Ladislava Janku

 

PhD. ThesIs (NOT FINISHED YET): Robust speech recognition: computational auditory scene analysis and missing data reconstruction approach.

Advisor: Assoc. Prof.. Vladimir Eck Ph.D

 

The aim of my research is to create computer system that can separate and recognize sound source in a complex auditory environment. In contrast to commonly used speech recognition methods consisting only in statistical models and assumin one sound source, machine listening approach is followed in this research. This approach consists in modeling the detection mechanism, in this case the mammal’s auditory processes, or in creating artificial intelligence or pattern recognition systems inspired with the properties of human auditory system. This approach is related to the studies in the line of sound perception, models of cochlea and auditory periphery, neural signal processing, auditory masking, loudness and pitch perception, auditory grouping with combination of statistical pattern recognition approaches.

 

PhD. Thesis Proposal:

Janků, L. Several Methods for Computational Auditory Scene Analysis. Prague : CTU FEE, Department of Cybernetics, BIO Laboratory, 2000. BIO333-02/2000. 23 p.

 

PhD Thesis Bibliography

[1]           Aertsen, A.M.J.H  Johannesma, P.I.M (1980).: Spectro-temporal receptive fields of auditory neurons in the grassfrog. I. Characterisation of tonal and natural stimuli’. Biological Cybernetics, 38, 223-234.

[2]           Aitkin, L., Dunlop, C., Webster, W. (1966): Click-evoked response patterns of single units in the medial geniculate body of the cat, J. Neurophysiology, vol. 29, pp. 109-123.

[3]           Aikawa, K., Singer, H., Kawahara, H., & Tohkura, Y.(1993): A dy-namic cepstrum incorporating time-frequency masking and its application to continuous speech recognition, Proc. IEEE ICASSP,  II-668-671.

[4]           Allen, J.B.: How do humans process and recognize speech? (1994): IEEE Trans. on Speech and Audio Processing, vol. 2, no. 4, 567-577.

[5]           Arai, T., M. Pavel, H. Hermansky, and C. Avendano (1996): Percep-tion of Speech With High-Passed. Low-Passed, and Band-Passed Spectral Envelopes.  Proc. Intl. Conf. on Spoken Language Processing.

[6]           Avendano, C., S. van Vureen, and H. Hermansky (1996): Optimizing RASTA Filters on Corrupted Speech. Proc.Intl. Conf. on Spoken Language Processing.

[7]           Beauvois, M. W., Meddis, R. (1996): Computer simulation of auditory stream segregation in alternating-tone sequences. J. Acoust. Soc. Am., 99(4), 2270-2280.

[8]           Berg, B. G. (1996): On the relation between comodulation masking release and temporal modulation transfer functions. J. Acoust. Soc. Am., 100(2), 1013-1023.

[9]           Bigand, E., Parncutt, R., Lerdahl, F. (1996): Perception of musical tension in short chord sequences: The influence of harmonic function, sensory dissonance, horizonal motion, and musical training. Perception and Psychophysics, 58(1), 125-141.

[10]       Bishop, C. M. (1995): Neural networks for pattern recognition. New York: Oxford University Press, 1995.

[11]        Bourlard, H.,  Dupont, S. (1996): A new ASR approach based on in-dependent processing and re-combination of partial frequency bands. Proc. ICSLP.

[12]        Brandenburg, K. (1998): Perceptual coding of high quality digital audio. In: Kahrs, M., Brandenburg, K.(eds.): Applications of Digital Signal Processing to Audio and Acoustics. New York: Kluwer Academic, 39-83.

[13]       Bregman, A.S.(1990): Auditory Scene Analysis: the Perceptual Organization of Sound. Cambridge, MA. MIT Press.

[14]        Bregman, A.S: Pinker S.: Auditory streaming and the building of timbre. Can. Journal Psych., 32, 19-31.

[15]        Bregman A.S., Abramson, J., Doehring, P., Darwin, C.J.: Spectral Integration based on common amplitude modulation, Perception and Psychophysics, 37, 483-493.

[16]        Britt, R., Starr, A. (1975): Synaptic events and discharge patterns of cochlear nucleus cells. I. Steady frequency tone bursts. J. Neurophys., 39, 162-178.

[17]        Brown, G. J. & Cooke, M. (1994): Computational auditory scene analysis. Computer Speech and Language 8(2), 297-336.

[18]        Brown, G. J. & Cooke, M. (1994): Perceptual grouping of musical sounds: A computational model. J. New Music Res., 23, 107-132.

[19]        Brown, G. J. & Wang, D. (1997): Modelling the perceptual segregation of double vowels with a network of neural oscillators. Neural Networks 10(9), 1547-1558.

[20]        Brown, J. & Puckette, M. S. (1989): Calculation of a ’narrowed’ autocorrelation function. J. Acoust. Soc. Am.,  85(5), 1595-1601.

[21]        Brown, J. C. (1991): Calculation of a constant Q spectral transform. J. Acoust. Soc. Am., 89(1), 425-434.

[22]        Brown, J. C. (1993): Determination of the meter of musical scores by autocorrelation. J. Acoust. Soc. Am.,  94(4), 1953-1957.

[23]        Brown, J. C. & Puckette, M. S. (1992): An efficient algorithm for the calculation of a constant Q transform. J. Acoust. Soc. Am., 92(5), 2698-2701.

[24]        Brown, G.J., Cooke, M. (1997): Temporal Synchronization in a Neural Oscillator Model of Primitive Auditory Sctream Segregation. In: Rosenthal, D.F., Okuno, H. G. (Eds.): Computational Auditory Scene Analysis. Lawrence Erlbaum Associates, Inc., Publishers.

[25]        Buus, S. (1985): Release from masking caused by envelope fluctuations. J. Acoust. Soc. Am.,   78(6), 1958-1965.

[26]        Carter, N. P., Bacon, R. A. & Messenger, T. (1988): The acquisition, representation and reconstruction of printed music by computer: A review. Comp. and Hum., 22(2), 117-136.

[27]        Casey, M. A. (1998): Auditory Group Theory with Applications to Statistical Basis Methods for Structured Audio. Ph.D. thesis, MIT Media Laboratory, Cambridge MA.

[28]        Cariani, P. A. & Delgutte, B. (1996): Neural correlates of the pitch of complex tones. I. Pitch and pitch salience. J. Neurophysiology, 76(3), 1698-1734.

[29]        Carlyon, R. P. (1991): Discriminating between coherent and incoherent frequency modulation of complex tones. J. Acoust. Soc. Am., 89(1), 329-340.

[30]        Carlyon, R. P. (1994): Further evidence against an across-frequency mechanism specific to the detection of frequency modulation (FM) incoherence between resolved frequency components. J. Acoust. Soc. Am., 95(2), 949-952.

[31]        Carlyon, R.P., Demars, L., Semal, C. (1992): Detection of across-frequency differences in fundamental frequency, J. Acoust. Soc. Am., 91(1), 279-292.

[32]        Chafe, C., Mont-Reynaud, B. & Rush, L. (1982): Toward an intelligent editor of digital audio: Recognition of musical constructs. Comp. Music J., 6(1), 30-41.

[33]        Chistovitch, L.A. (1998): Synaptic events and discharge patterns of cochlear nucleus cells. II. Frequency/modulated tones. J. Neurophys., 39, 162-178, 1975

[34]        Chistovich, L.A. (1985): Central auditory processing of peripheral vowel spectra. Journal Acoust. Soc. Am., 77, 789-805.

[35]       Cichocki, A., Amamri, S.: Adaptive Blind Signal and image Processing, , 2002, John Wiley & Sons Ltd, ISBN 0471-60791-6

[36]        Clarke, E. F. & Krumhansl, C. L. (1990): Perceiving musical time. Music Perc., 7(3), 213-252.

[37]        Clarkson, B., Sawhney N., Pentland, A.: Auditory Context Awareness via Wearable Computing, Perceptual Computing Group and Speech Interface Group, MIT Media Laboratory, 1998

[38]        Clynes, M. (1995): Microstructural musical linguistics: composers’ pulses are liked most by the best musicians. Cognition, 55, 269-310.

[39]        Cohen, J.R. (1989): Application of an auditory model to speech recognition. J. Acoust. Soc. Am., vol. 85, no. 6, pp. 2623-2629.

[40]        Common, P.: Independent Component Analysis – a New Concept? Signal Processing 36, pp. 287-314

[41]        Cooper, F. S., Delattre, P. C., Libermans, A. M., Borst,J. M, Gerstman, L. J. (1952): Some experiments on the perception of synthetic speech sounds, J. Acoust. Soc. Am., Vol. 24, pp. 579-606.

[42]        Cook, G,D, Christie, J. P., Clarkson, P. R., Hochberg, M.M., Logan, B. T., Robinson, A. J.(1996): Real-time recognition of broadcast radio speech, Proc. ICASSP-96 (141-144).

[43]        Cook, N. (1998): Music: A Very Short Introduction ???

[44]        Cooke, M.P.: Modelling Auditory Processing and Organization, Ph.D. Thesis (1991): Distinguished Dissertations in Computer Science Series, Cambridge University Press, 1993

[45]        Cope, D. (1992): Computer modeling of musical intelligence in EMI. Computer Music Journal 16(2), 69-83.

[46]        Crowder, R. G. (1985a): Perception of the major/minor distinction: I. Historical and theoretical foundations. Psychomusicology, 4(1), 3-12.

[47]        Crowder, R. G. (1985b): Perception of the major/minor distinction: II. Experimental investigations. Psychomusicology, 5(1), 3-24.

[48]        Cumming, D. (1988): Parallel algorithms for polyphonic pitch tracking. M.S. thesis, MIT Media Laboratory and Dept. of Electrical Engineering, Cambridge MA.

[49]        Davis, S.B. and Mermelstein, P. (1980): Comparison of parametric rep-resentations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. on Acoustics, Speech & Signal Processing, vol. 28(4), 357-366.

[50]        de Cheveigné, A.: Cancellation model of pitch perception.  J. Acoust. Soc. Am., 103(3), 1261-1271, 1998.

[51]        de Cheveigné, A. (1993): Separation of concurrent harmonic sounds: Fundamental frequency estimationand a time-domain cancellation model of auditory processing. Journal of the Acoustical Society of Am., 93(6), 3271-3290.

[52]        de Cheveigné, A. (1997): Concurrent vowel identification. III. A neural model of harmonic interference cancellation. J. Acoust. Soc. Am., 101(5), 2857-2865.

[53]        de Cheveigné, A. (1998b): Cancellation model of pitch perception. J. Acoust. Soc. Am., 103(3), 1261-1271.

[54]        de Cheveigné, A. & Kawahara, H. (1999): Multiple period estimation and pitch perception model. Speech Communication 27(3-4), 175-185.

[55]        De Poli, G., Piccialli, A. & Roads, C. (eds.) (1991): Representations of Musical Signals. Cambridge MA: MIT Press.

[56]        Delgutte, B., Hammond, B. M., Kalluri, S., Litvak, L. M. & Cariani, P. A. (1997): Neural encoding of temporal envelope and temporal interactions in speech. In Proceedings of the 1997 XIth International Conference on Hearing. Grantham UK.

[57]        Deliège, I., Melen, M., Stammers, D. & Cross, I. (1996): Musical schemata in real-time listening to a piece of music. Music Perception 14(2), 117-160.

[58]        Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977): Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (series B) 39(1), 1-38.

[59]        Desain, P. (1995): A (de)composable theory of rhythm perception. Music Perception 9(3), 439-454.

[60]        Desain, P. & Honing, H. (1999): Computational models of beat induction: The rule-based approach. J. New Music Res., 28(1), 29-42.

[61]        Desain, P., Honing, H., van Thienen, H. & Windsor, L. (1998): Computational modeling of music cognition: Problem or solution? Music Perception 16(1), 151-166.

[62]        Divenyi, P. L., Carre, R. & Algazi, A. P. (1997): Auditory segregation of vowel-like sounds with static and dynamic spectral properties. In Proceedings of the 1997 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. Mohonk, NY.

[63]        Drake, C. (1998): Psychological processes involved in the temporal organization of complex auditory sequences: Universal and acquired processes. Music Perception 16(1), 11-26.

[64]        Drullman, R., Festen, J.M., and Plomp, R. (1994): Effect of temporal envelope smearing on speech reception, J. Acoust. Soc. Am.,  95, 1053-1064.

[65]        Drullman, R., Festen, J.M., and Plomp, R. (1994b): Effect of reducing slow temporal modulations on speech reception, J.Acoust. Soc. Am. 95. 2670-2680.

[66]       Duda, R. O. & Hart, P. E. (1973): Pattern Classification and Scene Analysis. New York: John Wiley and Sons

[67]        Duda, R. O., Lyon, R. F. & Slaney, M. (1990): Correlograms and the separation of sounds. In Proceedings of the 1990 IEEE Asilomar Workshop. Asilomar CA.

[68]        Ellis, D. P. W. (1994): A computer implementation of psychoacoustic grouping rules. MIT Media Laboratory Perceptual Computing Technical Report #224, Cambridge, MA. Available from http://vismod.www.media.mit.edu/vismod/publications.

[69]        Ellis, D. P. W. (1996a): Prediction-Driven Computational Auditory Scene Analysis. Ph.D. thesis, MIT Dept. of Electrical Engineering and Computer Science, Cambridge MA.

[70]        Ellis, D. P. W. (1997): The weft: A representation for periodic sounds. In Proceedings of the 1997 Int. Conf. on Acoust. Speech and Sig. Proc. (pp. 1307-1310): Munich.

[71]        Ellis, D. P. W. & Rosenthal, D. F. (1998): Mid-level representations for computational auditory scene analysis: The weft element. In D. F. Rosenthal & H. G. Okuno (eds.), Readings in Computational Auditory Scene Analysis (pp. 257-272). Mahweh NJ: Lawrence Erlbaum.

[72]        Erickson, R. (1985): Sound Structure in Music. Berkeley, CA: University of California Press

[73]        Essens, D.-J. P. a. P. (1985): Perception of temporal patterns. Music Perception 2(3), 411-440.

[74]        Flanagan, J. L. & Golden, R. M. (1966): Phase vocoder. Bell System Technical J.,  45, 1493-1509.

[75]        Foote, J. (1999): An overview of audio information retrieval. Multimedia Systems, 7(1), 2-10.

[76]        Foster, S., Schloss, W. A. & Rockmore, A. J. (1982): Toward an intelligent editor of digital audio: Signal processing methods. Comp. Music J., 6(1), 42-51.

[77]        Freed, D. J. (1990): Auditory correlates of perceived mallet hardness for a set of recorded percussive soundevents. J. Acoust. Soc. Am.,  87(1), 311-322.

[78]        Fucci, D., Harris, D., Petrosino, L. & Banks, M. (1993): The effect of preference for rock music on magnitude-estimation scaling behavior in young adults. Percept. Motor Skills, 76(3), 1171-1176.

[79]        Fucci, D., Petrosino, L., Banks, M., Zaums, K. & Wilcox, C. (1996): The effect of preference for three different types of music on magnitude estimation-scaling behavior in young adults. Percept. and Motor Skills 83(1), 339-347.

[80]        Gabor, D. (1947): Acoustical quanta and the theory of hearing. Nature 159, 591-594.

[81]        Glasberg, B. R., Moore, B. C. J. (1990): Derivation of auditory filter shapes from notched-noise data. Hearing Research, 47.

[82]       Gold, B., Morgam, N.: Speech and Audio Signal Processing, John Willey & Sons, Inc., 2000, ISBN 0-471-35154-7

[83]        Godsmark, D. Brown, G.: Context-Sensitive Selection of Competing Auditory Organizations: A Blackboard Model. In: Rosenthal, D.F., Okuno, H. G. (Eds.): Computational Auditory Scene Analysis. Lawrence Erlbaum Associates, Inc., Publishers, 1997

[84]        Goldstein, J. L. (1973): An optimum processor theory for the central formation of the pitch of complextones. J. Acoust. Soc. Am., 54(6), 1496-1516.

[85]        Goto, M. (1999): Real-time beat tracking for drumless audio signals: Chord change detection for musical decisions. Speech Communication, 27(3-4), 311-335.

[86]        Goto, M. & Hayamizu, S. (1999): A real-time music scene description system: Detecting melody and bass lines in audio signals. In Proceedings of the 1999 International Joint Conference on Articifial Intelligence Workshop on Computational Auditory Scene Analysis (31-40): Stockholm.

[87]        Goto, M. & Muraoka, Y. (1998): Music understanding at the beat level: Real-time beat tracking for audio signals. In D. F. Rosenthal & H. Okuno (eds.), Readings in Computational Auditory Scene Analysis (157-176): Mahweh, NJ: Lawrence Erlbaum.

[88]        Green, D. M. (1996): Discrimination changes in spectral shape: Profile analysis. Acustica 82, S31-S36.

[89]        Green, P.D., M.P. Cooke, and M.D. Crawford (1995): Auditory scene analysis and hidden markov model recognition of speech in noise. Proc. IEEE ICASSP (401-404).

[90]        Grey, J. M. (1977): Multidimensional perceptual scaling of musical timbres. J. Acoust. Soc. Am., 61(5), 1270-1277.

[91]        Haeb-Umbach, R., Geller, D., Ney, H. (1994): Improvements in con-nected digit recognition using linear discriminant analysis and mixture densities. Proc. IEEE ICASSP (239-242).

[92]        Hall, J. W., Haggard, M. P. & Fernandes, M. A. (1984): Detection in noise by spectro-temporal pattern analysis. J. Acoust. Soc. Am., 76(1), 50-56.

[93]       Handel, S. (1989): Listening: An Introduction to the Perception of Auditory Events. Cambridge, MA: MIT Press

[94]        Hartmann, W. M. (1996): Pitch, periodicity, and auditory organization. J. Acoust. Soc. Am., 100(6), 3491-3502.

[95]       Hassoun, M.H.: Fundamentals of Artificial Neural Networks, The MIT Press, ISBN 0-262-08239-X

[96]        Hawley, M. J. (1993): Structure out of Sound. Ph.D. thesis, MIT Media Laboratory, Cambridge MA.

[97]        Herault, J., Jutten, C.: Blind Separation of Sources, part I, An Adaptive Algorithm based on Neuromimetic Architecture. Signal Processing 24, 1-10, 1991

[98]        Hermansky, H. (1990): Perceptual linear predictive (PLP) analysis of speech, Journal Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752.

[99]        Hermansky, H.(1995): Exploring temporal domain for robustness in speech recognition, Proc. of 15th International Congress on Acoustics, (Trondheim, Norway), Vol. II., pp. 61-64.

[100]     Hermansky, H., M. Pavel, S. Tibrewala, N. Mirghafori, and N. Morgan (1996): Re-Combination of Sub-Band Information for Digit Recognition, to be published in Proc. Intl. Conf. on Spoken Lan-guage Processing 96.

[101]     Hermansky, H., Fujisaki, H. & Sato Y. (1983): Analysis and synthesis of speech based on spectral transform linear predictive method, Proc. IEEE Intl. Conf. on Acoustics, Speech, & Signal Process-ing, (Boston, MA), pp. 777-780.

[102]     H. Hermansky, & Morgan, N.(1994): RASTA processing of speech, IEEE Trans. on Speech and Audio Processing, vol. 2, no. 4 pp. 578-589.

[103]     Hermansky, H., Wan, E., & Avendano, C. (1995): Speech enhancement based on temporal processing, Proc. IEEE Intl. Conf. on Acoustics, Speech, & Signal Processing, 405- 408.

[104]     Hermansky. H, and D. Broad (1989): The effective second for-mant F2’ and the vocal tract front cavity, Proc. Int. Conf. Acoust. Speech and Sig. Proc. 89, 480-483.

[105]     Hermansky, H, N. Morgan, A. Bayya and P. Kohn (1991):Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP), Proc. Eu-rospeech ’91, pp. 1367-1371, Genova, Italy.

[106]     Hermansky, H. and Pavel, M. (1995): Psychophysics of Speech Engi-neering Systems, Invited paper, 13th International Congress on Phonetic Sciences, Stockholm, Sweden, August 1995.

[107]     Hermes, D. J. (1988): Measurement of pitch by subharmonic summation. J. Acoust. Soc. Am., 83(1), 257-264.

[108]     Hewlett, W. B. (ed.) (1998): Melodic Similarity: Concepts, Procedures, and Applications. Computing in Musicology. Cambridge, MA: MIT Press.

[109]     Hirsch, I. J. & Watson, C. S. (1996): Auditory psychophysics and perception. Annual Review of Psychology 47, 461-484.

[110]     Hirsch, H.G., P. Meyer, and H. Ruehl: Improved speech recog-nition using high-pass filtering of subband envelopes, Proc. Eu-rospeech ’91, 1991, Genova, Italy.

[111]     Holleran, S., Jones, M. R. & Butler, D. (1995): Perceiving musical harmony: The influence of melodic and harmonic context. J. Experiment. Psych.: Learning, Memory, and Cognition 21(3), 737-753.

[112]     Huron, D. (1991): Tonal Consonance versus tonal fusion in polyphonic sonorities. Music Perception,  9(2), 135-154.

[113]     Huron, D. & Sellmer, P. (1992): Critical bands and the spelling of vertical sonorities. Music Perception, 10(2), 129-150.

[114]    Huang, X., Acero, A., H. HW., Spoken Language Processing, Microsoft Research, Prentice Hall PTR, 2001, ISBN: 0-13-022616-5

[115]     Irino, T. & Patterson, R. D. (1996): Temporal asymmetry in the auditory system. J. Acoust. Soc. Am ., 99(4), 2316-2331.

[116]     Izmirli, Ö. & Bilgen, S. (1996): A model for tonal context time course calculation from acoustical input. J. of New Music Res.,  25(3), 276-288.

[117]    Jelinek, F.: Statistical Methods for Speech Recognition, The MIT Press, 1999, Second Edition, 0-262-10066-5

[118]     Johnson-Laird, P. N. (1991b): Rhythm and meter: A theory at the computational level. Psychomusicology, 10, 88-106.

[119]     Jones, M. R. & Boltz, M. (1989): Dynamic attending and responses to time. Psychological Review 96(3), 459-491.

[120]     Juslin, P. N. (1997): Emotional communication in music performance: A functionalist perspective and some data. Music Perception 14(4), 383-418.

[121]     Kashino, K. & Murase, H. (1997): Sound source identification for ensemble music based on the music stream extraction. In Proceedings of the 1997 Int. Joint Conf. on AI Workshop on Computational Auditory Scene Analysis (pp. 127-134): Tokyo.

[122]     Kashino, K., Nakadai, K., Kinoshita, T. & Tanaka, H. (1995): Application of Bayesian probability network to music scene analysis. In Proceedings of the 1995 Int. Joint Conf. on AI Workshop on Computational Auditory Scene Analysis (pp. 52-59): Montreal.

[123]     Klassner, F. I. (1996): Data Preprocessing in Signal Understanding Systems. Ph.D. thesis, University of Massachusetts Computer Science, Amherst, MA.

[124]     Klassner, F. I., Lesser, V. & Nawab, S. H. (1998): The IPUS blackboard architecture as a framework for computational auditory scene analysis. In D. F. Rosenthal & H. Okuno (eds.), Readings in Computational Auditory Scene Analysis (pp. 177-193): Mahweh, NJ: Erlbaum.

[125]     Kohenen, T: (1995): Self-Organizing Maps. Berlin: Springer-Verlag

[126]     Krumhansl, C. L. (1979): The psychological representation of musical pitch in a tonal context. Cognitive Psych.,  11, 346-374.

[127]     Krumhansl, C. L. (1991a): Memory for musical surface. Memory & Cognition 19(4), 401-411.

[128]     Krumhansl, C. L. (1991b): Music psychology: tonal structures in perception and memory. Annual Rev. Psych., 42, 277-303.

[129]     Krumhansl, C. L. (1997): Effects of perceptual organization and musical form on melodic expectancies. In: M. Leman (ed.), Music, Gestalt, and Computing: Studies in Systematic and Cognitive Musicology (pp. 294-320): Berlin: Springer.

[130]     Krumhansl, C. L., Kessler, E. J. (1982): Tracing the dynamic changes in perceived tonal organization in a spatial representation of musical keys. Psychological Review 89(4), 334-368.

[131]     Kuhn, W. B. (1990): A real-time pitch recognition algorithm for music applications. Comp. Music J., 14(3), 60-71.

[132]     Langner, G. (1992): Periodicity coding in the auditory system. Hearing Research 60(1), 115-142.

[133]     Large, E. W.,   Kolen, J. F. (1994): Resonance and the perception of musical meter. Connection Science, 6(2), 177-208.

[134]     Lee, X. F., Logan, R. J., Pastore, R. E. (1991): Perception of acoustic source characteristics: Walking sounds. J. Acoust. Soc. Am., 90(6), 3036-3049.

[135]     Leman, M. (1989): Symbolic and subsymbolic information processing in models of musical communication and cognition. Interface, 18, 141-160.

[136]     Leman, M. (1994): Schema-based tone center recognition of musical signals. J. of New Music Res., 23(2), 169-204.

[137]     Levitin, D. J. (1994): Absolute memory for musical pitch: Evidence from the production of learned melodies. Percept. Psychophysics, 56(4), 414-423.

[138]     Levitin, D. J., Cook, P. R. (1996): Memory for musical tempo: Additional evidence that auditory memory is absolute. Percept.  Psychophysics 58(6), 927-935.

[139]     Licklider, J. C. R. (1951a): Basic correlates of the auditory stimulus. In S. S. Stevens (ed.), Handbook of Experimental Psychology (985-1035): New York: Wiley.

[140]     Licklider, J. C. R. (1951b): A duplex theory of pitch perception. Experientia 7, 128-134.

[141]     Liu, Z., Wang, Y. & Chen, T. (1998): Audio feature extraction and analysis for scene segmentation and classification. J.  VLSI Sig. Process., 20(1-2), 61-79.

[142]     Longuet-Higgins, H. C. (1994): Artificial intelligence and musical cognition. Philosophical Transactions of the Royal Society of London (A) 349, 103-113.

[143]     Maes, P., Lashkari, Y. & Metral, M. (1997): Collaborative interface agents. In M. H. Huhns & M. P. Singh (eds.), Readings in Agents . New York: Morgan Kaufmann Publishers.

[144]     Maher, R. C. (1990): Evaluation of a method for separating digitized duet signals. J. Audio Eng. Soc. 38(12), 956-979.

[145]     Mani, R. (1999): Knowledge-based processing of multicomponent signals in a musical application. Signal Processing, 74(1), 47-69.

[146]     Marr, D. (1982): Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. New York: W.H. Freeman & Co

[147]     Martin, K. D. (1996a): Automatic transcription of simple polyphonic music: Robust front-end processing. MIT Media Laboratory Perceptual Computing Technical Report #399, Cambridge MA. Available from http://vismod.www.media.mit.edu/ vismod/publications.

[148]     Martin, K. D. (1996b): A blackboard system for automatic transcription of simple polyphonic music. MIT Media Laboratory Perceptual Computing Technical Report #385, Cambridge MA. Available from http://vismod.www.media.mit.edu/ vismod/publications.

[149]     Martin, K. D. (1999): Sound-Source Recognition: A Theory and Computational Model. Ph.D. thesis, MIT, Department of Electrical Engineering and Computer Science, Cambridge, MA.

[150]     Martin, K. D., Scheirer, E. D. & Vercoe, B. L. (1998): Musical content analysis through models of audition. In Proceedings of the 1998 ACM Multimedia Workshop on Content-Based Processing of Music. Bristol UK.

[151]     McAdams, S. (1984): Spectral Fusion, Spectral Parsing, and the Formation of Auditory Images. Ph.D. thesis, Stanford University CCRMA, Dept of Music, Stanford, CA.

[152]     McAdams, S. (1987): Music: A science of the mind? Contemporary Music Review 2, 1-61.

[153]     McAdams, S. (1989): Segregation of concurrent sounds. I: Effects of frequency modulation coherence. J. Acoust. Soc.  Am. 86(6), 2148-2159.

[154]     McAdams, S., Botte, M.-C. & Drake, C. (1998): Auditory continuity and loudness computation. J. Acoust. Soc.  Am. , 103(3), 1580-1591.

[155]     McAulay, R. J. & Quatieri, T. F. (1986): Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 34(4), 744-754.

[156]     McCabe, S. L. & Denham, M. J. (1997): A model of auditory streaming. J. Acoust. Soc.  Am., 101(3), 1611-1621.

[157]     Meddis, R. & Hewitt, M. J. (1991): Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: Pitch identification. J. Acoust. Soc.  Am., 89(6), 2866-2882.

[158]     Meddis, R. & Hewitt, M. J. (1992): Modeling the identification of concurrent vowels with different fundamental frequencies. J. Acoust. Soc.  Am., 91(1), 233-244.

[159]     Mellinger, D. K. (1991): Event Formation and Separation in Musical Sound. Ph.D. thesis, Stanford University Dept. of Computer Science, Palo Alto CA.

[160]     Minami, K., Akutsu, A., Hamada, H. & Tonomura, Y. (1998): Video handling with music and speech detection. IEEE Multimedia 5(3), 17-25.

[161]     Mont-Reynaud, B. M., Mellinger, D. K.: Source separation by frequency co-modulation. Proc 1st Int Conf on Music Perception and Cognition, Kyoto, 1989.

[162]    Moore, B. C. J. (1997): An Introduction to the Psychology of Hearing. San Diego: Academic Press

[163]    Moore, B.C.J (1995): Hearing, Academic Press London, 1995

[164]     Moorer, J. A. (1977): On the transcription of musical sound by computer. Comp. Music J. 1(4), 32-38.

[165]     Nawab, S. H., Espy-Wilson, C. Y., Mani, R. & Bitar, N. N. (1998): Knowledge-based analysis of speech mixed with sporadic environmental sounds. In D. F. Rosenthal & H. Okuno (eds.), Readings in Computational Auditory Scene Analysis (pp. 177-193): Mahweh, NJ: Erlbaum.

[166]     Ng, K., Boyle, R. & Cooper, D. (1996): Automatic detection of tonality using note distribution. J. New Music Res., 25(4), 369-381.

[167]     Parncutt, R. (1994a): A perceptual model of pulse salience and metrical accent in musical rhythms. Music Perception 11(2), 409-464.

[168]     Parncutt, R. (1994b): Template-matching models of musical pitch and rhythm perception. J. New Music Res., 23, 145-167.

[169]     Parsons, T.W.: Separation of speech from interfering speech by means of harmonic selection. J. Acoust. Soc. Am. 60, 1976

[170]     Patterson, R. D., Allerhand, M. H. & Giguere, C. (1995): Time-domain modeling of peripheral auditory  processing: A modular architecture and a software platform. . J. Acoust. Soc.  Am., 98(4), 1890-1894.

[171]     Patterson, R. D., Moore, B.C. J.: Auditory Filters and Excitation Patterns as Representation of Frequency Resolution. In Hearing B.C.J. Moore (eds.), Academic Press London, 1986

[172]     Patterson R. D: A pulse ribbon model of monaural phase perception. J. Acoust. Soc. Am. 82(5), 1987

[173]     Pereira, F. & Koenen, R. H. (2000): MPEG-7: Status and directions. In A. Puri & T. Chen (eds.), Advances in Multimedia: Signals, Standards, and Networks (pp. 611-630):

[174]     Pfeiffer, S. and Fischer, S. and Effelsberg,W.: Automatic Audio Content Analysis. University of Mannheim, 1997.

[175]     Perrott, D. & Gjerdigen, R. O. (1999): Scanning the dial: An exploration of factors in the identification of musical style. In Proceedings of the 1999 Society for Music Perception & Cognition (pp. 88 (abstract)): Evanston, IL.

[176]     Pielemeier, W. J., Wakefield, G. H. & Simoni, M. H. (1996): Time-frequency analysis of musical signals. Proc IEEE 84(9), 1216-1230.

[177]     Piszczalski, M. & Galler, B. A. (1977): Automatic music transcription. Computer Music Journal 1(4), 24-31. Piszczalski, M. & Galler, B. A. (1983): A computer model of music recognition. In M. Clynes (ed.),

[178]     Plomp, R. & Levelt, W. J. M. (1965): Tonal consonance and critical bandwidth. J. Acoust. Soc. Am., 38(2), 548-560.

[179]     Povel, D. & Okkerman, H. (1981): Accents in equitone sequences. Percept. Psychophysics 30(3), 565-572.

[180]     Povel, D.-J. & Essens, P. (1985): Perception of temporal patterns. Music Perception 2(2), 411-480.

[181]     Povel, D.-J. & van Egmond, R. (1993): The function of accompanying chords in the recognition of melodic fragments. Music Perception 11(2), 101-115.

[182]     Quatieri, T. F. & McAulay, R. J. (1998): Audio signal processing based on sinusoidal analysis/synthesis. In M. Kahrs & K. Brandenburg (eds.), Applications of Digital Signal Processing to Audio and Acoustics (pp. 343-411): New York: Kluwer Academic.

[183]     Quatieri, T. F., Danisewicz, D. G. (1990): An approach to co-channel talker interference suppression using a sinusoidal model for speech” IEEE Tr. ASSP 38(1).

[184]     Rabiner, L. R., Cheng, M. J., Rosenberg, A. E. & McGonegal, C. A. (1976): A comparative performance study of several pitch detection algorithms. IEEE Trans ASSP 24(5), 399-418.

[185]     Roads, C., Pope, S. T., Piccialli, A. & de Poli, G. (eds.) (1997): Musical Signal Processing. Stud. New Music Res. Lisse, NL: Swets & Zeitlinger.

[186]     Robinson, K. (1993): Brightness and octave position: Are changes in spectral envelope and in tone height perceptually equivalent? Contemporary Music Review 9(1,2), 83-95.

[187]     Rose, M. M. & Moore, B. C. J. (1997): Perceptual grouping of tone sequences by normally hearing and hearing-impaired listeners. . J. Acoust. Soc. Am., 102(3), 1768=1778.

[188]     Rosenthal, D. F. (1992): Machine Rhythm: Computer Emulation of Human Rhythm Perception. Ph.D. thesis, MIT Media Laboratory, Cambridge, MA.

[189]    Rosenthal D.F,  H. G. Okuno (eds.), Computational Auditory Scene Analysis (pp. 27-42): Mahweh, NJ: Lawrence Erlbaum. 1998, ISBN 0-8058-2283-6

[190]     Sandell, G. J. (1995): Roles for spectral centroid and other factors in determining "blended" instrument pairings in orchestration. Music Perception 13(2), 209-246.

[191]     Saint-Arnaud, N.: Classification of Sound Textures, M.S. Thesis in Media Arts and Sciences, MIT, 1995

[192]     Schreirer, E.D.:  Towards music understanding without separation: segmenting music with correlogram comodulation, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1999

[193]     Scheirer, E. D. (1995): Extracting expressive performance information from recorded music. M.S. thesis, MIT Media Laboratory, Cambridge, MA.

[194]     Scheirer, E. D. (1996): Bregman’s chimerae: Music perception as auditory scene analysis. In Proceedings of the 1996 International Conference on Music Perception and Cognition (pp. 317-322): Montreal: Society for Music Perception and Cognition.

[195]     Scheirer, E. D. (1998a): Tempo and beat analysis of acoustic musical signals.  J. Acoust. Soc. Am., 103(1), 588-601.

[196]     Scheirer, E. D. (1998b): Using musical knowledge to extract expressive performance information from recorded signals. In D. F. Rosenthal & H. Okuno (eds.), Readings in Computational Auditory Scene Analysis (pp. 361-380): Mahweh, NJ: Lawrence Erlbaum.

[197]     Scheirer, E. D., Slaney, M. (1997): Construction and evaluation of a robust multifeature speech/music discriminator. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 1331-1334): Munich: IEEE.

[198]     Scheirer, E. D. , Vercoe, B. L. (1999): SAOL: The MPEG-4 Structured Audio Orchestra Language. Computer Music Journal 23(2), 31-51.

[199]     Schellenberg, E. G. (1996): Expectancy in melody: Tests of the implication-realization model. Cognition, 58(1), 75-125.

[200]     Schmuckler, M. A. & Boltz, M. G. (1994): Harmonic and rhythmic influences on musical expectancy. Percept. Psychophysics 56(3), 313-325.

[201]     Shamma, S. A. (1996): Auditory cortical representation of complex acoustic spectra as inferred from the ripple analysis method. Network - Computation in Neural Systems 7(3), 439-476.

[202]     Shannon, R. V., Zeng, F.-G., Kamath, V., Wygonski, J. & Ekelid, M. (1995): Speech recognition with primarily temporal cues. Science 270, 303-304.

[203]     Shepard, R. N. (1964): Circularity in judgments of relative pitch. . J. Acoust. Soc. Am., 36(12), 2346-2353.

[204]     Shepard, R. N. (1982): Geometrical approximations to the structure of musical pitch. Psychological Review,  89(4), 305-333.

[205]     Slaney, M. (1994): Auditory toolbox. Apple Computer, Inc. Technical Report #45, Cupertino CA. Available from http://www.interval.com/~malcolm.

[206]     Slaney, M. (1997): Connecting correlograms to neurophysiology and psychoacoustics. In Proceedings of the 1997 XIth International Symposium on Hearing. Lincolnshire UK.

[207]     Slaney, M. (1998): A critique of pure audition. In D. F. Rosenthal & H. G. Okuno (eds.), Computational Auditory Scene Analysis (pp. 27-42): Mahweh, NJ: Lawrence Erlbaum.

[208]     Slaney, M. & Lyon, R. F. (1990): A perceptual pitch detector. In Proceedings of the 1990 International Conference on Acoustics, Speech, and Signal Processing (pp. 357-360): Albuquerque: IEEE.

[209]     Slaney, M. & Lyon, R. F. (1991): Apple Hearing Demo Reel. Apple Computer, Inc. Technical Report #25, Cupertino CA. Available from malcolm@interval.com.

[210]     Slaney, M., Naar, D. & Lyon, R. F. (1994): Auditory model inversion for sound separation. In Proceedings of the 1994 ICASSP. Adelaide AU.

[211]     Smaragdis, P.J.: Information Theoretic Approaches to Source Separation, M.S. Thesis in Media Arts and Sciences, MIT,  1997

[212]     Smith, J. D. (1997): The place of musical novices in music science. Music Perception 14(3), 227-262.

[213]     Smoliar, S. W. (1995): Parsing, structure, memory and affect. J. of New Music Res. 24(1), 21-33.

[214]     Snyder, J. S. & Krumhansl, C. L. (1999): Cues to pulse-finding in piano ragtime music. In Proceedings of the 1999 Society for Music Perception and Cognition . Evanston, IL.

[215]     Summerfield, Q., Lea, A. & Marshall, D. (1990): Modelling auditory scene analysis: strategies for source segregation using autocorrelograms. Proceedings of the Institute of Acoustics 12(10), 507-514.

[216]     Temperley, D. (1997): An algorithm for harmonic analysis. Music Perception 15(1), 31-68.

[217]     Terhardt, E. (1974): Pitch, consonance, and harmony. . J. Acoust. Soc. Am., 55(5), 1061-1069.

[218]     Terhardt, E. (1978): Psychoacoustic evaluation of musical sounds. Percept. Psychophysics 23(6), 483-492.

[219]     Terhardt, E. (1991): Music perception and sensory information acquisition: Relationships and low-level analogies. Music Perception 8(3), 217-240.

[220]     Terhardt, E., Stoll, G. & Seewann, M. (1982): Algorithm for extraction of pitch and pitch salience from complex tonal signals. . J. Acoust. Soc. Am., 71(3), 679-688.

[221]     Thompson, W. F. (1993): Modeling perceived relationships between melody, harmony, and key. Percept. Psychophysics 53(1), 13-24.

[222]     Thompson, W. F. & Parncutt, R. (1997): Perceptual judgments of triads and dyads: Assessment of a psychoacoustic model. Music Perception 14(3), 263-280.

[223]     Thomson, W. (1993): The harmonic root: A fragile marriage of concept and percept. Music Perception, 10(4), 385-416.

[224]     Todd, N. P. M. (1994): The auditory "primal sketch": A multiscale model of rhythmic grouping. J. New Music Res., 23, 25-70.

[225]     Van Immerseel, L. M. & Martens, J.-P. (1992): Pitch and voiced/unvoiced determination with an auditory model. J. Acoust. Soc. Am. 91(6), 3511-3526.

[226]     van Noorden, L. P. A. S. (1977): Minimum differences of level and frequency for pereeptual fission of tone sequences ABAB. J. Acoust. Soc. Am., 61(4), 1041-1045.

[227]     Vercoe, B. L. (1984): The synthetic performer in the context of live performance. In Proceedings of the 1984 International Computer Music Conference (pp. 199-200): Paris: International Computer Music Association.

[228]     Vercoe, B. L. (1988): Hearing polyphonic music on the connection machine. In Proceedings of the 1988 First AAAI Workshop on Artificial Intelligence and Music (pp. 183-194). Minneapolis.

[229]     Vercoe, B. L. (1997): Computational auditory pathways to music understanding. In I. Deliège & J. Sloboda (eds.), Perception and Cognition of Music (pp. 307-326). London: Psychology Press.

[230]     Vercoe, B. L., Gardner, W. G. & Scheirer, E. D. (1998): Structured audio: The creation, transmission, and rendering of parametric sound representations. Proceedings of the IEEE 85(5), 922-940.

[231]     Vercoe, B. L. & Puckette, M. S. (1985): Synthetic rehearsal: Training the synthetic performer. In Proceedings of the 1985 ICMC (pp. 275-278). Burnaby BC, Canada.

[232]     Verhey, J. L., Dau, T. & Kollmeier, B. (1999): Within-channel cues in comodulation masking release (CMR): Experiments and model predictions using a modulation-filterbank model. J. Acoust. Soc. Am., 106(5), 2733-2745.

[233]     Versnel, H. & Shamma, S. A. (1998): Spectral-ripple representation of steady-state vowels in primary auditory cortex. J. Acoust. Soc. Am., 103(5), 2502-2514.

[234]     Vliegen, J. & Moore, B. C. J. (1999): The role of spectral and periodicity cues in auditory stream segregation, measured using a temporal discrimination task. Journal of the Acoustical Society of America, 106(2), 938-945.

[235]     Vliegen, J. & Oxenham, A. J. (1999): Sequential stream segregegation in the absence of spectral cues. J. Acoust. Soc. Am., 105(1), 339-346.

[236]     Vos, P. G. & Van Geenen, E. W. (1996): A parallel-processing key-finding model. Music Perception 14(2), 185-224.

[237]     Wang, D. (1996): Primitive auditory segregation based on oscillatory correlation. Cognitive Science 20, 409-456.

[238]     Wang, K. & Shamma, S. A. (1995): Spectral shape-analysis in the central auditory system. IEEE TSAP, 3(5), 382-395.

[239]     Wang D.: Stream Segregation Based on Oscillatory Correlation. In: Rosenthal, D.F., Okuno, H. G. (Eds.): Computational Auditory Scene Analysis. Lawrence Erlbaum Associates, Inc., Publishers

[240]     Warren, R. M. (1970): Perceptual restoration of missing speech sounds. Science 167, 392-393.

[241]     Warren, R. M., Obusek, C. J. & Ackroff, J. M. (1972): Auditory induction: perceptual synthesis of absent sounds. Science 176, 1149-1151.

[242]     Warren, W. H. & Verbrugge, R. R. (1984): Auditory perception of breaking and bouncing events: A case study in ecological acoustics. Journal of Experimental Psychology: Human Perception and Performance 10(4), 704-712.

[243]     Weintraub, M. (1985): A Theory and Computational Model of Auditory Monaural Sound Separation. Ph.D. thesis, Stanford University Dept. of Electrical Engineering, Palo Alto, CA.

[244]     Wightman, F. L. (1973): The pattern-transformation model of pitch. J. Acoust. Soc. Am., 54(2), 407-416.

[245]     Wold, E., Blum, T., Keislar, D. & Wheaton, J. (1996): Content-based classification, search, and retrieval of audio. IEEE Multimedia 3(3), 27-36.

[246]     Yost, W. A. (1991): Auditory image perception and analysis: The basis for hearing. Hearing Research 56,

 

 

Publications in ASR and CASA Oriented Research

-         Janku, L.: Sound Source Separation through Models of Auditory Processes and Fuzzy-Rule System. In: MOSIS 01-Modelling and Simulation of Systems. MARQ, 2001, vol. 1, p. 195-200. ISBN 80-85988-57-7.

-         Janků, L. - Lhotska, L. Musical Instrument Classification through the Model of Auditory Periphery and a Neural Network In: N. Mastorakis, V. Mladenov, B. Suter and L.J. Wand (Editors): Advances in Scientific Computing, Computational Intelligence and Applications, WSES Press, 2001

-         Janků, L. - Lhotska, L.: Towards the Application of Fuzzy Logic to the Sound Recordings Fingerprint Computation and Comparison. In: Recent Advances in Computers, Computing and Communications. New Jersey. WSEAS Press, 2002.

-         Janku, L.: An Efficient Machine Listening Approach to Sound Source Separation. In: Proceedings of Workshop 2001. Prague : CTU, 2001, vol. A, p. 262-263. ISBN 80-01-02335-4.

-         Janků, L.: Psychical States Estimation Based on Voice Analysis and Speech Recognition I - Introduction. [Research Report]. Prague : CTU FEE, Department of Cybernetics, BIO Laboratory, 1999. BIO333-07/99. (In Czech)

-         Janků, L. - Eck, V.: Psychical States Estimation Based on Voice Analysis and Speech Recognition II - Methods of Voice Analysis. [Research Report]. Prague : CTU FEE, Department of Cybernetics, BIO Laboratory, 1999. BIO333-15/99. (in Czech).

-         Janků, L. - Eck, V.: Workplace for Audio Signal Processing and Recognition, I & II. [Research Reports]. Prague : CTU FEE, Department of Cybernetics, BIO Laboratory, 1999. BIO333.  (in Czech).