Ladislava Janku
PhD. ThesIs (NOT FINISHED YET): Robust speech
recognition: computational auditory scene analysis and missing data
reconstruction approach.
Advisor: Assoc. Prof.. Vladimir Eck Ph.D
The aim of my research is to create computer
system that can separate and recognize sound source in a complex auditory
environment. In contrast to commonly used speech recognition methods consisting
only in statistical models and assumin one sound source, machine listening
approach is followed in this research. This approach consists in modeling the
detection mechanism, in this case the mammal’s auditory processes, or in
creating artificial intelligence or pattern recognition systems inspired with
the properties of human auditory system. This approach is related to the
studies in the line of sound perception, models of cochlea and auditory
periphery, neural signal processing, auditory masking, loudness and pitch
perception, auditory grouping with combination of statistical pattern
recognition approaches.
PhD. Thesis Proposal:
Janků, L. Several Methods for
Computational Auditory Scene Analysis. Prague : CTU FEE, Department of
Cybernetics, BIO Laboratory, 2000. BIO333-02/2000. 23 p.
[1]
Aertsen, A.M.J.H Johannesma, P.I.M (1980).: Spectro-temporal receptive fields of
auditory neurons in the grassfrog. I. Characterisation of tonal and natural
stimuli’. Biological Cybernetics, 38, 223-234.
[2]
Aitkin, L., Dunlop, C., Webster, W. (1966):
Click-evoked response patterns of single units in the medial geniculate body of
the cat, J. Neurophysiology, vol. 29, pp. 109-123.
[3]
Aikawa, K., Singer, H., Kawahara, H., &
Tohkura, Y.(1993): A dy-namic cepstrum incorporating time-frequency masking and
its application to continuous speech recognition, Proc. IEEE ICASSP, II-668-671.
[4]
Allen, J.B.: How do humans process and
recognize speech? (1994): IEEE Trans. on Speech and Audio Processing, vol. 2,
no. 4, 567-577.
[5]
Arai, T., M. Pavel, H. Hermansky, and C.
Avendano (1996): Percep-tion of Speech With High-Passed. Low-Passed, and
Band-Passed Spectral Envelopes. Proc.
Intl. Conf. on Spoken Language Processing.
[6]
Avendano, C., S. van Vureen, and H. Hermansky
(1996): Optimizing RASTA Filters on Corrupted Speech. Proc.Intl. Conf. on
Spoken Language Processing.
[7]
Beauvois, M. W., Meddis, R. (1996): Computer
simulation of auditory stream segregation in alternating-tone sequences. J.
Acoust. Soc. Am., 99(4), 2270-2280.
[8]
Berg, B. G. (1996): On the relation between
comodulation masking release and temporal modulation transfer functions. J.
Acoust. Soc. Am., 100(2), 1013-1023.
[9]
Bigand, E., Parncutt, R., Lerdahl, F. (1996):
Perception of musical tension in short chord sequences: The influence of
harmonic function, sensory dissonance, horizonal motion, and musical training.
Perception and Psychophysics, 58(1), 125-141.
[10]
Bishop, C. M. (1995): Neural
networks for pattern recognition. New York: Oxford University Press, 1995.
[11]
Bourlard, H.,
Dupont, S. (1996): A new ASR approach based on in-dependent processing
and re-combination of partial frequency bands. Proc. ICSLP.
[12]
Brandenburg, K. (1998): Perceptual coding of
high quality digital audio. In: Kahrs, M., Brandenburg, K.(eds.): Applications
of Digital Signal Processing to Audio and Acoustics. New York: Kluwer Academic,
39-83.
[13]
Bregman, A.S.(1990): Auditory
Scene Analysis: the Perceptual Organization of Sound. Cambridge, MA. MIT Press.
[14]
Bregman, A.S: Pinker S.: Auditory streaming and
the building of timbre. Can. Journal Psych., 32, 19-31.
[15]
Bregman A.S., Abramson, J., Doehring, P.,
Darwin, C.J.: Spectral Integration based on common amplitude modulation,
Perception and Psychophysics, 37, 483-493.
[16]
Britt, R., Starr, A. (1975): Synaptic events
and discharge patterns of cochlear nucleus cells. I. Steady frequency tone
bursts. J. Neurophys., 39, 162-178.
[17]
Brown, G. J. & Cooke, M. (1994):
Computational auditory scene analysis. Computer Speech and Language 8(2),
297-336.
[18]
Brown, G. J. & Cooke, M. (1994): Perceptual
grouping of musical sounds: A computational model. J. New Music Res., 23,
107-132.
[19]
Brown, G. J. & Wang, D. (1997): Modelling
the perceptual segregation of double vowels with a network of neural
oscillators. Neural Networks 10(9), 1547-1558.
[20]
Brown, J. & Puckette, M. S. (1989):
Calculation of a ’narrowed’ autocorrelation function. J. Acoust. Soc. Am., 85(5), 1595-1601.
[21]
Brown, J. C. (1991): Calculation of a constant
Q spectral transform. J. Acoust. Soc. Am., 89(1), 425-434.
[22]
Brown, J. C. (1993): Determination of the meter
of musical scores by autocorrelation. J. Acoust. Soc. Am., 94(4), 1953-1957.
[23]
Brown, J. C. & Puckette, M. S. (1992): An
efficient algorithm for the calculation of a constant Q transform. J. Acoust.
Soc. Am., 92(5), 2698-2701.
[24]
Brown, G.J., Cooke, M. (1997): Temporal
Synchronization in a Neural Oscillator Model of Primitive Auditory Sctream
Segregation. In: Rosenthal, D.F., Okuno, H. G. (Eds.): Computational Auditory
Scene Analysis. Lawrence Erlbaum Associates, Inc., Publishers.
[25]
Buus, S. (1985): Release from masking caused by
envelope fluctuations. J. Acoust. Soc. Am.,
78(6), 1958-1965.
[26]
Carter, N. P., Bacon, R. A. & Messenger, T.
(1988): The acquisition, representation and reconstruction of printed music by
computer: A review. Comp. and Hum., 22(2), 117-136.
[27]
Casey, M. A. (1998): Auditory Group Theory with
Applications to Statistical Basis Methods for Structured Audio. Ph.D. thesis,
MIT Media Laboratory, Cambridge MA.
[28]
Cariani, P. A. & Delgutte, B. (1996):
Neural correlates of the pitch of complex tones. I. Pitch and pitch salience.
J. Neurophysiology, 76(3), 1698-1734.
[29]
Carlyon, R. P. (1991): Discriminating between
coherent and incoherent frequency modulation of complex tones. J. Acoust. Soc.
Am., 89(1), 329-340.
[30]
Carlyon, R. P. (1994): Further evidence against
an across-frequency mechanism specific to the detection of frequency modulation
(FM) incoherence between resolved frequency components. J. Acoust. Soc. Am.,
95(2), 949-952.
[31]
Carlyon, R.P., Demars, L., Semal, C. (1992):
Detection of across-frequency differences in fundamental frequency, J. Acoust.
Soc. Am., 91(1), 279-292.
[32]
Chafe, C., Mont-Reynaud, B. & Rush, L.
(1982): Toward an intelligent editor of digital audio: Recognition of musical
constructs. Comp. Music J., 6(1), 30-41.
[33]
Chistovitch, L.A. (1998): Synaptic events and
discharge patterns of cochlear nucleus cells. II. Frequency/modulated tones. J.
Neurophys., 39, 162-178, 1975
[34]
Chistovich, L.A. (1985): Central auditory
processing of peripheral vowel spectra. Journal Acoust. Soc. Am., 77, 789-805.
[35]
Cichocki, A., Amamri, S.:
Adaptive Blind Signal and image Processing, , 2002, John Wiley & Sons Ltd,
ISBN 0471-60791-6
[36]
Clarke, E. F. & Krumhansl, C. L. (1990):
Perceiving musical time. Music Perc., 7(3), 213-252.
[37]
Clarkson, B., Sawhney N., Pentland, A.:
Auditory Context Awareness via Wearable Computing, Perceptual Computing Group
and Speech Interface Group, MIT Media Laboratory, 1998
[38]
Clynes, M. (1995): Microstructural musical
linguistics: composers’ pulses are liked most by the best musicians. Cognition,
55, 269-310.
[39]
Cohen, J.R. (1989): Application of an auditory
model to speech recognition. J. Acoust. Soc. Am., vol. 85, no. 6, pp.
2623-2629.
[40]
Common, P.: Independent Component Analysis – a
New Concept? Signal Processing 36, pp. 287-314
[41]
Cooper, F. S., Delattre, P. C., Libermans, A.
M., Borst,J. M, Gerstman, L. J. (1952): Some experiments on the perception of
synthetic speech sounds, J. Acoust. Soc. Am., Vol. 24, pp. 579-606.
[42]
Cook, G,D, Christie, J. P., Clarkson, P. R.,
Hochberg, M.M., Logan, B. T., Robinson, A. J.(1996): Real-time recognition of
broadcast radio speech, Proc. ICASSP-96 (141-144).
[43]
Cook, N. (1998): Music: A Very Short
Introduction ???
[44]
Cooke, M.P.: Modelling Auditory Processing and
Organization, Ph.D. Thesis (1991): Distinguished Dissertations in Computer
Science Series, Cambridge University Press, 1993
[45]
Cope, D. (1992): Computer modeling of musical
intelligence in EMI. Computer Music Journal 16(2), 69-83.
[46]
Crowder, R. G. (1985a): Perception of the
major/minor distinction: I. Historical and theoretical foundations.
Psychomusicology, 4(1), 3-12.
[47]
Crowder, R. G. (1985b): Perception of the
major/minor distinction: II. Experimental investigations. Psychomusicology,
5(1), 3-24.
[48]
Cumming, D. (1988): Parallel algorithms for
polyphonic pitch tracking. M.S. thesis, MIT Media Laboratory and Dept. of
Electrical Engineering, Cambridge MA.
[49]
Davis, S.B. and Mermelstein, P. (1980):
Comparison of parametric rep-resentations for monosyllabic word recognition in
continuously spoken sentences, IEEE Trans. on Acoustics, Speech & Signal
Processing, vol. 28(4), 357-366.
[50]
de Cheveigné, A.: Cancellation model of pitch
perception. J. Acoust. Soc. Am.,
103(3), 1261-1271, 1998.
[51]
de Cheveigné, A. (1993): Separation of
concurrent harmonic sounds: Fundamental frequency estimationand a time-domain
cancellation model of auditory processing. Journal of the Acoustical Society of
Am., 93(6), 3271-3290.
[52]
de Cheveigné, A. (1997): Concurrent vowel
identification. III. A neural model of harmonic interference
cancellation. J. Acoust. Soc. Am., 101(5), 2857-2865.
[53]
de Cheveigné, A. (1998b): Cancellation model of
pitch perception. J. Acoust. Soc. Am., 103(3), 1261-1271.
[54]
de Cheveigné, A. & Kawahara, H. (1999):
Multiple period estimation and pitch perception model. Speech Communication
27(3-4), 175-185.
[55]
De Poli, G., Piccialli, A. & Roads, C.
(eds.) (1991): Representations of Musical Signals. Cambridge MA: MIT Press.
[56]
Delgutte, B., Hammond, B. M., Kalluri, S.,
Litvak, L. M. & Cariani, P. A. (1997): Neural encoding of temporal envelope
and temporal interactions in speech. In Proceedings of the 1997 XIth
International Conference on Hearing. Grantham UK.
[57]
Deliège, I., Melen, M., Stammers, D. &
Cross, I. (1996): Musical schemata in real-time listening to a piece of music.
Music Perception 14(2), 117-160.
[58]
Dempster, A. P., Laird, N. M. & Rubin, D.
B. (1977): Maximum likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society (series B) 39(1), 1-38.
[59]
Desain, P. (1995): A (de)composable theory of
rhythm perception. Music Perception 9(3), 439-454.
[60]
Desain, P. & Honing, H. (1999):
Computational models of beat induction: The rule-based approach. J. New Music
Res., 28(1), 29-42.
[61]
Desain, P., Honing, H., van Thienen, H. &
Windsor, L. (1998): Computational modeling of music cognition: Problem or
solution? Music Perception 16(1), 151-166.
[62]
Divenyi, P. L., Carre, R. & Algazi, A. P.
(1997): Auditory segregation of vowel-like sounds with static and dynamic
spectral properties. In Proceedings of the 1997 IEEE Workshop on Applications
of Signal Processing to Audio and Acoustics. Mohonk, NY.
[63]
Drake, C. (1998): Psychological processes
involved in the temporal organization of complex auditory sequences: Universal
and acquired processes. Music Perception 16(1), 11-26.
[64]
Drullman, R., Festen, J.M., and Plomp, R.
(1994): Effect of temporal envelope smearing on speech reception, J. Acoust.
Soc. Am., 95, 1053-1064.
[65]
Drullman, R., Festen, J.M., and Plomp, R.
(1994b): Effect of reducing slow temporal modulations on speech reception,
J.Acoust. Soc. Am. 95. 2670-2680.
[66]
Duda, R. O. & Hart, P.
E. (1973): Pattern Classification and Scene Analysis. New York: John Wiley and
Sons
[67]
Duda, R. O., Lyon, R. F. & Slaney, M.
(1990): Correlograms and the separation of sounds. In Proceedings of the 1990
IEEE Asilomar Workshop. Asilomar CA.
[68]
Ellis, D. P. W. (1994): A computer
implementation of psychoacoustic grouping rules. MIT Media Laboratory
Perceptual Computing Technical Report #224, Cambridge, MA. Available from
http://vismod.www.media.mit.edu/vismod/publications.
[69]
Ellis, D. P. W. (1996a): Prediction-Driven
Computational Auditory Scene Analysis. Ph.D. thesis, MIT Dept. of Electrical
Engineering and Computer Science, Cambridge MA.
[70]
Ellis, D. P. W. (1997): The weft: A
representation for periodic sounds. In Proceedings of the 1997 Int. Conf. on
Acoust. Speech and Sig. Proc. (pp. 1307-1310): Munich.
[71]
Ellis, D. P. W. & Rosenthal, D. F. (1998):
Mid-level representations for computational auditory scene analysis: The weft
element. In D. F. Rosenthal & H. G. Okuno (eds.), Readings in Computational
Auditory Scene Analysis (pp. 257-272). Mahweh NJ: Lawrence Erlbaum.
[72]
Erickson, R. (1985): Sound Structure in Music.
Berkeley, CA: University of California Press
[73]
Essens, D.-J. P. a. P. (1985): Perception of
temporal patterns. Music Perception 2(3), 411-440.
[74]
Flanagan, J. L. & Golden, R. M. (1966):
Phase vocoder. Bell System Technical J.,
45, 1493-1509.
[75]
Foote, J. (1999): An overview of audio
information retrieval. Multimedia Systems, 7(1), 2-10.
[76]
Foster, S., Schloss, W. A. & Rockmore, A.
J. (1982): Toward an intelligent editor of digital audio: Signal processing
methods. Comp. Music J., 6(1), 42-51.
[77]
Freed, D. J. (1990): Auditory correlates of
perceived mallet hardness for a set of recorded percussive soundevents. J.
Acoust. Soc. Am., 87(1), 311-322.
[78]
Fucci, D., Harris, D., Petrosino, L. &
Banks, M. (1993): The effect of preference for rock music on
magnitude-estimation scaling behavior in young adults. Percept. Motor Skills,
76(3), 1171-1176.
[79]
Fucci, D., Petrosino, L., Banks, M., Zaums, K.
& Wilcox, C. (1996): The effect of preference for three different types of
music on magnitude estimation-scaling behavior in young adults. Percept. and
Motor Skills 83(1), 339-347.
[80]
Gabor, D. (1947): Acoustical quanta and the
theory of hearing. Nature 159, 591-594.
[81]
Glasberg, B. R., Moore, B. C. J. (1990):
Derivation of auditory filter shapes from notched-noise data. Hearing Research,
47.
[82]
Gold, B., Morgam, N.: Speech
and Audio Signal Processing, John Willey & Sons, Inc., 2000, ISBN
0-471-35154-7
[83]
Godsmark, D. Brown, G.: Context-Sensitive
Selection of Competing Auditory Organizations: A Blackboard Model. In:
Rosenthal, D.F., Okuno, H. G. (Eds.): Computational Auditory Scene Analysis.
Lawrence Erlbaum Associates, Inc., Publishers, 1997
[84]
Goldstein, J. L. (1973): An optimum processor
theory for the central formation of the pitch of complextones. J. Acoust. Soc.
Am., 54(6), 1496-1516.
[85]
Goto, M. (1999): Real-time beat tracking for
drumless audio signals: Chord change detection for musical decisions. Speech
Communication, 27(3-4), 311-335.
[86]
Goto, M. & Hayamizu, S. (1999): A real-time
music scene description system: Detecting melody and bass lines in audio
signals. In Proceedings of the 1999 International Joint Conference on
Articifial Intelligence Workshop on Computational Auditory Scene Analysis
(31-40): Stockholm.
[87]
Goto, M. & Muraoka, Y. (1998): Music
understanding at the beat level: Real-time beat tracking for audio signals. In
D. F. Rosenthal & H. Okuno (eds.), Readings in Computational Auditory Scene
Analysis (157-176): Mahweh, NJ: Lawrence Erlbaum.
[88]
Green, D. M. (1996): Discrimination changes in
spectral shape: Profile analysis. Acustica 82, S31-S36.
[89]
Green, P.D., M.P. Cooke, and M.D. Crawford
(1995): Auditory scene analysis and hidden markov model recognition of speech
in noise. Proc. IEEE ICASSP (401-404).
[90]
Grey, J. M. (1977): Multidimensional perceptual
scaling of musical timbres. J. Acoust. Soc. Am., 61(5), 1270-1277.
[91]
Haeb-Umbach, R., Geller, D., Ney, H. (1994):
Improvements in con-nected digit recognition using linear discriminant analysis
and mixture densities. Proc. IEEE ICASSP (239-242).
[92]
Hall, J. W., Haggard, M. P. & Fernandes, M.
A. (1984): Detection in noise by spectro-temporal pattern analysis. J. Acoust.
Soc. Am., 76(1), 50-56.
[93]
Handel, S. (1989):
Listening: An Introduction to the Perception of Auditory Events. Cambridge, MA:
MIT Press
[94]
Hartmann, W. M. (1996): Pitch, periodicity, and
auditory organization. J. Acoust. Soc. Am., 100(6), 3491-3502.
[95]
Hassoun, M.H.: Fundamentals
of Artificial Neural Networks, The MIT Press, ISBN 0-262-08239-X
[96]
Hawley, M. J. (1993): Structure out of Sound.
Ph.D. thesis, MIT Media Laboratory, Cambridge MA.
[97]
Herault, J., Jutten, C.: Blind Separation of
Sources, part I, An Adaptive Algorithm based on Neuromimetic Architecture.
Signal Processing 24, 1-10, 1991
[98]
Hermansky, H. (1990): Perceptual linear
predictive (PLP) analysis of speech, Journal Acoust. Soc. Am., vol. 87, no. 4,
pp. 1738-1752.
[99]
Hermansky, H.(1995): Exploring temporal domain
for robustness in speech recognition, Proc. of 15th International Congress on Acoustics,
(Trondheim, Norway), Vol. II., pp. 61-64.
[100] Hermansky, H., M. Pavel, S. Tibrewala, N. Mirghafori, and N. Morgan
(1996): Re-Combination of Sub-Band Information for Digit Recognition, to be
published in Proc. Intl. Conf. on Spoken Lan-guage Processing 96.
[101] Hermansky, H., Fujisaki, H. & Sato Y. (1983): Analysis and synthesis
of speech based on spectral transform linear predictive method, Proc. IEEE
Intl. Conf. on Acoustics, Speech, & Signal Process-ing, (Boston, MA), pp.
777-780.
[102] H. Hermansky, & Morgan, N.(1994): RASTA processing of speech, IEEE
Trans. on Speech and Audio Processing, vol. 2, no. 4 pp. 578-589.
[103] Hermansky, H., Wan, E., & Avendano, C. (1995): Speech enhancement
based on temporal processing, Proc. IEEE Intl. Conf. on Acoustics, Speech,
& Signal Processing, 405- 408.
[104] Hermansky. H, and D. Broad (1989): The effective second for-mant F2’ and
the vocal tract front cavity, Proc. Int. Conf. Acoust. Speech and Sig. Proc.
89, 480-483.
[105] Hermansky, H, N. Morgan, A. Bayya and P. Kohn (1991):Compensation for
the effect of the communication channel in auditory-like analysis of speech
(RASTA-PLP), Proc. Eu-rospeech ’91, pp. 1367-1371, Genova, Italy.
[106] Hermansky, H. and Pavel, M. (1995): Psychophysics of Speech Engi-neering
Systems, Invited paper, 13th International Congress on Phonetic Sciences,
Stockholm, Sweden, August 1995.
[107] Hermes, D. J. (1988): Measurement of pitch by subharmonic summation. J.
Acoust. Soc. Am., 83(1), 257-264.
[108] Hewlett, W. B. (ed.) (1998): Melodic Similarity: Concepts, Procedures,
and Applications. Computing in Musicology. Cambridge, MA: MIT Press.
[109] Hirsch, I. J. & Watson, C. S. (1996): Auditory psychophysics and
perception. Annual Review of Psychology 47, 461-484.
[110] Hirsch, H.G., P. Meyer, and H. Ruehl: Improved speech recog-nition using
high-pass filtering of subband envelopes, Proc. Eu-rospeech ’91, 1991, Genova,
Italy.
[111] Holleran, S., Jones, M. R. & Butler, D. (1995): Perceiving musical
harmony: The influence of melodic and harmonic context. J. Experiment. Psych.:
Learning, Memory, and Cognition 21(3), 737-753.
[112] Huron, D. (1991): Tonal Consonance versus tonal fusion in polyphonic
sonorities. Music Perception, 9(2),
135-154.
[113] Huron, D. & Sellmer, P. (1992): Critical bands and the spelling of
vertical sonorities. Music Perception, 10(2), 129-150.
[114] Huang, X., Acero, A., H. HW., Spoken Language Processing, Microsoft
Research, Prentice Hall PTR, 2001, ISBN: 0-13-022616-5
[115] Irino, T. & Patterson, R. D. (1996): Temporal asymmetry in the
auditory system. J. Acoust. Soc. Am ., 99(4), 2316-2331.
[116] Izmirli, Ö. & Bilgen, S. (1996): A model for tonal context time
course calculation from acoustical input. J. of New Music Res., 25(3), 276-288.
[117] Jelinek, F.: Statistical Methods for Speech Recognition, The MIT Press,
1999, Second Edition, 0-262-10066-5
[118] Johnson-Laird, P. N. (1991b): Rhythm and meter: A theory at the
computational level. Psychomusicology, 10, 88-106.
[119] Jones, M. R. & Boltz, M. (1989): Dynamic attending and responses to
time. Psychological Review 96(3), 459-491.
[120] Juslin, P. N. (1997): Emotional communication in music performance: A
functionalist perspective and some data. Music Perception 14(4), 383-418.
[121] Kashino, K. & Murase, H. (1997): Sound source identification for
ensemble music based on the music stream extraction. In Proceedings of the 1997
Int. Joint Conf. on AI Workshop on Computational Auditory Scene Analysis (pp.
127-134): Tokyo.
[122] Kashino, K., Nakadai, K., Kinoshita, T. & Tanaka, H. (1995):
Application of Bayesian probability network to music scene analysis. In
Proceedings of the 1995 Int. Joint Conf. on AI Workshop on Computational
Auditory Scene Analysis (pp. 52-59): Montreal.
[123] Klassner, F. I. (1996): Data Preprocessing in Signal Understanding
Systems. Ph.D. thesis, University of Massachusetts Computer Science, Amherst,
MA.
[124] Klassner, F. I., Lesser, V. & Nawab, S. H. (1998): The IPUS
blackboard architecture as a framework for computational auditory scene
analysis. In D. F. Rosenthal & H. Okuno (eds.), Readings in Computational
Auditory Scene Analysis (pp. 177-193): Mahweh, NJ: Erlbaum.
[125] Kohenen, T: (1995): Self-Organizing Maps. Berlin:
Springer-Verlag
[126] Krumhansl, C. L. (1979): The psychological representation of musical
pitch in a tonal context. Cognitive Psych.,
11, 346-374.
[127] Krumhansl, C. L. (1991a): Memory for musical surface. Memory & Cognition
19(4), 401-411.
[128] Krumhansl, C. L. (1991b): Music psychology: tonal structures in
perception and memory. Annual Rev. Psych., 42, 277-303.
[129] Krumhansl, C. L. (1997): Effects of perceptual organization and musical
form on melodic expectancies. In: M. Leman (ed.), Music, Gestalt, and
Computing: Studies in Systematic and Cognitive Musicology (pp. 294-320):
Berlin: Springer.
[130] Krumhansl, C. L., Kessler, E. J. (1982): Tracing the dynamic changes in
perceived tonal organization in a spatial representation of musical keys.
Psychological Review 89(4), 334-368.
[131] Kuhn, W. B. (1990): A real-time pitch recognition algorithm for music
applications. Comp. Music J., 14(3), 60-71.
[132] Langner, G. (1992): Periodicity coding in the auditory system. Hearing
Research 60(1), 115-142.
[133] Large, E. W., Kolen, J. F.
(1994): Resonance and the perception of musical meter. Connection Science,
6(2), 177-208.
[134] Lee, X. F., Logan, R. J., Pastore, R. E. (1991): Perception of acoustic
source characteristics: Walking sounds. J. Acoust. Soc. Am., 90(6), 3036-3049.
[135] Leman, M. (1989): Symbolic and subsymbolic information processing in
models of musical communication and cognition. Interface, 18, 141-160.
[136] Leman, M. (1994): Schema-based tone center recognition of musical
signals. J. of New Music Res., 23(2), 169-204.
[137] Levitin, D. J. (1994): Absolute memory for musical pitch: Evidence from
the production of learned melodies. Percept. Psychophysics, 56(4), 414-423.
[138] Levitin, D. J., Cook, P. R. (1996): Memory for musical tempo: Additional
evidence that auditory memory is absolute. Percept. Psychophysics 58(6), 927-935.
[139] Licklider, J. C. R. (1951a): Basic correlates of the auditory stimulus.
In S. S. Stevens (ed.), Handbook of Experimental Psychology (985-1035): New
York: Wiley.
[140] Licklider, J. C. R. (1951b): A duplex theory of pitch perception.
Experientia 7, 128-134.
[141] Liu, Z., Wang, Y. & Chen, T. (1998): Audio feature extraction and
analysis for scene segmentation and classification. J. VLSI Sig. Process., 20(1-2), 61-79.
[142] Longuet-Higgins, H. C. (1994): Artificial intelligence and musical
cognition. Philosophical Transactions of the Royal Society of London (A) 349,
103-113.
[143] Maes, P., Lashkari, Y. & Metral, M. (1997): Collaborative interface
agents. In M. H. Huhns & M. P. Singh (eds.), Readings in Agents . New York:
Morgan Kaufmann Publishers.
[144] Maher, R. C. (1990): Evaluation of a method for separating digitized
duet signals. J. Audio Eng. Soc. 38(12), 956-979.
[145] Mani, R. (1999): Knowledge-based processing of multicomponent signals in
a musical application. Signal Processing, 74(1), 47-69.
[146] Marr, D. (1982): Vision: A Computational Investigation into the Human
Representation and Processing of Visual Information. New York: W.H. Freeman
& Co
[147] Martin, K. D. (1996a): Automatic transcription of simple polyphonic
music: Robust front-end processing. MIT Media Laboratory Perceptual Computing
Technical Report #399, Cambridge MA. Available from
http://vismod.www.media.mit.edu/ vismod/publications.
[148] Martin, K. D. (1996b): A blackboard system for automatic transcription
of simple polyphonic music. MIT Media Laboratory Perceptual Computing Technical
Report #385, Cambridge MA. Available from http://vismod.www.media.mit.edu/
vismod/publications.
[149] Martin, K. D. (1999): Sound-Source Recognition: A Theory and
Computational Model. Ph.D. thesis, MIT, Department of Electrical Engineering
and Computer Science, Cambridge, MA.
[150] Martin, K. D., Scheirer, E. D. & Vercoe, B. L. (1998): Musical
content analysis through models of audition. In Proceedings of the 1998 ACM
Multimedia Workshop on Content-Based Processing of Music. Bristol UK.
[151] McAdams, S. (1984): Spectral Fusion, Spectral Parsing, and the Formation
of Auditory Images. Ph.D. thesis, Stanford University CCRMA, Dept of Music,
Stanford, CA.
[152] McAdams, S. (1987): Music: A science of the mind? Contemporary Music
Review 2, 1-61.
[153] McAdams, S. (1989): Segregation of concurrent sounds. I: Effects of
frequency modulation coherence. J. Acoust. Soc. Am. 86(6), 2148-2159.
[154] McAdams, S., Botte, M.-C. & Drake, C. (1998): Auditory continuity
and loudness computation. J. Acoust. Soc.
Am. , 103(3), 1580-1591.
[155] McAulay, R. J. & Quatieri, T. F. (1986): Speech analysis/synthesis
based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech,
and Signal Processing 34(4), 744-754.
[156] McCabe, S. L. & Denham, M. J. (1997): A model of auditory streaming.
J. Acoust. Soc. Am., 101(3), 1611-1621.
[157] Meddis, R. & Hewitt, M. J. (1991): Virtual pitch and phase
sensitivity of a computer model of the auditory periphery. I: Pitch
identification. J. Acoust. Soc. Am.,
89(6), 2866-2882.
[158] Meddis, R. & Hewitt, M. J. (1992): Modeling the identification of
concurrent vowels with different fundamental frequencies. J. Acoust. Soc. Am., 91(1), 233-244.
[159] Mellinger, D. K. (1991): Event Formation and Separation in Musical
Sound. Ph.D. thesis, Stanford University Dept. of Computer Science, Palo Alto
CA.
[160] Minami, K., Akutsu, A., Hamada, H. & Tonomura, Y. (1998): Video
handling with music and speech detection. IEEE Multimedia 5(3), 17-25.
[161] Mont-Reynaud, B. M., Mellinger, D. K.: Source separation by frequency
co-modulation. Proc 1st Int Conf on Music Perception and Cognition, Kyoto,
1989.
[162] Moore, B. C. J. (1997): An Introduction to the Psychology of Hearing.
San Diego: Academic Press
[163] Moore, B.C.J (1995): Hearing, Academic Press London, 1995
[164] Moorer, J. A. (1977): On the transcription of musical sound by computer.
Comp. Music J. 1(4), 32-38.
[165] Nawab, S. H., Espy-Wilson, C. Y., Mani, R. & Bitar, N. N. (1998):
Knowledge-based analysis of speech mixed with sporadic environmental sounds. In
D. F. Rosenthal & H. Okuno (eds.), Readings in Computational Auditory Scene
Analysis (pp. 177-193): Mahweh, NJ: Erlbaum.
[166] Ng, K., Boyle, R. & Cooper, D. (1996): Automatic detection of
tonality using note distribution. J. New Music Res., 25(4), 369-381.
[167] Parncutt, R. (1994a): A perceptual model of pulse salience and metrical
accent in musical rhythms. Music Perception 11(2), 409-464.
[168] Parncutt, R. (1994b): Template-matching models of musical pitch and
rhythm perception. J. New Music Res., 23, 145-167.
[169] Parsons, T.W.: Separation of speech from interfering speech by means of
harmonic selection. J. Acoust. Soc. Am. 60, 1976
[170] Patterson, R. D., Allerhand, M. H. & Giguere, C. (1995): Time-domain
modeling of peripheral auditory
processing: A modular architecture and a software platform. . J. Acoust.
Soc. Am., 98(4), 1890-1894.
[171] Patterson, R. D., Moore, B.C. J.: Auditory Filters and Excitation
Patterns as Representation of Frequency Resolution. In Hearing B.C.J. Moore
(eds.), Academic Press London, 1986
[172] Patterson R. D: A pulse ribbon model of monaural phase perception. J.
Acoust. Soc. Am. 82(5), 1987
[173] Pereira, F. & Koenen, R. H. (2000): MPEG-7: Status and directions.
In A. Puri & T. Chen (eds.), Advances in Multimedia: Signals, Standards,
and Networks (pp. 611-630):
[174] Pfeiffer, S. and Fischer, S. and Effelsberg,W.: Automatic Audio Content
Analysis. University of Mannheim, 1997.
[175] Perrott, D. & Gjerdigen, R. O. (1999): Scanning the dial: An
exploration of factors in the identification of musical style. In Proceedings
of the 1999 Society for Music Perception & Cognition (pp. 88 (abstract)):
Evanston, IL.
[176] Pielemeier, W. J., Wakefield, G. H. & Simoni, M. H. (1996):
Time-frequency analysis of musical signals. Proc IEEE 84(9), 1216-1230.
[177] Piszczalski, M. & Galler, B. A. (1977): Automatic music
transcription. Computer Music Journal 1(4), 24-31. Piszczalski, M. &
Galler, B. A. (1983): A computer model of music recognition. In M. Clynes
(ed.),
[178] Plomp, R. & Levelt, W. J. M. (1965): Tonal consonance and critical
bandwidth. J. Acoust. Soc. Am., 38(2), 548-560.
[179] Povel, D. & Okkerman, H. (1981): Accents in equitone sequences. Percept. Psychophysics 30(3), 565-572.
[180] Povel, D.-J. & Essens, P. (1985): Perception of temporal patterns. Music Perception 2(2), 411-480.
[181] Povel, D.-J. & van Egmond, R. (1993): The function of accompanying
chords in the recognition of melodic fragments. Music Perception 11(2),
101-115.
[182] Quatieri, T. F. & McAulay, R. J. (1998): Audio signal processing
based on sinusoidal analysis/synthesis. In M. Kahrs & K. Brandenburg
(eds.), Applications of Digital Signal Processing to Audio and Acoustics (pp.
343-411): New York: Kluwer Academic.
[183] Quatieri, T. F., Danisewicz, D. G. (1990): An approach to co-channel
talker interference suppression using a sinusoidal model for speech” IEEE Tr.
ASSP 38(1).
[184] Rabiner, L. R., Cheng, M. J., Rosenberg, A. E. & McGonegal, C. A.
(1976): A comparative performance study of several pitch detection algorithms. IEEE Trans ASSP 24(5), 399-418.
[185] Roads, C., Pope, S. T., Piccialli, A. & de Poli, G. (eds.) (1997): Musical Signal Processing. Stud. New Music Res. Lisse, NL: Swets & Zeitlinger.
[186] Robinson, K. (1993): Brightness and octave position: Are changes in
spectral envelope and in tone height perceptually equivalent? Contemporary
Music Review 9(1,2), 83-95.
[187] Rose, M. M. & Moore, B. C. J. (1997): Perceptual grouping of tone
sequences by normally hearing and hearing-impaired listeners. . J. Acoust. Soc.
Am., 102(3), 1768=1778.
[188] Rosenthal, D. F. (1992): Machine Rhythm: Computer Emulation of Human
Rhythm Perception. Ph.D. thesis, MIT Media Laboratory, Cambridge, MA.
[189] Rosenthal D.F, H. G. Okuno
(eds.), Computational Auditory Scene Analysis (pp. 27-42): Mahweh, NJ: Lawrence
Erlbaum. 1998, ISBN 0-8058-2283-6
[190] Sandell, G. J. (1995): Roles for spectral centroid and other factors in determining
"blended" instrument pairings in orchestration. Music Perception
13(2), 209-246.
[191] Saint-Arnaud, N.: Classification of Sound Textures, M.S. Thesis in Media
Arts and Sciences, MIT, 1995
[192] Schreirer, E.D.: Towards music
understanding without separation: segmenting music with correlogram
comodulation, IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics, 1999
[193] Scheirer, E. D. (1995): Extracting expressive performance information
from recorded music. M.S. thesis, MIT Media Laboratory, Cambridge, MA.
[194] Scheirer, E. D. (1996): Bregman’s chimerae: Music perception as auditory
scene analysis. In Proceedings of the 1996 International Conference on Music
Perception and Cognition (pp. 317-322): Montreal: Society for Music Perception
and Cognition.
[195] Scheirer, E. D. (1998a): Tempo and beat analysis of acoustic musical
signals. J. Acoust. Soc. Am., 103(1),
588-601.
[196] Scheirer, E. D. (1998b): Using musical knowledge to extract expressive
performance information from recorded signals. In D. F. Rosenthal & H.
Okuno (eds.), Readings in Computational Auditory Scene Analysis (pp. 361-380):
Mahweh, NJ: Lawrence Erlbaum.
[197] Scheirer, E. D., Slaney, M. (1997): Construction and evaluation of a
robust multifeature speech/music discriminator. In Proceedings of the 1997 IEEE
International Conference on Acoustics, Speech, and Signal Processing (pp.
1331-1334): Munich: IEEE.
[198] Scheirer, E. D. , Vercoe, B. L. (1999): SAOL: The MPEG-4 Structured
Audio Orchestra Language. Computer Music Journal 23(2), 31-51.
[199] Schellenberg, E. G. (1996): Expectancy in melody: Tests of the
implication-realization model. Cognition, 58(1), 75-125.
[200] Schmuckler, M. A. & Boltz, M. G. (1994): Harmonic and rhythmic
influences on musical expectancy. Percept. Psychophysics 56(3), 313-325.
[201] Shamma, S. A. (1996): Auditory cortical representation of complex
acoustic spectra as inferred from the ripple analysis method. Network -
Computation in Neural Systems 7(3), 439-476.
[202] Shannon, R. V., Zeng, F.-G., Kamath, V., Wygonski, J. & Ekelid, M.
(1995): Speech recognition with primarily temporal cues. Science 270, 303-304.
[203] Shepard, R. N. (1964): Circularity in judgments of relative pitch. . J.
Acoust. Soc. Am., 36(12), 2346-2353.
[204] Shepard, R. N. (1982): Geometrical approximations to the structure of
musical pitch. Psychological Review,
89(4), 305-333.
[205] Slaney, M. (1994): Auditory toolbox. Apple
Computer, Inc. Technical Report #45, Cupertino CA. Available
from http://www.interval.com/~malcolm.
[206] Slaney, M. (1997): Connecting correlograms to neurophysiology and
psychoacoustics. In Proceedings of the 1997 XIth International Symposium on
Hearing. Lincolnshire UK.
[207] Slaney, M. (1998): A critique of pure audition. In D. F. Rosenthal &
H. G. Okuno (eds.), Computational Auditory Scene Analysis (pp. 27-42): Mahweh,
NJ: Lawrence Erlbaum.
[208] Slaney, M. & Lyon, R. F. (1990): A perceptual pitch detector. In
Proceedings of the 1990 International Conference on Acoustics, Speech, and
Signal Processing (pp. 357-360): Albuquerque: IEEE.
[209] Slaney, M. & Lyon, R. F. (1991): Apple Hearing Demo Reel. Apple Computer, Inc. Technical Report #25,
Cupertino CA. Available from malcolm@interval.com.
[210] Slaney, M., Naar, D. & Lyon, R. F. (1994): Auditory model inversion
for sound separation. In Proceedings of the 1994 ICASSP. Adelaide AU.
[211] Smaragdis, P.J.: Information Theoretic Approaches to Source Separation,
M.S. Thesis in Media Arts and Sciences, MIT,
1997
[212] Smith, J. D. (1997): The place of musical novices in music science.
Music Perception 14(3), 227-262.
[213] Smoliar, S. W. (1995): Parsing, structure, memory and affect. J. of New
Music Res. 24(1), 21-33.
[214] Snyder, J. S. & Krumhansl, C. L. (1999): Cues to pulse-finding in
piano ragtime music. In Proceedings of the 1999 Society for Music Perception
and Cognition . Evanston, IL.
[215] Summerfield, Q., Lea, A. & Marshall, D. (1990): Modelling auditory
scene analysis: strategies for source segregation using autocorrelograms.
Proceedings of the Institute of Acoustics 12(10), 507-514.
[216] Temperley, D. (1997): An algorithm for harmonic analysis. Music
Perception 15(1), 31-68.
[217] Terhardt, E. (1974): Pitch, consonance, and harmony. . J. Acoust. Soc.
Am., 55(5), 1061-1069.
[218] Terhardt, E. (1978): Psychoacoustic evaluation of musical sounds.
Percept. Psychophysics 23(6), 483-492.
[219] Terhardt, E. (1991): Music perception and sensory information
acquisition: Relationships and low-level analogies. Music Perception 8(3),
217-240.
[220] Terhardt, E., Stoll, G. & Seewann, M. (1982): Algorithm for
extraction of pitch and pitch salience from complex tonal signals. . J. Acoust.
Soc. Am., 71(3), 679-688.
[221] Thompson, W. F. (1993): Modeling perceived relationships between melody,
harmony, and key. Percept. Psychophysics 53(1), 13-24.
[222] Thompson, W. F. & Parncutt, R. (1997): Perceptual judgments of
triads and dyads: Assessment of a psychoacoustic model. Music Perception 14(3),
263-280.
[223] Thomson, W. (1993): The harmonic root: A fragile marriage of concept and
percept. Music Perception, 10(4), 385-416.
[224] Todd, N. P. M. (1994): The auditory "primal sketch": A
multiscale model of rhythmic grouping. J. New Music Res., 23, 25-70.
[225] Van Immerseel, L. M. & Martens, J.-P. (1992): Pitch and
voiced/unvoiced determination with an auditory model. J. Acoust. Soc. Am.
91(6), 3511-3526.
[226] van Noorden, L. P. A. S. (1977): Minimum differences of level and
frequency for pereeptual fission of tone sequences ABAB. J. Acoust. Soc. Am.,
61(4), 1041-1045.
[227] Vercoe, B. L. (1984): The synthetic performer in the context of live
performance. In Proceedings of the 1984 International Computer Music Conference
(pp. 199-200): Paris: International Computer Music Association.
[228] Vercoe, B. L. (1988): Hearing polyphonic music on the connection
machine. In Proceedings of the 1988 First AAAI Workshop on Artificial
Intelligence and Music (pp. 183-194). Minneapolis.
[229] Vercoe, B. L. (1997): Computational auditory pathways to music
understanding. In I. Deliège & J. Sloboda (eds.), Perception and Cognition
of Music (pp. 307-326). London: Psychology Press.
[230] Vercoe, B. L., Gardner, W. G. & Scheirer, E. D. (1998): Structured
audio: The creation, transmission, and rendering of parametric sound
representations. Proceedings of the IEEE 85(5), 922-940.
[231] Vercoe, B. L. & Puckette, M. S. (1985): Synthetic rehearsal:
Training the synthetic performer. In Proceedings of the 1985 ICMC (pp.
275-278). Burnaby BC, Canada.
[232] Verhey, J. L., Dau, T. & Kollmeier, B. (1999): Within-channel cues
in comodulation masking release (CMR): Experiments and model predictions using
a modulation-filterbank model. J. Acoust. Soc. Am., 106(5), 2733-2745.
[233] Versnel, H. & Shamma, S. A. (1998): Spectral-ripple representation
of steady-state vowels in primary auditory cortex. J. Acoust. Soc. Am., 103(5),
2502-2514.
[234] Vliegen, J. & Moore, B. C. J. (1999): The role of spectral and
periodicity cues in auditory stream segregation, measured using a temporal
discrimination task. Journal of the Acoustical Society of America, 106(2),
938-945.
[235] Vliegen, J. & Oxenham, A. J. (1999): Sequential stream segregegation
in the absence of spectral cues. J. Acoust. Soc. Am., 105(1), 339-346.
[236] Vos, P. G. & Van Geenen, E. W. (1996): A parallel-processing
key-finding model. Music Perception 14(2), 185-224.
[237] Wang, D. (1996): Primitive auditory segregation based on oscillatory
correlation. Cognitive Science 20, 409-456.
[238] Wang, K. & Shamma, S. A. (1995): Spectral shape-analysis in the central
auditory system. IEEE TSAP, 3(5), 382-395.
[239] Wang D.: Stream Segregation Based on Oscillatory Correlation. In:
Rosenthal, D.F., Okuno, H. G. (Eds.): Computational Auditory Scene Analysis.
Lawrence Erlbaum Associates, Inc., Publishers
[240] Warren, R. M. (1970): Perceptual restoration of missing speech sounds.
Science 167, 392-393.
[241] Warren, R. M., Obusek, C. J. & Ackroff, J. M. (1972): Auditory
induction: perceptual synthesis of absent sounds. Science 176, 1149-1151.
[242] Warren, W. H. & Verbrugge, R. R. (1984): Auditory perception of
breaking and bouncing events: A case study in ecological acoustics. Journal of
Experimental Psychology: Human Perception and Performance 10(4), 704-712.
[243] Weintraub, M. (1985): A Theory and Computational Model of Auditory
Monaural Sound Separation. Ph.D. thesis, Stanford University Dept. of
Electrical Engineering, Palo Alto, CA.
[244] Wightman, F. L. (1973): The pattern-transformation model of pitch. J.
Acoust. Soc. Am., 54(2), 407-416.
[245] Wold, E., Blum, T., Keislar, D. & Wheaton, J. (1996): Content-based
classification, search, and retrieval of audio. IEEE Multimedia 3(3), 27-36.
[246] Yost, W. A. (1991): Auditory image perception and analysis: The basis
for hearing. Hearing Research 56,
-
Janku, L.: Sound Source Separation through
Models of Auditory Processes and Fuzzy-Rule System. In: MOSIS
01-Modelling and Simulation of Systems. MARQ, 2001, vol. 1, p. 195-200. ISBN
80-85988-57-7.
-
Janků,
L. - Lhotska, L. Musical Instrument Classification through the Model of
Auditory Periphery and a Neural Network In: N. Mastorakis, V. Mladenov,
B. Suter and L.J. Wand (Editors): Advances in Scientific Computing,
Computational Intelligence and Applications, WSES Press, 2001
-
Janků,
L. - Lhotska, L.: Towards the Application of Fuzzy Logic to the Sound
Recordings Fingerprint Computation and Comparison. In: Recent Advances
in Computers, Computing and Communications. New Jersey. WSEAS Press, 2002.
-
Janku, L.: An Efficient Machine Listening Approach
to Sound Source Separation. In: Proceedings of Workshop 2001. Prague : CTU,
2001, vol. A, p. 262-263. ISBN 80-01-02335-4.
-
Janků,
L.: Psychical States Estimation Based on Voice Analysis and Speech
Recognition I - Introduction. [Research Report]. Prague : CTU FEE,
Department of Cybernetics, BIO Laboratory, 1999. BIO333-07/99. (In Czech)
-
Janků,
L. - Eck, V.: Psychical States Estimation Based on Voice Analysis and
Speech Recognition II - Methods of Voice Analysis. [Research Report].
Prague : CTU FEE, Department of Cybernetics, BIO Laboratory, 1999.
BIO333-15/99. (in Czech).
-
Janků,
L. - Eck, V.: Workplace for Audio Signal Processing and Recognition, I
& II. [Research Reports]. Prague : CTU FEE, Department of
Cybernetics, BIO Laboratory, 1999. BIO333.
(in Czech).