research

For the past few years we have been collaborating with auditory scientists (from the University of Amsterdam and Eaton-Peabody Lab for Auditory Physiology (MIT/Harvard)) in learning about how the auditory system processes acoustic signals such as speech, encodes them and makes inferences on them. Our goal is to identify those aspects of auditory processing that are responsible for its superiority over current artificial implementations and to emulate the practically useful ones in a computer.

The biggest barrier to widespread use of automatic speech recognition(ASR) systems in real-life situations is their unreliable performance in background noise and interference. In marked contrast to current artificial systems, human listeners are able to correctly identify speech utterances in many acoustically-challenging contexts. Humans also do remarkably well in separating out individual voices from those of other speakers and from acoustic clutter of all sorts (cocktail party effect). How are we able to do this? Examination of auditory perception and the neurophysiological basis suggests to us that this difference is due to powerful sound separation mechanisms coupled with robust spectro-temporal representations of signals used by the auditory system.

Currently, every speech-recognition system that engineers have built uses framewise feature vectors. The feature vectors are derived from short-term spectral envelopes computed by standard spectral analysis or by using a bank of fixed bandpass filters (BPFs). When speech is degraded by noise, interference, and channel effects (such as telephone, reverberation etc.,) perturbations at one frequency affect the entire feature vector rendering the extracted features vulnerable. This type of framewise spectral envelope extraction that models the speech and interference together, is at odds with how the auditory system processes and recognizes speech. In the auditory system, sound components are spectrally and temporally separated, analyzed and subsequently fused into unified objects, streams and voices that exhibit perceptual attributes, such as pitch, timbre, loudness, and location.

We propose to develop methods and algorithms to process complex acoustic signals observed by one or more acoustic sensors. The long term goal is to develop a machine that can deal with the day-to-day booming, buzzing acoustic environment around us and make inferences on the sounds the way human beings and animals are able to do. Current signal analysis methods are inadequate for this purpose. Since the auditory system provides an existence proof of such a system it seems reasonable to use it as an inspiration for our strategy. However, our algorithm development is anchored in fundamental signal processing principles.

Welcome to Speech and Signal Processing Laboratory