This is a term that represents a wide range of speech speaker, channel, and environmental conditions that people typically encounter and routinely adapt to in responding and recognizing speech signals. Currently, ASR systems deliver significantly degraded performance when they encounter audio signals that differ from the limited conditions under which they were originally developed and trained. This is true in many cases even if the differences are slight.
This focused research area would concentrated on creating and developing systems that would be much more robust against variability and shifts in acoustic environments, reverberations, external noise sources, communication channels (e.g., far-field microphones, cellular phones), speaker characteristics (e.g., speaker style, nonnative accents, emotional state), and language characteristics (e.g., formal/informal styles, dialects, vocabulary, topic domain). New techniques and architectures are proposed to enable exploring these critical issues in environments as diverse as meeting-room presentations and unstructured conversations. A primary focus would be exploring alternatives for automatically adapting to changing conditions in multiple dimensions, even simultaneously. The goal is to deliver accurate and useful speech transcripts automatically under many more environments and diverse circumstances than is now possible, thereby enabling many more applications. This challenging problem can productively draw on expertise and knowledge from related disciplines, including natural-language processing, information retrieval, and cognitive science.
Speaker Characteristics and Style
It is well known that speech characteristics (e.g., age,nonnative accent) vary widely among speakers due to many factors, including speaker physiology, speaker style (e.g., speech rate, spontaneity of speech, emotional state of the speaker), and accents (both regional and nonnative). The primary method currently used for making ASR systems more robust to variations in speaker characteristics is to include a wide range of speakers in the training. Speaker adaptation mildly alleviates problems with new speakers within the "span" of known speaker and speech types but usually fails for new types.
Current ASR systems assume a pronunciation lexicon that models native speakers of a language. Furthermore, they train on large amounts of speech data from various native speakers of the language. A number of modeling approaches have been explored in modeling accented speech, including explicit modeling accented speech, adaptation of native acoustic models via accented speech data and hybrid systems that combine these two approaches. Pronunciation variants have also been tried in the lexicon to accommodate accented speech. Except for small gains, the problem is largely unsolved.
Similarly, some progress has been made for automatically detecting speaking rate from the speech signal, but such knowledge is not exploited in ASR systems, mainly due to the lack of any explicit mechanism to model speaking rate in the recognition process.
Cited from Baker, J.; Li Deng; Glass, J.; Khudanpur, S.; Chin-hui Lee; Morgan, N.; O'Shaughnessy, D.; , "Developments and directions in speech recognition and understanding," Signal Processing Magazine, IEEE , vol.26, no.3, pp.75-80, May 2009.