While automatic speech recognition systems have steadily improved and are now in widespread use, their accuracy continues to lag behind human performance, particularly in adverse conditions.
There has been much progress and ASR technology is now in widespread use; however, there is still a considerable gap between human and machine performance, particularly in adverse conditions.
How human evolution makes humans different from machines in perceiving speech signals?
What's the major differences between humans and machine when processing speech signals? And which are the crucial ones?
The parts-based model (PBM), based on previous work in machine vision, uses graphical models to represent speech with a deformable template of spectro-temporally localized "parts", as opposed to modeling speech as a sequence of fixed spectral profiles.
Perhaps most importantly, ASR systems have benefited greatly from general improvements in computer technology. The availability of very large datasets and the ability to utilize them for training models has been very beneficial. Also, with ever increasing computing power, more powerful search techniques can be utilized during recognition.
Reaching the ultimate goal of ASR - human-level (or beyond) performance in all conditions and on all tasks - will require investigating other regions of this landscape, even if doing so results in back-tracking in progress in the short-term.
Utilize the knowledge of acoustic phonetics and human speech perception in speech recognition.
We will argue that well known phonetic cues crucial to human speech perception are not modeled effectively in standard ASR systems, and that there is a benefit to model such cues explicitly, rather than implicitly.
The "glimpsing" model of speech perception suggests that humans can robustly decode noise-corrupted speech by taking advantage of local T-F regions having high SNR, and is supported by empirical evidence and computational models.
Auditory Neuroscience, Tonotopic maps. Also, recent research seeking to characterize the behavior of individual neurons in the mammalian auditory cortex has resulted in models in which cortical neurons act as localized spectro-temporal pattern detectors. ( represented by their so-called spectro-temporal receptive filed, or STRF).
+ Localized T-F pattern detectors
+ Explicit modeling of phonetic cues
acoustic ----------> phonetic cues ------------------> phonemes