Dream & Passion: January 2013

As currently working on recognizing speech under noisy conditions, I'm always wondering what's the best way of solving and evaluating this problem and how human perceive speech under noisy environments.

For research purpose, we usually collect clean data and pure noise data and then generate noisy speech by combining them. Sometimes, we use some filters to create the channel distortion effects. Speech at different SNRs are created and used for evaluation. Like in the Aurora 2 dataset, 6 SNRs from 20dB to -5dB are used for evaluation.

To understand the difficulty in recognizing the noisy speech, the spectrograms of them at different SNRs are plotted. From these relatively high resolution spectrograms, at SNR0 the patterns are already quite confusing. At SNR-5 it is hard to extract speech patterns from the noise.

While in speech recognition, the FBank features used are rather low resolution to the spectrograms. Due to the value ranges, the patterns are relatively hard. That's also why usually we use CMVN to preprocess the features before sending to NNs.

The CMVN normalized FBank features are shown as follows. The dynamic parameters actually helps a lot to locate the patterns.

Although from my experience, the dynamic coefficients are really helpful. I always have the question of whether is that because the high dimension. If we use higher dimensional static features, we will have much more detailed information, will that outperform the dynamic features? However, when I try to extract the same number of FBanks, there are several dimensions always giving 0 values. This may be saying the current feature extraction methods are in some aspects limited.
First the per utterance normalized static parts (40D) of the above features are displayed below:

Following are illustrations of 80 FBanks of static features.

Although more FBanks are visually more favorable (at least to me), it is hard to say how the ASR can benefit form them. Maybe some automatically learnt features directly from the waveform signals would be helpful.

Thursday, January 24, 2013

Noisy speech