Characteristics of the localized time-frequency features:
1) Local in frequency domain, not like MFCC, each feature is affected by all the frequencies;
2) Temporal dynamics, modeling long and variable time durations, while in MFCC, the features are all short time and fixed duration.
In this paper, the set of filters adopted is very simple, and are essentially basic edge detectors taking only values +1 and -1. The selection includes vertical edges (of varying frequency span and temporal duration) for onsets and offsets; wide horizontal edges for frication cutoffs; and horizontal edges tilted at various slopes to model formant transitions. The choices for the ranges of the various parameters were made based on acoustic phonetic knowledge, such as typical formant bandwidths, average phone durations, typical rates of formant movement, etc.[Book: Acoustic Phonetics]
With these filters, the features are computed as follows:
For each filter,
1) centering it at a particular frequency, and convolving with the whole spectrogram along the time axis for that specific frequency;
2) for each frequency value, we could get a time series of the convolution sums;
3) to reduce the dimension of features, the convolution sums are down sampled over a 16*32 point grid.
4) The 32 frequency points are taken linearly between 0 and 4kHz.
5) The 16 time points (which is specific to their task) is the centers of the 16 states (for the HMM word model) in the state alignment.
Thus, a feature refers to both the filter shape, and the time-frequency point (16*32 point grid).
In this paper, the task is to classify isolated digits. Each spectrogram is one single digit. The feature computed is per digit, thus is global to the target class.
The problem with current frame based features is its non-localization, make it difficult to modeling speaker variabilities.
As shown in following figure: