1. Two MLPs in a Tandem fashion for phone recognition. The first MLP is used to nonlinearly convert the acoustic features into posterior features.
Acoustic features are known to exhibit a high degree of nonlinguistic variabilities such as speaker and environmental (e.g. noise, channel) characteristics. The first MLP classifier can be interpreted as a discriminatively trained nonlinear transformation from the acoustic feature space to the posterior feature space. It has been shown that a well trained (large population of speakers, and different conditions) MLP classifier can achieve invariance to speaker as well as environment characteristics. Moreover, it has also been shown that the effect of co-articulation is less severe on the posterior features when compared to the acoustic features.
2. The behavior of the second MLP is analyzed using Volterra series.
3. Benefits of MLP based acoustic modeling:
a) It obviates the need for strong assumptions on the statistics of the features and the parametric form of its density function, easy for feature combination;
b) MLPs have been shown to be invariant to speaker characteristics and environment specific information such as noise, when trained on large amount of data;
c) Output of the MLP are probabilities with useful properties;
d) MLP can be trained efficiently and is scalable with large amount of data.
4. Posterior features have less nonlinguistic variabilities and sparse representation, and linear separable.
5. MLP acoustic modeling could be improved in following ways:
a) using richer acoustic information;
b) increasing the capacity of the MLP, however, this approach is often limited by the amount of training data;
c) using finer representation of output classes such as sub-phoneme state;
6. Normalization of posterior features obviate the effect of unigram phonetic class priors learned by the first MLP. The priors are, however, again learned by the second MLP classifier.
7. The MLPs are trained using Quicknet package. The phoneme n-gram models are trained using the SRILM toolkit and phoneme recognition is performed using the weighted finite state transducer based Juicer decoder.
8. A potential application of MLP based hierarchical system is in task adaptation. At the first stage of the hierarchical system, a well trained MLP available off-the-shelf could be used. The second MLP is trained on the posterior features estimated for the target task (adaptation data). It has already been observed that the second MLP in the hierarchy requires fewer number of parameters and can be trained using lesser amount of data.