Dream & Passion: [Speech] Hierarchical structures of neural networks for phoneme recognition

Wednesday, January 19, 2011

[Speech] Hierarchical structures of neural networks for phoneme recognition

Four Neural Network based phoneme recognition systems are investigated:

a) the TRAPs system (Fig. 1a) - separate networks for processing of speech in frequency bands;

b) the split temporal context (STC) system (Fig. 1b) - separate networks for processing of blocks of spectral vectors;

c) combination of both (Fig. 1c) - split in both frequency and time.

d) Tandem of two networks, the frond-end network is trained in classical ways and the back-end is trained on the combination of the front-end's posteriors and original features.

The assumptions for those systems are:

a) Independent processing of speech in critical bands;

b) Independent processing of parts of phonemes;

c) both a) and b).

Phoneme strings are basic representation for automatic language recognition and it is proved that language recognition results are highly correlated with phoneme recognition results. Phoneme posteriors are useful representation for acoustic keyword search, they contain enough information to distinguish among all words and they are small enough to store compared for example to the size of posteriors from context dependent Gaussian Mixture Models.

Two ways to provide additional information for NN training:

i) windowing, multiple frames context window, hamming window to emphasis the central frame;

ii) output representation: some improvements have been observed when a net was trained for multiple tasks in the same time.

A special Phoneme set mapping adopted in this paper is they merged closures with burst instead of with silence (bcl b -> b not bcl b -> pau b). It is believed that this mapping is more appropriate for features which use a longer temporal context.

The number of neurons in hidden layer of neural networks was increased until the saturation of phoneme error rate (PER) was observed. The obtained number of hidden layer neurons was approximately 500.

Table 1 shows the superiority of long Mel-bank energies but also great improvement coming from three state model. ( Block of 31 vectors of mel-bank energies (MBE) = 310 ms, Temporal trajectories in bands were weighted by Hamming window and down-sampled by DCT to 11 coefficients. )

The final best PER reported in this paper is using the 5-block STC system with bigram LM as shown in following table: