A Twin-Output MLP model is proposed in this paper, which predicts two sets of phone posteriors, one for the adapted speaker and the other for the "world" (i.e. speakers except the adapted one).
The model could both be used for speech recognition and speaker identification.
The structure of the model is shown below:
The model is trained in following steps:
1) Train the SI MLP model first using all the training data;
2) Duplicate the SI MLP's output layer to have two identical set of phone units, both the weights and the bias are duplicated;
3) Update the cloned model using the speaker specific data together with same amount of data randomly selected from the previously used training data;
4) For speech recognition, the authors have shown that using only the speaker specific phone units' posteriors (renormalized) yields the improved performance over the original SI model.
5) Also this model could be used for speaker recognition, as two sets of posteriors are gained simultaneously for second stage speaker verification.