The semi-supervised speaker adaptation is carried out by combining the MLLR and MAP adaptation techniques with the Confidence Measure score generated by NN.
MLLR is more capable to adapt the model with limited amount of adaptation data, thus is used at the early stage of the unsupervised, online adaptation.
After that the MLLR adapted model is used as the prior for the MAP method to update the model. With the increase of the amount of speaker specific data, MAP adaptation would lead to the SD system.
Instead of using the standard CM scores got from the recognition process, which has a high computational cost, a NN is trained to predict whether the phones are recognized correctly or not. The features adopted to predict the CM scores are mainly related to phoneme durations, speaker rate and acoustic score.