Semi-Supervised Training in Deep Learning Acoustic Model

Yan Huang; yongqiang wang; Yifan Gong

Semi-Supervised Training in Deep Learning Acoustic Model

Yan Huang ,
yongqiang wang ,
Yifan Gong

Interspeech 2016 | September 2016

Download BibTex

We studied the semi-supervised training in a fully connected deep neural network (DNN), unfolded recurrent neural network (RNN), and long short-term memory recurrent neural network (LSTM-RNN) with respect to the transcription quality, the importance data sampling, and the training data amount. We found that DNN, unfolded RNN, and LSTM-RNN are increasingly more sensitive to labeling errors. For example, with the simulated erroneous training transcription at 5%, 10%, or 15% word error rate (WER) level, the semi-supervised DNN yields 2.37%, 4.84%, or 7.46% relative WER increase against the baseline model trained with the perfect transcription; in comparison, the corresponding WER increase is 2.53%, 4.89%, or 8.85% in an unfolded RNN and 4.47%, 9.38%, or 14.01% in an LSTM-RNN. We further found that the importance sampling has similar impact on all three models with 2~3% relative WER reduction comparing to the random sampling. Lastly, we compared the modeling capability with increased training data. Experimental results suggested that LSTM-RNN can benefit more from enlarged training data comparing to unfolded RNN and DNN.

We trained a semi-supervised LSTM-RNN using 2600 hr transcribed and 10000 hr untranscribed data on a mobile speech task. The semi-supervised LSTM-RNN yields 7.9\% relative WER reduction against the supervised baseline.