Improving unsupervised language model adaptation with discriminative data filtering

INTERSPEECH |

In this paper we propose a method for improving unsupervised language model (LM) adaptation by discriminatively filtering the adaptation training material. Two main issues are addressed in this solution: first, how to automatically identify recognition errors and more correct alternatives without manual transcription; second, how to update the model parameters based on the recognition error cues. Within the adaptation framework, we address the first issue by predicting regression pairs between recognition results from the baseline LM and an initial adapted LM, using features such as language model score difference. For the second issue, we adopted a data filtering approach to penalize potent error attractors introduced by the unsupervised adaptation data, using Ngram set difference statistics computed on the predicted regression pairs. Experimental results on a large real-world application of voice catalog search demonstrated that the proposed solution provides significant recognition error reduction over an initial adapted LM.