Acoustic Model Adaptation for Presentation Transcription and Intelligent Meeting Assistant Systems
We present our solution for unsupervised rapid speaker adaptation in a state-of-art presentation and intelligent meeting transcription system. We adopt the Kullback-Leibler (KL) divergence regularized model adaptation paradigm. For the adaptation architecture, we found that the linear projection layer adaptation yields competitive performance with the additional benefit in its simplicity and robustness to small amount of adaptation data. To address the imperfect supervision, we use a supervision committee formed by multiple systems or single-system n-best to mask possibly mislabeled frames. To relieve the data sparsity issue, we apply noise and speaking rate perturbation data augmentation techniques to create a richer adaptation data set.
In summary, the proposed solution consists of the KL-divergence regularized linear projection layer adaptation with frame masking and data augmentation.
On a presentation transcription and a meeting transcription task, our proposed methodology yields 7.3% and 7.9% relative word error rate (WER) reduction against a strong baseline model trained from tens of thousand hour speech. To the best of our knowledge, this is a first reported work on rapid speaker adaptation on a state-of-art production system.