Rapid RNN-T Adaptation Using Personalized Speech Synthesis and Neural Language Generator
- Yan Huang ,
- Jinyu Li ,
- Lei He ,
- Wenning Wei ,
- William Gale ,
- Yifan Gong
Interspeech |
Rapid unsupervised speaker adaptation in an E2E system posits us new challenges due to its end-to-end unified structure in addition to its intrinsic difficulty of data sparsity and imperfect label [1]. Previously we proposed utilizing the content relevant personalized speech synthesis for rapid speaker adaptation and achieved significant performance breakthrough in a hybrid system [2]. In this paper, we answer the following two questions: First, how to effectively perform rapid speaker adaptation in an RNN-T. Second, whether our previously proposed approach is still beneficial for the RNN-T and what are the modification and distinct observations. We apply the proposed methodology to a speaker adaptation task in a state-of-art presentation transcription RNN-T system. In the 1 min setup, it yields 11.58 % or 7.95 % relative word error rate (WER) reduction for the sup/unsup adaptation, comparing to the negligible gain when adapting with 1 min source speech. In the 10 min setup, it yields 15.71 % or 8.00 % relative WER reduction, doubling the gain of the source speech adaptation. We further apply
various data filtering techniques and significantly bridge the gap between sup/unsup adaptation.