Using personalized speech synthesis and neural language generator for rapid speaker adaptation


We propose to use the personalized speech synthesis and the neural
language generator to synthesize content relevant personalized
speech for rapid speaker adaptation. It has two distinct aspects:
First, it relieves the general data sparsity issue in rapid adaptation via
making use of additional synthesized personalized speech; Second,
it circumvents the obstacle of the explicit labeling error in unsupervised
adaptation by converting it to pseudo-supervised adaptation.
In this setup, the labeling error is implicitly rendered as less damaging
speech distortion in the personalized synthesized speech. This
results in significant performance breakthrough in the rapid unsupervised
speaker adaptation. We apply the proposed methodology to a
speaker adaptation task in a state-of-art speech transcription system.
With 1 minute (min) adaptation data, our proposed approach yields
9.19 % or 5.98 % relative word error rate (WER) reduction for the
supervised and the unsupervised adaptation, comparing to the negligible
gain when adapting only with 1 min original speech. With 10
min adaptation data, it yields 12.53 % or 7.89 % relative WER reduction,
doubling the gain of the baseline adaptation. The proposed
approach is particularly suitable for unsupervised adaptation.