Rapid Speaker Adaptation for Conformer Transducer: Attention and Bias are All You Need
- Yan Huang ,
- Guoli Ye ,
- Jinyu Li ,
- Yifan Gong
Interspeech 2021 |
Conformer transducer achieves new state-of-the-art end-to-end
(E2E) system performance and has become increasingly appealing
for production. In this paper, we study how to effectively
perform rapid speaker adaptation in a conformer transducer
and how it compares with the RNN transducer. We hierarchically
decompose the conformer transducer and compare
adapting each component through fine-tuning. Among various
interesting observations, there are three distinct findings: First,
adapting the self-attention can achieve more than 80% gain of
the full network adaptation. When the adaptation data is extremely
scarce, attention is all you need to adapt. Second,
within the self-attention, adapting the value projection significantly
outperforms adapting the key or the query projection.
Lastly, bias adaptation, despite of its compact parameter space,
is surprisingly effective. We conduct experiments on a state-ofthe-
art conformer transducer for an email dictation task. With
3 to 5 min source speech and 200 minute augmented personalized
TTS speech, the best performing encoder and joint network
adaptation yields 38.37% and 19.90% relative word error rate
(WER) reduction. Combining the attention and bias adaptation
can achieve 90% of the gain with significantly smaller footprint.
Further comparison with the RNN transducer suggests that the
new state-of-the-art conformer transducer can benefit as much
as if not more from personalization.