Multi Transcription-Style Speech Transcription Using Attention-based Encoder-decoder Model
Human professional transcription services provide a variety of transcription styles to customize different needs.
To accommodate different users and facilitate seamless integration with downstream applications, we propose a framework to generate multi-style transcription in an attention-based encoder-decoder model (AED) using three different architectures: (A) style-dependent layers; (B) mixed-style output; (C) style-dependent prompt. In this framework, both the verbatim lexical transcription and the readable transcription of various styles can be generated simultaneously or separately, through a single decoding pass or multiple decoding passes on-demand. We conduct experiments in a large-scale AED-based speech transcription system trained with 50k hours speech. The proposed framework can achieve nearly on-par performance compared to the single-style AED with significant savings in model footprint and decoding cost. Moreover, it provides an efficient data sharing mechanism across different styles through knowledge transfer.