A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability

Workshop of Automatic Speech Recognition and Understanding |

Organized by IEEE

Streaming automatic speech recognition (ASR) and speech translation (ST) tasks have extensively utilized neural transducers. In this paper, we present our endeavor to construct a Streaming Multilingual Speech Model ($SM^2$), which employs a single neural transducer model for transcribing or translating multiple languages into target languages. $SM^2$ is trained using weakly supervised data created by converting speech recognition transcriptions with a machine translation model. Leveraging 351 thousand hours of speech training data from 25 languages, $SM^2$ achieves impressive ST performance. Furthermore, we demonstrate the truly zero-shot capability of $SM^2$ when expanding to new target languages, generating high-quality zero-shot ST translation for \{source-speech, target-text\} pairs that were not seen during training.