Session-level Language Modeling for Conversational Speech

Wayne Xiong; Lingfeng Wu; Jun Zhang; Andreas Stolcke

Session-level Language Modeling for Conversational Speech

Wayne Xiong ,
Lingfeng Wu ,
Jun Zhang ,
Andreas Stolcke

Proceedings EMNLP | November 2018

Published by Assocation for Computational Linguistics

We propose to generalize language models for conversational speech recognition to allow them to operate across utterance boundaries and speaker changes, thereby capturing conversation-level phenomena such as adjacency pairs, lexical entrainment, and topical coherence. The model consists of a long-short-term memory (LSTM) recurrent network that reads the entire word-level history of a conversation, as well as information about turn taking and speaker overlap, in order to predict each next word. The model is applied in a rescoring framework, where the word history prior to the current utterance is approximated with preliminary recognition results. In experiments in the conversational telephone speech domain (Switchboard) we find that such a model gives substantial perplexity reductions over a standard LSTM-LM with utterance scope, as well as improvements in word error rate.