Predicting unseen triphones with senones

  • Mei-Yuh Hwang ,
  • Xuedong Huang ,
  • Fileno A. Alleva

IEEE Trans. on Speech and Audio Processing | , Vol 4: pp. 412-419

In large-vocabulary speech recognition, we often encounter triphones that are not covered in the training data. These unseen triphones are usually backed off to their corresponding diphones or context-independent phones, which contain less context yet have plenty of training examples. In this paper, we propose to use decision-tree-based senones to generate needed senonic baseforms for these unseen triphones. A decision tree is built for each Markov state of each base phone; the leaves of the trees constitute the senone pool. To find the senone associated with a Markov state of any triphone, the corresponding tree is traversed until a leaf node is reached. The effectiveness of the proposed approach was demonstrated in the ARPA 5000-word speaker-independent Wall Street Journal dictation task. The word error rate was reduced by 11% when unseen triphones were modeled by the decision-tree-based senones instead of context independent phones. When there were more than five unseen triphones in each test utterance, the error rate reduction was more than 20%.