Traditionally speech recognition systems are built with models that are an average of many different users. A speaker-independent model is provided that works reasonably well for a large percentage of users. But the accuracy can be improved if the acoustic model is personalized to the given user. We have built a service that constantly looks at the user’s sent emails to personalize the language model and we’ve observed a 30% reduction in error rate for the text dictated in the body of emails.
Traditionally speech recognition systems are built with models that are an average of many different users. A speaker-independent model is provided that works reasonably well for a large percentage of users. But the accuracy can be improved if the acoustic model is personalized to the given user, i.e. if the system learns the voice characteristics of the user, and this is often done in dictation systems as part of an “enrollment phase” that typically lasts at least 10 minutes. We would like to also adapt the language model to the user but a large number of sentences written by the user is required for the error decrease to be significant. We have built a service that constantly looks at the user’s sent emails to personalize the language model and weve observed a 30% reduction in error rate for the text dictated in the body of emails.
Following the ideas developed in MiPAD, we use context to develop field-specific grammars. Additionally, we have personalized the grammars used in the “To:” and “Cc:”fields to reflect the distribution of senders that a user has. This significantly reduces the perplexity (i.e. you’re much more likely to send email to your colleagues than to other people in your organization) and thus improve accuracy and speed.
In the long term we’d like to have a system uses personalization and context to narrow down the options that the recognizer needs to search and thus increase accuracy.