Constrained Convolutional-recurrent Networks to Improve Speech Quality with Low Impact on Recognition Accuracy
- Rasool Fakoor ,
- Xiaodong He ,
- Ivan Tashev ,
- Shuayb Zarar
IEEE Int. Conf. Acoustics Speech and Signal Processing (ICASSP) |
For a speech-enhancement algorithm, it is highly desirable to simultaneously improve perceptual quality and recognition rate. Thanks to limitation on the cost functions, it is challenging to train a model that effectively optimizes both metrics at the same time. In this paper, we propose a method for speech enhancement that combines local and global contextual structures information through convolutional-recurrent neural networks that improves perceptual quality. At the same time, we introduce a new constraint on the objective function using a language model/decoder that limits the impact on recognition rate. Based on experiments conducted with real user data, we demonstrate that our new context-augmented machine learning approach for speech enhancement improves PESQ and WER by an additional 24:5% and 51:3%, respectively, when compared to the best-performing methods in the literature.
© IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.