Improving mask learning based speech enhancement system with restoration layers and residual connection
For single-channel speech enhancement, mask learning based approach through neural network has been shown to outperform the feature mapping approach, and to be effective as a pre-processor for automatic speech recognition. However, its assumption that the mixture and clean reference must have the correspondent scale doesn’t hold in data collected from real world, and thus leads to significant performance degradation on parallel recorded data. In this paper, we first extend the mask learning based speech enhancement by integrating two types of restoration layer to address the scale mismatch problem. We further propose a novel residual learning based speech enhancement model via adding different shortcut connections to a feature mapping network. We show such a structure can benefit from both the mask learning and the feature mapping. We evaluate the proposed speech enhancement models on CHiME 3 data. Without retraining the acoustic model, the best bidirection LSTM with residue connections yields 24.90% relative WER reduction on real data and 34.57% WER on simulated data.