MULTI-CHANNEL OVERLAPPED SPEECH RECOGNITION WITH LOCATION GUIDED SPEECH EXTRACTION NETWORK

Spoken Language Technology |

Organized by IEEE

Although advances in close-talk speech recognition have resulted in
relatively low error rates, the recognition performance in far-field
environments is still limited due to low signal-to-noise ratio, reverberation,
and overlapped speech from simultaneous speakers which
is especially more difficult. To solve these problems, beamforming
and speech separation networks were previously proposed. However,
they tend to suffer from leakage of interfering speech or limited
generalizability. In this work, we propose a simple yet effective
method for multi-channel far-field overlapped speech recognition.
In the proposed system, three different features are formed for each
target speaker, namely, spectral, spatial, and angle features. Then a
neural network is trained using all features with a target of the clean
speech of the required speaker. An iterative update procedure is proposed
in which the mask-based beamforming and mask estimation
are performed alternatively. The proposed system were evaluated
with real recorded meetings with different levels of overlapping ratios.
The results show that the proposed system achieves more than
24% relative word error rate (WER) reduction than fixed beamforming
with oracle selection. Moreover, as overlap ratio rises from 20%
to 70+%, only 3.8% WER increase is observed for the proposed system.