Blind C50 estimation from single-channel speech using a convolutional neural network
The early-to-late reverberation energy ratio is an important parameter describing the acoustic properties of an environment. C50, i.e., the ratio between the first 50 ms and the remaining late energy, affects the perceived clarity and intelligibility of speech, and can be used as a design parameter in mixed reality applications or to predict the performance of speech recognition systems. While established methods exist to derive C50 from impulse response measurements, such measurements are rarely available in practice. Recently, methods have been proposed to estimate C50 blindly from reverberant speech signals. Here, a convolutional neural network (CNN) architecture with a long short-term memory (LSTM) layer is proposed to estimate C50 blindly. The CNN-LSTM operates directly on the spectrogram of variable-length, noisy, reverberant utterances. A feature comparison indicates that log Mel spectrogram features with a frame size of 128 samples achieve the best performance with an average root-mean-square error of about 2.7 dB, outperforming previously proposed blind C50 estimators.