Semi-supervised Multi-task learning for acoustic parameter estimation

The acoustic properties of the room might impact the quality of speech audio for the listeners. Basic acoustic parameters that characterize the environment well are reverberation time, (RT60) is defined by the time it takes for the sound energy to decay after the source is switched off and clarity (C50/C80), which is measured by calculating the ratio between the early reflections' energy (up to 50/80ms) and the energy of the late response from the decay curve. Another important feature of speech is Speech quality. Many algorithms have been developed for speech enhancement and removing noise, echo, and reverberation. But these algorithms do not necessarily improve the speech quality accessed by human perception. The mean opinion score (MOS) is standardized for the perceptual evaluation of speech quality and is obtained by asking listeners to evaluate the quality of an audio signal on a scale from one to five. We are interested in the estimation of RT60 and C50 and MOS in a multi-task framework. We combined different data sets which are partially labeled and applied a semi-supervised approach to estimate RT60 and C50 and MOS simultaneously.