Learning Utterance-level Representations for Speech Emotion and Age/Gender Recognition Using Deep Neural Networks
Accurately recognizing speaker emotion and age/gender from speech can provide better user experience for many spoken dialogue systems. In this study, we propose to use deep neural networks (DNNs) to encode each utterance into a fixed-length vector by pooling the activations of the last hidden layer over time. The feature encoding process is designed to be jointly trained with the utterance-level classifier for better classification. A kernel extreme learning machine (ELM) is further trained on the encoded vectors for better utterance-level classification. Experiments on a Mandarin dataset demonstrate the effectiveness of our proposed methods on speech emotion and age/gender recognition tasks.