Toxic Speech and Speech Emotions: Investigations of Audio-based Modeling and Intercorrelations
- Wei-Cheng Lin ,
- Dimitra Emmanouilidou
European Signal Processing Conference (EUSIPCO) |
Published by IEEE | Organized by EURASIP
Content moderation (CM) systems have become essential following the monumental increase in multimodal and online social platforms; and while increasingly published work focuses on text-based solutions, there is still limited work on audio-based methods. In this study we aim to explore relationships between speech emotions and toxic speech, as part of a CM scenario. We first investigate an appropriate framework for combining speech emotion recognition (SER) and audio-based CM models. We then investigate which emotional aspects (i.e., attribute, sentiment, or attitude) could contribute the most in facilitating audio-based CM recognition platforms. Our experimental results indicate that conventional shared feature encoder approaches may fail to capture additional discriminative features for boosting audio-based CM tasks while utilizing SER learning. We further investigate performance trade-offs of late-fusion frameworks for combining SER and CM information. We argue that these observations could be attributed to an emotionally- biased distribution in the CM scenario, concluding that SER could indeed play a role in content moderation frameworks, given added application-specific emotional information.
Table. Recognition performance of the Content Moderation (CM) model, compared to emotional-based recognition models.
Attr-1D: Emotional regressor model trained on arousal and valence attributes (IEMOCAP)
Senti-1D: Categorical classifier for 3-class sentiment classes Pos/Neu/Neg (IEMOCAP)
Senti-5D: Categorical classifier for 3-class sentiment classes Pos/Neu/Neg (5 corpora).
Atti-1D: Categorical classifier for 3-class sentiment classes Pos/Neu/Neg (Cust Support Calls Attitude corpora)