Approaching Multi-Lingual Emotion Recognition from Speech - On Language Dependency of Acoustic/Prosodic Features for Anger Detection
This paper reports on mono- and cross-lingual performance of different acoustic and/or prosodic features. We analyze the way to define an optimal set of features when building a multilingual emotion classification system, i.e. a system that can handle more than a single input language. Due to our findings that cross-lingual emotion recognition suffers from low recognition rates we analyze our features on both an American English and a German database. Both databases contain speech of real-life users calling into interactive voice response (IVR) platforms. After calculating performance scores when cross-lingual decoding is involved, i.e. when an emotion classification system is confronted with a language it has not been trained on, we further report on different strategies to build a single feature space that is capable of dealing with both languages. We estimate the relative importance of different features for different languages by looking at their distribution, their classification scores and their rank in terms of information gain ratio. Finally, we construct a feature space on the joint data, replacing two formerly separated system by a single on. We obtain a bi-lingual emotion recognition system which performs as well as the monolingual systems on the test data.