Silviana Widya Lestari, Saliyah Kahar, Trismayanti Dwi
Corresponding email: [email protected]
A B S T R A C T
Speech emotion recognition is gaining significant importance in the domains of pattern recognition and natural language processing. In recent years, there has been notable progress in voice emotion detection within this field, primarily attributed to the successful application of deep learning techniques. Some research in this area lacks a thorough comparative study of different deep learning models and techniques related to speech emotion detection. This makes it difficult to identify the best performing approaches and their relative strengths and weaknesses. Therefore, the purpose of this work is to provide a comprehensive overview and provide a detailed overview of deep learning methods for speech emotion detection. The method used is a comparative literature analysis of previous articles that are relevant to the topic, which are related to both the methods of deep learning and the collections of data. The datasets that to be analyzed include the EMO[1]DB, RAVDESS, TESS, CREMA-D, IEMOCAP, and Danish Emotional Speech Databases. The language that used in the dataset is English, except for EMO-DB which used German language and Danish Emotional Speech Database that used Danish language. Most of the emotion types extracted from these datasets included basic emotions such as happiness, sadness, neutrality, disgust, surprise, and anger. The results of this review show that the application of deep learning techniques has made significant progress in the introduction of speech emotion detection. Complex deep learning models, for instance the CNN-RNN combination, can extract relevant acoustic features and produce accurate results in recognizing emotion from speech. This advancement has significant implications for various applications, including human computer interaction, affective computing, call center analytics, psychological research, and clinical diagnosis.