Performance Comparison of Deep Learning Algorithm for Speech Emotion Recognition

  • I Gusti Bagus Arya Pradnja Paramitha Universitas Nusa Mandiri
  • Hendra Budi Kusnawan Universitas Nusa Mandiri Jakarta
  • Muji Ernawati Universitas Nusa Mandiri Jakarta
DOI: https://doi.org/10.29303/jcosine.v6i2.443
Abstract: 241 Viewers PDF: 223 Viewers

Abstract

One of the problems in Speech emotion recognition is related to time series data, while the feedforward process in neural networks is unidirectional where the results from one layer are directly channeled to the next layer. This kind of feedforward process cannot store past data. Thus, if Deep Neural Network (DNN) is used for Speech emotion recognition, some problems arise, such as the speech rate of the speaker. DNN cannot analyze the existing acoustic patterns and so cannot map different levels of speech rate. Another method that can take input at once while retaining relevant data in the previous process is the Recurrent Neural Network (RNN). This paper presents the characteristics of the RNN method consisting of LSTM and GRU techniques for Speech emotion recognition using the Berlin EMODB dataset. The dataset is divided into 80% for training and 20% for testing. The feature extraction methods used are Zero crossing Rate (ZCR), Mel Frequency Cepstral Coefficients (MFCC), Root Mean Square Energy (RMSE), Mel Spectrogram, and Chroma. This study compares the CNN, LSTM, and GRU algorithms. The classification results show that the CNN algorithm gets better results, namely 79.13%. Meanwhile, LSTM and GRU only got an accuracy of 55.76% and 55.14%, respectively

Published
2022-12-21
Section
Embedded System and Data Communications