Performance Comparison of Deep Learning Algorithm for Speech Emotion Recognition
Abstract
One of the problems in Speech emotion recognition is related to time series data, while the feedforward process in neural networks is unidirectional where the results from one layer are directly channeled to the next layer. This kind of feedforward process cannot store past data. Thus, if Deep Neural Network (DNN) is used for Speech emotion recognition, some problems arise, such as the speech rate of the speaker. DNN cannot analyze the existing acoustic patterns and so cannot map different levels of speech rate. Another method that can take input at once while retaining relevant data in the previous process is the Recurrent Neural Network (RNN). This paper presents the characteristics of the RNN method consisting of LSTM and GRU techniques for Speech emotion recognition using the Berlin EMODB dataset. The dataset is divided into 80% for training and 20% for testing. The feature extraction methods used are Zero crossing Rate (ZCR), Mel Frequency Cepstral Coefficients (MFCC), Root Mean Square Energy (RMSE), Mel Spectrogram, and Chroma. This study compares the CNN, LSTM, and GRU algorithms. The classification results show that the CNN algorithm gets better results, namely 79.13%. Meanwhile, LSTM and GRU only got an accuracy of 55.76% and 55.14%, respectively