Speech Emotion Recognition Using MFCC Features and LSTM-Based Deep Learning Model

arXiv cs.AI / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a speech emotion recognition system that extracts Mel-Frequency Cepstral Coefficients (MFCC) features and feeds them into an LSTM neural network to model time-dependent patterns in speech.
  • Using the Toronto Emotional Speech Set (TESS), the audio signals are preprocessed and converted into MFCCs to capture salient temporal characteristics relevant to different emotions.
  • Experimental results indicate that the proposed MFCC-LSTM approach learns long-term features in sequential audio and achieves highly realistic emotion classifications across multiple emotion categories.
  • Compared with a traditional baseline of an RBF-kernel SVM (98% accuracy), the LSTM model improves performance to 99% accuracy.
  • The study suggests practical uses such as virtual assistants and mental-health monitoring/surveillance systems that can interpret emotional cues from speech.

Abstract

Speech Emotion Recognition (SER) is the use of machines to detect the emotional state of humans based on the speech, which is gaining importance in natural human-computer interaction. Speech is a very valuable source of information, as emotions modify the patterns of speech; pitch, energy and even timing. Nonetheless, SER is not an easy task because speakers are not constant, and situations vary when recording and the sound similarity between specific feelings. In this work, the author introduces a speech emotion recognition system relying on the Mel-Frequency Cepstral Coefficient and Long Short-Term Memory (LSTM) neural network, as a feature extraction method. The Toronto Emotional Speech Set (TESS) speech signal was pre-processed, and transformed into MFCC features to understand the important aspects in terms of time. The resultant features were then introduced to LSTM model, which is able to learn long term features of sequential audio data. The trained model was measured over several emotion classes occurring in the dataset. As seen in the results of experiments, the proposed MFCC-LSTM approach succeeds in capturing the patterns of emotions in speech and provides highly realistic classifications in all the chosen emotion classifications. This study presents a speech emotion recognition system using Mel-Frequency Cepstral Coefficients (MFCCs) as features and a deep learning LSTM classifier. A Support Vector Machine (SVM) with an RBF kernel served as a classical baseline, achieving 98% accuracy, against which the proposed LSTM model, achieving 99% accuracy, was validated. Overall, it is possible to confirm that LSTM-based architectures can be used to address the task of speech emotion recognition. Actual applications of the proposed system may be virtual assistants and mental health surveillance.