Algorithms that play an essential part in the field of speech emotion detection face a difficult problem in recognising a speaker's emotional state (SER). SER is used to examine the emotional state of speakers in numerous real-time applications such as human behaviour assessment, human-robot interaction, virtual reality, and emergency centres. Traditional convolutional neural networks (CNNs) were utilised to extract high-level features from audio spectrograms in order to improve recognition accuracy and model cost complexity. The Mfcc algorithm is used to convert the selected sequence into a spectrogram, which is then fed into the CNN model to extract the discriminative and salient features from the speech spectrogram. We also input the CNN features to the deep long-term memory to ensure precise recognition performance. To learn the temporal information needed to recognise the final state of emotion, researchers used short-term memory (LSTM). We process the essential segments rather than the entire utterance in the suggested technique to reduce the computational complexity of the overall model before applying it to actual processing. The suggested approach is tested on a variety of standard datasets, including TESS and RAVDESS, in order to increase recognition accuracy and shorten the model's processing time.



Software And Hardware