Researchers at the University of Cambridge, led by Saifelden M. Ismail, have developed a mobile-efficient system for Speech Emotion Recognition (SER) that could have practical applications in the energy industry, particularly in human-machine interfaces and energy management systems.
The team’s work focuses on addressing the computational demands of state-of-the-art transformer architectures that have limited the deployment of SER in mobile applications. They introduced a system based on DistilHuBERT, a distilled and 8-bit quantized transformer that significantly reduces the number of parameters compared to full-scale Wav2Vec 2.0 models while maintaining competitive accuracy.
The researchers conducted a rigorous 5-fold Leave-One-Session-Out (LOSO) cross-validation on the IEMOCAP dataset to ensure speaker independence. They also augmented their training with data from the CREMA-D dataset to enhance the model’s generalization capabilities. This cross-corpus training approach yielded improvements in Weighted Accuracy, Macro F1-score, and a reduction in cross-fold variance, with the Neutral class showing the most substantial benefit.
The model achieved an Unweighted Accuracy of 61.4% with a quantized model footprint of only 23 MB, representing approximately 91% of full-scale baseline performance. This achievement establishes a Pareto-optimal tradeoff between model size and accuracy, enabling practical affect recognition on resource-constrained mobile devices.
In cross-corpus evaluation on the RAVDESS dataset, the model demonstrated robust arousal detection with high recall rates for anger and sadness. However, the theatrical nature of acted emotions in RAVDESS caused predictions to cluster by arousal level rather than valence, leading to some confusion between happiness and anger due to acoustic saturation in high-energy expressions.
The practical applications of this research for the energy sector include improving human-machine interfaces in energy management systems, enabling more efficient and user-friendly interactions. Additionally, the ability to recognize and respond to user emotions could enhance the effectiveness of energy-saving behaviors and interventions delivered through mobile applications.
This research was published in the Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing.
This article is based on research available at arXiv.

