Abstract
Emotion recognition is a crucial component in human-computer interaction, enablingsystems to respond appropriately to users’ emotional states. This thesis investigates how
neural networks can effectively utilize multimodal inputs—specifically facial expressions
and speech—to enhance emotion recognition compared to unimodal approaches. Four
models were developed: a facial emotion recognition (FER) model, a speech emotion
recognition (SER) model, and two multimodal models employing Early Fusion and Late
Fusion techniques. The models were trained and evaluated on the IEMOCAP dataset.
The results demonstrate that the multimodal models outperform the unimodal models in terms of classification accuracy and F1-score, with the Late Fusion model achieving
the best overall performance. To assess the generalizability of the models, a user study
was conducted where participants expressed various emotions in a controlled environment. However, the models struggled to accurately recognize emotions in this new data,
failing to achieve significant improvements over baseline accuracy. This indicates a lack
of robustness and generalizability, especially in real-world scenarios where emotional
expressions are often subtle and individualized.
Date of Award | 2024 |
---|---|
Original language | German (Austria) |
Supervisor | Werner Christian Kurschl (Supervisor) |