Neuronale Netze und Fusionsstrategien für multimodale Emotionserkennung

  • Martin Reder

    Student thesis: Master's Thesis

    Abstract

    Emotion recognition is a crucial component in human-computer interaction, enabling
    systems to respond appropriately to users’ emotional states. This thesis investigates how
    neural networks can effectively utilize multimodal inputs—specifically facial expressions
    and speech—to enhance emotion recognition compared to unimodal approaches. Four
    models were developed: a facial emotion recognition (FER) model, a speech emotion
    recognition (SER) model, and two multimodal models employing Early Fusion and Late
    Fusion techniques. The models were trained and evaluated on the IEMOCAP dataset.
    The results demonstrate that the multimodal models outperform the unimodal models in terms of classification accuracy and F1-score, with the Late Fusion model achieving
    the best overall performance. To assess the generalizability of the models, a user study
    was conducted where participants expressed various emotions in a controlled environment. However, the models struggled to accurately recognize emotions in this new data,
    failing to achieve significant improvements over baseline accuracy. This indicates a lack
    of robustness and generalizability, especially in real-world scenarios where emotional
    expressions are often subtle and individualized.
    Date of Award2024
    Original languageGerman (Austria)
    SupervisorWerner Christian Kurschl (Supervisor)

    Studyprogram

    • Software Engineering

    Cite this

    '