Evaluierung aktueller Sprachverarbeitungssysteme für Speaker Diarization in praxisnahen Szenarien

  • Moritz Peter Gruber

    Student thesis: Master's Thesis

    Abstract

    Speaker diarization is the recognition of speakers in an audio file. The speakers are
    identified using a label. The subject area has been researched for several years and
    in recent years new methods for performing speaker diarization have been developed
    using neural networks, which improve the accuracy of this process. speaker diarization
    is supported by both cloud providers and local providers and offers numerous different
    application scenarios, such as meeting transcriptions. The aim of this master thesis is to
    compare several speech processing systems using different data sets to evaluate them.
    In addition to the evaluation, a separate data set will also be created.
    To fulfill these goals, a prototype is created, which contains four different speech processing systems. Two of these are cloud providers and two are local providers. Four data sets
    are used for the evaluation, one of which is self-collected and the remaining data sets are
    freely available. Two metrics are used for the evaluation, namely the diarization error
    rate (DER) and the jaccard error rate (JER). Both metrics provide information about
    the accuracy of the speech processing systems with regard to speaker diarization. The
    self-collected data set consists of three audio files with a length of five minutes and contains three speakers each, at least one male and one female. The speakers should speak
    in an Upper Austrian dialect. In addition, different recording locations and recording
    devices should be used.
    In the subsequent evaluation, it is noticeable that the pyannote.audio speech processing
    system performed best in terms of both the diarization error rate and the jaccard error
    rate. pyannote.audio is a local provider. The other speech processing systems (Amazon
    Transcribe, Microsoft Azure Speech to Text, Picovoice Falcon) performed slightly worse
    than pyannote.audio. In general, it can be seen that the results for the diarzation error
    rate are low. The results for the jaccard error rate, on the other hand, are not as good.
    The self-collected data set does not perform so well in the evaluation, with the diarization
    error rate and jaccard error rate being very high in some cases. The annotation effort was
    also examined. It can be seen that it takes between two and three hours to annotate a
    five-minute audio file. The identification of the speakers worked well during annotation.
    Date of Award2025
    Original languageGerman (Austria)
    SupervisorWerner Christian Kurschl (Supervisor)

    Studyprogram

    • Software Engineering

    Cite this

    '