Abstract
Speaker diarization is the recognition of speakers in an audio file. The speakers areidentified using a label. The subject area has been researched for several years and
in recent years new methods for performing speaker diarization have been developed
using neural networks, which improve the accuracy of this process. speaker diarization
is supported by both cloud providers and local providers and offers numerous different
application scenarios, such as meeting transcriptions. The aim of this master thesis is to
compare several speech processing systems using different data sets to evaluate them.
In addition to the evaluation, a separate data set will also be created.
To fulfill these goals, a prototype is created, which contains four different speech processing systems. Two of these are cloud providers and two are local providers. Four data sets
are used for the evaluation, one of which is self-collected and the remaining data sets are
freely available. Two metrics are used for the evaluation, namely the diarization error
rate (DER) and the jaccard error rate (JER). Both metrics provide information about
the accuracy of the speech processing systems with regard to speaker diarization. The
self-collected data set consists of three audio files with a length of five minutes and contains three speakers each, at least one male and one female. The speakers should speak
in an Upper Austrian dialect. In addition, different recording locations and recording
devices should be used.
In the subsequent evaluation, it is noticeable that the pyannote.audio speech processing
system performed best in terms of both the diarization error rate and the jaccard error
rate. pyannote.audio is a local provider. The other speech processing systems (Amazon
Transcribe, Microsoft Azure Speech to Text, Picovoice Falcon) performed slightly worse
than pyannote.audio. In general, it can be seen that the results for the diarzation error
rate are low. The results for the jaccard error rate, on the other hand, are not as good.
The self-collected data set does not perform so well in the evaluation, with the diarization
error rate and jaccard error rate being very high in some cases. The annotation effort was
also examined. It can be seen that it takes between two and three hours to annotate a
five-minute audio file. The identification of the speakers worked well during annotation.
| Date of Award | 2025 |
|---|---|
| Original language | German (Austria) |
| Supervisor | Werner Christian Kurschl (Supervisor) |
Studyprogram
- Software Engineering