The rapid expansion in the scale and capabilities of large language models (LLMs) has increased the need for reliable safety assessments. Various benchmarks such as DoNotAnswer (DNA), RealToxicityPrompts (RTP), ToxiGen, and HarmBench have been developed to evaluate the safety of LLMs in terms of toxicity, refusal behavior, and moderation of harmful content. However, the reliability and robustness of these benchmarks have been little studied. This work addresses this gap by conducting a statistical meta-analysis of these benchmarks to identify task- and domain-specific biases that could compromise their applicability in real-world scenarios. To this end, the benchmark tasks were reformulated as summarization problems and embedded in different domain contexts to test their consistency in detecting harmful content. The results show significant differences in benchmark sensitivity: reformulating the tasks leads to more harmful content being detected in DNA, RTP and ToxiGen, while HarmBench detects less. Domain shifts significantly impact the detection of toxic language, with certain domains exhibiting higher detection rates while others show lower sensitivity. These inconsistencies indicate a lack of transferability of the benchmarks, necessitating domain-specific adjustments. Cross-comparisons between the benchmarks show weak correlations, which may be attributed to different definitions of harmfulness. The study suggests that current benchmarks do not offer objective or universal reliability, but rather depend heavily on task formulation, domain context, and definitions of harm. It advocates for adaptive, multidimensional evaluation frameworks that account for task and domain variability to better reflect the real-world use of LLMs. Despite limitations such as the use of automated classifiers, a limited range of tasks, and restricted domain coverage, the work provides insights for the further development of dynamic, transparent and inclusive benchmarking strategies to promote trustworthy security assessment of LLMs.
| Date of Award | 2025 |
|---|
| Original language | English |
|---|
| Supervisor | Andreas Stöckl (Supervisor) |
|---|
- Data Science and Engineering
Is Content Moderation for LLMs Task-biased?: A Statistical Meta Study on the Impact of LLM Tasks and their Domains on Alignment Benchmarks
Gugg, R. (Author). 2025
Student thesis: Master's Thesis