Erfolgspotential von ChatGPT bei der Lösung von Aufgaben aus Berufsprüfungen für Buchhaltung, Bilanzierung und Kostenrechnung: Eine experimentelle Studie

  • Alexandra Strobl

    Student thesis: Master's Thesis

    Abstract

    The rapid development of large language models (LLMs) such as ChatGPT opens up new possibilities in the field of knowledge work - including in accounting. While the use of LLMs is becoming increasingly established in practice, their suitability for processing demanding technical examination content in the field of accounting has so far only been investigated to a limited extent. International studies point to partially divergent results in the solution of accounting examinations using LLMs. For German-speaking countries - particularly with regard to national accounting and financial statement audits in accordance with Austrian tax and company law - there is a lack of empirically sound findings to date. This master's thesis addresses this research gap and systematically analyses the performance of ChatGPT in solving national professional auditing tasks. The thesis is divided into five chapters. After a theoretical introduction to LLMs and an overview of the current state of research, an experimental study is conducted at the centre of the thesis. A total of 62 tasks were selected from the Austrian accountant examinations (autumn 2024) and presented to the ChatGPT 4o and o3 models under standardised conditions. The tasks covered a broad spectrum - from open questions and accounting records to complex cost accounting calculations. Four repetitions were carried out per test in order to measure the consistency of the answers. The evaluation was based on official answer keys, supplemented by qualitative error categories. The results show that the success rate depends heavily on the type of task. Structured formats such as single-choice or true/false questions were answered with a high degree of accuracy, while accounting records and open maths problems were particularly error-prone. These required specific expertise, precise application of the Austrian Commercial Code (UGB) and detailed tax knowledge - areas in which both models tested (ChatGPT 4o and o3) showed systematic weaknesses. It was noticeable that the ChatGPT o3 model consistently achieved better results compared to 4o - both in terms of the average points achieved and the reproducibility of repetitions. For example, o3 achieved up to 85% of the possible points in the cost calculation, while 4o remained below the pass mark of 60% in some cases. Even with identical prompts, the answers sometimes varied significantly. Overall, the experimental study shows that current LLMs offer considerable potential for supporting standardised tasks in bookkeeping and accounting. However, their suitability for highly qualified specialised examinations is limited and strongly dependent on prompt design, model version and task type.
    Date of Award2025
    Original languageGerman (Austria)
    SupervisorSusanne Leitner-Hanetseder (Supervisor)

    Studyprogram

    • Controlling, Accounting and Financial Management

    Cite this

    '