Beitrag zur Automatisierung in der Lernvideo-Produktion - GPT-gestütztes Softwaredokumentationsverständnis und Klickpositionsanalyse

  • Milan Vasilic

    Student thesis: Master's Thesis

    Abstract

    Many new methods and tools in the field of computer vision and AI can be a helpful aid
    in many tasks. Companies like OpenAI have revolutionized the use of text understanding, image classification, and the ability to ask questions about an image and receive a
    response. With the newest model, the Large Language and Vision Model, from OpenAI,
    it is possible to interact with text and analyze images. This master’s thesis analyzes the
    automation of learning video production, with the main goal of gaining initial insights
    into using GPTs, especially OpenAI’s models, for analyzing software documentation.
    The thesis is divided into a comprehension section and a click position determination
    section. OpenAI’s large language model is used to read the documentation and check
    comprehension. This is intended to provide an initial insight into the automatic creation
    of scripts for learning videos. To achieve this, a section of the company’s documentation is provided to the GPT-model (4 Turbo and 4o) from OpenAI, followed by two
    questions regarding this documentation. Company-specific individuals then answer additional questions regarding the given responses to verify understanding and accuracy.
    To evaluate the practical application of accurately positioning mouse clicks, OpenAI’s
    vision model was used to identify the positions of text in buttons and text input fields in
    images and to determine the corresponding click positions for interaction. Ten prompts
    were tested on sixty images to assess accuracy. It was found that the complexity of
    the prompts hardly affects accuracy. The vision model struggles more with the precise
    determination of bounding boxes than with identifying click coordinates. A minimal increase in accuracy was observed when using Matplotlib to check image coordinates. The
    mean Euclidean distances for the click position determination were not significantly far
    off with the best prompt, indicating reasonable accuracy. Errors were more frequent in
    the Y-coordinate than in the X-coordinate. The thesis demonstrates initial steps for automating learning video production, showing that GPT models can be used effectively,
    though further improvements are necessary to enhance accuracy and efficiency.
    Date of Award2024
    Original languageGerman (Austria)
    SupervisorAndreas Stöckl (Supervisor), Elias Ramoser (Supervisor) & Christof Feischl (Supervisor)

    Cite this

    '