Abstract
Many new methods and tools in the field of computer vision and AI can be a helpful aidin many tasks. Companies like OpenAI have revolutionized the use of text understanding, image classification, and the ability to ask questions about an image and receive a
response. With the newest model, the Large Language and Vision Model, from OpenAI,
it is possible to interact with text and analyze images. This master’s thesis analyzes the
automation of learning video production, with the main goal of gaining initial insights
into using GPTs, especially OpenAI’s models, for analyzing software documentation.
The thesis is divided into a comprehension section and a click position determination
section. OpenAI’s large language model is used to read the documentation and check
comprehension. This is intended to provide an initial insight into the automatic creation
of scripts for learning videos. To achieve this, a section of the company’s documentation is provided to the GPT-model (4 Turbo and 4o) from OpenAI, followed by two
questions regarding this documentation. Company-specific individuals then answer additional questions regarding the given responses to verify understanding and accuracy.
To evaluate the practical application of accurately positioning mouse clicks, OpenAI’s
vision model was used to identify the positions of text in buttons and text input fields in
images and to determine the corresponding click positions for interaction. Ten prompts
were tested on sixty images to assess accuracy. It was found that the complexity of
the prompts hardly affects accuracy. The vision model struggles more with the precise
determination of bounding boxes than with identifying click coordinates. A minimal increase in accuracy was observed when using Matplotlib to check image coordinates. The
mean Euclidean distances for the click position determination were not significantly far
off with the best prompt, indicating reasonable accuracy. Errors were more frequent in
the Y-coordinate than in the X-coordinate. The thesis demonstrates initial steps for automating learning video production, showing that GPT models can be used effectively,
though further improvements are necessary to enhance accuracy and efficiency.
Date of Award | 2024 |
---|---|
Original language | German (Austria) |
Supervisor | Andreas Stöckl (Supervisor), Elias Ramoser (Supervisor) & Christof Feischl (Supervisor) |