Self-Repairing Data Scraping for Websites

Research output: Chapter in Book/Report/Conference proceedingsConference contributionpeer-review

Abstract

Pre-processing and cleaning data from the web is challenging, as web pages are updated regularly. Whenever a monitored page changes its layout, existing processing pipelines break and must be adapted to the new design. We present an approach to scraping website data using Large Language Models (LLMs) to determine the location of the desired information and create a JavaScript path to the object in the Document Object Model of the HTML page. Our approach automatically detects when the path cannot be parsed anymore and repairs the path, continuously updating the scraping. Based on the example of the website kununu.com, our approach allows for consistent scraping and self-repair without overly impacting system performance. LLMs are only activated when an error in the pipeline is detected. In the future, we plan to expand this approach with multiple websites and data sources.
Original languageEnglish
Title of host publicationInternational Conference on Electrical, Computer, Communications and Mechatronics Engineering, ICECCME 2024
PublisherIEEE
Pages1-4
Number of pages4
ISBN (Electronic)9798350391183
ISBN (Print)979-8-3503-9119-0
DOIs
Publication statusPublished - 6 Nov 2024
Event2024 4th International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME) - Male, Maldives
Duration: 4 Nov 20246 Nov 2024

Publication series

NameInternational Conference on Electrical, Computer, Communications and Mechatronics Engineering, ICECCME 2024

Conference

Conference2024 4th International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)
Period04.11.202406.11.2024

Keywords

  • Data Cleaning
  • Data Scraping
  • LLM

Cite this