Self-Repairing Data Scraping for Websites

Publikation: Beitrag in Buch/Bericht/TagungsbandKonferenzbeitragBegutachtung

Abstract

Pre-processing and cleaning data from the web is challenging, as web pages are updated regularly. Whenever a monitored page changes its layout, existing processing pipelines break and must be adapted to the new design. We present an approach to scraping website data using Large Language Models (LLMs) to determine the location of the desired information and create a JavaScript path to the object in the Document Object Model of the HTML page. Our approach automatically detects when the path cannot be parsed anymore and repairs the path, continuously updating the scraping. Based on the example of the website kununu.com, our approach allows for consistent scraping and self-repair without overly impacting system performance. LLMs are only activated when an error in the pipeline is detected. In the future, we plan to expand this approach with multiple websites and data sources.
OriginalspracheEnglisch
TitelInternational Conference on Electrical, Computer, Communications and Mechatronics Engineering, ICECCME 2024
Herausgeber (Verlag)IEEE
Seiten1-4
Seitenumfang4
ISBN (elektronisch)9798350391183
ISBN (Print)979-8-3503-9119-0
DOIs
PublikationsstatusVeröffentlicht - 6 Nov. 2024
Veranstaltung2024 4th International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME) - Male, Maldives
Dauer: 4 Nov. 20246 Nov. 2024

Publikationsreihe

NameInternational Conference on Electrical, Computer, Communications and Mechatronics Engineering, ICECCME 2024

Konferenz

Konferenz2024 4th International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME)
Zeitraum04.11.202406.11.2024

Schlagwörter

  • Mechatronics
  • System performance
  • Soft sensors
  • Large language models
  • Pipelines
  • Layout
  • Web pages
  • Maintenance engineering
  • Data models
  • Monitoring

Zitieren