Document Segmentation for Topic Modelling with Sentence Embeddings

Andreas Stöckl, Ernes Ibrovic

Research output: Chapter in Book/Report/Conference proceedingsConference contributionpeer-review

Abstract

Topic modeling, a powerful technique to extract hidden structures from voluminous documents, encounters challenges with large documents encompassing myriad topics. Traditional segmentation methods struggle to identify precise points of topic shifts. This paper presents an innovative approach to document segmentation that leverages embeddings and vector distances to discern topic transitions. Building upon the TextTiling method, which primarily relies on patterns of lexical distribution, our method enriches the segmentation process by incorporating syntactic structures and rhetorical cues.Our method utilizes Sentence-BERT (SBERT) to generate embeddings for individual sentences. Through this, our approach compares embeddings of consecutive sentences to identify coherence and determine segmentation points governed by a tunable parameter, n. The computed similarities, termed 'gap scores,' undergo a smoothing process to counteract noise, improving the accuracy of segment identification.Furthermore, recognizing that longer documents may revisit similar topics at different intervals, our method introduces a clustering mechanism. This groups segments of analogous content, ensuring each topic's unique representation in the document, mitigating redundancy, and enhancing topic modeling results. Our approach delivers a robust, comprehensive, and efficient method for document segmentation tailored for advanced topic modeling applications.
Original languageEnglish
Title of host publicationInternational Conference on Artificial Intelligence, Computer, Data Sciences, and Applications, ACDSA 2024
Pages1-5
Number of pages5
ISBN (Electronic)9798350394528
DOIs
Publication statusPublished - Mar 2024

Publication series

NameInternational Conference on Artificial Intelligence, Computer, Data Sciences, and Applications, ACDSA 2024

Keywords

  • Deep learning
  • Technological innovation
  • Smoothing methods
  • Semantics
  • Computer architecture
  • Syntactics
  • Transformers
  • document segmentation
  • topic modeling
  • sentence embeddings

Fingerprint

Dive into the research topics of 'Document Segmentation for Topic Modelling with Sentence Embeddings'. Together they form a unique fingerprint.

Cite this