Document Segmentation for Topic Modelling with Sentence Embeddings

Andreas Stöckl, Ernes Ibrovic

Publikation: Beitrag in Buch/Bericht/TagungsbandKonferenzbeitragBegutachtung


Topic modeling, a powerful technique to extract hidden structures from voluminous documents, encounters challenges with large documents encompassing myriad topics. Traditional segmentation methods struggle to identify precise points of topic shifts. This paper presents an innovative approach to document segmentation that leverages embeddings and vector distances to discern topic transitions. Building upon the TextTiling method, which primarily relies on patterns of lexical distribution, our method enriches the segmentation process by incorporating syntactic structures and rhetorical cues.Our method utilizes Sentence-BERT (SBERT) to generate embeddings for individual sentences. Through this, our approach compares embeddings of consecutive sentences to identify coherence and determine segmentation points governed by a tunable parameter, n. The computed similarities, termed 'gap scores,' undergo a smoothing process to counteract noise, improving the accuracy of segment identification.Furthermore, recognizing that longer documents may revisit similar topics at different intervals, our method introduces a clustering mechanism. This groups segments of analogous content, ensuring each topic's unique representation in the document, mitigating redundancy, and enhancing topic modeling results. Our approach delivers a robust, comprehensive, and efficient method for document segmentation tailored for advanced topic modeling applications.
TitelInternational Conference on Artificial Intelligence, Computer, Data Sciences, and Applications, ACDSA 2024
ISBN (elektronisch)9798350394528
PublikationsstatusVeröffentlicht - März 2024


NameInternational Conference on Artificial Intelligence, Computer, Data Sciences, and Applications, ACDSA 2024


  • document segmentation
  • sentence embeddings
  • topic modeling


Untersuchen Sie die Forschungsthemen von „Document Segmentation for Topic Modelling with Sentence Embeddings“. Zusammen bilden sie einen einzigartigen Fingerprint.