TY - GEN
T1 - Document Segmentation for Topic Modelling with Sentence Embeddings
AU - Stöckl, Andreas
AU - Ibrovic, Ernes
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/3
Y1 - 2024/3
N2 - Topic modeling, a powerful technique to extract hidden structures from voluminous documents, encounters challenges with large documents encompassing myriad topics. Traditional segmentation methods struggle to identify precise points of topic shifts. This paper presents an innovative approach to document segmentation that leverages embeddings and vector distances to discern topic transitions. Building upon the TextTiling method, which primarily relies on patterns of lexical distribution, our method enriches the segmentation process by incorporating syntactic structures and rhetorical cues.Our method utilizes Sentence-BERT (SBERT) to generate embeddings for individual sentences. Through this, our approach compares embeddings of consecutive sentences to identify coherence and determine segmentation points governed by a tunable parameter, n. The computed similarities, termed 'gap scores,' undergo a smoothing process to counteract noise, improving the accuracy of segment identification.Furthermore, recognizing that longer documents may revisit similar topics at different intervals, our method introduces a clustering mechanism. This groups segments of analogous content, ensuring each topic's unique representation in the document, mitigating redundancy, and enhancing topic modeling results. Our approach delivers a robust, comprehensive, and efficient method for document segmentation tailored for advanced topic modeling applications.
AB - Topic modeling, a powerful technique to extract hidden structures from voluminous documents, encounters challenges with large documents encompassing myriad topics. Traditional segmentation methods struggle to identify precise points of topic shifts. This paper presents an innovative approach to document segmentation that leverages embeddings and vector distances to discern topic transitions. Building upon the TextTiling method, which primarily relies on patterns of lexical distribution, our method enriches the segmentation process by incorporating syntactic structures and rhetorical cues.Our method utilizes Sentence-BERT (SBERT) to generate embeddings for individual sentences. Through this, our approach compares embeddings of consecutive sentences to identify coherence and determine segmentation points governed by a tunable parameter, n. The computed similarities, termed 'gap scores,' undergo a smoothing process to counteract noise, improving the accuracy of segment identification.Furthermore, recognizing that longer documents may revisit similar topics at different intervals, our method introduces a clustering mechanism. This groups segments of analogous content, ensuring each topic's unique representation in the document, mitigating redundancy, and enhancing topic modeling results. Our approach delivers a robust, comprehensive, and efficient method for document segmentation tailored for advanced topic modeling applications.
KW - Deep learning
KW - Technological innovation
KW - Smoothing methods
KW - Semantics
KW - Computer architecture
KW - Syntactics
KW - Transformers
KW - document segmentation
KW - topic modeling
KW - sentence embeddings
KW - document segmentation
KW - sentence embeddings
KW - topic modeling
UR - http://www.scopus.com/inward/record.url?scp=85189935640&partnerID=8YFLogxK
U2 - 10.1109/ACDSA59508.2024.10467643
DO - 10.1109/ACDSA59508.2024.10467643
M3 - Conference contribution
T3 - International Conference on Artificial Intelligence, Computer, Data Sciences, and Applications, ACDSA 2024
SP - 1
EP - 5
BT - International Conference on Artificial Intelligence, Computer, Data Sciences, and Applications, ACDSA 2024
ER -