Disambiguating Unknown Entities in News Articles

  • Jakob Sebastian Lammel

    Student thesis: Master's Thesis

    Abstract

    Entity Linking (EL) is a well-known technique that connects mentions of entities in
    documents to a knowledge graph. However, there are numerous mentions without corresponding entries, which are often overlooked in many natural language processing
    applications despite containing valuable information. This issue is particularly prevalent in the news domain, where many mentions lack entries in knowledge graphs due to
    the entities not being considered important enough or just emerging.
    Existing approaches tend to address both linked and unlinked mentions simultaneously, which prevents the full utilization of modern EL techniques. This thesis presents
    a novel method that focuses exclusively on disambiguating unlinked mentions in news
    articles.
    The proposed approach employs an agglomerative clustering algorithm to group
    these mentions. The necessary similarities for clustering are derived from three factors:
    contextual semantic similarity, similarity of entities present in the document, as well as
    textual similarity of the mention.
    To ensure practical applicability, this thesis further explores and evaluates the use
    of modern database systems, specifically vector and graph databases. The approach
    is tested on a dataset consisting of one million news articles from both German and
    English news outlets.
    Date of Award2024
    Original languageEnglish (American)
    SupervisorErik Pitzer (Supervisor)

    Cite this

    '