Machine-Learning-basierte Klassifizierung von Phishing-Webseiten

  • Gerhard Kasess

    Student thesis: Master's Thesis

    Abstract

    The master’s thesis deals with the automated detection of phishing websites using machine learning methods. The aim of the thesis is to develop a deep learning model
    that precisely classifies websites based on their characteristics in order to minimise the
    security risks posed by phishing.
    In previous research, there are various approaches for recognising phishing websites,
    such as list-based methods, classification based on the URL and the analysis of features
    from websites and external sources. Particularly promising are content-based approaches
    that take into account not only the URL but also the characteristics of the website itself.
    This master’s thesis pursues such a content-based approach and combines classification
    based on the URL, the structure of the Document Object Model (DOM) and the text
    content of the website. This enables a comprehensive evaluation of both the structural
    and semantic properties of the web pages.
    Public datasets containing both the URL and the HTML code of the web pages
    are used to train the model. This data is pre-processed and the selected features are
    extracted. The TikToken tokeniser is used for tokenisation. A deep learning model consisting of a convolutional neural network and a Long Short-Term Memory is then trained.
    These network architectures generate a vector for each feature category. These are then
    combined in a neural network for classification.
    The developed model achieves a classification accuracy of 97.6 % on the test dataset.
    In addition, an evaluation dataset is created to test the model for productive use, which
    includes current websites from the years 2023 and 2024 as well as internal websites.
    On this evaluation dataset, the model achieves an accuracy of 83.1 %. By adjusting
    the threshold, this accuracy can be further increased at the expense of classifiability,
    making the model suitable for productive use.
    The thesis shows that the combination of structural and semantic features of the
    web pages with deep learning models can lead to an improvement in the detection of
    phishing web pages. The results contribute to the development of robust and efficient
    tools for phishing detection and provide a basis for future research and applications in
    this area.
    Date of Award2024
    Original languageGerman (Austria)
    SupervisorEckehard Hermann (Supervisor)

    Cite this

    '