Machine-Learning-basierte Klassifizierung von Phishing-Webseiten

  • Gerhard Kasess

Student thesis: Master's Thesis

Abstract

The master’s thesis deals with the automated detection of phishing websites using machine learning methods. The aim of the thesis is to develop a deep learning model
that precisely classifies websites based on their characteristics in order to minimise the
security risks posed by phishing.
In previous research, there are various approaches for recognising phishing websites,
such as list-based methods, classification based on the URL and the analysis of features
from websites and external sources. Particularly promising are content-based approaches
that take into account not only the URL but also the characteristics of the website itself.
This master’s thesis pursues such a content-based approach and combines classification
based on the URL, the structure of the Document Object Model (DOM) and the text
content of the website. This enables a comprehensive evaluation of both the structural
and semantic properties of the web pages.
Public datasets containing both the URL and the HTML code of the web pages
are used to train the model. This data is pre-processed and the selected features are
extracted. The TikToken tokeniser is used for tokenisation. A deep learning model consisting of a convolutional neural network and a Long Short-Term Memory is then trained.
These network architectures generate a vector for each feature category. These are then
combined in a neural network for classification.
The developed model achieves a classification accuracy of 97.6 % on the test dataset.
In addition, an evaluation dataset is created to test the model for productive use, which
includes current websites from the years 2023 and 2024 as well as internal websites.
On this evaluation dataset, the model achieves an accuracy of 83.1 %. By adjusting
the threshold, this accuracy can be further increased at the expense of classifiability,
making the model suitable for productive use.
The thesis shows that the combination of structural and semantic features of the
web pages with deep learning models can lead to an improvement in the detection of
phishing web pages. The results contribute to the development of robust and efficient
tools for phishing detection and provide a basis for future research and applications in
this area.
Date of Award2024
Original languageGerman (Austria)
SupervisorEckehard Hermann (Supervisor)

Cite this

'