Machine Learning in Online Marketplaces: Application of Product Categorization, Condition Classification and Price Prediction

  • Philip Kräutl

Student thesis: Master's Thesis

Abstract

This thesis investigates the application of machine learning (ML) to improve user support in the creation of listings in an online flea market application. The aim was to find
out what performance can be achieved when classifying products into categories and
conditions and when predicting the price of a product. The dataset for this was collected
using web scraping on multiple popular online marketplaces, totaling 607,455 product
samples. The categories and conditions from the online marketplaces were mapped to
self-defined categories and conditions to provide consistency in the data. In the category
classification, the product title was preprocessed and features were extracted using various natural language processing (NLP) techniques. The approach was to train a model
that classifies the products into 11 basic categories. For each base category, a separate
model was trained to classify the subcategories within the base category, totaling 72
subcategories. Multiple classification algorithms were explored and the best ones were
chosen. The models were then evaluated and an overall accuracy of 72% was achieved.
In the condition classification the title and description were combined to form a unified
text. This combined text was then preprocessed and features were extracted using NLP
techniques. Multiple classification algorithms were explored and a model was trained
that is able to classify the products into five conditions. An overall accuracy of 92% was
achieved for this task. However, it was also found that the model could benefit from a
more balanced data set. The price prediction uses the title, category and condition of
the product to predict the minimum and maximum price of the product. The title has
been preprocessed and the features were extracted in the same way as for the category
classification. Due to the high price variance, outliers were removed. The category and
condition were transformed into numeric values using encoding techniques. Cosine similarity was used to form groups to calculate the minimum and maximum prices within
these groups. Multiple regression algorithms were explored and models were trained for
both the minimum and maximum price predictions. An accuracy of 83% was achieved,
with the actual prices lying within the predicted minimum and maximum prices. It was
found that the price ranges were often too high to be useful in a production environment.
Date of Award2024
Original languageEnglish (American)
SupervisorMarc Kurz (Supervisor)

Cite this

'