Machine Learning in Online Marketplaces: Application of Product Categorization, Condition Classification and Price Prediction

  • Philip Kräutl

    Student thesis: Master's Thesis

    Abstract

    This thesis investigates the application of machine learning (ML) to improve user support in the creation of listings in an online flea market application. The aim was to find
    out what performance can be achieved when classifying products into categories and
    conditions and when predicting the price of a product. The dataset for this was collected
    using web scraping on multiple popular online marketplaces, totaling 607,455 product
    samples. The categories and conditions from the online marketplaces were mapped to
    self-defined categories and conditions to provide consistency in the data. In the category
    classification, the product title was preprocessed and features were extracted using various natural language processing (NLP) techniques. The approach was to train a model
    that classifies the products into 11 basic categories. For each base category, a separate
    model was trained to classify the subcategories within the base category, totaling 72
    subcategories. Multiple classification algorithms were explored and the best ones were
    chosen. The models were then evaluated and an overall accuracy of 72% was achieved.
    In the condition classification the title and description were combined to form a unified
    text. This combined text was then preprocessed and features were extracted using NLP
    techniques. Multiple classification algorithms were explored and a model was trained
    that is able to classify the products into five conditions. An overall accuracy of 92% was
    achieved for this task. However, it was also found that the model could benefit from a
    more balanced data set. The price prediction uses the title, category and condition of
    the product to predict the minimum and maximum price of the product. The title has
    been preprocessed and the features were extracted in the same way as for the category
    classification. Due to the high price variance, outliers were removed. The category and
    condition were transformed into numeric values using encoding techniques. Cosine similarity was used to form groups to calculate the minimum and maximum prices within
    these groups. Multiple regression algorithms were explored and models were trained for
    both the minimum and maximum price predictions. An accuracy of 83% was achieved,
    with the actual prices lying within the predicted minimum and maximum prices. It was
    found that the price ranges were often too high to be useful in a production environment.
    Date of Award2024
    Original languageEnglish (American)
    SupervisorMarc Kurz (Supervisor)

    Cite this

    '