A Distributed ETL Framework Using an Actor System

  • Marcel Hasieber

    Student thesis: Master's Thesis

    Abstract

    Extract, Transform and Load (ETL) is a well known technique to extract data from
    a source, transform it and load it into some target system. It is commonly used for
    data warehouses and in Big Data applications. A wide range of software systems exist
    that cover exactly that use case and excel in performance metrics for said applications.
    However, data integration is not only of interest when Big Data is involved. It is also
    of interest for applications where data is constrained to manageable dimensions and
    performance is not of the utmost importance.
    This thesis builds upon actor system building blocks and especially Akka Streams
    to build an ETL framework that uses asynchronous message passing as main communication bus. It uses a serializable data model for ETL configurations with a domain
    specific language for parameterization that can be transported over the network and allows separation of definition and execution. The system supports distributed execution
    via actors, employs failure tolerance mechanisms and implements a custom protocol for
    stateful actor communication.
    The effect of the serialization mechanism on overall performance in Akka is empirically highlighted by comparing Java Serialization, Jackson, Kryo and Protobuf for seven
    proposed serialization classes. Protobuf demonstrates the best performance for the most
    serialization classes. Kryo proves to be the most flexible in that it is schemaless while
    providing slightly slower performance as Protobuf. Java Serialization is significantly
    outperformed in most serialization classes. However, it is shown that Java Serialization,
    Jackson and Kryo have specific serialization classes where they perform exceptionally
    well.
    It is demonstrated that using asynchronous message passing as main communication
    bus in an ETL framework exhibits favorable performance characteristics for traditional
    row based data for up to 105
    elements, but is not adequate for significantly larger data
    volumes, compared to the Big Data oriented stream processing engine Apache Flink as
    a desirable scaling behavior between input size and runtime is not observed.
    Date of Award2024
    Original languageEnglish (American)
    SupervisorErik Pitzer (Supervisor) & Dominik Wachholder (Supervisor)

    Cite this

    '