A Distributed ETL Framework Using an Actor System

  • Marcel Hasieber

Student thesis: Master's Thesis

Abstract

Extract, Transform and Load (ETL) is a well known technique to extract data from
a source, transform it and load it into some target system. It is commonly used for
data warehouses and in Big Data applications. A wide range of software systems exist
that cover exactly that use case and excel in performance metrics for said applications.
However, data integration is not only of interest when Big Data is involved. It is also
of interest for applications where data is constrained to manageable dimensions and
performance is not of the utmost importance.
This thesis builds upon actor system building blocks and especially Akka Streams
to build an ETL framework that uses asynchronous message passing as main communication bus. It uses a serializable data model for ETL configurations with a domain
specific language for parameterization that can be transported over the network and allows separation of definition and execution. The system supports distributed execution
via actors, employs failure tolerance mechanisms and implements a custom protocol for
stateful actor communication.
The effect of the serialization mechanism on overall performance in Akka is empirically highlighted by comparing Java Serialization, Jackson, Kryo and Protobuf for seven
proposed serialization classes. Protobuf demonstrates the best performance for the most
serialization classes. Kryo proves to be the most flexible in that it is schemaless while
providing slightly slower performance as Protobuf. Java Serialization is significantly
outperformed in most serialization classes. However, it is shown that Java Serialization,
Jackson and Kryo have specific serialization classes where they perform exceptionally
well.
It is demonstrated that using asynchronous message passing as main communication
bus in an ETL framework exhibits favorable performance characteristics for traditional
row based data for up to 105
elements, but is not adequate for significantly larger data
volumes, compared to the Big Data oriented stream processing engine Apache Flink as
a desirable scaling behavior between input size and runtime is not observed.
Date of Award2024
Original languageEnglish (American)
SupervisorErik Pitzer (Supervisor) & Dominik Wachholder (Supervisor)

Cite this

'