Long-term Storage for Mass Data

  • René Stadler

    Student thesis: Master's Thesis

    Abstract

    Dynatrace is developing a new data storage for persisting information monitored by
    their platform. One key data point are entities. Information about entities is received
    periodically, ensuring the monitored entity is still available. The lifetime of most of those
    entities is limited and hence the set of actively monitored entities changes over time.
    In order to limit the amount of data, which has to be locally stored and reduce the
    amount of actively maintained entities, the data has to be separated into an active and
    an archived dataset.
    Dynatrace storage deals with this change by archiving information about these no
    longer observed entities into an archive datastore outside of the active dataset. While
    most entities remain in this unobserved state for the remainder of their storage lifetime,
    some entities are reobserved at later points in time (e.g. a server that has been shut
    down for power saving and is restarted on increased load demands a few days later).
    In these cases, the previously archived data needs to be reintegrated into the active
    dataset. This process is internally referred to as “resurrecting the entity”. This thesis
    will handle the management of the archival and resurrection workflows.
    Once a monitored entity is archived, it’s prior existence needs to be recorded. On
    arrival of a new entity, that is not part of the active dataset, it is necessary to determine
    if this is a completely new or a previously known entity. This check needs to be done in
    a fast and cheap manner (from a resource perspective). The complete data in the store
    has to be available at all times. Therefore, a fast way to access the archived dataset has
    to be provided. The data is either available in the active or the archived dataset. This
    has the advantage to be able to query both datastores independently. Single entities can
    be deleted from the active and archived dataset at all times. This is necessary in order
    to delete no longer needed data and also be able to delete data on the request of a store
    user.
    The goal of this thesis is to evaluate separate approaches for the archive data storage.
    An important criteria here is, that all data in a specific timeframe can be queried fast
    and can also be deleted on an entity level. Also, the access to a single entity should be
    fast, so that the resurrection process is achievable in a timely manner. An additional
    requirement is that the entries can be separated on a customer level, so that an access
    across multiple customers is not permitted.
    The solution should be cloud agnostic, so that it can be used in any cloud in order
    to have no binding to a single vendor and also to deploy it in different environments.
    For these criteria, multiple data stores like Redis, Cassandra or Kafka will be compared
    and evaluated.
    Date of Award2024
    Original languageEnglish (American)
    SupervisorJohann Heinzelreiter (Supervisor) & Christian Gesswagner (Supervisor)

    Cite this

    '