Long-term Storage for Mass Data

  • René Stadler

Student thesis: Master's Thesis

Abstract

Dynatrace is developing a new data storage for persisting information monitored by
their platform. One key data point are entities. Information about entities is received
periodically, ensuring the monitored entity is still available. The lifetime of most of those
entities is limited and hence the set of actively monitored entities changes over time.
In order to limit the amount of data, which has to be locally stored and reduce the
amount of actively maintained entities, the data has to be separated into an active and
an archived dataset.
Dynatrace storage deals with this change by archiving information about these no
longer observed entities into an archive datastore outside of the active dataset. While
most entities remain in this unobserved state for the remainder of their storage lifetime,
some entities are reobserved at later points in time (e.g. a server that has been shut
down for power saving and is restarted on increased load demands a few days later).
In these cases, the previously archived data needs to be reintegrated into the active
dataset. This process is internally referred to as “resurrecting the entity”. This thesis
will handle the management of the archival and resurrection workflows.
Once a monitored entity is archived, it’s prior existence needs to be recorded. On
arrival of a new entity, that is not part of the active dataset, it is necessary to determine
if this is a completely new or a previously known entity. This check needs to be done in
a fast and cheap manner (from a resource perspective). The complete data in the store
has to be available at all times. Therefore, a fast way to access the archived dataset has
to be provided. The data is either available in the active or the archived dataset. This
has the advantage to be able to query both datastores independently. Single entities can
be deleted from the active and archived dataset at all times. This is necessary in order
to delete no longer needed data and also be able to delete data on the request of a store
user.
The goal of this thesis is to evaluate separate approaches for the archive data storage.
An important criteria here is, that all data in a specific timeframe can be queried fast
and can also be deleted on an entity level. Also, the access to a single entity should be
fast, so that the resurrection process is achievable in a timely manner. An additional
requirement is that the entries can be separated on a customer level, so that an access
across multiple customers is not permitted.
The solution should be cloud agnostic, so that it can be used in any cloud in order
to have no binding to a single vendor and also to deploy it in different environments.
For these criteria, multiple data stores like Redis, Cassandra or Kafka will be compared
and evaluated.
Date of Award2024
Original languageEnglish (American)
SupervisorJohann Heinzelreiter (Supervisor) & Christian Gesswagner (Supervisor)

Cite this

'