Skip to content

Versioning Scheme

Jan Ehmueller edited this page Jul 27, 2017 · 3 revisions

Overview

Entries in the subject table are versioned on a per-attribute basis. This makes it possible to

  • selectively reverse changes (single attributes, entities or whole table) to any previous value
  • only export data from certain datasources
  • find the program responsible for errors (easier debugging)
  • manually edit values and don't let them be automatically overwritten
  • define validity parameters (e.g. time duration of validity) for single attributes and relations

Every field in subject has a corresponding history field (e.g. name and name_history).

Data structure (Cassandra UDT version)

The type version is the core data structure for the history fields and represents a change made by a single program on the datalake. It contains the value of that change as well as some meta information:

  • the version ID (the same across all changes of a single version)
  • validity data (e.g. time duration of validity)
  • data sources used in this step
  • timestamp of the change
  • program that modified this attribute

version table

The version table is used to identify the latest version of the datalake and can be used in the curation interface to display a history of processes that were run in the past.

Clone this wiki locally