Skip to content

Versioning Scheme

Jan Ehmueller edited this page Jul 6, 2017 · 3 revisions

Overview

Entries in the subject are versioned on a per-attribute basis. This makes it possible to

  • selectively reverse changes (single attributes, entities or whole table) to any previous value
  • only export data from certain datasources
  • find the program responsible for errors (easier debugging)
  • manually edit values and don't let them be automatically overwritten
  • define validity parameters (e.g. time duration of validity) for single attributes and relations

Every field in subject has a corresponding history field (e.g. name and name_history).

Data structure (Cassandra UDT version)

The type version is the core data structure for the history fields and represents a change made by a single program on the datalake. It contains the value of that change as well as some meta information:

  • the version ID (the same across all changes of a single version)
  • validity data (e.g. time duration of validity)
  • data sources used in this step
  • timestamp of the change
  • program that modified this attribute

This is the CQL command used to create the version UDT: create type datalake.version( version timeuuid, value list, validity map<text, text>, datasources list, timestamp timestamp, program text );

version table

The version table is used to identify the latest version of the datalake and can be used in the curation interface to display a history of processes that were run in the past.

This is the CQL command used to create the version table:

	create table datalake.version(
		version timeuuid primary key,
		timestamp timestamp,
		datasources list<text>,
		program text
	);
Clone this wiki locally