Domain Sharding #527

GavinMendelGleason · 2021-09-04T07:14:13Z

GavinMendelGleason
Sep 4, 2021
Maintainer

Domain Sharding

TerminusDB and TerminusX allow you to query across different data-products via the Using WOQL word. The document interface doesn't have any support for multiple data-products. But extending our current offering in a non-breaking and backward-compatible way should be relatively easy.

Combined with object storage on TerminusX, this would give us Infinitely scalable domain sharding

Foreign Types

Currently when we reference foreign instances we are doing so by constructing an empty class, and tagging the instance as being of this class, or we are reproducing the entire class definition internally to the data product. This doesn't scale well and ties representations which other domain teams should be in charge of.

The way to fix this is to add a foreign type edge. This would be like rdf:type but instead use the tag system:foreign_type or something along those lines. All we would need to alter is the instance checking procedure to treat system:foreign_type tagged entities as opaque and terminal (they can have no exiting edges).

The reason to use a different edge from rdf:type is to facilitate a union operation. We would like to keep our strategy of only recording the principle type of an object, and never having two type designators - otherwise things will become significantly more complicated in the constraint reasoner.

This is probably a few-line change and shouldn't take more than a day and has no backward compatibility issues. This gets us 90% of the way there and we should do this and then build an example data product assemblage using Marketing, Finance and HumanResources for instance, which has links going in different directions.

Foreign Type Discovery

Assuming we use the above approach, it makes it very easy to have a "late binding strategy" where you simply have to look the object up in another data product to get its description. It would be handy though to know which data products it might live in! We will at some point want a data product discovery which can tell you which data products have the URI available to you (it might live in more than one because there may be different versions of the data product which you have access to).

Union

To create assemblages for analysis it would be nice if you could create a union of branches from various data products and give it to the document API. Updates would always take place in the data product of which the type is defined, and reads could happen across the data products.

To implement the union, we just need to add some sort of descriptor for unions (a list should work fine) and extend the API so that you can send in a list of paths. At the lower level reads are across collections of descriptors anyhow, but writes will require discovery of the principle. We probably could have a quick check that there is no duplication in the schema when you do an assemblage.

Infinitely scalable domain sharding

With a few lines of code we can reasonably claim to scale up indefinitely for assemblages of data products. Simply shard your data by domain, and use pointers across the shards. The rest are really just convenience features.

We can produce some test examples but internally it would be extremely useful for us to pursue this approach for the system graph itself. If we shard by organisation for servers already, it probably makes sense to shard the data on this boundary as well. This will reduce the overheads to the system graph writes which currently will have difficulty scaling up.

luke-feeney · 2021-09-04T11:45:43Z

luke-feeney
Sep 4, 2021
Maintainer

I like the sound of this! I particularly like the idea contained in the diagram.

0 replies

matko · 2021-09-06T10:11:00Z

matko
Sep 6, 2021
Maintainer

I really like the idea of using foreign_type so that we can just union graphs and have something that works! We just need to make really sure that foreign types can't also be local types, so we know absolutely sure that there cannot be duplicate triples.

0 replies

GavinMendelGleason · 2021-09-15T06:17:35Z

GavinMendelGleason
Sep 15, 2021
Maintainer Author

Libraries

When creating a data product, it is extremely common to want to relate to other standard data products. Geo-coordinates with their respective countries and cities is one such example. Another is a standard treatment of dimensional values, such as mass in kg, length in meters, currency in dollars etc. These dimensional values are only comparable mathematically if of the same dimension and converted to the same unit. Incorrect manipulation of dimensional values has caused untold damage in computer systems so it's very useful if this information can be retained in a data product.

These standard data products contain information which is both schema and document.

Dynamic or Static Linking

There are two different practical approaches to linking which occur: static and dynamic. Both are useful for us. If you can completely shard on some boundary, then dynamic linking has some advantages in allowing a completely separate evolution of data product. This can be updated with new information, but one must keep very sure not to update externally visible ids or alter the contract of behaviour in some surprising way (or you need some additional tooling to facilitate such changes being made automatically).

The static linking approach essentially means pulling the schema and data into the same data product. This is really the sort of thing you need to do if you are trying to import sub-documents, as the sub-documents can not really live in a separate data product.

Static Linking Library Management

When you statically link a library, there are a number of different actions that should take place to facilitate this relocation. With careful use of URIs it shouldn't be a difficulty to merge data as documents or the schema information, but the schema itself needs to have its @base and @schema relocated to some named prefix. It would also be extremely useful to retain the library documentation in some sort of @imports segment in the @context.

This import action can not simply be a merge or rebase as there is a conflict with the '@context'. We need some method of treating this problem.

Dynamic Linking Library Management

In the case of weather, or countries (as opposed to dimensional units for instance), it would be more convenient to leave these as separate data products with their own evolution. This suggests that @base and @schema might be better named somehow in a way that is derived and recoverable from the data product URI itself. This would allow us to look-up and display automatically this information, or at least suggest the default for doing so within our data browser (and perhaps in some glorious future to select its location in a branch for linking).

We had long thought about the need to surface data products at specific locations for reference of both the schema and the data - and to facilitate sharing (in the tradition of github) but we have punted on this longer term goal for the moment. Since it also would facilitate domain sharding we should probably think about what these should be.

And to make sure that we can upgrade, we desperately need a simple operation which "moves" a data product from one URI prefix set to another! This would make a lot of things with respect to renaming easier.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TerminusDB

Domain Sharding #527

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

TerminusDB

Domain Sharding #527

GavinMendelGleason Sep 4, 2021 Maintainer

Domain Sharding

Foreign Types

Foreign Type Discovery

Union

Infinitely scalable domain sharding

Replies: 3 comments

luke-feeney Sep 4, 2021 Maintainer

matko Sep 6, 2021 Maintainer

GavinMendelGleason Sep 15, 2021 Maintainer Author

Libraries

Dynamic or Static Linking

Static Linking Library Management

Dynamic Linking Library Management

GavinMendelGleason
Sep 4, 2021
Maintainer

luke-feeney
Sep 4, 2021
Maintainer

matko
Sep 6, 2021
Maintainer

GavinMendelGleason
Sep 15, 2021
Maintainer Author