Domain Sharding #527
Replies: 3 comments
-
I like the sound of this! I particularly like the idea contained in the diagram. |
Beta Was this translation helpful? Give feedback.
-
I really like the idea of using foreign_type so that we can just union graphs and have something that works! We just need to make really sure that foreign types can't also be local types, so we know absolutely sure that there cannot be duplicate triples. |
Beta Was this translation helpful? Give feedback.
-
LibrariesWhen creating a data product, it is extremely common to want to relate to other standard data products. Geo-coordinates with their respective countries and cities is one such example. Another is a standard treatment of dimensional values, such as mass in kg, length in meters, currency in dollars etc. These dimensional values are only comparable mathematically if of the same dimension and converted to the same unit. Incorrect manipulation of dimensional values has caused untold damage in computer systems so it's very useful if this information can be retained in a data product. These standard data products contain information which is both schema and document. Dynamic or Static LinkingThere are two different practical approaches to linking which occur: static and dynamic. Both are useful for us. If you can completely shard on some boundary, then dynamic linking has some advantages in allowing a completely separate evolution of data product. This can be updated with new information, but one must keep very sure not to update externally visible ids or alter the contract of behaviour in some surprising way (or you need some additional tooling to facilitate such changes being made automatically). The static linking approach essentially means pulling the schema and data into the same data product. This is really the sort of thing you need to do if you are trying to import sub-documents, as the sub-documents can not really live in a separate data product. Static Linking Library ManagementWhen you statically link a library, there are a number of different actions that should take place to facilitate this relocation. With careful use of URIs it shouldn't be a difficulty to merge data as documents or the schema information, but the schema itself needs to have its This import action can not simply be a merge or rebase as there is a conflict with the '@context'. We need some method of treating this problem. Dynamic Linking Library ManagementIn the case of weather, or countries (as opposed to dimensional units for instance), it would be more convenient to leave these as separate data products with their own evolution. This suggests that We had long thought about the need to surface data products at specific locations for reference of both the schema and the data - and to facilitate sharing (in the tradition of github) but we have punted on this longer term goal for the moment. Since it also would facilitate domain sharding we should probably think about what these should be. And to make sure that we can upgrade, we desperately need a simple operation which "moves" a data product from one URI prefix set to another! This would make a lot of things with respect to renaming easier. |
Beta Was this translation helpful? Give feedback.
-
Domain Sharding
TerminusDB and TerminusX allow you to query across different data-products via the
Using
WOQL word. The document interface doesn't have any support for multiple data-products. But extending our current offering in a non-breaking and backward-compatible way should be relatively easy.Combined with object storage on TerminusX, this would give us Infinitely scalable domain sharding
Foreign Types
Currently when we reference foreign instances we are doing so by constructing an empty class, and tagging the instance as being of this class, or we are reproducing the entire class definition internally to the data product. This doesn't scale well and ties representations which other domain teams should be in charge of.
The way to fix this is to add a foreign type edge. This would be like
rdf:type
but instead use the tagsystem:foreign_type
or something along those lines. All we would need to alter is the instance checking procedure to treatsystem:foreign_type
tagged entities as opaque and terminal (they can have no exiting edges).The reason to use a different edge from
rdf:type
is to facilitate a union operation. We would like to keep our strategy of only recording the principle type of an object, and never having two type designators - otherwise things will become significantly more complicated in the constraint reasoner.This is probably a few-line change and shouldn't take more than a day and has no backward compatibility issues. This gets us 90% of the way there and we should do this and then build an example data product assemblage using Marketing, Finance and HumanResources for instance, which has links going in different directions.
Foreign Type Discovery
Assuming we use the above approach, it makes it very easy to have a "late binding strategy" where you simply have to look the object up in another data product to get its description. It would be handy though to know which data products it might live in! We will at some point want a data product discovery which can tell you which data products have the URI available to you (it might live in more than one because there may be different versions of the data product which you have access to).
Union
To create assemblages for analysis it would be nice if you could create a union of branches from various data products and give it to the document API. Updates would always take place in the data product of which the type is defined, and reads could happen across the data products.
To implement the union, we just need to add some sort of descriptor for unions (a list should work fine) and extend the API so that you can send in a list of paths. At the lower level reads are across collections of descriptors anyhow, but writes will require discovery of the principle. We probably could have a quick check that there is no duplication in the schema when you do an assemblage.
Infinitely scalable domain sharding
With a few lines of code we can reasonably claim to scale up indefinitely for assemblages of data products. Simply shard your data by domain, and use pointers across the shards. The rest are really just convenience features.
We can produce some test examples but internally it would be extremely useful for us to pursue this approach for the system graph itself. If we shard by organisation for servers already, it probably makes sense to shard the data on this boundary as well. This will reduce the overheads to the system graph writes which currently will have difficulty scaling up.
Beta Was this translation helpful? Give feedback.
All reactions