-
Notifications
You must be signed in to change notification settings - Fork 6
Consider using a graph db for services #226
Comments
cc @maroshmka , as he was the one who first suggested it, for their possible work on data flow modelling. |
What kind of queries is limiting us now or could limit on a larger scale? |
We don't have them at the moment @Stranger6667 , as we are barely starting to map our infrastructure. But thanks for your point, I think it's definitely something to consider. Queries I can imagine that we would like to be solving would be:
Considering that not all services are required for my service to work, even if we relate them together, we'd have either to keep a "flatten" dependency list on each service (which also seems to be the And considering data flows, which @maroshmka can probably tell more about, they would like to model different actions over the data flow, like (extractions, transformations, usage, querying) that the different nodes would perform between each other. Which again can also be represented using relational databases, but this is about discussing if the convenience of using a graph db (standalone or on top of pgsql) is worth the effort. |
Btw, related to #194 |
For the amount of data we can reasonably expect in The Zoo (like way less than million of Services) having these graph-like relations in PostgreSQL should be fine. In Django we can use Adding a graph database would come with costs of added complexity and does not magically solve everything. I would consider adding graph database, or another solutions, when we hit performance issues with PostgreSQL and Django ORM. |
yep, it seems like a job for a recursive CTE for depth > 1 (with depth one we can use simple joins, but CTE will work here still, maybe less performant). I assume that even for big number of services, e.g. 1M we can utilize index(-only) scans with fairly good performance in the recursive CTE part, however, it is interesting to compare different options. as @JanBednarik mentions, it should be a really good fit to use an intermediate table for relation representation ( For now, it seems to me that PG should still fit, I didn't work with graph databases and it is hard to say if it will be convenient to use or not, but I expect more efforts in the long run for using graph DB approach that for PG mainly because of my lack of experience with specific graph DBs and existing experience with graphs, represented in PG. |
hey guys, lot of good points and questions here. I generally agree with @JanBednarik, we shouldn't increase system complexity with graph database when it is not needed. Therefore I would do a case study that would model a specific problem and we than compare the solutions in terms of complexity, readability, robustness and so. I would imagine something that would meet this criteria (or a subset):
Answering the question would with having more visibility on what a service is gonna influence, e.g. in case of breaking changes or can help in OPS. I would start with the last one as it is most generic one and we should be able to answer the other ones (with some little changes) with it. My hypotheses is that the code itself would be less readable and less robust in case of using relational database. In neo4j we could answer the question with 1 line query:
for reference I don't wanna say it would be "better", that's why I'm suggesting a case study to compare the solutions on an very limited use-case that should be implementable in matter of days per each design. What do you think ? and @aexvir ad #194 - we're still not sure about this, but we will have update soon. |
I can imagine that in the Zoo we will have a few use cases for getting data from this graph. And if there will be performance issues with "naive" implementation in PostgreSQL, we can look on these use cases and try to optimize it. We can do some denormalization, use flexible data types like JSON, hstore or Array, and probably more options. I would just try to implement it for real use cases we have in the Zoo, like a proof of concept, and then we will see if it's good enough or we will need to adjust it, optimize it, or refactor it using different kind of data types or database. |
Just for the record - I was talking mostly about software qualities like robustness, openness to extensions and readability. Not about performance problems or any other hard technical problems, but software development ones. |
One of our goals is to have The Zoo as the main source of truth for all our microservices. Our microservices interact with each other in many ways, and we need to represent that interaction.
Currently we have a really simple model hierarchy, mainly because our data is limited, but as we plan to add more and more data, it will be more complex to represent all the possible dependencies between them.
With a graph db we can have all the current services as nodes and use edges to represent the different interactions (
requires
,uses
,belongs
, etc) in a much more performant and optimized way. Allowing "long shot" queries without having to do many "inner join" operations to find out relations 5 layers deep.I wouldn't like to model the whole system as a graph, as it doesn't bring that much value for other resources that we have, and we'd be losing the ORM, but as django allows using multiple databases, I'd move only the parts that benefit from being modelled as graphs there.
The disadvantage here would be that we'll have to make a thing layer to keep operations over the graph db consistent, that layer would probably be just using the
neo4j bolt driver
. There are a couple of projects aiming to offer ORM-like interfaces, but all seem kind of abandoned.Or, another option could be to use some multi-model db approach, like agensgraph, which supports both graph and relational DBMS on top of PostgreSQL
We'd also probably be loosing the admin unless we do some work ourselves. Personally I don't use the admin so much, but being an OSS project this might be a bigger inconvenient.
The text was updated successfully, but these errors were encountered: