-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Achieve horizontal scalability of API server #375
Comments
DataNucleus L2 CacheDistributed CacheI did some preliminary testing with distributed caching using the Due to how many caching operations are performed currently, it was evident very quickly that distributed L2 caching is, with the current cache usage, not realistic. Everything took way longer, and that's with a local Redis server already. It will be a lot worse with remote Redis instances. I will try to collect some actual numbers here to have some evidence. Disable L2 CacheHere's the thing: The L2 cache stores objects by their ID. If we don't do lookups by ID, we do not benefit from the cache at all. We mostly query objects by UUID (which is not the ID field), or other attributes (names etc.). I suspect that we can either:
Comparing performance with L2 cache enabled vs. disabled has priority. Memory usage is significantly reduced without it. If we don't get a good benefit out of caching, then the increase in heap usage is simply not justified. Alpine 2.2.1 / 2.3.0 is capable of exposing metrics of the DN caches: stevespringett/Alpine#471 DN doesn't expose information of hit rates though. To measure hit rates, we'll have to use an external cache that exposes such information. |
Task Scheduling / ProcessingThe remaining tasks that are still scheduled and processed internally are:
Singleton = Only one instance of the task should run at any given time Publishing all events that trigger these tasks to Kafka is not an option, as it will dramatically increase the number of topics, partitions, and consequently required streams threads that the API server has to run. It will also be super inefficient to have dedicated consumer threads for every single task type, as they will mostly run idle. Fault toleranceAll tasks are executed via in-memory event bus. If they fail, there is no DLQ or retry mechanism, but errors are logged. System loadCertain tasks are designed as singletons due to their high impact on load. PortfolioMetricsUpdateTask is triggering metrics updates for all projects and their components. Updates for projects are executed concurrently to speed up the process. Level of concurrency is bound to number of available CPU cores. For large portfolios this will take a while and puts noticeable load on the DB server. Concurrent execution of such heavy tasks should be avoided. However, if they do end up running concurrently, it's not an end-of-the-world scenario. Data ConsistencyAll tasks use short-lived database transactions. If a task or the entire application dies while in a transaction, the transaction is never committed and consistency is preserved. Metrics updates have been migrated to stored procedures recently (DependencyTrack/hyades-apiserver#149). Each stored procedure invocation executes in its own transaction.
If the task dies before it reaches Simple solution:
MVP: Introduce an API endpoint that executes a long running task ( Another option: Use different "entrypoints" in the DT application that make it possible to run instances that execute specific tasks. The container image of DT could be reused in CronJobs.
Complication: Some tasks are scheduled AND can be triggered via UI. If we rely on k8s CronJobs, we only get concurrent execution guarantees for the job, not for manual triggers from within the application. Database-level locking will be required.
|
Subject to further research: Hazelcast would be able to cater to both distributed caching and distributed task execution requirements. Investigate how well it would integrate with Alpine, and how seamless switching between single- and multi-node deployments is. |
Update DataNucleus L2 CacheWe have disabled the L2 cache in the majority of places where background processing is done. A caching mechanism that caches all objects written to the datastore does not scale, especially not under heavy load. Uploading 100 BOMs meant 100 projects and all of their components were stored in memory. Unless we want to expect users to provision VMs with >32GB of RAM, L2 caching is not an option. The costs outweigh the benefits here, even for single-node deployments. The L2 cache should be disabled globally. A distributed cache is undesirable as it couples availability of the API server to yet another external system. We still have the option to enable it though, so if at some point we find we need it, it would be easy to integrate. Update on Lucene Search IndexesThe indexes are currently only used for the A dedicated search engine like Solr or Elasticsearch would be great, but the danger of dual writes and all sorts of data inconsistencies will be quite high. Kafka may be able to help (keyword CDC), but that again introduces lots of complexity as well. |
Great work folks ! Hard to catch up after so much time away from DT. I think you've made the right decisions regarding the lucene indexes : trying to maintain dual systems or propagating changes throughout Kafka would introduce more problems than it will resolve especially considering the benefits from this feature. If anything, for the future I would advise looking up solutions like zombodb where consistency between RDBMS and search engine is managed by the RDBMS itself. If only it was agnostic to Postgresql. One question that spring into my mind and that always been bugging me with DT scheduled tasks : How would observability be managed (especially with an event based architecture) ?
|
Thanks @syalioune!
Agreed, this is a challenge. What I think would be required for addressing all of these points is some form of workflow engine that would orchestrate task executions based on events (like the Camunda / Zeebe platform we both know :p). However, I am not yet sold on implementing a fully-fledged workflow engine in DT, or even integrating a 3rd party one. For short- to mid-term, we came to the conclusion that keeping a high level state of workflows will be sufficient (see #664 for details). This is mostly required to support the CI/CD use case of uploading BOMs and waiting for their processing to complete. We would love your feedback on that if you ever get the chance to look into it! |
I had sneek peak today and it looks awesome so far. Will look at it more deeply this weekend and provide more elaborate feedback 😉 |
I'm going to close this issue. The API server has been running in a multi-replica setup in a production environment for almost two months now. The original ask of this issue is fulfilled. This is not to say that work on this is done, certainly there are more than enough areas to improve on still. |
Currently, DataNucleus will put all objects into the L2 cache. Given the volume of objects being processed by DT, this behavior quickly adds up to enormous cache sizes. Users continue to be bamboozled by DT's memory requirements, which for a large part are driven by the wasteful L2 caching. While working on DependencyTrack#4305, it became obvious that the hit rates of the cache are absolutely dwarfed by the high rate of misses. Storing such large volumes of objects in RAM is simply not justified if hit rates are that low. Disabling the L2 cache solves a lot of recurring issues we and users are facing. If we want to introduce caching again in the future, we should do it in targeted areas, and preferably not directly in the persistence layer. We disabled the L2 cache in Hyades a long time ago, and it has worked out very well for us. It was a precondition to make the API server horizontally scalable. Some more context: * DependencyTrack/hyades#375 (comment) * DependencyTrack/hyades#576 Supersedes DependencyTrack#4305 Signed-off-by: nscuro <[email protected]>
Currently, DataNucleus will put all objects into the L2 cache. Given the volume of objects being processed by DT, this behavior quickly adds up to enormous cache sizes. Users continue to be bamboozled by DT's memory requirements, which for a large part are driven by the wasteful L2 caching. While working on DependencyTrack#4305, it became obvious that the hit rates of the cache are absolutely dwarfed by the high rate of misses. Storing such large volumes of objects in RAM is simply not justified if hit rates are that low. Disabling the L2 cache solves a lot of recurring issues we and users are facing. If we want to introduce caching again in the future, we should do it in targeted areas, and preferably not directly in the persistence layer. We disabled the L2 cache in Hyades a long time ago, and it has worked out very well for us. It was a precondition to make the API server horizontally scalable. Some more context: * DependencyTrack/hyades#375 (comment) * DependencyTrack/hyades#576 Supersedes DependencyTrack#4305 Signed-off-by: nscuro <[email protected]>
Current Behavior
The API server is currently not horizontally scalable, for various reasons:
Proposed Behavior
Ensure that the API server is horizontally scalable. It should be possible to run multiple active instances of it in parallel.
Checklist
The text was updated successfully, but these errors were encountered: