Achieve horizontal scalability of API server #375

nscuro · 2023-02-24T11:26:12Z

Current Behavior

The API server is currently not horizontally scalable, for various reasons:

Search indexes being read from and written to on local disk
TaskExecutors for event subscribers consuming from an in-memory queue
- Some of this has been offloaded to Kafka
DataNucleus' L2 cache being in-memory
- There are options available to use distributed caches, but perhaps disabling caches altogether is also an option
Keys for secret encryption, and JWT signing and verification are stored on disk
- Not really an issue; Can be replaced with k8s secrets and mounted at runtime
  - We already do this for the secret key

Proposed Behavior

Ensure that the API server is horizontally scalable. It should be possible to run multiple active instances of it in parallel.

Checklist

I have read and understand the contributing guidelines
I have checked the existing issues for whether this enhancement was already requested

nscuro · 2023-03-07T13:11:36Z

DataNucleus L2 Cache

Distributed Cache

I did some preliminary testing with distributed caching using the javax.cache option, Redisson, and Redis.

Due to how many caching operations are performed currently, it was evident very quickly that distributed L2 caching is, with the current cache usage, not realistic. Everything took way longer, and that's with a local Redis server already. It will be a lot worse with remote Redis instances.

I will try to collect some actual numbers here to have some evidence.

Disable L2 Cache

Here's the thing: The L2 cache stores objects by their ID. If we don't do lookups by ID, we do not benefit from the cache at all. We mostly query objects by UUID (which is not the ID field), or other attributes (names etc.).

I suspect that we can either:

Disable the L2 cache
Severely limit cache usage, making distributed cache usage more viable
- Set datanucleus.cache.level2.mode to ENABLE_SELECTIVE (default is DISABLE_SELECTIVE)
  - Instead of caching everything per default, opt-in to caching when actually needed / beneficial
- Annotate to-be-cached classes and fields with @Cacheable
- Disable L2 cache explicitly in cases where we perform a lot of inserts / updates (e.g. mirroring)

Comparing performance with L2 cache enabled vs. disabled has priority. Memory usage is significantly reduced without it. If we don't get a good benefit out of caching, then the increase in heap usage is simply not justified.

Alpine 2.2.1 / 2.3.0 is capable of exposing metrics of the DN caches: stevespringett/Alpine#471

DN doesn't expose information of hit rates though. To measure hit rates, we'll have to use an external cache that exposes such information.

nscuro · 2023-03-07T13:20:07Z

Task Scheduling / Processing

The remaining tasks that are still scheduled and processed internally are:

Name	DB Read / Write	Singleton	Recurring	(Default) Interval	Manually Triggerable	Triggered by Event	Notes
BomUploadProcessingTask	R+W	❌	❌	n/a	❌	✅	-
VexUploadProcessingTask	R+W	❌	❌	n/a	❌	✅	-
CloneProjectTask	R+W	❌	❌	n/a	✅	❌	Manual trigger via Add version for project in UI or via API
LdapSyncTask	R+W	✅	✅	6h	❌	❌	-
EpssMirrorTask	R+W	✅	✅	24h	❌	❌	-
GitHubAdvisoryMirrorTask	n/a	-	✅	24h	❌	❌	Only emits event to Kafka topic
NistMirrorTask	n/a	-	✅	24h	❌	❌	Only emits event to Kafka topic
OsvMirrorTask	n/a	-	✅	24h	❌	❌	Only emits event to Kafka topic
VulnDbSyncTask	n/a	-	✅	24h	❌	❌	Only emits event to Kafka topic
PortfolioMetricsUpdateTask	R+W	✅	✅	1h	✅	❌	Manual trigger via UI or API
ProjectMetricsUpdateTask	R+W	❌	✅	1h	✅	✅	Called by PortfolioMetricsUpdateTask; Triggered after vuln analysis and policy eval; Manual trigger via UI or API
VulnerabilityMetricsUpdateTask	R+W	✅	✅	1h	❌	❌	-
ComponentMetricsUpdateTask	R+W	❌	✅	1h	✅	✅	Called by ProjectMetricsUpdateTask; Triggered after vuln analysis and policy eval; Manual trigger via UI or API
InternalComponentIdentificationTask	R+W	✅	✅	6h	✅	❌	Manual trigger via admin panel
DefectDojoUploadTask	R	❌	✅	1h	❌	❌	-
FortifySscUploadTask	R	❌	✅	1h	❌	❌	-
KennaSecurityUploadTask	R	❌	✅	1h	❌	❌	-
RepositoryMetaAnalyzerTask	R	❌	✅	24h	✅	✅	Recurring for portfolio; Triggered for project after BOM upload
VulnerabilityAnalysisTask	R	❌	✅	24h	✅	✅	Recurring for portfolio; Triggered for project after BOM upload; Manual trigger for projects via UI or API (Reanalyze button)
IndexTask	R	❌	✅	✅	3h	✅	✅
CallbackTask	n/a	❌	❌	n/a	❌	✅	Internal multi-purpose task for task chaining
NewVulnerableDependencyAnalysisTask	R	❌	❌	n/a	❌	✅	Internal task executed after vuln analysis for BOM upload to figure out which new dependencies are vulnerable

Singleton = Only one instance of the task should run at any given time

Publishing all events that trigger these tasks to Kafka is not an option, as it will dramatically increase the number of topics, partitions, and consequently required streams threads that the API server has to run. It will also be super inefficient to have dedicated consumer threads for every single task type, as they will mostly run idle.

Fault tolerance

All tasks are executed via in-memory event bus. If they fail, there is no DLQ or retry mechanism, but errors are logged.
Tasks enqueued on the event bus are lost when the application crashes or restarts, see: https://github.com/DependencyTrack/hyades/blob/main/WTF.md#limitations

System load

Certain tasks are designed as singletons due to their high impact on load.

PortfolioMetricsUpdateTask is triggering metrics updates for all projects and their components. Updates for projects are executed concurrently to speed up the process. Level of concurrency is bound to number of available CPU cores. For large portfolios this will take a while and puts noticeable load on the DB server.

Concurrent execution of such heavy tasks should be avoided. However, if they do end up running concurrently, it's not an end-of-the-world scenario.

Data Consistency

All tasks use short-lived database transactions. If a task or the entire application dies while in a transaction, the transaction is never committed and consistency is preserved.

Metrics updates have been migrated to stored procedures recently (DependencyTrack/hyades-apiserver#149). Each stored procedure invocation executes in its own transaction.

Fetch first X projects
For each project:
- Fetch first X components
  - For each component:
    - Execute UPDATE_COMPONENT_METRICS procedure (in its own transaction)
  - Fetch next X components, repeat until no more components
- Execute UPDATE_PROJECT_METRICS procedure (in its own transaction)
Fetch next X projects, repeat until no more projects
Execute UPDATE_PORTFOLIO_METRICS procedure (in its own transaction)

If the task dies before it reaches UPDATE_PORTFOLIO_METRICS, it is possible that the portfolio metrics do not correctly reflect the sum of all project metrics. However, this is also the case while the task is running, as project metrics are updated before portfolio metrics.

Simple solution:

Perform scheduling externally (via K8s CronJob), just curling an API endpoint to trigger tasks
- Needs AuthN / AuthZ; Needs dedicated trigger endpoints
For scheduled tasks that should only run once, and are performing DB write operations (e.g. PortfolioMetricsUpdateTask), implement a locking mechanism to ensure there can't be multiple instances running concurrently
- Even though DB transactions are used, for long-running tasks it's hard to guarantee consistency if they fail in between
- Conceptually need something like Schedlock in Spring
- What to do with tasks blocked by the lock? Drop? Reschedule?

Making the tasks blocking instead of event based; K8s CronJob can prevent them from running in parallel
- What happens when K8s kills the pod that is running the command that invokes the REST API endpoint? Does the task continue to execute or does it stop immediately?
- As long as K8s is gracefully shutting down the CronJob pod, and the HTTP connection to the API server is correctly closed, the task should also just stop and not leave the API server / DB in inconsistent state
- DB transactions should prevent data being left in inconsistent state

MVP: Introduce an API endpoint that executes a long running task (PortfolioMetricsUpdateTask) and blocks while doing it. Invoke endpoint via K8s CronJob and see what happens. Kill the Job pod in the middle of processing etc.

Another option: Use different "entrypoints" in the DT application that make it possible to run instances that execute specific tasks. The container image of DT could be reused in CronJobs.

Would work well with Spring or Quarkus, but in Alpine we can't really hook into the application's main method
Would allow us to "block" without using resources of the REST API

Complication: Some tasks are scheduled AND can be triggered via UI. If we rely on k8s CronJobs, we only get concurrent execution guarantees for the job, not for manual triggers from within the application. Database-level locking will be required.

We do not know ahead of time how long a task may run
Will need a mechanism to extend locks (e.g. https://github.com/lukas-krecan/ShedLock#extending-the-lock)
SchedLock docs say not all implementations support lock extension, will need to investigate why

Note
We should use startingDeadlineSeconds to avoid jobs piling up over time if they take too long to execute, or are being retried for too long. Also need to pay attention to backoff settings of the job template, see: https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy

nscuro · 2023-03-07T13:31:27Z

Subject to further research: Hazelcast would be able to cater to both distributed caching and distributed task execution requirements. Investigate how well it would integrate with Alpine, and how seamless switching between single- and multi-node deployments is.

nscuro · 2023-05-05T08:44:46Z

Update DataNucleus L2 Cache

We have disabled the L2 cache in the majority of places where background processing is done. A caching mechanism that caches all objects written to the datastore does not scale, especially not under heavy load. Uploading 100 BOMs meant 100 projects and all of their components were stored in memory. Unless we want to expect users to provision VMs with >32GB of RAM, L2 caching is not an option. The costs outweigh the benefits here, even for single-node deployments.

The L2 cache should be disabled globally. A distributed cache is undesirable as it couples availability of the API server to yet another external system. We still have the option to enable it though, so if at some point we find we need it, it would be easy to integrate.

Update on Lucene Search Indexes

The indexes are currently only used for the /api/v1/search endpoint, which is not called by the UI. It did not receive much love in the recent past and has limited use (DependencyTrack/dependency-track#2310 (comment)). Potentially breaking the API for the sake of a better implementation would likely be OK.

A dedicated search engine like Solr or Elasticsearch would be great, but the danger of dual writes and all sorts of data inconsistencies will be quite high. Kafka may be able to help (keyword CDC), but that again introduces lots of complexity as well.

syalioune · 2023-07-17T19:48:33Z

Great work folks ! Hard to catch up after so much time away from DT.

I think you've made the right decisions regarding the lucene indexes : trying to maintain dual systems or propagating changes throughout Kafka would introduce more problems than it will resolve especially considering the benefits from this feature. If anything, for the future I would advise looking up solutions like zombodb where consistency between RDBMS and search engine is managed by the RDBMS itself. If only it was agnostic to Postgresql.
Leveraging FTS capabilities of RDBMS would be better but requires expertise in fine-tuning indexes and stuff.

One question that spring into my mind and that always been bugging me with DT scheduled tasks : How would observability be managed (especially with an event based architecture) ?

Historical view of executed tasks
Status of each task execution and access to failure description if any
Next execution time

nscuro · 2023-07-17T20:38:19Z

Thanks @syalioune!

One question that spring into my mind and that always been bugging me with DT scheduled tasks : How would observability be managed (especially with an event based architecture) ?

Historical view of executed tasks

Status of each task execution and access to failure description if any

Next execution time

Agreed, this is a challenge. What I think would be required for addressing all of these points is some form of workflow engine that would orchestrate task executions based on events (like the Camunda / Zeebe platform we both know :p).

However, I am not yet sold on implementing a fully-fledged workflow engine in DT, or even integrating a 3rd party one.

For short- to mid-term, we came to the conclusion that keeping a high level state of workflows will be sufficient (see #664 for details). This is mostly required to support the CI/CD use case of uploading BOMs and waiting for their processing to complete.

We would love your feedback on that if you ever get the chance to look into it!

syalioune · 2023-07-19T20:06:27Z

We would love your feedback on that if you ever get the chance to look into it!

I had sneek peak today and it looks awesome so far. Will look at it more deeply this weekend and provide more elaborate feedback 😉

nscuro · 2023-09-28T11:18:50Z

I'm going to close this issue. The API server has been running in a multi-replica setup in a production environment for almost two months now. The original ask of this issue is fulfilled.

This is not to say that work on this is done, certainly there are more than enough areas to improve on still.

Currently, DataNucleus will put all objects into the L2 cache. Given the volume of objects being processed by DT, this behavior quickly adds up to enormous cache sizes. Users continue to be bamboozled by DT's memory requirements, which for a large part are driven by the wasteful L2 caching. While working on DependencyTrack#4305, it became obvious that the hit rates of the cache are absolutely dwarfed by the high rate of misses. Storing such large volumes of objects in RAM is simply not justified if hit rates are that low. Disabling the L2 cache solves a lot of recurring issues we and users are facing. If we want to introduce caching again in the future, we should do it in targeted areas, and preferably not directly in the persistence layer. We disabled the L2 cache in Hyades a long time ago, and it has worked out very well for us. It was a precondition to make the API server horizontally scalable. Some more context: * DependencyTrack/hyades#375 (comment) * DependencyTrack/hyades#576 Supersedes DependencyTrack#4305 Signed-off-by: nscuro <[email protected]>

nscuro changed the title ~~Achieve horizontal scalability~~ Achieve horizontal scalability of API server Mar 10, 2023

nscuro transferred this issue from DependencyTrack/hyades-apiserver Mar 10, 2023

nscuro added component/api-server p2 Non-critical bugs, and features that help organizations to identify and reduce risk architecture size/XL Higher effort labels Mar 10, 2023

nscuro pinned this issue May 7, 2023

nscuro mentioned this issue May 15, 2023

Implement centralized scheduling for recurring tasks #553

Closed

nscuro mentioned this issue May 24, 2023

Disable DataNucleus L2 cache globally #576

Closed

nscuro mentioned this issue Jun 27, 2023

Externalize BOM ingestion pipeline #633

Open

VithikaS mentioned this issue Jul 6, 2023

Lock singleton tasks DependencyTrack/hyades-apiserver#222

Merged

5 tasks

nscuro mentioned this issue Jul 6, 2023

Implement search index that works with multiple API server instances #655

Open

mehab added p1 Critical bugs that prevent DT from being used, or features that must be implemented ASAP and removed p2 Non-critical bugs, and features that help organizations to identify and reduce risk labels Jul 7, 2023

nscuro mentioned this issue Jul 10, 2023

Remove Lucene indexes from API server #661

Closed

VithikaS mentioned this issue Jul 12, 2023

Substituting fixed delays with cron expression DependencyTrack/hyades-apiserver#234

Merged

5 tasks

nscuro mentioned this issue Aug 2, 2023

Update Helm Chart and Minikube setup to accommodate for multi-replica API server #717

Merged

nscuro closed this as completed Sep 28, 2023

nscuro unpinned this issue Sep 28, 2023

nscuro mentioned this issue Oct 25, 2024

Disable DataNucleus L2 cache globally DependencyTrack/dependency-track#4310

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Achieve horizontal scalability of API server #375

Achieve horizontal scalability of API server #375

nscuro commented Feb 24, 2023 •

edited

Loading

nscuro commented Mar 7, 2023 •

edited

Loading

nscuro commented Mar 7, 2023 •

edited

Loading

nscuro commented Mar 7, 2023

nscuro commented May 5, 2023

syalioune commented Jul 17, 2023

nscuro commented Jul 17, 2023

syalioune commented Jul 19, 2023

nscuro commented Sep 28, 2023

Achieve horizontal scalability of API server #375

Achieve horizontal scalability of API server #375

Comments

nscuro commented Feb 24, 2023 • edited Loading

Current Behavior

Proposed Behavior

Checklist

nscuro commented Mar 7, 2023 • edited Loading

DataNucleus L2 Cache

Distributed Cache

Disable L2 Cache

nscuro commented Mar 7, 2023 • edited Loading

Task Scheduling / Processing

Fault tolerance

System load

Data Consistency

nscuro commented Mar 7, 2023

nscuro commented May 5, 2023

Update DataNucleus L2 Cache

Update on Lucene Search Indexes

syalioune commented Jul 17, 2023

nscuro commented Jul 17, 2023

syalioune commented Jul 19, 2023

nscuro commented Sep 28, 2023

nscuro commented Feb 24, 2023 •

edited

Loading

nscuro commented Mar 7, 2023 •

edited

Loading

nscuro commented Mar 7, 2023 •

edited

Loading