Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Achieve horizontal scalability of API server #375

Closed
2 tasks done
nscuro opened this issue Feb 24, 2023 · 8 comments
Closed
2 tasks done

Achieve horizontal scalability of API server #375

nscuro opened this issue Feb 24, 2023 · 8 comments
Labels
architecture component/api-server p1 Critical bugs that prevent DT from being used, or features that must be implemented ASAP size/XL Higher effort

Comments

@nscuro
Copy link
Member

nscuro commented Feb 24, 2023

Current Behavior

The API server is currently not horizontally scalable, for various reasons:

  • Search indexes being read from and written to on local disk
  • TaskExecutors for event subscribers consuming from an in-memory queue
    • Some of this has been offloaded to Kafka
  • DataNucleus' L2 cache being in-memory
    • There are options available to use distributed caches, but perhaps disabling caches altogether is also an option
  • Keys for secret encryption, and JWT signing and verification are stored on disk
    • Not really an issue; Can be replaced with k8s secrets and mounted at runtime
      • We already do this for the secret key

Proposed Behavior

Ensure that the API server is horizontally scalable. It should be possible to run multiple active instances of it in parallel.

Checklist

@nscuro
Copy link
Member Author

nscuro commented Mar 7, 2023

DataNucleus L2 Cache

Distributed Cache

I did some preliminary testing with distributed caching using the javax.cache option, Redisson, and Redis.

Due to how many caching operations are performed currently, it was evident very quickly that distributed L2 caching is, with the current cache usage, not realistic. Everything took way longer, and that's with a local Redis server already. It will be a lot worse with remote Redis instances.

I will try to collect some actual numbers here to have some evidence.

Disable L2 Cache

Here's the thing: The L2 cache stores objects by their ID. If we don't do lookups by ID, we do not benefit from the cache at all. We mostly query objects by UUID (which is not the ID field), or other attributes (names etc.).

I suspect that we can either:

  • Disable the L2 cache
  • Severely limit cache usage, making distributed cache usage more viable
    • Set datanucleus.cache.level2.mode to ENABLE_SELECTIVE (default is DISABLE_SELECTIVE)
      • Instead of caching everything per default, opt-in to caching when actually needed / beneficial
    • Annotate to-be-cached classes and fields with @Cacheable
    • Disable L2 cache explicitly in cases where we perform a lot of inserts / updates (e.g. mirroring)

Comparing performance with L2 cache enabled vs. disabled has priority. Memory usage is significantly reduced without it. If we don't get a good benefit out of caching, then the increase in heap usage is simply not justified.

Alpine 2.2.1 / 2.3.0 is capable of exposing metrics of the DN caches: stevespringett/Alpine#471

DN doesn't expose information of hit rates though. To measure hit rates, we'll have to use an external cache that exposes such information.

@nscuro
Copy link
Member Author

nscuro commented Mar 7, 2023

Task Scheduling / Processing

The remaining tasks that are still scheduled and processed internally are:

Name DB Read / Write Singleton Recurring (Default) Interval Manually Triggerable Triggered by Event Notes
BomUploadProcessingTask R+W n/a -
VexUploadProcessingTask R+W n/a -
CloneProjectTask R+W n/a Manual trigger via Add version for project in UI or via API
LdapSyncTask R+W 6h -
EpssMirrorTask R+W 24h -
GitHubAdvisoryMirrorTask n/a - 24h Only emits event to Kafka topic
NistMirrorTask n/a - 24h Only emits event to Kafka topic
OsvMirrorTask n/a - 24h Only emits event to Kafka topic
VulnDbSyncTask n/a - 24h Only emits event to Kafka topic
PortfolioMetricsUpdateTask R+W 1h Manual trigger via UI or API
ProjectMetricsUpdateTask R+W 1h Called by PortfolioMetricsUpdateTask; Triggered after vuln analysis and policy eval; Manual trigger via UI or API
VulnerabilityMetricsUpdateTask R+W 1h -
ComponentMetricsUpdateTask R+W 1h Called by ProjectMetricsUpdateTask; Triggered after vuln analysis and policy eval; Manual trigger via UI or API
InternalComponentIdentificationTask R+W 6h Manual trigger via admin panel
DefectDojoUploadTask R 1h -
FortifySscUploadTask R 1h -
KennaSecurityUploadTask R 1h -
RepositoryMetaAnalyzerTask R 24h Recurring for portfolio; Triggered for project after BOM upload
VulnerabilityAnalysisTask R 24h Recurring for portfolio; Triggered for project after BOM upload; Manual trigger for projects via UI or API (Reanalyze button)
IndexTask R 3h
CallbackTask n/a n/a Internal multi-purpose task for task chaining
NewVulnerableDependencyAnalysisTask R n/a Internal task executed after vuln analysis for BOM upload to figure out which new dependencies are vulnerable

Singleton = Only one instance of the task should run at any given time

Publishing all events that trigger these tasks to Kafka is not an option, as it will dramatically increase the number of topics, partitions, and consequently required streams threads that the API server has to run. It will also be super inefficient to have dedicated consumer threads for every single task type, as they will mostly run idle.


Fault tolerance

All tasks are executed via in-memory event bus. If they fail, there is no DLQ or retry mechanism, but errors are logged.
Tasks enqueued on the event bus are lost when the application crashes or restarts, see: https://github.com/DependencyTrack/hyades/blob/main/WTF.md#limitations

System load

Certain tasks are designed as singletons due to their high impact on load.

PortfolioMetricsUpdateTask is triggering metrics updates for all projects and their components. Updates for projects are executed concurrently to speed up the process. Level of concurrency is bound to number of available CPU cores. For large portfolios this will take a while and puts noticeable load on the DB server.

Concurrent execution of such heavy tasks should be avoided. However, if they do end up running concurrently, it's not an end-of-the-world scenario.

Data Consistency

All tasks use short-lived database transactions. If a task or the entire application dies while in a transaction, the transaction is never committed and consistency is preserved.

Metrics updates have been migrated to stored procedures recently (DependencyTrack/hyades-apiserver#149). Each stored procedure invocation executes in its own transaction.

  1. Fetch first X projects
  2. For each project:
    • Fetch first X components
      • For each component:
        • Execute UPDATE_COMPONENT_METRICS procedure (in its own transaction)
      • Fetch next X components, repeat until no more components
    • Execute UPDATE_PROJECT_METRICS procedure (in its own transaction)
  3. Fetch next X projects, repeat until no more projects
  4. Execute UPDATE_PORTFOLIO_METRICS procedure (in its own transaction)

If the task dies before it reaches UPDATE_PORTFOLIO_METRICS, it is possible that the portfolio metrics do not correctly reflect the sum of all project metrics. However, this is also the case while the task is running, as project metrics are updated before portfolio metrics.


Simple solution:

  1. Perform scheduling externally (via K8s CronJob), just curling an API endpoint to trigger tasks
    • Needs AuthN / AuthZ; Needs dedicated trigger endpoints
  2. For scheduled tasks that should only run once, and are performing DB write operations (e.g. PortfolioMetricsUpdateTask), implement a locking mechanism to ensure there can't be multiple instances running concurrently
    • Even though DB transactions are used, for long-running tasks it's hard to guarantee consistency if they fail in between
    • Conceptually need something like Schedlock in Spring
    • What to do with tasks blocked by the lock? Drop? Reschedule?

  • Making the tasks blocking instead of event based; K8s CronJob can prevent them from running in parallel
    • What happens when K8s kills the pod that is running the command that invokes the REST API endpoint? Does the task continue to execute or does it stop immediately?
    • As long as K8s is gracefully shutting down the CronJob pod, and the HTTP connection to the API server is correctly closed, the task should also just stop and not leave the API server / DB in inconsistent state
    • DB transactions should prevent data being left in inconsistent state

MVP: Introduce an API endpoint that executes a long running task (PortfolioMetricsUpdateTask) and blocks while doing it. Invoke endpoint via K8s CronJob and see what happens. Kill the Job pod in the middle of processing etc.

Another option: Use different "entrypoints" in the DT application that make it possible to run instances that execute specific tasks. The container image of DT could be reused in CronJobs.

  • Would work well with Spring or Quarkus, but in Alpine we can't really hook into the application's main method
  • Would allow us to "block" without using resources of the REST API

Complication: Some tasks are scheduled AND can be triggered via UI. If we rely on k8s CronJobs, we only get concurrent execution guarantees for the job, not for manual triggers from within the application. Database-level locking will be required.

Note
We should use startingDeadlineSeconds to avoid jobs piling up over time if they take too long to execute, or are being retried for too long. Also need to pay attention to backoff settings of the job template, see: https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy

@nscuro
Copy link
Member Author

nscuro commented Mar 7, 2023

Subject to further research: Hazelcast would be able to cater to both distributed caching and distributed task execution requirements. Investigate how well it would integrate with Alpine, and how seamless switching between single- and multi-node deployments is.

@nscuro nscuro changed the title Achieve horizontal scalability Achieve horizontal scalability of API server Mar 10, 2023
@nscuro nscuro transferred this issue from DependencyTrack/hyades-apiserver Mar 10, 2023
@nscuro nscuro added component/api-server p2 Non-critical bugs, and features that help organizations to identify and reduce risk architecture size/XL Higher effort labels Mar 10, 2023
@nscuro
Copy link
Member Author

nscuro commented May 5, 2023

Update DataNucleus L2 Cache

We have disabled the L2 cache in the majority of places where background processing is done. A caching mechanism that caches all objects written to the datastore does not scale, especially not under heavy load. Uploading 100 BOMs meant 100 projects and all of their components were stored in memory. Unless we want to expect users to provision VMs with >32GB of RAM, L2 caching is not an option. The costs outweigh the benefits here, even for single-node deployments.

The L2 cache should be disabled globally. A distributed cache is undesirable as it couples availability of the API server to yet another external system. We still have the option to enable it though, so if at some point we find we need it, it would be easy to integrate.

Update on Lucene Search Indexes

The indexes are currently only used for the /api/v1/search endpoint, which is not called by the UI. It did not receive much love in the recent past and has limited use (DependencyTrack/dependency-track#2310 (comment)). Potentially breaking the API for the sake of a better implementation would likely be OK.

A dedicated search engine like Solr or Elasticsearch would be great, but the danger of dual writes and all sorts of data inconsistencies will be quite high. Kafka may be able to help (keyword CDC), but that again introduces lots of complexity as well.

@nscuro nscuro pinned this issue May 7, 2023
@mehab mehab added p1 Critical bugs that prevent DT from being used, or features that must be implemented ASAP and removed p2 Non-critical bugs, and features that help organizations to identify and reduce risk labels Jul 7, 2023
@syalioune
Copy link
Contributor

Great work folks ! Hard to catch up after so much time away from DT.

I think you've made the right decisions regarding the lucene indexes : trying to maintain dual systems or propagating changes throughout Kafka would introduce more problems than it will resolve especially considering the benefits from this feature. If anything, for the future I would advise looking up solutions like zombodb where consistency between RDBMS and search engine is managed by the RDBMS itself. If only it was agnostic to Postgresql.
Leveraging FTS capabilities of RDBMS would be better but requires expertise in fine-tuning indexes and stuff.

One question that spring into my mind and that always been bugging me with DT scheduled tasks : How would observability be managed (especially with an event based architecture) ?

  • Historical view of executed tasks
  • Status of each task execution and access to failure description if any
  • Next execution time

@nscuro
Copy link
Member Author

nscuro commented Jul 17, 2023

Thanks @syalioune!

One question that spring into my mind and that always been bugging me with DT scheduled tasks : How would observability be managed (especially with an event based architecture) ?

  • Historical view of executed tasks
  • Status of each task execution and access to failure description if any
  • Next execution time

Agreed, this is a challenge. What I think would be required for addressing all of these points is some form of workflow engine that would orchestrate task executions based on events (like the Camunda / Zeebe platform we both know :p).

However, I am not yet sold on implementing a fully-fledged workflow engine in DT, or even integrating a 3rd party one.

For short- to mid-term, we came to the conclusion that keeping a high level state of workflows will be sufficient (see #664 for details). This is mostly required to support the CI/CD use case of uploading BOMs and waiting for their processing to complete.

We would love your feedback on that if you ever get the chance to look into it!

@syalioune
Copy link
Contributor

We would love your feedback on that if you ever get the chance to look into it!

I had sneek peak today and it looks awesome so far. Will look at it more deeply this weekend and provide more elaborate feedback 😉

@nscuro
Copy link
Member Author

nscuro commented Sep 28, 2023

I'm going to close this issue. The API server has been running in a multi-replica setup in a production environment for almost two months now. The original ask of this issue is fulfilled.

This is not to say that work on this is done, certainly there are more than enough areas to improve on still.

@nscuro nscuro closed this as completed Sep 28, 2023
@nscuro nscuro unpinned this issue Sep 28, 2023
nscuro added a commit to nscuro/dependency-track that referenced this issue Oct 25, 2024
Currently, DataNucleus will put all objects into the L2 cache. Given the volume of objects being processed by DT, this behavior quickly adds up to enormous cache sizes. Users continue to be bamboozled by DT's memory requirements, which for a large part are driven by the wasteful L2 caching.

While working on DependencyTrack#4305, it became obvious that the hit rates of the cache are absolutely dwarfed by the high rate of misses. Storing such large volumes of objects in RAM is simply not justified if hit rates are that low.

Disabling the L2 cache solves a lot of recurring issues we and users are facing. If we want to introduce caching again in the future, we should do it in targeted areas, and preferably not directly in the persistence layer.

We disabled the L2 cache in Hyades a long time ago, and it has worked out very well for us. It was a precondition to make the API server horizontally scalable. Some more context:

* DependencyTrack/hyades#375 (comment)
* DependencyTrack/hyades#576

Supersedes DependencyTrack#4305

Signed-off-by: nscuro <[email protected]>
nscuro added a commit to nscuro/dependency-track that referenced this issue Oct 25, 2024
Currently, DataNucleus will put all objects into the L2 cache. Given the volume of objects being processed by DT, this behavior quickly adds up to enormous cache sizes. Users continue to be bamboozled by DT's memory requirements, which for a large part are driven by the wasteful L2 caching.

While working on DependencyTrack#4305, it became obvious that the hit rates of the cache are absolutely dwarfed by the high rate of misses. Storing such large volumes of objects in RAM is simply not justified if hit rates are that low.

Disabling the L2 cache solves a lot of recurring issues we and users are facing. If we want to introduce caching again in the future, we should do it in targeted areas, and preferably not directly in the persistence layer.

We disabled the L2 cache in Hyades a long time ago, and it has worked out very well for us. It was a precondition to make the API server horizontally scalable. Some more context:

* DependencyTrack/hyades#375 (comment)
* DependencyTrack/hyades#576

Supersedes DependencyTrack#4305

Signed-off-by: nscuro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architecture component/api-server p1 Critical bugs that prevent DT from being used, or features that must be implemented ASAP size/XL Higher effort
Projects
None yet
Development

No branches or pull requests

3 participants