Externalize BOM ingestion pipeline #633

nscuro · 2023-06-27T11:40:06Z

At the moment, processing of uploaded BOMs is happening entirely in-memory.

BomUploadProcessingTasks are enqueued to the internal task queue (see https://github.com/DependencyTrack/hyades/blob/main/WTF.md#why), and processed by the EventService thread pool.

The current design has some downsides:

When the API server crashes or is stopped, queued tasks are lost. Users have to re-upload their BOM again after the API server was restarted, which may or may not be practical, depending on the users' workflows.
Processing can not be shared among multiple instances of the API server.
We do not store the original BOM, because it is not practical to store many large documents in a RDBMS.
If BOMs are re-uploaded for the same project in close succession, we'll run into race conditions, as BOM ingestion is not executed in a single, large DB transaction.

The proposed enhancement involves storage of uploaded BOMs in a Koala-compatible system (e.g. the CycloneDX BOM Repository Server), and publishing "BOM uploaded" events to Kafka. Consumers (API server or specialized workers) consume from the Kafka topic, and perform the actual ingestion into the database.

Processing could then happen in a distributed fashion, by one or more API server instances (or even specialized workers).
Uploaded BOMs would not be lost in case an API server instance is stopped or crashes.
Original BOMs would be retained, and could be referenced by the DT UI.
- See BOM Retention Policy dependency-track#877
Kafka's consumer model would ensure that BOM uploads for the same project are processed serially.
Would allow for policies on original BOMs.
- See Policy: Add support for BOM in policy conditions dependency-track#773
If Koala was able to accept SPDX, but return CycloneDX, we could implement official support for SPDX ingestion.
- See Add support for SPDX v3 dependency-track#1746
Aids in achieving Achieve horizontal scalability of API server #375.

sequenceDiagram
    Client->>+API Server: Upload BOM
    API Server->>Koala: Upload BOM
    Koala->>Koala: Validate BOM
    Koala->>API Server: Location of BOM in Koala (URL)
    API Server->>API Server: Generate and persist correlation ID<br/>identifying the upload
    API Server->>Kafka: Publish event to "BOM uploaded" topic
    Note over API Server, Kafka: Key=Project UUID<br/>Value=Koala URL<br/>Header=Correlation ID
    API Server->>Client: Report correlation ID
    loop continuously
        API Server->>Kafka: Consume from "BOM uploaded" topic
        loop for each event
            API Server->>Koala: Fetch BOM
            API Server->>API Server: Process BOM
            alt processing failed
                API Server->>API Server: Update status of upload in DB to "failed"
                API Server->>Kafka: Publish event to "BOM Processing failed" topic
            else processing succeeded
                API Server->>API Server: Update status of upload in DB to "successful"
                API Server->>Kafka: Publish event to "BOM Processed" topic
                API Server->>API Server: Trigger vuln analysis etc.
            end
        end
    end

Warning
Because BOM processing can take up to multiple minutes for huge BOMs, it is not viable to perform processing in Kafka Streams, where short processing times are mandatory (see #529). We either need to offload processing to a separate thread pool, or write some custom logic around the low-level Kafka consumer.

We need to look into proper AuthN / AuthZ for the Koala service. The BOM repo server does not have those built-in.

Focusing on the client-side a little more, existing workflows should still continue to work:

sequenceDiagram
    Client->>+API Server: Upload BOM
    API Server->>Client: Report correlation ID
    loop continuously
        Client->>API Server: Is BOM still being processed?<br/>(Using correlation ID)
        alt processing ongoing
            API Server->>Client: "true"
        else processing completed
            API Server->>Client: "false"
            Client->>Client: Stop polling
        end
    end
    Client->>API Server: Fetch findings
    Client->>API Server: Fetch policy violations

The text was updated successfully, but these errors were encountered:

nscuro · 2023-07-11T12:19:13Z

Decoupled the centralized tracking of processing status(es) into #664. Deprioritizing this issue.

nscuro · 2024-07-18T11:56:36Z

Project Koala / Transparency Exchange API will not arrive anytime soon, so we need to evaluate alternatives.

I don't think we should introduce a generic blob storage for this yet. Instead, we might want to consider storing uploaded BOMs in a new table, as BYTEA column.

BOMs can be arbitrarily large. While Postgres compresses large values, we still need to send all that data over the wire twice (once for storage, once for retrieval). The default compression is also not particularly good.

We already bring in zstd-jni via kafka-clients. I did some testing and it is possible to compress a ~22MB JSON BOM to ~1MB with reasonable resource consumption using zstd. I would thus propose that we perform compression/decompression in the application.

Relates to #633 Signed-off-by: nscuro <[email protected]>

nscuro added enhancement New feature or request architecture size/L High effort component/api-server labels Jun 27, 2023

This was referenced May 7, 2023

Identify feature gaps with vanilla Dependency-Track #373

Closed

Find a way to include original BOMs (or references to them) in notifications #646

Open

nscuro added this to Hyades Jun 30, 2023

nscuro moved this to Todo in Hyades Jun 30, 2023

nscuro added size/XL Higher effort and removed size/L High effort labels Jul 7, 2023

nscuro mentioned this issue Jul 11, 2023

Track workflow state of BOM processing / analysis in database to make it accessible to multiple API server instances #664

Closed

7 tasks

nscuro removed this from Hyades Jul 11, 2023

nscuro added p3 Nice-to-have features and removed p2 Non-critical bugs, and features that help organizations to identify and reduce risk labels Jul 11, 2023

nscuro added this to the 0.6.0 milestone Jul 18, 2024

This was referenced Jul 22, 2024

Externalize BOM ingestion pipeline DependencyTrack/hyades-apiserver#794

Draft

Add config options for BOM upload storage DependencyTrack/hyades-frontend#101

Closed

nscuro self-assigned this Aug 2, 2024

nscuro added a commit that referenced this issue Aug 5, 2024

Add e2e test for S3 BOM upload storage

1b770ea

Relates to #633 Signed-off-by: nscuro <[email protected]>

nscuro mentioned this issue Aug 5, 2024

Add e2e test for BOM upload storage #1432

Open

3 tasks

nscuro modified the milestones: 0.6.0, 0.7.0 Aug 22, 2024

nscuro added a commit that referenced this issue Sep 20, 2024

Add e2e test for S3 BOM upload storage

77b6bcb

Relates to #633 Signed-off-by: nscuro <[email protected]>

nscuro mentioned this issue Oct 2, 2024

GA Roadmap #860

Open

34 tasks

nscuro added this to Hyades Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Externalize BOM ingestion pipeline #633

Externalize BOM ingestion pipeline #633

nscuro commented Jun 27, 2023 •

edited

Loading

nscuro commented Jul 11, 2023

nscuro commented Jul 18, 2024

Externalize BOM ingestion pipeline #633

Externalize BOM ingestion pipeline #633

Comments

nscuro commented Jun 27, 2023 • edited Loading

nscuro commented Jul 11, 2023

nscuro commented Jul 18, 2024

nscuro commented Jun 27, 2023 •

edited

Loading