`diff` panics while diffing two files #74

SimoneLazzaris · 2024-03-20T16:37:18Z

Hi,
I'm trying to compare two SBoMs generated with two different versions of trivy.

sbom-utils thinks hard for a bit and then panics with panic: runtime error: slice bounds out of range [2004:1743]

Here are the files:

nats-box-49.sbom.json
nats-box-50.sbom.json

And this is the command line I've used:

sbom-utility diff --input-file nats-box-49.sbom.json --input-revision nats-box-50.sbom.json

The text was updated successfully, but these errors were encountered:

mrutkows · 2024-03-27T14:58:40Z

@SimoneLazzaris will try to repro. as soon as I can which may not be until next week. For sanity, can you run using the trace (--trace or -t) or even debug (--debug) and capture the full stdout/stderr here?

mrutkows · 2024-04-01T18:46:57Z

@SimoneLazzaris Just wanted to let you know I just was able to try this out on my local machine and saw the indications of some infinite loop (not waiting for a seg. violation and killed it) given that the files are relatively small in size. Having validated both and run other commands against its contents, it is my fear that the error may lie in one of the imported libraries which may take some time to pinpoint (and even more time to perhaps fix upstream if possible).

mrutkows · 2024-04-01T19:59:44Z

@SimoneLazzaris Comparing these 2 "Trivy" SBOM files using an online, general text diff tool, it finds 2694 removals and 8296 additions. I am sure that the underlying diff comparator is losing its mind and running out of memory looking for JSON objects matches between the two files (which uses deep hashes).

The warning for "diff" command use is that the files must be relatively similar... despite being binary image scans from Trivy from apparent image file minor point revisions (semantic image file versions) these files are completely dissimilar from one primary aspect:

order of components within JSON arrays appears to be completely mismatched.

The generalized "diff" libs used are not specific to any schema (SPDX, CycloneDX or any other JSON) and have no means to "normalize" the contents prior to comparison. In fact, normalization of JSON is only possible with custom knowledge of what makes each array entry (esp. for anonymous types) unique (i.e., a unique key or set of fields that create a unique key) to hash by reliably (relative to the JSON object).

In addition, within each component there are several Trivy custom properties that are identical across many components which makes comparison (similarity weighting) near impossible. For example:

    {
      "bom-ref": "1041129c-b3a8-4896-9ba4-cf92e58ed5d2",
      "type": "application",
      "name": "usr/local/bin/nsc",
      "properties": [
        {
          "name": "aquasecurity:trivy:Class",
          "value": "lang-pkgs"
        },
        {
          "name": "aquasecurity:trivy:Type",
          "value": "gobinary"
        }
      ]
    },
    {
      "bom-ref": "4ce1b5d8-fb7a-4506-9c92-ff2ca0de8e69",
      "type": "application",
      "name": "usr/local/bin/nats",
      "properties": [
        {
          "name": "aquasecurity:trivy:Class",
          "value": "lang-pkgs"
        },
        {
          "name": "aquasecurity:trivy:Type",
          "value": "gobinary"
        }
      ]
    },

where the similarity "score" would be high between these 2 components (with no knowledge the the only key in the case component objects was bom-ref.

This complexity is the reason why "merge" functions are not simple from any tool (even github commits of similar files) and require human (not yet AI) analysis to resolve "merge conflicts".

If a great deal of custom hashing code were written (which means a unique hash function per-object in the JSON schema), then normalization becomes more realistic. However, objects with any depth of nested objects increases the time necessary for deep comparisons of objects (as well as lots of hashing memory overhead).

mrutkows · 2024-04-01T20:28:26Z

All hope is not lost... as I said I planned on adding a "sort" function with knowledge of CycloneDX data schema structure (at least for top-level objects like components, services, and vulnerabilities for example. However, I cannot promise when I could begin such a feature, but would love help ;)

SimoneLazzaris · 2024-04-03T06:51:30Z

@mrutkows thanks for your effort. I know that mine was not a textbook example. I just find bad for the software to panic and wanted to report that.

mrutkows · 2024-04-03T13:10:32Z

The problem lies in the go-dff library which has many operations that look to create new "diff" representations by using slice text/string insertion into an existing slice (i.e., to create +/- prefixed strings with the difference text). Primarily the 2 I debugged were in the file: github.com/sergi/[email protected]/diffmatchpatch/patch.go and include functions:

patchMake2(text1 string, diffs []Diff) and specifically its logic statements as follows step outside slice bounds:

		case DiffDelete:
			patch.Length1 += len(aDiff.Text)
			patch.diffs = append(patch.diffs, aDiff)
			postpatchText = postpatchText[:charCount2] + postpatchText[charCount2+len(aDiff.Text):]

as well as another function PatchAddContext(patch Patch, text string) and the logic:

		pattern = text[patch.Start2 : patch.Start2+patch.Length1]

Short of a rewrite (as the lengths being passed in the patches are not being calculated properly, clearly) and perhaps the introduction of a "safe" slice reallocation routine, this will not be fixed anytime soon.

mrutkows · 2024-04-03T13:12:05Z

@SimoneLazzaris

@mrutkows thanks for your effort. I know that mine was not a textbook example. I just find bad for the software to panic and wanted to report that.

I can "catch" the panics; but, that really does not solve the underlying library I relied upon from working properly to produce a diff :(

I truly believe that normalizing the data (both files) will result in coherent/useful diff results (even using the faulty library), but it is a very complicated task in-and-of itself.

mrutkows · 2024-04-05T17:23:44Z

@SimoneLazzaris I managed to add "guard rails" to avoid the panics within the upstream file that was the source (of more than one) panic where they were accessing string slices past their current size/memory allocations...

The result is a large "patch" file that really has many large blocks of meaningless deltas:
diff.txt

The source code file I patched locally to avoid the panics is from the upstream library (i.e., "patch.go"); specifically, I added "if" tests before indexing into string slices in 2 places:
patch.go.txt

but again, even if I pushed this to upstream, the library has not been touched in like 6 years... and it absolutely masks other "bad" logic that is creating the leading to the bad character counting used to index into the slices that cause the panics.

mrutkows · 2024-04-05T19:52:15Z

@SimoneLazzaris checkout the cleaner output if such a panic is now caught:

Welcome to the sbom-utility! Version `latest` (sbom-utility) (darwin/arm64)
===========================================================================
[INFO] Loading (embedded) default schema config file: `config.json`...
[INFO] Loading (embedded) default license policy file: `license.json`...
[INFO] Reading file (--input-file): `nats-box-49.sbom.json` ...
[INFO] Reading file (--input-revision): `nats-box-50.sbom.json` ...
[INFO] Comparing files: `nats-box-49.sbom.json` (base) to `nats-box-50.sbom.json` (revised) ...
[ERROR] panic occurred: runtime error: slice bounds out of range [2004:1743]
[ERROR] diff failed: differences between files perhaps too large.

In addition, the exit code is now 1 (app. error).

mrutkows self-assigned this Apr 1, 2024

mrutkows added the bug Something isn't working label Apr 1, 2024

mrutkows added enhancement New feature or request help wanted Extra attention is needed and removed bug Something isn't working labels Apr 1, 2024

mrutkows mentioned this issue Apr 5, 2024

Catch "diff" command panics within external library deps. and exit gracefully #80

Merged

mrutkows added the bug Something isn't working label Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`diff` panics while diffing two files #74

`diff` panics while diffing two files #74

SimoneLazzaris commented Mar 20, 2024

mrutkows commented Mar 27, 2024

mrutkows commented Apr 1, 2024 •

edited

Loading

mrutkows commented Apr 1, 2024 •

edited

Loading

mrutkows commented Apr 1, 2024

SimoneLazzaris commented Apr 3, 2024

mrutkows commented Apr 3, 2024 •

edited

Loading

mrutkows commented Apr 3, 2024 •

edited

Loading

mrutkows commented Apr 5, 2024 •

edited

Loading

mrutkows commented Apr 5, 2024 •

edited

Loading

diff panics while diffing two files #74

diff panics while diffing two files #74

Comments

SimoneLazzaris commented Mar 20, 2024

mrutkows commented Mar 27, 2024

mrutkows commented Apr 1, 2024 • edited Loading

mrutkows commented Apr 1, 2024 • edited Loading

mrutkows commented Apr 1, 2024

SimoneLazzaris commented Apr 3, 2024

mrutkows commented Apr 3, 2024 • edited Loading

mrutkows commented Apr 3, 2024 • edited Loading

mrutkows commented Apr 5, 2024 • edited Loading

mrutkows commented Apr 5, 2024 • edited Loading

`diff` panics while diffing two files #74

`diff` panics while diffing two files #74

mrutkows commented Apr 1, 2024 •

edited

Loading

mrutkows commented Apr 1, 2024 •

edited

Loading

mrutkows commented Apr 3, 2024 •

edited

Loading

mrutkows commented Apr 3, 2024 •

edited

Loading

mrutkows commented Apr 5, 2024 •

edited

Loading

mrutkows commented Apr 5, 2024 •

edited

Loading