Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(spdx): use the hasExtractedLicensingInfos field for licenses that are not listed in the SPDX #8077

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

DmitriyLewen
Copy link
Contributor

@DmitriyLewen DmitriyLewen commented Dec 10, 2024

Description

For cases where the SPDX license list does not contain the detected license, we should use the hasExtractedLicensingInfos field.
See #7721 for more details.

Before:

➜ trivy -q image bitnami/wordpress:6.6.2-debian-12-r14 --format spdx-json --cache-backend memory | grep '"name": "libsasl2-2",' -A 8 
      "name": "libsasl2-2",
      "SPDXID": "SPDXRef-Package-eb685f3647d2fa92",
...
      "licenseConcluded": "BSD-4-Clause AND OpenSSL AND SSLeay AND BSD-3-Clause AND BSD-2-Clause AND GPL-3.0-or-later AND GPL-3.0-only AND BSD-4-Clause-UC AND RSA-MD AND BSD-4-clause-KTH AND text:--BSD-4-clause AND IBM-as-is AND BSD-3-clause-JANET AND BSD-3-clause-PADL AND MIT-OpenVision AND OpenLDAP AND FSFULLR AND MIT-CMU AND MIT-Export AND BSD-2.2-clause AND text:--IBM-as-is",
      "licenseDeclared": "BSD-4-Clause AND OpenSSL AND SSLeay AND BSD-3-Clause AND BSD-2-Clause AND GPL-3.0-or-later AND GPL-3.0-only AND BSD-4-Clause-UC AND RSA-MD AND BSD-4-clause-KTH AND text:--BSD-4-clause AND IBM-as-is AND BSD-3-clause-JANET AND BSD-3-clause-PADL AND MIT-OpenVision AND OpenLDAP AND FSFULLR AND MIT-CMU AND MIT-Export AND BSD-2.2-clause AND text:--IBM-as-is",

After:

➜  ./trivy -q image bitnami/wordpress:6.6.2-debian-12-r14 --format spdx-json --cache-backend memory | grep '"name": "libsasl2-2",' -A 8
      "name": "libsasl2-2",
      "SPDXID": "SPDXRef-Package-eb685f3647d2fa92",
...
      "licenseConcluded": "BSD-4-Clause AND OpenSSL AND LicenseRef-ea55a018594e6bf5 AND BSD-3-Clause AND BSD-2-Clause AND GPL-3.0-or-later AND GPL-3.0-only AND BSD-4-Clause-UC AND LicenseRef-65f52c543487922d AND LicenseRef-a02a4c35e67d69da AND LicenseRef-148e27c4af509b1f AND LicenseRef-6f1acaa4d7f6c64a AND LicenseRef-f3daae6e2ead6f1 AND LicenseRef-90cfa1054f767716 AND LicenseRef-d336b55b7573f941 AND LicenseRef-7955e7acaf9b5cc3 AND LicenseRef-17a620271d638d1c AND LicenseRef-157e923866327e93 AND LicenseRef-5a72bdaf3db99c53 AND LicenseRef-d5845310a6a8e792",
      "licenseDeclared": "BSD-4-Clause AND OpenSSL AND LicenseRef-ea55a018594e6bf5 AND BSD-3-Clause AND BSD-2-Clause AND GPL-3.0-or-later AND GPL-3.0-only AND BSD-4-Clause-UC AND LicenseRef-65f52c543487922d AND LicenseRef-a02a4c35e67d69da AND LicenseRef-148e27c4af509b1f AND LicenseRef-6f1acaa4d7f6c64a AND LicenseRef-f3daae6e2ead6f1 AND LicenseRef-90cfa1054f767716 AND LicenseRef-d336b55b7573f941 AND LicenseRef-7955e7acaf9b5cc3 AND LicenseRef-17a620271d638d1c AND LicenseRef-157e923866327e93 AND LicenseRef-5a72bdaf3db99c53 AND LicenseRef-d5845310a6a8e792",
➜  ./trivy -q image bitnami/wordpress:6.6.2-debian-12-r14 --format spdx-json --cache-backend memory | grep '"licenseId": "LicenseRef-a02a4c35e67d69da",' -A 2
      "licenseId": "LicenseRef-a02a4c35e67d69da",
      "extractedText": "NOASSERTION",
      "name": "BSD-4-clause-KTH"
➜  ./trivy -q image bitnami/wordpress:6.6.2-debian-12-r14 --format spdx-json --cache-backend memory | grep '"licenseId": "LicenseRef-148e27c4af509b1f",' -A 2
      "licenseId": "LicenseRef-148e27c4af509b1f",
      "extractedText": "BSD-4-clause and IBM-as-is",
      "name": "NOASSERTION"

test runs for cron action:

Related issues

Checklist

  • I've read the guidelines for contributing to this repository.
  • I've followed the conventions in the PR title.
  • I've added tests that prove my fix is effective or that my feature works.
  • I've updated the documentation with the relevant information (if needed).
  • I've added usage information (if the PR introduces new options)
  • I've included a "before" and "after" example to the description (if the PR is a user interface change).

return fmt.Sprintf("(%s)", license)
}), " AND ")

normalizedLicense, err := expression.Normalize(license, licensing.NormalizeLicense, expression.NormalizeForSPDX)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to override the expression.NormalizeForSPDX function, but we use With as a delimiter to parse licenses:

%token<token> IDENT OR AND WITH

So there are problems with overwriting licenses with the wrong exception

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about my approach?
#8257

It would be as below. I just wrote a PoC, so it doesn't work as is, of course, but I hope you'll get my point.

otherLicenses := map[string]*spdx.OtherLicense{}
replaceOtherLicenses := func(expr expression.Expression) expression.Expression {
    var licenseName string
    switch e := expr.(type) {
    case expression.SimpleExpression:
        if expression.ValidSpdxLicense(e.License) {
            return license
        }
        licenseName = e.License
    case expression.CompoundExpression:
        if e.Conjunction() != expression.TokenWith {
            return expr
        }
        if expression.ValidSpdxLicense(e.Left().String()) && expression.ValidSpdxLicense(e.Right().String()) {
            return expr
        }
        licenseName = e.String()
    }

    license := newOtherLicense(licenseName, false)
    if license == nil {
        ...
    }
    otherLicenses[license.LicenseIdentifier] = license
    return expression.SimpleExpr{License: license.LicenseIdentifier}
}

normalizedLicense, err := expression.Normalize(license, licensing.NormalizeLicense, expression.NormalizeForSPDX, replaceOtherLicenses)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    case expression.CompoundExpression:
        if e.Conjunction() != expression.TokenWith {
            return expr
        }
        if expression.ValidSpdxLicense(e.Left().String()) && expression.ValidSpdxLicense(e.Right().String()) {
            return expression.SimpleExpr{License: expr.String()}
        }
        licenseName = e.String()

It looks like we need to return SimpleExpr to a valid SPDX license to avoid a separate license and exception checking, but overall your approach should work!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we need to return SimpleExpr to a valid SPDX license to avoid a separate license and exception checking

It's just a waste of time as it will validate the same licenses and exceptions twice, but it doesn't cause any bugs, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't we get something like duplicates?
I mean license with exception, license and separate exception.
But maybe I'm wrong - I'll check and let you know

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe not, but let's see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use otherLicense for exception.
e.g. for happy path with WITH operator test:

-AFL-2.0 AND AFL-3.0 WITH Autoconf-exception-3.0
+AFL-2.0 AND AFL-3.0 WITH LicenseRef-ab9a5d8bfe1416c4

So i saved license with exception as SimpleExp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in a8a85ad

Copy link
Collaborator

@knqyf263 knqyf263 Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the happy path with WITH operator test is failing just because the exception validation is missing. The following code should pass the test.

diff --git a/pkg/sbom/spdx/marshal.go b/pkg/sbom/spdx/marshal.go
index 8c570d869..b99e28a0f 100644
--- a/pkg/sbom/spdx/marshal.go
+++ b/pkg/sbom/spdx/marshal.go
@@ -432,7 +432,7 @@ func (m *Marshaler) normalizeLicenses(licenses []string) (string, []*spdx.OtherL
                                e.License = strings.TrimPrefix(e.License, "text:--")
                        }

-                       if expression.ValidateSPDXLicense(e.License) {
+                       if expression.ValidateSPDXLicense(e.License) || expression.ValidateSPDXException(e.License) {
                                return e
                        }

@@ -447,7 +447,7 @@ func (m *Marshaler) normalizeLicenses(licenses []string) (string, []*spdx.OtherL
                        if expression.ValidateSPDXLicense(e.Left().String()) && expression.ValidateSPDXException(e.Right().String()) {
                                // Use SimpleExpr for a valid SPDX license with an exception,
                                // to avoid parsing the license and exception separately.
-                               return expression.SimpleExpr{License: e.String()}
+                               return e
                        }

                        licenseName = e.String()

@DmitriyLewen
Copy link
Contributor Author

@goneall I created this PR to use the hasExtractedLicensingInfos field (as you wrote in #7716).
It would be great if you took a look if you have time

@DmitriyLewen DmitriyLewen marked this pull request as ready for review December 11, 2024 07:03
@DmitriyLewen DmitriyLewen self-assigned this Dec 11, 2024
Copy link

@goneall goneall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this code will fix the issue and provide compliance SPDX license expressions.

I had a few suggestions to produce a bit more understandable results and a few coding suggestions.

}
if text {
otherLicense.ExtractedText = license
} else {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else {

I would always include the license as the license name - not just when the text is present

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second thought, this may be appropriate if the license text is the actual text of the license. In most cases, the metadata for packages includes text that are supposed to be a license name or identifier in which case it should also be in the name. If we know the license text is really the text, then the existing code is OK.

If, however, the license text is what is found in the package metadata files and they are not the actual text, I would add the same field as the name PLUS add a LicenseComment to explain - something like `otherLicense.LicenseComment = "The license text represents text found in package metadata and may not represent the full text of the license"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases (e.g. license field from python METADATA files) when we can't understand that license is name/id/text.
We tried to detect text of license:

var licenseTextKeywords = []string{
"http://",
"https://",
"(c)",
"as-is",
";",
"hereby",
"permission to use",
"permission is",
"use in source",
"use, copy, modify",
"using",
}
func isLicenseText(str string) bool {
for _, keyword := range licenseTextKeywords {
if strings.Contains(str, keyword) {
return true
}
}
return false
}

So Trivy split license name/id and license text(text:// prefix).
That is why i used both LicenseName and ExtractedText fields:
https://github.com/DmitriyLewen/trivy/blob/f851f9bb18411db838fc65e0c6c4351b04953f8f/pkg/sbom/spdx/marshal_private_test.go#L92-L112

Another question - i used NOASSERTION for LicenseName and ExtractedText fields (see link above), because these fields are mandatory (https://github.com/spdx/tools-golang/blob/f6e45fdb9e4e0c993105f798bee5f8aa8ea70f84/spdx/v2/v2_3/other_license.go#L12-L20).
Is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a LicenseComment to explain - something like `otherLicense.LicenseComment = "The license text represents text found in package metadata and may not represent the full text of the license"

I liked this idea 👍
Added in 041ab21

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Trivy split license name/id and license text(text:// prefix).
That is why i used both LicenseName and ExtractedText fields

Makes sense - I wasn't sure how the text was captured. Since you are capturing the text and name distinctly, your approach should work. BTW, it's a bit tricky to find the start and end of the license text and even trickier to match it to know licenses - something we've been working on in the SPDX java libraries for about 10 years and still don't have it perfected ;).

i used NOASSERTION for LicenseName and ExtractedText fields (see link above), because these fields are mandatory (https://github.com/spdx/tools-golang/blob/f6e45fdb9e4e0c993105f798bee5f8aa8ea70f84/spdx/v2/v2_3/other_license.go#L12-L20).
Is this correct?

For the license name, this is OK. The spec isn't very clear on how these should be treated, so many people make up a name based on the text. Unlike other parts of the spec, NOASSERTION is not required if you don't know - but in this case I think it would be fine to use NOASSERTION - for the name.

For the text, I would put in whatever string was actually found - even if it is the name. The definition of the field is the license text found - so if someone puts in "This software is licensed under mylicense" - you can put that exact string in the text field even though it is not technically the text of "mylicense". I wouldn't use "NOASSERTION" for the text field.

Copy link
Contributor Author

@DmitriyLewen DmitriyLewen Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Thanks for your opinion!
Updated text field in 659f992

"VSFTPD-OPENSSL-EXCEPTION",
"WXWINDOWS-EXCEPTION-3.1",
"X11VNC-OPENSSL-EXCEPTION",
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment may deserve to be a separate suggestion, but in reading the code I would recommend building the license and exception IDs from the JSON files maintained by the SPDX legal team. The license list is updated every 3 months with new IDs and maintaining these in code can be a challenge to keep up and maintain. What I do in the code I maintain is attempt to access the current JSON files on the website https://spdx.org/licenses/licenses.json and https://spdx.org/licenses/exceptions.json. If I can not access the website or if the user specified not to use the online version, I'll use a cached version of the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases when users run multiple times.
Downloading these files for each run is not good.
But we can save licenses.json and exceptions.json files in the cache dir and use them.
The files contain releaseDate field, so we can update this file only when releaseDate + 3 months has expired.

The license list is updated every 3 months

How strictly is this rule followed?

Anyway let's move this discussion into another issue/pr.


What I do in the code I maintain is attempt to access the current JSON files on the website https://spdx.org/licenses/licenses.json and https://spdx.org/licenses/exceptions.json

I found that https://spdx.org/licenses/exceptions.json and https://github.com/spdx/license-list-data/blob/592c2dcb8497c6fe829eea604045f77d3bce770b/json/exceptions.json are different (see harbour-exception).
Which file would be more correct to use?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How strictly is this rule followed?

Not very strictly. There is, however, a license list version field which is reliably incremented on release.

I found that https://spdx.org/licenses/exceptions.json and https://github.com/spdx/license-list-data/blob/592c2dcb8497c6fe829eea604045f77d3bce770b/json/exceptions.json are different (see harbour-exception).
Which file would be more correct to use?

The lists at https://spdx.org/licenses - these will always be the latest released version. The github repo master will have the latest in development version which may not be stable. The github repo is tagged with release versions, so if you go to the tag for the latest release in github, it will match what is on the website.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, update exception list in 659f992

}
return NormalizeLicense(c.Licenses)
otherLicense.LicenseIdentifier = LicenseRefPrefix + "-" + licenseID
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion - you could simplify this by including the "-" in the LicenseRefPrefix definition and just use LicenseRefPrefix + licenseID The string "LicenseRef" will likely never be used without the trailing "-".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is our internal decision - we don't use - suffixes in constants :)

pkg/licensing/expression/category.go Outdated Show resolved Hide resolved
pkg/licensing/expression/category.go Outdated Show resolved Hide resolved
pkg/licensing/expression/category.go Outdated Show resolved Hide resolved
pkg/licensing/expression/category.go Outdated Show resolved Hide resolved
pkg/licensing/expression/category.go Outdated Show resolved Hide resolved
pkg/licensing/expression/category.go Outdated Show resolved Hide resolved
pkg/sbom/spdx/marshal.go Outdated Show resolved Hide resolved
pkg/licensing/expression/category.go Outdated Show resolved Hide resolved
return fmt.Sprintf("(%s)", license)
}), " AND ")

normalizedLicense, err := expression.Normalize(license, licensing.NormalizeLicense, expression.NormalizeForSPDX)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about my approach?
#8257

It would be as below. I just wrote a PoC, so it doesn't work as is, of course, but I hope you'll get my point.

otherLicenses := map[string]*spdx.OtherLicense{}
replaceOtherLicenses := func(expr expression.Expression) expression.Expression {
    var licenseName string
    switch e := expr.(type) {
    case expression.SimpleExpression:
        if expression.ValidSpdxLicense(e.License) {
            return license
        }
        licenseName = e.License
    case expression.CompoundExpression:
        if e.Conjunction() != expression.TokenWith {
            return expr
        }
        if expression.ValidSpdxLicense(e.Left().String()) && expression.ValidSpdxLicense(e.Right().String()) {
            return expr
        }
        licenseName = e.String()
    }

    license := newOtherLicense(licenseName, false)
    if license == nil {
        ...
    }
    otherLicenses[license.LicenseIdentifier] = license
    return expression.SimpleExpr{License: license.LicenseIdentifier}
}

normalizedLicense, err := expression.Normalize(license, licensing.NormalizeLicense, expression.NormalizeForSPDX, replaceOtherLicenses)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I save and check exceptions in uppercase because we don't have normalization for exceptions like we do for licenses.

I suggest waiting for user feedback.
Maybe we need to update this logic (e.g. check in uppercase but save in original (from this file) case).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to update this logic (e.g. check in uppercase but save in original (from this file) case).

In the SPDX tools I maintain, I take the approach of saving in the original case but comparing ignoring case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @goneall

i updated logic in d4e67dc + 5d0f7e1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix(spdx): use hasExtractedLicensingInfos for licenses not in the SPDX license list
3 participants