Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(environment): support collecting license evidence from installed distributions #674

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nejch
Copy link

@nejch nejch commented Feb 6, 2024

Closes #570

Todo: still in draft as I need to add tests

@jkowalleck jkowalleck changed the title feat(environment): support collecting license evidence from installed distributions feat: support collecting license evidence from installed distributions Feb 6, 2024
@jkowalleck jkowalleck changed the title feat: support collecting license evidence from installed distributions feat(environment): support collecting license evidence from installed distributions Feb 6, 2024
if not (dist_files := dist.files):
return None

if not (license_files := [f for f in dist_files if LICENSE_FILE_REGEX.match(f.name)]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to guess the name of the file.
You might be able to access all names via dist.metadata.get_all('License-File')
this is similar to how declared component licenses are collected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkowalleck maybe the scope of this PR differs slightly from the issue description, but one of my main goals with this was to also collect licenses that the maintainers do not declare. For example, with dual licenses or in bundled and vendored code or where we can't trust maintainers to provide all the fields.

So when we build SBOMs internally at my employer I'd then get complaints that SBOMs are incomplete/inaccurate. This tries to address parts of that as well.

Take this example of the popular mkdocs-material theme (https://github.com/squidfunk/mkdocs-material):

import re
from importlib.metadata import distribution
dist = distribution("mkdocs-material")

LICENSE_FILE_REGEX = re.compile('LICEN[CS]E.*|COPYING.*')
dist.metadata.get_all('License-File')
> ['LICENSE']
[f for f in dist.files if LICENSE_FILE_REGEX.match(f.name)]
> [PackagePath('material/templates/.icons/fontawesome/LICENSE.txt'), PackagePath('material/templates/.icons/material/LICENSE'), PackagePath('material/templates/.icons/octicons/LICENSE'), PackagePath('material/templates/.icons/simple/LICENSE.md'), PackagePath('mkdocs_material-9.4.7.dist-info/licenses/LICENSE')]

Or just tried another one, https://pypi.org/project/packaging/ with a dual license:

dist = distribution("packaging")
dist.metadata.get_all('License-File')
> 
[f for f in dist.files if LICENSE_FILE_REGEX.match(f.name)]
> [PackagePath('packaging-23.2.dist-info/LICENSE'), PackagePath('packaging-23.2.dist-info/LICENSE.APACHE'), PackagePath('packaging-23.2.dist-info/LICENSE.BSD')]

Maybe it's slightly different scopes but I think it's worth adding all those licenses, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see.
then maybe do your auto-detection, and add the declared license files, too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks - so if I understand correctly we should try make a set out of the collected and maintainer-declared files right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might consider preventing duplicates in certain means, yes.
To what extent? I'd leave the answer to you, as you are probably a user of this feature. I am certain you will make the right decision.

@@ -47,6 +50,8 @@

T_AllComponents = Dict[str, Tuple['Component', Iterable[Requirement]]]

LICENSE_FILE_REGEX = re.compile('LICEN[CS]E.*|COPYING.*')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

files to consider:

  • LICENCE/LICENSE - obviously
  • NOTICE - some license texts, like Apache-2.0, explicitly state this exact file (name) as a optional source of custom license addendum
  • COPYING - obviously

read also: https://wheel.readthedocs.io/en/stable/user_guide.html?#including-license-files-in-the-generated-wheel-file
They describe this already, and they also allow arbitrary file suffixes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice catch! I think I got the original idea for the pattern from another license finder.

I'm thinking with AUTHORS and NOTICE it's a bit less clear if some of the results should be considered license files in the sense that I'd expect in a CycloneDX SBOM, but perhaps it still makes sense. See e.g.
https://github.com/CycloneDX/cyclonedx-python/blob/main/NOTICE (maybe it counts.. not sure)
https://github.com/python-gitlab/python-gitlab/blob/main/AUTHORS

Suffixes should already be covered but I'll make sure to add some tests for this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding NOTICE file:
such a file is a case of an addendum to standard license texts.
example:
the https://github.com/CycloneDX/cyclonedx-python/blob/main/NOTICE counts, since it is anchored in the apache-2.0 license - see https://github.com/CycloneDX/cyclonedx-python/blob/main/LICENSE#L106-L121

the AUTHORS is unclear, this is true.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkowalleck thanks! Ok makes sense. WDYT should we skip AUTHORS or add it but make the pattern configurable? (this would make more sense if there was a tiny high-level public API for programmatic usage perhaps, I'll open a separate issue to discuss).

Copy link
Member

@jkowalleck jkowalleck Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skip authors for now. if it is needed, somebody can add the feature later.

PS: Oftern i've seen AUTHORS being just a huge collection of email addresses and names with not much benefit. If it had any legal meaning, it would be where it belonged, instead of being an extra file. I am no lawyer and i did not read all license texts, maybe it is similar to the NOTICE file ... IDK

@jkowalleck
Copy link
Member

added PEP 639 compliant license gathering via #755

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Add complete License-Text to cyclonedx bom
2 participants