Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(fs): use git commit hash as cache key for clean repositories #8278

Merged
merged 14 commits into from
Jan 27, 2025

Conversation

knqyf263
Copy link
Collaborator

@knqyf263 knqyf263 commented Jan 22, 2025

Overview

Improve cache efficiency for Git repositories by using commit hash as cache key when the repository is in a clean state.

Description

When scanning a Git repository that is in a clean state, this change uses the commit hash as the cache key instead of calculating it from UUID. This allows for better cache reuse and prevents unnecessary cache deletion.

Currently, the file system scanner uses UUID to calculate the cache key, which effectively does not work as a cache key and is deleted after scanning with DeleteBlobs. However, if the directory is a Git repository, we can use the commit hash as a cache key, eliminating the need to delete the cache after scanning.

Key changes:

  • Add Git repository detection and status check
    • For Git repositories, check if the repository is dirty
      • Use commit hash as cache key for clean repositories
      • Use UUID for dirty repositories (existing behavior)
    • Skip cache deletion for clean repositories
  • Add comprehensive tests for both clean and dirty repository states

Related Issues

Checklist

  • I have read the guidelines for contributing to this repository.
  • I have followed the conventions in the PR title.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have updated the documentation with the relevant information (if needed).

When scanning a git repository that is in a clean state, use the commit hash as the cache key instead of calculating it from blob info. This allows for better cache reuse and prevents unnecessary cache deletion.
@knqyf263 knqyf263 self-assigned this Jan 22, 2025
- Introduced a new fixture in `internal/gittest/testdata/fixture.go` to clone a Git repository for unit tests.
- Updated the `magefiles/magefile.go` to include the new fixture in the unit test dependencies.
- Added the test repository path to `.gitignore` to prevent it from being tracked.
- Added NewServerWithRepository function to create a git server with an existing repository, including cloning and fetching all branches and tags.
- Updated NewTestServer to check for the existence of the test repository and provide a clear error message if not found.
- Modified the cloning process in fixture.go to include all branches and tags.
- Removed unused test files.
Signed-off-by: knqyf263 <[email protected]>
Add debug logging when using the latest git commit hash for calculating the artifact cache key, providing more visibility into the cache key generation process.
Signed-off-by: knqyf263 <[email protected]>
Signed-off-by: knqyf263 <[email protected]>
@knqyf263 knqyf263 marked this pull request as ready for review January 24, 2025 08:51
Copy link
Contributor

@DmitriyLewen DmitriyLewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

But i think we need to update docs:

  • fs may save/extract cache for clean repositories
  • by default we don't save cache for repositories (--cache-backend memory is default for fs/repo mode)
  • it works by default for client/server mode
  • cache only works if the repository directory is the target of Trivy scan (but maybe that's redundant)

art.logger.Debug("Using the latest commit hash for calculating cache key", log.String("commit_hash", hash))
art.commitHash = hash
} else {
art.logger.Debug("Random cache key will be used", log.Err(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we don't need to show this log unless it's a git repository directory.
Because we will show the following log for each fs scan

[fs] Random cache key will be used      err="failed to open git repository: repository does not exist"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, it might make sense to add path here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated 3802124

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is a bug 😄 I'll fix it.

Copy link
Contributor

@DmitriyLewen DmitriyLewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Can we merge this PR?

@knqyf263 knqyf263 added this pull request to the merge queue Jan 27, 2025
Merged via the queue into aquasecurity:main with commit b5062f3 Jan 27, 2025
17 checks passed
@knqyf263 knqyf263 deleted the feat/use_git_commit branch January 27, 2025 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(fs): improve cache efficiency using git commit hash
3 participants