Gitlab fingerprinting improvements #3996

Agalin · 2023-04-17T14:32:43Z

Agalin
Apr 17, 2023

Problem

Currently fingerprint in Gitlab format is based on location, filename and rule type (as a result of #3441, earlier it was just rule type).

But there is a problem. If you add even a blank line at the top of the file, all violations will change fingerprints leading to falsely marked fixed and new issues in Gitlab's UI.

Solutions?

How could it be done better? TBH I'm not aware of an existing solution but it's possible some tools have some heuristics included.

I've been thinking about it for some time and got a few ideas.

Note: By false-negative I mean a fingerprint that hasn't changed even though it should, false-positive means unwanted fingerprint change.

Git-aware

Closest thing that comes to my mind when thinking if given line has changed is git's diff algorithm or checking last commit sha for given lane (more usable here) - but implementing such approach would require removing line number leading to duplicate fingerprints.

Change in fingerprint:

- line number
+ line commit sha

Context-aware

Another option would be to implement some kind of a context-aware fingerprint. Instead of file line number, fingerprint would include info about position inside, for example, closest named scope. While not perfect, it would greatly decrease number of false changes (only line changes in the same scope would affect the fingerprint). But retrieving such data may be much harder (or maybe not, I didn't check how does Ruff handle Python's AST).

It wouldn't help with top-level stuff for which there might be yet another solution (described in the next block).

There is also a problem of context changing, e.g. function being renamed. In current approach it wouldn't lead to a false-positives but with this change fingerprint obviously must be updated too.

Change in fingerprint:

- line number
+ line number in the closest named context, e.g. function, class.
+ name of the closest named context, eg. function, class.

Neighbours-aware

There is an option to use position relative to previous/next line(s). Instead of using line number, we may hash entire line(s) that is directly below / above the one with issues. With this approach even moving blocks of top-level code around the file may preserve fingerprints.

This approach is highly configurable and may lead to false-negatives (line generates the same error while its content was changed but fingerprint stays the same) and false-positives (line above was changed). It may not protect completely from fingerprint duplicates if blocks of code repeat in the file.

There are two parameters in this approach that may be customized and change likelihood of issues described above:

Neighbourhood direction (line below, above, both).
Number of lines to take.

Change in fingerprint:

- line number
+ content of line(s) above/below

Hybrid context and neighbour aware

Context-aware approach and neighbour-aware approach can be combined. Including scope name should decrease likelihood of fingerprint duplicates between copied code blocks - but I don't think it's possible to completely eliminate.

Change in fingerprint:

- line number
+ content of line(s) above/below
+ name of the closest named context, eg. function, class.

Summary

I believe it's possible to improve fingerprinting. while it seems pretty hard, if not impossible, to avoid unexpected fingerprint changes without storing data externally, I think that at least context-aware approach should heavily decrease number of such events.

Considering that it depends on type of changes applied to a repository, it might be best to keep original (line-based) approach as an option.

Thoughts? Am I missing something obvious?

MichaReiser · 2023-04-22T23:00:28Z

MichaReiser
Apr 22, 2023
Maintainer

Thanks for writing this up and making me aware of the problem. I almost shipped a severe regression to the fingerprint in #3931 by hashing the byte offset.

0 replies

charliermarsh · 2023-04-22T23:17:35Z

charliermarsh
Apr 22, 2023
Maintainer

Perhaps we could use the name of the scope plus the row and column information relative to the parent scope?

1 reply

Agalin Apr 24, 2023
Author

That's what I've described as context-aware. But keep in mind parent scope name would likely need to be fully qualified / recursive to avoid collisions.

MichaReiser · 2023-04-22T23:52:01Z

MichaReiser
Apr 22, 2023
Maintainer

Sonar's approach is interesting by taking the line hash OR line number into account. But I don't know how to encode an OR into a single hash (probably impossible). source

1 reply

Agalin Apr 24, 2023
Author

Sonar explicitlly states it uses an issues list from previous analysis. That may be tricky in Gitlab environment.

carlfischerjba · 2024-02-08T21:13:04Z

carlfischerjba
Feb 8, 2024

For anyone else coming across this page, this was fixed in September 2023. See #7203.

2 replies

charliermarsh Feb 8, 2024
Maintainer

Thanks @carlfischerjba!

Agalin Feb 9, 2024
Author

Thanks for the update, I'm not up to date with Ruff's changelogs. And it's another implementation, not described here. Need to finally update our CI to 0.1+.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gitlab fingerprinting improvements #3996

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Gitlab fingerprinting improvements #3996

Agalin Apr 17, 2023

Problem

Solutions?

Git-aware

Context-aware

Neighbours-aware

Hybrid context and neighbour aware

Summary

Replies: 4 comments · 4 replies

MichaReiser Apr 22, 2023 Maintainer

charliermarsh Apr 22, 2023 Maintainer

Agalin Apr 24, 2023 Author

MichaReiser Apr 22, 2023 Maintainer

Agalin Apr 24, 2023 Author

carlfischerjba Feb 8, 2024

charliermarsh Feb 8, 2024 Maintainer

Agalin Feb 9, 2024 Author

Agalin
Apr 17, 2023

Replies: 4 comments 4 replies

MichaReiser
Apr 22, 2023
Maintainer

charliermarsh
Apr 22, 2023
Maintainer

Agalin Apr 24, 2023
Author

MichaReiser
Apr 22, 2023
Maintainer

Agalin Apr 24, 2023
Author

carlfischerjba
Feb 8, 2024

charliermarsh Feb 8, 2024
Maintainer

Agalin Feb 9, 2024
Author