Replies: 4 comments 4 replies
-
Thanks for writing this up and making me aware of the problem. I almost shipped a severe regression to the fingerprint in #3931 by hashing the byte offset. |
Beta Was this translation helpful? Give feedback.
-
Perhaps we could use the name of the scope plus the row and column information relative to the parent scope? |
Beta Was this translation helpful? Give feedback.
-
Sonar's approach is interesting by taking the line hash OR line number into account. But I don't know how to encode an OR into a single hash (probably impossible). source |
Beta Was this translation helpful? Give feedback.
-
For anyone else coming across this page, this was fixed in September 2023. See #7203. |
Beta Was this translation helpful? Give feedback.
-
Problem
Currently fingerprint in Gitlab format is based on location, filename and rule type (as a result of #3441, earlier it was just rule type).
But there is a problem. If you add even a blank line at the top of the file, all violations will change fingerprints leading to falsely marked fixed and new issues in Gitlab's UI.
Solutions?
How could it be done better? TBH I'm not aware of an existing solution but it's possible some tools have some heuristics included.
I've been thinking about it for some time and got a few ideas.
Note: By false-negative I mean a fingerprint that hasn't changed even though it should, false-positive means unwanted fingerprint change.
Git-aware
Closest thing that comes to my mind when thinking if given line has changed is git's diff algorithm or checking last commit sha for given lane (more usable here) - but implementing such approach would require removing line number leading to duplicate fingerprints.
Change in fingerprint:
Context-aware
Another option would be to implement some kind of a context-aware fingerprint. Instead of file line number, fingerprint would include info about position inside, for example, closest named scope. While not perfect, it would greatly decrease number of false changes (only line changes in the same scope would affect the fingerprint). But retrieving such data may be much harder (or maybe not, I didn't check how does Ruff handle Python's AST).
It wouldn't help with top-level stuff for which there might be yet another solution (described in the next block).
There is also a problem of context changing, e.g. function being renamed. In current approach it wouldn't lead to a false-positives but with this change fingerprint obviously must be updated too.
Change in fingerprint:
Neighbours-aware
There is an option to use position relative to previous/next line(s). Instead of using line number, we may hash entire line(s) that is directly below / above the one with issues. With this approach even moving blocks of top-level code around the file may preserve fingerprints.
This approach is highly configurable and may lead to false-negatives (line generates the same error while its content was changed but fingerprint stays the same) and false-positives (line above was changed). It may not protect completely from fingerprint duplicates if blocks of code repeat in the file.
There are two parameters in this approach that may be customized and change likelihood of issues described above:
Change in fingerprint:
Hybrid context and neighbour aware
Context-aware approach and neighbour-aware approach can be combined. Including scope name should decrease likelihood of fingerprint duplicates between copied code blocks - but I don't think it's possible to completely eliminate.
Change in fingerprint:
Summary
I believe it's possible to improve fingerprinting. while it seems pretty hard, if not impossible, to avoid unexpected fingerprint changes without storing data externally, I think that at least context-aware approach should heavily decrease number of such events.
Considering that it depends on type of changes applied to a repository, it might be best to keep original (line-based) approach as an option.
Thoughts? Am I missing something obvious?
Beta Was this translation helpful? Give feedback.
All reactions