Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Conditional Collision Tracking Compilation Flag to fs-index #64

Open
tareknaser opened this issue May 20, 2024 · 1 comment
Open
Assignees

Comments

@tareknaser
Copy link
Collaborator

Description

Currently, fs-index keeps track of collisions regardless of the hash function used to compute the resource ID, whether it's cryptographic or not.
This adds unnecessary computation in some cases. For cryptographic hash functions, there are no collisions, so this part can be ignored.

Note: some users might still want collision counting to track files with the same content.

Plan

We plan to include a compilation flag in the fs-index crate to manage collision tracking. This was initially intended for #42, but we postponed it until we update the ResourceIndex API, which will simplify the collision tracking code. For more details, see this comment.

Notes

  • The compilation flag might be better named collision-counting since it will now only track the number of occurrences of the same hash.
  • An alternative approach could be to create a separate crate for the different implementation.
@tareknaser tareknaser self-assigned this May 20, 2024
@kirillt
Copy link
Member

kirillt commented May 23, 2024

Collision counting (current implementation) might be useful but the developer need to know how to interpret them, and what to do when it's not just duplicates:

  • with cryptographic hash function, collisions mean the data is duplicated
  • with non-cryptographic hash function, collisions could mean both duplicates or real collisions

Collision tracking would be more useful for consumer apps, when used together with cryptographic hash functions, because we would also provide the developer with a list of duplicates for each resource id. For example, a photo app or documents vault could implement deduplication features using this.

tl;dr:
collision counting is ResourceId -> usize
collision tracking is ResourceId -> Vec<PathBuf>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants