Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a more efficient datastructure for searching for duplicates #28

Open
ianwal opened this issue Jul 10, 2023 · 1 comment
Open

Use a more efficient datastructure for searching for duplicates #28

ianwal opened this issue Jul 10, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@ianwal
Copy link
Collaborator

ianwal commented Jul 10, 2023

Current searching requires comparing every video at least once, it takes n*(n-1)/2 comparisons.

There exists more appropriate data structures for searching for similar objects.


Vantage-point tree

A vantage-point tree would require rebuilding the tree when videos are added, O(n * log(n)), but searching would be greatly improved.

The distance function for the VP tree would be the similarity of a video. Currently, that is a number [0,100].

It could also be possible to store the individual PDQHashes in a VP tree, but the tree would become extremely large with large videos, and it would require storing a table for pdqhash:video. It would likely reduce search time though because it would avoid calculating the similarity of videos while traversing the tree.


Another potential optimization is bucketing videos by their video duration. There's not really a point in comparing videos that are significantly different lengths since they are not likely to be duplicates. Of course, if a video is 1 second vs 2 second that is much different than 1 hour vs 2 hours, so they can't be compared just by their ratios.

@ianwal ianwal added the enhancement New feature or request label Jul 10, 2023
@KennethSamael
Copy link

There's not really a point in comparing videos that are significantly different lengths since they are not likely to be duplicates.

I feel it's worth mentioning that I have gotten useful duplicates from videos of significantly different durations; such as when the shorter video is a single scene clipped out of a longer video. Might not be relevant for everyone, so it could still be useful to have the option to only look for dupes among videos with similar duration, but it's definitely not pointless to compare videos of different durations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants