-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Detect duplicate samples when adding new data to tensors (images) #1757
Comments
hi @michelemoretti, thanks for the feature request! yes, this is a great idea but would add some overhead for computing hashes of the data while ingesting (assuming they are exactly the same images). Can you tell us more about how this would simplify the dataset extension pipeline for you (maybe just illustrating by an example)? The answer would help us with prioritizing. |
Hi David, |
Got it, @michelemoretti, just to make sure we are on the same page, are those repeating images pixel-perfect exactly same or still there could be some minor changes between those? |
Absolutely. We're talking about identical files/images. |
I want to work on this issue. Please assign me this issue. Thanks. |
I want to work on this issue. Please assign me this issue. Thanks. @michelemoretti @davidbuniat @sgrove @jraman |
hey @protocolog , thanks a lot for your contribution, and apologies for the late reply. Assigned the issue! You can join the Activeloop community slack (slack.activeloop.ai) to ask questions. :) |
Please assign #1757 issue, You assigned me but my profile is showing not assigned. your slack link is not working, please give the alternate source of contact@davidbuniat @mikayelh @michelemoretti @sgrove |
@protocolog apologies, fixed the link. Please refrain from tagging people who are not involved in this conversation to spare their inboxes. Thanks. :) |
I am unable to join the workspace on slack. Please help me out. My slack ID is [email protected] , Thanks |
This is an interesting problem that I face, but not just for identical images, but near-identical images. Have not actually tested this workflow, but I imagine this could be done by generating a perceptual hash (or normal hash) of the image (eg. with the |
thanks a lot @nmichlo for chiming in here and the suggestion! @protocolog I've re-sent you an invite to our slack but I noticed that you joined. Let me know if you have other questions:) I'm also tagging @istranic here in case he thinks this can be included on the roadmap. :) |
🚨🚨 Feature Request
Is it possible to discard samples in case they are already present in the dataset? If not, would this be something interesting to implement? I feel like this would make the dataset extension pipeline much easier to use and implement
The text was updated successfully, but these errors were encountered: