-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Support SemHash: Fast Semantic Text Deduplication #1444
Comments
Hi @Wendong-Fan, I'm interested in implementing the SemHash feature. Before proceeding, I have some questions:
Looking forward to your guidance on these points before starting the implementation. |
Hi @User235514 , Thank you so much for your willingness to contribute!
Feel free to reach out if you have any questions during the integration process! |
hi @User235514 there is no licensing issues they use MIT license as far as I am concerned https://github.com/MinishLab/semhash/blob/main/LICENSE but you do need to include the copyright notice somewhere in our folder. @Wendong-Fan can further advise on this. |
When trying to add semhash package using Attempted Solutions
Neither of these solutions resolved the issue. The process still gets stuck at downloading antlr4-python3-runtime. System information
Steps to reproduce
Additional context
|
Hey @User235514 , it took me 135.8s to add the dependency, could you try with |
Hi @Wendong-Fan. Thank you for your response. I just tried again with The overall behavior remains consistent with my previous report - even with a simpler command, the dependency resolution is still not completing in a reasonable timeframe. Have we encountered similar cases before where users experienced significantly longer dependency resolution times? If so, what were the common causes and solutions? Would you have any additional suggestions for troubleshooting this issue? Perhaps there are some verbose logging options we could enable to better understand where exactly the process is getting stuck? |
Hey @User235514 , Thank you for your detailed update. Could you try waiting a bit longer to see if the resolution eventually completes? This delay might be due to it being the first time you're running the command, which can sometimes take longer as dependencies are resolved and cached You can also try updating your poetry version by |
Required prerequisites
Motivation
Useful for generated synthetic data and graph data deduplication
https://github.com/MinishLab/semhash
SemHash supports both single-dataset deduplication (e.g., cleaning up a train set) and multi-dataset deduplication (e.g., ensuring no overlap between a test set and a train set). It works with simple datasets, such as text lists, and more complex ones, like multi-column QA datasets. Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process.
Solution
No response
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: