Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support SemHash: Fast Semantic Text Deduplication #1444

Open
1 of 2 tasks
Wendong-Fan opened this issue Jan 14, 2025 · 7 comments · May be fixed by #1545
Open
1 of 2 tasks

[Feature Request] Support SemHash: Fast Semantic Text Deduplication #1444

Wendong-Fan opened this issue Jan 14, 2025 · 7 comments · May be fixed by #1545
Assignees
Labels
New Feature P0 Task with high level priority
Milestone

Comments

@Wendong-Fan
Copy link
Member

Required prerequisites

Motivation

Useful for generated synthetic data and graph data deduplication

https://github.com/MinishLab/semhash
SemHash supports both single-dataset deduplication (e.g., cleaning up a train set) and multi-dataset deduplication (e.g., ensuring no overlap between a test set and a train set). It works with simple datasets, such as text lists, and more complex ones, like multi-column QA datasets. Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process.

Solution

No response

Alternatives

No response

Additional context

No response

@Wendong-Fan Wendong-Fan added New Feature call for contribution P0 Task with high level priority labels Jan 14, 2025
@Wendong-Fan Wendong-Fan added this to the Sprint 21 milestone Jan 14, 2025
@User235514
Copy link

Hi @Wendong-Fan,

I'm interested in implementing the SemHash feature. Before proceeding, I have some questions:

  1. Implementation approach:

    • Should we implement SemHash from scratch?
    • Or should we integrate the existing MinishLab/semhash library?
  2. If we use the existing library:

    • Are there any licensing considerations since CAMEL uses Apache 2.0?
    • Any specific requirements for third-party dependencies?
  3. Initial thoughts on integration:

    • Plan to add it in camel/utils/dedup.py
    • Will support both single and multi-dataset deduplication
    • Will maintain CAMEL's existing code style and documentation standards

Looking forward to your guidance on these points before starting the implementation.

@Wendong-Fan
Copy link
Member Author

Wendong-Fan commented Jan 16, 2025

Hi @User235514 ,

Thank you so much for your willingness to contribute!

  1. We can integrate the existing MinishLab/semhash library.
  2. I believe there are no licensing issues, but could @ Kaiming Hu double-check this? We use Poetry to manage dependencies, so you can add the new dependency to the pyproject.toml file.
  3. The plan looks solid to me.

Feel free to reach out if you have any questions during the integration process!

@lilpaulgotdrill
Copy link
Collaborator

hi @User235514 there is no licensing issues they use MIT license as far as I am concerned https://github.com/MinishLab/semhash/blob/main/LICENSE but you do need to include the copyright notice somewhere in our folder. @Wendong-Fan can further advise on this.

@User235514
Copy link

User235514 commented Jan 26, 2025

When trying to add semhash package using poetry add semhash@^0.2.0, the dependency resolution gets stuck for a very long time (>1000s) at downloading antlr4-python3-runtime.

Attempted Solutions

  1. Used Tsinghua mirror with POETRY_SOURCE="https://pypi.tuna.tsinghua.edu.cn/simple"
  2. Cleared poetry cache using poetry cache clear . --all

Neither of these solutions resolved the issue. The process still gets stuck at downloading antlr4-python3-runtime.

System information

  • Python version: 3.11.11
  • Poetry version: 2.0.1

Steps to reproduce

  1. Clone and install the CAMEL repository
  2. Run poetry add semhash@^0.2.0
  3. The process gets stuck at resolving dependencies

Additional context

  • I tried using Tsinghua mirror with POETRY_SOURCE="https://pypi.tuna.tsinghua.edu.cn/simple", but the issue persists
  • The process gets stuck specifically at downloading antlr4-python3-runtime
  • Total resolution time exceeds a few hours without completing

@Wendong-Fan
Copy link
Member Author

Hey @User235514 , it took me 135.8s to add the dependency, could you try with poetry add semhash?

Image

@User235514
Copy link

Hi @Wendong-Fan. Thank you for your response. I just tried again with poetry add semhash, but I'm still experiencing the same issue as before. The process gets stuck at "Resolving dependencies..." for an extended period (>1000s). However, I noticed that this time it seems to be stuck at a different point in the resolution process compared to my previous attempt.

The overall behavior remains consistent with my previous report - even with a simpler command, the dependency resolution is still not completing in a reasonable timeframe.

Have we encountered similar cases before where users experienced significantly longer dependency resolution times? If so, what were the common causes and solutions?

Would you have any additional suggestions for troubleshooting this issue? Perhaps there are some verbose logging options we could enable to better understand where exactly the process is getting stuck?

Image

@Wendong-Fan
Copy link
Member Author

Hey @User235514 , Thank you for your detailed update. Could you try waiting a bit longer to see if the resolution eventually completes? This delay might be due to it being the first time you're running the command, which can sometimes take longer as dependencies are resolved and cached

You can also try updating your poetry version by poetry self update

@User235514 User235514 linked a pull request Feb 3, 2025 that will close this issue
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New Feature P0 Task with high level priority
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

3 participants