Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: add semhash tests for data deduplication #1545

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

User235514
Copy link

@User235514 User235514 commented Feb 3, 2025

Description

  • Add tests for single dataset deduplication
  • Add tests for cross-dataset deduplication
  • Add tests for multi-column dataset deduplication
  • Add tests for custom encoders and pandas integration
  • Ensure that the semhash dependency is added to the project.
  • Note: I was unable to successfully add semhash through Poetry when submitting this code, so the reviewer will need to add it manually.

Motivation and Context

close #1444

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)
  • Example (update in the folder of example)

Checklist

  • I have read the CONTRIBUTION guide. (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

- Add tests for single dataset deduplication
- Add tests for cross-dataset deduplication
- Add tests for multi-column dataset deduplication
- Add tests for custom encoders and pandas integration
Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @User235514 , thanks for the contribution! Seem this PR only contains the test code, we need the core functionality code to check the test, could you also update this part?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[Feature Request] Support SemHash: Fast Semantic Text Deduplication
2 participants