Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create deduplication data for hand-annotating training data for a simple classifier #646

Open
janehmueller opened this issue Jun 6, 2018 · 0 comments
Assignees

Comments

@janehmueller
Copy link
Collaborator

  • generate every (string) similarity measure score for the names of two subjects
  • threshold the mean of every score (to reduce data)
  • export duplicate candidates to a file in the following way:
    • for every new subject there should be a list of subjects in the knowledge base that can be linked
    • the grouped subjects should be ranked by the mean of the scores in descending order
    • every subject should contain all the subject data, the similarity measure scores and the mean of the scores
    • export format should be json
  • a python script should then use the exported data and enable manually annotating it
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant