Create deduplication data for hand-annotating training data for a simple classifier #646

janehmueller · 2018-06-06T12:26:06Z

generate every (string) similarity measure score for the names of two subjects
threshold the mean of every score (to reduce data)
export duplicate candidates to a file in the following way:
- for every new subject there should be a list of subjects in the knowledge base that can be linked
- the grouped subjects should be ranked by the mean of the scores in descending order
- every subject should contain all the subject data, the similarity measure scores and the mean of the scores
- export format should be json
a python script should then use the exported data and enable manually annotating it

…e annotated

janehmueller added the deduplication label Jun 6, 2018

janehmueller self-assigned this Jun 6, 2018

janehmueller added a commit that referenced this issue Jun 6, 2018

Refs #646: moves implicits to package object

3fbfd83

janehmueller added a commit that referenced this issue Jun 6, 2018

Refs #646: add json serialization to jsonparser

735153b

janehmueller added a commit that referenced this issue Jun 6, 2018

Ref #646: add methods to export deduplication candidates as data to b…

073e154

…e annotated

janehmueller added a commit that referenced this issue Aug 23, 2018

Refs #646: finish annotation export job

7233023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create deduplication data for hand-annotating training data for a simple classifier #646

Create deduplication data for hand-annotating training data for a simple classifier #646

janehmueller commented Jun 6, 2018

Create deduplication data for hand-annotating training data for a simple classifier #646

Create deduplication data for hand-annotating training data for a simple classifier #646

Comments

janehmueller commented Jun 6, 2018