-
Notifications
You must be signed in to change notification settings - Fork 1
Duplicate Detection
The duplicate detection is a part of the structured data import and is essential for the maintenance of a clean database. To detect duplicates within the new imported data source and the subjects a configuration file has to be created. Three different settings have to be setup for a working configuration
For the blocking process to parameters have to set: filterUndefined
and maxBlockSize
. Both parameters influence the blockings filter behavior. filterUndefined
is boolean and if set to true
all undefined Blocks will be sorted out. maxBlockSize
defines the maximal allowed number of subjects within of a block. All blocks with number greater than maxBlockSize
, will be sorted out as well.
The confidence
parameter has to be set to real number within the interval [0,1]. And it describes the minimal similarity score threshold for a pair of subjects to be declared duplicates.
The similarity score is represented by a nested value and describes the configuration for the computation of the similarity score. For each attribute which should be considered during the comparison of subjects, an attribute tag should created, containing the the name of the attribute, a weight representing the relevance and a list of similarity measures.
Next step Data Merging