-
Notifications
You must be signed in to change notification settings - Fork 24
Identical score if only one of two identifiers match #89
Comments
What happens here is that there is no item with the requested identifier value: Therefore, the reconciliation service first searches for items with this identifier, finds none, so it falls back on normal search without the identifier. The five first items have the exact required string as label or alias, hence they all get the maximum score. It could indeed make sense to give them all less than the maximum score (because they all lack the P902=007804 claim). At the moment I am reluctant to make changes to the scoring mechanism - not because I think it is flawless, just because I think there is value in stability of that mechanism. In the future, I would like that reconciliation clients can rely on more granular scores (identifier match, label match) rather than a single score whose computation is quite opaque: |
@wetneb If this helps, in Freebase, I remember that a single column added as a disambiguator would drop the score 50% if it didn't match. I don't remember the algorithm for additional columns added as a disambiguator, but the 1st column added would drop it from 100 to 50, for instance. Maybe Andy had 5 or 10% drops on additional columns, maybe videos around the internet of it being demonstrated might help clue us in to the algorithm approximately. Its possible Tom might remember about additional columns; I only recall about the 1st disambiguator column added and it's percentage drop on a no match. Agree on giving power to the user to apply their own weighting through smarter clients including OpenRefine. |
I think this absolutely makes sense and is what the users would expect, particularly if 100 is meant to convey "perfect match."
I would favor continuous improvement over stability. The requested behavior sounds like a clear improvement to me. |
The problem with scoring tweaks is that they generally sound very reasonable when looking at a particular use case, but it is hard to ensure they are not affecting other legitimate use cases in a detrimental way. So the risk by starting this is to be drawn into a series of follow-up fixes to cater for the needs of whoever is going to report a regression in their own workflow. So, personally, if I had time to dedicate to this I would rather make progress on OpenRefine/OpenRefine#3139 rather than this issue, because I fear the downstream consequences of tweaking a scoring mechanism that has been stable for a long time, and I do not believe in a one-score-fits-all paradigm. Once people can rely on individual scoring features in their reconciliation workflows, then it might become easier to tweak the global score (with the understanding that people should instead rely on the individual features if they care about reproducibility). But that should not prevent people from running their own versions of the service with modified scoring mechanisms (locally or as a publicly hosted instance). |
I guess we'll have to hope that Wikidata fixes it in their production reconciliation service then, but that could be a very long wait. If someone does host a service that fixes this before then, I'd favor making that service the bundled OpenRefine service. |
Hmm, it seems like there is 2 types of scoring going on. One by identifier and one by name. It would be great if there was bigger transparency and the user could see which mechanism was used and could potentially sort by which column actually matched (identifier or name). |
That is exactly what OpenRefine/OpenRefine#3139 is about. If you look at the raw response from the service, you will see two "features" in the candidate, each indicating whether the name or the identifier match: I would like to make these scores available in OpenRefine itself (for now it ignores these values returned by the service). |
Originally posted by @hroest at OpenRefine/OpenRefine#3191:
When doing reconciliation with Wikidata for example, a match will often produce a score of 100% if only name matches to the lemma or any alternative identifiers. This is the case even if a second column is provided that can be used as an external identifier. The score will still be a 100 even if the second identifier does not match and there is one match where both columns match, basically making the single-column match indistinguishable from the two-column match.
Maybe I am doing this wrong, I would appreciate some help.
Proposed solution
The score should be higher for the match where both the Lemma name and the identifier match and lower for those that do not match.
Alternatives considered
Additional context
I am working with data where there is often items that have the same name and need a secondary identifier to be distinguished.
An example:
Lemma: Teufen
External Identifier: 007804
External Identifier used: HLS https://www.wikidata.org/wiki/Property:P902
Correct match: https://www.wikidata.org/wiki/Q67209
Here is how I do the reconciliation:
Here is the result:
What I expect: I expect that the correct match https://www.wikidata.org/wiki/Q67209 where the Lemma and the external identifier P902 match will get the highest score.
What happens instead:
There are 5 hits that all have 100 match score:
This is simply based on matching the Lemma and does not take the HLS identifier into consideration at all.
The text was updated successfully, but these errors were encountered: