Skip to content

Commit

Permalink
[#56] Use || instead of && in the Scala jw UDF
Browse files Browse the repository at this point in the history
This doesn't affect the return values of the function, but it does mean
that we only call JaroWinklerSimilarity.apply() when both arguments are strings
with length > 0. This should help prevent similar edge cases from showing up
in the future.
  • Loading branch information
riley-harper committed Nov 29, 2022
1 parent bbb6588 commit cebb30e
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 3 deletions.
Binary file modified hlink/spark/jars/hlink_lib-assembly-1.0.jar
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,10 @@ package com.isrdi.udfs
import org.apache.commons.text.similarity._

class SerJaroWinklerSimilarity extends JaroWinklerSimilarity with Serializable {
// JaroWinklerSimilarity.apply("", "") returns 1.0, but for our use case it makes
// more sense for two empty strings to have similarity 0.0.
// If either input string is empty, immediately return 0. Otherwise, return
// the value of calling JaroWinklerSimilarity's apply function.
// JaroWinklerSimilarity.apply("", "") returns 1, but for our use case it makes
// more sense for two empty strings to have similarity 0.
//
// By my understanding, the comparison of two empty strings is essentially
// undefined. To me it makes sense to directly apply the definition of
Expand All @@ -17,7 +19,7 @@ class SerJaroWinklerSimilarity extends JaroWinklerSimilarity with Serializable {
// empty strings 1 - 0 = 1. So I understand the decision to make the
// similarity 1 and the distance 0 as well.
override def apply(left: CharSequence, right: CharSequence): java.lang.Double = {
if (left == "" && right == "") {
if (left == "" || right == "") {
0.0
} else {
super.apply(left, right)
Expand Down

0 comments on commit cebb30e

Please sign in to comment.