-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Add Row Level Result Treatment Options for Uniqueness and Completeness #532
Conversation
src/main/scala/com/amazon/deequ/utilities/RowLevelFilterTreatement.scala
Outdated
Show resolved
Hide resolved
…rows as null for row-level results
…ed rows will be labeled (default True)
…able instead of extending, create RowLevelAnalyzer trait
666651e
to
f856816
Compare
…lterTreatment trait
f856816
to
8e37b92
Compare
@@ -255,12 +256,18 @@ case class NumMatchesAndCount(numMatches: Long, count: Long, override val fullCo | |||
} | |||
} | |||
|
|||
case class AnalyzerOptions(nullBehavior: NullBehavior = NullBehavior.Ignore) | |||
case class AnalyzerOptions(nullBehavior: NullBehavior = NullBehavior.Ignore, | |||
filteredRow: FilteredRow = FilteredRow.TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about filteredRowOutcome
or filteredRowEvaluationStatus
. AnalyzerOptions
is a public facing API, and filteredRow
could be confusing for customers.
@@ -25,7 +25,7 @@ import com.amazon.deequ.repository._ | |||
import org.apache.spark.sql.{DataFrame, SparkSession} | |||
|
|||
/** A class to build a VerificationRun using a fluent API */ | |||
class VerificationRunBuilder(val data: DataFrame) { | |||
class VerificationRunBuilder(val data: DataFrame) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: excess
: Column = { | ||
conditionColumn | ||
.map { condition => { | ||
when(not(condition), expr(filterTreatment)).when(condition, selection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can remove the {
after =>
and its enclosing }
@@ -51,4 +53,16 @@ case class Completeness(column: String, where: Option[String] = None) extends | |||
|
|||
@VisibleForTesting // required by some tests that compare analyzer results to an expected state | |||
private[deequ] def criterion: Column = conditionalSelection(column, where).isNotNull | |||
|
|||
@VisibleForTesting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this annotation? The method is accessible to classes in com.amazon.deequ
which the tests are under.
: Column = { | ||
conditionColumn | ||
.map { condition => { | ||
when(not(condition), expr(filterTreatment)).when(condition, selection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we delegate the expr(filterTreatment)
to the parameter of the method? We can update the type of filterTreatment
to a type of FilteredRow
. The expressions for each type of FilteredRow
enumerations can sit inside FilteredRow
itself. Right now, we have a .toString
which breaks the connection between FilteredRow
and this method. Ideally, we want to keep that connection to aid in refactoring and general readability of the code.
private def getRowLevelFilterTreatment: FilteredRow = { | ||
analyzerOptions | ||
.map { options => options.filteredRow } | ||
.getOrElse(FilteredRow.TRUE) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This is repeated in a few places, so could go into the base class.
rowLevelColumn => { | ||
conditionColumn.map { | ||
condition => { | ||
when(not(condition), expr(getRowLevelFilterTreatment.toString)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment for expr(getRowLevelFilterTreatment.toString)
as above
def hasUniqueness(column: String, assertion: Double => Boolean, hint: Option[String], | ||
analyzerOptions: Option[AnalyzerOptions]) | ||
: CheckWithLastConstraintFilterable = { | ||
hasUniqueness(Seq(column), assertion, hint, analyzerOptions) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we combine all the hasUniqueness
into one method and update the body of the method based on the existence of parameters?
// assert(SimpleResultSerde.deserialize(jsonA) == | ||
// SimpleResultSerde.deserialize(jsonB)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove these comments here and in other files below?
separateResults.asInstanceOf[Set[DoubleMetric]].foreach( result => { | ||
assert(runnerResults.toString.contains(result.toString)) | ||
} | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code has some excess whitespace and misaligned brackets. In a future PR, can we add these rules into scala-style such that the build fails otherwise?
…mpleteness (#532) * Modified Completeness analyzer to label filtered rows as null for row-level results * Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results * Adjustments for modifying the calculate method to take in a filterCondition * Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True) * Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait * Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait * Modify computeStateFrom to take in optional filterCondition
…mpleteness (#532) * Modified Completeness analyzer to label filtered rows as null for row-level results * Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results * Adjustments for modifying the calculate method to take in a filterCondition * Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True) * Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait * Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait * Modify computeStateFrom to take in optional filterCondition
…mpleteness (#532) * Modified Completeness analyzer to label filtered rows as null for row-level results * Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results * Adjustments for modifying the calculate method to take in a filterCondition * Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True) * Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait * Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait * Modify computeStateFrom to take in optional filterCondition
…mpleteness (#532) * Modified Completeness analyzer to label filtered rows as null for row-level results * Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results * Adjustments for modifying the calculate method to take in a filterCondition * Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True) * Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait * Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait * Modify computeStateFrom to take in optional filterCondition
…mpleteness (#532) * Modified Completeness analyzer to label filtered rows as null for row-level results * Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results * Adjustments for modifying the calculate method to take in a filterCondition * Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True) * Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait * Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait * Modify computeStateFrom to take in optional filterCondition
…mpleteness (#532) * Modified Completeness analyzer to label filtered rows as null for row-level results * Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results * Adjustments for modifying the calculate method to take in a filterCondition * Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True) * Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait * Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait * Modify computeStateFrom to take in optional filterCondition
…mpleteness (#532) * Modified Completeness analyzer to label filtered rows as null for row-level results * Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results * Adjustments for modifying the calculate method to take in a filterCondition * Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True) * Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait * Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait * Modify computeStateFrom to take in optional filterCondition
…mpleteness (awslabs#532) * Modified Completeness analyzer to label filtered rows as null for row-level results * Modified GroupingAnalyzers and Uniqueness analyzer to label filtered rows as null for row-level results * Adjustments for modifying the calculate method to take in a filterCondition * Add RowLevelFilterTreatement trait and object to determine how filtered rows will be labeled (default True) * Modify VerificationRunBuilder to have RowLevelFilterTreatment as variable instead of extending, create RowLevelAnalyzer trait * Do row-level filtering in AnalyzerOptions rather than with RowLevelFilterTreatment trait * Modify computeStateFrom to take in optional filterCondition
…um (awslabs#535) * Address comments on PR awslabs#532 * Add filtered row-level result support for Minimum, Maximum, Compliance, PatternMatch, MinLength, MaxLength analyzers * Refactored criterion for MinLength and MaxLength analyzers to separate rowLevelResults logic
Issue #, if available: #530
Description of changes:
This PR adds the option
FileteredRow
to AnalyzerOptions that defines how filtered rows will be labeled as when retrieving row level results. The two options areTrue
andNull
and this defaults to True.This PR defines the behavior for the
Completeness
andUniqueness
analyzers, and will be updated for other analyzers in future PRs.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.