-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add data synchronization test to verification Suite. #526
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @VenkataKarthikP for this PR! I've left some comments.
Note: We use this check for the DatasetMatch
rule in AWS Glue Data Quality. In a future PR, we can rename DataSynchronization
to DatasetMatch
for consistency purposes.
* Data Synchronization Analyzer | ||
* | ||
* @param dfToCompare DataFrame to compare | ||
* @param columnMappings columns mappings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We can add more documentation here. Are these the key column mappings? What about the other parameter, comparison column mappings? Will that be added in a future PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, will update documentation.
I was planning to do this in phased manner instead of one big PR, will definitely do a follow up PR to add that as well.
src/main/scala/com/amazon/deequ/comparison/ComparisonResult.scala
Outdated
Show resolved
Hide resolved
case class ComparisonFailed(errorMessage: String, ratio: Double = 0) extends ComparisonResult | ||
case class ComparisonSucceeded(ratio: Double = 0) extends ComparisonResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep 1 set of states. Right now, there are 2 sets of states. Comparison[Failed/Succeeded]
vs DataSynchronization[Failed/Succeeded]
. Having too many states can result in confusion for the end user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning, to consolidate once ReferentialIntegrity
is also integrated into verification suite. I will handle that in next PR.
.collect { case constraint: AnalysisBasedConstraint[_, _, _] => constraint.analyzer } | ||
.collect { | ||
case constraint: AnalysisBasedConstraint[_, _, _] => constraint.analyzer | ||
case constraint: DataSynchronizationConstraint => constraint.analyzer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't the existing statement match DataSynchronizationConstraint
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, yes we don't need this.
|
||
override def evaluate(metrics: Map[Analyzer[_, Metric[_]], Metric[_]]): ConstraintResult = { | ||
|
||
val anz = Try(metrics.filter(i => i._1.isInstanceOf[DataSynchronizationAnalyzer]).head._2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we write it as the following, for better readability?
val (_, anz) = Try(metrics.filter { case(analyzer, _) => analyzer.isInstanceOf[DataSynchronizationAnalyzer] }.head)
What happens if .head
is called on empty list? Is that possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will refactor to make it improve readability, thanks for suggestion.
@rdsharma26 thanks for the review, updated with review comments. |
* | ||
* @param dfToCompare The DataFrame to compare with the primary DataFrame that is setup | ||
* during [[com.amazon.deequ.VerificationSuite.onData]] setup. | ||
* @param columnMappings A map where each key-value pair represents a column in the primary DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename to keyColumnMappings
in next PR.
* | ||
*/ | ||
case class DataSynchronizationAnalyzer(dfToCompare: DataFrame, | ||
columnMappings: Map[String, String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename to keyColumnMappings
in next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks again for the PR! Please address the following in next PR.
columnMappings
should be renamed tokeyColumnMappings
- Add
matchColumnMappings
- [If possible] Change
DataSynchronization
toDatasetMatch
* add data synchronization test to verification suite * review comments * update test and doc strings
* add data synchronization test to verification suite * review comments * update test and doc strings
* add data synchronization test to verification suite * review comments * update test and doc strings
* add data synchronization test to verification suite * review comments * update test and doc strings
* add data synchronization test to verification suite * review comments * update test and doc strings
* add data synchronization test to verification suite * review comments * update test and doc strings
* add data synchronization test to verification suite * review comments * update test and doc strings
*Issue, if available: #501
Description of changes: Adding data synchronization check to verification suite, with this change users can define
isDataSynchronized
check.Example usage -
cc: @mentekid @rdsharma26
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.