-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter or quarantine data based on row-level checks #477
Comments
Hi, In recent pull requests, we have added row-level support for certain
These pull requests also include tests that show how a VerificationSuite can be used to return a new column per check. You can see an example here: We are working on adding support for more checks, and we're merging them as we complete the work required for each. Is this what you were looking for? |
I think only the first PR managed to get included in release 2.0.3-spark-3.3, which is the one I am testing, but it seems to be exactly what I was looking for! I am glad you are working towards including new checks too. |
@mentekid hi, is there any plan to release the new record level features in deequ artifact? thanks |
@torontoairbnb while I'm waiting for this new row-level feature to be finished and released, I packaged the JAR myself from the latest commit on master and added it as an external JAR dependency to my project. |
Yes, we are working on completing the work to support eligible Deequ checks, and will make a new release with this and other features we have added since release 2.0.3 |
You are right, this approach keeps each analyzer as standalone, and lets them share scans wherever possible, which is how Deequ optimizes Spark while keeping classes decoupled. Not every user needs row-level features, and our benchmarking shows it is slower to provide this information, so this was the best solution we could come up with that doesn't degrade performance of what's already here. However, the current implementation will still make the same number of passes over the data for row-level as it would for the analysis results - it adds new columns to the dataframe, but if actions are only performed after the analysis, only minimum passes are needed. If you spot some inefficiency, however, or you think there's something we can improve, let us know or feel free to open a PR. |
Is there any reason why Deequ has no feature to filter or quarantine data based on the
Checks
that are defined in aVerificationSuite
?I understand that some
Constraints
, such asDistinctnessConstraint
orUniquenessConstraint
, are related to the wholeDataFrame
and thus can only be used to either discard or keep the whole batch, however some of them, such asPatternMatchConstraint
orMaxLengthConstraint
, could be used to filter theDataFrame
and discard the rows that do not pass the row-level non-aggregated constraints.Is there any plan of extending Deequ to add this feature or would it require massive refactoring of the code base, as this was not Deequ's intended purpose? I briefly checked the code base and it seems like it would be hard because
ScanShareableAnalyzer
only exposesaggregationFunctions
, which is a sequence of aggregated SparkSQLColumns
and e.g. thePatternMatch
Analyzer
does not expose the SparkSQL non-aggregated expressions that useregexp_extract
.The text was updated successfully, but these errors were encountered: