Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter or quarantine data based on row-level checks #477

Closed
riccardodelega opened this issue May 3, 2023 · 7 comments
Closed

Filter or quarantine data based on row-level checks #477

riccardodelega opened this issue May 3, 2023 · 7 comments
Labels
question Further information is requested

Comments

@riccardodelega
Copy link

Is there any reason why Deequ has no feature to filter or quarantine data based on the Checks that are defined in a VerificationSuite?

I understand that some Constraints, such as DistinctnessConstraint or UniquenessConstraint, are related to the whole DataFrame and thus can only be used to either discard or keep the whole batch, however some of them, such as PatternMatchConstraint or MaxLengthConstraint, could be used to filter the DataFrame and discard the rows that do not pass the row-level non-aggregated constraints.

Is there any plan of extending Deequ to add this feature or would it require massive refactoring of the code base, as this was not Deequ's intended purpose? I briefly checked the code base and it seems like it would be hard because ScanShareableAnalyzer only exposes aggregationFunctions, which is a sequence of aggregated SparkSQL Columns and e.g. the PatternMatch Analyzer does not expose the SparkSQL non-aggregated expressions that use regexp_extract.

@riccardodelega riccardodelega added the question Further information is requested label May 3, 2023
@mentekid
Copy link
Contributor

mentekid commented May 3, 2023

Hi,

In recent pull requests, we have added row-level support for certain Constraints:

  1. Completness and MaxLength: Feature: Row Level Results #451
  2. MinLength: Feature: Length Row Level Results #465
  3. Uniqueness and related (primary key, unique value ratio, etc): Feature/uniqueness row level results #471
  4. Max, Min, and PatternMatch: Added ColumnValues to Row-level Results #476

These pull requests also include tests that show how a VerificationSuite can be used to return a new column per check. You can see an example here:
https://github.com/awslabs/deequ/blob/master/src/test/scala/com/amazon/deequ/VerificationSuiteTest.scala#L231

We are working on adding support for more checks, and we're merging them as we complete the work required for each.

Is this what you were looking for?

@riccardodelega
Copy link
Author

I think only the first PR managed to get included in release 2.0.3-spark-3.3, which is the one I am testing, but it seems to be exactly what I was looking for! I am glad you are working towards including new checks too.
It would be nice if VerificationResult.rowLevelResultsAsDataFrame did not require recomputing the checks, but I guess doing it in a single sweep (e.g. when the suite is run) would require a lot of refactoring.

@torontoairbnb
Copy link

torontoairbnb commented May 5, 2023

@mentekid hi, is there any plan to release the new record level features in deequ artifact? thanks

@riccardodelega
Copy link
Author

@torontoairbnb while I'm waiting for this new row-level feature to be finished and released, I packaged the JAR myself from the latest commit on master and added it as an external JAR dependency to my project.

@mentekid
Copy link
Contributor

mentekid commented May 5, 2023

Yes, we are working on completing the work to support eligible Deequ checks, and will make a new release with this and other features we have added since release 2.0.3

@mentekid mentekid closed this as completed May 5, 2023
@mentekid mentekid reopened this May 5, 2023
@mentekid
Copy link
Contributor

mentekid commented May 5, 2023

@riccardodelega

It would be nice if VerificationResult.rowLevelResultsAsDataFrame did not require recomputing the checks, but I guess doing it in a single sweep (e.g. when the suite is run) would require a lot of refactoring.

You are right, this approach keeps each analyzer as standalone, and lets them share scans wherever possible, which is how Deequ optimizes Spark while keeping classes decoupled. Not every user needs row-level features, and our benchmarking shows it is slower to provide this information, so this was the best solution we could come up with that doesn't degrade performance of what's already here.

However, the current implementation will still make the same number of passes over the data for row-level as it would for the analysis results - it adds new columns to the dataframe, but if actions are only performed after the analysis, only minimum passes are needed.

If you spot some inefficiency, however, or you think there's something we can improve, let us know or feel free to open a PR.

@mentekid mentekid closed this as completed May 5, 2023
@eapframework
Copy link

eapframework commented Feb 6, 2024

@mentekid We are currently using filtering bad records functionality. New column is added with true or false based on the row-level checks. But if rules has where filter condition, filter condition is ignored.

Have opened a separate issue for this: #530
Can you please guide how to fix the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants