-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Validation Framework: Source + Product data #1241
Conversation
919a826
to
f158ce4
Compare
5482cb6
to
3ba1789
Compare
378e855
to
2fbd086
Compare
2fbd086
to
50d9cd6
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1241 +/- ##
==========================================
+ Coverage 70.58% 70.74% +0.16%
==========================================
Files 115 119 +4
Lines 5966 6041 +75
Branches 695 706 +11
==========================================
+ Hits 4211 4274 +63
- Misses 1609 1617 +8
- Partials 146 150 +4 ☔ View full report in Codecov by Sentry. |
633c79f
to
025ff0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple small notes/thoughts but
- this all seems great to me, big picture-wise
- you and Alex seem to be in a good spot (and agreement) about some of the details, so I'll continue to peek and give any wanted feedback, but carry on!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't wanna miss out on the approval party! great stuff!
@fvankrieken @damonmcc @alexrichey Should I merge with unhappy |
No, those should be resolved in some way, particularly the couple failures outside of You also have a pytest failure - looks like import issues due to renaming |
8846d4b
to
5c36807
Compare
* Add pandera * automated compiling of python requirements (#1361) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: fvankrieken <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Create `CheckAttributes` model * Update attr `Column.checks` expected type (forward and backward compatible)
51defb4
to
f874c8c
Compare
a90ef00
to
3f87188
Compare
3f87188
to
3c5d0e7
Compare
Partially resolves #650
What
This PR implements a raw architecture of data checks with Pandera in dcpy.
In order to run data checks on a dataframe with Pandera, you need to construct a pandera object,
pandera.DataFrameSchema
. The Pandera object takes inpandera.Column
s which consist ofpandera.Check
s. Most of the PR is about creating these pandera objects invalidate/pandera_utils.py
module. The rest of the PR is creating a format for data checks in our templates.Example of running data checks from Pandera docs:
Example of data check format in ingest templates/metadata files:
Check names in a template correspond to pandera check names or our custom checks (same approach as in preprocessing steps). In the last commit, I created a custom check as an example how we would do it; the check itself can be scraped later.
I recommend to review this PR commit by commit or starting with the
validate.pandera_utils.run_data_checks()
fn and work your way backwards.🚨 Feedback needed
validate.pandera_utils.create_check()
fn: I have some validation logic in it where it checks correct input values (i.e. such check name or check parameters exist). I would like to move it underdataset.CheckAttributes
model -- does it make sense to do that?mypy
in the meantime, before pandera checks are fully implemented.Next steps/PRs
bbl
)text
->pandera.dtypes.DataType.String
validate.run()
function which reads in a local file into df, pulls in a template, and runs data checks.checks
data type. Refactor distribution metadata files.References