Registering a Custom Check with multiple fields #1263

crnudal · 2023-07-16T03:55:48Z

crnudal
Jul 16, 2023

Hello,

I want to start off by thanking everyone involved in the creation of this library.

Honestly, if I knew about this library in my previous jobs where I didn't used Python as much, I still would have went for implementing it as my go to data validator.

I have a small problem on which I think I have already spent too much energy, therefore I turn to the people of this repo for some guidance.

Let's assume that I want to add a custom check that involves validating col_4 based on 3 other columns.
For this, I have the following code snippet below:

import pandera as pa
import pandas as pd
from pandera.typing import Series
from pandera import extensions

@extensions.register_check_method(check_type = 'element_wise')
def check_col_4_concat(df: pd.Series):
    return df['col_4'] == df['col_1'] + df['col_2'] + df['col_3']


class ConcatClass(pa.SchemaModel):
    col_1 : Series[str] = pa.Field(nullable=False) 
    col_2 : Series[str] = pa.Field(nullable=False) 
    col_3 : Series[str] = pa.Field(nullable=False) 
    col_4 : Series[str] = pa.Field(nullable=False, description= 'Column that gets checked for concatenation') 
    col_5 : Series[str] = pa.Field(nullable=False)
    
    class Config:
        check_col_4_concat = ()
    


    

df_dict = {'col_1' : ['a','d','g'], 'col_2' : ['b','e','h'], 'col_3': ['c','f','i'], 
           'col_4': ['abc','de','ghi'], 'col_5':['foo','bar','whatever']}

df = pd.DataFrame(df_dict)

try:
    ConcatClass.validate(df, lazy=True)
    
except pa.errors.SchemaErrors as e:
    print(e.failure_cases)

So the check:

@extensions.register_check_method(check_type = 'element_wise')
def check_col_4_concat(df: pd.Series):
    return df['col_4'] == df['col_1'] + df['col_2'] + df['col_3']

For data frame:

  col_1 col_2 col_3 col_4     col_5
0     a     b     c   abc       foo
1     d     e     f    de       bar
2     g     h     i   ghi  whatever

Will return:

    schema_context column               check  check_number failure_case  index
0  DataFrameSchema  col_1  check_col_4_concat             0            d      1
1  DataFrameSchema  col_2  check_col_4_concat             0            e      1
2  DataFrameSchema  col_3  check_col_4_concat             0            f      1
3  DataFrameSchema  col_4  check_col_4_concat             0           de      1
4  DataFrameSchema  col_5  check_col_4_concat             0          bar      1

N.B. col_5 isn't even involved in the evaluation
What I would like to have is:

    schema_context column               check  check_number failure_case  index
3  DataFrameSchema  col_4  check_col_4_concat             0           de      1

I fully understand that I have passed the check inside the config class and it makes sense that it would return the entire row/index where this check fails, but is there another way to do this in order to specify for which column I want this check to run?
I know one could aggregate the failure cases df, but for my use case it would need quite a bit of work and deciding on a naming convention for the checks.

I have also tried to use pa.Check decorator and to give the evaluation function directly inside the pa.field, like so:

col_4 : Series[str] = pa.Field(check_col_4_concat = ())
This returns a typeError since it is mostly useful when you can validate one field by itself without having to check it against others in the field.
Besides this, I have also tried adding the statistic argument to @extensions.register_check_method, but it still has the same behavior.
Probably disabling the checks for the other columns might be a way to achieve this, but I really cannot go down this route with my use case.

Thx a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Registering a Custom Check with multiple fields #1263

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Registering a Custom Check with multiple fields #1263

crnudal Jul 16, 2023

Replies: 0 comments

crnudal
Jul 16, 2023