You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to start off by thanking everyone involved in the creation of this library.
Honestly, if I knew about this library in my previous jobs where I didn't used Python as much, I still would have went for implementing it as my go to data validator.
I have a small problem on which I think I have already spent too much energy, therefore I turn to the people of this repo for some guidance.
Let's assume that I want to add a custom check that involves validating col_4 based on 3 other columns.
For this, I have the following code snippet below:
import pandera as pa
import pandas as pd
from pandera.typing import Series
from pandera import extensions
@extensions.register_check_method(check_type = 'element_wise')
def check_col_4_concat(df: pd.Series):
return df['col_4'] == df['col_1'] + df['col_2'] + df['col_3']
class ConcatClass(pa.SchemaModel):
col_1 : Series[str] = pa.Field(nullable=False)
col_2 : Series[str] = pa.Field(nullable=False)
col_3 : Series[str] = pa.Field(nullable=False)
col_4 : Series[str] = pa.Field(nullable=False, description= 'Column that gets checked for concatenation')
col_5 : Series[str] = pa.Field(nullable=False)
class Config:
check_col_4_concat = ()
df_dict = {'col_1' : ['a','d','g'], 'col_2' : ['b','e','h'], 'col_3': ['c','f','i'],
'col_4': ['abc','de','ghi'], 'col_5':['foo','bar','whatever']}
df = pd.DataFrame(df_dict)
try:
ConcatClass.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
print(e.failure_cases)
col_1 col_2 col_3 col_4 col_5
0 a b c abc foo
1 d e f de bar
2 g h i ghi whatever
Will return:
schema_context column check check_number failure_case index
0 DataFrameSchema col_1 check_col_4_concat 0 d 1
1 DataFrameSchema col_2 check_col_4_concat 0 e 1
2 DataFrameSchema col_3 check_col_4_concat 0 f 1
3 DataFrameSchema col_4 check_col_4_concat 0 de 1
4 DataFrameSchema col_5 check_col_4_concat 0 bar 1
N.B. col_5 isn't even involved in the evaluation
What I would like to have is:
schema_context column check check_number failure_case index
3 DataFrameSchema col_4 check_col_4_concat 0 de 1
I fully understand that I have passed the check inside the config class and it makes sense that it would return the entire row/index where this check fails, but is there another way to do this in order to specify for which column I want this check to run?
I know one could aggregate the failure cases df, but for my use case it would need quite a bit of work and deciding on a naming convention for the checks.
I have also tried to use pa.Check decorator and to give the evaluation function directly inside the pa.field, like so:
col_4 : Series[str] = pa.Field(check_col_4_concat = ())
This returns a typeError since it is mostly useful when you can validate one field by itself without having to check it against others in the field.
Besides this, I have also tried adding the statistic argument to @extensions.register_check_method, but it still has the same behavior.
Probably disabling the checks for the other columns might be a way to achieve this, but I really cannot go down this route with my use case.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
I want to start off by thanking everyone involved in the creation of this library.
Honestly, if I knew about this library in my previous jobs where I didn't used Python as much, I still would have went for implementing it as my go to data validator.
I have a small problem on which I think I have already spent too much energy, therefore I turn to the people of this repo for some guidance.
Let's assume that I want to add a custom check that involves validating col_4 based on 3 other columns.
For this, I have the following code snippet below:
So the check:
For data frame:
Will return:
N.B. col_5 isn't even involved in the evaluation
What I would like to have is:
I fully understand that I have passed the check inside the config class and it makes sense that it would return the entire row/index where this check fails, but is there another way to do this in order to specify for which column I want this check to run?
I know one could aggregate the failure cases df, but for my use case it would need quite a bit of work and deciding on a naming convention for the checks.
I have also tried to use
pa.Check
decorator and to give the evaluation function directly inside thepa.field
, like so:col_4 : Series[str] = pa.Field(check_col_4_concat = ())
This returns a typeError since it is mostly useful when you can validate one field by itself without having to check it against others in the field.
Besides this, I have also tried adding the
statistic
argument to@extensions.register_check_method
, but it still has the same behavior.Probably disabling the checks for the other columns might be a way to achieve this, but I really cannot go down this route with my use case.
Thx a lot!
Beta Was this translation helpful? Give feedback.
All reactions