Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser is executed after validation, not before #1865

Open
svnv-svsv-jm opened this issue Nov 28, 2024 · 2 comments
Open

Parser is executed after validation, not before #1865

svnv-svsv-jm opened this issue Nov 28, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@svnv-svsv-jm
Copy link

svnv-svsv-jm commented Nov 28, 2024

I am using this class:

import pandas as pd
import pandera as pa
from pandera.typing import Index
from pandera.errors import SchemaError

class Schema(pa.DataFrameModel):
    """Schema."""

    index: Index[int]

    class Config:
        """Schema config."""

        strict = False
        coerce = True
        add_missing_columns = True

    a: Series[str]
    b: Series[int]
    c: Series[int]

    @pa.dataframe_parser
    @classmethod
    def preprocess(cls, df: pd.DataFrame) -> pd.DataFrame:
        """Preprocessing."""
        if 'b' not in df.columns and 'c' not in df.columns:
            raise SchemaError(schema=cls, data=df, message=f"No `b` or `c` in {df.columns}")
        
        # raise Exception()
        
        if `b' not in df.columns:
            df['b'] = df['c']
        if `c' not in df.columns:
            df['c'] = df['b']
        return df

Schema.validate(pd.DataFrame({'a': ['xxx'], 'b': 0}))

and it complains that c is not in the dataframe...

It should not, I add c in the parser.

Plus, if I uncomment that raise Exception() line, this exception is never raised as the error I get about missing column is happening before.

pandera == 0.19.3

@svnv-svsv-jm svnv-svsv-jm added the bug Something isn't working label Nov 28, 2024
@ppalyafari
Copy link

ppalyafari commented Dec 7, 2024

Hello,

I've checked and it seems like if the inner Config class is provided, then those checks (core parsers) will run before the custom parsers.

When it tries to add column 'C' to the dataframe due to add_missing_columns = True, it fails, because there are no default values provided, therefore it never gets to your custom parsers.

If the order of execution in /pandera/backends/pandas/container.py:

        # Collect status of columns against schema
        column_info = self.collect_column_info(check_obj, schema)
        core_parsers: List[Tuple[Callable[..., Any], Tuple[Any, ...]]] = [
            (self.add_missing_columns, (schema, column_info)),
            (self.strict_filter_columns, (schema, column_info)),
            (self.coerce_dtype, (schema,)),
        ]

        for parser, args in core_parsers:
            try:
                check_obj = parser(check_obj, *args)
            except SchemaError as exc:
                error_handler.collect_error(
                    validation_type(exc.reason_code), exc.reason_code, exc
                )
            except SchemaErrors as exc:
                error_handler.collect_errors(exc.schema_errors)

        # run custom parsers
        check_obj = self.run_parsers(schema, check_obj)

was changed to this, then it would work:

        # run custom parsers
        check_obj = self.run_parsers(schema, check_obj)
        
        # Collect status of columns against schema
        column_info = self.collect_column_info(check_obj, schema)

        core_parsers: List[Tuple[Callable[..., Any], Tuple[Any, ...]]] = [
            (self.add_missing_columns, (schema, column_info)),
            (self.strict_filter_columns, (schema, column_info)),
            (self.coerce_dtype, (schema,)),
        ]

        for parser, args in core_parsers:
            try:
                check_obj = parser(check_obj, *args)
            except SchemaError as exc:
                error_handler.collect_error(
                    validation_type(exc.reason_code), exc.reason_code, exc
                )
            except SchemaErrors as exc:
                error_handler.collect_errors(exc.schema_errors)

I have no idea if this is an intended behavior or not and I haven't checked this any deeper. If this is indeed a bug, I'd be happy to pick this issue up.

@svnv-svsv-jm
Copy link
Author

Thanks a lot for the clarification.

When it tries to add column 'C' to the dataframe due to add_missing_columns = True, it fails, because there are no default values provided, therefore it never gets to your custom parsers.

Indeed, but for my use-case, where the column c needs to be populated with values from column b, I cannot provide a default value.

In my opinion, this cannot be an intended behavior, especially if this can be fixed (perhaps with your suggestion) without breaking anything else.

Or something even cooler is to let users provide a mode argument to @pa.dataframe_parser, a la pydantic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants