Parser is executed after validation, not before #1865

svnv-svsv-jm · 2024-11-28T13:21:51Z

I am using this class:

import pandas as pd
import pandera as pa
from pandera.typing import Index
from pandera.errors import SchemaError

class Schema(pa.DataFrameModel):
    """Schema."""

    index: Index[int]

    class Config:
        """Schema config."""

        strict = False
        coerce = True
        add_missing_columns = True

    a: Series[str]
    b: Series[int]
    c: Series[int]

    @pa.dataframe_parser
    @classmethod
    def preprocess(cls, df: pd.DataFrame) -> pd.DataFrame:
        """Preprocessing."""
        if 'b' not in df.columns and 'c' not in df.columns:
            raise SchemaError(schema=cls, data=df, message=f"No `b` or `c` in {df.columns}")
        
        # raise Exception()
        
        if `b' not in df.columns:
            df['b'] = df['c']
        if `c' not in df.columns:
            df['c'] = df['b']
        return df

Schema.validate(pd.DataFrame({'a': ['xxx'], 'b': 0}))

and it complains that c is not in the dataframe...

It should not, I add c in the parser.

Plus, if I uncomment that raise Exception() line, this exception is never raised as the error I get about missing column is happening before.

pandera == 0.19.3

The text was updated successfully, but these errors were encountered:

ppalyafari · 2024-12-07T18:32:35Z

Hello,

I've checked and it seems like if the inner Config class is provided, then those checks (core parsers) will run before the custom parsers.

When it tries to add column 'C' to the dataframe due to add_missing_columns = True, it fails, because there are no default values provided, therefore it never gets to your custom parsers.

If the order of execution in /pandera/backends/pandas/container.py:

        # Collect status of columns against schema
        column_info = self.collect_column_info(check_obj, schema)
        core_parsers: List[Tuple[Callable[..., Any], Tuple[Any, ...]]] = [
            (self.add_missing_columns, (schema, column_info)),
            (self.strict_filter_columns, (schema, column_info)),
            (self.coerce_dtype, (schema,)),
        ]

        for parser, args in core_parsers:
            try:
                check_obj = parser(check_obj, *args)
            except SchemaError as exc:
                error_handler.collect_error(
                    validation_type(exc.reason_code), exc.reason_code, exc
                )
            except SchemaErrors as exc:
                error_handler.collect_errors(exc.schema_errors)

        # run custom parsers
        check_obj = self.run_parsers(schema, check_obj)

was changed to this, then it would work:

        # run custom parsers
        check_obj = self.run_parsers(schema, check_obj)
        
        # Collect status of columns against schema
        column_info = self.collect_column_info(check_obj, schema)

        core_parsers: List[Tuple[Callable[..., Any], Tuple[Any, ...]]] = [
            (self.add_missing_columns, (schema, column_info)),
            (self.strict_filter_columns, (schema, column_info)),
            (self.coerce_dtype, (schema,)),
        ]

        for parser, args in core_parsers:
            try:
                check_obj = parser(check_obj, *args)
            except SchemaError as exc:
                error_handler.collect_error(
                    validation_type(exc.reason_code), exc.reason_code, exc
                )
            except SchemaErrors as exc:
                error_handler.collect_errors(exc.schema_errors)

I have no idea if this is an intended behavior or not and I haven't checked this any deeper. If this is indeed a bug, I'd be happy to pick this issue up.

svnv-svsv-jm · 2024-12-09T11:37:46Z

Thanks a lot for the clarification.

When it tries to add column 'C' to the dataframe due to add_missing_columns = True, it fails, because there are no default values provided, therefore it never gets to your custom parsers.

Indeed, but for my use-case, where the column c needs to be populated with values from column b, I cannot provide a default value.

In my opinion, this cannot be an intended behavior, especially if this can be fixed (perhaps with your suggestion) without breaking anything else.

Or something even cooler is to let users provide a mode argument to @pa.dataframe_parser, a la pydantic.

svnv-svsv-jm added the bug Something isn't working label Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser is executed after validation, not before #1865

Parser is executed after validation, not before #1865

svnv-svsv-jm commented Nov 28, 2024 •

edited

Loading

ppalyafari commented Dec 7, 2024 •

edited

Loading

svnv-svsv-jm commented Dec 9, 2024

Parser is executed after validation, not before #1865

Parser is executed after validation, not before #1865

Comments

svnv-svsv-jm commented Nov 28, 2024 • edited Loading

ppalyafari commented Dec 7, 2024 • edited Loading

svnv-svsv-jm commented Dec 9, 2024

svnv-svsv-jm commented Nov 28, 2024 •

edited

Loading

ppalyafari commented Dec 7, 2024 •

edited

Loading