Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IntCastingNaNError #530

Open
chenzeDoris opened this issue Jan 6, 2025 · 1 comment
Open

IntCastingNaNError #530

chenzeDoris opened this issue Jan 6, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@chenzeDoris
Copy link

Describe the bug
Hi, Thank you for the amazing library, I really like this library, it saves my time in writing the existing architecture for tabular data. However, I have found a bug in the tabular_datamodule.py file when using the OrdinalEncoder to perform encoder transformations on test data or validation data, there may be some random unseen data in the test data or validation data. The ordinal encoder will use NAN_CATEGORY to impute. This will result in IIntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer.

To Reproduce
Steps to reproduce the behavior:

  1. Go to tabular_datamodule.py
  2. Function _encode_categorical_columns()

Expected behavior
Unseen values are normal in real-world data; I think you should keep them rather than Nan when running unseen data.

Desktop (please complete the following information):

  • OS: Ios
  • Version 12.7.5 (21H1222)

Additional context
Add any other context about the problem here.

Copy link

dosubot bot commented Jan 6, 2025

@chenzeDoris Hello! I'm here to help you with any bugs, questions, or contributions you have. Let's squash those bugs together!

To resolve the IntCastingNaNError when using the OrdinalEncoder to handle unseen data in test or validation datasets, you can modify the OrdinalEncoder configuration to handle unseen categories by imputing them instead of raising an error. This can be done by setting the handle_unseen parameter to "impute" in the _encode_categorical_columns method of the TabularDatamodule class.

Here's the relevant code snippet from the tabular_datamodule.py file:

def _encode_categorical_columns(self, data: DataFrame, stage: str) -> DataFrame:
    if stage != "fit":
        # Inference
        return self.categorical_encoder.transform(data)
    # Fit
    logger.debug("Encoding Categorical Columns using OrdinalEncoder")
    self.categorical_encoder = OrdinalEncoder(
        cols=self.config.categorical_cols,
        handle_unseen=("impute" if self.config.handle_unknown_categories else "error"),
        handle_missing="impute" if self.config.handle_missing_values else "error",
    )
    data = self.categorical_encoder.fit_transform(data)
    return data

Ensure that self.config.handle_unknown_categories is set to True to enable the imputation of unseen categories. This will prevent the IntCastingNaNError by imputing a value for unseen categories instead of causing an error [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the bug Something isn't working label Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant