Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(WIP) Partial fix for getting feature names out #398

Conversation

JaimeArboleda
Copy link
Contributor

I think this is a partial fix for this opened issue:

#395

It remains to check the behaviour of other estimators that are not ONE_TO_ONE.

Please, let me know if you like the work in progress and I will try to continue.

of pandas dataframe, to ensure full compatibility. We do that in the
__init__.py of the library. Besides, the get_feature_names_out methods
have a modified signature to match what is expected from them in
sklearn. A complete suite of tests have been added to check that feature
names work properly.
modified:   category_encoders/__init__.py
modified:   category_encoders/quantile_encoder.py
modified:   category_encoders/utils.py
modified:   tests/test_feature_names.py
import warnings
from textwrap import dedent

if sklearn.__version__ < '1.2.0':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the input warning here. I know of some users who use this library in their project and they try to suppress the warning for their end-users.
Also, at the moment it also does not work anyway.
I'd be an favor of not issuing a warning here but have something on the index.rst (c.f. below comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree, in fact warnings are annoying. Let's remove this. Do you want me to add a new commit or you prefer to do it in the merge process?

* Full compatibility with sklearn pipelines, input an array-like dataset like any other transformer
* Full compatibility with sklearn pipelines, input an array-like dataset like any other transformer (\*)

(\*) For full compatibility with Pipelines and ColumnTransformers, and consistent behaviour of `get_feature_names_out`, it's recommended to upgrade `sklearn` to a version at least '1.2.0' and to set output as pandas:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please introduce a section known issues here after usage and before contents, that states something like:

"""
CategoryEncoders internally works with pandas DataFrames as apposed to sklearn which works with numpy arrays. This can cause problems in sklearn versions prior to 1.2.0. In order to ensure full compatibility with sklearn set sklearn to also output DataFrames. This can be done by

sklearn.set_config(transform_output="pandas")

for a whole project or just for a single pipeline using

Pipeline(
    steps=[
        ("preprocessor", SomePreprocessor().set_output("pandas"),
        ("encoder", SomeEncoder()),
    ]
)

If you experience another bug feel free to report it on github https://github.com/scikit-learn-contrib/category_encoders/issues

"""
I think this makes it sufficiently clear and is better than the status quo. Do you agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's very good, yes. Thanks! I don't see any need for more. Let me add this piece and remove the other one so that you can merge :)

@PaulWestenthanner
Copy link
Collaborator

Perfect. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants