-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(WIP) Partial fix for getting feature names out #398
(WIP) Partial fix for getting feature names out #398
Conversation
…r one hot encoder
of pandas dataframe, to ensure full compatibility. We do that in the __init__.py of the library. Besides, the get_feature_names_out methods have a modified signature to match what is expected from them in sklearn. A complete suite of tests have been added to check that feature names work properly. modified: category_encoders/__init__.py modified: category_encoders/quantile_encoder.py modified: category_encoders/utils.py modified: tests/test_feature_names.py
category_encoders/__init__.py
Outdated
import warnings | ||
from textwrap import dedent | ||
|
||
if sklearn.__version__ < '1.2.0': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like the input warning here. I know of some users who use this library in their project and they try to suppress the warning for their end-users.
Also, at the moment it also does not work anyway.
I'd be an favor of not issuing a warning here but have something on the index.rst
(c.f. below comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree, in fact warnings are annoying. Let's remove this. Do you want me to add a new commit or you prefer to do it in the merge process?
* Full compatibility with sklearn pipelines, input an array-like dataset like any other transformer | ||
* Full compatibility with sklearn pipelines, input an array-like dataset like any other transformer (\*) | ||
|
||
(\*) For full compatibility with Pipelines and ColumnTransformers, and consistent behaviour of `get_feature_names_out`, it's recommended to upgrade `sklearn` to a version at least '1.2.0' and to set output as pandas: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please introduce a section known issues
here after usage and before contents, that states something like:
"""
CategoryEncoders internally works with pandas DataFrames
as apposed to sklearn which works with numpy arrays. This can cause problems in sklearn versions prior to 1.2.0
. In order to ensure full compatibility with sklearn set sklearn to also output DataFrames
. This can be done by
sklearn.set_config(transform_output="pandas")
for a whole project or just for a single pipeline using
Pipeline(
steps=[
("preprocessor", SomePreprocessor().set_output("pandas"),
("encoder", SomeEncoder()),
]
)
If you experience another bug feel free to report it on github https://github.com/scikit-learn-contrib/category_encoders/issues
"""
I think this makes it sufficiently clear and is better than the status quo. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's very good, yes. Thanks! I don't see any need for more. Let me add this piece and remove the other one so that you can merge :)
Perfect. Thanks! |
I think this is a partial fix for this opened issue:
#395
It remains to check the behaviour of other estimators that are not ONE_TO_ONE.
Please, let me know if you like the work in progress and I will try to continue.