-
-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added option to use sklearn's OneHotEncoder to handle unknown categories #174
Conversation
…ethod | added description
…the one_hot_encoder is fitted)
Thanks for starting this PR! This is a tricky topic. Have you tried running the unit tests? I think this will will fail due to supplementary columns... I have booked some time on my calendar to look into this. I'll let you know. |
And thanks for the appreciation :) |
Hi, thank you. I didn't try the unit tests, and as you said, the unit tests are failing. Please let me know if there is anything that I can do, and also, may I know the reason for having supplementary columns? |
I modified the mca file to handle unknown features, as the error in the unit test is the features that are seen in fit are not seen when transforming, so I modified the _prepare function in mca.py:
I checked with the unit tests and didn't have issues on my side. please let me know if this works. |
Ok thanks for looking it. I will take a good look! I want to also make sure this change you're bringing resolves this issue. |
Sure, thank you. Saw the error clean code test, and made a change. |
Hi @MaxHalford, is there any update to this? |
Hey @Vaseekaran-V! I finally found carved some time to look into this. Turns out I found a simpler solution in #181 |
This library is amazing and I noticed a small issue when using the Multiple Correspondence Analysis: since the function uses pd.get_dummies internally to one hot encode the data, I got an error as my testing set had unknown categories in certain categorical features compared to the train set.
Therefore, I have initialized a OneHotEncoder object from sklearn.preprocessing to process the data, if the user wants to opt out of using the get_dummies function.
These are the three attributes that I have specified:
I have updated the _prepare function as well:
Let me know if there is anything else I can do, or whether the workings are correct.
Thanks again for this great library <3