-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multidimensional/composite target encoding #429
Comments
If we'd implement it for TargetEncoder we'd need it also for all other encoders where each column is encoded independently (which are all encoders except hashing). |
Great! I like the preprocessing idea. I will scope it out and work on a PR for this in the next couple weeks. |
One point of clarification on what you wrote:
The use case I have in mind is to get an encoding of the joint product & color fields, but not return a string column of those two (as it will internally handle the encoding and then throw away the concatenated field). I may or may not want to also encode product & color separately. Here is a pseudocode example for how I would do it now.
So it is a minor nuisance (2 extra lines of code). The cleanest solution I can think of is to allow tuples to be passed into the cols arg which indicates those columns should be concatenated before encoding. So I may have spoken too soon about the preprocessing idea. Let me know if this seems like enough of a quality of life improvement to warrant a modification. I'm also curious about this point:
If I were in this position and didn't want to specify columns (maybe I have a long list of cat columns to encode), then I think it would be simple enough to drop product & color before encoding? Let me know if I'm missing something and if you'd like me to build a solution for it. |
I like the idea that you can choose if you want to encode |
I would like to do target encoding on the composite of multiple columns, but the current functionality only allows a single column to be encoded.
For example: I have a column names
product
and another namedcolor
and I'd like a unique target encoding value for each product+color combination. Currently, I can only have an encoding for each unique product and each unique color separately.The workaround would be to concatenate the column values together and then target encode, but that is a bit clunky and leads to some unnecessary categorical features in my dataframe. Let me know if this is something worth raising a PR for.
The implementation I'm thinking is optionally allowing a new argument called something like
composite_cols
(open to better naming suggestions). This arg can be a list of lists, where each inner list indicates the column names to be concatenated together, and each element in the outer list makes up a composite column. If passed, convert the values to string and concatenate them together before passing into the encoder the same way as regular cols. The composite column can be named as the concatenation of all its component column names.The text was updated successfully, but these errors were encountered: