You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
color interests height grade
0 red sketching 68 1
1 blue painting 64 2
2 blue instruments 87 3
3 green sketching 45 2
4 red painting 54 3 5 red video games 64 1
6 black painting 67 4
7 black instruments 98 4
8 blue sketching 90 2
9 green sketching 87 3
color interests height
0 1.875 2.100000 68
1 2.375 2.875000 64
2 2.375 3.166667 87
3 2.500 2.100000 45
4 1.875 2.875000 54 5 1.875 2.500000 64
6 3.500 2.875000 67
7 3.500 3.166667 98
8 2.375 2.100000 90
9 2.500 2.100000 87
2.5
index sum count encoder
0 black 8 2 3.500
1 blue 7 3 2.375
2 green 5 2 2.500
3 red 5 3 1.875
index sum count encoder
0 instruments 7 2 3.166667
1 painting 9 3 2.875000
2 sketching 8 4 2.100000 3 video games 1 1 1.750000
Actual Behavior
As you can see all values match except for video games, which is assigned 2.5 by the encoder but applying the equation yields 1.75, which seems the correct value to me.
Or is it the constant different from 1 when there is only one occurrence?
Steps to Reproduce the Problem
Add the code above to the code from the link.
Run it.
Specifications
Version: 2.5.0
Platform: Linux zboox 5.18.3-1-MANJARO Test cases failing #1 SMP PREEMPT_DYNAMIC Thu Jun 9 09:54:55 UTC 2022 x86_64 GNU/Linux
Subsystem:
The text was updated successfully, but these errors were encountered:
Or is it the constant different from 1 when there is only one occurrence?
This is actually the case in the current implementation. This is done do avoid over-fitting. But there is a discussion that we should not have this behaviour but rather manage cases with little sample size via regularization. This is also the case in e.g. in target encoder. We discuss this in issue #327
Expected Behavior
I have not found a function to map the encoded values back to the categorical values when using category_encoders' CatBoostEncoder.
I was trying to do it manually by using the following equation (TargetSum + Prior) / (FeatureCount + 1)
I am using the following example :
https://www.geeksforgeeks.org/categorical-encoding-with-catboost-encoder/
And the following code:
cbe_encoder = ce.cat_boost.CatBoostEncoder()
mapp = cbe_encoder.fit(train, target)
prior = target['grade'].sum() / len(train)
color = mapp.mapping.get('color').reset_index()
color['encoder'] = ( color['sum'] + prior ) / ( color['count'] + 1 )
Same for the column 'interests'.
color interests height grade
0 red sketching 68 1
1 blue painting 64 2
2 blue instruments 87 3
3 green sketching 45 2
4 red painting 54 3
5 red video games 64 1
6 black painting 67 4
7 black instruments 98 4
8 blue sketching 90 2
9 green sketching 87 3
color interests height
0 1.875 2.100000 68
1 2.375 2.875000 64
2 2.375 3.166667 87
3 2.500 2.100000 45
4 1.875 2.875000 54
5 1.875 2.500000 64
6 3.500 2.875000 67
7 3.500 3.166667 98
8 2.375 2.100000 90
9 2.500 2.100000 87
2.5
index sum count encoder
0 black 8 2 3.500
1 blue 7 3 2.375
2 green 5 2 2.500
3 red 5 3 1.875
index sum count encoder
0 instruments 7 2 3.166667
1 painting 9 3 2.875000
2 sketching 8 4 2.100000
3 video games 1 1 1.750000
Actual Behavior
As you can see all values match except for video games, which is assigned 2.5 by the encoder but applying the equation yields 1.75, which seems the correct value to me.
Or is it the constant different from 1 when there is only one occurrence?
Steps to Reproduce the Problem
Specifications
The text was updated successfully, but these errors were encountered: