Minor bias in split selection? #228

jlmelville · 2023-10-21T03:17:11Z

In euclidean_random_projection_split, this part of the code is picking two random points to form the hyperplane:

Lines 221 to 224 in db258ce

    
           left_index = tau_rand_int(rng_state) % indices.shape[0] 
        
           right_index = tau_rand_int(rng_state) % indices.shape[0] 
        
           right_index += left_index == right_index 
        
           right_index = right_index % indices.shape[0]

right_index has an ever so slight bias to being chosen as 0, because when left_index == indices[-1] and right_index is also sampled as indices[-1], it will be "overflowed" to 0.

If right_index was selected as:

    right_index = tau_rand_int(rng_state) % (indices.shape[0] - 1)

then the bias is removed and there is no need for the right_index = right_index % indices.shape[0] -- I think?

This also affects the angular and sparse variants, but I assume this doesn't really matter. I am mainly asking to make sure I didn't miss something.

The text was updated successfully, but these errors were encountered:

lmcinnes · 2023-10-22T02:34:15Z

No, I think you are correct. This should probably be fine for large indices, but when you get low in the tree perhaps it could actually come into play? Thanks for pointing this out; I'll have to see if I can run some small experiments and see if it is worth trying to fix this up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor bias in split selection? #228

Minor bias in split selection? #228

jlmelville commented Oct 21, 2023

lmcinnes commented Oct 22, 2023

Minor bias in split selection? #228

Minor bias in split selection? #228

Comments

jlmelville commented Oct 21, 2023

lmcinnes commented Oct 22, 2023