Stop using boston housing dataset #6180

betatim · 2024-12-12T13:49:56Z

Instead we use the california housing dataset.

closes #5158

copy-pr-bot · 2024-12-12T13:50:00Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Instead we use the california housing dataset.

bdice · 2024-12-12T18:40:17Z

Looks like there are some failures in hdbscan with the new dataset:

=========================== short test summary info ============================
FAILED test_hdbscan.py::test_hdbscan_sklearn_datasets[housing_dataset-knn-leaf-True-min_samples_cluster_size_bounds1-0.0] - AssertionError: 
Arrays are not almost equal to -25 decimals

(shapes (47,), (48,) mismatch)
 ACTUAL: array([   25,    27,    28,    29,    29,    29,    31,    32,    33,
          35,    36,    36,    40,    41,    43,    45,    47,    49,
          50,    54,    61,    62,    65,    71,    72,    72,    78,...
 DESIRED: array([   25,    25,    25,    26,    29,    30,    30,    31,    32,
          33,    35,    35,    35,    37,    40,    41,    44,    44,
          47,    49,    50,    57,    60,    61,    64,    71,    71,...
FAILED test_hdbscan.py::test_hdbscan_sklearn_datasets[housing_dataset-knn-leaf-False-min_samples_cluster_size_bounds1-0.0] - AssertionError: 
Arrays are not almost equal to -25 decimals

(shapes (47,), (48,) mismatch)
 ACTUAL: array([   25,    27,    28,    29,    29,    29,    31,    32,    33,
          35,    36,    36,    40,    41,    43,    45,    47,    49,
          50,    54,    61,    62,    65,    71,    72,    72,    78,...
 DESIRED: array([   25,    25,    25,    26,    29,    30,    30,    31,    32,
          33,    35,    35,    35,    37,    40,    41,    44,    44,
          47,    49,    50,    57,    60,    61,    64,    71,    71,...
= 2 failed, 14387 passed, 6025 skipped, 646 xfailed, 66 xpassed, 1152 warnings in 4889.78s (1:21:29) =

https://github.com/rapidsai/cuml/actions/runs/12297823749/job/34322294951?pr=6180#step:8:8331

dantegd · 2024-12-13T04:48:51Z

@betatim can you open an issue with the log of the error?

I think we should use this PR to debug that failure. Specifically, this fails with this specific set of relevant parameters:

cluster_selection_method=leaf not eom
cluster_selection_epsilon=0.0 not the larger ones
min_samples=50
min_cluster_size=25

This combination of parameters should give us a clue when debugging. Important that it is with cluster_selection_epsilon=0.0, which means it'd affect out of sample inference.

On a very first quick thought, this seems to suggest to me a bug either in the leaf code in

cuml/cpp/src/hdbscan/detail/select.cuh

Line 269 in 7211507

void leaf(const raft::handle_t& handle,

or somewhere else in the cluster selection logic.

dantegd · 2024-12-13T05:00:25Z

Something additional that always frustrates me about assert in pytests, I'd love to know beforehand the results of the following assertions (calculating adjusted_rand_score and checking for matching cluster_persistence_). In more recent tests, I've been moving away from a simple assert/failing on first thing, and accumulating the errors in an error list to fail and display at the end of the test:

cuml/python/cuml/cuml/tests/test_public_methods_attributes.py

Line 209 in 7211507

errors.append(

cuml/python/cuml/cuml/tests/test_public_methods_attributes.py

Line 233 in 7211507

# Report all errors at the end

Might be useful to addopt that approach in general, will open an issue to capture that as well.

betatim · 2024-12-13T13:12:13Z

An interesting thing that I didn't spot initially is that the length of the arrays is different. So isn't just the number of samples assigned to each cluster that differs, but also the number of clusters in total (I think).

Poking around a bit to understand what is what

betatim · 2024-12-13T14:02:57Z

adjusted_rand_score(cuml_agg.labels_, sk_agg.labels_)=0.9694956618348944

divyegala · 2024-12-17T22:56:07Z

@betatim @dantegd

Here's a full sequence of min_samples values that occur in the mutual reachability graph for the last data point in the housing dataset:

10.9316,10.9316,10.9316,10.9316,13.351,11.2694,10.9316,15.2398,10.9316,10.9316,11.3027,12.227,11.1692,10.9772,13.3884,10.9316,13.3978,11.4673,12.6491,11.3798,10.9316,10.9316,10.9316,11.2694,10.9316,10.9316,10.9316,10.9316,10.9316,10.9316,10.9316,12.1347,15.6445,10.9316,18.2688,11.8954,10.9316,11.5434,12.052,10.9316,11.2138,10.9316,12.5399,10.9316,10.9316,10.9316,10.9316,10.9316,11.3248,10.9316,10.9316,14.2653,10.9316,11.1131,11.6833,10.9316,10.9316,12.0934,10.9316,11.2027,10.9316,10.9316,10.9316,10.9316,10.9316,10.9316,3.40282e+38

where the min of this array ,10.9316, is highly repetitive. In fact, there appear to be very few unique edge weights for this data point. Thus, no parallel MST can be deterministic as you could select any edge weight with 10.9316 to build the MST. A sequential MST, however, would be deterministic and always select the same edge despite the repetitions (most likely the first instance of 10.9316). Once this difference occurs, we are looking at a possibility of completely different dendrograms being constructed, which leads to very different final solutions.

NOTE: I can also verify with an eye-test that almost all data points in the mutual reachability graph seem to have similar edge weight patterns.

betatim · 2024-12-19T12:57:35Z

The take away here is that we can't make the current test pass and that is a feature (not a bug)? Because cuml uses a parallel MST structure we can't promise to break ties in the same way as a sequential implementation.

If yes, I see two(+1) options:

special case the california housing dataset to only check e.g. the score but not the exact clustering
in general loosen the requirements in the test because we are just being lucky that it passes for the other datasets
try other datasets until we find one that passes the test

WDYT?

divyegala · 2024-12-19T19:59:54Z

@betatim I would want to try a dataset with fewer repetitive patterns. California housing dataset has a bunch of repetitive values in the first 3 columns.

github-actions bot added the Cython / Python Cython or Python issue label Dec 12, 2024

Stop using boston housing dataset

0112051

Instead we use the california housing dataset.

betatim force-pushed the remove-boston branch from 3e090c2 to 0112051 Compare December 12, 2024 13:51

betatim marked this pull request as ready for review December 12, 2024 13:52

betatim requested a review from a team as a code owner December 12, 2024 13:52

betatim requested review from bdice and viclafargue December 12, 2024 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop using boston housing dataset #6180

Stop using boston housing dataset #6180

betatim commented Dec 12, 2024 •

edited

Loading

copy-pr-bot bot commented Dec 12, 2024

bdice commented Dec 12, 2024

dantegd commented Dec 13, 2024 •

edited

Loading

dantegd commented Dec 13, 2024 •

edited

Loading

betatim commented Dec 13, 2024 •

edited

Loading

betatim commented Dec 13, 2024

divyegala commented Dec 17, 2024 •

edited

Loading

betatim commented Dec 19, 2024

divyegala commented Dec 19, 2024

Stop using boston housing dataset #6180

Are you sure you want to change the base?

Stop using boston housing dataset #6180

Conversation

betatim commented Dec 12, 2024 • edited Loading

copy-pr-bot bot commented Dec 12, 2024

bdice commented Dec 12, 2024

dantegd commented Dec 13, 2024 • edited Loading

dantegd commented Dec 13, 2024 • edited Loading

betatim commented Dec 13, 2024 • edited Loading

betatim commented Dec 13, 2024

divyegala commented Dec 17, 2024 • edited Loading

betatim commented Dec 19, 2024

divyegala commented Dec 19, 2024

betatim commented Dec 12, 2024 •

edited

Loading

dantegd commented Dec 13, 2024 •

edited

Loading

dantegd commented Dec 13, 2024 •

edited

Loading

betatim commented Dec 13, 2024 •

edited

Loading

divyegala commented Dec 17, 2024 •

edited

Loading