-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable UMAP to properly handle sparse data #772
Conversation
Signed-off-by: Rishi Chandra <[email protected]>
build |
build |
build |
build |
df_for_scoring = df_for_scoring.select( | ||
F.slice(feature_col, 1, dim).alias(feature_col), output_col | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing since it looks resolved: NVIDIA/spark-rapids#10770
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find.
loc_umap = trustworthiness(input_raw_data.toarray(), embedding, n_neighbors=15) | ||
|
||
trust_diff = loc_umap - dist_umap | ||
assert trust_diff <= 0.15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the range of trustworthiness score? Would the tolerance 0.15 be too large, given the ci actually runs single-GPU umap.
Re: discussion on how to avoid chunking twice - chunking when returning from fit is by # rows, and is meant to help conform with maxRecordsPerBatch so that Spark only loads a batch at a time into memory, whereas chunking on the driver side is based on # bytes to stay below the 8GB broadcast limit. During fit we can calculate the tighter of the two restrictions prior to yielding to the driver and then reuse these chunks during broadcast. Though if the maxRecordsPerBatch chunk is significantly smaller in bytes, we'd have the overhead of unnecessarily broadcasting lots of small chunks. Not sure if there's a fix that makes sense here. |
Also, chunking the CSR matrix by # rows when returning from _fit poses separate challenges since the CSR components and the embedding matrix will have different # rows. It's unclear the number of chunks we should return here. |
df_for_scoring = df_for_scoring.select( | ||
F.slice(feature_col, 1, dim).alias(feature_col), output_col | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find.
build |
Thanks for addressing the comments. I am good with merging the PR. Some corner cases can be studied in the future. |
build |
build |
Addressing bug found by QA, where UMAP did not properly handle sparse inputs for 'jaccard' metric.
Changes:
Handling sparse data:
Reading/writing:
Testing:
Todos:
These features can be targeted for 24.12: