Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable UMAP to properly handle sparse data #772

Merged
merged 15 commits into from
Nov 9, 2024

Conversation

rishic3
Copy link
Collaborator

@rishic3 rishic3 commented Oct 30, 2024

Addressing bug found by QA, where UMAP did not properly handle sparse inputs for 'jaccard' metric.

Changes:

Handling sparse data:

  • Convert UMAP input data to CSR matrix in case of SparseVectors by inheriting enable_sparse_data_optim code.
  • Return and store "raw_data" attribute in dict of lists containing the CSR attributes after fit.
  • Chunk and broadcast each CSR attribute independently, store result within dict.
  • Fixed bug where UMAP was inheriting from the wrong CumlModel and was returning a new dataframe, rather than adding columns to the input dataframe.

Reading/writing:

  • Added logic to save/load CSR array components.
  • Refactored to save model attributes as single compressed .npz file for all chunks (rather than separate .npy files) - requires compression time but less disk space.

Testing:

  • Added test for sparse inputs.
  • Added test for model persistence in sparse case, including when data is saved/loaded in chunks.

Todos:

These features can be targeted for 24.12:

  • Extend SparseDataGen to generate binary sparse dataset for testing jaccard metric (can be reused for exact/approx KNN)
  • Refactor UMAP model persistence to enable saving/loading arrays to cloud storage (e.g., put arrays into a spark dataframe and use Spark dataframe writer)
  • Chunking by bytes/row currently rests on the assumption of relatively uniform memory distribution - add better handling of skewed data.

@rishic3 rishic3 marked this pull request as ready for review October 30, 2024 23:39
@rishic3
Copy link
Collaborator Author

rishic3 commented Oct 31, 2024

build

python/tests/test_umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/tests/test_umap.py Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
@rishic3
Copy link
Collaborator Author

rishic3 commented Nov 4, 2024

build

@rishic3
Copy link
Collaborator Author

rishic3 commented Nov 4, 2024

build

@rishic3
Copy link
Collaborator Author

rishic3 commented Nov 4, 2024

build

df_for_scoring = df_for_scoring.select(
F.slice(feature_col, 1, dim).alias(feature_col), output_col
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing since it looks resolved: NVIDIA/spark-rapids#10770

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find.

python/tests/test_umap.py Outdated Show resolved Hide resolved
python/tests/test_umap.py Show resolved Hide resolved
loc_umap = trustworthiness(input_raw_data.toarray(), embedding, n_neighbors=15)

trust_diff = loc_umap - dist_umap
assert trust_diff <= 0.15
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the range of trustworthiness score? Would the tolerance 0.15 be too large, given the ci actually runs single-GPU umap.

python/tests/test_umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Show resolved Hide resolved
python/tests/test_umap.py Show resolved Hide resolved
@rishic3
Copy link
Collaborator Author

rishic3 commented Nov 6, 2024

Re: discussion on how to avoid chunking twice - chunking when returning from fit is by # rows, and is meant to help conform with maxRecordsPerBatch so that Spark only loads a batch at a time into memory, whereas chunking on the driver side is based on # bytes to stay below the 8GB broadcast limit.

During fit we can calculate the tighter of the two restrictions prior to yielding to the driver and then reuse these chunks during broadcast. Though if the maxRecordsPerBatch chunk is significantly smaller in bytes, we'd have the overhead of unnecessarily broadcasting lots of small chunks. Not sure if there's a fix that makes sense here.

@rishic3
Copy link
Collaborator Author

rishic3 commented Nov 6, 2024

Re: discussion on how to avoid chunking twice - chunking when returning from fit is by # rows, and is meant to help conform with maxRecordsPerBatch so that Spark only loads a batch at a time into memory, whereas chunking on the driver side is based on # bytes to stay below the 8GB broadcast limit.

During fit we can calculate the tighter of the two restrictions prior to yielding to the driver and then reuse these chunks during broadcast. Though if the maxRecordsPerBatch chunk is significantly smaller in bytes, we'd have the overhead of unnecessarily broadcasting lots of small chunks. Not sure if there's a fix that makes sense here.

Also, chunking the CSR matrix by # rows when returning from _fit poses separate challenges since the CSR components and the embedding matrix will have different # rows. It's unclear the number of chunks we should return here.
If we could chunk the CSR matrix directly and reconcatenate using scipy.sparse.vstack that would be ideal, but there's no easy way to get the CSR matrix into a serializable format (i.e., python lists) unless we extract the components. Will think about this more.

python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
df_for_scoring = df_for_scoring.select(
F.slice(feature_col, 1, dim).alias(feature_col), output_col
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find.

@rishic3
Copy link
Collaborator Author

rishic3 commented Nov 6, 2024

build

python/src/spark_rapids_ml/umap.py Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
@lijinf2
Copy link
Collaborator

lijinf2 commented Nov 7, 2024

Thanks for addressing the comments. I am good with merging the PR. Some corner cases can be studied in the future.

@rishic3
Copy link
Collaborator Author

rishic3 commented Nov 8, 2024

build

@lijinf2
Copy link
Collaborator

lijinf2 commented Nov 8, 2024

build

@rishic3 rishic3 merged commit d10e9f0 into NVIDIA:branch-24.10 Nov 9, 2024
2 checks passed
@rishic3 rishic3 deleted the umap-sparse branch November 9, 2024 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants