Enable UMAP to properly handle sparse data #772

rishic3 · 2024-10-30T22:57:15Z

Addressing bug found by QA, where UMAP did not properly handle sparse inputs for 'jaccard' metric.

Changes:

Handling sparse data:

Convert UMAP input data to CSR matrix in case of SparseVectors by inheriting enable_sparse_data_optim code.
Return and store "raw_data" attribute in dict of lists containing the CSR attributes after fit.
Chunk and broadcast each CSR attribute independently, store result within dict.
Fixed bug where UMAP was inheriting from the wrong CumlModel and was returning a new dataframe, rather than adding columns to the input dataframe.

Reading/writing:

Added logic to save/load CSR array components.
Refactored to save model attributes as single compressed .npz file for all chunks (rather than separate .npy files) - requires compression time but less disk space.

Testing:

Added test for sparse inputs.
Added test for model persistence in sparse case, including when data is saved/loaded in chunks.

Todos:

These features can be targeted for 24.12:

Extend SparseDataGen to generate binary sparse dataset for testing jaccard metric (can be reused for exact/approx KNN)
Refactor UMAP model persistence to enable saving/loading arrays to cloud storage (e.g., put arrays into a spark dataframe and use Spark dataframe writer)
Chunking by bytes/row currently rests on the assumption of relatively uniform memory distribution - add better handling of skewed data.

Signed-off-by: Rishi Chandra <[email protected]>

rishic3 · 2024-10-31T14:56:56Z

build

python/tests/test_umap.py

python/src/spark_rapids_ml/umap.py

python/tests/test_umap.py

python/src/spark_rapids_ml/umap.py

rishic3 · 2024-11-04T19:36:28Z

build

rishic3 · 2024-11-04T20:43:21Z

build

rishic3 · 2024-11-04T23:24:26Z

build

rishic3 · 2024-11-05T00:29:58Z

python/benchmark/benchmark/bench_kmeans.py

-                df_for_scoring = df_for_scoring.select(
-                    F.slice(feature_col, 1, dim).alias(feature_col), output_col
-                )
-


Removing since it looks resolved: NVIDIA/spark-rapids#10770

python/tests/test_umap.py

lijinf2 · 2024-11-05T22:43:12Z

python/tests/test_umap.py

+        loc_umap = trustworthiness(input_raw_data.toarray(), embedding, n_neighbors=15)
+
+        trust_diff = loc_umap - dist_umap
+        assert trust_diff <= 0.15


What is the range of trustworthiness score? Would the tolerance 0.15 be too large, given the ci actually runs single-GPU umap.

python/tests/test_umap.py

python/src/spark_rapids_ml/umap.py

python/tests/test_umap.py

python/src/spark_rapids_ml/umap.py

rishic3 · 2024-11-06T02:42:05Z

Re: discussion on how to avoid chunking twice - chunking when returning from fit is by # rows, and is meant to help conform with maxRecordsPerBatch so that Spark only loads a batch at a time into memory, whereas chunking on the driver side is based on # bytes to stay below the 8GB broadcast limit.

During fit we can calculate the tighter of the two restrictions prior to yielding to the driver and then reuse these chunks during broadcast. Though if the maxRecordsPerBatch chunk is significantly smaller in bytes, we'd have the overhead of unnecessarily broadcasting lots of small chunks. Not sure if there's a fix that makes sense here.

rishic3 · 2024-11-06T04:25:20Z

Re: discussion on how to avoid chunking twice - chunking when returning from fit is by # rows, and is meant to help conform with maxRecordsPerBatch so that Spark only loads a batch at a time into memory, whereas chunking on the driver side is based on # bytes to stay below the 8GB broadcast limit.

During fit we can calculate the tighter of the two restrictions prior to yielding to the driver and then reuse these chunks during broadcast. Though if the maxRecordsPerBatch chunk is significantly smaller in bytes, we'd have the overhead of unnecessarily broadcasting lots of small chunks. Not sure if there's a fix that makes sense here.

Also, chunking the CSR matrix by # rows when returning from _fit poses separate challenges since the CSR components and the embedding matrix will have different # rows. It's unclear the number of chunks we should return here.
If we could chunk the CSR matrix directly and reconcatenate using scipy.sparse.vstack that would be ideal, but there's no easy way to get the CSR matrix into a serializable format (i.e., python lists) unless we extract the components. Will think about this more.

python/src/spark_rapids_ml/umap.py

eordentlich · 2024-11-06T17:46:16Z

python/benchmark/benchmark/bench_kmeans.py

-                df_for_scoring = df_for_scoring.select(
-                    F.slice(feature_col, 1, dim).alias(feature_col), output_col
-                )
-


rishic3 · 2024-11-06T23:03:30Z

build

python/src/spark_rapids_ml/umap.py

lijinf2 · 2024-11-07T21:04:26Z

Thanks for addressing the comments. I am good with merging the PR. Some corner cases can be studied in the future.

python/src/spark_rapids_ml/umap.py

rishic3 · 2024-11-08T17:41:47Z

build

python/src/spark_rapids_ml/core.py

lijinf2 · 2024-11-08T22:12:28Z

build

rishic3 added 2 commits October 30, 2024 22:55

Fix UMAP sparse data

7f47efe

Signed-off-by: Rishi Chandra <[email protected]>

Add TBDs

1a8a575

rishic3 marked this pull request as ready for review October 30, 2024 23:39

rishic3 requested review from lijinf2 and eordentlich October 31, 2024 01:14

lijinf2 reviewed Oct 31, 2024

View reviewed changes

Address comments - round 1

d679219

rishic3 commented Nov 1, 2024

View reviewed changes

python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved

eordentlich reviewed Nov 1, 2024

View reviewed changes

python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved

python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved

python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved

python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved

rishic3 added 3 commits November 1, 2024 04:57

Address comments pt 2

9b03442

Address comments pt 3

d103f06

CSR format implementation and tests

c401293

update type check

439abea

Fix benchmark script

5de4db5

rishic3 commented Nov 5, 2024

View reviewed changes

lijinf2 reviewed Nov 5, 2024

View reviewed changes

python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved

assert_umap_model func, parametrize metric

7c10540

eordentlich reviewed Nov 6, 2024

View reviewed changes

rishic3 added 2 commits November 6, 2024 18:25

Return dataframe from _fit as separate rows

c44fca3

Broadcast at transform, fix CSR chunking

fa1d9bb

lijinf2 reviewed Nov 7, 2024

View reviewed changes

python/src/spark_rapids_ml/umap.py Show resolved Hide resolved

python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved

rishic3 added 3 commits November 8, 2024 17:20

Update NNZ warning, simplify chunking process

e407c77

type checking

ce439b5

Add CSR chunk warning.

f6c7eb4

rishic3 commented Nov 8, 2024

View reviewed changes

python/src/spark_rapids_ml/umap.py Show resolved Hide resolved

lijinf2 reviewed Nov 8, 2024

View reviewed changes

python/src/spark_rapids_ml/core.py Outdated Show resolved Hide resolved

Update warning msg

69ed791

lijinf2 approved these changes Nov 8, 2024

View reviewed changes

rishic3 merged commit d10e9f0 into NVIDIA:branch-24.10 Nov 9, 2024
2 checks passed

rishic3 deleted the umap-sparse branch November 9, 2024 00:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable UMAP to properly handle sparse data #772

Enable UMAP to properly handle sparse data #772

rishic3 commented Oct 30, 2024 •

edited

Loading

rishic3 commented Oct 31, 2024

rishic3 commented Nov 4, 2024

rishic3 commented Nov 4, 2024

rishic3 commented Nov 4, 2024

rishic3 Nov 5, 2024

eordentlich Nov 6, 2024

lijinf2 Nov 5, 2024

rishic3 commented Nov 6, 2024

rishic3 commented Nov 6, 2024 •

edited

Loading

eordentlich Nov 6, 2024

rishic3 commented Nov 6, 2024

lijinf2 commented Nov 7, 2024

rishic3 commented Nov 8, 2024

lijinf2 commented Nov 8, 2024

Enable UMAP to properly handle sparse data #772

Enable UMAP to properly handle sparse data #772

Conversation

rishic3 commented Oct 30, 2024 • edited Loading

Changes:

Handling sparse data:

Reading/writing:

Testing:

Todos:

rishic3 commented Oct 31, 2024

rishic3 commented Nov 4, 2024

rishic3 commented Nov 4, 2024

rishic3 commented Nov 4, 2024

rishic3 Nov 5, 2024

Choose a reason for hiding this comment

eordentlich Nov 6, 2024

Choose a reason for hiding this comment

lijinf2 Nov 5, 2024

Choose a reason for hiding this comment

rishic3 commented Nov 6, 2024

rishic3 commented Nov 6, 2024 • edited Loading

eordentlich Nov 6, 2024

Choose a reason for hiding this comment

rishic3 commented Nov 6, 2024

lijinf2 commented Nov 7, 2024

rishic3 commented Nov 8, 2024

lijinf2 commented Nov 8, 2024

rishic3 commented Oct 30, 2024 •

edited

Loading

rishic3 commented Nov 6, 2024 •

edited

Loading