Fix partitioning in explicit-comms shuffle #1356

rjzamora · 2024-07-03T19:51:44Z

Current version of the explicit-comms shuffle does not produce partitioning that is consistent with dask.dataframe.

…taframe.shuffle

rjzamora · 2024-07-05T14:14:56Z

dask_cuda/explicit_comms/dataframe/shuffle.py

+    # Make sure partitions are properly ordered
+    futures = [_futures.pop(i) for i in range(npartitions)]


@ayushdg - FYI: I think this means the ordering of partitions could have been "wrong" even before the dtype-casting change in dask :/

Ah I see. Based on the limited testing I've done the biggest change in results on my tests were from #1323 , even when paired with an older version of Dask prior to the dtype change.

rjzamora · 2024-07-05T14:17:43Z

dask_cuda/tests/test_explicit_comms.py

-                    ddf = dd.from_pandas(df.copy(), npartitions=input_nparts).persist(
-                        workers=all_workers
-                    )
+                    ddf1 = dd.from_pandas(df.copy(), npartitions=input_nparts)


Strange Finding: Even without the changes in this PR, I get a "cancelled" error when I persist ddf and then perform both an explicit comms shuffle and then a task-based shuffle. I don't understand the cause of this yet.

Are we ok still moving ahead with the changes even though there's a cancelation error?

Yes. The changes in this PR are independent of the cancellation error (it happens in branch-24.08 without these changes).

Another important detail: I only get the error when query-planning is disabled. Therefore, I assume the problem has to do with a key-name collision that doesn't happen with dask-expr (which is much more disciplined about key names than the legacy API is).

rjzamora · 2024-07-05T14:18:38Z

dask_cuda/tests/test_explicit_comms.py

-                                ddf,
+                                ddf1,


Seems a bit confusing to me that we were previously modifying the initial collection.

Makes sense to avoid that. I think @madsbk may have an idea if this was an oversight or was intentional when he gets back.

I think it was a oversight :)

pentschev

This seems sensible to me, thanks @rjzamora . I've left a few questions, but I don't think any of them should be blockers if you're satisfied, feel free to merge it if you think there's nothing more to be done.

pentschev · 2024-07-08T20:43:11Z

dask_cuda/tests/test_explicit_comms.py

-                    ddf = dd.from_pandas(df.copy(), npartitions=input_nparts).persist(
-                        workers=all_workers
-                    )
+                    ddf1 = dd.from_pandas(df.copy(), npartitions=input_nparts)


Are we ok still moving ahead with the changes even though there's a cancelation error?

pentschev · 2024-07-08T20:44:31Z

dask_cuda/tests/test_explicit_comms.py

-                                ddf,
+                                ddf1,


Makes sense to avoid that. I think @madsbk may have an idea if this was an oversight or was intentional when he gets back.

pentschev · 2024-07-08T20:48:08Z

dask_cuda/explicit_comms/dataframe/shuffle.py

+    # TODO: Use `partition_by_hash` if/when dtype-casting is added
+    # (See: https://github.com/rapidsai/dask-cuda/pull/1356)


The link refers to this PR, what is there exactly to see here for additional information on this TODO?

Ah good catch! I was planning to file an issue to add a cast_dtype argument to partition_by_hash (similar to the argument added to partitioning_index in dask/dask#10705). I'll update the link before this gets merged.

dask_cuda/explicit_comms/dataframe/shuffle.py

rjzamora · 2024-07-09T12:54:14Z

/merge

adjust shuffle to produce partitions that are consistent with dask.da…

51079a7

…taframe.shuffle

rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Jul 3, 2024

rjzamora self-assigned this Jul 3, 2024

github-actions bot added the python python code needed label Jul 3, 2024

rjzamora added 2 commits July 3, 2024 12:58

add link to PR to track partition_by_hash change

463ab8d

fix strange error after persisting

8bc7570

rjzamora commented Jul 5, 2024

View reviewed changes

rjzamora marked this pull request as ready for review July 5, 2024 14:59

rjzamora requested a review from a team as a code owner July 5, 2024 14:59

pentschev approved these changes Jul 8, 2024

View reviewed changes

rjzamora commented Jul 8, 2024

View reviewed changes

dask_cuda/explicit_comms/dataframe/shuffle.py Outdated Show resolved Hide resolved

Update dask_cuda/explicit_comms/dataframe/shuffle.py

2344777

ayushdg mentioned this pull request Jul 8, 2024

Skip explicit comms shuffle for dask-cuda 24.06 NVIDIA/NeMo-Curator#147

Merged

3 tasks

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Jul 9, 2024

rapids-bot bot merged commit fe23e45 into rapidsai:branch-24.08 Jul 9, 2024
27 checks passed

ayushdg mentioned this pull request Jul 30, 2024

Incorrect Shuffle results with dask-cuda 24.06 & above NVIDIA/NeMo-Curator#134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix partitioning in explicit-comms shuffle #1356

Fix partitioning in explicit-comms shuffle #1356

rjzamora commented Jul 3, 2024

rjzamora Jul 5, 2024

ayushdg Jul 5, 2024

rjzamora Jul 5, 2024

pentschev Jul 8, 2024

rjzamora Jul 8, 2024

rjzamora Jul 5, 2024

pentschev Jul 8, 2024

madsbk Jul 9, 2024

pentschev left a comment

pentschev Jul 8, 2024

pentschev Jul 8, 2024

pentschev Jul 8, 2024

rjzamora Jul 8, 2024

rjzamora commented Jul 9, 2024

		# Make sure partitions are properly ordered
		futures = [_futures.pop(i) for i in range(npartitions)]

		# TODO: Use `partition_by_hash` if/when dtype-casting is added
		# (See: https://github.com/rapidsai/dask-cuda/pull/1356)

Fix partitioning in explicit-comms shuffle #1356

Fix partitioning in explicit-comms shuffle #1356

Conversation

rjzamora commented Jul 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjzamora commented Jul 9, 2024