Initial setup for dask-upstream-testing #1

TomAugspurger · 2025-01-08T23:03:12Z

This has a quick pass at the setup for the dask-upstream testing. The basic idea is to add a cron job that

pulls dask and distributed
installs nightly versions of rapids stuff

and then runs the gpu-marked dask and distributed tests. It's mostly based off https://github.com/rapidsai/cudf/blob/branch-25.02/.github/workflows/pandas-tests.yaml, which does something similar for pandas.

I need to figure out the best place to report failures. Open an issue? A slack message?

scripts/run.sh

.github/workflows/cron.yaml

TomAugspurger · 2025-01-08T23:13:34Z

/ok to test

edit: that appears to have done nothing :) I'm guessing some bot isn't hooked up here.

quasiben · 2025-01-08T23:22:37Z

Maybe try adding the copy-pr-bot as well ?

.github/copy-pr-bot.yaml

# Copyright (c) 2025, NVIDIA CORPORATION.

# Configuration file for `copy-pr-bot` GitHub App
# https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/

enabled: true

TomAugspurger · 2025-01-09T00:15:46Z

Failure at https://github.com/rapidsai/dask-upstream-testing/actions/runs/12681051375

https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows#workflow_dispatch does mention that

This event will only trigger a workflow run if the workflow file exists on the default branch.

I'm not sure which version of the workflow file is getting run, the one from this branch or the one from main. I'll look into that.

scripts/run.sh

TomAugspurger · 2025-01-09T12:54:11Z

scripts/run.sh

+
+uv pip install --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple \
+  --prerelease=allow \
+  "cudf-cu12" \


Figure out what all we want / need to install here. Probably depends on what pytest.importorskips we have in dask/dask and dask/distributed. We don't want any gpu-marked tests skipped for missing deps.

We usually write a dependencies.yaml file rather than using hardcoded dependency lists. You could copy-paste from https://github.com/rapidsai/cudf/blob/branch-25.02/dependencies.yaml.

Or you might be able to install the wheels with the [test] extra?

Look at these:

https://github.com/rapidsai/cudf/blob/branch-25.02/ci/test_wheel_dask_cudf.sh

https://github.com/rapidsai/cudf/blob/cb77046d8baad31f4856c097f7052b3a3858c363/python/dask_cudf/pyproject.toml#L21-L29

https://github.com/rapidsai/cudf/blob/cb77046d8baad31f4856c097f7052b3a3858c363/python/dask_cudf/pyproject.toml#L47-L54

Thanks, will take a look. Right now I'm thinking that the best place for this to live would be in dask & distributed's pyproject.toml so that we can just install dask[test,gpu] distributed[test,gpu] and have everything we need.

scripts/run.sh

wence- · 2025-01-09T14:37:19Z

/ok to test

copy-pr-bot · 2025-01-09T15:01:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

wence- · 2025-01-09T15:04:03Z

/ok to test

This adds a Cron Job to the github actions that runs the `gpu` marked tests from dask and distributed. The setup here is modeled after https://github.com/rapidsai/cudf/blob/branch-25.02/.github/workflows/pandas-tests.yaml, which is running pandas tests against the nightly versions of cudf.

TomAugspurger · 2025-01-09T15:33:34Z

scripts/run.sh

+
+# RAPIDS_PY_CUDA_SUFFIX="$(rapids-wheel-ctk-name-gen ${RAPIDS_CUDA_VERSION})"
+RAPIDS_PY_CUDA_SUFFIX=12
+# TODO: set this to main once dask-cudf is compatible


I think rapidsai/dask-cuda#1424 is the PR to track here. For now, using 2024.12.1 is slightly preferable, otherwise all the tests fail on import.

TomAugspurger · 2025-01-09T16:37:03Z

I've added a few more optional dependencies (ucx-py, dask-cuda) that were causing some tests to be skipped locally. Here are the remaining skips for dask/dask:

SKIPPED [1] dask/array/tests/test_image.py:8: could not import 'skimage': No module named 'skimage'
SKIPPED [1] dask/array/tests/test_sparse.py:12: could not import 'sparse': No module named 'sparse'
SKIPPED [1] dask/array/tests/test_xarray.py:9: could not import 'xarray': No module named 'xarray'
SKIPPED [1] dask/bag/tests/test_avro.py:10: could not import 'fastavro': No module named 'fastavro'
SKIPPED [1] dask/bytes/tests/test_http.py:17: could not import 'requests': No module named 'requests'
SKIPPED [1] dask/bytes/tests/test_s3.py:14: could not import 's3fs': No module named 's3fs'
SKIPPED [1] dask/dataframe/io/tests/test_sql.py:16: could not import 'sqlalchemy': No module named 'sqlalchemy'
SKIPPED [1] dask/dataframe/tests/test_format.py:31: format is completely different
SKIPPED [1] dask/dataframe/tests/test_optimize_dataframe.py:17: doesn't make sense with dask-expr
SKIPPED [1] dask/tests/test_cache.py:13: could not import 'cachey': No module named 'cachey'
SKIPPED [1] dask/tests/test_dot.py:12: could not import 'ipycytoscape': No module named 'ipycytoscape'
SKIPPED [1] dask/tests/test_spark_compat.py:13: could not import 'pyspark': No module named 'pyspark'
SKIPPED [4] dask/array/tests/test_cupy_core.py:316: Unsupported assigning `list` to CuPy array
SKIPPED [4] dask/array/tests/test_cupy_core.py:358: Assigning `range` to CuPy array is not supported
SKIPPED [2] dask/array/tests/test_cupy_core.py:358: Unsupported assigning `list` to CuPy array
SKIPPED [1] dask/array/tests/test_cupy_core.py:358: Unsupported assigning Dask Array to CuPy array
SKIPPED [5] dask/array/tests/test_cupy_slicing.py:83: dask.Array.nonzero() doesn't support non-NumPy arrays yet
SKIPPED [3] dask/dataframe/tests/test_dataframe.py:3612: https://github.com/rapidsai/cudf/issues/16560
SKIPPED [1] dask/dataframe/tests/test_dataframe.py:3685: https://github.com/rapidsai/cudf/issues/16560
SKIPPED [1] dask/dataframe/tests/test_shuffle.py:818: we don't do division inference for 1 partition frames
SKIPPED [2] dask/tests/test_backends.py:16: not properly supported yet

And for dask/distributed:

SKIPPED [1] distributed/cli/tests/test_dask_scheduler.py:8: could not import 'requests': No module named 'requests'
SKIPPED [1] distributed/cli/tests/test_dask_ssh.py:9: could not import 'paramiko': No module named 'paramiko'
SKIPPED [1] distributed/dashboard/tests/test_components.py:5: could not import 'bokeh': No module named 'bokeh'
SKIPPED [1] distributed/dashboard/tests/test_scheduler_bokeh.py:11: could not import 'bokeh': No module named 'bokeh'
SKIPPED [1] distributed/dashboard/tests/test_worker_bokeh.py:10: could not import 'bokeh': No module named 'bokeh'
SKIPPED [1] distributed/deploy/tests/test_old_ssh.py:7: could not import 'paramiko': No module named 'paramiko'
SKIPPED [1] distributed/deploy/tests/test_ssh.py:5: could not import 'asyncssh': No module named 'asyncssh'
SKIPPED [1] distributed/diagnostics/tests/test_memray.py:5: could not import 'memray': No module named 'memray'
SKIPPED [1] distributed/diagnostics/tests/test_progress_stream.py:5: could not import 'bokeh': No module named 'bokeh'
SKIPPED [1] distributed/diagnostics/tests/test_progress_widgets.py:14: could not import 'ipywidgets': No module named 'ipywidgets'
SKIPPED [1] distributed/protocol/tests/test_h5py.py:8: could not import 'h5py': No module named 'h5py'
SKIPPED [1] distributed/protocol/tests/test_keras.py:5: could not import 'keras': No module named 'keras'
SKIPPED [1] distributed/protocol/tests/test_netcdf4.py:5: could not import 'netCDF4': No module named 'netCDF4'
SKIPPED [1] distributed/protocol/tests/test_sparse.py:6: could not import 'sparse': No module named 'sparse'
SKIPPED [1] distributed/protocol/tests/test_torch.py:8: could not import 'torch': No module named 'torch'
SKIPPED [1] distributed/tests/test_jupyter.py:18: could not import 'requests': No module named 'requests'
SKIPPED [1] distributed/diagnostics/tests/test_cudf_diagnostics.py:37: cuDF spill stats monitoring must be enabled manually

That cuDF spill stats monitoring must be enabled manually looks a bit suspicious. I'll look into that.

Longer-term, I'd prefer to have the list of packages needed defined upstream in dask and distributed, maybe as a gpu extra.

TomAugspurger · 2025-01-09T16:48:01Z

.github/workflows/cron.yaml

@@ -1,78 +1,29 @@
+# Based off https://github.com/rapidsai/cudf/blob/branch-25.02/.github/workflows/pandas-tests.yaml


We need another job to actually trigger this, e.g. https://github.com/rapidsai/cudf/blob/cb77046d8baad31f4856c097f7052b3a3858c363/.github/workflows/build.yaml#L180. I wonder if the workflow-dispatch stuff is needed, or if we can just use a cron job?

You can see an example of cron triggers in our nightly builds here:

https://github.com/rapidsai/workflows/blob/main/.github/workflows/nightly-pipeline-trigger.yaml
https://github.com/rapidsai/workflows/blob/main/.github/workflows/nightly-pipeline.yaml

Ah, thanks. I gave up searching after I didn't see it in either rapidsai/cudf or rapidsai/shared-workflows :)

IIUC, rapidsai/workflows#69 should be necessary and sufficient to get this workflow running on a schedule.

Based on a suggestion from @wence-, I've moved the cron definition here to avoid the need for this cross-repo dispatch.

I'm not sure if we can test this before merging, but I've hopefully gotten the GitHub Action set up so that we have all the inputs necessary for the rapidsai/shared-workflows/.github/workflows/[email protected] workflow to run. Does anyone know offhand if things like ${{ inputs.date }} will be available? My guess is that those came from the workflow dispatch, so I've added a job to collect that and the branch name.

TomAugspurger · 2025-01-09T17:26:16Z

I expect some failures once this is merged:

cudf is incompatible with the pre-release numba we're getting: [BUG]: Use of deprecated / removed numba.config.CAPTURED_ERRORS cudf#17703 (could work around that with an upper bound on numba)
Some incompatibility between distributed and nvml

    def test_visible_devices_uuid():
        if nvml.device_get_count() < 1:
            pytest.skip("No GPUs available")
    
        info = nvml.get_device_index_and_uuid(0)
        assert info.uuid
    
        with mock.patch.dict(
>           os.environ, {"CUDA_VISIBLE_DEVICES": info.uuid.decode("utf-8")}
        ):
E       AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?

Looking into that (edit: fixing upstream at dask/distributed#8981).

Lots of scipy.sparse failures: FAILED distributed/protocol/tests/test_cupy.py::test_serialize_cupy_sparse[cuda-dtype1-coo_matrix] - ValueError: scipy.sparse does not support dtype float32. The only supported types are: bool, int8, uint8, int16, uint16, int32, uint32, int64, uint64, longlong, ulonglong, float32, float64, longdouble, complex64, complex128, clongdouble. Maybe just need to be skipped, ideally in upstream. (edit: fixed in Removed big-endian sparse tests dask/distributed#8982)

We have the option of defining skips here when we invoke pytest, but for now I think we should pursue upstream fixes.

.github/workflows/cron.yaml

scripts/run.sh

bdice · 2025-01-09T17:44:31Z

scripts/run.sh

+
+uv pip install --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple \
+  --prerelease=allow \
+  "cudf-cu12" \


We usually write a dependencies.yaml file rather than using hardcoded dependency lists. You could copy-paste from https://github.com/rapidsai/cudf/blob/branch-25.02/dependencies.yaml.

Or you might be able to install the wheels with the [test] extra?

Look at these:

https://github.com/rapidsai/cudf/blob/branch-25.02/ci/test_wheel_dask_cudf.sh

https://github.com/rapidsai/cudf/blob/cb77046d8baad31f4856c097f7052b3a3858c363/python/dask_cudf/pyproject.toml#L21-L29

https://github.com/rapidsai/cudf/blob/cb77046d8baad31f4856c097f7052b3a3858c363/python/dask_cudf/pyproject.toml#L47-L54

scripts/test

scripts/run.sh

TomAugspurger · 2025-01-09T21:04:45Z

@charlesbluca I see you added the cudf spill stats test at dask/distributed#8148. Just for simplicity, I'd like to avoid adding a second pytest invocation for distributed. It looks like we can use cudf.set_option to programatically set the cudf settings. Unfortunately, setting dask.config.set(**{"distributed.diagnostics.cudf": True}) in the test seems to be too late for capturing the metrics. I get a test failure about cudf not being in the metrics if that's not set before the test process starts.

Hopefully setting DASK_DISTRIBUTED__DIAGNOSTICS__CUDF=1 for all the tests isn't a problem, but LMK if you think it will be and I can adjust stuff here.

edit: dask/distributed#8984 is the proposed restructuring that should let us avoid another pytest invocation.

charlesbluca · 2025-01-10T15:06:45Z

Thanks @TomAugspurger for this work 😄

LMK if you think it will be and I can adjust stuff here.

It's been a while, my main recollection of why we chose to split this off into two separate pytest invocations was that some of the other tests that we wanted running on GPU failed if we enabled spilling, such that we'd ideally want cuDF spilling enabled on a per-test basis; this thread roughly covers that issue - assuming dask/distributed#8984 does some refactoring so that this configuration is a bit easier to "toggle" on/off from a single process?

TomAugspurger · 2025-01-10T15:14:27Z

Thanks for the link to dask/distributed#8148 (comment). Yeah, the changes in dask/distributed#8984 should make this toggling behavior slightly easier. We should get the spilling behavior just in this one test, while avoiding the spilling in other test.

TomAugspurger commented Jan 8, 2025

View reviewed changes

scripts/run.sh Outdated Show resolved Hide resolved

.github/workflows/cron.yaml Show resolved Hide resolved

TomAugspurger marked this pull request as ready for review January 8, 2025 23:14

TomAugspurger commented Jan 9, 2025

View reviewed changes

TomAugspurger force-pushed the tom/initial-setup branch from 891277b to 4cc487c Compare January 9, 2025 15:28

TomAugspurger commented Jan 9, 2025

View reviewed changes

Narrow the skips

fcaaf08

TomAugspurger commented Jan 9, 2025

View reviewed changes

bdice reviewed Jan 9, 2025

View reviewed changes

TomAugspurger added 2 commits January 9, 2025 09:56

added pre-commit

15a4ff8

rename

bec8e96

TomAugspurger commented Jan 9, 2025

View reviewed changes

scripts/run.sh Outdated Show resolved Hide resolved

TomAugspurger added 2 commits January 9, 2025 10:11

Fixup

0edfa21

Added pre-commit

564f155

TomAugspurger added 2 commits January 10, 2025 09:02

Define the schedule here

11f4ce1

fixed branch

ff9bfda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial setup for dask-upstream-testing #1

Initial setup for dask-upstream-testing #1

TomAugspurger commented Jan 8, 2025 •

edited

Loading

TomAugspurger commented Jan 8, 2025 •

edited

Loading

quasiben commented Jan 8, 2025

TomAugspurger commented Jan 9, 2025

TomAugspurger Jan 9, 2025

bdice Jan 9, 2025

TomAugspurger Jan 9, 2025

wence- commented Jan 9, 2025

copy-pr-bot bot commented Jan 9, 2025

wence- commented Jan 9, 2025

TomAugspurger Jan 9, 2025

TomAugspurger commented Jan 9, 2025

TomAugspurger Jan 9, 2025

bdice Jan 9, 2025 •

edited

Loading

TomAugspurger Jan 9, 2025 •

edited

Loading

TomAugspurger Jan 9, 2025

TomAugspurger Jan 10, 2025

TomAugspurger commented Jan 9, 2025 •

edited

Loading

bdice Jan 9, 2025

TomAugspurger commented Jan 9, 2025 •

edited

Loading

charlesbluca commented Jan 10, 2025

TomAugspurger commented Jan 10, 2025

		@@ -1,78 +1,29 @@
		# Based off https://github.com/rapidsai/cudf/blob/branch-25.02/.github/workflows/pandas-tests.yaml

Initial setup for dask-upstream-testing #1

Are you sure you want to change the base?

Initial setup for dask-upstream-testing #1

Conversation

TomAugspurger commented Jan 8, 2025 • edited Loading

TomAugspurger commented Jan 8, 2025 • edited Loading

quasiben commented Jan 8, 2025

TomAugspurger commented Jan 9, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- commented Jan 9, 2025

copy-pr-bot bot commented Jan 9, 2025

wence- commented Jan 9, 2025

Choose a reason for hiding this comment

TomAugspurger commented Jan 9, 2025

Choose a reason for hiding this comment

bdice Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

TomAugspurger Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

TomAugspurger commented Jan 9, 2025 • edited Loading

charlesbluca commented Jan 10, 2025

TomAugspurger commented Jan 10, 2025

TomAugspurger commented Jan 8, 2025 •

edited

Loading

TomAugspurger commented Jan 8, 2025 •

edited

Loading

bdice Jan 9, 2025 •

edited

Loading

TomAugspurger Jan 9, 2025 •

edited

Loading

TomAugspurger commented Jan 9, 2025 •

edited

Loading

TomAugspurger commented Jan 9, 2025 •

edited

Loading