Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial setup for dask-upstream-testing #1

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

TomAugspurger
Copy link

@TomAugspurger TomAugspurger commented Jan 8, 2025

This has a quick pass at the setup for the dask-upstream testing. The basic idea is to add a cron job that

  1. pulls dask and distributed
  2. installs nightly versions of rapids stuff

and then runs the gpu-marked dask and distributed tests. It's mostly based off https://github.com/rapidsai/cudf/blob/branch-25.02/.github/workflows/pandas-tests.yaml, which does something similar for pandas.

I need to figure out the best place to report failures. Open an issue? A slack message?

scripts/run.sh Outdated Show resolved Hide resolved
.github/workflows/cron.yaml Show resolved Hide resolved
@TomAugspurger
Copy link
Author

TomAugspurger commented Jan 8, 2025

/ok to test

edit: that appears to have done nothing :) I'm guessing some bot isn't hooked up here.

@TomAugspurger TomAugspurger marked this pull request as ready for review January 8, 2025 23:14
@quasiben
Copy link
Member

quasiben commented Jan 8, 2025

Maybe try adding the copy-pr-bot as well ?

.github/copy-pr-bot.yaml

# Copyright (c) 2025, NVIDIA CORPORATION.

# Configuration file for `copy-pr-bot` GitHub App
# https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/

enabled: true

@TomAugspurger
Copy link
Author

Failure at https://github.com/rapidsai/dask-upstream-testing/actions/runs/12681051375

https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/events-that-trigger-workflows#workflow_dispatch does mention that

This event will only trigger a workflow run if the workflow file exists on the default branch.

I'm not sure which version of the workflow file is getting run, the one from this branch or the one from main. I'll look into that.

scripts/run.sh Outdated Show resolved Hide resolved
scripts/run.sh Outdated

uv pip install --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple \
--prerelease=allow \
"cudf-cu12" \
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figure out what all we want / need to install here. Probably depends on what pytest.importorskips we have in dask/dask and dask/distributed. We don't want any gpu-marked tests skipped for missing deps.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will take a look. Right now I'm thinking that the best place for this to live would be in dask & distributed's pyproject.toml so that we can just install dask[test,gpu] distributed[test,gpu] and have everything we need.

scripts/run.sh Outdated Show resolved Hide resolved
@wence-
Copy link
Contributor

wence- commented Jan 9, 2025

/ok to test

Copy link

copy-pr-bot bot commented Jan 9, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wence-
Copy link
Contributor

wence- commented Jan 9, 2025

/ok to test

This adds a Cron Job to the github actions that runs the `gpu` marked
tests from dask and distributed.

The setup here is modeled after https://github.com/rapidsai/cudf/blob/branch-25.02/.github/workflows/pandas-tests.yaml, which
is running pandas tests against the nightly versions of cudf.

# RAPIDS_PY_CUDA_SUFFIX="$(rapids-wheel-ctk-name-gen ${RAPIDS_CUDA_VERSION})"
RAPIDS_PY_CUDA_SUFFIX=12
# TODO: set this to main once dask-cudf is compatible
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rapidsai/dask-cuda#1424 is the PR to track here. For now, using 2024.12.1 is slightly preferable, otherwise all the tests fail on import.

@TomAugspurger
Copy link
Author

I've added a few more optional dependencies (ucx-py, dask-cuda) that were causing some tests to be skipped locally. Here are the remaining skips for dask/dask:

SKIPPED [1] dask/array/tests/test_image.py:8: could not import 'skimage': No module named 'skimage'
SKIPPED [1] dask/array/tests/test_sparse.py:12: could not import 'sparse': No module named 'sparse'
SKIPPED [1] dask/array/tests/test_xarray.py:9: could not import 'xarray': No module named 'xarray'
SKIPPED [1] dask/bag/tests/test_avro.py:10: could not import 'fastavro': No module named 'fastavro'
SKIPPED [1] dask/bytes/tests/test_http.py:17: could not import 'requests': No module named 'requests'
SKIPPED [1] dask/bytes/tests/test_s3.py:14: could not import 's3fs': No module named 's3fs'
SKIPPED [1] dask/dataframe/io/tests/test_sql.py:16: could not import 'sqlalchemy': No module named 'sqlalchemy'
SKIPPED [1] dask/dataframe/tests/test_format.py:31: format is completely different
SKIPPED [1] dask/dataframe/tests/test_optimize_dataframe.py:17: doesn't make sense with dask-expr
SKIPPED [1] dask/tests/test_cache.py:13: could not import 'cachey': No module named 'cachey'
SKIPPED [1] dask/tests/test_dot.py:12: could not import 'ipycytoscape': No module named 'ipycytoscape'
SKIPPED [1] dask/tests/test_spark_compat.py:13: could not import 'pyspark': No module named 'pyspark'
SKIPPED [4] dask/array/tests/test_cupy_core.py:316: Unsupported assigning `list` to CuPy array
SKIPPED [4] dask/array/tests/test_cupy_core.py:358: Assigning `range` to CuPy array is not supported
SKIPPED [2] dask/array/tests/test_cupy_core.py:358: Unsupported assigning `list` to CuPy array
SKIPPED [1] dask/array/tests/test_cupy_core.py:358: Unsupported assigning Dask Array to CuPy array
SKIPPED [5] dask/array/tests/test_cupy_slicing.py:83: dask.Array.nonzero() doesn't support non-NumPy arrays yet
SKIPPED [3] dask/dataframe/tests/test_dataframe.py:3612: https://github.com/rapidsai/cudf/issues/16560
SKIPPED [1] dask/dataframe/tests/test_dataframe.py:3685: https://github.com/rapidsai/cudf/issues/16560
SKIPPED [1] dask/dataframe/tests/test_shuffle.py:818: we don't do division inference for 1 partition frames
SKIPPED [2] dask/tests/test_backends.py:16: not properly supported yet

And for dask/distributed:

SKIPPED [1] distributed/cli/tests/test_dask_scheduler.py:8: could not import 'requests': No module named 'requests'
SKIPPED [1] distributed/cli/tests/test_dask_ssh.py:9: could not import 'paramiko': No module named 'paramiko'
SKIPPED [1] distributed/dashboard/tests/test_components.py:5: could not import 'bokeh': No module named 'bokeh'
SKIPPED [1] distributed/dashboard/tests/test_scheduler_bokeh.py:11: could not import 'bokeh': No module named 'bokeh'
SKIPPED [1] distributed/dashboard/tests/test_worker_bokeh.py:10: could not import 'bokeh': No module named 'bokeh'
SKIPPED [1] distributed/deploy/tests/test_old_ssh.py:7: could not import 'paramiko': No module named 'paramiko'
SKIPPED [1] distributed/deploy/tests/test_ssh.py:5: could not import 'asyncssh': No module named 'asyncssh'
SKIPPED [1] distributed/diagnostics/tests/test_memray.py:5: could not import 'memray': No module named 'memray'
SKIPPED [1] distributed/diagnostics/tests/test_progress_stream.py:5: could not import 'bokeh': No module named 'bokeh'
SKIPPED [1] distributed/diagnostics/tests/test_progress_widgets.py:14: could not import 'ipywidgets': No module named 'ipywidgets'
SKIPPED [1] distributed/protocol/tests/test_h5py.py:8: could not import 'h5py': No module named 'h5py'
SKIPPED [1] distributed/protocol/tests/test_keras.py:5: could not import 'keras': No module named 'keras'
SKIPPED [1] distributed/protocol/tests/test_netcdf4.py:5: could not import 'netCDF4': No module named 'netCDF4'
SKIPPED [1] distributed/protocol/tests/test_sparse.py:6: could not import 'sparse': No module named 'sparse'
SKIPPED [1] distributed/protocol/tests/test_torch.py:8: could not import 'torch': No module named 'torch'
SKIPPED [1] distributed/tests/test_jupyter.py:18: could not import 'requests': No module named 'requests'
SKIPPED [1] distributed/diagnostics/tests/test_cudf_diagnostics.py:37: cuDF spill stats monitoring must be enabled manually

That cuDF spill stats monitoring must be enabled manually looks a bit suspicious. I'll look into that.

Longer-term, I'd prefer to have the list of packages needed defined upstream in dask and distributed, maybe as a gpu extra.

@@ -1,78 +1,29 @@
# Based off https://github.com/rapidsai/cudf/blob/branch-25.02/.github/workflows/pandas-tests.yaml
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need another job to actually trigger this, e.g. https://github.com/rapidsai/cudf/blob/cb77046d8baad31f4856c097f7052b3a3858c363/.github/workflows/build.yaml#L180. I wonder if the workflow-dispatch stuff is needed, or if we can just use a cron job?

Copy link

@bdice bdice Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@TomAugspurger TomAugspurger Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks. I gave up searching after I didn't see it in either rapidsai/cudf or rapidsai/shared-workflows :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, rapidsai/workflows#69 should be necessary and sufficient to get this workflow running on a schedule.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on a suggestion from @wence-, I've moved the cron definition here to avoid the need for this cross-repo dispatch.

I'm not sure if we can test this before merging, but I've hopefully gotten the GitHub Action set up so that we have all the inputs necessary for the rapidsai/shared-workflows/.github/workflows/[email protected] workflow to run. Does anyone know offhand if things like ${{ inputs.date }} will be available? My guess is that those came from the workflow dispatch, so I've added a job to collect that and the branch name.

@TomAugspurger
Copy link
Author

TomAugspurger commented Jan 9, 2025

I expect some failures once this is merged:

    def test_visible_devices_uuid():
        if nvml.device_get_count() < 1:
            pytest.skip("No GPUs available")
    
        info = nvml.get_device_index_and_uuid(0)
        assert info.uuid
    
        with mock.patch.dict(
>           os.environ, {"CUDA_VISIBLE_DEVICES": info.uuid.decode("utf-8")}
        ):
E       AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?

Looking into that (edit: fixing upstream at dask/distributed#8981).

  • Lots of scipy.sparse failures: FAILED distributed/protocol/tests/test_cupy.py::test_serialize_cupy_sparse[cuda-dtype1-coo_matrix] - ValueError: scipy.sparse does not support dtype float32. The only supported types are: bool, int8, uint8, int16, uint16, int32, uint32, int64, uint64, longlong, ulonglong, float32, float64, longdouble, complex64, complex128, clongdouble. Maybe just need to be skipped, ideally in upstream. (edit: fixed in Removed big-endian sparse tests dask/distributed#8982)

We have the option of defining skips here when we invoke pytest, but for now I think we should pursue upstream fixes.

.github/workflows/cron.yaml Outdated Show resolved Hide resolved
scripts/run.sh Outdated Show resolved Hide resolved
scripts/run.sh Outdated Show resolved Hide resolved
scripts/run.sh Outdated

uv pip install --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple \
--prerelease=allow \
"cudf-cu12" \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scripts/test Outdated Show resolved Hide resolved
scripts/run.sh Outdated Show resolved Hide resolved
@TomAugspurger
Copy link
Author

TomAugspurger commented Jan 9, 2025

@charlesbluca I see you added the cudf spill stats test at dask/distributed#8148. Just for simplicity, I'd like to avoid adding a second pytest invocation for distributed. It looks like we can use cudf.set_option to programatically set the cudf settings. Unfortunately, setting dask.config.set(**{"distributed.diagnostics.cudf": True}) in the test seems to be too late for capturing the metrics. I get a test failure about cudf not being in the metrics if that's not set before the test process starts.

Hopefully setting DASK_DISTRIBUTED__DIAGNOSTICS__CUDF=1 for all the tests isn't a problem, but LMK if you think it will be and I can adjust stuff here.

edit: dask/distributed#8984 is the proposed restructuring that should let us avoid another pytest invocation.

@charlesbluca
Copy link
Member

Thanks @TomAugspurger for this work 😄

LMK if you think it will be and I can adjust stuff here.

It's been a while, my main recollection of why we chose to split this off into two separate pytest invocations was that some of the other tests that we wanted running on GPU failed if we enabled spilling, such that we'd ideally want cuDF spilling enabled on a per-test basis; this thread roughly covers that issue - assuming dask/distributed#8984 does some refactoring so that this configuration is a bit easier to "toggle" on/off from a single process?

@TomAugspurger
Copy link
Author

Thanks for the link to dask/distributed#8148 (comment). Yeah, the changes in dask/distributed#8984 should make this toggling behavior slightly easier. We should get the spilling behavior just in this one test, while avoiding the spilling in other test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants