Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] sort_values on categorical column fails in dask_cudf #11795

Open
eriknw opened this issue Sep 27, 2022 · 5 comments
Open

[BUG] sort_values on categorical column fails in dask_cudf #11795

eriknw opened this issue Sep 27, 2022 · 5 comments
Assignees
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API.

Comments

@eriknw
Copy link
Contributor

eriknw commented Sep 27, 2022

Describe the bug
ddf.sort_values(col) does not work with a dask_cudf DataFrame when col is categorical.

Steps/Code to reproduce bug

import cudf
import dask_cudf
df = cudf.DataFrame({"a": list("caba"), "b": list(range(4))})
df["a"] = df["a"].astype("category")
ddf = dask_cudf.from_cudf(df, npartitions=2)
df.sort_values("a")  # <-- works as expected
ddf.sort_values("a")  # raises
Traceback
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [7], line 1
----> 1 ddf.sort_values("a")

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/site-packages/dask_cudf/core.py:256, in DataFrame.sort_values(self, by, ignore_index, max_branch, divisions, set_divisions, ascending, na_position, sort_function, sort_function_kwargs, **kwargs)
    251 if kwargs:
    252     raise ValueError(
    253         f"Unsupported input arguments passed : {list(kwargs.keys())}"
    254     )
--> 256 df = sorting.sort_values(
    257     self,
    258     by,
    259     max_branch=max_branch,
    260     divisions=divisions,
    261     set_divisions=set_divisions,
    262     ignore_index=ignore_index,
    263     ascending=ascending,
    264     na_position=na_position,
    265     sort_function=sort_function,
    266     sort_function_kwargs=sort_function_kwargs,
    267 )
    269 if ignore_index:
    270     return df.reset_index(drop=True)

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/site-packages/dask_cudf/sorting.py:266, in sort_values(df, by, max_branch, divisions, set_divisions, ignore_index, ascending, na_position, sort_function, sort_function_kwargs)
    264 # Step 1 - Calculate new divisions (if necessary)
    265 if divisions is None:
--> 266     divisions = quantile_divisions(df, by, npartitions)
    268 # Step 2 - Perform repartitioning shuffle
    269 meta = df._meta._constructor_sliced([0])

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/site-packages/dask_cudf/sorting.py:213, in quantile_divisions(df, by, npartitions)
    211 dtype = df[col].dtype
    212 if dtype != "object":
--> 213     divisions[col] = divisions[col].astype("int64")
    214     divisions[col].iloc[-1] += 1
    215     divisions[col] = divisions[col].astype(dtype)

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/site-packages/cudf/core/series.py:1857, in Series.astype(self, dtype, copy, errors, **kwargs)
   1855 else:
   1856     dtype = {self.name: dtype}
-> 1857 return super().astype(dtype, copy, errors, **kwargs)

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/site-packages/cudf/core/indexed_frame.py:3264, in IndexedFrame.astype(self, dtype, copy, errors, **kwargs)
   3262 except Exception as e:
   3263     if errors == "raise":
-> 3264         raise e
   3265     return self
   3267 return self._from_data(data, index=self._index)

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/site-packages/cudf/core/indexed_frame.py:3261, in IndexedFrame.astype(self, dtype, copy, errors, **kwargs)
   3258     raise ValueError("invalid error value specified")
   3260 try:
-> 3261     data = super().astype(dtype, copy, **kwargs)
   3262 except Exception as e:
   3263     if errors == "raise":

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/site-packages/cudf/core/frame.py:328, in Frame.astype(self, dtype, copy, **kwargs)
    326 dt = dtype.get(col_name, col.dtype)
    327 if not is_dtype_equal(dt, col.dtype):
--> 328     result[col_name] = col.astype(dt, copy=copy, **kwargs)
    329 else:
    330     result[col_name] = col.copy() if copy else col

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/site-packages/cudf/core/column/column.py:857, in ColumnBase.astype(self, dtype, **kwargs)
    851 dtype = (
    852     pandas_dtypes_alias_to_cudf_alias.get(dtype, dtype)
    853     if isinstance(dtype, str)
    854     else pandas_dtypes_to_np_dtypes.get(dtype, dtype)
    855 )
    856 if _is_non_decimal_numeric_dtype(dtype):
--> 857     return self.as_numerical_column(dtype, **kwargs)
    858 elif is_categorical_dtype(dtype):
    859     return self.as_categorical_column(dtype, **kwargs)

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/site-packages/cudf/core/column/categorical.py:1236, in CategoricalColumn.as_numerical_column(self, dtype, **kwargs)
   1235 def as_numerical_column(self, dtype: Dtype, **kwargs) -> NumericalColumn:
-> 1236     return self._get_decategorized_column().as_numerical_column(dtype)

File ~/miniconda3/envs/cugraph_dev15/lib/python3.9/site-packages/cudf/core/column/string.py:5312, in StringColumn.as_numerical_column(self, dtype, **kwargs)
   5310 if out_dtype.kind in {"i", "u"}:
   5311     if not libstrings.is_integer(string_col).all():
-> 5312         raise ValueError(
   5313             "Could not convert strings to integer "
   5314             "type due to presence of non-integer values."
   5315         )
   5316 elif out_dtype.kind == "f":
   5317     if not libstrings.is_float(string_col).all():

ValueError: Could not convert strings to integer type due to presence of non-integer values.

Expected behavior
I expect it to work--that is, match the result of cudf and dask.dataframe.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of cuDF install: conda

Environment details

Click here to see environment details
 **git***
 commit 82b9922cc1635f6f0923f633a17a7c12f93ebdbe (HEAD -> pg_set_index_and_categorical, eriknw/pg_set_index_and_categorical)
 Author: Erik Welch <[email protected]>
 Date:   Tue Sep 27 11:28:04 2022 -0700

 workaround dask_cudf issue with `sort_values` on categorical column
 **git submodules***

 ***OS Information***
 DGX_NAME="DGX Server"
 DGX_PRETTY_NAME="NVIDIA DGX Server"
 DGX_SWBUILD_DATE="2020-03-04"
 DGX_SWBUILD_VERSION="4.4.0"
 DGX_COMMIT_ID="ee09ebc"
 DGX_PLATFORM="DGX Server for DGX-1"
 DGX_SERIAL_NUMBER="QTFCOU8220028"

 DGX_R418_REPO_ENABLED=20220727-142458

 DGX_OTA_VERSION="4.13.0"
 DGX_OTA_DATE="Wed Jul 27 14:38:05 PDT 2022"
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=18.04
 DISTRIB_CODENAME=bionic
 DISTRIB_DESCRIPTION="Ubuntu 18.04.6 LTS"
 NAME="Ubuntu"
 VERSION="18.04.6 LTS (Bionic Beaver)"
 ID=ubuntu
 ID_LIKE=debian
 PRETTY_NAME="Ubuntu 18.04.6 LTS"
 VERSION_ID="18.04"
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 VERSION_CODENAME=bionic
 UBUNTU_CODENAME=bionic
 Linux dgx12 4.15.0-189-generic #200-Ubuntu SMP Wed Jun 22 19:53:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

 ***GPU Information***
 Tue Sep 27 12:26:17 2022
 +-----------------------------------------------------------------------------+
 | NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                               |                      |               MIG M. |
 |===============================+======================+======================|
 |   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
 | N/A   32C    P0    42W / 300W |      3MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
 | N/A   30C    P0    42W / 300W |      3MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
 | N/A   28C    P0    41W / 300W |      3MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
 | N/A   28C    P0    41W / 300W |      3MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
 | N/A   30C    P0    42W / 300W |      3MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
 | N/A   30C    P0    41W / 300W |      3MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
 | N/A   33C    P0    43W / 300W |      3MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
 | N/A   29C    P0    41W / 300W |      3MiB / 32768MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+

 +-----------------------------------------------------------------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 |  No running processes found                                                 |
 +-----------------------------------------------------------------------------+

 ***CPU***
 Architecture:        x86_64
 CPU op-mode(s):      32-bit, 64-bit
 Byte Order:          Little Endian
 CPU(s):              80
 On-line CPU(s) list: 0-79
 Thread(s) per core:  2
 Core(s) per socket:  20
 Socket(s):           2
 NUMA node(s):        2
 Vendor ID:           GenuineIntel
 CPU family:          6
 Model:               79
 Model name:          Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
 Stepping:            1
 CPU MHz:             3343.606
 CPU max MHz:         3600.0000
 CPU min MHz:         1200.0000
 BogoMIPS:            4389.85
 Virtualization:      VT-x
 L1d cache:           32K
 L1i cache:           32K
 L2 cache:            256K
 L3 cache:            51200K
 NUMA node0 CPU(s):   0-19,40-59
 NUMA node1 CPU(s):   20-39,60-79
 Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

 ***CMake***
 /home/nfs/erwelch/miniconda3/envs/cugraph_dev15/bin/cmake
 cmake version 3.24.2

 CMake suite maintained and supported by Kitware (kitware.com/cmake).

 ***g++***
 /home/nfs/erwelch/miniconda3/envs/cugraph_dev15/bin/g++
 g++ (conda-forge gcc 10.4.0-16) 10.4.0
 Copyright (C) 2020 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


 ***nvcc***
 /home/nfs/erwelch/miniconda3/envs/cugraph_dev15/bin/nvcc
 nvcc: NVIDIA (R) Cuda compiler driver
 Copyright (c) 2005-2021 NVIDIA Corporation
 Built on Thu_Nov_18_09:45:30_PST_2021
 Cuda compilation tools, release 11.5, V11.5.119
 Build cuda_11.5.r11.5/compiler.30672275_0

 ***Python***
 /home/nfs/erwelch/miniconda3/envs/cugraph_dev15/bin/python
 Python 3.9.13

 ***Environment Variables***
 PATH                            : /home/nfs/erwelch/miniconda3/envs/cugraph_dev15/bin:/home/nfs/erwelch/miniconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
 LD_LIBRARY_PATH                 :
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    : /home/nfs/erwelch/miniconda3/envs/cugraph_dev15
 PYTHON_PATH                     :

 ***conda packages***
 /home/nfs/erwelch/miniconda3/condabin/conda
 # packages in environment at /home/nfs/erwelch/miniconda3/envs/cugraph_dev15:
 #
 # Name                    Version                   Build  Channel
 _libgcc_mutex             0.1                 conda_forge    conda-forge
 _openmp_mutex             4.5                       2_gnu    conda-forge
 alabaster                 0.7.12                     py_0    conda-forge
 argon2-cffi               21.3.0             pyhd8ed1ab_0    conda-forge
 argon2-cffi-bindings      21.2.0           py39hb9d737c_2    conda-forge
 arrow-cpp                 9.0.0           py39hd3ccb9b_2_cpu    conda-forge
 asttokens                 2.0.8              pyhd8ed1ab_0    conda-forge
 asvdb                     0.4.2               g90e8f2c_40    rapidsai
 attrs                     22.1.0             pyh71513ae_1    conda-forge
 aws-c-cal                 0.5.11               h95a6274_0    conda-forge
 aws-c-common              0.6.2                h7f98852_0    conda-forge
 aws-c-event-stream        0.2.7               h3541f99_13    conda-forge
 aws-c-io                  0.10.5               hfb6a706_0    conda-forge
 aws-checksums             0.1.11               ha31a3da_7    conda-forge
 aws-sdk-cpp               1.8.186              hb4091e7_3    conda-forge
 babel                     2.10.3             pyhd8ed1ab_0    conda-forge
 backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
 backports                 1.0                        py_2    conda-forge
 backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
 beautifulsoup4            4.11.1             pyha770c72_0    conda-forge
 binutils                  2.36.1               hdd6e379_2    conda-forge
 binutils_impl_linux-64    2.36.1               h193b22a_2    conda-forge
 binutils_linux-64         2.36                hf3e587d_10    conda-forge
 bleach                    5.0.1              pyhd8ed1ab_0    conda-forge
 bokeh                     2.4.3              pyhd8ed1ab_3    conda-forge
 boost                     1.80.0           py39hac2352c_1    conda-forge
 boost-cpp                 1.80.0               h75c5d50_0    conda-forge
 boto3                     1.24.81            pyhd8ed1ab_0    conda-forge
 botocore                  1.27.81            pyhd8ed1ab_0    conda-forge
 brotlipy                  0.7.0           py39hb9d737c_1004    conda-forge
 bzip2                     1.0.8                h7f98852_4    conda-forge
 c-ares                    1.18.1               h7f98852_0    conda-forge
 c-compiler                1.5.0                h166bdaf_0    conda-forge
 ca-certificates           2022.9.24            ha878542_0    conda-forge
 cachetools                5.2.0              pyhd8ed1ab_0    conda-forge
 certifi                   2022.9.24          pyhd8ed1ab_0    conda-forge
 cffi                      1.15.1           py39he91dace_0    conda-forge
 charset-normalizer        2.1.1              pyhd8ed1ab_0    conda-forge
 clang                     11.1.0               ha770c72_1    conda-forge
 clang-11                  11.1.0          default_ha53f305_1    conda-forge
 clang-tools               11.1.0          default_ha53f305_1    conda-forge
 clangxx                   11.1.0          default_ha53f305_1    conda-forge
 click                     8.1.3            py39hf3d152e_0    conda-forge
 cloudpickle               2.2.0              pyhd8ed1ab_0    conda-forge
 cmake                     3.24.2               h5432695_0    conda-forge
 colorama                  0.4.5              pyhd8ed1ab_0    conda-forge
 commonmark                0.9.1                      py_0    conda-forge
 coverage                  6.4.4            py39hb9d737c_0    conda-forge
 cryptography              37.0.4           py39hd97740a_0    conda-forge
 cuda-python               11.7.0           py39h3fd9d12_0    nvidia
 cudatoolkit               11.5.1               hcf5317a_9    nvidia
 cudf                      22.10.00a220920 cuda_11_py39_g0528b38f2b_241    rapidsai-nightly
 cugraph                   22.10.0a0+84.gc2f983f0          pypi_0    pypi
 cupy                      11.1.0           py39hc3c280e_0    conda-forge
 cxx-compiler              1.5.0                h924138e_0    conda-forge
 cython                    0.29.32          py39h5a03fae_0    conda-forge
 cytoolz                   0.12.0           py39hb9d737c_0    conda-forge
 dask                      2022.9.1           pyhd8ed1ab_0    conda-forge
 dask-core                 2022.9.1           pyhd8ed1ab_0    conda-forge
 dask-cuda                 22.10.00a220927 py39_g8de9ce3_19    rapidsai-nightly
 dask-cudf                 22.10.00a220920 cuda_11_py39_g0528b38f2b_241    rapidsai-nightly
 debugpy                   1.6.3            py39h5a03fae_0    conda-forge
 decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
 defusedxml                0.7.1              pyhd8ed1ab_0    conda-forge
 distributed               2022.9.1           pyhd8ed1ab_0    conda-forge
 distro                    1.6.0              pyhd8ed1ab_0    conda-forge
 dlpack                    0.5                  h9c3ff4c_0    conda-forge
 docutils                  0.19             py39hf3d152e_0    conda-forge
 doxygen                   1.9.5                h583eb01_0    conda-forge
 entrypoints               0.4                pyhd8ed1ab_0    conda-forge
 executing                 1.1.0              pyhd8ed1ab_0    conda-forge
 expat                     2.4.9                h27087fc_0    conda-forge
 fastavro                  1.6.1            py39hb9d737c_0    conda-forge
 fastrlock                 0.8              py39h5a03fae_2    conda-forge
 flake8                    5.0.4              pyhd8ed1ab_0    conda-forge
 flit-core                 3.7.1              pyhd8ed1ab_0    conda-forge
 freetype                  2.12.1               hca18f0e_0    conda-forge
 fsspec                    2022.8.2           pyhd8ed1ab_0    conda-forge
 future                    0.18.2           py39hf3d152e_5    conda-forge
 gcc                       10.4.0              hb92f740_10    conda-forge
 gcc_impl_linux-64         10.4.0              h7ee1905_16    conda-forge
 gcc_linux-64              10.4.0              h9215b83_10    conda-forge
 gflags                    2.2.2             he1b5a44_1004    conda-forge
 gh                        2.16.1               ha8f183a_0    conda-forge
 glog                      0.6.0                h6f12383_0    conda-forge
 gmock                     1.10.0               h4bd325d_7    conda-forge
 grpc-cpp                  1.47.1               hbad87ad_6    conda-forge
 gtest                     1.10.0               h4bd325d_7    conda-forge
 gxx                       10.4.0              hb92f740_10    conda-forge
 gxx_impl_linux-64         10.4.0              h7ee1905_16    conda-forge
 gxx_linux-64              10.4.0              h6e491c6_10    conda-forge
 heapdict                  1.0.1                      py_0    conda-forge
 icecream                  2.1.3              pyhd8ed1ab_0    conda-forge
 icu                       70.1                 h27087fc_0    conda-forge
 idna                      3.4                pyhd8ed1ab_0    conda-forge
 imagesize                 1.4.1              pyhd8ed1ab_0    conda-forge
 importlib-metadata        4.11.4           py39hf3d152e_0    conda-forge
 importlib_resources       5.9.0              pyhd8ed1ab_0    conda-forge
 iniconfig                 1.1.1              pyh9f0ad1d_0    conda-forge
 ipykernel                 6.16.0             pyh210e3f2_0    conda-forge
 ipython                   8.5.0              pyh41d4057_1    conda-forge
 ipython_genutils          0.2.0                      py_1    conda-forge
 jedi                      0.18.1             pyhd8ed1ab_2    conda-forge
 jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
 jmespath                  1.0.1              pyhd8ed1ab_0    conda-forge
 joblib                    1.2.0              pyhd8ed1ab_0    conda-forge
 jpeg                      9e                   h166bdaf_2    conda-forge
 jsonschema                4.16.0             pyhd8ed1ab_0    conda-forge
 jupyter_client            7.3.4              pyhd8ed1ab_0    conda-forge
 jupyter_core              4.11.1           py39hf3d152e_0    conda-forge
 jupyterlab_pygments       0.2.2              pyhd8ed1ab_0    conda-forge
 kernel-headers_linux-64   2.6.32              he073ed8_15    conda-forge
 keyutils                  1.6.1                h166bdaf_0    conda-forge
 krb5                      1.19.3               h3790be6_0    conda-forge
 lcms2                     2.12                 hddcbb42_0    conda-forge
 ld_impl_linux-64          2.36.1               hea4e1c9_2    conda-forge
 lerc                      4.0.0                h27087fc_0    conda-forge
 libabseil                 20220623.0      cxx17_h48a1fff_4    conda-forge
 libblas                   3.9.0           16_linux64_openblas    conda-forge
 libbrotlicommon           1.0.9                h166bdaf_7    conda-forge
 libbrotlidec              1.0.9                h166bdaf_7    conda-forge
 libbrotlienc              1.0.9                h166bdaf_7    conda-forge
 libcblas                  3.9.0           16_linux64_openblas    conda-forge
 libclang-cpp11.1          11.1.0          default_ha53f305_1    conda-forge
 libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
 libcudf                   22.10.00a220920 cuda11_g0528b38f2b_241    rapidsai-nightly
 libcugraphops             22.10.00a220927 cuda11_g553bacf_29    rapidsai-nightly
 libcurl                   7.83.1               h7bff187_0    conda-forge
 libcusolver               11.4.0.1                      0    nvidia
 libcusparse               11.7.4.91                     0    nvidia
 libdeflate                1.14                 h166bdaf_0    conda-forge
 libedit                   3.1.20191231         he28a2e2_2    conda-forge
 libev                     4.33                 h516909a_1    conda-forge
 libevent                  2.1.10               h9b69904_4    conda-forge
 libffi                    3.4.2                h7f98852_5    conda-forge
 libgcc-devel_linux-64     10.4.0              h74af60c_16    conda-forge
 libgcc-ng                 12.1.0              h8d9b700_16    conda-forge
 libgfortran-ng            12.1.0              h69a702a_16    conda-forge
 libgfortran5              12.1.0              hdcd56e2_16    conda-forge
 libgomp                   12.1.0              h8d9b700_16    conda-forge
 libgoogle-cloud           2.1.0                h9ebe8e8_2    conda-forge
 libiconv                  1.17                 h166bdaf_0    conda-forge
 liblapack                 3.9.0           16_linux64_openblas    conda-forge
 libllvm11                 11.1.0               hf817b99_3    conda-forge
 libnghttp2                1.47.0               hdcd2b5c_1    conda-forge
 libnsl                    2.0.0                h7f98852_0    conda-forge
 libopenblas               0.3.21          pthreads_h78a6416_3    conda-forge
 libpng                    1.6.38               h753d276_0    conda-forge
 libprotobuf               3.20.1               h6239696_4    conda-forge
 libraft-distance          22.10.00a220927 cuda11_g1dd2feb1_54    rapidsai-nightly
 libraft-headers           22.10.00a220927 cuda11_g1dd2feb1_54    rapidsai-nightly
 librmm                    22.10.00a220927 cuda11_g6e0d65a9_20    rapidsai-nightly
 libsanitizer              10.4.0              hde28e3b_16    conda-forge
 libsodium                 1.0.18               h36c2ea0_1    conda-forge
 libsqlite                 3.39.3               h753d276_0    conda-forge
 libssh2                   1.10.0               haa6b8db_3    conda-forge
 libstdcxx-devel_linux-64  10.4.0              h74af60c_16    conda-forge
 libstdcxx-ng              12.1.0              ha89aaad_16    conda-forge
 libthrift                 0.16.0               h491838f_2    conda-forge
 libtiff                   4.4.0                h55922b4_4    conda-forge
 libutf8proc               2.7.0                h7f98852_0    conda-forge
 libuuid                   2.32.1            h7f98852_1000    conda-forge
 libuv                     1.44.2               h166bdaf_0    conda-forge
 libwebp-base              1.2.4                h166bdaf_0    conda-forge
 libxcb                    1.13              h7f98852_1004    conda-forge
 libxml2                   2.10.2               h4c7fe37_1    conda-forge
 libxslt                   1.1.35               h8affb1d_0    conda-forge
 libzlib                   1.2.12               h166bdaf_3    conda-forge
 llvmlite                  0.38.1           py39h7d9a04d_0    conda-forge
 locket                    1.0.0              pyhd8ed1ab_0    conda-forge
 lxml                      4.9.1            py39hb9d737c_0    conda-forge
 lz4                       4.0.0            py39h029007f_2    conda-forge
 lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
 make                      4.3                  hd18ef5c_1    conda-forge
 markdown                  3.4.1              pyhd8ed1ab_0    conda-forge
 markupsafe                2.1.1            py39hb9d737c_1    conda-forge
 matplotlib-inline         0.1.6              pyhd8ed1ab_0    conda-forge
 mccabe                    0.7.0              pyhd8ed1ab_0    conda-forge
 mistune                   2.0.4              pyhd8ed1ab_0    conda-forge
 msgpack-python            1.0.4            py39hf939315_0    conda-forge
 nbclient                  0.6.8              pyhd8ed1ab_0    conda-forge
 nbconvert                 7.0.0              pyhd8ed1ab_0    conda-forge
 nbconvert-core            7.0.0              pyhd8ed1ab_0    conda-forge
 nbconvert-pandoc          7.0.0              pyhd8ed1ab_0    conda-forge
 nbformat                  5.6.1              pyhd8ed1ab_0    conda-forge
 nbsphinx                  0.8.9              pyhd8ed1ab_0    conda-forge
 nccl                      2.14.3.1             h0800d71_0    conda-forge
 ncurses                   6.3                  h27087fc_1    conda-forge
 nest-asyncio              1.5.5              pyhd8ed1ab_0    conda-forge
 networkx                  2.8.6              pyhd8ed1ab_0    conda-forge
 notebook                  6.4.12             pyha770c72_0    conda-forge
 numba                     0.55.2           py39h66db6d7_0    conda-forge
 numpy                     1.22.4           py39hc58783e_0    conda-forge
 numpydoc                  1.4.0              pyhd8ed1ab_1    conda-forge
 nvcc_linux-64             10.1                hcaf9a05_10
 nvtx                      0.2.3            py39h3811e60_1    conda-forge
 openjpeg                  2.5.0                h7d73246_1    conda-forge
 openssl                   1.1.1q               h166bdaf_0    conda-forge
 orc                       1.7.6                h6c59b99_0    conda-forge
 packaging                 21.3               pyhd8ed1ab_0    conda-forge
 pandas                    1.4.4            py39h1832856_0    conda-forge
 pandoc                    2.19.2               ha770c72_0    conda-forge
 pandocfilters             1.5.0              pyhd8ed1ab_0    conda-forge
 parquet-cpp               1.5.1                         2    conda-forge
 parso                     0.8.3              pyhd8ed1ab_0    conda-forge
 partd                     1.3.0              pyhd8ed1ab_0    conda-forge
 pexpect                   4.8.0              pyh9f0ad1d_2    conda-forge
 pickleshare               0.7.5                   py_1003    conda-forge
 pillow                    9.2.0            py39hd5dbb17_2    conda-forge
 pip                       22.2.2             pyhd8ed1ab_0    conda-forge
 pkgutil-resolve-name      1.3.10             pyhd8ed1ab_0    conda-forge
 pluggy                    1.0.0            py39hf3d152e_3    conda-forge
 prometheus_client         0.14.1             pyhd8ed1ab_0    conda-forge
 prompt-toolkit            3.0.31             pyha770c72_0    conda-forge
 protobuf                  3.20.1           py39h5a03fae_0    conda-forge
 psutil                    5.9.2            py39hb9d737c_0    conda-forge
 pthread-stubs             0.4               h36c2ea0_1001    conda-forge
 ptxcompiler               0.2.0            py39h107f55c_0    rapidsai
 ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
 pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
 py                        1.11.0             pyh6c4a22f_0    conda-forge
 py-cpuinfo                8.0.0              pyhd8ed1ab_0    conda-forge
 pyarrow                   9.0.0           py39hc0775d8_2_cpu    conda-forge
 pycodestyle               2.9.1              pyhd8ed1ab_0    conda-forge
 pycparser                 2.21               pyhd8ed1ab_0    conda-forge
 pydata-sphinx-theme       0.10.1             pyhd8ed1ab_0    conda-forge
 pyflakes                  2.5.0              pyhd8ed1ab_0    conda-forge
 pygal                     2.4.0                      py_0    conda-forge
 pygments                  2.13.0             pyhd8ed1ab_0    conda-forge
 pylibcugraph              22.10.0a0+84.gc2f983f0           dev_0    <develop>
 pylibraft                 22.10.00a220927 cuda11_py39_g1dd2feb1_54    rapidsai-nightly
 pynvml                    11.4.1             pyhd8ed1ab_0    conda-forge
 pyopenssl                 22.0.0             pyhd8ed1ab_1    conda-forge
 pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
 pyrsistent                0.18.1           py39hb9d737c_1    conda-forge
 pysocks                   1.7.1              pyha2e5f31_6    conda-forge
 pytest                    7.1.3            py39hf3d152e_0    conda-forge
 pytest-benchmark          3.2.3              pyh9f0ad1d_0    conda-forge
 pytest-cov                3.0.0              pyhd8ed1ab_0    conda-forge
 python                    3.9.13          h9a8a25e_0_cpython    conda-forge
 python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
 python-fastjsonschema     2.16.2             pyhd8ed1ab_0    conda-forge
 python_abi                3.9                      2_cp39    conda-forge
 pytz                      2022.2.1           pyhd8ed1ab_0    conda-forge
 pyyaml                    6.0              py39hb9d737c_4    conda-forge
 pyzmq                     24.0.1           py39headdf64_0    conda-forge
 raft-dask                 22.10.00a220927 cuda11_py39_g1dd2feb1_54    rapidsai-nightly
 rapids-pytest-benchmark   0.0.14                     py_0    rapidsai
 re2                       2022.06.01           h27087fc_0    conda-forge
 readline                  8.1.2                h0f457ee_0    conda-forge
 recommonmark              0.7.1              pyhd8ed1ab_0    conda-forge
 requests                  2.28.1             pyhd8ed1ab_1    conda-forge
 rhash                     1.4.3                h166bdaf_0    conda-forge
 rmm                       22.10.00a220927 cuda11_py39_g6e0d65a9_20    rapidsai-nightly
 s2n                       1.0.10               h9b69904_0    conda-forge
 s3transfer                0.6.0              pyhd8ed1ab_0    conda-forge
 scikit-build              0.15.0             pyhb871ab6_0    conda-forge
 scikit-learn              1.1.2            py39he5e8d7e_0    conda-forge
 scipy                     1.9.1            py39h8ba3f38_0    conda-forge
 send2trash                1.8.0              pyhd8ed1ab_0    conda-forge
 setuptools                65.4.0                   pypi_0    pypi
 six                       1.16.0             pyh6c4a22f_0    conda-forge
 snappy                    1.1.9                hbd366e4_1    conda-forge
 snowballstemmer           2.2.0              pyhd8ed1ab_0    conda-forge
 sortedcontainers          2.4.0              pyhd8ed1ab_0    conda-forge
 soupsieve                 2.3.2.post1        pyhd8ed1ab_0    conda-forge
 spdlog                    1.8.5                h4bd325d_1    conda-forge
 sphinx                    5.2.1              pyhd8ed1ab_0    conda-forge
 sphinx-copybutton         0.5.0              pyhd8ed1ab_0    conda-forge
 sphinx-markdown-tables    0.0.17             pyh6c4a22f_0    conda-forge
 sphinxcontrib-applehelp   1.0.2                      py_0    conda-forge
 sphinxcontrib-devhelp     1.0.2                      py_0    conda-forge
 sphinxcontrib-htmlhelp    2.0.0              pyhd8ed1ab_0    conda-forge
 sphinxcontrib-jsmath      1.0.1                      py_0    conda-forge
 sphinxcontrib-qthelp      1.0.3                      py_0    conda-forge
 sphinxcontrib-serializinghtml 1.1.5              pyhd8ed1ab_2    conda-forge
 sphinxcontrib-websupport  1.2.4              pyhd8ed1ab_1    conda-forge
 sqlite                    3.39.3               h4ff8645_0    conda-forge
 stack_data                0.5.1              pyhd8ed1ab_0    conda-forge
 sysroot_linux-64          2.12                he073ed8_15    conda-forge
 tblib                     1.7.0              pyhd8ed1ab_0    conda-forge
 terminado                 0.15.0           py39hf3d152e_0    conda-forge
 threadpoolctl             3.1.0              pyh8a188c0_0    conda-forge
 tinycss2                  1.1.1              pyhd8ed1ab_0    conda-forge
 tk                        8.6.12               h27826a3_0    conda-forge
 toml                      0.10.2             pyhd8ed1ab_0    conda-forge
 tomli                     2.0.1              pyhd8ed1ab_0    conda-forge
 toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
 tornado                   6.1              py39hb9d737c_3    conda-forge
 traitlets                 5.4.0              pyhd8ed1ab_0    conda-forge
 typing_extensions         4.3.0              pyha770c72_0    conda-forge
 tzdata                    2022d                h191b570_0    conda-forge
 ucx                       1.13.1               h538f049_0    conda-forge
 ucx-proc                  1.0.0                       gpu    rapidsai
 ucx-py                    0.28.00a220926  py39_g8e07f67_25    rapidsai-nightly
 urllib3                   1.26.11            pyhd8ed1ab_0    conda-forge
 wcwidth                   0.2.5              pyh9f0ad1d_2    conda-forge
 webencodings              0.5.1                      py_1    conda-forge
 wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
 xorg-libxau               1.0.9                h7f98852_0    conda-forge
 xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
 xz                        5.2.6                h166bdaf_0    conda-forge
 yaml                      0.2.5                h7f98852_2    conda-forge
 zeromq                    4.3.4                h9c3ff4c_1    conda-forge
 zict                      2.2.0              pyhd8ed1ab_0    conda-forge
 zipp                      3.8.1              pyhd8ed1ab_0    conda-forge
 zlib                      1.2.12               h166bdaf_3    conda-forge
 zstd                      1.5.2                h6239696_4    conda-forge


Additional context
Encountered in ProperterGraph in cugraph.

@eriknw eriknw added Needs Triage Need team to review and classify bug Something isn't working labels Sep 27, 2022
@shwina shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Sep 28, 2022
@quasiben quasiben added dask-cudf dask Dask issue labels Oct 25, 2022
@shwina shwina moved this to In Progress in cuDF/Dask/Numba/UCX Oct 27, 2022
@shwina shwina moved this from In Progress to Todo in cuDF/Dask/Numba/UCX Oct 27, 2022
@shwina shwina moved this to Todo in cuDF/Dask/Numba/UCX Oct 27, 2022
@galipremsagar galipremsagar self-assigned this Jan 4, 2023
@shwina shwina moved this from In Progress to TODO in cuDF/Dask/Numba/UCX Jan 19, 2023
@vyasr
Copy link
Contributor

vyasr commented May 17, 2024

This works now, but with the caveat that categorical column must be explicitly marked as ordered

In [1]: import cudf
   ...: import dask_cudf
   ...: df = cudf.DataFrame({"a": list("caba"), "b": list(range(4))})
   ...: df["a"] = df["a"].astype("category").cat.as_ordered()  # Without ordering the dask version fails.
   ...: ddf = dask_cudf.from_cudf(df, npartitions=2)
   ...: print(df.sort_values("a"))
   ...: print(ddf.sort_values("a").compute())
   a  b
1  a  1
3  a  3
2  b  2
0  c  0
   a  b
1  a  1
3  a  3
2  b  2
0  c  0

@rjzamora any idea why dask_cudf behaves differently from cudf w.r.t. the ordering?

@rjzamora
Copy link
Member

rjzamora commented May 17, 2024

Good catch @vyasr - The dask behavior was actually "fixed" recently in dask-expr (dask/dask-expr#1058), but I just realized that the pd.CategoricalDtype check will need to be updated to work for cudf (my mistake for missing that when I reviewed).

Even with dask-expr fixed, however, your snippet will not work for dask_cudf, because there seems to be a bug in cudf:

import cudf as lib  # Works for pandas, but not for cudf

df = lib.DataFrame({"a": list("caba"), "b": list(range(4))})
df["a"] = df["a"].astype("category")
df = df.iloc[:2]
df["a"].cat.as_ordered()
...
ValueError: Length of values (4) does not match length of index (2)

EDIT: I submitted #15778 to track this.

@rjzamora
Copy link
Member

Update: Latest version of dask-expr:main + dask:main now results in an ugly segfault when sorting on a categorical column. After #15788, the user will get a clear error until the upstream divisions logic is "generalized" to work with cudf.

@vyasr
Copy link
Contributor

vyasr commented May 20, 2024

The chain has gotten a bit long here, let me summarize to make sure I have everything right. #15780 will fix #15778. Once that is merged, will this issue also be fixed in the dask-expr case, or is there still work to be done to generalize dask-expr to work correctly for cudf because dask/dask-expr#1058 wasn't complete? And in either case, do we still expect this to fail for users of the legacy dask API (which I guess isn't too important if we're going to be forced to migrate to dask-expr anyway)?

@rjzamora
Copy link
Member

rjzamora commented May 20, 2024

Summary:

I certainly want to fix categorical sorting for 24.06 if possible, but my current expectation is that we will need to raise an error and tell the user to disable query planning. If I can find a work-around in the next day or so, then we can remove the error. Otherwise, the proper/upstream fix will only apply to 24.08.

rapids-bot bot pushed a commit that referenced this issue May 22, 2024
Follow up to #15788

Adds a temporary workaround for sorting on categorical columns in 24.06: We convert only the partitioning column to pandas to calculate divisions.

This is related to #11795, but I don't want to "close" that issue until `RepartitionQuantiles` works with cudf-backed data.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #15801
@vyasr vyasr added this to cuDF Python Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dask Dask issue Python Affects Python cuDF API.
Projects
Status: Todo
Status: TODO
Development

No branches or pull requests

7 participants