-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] sort_values on categorical column fails in dask_cudf #11795
Comments
This works now, but with the caveat that categorical column must be explicitly marked as ordered
@rjzamora any idea why dask_cudf behaves differently from cudf w.r.t. the ordering? |
Good catch @vyasr - The dask behavior was actually "fixed" recently in dask-expr (dask/dask-expr#1058), but I just realized that the Even with dask-expr fixed, however, your snippet will not work for dask_cudf, because there seems to be a bug in cudf: import cudf as lib # Works for pandas, but not for cudf
df = lib.DataFrame({"a": list("caba"), "b": list(range(4))})
df["a"] = df["a"].astype("category")
df = df.iloc[:2]
df["a"].cat.as_ordered()
EDIT: I submitted #15778 to track this. |
Update: Latest version of |
The chain has gotten a bit long here, let me summarize to make sure I have everything right. #15780 will fix #15778. Once that is merged, will this issue also be fixed in the dask-expr case, or is there still work to be done to generalize dask-expr to work correctly for cudf because dask/dask-expr#1058 wasn't complete? And in either case, do we still expect this to fail for users of the legacy dask API (which I guess isn't too important if we're going to be forced to migrate to dask-expr anyway)? |
Summary:
I certainly want to fix categorical sorting for 24.06 if possible, but my current expectation is that we will need to raise an error and tell the user to disable query planning. If I can find a work-around in the next day or so, then we can remove the error. Otherwise, the proper/upstream fix will only apply to 24.08. |
Follow up to #15788 Adds a temporary workaround for sorting on categorical columns in 24.06: We convert only the partitioning column to pandas to calculate divisions. This is related to #11795, but I don't want to "close" that issue until `RepartitionQuantiles` works with cudf-backed data. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #15801
Describe the bug
ddf.sort_values(col)
does not work with adask_cudf
DataFrame whencol
is categorical.Steps/Code to reproduce bug
Traceback
Expected behavior
I expect it to work--that is, match the result of cudf and dask.dataframe.
Environment overview (please complete the following information)
Environment details
Click here to see environment details
Additional context
Encountered in ProperterGraph in cugraph.
The text was updated successfully, but these errors were encountered: