-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DataFrame fill_nan/fill_null #1019
base: main
Are you sure you want to change the base?
Conversation
def test_coalesce(df): | ||
# Create a DataFrame with null values | ||
ctx = SessionContext() | ||
batch = pa.RecordBatch.from_arrays( | ||
[ | ||
pa.array(["Hello", None, "!"]), # string column with null | ||
pa.array([4, None, 6]), # integer column with null | ||
pa.array(["hello ", None, " !"]), # string column with null | ||
pa.array( | ||
[datetime(2022, 12, 31), None, datetime(2020, 7, 2)] | ||
), # datetime with null | ||
pa.array([False, None, True]), # boolean column with null | ||
], | ||
names=["a", "b", "c", "d", "e"], | ||
) | ||
df_with_nulls = ctx.create_dataframe([[batch]]) | ||
|
||
# Test coalesce with different data types | ||
result_df = df_with_nulls.select( | ||
f.coalesce(column("a"), literal("default")).alias("a_coalesced"), | ||
f.coalesce(column("b"), literal(0)).alias("b_coalesced"), | ||
f.coalesce(column("c"), literal("default")).alias("c_coalesced"), | ||
f.coalesce(column("d"), literal(datetime(2000, 1, 1))).alias("d_coalesced"), | ||
f.coalesce(column("e"), literal(False)).alias("e_coalesced"), | ||
) | ||
|
||
result = result_df.collect()[0] | ||
|
||
# Verify results | ||
assert result.column(0) == pa.array( | ||
["Hello", "default", "!"], type=pa.string_view() | ||
) | ||
assert result.column(1) == pa.array([4, 0, 6], type=pa.int64()) | ||
assert result.column(2) == pa.array( | ||
["hello ", "default", " !"], type=pa.string_view() | ||
) | ||
assert result.column(3) == pa.array( | ||
[datetime(2022, 12, 31), datetime(2000, 1, 1), datetime(2020, 7, 2)], | ||
type=pa.timestamp("us"), | ||
) | ||
assert result.column(4) == pa.array([False, False, True], type=pa.bool_()) | ||
|
||
# Test multiple arguments | ||
result_df = df_with_nulls.select( | ||
f.coalesce(column("a"), literal(None), literal("fallback")).alias( | ||
"multi_coalesce" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find tests for coalesce which I used for fill_null, so I added these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a very worthwhile and useful addition. Thank you!
We've tried to keep most of the heavier logic on the rust side and to keep the python wrappers as way to convert from rust to pythonic interfaces. Do you think this is a case where doing the logic in the python side makes more sense?
More generally, do you think this is something we can or should upstream to the core datafusion repo? I can assist with that if you like.
Which issue does this PR close?
Closes #922.
Rationale for this change
DataFusion currently lacks built-in methods for handling missing values (nulls and NaNs) in DataFrames. This functionality is commonly needed in data processing workflows and is available in other DataFrame libraries like pandas and PySpark.
The changes add:
fill_null()
method to replace NULL values with a specified valuefill_nan()
method to replace NaN values with a numeric value in floating-point columnsWhat changes are included in this PR?
Added
fill_null()
method to DataFrame class:Added
fill_nan()
method to DataFrame class:Added comprehensive test cases for both methods covering:
Are there any user-facing changes?
Yes, two new public methods are added to the DataFrame class: