Add DataFrame fill_nan/fill_null #1019

kosiew · 2025-02-12T08:40:40Z

Which issue does this PR close?

Closes #922.

Rationale for this change

DataFusion currently lacks built-in methods for handling missing values (nulls and NaNs) in DataFrames. This functionality is commonly needed in data processing workflows and is available in other DataFrame libraries like pandas and PySpark.

The changes add:

fill_null() method to replace NULL values with a specified value
fill_nan() method to replace NaN values with a numeric value in floating-point columns

What changes are included in this PR?

Added fill_null() method to DataFrame class:
- Replaces NULL values with a specified value
- Handles type casting validation
- Allows filling specific columns or entire DataFrame
- Preserves columns where casting fails
Added fill_nan() method to DataFrame class:
- Replaces NaN values with numeric values
- Only operates on floating-point columns
- Validates input types and column types
- Allows filling specific columns or all numeric columns
Added comprehensive test cases for both methods covering:
- Different data types
- Column subsetting
- Type validation
- Error cases

Are there any user-facing changes?

Yes, two new public methods are added to the DataFrame class:

# Fill nulls with a value
df = df.fill_null(0)  # Fill all nulls with 0 where possible
df = df.fill_null("missing", subset=["name"])  # Fill specific columns

# Fill NaN values in numeric columns
df = df.fill_nan(0)  # Fill all NaNs with 0
df = df.fill_nan(99.9, subset=["price"])  # Fill specific columns

kosiew · 2025-02-12T09:05:06Z

python/tests/test_functions.py

+def test_coalesce(df):
+    # Create a DataFrame with null values
+    ctx = SessionContext()
+    batch = pa.RecordBatch.from_arrays(
+        [
+            pa.array(["Hello", None, "!"]),  # string column with null
+            pa.array([4, None, 6]),  # integer column with null
+            pa.array(["hello ", None, " !"]),  # string column with null
+            pa.array(
+                [datetime(2022, 12, 31), None, datetime(2020, 7, 2)]
+            ),  # datetime with null
+            pa.array([False, None, True]),  # boolean column with null
+        ],
+        names=["a", "b", "c", "d", "e"],
+    )
+    df_with_nulls = ctx.create_dataframe([[batch]])
+
+    # Test coalesce with different data types
+    result_df = df_with_nulls.select(
+        f.coalesce(column("a"), literal("default")).alias("a_coalesced"),
+        f.coalesce(column("b"), literal(0)).alias("b_coalesced"),
+        f.coalesce(column("c"), literal("default")).alias("c_coalesced"),
+        f.coalesce(column("d"), literal(datetime(2000, 1, 1))).alias("d_coalesced"),
+        f.coalesce(column("e"), literal(False)).alias("e_coalesced"),
+    )
+
+    result = result_df.collect()[0]
+
+    # Verify results
+    assert result.column(0) == pa.array(
+        ["Hello", "default", "!"], type=pa.string_view()
+    )
+    assert result.column(1) == pa.array([4, 0, 6], type=pa.int64())
+    assert result.column(2) == pa.array(
+        ["hello ", "default", " !"], type=pa.string_view()
+    )
+    assert result.column(3) == pa.array(
+        [datetime(2022, 12, 31), datetime(2000, 1, 1), datetime(2020, 7, 2)],
+        type=pa.timestamp("us"),
+    )
+    assert result.column(4) == pa.array([False, False, True], type=pa.bool_())
+
+    # Test multiple arguments
+    result_df = df_with_nulls.select(
+        f.coalesce(column("a"), literal(None), literal("fallback")).alias(
+            "multi_coalesce"
+        )


I could not find tests for coalesce which I used for fill_null, so I added these

timsaucer

This looks like a very worthwhile and useful addition. Thank you!

We've tried to keep most of the heavier logic on the rust side and to keep the python wrappers as way to convert from rust to pythonic interfaces. Do you think this is a case where doing the logic in the python side makes more sense?

More generally, do you think this is something we can or should upstream to the core datafusion repo? I can assist with that if you like.

kosiew added 8 commits February 12, 2025 15:01

feat: add fill_null method to DataFrame for handling null values

106555e

test: add coalesce function tests for handling default values

cff9b7c

Resolve test cases for fill_null

4cf7496

feat: add fill_nan method to DataFrame for handling NaN values

df6208e

move imports out of functions

23ba1bd

docs: add documentation for fill_null and fill_nan methods in DataFrame

d6ca465

Add more tests

8582104

fix ruff errors

73b692f

kosiew force-pushed the fill-null branch from 9509f6d to 73b692f Compare February 12, 2025 09:03

kosiew commented Feb 12, 2025

View reviewed changes

kosiew marked this pull request as ready for review February 12, 2025 09:47

timsaucer reviewed Feb 15, 2025

View reviewed changes

This was referenced Feb 19, 2025

Add DataFrame fill_null apache/datafusion#14765

Closed

Add DataFrame fill_nan apache/datafusion#14770

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DataFrame fill_nan/fill_null #1019

Add DataFrame fill_nan/fill_null #1019

kosiew commented Feb 12, 2025

kosiew Feb 12, 2025

timsaucer left a comment

Add DataFrame fill_nan/fill_null #1019

Are you sure you want to change the base?

Add DataFrame fill_nan/fill_null #1019

Conversation

kosiew commented Feb 12, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

kosiew Feb 12, 2025

Choose a reason for hiding this comment

timsaucer left a comment

Choose a reason for hiding this comment