Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFrame fill_nan/fill_null #1019

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Feb 12, 2025

Which issue does this PR close?

Closes #922.

Rationale for this change

DataFusion currently lacks built-in methods for handling missing values (nulls and NaNs) in DataFrames. This functionality is commonly needed in data processing workflows and is available in other DataFrame libraries like pandas and PySpark.

The changes add:

  • fill_null() method to replace NULL values with a specified value
  • fill_nan() method to replace NaN values with a numeric value in floating-point columns

What changes are included in this PR?

  1. Added fill_null() method to DataFrame class:

    • Replaces NULL values with a specified value
    • Handles type casting validation
    • Allows filling specific columns or entire DataFrame
    • Preserves columns where casting fails
  2. Added fill_nan() method to DataFrame class:

    • Replaces NaN values with numeric values
    • Only operates on floating-point columns
    • Validates input types and column types
    • Allows filling specific columns or all numeric columns
  3. Added comprehensive test cases for both methods covering:

    • Different data types
    • Column subsetting
    • Type validation
    • Error cases

Are there any user-facing changes?

Yes, two new public methods are added to the DataFrame class:

# Fill nulls with a value
df = df.fill_null(0)  # Fill all nulls with 0 where possible
df = df.fill_null("missing", subset=["name"])  # Fill specific columns

# Fill NaN values in numeric columns
df = df.fill_nan(0)  # Fill all NaNs with 0
df = df.fill_nan(99.9, subset=["price"])  # Fill specific columns

Comment on lines +1178 to +1224
def test_coalesce(df):
# Create a DataFrame with null values
ctx = SessionContext()
batch = pa.RecordBatch.from_arrays(
[
pa.array(["Hello", None, "!"]), # string column with null
pa.array([4, None, 6]), # integer column with null
pa.array(["hello ", None, " !"]), # string column with null
pa.array(
[datetime(2022, 12, 31), None, datetime(2020, 7, 2)]
), # datetime with null
pa.array([False, None, True]), # boolean column with null
],
names=["a", "b", "c", "d", "e"],
)
df_with_nulls = ctx.create_dataframe([[batch]])

# Test coalesce with different data types
result_df = df_with_nulls.select(
f.coalesce(column("a"), literal("default")).alias("a_coalesced"),
f.coalesce(column("b"), literal(0)).alias("b_coalesced"),
f.coalesce(column("c"), literal("default")).alias("c_coalesced"),
f.coalesce(column("d"), literal(datetime(2000, 1, 1))).alias("d_coalesced"),
f.coalesce(column("e"), literal(False)).alias("e_coalesced"),
)

result = result_df.collect()[0]

# Verify results
assert result.column(0) == pa.array(
["Hello", "default", "!"], type=pa.string_view()
)
assert result.column(1) == pa.array([4, 0, 6], type=pa.int64())
assert result.column(2) == pa.array(
["hello ", "default", " !"], type=pa.string_view()
)
assert result.column(3) == pa.array(
[datetime(2022, 12, 31), datetime(2000, 1, 1), datetime(2020, 7, 2)],
type=pa.timestamp("us"),
)
assert result.column(4) == pa.array([False, False, True], type=pa.bool_())

# Test multiple arguments
result_df = df_with_nulls.select(
f.coalesce(column("a"), literal(None), literal("fallback")).alias(
"multi_coalesce"
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find tests for coalesce which I used for fill_null, so I added these

@kosiew kosiew marked this pull request as ready for review February 12, 2025 09:47
Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a very worthwhile and useful addition. Thank you!

We've tried to keep most of the heavier logic on the rust side and to keep the python wrappers as way to convert from rust to pythonic interfaces. Do you think this is a case where doing the logic in the python side makes more sense?

More generally, do you think this is something we can or should upstream to the core datafusion repo? I can assist with that if you like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add DataFrame fill_nan/fill_null
2 participants