Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support casting to and from spark-like structs #1991

Merged
merged 11 commits into from
Feb 16, 2025

Conversation

FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Feb 11, 2025

Reason

There are multiple reason for this PR to happen 😁

  • Eventually I would like to support Schema.to_pyspark
  • For some integration it might be useful to have:
    • nw.struct emulating pl.struct
    • .struct.unnest() and/or Frame.unnest

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

I am having a hard time testing this 🤔

@FBruzzesi FBruzzesi added the enhancement New feature or request label Feb 11, 2025
@FBruzzesi FBruzzesi changed the title WIP, feat: support casting to and from spark-like structs feat: support casting to and from spark-like structs Feb 11, 2025
@FBruzzesi FBruzzesi marked this pull request as ready for review February 11, 2025 13:08
@MarcoGorelli
Copy link
Member

thanks!

I am having a hard time testing this 🤔

😄 sorry could you elaborate please?

@FBruzzesi
Copy link
Member Author

FBruzzesi commented Feb 11, 2025

😄 sorry could you elaborate please?

Sure, sorry 😄

Ideally we would want to:

def test_cast_struct(request: pytest.FixtureRequest, constructor: Constructor) -> None:
    if any(
-         backend in str(constructor) for backend in ("dask", "modin", "cudf", "pyspark")
+         backend in str(constructor) for backend in ("dask", "modin", "cudf")
    ):

However pyspark converts the following input in a column of type MAP<STRING, STRING>:

data = {
        "a": [
            {"movie ": "Cars", "rating": 4.5},
            {"movie ": "Toy Story", "rating": 4.9},
        ]
    }

and conversion via cast is not supported.

I didn't have time today, but I can add a dedicated test for pyspark which initializes a dataframe with a column already of type Struct, but changes the Fields type. Do you think that would be enough as a test?

(Here is the link to the above test)

def test_cast_struct(request: pytest.FixtureRequest, constructor: Constructor) -> None:
if any(
backend in str(constructor) for backend in ("dask", "modin", "cudf", "pyspark")

@MarcoGorelli
Copy link
Member

I didn't have time today, but I can add a dedicated test for pyspark which initializes a dataframe with a column already of type Struct, but changes the Fields type. Do you think that would be enough as a test?

sure thanks!

@FBruzzesi
Copy link
Member Author

I didn't have time today, but I can add a dedicated test for pyspark which initializes a dataframe with a column already of type Struct, but changes the Fields type. Do you think that would be enough as a test?

sure thanks!

I had already forgotten 🙈 pushed now!

@osoucy
Copy link
Contributor

osoucy commented Feb 15, 2025

Great work! I had done something very similar on my side!

For testing however, I had a slightly different strategy. Instead of creating a new test, I used the existing test_cast_struct as follows:

def test_cast_struct(request: pytest.FixtureRequest, constructor: Constructor) -> None:
    if any(
        backend in str(constructor) for backend in ("dask", "modin", "cudf")
    ):
        request.applymarker(pytest.mark.xfail)

    if "pandas" in str(constructor) and PANDAS_VERSION < (2, 2):
        request.applymarker(pytest.mark.xfail)

    data = {
        "a": [
            {"movie ": "Cars", "rating": 4.5},
            {"movie ": "Toy Story", "rating": 4.9},
        ]
    }

    dtype = nw.Struct([nw.Field("movie ", nw.String()), nw.Field("rating", nw.Float64())])

    native_df = constructor(data)
    if "spark" in str(constructor):
        import pyspark.sql.functions as F
        import pyspark.sql.types as T

        native_df = native_df.withColumn("a", F.struct(
            F.col("a.movie ").alias("movie ").cast(T.StringType()),
            F.col("a.rating").alias("rating").cast(T.DoubleType()),
        ))

    result = (
        nw.from_native(native_df).select(nw.col("a").cast(dtype)).lazy().collect()
    )
    assert result.schema == {"a": dtype}

As you can see, when the consutrctor is PySpark, we need to re-define the column "a" to force a StructType instead of a MapType, which is something you faced yourself.

However, I still had an issue when calling the last .collect() as it uses df._collect_to_pyarrow() which does not seem to support StructType. A normal df.collect() would work, but it would not return a DataFrame object.

Have you seen the same thing when you run your test?

@FBruzzesi
Copy link
Member Author

Great work! I had done something very similar on my side!

Thanks @osoucy and I am sorry to hear we did duplicate work 🥲

However, I still had an issue when calling the last .collect() as it uses df._collect_to_pyarrow() which does not seem to support StructType. A normal df.collect() would work, but it would not return a DataFrame object.

Have you seen the same thing when you run your test?

Not really, locally I have no issue with your code as well - If you fancy sharing your github commit email I can add you as a co-author

@osoucy
Copy link
Contributor

osoucy commented Feb 15, 2025

Here is my email: [email protected]

In that case, it must be an issue with my specific environment python vs pyspark vs pyarrow version. I'm glad it's only me!

@FBruzzesi
Copy link
Member Author

Here is my email: [email protected]

The one used for commits should be something like: [email protected] (see how to find it)

In that case, it must be an issue with my specific environment python vs pyspark vs pyarrow version. I'm glad it's only me!

We did some refactor + new features, let us know if you keep having problems with the env in the future 🤔

@osoucy
Copy link
Contributor

osoucy commented Feb 16, 2025

Sorry, I read too quickly. Here it is:
[email protected]

Glad to see you were able to incorporate my suggested changes for the unit tests.

@FBruzzesi
Copy link
Member Author

Sorry, I read too quickly. Here it is: [email protected]

No worries, I tried with the other email and I can see you as co-author for 5dc3a09, so it worked!

Glad to see you were able to incorporate my suggested changes for the unit tests.

Thanks for reviewing and providing with a cleaner solution 👌

Comment on lines +126 to +132
if isinstance_or_issubclass(dtype, (dtypes.List, dtypes.Array)):
return spark_types.ArrayType(
elementType=narwhals_to_native_dtype(
dtype.inner, # type: ignore[union-attr]
version=version,
spark_types=spark_types,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The # type: ignore here is an example of this issue (#1807 (comment))

Off-topic-ish, but should I spin that out into a new issue?

I think it might get lost in that PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dangotbanned - I'd say let's keep track in a dedicated issue, as that's not even introduced in this specific PR

Copy link
Collaborator

@EdAbati EdAbati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks 👌👌👌

@FBruzzesi FBruzzesi merged commit 6662df5 into main Feb 16, 2025
27 of 28 checks passed
@FBruzzesi FBruzzesi deleted the feat/pyspark-struct-dtype branch February 16, 2025 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enh]: cast expr in SparkLike
5 participants