feat: support casting to and from spark-like structs #1991

FBruzzesi · 2025-02-11T11:36:36Z

Reason

There are multiple reason for this PR to happen 😁

Eventually I would like to support Schema.to_pyspark
For some integration it might be useful to have:
- nw.struct emulating pl.struct
- .struct.unnest() and/or Frame.unnest

What type of PR is this? (check all applicable)

Related issues

Closes [Enh]: cast expr in SparkLike #1743

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

I am having a hard time testing this 🤔

MarcoGorelli · 2025-02-11T21:47:49Z

thanks!

I am having a hard time testing this 🤔

😄 sorry could you elaborate please?

FBruzzesi · 2025-02-11T22:27:37Z

😄 sorry could you elaborate please?

Sure, sorry 😄

Ideally we would want to:

def test_cast_struct(request: pytest.FixtureRequest, constructor: Constructor) -> None:
    if any(
-         backend in str(constructor) for backend in ("dask", "modin", "cudf", "pyspark")
+         backend in str(constructor) for backend in ("dask", "modin", "cudf")
    ):

However pyspark converts the following input in a column of type MAP<STRING, STRING>:

data = {
        "a": [
            {"movie ": "Cars", "rating": 4.5},
            {"movie ": "Toy Story", "rating": 4.9},
        ]
    }

and conversion via cast is not supported.

I didn't have time today, but I can add a dedicated test for pyspark which initializes a dataframe with a column already of type Struct, but changes the Fields type. Do you think that would be enough as a test?

(Here is the link to the above test)

narwhals/tests/expr_and_series/cast_test.py

Lines 238 to 240 in fd8ccac

    
           def test_cast_struct(request: pytest.FixtureRequest, constructor: Constructor) -> None: 
        
               if any( 
        
                   backend in str(constructor) for backend in ("dask", "modin", "cudf", "pyspark")

MarcoGorelli · 2025-02-13T21:52:46Z

I didn't have time today, but I can add a dedicated test for pyspark which initializes a dataframe with a column already of type Struct, but changes the Fields type. Do you think that would be enough as a test?

sure thanks!

FBruzzesi · 2025-02-13T22:13:07Z

I didn't have time today, but I can add a dedicated test for pyspark which initializes a dataframe with a column already of type Struct, but changes the Fields type. Do you think that would be enough as a test?

sure thanks!

I had already forgotten 🙈 pushed now!

osoucy · 2025-02-15T18:25:55Z

Great work! I had done something very similar on my side!

For testing however, I had a slightly different strategy. Instead of creating a new test, I used the existing test_cast_struct as follows:

def test_cast_struct(request: pytest.FixtureRequest, constructor: Constructor) -> None:
    if any(
        backend in str(constructor) for backend in ("dask", "modin", "cudf")
    ):
        request.applymarker(pytest.mark.xfail)

    if "pandas" in str(constructor) and PANDAS_VERSION < (2, 2):
        request.applymarker(pytest.mark.xfail)

    data = {
        "a": [
            {"movie ": "Cars", "rating": 4.5},
            {"movie ": "Toy Story", "rating": 4.9},
        ]
    }

    dtype = nw.Struct([nw.Field("movie ", nw.String()), nw.Field("rating", nw.Float64())])

    native_df = constructor(data)
    if "spark" in str(constructor):
        import pyspark.sql.functions as F
        import pyspark.sql.types as T

        native_df = native_df.withColumn("a", F.struct(
            F.col("a.movie ").alias("movie ").cast(T.StringType()),
            F.col("a.rating").alias("rating").cast(T.DoubleType()),
        ))

    result = (
        nw.from_native(native_df).select(nw.col("a").cast(dtype)).lazy().collect()
    )
    assert result.schema == {"a": dtype}

As you can see, when the consutrctor is PySpark, we need to re-define the column "a" to force a StructType instead of a MapType, which is something you faced yourself.

However, I still had an issue when calling the last .collect() as it uses df._collect_to_pyarrow() which does not seem to support StructType. A normal df.collect() would work, but it would not return a DataFrame object.

Have you seen the same thing when you run your test?

FBruzzesi · 2025-02-15T18:52:15Z

Great work! I had done something very similar on my side!

Thanks @osoucy and I am sorry to hear we did duplicate work 🥲

However, I still had an issue when calling the last .collect() as it uses df._collect_to_pyarrow() which does not seem to support StructType. A normal df.collect() would work, but it would not return a DataFrame object.

Have you seen the same thing when you run your test?

Not really, locally I have no issue with your code as well - If you fancy sharing your github commit email I can add you as a co-author

osoucy · 2025-02-15T19:04:41Z

Here is my email: [email protected]

In that case, it must be an issue with my specific environment python vs pyspark vs pyarrow version. I'm glad it's only me!

FBruzzesi · 2025-02-15T19:12:20Z

Here is my email: [email protected]

The one used for commits should be something like: [email protected] (see how to find it)

In that case, it must be an issue with my specific environment python vs pyspark vs pyarrow version. I'm glad it's only me!

We did some refactor + new features, let us know if you keep having problems with the env in the future 🤔

Co-authored-by: Olivier Soucy <[email protected]>

osoucy · 2025-02-16T03:16:47Z

Sorry, I read too quickly. Here it is:
[email protected]

Glad to see you were able to incorporate my suggested changes for the unit tests.

FBruzzesi · 2025-02-16T09:21:43Z

Sorry, I read too quickly. Here it is: [email protected]

No worries, I tried with the other email and I can see you as co-author for 5dc3a09, so it worked!

Glad to see you were able to incorporate my suggested changes for the unit tests.

Thanks for reviewing and providing with a cleaner solution 👌

tests/expr_and_series/cast_test.py

Co-authored-by: Edoardo Abati <[email protected]>

dangotbanned · 2025-02-16T13:37:09Z

narwhals/_spark_like/utils.py

+    if isinstance_or_issubclass(dtype, (dtypes.List, dtypes.Array)):
+        return spark_types.ArrayType(
+            elementType=narwhals_to_native_dtype(
+                dtype.inner,  # type: ignore[union-attr]
+                version=version,
+                spark_types=spark_types,
+            )


The # type: ignore here is an example of this issue (#1807 (comment))

Off-topic-ish, but should I spin that out into a new issue?

I think it might get lost in that PR

Thanks @dangotbanned - I'd say let's keep track in a dedicated issue, as that's not even introduced in this specific PR

EdAbati

This looks 👌👌👌

Mentioned in -#1991 (comment) - #1807 (comment)

feat: support casting to and from spark-like structs

b315d60

FBruzzesi added the enhancement New feature or request label Feb 11, 2025

Merge branch 'main' into feat/pyspark-struct-dtype

6b27450

FBruzzesi changed the title ~~WIP, feat: support casting to and from spark-like structs~~ feat: support casting to and from spark-like structs Feb 11, 2025

FBruzzesi marked this pull request as ready for review February 11, 2025 13:08

Merge branch 'main' into feat/pyspark-struct-dtype

ae8d8f5

add dedicated pyspark test

0919442

FBruzzesi and others added 2 commits February 13, 2025 23:18

no cover test due to skip if pyspark not installed

3ec018f

Merge branch 'main' into feat/pyspark-struct-dtype

7d2e140

FBruzzesi and others added 3 commits February 15, 2025 22:42

refactor test

5dc3a09

Co-authored-by: Olivier Soucy <[email protected]>

Merge branch 'main' into feat/pyspark-struct-dtype

7c96752

ignore mypy

9e856e0

EdAbati reviewed Feb 16, 2025

View reviewed changes

tests/expr_and_series/cast_test.py Outdated Show resolved Hide resolved

FBruzzesi and others added 2 commits February 16, 2025 14:12

add 'pragma: no cover' for pyspark test case

91802a0

Co-authored-by: Edoardo Abati <[email protected]>

Merge branch 'main' into feat/pyspark-struct-dtype

a78a625

dangotbanned reviewed Feb 16, 2025

View reviewed changes

EdAbati approved these changes Feb 16, 2025

View reviewed changes

FBruzzesi merged commit 6662df5 into main Feb 16, 2025
27 of 28 checks passed

FBruzzesi deleted the feat/pyspark-struct-dtype branch February 16, 2025 15:02

dangotbanned added a commit that referenced this pull request Feb 16, 2025

fix(RFC): Use metaclass for safe DType attr access

7b1a0d6

Mentioned in -#1991 (comment) - #1807 (comment)

dangotbanned mentioned this pull request Feb 16, 2025

fix(RFC): Use metaclass for safe DType attr access #2025

Open

10 tasks

osoucy mentioned this pull request Feb 18, 2025

fix: Type conversion from Spark Struct to narwhals Struct. #2037

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support casting to and from spark-like structs #1991

feat: support casting to and from spark-like structs #1991

FBruzzesi commented Feb 11, 2025 •

edited

Loading

MarcoGorelli commented Feb 11, 2025

FBruzzesi commented Feb 11, 2025 •

edited

Loading

MarcoGorelli commented Feb 13, 2025

FBruzzesi commented Feb 13, 2025

osoucy commented Feb 15, 2025

FBruzzesi commented Feb 15, 2025

osoucy commented Feb 15, 2025

FBruzzesi commented Feb 15, 2025

osoucy commented Feb 16, 2025

FBruzzesi commented Feb 16, 2025

dangotbanned Feb 16, 2025

FBruzzesi Feb 16, 2025

EdAbati left a comment

feat: support casting to and from spark-like structs #1991

feat: support casting to and from spark-like structs #1991

Conversation

FBruzzesi commented Feb 11, 2025 • edited Loading

Reason

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

MarcoGorelli commented Feb 11, 2025

FBruzzesi commented Feb 11, 2025 • edited Loading

MarcoGorelli commented Feb 13, 2025

FBruzzesi commented Feb 13, 2025

osoucy commented Feb 15, 2025

FBruzzesi commented Feb 15, 2025

osoucy commented Feb 15, 2025

FBruzzesi commented Feb 15, 2025

osoucy commented Feb 16, 2025

FBruzzesi commented Feb 16, 2025

dangotbanned Feb 16, 2025

Choose a reason for hiding this comment

FBruzzesi Feb 16, 2025

Choose a reason for hiding this comment

EdAbati left a comment

Choose a reason for hiding this comment

FBruzzesi commented Feb 11, 2025 •

edited

Loading

FBruzzesi commented Feb 11, 2025 •

edited

Loading