Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.as[Foo] doesn't catch DataFrame Dataset type differences #282

Open
dnmfarrell opened this issue Sep 9, 2022 · 3 comments
Open

.as[Foo] doesn't catch DataFrame Dataset type differences #282

dnmfarrell opened this issue Sep 9, 2022 · 3 comments

Comments

@dnmfarrell
Copy link

dnmfarrell commented Sep 9, 2022

Spark 3.1
sparksql-scalapb_2.12:0.11.0

When used as a dataset, scalapb generated case classes don't catch type mismatches in the logical plan like ordinary scala case classes do.

E.g. in spark shell (regular case class):

scala> case class Foo(i:Int)
defined class Foo
scala> val badData = Seq[(Long)]((5L))
badData: Seq[Long] = List(5)
scala> val bdf = badData.toDF("i")
bdf: org.apache.spark.sql.DataFrame = [i: bigint]
scala> bdf.as[Foo]
org.apache.spark.sql.AnalysisException: Cannot up cast `i` from bigint to int.
...

Scalapb case class-based encoders don't blow up on as, but when the query runs (e.g. bdf.as[Foo].head).

@dnmfarrell
Copy link
Author

dnmfarrell commented Sep 9, 2022

(I would post my scalapb example but I'm running into the spark v scalapb protobuf version difference which I usually shade away but isn't working rn.)
Ah this is caused by spark-shell automatically importing spark implicits.

@dnmfarrell
Copy link
Author

dnmfarrell commented Sep 12, 2022

I forked the test repo and can reproduce the issue:

  • 0.11.0 displays the behavior described in this ticket - val ds: Dataset[Small] = df.as[Small] does not throw an exception despite the dataframe being incompatible with the dataset.
  • 1.0.1 does throw an exception but it's: java.lang.NoSuchMethodError - stack trace

@dnmfarrell
Copy link
Author

Re: java.lang.NoSuchMethodError - that was my mistake of compiling for spark 3.2, and running on spark 3.1 🤦

It seems like casts from one shape of data to another are caught. E.G. (x: Int) -> (x: Int, y: Int) emits org.apache.spark.sql.AnalysisException: cannot resolve 'y' given input columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant