Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark #682

Open
1 task done
james-miles-ccy opened this issue Nov 16, 2022 · 20 comments
Open
1 task done

Comments

@james-miles-ccy
Copy link

james-miles-ccy commented Nov 16, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When running v2 excel pySpark code below in Databricks 11.3 LTS Runtime:

df = spark.read.format("excel")
.option("header", True)
.option("inferSchema", True)
.load(fr"{folderpath}//.xlsx")
display(df)

I receive the following error upon attempting to display or use the resulting dataframe:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 101) (10.94.235.131 executor 1): java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;

Expected Behavior

The resulting Dataframe should display correctly.

Steps To Reproduce

set the folderpath variable to a location containing excel files, and run the below python code in latest runtime of Databricks:

df = spark.read.format("excel")
.option("header", True)
.option("inferSchema", True)
.load(fr"{folderpath}//.xlsx")
display(df)

Environment

- Spark version:3.3.0
- Spark-Excel version:0.18.5
- OS:Windows 10
- Cluster environment

Anything else?

No response

@nightscape
Copy link
Owner

Hey @james-miles-ccy, the Spark-Excel version should consist of the Spark version and the version of Spark-Excel itself.
You were only specifying the version of Spark-Excel. Can you check you were using 3.3.1_0.18.5?

@james-miles-ccy
Copy link
Author

Yes I am using 3.3.1_0.18.5

@nightscape
Copy link
Owner

Can you check the same thing with a local or other non-Databricks Spark 3.3.0?
We already had the case once where Databricks used a slightly different and not fully API-compatible version of Spark in their Runtime than the officially published one.

@james-miles-ccy
Copy link
Author

james-miles-ccy commented Nov 22, 2022

I have installed Pyspark/spark-excel locally and V1 format works fine and generates dataframes in 3.3.1 spark version, but using a path for multiple files (ie V2 format) is causing issues where cells are hanging/not completing. I am using the same spark-excel version as stated above.

@nightscape
Copy link
Owner

Is it the same error/issue as on DataBricks?

@james-miles-ccy
Copy link
Author

james-miles-ccy commented Nov 24, 2022

No, in Databricks you receive the error listed in my original comment, where as local causes endless/ non completing execution.

FYI, this is only an issue for v2, v1 works in both Databricks and local.

@snehawankhade
Copy link

snehawankhade commented Nov 30, 2022

I am facing same issue with V2 (Spark version:3.3.0, Spark-excel: 3.3.1_0.18.5). v1 works but not completely. input_file_name() returns empty string.

@nightscape
Copy link
Owner

input_file_name is only supported in v2. Unfortunately, I didn't have time to look into the original issue.

@dazfuller
Copy link

Hey @nightscape. This got mentioned in our implementation as well

I think I've traced the issue down to Databricks using a patched spark runtime in the 11.x runtimes (and 12.0 beta runtime) which includes a change from the master branch of Spark which isn't in the 3.3 support branch.

I'm looking into this further at the moment and I'll shout if I find anything

@dazfuller
Copy link

Just to add an update. I've been talking with Databricks and there's a fix coming which we'll resolve this in the 11.x and 12.x runtimes. Should hopefully be coming in January

@nightscape
Copy link
Owner

@dazfuller thanks a lot for pushing this forward and keeping us updated here!!
We had a similar issue before, so I guess Databricks breaking compatibility with the Open Source Spark version is sth. we have to keep an eye on...

@james-miles-ccy
Copy link
Author

Hi All, FYI looks like this has all been resolved by Databricks on 12.1 runtime!

@minnieshi
Copy link

this happens to

spark-excel_2.12-3.3.1_0.18.7 + Spark 3.5.0 (Azure databricks 15.4LTS)
and
spark-excel_2.12-3.3.3_0.20.3 + spark 3.4.1 (Azure databricks 13.3LTS)
spark-excel_2.12-3.3.3_0.20.3 + spark 3.5.0 (Azure databricks 15.4LTS) too

@nightscape
Copy link
Owner

@minnieshi do you get the exact same error as in the first post?

java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;

@minnieshi
Copy link

minnieshi commented Dec 18, 2024 via email

@mmicu
Copy link

mmicu commented Jan 27, 2025

@minnieshi, I think I am having a similar issue. Do you know which version I could use to make it run for Databricks 14.3 and Spark 3.5.0?

@minnieshi
Copy link

minnieshi commented Jan 27, 2025 via email

@mmicu
Copy link

mmicu commented Jan 27, 2025

@minnieshi, thanks for the quick feedback. Do you know what exactly cause the issue?

Because I have checked the JAR which contains that file and the corresponding class. Everything looks fine and properly defined.

@nightscape
Copy link
Owner

@mmicu can you access the JAR files on Databricks?
My best guess is that Databricks (again) made some non-binary-compatible changes in their version of Spark.

@nightscape nightscape reopened this Jan 28, 2025
@mmicu
Copy link

mmicu commented Jan 28, 2025

@nightscape, yes, I should have access. I could try to get some information from the cluster and its JAR if you need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants