-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark #682
Comments
Hey @james-miles-ccy, the Spark-Excel version should consist of the Spark version and the version of Spark-Excel itself. |
Yes I am using 3.3.1_0.18.5 |
Can you check the same thing with a local or other non-Databricks Spark 3.3.0? |
I have installed Pyspark/spark-excel locally and V1 format works fine and generates dataframes in 3.3.1 spark version, but using a path for multiple files (ie V2 format) is causing issues where cells are hanging/not completing. I am using the same spark-excel version as stated above. |
Is it the same error/issue as on DataBricks? |
No, in Databricks you receive the error listed in my original comment, where as local causes endless/ non completing execution. FYI, this is only an issue for v2, v1 works in both Databricks and local. |
I am facing same issue with V2 (Spark version:3.3.0, Spark-excel: 3.3.1_0.18.5). v1 works but not completely. input_file_name() returns empty string. |
|
Hey @nightscape. This got mentioned in our implementation as well I think I've traced the issue down to Databricks using a patched spark runtime in the 11.x runtimes (and 12.0 beta runtime) which includes a change from the master branch of Spark which isn't in the 3.3 support branch. I'm looking into this further at the moment and I'll shout if I find anything |
Just to add an update. I've been talking with Databricks and there's a fix coming which we'll resolve this in the 11.x and 12.x runtimes. Should hopefully be coming in January |
@dazfuller thanks a lot for pushing this forward and keeping us updated here!! |
Hi All, FYI looks like this has all been resolved by Databricks on 12.1 runtime! |
this happens to spark-excel_2.12-3.3.1_0.18.7 + Spark 3.5.0 (Azure databricks 15.4LTS) |
@minnieshi do you get the exact same error as in the first post?
|
Yes. The same error. @nightscape
So, I instead used a combination of lower versions and wrote here in the hope that higher versions could be used.
I have now copied the error below:
`AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;`
…On Wed, Dec 18, 2024 at 12.51 Martin Mauch ***@***.***> wrote:
@minnieshi <https://github.com/minnieshi> do you get the exact same error
as in the first post?
java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;
—
Reply to this email directly, view it on GitHub
<#682 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWMMBVVURHZIMRN3IRFS4L2GFOTXAVCNFSM6AAAAABTYQZ6EGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJRGEZDGMBZGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@minnieshi, I think I am having a similar issue. Do you know which version I could use to make it run for Databricks 14.3 and Spark 3.5.0? |
I tried all matrix, i could not get it run on spark 3.5.0
Kind regards
Min
…On Mon, Jan 27, 2025 at 17.36 Marco ***@***.***> wrote:
@minnieshi <https://github.com/minnieshi>, I think I am having a similar
issue <#926>. Do you know
which version I could use to make it run for Databricks 14.3 and Spark
3.5.0?
—
Reply to this email directly, view it on GitHub
<#682 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACWMMBQEF7J4HYCKEBO2ZBT2MZN77AVCNFSM6AAAAABTYQZ6EGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJWGMYDQNRUGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@minnieshi, thanks for the quick feedback. Do you know what exactly cause the issue? Because I have checked the JAR which contains that file and the corresponding class. Everything looks fine and properly defined. |
@mmicu can you access the JAR files on Databricks? |
@nightscape, yes, I should have access. I could try to get some information from the cluster and its JAR if you need. |
Is there an existing issue for this?
Current Behavior
When running v2 excel pySpark code below in Databricks 11.3 LTS Runtime:
df = spark.read.format("excel")
.option("header", True)
.option("inferSchema", True)
.load(fr"{folderpath}//.xlsx")
display(df)
I receive the following error upon attempting to display or use the resulting dataframe:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 101) (10.94.235.131 executor 1): java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;
Expected Behavior
The resulting Dataframe should display correctly.
Steps To Reproduce
set the folderpath variable to a location containing excel files, and run the below python code in latest runtime of Databricks:
df = spark.read.format("excel")
.option("header", True)
.option("inferSchema", True)
.load(fr"{folderpath}//.xlsx")
display(df)
Environment
Anything else?
No response
The text was updated successfully, but these errors were encountered: