-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bitnami/spark] connect spark session on Jupyter #77486
Comments
Hi! Are you running the jupyter notebook inside the Kubernetes cluster? Just to ensure that the spark master is accessible |
Dear Mr. Javier, Thank you for your response. I have attempted to set up the environment using both Kubernetes and Docker Compose. In Kubernetes, both the Jupyter pod and the Spark master were within the same namespace, ensuring they could communicate. Similarly, in Docker Compose, they shared the same network. However, the setup did not function as expected. On Jupyter, I was able to establish a connection with the Spark session, but I encountered issues when performing DataFrame operations. On the other hand, when using spark session from Spark Shell, Spark DataFrame operations, including reading and writing to HDFS, worked as expected. For my Hadoop setup, I deployed the cluster in Kubernetes using the Helm chart pfisterer/apache-hadoop-helm. In Docker Compose, I used the BDE2020 Hadoop image. Additionally, I used the Jupyter image jupyter/all-spark-notebook:x86_64-spark-3.5.0. Interestingly, when running a Spark session with spark.master=local in Jupyter (instead of connecting to the Bitnami Spark master cluster), I was able to successfully perform read/write operations in Hadoop. I sincerely appreciate your effort and assistance. |
Hi, So, if I understood correctly, it seems that the Bitnami Spark container supports DataFrame operations when you execute it inside the container, but for some reason it is failing when using Jupyter Notebook. It seems to me that this goes beyond the Bitnami packaging of Bitnami Spark and maybe something is incorrect when using the Jupyter integration of Spark only for Dataframe Operations. I will leave the ticket open in case someone from the community wants to share their experience, but my suggestion would be to check with the upstream Jupyter and Spark communities to see if there is something incorrect in the Jupyter configuration. |
Oke thank you for explanation and suggestion |
Name and Version
bitnami/spark:3.5
What architecture are you using?
None
What steps will reproduce the bug?
It can successfully connect to the Spark session on the Jupyter container.
from pyspark.sql import SparkSession
But encounters an error when executing Spark DataFrame creation.
What is the expected behavior?
It can perform Spark DataFrame and SQL operations, as well as read from and write to HDFS.
What do you see instead?
If use
spark-shell
inside the Spark master container, it can perform some Spark DataFrame operations and read/write to the Hadoop HDFS container cluster. However, it does not work when using a remote client via the Jupyter container.Additional information
No response
The text was updated successfully, but these errors were encountered: