Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/spark] connect spark session on Jupyter #77486

Open
gesangwibawono1 opened this issue Feb 14, 2025 · 4 comments
Open

[bitnami/spark] connect spark session on Jupyter #77486

gesangwibawono1 opened this issue Feb 14, 2025 · 4 comments
Assignees
Labels
spark tech-issues The user has a technical issue about an application triage Triage is needed

Comments

@gesangwibawono1
Copy link

gesangwibawono1 commented Feb 14, 2025

Name and Version

bitnami/spark:3.5

What architecture are you using?

None

What steps will reproduce the bug?

It can successfully connect to the Spark session on the Jupyter container.

from pyspark.sql import SparkSession

# Create Spark session with proper configuration
spark = SparkSession.builder \
    .appName("JupyterTest") \
    .master("spark://spark-master:7077") \
    .config("spark.driver.host", "jupyter") \
    .config("spark.hadoop.fs.defaultFS", "hdfs://namenode:9000") \
    .getOrCreate()

But encounters an error when executing Spark DataFrame creation.

# Create test DataFrame
data = [("John", 30), ("Alice", 25), ("Bob", 35)]
df = spark.createDataFrame(data, ["name", "age"])

What is the expected behavior?

It can perform Spark DataFrame and SQL operations, as well as read from and write to HDFS.

What do you see instead?

If use spark-shell inside the Spark master container, it can perform some Spark DataFrame operations and read/write to the Hadoop HDFS container cluster. However, it does not work when using a remote client via the Jupyter container.

Additional information

No response

@gesangwibawono1 gesangwibawono1 added the tech-issues The user has a technical issue about an application label Feb 14, 2025
@github-actions github-actions bot added the triage Triage is needed label Feb 14, 2025
@javsalgar javsalgar changed the title connect spark session on Jupyter [bitnami/spark] connect spark session on Jupyter Feb 17, 2025
@javsalgar
Copy link
Contributor

Hi!

Are you running the jupyter notebook inside the Kubernetes cluster? Just to ensure that the spark master is accessible

@gesangwibawono1
Copy link
Author

Dear Mr. Javier,

Thank you for your response.

I have attempted to set up the environment using both Kubernetes and Docker Compose. In Kubernetes, both the Jupyter pod and the Spark master were within the same namespace, ensuring they could communicate. Similarly, in Docker Compose, they shared the same network. However, the setup did not function as expected.

On Jupyter, I was able to establish a connection with the Spark session, but I encountered issues when performing DataFrame operations. On the other hand, when using spark session from Spark Shell, Spark DataFrame operations, including reading and writing to HDFS, worked as expected.

For my Hadoop setup, I deployed the cluster in Kubernetes using the Helm chart pfisterer/apache-hadoop-helm. In Docker Compose, I used the BDE2020 Hadoop image. Additionally, I used the Jupyter image jupyter/all-spark-notebook:x86_64-spark-3.5.0. Interestingly, when running a Spark session with spark.master=local in Jupyter (instead of connecting to the Bitnami Spark master cluster), I was able to successfully perform read/write operations in Hadoop.

I sincerely appreciate your effort and assistance.

@javsalgar
Copy link
Contributor

Hi,

So, if I understood correctly, it seems that the Bitnami Spark container supports DataFrame operations when you execute it inside the container, but for some reason it is failing when using Jupyter Notebook. It seems to me that this goes beyond the Bitnami packaging of Bitnami Spark and maybe something is incorrect when using the Jupyter integration of Spark only for Dataframe Operations. I will leave the ticket open in case someone from the community wants to share their experience, but my suggestion would be to check with the upstream Jupyter and Spark communities to see if there is something incorrect in the Jupyter configuration.

@gesangwibawono1
Copy link
Author

Oke thank you for explanation and suggestion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spark tech-issues The user has a technical issue about an application triage Triage is needed
Projects
None yet
Development

No branches or pull requests

2 participants