You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 16, 2023. It is now read-only.
If possible, I would like to be able to work both with a remote Hadoop cluster, and S3 storage while using Spark in a Jupyter Notebook. This include the possibility of:
Writing in and reading from S3
Writing in and reading from HDFS
Writing in S3 and reading from HDFS
Writing in HDFS and reading from S3
How we currently do it
The following steps explain how we achieved HDFS connection with S3 using Onyxia, Spark, and Jupyter;
We are using an external Hadoop cluster, which necessitates only a password to connect to.
Note: Keytabs are also widely used, but the detail is not explicitely written in this document. The Keytab would be mounted to the Jupyter pod, allowing one to kinit using this Keytab in the UI
Step 1 - Get configuration files from Hadoop cluster
We need the following configuration file from the Hadoop cluster, located in an edge node:
core-site.xml located in /etc/hadoop/conf;
hdfs-site.xml located in /etc/hadoop/conf;
hive-site.xml located in /etc/hive/conf;
krb5.conf located in /etc/krb5.conf.
Step 2 - S3 credentials
The aforementioned designated files are lacking S3 credentials, which are required for the Jupyter notebook to work with S3 storage. They are found in Onyxia: My Account/Connect to storage
We append these lines to the Hadoop core-site.xml, while replacing $S3 fields by Onyxia's S3 credentials:
You must have access to these files within your cluster, to create config maps with them.
Step 4 - Modify Stateful set
To give access to Jupyter's pods to the configuration, we need a Kubernetes configMap for each file, and to mount them into the pods. First, we create these configMaps. Set a variable with your Onyxia account username (to access the correct namespace):
ONYXIA_USER=<user>
Considering that in the region, the namespace is formed as u-username, from one of the master nodes of the Kubernetes cluster, run the following command:
In nano, volumeMounts and volumes fields need to be modified. Delete the existing config-coresite and config-hivesite configMap in volumes and add the following in the respective fields:
We authenticate through the Kerberos of the Hadoop cluster with account credentials using kinit -E inside of Jupyter UI, in a terminal. The -E option allows to select a specific user to authenticate with.
kinit -E <user>
If the cluster requires a Keytab, just pass the keytab like you would usually do.
Once you enter your password, you are able to run jobs in Spark through the Jupyter UI with HDFS.
Step 6 - Certificates if custom certificate authority
For S3 compatibility, we need to make sure that the certificates are accepted by the pod. Using nano, create a file named ca.pem and paste the contents of the certificate authority (or linked chain).
Once these steps are done, it should be good to go.
Possible implementation
Allow a user to pass the mentioned xxx-site.xml files, as well as the krb5.conf at the initialization of the pod, in My Services.
Passing core-file.xml, the file should be merged together with the one that is at the moment present in the Onyxia configuration, containing all S3 credentials, and overwriting conflicting fields (if it is the case).
Allow a user to mount certificate authority to connect to the minio.demo.insee.io endpoint
Change the statefulSet of the targeted pod linked to the service with: the certificates, the files (adding configmaps as it was the case before)
Then, the user only would have to kinit, and start playing around with the Notebook.
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
HDFS support for catalog
Description
If possible, I would like to be able to work both with a remote Hadoop cluster, and S3 storage while using Spark in a Jupyter Notebook. This include the possibility of:
How we currently do it
The following steps explain how we achieved HDFS connection with S3 using Onyxia, Spark, and Jupyter;
We are using an external Hadoop cluster, which necessitates only a password to connect to.
Step 1 - Get configuration files from Hadoop cluster
We need the following configuration file from the Hadoop cluster, located in an edge node:
core-site.xml
located in/etc/hadoop/conf
;hdfs-site.xml
located in/etc/hadoop/conf
;hive-site.xml
located in/etc/hive/conf
;krb5.conf
located in/etc/krb5.conf
.Step 2 - S3 credentials
The aforementioned designated files are lacking S3 credentials, which are required for the Jupyter notebook to work with S3 storage. They are found in Onyxia:
My Account/Connect to storage
We append these lines to the Hadoop
core-site.xml
, while replacing$S3
fields by Onyxia's S3 credentials:Step 3 - Upload files to cluster
You must have access to these files within your cluster, to create config maps with them.
Step 4 - Modify Stateful set
To give access to Jupyter's pods to the configuration, we need a Kubernetes configMap for each file, and to mount them into the pods. First, we create these configMaps. Set a variable with your Onyxia account username (to access the correct namespace):
Considering that in the region, the namespace is formed as
u-username
, from one of the master nodes of the Kubernetes cluster, run the following command:In
nano
, volumeMounts and volumes fields need to be modified. Delete the existingconfig-coresite
andconfig-hivesite
configMap involumes
and add the following in the respective fields:Step 5 - Authenticate with Kerberos
We authenticate through the Kerberos of the Hadoop cluster with account credentials using
kinit -E
inside of Jupyter UI, in a terminal. The-E
option allows to select a specific user to authenticate with.Once you enter your password, you are able to run jobs in Spark through the Jupyter UI with HDFS.
Step 6 - Certificates if custom certificate authority
For S3 compatibility, we need to make sure that the certificates are accepted by the pod. Using
nano
, create a file namedca.pem
and paste the contents of the certificate authority (or linked chain).In
sudo su
, do:cat ca.pem >> /etc/ssl/certs/ca-certificates.crt
Then, in
sudo su jovyan
, do:Once these steps are done, it should be good to go.
Possible implementation
xxx-site.xml
files, as well as thekrb5.conf
at the initialization of the pod, inMy Services
.core-file.xml
, the file should be merged together with the one that is at the moment present in the Onyxia configuration, containing all S3 credentials, and overwriting conflicting fields (if it is the case).minio.demo.insee.io
endpointThen, the user only would have to
kinit
, and start playing around with the Notebook.The text was updated successfully, but these errors were encountered: