Zeppelin on Windows

Install Spark

Download the binaries from: https://www.apache.org/dyn/closer.lua/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
Extract the files in the /d/data-analytics directory: $ tar -xvf /c/temp/spark-2.4.3-bin-hadoop2.7.tgz

Setup environment variables

Add the following environment variables:

Variables	Value
SPARK_HOME	D:\data-analytics\spark-2.4.3-bin-hadoop2.7
HADOOP_CONF_DIR	%HADOOP_HOME%\etc\hadoop
LD_LIBRARY_PATH	%HADOOP_HOME%\lib\native

Add %SPARK_HOME%\bin to Path environment variable.

PS C:\> mv $env:HADOOP_HOME\share\hadoop\hdfs\lib\netty-all-4.0.52.Final.jar $env:HADOOP_HOME\share\hadoop\hdfs\lib\netty-all-4.0.52.Final.jar.old
PS C:\> cp $env:SPARK_HOME\jars\netty-all-4.1.17.Final.jar $env:HADOOP_HOME\share\hadoop\hdfs\lib\

Configure Spark

$ cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
$ nano $SPARK_HOME/conf/spark-defaults.conf

Use the following settings:

spark.master						yarn
#spark.driver.memory					512m
#spark.yarn.am.memory				512m
#spark.executor.memory				512m
#spark.eventLog.enabled				true
#spark.eventLog.dir					hdfs://pshp111:9000/spark-logs
#spark.history.provider				org.apache.spark.deploy.history.FsHistoryProvider
#spark.history.fs.logDirectory		hdfs://pshp111:9000/spark-logs
#spark.history.fs.update.interval	10s
#spark.history.ui.port				18080
spark.yarn.archive					hdfs://pshp111:9000/spark/spark-libs-2.4.3.jar

Create the log directory in HDFS:

PS C:\> hdfs dfs -mkdir /spark-logs

Configure YARN

On all the nodes add the following properties to Hadoop yarn-site.xml file.

<property>
	<name>yarn.scheduler.maximum-allocation-mb</name>
	<value>14336</value>
</property>
<property>
	<name>yarn.nodemanager.pmem-check-enabled</name>
	<value>false</value>
</property>
<property>
	<name>yarn.nodemanager.vmem-check-enabled</name>
	<value>false</value>
</property>

Restart all the nodes to apply the new setting:

PS C:\> stop-yarn
stopping yarn daemons
SUCCESS: Sent termination signal to the process with PID 10580.
SUCCESS: Sent termination signal to the process with PID 1020.

INFO: No tasks running with the specified criteria.
PS C:\> stop-dfs
SUCCESS: Sent termination signal to the process with PID 11992.
SUCCESS: Sent termination signal to the process with PID 10524.

PS C:\> start-yarn
starting yarn daemons
PS C:\> start-dfs

Configure Spark JAR Location

$ cd /c/temp/
$ jar cv0f spark-libs-2.4.3.jar -C $SPARK_HOME/jars/ .

PS C:\> hdfs dfs -mkdir /spark
PS C:\> hdfs dfs -put c:\temp\spark-libs-2.4.3.jar /spark

Patch Hadoop

With the current versions being used, when the sample code below is executed to load a CSV file in spark-shell:

val logPath = "hdfs://pshp111:9000/u_ex190620.log"

val logTextData = sc.textFile(logPath)
    .filter(line=> !line.startsWith("#"))
    
val logData = spark.read
    .option("delimiter", " ")
    .csv(logTextData.toDS)
    
logData.count

The result will be this exception:

java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateParser; local class incompatible: stream classdesc serialVersionUID = 3, local class serialVersionUID = 2

As the error indicates this is because of a version mismatch of specifically of $HADOOP_HOME/hadoop/share/hadoop/common/lib/commons-lang3-3.4.jar and the version that spark uses in $SPARK_HOME/jars/commons-lang3-3.5.jar.

PS C:\> mv $env:HADOOP_HOME\share\hadoop\common\lib\commons-lang3-3.4.jar $env:HADOOP_HOME\share\hadoop\common\lib\commons-lang3-3.4.jar.orig
PS C:\> cp $env:SPARK_HOME\jars\commons-lang3-3.5.jar $env:HADOOP_HOME\share\hadoop\common\lib\

Submit a Spark Application

PS C:\> spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi %SPARK_HOME%/examples/jars/spark-examples_2.11-2.4.3.jar 10

Install Zeppelin

Install Binaries

$ cd /d/data-analytics
$ wget http://apache.is.co.za/zeppelin/zeppelin-0.8.1/zeppelin-0.8.1-bin-all.tgz
$ tar xvzf zeppelin-0.8.1-bin-all.tgz

$ echo "zeppelin-0.8.1-bin-all" > zeppelin-0.8.1-bin-all/_version.txt
$ mv zeppelin-0.8.1-bin-all zeppelin

Setup environment variables

Add the following environment variables:

Variables	Value
ZEPPELIN_HOME	D:\data-analytics\zeppelin

Add %ZEPPELIN_HOME%\bin to Path environment variable.

Run Zeppelin

Please note that common.cmd has a mistake on line 77, a ) should be used instead of a }.

PS C:\> zeppelin.cmd

After Zeppelin has started successfully, go to http://localhost:8080 with your web browser.

Click on Interpreter (top of the page), and edit the Spark section:

master == yarn-client
Save

Connect Docker Zeppelin to Cluster

WARNING

The steps below does not currently work but I would like to mess around with it again in the future so I am leaving it here as reference.

Copy /hadoop/etc/hadoop to a /conf/hadoop folder that will be used in Zeppelin
Copy the /spark folder or use the link to it.
Run the image with the appropriate settings:

docker run -p 8080:8080 --rm \
-v ~/logs:/logs \
-v ~/code/local-hadoop/notebook:/notebook \
-v ~/code/local-hadoop/data:/data \
-v ~/code/local-hadoop/lib/hadoop:/usr/lib/hadoop \
-v ~/code/local-hadoop/lib/archive:/usr/lib/archive \
-e ZEPPELIN_LOG_DIR='/logs' \
-e ZEPPELIN_NOTEBOOK_DIR='/notebook' \
-e HADOOP_CONF_DIR='/usr/lib/hadoop' \
-e SPARK_HOME='/usr/lib/spark' \
-e HADOOP_USER_NAME='fouldsjo' \
--name zeppelin apache/zeppelin:0.8.1

Click on Interpreter (top of the page), and edit the Spark section:

master == yarn-client
Save

Web References

Install, Configure, and Run Spark on Top of a Hadoop YARN Cluster - https://www.linode.com/docs/databases/hadoop/install-configure-run-spark-on-top-of-hadoop-yarn-cluster/
Install Spark 2.2.1 in Windows - https://kontext.tech/docs/DataAndBusinessIntelligence/p/install-spark-221-in-windows
Apache Zeppelin on Spark Cluster Mode - https://zeppelin.apache.org/docs/0.7.0/install/spark_cluster_mode.html
Running Spark on YARN - https://spark.apache.org/docs/latest/running-on-yarn.html
Gow - The lightweight alternative to Cygwin - https://github.com/bmatzelle/gow/wiki
Property spark.yarn.jars - how to deal with it? - https://stackoverflow.com/questions/41112801/property-spark-yarn-jars-how-to-deal-with-it
Use with remote Spark cluster and Yarn - https://datascientists.info/index.php/2016/09/29/apache-zeppelin-use-remote-spark-cluster-yarn/
Spark Interpreter - https://zeppelin.apache.org/docs/0.5.5-incubating/interpreter/spark.html
Apache Zeppelin installation on Windows 10 - https://hernandezpaul.wordpress.com/2016/11/14/apache-zeppelin-installation-on-windows-10/
Install Zeppelin 0.8.1 - https://zeppelin.apache.org/docs/latest/quickstart/install.html
spark 2.1 uses a more recent commons-lang3 - https://issues.apache.org/jira/browse/ZEPPELIN-1977
https://stackoverflow.com/questions/54773526/why-spark-job-dont-work-on-zepplin-while-they-work-when-using-pyspark-shell
https://issues.apache.org/jira/browse/ZEPPELIN-4177

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zeppelin on Windows

Install Spark

Setup environment variables

Configure Spark

Configure YARN

Configure Spark JAR Location

Patch Hadoop

Submit a Spark Application

Install Zeppelin

Install Binaries

Setup environment variables

Run Zeppelin

Connect Docker Zeppelin to Cluster

Web References

Clone this wiki locally