Skip to content

Zeppelin on Windows

Johnny Foulds edited this page Jul 12, 2020 · 14 revisions

Install Spark

  1. Download the binaries from: https://www.apache.org/dyn/closer.lua/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
  2. Extract the files in the /d/data-analytics directory: $ tar -xvf /c/temp/spark-2.4.3-bin-hadoop2.7.tgz

Setup environment variables

Add the following environment variables:

Variables Value
SPARK_HOME D:\data-analytics\spark-2.4.3-bin-hadoop2.7
HADOOP_CONF_DIR %HADOOP_HOME%\etc\hadoop
LD_LIBRARY_PATH %HADOOP_HOME%\lib\native

Add %SPARK_HOME%\bin to Path environment variable.

PS C:\> mv $env:HADOOP_HOME\share\hadoop\hdfs\lib\netty-all-4.0.52.Final.jar $env:HADOOP_HOME\share\hadoop\hdfs\lib\netty-all-4.0.52.Final.jar.old
PS C:\> cp $env:SPARK_HOME\jars\netty-all-4.1.17.Final.jar $env:HADOOP_HOME\share\hadoop\hdfs\lib\

Configure Spark

$ cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
$ nano $SPARK_HOME/conf/spark-defaults.conf

Use the following settings:

spark.master						yarn
#spark.driver.memory					512m
#spark.yarn.am.memory				512m
#spark.executor.memory				512m
#spark.eventLog.enabled				true
#spark.eventLog.dir					hdfs://pshp111:9000/spark-logs
#spark.history.provider				org.apache.spark.deploy.history.FsHistoryProvider
#spark.history.fs.logDirectory		hdfs://pshp111:9000/spark-logs
#spark.history.fs.update.interval	10s
#spark.history.ui.port				18080
spark.yarn.archive					hdfs://pshp111:9000/spark/spark-libs-2.4.3.jar

Create the log directory in HDFS:

PS C:\> hdfs dfs -mkdir /spark-logs

Configure YARN

On all the nodes add the following properties to Hadoop yarn-site.xml file.

<property>
	<name>yarn.scheduler.maximum-allocation-mb</name>
	<value>14336</value>
</property>
<property>
	<name>yarn.nodemanager.pmem-check-enabled</name>
	<value>false</value>
</property>
<property>
	<name>yarn.nodemanager.vmem-check-enabled</name>
	<value>false</value>
</property>

Restart all the nodes to apply the new setting:

PS C:\> stop-yarn
stopping yarn daemons
SUCCESS: Sent termination signal to the process with PID 10580.
SUCCESS: Sent termination signal to the process with PID 1020.

INFO: No tasks running with the specified criteria.
PS C:\> stop-dfs
SUCCESS: Sent termination signal to the process with PID 11992.
SUCCESS: Sent termination signal to the process with PID 10524.
PS C:\> start-yarn
starting yarn daemons
PS C:\> start-dfs

Configure Spark JAR Location

$ cd /c/temp/
$ jar cv0f spark-libs-2.4.3.jar -C $SPARK_HOME/jars/ .
PS C:\> hdfs dfs -mkdir /spark
PS C:\> hdfs dfs -put c:\temp\spark-libs-2.4.3.jar /spark

Patch Hadoop

With the current versions being used, when the sample code below is executed to load a CSV file in spark-shell:

val logPath = "hdfs://pshp111:9000/u_ex190620.log"

val logTextData = sc.textFile(logPath)
    .filter(line=> !line.startsWith("#"))
    
val logData = spark.read
    .option("delimiter", " ")
    .csv(logTextData.toDS)
    
logData.count

The result will be this exception:

java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateParser; local class incompatible: stream classdesc serialVersionUID = 3, local class serialVersionUID = 2

As the error indicates this is because of a version mismatch of specifically of $HADOOP_HOME/hadoop/share/hadoop/common/lib/commons-lang3-3.4.jar and the version that spark uses in $SPARK_HOME/jars/commons-lang3-3.5.jar.

PS C:\> mv $env:HADOOP_HOME\share\hadoop\common\lib\commons-lang3-3.4.jar $env:HADOOP_HOME\share\hadoop\common\lib\commons-lang3-3.4.jar.orig
PS C:\> cp $env:SPARK_HOME\jars\commons-lang3-3.5.jar $env:HADOOP_HOME\share\hadoop\common\lib\

Submit a Spark Application

PS C:\> spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi %SPARK_HOME%/examples/jars/spark-examples_2.11-2.4.3.jar 10

Install Zeppelin

Install Binaries

$ cd /d/data-analytics
$ wget http://apache.is.co.za/zeppelin/zeppelin-0.8.1/zeppelin-0.8.1-bin-all.tgz
$ tar xvzf zeppelin-0.8.1-bin-all.tgz

$ echo "zeppelin-0.8.1-bin-all" > zeppelin-0.8.1-bin-all/_version.txt
$ mv zeppelin-0.8.1-bin-all zeppelin

Setup environment variables

Add the following environment variables:

Variables Value
ZEPPELIN_HOME D:\data-analytics\zeppelin

Add %ZEPPELIN_HOME%\bin to Path environment variable.

Run Zeppelin

Please note that common.cmd has a mistake on line 77, a ) should be used instead of a }.

PS C:\> zeppelin.cmd

After Zeppelin has started successfully, go to http://localhost:8080 with your web browser.

Click on Interpreter (top of the page), and edit the Spark section:

  • master == yarn-client
  • Save

Connect Docker Zeppelin to Cluster

WARNING

The steps below does not currently work but I would like to mess around with it again in the future so I am leaving it here as reference.

  1. Copy /hadoop/etc/hadoop to a /conf/hadoop folder that will be used in Zeppelin
  2. Copy the /spark folder or use the link to it.
  3. Run the image with the appropriate settings:
docker run -p 8080:8080 --rm \
-v ~/logs:/logs \
-v ~/code/local-hadoop/notebook:/notebook \
-v ~/code/local-hadoop/data:/data \
-v ~/code/local-hadoop/lib/hadoop:/usr/lib/hadoop \
-v ~/code/local-hadoop/lib/archive:/usr/lib/archive \
-e ZEPPELIN_LOG_DIR='/logs' \
-e ZEPPELIN_NOTEBOOK_DIR='/notebook' \
-e HADOOP_CONF_DIR='/usr/lib/hadoop' \
-e SPARK_HOME='/usr/lib/spark' \
-e HADOOP_USER_NAME='fouldsjo' \
--name zeppelin apache/zeppelin:0.8.1
  1. Click on Interpreter (top of the page), and edit the Spark section:
  • master == yarn-client
  • Save

Web References