-
Notifications
You must be signed in to change notification settings - Fork 1
Zeppelin on Windows
- Download the binaries from: https://www.apache.org/dyn/closer.lua/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
- Extract the files in the
/d/data-analytics
directory:$ tar -xvf /c/temp/spark-2.4.3-bin-hadoop2.7.tgz
Add the following environment variables:
Variables | Value |
---|---|
SPARK_HOME | D:\data-analytics\spark-2.4.3-bin-hadoop2.7 |
HADOOP_CONF_DIR | %HADOOP_HOME%\etc\hadoop |
LD_LIBRARY_PATH | %HADOOP_HOME%\lib\native |
Add %SPARK_HOME%\bin
to Path environment variable.
PS C:\> mv $env:HADOOP_HOME\share\hadoop\hdfs\lib\netty-all-4.0.52.Final.jar $env:HADOOP_HOME\share\hadoop\hdfs\lib\netty-all-4.0.52.Final.jar.old
PS C:\> cp $env:SPARK_HOME\jars\netty-all-4.1.17.Final.jar $env:HADOOP_HOME\share\hadoop\hdfs\lib\
$ cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf
$ nano $SPARK_HOME/conf/spark-defaults.conf
Use the following settings:
spark.master yarn
#spark.driver.memory 512m
#spark.yarn.am.memory 512m
#spark.executor.memory 512m
#spark.eventLog.enabled true
#spark.eventLog.dir hdfs://pshp111:9000/spark-logs
#spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
#spark.history.fs.logDirectory hdfs://pshp111:9000/spark-logs
#spark.history.fs.update.interval 10s
#spark.history.ui.port 18080
spark.yarn.archive hdfs://pshp111:9000/spark/spark-libs-2.4.3.jar
Create the log directory in HDFS:
PS C:\> hdfs dfs -mkdir /spark-logs
On all the nodes add the following properties to Hadoop yarn-site.xml
file.
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>14336</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
Restart all the nodes to apply the new setting:
PS C:\> stop-yarn
stopping yarn daemons
SUCCESS: Sent termination signal to the process with PID 10580.
SUCCESS: Sent termination signal to the process with PID 1020.
INFO: No tasks running with the specified criteria.
PS C:\> stop-dfs
SUCCESS: Sent termination signal to the process with PID 11992.
SUCCESS: Sent termination signal to the process with PID 10524.
PS C:\> start-yarn
starting yarn daemons
PS C:\> start-dfs
$ cd /c/temp/
$ jar cv0f spark-libs-2.4.3.jar -C $SPARK_HOME/jars/ .
PS C:\> hdfs dfs -mkdir /spark
PS C:\> hdfs dfs -put c:\temp\spark-libs-2.4.3.jar /spark
With the current versions being used, when the sample code below is executed to load a CSV file in spark-shell
:
val logPath = "hdfs://pshp111:9000/u_ex190620.log"
val logTextData = sc.textFile(logPath)
.filter(line=> !line.startsWith("#"))
val logData = spark.read
.option("delimiter", " ")
.csv(logTextData.toDS)
logData.count
The result will be this exception:
java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateParser; local class incompatible: stream classdesc serialVersionUID = 3, local class serialVersionUID = 2
As the error indicates this is because of a version mismatch of specifically of $HADOOP_HOME/hadoop/share/hadoop/common/lib/commons-lang3-3.4.jar
and the version that spark uses in $SPARK_HOME/jars/commons-lang3-3.5.jar
.
PS C:\> mv $env:HADOOP_HOME\share\hadoop\common\lib\commons-lang3-3.4.jar $env:HADOOP_HOME\share\hadoop\common\lib\commons-lang3-3.4.jar.orig
PS C:\> cp $env:SPARK_HOME\jars\commons-lang3-3.5.jar $env:HADOOP_HOME\share\hadoop\common\lib\
PS C:\> spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi %SPARK_HOME%/examples/jars/spark-examples_2.11-2.4.3.jar 10
$ cd /d/data-analytics
$ wget http://apache.is.co.za/zeppelin/zeppelin-0.8.1/zeppelin-0.8.1-bin-all.tgz
$ tar xvzf zeppelin-0.8.1-bin-all.tgz
$ echo "zeppelin-0.8.1-bin-all" > zeppelin-0.8.1-bin-all/_version.txt
$ mv zeppelin-0.8.1-bin-all zeppelin
Add the following environment variables:
Variables | Value |
---|---|
ZEPPELIN_HOME | D:\data-analytics\zeppelin |
Add %ZEPPELIN_HOME%\bin
to Path environment variable.
Please note that common.cmd
has a mistake on line 77
, a )
should be used instead of a }
.
PS C:\> zeppelin.cmd
After Zeppelin has started successfully, go to http://localhost:8080 with your web browser.
Click on Interpreter (top of the page), and edit the Spark section:
- master == yarn-client
- Save
WARNING
The steps below does not currently work but I would like to mess around with it again in the future so I am leaving it here as reference.
- Copy
/hadoop/etc/hadoop
to a/conf/hadoop
folder that will be used in Zeppelin - Copy the
/spark
folder or use the link to it. - Run the image with the appropriate settings:
docker run -p 8080:8080 --rm \
-v ~/logs:/logs \
-v ~/code/local-hadoop/notebook:/notebook \
-v ~/code/local-hadoop/data:/data \
-v ~/code/local-hadoop/lib/hadoop:/usr/lib/hadoop \
-v ~/code/local-hadoop/lib/archive:/usr/lib/archive \
-e ZEPPELIN_LOG_DIR='/logs' \
-e ZEPPELIN_NOTEBOOK_DIR='/notebook' \
-e HADOOP_CONF_DIR='/usr/lib/hadoop' \
-e SPARK_HOME='/usr/lib/spark' \
-e HADOOP_USER_NAME='fouldsjo' \
--name zeppelin apache/zeppelin:0.8.1
- Click on Interpreter (top of the page), and edit the Spark section:
- master == yarn-client
- Save
- Install, Configure, and Run Spark on Top of a Hadoop YARN Cluster - https://www.linode.com/docs/databases/hadoop/install-configure-run-spark-on-top-of-hadoop-yarn-cluster/
- Install Spark 2.2.1 in Windows - https://kontext.tech/docs/DataAndBusinessIntelligence/p/install-spark-221-in-windows
- Apache Zeppelin on Spark Cluster Mode - https://zeppelin.apache.org/docs/0.7.0/install/spark_cluster_mode.html
- Running Spark on YARN - https://spark.apache.org/docs/latest/running-on-yarn.html
- Gow - The lightweight alternative to Cygwin - https://github.com/bmatzelle/gow/wiki
- Property spark.yarn.jars - how to deal with it? - https://stackoverflow.com/questions/41112801/property-spark-yarn-jars-how-to-deal-with-it
- Use with remote Spark cluster and Yarn - https://datascientists.info/index.php/2016/09/29/apache-zeppelin-use-remote-spark-cluster-yarn/
- Spark Interpreter - https://zeppelin.apache.org/docs/0.5.5-incubating/interpreter/spark.html
- Apache Zeppelin installation on Windows 10 - https://hernandezpaul.wordpress.com/2016/11/14/apache-zeppelin-installation-on-windows-10/
- Install Zeppelin 0.8.1 - https://zeppelin.apache.org/docs/latest/quickstart/install.html
- spark 2.1 uses a more recent commons-lang3 - https://issues.apache.org/jira/browse/ZEPPELIN-1977
- https://stackoverflow.com/questions/54773526/why-spark-job-dont-work-on-zepplin-while-they-work-when-using-pyspark-shell
- https://issues.apache.org/jira/browse/ZEPPELIN-4177