Update doc.

cerndb · Oct 14, 2020 · 3b2c06a · 3b2c06a
1 parent 8a60cdb
commit 3b2c06a
Showing 1 changed file with 38 additions and 31 deletions.
diff --git a/README.md b/README.md
@@ -1,10 +1,11 @@
 # SparkPlugins
 ![SparkPlugins CI](https://github.com/cerndb/SparkPlugins/workflows/SparkPlugins%20CI/badge.svg)
 
-Code and examples of how to use Apache Spark Plugin extensions to the metrics system with Apache Spark 3.0.
-
+A repo with code and examples of Apache Spark Plugin extensions to the metrics system
+ applied to measuring I/O from cloud Filesystems, OS/system metrics and custom metrics.
+
 ---
-Apache Spark 3.0 comes with a new plugin framework. Plugins allow extending Spark monitoring functionality.
+Apache Spark 3.0 comes with an improved plugin framework. Plugins allow extending Spark monitoring functionality.
 - One important use case is extending Spark instrumentation with custom metrics:  
   OS metrics, I/O metrics, external applications monitoring, etc.
 - Note: The code in this repo is for Spark 3.x.  
@@ -13,45 +14,25 @@ Apache Spark 3.0 comes with a new plugin framework. Plugins allow extending Spar
 Plugin notes:
 - Spark plugins implement the `org.apache.spark.api.Plugin` interface, they can be written in Scala or Java
  and can be used to run custom code at the startup of Spark executors and driver.  
-- Plugins configuration: `--conf spark.plugins=<list of plugin classes>`
+- Plugins basic configuration: `--conf spark.plugins=<list of plugin classes>`
 - Plugin JARs need to be made available to Spark executors
   - YARN: you can distribute the plugin code to the executors using `--jars`.
-  - K8S, when using Spark 3.0.1 on K8S, `--jars` distribution will **not work**, you will need to make the JAR available in the Spark container when you build it.
+  - K8S, when using Spark 3.0.1 on K8S, `--jars` or `--packages` distribution will **not work**, you will need to make the JAR available in the Spark container when you build it.
     - [SPARK-32119](https://issues.apache.org/jira/browse/SPARK-32119) fixes this issue and allows to use `--jars` to distribute plugin code. 
 - Link to [Spark monitoring documentation](https://spark.apache.org/docs/latest/monitoring.html#advanced-instrumentation)
 - See also [SPARK-29397](https://issues.apache.org/jira/browse/SPARK-29397), [SPARK-28091](https://issues.apache.org/jira/browse/SPARK-28091).
 
 Author and contact: [email protected] 
 
 ## Getting Started  
-- Build or download the SparkPlugin `jar`
+- Build or download the SparkPlugin `jar`. For example:
   - Build from source with:
     ```
     sbt package
     ```
-  - Or download from the automatic build in [github actions](https://github.com/cerndb/SparkPlugins/actions)
-
-- Test and debug with a demo plugin `ch.cern.RunOSCommandPlugin`:   
-  this runs an OS command on the executors, customizable (see below), by default `"/usr/bin/touch /tmp/plugin.txt"`.
-  ```
-  bin/spark-shell --master yarn \ 
-    --jars <path>/sparkplugins_2.12-0.1.jar \
-    --conf spark.plugins=ch.cern.RunOSCommandPlugin 
-  ```
-  - You can see if the plugin has run by checking that the file `/tmp/plugin.txt` has been
-   created on the executor machines. 
-
-## How to use the metrics produced by the plugins
+  - Or download the jar from the automatic build in [github actions](https://github.com/cerndb/SparkPlugins/actions)
 
-  - You can find the metrics generated by the plugin in the Spark metrics system stream (see details below).
-  - Spark Dashboard using the metrics system:    
-    Link with details on [how to deploy a dashboard using Spark metrics](https://github.com/cerndb/spark-dashboard)
-    This repo contains code and Spark Plugin examples to extend [Spark metrics instrumentation](https://spark.apache.org/docs/3.0.0/monitoring.html#metrics)
-    and use an extended grafana dashboard with the Plugin metrics of interest.
-
-## Plugins in this Repository
-
-### Demo and basics
+### Demo and Basic PLugins
   - [DemoPlugin](src/main/scala/ch/cern/DemoPlugin.scala)
     - `--conf spark.plugins=ch.cern.DemoPlugin`
     - Basic plugin, demonstrates how to write Spark plugins in Scala, for demo and testing.
@@ -65,6 +46,26 @@ Author and contact: [email protected]
     - Example illustrating how to use plugins to run actions on the OS.
     - Action implemented: runs an OS command on the executors, by default it runs: `/usr/bin/touch /tmp/plugin.txt`
     - Configurable action: `--conf spark.cernSparkPlugin.command="command or script you want to run"`
+    - Example:
+      ```
+      bin/spark-shell --master yarn \ 
+        --jars <path>/sparkplugins_2.12-0.1.jar \
+        --conf spark.plugins=ch.cern.RunOSCommandPlugin 
+      ```
+        - You can see if the plugin has run by checking that the file `/tmp/plugin.txt` has been
+          created on the executor machines. 
+
+## Spark Metrics System and Spark Performance Dashboard
+
+  - Most of the Plugins described in this repo are intended to extend the Spark Metrics System.
+    - See the details on the Spark metrics system at  [Spark Monitoring documentation](https://spark.apache.org/docs/latest/monitoring.html#metrics).
+    - You can find the metrics generated by the plugins in the Spark metrics system stream under the
+    namespace `namespace=plugin.<Plugin Class Name>`.
+  - Metrics needs to be stored and visualized
+    - This repo refers to the work on [how to deploy a Spark Performance Dashboard using Spark metrics](https://github.com/cerndb/spark-dashboard)
+
+---
+## Plugins in this Repository
 
 ### OS metrics instrumentation with cgroups, for Spark on Kubernetes 
   - [CgroupMetrics](src/main/scala/ch/cern/CgroupMetrics.scala)
@@ -111,7 +112,7 @@ Author and contact: [email protected]
 
 #### HDFS extended storage statistics
 This Plugin measures HDFS extended statistics.
-In particular it provides information on read locality and erasure coding usage (for HDFS 3.x).
+In particular, it provides information on read locality and erasure coding usage (for HDFS 3.x).
 
   - [HDFSMetrics](src/main/scala/ch/cern/HDFSMetrics.scala)
     - Configure with: `--conf spark.plugins=ch.cern.HDFSMetrics`
@@ -203,7 +204,8 @@ storage system exposed as a Hadoop Compatible Filesystem). Use this for Spark bu
     - Collects I/O metrics for Hadoop-compatible filesystem using Hadoop 2.7 API, use with Spark built with Hadoop 2.7
     - Metrics are the same as for CloudFSMetrics, they the prefix `ch.cern.CloudFSMetrics27`.
 
-## Experimental Plugins for I/O time instrumentation
+---
+## Experimental Plugins for I/O Time Instrumentation
 
 This section details a few experimental Spark plugins used to expose metrics for I/O-time
 instrumentation of Spark workloads using Hadoop-compliant filesystems.  
@@ -214,7 +216,10 @@ These plugins use instrumented experimental/custom versions of the Hadoop client
     - Note: this requires custom S3A client implementation, see experimental code at: [HDFS and S3A custom instrumentation](https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation)  
     - Spark config:
       - `--conf spark.plugins=ch.cern.experimental.S3ATimeInstrumentation`
-      - Custom jar needed: `--jars hadoop-aws-3.2.0.jar` built [from this fork](https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation)
+      - Custom jar needed: `--jars hadoop-aws-3.2.0.jar` 
+        - build [from this fork](https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation)
+        - or download a pre-built copy of the [hadoop-aws-3.2.0.jar at this link](https://cern.ch/canali/res/hadoop-aws-3.2.0.jar)
+
     - Metrics implemented (gauges), with prefix `ch.cern.experimental.S3ATimeInstrumentation`:
         - `S3AReadTimeMuSec`
         - `S3ASeekTimeMuSec`
@@ -259,6 +264,8 @@ These plugins use instrumented experimental/custom versions of the Hadoop client
       - `--jars <PATH>/sparkplugins_2.12-0.1.jar`
       - Non-standard configuration required for using this instrumentation:  
         - replace `$SPARK_HOME/jars/hadoop-hdfs-client-3.2.0.jar` with the jar built [from this fork](https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation)
+        - for convenience you can download a pre-built copy of the [hadoop-hdfs-client-3.2.0.jar at this link](https://cern.ch/canali/res/hadoop-hdfs-client-3.2.0.jar)
+
     - Metrics implemented (gauges), with prefix `ch.cern.experimental.HDFSTimeInstrumentation`:
         - `HDFSReadTimeMuSec`
         - `HDFSCPUTimeDuringReadMuSec`