diff --git a/README.md b/README.md index dc034e3..5dd9a06 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,11 @@ # SparkPlugins ![SparkPlugins CI](https://github.com/cerndb/SparkPlugins/workflows/SparkPlugins%20CI/badge.svg) -Code and examples of how to use Apache Spark Plugin extensions to the metrics system with Apache Spark 3.0. - +A repo with code and examples of Apache Spark Plugin extensions to the metrics system + applied to measuring I/O from cloud Filesystems, OS/system metrics and custom metrics. + --- -Apache Spark 3.0 comes with a new plugin framework. Plugins allow extending Spark monitoring functionality. +Apache Spark 3.0 comes with an improved plugin framework. Plugins allow extending Spark monitoring functionality. - One important use case is extending Spark instrumentation with custom metrics: OS metrics, I/O metrics, external applications monitoring, etc. - Note: The code in this repo is for Spark 3.x. @@ -13,10 +14,10 @@ Apache Spark 3.0 comes with a new plugin framework. Plugins allow extending Spar Plugin notes: - Spark plugins implement the `org.apache.spark.api.Plugin` interface, they can be written in Scala or Java and can be used to run custom code at the startup of Spark executors and driver. -- Plugins configuration: `--conf spark.plugins=` +- Plugins basic configuration: `--conf spark.plugins=` - Plugin JARs need to be made available to Spark executors - YARN: you can distribute the plugin code to the executors using `--jars`. - - K8S, when using Spark 3.0.1 on K8S, `--jars` distribution will **not work**, you will need to make the JAR available in the Spark container when you build it. + - K8S, when using Spark 3.0.1 on K8S, `--jars` or `--packages` distribution will **not work**, you will need to make the JAR available in the Spark container when you build it. - [SPARK-32119](https://issues.apache.org/jira/browse/SPARK-32119) fixes this issue and allows to use `--jars` to distribute plugin code. - Link to [Spark monitoring documentation](https://spark.apache.org/docs/latest/monitoring.html#advanced-instrumentation) - See also [SPARK-29397](https://issues.apache.org/jira/browse/SPARK-29397), [SPARK-28091](https://issues.apache.org/jira/browse/SPARK-28091). @@ -24,34 +25,14 @@ Plugin notes: Author and contact: Luca.Canali@cern.ch ## Getting Started -- Build or download the SparkPlugin `jar` +- Build or download the SparkPlugin `jar`. For example: - Build from source with: ``` sbt package ``` - - Or download from the automatic build in [github actions](https://github.com/cerndb/SparkPlugins/actions) - -- Test and debug with a demo plugin `ch.cern.RunOSCommandPlugin`: - this runs an OS command on the executors, customizable (see below), by default `"/usr/bin/touch /tmp/plugin.txt"`. - ``` - bin/spark-shell --master yarn \ - --jars /sparkplugins_2.12-0.1.jar \ - --conf spark.plugins=ch.cern.RunOSCommandPlugin - ``` - - You can see if the plugin has run by checking that the file `/tmp/plugin.txt` has been - created on the executor machines. - -## How to use the metrics produced by the plugins + - Or download the jar from the automatic build in [github actions](https://github.com/cerndb/SparkPlugins/actions) - - You can find the metrics generated by the plugin in the Spark metrics system stream (see details below). - - Spark Dashboard using the metrics system: - Link with details on [how to deploy a dashboard using Spark metrics](https://github.com/cerndb/spark-dashboard) - This repo contains code and Spark Plugin examples to extend [Spark metrics instrumentation](https://spark.apache.org/docs/3.0.0/monitoring.html#metrics) - and use an extended grafana dashboard with the Plugin metrics of interest. - -## Plugins in this Repository - -### Demo and basics +### Demo and Basic PLugins - [DemoPlugin](src/main/scala/ch/cern/DemoPlugin.scala) - `--conf spark.plugins=ch.cern.DemoPlugin` - Basic plugin, demonstrates how to write Spark plugins in Scala, for demo and testing. @@ -65,6 +46,26 @@ Author and contact: Luca.Canali@cern.ch - Example illustrating how to use plugins to run actions on the OS. - Action implemented: runs an OS command on the executors, by default it runs: `/usr/bin/touch /tmp/plugin.txt` - Configurable action: `--conf spark.cernSparkPlugin.command="command or script you want to run"` + - Example: + ``` + bin/spark-shell --master yarn \ + --jars /sparkplugins_2.12-0.1.jar \ + --conf spark.plugins=ch.cern.RunOSCommandPlugin + ``` + - You can see if the plugin has run by checking that the file `/tmp/plugin.txt` has been + created on the executor machines. + +## Spark Metrics System and Spark Performance Dashboard + + - Most of the Plugins described in this repo are intended to extend the Spark Metrics System. + - See the details on the Spark metrics system at [Spark Monitoring documentation](https://spark.apache.org/docs/latest/monitoring.html#metrics). + - You can find the metrics generated by the plugins in the Spark metrics system stream under the + namespace `namespace=plugin.`. + - Metrics needs to be stored and visualized + - This repo refers to the work on [how to deploy a Spark Performance Dashboard using Spark metrics](https://github.com/cerndb/spark-dashboard) + +--- +## Plugins in this Repository ### OS metrics instrumentation with cgroups, for Spark on Kubernetes - [CgroupMetrics](src/main/scala/ch/cern/CgroupMetrics.scala) @@ -111,7 +112,7 @@ Author and contact: Luca.Canali@cern.ch #### HDFS extended storage statistics This Plugin measures HDFS extended statistics. -In particular it provides information on read locality and erasure coding usage (for HDFS 3.x). +In particular, it provides information on read locality and erasure coding usage (for HDFS 3.x). - [HDFSMetrics](src/main/scala/ch/cern/HDFSMetrics.scala) - Configure with: `--conf spark.plugins=ch.cern.HDFSMetrics` @@ -203,7 +204,8 @@ storage system exposed as a Hadoop Compatible Filesystem). Use this for Spark bu - Collects I/O metrics for Hadoop-compatible filesystem using Hadoop 2.7 API, use with Spark built with Hadoop 2.7 - Metrics are the same as for CloudFSMetrics, they the prefix `ch.cern.CloudFSMetrics27`. -## Experimental Plugins for I/O time instrumentation +--- +## Experimental Plugins for I/O Time Instrumentation This section details a few experimental Spark plugins used to expose metrics for I/O-time instrumentation of Spark workloads using Hadoop-compliant filesystems. @@ -214,7 +216,10 @@ These plugins use instrumented experimental/custom versions of the Hadoop client - Note: this requires custom S3A client implementation, see experimental code at: [HDFS and S3A custom instrumentation](https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation) - Spark config: - `--conf spark.plugins=ch.cern.experimental.S3ATimeInstrumentation` - - Custom jar needed: `--jars hadoop-aws-3.2.0.jar` built [from this fork](https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation) + - Custom jar needed: `--jars hadoop-aws-3.2.0.jar` + - build [from this fork](https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation) + - or download a pre-built copy of the [hadoop-aws-3.2.0.jar at this link](https://cern.ch/canali/res/hadoop-aws-3.2.0.jar) + - Metrics implemented (gauges), with prefix `ch.cern.experimental.S3ATimeInstrumentation`: - `S3AReadTimeMuSec` - `S3ASeekTimeMuSec` @@ -259,6 +264,8 @@ These plugins use instrumented experimental/custom versions of the Hadoop client - `--jars /sparkplugins_2.12-0.1.jar` - Non-standard configuration required for using this instrumentation: - replace `$SPARK_HOME/jars/hadoop-hdfs-client-3.2.0.jar` with the jar built [from this fork](https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation) + - for convenience you can download a pre-built copy of the [hadoop-hdfs-client-3.2.0.jar at this link](https://cern.ch/canali/res/hadoop-hdfs-client-3.2.0.jar) + - Metrics implemented (gauges), with prefix `ch.cern.experimental.HDFSTimeInstrumentation`: - `HDFSReadTimeMuSec` - `HDFSCPUTimeDuringReadMuSec`