Skip to content

Commit

Permalink
Version 0.2 for Scala 2.12 and Scala 2.13.
Browse files Browse the repository at this point in the history
  • Loading branch information
LucaCanali committed Aug 26, 2022
1 parent 5f79a9a commit 3210bc5
Show file tree
Hide file tree
Showing 3 changed files with 24 additions and 21 deletions.
32 changes: 17 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
![SparkPlugins CI](https://github.com/cerndb/SparkPlugins/workflows/SparkPlugins%20CI/badge.svg?branch=master&event=push)
[![Maven Central](https://maven-badges.herokuapp.com/maven-central/ch.cern.sparkmeasure/spark-plugins_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/ch.cern.sparkmeasure/spark-plugins_2.12)

This repo contains code and examples of Apache Spark Plugins for Spark 3.x.
Apache Spark plugins are an interface and configuration allowing to inject custom code on executors
start-up, with a hook to the Spark metrics system.
This provides a way to extend metrics collection beyond what is normally available in Apache Spark.
This repository contains code and examples of how to use Apache Spark Plugins.
Spark plugins are part of Spark core since version 3.0 and provide an interface,
and related configuration, for injecting custom code on executors as they are initialized.
Spark plugins can also be used to implement custom extensions to the Spark metrics system.

### Motivations
- Instrumenting parts of the Spark workload with plugins provides additional flexibility compared
Expand Down Expand Up @@ -43,17 +43,17 @@ Spark I/O from cloud Filesystems, OS metrics, and custom application metrics.
---
## Getting Started
- Deploy the jar from maven central
- `--packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1`
- `--packages ch.cern.sparkmeasure:spark-plugins_2.12:0.2`
- Build or download the SparkPlugin `jar`. For example:
- Build from source with `sbt package`
- Build from source with `sbt +package`
- Or download the jar from the automatic build in [github actions](https://github.com/cerndb/SparkPlugins/actions)

### Demo and Basic Plugins
- [DemoPlugin](src/main/scala/ch/cern/DemoPlugin.scala)
- `--conf spark.plugins=ch.cern.DemoPlugin`
- `--packages ch.cern.sparkmeasure:spark-plugins_2.12:0.2 --conf spark.plugins=ch.cern.DemoPlugin`
- Basic plugin, demonstrates how to write Spark plugins in Scala, for demo and testing.
- [DemoMetricsPlugin](src/main/scala/ch/cern/DemoMetricsPlugin.scala)
- `--conf spark.plugins=ch.cern.DemoMetricsPlugin`
- `--packages ch.cern.sparkmeasure:spark-plugins_2.12:0.2 --conf spark.plugins=ch.cern.DemoMetricsPlugin`
- Example plugin illustrating integration with the Spark metrics system.
- Metrics implemented:
- `ch.cern.DemoMetricsPlugin.DriverTest42`: a gauge reporting a constant integer value, for testing.
Expand All @@ -65,7 +65,7 @@ Spark I/O from cloud Filesystems, OS metrics, and custom application metrics.
- Example:
```
bin/spark-shell --master yarn \
--packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1 \
--packages ch.cern.sparkmeasure:spark-plugins_2.12:0.2 \
--conf spark.plugins=ch.cern.RunOSCommandPlugin
```
- You can see if the plugin has run by checking that the file `/tmp/plugin.txt` has been
Expand Down Expand Up @@ -97,8 +97,8 @@ Spark I/O from cloud Filesystems, OS metrics, and custom application metrics.
```
bin/spark-shell --master k8s://https://<K8S URL>:6443 --driver-memory 1g \
--num-executors 2 --executor-cores 2 --executor-memory 2g \
--conf spark.kubernetes.container.image=<registry>/spark:v311 \
--packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1 \
--conf spark.kubernetes.container.image=<registry>/spark:v330 \
--packages ch.cern.sparkmeasure:spark-plugins_2.12:0.2 \
--conf spark.plugins=ch.cern.HDFSMetrics,ch.cern.CgroupMetrics \
--conf "spark.metrics.conf.*.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink" \
--conf "spark.metrics.conf.*.sink.graphite.host"=mytestinstance \
Expand Down Expand Up @@ -139,7 +139,7 @@ In particular, it provides information on read locality and erasure coding usage
- Example
```
bin/spark-shell --master yarn \
--packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1 \
--packages ch.cern.sparkmeasure:spark-plugins_2.12:0.2 \
--conf spark.plugins=ch.cern.HDFSMetrics \
--conf "spark.metrics.conf.*.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink" \
--conf "spark.metrics.conf.*.sink.graphite.host"=mytestinstance \
Expand All @@ -163,7 +163,7 @@ storage system exposed as a Hadoop Compatible Filesystem).
- `--conf spark.cernSparkPlugin.cloudFsName=<name of the filesystem>` (example: "s3a", "gs", "wasbs", "root", "oci", etc.)
- Optional configuration: `--conf spark.cernSparkPlugin.registerOnDriver` (default true)
- Collects I/O metrics for Hadoop-compatible filesystems using Hadoop's GlobalStorageStatistics API.
- Note: use this with Spark built with Hadoop 3.2 (requires Hadoop client version 2.8 or higher).
- Note: use this with Spark built with Hadoop 3.x (requires Hadoop client version 2.8 or higher).
- Spark also allows to measure filesystem metrics using
`--conf spark.executor.metrics.fileSystemSchemes=<filesystems to measure>` (default: `file,hdfs`)
however in Spark (up to 3.1) this is done using Hadoop Filesystem getAllStatistics, deprecated in recent versions of Hadoop.
Expand All @@ -177,7 +177,7 @@ storage system exposed as a Hadoop Compatible Filesystem).
bin/spark-shell --master k8s://https://<K8S URL>:6443 --driver-memory 1g \
--num-executors 2 --executor-cores 2 --executor-memory 2g \
--conf spark.kubernetes.container.image=<registry>/spark:v311 \
--packages org.apache.hadoop:hadoop-aws:3.2.0,ch.cern.sparkmeasure:spark-plugins_2.12:0.1 \
--packages org.apache.hadoop:hadoop-aws:3.3.2,ch.cern.sparkmeasure:spark-plugins_2.12:0.2 \
--conf spark.plugins=ch.cern.CloudFSMetrics,ch.cern.CgroupMetrics \
--conf spark.cernSparkPlugin.cloudFsName="s3a" \
--conf spark.hadoop.fs.s3a.secret.key="<SECRET KEY HERE>" \
Expand Down Expand Up @@ -217,6 +217,7 @@ These plugins use instrumented experimental/custom versions of the Hadoop client
- Instruments the Hadoop S3A client.
- Note: this requires custom S3A client implementation, see experimental code at: [HDFS and S3A custom instrumentation](https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation)
- Spark config:
- Use this with Spark 3.1.x (which uses hadoop version 3.2.0)
- `--conf spark.plugins=ch.cern.experimental.S3ATimeInstrumentation`
- Custom jar needed: `--jars hadoop-aws-3.2.0.jar`
- build [from this fork](https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation)
Expand Down Expand Up @@ -259,7 +260,8 @@ These plugins use instrumented experimental/custom versions of the Hadoop client
- Instruments the Hadoop HDFS client.
- Note: this requires custom HDFS client implementation, see experimental code at: [HDFS and S3A custom instrumentation](https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation)
- Spark config:
- `--conf spark.plugins=ch.cern.experimental.HDFSTimeInstrumentation`
- Use this with Spark 3.1.x (which uses hadoop version 3.2.0)
- `--conf spark.plugins=ch.cern.experimental.HDFSTimeInstrumentation`
- `--packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1`
- Non-standard configuration required for using this instrumentation:
- replace `$SPARK_HOME/jars/hadoop-hdfs-client-3.2.0.jar` with the jar built [from this fork](https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation)
Expand Down
11 changes: 6 additions & 5 deletions build.sbt
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
name := "spark-plugins"

version := "0.2-SNAPSHOT"
isSnapshot := true
version := "0.2"
isSnapshot := false

scalaVersion := "2.12.15"
crossScalaVersions := Seq("2.12.15", "2.13.8")

licenses += ("Apache-2.0", url("http://www.apache.org/licenses/LICENSE-2.0"))

libraryDependencies += "io.dropwizard.metrics" % "metrics-core" % "4.2.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "3.3.1"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0"
libraryDependencies += "io.dropwizard.metrics" % "metrics-core" % "4.2.7"
libraryDependencies += "org.apache.hadoop" % "hadoop-client-api" % "3.3.2"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.3.0"

// publishing to Sonatype Nexus repository and Maven
publishMavenStyle := true
Expand Down
2 changes: 1 addition & 1 deletion project/build.properties
Original file line number Diff line number Diff line change
@@ -1 +1 @@
sbt.version = 1.5.5
sbt.version = 1.7.1

0 comments on commit 3210bc5

Please sign in to comment.