This project houses the experimental client for Spark Connect for Apache Spark written in Rust
Currently, the Spark Connect client for Rust is highly experimental and should not be used in any production setting. This is currently a "proof of concept" to identify the methods of interacting with Spark cluster from rust.
The spark-connect-rs
aims to provide an entrypoint to Spark Connect, and provide similar DataFrame API interactions.
├── core <- core implementation in Rust
│ └─ spark <- git submodule for apache/spark
├── rust <- shim for 'spark-connect-rs' from core
Future state would be to have additional bindings for other languages along side the top level rust
folder.
This section explains how run Spark Connect Rust locally starting from 0.
Step 1: Install rust via rustup: https://www.rust-lang.org/tools/install
Step 2: Ensure you have a cmake and protobuf installed on your machine
Step 3: Run the following commands to clone the repo
git clone https://github.com/sjrusso8/spark-connect-rs.git
git submodule update --init --recursive
cargo build
Step 4: Setup the Spark Driver on localhost either by downloading spark or with docker.
With local spark:
-
Download Spark distribution (3.4.0+), unzip the package.
-
Start the Spark Connect server with the following command (make sure to use a package version that matches your Spark distribution):
sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0
With docker:
- Start the Spark Connect server by leveraging the created
docker-compose.yml
in this repo. This will start a Spark Connect Server running on port 15002
docker compose up --build -d
Step 5: Run an example from the repo under /examples
The following section outlines some of the larger functionality that are not yet working with this Spark Connect implementation.
- TLS authentication & Databricks compatability via the feature flag
feature = 'tls'
- StreamingQueryManager
- UDFs or any type of functionality that takes a closure (foreach, foreachBatch, etc.)
Spark Session type object and its implemented traits
Spark DataFrame type object and its implemented traits.
Spark Connect should respect the format as long as your cluster supports the specified type and has the required jars
DataFrameWriter | API | Comment |
---|---|---|
format | ||
option | ||
options | ||
mode | ||
bucketBy | ||
sortBy | ||
partitionBy | ||
save | ||
saveAsTable | ||
insertInto |
Start a streaming job and return a StreamingQuery
object to handle the stream operations.
DataStreamWriter | API | Comment |
---|---|---|
format | ||
foreach | ||
foreachBatch | ||
option | ||
options | ||
outputMode | Uses an Enum for OutputMode |
|
partitionBy | ||
queryName | ||
trigger | Uses an Enum for TriggerMode |
|
start | ||
toTable |
A handle to a query that is executing continuously in the background as new data arrives.
StreamingQuery | API | Comment |
---|---|---|
awaitTermination | ||
exception | ||
explain | ||
processAllAvailable | ||
stop | ||
id | ||
isActive | ||
lastProgress | ||
name | ||
recentProgress | ||
runId | ||
status |
Spark Column type object and its implemented traits
Only a few of the functions are covered by unit tests.
Spark schema objects have not yet been translated into rust objects.
Create Spark literal types from these rust types. E.g. lit(1_i64)
would be a LongType()
in the schema.
An array can be made like lit([1_i16,2_i16,3_i16])
would result in an ArrayType(Short)
since all the values of the slice can be translated into literal type.
For ease of use it's recommended to use Window
to create the WindowSpec
.