Skip to content

Commit

Permalink
Spark Integration cntd.
Browse files Browse the repository at this point in the history
  • Loading branch information
jaceklaskowski committed Jul 16, 2024
1 parent 652ee40 commit 0860f10
Showing 1 changed file with 130 additions and 29 deletions.
159 changes: 130 additions & 29 deletions docs/spark-integration/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,37 +17,89 @@ build/sbt clean package publishLocal
```

!!! note
You have to change `build.sbt` to build the Spark Integration module. A fix is coming.
You have to change `build.sbt` to build the Spark Integration module. A [fix](https://github.com/unitycatalog/unitycatalog/pull/228) is coming.

### Two-Level V1 Table Identifiers

=== "Spark {{ spark.version }} + Delta Lake {{ delta.version }}"

``` shell
./bin/spark-shell \
--conf spark.jars.ivy=$HOME/.ivy2 \
--packages \
io.delta:delta-spark_2.13:{{ delta.version }},io.unitycatalog:unitycatalog-spark:0.1.0-SNAPSHOT \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.unitycatalog.connectors.spark.UCSingleCatalog \
--conf spark.sql.catalog.spark_catalog.uri=http://localhost:8080
```

``` text
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 4.0.0-preview1
/_/

Using Scala version 2.13.14 (OpenJDK 64-Bit Server VM, Java 17.0.11)
```

```scala
assert(spark.version == "{{ spark.version }}", "Unity Catalog {{ uc.version }} supports Apache Spark {{ spark.version }} only")
assert(io.delta.VERSION == "{{ delta.version }}", "Unity Catalog {{ uc.version }} supports Delta Lake {{ delta.version }} only")
```

=== "Spark 3.5.1 + Delta Lake 3.2.0"

``` shell
./bin/spark-shell \
--conf spark.jars.ivy=$HOME/.ivy2 \
--packages \
io.delta:delta-spark_2.13:3.2.0,io.unitycatalog:unitycatalog-spark:0.1.0-SNAPSHOT \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.unitycatalog.connectors.spark.UCSingleCatalog \
--conf spark.sql.catalog.spark_catalog.uri=http://localhost:8080
```

``` text
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.1
/_/

Using Scala version 2.13.8 (OpenJDK 64-Bit Server VM, Java 17.0.11)
```

``` shell
./bin/spark-shell \
--conf spark.jars.ivy=$HOME/.ivy2 \
--packages \
io.delta:delta-spark_2.13:{{ delta.version }},io.unitycatalog:unitycatalog-spark:0.1.0-SNAPSHOT \
--conf spark.sql.catalog.unity=io.unitycatalog.connectors.spark.UCSingleCatalog \
--conf spark.sql.catalog.unity.uri=http://localhost:8080 \
--conf spark.sql.catalog.unity.token=not_used_token
./bin/uc catalog create --name spark_catalog
```

``` shell
./bin/uc schema create --catalog spark_catalog --name default
```

```text
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 4.0.0-preview1
/_/
Using Scala version 2.13.14 (OpenJDK 64-Bit Server VM, Java 17.0.11)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context Web UI available at http://192.168.68.100:4040
Spark context available as 'sc' (master = local[*], app id = local-1720981549342).
Spark session available as 'spark'.
``` shell
./bin/uc table create --full_name spark_catalog.default.uc_demo --columns 'id INT' --storage_location /tmp/uc_demo --format delta
```

```scala
assert(spark.version == "{{ spark.version }}", "Unity Catalog supports Apache Spark {{ spark.version }}")
assert(io.delta.VERSION == "{{ delta.version }}", "Unity Catalog supports Delta Lake {{ delta.version }}")
``` text
scala> spark.catalog.listCatalogs.show(truncate=false)
+-------------+-----------+
|name |description|
+-------------+-----------+
|spark_catalog|NULL |
+-------------+-----------+
```

``` text
scala> spark.catalog.listDatabases()
org.apache.spark.sql.AnalysisException: Catalog spark_catalog does not support namespaces.
at org.apache.spark.sql.errors.QueryCompilationErrors$.missingCatalogAbilityError(QueryCompilationErrors.scala:1952)
at org.apache.spark.sql.connector.catalog.CatalogV2Implicits$CatalogHelper.asNamespaceCatalog(CatalogV2Implicits.scala:96)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy.apply(DataSourceV2Strategy.scala:402)
...
```

``` text
Expand All @@ -63,8 +115,57 @@ scala.NotImplementedError: an implementation is missing
...
```

``` scala
spark.catalog.tableExists("uc_demo")
```

Equivalent to the following:

``` scala
spark.catalog.tableExists("default.uc_demo")
```

``` scala
sql("insert into uc_demo(id) values (1)").show(truncate=false)
```

``` scala
scala> sql("select * from uc_demo").show()
org.apache.spark.sql.AnalysisException: Table does not support reads: spark_catalog.default.uc_demo.
at org.apache.spark.sql.errors.QueryCompilationErrors$.tableDoesNotSupportError(QueryCompilationErrors.scala:1392)
at org.apache.spark.sql.errors.QueryCompilationErrors$.tableDoesNotSupportReadsError(QueryCompilationErrors.scala:1396)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits$TableHelper.asReadable(DataSourceV2Implicits.scala:38)
at org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown$$anonfun$createScanBuilder$1.applyOrElse(V2ScanRelationPushDown.scala:58)
at org.apache.spark.sql.execution.datasources.v2.V2ScanRelationPushDown$$anonfun$createScanBuilder$1.applyOrElse(V2ScanRelationPushDown.scala:56)
...
```

Equivalent to the following:

``` scala
spark.table("uc_demo").show(truncate=false)
```

### Three-Level V2 Table Identifiers

``` shell
./bin/spark-shell \
--conf spark.jars.ivy=$HOME/.ivy2 \
--packages \
io.delta:delta-spark_2.13:3.2.0,io.unitycatalog:unitycatalog-spark:0.1.0-SNAPSHOT \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.unity=io.unitycatalog.connectors.spark.UCSingleCatalog \
--conf spark.sql.catalog.unity.uri=http://localhost:8080
```

``` text
scala> spark.catalog.tableExists("unity.default.numbers")
val res0: Boolean = true
```

``` text
scala> spark.table("unity.default.numbers").show(truncate=false)
2024-07-16 13:55:36 [main] INFO org.apache.spark.sql.execution.datasources.DataSourceStrategy:60 - Pruning directories with:
org.apache.spark.sql.delta.DeltaAnalysisException: [DELTA_CONFIGURE_SPARK_SESSION_WITH_EXTENSION_AND_CATALOG] This Delta operation requires the SparkSession to be configured with the
DeltaSparkSessionExtension and the DeltaCatalog. Please set the necessary
configurations when creating the SparkSession as shown below.
Expand All @@ -80,7 +181,7 @@ If you are using spark-shell/pyspark/spark-submit, you can add the required conf
at org.apache.spark.sql.delta.DeltaErrorsBase.configureSparkSessionWithExtensionAndCatalog(DeltaErrors.scala:1698)
at org.apache.spark.sql.delta.DeltaErrorsBase.configureSparkSessionWithExtensionAndCatalog$(DeltaErrors.scala:1695)
at org.apache.spark.sql.delta.DeltaErrors$.configureSparkSessionWithExtensionAndCatalog(DeltaErrors.scala:3413)
at org.apache.spark.sql.delta.DeltaLog.checkRequiredConfigurations(DeltaLog.scala:625)
... 133 elided
```
at org.apache.spark.sql.delta.DeltaErrors$.configureSparkSessionWithExtensionAndCatalog(DeltaErrors.scala:3382)
at org.apache.spark.sql.delta.DeltaLog.checkRequiredConfigurations(DeltaLog.scala:613)
... 140 elided
```

0 comments on commit 0860f10

Please sign in to comment.