Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a databricks-iris starter that enables packaged deployment on Databricks #129

Merged
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions databricks-iris/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
##########################
# KEDRO PROJECT

# ignore all local configuration
conf/local/**
!conf/local/.gitkeep
.telemetry

# ignore potentially sensitive credentials files
conf/**/*credentials*

# ignore everything in the following folders
data/**
logs/**

# except their sub-folders
!data/**/
!logs/**/

# also keep all .gitkeep files
!.gitkeep

# also keep the example dataset
!data/01_raw/iris.csv


##########################
# Common files

# IntelliJ
.idea/
*.iml
out/
.idea_modules/

### macOS
*.DS_Store
.AppleDouble
.LSOverride
.Trashes

# Vim
*~
.*.swo
.*.swp

# emacs
*~
\#*\#
/.emacs.desktop
/.emacs.desktop.lock
*.elc

# JIRA plugin
atlassian-ide-plugin.xml

# C extensions
*.so

### Python template
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
.static_storage/
.media/
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.envrc
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# mkdocs documentation
/site

# mypy
.mypy_cache/
59 changes: 59 additions & 0 deletions databricks-iris/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# The `databricks-iris` Kedro starter

## Introduction

The code in this repository demonstrates best practice when working with Kedro and PySpark on Databricks. It contains a Kedro starter template with some initial configuration and an example pipeline, it accompanies the documentation on [developing and deploying Kedro projects on Databricks](https://docs.kedro.org/en/stable/integrations/index.html#databricks-integration).

This repository is a fork of the `pyspark-iris` starter that has been modified to run natively on Databricks.

## Getting started

The starter template can be used to start a new project using the [`starter` option](https://kedro.readthedocs.io/en/stable/get_started/starters.html) in `kedro new`:

```bash
kedro new --starter=databricks-iris
```

## Features

### Configuration for Databricks in `conf/base`

This starter has a base configuration that allows it to run natively on Databricks. Directories to store data and logs still need to be manually created in the user's Databricks DBFS instance:

```bash
/dbfs/FileStore/iris-databricks/data
/dbfs/FileStore/iris-databricks/logs
```

See the documentation on deploying a packaged Kedro project to Databricks for more information.

### Databricks entry point

The starter contains a script and an entry point (`databricks_run.py`) that enables a packaged project created with this starter to run on Databricks. See the documentation on deploying a packaged Kedro project to Databricks for more information.

### Single configuration in `/conf/base/spark.yml`

While Spark allows you to specify many different [configuration options](https://spark.apache.org/docs/latest/configuration.html), this starter uses `/conf/base/spark.yml` as a single configuration location.

### `SparkSession` initialisation

This Kedro starter contains the initialisation code for `SparkSession` in the `ProjectContext` and takes its configuration from `/conf/base/spark.yml`. Modify this code if you want to further customise your `SparkSession`, e.g. to use [YARN](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html).

### Configures `MemoryDataSet` to work with Spark objects

Out of the box, Kedro's `MemoryDataSet` works with Spark's `DataFrame`. However, it doesn't work with other Spark objects such as machine learning models unless you add further configuration. This Kedro starter demonstrates how to configure `MemoryDataSet` for Spark's machine learning model in the `catalog.yml`.

> Note: The use of `MemoryDataSet` is encouraged to propagate Spark's `DataFrame` between nodes in the pipeline. A best practice is to delay triggering Spark actions for as long as needed to take advantage of Spark's lazy evaluation.

### An example machine learning pipeline that uses only `PySpark` and `Kedro`

![Iris Pipeline Visualisation](./images/iris_pipeline.png)

This Kedro starter uses the simple and familiar [Iris dataset](https://www.kaggle.com/uciml/iris). It contains the code for an example machine learning pipeline that runs a 1-nearest neighbour classifier to classify an iris.
[Transcoding](https://kedro.readthedocs.io/en/stable/data/data_catalog.html#transcoding-datasets) is used to convert the Spark Dataframes into pandas DataFrames after splitting the data into training and testing sets.

The pipeline includes:

* A node to split the data into training dataset and testing dataset using a configurable ratio
* A node to run a simple 1-nearest neighbour classifier and make predictions
* A node to report the accuracy of the predictions performed by the model
6 changes: 6 additions & 0 deletions databricks-iris/cookiecutter.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"project_name": "Iris",
"repo_name": "{{ cookiecutter.project_name.strip().replace(' ', '-').replace('_', '-').lower() }}",
"python_package": "{{ cookiecutter.project_name.strip().replace(' ', '_').replace('-', '_').lower() }}",
"kedro_version": "{{ cookiecutter.kedro_version }}"
}
18 changes: 18 additions & 0 deletions databricks-iris/credentials.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Here you can define credentials for different data sets and environment.
#
# THIS FILE MUST BE PLACED IN `conf/local`. DO NOT PUSH THIS FILE TO GitHub.
#
# Example:
#
# dev_s3:
# client_kwargs:
# aws_access_key_id: token
# aws_secret_access_key: key
#
# prod_s3:
# aws_access_key_id: token
# aws_secret_access_key: key
#
# dev_sql:
# username: admin
# password: admin
jmholzer marked this conversation as resolved.
Show resolved Hide resolved
Binary file added databricks-iris/images/iris_pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions databricks-iris/prompts.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
project_name:
title: "Project Name"
text: |
Please enter a human readable name for your new project.
Spaces, hyphens, and underscores are allowed.
regex_validator: "^[\\w -]{2,}$"
error_message: |
It must contain only alphanumeric symbols, spaces, underscores and hyphens and
be at least 2 characters long.
Loading