diff --git a/docs/source/deployment/databricks/databricks_deployment_workflow.md b/docs/source/deployment/databricks/databricks_deployment_workflow.md index 431c8b318c..d020a72ae1 100644 --- a/docs/source/deployment/databricks/databricks_deployment_workflow.md +++ b/docs/source/deployment/databricks/databricks_deployment_workflow.md @@ -1,32 +1,13 @@ -# Use a Databricks job to deploy a Kedro project +# Use Databricks Asset Bundles and jobs to deploy a Kedro project Databricks jobs are a way to execute code on Databricks clusters, allowing you to run data processing tasks, ETL jobs, or machine learning workflows. In this guide, we explain how to package and run a Kedro project as a job on Databricks. -## What are the advantages of packaging a Kedro project to run on Databricks? -Packaging your Kedro project and running it on Databricks enables you to execute your pipeline without a notebook. This approach is particularly well-suited for production, as it provides a structured and reproducible way to run your code. - -Here are some typical use cases for running a packaged Kedro project as a Databricks job: - -- **Data engineering pipeline**: the output of your Kedro project is a file or set of files containing cleaned and processed data. -- **Machine learning with MLflow**: your Kedro project runs an ML model; metrics about your experiments are tracked in MLflow. -- **Automated and scheduled runs**: your Kedro project should be [run on Databricks automatically](https://docs.databricks.com/workflows/jobs/schedule-jobs.html#add-a-job-schedule). -- **CI/CD integration**: you have a CI/CD pipeline that produces a packaged Kedro project. - -Running your packaged project as a Databricks job is very different from running it from a Databricks notebook. The Databricks job cluster has to be provisioned and started for each run, which is significantly slower than running it as a notebook on a cluster that has already been started. In addition, there is no way to change your project's code once it has been packaged. Instead, you must change your code, create a new package, and then upload it to Databricks again. - -For those reasons, the packaging approach is unsuitable for development projects where rapid iteration is necessary. For guidance on developing a Kedro project for Databricks in a rapid build-test loop, see the [development workflow guide](./databricks_ide_development_workflow.md). - -## What this page covers - -- [Set up your Kedro project for deployment on Databricks](#set-up-your-project-for-deployment-to-databricks). -- [Run your project as a job using the Databricks workspace UI](#deploy-and-run-your-kedro-project-using-the-workspace-ui). -- [Resources for automating your Kedro deployments to Databricks](#resources-for-automatically-deploying-to-databricks). ## Prerequisites - An active [Databricks deployment](https://docs.databricks.com/getting-started/index.html). -- [`conda` installed](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) on your local machine in order to create a virtual environment with a specific version of Python (>= 3.7 is required). If you have Python >= 3.7 installed, you can use other software to create a virtual environment. +- [`conda` installed](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) on your local machine to create a virtual environment with a specific version of Python (>= 3.7 is required). If you have Python >= 3.7 installed, you can use other software to create a virtual environment. ## Set up your project for deployment to Databricks @@ -37,8 +18,8 @@ The sequence of steps described in this section is as follows: 3. [Authenticate the Databricks CLI](#authenticate-the-databricks-cli) 4. [Create a new Kedro project](#create-a-new-kedro-project) 5. [Create an entry point for Databricks](#create-an-entry-point-for-databricks) -6. [Package your project](#package-your-project) -7. [Upload project data and configuration to DBFS](#upload-project-data-and-configuration-to-dbfs) +6. [Create Asset bundles files](#create-asset-bundles-files) +7. [Upload project data to DBFS](#upload-project-data-to-dbfs) ### Note your Databricks username and host @@ -59,20 +40,23 @@ The following commands will create a new `conda` environment, activate it, and t In your local development environment, create a virtual environment for this tutorial using `conda`: ```bash -conda create --name iris-databricks python=3.10 +conda create --name databricks-iris python=3.10 -y ``` Once it is created, activate it: ```bash -conda activate iris-databricks +conda activate databricks-iris ``` -With your `conda` environment activated, install Kedro and the Databricks CLI: +With your `conda` environment activated, install Kedro: ```bash -pip install kedro databricks-cli +pip install kedro ``` +**This tutorial was created with kedro 0.19.6** + +Install databricks CLI depending on your system [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/install.html): ### Authenticate the Databricks CLI @@ -90,14 +74,14 @@ pip install kedro databricks-cli Create a Kedro project by using the following command in your local environment: ```bash -kedro new --starter=databricks-iris +kedro new --starter=databricks-iris --name iris --telemetry no ``` -This command creates a new Kedro project using the `databricks-iris` starter template. Name your new project `iris-databricks` for consistency with the rest of this guide. - ```{note} - If you are not using the `databricks-iris` starter to create a Kedro project, **and** you are working with a version of Kedro **earlier than 0.19.0**, then you should [disable file-based logging](https://docs.kedro.org/en/0.18.14/logging/logging.html#disable-file-based-logging) to prevent Kedro from attempting to write to the read-only file system. - ``` +```bash +Congratulations! +Your project 'iris' has been created in the directory +``` ### Create an entry point for Databricks @@ -105,7 +89,7 @@ The default entry point of a Kedro project uses a Click command line interface ( The `databricks-iris` starter has this entry point pre-built, so there is no extra work to do here, but generally you must **create an entry point manually for your own projects using the following steps**: -1. **Create an entry point script**: Create a new file in `/src/iris_databricks` named `databricks_run.py`. Copy the following code to this file: +1. **Create an entry point script**: Create a new file in `/src/databricks_iris` named `databricks_run.py`. Copy the following code to this file: ```python import argparse @@ -157,33 +141,25 @@ Because you are no longer using the default entry-point for Kedro, you will not - `--conf-source`: specifies the location of the `conf/` directory to use with your Kedro project. ``` -### Package your project - -To package your Kedro project for deployment on Databricks, you must create a Wheel (`.whl`) file, which is a binary distribution of your project. In the root directory of your Kedro project, run the following command: -```bash -kedro package -``` -This command generates a `.whl` file in the `dist` directory within your project's root directory. -### Upload project data and configuration to DBFS +### Upload project data to DBFS ```{note} A Kedro project's configuration and data do not get included when it is packaged. They must be stored somewhere accessible to allow your packaged project to run. ``` -Your packaged Kedro project needs access to data and configuration in order to run. Therefore, you will need to upload your project's data and configuration to a location accessible to Databricks. In this guide, we will store the data on the Databricks File System (DBFS). +Your packaged Kedro project needs access to data and configuration to run. Therefore, you will need to upload your project's data to a location accessible to Databricks. In this guide, we will store the data on the Databricks File System (DBFS). The `databricks-iris` starter contains a [catalog](../../data/data_catalog.md) that is set up to access data stored in DBFS (`/conf/`). You will point your project to use configuration stored on DBFS using the `--conf-source` option when you create your job on Databricks. There are several ways to upload data to DBFS: you can use the [DBFS API](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/dbfs), the [`dbutils` module](https://docs.databricks.com/dev-tools/databricks-utils.html) in a Databricks notebook or the [Databricks CLI](https://docs.databricks.com/archive/dev-tools/cli/dbfs-cli.html). In this guide, it is recommended to use the Databricks CLI because of the convenience it offers. -- **Upload your project's data and config**: at the command line in your local environment, use the following Databricks CLI commands to upload your project's locally stored data and configuration to DBFS: +- **Upload your project's data**: at the command line in your local environment, use the following Databricks CLI commands to upload your project's locally stored data and configuration to DBFS: ```bash -databricks fs cp --recursive /data/ dbfs:/FileStore/iris-databricks/data -databricks fs cp --recursive /conf/ dbfs:/FileStore/iris-databricks/conf +databricks fs cp --recursive data/ dbfs:/FileStore/iris-databricks/data ``` The `--recursive` flag ensures that the entire folder and its contents are uploaded. You can list the contents of the destination folder in DBFS using the following command: @@ -205,125 +181,141 @@ You should see the contents of the project's `data/` directory printed to your t 08_reporting ``` -## Deploy and run your Kedro project using the workspace UI - -To run your packaged project on Databricks, login to your Databricks account and perform the following steps in the workspace: - -1. [Create a new job](#create-a-new-job) -2. [Create a new job cluster specific to your job](#create-a-new-job-cluster-specific-to-your-job) -3. [Configure the job](#configure-the-job) -4. [Run the job](#run-the-job) - -### Create a new job - -In the Databricks workspace, navigate to the `Workflows` tab and click `Create Job` **or** click the `New` button, then `Job`: - -![Create Databricks job](../../meta/images/databricks_create_new_job.png) - -### Create a new job cluster specific to your job - -Create a dedicated [job cluster](https://docs.databricks.com/clusters/index.html) to run your job by clicking on the drop-down menu in the `Cluster` field and then clicking `Add new job cluster`: - -**Do not use the default `Job_cluster`, it has not been configured to run this job.** - -![Create Databricks job cluster](../../meta/images/databricks_create_job_cluster.png) - -Once you click `Add new job cluster`, the configuration page for this cluster appears. - -Configure the job cluster with the following settings: - -- In the `name` field enter `kedro_deployment_demo`. -- Select the radio button for `Single node`. -- Select the runtime `13.3 LTS` in the `Databricks runtime version` field. -- In the `Advanced options` section, under the `Spark` tab, locate the `Environment variables` field. Add the following line: -`KEDRO_LOGGING_CONFIG="/dbfs/FileStore/iris-databricks/conf/logging.yml"` -Here, ensure you specify the correct path to your custom logging configuration. This step is crucial because the default Kedro logging configuration incorporates the rich library, which is incompatible with Databricks jobs. In the `databricks-iris` Kedro starter, the `rich` handler in `logging.yml` is altered to a `console` handler for compatibility. For additional information about logging configurations, refer to the [Kedro Logging Manual](https://docs.kedro.org/en/stable/logging/index.html). -- Leave all other settings with their default values in place. - -The final configuration for the job cluster should look the same as the following: - -![Configure Databricks job cluster](../../meta/images/databricks_configure_job_cluster.png) - -### Configure the job - -Configure the job with the following settings: - -- Enter `iris-databricks` in the `Name` field. -- In the dropdown menu for the `Type` field, select `Python wheel`. -- In the `Package name` field, enter `iris_databricks`. This is the name of your package as defined in your project's `src/setup.py` file. -- In the `Entry Point` field, enter `databricks_run`. This is the name of the [entry point](#create-an-entry-point-for-databricks) to run your package from. -- Ensure the job cluster you created in step two is selected in the dropdown menu for the `Cluster` field. -- In the `Dependent libraries` field, click `Add` and upload [your project's `.whl` file](#package-your-project), making sure that the radio buttons for `Upload` and `Python Whl` are selected for the `Library Source` and `Library Type` fields. -- In the `Parameters` field, enter the following list of runtime options: +### Create asset bundles files + + + +#### 1. Create the bundle + +Create a folder called `assets` in the root directory containing the file `assets/batch-inference-workflow-asset.yml`, this name is just an example: +```yaml +common_permissions: &permissions + permissions: + - level: CAN_VIEW + group_name: users + +experimental: + python_wheel_wrapper: true + +artifacts: + default: + type: whl + build: kedro package + path: .. +resources: + jobs: + iris-databricks-job: #used to run job from CLI + name: "iris-databricks" #name of the job in databricks + tasks: + - task_key: kedro_run + existing_cluster_id: ${var.existing_cluster_id} + python_wheel_task: + package_name: iris # + entry_point: databricks_run + named_parameters: + --conf-source: /Workspace${workspace.file_path}/conf + --package-name: iris # + --env: ${var.environment} + libraries: + - whl: ../dist/*.whl + schedule: + quartz_cron_expression: "0 0 11 * * ?" # daily at 11am + timezone_id: UTC + <<: *permissions +``` +This file provides a template for a job utilizing Kedro as a package, and it allows for multiple definitions per job. For instance, you could have one definition per pipeline. + + +In this example, an existing cluster ID is being utilized, which can facilitate fast iteration. However, for a productive deployment, a job cluster is required. To enable ephemeral clusters, the YAML file (`assets/batch-inference-workflow-asset.yml`) needs to be modified to use a `job_cluster_key` instead of an `existing_cluster_id`. + +```yaml +jobs: + iris-databricks-job: #used to run job from CLI + name: "iris-databricks" #name of the job in databricks + job_clusters: + - job_cluster_key: job_cluster + new_cluster: + #Azure nodes + node_type_id: Standard_DS3_v2 + spark_version: 14.3.x-scala2.12 + tasks: + - task_key: kedro_run + job_cluster_key: job_cluster + python_wheel_task: + package_name: iris # + entry_point: databricks_run + named_parameters: + --conf-source: /Workspace${workspace.file_path}/conf + --package-name: iris # + --env: ${var.environment} + libraries: + - whl: ../dist/*.whl + schedule: + quartz_cron_expression: "0 0 11 * * ?" # daily at 11am + timezone_id: UTC + <<: *permissions +``` +#### 2. Create the bundle’s configuration file + + +Create the file `databricks.yml` in the root directory + +```yaml +bundle: + name: databricks_iris + +variables: + existing_cluster_id: + description: The ID of an existing cluster for development + default: + environment: + description: The environment to run the job in + default: dev + +include: + - assets/*.yml + +# Deployment Target specific values for workspace +targets: + dev: + default: true + mode: development + workspace: + host: + run_as: + user_name: + variables: + environment: base + + prod: + mode: production + workspace: + host: + run_as: + # This runs as your_user@mail.com in production. We could also use a service principal here + # using service_principal_name (see https://docs.databricks.com/dev-tools/bundles/permissions.html). + user_name: +``` +### Deploy and run the job +You can deploy to remote workspace with following command: ```bash -["--conf-source", "/dbfs/FileStore/iris-databricks/conf", "--package-name", "iris_databricks"] +databricks bundle deploy -t dev ``` -The final configuration for your job should look the same as the following: - -![Configure Databricks job](../../meta/images/databricks_configure_new_job.png) - -Click `Create` and then `Confirm and create` in the following pop-up asking you to name the job. - -### Run the job - -Click `Run now` in the top-right corner of your new job's page to start a run of the job. The status of your run can be viewed in the `Runs` tab of your job's page. Navigate to the `Runs` tab and track the progress of your run: - -![Databricks job status](../../meta/images/databricks_job_status.png) - -This page also shows an overview of all past runs of your job. As you only just started your job run, it's status will be `Pending`. A status of `Pending` indicates that the cluster is being started and your code is waiting to run. - -The following things happen when you run your job: - -- The job cluster is provisioned and started (job status: `Pending`). -- The packaged Kedro project and all its dependencies are installed (job status: `Pending`) -- The packaged Kedro project is run from the specified `databricks_run` entry point (job status: `In Progress`). -- The packaged code finishes executing and the job cluster is stopped (job status: `Succeeded`). - -A run will take roughly six to seven minutes. - -When the status of your run is `Succeeded`, your job has successfully finished executing. You can view the logging output created by the run by clicking on the link with the text `Go to the latest successful run` to take you to the `main run` view. You should see logs similar to the following: +This will: +1. Create the wheel of your kedro project: `iris-0.1-py3-none-any.whl` +2. Create a notebook containing the code to be used by a job to execute the kedro project. You can see it locally in `.databricks/bundle/dev/.internal/notebook_iris-databricks-job_kedro_run.py` +3. Upload the wheel to `/Workspace/Users//.bundle/databricks_iris/dev/artifacts/.internal/` +4. Upload all the files of the root directory to `/Workspace/Users//.bundle/databricks_iris/dev/files` including `conf` +5. Create the job using the notebook from 2. +You can execute now the project with ```bash -... -2023-06-06 12:56:14,399 - iris_databricks.nodes - INFO - Model has an accuracy of 0.972 on test data. -2023-06-06 12:56:14,403 - kedro.runner.sequential_runner - INFO - Completed 3 out of 3 tasks -2023-06-06 12:56:14,404 - kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully. +databricks bundle run -t dev iris-databricks-job ``` +and you can destroy all resources with: -By following these steps, you packaged your Kedro project and manually ran it as a job on Databricks using the workspace UI. - -## Resources for automatically deploying to Databricks - -Up to this point, this page has described a manual workflow for deploying and running a project on Databricks. The process can be automated in two ways: - -- [Use the Databricks API](#how-to-use-the-databricks-api-to-automatically-deploy-a-kedro-project). -- [Use the Databricks CLI](#how-to-use-the-databricks-cli-to-automatically-deploy-a-kedro-project). - -Both of these methods enable you to store information about your job declaratively in the same version control system as the rest of your project. For each method, the information stored declaratively is the same as what is entered manually in the [above section on creating and running a job in Databricks](#deploy-and-run-your-kedro-project-using-the-workspace-ui). - -These methods can be integrated into a CI/CD pipeline to automatically deploy a packaged Kedro project to Databricks as a job. - -### How to use the Databricks API to automatically deploy a Kedro project - -The Databricks API enables you to programmatically interact with Databricks services, including job creation and execution. You can use the Jobs API to automate the deployment of your Kedro project to Databricks. The following steps outline how to use the Databricks API to do this: - -1. [Set up your Kedro project for deployment on Databricks](#set-up-your-project-for-deployment-to-databricks) -2. Create a JSON file containing your job's configuration. -3. Use the Jobs API's [`/create` endpoint](https://docs.databricks.com/workflows/jobs/jobs-api-updates.html#create) to create a new job. -4. Use the Jobs API's [`/runs/submit` endpoint](https://docs.databricks.com/workflows/jobs/jobs-api-updates.html#runs-submit) to run your newly created job. - -### How to use the Databricks CLI to automatically deploy a Kedro project - -The Databricks Command Line Interface (CLI) is another way to automate deployment of your Kedro project. The following steps outline how to use the Databricks CLI to automate the deployment of a Kedro project: - -1. [Set up your Kedro project for deployment on Databricks.](#set-up-your-project-for-deployment-to-databricks) -2. Install the Databricks CLI and authenticate it with your workspace. -3. Create a JSON file containing your job's configuration. -4. Use the [`jobs create` command](https://docs.databricks.com/archive/dev-tools/cli/jobs-cli.html#create-a-job) to create a new job. -5. Use the [`jobs run-now` command](https://docs.databricks.com/archive/dev-tools/cli/jobs-cli.html#run-a-job) to run your newly created job. - -## Summary - -This guide demonstrated how to deploy a packaged Kedro project on Databricks. This is a structured and reproducible way to run your Kedro projects on Databricks that can be automated and integrated into CI/CD pipelines. +```bash +databricks bundle destroy --auto-approve -t dev --force-lock +``` diff --git a/docs/source/deployment/databricks/databricks_ide_development_workflow.md b/docs/source/deployment/databricks/databricks_ide_development_workflow.md index f85799272d..2e12767592 100644 --- a/docs/source/deployment/databricks/databricks_ide_development_workflow.md +++ b/docs/source/deployment/databricks/databricks_ide_development_workflow.md @@ -1,6 +1,6 @@ -# Use an IDE, dbx and Databricks Repos to develop a Kedro project +# Use Databricks Connect to develop a Kedro project -This guide demonstrates a workflow for developing Kedro projects on Databricks using your local environment for development, then using dbx and Databricks Repos to sync code for testing on Databricks. +This guide demonstrates a workflow for developing Kedro projects on Databricks using your local environment for development, then using Databricks Connect testing on Databricks. By working in your local environment, you can take advantage of features within an IDE that are not available on Databricks notebooks: @@ -12,26 +12,21 @@ To set up these features, look for instructions specific to your IDE (for instan If you prefer to develop a projects in notebooks rather than an in an IDE, you should follow our guide on [how to develop a Kedro project within a Databricks workspace](./databricks_notebooks_development_workflow.md) instead. -``` {note} -[Databricks now recommends](https://docs.databricks.com/en/archive/dev-tools/dbx/index.html) that you use now use Databricks asset bundles instead of dbx. This Kedro deployment documentation has not yet been updated but you may wish to consult [What are Databricks Asset Bundles?](https://docs.databricks.com/en/dev-tools/bundles/index.html) and [Migrate from dbx to bundles](https://docs.databricks.com/en/archive/dev-tools/dbx/dbx-migrate.html) for further information. -``` - ## What this page covers The main steps in this tutorial are as follows: -- [Create a virtual environment and install and configure dbx.](#install-kedro-and-dbx-in-a-new-virtual-environment) +- [Install Kedro and the Databricks CLI in a new virtual environment.](#install-kedro-and-databricks-cli-in-a-new-virtual-environment) - [Create a new Kedro project using the `databricks-iris` starter.](#create-a-new-kedro-project) -- [Create a Repo on Databricks and sync your project using dbx.](#create-a-repo-on-databricks) - [Upload project data to a location accessible by Kedro when run on Databricks (such as DBFS).](#upload-project-data-to-dbfs) -- [Create a Databricks notebook to run your project.](#create-a-new-databricks-notebook) +- [Modify spark hook to allow remote.](#modify-spark-hook) - [Modify your project in your local environment and test the changes on Databricks in an iterative loop.](#modify-your-project-and-test-the-changes) ## Prerequisites - An active [Databricks deployment](https://docs.databricks.com/getting-started/index.html). -- A [Databricks cluster](https://docs.databricks.com/clusters/configure.html) configured with a recent version (>= 11.3 is recommended) of the Databricks runtime. -- [Conda installed](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) on your local machine in order to create a virtual environment with a specific version of Python (>= 3.8 is required). If you have Python >= 3.8 installed, you can use other software to create a virtual environment. +- A [Databricks cluster](https://docs.databricks.com/clusters/configure.html) configured with a recent version (>= 14 is recommended) of the Databricks runtime. +- [Conda installed](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) on your local machine in order to create a virtual environment with a specific version of Python (>= 3.10 is required). If you have Python >= 3.8 installed, you can use other software to create a virtual environment. ## Set up your project @@ -47,26 +42,29 @@ Find your Databricks username in the top right of the workspace UI and the host Your databricks host must include the protocol (`https://`). ``` -### Install Kedro and dbx in a new virtual environment +### Install Kedro and Databricks CLI in a new virtual environment In your local development environment, create a virtual environment for this tutorial using Conda: ```bash -conda create --name iris-databricks python=3.10 +conda create --name databricks-iris python=3.10 ``` Once it is created, activate it: ```bash -conda activate iris-databricks +conda activate databricks-iris ``` -With your Conda environment activated, install Kedro and dbx: +With your `conda` environment activated, install Kedro and databricks connect: ```bash -pip install kedro dbx --upgrade +pip install kedro +pip install --upgrade "databricks-connect==14.3*" # Or X.Y.* to match your cluster version. ``` +Install databricks CLI depending on your system [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/install.html): + ### Authenticate the Databricks CLI **Now, you must authenticate the Databricks CLI with your Databricks instance.** @@ -74,13 +72,10 @@ pip install kedro dbx --upgrade [Refer to the Databricks documentation](https://docs.databricks.com/en/dev-tools/cli/authentication.html) for a complete guide on how to authenticate your CLI. The key steps are: 1. Create a personal access token for your user on your Databricks instance. -2. Run `databricks configure --token`. -3. Enter your token and Databricks host when prompted. +2. Run `databricks configure --token --configure-cluster`. +3. Enter your token and Databricks host when prompted and the cluster_id to execute the code 4. Run `databricks fs ls dbfs:/` at the command line to verify your authentication. - -```{note} -dbx is an extension of the Databricks CLI, a command-line program for interacting with Databricks without using its UI. You will use dbx to sync your project's code with Databricks. While Git can sync code to Databricks Repos, dbx is preferred for development as it avoids creating new commits for every change, even if those changes do not work. -``` +5. Run `databricks-connect test` to verify remote cluster ### Create a new Kedro project @@ -90,61 +85,9 @@ Create a Kedro project with the `databricks-iris` starter using the following co kedro new --starter=databricks-iris ``` -Name your new project `iris-databricks` for consistency with the rest of this guide. This command creates a new Kedro project using the `databricks-iris` starter template. - - ```{note} -If you are not using the `databricks-iris` starter to create a Kedro project, **and** you are working with a version of Kedro **earlier than 0.19.0**, then you should [disable file-based logging](https://docs.kedro.org/en/0.18.14/logging/logging.html#disable-file-based-logging) to prevent Kedro from attempting to write to the read-only file system. - ``` - -### Create a Repo on Databricks - -Create a new Repo on Databricks by navigating to `New` tab in the Databricks workspace UI side bar and clicking `Repo` in the drop-down menu that appears. - -In this guide, you will not sync your project with a remote Git provider, so uncheck `Create repo by cloning a Git repository` and enter `iris-databricks` as the name of your new repository: - -![Create a new Repo on Databricks](../../meta/images/databricks_repo_creation.png) - -### Sync code with your Databricks Repo using dbx - -The next step is to use dbx to sync your project to your Repo. - -**Open a new terminal instance**, activate your conda environment, and navigate to your project directory and start `dbx sync`: - -```bash -conda activate iris-databricks -cd -dbx sync repo --dest-repo iris-databricks --source . -``` - -This command will sync your local directory (`--source .`) with your Repo (`--dest-repo iris-databricks`) on Databricks. When started for the first time, `dbx sync` will write output similar to the following to your terminal: - -```bash -... -[dbx][2023-04-13 21:59:48.148] Putting /Repos//iris-databricks/src/tests/__init__.py -[dbx][2023-04-13 21:59:48.168] Putting /Repos//iris-databricks/src/tests/test_pipeline.py -[dbx][2023-04-13 21:59:48.189] Putting /Repos//iris-databricks/src/tests/test_run.py -[dbx][2023-04-13 21:59:48.928] Done. Watching for changes... -``` - -**Keep the second terminal (running dbx sync) alive during development; closing it stops syncing new changes.** - -`dbx sync` will automatically sync any further changes made in your local project directory with your Databricks Repo while it runs. - -```{note} -Syncing with dbx is one-way only, meaning changes you make using the Databricks Repos code editor will not be reflected in your local environment. Only make changes to your project in your local environment while syncing, not in the editor that Databricks Repos provides. -``` - -### Create a `conf/local` directory in your Databricks Repo - -Kedro requires your project to have a `conf/local` directory to exist to successfully run, even if it is empty. `dbx sync` does not copy the contents of your local `conf/local` directory to your Databricks Repo, so you must create it manually. +Name your new project `databricks-iris` for consistency with the rest of this guide. This command creates a new Kedro project using the `databricks-iris` starter template. -Open the Databricks workspace UI and using the panel on the left, navigate to `Repos -> -> iris-databricks -> conf`, right click and select `Create -> Folder` as in the image below: -![Create a conf folder in Databricks Repo](../../meta/images/databricks_conf_folder_creation.png) - -Name the new folder `local`. In this guide, we have no local credentials to store and so we will leave the newly created folder empty. Your `conf/local` and `local` directories should now look like the following: - -![Final conf folder](../../meta/images/final_conf_folder.png) ### Upload project data to DBFS @@ -177,96 +120,67 @@ You should see the contents of the project's `data/` directory printed to your t 08_reporting ``` -### Create a new Databricks notebook - -Now that your project is available on Databricks, you can run it on a cluster using a notebook. - -To run the Python code from your Databricks Repo, [create a new Python notebook](https://docs.databricks.com/notebooks/notebooks-manage.html#create-a-notebook) in your workspace. Name it `iris-databricks` for traceability and attach it to your cluster: - -![Create a new notebook on Databricks](../../meta/images/databricks_notebook_creation.png) - -### Run your project - -Open your newly-created notebook and create **four new cells** inside it. You will fill these cells with code that runs your project. When copying the following code snippets, remember to replace `` with your username on Databricks such that `project_root` correctly points to your project's location. - -1. Before you import and run your Python code, you'll need to install your project's dependencies on the cluster attached to your notebook. Your project has a `requirements.txt` file for this purpose. Add the following code to the first new cell to install the dependencies: - -```ipython -%pip install -r "/Workspace/Repos//iris-databricks/requirements.txt" -``` - -2. To run your project in your notebook, you must load the Kedro IPython extension. Add the following code to the second new cell to load the IPython extension: - -```ipython -%load_ext kedro.ipython -``` - -3. Loading the extension allows you to use the `%reload_kedro` line magic to load your Kedro project. Add the following code to the third new cell to load your Kedro project: - -```ipython -%reload_kedro /Workspace/Repos//iris-databricks -``` - -4. Loading your Kedro project with the `%reload_kedro` line magic will define four global variables in your notebook: `context`, `session`, `catalog` and `pipelines`. You will use the `session` variable to run your project. Add the following code to the fourth new cell to run your Kedro project: - -```ipython -session.run() -``` - -After completing these steps, your notebook should match the following image: - -![Databricks completed notebook](../../meta/images/databricks_finished_notebook.png) - -Run the completed notebook using the `Run All` bottom in the top right of the UI: - -![Databricks notebook run all](../../meta/images/databricks_run_all.png) - -On your first run, you will be prompted to consent to analytics, type `y` or `N` in the field that appears and press `Enter`: - -![Databricks notebook telemetry consent](../../meta/images/databricks_telemetry_consent.png) - -You should see logging output while the cell is running. After execution finishes, you should see output similar to the following: +## Modify spark hook +To enable remote execution let's modify `src/databricks_iris/hooks.py`. For more details please review [How to integrate Databricks Connect and Kedro](https://kedro.org/blog/how-to-integrate-kedro-and-databricks-connect) + +```python +import configparser +import os +from pathlib import Path + +from kedro.framework.hooks import hook_impl +from pyspark.sql import SparkSession + +class SparkHooks: + @hook_impl + def after_context_created(self) -> None: + """Initialises a SparkSession using the config + from Databricks. + """ + set_databricks_creds() + _spark_session = SparkSession.Builder().getOrCreate() + +def set_databricks_creds(): + """ + Pass databricks credentials as OS variables if using the local machine. + If you set DATABRICKS_PROFILE env variable, it will choose the desired profile on .databrickscfg, + otherwise it will use the DEFAULT profile in databrickscfg. + """ + DEFAULT = os.getenv("DATABRICKS_PROFILE", "DEFAULT") + if os.getenv("SPARK_HOME") != "/databricks/spark": + config = configparser.ConfigParser() + config.read(Path.home() / ".databrickscfg") + + host = ( + config[DEFAULT]["host"].split("//", 1)[1].strip()[:-1] + ) # remove "https://" and final "/" from path + cluster_id = config[DEFAULT]["cluster_id"] + token = config[DEFAULT]["token"] + os.environ[ + "SPARK_REMOTE" + ] = f"sc://{host}:443/;token={token};x-databricks-cluster-id={cluster_id}" +``` +## Run the project +Now you can run the project in the remote environment with ```bash -... -2023-06-06 17:21:53,221 - iris_databricks.nodes - INFO - Model has an accuracy of 0.960 on test data. -2023-06-06 17:21:53,222 - kedro.runner.sequential_runner - INFO - Completed 3 out of 3 tasks -2023-06-06 17:21:53,224 - kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully. +kedro run ``` - ## Modify your project and test the changes Now that your project has run successfully once, you can make changes using the convenience and power of your local development environment. In this section, you will modify the project to use a different ratio of training data to test data and check the effect of this change on Databricks. ### Modify the training / test split ratio -The `databricks-iris` starter uses a default 80-20 ratio of training data to test data when training the classifier. In this section, you will change this ratio to 70-30 by editing your project in your local environment, then sync it with the Databricks Repo using `dbx`, and then run the modified project on Databricks to observe the different result. - -Open the file `/conf/base/parameters.yml` in your local environment. Edit the line `train_fraction: 0.8` to `train_fraction: 0.7` and save your changes. Look in the terminal where `dbx sync` is running, you should see it automatically sync your changes with your Databricks Repo: - -```bash -... -[dbx][2023-04-14 18:29:39.235] Putting /Repos//iris-databricks/conf/base/parameters.yml -[dbx][2023-04-14 18:29:40.820] Done -``` +The `databricks-iris` starter uses a default 80-20 ratio of training data to test data when training the classifier. In this section, you will change this ratio to 70-30 by editing your project in your local environment and then run the modified project on Databricks to observe the different result. +Open the file `/conf/base/parameters.yml` in your local environment. Edit the line `train_fraction: 0.8` to `train_fraction: 0.7` and save your changes. ### Re-run your project -Return to your Databricks notebook. Re-run the third and fourth cells in your notebook (containing the code `%reload_kedro ...` and `session.run()`). The project will now run again, producing output similar to the following: - ```bash -... -2023-06-06 17:23:19,561 - iris_databricks.nodes - INFO - Model has an accuracy of 0.972 on test data. -2023-06-06 17:23:19,562 - kedro.runner.sequential_runner - INFO - Completed 3 out of 3 tasks -2023-06-06 17:23:19,564 - kedro.runner.sequential_runner - INFO - Pipeline execution completed successfully. -``` - -You can see that your model's accuracy has changed now that you are using a different classifier to produce the result. - -```{note} -If your cluster terminates, you must re-run your entire notebook, as libraries installed using `%pip install ...` are ephemeral. If not, repeating this step is only necessary if your project's requirements change. +kedro run ``` ## Summary -This guide demonstrated a development workflow on Databricks, using your local development environment, dbx, and Databricks Repos to sync code. This approach improves development efficiency and provides access to powerful development features, such as auto-completion, linting, and static type checking, that are not available when working exclusively with Databricks notebooks. +This guide demonstrated a development workflow on Databricks, using your local development environment and Databricks Connect. This approach improves development efficiency and provides access to powerful development features, such as auto-completion, linting, and static type checking, that are not available when working exclusively with Databricks notebooks. diff --git a/docs/source/deployment/databricks/databricks_notebooks_development_workflow.md b/docs/source/deployment/databricks/databricks_notebooks_development_workflow.md index 8fc4b76f4c..557f12a1c4 100644 --- a/docs/source/deployment/databricks/databricks_notebooks_development_workflow.md +++ b/docs/source/deployment/databricks/databricks_notebooks_development_workflow.md @@ -18,8 +18,8 @@ This tutorial introduces a Kedro project development workflow using only the Dat ## Prerequisites - An active [Databricks deployment](https://docs.databricks.com/getting-started/index.html). -- A [Databricks cluster](https://docs.databricks.com/clusters/configure.html) configured with a recent version (>= 11.3 is recommended) of the Databricks runtime. -- Python >= 3.8 installed. +- A [Databricks cluster](https://docs.databricks.com/clusters/configure.html) configured with a recent version (>= 14.0 is recommended) of the Databricks runtime. +- Python >= 3.10 installed. - Git installed. - A [GitHub](https://github.com/) account. - A Python environment management system installed, [venv](https://docs.python.org/3/library/venv.html), [virtualenv](https://virtualenv.pypa.io/en/latest/) or [Conda](https://docs.conda.io/en/latest/) are popular choices. diff --git a/docs/source/deployment/databricks/index.md b/docs/source/deployment/databricks/index.md index 6fba3b6089..b00e329318 100644 --- a/docs/source/deployment/databricks/index.md +++ b/docs/source/deployment/databricks/index.md @@ -13,13 +13,13 @@ To avoid the overhead of setting up and syncing a local development environment **I want a hybrid workflow model combining local IDE with Databricks** -The workflow documented in ["Use an IDE, dbx and Databricks Repos to develop a Kedro project"](./databricks_ide_development_workflow.md) is for those that prefer to work in a local IDE. +The workflow documented in ["Use Databricks Connect to develop a Kedro project"](./databricks_ide_development_workflow.md) is for those that prefer to work in a local IDE. If you're in the early stages of learning Kedro, or your project requires constant testing and adjustments, choose this workflow. You can use your IDE's capabilities for faster, error-free development, while testing on Databricks. Later you can make the transition into a production deployment with this approach, although you may prefer to switch to use [job-based deployment](./databricks_deployment_workflow.md) and fully optimise your workflow for production. **I want to deploy a packaged Kedro project to Databricks** -The workflow documented in ["Use a Databricks job to deploy a Kedro project"](./databricks_deployment_workflow.md) is the go-to choice when dealing with complex project requirements that need a high degree of structure and reproducibility. It's your best bet for a production setup, given its support for CI/CD, automated/scheduled runs and other advanced use cases. It might not be the ideal choice for projects requiring quick iterations due to its relatively rigid nature. +The workflow documented in ["Use databricks asset bundles and jobs to deploy a Kedro project"](./databricks_deployment_workflow.md) is the go-to choice when dealing with complex project requirements that need a high degree of structure and reproducibility. It's your best bet for a production setup, given its support for CI/CD, automated/scheduled runs and other advanced use cases. It might not be the ideal choice for projects requiring quick iterations due to its relatively rigid nature. --- Here's a flowchart to guide your choice of workflow: @@ -31,11 +31,11 @@ Here's a flowchart to guide your choice of workflow: % A[Start] --> B{Do you prefer developing your projects in notebooks?} % B -->|Yes| C[Use a Databricks workspace to develop a Kedro project] % B -->|No| D{Are you a beginner with Kedro?} -% D -->|Yes| E[Use an IDE, dbx and Databricks Repos to develop a Kedro project] +% D -->|Yes| E[Use Databricks Connect to develop a Kedro project] % D -->|No| F{Do you have advanced project requirements
e.g. CI/CD, scheduling, production-ready, complex pipelines, etc.?} % F -->|Yes| G{Is rapid development needed for your project needs?} -% F -->|No| H[Use an IDE, dbx and Databricks Repos to develop a Kedro project] -% G -->|Yes| I[Use an IDE, dbx and Databricks Repos to develop a Kedro project] +% F -->|No| H[Use Databricks Connect to develop a Kedro project] +% G -->|Yes| I[Use Databricks Connect to develop a Kedro project] % G -->|No| J[Use a Databricks job to deploy a Kedro project] diff --git a/docs/source/meta/images/databricks-flow-chart.png b/docs/source/meta/images/databricks-flow-chart.png index a3e5b8abf9..1f74aa5e48 100644 Binary files a/docs/source/meta/images/databricks-flow-chart.png and b/docs/source/meta/images/databricks-flow-chart.png differ