Update Databricks docs #3360

AhdraMeraliQB · 2023-11-29T13:26:53Z

Description

It looks like Databricks have deprecated their cli tools, which has had the knock on effect of breaking our docs. A quick fix in #3358 adds the necessary /archive/ to the broken links, but maybe we should rethink the section as a whole?

CC: @stichbury

docs: https://docs.kedro.org/en/stable/deployment/databricks/databricks_ide_development_workflow.html

Edit: See #3360 (comment) for current status of this parent issue

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2023-11-29T13:53:28Z

And also dbx:

Databricks recommends that you use Databricks asset bundles instead of dbx by Databricks Labs. See What are Databricks asset bundles? and Migrate from dbx to bundles.

https://docs.databricks.com/en/archive/dev-tools/dbx/index.html

stichbury · 2023-11-30T09:40:08Z

I'll take a look at this in an upcoming sprint -- we did some updates for the asset bundles recently as suggested by Harmony.

astrojuanlu · 2023-12-01T18:04:48Z

Another couple of things I found in our Databricks workspace guide databricks_notebooks_development_workflow.md:

The first half of the guide, which tells users to set up a GitHub repository, a personal access token, push the code, then create a Databricks Repo, is not strictly needed: kedro new works fine on Databricks notebooks.
- One has to be careful that the files are created in the Workspace, and not in the ephemeral driver storage. Depending on the runtime version, the default working directory is different, so a cd /Workspace/... might be needed.
"On Databricks, Kedro cannot access data stored directly in your project’s directory." this is not correct. From the docs:
- "Spark cannot directly interact with workspace files on compute configured with shared access mode." however, clusters configured with Single User access mode should, and can, access workspace files.
- However: “You can use Spark to read data files. You must provide Spark with the fully qualified path." https://learn.microsoft.com/en-us/azure/databricks/files/workspace-interact#read-data-workspace-files this means that spark.read.load("/Workspace/...") won't work (because it will assume dbfs:/), but spark.read.load("file:/Workspace/...") will.
  - Now, whether or not this can be incorporated in actual Kedro catalogs (in other words: whether or not our fsspec mangling will work on paths like these) is a different story. One can't just simply add file:/ in front of the dataset filepath, because then it will be taken as an absolute path and not a relative one.

It's true that creating a Databricks Repo synced with a GH repo gives some nice advantages, like being able to edit the code in an actual IDE (whether a local editor or a cloud development environment like Gitpod or GitHub Codespaces). And it's also true that Databricks recommends in different places that data should live in the DBFS root.

However, it would be nice to consider what's the shortest and simplest guide we can write for users to get started with Kedro on Databricks, and then build from there.

astrojuanlu · 2023-12-02T23:57:21Z

To clarify on the initial comment:

The DBFS CLI (legacy) becomes https://docs.databricks.com/en/dev-tools/cli/fs-commands.html. The commands are still databricks fs, on first look nothing has changed on the CLI (only the structure of the documentation). For example, compare old and new.
The Jobs CLI (legacy) stops being documented because "CLI command groups that are not documented in the REST API reference have their own separate reference articles", and Jobs is not one of them. Documentation of the REST API can be found in https://docs.databricks.com/api/workspace/jobs, and the list of jobs subcommands is on GitHub at the moment https://github.com/databricks/cli/blob/main/docs/commands.md#databricks-jobs---manage-databricks-workflows. Less than ideal, but that's the current status AFAIU.

Both things above require the Databricks CLI version 0.205 and above. Apart from that, the commands haven't changed, so what we should do in this regard is making sure we're not sending users to legacy docs and that's it.

astrojuanlu · 2023-12-02T23:59:46Z

To summarise:

Replace references to DBFS and Jobs legacy CLI docs
Rewrite docs from dbx to Asset Bundles (migration guide)
Simplify Databricks notebooks guide to better serve as starting point for Kedro on Databricks

stichbury · 2023-12-19T12:10:13Z

@astrojuanlu Is this something that the team can pick up or do we need to ask for time from Jannic or another databricks expert (maybe @deepyaman could ultimately review)?

What are we prioritising this? I'm guessing it's relatively high importance to keep databricks docs tracking with their tech.

astrojuanlu · 2023-12-19T16:16:09Z

We need to build Databricks expertise in the team, so I hope we don't need to ask external experts to do it (it's OK if they give assistance, but we need to own this).

astrojuanlu · 2023-12-19T16:16:37Z

Added this to the Inbox so that we prioritise.

deepyaman · 2023-12-19T17:06:14Z

@astrojuanlu Is this something that the team can pick up or do we need to ask for time from Jannic or another databricks expert (maybe @deepyaman could ultimately review)?

At this point, it's been almost 4 years since I've used Databricks (and don't currently have any interest in getting back into it), so I'd defer to somebody else. 🙂

stichbury · 2023-12-19T17:13:26Z

More than fair enough @deepyaman! Good to confirm though.

astrojuanlu · 2024-02-07T09:45:24Z

I'm adding one more item:

Document integration with Databricks Unity Catalog

Every time I give a talk or workshop, invariably somebody from the audience asks "how does the Kedro Catalog play along with Databricks Unity Catalog?".

Our reference docs for kedro-datasets mention it exactly once, in the API docs of pandas.DeltaTableDataset.

And there's one subtle mention of it in databricks.ManagedTableDataset ("the name of the catalog in Unity".

The broader question of Delta datasets is a topic for kedro-org/kedro-plugins#542.

astrojuanlu · 2024-02-21T16:36:11Z

Relevant: @dannyrfar 's https://github.com/dannyrfar/databricks-kedro-starter

felipemonroy · 2024-06-01T03:09:36Z

Maybe this could help:
https://github.com/JenspederM/databricks-kedro-bundle

astrojuanlu · 2024-06-03T10:23:07Z

This looks really cool. @JenspederM do you want to share a bit more insight on how far do you intend to go with your project?

JenspederM · 2024-06-03T11:03:30Z

Hey @astrojuanlu

Actually I don't really know if there's more to do. I almost want the project to be as barebone as possible.

The way I left it now is with a very simple datasets implementation for unity so that people can customize as required.

As for the DAB resource generator, I'm considering if I could find a better way for users to set defaults such as which job clusters, instance pools, etc..

One thing that is generally lacking is the documentation so that will definitely receive some attention once I have the time.

Do you have any suggestions?

astrojuanlu · 2024-07-05T07:01:26Z

However, it would be nice to consider what's the shortest and simplest guide we can write for users to get started with Kedro on Databricks, and then build from there.

I gave two Kedro on Databricks demos yesterday, so I'm sharing that very simple notebook here https://github.com/astrojuanlu/kedro-databricks-demo hopefully it can be the basis of what I proposed in #3360 (comment) (still no Kedro Framework there)

astrojuanlu · 2024-07-05T07:02:20Z

Hey @astrojuanlu

Actually I don't really know if there's more to do. I almost want the project to be as barebone as possible.

The way I left it now is with a very simple datasets implementation for unity so that people can customize as required.

As for the DAB resource generator, I'm considering if I could find a better way for users to set defaults such as which job clusters, instance pools, etc..

One thing that is generally lacking is the documentation so that will definitely receive some attention once I have the time.

Do you have any suggestions?

@JenspederM I gave your kedro-databricks a quick try yesterday and it didn't work out of the box, so if you're open to me opening issues, I'll gladly start doing so 😄

JenspederM · 2024-07-05T08:56:22Z

@astrojuanlu Go for it!

I've been a bit busy these last few days and haven't had the chance to make any progress.

But it's always nice to have some concrete issues to address. 😉

JenspederM · 2024-07-07T11:17:09Z

@astrojuanlu Just fyi, I'll merge a quite big PR soon so hopefully that will address most issues that you found.

The substitution algorithm was a bit more cumbersome than first anticipated..

noklam · 2024-09-09T13:10:49Z

There is now a community plugin exists: https://github.com/JenspederM/kedro-databricks

We need to update the documentation according to the readme to walkthrough the steps to setup on Databricks.

noklam · 2024-10-15T14:46:01Z

To summarise:

Replace references to DBFS and Jobs legacy CLI docs[ ] Rewrite docs from dbx to Asset Bundles (migration guide)[ ] Simplify Databricks notebooks guide to better serve as starting point for Kedro on Databricks

This comment is also 10 months old. Do we want to mention the kedro-databricks plugin as the shortest path to experiment Kedro on Databricks? @astrojuanlu

Document integration with Databricks Unity Catalog

There is also a comment about UnityCatalog, is it in the scope of this ticket or should we separate this out? Not quite sure what should be done, maybe we can mention there are some databricks dataset that can work with UnityCatalog (Databricks platform version)

How do we think about the VSCode IDE Databricks

Do we want to mention it?

astrojuanlu · 2024-10-15T16:21:46Z

Do we want to mention the kedro-databricks plugin as the shortest path to experiment Kedro on Databricks?

Yes!

There is also a comment about UnityCatalog, is it in the scope of this ticket or should we separate this out?

It's in the scope of this ticket because this is a parent ticket. No need to do everything in the same PR.

The key thing is that we explain clearly how the Kedro Catalog and Unity Catalog can be used together. Example: https://github.com/astrojuanlu/kedro-databricks-demo/blob/main/First%20Steps%20with%20Kedro%20on%20Databricks.ipynb

How do we think about the VSCode IDE Databricks

Do you mean https://docs.databricks.com/en/dev-tools/vscode-ext/index.html ? Not sure if there's anything Kedro-specific about that extension. We can discuss about that in another ticket.

noklam · 2024-10-16T13:04:39Z

@astrojuanlu

Do you mean https://docs.databricks.com/en/dev-tools/vscode-ext/index.html ? Not sure if there's anything Kedro-specific about that extension. We can discuss about that in another ticket.

Not directly, but it could be a better way to work with Databricks for the IDE experience. (no manual sync etc). Will keep this outside of the ticket.

Also, I think we should add new pages instead of just deleting the old one. I think there are still a significant amount of users are using dbx, we can recommend the new page and marked the old one as legacy.

noklam · 2024-10-27T21:56:46Z

Steps to use `kedro-databricks`

Create starter kedro new -s databrick-iris
Install databricks-cli curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sudo sh
databricks configure copy link & create tokens
Install kedro-databricks
kedro databricks init #databricks.yml
kedro databricks bundle
kedro databricks deploy
databricks bundle run or use the UI to trigger the job.

Notes

2024-10-25 13:59:33,723 - databrick_iris - INFO - Substituting DBFS paths: Checking conf/base/catalog.yml
2024-10-25 13:59:33,724 - databrick_iris - INFO - Substituting DBFS paths: Checking conf/local/catalog.yml
2024-10-25 13:59:33,724 - databrick_iris - WARNING - Substituting DBFS paths: conf/local/catalog.yml does not exist.

When I run kedro databricks init, I see these logs showing up. I am a bit nervous about updating user file directly. For new file it's fine. This Create the databricks.yml that is needed

kedro databricks bundle

This create the resource/databricks-iris.yml, wrapper around databricks bundle init

Uploading databrick_iris-0.1-py3-none-any.whl...
Uploading bundle files to /Workspace/Users/[email protected]/.bundle/databrick_iris/local/files...
Deploying resources...
Updating deployment state...
Deployment complete!

kedro databricks deploy

wrapper around databricks bundle deploy

Changes

The step to run a bundled job is not mentioned. Either through UI or using the databricks CLI directly with databricks bundle run

JenspederM · 2024-10-28T05:32:03Z

Yeah, I'm not too happy about having to change user paths either.

But what is done, is simply to check that paths that refer to dbfs use the correct package name.

This is mostly a legacy feature that helped developer experience with the old version of the starter, where dbfs paths were not aligned with the package name. Since that has now changed, I might also remove this from the init command.

noklam · 2024-10-28T17:09:38Z

@DimedS Do you want to take the rest of the docs as I am working on the data asset bundles ?

DimedS · 2024-10-28T17:14:45Z

@DimedS Do you want to take the rest of the docs as I am working on the data asset bundles ?

Yes, I’m ready to take care of them. I just need to wrap up the synthesis of our deployment interviews. Four of them were about Databricks, and we identified four distinct ways people deploy there. I believe we’ll complete the Databricks synthesis in the next two days. After that, I’ll share my thoughts on updating the Databricks docs here and will start making updates once we align on the changes.

noklam · 2024-10-28T21:39:51Z

Just want to share some notes for my DAB deployment experience with our own internal infra, there are two options: 1. Azure, 2. AWS instance

With Azure Databricks (the more common one), everything works well until the step 8 (run job with CLI) because we don't have job cluster with the sandbox.

With AWS Databricks, we do have job cluster, but I get stuck even earlier at step 7 (kedro databricks deploy). After some investigation, it seems to related to permission issue. For some reason, I have permission to run a job, create a job through the UI but not with the CLI. I spent a good afternoon on it and decide it's not worth the effort to continue. This is the error that I see repeatly.

Error: exit status 1
Error: cannot read job: User [email protected] does not have Admin or Manage Run or Owner or View permissions on job 1111051477632764

To proceed, I will go with the Azure Databricks option, and use the UI to run the job as the last step. I expect users will likely encounter cryptic error message related to security/permission. I am not sure what kind of help we can provide other

DimedS · 2024-10-31T10:15:38Z

Hi, I'd like to share a part of synthesis of our recent interviews with four users who deploy projects to Databricks. In the following diagram, I’ve outlined their user flows, highlighting the main steps for deploying to Databricks. While there are many similarities, certain parts of the process differ.

As you can see, the overall workflow is quite similar across users, but there are a few points where multiple options are available. I believe adding a diagram like this to our main Databricks documentation page (https://docs.kedro.org/en/stable/deployment/databricks/index.html) would be helpful. It could provide a concise overview of the steps required to deploy code and the various options for each step. This approach would be more efficient than detailing each complete workflow as we currently do.

For instance, these two pages:

currently overlap by about 50%. Instead, we could focus on describing specific differences and options at each step, such as:

Transferring Project Code to Databricks:
- Using the Databricks asset bundle
- Manually syncing the repository
- Using the VSCode plugin
Running Kedro on Databricks:
- Via a packaged .whl file
- Running code directly
- Running code through a notebook
Creating a Databricks Job to Schedule Runs:
- Setting up manually
- Using the Databricks asset bundle
- Using the Kedro-Databricks plugin (and its advantages)
Choosing the Right Cluster for Jobs:
- Job cluster: Provides isolation but requires 5–10 minutes to start up
- Existing cluster: Ready immediately but lacks isolation and can be costly to maintain 24/7

This structure would allow users to better understand the options available at each step without redundant information.

astrojuanlu · 2024-10-31T15:29:34Z

This diagram is fantastic, and probably more useful than the current one we have.

Also +1 on trying to reduce overlap between the pages.

And finally, I noticed that some users didn't really want to kedro package their code, is that something you observed as well @DimedS ?

DimedS · 2024-10-31T16:05:44Z

This diagram is fantastic, and probably more useful than the current one we have.

Also +1 on trying to reduce overlap between the pages.

And finally, I noticed that some users didn't really want to kedro package their code, is that something you observed as well @DimedS ?

Yes, from what I gathered, users didn’t find much benefit from packaging their projects - they only followed that approach because we recommended it. Some simplified the process and realised they could just upload code directly to their Databricks Repo, and they mentioned they don't need to package their projects. One user, Miguel, figured out that by specifying the notebook that runs the Kedro project within the Databricks Job deployment, he could use the same notebook during development, making debugging easier.

noklam · 2024-11-12T10:33:33Z

Do you plan to write these documentation, or it will be created as separate ticket from the current issue?

astrojuanlu · 2024-11-12T11:38:43Z

Another think we could try to document at some point is how to make use of the VS Code extension + databricks-connect. I tried and I got a MissingConfigException: Given configuration path either does not exist or is not a valid directory: /Workspace/Users/[email protected]/.bundle/kedro_databricks_playground/local/files/conf/local

And the workaround seems to be ~~to do git add -f conf/local~~ to un-ignore conf/local on .gitignore (see #2593).

DimedS · 2024-11-12T11:40:58Z

Do you plan to write these documentation, or it will be created as separate ticket from the current issue?

I don’t think I proceed with it in the current sprint. Let’s create a new issue as a follow-up to the Databricks deployment research #4317. We can probably close current one after merging your PR, which transitions from dbx to Databricks asset bundles.

astrojuanlu · 2024-11-26T16:36:37Z

Current status of this issue:

P0: Deprecated dbx and documented Databricks Asset Bundles (Add databricks asset bundles docs #4265)
P1: Document integration with Databricks Unity Catalog
- Source: Update Databricks docs #3360 (comment)
P1: Describe more clearly how to connect to data stored in Databricks
- Source: User interviews November 2024
P1: Write a very simple guide on how to use Kedro on Databricks without leaving the workspace (notebooks) (or simplify databricks_notebooks_development_workflow.md to the extreme)
- Source: Update Databricks docs #3360 (comment)
P2: Fix links to the legacy DBFS CLI
- Source: Update Databricks docs #3360 (comment)
P2: Document how to use the Databricks VS Code extension + databricks-connect
- Source: Update Databricks docs #3360 (comment)
P2: Review claims on data storage ("On Databricks, Kedro cannot access data stored directly in your project’s directory")
- Source: Update Databricks docs #3360 (comment)
P2: Rework decision diagram for deployment methods
- Source: Update Databricks docs #3360 (comment)

astrojuanlu changed the title ~~Update docs that reference Databricks CLI~~ Update docs that reference Databricks CLI and dbx Nov 29, 2023

astrojuanlu added the Component: Documentation 📄 Issue/PR for markdown and API documentation label Nov 29, 2023

stichbury added this to the Improve Kedro documentation used by advanced users milestone Nov 30, 2023

github-actions bot mentioned this issue Dec 1, 2023

Monthly issue metrics report #3375

Closed

astrojuanlu changed the title ~~Update docs that reference Databricks CLI and dbx~~ Update Databricks docs Dec 1, 2023

astrojuanlu mentioned this issue Mar 26, 2024

Add docs databricks asset bundles #3744

Closed

7 tasks

merelcht assigned ankatiyar Sep 16, 2024

ElenaKhaustova assigned noklam and DimedS Oct 14, 2024

noklam unassigned ankatiyar Oct 28, 2024

noklam mentioned this issue Oct 28, 2024

Add databricks asset bundles docs #4265

Merged

7 tasks

noklam linked a pull request Nov 12, 2024 that will close this issue

Add databricks asset bundles docs #4265

Merged

7 tasks

astrojuanlu closed this as completed in #4265 Nov 26, 2024

astrojuanlu reopened this Nov 26, 2024

astrojuanlu added the Type: Parent Issue label Nov 26, 2024

astrojuanlu mentioned this issue Feb 7, 2025

Improve Third-Party Deployment Plugins Reliability and Compatibility #4318

Open

1 task

Update Databricks docs #3360

Update Databricks docs #3360

Comments

AhdraMeraliQB commented Nov 29, 2023 • edited by astrojuanlu Loading

Description

astrojuanlu commented Nov 29, 2023

stichbury commented Nov 30, 2023

astrojuanlu commented Dec 1, 2023

astrojuanlu commented Dec 2, 2023

astrojuanlu commented Dec 2, 2023

stichbury commented Dec 19, 2023

astrojuanlu commented Dec 19, 2023 • edited Loading

astrojuanlu commented Dec 19, 2023

deepyaman commented Dec 19, 2023

stichbury commented Dec 19, 2023

astrojuanlu commented Feb 7, 2024

astrojuanlu commented Feb 21, 2024

felipemonroy commented Jun 1, 2024

astrojuanlu commented Jun 3, 2024

JenspederM commented Jun 3, 2024 • edited Loading

astrojuanlu commented Jul 5, 2024

astrojuanlu commented Jul 5, 2024

JenspederM commented Jul 5, 2024

JenspederM commented Jul 7, 2024

noklam commented Sep 9, 2024

noklam commented Oct 15, 2024

astrojuanlu commented Oct 15, 2024

noklam commented Oct 16, 2024 • edited Loading

noklam commented Oct 27, 2024

Steps to use kedro-databricks

Notes

kedro databricks bundle

kedro databricks deploy

Changes

JenspederM commented Oct 28, 2024

noklam commented Oct 28, 2024

DimedS commented Oct 28, 2024

noklam commented Oct 28, 2024

DimedS commented Oct 31, 2024 • edited Loading

astrojuanlu commented Oct 31, 2024

DimedS commented Oct 31, 2024

noklam commented Nov 12, 2024

astrojuanlu commented Nov 12, 2024 • edited Loading

DimedS commented Nov 12, 2024

astrojuanlu commented Nov 26, 2024 • edited Loading

AhdraMeraliQB commented Nov 29, 2023 •

edited by astrojuanlu

Loading

astrojuanlu commented Dec 19, 2023 •

edited

Loading

JenspederM commented Jun 3, 2024 •

edited

Loading

noklam commented Oct 16, 2024 •

edited

Loading

Steps to use `kedro-databricks`

DimedS commented Oct 31, 2024 •

edited

Loading

astrojuanlu commented Nov 12, 2024 •

edited

Loading

astrojuanlu commented Nov 26, 2024 •

edited

Loading