From b6cf0fee1c966514840ef82996bc0ccab12c81d2 Mon Sep 17 00:00:00 2001 From: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com> Date: Fri, 10 Jan 2025 22:32:46 +0100 Subject: [PATCH] chore: lakefs integration docs Signed-off-by: Ion Koutsouris <15728914+ion-elgreco@users.noreply.github.com> --- docs/integrations/object-storage/lakefs.md | 53 ++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 54 insertions(+) create mode 100644 docs/integrations/object-storage/lakefs.md diff --git a/docs/integrations/object-storage/lakefs.md b/docs/integrations/object-storage/lakefs.md new file mode 100644 index 0000000000..f875ae58d6 --- /dev/null +++ b/docs/integrations/object-storage/lakefs.md @@ -0,0 +1,53 @@ +# LakeFS +`delta-rs` offers native support for using LakeFS as an object storage backend. Each +deltalake operation is executed in a transaction branch and safely merged into your source branch. + +You don’t need to install any extra dependencies to read/write Delta tables to LakeFS with engines that use `delta-rs`. You do need to configure your LakeFS access credentials correctly. + +## Passing LakeFS Credentials + +You can pass your LakeFS credentials explicitly by using: + +- the `storage_options `kwarg +- Environment variables + +## Example + +Let's work through an example with Polars. The same logic applies to other Python engines like Pandas, Daft, Dask, etc. + +Follow the steps below to use Delta Lake on LakeFS with Polars: + +1. Install Polars and deltalake. For example, using: + + `pip install polars deltalake` + +2. Create a dataframe with some toy data. + + `df = pl.DataFrame({'x': [1, 2, 3]})` + +3. Set your `storage_options` correctly. + +```python +storage_options = { + "endpoint": "https://mylakefs.intranet.com", # LakeFS endpoint + "access_key_id": "LAKEFSID", + "secret_access_key": "LAKEFSKEY", + } +``` + +4. Write data to Delta table using the `storage_options` kwarg. The subpath after the bucket is always the branch you want to write into. + + ```python + df.write_delta( + "lakefs://bucket/branch/table", + storage_options=storage_options, + ) + ``` + +## Cleaning up failed transaction branches + +It might occur that a deltalake operation fails midway. At this point a lakefs transaction branch was created, but never destroyed. The branches are hidden in the UI, but each branch starts with `delta-tx`. + +With the lakefs python library you can list these branches and delete stale ones. + + diff --git a/mkdocs.yml b/mkdocs.yml index 384ef54be4..923afb581c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -88,6 +88,7 @@ nav: - integrations/object-storage/hdfs.md - integrations/object-storage/s3.md - integrations/object-storage/s3-like.md + - integrations/object-storage/lakefs.md - Arrow: integrations/delta-lake-arrow.md - Daft: integrations/delta-lake-daft.md - Dagster: integrations/delta-lake-dagster.md