Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving Unity Catalog tables fails #3708

Open
datanikkthegreek opened this issue Jan 17, 2025 · 15 comments
Open

Retrieving Unity Catalog tables fails #3708

datanikkthegreek opened this issue Jan 17, 2025 · 15 comments
Labels
bug Something isn't working p1 Important to tackle soon, but preemptable by p0

Comments

@datanikkthegreek
Copy link

datanikkthegreek commented Jan 17, 2025

Describe the bug

When running the code described here: https://www.getdaft.io/projects/docs/en/stable/user_guide/integrations/unity-catalog.html.

I deleted my databricks link from the error.

unity = UnityCatalog(
    endpoint="https://adb-***.azuredatabricks.net/",
    token="mytoken",
)
print(unity.list_catalogs())

I am getting the following error

TypeError: Client.__init__() got an unexpected keyword argument 'proxies'
File <command-6092925827694198>, line 1
----> 1 unity = UnityCatalog(
      2     endpoint="[***/](***/editor/notebooks/%3Ca%20class=)" target="_blank" rel="noopener noreferrer">[***/</a></span><span>&quot;</span>](***/%3C/a%3E%3C/span%3E%3Cspan%3E&quot;%3C/span%3E),
      3     token="***",
      4 )
      5 print(unity.list_catalogs())

To Reproduce

Run this code

from daft.unity_catalog import UnityCatalog

unity = UnityCatalog(
    endpoint="https://<databricks_workspace_id>.cloud.databricks.com",
    # Authentication can be retrieved from your provider of Unity Catalog
    token="my-token",
)

# See all available catalogs
print(unity.list_catalogs())

Expected behavior

Listing the catalogs

Component(s)

Python Runner

Additional context

No response

@datanikkthegreek datanikkthegreek added bug Something isn't working needs triage labels Jan 17, 2025
@datanikkthegreek
Copy link
Author

datanikkthegreek commented Jan 17, 2025

pip install httpx==0.27.2 solved the issue. It seems there were some breaking changes in httpx 0.28.0 and above which was on my databricks runtime.

@datanikkthegreek
Copy link
Author

I would recommend at least documenting this somewhere or updating to the new unitycatalogue-client library? :)

@datanikkthegreek
Copy link
Author

Now that I fixed this unfort loading the data itself did not work

Image

@datanikkthegreek
Copy link
Author

Env ist on Azure with ADL Gen2 and Databricks.

@colin-ho
Copy link
Contributor

colin-ho commented Jan 17, 2025

Can you try manually providing credentials into read_deltalake, e.g

from daft.daft import S3Config, IOConfig

s3_config_from_env = S3Config.from_env()
io_config = IOConfig(s3=s3_config_from_env)

df = daft.read_deltalake(unity_table, io_config=io_config)

@anilmenon14
Copy link
Contributor

anilmenon14 commented Jan 18, 2025

Hi @datanikkthegreek , I am on Daft v0.4.1 and the below code block is all I need to load an 'external' table.
One thing I have noticed is that managed tables cannot be loaded due to a minor issue ( since READ_WRITE permission token is being requested for reading managed tables and Unity catalog does not allow using that sort of token for managed tables to read). Hence, I would avoid reading a managed table for now . I will log a PR for a fix for this in the upcoming week.

import daft
from daft.unity_catalog import UnityCatalog
import os

# Set up your 'adb-......databricks.net' workspace URL as env var and a personal access token (PAT) 
DATABRICKS_HOST_AZURE = os.environ.get('DATABRICKS_HOST_AZURE') 
PAT_TOKEN_AZURE = os.environ.get('PAT_TOKEN_AZURE') 

unity = UnityCatalog(endpoint=DATABRICKS_HOST_AZURE,token=PAT_TOKEN_AZURE)
unity_table_ext = unity.load_table("some_uc_catalog.some_schema.some_table") # This is an external table
df_ext = daft.read_deltalake(unity_table_ext)
df_ext.show()

I noticed that your error seems to indicate it to be a response from an AWS control plane, when you seem to be attempting to access an Azure Databricks control plane, so something may be off in your env vars setup.


pip install httpx==0.27.2 solved the issue. It seems there were some breaking changes in httpx 0.28.0 and above which was on my databricks runtime.

As for this above issue, this is unfortunately an issue from the unitycatalog python client.
On Daft side, we have attempted to have this included into requirements-dev.txt using PR #3522 , however I too noticed that if you have other packages in your environment (e.g. ipykernel for me) it may attempt to install a higher version of httpx unless you specifically pin it during that install.

Hope this helps and feel free to share more of the error text if you still have the issue.

@datanikkthegreek
Copy link
Author

datanikkthegreek commented Jan 18, 2025

Can you try manually providing credentials into read_deltalake, e.g

from daft.daft import S3Config, IOConfig

s3_config_from_env = S3Config.from_env()
io_config = IOConfig(s3=s3_config_from_env)

df = daft.read_deltalake(unity_table, io_config=io_config)

@colin-ho yes, thats what I also thought. And it worked. It's definitely sth with the unitcatalogue client. I created some simple functions around the REST API myself. I feel the unity clients are a bit overengineered and hard to use. Even though I am not a Python expert. I see you internally also use the client in Daft. But the old one.

WORKSPACE = "WORKSPACE"
TOKEN = "YOUR TOKEN"

HOST=WORKSPACE.rstrip("/") + "/api/2.1/unity-catalog/"
DEFAULT_HEADERS = {"Authorization": f"Bearer {TOKEN}"}

def get_table(tbl_name):
  url = f"{HOST}/tables/{tbl_name}"
  response = requests.get(url, headers=DEFAULT_HEADERS)
  return response.json()
get_table("pt_dh_sand.test.test_table")

def get_table_id(tbl_name):
  return get_table(tbl_name)["table_id"]
get_table_id("pt_dh_sand.test.test_table4")

def get_tbl_path(tbl_name):
  return get_table(tbl_name)["storage_location"]
get_tbl_path("pt_dh_sand.test.test_table4")

def get_table_credentials(tbl_name, operation = "READ"):
  table_id = get_table_id(tbl_name)
  url = f"{HOST}/temporary-table-credentials"
  body = {
    "operation": operation, #READ_WRITE to write from outside only for external tables
    "table_id": table_id,
  }
  response = requests.post(url, json=body, headers=DEFAULT_HEADERS)
  return response.json()["azure_user_delegation_sas"]
get_table_credentials("pt_dh_sand.test.test_table")

from daft.io import IOConfig, AzureConfig
import daft
tbl_name = "pt_dh_sand.test.test_table4"
azure = AzureConfig(sas_token=get_table_credentials(tbl_name)["sas_token"])
io_config = IOConfig(azure=azure)

df = daft.read_deltalake(get_tbl_path(tbl_name), io_config=io_config)
df.show()

#Write with Daft
azure = AzureConfig(sas_token=get_table_credentials(tbl_name, operation="READ_WRITE")["sas_token"])
io_config = IOConfig(azure=azure)
df.write_deltalake(get_tbl_path(tbl_name), mode="overwrite", io_config=io_config)

@datanikkthegreek
Copy link
Author

datanikkthegreek commented Jan 18, 2025

Hi @datanikkthegreek , I am on Daft v0.4.1 and the below code block is all I need to load an 'external' table. One thing I have noticed is that managed tables cannot be loaded due to a minor issue ( since READ_WRITE permission token is being requested for reading managed tables and Unity catalog does not allow using that sort of token for managed tables to read). Hence, I would avoid reading a managed table for now . I will log a PR for a fix for this in the upcoming week.

import daft
from daft.unity_catalog import UnityCatalog
import os

# Set up your 'adb-......databricks.net' workspace URL as env var and a personal access token (PAT) 
DATABRICKS_HOST_AZURE = os.environ.get('DATABRICKS_HOST_AZURE') 
PAT_TOKEN_AZURE = os.environ.get('PAT_TOKEN_AZURE') 

unity = UnityCatalog(endpoint=DATABRICKS_HOST_AZURE,token=PAT_TOKEN_AZURE)
unity_table_ext = unity.load_table("some_uc_catalog.some_schema.some_table") # This is an external table
df_ext = daft.read_deltalake(unity_table_ext)
df_ext.show()

I noticed that your error seems to indicate it to be a response from an AWS control plane, when you seem to be attempting to access an Azure Databricks control plane, so something may be off in your env vars setup.

pip install httpx==0.27.2 solved the issue. It seems there were some breaking changes in httpx 0.28.0 and above which was on my databricks runtime.

As for this above issue, this is unfortunately an issue from the unitycatalog python client. On Daft side, we have attempted to have this included into requirements-dev.txt using PR #3522 , however I too noticed that if you have other packages in your environment (e.g. ipykernel for me) it may attempt to install a higher version of httpx unless you specifically pin it during that install.

Hope this helps and feel free to share more of the error text if you still have the issue.

Thanks for the detailed reply. The issue with the managed tables I also realized. There are two options. Set it as READ or parameterize it also allows writing the tables. I think you can also easily check if it's managed or not with the tables API.

The error is really strange, It's definitely Azure on my side. As I am running this on Databricks I could not really change sth. In my previous response you can also see that using the rest API everything works. I find the REST API more convenient and understandable than the python clients the new and old one.

You were right it works now. I tested daft with a managed table before. External table works. As you proposed. This can be easily fixed by making it READ or parameterize it.

@universalmind303 universalmind303 added p1 Important to tackle soon, but preemptable by p0 and removed needs triage labels Jan 21, 2025
@pmogren
Copy link

pmogren commented Feb 4, 2025

I ran into both of these issues using Databricks on AWS. I wanted to read managed tables and write external tables but no version of Daft supports both. IMHO the code should make fewer assumptions and let the caller specify their intent.

@jaychia
Copy link
Contributor

jaychia commented Feb 4, 2025

Sorry folks, we understand the databricks/Unity experience has been less than ideal so far. We'll work with the databricks Unity team to try and iron out some more of these issues!

@datanikkthegreek and @pmogren am I right to understand that all the issues in this PR can be tracked down to Daft support for managed tables in Unity Catalog, and that Daft currently works fine for external tables?

@anilmenon14
Copy link
Contributor

anilmenon14 commented Feb 4, 2025

Hi @jaychia ,

I was just reviewing and responding back and wanted to tag you and @kevinzwang for some thoughts on how we solve this.

The reason this is an issue is that Unity does not support vending READ_WRITE credentials for Managed tables and have run into this once we had write support enabled for Unity tables on Daft (i.e. supporting external tables now only) . I know this is on the roadmap for Unity to support read/write operations for managed tables , however , today it throws an error as seen below (tested on daft 0.4.2) when we attempt to do this. It only works for external tables as @pmogren pointed out rightly.

Error:

BadRequestError: Error code: 400 - {'error_code': 'INVALID_PARAMETER_VALUE', 'message': "Table with id f6e86dfe-f024-4d1e-8914-b738cf7d2b39 cannot be written from outside of Databricks Compute Environment due to its kind being TABLE_DELTA. Only 'TABLE_EXTERNAL' and 'TABLE_DELTA_EXTERNAL' table kinds can be written externally.", 'details': [{'@type': 'type.googleapis.com/google.rpc.ErrorInfo', 'reason': 'EXTERNAL_WRITE_NOT_ALLOWED_FOR_TABLE', 'domain': 'unity-catalog.databricks.com', 'metadata': {'tableId': 'f6e86dfe-f024-4d1e-8914-b738cf7d2b39', 'securableKind': 'TABLE_DELTA'}}, {'@type': 'type.googleapis.com/google.rpc.RequestInfo', 'request_id': '6ec26f24-5022-4e73-b0c6-e643adcd73b6', 'serving_data': ''}]}

The piece of code responsible for this behavior is : https://github.com/anilmenon14/Daft/blob/main/daft/unity_catalog/unity_catalog.py#L140-L142

IMO we have 2 ways to solve this :

  1. Shall we consider having the load_table() module have a parameter named as intent with default as READ (or READ_WRITE). For situations where managed table has to be read, the user can use load_table(intent="READ")
  2. Alternative is to wait for Unity to support READ_WRITE for managed tables.

If you think 1 is a better approach, happy to help contribute

@datanikkthegreek
Copy link
Author

@jaychia For me it's three issues:

  • httpx 0.28.0 and above support to work on the latest datsbricks runtime
  • managed table support
  • make the it paramtrizable to define read/read-write

Currently for me it's easier to use the Delta lake API from daft instead of unity catalogue and I am calling the unity rest API by myself.

I also realise you don't use the new Unity pypi package.

@pmogren
Copy link

pmogren commented Feb 5, 2025

Confirmed by forking the project and implementing that approach #1, I was able to read a managed table without error.
For some reason the data is coming back as all None values, but I suspect that is a different problem. I opened a discussion about that.

@jaychia
Copy link
Contributor

jaychia commented Feb 7, 2025

I also realise you don't use the new Unity pypi package.

Yeah we actually built both Unity pypi packages -- the databricks folks didn't like our first one because we used a tool called Stainless, so we made a new one but haven't yet moved over 😀

@pmogren any chance you'd like to open a PR for your approach? We'd love to take a contribution!

@pmogren
Copy link

pmogren commented Feb 12, 2025

@jaychia Yes I'll put together a PR, sorry for the delayed response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p1 Important to tackle soon, but preemptable by p0
Projects
None yet
Development

No branches or pull requests

6 participants