You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been exploring different ways of getting large amounts of data (100GB+) out of Unity Catalog and into external ray clusters for distributed ml model training and while assessing databricks-sql-python, noticed the download speeds are significantly slower than using the statement execution API. In the actual external ray cluster, the difference is 10x, however I was able to also replicate this to a lesser extent in a databricks notebook.
Replication
The first two approaches both lead to a ~45MB/s download speed on an i3.4xlarge
I suspect this is because the statement execution api allows you to make separate parallel requests to retrieve different "chunks" of data vs how the sql connector adopts a cursor based approach where you can only retrieve data sequentially.
Ask
Are there any plans on supporting a similar chunking pattern for databricks-sql-python and in lieu of that, is there currently any way to reach download speed parity with the statement execution api? databricks-sql-python is great because it does not have the 100GB limit of the statement execution api but the slow download speed is a major blocker for use in ml applications requiring the transfer of large data, which to be fair may not the use case that databricks-sql-python has been designed for.
The text was updated successfully, but these errors were encountered:
Context
I've been exploring different ways of getting large amounts of data (100GB+) out of Unity Catalog and into external ray clusters for distributed ml model training and while assessing
databricks-sql-python
, noticed the download speeds are significantly slower than using the statement execution API. In the actual external ray cluster, the difference is 10x, however I was able to also replicate this to a lesser extent in a databricks notebook.Replication
The first two approaches both lead to a ~45MB/s download speed on an i3.4xlarge
Using databricks-sql-python directly
Using databricks-sql-python + ray.data.read_sql
reference: https://docs.ray.io/en/latest/data/api/doc/ray.data.read_sql.html#ray.data.read_sql
However when I use ray.data.read_databricks_tables, I can reach download speeds of ~150MB/s on the same machine.
Using ray.data.read_databricks_tables
Potential Cause
I suspect this is because the statement execution api allows you to make separate parallel requests to retrieve different "chunks" of data vs how the sql connector adopts a cursor based approach where you can only retrieve data sequentially.
Ask
Are there any plans on supporting a similar chunking pattern for
databricks-sql-python
and in lieu of that, is there currently any way to reach download speed parity with the statement execution api?databricks-sql-python
is great because it does not have the 100GB limit of the statement execution api but the slow download speed is a major blocker for use in ml applications requiring the transfer of large data, which to be fair may not the use case thatdatabricks-sql-python
has been designed for.The text was updated successfully, but these errors were encountered: