-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error sending Delta files to Azure Gen2 Storage if over a certain size. #2086
Comments
ok. not sure of the details of that issue. Its worth pointing out that the file is created correctly in the normalized completed_jobs folder, but when it sends to azure it tries do it in a single block. If I use parquet and not delta the file is sent using small blocks. Also Delta files are created fine to a local filesystem its just when I want to use azure that if fails. Not sure if other cloud services have a similar issue. |
@pwr-philarmstrong yes, if you set it to parquet, we use our own upload code and send it file by file (parallelized but limited to the amount of load step workers). For delta tables we use python deltalake which loads all files into memory before sending them. Like @rudolfix pointed out, we are waiting for a bug to be fixed in the deltalake rust backend. |
Thanks for the clarification. I'll watch out for that bug being fixed. |
Update: The delta bug delta-io/delta-rs#2968 (comment) is still open.. |
dlt version
1.3.0
Describe the problem
I have a pipeline that copies a table from sql server to azure gen2 storage. It creates delta files and works fine if the parquet files are small however when they get larger I get a failure sending and it goes into a retry loop.
Logging the azure storage I can see this sort of error details
the pipeline part looks like this
and this is a chunk of the log file around where it fails though it only shows that its waiting and that if fails after 5 retries.
Expected behavior
I would expect the files to either be successfully sent to azure storage either as a single file as with the smaller files or broken into blocks.
Steps to reproduce
I've managed to create some code to generate the issue
It works if the file is sent as parquet but if the format is delta it fails.
Operating system
Windows
Runtime environment
Local
Python version
3.11
dlt data source
microsoft sql server but the problem also happens with a df datasource
dlt destination
Filesystem & buckets
Other deployment details
No response
Additional information
I have a pipeline that copies a table from sql server to azure gen2 storage. It creates delta files and works fine if the parquet files are small however when they get larger I get a failure sending and it goes into a retry loop.
Logging the azure storage I can see this sort of error details
the pipeline part looks like this
Looking at the azure logs for smaller files they look like this
I also tried sending the larger parquet file using a standalone python script and the azure.storage.blob package
This worked fine and seemed to send the file in blocks
the logs for one block look like this
I was also able to send pure parquet files to azure without an issue however delta seems to create larger parquet files.
I also tried adjusting a number of the dlt config items e.g.
The text was updated successfully, but these errors were encountered: