Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug in the _upload_file_part_concurrent method #910

Merged
merged 2 commits into from
Nov 6, 2024

Conversation

nils-braun
Copy link
Contributor

The _upload_file_part_concurrent method is used as part of the put_file function to upload the file in multiple parts (when the file is larger than a certain limit).
The function basically reads from the original file in chunks (by default 50MB) and then schedules 10 upload calls in one block. It has two different "branches": if there is more than one chunk left, it schedules them in parallel - if not, it just runs it directly.

This last branch has a bug: it uses a variable chunk which is actually defined in another scope (in the for-loop before it). This leads to wrong data on the remote location: if you upload a file which has e.g. between 20 * 50MB and 21 * 50MB size, it will always be truncated to to exactly 20 * 50MB on s3. This bug is fixed in this PR.

@martindurant
Copy link
Member

Thanks for the fix. It should be easy to test this, right?

@martindurant
Copy link
Member

martindurant commented Nov 5, 2024

Maye a simpler fix would be to run "in parallel" even for just one remaining chunk

@nils-braun
Copy link
Contributor Author

@martindurant - I added a test and simplified the code to only use a single branch ("in parallel" for both cases)

@martindurant
Copy link
Member

Perfect, thank you

@martindurant martindurant merged commit ff8e4fe into fsspec:main Nov 6, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants