-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Java SDK "append" API uses enormous and uncontrollable amounts of memory #43640
Comments
@wheezil Thank you for reaching out. For security reasons, we cannot accept .zip files of your code so I have removed the download link. Please instead upload a code snippet and reproduction steps so we can look into the issue. @ibrahimrabab when additional context is provided, can you look into this issue? |
Main test class
MemoryLogger util class
MarkableFileINputStream class, used to read from file w/o buffering entire file
|
pom.xml file to build using maven
|
log4j2.properties file, put in resources to enable logger output
|
FYI I've been advised on stackoverflow to simply use the raw REST API, but this is a much less desirable solution, as we'll always be chasing security updates and other changes, which we'd really prefer the SDK do for us. |
@jairmyree I believe you can remove the "needs more info" tag now that I've attached the code. |
Describe the bug
We are using the
DataLakeFileSystemClient.appendWithResponse()
to upload multiple parts in parallel. Despite making our own input stream which does not buffer yet still satisfies the "markable" property, the SDK wants to buffer an arbitrary amount of data in memory. We see no way to control this. Furthermore, uploading multiples of such files concurrently expands the amount of memory being used, leading to OOM at some point.Exception or Stack Trace
To Reproduce
Make a maven project out of the attached code snippets. This should be simple. Sorry, I tried to just attach a ZIP archive with the entire project, but it was rejected.
Build and run project with arguments
8 200
, which uploads 8 parts in parallel of 200MB each.You can see the heap logged:
Since we are not buffering data in memory, why is the SDK doing it? Our stream is markable and the SDK should just read from the stream and rewind it if needed for a retry.
Code Snippet
See attached
Expected behavior
Use no more than a reasonable amount of memory for in-flight data transfer, just enough to get good buffering performance from the local disk file-read, such as 128K per upload thread.
OR, have an alterative API which uses less memory.
Screenshots
Setup (please complete the following information):
The text was updated successfully, but these errors were encountered: