-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE] LocalParitionWriter causes OOM during mergeSpills #7860
Comments
Would you like to post the OOM error message you have seen? Or it's only an abnormal consumption of rss files? |
|
CC @marin-ma |
So the root cause is that the memory is unmapped until the file is close. When we merge the spills it eventually mapped all the spill data to memory. "Killed by yarn" error makes sense here. Let's see if the ReadableFile performance is the same as MemoryMappedFile. It's an easy fix if so. Otherwise we need to manually unmap the file. |
@jinchengchenghh, what's the way Velox used to merge spill file? Most efficient way should be mmap + MADV_SEQUENTIAL + manually munmap https://stackoverflow.com/questions/45972/mmap-vs-reading-blocks |
@zhztheplayer we caught this issue by dumping
@FelixYBW Another approach is in @ccat3z Can you commit a PR and trigger the community performance benchmark by adding the comment /Benchmark Velox? |
benchmark velox doesn't have spill. How much performance regression? Is the regression compared with mmap file or vanilla spark? |
@marin-ma see if we can have any way to call madvise(MADV_SEQUENTIAL), it should be able to release physical memory quickly. |
I tested the following patch on arrow, it seems that it will not be released immediately. It might still require explicit diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc
--- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc 2024-11-09 11:39:58.497266369 +0800
+++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc 2024-11-09 14:50:37.035869206 +0800
@@ -41,6 +41,7 @@
#include <sstream>
#include <string>
#include <utility>
+#include <iostream>
// ----------------------------------------------------------------------
// Other Arrow includes
@@ -575,6 +576,12 @@
return Status::IOError("Memory mapping file failed: ",
::arrow::internal::ErrnoMessage(errno));
}
+ int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL);
+ if (madv_res != 0) {
+ return Status::IOError("madvise failed: ",
+ ::arrow::internal::ErrnoMessage(errno));
+ }
+ std::cerr << "madvise success: " << result << " " << mmap_length << std::endl;
map_len_ = mmap_length;
offset_ = offset;
region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result),
@@ -720,6 +727,27 @@
return ::arrow::internal::MemoryAdviseWillNeed(regions);
} |
In io_util.cc, madvise(willneed) is called again. You may disable it. |
MemoryMappedFile has Region, when region is destructed, it called munmap. See how we can take use of it. |
@kecookier Not sure how much performance impact this might introduce. This approach requires invoking mmap and munmap for each partition, and some partitions in a single spill file may be quite small.
How much performance regression do you see? Could you share some results? Thanks! |
Velox use As I see, it reads bufferSize_ which is controlled by QueryConfig
|
@marin-ma After internal re-testing last weekend, no noticeable performance regression was found. |
The new patch is here, but it's not working either. There might still be some codes that haven't been found, I'll recheck it if required. diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc
--- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc 2024-11-09 11:39:58.497266369 +0800
+++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc 2024-11-09 18:09:06.540567675 +0800
@@ -41,6 +41,7 @@
#include <sstream>
#include <string>
#include <utility>
+#include <iostream>
// ----------------------------------------------------------------------
// Other Arrow includes
@@ -575,6 +576,12 @@
return Status::IOError("Memory mapping file failed: ",
::arrow::internal::ErrnoMessage(errno));
}
+ int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL);
+ if (madv_res != 0) {
+ return Status::IOError("madvise failed: ",
+ ::arrow::internal::ErrnoMessage(errno));
+ }
+ std::cerr << "madvise success: " << result << " " << mmap_length << std::endl;
map_len_ = mmap_length;
offset_ = offset;
region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result),
@@ -660,8 +667,8 @@
ARROW_ASSIGN_OR_RAISE(
nbytes, internal::ValidateReadRange(position, nbytes, memory_map_->size()));
// Arrange to page data in
- RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
- {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
+ // RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
+ // {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
return memory_map_->Slice(position, nbytes);
} |
The optimal way should be we map like 1M each time, unmap it once accessed. Looks MemoryMappedFile::Region can implement it. |
Thank you for your test. Can you try posix_fadvise? We have a case to show madvise doesn't work on file mapping. |
Not sure :( It's a test two years ago. |
I wrote a simple gist to dump memory usage during reading file. Only So I believe that the combination of manual |
Backend
VL (Velox)
Bug description
LocalParitionWrriter::mergeSpills
usearrow::io::MemoryMappedFile
to read all spill files. But it is notmunmap
in time, which results in a significant consumption ofRssFile
.Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response
The text was updated successfully, but these errors were encountered: