[CORE] LocalParitionWriter causes OOM during mergeSpills #7860

ccat3z · 2024-11-08T03:05:39Z

Backend

VL (Velox)

Bug description

LocalParitionWrriter::mergeSpills use arrow::io::MemoryMappedFile to read all spill files. But it is not munmap in time, which results in a significant consumption of RssFile.

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

The text was updated successfully, but these errors were encountered:

zhztheplayer · 2024-11-08T04:49:26Z

Would you like to post the OOM error message you have seen? Or it's only an abnormal consumption of rss files?

ccat3z · 2024-11-08T05:11:15Z

The error message is same as #6947, killed by yarn. But the reasons are different.

zhouyuan · 2024-11-08T07:51:45Z

CC @marin-ma

FelixYBW · 2024-11-08T21:31:16Z

So the root cause is that the memory is unmapped until the file is close. When we merge the spills it eventually mapped all the spill data to memory. "Killed by yarn" error makes sense here.

Let's see if the ReadableFile performance is the same as MemoryMappedFile. It's an easy fix if so. Otherwise we need to manually unmap the file.

FelixYBW · 2024-11-08T21:34:53Z

@jinchengchenghh, what's the way Velox used to merge spill file?

Most efficient way should be mmap + MADV_SEQUENTIAL + manually munmap

https://stackoverflow.com/questions/45972/mmap-vs-reading-blocks

kecookier · 2024-11-09T03:05:34Z

Would you like to post the OOM error message you have seen? Or it's only an abnormal consumption of rss files?

@zhztheplayer we caught this issue by dumping /proc/self/status when the executor is killed. It shows that RssFile is almost 3G when VmRss is 3.5G. This issue affects a number of our large-scale ETL processes.
/proc/self/status

So the root cause is that the memory is unmapped until the file is close. When we merge the spills it eventually mapped all the spill data to memory. "Killed by yarn" error makes sense here.

Let's see if the ReadableFile performance is the same as MemoryMappedFile. It's an easy fix if so. Otherwise we need to manually unmap the file.

@FelixYBW
The Arrow MemoryMappedFile does not support a method to get the underlying mmap address. However, MemoryMappedFile::MemoryMap has a method to get the region data pointer, so we might add a method for MemoryMappedFile to return head() or data().

Another approach is in mergeSpills(), where we open the file for each partition and close it when finished. We do some internal performance test, it shows ReadableFile may have some regression.

@ccat3z Can you commit a PR and trigger the community performance benchmark by adding the comment /Benchmark Velox?

FelixYBW · 2024-11-09T07:57:20Z

benchmark velox doesn't have spill. How much performance regression? Is the regression compared with mmap file or vanilla spark?

FelixYBW · 2024-11-09T08:07:59Z

@marin-ma see if we can have any way to call madvise(MADV_SEQUENTIAL), it should be able to release physical memory quickly.

ccat3z · 2024-11-09T08:56:16Z

@marin-ma see if we can have any way to call madvise(MADV_SEQUENTIAL), it should be able to release physical memory quickly.

I tested the following patch on arrow, it seems that it will not be released immediately. It might still require explicit MADV_DONTNEED in this case.

diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc
--- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc   2024-11-09 11:39:58.497266369 +0800
+++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc        2024-11-09 14:50:37.035869206 +0800
@@ -41,6 +41,7 @@
 #include <sstream>
 #include <string>
 #include <utility>
+#include <iostream>

 // ----------------------------------------------------------------------
 // Other Arrow includes
@@ -575,6 +576,12 @@
       return Status::IOError("Memory mapping file failed: ",
                              ::arrow::internal::ErrnoMessage(errno));
     }
+    int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL);
+    if (madv_res != 0) {
+      return Status::IOError("madvise failed: ",
+                             ::arrow::internal::ErrnoMessage(errno));
+    }
+    std::cerr << "madvise success: " << result << " " << mmap_length << std::endl;
     map_len_ = mmap_length;
     offset_ = offset;
     region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result),
@@ -720,6 +727,27 @@
   return ::arrow::internal::MemoryAdviseWillNeed(regions);
 }

FelixYBW · 2024-11-09T09:56:42Z

return ::arrow::internal::MemoryAdviseWillNeed(regions);
}

In io_util.cc, madvise(willneed) is called again. You may disable it.

FelixYBW · 2024-11-09T10:17:10Z

MemoryMappedFile has Region, when region is destructed, it called munmap. See how we can take use of it.

marin-ma · 2024-11-11T01:49:41Z

Another approach is in mergeSpills(), where we open the file for each partition and close it when finished.

@kecookier Not sure how much performance impact this might introduce. This approach requires invoking mmap and munmap for each partition, and some partitions in a single spill file may be quite small.

We do some internal performance test, it shows ReadableFile may have some regression.

How much performance regression do you see? Could you share some results? Thanks!

jinchengchenghh · 2024-11-11T02:17:31Z

Velox use SpillReadFile to read the file, it uses FileInputStream to read the file and simd::memcpy to copy the bytes, It will output batch RowVector one by one. FileInputStream uses velox::LocalReadFile pread or preadv to read the file.

As I see, it reads bufferSize_ which is controlled by QueryConfig kSpillReadBufferSize (default 1MB) one time. Note: if file system supports async read, read double bufferSize_ one time.
@FelixYBW

readBytes = readSize();
      VELOX_CHECK_LT(
          0, readBytes, "Read past end of FileInputStream {}", fileSize_);
      NanosecondTimer timer_2{&readTimeNs};
      file_->pread(fileOffset_, readBytes, buffer()->asMutable<char>());

uint64_t FileInputStream::readSize() const {
  return std::min(fileSize_ - fileOffset_, bufferSize_);
}

/* Read data from file descriptor FD at the given position OFFSET
   without change the file pointer, and put the result in the buffers
   described by IOVEC, which is a vector of COUNT 'struct iovec's.
   The buffers are filled in the order specified.  Operates just like
   'pread' (see <unistd.h>) except that data are put in IOVEC instead
   of a contiguous buffer.

   This function is a cancellation point and therefore not marked with
   __THROW.  */
extern ssize_t preadv (int __fd, const struct iovec *__iovec, int __count,
		       __off_t __offset) __wur;

ccat3z · 2024-11-11T03:44:57Z

How much performance regression do you see? Could you share some results? Thanks!

@marin-ma After internal re-testing last weekend, no noticeable performance regression was found.

ccat3z · 2024-11-11T03:51:39Z

In io_util.cc, madvise(willneed) is called again. You may disable it.

The new patch is here, but it's not working either. There might still be some codes that haven't been found, I'll recheck it if required.

diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc
--- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc   2024-11-09 11:39:58.497266369 +0800
+++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc        2024-11-09 18:09:06.540567675 +0800
@@ -41,6 +41,7 @@
 #include <sstream>
 #include <string>
 #include <utility>
+#include <iostream>

 // ----------------------------------------------------------------------
 // Other Arrow includes
@@ -575,6 +576,12 @@
       return Status::IOError("Memory mapping file failed: ",
                              ::arrow::internal::ErrnoMessage(errno));
     }
+    int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL);
+    if (madv_res != 0) {
+      return Status::IOError("madvise failed: ",
+                             ::arrow::internal::ErrnoMessage(errno));
+    }
+    std::cerr << "madvise success: " << result << " " << mmap_length << std::endl;
     map_len_ = mmap_length;
     offset_ = offset;
     region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result),
@@ -660,8 +667,8 @@
   ARROW_ASSIGN_OR_RAISE(
       nbytes, internal::ValidateReadRange(position, nbytes, memory_map_->size()));
   // Arrange to page data in
-  RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
-      {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
+  // RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
+  //     {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
   return memory_map_->Slice(position, nbytes);
 }

FelixYBW · 2024-11-12T21:06:56Z

If ReadableFile doesn't have significant perf loss, let's use the quick fix and optimize it to mmap in future. @marin-ma can you test #7861 on our jenkins with spill?

FelixYBW · 2024-11-12T21:09:30Z

Velox use SpillReadFile to read the file, it uses FileInputStream to read the file and simd::memcpy to copy the bytes, It will output batch RowVector one by one. FileInputStream uses velox::LocalReadFile pread or preadv to read the file.

The optimal way should be we map like 1M each time, unmap it once accessed. Looks MemoryMappedFile::Region can implement it.

FelixYBW · 2024-11-12T21:13:36Z

In io_util.cc, madvise(willneed) is called again. You may disable it.

The new patch is here, but it's not working either. There might still be some codes that haven't been found, I'll recheck it if required.

diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc
--- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc   2024-11-09 11:39:58.497266369 +0800
+++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc        2024-11-09 18:09:06.540567675 +0800
@@ -41,6 +41,7 @@
 #include <sstream>
 #include <string>
 #include <utility>
+#include <iostream>

 // ----------------------------------------------------------------------
 // Other Arrow includes
@@ -575,6 +576,12 @@
       return Status::IOError("Memory mapping file failed: ",
                              ::arrow::internal::ErrnoMessage(errno));
     }
+    int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL);
+    if (madv_res != 0) {
+      return Status::IOError("madvise failed: ",
+                             ::arrow::internal::ErrnoMessage(errno));
+    }
+    std::cerr << "madvise success: " << result << " " << mmap_length << std::endl;
     map_len_ = mmap_length;
     offset_ = offset;
     region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result),
@@ -660,8 +667,8 @@
   ARROW_ASSIGN_OR_RAISE(
       nbytes, internal::ValidateReadRange(position, nbytes, memory_map_->size()));
   // Arrange to page data in
-  RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
-      {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
+  // RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
+  //     {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
   return memory_map_->Slice(position, nbytes);
 }

Thank you for your test. Can you try posix_fadvise? We have a case to show madvise doesn't work on file mapping.

FelixYBW · 2024-11-12T21:16:35Z

Just found some test:

ccat3z · 2024-11-13T02:48:46Z

Just found some test:

Could you introduce what each columns in stdout represents?

FelixYBW · 2024-11-13T03:06:08Z

Could you introduce what each columns in stdout represents?

Not sure :( It's a test two years ago.

FelixYBW · 2024-11-13T21:23:42Z

some link about mmap we referred before.

http://xoyo.space/2017/11/mmap-performance-analyzing-and-tuning/
https://github.com/xoyowade/mmap_benchmark
https://stackoverflow.com/questions/30470972/using-mmap-and-madvise-for-huge-pages
https://lwn.net/Articles/590693/

https://www.usenix.org/system/files/conference/hotstorage17/hotstorage17-paper-choi.pdf

FelixYBW · 2024-11-13T21:25:03Z

summary:

ccat3z · 2024-11-14T10:33:47Z

I wrote a simple gist to dump memory usage during reading file. Only MADV_DONTNEED and munmap can release RssFile immediately. MAP_POPULATE significantly improve performance, but MAP_POPULATE will use hundreds MB of uncontrolable RssFile, which is not acceptable in this case.

So I believe that the combination of manual MDAV_WILLNEED and MADV_DONTNEED is the best solution. I will test it in gluten later.

ccat3z added bug Something isn't working triage labels Nov 8, 2024

ccat3z linked a pull request Nov 8, 2024 that will close this issue

[GLUTEN-7860][CORE] In shuffle writer, replace MemoryMappedFile to avoid OOM #7861

Open

ccat3z changed the title ~~[CORE] LocalParitionWrriter causes OOM during mergeSpills~~ [CORE] LocalParitionWriter causes OOM during mergeSpills Nov 8, 2024

FelixYBW closed this as completed Nov 14, 2024

FelixYBW reopened this Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE] LocalParitionWriter causes OOM during mergeSpills #7860

[CORE] LocalParitionWriter causes OOM during mergeSpills #7860

ccat3z commented Nov 8, 2024 •

edited

Loading

zhztheplayer commented Nov 8, 2024

ccat3z commented Nov 8, 2024

zhouyuan commented Nov 8, 2024

FelixYBW commented Nov 8, 2024

FelixYBW commented Nov 8, 2024 •

edited

Loading

kecookier commented Nov 9, 2024

FelixYBW commented Nov 9, 2024

FelixYBW commented Nov 9, 2024

ccat3z commented Nov 9, 2024 •

edited

Loading

FelixYBW commented Nov 9, 2024 •

edited

Loading

FelixYBW commented Nov 9, 2024

marin-ma commented Nov 11, 2024 •

edited

Loading

jinchengchenghh commented Nov 11, 2024

ccat3z commented Nov 11, 2024 •

edited

Loading

ccat3z commented Nov 11, 2024

FelixYBW commented Nov 12, 2024

FelixYBW commented Nov 12, 2024

FelixYBW commented Nov 12, 2024

FelixYBW commented Nov 12, 2024

ccat3z commented Nov 13, 2024

FelixYBW commented Nov 13, 2024

FelixYBW commented Nov 13, 2024 •

edited

Loading

FelixYBW commented Nov 13, 2024

ccat3z commented Nov 14, 2024 •

edited

Loading

[CORE] LocalParitionWriter causes OOM during mergeSpills #7860

[CORE] LocalParitionWriter causes OOM during mergeSpills #7860

Comments

ccat3z commented Nov 8, 2024 • edited Loading

Backend

Bug description

Spark version

Spark configurations

System information

Relevant logs

zhztheplayer commented Nov 8, 2024

ccat3z commented Nov 8, 2024

zhouyuan commented Nov 8, 2024

FelixYBW commented Nov 8, 2024

FelixYBW commented Nov 8, 2024 • edited Loading

kecookier commented Nov 9, 2024

FelixYBW commented Nov 9, 2024

FelixYBW commented Nov 9, 2024

ccat3z commented Nov 9, 2024 • edited Loading

FelixYBW commented Nov 9, 2024 • edited Loading

FelixYBW commented Nov 9, 2024

marin-ma commented Nov 11, 2024 • edited Loading

jinchengchenghh commented Nov 11, 2024

ccat3z commented Nov 11, 2024 • edited Loading

ccat3z commented Nov 11, 2024

FelixYBW commented Nov 12, 2024

FelixYBW commented Nov 12, 2024

FelixYBW commented Nov 12, 2024

FelixYBW commented Nov 12, 2024

ccat3z commented Nov 13, 2024

FelixYBW commented Nov 13, 2024

FelixYBW commented Nov 13, 2024 • edited Loading

FelixYBW commented Nov 13, 2024

ccat3z commented Nov 14, 2024 • edited Loading

ccat3z commented Nov 8, 2024 •

edited

Loading

FelixYBW commented Nov 8, 2024 •

edited

Loading

ccat3z commented Nov 9, 2024 •

edited

Loading

FelixYBW commented Nov 9, 2024 •

edited

Loading

marin-ma commented Nov 11, 2024 •

edited

Loading

ccat3z commented Nov 11, 2024 •

edited

Loading

FelixYBW commented Nov 13, 2024 •

edited

Loading

ccat3z commented Nov 14, 2024 •

edited

Loading