Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE] LocalParitionWriter causes OOM during mergeSpills #7860

Open
ccat3z opened this issue Nov 8, 2024 · 24 comments · May be fixed by #7861
Open

[CORE] LocalParitionWriter causes OOM during mergeSpills #7860

ccat3z opened this issue Nov 8, 2024 · 24 comments · May be fixed by #7861
Labels
bug Something isn't working triage

Comments

@ccat3z
Copy link
Contributor

ccat3z commented Nov 8, 2024

Backend

VL (Velox)

Bug description

LocalParitionWrriter::mergeSpills use arrow::io::MemoryMappedFile to read all spill files. But it is not munmap in time, which results in a significant consumption of RssFile.

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

@ccat3z ccat3z added bug Something isn't working triage labels Nov 8, 2024
@zhztheplayer
Copy link
Member

Would you like to post the OOM error message you have seen? Or it's only an abnormal consumption of rss files?

@ccat3z
Copy link
Contributor Author

ccat3z commented Nov 8, 2024

image
The error message is same as #6947, killed by yarn. But the reasons are different.

@zhouyuan
Copy link
Contributor

zhouyuan commented Nov 8, 2024

CC @marin-ma

@ccat3z ccat3z changed the title [CORE] LocalParitionWrriter causes OOM during mergeSpills [CORE] LocalParitionWriter causes OOM during mergeSpills Nov 8, 2024
@FelixYBW
Copy link
Contributor

FelixYBW commented Nov 8, 2024

So the root cause is that the memory is unmapped until the file is close. When we merge the spills it eventually mapped all the spill data to memory. "Killed by yarn" error makes sense here.

Let's see if the ReadableFile performance is the same as MemoryMappedFile. It's an easy fix if so. Otherwise we need to manually unmap the file.

@FelixYBW
Copy link
Contributor

FelixYBW commented Nov 8, 2024

@jinchengchenghh, what's the way Velox used to merge spill file?

Most efficient way should be mmap + MADV_SEQUENTIAL + manually munmap

https://stackoverflow.com/questions/45972/mmap-vs-reading-blocks

@kecookier
Copy link
Contributor

Would you like to post the OOM error message you have seen? Or it's only an abnormal consumption of rss files?

@zhztheplayer we caught this issue by dumping /proc/self/status when the executor is killed. It shows that RssFile is almost 3G when VmRss is 3.5G. This issue affects a number of our large-scale ETL processes.
/proc/self/status image

So the root cause is that the memory is unmapped until the file is close. When we merge the spills it eventually mapped all the spill data to memory. "Killed by yarn" error makes sense here.

Let's see if the ReadableFile performance is the same as MemoryMappedFile. It's an easy fix if so. Otherwise we need to manually unmap the file.

@FelixYBW
The Arrow MemoryMappedFile does not support a method to get the underlying mmap address. However, MemoryMappedFile::MemoryMap has a method to get the region data pointer, so we might add a method for MemoryMappedFile to return head() or data().

Another approach is in mergeSpills(), where we open the file for each partition and close it when finished. We do some internal performance test, it shows ReadableFile may have some regression.

@ccat3z Can you commit a PR and trigger the community performance benchmark by adding the comment /Benchmark Velox?

@FelixYBW
Copy link
Contributor

FelixYBW commented Nov 9, 2024

benchmark velox doesn't have spill. How much performance regression? Is the regression compared with mmap file or vanilla spark?

@FelixYBW
Copy link
Contributor

FelixYBW commented Nov 9, 2024

@marin-ma see if we can have any way to call madvise(MADV_SEQUENTIAL), it should be able to release physical memory quickly.

@ccat3z
Copy link
Contributor Author

ccat3z commented Nov 9, 2024

@marin-ma see if we can have any way to call madvise(MADV_SEQUENTIAL), it should be able to release physical memory quickly.

I tested the following patch on arrow, it seems that it will not be released immediately. It might still require explicit MADV_DONTNEED in this case.

diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc
--- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc   2024-11-09 11:39:58.497266369 +0800
+++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc        2024-11-09 14:50:37.035869206 +0800
@@ -41,6 +41,7 @@
 #include <sstream>
 #include <string>
 #include <utility>
+#include <iostream>

 // ----------------------------------------------------------------------
 // Other Arrow includes
@@ -575,6 +576,12 @@
       return Status::IOError("Memory mapping file failed: ",
                              ::arrow::internal::ErrnoMessage(errno));
     }
+    int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL);
+    if (madv_res != 0) {
+      return Status::IOError("madvise failed: ",
+                             ::arrow::internal::ErrnoMessage(errno));
+    }
+    std::cerr << "madvise success: " << result << " " << mmap_length << std::endl;
     map_len_ = mmap_length;
     offset_ = offset;
     region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result),
@@ -720,6 +727,27 @@
   return ::arrow::internal::MemoryAdviseWillNeed(regions);
 }

@FelixYBW
Copy link
Contributor

FelixYBW commented Nov 9, 2024

return ::arrow::internal::MemoryAdviseWillNeed(regions);
}

In io_util.cc, madvise(willneed) is called again. You may disable it.

@FelixYBW
Copy link
Contributor

FelixYBW commented Nov 9, 2024

MemoryMappedFile has Region, when region is destructed, it called munmap. See how we can take use of it.

@marin-ma
Copy link
Contributor

marin-ma commented Nov 11, 2024

Another approach is in mergeSpills(), where we open the file for each partition and close it when finished.

@kecookier Not sure how much performance impact this might introduce. This approach requires invoking mmap and munmap for each partition, and some partitions in a single spill file may be quite small.

We do some internal performance test, it shows ReadableFile may have some regression.

How much performance regression do you see? Could you share some results? Thanks!

@jinchengchenghh
Copy link
Contributor

Velox use SpillReadFile to read the file, it uses FileInputStream to read the file and simd::memcpy to copy the bytes, It will output batch RowVector one by one. FileInputStream uses velox::LocalReadFile pread or preadv to read the file.

As I see, it reads bufferSize_ which is controlled by QueryConfig kSpillReadBufferSize (default 1MB) one time. Note: if file system supports async read, read double bufferSize_ one time.
@FelixYBW

readBytes = readSize();
      VELOX_CHECK_LT(
          0, readBytes, "Read past end of FileInputStream {}", fileSize_);
      NanosecondTimer timer_2{&readTimeNs};
      file_->pread(fileOffset_, readBytes, buffer()->asMutable<char>());

uint64_t FileInputStream::readSize() const {
  return std::min(fileSize_ - fileOffset_, bufferSize_);
}
/* Read data from file descriptor FD at the given position OFFSET
   without change the file pointer, and put the result in the buffers
   described by IOVEC, which is a vector of COUNT 'struct iovec's.
   The buffers are filled in the order specified.  Operates just like
   'pread' (see <unistd.h>) except that data are put in IOVEC instead
   of a contiguous buffer.

   This function is a cancellation point and therefore not marked with
   __THROW.  */
extern ssize_t preadv (int __fd, const struct iovec *__iovec, int __count,
		       __off_t __offset) __wur;

@ccat3z
Copy link
Contributor Author

ccat3z commented Nov 11, 2024

How much performance regression do you see? Could you share some results? Thanks!

@marin-ma After internal re-testing last weekend, no noticeable performance regression was found.

@ccat3z
Copy link
Contributor Author

ccat3z commented Nov 11, 2024

In io_util.cc, madvise(willneed) is called again. You may disable it.

The new patch is here, but it's not working either. There might still be some codes that haven't been found, I'll recheck it if required.

diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc
--- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc   2024-11-09 11:39:58.497266369 +0800
+++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc        2024-11-09 18:09:06.540567675 +0800
@@ -41,6 +41,7 @@
 #include <sstream>
 #include <string>
 #include <utility>
+#include <iostream>

 // ----------------------------------------------------------------------
 // Other Arrow includes
@@ -575,6 +576,12 @@
       return Status::IOError("Memory mapping file failed: ",
                              ::arrow::internal::ErrnoMessage(errno));
     }
+    int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL);
+    if (madv_res != 0) {
+      return Status::IOError("madvise failed: ",
+                             ::arrow::internal::ErrnoMessage(errno));
+    }
+    std::cerr << "madvise success: " << result << " " << mmap_length << std::endl;
     map_len_ = mmap_length;
     offset_ = offset;
     region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result),
@@ -660,8 +667,8 @@
   ARROW_ASSIGN_OR_RAISE(
       nbytes, internal::ValidateReadRange(position, nbytes, memory_map_->size()));
   // Arrange to page data in
-  RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
-      {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
+  // RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
+  //     {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
   return memory_map_->Slice(position, nbytes);
 }

@FelixYBW
Copy link
Contributor

If ReadableFile doesn't have significant perf loss, let's use the quick fix and optimize it to mmap in future. @marin-ma can you test #7861 on our jenkins with spill?

@FelixYBW
Copy link
Contributor

Velox use SpillReadFile to read the file, it uses FileInputStream to read the file and simd::memcpy to copy the bytes, It will output batch RowVector one by one. FileInputStream uses velox::LocalReadFile pread or preadv to read the file.

The optimal way should be we map like 1M each time, unmap it once accessed. Looks MemoryMappedFile::Region can implement it.

@FelixYBW
Copy link
Contributor

In io_util.cc, madvise(willneed) is called again. You may disable it.

The new patch is here, but it's not working either. There might still be some codes that haven't been found, I'll recheck it if required.

diff -ru apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc apache-arrow-15.0.0/cpp/src/arrow/io/file.cc
--- apache-arrow-15.0.0.orig/cpp/src/arrow/io/file.cc   2024-11-09 11:39:58.497266369 +0800
+++ apache-arrow-15.0.0/cpp/src/arrow/io/file.cc        2024-11-09 18:09:06.540567675 +0800
@@ -41,6 +41,7 @@
 #include <sstream>
 #include <string>
 #include <utility>
+#include <iostream>

 // ----------------------------------------------------------------------
 // Other Arrow includes
@@ -575,6 +576,12 @@
       return Status::IOError("Memory mapping file failed: ",
                              ::arrow::internal::ErrnoMessage(errno));
     }
+    int madv_res = madvise(result, mmap_length, MADV_SEQUENTIAL);
+    if (madv_res != 0) {
+      return Status::IOError("madvise failed: ",
+                             ::arrow::internal::ErrnoMessage(errno));
+    }
+    std::cerr << "madvise success: " << result << " " << mmap_length << std::endl;
     map_len_ = mmap_length;
     offset_ = offset;
     region_ = std::make_shared<Region>(shared_from_this(), static_cast<uint8_t*>(result),
@@ -660,8 +667,8 @@
   ARROW_ASSIGN_OR_RAISE(
       nbytes, internal::ValidateReadRange(position, nbytes, memory_map_->size()));
   // Arrange to page data in
-  RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
-      {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
+  // RETURN_NOT_OK(::arrow::internal::MemoryAdviseWillNeed(
+  //     {{memory_map_->data() + position, static_cast<size_t>(nbytes)}}));
   return memory_map_->Slice(position, nbytes);
 }

Thank you for your test. Can you try posix_fadvise? We have a case to show madvise doesn't work on file mapping.

@FelixYBW
Copy link
Contributor

Just found some test:

image

@ccat3z
Copy link
Contributor Author

ccat3z commented Nov 13, 2024

Just found some test:

image

Could you introduce what each columns in stdout represents?

@FelixYBW
Copy link
Contributor

Could you introduce what each columns in stdout represents?

Not sure :( It's a test two years ago.

@FelixYBW
Copy link
Contributor

summary:
image

@ccat3z
Copy link
Contributor Author

ccat3z commented Nov 14, 2024

I wrote a simple gist to dump memory usage during reading file. Only MADV_DONTNEED and munmap can release RssFile immediately. MAP_POPULATE significantly improve performance, but MAP_POPULATE will use hundreds MB of uncontrolable RssFile, which is not acceptable in this case.

So I believe that the combination of manual MDAV_WILLNEED and MADV_DONTNEED is the best solution. I will test it in gluten later.

@FelixYBW FelixYBW reopened this Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants