[GLUTEN-7860][CORE] In shuffle writer, replace MemoryMappedFile to avoid OOM #7861

ccat3z · 2024-11-08T03:10:42Z

What changes were proposed in this pull request?

This pr fixed #7860 by MmapFileStream extended arrow:io::InputStream. MmapFileStream will invoke MADV_DONTNEED to release previous memory when read next range of data.

How was this patch tested?

// Generate 10 partitions, each partition has about 10GB random data.
def gen(scale: Int, parts: Int) = {
  sc.parallelize(1 to (1024*1024), numSlices = 1000)
    .map(x => (x % 1000, randStr(scale * parts)))
    .repartition(parts)
    .toDF("a", "b")
    .save./* ... */
}

// Trigger shuffle spill by `repartition(50)`.
def test(parts: Int = 50) = {
  spark.read./* ... */.repartition(parts)
    .filter(expr("a < 0*rand()")) // avoid pushdown repartition
}

# Executor Memory Config
spark.executor.memory=512M
spark.yarn.executor.memoryOverhead=512M
spark.gluten.memory.offHeap.size.in.bytes=1610612736

Test Result:

impl	avg time to merge spills (s)	avg total spilled size of each task (MB)
read (arrow ReadableFile)	10.58706836156	9935.920098495480
mmap (open required range by MemoryMappedFile)	6.602059312420000	9935.920098495480
madv (this pr)	6.73993204562	9935.920098495480
mmap (repace madv by munmap in this pr)	6.55791399852	9935.920098495480

munmap patch in above test:

diff --git a/cpp/core/shuffle/Utils.cc b/cpp/core/shuffle/Utils.cc
index 1ceb777f1..742c53c90 100644
--- a/cpp/core/shuffle/Utils.cc
+++ b/cpp/core/shuffle/Utils.cc
@@ -243,9 +243,9 @@ void MmapFileStream::advance(int64_t length) {
 
   auto purgeLength = (pos_ - posRetain_) & pageMask;
   if (purgeLength > 0) {
-    int ret = madvise(data_ + posRetain_, purgeLength, MADV_DONTNEED);
+    int ret = munmap(data_ + posRetain_, purgeLength);
     if (ret != 0) {
-      LOG(WARNING) << "fadvise failed " << ::arrow::internal::ErrnoMessage(errno);
+      LOG(WARNING) << "munmap failed " << ::arrow::internal::ErrnoMessage(errno);
     }
     posRetain_ += purgeLength;
   }
@@ -269,7 +269,7 @@ void MmapFileStream::willNeed(int64_t length) {
 
 arrow::Status MmapFileStream::Close() {
   if (data_ != nullptr) {
-    int result = munmap(data_, size_);
+    int result = munmap(data_ + posRetain_, size_ - posRetain_);
     if (result != 0) {
       LOG(WARNING) << "munmap failed";
     }

github-actions · 2024-11-08T03:10:59Z

#7860

ccat3z · 2024-11-08T04:03:41Z

cc @kecookier

kecookier · 2024-11-09T02:30:23Z

/Benchmark Velox

ccat3z · 2024-11-09T03:07:39Z

/Benchmark Velox

ccat3z · 2024-11-09T03:15:47Z

/Benchmark Velox

zhztheplayer

@ccat3z Do you see #7860 fixed with this approach?

I am triggering a benchmark manually.

cc @marin-ma @FelixYBW

zhztheplayer · 2024-11-11T07:36:09Z

cpp/core/shuffle/Spill.cc

@@ -73,7 +73,7 @@ void Spill::insertPayload(

 void Spill::openSpillFile() {
  if (!is_) {
-    GLUTEN_ASSIGN_OR_THROW(is_, arrow::io::MemoryMappedFile::Open(spillFile_, arrow::io::FileMode::READ));
+    GLUTEN_ASSIGN_OR_THROW(is_, arrow::io::ReadableFile::Open(spillFile_));


Is the API implemented with buffered read?

Not sure whether https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/buffered.h may help here.

Spill merge needn't buffer

marin-ma · 2024-11-11T07:40:27Z

I am triggering a benchmark manually.

@zhztheplayer There's no shuffle spill on jenkins. The change won't be tested.

zhztheplayer · 2024-11-11T07:49:20Z

I am triggering a benchmark manually.

@zhztheplayer There's no shuffle spill on jenkins. The change won't be tested.

Thought we always rely on Spark-controlled spill in shuffle. Does Jenkins CI always have enough memory for all shuffle data?

FelixYBW · 2024-11-12T21:34:13Z

@zhztheplayer There's no shuffle spill on jenkins. The change won't be tested.

Is it because the spill will be triggered on other operators in the pipeline? Like a sort + shuffle. Will the sort be triggered or shuffle?

FelixYBW · 2024-11-13T23:13:19Z

@zhztheplayer @marin-ma can we create a query and config to test it?

ccat3z · 2024-11-18T03:38:39Z

@FelixYBW @zhztheplayer I added MmapFileStream in this pr. MmapFileStream will invoke MADV_DONTNEED to release previous memory when reading next range of data. Test approach and result has updated in PR description.

marin-ma · 2024-11-18T04:00:10Z

cpp/core/shuffle/Utils.cc

+  auto fstream = std::shared_ptr<MmapFileStream>(new MmapFileStream());
+  fstream->fd_ = std::move(fd);
+  fstream->data_ = static_cast<uint8_t*>(result);
+  fstream->size_ = size;


Can we use std::make_shared and set the argument through ctor?

marin-ma · 2024-11-18T04:01:26Z

cpp/core/shuffle/Utils.h

@@ -72,4 +72,34 @@ arrow::Result<std::shared_ptr<arrow::RecordBatch>> makeUncompressedRecordBatch(

 std::shared_ptr<arrow::Buffer> zeroLengthNullBuffer();

+class MmapFileStream : public arrow::io::InputStream {


Could you please add some comments to explain the usage/functionality for this class?

marin-ma

Some minor comments. Thanks!

marin-ma · 2024-11-18T12:14:04Z

cpp/core/shuffle/Utils.h

+// to prefetch and release memory timely.
+class MmapFileStream : public arrow::io::InputStream {
+ public:
+  MmapFileStream(arrow::internal::FileDescriptor fd, uint8_t* data, int64_t size)


Please separate the declaration and definition. And add a blank line between two member functions.

marin-ma · 2024-11-18T12:14:18Z

cpp/core/shuffle/Utils.h

+  arrow::Status Close() override;
+  arrow::Result<int64_t> Read(int64_t nbytes, void* out) override;
+  arrow::Result<std::shared_ptr<arrow::Buffer>> Read(int64_t nbytes) override;
+  bool closed() const override {


marin-ma · 2024-11-18T12:14:25Z

cpp/core/shuffle/Utils.h

+  };
+
+ private:
+  arrow::Result<int64_t> actualReadSize(int64_t nbytes) {


FelixYBW · 2024-11-18T20:35:33Z

cpp/core/shuffle/Utils.h

@@ -72,4 +72,37 @@ arrow::Result<std::shared_ptr<arrow::RecordBatch>> makeUncompressedRecordBatch(

 std::shared_ptr<arrow::Buffer> zeroLengthNullBuffer();

+// MmapFileStream is used to optimize sequential file reading. It uses madvise
+// to prefetch and release memory timely.
+class MmapFileStream : public arrow::io::InputStream {


You may contribute MmapFileStream to Apache Arrow in future.

FelixYBW · 2024-11-18T23:20:49Z

Thank you. Looks good solution!

marin-ma

LGTM. Thanks!

zhztheplayer · 2024-11-19T09:21:36Z

Is it because the spill will be triggered on other operators in the pipeline? Like a sort + shuffle. Will the sort be triggered or shuffle?

So far the spill will be triggered on components holding more memory no matter it's Velox operator or shuffle. We have a basic priority setting in Spiller API and in future we can extend and use it to implement some fixed spill order.

FelixYBW · 2024-11-19T21:54:48Z

Is it because the spill will be triggered on other operators in the pipeline? Like a sort + shuffle. Will the sort be triggered or shuffle?

So far the spill will be triggered on components holding more memory no matter it's Velox operator or shuffle. We have a basic priority setting in Spiller API and in future we can extend and use it to implement some fixed spill order.

So now once spill is called, all operator's spill is triggered, right?

zhztheplayer · 2024-11-20T00:47:31Z

Is it because the spill will be triggered on other operators in the pipeline? Like a sort + shuffle. Will the sort be triggered or shuffle?

So far the spill will be triggered on components holding more memory no matter it's Velox operator or shuffle. We have a basic priority setting in Spiller API and in future we can extend and use it to implement some fixed spill order.

So now once spill is called, all operator's spill is triggered, right?

We pass a target spill size to Velox API so usually the spill call stops when enough memory space is reclaimed. So a portion of the operators can be omitted in the procedure.

github-actions bot added the VELOX label Nov 8, 2024

zhztheplayer changed the title ~~[GLUTEN-7860][CORE] Replace MemoryMappedFile with ReadableFile to avoid OOM~~ [GLUTEN-7860][CORE] In shuffle writer, replace MemoryMappedFile with ReadableFile to avoid OOM Nov 8, 2024

ccat3z marked this pull request as ready for review November 9, 2024 03:15

zhztheplayer reviewed Nov 11, 2024

View reviewed changes

This comment was marked as off-topic.

Sign in to view

FelixYBW mentioned this pull request Nov 12, 2024

[CORE] LocalParitionWriter causes OOM during mergeSpills #7860

Open

ccat3z force-pushed the mmap-read-file branch from 02fc498 to d87c094 Compare November 18, 2024 03:05

ccat3z changed the title ~~[GLUTEN-7860][CORE] In shuffle writer, replace MemoryMappedFile with ReadableFile to avoid OOM~~ [GLUTEN-7860][CORE] In shuffle writer, replace MemoryMappedFile to avoid OOM Nov 18, 2024

ccat3z force-pushed the mmap-read-file branch 2 times, most recently from eaf10aa to 43a4f06 Compare November 18, 2024 03:34

Impl MmapFileStream

8eea3c7

ccat3z force-pushed the mmap-read-file branch from 43a4f06 to 8eea3c7 Compare November 18, 2024 03:40

marin-ma reviewed Nov 18, 2024

View reviewed changes

update

2203599

ccat3z force-pushed the mmap-read-file branch from 6baf43e to 2203599 Compare November 18, 2024 12:05

marin-ma reviewed Nov 18, 2024

View reviewed changes

FelixYBW reviewed Nov 18, 2024

View reviewed changes

Separate the declaration and definition

bd579ba

marin-ma approved these changes Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-7860][CORE] In shuffle writer, replace MemoryMappedFile to avoid OOM #7861

[GLUTEN-7860][CORE] In shuffle writer, replace MemoryMappedFile to avoid OOM #7861

ccat3z commented Nov 8, 2024 •

edited

Loading

github-actions bot commented Nov 8, 2024

ccat3z commented Nov 8, 2024

kecookier commented Nov 9, 2024

ccat3z commented Nov 9, 2024

ccat3z commented Nov 9, 2024

zhztheplayer left a comment

zhztheplayer Nov 11, 2024

FelixYBW Nov 12, 2024

marin-ma commented Nov 11, 2024

zhztheplayer commented Nov 11, 2024

This comment was marked as off-topic.

FelixYBW commented Nov 12, 2024

FelixYBW commented Nov 13, 2024

ccat3z commented Nov 18, 2024

marin-ma Nov 18, 2024

marin-ma Nov 18, 2024

marin-ma left a comment

marin-ma Nov 18, 2024

marin-ma Nov 18, 2024

marin-ma Nov 18, 2024

FelixYBW Nov 18, 2024

FelixYBW commented Nov 18, 2024

marin-ma left a comment

zhztheplayer commented Nov 19, 2024

FelixYBW commented Nov 19, 2024

zhztheplayer commented Nov 20, 2024 •

edited

Loading

		@@ -72,4 +72,34 @@ arrow::Result<std::shared_ptr<arrow::RecordBatch>> makeUncompressedRecordBatch(

		std::shared_ptr<arrow::Buffer> zeroLengthNullBuffer();

		class MmapFileStream : public arrow::io::InputStream {

[GLUTEN-7860][CORE] In shuffle writer, replace MemoryMappedFile to avoid OOM #7861

Are you sure you want to change the base?

[GLUTEN-7860][CORE] In shuffle writer, replace MemoryMappedFile to avoid OOM #7861

Conversation

ccat3z commented Nov 8, 2024 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Nov 8, 2024

ccat3z commented Nov 8, 2024

kecookier commented Nov 9, 2024

ccat3z commented Nov 9, 2024

ccat3z commented Nov 9, 2024

zhztheplayer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marin-ma commented Nov 11, 2024

zhztheplayer commented Nov 11, 2024

This comment was marked as off-topic.

FelixYBW commented Nov 12, 2024

FelixYBW commented Nov 13, 2024

ccat3z commented Nov 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marin-ma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FelixYBW commented Nov 18, 2024

marin-ma left a comment

Choose a reason for hiding this comment

zhztheplayer commented Nov 19, 2024

FelixYBW commented Nov 19, 2024

zhztheplayer commented Nov 20, 2024 • edited Loading

ccat3z commented Nov 8, 2024 •

edited

Loading

zhztheplayer commented Nov 20, 2024 •

edited

Loading