repartition-based fallback for hash aggregate v3 #11712

binmahone · 2024-11-08T07:20:58Z

This PR replaces #11116, since there has been too many differences with #11116.

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

…ition_agg

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

…tor leak Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

…ition_agg_v3

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Signed-off-by: Firestarman <[email protected]>

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

…ition_agg_v3

…gether small buckets Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

…ition_agg_v3

revans2

I have not finished yet. Could you post an explanation of the changes? I see places in the code that appear to have duplicate functionality. Not to mention we have the old sort based agg code completely duplicating a lot of the newer hash re-partition based code.

I really just want to understand what the work flow is supposed to be?

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchSerializer.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuSemaphore.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/SortBasedGpuAggregateExec.scala

revans2 · 2024-11-08T15:37:09Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

@@ -335,7 +513,10 @@ class AggHelper(
    // We need to merge the aggregated batches into 1 before calling post process,
    // if the aggregate code had to split on a retry
    if (aggregatedSeq.size > 1) {
-      val concatted = concatenateBatches(metrics, aggregatedSeq)
+      val concatted =
+        withResource(aggregatedSeq) { _ =>


I am confused by this? Was this a bug? This change feels wrong to me.

concatenateBatches has the contract that it will either close everything in toConcat, or if there is a single item in the sequence it will just return it without closing anything. By putting it within a withResource it looks like we are going to double close the data in aggregatedSeq.

SpillableColumnarBatch has the nasty habit (?) of hiding double closes from us (https://github.com/NVIDIA/spark-rapids/blob/branch-24.12/sql-plugin/src/main/scala/com/nvidia/spark/rapids/SpillableColumnarBatch.scala#L137). I'd like to remove this behavior with my spillable changes.

I think I was mislead by

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

Line 955 in 5be4bd5

val concatBatch = withResource(batches) { _ =>

on main branch, and thought concatenateBatches will not close input batches. Will revert this part

abellina · 2024-11-08T15:52:29Z

I will review this today

revans2 · 2024-11-08T20:43:44Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

+          realIter = Some(ConcatIterator.apply(firstPassIter,
+            (aggOutputSizeRatio * configuredTargetBatchSize).toLong
+          ))
+          firstPassAggToggle.set(false)


I don't think that this does what you think that it does. The line that reads this is used to create an iterator. It is not within an iterator which decides if we should or should not do the agg. I added in some print statements and I have verified that it does indeed agg for every batch, even if the first batch set this to false. Which is a good thing because if you disabled the initial aggregation on something where the output types do not match the input types you would get a crash or data corruption.

Known issue, will revert this part.

I remember getting a crash or data corruption when I tried to fix the iterator bug here. Do you think it's beneficial if we do convert the output types, but not try any row-wise aggregate if heuristics show that first pass agg does not agg out a lot rows?

revans2 · 2024-11-08T21:32:59Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

+
+      // Handle the case of skipping second and third pass of aggregation
+      // This only work when spark.rapids.sql.agg.skipAggPassReductionRatio < 1
+      if (!firstBatchChecked && firstPassIter.hasNext


If we are doing an aggregate every time, would it be better to check each batch and skip repartitioning if the batch stayed large?

I don't quite understand this. can you elaborate on this?

binmahone · 2024-11-10T22:19:05Z

Hi @revans2 and @abellina, this PR, as I stated in Slack and marked in itself, is still in DRAFT, so I didn't expect a thorough review on this before I change it from Draft to Ready. (Or we wee already have some special usage of Draft PR in our dev process?) Some major problems such as mixing other features(e.g. so called voluntary release check), not working implementation of skipping first iteration agg, etc. The refinement is not ready last Friday but it is now. Please go ahead and review.

The reason why I showed you this PR is for proving "That we do a complete pass over all of the batches and cache them into GPU memory before we make a decision on when and how to release a batch" is no longer true. I assume having access to the state-of-art implementation of agg in our customer may help you better understand the problem we're facing now.

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone · 2024-11-11T03:37:45Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

+      batches: mutable.ArrayBuffer[SpillableColumnarBatch],
+      metrics: GpuHashAggregateMetrics,
+      concatAndMergeHelper: AggHelper): SpillableColumnarBatch = {
+    // TODO: concatenateAndMerge (and calling code) could output a sequence


FYI: this is TODO is not newly introduced

binmahone · 2024-11-11T03:40:52Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala

@@ -2113,6 +2187,7 @@ class DynamicGpuPartialSortAggregateIterator(
      }
      val newIter = if (doSinglePassAgg) {
        metrics.singlePassTasks += 1
+        // TO discuss in PR: is singlePassSortedAgg still necessary?


should we remove this?

binmahone · 2024-11-11T03:47:04Z

Next steps for this PR will be:

Addressing review comments. When reviewing please aware that we found repartition-based agg is more prone to "Container OOM (then get killed by K8S)" than sort based. I'm troubleshooting on my side, but any insight from the reviewer would be a great help.
We have selected four representative queries at our customer, and before/after result is being collected. Part of the collected results has already shown improvements.
I'll run regression test on NDS to ensure no/minor regression.

binmahone added 30 commits July 1, 2024 17:14

workable version without tests

f5d21a9

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

doc

10b7d20

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix scala 2.13

4451c54

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix compile

4da5797

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix it

e803c36

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

enable it

0b50434

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

metric name

a000c9b

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

minor

82cacbf

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

change seed

4cf4a45

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix comments

367a273

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

minor

74424b8

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Merge remote-tracking branch 'origin/branch-24.08' into 240701_repart…

a50261d

…ition_agg

temp

634b01a

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

add log and reverse batch

f1c1c0b

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

runable for optmization mode

603299a

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

clean repartition agg, remove sort agg

e96b2b5

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

stream agg

54cc016

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

suppport skip agg and repartition

a159249

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

clean code

ca0620c

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix

4ddbc48

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

remove all logs

97397a4

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

address comments, and remove the concept of fallback

62ff869

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

in mid of shortcircuit for small batches

9e47dfa

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

forbid volunteer release GPU if aggregate exec has more batches to offer

d0e74f3

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Add companion metrics for all nsTiming metrics

7a69fa6

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix conf bug for check volunary release

b4f8272

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

concat output batches, and use CloseableBufferedIterator to fix itera…

053a019

…tor leak Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Merge remote-tracking branch 'origin/branch-24.10' into agg_v2_patch

4ad5467

Merge branch 'agg_v2_patch' into 240801_repartition_agg_v2

b0cae01

temp, ut can pass

134b094

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone and others added 16 commits August 21, 2024 10:28

clean code, no repartition case

ec95179

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix seed 1724149115

85bafaa

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

remove -s

8f0a796

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix single batch optimization bug

dc9193f

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

remove concat and merge for inputs

fa6acf4

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Merge remote-tracking branch 'origin/branch-24.10' into 240821_repart…

0ab65c7

…ition_agg_v3

add toggle spark.rapids.sql.agg.outputSizeRatioToBatchSize

1edb32d

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

bring sort based back

557766e

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix skip agg bug

eeb5832

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

add metrics for skip agg, and skip sebsequent batches agg

b336131

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Add more metrics for shuffle

eee9362

Signed-off-by: Firestarman <[email protected]>

refine companion metrics

b9bc058

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Merge remote-tracking branch 'origin/branch-24.10' into 240821_repart…

3f26f4a

…ition_agg_v3

agg v3.3: optimize for merge-able neighbour input batches, and put to…

fbc07bd

…gether small buckets Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fixing 137 OOM

9904d65

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Merge remote-tracking branch 'origin/branch-24.12' into 240821_repart…

df4dcae

…ition_agg_v3

binmahone marked this pull request as draft November 8, 2024 07:21

binmahone changed the title ~~240821 repartition agg v3~~ repartition-based fallback for hash aggregate v3 Nov 8, 2024

abellina requested review from abellina and revans2 and removed request for revans2 November 8, 2024 14:33

revans2 reviewed Nov 8, 2024

View reviewed changes

binmahone added 2 commits November 11, 2024 09:31

remove unrelated features

34ed835

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

fix review comments

4522380

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone commented Nov 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repartition-based fallback for hash aggregate v3 #11712

repartition-based fallback for hash aggregate v3 #11712

binmahone commented Nov 8, 2024 •

edited

Loading

revans2 left a comment

revans2 Nov 8, 2024

abellina Nov 8, 2024

binmahone Nov 11, 2024

abellina commented Nov 8, 2024

revans2 Nov 8, 2024

binmahone Nov 11, 2024

binmahone Nov 11, 2024

revans2 Nov 8, 2024

binmahone Nov 11, 2024

binmahone commented Nov 10, 2024 •

edited

Loading

binmahone Nov 11, 2024

binmahone Nov 11, 2024

binmahone commented Nov 11, 2024

repartition-based fallback for hash aggregate v3 #11712

Are you sure you want to change the base?

repartition-based fallback for hash aggregate v3 #11712

Conversation

binmahone commented Nov 8, 2024 • edited Loading

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abellina commented Nov 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binmahone commented Nov 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binmahone commented Nov 11, 2024

binmahone commented Nov 8, 2024 •

edited

Loading

binmahone commented Nov 10, 2024 •

edited

Loading