Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a potential NPE error in the sized hash join #12101

Merged
merged 1 commit into from
Feb 13, 2025

Conversation

firestarman
Copy link
Collaborator

close #12100

This PR fixes a potential NPE error in the sized hash join by checking the buffer's existence before actually closing it.
Otherwise duplicate call to this close method will lead to a NPE error.

It is a quite simple change so no tests for it.

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

@binmahone binmahone changed the title Fix a potential NPE erro in the sized hash join Fix a potential NPE error in the sized hash join Feb 11, 2025
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is simple, but I don't understand the situation that caused the SpillableColumnarBatchResult to be closed twice. I look at the code and I don't see any code that could have caused it. But I have not dug that deeply. My main concern is that this might be masking something where we might have a used after free. I think it is more likely that this is the correct fix, but I would prefer it if I understood what happened.

@firestarman
Copy link
Collaborator Author

The change is simple, but I don't understand the situation that caused the SpillableColumnarBatchResult to be closed twice. I look at the code and I don't see any code that could have caused it. But I have not dug that deeply. My main concern is that this might be masking something where we might have a used after free. I think it is more likely that this is the correct fix, but I would prefer it if I understood what happened.

I found a possible case of the duplicate close in the sized hash join, you can refer to the linked issue for more details.

@firestarman
Copy link
Collaborator Author

firestarman commented Feb 13, 2025

Post the possible case here.

There was always a host memory OOM before this NPE error. And looks like the close of a SpillableHostConcatResult was called more than once for some exception cases.
I just found a possible case in the sized hash join as the below. A callback to call the close is registered to the task completion event when a HostQueueBatchIterator instance is created in setupForJoin. If the following prefetching fails and throws out an exception, the outer closeOnExcept will close the queue first, and the callback then will close the queue again.

        closeOnExcept(mutable.Queue.empty[T]) { rightQueue =>
          var leftSize = 0L
          var rightSize = 0L
          var buildSide: GpuBuildSide = null
          while (buildSide == null) {
            if (leftSize < rightSize || (startWithLeftSide && leftSize == rightSize)) {
              if (leftIter.hasNext) {
                val leftBatch = leftIter.next()
                if (getProbeBatchRowCount(leftBatch) > 0) {
                  leftQueue += leftBatch
                  leftSize += getProbeBatchDataSize(leftBatch)
                }
              } else {
                buildSide = GpuBuildLeft
              }
            } else {
              if (rightIter.hasNext) {
                val rightBatch = rightIter.next()
                if (getProbeBatchRowCount(rightBatch) > 0) {
                  rightQueue += rightBatch
                  rightSize += getProbeBatchDataSize(rightBatch)
                }
              } else {
                buildSide = GpuBuildRight
              }
            }
          }
          val exprs = BoundJoinExprs.bind(joinType, leftKeys, leftOutput, rightKeys, rightOutput,
            condition, buildSide)
          val (buildQueue, buildSize, streamQueue, rawStreamIter) = buildSide match {
            case GpuBuildRight =>
              buildTime += rightTime.value
              streamTime += leftTime.value
              (rightQueue, rightSize, leftQueue, rawLeftIter)
            case GpuBuildLeft =>
              buildTime += leftTime.value
              streamTime += rightTime.value
              (leftQueue, leftSize, rightQueue, rawRightIter)
          }
          metrics(BUILD_DATA_SIZE).set(buildSize)
          val baseBuildIter = setupForJoin(buildQueue, Iterator.empty, exprs.buildTypes,
            gpuBatchSizeBytes, metrics)
          val buildIter = if (exprs.buildSideNeedsNullFilter) {
            new NullFilteredBatchIterator(baseBuildIter, exprs.boundBuildKeys, metrics(OP_TIME))
          } else {
            baseBuildIter
          }
          val streamIter = new CollectTimeIterator("fetch join stream",
            setupForJoin(streamQueue, rawStreamIter, exprs.streamTypes, gpuBatchSizeBytes, metrics),
            streamTime)
          JoinInfo(joinType, buildSide, buildIter, buildSize, None, streamIter, exprs)
        }
      }

   override def setupForJoin(
        queue: mutable.Queue[SpillableHostConcatResult],
        remainingIter: Iterator[ColumnarBatch],
        batchTypes: Array[DataType],
        gpuBatchSizeBytes: Long,
        metrics: Map[String, GpuMetric]): Iterator[ColumnarBatch] = {
      val concatMetrics = getConcatMetrics(metrics)
      GpuShuffleCoalesceUtils.getGpuShuffleCoalesceIterator(
        new HostQueueBatchIterator(queue, remainingIter),
        gpuBatchSizeBytes,
        batchTypes,
        readOption,
        concatMetrics,
        prefetchFirstBatch = true)
    }

@firestarman firestarman merged commit 4a4f713 into NVIDIA:branch-25.04 Feb 13, 2025
52 checks passed
@firestarman firestarman deleted the fix-npe branch February 13, 2025 01:47
@sameerz sameerz added the bug Something isn't working label Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Potential duplicate call to the close method of a SpillableHostConcatResult in sized hash join.
3 participants