Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Parquet footer metrics behind debug config flag #9513

Closed

Conversation

mattahrens
Copy link
Collaborator

Contributes to #9265

@mattahrens mattahrens self-assigned this Oct 23, 2023
@mattahrens mattahrens added the performance A performance related task/issue label Oct 23, 2023
val footerBuffer = getFooterBuffer(filePath, conf, metrics)
val footerFetchTime = System.nanoTime() - startTime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readAndFilterFooter is being invoked, from lots of places. I see the following so far:

  • On task thread: GpuParquetMultiFilePartitionReaderFactory.buildBaseColumnarReaderForCloud > filterBlocks > readAndFilterFooter. Your change is ok: task thread.

  • On task thread: GpuParquetMultiFilePartitionReaderFatory.buildBaseColumnarReaderForCoalescing > filterBlocksForCoalescingReader:

    • If numFilesFilerParallel > 0: we are going to submit a CoalescingFilterRunner to a thread pool. CoalescingFilterRunner will then invoke filterBlocksforCoalescingReader in a thread that is not the task thread. Your change is not ok here, because then the footer fetch time is not task relative.
    • else, we call filterBlocksForCoalescingReader directly from the task thread. Your change is ok here.
  • On task thread: GpuParquetPartitionReaderFactory.buildColumnarReader > buildBaseColumnarParquetReader > filterBlocks. Your change is also ok here.

Note that we have the metric FILTER_TIME. It is updated in GpuParquetMultiFilePartitionReaderFactory.buildBaseColumnarReaderForCoalescing, and GpuParquetPartitionReaderFactory.buildBaseColumnarParquetReader. These are measuring task relative time because they are in the task thread and are starting before we submit to the thread pool, and ending after we get all the futures resolved.

For the multi-floud cloud readers, we updated FILTER_TIME in MultiFileCloudPartitionReaderBase.next using a different approach. Here, we obtained a percentage of time that a thread (not the task thread) spent buffering vs filtering (getFilterTimePct and getBufferTimePct). We then multiplied that time times the amount of time that the task was blocked waiting for the threads to finish. The HostMemoryBuffersWithMetadaDataBase class has the logic on how we compute that. It seems that we would want to do something similar for footer fetch/filter time ColescingFilterRunner case.

@mattahrens mattahrens changed the base branch from branch-23.12 to branch-24.02 November 28, 2023 15:04
@mattahrens mattahrens closed this Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants