feat: Add one config to limit max disk usage for spilling queries #14975
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
For memory-limit queries, executors might write temporary results into the disk to reduce memory load. It's important to have a configuration option to limit the max disk usage, in case some query would bloat the disk and cause other issues.
DuckDB provides a similar configuration:
What changes are included in this PR?
This PR is mostly refactoring, the actual change is small. I made one big refactor into a separate commit, so you can only look at the second commit for the related changes.
Changes
1. refactor: Move
metrics
module fromdatafusion_physical_plan
crate todatafusion_execution
crateCurrent code organization:
datafusion_physical_plan
depends ondatafusion_execution
datafusion_physical_plan
includes physical executors likeSortExec
datafusion_execution
includes memory reservation and disk managerThis PR requires adding metrics tracking ability to
DiskManager
, besides, I believe in the future we can also add moreMetrics
intoMemoryPool
andDiskManager
, for aggregated statistics across different executors, so this PR first made this refactoring change.2. refactor: Group all spilling-related metrics in executors into a single struct
SpillMetrics
This way we can ensure the metrics across different operators can be kept consistent.
3. refactor: Move
IPCStreamWriter
fromdatafusion_physical_plan
todatafusion_execution::DiskManager
4. Implement the utility function to spill batches inside
DiskManager
try_spill_record_batches
is implemented to replace the old utility functionspill_record_batches
, this way the total spilled file size can be tracked, also the interface for external operators to use is a little simpler.Are these changes tested?
Yes, integration test is included for queries exceed/not-exceed the disk limit.
Are there any user-facing changes?
No. However there is one deprecation, I'll explain separately.