Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added documentation for SortMergeJoin #13469

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

athultr1997
Copy link
Contributor

Which issue does this PR close?

Closes #10357

Rationale for this change

What changes are included in this PR?

Added documentation for SortMergeJoin

Are these changes tested?

No tests needed since only documentation is added.

Are there any user-facing changes?

No

@athultr1997 athultr1997 marked this pull request as draft November 18, 2024 15:55
@github-actions github-actions bot added the physical-expr Physical Expressions label Nov 18, 2024
@athultr1997 athultr1997 marked this pull request as ready for review November 18, 2024 16:59
@comphead
Copy link
Contributor

I would be adding this short video to make the dev/user familiar with SMJ concepts https://www.youtube.com/watch?v=jiWCPJtDE2c

/// will fail under the same conditions. Multiple record batches of buffered could be
/// present in memory/disk during the exectution.
///
/// Only one record batch of streamed input will be present in the memory at all times. There is no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it depends on the batch size and there can batch_size of streamed rows currently lives in memory.
For the buffered the buffered batch lives until BufferedState switches and due to presorted inputs the algorithm understands the buffered batch is not needed anymore and it gets released

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @athultr1997 this is good start
@korowa @viirya do you have anything to add on it?

/// partitions.
/// Join execution plan that executes equi-join predicates on multiple partitions using Sort-Merge
/// join algorithm and applies an optional filter post join. Can be used to join arbitrarily large
/// inputs where one or both of the inputs don't fit in the available memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where one or both of the inputs don't fit in the available memory.

Hmm, is this true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

streamed - Always taken one batch at a time.
buffered - Has spilling support.
Hence, the inputs don't have to fit in memory.

Also, I think this was the vision behind SMJ: #1599 (comment).

///
/// Buffered input is buffered for all record batches having the same value of join key.
/// If the memory limit increases beyond the specified value and spilling is enabled,
/// buffered batches could be spilled to disk. If spilling is disabled, the execution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a config for spilling? Shall we mention it here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One has to enable the disk manager when creating the RuntimeEnv during the creation of TaskContext.

let runtime = RuntimeEnvBuilder::new()
            .with_memory_limit(100, 1.0)
            .with_disk_manager(DiskManagerConfig::NewOs)
            .build_arc()?;

I am not sure if we should mention it here, since this is all the way over in RuntimeEnv and there are multiple strategies to enable the disk manager.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document Sort Merge Join algorithm
3 participants