Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting the checkpoint directory through SparkConnection #182

Merged
merged 5 commits into from
Dec 13, 2024

Conversation

riley-harper
Copy link
Contributor

@riley-harper riley-harper commented Dec 13, 2024

This work is for #181.

This PR adds a new checkpoint_dir argument to SparkConnection. This is a required argument, so this is a breaking change.

This fixes a bug where we always set Spark's checkpoint directory to tmp_dir, which we used to set the spark.local.dir configuration option. The problem is that these directories should be on separate disks! tmp_dir should be on a disk local to each executor. The checkpoint directory should be on shared storage so that all of the executors can access the same directory. (If you are running locally as the hlink script does, this distinction does not really matter.)

While working on this, I have also removed the hlink.scripts.main.load_conf() function, since it had some confusing logic in it. That logic has moved into the main script's cli() function and should hopefully be more straightforward now. Instead of setting the "conf_path" key in the return dictionary, hlink.configs.load_conf.load_conf_file() now returns a tuple of (file path, config dictionary).

This eliminates the need to set a new "conf_path" attribute on the
configuration dictionary before returning it.
Instead of using this function to get the config and add attributes to it, we
now separately get the config with load_conf_file() and pass attributes to
Spark. I've translated some of the tests for load_conf() to tests for
load_conf_file().
Previously we always set the checkpoint directory to be the same as
spark.local.dir, which we call "tmp_dir". However, this doesn't make sense
because tmp_dir should be on a disk local to each executor, and the checkpoint
directory has to be on shared storage to work correctly.
@riley-harper riley-harper merged commit 3f0d62f into v4-dev Dec 13, 2024
6 checks passed
@riley-harper riley-harper deleted the checkpoint_directory_rework branch December 13, 2024 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant