Allow setting the checkpoint directory through SparkConnection #182

riley-harper · 2024-12-13T21:30:03Z

This work is for #181.

This PR adds a new checkpoint_dir argument to SparkConnection. This is a required argument, so this is a breaking change.

This fixes a bug where we always set Spark's checkpoint directory to tmp_dir, which we used to set the spark.local.dir configuration option. The problem is that these directories should be on separate disks! tmp_dir should be on a disk local to each executor. The checkpoint directory should be on shared storage so that all of the executors can access the same directory. (If you are running locally as the hlink script does, this distinction does not really matter.)

While working on this, I have also removed the hlink.scripts.main.load_conf() function, since it had some confusing logic in it. That logic has moved into the main script's cli() function and should hopefully be more straightforward now. Instead of setting the "conf_path" key in the return dictionary, hlink.configs.load_conf.load_conf_file() now returns a tuple of (file path, config dictionary).

This eliminates the need to set a new "conf_path" attribute on the configuration dictionary before returning it.

…tion dictionary

Instead of using this function to get the config and add attributes to it, we now separately get the config with load_conf_file() and pass attributes to Spark. I've translated some of the tests for load_conf() to tests for load_conf_file().

Previously we always set the checkpoint directory to be the same as spark.local.dir, which we call "tmp_dir". However, this doesn't make sense because tmp_dir should be on a disk local to each executor, and the checkpoint directory has to be on shared storage to work correctly.

…kFactory

riley-harper added 5 commits December 13, 2024 18:46

[#181] Return a tuple (path, config) from load_conf_file

4c6e602

This eliminates the need to set a new "conf_path" attribute on the configuration dictionary before returning it.

[#181] Don't use load_conf() to set extra attributes on the configura…

46f79e3

…tion dictionary

[#181] Implement checkpoint_dir behavior for SparkConnection and Spar…

3dbc75b

…kFactory

riley-harper merged commit 3f0d62f into v4-dev Dec 13, 2024
6 checks passed

riley-harper deleted the checkpoint_directory_rework branch December 13, 2024 21:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow setting the checkpoint directory through SparkConnection #182

Allow setting the checkpoint directory through SparkConnection #182

riley-harper commented Dec 13, 2024 •

edited

Loading

Allow setting the checkpoint directory through SparkConnection #182

Allow setting the checkpoint directory through SparkConnection #182

Conversation

riley-harper commented Dec 13, 2024 • edited Loading

riley-harper commented Dec 13, 2024 •

edited

Loading