Allow setting the checkpoint directory through SparkConnection #182
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This work is for #181.
This PR adds a new
checkpoint_dir
argument toSparkConnection
. This is a required argument, so this is a breaking change.This fixes a bug where we always set Spark's checkpoint directory to
tmp_dir
, which we used to set the spark.local.dir configuration option. The problem is that these directories should be on separate disks!tmp_dir
should be on a disk local to each executor. The checkpoint directory should be on shared storage so that all of the executors can access the same directory. (If you are running locally as thehlink
script does, this distinction does not really matter.)While working on this, I have also removed the
hlink.scripts.main.load_conf()
function, since it had some confusing logic in it. That logic has moved into the main script'scli()
function and should hopefully be more straightforward now. Instead of setting the"conf_path"
key in the return dictionary,hlink.configs.load_conf.load_conf_file()
now returns a tuple of (file path, config dictionary).