Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parallel buffered slice writer bug #922

Merged
merged 20 commits into from
Aug 14, 2023

Conversation

markbader
Copy link
Contributor

@markbader markbader commented Jul 11, 2023

Description:

  • Hot fix implementation of bbox extension to avoid parallel access to properties JSON during Dataset from_images

Issues:

Todos:

Make sure to delete unnecessary points or to check all before merging:

  • Updated Changelog

@markbader markbader self-assigned this Jul 11, 2023
@markbader
Copy link
Contributor Author

I added a boolean argument update_bbox to the named partial copy_to_view and propagated it to the actual write of the view or the mag view. This makes sure that the bounding box is not updated by multiple processes. Therefore, the corruption of the properties JSON file should be avoided.

@markbader markbader requested a review from fm3 July 17, 2023 12:09
Copy link
Member

@fm3 fm3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff! I added a few comments, mostly about naming things and defaults. Could you have a look?

@@ -34,6 +34,7 @@ def __init__(
self,
view: "View",
offset: Optional[Vec3IntLike] = None,
update_bbox: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d say the default should be True, and there could be a comment explaining what it does and to encourage setting it to false IF parallel access is intended.

@@ -192,6 +192,7 @@ def write(
self,
data: np.ndarray,
offset: Optional[Vec3IntLike] = None, # deprecated, relative, in current mag
update_bbox: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe in this context it is rather something like allow_write_outside_view_bbox: bool = False? (this method does not update the bbox, right?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, did you test what happens if there is a write outside of the existing box? does it just work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I agree, that the name allow_write_outside_view_bbox fits better for View.write() but it seems a bit odd to have different signatures for View and MagView.

  2. Yes, it just worked for the files that I used to test it.

@fm3
Copy link
Member

fm3 commented Jul 17, 2023

@philippotto if you have time, could you also have a look at these changes? I’m not super sure about the semantics with view vs magview and I think another opinion would make sense

Copy link
Member

@philippotto philippotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a bandaid hotfix this seems okay to me (however, shouldn't the default for update_bbox be false everywhere to avoid the runtime errors?).

Is there a plan for a proper fix? The PR would cause bounding boxes to be not set correctly potentially, if I understand it correctly?

Comment on lines 132 to 147
if isinstance(self.view, MagView):
self.view.write(
data,
offset=buffer_start.add_or_none(self.offset),
relative_offset=buffer_start_mag1.add_or_none(self.relative_offset),
absolute_offset=buffer_start_mag1.add_or_none(self.absolute_offset),
update_bbox=self.update_bbox,
)
else:
self.view.write(
data,
offset=buffer_start.add_or_none(self.offset),
relative_offset=buffer_start_mag1.add_or_none(self.relative_offset),
absolute_offset=buffer_start_mag1.add_or_none(self.absolute_offset),
allow_write_outside_bbox=not self.update_bbox,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be DRYed with something like this, I think:

kwargs = None
if isinstance(self.view, MagView):
    kwargs = {"update_bbox": self.update_bbox}
else:
    kwargs = {"allow_write_outside_bbox": not self.update_bbox}

self.view.write(
    data,
    offset=buffer_start.add_or_none(self.offset),
    relative_offset=buffer_start_mag1.add_or_none(self.relative_offset),
    absolute_offset=buffer_start_mag1.add_or_none(self.absolute_offset),
    **kwargs
)

offset=buffer_start.add_or_none(self.offset),
relative_offset=buffer_start_mag1.add_or_none(self.relative_offset),
absolute_offset=buffer_start_mag1.add_or_none(self.absolute_offset),
allow_write_outside_bbox=not self.update_bbox,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this logic. If the bbox must not be updated, writing outside of the bbox is allowed? Shouldn't this be inverted?

@fm3
Copy link
Member

fm3 commented Jul 18, 2023

Is there a plan for a proper fix? The PR would cause bounding boxes to be not set correctly potentially, if I understand it correctly?

I’m unsure how to go about this. The idea here was to move the responsibility of setting the full bbox from the subjobs (which do the write call) to the main process (afterwards). The benefit would be that the subjobs don’t run into this parallel editing of the json problem, and yet, the final bbox would be correct (as the caller, in this case add_layer_from_images, can then set it).

The idea was to have the defaults be True, which is the intuitive behavior (everything updates), and only set them to False explicitly in the cases where concurrent usages are intended (creating a need to set the bbox in the caller code afterwards).

A problem that arose from this is that the write calls must now be able to write outside of the (previous, smaller) bbox, because it will be set only at the end. That is why there is the allow_write_outside_bbox param in this case.

I agree that this is all a bit hard to comprehend from the code alone. I am not super happy with this solution, but I also don’t have an idea for a “proper fix” – if you have ideas, please share them!

Another idea was to set the bbox to a huge one (millions of voxel per dimensions) before calling the subjobs. In that case, the subjobs would not touch the json as nothing is bigger than expected. That would alleviate the need for passing these booleans. However, itwould introduce magic numbers for this huge bbox, and also not feel like a proper fix.

We could probably also implement a locking mechanism for the json, but I’m not sure how to do that (in a way that does not impede performance)

@philippotto
Copy link
Member

I see two possible solutions for this problem (I'm preferring (2)):

  1. Defer the mutation of the JSON. This is similar to what the PR is currently doing, but I'd like to see an automatic fix-up of the JSON, so that the users of this library don't need to take care of this. Also, they shouldn't need to pass special parameters if they use multiprocessing as this is complicated. The automated fix-up could be done by remembering all the written bboxes and passing them to the parent job in the end.

  2. Use a locking mechanism as you also suggested. We have something similar in voxelytics and it looks relatively simple (however, I did not author this, maybe @normanrz wants to chime in?). It's file system based which should be okay since the JSON itself is FS-based, too. The lock file could be named datasource-properties.json.lock. Since the update of the JSON should take very little time (in comparison to writing image data), I don't think that performance will be a problem.
    We should take extra care to reduce the chance that the lock file is not cleaned up. That is, the locking phase should be as short as possible (i.e., only when updating the JSON and only when the JSON really changes). If the JSON update raises an exception for some reason, the lock file should also be cleaned up again. Then, the lock file might only be a problem when the python process is terminated during the update of the JSON. In that case, a following run would probably dead-lock. A warning could be emitted if the lock is awaited for more than X seconds with the hint how it can be deleted manually.

I think, I'm in strong favor of (2) since it seems way simpler to me.

@fm3
Copy link
Member

fm3 commented Jul 18, 2023

I would have assumed that file locks are not guaranteed to be multiprocessing-safe, but I may be wrong about that. Do we use it in voxelytics in the same way without issue? Then I’m ok with that approach. Maybe you could give @markbader an introduction on how it is done in voxelytics so that he can adapt it into here?

@philippotto
Copy link
Member

I would have assumed that file locks are not guaranteed to be multiprocessing-safe, but I may be wrong about that.

I think it should be safe. The package describes the locking mechanism as a "a simple way of inter-process communication".

Do we use it in voxelytics in the same way without issue? Then I’m ok with that approach.

In vx we use the locking to avoid that the same workflow task is run multiple times (e.g., due to user error). So, I think, that this should work for the libs use case, too.

Maybe you could give @markbader an introduction on how it is done in voxelytics so that he can adapt it into here?

@markbader See this code. I think, it's as simple as importing the module and adding the with SoftFileLock(lock_path, timeout=3): around the update of the JSON file. Let me know if you have questions :)

@markbader
Copy link
Contributor Author

@philippotto Thanks for your feedback and implementation ideas. I tried to implement the SoftFileLock approach but ran into some issues:

  1. The filelock python library does not support UPaths right now, and therefore is not usable for remote datasets.
  2. I adapted the filelock library to support UPaths and replaced the os.open calls with pathlib.Path(..).open('x') as it is the python representation for atomic write a file and throw an Exception if it already exists. Sadly, the s3fs library does not support the mode "x" for open. I think it is out of scope to change the s3fs library as well to implement a file lock.

I just discussed with @daniel-wer that under the given circumstances, the implementation of your first proposed solution would be better. @philippotto do you have another idea for an implementation of a locking mechanism, or would you agree that I should implement solution 1?

@philippotto
Copy link
Member

I just discussed with @daniel-wer that under the given circumstances, the implementation of your first proposed solution would be better. @philippotto do you have another idea for an implementation of a locking mechanism, or would you agree that I should implement solution 1?

I would hope that a filelock-implementation would be possible with S3, but I also don't see an easy way to achieve it without changing the entire approach that is used by the file-lock library. Alternatively, one could use a file lock somewhere else (e.g. in the current working directory) that is shared between all workers. For multiprocessing this would be easy (the only downside is that this wouldn't catch independent actors writing to the same DS). However, I don't see how this would be generalized to slurm etc.

Therefore, I'm fine with doing (1) :)

@markbader markbader marked this pull request as draft July 26, 2023 13:54
@markbader markbader marked this pull request as ready for review August 2, 2023 08:41
Copy link
Member

@philippotto philippotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, looks good to me :) Two things:

  • The documentation should state somewhere explicitly which restrictions exist for the parallel access of wk datasets. Namely:
    • when writing shards in parallel, json_update_allowed should be set to False to disable the automatic update of the bounding box meta data. otherwise, race conditions may happen. the user is responsible for updating the bounding box manually.
    • when writing to chunks in shards, one chunk may only be written to by one actor at any time
    • when writing to compressed shards, one shard may only be written to by one actor at any time
    • reading in parallel should be fine.
  • see my two review comments

@normanrz could you double check my restrictions from above?

webknossos/webknossos/dataset/dataset.py Outdated Show resolved Hide resolved
@normanrz
Copy link
Member

  • when writing to chunks in shards, one chunk may only be written to by one actor at any time
  • when writing to compressed shards, one shard may only be written to by one actor at any time

Ok for now. With Zarr, parallel writes to a shard will no longer be allowed at all. Maybe we already put such wording in.

  • reading in parallel should be fine.

reading in parallel without concurrent writes is ok.

@markbader
Copy link
Contributor Author

@philippotto Thanks for your review! I added a paragraph about parallel access of Datasets in the Dataset Usage part of the docs.

Copy link
Member

@philippotto philippotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, looks good to me :)

@fm3
Copy link
Member

fm3 commented Aug 14, 2023

With Zarr, parallel writes to a shard will no longer be allowed at all.

That sounds like trouble for some of our applications that do slice-wise operations. What would be the right way to deal with that? Shards with very small z size so that we can do the parallelization? Or ditch the fine-grained parallelization to write to the large chunk files sequentially each? Or re-chunking in a second pass? All seem to have drawbacks compared to now.

@philippotto
Copy link
Member

With Zarr, parallel writes to a shard will no longer be allowed at all.

That sounds like trouble for some of our applications that do slice-wise operations. What would be the right way to deal with that? Shards with very small z size so that we can do the parallelization? Or ditch the fine-grained parallelization to write to the large chunk files sequentially each? Or re-chunking in a second pass? All seem to have drawbacks compared to now.

For the vx align case, picking a small shard-z dimension should be a good option imo. Especially, because zarr supports non-cubic shards (so, x and y can still have larger sizes). After materialization, a recubing would be necessary (not strictly, but good for WK) which is a drawback, but on the other hand a low shard-z-size would allow higher parallelization.

@markbader markbader merged commit 1426ec1 into master Aug 14, 2023
18 checks passed
@markbader markbader deleted the Fix_parallel_BufferedSliceWriter_bug branch August 14, 2023 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

webknossos new CLI JSON error with multiprocessing futures
4 participants