Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make largest_segment_id optional #925

Merged
merged 11 commits into from
Aug 8, 2023
Merged

Conversation

markbader
Copy link
Contributor

@markbader markbader commented Jul 13, 2023

Description:

  • Largest segment id was required in webknossos libs, as it raises an Assertion Error if absence. As it is no longer required in webknossos it is optional in webknossos libs now.

Issues:

Todos:

Make sure to delete unnecessary points or to check all before merging:

  • Updated Changelog
  • Updated Documentation
  • Added / Updated Tests
  • Considered adding this to the Examples

@markbader markbader self-assigned this Jul 13, 2023
@markbader markbader marked this pull request as ready for review July 17, 2023 08:29
@markbader
Copy link
Contributor Author

Most of this issue was already solved in #786. This PR removes an assertion that prevents the creation of a segmentation layer without the largest segment id and adapts the test to check the expected behavior. I assumed that a segmentation layer that contains no data has None as largest_segment_id.

@markbader markbader requested a review from fm3 July 17, 2023 11:29
Copy link
Member

@fm3 fm3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanrz do you have an opinion concerning None for newly created empty segmentation layers? I guess zero could also be a valid value in that case, but if the data write functions don’t update it, it may make more sense to leave it at None so that it is more explicitly unset?

Other than this question, the changes look good to me.

@normanrz
Copy link
Member

@normanrz do you have an opinion concerning None for newly created empty segmentation layers? I guess zero could also be a valid value in that case, but if the data write functions don’t update it, it may make more sense to leave it at None so that it is more explicitly unset?

I'd leave it at None

@normanrz
Copy link
Member

Please also add a method to determine+save the max_id.

@fm3
Copy link
Member

fm3 commented Jul 17, 2023

Please also add a method to determine+save the max_id.

You can probably take that (or inspiration) from https://github.com/scalableminds/voxelytics/blob/master/voxelytics/worker/jobs/find_largest_segment_id.py

@markbader
Copy link
Contributor Author

I added a function update_largest_segment_id that determines and updates the largest segment id. I thought, that parallelization might be useful for bigger datasets.
Otherwise it might be done with a one-liner like:

def _find_largest_segment_id(self):
    self.get_finest_mag().read().flatten().max()

Do you have thought on the current implementation? @normanrz @fm3

@fm3
Copy link
Member

fm3 commented Jul 18, 2023

I agree that parallelization is good to have for that :) Maybe you could make use of MagView.for_each_chunk to shorten the code?

I think it would be good to avoid flatten, as it creates a copy of the array. I used np.max(data, initial=0) in my implementation in voxelytics, which should work for nd-arrays over all axes. Also, the initial=0 prevents exceptions for empty arrays (though catch and set to None may also be valid here)

@normanrz
Copy link
Member

I agree that parallelization is good to have for that :) Maybe you could make use of MagView.for_each_chunk to shorten the code?

Definitely needs parallelization. These segmentations can be 100s of TVx.

Perhaps, we should add a View.map_chunk function because for_each_chunk doesn't collect return values of the functions.

@markbader
Copy link
Contributor Author

I refactored the update_largest_segment_id method as @fm3 suggested and implemented a View.map_chunk method. Additionally, I added a test for the update_largest_segment_id method.

Is it reasonable to assume that update_largest_segment_id is called by a user of the library if the segmentation data is changed manually, or should the segment id be updated automatically when data of a segmentation layer is edited?

Copy link
Member

@fm3 fm3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff, looks good to me!

I’m not certain about auto-updating the largest_segment_id. It would be a nice feature, but it may be expensive, since we have to call max on all written data. Also, when opening an existing dataset, we can never be certain that the largest_segment_id in its metadata is actually correct, so we might not want to give the impression that we can guarantee it being correct and up to date.

Do you have opinions about this @philippotto @normanrz ?

webknossos/webknossos/dataset/layer.py Outdated Show resolved Hide resolved
webknossos/webknossos/dataset/layer.py Outdated Show resolved Hide resolved
@philippotto
Copy link
Member

I’m not certain about auto-updating the largest_segment_id. It would be a nice feature, but it may be expensive, since we have to call max on all written data. [...]

I'm leaning towards auto-updating the LSI, since it simplifies data handling for the user. However, it shouldn't add an additional read-pass across all written data chunks (instead, during each write, it should compute the maximum of the input data which is already in-memory).

Also, when opening an existing dataset, we can never be certain that the largest_segment_id in its metadata is actually correct, so we might not want to give the impression that we can guarantee it being correct and up to date.

If a dataset already has an LSI, that value should be trusted in my opinion (if existing metadata cant be trusted the user should be responsible for fixing it). If the LSI is None (and bbox > 0), an automatic update should not happen, as partial writes wouldn't be able to know the new LSI properly.

Does this make sense?

@fm3
Copy link
Member

fm3 commented Jul 20, 2023

Sounds fair, let’s wait if norman has an opinion as well :)

Also note that this may also need a file lock for the json during in parallel writes.

@normanrz
Copy link
Member

normanrz commented Aug 2, 2023

I don't think an auto-update is necessary. It would bring the same concurrency issues as the bbox changes.

In the future, Webknossos should have an "optimize" worker job for segmentations that

  • rechunks the data, if necessary
  • calculates largest_segment_id
  • calculates a segment index file

That job could run directly on import or user-triggered.

@markbader
Copy link
Contributor Author

Do you think an entry in the documentation or an example would be useful?

Otherwise, since there is no longer a need for an auto-update mechanism, I think the PR is in a state that it can be merged.

Copy link
Member

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think some documentation on the largest_segment_id would be nice in the write method.

webknossos/Changelog.md Outdated Show resolved Hide resolved
@@ -203,6 +204,25 @@ def write(
This parameter used to be relative for `View` and absolute for `MagView`,
and specified in the mag of the respective view.

Writing data to a segmentation layer manually does not automatically update the largest_segment_id. To set
the largest segment id properly run the `refresh_largest_segment_id` method on your layer.
Copy link
Member

@normanrz normanrz Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the largest segment id properly run the `refresh_largest_segment_id` method on your layer.
the largest segment id properly run the `refresh_largest_segment_id` method on your layer or set the `largest_segment_id` property manually.

@markbader markbader merged commit b2d156c into master Aug 8, 2023
18 checks passed
@markbader markbader deleted the optional_largest_segment_id branch August 8, 2023 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make largest_segment_id optional
4 participants