make largest_segment_id optional #925

markbader · 2023-07-13T14:54:40Z

Description:

Largest segment id was required in webknossos libs, as it raises an Assertion Error if absence. As it is no longer required in webknossos it is optional in webknossos libs now.

Issues:

fixes Make largest_segment_id optional #917

Todos:

Make sure to delete unnecessary points or to check all before merging:

Updated Changelog
Updated Documentation
Added / Updated Tests
Considered adding this to the Examples

markbader · 2023-07-17T11:28:24Z

Most of this issue was already solved in #786. This PR removes an assertion that prevents the creation of a segmentation layer without the largest segment id and adapts the test to check the expected behavior. I assumed that a segmentation layer that contains no data has None as largest_segment_id.

fm3

@normanrz do you have an opinion concerning None for newly created empty segmentation layers? I guess zero could also be a valid value in that case, but if the data write functions don’t update it, it may make more sense to leave it at None so that it is more explicitly unset?

Other than this question, the changes look good to me.

normanrz · 2023-07-17T16:11:38Z

@normanrz do you have an opinion concerning None for newly created empty segmentation layers? I guess zero could also be a valid value in that case, but if the data write functions don’t update it, it may make more sense to leave it at None so that it is more explicitly unset?

I'd leave it at None

normanrz · 2023-07-17T16:13:20Z

Please also add a method to determine+save the max_id.

fm3 · 2023-07-17T16:35:24Z

Please also add a method to determine+save the max_id.

You can probably take that (or inspiration) from https://github.com/scalableminds/voxelytics/blob/master/voxelytics/worker/jobs/find_largest_segment_id.py

markbader · 2023-07-18T11:19:52Z

I added a function update_largest_segment_id that determines and updates the largest segment id. I thought, that parallelization might be useful for bigger datasets.
Otherwise it might be done with a one-liner like:

def _find_largest_segment_id(self):
    self.get_finest_mag().read().flatten().max()

Do you have thought on the current implementation? @normanrz @fm3

fm3 · 2023-07-18T11:29:32Z

I agree that parallelization is good to have for that :) Maybe you could make use of MagView.for_each_chunk to shorten the code?

I think it would be good to avoid flatten, as it creates a copy of the array. I used np.max(data, initial=0) in my implementation in voxelytics, which should work for nd-arrays over all axes. Also, the initial=0 prevents exceptions for empty arrays (though catch and set to None may also be valid here)

normanrz · 2023-07-19T02:55:52Z

I agree that parallelization is good to have for that :) Maybe you could make use of MagView.for_each_chunk to shorten the code?

Definitely needs parallelization. These segmentations can be 100s of TVx.

Perhaps, we should add a View.map_chunk function because for_each_chunk doesn't collect return values of the functions.

markbader · 2023-07-20T09:55:10Z

I refactored the update_largest_segment_id method as @fm3 suggested and implemented a View.map_chunk method. Additionally, I added a test for the update_largest_segment_id method.

Is it reasonable to assume that update_largest_segment_id is called by a user of the library if the segmentation data is changed manually, or should the segment id be updated automatically when data of a segmentation layer is edited?

fm3

Cool stuff, looks good to me!

I’m not certain about auto-updating the largest_segment_id. It would be a nice feature, but it may be expensive, since we have to call max on all written data. Also, when opening an existing dataset, we can never be certain that the largest_segment_id in its metadata is actually correct, so we might not want to give the impression that we can guarantee it being correct and up to date.

Do you have opinions about this @philippotto @normanrz ?

webknossos/webknossos/dataset/layer.py

philippotto · 2023-07-20T11:57:19Z

I’m not certain about auto-updating the largest_segment_id. It would be a nice feature, but it may be expensive, since we have to call max on all written data. [...]

I'm leaning towards auto-updating the LSI, since it simplifies data handling for the user. However, it shouldn't add an additional read-pass across all written data chunks (instead, during each write, it should compute the maximum of the input data which is already in-memory).

Also, when opening an existing dataset, we can never be certain that the largest_segment_id in its metadata is actually correct, so we might not want to give the impression that we can guarantee it being correct and up to date.

If a dataset already has an LSI, that value should be trusted in my opinion (if existing metadata cant be trusted the user should be responsible for fixing it). If the LSI is None (and bbox > 0), an automatic update should not happen, as partial writes wouldn't be able to know the new LSI properly.

Does this make sense?

fm3 · 2023-07-20T11:59:02Z

Sounds fair, let’s wait if norman has an opinion as well :)

Also note that this may also need a file lock for the json during in parallel writes.

normanrz · 2023-08-02T11:28:54Z

I don't think an auto-update is necessary. It would bring the same concurrency issues as the bbox changes.

In the future, Webknossos should have an "optimize" worker job for segmentations that

rechunks the data, if necessary
calculates largest_segment_id
calculates a segment index file

That job could run directly on import or user-triggered.

markbader · 2023-08-02T12:25:10Z

Do you think an entry in the documentation or an example would be useful?

Otherwise, since there is no longer a need for an auto-update mechanism, I think the PR is in a state that it can be merged.

normanrz

Yes, I think some documentation on the largest_segment_id would be nice in the write method.

webknossos/Changelog.md

normanrz · 2023-08-03T11:42:58Z

webknossos/webknossos/dataset/view.py

@@ -203,6 +204,25 @@ def write(
        This parameter used to be relative for `View` and absolute for `MagView`,
        and specified in the mag of the respective view.

+        Writing data to a segmentation layer manually does not automatically update the largest_segment_id. To set
+        the largest segment id properly run the `refresh_largest_segment_id` method on your layer.


Suggested change

the largest segment id properly run the `refresh_largest_segment_id` method on your layer.

the largest segment id properly run the `refresh_largest_segment_id` method on your layer or set the `largest_segment_id` property manually.

Remove assertions to make largest_segment_id optional.

a002928

markbader self-assigned this Jul 13, 2023

Adapt tests to accept None for largest_segment_id.

4738843

markbader marked this pull request as ready for review July 17, 2023 08:29

Merge branch 'master' into optional_largest_segment_id

5b56f4c

markbader requested a review from fm3 July 17, 2023 11:29

Merge branch 'master' into optional_largest_segment_id

9e15a9c

fm3 reviewed Jul 17, 2023

View reviewed changes

Add update_largest_segment_id method to determine and save id.

cc60f27

Merge branch 'master' into optional_largest_segment_id

478a0bf

Refactor and add test.

6582132

fm3 reviewed Jul 20, 2023

View reviewed changes

webknossos/webknossos/dataset/layer.py Outdated Show resolved Hide resolved

webknossos/webknossos/dataset/layer.py Outdated Show resolved Hide resolved

Implement requested changes.

a06e9bb

Merge branch 'master' into optional_largest_segment_id

b86b347

markbader requested review from normanrz and philippotto August 2, 2023 12:25

normanrz approved these changes Aug 2, 2023

View reviewed changes

webknossos/Changelog.md Outdated Show resolved Hide resolved

Implement requested changes.

4fc7c17

normanrz reviewed Aug 3, 2023

View reviewed changes

Update docstring.

fc33d31

markbader merged commit b2d156c into master Aug 8, 2023
18 checks passed

markbader deleted the optional_largest_segment_id branch August 8, 2023 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make largest_segment_id optional #925

make largest_segment_id optional #925

markbader commented Jul 13, 2023 •

edited

Loading

markbader commented Jul 17, 2023

fm3 left a comment

normanrz commented Jul 17, 2023

normanrz commented Jul 17, 2023

fm3 commented Jul 17, 2023

markbader commented Jul 18, 2023

fm3 commented Jul 18, 2023 •

edited

Loading

normanrz commented Jul 19, 2023

markbader commented Jul 20, 2023

fm3 left a comment

philippotto commented Jul 20, 2023

fm3 commented Jul 20, 2023

normanrz commented Aug 2, 2023

markbader commented Aug 2, 2023

normanrz left a comment

normanrz Aug 3, 2023 •

edited

Loading

	the largest segment id properly run the `refresh_largest_segment_id` method on your layer.
	the largest segment id properly run the `refresh_largest_segment_id` method on your layer or set the `largest_segment_id` property manually.

make largest_segment_id optional #925

make largest_segment_id optional #925

Conversation

markbader commented Jul 13, 2023 • edited Loading

Description:

Issues:

Todos:

markbader commented Jul 17, 2023

fm3 left a comment

Choose a reason for hiding this comment

normanrz commented Jul 17, 2023

normanrz commented Jul 17, 2023

fm3 commented Jul 17, 2023

markbader commented Jul 18, 2023

fm3 commented Jul 18, 2023 • edited Loading

normanrz commented Jul 19, 2023

markbader commented Jul 20, 2023

fm3 left a comment

Choose a reason for hiding this comment

philippotto commented Jul 20, 2023

fm3 commented Jul 20, 2023

normanrz commented Aug 2, 2023

markbader commented Aug 2, 2023

normanrz left a comment

Choose a reason for hiding this comment

normanrz Aug 3, 2023 • edited Loading

Choose a reason for hiding this comment

markbader commented Jul 13, 2023 •

edited

Loading

fm3 commented Jul 18, 2023 •

edited

Loading

normanrz Aug 3, 2023 •

edited

Loading