Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design doc - Publish Dandisets that contain Zarr archives #1833

Closed
wants to merge 27 commits into from
Closed
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
e71ee11
Update gitignore
kabilar Jan 26, 2024
c0c9d66
Add requirements doc
kabilar Jan 26, 2024
dd40a75
Update requirements doc
kabilar Jan 26, 2024
2987b81
Update requirements doc
kabilar Jan 26, 2024
aa684fa
Update current implementation
kabilar Jan 29, 2024
fc15230
Update requirements
kabilar Jan 29, 2024
b7d88ad
Fix text
kabilar Jan 29, 2024
95bb260
Add solutions section
kabilar Jan 31, 2024
9acf245
Update title
kabilar Jan 31, 2024
02b5dbc
Add technical specifications
kabilar Jan 31, 2024
acb1367
Add use case
kabilar Jan 31, 2024
7dcae93
Add use case 3
kabilar Jan 31, 2024
cfe20a2
Add TODO
kabilar Jan 31, 2024
881513e
Remove link
kabilar Jan 31, 2024
a138173
Update doc/design/zarr-publish-1.md
kabilar Feb 13, 2024
6b84ed0
Update doc/design/zarr-publish-1.md
kabilar Feb 13, 2024
2faa5b8
Revert "Update gitignore"
kabilar Feb 15, 2024
b062af1
Revert "Revert "Update gitignore""
kabilar Feb 15, 2024
22d014c
Merge branch 'zarr-doc' of https://github.com/kabilar/linc-archive in…
kabilar Feb 15, 2024
31af71d
Revert "Update gitignore"
kabilar Feb 15, 2024
57eaa62
Update doc/design/zarr-publish-1.md
kabilar Feb 15, 2024
41d9449
Reorder steps
kabilar Feb 16, 2024
f2be2dd
Merge branch 'zarr-doc' of https://github.com/kabilar/linc-archive in…
kabilar Feb 16, 2024
50163a2
Update potential solutions section
kabilar Feb 22, 2024
75977f6
Update requirements
kabilar Feb 22, 2024
585756c
Add details to requirement 2
kabilar Feb 22, 2024
58ad723
Update introduction section
kabilar Feb 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -125,3 +125,5 @@ dmypy.json

# Editor settings
.vscode

.DS_Store
kabilar marked this conversation as resolved.
Show resolved Hide resolved
65 changes: 65 additions & 0 deletions doc/design/zarr-publish-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Publish Dandisets that contain Zarr archives, and support updates to the Zarr archive after publishing the Dandiset
kabilar marked this conversation as resolved.
Show resolved Hide resolved

This document describes the current implementation of publishing Dandisets with Zarr archives, a desired use case, and the associated requirements of this use case.

## Current implementation

When a blob asset is updated, a new version (i.e. a copy) is uploaded to the S3 bucket. Zarr archives are too large so multiple copies should not be created. A Zarr archive is uploaded once and it is updated in place. This design means that the Zarr archive is immutable once the Dandiset is published, so that the published Dandiset is immutable. Currently, a Dandiset cannot be published if it contains a Zarr asset. For more details, see the [zarr-support-3 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/zarr-support-3.md).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When a blob asset is updated, a new version (i.e. a copy) is uploaded to the S3 bucket. Zarr archives are too large so multiple copies should not be created. A Zarr archive is uploaded once and it is updated in place. This design means that the Zarr archive is immutable once the Dandiset is published, so that the published Dandiset is immutable. Currently, a Dandiset cannot be published if it contains a Zarr asset. For more details, see the [zarr-support-3 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/zarr-support-3.md).
When a non-Zarr asset blob is updated, a new copy of that file is uploaded to the S3 bucket. Zarr archives are too large so multiple copies should not be created. A Zarr archive is uploaded once and it is updated in place. This design means that the Zarr archive is immutable once the Dandiset is published, so that the published Dandiset is immutable. Currently, a Dandiset cannot be published if it contains a Zarr asset. For more details, see the [zarr-support-3 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/zarr-support-3.md).

Updating to use more accurate domain language (there isn't such a thing as a "blob asset", and the word "version" has a specific meaning in the context of DANDI).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, we disallow publishing of Zarr-containing Dandisets because we don't want to create copies of Zarrs if and when they are updated, and we're seeking a design that would allow us to do so. You might be able to condense this paragraph down to express that more directly; the sentence in the middle ("this design means...") seems a bit out of place in particular.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would add that there are a two requirements.

  • updating/publishing in a dandiset
  • adding the asset to another dandiset
  • being able to publish both

for implementation:

  • we currently don't have a readonly mode for an asset and any modification of zarr added to another dandiset should fail, unless control is handed over (more complicated).
  • adding should have either a link (readonly) or copy option. on the blob side, we have copy on write, but for assets this could get complicated for large trees.


## Use case 1

Publish a Dandiset containing a Zarr archive(s), and subsequently update the Zarr archive(s).

The publishing procedure would follow the description found in the [publish-1 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/publish-1.md). A modified publishing procedure that includes Zarr archive(s) is summarized below.

1. User uploads a new Dandiset which includes a Zarr archive(s).
2. User uploads an updated Zarr archive(s) to the `Draft` version of the Dandiset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step seems out of place in the use case description. We can already do this part (update a Zarr that is in a draft version), and the important novelty is in step 3 and beyond.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reordered steps 2 and 3, which would reflect functionality that is currently not in place (i.e. publishing a Dandiset with a Zarr archive and subsequently updating the Draft version). Please let me know if I misunderstood your comment.

3. User publishes the Dandiset and thereby creates a new immutable version of the Dandiset.
4. User repeats steps 2 and 3.

## Use case 2

Upload a Zarr archive to an embargoed Dandiset.


## MVP User requirements (Target date: April 30, 2024)

1. Publish Dandisets that contain Zarr archives.
1. The published Dandisets must be immutable and accessible.
1. The draft version of the Dandiset should be mutable.
1. Minimize storage costs in the design.

## MVP+1 User requirements

1. Support linking of a Zarr asset to multiple Dandisets - [dandi-archive/issues/1792](https://github.com/dandi/dandi-archive/issues/1792)

## MVP Technical specifications

1. Support versioning of Zarr archives.
1. Create a unique web address for each published version of the Zarr archive.
kabilar marked this conversation as resolved.
Show resolved Hide resolved
1. Provide access to the Zarr archive versions through the web app and command line interface.

## MVP+1 Technical specifications

1. TODO

## Potential solutions

1. Implement a Django backend for Zarr
1. Stores data in a Postgres database that references the Zarr chunks in S3.

1. Earthmover's [Arraylake](https://earthmover.io/blog/arraylake-beta-launch)
- Notes
- Edits of the Zarr archive must happen through the Arraylake Python API, and thus the `dandi-cli` should be updated.
- Questions
- Egress costs?
- Formal testing of Python API and infrastructure to ensure data integrity?

1. Create manifest file with paths and version IDs for each chunk for a specific version of the Zarr archive.
1. Steps
1. Initiate S3 bucket versioning
1. Questions
1. Store the manifest file in a database instead of S3 for improved performance?
1. Constraints
1. If the Zarr archive must be re-chunked then the user would need to upload the entire Zarr archive.
1. Garbage collection would need to be updated.