From e71ee1114e9dcab08bff0fd055f6c543ac9db89b Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Thu, 25 Jan 2024 19:40:48 -0600 Subject: [PATCH 01/25] Update gitignore --- .gitignore | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.gitignore b/.gitignore index 301b915d1..bc01c7f9a 100644 --- a/.gitignore +++ b/.gitignore @@ -125,3 +125,5 @@ dmypy.json # Editor settings .vscode + +.DS_Store \ No newline at end of file From c0c9d662b1428b1a5f9a08de1397eee17a318429 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Thu, 25 Jan 2024 20:07:03 -0600 Subject: [PATCH 02/25] Add requirements doc --- doc/design/zarr-publish-1.md | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) create mode 100644 doc/design/zarr-publish-1.md diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md new file mode 100644 index 000000000..e5f6a6681 --- /dev/null +++ b/doc/design/zarr-publish-1.md @@ -0,0 +1,34 @@ +# Publish Dandisets with Zarr assets + +This document describes the current implementation of publishing Dandisets with Zarr assets, a future use case, and the associated requirements. + +## Current implementation + +Currently when a blob asset is updated, a new version (i.e. a copy) is uploaded to S3. Zarr assets are too large to create multiple copies, so the upload occurs once and it is edited in place. This design means that the Zarr archive would be immutable once published. So currently the system is designed to not publish Dandisets that contain Zarr asset(s). + +## Use case 1 + +Publish Dandisets with Zarr asset(s) following the general publishing procedure described in the [publish-1 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/publish-1.md). + +### Steps + +1. User uploads a new Dandiset which includes Zarr asset(s). +2. User uploads an updated Zarr asset in the `Draft` version of the Dandiset. +3. User publishes the Dandiset and thereby creates a new version of the Dandiset. +4. User repeats steps 2 and 3. + +## MVP Requirements (Target date: April 30, 2024) + +1. Permit versioning of Zarr archives without creating a copy of the entire Zarr archive +1. Permit publishing of Dandisets with Zarr assets +1. Minimize storage costs in the design + +## MVP+1 Requirements + +1. Permit linking of a Zarr asset to multiple Dandisets - [dandi-archive/issues/1792](https://github.com/dandi/dandi-archive/issues/1792) + + +- Zarrbargo - Zarr interacts with embargo. Hide and unhide for all subfiles. +- Copy on write happens because of publishing +- Zarr related copies, copies on writes + From dd40a7521c23e9603e50d85d7845124802de55e4 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Thu, 25 Jan 2024 22:32:28 -0600 Subject: [PATCH 03/25] Update requirements doc --- doc/design/zarr-publish-1.md | 19 +++++++------------ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index e5f6a6681..2195c4fdb 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -1,26 +1,24 @@ -# Publish Dandisets with Zarr assets +# Publish Dandisets with Zarr archives -This document describes the current implementation of publishing Dandisets with Zarr assets, a future use case, and the associated requirements. +This document describes the current implementation of publishing Dandisets with Zarr archives, a desired use case, and the associated requirements of this use case. ## Current implementation -Currently when a blob asset is updated, a new version (i.e. a copy) is uploaded to S3. Zarr assets are too large to create multiple copies, so the upload occurs once and it is edited in place. This design means that the Zarr archive would be immutable once published. So currently the system is designed to not publish Dandisets that contain Zarr asset(s). +Currently when a blob asset is updated, a new version (i.e. a copy) is uploaded to S3. Zarr archives are too large so multiple copies should not be created. A Zarr archive is uploaded once and it is edited in place. This design means that the Zarr archive is immutable once the Dandiset is published. For more details, see the [zarr-support-3 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/zarr-support-3.md). ## Use case 1 -Publish Dandisets with Zarr asset(s) following the general publishing procedure described in the [publish-1 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/publish-1.md). +Publish Dandisets with Zarr archive(s) following the general publishing procedure described in the [publish-1 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/publish-1.md). A modified publishing procedure that includes Zarr archive(s) is summarized below. -### Steps - -1. User uploads a new Dandiset which includes Zarr asset(s). -2. User uploads an updated Zarr asset in the `Draft` version of the Dandiset. +1. User uploads a new Dandiset which includes Zarr archive(s). +2. User uploads an updated Zarr archive in the `Draft` version of the Dandiset. 3. User publishes the Dandiset and thereby creates a new version of the Dandiset. 4. User repeats steps 2 and 3. ## MVP Requirements (Target date: April 30, 2024) 1. Permit versioning of Zarr archives without creating a copy of the entire Zarr archive -1. Permit publishing of Dandisets with Zarr assets +1. Permit publishing of Dandisets with Zarr archives 1. Minimize storage costs in the design ## MVP+1 Requirements @@ -28,7 +26,4 @@ Publish Dandisets with Zarr asset(s) following the general publishing procedure 1. Permit linking of a Zarr asset to multiple Dandisets - [dandi-archive/issues/1792](https://github.com/dandi/dandi-archive/issues/1792) -- Zarrbargo - Zarr interacts with embargo. Hide and unhide for all subfiles. -- Copy on write happens because of publishing -- Zarr related copies, copies on writes From 2987b818771ef47c6a7f0b3a72db64aeeb3b3920 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Thu, 25 Jan 2024 23:32:59 -0600 Subject: [PATCH 04/25] Update requirements doc --- doc/design/zarr-publish-1.md | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index 2195c4fdb..0676beaba 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -1,29 +1,31 @@ -# Publish Dandisets with Zarr archives +# Support updates to Zarr archives after publishing the corresponding Dandiset This document describes the current implementation of publishing Dandisets with Zarr archives, a desired use case, and the associated requirements of this use case. ## Current implementation -Currently when a blob asset is updated, a new version (i.e. a copy) is uploaded to S3. Zarr archives are too large so multiple copies should not be created. A Zarr archive is uploaded once and it is edited in place. This design means that the Zarr archive is immutable once the Dandiset is published. For more details, see the [zarr-support-3 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/zarr-support-3.md). +When a blob asset is updated, a new version (i.e. a copy) is uploaded to the S3 bucket. Zarr archives are too large so multiple copies should not be created. A Zarr archive is uploaded once and it is updated in place. This design means that the Zarr archive is immutable once the Dandiset is published, so that the published Dandiset is immutable. For more details, see the [zarr-support-3 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/zarr-support-3.md). ## Use case 1 -Publish Dandisets with Zarr archive(s) following the general publishing procedure described in the [publish-1 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/publish-1.md). A modified publishing procedure that includes Zarr archive(s) is summarized below. +Publish a Dandiset containing a Zarr archive(s), and subsequently update the Zarr archive(s). -1. User uploads a new Dandiset which includes Zarr archive(s). -2. User uploads an updated Zarr archive in the `Draft` version of the Dandiset. -3. User publishes the Dandiset and thereby creates a new version of the Dandiset. +The publishing procedure would follow the description found in the [publish-1 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/publish-1.md). A modified publishing procedure that includes Zarr archive(s) is summarized below. + +1. User uploads a new Dandiset which includes a Zarr archive(s). +2. User uploads an updated Zarr archive(s) to the `Draft` version of the Dandiset. +3. User publishes the Dandiset and thereby creates a new immutable version of the Dandiset. 4. User repeats steps 2 and 3. ## MVP Requirements (Target date: April 30, 2024) -1. Permit versioning of Zarr archives without creating a copy of the entire Zarr archive -1. Permit publishing of Dandisets with Zarr archives -1. Minimize storage costs in the design +1. Support versioning of Zarr archives without creating a copy of the entire Zarr archive. +1. Support publishing immutable versions of Dandisets with Zarr archives, where the Zarr archives are potentially updated between versions. +1. Minimize storage costs in the design. ## MVP+1 Requirements -1. Permit linking of a Zarr asset to multiple Dandisets - [dandi-archive/issues/1792](https://github.com/dandi/dandi-archive/issues/1792) +1. Support linking of a Zarr asset to multiple Dandisets - [dandi-archive/issues/1792](https://github.com/dandi/dandi-archive/issues/1792) From aa684fa4895e7d376abbd056d16754060de774f4 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Sun, 28 Jan 2024 23:16:54 -0600 Subject: [PATCH 05/25] Update current implementation --- doc/design/zarr-publish-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index 0676beaba..6163884da 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -4,7 +4,7 @@ This document describes the current implementation of publishing Dandisets with ## Current implementation -When a blob asset is updated, a new version (i.e. a copy) is uploaded to the S3 bucket. Zarr archives are too large so multiple copies should not be created. A Zarr archive is uploaded once and it is updated in place. This design means that the Zarr archive is immutable once the Dandiset is published, so that the published Dandiset is immutable. For more details, see the [zarr-support-3 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/zarr-support-3.md). +When a blob asset is updated, a new version (i.e. a copy) is uploaded to the S3 bucket. Zarr archives are too large so multiple copies should not be created. A Zarr archive is uploaded once and it is updated in place. This design means that the Zarr archive is immutable once the Dandiset is published, so that the published Dandiset is immutable. Currently, a Dandiset cannot be published if it contains a Zarr asset. For more details, see the [zarr-support-3 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/zarr-support-3.md). ## Use case 1 From fc152308d5699c1c50438bbd163ebd80a27ef2da Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Sun, 28 Jan 2024 23:21:40 -0600 Subject: [PATCH 06/25] Update requirements --- doc/design/zarr-publish-1.md | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index 6163884da..97e96b52b 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -17,15 +17,22 @@ The publishing procedure would follow the description found in the [publish-1 de 3. User publishes the Dandiset and thereby creates a new immutable version of the Dandiset. 4. User repeats steps 2 and 3. -## MVP Requirements (Target date: April 30, 2024) +## MVP User requirements (Target date: April 30, 2024) -1. Support versioning of Zarr archives without creating a copy of the entire Zarr archive. -1. Support publishing immutable versions of Dandisets with Zarr archives, where the Zarr archives are potentially updated between versions. +1. Publish Dandisets that contain Zarr archives. +1. The published Dandisets must be immutable and accesible. +1. The draft version of the Dandiset should be mutable. 1. Minimize storage costs in the design. -## MVP+1 Requirements +## MVP+1 User requirements 1. Support linking of a Zarr asset to multiple Dandisets - [dandi-archive/issues/1792](https://github.com/dandi/dandi-archive/issues/1792) +## MVP Technical specifications +1. TODO + +## MVP+1 Technical specifications + +1. TODO From b7d88adfc266e4bd0743f0526e0a39e752797b11 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Sun, 28 Jan 2024 23:34:50 -0600 Subject: [PATCH 07/25] Fix text --- doc/design/zarr-publish-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index 97e96b52b..0c60a269e 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -20,7 +20,7 @@ The publishing procedure would follow the description found in the [publish-1 de ## MVP User requirements (Target date: April 30, 2024) 1. Publish Dandisets that contain Zarr archives. -1. The published Dandisets must be immutable and accesible. +1. The published Dandisets must be immutable and accessible. 1. The draft version of the Dandiset should be mutable. 1. Minimize storage costs in the design. From 95bb2605e04cfdac705bedcffdeed38c5c301432 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Tue, 30 Jan 2024 23:43:25 -0600 Subject: [PATCH 08/25] Add solutions section --- doc/design/zarr-publish-1.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index 0c60a269e..b78b26452 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -36,3 +36,23 @@ The publishing procedure would follow the description found in the [publish-1 de 1. TODO +## Potential solutions + +1. Implement a Django backend for Zarr + 1. Stores data in a Postgres database that references the Zarr chunks in S3. + +1. Earthmover's [Arraylake](https://earthmover.io/blog/arraylake-beta-launch) + - Notes + - Edits of the Zarr archive must happen through the Arraylake Python API, and thus the `dandi-cli` should be updated. + - Questions + - Egress costs? + - Formal testing of Python API and infrastructure to ensure data integrity? + +1. Create manifest file with paths and version IDs for each chunk for a specific version of the Zarr archive. + 1. Steps + 1. Initiate S3 bucket versioning + 1. Questions + 1. Store the manifest file in a database instead of S3 for improved performance? + 1. Constraints + 1. If the Zarr archive must be re-chunked then the user would need to upload the entire Zarr archive. + 1. Garbage collection would need to be updated. From 9acf245417cdbb40d513f9669594daf3acc5dd36 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Tue, 30 Jan 2024 23:43:58 -0600 Subject: [PATCH 09/25] Update title --- doc/design/zarr-publish-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index b78b26452..5a43ff57c 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -1,4 +1,4 @@ -# Support updates to Zarr archives after publishing the corresponding Dandiset +# Publish Dandisets that contain Zarr archives, and support updates to the Zarr archive after publishing the Dandiset This document describes the current implementation of publishing Dandisets with Zarr archives, a desired use case, and the associated requirements of this use case. From 02b5dbc7b220d3c18a863c3af5d634e5aaee3ab3 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Tue, 30 Jan 2024 23:47:36 -0600 Subject: [PATCH 10/25] Add technical specifications --- doc/design/zarr-publish-1.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index 5a43ff57c..fb54cef5c 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -30,7 +30,9 @@ The publishing procedure would follow the description found in the [publish-1 de ## MVP Technical specifications -1. TODO +1. Support versioning of Zarr archives. +1. Create a unique web address for each published version of the Zarr archive. +1. Provide access to the Zarr archive versions through the web app and command line interface. ## MVP+1 Technical specifications From acb1367aa1a2e11911d3b94dce710ee88e171b3e Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Tue, 30 Jan 2024 23:52:04 -0600 Subject: [PATCH 11/25] Add use case --- doc/design/zarr-publish-1.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index fb54cef5c..5547ef6c4 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -17,6 +17,11 @@ The publishing procedure would follow the description found in the [publish-1 de 3. User publishes the Dandiset and thereby creates a new immutable version of the Dandiset. 4. User repeats steps 2 and 3. +## Use case 2 + +Upload a Zarr archive to an embargoed Dandiset. + + ## MVP User requirements (Target date: April 30, 2024) 1. Publish Dandisets that contain Zarr archives. From 7dcae933745c1bd7846f13a9222b5e842ab7e590 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Wed, 31 Jan 2024 13:16:18 -0600 Subject: [PATCH 12/25] Add use case 3 --- doc/design/zarr-publish-1.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index 5547ef6c4..5d964ed64 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -1,6 +1,6 @@ # Publish Dandisets that contain Zarr archives, and support updates to the Zarr archive after publishing the Dandiset -This document describes the current implementation of publishing Dandisets with Zarr archives, a desired use case, and the associated requirements of this use case. +This document describes the current implementation of publishing Dandisets with Zarr archives, desired use cases, and the associated requirements. ## Current implementation @@ -21,6 +21,11 @@ The publishing procedure would follow the description found in the [publish-1 de Upload a Zarr archive to an embargoed Dandiset. +## Use case 3 + +Reuse a Zarr archive in more than one Dandiset. + +Allow for a Zarr archive that is uploaded as part of an original Dandiset to be packaged in a new Dandiset without duplicating the Zarr archive. The new Dandiset could be created by potentially different authors and could contain additional raw and/or analyzed data. This feature has been previously implemented for other asset types with [add_asset_to_dandiset.py](https://gist.github.com/satra/29404d965226e4c99fb48e7502953503#file-add_asset_to_dandiset-py). Further details of this feature request have been previously documented in [dandi-archive #1792](https://github.com/dandi/dandi-archive/issues/1792). ## MVP User requirements (Target date: April 30, 2024) From cfe20a2fe0149e74cf1f07d69f88087f554fd1d2 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Wed, 31 Jan 2024 13:16:56 -0600 Subject: [PATCH 13/25] Add TODO --- doc/design/zarr-publish-1.md | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index 5d964ed64..deb200d4b 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -43,6 +43,7 @@ Allow for a Zarr archive that is uploaded as part of an original Dandiset to be 1. Support versioning of Zarr archives. 1. Create a unique web address for each published version of the Zarr archive. 1. Provide access to the Zarr archive versions through the web app and command line interface. +1. TODO: add additional specifications ## MVP+1 Technical specifications From 881513ec502a35636f2a626e608f8b558deaa8f1 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Wed, 31 Jan 2024 16:23:48 -0600 Subject: [PATCH 14/25] Remove link --- doc/design/zarr-publish-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index deb200d4b..297fa7d20 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -36,7 +36,7 @@ Allow for a Zarr archive that is uploaded as part of an original Dandiset to be ## MVP+1 User requirements -1. Support linking of a Zarr asset to multiple Dandisets - [dandi-archive/issues/1792](https://github.com/dandi/dandi-archive/issues/1792) +1. Support linking of a Zarr asset to multiple Dandisets ## MVP Technical specifications From a13817375e4b50619f30d489e37a461dab4de6dc Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Tue, 13 Feb 2024 09:09:53 -0600 Subject: [PATCH 15/25] Update doc/design/zarr-publish-1.md Co-authored-by: Roni Choudhury <2903332+waxlamp@users.noreply.github.com> --- doc/design/zarr-publish-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index 297fa7d20..e847eef8e 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -25,7 +25,7 @@ Upload a Zarr archive to an embargoed Dandiset. Reuse a Zarr archive in more than one Dandiset. -Allow for a Zarr archive that is uploaded as part of an original Dandiset to be packaged in a new Dandiset without duplicating the Zarr archive. The new Dandiset could be created by potentially different authors and could contain additional raw and/or analyzed data. This feature has been previously implemented for other asset types with [add_asset_to_dandiset.py](https://gist.github.com/satra/29404d965226e4c99fb48e7502953503#file-add_asset_to_dandiset-py). Further details of this feature request have been previously documented in [dandi-archive #1792](https://github.com/dandi/dandi-archive/issues/1792). +Allow for a Zarr archive that is uploaded as part of an original Dandiset to be packaged in a new Dandiset without duplicating the Zarr archive. The new Dandiset could be created by potentially different authors and could contain additional raw and/or analyzed data. This feature has been previously implemented for other asset types with [add_asset_to_dandiset.py](https://gist.github.com/satra/29404d965226e4c99fb48e7502953503#file-add_asset_to_dandiset-py). Further details of this feature request have been previously documented in #1792. ## MVP User requirements (Target date: April 30, 2024) From 6b84ed09b161affc89d472a6d94ca8f78589953a Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Tue, 13 Feb 2024 13:51:24 -0600 Subject: [PATCH 16/25] Update doc/design/zarr-publish-1.md Co-authored-by: Roni Choudhury <2903332+waxlamp@users.noreply.github.com> --- doc/design/zarr-publish-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index e847eef8e..354706714 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -1,4 +1,4 @@ -# Publish Dandisets that contain Zarr archives, and support updates to the Zarr archive after publishing the Dandiset +# Publishing Dandisets that contain Zarr archives This document describes the current implementation of publishing Dandisets with Zarr archives, desired use cases, and the associated requirements. From 2faa5b8dead7695eb343f26b2ccb8fc08b73fdce Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Thu, 15 Feb 2024 16:25:25 -0600 Subject: [PATCH 17/25] Revert "Update gitignore" This reverts commit e71ee1114e9dcab08bff0fd055f6c543ac9db89b. --- .gitignore | 2 -- 1 file changed, 2 deletions(-) diff --git a/.gitignore b/.gitignore index bc01c7f9a..301b915d1 100644 --- a/.gitignore +++ b/.gitignore @@ -125,5 +125,3 @@ dmypy.json # Editor settings .vscode - -.DS_Store \ No newline at end of file From b062af17f6abb842b9cbddbd08cf53209d6358d6 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Thu, 15 Feb 2024 16:27:32 -0600 Subject: [PATCH 18/25] Revert "Revert "Update gitignore"" This reverts commit 2faa5b8dead7695eb343f26b2ccb8fc08b73fdce. --- .gitignore | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.gitignore b/.gitignore index 301b915d1..bc01c7f9a 100644 --- a/.gitignore +++ b/.gitignore @@ -125,3 +125,5 @@ dmypy.json # Editor settings .vscode + +.DS_Store \ No newline at end of file From 31af71dd6638311495175e36dceade2b661bf39d Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Thu, 15 Feb 2024 16:29:28 -0600 Subject: [PATCH 19/25] Revert "Update gitignore" This reverts commit e71ee1114e9dcab08bff0fd055f6c543ac9db89b. --- .gitignore | 2 -- 1 file changed, 2 deletions(-) diff --git a/.gitignore b/.gitignore index bc01c7f9a..301b915d1 100644 --- a/.gitignore +++ b/.gitignore @@ -125,5 +125,3 @@ dmypy.json # Editor settings .vscode - -.DS_Store \ No newline at end of file From 57eaa62146ba082177f78fac2c42e04bdd64edf2 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Thu, 15 Feb 2024 16:55:25 -0600 Subject: [PATCH 20/25] Update doc/design/zarr-publish-1.md Co-authored-by: Roni Choudhury <2903332+waxlamp@users.noreply.github.com> --- doc/design/zarr-publish-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index 354706714..4498b930a 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -4,7 +4,7 @@ This document describes the current implementation of publishing Dandisets with ## Current implementation -When a blob asset is updated, a new version (i.e. a copy) is uploaded to the S3 bucket. Zarr archives are too large so multiple copies should not be created. A Zarr archive is uploaded once and it is updated in place. This design means that the Zarr archive is immutable once the Dandiset is published, so that the published Dandiset is immutable. Currently, a Dandiset cannot be published if it contains a Zarr asset. For more details, see the [zarr-support-3 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/zarr-support-3.md). +When a non-Zarr asset blob is updated, a new copy of that file is uploaded to the S3 bucket. Zarr archives are too large so multiple copies should not be created. A Zarr archive is uploaded once and it is updated in place. This design means that the Zarr archive is immutable once the Dandiset is published, so that the published Dandiset is immutable. Currently, a Dandiset cannot be published if it contains a Zarr asset. For more details, see the [zarr-support-3 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/zarr-support-3.md). ## Use case 1 From 41d9449b925ee7022a1ccb1f80216515adbffbc1 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Thu, 15 Feb 2024 19:08:02 -0600 Subject: [PATCH 21/25] Reorder steps --- doc/design/zarr-publish-1.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index 354706714..3d3ea2ec6 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -13,9 +13,9 @@ Publish a Dandiset containing a Zarr archive(s), and subsequently update the Zar The publishing procedure would follow the description found in the [publish-1 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/publish-1.md). A modified publishing procedure that includes Zarr archive(s) is summarized below. 1. User uploads a new Dandiset which includes a Zarr archive(s). -2. User uploads an updated Zarr archive(s) to the `Draft` version of the Dandiset. -3. User publishes the Dandiset and thereby creates a new immutable version of the Dandiset. -4. User repeats steps 2 and 3. +1. User publishes the Dandiset and thereby creates a new immutable version of the Dandiset. +1. User uploads an updated Zarr archive(s) to the `Draft` version of the Dandiset. +1. User repeats steps 2 and 3. ## Use case 2 From 50163a27337d40e368bd6ae75dd2c3da6ee8b2ba Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Thu, 22 Feb 2024 12:23:26 -0600 Subject: [PATCH 22/25] Update potential solutions section --- doc/design/zarr-publish-1.md | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index 3ff39a739..b6103730d 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -51,21 +51,22 @@ Allow for a Zarr archive that is uploaded as part of an original Dandiset to be ## Potential solutions -1. Implement a Django backend for Zarr - 1. Stores data in a Postgres database that references the Zarr chunks in S3. - 1. Earthmover's [Arraylake](https://earthmover.io/blog/arraylake-beta-launch) - - Notes - - Edits of the Zarr archive must happen through the Arraylake Python API, and thus the `dandi-cli` should be updated. - - Questions - - Egress costs? - - Formal testing of Python API and infrastructure to ensure data integrity? - -1. Create manifest file with paths and version IDs for each chunk for a specific version of the Zarr archive. - 1. Steps + 1. Notes + 1. Edits of the Zarr archive must happen through the Arraylake Python API, and thus the `dandi-cli` should be updated. + 2. Questions + 1. Egress costs? + 2. Formal testing of Python API and infrastructure to ensure data integrity? + +2. Create manifest file with paths and version IDs for each chunk for a specific version of the Zarr archive. + 1. Candidate implementation - https://github.com/dandi/zarr-manifests/ + 2. Steps 1. Initiate S3 bucket versioning - 1. Questions + 3. Questions 1. Store the manifest file in a database instead of S3 for improved performance? - 1. Constraints + 4. Constraints 1. If the Zarr archive must be re-chunked then the user would need to upload the entire Zarr archive. - 1. Garbage collection would need to be updated. + 2. Garbage collection would need to be updated. + +3. Implement a Django backend for Zarr + 1. Stores data in a Postgres database that references the Zarr chunks in S3. From 75977f671ca4db1cc228161b3d6de394db0f30f2 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Thu, 22 Feb 2024 12:53:36 -0600 Subject: [PATCH 23/25] Update requirements --- doc/design/zarr-publish-1.md | 23 ++++++----------------- 1 file changed, 6 insertions(+), 17 deletions(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index b6103730d..48dab4b1f 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -27,27 +27,16 @@ Reuse a Zarr archive in more than one Dandiset. Allow for a Zarr archive that is uploaded as part of an original Dandiset to be packaged in a new Dandiset without duplicating the Zarr archive. The new Dandiset could be created by potentially different authors and could contain additional raw and/or analyzed data. This feature has been previously implemented for other asset types with [add_asset_to_dandiset.py](https://gist.github.com/satra/29404d965226e4c99fb48e7502953503#file-add_asset_to_dandiset-py). Further details of this feature request have been previously documented in #1792. -## MVP User requirements (Target date: April 30, 2024) +## Requirements (Target date: April 30, 2024) 1. Publish Dandisets that contain Zarr archives. -1. The published Dandisets must be immutable and accessible. -1. The draft version of the Dandiset should be mutable. -1. Minimize storage costs in the design. +2. If the same Zarr archive is uploaded to multiple Dandisets, then the Zarr archive should not be re-uploaded. -## MVP+1 User requirements +## Implementation details -1. Support linking of a Zarr asset to multiple Dandisets - -## MVP Technical specifications - -1. Support versioning of Zarr archives. -1. Create a unique web address for each published version of the Zarr archive. -1. Provide access to the Zarr archive versions through the web app and command line interface. -1. TODO: add additional specifications - -## MVP+1 Technical specifications - -1. TODO +1. Design a lightweight object to version Zarr archives. + 1. For a candidate implementation see https://github.com/dandi/zarr-manifests/. +2. Minimize storage costs in the design. ## Potential solutions From 585756cbbf7e7488823cb9f487a2ee5f57c23a4f Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Thu, 22 Feb 2024 13:06:45 -0600 Subject: [PATCH 24/25] Add details to requirement 2 --- doc/design/zarr-publish-1.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index 48dab4b1f..e41f4f9fe 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -30,7 +30,7 @@ Allow for a Zarr archive that is uploaded as part of an original Dandiset to be ## Requirements (Target date: April 30, 2024) 1. Publish Dandisets that contain Zarr archives. -2. If the same Zarr archive is uploaded to multiple Dandisets, then the Zarr archive should not be re-uploaded. +2. If the same Zarr archive is uploaded to multiple Dandisets, then the Zarr archive should not be re-uploaded. This requirement would mirror the behavior of non-Zarr asset blobs. ## Implementation details From 58ad723367a45da93bf8080c492d31b3c2788ed6 Mon Sep 17 00:00:00 2001 From: Kabilar Gunalan Date: Thu, 22 Feb 2024 13:27:07 -0600 Subject: [PATCH 25/25] Update introduction section --- doc/design/zarr-publish-1.md | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/design/zarr-publish-1.md b/doc/design/zarr-publish-1.md index e41f4f9fe..53cd48a0c 100644 --- a/doc/design/zarr-publish-1.md +++ b/doc/design/zarr-publish-1.md @@ -1,6 +1,7 @@ # Publishing Dandisets that contain Zarr archives This document describes the current implementation of publishing Dandisets with Zarr archives, desired use cases, and the associated requirements. +Note that once the requirements for `Use case 1` are implemented, then `Use cases 2-3` will be capable. ## Current implementation