Support multiple storage locations for datasets #1405

sbliven · 2024-09-02T13:27:12Z

Summary

We would like to add the ability to store data at different locations. This is primarily driven by our plan to allow users from multiple institutes to share a SciCat instance. The physical location in which data is stored will changed based on the affiliation of the dataset owner. While data transfer is handled outside of SciCat, Scicat needs to track the storage location so that the archive system can retrieve data from the correct storage. I could also imagine this feature being used for single facilities that wanted to support multiple archive systems (eg for disk and tape storage).

Suggested changes

Add storageLocation as a top-level field in Dataset. All DataBlocks associated with that dataset are assumed to be in the same storage location.
Allow Job actions to be filtered by storage location. This can be implemented as a FilterAction with no changes to the job data model
Ensure jobs involve datasets from only a single storageLocation. This can probably be implemented on the backend as a ValidateAction without code changes, but may require some changes in the frontend.
Permission model might need to be updated, depending on use cases

The text was updated successfully, but these errors were encountered:

bpedersen2 · 2024-09-03T14:07:05Z

Would also be welcome at MLZ, where likely different storage systems will be in operation.

dylanmcreynolds · 2024-09-03T14:30:50Z

We are very interested in this as well.

sbliven · 2024-09-04T20:49:47Z

There were a couple good points discussed at the meeting.

Moving data between different storage at one facility is an interesting use case. For instance, moving datasets between S3-accessible storage and parallel filesystems. Datasets might be in multiple places; this could be represented by several DataBlocks with the same files but different storageLocations.
(Martin) SciCat is not a data management system. Managing multiple storage locations, tiering, redundancy, etc. is better managed by a dedicated storage management system. Such systems abstract the physical storage location from the file identifier. A SciCat instance with such a system would store the file identifiers in DataBlock. Changing the physical storage location could then be done without updating the scicat database.
- I agree with this in theory, but replacing legacy archive systems with a completely new architecture might be impractical. Pragmatic solutions based off the existing architecture should also be considered.
(@minottic) Since the storageLocation is associated with the archived version, it might make sense to add it to DataBlock rather that Dataset. The desired storageLocation would first be added to the archive job (in jobParams) and then the archive system would be responsible for setting the storageLocation when the DataBlocks are created after archiving. Retrieve jobs would then request the datablocks from the relevant location.
Several string fields could be repurposed to include the storageLocation. For instance, facilities could choose to makesourceFolder contain a URI-like object with both the location and path. Or sourceFolderHost could be redefined as the storage location (if facilities don't need the originating host).
- I dislike this suggestion because it changes well-established meanings of important fields, and because it requires string parsing to separate out the original meanings.

sbliven · 2024-10-04T11:57:46Z

Leonardo opines that all Datablocks should be available together at each location. This would suggest that storageLocation should be at the Dataset level, with multiple storage locations implemented by making it an array.

The Dataset.datasetLifecycle object already has a number of fields relating to archiving state. Maybe this would be a reasonable place for it?

As an example workflow:

Dataset is created with data rsync'd to our staging server

POST /dataset
{
  "datasetlifecycle": {
    "archivable": true,
    "storageLocations": [
      "pb-archive.psi.ch"
    ]
  }
}

Then the archive system moves it to tape and delete it from staging:

PATCH /dataset/$PID
{
  "datasetlifecycle": {
    "archivable": true,
    "retrievable": true,
    "storageLocations": [
      "CSCS Petabyte Archive"
    ]
  }
}

Later the data gets retrieved. Now it's on tape and also a public S3 instance:

PATCH /dataset/$PID
{
  "datasetlifecycle": {
    "archivable": true,
    "retrievable": true,
    "storageLocations": [
      "CSCS Petabyte Archive",
      "http://doi.psi.ch/scicat/PID"
    ]
  }
}

This would still leave the interpretation of the storageLocations values up to each facility.

bpedersen2 · 2024-11-26T10:11:36Z

I would find it nice if clients ( e.g. scitacean) get good infos on how to fetch data, either directly as part of the storage info or via separate endpoint.

sbliven added the enhancement New feature or request label Sep 2, 2024

sbliven self-assigned this Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple storage locations for datasets #1405

Support multiple storage locations for datasets #1405

sbliven commented Sep 2, 2024

bpedersen2 commented Sep 3, 2024

dylanmcreynolds commented Sep 3, 2024

sbliven commented Sep 4, 2024

sbliven commented Oct 4, 2024

bpedersen2 commented Nov 26, 2024

Support multiple storage locations for datasets #1405

Support multiple storage locations for datasets #1405

Comments

sbliven commented Sep 2, 2024

Summary

Suggested changes

bpedersen2 commented Sep 3, 2024

dylanmcreynolds commented Sep 3, 2024

sbliven commented Sep 4, 2024

sbliven commented Oct 4, 2024

bpedersen2 commented Nov 26, 2024