Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple storage locations for datasets #1405

Open
sbliven opened this issue Sep 2, 2024 · 5 comments
Open

Support multiple storage locations for datasets #1405

sbliven opened this issue Sep 2, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@sbliven
Copy link
Contributor

sbliven commented Sep 2, 2024

Summary

We would like to add the ability to store data at different locations. This is primarily driven by our plan to allow users from multiple institutes to share a SciCat instance. The physical location in which data is stored will changed based on the affiliation of the dataset owner. While data transfer is handled outside of SciCat, Scicat needs to track the storage location so that the archive system can retrieve data from the correct storage. I could also imagine this feature being used for single facilities that wanted to support multiple archive systems (eg for disk and tape storage).

Suggested changes

  1. Add storageLocation as a top-level field in Dataset. All DataBlocks associated with that dataset are assumed to be in the same storage location.
  2. Allow Job actions to be filtered by storage location. This can be implemented as a FilterAction with no changes to the job data model
  3. Ensure jobs involve datasets from only a single storageLocation. This can probably be implemented on the backend as a ValidateAction without code changes, but may require some changes in the frontend.
  4. Permission model might need to be updated, depending on use cases
@sbliven sbliven added the enhancement New feature or request label Sep 2, 2024
@sbliven sbliven self-assigned this Sep 2, 2024
@bpedersen2
Copy link
Contributor

Would also be welcome at MLZ, where likely different storage systems will be in operation.

@dylanmcreynolds
Copy link
Contributor

We are very interested in this as well.

@sbliven
Copy link
Contributor Author

sbliven commented Sep 4, 2024

There were a couple good points discussed at the meeting.

  1. Moving data between different storage at one facility is an interesting use case. For instance, moving datasets between S3-accessible storage and parallel filesystems. Datasets might be in multiple places; this could be represented by several DataBlocks with the same files but different storageLocations.

  2. (Martin) SciCat is not a data management system. Managing multiple storage locations, tiering, redundancy, etc. is better managed by a dedicated storage management system. Such systems abstract the physical storage location from the file identifier. A SciCat instance with such a system would store the file identifiers in DataBlock. Changing the physical storage location could then be done without updating the scicat database.

    • I agree with this in theory, but replacing legacy archive systems with a completely new architecture might be impractical. Pragmatic solutions based off the existing architecture should also be considered.
  3. (@minottic) Since the storageLocation is associated with the archived version, it might make sense to add it to DataBlock rather that Dataset. The desired storageLocation would first be added to the archive job (in jobParams) and then the archive system would be responsible for setting the storageLocation when the DataBlocks are created after archiving. Retrieve jobs would then request the datablocks from the relevant location.

  4. Several string fields could be repurposed to include the storageLocation. For instance, facilities could choose to makesourceFolder contain a URI-like object with both the location and path. Or sourceFolderHost could be redefined as the storage location (if facilities don't need the originating host).

    • I dislike this suggestion because it changes well-established meanings of important fields, and because it requires string parsing to separate out the original meanings.

@sbliven
Copy link
Contributor Author

sbliven commented Oct 4, 2024

Leonardo opines that all Datablocks should be available together at each location. This would suggest that storageLocation should be at the Dataset level, with multiple storage locations implemented by making it an array.

The Dataset.datasetLifecycle object already has a number of fields relating to archiving state. Maybe this would be a reasonable place for it?

As an example workflow:

  1. Dataset is created with data rsync'd to our staging server
POST /dataset
{
  "datasetlifecycle": {
    "archivable": true,
    "storageLocations": [
      "pb-archive.psi.ch"
    ]
  }
}

Then the archive system moves it to tape and delete it from staging:

PATCH /dataset/$PID
{
  "datasetlifecycle": {
    "archivable": true,
    "retrievable": true,
    "storageLocations": [
      "CSCS Petabyte Archive"
    ]
  }
}

Later the data gets retrieved. Now it's on tape and also a public S3 instance:

PATCH /dataset/$PID
{
  "datasetlifecycle": {
    "archivable": true,
    "retrievable": true,
    "storageLocations": [
      "CSCS Petabyte Archive",
      "http://doi.psi.ch/scicat/PID"
    ]
  }
}

This would still leave the interpretation of the storageLocations values up to each facility.

@bpedersen2
Copy link
Contributor

I would find it nice if clients ( e.g. scitacean) get good infos on how to fetch data, either directly as part of the storage info or via separate endpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants