-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple storage locations for datasets #1405
Comments
Would also be welcome at MLZ, where likely different storage systems will be in operation. |
We are very interested in this as well. |
There were a couple good points discussed at the meeting.
|
Leonardo opines that all Datablocks should be available together at each location. This would suggest that storageLocation should be at the Dataset level, with multiple storage locations implemented by making it an array. The As an example workflow:
POST /dataset
{
"datasetlifecycle": {
"archivable": true,
"storageLocations": [
"pb-archive.psi.ch"
]
}
} Then the archive system moves it to tape and delete it from staging: PATCH /dataset/$PID
{
"datasetlifecycle": {
"archivable": true,
"retrievable": true,
"storageLocations": [
"CSCS Petabyte Archive"
]
}
} Later the data gets retrieved. Now it's on tape and also a public S3 instance: PATCH /dataset/$PID
{
"datasetlifecycle": {
"archivable": true,
"retrievable": true,
"storageLocations": [
"CSCS Petabyte Archive",
"http://doi.psi.ch/scicat/PID"
]
}
} This would still leave the interpretation of the |
I would find it nice if clients ( e.g. scitacean) get good infos on how to fetch data, either directly as part of the storage info or via separate endpoint. |
Summary
We would like to add the ability to store data at different locations. This is primarily driven by our plan to allow users from multiple institutes to share a SciCat instance. The physical location in which data is stored will changed based on the affiliation of the dataset owner. While data transfer is handled outside of SciCat, Scicat needs to track the storage location so that the archive system can retrieve data from the correct storage. I could also imagine this feature being used for single facilities that wanted to support multiple archive systems (eg for disk and tape storage).
Suggested changes
storageLocation
as a top-level field in Dataset. All DataBlocks associated with that dataset are assumed to be in the same storage location.The text was updated successfully, but these errors were encountered: