This set of scripts will simulate an archival system where datasets get archived
/retrieved
on demand
It listens to jobs posted to rabbitmq by the Backend. This will instruct the Archive Mock to run the script that simulates the archive system, for both archival and retrieval. After retrieving a job from RabbitMQ
, the job is executed. An archival job is executed as such:
- Mark job as being handled (jobStatusMessage:
"inProgress"
) - Mark dataset(s) as being archived (archivable:
false
, archiveStatusMessage:"started"
) - Add
DataBlocks
- Mark dataset(s) as archived (retrievable:
true
, archiveStatusMessage:"datasetOnArchiveDisk"
) - Mark job as done (jobStatusMessage:
"finishedSuccessful"
)
Note: the way the DataBlocks (not OrigDatablocks) are generated in the mock does not necessarily correspond to any specific implementation of the archival system, it's just an example.
A similar script handles retrieval:
- Mark job as being handled. (jobStatusMessage:
"inProgress"
) - Mark dataset(s) as being retrieved (retrieve status message:
"started"
) - Mark dataset(s) as retrieved (retrieve status message:
"datasetReceived"
) - Mark job as done (jobStatusMessage:
"finishedSuccessful"
)
It requires having access to a RabbitMQ and a Scicat Backend instance, with the latter configured to post jobs to the former. It depends on pika
for RabbitMQ communication.
python3 job_handler_mq_client_mock.py [PARAMS]
After starting the script, it will wait for any archival/retrieval jobs posted to RabbitMQ, and will attempt to handle them. It's possible for datasets to get stuck in a non-archivable and non-retrievable state, if the Job or Dataset has any unexpected (as in, out-of-spec) elements.
List of parameters:
Parameter | Description | Required |
---|---|---|
--scicat-url | The base url of the backend (usually ends in /api/v_ ) |
Yes |
--scicat-user | The user to use for connecting to scicat | No (can use token instead) |
--scicat-password | The user's password for connecting to scicat | No (can use token instead) |
--scicat-token | The token to use for connecting to scicat | No (can use user and pw instead) |
--rabbitmq-url | RabbitMQ API url | Yes |
--rabbitmq-user | the username used for accessing RabbitMQ | Yes |
--rabbitmq-password | the password used for accesing RabbitMQ | Yes |
It still requires some kind of access to a SciCat Backend and a RabbitMQ instance. You can use scicatlive if you don't have any instance of those two to test with.
Variable | Corresponding Parameter |
---|---|
SCI_URL | --scicat-url |
SCI_USER | --scicat-user |
SCI_PW | --scicat-password |
TOKEN | --scicat-token |
RMQ_URL | --rabbitmq-url |
RMQ_USER | --rabbitmq-user |
RMQ_PW | --rabbitmq-password |
- Use
python scicat_ingestion.py -b "http://catamel.localhost/api/v3/" -u admin -p 2jf70TPNZsS --path [PATH TO DATASET]
to ingest a dataset. NOTE: You must add atransfer.yaml
file (example is included in this repository) to the dataset which will serve as a metadata source for the ingestor. - Use
python job_create.py -b "http://catamel.localhost/api/v3/" -u admin -p 2jf70TPNZsS --dataset-id [PID of the previously created dataset] --job-type [archival / retrieval]
to create an archival / retrieval job (new datasets can only be archived initially, and archival can only happen once) - The job will almost immediately be handled by the archival mock script
- You can check the
datasetlifecycle
of your dataset to see that it has been archived / retrieved using the backend's swagger ui with theGET /Datasets/{id}
endpoint
You can use the dataset_ingest_job_mock.py
script to ingest and archive datasets by providing it a source folder of datasets (with each dataset having its own subfolder), and each dataset must have a transfer.yaml as a source of metadata (example provided with this repository). This is mostly useful as an API test, it doesn't replicate the job message queue, but it also alleviates the dependency on RabbitMQ (scicat is obviously still needed). The command to use it is the following:
python dataset_ingest_job_mock.py -b [backend address] -u [username] -p [password] --path [path to dataset collection]
Optionally, one can add --check-dataset-duplication
to avoid ingesting the same datasets multiple times