Allow full reprocessing of all existing resources #98

Ben-Hodgkiss · 2024-10-18T13:42:19Z

Overview
Currently, the nightly process downloads the previous log.csv and resource.csv file from S3, runs the collection and updates the csv files, which are then pushed back to S3, along with any collected resources.

If the log.csv / resource.csv files are not found in S3, it will generate fresh ones, but only from that night’s downloaded files. This is fine for bootstrapping a new collection (where the first nights download IS all the data at that point) but does not allow for a full re-processing of historical resources.

Update the process to optionally sync all the existing resources, so a full rebuild can be run.

Pull Request(PR):

Tech Approach

The resource and log directories are in S3
For a full rebuild, we need to pull them from S3
Ensure the log.csv and resource.csv files are NOT present
Run the pipeline, which should reprocess all resources
Push the resulting new log.csv and resource.csv files back to S3
Pulling of all existing log/resouce dir entries is only done for a full rebuild (not as part of normal run)

Acceptance Criteria/Tests

It is possible to run a full rebuild from all resources
This is NOT done as part of a normal collection run, which is not affected (especially it should not take any longer to run).
It’s possible to run this from airflow

Resourcing & Dependencies

This ticket assumes the pipelines will be running in Airflow
Can be completed by a developer with Airflow access

Ben-Hodgkiss added this to Infrastructure Oct 18, 2024

Ben-Hodgkiss converted this from a draft issue Oct 18, 2024

Ben-Hodgkiss removed this from Infrastructure Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow full reprocessing of all existing resources #98

Allow full reprocessing of all existing resources #98

Ben-Hodgkiss commented Oct 18, 2024

Allow full reprocessing of all existing resources #98

Allow full reprocessing of all existing resources #98

Comments

Ben-Hodgkiss commented Oct 18, 2024