You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Overview
Currently, the nightly process downloads the previous log.csv and resource.csv file from S3, runs the collection and updates the csv files, which are then pushed back to S3, along with any collected resources.
If the log.csv / resource.csv files are not found in S3, it will generate fresh ones, but only from that night’s downloaded files. This is fine for bootstrapping a new collection (where the first nights download IS all the data at that point) but does not allow for a full re-processing of historical resources.
Update the process to optionally sync all the existing resources, so a full rebuild can be run.
Pull Request(PR):
Tech Approach
The resource and log directories are in S3
For a full rebuild, we need to pull them from S3
Ensure the log.csv and resource.csv files are NOT present
Run the pipeline, which should reprocess all resources
Push the resulting new log.csv and resource.csv files back to S3
Pulling of all existing log/resouce dir entries is only done for a full rebuild (not as part of normal run)
Acceptance Criteria/Tests
It is possible to run a full rebuild from all resources
This is NOT done as part of a normal collection run, which is not affected (especially it should not take any longer to run).
It’s possible to run this from airflow
Resourcing & Dependencies
This ticket assumes the pipelines will be running in Airflow
Can be completed by a developer with Airflow access
The text was updated successfully, but these errors were encountered:
Overview
Currently, the nightly process downloads the previous log.csv and resource.csv file from S3, runs the collection and updates the csv files, which are then pushed back to S3, along with any collected resources.
If the log.csv / resource.csv files are not found in S3, it will generate fresh ones, but only from that night’s downloaded files. This is fine for bootstrapping a new collection (where the first nights download IS all the data at that point) but does not allow for a full re-processing of historical resources.
Update the process to optionally sync all the existing resources, so a full rebuild can be run.
Pull Request(PR):
Tech Approach
Acceptance Criteria/Tests
Resourcing & Dependencies
The text was updated successfully, but these errors were encountered: