Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow full reprocessing of all existing resources #98

Open
Ben-Hodgkiss opened this issue Oct 18, 2024 · 0 comments
Open

Allow full reprocessing of all existing resources #98

Ben-Hodgkiss opened this issue Oct 18, 2024 · 0 comments

Comments

@Ben-Hodgkiss
Copy link
Contributor

Overview
Currently, the nightly process downloads the previous log.csv and resource.csv file from S3, runs the collection and updates the csv files, which are then pushed back to S3, along with any collected resources.

If the log.csv / resource.csv files are not found in S3, it will generate fresh ones, but only from that night’s downloaded files. This is fine for bootstrapping a new collection (where the first nights download IS all the data at that point) but does not allow for a full re-processing of historical resources.

Update the process to optionally sync all the existing resources, so a full rebuild can be run.

Pull Request(PR):

Tech Approach

  • The resource and log directories are in S3
  • For a full rebuild, we need to pull them from S3
  • Ensure the log.csv and resource.csv files are NOT present
  • Run the pipeline, which should reprocess all resources
  • Push the resulting new log.csv and resource.csv files back to S3
  • Pulling of all existing log/resouce dir entries is only done for a full rebuild (not as part of normal run)

Acceptance Criteria/Tests

  • It is possible to run a full rebuild from all resources
  • This is NOT done as part of a normal collection run, which is not affected (especially it should not take any longer to run).
  • It’s possible to run this from airflow

Resourcing & Dependencies

  • This ticket assumes the pipelines will be running in Airflow
  • Can be completed by a developer with Airflow access
@Ben-Hodgkiss Ben-Hodgkiss converted this from a draft issue Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant