Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make jobs that rely on TrackDB last-modified dates more robust #120

Open
anjackson opened this issue Aug 24, 2023 · 0 comments
Open

Make jobs that rely on TrackDB last-modified dates more robust #120

anjackson opened this issue Aug 24, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@anjackson
Copy link
Contributor

We have had missed documents from the document harvester, and after some analysis, it became clear there were recent crawl logs that had not been been processed.

The Airflow task that processes logs looks up which files to analyse is based on last-modified dates, so today's task processes yesterday's log file(s). However, for whatever reason, we now find that the upload of the crawl logs is happening close to midnight, so TrackDB has not yet been updated when the log analysis job runs. This means the job runs successfully, but does not process any/all logs, because some only appear in TrackDB later.

In the short term, this has been dealt with by re-running all recent, relevant DAGs (via Airflow UI), and modifying the DAG to run at 4am in the future (d3f5c45).

However, this will fail if there is an extended TrackDB outage. A more robust solution would ensure the TrackDB has been updated first, perhaps used a daily Airflow Dataset to keep tabs on when TrackDB is up to date.

@anjackson anjackson added the enhancement New feature or request label Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant