-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finding the best order to run the rake tasks #378
Comments
@andrew any opinions? |
I've gone through and added descriptions to all the existing rake tasks in 1d6e73d to make them a little bit easier to understand why they exist. There definitely needs to be some more setup tasks as all of the existing ones, I'll put together a list of steps and manual bits of data that need to be added. |
Initial setup steps off the top of my head before trying to automate them: There's a few manual records that need to be created before the automated collection can start (currently done via rails console):
From there we can start pulling in data related to internal orgs:
Then for each repository:
Then for each package:
Then for more fine grained updates:
Most of these functions are not enqueuing async tasks, instead running sequentially, so this can take quite a long time if there are large (or a lot of) internal orgs to begin with. The main issue you'll have with running those tasks is that you'll hit the github rate limits on your personal api key pretty quickly and the code currently doesn't have much in the way of back-offs, it tends to just stop when there are rate limit errors. (this is why there are lots of tasks that run on an hourly basis checking the least recently synced records as it can always pick up where it left off once the rate limit is reset) |
@SgtPooki I'm working on making a single |
Pushed 3278e8f if you'd like to give it a try, either as a rake task:
or in the console:
and then make sure sidekiq is running as it will generate a lot of jobs. One thing that may need tweaking is now the jobs fail/retry one github errors, perhaps delaying jobs until after the rate limit has been reset. |
I had to fix some ruby issues locally, but I was finally able to test this, and sidekiq won't run for me anymore. I'm not sure what happened but my entire setup got borked somehow. I won't have the chance to look into this for a while |
Ping me on slack if you'd like any help debugging, or feel free to drop cli outputs with errors to see if I can help |
@andrew supplied me with an image of the job schedule for the published ecosystem-research site:
but I am wondering if there is a better way to determine dependencies between each of the rake tasks? I was able to find relevant rake tasks by running the following commands:
bundle exec rake -AT | grep 'repo\|issue\|package\|org\|discover\|contribut' | awk '{print $2}' | xargs -I% sh -c 'echo bundle exec rake % --trace'
and
bundle exec rake -AT | grep 'sync\|search\|find\|calc' | awk '{print $2}' | grep -v 'pg_' | xargs -I% sh -c 'echo bundle exec rake % --trace'
I'm wondering if there's a better way to document, or discover, in which order these should be ran? I know that we can't simply make these tasks depend upon one-another, because their dependency is asynchronous. However, would it be possible to create a set of rake tasks that more explicitly call out the order in which these need to be ran?
Proposal:
For those of us not looking to run our own version of ecosystem-research full-time, with scheduled tasks, etc, maybe we could have a set of tasks that would get us to a useable state:
and maybe support for a dynamic timeframe?
The text was updated successfully, but these errors were encountered: