Finding the best order to run the rake tasks #378

SgtPooki · 2022-02-23T22:34:20Z

@andrew supplied me with an image of the job schedule for the published ecosystem-research site:

but I am wondering if there is a better way to determine dependencies between each of the rake tasks? I was able to find relevant rake tasks by running the following commands:

bundle exec rake -AT | grep 'repo\|issue\|package\|org\|discover\|contribut' | awk '{print $2}' | xargs -I% sh -c 'echo bundle exec rake % --trace'

and

bundle exec rake -AT | grep 'sync\|search\|find\|calc' | awk '{print $2}' | grep -v 'pg_' | xargs -I% sh -c 'echo bundle exec rake % --trace'

I'm wondering if there's a better way to document, or discover, in which order these should be ran? I know that we can't simply make these tasks depend upon one-another, because their dependency is asynchronous. However, would it be possible to create a set of rake tasks that more explicitly call out the order in which these need to be ran?

Proposal:

For those of us not looking to run our own version of ecosystem-research full-time, with scheduled tasks, etc, maybe we could have a set of tasks that would get us to a useable state:

bundle exec rake schedule:setup:all                   # set up minimum requirements for navigating all metadata in the UI
bundle exec rake schedule:setup:repos             # setup minimum requirements for navigating repo metadata in the UI
bundle exec rake schedule:setup:issues            # setup minimum requirements for navigating issues metadata in the UI
bundle exec rake schedule:setup:packages       # setup minimum requirements for navigating packages metadata in the UI
bundle exec rake schedule:setup:contributors  # setup minimum requirements for navigating contributors metadata in the UI
bundle exec rake schedule:daily:all                     # if ran daily, all metadata would be updated daily via this task
bundle exec rake schedule:daily:repos               # if ran daily, repo metadata would be updated daily via this task
bundle exec rake schedule:daily:issues              # if ran daily, issues metadata would be updated daily via this task
bundle exec rake schedule:daily:packages        # if ran daily, package metadata would be updated daily via this task
bundle exec rake schedule:daily:contributors    # if ran daily, contributors metadata would be updated daily via this task

and maybe support for a dynamic timeframe?

The text was updated successfully, but these errors were encountered:

SgtPooki · 2022-02-28T16:42:30Z

@andrew any opinions?

Related #378

andrew · 2022-03-01T16:10:36Z

I've gone through and added descriptions to all the existing rake tasks in 1d6e73d to make them a little bit easier to understand why they exist.

There definitely needs to be some more setup tasks as all of the existing ones, I'll put together a list of steps and manual bits of data that need to be added.

andrew · 2022-03-01T17:08:03Z

Initial setup steps off the top of my head before trying to automate them:

There's a few manual records that need to be created before the automated collection can start (currently done via rails console):

create auth tokens (from github personal access tokens)

AuthToken.create(token: 'ABCDE12345')

create internal organizations (i.e. protocol labs orgs like ipfs, ipfs-shipyard, filecoin-project)

Organization.create(name: 'ipfs', internal: true)
Organization.create(name: 'ipfs-shipyard', internal: true)
Organization.create(name: 'filecoin-project', internal: true)

create search queries (for discovering repos from github search)
This step is optional if you want to discover related repos on a regular basis

SearchQuery.bootstrap('ipfs')
SearchQuery.bootstrap('filecoin')

From there we can start pulling in data related to internal orgs:

Organization.each(&:sync)
Organization.each(&:import)
Organization.each(&:sync_recently_active_repos)

Then for each repository:

Repository.find_each(&:download_issues)
Repository.find_each(&:sync)

Then for each package:

Package.find_each(&:sync)
Package.find_dependent_github_repos

Then for more fine grained updates:

SearchQuery.run_all
Repository.discover_from_search_results
Issue.find_each(&:sync)
Contributor.find_each(&:sync)

Most of these functions are not enqueuing async tasks, instead running sequentially, so this can take quite a long time if there are large (or a lot of) internal orgs to begin with.

The main issue you'll have with running those tasks is that you'll hit the github rate limits on your personal api key pretty quickly and the code currently doesn't have much in the way of back-offs, it tends to just stop when there are rate limit errors. (this is why there are lots of tasks that run on an hourly basis checking the least recently synced records as it can always pick up where it left off once the rate limit is reset)

andrew · 2022-03-02T12:35:49Z

@SgtPooki I'm working on making a single setup method on Organization that does the things that need to be sync first, then enqueues lots of async jobs for everything that doesn't have any tidy up once they are done (and can be retried if rate limit hit). Might take a couple days reworking some of the existing tasks to make them fail more gracefully.

andrew · 2022-03-09T19:11:02Z

Pushed 3278e8f if you'd like to give it a try, either as a rake task:

rake org:setup

or in the console:

Organization.internal.each(&:setup_async)

and then make sure sidekiq is running as it will generate a lot of jobs.

One thing that may need tweaking is now the jobs fail/retry one github errors, perhaps delaying jobs until after the rate limit has been reset.

SgtPooki · 2022-05-12T20:23:27Z

I had to fix some ruby issues locally, but I was finally able to test this, and sidekiq won't run for me anymore. I'm not sure what happened but my entire setup got borked somehow. I won't have the chance to look into this for a while

andrew · 2022-05-16T08:50:40Z

Ping me on slack if you'd like any help debugging, or feel free to drop cli outputs with errors to see if I can help

andrew added a commit that referenced this issue Mar 1, 2022

Add descriptions to all rake tasks

1d6e73d

Related #378

andrew added a commit that referenced this issue Mar 9, 2022

async setup methods and rake task for orgs and repos #378

3278e8f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding the best order to run the rake tasks #378

Finding the best order to run the rake tasks #378

SgtPooki commented Feb 23, 2022

SgtPooki commented Feb 28, 2022

andrew commented Mar 1, 2022

andrew commented Mar 1, 2022

andrew commented Mar 2, 2022

andrew commented Mar 9, 2022

SgtPooki commented May 12, 2022

andrew commented May 16, 2022

Finding the best order to run the rake tasks #378

Finding the best order to run the rake tasks #378

Comments

SgtPooki commented Feb 23, 2022

Proposal:

SgtPooki commented Feb 28, 2022

andrew commented Mar 1, 2022

andrew commented Mar 1, 2022

andrew commented Mar 2, 2022

andrew commented Mar 9, 2022

SgtPooki commented May 12, 2022

andrew commented May 16, 2022