Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding the best order to run the rake tasks #378

Open
SgtPooki opened this issue Feb 23, 2022 · 7 comments
Open

Finding the best order to run the rake tasks #378

SgtPooki opened this issue Feb 23, 2022 · 7 comments

Comments

@SgtPooki
Copy link
Member

@andrew supplied me with an image of the job schedule for the published ecosystem-research site:
Screenshot 2022-02-18 at 18 45 32

but I am wondering if there is a better way to determine dependencies between each of the rake tasks? I was able to find relevant rake tasks by running the following commands:

bundle exec rake -AT | grep 'repo\|issue\|package\|org\|discover\|contribut' | awk '{print $2}' | xargs -I% sh -c 'echo bundle exec rake % --trace'

and

bundle exec rake -AT | grep 'sync\|search\|find\|calc' | awk '{print $2}' | grep -v 'pg_' | xargs -I% sh -c 'echo bundle exec rake % --trace'

I'm wondering if there's a better way to document, or discover, in which order these should be ran? I know that we can't simply make these tasks depend upon one-another, because their dependency is asynchronous. However, would it be possible to create a set of rake tasks that more explicitly call out the order in which these need to be ran?

Proposal:

For those of us not looking to run our own version of ecosystem-research full-time, with scheduled tasks, etc, maybe we could have a set of tasks that would get us to a useable state:

bundle exec rake schedule:setup:all                   # set up minimum requirements for navigating all metadata in the UI
bundle exec rake schedule:setup:repos             # setup minimum requirements for navigating repo metadata in the UI
bundle exec rake schedule:setup:issues            # setup minimum requirements for navigating issues metadata in the UI
bundle exec rake schedule:setup:packages       # setup minimum requirements for navigating packages metadata in the UI
bundle exec rake schedule:setup:contributors  # setup minimum requirements for navigating contributors metadata in the UI
bundle exec rake schedule:daily:all                     # if ran daily, all metadata would be updated daily via this task
bundle exec rake schedule:daily:repos               # if ran daily, repo metadata would be updated daily via this task
bundle exec rake schedule:daily:issues              # if ran daily, issues metadata would be updated daily via this task
bundle exec rake schedule:daily:packages        # if ran daily, package metadata would be updated daily via this task
bundle exec rake schedule:daily:contributors    # if ran daily, contributors metadata would be updated daily via this task

and maybe support for a dynamic timeframe?

@SgtPooki
Copy link
Member Author

@andrew any opinions?

andrew added a commit that referenced this issue Mar 1, 2022
@andrew
Copy link
Collaborator

andrew commented Mar 1, 2022

I've gone through and added descriptions to all the existing rake tasks in 1d6e73d to make them a little bit easier to understand why they exist.

There definitely needs to be some more setup tasks as all of the existing ones, I'll put together a list of steps and manual bits of data that need to be added.

@andrew
Copy link
Collaborator

andrew commented Mar 1, 2022

Initial setup steps off the top of my head before trying to automate them:

There's a few manual records that need to be created before the automated collection can start (currently done via rails console):

  1. create auth tokens (from github personal access tokens)
AuthToken.create(token: 'ABCDE12345')
  1. create internal organizations (i.e. protocol labs orgs like ipfs, ipfs-shipyard, filecoin-project)
Organization.create(name: 'ipfs', internal: true)
Organization.create(name: 'ipfs-shipyard', internal: true)
Organization.create(name: 'filecoin-project', internal: true)
  1. create search queries (for discovering repos from github search)
    This step is optional if you want to discover related repos on a regular basis
SearchQuery.bootstrap('ipfs')
SearchQuery.bootstrap('filecoin')

From there we can start pulling in data related to internal orgs:

Organization.each(&:sync)
Organization.each(&:import)
Organization.each(&:sync_recently_active_repos)

Then for each repository:

Repository.find_each(&:download_issues)
Repository.find_each(&:sync)

Then for each package:

Package.find_each(&:sync)
Package.find_dependent_github_repos

Then for more fine grained updates:

SearchQuery.run_all
Repository.discover_from_search_results
Issue.find_each(&:sync)
Contributor.find_each(&:sync)

Most of these functions are not enqueuing async tasks, instead running sequentially, so this can take quite a long time if there are large (or a lot of) internal orgs to begin with.

The main issue you'll have with running those tasks is that you'll hit the github rate limits on your personal api key pretty quickly and the code currently doesn't have much in the way of back-offs, it tends to just stop when there are rate limit errors. (this is why there are lots of tasks that run on an hourly basis checking the least recently synced records as it can always pick up where it left off once the rate limit is reset)

@andrew
Copy link
Collaborator

andrew commented Mar 2, 2022

@SgtPooki I'm working on making a single setup method on Organization that does the things that need to be sync first, then enqueues lots of async jobs for everything that doesn't have any tidy up once they are done (and can be retried if rate limit hit). Might take a couple days reworking some of the existing tasks to make them fail more gracefully.

@andrew
Copy link
Collaborator

andrew commented Mar 9, 2022

Pushed 3278e8f if you'd like to give it a try, either as a rake task:

rake org:setup

or in the console:

Organization.internal.each(&:setup_async)

and then make sure sidekiq is running as it will generate a lot of jobs.

One thing that may need tweaking is now the jobs fail/retry one github errors, perhaps delaying jobs until after the rate limit has been reset.

@SgtPooki
Copy link
Member Author

I had to fix some ruby issues locally, but I was finally able to test this, and sidekiq won't run for me anymore. I'm not sure what happened but my entire setup got borked somehow. I won't have the chance to look into this for a while

@andrew
Copy link
Collaborator

andrew commented May 16, 2022

Ping me on slack if you'd like any help debugging, or feel free to drop cli outputs with errors to see if I can help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants