Skip to content

Harvest

Harris Tzovanakis edited this page Jan 9, 2019 · 8 revisions

There is a celerybeat running every day at:

  • EST Timezone (Estern Standard Time) UTC -5
  • EDT Timezone (Eastern Daylight Time) UTC -4

more datails.

How to check harvest?

The easiest way to check harvest is by accessing https://inspire-prod-grafana.web.cern.ch. There is also an alert from grafana which sends a message on Zulip at ops/harvest topic.

How to harvest?

We harvest many collections from arXiv but someone can harvest a single paper as well. The collections that are related to INSPIRE are the following:

  • cs
  • econ
  • eess
  • math
  • physics
  • physics:astro-ph
  • physics:cond-mat
  • physics:gr-qc
  • physics:hep-ex
  • physics:hep-lat
  • physics:hep-ph
  • physics:hep-th
  • physics:math-ph
  • physics:nlin
  • physics:nucl-ex
  • physics:nucl-th
  • physics:physics
  • physics:quant-ph
  • q-bio
  • q-fin
  • stat

Harvest by collection

$ ssh inspire-prod-crawler1
$ inspirehep crawler schedule arXiv article --kwarg 'from_date=2018-12-06' --kwarg 'until_date=2018-12-07' --kwarg 'sets=cs,econ,eess,math,physics,physics:astro-ph,physics:cond-mat,physics:gr-qc,physics:hep-ex,physics:hep-lat,physics:hep-ph,physics:hep-th,physics:math-ph,physics:nlin,physics:nucl-ex,physics:nucl-th,physics:physics,physics:quant-ph,q-bio,q-fin,stat

Note from_date and until_date are very important.

This command will trigger a harvest, you can always check the tasks in the queue (rabbitmq) with the following command:

$ ssh inspire-prod-broker1
$ rabbitmqctl -p inspire list_queues | grep harvests

Harvest a single paper

$ inspirehep crawler schedule arXiv_single article --kwarg 'identifier=oai:arXiv.org:1604.05726'

You can check the logs by running:

$ inspirehep crawler job list
$ inspirehep crawler job logs <JOB_ID>
Clone this wiki locally