API

Heroshi programming interface

Heroshi may be seen as a single point of control (queue manager), multiple controlled instances mechanism for continuous web crawling. So human administrator can only directly control queue manager, which, in turn, may delay giving URLs for crawling to all or particular workers. But manager has no means to directly connect to workers, by design. So it must wait for a worker to connect to it.

API is planned for administrator and workers.

Update: worker API is deprecating, AMQP between manager and workers will be used instead of custom HTTP/JSON protocol.

Authentication

Authentication is done via API keys. API key is just a long enough string. Manager maintains one administrator key and set of worker keys. Keys are stored in manager memory. Probably, need to dump them to persistent storage in case manager goes down.

Administrator key is passed to manager at deploy/configuration time. This allows admin to connect to manager and do maintenance (see API for administrator below).

Admin can issue new worker key and send it to his friend Boris by email. Also, Boris can set up a worker instance and generate a random API key and send it to his friend Admin.

Technically, authentication is done via passing X-Heroshi-Auth header with API key on every request.

Sample: X-Heroshi-Auth: Boris:bhi&FG45ef^^oyNIbJ\fdp]ax%c4s&F^">~0=z3mddYb_d

Since API key is just a string, there is no hard limit on it, but for convenience, first part up to : (colon) sign means unique worker name. It is used in admin reports. Best used as name of your friend, who provided you with bandwidth to run web crawler instance.

Authorization

Workers are limited to worker api (see below).

Administrator may do every other thing.

API for administrator

Allows to:

get all kinds of statistics information up to date or in history



  Sample request: GET /statistics?date-from=2009-03-29&date-to=2009-04-02

  Sample request: GET /statistics

  Sample response: (JSON decoded): {’workers’: [

      {’key’: ‘boris:foo-bar-long-key’, ‘uptime’: 39392, ‘crawled-urls’: 229939,

      ‘crawled-sites’: 2234}

    ], ‘uptime’: 39347, ‘crawled-urls’: 237897, ‘recieved-KB’: 9298451}

get current in-memory new-links list



  Sample request: GET /temp-new-links

  Sample response: (JSON decoded): [{’url’: ‘http://github.com/temoto/configs’,

    ‘given-to-crawling’: None}]

add another URL list to in-memory new-links list



  Sample request: POST /temp-new-links

  POST data (JSON decoded): [{’url’: ‘http://github.com/temoto/configs’}]

  Sample response: 200 OK

add API key to trusted workers list



  Sample request: POST /api-keys

  POST data (not JSON): inga:foo-bar-long-key

  Sample response: 201 Created

revoke API key completely



  Sample request: DELETE /api-keys/boris:foo-bar-long-key

  Sample response: 200 OK

API for worker

Allows to:

get urls for crawling



  Sample request: POST /crawl-queue

  POST data: limit=200

  Sample response (JSON decoded): [ {’url’: ‘http://github.com/temoto/heroshi’,

    ‘last-visited’: None, ‘given-to-crawling’: ‘2009-03-29 13:55:21’} ]

report result



  Sample request: PUT /report

  Sample body (JSON decoded): {’url’: ‘http://github.com/temoto/heroshi’,

      ‘status’: ‘200 OK’,

      ‘headers’: {’Content-Type’: ‘text/html; charset=utf-8’, ‘Connection’: ‘keep-alive’,

          ‘X-Runtime’: ’91ms’},

      ‘total_time’: 0.58, ‘content-body’: ‘<!DOCTYPE html…//30KB cut//

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API

Heroshi programming interface

Authentication

Authorization

API for administrator

API for worker

Clone this wiki locally