Skip to content
This repository has been archived by the owner on Apr 20, 2019. It is now read-only.
temoto edited this page Sep 13, 2010 · 14 revisions

Heroshi programming interface

Heroshi may be seen as a single point of control (queue manager), multiple controlled instances mechanism for continuous web crawling. So human administrator can only directly control queue manager, which, in turn, may delay giving URLs for crawling to all or particular workers. But manager has no means to directly connect to workers, by design. So it must wait for a worker to connect to it.

API is planned for administrator and workers.

Update: worker API is deprecating, AMQP between manager and workers will be used instead of custom HTTP/JSON protocol.

Authentication

Authentication is done via API keys. API key is just a long enough string. Manager maintains one administrator key and set of worker keys. Keys are stored in manager memory. Probably, need to dump them to persistent storage in case manager goes down.

Administrator key is passed to manager at deploy/configuration time. This allows admin to connect to manager and do maintenance (see API for administrator below).

Admin can issue new worker key and send it to his friend Boris by email. Also, Boris can set up a worker instance and generate a random API key and send it to his friend Admin.

Technically, authentication is done via passing X-Heroshi-Auth header with API key on every request.

Sample: X-Heroshi-Auth: Boris:bhi&FG45ef^^oyNIbJ\fdp]ax%c4s&F^">~0=z3mddYb_d

Since API key is just a string, there is no hard limit on it, but for convenience, first part up to : (colon) sign means unique worker name. It is used in admin reports. Best used as name of your friend, who provided you with bandwidth to run web crawler instance.

Authorization

Workers are limited to worker api (see below).

Administrator may do every other thing.

API for administrator

Allows to:

  • get all kinds of statistics information up to date or in history

    Sample request: GET /statistics?date-from=2009-03-29&date-to=2009-04-02
    Sample request: GET /statistics
    Sample response: (JSON decoded): {’workers’: [
    {’key’: ‘boris:foo-bar-long-key’, ‘uptime’: 39392, ‘crawled-urls’: 229939,
    ‘crawled-sites’: 2234}
    ], ‘uptime’: 39347, ‘crawled-urls’: 237897, ‘recieved-KB’: 9298451}
  • get current in-memory new-links list

    Sample request: GET /temp-new-links
    Sample response: (JSON decoded): [{’url’: ‘http://github.com/temoto/configs’,
    ‘given-to-crawling’: None}]
  • add another URL list to in-memory new-links list

    Sample request: POST /temp-new-links
    POST data (JSON decoded): [{’url’: ‘http://github.com/temoto/configs’}]
    Sample response: 200 OK
  • add API key to trusted workers list

    Sample request: POST /api-keys
    POST data (not JSON): inga:foo-bar-long-key
    Sample response: 201 Created
  • revoke API key completely

    Sample request: DELETE /api-keys/boris:foo-bar-long-key
    Sample response: 200 OK

API for worker

Allows to:

  • get urls for crawling

    Sample request: POST /crawl-queue
    POST data: limit=200
    Sample response (JSON decoded): [ {’url’: ‘http://github.com/temoto/heroshi’,
    ‘last-visited’: None, ‘given-to-crawling’: ‘2009-03-29 13:55:21’} ]
  • report result

    Sample request: PUT /report
    Sample body (JSON decoded): {’url’: ‘http://github.com/temoto/heroshi’,
    ‘status’: ‘200 OK’,
    ‘headers’: {’Content-Type’: ‘text/html; charset=utf-8’, ‘Connection’: ‘keep-alive’,
    ‘X-Runtime’: ’91ms’},
    ‘total_time’: 0.58, ‘content-body’: ‘<!DOCTYPE html…//30KB cut//
Clone this wiki locally