Simple node worker that crawls sitemaps in order to keep an Algolia index up-to-date.
It uses simple CSS selectors in order to find the actual text content to index.
This app uses Algolia's library.
- Usage
- Pre-requesites
- Installation
- Running
- Configuration file
- Configuration options
- Stored Object
- Indexing
- License
This script should be run via crontab in order to crawl the entire website at regular interval.
- Having at least one valid sitemap.xml url that contains all the url you want to be indexed.
- The sitemap(s) must contain at least the
<loc>
node, i.e.urlset/url/loc
. - An empty Algolia index.
- An Algolia Credential that can create objects and set settings on the index, i.e. search, addObject, settings, browse, deleteObject, editSettings, deleteIndex
- Get the latest version
- npm
npm i algolia-webcrawler -g
- git
- ssh+git:
git clone [email protected]:DeuxHuitHuit/algolia-webcrawler.git
- https:
git clone https://github.com/DeuxHuitHuit/algolia-webcrawler.git
- ssh+git:
- https download the latest tarball
- npm
- create a config.json file
algolia-webcrawler --config config.json
cd to the root of the project and run node app
.
Configuration is done via the config.json file.
You can choose a config.json file stored elsewhere usign the --config flag.
node app --config my-config.json
At the bare minimum, you can edit config.json to set a values to the following options: 'app', 'cred', 'indexname' and at least one 'sitemap' object. If you have multiple sitemaps, please list them all: sub-sitemaps will not be crawled.
Most options are required. No defaults are provided, unless stated otherwise.
The name of your app.
Algolia credentials object. See 'cred.appid' and 'cred.apikey'.
Your Algolia App ID.
Your generated Algolia API key.
Simple delay between each requests made to the website in milliseconds.
The maximum number of milliseconds an entry can live without being updated. After each run, the app will search for old entries and delete them. If you do not wish to get rid of old entries, set this value to 0.
A filter string that will be applied when deleting old entries. Useful when you want to keep old records that won't get updated. Only records that are old and match the filter will be deleted.
The maximum size in bytes of a record to be sent to Algolia. The default is 10,000 but could vary based on different plans.
When the record is too big (based on maxRecordSize), the crawler will remove values from the text key. Use this attribute to configure which keys should be pruned when the record is too big.
An object containing various values related to your index.
Your index name.
An object that will act as argument to Algolia's Index#setSetting
method.
Please read Algolia's documentation on that subject. Any valid attribute documented for this method can be used.
An array of string that defines which attributes are indexable, which means that full text search will be performed against them. For a complete list of possible attributes see the Stored Object section.
An array of string that defines which attributes are filterable, which means that you can use them to exclude some records from being returned. For a complete list of possible attributes see the Stored Object section.
This array should contain a list of sitemap objects.
A sitemap is a really simple object with two String properties: url and lang. The 'url' property is the exact url for this sitemap. The 'lang' property should explicit the main language used by url found in the sitemap.
An object containing different http options.
The auth string, in node's username:password
form.
If you do not need auth, you still need to specify an empty String.
An object containing CSS selectors in order to find the content in the pages html.
CSS selector for the title of the page.
CSS selector for the description of the page.
CSS selector for the image of the page.
CSS selector for the title of the page.
CSS selector for the "key" property. You can add custom keys as you wish.
Selectors can also be defined using the long form (i.e. as an object), which allow specifying custom properties on it.
Name of the attributes to look for values. Default is ['content', 'value'].
The actual CSS selector to use.
The maximum number of nodes to check.
An object containing CSS selectors to find elements that must not be indexed. Those CSS selectors are matched for each node and are check against all their parents to make sure non of its parent are excluded.
CSS selector of excluded elements for the text of the page.
CSS selector of excluded elements for "key" property. The key must match the one used in selectors[key].
An object containing formatter string. Their values are removed from the original result obtained with the associated CSS selector.
The string to remove from the title of the page. Can also be an array of strings.
The string to remove from the specified key. Can also be an array of strings.
The parse function used to format the value. Supported types are "integer", "float", "boolean" and "json".
The default value inserted for the specified key. Will be set if the value is falsy.
A list of javascript files to load custom code before saving the record. The only requirement is to
implement the following interface, where record
is the object to be saved and data is the html.
module.exports = (record, data) => {
record.value_from_plugin = 'Yay!';
};
All url are checked against all items in the blacklist. If the complete url or its path component is in the blacklist, it won't get indexed.
The stored object on Algolia's server is as follows
{
date: new Date(),
url: 'http://...',
objectID: shasum.digest('base64'),
lang: sitemap.lang,
http: {},
title: '',
description: '',
image: '',
text: ['...']
}
One thing to notice is that text is an array, since we tried to preserve the original text node -> actual value relationship. Algolia handle this just fine.
One url can be set to post a ping back to a web server after every saved url in Algolia. The web server will receive a post with this information :
result=[success|error]
action=[update|delete]
url=the url inserted
last-modified=[the http header value]
source=algolia-crawler
Indexing is done automatically, at each run. To tweak how indexing works, please see the index.settings configuration option.
MIT
Made with love in Montréal by Deux Huit Huit
Copyrights (c) 2014-2019