Search

The search functionality is provided by algolia and is known as algolia DocSearch. We are running docsearch v3.

Configuration

"mdanalysis" dashboard (application ID Y8HJT3NO22)
crawler configuration

Documentation

Hosted search

We are using the hosted search option where Algolia runs the docsearch-scraper.

~~docsearch v2 (legacy) docs — we are still using v2 but they are migrating v2 to v3~~
- FAQ v2
- update: we migrated to v3 in Novemeber 2021
docsearch v3 docs
- FAQ v3
- migrating from legacy
docsearch Discourse forum

specific issues

Experience and advice for dealing with indexing of code in software documentation?: indexing pre and code tags? faq Advice: don't but you could add selectors for valuable occurrences (but we cannot easily do it in sphinx-generated docs).

docsearch-scraper

One can run the scraper by oneself and then serve that index. That's also recommended for debugging. If we do this, here are links to get started:

repo https://github.com/algolia/docsearch-scraper

Relevant issues

For details, look through the issue comments

add search box #73
restrict DocSearch to relevant parts of the site #77
sitemapindex #79
update to v3 #211

Configuration

For v3, use the crawler interface https://crawler.algolia.com/

~~To change the configuration, make a PR against https://github.com/algolia/docsearch-configs/blob/master/configs/mdanalysis.json~~. The syntax is explained at https://docsearch.algolia.com/docs/config-file/

Selectors

In order for anything to be indexed it must match one of the CSS selectors

levels are mapped to heading tags
text is mapped to p, li, and similar tags
examine the produced documentation with the Firefox Web Developer Tool or similar to see which CSS elements apply to the content that should be indexed

Example selectors

selectors": {
    "lvl0": "[itemprop='articleBody'] > .section h1, .page h1, .post h1, .body > .section h1",
    "lvl1": "[itemprop='articleBody'] > .section h2, .page h2, .post h2, .body > .section h2",
    "lvl2": "[itemprop='articleBody'] > .section h3, .page h3, .post h3, .body > .section h3",
    "lvl3": "[itemprop='articleBody'] > .section h4, .page h4, .post h4, .body > .section h4",
    "lvl4": "[itemprop='articleBody'] > .section h5, .page h5, .post h5, .body > .section h5",
    "text": "[itemprop='articleBody'] > .section p, .page p, .post p, .body > .section p, [itemprop='articleBody'] > .section li, .page li, .post li, .body > .section li"
  },

mdanalysis.json

Snap shot of mdanalysis.json

{
  "index_name": "mdanalysis",
  "sitemap_urls": [
    "https://www.mdanalysis.org/sitemapindex.xml"
  ],
  "start_urls": [
    "https://docs.mdanalysis.org",
    "https://userguide.mdanalysis.org",
    "https://www.mdanalysis.org"
  ],
  "stop_urls": [
    "https://www.mdanalysis.org/.*?//.*?",
    "https://www.mdanalysis.org/blog",
    "https://www.mdanalysis.org/mdanalysis",
    "https://www.mdanalysis.org/docs",
    "https://docs.mdanalysis.org/stable/.*",
    "https://docs.mdanalysis.org/.*index.html$",
    "https://userguide.mdanalysis.org/stable/.*",
    "https://userguide.mdanalysis.org/.*-dev.*/.*",
    "https://www.mdanalysis.org/.*index.html$",
    "\\/_"
  ],
  "selectors": {
    "lvl0": "[itemprop='articleBody'] > .section h1, .page h1, .post h1, .body > .section h1",
    "lvl1": "[itemprop='articleBody'] > .section h2, .page h2, .post h2, .body > .section h2",
    "lvl2": "[itemprop='articleBody'] > .section h3, .page h3, .post h3, .body > .section h3",
    "lvl3": "[itemprop='articleBody'] > .section h4, .page h4, .post h4, .body > .section h4",
    "lvl4": "[itemprop='articleBody'] > .section h5, .page h5, .post h5, .body > .section h5",
    "text": "[itemprop='articleBody'] > .section p, .page p, .post p, .body > .section p, [itemprop='articleBody'] > .section li, .page li, .post li, .body > .section li, [itemprop='articleBody'] > .section dt, .body > .section dt"
  },
  "conversation_id": [
    "569445928"
  ],
  "nb_hits": 18529
}

Working with sitemaps

sitemap.org protocol definition (defines sitemap)
validators
- https://www.xml-sitemaps.com/validate-xml-sitemap.html

When making a PR

Please:

provide enough information so that others can review your pull request.
double check the dedicated documentation available here
try to implement the recommendations
please feature a sitemap, it will be the most complete source of truth for our crawling.
Allow edits from maintainer

Debugging search (v2)

Run a local version of the scraper that has index submission to algolia disabled (to avoid running in limits for the free plan). For example, install https://github.com/orbeckst/docsearch-scraper/tree/dryrun

Have the config file handy (e.g., by cloning https://github.com/algolia/docsearch-configs).

Run the scraper and check the output

./docsearch run ../docsearch-configs/configs/mdanalysis.json 2>&1 | tee RUN.log
less RUN.log

Example output

> DocSearch: https://www.mdanalysis.org 0 records)
> Ignored: from start url https://userguide.mdanalysis.org/stable/index.html
> Ignored: from start url https://docs.mdanalysis.org/stable/index.html
> DocSearch: https://www.mdanalysis.org/pages/privacy/ 12 records)
> DocSearch: https://www.mdanalysis.org/pages/used-by/ 30 records)
...
...
> DocSearch: https://www.mdanalysis.org/2015/12/15/The_benefit_of_social_coding/ 6 records)
> DocSearch: https://www.mdanalysis.org/distopia/search.html 0 records)
> Ignored from sitemap: https://www.mdanalysis.org/distopia/genindex.html
> Ignored from sitemap: https://www.mdanalysis.org/distopia/index.html
> DocSearch: https://www.mdanalysis.org/distopia/api/vector_triple.html 0 records)
> DocSearch: https://www.mdanalysis.org/distopia/api/helper_functions.html 0 records)
> DocSearch: https://www.mdanalysis.org/distopia/api/distopia.html 0 records)
> DocSearch: https://www.mdanalysis.org/distopia/building_distopia.html 0 records)

Interpretation of results

lines with N records where N > 0: this is desired and shows that the scraper collected data records for the index
lines with 0 records: the rules do not seem to correctly catch elements on the page for scraping
Ignored: from start url: started scraping by following but then hit a stop_url
Ignored from sitemap: : started scraping from sitemap (which is good!) and then hit a stop_url
Missing pages (e.g., nothing on the User Guide): check the sitemap file!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search

Configuration

Documentation

Hosted search

specific issues

docsearch-scraper

Relevant issues

Configuration

Selectors

mdanalysis.json

Working with sitemaps

When making a PR

Debugging search (v2)

Example output

Interpretation of results

Clone this wiki locally