Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

User Guide

Norwin edited this page Sep 12, 2018 · 2 revisions

Concept

The application allows to search the web for data sets on demand, based on keyword queries.

Queries are organized into crawl jobs ("Crawls"), each defining initial search terms, countries & languages, and further options. Once a Crawl is started, resulting web pages ("Results") can be explored in the Search Results UI as they are discovered.

Starting a Crawl

A Crawl can be started through the New Crawl UI. Crawls should be focussed on specific topics to allow for better filtering of the results.

  1. definition of keywords

    • Each keyword group represents a search query as known from a normal search engine.
    • Multiple keyword groups can be defined to broaden the search space of a crawl.
    • Each keyword group can be excluded from translation.
    • Keywords in the common keywords field are appended to all keyword groups.
  2. selection of languages

    • Keywords can be translated into multiple languages.
    • To allow a geographically focussed crawl, languages can be selected by country.
    • Only languages that are supported by the translation engine are selectable.
  3. crawl options

    • crawl depth: The crawler fetches web pages and follows most links they contain. This parameter defines after how many followed pages no more links are followed.
    • seed urls per keyword group: The crawl is started with initial pages by querying a search engine with the keyword groups. This parameter defines how many results from the search engine are used as initial pages.
    • termination conditions: The crawler automatically stops fetching more pages after one of these termination conditions is true.

Considerations

  • With the free tier of google search, only 100 search queries are allowed per day. The tool submits one query per keyword group per language per 10 seed-urls. The tool warns for high amount of queries to be submitted.

Viewing Results

Fetched pages are indexed with metadata and extracted content. They are shown in the Search Results UI, sorted by their assessed relevance, cosidering

  • their manual & automatic classification,
  • their extracted content types (see Tags).

Each result has action buttons:

  • manual classification (positive, slightly-positive, negative): see Classifying results manually
  • open page in new tab
  • open translated page (using google translate)

Clicking on a result shows details with the extracted content, if any:

  • data source type if any
  • direct links to data
  • contact & licensing information

Filtering Results

Results can be filtered in multiple ways, i.e. to

  • search by keyword
  • filter by page language
  • filter by host
  • filter by automatic or manual classification (confidence)
  • ...

Search bar

Here you can enter keywords which are matched to all indexed fields. The search field supports Lucene syntax, allowing to query specific metadata fields. The help (question mark button) shows which available fields and how to query them.

Advanced search

Clicking the caret shows interfaces to further filter the results:

  • Filter by crawl: show only Results of the selected Crawls
  • filter results by language, so that only the Results in the originally selected languages of the Crawl are displayed

Classifying Results manually

The crawler classifies results automatically using a machine learning algorithm. This currently only works for English pages, as training data has to be collected for each language for the classifier.

To collect training data for other languages, you can manually label the results through the action buttons. It is important that the same criteria are applied to all classifications; for the existing classifier the following criteria were applied:

  • label "data" 👍: pages containing datasets or direct links to data access.
  • label "related": pages not containing data, but on the same site were some data is provided.
  • label "unrelated" 👎: everything else.

Manual classification also has the benefit of tuning the relevance of results, hiding irrelevant pages from the list.

There is not yet a guide on how to train a new model with the manually labeled results, but you can checkout this wiki page or this README.

Assessing crawl performance

In the Crawler Metrics UI various metrics are visualized. Three panels show

  • distributions of Result metadata values per Crawl; useful to analyse the outcome of different keywords
  • metrics for assesing classifier performance by comparing manual to automatic classification
  • metrics for assessing crawler throughput; mainly useful for debugging.