Tech Stack Research

Translation APIs

EU eTranslation Service
- free use for EU institutions only TODO: exact terms?
Azure Text Translator
- free up to 2mio characters, should be more than enough
Yandex.Translate
- $15 per 1 mio characters
Google Translation
- $20 per 1 mio characters

Search APIs

This is important to get right, as the initial search defines the result set of the crawl. Search could become costly, expected search volume: 20 queries per crawl request and language

Google Custom Search
- 100 requests / day free, then $5 / 1000 requests
- 👍 allows result localization, emphasis
Azure Bing Search
- €2.53 / 1000 requests
- 👍 allows result localization, emphasis
faroo
- free
- 👎 bad results?
chatnoir
- free? API key on request
- based on CommonCrawl data

Crawling

Apache Nutch
- 👍 featurecomplete webcrawling application
- 👎 operates in batch mode, slow
- 👎 old, community rather inactive
Storm Crawler
- webcrawling SDK based on apache storm
- 👍 stream-based, very efficient
- 👎 not as feature complete, more work required, but probably friendlier

Indexing

Elasticsearch
Solr

Both work well, more knowhow with EL

Content Analysis

2-step process of finding a dataset:

score pages via method “X” for tags:
- Dataset: “probability of a dataset in this page”
- Link: “probability of a link to a dataset in this page”
- Portal: “probability of a data portal in this page”
- …
filter scored pages for keywords
- on demand via index, not at crawl time
  - REVIEW: → No specific keywords used at crawl time? → Low latency (streaming) not required?

UI / Result Presentation

Vue.js 2?
Views:
- "Launch Crawl"
- "View Crawls (completed/in progress)"
- "Search Results"

Deployment

all dockerized
orchestration with compose for now?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly