Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

Tech Stack Research

Norwin edited this page May 11, 2018 · 7 revisions

Translation APIs

Search APIs

This is important to get right, as the initial search defines the result set of the crawl. Search could become costly, expected search volume: 20 queries per crawl request and language

Crawling

  • Apache Nutch

    • 👍 featurecomplete webcrawling application
    • 👎 operates in batch mode, slow
    • 👎 old, community rather inactive
  • Storm Crawler

    • webcrawling SDK based on apache storm
    • 👍 stream-based, very efficient
    • 👎 not as feature complete, more work required, but probably friendlier

Indexing

  • Elasticsearch
  • Solr

Both work well, more knowhow with EL

Content Analysis

2-step process of finding a dataset:

  1. score pages via method “X” for tags:
    • Dataset: “probability of a dataset in this page”
    • Link: “probability of a link to a dataset in this page”
    • Portal: “probability of a data portal in this page”
  2. filter scored pages for keywords
    • on demand via index, not at crawl time
      • REVIEW: → No specific keywords used at crawl time? → Low latency (streaming) not required?

UI / Result Presentation

  • Vue.js 2?
  • Views:
    • "Launch Crawl"
    • "View Crawls (completed/in progress)"
    • "Search Results"

Deployment

  • all dockerized
  • orchestration with compose for now?
Clone this wiki locally