-
Notifications
You must be signed in to change notification settings - Fork 10
technologyNutch
Lukas Schmelzeisen edited this page Aug 9, 2013
·
13 revisions
See simpleNutchSolrSetup for a sample setup of Nutch.
See setupZookeeperHadoopHbaseTomcatSolrNutch for an advanced setup.
-
batchId
When generating URLs to be fetched later, a batchId can be assigned to a batch of generated URLs. This allows you to first generate multiple batches of URLs, and then fetch them later one after another without having to wait for one big fetch to finish.
-
crawlId
Identifier that describes a crawl. Might it be useful to just use timestamps to generate crawlIds?
- Nutch2Crawling: describes the nutch crawl jobs: generate, fetch, parse, updatedb.
- Understanding the columns/fields in Nutch 2.0 Webpage
- How to re-crawl with Nutch
- NewScoring (Seems to be outdated, despite its name)