-
Notifications
You must be signed in to change notification settings - Fork 10
technologyNutch
See simpleNutchSolrSetup for a sample setup of Nutch.
See setupZookeeperHadoopHbaseTomcatSolrNutch for an advanced setup.
Default configuration resides in conf/nutch-default.xml
, but you shouldn't change that. Rather copy relevant settings to conf/nutch-site.xml
.
conf/regex-urlfilter.txt
filters urls based on regular expressions. Allow urls with +<regex>
and disallow with -<regex>
. Careful default configuration allows anything that isn't disallowd (+.
).
Since Nutch 2.x is only provided as a source distribution config can be done either in nutchdir/conf
or in nutchdir/runtime/local/conf
. I'd recommend doing configuration in the former, because else every recompile overwrites settings. But in turn we have to recompile every time configuration changes:
$ cd nutchdir/
$ ant runtime
there is a list of nutch commands for the command line in the official nutch wiki:
sites we don't crawl:
- wikipedia
- myspace
- youtube
- last.fm
- reverbnation
- bandcamp
- bandzone.cz
- soundcloud
- tape.tv
there is a nice tutorial at:
-
batchId
When generating URLs to be fetched later, a batchId can be assigned to a batch of generated URLs. This allows you to first generate multiple batches of URLs, and then fetch them later one after another without having to wait for one big fetch to finish.
-
crawlId
Identifier that describes a crawl. Might it be useful to just use timestamps to generate crawlIds?
- Nutch2Crawling: describes the nutch crawl jobs: generate, fetch, parse, updatedb.
- Understanding the columns/fields in Nutch 2.0 Webpage
- How to re-crawl with Nutch
- NewScoring (Seems to be outdated, despite its name)