technologyNutch

See simpleNutchSolrSetup for a sample setup of Nutch.

See setupZookeeperHadoopHbaseTomcatSolrNutch for an advanced setup.

Configuration

Default configuration resides in conf/nutch-default.xml, but you shouldn't change that. Rather copy relevant settings to conf/nutch-site.xml.

conf/regex-urlfilter.txt filters urls based on regular expressions. Allow urls with +<regex> and disallow with -<regex>. Careful default configuration allows anything that isn't disallowd (+.).

Since Nutch 2.x is only provided as a source distribution config can be done either in nutchdir/conf or in nutchdir/runtime/local/conf. I'd recommend doing configuration in the former, because else every recompile overwrites settings. But in turn we have to recompile every time configuration changes:

$ cd nutchdir/
$ ant runtime

operating nutch

there is a list of nutch commands for the command line in the official nutch wiki:

http://wiki.apache.org/nutch/CommandLineOptions

metalcon specific crawling

sites we don't crawl:

wikipedia
facebook
myspace
youtube
last.fm
reverbnation
bandcamp
bandzone.cz
soundcloud
tape.tv

Tutorial

there is a nice tutorial at:

https://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html

Glossary

batchId

When generating URLs to be fetched later, a batchId can be assigned to a batch of generated URLs. This allows you to first generate multiple batches of URLs, and then fetch them later one after another without having to wait for one big fetch to finish.
crawlId

Identifier that describes a crawl. Might it be useful to just use timestamps to generate crawlIds?

Sources

Nutch2Crawling: describes the nutch crawl jobs: generate, fetch, parse, updatedb.
Understanding the columns/fields in Nutch 2.0 Webpage
How to re-crawl with Nutch
NewScoring (Seems to be outdated, despite its name)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly