Skip to content
Lukas Schmelzeisen edited this page Aug 13, 2013 · 15 revisions

See setupZookeeperHadoopHbaseTomcatSolrNutch for an advanced setup.

Documentation

Starting from release 4.4.0 there will be a Solr Reference Guide for every minor release. We are currently using 4.3.1. But it should be helpful anyway.

Overview

Solr's basic unit of information is a document. Documents are composed of fields. Each field has a field type which defines what kind of data a field can contain. It also tells Solr how to interpret the field and how it can be queried. [Solr Ref Guide: Overview of Documents, Fields and Schema Design].

schema.xml

This is the basic structure of a Solr schema.xml file [Solr Ref Guide: Putting the Pieces Together]:

<schema>
  <types>
  <fields>
  <uniqueKey>
  <defaultSearchField>
  <solrQueryParser defaultOperator>
  <copyField>
</schema>

A schema starts of by defining a set of custom types to be used for a field. Each type definition consists of an identifier and a Solr class name that implements the type. A type definition is completed by defining implementing class specific default attributes for that type.

After that a definition of all fields of the schema follow. A field definition consist of a name, a type, and attributes specific for that field. For each field you can specify if you want to index that field (check for its content when searching) and if you want to store it (make its content retrievable).

Nutch's "Solr schema.xml"

Field Type Description
id string A unique ID for each document. This is an url with a reversed domain, e.g. com.example:http/about.html.
digest string Nutch internal data used to remove duplicates.
boost float Nutch internal data used to calculate document score.
host url The host part of the url of the document, e.g. http://example.com/.
url url The url of the document, e.g. http://example.com/about.html.
content text_general The parsed contents of the document.
title text_general The title of the document.
cache string Probably Nutch internal.
tstamp date The timestamp the document was last fetched by nutch.
text text_general Aggregation field of content, url, title to optimize searching.

Query Syntax

Debug information can be appended to a query by adding &debugQuery=on to the request.

Solr can parse Queries with multiple QueryParsers:

  • Dismax Query Parser: Features a query syntax similar to the one Google uses. Is relatively simple and highly fault tolerant. It is designed, to make it possible to forward user queries directly to Solr.
  • Standard Query Parser: A query syntax enabling more complicated and precise queries. However user input would have to be escaped/parsed heavily to be forwarded to Solr.

Currently I favor Dismax Query Parser because it's probably easier to implement. An implementation of Standard Query Parse would most likely result in trying to emulate Dismax' syntax, but not reaching its quality.

Open Questions

Solr supports a feature called MoreLikeThis which can return all documents similiar to another one. This is different from fuzzy searching. Note: I don't know where we would use this quire yet, but it seems interesting.

Clone this wiki locally