Technologysolr

See setupZookeeperHadoopHbaseTomcatSolrNutch for an advanced setup.

Documentation

Starting from release 4.4.0 there will be a Solr Reference Guide for every minor release.

Apache Solr 4.4.0 Reference Guide

Overview

Solr's basic unit of information is a document. Documents are composed of fields. Each field has a field type which defines what kind of data a field can contain. It also tells Solr how to interpret the field and how it can be queried. [Solr Ref Guide: Overview of Documents, Fields and Schema Design].

Configuration files

Need to restart tomcat after every change to configuration files.

solrconfig.xml

requestHandlers set up functionally accessible via http requestHandler "/select" default search parameters can be set and forced. searchComponents can configure specific aspects of search

schema.xml

This is the basic structure of a Solr schema.xml file [Solr Ref Guide: Putting the Pieces Together]:

<schema>
  <types>
  <fields>
  <uniqueKey>
  <defaultSearchField>
  <solrQueryParser defaultOperator>
  <copyField>
</schema>

A schema starts of by defining a set of custom types to be used for a field. Each type definition consists of an identifier and a Solr class name that implements the type. A type definition is completed by defining implementing class specific default attributes for that type.

After that a definition of all fields of the schema follow. A field definition consist of a name, a type, and attributes specific for that field. For each field you can specify if you want to index that field (check for its content when searching) and if you want to store it (make its content retrievable).

Nutch's "Solr schema.xml"

Field	Type	Description
id	string	A unique ID for each document. This is an url with a reversed domain, e.g. `com.example:http/about.html`.
digest	string	Nutch internal data used to remove duplicates.
boost	float	Nutch internal data used to calculate document score.
host	url	The host part of the url of the document, e.g. `http://example.com/`.
url	url	The url of the document, e.g. `http://example.com/about.html`.
content	text_general	The parsed contents of the document.
title	text_general	The title of the document.
cache	string	Probably Nutch internal.
tstamp	date	The timestamp the document was last fetched by nutch.
text	text_general	Aggregation field of content, url, title to optimize searching.

Query Syntax

Debug information can be appended to a query by adding &debugQuery=on to the request.

Solr can parse Queries with multiple QueryParsers:

Dismax Query Parser: Features a query syntax similar to the one Google uses. Is relatively simple and highly fault tolerant. It is designed, to make it possible to forward user queries directly to Solr.
eDismax Query Parser: Like Dismax but with some extended functionality.
Standard Query Parser: A query syntax enabling more complicated and precise queries. However user input would have to be escaped/parsed heavily to be forwarded to Solr.

Currently I favor Dismax Query Parser because it's probably easier to implement. An implementation of Standard Query Parse would most likely result in trying to emulate Dismax' syntax, but not reaching its quality.

Interesting/relevant parameters for search.

Parameter	Describtion
q	Seach query term.
omitHeaders	Omits response header, which probably not needed for production.
echoParams	Controls what params are echoed in the response, can probably set this to "none" for production.
defType	Query parser: standard, dismax, edismax.
fl	Which fields from Solr schema to return.
rows	Number of search results returned.
hl	Highlighting of query terms.
hl.fl	Which fields to highlight.
hl.simple.pre/post	Controls wrapping code around highlighted parts.
hl.snippets	Count of snippets.
hl.mergeContinous	Merge continous snippets.

reindex

After some operations documentation tells advised you to "reindex" your data. Reindexing is no special Solr functionallity. What is meant, is to restart Solr, clear all data, and refeed all data into Solr as you did before. More information: http://wiki.apache.org/solr/HowToReindex

Open Questions

Solr supports a feature called MoreLikeThis which can return all documents similiar to another one. This is different from fuzzy searching. Note: I don't know where we would use this quire yet, but it seems interesting.

It is not possible to truncate the content of fields returned from Solr. This might turn out to be a huge bottleneck since Nutch puts the whole content of pages into the Solr index. So the whole content of each page is returned for every page returned for a search by Solr. A possible solution to this is mark fields in Solr schema as index-only and make them non readable. But this would require us to use our own datastore and do all highlighting by ourselves.

What we need to find out / research

How does solrCloud / clustering work? We want to do this with an existing ZooKeeper instance. We do not want Solr managing our Zookeeper.

How does stemming work in Solr. More precisely how can we use these configuration files?
- stopwords.txt
- protwords.txt
- synonyms.txt

How does highlighting on stemmed words work? For example "bands" stemm to "band", and we query for "band", what is highlighted?) http://wiki.apache.org/solr/HighlightingParameters

Solr currently doesn't highlight inside words. For example if we query for "kill", the word "killzone" will be matched, but it wont be highlighted, not even parts of it.
Probably look at <searchComponent name="highlight" /> in solrconfig.xml.

How does Spellchecking / Query Suggesting work? http://wiki.apache.org/solr/SpellCheckComponent

How can we insert data into Solr?

How do we efficiently rank our search results from Solr?

Rank by document score.
Rank by query matching.
Score from Nutch or from Solr?

To-do before release

Check all settings in solrconfig.xml.
- Lucene optimzation settings in <indexConfig/>.
- Query cache optimzation settings in <query/>.

Links

http://www.solrtutorial.com: Seems like a good Solr-tutorial for people without zero prior knowledge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly