ElasticSearch Indexer for Plone content
- Indexes full content in ElasticSearch on a field base
- uses serializers of
plone.restapi
to get the JSON for indexing - configuration of ElasticSearch via
zope.conf
(buildout) - flexible drop-in replacement proxy-index for the catalog (optional)
- default profile installs SearchableText drop-in (optional)
This addon needs ElasticSearch 6.2 with ingest-attachment plugin installed.
Install collective.es.index by adding it to your buildout:
[buildout] ... eggs = collective.es.index
also there, configure the connection to ElasticSearch:
[instance] ... zope-conf-additional = %import collective.es.index <elasticsearch> query 127.0.0.1:9200 ingest 127.0.0.1:9200 </elasticsearch>
and then running bin/buildout
.
To install the default drop-in proxy-index for SearchableText
,
go to the Site-Setup Add-Ons section and install ElasticSearch SearchableText Proxy Index
.
New content will be indexed in ElasticSearch.
To index existing content, a full Clear and Rebuild
is needed (via ZMI/portal_catalog
/Tab Advanced
).
This product can be used in Plone 4. but it requires the collective.indexing product. Just add this product to your Plone 4 buildout:
[buildout] ... eggs = collective.indexing
Current version of this package requires elasticsearch_dsl. It is necessary to add the 'elasticsearch-dsl' egg to the buildout eggs. Alternatively, it can be added to the eggs in the celery part. Run the buildout again to get that dependency on existing installations.
The elasticsearch directive supports the following keys:
- max_blobsize
- Max length of files to index, in bytes. If a file is larger than this size, it will not be indexed, and this will be logged. Default is zero, which means index everything.
- request_timeout
- The default connection timeout is 10 seconds. Using this key it can be set to any number of seconds.
- use_celery
- If true, indexing will be done in async celery tasks. This requires that celery is correctly configured.
- indexed_chars
- Maximum number of characters to extract from attachments. Default is 100000. Use -1 for infinite.
- search_fields
- The search fields and their weights for searching, separated by spaces.
Example:
zope-conf-additional = %import collective.es.index <elasticsearch> query 127.0.0.1:92000 ingest 127.0.0.1:92000 request_timeout 20 max_blobsize 10000000 # 10 MB indexed_chars 200000 use_celery true search_fields title^1.2 description^1.1 subjects^2 extracted_text </elasticsearch>
It is necessary to add this configuration to the buildout and rerun it whenever a change is made to these parameters.
NOTE: The configuration here uses collective.celery, so it has changed.
The collective.celery package requires adding the celery and collective.celery eggs to the mian buildout section eggs. Example:
eggs = celery Plone elasticsearch elasticsearch-dsl collective.es.index collective.celery
We still use the celery-broker part, for clarity. The celery part is still required, but is simpler:
[celery-broker] host = 127.0.0.1 port = 6379 [celery] recipe = zc.recipe.egg environment-vars = ${buildout:environment-vars} eggs = ${buildout:eggs} flower scripts = pcelery flower
The celery part depends on having some variables added to the main environment-vars section:
environment-vars = CELERY_BROKER_URL redis://${celery-broker:host}:${celery-broker:port} CELERY_RESULT_BACKEND redis://${celery-broker:host}:${celery-broker:port} CELERY_TASKS collective.es.index.tasks
To get the b64 attribute removal working on an existing elasticsearch install, it's necessary to clear the old ingest pipeline, so that collective.es.index can install the new one. To do this, you can use a Python prompt, like this:
>>> from elasticsearch import Elasticsearch >>> es = Elasticsearch() >>> es.ingest.delete_pipeline('attachment_ingest_plone_plone')
For every search result, a list of highlights from extracted text is saved as a dictionary in the current request annotations. The dictionary is keyed by object UID.
To get the annotations from Python code:
from collective.es.index.esproxyindex import HIGHLIGHT_KEY from zope.annotation.interfaces import IAnnotations annotations = IAnnotations(REQUEST) highlights = annotations[HIGHLIGHT_KEY] obj_highlights = highlights[OBJ_UID] highlight_text = '<br/>'.join(obj_highlights)
Highlights are just lists of HTML text fragments with the query term enclosed in <em> tags.
In addition to the elastic search index, this package includes support for faceted search, as implemented in the elasticsearch_dsl library. There is a @@faceted-search view, which will allow you to filter search results using facets.
Note that collective.es.index used a mapping that was incompatible with faceted search, wo it's necessary to completely remove the previous index from elastic search and reindex it again.
The quickest way to remove the index is from the command line:
>>> from elasticsearch import Elasticsearch >>> es = Elasticsearch() >>> es.indices.delete('plone_plone')
Once this is done, the full catalog must be reindexed from the ZMI.
By default, review_state, subjects, and modified fields are used as facets. The elastic search zope configuration supports changing them and adding custom facets. For regular keyword fields, just use the name of the field. For date fields, add an interval (month, week, day, hour). For integer fields, an integer interval is allowed:
zope-conf-additional = %import collective.es.index <elasticsearch> query 127.0.0.1:92000 facets department created,month subjects </elasticsearch>
The facets key expects one or more facets separated by spaces. In this example there is a custom facet (department), a date facet using monthly intervals, and a regular plone facet. Do not leave any spaces between the field and the interval for date and integer facets, or they will not be interpreted correctly.
Although elasticsearch_dsl supports month, week, day, and hour intervals, in practice, month is the best for plone, since the others result in a large number of options.
The sources are in a GIT DVCS with its main branches at github. There you can report issue too.
We'd be happy to see many forks and pull-requests to make this addon even better.
Maintainers are Jens Klein, Peter Holzer and the BlueDynamics Alliance developer team. We appreciate any contribution and if a release is needed to be done on pypi, please just contact one of us. We also offer commercial support if any training, coaching, integration or adaptions are needed.
Initial implementation was made possible by Evangelisch-reformierte Landeskirche des Kantons Zürich.
Idea and testing: Peter Holzer
Concept & initial code by Jens W. Klein
Authors:
- Enfold Systems
The project is licensed under the GPLv2.