Skip to content

Commit

Permalink
add stemmer_override token filter docs opensearch-project#8445
Browse files Browse the repository at this point in the history
Signed-off-by: Anton Rubin <[email protected]>
  • Loading branch information
AntonEliatra committed Oct 2, 2024
1 parent 76486a4 commit 93c4c41
Show file tree
Hide file tree
Showing 2 changed files with 138 additions and 1 deletion.
2 changes: 1 addition & 1 deletion _analyzers/token-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache
`shingle` | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`].
`snowball` | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). You can use the `snowball` token filter with the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`.
`stemmer` | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`.
`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
[`stemmer_override`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/stemmer_override/) | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
`stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream.
`synonym` | N/A | Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file.
`synonym_graph` | N/A | Supplies a synonym list, including multiword synonyms, for the analysis process.
Expand Down
137 changes: 137 additions & 0 deletions _analyzers/token-filters/stemmer-override.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
---
layout: default
title: Stemmer override
parent: Token filters
nav_order: 400
---

# Stemmer override token filter

The `stemmer_override` token filter allows you to define custom stemming rules that override the behavior of default stemmers like Porter or Snowball. This is useful when you want to apply specific stemming behavior to certain words that might not be handled correctly by the standard stemming algorithms.

## Parameters

The `stemmer_override` token filter needs be configured with *one* of the following parameters:

- `rules`: Defines the override rules directly in the settings.
- `rules_path`: Specifies the file to use with custom rules/mappings. (Either absolute path or relative to config directory)

## Example

The following example request creates a new index named `my-index` and configures an analyzer with `stemmer_override` filter:

```json
PUT /my-index
{
"settings": {
"analysis": {
"filter": {
"my_stemmer_override_filter": {
"type": "stemmer_override",
"rules": [
"running, runner => run",
"bought => buy",
"best => good"
]
}
},
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_stemmer_override_filter"
]
}
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
GET /my-index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": "I am a runner and bought the best shoes"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "i",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "am",
"start_offset": 2,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "a",
"start_offset": 5,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "run",
"start_offset": 7,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "and",
"start_offset": 14,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "buy",
"start_offset": 18,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "the",
"start_offset": 25,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "good",
"start_offset": 29,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 7
},
{
"token": "shoes",
"start_offset": 34,
"end_offset": 39,
"type": "<ALPHANUM>",
"position": 8
}
]
}
```

0 comments on commit 93c4c41

Please sign in to comment.