add stemmer_override token filter docs opensearch-project#8445

Signed-off-by: Anton Rubin <[email protected]>
AntonEliatra · Oct 2, 2024 · 93c4c41 · 93c4c41
1 parent 76486a4
commit 93c4c41
Show file tree

Hide file tree

Showing 2 changed files with 138 additions and 1 deletion.
diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md
@@ -53,7 +53,7 @@ Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache
 `shingle` | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`].
 `snowball` | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). You can use the `snowball` token filter with the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`.
 `stemmer` | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`.
-`stemmer_override` | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
+[`stemmer_override`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/stemmer_override/) | N/A | Overrides stemming algorithms by applying a custom mapping so that the provided terms are not stemmed.
 `stop` | [StopFilter](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/analysis/StopFilter.html) | Removes stop words from a token stream.
 `synonym` | N/A | Supplies a synonym list for the analysis process. The synonym list is provided using a configuration file.
 `synonym_graph` | N/A | Supplies a synonym list, including multiword synonyms, for the analysis process.

diff --git a/_analyzers/token-filters/stemmer-override.md b/_analyzers/token-filters/stemmer-override.md
@@ -0,0 +1,137 @@
+---
+layout: default
+title: Stemmer override
+parent: Token filters
+nav_order: 400
+---
+
+# Stemmer override token filter
+
+The `stemmer_override` token filter allows you to define custom stemming rules that override the behavior of default stemmers like Porter or Snowball. This is useful when you want to apply specific stemming behavior to certain words that might not be handled correctly by the standard stemming algorithms.
+
+## Parameters
+
+The `stemmer_override` token filter needs be configured with *one* of the following parameters:
+
+- `rules`: Defines the override rules directly in the settings.
+- `rules_path`: Specifies the file to use with custom rules/mappings. (Either absolute path or relative to config directory)
+
+## Example
+
+The following example request creates a new index named `my-index` and configures an analyzer with `stemmer_override` filter:
+
+```json
+PUT /my-index
+{
+  "settings": {
+    "analysis": {
+      "filter": {
+        "my_stemmer_override_filter": {
+          "type": "stemmer_override",
+          "rules": [
+            "running, runner => run",
+            "bought => buy",
+            "best => good"
+          ]
+        }
+      },
+      "analyzer": {
+        "my_custom_analyzer": {
+          "type": "custom",
+          "tokenizer": "standard",
+          "filter": [
+            "lowercase",
+            "my_stemmer_override_filter"
+          ]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /my-index/_analyze
+{
+  "analyzer": "my_custom_analyzer",
+  "text": "I am a runner and bought the best shoes"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+  "tokens": [
+    {
+      "token": "i",
+      "start_offset": 0,
+      "end_offset": 1,
+      "type": "<ALPHANUM>",
+      "position": 0
+    },
+    {
+      "token": "am",
+      "start_offset": 2,
+      "end_offset": 4,
+      "type": "<ALPHANUM>",
+      "position": 1
+    },
+    {
+      "token": "a",
+      "start_offset": 5,
+      "end_offset": 6,
+      "type": "<ALPHANUM>",
+      "position": 2
+    },
+    {
+      "token": "run",
+      "start_offset": 7,
+      "end_offset": 13,
+      "type": "<ALPHANUM>",
+      "position": 3
+    },
+    {
+      "token": "and",
+      "start_offset": 14,
+      "end_offset": 17,
+      "type": "<ALPHANUM>",
+      "position": 4
+    },
+    {
+      "token": "buy",
+      "start_offset": 18,
+      "end_offset": 24,
+      "type": "<ALPHANUM>",
+      "position": 5
+    },
+    {
+      "token": "the",
+      "start_offset": 25,
+      "end_offset": 28,
+      "type": "<ALPHANUM>",
+      "position": 6
+    },
+    {
+      "token": "good",
+      "start_offset": 29,
+      "end_offset": 33,
+      "type": "<ALPHANUM>",
+      "position": 7
+    },
+    {
+      "token": "shoes",
+      "start_offset": 34,
+      "end_offset": 39,
+      "type": "<ALPHANUM>",
+      "position": 8
+    }
+  ]
+}
+```