diff --git a/docs/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md b/docs/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md index 543f935f36509..9aac8ed219d76 100644 --- a/docs/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md +++ b/docs/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md @@ -24,31 +24,41 @@ specific language governing permissions and limitations under the License. --> -## multi_match_any -### Description -#### Syntax +## Description -`TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns)` +Returns whether the string matches any of the given regular expressions. +## Syntax -Checks whether the string `haystack` matches the regular expressions `patterns` in re2 syntax. returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches. +```sql +TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns) +``` -### example +## Parameters -``` -mysql> select multi_match_any('Hello, World!', ['hello', '!', 'world']); +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked | +| `patterns` | Array of regular expressions | + +## Return Value + +Returns 1 if the string `haystack` matches any of the regular expressions in the `patterns` array, otherwise returns 0. + +## Examples + +```sql +mysql> SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']); +-----------------------------------------------------------+ | multi_match_any('Hello, World!', ['hello', '!', 'world']) | +-----------------------------------------------------------+ | 1 | +-----------------------------------------------------------+ -mysql> select multi_match_any('abc', ['A', 'bcd']); +mysql> SELECT multi_match_any('abc', ['A', 'bcd']); +--------------------------------------+ | multi_match_any('abc', ['A', 'bcd']) | +--------------------------------------+ | 0 | +--------------------------------------+ ``` -### keywords - MULTI_MATCH,MATCH,ANY diff --git a/docs/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md b/docs/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md index c2d72c41d0e40..715385c7b9410 100644 --- a/docs/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md +++ b/docs/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md @@ -24,31 +24,41 @@ specific language governing permissions and limitations under the License. --> -## multi_search_all_positions -### Description -#### Syntax +## Description -`ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY needles)` +Returns the positions of the first occurrence of a set of regular expressions in a string. -Returns an `ARRAY` where the `i`-th element is the position of the `i`-th element in `needles`(i.e. `needle`)'s **first** occurrence in the string `haystack`. Positions are counted from 1, with 0 meaning the element was not found. **Case-sensitive**. - -### example +## Syntax +```sql +ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY patterns) ``` -mysql> select multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); + +## Parameters + +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked | +| `patterns` | Array of regular expressions | + +## Return Value + +Returns an `ARRAY` where the `i`-th element represents the position of the first occurrence of the `i`-th element (regular expression) in the `patterns` array within the string `haystack`. Positions are counted starting from 1, and 0 indicates that the element was not found. + +## Examples + +```sql +mysql> SELECT multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +----------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ['hello', '!', 'world']) | +----------------------------------------------------------------------+ -| [0,13,0] | +| [0, 13, 0] | +----------------------------------------------------------------------+ -select multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +mysql> SELECT multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +---------------------------------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ARRAY('hello', '!', 'world', 'Hello', 'World')) | +---------------------------------------------------------------------------------------------+ | [0, 13, 0, 1, 8] | +---------------------------------------------------------------------------------------------+ ``` - -### keywords - MULTI_SEARCH,SEARCH,POSITIONS diff --git a/docs/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md b/docs/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md index ae42731b9904d..fdff7703b8b22 100644 --- a/docs/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md +++ b/docs/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md @@ -26,42 +26,52 @@ under the License. ## Description -Calculate the N-gram similarity between `text` and `pattern`. The similarity ranges from 0 to 1, where a higher similarity indicates greater similarity between the two strings. +Calculates the N-gram similarity between two strings. -Both `pattern` and `gram_num` must be constants. If the length of either `text` or `pattern` is less than `gram_num`, return 0. +N-gram similarity is a text similarity calculation method based on N-grams (N-gram sequences). N-gram similarity ranges from 0 to 1, where a higher value indicates greater similarity between the two strings. -N-gram similarity is a method for calculating text similarity based on N-grams. An N-gram is a set of continuous N characters or words extracted from a text string. For example, for the string "text" with N=2 (bigram), the bigrams are: {"te", "ex", "xt"}. +An N-gram is a contiguous sequence of N characters or words from a text. For example, for the string 'text', when N=2, its bi-grams are: {"te", "ex", "xt"}. -The N-gram similarity is calculated as: +The N-gram similarity is calculated as: +**2 * |Intersection| / (|haystack set| + |pattern set|)** -2 * |Intersection| / (|text set| + |pattern set|) +Where |haystack set| and |pattern set| are the N-grams of `haystack` and `pattern`, respectively, and `Intersection` is the intersection of the two sets. -where |text set| and |pattern set| are the N-grams of `text` and `pattern`, and `Intersection` is the intersection of the two sets. +Note that, by definition, a similarity of 1 does not mean the two strings are identical. -Note that, by definition, a similarity of 1 does not necessarily mean the two strings are identical. +## Syntax + +```sql +DOUBLE ngram_search(VARCHAR haystack, VARCHAR pattern, INT gram_num) +``` -Only supports ASCII encoding. +## Parameters -## Syntax +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked, supports only ASCII encoding | +| `pattern` | The string used for similarity comparison, must be a constant, supports only ASCII encoding | +| `gram_num` | The `N` in N-gram, must be a constant | + +## Return Value -`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` +Returns the N-gram similarity between `haystack` and `pattern`. +Special case: If the length of `haystack` or `pattern` is less than `gram_num`, returns 0. -## Example +## Examples ```sql -mysql> select ngram_search('123456789' , '12345' , 3); +mysql> SELECT ngram_search('123456789' , '12345' , 3); +---------------------------------------+ | ngram_search('123456789', '12345', 3) | +---------------------------------------+ | 0.6 | +---------------------------------------+ -mysql> select ngram_search("abababab","babababa",2); +mysql> SELECT ngram_search('abababab', 'babababa', 2); +-----------------------------------------+ | ngram_search('abababab', 'babababa', 2) | +-----------------------------------------+ | 1 | +-----------------------------------------+ ``` -## keywords - NGRAM_SEARCH,NGRAM,SEARCH diff --git a/docs/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md b/docs/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md index 54e4f8ff31212..6b97dce846cd5 100644 --- a/docs/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md +++ b/docs/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md @@ -1,6 +1,6 @@ --- { - "title": "tokenize", + "title": "TOKENIZE", "language": "en" } --- @@ -23,3 +23,36 @@ KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> + +## Description + +Returns the result of text tokenization. Tokenization is the process of splitting text into a set of tokens. + +## Syntax + +```sql +ARRAY tokenize(VARCHAR txt, VARCHAR tokenizer_args) +``` + +## Parameters + +| Parameter | Description | +| -- | -- | +| `txt` | The text to be tokenized | +| `tokenizer_args` | Tokenizer arguments, a Doris PROPERTIES format string. For detailed information, refer to the inverted index documentation. | + +## Return Value + +Returns the tokenization result of the text `txt` based on the tokenizer arguments `tokenizer_args`. + +## Examples + +```sql +mysql> SELECT tokenize('I love Doris', '"parser"="english"'); ++------------------------------------------------+ +| tokenize('I love Doris', '"parser"="english"') | ++------------------------------------------------+ +| ["i", "love", "doris"] | ++------------------------------------------------+ +1 row in set (0.02 sec) +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md index a2436824ef5b6..57233299d385b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md @@ -24,31 +24,46 @@ specific language governing permissions and limitations under the License. --> -## multi_match_any ## 描述 + +返回字符串是否与给定的一组正则表达式匹配。 + + ## 语法 -`TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns)` +```sql +TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns) +``` + + +## 参数 +| 参数 | 说明 | +| -- | -- | +| `haystack` | 被检查的字符串 | +| `patterns` | 正则表达式数组 | + + +## 返回值 + +如果字符串 `haystack` 匹配 `patterns` 数组中的任意一个正则表达式返回 1,否则返回 0。 -检查字符串 `haystack` 是否与 re2 语法中的正则表达式 `patterns` 相匹配。如果都没有匹配的正则表达式返回 0,否则返回 1。 ## 举例 -``` -mysql> select multi_match_any('Hello, World!', ['hello', '!', 'world']); +```sql +mysql> SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']); +-----------------------------------------------------------+ | multi_match_any('Hello, World!', ['hello', '!', 'world']) | +-----------------------------------------------------------+ | 1 | +-----------------------------------------------------------+ -mysql> select multi_match_any('abc', ['A', 'bcd']); +mysql> SELECT multi_match_any('abc', ['A', 'bcd']); +--------------------------------------+ | multi_match_any('abc', ['A', 'bcd']) | +--------------------------------------+ | 0 | +--------------------------------------+ ``` -### keywords - MULTI_MATCH,MATCH,ANY + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md index 020d8cae8ba5e..091716cd86c09 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md @@ -24,31 +24,45 @@ specific language governing permissions and limitations under the License. --> -## multi_search_all_positions ## 描述 + +返回一组正则表达式在一个字符串中首次出现的位置。 + + ## 语法 -`ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY needles)` +```sql +ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY patterns) +``` + + +## 参数 + +| 参数 | 说明 | +| -- | -- | +| `haystack` | 被检查的字符串 | +| `patterns` | 正则表达式数组 | + + +## 返回值 + +返回一个 `ARRAY`,其中第 `i` 个元素为 `patterns` 数组中第 `i` 个元素(正则表达式),在字符串 `haystack` 中**首次**出现的位置,位置从 1 开始计数,0 代表未找到该元素。 -返回一个 `ARRAY`,其中第 `i` 个元素为 `needles` 中第 `i` 个元素 `needle`,在字符串 `haystack` 中**首次**出现的位置。位置从1开始计数,0代表未找到该元素。**大小写敏感**。 ## 举例 -``` -mysql> select multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +```sql +mysql> SELECT multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +----------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ['hello', '!', 'world']) | +----------------------------------------------------------------------+ -| [0,13,0] | +| [0, 13, 0] | +----------------------------------------------------------------------+ -select multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +mysql> SELECT multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +---------------------------------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ARRAY('hello', '!', 'world', 'Hello', 'World')) | +---------------------------------------------------------------------------------------------+ | [0, 13, 0, 1, 8] | +---------------------------------------------------------------------------------------------+ ``` - -### keywords - MULTI_SEARCH,SEARCH,POSITIONS diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md index 1a2eecc3cb20b..267a9132fb19b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md @@ -24,44 +24,57 @@ specific language governing permissions and limitations under the License. --> -## Description +## 描述 -`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` +计算两个字符串的 N-gram 相似度。 -计算 text 和 pattern 的 N-gram 相似度。相似度从 0 到 1,相似度越高证明两个字符串越相似。 -其中`pattern`,`gram_num`必须为常量。 -如果`text`或者`pattern`的长度小于`gram_num`,返回 0。 +N-gram 相似度(N-gram similarity)是一种基于 N-gram(N 元语法)的文本相似度计算方法。N-gram 相似度从 0 到 1,相似度越高证明两个字符串越相似。 -N-gram 相似度(N-gram similarity)是一种基于 N-gram(N 元语法)的文本相似度计算方法。N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如,对于字符串“text”,当 N=2 时,其二元组(bi-gram)为:{“te”, “ex”, “xt”}。 +N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如,对于字符串 'text',当 N=2 时,其二元组(bi-gram)为:{“te”, “ex”, “xt”}。 -N-gram 相似度的计算为 2 * |Intersection| / (|text set| + |pattern set|) +N-gram 相似度的计算为 2 * |Intersection| / (|haystack set| + |pattern set|) -其中|text set|,|pattern set|为 text 和 pattern 的 N-gram,`Intersection`为两个集合的交集。 +其中 |haystack set| 和 |pattern set| 分别是 `haystack` 和 `pattern` 的 N-gram,`Intersection` 是两个集合的交集。 注意,根据定义,相似度为 1 不代表两个字符串相同。 -仅支持 ASCII 编码。 -## Syntax +## 语法 + +```sql +DOUBLE ngram_search(VARCHAR haystack, VARCHAR pattern, INT gram_num) +``` + + +## 参数 + +| 参数 | 说明 | +| -- | -- | +| `haystack` | 被检查的字符串,仅支持 ASCII 编码 | +| `pattern` | 用于对比相似度的字符串,必须是常量,仅支持 ASCII 编码 | +| `gram_num` | N-gram 的 `N`,必须是常量 | + + +## 返回值 + +返回 `haystack` 和 `pattern` 的 N-gram 相似度。 +特殊情况:如果 `haystack` 或者 `pattern` 的长度小于 `gram_num`,返回 0。 -`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` -## Example +## 举例 ```sql -mysql> select ngram_search('123456789' , '12345' , 3); +mysql> SELECT ngram_search('123456789' , '12345' , 3); +---------------------------------------+ | ngram_search('123456789', '12345', 3) | +---------------------------------------+ | 0.6 | +---------------------------------------+ -mysql> select ngram_search("abababab","babababa",2); +mysql> SELECT ngram_search('abababab', 'babababa', 2); +-----------------------------------------+ | ngram_search('abababab', 'babababa', 2) | +-----------------------------------------+ | 1 | +-----------------------------------------+ ``` -## keywords - NGRAM_SEARCH,NGRAM,SEARCH diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md index ede21be2a4e65..61a8b710b5d77 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md @@ -1,6 +1,6 @@ --- { - "title": "tokenize", + "title": "TOKENIZE", "language": "zh-CN" } --- @@ -23,3 +23,72 @@ KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> + +## 描述 + +返回文本分词的结果。分词是将文本分一组词的过程。 + + +## 语法 + +```sql +ARRAY tokenize(VARCHAR txt, VARCHAR tokenizer_args) +``` + + +## 参数 + +| 参数 | 说明 | +| -- | -- | +| `txt` | 待分词的文本 | +| `tokenizer_args` | 分词器参数,是一个 Doris PROPERTIES 格式的字符串,详细说明参考倒排索引的文档 | + + +## 返回值 + +返回对文本 `txt` 按照分词器参数 `tokenizer_args` 进行分词的结果。 + + +## 举例 + +```sql +mysql> SELECT tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"'); ++-----------------------------------------------------------------------------------+ +| tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | ++-----------------------------------------------------------------------------------+ +| ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"] | ++-----------------------------------------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="fine_grained"'); ++--------------------------------------------------------------------------------------+ +| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | ++--------------------------------------------------------------------------------------+ +| ["武汉", "武汉市", "市长", "长江", "长江大桥", "大桥"] | ++--------------------------------------------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"'); ++----------------------------------------------------------------------------------------+ +| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"') | ++----------------------------------------------------------------------------------------+ +| ["武汉市", "长江大桥"] | ++----------------------------------------------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT tokenize('I love Doris', '"parser"="english"'); ++------------------------------------------------+ +| tokenize('I love Doris', '"parser"="english"') | ++------------------------------------------------+ +| ["i", "love", "doris"] | ++------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT tokenize('I love CHINA 我爱我的祖国','"parser"="unicode"'); ++-------------------------------------------------------------------+ +| tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"') | ++-------------------------------------------------------------------+ +| ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"] | ++-------------------------------------------------------------------+ +1 row in set (0.02 sec) +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-match-any.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-match-any.md index 9137347ba01f5..57233299d385b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-match-any.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-match-any.md @@ -1,6 +1,6 @@ --- { - "title": "multi_match_any", + "title": "MULTI_MATCH_ANY", "language": "zh-CN" } --- @@ -24,31 +24,46 @@ specific language governing permissions and limitations under the License. --> -## multi_match_any ## 描述 + +返回字符串是否与给定的一组正则表达式匹配。 + + ## 语法 -`TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns)` +```sql +TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns) +``` + + +## 参数 +| 参数 | 说明 | +| -- | -- | +| `haystack` | 被检查的字符串 | +| `patterns` | 正则表达式数组 | + + +## 返回值 + +如果字符串 `haystack` 匹配 `patterns` 数组中的任意一个正则表达式返回 1,否则返回 0。 -检查字符串 `haystack` 是否与 re2 语法中的正则表达式 `patterns` 相匹配。如果都没有匹配的正则表达式返回 0,否则返回 1。 ## 举例 -``` -mysql> select multi_match_any('Hello, World!', ['hello', '!', 'world']); +```sql +mysql> SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']); +-----------------------------------------------------------+ | multi_match_any('Hello, World!', ['hello', '!', 'world']) | +-----------------------------------------------------------+ | 1 | +-----------------------------------------------------------+ -mysql> select multi_match_any('abc', ['A', 'bcd']); +mysql> SELECT multi_match_any('abc', ['A', 'bcd']); +--------------------------------------+ | multi_match_any('abc', ['A', 'bcd']) | +--------------------------------------+ | 0 | +--------------------------------------+ ``` -### keywords - MULTI_MATCH,MATCH,ANY + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md index 9b30d33b9d76f..091716cd86c09 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md @@ -1,6 +1,6 @@ --- { - "title": "multi_search_all_positions", + "title": "MULTI_SEARCH_ALL_POSITIONS", "language": "zh-CN" } --- @@ -24,31 +24,45 @@ specific language governing permissions and limitations under the License. --> -## multi_search_all_positions ## 描述 + +返回一组正则表达式在一个字符串中首次出现的位置。 + + ## 语法 -`ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY needles)` +```sql +ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY patterns) +``` + + +## 参数 + +| 参数 | 说明 | +| -- | -- | +| `haystack` | 被检查的字符串 | +| `patterns` | 正则表达式数组 | + + +## 返回值 + +返回一个 `ARRAY`,其中第 `i` 个元素为 `patterns` 数组中第 `i` 个元素(正则表达式),在字符串 `haystack` 中**首次**出现的位置,位置从 1 开始计数,0 代表未找到该元素。 -返回一个 `ARRAY`,其中第 `i` 个元素为 `needles` 中第 `i` 个元素 `needle`,在字符串 `haystack` 中**首次**出现的位置。位置从1开始计数,0代表未找到该元素。**大小写敏感**。 ## 举例 -``` -mysql> select multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +```sql +mysql> SELECT multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +----------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ['hello', '!', 'world']) | +----------------------------------------------------------------------+ -| [0,13,0] | +| [0, 13, 0] | +----------------------------------------------------------------------+ -select multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +mysql> SELECT multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +---------------------------------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ARRAY('hello', '!', 'world', 'Hello', 'World')) | +---------------------------------------------------------------------------------------------+ | [0, 13, 0, 1, 8] | +---------------------------------------------------------------------------------------------+ ``` - -### keywords - MULTI_SEARCH,SEARCH,POSITIONS diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-match-any.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-match-any.md index a2436824ef5b6..57233299d385b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-match-any.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-match-any.md @@ -24,31 +24,46 @@ specific language governing permissions and limitations under the License. --> -## multi_match_any ## 描述 + +返回字符串是否与给定的一组正则表达式匹配。 + + ## 语法 -`TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns)` +```sql +TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns) +``` + + +## 参数 +| 参数 | 说明 | +| -- | -- | +| `haystack` | 被检查的字符串 | +| `patterns` | 正则表达式数组 | + + +## 返回值 + +如果字符串 `haystack` 匹配 `patterns` 数组中的任意一个正则表达式返回 1,否则返回 0。 -检查字符串 `haystack` 是否与 re2 语法中的正则表达式 `patterns` 相匹配。如果都没有匹配的正则表达式返回 0,否则返回 1。 ## 举例 -``` -mysql> select multi_match_any('Hello, World!', ['hello', '!', 'world']); +```sql +mysql> SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']); +-----------------------------------------------------------+ | multi_match_any('Hello, World!', ['hello', '!', 'world']) | +-----------------------------------------------------------+ | 1 | +-----------------------------------------------------------+ -mysql> select multi_match_any('abc', ['A', 'bcd']); +mysql> SELECT multi_match_any('abc', ['A', 'bcd']); +--------------------------------------+ | multi_match_any('abc', ['A', 'bcd']) | +--------------------------------------+ | 0 | +--------------------------------------+ ``` -### keywords - MULTI_MATCH,MATCH,ANY + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md index 020d8cae8ba5e..091716cd86c09 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md @@ -24,31 +24,45 @@ specific language governing permissions and limitations under the License. --> -## multi_search_all_positions ## 描述 + +返回一组正则表达式在一个字符串中首次出现的位置。 + + ## 语法 -`ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY needles)` +```sql +ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY patterns) +``` + + +## 参数 + +| 参数 | 说明 | +| -- | -- | +| `haystack` | 被检查的字符串 | +| `patterns` | 正则表达式数组 | + + +## 返回值 + +返回一个 `ARRAY`,其中第 `i` 个元素为 `patterns` 数组中第 `i` 个元素(正则表达式),在字符串 `haystack` 中**首次**出现的位置,位置从 1 开始计数,0 代表未找到该元素。 -返回一个 `ARRAY`,其中第 `i` 个元素为 `needles` 中第 `i` 个元素 `needle`,在字符串 `haystack` 中**首次**出现的位置。位置从1开始计数,0代表未找到该元素。**大小写敏感**。 ## 举例 -``` -mysql> select multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +```sql +mysql> SELECT multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +----------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ['hello', '!', 'world']) | +----------------------------------------------------------------------+ -| [0,13,0] | +| [0, 13, 0] | +----------------------------------------------------------------------+ -select multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +mysql> SELECT multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +---------------------------------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ARRAY('hello', '!', 'world', 'Hello', 'World')) | +---------------------------------------------------------------------------------------------+ | [0, 13, 0, 1, 8] | +---------------------------------------------------------------------------------------------+ ``` - -### keywords - MULTI_SEARCH,SEARCH,POSITIONS diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/tokenize.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/tokenize.md new file mode 100644 index 0000000000000..61a8b710b5d77 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/tokenize.md @@ -0,0 +1,94 @@ +--- +{ + "title": "TOKENIZE", + "language": "zh-CN" +} +--- + + + +## 描述 + +返回文本分词的结果。分词是将文本分一组词的过程。 + + +## 语法 + +```sql +ARRAY tokenize(VARCHAR txt, VARCHAR tokenizer_args) +``` + + +## 参数 + +| 参数 | 说明 | +| -- | -- | +| `txt` | 待分词的文本 | +| `tokenizer_args` | 分词器参数,是一个 Doris PROPERTIES 格式的字符串,详细说明参考倒排索引的文档 | + + +## 返回值 + +返回对文本 `txt` 按照分词器参数 `tokenizer_args` 进行分词的结果。 + + +## 举例 + +```sql +mysql> SELECT tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"'); ++-----------------------------------------------------------------------------------+ +| tokenize('武汉长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | ++-----------------------------------------------------------------------------------+ +| ["武汉", "武汉长江大桥", "长江", "长江大桥", "大桥"] | ++-----------------------------------------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="fine_grained"'); ++--------------------------------------------------------------------------------------+ +| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="fine_grained"') | ++--------------------------------------------------------------------------------------+ +| ["武汉", "武汉市", "市长", "长江", "长江大桥", "大桥"] | ++--------------------------------------------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"'); ++----------------------------------------------------------------------------------------+ +| tokenize('武汉市长江大桥', '"parser"="chinese","parser_mode"="coarse_grained"') | ++----------------------------------------------------------------------------------------+ +| ["武汉市", "长江大桥"] | ++----------------------------------------------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT tokenize('I love Doris', '"parser"="english"'); ++------------------------------------------------+ +| tokenize('I love Doris', '"parser"="english"') | ++------------------------------------------------+ +| ["i", "love", "doris"] | ++------------------------------------------------+ +1 row in set (0.02 sec) + +mysql> SELECT tokenize('I love CHINA 我爱我的祖国','"parser"="unicode"'); ++-------------------------------------------------------------------+ +| tokenize('I love CHINA 我爱我的祖国', '"parser"="unicode"') | ++-------------------------------------------------------------------+ +| ["i", "love", "china", "我", "爱", "我", "的", "祖", "国"] | ++-------------------------------------------------------------------+ +1 row in set (0.02 sec) +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md index a2436824ef5b6..57233299d385b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md @@ -24,31 +24,46 @@ specific language governing permissions and limitations under the License. --> -## multi_match_any ## 描述 + +返回字符串是否与给定的一组正则表达式匹配。 + + ## 语法 -`TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns)` +```sql +TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns) +``` + + +## 参数 +| 参数 | 说明 | +| -- | -- | +| `haystack` | 被检查的字符串 | +| `patterns` | 正则表达式数组 | + + +## 返回值 + +如果字符串 `haystack` 匹配 `patterns` 数组中的任意一个正则表达式返回 1,否则返回 0。 -检查字符串 `haystack` 是否与 re2 语法中的正则表达式 `patterns` 相匹配。如果都没有匹配的正则表达式返回 0,否则返回 1。 ## 举例 -``` -mysql> select multi_match_any('Hello, World!', ['hello', '!', 'world']); +```sql +mysql> SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']); +-----------------------------------------------------------+ | multi_match_any('Hello, World!', ['hello', '!', 'world']) | +-----------------------------------------------------------+ | 1 | +-----------------------------------------------------------+ -mysql> select multi_match_any('abc', ['A', 'bcd']); +mysql> SELECT multi_match_any('abc', ['A', 'bcd']); +--------------------------------------+ | multi_match_any('abc', ['A', 'bcd']) | +--------------------------------------+ | 0 | +--------------------------------------+ ``` -### keywords - MULTI_MATCH,MATCH,ANY + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md index 020d8cae8ba5e..091716cd86c09 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md @@ -24,31 +24,45 @@ specific language governing permissions and limitations under the License. --> -## multi_search_all_positions ## 描述 + +返回一组正则表达式在一个字符串中首次出现的位置。 + + ## 语法 -`ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY needles)` +```sql +ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY patterns) +``` + + +## 参数 + +| 参数 | 说明 | +| -- | -- | +| `haystack` | 被检查的字符串 | +| `patterns` | 正则表达式数组 | + + +## 返回值 + +返回一个 `ARRAY`,其中第 `i` 个元素为 `patterns` 数组中第 `i` 个元素(正则表达式),在字符串 `haystack` 中**首次**出现的位置,位置从 1 开始计数,0 代表未找到该元素。 -返回一个 `ARRAY`,其中第 `i` 个元素为 `needles` 中第 `i` 个元素 `needle`,在字符串 `haystack` 中**首次**出现的位置。位置从1开始计数,0代表未找到该元素。**大小写敏感**。 ## 举例 -``` -mysql> select multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +```sql +mysql> SELECT multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +----------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ['hello', '!', 'world']) | +----------------------------------------------------------------------+ -| [0,13,0] | +| [0, 13, 0] | +----------------------------------------------------------------------+ -select multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +mysql> SELECT multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +---------------------------------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ARRAY('hello', '!', 'world', 'Hello', 'World')) | +---------------------------------------------------------------------------------------------+ | [0, 13, 0, 1, 8] | +---------------------------------------------------------------------------------------------+ ``` - -### keywords - MULTI_SEARCH,SEARCH,POSITIONS diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md index 1a2eecc3cb20b..267a9132fb19b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md @@ -24,44 +24,57 @@ specific language governing permissions and limitations under the License. --> -## Description +## 描述 -`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` +计算两个字符串的 N-gram 相似度。 -计算 text 和 pattern 的 N-gram 相似度。相似度从 0 到 1,相似度越高证明两个字符串越相似。 -其中`pattern`,`gram_num`必须为常量。 -如果`text`或者`pattern`的长度小于`gram_num`,返回 0。 +N-gram 相似度(N-gram similarity)是一种基于 N-gram(N 元语法)的文本相似度计算方法。N-gram 相似度从 0 到 1,相似度越高证明两个字符串越相似。 -N-gram 相似度(N-gram similarity)是一种基于 N-gram(N 元语法)的文本相似度计算方法。N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如,对于字符串“text”,当 N=2 时,其二元组(bi-gram)为:{“te”, “ex”, “xt”}。 +N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如,对于字符串 'text',当 N=2 时,其二元组(bi-gram)为:{“te”, “ex”, “xt”}。 -N-gram 相似度的计算为 2 * |Intersection| / (|text set| + |pattern set|) +N-gram 相似度的计算为 2 * |Intersection| / (|haystack set| + |pattern set|) -其中|text set|,|pattern set|为 text 和 pattern 的 N-gram,`Intersection`为两个集合的交集。 +其中 |haystack set| 和 |pattern set| 分别是 `haystack` 和 `pattern` 的 N-gram,`Intersection` 是两个集合的交集。 注意,根据定义,相似度为 1 不代表两个字符串相同。 -仅支持 ASCII 编码。 -## Syntax +## 语法 + +```sql +DOUBLE ngram_search(VARCHAR haystack, VARCHAR pattern, INT gram_num) +``` + + +## 参数 + +| 参数 | 说明 | +| -- | -- | +| `haystack` | 被检查的字符串,仅支持 ASCII 编码 | +| `pattern` | 用于对比相似度的字符串,必须是常量,仅支持 ASCII 编码 | +| `gram_num` | N-gram 的 `N`,必须是常量 | + + +## 返回值 + +返回 `haystack` 和 `pattern` 的 N-gram 相似度。 +特殊情况:如果 `haystack` 或者 `pattern` 的长度小于 `gram_num`,返回 0。 -`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` -## Example +## 举例 ```sql -mysql> select ngram_search('123456789' , '12345' , 3); +mysql> SELECT ngram_search('123456789' , '12345' , 3); +---------------------------------------+ | ngram_search('123456789', '12345', 3) | +---------------------------------------+ | 0.6 | +---------------------------------------+ -mysql> select ngram_search("abababab","babababa",2); +mysql> SELECT ngram_search('abababab', 'babababa', 2); +-----------------------------------------+ | ngram_search('abababab', 'babababa', 2) | +-----------------------------------------+ | 1 | +-----------------------------------------+ ``` -## keywords - NGRAM_SEARCH,NGRAM,SEARCH diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md index a2436824ef5b6..57233299d385b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md @@ -24,31 +24,46 @@ specific language governing permissions and limitations under the License. --> -## multi_match_any ## 描述 + +返回字符串是否与给定的一组正则表达式匹配。 + + ## 语法 -`TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns)` +```sql +TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns) +``` + + +## 参数 +| 参数 | 说明 | +| -- | -- | +| `haystack` | 被检查的字符串 | +| `patterns` | 正则表达式数组 | + + +## 返回值 + +如果字符串 `haystack` 匹配 `patterns` 数组中的任意一个正则表达式返回 1,否则返回 0。 -检查字符串 `haystack` 是否与 re2 语法中的正则表达式 `patterns` 相匹配。如果都没有匹配的正则表达式返回 0,否则返回 1。 ## 举例 -``` -mysql> select multi_match_any('Hello, World!', ['hello', '!', 'world']); +```sql +mysql> SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']); +-----------------------------------------------------------+ | multi_match_any('Hello, World!', ['hello', '!', 'world']) | +-----------------------------------------------------------+ | 1 | +-----------------------------------------------------------+ -mysql> select multi_match_any('abc', ['A', 'bcd']); +mysql> SELECT multi_match_any('abc', ['A', 'bcd']); +--------------------------------------+ | multi_match_any('abc', ['A', 'bcd']) | +--------------------------------------+ | 0 | +--------------------------------------+ ``` -### keywords - MULTI_MATCH,MATCH,ANY + diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md index 020d8cae8ba5e..091716cd86c09 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md @@ -24,31 +24,45 @@ specific language governing permissions and limitations under the License. --> -## multi_search_all_positions ## 描述 + +返回一组正则表达式在一个字符串中首次出现的位置。 + + ## 语法 -`ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY needles)` +```sql +ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY patterns) +``` + + +## 参数 + +| 参数 | 说明 | +| -- | -- | +| `haystack` | 被检查的字符串 | +| `patterns` | 正则表达式数组 | + + +## 返回值 + +返回一个 `ARRAY`,其中第 `i` 个元素为 `patterns` 数组中第 `i` 个元素(正则表达式),在字符串 `haystack` 中**首次**出现的位置,位置从 1 开始计数,0 代表未找到该元素。 -返回一个 `ARRAY`,其中第 `i` 个元素为 `needles` 中第 `i` 个元素 `needle`,在字符串 `haystack` 中**首次**出现的位置。位置从1开始计数,0代表未找到该元素。**大小写敏感**。 ## 举例 -``` -mysql> select multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +```sql +mysql> SELECT multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +----------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ['hello', '!', 'world']) | +----------------------------------------------------------------------+ -| [0,13,0] | +| [0, 13, 0] | +----------------------------------------------------------------------+ -select multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +mysql> SELECT multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +---------------------------------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ARRAY('hello', '!', 'world', 'Hello', 'World')) | +---------------------------------------------------------------------------------------------+ | [0, 13, 0, 1, 8] | +---------------------------------------------------------------------------------------------+ ``` - -### keywords - MULTI_SEARCH,SEARCH,POSITIONS diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md index 1a2eecc3cb20b..267a9132fb19b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md @@ -24,44 +24,57 @@ specific language governing permissions and limitations under the License. --> -## Description +## 描述 -`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` +计算两个字符串的 N-gram 相似度。 -计算 text 和 pattern 的 N-gram 相似度。相似度从 0 到 1,相似度越高证明两个字符串越相似。 -其中`pattern`,`gram_num`必须为常量。 -如果`text`或者`pattern`的长度小于`gram_num`,返回 0。 +N-gram 相似度(N-gram similarity)是一种基于 N-gram(N 元语法)的文本相似度计算方法。N-gram 相似度从 0 到 1,相似度越高证明两个字符串越相似。 -N-gram 相似度(N-gram similarity)是一种基于 N-gram(N 元语法)的文本相似度计算方法。N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如,对于字符串“text”,当 N=2 时,其二元组(bi-gram)为:{“te”, “ex”, “xt”}。 +N-gram 是指将一个文本串分成连续的 N 个字符或词语的集合。例如,对于字符串 'text',当 N=2 时,其二元组(bi-gram)为:{“te”, “ex”, “xt”}。 -N-gram 相似度的计算为 2 * |Intersection| / (|text set| + |pattern set|) +N-gram 相似度的计算为 2 * |Intersection| / (|haystack set| + |pattern set|) -其中|text set|,|pattern set|为 text 和 pattern 的 N-gram,`Intersection`为两个集合的交集。 +其中 |haystack set| 和 |pattern set| 分别是 `haystack` 和 `pattern` 的 N-gram,`Intersection` 是两个集合的交集。 注意,根据定义,相似度为 1 不代表两个字符串相同。 -仅支持 ASCII 编码。 -## Syntax +## 语法 + +```sql +DOUBLE ngram_search(VARCHAR haystack, VARCHAR pattern, INT gram_num) +``` + + +## 参数 + +| 参数 | 说明 | +| -- | -- | +| `haystack` | 被检查的字符串,仅支持 ASCII 编码 | +| `pattern` | 用于对比相似度的字符串,必须是常量,仅支持 ASCII 编码 | +| `gram_num` | N-gram 的 `N`,必须是常量 | + + +## 返回值 + +返回 `haystack` 和 `pattern` 的 N-gram 相似度。 +特殊情况:如果 `haystack` 或者 `pattern` 的长度小于 `gram_num`,返回 0。 -`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` -## Example +## 举例 ```sql -mysql> select ngram_search('123456789' , '12345' , 3); +mysql> SELECT ngram_search('123456789' , '12345' , 3); +---------------------------------------+ | ngram_search('123456789', '12345', 3) | +---------------------------------------+ | 0.6 | +---------------------------------------+ -mysql> select ngram_search("abababab","babababa",2); +mysql> SELECT ngram_search('abababab', 'babababa', 2); +-----------------------------------------+ | ngram_search('abababab', 'babababa', 2) | +-----------------------------------------+ | 1 | +-----------------------------------------+ ``` -## keywords - NGRAM_SEARCH,NGRAM,SEARCH diff --git a/versioned_docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-match-any.md b/versioned_docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-match-any.md index ea41ef7207838..9aac8ed219d76 100644 --- a/versioned_docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-match-any.md +++ b/versioned_docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-match-any.md @@ -1,6 +1,6 @@ --- { - "title": "multi_match_any", + "title": "MULTI_MATCH_ANY", "language": "en" } --- @@ -24,31 +24,41 @@ specific language governing permissions and limitations under the License. --> -## multi_match_any -### Description -#### Syntax +## Description -`TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns)` +Returns whether the string matches any of the given regular expressions. +## Syntax -Checks whether the string `haystack` matches the regular expressions `patterns` in re2 syntax. returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches. +```sql +TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns) +``` -### example +## Parameters -``` -mysql> select multi_match_any('Hello, World!', ['hello', '!', 'world']); +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked | +| `patterns` | Array of regular expressions | + +## Return Value + +Returns 1 if the string `haystack` matches any of the regular expressions in the `patterns` array, otherwise returns 0. + +## Examples + +```sql +mysql> SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']); +-----------------------------------------------------------+ | multi_match_any('Hello, World!', ['hello', '!', 'world']) | +-----------------------------------------------------------+ | 1 | +-----------------------------------------------------------+ -mysql> select multi_match_any('abc', ['A', 'bcd']); +mysql> SELECT multi_match_any('abc', ['A', 'bcd']); +--------------------------------------+ | multi_match_any('abc', ['A', 'bcd']) | +--------------------------------------+ | 0 | +--------------------------------------+ ``` -### keywords - MULTI_MATCH,MATCH,ANY diff --git a/versioned_docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md b/versioned_docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md index db52923b6adc3..715385c7b9410 100644 --- a/versioned_docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md +++ b/versioned_docs/version-1.2/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md @@ -1,6 +1,6 @@ --- { - "title": "multi_search_all_positions", + "title": "MULTI_SEARCH_ALL_POSITIONS", "language": "en" } --- @@ -24,31 +24,41 @@ specific language governing permissions and limitations under the License. --> -## multi_search_all_positions -### Description -#### Syntax +## Description -`ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY needles)` +Returns the positions of the first occurrence of a set of regular expressions in a string. -Returns an `ARRAY` where the `i`-th element is the position of the `i`-th element in `needles`(i.e. `needle`)'s **first** occurrence in the string `haystack`. Positions are counted from 1, with 0 meaning the element was not found. **Case-sensitive**. - -### example +## Syntax +```sql +ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY patterns) ``` -mysql> select multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); + +## Parameters + +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked | +| `patterns` | Array of regular expressions | + +## Return Value + +Returns an `ARRAY` where the `i`-th element represents the position of the first occurrence of the `i`-th element (regular expression) in the `patterns` array within the string `haystack`. Positions are counted starting from 1, and 0 indicates that the element was not found. + +## Examples + +```sql +mysql> SELECT multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +----------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ['hello', '!', 'world']) | +----------------------------------------------------------------------+ -| [0,13,0] | +| [0, 13, 0] | +----------------------------------------------------------------------+ -select multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +mysql> SELECT multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +---------------------------------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ARRAY('hello', '!', 'world', 'Hello', 'World')) | +---------------------------------------------------------------------------------------------+ | [0, 13, 0, 1, 8] | +---------------------------------------------------------------------------------------------+ ``` - -### keywords - MULTI_SEARCH,SEARCH,POSITIONS diff --git a/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/ngram-search.md b/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/ngram-search.md new file mode 100644 index 0000000000000..fdff7703b8b22 --- /dev/null +++ b/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/ngram-search.md @@ -0,0 +1,77 @@ +--- +{ + "title": "NGRAM_SEARCH", + "language": "en" +} +--- + + + +## Description + +Calculates the N-gram similarity between two strings. + +N-gram similarity is a text similarity calculation method based on N-grams (N-gram sequences). N-gram similarity ranges from 0 to 1, where a higher value indicates greater similarity between the two strings. + +An N-gram is a contiguous sequence of N characters or words from a text. For example, for the string 'text', when N=2, its bi-grams are: {"te", "ex", "xt"}. + +The N-gram similarity is calculated as: +**2 * |Intersection| / (|haystack set| + |pattern set|)** + +Where |haystack set| and |pattern set| are the N-grams of `haystack` and `pattern`, respectively, and `Intersection` is the intersection of the two sets. + +Note that, by definition, a similarity of 1 does not mean the two strings are identical. + +## Syntax + +```sql +DOUBLE ngram_search(VARCHAR haystack, VARCHAR pattern, INT gram_num) +``` + +## Parameters + +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked, supports only ASCII encoding | +| `pattern` | The string used for similarity comparison, must be a constant, supports only ASCII encoding | +| `gram_num` | The `N` in N-gram, must be a constant | + +## Return Value + +Returns the N-gram similarity between `haystack` and `pattern`. +Special case: If the length of `haystack` or `pattern` is less than `gram_num`, returns 0. + +## Examples + +```sql +mysql> SELECT ngram_search('123456789' , '12345' , 3); ++---------------------------------------+ +| ngram_search('123456789', '12345', 3) | ++---------------------------------------+ +| 0.6 | ++---------------------------------------+ + +mysql> SELECT ngram_search('abababab', 'babababa', 2); ++-----------------------------------------+ +| ngram_search('abababab', 'babababa', 2) | ++-----------------------------------------+ +| 1 | ++-----------------------------------------+ +``` diff --git a/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-match-any.md b/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-match-any.md index 543f935f36509..9aac8ed219d76 100644 --- a/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-match-any.md +++ b/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-match-any.md @@ -24,31 +24,41 @@ specific language governing permissions and limitations under the License. --> -## multi_match_any -### Description -#### Syntax +## Description -`TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns)` +Returns whether the string matches any of the given regular expressions. +## Syntax -Checks whether the string `haystack` matches the regular expressions `patterns` in re2 syntax. returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches. +```sql +TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns) +``` -### example +## Parameters -``` -mysql> select multi_match_any('Hello, World!', ['hello', '!', 'world']); +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked | +| `patterns` | Array of regular expressions | + +## Return Value + +Returns 1 if the string `haystack` matches any of the regular expressions in the `patterns` array, otherwise returns 0. + +## Examples + +```sql +mysql> SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']); +-----------------------------------------------------------+ | multi_match_any('Hello, World!', ['hello', '!', 'world']) | +-----------------------------------------------------------+ | 1 | +-----------------------------------------------------------+ -mysql> select multi_match_any('abc', ['A', 'bcd']); +mysql> SELECT multi_match_any('abc', ['A', 'bcd']); +--------------------------------------+ | multi_match_any('abc', ['A', 'bcd']) | +--------------------------------------+ | 0 | +--------------------------------------+ ``` -### keywords - MULTI_MATCH,MATCH,ANY diff --git a/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md b/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md index c2d72c41d0e40..715385c7b9410 100644 --- a/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md +++ b/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/search/multi-search-all-positions.md @@ -24,31 +24,41 @@ specific language governing permissions and limitations under the License. --> -## multi_search_all_positions -### Description -#### Syntax +## Description -`ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY needles)` +Returns the positions of the first occurrence of a set of regular expressions in a string. -Returns an `ARRAY` where the `i`-th element is the position of the `i`-th element in `needles`(i.e. `needle`)'s **first** occurrence in the string `haystack`. Positions are counted from 1, with 0 meaning the element was not found. **Case-sensitive**. - -### example +## Syntax +```sql +ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY patterns) ``` -mysql> select multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); + +## Parameters + +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked | +| `patterns` | Array of regular expressions | + +## Return Value + +Returns an `ARRAY` where the `i`-th element represents the position of the first occurrence of the `i`-th element (regular expression) in the `patterns` array within the string `haystack`. Positions are counted starting from 1, and 0 indicates that the element was not found. + +## Examples + +```sql +mysql> SELECT multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +----------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ['hello', '!', 'world']) | +----------------------------------------------------------------------+ -| [0,13,0] | +| [0, 13, 0] | +----------------------------------------------------------------------+ -select multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +mysql> SELECT multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +---------------------------------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ARRAY('hello', '!', 'world', 'Hello', 'World')) | +---------------------------------------------------------------------------------------------+ | [0, 13, 0, 1, 8] | +---------------------------------------------------------------------------------------------+ ``` - -### keywords - MULTI_SEARCH,SEARCH,POSITIONS diff --git a/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/tokenize.md b/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/tokenize.md new file mode 100644 index 0000000000000..6b97dce846cd5 --- /dev/null +++ b/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/tokenize.md @@ -0,0 +1,58 @@ +--- +{ + "title": "TOKENIZE", + "language": "en" +} +--- + + + +## Description + +Returns the result of text tokenization. Tokenization is the process of splitting text into a set of tokens. + +## Syntax + +```sql +ARRAY tokenize(VARCHAR txt, VARCHAR tokenizer_args) +``` + +## Parameters + +| Parameter | Description | +| -- | -- | +| `txt` | The text to be tokenized | +| `tokenizer_args` | Tokenizer arguments, a Doris PROPERTIES format string. For detailed information, refer to the inverted index documentation. | + +## Return Value + +Returns the tokenization result of the text `txt` based on the tokenizer arguments `tokenizer_args`. + +## Examples + +```sql +mysql> SELECT tokenize('I love Doris', '"parser"="english"'); ++------------------------------------------------+ +| tokenize('I love Doris', '"parser"="english"') | ++------------------------------------------------+ +| ["i", "love", "doris"] | ++------------------------------------------------+ +1 row in set (0.02 sec) +``` diff --git a/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md b/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md index 543f935f36509..9aac8ed219d76 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md +++ b/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md @@ -24,31 +24,41 @@ specific language governing permissions and limitations under the License. --> -## multi_match_any -### Description -#### Syntax +## Description -`TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns)` +Returns whether the string matches any of the given regular expressions. +## Syntax -Checks whether the string `haystack` matches the regular expressions `patterns` in re2 syntax. returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches. +```sql +TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns) +``` -### example +## Parameters -``` -mysql> select multi_match_any('Hello, World!', ['hello', '!', 'world']); +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked | +| `patterns` | Array of regular expressions | + +## Return Value + +Returns 1 if the string `haystack` matches any of the regular expressions in the `patterns` array, otherwise returns 0. + +## Examples + +```sql +mysql> SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']); +-----------------------------------------------------------+ | multi_match_any('Hello, World!', ['hello', '!', 'world']) | +-----------------------------------------------------------+ | 1 | +-----------------------------------------------------------+ -mysql> select multi_match_any('abc', ['A', 'bcd']); +mysql> SELECT multi_match_any('abc', ['A', 'bcd']); +--------------------------------------+ | multi_match_any('abc', ['A', 'bcd']) | +--------------------------------------+ | 0 | +--------------------------------------+ ``` -### keywords - MULTI_MATCH,MATCH,ANY diff --git a/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md b/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md index c2d72c41d0e40..715385c7b9410 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md +++ b/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md @@ -24,31 +24,41 @@ specific language governing permissions and limitations under the License. --> -## multi_search_all_positions -### Description -#### Syntax +## Description -`ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY needles)` +Returns the positions of the first occurrence of a set of regular expressions in a string. -Returns an `ARRAY` where the `i`-th element is the position of the `i`-th element in `needles`(i.e. `needle`)'s **first** occurrence in the string `haystack`. Positions are counted from 1, with 0 meaning the element was not found. **Case-sensitive**. - -### example +## Syntax +```sql +ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY patterns) ``` -mysql> select multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); + +## Parameters + +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked | +| `patterns` | Array of regular expressions | + +## Return Value + +Returns an `ARRAY` where the `i`-th element represents the position of the first occurrence of the `i`-th element (regular expression) in the `patterns` array within the string `haystack`. Positions are counted starting from 1, and 0 indicates that the element was not found. + +## Examples + +```sql +mysql> SELECT multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +----------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ['hello', '!', 'world']) | +----------------------------------------------------------------------+ -| [0,13,0] | +| [0, 13, 0] | +----------------------------------------------------------------------+ -select multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +mysql> SELECT multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +---------------------------------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ARRAY('hello', '!', 'world', 'Hello', 'World')) | +---------------------------------------------------------------------------------------------+ | [0, 13, 0, 1, 8] | +---------------------------------------------------------------------------------------------+ ``` - -### keywords - MULTI_SEARCH,SEARCH,POSITIONS diff --git a/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md b/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md index ae42731b9904d..fdff7703b8b22 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md +++ b/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md @@ -26,42 +26,52 @@ under the License. ## Description -Calculate the N-gram similarity between `text` and `pattern`. The similarity ranges from 0 to 1, where a higher similarity indicates greater similarity between the two strings. +Calculates the N-gram similarity between two strings. -Both `pattern` and `gram_num` must be constants. If the length of either `text` or `pattern` is less than `gram_num`, return 0. +N-gram similarity is a text similarity calculation method based on N-grams (N-gram sequences). N-gram similarity ranges from 0 to 1, where a higher value indicates greater similarity between the two strings. -N-gram similarity is a method for calculating text similarity based on N-grams. An N-gram is a set of continuous N characters or words extracted from a text string. For example, for the string "text" with N=2 (bigram), the bigrams are: {"te", "ex", "xt"}. +An N-gram is a contiguous sequence of N characters or words from a text. For example, for the string 'text', when N=2, its bi-grams are: {"te", "ex", "xt"}. -The N-gram similarity is calculated as: +The N-gram similarity is calculated as: +**2 * |Intersection| / (|haystack set| + |pattern set|)** -2 * |Intersection| / (|text set| + |pattern set|) +Where |haystack set| and |pattern set| are the N-grams of `haystack` and `pattern`, respectively, and `Intersection` is the intersection of the two sets. -where |text set| and |pattern set| are the N-grams of `text` and `pattern`, and `Intersection` is the intersection of the two sets. +Note that, by definition, a similarity of 1 does not mean the two strings are identical. -Note that, by definition, a similarity of 1 does not necessarily mean the two strings are identical. +## Syntax + +```sql +DOUBLE ngram_search(VARCHAR haystack, VARCHAR pattern, INT gram_num) +``` -Only supports ASCII encoding. +## Parameters -## Syntax +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked, supports only ASCII encoding | +| `pattern` | The string used for similarity comparison, must be a constant, supports only ASCII encoding | +| `gram_num` | The `N` in N-gram, must be a constant | + +## Return Value -`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` +Returns the N-gram similarity between `haystack` and `pattern`. +Special case: If the length of `haystack` or `pattern` is less than `gram_num`, returns 0. -## Example +## Examples ```sql -mysql> select ngram_search('123456789' , '12345' , 3); +mysql> SELECT ngram_search('123456789' , '12345' , 3); +---------------------------------------+ | ngram_search('123456789', '12345', 3) | +---------------------------------------+ | 0.6 | +---------------------------------------+ -mysql> select ngram_search("abababab","babababa",2); +mysql> SELECT ngram_search('abababab', 'babababa', 2); +-----------------------------------------+ | ngram_search('abababab', 'babababa', 2) | +-----------------------------------------+ | 1 | +-----------------------------------------+ ``` -## keywords - NGRAM_SEARCH,NGRAM,SEARCH diff --git a/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md b/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md index 54e4f8ff31212..6b97dce846cd5 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md +++ b/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md @@ -1,6 +1,6 @@ --- { - "title": "tokenize", + "title": "TOKENIZE", "language": "en" } --- @@ -23,3 +23,36 @@ KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> + +## Description + +Returns the result of text tokenization. Tokenization is the process of splitting text into a set of tokens. + +## Syntax + +```sql +ARRAY tokenize(VARCHAR txt, VARCHAR tokenizer_args) +``` + +## Parameters + +| Parameter | Description | +| -- | -- | +| `txt` | The text to be tokenized | +| `tokenizer_args` | Tokenizer arguments, a Doris PROPERTIES format string. For detailed information, refer to the inverted index documentation. | + +## Return Value + +Returns the tokenization result of the text `txt` based on the tokenizer arguments `tokenizer_args`. + +## Examples + +```sql +mysql> SELECT tokenize('I love Doris', '"parser"="english"'); ++------------------------------------------------+ +| tokenize('I love Doris', '"parser"="english"') | ++------------------------------------------------+ +| ["i", "love", "doris"] | ++------------------------------------------------+ +1 row in set (0.02 sec) +``` diff --git a/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md b/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md index 543f935f36509..9aac8ed219d76 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md +++ b/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-match-any.md @@ -24,31 +24,41 @@ specific language governing permissions and limitations under the License. --> -## multi_match_any -### Description -#### Syntax +## Description -`TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns)` +Returns whether the string matches any of the given regular expressions. +## Syntax -Checks whether the string `haystack` matches the regular expressions `patterns` in re2 syntax. returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches. +```sql +TINYINT multi_match_any(VARCHAR haystack, ARRAY patterns) +``` -### example +## Parameters -``` -mysql> select multi_match_any('Hello, World!', ['hello', '!', 'world']); +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked | +| `patterns` | Array of regular expressions | + +## Return Value + +Returns 1 if the string `haystack` matches any of the regular expressions in the `patterns` array, otherwise returns 0. + +## Examples + +```sql +mysql> SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']); +-----------------------------------------------------------+ | multi_match_any('Hello, World!', ['hello', '!', 'world']) | +-----------------------------------------------------------+ | 1 | +-----------------------------------------------------------+ -mysql> select multi_match_any('abc', ['A', 'bcd']); +mysql> SELECT multi_match_any('abc', ['A', 'bcd']); +--------------------------------------+ | multi_match_any('abc', ['A', 'bcd']) | +--------------------------------------+ | 0 | +--------------------------------------+ ``` -### keywords - MULTI_MATCH,MATCH,ANY diff --git a/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md b/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md index c2d72c41d0e40..715385c7b9410 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md +++ b/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/multi-search-all-positions.md @@ -24,31 +24,41 @@ specific language governing permissions and limitations under the License. --> -## multi_search_all_positions -### Description -#### Syntax +## Description -`ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY needles)` +Returns the positions of the first occurrence of a set of regular expressions in a string. -Returns an `ARRAY` where the `i`-th element is the position of the `i`-th element in `needles`(i.e. `needle`)'s **first** occurrence in the string `haystack`. Positions are counted from 1, with 0 meaning the element was not found. **Case-sensitive**. - -### example +## Syntax +```sql +ARRAY multi_search_all_positions(VARCHAR haystack, ARRAY patterns) ``` -mysql> select multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); + +## Parameters + +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked | +| `patterns` | Array of regular expressions | + +## Return Value + +Returns an `ARRAY` where the `i`-th element represents the position of the first occurrence of the `i`-th element (regular expression) in the `patterns` array within the string `haystack`. Positions are counted starting from 1, and 0 indicates that the element was not found. + +## Examples + +```sql +mysql> SELECT multi_search_all_positions('Hello, World!', ['hello', '!', 'world']); +----------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ['hello', '!', 'world']) | +----------------------------------------------------------------------+ -| [0,13,0] | +| [0, 13, 0] | +----------------------------------------------------------------------+ -select multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +mysql> SELECT multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']); +---------------------------------------------------------------------------------------------+ | multi_search_all_positions('Hello, World!', ARRAY('hello', '!', 'world', 'Hello', 'World')) | +---------------------------------------------------------------------------------------------+ | [0, 13, 0, 1, 8] | +---------------------------------------------------------------------------------------------+ ``` - -### keywords - MULTI_SEARCH,SEARCH,POSITIONS diff --git a/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md b/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md index ae42731b9904d..fdff7703b8b22 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md +++ b/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/ngram-search.md @@ -26,42 +26,52 @@ under the License. ## Description -Calculate the N-gram similarity between `text` and `pattern`. The similarity ranges from 0 to 1, where a higher similarity indicates greater similarity between the two strings. +Calculates the N-gram similarity between two strings. -Both `pattern` and `gram_num` must be constants. If the length of either `text` or `pattern` is less than `gram_num`, return 0. +N-gram similarity is a text similarity calculation method based on N-grams (N-gram sequences). N-gram similarity ranges from 0 to 1, where a higher value indicates greater similarity between the two strings. -N-gram similarity is a method for calculating text similarity based on N-grams. An N-gram is a set of continuous N characters or words extracted from a text string. For example, for the string "text" with N=2 (bigram), the bigrams are: {"te", "ex", "xt"}. +An N-gram is a contiguous sequence of N characters or words from a text. For example, for the string 'text', when N=2, its bi-grams are: {"te", "ex", "xt"}. -The N-gram similarity is calculated as: +The N-gram similarity is calculated as: +**2 * |Intersection| / (|haystack set| + |pattern set|)** -2 * |Intersection| / (|text set| + |pattern set|) +Where |haystack set| and |pattern set| are the N-grams of `haystack` and `pattern`, respectively, and `Intersection` is the intersection of the two sets. -where |text set| and |pattern set| are the N-grams of `text` and `pattern`, and `Intersection` is the intersection of the two sets. +Note that, by definition, a similarity of 1 does not mean the two strings are identical. -Note that, by definition, a similarity of 1 does not necessarily mean the two strings are identical. +## Syntax + +```sql +DOUBLE ngram_search(VARCHAR haystack, VARCHAR pattern, INT gram_num) +``` -Only supports ASCII encoding. +## Parameters -## Syntax +| Parameter | Description | +| -- | -- | +| `haystack` | The string to be checked, supports only ASCII encoding | +| `pattern` | The string used for similarity comparison, must be a constant, supports only ASCII encoding | +| `gram_num` | The `N` in N-gram, must be a constant | + +## Return Value -`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)` +Returns the N-gram similarity between `haystack` and `pattern`. +Special case: If the length of `haystack` or `pattern` is less than `gram_num`, returns 0. -## Example +## Examples ```sql -mysql> select ngram_search('123456789' , '12345' , 3); +mysql> SELECT ngram_search('123456789' , '12345' , 3); +---------------------------------------+ | ngram_search('123456789', '12345', 3) | +---------------------------------------+ | 0.6 | +---------------------------------------+ -mysql> select ngram_search("abababab","babababa",2); +mysql> SELECT ngram_search('abababab', 'babababa', 2); +-----------------------------------------+ | ngram_search('abababab', 'babababa', 2) | +-----------------------------------------+ | 1 | +-----------------------------------------+ ``` -## keywords - NGRAM_SEARCH,NGRAM,SEARCH diff --git a/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md b/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md index 54e4f8ff31212..6b97dce846cd5 100644 --- a/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md +++ b/versioned_docs/version-3.0/sql-manual/sql-functions/scalar-functions/string-functions/tokenize.md @@ -1,6 +1,6 @@ --- { - "title": "tokenize", + "title": "TOKENIZE", "language": "en" } --- @@ -23,3 +23,36 @@ KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> + +## Description + +Returns the result of text tokenization. Tokenization is the process of splitting text into a set of tokens. + +## Syntax + +```sql +ARRAY tokenize(VARCHAR txt, VARCHAR tokenizer_args) +``` + +## Parameters + +| Parameter | Description | +| -- | -- | +| `txt` | The text to be tokenized | +| `tokenizer_args` | Tokenizer arguments, a Doris PROPERTIES format string. For detailed information, refer to the inverted index documentation. | + +## Return Value + +Returns the tokenization result of the text `txt` based on the tokenizer arguments `tokenizer_args`. + +## Examples + +```sql +mysql> SELECT tokenize('I love Doris', '"parser"="english"'); ++------------------------------------------------+ +| tokenize('I love Doris', '"parser"="english"') | ++------------------------------------------------+ +| ["i", "love", "doris"] | ++------------------------------------------------+ +1 row in set (0.02 sec) +```