Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update docs for string functions multi-match-any, multi-search-all-positions, ngram-search and tokenize #1948

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -24,31 +24,41 @@ specific language governing permissions and limitations
under the License.
-->

## multi_match_any
### Description
#### Syntax
## Description

`TINYINT multi_match_any(VARCHAR haystack, ARRAY<VARCHAR> patterns)`
Returns whether the string matches any of the given regular expressions.

## Syntax

Checks whether the string `haystack` matches the regular expressions `patterns` in re2 syntax. returns 0 if none of the regular expressions are matched and 1 if any of the patterns matches.
```sql
TINYINT multi_match_any(VARCHAR haystack, ARRAY<VARCHAR> patterns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 不用参数类型和返回类型
  2. 函数名大写
  3. 参数需要尖括号包裹
Suggested change
TINYINT multi_match_any(VARCHAR haystack, ARRAY<VARCHAR> patterns)
MULTI_MATCH_ANY(<haystack>, <patterns>)

```

### example
## Parameters

```
mysql> select multi_match_any('Hello, World!', ['hello', '!', 'world']);
| Parameter | Description |
| -- | -- |
| `haystack` | The string to be checked |
| `patterns` | Array of regular expressions |
Comment on lines +41 to +42
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `haystack` | The string to be checked |
| `patterns` | Array of regular expressions |
| `<haystack>` | The string to be checked |
| `<patterns>` | Array of regular expressions |


## Return Value

Returns 1 if the string `haystack` matches any of the regular expressions in the `patterns` array, otherwise returns 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Returns 1 if the string `haystack` matches any of the regular expressions in the `patterns` array, otherwise returns 0.
Returns 1 if the string `<haystack>` matches any of the regular expressions in the `<patterns>` array, otherwise returns 0.


## Examples

```sql
mysql> SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要prompt

Suggested change
mysql> SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']);
SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']);

+-----------------------------------------------------------+
| multi_match_any('Hello, World!', ['hello', '!', 'world']) |
+-----------------------------------------------------------+
| 1 |
+-----------------------------------------------------------+
Comment on lines 52 to 56
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

结果和查询分开放到两个 code block 中,结果使用 text 格式

```text
+-----------------------------------------------------------+
| multi_match_any('Hello, World!', ['hello', '!', 'world']) |
+-----------------------------------------------------------+
| 1                                                         |
+-----------------------------------------------------------+
```


mysql> select multi_match_any('abc', ['A', 'bcd']);
mysql> SELECT multi_match_any('abc', ['A', 'bcd']);
+--------------------------------------+
| multi_match_any('abc', ['A', 'bcd']) |
+--------------------------------------+
| 0 |
+--------------------------------------+
```
### keywords
MULTI_MATCH,MATCH,ANY
Original file line number Diff line number Diff line change
Expand Up @@ -24,31 +24,41 @@ specific language governing permissions and limitations
under the License.
-->

## multi_search_all_positions
### Description
#### Syntax
## Description

`ARRAY<INT> multi_search_all_positions(VARCHAR haystack, ARRAY<VARCHAR> needles)`
Returns the positions of the first occurrence of a set of regular expressions in a string.

Returns an `ARRAY` where the `i`-th element is the position of the `i`-th element in `needles`(i.e. `needle`)'s **first** occurrence in the string `haystack`. Positions are counted from 1, with 0 meaning the element was not found. **Case-sensitive**.

### example
## Syntax

```sql
ARRAY<INT> multi_search_all_positions(VARCHAR haystack, ARRAY<VARCHAR> patterns)
```
mysql> select multi_search_all_positions('Hello, World!', ['hello', '!', 'world']);

## Parameters

| Parameter | Description |
| -- | -- |
| `haystack` | The string to be checked |
| `patterns` | Array of regular expressions |

## Return Value

Returns an `ARRAY` where the `i`-th element represents the position of the first occurrence of the `i`-th element (regular expression) in the `patterns` array within the string `haystack`. Positions are counted starting from 1, and 0 indicates that the element was not found.

## Examples

```sql
mysql> SELECT multi_search_all_positions('Hello, World!', ['hello', '!', 'world']);
+----------------------------------------------------------------------+
| multi_search_all_positions('Hello, World!', ['hello', '!', 'world']) |
+----------------------------------------------------------------------+
| [0,13,0] |
| [0, 13, 0] |
+----------------------------------------------------------------------+

select multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']);
mysql> SELECT multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']);
+---------------------------------------------------------------------------------------------+
| multi_search_all_positions('Hello, World!', ARRAY('hello', '!', 'world', 'Hello', 'World')) |
+---------------------------------------------------------------------------------------------+
| [0, 13, 0, 1, 8] |
+---------------------------------------------------------------------------------------------+
```

### keywords
MULTI_SEARCH,SEARCH,POSITIONS
Original file line number Diff line number Diff line change
Expand Up @@ -26,42 +26,52 @@ under the License.

## Description

Calculate the N-gram similarity between `text` and `pattern`. The similarity ranges from 0 to 1, where a higher similarity indicates greater similarity between the two strings.
Calculates the N-gram similarity between two strings.

Both `pattern` and `gram_num` must be constants. If the length of either `text` or `pattern` is less than `gram_num`, return 0.
N-gram similarity is a text similarity calculation method based on N-grams (N-gram sequences). N-gram similarity ranges from 0 to 1, where a higher value indicates greater similarity between the two strings.

N-gram similarity is a method for calculating text similarity based on N-grams. An N-gram is a set of continuous N characters or words extracted from a text string. For example, for the string "text" with N=2 (bigram), the bigrams are: {"te", "ex", "xt"}.
An N-gram is a contiguous sequence of N characters or words from a text. For example, for the string 'text', when N=2, its bi-grams are: {"te", "ex", "xt"}.

The N-gram similarity is calculated as:
The N-gram similarity is calculated as:
**2 * |Intersection| / (|haystack set| + |pattern set|)**

2 * |Intersection| / (|text set| + |pattern set|)
Where |haystack set| and |pattern set| are the N-grams of `haystack` and `pattern`, respectively, and `Intersection` is the intersection of the two sets.

where |text set| and |pattern set| are the N-grams of `text` and `pattern`, and `Intersection` is the intersection of the two sets.
Note that, by definition, a similarity of 1 does not mean the two strings are identical.

Note that, by definition, a similarity of 1 does not necessarily mean the two strings are identical.
## Syntax

```sql
DOUBLE ngram_search(VARCHAR haystack, VARCHAR pattern, INT gram_num)
```

Only supports ASCII encoding.
## Parameters

## Syntax
| Parameter | Description |
| -- | -- |
| `haystack` | The string to be checked, supports only ASCII encoding |
| `pattern` | The string used for similarity comparison, must be a constant, supports only ASCII encoding |
| `gram_num` | The `N` in N-gram, must be a constant |

## Return Value

`DOUBLE ngram_search(VARCHAR text,VARCHAR pattern,INT gram_num)`
Returns the N-gram similarity between `haystack` and `pattern`.
Special case: If the length of `haystack` or `pattern` is less than `gram_num`, returns 0.

## Example
## Examples

```sql
mysql> select ngram_search('123456789' , '12345' , 3);
mysql> SELECT ngram_search('123456789' , '12345' , 3);
+---------------------------------------+
| ngram_search('123456789', '12345', 3) |
+---------------------------------------+
| 0.6 |
+---------------------------------------+

mysql> select ngram_search("abababab","babababa",2);
mysql> SELECT ngram_search('abababab', 'babababa', 2);
+-----------------------------------------+
| ngram_search('abababab', 'babababa', 2) |
+-----------------------------------------+
| 1 |
+-----------------------------------------+
```
## keywords
NGRAM_SEARCH,NGRAM,SEARCH
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
{
"title": "tokenize",
"title": "TOKENIZE",
"language": "en"
}
---
Expand All @@ -23,3 +23,36 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

## Description

Returns the result of text tokenization. Tokenization is the process of splitting text into a set of tokens.

## Syntax

```sql
ARRAY<VARCHAR> tokenize(VARCHAR txt, VARCHAR tokenizer_args)
```

## Parameters

| Parameter | Description |
| -- | -- |
| `txt` | The text to be tokenized |
| `tokenizer_args` | Tokenizer arguments, a Doris PROPERTIES format string. For detailed information, refer to the inverted index documentation. |

## Return Value

Returns the tokenization result of the text `txt` based on the tokenizer arguments `tokenizer_args`.

## Examples

```sql
mysql> SELECT tokenize('I love Doris', '"parser"="english"');
+------------------------------------------------+
| tokenize('I love Doris', '"parser"="english"') |
+------------------------------------------------+
| ["i", "love", "doris"] |
+------------------------------------------------+
1 row in set (0.02 sec)
```
Original file line number Diff line number Diff line change
Expand Up @@ -24,31 +24,46 @@ specific language governing permissions and limitations
under the License.
-->

## multi_match_any
## 描述

返回字符串是否与给定的一组正则表达式匹配。


## 语法

`TINYINT multi_match_any(VARCHAR haystack, ARRAY<VARCHAR> patterns)`
```sql
TINYINT multi_match_any(VARCHAR haystack, ARRAY<VARCHAR> patterns)
```


## 参数

| 参数 | 说明 |
| -- | -- |
| `haystack` | 被检查的字符串 |
| `patterns` | 正则表达式数组 |


## 返回值

如果字符串 `haystack` 匹配 `patterns` 数组中的任意一个正则表达式返回 1,否则返回 0。

检查字符串 `haystack` 是否与 re2 语法中的正则表达式 `patterns` 相匹配。如果都没有匹配的正则表达式返回 0,否则返回 1。

## 举例

```
mysql> select multi_match_any('Hello, World!', ['hello', '!', 'world']);
```sql
mysql> SELECT multi_match_any('Hello, World!', ['hello', '!', 'world']);
+-----------------------------------------------------------+
| multi_match_any('Hello, World!', ['hello', '!', 'world']) |
+-----------------------------------------------------------+
| 1 |
+-----------------------------------------------------------+

mysql> select multi_match_any('abc', ['A', 'bcd']);
mysql> SELECT multi_match_any('abc', ['A', 'bcd']);
+--------------------------------------+
| multi_match_any('abc', ['A', 'bcd']) |
+--------------------------------------+
| 0 |
+--------------------------------------+
```
### keywords
MULTI_MATCH,MATCH,ANY

Original file line number Diff line number Diff line change
Expand Up @@ -24,31 +24,45 @@ specific language governing permissions and limitations
under the License.
-->

## multi_search_all_positions
## 描述

返回一组正则表达式在一个字符串中首次出现的位置。


## 语法

`ARRAY<INT> multi_search_all_positions(VARCHAR haystack, ARRAY<VARCHAR> needles)`
```sql
ARRAY<INT> multi_search_all_positions(VARCHAR haystack, ARRAY<VARCHAR> patterns)
```


## 参数

| 参数 | 说明 |
| -- | -- |
| `haystack` | 被检查的字符串 |
| `patterns` | 正则表达式数组 |


## 返回值

返回一个 `ARRAY`,其中第 `i` 个元素为 `patterns` 数组中第 `i` 个元素(正则表达式),在字符串 `haystack` 中**首次**出现的位置,位置从 1 开始计数,0 代表未找到该元素。

返回一个 `ARRAY`,其中第 `i` 个元素为 `needles` 中第 `i` 个元素 `needle`,在字符串 `haystack` 中**首次**出现的位置。位置从1开始计数,0代表未找到该元素。**大小写敏感**。

## 举例

```
mysql> select multi_search_all_positions('Hello, World!', ['hello', '!', 'world']);
```sql
mysql> SELECT multi_search_all_positions('Hello, World!', ['hello', '!', 'world']);
+----------------------------------------------------------------------+
| multi_search_all_positions('Hello, World!', ['hello', '!', 'world']) |
+----------------------------------------------------------------------+
| [0,13,0] |
| [0, 13, 0] |
+----------------------------------------------------------------------+

select multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']);
mysql> SELECT multi_search_all_positions("Hello, World!", ['hello', '!', 'world', 'Hello', 'World']);
+---------------------------------------------------------------------------------------------+
| multi_search_all_positions('Hello, World!', ARRAY('hello', '!', 'world', 'Hello', 'World')) |
+---------------------------------------------------------------------------------------------+
| [0, 13, 0, 1, 8] |
+---------------------------------------------------------------------------------------------+
```

### keywords
MULTI_SEARCH,SEARCH,POSITIONS
Loading
Loading