Skip to content

Commit

Permalink
Fix YQL knn udf docs (#8425)
Browse files Browse the repository at this point in the history
  • Loading branch information
MBkkt authored Sep 7, 2024
1 parent de9b991 commit 16a986d
Show file tree
Hide file tree
Showing 4 changed files with 167 additions and 17 deletions.
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
* [DateTime](../../datetime.md)
* [Digest](../../digest.md)
* [Histogram](../../histogram.md)
* [Hyperscan](../../hyperscan.md)
* [Ip](../../ip.md)
* [Knn](../../knn.md)
* [Math](../../math.md)
* [Pcre](../../pcre.md)
* [Pire](../../pire.md)
* [Re2](../../re2.md)
* [String](../../string.md)
* [Unicode](../../unicode.md)
* [DateTime](../../datetime.md)
* [Url](../../url.md)
* [Ip](../../ip.md)
* [Knn](../../knn.md)
* [Yson](../../yson.md)
* [Digest](../../digest.md)
* [Math](../../math.md)
* [Histogram](../../histogram.md)
* [Yson](../../yson.md)
79 changes: 77 additions & 2 deletions ydb/docs/en/core/yql/reference/yql-core/udf/list/knn.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Approximate methods do not perform a complete search of the source data. Due to
This document provides an [example of approximate search](#approximate-search-examples) using scalar quantization. This example does not require the creation of a secondary vector index.

**Scalar quantization** is a method to compress vectors by mapping coordinates to a smaller space.
{{ ydb-short-name }} support exact search for `Float`, `Int8`, `Uint8`, `Bit` vectors.
This module supports exact search for `Float`, `Int8`, `Uint8`, `Bit` vectors.
So, it's possible to apply scalar quantization from `Float` to one of these other types.

Scalar quantization decreases read/write times by reducing vector size in bytes. For example, after quantization from `Float` to `Bit,` each vector becomes 32 times smaller.
Expand All @@ -45,7 +45,7 @@ It is recommended to measure if such quantization provides sufficient accuracy/r
## Data types

In mathematics, a vector of real or integer numbers is used to store points.
In {{ ydb-short-name }}, vectors are stored in the `String` data type, which is a binary serialized representation of a vector.
In this module, vectors are stored in the `String` data type, which is a binary serialized representation of a vector.

## Functions

Expand All @@ -57,7 +57,9 @@ Conversion functions are needed to serialize vectors into an internal binary rep

All serialization functions wrap returned `String` data into [Tagged](../../types/special.md) types.

{% if backend_name == "YDB" %}
The binary representation of the vector can be stored in the {{ ydb-short-name }} table column. Currently {{ ydb-short-name }} does not support storing `Tagged`, so before storing binary representation vectors you must call [Untag](../../builtins/basic#as-tagged).
{% endif %}

#### Function signatures

Expand Down Expand Up @@ -123,6 +125,7 @@ Error: Failed to find UDF function: Knn.CosineDistance, reason: Error: Module: K

## Еxact search examples

{% if backend_name == "YDB" %}
### Creating a table

```sql
Expand All @@ -142,9 +145,25 @@ $vector = [1.f, 2.f, 3.f, 4.f];
UPSERT INTO Facts (id, user, fact, embedding)
VALUES (123, "Williams", "Full name is John Williams", Untag(Knn::ToBinaryStringFloat($vector), "FloatVector"));
```
{% else %}
### Data declaration

```sql
$vector = [1.f, 2.f, 3.f, 4.f];
$facts = AsList(
AsStruct(
123 AS id, -- Id of fact
"Williams" AS user, -- User name
"Full name is John Williams" AS fact, -- Human-readable description of a user fact
Knn::ToBinaryStringFloat($vector) AS embedding, -- Binary representation of embedding vector
),
);
```
{% endif %}

### Exact search of K nearest vectors

{% if backend_name == "YDB" %}
```sql
$K = 10;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);
Expand All @@ -154,23 +173,45 @@ WHERE user="Williams"
ORDER BY Knn::CosineDistance(embedding, $TargetEmbedding)
LIMIT $K;
```
{% else %}
```sql
$K = 10;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);

SELECT * FROM AS_TABLE($facts)
WHERE user="Williams"
ORDER BY Knn::CosineDistance(embedding, $TargetEmbedding)
LIMIT $K;
```
{% endif %}

### Exact search of vectors in radius R

{% if backend_name == "YDB" %}
```sql
$R = 0.1f;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);

SELECT * FROM Facts
WHERE Knn::CosineDistance(embedding, $TargetEmbedding) < $R;
```
{% else %}
```sql
$R = 0.1f;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);

SELECT * FROM AS_TABLE($facts)
WHERE Knn::CosineDistance(embedding, $TargetEmbedding) < $R;
```
{% endif %}

## Approximate search examples

This example differs from the [exact search example](#еxact-search-examples) by using bit quantization.

This allows to first do a approximate preliminary search by the `embedding_bit` column, and then refine the results by the original vector column `embegging`.

{% if backend_name == "YDB" %}
### Creating a table

```sql
Expand All @@ -191,6 +232,22 @@ $vector = [1.f, 2.f, 3.f, 4.f];
UPSERT INTO Facts (id, user, fact, embedding, embedding_bit)
VALUES (123, "Williams", "Full name is John Williams", Untag(Knn::ToBinaryStringFloat($vector), "FloatVector"), Untag(Knn::ToBinaryStringBit($vector), "BitVector"));
```
{% else %}
### Data declaration

```sql
$vector = [1.f, 2.f, 3.f, 4.f];
$facts = AsList(
AsStruct(
123 AS id, -- Id of fact
"Williams" AS user, -- User name
"Full name is John Williams" AS fact, -- Human-readable description of a user fact
Knn::ToBinaryStringFloat($vector) AS embedding, -- Binary representation of embedding vector
Knn::ToBinaryStringBit($vector) AS embedding_bit, -- Binary representation of embedding vector
),
);
```
{% endif %}

### Scalar quantization

Expand Down Expand Up @@ -219,6 +276,7 @@ Approximate search algorithm:
* an approximate list of vectors is obtained;
* we search this list without using quantization.

{% if backend_name == "YDB" %}
```sql
$K = 10;
$Target = [1.2f, 2.3f, 3.4f, 4.5f];
Expand All @@ -234,3 +292,20 @@ WHERE id IN $Ids
ORDER BY Knn::CosineDistance(embedding, $TargetEmbeddingFloat)
LIMIT $K;
```
{% else %}
```sql
$K = 10;
$Target = [1.2f, 2.3f, 3.4f, 4.5f];
$TargetEmbeddingBit = Knn::ToBinaryStringBit($Target);
$TargetEmbeddingFloat = Knn::ToBinaryStringFloat($Target);

$Ids = SELECT id FROM AS_TABLE($facts)
ORDER BY Knn::CosineDistance(embedding_bit, $TargetEmbeddingBit)
LIMIT $K * 10;

SELECT * FROM AS_TABLE($facts)
WHERE id IN $Ids
ORDER BY Knn::CosineDistance(embedding, $TargetEmbeddingFloat)
LIMIT $K;
```
{% endif %}
12 changes: 6 additions & 6 deletions ydb/docs/en/core/yql/reference/yql-core/udf/list/toc_base.yaml
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
items:
- name: Overview
href: index.md
- { name: DateTime, href: datetime.md }
- { name: Digest, href: digest.md }
- { name: Histogram, href: histogram.md }
- { name: Hyperscan, href: hyperscan.md }
- { name: Ip, href: ip.md }
- { name: Knn, href: knn.md }
- { name: Math, href: math.md }
- { name: Pcre, href: pcre.md }
- { name: Pire, href: pire.md }
- { name: Re2, href: re2.md }
- { name: String, href: string.md }
- { name: Unicode, href: unicode.md }
- { name: DateTime, href: datetime.md }
- { name: Url, href: url.md }
- { name: Ip, href: ip.md }
- { name: Knn, href: knn.md }
- { name: Yson, href: yson.md }
- { name: Digest, href: digest.md }
- { name: Math, href: math.md }
- { name: Histogram, href: histogram.md }
79 changes: 77 additions & 2 deletions ydb/docs/ru/core/yql/reference/yql-core/udf/list/knn.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ LIMIT 10;
В данном документе приведен [пример приближенного поиска](#примеры-приближенного-поиска) с помощью скалярного квантования, не требущий построения вторичного векторного индекса.

**Скалярное квантование** это метод сжатия векторов, когда множество координат отображаются в множество меньшей размерности.
{{ ydb-short-name }} поддерживает точный поиск по `Float`, `Int8`, `Uint8`, `Bit` векторам.
Этот модуль поддерживает точный поиск по `Float`, `Int8`, `Uint8`, `Bit` векторам.
Соответственно, возможно скалярное квантование из `Float` в один из этих типов.

Скалярное квантование уменьшает время необходимое для чтения/записи, поскольку число байт сокращается в разы.
Expand All @@ -46,7 +46,7 @@ LIMIT 10;
## Типы данных

В математике для хранения точек используется вектор вещественных или целых чисел.
В {{ ydb-short-name }} вектора хранятся в строковом типе данных `String`, который является бинарным сериализованным представлением вектора.
В этом модуле вектора представлены типом данных `String`, который является бинарным сериализованным представлением вектора.

## Функции

Expand All @@ -58,8 +58,10 @@ LIMIT 10;

Все функции сериализации упаковывают возвращаемые данные типа `String` в [Tagged](../../types/special.md) тип.

{% if backend_name == "YDB" %}
Бинарное представление вектора можно сохранить в {{ ydb-short-name }} колонку.
В настоящий момент {{ ydb-short-name }} не поддерживает хранение `Tagged` типов и поэтому перед сохранением бинарного представления векторов нужно извлечь `String` с помощью функции [Untag](../../builtins/basic#as-tagged).
{% endif %}

#### Сигнатуры функций

Expand Down Expand Up @@ -125,6 +127,7 @@ Error: Failed to find UDF function: Knn.CosineDistance, reason: Error: Module: K

## Примеры точного поиска

{% if backend_name == "YDB" %}
### Создание таблицы

```sql
Expand All @@ -144,9 +147,25 @@ $vector = [1.f, 2.f, 3.f, 4.f];
UPSERT INTO Facts (id, user, fact, embedding)
VALUES (123, "Williams", "Full name is John Williams", Untag(Knn::ToBinaryStringFloat($vector), "FloatVector"));
```
{% else %}
### Декларация данных

```sql
$vector = [1.f, 2.f, 3.f, 4.f];
$facts = AsList(
AsStruct(
123 AS id, -- Id of fact
"Williams" AS user, -- User name
"Full name is John Williams" AS fact, -- Human-readable description of a user fact
Knn::ToBinaryStringFloat($vector) AS embedding, -- Binary representation of embedding vector
),
);
```
{% endif %}

### Точный поиск K ближайших векторов

{% if backend_name == "YDB" %}
```sql
$K = 10;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);
Expand All @@ -156,22 +175,44 @@ WHERE user="Williams"
ORDER BY Knn::CosineDistance(embedding, $TargetEmbedding)
LIMIT $K;
```
{% else %}
```sql
$K = 10;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);

SELECT * FROM AS_TABLE($facts)
WHERE user="Williams"
ORDER BY Knn::CosineDistance(embedding, $TargetEmbedding)
LIMIT $K;
```
{% endif %}

### Точный поиск векторов, находящихся в радиусе R

{% if backend_name == "YDB" %}
```sql
$R = 0.1f;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);

SELECT * FROM Facts
WHERE Knn::CosineDistance(embedding, $TargetEmbedding) < $R;
```
{% else %}
```sql
$R = 0.1f;
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);

SELECT * FROM AS_TABLE($facts)
WHERE Knn::CosineDistance(embedding, $TargetEmbedding) < $R;
```
{% endif %}

## Примеры приближенного поиска

Данный пример отличается от [примера с точным поиском](#примеры-точного-поиска) использованием битового квантования.
Это позволяет сначала делать грубый предварительный поиск по колонке `embedding_bit`, а затем уточнять результаты по основной колонке с векторами `embedding`.

{% if backend_name == "YDB" %}
### Создание таблицы

```sql
Expand All @@ -192,6 +233,22 @@ $vector = [1.f, 2.f, 3.f, 4.f];
UPSERT INTO Facts (id, user, fact, embedding, embedding_bit)
VALUES (123, "Williams", "Full name is John Williams", Untag(Knn::ToBinaryStringFloat($vector), "FloatVector"), Untag(Knn::ToBinaryStringBit($vector), "BitVector"));
```
{% else %}
### Декларация данных

```sql
$vector = [1.f, 2.f, 3.f, 4.f];
$facts = AsList(
AsStruct(
123 AS id, -- Id of fact
"Williams" AS user, -- User name
"Full name is John Williams" AS fact, -- Human-readable description of a user fact
Knn::ToBinaryStringFloat($vector) AS embedding, -- Binary representation of embedding vector
Knn::ToBinaryStringBit($vector) AS embedding_bit, -- Binary representation of embedding vector
),
);
```
{% endif %}

### Скалярное квантование

Expand Down Expand Up @@ -220,6 +277,7 @@ SELECT ListMap($FloatList, $MapInt8);
* получается приближенный список векторов;
* в этом списке производим поиск без использования квантования.

{% if backend_name == "YDB" %}
```sql
$K = 10;
$Target = [1.2f, 2.3f, 3.4f, 4.5f];
Expand All @@ -235,3 +293,20 @@ WHERE id IN $Ids
ORDER BY Knn::CosineDistance(embedding, $TargetEmbeddingFloat)
LIMIT $K;
```
{% else %}
```sql
$K = 10;
$Target = [1.2f, 2.3f, 3.4f, 4.5f];
$TargetEmbeddingBit = Knn::ToBinaryStringBit($Target);
$TargetEmbeddingFloat = Knn::ToBinaryStringFloat($Target);

$Ids = SELECT id FROM AS_TABLE($facts)
ORDER BY Knn::CosineDistance(embedding_bit, $TargetEmbeddingBit)
LIMIT $K * 10;

SELECT * FROM AS_TABLE($facts)
WHERE id IN $Ids
ORDER BY Knn::CosineDistance(embedding, $TargetEmbeddingFloat)
LIMIT $K;
```
{% endif %}

0 comments on commit 16a986d

Please sign in to comment.