Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
louisbrulenaudet committed Aug 11, 2024
1 parent 6908093 commit 1d2425f
Showing 1 changed file with 88 additions and 0 deletions.
88 changes: 88 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,94 @@ embedding_result = loader.batch_encode(text)
print(embedding_result)
```

### Similarity search and index creation

The `SimilaritySearch` class is instantiated with specific parameters to configure the embedding model and search infrastructure. The chosen model, `louisbrulenaudet/tsdae-lemone-mbert-base`, is likely a multilingual BERT model fine-tuned with TSDAE (Transfomer-based Denoising Auto-Encoder) on a custom dataset. This model choice suggests a focus on multilingual capabilities and improved semantic representations.

The `cuda` device specification leverages GPU acceleration, crucial for efficient processing of large datasets. The embedding dimension of `768` is typical for BERT-based models, representing a balance between expressiveness and computational efficiency. The `ip` (inner product) metric is selected for similarity comparisons, which is computationally faster than cosine similarity when vectors are normalized. The `i8` dtype indicates 8-bit integer quantization, a technique that significantly reduces memory usage and speeds up similarity search at the cost of a small accuracy rade-off.

```python
import polars as pl
from ragoon import (
dataset_loader,
SimilaritySearch,
EmbeddingsVisualizer
)

dataset = dataset_loader(
name="louisbrulenaudet/dac6-instruct",
streaming=False,
split="train"
)

dataset.save_to_disk("dataset.hf")

instance = SimilaritySearch(
model_name="louisbrulenaudet/tsdae-lemone-mbert-base",
device="cuda",
ndim=768,
metric="ip",
dtype="i8"
)

embeddings = instance.encode(corpus=dataset["output"])

ubinary_embeddings = instance.quantize_embeddings(
embeddings=embeddings,
quantization_type="ubinary"
)

int8_embeddings = instance.quantize_embeddings(
embeddings=embeddings,
quantization_type="int8"
)

instance.create_usearch_index(
int8_embeddings=int8_embeddings,
index_path="./usearch_int8.index",
save=True
)

instance.create_faiss_index(
ubinary_embeddings=ubinary_embeddings,
index_path="./faiss_ubinary.index",
save=True
)

top_k_scores, top_k_indices = instance.search(
query="Définir le rôle d'un intermédiaire concepteur conformément à l'article 1649 AE du Code général des Impôts.",
top_k=10,
rescore_multiplier=4
)

try:
dataframe = pl.from_arrow(dataset.data.table).with_row_index()

except:
dataframe = pl.from_arrow(dataset.data.table).with_row_count(
name="index"
)

scores_df = pl.DataFrame(
{
"index": top_k_indices,
"score": top_k_scores
}
).with_columns(
pl.col("index").cast(pl.UInt32)
)

search_results = dataframe.filter(
pl.col("index").is_in(top_k_indices)
).join(
scores_df,
how="inner",
on="index"
)

print("search_results")
```

### Embeddings visualization

This class provides functionality to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D plot.
Expand Down

0 comments on commit 1d2425f

Please sign in to comment.