Update README.md

louisbrulenaudet · Aug 11, 2024 · 1d2425f · 1d2425f
1 parent 6908093
commit 1d2425f
Showing 1 changed file with 88 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -75,6 +75,94 @@ embedding_result = loader.batch_encode(text)
 print(embedding_result)
 ```
 
+### Similarity search and index creation
+
+The `SimilaritySearch` class is instantiated with specific parameters to configure the embedding model and search infrastructure. The chosen model, `louisbrulenaudet/tsdae-lemone-mbert-base`, is likely a multilingual BERT model fine-tuned with TSDAE (Transfomer-based Denoising Auto-Encoder) on a custom dataset. This model choice suggests a focus on multilingual capabilities and improved semantic representations.
+
+The `cuda` device specification leverages GPU acceleration, crucial for efficient processing of large datasets. The embedding dimension of `768` is typical for BERT-based models, representing a balance between expressiveness and computational efficiency. The `ip` (inner product) metric is selected for similarity comparisons, which is computationally faster than cosine similarity when vectors are normalized. The `i8` dtype indicates 8-bit integer quantization, a technique that significantly reduces memory usage and speeds up similarity search at the cost of a small accuracy rade-off.
+
+```python
+import polars as pl
+from ragoon import (
+    dataset_loader,
+    SimilaritySearch,
+    EmbeddingsVisualizer
+)
+
+dataset = dataset_loader(
+    name="louisbrulenaudet/dac6-instruct",
+    streaming=False,
+    split="train"
+)
+
+dataset.save_to_disk("dataset.hf")
+
+instance = SimilaritySearch(
+    model_name="louisbrulenaudet/tsdae-lemone-mbert-base",
+    device="cuda",
+    ndim=768,
+    metric="ip",
+    dtype="i8"
+)
+
+embeddings = instance.encode(corpus=dataset["output"])
+
+ubinary_embeddings = instance.quantize_embeddings(
+    embeddings=embeddings,
+    quantization_type="ubinary"
+)
+
+int8_embeddings = instance.quantize_embeddings(
+    embeddings=embeddings,
+    quantization_type="int8"
+)
+
+instance.create_usearch_index(
+    int8_embeddings=int8_embeddings,
+    index_path="./usearch_int8.index",
+    save=True
+)
+
+instance.create_faiss_index(
+    ubinary_embeddings=ubinary_embeddings,
+    index_path="./faiss_ubinary.index",
+    save=True
+)
+
+top_k_scores, top_k_indices = instance.search(
+    query="Définir le rôle d'un intermédiaire concepteur conformément à l'article 1649 AE du Code général des Impôts.",
+    top_k=10,
+    rescore_multiplier=4
+)
+
+try:
+    dataframe = pl.from_arrow(dataset.data.table).with_row_index()
+
+except:
+    dataframe = pl.from_arrow(dataset.data.table).with_row_count(
+        name="index"
+    )
+
+scores_df = pl.DataFrame(
+    {
+        "index": top_k_indices,
+        "score": top_k_scores
+    }
+).with_columns(
+    pl.col("index").cast(pl.UInt32)
+)
+
+search_results = dataframe.filter(
+    pl.col("index").is_in(top_k_indices)
+).join(
+    scores_df,
+    how="inner",
+    on="index"
+)
+
+print("search_results")
+```
+
 ### Embeddings visualization
 
 This class provides functionality to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D plot.