Add basic implementation of `VectorSearch` `Step` and `KnowledgeBases` #1006

davidberenstein1957 · 2024-09-29T08:30:47Z

went for lancedb because it works in memory.
@frascuchon as follow up we can consider adding argilla based on your vector search PR :)

Do vector search using a KnowledgeBase and integrated Embeddings model.

from distilabel.embeddings import SentenceTransformerEmbeddings
from distilabel.knowledge_bases.lancedb import LanceDB
from distilabel.steps.knowledge_bases.vector_search import VectorSearch

embedding = SentenceTransformerEmbeddings(
    model="mixedbread-ai/mxbai-embed-large-v1",
)

knowledge_base = LanceDB(
    uri="data/sample-lancedb",
    table_name="my_table",
)

vector_search = VectorSearch(
    knowledge_base=knowledge_base,
    embeddings=embedding,
    n_retrieved_documents=5
)

vector_search.load()
result = next(vector_search.process([{"text": "Hello, how are you?"}]))
# [{
#   'text': 'Hello, how are you?',
#   'embedding': [0.06209656596183777, -0.015797119587659836, ...],
#   'knowledge_base_col_1': [10.0],
#   'knowledge_base_col_2': ['foo']
# }]

Do vector search using a KnowledgeBase and a pre-computed query column.

from distilabel.embeddings import SentenceTransformerEmbeddings
from distilabel.knowledge_bases.lancedb import LanceDB
from distilabel.steps.knowledge_bases.vector_search import VectorSearch

knowledge_base = LanceDB(
    uri="data/sample-lancedb",
    table_name="my_table",
)

vector_search = VectorSearch(
    knowledge_base=knowledge_base,
    n_retrieved_documents=5
)

vector_search.load()
result = next(embedding_generation.process([{'embedding': [0.06209656596183777, -0.015797119587659836, ...]}]))
# [{'embedding': [0.06209656596183777, -0.015797119587659836, ...], "knowledge_base_col_1": [10.0], "knowledge_base_col_2": ["foo"]}]

Or with Argilla

import os

from distilabel.knowledge_bases.argilla import ArgillaKnowledgeBase
from distilabel.steps.knowledge_bases.vector_search import VectorSearch

knowledge_base = ArgillaKnowledgeBase(
    dataset_name="ag_news_with_suggestions",
    dataset_workspace="argilla",
    vector_field="mini-lm-sentence-transformers",
    api_url=os.environ["ARGILLA_API_URL_DEV"],
    api_key=os.environ["ARGILLA_API_KEY_DEV"],
)

vector_search = VectorSearch(knowledge_base=knowledge_base, n_retrieved_documents=5)

vector_search.load()
result = next(
    vector_search.process([{"text": "Hello, how are you?", "embedding": [1] * 384}])
)
print(result)
# [{'text': ["Italy's Pennetta Wins Idea Prokom Open (AP) AP - Italy's Flavia Pennetta won the Idea Prokom Open for her first WTA Tour title, beating Klara Koukalova of the Czech Republic 7-5, 3-6, 6-3 Saturday after French Open champion Anastasia Myskina withdrew before the semifinals because of a rib injury."], 'embedding': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'id': ['65a57d53-1d1d-4acb-9f12-456d6989e905'], 'status': ['completed'], '_server_id': ['b0a29605-21ed-48d1-b2fe-163243947c2d'], 'split': ['unlabelled'], 'class.responses': [['Sci/Tech']], 'class.responses.users': [['3b1a58ff-6213-4365-880b-17532d13978c']], 'class.responses.status': [['submitted']], 'class.suggestion': ['Sports'], 'class.suggestion.score': [0.3421393299797776], 'class.suggestion.agent': ['setfit']}]

github-actions · 2024-09-29T08:41:30Z

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1006/

tests: Add tests lancedb

Add basic implementation of VectorSearch

8167fdc

davidberenstein1957 changed the title ~~Add basic implementation of VectorSearch~~ Add basic implementation of VectorSearch Step KnowledgeBases Sep 29, 2024

davidberenstein1957 requested review from gabrielmbmb and plaguss September 29, 2024 08:37

fix: revert unrequired changes

cda8db2

davidberenstein1957 changed the title ~~Add basic implementation of VectorSearch Step KnowledgeBases~~ Add basic implementation of VectorSearch Step and KnowledgeBases Sep 29, 2024

plaguss mentioned this pull request Oct 1, 2024

[EXAMPLE] Add CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation example #953

Open

davidberenstein1957 added 4 commits October 3, 2024 11:29

feat: shared argilla base

cd8d1ee

feat: Add support for ArgillaKnowledgeBase

1d864d8

tests: Add tests argilla

2210975

tests: Add tests lancedb

tests: Add tests for vector_search step

ac31d38

davidberenstein1957 marked this pull request as ready for review October 5, 2024 12:22

davidberenstein1957 added 2 commits October 7, 2024 13:05

chore: remove tool.pdm section

6798abf

Merge branch 'develop' into feat/knowledge-base

5aa0456

davidberenstein1957 added this to the 1.5.0 milestone Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic implementation of `VectorSearch` `Step` and `KnowledgeBases` #1006

Add basic implementation of `VectorSearch` `Step` and `KnowledgeBases` #1006

davidberenstein1957 commented Sep 29, 2024 •

edited

Loading

github-actions bot commented Sep 29, 2024

Add basic implementation of VectorSearch Step and KnowledgeBases #1006

Are you sure you want to change the base?

Add basic implementation of VectorSearch Step and KnowledgeBases #1006

Conversation

davidberenstein1957 commented Sep 29, 2024 • edited Loading

github-actions bot commented Sep 29, 2024

Add basic implementation of `VectorSearch` `Step` and `KnowledgeBases` #1006

Add basic implementation of `VectorSearch` `Step` and `KnowledgeBases` #1006

davidberenstein1957 commented Sep 29, 2024 •

edited

Loading