MostSimilarDocumentsPipeline for Retrieving Similar Sentences #3299

sankalp-acl · 2022-09-29T22:50:27Z

sankalp-acl
Sep 29, 2022

Use Case: retrieve top_k similar sentences given a given query sentence

I'm thinking of using ElasticsearchDocumentStore to store all sentences and their embeddings using EmbeddingRetriever.
Can I use the ready-made MostSimilarDocumentsPipeline to retrieve the top_k most similar sentences from the document store?

For my use case, I may have entirely new query sentences that are not part of the document store. And my understanding is that MostSimilarDocumentsPipeline expects a list of document IDs already in the document store. So how do I work around sentences that are not in the document store? Updating the document store with query sentences is not an option.

Any help is appreciated, thanks!

Answered by mwade-noetic

Sep 30, 2022

Hi,

The most similar document pipeline is just using the already calculated word embeddings of a document to find similar documents to that document vector.  YOu can achieve the same thing using the DocumentSearchPipeline and the EmbeddingRetriever.

You will simply pass the text from the sentence that you want to find all similar documents and it will create the embeddings that sentence and then just run the same query_by_embeddings method that the MSD uses.

Here is a simple outline of the code:

document_store = ElasticserachDocumentStore(similarity='cosine')
retriever = EmbeddingRetriever(document_store=document_store, embedding_model='sentence-transformers/all-mpnet-base-v2')
search_pi…

View full answer

mwade-noetic · 2022-09-30T00:08:01Z

mwade-noetic
Sep 30, 2022

Hi,

The most similar document pipeline is just using the already calculated word embeddings of a document to find similar documents to that document vector.  YOu can achieve the same thing using the DocumentSearchPipeline and the EmbeddingRetriever.

You will simply pass the text from the sentence that you want to find all similar documents and it will create the embeddings that sentence and then just run the same query_by_embeddings method that the MSD uses.

Here is a simple outline of the code:

document_store = ElasticserachDocumentStore(similarity='cosine')
retriever = EmbeddingRetriever(document_store=document_store, embedding_model='sentence-transformers/all-mpnet-base-v2')
search_pipeline = DocumentSearchPipeline(retriever=retriever)
similar_docs = search_pipline.run(query=text_of_sentence, params={"Retriever": { "top_k": 10 }})

Note, you may need to use a PreProcessor() to clean/split the text unless it is fairly short in length. I just kept this as a simple example.
This get's you good similarity estimates w/o having to index the documents.

2 replies

sankalp-acl Sep 30, 2022
Author

@mwade-noetic thanks a lot!

Using the code you mentioned, similar_docs has all the similar sentences from the document store along with their cosine similarity scores. That's great!

A couple of follow-up questions:

Is there a haystack util module to simplify the post-processing of retrieved sentences? I'm using the following code but it does not have a way to print the similarity scores.

from haystack.utils import print_documents
print_documents(similar_docs)

I also want to evaluate the retrieved sentences using the mAP and Recall metrics mentioned here. What's the easiest way to do it? Could you please point me to an example? Thanks!

mwade-noetic Sep 30, 2022

The return results is a list for each query that you submit (you can send multiple text documents at the same time) and for each query you get a list of documents. So if you are sending just ONE text document/sentence as the query you can get the list of documents like:

The follow code is a simple way to iterate the results and display the similarity_score and the text of the document.


list_of_similar_docs = similar_docs[0]

# Iterate the list of similar docs:
for similar_doc in list_of_similar_docs:
   # Print the similarity score for each document
   print(f'Score = {similar_doc.score}')
   # Print the text of the document.
   print(f'Text: {similar_doc.content}')

They have a "Document" class defined in the package and that is what the list of documents returned are defined as. Hopefully that works for you. FYI, I just copied some code that I have a modified it a bit w/o running it but it should be close if not exact.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MostSimilarDocumentsPipeline for Retrieving Similar Sentences #3299

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

MostSimilarDocumentsPipeline for Retrieving Similar Sentences #3299

sankalp-acl Sep 29, 2022

Replies: 1 comment · 2 replies

mwade-noetic Sep 30, 2022

sankalp-acl Sep 30, 2022 Author

mwade-noetic Sep 30, 2022

sankalp-acl
Sep 29, 2022

Replies: 1 comment 2 replies

mwade-noetic
Sep 30, 2022

sankalp-acl Sep 30, 2022
Author