Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr-based similarity search #1753

Merged
merged 23 commits into from
Feb 9, 2024
Merged

Solr-based similarity search #1753

merged 23 commits into from
Feb 9, 2024

Conversation

ffont
Copy link
Member

@ffont ffont commented Jan 24, 2024

Issue(s)
#1714

Description
This PR implements Solr-based similarity search for the web interface (not for the API!). It does not remove any of the previous functionality, and old/new similarity can be toggled with the setting USE_SEARCH_ENGINE_SIMILARITY.

This PR is to investigate different ways in which we can use Solr for similarity search. I have successfully implemented it, but there are some decisions to be made which have important implications (EDIT: I think after #12531aa it is clear how to proceed). To implement what is in this PR, I had to gain a good understanding of how Solr documents work, so it is possible that the explanations below are not very clear if you're not familiar with such concepts.

Essentially this PR adds similarity support by adding a new type of field in the Solr schema which can store "similarity vectors". Currently we only support 100-dimension vectors, but more types of vectors could be added. Then, add index time, we add vector information (embeddings data) to the Solr document just like any other field (or almost like that, see below). We retrieve embeddings from a SoundAnalysis object of a specific type of analyzer (for example, fsd-sinet_v1). All this can be configured using global Django settings.

The index is updated with the same strategy as normal Solr index updates. When a sound gets analyzed with the analyzer used for similarity, then it is marked as "index dirty", so the next time the job post_dirty_sounds_to_search_engine runs, the sound will be updated in the index with the similarity data. Also, I had to add new methods for sounds to get their "similarity state" and decide whether similarity buttons should be shown. I do that by checking the existence of SoundAnalysis objects of the designed analyzer (and I already took care of some optimizations to reduce number of queries). Therefore the similarity_state boolean field in the sound model is not needed. Nevertheless, I'd consider using it again and setting it at index time to avoid having to do some complex queries for knowing if the sound is ready for similarity (and also that method would be a bit more accurate for a number of reasons I won't detail here).

I experimented with two methods of adding the similarity data to the sound index:

  • The first method is straightforward: we simply add a similarity vector as a field to a Solr document. Using this method, search queries work basically like any other query, but the q parameter is used to define a point of the embedding space as a target for the similarity search. I call this method the "parent" method.
  • The second method involves the use of Solr "child" documents. This "child" method supports relating several similarity vectors with one single sound, so that we can store similarity vectors computed using different analyzers (i.e. different embedding extractors) and also we an even store different similarity vectors that correspond to different timestamps of the sound. This method therefore potentially allows to find similar sounds "within" a sound (so we can say sound X is similar to sound Y at 1:35), and also allows to easily switch between analyzers (embeddings) for getting similarity results. The downside of this method is that there are a number of query options like grouping and faceting that will not work because the documents that Solr returns are "child" documents and these include the id of the parent (and some other stuff) but not all parent fields (EDIT: this is no longer true, thanks to help from Solr users mailing list I could fix that).

A way to obtain the best of both methods, would be to index child documents also with the fields of the parent documents (or most of them at least), but this would mean having a lot of redundant information in the search index (EDIT: this is no longer true, no redundant information needed). Also, if we include several vectors per sound, it remains to be seen how this would scale in our production environment. Therefore, there are some things to investigate here...

So here are some next steps for this PR:

  • Add tests! Done
  • Choose a method (parent or child) for similarity queries We'll use child method
  • Check that NN returned by Solr are consistent with our experiments This will be better done once we actually have some experiments to compare to. For now we can merge without really activating solr-based similarity
  • Performance analysis This will be better done once deployed with a full index
  • Populate similarity_state Sound field at index time This optimization we can add when completely removing old similarity code, as otherwise there would be conflicts if using the same field.

Deployment steps:
This needs a full reindex of sounds, including removing the freesound Solr core data folder as the schema has changed. Also, the setting USE_SEARCH_ENGINE_SIMILARITY must be set to True for the Solr-based similarity to take over gaia similarity.

ffont added 21 commits January 23, 2024 12:30
This means that now documents can be partially updated instead of always being completely replaced. This features is not used yet anywhere, but it will be useful when including similarity data to the search engine.

#1714
…bled

We used to store similarity_state field in the sound model, but this is no longer needed when using the search engine based similarity.
This mode complements get_child mode, and will match parent documents instead of child documents
Also fixed an issue with grouping by pack in similarity search
@ffont ffont marked this pull request as ready for review February 9, 2024 09:19
@ffont ffont merged commit 20d47d9 into master Feb 9, 2024
1 check passed
@ffont ffont deleted the similarity-solr branch February 9, 2024 09:24
Copy link
Contributor

@Bomme Bomme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, for commenting so late. I started this review and wanted to take another look, but then never got around to it.
Here are some comments from my first round of review

README.md Show resolved Hide resolved
@@ -92,6 +92,8 @@ def display_facet(context, flt, facet, facet_type, title=""):
context['sort'] if context['sort'] is not None else '',
context['weights'] or ''
)
if context['similar_to'] is not None:
element['add_filter_url'] += '&similar_to={}'.format(context['similar_to'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was not started in this PR but we should really use something like urllib.parse.urlencode to build the query args to avoid bugs introduced by typos etc.

For example, drag along a filter_url_args dict and only at the very end:

element['add_filter_url'] = urllib.parse.urlencode(filter_url_args)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, I did not update stuff here because this will be part of the refactoring of the whole search views/template code that I've been planning to do. All URL that we build (filter urls, pagination, etc.) should be built by some object that knows how to handle this and then that object should do the proper encoding.

sounds/views.py Show resolved Hide resolved
utils/search/__init__.py Show resolved Hide resolved
@@ -358,10 +149,302 @@ def get_forum_index(self):
always_commit=True
)
return self.forum_index

# Util functions
def transform_document_into_update_document(self, document):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be function and not a method of this class since we do not need anything from the class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that we don't use self but it made sense to have it as a method because in this way it is easy to override methods in subclasses. This method does actually not need to be overwritten, but there are other similar methods that don't use self either but that need to be overwritten (e.g. search_process_filter). I decided to keep them all as methods for consistency.

new_document.update({key: {'set': value} for key, value in document.items() if key != 'id'})
return new_document

def convert_sound_to_search_engine_document(self, sound, fields_to_include=[]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be function and not a method of this class since we do not need anything from the class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained in previous comment.


return document

def add_similarity_vectors_to_documents(self, sound_objects, documents):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually all this function seem to better suited as functions :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained in previous comment.

@ffont
Copy link
Member Author

ffont commented Feb 13, 2024

Thanks for the comments @Bomme!
I'll re-implement some of them in the master branch and commit there. The methods/functions one I'll keep it like it is right know, but we can discuss this further if you want. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants