-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solr-based similarity search #1753
Conversation
This means that now documents can be partially updated instead of always being completely replaced. This features is not used yet anywhere, but it will be useful when including similarity data to the search engine. #1714
…bled We used to store similarity_state field in the sound model, but this is no longer needed when using the search engine based similarity.
This mode complements get_child mode, and will match parent documents instead of child documents
Also fixed an issue with grouping by pack in similarity search
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, for commenting so late. I started this review and wanted to take another look, but then never got around to it.
Here are some comments from my first round of review
@@ -92,6 +92,8 @@ def display_facet(context, flt, facet, facet_type, title=""): | |||
context['sort'] if context['sort'] is not None else '', | |||
context['weights'] or '' | |||
) | |||
if context['similar_to'] is not None: | |||
element['add_filter_url'] += '&similar_to={}'.format(context['similar_to']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it was not started in this PR but we should really use something like urllib.parse.urlencode
to build the query args to avoid bugs introduced by typos etc.
For example, drag along a filter_url_args
dict
and only at the very end:
element['add_filter_url'] = urllib.parse.urlencode(filter_url_args)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree, I did not update stuff here because this will be part of the refactoring of the whole search views/template code that I've been planning to do. All URL that we build (filter urls, pagination, etc.) should be built by some object that knows how to handle this and then that object should do the proper encoding.
@@ -358,10 +149,302 @@ def get_forum_index(self): | |||
always_commit=True | |||
) | |||
return self.forum_index | |||
|
|||
# Util functions | |||
def transform_document_into_update_document(self, document): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be function and not a method of this class since we do not need anything from the class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that we don't use self
but it made sense to have it as a method because in this way it is easy to override methods in subclasses. This method does actually not need to be overwritten, but there are other similar methods that don't use self
either but that need to be overwritten (e.g. search_process_filter
). I decided to keep them all as methods for consistency.
new_document.update({key: {'set': value} for key, value in document.items() if key != 'id'}) | ||
return new_document | ||
|
||
def convert_sound_to_search_engine_document(self, sound, fields_to_include=[]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be function and not a method of this class since we do not need anything from the class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explained in previous comment.
|
||
return document | ||
|
||
def add_similarity_vectors_to_documents(self, sound_objects, documents): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually all this function seem to better suited as functions :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explained in previous comment.
Thanks for the comments @Bomme! |
Issue(s)
#1714
Description
This PR implements Solr-based similarity search for the web interface (not for the API!). It does not remove any of the previous functionality, and old/new similarity can be toggled with the setting
USE_SEARCH_ENGINE_SIMILARITY
.This PR is to investigate different ways in which we can use Solr for similarity search. I have successfully implemented it,
but there are some decisions to be made which have important implications(EDIT: I think after #12531aa it is clear how to proceed). To implement what is in this PR, I had to gain a good understanding of how Solr documents work, so it is possible that the explanations below are not very clear if you're not familiar with such concepts.Essentially this PR adds similarity support by adding a new type of field in the Solr schema which can store "similarity vectors". Currently we only support 100-dimension vectors, but more types of vectors could be added. Then, add index time, we add vector information (embeddings data) to the Solr document just like any other field (or almost like that, see below). We retrieve embeddings from a
SoundAnalysis
object of a specific type of analyzer (for example,fsd-sinet_v1
). All this can be configured using global Django settings.The index is updated with the same strategy as normal Solr index updates. When a sound gets analyzed with the analyzer used for similarity, then it is marked as "index dirty", so the next time the job
post_dirty_sounds_to_search_engine
runs, the sound will be updated in the index with the similarity data. Also, I had to add new methods for sounds to get their "similarity state" and decide whether similarity buttons should be shown. I do that by checking the existence ofSoundAnalysis
objects of the designed analyzer (and I already took care of some optimizations to reduce number of queries). Therefore thesimilarity_state
boolean field in the sound model is not needed. Nevertheless, I'd consider using it again and setting it at index time to avoid having to do some complex queries for knowing if the sound is ready for similarity (and also that method would be a bit more accurate for a number of reasons I won't detail here).I experimented with two methods of adding the similarity data to the sound index:
q
parameter is used to define a point of the embedding space as a target for the similarity search. I call this method the "parent" method.The downside of this method is that there are a number of query options like grouping and faceting that will not work because the documents that Solr returns are "child" documents and these include the(EDIT: this is no longer true, thanks to help from Solr users mailing list I could fix that).id
of the parent (and some other stuff) but not all parent fieldsA way to obtain the best of both methods, would be to index child documents also with the fields of the parent documents (or most of them at least), but this would mean having a lot of redundant information in the search index(EDIT: this is no longer true, no redundant information needed). Also, if we include several vectors per sound, it remains to be seen how this would scale in our production environment. Therefore, there are some things to investigate here...So here are some next steps for this PR:
Add tests!DoneChoose a method (parent or child) for similarity queriesWe'll use child methodCheck that NN returned by Solr are consistent with our experimentsThis will be better done once we actually have some experiments to compare to. For now we can merge without really activating solr-based similarityPerformance analysisThis will be better done once deployed with a full indexPopulateThis optimization we can add when completely removing old similarity code, as otherwise there would be conflicts if using the same field.similarity_state
Sound
field at index timeDeployment steps:
This needs a full reindex of sounds, including removing the freesound Solr core data folder as the schema has changed. Also, the setting
USE_SEARCH_ENGINE_SIMILARITY
must be set toTrue
for the Solr-based similarity to take over gaia similarity.