Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement similarity search using Solr #1714

Open
4 of 14 tasks
ffont opened this issue Dec 13, 2023 · 0 comments
Open
4 of 14 tasks

Reimplement similarity search using Solr #1714

ffont opened this issue Dec 13, 2023 · 0 comments
Labels
Improvement A functional improvement to an existing feature, that isn't urgently a bug New feature Something that doesn't yet exist in Freesound

Comments

@ffont
Copy link
Member

ffont commented Dec 13, 2023

We've been discussing for a long time how could we get rid of our custom similarity server (based on gaia) and move its current functionality to another service. Since last versions of Solr support nearest-neighbours (NN) search, we decided to move the similarity functionality to Solr. For this to happen, there are a number of considerations to take into account and steps to follow, some have already been done.

We mainly use gaia for similarity search, but we also use it for some advanced functionalities of the API through which users can implement complex search filters based on low-level audio descriptors (e.g., filter by pitch variance or bpm), and also can specify a number of descriptors which establish a custom similarity metric which only takes into account these descriptors to sort the results of a query. When moving to Solr, some of these features will be lost, and also users will need to change their app implementations to achieve similar results. Therefore, some actions will need to be taken to notify users about these changes and also provide alternative ways to achieve similar results.

Our current implementation of similarity search also uses gaia to apply some transformations to the audio descriptors that we calculate and convert them into a 100-dimension normalised vector used for the NN queries. If we get rid of gaia, we will need to move this functionality somewhere else so that we can continue generating these sound vectors. Also, we want to take this opportunity to introduce audio embeddings computed using pre-trained deep learning models to our similarity system, so we can test state of the art approaches to similarity search.

Here are some steps to carry out for the reimplementation of the similarity service:

  • Deprecate content search and combined search functionality from the API. Add a note in the API documentation and provide examples of how to implement an alternative solution. Send email to the API mailing list and also specifically notify users who use these API resources.
  • Choose some audio descriptors that previously were only indexed in gaia to also add them in our current Solr index so that complex queries involving audio descriptors will still be possible. Currently we already index some descriptors generated by the AudioCommons analyzer, so basically we just need to add some configuration to list which descriptors from the FreesoundExtractor analyzer we want to get indexed in Solr. A full list of descriptors is here: https://freesound.org/apiv2/descriptors/ (note we should only choose one-dimensional descriptors).
  • Also choose some audio descriptors that should be stored in the DB so that through the API these can be retrieved. These do not necessarily need to be in the search engine so they might not be used in queries, but might be relevant to have them available. For example, tristimulus descriptor has been used for mapping to RGB in several experiments, or mfcc mean. Note that after removing gaia only the descriptors that we store in the DB will be available as metadata information for sounds, so all the relevant documentation will also need to be updated.
  • Update the current FreesoundExtractor audio analyzer so that it generates a 100-dimension vector compatible with the current ones being generated by gaia. We need to "extract" the current PCA and normalisation rules from our gaia dataset and reimplement them in the extractor so we can project points in the same space.
  • Add new freesound analyzers that calculate embeddings and define a way to configure which ones to be loaed and ready for similarity search.
  • Add a parameter for similarity search endpoints (both in web and api) to choose a preset that will select which embeddings to use for the NN search.
  • Add option to API search endpoint to provide an embedding vector as target for a similarity search. This will allow searching inside Freesound for similar sounds to an embedding vector of a sound which is not part of Freesound. This used to be a feature (broken) in old similarity service, now we could make it work again. Of course end users need to run some sort of extractor to get the embedding first, but still it makes it possible.
  • Update Solr schema to support "NN" fields, if possible make them dynamic as other fields used to load audio descriptors.
  • Update similarity utils code to interface with the new similarity system instead of the old one.
  • Remove content search and combined search functionality from the API (only after the whole process has been finished and we have been able to provide working examples in the documentation about how to implement alternatives). Also remove this functionality from the official API clients.
  • Implement solr-based similarity search which indexes sound search vectors - Solr-based similarity search #1753
  • Get rid of all old gaia/similarity related code, basically remove whole "similarity" folder and related things. Also update things such as code to check gaia index consistency to now use Solr, etc. Similarity update command will also no longer be needed.
  • Use the Sound.similarity_state to specify when a sound is ready for similarity (using the new similarity). In Solr-based similarity search #1753 we did not implement this because this field is used by old similarity. But when old similarity is completely gone, then we can do it.
  • Run some performance evaluation comparing gaia-based vs solr-based similarity.
@ffont ffont added New feature Something that doesn't yet exist in Freesound Improvement A functional improvement to an existing feature, that isn't urgently a bug labels Dec 13, 2023
ffont added a commit that referenced this issue Jan 23, 2024
This means that now documents can be partially updated instead of always being completely replaced. This features is not used yet anywhere, but it will be useful when including similarity data to the search engine.

#1714
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvement A functional improvement to an existing feature, that isn't urgently a bug New feature Something that doesn't yet exist in Freesound
Projects
None yet
Development

No branches or pull requests

1 participant