Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search "frontend" refactor #1756

Merged
merged 45 commits into from
Apr 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
e1d88b3
Add basic implementation of SearchQueryProcessor
ffont Feb 16, 2024
a4a2363
Fix "search in" option parsing
ffont Feb 16, 2024
513ff85
WIP generate query_params dict from SearchQueryProcessor
ffont Feb 16, 2024
b41c446
Further work in query processor
ffont Feb 16, 2024
e0fa707
Implement more search query options, add comprehensive tests
ffont Feb 19, 2024
f8a0e01
Add method to make URLs and tests for it
ffont Feb 19, 2024
b003670
Add method to test for active advanced search options
ffont Feb 19, 2024
0a0576a
Add test for contains_active_advanced_search_options
ffont Feb 20, 2024
3d1c865
Fix for boolean options in filters
ffont Feb 20, 2024
0e0de1d
Make visible select elements follow disabled status
ffont Feb 20, 2024
6f50adb
WIP re-implementing search view with sqp
ffont Feb 20, 2024
9b06b2e
WIP re-implementing search view with sqp, results list section
ffont Feb 21, 2024
3719876
Working facets with SearchQueryProcessor
ffont Feb 21, 2024
0de8774
Working SearchQueryProcessor-based search page
ffont Feb 21, 2024
ff180cc
Fix failing tests
ffont Feb 21, 2024
cf8172f
Make custering work with python 10
ffont Feb 22, 2024
11a4c54
WIP make clustering work with SearchQueryProcessor and similarity vec…
ffont Feb 22, 2024
c4c7c4d
Make clustering work in new UI
ffont Feb 23, 2024
096504b
Tidy up methods to get clusters graph data
ffont Feb 23, 2024
a0ed755
Show/hide beta search options with django perm
ffont Feb 23, 2024
5b58b6d
User random numbers for cluster IDs
ffont Feb 23, 2024
621fbc6
Fix issue with grouping pack when no packs
ffont Feb 23, 2024
8d7abfc
Fix bug in facet filter quotation
ffont Feb 23, 2024
ddf2db7
Update networkx version in web image
ffont Feb 26, 2024
6a62309
Make clustering task run on web_worker image
ffont Feb 26, 2024
5ef76c8
Cancel clustering task if timeout
ffont Feb 26, 2024
7851d8b
Makes tests pass again
ffont Feb 26, 2024
f6d6566
Move some tests around and add some extra
ffont Feb 29, 2024
9d6be4a
Remove print
ffont Feb 29, 2024
8af3d74
Add comment
ffont Feb 29, 2024
3172bdc
Remove no longer neede files
ffont Feb 29, 2024
c5512c7
Properly document SearchQueryProcessor class
ffont Mar 1, 2024
cfeaa88
Move SearchOption base classes to new file
ffont Mar 1, 2024
e0fde6d
WIP properly documenting and refactoring SearchOption(s)
ffont Mar 4, 2024
efc6785
More WIP refactoring SearchOption
ffont Mar 4, 2024
645be9c
Refactor SearchOption class and the way it is handled in SearchQueryP…
ffont Mar 5, 2024
9a5901f
Add some property shortcuts to SearchQueryProcessor
ffont Mar 5, 2024
4c835d3
Make "advanced" parameter True by default
ffont Mar 5, 2024
4b45c95
Display clusters in a nicer way
ffont Mar 5, 2024
388a186
Add some documentation about adding new search options
ffont Mar 5, 2024
ae4d40b
Fix bug with empty clusters
ffont Mar 5, 2024
e6fbc2a
Use templatetag to help displaying search options in template
ffont Mar 5, 2024
e26fcc7
Trigger cluster selection submit from search.js
ffont Mar 5, 2024
caf1d8f
Merge branch 'master' into search-refactor2
ffont Apr 10, 2024
83e6fbc
Fix failing tickets tests
ffont Apr 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion DEVELOPERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ Currently, we only use the following custom permissions:
* `tickets.can_moderate` (in `Ticket` model, used to allow sound moderation)
* `forum.can_moderate_forum` (in `Post` model, used to allow forum moderation)
* `sounds.can_describe_in_bulk` (in `BulkUploadProgress` model, used to allow bulk upload for users who don't meet the other common requirements)
* `profile.show_beta_search_options` (in `Profile` model, used to allow using beta search features)


### URLs that include a username
Expand Down Expand Up @@ -131,6 +132,33 @@ creating `DeletedSound` objects in the `sounds-models.on_delete_sound` function
signal of the `Sound` model.


### Adding new search options in the search page

The available options for searching and filtering sounds in the search page ara managed using a `SearchQueryProcessor`
object (implemented in `/utils/search/search_query_processor.py`). The `SearchQueryProcessor` class is used to parse and
process search query information from a Django `request` object, and compute a number of useful items for displaying search
information in templates, constructing search URLs, and preparing search options to be passed to the backend search engine.

To add a new option to the search page, a new member of a specific `SearchOption` class should be added to the `SearchQueryProcessor`
class (see `SearchQueryProcessor` definion for examples). There are a number of already existing types of `SearchOption`s
as you can see by looking at the search options which are already implemented in `SearchQueryProcessor`. If the newly added search
option implies doing some calcualtions for determining the `query_params` to be sent to the `search_sounds` function of the search
engine backend, this should be done in the `SearchQueryProcessor.as_query_params` method.

Adding a new search option to `SearchQueryProcessor` will make the option work with the search engine backend and with search URLs,
but it will NOT automatically add the option to the form in the search page. This will need to be done manually by adding the
search option in the desired place in `templates/search/search.html` (see how other search options are implemented for inspiration,
there is a `display_search_option` templatetag which will facilitate things in most cases).

All this will add the search option to the user interface and send corresponding information to the search backend. For example,
if the new search option should apply a filter in the search backend of some `new_property`, this will be handled by the `SearchQueryProcessor`.
However, it is expected that this `new_property` has been added to the search engine schema and indexed properly, otherwise there
will be errors when running the queries.

Please have a look at the documentation of `SearchQueryProcessor` and the various `SearchOption` classes to get a better
understanding of how all this works.


### Search Engine Backends

The way in which Freesound communicates with a search engine to search for sounds and forum posts is abstracted through
Expand All @@ -149,7 +177,6 @@ the implementation of a search backend. You can run it like:
Please read carefully the documentation of the management command to better understand how it works and how is it
doing the testing.


### Freesound analysis pipeline

In February 2022 we released a refactoring of the analysis pipeline that allows us to more easily incorporate new audio
Expand Down
17 changes: 17 additions & 0 deletions accounts/migrations/0041_alter_profile_options.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Generated by Django 3.2.23 on 2024-02-23 22:08

from django.db import migrations


class Migration(migrations.Migration):

dependencies = [
('accounts', '0040_auto_20230328_1205'),
]

operations = [
migrations.AlterModelOptions(
name='profile',
options={'ordering': ('-user__date_joined',), 'permissions': (('can_beta_test', 'Show beta features to that user.'),)},
),
]
5 changes: 4 additions & 1 deletion accounts/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ def get_user_sounds_in_search_url(self):
return f'{reverse("sounds-search")}?f=username:"{ self.user.username }"&s=Date+added+(newest+first)&g=0'

def get_user_packs_in_search_url(self):
return f'{reverse("sounds-search")}?f=username:"{ self.user.username }"&s=Date+added+(newest+first)&g=1&only_p=1'
return f'{reverse("sounds-search")}?f=username:"{ self.user.username }"&s=Date+added+(newest+first)&g=1&dp=1'

def get_latest_packs_for_profile_page(self):
latest_pack_ids = Pack.objects.select_related().filter(user=self.user, num_sounds__gt=0).exclude(is_deleted=True) \
Expand Down Expand Up @@ -649,6 +649,9 @@ def get_stats_for_profile_page(self):

class Meta:
ordering = ('-user__date_joined', )
permissions = (
("can_beta_test", "Show beta features to that user."),
)


class GdprAcceptance(models.Model):
Expand Down
4 changes: 2 additions & 2 deletions accounts/tests/test_views.py
Original file line number Diff line number Diff line change
Expand Up @@ -262,14 +262,14 @@ def test_sounds_response(self):
reverse('pack-downloaders', kwargs={'username': user.username, "pack_id": self.pack.id}) + '?ajax=1')
self.assertEqual(resp.status_code, 200)

@mock.patch('search.views.perform_search_engine_query')
@mock.patch('tags.views.perform_search_engine_query')
def test_tags_response(self, perform_search_engine_query):
perform_search_engine_query.return_value = (create_fake_perform_search_engine_query_results_tags_mode(), None)

# 200 response on tags page access
resp = self.client.get(reverse('tags'))
self.assertEqual(resp.status_code, 200)
self.assertEqual(resp.context['tags_mode'], True)
self.assertEqual(resp.context['sqp'].tags_mode_active(), True)

def test_packs_response(self):
# 302 response (note that since BW, there will be a redirect to the search page in between)
Expand Down
2 changes: 1 addition & 1 deletion accounts/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
import bookmarks.views as bookmarks
import follow.views as follow
import apiv2.views as api
from utils.urlpatterns import redirect_inline
from utils.url import redirect_inline



Expand Down
23 changes: 0 additions & 23 deletions clustering/__init__.py
Original file line number Diff line number Diff line change
@@ -1,23 +0,0 @@
#
# Freesound is (c) MUSIC TECHNOLOGY GROUP, UNIVERSITAT POMPEU FABRA
#
# Freesound is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# Freesound is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Affero General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#
# Authors:
# See AUTHORS file.
#

# strings used for communicating the state of the clustering process
CLUSTERING_RESULT_STATUS_PENDING = "pending"
CLUSTERING_RESULT_STATUS_FAILED = "failed"
150 changes: 32 additions & 118 deletions clustering/clustering.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,21 +33,15 @@
import six
from time import time

from . import clustering_settings as clust_settings

# The following packages are only needed if the running process is configured to be a Celery worker.
# We avoid importing them in appservers to avoid having to install unneeded dependencies.
if settings.IS_CELERY_WORKER:
import community as com
import numpy as np
import networkx as nx
from networkx.readwrite import json_graph
from networkx.algorithms.community import k_clique_communities, greedy_modularity_communities
from sklearn import metrics
from sklearn.feature_selection import mutual_info_classif
from sklearn.neighbors import kneighbors_graph

from .features_store import FeaturesStore
import community as com
import numpy as np
import networkx as nx
from networkx.readwrite import json_graph
from networkx.algorithms.community import k_clique_communities, greedy_modularity_communities
from sklearn import metrics
from sklearn.feature_selection import mutual_info_classif
from sklearn.neighbors import kneighbors_graph


logger = logging.getLogger('clustering')

Expand All @@ -65,8 +59,6 @@ class ClusteringEngine(object):
method. Moreover, a few unsued alternative methods for performing some intermediate steps are left
here for developement and research purpose.
"""
def __init__(self):
self.feature_store = FeaturesStore()

def _prepare_clustering_result_and_reference_features_for_evaluation(self, partition):
"""Formats the clustering classes and some reference features in order to then estimate how good is the
Expand Down Expand Up @@ -157,6 +149,9 @@ def _evaluation_metrics(self, partition):
"""
# we compute the evaluation metrics only if some reference features are available for evaluation
# we return None when they are not available not to break the following part of the code
'''
# NOTE: the following code is commented because the reference features are not available in the current version of the code
# If in the future we wan to perform further evaluation, we should re-implement some of these functions
if clust_settings.REFERENCE_FEATURES in clust_settings.AVAILABLE_FEATURES:
reference_features, clusters = self._prepare_clustering_result_and_reference_features_for_evaluation(partition)
ami = np.average(mutual_info_classif(reference_features, clusters, discrete_features=True))
Expand All @@ -165,6 +160,8 @@ def _evaluation_metrics(self, partition):
return ami, ss, ci
else:
return None, None, None
'''
return None, None, None

def _ratio_intra_community_edges(self, graph, communities):
"""Computes the ratio of the number of intra-community (cluster) edges to the total number of edges in the cluster.
Expand Down Expand Up @@ -212,55 +209,13 @@ def _point_centralities(self, graph, communities):
node_community_centralities = {k: old_div(v,max(d.values())) for d in communities_centralities for k, v in d.items()}

return node_community_centralities

def _save_results_to_file(self, query_params, features, graph_json, sound_ids, modularity,
num_communities, ratio_intra_community_edges, ami, ss, ci, communities):
"""Saves a json file to disk containing the clustering results information listed below.

This is used when developing the clustering method. The results and the evaluation metrics are made accessible
for post-analysis.

Args:
query_params (str): string representing the query parameters submited by the user to the search engine.
features (str): name of the features used for clustering.
graph_json: (dict) NetworkX graph representation of sounds data in node-link format that is suitable for JSON
serialization.
sound_ids (List[Int]): list of the sound ids.
modularity (float): modularity of the graph partition.
num_communities (Int): number of communities (clusters).
ratio_intra_community_edges (List[Float]): intra-community edges ratio.
ami (Numpy.float): Average Mutual Information score.
ss (Numpy.float): Silhouette Coefficient score.
ci (Numpy.float): Calinski and Harabaz Index score.
communities (List[List[Int]]): List storing Lists containing the Sound ids that are in each community (cluster).
"""
if clust_settings.SAVE_RESULTS_FOLDER:
result = {
'query_params' : query_params,
'sound_ids': sound_ids,
'num_clusters': num_communities,
'graph': graph_json,
'features': features,
'modularity': modularity,
'ratio_intra_community_edges': ratio_intra_community_edges,
'average_mutual_information': ami,
'silouhette_coeff': ss,
'calinski_harabaz_score': ci,
'communities': communities
}
with open(os.path.join(
clust_settings.SAVE_RESULTS_FOLDER,
f'{query_params}.json'
), 'w') as f:
json.dump(result, f)

def create_knn_graph(self, sound_ids_list, features=clust_settings.DEFAULT_FEATURES):
def create_knn_graph(self, sound_ids_list, similarity_vectors_map):
"""Creates a K-Nearest Neighbors Graph representation of the given sounds.

Args:
sound_ids_list (List[str]): list of sound ids.
features (str): name of the features to be used for nearest neighbors computation.
Available features are listed in the clustering settings file.
similarity_vectors_map (Dict{int:List[float]}): dictionary with the similarity feature vectors for each sound.

Returns:
(nx.Graph): NetworkX graph representation of sounds.
Expand All @@ -272,58 +227,21 @@ def create_knn_graph(self, sound_ids_list, features=clust_settings.DEFAULT_FEATU
# neighbors for small collections, while limiting it for larger collections, which ensures low-computational complexity.
k = int(np.ceil(np.log2(len(sound_ids_list))))

sound_features, sound_ids_out = self.feature_store.return_features(sound_ids_list)
features = []
sound_ids_out = []
for sound_id, feature_vector in similarity_vectors_map.items():
features.append(feature_vector)
sound_ids_out.append(sound_id)
sound_features = np.array(features).astype('float32')

A = kneighbors_graph(sound_features, k)
for idx_from, (idx_to, distance) in enumerate(zip(A.indices, A.data)):
idx_from = int(idx_from / k)
if distance < clust_settings.MAX_NEIGHBORS_DISTANCE:
if distance < settings.CLUSTERING_MAX_NEIGHBORS_DISTANCE:
graph.add_edge(sound_ids_out[idx_from], sound_ids_out[idx_to])

# Remove isolated nodes
graph.remove_nodes_from(list(nx.isolates(graph)))

return graph

def create_common_nn_graph(self, sound_ids_list, features=clust_settings.DEFAULT_FEATURES):
"""Creates a Common Nearest Neighbors Graph representation of the given sounds.

Args:
sound_ids_list (List[str]): list of sound ids.
features (str): name of the features to be used for nearest neighbors computation.
Available features are listed in the clustering settings file.

Returns:
(nx.Graph): NetworkX graph representation of sounds.
"""
# first create a knn graph
knn_graph = self.create_knn_graph(sound_ids_list, features=features)

# create the common nn graph
graph = nx.Graph()
graph.add_nodes_from(knn_graph.nodes)

for i, node_i in enumerate(knn_graph.nodes):
for j, node_j in enumerate(knn_graph.nodes):
if j > i:
num_common_neighbors = len(set(knn_graph.neighbors(node_i)).intersection(knn_graph.neighbors(node_j)))
if num_common_neighbors > 0:
graph.add_edge(node_i, node_j, weight=num_common_neighbors)

# keep only k most weighted edges
k = int(np.ceil(np.log2(len(graph.nodes))))
# we iterate through the node ids and get all its corresponding edges using graph[node]
# there seem to be no way to get node_id & edges in the for loop.
for node in graph.nodes:
ordered_neighbors = sorted(list(six.iteritems(graph[node])), key=lambda x: x[1]['weight'], reverse=True)
try:
neighbors_to_remove = [neighbor_distance[0] for neighbor_distance in ordered_neighbors[k:]]
graph.remove_edges_from([(node, neighbor) for neighbor in neighbors_to_remove])
except IndexError:
pass

# Remove isolated nodes
graph.remove_nodes_from(list(nx.isolates(graph)))

return graph

def cluster_graph(self, graph):
Expand All @@ -349,7 +267,7 @@ def cluster_graph(self, graph):
modularity = com.modularity(partition , graph)

return partition, num_communities, communities, modularity

def cluster_graph_overlap(self, graph, k=5):
"""Applies overlapping community detection in the given graph.

Expand All @@ -371,7 +289,7 @@ def cluster_graph_overlap(self, graph, k=5):
partition = {sound_id: cluster_id for cluster_id, cluster in enumerate(communities) for sound_id in cluster}

return partition, num_communities, communities, None

def remove_lowest_quality_cluster(self, graph, partition, communities, ratio_intra_community_edges):
"""Removes the lowest quality cluster in the given graph.

Expand Down Expand Up @@ -404,13 +322,13 @@ def remove_lowest_quality_cluster(self, graph, partition, communities, ratio_int
partition[snd] -= 1
return graph, partition, communities, ratio_intra_community_edges

def cluster_points(self, query_params, features, sound_ids):
def cluster_points(self, query_params, sound_ids, similarity_vectors_map):
"""Applies clustering on the requested sounds using the given features name.

Args:
query_params (str): string representing the query parameters submited by the user to the search engine.
features (str): name of the features used for clustering the sounds.
sound_ids (List[int]): list containing the ids of the sound to cluster.
similarity_vectors_map (Dict{int:List[float]}): dictionary with the similarity feature vectors for each sound.

Returns:
Dict: contains the resulting clustering classes and the graph in node-link format suitable for JSON serialization.
Expand All @@ -420,17 +338,17 @@ def cluster_points(self, query_params, features, sound_ids):
logger.info('Request clustering of {} points: {} ... from the query "{}"'
.format(len(sound_ids), ', '.join(sound_ids[:20]), json.dumps(query_params)))

graph = self.create_knn_graph(sound_ids, features=features)
graph = self.create_knn_graph(sound_ids, similarity_vectors_map=similarity_vectors_map)

if len(graph.nodes) == 0: # the graph does not contain any node
return {'error': False, 'result': None, 'graph': None}
return {'clusters': None, 'graph': None}

partition, num_communities, communities, modularity = self.cluster_graph(graph)

ratio_intra_community_edges = self._ratio_intra_community_edges(graph, communities)

# Discard low quality cluster if there are more than NUM_MAX_CLUSTERS clusters
num_exceeding_clusters = num_communities - clust_settings.NUM_MAX_CLUSTERS
num_exceeding_clusters = num_communities - settings.CLUSTERING_NUM_MAX_CLUSTERS
if num_exceeding_clusters > 0:
for _ in range(num_exceeding_clusters):
graph, partition, communities, ratio_intra_community_edges = self.remove_lowest_quality_cluster(
Expand Down Expand Up @@ -459,8 +377,4 @@ def cluster_points(self, query_params, features, sound_ids):
# Export graph as json
graph_json = json_graph.node_link_data(graph)

# Save results to file if SAVE_RESULTS_FOLDER is configured in clustering settings
self._save_results_to_file(query_params, features, graph_json, sound_ids, modularity,
num_communities, ratio_intra_community_edges, ami, ss, ci, communities)

return {'error': False, 'result': communities, 'graph': graph_json}
return {'clusters': communities, 'graph': graph_json}
Loading
Loading