[MNT, ENH, DOC] Rework similarity search #2473

baraline · 2024-12-26T19:00:13Z

Reference Issues/PRs

Fixes #2341, #2236, #2028, #2020, #1806, #2475

What does this implement/fix? Explain your changes.

The previous structure for similarity search was not in line with the structure we would expect considering other aeon modules, the lack of distinct base classes for some tasks, as well as the initial design choice (due to the lack of practical experience with using and expanding the module) lead to some really complex code when working on #2341 to make everything work together. Further expanding the module would have made thing worse.

To make the module more flexible and comprehensible, the following rework is proposed in this PR (AEP to be updated acordingly):

The module structure is now :

|-similarity_search
|------series
|------------neighbors #NN on subsequences 
|------------motifs #Motif extraction on subsequences methods
|------collection
|------------neighbors # series methods for subsequence NN adapted to collection case or Approximate NN on whole series
|------------motifs #Series methods adapted to collection case

Base classes are BaseSimilaritySearch, BaseSeriesSimilaritySearch, BaseCollectionSimilaritySearch
Implemented estimators are :

(series/neighbors) MassSNN : subsequence nearest neighbors and distance profile computation
(series/neighbors) DummySNN : brute force subsequence nearest neighbors
(series/motifs) StompMotifs : top k motifs extraction (supports motifs pairs, k-motif or r-motifs)
(collection/neighbors) RandomProjectionIndexANN: Approximate nearest neighbors on whole series using a random projection LSH method.

The sufix of the estimators (SNN/ANN/Motifs) remains an open discussion, not sure it's the right way to go.

I removed the support for collections for Stomp and Mass for now to focus on the "expected and well known" use cases, I'll make them in another PR.

All similarity search estimators now use fit/predict interface, with predict returning two arrays (NN/Motifs indexes, and NN/Motifs distances).

Does your contribution introduce a new dependency? If yes, which one?

No.

Any other comments?

As this is still a WIP, I would love some inputs on the structure (notably from @patrickzib !) to make the module more future-proof to future additions and easier to use.

TODO list :

Finish to include testing suite for base estimators in the testing module for the SubsequenceSearch part and fix them
Implement LSH index as a simple first case for BaseCollectionSimilaritySearch
Implement tests for base classes and estimators
Update API docs / doc pages
Update notebooks
Check docstrings
Cleanup TODOs in the code
updated aeon's CODEOWNERS to receive notifications about future changes to these files.

…class-with-attimo-algorithm

aeon-actions-bot · 2024-12-26T19:00:36Z

Thank you for contributing to `aeon`

I have added the following labels to this PR based on the title: [ $\color{#F3B9F8}{\textsf{documentation}}$, $\color{#FEF1BE}{\textsf{enhancement}}$, $\color{#EC843A}{\textsf{maintenance}}$ ].
I have added the following labels to this PR based on the changes made: [ $\color{#006b75}{\textsf{similarity search}}$ ]. Feel free to change these if they do not properly represent the PR.

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

Run pre-commit checks for all files
Run mypy typecheck tests
Run all pytest tests and configurations
Run all notebook example tests
Run numba-disabled codecov tests
Stop automatic pre-commit fixes (always disabled for drafts)
Disable numba cache loading
Push an empty commit to re-run CI checks

…class-with-attimo-algorithm

patrickzib · 2025-01-02T13:53:45Z

Thank you very much for working on this.

Some thoughts:

Focus the module on two distinct tasks : find_neighbors and find_motifs for all type for similarity search estimators. Similarly to the fit/ predict interface we already know well, here, we first fit and then either find_motifs or find_neighbors ("predict" keyword don't make much sense here). We give a collection to use as database in fit, and a single series in find_neighbors or find_motifs to use as query for the search.

That is an interesting problem. Here is my view:

For whole series similarity search fit requires a dataset of time series of equal length, and find_neighbors would get one or many query series of this length.

For subsequence similarity search fit requires a single time series, and find_neighbors commonly gets a single query sequence which length is shorter than the single series length. It would be fine however, to extend it to multiple short sequences.

There is only one whole series consensus motif search paper, which would be the use case of whole matching and motif discovery. The input to fit would be the whole dataset, and find_motif has no input series X. not sure, what an input series X should trigger.

Most papers solve the problem of motif discovery in a single long time series, defined as subsequences of the time series. Here, fit gets a single series, and find_motif has no input series X.

Distinguish between two kind of similarity search tasks with the two submodules, SeriesSearch and SubsequencesSearch. The SubsequencesSearch focuses on tasks for which the goal is to find motifs or neighbors in subsequences of time series (e.g. Matrix Profiles, Motiflets, etc.). the SeriesSearch focuses on task using whole series (e.g. Indexes such as LSH, iSAX, etc.)

Have base classes for families of method to limit code duplication (e.g. BaseMatrixProfile, and STOMP, where most existing code was ported), so the we can focus on implementing the computational logic when adding new estimators.

What is the difference between BaseMatrixProfile and STOMP?

At least for Motiflets, we cannot use STOMP/MP, as it only gives a 1-NN profile, but we need k-NN profiles. Same problem would be the case, if you want to solve k-nearest neighbors similarity search.

Questions to answers for motif search :

Do we want to make providing X optional in find_motifs ? Providing X means that we search for subsequences in X that are motifs in the collection given in fit. Not providing X would mean that we search for motifs in the collection given in fit only. I think it would make sense to make it optional, but would love some comment from people actually doing motif discovery.

I think that X is not meaningful for motif discovery.

baraline · 2025-01-02T14:20:37Z

Thanks for the inputs @patrickzib😄

For whole series similarity search fit requires a dataset of time series of equal length, and find_neighbors would get one or many query series of this length.

Completely in line with this, but what about the case of unequal length series, with, for example, elastic distance measures? Wouldn't that be a plausible use case? (all whole series estimators don't have to support it)

For subsequence similarity search fit requires a single time series, and find_neighbors commonly gets a single query sequence which length is shorter than the single series length. It would be fine however, to extend it to multiple short sequences.

For this case, I'm defining a length parameter during __init__, and accept a 2D subsequence of shape (n_channels, length) for find_neighbors, ensuring that length is not bigger than the length of any series given in fit. However, any reason why you think we should restrain ourselves to a single series for fit ?

There is only one whole series consensus motif search paper, which would be the use case of whole matching and motif discovery. The input to fit would be the whole dataset, and find_motif has no input series X. not sure, what an input series X should trigger.

This is the tricky one for me too. I'm not sure how giving X during find_motifs on whole series would fit any use case.

Most papers solve the problem of motif discovery in a single long time series, defined as subsequences of the time series. Here, fit gets a single series, and find_motif has no input series X.

I've been kinda frustrated by this limitation for practical use cases, wouldn't it be fine to loop on series of a collection with the motif discovery methods and then merge the results ? That's how I implemented STOMP for now for example. For each subsequence in X, it computes the distance profile to all series in a collection, and keep the top k among all of them (also storing the sample ID and timepoint ID).

What is the difference between BaseMatrixProfile and STOMP?
At least for Motiflets, we cannot use STOMP/MP, as it only gives a 1-NN profile, but we need k-NN profiles. Same problem would be the case, if you want to solve k-nearest neighbors similarity search.

BaseMatrixProfile (which inherit BaseSubsequenceSearch) is simply a base class that defines abstract compute_matrix_profile and compute_distance_profile methods to be implemented by child classes such as STOMP and the likes (STUMP, etc...). The logic for finding neighbors / motifs is then handled in the BaseMatrixProfile. I wanted to leave the door open to alternative methods and not just focus on matrix profiles, hence the split.

As stated above, I already extended STOMP to support k-NN profiles for collections (multivariate and unequal length compatible).

I suppose that in this context, motiflets would either inherit from BaseMatrixProfile if you need to implement methods like compute_matrix_profile and compute_distance_profile. Otherwise, It would inherit from BaseSubsequenceSearch and make its own methods to answer the find_neighbors/find_motifs tasks. (I would need to read the paper again!)

Note that it's possible to simply raise a "NotImplementedError" or something similar if an estimator would only support neighbors or motifs search.

My goal here is to find a base class structure that enables us to move most common code to there and focus on the computational optimisations of each method in the child classes.

I think that X is not meaningful for motif discovery.

In the context of motif search in a single series I agree, but wouldn't there be some interest when dealing with a collection ? For example find motifs in the collection at the condition that they are similar to a subsequence in X ? (This is pure speculation)

patrickzib · 2025-01-02T16:13:42Z

Completely in line with this, but what about the case of unequal length series, with, for example, elastic distance measures? Wouldn't that be a plausible use case? (all whole series estimators don't have to support it)

Sure. I did not think of this.

For subsequence similarity search fit requires a single time series, and find_neighbors commonly gets a single query sequence which length is shorter than the single series length. It would be fine however, to extend it to multiple short sequences.

For this case, I'm defining a length parameter during __init__, and accept a 2D subsequence of shape (n_channels, length) for find_neighbors, ensuring that length is not bigger than the length of any series given in fit. However, any reason why you think we should restrain ourselves to a single series for fit ?

Simplicity :) But I agree that you could have multiple series in fit, too - this would mimic the Shapelet use case, I suppose?

Most papers solve the problem of motif discovery in a single long time series, defined as subsequences of the time series. Here, fit gets a single series, and find_motif has no input series X.

I've been kinda frustrated by this limitation for practical use cases, wouldn't it be fine to loop on series of a collection with the motif discovery methods and then merge the results ? That's how I implemented STOMP for now for example. For each subsequence in X, it computes the distance profile to all series in a collection, and keep the top k among all of them (also storing the sample ID and timepoint ID).

Sorry, yes, that is what the authors refer to as consensus motif:
https://www.cs.ucr.edu/~eamonn/consensus_Motif_ICDM_Long_version.pdf

BaseMatrixProfile (which inherit BaseSubsequenceSearch) is simply a base class that defines abstract compute_matrix_profile and compute_distance_profile methods to be implemented by child classes such as STOMP and the likes (STUMP, etc...).

I see. I personally do not like to use the terms matrix-profile for simple k-NN distances or k-NN indices though. It was a brilliant re-framing of EK, such that all 1-NN algorithms are now suddenly an instance of matrix profile. Yet, the concept is much older.

As stated above, I already extended STOMP to support k-NN profiles for collections (multivariate and unequal length compatible).

Great.

I think that X is not meaningful for motif discovery.

In the context of motif search in a single series I agree, but wouldn't there be some interest when dealing with a collection ? For example find motifs in the collection at the condition that they are similar to a subsequence in X ? (This is pure speculation)

I would not say that this is impossible, but I have not seen it. :)

baraline · 2025-01-02T16:23:36Z

Simplicity :) But I agree that you could have multiple series in fit, too - this would mimic the Shapelet use case, I suppose?

I'm not 100% sure what you mean, but in a sense yes ? For example with a brute force neighbour search, just compute the distance of the subsequence given in find_neighbors to all candidates subsequences in all series of the collection given in fit, and take the k best overall, (considering neighbouring matches/self matches if specified by parameters).

I see. I personally do not like to use the terms matrix-profile for simple k-NN distances or k-NN indices though. It was a brilliant re-framing of EK, such that all 1-NN algorithms are now suddenly an instance of matrix profile. Yet, the concept is much older.

I'm not against the idea of a different naming, especially if methods labelled differently from MPs would fit in the base class without much change of parameter/interface. Would you have any proposal? Something like BaseNeighborhoodSearch ?

patrickzib · 2025-01-06T11:51:45Z

I'm not against the idea of a different naming, especially if methods labelled differently from MPs would fit in the base class without much change of parameter/interface. Would you have any proposal? Something like BaseNeighborhoodSearch ?

In sklearn it is simply NearestNeighbors ? :) And it returns indices and distances.

https://scikit-learn.org/1.5/modules/neighbors.html

review-notebook-app · 2025-01-13T09:50:56Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

…class-with-attimo-algorithm

baraline · 2025-02-01T13:07:35Z

I'll leave the implementation of k-motiflets for later as a separate estimator so I can focus on getting the testing, tags and docs right for this PR to have the structure for the module set.

…e to all sim search estimators

…ithm

baraline added 4 commits December 4, 2024 22:47

WIP remake module structure

1646895

Update _brute_force.py

52b0692

Update test__commons.py

7973a30

Merge remote-tracking branch 'origin/main' into 2341-enh-indexsearch-…

7abe221

…class-with-attimo-algorithm

baraline linked an issue Dec 26, 2024 that may be closed by this pull request

[ENH] IndexSearch class with attimo or LL algorithm #2341

Open

aeon-actions-bot bot added documentation Improvements or additions to documentation enhancement New feature, improvement request or other non-bug code enhancement maintenance Continuous integration, unit testing & package distribution similarity search Similarity search package labels Dec 26, 2024

baraline mentioned this pull request Dec 26, 2024

adds examples for querysearch #2472

Closed

WIP mock and test

ad02b84

This was linked to issues Dec 27, 2024

[ENH] Rethink the mask parameter in similarity search #2020

Open

[DOC] Update similarity search notebook #2028

Open

[BUG] Test fail with numpy2 for stomp_squared_matrix_profile #2236

Open

This was referenced Dec 30, 2024

[ENH] Improved the test coverage of similarity search module #2476

Closed

[BUG] extract_top_k_and_threshold_from_distance_profiles_one_series Function Errors Related to Numba and n_candidates Parameter #2475

Open

baraline added 3 commits January 1, 2025 09:36

Add test for base subsequence

bb2aa33

Merge remote-tracking branch 'origin/main' into 2341-enh-indexsearch-…

c5c9c28

…class-with-attimo-algorithm

Fix subsequence_search tests

f23c720

baraline linked an issue Jan 2, 2025 that may be closed by this pull request

[BUG] extract_top_k_and_threshold_from_distance_profiles_one_series Function Errors Related to Numba and n_candidates Parameter #2475

Open

baraline added 2 commits January 2, 2025 13:56

debug brute force mp

c372969

more debug of subsequence tests

d7da68b

more debug of subsequence tests

da2758c

Add functional LSH neighbors

2191ac2

add notebook for sim search tasks

cd33d0a

baraline added 4 commits January 16, 2025 09:55

Updated series similarity search

b841b79

Merge remote-tracking branch 'origin/main' into 2341-enh-indexsearch-…

dbe9494

…class-with-attimo-algorithm

Fix mistake addition in transformers and fix base classes

57e5e7b

Fix registry and api reference

2078086

baraline mentioned this pull request Jan 17, 2025

[DOC] Add examples for QuerySearch in SimilaritySearch #1806

Closed

baraline added 6 commits January 17, 2025 22:40

Update documentation and fix some leftover bugs

9effbd9

Update documentation and add default test params

f51d66a

Fix identifiers and test data shape for all_estimators tests

763bdcf

Fix missing params

85c7174

Merge remote-tracking branch 'origin/main' into 2341-enh-indexsearch-…

038f844

…class-with-attimo-algorithm

Fix n_jobs params and tags, add some docs

fd7caad

baraline added 6 commits February 2, 2025 12:46

Fix numba test bug and update testing data for sim search

6e3157b

Fix imports, testing data tests, and impose predict/_predict interfac…

e3ccb3f

…e to all sim search estimators

Fix args

ee7aa58

Fix extract test

bf0c5e8

update docs api and notebooks

0c2d763

remove notes

db10499

baraline marked this pull request as ready for review February 2, 2025 15:08

baraline requested a review from MatthewMiddlehurst as a code owner February 2, 2025 15:08

Merge branch 'main' into 2341-enh-indexsearch-class-with-attimo-algor…

3587de1

…ithm

TonyBagnall requested a review from patrickzib February 3, 2025 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MNT, ENH, DOC] Rework similarity search #2473

[MNT, ENH, DOC] Rework similarity search #2473

baraline commented Dec 26, 2024 •

edited

Loading

aeon-actions-bot bot commented Dec 26, 2024

patrickzib commented Jan 2, 2025 •

edited

Loading

Questions to answers for motif search :

baraline commented Jan 2, 2025 •

edited

Loading

patrickzib commented Jan 2, 2025

baraline commented Jan 2, 2025 •

edited

Loading

patrickzib commented Jan 6, 2025

review-notebook-app bot commented Jan 13, 2025

baraline commented Feb 1, 2025

[MNT, ENH, DOC] Rework similarity search #2473

Are you sure you want to change the base?

[MNT, ENH, DOC] Rework similarity search #2473

Conversation

baraline commented Dec 26, 2024 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

Any other comments?

TODO list :

aeon-actions-bot bot commented Dec 26, 2024

Thank you for contributing to aeon

PR CI actions

patrickzib commented Jan 2, 2025 • edited Loading

Questions to answers for motif search :

baraline commented Jan 2, 2025 • edited Loading

patrickzib commented Jan 2, 2025

baraline commented Jan 2, 2025 • edited Loading

patrickzib commented Jan 6, 2025

review-notebook-app bot commented Jan 13, 2025

baraline commented Feb 1, 2025

baraline commented Dec 26, 2024 •

edited

Loading

Thank you for contributing to `aeon`

patrickzib commented Jan 2, 2025 •

edited

Loading

baraline commented Jan 2, 2025 •

edited

Loading

baraline commented Jan 2, 2025 •

edited

Loading