Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of a SEARCH operator #533

Closed
wants to merge 1 commit into from
Closed

Conversation

ml-evs
Copy link
Member

@ml-evs ml-evs commented Nov 9, 2024

Following discussion with @merkys and others in the thread at #398 (comment), I though I'd attempt to draft something for a loosely defined SEARCH operator. I think this would be pretty useful for quite a few applications, where a database wants to enable queries that may not fall directly under the strict semantics we expect for e.g., string matching or arithmetic.

I've tried to keep it necessarily vague here, but motivate it via examples.

Outstanding issues:

  • Although we don't have any mechanism for reporting it, it is straightforward to "discover" whether a database supports a particular filter construct on a field, e.g., whether a string field supports ENDS WITH -- you just try the query and wait for the error. It will be much harder to discover if a database supports SEARCH on a field, or how it is interpreted. Do we need to a) add a specific info metadata field for this, along the lines of sortable for the moment? searchable? or b) should we enforce that the database must describe in full the search semantics on a field at its given info endpoint? This is a bit tricky as this implementation definition would overlap with the field definition itself, especially in cases where a database wants to enable search on an already standardized OPTIMADE field (like the chemical_formula_reduced example in this draft).
  • There are cases where it might be helpful to also return the search_score indicating the amount to which an entry fits the SEARCH (writing this with compositional/structural/substructural similarity in mind) -- should we define a reserved keyword for this in the new entry-level metadata?

@merkys
Copy link
Member

merkys commented Nov 10, 2024

Personally, I do not see many use-cases for search operator for which every implementation is free to choose the search method. When replying in #398 (comment) I was thinking more about SEARCH operator which still has standardized behavior for a given data type, but these behaviors may differ between data types (e.g., for SMILES this is SMARTS query; for string this is regular expression, and so on).

When speaking about SMILES and SMARTS, I think users need to know what the underlying implementation does. As the PR is written now, an implementation A is free to choose to implement SEARCH as SMARTS query whereas implementation B may implement it as substring search on SMILES string. Clearly then CaaO will return meaningful results from A and no results from B (invalid SMILES).

@ml-evs
Copy link
Member Author

ml-evs commented Nov 10, 2024

Personally, I do not see many use-cases for search operator for which every implementation is free to choose the search method. When replying in #398 (comment) I was thinking more about SEARCH operator which still has standardized behavior for a given data type, but these behaviors may differ between data types (e.g., for SMILES this is SMARTS query; for string this is regular expression, and so on).

When speaking about SMILES and SMARTS, I think users need to know what the underlying implementation does. As the PR is written now, an implementation A is free to choose to implement SEARCH as SMARTS query whereas implementation B may implement it as substring search on SMILES string. Clearly then CaaO will return meaningful results from A and no results from B (invalid SMILES).

Yes, this kicks the can down the road a bit... But I was imagining the cheminfo namespace would define the search semantics on the relevant fields/types, if we don't get to the point of standardizing it in the core of OPTIMADE. In that sense, this PR basically just reserves the keyword in the filter language so that we can drill down on any more specific semantics we need to define.

where this is not supported, the API should respond with a clear error message.
It is RECOMMENDED that providers do not allow chaining together multiple :filter-fragment:`SEARCH` operations, but MAY allow e.g., a list value for the :filter-fragment:`SEARCH` which can be considered the equivalent :filter-fragment:`<property> SEARCH [x, y] === <property> SEARCH x AND property SEARCH y`

**Examples**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Examples**:
**Examples**:

Comment on lines +1996 to +1997
This operator can act on any field and value type, and can be interpreted by the
database provider in any way they desire.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This operator can act on any field and value type, and can be interpreted by the
database provider in any way they desire.
This operator can act on any field and value type, and can be interpreted by the database provider in any way they desire.

Comment on lines +2000 to +2002
The cutoff for 'relevance' can be entirely decided by the database; it is not
necessary to rank and return all entries in the database according to the search
criteria.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The cutoff for 'relevance' can be entirely decided by the database; it is not
necessary to rank and return all entries in the database according to the search
criteria.
The cutoff for 'relevance' can be entirely decided by the database; it is not necessary to rank and return all entries in the database according to the search criteria.

Comment on lines +2006 to +2007
Where implemented, it MAY be used in conjunction with other filters; in cases
where this is not supported, the API should respond with a clear error message.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Where implemented, it MAY be used in conjunction with other filters; in cases
where this is not supported, the API should respond with a clear error message.
Where implemented, it MAY be used in conjunction with other filters; in cases where this is not supported, the API should respond with a clear error message.

@merkys
Copy link
Member

merkys commented Nov 14, 2024

Yes, this kicks the can down the road a bit... But I was imagining the cheminfo namespace would define the search semantics on the relevant fields/types, if we don't get to the point of standardizing it in the core of OPTIMADE. In that sense, this PR basically just reserves the keyword in the filter language so that we can drill down on any more specific semantics we need to define.

I think having a reserved operator for type-specific queries is a nice idea. Having such operator would shift the responsibility of query standardization from the main specification to type-governing namespaces. However, I feel that the current draft should be rewritten to convey this precise meaning.

I like the idea of introducing the relevance metadata field. It makes a lot of sense defining it as an implementation-specific value.

Edit: I noticed now you have been talking about property-specific search semantics while I was talking about type-specific. Yours (property-specific) would allow greater granularity, which I like. However, my concern about the need for standardization (in namespaces) still stands.

@rartino
Copy link
Contributor

rartino commented Nov 15, 2024

Maybe I'm missing a bigger picture behind suggesting this feature, but I prefer the related feature that has been suggested previously and I think overlaps with this idea: custom data types that are allowed to provide their own definitions of all operators included in the OPTIMADE grammar. So, for example for SMILES - if not already standardized - a SMILES data type could be defined by an implementation or prefix organization with some chosen filter semantics.

But, then, what operator should one define for searching, e.g., SMILES? I think the answer is the string regex operator not yet implemented/merged "MATCH" (or possibly "MATCHES") which IMO grammatically fits the current structure of the OPTIMADE filer language than "SEARCH".

The danger with defining a "SEARCH" operator as "implementation-specific search" is that it will easily leads to confusion about things working differently without it being clear why.

@ml-evs
Copy link
Member Author

ml-evs commented Nov 16, 2024

I'm going to close this to avoid polluting the discussions, I think in summary the preference was to overload our current filter operations (and allow them to break semantics on custom data types). The only bit that this leaves out is the kind of search that returns a ranking of matches, rather than a simple enumeration of boolean matches.

@ml-evs ml-evs closed this Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants