Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add support for Neural and Hybrid queries via the DSL builder API #735

Open
MikeyCymantix opened this issue Apr 25, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@MikeyCymantix
Copy link

MikeyCymantix commented Apr 25, 2024

Is your feature request related to a problem?

there seems to be a gap between the OpenSearch-Python 'high level Search Client's' functionality. Specifically with respect to the 'Hybrid' and 'Neural' search queries. These query mechanisms are definitely advanced, and are mainly used with anything regarding the ml_common/NLP functionalities that have been rolled out.
It would be great to support these kinds of search queries in the High Level python search client. Attached is a commit where I added support for neural query types.
without support for these types, we would be forced to resort to manually constructing the DSL ourselves-- ideally things that the high level search client should abstract away. Not only is this confusing (like why arnt these queries supported), but also increases the surface area for bugs.

cls = <class 'opensearchpy.helpers.query.Query'>, name = 'neural'                                                                                                                                                                                                                                                              
default = None                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                               
    @classmethod                                                                                                                                                                                                                                                                                                               
    def get_dsl_class(cls: Any, name: Any, default: Optional[bool] = None) -> Any:                                                                                                                                                                                                                                             
        try:                                                                                                                                                                                                                                                                                                                   
            return cls._classes[name]                                                                                                                                                                                                                                                                                          
        except KeyError:                                                                                                                                                                                                                                                                                                       
            if default is not None:                                                                                                                                                                                                                                                                                            
                return cls._classes[default]                                                                                                                                                                                                                                                                                   
>           raise UnknownDslObject(                                                                                                                                                                                                                                                                                            
                "DSL class `{}` does not exist in {}.".format(name, cls._type_name)                                                                                                                                                                                                                                            
            )        
                                                                                                                                                                                                                                                                                                                      
E           opensearchpy.exceptions.UnknownDslObject: DSL class `neural` does not exist in query.     

What solution would you like?

Support for Advanced Query Types
The addition of 'Neural' and 'Hybrid' query types to the OpenSearch Python client's high-level Search API would be great.

Implementation Details
Neural search is particularly unique because it involves dynamically specified embedding fields rather than a static field such as "passage_embedding" often cited in the documentation. This flexibility pretty important-- and I wasn't quite sure how to reflect that in the code. However, i included a working prototype-- its based pretty much off of the FunctionScore query which also has an init method attached to it

class Neural(Query):
    name = "neural"

    def __init__(self, **kwargs):
        super(Neural, self).__init__()

        embedding_field = kwargs.pop("embedding_field")
        if not embedding_field:
            raise ValueError("Missing embedding_field argument")

        required_keys = {'query_text', 'model_id', 'k'}
        if not required_keys <= kwargs.keys():
            missing_keys = required_keys - kwargs.keys()
            raise ValueError(f"Missing required fields: {missing_keys}")

        # Nest all required keys under the specified embedding field
        self._params[embedding_field] = {key: kwargs[key] for key in required_keys}

main...MikeyCymantix:opensearch-py:Cymantix_MichaelAlmeida/neural_query

What alternatives have you considered?

We can construct the DSL Manually for these types of queries and it would be fine.

Do you have any additional context?

The error I originally encountered appeared in this code.

** Note this test is fine, it's just part of my dev workflow when building out libraries. Using it to indicate a working test**
def test_keyword_search():
    search = Search('movies')
    weights = {"title": 2, "description": 1}
    search.add_keyword_search(query="action", weights=weights)

    expected_query = {
        'multi_match': {
            'query': 'action',
            'fields': ['title^2', 'description^1'],
            'operator': 'or'
        }
    }
    assert search.build() == expected_query, "Keyword search query does not match expected"


**Broken**
def test_neural_search():
    search_instance = Search('articles')
    search_instance.add_neural_search(query="deep learning", model_id="bert_or_something", k=5)
    built_query = search_instance.build()
    print(built_query)
@MikeyCymantix MikeyCymantix added enhancement New feature or request untriaged Need triage labels Apr 25, 2024
@MikeyCymantix
Copy link
Author

MikeyCymantix commented Apr 25, 2024

Heres some addition information about the commit I referenced,
Working code for construction neural queries

   def add_neural_search(self, embedding_field, query_text, model_id, k=10):
        """
        Add a neural search condition to the query.

        Args:
        embedding_field (str): The field under which neural search parameters are placed.
        query_text (str): The search query string for neural search.
        model_id (str): The ID of the neural model used for generating embeddings.
        k (int): The number of nearest neighbors (k) to return.
        """
        if not model_id:
            raise ValueError("Model ID must be provided for neural search.")

        neural_query = Q("neural",
                         embedding_field=embedding_field,
                         query_text=query_text,
                         model_id=model_id,
                         k=k)

        # this really shouldnt have any means of combination. if we have nueral query it just overwrites
        self.query = neural_query

working test

def test_neural_search():
    # prepare fixtures
    search_instance_fixture = Search('movies')
    search_instance_fixture.add_neural_search(
        embedding_field='passage_embedding',
        query_text="find similar movies",
        model_id="model123",
        k=5
    )
    expected_query_fixture = {
        "neural": {
            "passage_embedding": {
                "query_text": "find similar movies",
                "model_id": "model123",
                "k": 5
            }
        }
    }

    # execute
    built_query = search_instance_fixture.build()

    assert built_query == expected_query_fixture, "Neural search not matching"

@dblock
Copy link
Member

dblock commented Apr 25, 2024

Thanks! This looks great.

At a high level, we want as much code as possible generated from https://github.com/opensearch-project/opensearch-api-specification and all the interesting stuff to be hand-rolled here, like you're proposing. Check whether some of these request objects be expressed in the API, produce auto-generated code, and then be used by the high level constructs? In either case, make a PR with your proposal, update user guides, etc.?

@MikeyCymantix MikeyCymantix changed the title [FEATURE] Add support for Neural and Hybrid the DSL builder API [FEATURE] Add support for Neural and Hybrid queries via the DSL builder API Apr 25, 2024
@saimedhi saimedhi removed the untriaged Need triage label Apr 29, 2024
@MikeyCymantix
Copy link
Author

MikeyCymantix commented Apr 30, 2024

@dblock

Just got approval to work on this from my company-- we've added it to our next sprint which starts friday. I'll begin working on this feature request on Monday of next week~

Thank you!

@MikeyCymantix
Copy link
Author

Just an update--
Actively working on this. I've implemented a version that reuses all of the codegen tools and plays will with the other queries, and can be serialized to and from dicts. also have written tests.

Not 100% confident on my implementation yet, and still need to do a large refactor but the functionality is working-- it's just messy.

Once i'm confident i'll work on creating user guides for use with the higher level Search client.

a link to the current dif: main...MikeyCymantix:opensearch-py:main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants