Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to set the prefixes for stanford-nlp models #55

Merged
merged 3 commits into from
Sep 13, 2024
Merged

Conversation

NohTow
Copy link
Collaborator

@NohTow NohTow commented Sep 13, 2024

After discussions with @bwanglzu, I realized the output of jina-colbert-v2 were not identical to the ones using stanford-nlp.

The problem was two-folds:

  1. As it is a stanford-nlp repository, the prefixes used were [unused0] and [unused1], whereas they actually use [QueryMarker] and [DocumentMarker]. As this parameter is not directly readable from the repositories, my proposed solution is to let the user define the prefixes when loading the model. The PR had the ability to set the prefixes for stanford-nlp models and only default to unused if not set. It still default to [Q] and [D] if not set and not a stanford repo.
  2. They actually attend to expansion tokens when encoding queries. As the functionality is already available in PyLate, the user just has to set attend_to_expansion_tokens to True. I do not have a way to read this from the repository either. These parameters are stored in the PyLate configurations when saving the model though.

Thus, the loading of Jina-colbert-v2 looks like this:

model = models.ColBERT(
    model_name_or_path="jinaai/jina-colbert-v2",
    query_prefix="[QueryMarker]",
    document_prefix="[DocumentMarker]",
    attend_to_expansion_tokens=True,
    trust_remote_code=True,
)

@bwanglzu
Copy link

DDE9E3BE-CA7B-4142-82BB-98192068D088

now i'm on the branch still get a bit different result (while the embeddings are close), might be related to precision

@bwanglzu
Copy link

the mixed precision manager brought the minor diff, disable it brings identical result. LGTM!

@NohTow
Copy link
Collaborator Author

NohTow commented Sep 13, 2024

The outputs are equivalent to RAGatouille's encode_index_free_queries and encode_index_free_documents so I think we are good.
You can also use model_kwargs={"torch_dtype": torch.float16} to handle mixed precision in PyLate.

@bwanglzu
Copy link

quick follow up on my end (just keep transparency), PR of sample usage in jina-colbert repo: https://huggingface.co/jinaai/jina-colbert-v2/discussions/8

@raphaelsty
Copy link
Collaborator

raphaelsty commented Sep 13, 2024

@NohTow Would it be possible to update the models documentation and add a note on how to load Jina model ?

Otherwise everything look good to me, great to see we support Jina model

model = models.ColBERT(
      model_name_or_path="jinaai/jina-colbert-v2",
      query_prefix="[QueryMarker]",
      document_prefix="[DocumentMarker]",
      attend_to_expansion_tokens=True,
      trust_remote_code=True,
)

@NohTow
Copy link
Collaborator Author

NohTow commented Sep 13, 2024

I added a tip saying that we handle nlp-stanford models and added documentation for the Jina model (and added it to the BEIR tab aswell).

@NohTow NohTow merged commit d526d98 into main Sep 13, 2024
2 checks passed
@NohTow NohTow deleted the jina_fixes branch October 13, 2024 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants