Allow to set the prefixes for stanford-nlp models #55

NohTow · 2024-09-13T09:57:04Z

After discussions with @bwanglzu, I realized the output of jina-colbert-v2 were not identical to the ones using stanford-nlp.

The problem was two-folds:

As it is a stanford-nlp repository, the prefixes used were [unused0] and [unused1], whereas they actually use [QueryMarker] and [DocumentMarker]. As this parameter is not directly readable from the repositories, my proposed solution is to let the user define the prefixes when loading the model. The PR had the ability to set the prefixes for stanford-nlp models and only default to unused if not set. It still default to [Q] and [D] if not set and not a stanford repo.
They actually attend to expansion tokens when encoding queries. As the functionality is already available in PyLate, the user just has to set attend_to_expansion_tokens to True. I do not have a way to read this from the repository either. These parameters are stored in the PyLate configurations when saving the model though.

Thus, the loading of Jina-colbert-v2 looks like this:

model = models.ColBERT(
    model_name_or_path="jinaai/jina-colbert-v2",
    query_prefix="[QueryMarker]",
    document_prefix="[DocumentMarker]",
    attend_to_expansion_tokens=True,
    trust_remote_code=True,
)

bwanglzu · 2024-09-13T11:41:23Z

now i'm on the branch still get a bit different result (while the embeddings are close), might be related to precision

bwanglzu · 2024-09-13T12:01:37Z

the mixed precision manager brought the minor diff, disable it brings identical result. LGTM!

NohTow · 2024-09-13T12:20:03Z

The outputs are equivalent to RAGatouille's encode_index_free_queries and encode_index_free_documents so I think we are good.
You can also use model_kwargs={"torch_dtype": torch.float16} to handle mixed precision in PyLate.

bwanglzu · 2024-09-13T12:29:59Z

quick follow up on my end (just keep transparency), PR of sample usage in jina-colbert repo: https://huggingface.co/jinaai/jina-colbert-v2/discussions/8

raphaelsty · 2024-09-13T12:36:57Z

@NohTow Would it be possible to update the models documentation and add a note on how to load Jina model ?

Otherwise everything look good to me, great to see we support Jina model

model = models.ColBERT(
      model_name_or_path="jinaai/jina-colbert-v2",
      query_prefix="[QueryMarker]",
      document_prefix="[DocumentMarker]",
      attend_to_expansion_tokens=True,
      trust_remote_code=True,
)

NohTow · 2024-09-13T12:57:54Z

I added a tip saying that we handle nlp-stanford models and added documentation for the Jina model (and added it to the BEIR tab aswell).

Allow to set the prefixes for stanford-nlp models

02bd268

NohTow requested a review from raphaelsty September 13, 2024 09:57

Version bump

cdb4a3d

Add documentation for the jina-colbert-v2 model

b057ac2

raphaelsty approved these changes Sep 13, 2024

View reviewed changes

NohTow merged commit d526d98 into main Sep 13, 2024
2 checks passed

NohTow deleted the jina_fixes branch October 13, 2024 18:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to set the prefixes for stanford-nlp models #55

Allow to set the prefixes for stanford-nlp models #55

NohTow commented Sep 13, 2024

bwanglzu commented Sep 13, 2024

bwanglzu commented Sep 13, 2024

NohTow commented Sep 13, 2024

bwanglzu commented Sep 13, 2024

raphaelsty commented Sep 13, 2024 •

edited

Loading

NohTow commented Sep 13, 2024

Allow to set the prefixes for stanford-nlp models #55

Allow to set the prefixes for stanford-nlp models #55

Conversation

NohTow commented Sep 13, 2024

bwanglzu commented Sep 13, 2024

bwanglzu commented Sep 13, 2024

NohTow commented Sep 13, 2024

bwanglzu commented Sep 13, 2024

raphaelsty commented Sep 13, 2024 • edited Loading

NohTow commented Sep 13, 2024

raphaelsty commented Sep 13, 2024 •

edited

Loading