Allow to set the prefixes for stanford-nlp models #55
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
After discussions with @bwanglzu, I realized the output of jina-colbert-v2 were not identical to the ones using stanford-nlp.
The problem was two-folds:
[unused0]
and[unused1]
, whereas they actually use[QueryMarker]
and[DocumentMarker]
. As this parameter is not directly readable from the repositories, my proposed solution is to let the user define the prefixes when loading the model. The PR had the ability to set the prefixes for stanford-nlp models and only default to unused if not set. It still default to[Q]
and[D]
if not set and not a stanford repo.attend_to_expansion_tokens
to True. I do not have a way to read this from the repository either. These parameters are stored in the PyLate configurations when saving the model though.Thus, the loading of Jina-colbert-v2 looks like this: