-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to not encode sentencepiece during training/decoding al… #1003
Conversation
…lowing passing of spmIDs directly
If we prefer to produce output in SPM pieces then we could use this for mapping |
I was initially of the opinion that token ids would be safer, but looking at how byte fallback pieces look, I'd say pieces are fine. Maybe even better because they're somewhat human readable and you can see what's going on. >>> spm.encode('🤣', out_type=str)
['▁', '<0xF0>', '<0x9F>', '<0xA4>', '<0xA3>']
>>> spm.encode('🤣', out_type=int)
[275, 247, 166, 171, 170] |
Updated to use spm pieces as opposed to spm vocab ids so that the input can also be somewhat human readable. |
Careful, in SP models without byte fallback, unknown characters are left as they are, instead of using unk token, when tokenizing into pieces: >>> spm.encode('ç', out_type=int)
[25, 0]
>>> spm.encode('ç', out_type=str)
['▁', 'ç'] |
There is basically already a way to do that. If you use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking for now until comment about using existing capabilities is resolved.
Hi, so the goal of this change is to allow for training to be done on SPM-d corpora, while translation/validation still happens on de-SPM-d corpora so you can get accurate BLEU scores. It also brings parity to the |
@ZJaume , I just tested with sentencepiece and vocab that doesn't have unicode backoff, and it seems that it does indeed encode pass through unks: $ cat test.bg | ~/marian-dev/build/spm_encode --model vocab.esen.spm
▁en g lish ▁text ▁ бг ▁ текст ▁ 靐
$ cat test.bg
english text бг текст 靐
$ cat test.bg | ~/marian-dev/build/spm_encode --model vocab.esen.spm | ~/marian-dev/build/marian-decoder -m model.npz -v vocab.esen.spm vocab.esen.spm --mini-batch 1 --maxi-batch 1 --cpu-threads 1 --no-spm-encode --quiet --quiet-translation
texto english
$ cat test.bg | ~/marian-dev/build/marian-decoder -m model.npz -v vocab.esen.spm vocab.esen.spm --mini-batch 1 --maxi-batch 1 --cpu-threads 1 --quiet --quiet-translation
texto english I also looked at the source and In light of this, I think this is ready to merge. |
@XapaJIaMnu Will you resolve the conflicts (seem simple) and update the patch number in the VERSION file, or would you prefer me to do that? I can then merge. |
I think i fixed it @snukky . |
Description
This PR adds the ability to train or decode with an a sentence that already has had
spm_encode --model model.spm
applied to it.The benefit of this is that we can apply spm modifications prior to feeding the data to marian, giving us more flexibility than what SPM allows.
The code is minimally intrusive and doesn't change the behavior unless the flag is toggled on.
How to test
Checklist