-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[text] refine tokenizer #2165
[text] refine tokenizer #2165
Conversation
这里补充下whisper tokenizer: 在wenet中我们选择了,外边添加一个函数: 所以这里的whisper tokenizer需要fix下他填充的东西 |
|
确认了下,pos_init 不影响其他的函数, 看着目前的实现中应该也没有问题了 |
23459fa
to
af58893
Compare
af58893
to
c2ecc7c
Compare
3acdd59
to
55da48a
Compare
关于
另:如果jit可以直接把toeknizer的类函数trace出来,那上述结论就需要推翻 |
6d21b0f
to
75b1e78
Compare
75b1e78
to
49994bf
Compare
aab5fde
to
a13dedf
Compare
build sp model lazily , https://github.com/espnet/espnet/blob/master/espnet2/text/sentencepiece_tokenizer.py#L14-L17 |
whisper tiktokenize 在多进程环境先会有问题, 所有也改成了lazy build的形式 |
c70192b
to
51a10fa
Compare
@robin1001 @xingchensong all work, it's time to merge |
Really a lot of work, and all look great! Just one question, seems |
Split WenetTokenizer into char and bpe, WenetTokenizer is just a tool to aligm the result for char and bpe tokenizer |
Will delete in future pr |
issues: #2160
in this pr:
next pr