[text] refine tokenizer #2165

Mddct · 2023-11-25T12:39:30Z

issues: #2160

in this pr:

next pr

paraformer tokenizer for fintune [paraformer] add paraformer tokenizer #2219
huggingface tokenizer for LLMs [text] huggingface tokenizer #2186
[text] fix whisper tokens and others #2179

Mddct · 2023-11-27T05:00:56Z

这里补充下whisper tokenizer：
原始官方实现中 tokenizer中对multi task的sot language等在post_init 中实现

在wenet中我们选择了，外边添加一个函数：
https://github.com/wenet-e2e/wenet/blob/main/wenet/utils/common.py#L150-L213

所以这里的whisper tokenizer需要fix下他填充的东西
~~- TODO~~-
~~[ ] fix whisper prepend~~-

xingchensong · 2023-11-27T05:05:32Z

post_init 不用管他，不灵活，弃用。举个例子，多任务学习时，可能有 transcribe 任务，也可能有 translate 任务，如果执意用 self.sot_sequence, 那么要改成员属性

Mddct · 2023-11-27T05:11:20Z

post_init 不用管他，不灵活，弃用。举个例子，多任务学习时，可能有 transcribe 任务，也可能有 translate 任务，如果执意用 self.sot_sequence, 那么要改成员属性

确认了下，pos_init 不影响其他的函数，看着目前的实现中应该也没有问题了

wenet/dataset/processor.py

wenet/utils/common.py

xingchensong · 2023-11-27T10:10:41Z

关于 add_special_tokens (如 sos eos) 应该在tokenizer的类函数中实现还是一个独立函数中实现，讨论结果如下：

add_xxx_tokens() 函数实际只用到了special token对应的id，不涉及 tokenize or detokenize, 相对来说较为独立
add_xxx_tokens() 如果放到tokenizer的类函数中，那么所有model都需要own一个tokenizer，但是model实际上需要own的是special token id，所以 model own special ids + 独立 add_xxx_tokens() 可能更合适，也方便推理时c++通过jit-traced.zip 读取相应的id。

另：如果jit可以直接把toeknizer的类函数trace出来，那上述结论就需要推翻

Mddct · 2023-11-27T16:31:12Z

build sp model lazily ,

https://github.com/espnet/espnet/blob/master/espnet2/text/sentencepiece_tokenizer.py#L14-L17

Mddct · 2023-11-28T04:03:12Z

whisper tiktokenize 在多进程环境先会有问题，所有也改成了lazy build的形式
issues：huggingface/datasets#5769

Mddct · 2023-11-28T11:42:15Z

recognize.py works!

Mddct · 2023-11-28T13:03:51Z

training works

Mddct · 2023-11-28T13:06:53Z

@robin1001 @xingchensong all work, it's time to merge

robin1001 · 2023-11-28T13:27:42Z

Really a lot of work, and all look great! Just one question, seems WenetTokenizer is not initialized in init_tokenzer.py.

Mddct · 2023-11-28T13:34:46Z

Really a lot of work, and all look great! Just one question, seems WenetTokenizer is not initialized in init_tokenzer.py.

Split WenetTokenizer into char and bpe， WenetTokenizer is just a tool to aligm the result for char and bpe tokenizer

Mddct · 2023-11-28T13:35:17Z

Really a lot of work, and all look great! Just one question, seems WenetTokenizer is not initialized in init_tokenzer.py.

Split WenetTokenizer into char and bpe， WenetTokenizer is just a tool to aligm the result for char and bpe tokenizer

Will delete in future pr

Mddct added 4 commits November 25, 2023 20:31

[text] refine tokenizer

8144a2d

[text] fix flake8

99cf7d7

[text] fix lint

5694565

[text] fix unit

2418d79

Mddct mentioned this pull request Nov 26, 2023

[refine/tokenzier] 重构tokenizer接口 #2160

Closed

5 tasks

Mddct added 3 commits November 26, 2023 23:37

[text] add bpe tokenizer and char tokenizer

3552b94

[text] add char tokenizer unit test

9912df9

[text] add bpe tokenizer unit test

266a4fa

xingchensong requested review from xingchensong and robin1001 November 27, 2023 02:11

Mddct force-pushed the Mddct-refine-tokenzier branch 2 times, most recently from 23459fa to af58893 Compare November 27, 2023 05:55

[text] add WhisperTokenizer for test_whisper.py

c2ecc7c

Mddct force-pushed the Mddct-refine-tokenzier branch from af58893 to c2ecc7c Compare November 27, 2023 08:41

xingchensong reviewed Nov 27, 2023

View reviewed changes

wenet/dataset/processor.py Outdated Show resolved Hide resolved

[text] revert wenet/utils/file_utils.py

55da48a

Mddct force-pushed the Mddct-refine-tokenzier branch from 3acdd59 to 55da48a Compare November 27, 2023 09:20

xingchensong reviewed Nov 27, 2023

View reviewed changes

wenet/utils/common.py Outdated Show resolved Hide resolved

[text] add consistency for char and bpe unit

50422fc

xingchensong mentioned this pull request Nov 27, 2023

refactor(whisper): remove tokenizer from WhisperModel #2172

Merged

Mddct added 2 commits November 27, 2023 20:05

[text] merge main

a479de7

[text] merge main

cf754ff

xingchensong mentioned this pull request Nov 27, 2023

feat(whisper): support whisper arch #2141

Merged

9 tasks

[text] add symbol table

bd24277

Mddct force-pushed the Mddct-refine-tokenzier branch 2 times, most recently from 6d21b0f to 75b1e78 Compare November 27, 2023 13:43

[text] add init_tokenizer unit test

49994bf

Mddct force-pushed the Mddct-refine-tokenzier branch from 75b1e78 to 49994bf Compare November 27, 2023 14:00

Mddct added 2 commits November 27, 2023 23:42

[text] uncomment

301af9e

[text] fix bpe model in multiprocess env

a13dedf

Mddct force-pushed the Mddct-refine-tokenzier branch from aab5fde to a13dedf Compare November 27, 2023 16:29

Mddct marked this pull request as ready for review November 27, 2023 16:35

[text] fix whisper tokenzier in multiprocess env

ec2d838

[text] add test unit parallel for bpe and whisper

51a10fa

Mddct force-pushed the Mddct-refine-tokenzier branch from c70192b to 51a10fa Compare November 28, 2023 04:15

[text] fix none type in test_whisper.py

1dc2d79

Mddct requested a review from xingchensong November 28, 2023 13:06

[text] all work

f1099a8

xingchensong approved these changes Nov 28, 2023

View reviewed changes

robin1001 merged commit 3ab6718 into main Nov 28, 2023
6 checks passed

robin1001 deleted the Mddct-refine-tokenzier branch November 28, 2023 13:46

Mddct mentioned this pull request Nov 28, 2023

[text] fix whisper tokens and others #2179

Merged

6 tasks

This was referenced Dec 7, 2023

[WIP][text] add tokens #2201

Closed

[text] rm WenetTokenizer #2218

Merged

[paraformer] add paraformer tokenizer #2219

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[text] refine tokenizer #2165

[text] refine tokenizer #2165

Mddct commented Nov 25, 2023 •

edited

Loading

Mddct commented Nov 27, 2023 •

edited

Loading

xingchensong commented Nov 27, 2023

Mddct commented Nov 27, 2023

xingchensong commented Nov 27, 2023

Mddct commented Nov 27, 2023

Mddct commented Nov 28, 2023

Mddct commented Nov 28, 2023

Mddct commented Nov 28, 2023

Mddct commented Nov 28, 2023

robin1001 commented Nov 28, 2023

Mddct commented Nov 28, 2023

Mddct commented Nov 28, 2023

[text] refine tokenizer #2165

[text] refine tokenizer #2165

Conversation

Mddct commented Nov 25, 2023 • edited Loading

Mddct commented Nov 27, 2023 • edited Loading

xingchensong commented Nov 27, 2023

Mddct commented Nov 27, 2023

xingchensong commented Nov 27, 2023

Mddct commented Nov 27, 2023

Mddct commented Nov 28, 2023

Mddct commented Nov 28, 2023

Mddct commented Nov 28, 2023

Mddct commented Nov 28, 2023

robin1001 commented Nov 28, 2023

Mddct commented Nov 28, 2023

Mddct commented Nov 28, 2023

Mddct commented Nov 25, 2023 •

edited

Loading

Mddct commented Nov 27, 2023 •

edited

Loading