-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[refine/tokenzier] 重构tokenizer接口 #2160
Comments
I write a code snippet like below, any suggestion? @xingchensong @robin1001 bin from abc import ABC, abstractmethod
from collections.abc import Iterable
from typing import List, Tuple
class BaseTokenizer(ABC):
def tokenize(self, line: str) -> Tuple[List[str], List[int]]:
tokens = self.text2tokens(line)
ids = self.tokens2ids(tokens)
return tokens, ids
def detaokenzie(self, ids: List[int]) -> Tuple[str, List[str]]:
tokens = self.ids2tokens(ids)
text = self.tokens2text(tokens)
return text, tokens
@abstractmethod
def text2tokens(self, line: str) -> List[str]:
raise NotImplementedError("abstract method")
@abstractmethod
def tokens2text(self, tokens: Iterable[str]) -> str:
raise NotImplementedError("abstract method")
@abstractmethod
def tokens2ids(self, tokens: List[str]) -> List[int]:
raise NotImplemented("abstract method")
@abstractmethod
def ids2tokens(self, ids: List[int]) -> List[str]:
raise NotImplemented("abstract method") import re
from collections.abc import Iterable
from os import PathLike
from typing import List, Optional
from wenet.utils.file_utils import read_symbol_table, read_non_lang_symbols
from wenet.utils.text.base_tokenzier import BaseTokenizer
from wenet.utils.text.tokenize_utils import tokenize_by_bpe_model
class WenetTokenizer(BaseTokenizer):
"""Wrapper for original wenet tokenize implementation
"""
def __init__(
self,
symbol_table: PathLike,
bpe_model: Optional[PathLike] = None,
non_lang_syms: Optional[PathLike] = None,
split_with_space: bool = False,
connect_symbol: str = '',
) -> None:
self.non_lang_syms_pattern = None
if non_lang_syms is not None:
self.non_lang_syms_pattern = re.compile(
r"(\[[^\[\]]+\]|<[^<>]+>|{[^{}]+})")
self.symbol_table = read_symbol_table(symbol_table)
self.non_lang_syms = read_non_lang_symbols(non_lang_syms)
self.bpe_model = None
if bpe_model is not None:
import sentencepiece as spm
self.bpe_model = spm.SentencePieceProcessor()
self.bpe_model.load(bpe_model)
self.char_dict = {v: k for k, v in self.symbol_table.items()}
self.split_with_space = split_with_space
self.connect_symbol = connect_symbol
def text2tokens(self, line: str) -> List[str]:
if self.non_lang_syms_pattern is not None:
parts = self.non_lang_syms_pattern.split(line.upper())
parts = [w for w in parts if len(w.strip()) > 0]
else:
parts = [line]
tokens = []
for part in parts:
if part in self.non_lang_syms:
tokens.append(part)
else:
if self.bpe_model is not None:
tokens.extend(tokenize_by_bpe_model(self.bpe_model, part))
else:
if self.split_with_space:
part = part.split(" ")
for ch in part:
if ch == ' ':
ch = "▁"
tokens.append(ch)
return tokens
def tokens2text(self, tokens: Iterable[str]) -> str:
return self.connect_symbol.join(tokens)
def tokens2ids(self, tokens: List[str]) -> List[int]:
ids = []
for ch in tokens:
if ch in self.symbol_table:
ids.append(self.symbol_table[ch])
elif '<unk>' in self.symbol_table:
ids.append(self.symbol_table['<unk>'])
return ids
def ids2tokens(self, ids: List[int]) -> List[str]:
content = [self.char_dict[w] for w in ids]
return content class WhisperTokenizer(BaseTokenizer):
.... |
great , i like it ♥ |
后边可以在拆成个char_tokenzier, bpe_tokenzier, |
Looks great, I like it too. 后面我们可以把针对 tokenizer 的单元测试也考虑进来。 |
我看 proposal 中已经考虑了,Great! |
写BPETokenizer的时候直接把”▁“ 替换掉‘’ 相关关 bpe 的 recipe 的 : sed -e "s/▁/ /g" 的处理; 也能保持结果不变 |
openai tiktoken 和其他实现 都会有个encode_batch, 比如 但是这些实现使用thread pool 进行并发的 考虑到咱们后边会对dataset进行重构,以及保证tokenizer interface 简洁,咱们可以不需要tokenize_batch这些功能 # example
tokenizer = whisper_tokenizer(....)
dataset = ....
dataset = dataset.parallel_map(tokenizer) |
Yes,Compared with WenetTokenizer, CharTokenizer and BpeTokenizer are more clear. |
还有个需求需要考虑下,一个tokenizer允许存在俩个不同的子tokenizer。比如这个场景,ctc的词表和decoder的词表不共享,那么ctc需要一个tokenizer(如char tokenizer),decoder需要一个tokenizer(如whisper tokenizer) |
可能拆开生成两个更方便些? 1 old dataset iowhisper_tokenizer = WhisperTokenizer(...)
ctc_tokenizer = BpeTokenizer | CustomTokenizer
def multi_tokenize(sample, *tokenizers):
for data in sample:
for tokenizer in tokenizers:
name = tokenizer.name
tokens, label = tokenizer.tokenize(data[’txt’])
data[name]['label'] = label
data[name]['tokens'] = tokens
yield data
2 new dataset io
这样的话 个人觉得会更灵活些,直接在这个里边拓展就行 https://github.com/wenet-e2e/wenet/blob/2418d79bcc4496187922f95c4ad30d4aa8cda768/wenet/utils/init_tokenizer.py |
或者方案2, 把这个逻辑放到单独的一个类里边: class MultiNamedTokenizer:
def __init__(*tokenizers):
self.tokenizers = []
for tokenizer in tokenizers:
self.tokenizers.append(tokenizer)
def tokenize(line :str):
labels = {}
tokens = {}
for tokenizer in self.tokenizers:
name = tokenizer.name
tokens, label = tokenizer.tokenize(line)
labels[name] = labels
tokens[name] = tokens
return labels, tokens
ctc_tokenizer = CharToeknzer(..., name='ctc')
whisper_tokenizer = WhisperTokenizer(...., name='whisper')
task_tokenizer = MultiNamedTokenizer(ctc_tokenizer, whisper_tokenizer)
# old io, 这里逻辑不变
def tokenize(sample, tokenizer):
for data in sample:
labels, tokens = tokenizer(data['txt'])
data['label'] = labels
data['tokens'] = tokens
yield data
# 这里以及需要对labels和tokens 操作的需要考虑下类型 然后操作
def padded_batch():
if labels is dict:
遍历填0
....
# for ctc and whisper task, 这里逻辑不变
for batch in dataloader:
feats, feats_lens, labels, tokens = batch
# whisper forward
def forward(...):
ctc_tokens = tokens['ctc']
whisper_tokens = tokens['whisper']
....
## new io
def multi_tokenizer(*tokenizers)
tokens = {}
labels = {}
.....
dataset = wenet_dataset.parallel_map(lambda sample: multi_tokenizer(ctc_tokenzier, WenetTokenizer))
|
@robin1001 @xingchensong 先简单拆成这样的了, 先看下 #2165 |
可以,没有问题。 |
除了 whisper 这个场景,这个需求还有哪些潜在场景? |
跟llm相关的,ctc可能都需要另一个词表。 优点:
缺点:
|
@xingchensong tokenizer = ...
tokenizer.add_tokens("[wenet]")
model.resize_embedding(tokenizer.vocab_size) # (自动处理pretrain 参数 和 新增 (torch.cat((embed, tail), ...))) 我们需要考虑下这个逻辑 |
huggingface 扩充词表的issue和pr可以贴一下吗 |
https://discuss.huggingface.co/t/how-to-train-the-embedding-of-special-token/10837 |
看了一下,好像和我们之前表达的 add special 不是一个语义😂 我们之前说的是训练的时候往tgt左右加sos eos,这里的add special是往词表里加不存在的token |
不过这个功能我们也是需要的(比如微调whisper,ctc的词表要加一个blank),所以同意增加这个功能 |
tokenizer 这样设计: ctc_conf:
special_tokens: <blank>
whisper_conf:
special_tokens: <speech> # for example tokenizer = WhisperTokenizer()
vocab_size = tokenizer.vocab_size
tokenizer.add_special_tokens(special_tokens)
assert tokenizer.vocab.size == vocab_size + len(special_tokens) # 这里hg的处理是vocab_size返回最一开始的dict的大小 model 部分
|
相关pr 提到的
ref:
涉及到的文件:
TODO:
The text was updated successfully, but these errors were encountered: