Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[refine/tokenzier] 重构tokenizer接口 #2160

Closed
5 tasks done
Mddct opened this issue Nov 24, 2023 · 21 comments
Closed
5 tasks done

[refine/tokenzier] 重构tokenizer接口 #2160

Mddct opened this issue Nov 24, 2023 · 21 comments
Assignees
Labels
enhancement New feature or request

Comments

@Mddct
Copy link
Collaborator

Mddct commented Nov 24, 2023

相关pr 提到的

ref:

涉及到的文件:

  • wenet/dataset/dataset.py
  • wenet/dataset/process.py
  • wenet/bin/train.py
  • wenet/bin/recognize.py
  • wenet/utils/tokenize_utils.py
  • wenet/utils/context_graph.py
  • wenet/bin/aligment.py
  • wene/tools/text2token.py
  • wenet/test/test_tokenize.py
  • wenet/test/whisper/test_whisper.py
  • wenet/cli/model.py

TODO:

  • tokenzier interface
  • bpe
  • paraformer
  • char
  • unit test
@Mddct Mddct added the enhancement New feature or request label Nov 24, 2023
@Mddct Mddct self-assigned this Nov 24, 2023
@Mddct
Copy link
Collaborator Author

Mddct commented Nov 24, 2023

I write a code snippet like below, any suggestion? @xingchensong @robin1001 bin

from abc import ABC, abstractmethod
from collections.abc import Iterable
from typing import List, Tuple


class BaseTokenizer(ABC):

    def tokenize(self, line: str) -> Tuple[List[str], List[int]]:
        tokens = self.text2tokens(line)
        ids = self.tokens2ids(tokens)
        return tokens, ids

    def detaokenzie(self, ids: List[int]) -> Tuple[str, List[str]]:
        tokens = self.ids2tokens(ids)
        text = self.tokens2text(tokens)
        return text, tokens

    @abstractmethod
    def text2tokens(self, line: str) -> List[str]:
        raise NotImplementedError("abstract method")

    @abstractmethod
    def tokens2text(self, tokens: Iterable[str]) -> str:
        raise NotImplementedError("abstract method")

    @abstractmethod
    def tokens2ids(self, tokens: List[str]) -> List[int]:
        raise NotImplemented("abstract method")

    @abstractmethod
    def ids2tokens(self, ids: List[int]) -> List[str]:
        raise NotImplemented("abstract method")
import re

from collections.abc import Iterable
from os import PathLike
from typing import List, Optional
from wenet.utils.file_utils import read_symbol_table, read_non_lang_symbols
from wenet.utils.text.base_tokenzier import BaseTokenizer
from wenet.utils.text.tokenize_utils import tokenize_by_bpe_model


class WenetTokenizer(BaseTokenizer):
    """Wrapper for original wenet tokenize implementation
    """

    def __init__(
        self,
        symbol_table: PathLike,
        bpe_model: Optional[PathLike] = None,
        non_lang_syms: Optional[PathLike] = None,
        split_with_space: bool = False,
        connect_symbol: str = '',
    ) -> None:
        self.non_lang_syms_pattern = None
        if non_lang_syms is not None:
            self.non_lang_syms_pattern = re.compile(
                r"(\[[^\[\]]+\]|<[^<>]+>|{[^{}]+})")

        self.symbol_table = read_symbol_table(symbol_table)
        self.non_lang_syms = read_non_lang_symbols(non_lang_syms)
        self.bpe_model = None
        if bpe_model is not None:
            import sentencepiece as spm
            self.bpe_model = spm.SentencePieceProcessor()
            self.bpe_model.load(bpe_model)
        self.char_dict = {v: k for k, v in self.symbol_table.items()}
        self.split_with_space = split_with_space
        self.connect_symbol = connect_symbol

    def text2tokens(self, line: str) -> List[str]:
        if self.non_lang_syms_pattern is not None:
            parts = self.non_lang_syms_pattern.split(line.upper())
            parts = [w for w in parts if len(w.strip()) > 0]
        else:
            parts = [line]

        tokens = []
        for part in parts:
            if part in self.non_lang_syms:
                tokens.append(part)
            else:
                if self.bpe_model is not None:
                    tokens.extend(tokenize_by_bpe_model(self.bpe_model, part))
                else:
                    if self.split_with_space:
                        part = part.split(" ")
                    for ch in part:
                        if ch == ' ':
                            ch = "▁"
                        tokens.append(ch)
        return tokens

    def tokens2text(self, tokens: Iterable[str]) -> str:
        return self.connect_symbol.join(tokens)

    def tokens2ids(self, tokens: List[str]) -> List[int]:
        ids = []
        for ch in tokens:
            if ch in self.symbol_table:
                ids.append(self.symbol_table[ch])
            elif '<unk>' in self.symbol_table:
                ids.append(self.symbol_table['<unk>'])
        return ids

    def ids2tokens(self, ids: List[int]) -> List[str]:
        content = [self.char_dict[w] for w in ids]
        return content
class WhisperTokenizer(BaseTokenizer):
....

@xingchensong
Copy link
Member

xingchensong commented Nov 24, 2023

great , i like it ♥

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 24, 2023

great , i like it ♥ bpe的逻辑和char的逻辑要是能分开成两个各自的tokenozer就更好了

后边可以在拆成个char_tokenzier, bpe_tokenzier,
先这样合到一块 后边看下测试没问题了 再拆开

@robin1001
Copy link
Collaborator

Looks great, I like it too. 后面我们可以把针对 tokenizer 的单元测试也考虑进来。

@robin1001
Copy link
Collaborator

我看 proposal 中已经考虑了,Great!

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 24, 2023

写BPETokenizer的时候直接把”▁“ 替换掉‘’

相关关 bpe 的 recipe 的 : sed -e "s/▁/ /g" 的处理; 也能保持结果不变

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 25, 2023

openai tiktoken 和其他实现 都会有个encode_batch, 比如
https://github.com/openai/tiktoken/blob/39f29cecdb6fc38d9a3434e5dd15e4de58cf3c80/tiktoken/core.py#L145

但是这些实现使用thread pool 进行并发的

考虑到咱们后边会对dataset进行重构,以及保证tokenizer interface 简洁,咱们可以不需要tokenize_batch这些功能

# example 
tokenizer = whisper_tokenizer(....)
dataset = ....
dataset = dataset.parallel_map(tokenizer)

@Mddct Mddct mentioned this issue Nov 25, 2023
15 tasks
@robin1001
Copy link
Collaborator

Yes,Compared with WenetTokenizer, CharTokenizer and BpeTokenizer are more clear.

@xingchensong
Copy link
Member

还有个需求需要考虑下,一个tokenizer允许存在俩个不同的子tokenizer。比如这个场景,ctc的词表和decoder的词表不共享,那么ctc需要一个tokenizer(如char tokenizer),decoder需要一个tokenizer(如whisper tokenizer)

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 26, 2023

还有个需求需要考虑下,一个tokenizer允许存在俩个不同的子tokenizer。比如这个场景,ctc的词表和decoder的词表不共享,那么ctc需要一个tokenizer(如char tokenizer),decoder需要一个tokenizer(如whisper tokenizer)

可能拆开生成两个更方便些?

1 old dataset io

whisper_tokenizer = WhisperTokenizer(...)
ctc_tokenizer = BpeTokenizer | CustomTokenizer

def multi_tokenize(sample, *tokenizers):
     for data in samplefor tokenizer in tokenizers:
            name = tokenizer.name
            tokenslabel = tokenizer.tokenize(data[’txt’])
            data[name]['label'] = label
            data[name]['tokens'] = tokens
        yield data
            

2 new dataset io

whisper_tokenizer = WhisperTokenizer(...)
ctc_tokenizer = BpeTokenizer | CustomTokenizer

wenet_dataset = get_dataset (...)

whisper_dataset = wenet_dataset.parallel_map(whisper_tokenizer)
ctc_dataset = wenet_dataset.parallel_map(ctc_tokenizer)

# 这里我们就有了针对不同任务的tokens和labels 
wenet_dataset = dataset.zip(whisper_dataset, ctc_dataset)

这样的话 个人觉得会更灵活些,直接在这个里边拓展就行 https://github.com/wenet-e2e/wenet/blob/2418d79bcc4496187922f95c4ad30d4aa8cda768/wenet/utils/init_tokenizer.py

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 26, 2023

或者方案2, 把这个逻辑放到单独的一个类里边:

class MultiNamedTokenizer:
    def __init__(*tokenizers):
        self.tokenizers = []
        for tokenizer in tokenizers:
           self.tokenizers.append(tokenizer)

    def tokenize(line :str):
       labels = {}
       tokens = {}
        for tokenizer in self.tokenizers:
            name = tokenizer.name
            tokens, label = tokenizer.tokenize(line)
            labels[name] = labels
            tokens[name] = tokens
       return labels, tokens

ctc_tokenizer = CharToeknzer(..., name='ctc')
whisper_tokenizer = WhisperTokenizer(...., name='whisper')

task_tokenizer = MultiNamedTokenizer(ctc_tokenizer, whisper_tokenizer)

# old io, 这里逻辑不变
def tokenize(sample, tokenizer):
     for data in sample:
        labels, tokens = tokenizer(data['txt'])
        data['label'] = labels
        data['tokens'] = tokens
        yield data
# 这里以及需要对labels和tokens 操作的需要考虑下类型 然后操作
def padded_batch():
    if labels is dict:
         遍历填0
    ....
# for ctc and whisper task, 这里逻辑不变
for batch in dataloader:
    feats, feats_lens, labels, tokens = batch

# whisper forward
def forward(...):
    ctc_tokens = tokens['ctc']
    whisper_tokens = tokens['whisper']
    ....



## new io
def multi_tokenizer(*tokenizers)
      tokens = {}
      labels = {}
       .....


dataset = wenet_dataset.parallel_map(lambda sample: multi_tokenizer(ctc_tokenzier, WenetTokenizer))

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 26, 2023

Yes,Compared with WenetTokenizer, CharTokenizer and BpeTokenizer are more clear.

@robin1001 @xingchensong 先简单拆成这样的了, 先看下 #2165

@robin1001
Copy link
Collaborator

可以,没有问题。

@robin1001
Copy link
Collaborator

还有个需求需要考虑下,一个tokenizer允许存在俩个不同的子tokenizer。比如这个场景,ctc的词表和decoder的词表不共享,那么ctc需要一个tokenizer(如char tokenizer),decoder需要一个tokenizer(如whisper tokenizer)

除了 whisper 这个场景,这个需求还有哪些潜在场景?

@xingchensong
Copy link
Member

xingchensong commented Nov 27, 2023

还有个需求需要考虑下,一个tokenizer允许存在俩个不同的子tokenizer。比如这个场景,ctc的词表和decoder的词表不共享,那么ctc需要一个tokenizer(如char tokenizer),decoder需要一个tokenizer(如whisper tokenizer)

除了 whisper 这个场景,这个需求还有哪些潜在场景?

跟llm相关的,ctc可能都需要另一个词表。

优点:

  1. llm的词表比较大
  2. 为了多语言肯定对某些特定语言损失了“效率”

缺点:

  1. 部署时麻烦,要两套词表

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 30, 2023

@xingchensong
#2165 (comment)
这个pr中提到add_special_tokens,
fintune 有些扩词表的需求, add_tokens 和模型resize_embedding有以下场景

tokenizer = ...
tokenizer.add_tokens("[wenet]")

model.resize_embedding(tokenizer.vocab_size) # (自动处理pretrain 参数 和 新增 (torch.cat((embed, tail), ...)))

我们需要考虑下这个逻辑

@xingchensong
Copy link
Member

huggingface 扩充词表的issue和pr可以贴一下吗

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 30, 2023

huggingface 扩充词表的issue和pr可以贴一下吗

https://discuss.huggingface.co/t/how-to-train-the-embedding-of-special-token/10837

@xingchensong
Copy link
Member

看了一下,好像和我们之前表达的 add special 不是一个语义😂 我们之前说的是训练的时候往tgt左右加sos eos,这里的add special是往词表里加不存在的token

@xingchensong
Copy link
Member

不过这个功能我们也是需要的(比如微调whisper,ctc的词表要加一个blank),所以同意增加这个功能

@Mddct
Copy link
Collaborator Author

Mddct commented Nov 30, 2023

tokenizer 这样设计:
1 in yaml:

ctc_conf:
    special_tokens: <blank>

whisper_conf:
   special_tokens: <speech> # for example
tokenizer = WhisperTokenizer()
vocab_size = tokenizer.vocab_size
tokenizer.add_special_tokens(special_tokens)
assert tokenizer.vocab.size  == vocab_size + len(special_tokens) # 这里hg的处理是vocab_size返回最一开始的dict的大小

model 部分

model.resize_embedding(tokenizer.vocab_size)

This was referenced Nov 30, 2023
@Mddct Mddct closed this as completed Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants