Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于mask lm #1

Open
andiShan11 opened this issue Mar 17, 2021 · 1 comment
Open

关于mask lm #1

andiShan11 opened this issue Mar 17, 2021 · 1 comment

Comments

@andiShan11
Copy link

andiShan11 commented Mar 17, 2021

您在掩码中文词时的实现如下:

for index in index_set:
   covered_indexes.add(index)
   masked_token = None
   # 80% of the time, replace with [MASK]
   if rng.random() < 0.8:
     masked_token = "[MASK]"
   else:
     # 10% of the time, keep original
     if rng.random() < 0.5:
       if FLAGS.non_chinese == False: # if non chinese is False, that means it is chinese, then try to remove "##" which is added previously
         masked_token = tokens[index][2:] if len(re.findall('##[\u4E00-\u9FA5]', tokens[index])) > 0 else tokens[index]  # 去掉"##"
       else:
         masked_token = tokens[index]
     # 10% of the time, replace with random word
     else:
       masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]

index_set是一个中文词的所有索引,但是您在这个循环里面对每个中文字都随机mask了,这和BERT里面的掩码策略感觉一样,WWM不是一个整词 要么全mask,要么全不mask吗
按照您这个实现 一个完整的中文词也可能只mask一部分
是我的理解有误吗?

@ysyfrank
Copy link

wwm的mask是广义的mask,包含三种情况

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants