关于mask lm #1

andiShan11 · 2021-03-17T13:18:56Z

您在掩码中文词时的实现如下:

for index in index_set:
   covered_indexes.add(index)
   masked_token = None
   # 80% of the time, replace with [MASK]
   if rng.random() < 0.8:
     masked_token = "[MASK]"
   else:
     # 10% of the time, keep original
     if rng.random() < 0.5:
       if FLAGS.non_chinese == False: # if non chinese is False, that means it is chinese, then try to remove "##" which is added previously
         masked_token = tokens[index][2:] if len(re.findall('##[\u4E00-\u9FA5]', tokens[index])) > 0 else tokens[index]  # 去掉"##"
       else:
         masked_token = tokens[index]
     # 10% of the time, replace with random word
     else:
       masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]

index_set是一个中文词的所有索引，但是您在这个循环里面对每个中文字都随机mask了，这和BERT里面的掩码策略感觉一样，WWM不是一个整词要么全mask，要么全不mask吗
按照您这个实现一个完整的中文词也可能只mask一部分
是我的理解有误吗？

The text was updated successfully, but these errors were encountered:

ysyfrank · 2021-07-21T02:14:46Z

wwm的mask是广义的mask，包含三种情况

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于mask lm #1

关于mask lm #1

andiShan11 commented Mar 17, 2021 •

edited

Loading

ysyfrank commented Jul 21, 2021

关于mask lm #1

关于mask lm #1

Comments

andiShan11 commented Mar 17, 2021 • edited Loading

ysyfrank commented Jul 21, 2021

andiShan11 commented Mar 17, 2021 •

edited

Loading