You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
for index in index_set:
covered_indexes.add(index)
masked_token = None
# 80% of the time, replace with [MASK]
if rng.random() < 0.8:
masked_token = "[MASK]"
else:
# 10% of the time, keep original
if rng.random() < 0.5:
if FLAGS.non_chinese == False: # if non chinese is False, that means it is chinese, then try to remove "##" which is added previously
masked_token = tokens[index][2:] if len(re.findall('##[\u4E00-\u9FA5]', tokens[index])) > 0 else tokens[index] # 去掉"##"
else:
masked_token = tokens[index]
# 10% of the time, replace with random word
else:
masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
您在掩码中文词时的实现如下:
index_set是一个中文词的所有索引,但是您在这个循环里面对每个中文字都随机mask了,这和BERT里面的掩码策略感觉一样,WWM不是一个整词 要么全mask,要么全不mask吗
按照您这个实现 一个完整的中文词也可能只mask一部分
是我的理解有误吗?
The text was updated successfully, but these errors were encountered: