Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 2.MoE经典论文简牍.md #8

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ $$

### 4.2 专家如何学习?

ST-MoE 的研究者们发现,**Encorder 中不同的专家倾向于专注于特定类型的 token 或浅层概念**。例如,某些专家可能专门处理标点符号,而其他专家则专注于专有名词等。与此相反,Decorder 中的专家通常具有较低的专业化程度。此外,研究者们还对这一模型进行了多语言训练。尽管人们可能会预期每个专家处理一种特定语言,但实际上并非如此。由于 token 路由和负载均衡的机制,没有任何专家被特定配置以专门处理某一特定语言。
ST-MoE 的研究者们发现,**Encoder 中不同的专家倾向于专注于特定类型的 token 或浅层概念**。例如,某些专家可能专门处理标点符号,而其他专家则专注于专有名词等。与此相反,Decoder 中的专家通常具有较低的专业化程度。此外,研究者们还对这一模型进行了多语言训练。尽管人们可能会预期每个专家处理一种特定语言,但实际上并非如此。由于 token 路由和负载均衡的机制,没有任何专家被特定配置以专门处理某一特定语言。

### 4.3 专家的数量对预训练有何影响?

Expand Down