diff --git a/posts/2023-12-12-moe.md b/posts/2023-12-12-moe.md index 77ca7b0..409c725 100644 --- a/posts/2023-12-12-moe.md +++ b/posts/2023-12-12-moe.md @@ -22,3 +22,11 @@ Google was one the first to blend large scale Transformers with MoEs in a framew - **Random routing**. The top expert is always picked but the second expert is sampled according to the gating weight probabilities. - **Expert capacity**. A threshold for how many tokens can be processed by one expert. If both experts are at capacity, the token is considered overflowed and is sent to the next layer via a skip connection. + +# Mixtral + +Mixtral uses concepts inspired by the Switch Transformer. It has a similar architecture as Mistral 7B with the difference that each Transformer Block replaces the FFN with a Switch Transformer block. Below is an illustration from the Switch Transformer [paper](https://arxiv.org/abs/2006.16668): + +![](/public/images/switchtransformer.png) + +For every token, at each layer, a router network (gate) selects two experts to process the current state and combine the outputs. Mixtral uses 8 experts with top-2 gating. Even though each token only sees two experts, the selected expert can be different at each timestep. In practice, this means that Mixtral decodes at the speed of a 12B model, while having access to 45B parameters. The requirements to run this model are still quite hefty, you are looking at upwards of 90GB in memory but fortunately quantized versions of Mixtral have already been released and are available through popular frameworks such as llama.cpp, vLLM and HF Transformers. MoE as a architecture is interesting because of how you handle the experts in terms of batching, data parallelism and model parallelism. diff --git a/public/images/switchtransformer.png b/public/images/switchtransformer.png new file mode 100644 index 0000000..707d9ac Binary files /dev/null and b/public/images/switchtransformer.png differ