From 008328b473fc1c4983f31e6b93e597aad6e3c479 Mon Sep 17 00:00:00 2001 From: nerdai <92402603+nerdai@users.noreply.github.com> Date: Fri, 7 Feb 2025 18:35:10 +0000 Subject: [PATCH] =?UTF-8?q?Deploying=20to=20gh-pages=20from=20@=20VectorIn?= =?UTF-8?q?stitute/ai-pocket-reference@7b3f6907cf1a350826e0418fc129e0122bf?= =?UTF-8?q?59603=20=F0=9F=9A=80?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- nlp/llms/agents/index.html | 4 +- nlp/llms/architecture/attention.html | 4 +- nlp/llms/architecture/feedforward.html | 209 +++++++++++++++++++++++++ nlp/llms/architecture/index.html | 4 +- nlp/llms/fine_tuning/dora.html | 4 +- nlp/llms/fine_tuning/yarn.html | 209 +++++++++++++++++++++++++ nlp/models/deepseek_r1.html | 19 ++- nlp/models/deepseek_v3.html | 179 ++++++++++++++++++++- nlp/print.html | 200 ++++++++++++++++++++++- nlp/searchindex.js | 2 +- nlp/searchindex.json | 2 +- nlp/toc.html | 2 +- nlp/toc.js | 2 +- 13 files changed, 814 insertions(+), 26 deletions(-) create mode 100644 nlp/llms/architecture/feedforward.html create mode 100644 nlp/llms/fine_tuning/yarn.html diff --git a/nlp/llms/agents/index.html b/nlp/llms/agents/index.html index 7b7aa59..ea1b8ff 100644 --- a/nlp/llms/agents/index.html +++ b/nlp/llms/agents/index.html @@ -160,7 +160,7 @@
The DeepSeek-R1 model was introduced by DeepSeek in January of 2024. It is + +
+(Reading time: 6 minutes) +
The DeepSeek-R1 model was introduced by DeepSeek in January of 2025. It is derived from an earlier checkpoint of DeepSeek-V3. In particular, starting with DeepSeek-V3-base, four stages of fine-tuning were performed in order to arrive at the checkpoint known as DeepSeek-R1: (i) Reasoning @@ -251,7 +255,7 @@ Key Results style="text-align: center; font-size: 0.8em; margin-top: 10px;" > Table: Comparison between DeepSeek-R1 and other representative models. -(Copied from Table 4 of Guo, Daya, et al (2024).) +(Copied from Table 4 of Guo, Daya, et al (2025).)
Table: Comparison between DeepSeek-R1 and other representative models. -(Copied from Table 4 of Guo, Daya, et al (2024).)
Contributors:
The DeepSeek-V3 model was introduced by DeepSeek in December of 2024. It is an +LLM that leverages MoE in its design.
The training pipeline for DeepSeek-V3 consists of the two typical stages: pre-training +and post-training. As depicted in the Figure above, the pre-training stage involves +pre-training on 14.8T tokens followed by long-context extension using the YaRN +methodology. Post-training of DeepSeek-V3 utilizes SFT +as well as Reinforcement Learning methods.
At the time of its release, open-source models had already been lessening the gap +in performance with closed-source counterparts. DeepSeek-V3 was yet another open-source +model that achieved high levels of performance, beating other open-source alternatives +as well as some closed-source models in various benchmarks. What made DeepSeek-V3's +achievement even more intriguing was that it was reportedly trained using less +compute than its closest counterparts.
DeepSeek-V3 is a transformer-based model that swaps out nearly all dense feedforward +for MoE. The model has a total of 671B parameters +but through its specialized variant of MoE (referred to as DeepSeekMoE), only +37B parameters are activated in both training and inference. Through a series of +long-context extension fine-tuning steps, the maximum context length for this model +was extended to 128K tokens.
DeepSeekMoE: Used to carry out training more efficiently, this MoE design +consists of two sets of experts, namely: shared and routed. The former set of routers +is used for every token in the input sequence whereas the usage of routed ones are +determined according to the affinity to the input token.
Auxiliary-loss Load Free Balancing: When using an MoE architecture, one must +consider load balancing across the networks to prevent routing collapse. This has +been typically addressed via the introduction of an auxiliary loss. However, if +this loss has too great of an influence, it can lead to a model degradation. DeepSeek-V3 +instead considers a technique that requires no auxiliary loss but instead relies +on a new bias term that dynamically changes its value according to the experts +current workload.
Multi-Head Latent Attention (MLA): Used for making inference more efficient +by jointly compressing attention keys and values to a lower dimension. The compression +involves a linear projection matrix compressing keys and values down as well as +another linear project matrix for compressing keys and values back up. Only the +compressed joint representation of keys and values need to be cached during inference. +For more details see MLA.
Multi-Token Prediction: In an effort to improve the training signal, DeepSeek-V3 +expands the prediction scope to additional future tokens at every token position +of the sequence. In other words, instead of only predicting the next immediate token +and training the model on this signal, $D$ future tokens are predicted. These tokens +are predicted sequentially by $D$ sequential multi-token prediction modules in order +to maintain the causal chain. For more details see MTP.
The pre-training corpus is a revised version of the one used to train an earlier +version of the model, DeepSeek-V2. In this revision, more samples pertaining to +mathematics and programming were included. Ultimately, the dataset comprised of +14.8T tokens.
DeepSeek-V3 was trained on a cluster with 2048 NVIDIA H800 GPUs. Each node within +the cluster consists of 8 H800 GPUs inter-connected via NVLink and NVSwitch. In total, +it was reported that only 2.664M H800 GPU hours were used for pre-training while +subsequent training stages required only 0.1M GPU hours. One of the main reasons +for this training efficiency was their application of an FP8 mixed precision +training framework.
Superior Open-Source Model: DeepSeek-V3 outperformed all other open-source +models on educational benchmarks (MMLU, MMLU-Pro, GPQA) achieving performance +levels that rivals that for closed-source models such as GPT-4o and Claude-Sonnet-3.5. +DeepSeek-V3 also achieved SOTA on math-related benchmarks (GSM8K, MATH, MGSM, +CMath).
Efficient Training: DeepSeek-V3 was trained using only 2.664M H800 GPU hours, +leveraging an FP8 mixed precision training framework. This marked, as reported +by the authors, the first successful use of an FP8 scheme to train a large-scale +model.
Reasoning Distillation: As part of the post-training step, DeepSeek-V3 creators +were able to distill reasoning capabilities via long CoT +passages generated by DeepSeek-R1. The authors noted +that this pipeline improved reasononing performance while still maintaining the +ability to produce desired outputs and efficient response lengths.
DeepSeek-V3 requires significant amounts of computation facilities to ensure efficient +inference.
The DeepSeek-R1 model was introduced by DeepSeek in January of 2024. It is +
The DeepSeek-R1 model was introduced by DeepSeek in January of 2025. It is derived from an earlier checkpoint of DeepSeek-V3. In particular, starting with DeepSeek-V3-base, four stages of fine-tuning were performed in order to arrive at the checkpoint known as DeepSeek-R1: (i) Reasoning @@ -307,7 +313,7 @@ Key Results style="text-align: center; font-size: 0.8em; margin-top: 10px;" > Table: Comparison between DeepSeek-R1 and other representative models. -(Copied from Table 4 of Guo, Daya, et al (2024).) +(Copied from Table 4 of Guo, Daya, et al (2025).)