diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index da12476..6efb7b5 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -23,7 +23,10 @@ jobs: toolchain: stable - name: Install mdbook if needed - run: (test -x $HOME/.cargo/bin/mdbook || cargo install --vers "^0.4" mdbook) + run: | + (test -x $HOME/.cargo/bin/mdbook || cargo install --vers "^0.4" mdbook) + cargo install mdbook-reading-time + cargo install mdbook-github-authors --version 0.1.0-a0 - name: Build books run: | diff --git a/books/nlp/book.toml b/books/nlp/book.toml index 84b2f8e..b6700e9 100644 --- a/books/nlp/book.toml +++ b/books/nlp/book.toml @@ -16,6 +16,12 @@ create-missing = true # whether or not to create missing pages use-default-preprocessors = true # use the default preprocessors extra-watch-dirs = [] # directories to watch for triggering builds +# preprocessors +[preprocessor.github-authors] +command = "mdbook-github-authors" +[preprocessor.reading-time] + +# renderer options [output.html] mathjax-support = true diff --git a/books/nlp/src/SUMMARY.md b/books/nlp/src/SUMMARY.md index 3cffd96..a8778f4 100644 --- a/books/nlp/src/SUMMARY.md +++ b/books/nlp/src/SUMMARY.md @@ -8,6 +8,7 @@ - [LLMs](llms/README.md) - [Architecture](llms/architecture/README.md) + - [FeedForward](llms/architecture/feedforward.md) - [Attention](llms/architecture/attention.md) - [Transformer](llms/architecture/transformer.md) - [Mixture of Experts](llms/architecture/moe.md) @@ -32,6 +33,7 @@ - [LoRA](llms/fine_tuning/lora.md) - [QLoRA](llms/fine_tuning/qlora.md) - [DoRA](llms/fine_tuning/dora.md) + - [YaRN](llms/fine_tuning/yarn.md) - [Agents](llms/agents/README.md) - [Tool Use](llms/agents/tool_use.md) - [Reflection](llms/agents/reflection.md) diff --git a/books/nlp/src/llms/architecture/feedforward.md b/books/nlp/src/llms/architecture/feedforward.md new file mode 100644 index 0000000..e85be0c --- /dev/null +++ b/books/nlp/src/llms/architecture/feedforward.md @@ -0,0 +1 @@ +# FeedForward diff --git a/books/nlp/src/llms/fine_tuning/yarn.md b/books/nlp/src/llms/fine_tuning/yarn.md new file mode 100644 index 0000000..2d03c9c --- /dev/null +++ b/books/nlp/src/llms/fine_tuning/yarn.md @@ -0,0 +1 @@ +# YaRN diff --git a/books/nlp/src/models/deepseek_r1.md b/books/nlp/src/models/deepseek_r1.md index 80b3aba..e37a4f3 100644 --- a/books/nlp/src/models/deepseek_r1.md +++ b/books/nlp/src/models/deepseek_r1.md @@ -1,6 +1,12 @@ + + # DeepSeek-R1 -The DeepSeek-R1 model was introduced by DeepSeek in January of 2024. It is +

+(Reading time: {{ #reading_time }}) +

+ +The DeepSeek-R1 model was introduced by DeepSeek in January of 2025. It is derived from an earlier checkpoint of [DeepSeek-V3](../models/deepseek_v3.md). In particular, starting with DeepSeek-V3-base, four stages of fine-tuning were performed in order to arrive at the checkpoint known as DeepSeek-R1: (i) **Reasoning @@ -121,7 +127,7 @@ Below are three key results of DeepSeek-R1 and its development: > Table: Comparison between DeepSeek-R1 and other representative models. -(Copied from Table 4 of Guo, Daya, et al (2024).) +(Copied from Table 4 of Guo, Daya, et al (2025).) @@ -167,21 +173,9 @@ such as software-engineering tasks. _(appearing in fortune.com)_ 4. [_Open-R1: a fully open reproduction of DeepSeek-R1_](https://huggingface.co/blog/open-r1) _(by HuggingFace)_ +5. [_DeepSeek-R1 is available on HuggingFace_](https://huggingface.co/deepseek-ai/DeepSeek-R1) ---- - -
- - -**Contributors:** - - -Contributor 1 - - - -
+{{#author nerdai}} diff --git a/books/nlp/src/models/deepseek_v3.md b/books/nlp/src/models/deepseek_v3.md index 5443818..ee19591 100644 --- a/books/nlp/src/models/deepseek_v3.md +++ b/books/nlp/src/models/deepseek_v3.md @@ -1 +1,191 @@ + + # DeepSeek-v3 + +

+(Reading time: {{ #reading_time }}) +

+ +The DeepSeek-V3 model was introduced by DeepSeek in December of 2024. It is an +LLM that leverages [MoE](../llms/architecture/moe.md) in its design. + +
+DeepSeek-V3 Model Lineage +
+ +
+Figure: Illustrating DeepSeek-V3 training evolution. +
+ +The training pipeline for DeepSeek-V3 consists of the two typical stages: pre-training +and post-training. As depicted in the Figure above, the pre-training stage involves +pre-training on 14.8T tokens followed by long-context extension using the [YaRN](../llms/fine_tuning/yarn.md) +methodology. Post-training of DeepSeek-V3 utilizes [SFT](../llms/fine_tuning/sft.md) +as well as Reinforcement Learning methods. + +## Historical Significance + +At the time of its release, open-source models had already been lessening the gap +in performance with closed-source counterparts. DeepSeek-V3 was yet another open-source +model that achieved high levels of performance, beating other open-source alternatives +as well as some closed-source models in various benchmarks. What made DeepSeek-V3's +achievement even more intriguing was that it was reportedly trained using less +compute than its closest counterparts. + +## Architectural Highlights + +DeepSeek-V3 is a transformer-based model that swaps out nearly all dense [feedforward](../llms/architecture/feedforward.md) +for [MoE](../llms/architecture/moe.md). The model has a total of 671B parameters +but through its specialized variant of MoE (referred to as DeepSeekMoE), only +37B parameters are activated in both training and inference. Through a series of +long-context extension fine-tuning steps, the maximum context length for this model +was extended to 128K tokens. + +**DeepSeekMoE:** Used to carry out training more efficiently, this MoE design +consists of two sets of experts, namely: shared and routed. The former set of routers +is used for every token in the input sequence whereas the usage of routed ones are +determined according to the affinity to the input token. + +**Auxiliary-loss Load Free Balancing:** When using an MoE architecture, one must +consider load balancing across the networks to prevent routing collapse. This has +been typically addressed via the introduction of an auxiliary loss. However, if +this loss has too great of an influence, it can lead to a model degradation. DeepSeek-V3 +instead considers a technique that requires no auxiliary loss but instead relies +on a new bias term that dynamically changes its value according to the experts +current workload. + +**Multi-Head Latent Attention (MLA):** Used for making inference more efficient +by jointly compressing attention keys and values to a lower dimension. The compression +involves a linear projection matrix compressing keys and values down as well as +another linear project matrix for compressing keys and values back up. Only the +compressed joint representation of keys and values need to be cached during inference. +For more details see [MLA](../llms/architecture/mla.md). + +**Multi-Token Prediction:** In an effort to improve the training signal, DeepSeek-V3 +expands the prediction scope to additional future tokens at every token position +of the sequence. In other words, instead of only predicting the next immediate token +and training the model on this signal, $D$ future tokens are predicted. These tokens +are predicted sequentially by $D$ sequential multi-token prediction modules in order +to maintain the causal chain. For more details see [MTP](../llms/decoding/multi_token_prediction.md). + +| Parameter | Value | +| ----------------------------------- | ------------------------- | +| Total parameters | 671B | +| Activated parameters | 37B | +| Maximum context length | 128K tokens | +| Number of Transformer layers | 61 | +| Hidden dimension size | 7168 | +| Number of attention heads | 128 | +| Number of experts (MoE) | 1 (shared) & 256 (routed) | +| Hidden dimension of experts | 2048 | +| KV compression dimension size (MLA) | 512 | +| Multi-token depth (MTP) | 1 | + +
+Table 1: Summary of DeepSeek-V3 architecture and hyper parameters. +
+ +## Training Data + +The pre-training corpus is a revised version of the one used to train an earlier +version of the model, DeepSeek-V2. In this revision, more samples pertaining to +mathematics and programming were included. Ultimately, the dataset comprised of +14.8T tokens. + +## Compute Details + +DeepSeek-V3 was trained on a cluster with 2048 NVIDIA H800 GPUs. Each node within +the cluster consists of 8 H800 GPUs inter-connected via NVLink and NVSwitch. In total, +it was reported that only 2.664M H800 GPU hours were used for pre-training while +subsequent training stages required only 0.1M GPU hours. One of the main reasons +for this training efficiency was their application of an FP8 mixed precision +training framework. + +## Key Results + + + +| Benchmark (Metric) | # Shots | DeepSeek-V2 Base | Qwen2.5 72B Base | LLaMA-3.1 405B Base | DeepSeek-V3 Base | +| --------------------------- | ------- | ---------------- | ---------------- | ------------------- | ---------------- | +| Pile-test (BPB) | - | 0.606 | 0.638 | **0.542** | 0.548 | +| BBH (EM) | 3-shot | 78.8 | 79.8 | 82.9 | **87.5** | +| MMLU (EM) | 5-shot | 78.4 | 85.0 | 84.4 | **87.1** | +| MMLU-Redux (EM) | 5-shot | 75.6 | 83.2 | 81.3 | **86.2** | +| MMLU-Pro (EM) | 5-shot | 51.4 | 58.3 | 52.8 | **64.4** | +| DROP (F1) | 3-shot | 80.4 | 80.6 | 86.0 | **89.0** | +| ARC-Easy (EM) | 25-shot | 97.6 | 98.4 | 98.4 | **98.9** | +| ARC-Challenge (EM) | 25-shot | 92.2 | 94.5 | **95.3** | **95.3** | +| HellaSwag (EM) | 10-shot | 87.1 | 84.8 | **89.2** | 88.9 | +| PIQA (EM) | 0-shot | 83.9 | 82.1 | **85.9** | 84.7 | +| WinoGrande (EM) | 5-shot | **86.3** | 82.3 | 85.2 | 84.9 | +| RACE-Middle (EM) | 3-shot | 73.1 | 68.1 | **74.2** | 74.9 | +| RACE-High (EM) | 5-shot | 52.6 | 50.3 | **56.8** | 51.3 | +| TriviaQA (EM) | 5-shot | 80.0 | 71.9 | **82.7** | 82.9 | +| NaturalQuestions (EM) | 5-shot | 38.6 | 33.2 | **41.5** | 40.0 | +| AGIEval (EM) | 0-shot | 57.5 | 75.8 | 60.6 | **79.6** | +| HumanEval (Pass@1) | 0-shot | 43.3 | 53.0 | 54.9 | **65.2** | +| MBPP (Pass@1) | 3-shot | 65.0 | 72.6 | 68.4 | **75.4** | +| LiveCodeBench-Base (Pass@1) | 3-shot | 11.6 | 12.9 | 15.1 | **19.4** | +| CRUXEval-1 (EM) | 2-shot | 52.5 | 59.1 | 58.5 | **67.3** | +| CRUXEval-O (EM) | 2-shot | 49.8 | 59.9 | 59.9 | **69.8** | +| CSMRR (EM) | 8-shot | 81.6 | 88.3 | 89.3 | **89.3** | +| MATH (EM) | 4-shot | 43.4 | 54.4 | 49.0 | **61.6** | +| MGSM (EM) | 8-shot | 63.6 | 76.2 | 69.9 | **79.8** | +| CMath (EM) | 3-shot | 78.7 | 84.5 | 77.3 | **90.7** | +| CLUEWSC (EM) | 5-shot | 82.0 | 82.5 | **83.0** | 82.7 | +| C-Eval (EM) | 0-shot | 81.4 | 72.5 | 72.5 | **90.1** | +| CMMLU (EM) | 5-shot | 84.0 | **89.5** | 73.7 | 88.8 | +| CMRC (EM) | 1-shot | **77.4** | 75.8 | 76.0 | 76.3 | +| C3 (EM) | 0-shot | 77.4 | 76.7 | **79.7** | 78.6 | +| CCPM (EM) | 0-shot | **93.0** | 88.5 | 78.6 | 92.0 | +| MMLU-non-English (EM) | 5-shot | 64.0 | 74.8 | 73.8 | **79.4** | + + + +
+Table 2: Comparison between DeepSeek-V3 and other representative models. +(Copied from Table 3 of Liu, Aixin, et al (2024).) +
+ +1. **Superior Open-Source Model:** DeepSeek-V3 outperformed all other open-source + models on educational benchmarks (MMLU, MMLU-Pro, GPQA) achieving performance + levels that rivals that for closed-source models such as GPT-4o and Claude-Sonnet-3.5. + DeepSeek-V3 also achieved SOTA on math-related benchmarks (GSM8K, MATH, MGSM, + CMath). + +2. **Efficient Training:** DeepSeek-V3 was trained using only 2.664M H800 GPU hours, + leveraging an FP8 mixed precision training framework. This marked, as reported + by the authors, the first successful use of an FP8 scheme to train a large-scale + model. + +3. **Reasoning Distillation:** As part of the post-training step, DeepSeek-V3 creators + were able to distill reasoning capabilities via long [CoT](../llms/prompting/cot.md) + passages generated by [DeepSeek-R1](../models/deepseek_r1.md). The authors noted + that this pipeline improved reasononing performance while still maintaining the + ability to produce desired outputs and efficient response lengths. + +## Limitations + +DeepSeek-V3 requires significant amounts of computation facilities to ensure efficient +inference. + +#### References & Useful Links + +1. [_Liu, Aixin, et al. "Deepseek-v3 technical report." arXiv preprint + arXiv:2412.19437 (2024)._](https://arxiv.org/pdf/2412.19437) +2. [DeepSeek sparks AI stock selloff; Nvidia posts record market-cap loss](https://www.reuters.com/technology/chinas-deepseek-sets-off-ai-market-rout-2025-01-27/) + (_appearing in reuters.com_) + + + + +{{#author nerdai}}