Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSeek-V3 #8

Merged
merged 13 commits into from
Feb 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ jobs:
toolchain: stable

- name: Install mdbook if needed
run: (test -x $HOME/.cargo/bin/mdbook || cargo install --vers "^0.4" mdbook)
run: |
(test -x $HOME/.cargo/bin/mdbook || cargo install --vers "^0.4" mdbook)
cargo install mdbook-reading-time
cargo install mdbook-github-authors --version 0.1.0-a0

- name: Build books
run: |
Expand Down
6 changes: 6 additions & 0 deletions books/nlp/book.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,12 @@ create-missing = true # whether or not to create missing pages
use-default-preprocessors = true # use the default preprocessors
extra-watch-dirs = [] # directories to watch for triggering builds

# preprocessors
[preprocessor.github-authors]
command = "mdbook-github-authors"

[preprocessor.reading-time]

# renderer options
[output.html]
mathjax-support = true
2 changes: 2 additions & 0 deletions books/nlp/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

- [LLMs](llms/README.md)
- [Architecture](llms/architecture/README.md)
- [FeedForward](llms/architecture/feedforward.md)
- [Attention](llms/architecture/attention.md)
- [Transformer](llms/architecture/transformer.md)
- [Mixture of Experts](llms/architecture/moe.md)
Expand All @@ -32,6 +33,7 @@
- [LoRA](llms/fine_tuning/lora.md)
- [QLoRA](llms/fine_tuning/qlora.md)
- [DoRA](llms/fine_tuning/dora.md)
- [YaRN](llms/fine_tuning/yarn.md)
- [Agents](llms/agents/README.md)
- [Tool Use](llms/agents/tool_use.md)
- [Reflection](llms/agents/reflection.md)
Expand Down
1 change: 1 addition & 0 deletions books/nlp/src/llms/architecture/feedforward.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# FeedForward
1 change: 1 addition & 0 deletions books/nlp/src/llms/fine_tuning/yarn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# YaRN
26 changes: 10 additions & 16 deletions books/nlp/src/models/deepseek_r1.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
<!-- markdownlint-disable-file MD033 -->

# DeepSeek-R1

The DeepSeek-R1 model was introduced by DeepSeek in January of 2024. It is
<p align="left"><small>
(Reading time: {{ #reading_time }})
</small></p>

The DeepSeek-R1 model was introduced by DeepSeek in January of 2025. It is
derived from an earlier checkpoint of [DeepSeek-V3](../models/deepseek_v3.md).
In particular, starting with DeepSeek-V3-base, four stages of fine-tuning were
performed in order to arrive at the checkpoint known as DeepSeek-R1: (i) **Reasoning
Expand Down Expand Up @@ -121,7 +127,7 @@ Below are three key results of DeepSeek-R1 and its development:
>

Table: Comparison between DeepSeek-R1 and other representative models.
(Copied from Table 4 of Guo, Daya, et al (2024).)
(Copied from Table 4 of Guo, Daya, et al (2025).)

</div>

Expand Down Expand Up @@ -167,21 +173,9 @@ such as software-engineering tasks.
_(appearing in fortune.com)_
4. [_Open-R1: a fully open reproduction of DeepSeek-R1_](https://huggingface.co/blog/open-r1)
_(by HuggingFace)_
5. [_DeepSeek-R1 is available on HuggingFace_](https://huggingface.co/deepseek-ai/DeepSeek-R1)

<!-- TODO: mdBook preprocessor with custom mustache handler {{ #author }} -->
<!-- markdownlint-disable-file MD033 -->

---

<div class="contributor-footnotes">
<small>

**Contributors:**

<a href="https://github.com/nerdai">
<img src="https://github.com/nerdai.png"
width="32px" alt="Contributor 1" style="border-radius: 50%">
</a>
</small>

</div>
{{#author nerdai}}
190 changes: 190 additions & 0 deletions books/nlp/src/models/deepseek_v3.md
Original file line number Diff line number Diff line change
@@ -1 +1,191 @@
<!-- markdownlint-disable-file MD033 -->

# DeepSeek-v3

<p align="left"><small>
(Reading time: {{ #reading_time }})
</small></p>

The DeepSeek-V3 model was introduced by DeepSeek in December of 2024. It is an
LLM that leverages [MoE](../llms/architecture/moe.md) in its design.

<center>
<img src="https://d3ddy8balm3goa.cloudfront.net/vector-ai-pocket-refs/deepseek-v3-lineage-v2.excalidraw.svg" alt="DeepSeek-V3 Model Lineage"> <!-- markdownlint-disable-line MD013 -->
</center>

<div
class="figure-caption"
style="text-align: center; font-size: 0.8em; margin-top: 10px;"
>
Figure: Illustrating DeepSeek-V3 training evolution.
</div>

The training pipeline for DeepSeek-V3 consists of the two typical stages: pre-training
and post-training. As depicted in the Figure above, the pre-training stage involves
pre-training on 14.8T tokens followed by long-context extension using the [YaRN](../llms/fine_tuning/yarn.md)
methodology. Post-training of DeepSeek-V3 utilizes [SFT](../llms/fine_tuning/sft.md)
as well as Reinforcement Learning methods.

## Historical Significance

At the time of its release, open-source models had already been lessening the gap
in performance with closed-source counterparts. DeepSeek-V3 was yet another open-source
model that achieved high levels of performance, beating other open-source alternatives
as well as some closed-source models in various benchmarks. What made DeepSeek-V3's
achievement even more intriguing was that it was reportedly trained using less
compute than its closest counterparts.

## Architectural Highlights

DeepSeek-V3 is a transformer-based model that swaps out nearly all dense [feedforward](../llms/architecture/feedforward.md)
for [MoE](../llms/architecture/moe.md). The model has a total of 671B parameters
but through its specialized variant of MoE (referred to as DeepSeekMoE), only
37B parameters are activated in both training and inference. Through a series of
long-context extension fine-tuning steps, the maximum context length for this model
was extended to 128K tokens.

**DeepSeekMoE:** Used to carry out training more efficiently, this MoE design
consists of two sets of experts, namely: shared and routed. The former set of routers
is used for every token in the input sequence whereas the usage of routed ones are
determined according to the affinity to the input token.

**Auxiliary-loss Load Free Balancing:** When using an MoE architecture, one must
consider load balancing across the networks to prevent routing collapse. This has
been typically addressed via the introduction of an auxiliary loss. However, if
this loss has too great of an influence, it can lead to a model degradation. DeepSeek-V3
instead considers a technique that requires no auxiliary loss but instead relies
on a new bias term that dynamically changes its value according to the experts
current workload.

**Multi-Head Latent Attention (MLA):** Used for making inference more efficient
by jointly compressing attention keys and values to a lower dimension. The compression
involves a linear projection matrix compressing keys and values down as well as
another linear project matrix for compressing keys and values back up. Only the
compressed joint representation of keys and values need to be cached during inference.
For more details see [MLA](../llms/architecture/mla.md).

**Multi-Token Prediction:** In an effort to improve the training signal, DeepSeek-V3
expands the prediction scope to additional future tokens at every token position
of the sequence. In other words, instead of only predicting the next immediate token
and training the model on this signal, $D$ future tokens are predicted. These tokens
are predicted sequentially by $D$ sequential multi-token prediction modules in order
to maintain the causal chain. For more details see [MTP](../llms/decoding/multi_token_prediction.md).

| Parameter | Value |
| ----------------------------------- | ------------------------- |
| Total parameters | 671B |
| Activated parameters | 37B |
| Maximum context length | 128K tokens |
| Number of Transformer layers | 61 |
| Hidden dimension size | 7168 |
| Number of attention heads | 128 |
| Number of experts (MoE) | 1 (shared) & 256 (routed) |
| Hidden dimension of experts | 2048 |
| KV compression dimension size (MLA) | 512 |
| Multi-token depth (MTP) | 1 |

<div
class="table-caption"
style="text-align: center; font-size: 0.8em; margin-top: 10px;"
>
Table 1: Summary of DeepSeek-V3 architecture and hyper parameters.
</div>

## Training Data

The pre-training corpus is a revised version of the one used to train an earlier
version of the model, DeepSeek-V2. In this revision, more samples pertaining to
mathematics and programming were included. Ultimately, the dataset comprised of
14.8T tokens.

## Compute Details

DeepSeek-V3 was trained on a cluster with 2048 NVIDIA H800 GPUs. Each node within
the cluster consists of 8 H800 GPUs inter-connected via NVLink and NVSwitch. In total,
it was reported that only 2.664M H800 GPU hours were used for pre-training while
subsequent training stages required only 0.1M GPU hours. One of the main reasons
for this training efficiency was their application of an FP8 mixed precision
training framework.

## Key Results

<!-- markdownlint-disable MD013 -->

| Benchmark (Metric) | # Shots | DeepSeek-V2 Base | Qwen2.5 72B Base | LLaMA-3.1 405B Base | DeepSeek-V3 Base |
| --------------------------- | ------- | ---------------- | ---------------- | ------------------- | ---------------- |
| Pile-test (BPB) | - | 0.606 | 0.638 | **0.542** | 0.548 |
| BBH (EM) | 3-shot | 78.8 | 79.8 | 82.9 | **87.5** |
| MMLU (EM) | 5-shot | 78.4 | 85.0 | 84.4 | **87.1** |
| MMLU-Redux (EM) | 5-shot | 75.6 | 83.2 | 81.3 | **86.2** |
| MMLU-Pro (EM) | 5-shot | 51.4 | 58.3 | 52.8 | **64.4** |
| DROP (F1) | 3-shot | 80.4 | 80.6 | 86.0 | **89.0** |
| ARC-Easy (EM) | 25-shot | 97.6 | 98.4 | 98.4 | **98.9** |
| ARC-Challenge (EM) | 25-shot | 92.2 | 94.5 | **95.3** | **95.3** |
| HellaSwag (EM) | 10-shot | 87.1 | 84.8 | **89.2** | 88.9 |
| PIQA (EM) | 0-shot | 83.9 | 82.1 | **85.9** | 84.7 |
| WinoGrande (EM) | 5-shot | **86.3** | 82.3 | 85.2 | 84.9 |
| RACE-Middle (EM) | 3-shot | 73.1 | 68.1 | **74.2** | 74.9 |
| RACE-High (EM) | 5-shot | 52.6 | 50.3 | **56.8** | 51.3 |
| TriviaQA (EM) | 5-shot | 80.0 | 71.9 | **82.7** | 82.9 |
| NaturalQuestions (EM) | 5-shot | 38.6 | 33.2 | **41.5** | 40.0 |
| AGIEval (EM) | 0-shot | 57.5 | 75.8 | 60.6 | **79.6** |
| HumanEval (Pass@1) | 0-shot | 43.3 | 53.0 | 54.9 | **65.2** |
| MBPP (Pass@1) | 3-shot | 65.0 | 72.6 | 68.4 | **75.4** |
| LiveCodeBench-Base (Pass@1) | 3-shot | 11.6 | 12.9 | 15.1 | **19.4** |
| CRUXEval-1 (EM) | 2-shot | 52.5 | 59.1 | 58.5 | **67.3** |
| CRUXEval-O (EM) | 2-shot | 49.8 | 59.9 | 59.9 | **69.8** |
| CSMRR (EM) | 8-shot | 81.6 | 88.3 | 89.3 | **89.3** |
| MATH (EM) | 4-shot | 43.4 | 54.4 | 49.0 | **61.6** |
| MGSM (EM) | 8-shot | 63.6 | 76.2 | 69.9 | **79.8** |
| CMath (EM) | 3-shot | 78.7 | 84.5 | 77.3 | **90.7** |
| CLUEWSC (EM) | 5-shot | 82.0 | 82.5 | **83.0** | 82.7 |
| C-Eval (EM) | 0-shot | 81.4 | 72.5 | 72.5 | **90.1** |
| CMMLU (EM) | 5-shot | 84.0 | **89.5** | 73.7 | 88.8 |
| CMRC (EM) | 1-shot | **77.4** | 75.8 | 76.0 | 76.3 |
| C3 (EM) | 0-shot | 77.4 | 76.7 | **79.7** | 78.6 |
| CCPM (EM) | 0-shot | **93.0** | 88.5 | 78.6 | 92.0 |
| MMLU-non-English (EM) | 5-shot | 64.0 | 74.8 | 73.8 | **79.4** |

<!-- markdownlint-enable MD013 -->

<div
class="table-caption"
style="text-align: center; font-size: 0.8em; margin-top: 10px;"
>
Table 2: Comparison between DeepSeek-V3 and other representative models.
(Copied from Table 3 of Liu, Aixin, et al (2024).)
</div>

1. **Superior Open-Source Model:** DeepSeek-V3 outperformed all other open-source
models on educational benchmarks (MMLU, MMLU-Pro, GPQA) achieving performance
levels that rivals that for closed-source models such as GPT-4o and Claude-Sonnet-3.5.
DeepSeek-V3 also achieved SOTA on math-related benchmarks (GSM8K, MATH, MGSM,
CMath).

2. **Efficient Training:** DeepSeek-V3 was trained using only 2.664M H800 GPU hours,
leveraging an FP8 mixed precision training framework. This marked, as reported
by the authors, the first successful use of an FP8 scheme to train a large-scale
model.

3. **Reasoning Distillation:** As part of the post-training step, DeepSeek-V3 creators
were able to distill reasoning capabilities via long [CoT](../llms/prompting/cot.md)
passages generated by [DeepSeek-R1](../models/deepseek_r1.md). The authors noted
that this pipeline improved reasononing performance while still maintaining the
ability to produce desired outputs and efficient response lengths.

## Limitations

DeepSeek-V3 requires significant amounts of computation facilities to ensure efficient
inference.

#### References & Useful Links <!-- markdownlint-disable-line MD001 -->

1. [_Liu, Aixin, et al. "Deepseek-v3 technical report." arXiv preprint
arXiv:2412.19437 (2024)._](https://arxiv.org/pdf/2412.19437)
2. [DeepSeek sparks AI stock selloff; Nvidia posts record market-cap loss](https://www.reuters.com/technology/chinas-deepseek-sets-off-ai-market-rout-2025-01-27/)
(_appearing in reuters.com_)

<!-- TODO: mdBook preprocessor with custom mustache handler {{ #author }} -->
<!-- markdownlint-disable-file MD033 -->

{{#author nerdai}}