-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
5 changed files
with
171 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
+++ | ||
title = "SU-CS224N MAY092024" | ||
author = ["Houjun Liu"] | ||
draft = false | ||
+++ | ||
|
||
## Floating Point {#floating-point} | ||
|
||
4 bytes | ||
|
||
\begin{equation} | ||
(-1)^{B} + e^{E-127} \times \qty(1 + \sum\_{i=1}^{23} b\_{23-i}2^{-i}) | ||
\end{equation} | ||
|
||
usually \\(E\\) is a 8 bytes, and 23 digits of \\(b\\). | ||
|
||
With more \\(E\\), we will have more range, with more \\(b\\), we will have more precision. | ||
|
||
|
||
## Mixed Precision Training {#mixed-precision-training} | ||
|
||
1. copy the model in FP32 | ||
2. Run forward pass in FP16 | ||
3. Scale loss to be large enough to not be rounded away | ||
4. Compute gradients in FP16 | ||
5. Convert the gradients onto FP32 | ||
6. Scale the gradients down | ||
7. apply to the model | ||
|
||
|
||
### BFloat16 {#bfloat16} | ||
|
||
To not need to scale, we can use a scheme that has less precision but the same amount of dynamic range (i.e. allocate the same \\(E\\), chop off \\(b\\)) ---no need to scale, just have more dynamic range. | ||
|
||
|
||
## Distributed Data Parallel {#distributed-data-parallel} | ||
|
||
- every GPU has a copy of the model | ||
- each GPU | ||
|
||
|
||
### all-reduce {#all-reduce} | ||
|
||
reduce each copy of the weights down | ||
|
||
|
||
## Deepspeed Zero {#deepspeed-zero} | ||
|
||
|
||
### reduce-scatter {#reduce-scatter} | ||
|
||
squish down and send each part tot the right GPU | ||
|
||
|
||
### all gather {#all-gather} | ||
|
||
- send eveything over everybody else | ||
|
||
|
||
### Stage 1 {#stage-1} | ||
|
||
We cache a slice of the optimizer state on each GPU. | ||
|
||
|
||
### Stage 2 {#stage-2} | ||
|
||
- perform a backwards pass | ||
- at each layer, compute gradient | ||
- look up who in the cluster is responsible for that layer | ||
|
||
|
||
### Stage 3 {#stage-3} | ||
|
||
- divide the model parameters into FSDP units | ||
- shard each unit across multiple GPUs | ||
- run forward pass | ||
- run backward pass | ||
- each GPU updates its own shard using the full gradient from earlier | ||
|
||
(unlike stages 1 and 2, you need to stream in your parameters---more communication overhead!) | ||
|
||
|
||
## Lessons {#lessons} | ||
|
||
- always use mix precision training | ||
- always use bfloat16 | ||
|
||
|
||
## PEFT {#peft} | ||
|
||
See [PEFT]({{< relref "KBhpeft.md" >}}) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
+++ | ||
title = "SU-ENGR76 MAY092024" | ||
author = ["Houjun Liu"] | ||
draft = false | ||
+++ | ||
|
||
## [digital encoding]({{< relref "KBhsu_engr76_may072024.md#digital-encoding" >}}) {#digital-encoding--kbhsu-engr76-may072024-dot-md} | ||
|
||
We allocating different systems in the same environment different frequency bands; by doing this, we are able to communicate pack information more effectively to prevent interference. | ||
|
||
"how do we take a sequence of bits 10100.... and map it to a continuous-time signal \\(X(t)\\) such that the spectrum of this system is limited to \\([0, B]\\)"? | ||
|
||
|
||
### sinc digital encoding {#sinc-digital-encoding} | ||
|
||
IDEA: recall [sinc sampling theorem]({{< relref "KBhsu_engr76_may022024.md#shannon-s-nyquist-theorem" >}}), which (even if under sampled), will recover the source points exactly. As such, we can write: | ||
|
||
\begin{equation} | ||
X(t) = \sum\_{m=1}^{\infty} X[m] \text{sinc} \qty( \frac{t-mT}{T}) | ||
\end{equation} | ||
|
||
for your choice of period \\(T > 0\\). where, \\(X[m] = V \cdot b\_{m}\\) where \\(b\_{m}\\) is the binary ([Huffman Coding]({{< relref "KBhhuffman_coding.md" >}})) encoding of your data. | ||
|
||
By the [sinc sampling theorem]({{< relref "KBhsu_engr76_may022024.md#shannon-s-nyquist-theorem" >}}), we know that the spectrum of the recovered signal would be at most \\(\frac{1}{2T}\\). This means that, to limit the transmission to \\([0,B]\\), we should transmit our signal by using sinc interpolation using \\(T = \frac{1}{2B}\\). | ||
|
||
This makes the rate of our communication to be \\(\frac{1}{T}\\) bits per second; plugging in the above that \\(T = 1/2B\\) in, this means our rate of communication is \\(2B\\) (higher the bandwidth, higher the speed). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters