Skip to content

Commit

Permalink
kb autocommit
Browse files Browse the repository at this point in the history
  • Loading branch information
Jemoka committed May 10, 2024
1 parent 0775021 commit f233cb7
Show file tree
Hide file tree
Showing 5 changed files with 171 additions and 2 deletions.
21 changes: 21 additions & 0 deletions content/posts/KBhpeft.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,24 @@ title = "PEFT"
author = ["Houjun Liu"]
draft = false
+++

[PEFT]({{< relref "KBhpeft.md" >}}) is parameter efficient fine-tuning.


## LoRA {#lora}

Consider some matrix:

\begin{equation}
W\_0 \in \mathbb{R}^{d \times k}
\end{equation}

Key intuition: **gradient matricies have low intrinsic rank**. We consider the following update:

\begin{equation}
W\_0 + \Delta W = W\_0 + \alpha BA
\end{equation}

where \\(B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k}\\), and \\(r \ll \min(d,k)\\).

where \\(\alpha\\) is the trade off between pre-trained knowledge and task specific knowledge.
91 changes: 91 additions & 0 deletions content/posts/KBhsu_cs224n_may092024.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
+++
title = "SU-CS224N MAY092024"
author = ["Houjun Liu"]
draft = false
+++

## Floating Point {#floating-point}

4 bytes

\begin{equation}
(-1)^{B} + e^{E-127} \times \qty(1 + \sum\_{i=1}^{23} b\_{23-i}2^{-i})
\end{equation}

usually \\(E\\) is a 8 bytes, and 23 digits of \\(b\\).

With more \\(E\\), we will have more range, with more \\(b\\), we will have more precision.


## Mixed Precision Training {#mixed-precision-training}

1. copy the model in FP32
2. Run forward pass in FP16
3. Scale loss to be large enough to not be rounded away
4. Compute gradients in FP16
5. Convert the gradients onto FP32
6. Scale the gradients down
7. apply to the model


### BFloat16 {#bfloat16}

To not need to scale, we can use a scheme that has less precision but the same amount of dynamic range (i.e. allocate the same \\(E\\), chop off \\(b\\)) ---no need to scale, just have more dynamic range.


## Distributed Data Parallel {#distributed-data-parallel}

- every GPU has a copy of the model
- each GPU


### all-reduce {#all-reduce}

reduce each copy of the weights down


## Deepspeed Zero {#deepspeed-zero}


### reduce-scatter {#reduce-scatter}

squish down and send each part tot the right GPU


### all gather {#all-gather}

- send eveything over everybody else


### Stage 1 {#stage-1}

We cache a slice of the optimizer state on each GPU.


### Stage 2 {#stage-2}

- perform a backwards pass
- at each layer, compute gradient
- look up who in the cluster is responsible for that layer


### Stage 3 {#stage-3}

- divide the model parameters into FSDP units
- shard each unit across multiple GPUs
- run forward pass
- run backward pass
- each GPU updates its own shard using the full gradient from earlier

(unlike stages 1 and 2, you need to stream in your parameters---more communication overhead!)


## Lessons {#lessons}

- always use mix precision training
- always use bfloat16


## PEFT {#peft}

See [PEFT]({{< relref "KBhpeft.md" >}})
32 changes: 31 additions & 1 deletion content/posts/KBhsu_engr76_may072024.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,37 @@ Tx and Rx maps **boolean [signal]({{< relref "KBhsu_engr76_apr162024.md#signal"
"how do we map a sequence of bits 0100100.... and map it to a continuous time signal \\(X(t)\\)?"


### simplest digital encoding approcah {#simplest-digital-encoding-approcah}
### [sinc digital encoding]({{< relref "KBhsu_engr76_may092024.md#sinc-digital-encoding" >}}) {#sinc-digital-encoding--kbhsu-engr76-may092024-dot-md}

see [sinc digital encoding]({{< relref "KBhsu_engr76_may092024.md#sinc-digital-encoding" >}})


### on-off keying {#on-off-keying}

in brief: its like [sinc digital encoding]({{< relref "KBhsu_engr76_may092024.md#sinc-digital-encoding" >}}), but we interpolate using the indicator function:

\begin{equation}
X(t) = \sum\_{m=1}^{\infty} X[m] F \qty( \frac{t-mT}{T})
\end{equation}

where:

\begin{equation}
F = \begin{cases}
1, |x| < \frac{1}{2} \\\\
0
\end{cases}
\end{equation}

The spectrum of this type of signal would be:

\begin{equation}
\left| \text{sinc} \left (\frac{1}{T} \pi x \right) \right|
\end{equation}

We consider this signal "approximately bandwidth limited" to roughly \\(\frac{1}{T}\\), which is usually fine. The other concern with this is that, because unlike [sinc function]({{< relref "KBhsinc_function.md#sinc-function" >}}) where you can sample for twice the function bandwidth, you have to sample at the function bandwith meaning you communicate less info.

---

choose some voltage \\(V\\). Assign 1-bit voltage \\(V\\), assign 0-bit voltage \\(0\\), and simply play a voltage for a set amount of time \\(t\\) and move on to the next symbol for encoding.

Expand Down
26 changes: 26 additions & 0 deletions content/posts/KBhsu_engr76_may092024.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
+++
title = "SU-ENGR76 MAY092024"
author = ["Houjun Liu"]
draft = false
+++

## [digital encoding]({{< relref "KBhsu_engr76_may072024.md#digital-encoding" >}}) {#digital-encoding--kbhsu-engr76-may072024-dot-md}

We allocating different systems in the same environment different frequency bands; by doing this, we are able to communicate pack information more effectively to prevent interference.

"how do we take a sequence of bits 10100.... and map it to a continuous-time signal \\(X(t)\\) such that the spectrum of this system is limited to \\([0, B]\\)"?


### sinc digital encoding {#sinc-digital-encoding}

IDEA: recall [sinc sampling theorem]({{< relref "KBhsu_engr76_may022024.md#shannon-s-nyquist-theorem" >}}), which (even if under sampled), will recover the source points exactly. As such, we can write:

\begin{equation}
X(t) = \sum\_{m=1}^{\infty} X[m] \text{sinc} \qty( \frac{t-mT}{T})
\end{equation}

for your choice of period \\(T > 0\\). where, \\(X[m] = V \cdot b\_{m}\\) where \\(b\_{m}\\) is the binary ([Huffman Coding]({{< relref "KBhhuffman_coding.md" >}})) encoding of your data.

By the [sinc sampling theorem]({{< relref "KBhsu_engr76_may022024.md#shannon-s-nyquist-theorem" >}}), we know that the spectrum of the recovered signal would be at most \\(\frac{1}{2T}\\). This means that, to limit the transmission to \\([0,B]\\), we should transmit our signal by using sinc interpolation using \\(T = \frac{1}{2B}\\).

This makes the rate of our communication to be \\(\frac{1}{T}\\) bits per second; plugging in the above that \\(T = 1/2B\\) in, this means our rate of communication is \\(2B\\) (higher the bandwidth, higher the speed).
3 changes: 2 additions & 1 deletion content/posts/KBhsu_engr76_unit_2_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,5 @@ Communication System Design!
- [Analog Communication]({{< relref "KBhanalog_vs_digital_signal.md#analog-communication" >}})
- [Digital Communication]({{< relref "KBhanalog_vs_digital_signal.md#digital-communication" >}})
- [digital encoding]({{< relref "KBhsu_engr76_may072024.md#digital-encoding" >}})
- [simplest digital encoding approcah]({{< relref "KBhsu_engr76_may072024.md#simplest-digital-encoding-approcah" >}})
- [on-off keying]({{< relref "KBhsu_engr76_may072024.md#on-off-keying" >}})
- [sinc digital encoding]({{< relref "KBhsu_engr76_may092024.md#sinc-digital-encoding" >}})

0 comments on commit f233cb7

Please sign in to comment.