kb autocommit

Jemoka · May 10, 2024 · f233cb7 · f233cb7
1 parent 0775021
commit f233cb7
Show file tree

Hide file tree

Showing 5 changed files with 171 additions and 2 deletions.
diff --git a/content/posts/KBhpeft.md b/content/posts/KBhpeft.md
@@ -3,3 +3,24 @@ title = "PEFT"
 author = ["Houjun Liu"]
 draft = false
 +++
+
+[PEFT]({{< relref "KBhpeft.md" >}}) is parameter efficient fine-tuning.
+
+
+## LoRA {#lora}
+
+Consider some matrix:
+
+\begin{equation}
+W\_0 \in \mathbb{R}^{d \times k}
+\end{equation}
+
+Key intuition: **gradient matricies have low intrinsic rank**. We consider the following update:
+
+\begin{equation}
+W\_0 + \Delta W = W\_0 + \alpha BA
+\end{equation}
+
+where \\(B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times k}\\), and \\(r \ll \min(d,k)\\).
+
+where \\(\alpha\\) is the trade off between pre-trained knowledge and task specific knowledge.
diff --git a/content/posts/KBhsu_cs224n_may092024.md b/content/posts/KBhsu_cs224n_may092024.md
@@ -0,0 +1,91 @@
++++
+title = "SU-CS224N MAY092024"
+author = ["Houjun Liu"]
+draft = false
++++
+
+## Floating Point {#floating-point}
+
+4 bytes
+
+\begin{equation}
+(-1)^{B} + e^{E-127} \times \qty(1 + \sum\_{i=1}^{23} b\_{23-i}2^{-i})
+\end{equation}
+
+usually \\(E\\) is a 8 bytes, and 23 digits of \\(b\\).
+
+With more \\(E\\), we will have more range, with more \\(b\\), we will have more precision.
+
+
+## Mixed Precision Training {#mixed-precision-training}
+
+1.  copy the model in FP32
+2.  Run forward pass in FP16
+3.  Scale loss to be large enough to not be rounded away
+4.  Compute gradients in FP16
+5.  Convert the gradients onto FP32
+6.  Scale the gradients down
+7.  apply to the model
+
+
+### BFloat16 {#bfloat16}
+
+To not need to scale, we can use a scheme that has less precision but the same amount of dynamic range (i.e. allocate the same \\(E\\), chop off \\(b\\)) ---no need to scale, just have more dynamic range.
+
+
+## Distributed Data Parallel {#distributed-data-parallel}
+
+-   every GPU has a copy of the model
+-   each GPU
+
+
+### all-reduce {#all-reduce}
+
+reduce each copy of the weights down
+
+
+## Deepspeed Zero {#deepspeed-zero}
+
+
+### reduce-scatter {#reduce-scatter}
+
+squish down and send each part tot the right GPU
+
+
+### all gather {#all-gather}
+
+-   send eveything over everybody else
+
+
+### Stage 1 {#stage-1}
+
+We cache a slice of the optimizer state on each GPU.
+
+
+### Stage 2 {#stage-2}
+
+-   perform a backwards pass
+-   at each layer, compute gradient
+-   look up who in the cluster is responsible for that layer
+
+
+### Stage 3 {#stage-3}
+
+-   divide the model parameters into FSDP units
+-   shard each unit across multiple GPUs
+-   run forward pass
+-   run backward pass
+-   each GPU updates its own shard using the full gradient from earlier
+
+(unlike stages 1 and 2, you need to stream in your parameters---more communication overhead!)
+
+
+## Lessons {#lessons}
+
+-   always use mix precision training
+-   always use bfloat16
+
+
+## PEFT {#peft}
+
+See [PEFT]({{< relref "KBhpeft.md" >}})
diff --git a/content/posts/KBhsu_engr76_may072024.md b/content/posts/KBhsu_engr76_may072024.md
@@ -51,7 +51,37 @@ Tx and Rx maps **boolean [signal]({{< relref "KBhsu_engr76_apr162024.md#signal"
 "how do we map a sequence of bits 0100100.... and map it to a continuous time signal \\(X(t)\\)?"
 
 
-### simplest digital encoding approcah {#simplest-digital-encoding-approcah}
+### [sinc digital encoding]({{< relref "KBhsu_engr76_may092024.md#sinc-digital-encoding" >}}) {#sinc-digital-encoding--kbhsu-engr76-may092024-dot-md}
+
+see [sinc digital encoding]({{< relref "KBhsu_engr76_may092024.md#sinc-digital-encoding" >}})
+
+
+### on-off keying {#on-off-keying}
+
+in brief: its like [sinc digital encoding]({{< relref "KBhsu_engr76_may092024.md#sinc-digital-encoding" >}}), but we interpolate using the indicator function:
+
+\begin{equation}
+X(t) = \sum\_{m=1}^{\infty} X[m] F \qty( \frac{t-mT}{T})
+\end{equation}
+
+where:
+
+\begin{equation}
+F = \begin{cases}
+1, |x| < \frac{1}{2} \\\\
+0
+\end{cases}
+\end{equation}
+
+The spectrum of this type of signal would be:
+
+\begin{equation}
+\left| \text{sinc} \left (\frac{1}{T} \pi x \right) \right|
+\end{equation}
+
+We consider this signal "approximately bandwidth limited" to roughly \\(\frac{1}{T}\\), which is usually fine. The other concern with this is that, because unlike [sinc function]({{< relref "KBhsinc_function.md#sinc-function" >}}) where you can sample for twice the function bandwidth, you have to sample at the function bandwith meaning you communicate less info.
+
+---
 
 choose some voltage \\(V\\). Assign 1-bit voltage \\(V\\), assign 0-bit voltage \\(0\\), and simply play a voltage for a set amount of time \\(t\\) and move on to the next symbol for encoding.
 

diff --git a/content/posts/KBhsu_engr76_may092024.md b/content/posts/KBhsu_engr76_may092024.md
@@ -0,0 +1,26 @@
++++
+title = "SU-ENGR76 MAY092024"
+author = ["Houjun Liu"]
+draft = false
++++
+
+## [digital encoding]({{< relref "KBhsu_engr76_may072024.md#digital-encoding" >}}) {#digital-encoding--kbhsu-engr76-may072024-dot-md}
+
+We allocating different systems in the same environment different frequency bands; by doing this, we are able to communicate pack information more effectively to prevent interference.
+
+"how do we take a sequence of bits 10100.... and map it to a continuous-time signal \\(X(t)\\) such that the spectrum of this system is limited to \\([0, B]\\)"?
+
+
+### sinc digital encoding {#sinc-digital-encoding}
+
+IDEA: recall [sinc sampling theorem]({{< relref "KBhsu_engr76_may022024.md#shannon-s-nyquist-theorem" >}}), which (even if under sampled), will recover the source points exactly. As such, we can write:
+
+\begin{equation}
+X(t) = \sum\_{m=1}^{\infty} X[m] \text{sinc} \qty( \frac{t-mT}{T})
+\end{equation}
+
+for your choice of period \\(T > 0\\). where, \\(X[m] = V \cdot b\_{m}\\) where \\(b\_{m}\\) is the binary ([Huffman Coding]({{< relref "KBhhuffman_coding.md" >}})) encoding of your data.
+
+By the [sinc sampling theorem]({{< relref "KBhsu_engr76_may022024.md#shannon-s-nyquist-theorem" >}}), we know that the spectrum of the recovered signal would be at most \\(\frac{1}{2T}\\). This means that, to limit the transmission to \\([0,B]\\), we should transmit our signal by using sinc interpolation using \\(T = \frac{1}{2B}\\).
+
+This makes the rate of our communication to be \\(\frac{1}{T}\\) bits per second; plugging in the above that \\(T = 1/2B\\) in, this means our rate of communication is \\(2B\\) (higher the bandwidth, higher the speed).
diff --git a/content/posts/KBhsu_engr76_unit_2_index.md b/content/posts/KBhsu_engr76_unit_2_index.md
@@ -12,4 +12,5 @@ Communication System Design!
     -   [Analog Communication]({{< relref "KBhanalog_vs_digital_signal.md#analog-communication" >}})
     -   [Digital Communication]({{< relref "KBhanalog_vs_digital_signal.md#digital-communication" >}})
 -   [digital encoding]({{< relref "KBhsu_engr76_may072024.md#digital-encoding" >}})
-    -   [simplest digital encoding approcah]({{< relref "KBhsu_engr76_may072024.md#simplest-digital-encoding-approcah" >}})
+    -   [on-off keying]({{< relref "KBhsu_engr76_may072024.md#on-off-keying" >}})
+    -   [sinc digital encoding]({{< relref "KBhsu_engr76_may092024.md#sinc-digital-encoding" >}})