From 9fb068a87dfe8801af85b5648fd0d020dd9492b5 Mon Sep 17 00:00:00 2001
From: Houjun Liu <houjun@jemoka.com>
Date: Tue, 23 Apr 2024 17:08:34 -0700
Subject: [PATCH] kb autocommit

---
 content/posts/KBhsu_cs224n_apr232024.md | 86 +++++++++++++++++++++++++
 1 file changed, 86 insertions(+)
 create mode 100644 content/posts/KBhsu_cs224n_apr232024.md

diff --git a/content/posts/KBhsu_cs224n_apr232024.md b/content/posts/KBhsu_cs224n_apr232024.md
new file mode 100644
index 000000000..b610a6db8
--- /dev/null
+++ b/content/posts/KBhsu_cs224n_apr232024.md
@@ -0,0 +1,86 @@
++++
+title = "SU-CS224N APR232024"
+author = ["Houjun Liu"]
+draft = false
++++
+
+## Evaluating Machine Translation {#evaluating-machine-translation}
+
+
+### BLEU {#bleu}
+
+Compare machine vs. multiple-human reference translations. Uses [N-Gram]({{< relref "KBhn_grams.md" >}}) geometric mean---the actual n gram size isn't super special.
+
+Original idea to have **multiple reference translations**---but maybe people to do this only one reference translation---good score **in expectation**.
+
+
+#### Limitations {#limitations}
+
+-   good translation can get a bad BLEU because it has low n gram overlap
+-   penalty to too-short system translations (i.e. translating only easy sentences isn't a good metric)
+-   you really can't get to 100 in BLEU because of variations in text
+
+
+## attention {#attention}
+
+Given a vector of **values**, a vector **query**, attention is a technique to compute a weighted sum of the values depending on the query.
+
+
+### motivation {#motivation}
+
+machine translation problem---naive [LSTM]({{< relref "KBhsu_cs224n_apr182024.md#lstm" >}}) implementation has to stuff the entire information about a sentence into a single ending vector.
+
+-   improves performance
+-   more human like model for the MT
+-   solves the bottleneck problem
+-   helps solving [Vanishing Gradients]({{< relref "KBhsu_cs224n_apr182024.md#vanishing-gradients" >}})
+-   interoperability --- provides soft phrase-level alignments, and know what is being translated
+
+
+### implementation {#implementation}
+
+**each step of the decoder, we will insert direct connections to the encoder to look at particular parts of the input source sequence**
+
+dot every output state against every input state, softmax and add against the source sequence input.
+
+with encoder \\(h\_{j}\\) and decoder \\(s\_{k}\\):
+
+
+#### dot product attention {#dot-product-attention}
+
+\begin{align}
+e\_{i} = s^{T} h\_{i}
+\end{align}
+
+**limitation**: LSTM latent layers are a little bit too busy---some of the information is not as useful as others---also forces everything to have dimension-to-dimension match
+
+
+#### multiplicative attention {#multiplicative-attention}
+
+"learn a map from encoder vectors to decoder vectors---working out the right place to pay attention by learning it"
+
+\begin{equation}
+e\_{i} = s^{T} W h\_{i}
+\end{equation}
+
+**limitation**: lots of parameters to learn in \\(W\\) for no good reason
+
+
+#### reduced-rank multiplicative attention {#reduced-rank-multiplicative-attention}
+
+\begin{equation}
+e\_{i} = s^{T} Q R h\_{i} = (Q s^{T})^{T} (R h\_{i})
+\end{equation}
+
+essentially, why don't we project \\(s\\) and \\(h\\) down to smaller dimensions before the dot product is taken?
+
+this motivates also transformers
+
+
+#### additive attention {#additive-attention}
+
+\begin{equation}
+e\_{i} = v^{T} \text{tanh} \qty(W\_1 h\_{i} + W\_{2} s)
+\end{equation}
+
+where \\(v\\) and \\(W\_{j}\\) are learned.