Skip to content

Commit

Permalink
kb autocommit
Browse files Browse the repository at this point in the history
  • Loading branch information
Jemoka committed Feb 8, 2024
1 parent 8595bd2 commit 3dc4973
Show file tree
Hide file tree
Showing 3 changed files with 114 additions and 0 deletions.
47 changes: 47 additions & 0 deletions content/posts/KBhlm_alignment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
+++
title = "LM Alignment"
author = ["Houjun Liu"]
draft = false
+++

Been Kim

alignment problem involves "aligning" the representation spaces between machines of the world and that of the human. alternative perspective: **teach <span class="underline"><span class="underline">humans</span></span> new concepts to understand/communicate better**


## feature attribution doesn't work {#feature-attribution-doesn-t-work}

We take that perspective because many of the intersectional intepretability doesn't work well (feature permutation, etc.)---feature attribution type analyses ("Impossibility Theorems Been Kim") actually has no correlation with predictive results.


## feature information store in models is unrelated to model edit success {#feature-information-store-in-models-is-unrelated-to-model-edit-success}

i.e.: knowledge storing location located using ROME technique, though it gives you a sense of the location to store information, doens't correlate to success of model editing.


## can we use ML to teach people? {#can-we-use-ml-to-teach-people}

for instance, we can teach grandmasters to play chess using AlphaGo, and see if we can make a quantitative impact.


### concept {#concept}

A [concept](#concept) is a unit of knowledge that's **useful for a task**. Two properties:

1. **minimality**: irrelavent information has been removed
2. **transferable**: it can be taught atomically


#### filtering for good [concept](#concept)s {#filtering-for-good-concept--orgcd99e8f--s}

Representing a concept as a sparse vector as the latent space. We can check if a concept is transferable by teaching a student agent by doing KL divergence.


#### demonstration learning {#demonstration-learning}

instead of doing demonstration learning on machines, do it on ****HUMANS****. Filter for the [concept](#concept)s that are well operationalized.


## alpha-zero {#alpha-zero}

recap: using a dense network to embed the network, and then [MCTS]({{< relref "KBhmonte_carlo_tree_search.md" >}}).
66 changes: 66 additions & 0 deletions content/posts/KBhpset_5.md
Original file line number Diff line number Diff line change
Expand Up @@ -533,3 +533,69 @@ a(1-y)x - a(1-y) + a((x-1)y-(x-1)) = 0
\end{equation}

as desired.


### Problem 14.6 {#problem-14-dot-6}


#### Part a {#part-a}

We have:

\begin{equation}
s = 1-i
\end{equation}

and that:

\begin{equation}
i'(t) = \tau s(t) i(t) - \gamma i(t)
\end{equation}

Substituting the former into the latter:

\begin{equation}
i'(t) = \tau (1-i(t))i(t) - \gamma i(t)
\end{equation}

which means, combining the two:

\begin{equation}
i'(t) = \tau i(t) -\tau i^{2}(t) - \gamma i(t)
\end{equation}

which finally means th

at:

\begin{equation}
i'(t) = (\tau -\gamma )i(t) - \tau i^{2}(t)
\end{equation}

Now, factoring \\(i(t)\\) out, we obtain:

\begin{equation}
i'(t) = i(t) ((\tau-\gamma) - \tau i(t))
\end{equation}

which yields two solutions, either \\(i(t) = 0\\), or \\((\tau - \gamma ) - \tau (i(t)) = 0\\) meaning, \\(i(t) = 1- \frac{\gamma}{\tau}\\)


#### Part b {#part-b}

If \\(R\_0 = \frac{\tau}{\gamma} < 1\\), then we have that \\(\tau < \gamma\\).

Meaning, \\(\frac{\gamma}{\tau} > 1\\), which gives that \\(1 - \frac{\gamma}{\tau} < 0\\). Meaning there is no positive stationary value.

As for \\(i(t) = 0\\), taking one derivative of the expression above yields that \\(i''(t) = (\tau -\gamma) - 2\tau i(t)\\). The latter is \\(0\\) when \\(i(t) = 0\\), and the former would be negative. Meaning, the stationary point \\(i(t) = 0\\) is stable.


#### Part c {#part-c}

A similar argument goes as part \\(b\\), but in opposite direction.

If \\(R\_0 = \frac{\tau}{\gamma} > 1\\), then we have that \\(\tau > \gamma\\).

For \\(i(t) = 0\\), taking one derivative of the expression above yields that \\(i''(t) = (\tau -\gamma) - 2\tau i(t)\\). At \\(i(t) = 1- \frac{\gamma}{\tau}\\), this simplifies to $(&tau; -&gamma; ) - 2 &tau; + 2&gamma; $, which gives \\(\gamma - \tau < 0\\), which is stable

As for \\(i(t) = 0\\), the latter term in the derivative is still \\(0\\) when \\(i(t) = 0\\), and the former would be positive. Meaning, the stationary point \\(i(t) = 0\\) is unstable.
1 change: 1 addition & 0 deletions content/posts/KBhstanford_courses_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ Here are a list of random indicies which may end up being helpful!
| <span class="timestamp-wrapper"><span class="timestamp">&lt;2023-11-16 Thu&gt;</span></span> | Dissociating Language and Thought | Anna Ivanova | [Dissociating Language and Thought]({{< relref "KBhdissociating_language_and_thought.md" >}}) |
| <span class="timestamp-wrapper"><span class="timestamp">&lt;2024-01-11 Thu&gt;</span></span> | Language Agents | Karthik Narasimhan | [Language Agents with Karthik]({{< relref "KBhlanguage_agents.md" >}}) |
| <span class="timestamp-wrapper"><span class="timestamp">&lt;2024-02-01 Thu&gt;</span></span> | | | [Pretraining Data]({{< relref "KBhpretraining_data.md" >}}) |
| <span class="timestamp-wrapper"><span class="timestamp">&lt;2024-02-08 Thu&gt;</span></span> | value alignment | Been Kim | [LM Alignment]({{< relref "KBhlm_alignment.md" >}}) |


## Contacts {#contacts}
Expand Down

0 comments on commit 3dc4973

Please sign in to comment.