diff --git a/content/posts/KBhltrdp.md b/content/posts/KBhltrdp.md new file mode 100644 index 000000000..d0286d83d --- /dev/null +++ b/content/posts/KBhltrdp.md @@ -0,0 +1,32 @@ ++++ +title = "LRTDP" +author = ["Houjun Liu"] +draft = false ++++ + +## Real-Time Dynamic Programming {#real-time-dynamic-programming} + +[RTDP](#real-time-dynamic-programming) is a asynchronous value iteration scheme. Each [RTDP](#real-time-dynamic-programming) trial is a result of: + +\begin{equation} +V(s) = \min\_{ia \in A(s)} c(a,s) + \sum\_{s' \in S}^{} P\_{a}(s'|s)V(s) +\end{equation} + +the algorithm halts when the residuals are sufficiently small. + + +## Labeled [RTDP](#real-time-dynamic-programming) {#labeled-rtdp--org9a279ff} + +We want to label converged states so we don't need to keep investigate it: + +a state is **solved** if: + +- state has less then \\(\epsilon\\) +- all reachable states given \\(s'\\) from this state has residual lower than \\(\epsilon\\) + + +### Labelled RTDP {#labelled-rtdp} + +{{< figure src="/ox-hugo/2024-02-13_10-11-32_screenshot.png" >}} + +We stochastically simulate one step forward, and until a state we haven't marked as "solved" is met, then we simulate forward and value iterate diff --git a/content/posts/KBhmaxq.md b/content/posts/KBhmaxq.md new file mode 100644 index 000000000..90db24636 --- /dev/null +++ b/content/posts/KBhmaxq.md @@ -0,0 +1,55 @@ ++++ +title = "MaxQ" +author = ["Houjun Liu"] +draft = false ++++ + +## Two Abstractions {#two-abstractions} + +- "temporal abstractions": making decisions without consideration / abstracting away time ([MDP]({{< relref "KBhmarkov_decision_process.md" >}})) +- "state abstractions": making decisions about groups of states at once + + +## Graph {#graph} + +[MaxQ]({{< relref "KBhmaxq.md" >}}) formulates a policy as a graph, which formulates a set of \\(n\\) policies + +{{< figure src="/ox-hugo/2024-02-13_09-50-20_screenshot.png" >}} + + +### Max Node {#max-node} + +This is a "policy node", connected to a series of \\(Q\\) nodes from which it takes the max and propegate down. If we are at a leaf max-node, the actual action is taken and control is passed back t to the top of the graph + + +### Q Node {#q-node} + +each node computes \\(Q(S,A)\\) for a value at that action + + +## Hierachical Value Function {#hierachical-value-function} + +{{< figure src="/ox-hugo/2024-02-13_09-51-27_screenshot.png" >}} + +\begin{equation} +Q(s,a) = V\_{a}(s) + C\_{i}(s,a) +\end{equation} + +the value function of the root node is the value obtained over all nodes in the graph + +where: + +\begin{equation} +C\_{i}(s,a) = \sum\_{s'}^{} P(s'|s,a) V(s') +\end{equation} + + +## Learning MaxQ {#learning-maxq} + +1. maintain two tables \\(C\_{i}\\) and \\(\tilde{C}\_{i}(s,a)\\) (which is a special completion function which corresponds to a special reward \\(\tilde{R}\\) which prevents the model from doing egregious ending actions) +2. choose \\(a\\) according to exploration strategy +3. execute \\(a\\), observe \\(s'\\), and compute \\(R(s'|s,a)\\) + +Then, update: + +{{< figure src="/ox-hugo/2024-02-13_09-54-38_screenshot.png" >}} diff --git a/content/posts/KBhoption.md b/content/posts/KBhoption.md new file mode 100644 index 000000000..d7ab5f013 --- /dev/null +++ b/content/posts/KBhoption.md @@ -0,0 +1,60 @@ ++++ +title = "Option (MDP)" +author = ["Houjun Liu"] +draft = false ++++ + +an [Option (MDP)]({{< relref "KBhoption.md" >}}) represents a high level collection of actions. Big Picture: abstract away your big policy into \\(n\\) small policies, and value-iterate over expected values of the big policies. + + +## Markov Option {#markov-option} + +A [Markov Option](#markov-option) is given by a triple \\((I, \pi, \beta)\\) + +- \\(I \subset S\\), the states from which the option maybe started +- \\(S \times A\\), the MDP during that option +- \\(\beta(s)\\), the probability of the option terminating at state \\(s\\) + + +### one-step options {#one-step-options} + +You can develop one-shot options, which terminates immediate after one action with underlying probability + +- \\(I = \\{s:a \in A\_{s}\\}\\) +- \\(\pi(s,a) = 1\\) +- \\(\beta(s) = 1\\) + + +### option value fuction {#option-value-fuction} + +\begin{equation} +Q^{\mu}(s,o) = \mathbb{E}\qty[r\_{t} + \gamma r\_{t+1} + \dots] +\end{equation} + +where \\(\mu\\) is some option selection process + + +### semi-markov decision process {#semi-markov-decision-process} + +a [semi-markov decision process](#semi-markov-decision-process) is a system over a bunch of [option]({{< relref "KBhoptions.md" >}})s, with time being a factor in option transitions, but the underlying policies still being [MDP]({{< relref "KBhmarkov_decision_process.md" >}})s. + +\begin{equation} +T(s', \tau | s,o) +\end{equation} + +where \\(\tau\\) is time elapsed. + +because option-level termination induces jumps between large scale states, one backup can propagate to a lot of states. + + +### intra option q-learning {#intra-option-q-learning} + +\begin{equation} +Q\_{k+1} (s\_{i},o) = (1-\alpha\_{k})Q\_{k}(S\_{t}, o) + \alpha\_{k} \qty(r\_{t+1} + \gamma U\_{k}(s\_{t+1}, o)) +\end{equation} + +where: + +\begin{equation} +U\_{k}(s,o) = (1-\beta(s))Q\_{k}(s,o) + \beta(s) \max\_{o \in O} Q\_{k}(s,o') +\end{equation} diff --git a/content/posts/KBhpomdps_index.md b/content/posts/KBhpomdps_index.md index 1d24aa7ee..054285916 100644 --- a/content/posts/KBhpomdps_index.md +++ b/content/posts/KBhpomdps_index.md @@ -17,8 +17,12 @@ a class about [POMDP]({{< relref "KBhpartially_observable_markov_decision_proces | Moar Online Methods | [IS-DESPOT]({{< relref "KBhis_despot.md" >}}), [POMCPOW]({{< relref "KBhpomcpow.md" >}}), [AdaOPS]({{< relref "KBhadaops.md" >}}) | | POMDPish | [MOMDP]({{< relref "KBhmomdp.md" >}}), [POMDP-lite]({{< relref "KBhpomdp_lite.md" >}}), [rho-POMDPs]({{< relref "KBhrho_pomdps.md" >}}) | | Memoryless + Policy Search | [Sarsa (Lambda)]({{< relref "KBhsarsa_lambda.md" >}}), [JSJ]({{< relref "KBhjsj.md" >}}), [Pegasus]({{< relref "KBhpegasus.md" >}}) | +| Hierarchical Decomposition | [Option]({{< relref "KBhoption.md" >}}), [MaxQ]({{< relref "KBhmaxq.md" >}}), [LTRDP]({{< relref "KBhltrdp.md" >}}) | ## Other Content {#other-content} [Research Tips]({{< relref "KBhresearch_tips.md" >}}) + +- [STRIPS-style planning]({{< relref "KBhstrips_style_planning.md" >}}) +- [Temperal Abstraction]({{< relref "KBhtemperal_abstraction.md" >}}) diff --git a/content/posts/KBhresearch_tips.md b/content/posts/KBhresearch_tips.md index b89c3a834..f724584c4 100644 --- a/content/posts/KBhresearch_tips.md +++ b/content/posts/KBhresearch_tips.md @@ -114,3 +114,35 @@ Overview **AFTER** the motivation. - biblatex: bibtex with postprocessing the .tex - sislstrings.bib: mykel's conference list for .bib - JabRef + + +## PhD Thesis {#phd-thesis} + + + +- "Cool Theorems and New Methods" +- "Cool Methods and Predictions" +- "Beautiful Demonstrations" +- "Cool engineering ideas" + + +## "How to Write a Paper" {#how-to-write-a-paper} + + + +1. what's the problem +2. why is it interesting and important? +3. why is it hard? +4. why hasn't been solved before/what's wrong with previous solutions? +5. what are the key components of my approach and results? + +You want the intro to end near the end of the first page or near the end of the second page. **Always lead with the problem.** + + +## Mathematical Writing {#mathematical-writing} + +"CS209 mathematical writing" + +Don't start a sentence with a symbol. + +Don't use "utilize". diff --git a/content/posts/KBhstrips_style_planning.md b/content/posts/KBhstrips_style_planning.md new file mode 100644 index 000000000..7ea3b38fe --- /dev/null +++ b/content/posts/KBhstrips_style_planning.md @@ -0,0 +1,23 @@ ++++ +title = "STRIPS-style planning" +author = ["Houjun Liu"] +draft = false ++++ + +This is a precursor to [MDP]({{< relref "KBhmarkov_decision_process.md" >}}) planning: + +- states: conjunction of "fluents" (which are state) +- actions: transition between fulents +- transitions: deleting of older, changed parts of fluents, adding new parts + + +## Planning Domain Definition Language {#planning-domain-definition-language} + +A LISP used to specify a [STRIPS-style planning]({{< relref "KBhstrips_style_planning.md" >}}) problem. + + +## Hierarchical Task Network {#hierarchical-task-network} + +1. Decompose classical planning into a hierarchy of actions +2. Leverage High level actions to generate a coarse plan +3. Refine to smaller problems diff --git a/content/posts/KBhtemperal_abstraction.md b/content/posts/KBhtemperal_abstraction.md new file mode 100644 index 000000000..0ea4953a3 --- /dev/null +++ b/content/posts/KBhtemperal_abstraction.md @@ -0,0 +1,5 @@ ++++ +title = "Temperal Abstraction" +author = ["Houjun Liu"] +draft = false ++++ diff --git a/static/ox-hugo/2024-02-13_09-50-20_screenshot.png b/static/ox-hugo/2024-02-13_09-50-20_screenshot.png new file mode 100644 index 000000000..90db6c91c Binary files /dev/null and b/static/ox-hugo/2024-02-13_09-50-20_screenshot.png differ diff --git a/static/ox-hugo/2024-02-13_09-51-27_screenshot.png b/static/ox-hugo/2024-02-13_09-51-27_screenshot.png new file mode 100644 index 000000000..d0e86c32b Binary files /dev/null and b/static/ox-hugo/2024-02-13_09-51-27_screenshot.png differ diff --git a/static/ox-hugo/2024-02-13_09-54-38_screenshot.png b/static/ox-hugo/2024-02-13_09-54-38_screenshot.png new file mode 100644 index 000000000..926700648 Binary files /dev/null and b/static/ox-hugo/2024-02-13_09-54-38_screenshot.png differ diff --git a/static/ox-hugo/2024-02-13_10-11-32_screenshot.png b/static/ox-hugo/2024-02-13_10-11-32_screenshot.png new file mode 100644 index 000000000..bd9335195 Binary files /dev/null and b/static/ox-hugo/2024-02-13_10-11-32_screenshot.png differ