-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
11 changed files
with
211 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
+++ | ||
title = "LRTDP" | ||
author = ["Houjun Liu"] | ||
draft = false | ||
+++ | ||
|
||
## Real-Time Dynamic Programming {#real-time-dynamic-programming} | ||
|
||
[RTDP](#real-time-dynamic-programming) is a asynchronous value iteration scheme. Each [RTDP](#real-time-dynamic-programming) trial is a result of: | ||
|
||
\begin{equation} | ||
V(s) = \min\_{ia \in A(s)} c(a,s) + \sum\_{s' \in S}^{} P\_{a}(s'|s)V(s) | ||
\end{equation} | ||
|
||
the algorithm halts when the residuals are sufficiently small. | ||
|
||
|
||
## Labeled [RTDP](#real-time-dynamic-programming) {#labeled-rtdp--org9a279ff} | ||
|
||
We want to label converged states so we don't need to keep investigate it: | ||
|
||
a state is **solved** if: | ||
|
||
- state has less then \\(\epsilon\\) | ||
- all reachable states given \\(s'\\) from this state has residual lower than \\(\epsilon\\) | ||
|
||
|
||
### Labelled RTDP {#labelled-rtdp} | ||
|
||
{{< figure src="/ox-hugo/2024-02-13_10-11-32_screenshot.png" >}} | ||
|
||
We stochastically simulate one step forward, and until a state we haven't marked as "solved" is met, then we simulate forward and value iterate |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
+++ | ||
title = "MaxQ" | ||
author = ["Houjun Liu"] | ||
draft = false | ||
+++ | ||
|
||
## Two Abstractions {#two-abstractions} | ||
|
||
- "temporal abstractions": making decisions without consideration / abstracting away time ([MDP]({{< relref "KBhmarkov_decision_process.md" >}})) | ||
- "state abstractions": making decisions about groups of states at once | ||
|
||
|
||
## Graph {#graph} | ||
|
||
[MaxQ]({{< relref "KBhmaxq.md" >}}) formulates a policy as a graph, which formulates a set of \\(n\\) policies | ||
|
||
{{< figure src="/ox-hugo/2024-02-13_09-50-20_screenshot.png" >}} | ||
|
||
|
||
### Max Node {#max-node} | ||
|
||
This is a "policy node", connected to a series of \\(Q\\) nodes from which it takes the max and propegate down. If we are at a leaf max-node, the actual action is taken and control is passed back t to the top of the graph | ||
|
||
|
||
### Q Node {#q-node} | ||
|
||
each node computes \\(Q(S,A)\\) for a value at that action | ||
|
||
|
||
## Hierachical Value Function {#hierachical-value-function} | ||
|
||
{{< figure src="/ox-hugo/2024-02-13_09-51-27_screenshot.png" >}} | ||
|
||
\begin{equation} | ||
Q(s,a) = V\_{a}(s) + C\_{i}(s,a) | ||
\end{equation} | ||
|
||
the value function of the root node is the value obtained over all nodes in the graph | ||
|
||
where: | ||
|
||
\begin{equation} | ||
C\_{i}(s,a) = \sum\_{s'}^{} P(s'|s,a) V(s') | ||
\end{equation} | ||
|
||
|
||
## Learning MaxQ {#learning-maxq} | ||
|
||
1. maintain two tables \\(C\_{i}\\) and \\(\tilde{C}\_{i}(s,a)\\) (which is a special completion function which corresponds to a special reward \\(\tilde{R}\\) which prevents the model from doing egregious ending actions) | ||
2. choose \\(a\\) according to exploration strategy | ||
3. execute \\(a\\), observe \\(s'\\), and compute \\(R(s'|s,a)\\) | ||
|
||
Then, update: | ||
|
||
{{< figure src="/ox-hugo/2024-02-13_09-54-38_screenshot.png" >}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
+++ | ||
title = "Option (MDP)" | ||
author = ["Houjun Liu"] | ||
draft = false | ||
+++ | ||
|
||
an [Option (MDP)]({{< relref "KBhoption.md" >}}) represents a high level collection of actions. Big Picture: abstract away your big policy into \\(n\\) small policies, and value-iterate over expected values of the big policies. | ||
|
||
|
||
## Markov Option {#markov-option} | ||
|
||
A [Markov Option](#markov-option) is given by a triple \\((I, \pi, \beta)\\) | ||
|
||
- \\(I \subset S\\), the states from which the option maybe started | ||
- \\(S \times A\\), the MDP during that option | ||
- \\(\beta(s)\\), the probability of the option terminating at state \\(s\\) | ||
|
||
|
||
### one-step options {#one-step-options} | ||
|
||
You can develop one-shot options, which terminates immediate after one action with underlying probability | ||
|
||
- \\(I = \\{s:a \in A\_{s}\\}\\) | ||
- \\(\pi(s,a) = 1\\) | ||
- \\(\beta(s) = 1\\) | ||
|
||
|
||
### option value fuction {#option-value-fuction} | ||
|
||
\begin{equation} | ||
Q^{\mu}(s,o) = \mathbb{E}\qty[r\_{t} + \gamma r\_{t+1} + \dots] | ||
\end{equation} | ||
|
||
where \\(\mu\\) is some option selection process | ||
|
||
|
||
### semi-markov decision process {#semi-markov-decision-process} | ||
|
||
a [semi-markov decision process](#semi-markov-decision-process) is a system over a bunch of [option]({{< relref "KBhoptions.md" >}})s, with time being a factor in option transitions, but the underlying policies still being [MDP]({{< relref "KBhmarkov_decision_process.md" >}})s. | ||
|
||
\begin{equation} | ||
T(s', \tau | s,o) | ||
\end{equation} | ||
|
||
where \\(\tau\\) is time elapsed. | ||
|
||
because option-level termination induces jumps between large scale states, one backup can propagate to a lot of states. | ||
|
||
|
||
### intra option q-learning {#intra-option-q-learning} | ||
|
||
\begin{equation} | ||
Q\_{k+1} (s\_{i},o) = (1-\alpha\_{k})Q\_{k}(S\_{t}, o) + \alpha\_{k} \qty(r\_{t+1} + \gamma U\_{k}(s\_{t+1}, o)) | ||
\end{equation} | ||
|
||
where: | ||
|
||
\begin{equation} | ||
U\_{k}(s,o) = (1-\beta(s))Q\_{k}(s,o) + \beta(s) \max\_{o \in O} Q\_{k}(s,o') | ||
\end{equation} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
+++ | ||
title = "STRIPS-style planning" | ||
author = ["Houjun Liu"] | ||
draft = false | ||
+++ | ||
|
||
This is a precursor to [MDP]({{< relref "KBhmarkov_decision_process.md" >}}) planning: | ||
|
||
- states: conjunction of "fluents" (which are state) | ||
- actions: transition between fulents | ||
- transitions: deleting of older, changed parts of fluents, adding new parts | ||
|
||
|
||
## Planning Domain Definition Language {#planning-domain-definition-language} | ||
|
||
A LISP used to specify a [STRIPS-style planning]({{< relref "KBhstrips_style_planning.md" >}}) problem. | ||
|
||
|
||
## Hierarchical Task Network {#hierarchical-task-network} | ||
|
||
1. Decompose classical planning into a hierarchy of actions | ||
2. Leverage High level actions to generate a coarse plan | ||
3. Refine to smaller problems |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
+++ | ||
title = "Temperal Abstraction" | ||
author = ["Houjun Liu"] | ||
draft = false | ||
+++ |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.