From 4c0e6b34e72e6a849e926dc29b88a8f5cf4ad15c Mon Sep 17 00:00:00 2001 From: Diederik Huige <31347265+Dhuige@users.noreply.github.com> Date: Mon, 18 Dec 2023 12:08:56 +0100 Subject: [PATCH] Update sac.rst Unclear phrasing referring to though -> although. But this could also be changed to through depending on context --- docs/algorithms/sac.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/algorithms/sac.rst b/docs/algorithms/sac.rst index 6df7ff501..96f44d6a9 100644 --- a/docs/algorithms/sac.rst +++ b/docs/algorithms/sac.rst @@ -159,7 +159,7 @@ The way we optimize the policy makes use of the **reparameterization trick**, in This policy has two key differences from the policies we use in the other policy optimization algorithms: - **1. The squashing function.** The :math:`\tanh` in the SAC policy ensures that actions are bounded to a finite range. This is absent in the VPG, TRPO, and PPO policies. It also changes the distribution: before the :math:`\tanh` the SAC policy is a factored Gaussian like the other algorithms' policies, but after the :math:`\tanh` it is not. (You can still compute the log-probabilities of actions in closed form, though: see the paper appendix for details.) + **1. The squashing function.** The :math:`\tanh` in the SAC policy ensures that actions are bounded to a finite range. This is absent in the VPG, TRPO, and PPO policies. It also changes the distribution: before the :math:`\tanh` the SAC policy is a factored Gaussian like the other algorithms' policies, but after the :math:`\tanh` it is not. (You can still compute the log-probabilities of actions in closed form, although: see the paper appendix for details.) **2. The way standard deviations are parameterized.** In VPG, TRPO, and PPO, we represent the log std devs with state-independent parameter vectors. In SAC, we represent the log std devs as outputs from the neural network, meaning that they depend on state in a complex way. SAC with state-independent log std devs, in our experience, did not work. (Can you think of why? Or better yet: run an experiment to verify?) @@ -318,4 +318,4 @@ Other Public Implementations .. _`SAC release repo`: https://github.com/haarnoja/sac .. _`Softlearning repo`: https://github.com/rail-berkeley/softlearning -.. _`Yarats and Kostrikov repo`: https://github.com/denisyarats/pytorch_sac \ No newline at end of file +.. _`Yarats and Kostrikov repo`: https://github.com/denisyarats/pytorch_sac