Wrong softmax temperature when choosing the action for the next step #1

andste97 · 2024-05-08T07:31:16Z

Dear maintainers, I have found that the softmax temperature you use in the code in the master branch when expanding the next state is different from what is stated in your publication, as well as the original code.

Instead of setting temperature = 0, to use the deterministic softmax version, you set the parameter to the same value as the one used by the planner, which made led to the policy not learning in the Atari environment in my experiments, as random actions would get inserted into the replay buffer.

The issue can be fixed by changing the temp parameter in line 94 in file piIW_alphazero.py.

mjunyentb · 2024-06-02T11:07:39Z

Hi Andreas,

Thanks for the interest in the paper, and for finding this mismatch between the original implementation and this one. I had a quick look, and if I'm not mistaken (which could be since I wrote this code a while ago), the fix you propose will lead to the agent always taking the greedy action, which is not what we state in the paper. See this line:

pi-IW/piIW_alphazero.py

Line 109 in 9bfd15e

a = sample_pmf(tree_policy)

Instead, we should apply a deterministic softmax before adding the sample to the experience replay.

As you see from my late reply, I don't have much time to address this at the moment. Feel free to open a PR with a proposed fix. I'll try to have a look. Thanks!

mjunyentb · 2024-06-02T11:12:43Z

which made led to the policy not learning in the Atari environment in my experiments, as random actions would get inserted into the replay buffer.

About this, note that random actions would only get inserted into the replay buffer if you set the temperature to a very high value. In that case, your agent would also take random actions. With a reasonable non-zero value, you will reduce variance on the target policies in the replay buffer, which may be beneficial depending on the problem, at the cost of training speed.

andste97 · 2024-07-05T10:56:22Z

Dear Miqael,

while it may be the case that it would decrease variance, I could not see that. But maybe the issue was also somewhere else at the time I made the experiments.

In any case, I still believe that my proposed change would represent what is stated in the associated paper. In the latest version of the paper on Arxiv (https://arxiv.org/abs/1904.07091) you specifically mention that the softmax temperature parameter was 0 when you applied the softmax function to the returns at the root node:

I will create a PR for this later so you can check the exact proposed changes when you have the time :)

andste97 mentioned this issue Jul 7, 2024

Change plan step softmax temperature to 0 #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong softmax temperature when choosing the action for the next step #1

Wrong softmax temperature when choosing the action for the next step #1

andste97 commented May 8, 2024 •

edited

Loading

mjunyentb commented Jun 2, 2024

mjunyentb commented Jun 2, 2024

andste97 commented Jul 5, 2024

Wrong softmax temperature when choosing the action for the next step #1

Wrong softmax temperature when choosing the action for the next step #1

Comments

andste97 commented May 8, 2024 • edited Loading

mjunyentb commented Jun 2, 2024

mjunyentb commented Jun 2, 2024

andste97 commented Jul 5, 2024

andste97 commented May 8, 2024 •

edited

Loading