-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong softmax temperature when choosing the action for the next step #1
Comments
Hi Andreas, Thanks for the interest in the paper, and for finding this mismatch between the original implementation and this one. I had a quick look, and if I'm not mistaken (which could be since I wrote this code a while ago), the fix you propose will lead to the agent always taking the greedy action, which is not what we state in the paper. See this line: Line 109 in 9bfd15e
Instead, we should apply a deterministic softmax before adding the sample to the experience replay. As you see from my late reply, I don't have much time to address this at the moment. Feel free to open a PR with a proposed fix. I'll try to have a look. Thanks! |
About this, note that random actions would only get inserted into the replay buffer if you set the temperature to a very high value. In that case, your agent would also take random actions. With a reasonable non-zero value, you will reduce variance on the target policies in the replay buffer, which may be beneficial depending on the problem, at the cost of training speed. |
Dear Miqael, while it may be the case that it would decrease variance, I could not see that. But maybe the issue was also somewhere else at the time I made the experiments. In any case, I still believe that my proposed change would represent what is stated in the associated paper. In the latest version of the paper on Arxiv (https://arxiv.org/abs/1904.07091) you specifically mention that the softmax temperature parameter was 0 when you applied the softmax function to the returns at the root node: I will create a PR for this later so you can check the exact proposed changes when you have the time :) |
Dear maintainers, I have found that the softmax temperature you use in the code in the master branch when expanding the next state is different from what is stated in your publication, as well as the original code.
Instead of setting temperature = 0, to use the deterministic softmax version, you set the parameter to the same value as the one used by the planner, which made led to the policy not learning in the Atari environment in my experiments, as random actions would get inserted into the replay buffer.
The issue can be fixed by changing the temp parameter in line 94 in file piIW_alphazero.py.
The text was updated successfully, but these errors were encountered: