Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong softmax temperature when choosing the action for the next step #1

Open
andste97 opened this issue May 8, 2024 · 3 comments
Open

Comments

@andste97
Copy link

andste97 commented May 8, 2024

Dear maintainers, I have found that the softmax temperature you use in the code in the master branch when expanding the next state is different from what is stated in your publication, as well as the original code.

Instead of setting temperature = 0, to use the deterministic softmax version, you set the parameter to the same value as the one used by the planner, which made led to the policy not learning in the Atari environment in my experiments, as random actions would get inserted into the replay buffer.

The issue can be fixed by changing the temp parameter in line 94 in file piIW_alphazero.py.

@mjunyentb
Copy link
Member

Hi Andreas,

Thanks for the interest in the paper, and for finding this mismatch between the original implementation and this one. I had a quick look, and if I'm not mistaken (which could be since I wrote this code a while ago), the fix you propose will lead to the agent always taking the greedy action, which is not what we state in the paper. See this line:

a = sample_pmf(tree_policy)

Instead, we should apply a deterministic softmax before adding the sample to the experience replay.

As you see from my late reply, I don't have much time to address this at the moment. Feel free to open a PR with a proposed fix. I'll try to have a look. Thanks!

@mjunyentb
Copy link
Member

which made led to the policy not learning in the Atari environment in my experiments, as random actions would get inserted into the replay buffer.

About this, note that random actions would only get inserted into the replay buffer if you set the temperature to a very high value. In that case, your agent would also take random actions. With a reasonable non-zero value, you will reduce variance on the target policies in the replay buffer, which may be beneficial depending on the problem, at the cost of training speed.

@andste97
Copy link
Author

andste97 commented Jul 5, 2024

Dear Miqael,

while it may be the case that it would decrease variance, I could not see that. But maybe the issue was also somewhere else at the time I made the experiments.

In any case, I still believe that my proposed change would represent what is stated in the associated paper. In the latest version of the paper on Arxiv (https://arxiv.org/abs/1904.07091) you specifically mention that the softmax temperature parameter was 0 when you applied the softmax function to the returns at the root node:
image

I will create a PR for this later so you can check the exact proposed changes when you have the time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants