Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform fire on reset after doing no-ops, make fire_on_reset configurable #158

Merged
merged 3 commits into from
Jun 30, 2020

Conversation

ankeshanand
Copy link
Contributor

@ankeshanand ankeshanand commented May 22, 2020

Fire on reset should probably happen after doing the no-ops, not before. Firing and then doing no-ops for 30 steps is probably harmful in some cases. OpenAI baselines follow this order as well: https://github.com/openai/baselines/blob/8c2aea2addc9f3ba36d4a0c937e6a2d09830afc7/baselines/ppo1/run_atari.py

I have also made fire_on_reset configurable since repositories that reproduce DeepMind numbers don't use it (dopamine and @Kaixhin's rainbow). Setting it to False is probably a better default but I didn't change that yet.

Edit: Switched to False by default at Kai's suggestion, my own experiments with Data efficient rainbow also suggest that using FireOnReset hurts.

@codecov-commenter
Copy link

codecov-commenter commented May 22, 2020

Codecov Report

Merging #158 into master will decrease coverage by 0.02%.
The diff coverage is 75.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #158      +/-   ##
==========================================
- Coverage   22.53%   22.51%   -0.03%     
==========================================
  Files         128      128              
  Lines        8002     8005       +3     
==========================================
- Hits         1803     1802       -1     
- Misses       6199     6203       +4     
Flag Coverage Δ
#unittests 22.51% <75.00%> (-0.03%) ⬇️
Impacted Files Coverage Δ
rlpyt/envs/atari/atari_env.py 80.64% <75.00%> (-2.83%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85d4e01...aecb175. Read the comment docs.

@Kaixhin
Copy link

Kaixhin commented May 22, 2020

Check this openai/baselines#240 - I confirmed with DM that they don't use it.

@ankeshanand
Copy link
Contributor Author

Thanks a lot Kai, my own experiments with DER suggest not using fire on reset is better as well! While we have you here, could you clarify the following settings as well?

  • Did you find clipping the grad norm not useful? I saw your repo doesn't use it.
  • What's the correct exploration strategy to use during eval with NoisyNets? Does DM still use a small epsilon?
  • When prioriotizing with C51, is it useful to use cross-entropy over KL? I saw both dopamine and your repo use CE, but the Rainbow paper mentions KL.

@Kaixhin
Copy link

Kaixhin commented May 22, 2020

  1. I think it's not mentioned in the Rainbow paper so presume unused?
  2. Evaluation is ε-greedy with ε = 0.001, but no noise in the noisy layers.
  3. The original distributional RL paper's algorithm specifies the cross-entropy loss.

@ankeshanand
Copy link
Contributor Author

Thanks, super helpful! For 1, I was just looking at the DER paper, so maybe it only helps there.

@Kaixhin
Copy link

Kaixhin commented May 22, 2020

Oh - that specifies that it was used in both. I found the DER paper was useful for specifying hyperparameters, so I'll add it in to my repo...

@ankeshanand
Copy link
Contributor Author

Good catch! If you want to be super consistent, both version don't seem to be using Dueling as well.

@Kaixhin
Copy link

Kaixhin commented May 22, 2020

Dueling is one of the main components of Rainbow (both)?

@ankeshanand
Copy link
Contributor Author

ankeshanand commented May 22, 2020

I don't think so.

From the Rainbow paper:

In aggregate, we did not observe a significant difference when removing the dueling network from the full Rainbow.

And, the DER paper mentions for both variants:

Update: Distributional Double Q

which I assume implies not using dueling

@Kaixhin
Copy link

Kaixhin commented May 22, 2020

Seems like that means there's no significant difference, but they do keep it. As for the update rule, distributional changes the form, and double Q-learning changes the form, but dueling is an architectural change that is agnostic to the update rule.

@ankeshanand
Copy link
Contributor Author

I see, that might be the case!

@Kaixhin
Copy link

Kaixhin commented May 25, 2020

Confirmed with Matteo Hessel that gradient clipping is used in Rainbow.

@astooke astooke merged commit 4acbe31 into astooke:master Jun 30, 2020
@astooke
Copy link
Owner

astooke commented Jun 30, 2020

OK looks good, thanks for digging into the details!

I thought fire was needed in some games to start the action after a lost life? I guess you're right tho on Breakout, if you press fire, then wait up to 30 steps, the ball might have already gone too low to reach....could be why I'm getting low scores in that game sometimes hahah hmmm...

@astooke astooke mentioned this pull request Jul 7, 2020
@DanielTakeshi
Copy link
Contributor

@astooke and everyone else here, I am just curious, but are there benchmarks on Breakout with and without this change, to see if reward is better? I also notice (at least anecdotally) that before this pull request, my Breakout scores are OK sometimes but can be lower than expected. I have not tried with this change.

jordan-schneider pushed a commit to jordan-schneider/rlpyt that referenced this pull request Jan 4, 2021
…able (astooke#158)

* Perform fire on reset after doing no-ops, make fire_on_reset configurable

* Fix typo

* Set fire_on_reset to be False by default
@rfali
Copy link

rfali commented May 16, 2023

@Kaixhin
From your comment above

  1. I think it's not mentioned in the Rainbow paper so presume unused?
  2. Evaluation is ε-greedy with ε = 0.001, but no noise in the noisy layers.
  3. The original distributional RL paper's algorithm specifies the cross-entropy loss.

Can you please share the source of "2. Evaluation is ε-greedy with ε = 0.001, but no noise in the noisy layers". I thought the Rainbow Paper (https://arxiv.org/pdf/1710.02298.pdf) on page 4 says to set ε = 0.

image

@Kaixhin
Copy link

Kaixhin commented May 16, 2023

@rfali the part you highlighted from the Rainbow paper seems to refer to training settings, rather than evaluation settings. Unfortunately this bit of info was a bit hard to find, and I'm afraid I can't remember where I picked it up (it's been several years since I worked on Atari), so if you do find a primary source that says otherwise you should trust that instead.

@rfali
Copy link

rfali commented May 16, 2023

Thanks @Kaixhin, the only other source I have found that does something similar is RLLib benchmark experiments which used e=0.01. https://github.com/ray-project/rl-experiments#dqn--rainbow

Note that RLlib evaluation scores include the 1% random actions of epsilon-greedy exploration. You can expect slightly higher rewards when rolling out the policies without any exploration at all.

@rfali
Copy link

rfali commented May 16, 2023

Update: I did find that Dopamine also sets epsilon=0.001 during evaluation. See the Full Rainbow config here.

However, the above config also sets epsilon_train=0.01, which while the config also sets noisy=True, sets the epsilon to 0 later. See google/dopamine#201 (comment) and the associated commit for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants