Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize rewards by standard deviation of discounted return in MuJoCo #149

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vzhuang
Copy link

@vzhuang vzhuang commented Apr 21, 2020

Averaged results over 10 runs for PPO on Walker2d-v3:

walker2dv3normtest

@vzhuang
Copy link
Author

vzhuang commented Apr 21, 2020

#115

@codecov-io
Copy link

codecov-io commented Apr 21, 2020

Codecov Report

Merging #149 into master will decrease coverage by 0.00%.
The diff coverage is 20.58%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #149      +/-   ##
==========================================
- Coverage   22.56%   22.56%   -0.01%     
==========================================
  Files         128      128              
  Lines        7987     8014      +27     
==========================================
+ Hits         1802     1808       +6     
- Misses       6185     6206      +21     
Flag Coverage Δ
#unittests 22.56% <20.58%> (-0.01%) ⬇️
Impacted Files Coverage Δ
rlpyt/algos/pg/a2c.py 0.00% <0.00%> (ø)
rlpyt/algos/pg/base.py 0.00% <0.00%> (ø)
rlpyt/algos/pg/ppo.py 0.00% <0.00%> (ø)
rlpyt/experiments/configs/mujoco/pg/mujoco_a2c.py 0.00% <ø> (ø)
rlpyt/experiments/configs/mujoco/pg/mujoco_ppo.py 0.00% <ø> (ø)
rlpyt/samplers/base.py 80.00% <ø> (ø)
rlpyt/samplers/collections.py 96.29% <ø> (ø)
rlpyt/samplers/collectors.py 81.03% <ø> (ø)
rlpyt/samplers/parallel/gpu/collectors.py 0.00% <0.00%> (ø)
rlpyt/samplers/serial/sampler.py 97.72% <ø> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 668290d...a15a93b. Read the comment docs.

@astooke
Copy link
Owner

astooke commented Jun 30, 2020

OK this is interesting and I think it can be made a lot simpler. As far as I can tell from the PR, the same could be achieved by changing process_returns() of the policy gradient class, right after the following lines

reward, done, value, bv = (samples.env.reward, samples.env.done,
samples.agent.agent_info.value, samples.agent.bootstrap_value)
done = done.type(reward.dtype)

by inserting:

if self.normalize_reward:
  return_ = discount_return(reward, done, 0., self.discount)  # NO boostrapping of value
  self.rets_rms.update(return_.view(-1, 1))  # matching the shape you used, not sure if the extra dim is needed?
  std_dev = torch.sqrt(self.rets_rms.var)
  reward = torch.div(reward, std_dev)

# proceed with computing discounted returns or GAE returns using the scaled reward

I think that accomplishes the same math? And doesn't need to change any files in the sampler :)

@astooke
Copy link
Owner

astooke commented Sep 5, 2020

Any more comment? Anyone else used this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants