Normalize rewards by standard deviation of discounted return in MuJoCo #149

vzhuang · 2020-04-21T04:55:18Z

Averaged results over 10 runs for PPO on Walker2d-v3:

vzhuang · 2020-04-21T04:55:33Z

codecov-io · 2020-04-21T04:56:51Z

Codecov Report

Merging #149 into master will decrease coverage by 0.00%.
The diff coverage is 20.58%.

@@            Coverage Diff             @@
##           master     #149      +/-   ##
==========================================
- Coverage   22.56%   22.56%   -0.01%     
==========================================
  Files         128      128              
  Lines        7987     8014      +27     
==========================================
+ Hits         1802     1808       +6     
- Misses       6185     6206      +21

Flag	Coverage Δ
#unittests	`22.56% <20.58%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
rlpyt/algos/pg/a2c.py	`0.00% <0.00%> (ø)`
rlpyt/algos/pg/base.py	`0.00% <0.00%> (ø)`
rlpyt/algos/pg/ppo.py	`0.00% <0.00%> (ø)`
rlpyt/experiments/configs/mujoco/pg/mujoco_a2c.py	`0.00% <ø> (ø)`
rlpyt/experiments/configs/mujoco/pg/mujoco_ppo.py	`0.00% <ø> (ø)`
rlpyt/samplers/base.py	`80.00% <ø> (ø)`
rlpyt/samplers/collections.py	`96.29% <ø> (ø)`
rlpyt/samplers/collectors.py	`81.03% <ø> (ø)`
rlpyt/samplers/parallel/gpu/collectors.py	`0.00% <0.00%> (ø)`
rlpyt/samplers/serial/sampler.py	`97.72% <ø> (ø)`
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 668290d...a15a93b. Read the comment docs.

astooke · 2020-06-30T18:26:54Z

OK this is interesting and I think it can be made a lot simpler. As far as I can tell from the PR, the same could be achieved by changing process_returns() of the policy gradient class, right after the following lines

rlpyt/rlpyt/algos/pg/base.py

Lines 47 to 49 in 85d4e01

    
           reward, done, value, bv = (samples.env.reward, samples.env.done, 
        
               samples.agent.agent_info.value, samples.agent.bootstrap_value) 
        
           done = done.type(reward.dtype)

by inserting:

if self.normalize_reward:
  return_ = discount_return(reward, done, 0., self.discount)  # NO boostrapping of value
  self.rets_rms.update(return_.view(-1, 1))  # matching the shape you used, not sure if the extra dim is needed?
  std_dev = torch.sqrt(self.rets_rms.var)
  reward = torch.div(reward, std_dev)

# proceed with computing discounted returns or GAE returns using the scaled reward

I think that accomplishes the same math? And doesn't need to change any files in the sampler :)

astooke · 2020-09-05T01:29:59Z

Any more comment? Anyone else used this?

normalize rewards by standard deviation of discounted return in MuJoCo

a15a93b

vzhuang mentioned this pull request Apr 21, 2020

Normalizing environment wrapper #115

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize rewards by standard deviation of discounted return in MuJoCo #149

Normalize rewards by standard deviation of discounted return in MuJoCo #149

vzhuang commented Apr 21, 2020 •

edited

Loading

vzhuang commented Apr 21, 2020

codecov-io commented Apr 21, 2020 •

edited

Loading

astooke commented Jun 30, 2020 •

edited

Loading

astooke commented Sep 5, 2020

Normalize rewards by standard deviation of discounted return in MuJoCo #149

Are you sure you want to change the base?

Normalize rewards by standard deviation of discounted return in MuJoCo #149

Conversation

vzhuang commented Apr 21, 2020 • edited Loading

vzhuang commented Apr 21, 2020

codecov-io commented Apr 21, 2020 • edited Loading

Codecov Report

astooke commented Jun 30, 2020 • edited Loading

astooke commented Sep 5, 2020

vzhuang commented Apr 21, 2020 •

edited

Loading

codecov-io commented Apr 21, 2020 •

edited

Loading

astooke commented Jun 30, 2020 •

edited

Loading