add continuous actions option in maddpg (#828)

* add continuous actions option in maddpg, update mpe package source * format * fix act_reg when continuous * from parl.env.pettingzoo_mpe import MAenv_v2 * use concat instead of convert list to tensor * format code，add args of max_episodes * fix args * Update README.md * Update README.md * Update README.md * fix kl and logp of DiagGaussianDistribution * update comment * update comment * Update README.md * Fix maddpg (#836) * fix_maddpg_torch * fix_maddpg_benchmark_torch * fix_torch_api_bug * fix_simple_spread_local_rate * simplify * env_into_core * add_argument * delete_argument * delete_argument * fix_torch_normal_sample * fix_guassion_sample * fix_guassion_sample * remove_tmpfile * align_torch_with_paddle * tensor_gpu_bug * api_fix * rm-tmp * gitignore * fix_readme * align-torch-paddle * reformate&torchclip * reformate * fix_logp_gitignore * reformate * fix_comment * fix_readme_logdir * reformate-trainpy-torch * reformate-trainpy-torch Co-authored-by: yixin617 <[email protected]> * yapf * from parl.env.multiagent_env import MAenv * update deprecated comment * Update multiagent_simple_env.py * update comment Co-authored-by: liuyixin-louis <[email protected]> Co-authored-by: yixin617 <[email protected]>
PaddlePaddle · Apr 20, 2022 · 46ceefd · 46ceefd
1 parent 88e43d3
commit 46ceefd
Show file tree

Hide file tree

Showing 17 changed files with 532 additions and 138 deletions.
diff --git a/.gitignore b/.gitignore
@@ -90,7 +90,6 @@ celerybeat-schedule
 # virtualenv
 .venv
 venv/
-ENV/
 
 # Spyder project settings
 .spyderproject
@@ -103,4 +102,4 @@ ENV/
 /site
 
 # mypy
-.mypy_cache/
+.mypy_cache/
diff --git a/benchmark/torch/maddpg/.benchmark/maddpg_torch.png b/benchmark/torch/maddpg/.benchmark/maddpg_torch.png
diff --git a/benchmark/torch/maddpg/README.md b/benchmark/torch/maddpg/README.md
@@ -10,7 +10,7 @@ A simple multi-agent particle world based on gym. Please see [here](https://gith
 Mean episode reward (every 1000 episodes) in training process (totally 25000 episodes).
 
 <p align="center">
-<img src=".benchmark/maddpg_torch.png" alt="result"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/torch/result.png" alt="result"/>
 </p>
 
 ### Experiments result
@@ -19,37 +19,37 @@ Mean episode reward (every 1000 episodes) in training process (totally 25000 epi
 <tr>
 <td>
 simple<br>
-<img src="../../fluid/MADDPG/.benchmark/MADDPG_simple.gif"                  width = "170" height = "170" alt="MADDPG_simple"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple.gif"                  width = "170" height = "170" alt="MADDPG_simple"/>
 </td>
 <td>
 simple_adversary<br>
-<img src="../../fluid/MADDPG/.benchmark/MADDPG_simple_adversary.gif"        width = "170" height = "170" alt="MADDPG_simple_adversary"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_adversary.gif"        width = "170" height = "170" alt="MADDPG_simple_adversary"/>
 </td>
 <td>
 simple_push<br>
-<img src="../../fluid/MADDPG/.benchmark/MADDPG_simple_push.gif"             width = "170" height = "170" alt="MADDPG_simple_push"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_push.gif"             width = "170" height = "170" alt="MADDPG_simple_push"/>
 </td>
 <td>
-simple_reference<br>
-<img src="../../fluid/MADDPG/.benchmark/MADDPG_simple_reference.gif"        width = "170" height = "170" alt="MADDPG_simple_reference"/>
+simple_crypto<br>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_crypto.gif"        width = "170" height = "170" alt="MADDPG_simple_crypto"/>
 </td>
 </tr>
 <tr>
 <td>
 simple_speaker_listener<br>
-<img src="../../fluid/MADDPG/.benchmark/MADDPG_simple_speaker_listener.gif" width = "170" height = "170" alt="MADDPG_simple_speaker_listener"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_speaker_listener.gif" width = "170" height = "170" alt="MADDPG_simple_speaker_listener"/>
 </td>
 <td>
 simple_spread<br>
-<img src="../../fluid/MADDPG/.benchmark/MADDPG_simple_spread.gif"           width = "170" height = "170" alt="MADDPG_simple_spread"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_spread.gif"           width = "170" height = "170" alt="MADDPG_simple_spread"/>
 </td>
 <td>
 simple_tag<br>
-<img src="../../fluid/MADDPG/.benchmark/MADDPG_simple_tag.gif"              width = "170" height = "170" alt="MADDPG_simple_tag"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_tag.gif"              width = "170" height = "170" alt="MADDPG_simple_tag"/>
 </td>
 <td>
 simple_world_comm<br>
-<img src="../../fluid/MADDPG/.benchmark/MADDPG_simple_world_comm.gif"       width = "170" height = "170" alt="MADDPG_simple_world_comm"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_world_comm.gif"       width = "170" height = "170" alt="MADDPG_simple_world_comm"/>
 </td>
 </tr>
 </table>
@@ -58,17 +58,21 @@ simple_world_comm<br>
 ### Dependencies:
 + python3.5+
 + torch
-+ [parl>=2.0.2](https://github.com/PaddlePaddle/PARL)
-+ [multiagent-particle-envs](https://github.com/openai/multiagent-particle-envs)
-+ gym==0.10.5
++ [parl>=2.0.4](https://github.com/PaddlePaddle/PARL)
++ PettingZoo==1.17.0
++ gym==0.23.1
 
 ### Start Training:
 ```
 # To train an agent for simple_speaker_listener scenario
 python train.py
 
 # To train for other scenario, model is automatically saved every 1000 episodes
-# python train.py --env [ENV_NAME]
+python train.py --env [ENV_NAME]
 
 # To show animation effects after training
-# python train.py --env [ENV_NAME] --show --restore
+python train.py --env [ENV_NAME] --show --restore
+
+# To train and evaluate scenarios with continuous action spaces
+python train.py --env [ENV_NAME] --continuous_actions
+python train.py --env [ENV_NAME] --continuous_actions --show --restore
diff --git a/benchmark/torch/maddpg/simple_agent.py b/benchmark/torch/maddpg/simple_agent.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.

diff --git a/benchmark/torch/maddpg/simple_model.py b/benchmark/torch/maddpg/simple_model.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -26,9 +26,13 @@ def weights_init_(m):
 
 
 class MAModel(parl.Model):
-    def __init__(self, obs_dim, act_dim, critic_in_dim):
+    def __init__(self,
+                 obs_dim,
+                 act_dim,
+                 critic_in_dim,
+                 continuous_actions=False):
         super(MAModel, self).__init__()
-        self.actor_model = ActorModel(obs_dim, act_dim)
+        self.actor_model = ActorModel(obs_dim, act_dim, continuous_actions)
         self.critic_model = CriticModel(critic_in_dim)
 
     def policy(self, obs):
@@ -45,19 +49,26 @@ def get_critic_params(self):
 
 
 class ActorModel(parl.Model):
-    def __init__(self, obs_dim, act_dim):
+    def __init__(self, obs_dim, act_dim, continuous_actions=False):
         super(ActorModel, self).__init__()
+        self.continuous_actions = continuous_actions
         hid1_size = 64
         hid2_size = 64
         self.fc1 = nn.Linear(obs_dim, hid1_size)
         self.fc2 = nn.Linear(hid1_size, hid2_size)
         self.fc3 = nn.Linear(hid2_size, act_dim)
+        if self.continuous_actions:
+            std_hid_size = 64
+            self.std_fc = nn.Linear(std_hid_size, act_dim)
         self.apply(weights_init_)
 
     def forward(self, obs):
         hid1 = F.relu(self.fc1(obs))
         hid2 = F.relu(self.fc2(hid1))
         means = self.fc3(hid2)
+        if self.continuous_actions:
+            act_std = self.std_fc(hid2)
+            return (means, act_std)
         return means
 
 

diff --git a/benchmark/torch/maddpg/train.py b/benchmark/torch/maddpg/train.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -19,15 +19,15 @@
 from simple_model import MAModel
 from simple_agent import MAAgent
 from parl.algorithms import MADDPG
-from parl.env.multiagent_simple_env import MAenv
+from parl.env.multiagent_env import MAenv
 from parl.utils import logger, summary
+from gym import spaces
 
 CRITIC_LR = 0.01  # learning rate for the critic model
 ACTOR_LR = 0.01  # learning rate of the actor model
 GAMMA = 0.95  # reward discount factor
 TAU = 0.01  # soft update
 BATCH_SIZE = 1024
-MAX_EPISODES = 25000  # stop condition:number of episodes
 MAX_STEP_PER_EPISODE = 25  # maximum step per episode
 STAT_RATE = 1000  # statistical interval of save model or count reward
 
@@ -79,36 +79,34 @@ def run_episode(env, agents):
 
 
 def train_agent():
-    env = MAenv(args.env)
+    env = MAenv(args.env, args.continuous_actions)
+    if args.continuous_actions:
+        assert isinstance(env.action_space[0], spaces.Box)
+
+    # print env info
     logger.info('agent num: {}'.format(env.n))
-    logger.info('observation_space: {}'.format(env.observation_space))
-    logger.info('action_space: {}'.format(env.action_space))
     logger.info('obs_shape_n: {}'.format(env.obs_shape_n))
     logger.info('act_shape_n: {}'.format(env.act_shape_n))
+    logger.info('observation_space: {}'.format(env.observation_space))
+    logger.info('action_space: {}'.format(env.action_space))
 
     for i in range(env.n):
         logger.info('agent {} obs_low:{} obs_high:{}'.format(
             i, env.observation_space[i].low, env.observation_space[i].high))
         logger.info('agent {} act_n:{}'.format(i, env.act_shape_n[i]))
-        if ('low' in dir(env.action_space[i])):
+        if (isinstance(env.action_space[i], spaces.Box)):
             logger.info('agent {} act_low:{} act_high:{} act_shape:{}'.format(
                 i, env.action_space[i].low, env.action_space[i].high,
                 env.action_space[i].shape))
-            logger.info('num_discrete_space:{}'.format(
-                env.action_space[i].num_discrete_space))
-
-    from gym import spaces
-    from multiagent.multi_discrete import MultiDiscrete
-    for space in env.action_space:
-        assert (isinstance(space, spaces.Discrete)
-                or isinstance(space, MultiDiscrete))
 
     critic_in_dim = sum(env.obs_shape_n) + sum(env.act_shape_n)
     logger.info('critic_in_dim: {}'.format(critic_in_dim))
 
+    # build agents
     agents = []
     for i in range(env.n):
-        model = MAModel(env.obs_shape_n[i], env.act_shape_n[i], critic_in_dim)
+        model = MAModel(env.obs_shape_n[i], env.act_shape_n[i], critic_in_dim,
+                        args.continuous_actions)
         algorithm = MADDPG(
             model,
             agent_index=i,
@@ -142,7 +140,7 @@ def train_agent():
 
     t_start = time.time()
     logger.info('Starting...')
-    while total_episodes <= MAX_EPISODES:
+    while total_episodes <= args.max_episodes:
         # run an episode
         ep_reward, ep_agent_rewards, steps = run_episode(env, agents)
         summary.add_scalar('train_reward/episode', ep_reward, total_episodes)
@@ -208,8 +206,20 @@ def train_agent():
         type=str,
         default='./model',
         help='directory for saving model')
+    parser.add_argument(
+        '--continuous_actions',
+        action='store_true',
+        default=False,
+        help='use continuous action mode or not')
+    parser.add_argument(
+        '--max_episodes',
+        type=int,
+        default=25000,
+        help='the maximum number of episodes')
+    parser.add_argument('--seed', type=int, default=0)
 
     args = parser.parse_args()
+    print('========== args: ', args)
     logger.set_dir('./train_log/' + str(args.env))
 
     train_agent()
diff --git a/examples/MADDPG/.benchmark/maddpg_paddle.png b/examples/MADDPG/.benchmark/maddpg_paddle.png
diff --git a/examples/MADDPG/README.md b/examples/MADDPG/README.md
@@ -10,7 +10,7 @@ A simple multi-agent particle world based on gym. Please see [here](https://gith
 Mean episode reward (every 1000 episodes) in training process (totally 25000 episodes).
 
 <p align="center">
-<img src=".benchmark/maddpg_paddle.png" alt="result"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/result.png" alt="result"/>
 </p>
 
 ### Experiments result
@@ -19,37 +19,37 @@ Mean episode reward (every 1000 episodes) in training process (totally 25000 epi
 <tr>
 <td>
 simple<br>
-<img src="../../benchmark/fluid/MADDPG/.benchmark/MADDPG_simple.gif"                  width = "170" height = "170" alt="MADDPG_simple"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple.gif"                  width = "170" height = "170" alt="MADDPG_simple"/>
 </td>
 <td>
 simple_adversary<br>
-<img src="../../benchmark/fluid/MADDPG/.benchmark/MADDPG_simple_adversary.gif"        width = "170" height = "170" alt="MADDPG_simple_adversary"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_adversary.gif"        width = "170" height = "170" alt="MADDPG_simple_adversary"/>
 </td>
 <td>
 simple_push<br>
-<img src="../../benchmark/fluid/MADDPG/.benchmark/MADDPG_simple_push.gif"             width = "170" height = "170" alt="MADDPG_simple_push"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_push.gif"             width = "170" height = "170" alt="MADDPG_simple_push"/>
 </td>
 <td>
-simple_reference<br>
-<img src="../../benchmark/fluid/MADDPG/.benchmark/MADDPG_simple_reference.gif"        width = "170" height = "170" alt="MADDPG_simple_reference"/>
+simple_crypto<br>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_crypto.gif"        width = "170" height = "170" alt="MADDPG_simple_crypto"/>
 </td>
 </tr>
 <tr>
 <td>
 simple_speaker_listener<br>
-<img src="../../benchmark/fluid/MADDPG/.benchmark/MADDPG_simple_speaker_listener.gif" width = "170" height = "170" alt="MADDPG_simple_speaker_listener"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_speaker_listener.gif" width = "170" height = "170" alt="MADDPG_simple_speaker_listener"/>
 </td>
 <td>
 simple_spread<br>
-<img src="../../benchmark/fluid/MADDPG/.benchmark/MADDPG_simple_spread.gif"           width = "170" height = "170" alt="MADDPG_simple_spread"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_spread.gif"           width = "170" height = "170" alt="MADDPG_simple_spread"/>
 </td>
 <td>
 simple_tag<br>
-<img src="../../benchmark/fluid/MADDPG/.benchmark/MADDPG_simple_tag.gif"              width = "170" height = "170" alt="MADDPG_simple_tag"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_tag.gif"              width = "170" height = "170" alt="MADDPG_simple_tag"/>
 </td>
 <td>
 simple_world_comm<br>
-<img src="../../benchmark/fluid/MADDPG/.benchmark/MADDPG_simple_world_comm.gif"       width = "170" height = "170" alt="MADDPG_simple_world_comm"/>
+<img src="https://github.com/benchmarking-rl/PARL-experiments/blob/master/MADDPG/paddle/.benchmark/MADDPG_simple_world_comm.gif"       width = "170" height = "170" alt="MADDPG_simple_world_comm"/>
 </td>
 </tr>
 </table>
@@ -58,17 +58,23 @@ simple_world_comm<br>
 ### Dependencies:
 + python3.5+
 + [paddlepaddle>=2.0.0](https://github.com/PaddlePaddle/Paddle)
-+ [parl>=2.0.2](https://github.com/PaddlePaddle/PARL)
-+ [multiagent-particle-envs](https://github.com/openai/multiagent-particle-envs)
-+ gym==0.10.5
++ [parl>=2.0.4](https://github.com/PaddlePaddle/PARL)
++ PettingZoo==1.17.0
++ gym==0.23.1
+
 
 ### Start Training:
 ```
 # To train an agent for simple_speaker_listener scenario
 python train.py
 
 # To train for other scenario, model is automatically saved every 1000 episodes
-# python train.py --env [ENV_NAME]
+python train.py --env [ENV_NAME]
 
 # To show animation effects after training
-# python train.py --env [ENV_NAME] --show --restore
+python train.py --env [ENV_NAME] --show --restore
+
+# To train and evaluate scenarios with continuous action spaces
+python train.py --env [ENV_NAME] --continuous_actions
+python train.py --env [ENV_NAME] --continuous_actions --show --restore
+```
diff --git a/examples/MADDPG/simple_agent.py b/examples/MADDPG/simple_agent.py
@@ -1,4 +1,4 @@
-#   Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#   Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -16,7 +16,6 @@
 import paddle
 import numpy as np
 from parl.utils import ReplayMemory
-from parl.utils import machine_info, get_gpu_count
 
 
 class MAAgent(parl.Agent):