MARLToolkit is a Multi-Agent Reinforcement Learning Toolkit based on Pytorch. It provides MARL research community a unified platform for developing and evaluating the new ideas in various multi-agent environments. There are four core features of MARLToolkit.
- it collects most of the existing MARL algorithms widely acknowledged by the community and unifies them under one framework.
- it gives a solution that enables different multi-agent environments using the same interface to interact with the agents.
- it guarantees excellent efficiency in both the training and sampling process.
- it provides trained results, including learning curves and pretrained models specific to each task and algorithm's combination, with finetuned hyper-parameters to guarantee credibility.
We collected most of the existing multi-agent environment and multi-agent reinforcement learning algorithms and unified them under one framework based on [Pytorch] to boost the MARL research.
The MARL baselines include independence learning (IQL, A2C, DDPG, TRPO, PPO), centralized critic learning (COMA, MADDPG, MAPPO, HATRPO), and value decomposition (QMIX, VDN, FACMAC, VDA2C) are all implemented.
Popular environments like SMAC, MaMujoco, and Google Research Football are provided with a unified interface.
The algorithm code and environment code are fully separated. Changing the environment needs no modification on the algorithm side and vice versa.
Benchmark | Github Stars | Learning Mode | Available Env | Algorithm Type | Algorithm Number | Continues Control | Asynchronous Interact | Distributed Training | Framework | Last Update |
---|---|---|---|---|---|---|---|---|---|---|
PyMARL | CP | 1 | VD | 5 | * | |||||
PyMARL2 | CP | 1 | VD | 12 | PyMARL | |||||
off-policy | CP | 4 | IL+VD+CC | 4 | off-policy | |||||
on-policy | CP | 4 | IL+VD+CC | 1 | on-policy | |||||
MARL-Algorithms | CP | 1 | VD+Comm | 9 | * | |||||
EPyMARL | CP | 4 | IL+VD+CC | 10 | PyMARL | |||||
Marlbenchmark | CP+CL | 4 | VD+CC | 5 | ✔️ | pytorch-a2c-ppo-acktr-gail | ||||
MAlib | SP | 8 | SP | 9 | ✔️ | * | ||||
MARLlib | CP+CL+CM+MI | 10 | IL+VD+CC | 18 | ✔️ | ✔️ | ✔️ | Ray/RLlib | ||
CP, CL, CM, and MI represent cooperative, collaborative, competitive, and mixed task learning modes. IL, VD, and CC represent independent learning, value decomposition, and centralized critic categorization. SP represents self-play. Comm represents communication-based learning. Asterisk denotes that the benchmark uses its framework.
Most of the popular environment in MARL research has been incorporated in this benchmark:
Env Name | Learning Mode | Observability | Action Space | Observations |
---|---|---|---|---|
LBF | Mixed | Both | Discrete | Discrete |
RWARE | Collaborative | Partial | Discrete | Discrete |
MPE | Mixed | Both | Both | Continuous |
SMAC | Cooperative | Partial | Discrete | Continuous |
MetaDrive | Collaborative | Partial | Continuous | Continuous |
MAgent | Mixed | Partial | Discrete | Discrete |
Pommerman | Mixed | Both | Discrete | Discrete |
MaMujoco | Cooperative | Partial | Continuous | Continuous |
GRF | Collaborative | Full | Discrete | Continuous |
Hanabi | Cooperative | Partial | Discrete | Discrete |
Each environment has a readme file, standing as the instruction for this task, talking about env settings, installation, and some important notes.
We provide three types of MARL algorithms as our baselines including:
Independent Learning: IQL DDPG PG A2C TRPO PPO
Centralized Critic: COMA MADDPG MAAC MAPPO MATRPO HATRPO HAPPO
Value Decomposition: VDN QMIX FACMAC VDAC VDPPO
Here is a chart describing the characteristics of each algorithm:
Algorithm | Support Task Mode | Need Global State | Action | Learning Mode | Type |
---|---|---|---|---|---|
IQL | Mixed | No | Discrete | Independent Learning | Off Policy |
PG | Mixed | No | Both | Independent Learning | On Policy |
A2C | Mixed | No | Both | Independent Learning | On Policy |
DDPG | Mixed | No | Continuous | Independent Learning | Off Policy |
TRPO | Mixed | No | Both | Independent Learning | On Policy |
PPO | Mixed | No | Both | Independent Learning | On Policy |
COMA | Mixed | Yes | Both | Centralized Critic | On Policy |
MADDPG | Mixed | Yes | Continuous | Centralized Critic | Off Policy |
MAA2C | Mixed | Yes | Both | Centralized Critic | On Policy |
MATRPO | Mixed | Yes | Both | Centralized Critic | On Policy |
MAPPO | Mixed | Yes | Both | Centralized Critic | On Policy |
HATRPO | Cooperative | Yes | Both | Centralized Critic | On Policy |
HAPPO | Cooperative | Yes | Both | Centralized Critic | On Policy |
VDN | Cooperative | No | Discrete | Value Decomposition | Off Policy |
QMIX | Cooperative | Yes | Discrete | Value Decomposition | Off Policy |
FACMAC | Cooperative | Yes | Continuous | Value Decomposition | Off Policy |
VDAC | Cooperative | Yes | Both | Value Decomposition | On Policy |
VDPPO* | Cooperative | Yes | Both | Value Decomposition | On Policy |
IQL is the multi-agent version of Q learning. MAA2C and MATRPO are the centralized version of A2C and TRPO. VDPPO is the value decomposition version of PPO.