MADDPG¶
Multi-Agent Deep Deterministic Policy Gradient (MADDPG) is a multi-agent reinforcement learning algorithm for continuous action space:
Implementation is based on DDPG ✔️
Initialize n DDPG agents in MADDPG ✔️
Code Snippet¶
def update_net(self, buffer, batch_size, repeat_times, soft_update_tau):
buffer.update_now_len()
self.batch_size = batch_size
self.update_tau = soft_update_tau
rewards, dones, actions, observations, next_obs = buffer.sample_batch(self.batch_size)
for index in range(self.n_agents):
self.update_agent(rewards, dones, actions, observations, next_obs, index)
for agent in self.agents:
self.soft_update(agent.cri_target, agent.cri, self.update_tau)
self.soft_update(agent.act_target, agent.act, self.update_tau)
return
Parameters¶
- class elegantrl.agents.AgentMADDPG.AgentMADDPG[source]¶
Bases:
AgentBase
Multi-Agent DDPG algorithm. “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive”. R Lowe. et al.. 2017.
- Parameters
net_dim[int] – the dimension of networks (the width of neural networks)
state_dim[int] – the dimension of state (the number of state vector)
action_dim[int] – the dimension of action (the number of discrete action)
learning_rate[float] – learning rate of optimizer
gamma[float] – learning rate of optimizer
n_agents[int] – number of agents
if_per_or_gae[bool] – PER (off-policy) or GAE (on-policy) for sparse reward
env_num[int] – the env number of VectorEnv. env_num == 1 means don’t use VectorEnv
agent_id[int] – if the visible_gpu is ‘1,9,3,4’, agent_id=1 means (1,9,4,3)[agent_id] == 9
- explore_one_env(env, target_step) list [source]¶
Exploring the environment for target_step. param env: the Environment instance to be explored. param target_step: target steps to explore.
- save_or_load_agent(cwd, if_save)[source]¶
save or load training files for Agent
- Parameters
cwd – Current Working Directory. ElegantRL save training files in CWD.
if_save – True: save files. False: load files.
- select_actions(states)[source]¶
Select continuous actions for exploration
- Parameters
state – states.shape==(n_agents,batch_size, state_dim, )
- Returns
actions.shape==(n_agents,batch_size, action_dim, ), -1 < action < +1
- update_agent(rewards, dones, actions, observations, next_obs, index)[source]¶
Update the single agent neural networks, called by update_net.
- Parameters
rewards – reward list of the sampled buffer
dones – done list of the sampled buffer
actions – action list of the sampled buffer
observations – observation list of the sampled buffer
next_obs – next_observation list of the sample buffer
index – ID of the agent
- update_net(buffer, batch_size, repeat_times, soft_update_tau)[source]¶
Update the neural networks by sampling batch data from
ReplayBuffer
.- Parameters
buffer – the ReplayBuffer instance that stores the trajectories.
batch_size – the size of batch data for Stochastic Gradient Descent (SGD).
repeat_times – the re-using times of each trajectory.
soft_update_tau – the soft update parameter.