Structured Multi-Agent RL

MAGNet

Deep Multi-Agent Reinforcement Learning with Relevance Graphs (NIPS2018)

image-20220716232443490
  1. Agent 和environment object作为节点建图

  2. 图边结构不变,边的权重通过graph generation network (GGN) 学习

Mean-field MARL

Mean Field Multi-Agent Reinforcement Learning (ICML2018)

Mean Field Theory (Stanley, 1971) – the interactions within the population of agents are approximated by that of a single agent played with the average effect from the overall (local) population.

Virtual Central Agent

优化central agent近似于优化neighbor agent

image-20220716232642877

but the mean action eliminates the difference among agents and thus incurs the loss of important information that could help cooperation

image-20220716232706686 image-20220716232713424

迭代,最后mean action和policy收敛

RFM

RELATIONAL FORWARD MODELS FOR MULTI-AGENT LEARNING (ICLR2019)

image-20220716232748423

The output of a GN is also a graph, with the same connectivity structure as the input graph (i.e., same number of vertices and edges, as well as same sender and receiver for each edge), but updated global, vertex, and edge attributes.

NeurComm (ICLR2020)

MULTI-AGENT REINFORCEMENT LEARNING FOR NETWORKED SYSTEM CONTROL

MARL通信机制:

  1. non-communicative;MADDPG,COMA
  2. heuristic communication protocols or direct information sharing;
  3. learnable communication protocols;DIAL,BICNet
  4. communication attentions to selectively send messages;ATOC,IC3Net

Spatio-Temporal RL

image-20220716181525051

The major assumption in Definition 3.2 is that the Markovian property holds both temporally and spatially, so that the next local state depends on the neighborhood states and policies only.

This assumption is valid in most networked control systems such as traffic and wireless networks, as well as the power grid, where the impact of each agent is spread over the entire system via controlled flows, or chained local transitions.

Eq 2 的缩放因子 to scale down reward signals further away (which are more difficult to fit using local information (另外,越远的通信代价越高)).

When $\alpha$ -> 0, each agent performs local greedy control; when $\alpha$ -> 1, each agent performs global coordination

the immediate local reward of each agent is only affected by controls within its closed neighborhood.

Neural Communication

$h_t = f(h_{t-1}, policy, state, h_{neighbor})$

输入有hidden state(belief),policy,state。以上可以multi-pass,多过几遍

DGN

GRAPH CONVOLUTIONAL REINFORCEMENT LEARNING (ICLR2020)

image-20220716232828610
  1. 图是动态的,agent move, enter, leave the env

  2. 为了训练稳定,邻接矩阵在相邻的two steps不变

  3. Relation kernel: multi-head attention

  4. TEMPORAL RELATION REGULARIZATION:两个相连时刻的attention distribution的KLD

DCG

Deep Coordination Graphs (ICML2020)

  1. A higher-order value factorization can be expressed as an undirected coordination graph
image-20220716232929263
  1. Specialized behavior between agents can be represented by conditioning on the agent’s role, or more generally on the agent’s ID (learned embedding of the participating agents’ histories.)

  2. Low-rank approximation:类似把agent分K组

image-20220716232951788

G2ANet

Multi-Agent Game Abstraction via Graph Attention Neural Network (AAAI2020)

  1. 先建全连接图,hard attention cut edge,soft attention学习边权重 (每个step重复建图)

  2. 为每个agent建立一个subgraph

GC

MULTI-AGENT REINFORCEMENT LEARNING WITH GRAPH CLUSTERING

image-20220716233037846
  1. KNN计算邻接矩阵

  2. Group feature:GRU输出mean,var,高斯采样,然后进GAT

  3. 保证不同cluster之间的距离

image-20220716233051971

ROMA

ROMA: Multi-Agent Reinforcement Learning with Emergent Roles (ICML2020)

image-20220716233114002
  1. agents with similar roles have both similar policies and responsibilities.

  2. Dynamic

  • local observation→(mean,std)→role→hypernet→utility function
  1. Identifiable
  • 为了role的temporally stable

  • 最大化互信息:observation和(trajectory,role)

  1. Specialized
  • Such formulation makes sure that dissimilarity d is high only when mutual information I is low, so that the set of learned roles is compact but diverse
image-20220716233226486

RODE

RODE: LEARNING ROLES TO DECOMPOSE MULTI-AGENT TASKS (ICLR2021)

image-20220716233248345
  1. 先k-means聚类action vector (像parameter sharing)

  2. Role vector取action mean pooling

  3. Role, history余弦相似度进行assign

  4. 选好role之后,保持c timesteps

  5. Role selector TD loss

  6. Role policy TD loss

image-20220716233339093

ROCHICO

Structured Diversification Emergence via Reinforced Organization Control and Hierarchical Consensus Learning (AAMAS2021)

  1. Organization control module
  • 最多m个最近邻邻居节点

  • 两个step 的subgraph的graph edit distance不能太大

  1. hierarchical consensus module
  • DeepSet 融合一个subgraph里的agent的表示

  • 找弱连通图,Contrastive learning保证diversity

  • 一个team是全连接图

  1. decision module

总结 (共同点) :

  1. 保证temporally stable
  • Graph/role 维持一段时间

  • KL距离,graph edit distance不能太大

  • Obs和role、trajectory互信息最大

  1. Group内尽量相似
  • 低秩近似

  • Mean-field、DeepSet

  • Parameter sharing

  1. 不同group要diversity
  • KLD不能太小

  • 互信息小

  • 对抗训练