大规模竞争自我对战框架TLeague：分布式多智能体强化学习

需积分: 9 118 浏览量更新于2024-07-15 收藏 5.46MB PDF 举报

"TLeague是一个基于竞争性自对弈的分布式多智能体强化学习框架，旨在解决大规模训练和实现多种主流CSP-MARL算法的问题。该框架能够在单机或多机环境中部署，提高了训练效率和应用范围。" 在当前的人工智能领域，尤其是强化学习（Reinforcement Learning, RL）分支，多智能体强化学习（Multi-Agent Reinforcement Learning, MARL）已经成为研究热点。竞争性自对弈（Competitive Self-Play, CSP）是MARL的一个重要策略，通过让多个智能体在对抗环境中自我博弈来提升学习效果。近期，CSP-MARL已经在诸如Dota2、Glory of Kings、Quake III、StarCraft II等复杂游戏中取得了显著突破，成功构建了强大的AI系统。然而，这类方法的训练过程极其数据密集，通常需要数十亿甚至万亿个环境帧来训练高性能的智能体。这给研究人员和工程师带来了巨大挑战，也限制了MARL在更广泛的实际问题中的应用。为了解决这一问题，"TLeague"框架应运而生。 TLeague框架的核心目标是支持大规模训练，它实现了多种主流的CSP-MARL算法，以提高学习效率并降低对海量数据的需求。通过分布式训练，TLeague能够将计算任务分散到多个机器上，从而加速训练进程，使得处理更大规模、更复杂的环境成为可能。此外，这种分布式设置还允许并行处理，增加了训练的灵活性和可扩展性。在TLeague中，智能体不仅可以与其他智能体进行实时的自对弈，还能学习从大量的历史对局中提取有用信息，这有助于更快地收敛到优秀的策略。同时，框架可能还包括策略更新、经验回放、奖励设计等关键组件，这些都经过优化以适应竞争环境的动态性和复杂性。总结来说，TLeague是一个创新的工具，它推动了CSP-MARL在实际应用中的可能性，降低了训练的资源门槛，并且可能促进了新的算法和技术的发展。通过这样的框架，未来的研究和工程实践有望在更多现实世界场景中看到高效、智能的多智能体系统。

state and is usually omitted). There are studies [

] where a neural network is adopted

to model the conditional mixture term

. However, a more straightforward and convenient

implementation is the opponent sampling based Monte Carlo method. Denote by

the

parameter of an agent’s policy. Construct a pool

{θ

, θ

, ...}

. On each episode beginning

during training, an opponent, denoted by its parameter

, is selected by sampling from the

pool

φ ∼ Q

(

). Various sampling distributions

have been reported in the literature,

including uniform [

], a probabilistic mixture of the current and the historic opponent [

probabilistic Elo score matching [

], a function of win-rate [

], etc. The opponent sampling

can be viewed as a stochastic approximation of the opponent policy mixture.

Once an opponent

is selected, the parameter

gets ﬁxed and the learning agent tries

to maximize the return by updating its own parameter

. In the Game Theory community,

this procedure is referred to as Best Response, which is actually an RL in view of Machine

Learning [

]. Note the ﬁxed

leads to a stationary opponent policy

, which is then

absorbed into the environment and the dynamics remains stationary for the learning agent.

To this extent, one can employ any favorite RL (e.g., PPO [

], V-trace [

], etc) as the

“proxy algorithm”. Morden RL is able to learn from trajectory segment deﬁned as tuples of

observation-reward-action in contiguous time steps:

τ = (o

, r

, a

, o

t+1

, r

t+1

, a

t+1

, ..., o

t+L

, r

t+L

, a

t+L

) (1)

where

is the segment length and we’ve omitted the superscript for the learning agent. This

permits a mini-batch style SGD for RL which is more compatible to the Deep Learning

paradigm.

Every once in a while, the pool is updated by

M ← M ∪ {θ}

. This way, the learning

agent still plays against a mixture of historical opponents stochastically. The initial size of

the pool is one that

{θ

}

, where the “seed” policy parameter

can be either randomly

initialized or the one learned from Imitation Learning.

Finally, we note that FSP is easy to extend to multiple opponents (

= 2). For example,

do the sampling

φ ∼ Q

for each of the opponents, respectively, on each episode beginning as

in [7].

3.2. Design

To implement the CSP-MARL algorithm described in Section 3.1 and allow it to be

scalable, we adopt a modular design for our distributed training framework. Fig. 1 gives an

overview. In the following, we describe each of the modules and explain how they correspond

to CSP-MARL.

Actor.

The Actor module produces the trajectory for the learning agent. It embeds two

secondary modules, Env (environment) and Agt (agent). We require Env be OpenAI gym [

]

compatible for the Multi-Agent case, that is, it should implement the two methods:

In a more general setting, it is possible to perform a no-regret learning for NE ﬁnding [

The corresponding discussion is beyond the scope of this manuscript.

剩余20页未读，继续阅读

星桥翊月

粉丝: 1
资源: 7

大规模竞争自我对战框架TLeague：分布式多智能体强化学习

M2GRL_A Multi-task Multi-view Graph Representation Learning Framework for Web-sc

dotNetFx40LP_Full_x86_x64zh-Hans、dotNetFx45LP_Full_x86_x64zh-Hans

It is a simple, elegant, and lightweight framework for a multi-t

A new framework of consensus protocol design for complex multi-agent systems

A framework of tracking by multi-trackers with multi-features in a hybrid cascade way

A Shared-Subspace Learning Framework for Multi-Label Classification

JADE-all-4.1.zip_Agent_JADE4.1_Multi Agent System_jade java_jade

SOFA: A Multi-Model Framework for Interactive Physical Simulation

Bee-gent: bonding and encapsulation enhancement agent framework for development of distributed systems

framework-xposed-v89-sdk25-x86

最新资源