《星际争霸II》：强化学习的新挑战

下载需积分: 30 | PDF格式 | 1.95MB | 更新于2024-07-17 | 165 浏览量 | 举报

"这篇论文介绍了SC2LE（StarCraft II Learning Environment），这是一个基于星际争霸II游戏的强化学习环境。这个领域为强化学习提出了新的挑战，它代表了一类比大多数先前工作更为复杂的问题。它是一个多代理问题，涉及多个玩家的交互；存在不完全信息，因为地图部分被遮挡；它有一个庞大的行动空间，涉及到数百个单位的选择和控制；它有一个巨大的状态空间，必须仅从原始输入特征平面观察；并且它需要长期策略，跨越数千个时间步的延迟信用分配。" 正文：《星际争霸II：强化学习的新挑战》是关于将强化学习应用到复杂实时战略游戏——星际争霸II中的研究。这篇论文由DeepMind的研究人员与Blizzard合作完成，旨在探索如何在具有高度复杂性和动态性的环境中训练智能体，从而展示强化学习的潜力。 SC2LE环境的引入为强化学习研究开辟了新天地。通常，强化学习算法在解决单一任务或环境时表现良好，如棋类游戏或简单的电子游戏。然而，星际争霸II的特性使得其成为了一个极具挑战性的测试平台。首先，它是一个多玩家环境，需要智能体理解并应对其他玩家的行为，这增加了决策的复杂性。其次，游戏地图的部分不可见性意味着智能体必须通过推理来填补信息空白，这对感知和推理能力提出了更高要求。此外，行动空间的大小也是显著的挑战。星际争霸II允许玩家控制几十到几百个单位，每个单位有多种可能的动作，这意味着智能体需要处理的可能动作数量巨大。这需要高效的学习策略来有效地探索如此庞大的空间。再者，游戏的状态空间庞大且只能从原始输入特征平面获取信息，这要求智能体能够从像素级数据中提取有用的信息并形成高级战略。这类似于人类玩家从视觉输入中理解和解读游戏状态，是视觉理解与策略形成的重大挑战。最后，游戏的延迟信用分配机制是另一个难题。由于决策的影响可能在数千个时间步后才能显现，智能体需要具备长远规划和记忆的能力，这对于传统的强化学习算法来说是一个重大挑战。这篇论文通过SC2LE环境推动了强化学习在复杂、实时和信息不完全环境中的应用。它不仅为AI研究者提供了测试新算法的平台，也为未来开发能适应现实世界复杂场景的智能系统提供了理论基础。同时，这种研究也可能为其他领域的应用带来启示，如机器人控制、自动驾驶等，这些领域同样需要处理不确定性和长时序决策。

on Windows and Mac OS, but we also provide a limited headless build that runs on Linux especially

for machine learning and distributed use cases.

Using this API we built PySC2

, an open source environment that is optimised for RL agents. PySC2

is a Python environment that wraps the StarCraft II API to ease the interaction between Python rein-

forcement learning agents and StarCraft II. PySC2 deﬁnes an action and observation speciﬁcation,

and includes a random agent and a handful of scripted agents as examples. It also includes some

mini-games as challenges and visualisation tools to understand what the agent can see and do.

StarCraft II updates the simulation 16 (at “normal speed”) or 22.4 (at “fast speed”) times per second.

The game is mostly deterministic, but it does have some randomness mainly for cosmetic reasons;

the two main random elements are weapon speed and update order. These sources of randomness

can be removed/mitigated by setting a random seed.

We now describe the environment which was used for all of the experiments in this paper.

3.1 Full Game Description and Reward Structure

In the full 1v1 game of StarCraft II, two opponents spawn on a map which contains resources and

other elements such as ramps, bottlenecks, and islands. To win a game, a player must: 1. accumulate

resources (minerals and vespene gas), 2. construct production buildings, 3. amass an army, and 4.

eliminate all of the opponent’s buildings. A game typically lasts from a few minutes to one hour,

and early actions taken in the game (e.g., which buildings and units are built) have long term conse-

quences. Players have imperfect information since they can only see the portion of the map where

they have units. If they want to understand and react to their opponents’ strategy they must send units

to scout. As we describe later in this section, the action space is also quite unique and challenging.

Most people play online against other human players. The most common games are 1v1, but team

games are possible too (2v2, 3v3 or 4v4), as are more complicated games with unbalanced teams

or more than two teams. Here we focus on the 1v1 format, the most popular form of competitive

StarCraft, but may consider more complicated situations in the future.

StarCraft II includes a built-in AI which is based on a set of handcrafted rules and comes with

10 levels of difﬁculty (the three strongest of which cheat by getting extra resources or privileged

vision). Unfortunately, the fact that they are scripted means their strategies are fairly narrow. As

such, they are easily exploitable, meaning that humans tend to lose interest in them fairly quickly.

Nevertheless, they are a reasonable ﬁrst challenge for a purely learned approach like the baselines

we investigate in sections 4 and 5; they play far better than random, play very quickly with little

compute, and offer consistent baselines to compare against.

We deﬁne two different reward structures: ternary 1 (win) / 0 (tie) / −1 (loss) received at the end

of a game (with all-zero rewards during the game), and Blizzard score. The ternary win/tie/loss

score is the real reward that we care about. The Blizzard score is the score seen by players on the

victory screen at the end of the game. While players can only see this score at the end of the game, we

provide access to the running Blizzard score at every step during the game so that the change in score

can be used as a reward for reinforcement learning. It is computed as the sum of current resources

and upgrades researched, as well as units and buildings currently alive and being built. This means

that the player’s cumulative reward increases with more mined resources, decreases when losing

units/buildings, and all other actions (training units, building buildings, and researching) do not

affect it. The Blizzard score is not zero-sum since it is player-centric, it is far less sparse than the

ternary reward signal, and it correlates to some extent with winning or losing.

3.2 Observations

StarCraft II uses a game engine which renders graphics in 3D. Whilst utilising the underlying game

engine which simulates the whole environment, the StarCraft II API does not currently render RGB

pixels. Rather, it generates a set of “feature layers”, which abstract away from the RGB images seen

during human play, while maintaining the core spatial and graphical concepts of StarCraft II (see

Figure 2).

https://github.com/deepmind/pysc2

剩余18页未读，继续阅读

伪装狙击手

粉丝: 97

《星际争霸II》：强化学习的新挑战

PySC2强化学习代理：实现StarCraft II Minimap自主玩转

构建StarCraft II脚本化机器人：s2client-api库深入解析

node-sc2框架：用JavaScript轻松开发Starcraft II代理

StarCraft II: Legacy of the Void New Tab-crx插件

starcraft2：Starcraft Smartable-在Starcraft II中玩智能游戏

StarCraft2:游戏AI探索者

StarCraft II Benchmarker:该程序可以更轻松地对星际争霸游戏进行基准测试-开源

s2client-proto:StarCraft II客户端-用于与StarCraft II进行通信的协议定义

pysc2：StarCraft II学习环境

DeepMind 关系型深度强化学习 Relational Deep Reinforcement Learning

最新资源