2 Related Work
Our work belongs to system-level AI development for strategy video game playing, so we mainly
discuss representative works along this line, covering RTS and MOBA games.
General RTS games
StarCraft has been used as the testbed for Game AI research in RTS for many
years. Methods adopted by existing studies include rule-based, supervised learning, reinforcement
learning, and their combinations [
23
,
34
]. For rule-based methods, a representative is SAIDA,
the champion of StarCraft AI Competition 2018 (see
https://github.com/TeamSAIDA/SAIDA
).
For learning-based methods, recently, AlphaStar combined supervised learning and multi-agent
reinforcement learning and achieved the grandmaster level in playing StarCraft 2 [
33
]. Our value
estimation (Section 3.2) shares similarity to AlphaStar’s by using invisible opponent’s information.
MOBA games
Recently, a macro strategy model, named Tencent HMS, was proposed for MOBA
Game AI [
36
]. Specifically, HMS is a functional component for guiding where to go on the map
during the game, without considering the action execution of agents, i.e., micro control or micro-
management in esports, and is thus not a complete AI solution. The most relevant works are Tencent
Solo [
37
] and OpenAI Five [
2
]. Ye et al. [
37
] performed a thorough and systematic study on the
playing mechanics of different MOBA heroes. They developed a RL system that masters micro
control of agents in MOBA combats. However, only 1v1 games were studied without the much more
sophisticated multi-agent 5v5 games. On the other hand, the similarities between this work and Ye et
al. [
37
] include: the modeling of action heads (the value heads are different) and off-policy correction
(adaption). In 2019, OpenAI introduced an AI for playing 5v5 games in Dota 2, called OpenAI Five,
with the ability to defeat professional human players [
2
]. OpenAI Five is based on deep reinforcement
learning via self-play. It trains using Proximal Policy Optimization (PPO) [
28
]. The major difference
between our work and OpenAI Five is that the goal of this paper is to develop AI programs towards
playing full MOBA games. Hence, methodologically, we introduce a set of techniques of off-policy
adaption, curriculum self-play learning, value estimation, and tree-search that addresses the scalability
issue in training and playing a large pool of heroes. On the other hand, the similarities between this
work and OpenAI Five include: the design of action space for modeling MOBA hero’s actions, the
use of recurrent neural network like LSTM for handling partial observability, and the use of one
model with shared weights to control all heroes.
3 Learning System
To address the complexity of MOBA game-playing, we use a combination of novel and existing
learning techniques for neural network architecture, distributed system, reinforcement learning,
multi-agent training, curriculum learning, and Monte-Carlo tree search. Although we use Honor of
Kings as a case study, these proposed techniques are also applicable to other MOBA games, as the
playing mechanics across MOBA games are similar.
3.1 Architecture
MOBA can be considered as a multi-agent Markov game with partial observations. Central to our AI
is a policy
π
θ
(a
t
|s
t
)
represented by a deep neural network with parameters
θ
. It receives previous
observations and actions
s
t
= o
1:t
, a
1:t−1
from the game as inputs, and selects actions
a
t
as outputs.
Internally, observations
o
t
are encoded via convolutions and fully-connected layers, then combined as
vector representations, processed by a deep sequential network, and finally mapped to a probability
distribution over actions. The overall architecture is shown in Fig. 1.
The architecture consists of general-purpose network components that model the raw complexity
of MOBA games. To provide informative observations to agents, we develop multi-modal features,
consisting of a comprehensive list of both scalar and spatial features. Scalar features are made up
of observable units’ attributes, in-game statistics and invisible opponent information, e.g., health
point (hp), skill cool down, gold, level, etc. Spatial features consist of convolution channels extracted
from hero’s local-view map. To handle partial observability, we resort to LSTM [
14
] to maintain
memories between steps. To help target selection, we use target attention [
37
,
2
], which treats the
encodings after LSTM as query, and the stack of game unit encodings as attention keys. To eliminate
unnecessary RL explorations, we design action mask, similar to [
37
]. To manage the combinatorial
3