state and is usually omitted). There are studies [
40
,
41
] where a neural network is adopted
to model the conditional mixture term
2
. However, a more straightforward and convenient
implementation is the opponent sampling based Monte Carlo method. Denote by
θ
the
parameter of an agent’s policy. Construct a pool
M
=
{θ
1
, θ
2
, ...}
. On each episode beginning
during training, an opponent, denoted by its parameter
φ
, is selected by sampling from the
pool
φ ∼ Q
(
M
). Various sampling distributions
Q
have been reported in the literature,
including uniform [
4
], a probabilistic mixture of the current and the historic opponent [
5
],
probabilistic Elo score matching [
7
], a function of win-rate [
8
], etc. The opponent sampling
can be viewed as a stochastic approximation of the opponent policy mixture.
Once an opponent
φ
is selected, the parameter
φ
gets fixed and the learning agent tries
to maximize the return by updating its own parameter
θ
. In the Game Theory community,
this procedure is referred to as Best Response, which is actually an RL in view of Machine
Learning [
9
]. Note the fixed
φ
leads to a stationary opponent policy
π
φ
, which is then
absorbed into the environment and the dynamics remains stationary for the learning agent.
To this extent, one can employ any favorite RL (e.g., PPO [
15
], V-trace [
13
], etc) as the
“proxy algorithm”. Morden RL is able to learn from trajectory segment defined as tuples of
observation-reward-action in contiguous time steps:
τ = (o
t
, r
t
, a
t
, o
t+1
, r
t+1
, a
t+1
, ..., o
t+L
, r
t+L
, a
t+L
) (1)
where
L
is the segment length and we’ve omitted the superscript for the learning agent. This
permits a mini-batch style SGD for RL which is more compatible to the Deep Learning
paradigm.
Every once in a while, the pool is updated by
M ← M ∪ {θ}
. This way, the learning
agent still plays against a mixture of historical opponents stochastically. The initial size of
the pool is one that
M
=
{θ
1
}
, where the “seed” policy parameter
θ
1
can be either randomly
initialized or the one learned from Imitation Learning.
Finally, we note that FSP is easy to extend to multiple opponents (
>
= 2). For example,
do the sampling
φ ∼ Q
for each of the opponents, respectively, on each episode beginning as
in [7].
3.2. Design
To implement the CSP-MARL algorithm described in Section 3.1 and allow it to be
scalable, we adopt a modular design for our distributed training framework. Fig. 1 gives an
overview. In the following, we describe each of the modules and explain how they correspond
to CSP-MARL.
Actor.
The Actor module produces the trajectory for the learning agent. It embeds two
secondary modules, Env (environment) and Agt (agent). We require Env be OpenAI gym [
49
]
compatible for the Multi-Agent case, that is, it should implement the two methods:
2
In a more general setting, it is possible to perform a no-regret learning for NE finding [
42
,
43
,
44
,
45
,
46
,
47
].
The corresponding discussion is beyond the scope of this manuscript.
5