A Framework for Monte-Carlo Tree Search on CPU-FPGA Heterogeneous Platform via on-chip Dynamic Tree ManagementFPGA ’23, February 12–14, 2023, Monterey, CA, USA
Figure 1: MCTS system performance on CPUs
actual
𝐼𝑃𝑆
achieved. Note that the
𝐼𝑃𝑆
for a specic
𝑝
spans a range
as it depends on the specic execution model (details discussed in
Section 3.1).
2.3.2 FPGA acceleration of MCTS. A key perspective of eciently
balancing exploration-exploitation tradeo in MCTS is the dynamic
construction of its tree policy. The pattern of the tree growth is de-
termined at runtime by the random simulations. While it is simple
to perform dynamic tree management using runtime dynamic mem-
ory allocation on CPUs, this is a challenging task on FPGAs. This
is due to the FPGA bitstream’s nature of static memory assignation.
A naive method of dynamically re-allocating memory blocks for
the growing tree at runtime is through hardware reconguration,
which causes unnecessary and large time overhead in the end-to-
end MCTS execution. We are motivated to address this challenge
by proposing the rst dynamic MCTS accelerator design without
the need for hardware reconguration.
3 RELATED WORK
3.1 Parallel MCTS: General-Purpose Processors
Several parallel MCTS algorithms have been developed to increase
the throughput while reducing the negative impact on algorithm
performance in terms of obtained rewards [
4
,
5
,
11
,
12
]. Tree-Parallel
MCTS and its variants benet signicantly from their superior
algorithm performance compared with the other parallel methods
[
5
,
13
–
15
]. It has been adopted in various successful applications
such as Go [
20
], Dou-di-zhu [
23
], and Atari games [
13
]. Therefore,
Tree-Parallel MCTS is our target parallel approach for this work.
Existing Tree-Parallel MCTS on CPU can be categorized into
two parallel execution models: multi-threaded tree traversal [
5
]
and single-thread tree traversal [13].
•
In multi-threaded tree traversal, each worker accessing the
tree is assigned a separate thread, and local mutex at each
tree node is used for accessing the shared tree. The main
disadvantage of this method is that multiple threads com-
municate through the DDR memory which lead to high
𝐼𝑡𝑣
dominated by DDR access time (hundreds of CPU cycles
[1, 7]).
•
In single-thread tree traversal, only a single master thread is
assigned for performing in-tree operations exclusively, and
multiple worker threads perform simulations exclusively. It
has the advantage of low-latency memory access time since
the tree can be managed in the local memory (e.g. last-level
cache). It also achieves higher IPS than multi-threaded tree
traversal, because the in-tree operations can be overlapped
with simulations. However,
𝐼𝑡𝑣
between workers is still large
because all the workers are serialized, and the system per-
formance cannot scale well even with a small number of
workers (as shown in Figure 1, the master thread for in-tree
operations becomes the bottleneck at 𝑝 = 16).
In this work, we are motivated to achieve better system through-
put and scalability compared with the existing implementations
discussed above.
3.2 Hardware-Accelerated MCTS
[
8
,
17
] design Blokus Duo Game solvers on FPGA that uses MCTS.
Their accelerators target Blokus Duo game only and implement the
simulator circuit on FPGA. It is dicult for their designs to gen-
eralize to various applications due to the lack of general-purpose
simulators provided by CPU processors. [
14
] proposed to accelerate
MCTS in CPU-FPGA heterogeneous systems, and developed FPGA
accelerator for in-tree operations. However, the accelerator design
in [
14
] requires static memory allocation for a full tree at compile
time. This is because it assumes a static one-to-one association
between the topological ordering of tree nodes with the on-chip
memory addresses. As the memory requirement for the full tree in-
creases exponentially wrt the tree height, the supported tree height
is extremely limited on FPGAs which typically have limited on-chip
resources. This constrain the asymptotically growing characteristic
the tree, thus aecting the domain-specic algorithm performance
of MCTS algorithms. In summary, none of the existing FPGA design
support dynamic tree management which is critical in achieving
high algorithm performance. In this work, we aim to bridge this
gap by supporting dynamic tree management while maintaining
high system throughput.
4 ACCELERATOR DESIGN
4.1 Overview
4.1.1 Data Structure and Operations. The MCTS tree is maintained
on-chip of the FPGA accelerator. In the MCTS tree data structure,
each node is associated with an ID based on insertion order, its
number of visits, and the average reward gained by visiting it.
Each edge has a parent ID, a child ID and a weight (UCT value).
Assuming there are
𝑝
workers, the accelerator performs all their
in-tree operations (BackUp, Selection, and Node Insertion as listed
in Section 2.2). Note that the application-specic environmental
states are stored in the CPU memory rather than FPGA memory,
and the rest of the Expansion phase including 1-Step simulation
and environmental state management are also performed on the
CPU instead of the FPGA (further discussed in Sec. 5.1).
4.1.2 Accelerator Overview. The overview of the accelerator is de-
picted in Fig. 2. The key idea of the accelerator design is to exploit
pipeline parallelism among the workers that propagate through
multiple stages, each stage operating on a certain tree level stored in
on-chip SRAM. Assuming the maximum tree height is
𝐷
, a pipeline
is allocated with
𝐷
pipeline stages, each stage equipped with an
Inserter, a Selector and an Updater corresponding to operations on
a tree level. Worker requests for the in-tree operations are streamed
into the compute units (Inserter, Selector or Updater) from the PCIe
Interface. Upon the completion of Selection and Node Insertion
requests, The pipeline outputs requests for simulation back to the