CPU-FPGA异构平台上动态树管理的蒙特卡罗树搜索优化框架

需积分: 0 133 浏览量更新于2024-08-03 收藏 3.28MB PDF 举报

"本文介绍了一种基于CPU-FPGA异构平台的蒙特卡罗树搜索(Monte Carlo Tree Search, MCTS)框架，该框架利用片上动态树管理来提升大规模并行MCTS的性能。MCTS是人工智能应用中的常用搜索技术，通过构建一个在运行时动态演进的决策树来引导AI代理寻找最优策略。由于树内操作的内存约束，大型并行MCTS在通用处理器上存在严重的性能瓶颈。FPGA加速器可以缓解这一内存瓶颈问题，但现有的FPGA加速器缺乏动态内存管理，无法有效地支持动态变化的MCT树。为此，研究者提出了一种新的MCTS加速框架，旨在解决这个问题，提高AI应用的效率和响应速度。" 本文的核心知识点包括： 1. 蒙特卡罗树搜索(Monte Carlo Tree Search, MCTS)：MCTS是一种基于随机模拟的搜索算法，广泛应用于AI决策问题，如围棋、棋类游戏和复杂策略规划。它通过反复模拟随机走步，探索可能的决策路径，以找到最可能带来最优结果的策略。 2. 决策树的动态演化：在MCTS中，决策树的结构是随着搜索过程不断变化的。每次模拟都会扩展树的一部分，导致树的深度和高度在运行时动态调整，以适应当前的决策环境。 3. 性能瓶颈与内存约束：在大规模并行MCTS中，频繁的树内操作（如节点的创建、访问和更新）受到内存带宽限制，成为性能的主要瓶颈。尤其是在通用处理器（如CPU）上，这种问题更为突出。 4. CPU-FPGA加速器：FPGA（Field-Programmable Gate Array）作为一种可编程硬件，可以提供比CPU更高的计算速度和更低的延迟。在AI应用中，CPU-FPGA异构平台可以将计算密集型任务分配给FPGA，从而减轻CPU的负担，提升整体性能。 5. 动态内存管理：传统的FPGA加速器在处理动态变化的数据结构，如MCTS的决策树时，往往遇到困难，因为它们通常不具备灵活的内存管理系统。文章提出的解决方案旨在解决这个问题，使FPGA能够更高效地支持动态树的管理和操作。 6. 新的MCTS加速框架：这个框架创新性地引入了片上动态树管理机制，允许FPGA根据MCTS的运行情况动态调整存储资源，以优化树结构的存储和访问，进而提高搜索效率。 7. 应用前景：该框架的提出对提高AI应用，特别是那些依赖于MCTS的实时决策系统，具有重要意义。它能够帮助开发者构建更快、更智能的系统，应对复杂环境下的决策挑战。

A Framework for Monte-Carlo Tree Search on CPU-FPGA Heterogeneous Platform via on-chip Dynamic Tree ManagementFPGA ’23, February 12–14, 2023, Monterey, CA, USA

Figure 1: MCTS system performance on CPUs

actual

𝐼𝑃𝑆

achieved. Note that the

𝐼𝑃𝑆

for a specic

𝑝

spans a range

as it depends on the specic execution model (details discussed in

Section 3.1).

2.3.2 FPGA acceleration of MCTS. A key perspective of eciently

balancing exploration-exploitation tradeo in MCTS is the dynamic

construction of its tree policy. The pattern of the tree growth is de-

termined at runtime by the random simulations. While it is simple

to perform dynamic tree management using runtime dynamic mem-

ory allocation on CPUs, this is a challenging task on FPGAs. This

is due to the FPGA bitstream’s nature of static memory assignation.

A naive method of dynamically re-allocating memory blocks for

the growing tree at runtime is through hardware reconguration,

which causes unnecessary and large time overhead in the end-to-

end MCTS execution. We are motivated to address this challenge

by proposing the rst dynamic MCTS accelerator design without

the need for hardware reconguration.

3 RELATED WORK

3.1 Parallel MCTS: General-Purpose Processors

Several parallel MCTS algorithms have been developed to increase

the throughput while reducing the negative impact on algorithm

performance in terms of obtained rewards [

]. Tree-Parallel

MCTS and its variants benet signicantly from their superior

algorithm performance compared with the other parallel methods

[

–

]. It has been adopted in various successful applications

such as Go [

], Dou-di-zhu [

], and Atari games [

]. Therefore,

Tree-Parallel MCTS is our target parallel approach for this work.

Existing Tree-Parallel MCTS on CPU can be categorized into

two parallel execution models: multi-threaded tree traversal [

]

and single-thread tree traversal [13].

•

In multi-threaded tree traversal, each worker accessing the

tree is assigned a separate thread, and local mutex at each

tree node is used for accessing the shared tree. The main

disadvantage of this method is that multiple threads com-

municate through the DDR memory which lead to high

𝐼𝑡𝑣

dominated by DDR access time (hundreds of CPU cycles

[1, 7]).

•

In single-thread tree traversal, only a single master thread is

assigned for performing in-tree operations exclusively, and

multiple worker threads perform simulations exclusively. It

has the advantage of low-latency memory access time since

the tree can be managed in the local memory (e.g. last-level

cache). It also achieves higher IPS than multi-threaded tree

traversal, because the in-tree operations can be overlapped

with simulations. However,

𝐼𝑡𝑣

between workers is still large

because all the workers are serialized, and the system per-

formance cannot scale well even with a small number of

workers (as shown in Figure 1, the master thread for in-tree

operations becomes the bottleneck at 𝑝 = 16).

In this work, we are motivated to achieve better system through-

put and scalability compared with the existing implementations

discussed above.

3.2 Hardware-Accelerated MCTS

[

] design Blokus Duo Game solvers on FPGA that uses MCTS.

Their accelerators target Blokus Duo game only and implement the

simulator circuit on FPGA. It is dicult for their designs to gen-

eralize to various applications due to the lack of general-purpose

simulators provided by CPU processors. [

] proposed to accelerate

MCTS in CPU-FPGA heterogeneous systems, and developed FPGA

accelerator for in-tree operations. However, the accelerator design

in [

] requires static memory allocation for a full tree at compile

time. This is because it assumes a static one-to-one association

between the topological ordering of tree nodes with the on-chip

memory addresses. As the memory requirement for the full tree in-

creases exponentially wrt the tree height, the supported tree height

is extremely limited on FPGAs which typically have limited on-chip

resources. This constrain the asymptotically growing characteristic

the tree, thus aecting the domain-specic algorithm performance

of MCTS algorithms. In summary, none of the existing FPGA design

support dynamic tree management which is critical in achieving

high algorithm performance. In this work, we aim to bridge this

gap by supporting dynamic tree management while maintaining

high system throughput.

4 ACCELERATOR DESIGN

4.1 Overview

4.1.1 Data Structure and Operations. The MCTS tree is maintained

on-chip of the FPGA accelerator. In the MCTS tree data structure,

each node is associated with an ID based on insertion order, its

number of visits, and the average reward gained by visiting it.

Each edge has a parent ID, a child ID and a weight (UCT value).

Assuming there are

𝑝

workers, the accelerator performs all their

in-tree operations (BackUp, Selection, and Node Insertion as listed

in Section 2.2). Note that the application-specic environmental

states are stored in the CPU memory rather than FPGA memory,

and the rest of the Expansion phase including 1-Step simulation

and environmental state management are also performed on the

CPU instead of the FPGA (further discussed in Sec. 5.1).

4.1.2 Accelerator Overview. The overview of the accelerator is de-

picted in Fig. 2. The key idea of the accelerator design is to exploit

pipeline parallelism among the workers that propagate through

multiple stages, each stage operating on a certain tree level stored in

on-chip SRAM. Assuming the maximum tree height is

𝐷

, a pipeline

is allocated with

𝐷

pipeline stages, each stage equipped with an

Inserter, a Selector and an Updater corresponding to operations on

a tree level. Worker requests for the in-tree operations are streamed

into the compute units (Inserter, Selector or Updater) from the PCIe

Interface. Upon the completion of Selection and Node Insertion

requests, The pipeline outputs requests for simulation back to the

237

剩余10页未读，继续阅读

徐丹FPGA之路

粉丝: 739
资源: 4

CPU-FPGA异构平台上动态树管理的蒙特卡罗树搜索优化框架

基于CPU-FPGA异构平台的虚拟同步并网逆变器实时仿真算法设计.pdf

基于BFS和FPGA-CPU的混合加速器设计.pdf

CPU-FPGA异构平台性能定量分析

cpu-fpga-nwofle:基于FPGA的DNS异常检测器

MIPS-architecture-CPU-design-based-on-FPGA:基于FPGA的MIPS架构的CPU设计

行业分类-设备装置-一种基于异构平台的多种并行错误检测体系架构.zip

基于PCIe的高性能FPGA-GPU-CPU异构编程架构.pdf

基于PowerPC-FPGA架构的机载雷达任务管理设计-论文

基于Python和VHDL的轻量级CNN-to-FPGA框架设计源码

基于Virtex-5 FPGA的AOPS节点实验平台研究.pdf

最新资源