CPU-FPGA异构平台性能定量分析

需积分: 0 80 浏览量更新于2024-08-05 收藏 231KB PDF 举报

"本文档主要探讨了现代CPU-FPGA异构系统在数据中心性能和能效提升方面的潜力，以及如何选择合适的平台。作者通过定量分析和深入研究，关注了两个具有代表性的平台：基于QPI的Intel-Altera HARP（具有一致性共享内存）和基于PCIe的Alpha Data板（具有私有设备内存）。" 正文: 在信息技术领域，异构系统已经成为提升计算性能和能源效率的关键手段。CPU-FPGA（中央处理器-现场可编程门阵列）混合加速平台因其灵活性和定制能力，被广泛应用于现代数据中心。这些平台允许用户根据特定应用需求调整硬件配置，从而实现更高效的计算任务执行。然而，面对来自不同供应商的各种基于PCIe和QPI的CPU-FPGA平台，选择合适的一套并不简单。本研究聚焦于理解影响这些异构系统性能的微架构特性，并提供了一种定量比较的方法。论文中，研究人员选择了基于Intel-Altera HARP（使用快速互连协议QPI并配备一致性共享内存）和基于PCIe接口的Alpha Data板（拥有私有设备内存）作为案例进行深入分析。这两种平台代表了不同的设计策略，一个强调高速共享数据访问，另一个则强调设备独立性。作者揭示了多个关键洞察，对应用开发者和平台设计师都具有重要价值。首先，他们指出微架构的内存访问模式对性能的影响显著。在Intel-Altera HARP平台上，由于具有内存一致性，可以实现更快的数据交换，适合那些需要频繁交互和共享数据的应用。而在Alpha Data板上，私有设备内存可能更适合那些对延迟不敏感但需要大量并行处理的任务。其次，他们讨论了通信带宽和延迟在性能差异中的作用。基于QPI的平台通常提供更高的带宽和更低的延迟，这对于需要密集型数据传输的实时应用至关重要。相比之下，PCIe接口虽然带宽较低，但其异步通信模式可能更适合离散任务的执行。此外，论文还强调了软件优化在利用这些平台潜能中的重要性。对于应用开发者，理解和适应特定平台的微架构特性，如缓存层次、数据传输机制和内存模型，是优化性能的关键。对于平台设计师，了解这些影响因素有助于创建更高效、更易使用的异构解决方案。最后，论文提出了未来研究的方向，包括如何更好地集成CPU和FPGA之间的通信，以及如何通过硬件自定义进一步提升能效。这为学术界和工业界提供了宝贵的指导，推动了CPU-FPGA异构系统的持续发展和优化。这篇研究为理解和利用CPU-FPGA异构平台提供了深入的见解，强调了微架构选择、内存访问策略和通信效率在性能优化中的核心地位，对数据中心的能效提升具有重要意义。

A Quantitative Analysis on Microarchitectures of

Modern CPU-FPGA Platforms

Young-kyu Choi Jason Cong Zhenman Fang

Yuchen Hao Glenn Reinman Peng Wei

Center for Domain-Speciﬁc Computing, University of California, Los Angeles

{ykchoi, cong, zhenman, haoyc, reinman, peng.wei.prc}@cs.ucla.edu

ABSTRACT

CPU-FPGA heterogeneous acceleration platforms have shown

great potential for continued performance and energy efﬁciency im-

provement for modern data centers, and have captured great atten-

tion from both academia and industry. However, it is nontrivial for

users to choose the right platform among various PCIe and QPI

based CPU-FPGA platforms from different vendors. This paper

aims to ﬁnd out what microarchitectural characteristics affect the

performance, and how. We conduct our quantitative comparison

and in-depth analysis on two representative platforms: QPI-based

Intel-Altera HARP with coherent shared memory, and PCIe-based

Alpha Data board with private device memory. We provide multi-

ple insights for both application developers and platform designers.

1. INTRODUCTION

In today’s data center design, power and energy efﬁciency have

become two of the primary constraints. The increasing demand

for energy-efﬁcient high-performance computing has stimulated a

growing number of heterogeneous architectures that feature hard-

ware accelerators or coprocessors, such as GPUs (Graphics Pro-

cessing Units), FPGAs (Field Programmable Gate Arrays), and

ASICs (Application Speciﬁc Integrated Circuits). Among various

heterogeneous acceleration platforms, the FPGA-based approach

is considered to be one of the most promising directions, since

FPGAs provide low power and high energy efﬁciency, and can be

reprogrammed to accelerate different applications. For example,

Microsoft has designed a customized FPGA board called Catapult

and deployed it in its data center [16], which improved the rank-

ing throughput of the Bing search engine by 2x. In other words, the

number of servers can be reduced by 2x with each new CPU-FPGA

server consuming only 10% more power.

Motivated by FPGAs’ high energy efﬁciency and reprogramma-

bility, industry vendors, such as Microsoft, Intel/Altera, Xilinx,

IBM and Convey, are providing various ways to integrate high-

performance FPGAs into data center servers. We have classiﬁed

modern CPU-FPGA acceleration platforms in Table 1 according

to their physical integration and memory models. Traditionally,

the most widely used integration is to connect an FPGA to a CPU

via PCIe, with both components equipped with private memory.

Many FPGA boards built on top of Xilinx or Altera FPGAs use

this way of integration because of its extensibility. The customized

Microsoft Catapult board integration is such an example. Another

example is the Alpha Data FPGA board [1] with Xilinx FPGA

fabric that can leverage the Xilinx SDAccel development environ-

ment [2] to support efﬁcient accelerator design using high-level

programming languages, including C/C++ and OpenCL. More-

over, vendors like IBM tend to support a PCIe connection with

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

DAC ’16, June 05-09, 2016, Austin, TX, USA

 2016 ACM. ISBN 978-1-4503-4236-0/16/06. . . $15.00

DOI: http://dx.doi.org/10.1145/2897937.2897972

Table 1: Classiﬁcation of modern CPU-FPGA platforms

Separate Private Memory Coherent Shared Memory

PCIe Peripheral

Interconnect

Alpha Data board [2],

Microsoft Catapult [16]

IBM CAPI [19]

Processor

Interconnect

N/A

Intel-Altera HARP (QPI) [13],

Convey HC-1 (FSB) [4]

a coherent shared memory for easier programming. For exam-

ple, IBM has been developing the Coherent Accelerator Processor

Interface (CAPI) on POWER8 [19] for such an integration, and

has used this platform in the IBM data engine for NoSQL [3].

More recently, closer integration becomes available using a new

class of processor interconnects such as Front-Side Bus (FSB) and

the newer QuickPath Interconnect (QPI), and provides a coherent

shared memory, such as the FSB-based Convey machine [4] and,

the latest Intel-Altera Heterogeneous Architecture Research Plat-

form (HARP) [13] that targets data centers.

The evolvement of various CPU-FPGA platforms brings up two

challenging questions: 1) which platform should we choose to gain

better performance and energy efﬁciency? and 2) how can we de-

sign an optimal accelerator on a given platform? There are nu-

merous factors that can affect the choice and optimization, such

as platform cost, programming models and efforts, logic resource

and frequency of FPGA fabric, CPU-FPGA communication latency

and bandwidth, to name just a few. While some of them are easy

to ﬁgure out, others are nontrivial, especially the communication

latency and bandwidth between CPU and FPGA under different

integration. One reason is that there are few publicly available doc-

uments for newly announced platforms like HARP

. More impor-

tantly, those architectural parameters in the datasheets are often ad-

vertised values, which are usually difﬁcult to achieve in practice.

Actually, sometimes there could be a huge gap between the ad-

vertised numbers and practical numbers. For example, the adver-

tised bandwidth of the PCIe Gen3 x8 interface is 8GB/s; however,

our experimental results show that the PCIe-equipped Alpha Data

platform can only provide 1.6GB/s PCIe-DMA bandwidth using

OpenCL APIs implemented by Xilinx (see Section 3.2.1). Quan-

titative evaluation and in-depth analysis of such kinds of microar-

chitectural characteristics could aid CPU-FPGA platform users to

accurately predict (and optimize) the performance of a candidate

accelerator design on a platform, and make the right choice. Fur-

thermore, it could also beneﬁt CPU-FPGA platform designers for

identifying performance bottlenecks and providing better hardware

and software support.

Motivated by those potential beneﬁts to both platform users and

designers, this paper aims to discover what microarchitectural char-

acteristics affect the performance of modern CPU-FPGA platforms,

and evaluate how they will affect that performance. We conduct our

quantitative comparison on two representative modern CPU-FPGA

platforms to cover both integration dimensions, i.e., PCIe-based vs.

QPI-based, and private vs. shared memory model. One is the re-

cently announced QPI-based HARP with coherent shared memory,

and the other is the commonly used PCIe-based Alpha Data with

separate private memory. We quantify each platform’s CPU-FPGA

communication latency and bandwidth with in-depth analysis, and

For simplicity, in the rest of this paper, we will use HARP for the

Intel-Altera HARP platform, and Alpha Data for the Alpha Data

board integrated with a Xeon CPU.

下载后可阅读完整内容，剩余5页未读，立即下载

袁大岛

粉丝: 40

CPU-FPGA异构平台性能定量分析

异构众核系统综述

SOA异构系统交互指南

异构系统分组编队：编队跟踪控制策略研究与应用分析,异构系统分组编队跟踪控制策略的无文献探索与实现挑战,异构系统分组编队跟踪控制（无文献） ,异构系统分组; 编队; 跟踪控制; 联合协调,异构系统编队跟

异构系统分组编队跟踪控制策略的创新实践与探索,异构系统分组编队跟踪控制（无文献） ,核心关键词：异构系统；分组编队；跟踪控制；无文献；研究空白；自开发技术; 目标算法设计 ,"异构系统编队跟踪控制技术

异构系统协同控制策略：多无人机与无人车系统的最优编队控制技术,多系统协同：异构系统的最优编队控制策略与应用在多无人机与无人车系统中的实践,异构系统的协同控制及最优控制 包括：多无人机系统的最优编队控制

异构系统WS调用

ARM Studio 异构系统调试指南1

多层次异构系统协同控制与优化：多无人机及无人车编队最优控制策略研究,异构系统的协同控制及最优控制 包括：多无人机系统的最优编队控制 以及多无人车系统的最优编队控制 ,异构系统协同控制; 多无人机系

异构系统xml交互程序

面向大数据异构系统的神威并行存储系统

最新资源

异构系统协同控制策略：多无人机与无人车系统的最优编队控制技术,多系统协同：异构系统的最优编队控制策略与应用在多无人机与无人车系统中的实践,异构系统的协同控制及最优控制包括：多无人机系统的最优编队控制

多层次异构系统协同控制与优化：多无人机及无人车编队最优控制策略研究,异构系统的协同控制及最优控制包括：多无人机系统的最优编队控制以及多无人车系统的最优编队控制 ,异构系统协同控制; 多无人机系