没有合适的资源?快使用搜索试试~ 我知道了~
Automated Accelerator Optimization Aided by Graph NeuralNetworksAtefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, and Jason CongComputer Science Department, University of California - Los Angeles, USALos Angeles, CA, USA{atefehsz,yba,yzsun,cong}@cs.ucla.eduAbstractUsing High-Level Synthesis (HLS), the hardware designers mustdescribe only a high-level behavioral flow of the design. However,it still can take weeks to develop a high-performance architecturemainly because there are many design choices at a higher levelto explore. Besides, it takes several minutes to hours to evaluatethe design with the HLS tool. To solve this problem, we modelthe HLS tool with a graph neural network that is trained to beused for a wide range of applications. The experimental resultsdemonstrate that our model can estimate the quality of design inmilliseconds with high accuracy, resulting in up to 79× speedup(with an average of 48×) for optimizing the design compared to theprevious state-of-the-art work relying on the HLS tool.ACM Reference Format:Atefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, and Jason Cong. 2022.Automated Accelerator Optimization Aided by Graph Neural Networks.In Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC)(DAC ’22), July 10–14, 2022, San Francisco, CA, USA. ACM, New York, NY,USA, 6 pages. https://doi.org/10.1145/3489517.35304091 IntroductionHigh-Level Synthesis (HLS) was introduced to simplify the FPGAprogramming by raising the abstraction level in design and soonwas embraced by both academia and industry [4, 16]. This is be-cause the HLS tools let the designers optimize their microarchitec-ture quickly by inserting a few synthesis directives in the form ofpragmas. This feature can potentially help shorten the design devel-opment cycle. However, not every HLS design has a good quality ofresults [17]. Thus, one often has to explore many design choices foreach new application since the solution space grows exponentiallyby the number of candidate pragmas. This can negatively impactthe design turn-around times.To speed up the design optimization, a new line of research hasbeen created with the focus on automating the design space explo-ration (DSE) for optimizing the microarchitecture. As summarizedin [14], the previous studies either use the HLS tool directly [17, 24],or develop a model to mimic the HLS tool [11, 26] for evaluatinga design point. Relying on the HLS tool to evaluate a solution canincrease the DSE time significantly as each design candidate wouldhave a long evaluation time (minutes to hours) that forces us toexplore a reduced set of the solution space. While utilizing a modelcan potentially speed up the process, a simple analytical model can-not capture the different heuristics used by the tool [14]. Adopting alearning algorithm can help with increasing the accuracy. However,the related works build a separate learning model per applicationand the results from one application are not transferred to anotherone. A nice effort was made in Kwon et al. [7] for transfer learn-ing using a Multi-Layer Perceptron (MLP) network. Nonetheless,they only use the pragma configurations as the input to the model,which can result in considerable loss since the program semanticsare missing (see Section 5.2).A few of the very recent works have proposed to use GraphNeural Network (GNN) for predicting the design’s quality [18, 21].Ustun et al. [18] proposes a GNN-based model to learn the oper-ation mapping to FPGA’s resources for delay prediction in HLS.IronMan [21] uses GNN to predict the performance of the programunder different resource allocations (DSP or LUT) to the computa-tion nodes. Although their studies clearly demonstrate the valueand power of using GNNs, none of these works include the pragmasin their input representation so their models cannot be used forfinding the best design configuration.In this paper, we aim to automate the design optimization usingGNN with the support for model generalization by developing aframework called GNN-DSE. We first build a model to evaluate adesign quickly, in milliseconds, without the invocation of the HLStool. Since the HLS tools employ many heuristics to optimize adesign and the design parameters affect each other, we let a deeplearning model learn their impact. Furthermore, as the currentHLS tools optimize the design based on specific code patterns,it is important to identify the different code patterns and learntheir effect to be able to transfer the knowledge we gained fromone application to another. As such, we represent the program asa graph which includes the program information in the form ofcontrol, data, call, and pragma flows and exploit a GNN to extractthe required features of the graph for predicting the objectives.We propose several techniques for improving the accuracy of themodel including Jumping Knowledge Network (JKN) [23], nodeattention [9], and multi-head objective prediction. To demonstratethe effectiveness of our model, we build a DSE on top of it tofind the Pareto-optimal design points. We show that not only canGNN-DSE find the Pareto-optimal design points for the kernels thatwere included in its training set, it can also generalize to the kernelsoutside of its database and detect their Pareto-optimal design points.This paper is the first work to employ a graph representation thatcaptures both the program semantics and the pragmas, and to builda single predictive model for several applications with transferringlearning capability. In this paper, we target Xilinx FPGAs as anexample but our approach is tool-independent and extendable toIntel FPGAs as well.In summary, this paper makes the following contributions:• We propose a graph-based program representation for optimiz-ing FPGA designs which includes both the program context andthe pragma flow.• We develop a learning model based on Graph Neural Network(GNN) as a surrogate of the HLS tool for assessing a designpoint’s quality in milliseconds and propose several techniquesfor improving its accuracy.• We build an automated framework, GNN-DSE1, to gather a data-base of FPGA designs, train a learning model for predicting thedesign’s objectives, and run a design space exploration based onthe model to close-in on a high-performance design point.• The experimental results demonstrate that not only canGNN-DSE find the Pareto-optimal design points for the ker-nels in its database, but can also optimize the unseen kernels bygeneralizing the knowledge it learned from its training set.5501 这些代码在https://github.com/UCLA-VAST/GNN-DSE上开源0本作品采用知识共享署名国际4.0许可协议。0DAC '22,2022年7月10日至14日,美国加利福尼亚州旧金山 ©2022版权归所有者/作者所有。ACM ISBN978-1-4503-9142-9/22/07。https://doi.org/10.1145/3489517.3530409𝒉′𝑖 = 𝜎(𝑾𝑑 𝑑𝒉𝑗)(1)𝛼𝑖,𝑗 =𝒂⊤[𝑾𝒉𝑖 ∥ 𝑾𝒉𝑗](3)𝜃 ∈ RP,∀𝑢 ∈ 𝑈𝑡𝑖𝑙(F, P(𝜃)),𝑢 < 𝑇𝑢where𝑢 is the utilization of one type of the FPGA on-chip resourcesand 𝑇𝑢 is a user-defined threshold for that type on the FPGA.4 Our Proposed MethodologyFig. 1(a) depicts a high-level overview of GNN-DSE which operatesin three modes: training, inference, and DSE. We first collect adatabase from various applications (Section 4.1) and represent eachdesign in the database as a graph (Section 4.2). Then, we train apredictive model for estimating the design’s objectives (Section 4.3).Finally, the predictive model can be used as a surrogate to the HLStool to run the inference and DSE stages (Section 4.4).4.1 Database GenerationWe adapt a related prior work, AutoDSE [17], which reported supe-rior results over previous studies (e.g. [24]), to generate the initialdatabase for each of the applications. Fig. 2 demonstrates our ap-proach for it. Each for loop can take up to three pragmas: pipeline,parallel, and tile (Section 2.3). We also exploit AutoDSE’s rulesfor pruning a design configuration (e.g., when fine-grained pipelin-ing is applied on a loop, the inner loops would not take any pragma).Since the model needs to see a variety of design points from “bad”to “good” to learn to distinguish them, GNN-DSE extends AutoDSEto exploit three types of explorers for building the training set:• The existing explorer of AutoDSE, bottleneck-based optimizer,which can find high-quality designs.• A hybrid explorer combining the bottleneck-based optimizerwith a local search, which evaluates up to 𝑃 neighbors of thebest design point after 𝑋% improvement in its quality. Thus, themodel can see the effect of modifying only one of the pragmas.• A random explorer which may consider those configurationsskipped by the previous two explorers.Once the explorer picks a design point, it is passed to the Mer-lin Compiler for evaluation. The result will be committed to acommon database along with the program’s graph representation(Section 4.2). GNN-DSE gradually collects results from differentapplications in a shared space to be used for training the model.4.2 Program RepresentationAs mentioned in Section 2.1, CDFG is a popular choice for represent-ing FPGA kernels. On the downside, the CDFGs ignore the precisionof the operands and their values, which are crucial in determiningdesign’s objectives. Recently, a more convenient program repre-sentation is proposed, ProGraML [3], which extends the CDFG byexplicitly assigning separate nodes to operands to retrieve the miss-ing information. It also keeps the function hierarchies by including560DAC '22,2022年7月10日至14日,美国加利福尼亚州旧金山 Atefeh Sohrabizadeh,Yunsheng Bai,Yizhou Sun和Jason Cong02 背景 2.1 程序作为图的表示将程序表示为图的一种常见方法是从其在LLVM中的中间表示(IR)中提取其控制流图和数据流图(CDFG)。因此,不再关注代码的语法,而是捕捉程序流程的语义。在CDFG中,节点表示程序的LLVM指令,这些指令根据程序的控制流连接在一起。对于程序的数据流,根据指令的操作数,在节点之间添加第二类型的边。请注意,CDFG包括许多低级操作(例如内存管理),这使其对于FPGA内核非常有用。2.2 图神经网络图神经网络(GNN)[22]通过学习图中每个节点的特征(嵌入)来提取图信息,它通过一系列层进行聚合(AGG)邻居节点(N( �))的信息,并在聚合结果上应用转换函数(TF)。图卷积网络(GCN)[6]是GNN的一种流行形式,它采用简单的AGG函数,通过节点的度( � � )对 N( � ) 的嵌入进行加权求和:0其中 � � ∈ R � ( � ′ � ∈ R � ′ ) 表示节点 �的输入(输出)嵌入,它是一个具有 � ( � ′ )特征的向量。� 是用于TF 步骤的可训练权重矩阵,用作过滤器,�是引入模型的非线性激活函数。在这里,由于 AGG步骤使用一组固定的权重(由节点的度确定),模型无法优先考虑任何邻居以学习更好的嵌入。图注意力网络(GAT)[20]被引入以学习节点邻居的重要性(注意力),以便它们可以相应地更新其嵌入。GAT层的计算可以看作是:� ′ � = � ( W ∑�0� �,� s 是通过多头点积注意力计算的注意力系数。每个头的计算如下:0其中 ∥ 表示连接操作,� 是一个可学习的向量,控制节点 �接收来自节点 �的注意力。请注意,与GCN相比,只有AGG步骤发生了变化。2.3Merlin编译器 Merlin编译器[1,2]是由Xilinx开源的,旨在通过引入一组用于优化设计的高级pragma,使FPGA编程更加容易。基于这些pragma,它执行源代码转换,并自动生成相应的HLS代码以及所需的HLSpragma。它还自动应用代码转换来实现内存合并、应用内存突发,并根据架构优化缓存所需的数据。虽然它只使用了三个pragma(pipeline、parallel和tile),但它基于它们生成了几个HLSpragma,包括pipeline、unroll、array_partition、inline、dependence、loop_flatten。我们在Merlin编译器的基础上构建了我们的工具,不仅大大减小了解空间的大小,还应用了其自动化的代码转换来实现更好的设计。然而,GNN模型必须更加努力地学习Merlin编译器何时(以及何处)应用其自动优化。03问题形式化在这项工作中,我们旨在加速HLS的DSE问题。为此,我们针对以下问题提出解决方案:问题1:构建预测模型。设P为带有设计配置( �)的FPGA加速器内核的C程序。设H为供应商的HLS工具,输出真实的执行周期 ����� ( H , P(� )) 和真实的资源利用率 ���� ( H , P( � )) :Q H (P( � )) = � ����� ( H , P( � )) ,���� ( H , P( � )) � (4)找到一个预测函数(F),它可以近似H对于给定程序P和任意设计配置( � )的结果:min F � ������� �0问题2:确定最佳配置。对于上述定义的程序P,在给定的搜索时间限制内找到一个配置 � ∈ R P ,使生成的设计 P( � )能够适应FPGA并且执行周期最小化。形式上,我们的目标是:min � �����( F , P( � )) (6)Graph Neural Networkto extract each node embeddingMulti-layer Perceptron to predict the outputOutput graphwith the final node embeddings𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒1𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒𝑖Graph EmbeddingResult of merging the node embeddingsC/C++ CodeDesign Space GeneratorGraph GeneratorPredictive ModelDesign ConfigobjectivesDesign Space ExplorationC/C++ CodeInferenceobjectivesTop M DesignsDatabase GeneratorEvaluator (HLS tool)objectives(a)(b)Figure 1: (a) High-level overview of the GNN-DSE framework; (b) The graph representation of Code 1.Application 1 (C/C++)Application N (C/C++)…Design Space GeneratorEvaluator (HLS tool)ExplorerGraph GeneratorDatabase GeneratorobjectivesDatabase GeneratorTraining DatabaseFigure 2: Database generator of GNN-DSE.the design’s call flow. As such, we adapt ProGraML and extend itby including the pragma flow to represent a program. Each of thecandidate pragmas are defined in either of the following forms:#pragma ACCEL pipeline auto{pragma_name}#pragma ACCEL parallel factor=auto{pragma_name}#pragma ACCEL tile factor=auto{pragma_name}the pragma_name is a placeholder for its option, which can be(off|cg|fg) for pipeline and a numerical value for the othertwo. cg (fg) refers to coarse-grained (fine-grained) pipelining [17].Code 1: Code snippet of an input toy example to GNN-DSE.1 void foo(int input[N]) {2 #pragma ACCEL pipeline auto{_PIPE_L1}3 #pragma ACCEL parallel factor=auto{_PARA_L1}4for (int i = 0; i < N; i++) { input[i] += 1; } }We assign a node for storing the placeholder pragma for each candi-date pragma. Since the pragmas are applied to the loops, we connectthis node to one of the instruction nodes corresponding to the loop:icmp. Code 1 shows a toy example having a simple for loop withtwo candidate pragmas. Fig. 1(b) depicts its graph representation.We only show the relevant nodes here for illustration purposes. AsFig. 1(b) demonstrates, there are three types of nodes in each graph.The first kind (in blue) is for the LLVM instructions that togetherdemonstrate the control flow of the program. The second kind (inred) exhibits the constant values and variables that capture the dataflow of the program. The pragma nodes (third kind) are presentedas purple boxes connecting to the respective icmp node. The edgesalso have different kinds which show the different flows of thegraph: control (blue), data (red), call (green), and pragma (purple).When there are two or more edges of the same type connected to anode, they are numbered to further distinguish them (see the edgesconnecting from pragma nodes to the icmp node).To distinguish the different candidate pragmas of a nested loop,one must know the loop level for each pragma. As a rule of thumb,the HLS tools perform better when the pragmas are applied toinner loops since they can implement fine-grained optimizationseasier [17]. Besides capturing the control flow, we explicitly encodethis information in each node via the LLVM block ID of the for loop.More specifically, each node / edge has the following attributes:Node = {'block': LLVM block ID, 'key_text': Node key task, 'function': FunctionID, 'type': Node type}Edge = (Src node ID, Dst node ID, {'flow': Flow type, 'position': Position ID})the type, flow, and position attributes encode this information(for non-pragma edges, position denotes their ordering):type0: instruction1: variable2: constant value3: pragmaflow0: control1: data2: call3: pragmaposition0: tile1: pipeline2: parallel-(Extension of PrograML)+ 𝛼𝑖,𝑗 = softmax(W1𝒉𝑖) (𝒉𝑗 + W3𝒆𝑖𝑗)570由图神经网络辅助的自动加速器优化DAC '22,2022年7月10日至14日,美国加利福尼亚州旧金山0训练数据库0预测模型0训练师0推断0key_text属性显示与该节点对应的关键词。例如,对于pragma、control和data节点类型,分别对应PIPELINE、load、i32*。对于每个设计点,将pragma占位符中的auto变量替换为其相应的值。因此,在同一应用的不同设计配置的图中,只有其pragma节点的属性不同。图3展示了图生成过程。0C/C++代码0候选0Pragma0生成器0图构建器0Clang(++)图生0LLVM IR0Pragma0占位符0Pragma0设计配置0图3:GNN-DSE的图生成器。4.3预测模型图4描述了我们用于预测设计目标的模型架构。它以程序的图表示作为输入,并通过将其属性的独热编码(第4.2节)和pragma选项进行连接来创建初始节点/边嵌入。此编码帮助模型对对最终预测贡献更大的属性分配更高的权重。为此,模型利用GNN编码器(第4.3.1节)来更新嵌入。然后,GNN编码器将图嵌入传递给一组MLP来估计输出(第4.3.2节)。4.3.1GNN编码器:它通过三个阶段将� G ∈ R �分配给图G:(1)堆叠的TransformerConv层用于生成节
下载后可阅读完整内容,剩余1页未读,立即下载
cpongm
- 粉丝: 5
- 资源: 2万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- C++标准程序库:权威指南
- Java解惑:奇数判断误区与改进方法
- C++编程必读:20种设计模式详解与实战
- LM3S8962微控制器数据手册
- 51单片机C语言实战教程:从入门到精通
- Spring3.0权威指南:JavaEE6实战
- Win32多线程程序设计详解
- Lucene2.9.1开发全攻略:从环境配置到索引创建
- 内存虚拟硬盘技术:提升电脑速度的秘密武器
- Java操作数据库:保存与显示图片到数据库及页面
- ISO14001:2004环境管理体系要求详解
- ShopExV4.8二次开发详解
- 企业形象与产品推广一站式网站建设技术方案揭秘
- Shopex二次开发:触发器与控制器重定向技术详解
- FPGA开发实战指南:创新设计与进阶技巧
- ShopExV4.8二次开发入门:解决升级问题与功能扩展
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功