能条件下材料的原子尺度模拟中的潜力，基于ML-IAP算法设计了高效的SNAP模型，并在极端条件下预测了碳材料的性质

184 浏览量更新于2023-11-05 收藏 14.8MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

Billion atom molecular dynamics simulations of carbon atextreme conditions and experimental time and length scalesKien Nguyen-Cong∗nguyencong@usf.eduUniversity of South FloridaTampa, FL, USAJonathan T. Willman∗jwillma2@usf.eduUniversity of South FloridaTampa, FL, USAStan G. Moorestamoor@sandia.govSandia National LaboratoriesAlbuquerque, NM, USAAnatoly B. Belonoshkoanatoly@kth.seRoyal Institute of Technology (KTH)Stockholm, SwedenRahulkumar Gayatrirgayatri@lbl.govNERSCBerkeley, CA, USAEvan Weinbergeweinberg@nvidia.comNVIDIA CorporationSanta Clara, CA, USAMitchell A. Woodmitwood@sandia.govSandia National LaboratoriesAlbuquerque, NM, USAAidan P. Thompsonathomps@sandia.govSandia National LaboratoriesAlbuquerque, NM, USAIvan I. Oleynikoleynik@usf.eduUniversity of South FloridaTampa, FL, USAABSTRACTBillion atom molecular dynamics (MD) using quantum-accuratemachine-learning Spectral Neighbor Analysis Potential (SNAP) ob-served long-sought high pressure BC8 phase of carbon at extremepressure (12 Mbar) and temperature (5,000 K). 24-hour, 4650 nodeproduction simulation on OLCF Summit demonstrated an unprece-dented scaling and unmatched real-world performance of SNAPMD while sampling 1 nanosecond of physical time. Efficient im-plementation of SNAP force kernel in LAMMPS using the KokkosCUDA backend on NVIDIA GPUs combined with excellent strongscaling (better than 97% parallel efficiency) enabled a peak comput-ing rate of 50.0 PFLOPs (24.9% of theoretical peak) for a 20 billionatom MD simulation on the full Summit machine (27,900 GPUs).The peak MD performance of 6.21 Matom-steps/node-s is 22.9 timesgreater than a previous record for quantum-accurate MD. Near per-fect weak scaling of SNAP MD highlights its excellent potentialto advance the frontier of quantum-accurate MD to trillion atomsimulations on upcoming exascale platforms.KEYWORDSmolecular dynamics, machine-learning interatomic potentials, car-bon, extreme conditions1JUSTIFICATION FOR ACM GORDON BELLPRIZEPeak 50.0 PFLOPS rate in quantum-accurate 20 billion atom molecu-lar dynamics simulation, 6.21 Matom-steps/node-s MD performance- 22.9x improvement over previous record for quantum-accurate∗K. Nguyen-Cong and J. T. Willman contributed equally to this work.ACM acknowledges that this contribution was authored or co-authored by an employee,contractor, or affiliate of the United States government. As such, the United Statesgovernment retains a nonexclusive, royalty-free right to publish or reproduce thisarticle, or to allow others to do so, for government purposes only.SC ’21, November 14–19, 2021, St. Louis, MO, USA© 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-8442-1/21/11...$15.00https://doi.org/10.1145/3458817.3487400MD. Sustained real-world simulation of 1 billion carbon atoms for1 nanosecond of physical time on 4,650 nodes of Summit during 24hours of wall clock time.2PERFORMANCE ATTRIBUTESPerformance AttributeOur SubmissionCategory of achievementTime to solution, scalabilityType of method usedSNAP/Kokkos via LAMMPS MDResults reported on basis ofWhole application including I/OPrecision reportedDouble precisionSystem scaleMeasured on full systemMeasurement mechanismTimers, FLOP count3OVERVIEW OF THE PROBLEM: CLASSICALSIMULATIONS OF MATERIALS ATEXTREME CONDITIONS WITH QUANTUMACCURACYRecent exciting discoveries of thousands of exoplanets beyond oursolar system has advanced the research on planetary materials atextreme pressures and temperatures to the forefront of physicalsciences [1, 2]. A fundamental requirement for understanding thecomposition and the structure of exoplanetary interiors is an ac-curate knowledge of crystal structure, high pressure-temperature(PT) equations of state (EOS) and melting behavior of key geologi-cal materials. The advent of powerful laser [3] and pulsed-power[4] compressions, and in-situ X-ray free electron laser diffractionexperiments [5] have made it possible to recreate and probe thehigh-PT environment of exoplanetary cores in the laboratory. How-ever, a lack of theoretical and simulation guidance of experimentalefforts, including comprehensive atomic-scale understanding of thecomplex physics of a material’s response to extreme PT conditions,substantially limits the science return from these sophisticated butvery expensive experiments. Such meaningful predictions requirebillion-atom MD simulations at experimental nanosecond and mi-crometer time and length scales.0修正版本。V.1.1。发表于2021年12月14日。20传统上，使用密度泛函理论（DFT）的量子分子动力学（QMD）用于模拟极端PT条件下的物质[6,7]。由于计算成本高昂，这些模拟仅限于最多数千个原子的样品和最多数十皮秒的模拟时间，因此主要获得材料的平衡性质（例如EOS和静态相图）。这些平衡模型在非平衡时变物理变得重要的区域中已被发现失败[8]。原则上，经验性原子间势的经典分子动力学模拟可以克服QMD的固有时间和长度尺度限制[9]。然而，它们固有的无法以足够准确度涵盖广泛压力和温度范围的能力是实现理论上对实验的高保真预测模拟的严重障碍。最近，随着机器学习原子间势（ML-IAPs）的出现，原子尺度模拟出现了新的令人兴奋的机会[10-13]，它旨在提供具有DFT准确度的基本原子间相互作用的经典描述。尽管已经出现了ML-IAPs在材料建模中的几个应用[14-17]，但它们在描述在极端PT条件下发生的材料中多样且复杂的原子环境方面的卓越潜力尚未实现。我们的团队最近在解决这个极具挑战性但基础重要的问题上取得了重大进展：在实验时间和长度尺度上对极端PT条件下的材料进行预测性的原子尺度模拟。特别是，我们设计了一个量子精确的Spectral Neighbor AnalysisPotential（SNAP）ML-IAP用于碳[18]，它描述了碳在从0到50Mbars的压力和高达20,000-K的温度等极端条件下的性质，包括相图、金刚石、BC8和简单立方相的熔化曲线，其准确度在非常严格的QMD结果的5%以内。尽管这样令人印象深刻的结果伴随着巨大的计算成本，但SNAP在原子数量上的线性扩展性以及其在LAMMPSMD模拟软件包中的高效实现[19]使我们能够在领导级的DOESummit、DOE Perlmutter和NSFFrontera系统上运行大规模并行的十亿原子模拟。目前的工作计划是专门为了突出在异构多核处理器和众核GPU加速器上实现SNAPMD的重要算法创新，以及在极端条件下对碳进行生产模拟的重要目标。在环境条件下，碳以石墨和金刚石的形式存在于常温常压下，预计在数千万大气压（Mbars）和数万开尔文温度下，它将转变为一种新的晶体形态，即所谓的BC8结构[20]。迄今为止，实验上在极端条件下发现新形态的碳的多次尝试一直未能成功[21-23]。这个科学挑战是展示量子精确的MD模拟在实验时间和长度尺度上的转变性影响的完美候选。这种准确的涉及到亿级原子的模拟以前从未尝试过。0由于我们团队在高性能计算平台上SNAP的高效实现以及对OLCF的Summit超级计算机的独家访问的最新突破，这一切成为可能。我们有效地利用了Summit的所有4650个节点（27,900个GPU），在这个大规模模拟中，我们能够揭示出合成难以捉摸的碳的BC8相的新途径，同时展示了SNAPMD的前所未有的扩展性和无与伦比的实际性能。04 最新技术现状0通过从原子间势能（IAP）能量函数的导数中获得适当的力的表达式，可以使用有限时间步长 �� ≈ 1 ∙ 10 − 15 s对任何生物/化学/材料系统的平衡和非平衡动力学进行数值积分模拟。传统上，通过简化描述共价/金属/离子键的经验IAP来获得由与附近原子相互作用引起的原子力。机器学习势（ML-IAP）的推动力来自于需要接近量子电子结构方法的准确性，同时保留经验势的计算成本、线性缩放和并行效率。自从它们的诞生以来的相对较短时间内[25,26]，ML-IAP已经被证明可以达到与密度泛函理论（DFT）等电子结构方法相当的准确性。一般来说，ML-IAP由三个独特但不完全可分离的部分构成；描述符集、回归技术和模型形式。描述符（与特征同义）编码了每个原子周围的键合环境（见图1），在模型的绝对准确性（相对于DFT）和计算成本方面起着关键作用。迄今为止最成功的ML势能可以根据模型形式的选择分为两类，即基于核的模型或神经网络（NN）模型。后者的例子包括Behler-Parrinello NNP[25]，ANI [27, 28]，HIP-NN [29]和DeepMD[30]，每个模型在描述符和神经网络架构的选择上略有不同。本文中的SNAP工作属于基于核的ML-IAP类，也包括GAP [26]，ChIMES[31]和MTP[32]，其中的差异主要由描述符的选择定义。基于核的方法计算每个特征向量与基于真实训练结构的数据库之间的相似度度量。最常用的相似度核是在高斯过程回归模型中使用的平方指数核，如GAP[26]所使用的。可以证明，当将平方指数核替换为点积核时，SNAP、MTP、ChIMES和其他线性模型等价于GAP [33]。Drautz[34]的最新理论工作表明，SNAP、MTP和其他一些成功的描述符是原子簇展开（ACE）描述符家族的一部分，每个家族都有特定的径向基选择[35]。由一个独立团队对主要的ML-IAP方法（包括NN和基于核的方法）的最新比较表明，SNAP、GAP和MTP（即所有基于核的方法）在计算成本和准确性之间提供了最佳平衡[12]。𝐸𝑖𝑆𝑁𝐴𝑃 (r𝑁 )=𝜷 · B𝑖 + 12B𝑖 · 𝜶 · B𝑖(1)(3)• ComputeUi: Evaluate the local neighbor density of an atom𝑖 in terms of a four-dimensional hyperspherical harmonic30图1：对原子的局部环境进行编码的ML描述符的示意图。在径向截断（虚线，�）内的所有原子都用于生成描述符，在这里表示为指纹。原子能量表示为描述符的线性或非线性函数，其参数在训练过程中进行调整以最小化与DFT数据的误差。0由我们团队开创的光谱邻域分析势（SNAP）方法使用了局部邻居密度的二谱组分，将其投影到四维超球谐函数的基上作为描述符，如图1所示。我们使用SNAP的二次形式来描述碳，其中原子能量 � � ��对于一个原子 � 表示为该原子的二谱组分 B �的总和（详见第5节）以及这些描述符的二次乘积，乘以回归系数的加权。0其中对称矩阵AAAAAA和向量AAAAAA是常数线性系数，其值经过训练以重现从DFT训练结构中获得的能量和力。类似地，每个原子AAAAAA的力是用原子能量对原子AAAAAA的位置的导数表示的，其中AAAAAA是结构中的原子总数0F AAAAAA = -�A0A �0A = 1AAAAAA = -0A �0�AAAAAA +BAAAAAA∙AAAAπBAAAAAA0πrA(2)0对碳的SNAPML-IAP的训练是通过使用DAKOTA优化软件包[36]进行迭代完成的，其中SNAP预测误差相对于DFT数据进行了最小化。通过加权线性回归，确定了AAAAAA和AAAAAA系数，使得SNAP预测的能量和原子力相对于DFT计算数据库最小化。这导致了一个在惊人的压力和温度范围（0-50 Mbars和300-20,000K）内具有鲁棒性的IAP，远远超过任何经验性IAP的能力。在任何MD模拟中，计算力是计算瓶颈。对于SNAP（方程2），这个成本主要由对每个原子的双谱成分BAAAAAA的评估以及与邻居原子位置的导数相关的部分πBAAAAAA/πrAAAAAA组成。与经验性IAP相比，几乎所有的ML-IAP在计算上都更昂贵，因为描述符定义的复杂性，因此为了提高准确性，牺牲了总原子数和模拟时间0在之前的工作中，我们已经证明了基于核的方法（如SNAP）可以利用加速器设备，通过暴露用于MD力计算所需的描述符梯度计算的计算核心中的多个并行级别。Trott等人开发了一个早期的CUDA实现SNAP，在NVIDIA K20xGPU上实现了良好的计算效率。该工作还展示了机器学习势的出色可扩展性，允许MD模拟在完整的Titan机器（18,688个GPU）上运行，每个GPU仅有13个原子[37]。与其他最先进的ML-IAP进行比较时，除了FLOP速率外，必须使用一种通用的归一化指标来衡量MD模拟的吞吐量。即，使用AAAAAA模拟的MD模拟的性能，使用AAAAAA模拟并在AAAAAA秒内完成AAAAAA0A010^6 AAAAAA0AAAAAA ×AAAAAA0并在此处以Matom-steps/node-s为单位报告。在比较不同系统大小、硬件类型、节点数量、模拟时间和不同IAP使用情况下的MD模拟时，同时具有这两个指标，即计算强度和模拟性能，是很重要的。最近，基于神经网络的深度学习IAP DeepMD[38]报告了8.1×10^(-10)0在4560个Summit节点上，以�127亿个Cu原子为单位，在100个时间步长内的每个原子步长的计算性能为0.271Matom-steps/node-s。根据方程3定义的等效MD性能为0.271Matom-steps/node-s，这是目前在这个规模上任何NNML-IAP的最佳计算性能[39]。相比之下，我们在本文中报告的SNAPMD模拟在完整的Summit机器上（4650个节点）模拟了�200亿个碳原子，其MD性能为6.21Matom-steps/node-s，比DeepMD高出22.9倍。下一节详细介绍了提供这种性能增益的算法改进05个实现的创新0在这里，我们介绍了为了提高新一代CPU和GPU的SNAP吞吐量而进行的算法和架构特定的优化。SNAP能量和力以双谱成分（方程1、2）的基扩展形式表示，上限为角动量量子数AAAAAA，如下所定义。利用双谱成分中的对称关系，将计算复杂度从O(AAAAAA^7)降低到O(AAAAAA^5)，在CPU上提速了一个数量级[13]。这个版本的SNAP被移植到在LAMMPS上运行在GPU上，并且是我们作为起点的版本[40]。如下方程所示，SNAP由许多不规则结构、深度嵌套的循环组成，循环大小小且不同，与常规结构的线性代数核（如GEMM）相比，优化的挑战更大。SNAP势和派生力的计算遵循以下模式：basis,U𝑗=�𝑟𝑖𝑘 <𝑅𝑓𝑐 (𝑟𝑖𝑘)u𝑗 (𝑎,𝑏),(4)where u𝑗 are Wigner U-matrices, each rank 2𝑗+1, and𝑎,𝑏 arethe Cayley-Klein parameters, mappings of r𝑖𝑘 to the 3-sphere,and the index 𝑗 takes half-integer values {0, 12, 1, 32, . . .}. Theu𝑗 are efficiently calculated by a recursion relationu𝑗 = F (u𝑗− 12 ),(5)where F is a linear operator mapping two adjacent elementsof u𝑗−1/2 to each element of u𝑗. 𝑓𝑐 (𝑟𝑖𝑘) is a smooth cutofffunction.• ComputeBi and ComputeZi: The U𝑗 are not basis invariantand thus not useful as descriptors. We form real, scalar, basis-invariant triple-products [26]:𝐵𝑗1𝑗2𝑗=U𝑗1 ⊗𝑗𝑗1𝑗2 U𝑗2 : U∗𝑗(6)=Z𝑗𝑗1𝑗2 : U∗𝑗 .(7)The symbol ⊗𝑗𝑗1𝑗2 indicates a Clebsch-Gordan product of ma-trices, an O(𝑗4) operation. The : corresponds to an element-wise scalar product of two matrices of equal rank, an O(𝑗2)operation. The vector of descriptors B𝑖 for atom 𝑖 introducedin Eq. 1 is a flattened list of elements 𝐵𝑗1𝑗2𝑗 restricted to0 ≤ 2𝑗2 ≤ 2𝑗1 ≤ 2𝑗 ≤ 2𝐽, so that the number of unique bis-pectrum components scales as O(𝐽 3). In the current work,2𝐽 is set to 8, yielding a descriptor vector B𝑖 of length 55.• ComputeDuidrj and ComputeDeidrj: Compute derivativesof the descriptors,𝜕𝐵𝑗1𝑗2𝑗𝜕r𝑘=Z𝑗𝑗1𝑗2 :𝜕U∗𝑗𝜕r𝑘+ Z𝑗1𝑗 𝑗2 :𝜕U∗𝑗1𝜕r𝑘+ Z𝑗2𝑗 𝑗1 :𝜕U∗𝑗2𝜕r𝑘,(8)and accumulate into the force via Eq. 2.This section describes the implementation and optimizations ofthe quadratic SNAP ML-IAP given above that uses the Kokkosperformance-portability library [41]. Kokkos provides a frame-work for decomposing work into discrete, independent pieces thatare written in C++ and then mapped onto backend languages(such as CUDA) and dispatched in parallel, hiding the architecture-specific details of executing work. Kokkos provides constructs toexploit hierarchical parallelism. The most relevant here are multi-dimensional, tiled launches, which conceptually map onto cacheblocking on the CPU and multi-dimensional thread and block in-dices on the GPU. Of special note, Kokkos provides an abstraction of“scratchpad memory”, which conceptually maps onto small memorysegments on the CPU which stay resident in cache, and maps ontoshared memory on the GPU.The first set of optimizations below describe the systematic ex-traction of hierarchies of parallelism in the SNAP ML-IAP. Theseare complimented by optimizations to memory layouts enabled bythe Kokkos performance portable framework “view” abstraction formulti-dimensional data structures. The latter set of optimizationsdescribe where the ideals of performance portability break down,and we diverge the implementations for the CPU and GPU. This isnecessary because GPUs, compared to CPUs, require a far higherarithmetic intensity, or ratio of FLOPS to memory transactions, totake full advantage of hardware accelerators.The implementation of the SNAP potential described here ispublicly available1 with the LAMMPS molecular dynamics package[19, 42] . The work we describe below was been performed overthe past ∼3 years, starting with the baseline GPU implementation[40] in LAMMPS.5.1Kernel Fission and Reduction ofComputational ComplexityDespite taking advantage of the Kokkos features of hierarchicalparallelism and scratchpad memory, the initial implementationof the SNAP potential had lackluster performance on GPUs. Theoriginal implementation mirrored the baseline CPU version byusing one large, fused kernel, which caused high register usage,throttling occupancy.Our first change was kernel fission, splitting the large kernelinto multiple small kernels. This reduced register pressure acrossseparate kernels, but greatly increased memory usage since inter-mediate quantities for all pairs of atom-neighbors needed to beexplicitly stored between kernel launches.These memory overheads became prohibitive and motivatedseveral important optimizations. First, it motivated index flatteningin both U𝑗 and Z𝑗𝑗1𝑗2, replacing jagged arrays with compressedindices. This innovation reduced the memory for U𝑗 by a factor of1/3 and considerably more for Z𝑗𝑗1𝑗2.More importantly, this motivated the development of the adjointrefactorization, which combines Eqs. 1 and 8 to define a new quantityY,Y𝑗=�𝑗1𝑗2(𝜷 + B · 𝜶)𝑗𝑗1𝑗2 Z𝑗𝑗1𝑗2 .(9)This adjoint refactorization simplifies the final force evaluationtoF𝑘𝑆𝑁𝐴𝑃 = −𝑁�𝑖=1𝐽�𝑗=0Y𝑗 :𝜕U∗𝑗𝜕r𝑘.(10)Y can be identified as the adjoint of dB with respect to dU. Thisreduces the computational complexity from O(𝐽 5) to O(𝐽 3)by removing a factor of O(𝐽 2) computation from the evalua-tion of Eq. 8 compared to Eq. 10. This reformulation also enablesa factor of 3 reduction of flops due to a 𝑗 ↔ 𝑗1 ↔ 𝑗2 symmetryin Z𝑗𝑗1𝑗2. As part of the development of this method, we optimizedaway a factor of O(𝑁𝑛𝑒𝑖𝑔ℎ) storage in Z. The calculation of Y wasimplemented in a new kernel ComputeYi.5.2Extraction of Parallelism and Data LayoutOptimizationsThe acts of kernel fission and implementing the adjoint refactoriza-tion simplified identifying the parallelism available in each kernel.All four kernels noted below have trivial atom parallelism. Of fur-ther note:1Production simulations in this work used the version of SNAP in [42], while scalingsimulations were rerun using a slightly modified version of [42] optimized for largeatom counts per GPU. These new optimizations have been recently released publiclyin LAMMPS [43].4• ComputeUi: Eq. 4 offers additional neighbor parallelism ifthe sum over neighbors is performed atomically.• ComputeYi: Eq. 9 offers additional quantum number paral-lelism if the sum over 𝑗1, 𝑗2 is performed atomically.• ComputeDuidrj: Trivial neighbor parallelism across inde-pendent 𝜕U𝑗𝜕𝑟𝑘 .• ComputeDeidrj: Additional neighbor parallelism if force ac-cumulations are performed atomically, exploiting Newton’sLaws.Each source of parallelism obeys linearity, meaning we are freeto reorder the per-atom, per-neighbor, and per-quantum-numberparallelism as appropriate to maximize performance. In the evalua-tion of U𝑗, for example, we can choose to evaluate the contributionfrom all atoms one neighbor at a time, or compute the contributionfrom all neighbors one atom at a time. The hierarchical parallelismabstractions in Kokkos makes it easy to rearrange the order ofparallelism. It also simplifies changing data layouts to promotegood memory access on the CPU and GPU: Array-of-Structures onthe CPU to promote spatial/temporal cache locality, Structure-of-Arrays on the GPU to promote memory alignment and coalescing.The trivial parallelism over atom number across all kernelsenables a

下载后可阅读完整内容，剩余1页未读，立即下载