LLVM：支持终身程序分析与变换的编译器框架

需积分: 9 41 浏览量更新于2024-09-11 收藏 202KB PDF 举报

LLVM全称为Low Level Virtual Machine（低级虚拟机），是Chris Lattner和Vikram Adve在University of Illinois at Urbana-Champaign共同开发的一种编译器框架。本文旨在介绍这个创新的框架，它旨在支持对任意程序的透明、终生（lifelong）程序分析与转换。通过提供编译时、链接时、运行时以及程序空闲期间的高级信息，LLVM能够在不改变底层代码结构的情况下，实现对复杂语言特性的高效处理。核心概念包括： 1. **静态单赋值（Static Single Assignment, SSA）形式**: LLVM采用这种代码表示方法，每个变量在其生命周期内只被赋值一次，这有助于简化代码管理和优化，提高编译器的性能分析能力。 2. **简单且语言独立的类型系统**: LLVM设计了一种通用的类型系统，能够揭示实现高级语言特性（如对象模型、函数指针等）所需的底层操作，这使得编译器可以更好地理解和处理不同语言的特性。 3. **针对类型地址运算的指令**: 提供了专门用于处理类型安全地址计算的指令，增强了代码的健壮性和可维护性，同时避免了因类型错误引发的运行时问题。 4. **异常处理和C语言的setjmp/longjmp机制**: LLVM提供了一个统一且高效的机制来实现高级语言中的异常处理，以及在C语言中类似的跳转功能，确保了跨语言的编程一致性和性能。 5. **关键能力的结合**: LLVM的框架和代码表示形式结合起来，为实际应用提供了至关重要的特性，如代码重用、优化潜力和跨平台兼容性，这对于现代软件工程和性能优化至关重要。 LLVM的设计目标是创建一个灵活且强大的编译器基础设施，使得开发者可以在整个程序生命周期中进行各种优化，而不必担心底层实现的复杂性。这不仅提高了代码的可维护性和性能，也为未来的技术创新打开了大门。随着对LLVM的深入研究和不断扩展，它已成为许多高性能和开源项目中的基石，如Clang（一个C/C++前端）和Swift（苹果公司的现代编程语言）。

system, the memory model, exception handling mechanisms,

and the oﬄine and in-memory representations. The detailed

syntax and semantics of the representation are deﬁned in the

LLVM reference manual [29].

2.1 Overview of the LLVM Instruction Set

The LLVM instruction set captures t he key operations of

ordinary processors but avoids machine-speciﬁc constraints

such as physical regis t ers , pip elines, and low-level calling

conventions. LLVM provides an inﬁnite set of typed virtual

registers which can hold values of primitive types (Boolean,

integer, ﬂoating point, and pointer). The virtual registers

are in Static Single Assignment (SSA) form [15]. LLVM

is a load/store architecture: programs tra ns fer values be-

tween registers and memory solely via load and store op-

erations using typed pointers. The LLVM memory model is

described in Section 2.3.

The entire LLVM instruction set consists of only 31 op-

codes. T his is possible because, ﬁrst, we avoid multiple op-

codes for the same operations

. Second, most opcodes i n

LLVM are overloaded (for example, the add instruction can

operate on operands of any integer or ﬂoating po int operand

type). Most instructions, including all arithmetic and log i-

cal operations, are in three-address form: they take one or

two operands and produce a single result.

LLVM uses SSA form as its prima ry code representation,

i.e., each virtual register is written in exactly one instruc-

tion, and each use of a register is dominated by its deﬁnition .

Memory locations in LLVM are not in SSA form because

many possible locations may be modiﬁed at a single store

through a pointer, making it diﬃcult to construct a rea-

sonably compact, explicit SSA code representation for such

locations. The LLVM instruction s et includes an explicit

phi instruction, which corresponds directly to the standard

(non-gated) φ function of SSA form. SSA form provides a

compact def-use graph that simpliﬁes many dataﬂow opti-

mizations and enables fast, ﬂow-insensitive algorithms to

achieve many of the beneﬁts of ﬂow -s ensi tive algorithms

without expensive dataﬂow analysis. Non-loop transforma-

tions in SSA form are further simpliﬁed because they do

not encounter anti- or o utput dependences on SSA registers.

Non-memory transformations are also greatly simpliﬁed be-

cause (unrelated to SSA) registers cannot have aliases.

LLVM also makes the Control Flow Graph (CFG) of every

function explicit in the representation. A function is a set

of basic blocks, and each basic block is a sequence of LLVM

instructions, ending in exactly o ne terminator instruction

(branches, return, unwind, or invoke; the latter two are

explained later below). Each terminator explicitly speciﬁes

its successor basic blocks.

2.2 Language-independent Type Information,

Cast, and GetElementPtr

One of the fundamental design features of LLVM is the in-

clusion of a language-independent type system. Every SSA

and all operations obey strict type rules. This type informa-

tion is used in conjunction with the ins truction op code to

determine the exact semantics of an instruction (e.g. ﬂoat-

ing point vs. integer add). This type information enables a

broad class of high-level transformations on low-level code

For example, there are no unary operators: not and neg

are implemented in terms of xor and sub, respectively.

(for example, see Section 4.1.1). In addition, type mis-

matches are useful for detecting optimizer bugs.

The LLVM type system includes source-language-indep-

endent primitive types with predeﬁned sizes (void, bool,

signed/unsigned integers from 8 to 64 bit s, and single- and

double-precision ﬂoating-point types). This makes it possi-

ble to write portable code using these types, though non-

porta ble code can be expressed directly as well. LLVM also

includes (only) four derived types: pointers, arrays, struc-

tures, and functions. We believe that most high-level l a n-

guage data types are eventually represented using some com-

bination of these four types in terms of their operational

behavior. For example, C++ classes with inheritance are

implemented using structures, functions, and arrays of func-

tion pointers, as described i n Section 4.1.2.

Equally important, the four derived types above capture

the type information used even by s o phis ticated language-

independent analyses and optimizations. For example, ﬁeld-

sensitive points-to analyses [25, 31], call graph construc-

tion (including for object-oriented languages like C++),

scalar promotion of aggregates, and structure ﬁeld reorder-

ing transformations [12], only use pointers, structures, func-

tions, and primitive data types, while array dependence

analysis and loop transformations use all those plus array

types.

Because LLVM is language independent and must support

weakly-typed languages, declared type information in a legal

LLVM program may not be reliable. Instead, some p o inter

analysis algo rithm must be used to distinguish memory ac-

cesses for which the type of the pointer target is reliably

known from those for which it is not. LLVM includes such

an analysis described in Section 4.1.1. Our results show that

despite allowing values to be arbitrarily cast to other types,

reliable type information is avai lable for a large fraction of

memory accesses in C programs compiled to LLVM.

The LLVM ‘cast’ instruction is used to convert a value of

one type to another arbitrary type, and is the only way to

perform such conversions. Casts thus make all type conver-

sions explicit, including typ e coercion (there are no mixed-

type operations in LLVM), explicit casts for physical sub-

typing, and reinterpreting casts for non-type-safe code. A

program without casts is necessarily type-safe (in the a b-

sence of memory access errors, e.g., array overﬂow [19]).

A critical diﬃculty in preserving type information for

low-level code is implementi ng address arithmetic. The

getelementptr instruction is used by the LLVM system to

perform pointer arithmetic in a way that both preserves type

information and has ma chine-independent semantics. Given

a typed pointer to an object of s ome aggregate type, this in-

struction calculates the address of a sub-element of the ob-

ject in a type-preserving manner (eﬀectively a combined ‘.’

and ‘[ ]’ operator for LLVM). For example, the C statement

“X[i].a = 1;” could be translat ed into the pair of LLVM

instructions:

%p = getelementptr %xty* %X, long %i, ubyte 3;

store int 1, int* %p;

where we assume a is ﬁeld number 3 within the structure

X[i], and the structure is of type %xty. Ma king all address

arithmetic explicit is important so that it is exposed to al l

LLVM optimizations (most importantly, reassociation and

redundancy elimination); getelementptr achieves this with-

out obscuring the type information. Load and store instruc-

tions take a single pointer and do not perform a ny i ndexing,

剩余11页未读，继续阅读

qkmeng

粉丝: 1
资源: 4

LLVM：支持终身程序分析与变换的编译器框架

混沌博弈优化算法CGO-TCN-LSTM-Multihead-Attention多变量时间序列预测Matlab实现.rar

混沌博弈优化算法CGO-TCN-LSTM-Multihead-Attention负荷预测Matlab实现.rar

kaleidoscope:Golang 中的 LLVM 万花筒

leaven:将LLVM IR转换为Go

MLIR C4ML CGO Workshop Talk.pdf

go-conntracer-bpf:使用eBPF转到库以跟踪网络流事件

neuro-vectorizer:NeuroVectorizer是一个框架，该框架使用深度强化学习（RL）来预测C和C ++代码中for循环的最佳矢量化编译器实用程序

Moore's Law与机器学习：MLIR编译器基础设施在CGO研讨会上的探讨

【CGo与操作系统API交互】：直接调用系统服务的注意事项

2025年软考高级 - 信息系统项目管理师考试备考全攻略

最新资源