system, the memory model, exception handling mechanisms,
and the offline and in-memory representations. The detailed
syntax and semantics of the representation are defined in the
LLVM reference manual [29].
2.1 Overview of the LLVM Instruction Set
The LLVM instruction set captures t he key operations of
ordinary processors but avoids machine-specific constraints
such as physical regis t ers , pip elines, and low-level calling
conventions. LLVM provides an infinite set of typed virtual
registers which can hold values of primitive types (Boolean,
integer, floating point, and pointer). The virtual registers
are in Static Single Assignment (SSA) form [15]. LLVM
is a load/store architecture: programs tra ns fer values be-
tween registers and memory solely via load and store op-
erations using typed pointers. The LLVM memory model is
described in Section 2.3.
The entire LLVM instruction set consists of only 31 op-
codes. T his is possible because, first, we avoid multiple op-
codes for the same operations
3
. Second, most opcodes i n
LLVM are overloaded (for example, the add instruction can
operate on operands of any integer or floating po int operand
type). Most instructions, including all arithmetic and log i-
cal operations, are in three-address form: they take one or
two operands and produce a single result.
LLVM uses SSA form as its prima ry code representation,
i.e., each virtual register is written in exactly one instruc-
tion, and each use of a register is dominated by its definition .
Memory locations in LLVM are not in SSA form because
many possible locations may be modified at a single store
through a pointer, making it difficult to construct a rea-
sonably compact, explicit SSA code representation for such
locations. The LLVM instruction s et includes an explicit
phi instruction, which corresponds directly to the standard
(non-gated) φ function of SSA form. SSA form provides a
compact def-use graph that simplifies many dataflow opti-
mizations and enables fast, flow-insensitive algorithms to
achieve many of the benefits of flow -s ensi tive algorithms
without expensive dataflow analysis. Non-loop transforma-
tions in SSA form are further simplified because they do
not encounter anti- or o utput dependences on SSA registers.
Non-memory transformations are also greatly simplified be-
cause (unrelated to SSA) registers cannot have aliases.
LLVM also makes the Control Flow Graph (CFG) of every
function explicit in the representation. A function is a set
of basic blocks, and each basic block is a sequence of LLVM
instructions, ending in exactly o ne terminator instruction
(branches, return, unwind, or invoke; the latter two are
explained later below). Each terminator explicitly specifies
its successor basic blocks.
2.2 Language-independent Type Information,
Cast, and GetElementPtr
One of the fundamental design features of LLVM is the in-
clusion of a language-independent type system. Every SSA
register and explicit memory object has an associated type,
and all operations obey strict type rules. This type informa-
tion is used in conjunction with the ins truction op code to
determine the exact semantics of an instruction (e.g. float-
ing point vs. integer add). This type information enables a
broad class of high-level transformations on low-level code
3
For example, there are no unary operators: not and neg
are implemented in terms of xor and sub, respectively.
(for example, see Section 4.1.1). In addition, type mis-
matches are useful for detecting optimizer bugs.
The LLVM type system includes source-language-indep-
endent primitive types with predefined sizes (void, bool,
signed/unsigned integers from 8 to 64 bit s, and single- and
double-precision floating-point types). This makes it possi-
ble to write portable code using these types, though non-
porta ble code can be expressed directly as well. LLVM also
includes (only) four derived types: pointers, arrays, struc-
tures, and functions. We believe that most high-level l a n-
guage data types are eventually represented using some com-
bination of these four types in terms of their operational
behavior. For example, C++ classes with inheritance are
implemented using structures, functions, and arrays of func-
tion pointers, as described i n Section 4.1.2.
Equally important, the four derived types above capture
the type information used even by s o phis ticated language-
independent analyses and optimizations. For example, field-
sensitive points-to analyses [25, 31], call graph construc-
tion (including for object-oriented languages like C++),
scalar promotion of aggregates, and structure field reorder-
ing transformations [12], only use pointers, structures, func-
tions, and primitive data types, while array dependence
analysis and loop transformations use all those plus array
types.
Because LLVM is language independent and must support
weakly-typed languages, declared type information in a legal
LLVM program may not be reliable. Instead, some p o inter
analysis algo rithm must be used to distinguish memory ac-
cesses for which the type of the pointer target is reliably
known from those for which it is not. LLVM includes such
an analysis described in Section 4.1.1. Our results show that
despite allowing values to be arbitrarily cast to other types,
reliable type information is avai lable for a large fraction of
memory accesses in C programs compiled to LLVM.
The LLVM ‘cast’ instruction is used to convert a value of
one type to another arbitrary type, and is the only way to
perform such conversions. Casts thus make all type conver-
sions explicit, including typ e coercion (there are no mixed-
type operations in LLVM), explicit casts for physical sub-
typing, and reinterpreting casts for non-type-safe code. A
program without casts is necessarily type-safe (in the a b-
sence of memory access errors, e.g., array overflow [19]).
A critical difficulty in preserving type information for
low-level code is implementi ng address arithmetic. The
getelementptr instruction is used by the LLVM system to
perform pointer arithmetic in a way that both preserves type
information and has ma chine-independent semantics. Given
a typed pointer to an object of s ome aggregate type, this in-
struction calculates the address of a sub-element of the ob-
ject in a type-preserving manner (effectively a combined ‘.’
and ‘[ ]’ operator for LLVM). For example, the C statement
“X[i].a = 1;” could be translat ed into the pair of LLVM
instructions:
%p = getelementptr %xty* %X, long %i, ubyte 3;
store int 1, int* %p;
where we assume a is field number 3 within the structure
X[i], and the structure is of type %xty. Ma king all address
arithmetic explicit is important so that it is exposed to al l
LLVM optimizations (most importantly, reassociation and
redundancy elimination); getelementptr achieves this with-
out obscuring the type information. Load and store instruc-
tions take a single pointer and do not perform a ny i ndexing,