push %ebx
mov %eax,%edx
and $0xfd,%gs:vcpu.flags
mov $1,%ecx
xor %ebx,%ebx
jmp [doTest]
Each translator invocation consumes one TU and produces
one compiled code fragment (CCF). Although we show CCFs
in textual form with labels like vcpu.flags, in reality the
translator produces binary code directly.
After producing the above CCF, the VMM will execute the
code which ends with a call to the translator to produce the
translation for doTest. This second TU is all IDENT except
for the final conditional jnz branch for which the translator
emits two continuations (one for each successor):
jnz [spin]
jmp [done]
To speed up inter-CCF transfers, our translator, like pre-
vious ones [9], employs a “chaining” optimization, allowing
one CCF to jump directly to another without calling out of
the translation cache (TC). These chaining jumps replace
the continuation jumps, which therefore are “execute once.”
Moreover, it is often possible to elide chaining jumps and
fall through from one CCF into the next.
For conditional branches, at most one of the two successors
can use fall through. The other must remain in the trans-
lated code as a conditional branch, initially invoking the
continuation, but, once the translated target is produced,
redirected to this target. (Sometimes, to avoid code dupli-
cation, no successor can use fall-through, so the final transla-
tion uses a jcc/jmp pair of instructions to connect to each of
the successors.) Since translation and execution interleave,
the first of the two continuations to execute is most likely
to receive the beneficial fall-through treatment. If the first
and subsequent executions follow similar paths, this tends
to straighten code for good i-cache performance. In effect,
the translator builds execution traces in the TC, even as it
works through guest code in smaller TU chunks.
This interleaving of translation and execution continues for
as long as the guest runs kernel code, with a decreasing
proportion of translation as the TC gradually captures the
guest’s working set. For the spin lock example the transla-
tion, after one spin-free acquisition, results in this code in
the TC:
* push %ebx ; IDENT
mov %eax,%edx
and $0xfd,%gs:vcpu.flags ; PRIV
mov $1,%ecx ; IDENT
xor %ebx,%ebx
* mov %ebx,%eax ; IDENT
lock
cmpxchg %eax,%ecx,(%edx)
test %eax,%eax
jnz [spin] ; JCC
* pop %ebx ; IDENT
* mov %eax,%gs:scratchEAX ; RET_LAUNCH
mov %ecx,%gs:scratchECX
pop %eax
movzx %al,%ecx
jmp %gs:rtc(4*%ecx)
Above, there are four CCFs with the leading instruction
in each one marked with an asterisk. The continuation to
the spin label remains untranslated as it has not executed
yet. The code that was executed now sits in a straight line
without jumping about as the original code did.
The last CCF above terminates with a “launch” sequence for
a return translation, the details of which have been described
previously [2].
For a bigger example than the spin lock, but nevertheless one
that runs in exactly the same manner, booting Windows XP
Professional and then immediately shutting it down trans-
lates 933,444 32-bit TUs and 28,339 16-bit TUs. While this
may seem like a lot, translating each unit takes just 3 mi-
croseconds for a total translation time of about 3 seconds.
Against a background of a one minute boot/halt, and keep-
ing in mind that a boot workload has an unusually high
proportion of cold code, the cost of running the translator
is acceptable.
The translator does not attempt to “improve” the translated
code. We assume that if guest code is performance critical,
the OS developers have optimized it and a simple binary
translator would find few remaining opportunities. Thus,
instead of applying deep analysis to support manipulation
of guest code, we disturb it minimally.
Most virtual registers bind to their physical counterparts
during execution of TC code to facilitate IDENT transla-
tion. One exception is the segment register %gs. It provides
an escape into VMM-level data structures; see Section 3.3.
The ret translation above uses %gs overrides to spill %eax
and %ecx into VMM memory so that they can be used as
working registers in the translation of ret. Later, of course,
the guest’s %eax and %ecx values must be reloaded into the
hardware registers.
As with registers, the translator binds guest ALU-flags (CF,
PF, AF, ZF, SF, OF) to their physical counterparts. Since
many x86 ALU instructions modify flags, nontrivial trans-
lations often must save and restore guest flags around flags-
clobbering operations. For example, this applies to the cli
translation, where the use of and clobbers guest flags. How-
ever, in the above example, the translator avoided flags
save/restore code by looking ahead to see that the guest
soon will execute an xor, which (re)defines all flags. To en-
sure that even the guest’s interrupt handler has a consistent
view of flags, the VMM defers virtual interrupts until the
xor that terminates the flags-optimized region.
3.2 Virtualized Memory: Shadow Page Tables
The x86 architecture has supported virtual memory since
the 80386 with an MMU consisting of a TLB and a hardware
page table walker. The walker fills the TLB by traversing hi-
erarchical page tables in physical memory. Originally, these
page tables were two levels deep but were extended to three
and later four levels (see Section 5). The walker may spec-
6