mented. For example, in Figure 1, there are two instances
of INST
PUSH. In the context of vPC=0, the dispatch at
the end of the INST
PUSH body results in a native indirect
branch back to the start of the INST
PUSH body (since the
next virtual instruction at vPC=2 is also an INST
PUSH).
However, the target of the same native indirect branch in
the context of vPC=2 is determined by the address stored
at vPC=4, which in this example is an INST
MUL opcode.
Thus, the target of the indirect branch depends on the vir-
tual context—the vPC—rather than the hardware pc of the
branch, causing the hardware to speculate incorrectly or not
at all. We refer to this lack of correlation between the native
PC and the vPC as the context problem.
3 Related Work
Much of the work on interpreters has focused on the dis-
patch problem. Kogge [12] remains a definitive description
of many threaded code dispatch techniques. These can be
divided into two broad classes: those which refine the dis-
patch itself, and those which alter the bodies so that there
are more efficient or simply fewer dispatches. Switch and
direct threading belong to the first class, as does subroutine
threading, discussed next. Later, we will discuss superin-
structions and replication, which are in the second class.
We are particularly interested in subroutine threading and
replication because they both provide context to the branch
prediction hardware.
Some Forth interpreters use subroutine-threaded dis-
patch. Here, the program is not represented as a list of
body addresses, but instead as a sequence of native calls
to the bodies, which are then constructed to end with na-
tive returns. Curley [3, 4] describes a subroutine-threaded
Forth for the 68000 CPU. He improvesthe resultingcode by
inlining small opcode bodies, and converts virtual branch
opcodes to single native branch instructions. He cred-
its Charles Moore, the inventor of Forth, with discovering
these ideas much earlier. Outside of Forth, there is lit-
tle thorough literature on subroutine threading. In partic-
ular, few authors address the problem of where to store vir-
tual instruction operands. In Section 4, we document how
operands are handled in our implementation of subroutine
threading.
The choice of optimal dispatch technique depends on the
hardware platform, because dispatch is highly dependent on
micro-architectural features. On earlier hardware, call and
return were both expensive and hence subroutine thread-
ing required two costly branches, versus one in the case of
direct threading. Rodriguez [17] presents the tradeoffs for
various dispatch types on several 8 and 16-bit CPUs. For
example, he finds direct threading is faster than subroutine
threadingon a 6809 CPU, becausethe JSR and RET instruc-
tion require extra cycles to push and pop the return address
stack. On the other hand, Curley found subroutine thread-
ing faster on the 68000 [3]. On modern hardware the cost
of the call and return is much lower, due to return branch
prediction hardware, while the cost of direct threading has
increased due to misprediction. In Section 5 we demon-
strate this effect on several modern CPUs.
Superinstructions reduce the numberof dispatches. Con-
sider the code to add a constant integer to a variable. This
may require loading the variable onto the stack, loading the
constant, adding, and storing back to the variable. VM de-
signers can instead extend the virtual instruction set with a
single superinstruction that performs the work of all four
instructions. This technique is limited, however, because
the virtual instruction encoding (often one byte per opcode)
may allow only a limited number of instructions, and the
number of desirable superinstructions grows exponentially
in the number of subsumed atomic instructions. Further-
more, the optimal superinstruction set may change based
on the workload. One approach uses profile-feedback to
select and create the superinstructions statically (when the
interpreter is compiled [8]).
Piumarta [15] presents selective inlining. It constructs
superinstructions when the virtual program is loaded. They
are created in a relatively portable way, by memcpy’ing the
native code in the bodies, again using GNU C labels-as-
values. This technique was first documented earlier [19],
but Piumarta’s independent discovery inspired many other
projects to exploit selective inlining. Like us, he applied his
optimization to OCaml, and reports significant speedup on
several microbenchmarks. As we discuss in Section 5.4, our
technique is separate from, but supports and indeed facili-
tates, inlining optimizations.
Only certain classes of opcode bodies can be relocated
using memcpy alone—the body must contain no pc-relative
instructions (typically this excludes C function calls). Se-
lective inlining requires that the superinstruction starts at
a virtual basic block, and ends at or before the end of
the block. Ertl’s dynamic superinstructions [6] also use
memcpy, but are applied to effect a simple native compi-
lation by inlining bodies for nearly every virtual instruc-
tion. Ertl shows how to avoid the virtual basic block con-
straints, so dispatch to interpreter code is only required for
virtual branches and un-relocatable bodies. Catenation [24]
patches Sparc native code so that all implementationscan be
moved, specializes operands, and converts virtual branches
to native, thereby eliminating the virtual program counter.
Replication—creating multiple copies of the opcode
body—decreases the number of contexts in which it is exe-
cuted, and hence increases the chances of successfully pre-
dicting the successor [6]. Replication implemented by in-
lining opcode bodies reduces the number of dispatches, and
therefore, the average dispatch overhead [15]. In the ex-
treme, one could create a copy for each instruction, elimi-