提升性能：上下文线程解释器的新型调度技术

需积分: 1 173 浏览量更新于2024-09-13 收藏 129KB PDF 举报

"Context Thread Interpreter (CTI) 是一种优化虚拟机解释器的高效分派技术，旨在通过调整硬件和虚拟机状态来提升分支预测和性能。它通过将线性虚拟指令与本地调用和返回相结合，使硬件和虚拟程序计数器对齐，从而解决‘上下文问题’，提高间接分支的预测准确性，进而提升性能。" 在计算机科学领域，解释器是用于执行编程语言字节码的软件。通常，直接线程化的解释器会使用间接跳转来分发字节码，但在深管道架构中，性能很大程度上依赖于分支预测。然而，由于虚拟程序的控制流与硬件程序计数器之间的关联性较差（即“上下文问题”），直接线程化中的间接分支在硬件中的预测效果不佳，限制了性能的发挥。 Context Thread Interpreter (CTI) 技术的引入正是为了解决这个问题。CTI 通过将虚拟指令转换为类似于本地调用和返回的形式，使得硬件程序计数器与虚拟机状态保持一致，从而使得顺序控制流可以被硬件的返回堆栈正确预测。此外，虚拟分支指令被转换为本地分支，利用硬件的分支预测资源，进一步提升了预测准确性和整体性能。通过这种技术，虚拟机解释器能够在不影响解释器灵活性的同时，显著提升执行效率。在实际应用中，这意味着程序运行速度的提高，特别是在那些分支密集且复杂度高的代码段中，CTI 的优势尤为明显。论文中提到的评估结果证明了 CTI 对改善分支预测和提升性能的积极影响。 CTI 与 Just-In-Time (JIT) 编译器有所不同，JIT 编译器会将部分或全部字节码编译为机器码，以提高执行效率。而 CTI 更专注于优化解释器的分派机制，两者在提升性能的方式上各有侧重，但都致力于改进虚拟机的执行效率。 Context Thread Interpreter 是一种创新的解释器调度策略，通过调整硬件和虚拟机状态的一致性，有效解决了间接分支预测的难题，提高了虚拟机在现代处理器上的执行效率。这项技术对于虚拟机设计者和优化者来说具有重要的参考价值，有助于构建更高效、更适应现代硬件架构的解释器。

mented. For example, in Figure 1, there are two instances

of INST

PUSH. In the context of vPC=0, the dispatch at

the end of the INST

PUSH body results in a native indirect

branch back to the start of the INST

PUSH body (since the

next virtual instruction at vPC=2 is also an INST

PUSH).

However, the target of the same native indirect branch in

the context of vPC=2 is determined by the address stored

at vPC=4, which in this example is an INST

MUL opcode.

Thus, the target of the indirect branch depends on the vir-

tual context—the vPC—rather than the hardware pc of the

branch, causing the hardware to speculate incorrectly or not

at all. We refer to this lack of correlation between the native

PC and the vPC as the context problem.

3 Related Work

Much of the work on interpreters has focused on the dis-

patch problem. Kogge [12] remains a deﬁnitive description

of many threaded code dispatch techniques. These can be

divided into two broad classes: those which reﬁne the dis-

patch itself, and those which alter the bodies so that there

are more efﬁcient or simply fewer dispatches. Switch and

direct threading belong to the ﬁrst class, as does subroutine

threading, discussed next. Later, we will discuss superin-

structions and replication, which are in the second class.

We are particularly interested in subroutine threading and

replication because they both provide context to the branch

prediction hardware.

Some Forth interpreters use subroutine-threaded dis-

patch. Here, the program is not represented as a list of

body addresses, but instead as a sequence of native calls

to the bodies, which are then constructed to end with na-

tive returns. Curley [3, 4] describes a subroutine-threaded

Forth for the 68000 CPU. He improvesthe resultingcode by

inlining small opcode bodies, and converts virtual branch

opcodes to single native branch instructions. He cred-

its Charles Moore, the inventor of Forth, with discovering

these ideas much earlier. Outside of Forth, there is lit-

tle thorough literature on subroutine threading. In partic-

ular, few authors address the problem of where to store vir-

tual instruction operands. In Section 4, we document how

operands are handled in our implementation of subroutine

threading.

The choice of optimal dispatch technique depends on the

hardware platform, because dispatch is highly dependent on

micro-architectural features. On earlier hardware, call and

return were both expensive and hence subroutine thread-

ing required two costly branches, versus one in the case of

direct threading. Rodriguez [17] presents the tradeoffs for

various dispatch types on several 8 and 16-bit CPUs. For

example, he ﬁnds direct threading is faster than subroutine

threadingon a 6809 CPU, becausethe JSR and RET instruc-

tion require extra cycles to push and pop the return address

stack. On the other hand, Curley found subroutine thread-

ing faster on the 68000 [3]. On modern hardware the cost

of the call and return is much lower, due to return branch

prediction hardware, while the cost of direct threading has

increased due to misprediction. In Section 5 we demon-

strate this effect on several modern CPUs.

Superinstructions reduce the numberof dispatches. Con-

sider the code to add a constant integer to a variable. This

may require loading the variable onto the stack, loading the

constant, adding, and storing back to the variable. VM de-

signers can instead extend the virtual instruction set with a

single superinstruction that performs the work of all four

instructions. This technique is limited, however, because

the virtual instruction encoding (often one byte per opcode)

may allow only a limited number of instructions, and the

number of desirable superinstructions grows exponentially

in the number of subsumed atomic instructions. Further-

more, the optimal superinstruction set may change based

on the workload. One approach uses proﬁle-feedback to

select and create the superinstructions statically (when the

interpreter is compiled [8]).

Piumarta [15] presents selective inlining. It constructs

superinstructions when the virtual program is loaded. They

are created in a relatively portable way, by memcpy’ing the

native code in the bodies, again using GNU C labels-as-

values. This technique was ﬁrst documented earlier [19],

but Piumarta’s independent discovery inspired many other

projects to exploit selective inlining. Like us, he applied his

optimization to OCaml, and reports signiﬁcant speedup on

several microbenchmarks. As we discuss in Section 5.4, our

technique is separate from, but supports and indeed facili-

tates, inlining optimizations.

Only certain classes of opcode bodies can be relocated

using memcpy alone—the body must contain no pc-relative

instructions (typically this excludes C function calls). Se-

lective inlining requires that the superinstruction starts at

a virtual basic block, and ends at or before the end of

the block. Ertl’s dynamic superinstructions [6] also use

memcpy, but are applied to effect a simple native compi-

lation by inlining bodies for nearly every virtual instruc-

tion. Ertl shows how to avoid the virtual basic block con-

straints, so dispatch to interpreter code is only required for

virtual branches and un-relocatable bodies. Catenation [24]

patches Sparc native code so that all implementationscan be

moved, specializes operands, and converts virtual branches

to native, thereby eliminating the virtual program counter.

Replication—creating multiple copies of the opcode

body—decreases the number of contexts in which it is exe-

cuted, and hence increases the chances of successfully pre-

dicting the successor [6]. Replication implemented by in-

lining opcode bodies reduces the number of dispatches, and

therefore, the average dispatch overhead [15]. In the ex-

treme, one could create a copy for each instruction, elimi-

剩余11页未读，继续阅读

felixs

粉丝: 158
资源: 18

提升性能：上下文线程解释器的新型调度技术

Java解释器Basic-Interpreter源码解析

Python库robotframework-interpreter-0.3.0发布

Python库robotframework_interpreter使用指南

vapl_interpreter-0.1.1：Python库的安装与应用

matplotlib-3.6.3-cp39-cp39-linux_armv7l.whl

numpy-2.0.1-cp39-cp39-linux_armv7l.whl

基于springboot个人公务员考试管理系统源码数据库文档.zip

onnxruntime-1.13.1-cp310-cp310-win_amd64.whl

基于springboot的西山区家政服务网站源码数据库文档.zip

Linux环境下，关于C++静态库的封装和调用代码

最新资源