x86虚拟机监视器的发展历程

需积分: 1 10 浏览量更新于2024-07-27 收藏 282KB PDF 举报

"这篇文章探讨了x86虚拟机监视器的发展历程，从最初的Pentium II时代的虚拟化技术，到现代的虚拟对称多处理(SMP)、64位(x64)支持以及硬件辅助虚拟化的引入，直至当前面临的嵌套虚拟化挑战。" 在过去的十二年里，自从VMware的工程师首次实现x86架构的虚拟化以来，虚拟化技术经历了显著的发展和变革。这一技术创新不仅重塑了整个IT行业，还催生了各种虚拟化解决方案的繁荣发展。尽管时间推移，但核心目标始终如一：尽可能高效地在虚拟机监视器上运行虚拟机。文章首先回顾了在Pentium II时代，如何初步实现x86架构的虚拟化。那时，面对处理器的复杂性，工程师们必须克服诸多技术难题，包括指令集模拟、内存管理以及I/O设备的虚拟化。这些早期的努力奠定了现代虚拟化技术的基础。随着技术的进步，虚拟化监视器进一步演进，引入了虚拟SMP（Symmetric MultiProcessing）功能，使得单个物理服务器能够模拟多个CPU，支持多线程虚拟机，极大地提升了资源利用率和并发性能。随后，64位（x64）虚拟化技术的引入，扩大了虚拟机的内存地址空间，适应了大数据和高性能计算的需求。硬件支持对虚拟化技术的推动同样至关重要。Intel和AMD等处理器制造商引入了硬件辅助虚拟化特性，如Intel的VT-x和AMD的V，这些特性减少了模拟层的开销，提高了虚拟机的性能，使得虚拟化成为主流数据中心的常态。然而，技术的边界不断被打破，当前面临的一大挑战是嵌套虚拟化。嵌套虚拟化允许在一个运行的虚拟机内部再创建虚拟环境，这对于开发、测试和云服务提供商具有重大意义。然而，这也带来了新的性能和资源管理问题，需要更高级别的优化和精细控制，以确保效率和稳定性。文章的分类和主题描述涵盖了硬件/软件接口、系统性能以及操作系统的设计和组织。通过深入研究x86虚拟机监视器的演变，作者揭示了虚拟化技术如何从概念验证发展到当今的广泛应用，并为未来的创新提供了洞察力。总结起来，"evolution of x86 vm monitor"这个主题探讨了x86架构虚拟化技术的历史、关键里程碑和发展趋势，为理解虚拟化技术的演进及其在现代计算中的核心作用提供了全面的视角。

push %ebx

mov %eax,%edx

and $0xfd,%gs:vcpu.flags

mov $1,%ecx

xor %ebx,%ebx

jmp [doTest]

Each translator invocation consumes one TU and produces

one compiled code fragment (CCF). Although we show CCFs

in textual form with labels like vcpu.flags, in reality the

translator produces binary code directly.

After producing the above CCF, the VMM will execute the

code which ends with a call to the translator to produce the

translation for doTest. This second TU is all IDENT except

for the ﬁnal conditional jnz branch for which the translator

emits two continuations (one for each successor):

jnz [spin]

jmp [done]

To speed up inter-CCF transfers, our translator, like pre-

vious ones [9], employs a “chaining” optimization, allowing

one CCF to jump directly to another without calling out of

the translation cache (TC). These chaining jumps replace

the continuation jumps, which therefore are “execute once.”

Moreover, it is often possible to elide chaining jumps and

fall through from one CCF into the next.

For conditional branches, at most one of the two successors

can use fall through. The other must remain in the trans-

lated code as a conditional branch, initially invoking the

continuation, but, once the translated target is produced,

redirected to this target. (Sometimes, to avoid code dupli-

cation, no successor can use fall-through, so the ﬁnal transla-

tion uses a jcc/jmp pair of instructions to connect to each of

the successors.) Since translation and execution interleave,

the ﬁrst of the two continuations to execute is most likely

to receive the beneﬁcial fall-through treatment. If the ﬁrst

and subsequent executions follow similar paths, this tends

to straighten code for good i-cache performance. In eﬀect,

the translator builds execution traces in the TC, even as it

works through guest code in smaller TU chunks.

This interleaving of translation and execution continues for

as long as the guest runs kernel code, with a decreasing

proportion of translation as the TC gradually captures the

guest’s working set. For the spin lock example the transla-

tion, after one spin-free acquisition, results in this code in

the TC:

* push %ebx ; IDENT

mov %eax,%edx

and $0xfd,%gs:vcpu.flags ; PRIV

mov $1,%ecx ; IDENT

xor %ebx,%ebx

* mov %ebx,%eax ; IDENT

lock

cmpxchg %eax,%ecx,(%edx)

test %eax,%eax

jnz [spin] ; JCC

* pop %ebx ; IDENT

* mov %eax,%gs:scratchEAX ; RET_LAUNCH

mov %ecx,%gs:scratchECX

pop %eax

movzx %al,%ecx

jmp %gs:rtc(4*%ecx)

Above, there are four CCFs with the leading instruction

in each one marked with an asterisk. The continuation to

the spin label remains untranslated as it has not executed

yet. The code that was executed now sits in a straight line

without jumping about as the original code did.

The last CCF above terminates with a “launch” sequence for

a return translation, the details of which have been described

previously [2].

For a bigger example than the spin lock, but nevertheless one

that runs in exactly the same manner, booting Windows XP

Professional and then immediately shutting it down trans-

lates 933,444 32-bit TUs and 28,339 16-bit TUs. While this

may seem like a lot, translating each unit takes just 3 mi-

croseconds for a total translation time of about 3 seconds.

Against a background of a one minute boot/halt, and keep-

ing in mind that a boot workload has an unusually high

proportion of cold code, the cost of running the translator

is acceptable.

The translator does not attempt to “improve” the translated

code. We assume that if guest code is performance critical,

the OS developers have optimized it and a simple binary

translator would ﬁnd few remaining opportunities. Thus,

instead of applying deep analysis to support manipulation

of guest code, we disturb it minimally.

Most virtual registers bind to their physical counterparts

during execution of TC code to facilitate IDENT transla-

tion. One exception is the segment register %gs. It provides

an escape into VMM-level data structures; see Section 3.3.

The ret translation above uses %gs overrides to spill %eax

and %ecx into VMM memory so that they can be used as

working registers in the translation of ret. Later, of course,

the guest’s %eax and %ecx values must be reloaded into the

hardware registers.

As with registers, the translator binds guest ALU-ﬂags (CF,

PF, AF, ZF, SF, OF) to their physical counterparts. Since

many x86 ALU instructions modify ﬂags, nontrivial trans-

lations often must save and restore guest ﬂags around ﬂags-

clobbering operations. For example, this applies to the cli

translation, where the use of and clobbers guest ﬂags. How-

ever, in the above example, the translator avoided ﬂags

save/restore code by looking ahead to see that the guest

soon will execute an xor, which (re)deﬁnes all ﬂags. To en-

sure that even the guest’s interrupt handler has a consistent

view of ﬂags, the VMM defers virtual interrupts until the

xor that terminates the ﬂags-optimized region.

3.2 Virtualized Memory: Shadow Page Tables

The x86 architecture has supported virtual memory since

the 80386 with an MMU consisting of a TLB and a hardware

page table walker. The walker ﬁlls the TLB by traversing hi-

erarchical page tables in physical memory. Originally, these

page tables were two levels deep but were extended to three

and later four levels (see Section 5). The walker may spec-

剩余15页未读，继续阅读

crystalld0330

粉丝: 0
资源: 1

x86虚拟机监视器的发展历程

"基于ARM和Android的智能家居安防监控系统：实时视频监控及远程报警

"演进中的深度学习框架及PaddlePaddle实践技术分享

"电子与通信工程硕士专业英语part-6：电信通信网络演化及技术背景

Evolution of biological diversity

THE EVOLUTION OF OS

Evolution of Industrial Process Control

The Evolution of Collaborative Commerce

Evolution of Linux Container Virtualization

Evolution of the human menopause

Episode 1: Evolution of Databases

最新资源