ELI：提升I/O虚拟化的裸金属性能

需积分: 35 157 浏览量更新于2024-09-13 收藏 722KB PDF 举报

"ELI: Bare-Metal Performance for I/O Virtualization" 在当前的虚拟化环境中，I/O虚拟化已经成为一个关键的技术，它允许多个虚拟机（VMs）共享物理硬件资源，特别是I/O设备。然而，尽管直接设备分配（Direct Device Assignment）能够提升虚拟机的I/O性能，因为它们可以直接与硬件设备通信，但在这种模式下，虚拟机仍然无法达到裸金属（Bare-Metal）的性能水平。原因在于，即使设备被直接分配给虚拟机，主机操作系统（Host OS）仍然会拦截所有的中断，包括那些由分配给虚拟机的设备生成的中断，这些中断用于通知虚拟机其I/O请求已经完成。这种主机的介入导致了不必要的虚拟机与主机之间的上下文切换，严重影响了I/O密集型工作负载的性能。针对这个问题，研究人员提出了ELI（Exit-Less Interrupts）技术，这是一种纯软件的方法，用于在虚拟机内部直接、安全地处理中断。ELI的核心思想是消除主机在中断处理路径中的角色，从而减少上下文切换的开销，提高I/O吞吐量并降低延迟。 ELI的工作机制主要包括以下几个方面： 1. **中断路由优化**：ELI设计了一种新的中断路由策略，使得中断能够直接传递到相应的虚拟机，而不是通过主机的中断控制器。这减少了中断处理的层次，降低了延迟。 2. **安全机制**：为了确保虚拟机之间的隔离性和安全性，ELI采用了细粒度的权限控制，只有经过验证的中断才能被直接传递到目标虚拟机，防止恶意或错误的中断对系统造成破坏。 3. **虚拟化层优化**：ELI改进了虚拟化层的架构，允许中断处理在没有离开虚拟机（无退出，Exit-less）的情况下完成，避免了昂贵的虚拟机到主机的上下文切换。 4. **性能监控与调优**：ELI还包含了一套性能监控和调优机制，可以根据工作负载的特性动态调整中断处理策略，以最大化性能。通过以上技术，ELI有效地提升了虚拟机在I/O性能上的表现，使其更接近于直接运行在裸金属环境下的性能。这对于云计算、数据中心以及高性能计算等场景来说具有重要的意义，因为它可以更好地支持那些对I/O性能有高要求的应用，如数据库、大数据处理和实时流媒体服务等。 ELI技术的提出是虚拟化领域的一个重要进展，它在不牺牲安全性的前提下，显著提高了I/O虚拟化的效率，有望成为未来虚拟化平台的标准特性之一。

interrupts, such as for I/O intensive workloads with SR-IOV, such

software techniques do not alleviate the overhead.

Dong et al. [

] discuss a framework for implementing SR-IOV

support in the Xen hypervisor. Their results show that SR-IOV can

achieve line rate with a 10Gbps network interface controller (NIC).

However, the CPU utilization is 148% of bare metal. In addition,

this result is achieved using adaptive interrupt coalescing, which

increases I/O latency.

Like ELI, several studies attempted to reduce the aforementioned

extra overhead of interrupts in virtual environments. vIC [

] dis-

cusses a method for interrupt coalescing in virtual storage devices

and shows an improvement of up to 5% in a macro benchmark.

Their method decides how much to coalesce based on the number of

“commands in ﬂight”. Therefore, as the authors say, this approach

cannot be used for network devices due to the lack of information

on commands (or packets) in ﬂight. Furthermore, no comparison is

made with bare-metal performance. Dong et al. [

] use virtual in-

terrupt coalescing via polling in the guest and receive side scaling to

reduce network overhead in a paravirtual environment. But polling

has its drawbacks, as discussed above, and ELI improves the more

performance-oriented device assignment environment.

In CDNA [

], the authors propose a method for concurrent and

direct network access for virtual machines. This method requires

physical changes to NICs akin to SR-IOV. With CDNA, the NIC

and the hypervisor split the work of multiplexing several guests’

network ﬂows onto a single NIC. In the CDNA model the hypervisor

is still involved in the I/O path. While CDNA signiﬁcantly increases

throughput compared to the standard paravirtual driver in Xen, it is

still 2x–3x slower than bare metal.

SplitX [

] proposes hardware extensions for running virtual

machines on dedicated cores, with the hypervisor running in parallel

on a different set of cores. Interrupts arrive only at the hypervisor

cores and are then sent to the appropriate guests via an exitless

inter-core communication mechanism. In contrast, with ELI the

hypervisor can share cores with its guests, and instead of injecting

interrupts to guests, programs the interrupts to arrive at them directly.

Moreover, ELI does not require any hardware modiﬁcations and runs

on current hardware.

NoHype [

] argues that modern hypervisors are prone to

attacks by their guests. In the NoHype model, the hypervisor is a

thin layer that starts, stops, and performs other administrative actions

on guests, but is not otherwise involved. Guests use assigned devices

and interrupts are delivered directly to guests. No details of the

implementation or performance results are provided. Instead, the

authors focus on describing the security and other beneﬁts of the

model. In addition, NoHype requires a modiﬁed and trusted guest.

In Following the White Rabbit [

], the authors show several

interrupt-based attacks on hypervisors, which can be addressed

through the use of interrupt remapping [

]. Interrupt remapping

can stop the guest from sending arbitrary interrupts to the host; it

does not, as its name might imply, provide a mechanism for secure

and direct delivery of interrupts to the guest. Since ELI delivers

interrupts directly to guests, bypassing the host, the hypervisor is

immune to certain interrupt-related attacks.

3. x86 Interrupt Handling

ELI gives untrusted and unmodiﬁed guests direct access to the

architectural interrupt handling mechanisms in such a way that

the host and other guests remain protected. To put ELI’s design

in context, we begin with a short overview of how interrupt handling

works on x86 today.

3.1 Interrupts in Bare-Metal Environments

x86 processors use interrupts and exceptions to notify system

software about incoming events. Interrupts are asynchronous events

generated by external entities such as I/O devices; exceptions are

synchronous events—such as page faults—caused by the code being

executed. In both cases, the currently executing code is interrupted

and execution jumps to a pre-speciﬁed interrupt or exception handler.

x86 operating systems specify handlers for each interrupt and

exception using an architected in-memory table, the Interrupt De-

scriptor Table (IDT). This table contains up to 256 entries, each

entry containing a pointer to a handler. Each architecturally-deﬁned

exception or interrupt have a numeric identiﬁer—an exception num-

ber or interrupt vector—which is used as an index to the table. The

operating systems can use one IDT for all of the cores or a separate

IDT per core. The operating system notiﬁes the processor where

each core’s IDT is located in memory by writing the IDT’s virtual

memory address into the Interrupt Descriptor Table Register (IDTR).

Since the IDTR holds the virtual (not physical) address of the IDT,

the OS must always keep the corresponding address mapped in

the active set of page tables. In addition to the table’s location in

memory, the IDTR also holds the table’s size.

When an external I/O device raises an interrupt, the processor

reads the current value of the IDTR to ﬁnd the IDT. Then, using

the interrupt vector as an index to the IDT, the CPU obtains the

virtual address of the corresponding handler and invokes it. Further

interrupts may or may not be blocked while an interrupt handler

runs.

System software needs to perform operations such as enabling

and disabling interrupts, signaling the completion of interrupt han-

dlers, conﬁguring the timer interrupt, and sending inter-processor

interrupts (IPIs). Software performs these operations through the Lo-

cal Advanced Programmable Interrupt Controller (LAPIC) interface.

The LAPIC has multiple registers used to conﬁgure, deliver, and sig-

nal completion of interrupts. Signaling the completion of interrupts,

which is of particular importance to ELI, is done by writing to the

end-of-interrupt (EOI) LAPIC register. The newest LAPIC interface,

x2APIC [

], exposes its registers using model speciﬁc registers

(MSRs), which are accessed through “read MSR” and “write MSR”

instructions. Previous LAPIC interfaces exposed the registers only

in a pre-deﬁned memory area which is accessed through regular

load and store instructions.

3.2 Interrupts in Virtual Environments

x86 hardware virtualization [

] provides two modes of operation,

guest mode and host mode . The host, running in host mode, uses

guest mode to create new contexts for running guest virtual machines.

Once the processor starts running a guest, execution continues in

guest mode until some sensitive event [

] forces an exit back

to host mode. The host handles any necessary events and then

resumes the execution of the guest, causing an entry into guest

mode. These exits and entries are the primary cause of virtualization

overhead [

]. The overhead is particularly pronounced

in I/O intensive workloads [

]. It comes from the

cycles spent by the processor switching between contexts, the time

spent in host mode to handle the exit, and the resulting cache

pollution [2, 9, 19, 26].

This work focuses on running unmodiﬁed and untrusted operat-

ing systems. On the one hand, unmodiﬁed guests are not aware they

run in a virtual machine, and they expect to control the IDT exactly

as they do on bare metal. On the other hand, the host cannot easily

give untrusted and unmodiﬁed guests control of each core’s IDT.

This is because having full control over the physical IDT implies

total control of the core. Therefore, x86 hardware virtualization ex-

tensions use a different IDT for each mode. Guest mode execution

on each core is controlled by the guest IDT and host mode execution

is controlled by the host IDT. An I/O device can raise a physical

interrupt when the CPU is executing either in host mode or in guest

mode. If the interrupt arrives while the CPU is in guest mode, the

413

剩余11页未读，继续阅读

luo_brian

粉丝: 40
资源: 9

ELI：提升I/O虚拟化的裸金属性能

tensorflow_gpu-2.3.1-cp38-cp38.zip

WinNTSetup_v2.3.1_x86_x64

virtualization

ELI Bare-Metal Performance for IO Virtualization

eli_watchdog:检查editor-layer-index的来源

Eli: Translator Construction Made Easy-开源

ELI5-chrome-extension

Eli's utilities-开源

Running Xen A Hands-On Guide to the Art of Virtualization

habit-tracker:dreamcoding-习惯-跟踪器-React

最新资源