interrupts, such as for I/O intensive workloads with SR-IOV, such
software techniques do not alleviate the overhead.
Dong et al. [
15
] discuss a framework for implementing SR-IOV
support in the Xen hypervisor. Their results show that SR-IOV can
achieve line rate with a 10Gbps network interface controller (NIC).
However, the CPU utilization is 148% of bare metal. In addition,
this result is achieved using adaptive interrupt coalescing, which
increases I/O latency.
Like ELI, several studies attempted to reduce the aforementioned
extra overhead of interrupts in virtual environments. vIC [
4
] dis-
cusses a method for interrupt coalescing in virtual storage devices
and shows an improvement of up to 5% in a macro benchmark.
Their method decides how much to coalesce based on the number of
“commands in flight”. Therefore, as the authors say, this approach
cannot be used for network devices due to the lack of information
on commands (or packets) in flight. Furthermore, no comparison is
made with bare-metal performance. Dong et al. [
14
] use virtual in-
terrupt coalescing via polling in the guest and receive side scaling to
reduce network overhead in a paravirtual environment. But polling
has its drawbacks, as discussed above, and ELI improves the more
performance-oriented device assignment environment.
In CDNA [
51
], the authors propose a method for concurrent and
direct network access for virtual machines. This method requires
physical changes to NICs akin to SR-IOV. With CDNA, the NIC
and the hypervisor split the work of multiplexing several guests’
network flows onto a single NIC. In the CDNA model the hypervisor
is still involved in the I/O path. While CDNA significantly increases
throughput compared to the standard paravirtual driver in Xen, it is
still 2x–3x slower than bare metal.
SplitX [
26
] proposes hardware extensions for running virtual
machines on dedicated cores, with the hypervisor running in parallel
on a different set of cores. Interrupts arrive only at the hypervisor
cores and are then sent to the appropriate guests via an exitless
inter-core communication mechanism. In contrast, with ELI the
hypervisor can share cores with its guests, and instead of injecting
interrupts to guests, programs the interrupts to arrive at them directly.
Moreover, ELI does not require any hardware modifications and runs
on current hardware.
NoHype [
24
,
48
] argues that modern hypervisors are prone to
attacks by their guests. In the NoHype model, the hypervisor is a
thin layer that starts, stops, and performs other administrative actions
on guests, but is not otherwise involved. Guests use assigned devices
and interrupts are delivered directly to guests. No details of the
implementation or performance results are provided. Instead, the
authors focus on describing the security and other benefits of the
model. In addition, NoHype requires a modified and trusted guest.
In Following the White Rabbit [
52
], the authors show several
interrupt-based attacks on hypervisors, which can be addressed
through the use of interrupt remapping [
1
]. Interrupt remapping
can stop the guest from sending arbitrary interrupts to the host; it
does not, as its name might imply, provide a mechanism for secure
and direct delivery of interrupts to the guest. Since ELI delivers
interrupts directly to guests, bypassing the host, the hypervisor is
immune to certain interrupt-related attacks.
3. x86 Interrupt Handling
ELI gives untrusted and unmodified guests direct access to the
architectural interrupt handling mechanisms in such a way that
the host and other guests remain protected. To put ELI’s design
in context, we begin with a short overview of how interrupt handling
works on x86 today.
3.1 Interrupts in Bare-Metal Environments
x86 processors use interrupts and exceptions to notify system
software about incoming events. Interrupts are asynchronous events
generated by external entities such as I/O devices; exceptions are
synchronous events—such as page faults—caused by the code being
executed. In both cases, the currently executing code is interrupted
and execution jumps to a pre-specified interrupt or exception handler.
x86 operating systems specify handlers for each interrupt and
exception using an architected in-memory table, the Interrupt De-
scriptor Table (IDT). This table contains up to 256 entries, each
entry containing a pointer to a handler. Each architecturally-defined
exception or interrupt have a numeric identifier—an exception num-
ber or interrupt vector—which is used as an index to the table. The
operating systems can use one IDT for all of the cores or a separate
IDT per core. The operating system notifies the processor where
each core’s IDT is located in memory by writing the IDT’s virtual
memory address into the Interrupt Descriptor Table Register (IDTR).
Since the IDTR holds the virtual (not physical) address of the IDT,
the OS must always keep the corresponding address mapped in
the active set of page tables. In addition to the table’s location in
memory, the IDTR also holds the table’s size.
When an external I/O device raises an interrupt, the processor
reads the current value of the IDTR to find the IDT. Then, using
the interrupt vector as an index to the IDT, the CPU obtains the
virtual address of the corresponding handler and invokes it. Further
interrupts may or may not be blocked while an interrupt handler
runs.
System software needs to perform operations such as enabling
and disabling interrupts, signaling the completion of interrupt han-
dlers, configuring the timer interrupt, and sending inter-processor
interrupts (IPIs). Software performs these operations through the Lo-
cal Advanced Programmable Interrupt Controller (LAPIC) interface.
The LAPIC has multiple registers used to configure, deliver, and sig-
nal completion of interrupts. Signaling the completion of interrupts,
which is of particular importance to ELI, is done by writing to the
end-of-interrupt (EOI) LAPIC register. The newest LAPIC interface,
x2APIC [
20
], exposes its registers using model specific registers
(MSRs), which are accessed through “read MSR” and “write MSR”
instructions. Previous LAPIC interfaces exposed the registers only
in a pre-defined memory area which is accessed through regular
load and store instructions.
3.2 Interrupts in Virtual Environments
x86 hardware virtualization [
5
,
50
] provides two modes of operation,
guest mode and host mode . The host, running in host mode, uses
guest mode to create new contexts for running guest virtual machines.
Once the processor starts running a guest, execution continues in
guest mode until some sensitive event [
36
] forces an exit back
to host mode. The host handles any necessary events and then
resumes the execution of the guest, causing an entry into guest
mode. These exits and entries are the primary cause of virtualization
overhead [
2
,
9
,
26
,
37
]. The overhead is particularly pronounced
in I/O intensive workloads [
26
,
31
,
38
,
46
]. It comes from the
cycles spent by the processor switching between contexts, the time
spent in host mode to handle the exit, and the resulting cache
pollution [2, 9, 19, 26].
This work focuses on running unmodified and untrusted operat-
ing systems. On the one hand, unmodified guests are not aware they
run in a virtual machine, and they expect to control the IDT exactly
as they do on bare metal. On the other hand, the host cannot easily
give untrusted and unmodified guests control of each core’s IDT.
This is because having full control over the physical IDT implies
total control of the core. Therefore, x86 hardware virtualization ex-
tensions use a different IDT for each mode. Guest mode execution
on each core is controlled by the guest IDT and host mode execution
is controlled by the host IDT. An I/O device can raise a physical
interrupt when the CPU is executing either in host mode or in guest
mode. If the interrupt arrives while the CPU is in guest mode, the