Improving Resource Efficiency at Scale with Heracles 6:7
We performed the following characterization experiments:
—Cores: As we discussed in Section 2, we cannot share a logical core (a single Hyper-
Thread) between an LC and a BE task because OS scheduling can introduce latency
spikes in the order of tens of milliseconds [Leverich and Kozyrakis 2014]. Hence, we
focus on the potential of using separate HyperThreads that run pinned on the same
physical core. We characterize the impact of a colocated HyperThread that imple-
ments a tight spinloop on the LC task. This experiment captures a lower bound of
HyperThread interference. A more compute or m emory intensive microbenchmark
would antagonize the LC HyperThread for more core resources (e.g., execution units)
and space in the private caches (L1 and L2). Hence, if this experiment shows high
impact on tail latency, we can conclude that core sharing through HyperThreads is
not a practical option.
—LLC: The interference impact of LLC antagonists is measured by pinning the LC
workload to enough cores to satisfy its SLO at the specific load and pinning a cache
antagonist that streams through a large data array on the remaining cores of the
socket. We use several array sizes that take up a quarter, half, and almost all of the
LLC and denote these configurations as LLC small, medium, and big respectively.
—DRAM bandwidth: The impact of DRAM bandwidth interference is characterized in
a similar fashion to LLC interference, using a significantly larger array for streaming.
We use numactl to ensure that the DRAM antagonist and the LC task are placed on
the same socket(s) and that all memory channels are stressed.
—Network traffic: We use iperf , an open source TCP streaming benchmark [iperf
2011], to saturate the network transmit (outgoing) bandwidth. All cores except for
one are given to the LC workload. Since the LC workloads we consider serve request
from multiple clients connecting to the service they provide, we generate interference
in the form of many low-bandwidth “mice” flows. Network interference can also
be generated using a few “elephant” flows. However, such flows can be effectively
throttled by TCP congestion control [Briscoe 2007], while the many “mice” flows of
the LC workload will not be impacted.
—Power: To characterize the l atency impact of a power antagonist, the same division
of cores is used as in the cases of generating LLC and DRAM interference. Instead of
running a memory access antagonist, a CPU power virus is used. The power virus is
designed such that it stresses all the components of the core, leading to high power
draw and lower CPU core frequencies.
—OS Isolation: For completeness, we evaluate the overall impact of running a BE
task along with an LC workload using only the isolation mechanisms available in
the OS. Namely, we execute the two workloads in separate Linux containers and
set the BE workload to be low priority. The scheduling policy is enforced by CFS
using the shares parameter, where the BE task receives very few shares compared
to the LC workload. No other isolation mechanisms are used in this case. The BE
task is the Google brain workload [Le et al. 2012; Rosenberg 2013], which we will
describe further in Section 5.1.
3.3. Interference Analysis
Figure 1 presents the impact of the interference microbenchmarks on the tail latency
of the three LC workloads. Each row in the table shows tail latency at a certain load
for the LC workload when colocated with the corresponding microbenchmark. The
interference impact is acceptable if and only if the tail latency is less than 100% of the
target SLO. We color-code red/yellow all cases where SLO latency is violated.
By observing the rows for brain, we immediately notice that current OS isolation
mechanisms are inadequate for colocating LC tasks with BE tasks. Even at low loads,
ACM Transactions on Computer Systems, Vol. 34, No. 2, Article 6, Publication date: May 2016.