just on its own load, but also on the intensity of any BE task
running on the same socket. In other words, the performance of
LC tasks can suffer from unexpected drops in frequency due to
colocated tasks. This interference can be mitigated with per-core
dynamic voltage frequency scaling, as cores running BE tasks
can have their frequency decreased to ensure that the LC jobs
maintain a guaranteed frequency. A static policy would run all
BE jobs at minimum frequency, thus ensuring that the LC tasks
are not power-limited. However, this approach severely penal-
izes the vast majority of BE tasks. Most BE jobs do not have the
profile of a power virus
2
and LC tasks only need the additional
frequency boost during periods of high load. Thus, a dynamic
solution that adjusts the allocation of power between cores is
needed to ensure that LC cores run at a guaranteed minimum fre-
quency while maximizing the frequency of cores for BE tasks.
A major challenge with colocation is cross-resource inter-
actions. A BE task can cause interference in all the shared re-
sources discussed. Similarly, many LC tasks are sensitive to in-
terference on multiple resources. Therefore, it is not sufficient to
manage one source of interference: all potential sources need to
be monitored and carefully isolated if need be. In addition, inter-
ference sources interact with each other. For example, LLC con-
tention causes both types of tasks to require more DRAM band-
width, also creating a DRAM bandwidth bottleneck. Similarly, a
task that notices network congestion may attempt to use compres-
sion, causing core and power contention. In theory, the number
of possible interactions scales with the square of the number of
interference sources, making this a very difficult problem.
3 Interference Characterization & Analysis
This section characterizes the impact of interference on
shared resources for latency-critical services.
3.1 Latency-critical Workloads
We use three Google production latency-critical workloads.
websearch is the query serving portion of a production web
search service. It is a scale-out workload that provides high
throughput with a strict latency SLO by using a large fan-out
to thousands of leaf nodes that process each query on their shard
of the search index. The SLO for leaf nodes is in the tens of
milliseconds for the 99%-ile latency. Load for websearch is gen-
erated using an anonymized trace of real user queries.
websearch has high memory footprint as it serves shards of
the search index stored in DRAM. It also has moderate DRAM
bandwidth requirements (40% of available bandwidth at 100%
load), as most index accesses miss in the LLC. However, there
is a small but significant working set of instructions and data in
the hot path. Also, websearch is fairly compute intensive, as it
needs to score and sort search hits. However, it does not consume
a significant amount of network bandwidth. For this study, we
reserve a small fraction of DRAM on search servers to enable
colocation of BE workloads with websearch.
ml_cluster is a standalone service that performs real-time text
clustering using machine-learning techniques. Several Google
services use ml_cluster to assign a cluster to a snippet of text.
ml_cluster performs this task by locating the closest clusters for
the text in a model that was previously learned offline. This
2
A computation that maximizes activity and power consumption of a core.
model is kept in main memory for performance reasons. The
SLO for ml_cluster is a 95%-ile latency guarantee of tens of mil-
liseconds. ml_cluster is exercised using an anonymized trace of
requests captured from production services.
Compared to websearch, ml_cluster is more memory band-
width intensive (with 60% DRAM bandwidth usage at peak) but
slightly less compute intensive (lower CPU power usage over-
all). It has low network bandwidth requirements. An interesting
property of ml_cluster is that each request has a very small cache
footprint, but, in the presence of many outstanding requests, this
translates into a large amount of cache pressure that spills over to
DRAM. This is reflected in our analysis as a super-linear growth
in DRAM bandwidth use for ml_cluster versus load.
memkeyval is an in-memory key-value store, similar to mem-
cached [
2]. memkeyval is used as a caching service in the back-
ends of several Google web services. Other large-scale web
services, such as Facebook and Twitter, use memcached exten-
sively. memkeyval has significantly less processing per request
compared to websearch, leading to extremely high throughput
in the order of hundreds of thousands of requests per second at
peak. Since each request is processed quickly, the SLO latency
is very low, in the few hundreds of microseconds for the 99%-
ile latency. Load generation for memkeyval uses an anonymized
trace of requests captured from production services.
At peak load, memkeyval is network bandwidth limited. De-
spite the small amount of network protocol processing done
per request, the high request rate makes memkeyval compute-
bound. In contrast, DRAM bandwidth requirements are low
(20% DRAM bandwidth utilization at max load), as requests sim-
ply retrieve values from DRAM and put the response on the wire.
memkeyval has both a static working set in the LLC for instruc-
tions, as well as a per-request data working set.
3.2 Characterization Methodology
To understand their sensitivity to interference on shared re-
sources, we ran each of the three LC workloads with a synthetic
benchmark that stresses each resource in isolation. While these
are single node experiments, there can still be significant network
traffic as the load is generated remotely. We repeated the char-
acterization at various load points for the LC jobs and recorded
the impact of the colocation on tail latency. We used produc-
tion Google servers with dual-socket Intel Xeons based on the
Haswell architecture. Each CPU has a high core-count, with a
nominal frequency of 2.3GHz and 2.5MB of LLC per core. The
chips have hardware support for way-partitioning of the LLC.
We performed the following characterization experiments:
Cores: As we discussed in §
2, we cannot share a logical core (a
single HyperThread) between a LC and a BE task because OS
scheduling can introduce latency spikes in the order of tens of
milliseconds [
39]. Hence, we focus on the potential of using sep-
arate HyperThreads that run pinned on the same physical core.
We characterize the impact of a colocated HyperThread that im-
plements a tight spinloop on the LC task. This experiment cap-
tures a lower bound of HyperThread interference. A more com-
pute or memory intensive microbenchmark would antagonize the
LC HyperThread for more core resources (e.g., execution units)
and space in the private caches (L1 and L2). Hence, if this exper-
iment shows high impact on tail latency, we can conclude that
3