Heracles:Google的动态控制器优化大规模系统资源效率与延迟

需积分: 9 50 浏览量更新于2024-07-18 收藏 1.7MB PDF 举报

"《Heracles：大规模环境下提升资源效率与延迟控制》一文由Google Inc.的David Lo、Liqun Cheng、Ramagovind Raju和Parthasarathy Ranganathan，以及斯坦福大学的Christos Kozymakis共同撰写。该研究主要关注在现代互联网服务中，特别是对于那些对延迟敏感的业务，如网页搜索，如何在低流量时段有效地利用计算资源，同时在高峰期保证服务质量，避免由于共享资源竞争导致的突发延迟问题。随着摩尔定律放缓，大型数据中心的能源效率和成本成为关键考量。传统上，生产服务倾向于优先满足对延迟高度敏感的任务，如搜索引擎，导致非核心任务（如最佳努力任务）的资源利用率低下。Heracles作为一种基于反馈的控制器，旨在解决这一挑战。它通过动态管理多种硬件和软件隔离机制，如CPU、内存和网络资源，确保延迟敏感任务能够满足严格的性能指标，同时为非关键任务腾出空间，实现资源的高效利用。 Heracles的设计原则包括实时监控系统的负载情况，根据任务的优先级和实时需求调整资源分配，从而避免了资源的浪费。它通过智能调度算法，能够在保证服务质量的同时，灵活地在不同服务之间平衡资源，减少由于竞争引起的延迟峰值。这对于当前数据中心面临的技术瓶颈和可持续性挑战具有重要意义，有助于优化数据中心的运营效率，降低成本，同时减少碳排放，符合当前绿色计算的趋势。在实施Heracles的过程中，论文可能详细讨论了隔离技术的实现细节、系统架构、以及如何处理可能出现的复杂性，比如资源抢占和多任务切换。此外，还可能探讨了Heracles在实际部署中的性能评估和与现有解决方案的比较，证明其在大规模环境下的优势和价值。Heracles为解决大规模集群的资源效率和延迟保障问题提供了一种创新且实用的方法，对于IT行业的资源管理和优化策略具有深远的影响。"

6:6 D. Lo et al.

process each query on their shard of the search index. The SLO for leaf nodes is in the

tens of milliseconds for the 99%-ile latency. Load for websearch is generated using an

anonymized trace of real user queries.

The websearch workload has high memory footprint as it serves shards of the search

index stored in DRAM. It also has moderate DRAM bandwidth requirements (40% of

available bandwidth at 100% load), as most index accesses miss in the LLC. However,

there is a small but signiﬁcant working set of instructions and data in the hot path.

Also, websearch is fairly compute intensive, as it needs to score and sort search hits.

However, it does not consume a signiﬁcant amount of network bandwidth. For this

study, we reserve a small fraction of DRAM on search servers to enable colocation of

BE workloads with websearch.

ml_cluster is a standalone service that performs real-time text clustering using

machine-learning techniques. Several Google services use ml_cluster to assign a clus-

ter t o a snippet of text. The ml_cluster workload performs this task by locating the

closest clusters for the text in a model that was previously learned ofﬂine. This model

is kept in main memory for performance reasons. The SLO for ml_cluster is a 95%-ile

latency guarantee of tens of milliseconds. The ml_cluster workload is exercised using

an anonymized trace of requests captured from production services.

Compared to websearch, ml_cluster is more memory bandwidth intensive (with 60%

DRAM bandwidth usage at peak) but slightly less compute intensive (lower CPU power

usage overall). It has low network bandwidth requirements. An interesting property of

ml_cluster is that each request has a very small cache footprint, but, in the presence of

many outstanding requests, the combined cache footprint of all the queries translates

to a large amount of cache pressure that spills over to DRAM. This behavior is reﬂected

in our analysis as a super-linear growth in DRAM bandwidth use for ml_cluster versus

load.

memkeyval is an in-memory key-value store, similar to memcached [Nishtala et al.

2013]. The memkeyval workload is used as a caching service in the backends of several

Google web services. Other large-scale web services, such as Facebook and Twitter, use

memcached extensively. The memkeyval workload has signiﬁcantly less processing per

request compared to websearch, leading to extremely high throughput in the order of

hundreds of thousands of requests per second at peak. Since each request is processed

quickly, the SLO latency is very low, in the few hundreds of microseconds for the 99%-ile

latency. Load generation for memkeyval uses an anonymized trace of requests captured

from production services.

At peak load, memkeyval is network bandwidth limited. Despite the small amount of

network protocol processing done per request, the high request rate makes memkeyval

compute bound. In contrast, DRAM bandwidth requirements are low (20% DRAM

bandwidth utilization at max load), as requests simply retrieve values from DRAM

and put the response on the wire. The memkeyval workload has both a static working

set in the LLC for instructions, as well as a per-request data working set.

3.2. Characterization Methodology

To understand their sensitivity to interference on shared resources, we ran each of

the three LC workloads with a synthetic benchmark that stresses each resource in

isolation. While these are single node experiments, there can still be signiﬁcant network

trafﬁc as the load is generated remotely. We repeated the characterization at various

load points for the LC jobs and recorded the impact of the colocation on tail latency.

We used production Google servers with dual-socket Intel Xeons based on the Haswell

architecture. Each CPU has a high core count, with a nominal frequency of 2.3GHz

and 2.5MB of LLC per core. The chips have hardware support for way partitioning of

the LLC.

ACM Transactions on Computer Systems, Vol. 34, No. 2, Article 6, Publication date: May 2016.

Improving Resource Efﬁciency at Scale with Heracles 6:7

We performed the following characterization experiments:

—Cores: As we discussed in Section 2, we cannot share a logical core (a single Hyper-

Thread) between an LC and a BE task because OS scheduling can introduce latency

spikes in the order of tens of milliseconds [Leverich and Kozyrakis 2014]. Hence, we

focus on the potential of using separate HyperThreads that run pinned on the same

physical core. We characterize the impact of a colocated HyperThread that imple-

ments a tight spinloop on the LC task. This experiment captures a lower bound of

HyperThread interference. A more compute or m emory intensive microbenchmark

would antagonize the LC HyperThread for more core resources (e.g., execution units)

and space in the private caches (L1 and L2). Hence, if this experiment shows high

impact on tail latency, we can conclude that core sharing through HyperThreads is

not a practical option.

—LLC: The interference impact of LLC antagonists is measured by pinning the LC

workload to enough cores to satisfy its SLO at the speciﬁc load and pinning a cache

antagonist that streams through a large data array on the remaining cores of the

socket. We use several array sizes that take up a quarter, half, and almost all of the

LLC and denote these conﬁgurations as LLC small, medium, and big respectively.

—DRAM bandwidth: The impact of DRAM bandwidth interference is characterized in

a similar fashion to LLC interference, using a signiﬁcantly larger array for streaming.

We use numactl to ensure that the DRAM antagonist and the LC task are placed on

the same socket(s) and that all memory channels are stressed.

—Network trafﬁc: We use iperf , an open source TCP streaming benchmark [iperf

2011], to saturate the network transmit (outgoing) bandwidth. All cores except for

one are given to the LC workload. Since the LC workloads we consider serve request

from multiple clients connecting to the service they provide, we generate interference

in the form of many low-bandwidth “mice” ﬂows. Network interference can also

be generated using a few “elephant” ﬂows. However, such ﬂows can be effectively

throttled by TCP congestion control [Briscoe 2007], while the many “mice” ﬂows of

the LC workload will not be impacted.

—Power: To characterize the l atency impact of a power antagonist, the same division

of cores is used as in the cases of generating LLC and DRAM interference. Instead of

running a memory access antagonist, a CPU power virus is used. The power virus is

designed such that it stresses all the components of the core, leading to high power

draw and lower CPU core frequencies.

—OS Isolation: For completeness, we evaluate the overall impact of running a BE

task along with an LC workload using only the isolation mechanisms available in

the OS. Namely, we execute the two workloads in separate Linux containers and

set the BE workload to be low priority. The scheduling policy is enforced by CFS

using the shares parameter, where the BE task receives very few shares compared

to the LC workload. No other isolation mechanisms are used in this case. The BE

task is the Google brain workload [Le et al. 2012; Rosenberg 2013], which we will

describe further in Section 5.1.

3.3. Interference Analysis

Figure 1 presents the impact of the interference microbenchmarks on the tail latency

of the three LC workloads. Each row in the table shows tail latency at a certain load

for the LC workload when colocated with the corresponding microbenchmark. The

interference impact is acceptable if and only if the tail latency is less than 100% of the

target SLO. We color-code red/yellow all cases where SLO latency is violated.

By observing the rows for brain, we immediately notice that current OS isolation

mechanisms are inadequate for colocating LC tasks with BE tasks. Even at low loads,

ACM Transactions on Computer Systems, Vol. 34, No. 2, Article 6, Publication date: May 2016.

剩余32页未读，继续阅读

tonylwj

粉丝: 3

Heracles:Google的动态控制器优化大规模系统资源效率与延迟

Improving irradiation uniformity by lens array with beam spectral dispersion

Heracles - Improving Resource Efficiency at Scale (2015)-计算机科学

藏经阁-Improving Resource Efficiency.pdf

藏经阁-Improving HBase reliability at Pinterest with geo-­‐replicat

Improving transmission efficiency of Cassegrain antenna

On Improving the Efficiency of Tensor Voting

Improving FRET efficiency measurement in confocal microscopy imaging

FullMeSH: Improving Large-Scale MeSH Indexing with Full Text

Improving the efficiency and reducing efficiency roll-off in quantum dot light emitting devices by utilizing plasmonic Au nanoparticles

Comparison of Ce3+/Yb3+ with Er3+/Yb3+ down-conversion pairs in YAG host for improving the efficiency of the crystalline silicon solar cells

最新资源

藏经阁-Improving HBase reliability at Pinterest with geo-‐replicat