Heracles：大规模资源效率优化技术

22 浏览量更新于2024-08-25 收藏 792KB PDF 举报

"Heracles是一款旨在提高大规模资源效率的系统，由斯坦福大学和谷歌的研究人员开发。该系统针对低流量期间用户面临、对延迟敏感的服务（如网络搜索）的资源未充分利用问题，提出了一种反馈控制策略，允许安全地与延迟关键服务共置最佳努力任务。Heracles动态管理多种硬件和软件隔离机制，以确保延迟敏感任务满足其延迟目标，同时最大化资源分配给最佳努力任务。" 在大规模数据中心中，用户面向的、对延迟敏感的服务（如网络搜索）在日常低流量时段常常未能充分利用计算资源。这是因为共享资源的竞争可能导致延迟尖峰，从而违反这些延迟敏感任务的服务水平目标。这种未充分利用不仅影响了大型数据中心的经济性和能源效率，而且随着技术扩展速度放缓，解决这个问题变得更加重要。 Heracles是为了解决这一挑战而提出的，它是一个基于反馈的控制器。该控制器设计的核心是能够在不影响延迟关键服务性能的前提下，安全地在同一硬件上运行其他最佳努力任务。通过动态调整多个硬件和软件隔离机制，包括CPU、内存和网络隔离，Heracles能够确保延迟敏感的工作负载始终能够达到预设的延迟目标。 Heracles的工作原理是监控系统状态，识别何时可以安全地将额外的任务引入而不引起显著的延迟增加。它实时调整资源分配，以平衡延迟敏感任务的需求和最佳努力任务的执行。这允许数据中心更有效地利用资源，提高整体运行效率，同时保持服务质量的稳定。此外，Heracles的实施考虑了数据中心的复杂性，包括各种工作负载的特性和资源需求。它可能包括智能调度策略，以确保最佳任务的优先级，并避免资源竞争导致的不稳定性。这种方法对于优化数据中心资源利用率、降低成本和提高能源效率具有重要意义。 Heracles的出现揭示了通过精细管理和动态调整来提升大规模数据中心效率的新途径。通过对现有资源进行更有效的复用，Heracles为云计算和大数据处理环境提供了一个强大的工具，有助于推动数据中心的可持续发展和优化。

just on its own load, but also on the intensity of any BE task

running on the same socket. In other words, the performance of

LC tasks can suffer from unexpected drops in frequency due to

colocated tasks. This interference can be mitigated with per-core

dynamic voltage frequency scaling, as cores running BE tasks

can have their frequency decreased to ensure that the LC jobs

maintain a guaranteed frequency. A static policy would run all

BE jobs at minimum frequency, thus ensuring that the LC tasks

are not power-limited. However, this approach severely penal-

izes the vast majority of BE tasks. Most BE jobs do not have the

proﬁle of a power virus

and LC tasks only need the additional

frequency boost during periods of high load. Thus, a dynamic

solution that adjusts the allocation of power between cores is

needed to ensure that LC cores run at a guaranteed minimum fre-

quency while maximizing the frequency of cores for BE tasks.

A major challenge with colocation is cross-resource inter-

actions. A BE task can cause interference in all the shared re-

sources discussed. Similarly, many LC tasks are sensitive to in-

terference on multiple resources. Therefore, it is not sufﬁcient to

manage one source of interference: all potential sources need to

be monitored and carefully isolated if need be. In addition, inter-

ference sources interact with each other. For example, LLC con-

tention causes both types of tasks to require more DRAM band-

width, also creating a DRAM bandwidth bottleneck. Similarly, a

task that notices network congestion may attempt to use compres-

sion, causing core and power contention. In theory, the number

of possible interactions scales with the square of the number of

interference sources, making this a very difﬁcult problem.

3 Interference Characterization & Analysis

This section characterizes the impact of interference on

shared resources for latency-critical services.

3.1 Latency-critical Workloads

We use three Google production latency-critical workloads.

websearch is the query serving portion of a production web

search service. It is a scale-out workload that provides high

throughput with a strict latency SLO by using a large fan-out

to thousands of leaf nodes that process each query on their shard

of the search index. The SLO for leaf nodes is in the tens of

milliseconds for the 99%-ile latency. Load for websearch is gen-

erated using an anonymized trace of real user queries.

websearch has high memory footprint as it serves shards of

the search index stored in DRAM. It also has moderate DRAM

bandwidth requirements (40% of available bandwidth at 100%

load), as most index accesses miss in the LLC. However, there

is a small but signiﬁcant working set of instructions and data in

the hot path. Also, websearch is fairly compute intensive, as it

needs to score and sort search hits. However, it does not consume

a signiﬁcant amount of network bandwidth. For this study, we

reserve a small fraction of DRAM on search servers to enable

colocation of BE workloads with websearch.

ml_cluster is a standalone service that performs real-time text

clustering using machine-learning techniques. Several Google

services use ml_cluster to assign a cluster to a snippet of text.

ml_cluster performs this task by locating the closest clusters for

the text in a model that was previously learned ofﬂine. This

A computation that maximizes activity and power consumption of a core.

model is kept in main memory for performance reasons. The

SLO for ml_cluster is a 95%-ile latency guarantee of tens of mil-

liseconds. ml_cluster is exercised using an anonymized trace of

requests captured from production services.

Compared to websearch, ml_cluster is more memory band-

width intensive (with 60% DRAM bandwidth usage at peak) but

slightly less compute intensive (lower CPU power usage over-

all). It has low network bandwidth requirements. An interesting

property of ml_cluster is that each request has a very small cache

footprint, but, in the presence of many outstanding requests, this

translates into a large amount of cache pressure that spills over to

DRAM. This is reﬂected in our analysis as a super-linear growth

in DRAM bandwidth use for ml_cluster versus load.

memkeyval is an in-memory key-value store, similar to mem-

cached [

2]. memkeyval is used as a caching service in the back-

ends of several Google web services. Other large-scale web

services, such as Facebook and Twitter, use memcached exten-

sively. memkeyval has signiﬁcantly less processing per request

compared to websearch, leading to extremely high throughput

in the order of hundreds of thousands of requests per second at

peak. Since each request is processed quickly, the SLO latency

is very low, in the few hundreds of microseconds for the 99%-

ile latency. Load generation for memkeyval uses an anonymized

trace of requests captured from production services.

At peak load, memkeyval is network bandwidth limited. De-

spite the small amount of network protocol processing done

per request, the high request rate makes memkeyval compute-

bound. In contrast, DRAM bandwidth requirements are low

(20% DRAM bandwidth utilization at max load), as requests sim-

ply retrieve values from DRAM and put the response on the wire.

memkeyval has both a static working set in the LLC for instruc-

tions, as well as a per-request data working set.

3.2 Characterization Methodology

To understand their sensitivity to interference on shared re-

sources, we ran each of the three LC workloads with a synthetic

benchmark that stresses each resource in isolation. While these

are single node experiments, there can still be signiﬁcant network

trafﬁc as the load is generated remotely. We repeated the char-

acterization at various load points for the LC jobs and recorded

the impact of the colocation on tail latency. We used produc-

tion Google servers with dual-socket Intel Xeons based on the

Haswell architecture. Each CPU has a high core-count, with a

nominal frequency of 2.3GHz and 2.5MB of LLC per core. The

chips have hardware support for way-partitioning of the LLC.

We performed the following characterization experiments:

Cores: As we discussed in §

2, we cannot share a logical core (a

single HyperThread) between a LC and a BE task because OS

scheduling can introduce latency spikes in the order of tens of

milliseconds [

39]. Hence, we focus on the potential of using sep-

arate HyperThreads that run pinned on the same physical core.

We characterize the impact of a colocated HyperThread that im-

plements a tight spinloop on the LC task. This experiment cap-

tures a lower bound of HyperThread interference. A more com-

pute or memory intensive microbenchmark would antagonize the

LC HyperThread for more core resources (e.g., execution units)

and space in the private caches (L1 and L2). Hence, if this exper-

iment shows high impact on tail latency, we can conclude that

剩余12页未读，继续阅读

weixin_38609693

粉丝: 8
资源: 961

Heracles：大规模资源效率优化技术

Improving irradiation uniformity by lens array with beam spectral dispersion

Improving Resource Efficiency at Scale with Heracles

heracles-ai：heracles-ai是一个人工应用程序，它生成监视数据以允许学习和评估Azure Monitor Application Insights。 Heracles是使用GitHub操作对六个微服务应用程序的全自动（单击）部署。

heracles-开源

heracles-web-app:这是一个初步的 Java 网络应用程序版本

matlab绘图的形状代码-M_HERACLES:通过点对点CLued进行Matlab的继承

heracles:Mac应用程序，可在登录时自动获取Kerberos令牌

Heracles网络应用：Java语言开发的队列分析工具

Heracles:Google的动态控制器优化大规模系统资源效率与延迟

IDEA中新建heracles.properties配置

最新资源