十年变迁：谷歌数据中心网络的演进与中央控制策略

需积分: 46 71 浏览量更新于2024-07-20 收藏 1.92MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

资源详情

资源推荐

Figure 3: Mix of jobs in an example cluster with 12 blocks

of servers (left). Fraction of traﬃc in each block destined

for remote blocks (right).

width applications had to ﬁt under a single ToR to

avoid the heavily oversubscribed ToR uplinks. Deploy-

ing large clusters was important to our services because

there were many aﬃliated applications that beneﬁted

from high bandwidth communication. Consider large-

scale data processing to produce and continuously re-

fresh a search index, web search, and serving ads as

aﬃliated applications. Larger clusters also substan-

tially improve bin-packing eﬃciency for job scheduling

by reducing stranding from cases where a job cannot

be scheduled in any one cluster despite the aggregate

availability of suﬃcient resources across multiple small

clusters.

Maximum cluster scale is important for a more sub-

tle reason. Power is distributed hierarchically at the

granularity of the building, multi-Megawatt power gen-

erators, and physical datacenter rows. Each level of hi-

erarchy represents a unit of failure and maintenance.

For availability, cluster scheduling purposely spreads

jobs across multiple rows. Similarly, the required re-

dundancy in storage systems is in part determined by

the fraction of a cluster that may simultaneously fail as

a result of a power event. Hence, larger clusters lead to

lower storage overhead and more eﬃcient job scheduling

while meeting diversity requirements.

Running storage across a cluster requires both rack

and power diversity to avoid correlated failures. Hence,

cluster data should be spread across the cluster’s failure

domains for resilience. However, such spreading natu-

rally eliminates locality and drives the need for uni-

form bandwidth across the cluster. Consequently, stor-

age placement and job scheduling have little locality in

our cluster traﬃc, as shown in Figure 3. For a rep-

resentative cluster with 12 blocks (groups of racks) of

servers, we show the fraction of traﬃc destined for re-

mote blocks. If traﬃc were spread uniformly across the

cluster, we would expect 11/12 of the traﬃc (92%) to

be destined for other blocks. Figure 3 shows approxi-

mately this distribution for the median block, with only

moderate deviation.

While our traditional cluster network architecture

largely met our scale needs, it fell short in terms of

overall performance and cost. Bandwidth per host was

severely limited to an average of 100Mbps. Packet drops

associated with incast [8] and outcast [21] were severe

Figure 4: A generic 3 tier Clos architecture with edge

switches (ToRs), aggregation blocks and spine blocks. All

generations of Clos fabrics deployed in our datacenters fol-

low variants of this architecture.

pain points. Increasing bandwidth per server would

have substantially increased cost per server and reduced

cluster scale.

We realized that existing commercial solutions could

not meet our scale, management, and cost requirements.

Hence, we decided to build our own custom data center

network hardware and software. We started with the

key insight that we could scale cluster fabrics to near

arbitrary size by leveraging Clos topologies (Figure 4)

and the then-emerging (ca. 2003) merchant switching

silicon industry [12]. Table 1 summarizes a number of

the top-level challenges we faced in constructing and

managing building-scale network fabrics. The following

sections explain these challenges and the rationale for

our approach in detail.

For brevity, we omit detailed discussion of related

work in this paper. However, our topological approach,

reliance on merchant silicon, and load balancing across

multipath are substantially similar to contemporaneous

research [2,15]. In addition to outlining the evolution of

our network, we further describe inter cluster network-

ing, network management issues, and detail our control

protocols. Our centralized control protocols running on

switch embedded processors are also related to subse-

quent substantial eﬀorts in Software Deﬁned Network-

ing (SDN) [13]. Based on our experience in the dat-

acenter, we later applied SDN to our Wide Area Net-

work [19]. For the WAN, more CPU intensive traﬃc

engineering and BGP routing protocols led us to move

control protocols onto external servers with more plen-

tiful CPU from the embedded CPU controllers we were

able to utilize for our initial datacenter deployments.

Recent work on alternate network topologies such as

HyperX [1], Dcell [17], BCube [16] and Jellyﬁsh [22]

deliver more eﬃcient bandwidth for uniform random

communication patterns. However, to date, we have

found that the beneﬁts of these topologies do not make

up for the cabling, management, and routing challenges

and complexity.

3. NETWORK EVOLUTION

3.1 Firehose 1.0

Table 2 summarizes the multiple generations of our

剩余14页未读，继续阅读

xu23heng

粉丝: 0
资源: 3

十年变迁：谷歌数据中心网络的演进与中央控制策略

google服务器集群介绍

通信与网络中的浅谈数据中心整合的发展

黄佳_04 The Datacenter as a Computer1

CS61C 教材3 The Datacenter as a Computer

浅谈数据中心整合的发展

The datacenter as a computer_Final-Draft【英文原版】

谷歌数据中心设计精华：第二版《数据中心作为计算机》

谷歌视角：数据中心作为大规模计算机

Google数据中心启示录：设计大型机器的艺术

数据中心计算机：解析大型机设计介绍

开放计算项目：数据中心安全控制模块规范

内存子系统资源共享对数据中心应用的影响

揭秘知名网站技术：从Google到Alexa排名巨头的架构演进

了解网络云计算：基础概念与应用

Datacenter个人能使用吗

最新资源