谷歌视角：数据中心作为大规模计算机

5星 · 超过95%的资源需积分: 31 142 浏览量更新于2024-07-24 1 收藏 3.15MB PDF 举报

"The Datacenter as a Computer - Google 出版的最新技术书籍，深入探讨了构建大规模数据中心的逻辑和设计，旨在将数据中心视为一个整体的仓库规模计算机（Warehouse-Scale Computer, WSC）来理解和优化。该书由 Luiz André Barroso、Jimmy Clidaras 和 Urs Hölzle 合著，他们都是 Google 公司的专家。书中抽象指出，随着计算向云端转移，现代关注的计算平台不再像个人电脑或服务器集群，而是转变为庞大的数据中心。这些新型数据中心与早期的传统托管设施截然不同，不能简单地看作是服务器的集合。它们内部的大量硬件和软件资源必须协同工作，才能高效地提供互联网服务性能。书中强调，要实现这种高效性，必须采取全面的方法来设计和部署数据中心，即把整个数据中心视为一台巨大的、统一管理的计算机。作者详细介绍了 WSC 的架构，讨论了影响其设计、运行和成本结构的主要因素，以及这些大规模系统所特有的软件基础。本书的目标读者包括当前和未来 WSC 的架构师和程序员。它提供了深入的洞察，有助于读者理解如何设计、优化和管理这样的大规模计算平台，以满足云计算时代的需求和挑战。书中涵盖的内容可能涉及硬件选型、冷却系统设计、网络架构、分布式系统原理、能效优化、容错机制、自动化运维等多个关键领域，这些都是构建和运行高效数据中心的核心知识点。通过阅读这本书，读者可以了解到 Google 等领先科技公司如何设计和运营他们的数据中心，从而获取在大规模分布式计算环境中的最佳实践和经验教训。这不仅对数据中心的专业人士具有极高的参考价值，也对所有对云计算基础设施感兴趣的技术人员有着深远的启示作用。"

infrastructure and services that hide most of this complexity from application developers.

NAND Flash technology that was originally developed for portable electronics has found

target use cases in WSC systems more recently. Today Flash is a viable option for bridging

the cost and performance gap between DRAM and disks, as displayed Figure 1-4. Flash's

most appealing characteristic with respect to disks is its performance under random read

operations, which is nearly three orders of magnitude better. In fact, Flash's performance is

so high that it becomes a challenge to use it effectively in distributed storage systems since

it would demand much higher bandwidth from the WSC fabric. We discuss Flash's potential

and challenges further in Chapter 3.

Figure 1-4: Performance and cost of NAND Flash with respect to DRAM and Disks

1.6.5 Power Usage

Energy and power usage are also important concerns in the design of WSCs because, as

discussed in more detail in Chapter 5, energy-related costs have become an important

component of the total cost of ownership of this class of systems. Figure 1-5 provides some

insight into how energy is used in modern IT equipment by breaking down the peak power

usage of one generation of WSCs deployed at Google in 2007 categorized by main

component group.

Although this breakdown can vary significantly depending on how systems are

configured for a given workload domain, the graph indicates that CPUs can no longer be the

sole focus of energy efficiency improvements because no one subsystem dominates the

overall energy usage profile. Chapter 5 also discusses how overheads in power delivery and

cooling can significantly increase the actual energy usage in WSCs.

2 Workloads and Software Infrastructure

The applications that run on warehouse-scale computers (WSCs) dominate many system

design trade-off decisions. This chapter outlines some of the distinguishing characteristics of

software that runs in large Internet services and the system software and tools needed for a

complete computing platform. Here is some terminology that defines the different software

layers in a typical WSC deployment:

● Platform-level software—the common firmware, kernel, operating system distribution,

and libraries expected to be present in all individual servers to abstract the hardware

of a single machine and provide basic server-level services.

● Cluster-level infrastructure—the collection of distributed systems software that

manages resources and provides services at the cluster level; ultimately, we consider

these services as an operating system for a datacenter. Examples are distributed file

systems, schedulers, remote procedure call (RPC) layers, as well as programming

models that simplify the usage of resources at the scale of datacenters, such as

MapReduce [19], Dryad [48], Hadoop [43], Sawzall [66], BigTable [13], Dynamo

[20], Dremel [95], Spanner [96], and Chubby [7].

● Application-level software—software that implements a specific service. It is often

useful to further divide application-level software into online services and offline

computations because those tend to have different requirements. Examples of online

services are Google search, Gmail, and Google Maps. Offline computations are

typically used in large-scale data analysis or as part of the pipeline that generates

the data used in online services; for example, building an index of the Web or

processing satellite images to create map tiles for the online service.

2.1 DATACENTER VS. DESKTOP

Software development in Internet services differs from the traditional desktop/server

model in many ways:

● Ample parallelism—Typical Internet services exhibit a large amount of parallelism

stemming from both data- and request-level parallelism. Usually, the problem is not

to find parallelism but to manage and efficiently harness the explicit parallelism that

is inherent in the application. Data parallelism arises from the large data sets of

relatively independent records that need processing, such as collections of billions of

Web pages or billions of log lines. These very large data sets often require significant

computation for each parallel (sub) task, which in turn helps hide or tolerate

communication and synchronization overheads. Similarly, request-level parallelism

stems from the hundreds or thousands of requests per second that popular Internet

services receive. These requests rarely involve read–write sharing of data or

synchronization across requests. For example, search requests are essentially

independent and deal with a mostly read-only database; therefore, the computation

can be easily partitioned both within a request and across different requests.

Similarly, whereas Web email transactions do modify user data, requests from

different users are essentially independent from each other, creating natural units of

data partitioning and concurrency.

● Workload churn—Users of Internet services are isolated from the service’s

implementation details by relatively well-defined and stable high-level APIs (e.g.,

simple URLs), making it much easier to deploy new software quickly. Key pieces of

Google’s services have release cycles on the order of a couple of weeks compared to

months or years for desktop software products. Google’s front-end Web server

binaries, for example, are released on a weekly cycle, with nearly a thousand

independent code changes checked in by hundreds of developers—the core of

Google’s search services has been reimplemented nearly from scratch every 2 to 3

years. This environment creates significant incentives for rapid product innovation

but makes it hard for a system designer to extract useful benchmarks even from

established applications. Moreover, because Internet services are still a relatively

new field, new products and services frequently emerge, and their success with users

directly affects the resulting workload mix in the datacenter. For example, video

services such as YouTube have flourished in relatively short periods and may present

a very different set of requirements from the existing large customers of computing

cycles in the datacenter, potentially affecting the optimal design point of WSCs in

unexpected ways. A beneficial side effect of this aggressive software deployment

environment is that hardware architects are not necessarily burdened with having to

provide good performance for immutable pieces of code. Instead, architects can

consider the possibility of significant software rewrites to take advantage of new

hardware capabilities or devices.

● Platform homogeneity—The datacenter is generally a more homogeneous

environment than the desktop as a target platform for software development. Large

Internet services operations typically deploy a small number of hardware and system

software configurations at any given time. Significant heterogeneity arises primarily

from the incentives to deploy more cost-efficient components that become available

over time. Homogeneity within a platform generation simplifies cluster-level

scheduling and load balancing and reduces the maintenance burden for platforms

software (kernels, drivers, etc.). Similarly, homogeneity can allow more efficient

supply chains and more efficient repair processes because automatic and manual

repairs benefit from having more experience with fewer types of systems. In contrast,

software for desktop systems can make few assumptions about the hardware or

software platform they are deployed on, and their complexity and performance

characteristics may suffer from the need to support thousands or even millions of

hardware and system software configurations.

● Fault-free operation—Because Internet service applications run on clusters of

thousands of machines—each of them not dramatically more reliable than PC-class

hardware—the multiplicative effect of individual failure rates means that some type

of fault is expected every few hours or less (more details are provided in Chapter 6).

As a result, although it may be reasonable for desktop-class software to assume a

fault-free hardware operation for months or years, this is not true for datacenter-

level services—Internet services need to work in an environment where faults are

part of daily life. Ideally, the cluster-level system software should provide a layer

that hides most of that complexity from application-level software, although that

goal may be difficult to accomplish for all types of applications.

Although the plentiful thread-level parallelism and a more homogeneous computing platform

help reduce software development complexity in Internet services compared to desktop

systems, the scale, the need to operate under hardware failures, and the speed of workload

churn have the opposite effect.

2.2 PERFORMANCE AND AVAILABILITY TOOLBOX

Some basic programming concepts tend to occur often in both infrastructure and application

levels because of their wide applicability in achieving high performance or high availability in

large-scale deployments. The following table describes some of the most prevalent concepts.

Performance

Availability

Description

Replication

Yes

Data replication is a powerful technique because

it can improve both throughput and availability.

It is particularly powerful when the replicated

data are not often modified because replication

makes updates more complex.

Sharding

(partitioning)

Yes

Splitting a data set into smaller fragments

(shards) and distributing them across a large

number of machines. Operations on the data set

are dispatched to some or all of the machines

hosting shards, and results are coalesced by the

client. The sharding policy can vary depending

on space constraints and performance

considerations. Use of very small shards (or

microsharding) is particularly beneficial to load

balancing and recovery.

Load-

balancing

Yes

In large-scale services, service-level

performance often depends on the slowest

responder out of hundreds or thousands of

servers. Reducing response-time variance is

therefore critical.

In a sharded service, load balancing can be

achieved by biasing the sharding policy to

剩余124页未读，继续阅读

Kinges

粉丝: 45
资源: 11

谷歌视角：数据中心作为大规模计算机

CS61C 教材3 The Datacenter as a Computer

cn_windows_server_standard_enterprise_and_datacenter_with_sp2_x86_dvd_x15-41045

ABR11AS_quickstart_zh-CN.pdf

黄佳_04 The Datacenter as a Computer1

Python库 | datacenter_utils-0.2.17-py3-none-any.whl

PyPI 官网下载 | cookapps_datacenter-0.0.13-py3-none-any.whl

PyPI 官网下载 | datacenter_datasdk-0.2.4.tar.gz

datacenter-automation-suite-l58:DCAS:trade_mark: - 光纤跳 - 数据中心自动化套件

datacenter_utils-0.2.17-py3-none-any.whl: Python数据处理工具库解析

ed2k://|file|cn_windows_server_2008_r2_standard_enterprise_datacenter_web_x64_dvd_x15-50360.iso|3270336512|994401af40cf454135f4f9510829405d|/

最新资源