Google云存储系统Bigtable深度解析

5星 · 超过95%的资源需积分: 9 161 浏览量更新于2024-07-22 1 收藏 593KB PDF 举报

"Google云计算三大论文英文版" 这篇资源包含了Google云计算领域的三篇重要论文，英文原版，适合与中文版对照阅读，以深入理解其内容。论文主要关注分布式存储系统，特别是Google的Bigtable，这是一个设计用于处理大量结构化数据的分布式存储系统，能够扩展到PB级别的数据并跨数千台 commodity servers 运行。 Bigtable 论文由 Fay Chang、Jeffrey Dean、Sanjay Ghemawat、Wilson C. Hsieh、Deborah A. Wallach、Mike Burrows、Tushar Chandra、Andrew Fikes 和 Robert E. Gruber 等人撰写，他们都是 Google 的员工。论文首先介绍了 Bigtable 的主要目标是管理和存储各种项目的数据，包括网页索引、Google Earth 和 Google Finance，这些应用对数据大小和延迟需求有着不同的要求。 Bigtable 的一个关键特性是其简单但灵活的数据模型。它允许客户端动态控制数据的布局和格式，这使得它能适应多种类型的应用场景。论文详细讨论了这一数据模型的设计，以及如何通过这个模型来支持不同类型的负载。在设计和实现方面，论文揭示了Bigtable如何实现高可用性和高性能。它采用了分布式架构，包括分布式哈希表（Chubby）提供的锁服务来保证数据一致性，以及基于行和列族的数据分片策略，以实现水平扩展。Bigtable 使用了类似于Google的GFS（Google File System）的存储系统来存储数据块，并采用了三副本策略确保数据可靠性。论文还探讨了Bigtable的其他核心组件，如Master服务器的角色，它负责管理表的元数据，以及MapReduce框架如何与Bigtable协同工作，进行大规模的数据处理。此外，论文还讨论了故障恢复机制、性能优化策略以及在实际生产环境中的经验教训。这三篇论文是理解Google云计算基础设施的核心文献，对于想要深入了解分布式存储系统、云计算技术及其在大规模应用中的实践的人来说，具有极高的参考价值。通过学习这些论文，读者可以掌握构建可扩展、高可用的云服务背后的关键技术和设计理念。

were co-mingled in the same physical log ﬁle. One ap-

proach would be for each new tablet server to read this

full commit log ﬁle and apply just the entries needed for

the tablets it needs to recover. However, under such a

scheme, if 100 machines were each assigned a single

tablet from a failed tablet server, then the log ﬁle would

be read 100 times (once by each server).

We avoid duplicating log reads by ﬁrst sort-

ing the commit log entries in order of the keys

htable, row name, log sequence numberi. In the

sorted output, all mutations for a particular tablet are

contiguous and can therefore be read efﬁciently with one

disk seek followed by a sequential read. To parallelize

the sorting, we partition the log ﬁle into 64 MB seg-

ments, and sort each segment in parallel on different

tablet servers. This sorting process is coordinated by the

master and is initiated when a tablet server indicates that

it needs to recover mutations from some commit log ﬁle.

Writing commit logs to GFS sometimes causes perfor-

mance hiccups for a variety of reasons (e.g., a GFS server

machine involved in the write crashes, or the network

paths traversed to reach the particular set of three GFS

servers is suffering network congestion, or is heavily

loaded). To protect mutations from GFS latency spikes,

each tablet server actually has two log writing threads,

each writing to its own log ﬁle; only one of these two

threads is actively in use at a time. If writes to the ac-

tive log ﬁle are performing poorly, the log ﬁle writing is

switched to the other thread, and mutations that are in

the commit log queue are written by the newly active log

writing thread. Log entries contain sequence numbers

to allow the recovery process to elide duplicated entries

resulting from this log switching process.

Speeding up tablet recovery

If the master moves a tablet from one tablet server to

another, the source tablet server ﬁrst does a minor com-

paction on that tablet. This compaction reduces recov-

ery time by reducing the amount of uncompacted state in

the tablet server’s commit log. After ﬁnishing this com-

paction, the tablet server stops serving the tablet. Before

it actually unloads the tablet, the tablet server does an-

other (usually very fast) minor compaction to eliminate

any remaining uncompacted state in the tablet server’s

log that arrived while the ﬁrst minor compaction was

being performed. After this second minor compaction

is complete, the tablet can be loaded on another tablet

server without requiring any recovery of log entries.

Exploiting immutability

Besides the SSTable caches, various other parts of the

Bigtable system have been simpliﬁed by the fact that all

of the SSTables that we generate are immutable. For ex-

ample, we do not need any synchronization of accesses

to the ﬁle system when reading from SSTables. As a re-

sult, concurrency control over rows can be implemented

very efﬁciently. The only mutable data structure that is

accessed by both reads and writes is the memtable. To re-

duce contention during reads of the memtable, we make

each memtable row copy-on-write and allow reads and

writes to proceed in parallel.

Since SSTables are immutable, the problem of perma-

nently removing deleted data is transformed to garbage

collecting obsolete SSTables. Each tablet’s SSTables are

registered in the METADATA table. The master removes

obsolete SSTables as a mark-and-sweep garbage collec-

tion [25] over the set of SSTables, where the METADATA

table contains the set of roots.

Finally, the immutability of SSTables enables us to

split tablets quickly. Instead of generating a new set of

SSTables for each child tablet, we let the child tablets

share the SSTables of the parent tablet.

7 Performance Evaluation

We set up a Bigtable cluster with N tablet servers to

measure the performance and scalability of Bigtable as

N is varied. The tablet servers were conﬁgured to use 1

GB of memory and to write to a GFS cell consisting of

1786 machines with two 400 GB IDE hard drives each.

N client machines generated the Bigtable load used for

these tests. (We used the same number of clients as tablet

servers to ensure that clients were never a bottleneck.)

Each machine had two dual-core Opteron 2 GHz chips,

enough physical memory to hold the working set of all

running processes, and a single gigabit Ethernet link.

The machines were arranged in a two-level tree-shaped

switched network with approximately 100-200 Gbps of

aggregate bandwidth available at the root. All of the ma-

chines were in the same hosting facility and therefore the

round-trip time between any pair of machines was less

than a millisecond.

The tablet servers and master, test clients, and GFS

servers all ran on the same set of machines. Every ma-

chine ran a GFS server. Some of the machines also ran

either a tablet server, or a client process, or processes

from other jobs that were using the pool at the same time

as these experiments.

R is the distinct number of Bigtable row keys involved

in the test. R was chosen so that each benchmark read or

wrote approximately 1 GB of data per tablet server.

The sequential write benchmark used row keys with

names 0 to R − 1. This space of row keys was parti-

tioned into 10N equal-sized ranges. These ranges were

assigned to the N clients by a central scheduler that as-

To appear in OSDI 2006 8

# of Tablet Servers

Experiment 1 50 250 500

random reads 1212 593 479 241

random reads (mem) 10811 8511 8000 6250

random writes 8850 3745 3425 2000

sequential reads 4425 2463 2625 2469

sequential writes 8547 3623 2451 1905

scans 15385 10526 9524 7843

100 200 300 400 500

Number of tablet servers

Values read/written per second

scans

random reads (mem)

random writes

sequential reads

sequential writes

random reads

Figure 6: Number of 1000-byte values read/written per second. The table shows the rate per tablet server; the graph shows the

aggregate rate.

signed the next available range to a client as soon as the

client ﬁnished processing the previous range assigned to

it. This dynamic assignment helped mitigate the effects

of performance variations caused by other processes run-

ning on the client machines. We wrote a single string un-

der each row key. Each string was generated randomly

and was therefore uncompressible. In addition, strings

under different row key were distinct, so no cross-row

compression was possible. The random write benchmark

was similar except that the row key was hashed modulo

R immediately before writing so that the write load was

spread roughly uniformly across the entire row space for

the entire duration of the benchmark.

The sequential read benchmark generated row keys in

exactly the same way as the sequential write benchmark,

but instead of writing under the row key, it read the string

stored under the row key (which was written by an earlier

invocation of the sequential write benchmark). Similarly,

the random read benchmark shadowed the operation of

the random write benchmark.

The scan benchmark is similar to the sequential read

benchmark, but uses support provided by the Bigtable

API for scanning over all values in a row range. Us-

ing a scan reduces the number of RPCs executed by the

benchmark since a single RPC fetches a large sequence

of values from a tablet server.

The random reads (mem) benchmark is similar to the

random read benchmark, but the locality group that con-

tains the benchmark data is marked as in-memory, and

therefore the reads are satisﬁed from the tablet server’s

memory instead of requiring a GFS read. For just this

benchmark, we reduced the amount of data per tablet

server from 1 GB to 100 MB so that it would ﬁt com-

fortably in the memory available to the tablet server.

Figure 6 shows two views on the performance of our

benchmarks when reading and writing 1000-byte values

to Bigtable. The table shows the number of operations

per second per tablet server; the graph shows the aggre-

gate number of operations per second.

Single tablet-server performance

Let us ﬁrst consider performance with just one tablet

server. Random reads are slower than all other operations

by an order of magnitude or more. Each random read in-

volves the transfer of a 64 KB SSTable block over the

network from GFS to a tablet server, out of which only a

single 1000-byte value is used. The tablet server executes

approximately 1200 reads per second, which translates

into approximately 75 MB/s of data read from GFS. This

bandwidth is enough to saturate the tablet server CPUs

because of overheads in our networking stack, SSTable

parsing, and Bigtable code, and is also almost enough

to saturate the network links used in our system. Most

Bigtable applications with this type of an access pattern

reduce the block size to a smaller value, typically 8KB.

Random reads from memory are much faster since

each 1000-byte read is satisﬁed from the tablet server’s

local memory without fetching a large 64 KB block from

GFS.

Random and sequential writes perform better than ran-

dom reads since each tablet server appends all incoming

writes to a single commit log and uses group commit to

stream these writes efﬁciently to GFS. There is no sig-

niﬁcant difference between the performance of random

writes and sequential writes; in both cases, all writes to

the tablet server are recorded in the same commit log.

Sequential reads perform better than random reads

since every 64 KB SSTable block that is fetched from

GFS is stored into our block cache, where it is used to

serve the next 64 read requests.

Scans are even faster since the tablet server can return

a large number of values in response to a single client

RPC, and therefore RPC overhead is amortized over a

large number of values.

Scaling

Aggregate throughput increases dramatically, by over a

factor of a hundred, as we increase the number of tablet

servers in the system from 1 to 500. For example, the

To appear in OSDI 2006 9

剩余41页未读，继续阅读

Liang_You

粉丝: 0
资源: 3

Google云存储系统Bigtable深度解析

Google三大论文英文原版

关于云计算的外文文献

云计算英文文献

Google云计算三大论文英文版.zip

Google云计算三大论文中英文版

Google云计算三大论文中文版

google云计算三大论文中文版

google云计算三论文中英文版

Google云计算三大论文中英文版.zip

Google云计算三大论文中文版.zip

最新资源