《可信赖分布式系统：最新进展与技术应用》深度解析

5星 · 超过95%的资源需积分: 50 164 浏览量更新于2024-07-29 1 收藏 5.59MB PDF 举报

《Reliable Distributed Systems》是一本由肯尼斯·B·伯曼（Kenneth P. Birman）所著的经典著作，它详尽探讨了分布式系统的技术、Web服务以及在实际应用中的可靠性。该书是在《Building Secure and Reliable Network Applications》一书基础上，由曼宁出版社于1996年出版，并在2005年由Springer Science+Business Media重新发行。本书的国际标准书号(ISBN)分别是10-387-21509-3和13-978-0-387-21509-9，版权归属Springer。书中内容主要聚焦于分布式系统的设计、实现和优化，涵盖了数学学科分类：68M14（计算机系统理论中的分布式计算）、68W15（计算机网络理论）、68M15（分布式算法）、68Q85（计算机通信网络）以及68M12（并行和分布式处理）。作者深入剖析了如何构建和维护在高度分布式环境中具有可靠性的应用程序，尤其是在互联网技术日益复杂和广泛应用的背景下。《Reliable Distributed Systems》对于理解分布式系统中的关键概念，如一致性模型、分区容忍性（Paxos和Zookeeper等协议）、复制策略、故障检测与恢复、数据一致性保证、服务发现、负载均衡以及网络安全等方面有着极其重要的指导价值。书中还可能涉及到微服务架构、云计算中的分布式存储和计算、以及区块链技术等现代分布式系统中的前沿课题。由于原文没有中文版，读者在阅读时可能需要依赖英文理解和专业术语，这对于熟悉英语并且对分布式系统有深厚背景的专业人士来说是一本不可或缺的参考书。书中强调了尊重版权，未经出版商许可，不得全文翻译或复制，仅限于学术评论或研究分析用途。《Reliable Distributed Systems》是一本为分布式系统开发者、研究人员和高级工程师提供理论基础和技术实践指导的权威书籍，它不仅回顾了历史上的发展，也展望了未来可能出现的技术趋势。阅读这本书，读者可以深入了解分布式系统的挑战、最佳实践和潜在创新点，为构建稳定、可扩展且安全的分布式应用奠定坚实基础。

xviii Preface

Barrault and Yves Eychenne of Stratus France and Isis Distributed Systems, France, for

providing me with resources needed to work on this book during a sabbatical that I spent

in Paris (in fall of 1995 and spring of 1996). Cindy Williams and Werner Vogels provided

invaluable help in overcoming some of the details of working at such a distance from home.

A number of reviewers provided feedback on early copies of this text. Thanks are due to

Marjan Bace, David Bakken, Robert Cooper, Yves Eychenne, Dalia Malkhi, Raghu Hudli,

David Page, David Plainfosse, Henrijk Paszt, John Warne, and Werner Vogels. Raj Alur,

Ian Service, and Mark Wood provided help in clarifying some thorny technical questions

and are also gratefully acknowledged. Bruce Donald’s e-mails on idiosyncrasies of the Web

were extremely useful and had a surprisingly large impact on treatment of that topic in this

text.

Kermeth P. Birman

Cornell University

February 2004

Introduction

Despite more than 30 years of progress towards ubiquitous computer connectivity,

distributed computing systems have only recently emerged to play a serious role in industry

and society. Perhaps this explains why so few distributed systems are reliable in the sense

of tolerating failures automatically, or guaranteeing properties such as high availability, or

having good performance even under stress, or bounded response time, or offer security

against intentional threats. In many ways the engineering discipline of reliable distributed

computing is still in its infancy.

Reliability engineering is a bit like alchemy. The ﬁeld swirls with competing schools of

thought. Profound arguments erupt over obscure issues, and there is little consensus on how

to proceed even to the extent that we know how to solve many of the hard problems. In fact,

if there is agreement on anything today, it seems to be an agreement to postpone thinking

seriously about reliability until tomorrow! But of course, not everyone can deliberately take

that risk. This book is aimed at those who ﬁnd themselves in an unfortunate bind. How can

we build reliable systems over a widely popular but unreliable substrate?

One might be tempted by a form of circular reasoning, concluding that reliability must

not be all that important in distributed systems (otherwise, the pressure to make such systems

reliable would long since have become overwhelming). Yet, it seems more likely that we

have only recently begun to see the kinds of distributed computing systems in which the

pressure to guarantee reliability is so critical that the developer simply has no alternative. At

any rate, there is much evidence that reliability does matter to those building the standards;

the new Web Services standard, about which we will have a great deal to say, includes a

component called WS

RELIABILITY and, as one might hope, this explains how one should

tackle reliability issues in the context of a platform supporting the Web Services architecture.

RELIABILITY would not be there if there were not a signiﬁcant commercial demand

for a reliability solution.

The problem is that if one delves deeper, this particular form of reliability has serious

limits—limits that might preclude solving the great majority of what one would most

naturally term “reliability problems.” WS

RELIABILITY turns out to be about reliably

passing documents down a pipeline with intermediary processing and queuing components

in it—a form of reliability analogous to what we see when ordering a product on the Web.

You place the order and it enters a pending order subsystem. Later, when availability has

been conﬁrmed, you learn that now your order is in “fulﬁllment,” and still later, that it has

been shipped and you have been billed. WS

RELIABILITY formalizes and standardizes

this kind of protocol.

This is all well and good, but reliability means something speciﬁc in the context of, say,

a critical care application in a hospital. “Change the digitalis dosage for Mr. Smith in room

xx Introduction

219” does not mean “sooner or later.” An air trafﬁc controller who wants US Airways ﬂight

29 to climb to 25,000 feet does not mean “and tell the pilot eventually.” Thus a pipelined,

queued, sooner-or-later guarantee might not do the trick in these kinds of settings. For them,

reliability implies high availability, and perhaps other properties as well, such as real-time

guarantees or security.

Web Services are not the only game in town, and a developer faced with such an

issue might consider implementing the system using the CORBA architecture instead.

CORBA, which predated Web Services and is more object oriented (Web Services are

“document” oriented) has a fault-tolerance standard, FTOL. At a glance the match with the

problem as stated is somewhat better. But now we encounter a different problem: it has

not been widely implemented, and the CORBA community ﬁnds it overly constraining. In

particular, the CORBA FTOL standard limits itself to solving one speciﬁc high availability

problem, and only when the application satisﬁes certain properties (notably, determinism)

that many applications cannot guarantee (notably, because of multi-threading, I/O from

multiple channels, and asynchronous event handling). In particular, the medical and air

trafﬁc control applications probably couldn’t manage within these kinds of constraints.

Why not cast an even wider net? Some of the world’s most demanding distributed

systems were built using reliable “process group” architectures. For example, in this text we

will learn about the technologies underlying the New York Stock Exchange overhead quote

and trade reporting system. The same technology is used to full replicate the Swiss Stock

Exchange so that every trader has a complete replica of the state of the entire exchange on his

or her workstation. And it was used to implement Air Trafﬁc Control by the French as part of

their PHIDIAS console clustering architecture, and to support inter-airport communication

as well. It runs the AEGIS Naval Warship, and if ever there was a system that needs to “take

a licking and keep on ticking,” that’s the one. So, why not use the same technology to solve

our problem?

The response to such a question highlights an issue event tougher than the purely

technical one: to a growing extent, standards like Web Services and CORBA are the only

games in town. While rolling one’s own system (as did the enterprises just cited) was an

option in the early 1990s, doing so is expensive and means that the developers are unable

to exploit the best productivity tools and technologies. So most development teams are now

compelled to work within widely accepted standards. But as we just saw, where reliability is

concerned, these standards are quite limited. The designer faces a tough problem: Pragmatic

considerations force him or her to work with platforms that lack support for reliability, yet

the application demands stronger guarantees. How then can one circumvent the limitations

without building a “non-standard” application?

To the extent that existing mission- and even life-critical applications rely on distributed

software, the importance of reliability has perhaps been viewed as a narrow, domain-speciﬁc

issue. One could argue that hospitals and air trafﬁc control centers should use special-purpose

software designed to address their specialized requirements. Why should everyone pay for

platform features that are only needed in obscure, specialized, applications? On the other

hand, as distributed software is placed into more and more critical applications, where safety

Introduction xxi

or ﬁnancial stability of large organizations depends on the reliable operation of complex

distributed applications, the inevitable result will be that some of these will require exactly

the same reliability mechanisms needed in those hospital and air trafﬁc control systems. It

is time to tackle distributed systems reliability in a serious manner. To fail to do so today is

to invite catastrophic computer systems failures tomorrow.

Web Services are likely to amplify this new concern about reliability. The service-

oriented computing movement makes it much easier to reuse existing functionality in new

ways: to build a new application that talks to one or more old applications. As this trend plays

out, we are seeing a slow evolution towards elaborate, poorly understood, interdependency

between applications. Developers are discovering that the failure of a component that they

did not even know was “involved” in their application may have performance or availability

implications in remote parts of the system. Even when a dependency is evident, the developer

may have no clue as to what some components do. Thus, ﬁnding ways to guarantee the

availability of critical subsystems and components may be the only way for platform vendors

to avoid some very unhappy customer experiences.

Perhaps with this in mind, many vendors insist that reliability is a high priority for their

internal development teams. The constraint, they explain, is that they are uncomfortable

with any approach to reliability that has visibility to the developer or end user. In effect,

reliability is important, but the mechanisms must be highly “transparent.” Moreover, they

insist that only generally accepted, best of breed solutions can be considered for inclusion

in their standards.

Unfortunately, for three decades, the computing industry has tried (and failed) to make

the mechanisms of reliable, secure distributed computing transparent. Moreover, as noted

earlier, there is considerable confusion about just what the best of breed solutions actually

are in this ﬁeld; we will see why in Part III of the book.

Now, perhaps the experts who have worked in this ﬁeld for almost three decades have

missed some elegant, simple insight ... if so, it would not be the ﬁrst time. On the other

hand, perhaps we cannot achieve transparency and are faced with deep tradeoffs, so that

there simply isn’t any clear best of breed technology that can make all users happy.

Thus while these distributed computing experts have all kinds of interesting technolo-

gies in their “tool kits,” right now the techniques that make reliability, and security, and

stability possible in massive settings can’t be swept under a rug or concealed from the

developer. On the contrary, to succeed in this new world, developers need to master a

diversity of powerful new ideas and techniques, and in many cases may need to implement

variants of those techniques specialized to the requirements of their speciﬁc application.

There just isn’t any sensible way to hide the structure of a massive, and massively complex,

distributed system scattered over multiple data centers.

We need to learn to expose system structure, to manage it intelligently but explicitly, and

to embed intelligence right into the application, so that each application can sense problems,

develop a suitable application-speciﬁc strategy for reacting to those problems, and ride out

the disruption. One can certainly wish that greater transparency were an option, and maybe

someday we will gain such a sophisticated understanding of the matter that we will be able to

xxii Introduction

do better. But today the best hope is to standardize the best mechanisms we can bring to bear

on the problem, while also exposing those mechanisms so that a developer who understands

them well can either customize his/her behavior or work around their limitations.

The conﬂuence of these trends forces universities to offer new kinds of courses: courses

aimed at educating students in some esoteric, hard to understand, hard to use technologies,

and in the “reliability mindset” required to apply them appropriately. In fact, a reliability

engineering program should include several such courses: courses in security, in system

management, in large-scale system structure and risk analysis, and in software techniques

for building reliable applications. Some of these topics can be integrated into existing

courses; this is how we handle the challenge at Cornell. The present textbook will be useful

mostly for the last of these needs: teaching students the speciﬁc technology options available

to them in situations where reliability is a clear requirement.

If we do not educate the next generation of developers to deal with the challenge, they

simply won’t be prepared when they start to encounter puzzling questions: “But how should

the application react if that request times out?” “What might happen if the load surges?”

“What is causing our periodic 90-second network “storms?” We owe it to our students to

equip them for the world in which they will live and work!

But there is also some good news here. The experience of teaching courses on reliability

and security and “information assurance” has been quite positive. What looks hard at ﬁrst

glance starts to make sense when explained slowly and clearly. Simple exercises can go a

long way towards helping students master the technologies involved. And while the material

is not easy, it is not impossibly complex either.

True, it would be better if the most important technologies were part of standard plat-

forms, but even if something cannot be purchased from Microsoft, it may still be available

to those willing to download and work with free packages. Even if a protocol is not trivial,

a good programming team may be able to implement it once someone explains it clearly.

We will see that reliability engineering is as much a mindset as a “technology”, and one can

teach a mindset and work through examples to see how a way of thinking can translate into

actions. It is not unreasonable to expect every student to gain at least a broad appreciation

of the issues and the challenges. And it turns out that a great many students should be able

to gain a real mastery of the state of the art, and even some hands-on experience.

The instructor teaching from this book, or the professional reading the book to catch

up on developments, will ﬁnd that it is organized into multiple “parts.” The early parts are

somewhat remedial in nature: they review technologies that any student has probably seen

in other settings, and even when a part consists of four or ﬁve chapters, these early chapters

probably do not merit more than one lecture in a semester-long class. They are included

because even the basic technologies have some surprising reliability and performance char-

acteristics, and one cannot plunge directly into more advanced technologies without an

appreciation of the basics. But any course that devotes more than cursory attention to these

parts of the book is probably using the wrong text. For example, when the author teaches

from this textbook, pretty much all the important material in Parts I and II is covered during

the ﬁrst two lectures, revisiting one or two topics a bit later in the semester.

剩余683页未读，继续阅读

quartz1222

粉丝: 0
资源: 7

《可信赖分布式系统：最新进展与技术应用》深度解析

Reliable Distributed Systems-Technologies, Web Services,and Applications

Unreliable Failure Detectors for Reliable Distributed Systems.pdf

Reliable Distributed Systems: Technologies, Web Services

The Implementation of Reliable Distributed Multiprocess Systems

Designing Distributed Systems Patterns and Paradigms for Scalable, Reliable epub

DB - Designing Distributed Systems Using Approximate Synchrony

Designing-Distributed-Systems

2009 29th IEEE International Conference on Distributed Computing Systems

Designing+Distributed+Systems_+Patterns+and+Paradigms

distributed and cloud computing

最新资源