首页[Google] Introduction to Distributed System Design.pdf
[Google] Introduction to Distributed System Design.pdf
需积分: 11 123 浏览量 更新于2023-05-24 评论 收藏 211KB PDF 举报
Google分布式设计教程： Introduction to Distributed System Design Table of Contents Audience and Pre-Requisites The Basics So How Is It Done? Remote Procedure Calls Distributed Design Principles Exercises References
Introduction to Distributed System Design
Table of Contents
Audience and Pre-Requisites
So How Is It Done?
Remote Procedure Calls
Distributed Design Principles
Audience and Pre-Requisites
This tutorial covers the basics of distributed systems design. The pre-requisites are significant programming experience
with a language such as C++ or Java, a basic understanding of networking, and data structures & algorithms.
What is a distributed system? It's one of those things that's hard to define without first defining many other things. Here is
a "cascading" definition of a distributed system:
is the code you write.
is what you get when you run it.
is used to communicate between processes.
is a fragment of a message that might travel on a wire.
is a formal description of message formats and the rules that two processes must follow in order to exchange those
is the infrastructure that links computers, workstations, terminals, servers, etc. It consists of routers which are
connected by communication links.
can be a process or any piece of hardware required to run a process, support communications between processes,
store data, etc.
A distributed system
is an application that executes a collection of protocols to coordinate the actions of multiple processes on a
network, such that all components cooperate together to perform a single or small set of related tasks.
Why build a distributed system? There are lots of advantages including the ability to connect remote users with remote
resources in an open and scalable way. When we say open, we mean each component is continually open to interaction
with other components. When we say scalable, we mean the system can easily be altered to accommodate changes in
the number of users, resources and computing entities.
Thus, a distributed system can be much larger and more powerful given the combined capabilities of the distributed
components, than combinations of stand-alone systems. But it's not easy - for a distributed system to be useful, it must be
reliable. This is a difficult goal to achieve because of the complexity of the interactions between simultaneously running
To be truly reliable, a distributed system must have the following characteristics:
• Fault-Tolerant: It can recover from component failures without performing incorrect actions.
• Highly Available: It can restore operations, permitting it to resume providing services even when some components
• Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been
• Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and
failure. This underlies the ability of a distributed system to act like a non-distributed system.
• Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we
might increase the size of the network on which the system is running. This increases the frequency of network
outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or
overall load on the system. In a scalable system, this should not have a significant effect.
• Predictable Performance: The ability to provide desired responsiveness in a timely manner.
• Secure: The system authenticates access to data and services 
These are high standards, which are challenging to achieve. Probably the most difficult challenge is a distributed system
must be able to continue operating correctly even when components fail. This issue is discussed in the following excerpt
of an interview with Ken Arnold. Ken is a research scientist at Sun and is one of the original architects of Jini, and was a
member of the architectural team that designed CORBA.
Failure is the defining difference between distributed and local programming, so you have to design distributed systems
with the expectation of failure. Imagine asking people, "If the probability of something happening is one in 10
, how often
would it happen?" Common sense would be to answer, "Never." That is an infinitely large number in human terms. But if
you ask a physicist, she would say, "All the time. In a cubic foot of air, those things happen all the time."
When you design distributed systems, you have to say, "Failure happens all the time." So when you design, you design
for failure. It is your number one concern. What does designing for failure mean? One classic problem is partial failure. If I
send a message to you and then a network failure occurs, there are two possible outcomes. One is that the message got
to you, and then the network broke, and I just didn't get the response. The other is the message never got to you because
the network broke before it arrived.
So if I never receive a response, how do I know which of those two results happened? I cannot determine that without
eventually finding you. The network has to be repaired or you have to come up, because maybe what happened was not
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额