Introduction xxi
or financial stability of large organizations depends on the reliable operation of complex
distributed applications, the inevitable result will be that some of these will require exactly
the same reliability mechanisms needed in those hospital and air traffic control systems. It
is time to tackle distributed systems reliability in a serious manner. To fail to do so today is
to invite catastrophic computer systems failures tomorrow.
Web Services are likely to amplify this new concern about reliability. The service-
oriented computing movement makes it much easier to reuse existing functionality in new
ways: to build a new application that talks to one or more old applications. As this trend plays
out, we are seeing a slow evolution towards elaborate, poorly understood, interdependency
between applications. Developers are discovering that the failure of a component that they
did not even know was “involved” in their application may have performance or availability
implications in remote parts of the system. Even when a dependency is evident, the developer
may have no clue as to what some components do. Thus, finding ways to guarantee the
availability of critical subsystems and components may be the only way for platform vendors
to avoid some very unhappy customer experiences.
Perhaps with this in mind, many vendors insist that reliability is a high priority for their
internal development teams. The constraint, they explain, is that they are uncomfortable
with any approach to reliability that has visibility to the developer or end user. In effect,
reliability is important, but the mechanisms must be highly “transparent.” Moreover, they
insist that only generally accepted, best of breed solutions can be considered for inclusion
in their standards.
Unfortunately, for three decades, the computing industry has tried (and failed) to make
the mechanisms of reliable, secure distributed computing transparent. Moreover, as noted
earlier, there is considerable confusion about just what the best of breed solutions actually
are in this field; we will see why in Part III of the book.
Now, perhaps the experts who have worked in this field for almost three decades have
missed some elegant, simple insight ... if so, it would not be the first time. On the other
hand, perhaps we cannot achieve transparency and are faced with deep tradeoffs, so that
there simply isn’t any clear best of breed technology that can make all users happy.
Thus while these distributed computing experts have all kinds of interesting technolo-
gies in their “tool kits,” right now the techniques that make reliability, and security, and
stability possible in massive settings can’t be swept under a rug or concealed from the
developer. On the contrary, to succeed in this new world, developers need to master a
diversity of powerful new ideas and techniques, and in many cases may need to implement
variants of those techniques specialized to the requirements of their specific application.
There just isn’t any sensible way to hide the structure of a massive, and massively complex,
distributed system scattered over multiple data centers.
We need to learn to expose system structure, to manage it intelligently but explicitly, and
to embed intelligence right into the application, so that each application can sense problems,
develop a suitable application-specific strategy for reacting to those problems, and ride out
the disruption. One can certainly wish that greater transparency were an option, and maybe
someday we will gain such a sophisticated understanding of the matter that we will be able to