数据密集型应用设计指南

需积分: 49 139 浏览量更新于2024-07-17 收藏 25.8MB PDF 举报

"Designing Data-Intensive Applications.pdf" 是一本由 Martin Kleppmann 所著的书籍，探讨了构建可靠、可扩展和可维护的数据密集型应用背后的关键概念和技术。书中涵盖了各种存储组件，包括 NoSQL 数据库系统、消息队列、缓存、搜索索引、批处理和流处理框架等，以及如何根据需求选择合适的技术。在当今的数据驱动时代，数据量的剧增、数据复杂性的提升以及数据变化的速度都对应用带来了挑战。数据密集型应用与计算密集型应用不同，它们的主要瓶颈在于如何有效管理和处理数据，而非CPU计算能力。随着技术的发展，出现了许多新的数据库系统，比如 NoSQL，它们在应对大数据量和高并发场景方面表现出色。同时，消息队列在异步处理和解耦系统中起到关键作用，缓存则提高了数据访问速度，而搜索索引则优化了数据检索效率。此外，批处理和流处理框架如 Apache Hadoop 和 Apache Spark，使得大规模数据处理变得更加高效。书中深入讨论了数据模型，这是理解和设计数据存储系统的基础。SQL 和 MySQL 作为关系型数据库的代表，提供了强大的事务处理能力和规范化数据模型，确保数据的一致性和完整性。然而，对于某些特定场景，如高并发读写或非结构化数据处理，NoSQL 数据库如 MongoDB、Cassandra 和 Redis 可能是更好的选择，它们在可扩展性和灵活性方面具有优势，但可能牺牲部分一致性。本书还涵盖了数据一致性、分布式系统、容错机制和数据复制等方面的内容，这些都是构建大规模分布式数据系统时必须考虑的问题。作者 Martin Kleppmann 提供了关于如何在实际项目中平衡可用性、一致性和分区容忍性的指导，这对于设计能够处理海量数据并保持高可用性的应用至关重要。 "Designing Data-Intensive Applications" 是一本面向软件开发者、架构师和数据工程师的深度指南，它帮助读者理解如何设计能够应对大数据挑战的系统，并提供了选择和使用各种数据处理技术的实用见解。无论你是刚接触数据存储领域，还是已经有一定经验，这本书都将为你提供宝贵的洞见和策略，助你在数据密集型应用的设计和实现上更进一步。

which it is changing—as opposed to compute-intensive, where CPU cycles are the

bottleneck.

The tools and technologies that help data-intensive applications store and process

data have been rapidly adapting to these changes. New types of database systems

(“NoSQL”) have been getting lots of attention, but message queues, caches, search

indexes, frameworks for batch and stream processing, and related technologies are

very important too. Many applications use some combination of these.

The buzzwords that fill this space are a sign of enthusiasm for the new possibilities,

which is a great thing. However, as software engineers and architects, we also need to

have a technically accurate and precise understanding of the various technologies and

their trade-offs if we want to build good applications. For that understanding, we

have to dig deeper than buzzwords.

Fortunately, behind the rapid changes in technology, there are enduring principles

that remain true, no matter which version of a particular tool you are using. If you

understand those principles, you’re in a position to see where each tool fits in, how to

make good use of it, and how to avoid its pitfalls. That’s where this book comes in.

The goal of this book is to help you navigate the diverse and fast-changing landscape

of technologies for processing and storing data. This book is not a tutorial for one

particular tool, nor is it a textbook full of dry theory. Instead, we will look at examples

of successful data systems: technologies that form the foundation of many popular

applications and that have to meet scalability, performance, and reliability require‐

ments in production every day.

We will dig into the internals of those systems, tease apart their key algorithms, dis‐

cuss their principles and the trade-offs they have to make. On this journey, we will try

to find useful ways of thinking about data systems—not just how they work, but also

why they work that way, and what questions we need to ask.

After reading this book, you will be in a great position to decide which kind of tech‐

nology is appropriate for which purpose, and understand how tools can be combined

to form the foundation of a good application architecture. You won’t be ready to

build your own database storage engine from scratch, but fortunately that is rarely

necessary. You will, however, develop a good intuition for what your systems are

doing under the hood so that you can reason about their behavior, make good design

decisions, and track down any problems that may arise.

Who Should Read This Book?

If you develop applications that have some kind of server/backend for storing or pro‐

cessing data, and your applications use the internet (e.g., web applications, mobile

apps, or internet-connected sensors), then this book is for you.

xiv | Preface

This book is for software engineers, software architects, and technical managers who

love to code. It is especially relevant if you need to make decisions about the architec‐

ture of the systems you work on—for example, if you need to choose tools for solving

a given problem and figure out how best to apply them. But even if you have no

choice over your tools, this book will help you better understand their strengths and

weaknesses.

You should have some experience building web-based applications or network serv‐

ices, and you should be familiar with relational databases and SQL. Any non-

relational databases and other data-related tools you know are a bonus, but not

required. A general understanding of common network protocols like TCP and

HTTP is helpful. Your choice of programming language or framework makes no dif‐

ference for this book.

If any of the following are true for you, you’ll find this book valuable:

• You want to learn how to make data systems scalable, for example, to support

web or mobile apps with millions of users.

• You need to make applications highly available (minimizing downtime) and

operationally robust.

• You are looking for ways of making systems easier to maintain in the long run,

even as they grow and as requirements and technologies change.

•

You have a natural curiosity for the way things work and want to know what

goes on inside major websites and online services. This book breaks down the

internals of various databases and data processing systems, and it’s great fun to

explore the bright thinking that went into their design.

Sometimes, when discussing scalable data systems, people make comments along the

lines of, “You’re not Google or Amazon. Stop worrying about scale and just use a

relational database.” There is truth in that statement: building for scale that you don’t

need is wasted effort and may lock you into an inflexible design. In effect, it is a form

of premature optimization. However, it’s also important to choose the right tool for

the job, and different technologies each have their own strengths and weaknesses. As

we shall see, relational databases are important but not the final word on dealing with

data.

Scope of This Book

This book does not attempt to give detailed instructions on how to install or use spe‐

cific software packages or APIs, since there is already plenty of documentation for

those things. Instead we discuss the various principles and trade-offs that are funda‐

mental to data systems, and we explore the different design decisions taken by differ‐

ent products.

Preface | xv

In the ebook editions we have included links to the full text of online resources. All

links were verified at the time of publication, but unfortunately links tend to break

frequently due to the nature of the web. If you come across a broken link, or if you

are reading a print copy of this book, you can look up references using a search

engine. For academic papers, you can search for the title in Google Scholar to find

open-access PDF files. Alternatively, you can find all of the references at https://

github.com/ept/ddia-references, where we maintain up-to-date links.

We look primarily at the architecture of data systems and the ways they are integrated

into data-intensive applications. This book doesn’t have space to cover deployment,

operations, security, management, and other areas—those are complex and impor‐

tant topics, and we wouldn’t do them justice by making them superficial side notes in

this book. They deserve books of their own.

Many of the technologies described in this book fall within the realm of the Big Data

buzzword. However, the term “Big Data” is so overused and underdefined that it is

not useful in a serious engineering discussion. This book uses less ambiguous terms,

such as single-node versus distributed systems, or online/interactive versus offline/

batch processing systems.

This book has a bias toward free and open source software (FOSS), because reading,

modifying, and executing source code is a great way to understand how something

works in detail. Open platforms also reduce the risk of vendor lock-in. However,

where appropriate, we also discuss proprietary software (closed-source software, soft‐

ware as a service, or companies’ in-house software that is only described in literature

but not released publicly).

Outline of This Book

This book is arranged into three parts:

1. In Part I, we discuss the fundamental ideas that underpin the design of data-

intensive applications. We start in Chapter 1 by discussing what we’re actually

trying to achieve: reliability, scalability, and maintainability; how we need to

think about them; and how we can achieve them. In Chapter 2 we compare sev‐

eral different data models and query languages, and see how they are appropriate

to different situations. In Chapter 3 we talk about storage engines: how databases

arrange data on disk so that we can find it again efficiently. Chapter 4 turns to

formats for data encoding (serialization) and evolution of schemas over time.

In Part II, we move from data stored on one machine to data that is distributed

across multiple machines. This is often necessary for scalability, but brings with

it a variety of unique challenges. We first discuss replication (Chapter 5), parti‐

tioning/sharding (Chapter 6), and transactions (Chapter 7). We then go into

xvi | Preface

more detail on the problems with distributed systems (Chapter 8) and what it

means to achieve consistency and consensus in a distributed system (Chapter 9).

3. In Part III, we discuss systems that derive some datasets from other datasets.

Derived data often occurs in heterogeneous systems: when there is no one data‐

base that can do everything well, applications need to integrate several different

databases, caches, indexes, and so on. In Chapter 10 we start with a batch pro‐

cessing approach to derived data, and we build upon it with stream processing in

Chapter 11. Finally, in Chapter 12 we put everything together and discuss

approaches for building reliable, scalable, and maintainable applications in the

future.

References and Further Reading

Most of what we discuss in this book has already been said elsewhere in some form or

another—in conference presentations, research papers, blog posts, code, bug trackers,

mailing lists, and engineering folklore. This book summarizes the most important

ideas from many different sources, and it includes pointers to the original literature

throughout the text. The references at the end of each chapter are a great resource if

you want to explore an area in more depth, and most of them are freely available

online.

O’Reilly Safari

Safari (formerly Safari Books Online) is a membership-based

training and reference platform for enterprise, government,

educators, and individuals.

Members have access to thousands of books, training videos, Learning Paths, interac‐

tive tutorials, and curated playlists from over 250 publishers, including O’Reilly

Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Pro‐

fessional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco

Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt,

Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett,

and Course Technology, among others.

For more information, please visit http://oreilly.com/safari.

Preface | xvii

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.

1005 Gravenstein Highway North

Sebastopol, CA 95472

800-998-9938 (in the United States or Canada)

707-829-0515 (international or local)

707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional

information. You can access this page at http://bit.ly/designing-data-intensive-apps.

To comment or ask technical questions about this book, send email to bookques‐

tions@oreilly.com.

For more information about our books, courses, conferences, and news, see our web‐

site at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments

This book is an amalgamation and systematization of a large number of other peo‐

ple’s ideas and knowledge, combining experience from both academic research and

industrial practice. In computing we tend to be attracted to things that are new and

shiny, but I think we have a huge amount to learn from things that have been done

before. This book has over 800 references to articles, blog posts, talks, documenta‐

tion, and more, and they have been an invaluable learning resource for me. I am very

grateful to the authors of this material for sharing their knowledge.

I have also learned a lot from personal conversations, thanks to a large number of

people who have taken the time to discuss ideas or patiently explain things to me. In

particular, I would like to thank Joe Adler, Ross Anderson, Peter Bailis, Márton

Balassi, Alastair Beresford, Mark Callaghan, Mat Clayton, Patrick Collison, Sean

Cribbs, Shirshanka Das, Niklas Ekström, Stephan Ewen, Alan Fekete, Gyula Fóra,

Camille Fournier, Andres Freund, John Garbutt, Seth Gilbert, Tom Haggett, Pat Hel‐

land, Joe Hellerstein, Jakob Homan, Heidi Howard, John Hugg, Julian Hyde, Conrad

Irwin, Evan Jones, Flavio Junqueira, Jessica Kerr, Kyle Kingsbury, Jay Kreps, Carl

Lerche, Nicolas Liochon, Steve Loughran, Lee Mallabone, Nathan Marz, Caitie

xviii | Preface

剩余612页未读，继续阅读

墨玉道人

粉丝: 120
资源: 23

数据密集型应用设计指南

Designing.Data-Intensive.Applications 设计数据密集型应用（html中文版全网首发）

Designing Data-Intensive Applications(Early Release)

Designing Data-Intensive Applications 中文版

Designing Data-Intensive Applications pdf

Designing Data-Intensive Applications 英文 pdf

Designing.Data-Intensive.Applications

Designing.Data-Intensive.Applications-2017

designing data-intensive applications pdf

Designing Data-Intensive Application-cn.pdf

Designing-Data-Intensive-Applications 英文高清完整下载

最新资源