构建大数据应用的基石

需积分: 0 113 浏览量更新于2024-07-17 收藏 23.82MB PDF 举报

"Designing Data-Intensive Applications" 是一本由 Martin Kleppmann 撰写的书籍，被广泛认为是分布式系统领域最优秀的技术书籍之一，专注于构建可靠、可扩展和可维护的数据密集型应用。本书深入探讨了设计大规模数据处理系统背后的关键概念和技术。它涵盖了从数据模型到数据存储、数据处理、数据一致性以及分布式系统的基础，旨在帮助读者理解和解决在处理大量数据时遇到的挑战。 1. **数据模型**：书中详细讨论了不同类型的数据库模型，如关系型数据库（RDBMS）、NoSQL 数据库和时间序列数据库等，以及它们在不同场景下的适用性。此外，它还涉及数据建模的原则，如何选择合适的数据结构来优化查询性能。 2. **数据存储**：探讨了各种存储系统，包括分布式文件系统、键值存储、文档数据库、列族数据库等。书中详细分析了这些存储系统的优缺点，以及如何根据业务需求选择合适的存储解决方案。 3. **数据处理**：涵盖了批处理和流处理的概念，对比了 Hadoop 和 Spark 等工具，以及实时数据处理和离线数据处理的区别。这部分还讲解了数据管道和ETL（提取、转换、加载）流程的设计。 4. **数据一致性**：深入讨论了分布式系统中的一致性模型，如强一致性、最终一致性、CAP 定理以及BASE原则。书中通过实例解释了如何在可用性和一致性之间做出权衡。 5. **分布式系统基础**：介绍了分布式系统的基本原理，如复制、分区、故障恢复和容错机制，以及如何设计健壮的分布式服务。此外，还涵盖了负载均衡、服务发现和网络通信等关键主题。 6. **最佳实践与案例研究**：书中通过实际案例分析，展示了如何在大型项目中应用上述理论，提供了在实践中避免常见错误的建议。 7. **未来趋势**：Kleppmann 还讨论了数据处理领域的最新发展，如大数据处理的新技术、机器学习和人工智能在数据密集型应用中的角色。这本专著对任何从事大数据处理、云计算或分布式系统设计的人来说都是宝贵的资源，它不仅提供了理论知识，还提供了实践经验，有助于读者提升在数据驱动的应用设计方面的技能。

which it is changing—as opposed to compute-intensive, where CPU cycles are the

bottleneck.

The tools and technologies that help data-intensive applications store and process

data have been rapidly adapting to these changes. New types of database systems

(“NoSQL”) have been getting lots of attention, but message queues, caches, search

indexes, frameworks for batch and stream processing, and related technologies are

very important too. Many applications use some combination of these.

The buzzwords that fill this space are a sign of enthusiasm for the new possibilities,

which is a great thing. However, as software engineers and architects, we also need to

have a technically accurate and precise understanding of the various technologies and

their trade-offs if we want to build good applications. For that understanding, we

have to dig deeper than buzzwords.

Fortunately, behind the rapid changes in technology, there are enduring principles

that remain true, no matter which version of a particular tool you are using. If you

understand those principles, you’re in a position to see where each tool fits in, how to

make good use of it, and how to avoid its pitfalls. That’s where this book comes in.

The goal of this book is to help you navigate the diverse and fast-changing landscape

of technologies for processing and storing data. This book is not a tutorial for one

particular tool, nor is it a textbook full of dry theory. Instead, we will look at examples

of successful data systems: technologies that form the foundation of many popular

applications and that have to meet scalability, performance, and reliability require‐

ments in production every day.

We will dig into the internals of those systems, tease apart their key algorithms, dis‐

cuss their principles and the trade-offs they have to make. On this journey, we will try

to find useful ways of thinking about data systems—not just how they work, but also

why they work that way, and what questions we need to ask.

After reading this book, you will be in a great position to decide which kind of tech‐

nology is appropriate for which purpose, and understand how tools can be combined

to form the foundation of a good application architecture. You won’t be ready to

build your own database storage engine from scratch, but fortunately that is rarely

necessary. You will, however, develop a good intuition for what your systems are

doing under the hood so that you can reason about their behavior, make good design

decisions, and track down any problems that may arise.

Who Should Read This Book?

If you develop applications that have some kind of server/backend for storing or pro‐

cessing data, and your applications use the internet (e.g., web applications, mobile

apps, or internet-connected sensors), then this book is for you.

xiv | Preface

This book is for software engineers, software architects, and technical managers who

love to code. It is especially relevant if you need to make decisions about the architec‐

ture of the systems you work on—for example, if you need to choose tools for solving

a given problem and figure out how best to apply them. But even if you have no

choice over your tools, this book will help you better understand their strengths and

weaknesses.

You should have some experience building web-based applications or network serv‐

ices, and you should be familiar with relational databases and SQL. Any non-

relational databases and other data-related tools you know are a bonus, but not

required. A general understanding of common network protocols like TCP and

HTTP is helpful. Your choice of programming language or framework makes no dif‐

ference for this book.

If any of the following are true for you, you’ll find this book valuable:

• You want to learn how to make data systems scalable, for example, to support

web or mobile apps with millions of users.

• You need to make applications highly available (minimizing downtime) and

operationally robust.

• You are looking for ways of making systems easier to maintain in the long run,

even as they grow and as requirements and technologies change.

•

You have a natural curiosity for the way things work and want to know what

goes on inside major websites and online services. This book breaks down the

internals of various databases and data processing systems, and it’s great fun to

explore the bright thinking that went into their design.

Sometimes, when discussing scalable data systems, people make comments along the

lines of, “You’re not Google or Amazon. Stop worrying about scale and just use a

relational database.” There is truth in that statement: building for scale that you don’t

need is wasted effort and may lock you into an inflexible design. In effect, it is a form

of premature optimization. However, it’s also important to choose the right tool for

the job, and different technologies each have their own strengths and weaknesses. As

we shall see, relational databases are important but not the final word on dealing with

data.

Scope of This Book

This book does not attempt to give detailed instructions on how to install or use spe‐

cific software packages or APIs, since there is already plenty of documentation for

those things. Instead we discuss the various principles and trade-offs that are funda‐

mental to data systems, and we explore the different design decisions taken by differ‐

ent products.

Preface | xv

In the ebook editions we have included links to the full text of online resources. All

links were verified at the time of publication, but unfortunately links tend to break

frequently due to the nature of the web. If you come across a broken link, or if you

are reading a print copy of this book, you can look up references using a search

engine. For academic papers, you can search for the title in Google Scholar to find

open-access PDF files. Alternatively, you can find all of the references at https://

github.com/ept/ddia-references, where we maintain up-to-date links.

We look primarily at the architecture of data systems and the ways they are integrated

into data-intensive applications. This book doesn’t have space to cover deployment,

operations, security, management, and other areas—those are complex and impor‐

tant topics, and we wouldn’t do them justice by making them superficial side notes in

this book. They deserve books of their own.

Many of the technologies described in this book fall within the realm of the Big Data

buzzword. However, the term “Big Data” is so overused and underdefined that it is

not useful in a serious engineering discussion. This book uses less ambiguous terms,

such as single-node versus distributed systems, or online/interactive versus offline/

batch processing systems.

This book has a bias toward free and open source software (FOSS), because reading,

modifying, and executing source code is a great way to understand how something

works in detail. Open platforms also reduce the risk of vendor lock-in. However,

where appropriate, we also discuss proprietary software (closed-source software, soft‐

ware as a service, or companies’ in-house software that is only described in literature

but not released publicly).

Outline of This Book

This book is arranged into three parts:

1. In Part I, we discuss the fundamental ideas that underpin the design of data-

intensive applications. We start in Chapter 1 by discussing what we’re actually

trying to achieve: reliability, scalability, and maintainability; how we need to

think about them; and how we can achieve them. In Chapter 2 we compare sev‐

eral different data models and query languages, and see how they are appropriate

to different situations. In Chapter 3 we talk about storage engines: how databases

arrange data on disk so that we can find it again efficiently. Chapter 4 turns to

formats for data encoding (serialization) and evolution of schemas over time.

In Part II, we move from data stored on one machine to data that is distributed

across multiple machines. This is often necessary for scalability, but brings with

it a variety of unique challenges. We first discuss replication (Chapter 5), parti‐

tioning/sharding (Chapter 6), and transactions (Chapter 7). We then go into

xvi | Preface

more detail on the problems with distributed systems (Chapter 8) and what it

means to achieve consistency and consensus in a distributed system (Chapter 9).

3. In Part III, we discuss systems that derive some datasets from other datasets.

Derived data often occurs in heterogeneous systems: when there is no one data‐

base that can do everything well, applications need to integrate several different

databases, caches, indexes, and so on. In Chapter 10 we start with a batch pro‐

cessing approach to derived data, and we build upon it with stream processing in

Chapter 11. Finally, in Chapter 12 we put everything together and discuss

approaches for building reliable, scalable, and maintainable applications in the

future.

References and Further Reading

Most of what we discuss in this book has already been said elsewhere in some form or

another—in conference presentations, research papers, blog posts, code, bug trackers,

mailing lists, and engineering folklore. This book summarizes the most important

ideas from many different sources, and it includes pointers to the original literature

throughout the text. The references at the end of each chapter are a great resource if

you want to explore an area in more depth, and most of them are freely available

online.

O’Reilly Safari

Safari (formerly Safari Books Online) is a membership-based

training and reference platform for enterprise, government,

educators, and individuals.

Members have access to thousands of books, training videos, Learning Paths, interac‐

tive tutorials, and curated playlists from over 250 publishers, including O’Reilly

Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Pro‐

fessional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco

Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt,

Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett,

and Course Technology, among others.

For more information, please visit http://oreilly.com/safari.

Preface | xvii

How to Contact Us

Please address comments and questions concerning this book to the publisher:

O’Reilly Media, Inc.

1005 Gravenstein Highway North

Sebastopol, CA 95472

800-998-9938 (in the United States or Canada)

707-829-0515 (international or local)

707-829-0104 (fax)

We have a web page for this book, where we list errata, examples, and any additional

information. You can access this page at http://bit.ly/designing-data-intensive-apps.

To comment or ask technical questions about this book, send email to bookques‐

tions@oreilly.com.

For more information about our books, courses, conferences, and news, see our web‐

site at http://www.oreilly.com.

Find us on Facebook: http://facebook.com/oreilly

Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments

This book is an amalgamation and systematization of a large number of other peo‐

ple’s ideas and knowledge, combining experience from both academic research and

industrial practice. In computing we tend to be attracted to things that are new and

shiny, but I think we have a huge amount to learn from things that have been done

before. This book has over 800 references to articles, blog posts, talks, documenta‐

tion, and more, and they have been an invaluable learning resource for me. I am very

grateful to the authors of this material for sharing their knowledge.

I have also learned a lot from personal conversations, thanks to a large number of

people who have taken the time to discuss ideas or patiently explain things to me. In

particular, I would like to thank Joe Adler, Ross Anderson, Peter Bailis, Márton

Balassi, Alastair Beresford, Mark Callaghan, Mat Clayton, Patrick Collison, Sean

Cribbs, Shirshanka Das, Niklas Ekström, Stephan Ewen, Alan Fekete, Gyula Fóra,

Camille Fournier, Andres Freund, John Garbutt, Seth Gilbert, Tom Haggett, Pat Hel‐

land, Joe Hellerstein, Jakob Homan, Heidi Howard, John Hugg, Julian Hyde, Conrad

Irwin, Evan Jones, Flavio Junqueira, Jessica Kerr, Kyle Kingsbury, Jay Kreps, Carl

Lerche, Nicolas Liochon, Steve Loughran, Lee Mallabone, Nathan Marz, Caitie

xviii | Preface

剩余612页未读，继续阅读

道不虚行只在人

粉丝: 251
资源: 1

构建大数据应用的基石

Designing Data-Intensive Applications The Big Ideas Behind Reliable Scalable

Designing Data-Intensive Applications pdf

Designing Data-Intensive Applications 中文版

designing data-intensive applications pdf

designing data-intensive applications awz3 mobi

学习后端应该准备什么书籍

基于ospf协议的参考文献

MySQL图书管理系统参考文献

推荐几本分布式数据库的书籍

ssm的酒店管理系统参考文献

最新资源