Hadoop与Spark入门：数据密集系统原理与实践

需积分: 9 123 浏览量更新于2024-07-17 收藏 2.78MB PDF 举报

《数据密集型系统：Hadoop与Spark原理与基础》是一本由Tomasz Wiktorski撰写的专业书籍，针对大数据和数据科学应用提供核心概念的深入介绍。该书适用于初学者，帮助他们建立对数据密集型系统的基础理解，以便在深入学习时能够独立工作并掌握当前技术领域的高级参考资料。书中采用问题导向的学习方法，每个章节围绕简化但实际的问题展开，通过数据密集技术解决。读者将跟随一个基于Apache开源数据集的参考场景，逐步了解Hadoop的运用。这本书的起源可以追溯到斯泰万格大学的数据密集型系统硕士课程，部分章节还被用作普渡大学和罗兹理工大学的客座讲座。书中的内容包括： 1. 引言：概述数据密集型系统的重要性，以及Hadoop和Spark在其中的角色。 2. Hadoop 101及参考场景：对Hadoop的基本概念进行入门级讲解，包括其分布式计算模型、HDFS（Hadoop分布式文件系统）和MapReduce的工作原理。 3. 功能抽象：阐述如何通过抽象层理解和使用Hadoop，简化开发过程。 4. MapReduce：深入研究这种编程模型，包括算法和模式，如Shuffle操作和Combiner优化。 5. Hadoop架构：详细解析Hadoop集群的组成，如NameNode、DataNode和JobTracker等组件。 6. NOSQL数据库：介绍非关系型数据库如何配合Hadoop处理大规模数据，比如Cassandra和HBase。 7. Spark：对比Hadoop，讨论Spark的内存计算模型（Resilient Distributed Dataset, RDD），以及其DataFrame和Spark Streaming等高级特性。《SpringerBriefs in Advanced Information and Knowledge Processing》系列是Springer出版社推出的一个简洁而前沿的学术平台，本书作为该系列的一部分，旨在为研究人员提供一个发表尚未成熟但超出研讨会论文或期刊文章水平的研究成果的渠道。主题涵盖大数据分析、大数据知识、生物信息学、商业智能、计算机安全、数据挖掘和知识发现、信息质量和隐私保护等领域。《Data-intensive Systems: Principles and Fundamentals using Hadoop and Spark》不仅提供了一本实用的技术指南，也是探索和理解数据密集型系统核心理论和技术的重要资源，适合那些希望在这个领域深入发展的专业人士。

6 2 Introduction

2.1 Growing Datasets

Data growth is one of the most important trends in the industry nowadays. The same

applies also to science and even to our everyday lives. Eric Schmidt famously said in

2010, as Google’s CEO at that time, that every 2 days we create as much information

as we did up to 2003. This statement underlines a fact that the data growth we observe

is not just linear, but it is accelerating.

Let us look at a few examples of big datasets in Fig. 2.1. You are most prob-

ably familiar with a concept of trending articles on Wikipedia. It presents hourly

updated Wikipedia trends. Have you ever wondered how these trends are determined?

Wikipedia servers store page trafﬁc data aggregated per hour; this leads quite directly

to the calculation of trends. However, one caveat is the amount of data it requires.

One month of such data requires 50 GBs of storage space and related computational

capabilities. While it is not much for each separate month, at moment you want to

analyze data from several months, both storage and computation become a consid-

erable challenge. In one of the earlier chapters, we already discussed the analysis

of genomic information as one of the major use cases for data-intensive systems, of

which 1000 Genomes Project is a famous example. Data collected as a part of this

project require 200 TBs of storage.

Fig. 2.1 Examples of big datasets. Source Troester (2012); European Organization for Nuclear

Research (2015); The Internet Archive (2015); Amazon Web Services (2015)

2.1 Growing Datasets 7

Walmart is one of the major global retailers, so it comes as no surprise that they

collect a signiﬁcant amount of data related to customer transactions; a few years this

was estimated at 2.5 PBs. Scientiﬁc experiments on Large Hadron Collider (LHC)

at CERN generate around 30 PBs of data a year. I t is important to notice that this

amount of new data has to be stored, backed up, and processed every single year. The

last example I would like to offer is Internet Archive, which you are familiar with

most probably due to Wayback Machine it offers. It gives you a chance of exploring

how websites were changing with time. Currently, total storage used by the Internet

Archives reached 50 PBs.

To put these numbers in context, let us consider the largest hard drive available

now; it is 8 TBs. It would take over 6000 HDDs to just store Internet Archive; this

does not include extra storage to provide redundancy in case of failure. It would take

almost 4000 new HDDs every year for LHC data, again in just one copy, without

any redundancy.

2.2 Hardware Trends

All these data need to be stored and processed. Storage capacity has been growing

quite rapidly in the recent years. At the same time, processing capacity has not been

growing accordingly fast. This is easiest visualized by comparing growth in hard drive

capacity with hard drive throughput in Fig. 2.2. You can notice that the capacity of

a regular hard drive grows exponentially, while throughput linearly. Typically, data

generation pace corresponds closely with the increase in storage capabilities, what

results in keeping the available storage occupied. The conclusion i s that every year

it takes longer and longer to read all data from a hard drive. Price and capacity

improvements in SSD and memory technologies enable more complex algorithms

and frameworks for data processing, but they do not change the general trend.

This brings us to the core of the problem. How can we process all these infor-

mation within a reasonable time if the capacity growths faster than throughput?

One possibility is some form of parallel processing. However, traditional parallel

programming techniques separate data processing from data storage. This way read-

ing through data remains a bottleneck. MapReduce paradigm, which is dominant in

data-intensive systems, s olves this problem combination of map- and reduce-based

functional programming routines with a distributed ﬁle system. Hadoop is the most

common example of such a system. The focus is shifted from moving data to compu-

tation, to moving computation to data. This general idea is prevalent in the modern

data-intensive; it is a deﬁning feature.

8 2 Introduction

Fig. 2.2 Historical capacity versus throughput for HDDs. Source Leventhal (2009)

2.3 The V’s of Big Data

Deﬁnition of big data through a combination of three characteristics—volume, vari-

ety, and velocity—goes back to 2001. In a short paper, Doug Laney (2001)from

Gartner introduced for the ﬁrst time the combination of these three elements as an

approach to deﬁning big data. It was later widely adopted and with time extended

with many additional adjectives starting with v.

The most commonly adopted deﬁnition now comes from NIST Big Data Public

Working Group, which is one of its documents (SP NIST 2015) deﬁned big data as

consisting of “extensive datasets—primarily in the characteristics of volume, variety,

velocity, and/or variability—that require a scalable architecture for efﬁcient storage,

manipulation, and analysis.”

The volume describes the sheer amount of data, variety refers to different types

and sources of data, velocity represents the speed of arrival of data, and variability

is related to any change in aforementioned characteristics. For instance, average

velocity might be low, but it might have picks, which are important to account for. In

some case, it is just one of these four characteristics that strongly pronounced in your

data, most typically it might be volume. In other cases, none of the characteristics is

challenging on its own, but a combination of them leads to a big data challenge.

There are two general technological trends that are associated with big data. The

ﬁrst is horizontal scaling. The traditional way to meet growing demand for computing

power was to replace existing machines with new and more powerful; it is called

vertical scaling. Due to changes in hardware development, this approach became

unsustainable. Instead, a new approach is to build a cluster of ordinary machines and

add more machines to the cluster with growing processing need. Such an approach

is called horizontal scaling.

The second related trend is related to the software side of the big data. Storage and

computation used to be separated that the ﬁrst step in data processing was sending data

10 2 Introduction

You can observe a departure from a universal database in favor of specialized

databases suited narrowly to data characteristics and application goals. Sometimes,

the actual schema is applied only when reading the data, what allows for even greater

ﬂexibility in the data model.

2.5 Data as the Fourth Paradigm of Science

In January 2007, Jim Gray gave a talk to Computer Science and Telecommunica-

tions Board of (American) National Research Council on transformations in the

scientiﬁc method he observed. This talk later led to a book The Fourth Paradigm.

Data-Intensive Scientiﬁc Discovery (Hey et al 2009), which develops on the ideas

presented in the talk.

Gray groups history of science in three paradigms. In the beginning, science was

empirical (or experimental), and it revolved around describing and testing observed

phenomena. It was followed by theoretical science, in which generalized models

were created in form of laws and theories. With time, phenomena we wished to

describe became too complex to be contained in generalized models. This gave rise

to computational science, focused on simulation, which was assisted by exponential

growth in computing power—Moore’s Law. Relying on the growth of computing

power of individual desktop machines, and also supercomputers, is referred to as

vertical scaling. Timescale between science paradigms is presented in Fig. 2.4.

Currently, we observe an emergence of a new fourth paradigm in science, data-

intensive. Exponentially increasing amounts of data are collected, for instance, from

advanced instruments (such as Large Hadron Collider at CERN), everyday sensors

(Internet of Things), or wearable devices (such as activity bracelets). Data become

central to the scientiﬁc process. They require new methodologies, tools, infrastruc-

tures, and differently trained scientists.

Gray also observes that in data-intensive science problems and methods can be

abstracted from a speciﬁc context. Usually, two general approaches are common

“looking for needles in haystacks or looking for the haystacks themselves.” There are

several generic tools for data collection and analysis, which support these approaches.

Fig. 2.4 The timescale of science paradigms

剩余104页未读，继续阅读

THESUMMERE

粉丝: 23
资源: 328

Hadoop与Spark入门：数据密集系统原理与实践

Spark和Hadoop的集成

Practical Data Science with Hadoop and Spark

designing-data-Intensive-applications:以数据为中心的应用设计研究

worldwindjava源码-data-intensive-book:设计数据密集型应用书的注意事项

java版飞机大战源码-designing-data-intensive-applications:设计数据密集型应用程序

designing-data-intensive-applications-notes:有关设计数据密集型应用程序的说明（进行中）

Data-Intensive-Computing-with-MapReduce:我在数据密集计算课程上的作品

applied-machine-learning-intensive:应用机器学习密集

Designing-Data-Intensive-Applications

Data-Intensive-Computing

最新资源