2017年Azure HDInsight大数据处理实战指南

Azure

需积分: 10 71 浏览量更新于2024-07-18 收藏 5.49MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

资源详情

资源推荐

xix

■ IntroduCtIon

To get the most out of this book, follow along with the sample code and do the

hands-on programs directly in Sandbox or an Azure HDInsight environment.

About versions used in this book: Azure HDInsight changes very rapidly and comes

in the form of Azure service updates. Also, HDInsight is a Hadoop distribution from

Hortonworks; hence, it also introduces a new version when available. The basics covered

in this book will be useful in upcoming versions too.

Happy coding.

V. Yadav, Processing Big Data with Azure HDInsight, DOI 10.1007/978-1-4842-2869-2_1

CHAPTER 1

Big Data, Hadoop, and

HDInsight

Azure HDInsight is a managed Hadoop distribution, developed in partnership with

Hortonworks and Microsoft. It uses the Hortonworks Data Platform (HDP) Hadoop

distribution, which means that HDInsight is entirely Apache Hadoop on Azure. It deploys

and provisions managed Apache Hadoop clusters in the cloud on Windows or Linux

machines, which is a unique capability. It provides the Hadoop Distributed File System

(HDFS) for reliable data storage. It uses the MapReduce programming model to process,

analyze, and report on data stored in distributed file systems. Because it is a managed

offering, within a few hours an enterprise can be up and running with a fully configured

Hadoop cluster and other Hadoop ecosystem components, such as HBase, Apache Spark,

and Apache Storm.

This chapter looks at history so that you understand what big data is and the

approaches used to handle large data. It also introduces Hadoop and its components, and

HDInsight.

What Is Big Data?

Big data is not a buzzword anymore. Enterprises are adopting, building, and

implementing big-data solutions. By definition, big data describes any large body of

digital information. It can be historical or in real time, and ranges from streams of

tweets to customer purchase history, and from server logs to sensor data from industrial

equipment. It all falls under big data. As far as the definition goes, there are many

different interpretations. One that I like comes from Gartner, an information technology

research and advisory company: “Big data is high-volume, high-velocity and/or

high-variety information assets that demand cost-effective, innovative forms of

information processing that enable enhanced insight, decision making, and process

automation.” (www.gartner.com/it-glossary/big-data/) Another good description

is by Forrester: “Big Data is techniques and technologies that make handling of data at

extreme scale economical.” (http://blogs.forrestor.com)

Chapter 1 ■ Big Data, haDoop, anD hDinsight

Based on the preceding definitions, the following are the three Vs of big data.

• Volume: The amount of data that cannot be stored using

scale-up/vertical scaling techniques due to physical and software

limitations. It requires a scale-out or a horizontal scaling

approach.

• Variety: When new data coming in has a different structure

and format than what is already stored, or it is completely

unstructured or semi-structured, this type of data is considered a

data variety problem.

• Velocity: The rate at which data arrives or changes. When the

window of processing data is comparatively small, then it is called

a data velocity problem.

Normally, if you are dealing with more than one V, you need a big data solution;

otherwise, traditional data management and processing tools can do the job very well.

With large volumes of structured data, you can use a traditional relational database

management system (RDBMS) and divide the data onto multiple RDBMS across different

machines—allowing you to query all the data at once. This process is called sharding.

Variety can be handled by parsing the schema using custom code at the source or

destination side. Velocity can be treated using Microsoft SQL Server StreamInsight. Hence,

think about your needs before you decide to use a big data solution for your problem.

We are generating data at breakneck speed. The problem is not with the storage of

data, as storage costs are at an all-time low. In 1990, storage costs were around $10K for

a GB (gigabyte), whereas now it is less than $0.07 per GB. A commercial airplane has so

many sensors installed in it that every single flight generates over 5TB (terabyte) of data.

Facebook, YouTube, Twitter, and LinkedIn are generating many petabytes worth of data

each day.

With the adoption of Internet of Things (IoT), more and more data is being

generated, not to mention all the blogs, websites, user click streams, and server logs.

They will only add up to more and more data. So what is the issue? The problem is the

amount of data that gets analyzed: large amounts of data are not easy to analyze with

traditional tools and technology. Hadoop changed all of this and enabled us to analyze

massive amounts of data using commodity hardware. In fact, until the cloud arrived, it

was not economical for small and medium-sized businesses to purchase all the hardware

required by a moderately sized Hadoop cluster. The cloud really enabled everyone to take

advantage of on-demand scaling. Now if you want to analyze terabytes of data, you just

spin up a cluster, tear it down when done processing, and pay only for the time that you

used the hardware. This has really reduced the overall cost of data processing and has

made it available to everyone. Now the actual question is this: How do you build a big

data solution? Let’s look at the approaches taken so far.

The Scale-Up and Scale-Out Approaches

Traditionally, data is stored in a single processing unit and all requests go through this

system only. Once this unit reaches its limit in terms of storage, processing power, or

memory, a higher-powered system usually replaces it. This process of expanding a system

Chapter 1 ■ Big Data, haDoop, anD hDinsight

by adding more resources is called scale-up, or vertical scaling. The same approach

has been used for years to tackle performance improvement issues: add more capable

hardware—and performance will go up. But this approach can only go so far; at some

point, data or query processing will overwhelm the hardware and you have to upgrade

the hardware again. As you scale up, hardware costs begin to rise. At some point, it will no

longer be cost effective to upgrade.

Think of a hotdog stand, where replacing a slow hotdog maker with a more

experienced person who prepares hotdogs in less time, but for higher wages, improves

efficiency. Yet, it can be improved up to only certain point, because the worker has to

take their time to prepare the hotdogs no matter how long the queue is and he cannot

serve the next customer in the queue until current one is served. Also, there is no control

over customer behavior: customers can customize their orders, and payment takes each

customer a different amount of time. So scaling up can take you so far, but in the end, it

will start to bottleneck.

So if your resource is completely occupied, add another person to the job, but

not at a higher wage. You should double the performance, thereby linearly scaling the

throughput by distributing the work across different resources.

The same approach is taken in large-scale data storage and processing scenarios:

you add more commodity hardware to the network to improve performance. But adding

hardware to a network is a bit more complicated than adding more workers to a hotdog

stand. These new units of hardware should be taken into account. The software has to

support dividing processing loads across multiple machines. If you only allow a single

system to process all the data, even if it is stored on multiple machines, you will hit the

processing power cap eventually. This means that there has to be a way to distribute not

only the data to new hardware on the network, but also instructions on how to process

that data and get results back. Generally, there is a master node that instructs all the

other nodes to do the processing, and then it aggregates the results from each of them.

The scale-out approach is very common in real life—from overcrowded hotdog stands to

grocery stores queues, everyone uses this approach. So in a way, big data problems and

their solutions are not so new.

Apache Hadoop

Apache Hadoop is an open source project, and undoubtedly the most used framework

for big data solutions. It is a very flexible, scalable, and fault-tolerant framework that

handles massive amounts of data. It is called a framework because it is made up of many

components and evolves at a rapid pace. Components can work together or separately,

if you want them to. Hadoop and its component are discussed in accordance with

HDInsight in this book, but all the fundamentals apply to Hadoop in general, too.

A Brief History of Hadoop

In 2003, Google released a paper on scalable distributed file systems for large distributed

data-intensive applications. This paper spawned “MapReduce: Simplified Data

Processing on Large Clusters” in December 2004. Based on these papers’ theory, an open

source project started—Apache Nutch. Soon thereafter, a Hadoop subproject was started

Chapter 1 ■ Big Data, haDoop, anD hDinsight

by Doug Cutting, who worked for Yahoo! at the time. Cutting named the project Hadoop

after his son’s toy elephant.

The initial code factored out of Nutch consisted of 5,000 lines of code for HDFS and

6,000 lines of code for MapReduce. Since then, Hadoop has evolved rapidly, and at the

time of writing, Hadoop v2.7 is available.

The core of Hadoop is HDFS and the MapReduce programming model. Let’s take a

look at them.

HDFS

The Hadoop Distributed File System is an abstraction over a native file system, which

is a layer of Java-based software that handles data storage calls and directs them to one

or more data nodes in a network. HDFS provides an application programming interface

(API) that locates the relevant node to store or fetch data from.

That is a simple definition of HDFS. It is actually more complicated. You have large

file that is divided into smaller chunks—by default, 64 MB each—to distribute among

data nodes. It also performs the appropriate replication of these chunks. Replication is

required, because when you are running a one-thousand-nodes cluster, any node could

have hard-disk failure, or the whole rack could go down; the system should be able to

withstand such failures and continue to store and retrieve data without loss. Ideally, you

should have three replicas of your data to achieve maximum fault tolerance: two on the

same rack and one off the rack. Don’t worry about the name node or the data node; they

are covered in an upcoming section.

HDFS allows us to store large amounts of data without worrying about its

management. So it solves one problem for big data, while it creates another problem.

Now, the data is distributed so you have to distribute processing of data as well. This is

solved by MapReduce.

MapReduce

MapReduce is also inspired by the Google papers that I mentioned earlier. Basically,

MapReduce moves the computing to the data nodes by using the Map and Reduce

paradigm. It is a framework for processing parallelizable problems, spanning multiple

nodes and large data sets. The advantage of MapReduce is that it processes data where

it resides, or nearby; hence, it reduces the distance over which the data needs to be

transmitted. MapReduce is twofold process of distributing computing loads. The first

one is Map, which finds all the data nodes where it needs to run the compute and moves

the work to those nodes, the second phase is reduce in which the system brings the

intermediate results back together and computes them. MapReduce engines have many

different implementations (this is discussed in upcoming chapters).

To understand how MapReduce works, take a look at Figure1-1, which presents a

distributed word-count problem that is solved using the MapReduce framework. Let’s

assume that ABC, DCD, and DBC are stored on different nodes Figure1-1 shows that

first, input data is loaded and divided based on key/value pairs on which mapping is

performed on individual nodes. The output of this process is intermediate key/value

剩余220页未读，继续阅读

爱学习的小男孩

粉丝: 1
资源: 12

2017年Azure HDInsight大数据处理实战指南

Sams.Teach.Yourself.Big.Data.Analytics.with.Microsoft.HDInsight

Big Data Processing Using Spark in Cloud

Beginning.Big.Data.with.Power.BI.and.Excel.2013.1484205308

azure ai-900

Microsoft Azure Data Lak说是什么

[root@VM-0-14-centos go-ethereum]# make all env GO111MODULE=on go run build/ci.go install go: downloading github.com/cespare/cp v0.1.0 go: downloading golang.org/x/crypto v0.1.0 go: downloading github.com/Azure/azure-sdk-for-go/sdk/storage/azblob v0.3.0

azure-kinect-unreal

Azure Kinect DK手势识别的代码

azure-translate-api 是免费的吗

python安装azure

如何用python调用微软的text to speech, 并举例

Azure purview

502 Bad Gateway Microsoft-Azure-Application-Gateway/v2

说明SQL Azure和SQL Server的相同点和不同点。并说明SQL Azure怎样支持大数据。

SQL Azure怎样支持大数据

如何更新Azure Date studio

如何通过processing进行语音识别‘

Azure-使用PowerShell批量创建VM

azure data studio使用

最新资源