大规模数据处理与管理：应对大数据挑战

需积分: 9 9 浏览量更新于2024-07-18 收藏 83.01MB PDF 举报

"大型数据和大数据处理与管理.pdf" 这篇文档是关于大型数据和大数据处理的，由Sherif Sakr和Mohamed Medhat Gaber编辑。随着互联网用户的快速增长，每天都会产生海量的数据，如Twitter的推文、Facebook的日志数据、纽约证券交易所的交易信息，以及无数的RFID标签、GPS设备和网络传感器生成的数据。这些数据量预计每两年翻一番，未来十年内将持续增长。企业面临着大量半结构化或非结构化的数据，这些数据的特征被概括为大数据的3V：体积（Volume）、速度（Velocity）和多样性（Variety）。体积指的是数据的规模，从TB到ZB；速度反映了数据流的实时性和大规模数据迁移；多样性则涉及数据的不同结构形式，从关系型数据库到日志再到原始文本。面对这样的挑战，企业期望能够如同处理结构化小规模信息一样轻松地分析和理解互联网规模的信息。大数据技术的目标是帮助企业快速分析理解这些海量数据，从而在运营中变得更加敏捷，通过数据分析和决策过程创新，避免错失商业机会。本书可能涵盖了大数据处理和管理的多个方面，包括数据采集、存储、清洗、分析和可视化等。它可能还讨论了各种工具和技术，如Hadoop、Spark、NoSQL数据库、流处理框架（如Apache Kafka）、机器学习算法以及数据挖掘方法，这些工具和技术都是为了应对大数据的3V特性而设计的。此外，书中可能还涉及到了数据质量保证、数据安全、隐私保护和合规性问题，这些都是在处理大数据时必须考虑的重要因素。对于数据科学家、数据工程师和IT专业人士来说，这本书可能提供了理解和应用大数据解决方案的关键洞察，帮助他们在这个信息爆炸的时代中抓住机遇，提升企业的竞争力。 "大型数据和大数据处理与管理"是一本深入探讨大数据领域挑战和解决方案的专业著作，旨在帮助读者掌握如何有效地管理和利用大数据，以驱动业务发展和创新。

Distributed Programming

for the Cloud

Models, Challenges,

and Analytics Engines

Mohammad Hammoud and Majd F. Sakr

1.1 INTRODUCTION

The effectiveness of cloud programs hinges on the manner in which they are

designed, implemented, and executed. Designing and implementing programs for

the cloud requires several considerations. First, they involve specifying the under-

lying programming model, whether message passing or shared memory. Second,

they entail developing synchronous or asynchronous computation model. Third,

CONTENTS

1.1 Introduction ......................................................................................................1

1.2 Taxonomy of Programs .....................................................................................2

1.3 Tasks and Jobs in Distributed Programs ..........................................................4

1.4 Motivations for Distributed Programming ....................................................... 4

1.5 Models of Distributed Programs ......................................................................6

1.5.1 Distributed Systems and the Cloud ......................................................6

1.5.2 Traditional Programming Models and Distributed Analytics Engines ....6

1.5.2.1 The Shared-Memory Programming Model ...........................7

1.5.2.2 The Message-Passing Programming Model ........................10

1.5.3 Synchronous and Asynchronous Distributed Programs .....................12

1.5.4 Data Parallel and Graph Parallel Computations ................................. 14

1.5.5 Symmetrical and Asymmetrical Architectural Models ..................... 18

1.6 Main Challenges in Building Cloud Programs ..............................................20

1.6.1 Heterogeneity ...................................................................................... 21

1.6.2 Scalability ........................................................................................... 22

1.6.3 Communication ..................................................................................24

1.6.4 Synchronization .................................................................................. 26

1.6.5 Fault Tolerance ....................................................................................27

1.6.6 Scheduling .......................................................................................... 31

1.7 Summary ........................................................................................................32

References ................................................................................................................34

2 Large Scale and Big Data

cloud programs can be tailored for graph or data parallelism, which require employ-

ing either data striping and distribution or graph partitioning and mapping. Lastly,

from architectural and management perspectives, a cloud program can be typically

organized in two ways, master/slave or peer-to-peer. Such organizations dene the

program’s complexity, efciency, and scalability.

Added to the above design considerations, when constructing cloud programs,

special attention must be paid to various challenges like scalability, communication,

heterogeneity, synchronization, fault tolerance, and scheduling. First, scalability is

hard to achieve in large-scale systems (e.g., clouds) due to several reasons such as

the inability of parallelizing all parts of algorithms, the high probability of load

imbalance, and the inevitability of synchronization and communication overheads.

Second, exploiting locality and minimizing network trafc are not easy to accom-

plish on (public) clouds since network topologies are usually unexposed. Third, het-

erogeneity caused by two common realities on clouds, virtualization environments

and variety in datacenter components, impose difculties in scheduling tasks and

masking hardware and software differences across cloud nodes. Fourth, synchroni-

zation mechanisms must guarantee mutual exclusive accesses as well as properties

like avoiding deadlocks and transitive closures, which are highly likely in distributed

settings. Fifth, fault-tolerance mechanisms, including task resiliency, distributed

checkpointing and message logging should be incorporated since the likelihood of

failures increases on large-scale (public) clouds. Finally, task locality, high parallel-

ism, task elasticity, and service level objectives (SLOs) need to be addressed in task

and job schedulers for effective programs’ executions.

Although designing, addressing, and implementing the requirements and chal-

lenges of cloud programs are crucial, they are difcult, require time and resource

investments, and pose correctness and performance issues. Recently, distributed

analytics engines such as MapReduce, Pregel, and GraphLab were developed to

relieve programmers from worrying about most of the needs to construct cloud pro-

grams and focus mainly on the sequential parts of their algorithms. Typically, these

analytics engines automatically parallelize sequential algorithms provided by users

in high-level programming languages like Java and C++, synchronize and schedule

constituent tasks and jobs, and handle failures, all without any involvement from

users/developers. In this chapter, we rst dene some common terms in the theory

of distributed programming, draw a requisite relationship between distributed sys-

tems and clouds, and discuss the main requirements and challenges for building dis-

tributed programs for clouds. While discussing the main requirements for building

cloud programs, we indicate how MapReduce, Pregel, and GraphLab address each

requirement. Finally, we close up with a summary on the chapter and a comparison

among MapReduce, Pregel, and GraphLab.

1.2 TAXONOMY OF PROGRAMS

A computer program consists of variable declarations, variable assignments, expres-

sions, and ow control statements written typically using a high-level programming

language such as Java or C++. Computer programs are compiled before executed on

machines. After compilation, they are converted to a machine instructions/code that

3Distributed Programming for the Cloud

run over computer processors either sequentially or concurrently in an in-order or

out-of-order manner, respectively. A sequential program is a program that runs in

the program order. The program order is the original order of statements in a pro-

gram as specied by a programmer. A concurrent program is a set of sequential

programs that share in time a certain processor when executed. Sharing in time (or

timesharing) allows sequential programs to take turns in using a certain resource

component. For instance, with a single CPU and multiple sequential programs, the

operating system (OS) can allocate the CPU to each program for a specic time

interval; given that only one program can run at a time on the CPU. This can be

achieved using a specic CPU scheduler such as the round-robin scheduler [69].

Programs, being sequential or concurrent, are often named interchangeably as

applications. A different term that is also frequently used alongside concurrent pro-

grams is parallel programs. Parallel programs are technically different than con-

current programs. A parallel program is a set of sequential programs that overlap in

time by running on separate CPUs. In multiprocessor systems such as chip multicore

machines, related sequential programs that are executed at different cores represent

a parallel program, while related sequential programs that share the same CPU in

time represent a concurrent program. To this end, we refer to a parallel program

with multiple sequential programs that run on different networked machines (not

on different cores at the same machine) as distributed program. Consequently, a

distributed program can essentially include all types of programs. In particular, a

distributed program can consist of multiple parallel programs, which in return can

consist of multiple concurrent programs, which in return can consist of multiple

sequential programs. For example, assume a set S that includes 4 sequential pro-

grams, P

, P

, and P

(i.e., S = {P

, P

}). A concurrent program, P′, can

encompass P

and P

(i.e., P′ = {P

, P

}), whereby P

and P

share in time a single

core. Furthermore, a parallel program, P″, can encompass P′ and P

(i.e., P″ = {P′,

}), whereby P′ and P

overlap in time over multiple cores on the same machine.

Lastly, a distributed program, P‴, can encompass P″ and P

(i.e., P‴ = {P″, P

}),

whereby P″ runs on different cores on the same machine and P

runs on a different

machine as opposed to P″. In this chapter, we are mostly concerned with distributed

programs. Figure 1.1 shows our program taxonomy.

Program

Sequential program

(runs on a single core)

Parallel program

(runs on a separate cores

on a single machine)

Distributed program

(runs on a separate cores

on diﬀerent machines)

Concurrent program

(shares in time a core on a

single machine)

FIGURE 1.1 Our taxonomy of programs.

4 Large Scale and Big Data

1.3 TASKS AND JOBS IN DISTRIBUTED PROGRAMS

Another common term in the theory of parallel/distributed programming is multi-

tasking. Multitasking is referred to overlapping the computation of one program

with that of another. Multitasking is central to all modern operating systems (OSs),

whereby an OS can overlap computations of multiple programs by means of a scheduler.

Multitasking has become so useful that almost all modern programming languages are

now supporting multitasking via providing constructs for multithreading. A thread of

execution is the smallest sequence of instructions that an OS canmanage through its

scheduler. The term thread was popularized by Pthreads (POSIX threads [59]), a speci-

cation of concurrency constructs that has been widely adopted, especially in UNIX

systems [8]. A technical distinction is often made between processes and threads. A

process runs using its own address space while a thread runs within the address space

of a process (i.e., threads are parts of processes and not standalone sequences of instruc-

tions). A process can contain one or many threads. In principle, processes do not share

address spaces among each other, while the threads in a process do share the process’s

address space. The term task is also used to refer to a small unit of work. In this chap-

ter, we use the term task to denote a process, which can include multiple threads. In

addition, we refer to a group of tasks (which can only be one task) that belong to the

same program/application as a job. An application can encompass multiple jobs. For

instance, a uid dynamics application typically consists of three jobs, one responsible

for structural analysis, one for uid analysis, and one for thermal analysis. Each of these

jobs can in return have multiple tasks to carry on the pertaining analysis. Figure 1.2

demonstrates the concepts of processes, threads, tasks, jobs, and applications.

1.4 MOTIVATIONS FOR DISTRIBUTED PROGRAMMING

In principle, every sequential program can be parallelized by identifying sources of

parallelism in it. Various analysis techniques at the algorithm and code levels can be

applied to identify parallelism in sequential programs [67]. Once sources of paral-

lelism are detected, a program can be split into serial and parallel parts as shown in

read1 read2 read

Process1/Task1 Process2/Task2

Distributed application/program

Process/Task

Job2Job1

read2

read1 read3

FIGURE 1.2 A demonstration of the concepts of processes, threads, tasks, jobs, and

applications.

剩余611页未读，继续阅读

arthurlee

粉丝: 7
资源: 46

大规模数据处理与管理：应对大数据挑战

MATLAB Paths and Big Data Processing: Managing Big Data Paths, Enhancing Code Processing Efficiency,...

Big Data and Computational Intelligence in Networking-CRC(2018).pdf

Handbook of Big Data Technologies

Big Data Made Easy - A Working Guide To The Complete Hadoop Toolset

Julia High performance.pdf

Advantages and Applicable Scenarios of unordered_map in Big Data Processing

MATLAB Versions and Big Data Analysis: Task Matching, Selecting the Optimal Version

S57 Map Data Decoding: Data Decompression, Parsing, and Processing Workflow

MATLAB Reading Excel Data Cloud Computing: Distributed Processing and Scalability

Parallelization Techniques for Matlab Autocorrelation Function: Enhancing Efficiency in Big Data ...

最新资源