实时应用：Storm、Spark与Hadoop之外的选择

需积分: 10 139 浏览量更新于2024-07-19 收藏 2.02MB PDF 举报

《实时应用与Storm、Spark及更多Hadoop替代方案：超越大数据分析》是一本探讨在大数据时代，如何利用新兴技术扩展Hadoop生态系统的专业书籍。本书不仅关注Hadoop的核心，还深入解析了实时数据处理工具如Apache Storm和Apache Spark的应用实践，以及Hadoop之外的其他选择。书中首先介绍了ePub作为一种开放且行业标准的电子书格式，但不同的阅读设备和应用程序可能对ePub的支持和特性有所不同。读者可以根据个人喜好调整阅读设置，例如字体大小、单双列模式、横竖屏切换以及可点击放大图片。为了充分利用这些功能，建议在单列模式和最小字体大小下阅读，特别是针对编程代码和配置示例部分，因为这样可以优化元素的呈现。除了可滚动的文本格式，书中还提供了与印刷版一致的代码图像，确保了代码展示的清晰度。如果流动文本格式影响了代码的视觉效果，用户会看到"点击此处查看代码图像"链接。点击后，读者可以直接查看高分辨率的打印质量代码图。为了回溯到先前的页面，只需点击返回按钮。在《实时应用与Storm、Spark及更多Hadoop替代方案》中，作者将重点放在实时数据处理技术的实战应用上，比如如何通过Storm处理流式数据，以及如何使用Spark进行大规模并行处理，同时还会对比分析这些技术与Hadoop的关系，以及它们在性能、扩展性和实时响应能力上的优势。此外，书中的内容还包括对NoSQL数据库、内存计算框架等其他大数据解决方案的探讨，帮助读者全面理解当今大数据生态系统的选择和优化策略。本书不仅适合Hadoop开发者和数据科学家，也对寻求扩展其数据分析能力的业务分析师和IT决策者极具价值，它提供了一个实用的指南，帮助他们应对不断发展的数据处理挑战，提升实时分析的能力。无论是希望深入了解现有技术的深入学习者，还是寻求创新解决方案的实践者，这本书都是一份宝贵的资源。

the “dwarf” terminology that was used to characterize fundamental computational tasks in the

super-computing literature (Asanovic et al. 2006). These are the seven “giants”:

1. Basic statistics: This category involves basic statistical operations such as computing the

mean, median, and variance, as well as things like order statistics and counting. The operations

are typically O(N) for N points and are typically embarrassingly parallel, so perfect for Hadoop.

2. Linear algebraic computations: These computations involve linear systems, eigenvalue

problems, inverses from problems such as linear regression, and Principal Component Analysis

(PCA). Linear regression is doable over Hadoop (Mahout has the implementation), whereas PCA

is not easy. Moreover, a formulation of multivariate statistics in matrix form is difficult to realize

over Hadoop. Examples of this type include kernel PCA and kernel regression.

3. Generalized N-body problems: These are problems that involve distances, kernels, or

other kinds of similarity between points or sets of points (tuples). Computational complexity is

typically O(N

) or even O(N

). The typical problems include range searches, nearest neighbor

search problems, and nonlinear dimension reduction methods. The simpler solutions of N-body

problems such as k-means clustering are solvable over Hadoop, but not the complex ones such

as kernel PCA, kernel Support Vector Machines (SVM), and kernel discriminant analysis.

4. Graph theoretic computations: Problems that involve graph as the data or that can be

modeled graphically fall into this category. The computations on graph data include centrality,

commute distances, and ranking. When the statistical model is a graph, graph search is

important, as are computing probabilities which are operations known as inference. Some graph

theoretic computations that can be posed as linear algebra problems can be solved over Hadoop,

within the limitations specified under giant 2. Euclidean graph problems are hard to realize over

Hadoop as they become generalized N-body problems. Moreover, major computational

challenges arise when you are dealing with large sparse graphs; partitioning them across a

cluster is hard.

5. Optimizations: Optimization problems involve minimizing (convex) or maximizing

(concave) a function that can be referred to as an objective, a loss, a cost, or an energy function.

These problems can be solved in various ways. Stochastic approaches are amenable to be

implemented in Hadoop. (Mahout has an implementation of stochastic gradient descent.)

Linear or quadratic programming approaches are harder to realize over Hadoop, because they

involve complex iterations and operations on large matrices, especially at high dimensions. One

approach to solve optimization problems has been shown to be solvable on Hadoop, but by

realizing a construct known as All-Reduce (Agarwal et al. 2011). However, this approach might

not be fault-tolerant and might not be generalizable. Conjugate gradient descent (CGD), due to

its iterative nature, is also hard to realize over Hadoop. The work of Stephen Boyd and his

colleagues from Stanford has precisely addressed this giant. Their paper (Boyd et al. 2011)

provides insights on how to combine dual decomposition and augmented Lagrangian into an

optimization algorithm known as Alternating Direction Method of Multipliers (ADMM). The

ADMM has been realized efficiently over Message Passing Interface (MPI), whereas the Hadoop

implementation would require several iterations and might not be so efficient.

6. Integrations: The mathematical operation of integration of functions is important in big

data analytics. They arise in Bayesian inference as well as in random effects models. Quadrature

approaches that are sufficient for low-dimensional integrals might be realizable on Hadoop, but

not those for high-dimensional integration which arise in Bayesian inference approach for big

data analytical problems. (Most recent applications of big data deal with high-dimensional

data—this is corroborated among others by Boyd et al. 2011.) For example, one common

approach for solving high-dimensional integrals is the Markov Chain Monte Carlo (MCMC)

(Andrieu 2003), which is hard to realize over Hadoop. MCMC is iterative in nature because the

chain must converge to a stationary distribution, which might happen after several iterations

only.

7. Alignment problems: The alignment problems are those that involve matching between

data objects or sets of objects. They occur in various domains—image de-duplication, matching

catalogs from different instruments in astronomy, multiple sequence alignments used in

computational biology, and so on. The simpler approaches in which the alignment problem can

be posed as a linear algebra problem can be realized over Hadoop. But the other forms might be

hard to realize over Hadoop—when either dynamic programming is used or Hidden Markov

Models (HMMs) are used. It must be noted that dynamic programming needs

iterations/recursions. The catalog cross-matching problem can be posed as a generalized

N-body problem, and the discussion outlined earlier in point 3 applies.

To summarize, giant 1 is perfect for Hadoop, and in all other giants, simpler problems or smaller

versions of the giants are doable in Hadoop—in fact, we can call them dwarfs, Hadoopable

problems/algorithms! The limitations of Hadoop and its lack of suitability for certain classes of

applications have motivated some researchers to come up with alternatives. Researchers at the

University of Berkeley have proposed “Spark” as one such alternative—in other words, Spark

could be seen as the next-generation data processing alternative to Hadoop in the big data space.

In the previous seven giants categorization, Spark would be efficient for

• Complex linear algebraic problems (giant 2)

• Generalized N-body problems (giant 3), such as kernel SVMs and kernel PCA

• Certain optimization problems (giant 4), for example, approaches involving CGD

An effort has been made to apply Spark for another giant, namely, graph theoretic computations

in GraphX (Xin et al. 2013). It would be an interesting area of further research to estimate the

efficiency of Spark for other classes of problems or other giants such as integrations and

alignment problems.

The key idea distinguishing Spark is its in-memory computation, allowing data to be cached in

memory across iterations/interactions. Initial performance studies have shown that Spark can

be 100 times faster than Hadoop for certain applications. This book explores Spark as well as

the other components of the Berkeley Data Analytics Stack (BDAS), a data processing

alternative to Hadoop, especially in the realm of big data analytics that involves realizing

machine learning (ML) algorithms. When using the term big data analytics, I refer to the

capability to ask questions on large data sets and answer them appropriately, possibly by using

ML techniques as the foundation. I will also discuss the alternatives to Spark in this

space—systems such as HaLoop and Twister.

The other dimension for which the beyond-Hadoop thinking is required is for real-time

analytics. It can be inferred that Hadoop is basically a batch processing system and is not well

suited for real-time computations. Consequently, if analytical algorithms are required to be run

in real time or near real time, Storm from Twitter has emerged as an interesting alternative in

this space, although there are other promising contenders, including S4 from Yahoo and Akka

from Typesafe. Storm has matured faster and has more production use cases than the others.

Thus, I will discuss Storm in more detail in the later chapters of this book—though I will also

attempt a comparison with the other alternatives for real-time analytics.

The third dimension where beyond-Hadoop thinking is required is when there are specific

complex data structures that need specialized processing—a graph is one such example. Twitter,

Facebook, and LinkedIn, as well as a host of other social networking sites, have such graphs.

They need to perform operations on the graphs, for example, searching for people you might

know on LinkedIn or a graph search in Facebook (Perry 2013). There have been some efforts to

use Hadoop for graph processing, such as Intel’s GraphBuilder. However, as outlined in the

GraphBuilder paper (Jain et al. 2013), it is targeted at construction and transformation and is

useful for building the initial graph from structured or unstructured data. GraphLab (Low et al.

2012) has emerged as an important alternative for processing graphs efficiently. By processing, I

mean running page ranking or other ML algorithms on the graph. GraphBuilder can be used for

constructing the graph, which can then be fed into GraphLab for processing. GraphLab is

focused on giant 4, graph theoretic computations. The use of GraphLab for any of the other

giants is an interesting topic of further research.

The emerging focus of big data analytics is to make traditional techniques, such as market

basket analysis, scale, and work on large data sets. This is reflected in the approach of SAS and

other traditional vendors to build Hadoop connectors. The other emerging approach for

analytics focuses on new algorithms or techniques from ML and data mining for solving

complex analytical problems, including those in video and real-time analytics. My perspective is

that Hadoop is just one such paradigm, with a whole new set of others that are emerging,

including Bulk Synchronous Parallel (BSP)-based paradigms and graph processing paradigms,

which are more suited to realize iterative ML algorithms. The following discussion should help

clarify the big data analytics spectrum, especially from an ML realization perspective. This

should help put in perspective some of the key aspects of the book and establish the

beyond-Hadoop thinking along the three dimensions of real-time analytics, graph computations,

and batch analytics that involve complex problems (giants 2 through 7).

Big Data Analytics: Evolution of Machine Learning Realizations

I will explain the different paradigms available for implementing ML algorithms, both from the

literature and from the open source community. First of all, here’s a view of the three

generations of ML tools available today:

1. The traditional ML tools for ML and statistical analysis, including SAS, SPSS from IBM, Weka,

and the R language. These allow deep analysis on smaller data sets—data sets that can fit the

memory of the node on which the tool runs.

2. Second-generation ML tools such as Mahout, Pentaho, and RapidMiner. These allow what I

call a shallow analysis of big data. Efforts to scale traditional tools over Hadoop, including the

work of Revolution Analytics (RHadoop) and SAS over Hadoop, would fall into the

second-generation category.

3. The third-generation tools such as Spark, Twister, HaLoop, Hama, and GraphLab. These

facilitate deeper analysis of big data. Recent efforts by traditional vendors such as SAS

in-memory analytics also fall into this category.

First-Generation ML Tools/Paradigms

The first-generation ML tools can facilitate deep analytics because they have a wide set of ML

algorithms. However, not all of them can work on large data sets—like terabytes or petabytes of

data—due to scalability limitations (limited by the nondistributed nature of the tool). In other

words, they are vertically scalable (you can increase the processing power of the node on which

the tool runs), but not horizontally scalable (not all of them can run on a cluster). The

first-generation tool vendors are addressing those limitations by building Hadoop connectors as

well as providing clustering options—meaning that the vendors have made efforts to reengineer

the tools such as R and SAS to scale horizontally. This would come under the

second-/third-generation tools and is covered subsequently.

Second-Generation ML Tools/Paradigms

The second-generation tools (we can now term the traditional ML tools such as SAS as

first-generation tools) such as Mahout (http://mahout.apache.org), Rapidminer, and Pentaho

provide the capability to scale to large data sets by implementing the algorithms over Hadoop,

the open source MR implementation. These tools are maturing fast and are open source

(especially Mahout). Mahout has a set of algorithms for clustering and classification, as well as a

very good recommendation algorithm (Konstan and Riedl 2012). Mahout can thus be said to

work on big data, with a number of production use cases, mainly for the recommendation

system. I have also used Mahout in a production system for realizing recommendation

algorithms in financial domain and found it to be scalable, though not without issues. (I had to

tweak the source significantly.) One observation about Mahout is that it implements only a

smaller subset of ML algorithms over Hadoop—only 25 algorithms are of production quality,

with only 8 or 9 usable over Hadoop, meaning scalable over large data sets. These include the

linear regression, linear SVM, the K-means clustering, and so forth. It does provide a fast

sequential implementation of the logistic regression, with parallelized training. However, as

several others have also noted (see Quora.com, for instance), it does not have implementations

of nonlinear SVMs or multivariate logistic regression (discrete choice model, as it is otherwise

known).

Overall, this book is not intended for Mahout bashing. However, my point is that it is quite hard

to implement certain ML algorithms including the kernel SVM and CGD (note that Mahout has

an implementation of stochastic gradient descent) over Hadoop. This has been pointed out by

several others as well—for instance, see the paper by Professor Srirama (Srirama et al. 2012).

This paper makes detailed comparisons between Hadoop and Twister MR (Ekanayake et al.

2010) with regard to iterative algorithms such as CGD and shows that the overheads can be

significant for Hadoop. What do I mean by iterative? A set of entities that perform a certain

computation, wait for results from neighbors or other entities, and start the next iteration. The

CGD is a perfect example of iterative ML algorithm—each CGD can be broken down into daxpy,

ddot, and matmul as the primitives. I will explain these three primitives: daxpy is an operation

that takes a vector x, multiplies it by a constant k, and adds another vector y to it; ddot

computes the dot product of two vectors x and y; matmul multiplies a matrix by a vector and

produces a vector output. This means 1 MR per primitive, leading to 6 MRs per iteration and

eventually 100s of MRs per CG computation, as well as a few gigabytes (GB)s of communication

even for small matrices. In essence, the setup cost per iteration (which includes reading from

HDFS into memory) overwhelms the computation for that iteration, leading to performance

degradation in Hadoop MR. In contrast, Twister distinguishes between static and variable data,

allowing data to be in memory across MR iterations, as well as a combine phase for collecting all

reduce phase outputs and, hence, performs significantly better.

The other second-generation tools are the traditional tools that have been scaled to work over

Hadoop. The choices in this space include the work done by Revolution Analytics, among others,

to scale R over Hadoop and the work to implement a scalable runtime over Hadoop for R

programs (Venkataraman et al. 2012). The SAS in-memory analytics, part of the High

Performance Analytics toolkit from SAS, is another attempt at scaling a traditional tool by using

a Hadoop cluster. However, the recently released version works over Greenplum/Teradata in

剩余174页未读，继续阅读

cobra_cai

粉丝: 1
资源: 15

实时应用：Storm、Spark与Hadoop之外的选择

Spark运行模式全解析：local、standalone与Yarn命令示例

Hive JDBC驱动包 hive-jdbc-uber-*.*.*.*-292.zip 解析

Apache Spark 2.3.4 with Hadoop 2.7 版本发布

spark-3.3.0-bin-hadoop3.tg和spark-3.3.0-bin-without-hadoop.tgz

Packt.Big.Data.Analytics.with.Spark.and.Hadoop

Data-Algorithms-Recipes-for-Scaling-Up-with-Hadoop-and-Spark.pdf

spark-1.6.1-bin-hadoop2.6.zip （缺spark-examples-1.6.1-hadoop2.6.0.jar）

spark-3.1.2.tgz & spark-3.1.2-bin-hadoop2.7.tgz.rar

Storm.Applied.Strategies.for.real-time.event.processing

spark-3.4.1-bin-hadoop3.tgz - Spark 3.4.1 安装包(内置了Hadoop 3)

最新资源

Hive JDBC驱动包 hive-jdbc-uber-...-292.zip 解析