Apache Mahout：分布式算法设计与实现

5星 · 超过95%的资源 | 下载需积分: 9 | PDF格式 | 1.43MB | 更新于2024-07-20 | 70 浏览量 | 举报

"Apache Mahout: Beyond MapReduce" 是一本由 Dmitriy Lyubimov 和 Andrew Palumbo 合著的书籍，专注于介绍如何利用Apache Mahout "Samsara"平台设计分布式数学和机器学习算法。这本书面向机器学习从业者、算法设计师、应用研究人员以及对融合数学的算法感兴趣的实验者。书中涵盖了Apache Mahout 0.10和0.11版本的内容，并深入探讨了如何在大数据集上解决机器学习问题的最佳编程实践和概念方法。书中的章节分布如下：第一部分：初识Mahout - 第1章了解Mahout：这一章将引导读者进入Mahout的世界，解释其核心概念和用途。 - 第2章搭建环境：介绍了设置和准备使用Mahout所需的步骤和工具。第二部分：使用Mahout编码 - 第3章内存中的代数：讨论如何在内存中进行数学运算，提供代码示例，以帮助理解Mahout的基础。 - 第4章分布式代数：进一步深入，介绍如何在分布式环境中执行这些计算。第三部分：近似分布式问题 - 第5章随机SVD（奇异值分解）：讲解如何使用随机方法解决大规模数据集上的SVD问题。 - 第6章随机PCA（主成分分析）：类似地，探讨了处理PCA的分布式策略。 - 第7章使用Bahmani sketch进行数据概化：介绍了一种用于大数据的快速且近似的统计方法。第四部分：Samsara教程 - 第8章朴素贝叶斯示例：通过一个实际的机器学习任务，展示了如何使用Mahout Samsara实现朴素贝叶斯分类器。附录部分提供了关于书中所用约定的指南，以及内核代数和分布式代数的参考材料，帮助读者更好地理解和应用书中介绍的概念。本书不仅关注理论，还强调实践，通过数学原理的解释和代码示例，使读者能够设计和实现分布式机器学习算法，同时也能够利用Mahout "Samsara"现成的算法。对于那些希望深入理解并掌握大规模数据集上机器学习技术的人来说，这是一本宝贵的资源。

The Mahout team is building the environment dialect in the image of R. The new

Mahout is a Scala-based beast, and all algebraic expressions are now in Scala with an

R-like Scala DSL layered on top.

Initially, Samsara had a DSL (enabled via a separate import) for MATLAB-like

dialect as well, but unfortunately Scala operator support posed issues implementing the

entire MATLAB operator set verbatim. As a result, this work received much less attention.

Instead, we focused on the R side of things.

The goal is for the Mahout DSL to be easily readable by R programmers. E.g.

%*% B

is matrix multiplication,

A * B

is the element-wise Hadamard product, methods

like colMeans, colSums follow R naming.

Among other things, arguably, math written in an R-like fashion is easier to under-

stand and maintain than the same things written in other basic procedural or functional

environments.

Mahout Samsara is backend-agnostic.

Indeed, Mahout is not positioning itself as Spark-speciﬁc. You can think of it that way if

you use Spark, but if you use H2O, you could think of it as H2O-specifc (or, hopefully,

"Apache Flink-speciﬁc" in the future) just as easily.

Neither of the above examples contain a single Spark (or H2O) imported dependency.

They are written once but run on any of supported backs.

Not every algorithm can be written with this set of backend-independent techniques

of course – there is more on that below. But quite a few can – and the majority can

leverage at least some of these techniques as the backbone. For example, imagine that the

dataset

above is a result of an embarrassingly parallel statistical Monte Carlo technique

(which is also backend-independent), and just like that perhaps we get a backend-agnostic

Gibbs sampler.

Mahout is an add-on to backend functionality.

Mahout is not taking away any capabilities of the backend. Instead, one can think of it as

an "add-on" over, e.g., Spark and all its technologies. The same is true for H2O.

In truth, algebra and statistics alone are not enough to make ends meet. Access to

the Spark RDD API, streaming, functional programming, external libraries, and many

other wonderful things is desirable. In the case of Apache Spark one can embed algebraic

pipelines by importing Spark-speciﬁc capabilities. Import MLlib or GraphX and all

the goodies are available. Import DataFrame (or SchemaRDD) and use the language-

integrated QL, and so on.

But if we want to draw any parallels, MLlib is “off-the-shelf code.” Mahout 0.10+ is

about that, too; but we hope that it is more about “off-the-shelf math” rather than code.

In other words, Mahout 0.10+ is for people who like to experiment and research at scale

using known mathematical constructs, execute more control over an algorithm, and pay

much less attention to the speciﬁcs of distributed engines, and potentially would like to

share the outcomes across different operational backends.

剩余230页未读，继续阅读

ramissue

粉丝: 354

Apache Mahout：分布式算法设计与实现

Apache Mahout：机器学习算法与推荐引擎探索

Apache Mahout教程：入门Java机器学习

Mahout 0.11.1 版本的jar包全览与核心算法

org.apache.mahout 最新版

Apache-Mahout-Cookbook-Example-Code:Apache Mahout Cookbook 示例代码

apache-mahout-distribution-0.12.2.tar.gz

apache-mahout-distribution-0.12.1.tar.gz

2.简述HBASE的体系架构和数据模型。 3.简述hive的体系架构和三种部署方式。 4.Flume的组成部分有那几部分？给出flume的逻辑结构。 5.Mahout包含哪几种类型的算法？

apache_mahout_tutorials

apache-mahout-distribution-0.11.0-src.zip

最新资源