大规模数据挖掘：MapReduce实战与网络分析

需积分: 32 17 浏览量更新于2024-07-25 收藏 2.58MB PDF 举报

"Great book on MapReduce with a lot of examples for Mining datasets" 本书是关于MapReduce技术的优秀读物，特别适合于数据挖掘实践，书中包含了大量的实例，旨在帮助读者理解和应用MapReduce处理大规模数据集。MapReduce是一种由Google提出的分布式计算模型，常用于大数据处理，尤其在处理和分析互联网数据时显得尤为关键。书中的内容源自斯坦福大学的多门课程，最初由Anand Rajaraman和Jeﬀ Ullman为名为"Web Mining"（CS345A）的研究生课程设计，随着时间的推移，随着Jure Leskovec的加入，课程内容进行了重新组织和扩展，新增了网络分析课程（CS224W），并将"Web Mining"课程改为了CS246。同时，三位作者还引入了一门大型数据挖掘项目课程（CS341）。这些课程的材料构成了本书的基础。书中内容不仅涵盖了数据挖掘的基本概念，而且重点在于处理超大规模数据。由于数据量巨大，许多示例都与Web或源自Web的数据有关，这是因为互联网数据通常具有极高的复杂性和海量性，是MapReduce应用的理想场景。MapReduce的核心理念是将大任务分解成小任务（Map阶段），并在多台计算机上并行处理，然后将结果整合（Reduce阶段），以实现高效的数据处理能力。具体来说，本书可能涉及以下知识点： 1. MapReduce模型：介绍MapReduce的工作原理，包括Mapper和Reducer的职责，以及中间数据的分区和排序过程。 2. 数据预处理：在进行MapReduce操作前，如何对大规模数据进行清洗、转换和格式化。 3. 分布式文件系统：如Hadoop Distributed File System (HDFS)，它是MapReduce运行的基础，用于存储和管理大数据。 4. 数据挖掘算法：书中可能会涵盖各种适用于大数据的挖掘算法，如聚类、分类、关联规则学习等，并展示如何在MapReduce框架下实现这些算法。 5. 实战案例：通过实际的Web数据挖掘案例，如网页链接分析、搜索引擎索引构建、用户行为分析等，来演示MapReduce的应用。 6. 性能优化：讨论如何调整MapReduce作业的参数，提高处理效率和资源利用率。 7. 故障恢复和容错机制：解释MapReduce如何处理节点故障，保证系统的高可用性和数据完整性。这本书对于想要深入理解MapReduce和大数据处理的读者来说是一份宝贵的资源，通过丰富的实例和实际场景，帮助读者掌握处理大规模数据集的关键技术和方法。

xvi CONTENTS

10.6.5 Using Fewer Reduce Tasks . . . . . . . . . . . . . . . . . . 375

10.6.6 Exercises for Section 10.6 . . . . . . . . . . . . . . . . . . 375

10.7 Neighborhood Pro perties of Graphs . . . . . . . . . . . . . . . . . 377

10.7.1 Directed Graphs and Neighborhoods . . . . . . . . . . . . 377

10.7.2 The Diameter of a Graph . . . . . . . . . . . . . . . . . . 378

10.7.3 Transitive Clo sure and Rea chability . . . . . . . . . . . . 3 79

10.7.4 Transitive Clo sure Via Map-Reduce . . . . . . . . . . . . 380

10.7.5 Smart Transitive Closure . . . . . . . . . . . . . . . . . . 3 82

10.7.6 Transitive Clo sure by Graph Reduction . . . . . . . . . . 384

10.7.7 Approximating the Sizes of Neighborhoods . . . . . . . . 386

10.7.8 Exercises for Section 10.7 . . . . . . . . . . . . . . . . . . 387

10.8 Summary of Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 388

10.9 References for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . 391

11 Dimensionality Reduction 395

11.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 395

11.1.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . 396

11.1.2 Computing Eigenvalues and Eigenvectors . . . . . . . . . 396

11.1.3 Finding Eigenpairs by Power Iteration . . . . . . . . . . . 398

11.1.4 The Matrix of Eigenvectors . . . . . . . . . . . . . . . . . 401

11.1.5 Exercises for Section 11.1 . . . . . . . . . . . . . . . . . . 401

11.2 Principal-Component Analysis . . . . . . . . . . . . . . . . . . . 402

11.2.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . 403

11.2.2 Using Eigenvectors for Dimensiona lity Reduction . . . . . 406

11.2.3 The Matrix of Distances . . . . . . . . . . . . . . . . . . . 406

11.2.4 Exercises for Section 11.2 . . . . . . . . . . . . . . . . . . 408

11.3 Singular-Value Decomposition . . . . . . . . . . . . . . . . . . . . 408

11.3.1 Deﬁnition of SVD . . . . . . . . . . . . . . . . . . . . . . 408

11.3.2 Interpretation o f SVD . . . . . . . . . . . . . . . . . . . . 410

11.3.3 Dimensionality Reduction Using SVD . . . . . . . . . . . 412

11.3.4 Why Zeroing Low Singular Values Works . . . . . . . . . 413

11.3.5 Querying Using Concepts . . . . . . . . . . . . . . . . . . 415

11.3.6 Computing the SVD of a Matrix . . . . . . . . . . . . . . 416

11.3.7 Exercises for Section 11.3 . . . . . . . . . . . . . . . . . . 417

11.4 CUR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 418

11.4.1 Deﬁnition of CUR . . . . . . . . . . . . . . . . . . . . . . 418

11.4.2 Choosing Rows and Columns Properly . . . . . . . . . . . 419

11.4.3 Constructing the Middle Matrix . . . . . . . . . . . . . . 421

11.4.4 The Complete CUR Decomposition . . . . . . . . . . . . 422

11.4.5 Eliminating Duplicate Rows and Columns . . . . . . . . . 42 3

11.4.6 Exercises for Section 11.4 . . . . . . . . . . . . . . . . . . 424

11.5 Summary of Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . 424

11.6 References for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . 426

2 CHAPTER 1. DATA MINING

and standard deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that some data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training set,

to train an algorithm of one of the many types use d by machine-learning prac-

titioners, such as Bayes nets, support-vector machines, decision trees, hidden

Markov models, and many others.

There are situations where using data in this way makes sense. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data. For example, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite success ful. We shall discuss a simple

form of this type of algor ithm in Section 9.4.

On the other hand, machine learning has not proved succes sful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate peo ple’s resumes on the Web. It was not able to do better than algo rithms

in the typical resume. Since everyone who has looked at or written a resume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example 1.1, we might

compute their average and standard devia tion. Note that these values might

not be the parameter s of the Gaussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is large.

There are many diﬀerent appro aches to modeling data. We have already

mentioned the possibility of constructing a statistical process whereby the data

could have been generated. Most other approaches to modeling can be described

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired m any

of the top machine-learning people to do so. Unfortunately, it was not able to survive.

4 CHAPTER 1. DATA MINING

The cases clustered around some of the intersections of roads. These inter-

sections were the locations of wells that had become contaminated; people who

lived nearest these wells got sick, while people who lived nearer to wells that

had not been contaminated did not g et sick. Without the ability to cluster the

data, the cause of Cholera would not have been discovered. 2

1.1.5 Feature Extract ion

The typical feature-based model looks for the most extreme examples of a phe-

nomenon and r e presents the data by these examples. If you are familiar with

Bayes nets, a branch of machine learning and a topic we do not cover in this

book, you know how a complex relationship between objects is represented by

ﬁnding the strongest statistical dependencies among these objects and using

only those in representing all statistical connections. Some of the important

kinds of feature extraction from large-scale data that we shall study are:

1. Frequent Itemsets. This model makes sense for data that consists of “bas-

kets” of small sets of items, as in the market-basket problem that we shall

discuss in Chapter 6. We look for small sets of items that appear together

in many baskets, and these “frequent itemsets” are the characterization of

the data that we s e e k. The original application of this sort of mining was

true market baskets: the sets of items, such as hamburger and ketchup,

that people tend to buy together when checking out at the cash register

of a sto re or super market.

2. Similar Items. Often, your data looks like a collection of sets, and the

objective is to ﬁnd pairs of sets that have a relatively large fraction of

their e le ments in common. An example is treating customers at an on-

line store like Amazon as the set of items they have bought. In order

for Amazo n to recommend something else they might like, Amazon ca n

look for “similar” customers and recommend something many of these

customers have bought. This process is called “collaborative ﬁltering.”

If customers were single-minded, that is, they bought only one kind of

thing, then clustering customers might work. However, since customers

tend to have interests in many diﬀerent things, it is more useful to ﬁnd,

for each customer, a small number of other customers who are similar

in their tastes, and represent the data by these connections. We discuss

similarity in Chapter 3.

1.2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events

hidden within massive amounts of data. This section is a discussion of the

problem, including “Bonferroni’s Principle,” a warning against overzealous use

of data mining.

剩余452页未读，继续阅读

爵色

粉丝: 0
资源: 1

大规模数据挖掘：MapReduce实战与网络分析

斯坦福大学book-Mining of Massive Datasets

Mining of Massive Datasets

Mining of Massive Datasets, 英文原版，斯坦福CS246官方教程

Exploring Computation Locality of Graph Mining Algorithms on MapReduce

mapreduce-examples:mapreduce-examples

mining of massive datasets

斯坦福大学CS246 book-Mining of Massive Datasets

mining massive datasets

Mining of Massive Datasets.pdf

基于微信小程序的校园论坛；微信小程序；云开发；云数据库；云储存；云函数；纯JS无后台；全部资料+详细文档+高分项目.zip

最新资源