大数据挖掘：分布式系统与MapReduce

4星 · 超过85%的资源需积分: 19 115 浏览量更新于2024-07-30 收藏 2.63MB PDF 举报

"Mining of Massive Datasets - Anand Rajaraman & Jeﬀrey D. Ullman" 本书《Mining of Massive Datasets》由Anand Rajaraman和Jeﬀrey D. Ullman共同撰写，主要探讨大规模数据挖掘的技术与应用。作者们基于他们在斯坦福大学开设的名为"Web Mining"（网络挖掘）的课程内容，将这本教材定位为高级研究生课程，同时也适合对这个领域感兴趣的本科生。书中的焦点在于处理海量数据，这些数据量大到无法一次性装入内存，因此书中很多例子都与互联网或源自互联网的数据有关。在整体内容上，这本书关注的是数据挖掘，尤其是针对大规模数据的挖掘。它强调以算法为中心的视角，即数据挖掘是通过应用算法来处理数据，而非利用数据训练某种机器学习引擎。书中涵盖了以下主要知识点： 1. 分布式文件系统：讲解了如何处理大规模数据时使用的分布式文件系统，如Google的GFS（Google File System）的类似系统，以及如何利用这些系统实现并行算法，以应对超大数据集的处理需求。 2. MapReduce框架：介绍了MapReduce作为一种编程模型，用于在分布式计算环境中处理和生成大规模数据集。Map阶段负责数据的预处理，Reduce阶段则进行聚合和总结，两者结合能高效地处理大规模数据问题。 3. 相似性搜索：讨论了在海量数据中查找相似项的关键技术，包括余弦相似度、Jaccard相似度等，以及如何有效地近似最近邻搜索，如Locality Sensitive Hashing (LSH)。 4. 数据流挖掘：探讨了如何在不断流入的数据流中实时或近实时地发现模式，这对于处理动态变化的数据非常重要。 5. 网络广告：分析了网络广告的拍卖机制、点击率预测以及广告定位策略，这些都是大数据在实际商业场景中的应用实例。 6. 推荐系统：详细介绍了协同过滤、基于内容的推荐和混合推荐算法，这些方法在电子商务、媒体推荐等领域广泛应用。 7. 社交网络分析：涵盖了社交网络的特征提取、社区检测、影响力传播模型等，帮助理解用户行为和网络结构。通过这些主题，读者可以了解到如何在大数据环境下设计和实施有效的数据挖掘策略，并掌握处理和分析大规模数据的核心工具和技术。这本书对于从事大数据分析、云计算、数据科学和机器学习领域的专业人士来说，是一份宝贵的参考资料。

xvi CONTENTS

2 CHAPTER 1. DATA MINING

and standard deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that some data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training set,

to train an algorithm of one of the many types used by machine-learning pra c-

titioners, such as Bayes nets, support-vector machines , decision trees, hidden

Markov models, and ma ny others .

There are situations where using data in this way makes sense. The typical

case where machine lear ning is a good approach is when we have little idea of

what we are looking for in the data. For exa mple, it is rather unclear what

it is about movies that makes certain movie-goer s like or dis like it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite success ful. We shall discuss a simple

form of this type of algorithm in Section 9.4.

On the other hand, machine learning has not proved successful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate people’s re sumes on the Web. It was not able to do better than algorithms

in the typical resume. Since everyone who has looked at or written a resume has

a pretty goo d idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no adva ntage to ma chine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer sc ie ntists have looked at data mining as an algorithmic

problem. I n this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example 1.1, we might

compute their average and standard deviation. Note that these values might

not be the parameters of the Gaussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is large.

There are many diﬀerent approaches to modeling data. We have already

mentioned the possibility of c onstructing a sta tistical process whereby the data

could have been generated. Most other approaches to modeling can be described

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to survive.

4 CHAPTER 1. DATA MINING

The cases clustered around some o f the intersections of roads. These inter-

sections were the locations of wells that had become contaminated; people who

lived nearest these wells got sick, while people who lived nearer to wells that

had not been contaminated did not get sick. Without the ability to cluster the

data, the cause of Cholera would not have been discovered. 2

1.1.5 Feature Extraction

The typical feature-based model looks for the most extreme examples of a phe-

nomenon and represents the data by these examples. If you are familiar with

Bayes nets, a branch of machine learning and a topic we do not cover in this

book, you know how a complex relationship between objects is represented by

ﬁnding the strongest statistical dependencies among these objects and using

only those in representing all s tatistical connections. Some of the important

kinds of feature extraction from large-scale data that we shall study are:

1. Frequent Itemsets. This model makes sense for da ta that consists of “bas-

kets” of small sets of items, as in the market-basket problem that we shall

discuss in Chapter 6. We look for small sets of items that appear together

in many baskets, and these “frequent itemsets” are the characterization of

the data that we seek. The orignal application of this sor t of mining was

true market baskets: the sets of items, such as hamburger and ketchup,

that people tend to buy together when checking out at the cash register

of a store or super market.

2. Similar Items. Often, your data looks like a collection of sets, and the

objective is to ﬁnd pair s of sets that have a relatively large fraction of

their elements in common. An exa mple is treating custo mers a t an on-

line store like Amazon as the se t of items they have bought. In order

for Amazon to recommend something else they might like, Amazo n can

look for “similar” customers and recommend something many of these

customers have bought. This process is called “collaborative ﬁltering.”

If customers were single-minded, tha t is, they bought only o ne kind of

thing, then cluster ing customers might work. However, since customers

tend to have interests in many diﬀerent things, it is more useful to ﬁnd,

for each customer, a small number of other customers who are similar

in their tastes, and represent the data by these connections. We discuss

similarity in Chapter 3.

1.2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events

hidden within massive amounts of data. This section is a discussion of the

problem, including “Bonferroni’s Principle,” a warning against overzealous use

of data mining.

剩余456页未读，继续阅读

mzfor2004

粉丝: 4
资源: 5

大数据挖掘：分布式系统与MapReduce

大数据挖掘：Stanford大学 Mining of Massive Datasets 教材概览

大规模数据挖掘：Anand.Rajaraman《Mining of Massive Datasets》精华解读

大数据挖掘：斯坦福大学教材深度解读

Mining of massive datasets

mining of massive datasets

大规模数据挖掘：斯坦福大学教材第二版

大数据挖掘：社交网络分析与大规模机器学习

YOLO算法-城市电杆数据集-496张图像带标签-电杆.zip

(177406840)JAVA图书管理系统毕业设计(源代码+论文).rar

(35734838)信号与系统实验一实验报告

最新资源