大数据集挖掘：Hadoop、LSH与网络分析

需积分: 10 26 浏览量更新于2024-07-27 收藏 2.31MB PDF 举报

"大规模数据集的挖掘" 本书是Anand Rajaraman、Jure Leskovec和Jeffrey D. Ullman三位斯坦福大学教授合著的作品，主要探讨了在大数据背景下进行数据挖掘的技术与方法。他们通过多年教学经验，结合CS345A（Web Mining）、CS224W（网络分析）和CS246等课程的内容，形成了这本书的核心知识体系。书中的重点在于处理大规模数据集，这些数据集往往大到无法一次性装入内存。因此，书中特别关注如何利用如Hadoop这样的分布式计算框架来处理这些问题。Hadoop是Apache开源项目的一部分，旨在提供高可靠性和可伸缩性的数据处理能力，通过MapReduce编程模型，使得大规模数据的处理成为可能。书中还深入讨论了Locality Sensitive Hashing (LSH) 技术，这是一种用于近似最近邻搜索的方法，尤其适用于大规模数据集的高效检索。此外，针对流数据和图数据的挖掘也是书中的关键主题。流数据是指不断到来且需要实时处理的数据，而图数据则涉及到节点和边的关系分析，如社交网络分析。在数据挖掘和机器学习方法的区分上，作者指出数据挖掘更侧重于发现数据中的模式和结构，而机器学习则关注构建预测模型。他们提醒读者，进行数据挖掘时要避免陷入统计陷阱，比如过度拟合、偏差-方差权衡以及误用统计假设等问题。书中的案例研究主要围绕互联网和Web数据，因为这些数据来源广泛、量级巨大，是大数据的典型代表。通过这些案例，读者可以了解到如何从海量网页数据中提取有价值的信息，如链接分析、用户行为建模等。这本书涵盖了大数据挖掘的基础理论、实用工具和技术，对于想要深入了解大规模数据处理和分析的研究生和高级本科生来说，是一本宝贵的教材和参考书籍。同时，它也适合对大数据感兴趣的IT专业人士阅读，以提升他们在数据科学领域的实践能力。

xvi CONTENTS

10.7.7 Approximating the Sizes of Neighborhoods . . . . . . . . 366

10.7.8 Exercises for Section 1 0.7 . . . . . . . . . . . . . . . . . . 367

10.8 Summary of Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 368

10.9 References for Chapter 1 0 . . . . . . . . . . . . . . . . . . . . . . 371

2 CHAPTER 1. DATA MINING

and standard deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as s ynonymous with machine learning.

There is no question that some data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training set,

to train an algorithm of one of the many types used by machine-learning prac-

titioners, such a s Bayes nets, suppo rt-vector machines, decision tree s, hidden

Markov models, and many others.

There are situations where using da ta in this way makes sense. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data. For example, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite successful. We shall discuss a simple

form of this type of algorithm in Section 9.4.

On the o ther hand, machine learning has not proved successful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the a ttempt by WhizBang! Labs

to use machine learning to

locate people’s resumes on the Web. It was not able to do b etter than algorithms

in the typical resume. Since everyone who has looked at or written a resume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Ex ample 1.1, we might

compute their average and standar d deviation. Note tha t these values might

not be the parameters of the Gaussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is lar ge.

There are many diﬀerent approaches to modeling data. We have already

mentioned the possibility of constructing a statistical process whereby the data

could have been generated. Most other approaches to mo deling can be described

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to survive.

4 CHAPTER 1. DATA MINING

The cases clus tered around some of the intersections of roads. These inter-

sections were the locatio ns of wells that had become contaminated; people who

lived nearest these wells got sick, while people who lived neare r to wells that

had not been contaminated did not get sick. Without the ability to cluster the

data, the cause of Cholera would not have been discovered. 2

1.1.5 Feature Extraction

The typical fea tur e -based model looks for the most extre me examples of a phe-

nomenon and represents the data by these ex amples. If you are familiar with

Bayes nets, a branch of machine learning and a topic we do not cover in this

book, you know how a complex relationship b etween objects is represented by

ﬁnding the stronge st statistical dependencies among these objects and using

only those in representing all statistical connections. Some of the important

kinds of feature extraction from large-scale data that we shall study are:

1. Frequent Itemsets. This model makes sense for data that consists of “bas-

kets” of small sets of items, a s in the mar ket-basket problem that we shall

discuss in Chapter 6. We look for sma ll se ts of items that appear toge ther

in ma ny baskets, and these “frequent itemsets” are the characterization of

the data that we seek. The orignal a pplication of this sort of mining was

true market baskets: the sets of items, such as hamburger and ketchup,

that people tend to buy together when checking out at the cash register

of a sto re or s uper market.

2. Similar Items. Often, your data looks like a collection of sets, and the

objective is to ﬁnd pairs of sets that have a relatively large fraction of

their elements in common. An example is trea ting customers at an on-

line store like Amazon as the set of items they have bought. In order

for Amazon to recommend something else they might like, Amazon can

look for “simila r” customers and recommend something many of these

customers have bought. This process is called “collaborative ﬁltering.”

If customers were single -minded, that is, they bought only one kind of

thing, then clustering customers might work. However, since customers

tend to have interests in many diﬀerent things, it is more useful to ﬁnd,

for each customer, a small number of other customers who are similar

in their tastes, a nd represent the da ta by these connections. We discuss

similarity in Chapter 3.

1.2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events

hidden within massive amounts of data. This section is a discussion of the

problem, including “Bonferroni’s Principle,” a warning against overzealous use

of data mining.

剩余397页未读，继续阅读

fluola

粉丝: 0
资源: 3

大数据集挖掘：Hadoop、LSH与网络分析

挖掘大规模数据集的Summer2Winter Yosemite

大规模数据挖掘：斯坦福经典教程

大规模数据集的维度降低：SVD与CUR方法

中科院大数据系统与大规模数据集分析 大数据挖掘教程 5-DR 挖掘海量数据集 挖掘数据流 共78页.pptx

大规模数据集高效数据挖掘算法研究.pdf

大规模数据集关联规则挖掘方法研究——大数据挖掘方法研究之一.pdf

中科院大数据系统与大规模数据集分析 大数据挖掘教程 4-RS 挖掘海量数据集 推荐系统 共87页.pptx

大规模数据集高效数据挖掘算法研究 (1).pdf

中科院大数据系统与大规模数据集分析 大数据挖掘教程 3-DR 挖掘海量数据集降维 SVD&CUR 共76页.pptx

中科院大数据系统与大规模数据集分析 大数据挖掘教程 2-LSH 挖掘海量数据集寻找相似项目：局部敏感哈希算法共126页.pptx

最新资源

中科院大数据系统与大规模数据集分析大数据挖掘教程 5-DR 挖掘海量数据集挖掘数据流共78页.pptx

中科院大数据系统与大规模数据集分析大数据挖掘教程 4-RS 挖掘海量数据集推荐系统共87页.pptx

中科院大数据系统与大规模数据集分析大数据挖掘教程 3-DR 挖掘海量数据集降维 SVD&CUR 共76页.pptx

中科院大数据系统与大规模数据集分析大数据挖掘教程 2-LSH 挖掘海量数据集寻找相似项目：局部敏感哈希算法共126页.pptx