《大數據挖掘与分布式处理实战》：互联网海量数据详解

5星 · 超过95%的资源需积分: 10 89 浏览量更新于2024-07-26 收藏 2.4MB PDF 举报

《大數據：互聯網大規模數據挖掘與分佈式處理》是一本英文原版著作，由Anand Rajaraman、Jure Leskovec和Jeffrey D. Ullman共同撰写，版权日期为2010年至2012年。这本书的起源可以追溯到斯坦福大学多年来的课程开发，最初是作为研究生高级课程CS345A（网络挖掘）的一部分，但其内容逐渐丰富，吸引了高级本科生的兴趣。当Jure Leskovec加入斯坦福教职后，他进一步组织和扩展了课程材料。书中的核心主题是大规模数据挖掘，特别关注那些无法一次性装入主内存的海量数据。书中大部分示例都围绕互联网及其衍生的数据展开，因为互联网是一个天然的大数据来源。作者们将理论知识与实践相结合，书中包含的内容覆盖了三个课程：CS224W（网络分析）、CS345A/CS246（进一步的数据挖掘）以及大型数据挖掘项目课程CS341。这些课程旨在教授学生如何处理和分析大规模数据集，包括但不限于数据收集、预处理、模式识别、预测建模等技术。《大數據：互聯網大規模數據挖掘與分佈式處理》不仅介绍基本的数据挖掘概念和技术，还涵盖了分布式处理方法，因为处理海量数据通常需要分布式系统来实现并行计算，提高效率。书中可能涉及到Hadoop、Spark等分布式计算框架的应用，以及MapReduce模型在大数据处理中的关键作用。此外，书中还会涉及数据挖掘的伦理和社会影响，强调了在处理大量用户数据时，如何确保隐私保护和数据安全。随着大数据时代的到来，这本书对理解如何从海量信息中提取有价值的知识，以及如何设计和优化大规模数据处理系统具有重要意义。总结来说，本书是一部深度探讨互联网时代大规模数据挖掘和分布式处理的权威指南，适合那些希望在这个领域深入学习的研究生和高级本科生，也对数据科学家、工程师和研究人员提供了实用的技术和理论支持。通过阅读这本书，读者能够掌握从数据采集到分析的全过程，为在实际工作中应对大数据挑战打下坚实基础。

xvi CONTENTS

10.6.5 Using Fewer Reduce Tasks . . . . . . . . . . . . . . . . . . 371

10.6.6 Exercises for Section 1 0.6 . . . . . . . . . . . . . . . . . . 371

10.7 Neighborhood Pro perties of Graphs . . . . . . . . . . . . . . . . . 373

10.7.1 Directed Gra phs and Neighborho ods . . . . . . . . . . . . 37 3

10.7.2 The Diameter of a Graph . . . . . . . . . . . . . . . . . . 374

10.7.3 Transitive Closure and Reachability . . . . . . . . . . . . 375

10.7.4 Transitive Closure Via Map-Reduce . . . . . . . . . . . . 376

10.7.5 Smart Transitive Closure . . . . . . . . . . . . . . . . . . 378

10.7.6 Transitive Closure by Graph Reduction . . . . . . . . . . 380

10.7.7 Approximating the Sizes of Neighborhoods . . . . . . . . 382

10.7.8 Exercises for Section 1 0.7 . . . . . . . . . . . . . . . . . . 383

10.8 Summary of Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 384

10.9 References for Chapter 1 0 . . . . . . . . . . . . . . . . . . . . . . 387

2 CHAPTER 1. DATA MINING

and standard deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as s ynonymous with machine learning.

There is no question that some data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training set,

to train an algorithm of one of the many types used by machine-learning prac-

titioners, such a s Bayes nets, suppo rt-vector machines, decision tree s, hidden

Markov models, and many others.

There are situations where using da ta in this way makes sense. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data. For example, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite successful. We shall discuss a simple

form of this type of algorithm in Section 9.4.

On the o ther hand, machine learning has not proved successful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the a ttempt by WhizBang! Labs

to use machine learning to

locate people’s resumes on the Web. It was not able to do b etter than algorithms

in the typical resume. Since everyone who has looked at or written a resume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Ex ample 1.1, we might

compute their average and standar d deviation. Note tha t these values might

not be the parameters of the Gaussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is lar ge.

There are many diﬀerent approaches to modeling data. We have already

mentioned the possibility of constructing a statistical process whereby the data

could have been generated. Most other approaches to mo deling can be described

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to survive.

4 CHAPTER 1. DATA MINING

The cases clus tered around some of the intersections of roads. These inter-

sections were the locatio ns of wells that had become contaminated; people who

lived nearest these wells got sick, while people who lived neare r to wells that

had not been contaminated did not get sick. Without the ability to cluster the

data, the cause of Cholera would not have been discovered. 2

1.1.5 Feature Extraction

The typical fea tur e -based model looks for the most extre me examples of a phe-

nomenon and represents the data by these ex amples. If you are familiar with

Bayes nets, a branch of machine learning and a topic we do not cover in this

book, you know how a complex relationship b etween objects is represented by

ﬁnding the stronge st statistical dependencies among these objects and using

only those in representing all statistical connections. Some of the important

kinds of feature extraction from large-scale data that we shall study are:

1. Frequent Itemsets. This model makes sense for data that consists of “bas-

kets” of small sets of items, a s in the mar ket-basket problem that we shall

discuss in Chapter 6. We look for sma ll se ts of items that appear toge ther

in ma ny baskets, and these “frequent itemsets” are the characterization of

the data that we seek. The original application of this sort of mining was

true market baskets: the sets of items, such as hamburger and ketchup,

that people tend to buy together when checking out at the cash register

of a sto re or s uper market.

2. Similar Items. Often, your data looks like a collection of sets, and the

objective is to ﬁnd pairs of sets that have a relatively large fraction of

their elements in common. An example is trea ting customers at an on-

line store like Amazon as the set of items they have bought. In order

for Amazon to recommend something else they might like, Amazon can

look for “simila r” customers and recommend something many of these

customers have bought. This process is called “collaborative ﬁltering.”

If customers were single -minded, that is, they bought only one kind of

thing, then clustering customers might work. However, since customers

tend to have interests in many diﬀerent things, it is more useful to ﬁnd,

for each customer, a small number of other customers who are similar

in their tastes, a nd represent the da ta by these connections. We discuss

similarity in Chapter 3.

1.2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events

hidden within massive amounts of data. This section is a discussion of the

problem, including “Bonferroni’s Principle,” a warning against overzealous use

of data mining.

剩余414页未读，继续阅读

sam_5899

粉丝: 0
资源: 3

《大數據挖掘与分布式处理实战》：互联网海量数据详解

Hadoop权威指南（英文原版）

Spark GraphX In Action 2016英文原版.pdf

数据分析电子书

大规模数据挖掘：英文原版教材

大规模数据挖掘：斯坦福CS246课程精华

ApacheCN大数据译文集：中文大数据技术文档大全

Mahout实战：探索大数据机器学习

Hadoop权威指南：原版英文版详解

掌握Apache Sqoop实战指南：数据驱动决策的秘密武器

Hadoop权威指南第三版（英文）详解

最新资源