大规模数据挖掘精要

需积分: 0 174 浏览量更新于2024-07-20 收藏 2.62MB PDF 举报

"大规模数据挖掘" 本书《大规模数据挖掘》由Anand Rajaraman、Jure Leskovec和Jeﬀrey D. Ullman三位斯坦福大学教授共同编写，版权始于2010年，经过多次修订，内容涵盖他们在斯坦福大学开设的多门课程的教学材料。这本书最初源于Anand Rajaraman和Jeﬀ Ullman为研究生设计的一门名为“Web Mining”的课程，后来随着Jure Leskovec的加入，课程内容进行了重大调整，涵盖了网络分析，并将课程编号改为CS246。书中主要内容分为三部分：Web挖掘、网络分析和大规模数据挖掘项目。随着Jure Leskovec的参与，课程新增了对网络分析的深入探讨，并且扩展了CS345A（即后来的CS246）的课程内容。此外，他们还引入了一门大规模数据挖掘项目课程——CS341，进一步强化了实践教学。该书的核心主题是数据挖掘，特别是针对非常庞大的数据集进行挖掘。由于关注点在于数据规模，书中许多实例和案例都与互联网或源自互联网的数据有关。在内存无法容纳如此大量数据的情况下，如何有效地处理和挖掘这些数据成为了本书讨论的重点。作者们探讨了在这样的环境下，如何运用特定的技术和算法来发现数据中的模式、关联和趋势。书中可能涉及的知识点包括但不限于： 1. 数据预处理：在处理大规模数据时，清洗、转换和整合数据的方法。 2. 数据存储与管理：分布式数据库系统，如Hadoop和Spark，以及NoSQL数据库在处理大数据中的应用。 3. MapReduce编程模型：理解并实现MapReduce，用于大规模数据处理的并行计算模型。 4. 数据采样与近似算法：在数据量过大时，如何通过采样和近似方法来高效分析数据。 5. 数据可视化：如何将大规模数据的结果以可视化方式呈现，以便于理解和解释。 6. 图论与网络分析：理解网络结构，如社交网络、网页链接网络等，以及度中心性、聚类系数等网络属性的计算。 7. 分布式算法：如PageRank算法，用于评估网页重要性的分布式计算方法。 8. 社交网络分析：用户行为模式的识别，社区检测，影响力传播模型等。 9. 预测与分类：机器学习算法在大数据中的应用，如决策树、随机森林、支持向量机等。 10. 话题建模：如Latent Dirichlet Allocation (LDA)，用于发现文本数据中的隐藏主题。 11. 推荐系统：协同过滤、基于内容的推荐和混合推荐系统的构建。 12. 实时数据流处理：如何处理不断产生的实时数据，如Apache Storm和Flink。 13. 安全与隐私：在大数据背景下，如何保护用户隐私和数据安全。《大规模数据挖掘》是一本全面介绍如何在海量数据环境中进行有效数据挖掘的教材，不仅包含理论知识，还注重实际应用，对于想要深入理解和实践大数据分析的读者来说是一份宝贵的资源。

xvi CONTENTS

10.6.5 Using Fewer Reduce Tasks . . . . . . . . . . . . . . . . . . 375

10.6.6 Exercises for Section 10.6 . . . . . . . . . . . . . . . . . . 375

10.7 Neighborhood Pro perties of Graphs . . . . . . . . . . . . . . . . . 377

10.7.1 Directed Graphs and Neighborhoods . . . . . . . . . . . . 377

10.7.2 The Diameter of a Graph . . . . . . . . . . . . . . . . . . 378

10.7.3 Transitive Clo sure and Rea chability . . . . . . . . . . . . 3 79

10.7.4 Transitive Clo sure Via Map-Reduce . . . . . . . . . . . . 380

10.7.5 Smart Transitive Closure . . . . . . . . . . . . . . . . . . 3 82

10.7.6 Transitive Clo sure by Graph Reduction . . . . . . . . . . 384

10.7.7 Approximating the Sizes of Neighborhoods . . . . . . . . 386

10.7.8 Exercises for Section 10.7 . . . . . . . . . . . . . . . . . . 387

10.8 Summary of Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 388

10.9 References for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . 391

11 Dimensionality Reduction 395

11.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 395

11.1.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . 396

11.1.2 Computing Eigenvalues and Eigenvectors . . . . . . . . . 396

11.1.3 Finding Eigenpairs by Power Iteration . . . . . . . . . . . 398

11.1.4 The Matrix of Eigenvectors . . . . . . . . . . . . . . . . . 401

11.1.5 Exercises for Section 11.1 . . . . . . . . . . . . . . . . . . 401

11.2 Principal-Component Analysis . . . . . . . . . . . . . . . . . . . 402

11.2.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . 403

11.2.2 Using Eigenvectors for Dimensiona lity Reduction . . . . . 406

11.2.3 The Matrix of Distances . . . . . . . . . . . . . . . . . . . 406

11.2.4 Exercises for Section 11.2 . . . . . . . . . . . . . . . . . . 408

11.3 Singular-Value Decomposition . . . . . . . . . . . . . . . . . . . . 408

11.3.1 Deﬁnition of SVD . . . . . . . . . . . . . . . . . . . . . . 408

11.3.2 Interpretation o f SVD . . . . . . . . . . . . . . . . . . . . 410

11.3.3 Dimensionality Reduction Using SVD . . . . . . . . . . . 412

11.3.4 Why Zeroing Low Singular Values Works . . . . . . . . . 413

11.3.5 Querying Using Concepts . . . . . . . . . . . . . . . . . . 415

11.3.6 Computing the SVD of a Matrix . . . . . . . . . . . . . . 416

11.3.7 Exercises for Section 11.3 . . . . . . . . . . . . . . . . . . 417

11.4 CUR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 418

11.4.1 Deﬁnition of CUR . . . . . . . . . . . . . . . . . . . . . . 418

11.4.2 Choosing Rows and Columns Properly . . . . . . . . . . . 419

11.4.3 Constructing the Middle Matrix . . . . . . . . . . . . . . 421

11.4.4 The Complete CUR Decomposition . . . . . . . . . . . . 422

11.4.5 Eliminating Duplicate Rows and Columns . . . . . . . . . 42 3

11.4.6 Exercises for Section 11.4 . . . . . . . . . . . . . . . . . . 424

11.5 Summary of Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . 424

11.6 References for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . 426

2 CHAPTER 1. DATA MINING

and standard deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that some data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training set,

to train an algorithm of one of the many types use d by machine-learning prac-

titioners, such as Bayes nets, support-vector machines, decision trees, hidden

Markov models, and many others.

There are situations where using data in this way makes sense. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data. For example, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite success ful. We shall discuss a simple

form of this type of algor ithm in Section 9.4.

On the other hand, machine learning has not proved succes sful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate peo ple’s resumes on the Web. It was not able to do better than algo rithms

in the typical resume. Since everyone who has looked at or written a resume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example 1.1, we might

compute their average and standard devia tion. Note that these values might

not be the parameter s of the Gaussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is large.

There are many diﬀerent appro aches to modeling data. We have already

mentioned the possibility of constructing a statistical process whereby the data

could have been generated. Most other approaches to modeling can be described

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired m any

of the top machine-learning people to do so. Unfortunately, it was not able to survive.

4 CHAPTER 1. DATA MINING

The cases clustered around some of the intersections of roads. These inter-

sections were the locations of wells that had become contaminated; people who

lived nearest these wells got sick, while people who lived nearer to wells that

had not been contaminated did not g et sick. Without the ability to cluster the

data, the cause of Cholera would not have been discovered. 2

1.1.5 Feature Extract ion

The typical feature-based model looks for the most extreme examples of a phe-

nomenon and r e presents the data by these examples. If you are familiar with

Bayes nets, a branch of machine learning and a topic we do not cover in this

book, you know how a complex relationship between objects is represented by

ﬁnding the strongest statistical dependencies among these objects and using

only those in representing all statistical connections. Some of the important

kinds of feature extraction from large-scale data that we shall study are:

1. Frequent Itemsets. This model makes sense for data that consists of “bas-

kets” of small sets of items, as in the market-basket problem that we shall

discuss in Chapter 6. We look for small sets of items that appear together

in many baskets, and these “frequent itemsets” are the characterization of

the data that we s e e k. The original application of this sort of mining was

true market baskets: the sets of items, such as hamburger and ketchup,

that people tend to buy together when checking out at the cash register

of a sto re or super market.

2. Similar Items. Often, your data looks like a collection of sets, and the

objective is to ﬁnd pairs of sets that have a relatively large fraction of

their e le ments in common. An example is treating customers at an on-

line store like Amazon as the set of items they have bought. In order

for Amazo n to recommend something else they might like, Amazon ca n

look for “similar” customers and recommend something many of these

customers have bought. This process is called “collaborative ﬁltering.”

If customers were single-minded, that is, they bought only one kind of

thing, then clustering customers might work. However, since customers

tend to have interests in many diﬀerent things, it is more useful to ﬁnd,

for each customer, a small number of other customers who are similar

in their tastes, and represent the data by these connections. We discuss

similarity in Chapter 3.

1.2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events

hidden within massive amounts of data. This section is a discussion of the

problem, including “Bonferroni’s Principle,” a warning against overzealous use

of data mining.

剩余452页未读，继续阅读

Rosun_

粉丝: 94
资源: 6

大规模数据挖掘精要

ming of massive datasets

mining of massive datasets

Mining of Massive Datasets

Mining of massive datasets

python小爬虫.zip

最全的JAVA设计模式，包含原理图解+代码实现.zip

CPPC++_世界上最快的3d贴图转换工具.zip

【风电】基于TCN-BiGRU的风电功率单变量输入多步预测研究附Matlab代码.rar

CPPC++_OSGI for C 通往架构师之路.zip

童心派贪吃蛇游戏pygame版

最新资源