大规模数据挖掘：斯坦福教材

5星 · 超过95%的资源需积分: 10 66 浏览量更新于2024-07-24 收藏 2.58MB PDF 举报

"Mining of Massive Datasets" 这本教材《Mining of Massive Datasets》是由Anand Rajaraman、Jure Leskovec和Jeﬀrey D. Ullman三位斯坦福大学教授共同编著的，主要关注大数据挖掘领域的知识。教材源于他们在斯坦福开设的课程，包括CS345A（Web Mining）和后来的CS224W（网络分析），以及大型数据挖掘项目课程CS341。随着Jure Leskovec的加入，课程内容进行了大幅度的调整和扩展，使得这些课程对高级研究生甚至优秀的本科生都具有吸引力。这本书的核心内容是关于数据挖掘，特别是针对非常大规模的数据集进行挖掘。由于其专注于处理那些无法一次性加载到主内存中的海量数据，因此，书中很多实例都涉及到互联网或源自互联网的数据。数据挖掘在当今的信息时代具有极高的价值，它能帮助我们从海量信息中提取有价值的知识，支持决策制定，推动科学研究，优化业务运营，以及改进用户体验等。书中涵盖了多个关键知识点： 1. 大数据概述：介绍大数据的特点，如高容量、高速度和多样性，以及处理这些数据所面临的挑战。 2. 数据存储与管理：讨论适合处理大规模数据的存储系统，如分布式文件系统（如Hadoop的HDFS）和NoSQL数据库。 3. 数据预处理：包括数据清洗、数据集成、数据转换等步骤，这些是挖掘前的基础工作。 4. 数据采样与近似算法：在大数据环境下，全量处理往往是不可能的，因此学习如何进行有效的数据采样和使用近似算法是至关重要的。 5. 数据挖掘技术：涵盖关联规则学习、聚类、分类、回归等多种机器学习方法，以及图挖掘和网络分析。 6. 社交网络分析：分析网络中的用户行为、社区结构和信息传播模式，用于理解网络动态和预测用户行为。 7. 搜索引擎与推荐系统：深入探讨网页排名算法（如PageRank）和个性化推荐的实现原理。 8. 实时与流式数据分析：针对不断增长的数据流，如何实时地进行分析和响应。 9. 安全与隐私：在进行大数据挖掘时，如何保护数据的安全性和用户的隐私。 10. 实践项目：书中可能包含实际项目案例，让学生或读者有机会应用所学知识解决实际问题。这本书不仅理论知识丰富，而且实践性强，对于想要深入理解和掌握大数据挖掘技术的人来说是一份宝贵的资源。通过学习，读者可以掌握处理大规模数据集的方法和策略，从而在科研、工程或商业领域中应用这些技术解决实际问题。

xvi CONTENTS

10.6.5 Using Fewer Reduce Tasks . . . . . . . . . . . . . . . . . . 375

10.6.6 Exercises for Section 10.6 . . . . . . . . . . . . . . . . . . 375

10.7 Neighborhood Pro perties of Graphs . . . . . . . . . . . . . . . . . 377

10.7.1 Directed Graphs and Neighborhoods . . . . . . . . . . . . 377

10.7.2 The Diameter of a Graph . . . . . . . . . . . . . . . . . . 378

10.7.3 Transitive Clo sure and Rea chability . . . . . . . . . . . . 3 79

10.7.4 Transitive Clo sure Via Map-Reduce . . . . . . . . . . . . 380

10.7.5 Smart Transitive Closure . . . . . . . . . . . . . . . . . . 3 82

10.7.6 Transitive Clo sure by Graph Reduction . . . . . . . . . . 384

10.7.7 Approximating the Sizes of Neighborhoods . . . . . . . . 386

10.7.8 Exercises for Section 10.7 . . . . . . . . . . . . . . . . . . 387

10.8 Summary of Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 388

10.9 References for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . 391

11 Dimensionality Reduction 395

11.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 395

11.1.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . 396

11.1.2 Computing Eigenvalues and Eigenvectors . . . . . . . . . 396

11.1.3 Finding Eigenpairs by Power Iteration . . . . . . . . . . . 398

11.1.4 The Matrix of Eigenvectors . . . . . . . . . . . . . . . . . 401

11.1.5 Exercises for Section 11.1 . . . . . . . . . . . . . . . . . . 401

11.2 Principal-Component Analysis . . . . . . . . . . . . . . . . . . . 402

11.2.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . 403

11.2.2 Using Eigenvectors for Dimensiona lity Reduction . . . . . 406

11.2.3 The Matrix of Distances . . . . . . . . . . . . . . . . . . . 406

11.2.4 Exercises for Section 11.2 . . . . . . . . . . . . . . . . . . 408

11.3 Singular-Value Decomposition . . . . . . . . . . . . . . . . . . . . 408

11.3.1 Deﬁnition of SVD . . . . . . . . . . . . . . . . . . . . . . 408

11.3.2 Interpretation o f SVD . . . . . . . . . . . . . . . . . . . . 410

11.3.3 Dimensionality Reduction Using SVD . . . . . . . . . . . 412

11.3.4 Why Zeroing Low Singular Values Works . . . . . . . . . 413

11.3.5 Querying Using Concepts . . . . . . . . . . . . . . . . . . 415

11.3.6 Computing the SVD of a Matrix . . . . . . . . . . . . . . 416

11.3.7 Exercises for Section 11.3 . . . . . . . . . . . . . . . . . . 417

11.4 CUR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 418

11.4.1 Deﬁnition of CUR . . . . . . . . . . . . . . . . . . . . . . 418

11.4.2 Choosing Rows and Columns Properly . . . . . . . . . . . 419

11.4.3 Constructing the Middle Matrix . . . . . . . . . . . . . . 421

11.4.4 The Complete CUR Decomposition . . . . . . . . . . . . 422

11.4.5 Eliminating Duplicate Rows and Columns . . . . . . . . . 42 3

11.4.6 Exercises for Section 11.4 . . . . . . . . . . . . . . . . . . 424

11.5 Summary of Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . 424

11.6 References for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . 426

2 CHAPTER 1. DATA MINING

and standard deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that some data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training set,

to train an algorithm of one of the many types use d by machine-learning prac-

titioners, such as Bayes nets, support-vector machines, decision trees, hidden

Markov models, and many others.

There are situations where using data in this way makes sense. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data. For example, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite success ful. We shall discuss a simple

form of this type of algor ithm in Section 9.4.

On the other hand, machine learning has not proved succes sful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate peo ple’s resumes on the Web. It was not able to do better than algo rithms

in the typical resume. Since everyone who has looked at or written a resume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example 1.1, we might

compute their average and standard devia tion. Note that these values might

not be the parameter s of the Gaussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is large.

There are many diﬀerent appro aches to modeling data. We have already

mentioned the possibility of constructing a statistical process whereby the data

could have been generated. Most other approaches to modeling can be described

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired m any

of the top machine-learning people to do so. Unfortunately, it was not able to survive.

4 CHAPTER 1. DATA MINING

The cases clustered around some of the intersections of roads. These inter-

sections were the locations of wells that had become contaminated; people who

lived nearest these wells got sick, while people who lived nearer to wells that

had not been contaminated did not g et sick. Without the ability to cluster the

data, the cause of Cholera would not have been discovered. 2

1.1.5 Feature Extract ion

The typical feature-based model looks for the most extreme examples of a phe-

nomenon and r e presents the data by these examples. If you are familiar with

Bayes nets, a branch of machine learning and a topic we do not cover in this

book, you know how a complex relationship between objects is represented by

ﬁnding the strongest statistical dependencies among these objects and using

only those in representing all statistical connections. Some of the important

kinds of feature extraction from large-scale data that we shall study are:

1. Frequent Itemsets. This model makes sense for data that consists of “bas-

kets” of small sets of items, as in the market-basket problem that we shall

discuss in Chapter 6. We look for small sets of items that appear together

in many baskets, and these “frequent itemsets” are the characterization of

the data that we s e e k. The original application of this sort of mining was

true market baskets: the sets of items, such as hamburger and ketchup,

that people tend to buy together when checking out at the cash register

of a sto re or super market.

2. Similar Items. Often, your data looks like a collection of sets, and the

objective is to ﬁnd pairs of sets that have a relatively large fraction of

their e le ments in common. An example is treating customers at an on-

line store like Amazon as the set of items they have bought. In order

for Amazo n to recommend something else they might like, Amazon ca n

look for “similar” customers and recommend something many of these

customers have bought. This process is called “collaborative ﬁltering.”

If customers were single-minded, that is, they bought only one kind of

thing, then clustering customers might work. However, since customers

tend to have interests in many diﬀerent things, it is more useful to ﬁnd,

for each customer, a small number of other customers who are similar

in their tastes, and represent the data by these connections. We discuss

similarity in Chapter 3.

1.2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events

hidden within massive amounts of data. This section is a dis cus sion of the

problem, including “Bonferroni’s Principle,” a warning against overzealous use

of data mining.

剩余452页未读，继续阅读

limingli111111

粉丝: 0
资源: 2

大规模数据挖掘：斯坦福教材

Mining of Massive Datasets

mining of massive datasets中文版

fundamentals of massive mimo

how much data did it cost to train you

massive mimo的相关代码

中兴massive mimo 白皮书 csdn

英语2200词关于人工智能和大数据的发展前景，看法

Describe each part of the above in detail

fast maximum likelihood

最新资源