大规模数据挖掘：斯坦福大学教材

需积分: 10 199 浏览量更新于2024-07-23 收藏 2.85MB PDF 举报

"数据挖掘书籍，包括'Mining of Massive Datasets'，由Anand Rajaraman、Jure Leskovec和Jeffrey D. Ullman合著，是一本关于大数据挖掘的专业教材，涵盖了Web挖掘、网络分析和大规模数据挖掘项目课程的内容。" 《Mining of Massive Datasets》这本书是基于斯坦福大学的CS345A（Web Mining）、CS224W（网络分析）以及CS246和CS341（大规模数据挖掘项目）课程的讲义发展而成的。它不仅适合研究生，也对高级本科生具有很高的学习价值。随着Jure Leskovec加入斯坦福大学的教职团队，课程内容得到了显著的扩展和组织，加入了更多关于网络分析的内容，并将原来的课程重新编号为CS246。本书的核心内容是关于数据挖掘，尤其是针对非常庞大的数据集进行挖掘。由于数据量巨大，无法完全存储在内存中，因此书中很多实例都涉及到互联网数据或源自互联网的数据。作者们关注的是如何处理和分析这些海量数据，以揭示隐藏的模式、趋势和关联，为决策提供支持。书中可能涵盖的知识点包括但不限于： 1. 数据挖掘的基本概念：定义、目标、流程和主要技术。 2. 大数据存储和处理：分布式计算框架如MapReduce，分布式文件系统如Hadoop。 3. 浏览器历史、链接结构和网页排名算法：如Google的PageRank算法。 4. 互联网上的搜索和推荐系统：查询处理、排序算法、协同过滤等。 5. 社交网络分析：社区检测、影响力传播、用户行为建模。 6. 图数据结构与算法：图论基础、最短路径、聚类算法。 7. 时间序列分析：趋势分析、季节性模型、异常检测。 8. 文本挖掘和自然语言处理：词频统计、情感分析、主题建模。 9. 机器学习和数据分类：监督学习、无监督学习、深度学习应用。 10. 大规模数据项目管理：数据预处理、特征工程、实验设计。通过这些知识点，读者将了解到如何在实际场景中处理和分析大规模数据，理解数据挖掘在现代互联网和数据分析中的重要地位，以及如何利用这些工具和技术来解决现实世界的问题。此外，书中可能还包含了实际案例研究和项目实践，帮助读者将理论知识应用于实践中。

xvi CONTENTS

10.6.5 Using Fewer Reduce Tasks . . . . . . . . . . . . . . . . . . 375

10.6.6 Exercises for Section 10.6 . . . . . . . . . . . . . . . . . . 375

10.7 Neighborhood Properties of Graphs . . . . . . . . . . . . . . . . . 377

10.7.1 Directed Graphs and Neighborhoods . . . . . . . . . . . . 377

10.7.2 The Diameter of a Gra ph . . . . . . . . . . . . . . . . . . 378

10.7.3 Transitive Closure and Reachability . . . . . . . . . . . . 379

10.7.4 Transitive Closure Via MapReduce . . . . . . . . . . . . . 3 80

10.7.5 Smart Transitive Closure . . . . . . . . . . . . . . . . . . 382

10.7.6 Transitive Closure by Graph Reduction . . . . . . . . . . 384

10.7.7 Approximating the Sizes of Neighborhoods . . . . . . . . 386

10.7.8 Exercises for Section 10.7 . . . . . . . . . . . . . . . . . . 387

10.8 Summary of Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 3 88

10.9 References for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . 391

11 Dimensionali ty Reduction 395

11.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 395

11.1.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . 396

11.1.2 Computing Eigenvalues and Eigenvectors . . . . . . . . . 396

11.1.3 Finding Eigenpairs by Power Iteration . . . . . . . . . . . 398

11.1.4 The Matrix of Eigenvectors . . . . . . . . . . . . . . . . . 401

11.1.5 Exercises for Section 11.1 . . . . . . . . . . . . . . . . . . 401

11.2 Principal-Compo nent Analysis . . . . . . . . . . . . . . . . . . . 402

11.2.1 An Illustr ative Exa mple . . . . . . . . . . . . . . . . . . . 403

11.2.2 Using Eigenvectors for Dimensionality Reduction . . . . . 406

11.2.3 The Matrix of Distances . . . . . . . . . . . . . . . . . . . 406

11.2.4 Exercises for Section 11.2 . . . . . . . . . . . . . . . . . . 408

11.3 Singular-Value Decomposition . . . . . . . . . . . . . . . . . . . . 408

11.3.1 Deﬁnition of SVD . . . . . . . . . . . . . . . . . . . . . . 408

11.3.2 Interpretation of SVD . . . . . . . . . . . . . . . . . . . . 410

11.3.3 Dimensionality Reduction Using SVD . . . . . . . . . . . 412

11.3.4 Why Zeroing Low Singula r Values Works . . . . . . . . . 413

11.3.5 Querying Using Concepts . . . . . . . . . . . . . . . . . . 415

11.3.6 Computing the SVD of a Matrix . . . . . . . . . . . . . . 416

11.3.7 Exercises for Section 11.3 . . . . . . . . . . . . . . . . . . 417

11.4 CUR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 418

11.4.1 Deﬁnition of CUR . . . . . . . . . . . . . . . . . . . . . . 418

11.4.2 Choosing Rows and Columns Proper ly . . . . . . . . . . . 419

11.4.3 Constructing the Middle Matrix . . . . . . . . . . . . . . 421

11.4.4 The Complete CUR Decomposition . . . . . . . . . . . . 422

11.4.5 Eliminating Duplicate Rows and Columns . . . . . . . . . 423

11.4.6 Exercises for Section 11.4 . . . . . . . . . . . . . . . . . . 424

11.5 Summary of Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . 4 24

11.6 References for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . 426

CONTENTS xvii

12 Large-Scale Machine Learning 429

12.1 The Machine-Learning Model . . . . . . . . . . . . . . . . . . . . 430

12.1.1 Training Sets . . . . . . . . . . . . . . . . . . . . . . . . . 430

12.1.2 Some Illustrative Examples . . . . . . . . . . . . . . . . . 430

12.1.3 Approaches to Machine Learning . . . . . . . . . . . . . . 433

12.1.4 Machine-Learning Architecture . . . . . . . . . . . . . . . 434

12.1.5 Exercises for Section 12.1 . . . . . . . . . . . . . . . . . . 437

12.2 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437

12.2.1 Training a Perceptron with Zero Thresho ld . . . . . . . . 437

12.2.2 Convergence of Perceptrons . . . . . . . . . . . . . . . . . 441

12.2.3 The Winnow Algorithm . . . . . . . . . . . . . . . . . . . 441

12.2.4 Allowing the Threshold to Vary . . . . . . . . . . . . . . . 443

12.2.5 Multiclass Perceptrons . . . . . . . . . . . . . . . . . . . . 4 45

12.2.6 Transforming the Tr aining Set . . . . . . . . . . . . . . . 446

12.2.7 Problems With Perceptrons . . . . . . . . . . . . . . . . . 4 47

12.2.8 Parallel Implementation of Perceptrons . . . . . . . . . . 448

12.2.9 Exercises for Section 12.2 . . . . . . . . . . . . . . . . . . 449

12.3 Support-Ve ctor Machines . . . . . . . . . . . . . . . . . . . . . . 45 1

12.3.1 The Mechanics of an SVM . . . . . . . . . . . . . . . . . . 451

12.3.2 Normalizing the Hyperplane . . . . . . . . . . . . . . . . . 452

12.3.3 Finding Optimal Approximate Separators . . . . . . . . . 454

12.3.4 SVM So lutions by Gradient Descent . . . . . . . . . . . . 457

12.3.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 461

12.3.6 Parallel Implementation of SVM . . . . . . . . . . . . . . 461

12.3.7 Exercises for Section 12.3 . . . . . . . . . . . . . . . . . . 462

12.4 Learning fro m Nearest Neig hbors . . . . . . . . . . . . . . . . . . 462

12.4.1 The Framework for Nea rest-Neighbor Calculations . . . . 463

12.4.2 Learning with One Nearest Neighbor . . . . . . . . . . . . 463

12.4.3 Learning O ne-Dimensional Functions . . . . . . . . . . . . 464

12.4.4 Kernel Regressio n . . . . . . . . . . . . . . . . . . . . . . 467

12.4.5 Dealing with High-Dimensional Euclidean Data . . . . . . 467

12.4.6 Dealing with Non-Euclidean Distances . . . . . . . . . . . 469

12.4.7 Exercises for Section 12.4 . . . . . . . . . . . . . . . . . . 469

12.5 Comparison of Lea rning Methods . . . . . . . . . . . . . . . . . . 470

12.6 Summary of Chapter 12 . . . . . . . . . . . . . . . . . . . . . . . 4 71

12.7 References for Chapter 12 . . . . . . . . . . . . . . . . . . . . . . 473

2 CHAPTER 1. DATA MINING

and standard deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that some data mining appr opriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training set,

to train an algorithm of one of the many types used by machine-learning prac-

titioners, such as Bayes nets, support-vector machines, decision trees, hidden

Markov models, and many o thers.

There are situations where using data in this way makes sense. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data. For example, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that pre dicts the

ratings of movies by users, based on a sample of their resp onses, machine-

learning algorithms have proved quite success ful. We shall discuss a simple

form of this type of algorithm in Section 9.4.

On the other hand, machine learning has not proved succes sful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate people’s resumes on the Web. It was not a ble to do better than algorithms

in the typical resume. Since everyone who has looked at or written a resume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the dire ct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example 1.1, we might

compute their average and standard deviation. Note that these values might

not be the parameters of the Gaussian that best ﬁts the data , although they

will almost certainly be very close if the size of the data is large.

There are many diﬀerent approaches to modeling da ta. We have already

mentioned the possibility of constructing a statistical process whereby the data

could have been g e nerated. Most other approaches to modeling can be desc ribed

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to survive.

剩余502页未读，继续阅读

slowbull

粉丝: 0
资源: 2

大规模数据挖掘：斯坦福大学教材

数据挖掘书籍，韩家炜

数据挖掘书籍（可复制粘贴的pdf文件）

数据挖掘推荐书籍 数据挖掘 推荐书籍

数据挖掘相关图书

数据挖掘资料 数据挖掘

中文图书数据集数据挖掘自然语言处理中国图书分类法图书情报学数据挖掘文本分类.zip

数据挖掘课程设计----图书馆系统数据挖掘

中文图书数据集-数据挖掘-自然语言处理-中文图书分类-图书情报学-数据挖掘_文

《数据挖掘入门书籍》

图书借阅数据挖掘系统

最新资源

数据挖掘推荐书籍数据挖掘推荐书籍

数据挖掘资料数据挖掘