大规模数据挖掘：斯坦福大学教材第二版

需积分: 49 183 浏览量更新于2024-07-21 收藏 3.69MB PDF 举报

"Mining of Massive Datasets第二版" 本书《Mining of Massive Datasets》的第二版，由Jure Leskovec、Anand Rajaraman和Jeﬀrey D. Ullman三位作者共同编写，他们分别来自斯坦福大学和Milliway Labs。这本书源于Anand Rajaraman和Jeﬀ Ullman在斯坦福大学开设的一门名为“Web Mining”的课程——CS345A。随着Jure Leskovec加入斯坦福教职团队，课程内容进行了重新组织，新增了网络分析课程CS224W，并将CS345A改为CS246。此外，他们还引入了一门大规模数据挖掘项目课程CS341。书中的内容涵盖了这三门课程的精华。这本书的核心主题是数据挖掘，特别是针对极其庞大的数据集进行挖掘，这些数据集大到无法直接装入内存。因此，书中许多实例都与互联网或大数据相关。在处理如此大规模的数据时，传统的数据挖掘方法往往不再适用，需要采用新的技术和策略。书中详细探讨了以下几个关键知识点： 1. 大数据存储与管理：介绍如何在分布式系统如Google的Bigtable或Hadoop的HDFS中存储和处理大规模数据，以及MapReduce编程模型在大规模数据处理中的应用。 2. 数据预处理：包括数据清洗、转换和归一化等步骤，这些是大数据分析的基础，确保数据质量并为后续挖掘做好准备。 3. 数据采样与近似算法：由于全量数据处理不现实，书中会讲解如何通过有效的采样技术获取数据的代表性样本，以及设计近似算法来快速估算统计量。 4. 图数据结构与网络分析：网络分析是大数据挖掘的重要组成部分，书中会涵盖图论基础、社区检测、节点聚类和路径发现等方法。 5. 分布式计算框架：如Spark、Flink等，它们为大规模数据处理提供了高效且易用的平台。 6. 推荐系统：介绍协同过滤、基于内容的推荐以及矩阵分解等方法，这些在电商、流媒体等领域广泛应用。 7. 社交网络分析：研究社交网络中的用户行为、关系模式和信息传播，包括社交网络的生成模型、影响力最大化等问题。 8. 搜索引擎与网页排名：如PageRank算法，它是Google搜索引擎的核心，用于评估网页的重要性。 9. 文本挖掘与信息抽取：涉及自然语言处理技术，如何从大量文本中提取有价值的信息，如关键词提取、情感分析等。 10. 异常检测与聚类分析：在大规模数据中识别异常模式和群体特征，这对于安全监控、市场细分等场景至关重要。 11. 时间序列分析：处理具有时间属性的大数据，如预测趋势、周期性分析等。 12. 隐含概率模型：如朴素贝叶斯、马尔科夫链和隐马尔科夫模型，它们在分类、序列预测等任务中发挥作用。 13. 深度学习与神经网络：近年来，深度学习在大规模数据挖掘领域取得了显著成果，书中可能涉及卷积神经网络、循环神经网络等模型。 14. 实战项目经验：书中包含的实际项目案例，让学生和读者能够将理论知识应用于解决实际问题。《Mining of Massive Datasets》第二版是一本深入浅出地探讨大数据挖掘技术的权威教材，不仅适合研究生和高级本科生学习，也对从事相关工作的专业人士有着极高的参考价值。通过阅读本书，读者可以掌握处理大规模数据的关键技能，理解现代数据科学背后的原理和实践。

xvi CONTENTS

10.7.2 An Algorithm for Finding Triangles . . . . . . . . . . . . 381

10.7.3 Optimality of the Triangle-Finding Algorithm . . . . . . . 382

10.7.4 Finding Triangles Using MapReduce . . . . . . . . . . . . 383

10.7.5 Using Fewer Reduce Tasks . . . . . . . . . . . . . . . . . . 384

10.7.6 Exercises for Section 10.7 . . . . . . . . . . . . . . . . . . 385

10.8 Neighborhood Properties of Graphs . . . . . . . . . . . . . . . . . 386

10.8.1 Directed Graphs and Neighborhoods . . . . . . . . . . . . 386

10.8.2 The Diameter of a Graph . . . . . . . . . . . . . . . . . . 388

10.8.3 Transitive Closure and Reachability . . . . . . . . . . . . 388

10.8.4 Transitive Closure Via MapReduce . . . . . . . . . . . . . 390

10.8.5 Smart Transitive Closure . . . . . . . . . . . . . . . . . . 392

10.8.6 Transitive Closure by Graph Reduction . . . . . . . . . . 394

10.8.7 Approximating the Sizes of Neighborhoods . . . . . . . . 395

10.8.8 Exercises for Section 10.8 . . . . . . . . . . . . . . . . . . 397

10.9 Summary of Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 398

10.10References for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . 402

11 Dimensionality Reduction 405

11.1 Eigenvalues and Eigenvectors of Symmetric Matrices . . . . . . . 406

11.1.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . 406

11.1.2 Computing Eigenvalues and Eigenvectors . . . . . . . . . 407

11.1.3 Finding Eigenpairs by Power Iteration . . . . . . . . . . . 408

11.1.4 The Matrix of Eig e nvectors . . . . . . . . . . . . . . . . . 411

11.1.5 Exercises for Section 11.1 . . . . . . . . . . . . . . . . . . 411

11.2 Principal-Component Analysis . . . . . . . . . . . . . . . . . . . 412

11.2.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . 413

11.2.2 Using Eige nvectors for Dimensiona lity Reduction . . . . . 416

11.2.3 The Matrix of Distances . . . . . . . . . . . . . . . . . . . 417

11.2.4 Exercises for Section 11.2 . . . . . . . . . . . . . . . . . . 418

11.3 Singular-Value Decomposition . . . . . . . . . . . . . . . . . . . . 418

11.3.1 Deﬁnition of SVD . . . . . . . . . . . . . . . . . . . . . . 418

11.3.2 Interpretation o f SVD . . . . . . . . . . . . . . . . . . . . 420

11.3.3 Dimensionality Reduction Using SVD . . . . . . . . . . . 422

11.3.4 Why Zeroing Low Singular Values Works . . . . . . . . . 423

11.3.5 Querying Using Concepts . . . . . . . . . . . . . . . . . . 425

11.3.6 Computing the SVD of a Matrix . . . . . . . . . . . . . . 426

11.3.7 Exercises for Section 11.3 . . . . . . . . . . . . . . . . . . 427

11.4 CUR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 428

11.4.1 Deﬁnition of CUR . . . . . . . . . . . . . . . . . . . . . . 429

11.4.2 Choosing Rows and Columns Properly . . . . . . . . . . . 430

11.4.3 Constructing the Middle Matrix . . . . . . . . . . . . . . 431

11.4.4 The Complete CUR Decomposition . . . . . . . . . . . . 432

11.4.5 Eliminating Duplicate Rows and Columns . . . . . . . . . 433

11.4.6 Exercises for Section 11.4 . . . . . . . . . . . . . . . . . . 434

11.5 Summary of Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . 434

CONTENTS xvii

11.6 References for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . 436

12 Large-Scale M achine Learning 439

12.1 The Ma chine-Learning Model . . . . . . . . . . . . . . . . . . . . 440

12.1.1 Training Sets . . . . . . . . . . . . . . . . . . . . . . . . . 440

12.1.2 Some Illustrative Examples . . . . . . . . . . . . . . . . . 440

12.1.3 Approaches to Machine Lea rning . . . . . . . . . . . . . . 443

12.1.4 Machine-Learning Architecture . . . . . . . . . . . . . . . 444

12.1.5 Exercises for Section 12.1 . . . . . . . . . . . . . . . . . . 447

12.2 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

12.2.1 Training a Perceptron with Zero Threshold . . . . . . . . 447

12.2.2 Convergence of Perceptrons . . . . . . . . . . . . . . . . . 451

12.2.3 The Winnow Algorithm . . . . . . . . . . . . . . . . . . . 451

12.2.4 Allowing the Threshold to Vary . . . . . . . . . . . . . . . 453

12.2.5 Multiclass Perceptrons . . . . . . . . . . . . . . . . . . . . 455

12.2.6 Transforming the Training Set . . . . . . . . . . . . . . . 456

12.2.7 Problems With Perceptrons . . . . . . . . . . . . . . . . . 457

12.2.8 Parallel Implementation of Perceptrons . . . . . . . . . . 458

12.2.9 Exercises for Section 12.2 . . . . . . . . . . . . . . . . . . 459

12.3 Support-Vector Machines . . . . . . . . . . . . . . . . . . . . . . 461

12.3.1 The Mechanics of an SVM . . . . . . . . . . . . . . . . . . 461

12.3.2 Normalizing the Hyperplane . . . . . . . . . . . . . . . . . 462

12.3.3 Finding Optimal Approximate Separators . . . . . . . . . 464

12.3.4 SVM Solutions by Gradient Descent . . . . . . . . . . . . 467

12.3.5 Sto chastic Gradient Descent . . . . . . . . . . . . . . . . . 470

12.3.6 Parallel Implementation of SVM . . . . . . . . . . . . . . 471

12.3.7 Exercises for Section 12.3 . . . . . . . . . . . . . . . . . . 471

12.4 Learning from Nearest Neighbors . . . . . . . . . . . . . . . . . . 472

12.4.1 The Framework for Nearest-Neighbor Calculations . . . . 47 2

12.4.2 Learning with One Nearest Neighbor . . . . . . . . . . . . 473

12.4.3 Learning One-Dimensional Functions . . . . . . . . . . . . 474

12.4.4 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . 476

12.4.5 Dealing with High-Dimensional Euclidean Data . . . . . . 477

12.4.6 Dealing with Non-Euclidean Distances . . . . . . . . . . . 478

12.4.7 Exercises for Section 12.4 . . . . . . . . . . . . . . . . . . 479

12.5 Comparison of Learning Methods . . . . . . . . . . . . . . . . . . 480

12.6 Summary of Chapter 12 . . . . . . . . . . . . . . . . . . . . . . . 481

12.7 References for Chapter 12 . . . . . . . . . . . . . . . . . . . . . . 483

2 CHAPTER 1. DATA MINING

and standar d deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that s ome data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training s et,

to train an algorithm of one of the many types used by machine-learning prac -

titioners, such a s Bayes nets, support-vector machines, decision trees, hidden

Markov models, and many others.

There are situations where using data in this way makes se ns e. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data . For exa mple, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite successful. We shall discuss a simple

form of this type of algorithm in Section

9.4.

On the other hand, machine learning has not proved successful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate people’s resumes on the Web. It was not able to do better than algorithms

in the typical resume. Since everyone who has looked at or written a r esume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example

1.1, we mig ht

compute their average and standard deviation. No te that these values might

not be the parameters of the Ga ussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is large.

There are ma ny diﬀerent approaches to modeling data. We have already

mentioned the poss ibility of constructing a statistical process whereby the data

could have b e e n generated. Most other approaches to modeling can b e describe d

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to sur vive.

剩余512页未读，继续阅读

wangconggang2775

粉丝: 6
资源: 79

大规模数据挖掘：斯坦福大学教材第二版

Mining of Massive Datasets（2nd edition）

Mining of massive datasets

Mining of Massive Datasets, 英文原版，斯坦福CS246课程视频

mining of massive datasets

mining of massive datasets中文版

Mining of Massive Datasets

大数据(Mining of Massive Datasets)

Mining of Massive Datasets, 英文原版，斯坦福CS246官方教程

斯坦福大学Mining of Massive Datasets课程相关资源

Mining of Massive Datasets.pdf

最新资源