大数据挖掘：Stanford大学 Mining of Massive Datasets 教材概览

需积分: 10 129 浏览量更新于2024-07-23 收藏 2.85MB PDF 举报

"大数据-互联网大规模数据挖掘与分布式处理" 本书《Mining of Massive Datasets v2.0》由Anand Rajaraman、Jure Leskovec和Jeﬀrey D. Ullman三位斯坦福大学的教授共同编写，是大数据挖掘和分布式处理领域的经典教材。书中内容源于他们在斯坦福大学开设的多门课程，包括针对研究生的“Web Mining”（CS345A）以及后来加入的“网络分析”课程（CS224W）和大规模数据挖掘项目课程（CS341）。这三门课程的内容都被整合进了这本书中。该书主要探讨的是大数据挖掘，特别是对那些无法一次性加载到内存中的海量数据进行挖掘的方法。由于关注点在于数据规模，因此书中许多实例都涉及到互联网数据或由此衍生的数据。这些数据通常来自搜索引擎的日志、社交媒体、网络链接结构等。书中的内容涵盖了以下几个关键知识点： 1. 数据挖掘基础：介绍数据挖掘的基本概念、方法和目标，如何从大量无结构或半结构化的数据中发现有价值的信息和模式。 2. Web数据和网页结构：讲解Web的拓扑结构，如超文本链接和PageRank算法，这是理解网络数据特性的基础。 3. 分布式计算框架：深入讨论MapReduce模型和Hadoop框架，这些都是处理大规模数据的核心工具，使得在分布式环境中处理数据成为可能。 4. 图数据挖掘：介绍网络分析技术，如社区检测、路径查找和聚类，这些对于理解复杂网络结构至关重要。 5. 数据可视化：讨论如何有效地将大量数据转化为易于理解的图形表示，帮助人们理解数据的模式和趋势。 6. 推荐系统：介绍协同过滤和基于内容的推荐方法，这些都是现代电子商务和媒体平台个性化推荐的核心。 7. 社交媒体分析：研究社交媒体数据的特性，如用户行为模式、信息传播和影响力测量。 8. 机器学习：涵盖监督和无监督学习，以及在大数据环境下的学习算法，如随机森林和深度学习。 9. 实时和流式数据处理：讨论如何处理不断到来的新数据，如使用Apache Storm和Spark Streaming等技术。 10. 大数据项目实践：提供实际的大规模数据挖掘项目案例，帮助读者将理论知识应用到实际问题中。《Mining of Massive Datasets v2.0》是一本深入浅出地介绍大数据处理和挖掘的教材，适合对大数据感兴趣的高级本科生和研究生，以及从事相关工作的专业人士。通过阅读本书，读者可以掌握处理和分析大规模数据所需的技术和思维方式，为应对日益增长的数据挑战做好准备。

xvi CONTENTS

10.6.5 Using Fewer Reduce Tasks . . . . . . . . . . . . . . . . . . 375

10.6.6 Exercises for Section 10.6 . . . . . . . . . . . . . . . . . . 375

10.7 Neighborhood Properties of Graphs . . . . . . . . . . . . . . . . . 377

10.7.1 Directed Graphs and Neighborhoods . . . . . . . . . . . . 377

10.7.2 The Diameter of a Gra ph . . . . . . . . . . . . . . . . . . 378

10.7.3 Transitive Closure and Reachability . . . . . . . . . . . . 379

10.7.4 Transitive Closure Via MapReduce . . . . . . . . . . . . . 3 80

10.7.5 Smart Transitive Closure . . . . . . . . . . . . . . . . . . 382

10.7.6 Transitive Closure by Graph Reduction . . . . . . . . . . 384

10.7.7 Approximating the Sizes of Neighborhoods . . . . . . . . 386

10.7.8 Exercises for Section 10.7 . . . . . . . . . . . . . . . . . . 387

10.8 Summary of Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 3 88

10.9 References for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . 391

11 Dimensionali ty Reduction 395

11.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 395

11.1.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . 396

11.1.2 Computing Eigenvalues and Eigenvectors . . . . . . . . . 396

11.1.3 Finding Eigenpairs by Power Iteration . . . . . . . . . . . 398

11.1.4 The Matrix of Eigenvectors . . . . . . . . . . . . . . . . . 401

11.1.5 Exercises for Section 11.1 . . . . . . . . . . . . . . . . . . 401

11.2 Principal-Compo nent Analysis . . . . . . . . . . . . . . . . . . . 402

11.2.1 An Illustr ative Exa mple . . . . . . . . . . . . . . . . . . . 403

11.2.2 Using Eigenvectors for Dimensionality Reduction . . . . . 406

11.2.3 The Matrix of Distances . . . . . . . . . . . . . . . . . . . 406

11.2.4 Exercises for Section 11.2 . . . . . . . . . . . . . . . . . . 408

11.3 Singular-Value Decomposition . . . . . . . . . . . . . . . . . . . . 408

11.3.1 Deﬁnition of SVD . . . . . . . . . . . . . . . . . . . . . . 408

11.3.2 Interpretation of SVD . . . . . . . . . . . . . . . . . . . . 410

11.3.3 Dimensionality Reduction Using SVD . . . . . . . . . . . 412

11.3.4 Why Zeroing Low Singula r Values Works . . . . . . . . . 413

11.3.5 Querying Using Concepts . . . . . . . . . . . . . . . . . . 415

11.3.6 Computing the SVD of a Matrix . . . . . . . . . . . . . . 416

11.3.7 Exercises for Section 11.3 . . . . . . . . . . . . . . . . . . 417

11.4 CUR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 418

11.4.1 Deﬁnition of CUR . . . . . . . . . . . . . . . . . . . . . . 418

11.4.2 Choosing Rows and Columns Proper ly . . . . . . . . . . . 419

11.4.3 Constructing the Middle Matrix . . . . . . . . . . . . . . 421

11.4.4 The Complete CUR Decomposition . . . . . . . . . . . . 422

11.4.5 Eliminating Duplicate Rows and Columns . . . . . . . . . 423

11.4.6 Exercises for Section 11.4 . . . . . . . . . . . . . . . . . . 424

11.5 Summary of Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . 4 24

11.6 References for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . 426

CONTENTS xvii

12 Large-Scale Machine Learning 429

12.1 The Machine-Learning Model . . . . . . . . . . . . . . . . . . . . 430

12.1.1 Training Sets . . . . . . . . . . . . . . . . . . . . . . . . . 430

12.1.2 Some Illustrative Examples . . . . . . . . . . . . . . . . . 430

12.1.3 Approaches to Machine Learning . . . . . . . . . . . . . . 433

12.1.4 Machine-Learning Architecture . . . . . . . . . . . . . . . 434

12.1.5 Exercises for Section 12.1 . . . . . . . . . . . . . . . . . . 437

12.2 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437

12.2.1 Training a Perceptron with Zero Thresho ld . . . . . . . . 437

12.2.2 Convergence of Perceptrons . . . . . . . . . . . . . . . . . 441

12.2.3 The Winnow Algorithm . . . . . . . . . . . . . . . . . . . 441

12.2.4 Allowing the Threshold to Vary . . . . . . . . . . . . . . . 443

12.2.5 Multiclass Perceptrons . . . . . . . . . . . . . . . . . . . . 4 45

12.2.6 Transforming the Tr aining Set . . . . . . . . . . . . . . . 446

12.2.7 Problems With Perceptrons . . . . . . . . . . . . . . . . . 4 47

12.2.8 Parallel Implementation of Perceptrons . . . . . . . . . . 448

12.2.9 Exercises for Section 12.2 . . . . . . . . . . . . . . . . . . 449

12.3 Support-Ve ctor Machines . . . . . . . . . . . . . . . . . . . . . . 45 1

12.3.1 The Mechanics of an SVM . . . . . . . . . . . . . . . . . . 451

12.3.2 Normalizing the Hyperplane . . . . . . . . . . . . . . . . . 452

12.3.3 Finding Optimal Approximate Separators . . . . . . . . . 454

12.3.4 SVM So lutions by Gradient Descent . . . . . . . . . . . . 457

12.3.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 461

12.3.6 Parallel Implementation of SVM . . . . . . . . . . . . . . 461

12.3.7 Exercises for Section 12.3 . . . . . . . . . . . . . . . . . . 462

12.4 Learning fro m Nearest Neig hbors . . . . . . . . . . . . . . . . . . 462

12.4.1 The Framework for Nea rest-Neighbor Calculations . . . . 463

12.4.2 Learning with One Nearest Neighbor . . . . . . . . . . . . 463

12.4.3 Learning O ne-Dimensional Functions . . . . . . . . . . . . 464

12.4.4 Kernel Regressio n . . . . . . . . . . . . . . . . . . . . . . 467

12.4.5 Dealing with High-Dimensional Euclidean Data . . . . . . 467

12.4.6 Dealing with Non-Euclidean Distances . . . . . . . . . . . 469

12.4.7 Exercises for Section 12.4 . . . . . . . . . . . . . . . . . . 469

12.5 Comparison of Lea rning Methods . . . . . . . . . . . . . . . . . . 470

12.6 Summary of Chapter 12 . . . . . . . . . . . . . . . . . . . . . . . 4 71

12.7 References for Chapter 12 . . . . . . . . . . . . . . . . . . . . . . 473

2 CHAPTER 1. DATA MINING

and standard deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that some data mining appr opriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training set,

to train an algorithm of one of the many types used by machine-learning prac-

titioners, such as Bayes nets, support-vector machines, decision trees, hidden

Markov models, and many o thers.

There are situations where using data in this way makes sense. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data. For example, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that pre dicts the

ratings of movies by users, based on a sample of their resp onses, machine-

learning algorithms have proved quite success ful. We shall discuss a simple

form of this type of algorithm in Section 9.4.

On the other hand, machine learning has not proved succes sful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate people’s resumes on the Web. It was not a ble to do better than algorithms

in the typical resume. Since everyone who has looked at or written a resume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the dire ct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example 1.1, we might

compute their average and standard deviation. Note that these values might

not be the parameters of the Gaussian that best ﬁts the data , although they

will almost certainly be very close if the size of the data is large.

There are many diﬀerent approaches to modeling da ta. We have already

mentioned the possibility of constructing a statistical process whereby the data

could have been g e nerated. Most other approaches to modeling can be desc ribed

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to survive.

剩余502页未读，继续阅读

summerevening

粉丝: 2
资源: 4

大数据挖掘：Stanford大学 Mining of Massive Datasets 教材概览

Mining of Massive Datasets

Mining of Massive Datasets.pdf

Mining of Massive Datasets, 英文原版，斯坦福CS246官方教程

Mining of Massive Datasets.zip

mining of massive datasets-ch01-intro.pdf

大数据(Mining of Massive Datasets)

Mining of Massive Datasets(v2.1)

Mining of massive datasets

mining of massive datasets

【java毕业设计】智慧社区在线教育平台（源代码+论文+PPT模板）.zip

最新资源