大规模数据挖掘：斯坦福经典教程

需积分: 10 95 浏览量更新于2024-07-22 收藏 2.91MB PDF 举报

《大规模数据集挖掘》(Mining Massive Data Sets)是一本由斯坦福大学的Jure Leskovec、Anand Rajaraman和Jeﬀrey D. Ullman合著的经典教材。这本书源自他们多年为斯坦福大学高级研究生课程"Web Mining"（后改为CS224W，网络分析）所开发的材料，虽然最初主要针对研究生，但其内容逐渐普及，对高年级本科生也极具吸引力。当Leskovec加入斯坦福教职队伍后，他进一步组织并扩展了课程内容，不仅在CS224W中加入了网络分析，还对原来的CS345A进行了修订，改名为CS246。此外，三位作者还开设了一门大规模数据挖掘项目课程CS341，这些课程的内容都被整合进了本书。该书的核心关注点在于大规模数据挖掘，即处理的数据规模巨大到无法一次性加载到常规内存中。由于侧重于大数据，书中提供了大量关于互联网或大规模数据源的实例，这些例子旨在展示如何在海量数据背景下进行有效的数据挖掘和分析。书中涵盖的主题包括数据挖掘的基本概念、算法和技术，如关联规则学习、聚类、分类、异常检测以及网络分析方法等，这些都是在处理大数据时必不可少的工具。书中深入探讨了如何利用分布式系统和云计算技术来处理大规模数据，例如Hadoop MapReduce框架的应用，以及如何设计和优化数据流处理算法。此外，作者还强调了数据隐私和安全问题，在大数据时代如何保护个人信息和商业秘密。《大规模数据集挖掘》不仅仅是一本理论教材，它还包含了丰富的实践案例和实战项目，读者可以通过解决实际问题来理解和掌握理论知识。它适合那些希望在这个快速发展的领域中寻求深入理解的专业人士，无论是计算机科学、数据科学、统计学还是商业分析领域的学生和研究者，都能从中受益匪浅。这是一本在当今大数据时代不可或缺的参考书籍，它帮助读者掌握了处理和从海量数据中提取有价值信息的关键技能，是数据驱动决策和创新的重要指南。

xvi CONTENTS

10.6.3 Exercises for Section 10.6 . . . . . . . . . . . . . . . . . . 380

10.7 Counting Triangles . . . . . . . . . . . . . . . . . . . . . . . . . . 380

10.7.1 Why Count Triangles? . . . . . . . . . . . . . . . . . . . . 380

10.7.2 An Algorithm for Finding Triangles . . . . . . . . . . . . 381

10.7.3 Optimality of the Triangle-Finding Algorithm . . . . . . . 382

10.7.4 Finding Triangles Using MapReduce . . . . . . . . . . . . 383

10.7.5 Using Fewer Reduce Tasks . . . . . . . . . . . . . . . . . . 384

10.7.6 Exercises for Section 10.7 . . . . . . . . . . . . . . . . . . 385

10.8 Neighborhood Properties of Graphs . . . . . . . . . . . . . . . . . 386

10.8.1 Directed Graphs and Neighborhoods . . . . . . . . . . . . 386

10.8.2 The Diameter of a Graph . . . . . . . . . . . . . . . . . . 388

10.8.3 Tra nsitive Closure and Reachability . . . . . . . . . . . . 389

10.8.4 Tra nsitive Closure Via MapReduce . . . . . . . . . . . . . 390

10.8.5 Smart Transitive Closure . . . . . . . . . . . . . . . . . . 392

10.8.6 Tra nsitive Closure by Graph Reduction . . . . . . . . . . 393

10.8.7 Approximating the Sizes of Neighborhoods . . . . . . . . 395

10.8.8 Exercises for Section 10.8 . . . . . . . . . . . . . . . . . . 397

10.9 Summary of Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 398

10.10References for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . 402

11 Dimensionality Reduction 405

11.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 405

11.1.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . 406

11.1.2 Computing Eigenvalues and Eigenvectors . . . . . . . . . 406

11.1.3 Finding Eigenpairs by Power Iteration . . . . . . . . . . . 408

11.1.4 The Matrix of Eigenvectors . . . . . . . . . . . . . . . . . 411

11.1.5 Exercises for Section 11.1 . . . . . . . . . . . . . . . . . . 411

11.2 Principal-Component Analysis . . . . . . . . . . . . . . . . . . . 412

11.2.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . 413

11.2.2 Using Eigenvectors for Dimensionality Reduction . . . . . 416

11.2.3 The Matrix of Distances . . . . . . . . . . . . . . . . . . . 417

11.2.4 Exercises for Section 11.2 . . . . . . . . . . . . . . . . . . 418

11.3 Singular-Value Decomposition . . . . . . . . . . . . . . . . . . . . 418

11.3.1 Deﬁnition of SVD . . . . . . . . . . . . . . . . . . . . . . 418

11.3.2 Interpretation of SVD . . . . . . . . . . . . . . . . . . . . 420

11.3.3 Dimensionality Reduction Using SVD . . . . . . . . . . . 422

11.3.4 Why Zeroing Low Singular Values Works . . . . . . . . . 423

11.3.5 Querying Using Concepts . . . . . . . . . . . . . . . . . . 425

11.3.6 Computing the SVD of a Matrix . . . . . . . . . . . . . . 426

11.3.7 Exercises for Section 11.3 . . . . . . . . . . . . . . . . . . 427

11.4 CUR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 428

11.4.1 Deﬁnition of CUR . . . . . . . . . . . . . . . . . . . . . . 429

11.4.2 Choosing Rows and Columns Properly . . . . . . . . . . . 430

11.4.3 Constructing the Middle Matrix . . . . . . . . . . . . . . 431

11.4.4 The Complete CUR Decomposition . . . . . . . . . . . . 432

CONTENTS xvii

11.4.5 Eliminating Duplicate Rows and Columns . . . . . . . . . 433

11.4.6 Exercises for Section 11.4 . . . . . . . . . . . . . . . . . . 434

11.5 Summary of Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . 434

11.6 References for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . 436

12 Large-Scale M achine Learning 439

12.1 The Ma chine-Learning Model . . . . . . . . . . . . . . . . . . . . 440

12.1.1 Tra ining Sets . . . . . . . . . . . . . . . . . . . . . . . . . 440

12.1.2 Some Illustrative E xamples . . . . . . . . . . . . . . . . . 440

12.1.3 Approaches to Machine Learning . . . . . . . . . . . . . . 443

12.1.4 Machine-Learning Ar chitecture . . . . . . . . . . . . . . . 444

12.1.5 Exercises for Section 12.1 . . . . . . . . . . . . . . . . . . 447

12.2 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

12.2.1 Tra ining a Perceptron with Zero Threshold . . . . . . . . 447

12.2.2 Convergence o f Perceptrons . . . . . . . . . . . . . . . . . 451

12.2.3 The Winnow Algorithm . . . . . . . . . . . . . . . . . . . 451

12.2.4 Allowing the Thres hold to Vary . . . . . . . . . . . . . . . 453

12.2.5 Multiclass Perceptrons . . . . . . . . . . . . . . . . . . . . 455

12.2.6 Tra nsforming the Tra ining Set . . . . . . . . . . . . . . . 456

12.2.7 Problems With Perceptrons . . . . . . . . . . . . . . . . . 457

12.2.8 Parallel Implementation of Perceptrons . . . . . . . . . . 458

12.2.9 Exercises for Section 12.2 . . . . . . . . . . . . . . . . . . 459

12.3 Support-Vector Machines . . . . . . . . . . . . . . . . . . . . . . 461

12.3.1 The Mechanics of an SVM . . . . . . . . . . . . . . . . . . 461

12.3.2 Normalizing the Hyperplane . . . . . . . . . . . . . . . . . 462

12.3.3 Finding Optimal Approximate Separators . . . . . . . . . 464

12.3.4 SVM Solutions by Gradient Descent . . . . . . . . . . . . 467

12.3.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 471

12.3.6 Parallel Implementation of SVM . . . . . . . . . . . . . . 471

12.3.7 Exercises for Section 12.3 . . . . . . . . . . . . . . . . . . 472

12.4 Learning from Nearest Neighbors . . . . . . . . . . . . . . . . . . 472

12.4.1 The Framework for Nearest- Neighbor Calculations . . . . 473

12.4.2 Learning with One Nearest Neighbor . . . . . . . . . . . . 473

12.4.3 Learning One-Dimensional Functions . . . . . . . . . . . . 474

12.4.4 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . 477

12.4.5 Dealing with High-Dimensional Euclidean Data . . . . . . 477

12.4.6 Dealing with Non-Euclidean Distances . . . . . . . . . . . 479

12.4.7 Exercises for Section 12.4 . . . . . . . . . . . . . . . . . . 479

12.5 Comparison of Learning Methods . . . . . . . . . . . . . . . . . . 480

12.6 Summary of Chapter 12 . . . . . . . . . . . . . . . . . . . . . . . 481

12.7 References for Chapter 12 . . . . . . . . . . . . . . . . . . . . . . 483

2 CHAPTER 1. DATA MINING

and standar d deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that s ome data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training s et,

to train an algorithm of one of the many types used by machine-learning prac -

titioners, such a s Bayes nets, support-vector machines, decision trees, hidden

Markov models, and many others.

There are situations where using data in this way makes se ns e. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data . For exa mple, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite successful. We shall discuss a s imple

form of this type of algorithm in Section 9.4 .

On the other hand, machine learning has not proved successful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate people’s resumes on the Web. It was not able to do better than algorithms

in the typical resume. Since everyone who has looked at or written a r esume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example 1.1, we might

compute their average and standard deviation. Note that these values might

not be the parameters of the Ga ussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is large.

There are ma ny diﬀerent approaches to modeling data. We have already

mentioned the poss ibility of constructing a statistical process whereby the data

could have b e e n generated. Most other approaches to modeling can b e describe d

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to sur vive.

剩余512页未读，继续阅读

jixieniao01

粉丝: 0
资源: 12

大规模数据挖掘：斯坦福经典教程

Mining Massive Data Sets Reading Material (Stanford CS246)

Mining-Massive-Data-Sets-CS246:挖掘海量数据集，斯坦福2019

Mining-Massive-Data-Sets:我创建了一些算法来解决参加本课程时的一些测验问题

high-performance-data-mining-scaling-algorithms-applications-and-systems

斯坦福大学CS246 海量数据挖掘 课程所有课件(pdf+ppt)

MATLAB Legends and Big Data Analytics: Applying Legends in Visualizing Massive Datasets, Gaining ...

[Practical Exercise] Data Storage and Analysis: Storing Scraped Data to Hadoop HDFS and Processing ...

知攻善防-应急响应靶机-web2.z18

知攻善防-应急响应靶机-web2.z09

白色简洁风格的影视众筹平台整站网站源码下载.zip

最新资源

斯坦福大学CS246 海量数据挖掘课程所有课件(pdf+ppt)