大规模数据挖掘：Web与网络分析

需积分: 10 62 浏览量更新于2024-07-21 收藏 3.69MB PDF 举报

《大规模数据挖掘》是一本由Jure Leskovec、Anand Rajaraman和Jeffrey D. Ullman合著的书籍，版权日期涵盖2010年至2014年，最初是斯坦福大学的一门高级研究生课程——CS345A（原名“Web Mining”），后来随着作者们对内容的扩展和整理，该课程逐渐开放给高级本科生，并融入了更多网络分析和大规模数据挖掘的主题。课程内容如今被整合进CS224W（网络分析）、CS246（之前CS345A）以及一门专门的大规模数据挖掘项目课程CS341。本书的核心关注点在于处理大规模数据的挖掘，即那些超出了传统计算机内存容量的数据集。由于其侧重于大数据，书中的许多示例都围绕互联网（Web）数据展开，展示了如何在海量数据中提取有价值的信息和知识。书中内容涵盖了数据预处理、数据结构、算法设计、模式识别、关联规则学习、聚类分析、社交网络分析等多个数据挖掘的关键领域。此外，由于课程的实践性质，读者还能在这里了解到如何进行大型数据项目的实施和优化策略。书中强调了以下几点核心知识点： 1. **数据挖掘基础**：介绍了数据挖掘的基本概念，包括其目的、方法和步骤，以及与机器学习、统计学和数据库系统的相互关系。 2. **大数据处理技术**：涵盖了分布式计算、并行处理、流式处理等技术，如何有效地管理和处理超出单机存储限制的数据。 3. **数据结构与算法**：如何设计和实现适用于大规模数据的高效数据结构，以及用于搜索、排序和过滤的算法。 4. **模式识别与关联规则**：探讨频繁模式挖掘（如Apriori算法）和关联规则发现，这对于市场篮子分析和推荐系统至关重要。 5. **聚类分析**：讲解层次聚类、K-means等聚类方法在大规模数据中的应用，帮助理解数据内部的结构和相似性。 6. **社交网络分析**：利用网络数据挖掘技术，研究用户行为、社区结构和信息传播等问题。 7. **项目案例与实战**：通过实际案例演示如何将理论知识应用于解决现实世界中的大规模数据挖掘问题，提升学生的实践能力。《大规模数据挖掘》是一本深入浅出的指南，不仅适合正在攻读数据科学、计算机科学或相关领域的学生，也对数据工程师、分析师和研究人员具有重要的参考价值，他们需要掌握处理和分析海量数据的实用工具和技术。

xvi CONTENTS

10.7.2 An Algorithm for Finding Triangles . . . . . . . . . . . . 381

10.7.3 Optimality of the Triangle-Finding Algorithm . . . . . . . 382

10.7.4 Finding Triangles Using MapReduce . . . . . . . . . . . . 383

10.7.5 Using Fewer Reduce Tasks . . . . . . . . . . . . . . . . . . 384

10.7.6 Exercises for Section 10.7 . . . . . . . . . . . . . . . . . . 385

10.8 Neighborhood Properties of Graphs . . . . . . . . . . . . . . . . . 386

10.8.1 Directed Graphs and Neighborhoods . . . . . . . . . . . . 386

10.8.2 The Diameter of a Graph . . . . . . . . . . . . . . . . . . 388

10.8.3 Transitive Closure and Reachability . . . . . . . . . . . . 388

10.8.4 Transitive Closure Via MapReduce . . . . . . . . . . . . . 390

10.8.5 Smart Transitive Closure . . . . . . . . . . . . . . . . . . 392

10.8.6 Transitive Closure by Graph Reduction . . . . . . . . . . 394

10.8.7 Approximating the Sizes of Neighborhoods . . . . . . . . 395

10.8.8 Exercises for Section 10.8 . . . . . . . . . . . . . . . . . . 397

10.9 Summary of Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 398

10.10References for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . 402

11 Dimensionality Reduction 405

11.1 Eigenvalues and Eigenvectors of Symmetric Matrices . . . . . . . 406

11.1.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . 406

11.1.2 Computing Eigenvalues and Eigenvectors . . . . . . . . . 407

11.1.3 Finding Eigenpairs by Power Iteration . . . . . . . . . . . 408

11.1.4 The Matrix of Eig e nvectors . . . . . . . . . . . . . . . . . 411

11.1.5 Exercises for Section 11.1 . . . . . . . . . . . . . . . . . . 411

11.2 Principal-Component Analysis . . . . . . . . . . . . . . . . . . . 412

11.2.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . 413

11.2.2 Using Eige nvectors for Dimensiona lity Reduction . . . . . 416

11.2.3 The Matrix of Distances . . . . . . . . . . . . . . . . . . . 417

11.2.4 Exercises for Section 11.2 . . . . . . . . . . . . . . . . . . 418

11.3 Singular-Value Decomposition . . . . . . . . . . . . . . . . . . . . 418

11.3.1 Deﬁnition of SVD . . . . . . . . . . . . . . . . . . . . . . 418

11.3.2 Interpretation o f SVD . . . . . . . . . . . . . . . . . . . . 420

11.3.3 Dimensionality Reduction Using SVD . . . . . . . . . . . 422

11.3.4 Why Zeroing Low Singular Values Works . . . . . . . . . 423

11.3.5 Querying Using Concepts . . . . . . . . . . . . . . . . . . 425

11.3.6 Computing the SVD of a Matrix . . . . . . . . . . . . . . 426

11.3.7 Exercises for Section 11.3 . . . . . . . . . . . . . . . . . . 427

11.4 CUR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 428

11.4.1 Deﬁnition of CUR . . . . . . . . . . . . . . . . . . . . . . 429

11.4.2 Choosing Rows and Columns Properly . . . . . . . . . . . 430

11.4.3 Constructing the Middle Matrix . . . . . . . . . . . . . . 431

11.4.4 The Complete CUR Decomposition . . . . . . . . . . . . 432

11.4.5 Eliminating Duplicate Rows and Columns . . . . . . . . . 433

11.4.6 Exercises for Section 11.4 . . . . . . . . . . . . . . . . . . 434

11.5 Summary of Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . 434

CONTENTS xvii

11.6 References for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . 436

12 Large-Scale M achine Learning 439

12.1 The Ma chine-Learning Model . . . . . . . . . . . . . . . . . . . . 440

12.1.1 Training Sets . . . . . . . . . . . . . . . . . . . . . . . . . 440

12.1.2 Some Illustrative Examples . . . . . . . . . . . . . . . . . 440

12.1.3 Approaches to Machine Lea rning . . . . . . . . . . . . . . 443

12.1.4 Machine-Learning Architecture . . . . . . . . . . . . . . . 444

12.1.5 Exercises for Section 12.1 . . . . . . . . . . . . . . . . . . 447

12.2 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

12.2.1 Training a Perceptron with Zero Threshold . . . . . . . . 447

12.2.2 Convergence of Perceptrons . . . . . . . . . . . . . . . . . 451

12.2.3 The Winnow Algorithm . . . . . . . . . . . . . . . . . . . 451

12.2.4 Allowing the Threshold to Vary . . . . . . . . . . . . . . . 453

12.2.5 Multiclass Perceptrons . . . . . . . . . . . . . . . . . . . . 455

12.2.6 Transforming the Training Set . . . . . . . . . . . . . . . 456

12.2.7 Problems With Perceptrons . . . . . . . . . . . . . . . . . 457

12.2.8 Parallel Implementation of Perceptrons . . . . . . . . . . 458

12.2.9 Exercises for Section 12.2 . . . . . . . . . . . . . . . . . . 459

12.3 Support-Vector Machines . . . . . . . . . . . . . . . . . . . . . . 461

12.3.1 The Mechanics of an SVM . . . . . . . . . . . . . . . . . . 461

12.3.2 Normalizing the Hyperplane . . . . . . . . . . . . . . . . . 462

12.3.3 Finding Optimal Approximate Separators . . . . . . . . . 464

12.3.4 SVM Solutions by Gradient Descent . . . . . . . . . . . . 467

12.3.5 Sto chastic Gradient Descent . . . . . . . . . . . . . . . . . 470

12.3.6 Parallel Implementation of SVM . . . . . . . . . . . . . . 471

12.3.7 Exercises for Section 12.3 . . . . . . . . . . . . . . . . . . 471

12.4 Learning from Nearest Neighbors . . . . . . . . . . . . . . . . . . 472

12.4.1 The Framework for Nearest-Neighbor Calculations . . . . 47 3

12.4.2 Learning with One Nearest Neighbor . . . . . . . . . . . . 473

12.4.3 Learning One-Dimensional Functions . . . . . . . . . . . . 474

12.4.4 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . 477

12.4.5 Dealing with High-Dimensional Euclidean Data . . . . . . 477

12.4.6 Dealing with Non-Euclidean Distances . . . . . . . . . . . 479

12.4.7 Exercises for Section 12.4 . . . . . . . . . . . . . . . . . . 479

12.5 Comparison of Learning Methods . . . . . . . . . . . . . . . . . . 480

12.6 Summary of Chapter 12 . . . . . . . . . . . . . . . . . . . . . . . 481

12.7 References for Chapter 12 . . . . . . . . . . . . . . . . . . . . . . 483

2 CHAPTER 1. DATA MINING

and standar d deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that s ome data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training s et,

to train an algorithm of one of the many types used by machine-learning prac -

titioners, such a s Bayes nets, support-vector machines, decision trees, hidden

Markov models, and many others.

There are situations where using data in this way makes se ns e. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data . For exa mple, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite successful. We shall discuss a simple

form of this type of algorithm in Section

9.4.

On the other hand, machine learning has not proved successful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate people’s resumes on the Web. It was not able to do better than algorithms

in the typical resume. Since everyone who has looked at or written a r esume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example

1.1, we mig ht

compute their average and standard deviation. No te that these values might

not be the parameters of the Ga ussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is large.

There are ma ny diﬀerent approaches to modeling data. We have already

mentioned the poss ibility of constructing a statistical process whereby the data

could have b e e n generated. Most other approaches to modeling can b e describe d

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to sur vive.

剩余512页未读，继续阅读

bolical

粉丝: 0
资源: 1

大规模数据挖掘：Web与网络分析

Mining of Massive Datasets

Mining of Massive Datasets, 英文原版，斯坦福CS246官方教程

Mining Massive Datasets

大数据挖掘 Mining Massive Datasets 斯坦福大学教材

mining of massive datasets

Mining of massive datasets

Teddy Bear v1.2.unitypackage

C#ASP.NET体育馆综合会员管理系统源码数据库 SQL2008源码类型 WebForm

基于OpenCV+YOLO3道路损伤检测系统实现的源代码+文档说明+训练好的模型+数据集（毕业设计）

C语言_微控制器的高级神经网络库.zip

最新资源