大数据挖掘入门：处理海量数据的方法

需积分: 9 40 浏览量更新于2024-07-18 收藏 3.66MB PDF 举报

《大规模数据挖掘》(Mining of Massive Datasets)是一本由斯坦福大学的Jure Leskovec、Anand Rajaraman和Jeffrey D. Ullman共同编写的经典著作，它系统地探讨了在海量数据背景下进行数据挖掘的各种方法和技术。该书源于斯坦福大学多年来的教学材料，最初是作为高级研究生课程CS345A（网络挖掘）的一部分，后来随着Jure Leskovec的加入，课程内容得到了显著扩展，包括CS224W（网络分析）和CS346（大数据挖掘），同时三位作者还引入了一个大规模数据挖掘项目课程CS341。本书的核心关注点在于处理那些超出了常规内存容量的大规模数据集，因为这些数据的规模超出了传统数据挖掘方法的适用范围。书中大量的示例主要围绕互联网数据，如网页、社交媒体数据和网络流量等，这些数据的特点是数量巨大且实时更新。因此，书中的内容涵盖了如何有效地存储、处理、分析和挖掘这些海量数据，涉及的主题包括： 1. 数据采集与存储技术：书中会介绍如何设计和实现高效的分布式数据存储系统，以及如何处理流式数据，确保数据的实时性和可用性。 2. 数据预处理与清洗：面对大量噪声和不一致性，作者会讲解如何运用统计方法和算法来清洗、转换和整合数据，以便后续分析。 3. 分布式计算框架：由于单机无法处理大规模数据，书中会深入剖析MapReduce、Spark等分布式计算框架，以及Hadoop等大数据处理平台。 4. 聚类与分类算法：针对海量数据中的模式识别，书中会详细介绍各种聚类（如K-means、层次聚类）和分类（如决策树、随机森林、支持向量机）算法，以及它们在大规模数据集上的优化策略。 5. 关联规则学习：通过Apriori算法等方法，作者会讨论如何发现大规模数据集中的频繁项集和关联规则，这对于市场篮子分析等场景至关重要。 6. 网络分析：书中涵盖社交网络、信息传播、社区检测等内容，利用图论和复杂网络理论分析大规模网络结构和动态。 7. 实时推荐系统：针对在线服务中的个性化推荐，作者会介绍协同过滤、基于内容的推荐等方法，以及如何在实时场景下处理用户行为数据。 8. 流度量与时间序列分析：书中也会涉及如何处理时间序列数据，例如异常检测、趋势分析等，这对于理解和预测大规模数据的时间依赖性至关重要。 9. 高性能数据挖掘工具：介绍了一些开源工具和技术，如Apache Mahout、Pig、Hive等，帮助读者理解和应用数据挖掘到实际项目中。《大规模数据挖掘》不仅是一本理论教材，也是实践指导，它提供了一套完整的框架，帮助读者理解并掌握如何在当今信息爆炸的时代处理和从中提取有价值的知识。无论是对研究人员、工程师还是数据分析师来说，这本书都是深入理解大数据领域不可或缺的参考文献。

展开

xvi CONTENTS

10.7.2 An Algorithm for Finding Triangles . . . . . . . . . . . . 381

10.7.3 Optimality of the Triangle-Finding Algorithm . . . . . . . 382

10.7.4 Finding Triangles Using MapReduce . . . . . . . . . . . . 383

10.7.5 Using Fewer Reduce Tasks . . . . . . . . . . . . . . . . . . 384

10.7.6 Exercises for Section 10.7 . . . . . . . . . . . . . . . . . . 385

10.8 Neighborhood Properties of Graphs . . . . . . . . . . . . . . . . . 386

10.8.1 Directed Graphs and Neighborhoods . . . . . . . . . . . . 386

10.8.2 The Diameter of a Graph . . . . . . . . . . . . . . . . . . 388

10.8.3 Transitive Closure and Reachability . . . . . . . . . . . . 388

10.8.4 Transitive Closure Via MapReduce . . . . . . . . . . . . . 390

10.8.5 Smart Transitive Closure . . . . . . . . . . . . . . . . . . 392

10.8.6 Transitive Closure by Graph Reduction . . . . . . . . . . 394

10.8.7 Approximating the Sizes of Neighborhoods . . . . . . . . 395

10.8.8 Exercises for Section 10.8 . . . . . . . . . . . . . . . . . . 397

10.9 Summary of Chapter 10 . . . . . . . . . . . . . . . . . . . . . . . 398

10.10References for Chapter 10 . . . . . . . . . . . . . . . . . . . . . . 402

11 Dimensionality Reduction 405

11.1 Eigenvalues and Eigenvectors of Symmetric Matrices . . . . . . . 406

11.1.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . 406

11.1.2 Computing Eigenvalues and Eigenvectors . . . . . . . . . 407

11.1.3 Finding Eigenpairs by Power Iteration . . . . . . . . . . . 408

11.1.4 The Matrix of Eig e nvectors . . . . . . . . . . . . . . . . . 411

11.1.5 Exercises for Section 11.1 . . . . . . . . . . . . . . . . . . 411

11.2 Principal-Component Analysis . . . . . . . . . . . . . . . . . . . 412

11.2.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . 413

11.2.2 Using Eige nvectors for Dimensiona lity Reduction . . . . . 416

11.2.3 The Matrix of Distances . . . . . . . . . . . . . . . . . . . 417

11.2.4 Exercises for Section 11.2 . . . . . . . . . . . . . . . . . . 418

11.3 Singular-Value Decomposition . . . . . . . . . . . . . . . . . . . . 418

11.3.1 Deﬁnition of SVD . . . . . . . . . . . . . . . . . . . . . . 418

11.3.2 Interpretation o f SVD . . . . . . . . . . . . . . . . . . . . 420

11.3.3 Dimensionality Reduction Using SVD . . . . . . . . . . . 422

11.3.4 Why Zeroing Low Singular Values Works . . . . . . . . . 423

11.3.5 Querying Using Concepts . . . . . . . . . . . . . . . . . . 425

11.3.6 Computing the SVD of a Matrix . . . . . . . . . . . . . . 426

11.3.7 Exercises for Section 11.3 . . . . . . . . . . . . . . . . . . 427

11.4 CUR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 428

11.4.1 Deﬁnition of CUR . . . . . . . . . . . . . . . . . . . . . . 429

11.4.2 Choosing Rows and Columns Properly . . . . . . . . . . . 430

11.4.3 Constructing the Middle Matrix . . . . . . . . . . . . . . 431

11.4.4 The Complete CUR Decomposition . . . . . . . . . . . . 432

11.4.5 Eliminating Duplicate Rows and Columns . . . . . . . . . 433

11.4.6 Exercises for Section 11.4 . . . . . . . . . . . . . . . . . . 434

11.5 Summary of Chapter 11 . . . . . . . . . . . . . . . . . . . . . . . 434

CONTENTS xvii

11.6 References for Chapter 11 . . . . . . . . . . . . . . . . . . . . . . 436

12 Large-Scale M achine Learning 439

12.1 The Ma chine-Learning Model . . . . . . . . . . . . . . . . . . . . 440

12.1.1 Training Sets . . . . . . . . . . . . . . . . . . . . . . . . . 440

12.1.2 Some Illustrative Examples . . . . . . . . . . . . . . . . . 440

12.1.3 Approaches to Machine Lea rning . . . . . . . . . . . . . . 443

12.1.4 Machine-Learning Architecture . . . . . . . . . . . . . . . 444

12.1.5 Exercises for Section 12.1 . . . . . . . . . . . . . . . . . . 447

12.2 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

12.2.1 Training a Perceptron with Zero Threshold . . . . . . . . 447

12.2.2 Convergence of Perceptrons . . . . . . . . . . . . . . . . . 451

12.2.3 The Winnow Algorithm . . . . . . . . . . . . . . . . . . . 451

12.2.4 Allowing the Threshold to Vary . . . . . . . . . . . . . . . 453

12.2.5 Multiclass Perceptrons . . . . . . . . . . . . . . . . . . . . 455

12.2.6 Transforming the Training Set . . . . . . . . . . . . . . . 456

12.2.7 Problems With Perceptrons . . . . . . . . . . . . . . . . . 457

12.2.8 Parallel Implementation of Perceptrons . . . . . . . . . . 458

12.2.9 Exercises for Section 12.2 . . . . . . . . . . . . . . . . . . 459

12.3 Support-Vector Machines . . . . . . . . . . . . . . . . . . . . . . 461

12.3.1 The Mechanics of an SVM . . . . . . . . . . . . . . . . . . 461

12.3.2 Normalizing the Hyperplane . . . . . . . . . . . . . . . . . 462

12.3.3 Finding Optimal Approximate Separators . . . . . . . . . 464

12.3.4 SVM Solutions by Gradient Descent . . . . . . . . . . . . 467

12.3.5 Sto chastic Gradient Descent . . . . . . . . . . . . . . . . . 470

12.3.6 Parallel Implementation of SVM . . . . . . . . . . . . . . 471

12.3.7 Exercises for Section 12.3 . . . . . . . . . . . . . . . . . . 471

12.4 Learning from Nearest Neighbors . . . . . . . . . . . . . . . . . . 472

12.4.1 The Framework for Nearest-Neighbor Calculations . . . . 47 2

12.4.2 Learning with One Nearest Neighbor . . . . . . . . . . . . 473

12.4.3 Learning One-Dimensional Functions . . . . . . . . . . . . 474

12.4.4 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . 476

12.4.5 Dealing with High-Dimensional Euclidean Data . . . . . . 477

12.4.6 Dealing with Non-Euclidean Distances . . . . . . . . . . . 478

12.4.7 Exercises for Section 12.4 . . . . . . . . . . . . . . . . . . 479

12.5 Comparison of Learning Methods . . . . . . . . . . . . . . . . . . 480

12.6 Summary of Chapter 12 . . . . . . . . . . . . . . . . . . . . . . . 481

12.7 References for Chapter 12 . . . . . . . . . . . . . . . . . . . . . . 483

2 CHAPTER 1. DATA MINING

and standar d deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that s ome data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training s et,

to train an algorithm of one of the many types used by machine-learning prac -

titioners, such a s Bayes nets, support-vector machines, decision trees, hidden

Markov models, and many others.

There are situations where using data in this way makes se ns e. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data . For exa mple, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite successful. We shall discuss a simple

form of this type of algorithm in Section

9.4.

On the other hand, machine learning has not proved successful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate people’s resumes on the Web. It was not able to do better than algorithms

in the typical resume. Since everyone who has looked at or written a r esume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example

1.1, we mig ht

compute their average and standard deviation. No te that these values might

not be the parameters of the Ga ussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is large.

There are ma ny diﬀerent approaches to modeling data. We have already

mentioned the poss ibility of constructing a statistical process whereby the data

could have b e e n generated. Most other approaches to modeling can b e describe d

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to sur vive.

剩余512页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

xh8604

粉丝: 0

大数据挖掘入门：处理海量数据的方法

大数据挖掘：Stanford大学 Mining of Massive Datasets 教材概览

大规模数据挖掘：Anand.Rajaraman《Mining of Massive Datasets》精华解读

《Mining of Massive Datasets》：大数据挖掘算法与应用

Mining of massive dataset

Mining of Massive Dataset.rar

Mining of Massive Dataset的中文版

mining of massive datasets

Mining of Massive Datasets

斯坦福大学CS246 book-Mining of Massive Datasets

mining massive datasets

最新资源