大数据挖掘：处理海量数据的关键算法与应用

2星需积分: 9 60 浏览量更新于2024-07-19 收藏 5.13MB PDF 举报

"《大规模数据挖掘》(Mining of Massive Datasets)是一本由Anand Rajaraman和Jeffrey D. Ullman合著的书籍，豆瓣评分高达8.7分，深受读者喜爱。该书主要针对互联网时代的海量数据挖掘问题，强调在处理那些超出了传统内存限制的数据集时，实用算法的应用。书中内容涵盖了以下几个核心主题： 1. 分布式文件系统与MapReduce框架：作者首先介绍了MapReduce，这是一种重要的并行计算框架，用于自动将算法分解成可以在大量数据上执行的小任务，有效地解决了大数据处理中的效率问题。MapReduce简化了大规模数据处理的编程模型，使得开发者能够轻松编写并行处理代码。 2. 局部敏感哈希(LSH) 和流处理算法：针对数据量巨大且实时性要求高的情况，书中探讨了如何使用LSH（一种哈希函数，其设计旨在保留相似项的概率特性）来快速查找潜在的相关数据，以及如何通过流处理算法处理连续不断到来的数据，避免对所有数据进行详尽分析。 3. PageRank算法与网页组织：作者详细解释了PageRank算法，这是一种用于评估网页重要性的算法，对于搜索引擎排名和Web信息组织至关重要。此外，书中还讨论了其他与网页排序和链接分析相关的技巧。 4. 频繁模式挖掘和聚类：本书还深入研究了如何发现频繁出现的项目组合（频繁项集），这是市场篮子分析和关联规则学习的基础，以及如何通过聚类技术对大规模数据进行分类，以识别数据中的结构和模式。 5. 推荐系统和网络广告：最后两章聚焦于两个电子商务领域的重要应用：推荐系统，它利用数据挖掘技术提供个性化的产品或服务推荐；以及网络广告，包括广告定位、点击率优化等关键问题，这些都直接关系到在线业务的效益。作为数据库和Web技术领域的权威著作，《大规模数据挖掘》不仅适合研究生学习，也对行业从业者具有极高的参考价值。它提供了理论基础和实践经验，帮助读者掌握处理和分析海量数据的核心方法。"

xvi CONTENTS

2 CHAPTER 1. DATA MINING

and standard deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that some data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training set,

to train an algorithm of one of the many types used by machine-learning pra c-

titioners, such as Bayes nets, support-vector machines , decision trees, hidden

Markov models, and ma ny others .

There are situations where using data in this way makes sense. The typical

case where machine lear ning is a good approach is when we have little idea of

what we are looking for in the data. For exa mple, it is rather unclear what

it is about movies that makes certain movie-goer s like or dis like it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite success ful. We shall discuss a simple

form of this type of algorithm in Section 9.4.

On the other hand, machine learning has not proved successful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate people’s re sumes on the Web. It was not able to do better than algorithms

in the typical resume. Since everyone who has looked at or written a resume has

a pretty goo d idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no adva ntage to ma chine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer sc ie ntists have looked at data mining as an algorithmic

problem. I n this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example 1.1, we might

compute their average and standard deviation. Note that these values might

not be the parameters of the Gaussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is large.

There are many diﬀerent approaches to modeling data. We have already

mentioned the possibility of c onstructing a sta tistical process whereby the data

could have been generated. Most other approaches to modeling can be described

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to survive.

4 CHAPTER 1. DATA MINING

The cases clustered around some o f the intersections of roads. These inter-

sections were the locations of wells that had become contaminated; people who

lived nearest these wells got sick, while people who lived nearer to wells that

had not been contaminated did not get sick. Without the ability to cluster the

data, the cause of Cholera would not have been discovered. 2

1.1.5 Feature Extraction

The typical feature-based model looks for the most extreme examples of a phe-

nomenon and represents the data by these examples. If you are familiar with

Bayes nets, a branch of machine learning and a topic we do not cover in this

book, you know how a complex relationship between objects is represented by

ﬁnding the strongest statistical dependencies among these objects and using

only those in representing all s tatistical connections. Some of the important

kinds of feature extraction from large-scale data that we shall study are:

1. Frequent Itemsets. This model makes sense for da ta that consists of “bas-

kets” of small sets of items, as in the market-basket problem that we shall

discuss in Chapter 6. We look for small sets of items that appear together

in many baskets, and these “frequent itemsets” are the characterization of

the data that we seek. The orignal application of this sor t of mining was

true market baskets: the sets of items, such as hamburger and ketchup,

that people tend to buy together when checking out at the cash register

of a store or super market.

2. Similar Items. Often, your data looks like a collection of sets, and the

objective is to ﬁnd pair s of sets that have a relatively large fraction of

their elements in common. An exa mple is treating custo mers a t an on-

line store like Amazon as the se t of items they have bought. In order

for Amazon to recommend something else they might like, Amazo n can

look for “similar” customers and recommend something many of these

customers have bought. This process is called “collaborative ﬁltering.”

If customers were single-minded, tha t is, they bought only o ne kind of

thing, then cluster ing customers might work. However, since customers

tend to have interests in many diﬀerent things, it is more useful to ﬁnd,

for each customer, a small number of other customers who are similar

in their tastes, and represent the data by these connections. We discuss

similarity in Chapter 3.

1.2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events

hidden within massive amounts of data. This section is a discussion of the

problem, including “Bonferroni’s Principle,” a warning against overzealous use

of data mining.

剩余456页未读，继续阅读

chipsKe

粉丝: 2
资源: 39

大数据挖掘：处理海量数据的关键算法与应用

Mining of Massive Datasets

Mining of Massive Datasets, 英文原版，斯坦福CS246官方教程

Mining of Massive Dataset.rar

mining of massive datasets-ch01-intro.pdf

Mining of Massive Dataset的中文版

《海量数据挖掘》第二版英文版（pdf+epub）

stanford大学大数据挖掘PPT.rar

大数据(Mining of Massive Datasets)

go 生成基于 graphql 服务器库.zip

基于JAVA+SpringBoot+Vue+MySQL的社区物资交易互助平台 源码+数据库+论文(高分毕业设计).zip

最新资源

基于JAVA+SpringBoot+Vue+MySQL的社区物资交易互助平台源码+数据库+论文(高分毕业设计).zip