大规模数据挖掘：Anand.Rajaraman《Mining of Massive Datasets》精华解读

需积分: 9 164 浏览量更新于2024-07-20 收藏 2.94MB PDF 举报

《Anand Rajaraman - Mining of Massive Datasets》（MMDS）是一本由Anand Rajaraman、Jure Leskovec和Jeffrey D. Ullman合作编写的经典著作，该书专注于大数据挖掘。这本书起源于斯坦福大学的研究生课程CS345A，最初名为“Web Mining”，主要针对高级研究生，但其内容逐渐普及并吸引了对数据科学有兴趣的高级本科生。书中强调的是处理大规模数据的挖掘，即那些超出了常规计算机内存容量的数据集。作者们在课程内容上进行了大幅度的整合和扩展，如引入了网络分析的CS224W课程，并对原有课程CS345A（后来改为CS246）进行了扩充。此外，他们还创建了一个大型数据挖掘项目课程CS341，使得这本书包含了三个课程的核心内容。本书的核心主题围绕数据挖掘展开，特别是在海量数据背景下，涉及的技术和方法论。主要内容可能包括但不限于以下几点： 1. **大规模数据处理基础**：探讨如何设计和实现有效的算法和技术来处理超出内存限制的数据集，可能涉及分布式计算、数据分区、存储和访问策略等。 2. **网络数据分析**：书中会深入介绍如何利用网络数据进行分析，如社交网络、网页链接结构、搜索引擎排名模型等。 3. **Web挖掘**：通过实际案例展示如何从互联网数据中提取有价值的信息，如用户行为分析、推荐系统、内容挖掘等。 4. **数据挖掘项目实践**：书中可能包含一些实际的大规模数据挖掘项目案例，让学生或读者了解如何将理论应用于实际问题解决。 5. **技术发展与趋势**：随着大数据时代的到来，书中可能讨论了当时的前沿技术，如Hadoop、Spark等开源工具在大规模数据处理中的应用。 6. **隐私和伦理问题**：鉴于大数据的敏感性，书中可能还会涉及数据隐私保护、伦理道德以及数据使用的法律规范。 7. **理论与实践结合**：书中的教学内容不仅局限于理论，还强调理论知识与实践技能的结合，帮助读者理解和掌握在实际工作中如何处理大规模数据。《Mining of Massive Datasets》是一本极具实用价值的教材，不仅适用于学术研究，也对数据工程师、分析师和学生提供了深入理解数据挖掘在大规模环境中的关键技术和挑战的窗口。通过阅读和实践书中的内容，读者能够掌握在海量数据世界中进行智能分析的必要技能。

Chapter 1

Data Mining

In this intoductory chapter we begin with the esse nce of data mining a nd a dis-

cussion of how data mining is treated by the various disciplines that contribute

to this ﬁeld. We cover “Bonferro ni’s Principle,” which is really a warning about

overusing the ability to mine data. This chapter is also the place where we

summarize a few useful ideas that are not data mining but are useful in un-

derstanding some important data-mining concepts. These include the T F.IDF

measure of word imp ortance, behavior of hash functions and indexes, and iden-

tities involving e, the base of natural logarithms. Finally, we give an outline of

the topics covered in the balance of the book.

1.1 What is Data Mining?

The most c ommonly acce pted deﬁnition of “data mining” is the discovery of

“models” for data. A “model,” however, can be one of several things. We

mention below the most important directions in mo deling.

1.1.1 Statistical Modeling

Statisticians were the ﬁrst to use the term “data mining.” Originally, “data

mining” or “da ta dredging” was a derogatory term referring to attempts to

extract information that was no t supported by the data. Section 1.2 illustra tes

the sort of errors one can make by trying to extract what really isn’t in the data.

Today, “data mining” has taken on a positive meaning. Now, statisticians view

data mining a s the construction of a statistical model, that is, an underlying

distribution from which the visible data is drawn.

Example 1.1 : Suppose o ur data is a set of numbers. This data is much

simpler than data that would be data-mined, but it will serve as an example. A

statistician might decide that the data comes from a Gaussian distribution and

use a formula to compute the most likely parameters of this Gaussian. The mean

2 CHAPTER 1. DATA MINING

and standar d deviation of this Gaussian distribution completely characterize the

distribution and would become the model of the data. 2

1.1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning.

There is no question that s ome data mining appropriately uses algorithms from

machine learning. Machine-learning practitioners use the data as a training s et,

to train an algorithm of one of the many types used by machine-learning prac -

titioners, such a s Bayes nets, support-vector machines, decision trees, hidden

Markov models, and many others.

There are situations where using data in this way makes se ns e. The typical

case where machine learning is a good approach is when we have little idea of

what we are looking for in the data . For exa mple, it is rather unclear what

it is about movies that makes certain movie-goers like or dislike it. Thus,

in answering the “Netﬂix challenge” to devise an algorithm that predicts the

ratings of movies by users, based on a sample of their responses, machine-

learning algorithms have proved quite successful. We shall discuss a simple

form of this type of algorithm in Section

9.4.

On the other hand, machine learning has not proved successful in situations

where we can describe the goals of the mining more directly. An interesting

case in point is the attempt by WhizBang! Labs

to use machine learning to

locate people’s resumes on the Web. It was not able to do better than algorithms

in the typical resume. Since everyone who has looked at or written a r esume has

a pretty good idea of what resumes contain, there was no mystery about what

makes a Web page a resume. Thus, there was no advantage to machine-learning

over the direct design of an algorithm to discover resumes.

1.1.3 Computational Approaches to Modeling

More recently, computer scientists have looked at data mining as an algorithmic

problem. In this case, the model of the data is simply the answer to a complex

query about it. For instance, given the set of numbers of Example

1.1, we mig ht

compute their average and standard deviation. No te that these values might

not be the parameters of the Ga ussian that best ﬁts the data, although they

will almost certainly be very close if the size of the data is large.

There are ma ny diﬀerent approaches to modeling data. We have already

mentioned the poss ibility of constructing a statistical process whereby the data

could have b e e n generated. Most other approaches to modeling can b e describe d

as either

1. Summarizing the data succinctly and approximately, or

This startup attempted to use machine learning to mine large-scale data, and hired many

of the top machine-learning people to do so. Unfortunately, it was not able to sur vive.

4 CHAPTER 1. DATA MINING

The cases clustered around some of the intersections of roads. These inter-

sections were the locations of wells that had become contaminated; people who

lived nearest these wells got sick, while people who lived nearer to wells that

had not been contaminated did not get s ick. Without the ability to cluster the

data, the cause of Cholera would not have been discovered. 2

1.1.5 Feature Extraction

The typical feature-based model looks for the most extreme ex amples of a phe-

nomenon and represents the data by these ex amples. If you are familiar with

Bayes nets, a branch of machine learning and a topic we do not cover in this

book, you know how a complex relationship between objects is represented by

ﬁnding the stronge st statistical dependencies among these objects and using

only those in representing all statistical connections. Some of the important

kinds of feature extractio n from large-scale data that we shall study are:

1. Frequent Itemsets. This model makes sense fo r data that consists of “bas-

kets” of small sets of items, as in the market-basket problem that we shall

discuss in Chapter 6. We look for small sets of items that appear together

in many baskets, and these “frequent itemsets” are the character ization of

the data that we seek. The original application of this sort of mining was

true market baskets: the sets o f items, such as hamburger and ketchup,

that people tend to buy together when checking out at the cash register

of a store or super market.

2. Similar Items. Often, your data looks like a collection of sets, and the

objective is to ﬁnd pairs of sets that have a relatively large fraction of

their elements in common. An example is treating customers at an on-

line store like Amazon as the set of items they have bought. In order

for Amaz on to recommend something else they might like, Amazon can

look for “similar” customers and recommend something many of these

customers have bought. This proces s is called “collaborative ﬁltering.”

If customers were single-minded, that is , they bought only one kind of

thing, then clustering customers might work. However, since customers

tend to have interests in many diﬀerent things, it is more useful to ﬁnd,

for each c ustomer, a small number of other customers who are similar

in their tastes, and represent the data by these connections. We discuss

similarity in Chapter

1.2 Statistical Limits on Data Mining

A common sor t of data-mining problem involves discovering unusual events

hidden within massive amounts of data. This section is a discussion of the

problem, including “Bonferroni’s Principle,” a warning against overzealous use

of data mining.

1.2. STATISTICAL LIMITS ON DATA MINING 5

1.2.1 Total Information Awareness

In 2002, the Bush administration put forward a plan to mine all the data it could

ﬁnd, including credit-card receipts, hotel rec ords, travel data, and many other

kinds of information in order to track terr orist activity. This idea naturally

caused great concern among privacy advocates, and the project, called TIA,

or Total Information Awareness, was eventually killed by Congress, although

it is unclear whether the project in fa c t exists under another name. It is not

the purpose o f this book to discuss the diﬃcult issue of the privacy-security

tradeoﬀ. However, the prospect of TIA or a sys tem like it does raise technical

questions about its feasibility and the realism of its assumptions.

The concern raised by many is that if you look at so much data, and you try

to ﬁnd within it activities that look like terrorist behavior, are you not going to

ﬁnd many innocent activities – or even illicit activities that are not terrorism –

that will result in visits from the police and maybe worse than just a v isit? The

answer is that it all depends on how narrowly you deﬁne the activities that you

look fo r. Statisticians have seen this problem in many guises and have a theor y,

which we introduce in the next section.

1.2.2 Bonferroni’s Principle

Suppose you have a certain amount of data, and you look for events of a cer-

tain type within that data. You can expect e vents of this type to occur, even if

the data is completely random, and the number of occurr e nce s of these events

will g row as the size of the data grows. These occurrences are “bogus,” in the

sense that they have no caus e other than that random data will always have

some number o f unusual features that look signiﬁca nt but aren’t. A theorem

of statistics, known as the Bonferroni correction gives a statistically sound way

to avoid most of these bogus positive responses to a search through the data.

Without going into the statistical details, we oﬀer an infor mal version, Bon-

ferroni’s principle, that helps us avoid treating random occurrences as if they

were real. Calculate the exp e c ted number of occurrences of the events you are

looking for , on the assumption that data is random. If this number is sig niﬁ-

cantly larger than the number of real insta nce s you hope to ﬁnd, then you must

exp ect almost anything you ﬁnd to be b ogus, i.e., a sta tistical artifact rather

than evidence of what you are looking for. This obs e rvatio n is the informal

statement of Bonferroni’s principle.

In a situation like searching for terrorists, where we expect that there are

few terrorists operating at any one time, Bonferroni’s principle says that we

may only detect terroris ts by looking for events that are so rare that they are

unlikely to occur in random data. We shall give an extended example in the

next section.

剩余509页未读，继续阅读

-柚子皮-

粉丝: 1w+
资源: 94

大规模数据挖掘：Anand.Rajaraman《Mining of Massive Datasets》精华解读

大数据挖掘：Stanford大学 Mining of Massive Datasets 教材概览

大规模数据挖掘：MapReduce与相似性搜索

大数据挖掘：斯坦福大学教材深度解读

斯坦福大学book-Mining of Massive Datasets

斯坦福大学CS246 book-Mining of Massive Datasets

mining of massive datasets

Mining of massive datasets

Mining of Massive Datasets

Mining of Massive Datasets.zip

Mining of Massive Datasets.pdf

最新资源