Mahout实战：探索推荐、聚类与分类

需积分: 10 25 浏览量更新于2024-07-29 收藏 2.41MB PDF 举报

"Mahout in Action 是一本专注于Apache Mahout框架的实战型书籍，适合对机器学习和大数据处理感兴趣的读者学习。这本书深入浅出地介绍了如何利用Mahout在Hadoop环境下进行推荐系统、聚类和分类等任务的实现。" Apache Mahout是一个基于Hadoop的开源机器学习库，它提供了多种算法，用于大规模数据集上的机器学习任务。本书"Mahout in Action"分为三个主要部分，分别关注推荐系统、聚类和分类。第一部分：推荐系统 1. 介绍推荐系统：这部分解释了推荐系统的基本概念和工作原理，阐述了它们在电子商务、媒体推荐等领域的应用价值。 2. 推荐者的工作方式：讨论了推荐系统的核心组件，如用户和项目相似度计算，以及如何基于这些相似度生成个性化推荐。 3. 数据表示：介绍了如何将用户行为和偏好数据转化为适合机器学习的格式。 4. 生成推荐：详细讲解了Mahout中的推荐算法，如基于用户的协同过滤和基于物品的协同过滤。 5. 推荐系统上线：涵盖了将推荐系统部署到生产环境中的策略和挑战。第二部分：聚类 1. 聚类简介：阐述了聚类的基本目标和应用场景，如市场分割、文档分类等。 2. 数据表示：讨论了在聚类中如何处理不同类型的数据，如数值数据和文本数据的预处理。 3. Mahout中的聚类算法：介绍了K-means、Fuzzy K-means、Canopy Clustering等常见的聚类算法。 4. 评估聚类质量：解释了如何衡量聚类效果的好坏，如轮廓系数和Calinski-Harabasz指数。 5. 聚类的生产化：讨论了在实际环境中运行聚类算法时要考虑的问题，如性能优化和结果稳定性。 6. 聚类的实际应用：展示了聚类技术在现实世界中的具体案例。第三部分：分类 1. 分类介绍：概述了分类的基本概念，包括监督学习和特征选择的重要性。 2. 朴素分类器的力量：特别强调了朴素贝叶斯分类器的简单性和有效性。 3. 多类分类：探讨了处理多类别问题的策略，如one-vs-all和决策树。 4. 分类器评估：介绍了交叉验证、混淆矩阵等评估分类模型性能的方法。 5. 调整分类器以提高准确性：讨论了参数调优和特征工程，以提升模型预测的准确性。通过这本书，读者不仅可以了解Mahout的基础知识，还能掌握如何在Hadoop分布式环境中实施和优化这些机器学习算法，从而解决实际问题。对于希望在大数据背景下进行机器学习实践的人来说，"Mahout in Action"无疑是一本非常有价值的参考资料。

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

Figure 2.1 Relationships between users 1 to 5 and items 101 to 107. Dashed lines represent associations that seem

negative -- the user does not seem to like the item much, but expresses a relationship to the item.

2.2.2 Creating a Recommender

So what book might we recommend to user 1? Not 101, 102 or 103 – he already knows about these

books, apparently, and recommendation is about discovering new things. Intuition suggests that

because users 4 and 5 seem similar to 1, we should recommend something that user 4 or user 5 likes.

That leaves 104, 105 and 106 as possible recommendations. On the whole, 104 seems to be the most

liked of these possibilities, judging by the preference values of 4.5 and 4.0 for item 104. Now, run the

following code:

Listing 2.2 A simple user-based recommender program with Mahout

package mia.recommender.ch02;

import org.apache.mahout.cf.taste.impl.model.file.*;

import org.apache.mahout.cf.taste.impl.neighborhood.*;

import org.apache.mahout.cf.taste.impl.recommender.*;

import org.apache.mahout.cf.taste.impl.similarity.*;

import org.apache.mahout.cf.taste.model.*;

import org.apache.mahout.cf.taste.neighborhood.*;

import org.apache.mahout.cf.taste.recommender.*;

import org.apache.mahout.cf.taste.similarity.*;

import java.io.*;

import java.util.*;

class RecommenderIntro {

public static void main(String[] args) throws Exception {

DataModel model = new FileDataModel(new File("intro.csv")); A

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood =

new NearestNUserNeighborhood(2, similarity, model);

Recommender recommender = new GenericUserBasedRecommender(

model, neighborhood, similarity); B

List<RecommendedItem> recommendations =

recommender.recommend(1, 1); C

for (RecommendedItem recommendation : recommendations) {

System.out.println(recommendation);

}

A Load the data file

B Create the recommender engine

C For user 1, recommend 1 item

Licensed to nancy chen <amigo4u2009@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

For brevity, through several more chapters of examples that follow, we will omit the imports, class

declaration, and method declaration, and instead repeat only the program statements themselves. To

help visualize the relationship between these basic components, see figure 2.2. Not all Mahout-based

recommenders will look like this -- some will employ different components with different relationships.

But this gives a sense of what’s going on in our example.

Figure 2.2 Simplified illustration of component interaction in a Mahout user-based recommender

While we will discuss each of these components in much more detail in the next two chapters, we

can summarize the role of each component now. A

DataModel implementation stores and provides

access to all the preference, user and item data needed in the computation. A

UserSimiliarity

implementation provides some notion of how similar two users are; this could be based on one of many

possible metrics or calculations. A

UserNeighborhood implementation defines a notion of a group of

users that are most similar to a given user. Finally, a

Recommender implementation pulls all these

components together to recommend items to users, and related functionality.

2.2.3 Analyzing the output

Compile and run this using your favorite IDE. The output of running the program in your terminal or IDE

should be: RecommendedItem[item:104, value:4.257081]

We asked for one top recommendation, and got one. The recommender engine recommended book

104 to user 1. Further, it says that the recommender engine did so because it estimated user 1’s

preference for book 104 to be about 4.3, and that was the highest among all the items eligible for

recommendations.

That’s not bad. We didn't get 107, which was also recommendable, but only associated to a user

with different tastes. We picked 104 over 106, and this makes sense when you note that 104 is a bit

more highly rated overall. Further, we got a reasonable estimate of how much user 1 likes item 104 –

something between the 4.0 and 4.5 that users 4 and 5 expressed.

The right answer isn't obvious from looking at the data, but the recommender engine made some

decent sense of it and returned a defensible answer. If you got a pleasant tingle out of seeing this

simple program give a useful and non-obvious result from a small pile of data, then the world of

machine learning is for you!

For clear, small data sets, producing recommendations is as trivial as it appears above. In real life,

data sets are huge, and they are noisy. For example, imagine a popular news site recommending news

Licensed to nancy chen <amigo4u2009@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

articles to readers. Preferences are inferred from article clicks. But, many of these “preferences” may be

bogus – maybe a reader clicked an article but didn't like it, or, had clicked the wrong story. Perhaps

many of the clicks occurred while not logged in, so can’t be associated to a user. And, imagine the size

of the data set – perhaps billions of clicks in a month.

Producing the right recommendations from this data and producing them quickly are not trivial. Later

we will present the tools Mahout provides to attack a range of such problems by way of case studies.

They will show how standard approaches can produce poor recommendations or take a great deal of

memory and CPU time, and, how to configure and customize Mahout to improve performance.

2.3 Evaluating a Recommender

A recommender engines is a tool, a means to answer the question, “what are the best recommendations

for a user?” Before investigating the answers, we should investigate the question. What exactly is a

good recommendation? And how will we know when a recommender is producing them? The remainder

of this chapter pauses to explore evaluation of a recommender, because this is a tool that will be useful

when we begin looking at specific recommender systems.

The best possible recommender would be a sort of psychic that could somehow know, before you do,

exactly how much you would like every possible item that you've not yet seen or expressed any

preference for. A recommender that could predict all your preferences exactly would merely present all

other items ranked by your future preference and be done. These would be the best possible

recommendations.

And indeed most recommender engines operate by trying to do just this, estimating ratings for some

or all other items. So, one way of evaluating a recommender's recommendations is to evaluate the

quality of its estimated preference values – that is, evaluating how closely the estimated preferences

match the actual preferences.

2.3.1 Training data and scoring

Those “actual preferences” don't exist though. Nobody knows for sure how you'll like some new item in

the future (including you). This can be simulated to a recommender engine by setting aside a small part

of the real data set as test data. These test preferences are not present in the training data fed into a

recommender engine under evaluation -- which is all data except the test data. Instead, the

recommender is asked to estimate preference for the missing test data, and estimates are compared to

the actual values.

From there, it is fairly simple to produce a kind of “score” for the recommender. For example we

could compute the average difference between estimate and actual preference. With a score of this

type, lower is better, because that would mean the estimates differed from the actual preference values

by less. 0.0 would mean perfect estimation -- no difference at all between estimates and actual values.

Sometimes the root-mean-square of the differences is used: this is the square root of the average of

the squares of the differences between actual and estimated preference values. Again, lower is better.

Licensed to nancy chen <amigo4u2009@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

Item 1 Item 2 Item 3

Actual

3.0 5.0 4.0

Estimate

3.5 2.0 5.0

Difference

0.5 3.0 1.0

Average

Difference

= (0.5 + 3.0 + 1.0) / 3 = 1.5

Root Mean

Square

=√((0.52 + 3.02 + 1.02) / 3) = 1.8484

Table 2.1 An illustration of the average difference, and root mean square calculation

Above, the table shows the difference between a set of actual and estimated preferences, and how

they are translated into scores. Root-mean-square more heavily penalizes estimates that are way off, as

with item 2 here, and that is considered desirable by some. For example, an estimate that’s off by 2

whole stars is probably more than twice as “bad” as one off by just 1 star. Because the simple average

of differences is perhaps more intuitive and easy to understand, we’ll use it in upcoming examples.

2.3.2 Running RecommenderEvaluator

Let's revisit the example code and instead evaluate the simple recommender we created, on our simple

data set:

Listing 2.3 Configuring and running an evaluation of a Recommender

RandomUtils.useTestSeed(); A

DataModel model = new FileDataModel(new File("intro.csv"));

RecommenderEvaluator evaluator =

new AverageAbsoluteDifferenceRecommenderEvaluator();

RecommenderBuilder builder = new RecommenderBuilder() { B

@Override

public Recommender buildRecommender(DataModel model)

throws TasteException {

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood =

new NearestNUserNeighborhood(2, similarity, model);

return

new GenericUserBasedRecommender(model, neighborhood, similarity);

}

};

double score = evaluator.evaluate(

builder, null, model, 0.7, 1.0); C

System.out.println(score);

A Used only in examples for repeatable result

B Builds the same Recommender as above

C Use 70% of data to train; test with other 30%

Licensed to nancy chen <amigo4u2009@gmail.com>

http://www.manning-sandbox.com/forum.jspa?forumID=623

Most of the action happens in evaluate(). Inside, the RecommenderEvaluator handles splitting

the data into a training and test set, builds a new training

DataModel and Recommender to test, and

compares its estimated preferences to the actual test data.

Note that we don’t pass a

Recommender to this method. This is because, inside, the method will

need to build a

Recommender around a newly created training DataModel. So we must provide an

object that can build a

Recommender from a DataModel – a RecommenderBuilder. Here, it builds

the same implementation that we tried in the first chapter.

2.3.3 Assessing the result

This program prints the result of the evaluation: a score indicating how well the Recommender

performed. In this case you should simply see:

1.0. Even though a lot of randomness is used inside the

evaluator to choose test data, the result should be consistent because of the call to

RandomUtils.useTestSeed(), which forces the same random choices each time. This is only used in

such examples, and unit tests, to guarantee repeatable results. Don’t use it in your real code.

What this value means depends on the implementation we used – here,

AverageAbsoluteDifferenceRecommenderEvaluator. A result of 1.0 from this implementation

means that, on average, the recommender estimates a preference that deviates from the actual

preference by 1.0.

A value of 1.0 is not great, on a scale of 1 to 5, but there is so little data here to begin with. Your

results may differ as the data set is split randomly, and hence the training and test set may differ with

each run.

This technique can be applied to any

Recommender and DataModel. To use root-mean-square

scoring, replace

AverageAbsoluteDifferenceRecommenderEvaluator with the implementation

RMSRecommenderEvaluator.

Also, the

null parameter to evaluate() could instead be an instance of DataModelBuilder,

which can be used to control how the training

DataModel is created from training data. Normally the

default is fine; it may not be if you are using a specialized implementation of

DataModel in your

deployment. A

DataModelBuilder is how you would inject it into the evaluation process.

The

1.0 parameter at the end controls how much of the overall input data is used. Here it means

“100%.” This can be used to produce a quicker, if less accurate, evaluation by using only a little of a

potentially huge data set. For example,

0.1 would mean 10% of the data is used and 90% is ignored.

This is quite useful when rapidly testing small changes to a

Recommender.

2.4 Evaluating precision and recall

We could also take a broader view of the recommender problem: we don't have to estimate preference

values to produce recommendations. It’s not always necessary to present estimated preference values

to users. In many cases, all we want is an ordered list of recommendations, from best to worst. In fact,

in some cases we don't care much about the exact ordering of the list – a set of a few good

recommendations is fine.

Taking this more general view, we could also apply classic information retrieval metrics to evaluate

recommenders: precision and recall. These terms are typically applied to things like search engines,

which return some set of best results for a query out of many possible results.

Licensed to nancy chen <amigo4u2009@gmail.com>

剩余178页未读，继续阅读

sail0755

粉丝: 0
资源: 3

Mahout实战：探索推荐、聚类与分类

Mahout In Action英文完整版

mahout in action中的源码

mahout in action

Mahout in Action 2012

Mahout In Action 2012 Source Code

Mahout in action清晰完整版

基于React框架的react-demo设计源码学习参考

Delphi 12 控件之unidac-10.3.1-d25pro.exe

智慧医疗服务平台 JAVA毕业设计 源码+数据库+论文+启动教程（SpringBoot+Vue.JS）.zip

基于qt+mpv的视频播放器

最新资源

智慧医疗服务平台 JAVA毕业设计源码+数据库+论文+启动教程（SpringBoot+Vue.JS）.zip