Mahout实战：推荐、聚类与应用

需积分: 10 189 浏览量更新于2024-07-27 收藏 5.75MB PDF 举报

"Mahout in Action 是一本入门级的Apache Mahout教程，涵盖了推荐系统、聚类和分类等核心概念。" 在《Mahout in Action》这本书中，读者将深入了解到Apache Mahout这一开源机器学习库的实用技术。Mahout是一个基于Java的框架，它为大数据分析提供了丰富的算法，主要用于构建智能应用程序，如推荐系统、分类和聚类。 Part 1 - 推荐系统 1. Introducing recommenders：这部分介绍了推荐系统的基本概念，包括协同过滤（Collaborative Filtering, CF）的原理，它是Mahout最常用于个性化推荐的方法。协同过滤通过分析用户的历史行为，找出相似兴趣的用户或项目，然后为用户推荐他们可能感兴趣但尚未接触的内容。 2. Representing data：数据是推荐系统的基础，书中会讲解如何将用户行为数据、项目信息等转换为适合算法处理的形式，例如用户-项目矩阵。 3. Making recommendations：介绍如何使用Mahout实现推荐算法，包括基于用户的CF和基于物品的CF，以及混合推荐方法。 4. Taking recommenders to production：讨论了将推荐系统部署到实际环境中的挑战，如数据更新、实时推荐和性能优化。 5. Distributing recommendation computations：Mahout设计时就考虑到了分布式计算，这里将解释如何利用Hadoop等工具进行分布式推荐计算，以处理大规模数据。 Part 2 - 聚类 6. Introduction to clustering：聚类是无监督学习的一种，旨在发现数据的自然分组。书里会解释聚类的目标和应用场景。 7. Representing data：与推荐系统类似，聚类也需要对数据进行预处理和表示，以适应不同的算法需求。 8. Clustering algorithms in Mahout：涵盖K-means、Canopy、Fuzzy K-means等Mahout中的聚类算法，以及它们的适用场景和优缺点。 9. Evaluating clustering quality：聚类效果的评估是关键，书中会讲解各种评估指标，如轮廓系数、Calinski-Harabasz指数等。 10. Taking clustering to production：将聚类模型应用于实际问题，包括模型选择、参数调优和结果解释。 11. Real-world applications of clustering：提供实际案例，展示聚类技术在市场分割、网络分析、文本挖掘等领域的应用。本书通过理论与实践相结合的方式，引导读者掌握Mahout在推荐系统和聚类分析方面的知识，是学习和理解机器学习在大数据处理中应用的良好起点。

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

Before discussing each of these components in more detail in the next two chapters, we can

summarize the role of each component now. A

DataModel implementation stores and provides access to

all the preference, user and item data needed in the computation. A

UserSimiliarity implementation

provides some notion of how similar two users are; this could be based on one of many possible metrics

or calculations. A

UserNeighborhood implementation defines a notion of a group of users that are most

similar to a given user. Finally, a

Recommender implementation pulls all these components together to

recommend items to users, and related functionality.

2.2.3 Analyzing the output

Compile and run this using your favorite IDE. The output of running the program in your terminal or IDE

should be: RecommendedItem[item:104, value:4.257081]

The request asked for one top recommendation, and got one. The recommender engine recommended

book 104 to user 1. Further, it says that the recommender engine did so because it estimated user 1’s

preference for book 104 to be about 4.3, and that was the highest among all the items eligible for

recommendations.

That’s not bad. 107 did not appear, which was also recommendable, but only associated to a user with

different tastes. It picked 104 over 106, and this makes sense after noting that 104 is a bit more highly

rated overall. Further, the output contained a reasonable estimate of how much user 1 likes item 104 –

something between the 4.0 and 4.5 that users 4 and 5 expressed.

The right answer isn't obvious from looking at the data, but the recommender engine made some

decent sense of it and returned a defensible answer. If you got a pleasant tingle out of seeing this simple

program give a useful and non-obvious result from a small pile of data, then the world of machine learning

is for you!

For clear, small data sets, producing recommendations is as trivial as it appears above. In real life,

data sets are huge, and they are noisy. For example, imagine a popular news site recommending news

articles to readers. Preferences are inferred from article clicks. But, many of these “preferences” may be

bogus – maybe a reader clicked an article but didn't like it, or, had clicked the wrong story. Perhaps many

of the clicks occurred while not logged in, so can’t be associated to a user. And, imagine the size of the

data set – perhaps billions of clicks in a month.

Producing the right recommendations from this data and producing them quickly are not trivial. Later

we will present the tools Mahout provides to attack a range of such problems by way of case studies. They

will show how standard approaches can produce poor recommendations or take a great deal of memory

and CPU time, and, how to configure and customize Mahout to improve performance.

2.3 Evaluating a Recommender

A recommender engines is a tool, a means to answer the question, “what are the best recommendations

for a user?” Before investigating the answers, it’s best to investigate the question. What exactly is a good

recommendation? And how does one know when a recommender is producing them? The remainder of

this chapter pauses to explore evaluation of a recommender, because this is a tool that will be useful

when looking at specific recommender systems.

The best possible recommender would be a sort of psychic that could somehow know, before you do,

exactly how much you would like every possible item that you've not yet seen or expressed any

preference for. A recommender that could predict all your preferences exactly would merely present all

other items ranked by your future preference and be done. These would be the best possible

recommendations.

And indeed most recommender engines operate by trying to do just this, estimating ratings for some

or all other items. So, one way of evaluating a recommender's recommendations is to evaluate the quality

of its estimated preference values – that is, evaluating how closely the estimated preferences match the

actual preferences.

Licensed to Duan Jienan <jnduan@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

2.3.1 Training data and scoring

Those “actual preferences” don't exist though. Nobody knows for sure how you'll like some new item in

the future (including you). This can be simulated to a recommender engine by setting aside a small part of

the real data set as test data. These test preferences are not present in the training data fed into a

recommender engine under evaluation -- which is all data except the test data. Instead, the recommender

is asked to estimate preference for the missing test data, and estimates are compared to the actual

values.

From there, it is fairly simple to produce a kind of “score” for the recommender. For example it’s

possible to compute the average difference between estimate and actual preference. With a score of this

type, lower is better, because that would mean the estimates differed from the actual preference values

by less. 0.0 would mean perfect estimation -- no difference at all between estimates and actual values.

Sometimes the root-mean-square of the differences is used: this is the square root of the average of

the squares of the differences between actual and estimated preference values. Again, lower is better.

Item 1 Item 2 Item 3

Actual

3.0 5.0 4.0

Estimate

3.5 2.0 5.0

Difference

0.5 3.0 1.0

Average

Difference

= (0.5 + 3.0 + 1.0) / 3 = 1.5

Root Mean

Square

=√((0.5

+ 3.0

+ 1.0

) / 3) = 1.8484

Table 2.1 An illustration of the average difference, and root mean square calculation

Above, the table shows the difference between a set of actual and estimated preferences, and how

they are translated into scores. Root-mean-square more heavily penalizes estimates that are way off, as

with item 2 here, and that is considered desirable by some. For example, an estimate that’s off by 2 whole

stars is probably more than twice as “bad” as one off by just 1 star. Because the simple average of

differences is perhaps more intuitive and easy to understand, upcoming examples will use it.

2.3.2 Running RecommenderEvaluator

Let's revisit the example code and instead evaluate the simple recommender, on this simple data set:

Listing 2.3 Configuring and running an evaluation of a Recommender

RandomUtils.useTestSeed(); A

DataModel model = new FileDataModel(new File("intro.csv"));

RecommenderEvaluator evaluator =

new AverageAbsoluteDifferenceRecommenderEvaluator();

RecommenderBuilder builder = new RecommenderBuilder() { B

@Override

public Recommender buildRecommender(DataModel model)

throws TasteException {

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood =

new NearestNUserNeighborhood(2, similarity, model);

return

new GenericUserBasedRecommender(model, neighborhood, similarity);

}

};

double score = evaluator.evaluate(

Licensed to Duan Jienan <jnduan@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

builder, null, model, 0.7, 1.0); C

System.out.println(score);

A Used only in examples for repeatable result

B Builds the same Recommender as above

C Use 70% of data to train; test with other 30%

Most of the action happens in

evaluate(). Inside, the RecommenderEvaluator handles splitting

the data into a training and test set, builds a new training

DataModel and Recommender to test, and

compares its estimated preferences to the actual test data.

Note that there is no

Recommender passed to this method. This is because, inside, the method will

need to build a

Recommender around a newly created training DataModel. So the caller must provide an

object that can build a

Recommender from a DataModel – a RecommenderBuilder. Here, it builds the

same implementation that was tried earlier in this chapter.

2.3.3 Assessing the result

This program prints the result of the evaluation: a score indicating how well the Recommender performed.

In this case you should simply see:

1.0. Even though a lot of randomness is used inside the evaluator to

choose test data, the result should be consistent because of the call to

RandomUtils.useTestSeed(),

which forces the same random choices each time. This is only used in such examples, and unit tests, to

guarantee repeatable results. Don’t use it in your real code.

What this value means depends on the implementation used – here,

AverageAbsoluteDifferenceRecommenderEvaluator. A result of 1.0 from this implementation

means that, on average, the recommender estimates a preference that deviates from the actual

preference by 1.0.

A value of 1.0 is not great, on a scale of 1 to 5, but there is so little data here to begin with. Your

results may differ as the data set is split randomly, and hence the training and test set may differ with

each run.

This technique can be applied to any

Recommender and DataModel. To use root-mean-square

scoring, replace

AverageAbsoluteDifferenceRecommenderEvaluator with the implementation

RMSRecommenderEvaluator.

Also, the

null parameter to evaluate() could instead be an instance of DataModelBuilder,

which can be used to control how the training

DataModel is created from training data. Normally the

default is fine; it may not be if you are using a specialized implementation of

DataModel in your

deployment. A

DataModelBuilder is how you would inject it into the evaluation process.

The

1.0 parameter at the end controls how much of the overall input data is used. Here it means

“100%.” This can be used to produce a quicker, if less accurate, evaluation by using only a little of a

potentially huge data set. For example,

0.1 would mean 10% of the data is used and 90% is ignored.

This is quite useful when rapidly testing small changes to a

Recommender.

2.4 Evaluating precision and recall

We could also take a broader view of the recommender problem: it’s not strictly necessary to estimate

preference values in order to produce recommendations. It’s not always essential to present estimated

preference values to users. In many cases, just an ordered list of recommendations, from best to worst, is

sufficient. In fact, in some cases the exact ordering of the list doesn’t matter much – a set of a few good

recommendations is fine.

Taking this more general view, we could also apply classic information retrieval metrics to evaluate

recommenders: precision and recall. These terms are typically applied to things like search engines, which

return some set of best results for a query out of many possible results.

A search engine should not return irrelevant results in the top results, although it should strive to

return as many relevant results as possible. “Precision” is the proportion of top results that are relevant,

for some definition of relevant. “Precision at 10” would be this proportion judged from the top 10 results.

“Recall” is the proportion of all relevant results included in the top results. See figure 2.3 for a

visualization of these ideas.

Licensed to Duan Jienan <jnduan@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

Figure 2.3 An illustration of precision and recall in the context of search results

These terms can easily be adapted to recommenders: precision is the proportion of top

recommendations that are good recommendations, and recall is the proportion of good recommendations

that appear in top recommendations. The next section will define “good”.

2.4.1 Running RecommenderIRStatsEvaluator

Again, Mahout provides a fairly simple way to compute these values for a Recommender:

Listing 2.4 Configuring and running a precision and recall evaluation

RandomUtils.useTestSeed();

DataModel model = new FileDataModel(new File("intro.csv"));

RecommenderIRStatsEvaluator evaluator =

new GenericRecommenderIRStatsEvaluator();

RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {

@Override

public Recommender buildRecommender(DataModel model)

throws TasteException {

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood =

new NearestNUserNeighborhood(2, similarity, model);

return

new GenericUserBasedRecommender(model, neighborhood, similarity);

}

};

IRStatistics stats = evaluator.evaluate(

recommenderBuilder, null, model, null, 2,

GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,

1.0); A

System.out.println(stats.getPrecision());

System.out.println(stats.getRecall());

A Evaluate precision and recall at 2

Without the call to

RandomUtils.useTestSeed(), the result you see would vary significantly due to

random selection of training data and test data, and because the data set is so small here. But with the

call, the result ought to be:

0.75

1.0

Precision at 2 is 0.75; on average about three-quarters of recommendations were “good.” Recall at 2 is

1.0; all good recommendations are among those recommended.

But what exactly is a “good” recommendation here? The framework was asked to decide. It didn’t

receive a definition. Intuitively, the most highly preferred items in the test set are the good

recommendations, and the rest aren’t.

Licensed to Duan Jienan <jnduan@gmail.com>

http://www.manning-sandbox.com/forum.jspa?forumID=623

Listing 2.5 User 5’s preference in test data set

5,101,4.0

5,102,3.0

5,103,2.0

5,104,4.0

5,105,3.5

5,106,4.0

Look at user 5 in this simple data set again. Let’s imagine the preferences for items 101, 102 and 103

were withheld as test data. The preference values for these are 4.0, 3.0 and 2.0. With these values

missing from the training data, a recommender engine ought to recommend 101 before 102, and 102

before 103, because this is the order in which user 5 prefers these items. But would it be a good idea to

recommend 103? It’s last on the list; user 5 doesn’t seem to like it much. Book 102 is just average. Book

101 looks reasonable as its preference value is well above average. Maybe 101 is a good

recommendation; 102 and 103 are valid, but not good recommendations.

And this is the thinking that the

RecommenderEvaluator employs. When not given an explicit

threshold that divides good recommendations from bad, the framework will pick a threshold, per user,

that is equal to the user's average preference value µ plus one standard deviation σ:

threshold = µ +

If you’ve forgotten your statistics, don’t worry. This takes items whose preference value is not merely

a little more than average (µ), but above average by a significant amount (σ). In practice this means that

about the 16% of items that are most highly preferred are considered “good” recommendations to make

back to the user. The other arguments to this method are similar to those discussed before and are more

fully documented in the project javadoc.

2.4.2 Problems with precision and recall

The usefulness of precision and recall tests in the context of recommenders depends entirely on how well

a “good” recommendation can be defined. Above, the threshold was given, or defined by the framework. A

poor choice will hurt the usefulness of the resulting score.

There’s a subtler problem with these tests, though. Here, they necessarily pick the set of good

recommendations from among items for which the user has already expressed some preference. But the

best recommendations are of course not necessarily among those the user already knows about!

Imagine running such a test for a user who would absolutely love the little-known French cult film, “My

Brother The Armoire”. Let’s say it’s objectively a great recommendation for this user. But, the user has

never heard of this movie. If a recommender actually returned this film when recommending movies, it

would be penalized; the test framework can only pick good recommendations from among those in the

user’s set of preferences already.

The issue is further complicated when the preferences are ‘boolean’ and contain no preference value.

There is not even a notion of relative preference on which to select a subset of good items. The best the

test can do is to randomly select some preferred items as the good ones.

The test nevertheless has some use. The items a user prefers are a reasonably proxy for the best

recommendations for the user, but by no means a perfect one. In the case of boolean preference data,

only a precision-recall test is available anyway. It is worth understanding the test’s limitations in this

context.

2.5 Evaluating the GroupLens data set

With these tools in hand, we will be able to discuss not only the speed, but also the quality of

recommender engines. Although examples with large amounts real data are still a couple chapters away,

it’s already possible to quickly evaluate performance on a small data set.

Licensed to Duan Jienan <jnduan@gmail.com>

剩余286页未读，继续阅读

duandongsheng

粉丝: 1
资源: 7

Mahout实战：推荐、聚类与应用

Mahout in action

Mahout in Action 英文版, 标准PDF格式非伪PDF

Mahout in Action 最新版+完整版

mahout in action

Mahout in Action 2012

Mahout in action 中文版

Mahout In Action 2012 Source Code

Mahout in action清晰完整版

Mahout In Action英文完整版

基于WoodandBerry1和非耦合控制WoodandBerry2来实现控制木材和浆果蒸馏柱控制Simulink仿真.rar

最新资源