Mahout实践指南：推荐系统和聚类算法详解

需积分: 10 135 浏览量更新于2024-07-27 收藏 5.75MB PDF 举报

"MahoutinAction完整版本" MahoutinAction完整版本是 Apache Mahout 的一个详细指南，涵盖了推荐算法、数据挖掘和个性化推荐等方面的知识。 **推荐算法** 推荐算法是 Mahout 的核心组件之一，MahoutinAction完整版本中详细介绍了推荐算法的基本概念和实现方式。推荐算法的主要目标是根据用户的行为和喜好，推荐用户可能感兴趣的物品或服务。Mahout 提供了多种推荐算法，包括基于用户的协同过滤、基于项目的协同过滤、矩阵分解等。 **数据挖掘** 数据挖掘是指从大量数据中提取有价值的信息和模式的过程。MahoutinAction完整版本中介绍了数据挖掘的基本概念和技术，包括数据预处理、数据转换、数据挖掘算法等。 **个性化推荐** 个性化推荐是指根据用户的行为和喜好，提供个性化的推荐结果。MahoutinAction完整版本中详细介绍了个性化推荐的概念和实现方式，包括基于用户的协同过滤、基于项目的协同过滤、矩阵分解等。 **Mahout clustering** Mahout clustering 是 Mahout 的一个重要组件，用于对数据进行聚类分析。MahoutinAction完整版本中详细介绍了 Mahout clustering 的基本概念和实现方式，包括 K-均值聚类、 Hierarchical 聚类、 DBSCAN 聚类等。 **Mahout recommendations** Mahout recommendations 是 Mahout 的一个核心组件，用于提供个性化推荐。MahoutinAction完整版本中详细介绍了 Mahout recommendations 的基本概念和实现方式，包括基于用户的协同过滤、基于项目的协同过滤、矩阵分解等。 **数据表示** 数据表示是指将数据转换为机器学习算法可以使用的格式。MahoutinAction完整版本中介绍了数据表示的基本概念和技术，包括向量空间模型、矩阵分解等。 **分布式计算** 分布式计算是指将计算任务分布到多个节点上，以提高计算速度和可扩展性。MahoutinAction完整版本中介绍了分布式计算的基本概念和技术，包括 Hadoop、MapReduce 等。 MahoutinAction完整版本是一个详细的指南，涵盖了 Mahout 的各个方面，包括推荐算法、数据挖掘、个性化推荐、Mahout clustering、Mahout recommendations、数据表示、分布式计算等。

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

Before discussing each of these components in more detail in the next two chapters, we can

summarize the role of each component now. A

DataModel implementation stores and provides access to

all the preference, user and item data needed in the computation. A

UserSimiliarity implementation

provides some notion of how similar two users are; this could be based on one of many possible metrics

or calculations. A

UserNeighborhood implementation defines a notion of a group of users that are most

similar to a given user. Finally, a

Recommender implementation pulls all these components together to

recommend items to users, and related functionality.

2.2.3 Analyzing the output

Compile and run this using your favorite IDE. The output of running the program in your terminal or IDE

should be: RecommendedItem[item:104, value:4.257081]

The request asked for one top recommendation, and got one. The recommender engine recommended

book 104 to user 1. Further, it says that the recommender engine did so because it estimated user 1’s

preference for book 104 to be about 4.3, and that was the highest among all the items eligible for

recommendations.

That’s not bad. 107 did not appear, which was also recommendable, but only associated to a user with

different tastes. It picked 104 over 106, and this makes sense after noting that 104 is a bit more highly

rated overall. Further, the output contained a reasonable estimate of how much user 1 likes item 104 –

something between the 4.0 and 4.5 that users 4 and 5 expressed.

The right answer isn't obvious from looking at the data, but the recommender engine made some

decent sense of it and returned a defensible answer. If you got a pleasant tingle out of seeing this simple

program give a useful and non-obvious result from a small pile of data, then the world of machine learning

is for you!

For clear, small data sets, producing recommendations is as trivial as it appears above. In real life,

data sets are huge, and they are noisy. For example, imagine a popular news site recommending news

articles to readers. Preferences are inferred from article clicks. But, many of these “preferences” may be

bogus – maybe a reader clicked an article but didn't like it, or, had clicked the wrong story. Perhaps many

of the clicks occurred while not logged in, so can’t be associated to a user. And, imagine the size of the

data set – perhaps billions of clicks in a month.

Producing the right recommendations from this data and producing them quickly are not trivial. Later

we will present the tools Mahout provides to attack a range of such problems by way of case studies. They

will show how standard approaches can produce poor recommendations or take a great deal of memory

and CPU time, and, how to configure and customize Mahout to improve performance.

2.3 Evaluating a Recommender

A recommender engines is a tool, a means to answer the question, “what are the best recommendations

for a user?” Before investigating the answers, it’s best to investigate the question. What exactly is a good

recommendation? And how does one know when a recommender is producing them? The remainder of

this chapter pauses to explore evaluation of a recommender, because this is a tool that will be useful

when looking at specific recommender systems.

The best possible recommender would be a sort of psychic that could somehow know, before you do,

exactly how much you would like every possible item that you've not yet seen or expressed any

preference for. A recommender that could predict all your preferences exactly would merely present all

other items ranked by your future preference and be done. These would be the best possible

recommendations.

And indeed most recommender engines operate by trying to do just this, estimating ratings for some

or all other items. So, one way of evaluating a recommender's recommendations is to evaluate the quality

of its estimated preference values – that is, evaluating how closely the estimated preferences match the

actual preferences.

Licensed to Duan Jienan <jnduan@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

2.3.1 Training data and scoring

Those “actual preferences” don't exist though. Nobody knows for sure how you'll like some new item in

the future (including you). This can be simulated to a recommender engine by setting aside a small part of

the real data set as test data. These test preferences are not present in the training data fed into a

recommender engine under evaluation -- which is all data except the test data. Instead, the recommender

is asked to estimate preference for the missing test data, and estimates are compared to the actual

values.

From there, it is fairly simple to produce a kind of “score” for the recommender. For example it’s

possible to compute the average difference between estimate and actual preference. With a score of this

type, lower is better, because that would mean the estimates differed from the actual preference values

by less. 0.0 would mean perfect estimation -- no difference at all between estimates and actual values.

Sometimes the root-mean-square of the differences is used: this is the square root of the average of

the squares of the differences between actual and estimated preference values. Again, lower is better.

Item 1 Item 2 Item 3

Actual

3.0 5.0 4.0

Estimate

3.5 2.0 5.0

Difference

0.5 3.0 1.0

Average

Difference

= (0.5 + 3.0 + 1.0) / 3 = 1.5

Root Mean

Square

=√((0.5

+ 3.0

+ 1.0

) / 3) = 1.8484

Table 2.1 An illustration of the average difference, and root mean square calculation

Above, the table shows the difference between a set of actual and estimated preferences, and how

they are translated into scores. Root-mean-square more heavily penalizes estimates that are way off, as

with item 2 here, and that is considered desirable by some. For example, an estimate that’s off by 2 whole

stars is probably more than twice as “bad” as one off by just 1 star. Because the simple average of

differences is perhaps more intuitive and easy to understand, upcoming examples will use it.

2.3.2 Running RecommenderEvaluator

Let's revisit the example code and instead evaluate the simple recommender, on this simple data set:

Listing 2.3 Configuring and running an evaluation of a Recommender

RandomUtils.useTestSeed(); A

DataModel model = new FileDataModel(new File("intro.csv"));

RecommenderEvaluator evaluator =

new AverageAbsoluteDifferenceRecommenderEvaluator();

RecommenderBuilder builder = new RecommenderBuilder() { B

@Override

public Recommender buildRecommender(DataModel model)

throws TasteException {

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood =

new NearestNUserNeighborhood(2, similarity, model);

return

new GenericUserBasedRecommender(model, neighborhood, similarity);

}

};

double score = evaluator.evaluate(

Licensed to Duan Jienan <jnduan@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

builder, null, model, 0.7, 1.0); C

System.out.println(score);

A Used only in examples for repeatable result

B Builds the same Recommender as above

C Use 70% of data to train; test with other 30%

Most of the action happens in

evaluate(). Inside, the RecommenderEvaluator handles splitting

the data into a training and test set, builds a new training

DataModel and Recommender to test, and

compares its estimated preferences to the actual test data.

Note that there is no

Recommender passed to this method. This is because, inside, the method will

need to build a

Recommender around a newly created training DataModel. So the caller must provide an

object that can build a

Recommender from a DataModel – a RecommenderBuilder. Here, it builds the

same implementation that was tried earlier in this chapter.

2.3.3 Assessing the result

This program prints the result of the evaluation: a score indicating how well the Recommender performed.

In this case you should simply see:

1.0. Even though a lot of randomness is used inside the evaluator to

choose test data, the result should be consistent because of the call to

RandomUtils.useTestSeed(),

which forces the same random choices each time. This is only used in such examples, and unit tests, to

guarantee repeatable results. Don’t use it in your real code.

What this value means depends on the implementation used – here,

AverageAbsoluteDifferenceRecommenderEvaluator. A result of 1.0 from this implementation

means that, on average, the recommender estimates a preference that deviates from the actual

preference by 1.0.

A value of 1.0 is not great, on a scale of 1 to 5, but there is so little data here to begin with. Your

results may differ as the data set is split randomly, and hence the training and test set may differ with

each run.

This technique can be applied to any

Recommender and DataModel. To use root-mean-square

scoring, replace

AverageAbsoluteDifferenceRecommenderEvaluator with the implementation

RMSRecommenderEvaluator.

Also, the

null parameter to evaluate() could instead be an instance of DataModelBuilder,

which can be used to control how the training

DataModel is created from training data. Normally the

default is fine; it may not be if you are using a specialized implementation of

DataModel in your

deployment. A

DataModelBuilder is how you would inject it into the evaluation process.

The

1.0 parameter at the end controls how much of the overall input data is used. Here it means

“100%.” This can be used to produce a quicker, if less accurate, evaluation by using only a little of a

potentially huge data set. For example,

0.1 would mean 10% of the data is used and 90% is ignored.

This is quite useful when rapidly testing small changes to a

Recommender.

2.4 Evaluating precision and recall

We could also take a broader view of the recommender problem: it’s not strictly necessary to estimate

preference values in order to produce recommendations. It’s not always essential to present estimated

preference values to users. In many cases, just an ordered list of recommendations, from best to worst, is

sufficient. In fact, in some cases the exact ordering of the list doesn’t matter much – a set of a few good

recommendations is fine.

Taking this more general view, we could also apply classic information retrieval metrics to evaluate

recommenders: precision and recall. These terms are typically applied to things like search engines, which

return some set of best results for a query out of many possible results.

A search engine should not return irrelevant results in the top results, although it should strive to

return as many relevant results as possible. “Precision” is the proportion of top results that are relevant,

for some definition of relevant. “Precision at 10” would be this proportion judged from the top 10 results.

“Recall” is the proportion of all relevant results included in the top results. See figure 2.3 for a

visualization of these ideas.

Licensed to Duan Jienan <jnduan@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

Figure 2.3 An illustration of precision and recall in the context of search results

These terms can easily be adapted to recommenders: precision is the proportion of top

recommendations that are good recommendations, and recall is the proportion of good recommendations

that appear in top recommendations. The next section will define “good”.

2.4.1 Running RecommenderIRStatsEvaluator

Again, Mahout provides a fairly simple way to compute these values for a Recommender:

Listing 2.4 Configuring and running a precision and recall evaluation

RandomUtils.useTestSeed();

DataModel model = new FileDataModel(new File("intro.csv"));

RecommenderIRStatsEvaluator evaluator =

new GenericRecommenderIRStatsEvaluator();

RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {

@Override

public Recommender buildRecommender(DataModel model)

throws TasteException {

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood =

new NearestNUserNeighborhood(2, similarity, model);

return

new GenericUserBasedRecommender(model, neighborhood, similarity);

}

};

IRStatistics stats = evaluator.evaluate(

recommenderBuilder, null, model, null, 2,

GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,

1.0); A

System.out.println(stats.getPrecision());

System.out.println(stats.getRecall());

A Evaluate precision and recall at 2

Without the call to

RandomUtils.useTestSeed(), the result you see would vary significantly due to

random selection of training data and test data, and because the data set is so small here. But with the

call, the result ought to be:

0.75

1.0

Precision at 2 is 0.75; on average about three-quarters of recommendations were “good.” Recall at 2 is

1.0; all good recommendations are among those recommended.

But what exactly is a “good” recommendation here? The framework was asked to decide. It didn’t

receive a definition. Intuitively, the most highly preferred items in the test set are the good

recommendations, and the rest aren’t.

Licensed to Duan Jienan <jnduan@gmail.com>

http://www.manning-sandbox.com/forum.jspa?forumID=623

Listing 2.5 User 5’s preference in test data set

5,101,4.0

5,102,3.0

5,103,2.0

5,104,4.0

5,105,3.5

5,106,4.0

Look at user 5 in this simple data set again. Let’s imagine the preferences for items 101, 102 and 103

were withheld as test data. The preference values for these are 4.0, 3.0 and 2.0. With these values

missing from the training data, a recommender engine ought to recommend 101 before 102, and 102

before 103, because this is the order in which user 5 prefers these items. But would it be a good idea to

recommend 103? It’s last on the list; user 5 doesn’t seem to like it much. Book 102 is just average. Book

101 looks reasonable as its preference value is well above average. Maybe 101 is a good

recommendation; 102 and 103 are valid, but not good recommendations.

And this is the thinking that the

RecommenderEvaluator employs. When not given an explicit

threshold that divides good recommendations from bad, the framework will pick a threshold, per user,

that is equal to the user's average preference value µ plus one standard deviation σ:

threshold = µ +

If you’ve forgotten your statistics, don’t worry. This takes items whose preference value is not merely

a little more than average (µ), but above average by a significant amount (σ). In practice this means that

about the 16% of items that are most highly preferred are considered “good” recommendations to make

back to the user. The other arguments to this method are similar to those discussed before and are more

fully documented in the project javadoc.

2.4.2 Problems with precision and recall

The usefulness of precision and recall tests in the context of recommenders depends entirely on how well

a “good” recommendation can be defined. Above, the threshold was given, or defined by the framework. A

poor choice will hurt the usefulness of the resulting score.

There’s a subtler problem with these tests, though. Here, they necessarily pick the set of good

recommendations from among items for which the user has already expressed some preference. But the

best recommendations are of course not necessarily among those the user already knows about!

Imagine running such a test for a user who would absolutely love the little-known French cult film, “My

Brother The Armoire”. Let’s say it’s objectively a great recommendation for this user. But, the user has

never heard of this movie. If a recommender actually returned this film when recommending movies, it

would be penalized; the test framework can only pick good recommendations from among those in the

user’s set of preferences already.

The issue is further complicated when the preferences are ‘boolean’ and contain no preference value.

There is not even a notion of relative preference on which to select a subset of good items. The best the

test can do is to randomly select some preferred items as the good ones.

The test nevertheless has some use. The items a user prefers are a reasonably proxy for the best

recommendations for the user, but by no means a perfect one. In the case of boolean preference data,

only a precision-recall test is available anyway. It is worth understanding the test’s limitations in this

context.

2.5 Evaluating the GroupLens data set

With these tools in hand, we will be able to discuss not only the speed, but also the quality of

recommender engines. Although examples with large amounts real data are still a couple chapters away,

it’s already possible to quickly evaluate performance on a small data set.

Licensed to Duan Jienan <jnduan@gmail.com>

剩余286页未读，继续阅读

dadao201

粉丝: 0
资源: 3

Mahout实践指南：推荐系统和聚类算法详解

Justice.League.Snyders.Cut.2021.WEBRip.PD.CHS.ass

Sybase.Modeling.Pd.Oom.Xmi.dll.descriptor

com.tivoli.pd.jcfg2.jar

Sybase.Modeling.Pd.Oom.Xmi.dll

private-data-20240105.zip.pd

update9-20230609.5.184.slice.7z.003.pd

update9-202304020.4.181.slice.7z.002.pd

private-data-20240105.7z.pd

private-data-20250105.7z.pd

private-data-20240224.7z.pd

最新资源