Apache Mahout：推荐、聚类与分类实践

需积分: 10 42 浏览量更新于2024-07-24 收藏 2.41MB PDF 举报

"Mahout_in_Action 是一本专注于Apache Mahout框架的书籍，该框架主要用于实现大规模机器学习算法。书中涵盖了Mahout在推荐系统、聚类和分类等领域的应用，并且强调了其与Apache Hadoop的集成，以支持在云计算环境中的高效扩展。" Apache Mahout是一个开源项目，它为开发人员提供了构建机器学习算法的工具，这些算法可以应用于大数据分析。在标题和描述中提到的几个关键知识点包括： 1. **推荐系统**：Mahout提供了一套推荐过滤的实现，可以用于构建个性化的推荐服务。这部分内容可能包括协同过滤、基于内容的推荐以及混合推荐策略。推荐系统的核心是理解用户的历史行为和偏好，然后预测他们可能感兴趣的新内容。 2. **数据表示**：在处理推荐和聚类任务时，数据通常需要转换成适合算法的格式。这可能涉及到特征提取、向量化以及降维等步骤，如TF-IDF（词频-逆文档频率）用于文本数据，或余弦相似度用于计算数据点之间的相似性。 3. **分布式计算**：通过与Hadoop库的集成，Mahout能够将大规模的机器学习任务分解成小块并在多台机器上并行处理，从而显著提升计算速度。这一特性使得Mahout成为处理海量数据的理想选择。 4. **聚类算法**：聚类是一种无监督学习方法，用于发现数据集中的自然群体或类别。书中可能会涵盖K-Means、层次聚类、DBSCAN等经典算法，以及如何评估聚类质量（例如，轮廓系数、Calinski-Harabasz指数等）。 5. **分类算法**：这部分内容可能包括朴素贝叶斯、决策树、支持向量机等经典的分类模型。讨论的主题可能包括多类分类、模型评估（如准确率、召回率、F1分数）以及如何调整分类器参数以提高预测性能。 6. **实际应用**：书中不仅会讲解理论知识，还可能包含真实世界案例研究，展示如何将这些算法应用于电子商务、社交媒体分析、市场细分等多个领域。 "Mahout_in_Action"这本书是深入理解Mahout框架、学习机器学习实践以及掌握大数据分析技术的宝贵资源。无论是对推荐系统、聚类还是分类感兴趣的读者，都能从中获益。通过实际操作和案例研究，读者可以学会如何在自己的项目中有效利用Mahout解决复杂的数据问题。

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

Figure 2.1 Relationships between users 1 to 5 and items 101 to 107. Dashed lines represent associations that seem

negative -- the user does not seem to like the item much, but expresses a relationship to the item.

2.2.2 Creating a Recommender

So what book might we recommend to user 1? Not 101, 102 or 103 – he already knows about these

books, apparently, and recommendation is about discovering new things. Intuition suggests that

because users 4 and 5 seem similar to 1, we should recommend something that user 4 or user 5 likes.

That leaves 104, 105 and 106 as possible recommendations. On the whole, 104 seems to be the most

liked of these possibilities, judging by the preference values of 4.5 and 4.0 for item 104. Now, run the

following code:

Listing 2.2 A simple user-based recommender program with Mahout

package mia.recommender.ch02;

import org.apache.mahout.cf.taste.impl.model.file.*;

import org.apache.mahout.cf.taste.impl.neighborhood.*;

import org.apache.mahout.cf.taste.impl.recommender.*;

import org.apache.mahout.cf.taste.impl.similarity.*;

import org.apache.mahout.cf.taste.model.*;

import org.apache.mahout.cf.taste.neighborhood.*;

import org.apache.mahout.cf.taste.recommender.*;

import org.apache.mahout.cf.taste.similarity.*;

import java.io.*;

import java.util.*;

class RecommenderIntro {

public static void main(String[] args) throws Exception {

DataModel model = new FileDataModel(new File("intro.csv")); A

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood =

new NearestNUserNeighborhood(2, similarity, model);

Recommender recommender = new GenericUserBasedRecommender(

model, neighborhood, similarity); B

List<RecommendedItem> recommendations =

recommender.recommend(1, 1); C

for (RecommendedItem recommendation : recommendations) {

System.out.println(recommendation);

}

A Load the data file

B Create the recommender engine

C For user 1, recommend 1 item

Licensed to nancy chen <amigo4u2009@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

For brevity, through several more chapters of examples that follow, we will omit the imports, class

declaration, and method declaration, and instead repeat only the program statements themselves. To

help visualize the relationship between these basic components, see figure 2.2. Not all Mahout-based

recommenders will look like this -- some will employ different components with different relationships.

But this gives a sense of what’s going on in our example.

Figure 2.2 Simplified illustration of component interaction in a Mahout user-based recommender

While we will discuss each of these components in much more detail in the next two chapters, we

can summarize the role of each component now. A

DataModel implementation stores and provides

access to all the preference, user and item data needed in the computation. A

UserSimiliarity

implementation provides some notion of how similar two users are; this could be based on one of many

possible metrics or calculations. A

UserNeighborhood implementation defines a notion of a group of

users that are most similar to a given user. Finally, a

Recommender implementation pulls all these

components together to recommend items to users, and related functionality.

2.2.3 Analyzing the output

Compile and run this using your favorite IDE. The output of running the program in your terminal or IDE

should be: RecommendedItem[item:104, value:4.257081]

We asked for one top recommendation, and got one. The recommender engine recommended book

104 to user 1. Further, it says that the recommender engine did so because it estimated user 1’s

preference for book 104 to be about 4.3, and that was the highest among all the items eligible for

recommendations.

That’s not bad. We didn't get 107, which was also recommendable, but only associated to a user

with different tastes. We picked 104 over 106, and this makes sense when you note that 104 is a bit

more highly rated overall. Further, we got a reasonable estimate of how much user 1 likes item 104 –

something between the 4.0 and 4.5 that users 4 and 5 expressed.

The right answer isn't obvious from looking at the data, but the recommender engine made some

decent sense of it and returned a defensible answer. If you got a pleasant tingle out of seeing this

simple program give a useful and non-obvious result from a small pile of data, then the world of

machine learning is for you!

For clear, small data sets, producing recommendations is as trivial as it appears above. In real life,

data sets are huge, and they are noisy. For example, imagine a popular news site recommending news

Licensed to nancy chen <amigo4u2009@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

articles to readers. Preferences are inferred from article clicks. But, many of these “preferences” may be

bogus – maybe a reader clicked an article but didn't like it, or, had clicked the wrong story. Perhaps

many of the clicks occurred while not logged in, so can’t be associated to a user. And, imagine the size

of the data set – perhaps billions of clicks in a month.

Producing the right recommendations from this data and producing them quickly are not trivial. Later

we will present the tools Mahout provides to attack a range of such problems by way of case studies.

They will show how standard approaches can produce poor recommendations or take a great deal of

memory and CPU time, and, how to configure and customize Mahout to improve performance.

2.3 Evaluating a Recommender

A recommender engines is a tool, a means to answer the question, “what are the best recommendations

for a user?” Before investigating the answers, we should investigate the question. What exactly is a

good recommendation? And how will we know when a recommender is producing them? The remainder

of this chapter pauses to explore evaluation of a recommender, because this is a tool that will be useful

when we begin looking at specific recommender systems.

The best possible recommender would be a sort of psychic that could somehow know, before you do,

exactly how much you would like every possible item that you've not yet seen or expressed any

preference for. A recommender that could predict all your preferences exactly would merely present all

other items ranked by your future preference and be done. These would be the best possible

recommendations.

And indeed most recommender engines operate by trying to do just this, estimating ratings for some

or all other items. So, one way of evaluating a recommender's recommendations is to evaluate the

quality of its estimated preference values – that is, evaluating how closely the estimated preferences

match the actual preferences.

2.3.1 Training data and scoring

Those “actual preferences” don't exist though. Nobody knows for sure how you'll like some new item in

the future (including you). This can be simulated to a recommender engine by setting aside a small part

of the real data set as test data. These test preferences are not present in the training data fed into a

recommender engine under evaluation -- which is all data except the test data. Instead, the

recommender is asked to estimate preference for the missing test data, and estimates are compared to

the actual values.

From there, it is fairly simple to produce a kind of “score” for the recommender. For example we

could compute the average difference between estimate and actual preference. With a score of this

type, lower is better, because that would mean the estimates differed from the actual preference values

by less. 0.0 would mean perfect estimation -- no difference at all between estimates and actual values.

Sometimes the root-mean-square of the differences is used: this is the square root of the average of

the squares of the differences between actual and estimated preference values. Again, lower is better.

Licensed to nancy chen <amigo4u2009@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

Item 1 Item 2 Item 3

Actual

3.0 5.0 4.0

Estimate

3.5 2.0 5.0

Difference

0.5 3.0 1.0

Average

Difference

= (0.5 + 3.0 + 1.0) / 3 = 1.5

Root Mean

Square

=√((0.52 + 3.02 + 1.02) / 3) = 1.8484

Table 2.1 An illustration of the average difference, and root mean square calculation

Above, the table shows the difference between a set of actual and estimated preferences, and how

they are translated into scores. Root-mean-square more heavily penalizes estimates that are way off, as

with item 2 here, and that is considered desirable by some. For example, an estimate that’s off by 2

whole stars is probably more than twice as “bad” as one off by just 1 star. Because the simple average

of differences is perhaps more intuitive and easy to understand, we’ll use it in upcoming examples.

2.3.2 Running RecommenderEvaluator

Let's revisit the example code and instead evaluate the simple recommender we created, on our simple

data set:

Listing 2.3 Configuring and running an evaluation of a Recommender

RandomUtils.useTestSeed(); A

DataModel model = new FileDataModel(new File("intro.csv"));

RecommenderEvaluator evaluator =

new AverageAbsoluteDifferenceRecommenderEvaluator();

RecommenderBuilder builder = new RecommenderBuilder() { B

@Override

public Recommender buildRecommender(DataModel model)

throws TasteException {

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood =

new NearestNUserNeighborhood(2, similarity, model);

return

new GenericUserBasedRecommender(model, neighborhood, similarity);

}

};

double score = evaluator.evaluate(

builder, null, model, 0.7, 1.0); C

System.out.println(score);

A Used only in examples for repeatable result

B Builds the same Recommender as above

C Use 70% of data to train; test with other 30%

Licensed to nancy chen <amigo4u2009@gmail.com>

http://www.manning-sandbox.com/forum.jspa?forumID=623

Most of the action happens in evaluate(). Inside, the RecommenderEvaluator handles splitting

the data into a training and test set, builds a new training

DataModel and Recommender to test, and

compares its estimated preferences to the actual test data.

Note that we don’t pass a

Recommender to this method. This is because, inside, the method will

need to build a

Recommender around a newly created training DataModel. So we must provide an

object that can build a

Recommender from a DataModel – a RecommenderBuilder. Here, it builds

the same implementation that we tried in the first chapter.

2.3.3 Assessing the result

This program prints the result of the evaluation: a score indicating how well the Recommender

performed. In this case you should simply see:

1.0. Even though a lot of randomness is used inside the

evaluator to choose test data, the result should be consistent because of the call to

RandomUtils.useTestSeed(), which forces the same random choices each time. This is only used in

such examples, and unit tests, to guarantee repeatable results. Don’t use it in your real code.

What this value means depends on the implementation we used – here,

AverageAbsoluteDifferenceRecommenderEvaluator. A result of 1.0 from this implementation

means that, on average, the recommender estimates a preference that deviates from the actual

preference by 1.0.

A value of 1.0 is not great, on a scale of 1 to 5, but there is so little data here to begin with. Your

results may differ as the data set is split randomly, and hence the training and test set may differ with

each run.

This technique can be applied to any

Recommender and DataModel. To use root-mean-square

scoring, replace

AverageAbsoluteDifferenceRecommenderEvaluator with the implementation

RMSRecommenderEvaluator.

Also, the

null parameter to evaluate() could instead be an instance of DataModelBuilder,

which can be used to control how the training

DataModel is created from training data. Normally the

default is fine; it may not be if you are using a specialized implementation of

DataModel in your

deployment. A

DataModelBuilder is how you would inject it into the evaluation process.

The

1.0 parameter at the end controls how much of the overall input data is used. Here it means

“100%.” This can be used to produce a quicker, if less accurate, evaluation by using only a little of a

potentially huge data set. For example,

0.1 would mean 10% of the data is used and 90% is ignored.

This is quite useful when rapidly testing small changes to a

Recommender.

2.4 Evaluating precision and recall

We could also take a broader view of the recommender problem: we don't have to estimate preference

values to produce recommendations. It’s not always necessary to present estimated preference values

to users. In many cases, all we want is an ordered list of recommendations, from best to worst. In fact,

in some cases we don't care much about the exact ordering of the list – a set of a few good

recommendations is fine.

Taking this more general view, we could also apply classic information retrieval metrics to evaluate

recommenders: precision and recall. These terms are typically applied to things like search engines,

which return some set of best results for a query out of many possible results.

Licensed to nancy chen <amigo4u2009@gmail.com>

剩余178页未读，继续阅读

Leon_s

粉丝: 2
资源: 9

Apache Mahout：推荐、聚类与分类实践

王家林Mahout_in_Action

Mahout_in_Action.pdf

mahout_in_action_中文版

Mahout_In_Action(源码)

Mahout_in_Action中文版（部分）.

mahout in action

Mahout in Action

基于WoodandBerry1和非耦合控制WoodandBerry2来实现控制木材和浆果蒸馏柱控制Simulink仿真.rar

(源码)基于Spring Boot框架的用户管理系统.zip

基于springboot企业员工薪酬管理系统源码数据库文档.zip

最新资源