Mahout实战：推荐、聚类与分类解析

需积分: 10 16 浏览量更新于2024-07-29 收藏 2.41MB PDF 举报

"mahout in action" 《Mahout in Action》这本书深入探讨了Apache Mahout框架在机器学习和数据挖掘领域的应用。Apache Mahout是一个开源项目，它提供了可扩展的机器学习算法，主要用于推荐系统、聚类和分类。本书分为三个部分，详细介绍了这些核心功能。 **Part1 - Recommendations（推荐系统）** 1. **Introducing Recommenders** - 该章节介绍了推荐系统的基本概念，解释了为什么推荐系统对于理解和预测用户行为至关重要，并讨论了它们在电子商务、媒体和社交网络等领域的应用。 2. **Representing Data** - 这一章涉及如何将数据转化为机器学习算法可以理解的形式，如用户-项目矩阵和相似度度量。 3. **Making Recommendations** - 作者详细阐述了如何利用协同过滤、基于内容的推荐以及混合推荐策略来生成个性化建议。 4. **Taking Recommenders to production** - 这一部分讨论了将推荐系统从实验阶段部署到实际生产环境中的挑战和策略，包括性能优化和实时推荐。 **Part2 - Clustering（聚类）** 5. **Introduction to Clustering** - 聚类是数据挖掘的重要组成部分，本章介绍聚类的基本原理和目的，用于发现数据集中的自然群体或模式。 6. **Representing Data** - 在聚类中，数据也需要适当的表示，以便于算法进行处理。这一章会探讨不同的数据表示方法，如特征向量和稀疏数据结构。 7. **Clustering algorithms in Mahout** - 详细讲解了Mahout中实现的各种聚类算法，如K-means、Fuzzy K-means和Canopy Clustering。 8. **Evaluating cluster quality** - 评估聚类结果的质量是关键，这里介绍了内部和外部评价指标，如轮廓系数和Calinski-Harabasz指数。 9. **Taking clustering to production** - 讨论了如何将聚类模型应用于实际问题，以及监控和调整聚类过程的策略。 10. **Real-world applications of clustering** - 展示了聚类在市场分割、文本分析和图像识别等领域的实际应用。 **Part3 - Classification（分类）** 11. **Introduction to classification** - 提供了分类问题的概述，解释了如何通过训练数据构建分类模型来预测类别标签。 12. **Power of the naive classifier** - 介绍了朴素贝叶斯分类器的原理和优势，它是许多分类任务的基础。 13. **Multiclass classification** - 针对具有多个可能输出类别的问题，讨论了多类别分类技术和策略，如One-vs-All和Error-Correcting Output Codes (ECOC)。 14. **Classifier evaluation** - 讲述了评估分类器性能的方法，包括混淆矩阵、精确度、召回率和F1分数。 15. **Tuning your classifier for greater accuracy and performance** - 本章介绍了如何通过调整参数和特征选择来优化分类器的准确性和效率。这本书不仅为读者提供了丰富的理论知识，还包含了大量的实例代码，帮助读者掌握如何在实际项目中运用Mahout。无论是机器学习新手还是经验丰富的数据科学家，都能从中受益，提升自己在推荐系统、聚类和分类领域的技能。

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

Figure 2.1 Relationships between users 1 to 5 and items 101 to 107. Dashed lines represent associations that seem

negative -- the user does not seem to like the item much, but expresses a relationship to the item.

2.2.2 Creating a Recommender

So what book might we recommend to user 1? Not 101, 102 or 103 – he already knows about these

books, apparently, and recommendation is about discovering new things. Intuition suggests that

because users 4 and 5 seem similar to 1, we should recommend something that user 4 or user 5 likes.

That leaves 104, 105 and 106 as possible recommendations. On the whole, 104 seems to be the most

liked of these possibilities, judging by the preference values of 4.5 and 4.0 for item 104. Now, run the

following code:

Listing 2.2 A simple user-based recommender program with Mahout

package mia.recommender.ch02;

import org.apache.mahout.cf.taste.impl.model.file.*;

import org.apache.mahout.cf.taste.impl.neighborhood.*;

import org.apache.mahout.cf.taste.impl.recommender.*;

import org.apache.mahout.cf.taste.impl.similarity.*;

import org.apache.mahout.cf.taste.model.*;

import org.apache.mahout.cf.taste.neighborhood.*;

import org.apache.mahout.cf.taste.recommender.*;

import org.apache.mahout.cf.taste.similarity.*;

import java.io.*;

import java.util.*;

class RecommenderIntro {

public static void main(String[] args) throws Exception {

DataModel model = new FileDataModel(new File("intro.csv")); A

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood =

new NearestNUserNeighborhood(2, similarity, model);

Recommender recommender = new GenericUserBasedRecommender(

model, neighborhood, similarity); B

List<RecommendedItem> recommendations =

recommender.recommend(1, 1); C

for (RecommendedItem recommendation : recommendations) {

System.out.println(recommendation);

}

A Load the data file

B Create the recommender engine

C For user 1, recommend 1 item

Licensed to nancy chen <amigo4u2009@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

For brevity, through several more chapters of examples that follow, we will omit the imports, class

declaration, and method declaration, and instead repeat only the program statements themselves. To

help visualize the relationship between these basic components, see figure 2.2. Not all Mahout-based

recommenders will look like this -- some will employ different components with different relationships.

But this gives a sense of what’s going on in our example.

Figure 2.2 Simplified illustration of component interaction in a Mahout user-based recommender

While we will discuss each of these components in much more detail in the next two chapters, we

can summarize the role of each component now. A

DataModel implementation stores and provides

access to all the preference, user and item data needed in the computation. A

UserSimiliarity

implementation provides some notion of how similar two users are; this could be based on one of many

possible metrics or calculations. A

UserNeighborhood implementation defines a notion of a group of

users that are most similar to a given user. Finally, a

Recommender implementation pulls all these

components together to recommend items to users, and related functionality.

2.2.3 Analyzing the output

Compile and run this using your favorite IDE. The output of running the program in your terminal or IDE

should be: RecommendedItem[item:104, value:4.257081]

We asked for one top recommendation, and got one. The recommender engine recommended book

104 to user 1. Further, it says that the recommender engine did so because it estimated user 1’s

preference for book 104 to be about 4.3, and that was the highest among all the items eligible for

recommendations.

That’s not bad. We didn't get 107, which was also recommendable, but only associated to a user

with different tastes. We picked 104 over 106, and this makes sense when you note that 104 is a bit

more highly rated overall. Further, we got a reasonable estimate of how much user 1 likes item 104 –

something between the 4.0 and 4.5 that users 4 and 5 expressed.

The right answer isn't obvious from looking at the data, but the recommender engine made some

decent sense of it and returned a defensible answer. If you got a pleasant tingle out of seeing this

simple program give a useful and non-obvious result from a small pile of data, then the world of

machine learning is for you!

For clear, small data sets, producing recommendations is as trivial as it appears above. In real life,

data sets are huge, and they are noisy. For example, imagine a popular news site recommending news

Licensed to nancy chen <amigo4u2009@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

articles to readers. Preferences are inferred from article clicks. But, many of these “preferences” may be

bogus – maybe a reader clicked an article but didn't like it, or, had clicked the wrong story. Perhaps

many of the clicks occurred while not logged in, so can’t be associated to a user. And, imagine the size

of the data set – perhaps billions of clicks in a month.

Producing the right recommendations from this data and producing them quickly are not trivial. Later

we will present the tools Mahout provides to attack a range of such problems by way of case studies.

They will show how standard approaches can produce poor recommendations or take a great deal of

memory and CPU time, and, how to configure and customize Mahout to improve performance.

2.3 Evaluating a Recommender

A recommender engines is a tool, a means to answer the question, “what are the best recommendations

for a user?” Before investigating the answers, we should investigate the question. What exactly is a

good recommendation? And how will we know when a recommender is producing them? The remainder

of this chapter pauses to explore evaluation of a recommender, because this is a tool that will be useful

when we begin looking at specific recommender systems.

The best possible recommender would be a sort of psychic that could somehow know, before you do,

exactly how much you would like every possible item that you've not yet seen or expressed any

preference for. A recommender that could predict all your preferences exactly would merely present all

other items ranked by your future preference and be done. These would be the best possible

recommendations.

And indeed most recommender engines operate by trying to do just this, estimating ratings for some

or all other items. So, one way of evaluating a recommender's recommendations is to evaluate the

quality of its estimated preference values – that is, evaluating how closely the estimated preferences

match the actual preferences.

2.3.1 Training data and scoring

Those “actual preferences” don't exist though. Nobody knows for sure how you'll like some new item in

the future (including you). This can be simulated to a recommender engine by setting aside a small part

of the real data set as test data. These test preferences are not present in the training data fed into a

recommender engine under evaluation -- which is all data except the test data. Instead, the

recommender is asked to estimate preference for the missing test data, and estimates are compared to

the actual values.

From there, it is fairly simple to produce a kind of “score” for the recommender. For example we

could compute the average difference between estimate and actual preference. With a score of this

type, lower is better, because that would mean the estimates differed from the actual preference values

by less. 0.0 would mean perfect estimation -- no difference at all between estimates and actual values.

Sometimes the root-mean-square of the differences is used: this is the square root of the average of

the squares of the differences between actual and estimated preference values. Again, lower is better.

Licensed to nancy chen <amigo4u2009@gmail.com>

©Manning Publications Co. Please post comments or corrections to the Author Online forum:

http://www.manning-sandbox.com/forum.jspa?forumID=623

Item 1 Item 2 Item 3

Actual

3.0 5.0 4.0

Estimate

3.5 2.0 5.0

Difference

0.5 3.0 1.0

Average

Difference

= (0.5 + 3.0 + 1.0) / 3 = 1.5

Root Mean

Square

=√((0.52 + 3.02 + 1.02) / 3) = 1.8484

Table 2.1 An illustration of the average difference, and root mean square calculation

Above, the table shows the difference between a set of actual and estimated preferences, and how

they are translated into scores. Root-mean-square more heavily penalizes estimates that are way off, as

with item 2 here, and that is considered desirable by some. For example, an estimate that’s off by 2

whole stars is probably more than twice as “bad” as one off by just 1 star. Because the simple average

of differences is perhaps more intuitive and easy to understand, we’ll use it in upcoming examples.

2.3.2 Running RecommenderEvaluator

Let's revisit the example code and instead evaluate the simple recommender we created, on our simple

data set:

Listing 2.3 Configuring and running an evaluation of a Recommender

RandomUtils.useTestSeed(); A

DataModel model = new FileDataModel(new File("intro.csv"));

RecommenderEvaluator evaluator =

new AverageAbsoluteDifferenceRecommenderEvaluator();

RecommenderBuilder builder = new RecommenderBuilder() { B

@Override

public Recommender buildRecommender(DataModel model)

throws TasteException {

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

UserNeighborhood neighborhood =

new NearestNUserNeighborhood(2, similarity, model);

return

new GenericUserBasedRecommender(model, neighborhood, similarity);

}

};

double score = evaluator.evaluate(

builder, null, model, 0.7, 1.0); C

System.out.println(score);

A Used only in examples for repeatable result

B Builds the same Recommender as above

C Use 70% of data to train; test with other 30%

Licensed to nancy chen <amigo4u2009@gmail.com>

http://www.manning-sandbox.com/forum.jspa?forumID=623

Most of the action happens in evaluate(). Inside, the RecommenderEvaluator handles splitting

the data into a training and test set, builds a new training

DataModel and Recommender to test, and

compares its estimated preferences to the actual test data.

Note that we don’t pass a

Recommender to this method. This is because, inside, the method will

need to build a

Recommender around a newly created training DataModel. So we must provide an

object that can build a

Recommender from a DataModel – a RecommenderBuilder. Here, it builds

the same implementation that we tried in the first chapter.

2.3.3 Assessing the result

This program prints the result of the evaluation: a score indicating how well the Recommender

performed. In this case you should simply see:

1.0. Even though a lot of randomness is used inside the

evaluator to choose test data, the result should be consistent because of the call to

RandomUtils.useTestSeed(), which forces the same random choices each time. This is only used in

such examples, and unit tests, to guarantee repeatable results. Don’t use it in your real code.

What this value means depends on the implementation we used – here,

AverageAbsoluteDifferenceRecommenderEvaluator. A result of 1.0 from this implementation

means that, on average, the recommender estimates a preference that deviates from the actual

preference by 1.0.

A value of 1.0 is not great, on a scale of 1 to 5, but there is so little data here to begin with. Your

results may differ as the data set is split randomly, and hence the training and test set may differ with

each run.

This technique can be applied to any

Recommender and DataModel. To use root-mean-square

scoring, replace

AverageAbsoluteDifferenceRecommenderEvaluator with the implementation

RMSRecommenderEvaluator.

Also, the

null parameter to evaluate() could instead be an instance of DataModelBuilder,

which can be used to control how the training

DataModel is created from training data. Normally the

default is fine; it may not be if you are using a specialized implementation of

DataModel in your

deployment. A

DataModelBuilder is how you would inject it into the evaluation process.

The

1.0 parameter at the end controls how much of the overall input data is used. Here it means

“100%.” This can be used to produce a quicker, if less accurate, evaluation by using only a little of a

potentially huge data set. For example,

0.1 would mean 10% of the data is used and 90% is ignored.

This is quite useful when rapidly testing small changes to a

Recommender.

2.4 Evaluating precision and recall

We could also take a broader view of the recommender problem: we don't have to estimate preference

values to produce recommendations. It’s not always necessary to present estimated preference values

to users. In many cases, all we want is an ordered list of recommendations, from best to worst. In fact,

in some cases we don't care much about the exact ordering of the list – a set of a few good

recommendations is fine.

Taking this more general view, we could also apply classic information retrieval metrics to evaluate

recommenders: precision and recall. These terms are typically applied to things like search engines,

which return some set of best results for a query out of many possible results.

Licensed to nancy chen <amigo4u2009@gmail.com>

剩余178页未读，继续阅读

康派尔

粉丝: 247
资源: 39

Mahout实战：推荐、聚类与分类解析

Mahout In Action英文完整版

mahout in action中的源码

Mahout in Action

Mahout in Action 2012

Mahout in action 中文版

Mahout In Action 2012 Source Code

Mahout in action清晰完整版

Proteus 8 Professional.zip

【气象】基于matlab Arduino气象站气象数据分析【含Matlab源码 8983期】.mp4

STC四轴飞行器课程设计

最新资源