Besides the traditional heuristics that are based mostly
on information retrieval methods, other techniques for
content-based recommendation have also been used, such
as Bayesian classifiers [70], [77] and various machine
learning techniques, including clustering, decision trees,
and artificial neural networks [77]. These techniques differ
from information retrieval-based approaches in that they
calculate utility predictions based not on a heuristic
formula, such as a cosine similarity measure, but rather
are based on a model learned from the underlying data
using statistical learning and machine learning techni-
ques. For example, based on a set of Web pages that were
rated as “relevant” or “irrelevant” by the user, [77] uses
the naive Bayesian classifier [31] to classify unrated Web
pages. More specifically, the naive Bayesian classifier is
used to estimate the following probability that page p
j
belongs to a certain class C
i
(e.g., relevant or irrelevant)
given the set of keywords k
1;j
; ...;k
n;j
on that page:
P ðC
i
jk
1;j
&...&k
n;j
Þ: ð7Þ
Moreover, [77] uses the assumption that keywords are
independent and, therefore, the above probability is
proportional to
P ðC
i
Þ
Y
x
P ðk
x;j
jC
i
Þ: ð8Þ
While the keyword independence assumption does not
necessarily apply in many applications, experimental results
demonstrate that naı
¨
ve Bayesian classifiers still produce
high classification accuracy [77]. Furthermore, both
P ðk
x;j
jC
i
Þ and P ðC
i
Þ can be estimated from the underlying
training data. Therefore, for each page p
j
, the probability
P ðC
i
jk
1;j
&...&k
n;j
Þ is computed for each class C
i
and page p
j
is assigned to class C
i
having the highest probability [77].
While not explicitly dealing with providing recommen-
dations, the text retrieval community has contributed several
techniques that are being used in content-based recommen-
der systems. One example of such a technique would be the
research on adaptive filtering [101], [112], which focuses on
becoming more accurate at identifying relevant documents
incrementally by observing the documents one-by-one in a
continuous document stream. Another example would be
the work on threshold setting [84], [111], which focuses on
determining the extent to which documents should match a
given query in order to be relevant to the user. Other text
retrieval methods are described in [50] and can also be
found in the proceedings of the Text Retrieval Conference
(TREC) (http://trec.nist.gov).
As was observed in [8], [97], content-based recommender
systems have several limitations that are described in the
rest of this section.
2.1.1 Limited Content Analysis
Content-based techniques are limited by the features that
are explicitly associated with the objects that these systems
recommend. Therefore, in order to have a sufficient set of
features, the content must either be in a form that can be
parsed automatically by a computer (e.g., text) or the
features should be assigned to items manually. While
information retrieval techniques work well in extracting
features from text documents, some other domains have an
inherent problem with automatic feature extraction. For
example, automatic feature extraction methods are much
harder to apply to multimedia data, e.g., graphical images,
audio streams, and video streams. Moreover, it is often not
practical to assign attributes by hand due to limitations of
resources [97].
Another problem with limited content analysis is that, if
two different items are represented by the same set of
features, they are indistinguishable. Therefore, since text-
based documents are usually represented by their most
important keywords, content-based systems cannot distin-
guish between a well-written article and a badly written
one, if they happen to use the same terms [97].
2.1.2 Overspecializati on
When the system can only recommend items that score
highly against a user’s profile, the user is limited to being
recommended items that are similar to those already rated.
For example, a person with no experience with Greek
cuisine would never receive a recommendation for even the
greatest Greek restaurant in town. This problem, which has
also been studied in other domains, is often addressed by
introducing some randomness. For example, the use of
genetic algorithms has been proposed as a possible solution
in the context of information filtering [98]. In addition, the
problem with overspecialization is not only that the
content-based systems cannot recommend items that are
different from anything the user has seen before. In certain
cases, items should not be recommended if they are too
similar to something the user has already seen, such as a
different news article describing the same event. Therefore,
some content-based recommender systems, such as Daily-
Learner [13], filter out items not only if they are too different
from the user’s preferences, but also if they are too similar
to something the user has seen before. Furthermore, Zhang
et al. [112] provide a set of five redundancy measures to
evaluate whether a document that is deemed to be relevant
contains some novel information as well. In summary, the
diversity of recommendations is often a desirable feature in
recommender systems. Ideally, the user should be pre-
sented with a range of options and not with a homogeneous
set of alternatives. For example, it is not necessarily a good
idea to recommend all movies by Woody Allen to a user
who liked one of them.
2.1.3 New User Problem
The user has to rate a sufficient number of items before a
content-based recommender system can really understand
the user’s preferences and present the user with reliable
recommendations. Therefore, a new user, having very few
ratings, would not be able to get accurate recommendations.
2.2 Collaborative Methods
Unlike content-based recommendation methods, collabora-
tive recommender systems (or collaborative filtering systems)
try to predict the utility of items for a particular user based
on the items previously rated by other users. More formally,
the utility uðc; sÞ of item s for user c is estimated based on
the utilities uðc
j
;sÞ assigned to item s by those users c
j
2 C
who are “similar” to user c. For example, in a movie
ADOMAVICIUS AND TUZHILIN: TOWARD THE NEXT GENERATION OF RECOMMENDER SYSTEMS: A SURVEY OF THE STATE-OF-THE-ART... 737