1
The BellKor Solution to the Netflix Grand Prize
Yehuda Koren
August 2009
I. INTRODUCTION
This article describes part of our contribution to the “Bell-
Kor’s Pragmatic Chaos” final solution, which won the Netflix
Grand Prize. The other portion of the contribution was created
while working at AT&T with Robert Bell and Chris Volinsky,
as reported in our 2008 Progress Prize report [3]. The final
solution includes all the predictors described there. In this
article we describe only the newer predictors.
So what is new over last year’s solution? First we further im-
proved the baseline predictors (Sec. III). This in turn improves
our other models, which incorporate those predictors, like the
matrix factorization model (Sec. IV). In addition, an extension
of the neighborhood model that addresses temporal dynamics
was introduced (Sec. V). On the Restricted Boltzmann Ma-
chines (RBM) front, we use a new RBM model with superior
accuracy by conditioning the visible units (Sec. VI). The final
addition is the introduction of a new blending algorithm, which
is based on gradient boosted decision trees (GBDT) (Sec. VII).
II. PRELIMINARIES
The Netflix dataset contains more than 100 million date-
stamped movie ratings performed by anonymous Netflix cus-
tomers between Dec 31, 1999 and Dec 31, 2005 [4]. This
dataset gives ratings about m = 480, 189 users and n = 17, 770
movies (aka, items).
The contest was designed in a training-test set format. A
Hold-out set of about 4.2 million ratings was created consisting
of the last nine movies rated by each user (or fewer if a
user had not rated at least 18 movies over the entire period).
The remaining data made up the training set. The Hold-out
set was randomly split three ways, into subsets called Probe,
Quiz, and Test. The Probe set was attached to the training
set, and labels (the rating that the user gave the movie) were
attached. The Quiz and Test sets made up an evaluation set,
which is known as the Qualifying set, that competitors were
required to predict ratings for. Once a competitor submits pre-
dictions, the prizemaster returns the root mean squared error
(RMSE) achieved on the Quiz set, which is posted on a public
leaderboard (www.netflixprize.com/leaderboard).
RMSE values mentioned in this article correspond to the Quiz
set. Ultimately, the winner of the prize is the one that scores
best on the Test set, and those scores were never disclosed by
Netflix. This precludes clever systems which might “game” the
competition by learning about the Quiz set through repeated
submissions.
Compared with the training data, the Hold-out set contains
many more ratings by users that do not rate much and are
Y. Koren is with Yahoo! Research, Haifa, ISRAEL. Email:
yehuda@yahoo-inc.com
therefore harder to predict. In a way, this represents real
requirements for a collaborative filtering (CF) system, which
needs to predict new ratings from older ones, and to equally
address all users, not just the heavy raters.
We reserve special indexing letters to distinguish users from
movies: for users u, v, and for movies i, j. A rating r
ui
indicates
the preference by user u of movie i. Values are ranging from
1 (star) indicating no interest to 5 (stars) indicating a strong
interest. We distinguish predicted ratings from known ones,
by using the notation ˆr
ui
for the predicted value of r
ui
.
The scalar t
ui
denotes the time of rating r
ui
. Here, time is
measured in days, so t
ui
counts the number of days elapsed
since some early time point. About 99% of the possible ratings
are missing, because a user typically rates only a small portion
of the movies. The (u, i) pairs for which r
ui
is known are stored
in the training set K = {(u, i) | r
ui
is known}. Notice that K
includes also the Probe set. Each user u is associated with a
set of items denoted by R(u), which contains all the items
for which ratings by u are available. Likewise, R(i) denotes
the set of users who rated item i. Sometimes, we also use
a set denoted by N(u), which contains all items for which
u provided a rating, even if the rating value is unknown.
Thus, N(u) extends R(u) by also considering the ratings in
the Qualifying set.
Models for the rating data are learned by fitting the pre-
viously observed ratings (training set). However, our goal is
to generalize those in a way that allows us to predict future,
unknown ratings (Qualifying set). Thus, caution should be ex-
ercised to avoid overfitting the observed data. We achieve this
by regularizing the learned parameters, whose magnitudes are
penalized. The extent of regularization is controlled by tunable
constants. Unless otherwise stated, we use L2 regularization.
This is a good place to add some words on the constants
controlling our algorithms (including step sizes, regularization,
and number of iterations). Exact values of these constants
are determined by validation on the Probe set. In all cases
but one (to be mentioned below), such validation is done in
a manual greedy manner. That is, when a newly introduced
constant needs to get tuned, we execute multiple runs of the
algorithms and pick the value that yields the best RMSE on
the Netflix Probe set [4]. This scheme does not result in
optimal settings for several reasons. First, once a constant is
set we do not revisit its value, even though future introduction
of other constants may require modifying earlier settings.
Second, we use the same constants under multiple variants
of the same algorithm (e.g., multiple dimensionalities of a
factorization model), whereas a more delicate tuning would
require a different setting for each variant. We chose this
convenient, but less accurate method, because our experience
showed that over tuning the accuracy of a single predictor does