Field-aware Factorization Machines for CTR Prediction
Yuchin Juan
Criteo Research
∗
Palo Alto, CA
yc.juan@criteo.com
Yong Zhuang
Dept. of ECE
∗
Carnegie Mellon Univ., USA
yong.zhuang22@gmail.com
Wei-Sheng Chin
Dept. of Computer Science
National Taiwan Univ., Taiwan
d01944006@csie.ntu.edu.tw
Chih-Jen Lin
Dept. of Computer Science
National Taiwan Univ., Taiwan
cjlin@csie.ntu.edu.tw
ABSTRACT
Click-through rate (CTR) prediction plays an important role
in computational advertising. Models based on degree-2
polynomial mappings and factorization machines (FMs) are
widely used for this task. Recently, a variant of FMs, field-
aware factorization machines (FFMs), outperforms existing
models in some world-wide CTR-prediction competitions.
Based on our experiences in winning two of them, in this
paper we establish FFMs as an effective method for clas-
sifying large sparse data including those from CTR predic-
tion. First, we propose efficient implementations for training
FFMs. Then we comprehensively analyze FFMs and com-
pare this approach with competing models. Experiments
show that FFMs are very useful for certain classification
problems. Finally, we have released a package of FFMs for
public use.
Keywords
Machine learning; Click-through rate prediction; Computa-
tional advertising; Factorization machines
1. INTRODUCTION
Click-through rate (CTR) prediction plays an important
role in advertising industry [1, 2, 3]. Logistic regression is
probably the most widely used model for this task [3]. Given
a data set with m instances (y
i
, x
i
), i = 1, . . . , m, where y
i
is the label and x
i
is an n-dimensional feature vector, the
model w is obtained by solving the following optimization
problem.
min
w
λ
2
kwk
2
2
+
m
X
i=1
log(1 + exp(−y
i
φ
LM
(w, x
i
))). (1)
∗
Part of the work was done when these authors were in
National Taiwan University.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
RecSys ’16, September 15-19, 2016, Boston , MA, USA
c
2016 ACM. ISBN 978-1-4503-4035-9/16/09. . . $15.00
DOI: http://dx.doi.org/10.1145/2959100.2959134
Publisher Advertiser
+80 −20 ESPN Nike
+10 −90 ESPN Gucci
+0 −1 ESPN Adidas
+15 −85 Vogue Nike
+90 −10 Vogue Gucci
+10 −90 Vogue Adidas
+85 −15 NBC Nike
+0 −0 NBC Gucci
+90 −10 NBC Adidas
Table 1: An artificial CTR data set, where + (−) represents
the number of clicked (unclicked) impressions.
In problem (1), λ is the regularization parameter, and in the
loss function we consider the linear model:
φ
LM
(w, x) = w · x.
Learning the effect of feature conjunctions seems to be
crucial for CTR prediction; see, for example, [1]. Here, we
consider an artificial data set in Table 1 to have a better
understanding of feature conjunctions. An ad from Gucci
has a particularly high CTR on Vogue. This information
is however difficult for linear models to learn because they
learn the two weights Gucci and Vogue separately. To ad-
dress this problem, two models have been used to learn the
effect of feature conjunction. The first model, degree-2 poly-
nomial mappings (Poly2) [4, 5], learns a dedicate weight for
each feature conjunction. The second model, factorization
machines (FMs) [6], learns the effect of feature conjunction
by factorizing it into a product of two latent vectors. We
will discuss details about Poly2 and FMs in Section 2.
A variant of FM called pairwise interaction tensor factor-
ization (PITF) [7] was proposed for personalized tag recom-
mendation. In KDD Cup 2012, a generalization of PITF
called “factor model” was proposed by “Team Opera Solu-
tions” [8]. Because this term is too general and may easily
be confused with factorization machines, we refer to it as
“field-aware factorization machines” (FFMs) in this paper.
The difference between PITF and FFM is that PITF con-
siders three special fields including “user,”“item,” and “tag,”
while FFM is more general. Because [8] is about the over-
all solution for the competition, its discussion of FFM is
limited. We can conclude the following results in [8]: