GLMix: Generalized Linear Mixed Models For Large-Scale
Response Prediction
Xianxing Zhang, Yitong Zhou, Yiming Ma, Bee-Chung Chen, Liang Zhang, Deepak Agarwal
LinkedIn
Mountain View, CA, USA
{xazhang, yizhou, yma, bchen, lizhang, dagarwal}@linkedin.com
ABSTRACT
Generalized linear mo de l (GLM) is a wi dely used class of
mo dels for statistical inference and response prediction prob-
lems. For instance, in order to recommend relevant content
to a user or optimize for revenue, many web companies use
logistic regression models to predict the probability of the
user’s clicking on a n item (e.g., ad, news article, job). In
scenarios where the data is abundant, having a more fine-
grained model at the user or item level would potentially
lead to more accurate prediction, as the user’s personal pref-
erences on items and the item’s specific attraction for users
can be better captured. One common approach is to in-
troduc e ID-level regression coefficients in addition to the
global regression coefficients in a GLM setting, and such
mo dels are called generalized linear mixed models (GLMix)
in the statistical literature. However, for big data sets with a
large number of ID-level coefficients, fitting a GLMix model
can be computationally challenging. In this paper, we re-
p ort how we successfully overcame the scalability bottleneck
by applying parallelized block coordinate descent under the
Bulk Synchronous Parallel (BSP) paradigm. We deployed
the model in the LinkedIn job recommender system, and
generated 20% to 40% more job applications for job seekers
on LinkedIn.
1. INTRODUCTION
Accurate prediction of users’ responses to items is one
of the core functions of many recommendation applications.
Examples include recommending movies, news articles, songs,
jobs, advertisements, and so forth. Given a set of features,
a common approach is to apply generalized linear models
(GLM). For example, when the response is numeric (e.g.,
rating of a user to a movie), a linear regression on the fea-
tures is commonly used to predict t he response. For the
binary response scenario (e.g. whether to click or not when
a user sees a job recommendation), logistic regression is of-
ten used. Sometimes the response is a count (e.g., number
of times a user l istens to a song), and Poisson regression be-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
KDD ’16 Aug 13–17, 2016, San Francisco, CA, USA
c
2016 ACM. ISBN 978-1-4503-4232-2/16/08. . . $15.00
DOI: http://dx.doi.org/10.1145/2939672.2939684
comes a natural choice. All of the above models are special
cases of GLM.
The features available in recommender systems often in-
clude user features (e.g., age, gender, industry, job func-
tion) and item features (e.g., title and skills for jobs, title
and named entities for news articles). An approach that is
widely a dopted in industry to model interactions between
users and items is to form the outer (cross) product of user
and item features, followed by feature selection to reduce the
dimensionality and mitigate the problem of overfitting. In
reality, we often observe a lot of heterogeneity in the amount
of data per user or item that cannot be sufficiently modeled
by user/item features alone, which provides an opportunity
to improve model accuracy by adding more granularity to
the model. Specifically, for a user who has interacted with
many items in the past, we should have sufficient data to fit
regression coefficients that are specific to that user to cap-
ture his/her personal interests. Similarly, for an item that
has received many users’ responses, it is beneficial to model
its popularity and interactions with user features through
regression coefficients that are specific to the item.
One common approach to capture such behavior of each
individual user and item is to use ID-le vel features, i.e., the
outer product of user IDs and item features, and the outer
pro duct of item IDs and user features. Models with ID-level
features are usually referred to as generalized linear mixed
mo dels (GLMix) in Statistics [15]. Although conceptually
simple, it can generate a very large number of regression
co effi cients to be learned. For example, for a data set of
10 million users, and each user with 1,000 non-zero coeffi-
cients on item features, the total number of regression coef-
ficients can easily go beyond 10
10
. Therefore, fitting GLMix
mo dels for big data is computationally challenging. Dimen-
sion reduction methods such as feature hashing [1] or princi-
pal component analysis can reduce the number of features.
However, they also reduce our ability to interpret the model
or explain the predictions in the original feature space (e.g.,
at the user’s ID level), making it difficult to debug or inves-
tigate system issues or user complaints.
1.1 Our contributions
In this paper we develop a parallel block-wise coordinate
descent (PBCD) algorithm under the Bulk Synchronous Par-
allel (BSP) paradigm [21] for GLMix model, with a novel
mo del score movement scheme that allows both the coeffi-
cients and the data to stay in local nodes. We s how that
our algorithm successfully scales model fitting of GLMix for
very large data sets. We empirically demonstrate that t his
conceptually simple class of models achieves high accuracy