LinkedIn的通用线性混合模型：大规模响应预测利器

需积分: 22 133 浏览量更新于2024-09-08 收藏 1.33MB PDF 举报

GLMix: Generalized Linear Mixed Models for Large-Scale Response Prediction 是LinkedIn公司提出的一种在大规模数据场景下应用广泛的统计建模方法，主要用于推荐系统和预测用户行为。GLM（Generalized Linear Model）是一类广泛使用的模型，特别在诸如个性化推荐或优化收益时，许多互联网公司会利用逻辑回归来估计用户点击某个物品（如广告、新闻文章或职位）的概率。然而，当数据量庞大时，传统的全局回归系数可能无法捕捉到每个用户个体偏好和特定物品对用户的吸引力，因此引入了个体级别的回归系数，这就是GLMM（Generalized Linear Mixed Models）。 GLMix模型通过混合个体和全局参数，增强了模型的灵活性。它允许每个用户（ID）和/或项目具有自定义的回归系数，这有助于提高预测精度。与简单的GLM相比，GLMix模型能够更好地理解用户与物品之间的复杂交互，例如用户的长期兴趣变化、地理位置的影响或者时间序列效应等。在实际应用中，GLMix的优势体现在以下几个方面： 1. **个体差异**：适应个体特征，考虑用户间的异质性，从而提供更精准的个性化推荐。 2. **局部依赖**：模型中的随机效应可以捕获局部趋势，这对于处理非独立且同分布的数据（如社交网络中用户间的相似性）非常有用。 3. **数据稀疏性**：对于大型数据集，即使部分观测值缺失，GLMix也能通过混合模型结构有效处理。 4. **扩展性**：GLMix设计适合大数据环境，能够处理海量用户和物品，保持模型训练和预测的高效性。为了实现GLMix，LinkedIn的研究者们如Xianxing Zhang、Yitong Zhou、Yiming Ma、Bee-Chung Chen、Liang Zhang和Deepak Agarwal合作开发了一种算法，该算法可能包括梯度提升、贝叶斯估计或变分推断等技术，以在大规模数据上估计混合模型的参数。然而，值得注意的是，尽管GLMix模型在性能上可能超越传统GLM，但其复杂性也会带来额外的计算成本和模型解释的挑战。因此，在实际应用中，需要权衡模型复杂性和预测准确性，以及资源限制等因素。GLMix为大规模响应预测提供了强大的工具，是现代推荐系统和个性化营销中的重要组成部分。

GLMix: Generalized Linear Mixed Models For Large-Scale

Response Prediction

Xianxing Zhang, Yitong Zhou, Yiming Ma, Bee-Chung Chen, Liang Zhang, Deepak Agarwal

Mountain View, CA, USA

{xazhang, yizhou, yma, bchen, lizhang, dagarwal}@linkedin.com

ABSTRACT

Generalized linear mo de l (GLM) is a wi dely used class of

mo dels for statistical inference and response prediction prob-

lems. For instance, in order to recommend relevant content

to a user or optimize for revenue, many web companies use

logistic regression models to predict the probability of the

user’s clicking on a n item (e.g., ad, news article, job). In

scenarios where the data is abundant, having a more ﬁne-

grained model at the user or item level would potentially

lead to more accurate prediction, as the user’s personal pref-

erences on items and the item’s speciﬁc attraction for users

can be better captured. One common approach is to in-

troduc e ID-level regression coeﬃcients in addition to the

global regression coeﬃcients in a GLM setting, and such

mo dels are called generalized linear mixed models (GLMix)

in the statistical literature. However, for big data sets with a

large number of ID-level coeﬃcients, ﬁtting a GLMix model

can be computationally challenging. In this paper, we re-

p ort how we successfully overcame the scalability bottleneck

by applying parallelized block coordinate descent under the

Bulk Synchronous Parallel (BSP) paradigm. We deployed

the model in the LinkedIn job recommender system, and

generated 20% to 40% more job applications for job seekers

on LinkedIn.

1. INTRODUCTION

Accurate prediction of users’ responses to items is one

of the core functions of many recommendation applications.

Examples include recommending movies, news articles, songs,

jobs, advertisements, and so forth. Given a set of features,

a common approach is to apply generalized linear models

(GLM). For example, when the response is numeric (e.g.,

rating of a user to a movie), a linear regression on the fea-

tures is commonly used to predict t he response. For the

binary response scenario (e.g. whether to click or not when

a user sees a job recommendation), logistic regression is of-

ten used. Sometimes the response is a count (e.g., number

of times a user l istens to a song), and Poisson regression be-

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

KDD ’16 Aug 13–17, 2016, San Francisco, CA, USA

 2016 ACM. ISBN 978-1-4503-4232-2/16/08. . . $15.00

DOI: http://dx.doi.org/10.1145/2939672.2939684

comes a natural choice. All of the above models are special

cases of GLM.

The features available in recommender systems often in-

clude user features (e.g., age, gender, industry, job func-

tion) and item features (e.g., title and skills for jobs, title

and named entities for news articles). An approach that is

widely a dopted in industry to model interactions between

users and items is to form the outer (cross) product of user

and item features, followed by feature selection to reduce the

dimensionality and mitigate the problem of overﬁtting. In

reality, we often observe a lot of heterogeneity in the amount

of data per user or item that cannot be suﬃciently modeled

by user/item features alone, which provides an opportunity

to improve model accuracy by adding more granularity to

the model. Speciﬁcally, for a user who has interacted with

many items in the past, we should have suﬃcient data to ﬁt

regression coeﬃcients that are speciﬁc to that user to cap-

ture his/her personal interests. Similarly, for an item that

has received many users’ responses, it is beneﬁcial to model

its popularity and interactions with user features through

regression coeﬃcients that are speciﬁc to the item.

One common approach to capture such behavior of each

individual user and item is to use ID-le vel features, i.e., the

outer product of user IDs and item features, and the outer

pro duct of item IDs and user features. Models with ID-level

features are usually referred to as generalized linear mixed

mo dels (GLMix) in Statistics [15]. Although conceptually

simple, it can generate a very large number of regression

co eﬃ cients to be learned. For example, for a data set of

10 million users, and each user with 1,000 non-zero coeﬃ-

cients on item features, the total number of regression coef-

ﬁcients can easily go beyond 10

. Therefore, ﬁtting GLMix

mo dels for big data is computationally challenging. Dimen-

sion reduction methods such as feature hashing [1] or princi-

pal component analysis can reduce the number of features.

However, they also reduce our ability to interpret the model

or explain the predictions in the original feature space (e.g.,

at the user’s ID level), making it diﬃcult to debug or inves-

tigate system issues or user complaints.

1.1 Our contributions

In this paper we develop a parallel block-wise coordinate

descent (PBCD) algorithm under the Bulk Synchronous Par-

allel (BSP) paradigm [21] for GLMix model, with a novel

mo del score movement scheme that allows both the coeﬃ-

cients and the data to stay in local nodes. We s how that

our algorithm successfully scales model ﬁtting of GLMix for

very large data sets. We empirically demonstrate that t his

conceptually simple class of models achieves high accuracy

下载后可阅读完整内容，剩余9页未读，立即下载

白才

粉丝: 1
资源: 6

LinkedIn的通用线性混合模型：大规模响应预测利器

Generalized Linear Model

generalized linear model

Generalized Linear Mixed Model 教材

Extending the Linear Model with R

[R.Books]Extending the Linear Model with R

sas for mixed models

mixed-models-snippets:该存储库收集有关如何使用或定义特定混合模型的各种小代码段或简短说明，主要是使用lme4和glmmTMB软件包

1基于蓝牙的项目开发--蓝牙温度监测器.docx

AppDynamics：性能瓶颈识别与优化.docx

percona-xtrabackup-2.4.28-1.ky10.x86-64.rpm

最新资源