IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 4
researchers argued that these neural graph based CF models
differ from the classical GNNs as CF models do not contain
any user or item features, and directly borrowing complex
steps such as embedding transformation, and non-linear
activations in GNNs may not be a good choice. Simplified
neural graph CF models, including LR-GCCF [32], and
LightGCN [33] have been proposed, which eliminate
unnecessary deep learning operations. These simplified
neural graph based models show superior performance in
practice without the need of carefully chosen activation
functions.
2.2 Interaction Modeling
Let p
u
and q
i
denote the learned embeddings of users
and items from representation models, this component
aims at interaction function modeling that estimates the
user’s preference towards the target item based on their
representations. In the following, we describe how to
model users’ predicted preference, denoted as ˆr
ui
based
on the learned embeddings. For ease of explanation, as
shown in Table 2, we summarize three main categories
for interaction modeling: classical inner product based
approaches, distance based modeling and neural network
based approaches.
Most previous recommendation models relied on
the inner product between user embedding and item
embedding to estimate the user-item pair score as: ˆr
ui
=
p
>
u
q
i
=
P
d
f=1
p
uf
q
if
. Despite its great success and
simplicity, prior efforts suggest that simply conducting
inner product would have two major limitations. First, the
triangle inequality is violated [38]. That is, inner product
only encourages the representations of users and historical
items to be similar, but lacks guarantees for the similarity
propagation between user-user and item-item relationships.
Second, it models the linear interaction, and may fail
to capture the complex relationships between users and
items [41].
2.2.1 Distance based Metrics
In order to solve the first issue, a line of research [38], [39],
[40] borrows ideas from the translation principles and uses
distance metric as the interaction function. The inherent
triangle inequality assumption plays an important role in
helping capture the underlying relationships among users
and items. For instance, if user u tends to purchase items i
and j, the representations of i and j should be close in the
latent space.
Towards this end, CML [38] minimizes the distance d
ui
between each user-item interaction < u, i > in Euclidean
space as: d
ui
= kp
u
− q
i
k
2
2
. Instead of minimizing the
distance between each observed user-item pair, TransRec
exploits the translation principle to model the sequential
behaviors of users [39]. In particular, the representation
of user u is treated as the translation vector between the
representations of the items i and the item j to visit next,
namely, q
j
+ p
u
≈ q
i
.
Distinct from CML that uses simple metric learning that
assumes each user’s embedding is equally close to every
item embedding she likes, LRML introduces the relation
vectors r to capture the relationships between user and item
pairs [40] . More formally, the score function is defined as:
s
ui
= kp
u
+ e − q
i
k
2
F
, (6)
where the relation vector e ∈ R
d
is constructed using
a neural attention mechanism over a memory matrix M.
M ∈ R
m×d
is the trainable memory module, hence E is the
attentive sum of m memory slots. As a result, the relation
vectors not only ensure the triangle inequality, but also
achieve better representation ability.
2.2.2 Neural network based Metrics
Distinct from the foregoing that employs linear the
metrics, recent works adopt a diverse array of neural
architectures, spanning from MLP, Convolutional Neural
Network (CNN), and AE as the main building block to mine
complex and nonlinear patterns of user-item interactions.
Researchers made attempts to replace similarity
modeling between users and items with MLPs, as MLPs
are general function approximators to model the any
complex continuous function. NCF is proposed to model
the interaction function between each user-item pair
with MLPs as: ˆr
ui
= f
MLP
(p
u
||q
i
). Besides, NCF also
incorporates a generic MF component into the interaction
modeling, thereby making use of both linearity of MF and
non-linearity of MLP to enhance recommendation quality.
Researchers also proposed to leverage CNN based
architecture for interaction modeling. These kinds of
models first generate interaction maps via outer product
of user and item embeddings, explicitly capturing the
pairwise correlations between embedding dimensions [42],
[43]. These CNN based CF focuses on high order
correlations among representation dimensions. However,
such improvements on performance come at the cost of
increasing model complexity and time cost.
Besides, a line of research exploits AEs to fulfill the
blanks of the user-item interaction matrix directly in the
decoder part [20], [21], [22], [23], [44], [45], [46]. As
the encoder and decoder can be implemented via deep
neural networks, such stacks of nonlinear transformations
give the recommenders more capacity to model the user
representation from complex combinations of all historically
interacted items.
3 CONTENT-ENRICHED RECOMMENDATION
Besides the general user-item interaction information,
recommendation problems are often accompanied with
auxiliary data. The auxiliary data could be classified into
two categories: content based information and context-
aware data. Specifically, the first category of content
information is associated with users and items, including
general user and item features, textual content (a.k.a, item
tags, item textual descriptions and users’ reviews for items),
multimedia descriptions (a.k.a, images, videos, and audio
information), user social networks, and knowledge graphs.
In contrast, contextual information shows the environment
when users make item decisions, which usually denotes
descriptions that beyond users and items [2]. Contextual
information includes time, location, and specific data
that are collected from sensors (such as speed, and