Advances and Challenges in Conversational Recommender Systems: A Survey
propose to integrate a knowledge graph into the interactive
recommendation to solve these problems.
However, directly requiring items is inefficient for building
the user profile because the candidate item set is large. In
real-world CRS applications, users will get bored as the num-
ber of conversation turns increases. It is more practical to
ask attribute-centric questions, i.e., to ask users whether they
like an attribute (or topic/category in some works), and then
make recommendations based on these attributes [207, 88].
Therefore, the estimation and utilization of a user’s prefer-
ences towards attributes become a key research issue.
2.2. Asking about Attributes
Asking about attributes is more efficient because whether
users like or dislike an attribute can significantly reduce the
recommendation candidates. The challenge is to determine
a sequence of attributes to ask so as to minimize the uncer-
tainty of current user needs [119, 164]. The aforementioned
critiquing-based methods fall into this category. Besides,
there are other kinds of methods, we introduce some main-
stream branches as below.
2.2.1. Fitting Patterns from Historical Interaction
A conversation can be deemed as a sequence of entities
including consumed items and mentioned attributes, and the
objective is to learn to predict the next attribute to ask or the
next item to recommend. Therefore, the sequential neural
network such as the gated recurrent unit (GRU) model [29]
and the long short term memory (LSTM) model [62] can be
naturally adopted in this setting, due to its ability to capture
long and short term dependency in user behavioral patterns.
An exemplar work is the question & recommendation
(Q&R) model proposed by Christakopoulou et al. [31], where
the interaction between the system and a user is implemented
as a selection system. In each turn, the system asks the user
to choose one or more distinct topics (e.g., NBA, Comics, or
Cooking) from the given list, and then recommends items in
these topics to the user. It contains a trigger module to de-
cide whether to ask a question about attributes or to make
a recommendation. The triggering mechanism can be as
simple as a random mechanism or can be more sophisti-
cated, i.e., using criteria capturing the user’s state, or even
be user-initiated. At the 𝑡-th time step, the next topic 𝑞 that
user click can be predicted based on the user’s watching his-
tory 𝑒
1
, … , 𝑒
𝑇
as: 𝑃
(
𝑞 ∣ 𝑒
1
, … , 𝑒
𝑇
)
. After user clicking
a topic 𝑞, the model can recommend an item 𝑟 based on
the conditional probability written as: 𝑃
(
𝑟 ∣ 𝑒
1
, … , 𝑒
𝑇
, 𝑞
)
.
Both of the two conditional probabilities are implemented
as the GRU architecture [29]. This algorithm is deployed on
YouTube, for obtaining preferences from cold-start users.
Zhang et al. [207] propose a “System Ask User Response”
(SAUR) paradigm. For each item, they utilize the rich re-
view information and convert a sentence containing an aspect-
value pair to a latent vector via the GRU model. Then they
adopt a memory module with attention mechanism [158, 83,
118] to perform both the next question generation task (de-
termining which attribute to ask) and the next item recom-
mendation task. Again, they also develop a heuristic trigger
to decide whether it is the time to display the top-𝑛 recom-
mended items to users or to keep asking questions about at-
tributes. One limitation of the work is that the authors as-
sume all information in reviews can support the purchasing
behavior, however it is not true as users may complain cer-
tain aspects of the purchased items, e.g., a user may write
“64 Gigabytes is not enough”. Using information without
discrimination will mislead the model and deteriorate the
performance.
The utterances produced by the system, i.e., the ques-
tions, are constructed with predefined language patterns or
templates, meaning that what the system needs to pay at-
tention to are only the aspect and the value. This is a com-
mon setting in state-of-the-art CRS studies because the core
task here is recommendation instead of language generation
[31, 88, 89].
Note that these kinds of methods have a common disad-
vantage: learning from historical user behaviors cannot aid
understanding the logic behind the interaction. As interac-
tive systems, these models do not consider how to react to
feedback when users reject the recommendation, i.e., they
just try to fit the preferences in historical interaction and do
not consider an explicit strategy to deal with different feed-
back.
2.2.2. Reducing Uncertainty
Unlike sequential neural network-based methods that do
not have an explicit strategy to handle all kinds of user feed-
back, some studies try to build a straightforward logic to nar-
row down item candidates.
Critiquing-based Methods. The aforementioned critiquing
model is typically equipped with a heuristic tactic to elicit
user preference on attributes [23, 187, 107, 106]. In tradi-
tional critiquing models, where the critique on an attribute
value (e.g., “not red” for color or “less expensive” for price)
is used for reconstructing the candidate set by removing the
items with unsatisfied attributes [23, 116, 154, 171, 12, 153].
The neural vector-based methods take the criticism into the
latent vector, which is responsible for generating both the
recommended items and the explained attributes. For exam-
ple, Wu et al. [187] propose an explainable neural collabo-
rative filtering (CE-NCF) model for critiquing. They use the
neural collaborative filtering model [60] to encode the pref-
erence of a user 𝑖 for an item 𝑗 as a latent vector
𝐳
𝑖,𝑗
, then
𝐳
𝑖,𝑗
is used for producing the rating score 𝑟
𝑖,𝑗
as well as the ex-
plained attribute vector
𝐬
𝑖,𝑗
. The attributes are composed of
a set of key-phrases such as “golden, copper, orange, black,
yellow,” and each dimension of
𝐬
𝑖,𝑗
corresponds to a certain
attribute. When a user dislikes an attribute and critique it
in real-time feedback, the system updates the explained at-
tribute vector
𝐬
𝑖,𝑗
by setting the corresponding dimension to
zero. Then the updated vector
𝐬
𝑖,𝑗
is used to update the la-
tent vector
𝐳
𝑖,𝑗
to be
𝐳
𝑖,𝑗
. Consequently, the recommendation
score is updated to be 𝑟
𝑖,𝑗
. Following this setting, Luo et al.
[107] change the base NCF model to be a variational autoen-
coder (VAE) model, and this generative model can help the
Gao et al.: Preprint submitted to Elsevier Page 6 of 30