Learning Continuous Word Embedding with Metadata for Question
Retrieval in Community Question Answering
Guangyou Zhou
1
, Tingting He
1
, Jun Zhao
2
, and Po Hu
1
1
School of Computer, Central China Normal University, Wuhan 430079, China
2
National Laboratory of Pattern Recognition, CASIA, Beijing 100190, China
{gyzhou,tthe,phu}@mail.ccnu.edu.cn jzhao@nlpr.ia.ac.cn
Abstract
Community question answering (cQA)
has become an important issue due to the
popularity of cQA archives on the web.
This paper is concerned with the problem
of question retrieval. Question retrieval
in cQA archives aims to find the exist-
ing questions that are semantically equiv-
alent or relevant to the queried questions.
However, the lexical gap problem brings
about new challenge for question retrieval
in cQA. In this paper, we propose to learn
continuous word embeddings with meta-
data of category information within cQA
pages for question retrieval. To deal with
the variable size of word embedding vec-
tors, we employ the framework of fisher
kernel to aggregated them into the fixed-
length vectors. Experimental results on
large-scale real world cQA data set show
that our approach can significantly out-
perform state-of-the-art translation models
and topic-based models for question re-
trieval in cQA.
1 Introduction
Over the past few years, a large amount of user-
generated content have become an important in-
formation resource on the web. These include
the traditional Frequently Asked Questions (FAQ)
archives and the emerging community question
answering (cQA) services, such as Yahoo! An-
swers
1
, Live QnA
2
, and Baidu Zhidao
3
. The con-
tent in these web sites is usually organized as ques-
tions and lists of answers associated with meta-
data like user chosen categories to questions and
askers’ awards to the best answers. This data made
1
http://answers.yahoo.com/
2
http://qna.live.com/
3
http://zhidao.baidu.com/
cQA archives valuable resources for various tasks
like question-answering (Jeon et al., 2005; Xue et
al., 2008) and knowledge mining (Adamic et al.,
2008), etc.
One fundamental task for reusing content in
cQA is finding similar questions for queried ques-
tions, as questions are the keys to accessing the
knowledge in cQA. Then the best answers of
these similar questions will be used to answer the
queried questions. Many studies have been done
along this line (Jeon et al., 2005; Xue et al., 2008;
Duan et al., 2008; Lee et al., 2008; Bernhard and
Gurevych, 2009; Cao et al., 2010; Zhou et al.,
2011; Singh, 2012; Zhang et al., 2014a). One big
challenge for question retrieval in cQA is the lexi-
cal gap between the queried questions and the ex-
isting questions in the archives. Lexical gap means
that the queried questions may contain words that
are different from, but related to, the words in the
existing questions. For example shown in (Zhang
et al., 2014a), we find that for a queried question
“how do I get knots out of my cats fur?”, there
are good answers under an existing question “how
can I remove a tangle in my cat’s fur?” in Yahoo!
Answers. Although the two questions share few
words in common, they have very similar mean-
ings, it is hard for traditional retrieval models (e.g.,
BM25 (Robertson et al., 1994)) to determine their
similarity. This lexical gap has become a major
barricade preventing traditional IR models (e.g.,
BM25) from retrieving similar questions in cQA.
To address the lexical gap problem in cQA, pre-
vious work in the literature can be divided into two
groups. The first group is the translation models,
which leverage the question-answer pairs to learn
the semantically related words to improve tradi-
tional IR models (Jeon et al., 2005; Xue et al.,
2008; Zhou et al., 2011). The basic assumption is
that question-answer pairs are “parallel texts” and
relationship of words (or phrases) can be estab-
lished through word-to-word (or phrase-to-phrase)