Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4946–4951
Brussels, Belgium, October 31 - November 4, 2018.
c
2018 Association for Computational Linguistics
4946
The BQ Corpus: A Large-scale Domain-specific Chinese Corpus For
Sentence Semantic Equivalence Identification
Jing Chen
†
, Qingcai Chen
#∗
, Xin Liu
†
, Haijun Yang
‡
, Daohe Lu
‡
, Buzhou Tang
†
†#
Shenzhen Calligraphy Digital Simulation Technology Lab,
Harbin Institute of Technology, Shenzhen, China
‡
WeBank Inc.
†
{mcdh.chenjing,hit.liuxin,tangbuzhou}@gmail.com
#
qingcai.chen@hit.edu.cn
‡
{navyyang,leslielu}@webank.com
Abstract
This paper introduces the Bank Question (BQ)
corpus, a Chinese corpus for sentence seman-
tic equivalence identification (SSEI). The BQ
corpus contains 120,000 question pairs from
1-year online bank custom service logs. To ef-
ficiently process and annotate questions from
such a large scale of logs, this paper proposes a
clustering based annotation method to achieve
questions with the same intent. First, the de-
duplicated questions with the same answer are
clustered into stacks by the Word Mover’s Dis-
tance (WMD) based Affinity Propagation (AP)
algorithm. Then, the annotators are asked to
assign the clustered questions into different in-
tent categories. Finally, the positive and nega-
tive question pairs for SSEI are selected in the
same intent category and between different in-
tent categories respectively. We also present
six SSEI benchmark performance on our cor-
pus, including state-of-the-art algorithms. As
the largest manually annotated public Chinese
SSEI corpus in the bank domain, the BQ cor-
pus is not only useful for Chinese question
semantic matching research, but also a sig-
nificant resource for cross-lingual and cross-
domain SSEI research. The corpus is available
in public
1
.
1 Introduction
As the semantic matching task, sentence semantic
equivalence identification (SSEI) is a fundamen-
tal task of natural language processing (NLP) in
question answering (QA), automatic customer ser-
vice and chat-bots. In customer service systems,
two questions are defined as semantically equiva-
lent if they convey the same intent or they could
be answered by the same answer. Because of rich
expressions in natural languages, SSEI is really a
challenging NLP task.
∗
Corresponding author
1
http://icrc.hitsz.edu.cn/Article/show/175.html
Compared with other NLP tasks, the lack of
large-scale SSEI corpora is one of the biggest ob-
stacles for SSEI algorithm development. To ad-
dress this issue, several corpora have been pro-
vided in recent years, including the Microsoft Re-
search Paraphrase (MSRP) Corpus (Dolan et al.,
2004; Dolan and Brockett, 2005), the Twitter Para-
phrase Corpus (PIT-2015 corpus) (Xu et al., 2014,
2015), the Twitter URL corpus (Lan et al., 2017)
and the Quora dataset
2
.
In the early stage, the MSRP corpus was used
to validate paraphrase identification algorithms
based on a set of linguistic features (Kozareva and
Montoyo, 2006; Mihalcea et al., 2006; Rus et al.,
2008). Then, MSRP was also used to validate the
deep models within a long duration. The deep
convolutional neural networks (DCNNs), recur-
rent neural networks (RNNs), and their variants,
such as Arc-I, Arc-II and BiMPM etc., have been
developed and verified on it, even though it con-
tains only thousands of sentence pairs (Hu et al.,
2014; Yin and Sch
¨
utze, 2015; Wang et al., 2016,
2017). Until 2015, the SemEval 2015 released a
larger corpus, the PIT-2015 corpus for paraphrase
and semantic similarity identification tasks. On
this corpus, participants adopted SVM classifiers,
logistic regression models, referential translation
machines (RTM) and neural networks (Xu et al.,
2015). In 2017, a large-scale SSEI corpus named
Quora was released, which greatly boost the de-
velopment of deep matching algorithms. Tomar
et al. (2017) proposed a variant of the decom-
posable attention model. Gong et al.(2018) pro-
posed a Densely Interactive Inference Network
(DIIN) by hierarchically extracting semantic fea-
tures from interaction space. However, the Quora
corpus comes from social network sites. Consider-
2
https://data.quora.com/First-Quora-Dataset-Release-
Question-Pairs
3
57,037 pairs out of them are manually labeled.