BQ语料库：大规模银行领域中文SSEI语料库与聚类标注方法

2 浏览量更新于2024-08-26 收藏 308KB PDF 举报

本文主要探讨了"大规模特定领域汉语语料库的句子语义对等识别"（BQ语料库），这是一项关键的研究工作，着重于中文自然语言处理领域。BQ语料库的创建是为了支持句子语义对等识别（SSEI）的研究，这是一个重要的自然语言理解任务，旨在找出具有相同含义或意图的不同表达方式。该语料库的独特之处在于其庞大的规模，包含了来自一年在线银行客户服务日志的120,000个问题对，这些数据源自真实的商业场景，因此具有高度的实用性和代表性。在处理如此大量的数据时，研究者提出了一种创新的基于聚类的注解方法。首先，他们利用Word Mover's Distance (WMD) 的亲和力传播（AP）算法来识别并删除具有相同答案的重复问题，这样可以简化后续的注解过程，提高效率。然后，研究团队要求注释者根据问题的意图将其划分为不同的类别，确保每个类别内的问题都具有相似的语义。注解过程中，除了同一意图类别内的问题对，还挑选了正面和负面的样本，以展示语料库在语义匹配中的多样性。这有助于评估SSEI模型的鲁棒性和精确度，特别是在处理不同表达方式和情感色彩时。 BQ语料库作为银行领域的最大手动注释中文公共SSEI语料库，对于中文问题语义匹配、跨语言和跨域的SSEI研究具有重大价值。它不仅推动了学术界的研究进展，也为实际应用提供了丰富的训练和测试资源。此外，该语料库公开可供学术界和工业界使用，促进了知识的共享和技术创新。在2018年的Empirical Methods in Natural Language Processing (EMNLP)会议上，这篇论文被发表，展示了BQ语料库的建设方法和初步的性能评估结果。通过对比不同的SSEI基准算法，研究者证明了BQ语料库在提升句子语义理解和模型性能方面的潜力。BQ语料库的发布标志着汉语语义对等识别研究的一个重要里程碑。

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4946–4951

Brussels, Belgium, October 31 - November 4, 2018.

2018 Association for Computational Linguistics

4946

The BQ Corpus: A Large-scale Domain-speciﬁc Chinese Corpus For

Sentence Semantic Equivalence Identiﬁcation

Jing Chen

†

, Qingcai Chen

#∗

, Xin Liu

†

, Haijun Yang

‡

, Daohe Lu

‡

, Buzhou Tang

†

†#

Shenzhen Calligraphy Digital Simulation Technology Lab,

Harbin Institute of Technology, Shenzhen, China

‡

WeBank Inc.

†

{mcdh.chenjing,hit.liuxin,tangbuzhou}@gmail.com

qingcai.chen@hit.edu.cn

‡

{navyyang,leslielu}@webank.com

Abstract

This paper introduces the Bank Question (BQ)

corpus, a Chinese corpus for sentence seman-

tic equivalence identiﬁcation (SSEI). The BQ

corpus contains 120,000 question pairs from

1-year online bank custom service logs. To ef-

ﬁciently process and annotate questions from

such a large scale of logs, this paper proposes a

clustering based annotation method to achieve

questions with the same intent. First, the de-

duplicated questions with the same answer are

clustered into stacks by the Word Mover’s Dis-

tance (WMD) based Afﬁnity Propagation (AP)

algorithm. Then, the annotators are asked to

assign the clustered questions into different in-

tent categories. Finally, the positive and nega-

tive question pairs for SSEI are selected in the

same intent category and between different in-

tent categories respectively. We also present

six SSEI benchmark performance on our cor-

pus, including state-of-the-art algorithms. As

the largest manually annotated public Chinese

SSEI corpus in the bank domain, the BQ cor-

pus is not only useful for Chinese question

semantic matching research, but also a sig-

niﬁcant resource for cross-lingual and cross-

domain SSEI research. The corpus is available

in public

1 Introduction

As the semantic matching task, sentence semantic

equivalence identiﬁcation (SSEI) is a fundamen-

tal task of natural language processing (NLP) in

question answering (QA), automatic customer ser-

vice and chat-bots. In customer service systems,

two questions are deﬁned as semantically equiva-

lent if they convey the same intent or they could

be answered by the same answer. Because of rich

expressions in natural languages, SSEI is really a

challenging NLP task.

∗

Corresponding author

http://icrc.hitsz.edu.cn/Article/show/175.html

Compared with other NLP tasks, the lack of

large-scale SSEI corpora is one of the biggest ob-

stacles for SSEI algorithm development. To ad-

dress this issue, several corpora have been pro-

vided in recent years, including the Microsoft Re-

search Paraphrase (MSRP) Corpus (Dolan et al.,

2004; Dolan and Brockett, 2005), the Twitter Para-

phrase Corpus (PIT-2015 corpus) (Xu et al., 2014,

2015), the Twitter URL corpus (Lan et al., 2017)

and the Quora dataset

In the early stage, the MSRP corpus was used

to validate paraphrase identiﬁcation algorithms

based on a set of linguistic features (Kozareva and

Montoyo, 2006; Mihalcea et al., 2006; Rus et al.,

2008). Then, MSRP was also used to validate the

deep models within a long duration. The deep

convolutional neural networks (DCNNs), recur-

rent neural networks (RNNs), and their variants,

such as Arc-I, Arc-II and BiMPM etc., have been

developed and veriﬁed on it, even though it con-

tains only thousands of sentence pairs (Hu et al.,

2014; Yin and Sch

utze, 2015; Wang et al., 2016,

2017). Until 2015, the SemEval 2015 released a

larger corpus, the PIT-2015 corpus for paraphrase

and semantic similarity identiﬁcation tasks. On

this corpus, participants adopted SVM classiﬁers,

logistic regression models, referential translation

machines (RTM) and neural networks (Xu et al.,

2015). In 2017, a large-scale SSEI corpus named

Quora was released, which greatly boost the de-

velopment of deep matching algorithms. Tomar

et al. (2017) proposed a variant of the decom-

posable attention model. Gong et al.(2018) pro-

posed a Densely Interactive Inference Network

(DIIN) by hierarchically extracting semantic fea-

tures from interaction space. However, the Quora

corpus comes from social network sites. Consider-

https://data.quora.com/First-Quora-Dataset-Release-

Question-Pairs

57,037 pairs out of them are manually labeled.

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38665449

粉丝: 8
资源: 963

BQ语料库：大规模银行领域中文SSEI语料库与聚类标注方法

中文文本匹配数据集（ LCQMC、BQ-Corpus、STS-B、ATEC ）

BQ_corpus.rar

BQ51013.PDF

BQ语料库：大规模领域特定中文语料库

语料库：Django应用程序，用于收集母语的书面语和口语语料库

语料库：note便笺管理应用程序

中文语料库：msr_training.utf8.ic

ChatGPT中文语料库：训练大模型的丰富资源

JavaScript小语料库：机器人开发必备

豆瓣对话语料库：深入数据集探索

最新资源