CNSD：中文大规模自然语言推断与语义相似度计算数据集

需积分: 11 185 浏览量更新于2024-09-07 收藏 371KB PDF 举报

CNSD-Endl.pdf是一份针对大规模中文自然语言理解和语义相似度计算的大型数据集。自然语言推理（NLP）和语义相似度计算是自然语言处理领域的基础研究任务，近年来深度学习驱动的NLP技术取得了显著的进步。然而，由于深度神经网络模型结构复杂，通常需要大量的训练数据来防止过拟合问题。在中国，这类特定任务的数据集相对有限。为了弥补这一空白，本文构建了名为CNSD的中文自然语言推理与句子相似度计算数据集。CNSD来源于四个具有不同特性的源数据集，总共包含2,195,000个句子对，这使得CNSD成为该领域首个百万级别的中文数据集。它对于提升中文NLP任务的性能具有重要意义，特别是在基于深度学习的模型如BERT应用时。在本研究中，作者将预训练的BERT模型应用于基于CNSD的任务，并将所得结果作为基准，为后续依赖此数据集的NLP研究提供了参考。这份数据集的公开和可供下载，旨在鼓励和促进学术界的研究者进行贡献和进一步探索。CNSD的存在不仅有助于解决中文NLP的挑战，而且还将推动整个领域的研究和发展，尤其是在处理中文语言的复杂性和多样性方面。未来的研究者可以利用CNSD来测试和优化他们的算法，以提升中文文本理解的准确性和效率。

A large-scale Chinese Nature language inference and

Semantic similarity calculation Dataset

Abstract. Natural language inference (NLI) and semantic similarity calculation

are basic research tasks in the field of natural language processing (NLP). In re-

cent years, NLP technology based on deep learning has achieved great success.

On account of the complex model structure of deep neural network, a large

amount of training data is needed to avoid overfitting. Aiming at the limited scale

of Chinese datasets related to this two tasks, a Chinese Natural language infer-

ence and Sentence similarity calculation Dataset (CNSD) is constructed in this

paper. CNSD comes from four datasets with different characteristics and contains

2,195,000 sentence pairs. CNSD is the first Chinese dataset of millions in the

field of NLI and semantic similarity calculation. In this paper, the deep neural

network model BERT is applied to the NLP task based on this dataset, and the

obtained results are taken as the baseline of it. This baseline result will provide

reference for future NLP research based on CNSD. CNSD will be available for

download by researchers to contribute to Chinese NLP researches.

Keywords: Natural language inference, Sentence similarity, Chinese dataset,

BERT for classification.

1 Introduction

Natural language inference (NLI) is one of the four tasks of natural language processing

(NLP) and belongs to sentence relation judgment [1]. NLI uses models to determine

whether the sentence pair (premise, hypothesis) has entailment, contradiction or neutral

relation. Semantic similarity calculation is a measure of finding resemblance between

texts [2]. There are increasing deep learning models provided with certain semantic

learning ability. NLI and semantic similarity calculation both need to obtain low-level

syntactic information and high-level information of text, then judge the relation be-

tween sentence pairs. The difference lies in that NLI is a one-way implication relation,

while semantic similarity calculation is a two-way similar relation. But implication re-

lation can be converted into similar relation. Entailment in NLI is equivalent to label 1

in semantic similarity calculation, i.e., the meaning of two sentences is similar. This

alleviates the problem of lack of annotated data in semantic similarity calculation to

some extent. It has been applied to semantic similarity calculation with NLI datasets

It should be noted that similarity relation cannot be converted into implication relation.

As we know, data is the basis of research. The community of NLP has been dedicating

to preserve and standardize existing information in the world. Obviously, it contributes

to enhance the robustness and accuracy of model with more real data. However, we

find that existing large-scale datasets are mainly written in English. Chinese datasets

https://github.com/dhwajraj/deep-siamese-text-similarity

下载后可阅读完整内容，剩余8页未读，立即下载

垮掉的一代人

粉丝: 7
资源: 2

CNSD：中文大规模自然语言推断与语义相似度计算数据集

simCSE simCSE

打分类型数据集 STS-B 中文数据集

Chrome-Charset-0.5.4.zip

计算机网络第五版谢希仁答案.pdf

关于牛顿环实验数据的处理.pdf

NI DAQ和LabVIEW构造模糊控制系统.pdf

通信电子线路：Chapter8 鉴频与鉴相.pdf

基于补集零空间与最近空间距离的人脸识别.pdf

_云计算_环境中的计算机网络安全_陈梦瑶.pdf

SQuAD-v1.1（包含train.json和dev.json）

最新资源