A large-scale Chinese Nature language inference and
Semantic similarity calculation Dataset
Abstract. Natural language inference (NLI) and semantic similarity calculation
are basic research tasks in the field of natural language processing (NLP). In re-
cent years, NLP technology based on deep learning has achieved great success.
On account of the complex model structure of deep neural network, a large
amount of training data is needed to avoid overfitting. Aiming at the limited scale
of Chinese datasets related to this two tasks, a Chinese Natural language infer-
ence and Sentence similarity calculation Dataset (CNSD) is constructed in this
paper. CNSD comes from four datasets with different characteristics and contains
2,195,000 sentence pairs. CNSD is the first Chinese dataset of millions in the
field of NLI and semantic similarity calculation. In this paper, the deep neural
network model BERT is applied to the NLP task based on this dataset, and the
obtained results are taken as the baseline of it. This baseline result will provide
reference for future NLP research based on CNSD. CNSD will be available for
download by researchers to contribute to Chinese NLP researches.
Keywords: Natural language inference, Sentence similarity, Chinese dataset,
BERT for classification.
1 Introduction
Natural language inference (NLI) is one of the four tasks of natural language processing
(NLP) and belongs to sentence relation judgment [1]. NLI uses models to determine
whether the sentence pair (premise, hypothesis) has entailment, contradiction or neutral
relation. Semantic similarity calculation is a measure of finding resemblance between
texts [2]. There are increasing deep learning models provided with certain semantic
learning ability. NLI and semantic similarity calculation both need to obtain low-level
syntactic information and high-level information of text, then judge the relation be-
tween sentence pairs. The difference lies in that NLI is a one-way implication relation,
while semantic similarity calculation is a two-way similar relation. But implication re-
lation can be converted into similar relation. Entailment in NLI is equivalent to label 1
in semantic similarity calculation, i.e., the meaning of two sentences is similar. This
alleviates the problem of lack of annotated data in semantic similarity calculation to
some extent. It has been applied to semantic similarity calculation with NLI datasets
.
It should be noted that similarity relation cannot be converted into implication relation.
As we know, data is the basis of research. The community of NLP has been dedicating
to preserve and standardize existing information in the world. Obviously, it contributes
to enhance the robustness and accuracy of model with more real data. However, we
find that existing large-scale datasets are mainly written in English. Chinese datasets