Discrimination of Chinese Quantitative Style Features
Based on Text Clustering
Hou Renkui, Jiang Minghu
Lab. of Computational Linguistics, School of Humanities and Social Sciences,
Tsinghua University, Beijing 100084, China
hourk0917@163.com, jiang.mh@tsinghua.edu.cn
Abstract—The styles of “News Broadcast” and “Qiang Qiang
Conversation between Three Individuals” are different. The
former is broadcasting, while the latter is conversational. This
paper collects the corpus of both programs and selects sentence
length, word length and sentence-initial word POS as the
characters to generate the text vectors. And the texts are
clustered by the Euclidean distance and ward algorithm. The
analysis showed that the sentence length, word length and
sentence-initial word POS can be used as Chinese quantitative
stylistic characters.
Keywords- Text Clustering, type of writing, sentence length,
word length, sentence-initial word POS
I. INTRODUCTION
Style is the beginning and result of the linguistic
performance and was formed in the specific context. It is a
kind of speech function variation reflecting object in a
particular way using language means according by context [1].
In communication, according to a kind of context, choose
some stylistic means, using the specific expressions and a
large number of neutral language materials, you can construct
such a discourse genre. There are both inevitability and
occasionality in the use of language means, while this
coincidence can be described by probability. Quantitative
analysis can make us explain the linguistic features of the style
more objectively and scientifically. The style is formed by the
language unit frequency, while the law of language unit is the
basis of analysis of the style [2]. The stylistic means is
reflected to the statistics of the language units. The
distribution of the linguistic features can be thought the basis
of the language style [2].
Text Clustering is an unsupervised text mining, in which
similar elements are divided into the same groups and
different elements are divided into different groups [3]. Text
clustering is the cluster analysis and has the character of this
statistical analysis: do not know in advance the number and
structure of the categories and clustering based on similarity or
dissimilarity between objects. This similarity is regarded as a
"distance" measurement between objects. The objects which
have near distance are classified into a class, the objects which
have far distance are classified different classes.
“News Broadcasting” belongs to broadcast style [1, 4, 5],
in which there is no interaction between the host. “Qiang
Qiang” belongs to conversational style in which the host and
guests discuss some hot social issues.
This paper selects sentence length, word length and
sentence-initial word POS as feature representations of the
texts, determining whether these language features can
distinguish two kinds of style texts and determining whether
they can be used as a quantitative stylistic character by text
clustering.
II. C
ORPUS COLLECTION, PREPROCESSING, TEXT
REPRESENTATION AND CLUSTERING ALGORITHM
"News Broadcasting" corpus is collected from the
language resource monitoring and research center, the scale of
which is 30 days; "Qiang Qiang" corpus is collected from the
website of ifeng, the scale of which is 31 days
1
.., Both them
are original corpus.
Some tags in the corpus do not belong to the linguistic
performance, which need to be cleared, such as the time
stamps, the titles and the blank lines in "News Broadcasting",
the speaker marks in "Qiang Qiang". After that, the process is
word segmentation and POS tagging by Chinese lexical
analysis system by the Institute of Computing Technology.
Choosing some certain language features to represent the
text, compute the features distribution and normalized them
and generate text vectors. We can calculate the Euclidean
distance between the text vectors, such as formula 1, where X
= [x
1
, x
2
, ..., x
p
] and Y = [y
1
, y
2
, ..., y
p
] represent two texts, x
i
and y
i
represent eigenvalues.
˄1˅
Two vectors are more likely to be clustered together when
the Euclidean distance of them is smaller and their similarity
is higher.
1
ˈhttp://phtv.ifeng.com/program/qqsrx
¦
2
)(),(
ii
yxYXEd
___________________________________
978-1-4673-2197-6/12/$31.00 ©2012 IEEE