Auto-Clustering of Conversation Corpus Based on
Syntactic, Semantic and Pragmatic Features
Baojian Chen, Minghu Jiang
Lab of Computational Linguistics, School of Humanities, Tsinghua University, Beijing, 10084, China
cbjchina@126.com, jiang.mh@tsinghua.edu.cn
Abstract—To understand natural language accurately, we not
only need to do natural language morphology and syntactic
analysis, but also need to combine semantic knowledge and
pragmatic information with a specific context. Due to short
knowledge and lack in background information of conversation
corpus which related to the pragmatic, there is a long way to go
for computer fully understand natural language. In this paper,
the pragmatic features were added to the text vector space model
of language spoken conversation, and hierarchical clustering is
executed. Our experimental results show that the clustering
effect with pragmatic features outperforms than non- pragmatic
features, and precision, recall rate and F values of the former
were increased by 6.67%, 6.34% and 6.6%, respectively. It
indicates that pragmatic information has played an important
role in enhancing the effect of the text clustering.
Key words: text vector space mode; pragmatic features;
hierarchical clustering
I. INTRODUCTION
To better understand natural language, it is short for only
inclusion the morphology and syntactic knowledge, and it
needs to combine the semantic knowledge and pragmatic
information with a specific context. Pragmatics is studying the
relationship between language and its use environment, which
relates to persona of conversation, context, the effect of
symbols usage in context and its practical roles. Although
pragmatic knowledge is an integral part of human language
understanding, in the past few decades, natural language
understanding is almost based on syntactic and semantic
information, basically did not consider for pragmatic
information. The bottleneck of the pragmatic information is
how to get effective features extraction and knowledge
representation in the version of a computer understanding.
Currently, it’s relatively difficult for the description of
pragmatic features in natural language understanding, mainly
because of the lack of large-scale corpus which is suitable for
pragmatics research. In 1990, the University of Southern
California first used the pragmatics information for natural
language generation. Subsequently, the Enron email corpus
(www.cs.cmu.edu/~enron/) is used in the extraction of
pragmatic features, which consisted of 619446 e-mails
include 158 Enron executives. it shows internal precious and
rich communication records in a vast and authentic business
organization, and including a lot of communicating
information between individuals and groups, it also conveys
knowledge, perception, resources, tasks, events and
relationships and other social network data [6], Enron corpus
provides valuable data resources for pragmatic research.
McCallum et al. put forward the Author-Receipt-Topic (ART)
model, they summed up what kind of people usually
communicate what kind of problems to study the relationship
between message contents and the recipients and senders
through statistical learning of Enron corpus, and they want to
construct the social relation network between the message
contents and the writers [7]. In fact, the understanding of
descriptive words with pragmatic information depends on the
context, the atmosphere between two talkers, time, place and
identity of the participants and the shared background
knowledge during conversation. The mining of pragmatic
knowledge aimed for theme of session, interactive atmosphere,
talker relationship, all of these can not only improve the
effectiveness of the feature extraction, which itself is an
important research topic in text mining domain.
Clustering is based on the different characteristics of the
data sets which can be divided into different classes, and its
purpose is to make individuals which have same features
belonging to one category. Clustering has lots of methods,
including statistical methods, machine learning methods,
neural networks methods, and database-oriented methods.
Text clustering is mainly based on the famous clustering
hypothesis: similar documents’ similarity is greater. As an
unsupervised machine learning method, it grouped the set of
objects based on a measure of the similarity, and assigned the
similar objects to one group. Text clustering can organize
texts orderly based some connections and relevance between
the documents, which makes it easy for people to focus on
related information of the documents.
In this paper, we use the spoken conversation corpus to
research the description and extraction of syntactic, semantic
and pragmatic features, and then by using the clustering
technology to demonstration that pragmatic features play an
important role in language understanding.
II. TEXT CLUSTERING OF THE CONVERSATION CORPUS
BASED ON PRAGMATIC FEATURES
Text clustering needs to extract text features at first and
then construct text vector space model. However, the most
methods in current to get the features description and
extraction only involved with syntactic and semantic