J. Cent. South Univ. (2015) 22: 1833−1840
DOI: 10.1007/s11771-015-2702-8
WordNet-based lexical semantic classification for text corpus analysis
LONG Jun(龙军)
1
, WANG Lu-da(王鲁达)
1
, LI Zu-de(李祖德)
1
, ZHANG Zu-ping(张祖平)
1
, YANG Liu(杨柳)
2
1. School of Information Science and Engineering, Central South University, Changsha 410075, China;
2. School of Software, Central South University, Changsha 410075, China
© Central South University Press and Springer-Verlag Berlin Heidelberg 2015
Abstract: Many text classifications depend on statistical term measures to implement document representation. Such document
representations ignore the lexical semantic contents of terms and the distilled mutual information, leading to text classification errors.
This work proposed a document representation method, WordNet-based lexical semantic VSM, to solve the problem. Using WordNet,
this method constructed a data structure of semantic-element information to characterize lexical semantic contents, and adjusted EM
modeling to disambiguate word stems. Then, in the lexical-semantic space of corpus, lexical-semantic eigenvector of document
representation was built by calculating the weight of each synset, and applied to a widely-recognized algorithm NWKNN. On text
corpus Reuter-21578 and its adjusted version of lexical replacement, the experimental results show that the lexical-semantic
eigenvector performs F1 measure and scales of dimension better than term-statistic eigenvector based on TF-IDF. Formation of
document representation eigenvectors ensures the method a wide prospect of classification applications in text corpus analysis.
Key words: document representation; lexical semantic content; classification; eigenvector
1 Introduction
Text corpus analysis is an important task.
Meanwhile, clustering and classification are the key
procedures for text corpus analysis. In addition, text
classification is an active research area in information
retrieval, machine learning and natural language
processing. Most classification algorithms based on
eigenvector prevail in this field, such as KNN, SVM,
ELM. Eigenvector-based document classification is a
widely used technology for text corpus analysis.
Relevant classification algorithms and experiments are
typically based on eigenvector of document
representation. Moreover, the key issue is eigenvector-
based classification algorithms depending on the VSM [1].
TF-IDF (term frequency–inverse document
frequency) [2] is a prevalent method for characterizing
document, and its essence is statistical term measure.
Many methods of document representation based on
TF-IDF can construct vector space model (VSM) of text
corpus. Similarly, many methods of document
representation exploit statistical term measures, such as
Bag-of-Words [3] and Minwise hashing [4]. For
document representation, these methods are perceived as
statistical methods of feature extraction.
However, in the information retrieval field,
statistical term measures neglect lexical semantic content.
It causes corpus analysis to perform on the level of term
string basically, and disregard lexical replacement of
document original at deceiving the text corpus analysis
easily.
Semantic approach is an effectively used technology
for document analysis. It can capture the semantic
features of words under analysis, and based on that,
characterizes and classifies the document. Close
relationship between the syntax and lexical semantic
contents of words have attracted considerable interest in
both linguistics and computational linguistics.
The design and implementation of WordNet-based
lexical semantic classification take account of lexical
semantic content particularly. Unlike traditional statistical
methods of feature extraction, our work developed a new
term measure which can characterize lexical semantic
contents, and provide a practical method of document
representation to can handle the impact of lexical
replacement. The document representation is normalized
as the eigenvector; consequently, it shall be applied to
current VSM-dependent classification algorithms.
Theoretical analysis and a large number of experiments
Foundation item: Project(2012AA011205) supported by National High-Tech Research and Development Program (863 Program) of China;
Projects(61272150, 61379109, M1321007, 61301136, 61103034) supported by the National Natural Science Foundation of China;
Project(20120162110077) supported by Research Fund for the Doctoral Program of Higher Education of China; Project(11JJ1012)
supported by Excellent Youth Foundation of Hunan Scientific Committee, China
Received date: 2014−03−21; Accepted date: 2014−10−11
Corresponding author: WANG Lu-da, PhD Candidate, Lecturer; Tel: +86−18613082443; E-mail: wang_luda@csu.edu.cn