词网驱动的词法语义分类：改善文本语料库分析

167 浏览量更新于2024-08-26 收藏 779KB PDF 举报

"这篇研究论文探讨了一种基于WordNet的词法语义向量空间模型（VSM），用于文本语料库的分析和分类。通过利用WordNet的词汇语义网络，该方法能够更好地捕捉词汇的语义内容，解决文本分类中的错误。" 在自然语言处理领域，文本分类是关键任务之一，它涉及到将文本自动归类到预定义的类别中。传统的统计方法，如词频（TF）和逆文档频率（IDF），常用于构建文档表示，但这些方法往往忽略了词汇的语义含义，只关注词汇出现的频率，这可能导致分类不准确。论文提出了一种名为WordNet-based Lexical Semantic VSM（基于WordNet的词法语义向量空间模型）的新方法，以解决这个问题。WordNet是一个广泛使用的英语词汇数据库，它提供了词汇之间的语义关系，如同义词集（synsets）和上下位关系。通过WordNet，该方法可以构建一个包含语义元素信息的数据结构，以捕获词汇的语义内容，而不仅仅是它们在文本文档中的出现情况。论文中，研究人员首先利用WordNet来构造语义元素信息，以表示词汇的语义特征。然后，他们采用期望最大化（EM）算法对词汇词干进行消歧，以确定在特定上下文中最可能的词义。在消歧的基础上，他们能够在词汇-语义空间中建立文档的词法语义特征向量，这有助于捕捉词汇之间的深层语义关联。接下来，利用这些向量，论文在语料库的词法-语义空间中计算文档的语义特征向量，即“词法语义特征向量”，这种向量更能反映文档的语义本质，从而提高文本分类的准确性。这种方法的优势在于，它可以减少由于词汇多义性导致的分类误差，并且能够捕捉到文本中潜在的语义关系。这项工作为文本分析和分类提供了一种新的视角，通过结合词汇的语义信息，提高了模型的性能。这一方法对于信息检索、情感分析、主题建模等其他NLP任务也具有潜在的应用价值。在实际应用中，结合WordNet的语义知识，不仅可以改善文本理解，还可以进一步推动自然语言处理技术的发展。

J. Cent. South Univ. (2015) 22: 1833−1840

DOI: 10.1007/s11771-015-2702-8

WordNet-based lexical semantic classification for text corpus analysis

LONG Jun(龙军)

, WANG Lu-da(王鲁达)

, LI Zu-de(李祖德)

, ZHANG Zu-ping(张祖平)

, YANG Liu(杨柳)

1. School of Information Science and Engineering, Central South University, Changsha 410075, China;

2. School of Software, Central South University, Changsha 410075, China

Abstract: Many text classifications depend on statistical term measures to implement document representation. Such document

representations ignore the lexical semantic contents of terms and the distilled mutual information, leading to text classification errors.

This work proposed a document representation method, WordNet-based lexical semantic VSM, to solve the problem. Using WordNet,

this method constructed a data structure of semantic-element information to characterize lexical semantic contents, and adjusted EM

modeling to disambiguate word stems. Then, in the lexical-semantic space of corpus, lexical-semantic eigenvector of document

representation was built by calculating the weight of each synset, and applied to a widely-recognized algorithm NWKNN. On text

corpus Reuter-21578 and its adjusted version of lexical replacement, the experimental results show that the lexical-semantic

eigenvector performs F1 measure and scales of dimension better than term-statistic eigenvector based on TF-IDF. Formation of

document representation eigenvectors ensures the method a wide prospect of classification applications in text corpus analysis.

Key words: document representation; lexical semantic content; classification; eigenvector

1 Introduction

Text corpus analysis is an important task.

Meanwhile, clustering and classification are the key

procedures for text corpus analysis. In addition, text

classification is an active research area in information

retrieval, machine learning and natural language

processing. Most classification algorithms based on

eigenvector prevail in this field, such as KNN, SVM,

ELM. Eigenvector-based document classification is a

widely used technology for text corpus analysis.

Relevant classification algorithms and experiments are

typically based on eigenvector of document

representation. Moreover, the key issue is eigenvector-

based classification algorithms depending on the VSM [1].

TF-IDF (term frequency–inverse document

frequency) [2] is a prevalent method for characterizing

document, and its essence is statistical term measure.

Many methods of document representation based on

TF-IDF can construct vector space model (VSM) of text

corpus. Similarly, many methods of document

representation exploit statistical term measures, such as

Bag-of-Words [3] and Minwise hashing [4]. For

document representation, these methods are perceived as

statistical methods of feature extraction.

However, in the information retrieval field,

statistical term measures neglect lexical semantic content.

It causes corpus analysis to perform on the level of term

string basically, and disregard lexical replacement of

document original at deceiving the text corpus analysis

easily.

Semantic approach is an effectively used technology

for document analysis. It can capture the semantic

features of words under analysis, and based on that,

characterizes and classifies the document. Close

relationship between the syntax and lexical semantic

contents of words have attracted considerable interest in

both linguistics and computational linguistics.

The design and implementation of WordNet-based

lexical semantic classification take account of lexical

semantic content particularly. Unlike traditional statistical

methods of feature extraction, our work developed a new

term measure which can characterize lexical semantic

contents, and provide a practical method of document

representation to can handle the impact of lexical

replacement. The document representation is normalized

as the eigenvector; consequently, it shall be applied to

current VSM-dependent classification algorithms.

Theoretical analysis and a large number of experiments

Foundation item: Project(2012AA011205) supported by National High-Tech Research and Development Program (863 Program) of China;

Projects(61272150, 61379109, M1321007, 61301136, 61103034) supported by the National Natural Science Foundation of China;

Project(20120162110077) supported by Research Fund for the Doctoral Program of Higher Education of China; Project(11JJ1012)

supported by Excellent Youth Foundation of Hunan Scientific Committee, China

Received date: 2014−03−21; Accepted date: 2014−10−11

Corresponding author: WANG Lu-da, PhD Candidate, Lecturer; Tel: +86−18613082443; E-mail: wang_luda@csu.edu.cn

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38696877

粉丝: 6
资源: 929

词网驱动的词法语义分类：改善文本语料库分析

基于医疗卫生文本语义依存树库建设研究.pdf

中英平行语料（共20万句）可用于训练机器翻译

词法分析语法分析语义分析编译器

词法分析、语义分析语法分析如何判断

词法分析语法分析语义分析的区别

编译原理词法分析语法分析语义分析c++

词法分析语法分析语义分析中间代码生成csdn

词法分析、语义分析语法分析如何判断举例说明

课程设计编译原理词法分析语法分析语义分析

python实现词法分析语法分析语义分析和中间代码生成

最新资源