LCYS_TEAM在SIGHAN8-Task2上的主题中文消息情感分类系统

186 浏览量更新于2024-08-26 收藏 333KB PDF 举报

"这篇研究论文详细介绍了在SIGHAN8-Task2比赛中，LCYS_TEAM提交的基于主题的中文消息极性分类系统。该系统主要由两个部分组成：1)采用了融合局部和全局信息的图排序模型来表示词汇对不同主题的分类能力，并提出了新的权重计算方法以及基于PMI的随机跳跃概率选择策略；2)对于情感特征，使用词嵌入技术获取扩展的主题词汇，并利用句法依赖关系来提取与主题相关的情感词。" 在这篇论文中，作者首先提到了SIGHAN8-Task2比赛，这是一个专注于中文语言处理的研讨会，旨在推动中文自然语言处理技术的发展。他们构建的系统专注于中文消息的极性分类，即判断一条消息是积极、消极还是中立。系统的核心在于一个图排名模型，该模型整合了词汇的局部和全局信息。局部信息可能指的是单个词汇的出现频率，而全局信息则涉及词汇在整个语料库中的分布情况。通过构建图模型，可以更好地理解词汇之间的关联和相互作用。在构建图模型时，论文提出了一种新的权重分配方法，这有助于更准确地衡量词汇对特定主题的重要性。此外，他们还引入了基于Pointwise Mutual Information (PMI)的随机跳跃概率选择方法，这是一种统计工具，用于量化两个事件的关联程度。这种方法能帮助模型跳过无用的信息，聚焦于真正相关的词汇。其次，为了捕捉情感特征，论文采用了词嵌入技术。词嵌入是一种将词汇转换为多维向量的方法，这些向量能够捕捉到词汇的语义信息。通过这种方式，系统可以获取与主题相关的扩展词汇集，从而提高分类的准确性。同时，他们利用句法依赖分析来识别那些与主题紧密相关的情感词汇。句法依赖分析可以帮助理解词汇之间的结构关系，从而找出潜在的情感表达。这篇论文展示了如何通过结合主题建模、词嵌入和句法分析技术来提高中文消息的极性分类效果。这种系统对于社交媒体监控、舆情分析等领域具有重要的应用价值，有助于更有效地理解和处理大量中文文本数据中的情感倾向。

Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing (SIGHAN-8), pages 158–163,

Beijing, China, July 30-31, 2015.

2015 Association for Computational Linguistics and Asian Federation of Natural Language Processing

Topic-Based Chinese Message Polarity Classification

System at SIGHAN8-Task2

Chun Liao, Chong Feng, Sen Yang, Heyan Huang

School of Computer Science

and Technology, Beijing

Institute of Technology

{cliao, fengchong, syang, hhy63}@bit.edu.cn

Abstract

This paper describes the topic-based Chi-

nese message polarity classification sys-

tem submitted by LCYS_TEAM at

SIGHAN8-Task2. The system mainly in-

cludes two parts: 1) a graph-based rank-

ing model integrating local and global in-

formation is adopted to represent the

classification ability of words towards

different topics. In construction of graph

model, a new weighting approach and a

PMI-based random jumping probability

selection method is proposed. 2) For sen-

timental features, word embedding is

employed for acquiring expanded topical

words and syntactic dependency is

adopted for getting topic-related senti-

mental words. Experiment results

demonstrate the effectiveness of our sys-

tem.

1 Introduction

Sentiment analysis, which is to identify or de-

termine the implied emotional orientation, atti-

tude and opinion when people express something,

is becoming more and more important for net-

work monitoring with its application on mi-

croblog. In the traditional sentiment analysis，

unsupervised methods were adopted in Ku(2005),

Shen(2009), Vasileios(2000) and Turney(2002),

and the limitation of such approaches based on

semantic dictionary mainly is unable to solve the

problem of Out-of-Vocabulary words. Super-

vised methods were employed with model of

machine learning, such as Naive Bayes, Max

Entropy, Support Vector Machine in Pang(2002),

Dasgupta(2009), and Li(2011).

Hashtags, in the form of “＃ topic＃ ”, are

widely used as topics in Chinese microblogs. For

the topic-related work, Wang(2011) and

Jakob(2010) made research on hashtag-level sen-

timent classification in twitter. In the traditional

sentiment analysis, the object people express

sentiment on is not taken into consideration. And

these methods are mostly topic-ignored and can-

not perform the accurate sentiment analysis in

many topic-related messages. We summarize

such kind of difficult cases into two categories.

1) Microblogs with multiple candidate topics

For example, “# 三星 galaxy s6## 华为

P8##mate8#”三星 galaxy s6 真没什么亮点，华

为 P8 就可以秒它了，更不用说 mate8[拜拜]”.

This sentence conveys negative sentiment to-

wards topic of “三星 galaxy s6”, but positive

sentiment towards topic of “华为 P8” and “ma-

te8”.

2) Microblogs with topic specific sentimental

words

For example, “#股票#前天刚入手一支股票，

一直在升，股价越来越高” and “#三星#三星手

机电量明显不够用，耗能高”. The word “高”

carrys positive sentiment orientation in the first

sentence towards topic “股票” and negative sen-

timent orientation in the latter towards topic “三

星”.

Considering the importance of topical infor-

mation in microblogs, this paper studied topic-

based Chinese message polarity classification.

Given a message from Chinese Weibo Platform

(Such as Sina, Tencent, NetEase etc. ) and a top-

ic, classify whether the message is of positive,

negative, or neutral sentiment towards the given

topic. For messages conveying both a positive

and negative sentiment towards the topic, which-

ever is the stronger sentiment should be chosen.

158

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38667581

粉丝: 8
资源: 955

LCYS_TEAM在SIGHAN8-Task2上的主题中文消息情感分类系统

SIGHAN-中文分词

SIGHAN 2006 Bakeoff-3中文语料

sighan-bakeoff

Adaptive Multi-Task Transfer Learning

SIGHAN中文纠错数据集及转换后格式.zip

SIGHAN国际汉语分词数据集backoff2005

Kenlm、ConvSeq2Seq等多种模型的文本纠错，并在SigHAN数据集评估各模型的效果，开箱即用

sighan 2006 MSRA命名实体语料(BIO格式)

CPLM-CSC：基于单字级别预训练语言模型的中文错别字纠正方法1

最新资源