基于术语关联的半监督微博客聚类非负矩阵分解算法

65 浏览量更新于2024-08-28 收藏 244KB PDF 举报

在当前的信息化社会中，微博作为社交媒体的重要组成部分，其数据量庞大且内容丰富，对于许多Web应用程序，如用户行为分析、情感挖掘、主题发现等，微博的自动聚类是至关重要的。本文研究的主题是"基于术语相关性的微博聚类半监督非负矩阵分解"（Semi-supervised Non-negative Matrix Factorization for Microblog Clustering Based on Term Correlation），它在处理大量微博数据时，特别关注的是如何利用语义信息来提升聚类效果。首先，作者们提出了一个关键的思想，即利用术语相关性数据。在文本数据中，特别是微博这种非结构化的信息源，词语之间的关联性（如同义词、反义词或主题相关）蕴含了丰富的语义信息。通过挖掘这些关联，可以有效地构建一个术语权重矩阵，这个矩阵能够更好地反映出每个微博中的核心概念和主题，从而在聚类过程中提供更有价值的特征表示。接着，他们将这个过程与半监督学习方法结合起来，因为实际应用中往往只有部分微博被标记，其余的则缺乏标签。半监督学习在这种情况下显得尤为重要，因为它能够利用未标记数据中的隐含模式，提高聚类的准确性。非负矩阵分解（Non-negative Matrix Factorization, NMF）作为一种有效的数据降维技术，被用来解决这个问题。NMF的优势在于它能保持数据的非负性，这在处理文本数据，尤其是微博这样包含正面和负面情绪词汇的数据中，有助于保留原始信息的完整性。作者们将微博聚类问题转化为一个词级别的非负矩阵分解任务，通过添加适当的约束条件，确保模型既捕捉到词频信息，又考虑到词与词之间的语义联系。这种策略允许模型在处理嘈杂、简短的微博时，例如含有拼写错误、缩写或网络语言的微博，仍能有效识别出潜在的主题和群组。在实验部分，作者们对真实世界的微博数据集进行了严格的评估，结果显示，他们的方法在面对数据噪声和信息不完整的情况下，展现出了显著的性能优势。与传统的聚类算法相比，基于术语相关性和半监督学习的非负矩阵分解方法不仅提高了聚类的精度，还能减少人工标注的需求，节省大量的时间和资源。这项工作为微搏聚类提供了一个新颖且实用的框架，它结合了语义相关性、半监督学习和非负矩阵分解的优势，有望在实际的微博分析和应用中发挥重要作用。未来的研究可能进一步探讨如何改进模型的鲁棒性，以及如何处理更复杂的数据结构，以适应不断发展的社交网络环境。

L. Chen et al. (Eds.): APWeb 2014, LNCS 8709, pp. 511–516, 2014.

Semi-supervised Nonnegative Matrix Factorization

for Microblog Clustering Based on Term Correlation

Huifang Ma, Meihuizi Jia, YaKai Shi, and Zhanjun Hao

College of Computer Science and Engineering, Northwest Normal University,

Gansu Lanzhou 730070, China

mahuifang@yeah.net

Abstract. Clustering microblogs is very important in many web applications. In

this paper, we propose a semi-supervised Nonnegative Matrix Factorization

clustering method based on term correlation. The key idea is to explore term

correlation data, which well captures the semantic information for term weight-

ing. We then formulate microblog clustering problem as a non-negative matrix

factorization using word-level constraints. Empirical study of real-world dataset

shows the superior performance of our framework in handling noisy and short

microblogs.

Keywords: Semi-supervised Clustering, Microblogs, Term correlation matrix,

Nonnegative Matrix Factorization.

1 Introduction

Clustering microblog is of great use for analyzing such up-to-date and tremendous

amount of information[1,2]. An intuitive way for clustering microblogs is through

Non-negative Matrix Factorization (NMF)[3], which has already been successfully

applied to document clustering. However, experiments on short texts, such as micro-

blogs, Q&A documents and news titles, suggest unsatisfactory performance of NMF.

One of the possible reasons is that compared with documents, microblogs are in gen-

eral much shorter, nosier, and sparser. Therefore, clustering such kind of sparse and

noisy data can be challenging.

Researchers have presented various ways to aid the clustering process for micro-

blog. Some researchers[4] introduce semi-supervised priors and explore the effects on

accuracy of clustering. They try to enrich the representation of a microblog using

additional semantics.

In this paper, however, we take advantage of term correlation to enrich the seman-

tics of microblog internally. At first, term similarity based on term-term information

is calculated. And then, a non-negative matrix factorization embedded with word-

level constraint is performed to obtain clustering results. Experiments performed on

microblog dataset demonstrate the superior performance of the proposed method.

The outline of this paper is as follows: Section 2 presents details of our approach. The

experiments and results are given in Section 3. We conclude our paper in Section 4.

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38626928

粉丝: 2
资源: 948

基于术语关联的半监督微博客聚类非负矩阵分解算法

稀疏非负矩阵分解及模式识别

Distributed_pyNMFk:自定义聚类的分布式非负矩阵分解

具有双重约束的半监督文档聚类的非负矩阵分解框架

时间戳数据聚类：多非负矩阵分解与演化分析

行业分类-设备装置-一种基于非负矩阵分解的半监督聚类方法及系统.zip

基于非负矩阵分解的基于多视图聚类的社交网络视频聚类

非负矩阵分解的matlab代码,内容全.zip_landylc_listenbl6_分解_非负矩阵_非负矩阵分解

matlab中存档算法代码-NMF-MCC:我的论文“通过最大化用于癌症聚类的肾上腺素的非负矩阵分解”的源代码

双约束半监督下OSS-NMF: 文档聚类的非负矩阵分解新方法

GSNMF：一种图正则化的半监督非负矩阵分解算法

最新资源