适应性数据补全的可扩展不完整多视图聚类

需积分: 0 194 浏览量更新于2024-06-19 收藏 2.4MB PDF 举报

本文主要探讨了"Scalable Incomplete Multi-View Clustering with Adaptive Data Completion"这一主题，针对现实世界应用中的一个关键挑战：多源数据往往存在缺失性，而现有的多视图聚类算法通常假设数据是完整的。在大规模数据集上进行聚类分析时，处理缺失值是一项具有挑战性的任务。该研究旨在发展一种可扩展的、适应性数据完成策略，以应对不完整多视图数据的分群问题。作者 Wen-Jue He、Zheng Zhang 和 Yuhong Wei 来自哈尔滨工业大学深圳计算机科学与技术学院，他们提出了一种新的方法，旨在解决大规模不完整多视图数据的聚类问题。他们的工作突破了当前不完整多视图聚类（Incomplete Multi-View Clustering, IMC）研究的主要瓶颈，即如何有效地处理大规模数据集中的缺失值，同时保持算法的效率和性能。文章的核心内容可能包括以下几个方面： 1. **不完整多视图数据的背景**：强调了现实场景中数据集常常存在缺失值的普遍性，这对传统的多视图聚类算法构成了挑战，因为这些算法往往假定所有数据都是完备的。 2. **问题定义**：研究者提出了不完整多视图聚类的概念，即在数据不完全的情况下，挖掘隐藏的集群结构并将其数据分成不同的组。 3. **算法创新**：可能提出了一种新颖的数据完成策略，利用机器学习和相似性学习的方法，能够自适应地处理不同来源的缺失值，确保在大规模数据集上仍能保持良好的聚类效果。 4. **可扩展性**：文章着重强调了算法的可扩展性，意味着它能够在面对海量数据时保持高效性能，这对于处理现代大数据环境尤为重要。 5. **实验与评估**：文中可能会包含详细的实验设计，展示所提出的算法在实际数据集上的性能对比，以及与其他现有IMC方法在速度、准确性等方面的比较。 6. **未来方向**：最后，文章可能会讨论这种方法的潜在应用领域，以及未来可能的研究扩展，如处理更复杂的数据类型或融合其他技术来进一步提升性能。这篇文章为解决大规模不完整多视图聚类问题提供了新的思路和方法，对于那些处理含有大量缺失值的数据集的机构和个人来说，具有重要的理论和实践价值。

Information Sciences 649 (2023) 119562

W.-J. He, Z. Zhang and Y. Wei

𝑠.𝑡. 𝒁

⊤

1 = 1, 𝒁

𝑖𝑗

≥ 0. (2)

In Eq. (2), 𝑨

𝑣

∈ ℜ

𝑑

𝑣

×𝑘

is the anchor set of the 𝑣-th view (𝑘 is the number of anchors, 𝑘 ≪𝑛), which could be generated by random

selection or k-means.

Although AIMSC can achieve approximately linear time complexity, its clustering result is inﬂuenced by the randomness intro-

duced

in anchor selection. Moreover, this work cannot fully employ the missing data recovery scheme for multi-view interaction.

3. Method

3.1. The proposed method

Traditional IMC methods typically measure each pairwise relationship and construct n-by-n similarity matrices, resulting in high

computational complexity, high memory consumption, and lack of scalability. Anchor-based methods eliminate these shortcomings

by reducing the number of relationships, i.e., measuring the relationships between the raw data and anchor points, to represent the

general relationships. However, current anchor-based methods still share some common defects: 1) Most existing works ignore the

recovery of missing information, which leads to insuﬃcient usage of hidden information. 2) The selected anchors are sub-optimal

and fragile to noise.

Traditional IMC methods usually deal with the missing instances with a simple one-time ﬁlling strategy, which ﬁlls the missing

instances with 0 or the average values and never updates them during the training process. However, such measures pull all the

missing instances together since it forces them to have the same values, and thus introduce noise into the clustering result. Orthogonal

to the ﬁlling strategy mentioned above, we propose to iteratively recover the missing instances in each view:

min

𝑬

𝑖

,𝒁

𝑖

,𝒁

𝑣



𝑖=1

𝛼

𝑖





𝑿

𝑖

− 𝑨

𝑖

𝒁

𝑖



𝐹

+ 𝛽𝑬

𝑖



𝐹

+ 𝛾𝑹

𝑖

𝒁

𝑖

− 𝒁

𝐹

𝑠.𝑡.



𝑿

𝑖

= 𝑿

𝑖

+ 𝑬

𝑖

𝑵

𝑖

, (𝑹

𝑖

)

⊤

𝑹

𝑖

= 𝑰

𝑘

. (3)

In Eq. (3), each view learns a bipartite graph 𝒁

𝑖

∈ ℜ

𝑘×𝑛

which represents the relationship between the recovered data



𝑿

𝑖

∈ ℜ

𝑑

𝑖

×𝑛

and pre-deﬁned anchors 𝑨

𝑖

∈ ℜ

𝑑

𝑖

×𝑘

. Speciﬁcally,



𝑿

𝑖

is the recovered data obtained by adding the justiﬁed hidden information 𝑬

𝑖

𝑵

𝑖

to the raw incomplete data 𝑿

𝑖

∈ ℜ

𝑑

𝑖

×𝑛

. The recovery of missing instances 𝑬

𝑖

∈ ℜ

𝑑

𝑖

×𝑛

𝑖

is updated as a variable to learn from the

view-speciﬁc bipartite graph 𝒁

𝑖

as well as the consensus graph 𝒁 to utilize complementary information from all views. 𝑵

𝑖

∈ ℜ

𝑛

𝑖

×𝑛

(𝑛

𝑖

is the number of missing instances in the 𝑖-th view) is an indicator matrix indicating the missing status of an instance, which is

deﬁned by:

𝑵

𝑖

𝑚,𝑛











1 if the 𝑚-th missing instance in 𝑿

𝑖

is the n-th instance in



𝑿

𝑖

0 else.

(4)

By multiplying 𝑬

𝑖

by 𝑵

𝑖

, the recovered information is placed where the corresponding missing instance is and left the observed

positions zero. In this way, noises will not be introduced to the observed instances. Additionally, since descriptions of instances from

diﬀerent views are semantically consistent, we project the bipartite graph of each view 𝒁

𝑖

onto a consensus graph 𝒁 to represent

the overall relationship between given data and the learned anchors.

In Eq. (3), the anchor set 𝑨

𝑖

is pre-deﬁned by k-means or random selection, which is the prevalent approach taken by current

anchor-based methods. In other words, anchor selection and bipartite graph construction are separated into two stages. However,

the selected anchors of each view are independent, which weakens the interpretability of 𝒁 since each 𝒁

𝑣

denotes the similarity of

data points with diﬀerent anchors. Additionally, the pre-deﬁned anchor sets are generally not representative enough for the training

process, and the randomness introduced by anchor selection is hard to eliminate.

Contrary to the ﬁxed anchor strategies, our method updates the anchor sets during the training process. Speciﬁcally, SIMC_ADC

learns a central consensus anchor set as well as a series of projection matrices. By projecting the central anchors to each view,

the view-speciﬁc anchors are closely connected to each other. In this way, the anchors from each view are mutually updated with

the similarity matrices, resulting in an optimal result. Also, since we have no prior knowledge of how important each view is, it

is unreasonable to set the weight of anchor learning and view-speciﬁc bipartite graph learning manually. Instead, we introduce an

adaptive weighting parameter to automatically balance the importance of each view. The objective function therefore derives:

min

𝑾

𝑖

,𝑨,𝒁

𝑖

,𝑬

𝑖

,𝒁

𝑣



𝑖=1

(𝛼

𝑖





𝑿

𝒊

− 𝑾

𝑖

𝑨𝒁

𝑖



𝐹

+ 𝛾𝑹

𝑖

𝒁

𝑖

− 𝒁

𝐹

+ 𝛽𝑬

𝑖



𝐹

)+𝒁

𝐹

𝑠.𝑡.



𝑿

𝒊

= 𝑿

𝑖

+ 𝑬

𝑖

𝑵

𝑖

, (𝑹

𝑖

)

⊤

𝑹

𝑖

= 𝑰

𝑘

, (𝑾

𝑖

)

⊤

𝑾

𝑖

= 𝑰

𝑑

𝑨

⊤

𝑨 = 𝑰

𝑘

, 𝒁 ≥ 0, 𝒁

⊤

1 = 1, 𝒁

𝑖

≥ 0, (𝒁

𝑖

)

⊤

1 = 1,

𝛼

𝑖





𝑿

𝒊

− 𝑾

𝑖

𝑨𝒁

𝑖



𝐹

. (5)

剩余15页未读，继续阅读

麻辣小凉皮

粉丝: 102
资源: 4

适应性数据补全的可扩展不完整多视图聚类

不完整视图的多视图学习

Incomplete multi-view subspace clustering with adaptive instance

Scalable-Music-Player-with-.NET-MVC-Web-API-and-Desktop-Application

Building-Scalable-Apps-with-Redis-and-Node.js

Advanced Features of QT Graphics: Implementing Multi-View Display with Graphics View Framework

Multi-view_Clustering:适用于7种多视图光谱聚类算法的MATLAB代码

Scalable Multi-Party Private Set-Intersection-解读.doc

Scalable, Behavior-Based Malware Clustering

Scalable-Data-Analysis-using-Pandas:项目

Scalable Bandwidth-Tunable Micro-Ring Filter Based on Multi-Channel-Spectrum Combination

最新资源