低秩表示法：解决多子空间数据聚类与异常检测问题

需积分: 33 199 浏览量更新于2024-07-21 1 收藏 963KB PDF 举报

本文主要探讨了数据流形在图像处理中的应用，特别是关注数据分割和运动分割技术。研究聚焦于一种名为“低秩表示”(Low-Rank Representation, LRR)的方法，它在解决子空间聚类问题时展现出显著的优势。在给定的数据集，假设这些样本（向量）大致来自多个子空间的组合，目标是将它们有效地分类到各自的子空间，并同时剔除可能存在的异常值或噪声点。 LRR方法的核心思想是提出一个创新的优化目标函数，该函数旨在寻找所有候选表示中最为低秩的一个。这种低秩约束有助于捕捉数据内在的结构，即样本在高维空间中的子空间分布特征。通过最小化数据矩阵的秩，LRR能够发现数据点之间的协同模式，从而形成紧凑且具有代表性的子空间表示。具体来说，LRR算法包括以下几个步骤： 1. **数据预处理**：首先，对输入的图像数据进行预处理，可能涉及归一化、降噪等步骤，以确保数据的质量和一致性。 2. **构建低秩模型**：对于每个数据点，构建一个包含其邻域内其他点的低秩矩阵，使得数据点可以通过这些邻域内的其他点近似表示，而不是直接依赖单个点。 3. **求解优化问题**：利用数学优化工具（如梯度下降或核方法），找到使整个数据集低秩表示的最优参数，这通常涉及到求解一个凸优化问题。 4. **子空间聚类**：根据每个数据点在低秩表示中的位置，将其分配到最接近的子空间中。低秩性使得数据点更容易被划分到它们所属的子空间中，而远离低秩模型的点则被视为异常。 5. **异常检测**：通过比较数据点与低秩模型的距离，可以识别并剔除潜在的异常值，这些点的表示通常与实际子空间不一致。 LRR因其在处理图像中的优异性能，尤其是在复杂背景下区分不同运动模式、分离背景和前景以及在图像分割任务中的应用，而在计算机视觉和机器学习领域受到广泛关注。此外，由于其理论基础深厚且在实际问题中效果显著，LRR还启发了后续的研究者开发更多的低秩表示相关算法和扩展，如用于异常检测、视频监控、图像分类等领域。

B. Relations Between Segmentation and Row Space

Let X

with skinny SVD U

be a collection of data

samples strictly drawn from a union of multiple subspaces

(i.e., X

is clean), the subspace membership of the samples

is determined by the row space of X

. Indeed, as shown

in [12], when subspaces are independent, V

forms a

block-diagonal matrix: the (i, j)-th entry of V

can be

nonzero only if the i-th and j-th samples are from the same

subspace. Hence, this matrix, termed as Shape Interaction

Matrix (SIM) [12], has been widely used for subspace seg-

mentation. Previous approaches simply compute the SVD of

the data matrix X = U

and then use |V

for

subspace segmentation. However, in the presence of outliers

and corruptions, V

can be far away from V

and thus the

segmentation using such approaches is inaccurate. In contrast,

we show that LRR can recover V

even when the data

matrix X is contaminated by outliers.

If the subspaces are not independent, V

may not be

strictly block-diagonal. This is indeed well expected, since

when the subspaces have nonzero (nonempty) intersections,

then some samples may belong to multiple subspaces simul-

taneously. When the subspaces are pairwise disjoint (but not

independent), our extensive numerical experiments show that

may still be close to be block-diagonal, as exempliﬁed

in Fig. 3. Hence, to recover V

is still of interest to

subspace segmentation.

C. Problem Statement

Problem 1.1 only roughly describes what we want to study.

More precisely, this paper addresses the following problem.

Problem 3.1 (Subspace Clustering): Let X

∈ R

d×n

with

skinny SVD U

store a set of n d-dimensional samples

(vectors) strictly drawn from a union of k subspaces {S

}

i=1

of unknown dimensions (k is unknown either). Given a set of

observation vectors X generated by

X = X

+ E

the goal is to recover the row space of X

, or to recover the

true SIM V

as equal.

The recovery of row space can guarantee high segmentation

accuracy, as analyzed in Section III-B. Also, the recovery of

row space naturally implies the success in error correction. So

it is sufﬁcient to set the goal of subspace clustering as the

recovery of the row space identiﬁed by V

. For ease of

exploration, we consider the problem under three assumptions

of increasing practicality and difﬁculty.

Assumption 1: The data is clean, i.e., E

= 0.

Assumption 2: A fraction of the data samples are grossly

corrupted and the others are clean, i.e., E

has sparse column

supports as shown in Fig.2(c).

Assumption 3: A fraction of the data samples are grossly

corrupted and the others are contaminated by small Gaussian

noise, i.e., E

is characterized by a combination of the models

shown in Fig.2(a) and Fig.2(c).

For a matrix M, |M | denotes the matrix with the (i, j)-th entry being the

absolute value of [M]

Unlike [14], the independent assumption on the subspaces is

not highlighted in this paper, because the analysis in this work

focuses on recovering V

other than a pursuit of block-

diagonal matrix.

IV. LOW-RANK REPRESENTATION FOR MATRIX

RECOVERY

In this section we abstractly present the LRR method

for recovering a matrix from corrupted observations. The

basic theorems and optimization algorithms will be presented.

The speciﬁc methods and theories for handling the subspace

clustering problem are deferred until Section V.

A. Low-Rank Representation

In order to recover the low-rank matrix X

from the given

observation matrix X corrupted by errors E

(X = X

+ E

it is straightforward to consider the following regularized rank

minimization problem:

min

D,E

rank (D) + λ kEk

ℓ

, s.t. X = D + E, (2)

where λ > 0 is a parameter and k·k

ℓ

indicates certain

regularization strategy, such as the squared Frobenius norm

(i.e., k · k

) used for modeling the noise as show in Fig.2(a)

[6], the ℓ

norm adopted by [7] for characterizing the random

corruptions as shown in Fig.2(b), and the ℓ

2,0

norm adopted

by [14], [16] for dealing with sample-speciﬁc corruptions

and outliers. Suppose D

∗

is a minimizer with respect to the

variable D, then it gives a low-rank recovery to the original

data X

The above formulation is adopted by the recently established

Robust PCA (RPCA) method [7] which has been used to

achieve the state-of-the-art performance in several applications

(e.g., [35]). However, this formulation implicitly assumes that

the underlying data structure is a single low-rank subspace.

When the data is drawn from a union of multiple subspaces,

denoted as S

, S

, ··· , S

, it actually treats the data as being

sampled from a single subspace deﬁned by S =

i=1

Since the sum

i=1

can be much larger than the union

∪

i=1

, the speciﬁcs of the individual subspaces are not well

considered and so the recovery may be inaccurate.

To better handle the mixed data, here we suggest a more

general rank minimization problem deﬁned as follows:

min

Z,E

rank (Z) + λ kEk

ℓ

, s.t. X = AZ + E, (3)

where A is a “dictionary” that linearly spans the data space.

We call the minimizer Z

∗

(with regard to the variable Z)

the “lowest-rank representation” of data X with respect to a

dictionary A. After obtaining an optimal solution (Z

∗

, E

∗

we could recover the original data by using AZ

∗

(or X −

∗

). Since rank (AZ

∗

) ≤ rank (Z

∗

), AZ

∗

is also a low-

rank recovery to the original data X

. By setting A = I, the

formulation (3) falls back to (2). So LRR could be regarded

as a generalization of RPCA that essentially uses the standard

bases as the dictionary. By choosing an appropriate dictionary

A, as we will see, the lowest-rank representation can recover

the underlying row space so as to reveal the true segmentation

of data. So, LRR could handle well the data drawn from a

union of multiple subspaces.

剩余15页未读，继续阅读

qq_22767445

粉丝: 0
资源: 1

低秩表示法：解决多子空间数据聚类与异常检测问题

数据挖掘 作业 图像处理

数据流形的距离和几何结构

数据的多流形结构分析

通过重复挖掘查询日志提高图像检索的流形排名

电力系统海量数据挖掘与处理.pptx

KLPP.rar_KLPP_klpp是什么_数据降维_流形学习_降维

基于局部流形重构的半监督多视图图像分类

全国研究生数学建模竞赛B题-数据的多流形结构分析（论文+源码）

重复挖掘查询日志提升图像检索流形排名技术

自然邻居图在流形排序图像检索中的应用

最新资源

数据挖掘作业图像处理