MATLAB降维算法实战：PCA、KPCA等

需积分: 33 197 浏览量更新于2024-07-24 收藏 598KB PDF 举报

“matlab降维工具箱：该文档详细介绍了在MATLAB中实现的各种降维算法，包括主成分分析（PCA）、核主成分分析（KPCA）等经典的降维方法。” 降维是机器学习领域中的一个重要任务，它有助于通过减少数据的维度来改善分类、压缩和高维数据可视化的效果，从而克服高维空间中可能出现的不利属性。在过去十年间，已经提出了许多新的（非线性）降维技术。这些技术大多基于一个直观的想法，即数据位于或接近一个复杂的低维流形上。 MATLAB作为一个强大的数学和计算平台，提供了丰富的工具箱来实现这些降维算法。例如，PCA是一种线性降维方法，它通过找到数据方差最大的方向来提取主要特征，从而将原始数据投影到一个低维空间。PCA在减少噪声、简化模型复杂性和加速计算速度方面非常有用。而KPCA是PCA的一种扩展，引入了核函数的概念，能够处理非线性关系的数据。它通过将数据映射到一个高维特征空间，然后在该空间中执行线性PCA，从而在原始数据的非线性结构中发现低维表示。除了PCA和KPCA，MATLAB还支持其他降维技术，如局部线性嵌入（LLE）、Isomap、主成分分析网络（PCANet）和独立成分分析（ICA）。LLE试图保持数据点之间的局部邻接关系，Isomap则通过考虑数据点之间的全局几何结构来进行降维。PCANet利用PCA的原理构建深度学习架构，而ICA则寻找信号的独立分量，常用于信号处理和图像去噪。在MATLAB中实现这些降维算法，通常涉及以下几个步骤： 1. 数据预处理：清洗、标准化或归一化数据。 2. 选择合适的降维方法：根据数据特性和问题需求选择PCA、KPCA或其他降维算法。 3. 参数设置：例如，确定降维后的维度、核函数类型（如高斯核）及其参数等。 4. 训练模型：使用选定的算法对数据进行降维操作。 5. 评估与应用：通过保留数据重要性的度量（如方差解释率）评估降维效果，并将降维后的数据用于后续的分析或建模任务。在实际应用中，MATLAB的可视化工具如scatter3和plot函数可以帮助我们直观地理解降维结果，检查数据在低维空间中的分布和聚类情况。同时，MATLAB的优化工具箱和并行计算工具箱也可以用于优化算法性能和处理大规模数据。 MATLAB降维工具箱提供了一个全面的环境，使得研究者和工程师可以方便地探索和应用各种降维技术，解决实际问题，特别是在高维数据处理和分析中。通过熟练掌握这些工具，用户可以更好地理解和利用数据的内在结构，提升数据分析和建模的效率。

covariance matrix. Alternatively, PCA can also be rewritten in a probabilistic framework, allowing for performing

PCA by means of an EM algorithm [83]. The reader should note that probabilistic PCA is closely related to factor

analysis [3]. In the toolbox, a traditional PCA implementation is available, as well as implementations of Simple PCA

and probabilistic PCA.

3.1.2 LDA

Linear Discriminant Analysis (LDA) [28] attempts to maximize the linear separability between datapoints belonging

to different classes. In contrast to most other dimensionality reduction techniques, LDA is a supervised technique.

LDA ﬁnds a linear mapping M that maximizes the linear class separability in the low-dimensional representation of

the data. The criteria that are used to formulate linear class separability in LDA are the within-class scatter S

and the

between-class scatter S

, which are deﬁned as

cov

−

(2)

cov

= cov

X−

−S

(3)

where p

is the class prior of class label c, cov

−

is the covariance matrix of the zero mean datapoints x

assigned

to class c ∈ C (where C is the set of possible classes), cov

is the covariance matrix of the cluster means, and

cov

X−

is the covariance matrix of the zero mean data X. LDA optimizes the ratio between the within-class scatter

and the between-class scatter S

in the low-dimensional representation of the data, by ﬁnding a linear mapping M

that maximizes the so-called Fisher criterion

(4)

This maximization can be performed by solving the generalized eigenproblem

v = λS

v (5)

for the d largest eigenvalues (under the requirement that d < |C|). The eigenvectors v form the columns of the linear

transformation matrix T . The low-dimensional data representation Y of the datapoints in X can be computed by

mapping them onto the linear basis T , i.e., Y = (X −

X)T .

LDA has been successfully applied in a large number of classiﬁcation tasks. Successful applications include speech

recognition [34], mammography [15], and document classiﬁcation [84].

3.2 Global nonlinear techniques

Global nonlinear techniques for dimensionality reduction are techniques that attempt to preserve global properties

of the data (similar to PCA and LDA), but that are capable of constructing nonlinear transformations between the

high-dimensional data representation X and its low-dimensional counterpart Y . The subsection presents nine global

nonlinear techniques for dimensionality reduction: (1) MDS, (2) SPE, (3) Isomap, (4) FastMVU, (5) Kernel PCA,

(6) GDA, (7) diffusion maps, (8) SNE, and (9) multilayer autoencoders. The techniques are discussed in subsubsec-

tion 3.2.1 to 3.2.9.

3.2.1 MDS

Multidimensional scaling (MDS) [19, 52] represents a collection of nonlinear techniques that maps the high-dimensional

data representation to a low-dimensional representation while retaining the pairwise distances between the datapoints

as much as possible. The quality of the mapping is expressed in the stress function, a measure of the error between the

pairwise distances in the low-dimensional and high-dimensional representation of the data. Two important examples

of stress functions (for metric MDS) are the raw stress function and the Sammon cost function. The raw stress function

is deﬁned by

φ(Y ) =

(kx

− x

k − ky

− y

(6)

in which kx

−x

k is the Euclidean distance between the high-dimensional datapoints x

and x

and ky

−y

k is the

Euclidean distance between the low-dimensional datapoints y

and y

. The Sammon cost function is given by

φ(Y ) =

− x

(kx

− x

k − ky

− y

− x

(7)

The Sammon cost function differs from the raw stress function in that it puts more emphasis on retaining distances

that were originally small. The minimization of the stress function can be performed using various methods, such

as the eigendecomposition of a pairwise dissimilarity matrix, the conjugate gradient method, or a pseudo-Newton

method [19].

MDS is widely used for the visualization of data, e.g., in fMRI analysis [77] and in molecular modelling [88]. The

popularity of MDS has led to the proposal of variants such as SPE [2] and FastMap [27].

3.2.2 SPE

Stochastic Proximity Embedding (SPE) is an iterative algorithm that minimizes the MDS raw stress function. SPE

differs from MDS in the efﬁcient rule it employs to update the current estimate of the low-dimensional data represen-

tation. In addition, SPE can readily be applied in order to retain only distances in a neighborhood graph deﬁned on the

graph, leadin to behavior that is comparable to, e.g., Isomap (see subsubsection 3.2.3).

SPE minimizes the MDS raw stress function that was given in Equation 6. For convenience, we rewrite the raw stress

function as

φ(Y ) =

− r

)

(8)

where r

is the proximity between the high-dimensional datapoints x

and x

(computed by r

−x

max R

), and d

is the Euclidean distance between their low-dimensional counterparts y

and y

in the current approximation of the

embedded space. SPE can readily be performed in order to retain solely distances in a neighborhood graph G deﬁned

on the data, by setting d

and r

to 0 if (i, j) /∈ G. Using this setting, SPE behaves comparable to techniques based

on neighborhood graphs. Nonlinear dimensionality reduction techniques based on neighborhood graphs are discussed

in more detail later.

SPE performs an iterative algorithm in order to minimize the raw stress function deﬁned above. The initial positions

of the points y

are selected randomly in [0, 1]. An update of the embedding coordinates y

is performed by randomly

selecting s pairs of points (y

, y

). For each pair of points, the Euclidean distance in the low-dimensional data repre-

sentation Y is computed. Subsequently, the coordinates of y

and y

are updated in order to decrease the difference

between the distance in the original space r

and the distance in the embedded space d

. The updating is performed

using the following update rules

= y

+ λ

− d

+ ǫ

− y

) (9)

= y

+ λ

− d

+ ǫ

− y

) (10)

where λ is a learning parameter that decreases with the number of iterations, and ǫ is a regularization parameter that

prevents divisions by zero. The updating of the embedded coordinates is performed for a large number of iterations

(e.g., 10

iterations). The high number of iterations is feasible because of the low computational costs of the update

procedure.

3.2.3 Isomap

Multidimensional scaling has proven to be successful in many applications, but it suffers from the fact that it is based

on Euclidean distances, and does not take into account the distribution of the neighboring datapoints. If the high-

dimensional data lies on or near a curved manifold, such as in the Swiss roll dataset [79], MDS might consider two

datapoints as near points, whereas their distance over the manifold is much larger than the typical interpoint distance.

Isomap [79] is a technique that resolves this problem by attempting to preserve pairwise geodesic (or curvilinear)

剩余43页未读，继续阅读

xyniubaobao

粉丝: 0
资源: 2

MATLAB降维算法实战：PCA、KPCA等

matlab的egde源代码-SOM-Toolbox:SOM工具箱

matlab 降维工具箱及说明文档

Matlab降维工具箱

MATLAB降维工具箱：LLE与PCA算法详解

MATLAB降维工具箱drtoolbox：源码+GUI实现模式识别

Matlab降维工具箱v0.8.1b使用教程与安装指南

matlab 数据降维工具箱

Matlab数据降维工具箱

MATLAB数据降维工具箱

matlab数据降维工具箱

最新资源