统计建模视角下的无监督学习综述与深度探讨

需积分: 10 176 浏览量更新于2024-07-19 收藏 368KB PDF 举报

本篇文章主要探讨了无监督学习（Unsupervised Learning）在机器学习领域中的重要性和应用。无监督学习是一种数据驱动的学习方法，它不依赖于明确的标签或目标输出，而是通过发现数据内在的结构、模式和关系来获取知识。文章从统计建模的角度出发，介绍了无监督学习的动机，这包括信息论和贝叶斯原理的视角。主要内容涵盖了多个基础模型的概述： 1. 因子分析（Factor Analysis），用于揭示变量间的潜在关联。 2. 主成分分析（PCA），用于数据降维并保留主要特征。 3. 高斯混合模型（GMMs），用于表示复杂数据集中的多类分布。 4. 独立成分分析（ICA），分离信号中的原始成分。 5. 隐马尔可夫模型（HMMs）和状态空间模型，用于序列数据建模。 6. 各种变体和扩展，展示了无监督学习的灵活性和多样性。文章重点介绍了期望最大化算法（EM），这是一种常用的迭代优化方法，在有缺失数据的模型中尤为关键。此外，图形模型（Graphical Models）如贝叶斯网络和条件随机场的基础概念也被深入讨论，以及在这类模型上的推断算法，如图上信念更新和近似贝叶斯推理。后续部分着重讲解了近似贝叶斯推理的不同方法： - 马尔科夫链蒙特卡洛（MCMC）：一种通过随机抽样模拟真实后验分布的方法。 - 拉普拉斯近似：对复杂概率分布的简化估计。 - 贝叶斯信息准则（BIC）：评估模型选择的有效性。 - 变分推断：通过优化一个易于计算的函数来近似真实的后验分布。 - 期望传播（EP）：一种有效的局部消息传递算法，用于处理高维概率模型。作者的目标是通过这篇综述，为读者提供一个无监督学习领域的高层次视图，同时提及了许多当前最先进的思想和技术以及未来的研究方向。阅读本文有助于理解无监督学习的基础理论，应用方法及其在实际问题中的潜力，为研究者和从业者提供了宝贵的参考资源。

probable parameter value given the data, which is known as the maximum a posteriori or MAP parameter

estimate

MAP

= arg max

P (θ|D, m) = arg max

log P (θ|m) +

log P (x

|θ, m)

(9)

Another natural choice is the maximum likelihood or ML parameter es timate

= arg max

P (D|θ, m) = arg max

log P (x

|θ, m) (10)

Many learning algorithms can be seen as ﬁnding ML parameter estimates. The ML parameter estimate is

also acceptable from a frequentist statistical modelling perspective since it does not require deciding on a

prior over parameters. However, ML estimation does not protect against overﬁtting—more complex models

will generally have higher maxima of the likelihood. In order to avoid problems with overﬁtting, frequentist

procedures often maximise a penalised or regularised log likelihood (e.g. [26]). If the penalty or regularisation

term is interpreted as a log prior, then maximising penalised likelihood appears identical to maximising a

posterior. However, there are subtle iss ues that make a Bayesian MAP procedure and maximum penalised

likelihood diﬀerent [28]. One diﬀerence is that the MAP estimate is not invariant to reparameterisation,

while the maximum of the penalised likelihood is invariant. The penalised likelihood is a function, not a

density, and therefore does not increase or decrease depending on the Jacobian of the reparameterisation.

2 Latent variable models

The framework described above can be applied to a wide range of models. No singe model is appropriate

for all data sets. The art in machine learning is to develop models which are appropriate for the data set

being analysed, and which have certain desired properties. For example, for high dimensional data sets it

might be necessary to use models that perform dimensionality reduction. Of course, ultimately, the machine

should b e able to decide on the appropriate model without any human intervention, but to achieve this in

full generality requires signiﬁcant advances in artiﬁcial intelligence.

In this section, we will consider probabilistic models that are deﬁned in terms of some latent or hidden

variables. These models can b e used to do dimensionality reduction and clustering, the two cornerstones of

unsup e rvised learning.

2.1 Factor analysis

Let the data set D consist of D-dimensional real valued vectors, D = {y

, . . . , y

}. In factor analysis, the

data is assumed to be generated from the following model

y = Λx +  (11)

where x is a K-dimensional zero-mean unit-variance multivariate Gaussian vector with elements correspond-

ing to hidden (or latent) factors, Λ is a D × K matrix of parameters, known as the factor loading matrix,

and  is a D-dimensional zero-mean multivariate Gaussian noise vector with diagonal covariance matrix Ψ.

Deﬁning the parameters of the model to be θ = (Ψ, Λ), by integrating out the factors, one can readily derive

that

p(y|θ) =

p(x|θ)p(y|x, θ)dx = N (0, ΛΛ

+ Ψ) (12)

where N (µ, Σ) refers to a multivariate Gaussian density with mean µ and covariance matrix Σ. For more

details refer to [68].

Fac tor analysis is an interesting model for several reasons. If the data is very high dimensional (D is

large) then even a simple model like the full-covariance multivariate Gaussian will have too many parameters

to reliably estimate or infer from the data. By choosing K < D, factor analysis makes it poss ible to model

a Gaussian density for high dimensional data without requiring O(D

) parameters. Moreover, given a new

data point, one can compute the posterior over the hidden factors, p(x|y, θ); s ince x is lower dimensional

than y this provides a low-dimensional representation of the data (for example, one could pick the mean of

p(x|y, θ) as the representation for y).

2.2 Principal components analysis (PCA)

Principal components analysis (PCA) is an important limiting case of factor analysis (FA). One can derive

PCA by making two modiﬁcations to FA. First, the noise is assumed to be isotropic, in other words each

element of  has equal variance: Ψ = σ

I, where I is a D×D identity matrix. This model is called probabilistic

PCA [67, 78]. Second, if we take the limit of σ → 0 in probabilistic PCA, we obtain standard PCA (which

also goes by the names Karhunen-Lo`eve expansion, and singular value decomposition; SVD). Given a data

set with covariance matrix Σ, for maximum likelihood factor analysis the goal is to ﬁnd parameters Λ, and

Ψ for w hich the model ΛΛ

+Ψ has highest likelihood. In PCA, the goal is to ﬁnd Λ so that the likelihood is

highest for ΛΛ

. Note that this matrix is singular unless K = D, so the standard PCA model is not a sensible

model. However, taking the limiting case, and further constraining the columns of Λ to be orthogonal, it

can be derived that the principal components correspond to the K eigenvectors with largest eigenvalue of

Σ. PCA is thus attractive because the solution can be found immediately after eigendecomposition of the

covariance. Taking the limit σ → 0 of p(x|y, Λ, σ) we ﬁnd that it is a delta-function at x = Λ

y, which is

the projection of y onto the principal components.

2.3 Independent components analysis (ICA)

Independent components analysis (ICA) extends factor analysis to the case where the factors are non-

Gaussian. This is an interesting extension because many real-world data sets have structure which can be

modelled as linear combinations of sparse sources. This includes auditory data, images, biological signals

such as EEG, etc. Sparsity sim ply corresponds to the assumption that the factors have distributions with

higher kurtosis that the Gaussian. For example, p(x) =

exp{−λ|x|} has a higher peak at zero and heavier

tails than a Gaussian with corresponding mean and variance, so it would be considered sparse (strictly

speaking, one would like a distribution which had non-zero probability mass at 0 to get true sparsity).

Models like PCA, FA and ICA can all be implemented using neural networks (multilayer perceptrons)

trained using various cost functions. It is not clear what advantage this implementation/interpretation has

from a machine learning perspective, although it provides interesting ties to biological information processing.

Rather than ML estimation, one can also do Bayesian inference for the parameters of probabilistic PCA,

FA, and ICA.

2.4 Mixture of Gaussians

The densities modelled by PCA, FA and ICA are all relatively simple in that they are unimodal and have

fairly restricted parametric forms (Gaussian, in the case of PCA and FA). To model data with m ore complex

structure such as clusters, it is very useful to consider mixture models. Although it is straightforward to

consider mixtures of arbitrary densities, we will focus on Gaussians as a common special case. The density

of each data point in a mixture model can be written:

p(y|θ) =

k=1

p(y|θ

) (13)

where each of the K components of the mixture is, for example, a Gaussian with diﬀering means and

covariances θ

= (µ

, Σ

) and π

is the mixing proportion for component k, such that

k=1

= 1 and

> 0, ∀k.

A diﬀerent way to think about mixture models is to consider them as latent variable models, where

associated with each data point is a K-ary discrete latent (i.e. hidden) variable s which has the interpretation

that s = k if the data point was generated by component k. This can be written

p(y|θ) =

k=1

P (s = k|π)p(y|s = k, θ) (14)

where P (s = k|π) = π

is the prior for the latent variable taking on value k, and p(y|s = k, θ) = p(y |θ

) is

the density under component k, recovering Equation (13).

剩余31页未读，继续阅读

records7

粉丝: 0
资源: 1

统计建模视角下的无监督学习综述与深度探讨

Cognex VisionPro DeepLearning功能与使用指南

PHP开发的e-Learning平台教程

掌握Python中的Q-Learning：使用qlearn库进行智能决策

learning

Learning

Python Machine Learning Machine Learning and Deep Learning

Python Machine Learning Machine Learning And Deep Learning From Scratch

q-learning.rar_Q-learning_Q——learning_matlab q-learning_q learni

Q_learning.rar_Q learning_Q-learning_Q-learning、_Reinforcement_l

qlearning.rar_Q learning 建模_cent4qq_qlearning_基于qlearning_栅格建模

最新资源