2.2 Principal components analysis (PCA)
Principal components analysis (PCA) is an important limiting case of factor analysis (FA). One can derive
PCA by making two modifications to FA. First, the noise is assumed to be isotropic, in other words each
element of has equal variance: Ψ = σ
2
I, where I is a D×D identity matrix. This model is called probabilistic
PCA [67, 78]. Second, if we take the limit of σ → 0 in probabilistic PCA, we obtain standard PCA (which
also goes by the names Karhunen-Lo`eve expansion, and singular value decomposition; SVD). Given a data
set with covariance matrix Σ, for maximum likelihood factor analysis the goal is to find parameters Λ, and
Ψ for w hich the model ΛΛ
>
+Ψ has highest likelihood. In PCA, the goal is to find Λ so that the likelihood is
highest for ΛΛ
>
. Note that this matrix is singular unless K = D, so the standard PCA model is not a sensible
model. However, taking the limiting case, and further constraining the columns of Λ to be orthogonal, it
can be derived that the principal components correspond to the K eigenvectors with largest eigenvalue of
Σ. PCA is thus attractive because the solution can be found immediately after eigendecomposition of the
covariance. Taking the limit σ → 0 of p(x|y, Λ, σ) we find that it is a delta-function at x = Λ
>
y, which is
the projection of y onto the principal components.
2.3 Independent components analysis (ICA)
Independent components analysis (ICA) extends factor analysis to the case where the factors are non-
Gaussian. This is an interesting extension because many real-world data sets have structure which can be
modelled as linear combinations of sparse sources. This includes auditory data, images, biological signals
such as EEG, etc. Sparsity sim ply corresponds to the assumption that the factors have distributions with
higher kurtosis that the Gaussian. For example, p(x) =
λ
2
exp{−λ|x|} has a higher peak at zero and heavier
tails than a Gaussian with corresponding mean and variance, so it would be considered sparse (strictly
speaking, one would like a distribution which had non-zero probability mass at 0 to get true sparsity).
Models like PCA, FA and ICA can all be implemented using neural networks (multilayer perceptrons)
trained using various cost functions. It is not clear what advantage this implementation/interpretation has
from a machine learning perspective, although it provides interesting ties to biological information processing.
Rather than ML estimation, one can also do Bayesian inference for the parameters of probabilistic PCA,
FA, and ICA.
2.4 Mixture of Gaussians
The densities modelled by PCA, FA and ICA are all relatively simple in that they are unimodal and have
fairly restricted parametric forms (Gaussian, in the case of PCA and FA). To model data with m ore complex
structure such as clusters, it is very useful to consider mixture models. Although it is straightforward to
consider mixtures of arbitrary densities, we will focus on Gaussians as a common special case. The density
of each data point in a mixture model can be written:
p(y|θ) =
K
X
k=1
π
k
p(y|θ
k
) (13)
where each of the K components of the mixture is, for example, a Gaussian with differing means and
covariances θ
k
= (µ
k
, Σ
k
) and π
k
is the mixing proportion for component k, such that
P
K
k=1
π
k
= 1 and
π
k
> 0, ∀k.
A different way to think about mixture models is to consider them as latent variable models, where
associated with each data point is a K-ary discrete latent (i.e. hidden) variable s which has the interpretation
that s = k if the data point was generated by component k. This can be written
p(y|θ) =
K
X
k=1
P (s = k|π)p(y|s = k, θ) (14)
where P (s = k|π) = π
k
is the prior for the latent variable taking on value k, and p(y|s = k, θ) = p(y |θ
k
) is
the density under component k, recovering Equation (13).
7