loading matrix for all the components. Moreover, the factors
belonging to different components in the tFMA are described by
the Student’s t-distributions with different means and covariance
matrices. The tFMA has two main properties. First, when the
dimension of the observed data is high and/or the number of
components is not small, the number of parameters in the tFMA
does not increase as fast as the MtFA, providing a preferable
performance at the expense of more distributional restrictions on
the observed data. Second, since the latent factors are distributed
according to a mixture model, the estimated factors correspond-
ing to the observed data in the tFMA can be visualized in a low-
dimensional space [16], thus enabling us to perform clustering or
classification in this low-dimensional space. This property is not
shared by the MtFA and the MODtFA. Moreover, compared to the
FMA based on normal distributions [17,19], the tFMA is more
robust to the observed data and outliers due to the long tails of
the Student’s t-distribution. In [16], the tFMA has been success-
fully used for clustering high-dimensional microarray data.
As the t FMA belongs to finite mixtures [7], one important issue
is the choice of the number of components. If this number is not
set properly, the model will underfit or overfit the observed data.
Many approaches handling this issue, such as parsimony-based
approaches and testing-based approaches, have been suggested
[20]. Parsimony-based approaches choose this number to mini-
mize the negative log likelihood function augmented by some
penalty functions to reflect its complexity. A widely used penalty
function is based on the Bayesian information criterion (BIC) [21].
However, uses of parsimony-based approaches entail training of
multiple models. It is also difficult to obtain a meaningful
comparison of model fit from one situation to another. Test-based
approaches use a likelihood ratio test to select the number of
components. However, these approaches are time-consuming to
implement and there exist boundary problems [20]. Moreover, in
the tFMA, the related parameter estimation algorithm is based on
the maximum likelihood criterion, making this model inevitably
suffer from the presence of estimated singular covariance
matrices [22]. These singular covariance matrices from the
degraded components that have only one observed data, make
the likelihood function unbounded.
To solve the model size selection problem more effectively,
another kind of approaches based on the nonparametric Bayesian
statistics have been proposed. Rather than comparing models that
vary in complexity, the nonparametric Bayesian statistics
approach is to fit a single model that can adapt its complexity
to the data [23,24]. Concretely, a model having countably infinite
amount of substructures (e.g. components in a mixture model) is
assumed in advance, and the proper model structure can be
determined after the observed data coming. Among the family
of nonparametric Bayesian statistics, the Dirichlet process mix-
ture model [25] has received the most attention and has been
used in many applications [26
,27].
Under this motivation, in this work, we introduce the non-
parametric Bayesian statistics to the tFMA, proposing a novel
formulation of the tFMA considering a countably infinite number
of mixing components. We will be referring to this new model as
the infinite Student’s t-factor mixture analyzer (itFMA). Specifi-
cally, prior distributions of the related stochastic variables in the
itFMA are assumed at first. Then the Bayes’ rule is adopted to
obtain the corresponding posteriors. As integrals are intractable
in the course of calculating posteriors, a variational algorithm,
originated from the mean field theory [28–30], is derived for the
itFMA. By this algorithm, we can obtain computationally tractable
expressions of posteriors. When the itFMA is used for clustering
or classifying high-dimensional observed data, it can effectively
solve the model size selection problem and avoid the singularities
in the tFMA listed above.
The rest of this paper is organized as follows. In Section 2,a
brief overview of the tFMA is provided. In Section 3, the itFMA is
proposed, and in Section 4, a variational inference algorithm for
the itFMA is derived. In Section 5, the procedures of clustering and
classification using the proposed itFMA and the related varia-
tional inference algorithm are given. In Section 6, some experi-
ments are performed to evaluate the performance of the itFMA.
Finally, the conclusion is presented in Section 7.
2. Student’s t-factor mixture analyzer
Here we introduce the definition of the tFMA [16]. In the tFMA,
the p -dimensional data vector y
n
(n¼1,y,N) is modeled as
y
n
¼ Au
ni
þe
ni
with prob:
p
i
ði ¼ 1, ...,IÞ, ð1Þ
where I is the number of mixing components. The corresponding
q-dimensional ðqo pÞ factor u
ni
is distributed independently
tðu
ni
9
n
i
,
X
i
,
n
i
Þ, where
n
i
is called degrees of freedom. The compo-
nent-error e
ni
is distributed independently tðe
ni
90, D,
n
i
Þ, where D
is a p p diagonal matrix. The p q matrix A is shared by all
components, so it is called common factor loading matrix. The
parameter
n
i
is a q-dimensional vector and
X
i
is a q q positive
definite symmetric matrix, representing the mean and the covar-
iance of the ith component in the low-dimensional factor space,
respectively. The Student’s t-distributions tðu
ni
9
n
i
,
X
i
,
n
i
Þ and
tðe
ni
90, D,
n
i
Þ can be respectively regarded as average normal scale
distributions N ðu
ni
9
n
i
,
X
i
=w
ni
Þ and N ðe
ni
90, D=w
ni
Þ with the preci-
sion scalar w
ni
Gamma distributions, that is
tðu
ni
9
n
i
,
X
i
,
n
i
Þ¼
Z
N ðu
ni
9
n
i
,
X
i
=w
ni
ÞGðw
ni
9
n
i
=2,
n
i
=2Þ dw
ni
,
tðe
ni
90, D,
n
i
Þ¼
Z
N ðe
ni
90, D=w
ni
ÞGðw
ni
9
n
i
=2,
n
i
=2Þ dw
ni
: ð2Þ
The parameter set of the tFMA,
H
, consists of
n
i
,
X
i
,
n
i
(i¼1,y,I),
A, D, and
p
i
(i ¼ 1: ..., I1), on putting
p
I
¼ 1
P
I1
i ¼ 1
p
i
. It is worth
noting that in the limit
n
i
-1ði ¼ 1, ...,IÞ, the tFMA reduces to
the FMA [19].
From the above description, we can obtain the total number of
free parameters in the tFMA, which is ð2I1Þþpþq
n
ðpþqÞþ
0:5
n
I
n
q
n
ðqþ1Þq
2
. In the MtFA, this number is ð2I1Þþ
2
n
I
n
pþI
n
p
n
q0:5
n
I
n
qðq1Þ, while in the MODtFA with common
factor loading matrix, it is ð2I 1ÞþI
n
pþp
n
q 0:5
n
q
n
ðq1Þþ
1þI
n
ðp1Þ. Therefore, in most cases, the tFMA has the smallest
number of free parameters. When p is large and/or I is not small,
the tFMA is a more feasible tool for modeling high-dimensional
data. Moreover, the estimated means
n
i
, covariances
K
i
and
degrees of freedom
n
i
(i¼1,y,I) in the tFMA can be used to
visualize the distributions of the factors in a low-dimensional
latent space, enabling this approach to perform clustering or
classification in the latent space.
3. Infinite Student’s t-factor mixture analyzer (itFMA)
Let us now consider the problem of modeling the high-
dimensional observed data Y ¼fy
1
, ..., y
N
g by the itFMA, which
is a tFMA with a countably infinite number of components.
An infinite-dimensional latent stochastic variable z
n
is intro-
duced, which is associated with the observed data y
n
.Inz
n
,
elements satisfy z
ni
A f0; 1g and
P
i
z
ni
¼ 1. If y
n
are originated from
the ith component of the mixture, the particular element z
ni
is
equal to 1 and all the other elements are equal to 0. The whole
latent variable set is represented as Z ¼fz
1
, ..., z
N
g.
X. Wei, Z. Yang / Pattern Recognition 45 (2012) 4346–4357 4347