无限学生t-因子混合分析器：稳健聚类与分类

63 浏览量更新于2024-08-28 收藏 693KB PDF 举报

"The infinite Student's t-factor mixture analyzer for robust clustering and classification" 这篇研究论文探讨了无限学生t分布因子混合模型在鲁棒聚类和分类中的应用。学生t分布因其能够处理异常值和非正态数据而被广泛用于统计分析，尤其是在机器学习和数据挖掘领域。在传统的聚类和分类方法中，正态分布假设常常过于严格，无法适应现实世界中复杂的数据分布。因此，引入学生t分布提供了一种更灵活、更稳健的解决方案。无限学生t因子混合模型（Infinite Student's t-factor Mixture Model）是作者提出的一种新型模型，它扩展了有限的因子分析模型，允许因子的数量无限。这种模型通过考虑数据中的不确定性，可以更好地捕捉数据的复杂结构，同时对异常值具有较高的抵抗力。在聚类任务中，该模型能够识别出不同群体之间的细微差异，而在分类任务中，它能提高模型的泛化能力和预测精度。文章中可能涉及以下几个关键知识点： 1. 学生t分布：学生t分布是一种连续概率分布，通常用于小样本数据或存在未知方差的情况下，其尾部比正态分布更厚，能更好地处理异常值。 2. 混合模型：混合模型是由多个基础分布组合而成的概率分布，常用于描述数据的多元性和复杂性。在这个场景中，每个观测值被认为是由某个潜在的混合成分生成的。 3. 因子分析：因子分析是一种统计方法，用于发现隐藏在大量变量背后的少数潜在因子。无限因子混合模型则允许因子的数量不固定，适应数据的动态特性。 4. 鲁棒性：在统计学和机器学习中，鲁棒性指的是模型对于异常值和噪声的抵抗力。无限学生t因子混合模型通过利用t分布的特性，提高了模型的鲁棒性。 5. 聚类与分类：聚类是将数据集中的对象分组到不同的类别，使得同一类别的对象相似度较高，而不同类别的对象相似度较低。分类则是根据预先定义的类别标签预测新数据点的归属。 6. 算法实现与评估：文章可能会介绍如何用贝叶斯推断或期望最大化（EM）算法来估计无限学生t因子混合模型的参数，并可能通过模拟数据或真实数据集进行性能评估。 7. 数据预处理与后处理：在实际应用中，数据预处理步骤可能包括缺失值处理、标准化等，以确保模型的稳定性和准确性。后处理可能涉及簇的解释和验证，以及分类结果的评估。通过这篇论文，读者可以了解到如何使用无限学生t因子混合模型来改进聚类和分类的效果，特别是在面对异常值和复杂数据分布时。此外，对于从事机器学习、数据挖掘和统计分析的研究者，这篇论文提供了新的思路和工具，有助于他们在实际项目中实现更稳健和准确的模型。

Author's personal copy

loading matrix for all the components. Moreover, the factors

belonging to different components in the tFMA are described by

the Student’s t-distributions with different means and covariance

matrices. The tFMA has two main properties. First, when the

dimension of the observed data is high and/or the number of

components is not small, the number of parameters in the tFMA

does not increase as fast as the MtFA, providing a preferable

performance at the expense of more distributional restrictions on

the observed data. Second, since the latent factors are distributed

according to a mixture model, the estimated factors correspond-

ing to the observed data in the tFMA can be visualized in a low-

dimensional space [16], thus enabling us to perform clustering or

classiﬁcation in this low-dimensional space. This property is not

shared by the MtFA and the MODtFA. Moreover, compared to the

FMA based on normal distributions [17,19], the tFMA is more

robust to the observed data and outliers due to the long tails of

the Student’s t-distribution. In [16], the tFMA has been success-

fully used for clustering high-dimensional microarray data.

As the t FMA belongs to ﬁnite mixtures [7], one important issue

is the choice of the number of components. If this number is not

set properly, the model will underﬁt or overﬁt the observed data.

Many approaches handling this issue, such as parsimony-based

approaches and testing-based approaches, have been suggested

[20]. Parsimony-based approaches choose this number to mini-

mize the negative log likelihood function augmented by some

penalty functions to reﬂect its complexity. A widely used penalty

function is based on the Bayesian information criterion (BIC) [21].

However, uses of parsimony-based approaches entail training of

multiple models. It is also difﬁcult to obtain a meaningful

comparison of model ﬁt from one situation to another. Test-based

approaches use a likelihood ratio test to select the number of

components. However, these approaches are time-consuming to

implement and there exist boundary problems [20]. Moreover, in

the tFMA, the related parameter estimation algorithm is based on

the maximum likelihood criterion, making this model inevitably

suffer from the presence of estimated singular covariance

matrices [22]. These singular covariance matrices from the

degraded components that have only one observed data, make

the likelihood function unbounded.

To solve the model size selection problem more effectively,

another kind of approaches based on the nonparametric Bayesian

statistics have been proposed. Rather than comparing models that

vary in complexity, the nonparametric Bayesian statistics

approach is to ﬁt a single model that can adapt its complexity

to the data [23,24]. Concretely, a model having countably inﬁnite

amount of substructures (e.g. components in a mixture model) is

assumed in advance, and the proper model structure can be

determined after the observed data coming. Among the family

of nonparametric Bayesian statistics, the Dirichlet process mix-

ture model [25] has received the most attention and has been

used in many applications [26

,27].

Under this motivation, in this work, we introduce the non-

parametric Bayesian statistics to the tFMA, proposing a novel

formulation of the tFMA considering a countably inﬁnite number

of mixing components. We will be referring to this new model as

the inﬁnite Student’s t-factor mixture analyzer (itFMA). Speciﬁ-

cally, prior distributions of the related stochastic variables in the

itFMA are assumed at ﬁrst. Then the Bayes’ rule is adopted to

obtain the corresponding posteriors. As integrals are intractable

in the course of calculating posteriors, a variational algorithm,

originated from the mean ﬁeld theory [28–30], is derived for the

itFMA. By this algorithm, we can obtain computationally tractable

expressions of posteriors. When the itFMA is used for clustering

or classifying high-dimensional observed data, it can effectively

solve the model size selection problem and avoid the singularities

in the tFMA listed above.

The rest of this paper is organized as follows. In Section 2,a

brief overview of the tFMA is provided. In Section 3, the itFMA is

proposed, and in Section 4, a variational inference algorithm for

the itFMA is derived. In Section 5, the procedures of clustering and

classiﬁcation using the proposed itFMA and the related varia-

tional inference algorithm are given. In Section 6, some experi-

ments are performed to evaluate the performance of the itFMA.

Finally, the conclusion is presented in Section 7.

2. Student’s t-factor mixture analyzer

Here we introduce the deﬁnition of the tFMA [16]. In the tFMA,

the p -dimensional data vector y

(n¼1,y,N) is modeled as

¼ Au

þe

with prob:

ði ¼ 1, ...,IÞ, ð1Þ

where I is the number of mixing components. The corresponding

q-dimensional ðqo pÞ factor u

is distributed independently

tðu

Þ, where

is called degrees of freedom. The compo-

nent-error e

is distributed independently tðe

90, D,

Þ, where D

is a p  p diagonal matrix. The p  q matrix A is shared by all

components, so it is called common factor loading matrix. The

parameter

is a q-dimensional vector and

is a q  q positive

deﬁnite symmetric matrix, representing the mean and the covar-

iance of the ith component in the low-dimensional factor space,

respectively. The Student’s t-distributions tðu

Þ and

tðe

90, D,

Þ can be respectively regarded as average normal scale

distributions N ðu

Þ and N ðe

90, D=w

Þ with the preci-

sion scalar w

Gamma distributions, that is

tðu

Þ¼

N ðu

ÞGðw

=2,

=2Þ dw

tðe

90, D,

Þ¼

N ðe

90, D=w

ÞGðw

=2,

=2Þ dw

: ð2Þ

The parameter set of the tFMA,

, consists of

(i¼1,y,I),

A, D, and

(i ¼ 1: ..., I1), on putting

¼ 1

I1

i ¼ 1

. It is worth

noting that in the limit

-1ði ¼ 1, ...,IÞ, the tFMA reduces to

the FMA [19].

From the above description, we can obtain the total number of

free parameters in the tFMA, which is ð2I1Þþpþq

ðpþqÞþ

0:5

ðqþ1Þq

. In the MtFA, this number is ð2I1Þþ

pþI

q0:5

qðq1Þ, while in the MODtFA with common

factor loading matrix, it is ð2I 1ÞþI

pþp

q 0:5

ðq1Þþ

1þI

ðp1Þ. Therefore, in most cases, the tFMA has the smallest

number of free parameters. When p is large and/or I is not small,

the tFMA is a more feasible tool for modeling high-dimensional

data. Moreover, the estimated means

, covariances

and

degrees of freedom

(i¼1,y,I) in the tFMA can be used to

visualize the distributions of the factors in a low-dimensional

latent space, enabling this approach to perform clustering or

classiﬁcation in the latent space.

3. Inﬁnite Student’s t-factor mixture analyzer (itFMA)

Let us now consider the problem of modeling the high-

dimensional observed data Y ¼fy

, ..., y

g by the itFMA, which

is a tFMA with a countably inﬁnite number of components.

An inﬁnite-dimensional latent stochastic variable z

is intro-

duced, which is associated with the observed data y

.Inz

elements satisfy z

A f0; 1g and

¼ 1. If y

are originated from

the ith component of the mixture, the particular element z

equal to 1 and all the other elements are equal to 0. The whole

latent variable set is represented as Z ¼fz

, ..., z

X. Wei, Z. Yang / Pattern Recognition 45 (2012) 4346–4357 4347

剩余12页未读，继续阅读

weixin_38631197

粉丝: 5
资源: 943

无限学生t-因子混合分析器：稳健聚类与分类

Existence of infinite conservation laws of a variable-coefficient Korteweg-de Vries equation from fluid dynamics and plasma physics via symbolic computation

（Unity源码） 奔跑小车抢钱游戏 Infinite Runner Game 2.05.zip

Isotope-infinitescroll-extention-for-yii:基于Isotope和infinitescroll实现的无限滚动瀑布流插件，适用于yii1.1

Dissipativity for Lur'e distributed parameter control systems

Infinite-webServer:Vue-Node全栈实践--服务端

Infinite - Unlimited Lights Out Game-crx插件

react-native-infinite-swiper:包装 react-native-viewpager 并添加无限循环和触摸边距的 React Native 包

Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process

React-Infinite-Scroll-and-Lazy-Loading:使用IntersectionObserver API在React功能组件中实现无限滚动和图像延迟加载

infiniteScroll:使用Javascript创建-asyncawait，DOM，提取和Unsplash API

最新资源

（Unity源码）奔跑小车抢钱游戏 Infinite Runner Game 2.05.zip