多重因子分析MFA：多表数据的主成分分析

需积分: 50 138 浏览量更新于2024-07-26 收藏 632KB PDF 举报

"多重因子分析(MFA)是一种用于处理多表格或多块数据集的主成分分析(PCA)扩展方法，特别适用于测量相同观测值上一组变量的数据表，或者在独立MFA中，针对不同观测集测量相同变量的多个数据表。MFA包括两个步骤：首先对每个数据表进行PCA并对其进行标准化，即除以各自PCA得到的第一个奇异值；其次，将所有标准化的数据表聚合到一个大的数据表中，然后通过非标准化的PCA分析，得出观测值的因子得分和变量的载荷。此外，MFA还为每个数据表提供了反映该数据表特定“视角”的部分因子得分。有趣的是，公共因子得分可以通过替换原始的正常化的部分因子得分来获得。" 多重因子分析(MFA)是一种统计方法，它结合了主成分分析的优势，同时处理多个相关的数据表。这种方法主要应用于多源数据整合分析，例如，在社会科学、市场研究或生物信息学等领域，我们可能需要分析来自不同来源或不同时间点的多个数据集。MFA的主要目标是识别不同数据表中的共同模式和结构，并提供一个统一的框架来解释这些数据。在MFA的第一步，对每个单独的数据表执行主成分分析。PCA是一种降维技术，它通过找到数据变异的主要方向（主成分）来压缩数据，从而减少数据的复杂性。PCA的结果是得到一组新的正交变量（因子），它们是原始变量的线性组合，且保留了大部分的方差信息。接着，通过除以各自PCA的第一奇异值，对每个数据表进行标准化，这一步骤确保了不同数据表之间的可比性，因为PCA的第一奇异值反映了数据表的总变异量。第二步，将所有标准化的数据表合并成一个“大”数据表，然后应用非标准化的PCA。这一步分析产生的因子得分反映了观测值在整体数据结构中的位置，而载荷则表示变量如何与这些因子相关联。 MFA的一个关键特性是部分因子得分，这部分得分反映了每个数据表特定的观察视角。例如，如果一个数据表关注的是消费者行为，另一个关注的是产品特性，那么部分因子得分将分别突出显示这两个方面的关系。最后，MFA还可以帮助识别那些在所有数据表中都显著的共同模式，即公共因子得分。这些得分可以帮助研究者理解那些在所有数据集中都起作用的关键因素，对于解释和综合多个数据集的信息非常有价值。多重因子分析提供了一种强大的工具，用于处理和解析复杂的多源数据，使得研究人员能够从多个角度理解和解释数据，并从中发现隐藏的结构和模式。这种分析方法在处理跨学科、跨领域的大型数据集时尤其有用。

Overview

wires.wiley.com/compstats

Eq. (13) can re-expressed as:

X =



[1]

|...|X

[k]

|...|X

[K]



= PQ

= P





[1]

|...|Q

[k]

|...|Q

[K]





= P



[1]

|...|Q

[k]

|...|Q

[K]





PQ

[1]

|...|PQ

[k]

|...|PQ

[K]



. (16)

Note, that, the pattern in Eq. (13) does not completely

generalize to Eq. (16) because, if we deﬁne A

[k]

= α

I, (17)

we have, in general, Q

[k]

= I.

Factor Scores

The factor scores for X represent a compromise

(i.e., a common representation) for the set of the K

matrices. Recall that these compromise factor scores,

are computed (cf., Eqs (13) and (14)) as

F = P. (18)

Factor scores can be used to plot the observations as

done in standard PCA for which each column of F

represents a dimension. Note that the variance of the

factor scores of the observations is computed using

their masses (stored in matrix M) and can be found

as the diagonal of the matrix F

MF. This variance

is equal, for each dimension, to the square of the

singular value of this dimension as shown by

MF = P

MP = 

. (19)

As in standard PCA, F can be obtained from X

by combining Eqs (13) and (18) to get:

F = P = XAQ. (20)

Taking into account the block structure of X, A,and

Q,Eq.(13)canalsoberewrittenas(cf., Eq.(17)):

F = XAQ =



[1]

|...|X

[k]

|...|X

[K]



× A ×

⎡

⎢

⎣

[1]

[k]

[K]

⎤

⎥

⎦



[k]



[k]

. (21)

This equation suggests that the partial factor scores

for a table can be deﬁned from the projection of this

table onto its right singular vectors (i.e., Q

[k]

). Specif-

ically, the partial factor scores for the kth table are

stored in a matrix denoted by F

[k]

computed as

[k]

= K × α

× X

[k]

. (22)

Note that the compromise factor scores matrix is the

barycenter (also called centroid or center of gravity

see Ref 86) of the partial factor scores because it is the

average of all K partial factor scores (cf., Eq. (20)):



[k]



Kα

[k]



[k]

=F.

(23)

Also as in standard PCA, the elements of Q are

loadings and can be plotted either on their own or

along with the factor scores as a biplot (see Refs

87,88). As the loadings come in blocks (i.e., the

loadings correspond to the variables of a table), it

makes sense to create a biplot with the partial factor

scores (i.e., F

[k]

) for a block and the loadings (i.e., Q

[k]

)

for this block. In doing so, it is often practical to

normalize the loadings such that their variance is

commensurable with the variance of the factor scores.

This can be achieved, for example, by normalizing,

for each dimension, the loadings of a block such that

their variance is equal to the square of the singular

value of the dimension or even to the singular value

itself (as illustrated in the example that we present

in a following section). These biplots are helpful for

understanding the statistical structure of each block,

even though the relative positions of the factor scores

and the loadings are not directly interpretable because

only the projections of observations on the loading

vectors can be meaningfully interpreted in a biplot

(cf., Refs 87,88).

An alternative pictorial representation of the

variables and the components plots the correlations

between the original variables of X and the factor

scores. These correlations are plotted as two-

dimensional maps in which a circle of radius one

(called the circle of correlation

75,89

) is also plotted.

The closer to the circle a variable is, the better this

variable is ‘explained’ by the components used to

create the plot (see Refs 23,24 for examples). Loadings

and correlations as often used interchangeably because

these two concepts are very similar and, sometimes the

names loading is used for both concepts (see Ref 75).

In fact, loadings and correlation differ only by a

normalization factor: the sum of the squared loadings

of all the variables for a given dimension is equal to

WIREs Computational Statistics Multiple factor analysis

one whereas the sum of the squared correlations of

all the dimensions for a given variable is equal to one

(and therefore it is always possible to transform one

set into the other).

HOW TO FIND THE IMPORTANT

ELEMENTS: CONTRIBUTIONS, ETC.

Contributions of Observations, Variables,

and Tables to a Dimension

In MFA, just like in standard PCA, the importance of

a dimension (i.e., principal component) is reﬂected by

its eigenvalue which indicates how much of the total

inertia (i.e., variance) of the data is explained by this

component.

To better understand the relationships between

components, observations, variables, and tables

and also to help interpret a component, we can

evaluate how much an observation, a variable, or

a whole table contribute to the inertia extracted

by a component. In order to do so, we com-

pute descriptive statistics, called contributions (see

Refs 78,89–91 and Ref 75, p. 437ff.). The stability of

these descriptive statistics can be assessed by cross-

validation techniques such as the bootstrap whose

results can be used to select the relevant elements for

a dimension.

Contribution of an Observation to a Dimension

As stated in Eq. (19), the variance of the factor

scores for a given dimension is equal to its eigenvalue

(i.e., the square of the singular value) associated with

this dimension. If we denote λ



, the eigenvalue of a

given dimension, we can rewrite Eq. (19) as





× f

i,

(24)

where m

and f

i,

are, respectively, the mass of the ith

observation and the factor score of the ith observation

for the th dimension. As all the terms m

× f

i,

are

positive or null, we can evaluate the contribution

of an observation to a dimension as the ratio of

the squared weighted factor score by the dimension

eigenvalue. Formally, the contribution of observation

i to component , denoted ctr

i,

, is computed as

ctr

i,

× f

i,



. (25)

Contributions take values between 0 and 1, and for

a given component, the sum of the contributions

of all observations is equal to 1. The larger a

contribution, the more the observation contributes

to the component. A useful heuristic is to base the

interpretation of a component on the observations

that have contributions larger than the average

contribution. Observations with high contributions

and whose factor scores have different signs can then

be contrasted to help interpreting the component.

Alternatively (as described in a later section) we can

derive pseudo t statistics (called bootstrap ratios) in

order to ﬁnd the observations important for a given

dimension.

Contributions of a Variable to a Dimension

As we did for the observations, we can ﬁnd

the important variables for a given dimension by

computing variable contributions. The variance of the

loadings for the variables is equal to one when the α

weights are taken into account (cf., Eq. (13)). So if we

denote by a

the α weight for the j th variable (recall

that all variables from the same table share the same

α weight cf., Eq. (11)), we have

1 =



× q

j,

(26)

where q

i,

is the loading of the jth variable for the

th dimension. As all terms a

× q

i,

are positive or

null, we can evaluate the contribution of a variable to

a dimension as its squared weighted loading for this

dimension. Formally, the contribution of variable j to

component , denoted ctr

j,

, is computed as

ctr

j,

= a

× q

j,

. (27)

Variable contributions take values between 0 and

1, and for a given component, the contributions of

all variables sum to 1. The larger a contribution of

a variable to a component the more this variable

contributes to this component. Variables with high

contributions and whose loadings have different signs

can then be contrasted to help interpreting the

component.

Contribution of a Table to a Dimension

Speciﬁc to multiblock analysis is the notion of

a table contribution. As a table comprises several

variables, the contribution of a table can simply be

deﬁned as the sum of the contributions of its variables

(a simple consequence of the Pythagorean theorem

that states that squared lengths are additive). So the

contribution of table k to component  is denoted

ctr

k,

and is deﬁned as

ctr

k,

[k]



ctr

j,

. (28)

剩余30页未读，继续阅读

windson11

粉丝: 0

多重因子分析MFA：多表数据的主成分分析

多因子分析

c#单因子、多因子方差分析

多因子分析程序

多重因子分析（MFA）与传统主成分分析（PCA）在处理多表数据时有何不同？

如何应用多重因子分析（MFA）来整合多表数据，并与传统的主成分分析（PCA）进行比较？

在处理多个数据表时，多重因子分析（MFA）相比于传统的主成分分析（PCA）有哪些优势和特点？

多重分析因子与开放数据索引应用探索

aws-mfa：管理AWS MFA安全证书

meteor-mfa:流星的多因素身份验证和无密码（支持U2F）

Python库 | mfa-aws-0.0.12.tar.gz

最新资源