数据降维利器：主成分分析在调查数据挖掘中的应用

版权申诉

49 浏览量更新于2024-08-04 收藏 250KB PDF 举报

"Principal Components Analysis (PCA)是一种在数据分析中广泛应用的统计方法，尤其在机器学习课程如MATH1900: Machine Learning中占据重要地位。PCA的核心目标是通过找出一组相互正交（即相互垂直）的基础向量来简化大量数据，提炼出其中的关键趋势。在实际应用中，比如对大规模调查问卷数据的处理，假设我们收集了1000个人填写的50个问题的答案，尽管每个问卷可能都有差异，但可能存在性别、年龄、政治倾向等显著的模式。通过PCA，我们能够识别这些潜在的结构，使得原始数据能被压缩到少数几个主要成分中，同时保留大部分信息。在PCA的具体操作中，我们将问卷数据视为一个大的、数值型的矩形矩阵，每个观测值对应一个样本，而特征（问题）则构成列。首先，我们需要计算每个特征（问题）的平均值，形成一个平均向量。然后，对于每个样本，我们会计算它与平均值之间的偏差，这构成了样本在各个特征上的得分。接下来，PCA通过线性变换将原始数据转换为一组新的坐标系，新坐标系中的轴（主成分）按其对数据方差的贡献程度排序，第一主成分解释了最多的数据变异，第二主成分解释次之，以此类推。通过这种方式，我们可以将复杂的数据集投影到少数几个主成分上，从而实现降维。例如，如果发现前两个主成分已经涵盖了大部分数据的变异，那么我们可以只报告这前两个组件，而非所有50个问题的答案，这对于数据可视化、特征选择和模型构建都非常有帮助。此外，PCA还能用于异常检测，因为远离主要趋势的样本在低维度表示下会更明显。总结来说，Principal Components Analysis是一种强大的工具，它在数据挖掘、预处理和理解复杂数据集中关键变量之间的关系时发挥着关键作用。通过找到最能概括数据特点的正交基，PCA不仅有助于简化分析过程，还能揭示数据背后的深层次结构，从而支持更为有效的决策和预测。"

Principal Components

MATH1900: Machine Learning

Location: http://people.sc.fsu.edu/∼jburkardt/classes/ml 2019/principal components/principal components.pdf

What are the top two trends in my data?

Principal Component Analysis

Find a small set of orthogonal basis vectors that approximate a large data set.

If we ask 1,000 people to ﬁll out a survey of 50 questions, it’s likely that every survey will be diﬀerent.

However, it may also be the case that distinct patterns can be observed, corresponding to diﬀerences in men

and women, young and old, conservative and liberal. Identifying these patterns will help us to understand

the data, to replace the raw data by a simpler model that explains much of the results.

We can use principal component analysis (PCA) to search for these patterns. We try to boil down our data

to reveal the most information in the fewest number of components.

If there was only one question on the survey, it would make sense to compute the average answer, and then

the variance to report how much answers can deviate from it. Now, however, our task is more complicated.

We will think of our data as a big, rectangular, numerical matrix, and we will see that the singular value

decomposition (SVD) can give us useful answers.

1 A = U * S * V

The SVD factorization of an m × n matrix A (as computed in Python) has the form A = U SV where the

matrix S has the shape of A, but is nonzero only on the diagonal, while the m × m matrix U and the n × n

matrix V are orthogonal.

In Python, if A is an np.array(), then we can request the SVD factorization by:

1 U, s , V = np . l i n a l g . svd ( A ) ;

Instead of returning the matrix S, we get back a vector s, of length mn = min(m, n), containing the diagonal

entries of S. We can build S by

下载后可阅读完整内容，剩余5页未读，立即下载

卷积神经网络

粉丝: 359
资源: 8440

数据降维利器：主成分分析在调查数据挖掘中的应用

A tutorial on Principal Components Analysis.pdf

Adaptive kernel principal components tracking.pdf

Principal components and regularized estimation of factor models.pdf

PCA算法.pdf(主成分分析 ( Principal Component Analysis ， PCA ))

全局滤波.pdf

SPSS进行主成分分析.pdf

NIST SP800-45.pdf

NIST SP800-45v2.pdf

ENVI图像增强实验：.pdf

SPSS操作方法因子分析09.pdf

最新资源