PLS监督特征提取结合FNN的高维数据分类特征选择

1 下载量 26 浏览量 更新于2024-08-28 收藏 3.9MB PDF 举报
"特征选择对于数据分类至关重要,特别是在高维数据中,由于多重共线性、冗余特征和噪声的影响,可能导致分类器性能下降和计算成本增加。本文提出了一种结合偏最小二乘(PLS)监督特征提取和虚假最近邻点(FNN)的方法来优化特征选择过程。首先,PLS用于从原始高维数据中提取主元,消除特征间的多重共线性,创建一个包含监督信息的独立主元空间。接着,通过计算特征选择前后的相关性,利用FNN建立特征相似性度量,以确定哪些特征对类别变量的解释能力更强。最后,通过逐步去除解释能力弱的特征,构建多种分类模型,并使用支持向量机(SVM)的分类识别率作为评估标准,寻找识别率最高且特征数量最少的模型,从而确定最佳特征子集。实验证明,这种方法能有效地选择出与数据本质分类特征相匹配的最佳特征子集,为数据分类特征选择提供了一种新的有效途径。" 在数据分类中,特征选择是一个关键步骤,它有助于降低模型复杂性,提高分类准确性和效率。偏最小二乘(PLS)是一种统计分析方法,用于处理具有多重共线性的变量,它通过构建主元来捕获数据的主要变异性,并且这些主元与响应变量(或类别变量)有密切关系。在本文提出的特征选择策略中,PLS首先被用来提取那些与分类目标紧密相关的主元,从而减少特征间的相互影响。 虚假最近邻点(FNN)是一种用于检测数据点在降维空间中的非线性结构的算法。在特征选择过程中,FNN被用来衡量特征选择前后在PLS主元空间中的变化,以此判断特征对类别变量的解释能力。如果一个特征的选择显著改变了其他特征与类别变量的关系,那么这个特征可能对分类有重要影响。 通过结合这两种方法,可以建立一个基于特征相似性测度的排序系统,选择出对分类影响最大的特征。随后,通过构建和评估多种基于SVM的分类模型,可以找出最优的特征子集,这个子集包含的特征数量最少,但分类效果最佳。 实验结果在三个不同的数据集上验证了该方法的有效性,最佳特征子集与每个数据集的本质分类特征高度一致,证明了该方法在特征选择方面的优越性。这种方法为高维数据分类提供了一个新的、有效的特征选择策略,有助于提升分类器的性能,降低计算资源的需求。

精简下面表达:Existing protein function prediction methods integrate PPI networks and multivariate bioinformatics data to improve the performance of function prediction. By combining multivariate information, the interactions between proteins become diverse. Different interactions’ functions in functional prediction are various. Combining multiple interactions simply between two proteins can effectively reduce the effect of false negatives and increase the number of predicted functions, but it can also increase the number of false positive functions, which contribute to nonobvious enhancement for the overall functional prediction performance. In this article, we have presented a framework for protein function prediction algorithms based on PPI network and semantic similarity with the addition of protein hierarchical functions to them. The framework relies on diverse clustering algorithms and the calculation of protein semantic similarity for protein function prediction. Classification and similarity calculations for protein pairs clustered by the functional feature are more accurate and reliable, allowing for the prediction of protein function at different functional levels from different proteomes, and giving biological applications greater flexibility.The method proposed in this paper performs well on protein data from wine yeast cells, but how well it matches other data remains to be verified. Yet until now, most unknown proteins have only been able to predict protein function by calculating similarities to their homologues. The predictions result of those unknown proteins without homologues are unstable because they are relatively isolated in the protein interaction network. It is difficult to find one protein with high similarity. In the framework proposed in this article, the number of features selected after clustering and the number of protein features selected for each functional layer has a significant impact on the accuracy of subsequent functional predictions. Therefore, when making feature selection, it is necessary to select as many functional features as possible that are important for the whole interaction network. When an incorrect feature was selected, the prediction results will be somewhat different from the actual function. Thus as a whole, the method proposed in this article has improved the accuracy of protein function prediction based on the PPI network method to a certain extent and reduces the probability of false positive prediction results.

236 浏览量