低秩近似加速癌症多组学数据降维与集成聚类

需积分: 9 182 浏览量更新于2024-08-24 收藏 2.18MB PDF 举报

本文是一篇研究论文，标题为"使用低秩近似快速减少大规模多组学数据的维数并进行综合聚类：在癌症分子分类中的应用"。癌症分子分类是大规模多组学研究的重要目标，旨在通过识别分子亚型来提高癌症的诊断和治疗精度。面对高维度的多组学数据，如基因表达、蛋白质表达、表观遗传学等，如何有效地降低维度并整合不同类型数据以挖掘潜在的生物标志物和群体结构是一个挑战。该研究提出了一种新颖的低秩近似方法，通过构建一个整合的概率模型来解决这一问题。该模型利用低秩约束优化技术，对多类型数据的联合概率分布进行建模，以寻找数据间的共同低维结构。低秩正则化的似然函数的凸性确保了模型的高效和稳定拟合。这种方法允许作者在保持数据关键特征的同时，显著减少数据维度，从而简化后续的分析和聚类任务。通过这个模型，研究人员能够发掘不同数据类型下的共享原则子空间，这有助于发现癌症样本之间的内在联系，可能揭示出未被传统方法识别的分子亚型。这种综合策略的优势在于，它能够在处理大数据集的同时，兼顾数据类型间的异质性，从而提供更为全面和准确的癌症分类。论文的研究方法包括数据预处理、模型构建、参数估计以及聚类验证等步骤，可能还涉及了统计推断和模型选择的方法来评估模型性能。最终，通过实际癌症数据的应用实例，作者展示了这种低秩近似方法的有效性和实用性，为癌症分子分类领域的研究者提供了一个有价值的工具和技术框架。通过这种方式，研究不仅推动了多组学数据分析的进步，也为临床决策提供了更精确的依据。

M ETHODOLOGY AR TICLE Open Access

Fast dimension reduction and integrative

clustering of multi-omics data using low-

rank approximation: application to cancer

molecular classification

Dingming Wu

, Dongfang Wang

, Michael Q. Zhang

1,2*

and Jin Gu

Abstract

Background: One major goal of large-scale cancer omics study is to identify molecular subtypes for more accurat e

cancer diagnoses and treatments. To deal with high-dimensional cancer multi-omics data, a promising strategy is to

find an effective low-dimensional subspace of the original data and then cluster cancer samples in the reduced

subspace. However, due to data-type diversity and big data volume, few methods can integrative and efficiently

find the principal low-dimensional manifold of the high-dimensional cancer multi-omics data.

Results: In this study, we proposed a novel low-rank approximation based integrative probabilistic model to fast

find the shared principal subspace across multiple data types: the convexity of the low-rank regularized likelihood

function of the probabilistic model ensures efficient and stable model fitting. Candidate molecular subtypes can be

identified by unsupervised clustering hundreds of cancer samples in the reduced low-dimensional subspace. On

testing datasets, our method LRAcluster (low-rank approximation based multi-omics data clustering) runs much

faster with better clustering performances than the existing method. Then, we applied LRAcluster on large-scale

cancer multi-omics data from TCGA. The pan-cancer analysis results show that the cancers of different tissue origins

are generally grouped as independent clusters, except squamous-like carcinomas. While the single cancer type

analysis suggests that the omics data have different subtyping abilities for different cancer types.

Conclusions: LRAcluster is a very useful method for fast dimension reduction and unsupervised clustering of large-

scale multi-omics data. LRAcluster is implemented in R and freely available via http://bioinfo.au.tsinghua.edu.cn/

software/lracluster/.

Keywords: Mutli-omics, Cancer, Low-rank approximation, Clustering, Dimension reduction, Algorithm

Background

Cancer is a large family of lethal diseases which are kill-

ing millions of lives each year [1, 2]. Highly genetic het-

erogeneity makes it hard to develop general and effective

treatments against cancer [3, 4]. One of the major goal

of cancer multi-omics study is to disc over possible can-

cer subtypes using molecule-level signatures, which can

be used for more accurate diagnoses and treatments

[5–8]. Several international collaborated projects, such

as TCGA [9], ICGC [10], and CCLE [11] generated

tons of cancer multi-omics data. However, we still

face several challenge s for analyzing such large-sca le

cancer multi-omics data: 1) n eed to handle different

data types of different platforms at the same time,

such as count ba sed data of sequencing, continuous

data of microarray and binary data of g enetic varia-

tions; 2) the data dimension (the number of the mo-

lecular features) is much higher than the sample

number; and 3) the big data volumes require efficient

and robust computational algorithms.

The molecules involved in the same biological pro-

cesses are usually highly correlated. It is commonly

* Correspondence: michael.zhang@utdallas.edu; jgu@tsinghua.edu.cn

MOE Key Laboratory of Bioinformatics, TNLIST Bioinformatics Division &

Center for Synthetic and Systems Biology, Department of Automation,

Tsinghua University, Beijing 100084, China

Full list of author information is available at the end of the article

International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and

reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to

the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver

(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Wu et al. BMC Genomics (2015) 16:1022

DOI 10.1186/s12864-015-2223-8

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38740328

粉丝: 4
资源: 863

低秩近似加速癌症多组学数据降维与集成聚类

基于k-means聚类算法实现三维数据分类含Matlab源码

聚类算法对同一个二维坐标数据集进行聚类分析

kmeans聚类：一维数据的kmeans聚类算法的实现

大规模数据的近似谱聚类：随机化采样方法

大规模数据谱聚类：近似加权核k-means算法

大规模矩阵聚类：Squeezer算法的改进与应用

数据聚类：理论、算法与实践应用

高效癌症数据聚类：新异度量与分类方法的研究

分类与聚类：监督与无监督学习的差异与应用

中文文本聚类：关键技术与应用

最新资源