FaST-LMM：全基因组关联研究的高效线性混合模型

需积分: 3 65 浏览量更新于2024-08-04 1 收藏 1.31MB PDF 举报

全基因组关联研究（Genome-Wide Association Studies, GWAS）是遗传学领域的重要工具，用于探索单核苷酸多态性（Single Nucleotide Polymorphisms, SNPs）与复杂疾病或性状之间的关联。传统的线性混合模型（Linear Mixed Models, LMMs）在处理大规模数据集时面临挑战，因为其计算复杂度通常与样本数量的三次方成正比，这限制了分析的大规模样本和SNP的数量。 FaST-LMM（Fast Linear Mixed Model）算法的出现改变了这一状况。该算法由Nature Methods杂志在2011年的一篇论文中介绍，它解决了全基因组关联研究中的关键问题：如何在保持统计功效的同时，显著降低计算时间和内存需求。FaST-LMM的核心思想在于将LMM的最大似然估计（Maximum Likelihood Estimation,MLE）或受约束最大似然估计（Restricted Maximum Likelihood, REML）重写为仅依赖于单一参数δ，即遗传变异与残差变异的比例。这个参数化使得优化问题简化为对δ的求解，从而大大减少了计算复杂性。 EMMA（Efficient Mixed Model Association）算法是FaST-LMM的基础，它通过巧妙地利用谱分解技术，将评估log-likelihood函数的时间从通常的三次方增长降到了线性级别，与样本大小成正比。这意味着，即使面对海量的数据，FaST-LMM也能在理论上支持处理更大的数据集，并且显著加快了分析速度。 FaST-LMM的优点在于它能够扩展分析的界限，使得以前无法处理的大规模GWAS成为可能，这对于理解遗传因素在疾病发生中的作用具有重要意义。同时，对于已经可以进行的分析，FaST-LMM也提供了显著的性能提升，极大地提高了研究的效率和可行性。FaST-LMM算法是遗传学领域的一个重大进步，为科学家们探索遗传与复杂性状之间关系提供了强大的工具。

BRIEF COMMUNICATIONS

NATURE METHODS

VOL.8 NO.10

OCTOBER 2011

833

the cohort size (regardless of how many SNPs are to be tested) and

(ii) the RRM is used to determine these similarities, then FaST-

LMM produces exactly the same results as a standard LMM but

with a run time and memory footprint that is only linear in the

cohort size. FaST-LMM thus dramatically increases the size of

datasets that can be analyzed with LMMs and additionally makes

currently feasible analyses much faster.

Our FaST-LMM algorithm builds on the insight that the

maximum likelihood (or the restricted maximum likelihood

(REML)) of an LMM can be rewritten as a function of just a single

parameter,

, the ratio of the genetic variance to the residual vari-

ance

3,13

. Consequently, the identification of the maximum like-

lihood (or REML) parameters becomes an optimization problem

over

only. The algorithm ‘efficient mixed model association’

(EMMA)

speeds up the evaluation of the log likelihood for any

value of

, which is ordinarily cubic in the cohort size, by clever

use of spectral decompositions. However, the approach requires

a new spectral decomposition for each SNP tested (a cubic opera-

tion). The algorithms ‘EMMA expedited’ (called EMMAX) and

‘population parameters previously determined’ (called P3D)

4,5

provide additional computational savings by assuming that vari-

ance parameters for each tested SNP are the same, removing the

expensive cubic computation per SNP.

In contrast to these methods, FaST-LMM requires only a single

spectral decomposition to test all SNPs, even without assuming

variance parameters to be the same across SNPs, and offers a

decrease in memory footprint and additional speedups. A key

insight behind our approach is that the spectral decomposition

of the genetic similarity matrix makes it possible to transform

(rotate) the phenotypes, SNPs to be tested and covariates in such

a way that the rotated data become uncorrelated. These data are

then amenable to analysis with a linear regression model, which

has a run time and memory footprint linear in the cohort size.

In general, the number of entries in the required rotation matrix

is quadratic in the cohort size, and computing this matrix by way

of a spectral decomposition has a cubic run time in the cohort size.

When the number of SNPs used to construct the genetic similarity

matrix is less than the cohort size, however, the number of entries

in the matrix required to perform the rotations is linear in the

cohort size (and linear in the number of SNPs), and the time

required to compute the matrix is linear in the cohort size (and

quadratic in the number of SNPs). Intuitively, these savings can

be achieved because the intrinsic dimensionality of the space

spanned by the SNPs used to construct the similarity matrix can

never be higher than the smaller of the number of such SNPs

and the cohort size. Thus, we can always perform operations

in the smaller space without any loss of information, and the

computations remain exact. This basic idea has been exploited

FaST linear mixed

models for genome-wide

association studies

Christoph Lippert

1–3

, Jennifer Listgarten

1,3

Ying Liu

, Carl M Kadie

, Robert I Davidson

David Heckerman

1,3

We describe factored spectrally transformed linear mixed

models (FaST-LMM), an algorithm for genome-wide association

studies (GWAS) that scales linearly with cohort size in both

run time and memory use. On Wellcome Trust data for 15,000

individuals, FaST-LMM ran an order of magnitude faster than

current efﬁcient algorithms. Our algorithm can analyze data

for 120,000 individuals in just a few hours, whereas current

algorithms fail on data for even 20,000 individuals

(http://mscompbio.codeplex.com/).

The problem of confounding by population structure, family

structure and cryptic relatedness in genome-wide association

studies (GWAS) is widely appreciated

1–7

. Statistical methods

for correcting these confounders include linear mixed models

(LMMs)

2–10

, genomic control, family-based association tests,

structured association and Eigenstrat

. In contrast to other

methods, LMMs can capture all of these confounders simultane-

ously, without knowledge of which are present and without the

need to tease them apart

. Unfortunately, LMMs are computation-

ally expensive relative to simpler models. In particular, the run

time and memory footprint required by these models scale as the

cube and square of the cohort size (the number of individuals

represented in the dataset), respectively. This bottleneck means

that LMMs run slowly or not at all on currently or soon to be

available large datasets.

Roughly speaking, LMMs tackle confounders by using mea-

sures of genetic similarity to capture the probabilities that pairs

of individuals have causative alleles in common. Such measures

include those based on identity by descent

10,11

and the realized

relationship matrix (RRM)

9,10,12

, and have been estimated with

a small sample of markers (200–2,000 markers)

2,4

. Here we take

advantage of such sampling to make LMM analysis applicable to

extremely large datasets, introducing a reformulation of LMMs

called factored spectrally transformed LMM (FaST-LMM). We

show that, provided (i) the number of single-nucleotide poly-

morphisms (SNPs) used to estimate genetic similarity is less than

Microsoft Research, Los Angeles, California, USA.

Max Planck Institutes Tübingen, Tübingen, Germany.

These authors contributed equally to this work.

Correspondence should be addressed to C.L. (christoph.lippert@tuebingen.mpg.de), J.L. (jennl@microsoft.com) or D.H. (heckerma@microsoft.com).

Received 5 ApRil; Accepted 2 August; published online 4 septembeR 2011; doi:10.1038/nmeth.1681

下载后可阅读完整内容，剩余4页未读，立即下载

上山砍菜

粉丝: 0
资源: 225

FaST-LMM：全基因组关联研究的高效线性混合模型

based on genome-wide RNA interference screening

精确Fisher测试

2021-July-Genome-Wide-Association-Studies:学习执行从设计到分析的成功 GWAS 实验

孟德尔随机化研究需要在R软件中安装哪些包

genome-based prediction of bayesian linear and non-linear regression models

We additionally performed genome-wide MR analyses using Causal Analysis Using Summary Effect estimates (CAUSE)，孟德尔随机化安装CAUSE包

gwas全基因组相关联plink csdn

gwasvcf_to_TwoSampleMR

Linkage Mapper 陈书予

GWAS数据库孟德尔随机化

最新资源