BRIEF COMMUNICATIONS
NATURE METHODS
|
VOL.8 NO.10
|
OCTOBER 2011
|
833
the cohort size (regardless of how many SNPs are to be tested) and
(ii) the RRM is used to determine these similarities, then FaST-
LMM produces exactly the same results as a standard LMM but
with a run time and memory footprint that is only linear in the
cohort size. FaST-LMM thus dramatically increases the size of
datasets that can be analyzed with LMMs and additionally makes
currently feasible analyses much faster.
Our FaST-LMM algorithm builds on the insight that the
maximum likelihood (or the restricted maximum likelihood
(REML)) of an LMM can be rewritten as a function of just a single
parameter,
δ
, the ratio of the genetic variance to the residual vari-
ance
3,13
. Consequently, the identification of the maximum like-
lihood (or REML) parameters becomes an optimization problem
over
δ
only. The algorithm ‘efficient mixed model association’
(EMMA)
3
speeds up the evaluation of the log likelihood for any
value of
δ
, which is ordinarily cubic in the cohort size, by clever
use of spectral decompositions. However, the approach requires
a new spectral decomposition for each SNP tested (a cubic opera-
tion). The algorithms ‘EMMA expedited’ (called EMMAX) and
‘population parameters previously determined’ (called P3D)
4,5
provide additional computational savings by assuming that vari-
ance parameters for each tested SNP are the same, removing the
expensive cubic computation per SNP.
In contrast to these methods, FaST-LMM requires only a single
spectral decomposition to test all SNPs, even without assuming
variance parameters to be the same across SNPs, and offers a
decrease in memory footprint and additional speedups. A key
insight behind our approach is that the spectral decomposition
of the genetic similarity matrix makes it possible to transform
(rotate) the phenotypes, SNPs to be tested and covariates in such
a way that the rotated data become uncorrelated. These data are
then amenable to analysis with a linear regression model, which
has a run time and memory footprint linear in the cohort size.
In general, the number of entries in the required rotation matrix
is quadratic in the cohort size, and computing this matrix by way
of a spectral decomposition has a cubic run time in the cohort size.
When the number of SNPs used to construct the genetic similarity
matrix is less than the cohort size, however, the number of entries
in the matrix required to perform the rotations is linear in the
cohort size (and linear in the number of SNPs), and the time
required to compute the matrix is linear in the cohort size (and
quadratic in the number of SNPs). Intuitively, these savings can
be achieved because the intrinsic dimensionality of the space
spanned by the SNPs used to construct the similarity matrix can
never be higher than the smaller of the number of such SNPs
and the cohort size. Thus, we can always perform operations
in the smaller space without any loss of information, and the
computations remain exact. This basic idea has been exploited
FaST linear mixed
models for genome-wide
association studies
Christoph Lippert
1–3
, Jennifer Listgarten
1,3
,
Ying Liu
1
, Carl M Kadie
1
, Robert I Davidson
1
&
David Heckerman
1,3
We describe factored spectrally transformed linear mixed
models (FaST-LMM), an algorithm for genome-wide association
studies (GWAS) that scales linearly with cohort size in both
run time and memory use. On Wellcome Trust data for 15,000
individuals, FaST-LMM ran an order of magnitude faster than
current efficient algorithms. Our algorithm can analyze data
for 120,000 individuals in just a few hours, whereas current
algorithms fail on data for even 20,000 individuals
(http://mscompbio.codeplex.com/).
The problem of confounding by population structure, family
structure and cryptic relatedness in genome-wide association
studies (GWAS) is widely appreciated
1–7
. Statistical methods
for correcting these confounders include linear mixed models
(LMMs)
2–10
, genomic control, family-based association tests,
structured association and Eigenstrat
7
. In contrast to other
methods, LMMs can capture all of these confounders simultane-
ously, without knowledge of which are present and without the
need to tease them apart
7
. Unfortunately, LMMs are computation-
ally expensive relative to simpler models. In particular, the run
time and memory footprint required by these models scale as the
cube and square of the cohort size (the number of individuals
represented in the dataset), respectively. This bottleneck means
that LMMs run slowly or not at all on currently or soon to be
available large datasets.
Roughly speaking, LMMs tackle confounders by using mea-
sures of genetic similarity to capture the probabilities that pairs
of individuals have causative alleles in common. Such measures
include those based on identity by descent
10,11
and the realized
relationship matrix (RRM)
9,10,12
, and have been estimated with
a small sample of markers (200–2,000 markers)
2,4
. Here we take
advantage of such sampling to make LMM analysis applicable to
extremely large datasets, introducing a reformulation of LMMs
called factored spectrally transformed LMM (FaST-LMM). We
show that, provided (i) the number of single-nucleotide poly-
morphisms (SNPs) used to estimate genetic similarity is less than
1
Microsoft Research, Los Angeles, California, USA.
2
Max Planck Institutes Tübingen, Tübingen, Germany.
3
These authors contributed equally to this work.
Correspondence should be addressed to C.L. (christoph.lippert@tuebingen.mpg.de), J.L. (jennl@microsoft.com) or D.H. (heckerma@microsoft.com).
Received 5 ApRil; Accepted 2 August; published online 4 septembeR 2011; doi:10.1038/nmeth.1681
© 2011 Nature America, Inc. All rights reserved.© 2011 Nature America, Inc. All rights reserved.