ANNOVAR: functional annotation of genetic variants
from high-throughput sequencing data
Kai Wang
1,
*, Mingyao Li
2
and Hakon Hakonarson
1,3
1
Center for Applied Genomics, Children’s Hospital of Philadelphia,
2
Department of Biostatistics and
Epidemiology and
3
Department of Pediatrics, University of Pennsylvania, Philadelphia, PA 19104, USA
Received March 27, 2010; Revised June 2, 2010; Accepted June 18, 2010
ABSTRACT
High-throughput sequencing platforms are genera-
ting massive amounts of genetic variation data for
diverse genomes, but it remains a challenge to
pinpoint a small subset of functionally important
variants. To fill these unmet needs, we developed
the ANNOVAR tool to annotate single nucleotide
variants (SNVs) and insertions/deletions, such as
examining their functional consequence on genes,
inferring cytogenetic bands, reporting functional im-
portance scores, finding variants in conserved
regions, or identifying variants reported in the 1000
Genomes Project and dbSNP. ANNOVAR can utilize
annotation databases from the UCSC Genome
Browser or any annotation data set conforming to
Generic Feature Format version 3 (GFF3). We also
illustrate a ‘variants reduction’ protocol on
4.7 million SNVs and indels from a human genome,
including two causal mutations for Miller syndrome,
a rare recessive disease. Through a stepwise pro-
cedure, we excluded variants that are unlikely to be
causal, and identified 20 candidate genes including
the causal gene. Using a desktop computer,
ANNOVAR requires 4 min to perform gene-based
annotation and 15 min to perform variants reduc-
tion on 4.7 million variants, making it practical to
handle hundreds of human genomes in a day.
ANNOVAR is freely available at http://www.
openbioinformatics.org/annovar/.
INTRODUCTION
High-throughput sequencing data have been produced at
unprecedented rates for diverse genomes. There is a strong
need for novel informatics and analytical strategies,
including methods for sequencing reads alignment,
variant identification, genotype calling and association
tests, in order to take advantage of the massive amounts
of sequencing data. There have been dozens of short read
alignment software available now with different function-
alities (1), as well as several single nucleotide variants
(SNV) and copy number variant (CNV) calling algorithms
(2). However, there is a paucity of methods that can sim-
ultaneously handle a large number of called variants (typ-
ically >3 million variants for a given human genome) and
annotate their functional impacts, despite the fact that this
is an important task in many sequencing applications.
Even when sequencing only exonic regions for
Mendelian diseases such as Freeman–Sheldon syndrome,
each subject still carries a total of 20 000 variants, but
only two variants in trans are the true disease causal mu-
tations (3). Therefore, identifying a small subset of func-
tionally important variants from large amounts of
sequencing data is important to pinpoint potential
disease causal genes and causal mutations.
Several reasons motivate us to develop a functional
annotation pipeline for genetic variants. First, although
companies that manufacture sequencing machines or
provide sequencing services typically offer software for
functional annotation, these software are usually
sequencing platform-specific, and cannot be extended to
handle users’ specific needs (such as using different
genome builds or gene annotations). Second, although
several databases have been developed for the functional
annotation of SNPs or CNVs (4–6), most of them are
limited to known variants, typically those reported in
dbSNP or CNV databases. We note that some excep-
tions exist (7), for example, the F-SNP tool (8) and
Seattle Seq tool (http://gvs.gs.washington.edu/SeattleSeq
Annotation/) can be used for annotation of novel SNPs.
Third, several previously developed mutation prediction
algorithms, such as SIFT (9) and PolyPhen (10), require
building multiple alignments on sequence databases, can
only handle non-synonymous mutations, and are difficult
to scale up to many model organism genomes.
Nevertheless, for human genomes, SIFT/PolyPhen scores
for all possible non-synonymous mutations can be
*To whom correspondence should be addressed. Tel: +1 215 426 1256; Fax: +1 267 426 0363; Email: kai@openbioinformatics.org
Published online 3 July 2010 Nucleic Acids Research, 2010, Vol. 38, No. 16 e164
doi:10.1093/nar/gkq603
ß The Author(s) 2010. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
by guest on June 25, 2015http://nar.oxfordjournals.org/Downloaded from