Elements (ENCODE) project [15]. Combined Annotation Dependent Depletion
(CADD) [22] is a score that can be used to prioritize the functional variants
including coding variants and regulatory variants. CADD tool constructs support
vector machine classifiers to integrate various genomic and epigenomic annotations
into a single measure (C score) for each sequence variant. Recently, deep learning
algorithm has been applied for interpretat ion of regulatory variants. DeepSEA [62]
is a deep learning-based tool for predicting the effects of noncoding variant and
prioritizing regulatory variants. The software uses deep learning algorithms to learn
regulatory sequence code from large-scale chromati n-profiling data and predict the
effects of noncoding variants on chrom atin accessibility such as DNase I sensitiv-
ities, transcription factor binding, and histone marks at regulatory elements.
General Variant Annotation Multiple annotation toolkits have been developed to
determine the impacts of sequence variants on genes and proteins and access their
functional effects from above predictors. ANNOVAR [58] is a command-line Perl
software for annotating SNVs and INDELs based on genes, regions, or filters. In
gene-based annotation, it can annotate whether sequence variants affect protein
amino acid sequences (nonsense, missense, splice site, etc.). In region-based annota-
tion, it can identify the variants located in ENCODE-annotated regions such as
transcribed regions, enhancer regions, DNase I hypersensitivity sites, transcription
factor binding site, and transcription factor ChIP-Seq data. In filter-based annotation,
ANNOVAR can extract the information (allele frequency and identifier) of a
sequence variant in public databases such as dbSNP [53], ClinVar [25], 1000
Genomes Project [1], and Exome Variant Server (http://evs.gs.washington.edu/
EVS/). In addition, it can be used to access the annotations from damaging mutation
predictors (SIFT, PolyPhen2, LRT, MutationTaster, etc.) for nonsynonymous muta-
tions and CADD for regulatory variants. SnpEff [14] is another popular annotation
package to estimate the functional effects of SNVs, INDELs, and multiple nucleotide
polymorphisms. Based on the functional impacts of the sequence variants, SnpEff
classifies the variants to four classes: high, moderate, low, and modifier. It also
provides the annotations for regulatory variants. SnpEff provides a summary
HTML page to display overall statistics for sequences and variants (Table 1.3).
1.3.2 Variant Association Testing
Understanding how genetic variants contribute to diseases is the key challenge in
precision medicine. There are two hypotheses for interpreting the genetic contri-
bution of sequence variants in complex diseases such as cancers and mental
disorders [50]. The “common disease–common variant” hypothesis states that a
few common variant s, usually defined as the allele frequency greater than 1 % in the
population, make the major contributions for the genetic variance in complex
disease susce ptibility. In contrast, the “common disease–rare variant” hypothesis
argues that multipl e risk variants, each of which has low frequency (e.g., allele
frequency less than 1 %) in the population, are the major contributors to the genetic
12 S. Teng