CIGAR field. Each candidate variant is marked with a
quality metric “qual” valuing 1 or 0 according to whether
the candidate variant is in dbSNP. Then, a decision-tree
model is trained using the feature vectors of candidate
variants as the training set. After the model is trained,
candidate variants with similar feature values are
grouped into a same leaf node and are treated as a unit.
For all the candidates in a leaf, if their average qual is
higher than the threshold, they are called out; otherwise,
they are identified as false positives. Finally, a simple and
effective genotyper is applied.
Generating and labelling candidate variants
Fuwa walks through the whole-genome sequence, gener-
ating candidate variants at each locus. Designed for high
sensitivity, Fuwa considers all 6 possible candidate
variants (i.e., A, T, G, C, insertion, deletion), and only
those with too low a proportion of read depth at their
loci are excluded. Feature values of these candidates are
also calculated. At the same time, the programme
searches dbSNP and labels each candidate with dbSNP
quality, or “qual” in short. Qual is set to 1 if the candi-
date exists in dbSNP and 0 if not. To improve search
speed, Fuwa preloads dbSNP into RAM and transforms
it into a hash table so that any searching can be finished
in a constant time. After this step, all candidate variants
are obtained and labelled.
To date, most common human variants have already
been catalogued in dbSNP. The high coverage rate of
SN Vs and short indels qualifies dbSNP as a powerful
benchmark in alignment result recalibration [7] and final
call set quality assessment [5, 7, 11] as well as in training
machine learning models.
Decision tree and feature selection
Classification and regression tree (CART) [12]isa
widely used training algorithm of decision tree that can
be applied to either classification or regression problems.
It assumes the decision tree to be binary, and each non-
leaf node is measured by a Boolean expression so that
the input samples could be transferred into two
branches: the left branch if the Boolean expression is
“true” or the right branch otherwise. We chose CART
because it is simple and fast, and the decision procedure
can be easily understood.
Twelve features were sele cted to train the CART
model, which were divided into four categories, shown
as follows.
Category I. Read depth
Features under this category measure the absolute depth
and depth ratio of reads that are “effective” to be a spe-
cific candidate variant. “Effective ” means that the read
shares the same base as the candidate variant at the can-
didate’s locus.
Feature 1: effective base depth Effective Base Depth
(EBD) is the sum of the depths of effective reads. For
indel reads, the EBD equals the mapping quality, while
for SNV reads, the EBD is the value of the mapping
quality multiplied by the base quality.
Feature 2: effective base depth ratio The EBD ratio, i.
e., the EBD of one candidate variant divided by the sum
of the EBDs of all candidate variants at that locus. If this
indicator is very low, the related candidate variant tends
to be a random error.
Feature 3: DeltaL DeltaL is a statistic describing the
difference between optimal and suboptimal genotypes.
Fuwa first hypothesizes that the variant is true, so the
reads covering this locus obey an almost ideal variant
model: 0/1 or 1/1. The logarithms of likelihood under
these two ideal models are calculated separately, and the
bigger one is selected as L
1
. Then, Fuwa calcu lates the
second likelihood logarithm, L
2
, under another hypoth-
esis that the variant is false and tha t reads covering this
locus follow the binomial distribu tion model. Thus , L
1
-
L
2
, or DeltaL, is the logarithm of the ratio of the first
and second likelihoods. If DeltaL is close to 0, which
means the likelihoods of the ideal model and the bino-
mial model are nearly equal, we empirically judged the
variant to be false positive; otherwise, the variant tends
to be true.
Category II. Base quality
This category focuses on the accuracy of a base
sequenced by the sequencing machine, which ha s con-
siderable impact on variant calling.
Feature 4: Sum of Base Quality (SumBQ) This feature
is the sum of the base quality of effective reads for one
candidate variant. For indel reads, this value is set to 30
empirically.
Feature 5: Average Mapping Quality (AveBQ) By div-
iding SumBQ by the number of effective reads, we ob-
tain the average mapping quality.
Feature 6: Variance of Position (VarPos) Here, “pos-
ition” means the offset of the pile-up site from the 3′
end of a read. We use this statistic considering that, gen-
erally, sequencing quality declines towards the end of a
read; thus, candidate variants that are close to the 3′
end are more likely to be sequencing errors.
Li et al. BMC Bioinformatics (2018) 19:145 Page 3 of 14