
Grammatical Evolution Support Vector Machines for
Predicting Human Genetic Disease Association
Skylar Marvel
North Carolina State University
Bioinformatics Research Center
Raleigh, NC 27695
swmarvel@ncsu.edu
Alison Motsinger-Reif
North Carolina State University
Bioinformatics Research Center
Raleigh, NC 27695
aamotsin@ncsu.edu
ABSTRACT
Identifying genes that predict common, complex human dis-
eases is a major goal of human genetics. This is made dif-
ficult by the effect of epistatic interactions and the need
to analyze datasets with high-dimensio n a l feature spaces.
Many classification methods have been applied to this prob-
lem, one of the more recent b ein g Support Vector Machines
(SVM). Selectio n of wh ich features to includ e in the SVM
model and what para meter s or kernels to us e can often be a
difficult task. This work uses Grammatical Evolution (GE)
as a way to choose features and parameters. Initial results
look promising and encourage further d evelopment and test-
ing of this new approach.
Categories and Subject Descriptors
I.2.m [Artificial Intelligence]: Miscellaneous—Genetic-
Based Machine Learning and Learning Classifier Systems
General Terms
Algorithms
Keywords
Support vector machine, grammatical evolution, Single Nu-
cleotide Polymorphism (SNP ), epistasis
1. INTRODUCTION
The ability to identify g en es tha t pred ic t common , com-
plex human diseases is an intense area of res ea r ch. Such
diseases are often caused by the combination of many ge-
netic and enviro n mental factors, each contributing a small
effect [8]. Identification of genetic factors is made difficult by
the interactions b etween different genes, referred to as epis-
tasis [3]. Traditional parametr ic statistical methods used
to characterize gene-gene or gene-environment interactions
fail when applied to large datasets [4], which has stimulated
the development of novel computational approaches that are
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
GECCO’12 Companion, July 7–11, 2012, Philadelphia, PA, USA.
Copyright 2012 ACM 978-1-4503-1178-6/12/07 ...$10.00.
able to extract information from data obtained during this
‘omics’ era .
One popular approach for detecting dis ea s e association
involves the use of machine-learning c la s s ifi c a tio n methods
[1, 2, 9, 14]. A few of the most common methods a r e Ar-
tificial Neura l Networ ks (ANNs), Decision Trees (DTs) and
Support Vector Machines (SVMs), th e later of which has
been steadily gaining popularity. Due to the enormous s iz e
of the datasets that are being analyzed, feature selection is
an extremely important aspect of these classificat io n meth-
ods [11]. In addition, properties innate to the classification
technique also influence performanc e, e.g. the architecture
of an artificial neural network or the kernel parameter(s) of
a su p port vector machine.
To address these issues, many techniques are being devel-
oped that combine machine-learning classification methods
with algorithms that select features and cla s s ifi er architec-
ture [2, 7, 9, 12]. Genetic programming algorithms are of-
ten used for this purpose [2, 7, 12], however, applica t io n of
Grammatical Evolution (GE) h a s been shown to outperform
the genetic programming counterpart for ANNs [9]. Moti-
vated by this res u lt and the increasing use of SMVs, this
work begins the process of combining GE and SVMs for the
purpose of predicting human genetic disease associations.
2. METHODS
2.1 Support Vector Machines
SVMs are non-probabilistic binary classifiers that can be
used to construct a hyper p la n e to separate data into one
of two classes [13]. Consider a set of n data points, each
consisting of p featur es , x ∈ R
p
, and a class label, y ∈ [−1, 1],
i.e. (x
i
, y
i
) for i = 1, . . . , n. A hyperplane can be defined
by a normal vector, w, and offset, b. In a d d itio n , slack
variables, ξ
i
, can be introduced to represent the degree of
misclassification when data points are not linearly separable.
The objective func tio n of the SVM is then
min
w,b,ξ
1
2
kwk
2
+ C
n
X
i=1
ξ
i
subject to y
i
(w
T
φ(x
i
) + b) ≥ 1 − ξ
i
, (1)
ξ
i
≥ 0,
where C is a linear misclassifi c a t io n penalty and φ is a non-
linear transformation func tio n that projects x ∈ R
p
into
a higher-dimens io n a l feature space. Using th e relationship