SVM与格言进化算法：预测遗传疾病的新策略

需积分: 1 9 浏览量更新于2024-09-07 收藏 341KB PDF 举报

本文主要探讨了支持向量机（Support Vector Machines, SVM）在预测人类遗传疾病关联中的应用，特别是在复杂的疾病识别过程中。遗传学领域的关键挑战在于处理基因间上位性交互作用以及如何有效分析高维度特征空间中的数据。传统的分类方法已经面临这些难题，而支持向量机作为一种强大的机器学习工具，其在生物信息学领域中展现出巨大潜力。文章的焦点在于利用格拉姆矩阵进化（Grammatical Evolution, GE）作为一种优化策略，来解决SVM模型中特征选择和参数调整的问题。通常情况下，确定哪些基因特征对疾病预测最为关键，以及如何设置SVM的核函数（如线性核、多项式核或径向基函数核）是一项具有挑战性的任务。通过将GE引入这个过程，研究人员能够自动化这一过程，减少人为干预，从而提高模型的准确性和效率。格拉姆矩阵进化是一种基于进化计算的方法，它模仿自然选择和遗传机制，通过迭代的方式搜索最优的特征组合和参数配置。这种方法的优势在于能够处理非线性关系，适应各种核函数的选择，同时还能处理高维数据中的复杂模式。作者Skylar Marvel和Alison Motsinger-Reif，来自北卡罗来纳州立大学的生物信息学研究中心，他们的研究结果表明，使用GE优化SVM在预测遗传疾病方面的初步效果积极，这为后续深入研究和改进提供了新的视角。他们的工作不仅推动了遗传学与人工智能的交叉领域研究，也为未来在大规模基因数据上进行更精准的疾病风险评估开辟了新的途径。本文可能涉及的主题和分类包括：人工智能的遗传基础机器学习和学习算法（类别I.2.m），以及与生物医学数据分析和疾病预测相关的研究方法。这项工作的重要意义在于，它不仅提升了疾病预测的科学水平，而且有可能在未来的人类遗传疾病防控中发挥重要作用。随着技术的发展和数据的积累，这种结合SVM和GE的方法可能会成为遗传疾病研究中的一个标准实践。

展开

Grammatical Evolution Support Vector Machines for

Predicting Human Genetic Disease Association

Skylar Marvel

North Carolina State University

Bioinformatics Research Center

Raleigh, NC 27695

swmarvel@ncsu.edu

Alison Motsinger-Reif

North Carolina State University

Bioinformatics Research Center

Raleigh, NC 27695

aamotsin@ncsu.edu

ABSTRACT

Identifying genes that predict common, complex human dis-

eases is a major goal of human genetics. This is made dif-

ﬁcult by the eﬀect of epistatic interactions and the need

to analyze datasets with high-dimensio n a l feature spaces.

Many classiﬁcation methods have been applied to this prob-

lem, one of the more recent b ein g Support Vector Machines

(SVM). Selectio n of wh ich features to includ e in the SVM

model and what para meter s or kernels to us e can often be a

diﬃcult task. This work uses Grammatical Evolution (GE)

as a way to choose features and parameters. Initial results

look promising and encourage further d evelopment and test-

ing of this new approach.

Categories and Subject Descriptors

I.2.m [Artiﬁcial Intelligence]: Miscellaneous—Genetic-

Based Machine Learning and Learning Classiﬁer Systems

General Terms

Algorithms

Keywords

Support vector machine, grammatical evolution, Single Nu-

cleotide Polymorphism (SNP ), epistasis

1. INTRODUCTION

The ability to identify g en es tha t pred ic t common , com-

plex human diseases is an intense area of res ea r ch. Such

diseases are often caused by the combination of many ge-

netic and enviro n mental factors, each contributing a small

eﬀect [8]. Identiﬁcation of genetic factors is made diﬃcult by

the interactions b etween diﬀerent genes, referred to as epis-

tasis [3]. Traditional parametr ic statistical methods used

to characterize gene-gene or gene-environment interactions

fail when applied to large datasets [4], which has stimulated

the development of novel computational approaches that are

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

GECCO’12 Companion, July 7–11, 2012, Philadelphia, PA, USA.

able to extract information from data obtained during this

‘omics’ era .

One popular approach for detecting dis ea s e association

involves the use of machine-learning c la s s iﬁ c a tio n methods

[1, 2, 9, 14]. A few of the most common methods a r e Ar-

tiﬁcial Neura l Networ ks (ANNs), Decision Trees (DTs) and

Support Vector Machines (SVMs), th e later of which has

been steadily gaining popularity. Due to the enormous s iz e

of the datasets that are being analyzed, feature selection is

an extremely important aspect of these classiﬁcat io n meth-

ods [11]. In addition, properties innate to the classiﬁcation

technique also inﬂuence performanc e, e.g. the architecture

of an artiﬁcial neural network or the kernel parameter(s) of

a su p port vector machine.

To address these issues, many techniques are being devel-

oped that combine machine-learning classiﬁcation methods

with algorithms that select features and cla s s iﬁ er architec-

ture [2, 7, 9, 12]. Genetic programming algorithms are of-

ten used for this purpose [2, 7, 12], however, applica t io n of

Grammatical Evolution (GE) h a s been shown to outperform

the genetic programming counterpart for ANNs [9]. Moti-

vated by this res u lt and the increasing use of SMVs, this

work begins the process of combining GE and SVMs for the

purpose of predicting human genetic disease associations.

2. METHODS

2.1 Support Vector Machines

SVMs are non-probabilistic binary classiﬁers that can be

used to construct a hyper p la n e to separate data into one

of two classes [13]. Consider a set of n data points, each

consisting of p featur es , x ∈ R

, and a class label, y ∈ [−1, 1],

i.e. (x

, y

) for i = 1, . . . , n. A hyperplane can be deﬁned

by a normal vector, w, and oﬀset, b. In a d d itio n , slack

variables, ξ

, can be introduced to represent the degree of

misclassiﬁcation when data points are not linearly separable.

The objective func tio n of the SVM is then

min

w,b,ξ

kwk

+ C

i=1

subject to y

φ(x

) + b) ≥ 1 − ξ

, (1)

≥ 0,

where C is a linear misclassiﬁ c a t io n penalty and φ is a non-

linear transformation func tio n that projects x ∈ R

into

a higher-dimens io n a l feature space. Using th e relationship

595

下载后可阅读完整内容，剩余3页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

StreamApi

粉丝: 0

SVM与格言进化算法：预测遗传疾病的新策略

"数据仓库与数据挖掘第9章：分类与预测方法综述

利用机器学习预测心脏疾病的项目解析

"基于三维格构模型的长龄期混凝土力学性能预测研究

Image Classification in MATLAB: Applying Support Vector Machines for Image Classification

RWRMD A predicting novel human microRNA-disease associations

Predicting MicroRNA-Disease Associations Based on Improved MicroRNA and Disease Similarities

Predicting Human Decision-Making: From Prediction to Action

A Computational Method Based on the Integration of Heterogeneous Networks for Predicting Disease-Gene Associations

A novel method of predicting microRNA-disease associations based on microRNA, disease, gene and environment factor networks

predicting-disease-spread

最新资源