基于粗糙集技术的微阵列数据基因选择和肿瘤样本分类

11 浏览量更新于2024-08-27 收藏 514KB PDF 举报

"Interval-valued analysis for discriminative gene selection and tissue sample classification using microarray data" 本文旨在解决microarray数据分类中的两个主要问题：过拟合和噪声敏感性。为此，我们提出了一种基于粗糙集技术的间隔值分析方法，以选择鉴别基因并对组织样本进行分类。首先，microarray数据的特点是高维低样本，导致分类方法的准确性不高。传统的分类方法，如支持向量机（SVM）和随机森林（RF），都存在过拟合和噪声敏感性的问题。为了解决这些问题，我们提出了基于粗糙集技术的间隔值分析方法。该方法可以有效地选择鉴别基因，并对组织样本进行分类。粗糙集技术是一种处理不确定性和不完整信息的数学工具。它可以将数据分为三个部分：下界、上界和边界。我们可以使用粗糙集技术来处理microarray数据，选择鉴别基因，并对组织样本进行分类。在我们的方法中，我们首先将microarray数据转换为间隔值决策表，然后使用粗糙集技术来选择鉴别基因。接着，我们使用这些鉴别基因来对组织样本进行分类。实验结果表明，我们的方法可以有效地选择鉴别基因，并对组织样本进行分类。我们方法的优点是可以处理高维低样本的microarray数据，并且可以对噪声敏感性进行 robust处理。我们的方法也可以应用于其他类型的生物信息学数据，例如蛋白质表达数据和DNA序列数据。在生物信息学领域，基因选择和组织样本分类是两个非常重要的问题。我们的方法可以为这两个问题提供了一种有效的解决方案。我们相信，我们的方法将对生物信息学领域产生重要的影响。关键词：Microarray; Gene selection; Classification; Rough sets; Interval-valued decision table

Author's personal copy

label indicates the class to which each row belongs. The class label is

termed a decision attribute; the remaining attributes are termed condi-

tion attributes. Rough set theory distinguishes itself from other machine

learning and pattern recognition methods through three notions of

indiscernibility, approximation, and reduction of attributes (introduced

in Sections 2.2 and 3.4). The ﬁrst deﬁnes a relationship stating that two

objects are only equivalent under a selection of attributes. The second

gives the ability to deﬁne an unknown set of boundaries through the

analysis of how that set relates to the objects in the universe. The third

allows for the reduction of irrelevant information, thus saving valuable

resources. These three important concepts give rough set theory an

advantage over other classical methods as it does not need any prelimi-

nary or additional information about the data: for example, probability in

statistics or grade of membership or the value of possibility in fuzzy set

theory, all require further information. The characteristics of the microar-

ray data — smallsamplesizeandverylargedimensionality,createnew

challenges in obtaining preliminary information.

In practice, discretization is a common preprocess before rough set

based mining on gene expression data, which transforms continuous

gene expression levels to categorical item sets [26,27]. If a particular

gene's expression level is higher than the discretization threshold,

the gene is considered as expressed, otherwise it is considered

unexpressed. Obviously, a lot of information is lost in the above trans-

formation of the dataset with the noise, which is especially inherent

in the microarray data [28]. Previous research has shown that handling

uncertainty in such applications by the representation as interval data

leads to accurate learning algorithms [29,30].

In this study, we propose an interval-valued analysis method to

select discriminative genes, and to use these genes to classify tissue

samples of microarray data. We ﬁrst select a small subset of genes

based on interval-valued rough set by considering the preference-

ordered domains of the gene expression data, and then classifying a

test sample into a certain class with a term of similar degree.

To summarize the process:

• The interval-valued decision table of the microarray is generated. In

the decision table, each row corresponds to a class of tissue samples,

and each column (condition attribute) corresponds to a gene's

expression value over all classes of samples. To generate the decision

table, the decision attribute is the average gene expression value of a

class, and the condition attribute is the value of the 1st quartile and

the 3rd quartile of the gene expression value within a class.

• In the gene selection step, our objective is to determine the reducts

that discern between objects belonging to different classes. The

reduct, from rough set theory, corresponds to a minimal subset of dis-

criminative genes. The ordered process of this algorithm is described

in Section 4.1.

• Thetissuesampleclassiﬁcation is based on the selected genes. The

proposed interval-valued classiﬁcation method classiﬁes a sample

into a class with the maximum similar degrees.

To facilitate our discussion, we ﬁrst present the basic notions in

Section 2. Section 3 presents the discernibility approach to compute

reducts from the compared dominance relationships. In Section 4,we

describe our gene selection and tissue sample classiﬁcation method. In

Section 5, we apply our approach to the analysis of real microarray

data. In this section, we also discuss RNA-sequencing data, the data

from next-generation sequencing technologies, and analysis using the

proposed method. Finally, Section 6, summarizes our approach and pre-

sents our conclusions.

2. Preliminaries

2.1. Microarray dataset

A microarray dataset is a gene expression matrix, in which each

column represents a gene and each row represents a sample (or

experiment) with a class label. Let

G={g

,⋯, g

} be a set of genes

and U ={s

,⋯, s

} be a set of samples. The corresponding gene expres-

sion matrix can be represented as X ¼ x

i;j



mn

,wherex

i, j

is the expres-

sion level of gene g

in sample s

, and usually n≫ m.Herem is the

number of samples, and n is the number of genes. The matrix X is com-

posed of m row vectors s

∈ R

,i=1,2,⋯,m. Each vector s

in the gene

expression matrix may be regarded as a point in n-dimensional space,

and each of the n columns consists of an m-element expression vector

for a single gene.

A microarray dataset can be regarded as a decision table S ¼

bU; AT∪d; V; f >, where U denotes the set of samples, AT denotes the

set of the condition attributes (genes), d denotes the decision attribute

(class label), V is the domain of AT∪d,andx

i,j

=f(s

2.2. Rough set

An information system is a 4-tuple, where S ¼ bU; A; V; f >. U is a

non-empty and ﬁnite set of objects, called as universe; A is a

non-empty and ﬁnite set of attributes, such that ∀ a ∈ A: U→ V

, where

is the domain of attribute a; V is regarded as the domain of all attri-

butes such that V=V

=∪

a∈ A

; f(x,a)isthevaluethatx holds on

a(∀x∈ U,a∈A).

A decision table is an information system S ¼ bU; AT∪d; V; f >,

where d∉ AT. d is a complete attribute called a decision, and AT is the

condition attribute set.

For an information system S, it is possible to describe relationships

between objects through their attribute values. With respect to a subset

of attributes such that AAT, an indiscernibility relationship IND(A) [31]

may be deﬁned as:

IND AðÞ¼ x; yðÞ∈U

: ∀a∈A; fx; aðÞ¼fy; aðÞ

IND(A) is an equivalence relationship because it is reﬂexive, sym-

metrical and transitive. With the relationship IND(A), two objects are

considered to be indiscernible if, and only if, they have the same value

on each a ∈ A.

Based on the indiscernibility relationship IND(A), it is possible to

derive the lower and upper approximations of an arbitrary subset X

of U, which are deﬁned as [31]:



AXðÞ¼ x∈U : x½

⊂X



and AXðÞ¼ x∈U : x½

∩X≠ϕ



respectively, where [x ]

={y ∈ U:(x,y)∈IND(A)} is the A-equivalence

class containing x. The pair



AXðÞ; AXðÞ

is referred to as the Pawlak

rough set of X with respect to the subset of attributes A.

2.3. Inclusion degree

A partial order on a set X has a binary relationship ⪯ with the fol-

lowing properties: x ⪯ x (reﬂexive), x ⪯ y and y⪯ x imply x=y (anti-

symmetric), x⪯ y and y ⪯ z imply x ⪯ z (transitive).

Deﬁnition 1. [32,33] Let (X,⪯) be a partially ordered set. If for any

x,y ∈ X, there is a real number I y=xðÞwith the following properties:

(1) 0≤I y=xðÞ≤1; (2) x ⪯ y implies I y= xðÞ¼1; (3) x ⪯ y ⪯ z implies

I x=zðÞ≤I x

=yðÞ; then I is called an inclusion degree on X.

For an information system S, U is the universe, the collection of all

normal fuzzy subsets of U is denoted by F

UðÞ. Let F

; F

∈F

UðÞ,if

xðÞ≤μ

xðÞ for all x∈ U, then F

. It is well known that

UðÞ; pðÞis a partially ordered set.

Deﬁnition 2. [34] Suppose that F

ðÞ

; p

ðÞ

is a partially ordered set,

then I is an inclusion degree on F

UðÞ, if the following conditions

hold: (1) 0≤I F

ðÞ

≤1; (2) F

⇒I F

ðÞ

¼ 1; (3) F

⇒

I F

ðÞ≤I F

ðÞ,whereF

; F

∈F

UðÞ.

39Y. Qi, X. Yang / Genomics 101 (2013) 38–48

剩余11页未读，继续阅读

weixin_38661236

粉丝: 5
资源: 980

基于粗糙集技术的微阵列数据基因选择和肿瘤样本分类

Using the method of maximizing deviations to multiple attribute decision making under interval-valued intuitionistic fuzzy environment

Generalized Interval-Valued Fuzzy Rough Sets Based on Interval-Valued Fuzzy Logical Operators

Entropy of interval-valued fuzzy sets based on distance and its relationship with similarity measure

A rough set approach for the discovery of classification rules in interval-valued information systems

On type-2 fuzzy relations and interval-valued type-2 fuzzy sets

Interval-valued Fuzzy Soft Decision Making Methods Based on MABAC, Similarity Measure and EDAS

Interval-Valued Intuitionistic Fuzzy Sets (Krassimir T. Atanasso

区间值q-Rung模糊Choquet积分算子及其在群决策中的应用_Interval-valued q-Rung Orthopai

interval-analysis

matts-algorithm-for-interval-partitioning

最新资源