label indicates the class to which each row belongs. The class label is
termed a decision attribute; the remaining attributes are termed condi-
tion attributes. Rough set theory distinguishes itself from other machine
learning and pattern recognition methods through three notions of
indiscernibility, approximation, and reduction of attributes (introduced
in Sections 2.2 and 3.4). The first defines a relationship stating that two
objects are only equivalent under a selection of attributes. The second
gives the ability to define an unknown set of boundaries through the
analysis of how that set relates to the objects in the universe. The third
allows for the reduction of irrelevant information, thus saving valuable
resources. These three important concepts give rough set theory an
advantage over other classical methods as it does not need any prelimi-
nary or additional information about the data: for example, probability in
statistics or grade of membership or the value of possibility in fuzzy set
theory, all require further information. The characteristics of the microar-
ray data — smallsamplesizeandverylargedimensionality,createnew
challenges in obtaining preliminary information.
In practice, discretization is a common preprocess before rough set
based mining on gene expression data, which transforms continuous
gene expression levels to categorical item sets [26,27]. If a particular
gene's expression level is higher than the discretization threshold,
the gene is considered as expressed, otherwise it is considered
unexpressed. Obviously, a lot of information is lost in the above trans-
formation of the dataset with the noise, which is especially inherent
in the microarray data [28]. Previous research has shown that handling
uncertainty in such applications by the representation as interval data
leads to accurate learning algorithms [29,30].
In this study, we propose an interval-valued analysis method to
select discriminative genes, and to use these genes to classify tissue
samples of microarray data. We first select a small subset of genes
based on interval-valued rough set by considering the preference-
ordered domains of the gene expression data, and then classifying a
test sample into a certain class with a term of similar degree.
To summarize the process:
• The interval-valued decision table of the microarray is generated. In
the decision table, each row corresponds to a class of tissue samples,
and each column (condition attribute) corresponds to a gene's
expression value over all classes of samples. To generate the decision
table, the decision attribute is the average gene expression value of a
class, and the condition attribute is the value of the 1st quartile and
the 3rd quartile of the gene expression value within a class.
• In the gene selection step, our objective is to determine the reducts
that discern between objects belonging to different classes. The
reduct, from rough set theory, corresponds to a minimal subset of dis-
criminative genes. The ordered process of this algorithm is described
in Section 4.1.
• Thetissuesampleclassification is based on the selected genes. The
proposed interval-valued classification method classifies a sample
into a class with the maximum similar degrees.
To facilitate our discussion, we first present the basic notions in
Section 2. Section 3 presents the discernibility approach to compute
reducts from the compared dominance relationships. In Section 4,we
describe our gene selection and tissue sample classification method. In
Section 5, we apply our approach to the analysis of real microarray
data. In this section, we also discuss RNA-sequencing data, the data
from next-generation sequencing technologies, and analysis using the
proposed method. Finally, Section 6, summarizes our approach and pre-
sents our conclusions.
2. Preliminaries
2.1. Microarray dataset
A microarray dataset is a gene expression matrix, in which each
column represents a gene and each row represents a sample (or
experiment) with a class label. Let
G={g
1
,⋯, g
n
} be a set of genes
and U ={s
1
,⋯, s
m
} be a set of samples. The corresponding gene expres-
sion matrix can be represented as X ¼ x
i;j
mn
,wherex
i, j
is the expres-
sion level of gene g
j
in sample s
i
, and usually n≫ m.Herem is the
number of samples, and n is the number of genes. The matrix X is com-
posed of m row vectors s
i
∈ R
n
,i=1,2,⋯,m. Each vector s
i
in the gene
expression matrix may be regarded as a point in n-dimensional space,
and each of the n columns consists of an m-element expression vector
for a single gene.
A microarray dataset can be regarded as a decision table S ¼
bU; AT∪d; V; f >, where U denotes the set of samples, AT denotes the
set of the condition attributes (genes), d denotes the decision attribute
(class label), V is the domain of AT∪d,andx
i,j
=f(s
i
,g
j
).
2.2. Rough set
An information system is a 4-tuple, where S ¼ bU; A; V; f >. U is a
non-empty and finite set of objects, called as universe; A is a
non-empty and finite set of attributes, such that ∀ a ∈ A: U→ V
a
, where
V
a
is the domain of attribute a; V is regarded as the domain of all attri-
butes such that V=V
A
=∪
a∈ A
V
a
; f(x,a)isthevaluethatx holds on
a(∀x∈ U,a∈A).
A decision table is an information system S ¼ bU; AT∪d; V; f >,
where d∉ AT. d is a complete attribute called a decision, and AT is the
condition attribute set.
For an information system S, it is possible to describe relationships
between objects through their attribute values. With respect to a subset
of attributes such that AAT, an indiscernibility relationship IND(A) [31]
may be defined as:
IND AðÞ¼ x; yðÞ∈U
2
: ∀a∈A; fx; aðÞ¼fy; aðÞ
no
:
IND(A) is an equivalence relationship because it is reflexive, sym-
metrical and transitive. With the relationship IND(A), two objects are
considered to be indiscernible if, and only if, they have the same value
on each a ∈ A.
Based on the indiscernibility relationship IND(A), it is possible to
derive the lower and upper approximations of an arbitrary subset X
of U, which are defined as [31]:
AXðÞ¼ x∈U : x½
A
⊂X
and AXðÞ¼ x∈U : x½
A
∩X≠ϕ
respectively, where [x ]
A
={y ∈ U:(x,y)∈IND(A)} is the A-equivalence
class containing x. The pair
AXðÞ; AXðÞ
hi
is referred to as the Pawlak
rough set of X with respect to the subset of attributes A.
2.3. Inclusion degree
A partial order on a set X has a binary relationship ⪯ with the fol-
lowing properties: x ⪯ x (reflexive), x ⪯ y and y⪯ x imply x=y (anti-
symmetric), x⪯ y and y ⪯ z imply x ⪯ z (transitive).
Definition 1. [32,33] Let (X,⪯) be a partially ordered set. If for any
x,y ∈ X, there is a real number I y=xðÞwith the following properties:
(1) 0≤I y=xðÞ≤1; (2) x ⪯ y implies I y= xðÞ¼1; (3) x ⪯ y ⪯ z implies
I x=zðÞ≤I x
=yðÞ; then I is called an inclusion degree on X.
For an information system S, U is the universe, the collection of all
normal fuzzy subsets of U is denoted by F
0
UðÞ. Let F
1
; F
2
∈F
0
UðÞ,if
μ
F
1
xðÞ≤μ
F
2
xðÞ for all x∈ U, then F
1
F
2
. It is well known that
F
0
UðÞ; pðÞis a partially ordered set.
Definition 2. [34] Suppose that F
0
U
ðÞ
; p
ðÞ
is a partially ordered set,
then I is an inclusion degree on F
0
UðÞ, if the following conditions
hold: (1) 0≤I F
2
=F
1
ðÞ
≤1; (2) F
1
pF
2
⇒I F
2
=F
1
ðÞ
¼ 1; (3) F
1
F
2
F
3
⇒
I F
1
=F
3
ðÞ≤I F
1
=F
2
ðÞ,whereF
1
; F
2
; F
3
∈F
0
UðÞ.
39Y. Qi, X. Yang / Genomics 101 (2013) 38–48