Data Mining in the Bioinformatics Domain
Shalom Tsur
SurroMed, Inc.
Palo Alto, CA., USA
tsur@surromed.com
Abstract
Bioinformatics, the study and application of
computational methods to life sciences data,
is presently enjoying a surge of interest. The
main reason for this welcome publicityisthe
nearing completion of the sequencing of the
human genome and the anticipation that the
knowledge derived from this process will have
a great impact on mo dern medicine. The
pharmaceutical industry,which expects to uti-
lize the knowledge for new drug design, has a
particular interest in bioinformatics.
The structure of data in this domain has its
own characteristics which set it apart from
data in other domains. While genomic data
haveawell-known representation as sequences
taken from the
f
A,C,G,T
g
alphabet, there
is no clear mo del for data representing the
expression products of genes: proteins and
higher forms of organisms e.g., cells and the
multitude of forms they assume in response
to environmental challenges.
Data collected at these levels of information
can be often thought of as "broad": meaning
that for a relatively small number of records
representing biological samples, a very large
number of attributes, representing measure-
ments or observations is collected p er sample.
In contrast, typical data used for mining are
"long" i.e., consist of a large number of records
in whicheach record is characterized by a rel-
atively small number of attributes.
Permission to copy without feeallor part of this material is
grantedprovided that the copies are not made or distributedfor
direct commercial advantage, the VLDB copyright noticeand
the title of the publication and its date appear, and notice is
given that copying is by permission of the Very Large Data Base
Endowment. Tocopy otherwise, or to republish, requires a fee
and/or special permission from the Endowment.
Proceedings of the 26th VLDB Conference,
Cairo, Egypt, 2000.
Mining broad data presents a new and unique
challenge. The presentation will elab orate on
some of the issues in this domain.
1 Intro duction
In the biological enterprise, biological samples e.g.,
bloo d, are collected from donors or study sub jects and
are sub jected to an array of dierent measurements.
These measurements can b e quantitative, to determine
the purity or concentration of some substance such
as a protein in the sample, or can be qualitative to
merely detect the presence of some substance. Mea-
surements of the former type are referred to as assays.
The process and conditions under which these samples
are pro cessed, the timing and the characterization of
the participating sub jects, are sp ecied in a study or
clinical proto col.
Biology draws a distinction between the
genotype
and
phenotype
of an organism. The genotype is deter-
mined by its genetic makeup and is invariantover the
organism's life. The phenotype on the other hand, is
determined byasetofobservable characteristics of the
organism that in turn, are determined by its genotype
and
bythe environment. Thus, a certain protein is the
expressed product of a gene. The measured concentra-
tion of this protein in the blo od may be the result of a
disease burden, taking a certain drug, a diet, exp osure
etc. The phenotype is thus a set of time varying quan-
tities. Tracing a phenotype over time mayprovide a
longitudinal record of e.g., the evolution of a disease
and the resp onse to a therap eutic intervention. By
analogy, the code making up a software system (as-
suming we do not change it) would be its genotype.
The dynamic execution b ehavior of the system, which
is dep endent on the co de, the operating system, the
input data and the user-interaction with it, would b e
its phenotype. It is worth noting that portions of this
code maynever b e executed and hence, will not con-
tribute to the dynamic behavior. Likewise, the bio-
logical genome contains large p ortions of DNA that
are considered "junk" and seemingly do not serveany
purpose.
From the clinical perspective, sub jects interact with
711