text September 25, 2011
INTRODUCTION TO COMPUTATIONAL BIOLOGY 7
and experiments are still under way to refine the draft genome further.
In spite of its enormous length, it appears that there is a great deal of
redundancy in the human genome. The overlap between the human genome
and the mouse genome is about 80%, whereas between a human and a chim-
panzee (our nearest neighbour in the animal kingdom) is about 98%. Be-
tween two humans the overlap is still more striking. It is estimated that
the genome sequences of two humans will agree in about 99.9% of the loca-
tions, and differ only in about 0.1%, or about 3 million base pairs. All that
distinguishes one human from another, be it height, weight, colour of eyes,
colour of hair, etc. can presumably be attributed to this tiny variation in
the genome (aside from environmental factors of course). These variations
from the ‘consensus’ human genome (see below) are called Single Nucleotide
Polymorphisms (SNPs), often pronounced as ‘snips’. Even ‘identical’ twins
will not have identifical genomes; rather, the overlap in such a case will be
about 99.99%, as opposed to 99.9% in the case of two unrelated humans. A
100% replication of a genome is known as a ‘clone’ as stated earlier. The
cloning of life forms is both a fascinating as well as a controversial subject.
It is widely accepted that there is indeed a ‘consensus’ human genome.
In other words, it is believed that at any given location, an overwhelming
majority of humans will have just one of the four nucleotides. Moreover,
in case there is a deviation from the consensus genome, even though there
are three variations possible in theory, in reality only one variation seems to
occur in an overwhelming majority of cases. It is not the case, for example,
that at a particular location about half of the population will have a T ,
another half will have a G, while A or C occur in a tiny fraction of the
population. It is also not the case that virtually all humans have, say, a T at a
particular location, whereas the symbols A, C, G occur among the remaining
small minority with roughly equal frequency. If at all there is a variation,
only one of the remaining three symbols will occur in almost all the rest of
humanity. To contrast the situation, there is no ‘consensus blood group’ for
example. While the O type blood is the most common, the percentage of the
population that has other blood types is still significant. The draft human
genome published in February 2001 by Celera is actually the (approximate)
sequence of the DNA of no fewer than six different individuals, not that of
just one person. Since the estimated error in the published draft (2% or so)
is considerably more than the variations amongst individuals (0.1% or so,
as mentioned above), this mixing up of DNA from different individuals did
not matter. The above remarks about the existence of a consensus genome
apply also to other organisms.
As technology improves, we can aspire to a situation whereby it will be
both quick and inexpensive to determine the genome of every human on the
planet, or at least a large number of them. As stated above, it would be
wasteful to capture the DNA of a specific individual and sequence that. It
would be more efficient to (i) determine the consensus genome very accu-
rately, and (ii) determine the SNP’s of an individual, that is, variations from
the consensus genome. It appears reasonable that Step (ii) above should be,