1.1 Motivation: Data, models and molecular sciences 5
curve fitting to machine learning. Theory-loade d and mod el-driven research areas
like physical chemistry or biophysics often prefer situation 2: A scientific quantity
of interest is studied in dependence of another quantity where the structural form
of a model function f that describes the desired dependency is known but not the
values of its parameters. In general th e parameters may be purely empirical or may
have a theo retically well- defined meaning. An example of the latter is usually en-
countered in chemical kinetics where phenomenological rate equations are used to
describe the temporal progress of the chemical reactions but the values of the rate
constants - the crucial information - are unknown and may not be calculated by
a more fundamental theoretical treatment [Grant 1998]. In this case experimental
measurements are indispensable that lead to xy-error d ata triples (x
i
,y
i
,
σ
i
) with an
argument value x
i
, the corresponding dependent value y
i
and the statistical error
σ
i
of the y
i
value (compare below). Then optimum estimates of the unknown param-
eter values can be statistically deduced on the basis of these data triples by curve
fitting methods. In practice a successful model f unction may at first be only empiri-
cally constructed like the quantitative description of the temperature dependence of
a liquid’s viscosity (illustrated in chapter 2) and then later be motivated by more th e-
oretical lines of argument. Or curve fitting is used to validate the value of a specific
theoretical model p arameter by experiment (like the critical exponents in chapter 2).
Last but not least curve fitting may play a pure support role: The energy values of
the potential energy surface of hydrogen fluoride could be directly calculated by a
quantu m-chemical ab-initio method for every distance between the two atoms. But
a restriction to a limited number of distinct calculated values that span the range of
interest in combination with the construction of a suitable smoothing function for
interpolation (shown in chapter 2) may save considerable time and enhance practical
usability without any relevant loss of precision.
With increasing complexity of the natur al system under investigation a quantita-
tive theoretical treatment becomes more and more difficult. As already mentioned
a quantitative theory-based prediction o f a biological effect of a new molecular en-
tity or the properties of a new material’s composition are in general out of scop e
of current science. Thus situation 3 takes over where a model function f is simply
unknown or too complex. To still achieve at least an approximate quantitative de-
scription of the relationships in question a model function may be tried to be solely
constructed with the available data only - a task that is at heart of machine learning.
Especially quantitative relationships between chemical structures and their biologi-
cal activities or physico-chemical and material’s properties draw a lot of attention:
Thus QSAR (Quantitative Structure Activity Relationship) and QSPR (Quantitative
Structure Property Relationship) studies are active fields of research in the life, ma-
terial’s and nano sciences (see [Zupan 1999], [Gasteiger 2003], [Leach 2007] or
[Schneider 2008]). Cheminformatics and structural bioinformatics provide a bunch
of possibilities to represent a chemical structure in form of a list of numbers (which
mathematically form a vector or an input in terms of machine learning, see below).
Each number or sequence of numbers is a specific structural descriptor that describes
a specific feature of a chemical structure in question, e.g. its molecular weight, its
topological connections and branches or electronic properties like its dipole mo-