科学数据分析与计算智能入门：从曲线拟合到机器学习（第2版）

需积分: 10 65 浏览量更新于2024-07-19 收藏 35.62MB PDF 举报

《从曲线拟合到机器学习（第2版）》是一本详尽的科学数据分析与计算智能指南，共509页，提供高清原版PDF格式。本书属于"智能系统参考图书馆"系列，由Janusz Kacprzyk和Lakhmi C. Jain两位编辑共同编纂，他们是来自波兰科学院和澳大利亚Bournemouth及University of Canberra的学者。该系列旨在出版一套全面的参考文献，收录智能系统领域的最新进展和开发，包括参考手册、百科全书、教科书等，内容覆盖理论、应用和设计方法，涉及工程、计算机科学、航空电子、商业、电子商务等多个领域。在本书中，作者Achim Zielesny引导读者逐步从基础的曲线拟合技术深入探讨机器学习。曲线拟合是数据分析中的基础环节，它涉及到通过数学模型来描述数据之间的关系，以便预测和理解未知数据点。这一部分可能包括多项式拟合、回归分析和非线性拟合方法，这些都是机器学习算法建立的基础。机器学习章节则涵盖了监督学习、无监督学习、半监督学习和强化学习等核心概念。从简单的线性回归和逻辑回归，到复杂的神经网络和深度学习，读者可以了解到如何利用统计学习理论来训练模型，从而使计算机自动从数据中学习和改进。此外，书中还会涉及特征选择、模型评估和优化、以及常用的机器学习工具和库，如Python的scikit-learn或TensorFlow。书中还可能讨论集成学习和迁移学习，这两种方法能够整合多个模型的优势，提高预测性能，并在不同问题间转移学习经验。同时，对于现代大数据和云计算环境，可能会涉及分布式机器学习和云计算平台在实际应用中的挑战和解决方案。《从曲线拟合到机器学习（第2版）》不仅是科研人员和工程师的必备参考资料，也是对机器学习初学者的实用教程，它提供了清晰的路径，帮助读者掌握数据分析和智能计算的核心技术，从而推动科学研究和实际应用的创新。无论是理论讲解还是实战案例，这本书都致力于将复杂的计算智能技术以易懂的方式呈现，使之成为智能系统领域的一座桥梁。

2 1 Introduction

chapter ends with a note on the reproducibility of calculations reported throughout

the book (section 1.9).

1.1 Motivation: Data, models and molecular sciences

Essentially, all models are wrong, but some are useful.

G.E.P. Box

Science is an endeavor to understand and describe the real world out there to

(at b est) alleviate and enrich human existence. But the structures and dynamics of

the real world are very intricate and complex. A humble chemical reaction in the

laboratory may already involve perhaps 10

molecules surrounded by 10

solvent

molecules, in contact with a glass surface and interacting with gases ... in the atmo-

sphere. The whole system will be exposed to a ﬂux of photons of different frequency

(light) and a magnetic ﬁeld (from the earth), and possibly also a temperature gra-

dient from external heating. The dynamics of all the particles (nuclei and electrons)

is determined by relativistic quantum mechanics, and the interaction between par-

ticles is governed by quantum electrodynamics. In principle the gravitational and

strong (nuclear) forces should also be considered. For chemical reactions in biolog-

ical systems, the number o f different chemical components will be large, involving

various ions and assemblies of molecules behaving intermediately between solution

and solid state (e.g. lipids in cell walls) [Jensen 2007]. Thus, to d escribe nature,

there is the inevitable necessity to set up limitations and approximations in form of

simplifying and idealized models - based on the known laws of nature. Adequate

models neglect almost everything (i.e. they are, strictly speaking, wrong) but they

may keep some of those essential real world f eatures that are of speciﬁc interest (i.e.

they may be useful).

The dialectical interplay of experiment and theory is a key driving force of mod-

ern science. Experimental data do only have meaning in the light of a particular

model or at least a theoretical background. Reversely theoretical considerations may

be logically consistent as well as intellectually elegant: Without experimental evi-

dence they are a mere exercise of thought no matter how difﬁcult they are. Data

analysis is a connector between experiment and theory: Its techniques advise possi-

bilities of m odel extraction as well as mode l testing with experimental data.

Model functions have several practical advantages in comparison to mere enu-

merated d ata: They are a comprehensive representation of the relation between the

quantities of interest which may be stored in a database in a very compact manner

with minimum memory consumption. A good model allows interpolating or ex-

trapolating calculations to generate new data and thus may support (up to replace)

expensive lab work. Last but not least a suitable model may be heuristically used to

explore interesting optimum properties (i.e. minima or maxima of the model func-

1.1 Motivation: Data, models and molecular sciences 3

tion) which could otherwise be missed. Within a market economy a good model is

simply a com petitive advantage.

The ultimate goal of all sciences is to arrive at qu antitative models that describe

nature with a sufﬁcient accuracy - or to put it short: to calculate nature. These cal-

culations have the general form

answer = f (question) or output = f (input)

where input denotes a question and output the corresponding answer generated

by a model function f. Unfortunately the number of interesting quantities which can

be directly calculated by application of theoretical ab-initio techniques solely based

on the known laws of nature is rather limited (although expanding). For the over-

whelming number of questions about nature the model functions f are unknown or

too difﬁcult to be evaluated. This is the daily trouble o f chemists, material’s sci-

entists, engineers or biologists who want to ask questions like the biological effect

of a new molecular entity or th e properties of a n ew m aterial’s composition. So in

current science there are three situations that may be sensibly distinguished due to

our knowledge of nature:

• Situation 1: The model function f is theoretically or empirically known. Then

the output quantity of interest may be calculated directly.

• Situation 2: The structural form of the function f is known but not the values of

its parameters. Th en these parameter values may be statistically estimated on the

basis of experimental data by curve ﬁtting methods.

• Situation 3: Even the structural form of the function f is unknown. As an ap-

proximation the function f may be modelled by a machine learning technique on

the basis of experimental data.

A simple example for situation 2 is the case that the relation between input and

output is known to be linear. If there is only one input variable of interest, denoted

x, and one output variable of interest, denoted y, the structural form of the function

f is a straight line

y = f (x)=a

+ a

where a

and a

are the unknown parameters of the function which may be sta-

tistically estimated by curve ﬁtting of experimental data. In situation 3 it is not only

the values of the parameters that are unknown but in addition the structural form

of the model function f itself. This is obviously the worst possible case which is

addressed by data smoothing or machine learning approaches that try to construct a

model function with experimental data only.

Situations 1 to 3 are widely encountered by the contemporary molecular sciences.

Since the scientiﬁc revolution of the early 20th century the molecular sciences have

a thorough theoretical basis in m odern physics: Quantum theory is able to (at least in

principle) quantitatively explain and calculate the structure, stability and reactivity

4 1 Introduction

of matter. It provides a fundamental understanding of chemical bonding and molecu-

lar interactions. This foundational feat was summarized in 1929 by Paul A. M. Dirac

with famous words: The underlying physical laws necessary for the mathematical

theory of a large part of physics and the whole of chemistry are thus completely

known ... it became possible to submit molecular research and development (R&D)

problems to a theoretical framework to achieve correct and satisfactory solutions -

but unfortunately Dirac had to continue ... and the difﬁculty is only that the exact

application of these laws leads to equations much too complicated to be soluble.

The humble "only" means a severe practical restriction: It is in fact only the small-

est quantum-mechanical systems like the hydrogen atom with one single proton in

the nucleus and one single electron in the surrounding shell that can be treated by

pure analytical means to come to an exact mathematical solution, i.e. by solving the

Schroedinger equation of this mechanical system with pencil and paper. Nonetheless

Dirac added an optimistic prospect: It therefore becomes desirable that approximate

practical methods of applying quantum mechanics should be developed, which can

lead to an explanation of the main features of complex atomic systems without too

much computation [Dirac 1929]. A few decades later this hope begun to turn into

reality with the emergence of digital computers and their exponentially increasing

computational speed: Iterative methods were developed that allowed an approximate

quantum-mechanical treatment of molecules and molecular ensembles with growing

size (see [Leach 2001], [Frenkel 2002] or [Jensen 2007]). The methods which are

ab-initio approximations to the true solutio n of the Schroedin ger equ ation (i.e. they

only use the experimental values of natural constants) are still very limited in appli-

cability so they are restricted to chemical ensembles with just a few hundred atoms

to stay within tolerable calculation periods. If these methods are combined with ex-

perimental data in a suitable manner so that they become semi-empirical the r ange

of applicability can be extended to molecular systems with several thousands of

atoms (up to more than a hundred thousand atoms by the writing of this book [Clark

2010/2015]). The size of the molecular systems and the time frames for their simu-

lation can be even further expanded by orders of magnitude with mechanical force

ﬁelds that are constructed to mimic the quantum-mechanical molecular interactions

so that an atomistic de scription of matter exceeds the million-ato ms threshold. In

1998 and 2013 the Royal Swedish Academy of Sciences honored these scientiﬁc

achievements by awarding the Nobel prize in chemistry with the prudent comment

in 1998 that Chemistry is no longer a purely experimental science (see [Nobel Prize

1998/2013]). This atomistic theory-based treatment of molecular R&D problems

corresponds to situation 1 where a theoretical technique provides a model function

f to "simply calculate" the desired solution in a direct manner.

Despite these impressive improvements (and more is to come) the overwhelm-

ing majority of molecular R&D problems is (and will be) out of scope of these

atomistic computational methods due to their complexity in space and time. This

is especially true for the life and the n ano sciences that deal with the most com-

plex natural and artiﬁcial systems known today - with the human brain at the top.

Thus the molecular sciences are mainly faced with situations 2 and 3: They are a

predominant area of application of the methods to be discussed on the road from

1.1 Motivation: Data, models and molecular sciences 5

curve ﬁtting to machine learning. Theory-loade d and mod el-driven research areas

like physical chemistry or biophysics often prefer situation 2: A scientiﬁc quantity

of interest is studied in dependence of another quantity where the structural form

of a model function f that describes the desired dependency is known but not the

values of its parameters. In general th e parameters may be purely empirical or may

have a theo retically well- deﬁned meaning. An example of the latter is usually en-

countered in chemical kinetics where phenomenological rate equations are used to

describe the temporal progress of the chemical reactions but the values of the rate

constants - the crucial information - are unknown and may not be calculated by

a more fundamental theoretical treatment [Grant 1998]. In this case experimental

measurements are indispensable that lead to xy-error d ata triples (x

) with an

argument value x

, the corresponding dependent value y

and the statistical error

of the y

value (compare below). Then optimum estimates of the unknown param-

eter values can be statistically deduced on the basis of these data triples by curve

ﬁtting methods. In practice a successful model f unction may at ﬁrst be only empiri-

cally constructed like the quantitative description of the temperature dependence of

a liquid’s viscosity (illustrated in chapter 2) and then later be motivated by more th e-

oretical lines of argument. Or curve ﬁtting is used to validate the value of a speciﬁc

theoretical model p arameter by experiment (like the critical exponents in chapter 2).

Last but not least curve ﬁtting may play a pure support role: The energy values of

the potential energy surface of hydrogen ﬂuoride could be directly calculated by a

quantu m-chemical ab-initio method for every distance between the two atoms. But

a restriction to a limited number of distinct calculated values that span the range of

interest in combination with the construction of a suitable smoothing function for

interpolation (shown in chapter 2) may save considerable time and enhance practical

usability without any relevant loss of precision.

With increasing complexity of the natur al system under investigation a quantita-

tive theoretical treatment becomes more and more difﬁcult. As already mentioned

a quantitative theory-based prediction o f a biological effect of a new molecular en-

tity or the properties of a new material’s composition are in general out of scop e

of current science. Thus situation 3 takes over where a model function f is simply

unknown or too complex. To still achieve at least an approximate quantitative de-

scription of the relationships in question a model function may be tried to be solely

constructed with the available data only - a task that is at heart of machine learning.

Especially quantitative relationships between chemical structures and their biologi-

cal activities or physico-chemical and material’s properties draw a lot of attention:

Thus QSAR (Quantitative Structure Activity Relationship) and QSPR (Quantitative

Structure Property Relationship) studies are active ﬁelds of research in the life, ma-

terial’s and nano sciences (see [Zupan 1999], [Gasteiger 2003], [Leach 2007] or

[Schneider 2008]). Cheminformatics and structural bioinformatics provide a bunch

of possibilities to represent a chemical structure in form of a list of numbers (which

mathematically form a vector or an input in terms of machine learning, see below).

Each number or sequence of numbers is a speciﬁc structural descriptor that describes

a speciﬁc feature of a chemical structure in question, e.g. its molecular weight, its

topological connections and branches or electronic properties like its dipole mo-

6 1 Introduction

ments or its correlation of surface charges. These structure-representing inputs alone

may be analyzed by clustering methods (discussed in chapter 3) for their chemical

diversity. The results may be used to generate a reduced but representative subset

of structures with a similar ch emical diversity in comparison to the original larger

set (e.g. to be used in combinatorial chemistry approaches for a targeted structure

library design). Alternatively different sets of structures could be compared in terms

of their similarity o r dissimilarity as well as their mutual white spots (these topics

are discussed in chapter 3). A structural descriptor b ased QSAR/QSPR approach

takes the form

activity/property = f (descriptor1, descriptor2,descriptor3, ...)

with the model function f as the ﬁnal target to b ecome able to make model-based

predictions (the methods used for the construction o f an approximate model func-

tion f are outlined in chapter 4 ). The extensive volume of data that is necessary for

this line of research is often obtained by modern high-throughput (HT) techniques

like the biological assay-based high-throughput screening (HTS) of thousands of

chemical compounds in the pharmaceutical i ndustry or HT approaches in materials

science all performed with automa ted robotic lab systems. Among others these HT

methods lead to the so called BioTech data explosion that may be thoroughly ex-

ploited for model construction. In fact HT experiments and model construction v ia

machine learning are mutually dependent on each other: Models deserve d ata for

their creation as well as the mere heaps o f data produced by HT methods deserve

models for their comprehension.

With these few statements about the needs of the molecular sciences in mind

the motivation of this book is to show how situations 2 (model function f known, its

parameters unknown) and 3 (model function f itself unknown) may be tackled on the

road from curve ﬁttin g to machine learning: How can we p roceed from experimental

data to models? What conceptual and technical problems occur along this path?

What new insights can we expect?

1.2 Optimization

Clear["Global‘

"];

<<CIP‘Graphics‘

At the beginning of each section or subsection the global Clear command clears all earlier variables and

deﬁnitions and thus cares for a proper initialization. Then the necessary CIP packages are loaded, e.g. the

Graphics package for this section. A proper initialization prevents possible code interferences due to earlier

deﬁnitions. Note that Mathematica has a top-down programming style: Once a variable is assigned it keeps its

value.

剩余508页未读，继续阅读

ignite678@126.com

粉丝: 2
资源: 42

科学数据分析与计算智能入门：从曲线拟合到机器学习（第2版）

曲线拟合软件curve fitting

curve fitting曲线曲面拟合，效果很好

understanding machine learning theory-algorithms

svm算法手写matlab代码-Machine-Learning-Applications:机器学习应用

Matlab各工具箱功能简介(部分) (2).pdf

MATLAB Gaussian Fitting in Machine Learning: Foundation of Constructing Predictive Models, Enhancing...

MATLAB Curve Fitting Optimization: Finding Optimal Parameters for Enhanced Results

MATLAB Curve Fitting Validation: Evaluate the Model, Ensure Reliability

MATLAB Reading Excel Data Machine Learning Application: Mining Value from Data

Solving Differential Equations with ode45: A Treasure in Data Science and Machine Learning, ...

最新资源