自动机器学习方法与挑战：竞赛精华系列

AutoML

需积分: 10 125 浏览量更新于2024-07-17 收藏 6.15MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

资源详情

资源推荐

4 M. Feurer and F. Hutter

• improve the performance of machine learning algorithms (by tailoring them

to the problem at hand); this has led to new state-of-the-art performances for

important machine learning benchmarks in several studies (e.g. [105, 140]).

• improve the reproducibility and fairness of scientiﬁc studies. Automated HPO

is clearly more reproducible than manual search. It facilitates fair comparisons

since different methods can only be compared fairly if they all receive the same

level of tuning for the problem at hand [14, 133].

The problem of HPO has a long history, dating back to the 1990s (e.g., [77,

82, 107, 126]), and it was also established early that different hyperparameter

conﬁgurations tend to work best for different datasets [82]. In contrast, it is a rather

new insight that HPO can be used to adapt general-purpose pipelines to speciﬁc

application domains [30]. Nowadays, it is also widely acknowledged that tuned

hyperparameters improve over the default setting provided by common machine

learning libraries [100, 116, 130, 149].

Because of the increased usage of machine learning in companies, HPO is also of

substantial commercial interest and plays an ever larger role there, be it in company-

internal tools [45], as part of machine learning cloud services [6, 89], or as a service

by itself [137].

HPO faces several challenges which make it a hard problem in practice:

• Function evaluations can be extremely expensive for large models (e.g., in deep

learning), complex machine learning pipelines, or large datesets.

• The conﬁguration space is often complex (comprising a mix of continuous, cat-

egorical and conditional hyperparameters) and high-dimensional. Furthermore,

it is not always clear which of an algorithm’s hyperparameters need to be

optimized, and in which ranges.

• We usually don’t have access to a gradient of the loss function with respect to

the hyperparameters. Furthermore, other properties of the target function often

used in classical optimization do not typically apply, such as convexity and

smoothness.

• One cannot directly optimize for generalization performance as training datasets

are of limited size.

We refer the interested reader to other reviews of HPO for further discussions on

this topic [64, 94].

This chapter is structured as follows. First, we deﬁne the HPO problem for-

mally and discuss its variants (Sect. 1.2). Then, we discuss blackbox optimization

algorithms for solving HPO (Sect. 1.3). Next, we focus on modern multi-ﬁdelity

methods that enable the use of HPO even for very expensive models, by exploiting

approximate performance measures that are cheaper than full model evaluations

(Sect. 1.4). We then provide an overview of the most important hyperparameter

optimization systems and applications to AutoML (Sect. 1.5) and end the chapter

with a discussion of open problems (Sect. 1.6).

1 Hyperparameter Optimization 5

1.2 Problem Statement

Let A denote a machine learning algorithm with N hyperparameters. We denote

the domain of the n-th hyperparameter by 

and the overall hyperparameter

conﬁguration space as  = 

× 

× ...

. A vector of hyperparameters is

denoted by λ ∈ ,andA with its hyperparameters instantiated to λ is denoted

by A

The domain of a hyperparameter can be real-valued (e.g., learning rate), integer-

valued (e.g., number of layers), binary (e.g., whether to use early stopping or not), or

categorical (e.g., choice of optimizer). For integer and real-valued hyperparameters,

the domains are mostly bounded for practical reasons, with only a few excep-

tions [12, 113, 136].

Furthermore, the conﬁguration space can contain conditionality, i.e., a hyper-

parameter may only be relevant if another hyperparameter (or some combination

of hyperparameters) takes on a certain value. Conditional spaces take the form of

directed acyclic graphs. Such conditional spaces occur, e.g., in the automated tuning

of machine learning pipelines, where the choice between different preprocessing

and machine learning algorithms is modeled as a categorical hyperparameter, a

problem known as Full Model Selection (FMS) or Combined Algorithm Selection

and Hyperparameter optimization problem (CASH) [30, 34, 83, 149]. They also

occur when optimizing the architecture of a neural network: e.g., the number of

layers can be an integer hyperparameter and the per-layer hyperparameters of layer

i are only active if the network depth is at least i [12, 14, 33].

Given a data set D, our goal is to ﬁnd

∗

= argmin

λ∈

train

valid

)∼D

V(L, A

train

valid

), (1.1)

where V(L, A

train

valid

) measures the loss of a model generated by algo-

rithm A with hyperparameters λ on training data D

train

and evaluated on validation

data D

valid

. In practice, we only have access to ﬁnite data D ∼ D and thus need to

approximate the expectation in Eq. 1.1.

Popular choices for the validation protocol V(·, ·, ·, ·) are the holdout and cross-

validation error for a user-given loss function (such as misclassiﬁcation rate);

see Bischl et al. [16] for an overview of validation protocols. Several strategies

for reducing the evaluation time have been proposed: It is possible to only test

machine learning algorithms on a subset of folds [149], only on a subset of

data [78, 102, 147], or for a small amount of iterations; we will discuss some of

these strategies in more detail in Sect. 1.4. Recent work on multi-task [147]and

multi-source [121] optimization introduced further cheap, auxiliary tasks, which

can be queried instead of Eq. 1.1. These can provide cheap information to help HPO,

but do not necessarily train a machine learning model on the dataset of interest and

therefore do not yield a usable model as a side product.

6 M. Feurer and F. Hutter

1.2.1 Alternatives to Optimization: Ensembling and

Marginalization

Solving Eq. 1.1 with one of the techniques described in the rest of this chapter

usually requires ﬁtting the machine learning algorithm A with multiple hyperpa-

rameter vectors λ

. Instead of using the argmin-operator over these, it is possible

to either construct an ensemble (which aims to minimize the loss for a given

validation protocol) or to integrate out all the hyperparameters (if the model under

consideration is a probabilistic model). We refer to Guyon et al. [50]andthe

references therein for a comparison of frequentist and Bayesian model selection.

Only choosing a single hyperparameter conﬁguration can be wasteful when

many good conﬁgurations have been identiﬁed by HPO, and combining them

in an ensemble can improve performance [109]. This is particularly useful in

AutoML systems with a large conﬁguration space (e.g., in FMS or CASH), where

good conﬁgurations can be very diverse, which increases the potential gains from

ensembling [4, 19, 31, 34]. To further improve performance, Automatic Franken-

steining [155] uses HPO to train a stacking model [156] on the outputs of the

models found with HPO; the 2nd level models are then combined using a traditional

ensembling strategy.

The methods discussed so far applied ensembling after the HPO procedure.

While they improve performance in practice, the base models are not optimized

for ensembling. It is, however, also possible to directly optimize for models which

would maximally improve an existing ensemble [97].

Finally, when dealing with Bayesian models it is often possible to integrate

out the hyperparameters of the machine learning algorithm, for example using

evidence maximization [98], Bayesian model averaging [56], slice sampling [111]

or empirical Bayes [103].

1.2.2 Optimizing for Multiple Objectives

In practical applications it is often necessary to trade off two or more objectives,

such as the performance of a model and resource consumption [65](seealso

Chap. 3) or multiple loss functions [57]. Potential solutions can be obtained in two

ways.

First, if a limit on a secondary performance measure is known (such as the

maximal memory consumption), the problem can be formulated as a constrained

optimization problem. We will discuss constraint handling in Bayesian optimization

in Sect. 1.3.2.4.

Second, and more generally, one can apply multi-objective optimization to search

for the Pareto front, a set of conﬁgurations which are optimal tradeoffs between the

objectives in the sense that, for each conﬁguration on the Pareto front, there is no

other conﬁguration which performs better for at least one and at least as well for all

other objectives. The user can then choose a conﬁguration from the Pareto front. We

refer the interested reader to further literature on this topic [53, 57, 65, 134].

8 M. Feurer and F. Hutter

Further advantages over grid search include easier parallelization (since workers

do not need to communicate with each other and failing workers do not leave holes

in the design) and ﬂexible resource allocation (since one can add an arbitrary number

of random points to a random search design to still yield a random search design;

the equivalent does not hold for grid search).

Random search is a useful baseline because it makes no assumptions on the

machine learning algorithm being optimized, and, given enough resources, will,

in expectation, achieves performance arbitrarily close to the optimum. Interleaving

random search with more complex optimization strategies therefore allows to

guarantee a minimal rate of convergence and also adds exploration that can improve

model-based search [3, 59]. Random search is also a useful method for initializing

the search process, as it explores the entire conﬁguration space and thus often

ﬁnds settings with reasonable performance. However, it is no silver bullet and often

takes far longer than guided search methods to identify one of the best performing

hyperparameter conﬁgurations: e.g., when sampling without replacement from a

conﬁguration space with N Boolean hyperparameters with a good and a bad setting

each and no interaction effects, it will require an expected 2

N−1

function evaluations

to ﬁnd the optimum, whereas a guided search could ﬁnd the optimum in N + 1

function evaluations as follows: starting from an arbitrary conﬁguration, loop over

the hyperparameters and change one at a time, keeping the resulting conﬁguration

if performance improves and reverting the change if it doesn’t. Accordingly, the

guided search methods we discuss in the following sections usually outperform

random search [12, 14, 33, 90, 153].

Population-based methods, such as genetic algorithms, evolutionary algorithms,

evolutionary strategies,andparticle swarm optimization are optimization algo-

rithms that maintain a population, i.e., a set of conﬁgurations, and improve this

population by applying local perturbations (so-called mutations) and combinations

of different members (so-called crossover) to obtain a new generation of better

conﬁgurations. These methods are conceptually simple, can handle different data

types, and are embarrassingly parallel [91] since a population of N members can be

evaluated in parallel on N machines.

One of the best known population-based methods is the covariance matrix

adaption evolutionary strategy (CMA-ES [51]); this simple evolutionary strategy

samples conﬁgurations from a multivariate Gaussian whose mean and covariance

are updated in each generation based on the success of the population’s individ-

uals. CMA-ES is one of the most competitive blackbox optimization algorithms,

regularly dominating the Black-Box Optimization Benchmarking (BBOB) chal-

lenge [11].

For further details on population-based methods, we refer to [28, 138]; we discuss

applications to hyperparameter optimization in Sect. 1.5, applications to neural

architecture search in Chap. 3, and genetic programming for AutoML pipelines in

Chap. 8.

剩余221页未读，继续阅读

Agent1998

粉丝: 21
资源: 6

自动机器学习方法与挑战：竞赛精华系列

AUTOML: METHODS, SYSTEMS, CHALLENGES (NEW BOOK) 有书签

告别调参，AutoML新书发布

AutoML

SELECT bookname, bookauthor FROM tb_bookinfo LEFT OUTER JOIN tb_booklend ON tb_bookinfo.bookcode = tb_booklend.bookcode LEFT OUTER JOIN tb_book ON tb_bookinfo.isbn = tb_book.isbn WHERE tb_booklend.bookcode IS NULL;用嵌套法做

select user.uname,tnum,book.bname,b_order.ordernum,book.price,discountfrom user a inner join b_order b on a.uid=b.uid inner join book c on b.bid=c.bid;哪里错误了

with open("guest_book.txt", "a") as file:

update book left join (select bno, count(*) as borrow_num from borrow where rdate is null group by bno) as borrow_count on book.bno = borrow_count.bno set book.available = coalesce(book.number, 0) - coalesce(borrow_count.borrow_num, 0);将该sql语句规范化

machine_vision_book_programs.zip下载

编写一个while循环，提示用户输入名字。用户输入名字后，在屏幕上打印一句问候语，并将一条到访记录添加到文件 guest_book.txt 中。确保这个文件中的每条记录都独占一行

vector<Book*>::iterator it = find(borrowed_books.begin(), borrowed_books.end(), &book);逐字解释

请你访问https://book.douban.com/top250，获取的数据内容：作者、 出版社、 出版年、 页数、 定价、装帧、ISBN、评分、39408人在读、569770人读过、157297人想读，按此顺序显示在屏幕中，其中使用select方法实现

xv6-book.pdf

自动化学习框架（AutoML）的性能比较

NanoAirline航空公司管理系统.zip

基于Tensorflow的手势识别代码+数据集+文档说明（期末大作业）

最新资源

请你访问https://book.douban.com/top250，获取的数据内容：作者、出版社、出版年、页数、定价、装帧、ISBN、评分、39408人在读、569770人读过、157297人想读，按此顺序显示在屏幕中，其中使用select方法实现