AUTOML:METHODS,SYSTEMS,CHALLENGES(NEWBOOK)有书签

automl

机器学习

需积分: 10 124 浏览量更新于2023-05-22 评论收藏 9.53MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

Chapter 1

Hyperparameter

Optimization

Matthias Feurer and Frank Hutter

Abstract

Recent interest in complex and computationally expensive machine learn-

ing models with many hyperparameters, such as automated machine learning

(AutoML) frameworks and deep neural networks, has resulted in a resurgence

of research on hyperparameter optimization (HPO). In this chapter, we give an

overview of the most prominent approaches for HPO. We ﬁrst discuss black-

box function optimization methods based on model-free methods and Bayesian

optimization. Since the high computational demand of many modern machine

learning applications renders pure blackbox optimization extremely costly, we

next focus on modern multi-ﬁdelity methods that use (much) cheaper variants

of the blackbox function to approximately assess the quality of hyperparameter

setting. Lastly, we point to open problems and future research directions.

1.1 Introduction

Every machine learning system has hyperparameters, and the most basic task

in automated machine learning (AutoML) is to automatically set these hyper-

parameters to optimize performance. Especially recent deep neural networks

crucially depend on a wide range of hyperparameter choices about the neural

network’s architecture, regularization, and optimization. Automated hyperpa-

rameter optimization (HPO) has several important use cases; it can

• reduce the human eﬀort necessary for applying machine learning. This is

particularly important in the context of AutoML.

4 CHAPTER 1. HYPERPARAMETER OPTIMIZATION

• improve the performance of machine learning algorithms (by tailoring

them to the problem at hand); this has led to new state-of-the-art per-

formances for important machine learning benchmarks in several studies

(e.g. [137, 102]).

• improve the reproducibility and fairness of scientiﬁc studies. Automated

HPO is clearly more reproducible than manual search. It facilitates fair

comparisons since diﬀerent methods can only be compared fairly if they

all receive the same level of tuning for the problem at hand [12, 130].

The problem of HPO has a long history, dating back to the 1990s (e.g., [123,

104, 74, 79]), and it was also established early that diﬀerent hyperparameter

conﬁgurations tend to work best for diﬀerent datasets [79]. In contrast, it is a

rather new insight that HPO can be used to adapt general-purpose pipelines to

speciﬁc application domains [28]. Nowadays, it is also widely acknowledged that

tuned hyperparameters improve over the default setting provided by common

machine learning libraries [146, 97, 127, 113].

Because of the increased usage of machine learning in companies, HPO is

also of substantial commercial interest and plays an ever larger role there, be it

in company-internal tools [42], as part of machine learning cloud services [86, 5],

or as a service by itself [134].

HPO faces several challenges which make it a hard problem in practice:

• Function evaluations can be extremely expensive for large models (e.g., in

deep learning), complex machine learning pipelines, or large datesets.

• The conﬁguration space is often complex (comprising a mix of continuous,

categorical and conditional hyperparameters) and high-dimensional. Fur-

thermore, it is not always clear which of an algorithm’s hyperparameters

need to be optimized, and in which ranges.

• We usually don’t have access to a gradient of the loss function with re-

spect to the hyperparameters. Furthermore, other properties of the target

function often used in classical optimization do not typically apply, such

as convexity and smoothness.

• One cannot directly optimize for generalization performance as training

datasets are of limited size.

We refer the interested reader to other reviews of HPO for further discussions

on this topic [61, 91].

This chapter is structured as follows. First, we deﬁne the HPO problem

formally and discuss its variants (Section 1.2). Then, we discuss blackbox opti-

mization algorithms for solving HPO (Section 1.3). Next, we focus on modern

multi-ﬁdelity methods that enable the use of HPO even for very expensive mod-

els, by exploiting approximate performance measures that are cheaper than full

model evaluations (Section 1.4). We then provide an overview of the most

important hyperparameter optimization systems and applications to AutoML

(Section 1.5) and end the chapter with a discussion of open problems (Section

1.6).

1.2. PROBLEM STATEMENT 5

1.2 Problem Statement

Let A denote a machine learning algorithm with N hyperparameters. We denote

the domain of the n-th hyperparameter by Λ

and the overall hyperparameter

conﬁguration space as Λ = Λ

× Λ

× . . . Λ

. A vector of hyperparameters is

denoted by λ ∈ Λ, and A with its hyperparameters instantiated to λ is denoted

by A

The domain of a hyperparameter can be real-valued (e.g., learning rate),

integer-valued (e.g., number of layers), binary (e.g., whether to use early stop-

ping or not), or categorical (e.g., choice of optimizer). For integer and real-

valued hyperparameters, the domains are mostly bounded for practical reasons,

with only a few exceptions [10, 133, 110].

Furthermore, the conﬁguration space can contain conditionality, i.e., a hy-

perparameter may only be relevant if another hyperparameter (or some combi-

nation of hyperparameters) takes on a certain value. Conditional spaces take

the form of directed acyclic graphs. Such conditional spaces occur, e.g., in the

automated tuning of machine learning pipelines, where the choice between dif-

ferent preprocessing and machine learning algorithms is modeled as a categorical

hyperparameter, a problem known as Full Model Selection (FMS) or Combined

Algorithm Selection and Hyperparameter (CASH) [28, 146, 80, 32]. They also

occur when optimizing the architecture of a neural network: e.g., the number

of layers can be an integer hyperparameter and the per-layer hyperparameters

of layer i are only active if the network depth is at least i [10, 12, 31].

Given a data set D, our goal is to ﬁnd

∗

= argmin

λ∈Λ

train

valid

)∼D

V(L, A

, D

train

, D

valid

), (1.1)

where V(L, A

, D

train

, D

valid

) measures the loss of a model generated by al-

gorithm A with hyperparameters λ on training data D

train

and evaluated on

validation data D

valid

. In practice, we only have access to ﬁnite data D ∼ D

and thus need to approximate the expectation in Equation 1.1.

Popular choices for the validation protocol V(·, ·, ·, ·) are the holdout and

cross-validation error for a user-given loss function (such as misclassiﬁcation

rate); see Bischl et al. [14] for an overview of validation protocols. Several

strategies for reducing the evaluation time have been proposed: It is possible

to only test machine learning algorithms on a subset of folds [146], only on

a subset of data [99, 144, 75], or for a small amount of iterations; we will

discuss some of these strategies in more detail in Section 1.4. Recent work on

multi-task [144] and multi-source [118] optimization introduced further cheap,

auxiliary tasks, which can be queried instead of Equation 1.1. These can provide

cheap information to help HPO, but do not necessarily train a machine learning

model on the dataset of interest and therefore do not yield a usable model as a

side product.

6 CHAPTER 1. HYPERPARAMETER OPTIMIZATION

1.2.1 Alternatives to Optimization: Ensembling and Marginal-

ization

Solving Equation 1.1 with one of the techniques described in the rest of this

chapter usually requires ﬁtting the machine learning algorithm A with multiple

hyperparameter vectors λ

. Instead of using the argmin-operator over these,

it is possible to either construct an ensemble (which aims to minimize the loss

for a given validation protocol) or to integrate out all the hyperparameters (if

the model under consideration is a probabilistic model). We refer to Guyon et

al. [47] and the references therein for a comparison of frequentist and Bayesian

model selection.

Only choosing a single hyperparameter conﬁguration can be wasteful when

many good conﬁgurations have been identiﬁed by HPO, and combining them in

an ensemble can improve performance [106]. This is particularly useful in Au-

toML systems with a large conﬁguration space (e.g., in FMS or CASH ), where

good conﬁgurations can be very diverse, which increases the potential gains

from ensembling [29, 17, 32, 4]. To further improve performance, Automatic

Frankensteining [152] uses HPO to train a stacking model [153] on the outputs

of the models found with HPO; the 2

level models are then combined using a

traditional ensembling strategy.

The methods discussed so far applied ensembling after the HPO procedure.

While they improve performance in practice, the base models are not optimized

for ensembling. It is, however, also possible to directly optimize for models

which would maximally improve an existing ensemble [94].

Finally, when dealing with Bayesian models it is often possible to integrate

out the hyperparameters of the machine learning algorithm, for example using

evidence maximization [95], Bayesian model averaging [53], slice sampling [108]

or empirical Bayes [100].

1.2.2 Optimizing for Multiple Objectives

In practical applications it is often necessary to trade oﬀ two or more objectives,

such as the performance of a model and resource consumption [62] (see also

Chapter 3) or multiple loss functions [54]. Potential solutions can be obtained

in two ways.

First, if a limit on a secondary performance measure is known (such as

the maximal memory consumption), the problem can be formulated as a con-

strained optimization problem. We will discuss constraint handling in Bayesian

optimization in Section 1.3.2.

Second, and more generally, one can apply multi-objective optimization to

search for the Pareto front, a set of conﬁgurations which are optimal tradeoﬀs

between the objectives in the sense that, for each conﬁguration on the Pareto

front, there is no other conﬁguration which performs better for at least one and

at least as well for all other objectives. The user can then choose a conﬁguration

from the Pareto front. We refer the interested reader to further literature on

this topic [62, 131, 50, 54].

剩余220页未读，继续阅读

sunqiang20111

粉丝: 0
资源: 14

会员权益专享

AUTOML: METHODS, SYSTEMS, CHALLENGES (NEW BOOK) 有书签

评论0

会员权益专享

最新资源

AUTOML: METHODS, SYSTEMS, CHALLENGES (NEW BOOK) 有书签

评论0

AUTOML: METHODS, SYSTEMS, CHALLENGES

告别调参，AutoML新书发布

AUTOML METHODS, SYSTEMS, CHALLENGES

Automated Machine Learning: Methods, Systems, Challenges

Network Slicing in 5G: Survey and Challenges解决问题

助农电商系统国外参考文献

现在地理信息系统方面有些什么前沿的理论和论文

集成电路的价值地位参考文献

给我推荐20个比较流行的自动驾驶算法模型代码

大数据与云计算融合技术相关文献

List

关于无人驾驶的参考文献

请给我一篇关于地名消歧的外文文献

图神经网络和推荐系统

给出分布式深度神经网络参考文献

please unblock challenges.cloudflare.com to proceed.

M2M (Machine-to-Machine)相关文献

网络安全区块链参考文献

移动电商平台参考英文文献

会员权益专享

最新资源