没有合适的资源?快使用搜索试试~ 我知道了~
首页AUTOML: METHODS, SYSTEMS, CHALLENGES (NEW BOOK) 有书签
有书签的 近期,由Frank Hutter, Lars Kotthoff, Joaquin Vanschoren撰写的《AUTOML:方法,系统,挑战》“AUTOML: METHODS, SYSTEMS, CHALLENGES (NEW BOOK)” 221页的草稿版本已经放出,详细讲解了所有AutoML系统背后的基础知识,以及对当前AutoML系统进行了深入描述,Auto-WEKA、Hyperopt-Sklearn、Auto-sklearn等,最后介绍了AutoML的挑战。作者当前正在完成这本新书的编辑工作,它将由NIPS 2018出版发行。
资源详情
资源评论
资源推荐
Chapter 1
Hyperparameter
Optimization
Matthias Feurer and Frank Hutter
Abstract
Recent interest in complex and computationally expensive machine learn-
ing models with many hyperparameters, such as automated machine learning
(AutoML) frameworks and deep neural networks, has resulted in a resurgence
of research on hyperparameter optimization (HPO). In this chapter, we give an
overview of the most prominent approaches for HPO. We first discuss black-
box function optimization methods based on model-free methods and Bayesian
optimization. Since the high computational demand of many modern machine
learning applications renders pure blackbox optimization extremely costly, we
next focus on modern multi-fidelity methods that use (much) cheaper variants
of the blackbox function to approximately assess the quality of hyperparameter
setting. Lastly, we point to open problems and future research directions.
1.1 Introduction
Every machine learning system has hyperparameters, and the most basic task
in automated machine learning (AutoML) is to automatically set these hyper-
parameters to optimize performance. Especially recent deep neural networks
crucially depend on a wide range of hyperparameter choices about the neural
network’s architecture, regularization, and optimization. Automated hyperpa-
rameter optimization (HPO) has several important use cases; it can
• reduce the human effort necessary for applying machine learning. This is
particularly important in the context of AutoML.
3
4 CHAPTER 1. HYPERPARAMETER OPTIMIZATION
• improve the performance of machine learning algorithms (by tailoring
them to the problem at hand); this has led to new state-of-the-art per-
formances for important machine learning benchmarks in several studies
(e.g. [137, 102]).
• improve the reproducibility and fairness of scientific studies. Automated
HPO is clearly more reproducible than manual search. It facilitates fair
comparisons since different methods can only be compared fairly if they
all receive the same level of tuning for the problem at hand [12, 130].
The problem of HPO has a long history, dating back to the 1990s (e.g., [123,
104, 74, 79]), and it was also established early that different hyperparameter
configurations tend to work best for different datasets [79]. In contrast, it is a
rather new insight that HPO can be used to adapt general-purpose pipelines to
specific application domains [28]. Nowadays, it is also widely acknowledged that
tuned hyperparameters improve over the default setting provided by common
machine learning libraries [146, 97, 127, 113].
Because of the increased usage of machine learning in companies, HPO is
also of substantial commercial interest and plays an ever larger role there, be it
in company-internal tools [42], as part of machine learning cloud services [86, 5],
or as a service by itself [134].
HPO faces several challenges which make it a hard problem in practice:
• Function evaluations can be extremely expensive for large models (e.g., in
deep learning), complex machine learning pipelines, or large datesets.
• The configuration space is often complex (comprising a mix of continuous,
categorical and conditional hyperparameters) and high-dimensional. Fur-
thermore, it is not always clear which of an algorithm’s hyperparameters
need to be optimized, and in which ranges.
• We usually don’t have access to a gradient of the loss function with re-
spect to the hyperparameters. Furthermore, other properties of the target
function often used in classical optimization do not typically apply, such
as convexity and smoothness.
• One cannot directly optimize for generalization performance as training
datasets are of limited size.
We refer the interested reader to other reviews of HPO for further discussions
on this topic [61, 91].
This chapter is structured as follows. First, we define the HPO problem
formally and discuss its variants (Section 1.2). Then, we discuss blackbox opti-
mization algorithms for solving HPO (Section 1.3). Next, we focus on modern
multi-fidelity methods that enable the use of HPO even for very expensive mod-
els, by exploiting approximate performance measures that are cheaper than full
model evaluations (Section 1.4). We then provide an overview of the most
important hyperparameter optimization systems and applications to AutoML
(Section 1.5) and end the chapter with a discussion of open problems (Section
1.6).
1.2. PROBLEM STATEMENT 5
1.2 Problem Statement
Let A denote a machine learning algorithm with N hyperparameters. We denote
the domain of the n-th hyperparameter by Λ
n
and the overall hyperparameter
configuration space as Λ = Λ
1
× Λ
2
× . . . Λ
N
. A vector of hyperparameters is
denoted by λ ∈ Λ, and A with its hyperparameters instantiated to λ is denoted
by A
λ
.
The domain of a hyperparameter can be real-valued (e.g., learning rate),
integer-valued (e.g., number of layers), binary (e.g., whether to use early stop-
ping or not), or categorical (e.g., choice of optimizer). For integer and real-
valued hyperparameters, the domains are mostly bounded for practical reasons,
with only a few exceptions [10, 133, 110].
Furthermore, the configuration space can contain conditionality, i.e., a hy-
perparameter may only be relevant if another hyperparameter (or some combi-
nation of hyperparameters) takes on a certain value. Conditional spaces take
the form of directed acyclic graphs. Such conditional spaces occur, e.g., in the
automated tuning of machine learning pipelines, where the choice between dif-
ferent preprocessing and machine learning algorithms is modeled as a categorical
hyperparameter, a problem known as Full Model Selection (FMS) or Combined
Algorithm Selection and Hyperparameter (CASH) [28, 146, 80, 32]. They also
occur when optimizing the architecture of a neural network: e.g., the number
of layers can be an integer hyperparameter and the per-layer hyperparameters
of layer i are only active if the network depth is at least i [10, 12, 31].
Given a data set D, our goal is to find
λ
∗
= argmin
λ∈Λ
E
(D
train
,D
valid
)∼D
V(L, A
λ
, D
train
, D
valid
), (1.1)
where V(L, A
λ
, D
train
, D
valid
) measures the loss of a model generated by al-
gorithm A with hyperparameters λ on training data D
train
and evaluated on
validation data D
valid
. In practice, we only have access to finite data D ∼ D
and thus need to approximate the expectation in Equation 1.1.
Popular choices for the validation protocol V(·, ·, ·, ·) are the holdout and
cross-validation error for a user-given loss function (such as misclassification
rate); see Bischl et al. [14] for an overview of validation protocols. Several
strategies for reducing the evaluation time have been proposed: It is possible
to only test machine learning algorithms on a subset of folds [146], only on
a subset of data [99, 144, 75], or for a small amount of iterations; we will
discuss some of these strategies in more detail in Section 1.4. Recent work on
multi-task [144] and multi-source [118] optimization introduced further cheap,
auxiliary tasks, which can be queried instead of Equation 1.1. These can provide
cheap information to help HPO, but do not necessarily train a machine learning
model on the dataset of interest and therefore do not yield a usable model as a
side product.
6 CHAPTER 1. HYPERPARAMETER OPTIMIZATION
1.2.1 Alternatives to Optimization: Ensembling and Marginal-
ization
Solving Equation 1.1 with one of the techniques described in the rest of this
chapter usually requires fitting the machine learning algorithm A with multiple
hyperparameter vectors λ
t
. Instead of using the argmin-operator over these,
it is possible to either construct an ensemble (which aims to minimize the loss
for a given validation protocol) or to integrate out all the hyperparameters (if
the model under consideration is a probabilistic model). We refer to Guyon et
al. [47] and the references therein for a comparison of frequentist and Bayesian
model selection.
Only choosing a single hyperparameter configuration can be wasteful when
many good configurations have been identified by HPO, and combining them in
an ensemble can improve performance [106]. This is particularly useful in Au-
toML systems with a large configuration space (e.g., in FMS or CASH ), where
good configurations can be very diverse, which increases the potential gains
from ensembling [29, 17, 32, 4]. To further improve performance, Automatic
Frankensteining [152] uses HPO to train a stacking model [153] on the outputs
of the models found with HPO; the 2
nd
level models are then combined using a
traditional ensembling strategy.
The methods discussed so far applied ensembling after the HPO procedure.
While they improve performance in practice, the base models are not optimized
for ensembling. It is, however, also possible to directly optimize for models
which would maximally improve an existing ensemble [94].
Finally, when dealing with Bayesian models it is often possible to integrate
out the hyperparameters of the machine learning algorithm, for example using
evidence maximization [95], Bayesian model averaging [53], slice sampling [108]
or empirical Bayes [100].
1.2.2 Optimizing for Multiple Objectives
In practical applications it is often necessary to trade off two or more objectives,
such as the performance of a model and resource consumption [62] (see also
Chapter 3) or multiple loss functions [54]. Potential solutions can be obtained
in two ways.
First, if a limit on a secondary performance measure is known (such as
the maximal memory consumption), the problem can be formulated as a con-
strained optimization problem. We will discuss constraint handling in Bayesian
optimization in Section 1.3.2.
Second, and more generally, one can apply multi-objective optimization to
search for the Pareto front, a set of configurations which are optimal tradeoffs
between the objectives in the sense that, for each configuration on the Pareto
front, there is no other configuration which performs better for at least one and
at least as well for all other objectives. The user can then choose a configuration
from the Pareto front. We refer the interested reader to further literature on
this topic [62, 131, 50, 54].
1.3. BLACKBOX HYPERPARAMETER OPTIMIZATION 7
Figure 1.1: Comparison of grid search and random search. Figure reproduced
from Bergstra and Bengio [11].
1.3 Blackbox Hyperparameter Optimization
In general, every blackbox optimization method can be applied to HPO. Due
to the non-convex nature of the problem, global optimization algorithms are
usually preferred, but some locality in the optimization process is useful in order
to make progress within the few function evaluations that are usually available.
We first discuss model-free blackbox HPO methods and then describe blackbox
Bayesian optimization methods.
1.3.1 Model-Free Blackbox Optimization Methods
Grid search is the most basic HPO method, also known as full factorial de-
sign [107]. The user specifies a finite set of values for each hyperparameter,
and grid search evaluates the Cartesian product of these sets. This suffers from
the curse of dimensionality since the required number of function evaluations
grows exponentially with the dimensionality of the configuration space. An ad-
ditional problem of grid search is that increasing the resolution of discretization
substantially increases the required number of function evaluations.
A simple alternative to grid search is random search [11].
1
As the name
suggests, random search samples configurations at random until a certain budget
for the search is exhausted. This works better than grid search when some
hyperparameters are much more important than others (a property that holds
in many cases [11, 58]). Intuitively, when run with a fixed budget of B function
evaluations, the number of different values grid search can afford to evaluate
for each of the N hyperparameters is only B
1/N
, whereas random search will
explore B different values for each; see Figure 1.1 for an illustration.
1
In some disciplines this is also known as pure random search[155].
剩余220页未读,继续阅读
sunqiang20111
- 粉丝: 0
- 资源: 14
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- zigbee-cluster-library-specification
- JSBSim Reference Manual
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0