没有合适的资源?快使用搜索试试~ 我知道了~
首页AUTOML METHODS, SYSTEMS, CHALLENGES
AUTOML METHODS, SYSTEMS, CHALLENGES
需积分: 9 19 下载量 104 浏览量
更新于2023-03-16
评论 1
收藏 9.87MB PDF 举报
AUTOML METHODS, SYSTEMS, CHALLENGES AUTOML教程
资源详情
资源评论
资源推荐
AUTOML: METHODS, SYSTEMS, CHALLENGES
Editors: Frank Hutter, Lars Kotthoff, Joaquin Vanschoren
We’re in the process of finishing this edited book, and it will be ready for sale by NIPS 2018. Next
to publishing, we will keep the book open access. Below, we share preliminary versions of the
chapters; at this point in time, these are all drafts, before copy editing.
Catalog
Part 1: AutoML Methods
This part comprises highly up-to-date overview chapters on the common foundations behind all
AutoML systems.
Chapter 1: Hyperparameter Optimization. By Matthias Feurer and Frank Hutter
Chapter 2: Meta Learning. By Joaquin Vanschoren
Chapter 3: Neural Architecture Search. By Thomas Elsken, Jan-Hendrik Metzen and Frank Hutter
Part 2: AutoML Systems
This part comprises in-depth descriptions of a broad range of available AutoML systems that can
be used for effective machine learning out of the box.
Chapter 4: Auto-WEKA. By Lars Kotthoff and Chris Thornton and Holger H. Hoos and Frank Hutter
and Kevin Leyton-Brown
Chapter 5: Hyperopt-Sklearn. By Brent Komer and James Bergstra and Chris Eliasmith
Chapter 6: Auto-sklearn: Efficient and Robust Automated Machine Learning. By Matthias Feurer
and Aaron Klein and Katharina Eggensperger and Jost Tobias Springenberg and Manuel Blum and
Frank Hutter
Chapter 7: Auto-Net: Towards Automatically-Tuned Neural Networks. By Hector Mendoza and
Aaron Klein and Matthias Feurer and Jost Tobias Springenberg and Matthias Urban and Michael
Burkart and Max Dippel and Marius Lindauer and Frank Hutter
Chapter 8: TPOT: A Tool for Automating Machine Learning. By Randal S. Olson and Jason H. Moore
Chapter 9: The Automatic Statistician. By Christian Steinruecken and Emma Smith and David Janz
and James Lloyd and Zoubin Ghahramani
Part 3: AutoML Challenges
This part provides an in-depth analysis of all AutoML challenges held to date.
Chapter 10: Analysis of the AutoML Challenge series 2015-2018. By Isabelle Guyon and Lisheng
Sun-Hosoya and Marc Boull e and Hugo Jair Escalante and Sergio Escalera and Zhengying Liu and
Damir Jajetic and Bisakha Ray and Mehreen Saeed and Michele Sebag and Alexander Statnikov and
Wei-Wei Tu and Evelyne Viegas
Chapter 1
Hyperparameter
Optimization
Matthias Feurer and Frank Hutter
Abstract
Recent interest in complex and computationally expensive machine learn-
ing models with many hyperparameters, such as automated machine learning
(AutoML) frameworks and deep neural networks, has resulted in a resurgence
of research on hyperparameter optimization (HPO). In this chapter, we give an
overview of the most prominent approaches for HPO. We first discuss black-
box function optimization methods based on model-free methods and Bayesian
optimization. Since the high computational demand of many modern machine
learning applications renders pure blackbox optimization extremely costly, we
next focus on modern multi-fidelity methods that use (much) cheaper variants
of the blackbox function to approximately assess the quality of hyperparameter
setting. Lastly, we point to open problems and future research directions.
1.1 Introduction
Every machine learning system has hyperparameters, and the most basic task
in automated machine learning (AutoML) is to automatically set these hyper-
parameters to optimize performance. Especially recent deep neural networks
crucially depend on a wide range of hyperparameter choices about the neural
network’s architecture, regularization, and optimization. Automated hyperpa-
rameter optimization (HPO) has several important use cases; it can
• reduce the human effort necessary for applying machine learning. This is
particularly important in the context of AutoML.
3
4 CHAPTER 1. HYPERPARAMETER OPTIMIZATION
• improve the performance of machine learning algorithms (by tailoring
them to the problem at hand); this has led to new state-of-the-art per-
formances for important machine learning benchmarks in several studies
(e.g. [137, 102]).
• improve the reproducibility and fairness of scientific studies. Automated
HPO is clearly more reproducible than manual search. It facilitates fair
comparisons since different methods can only be compared fairly if they
all receive the same level of tuning for the problem at hand [12, 130].
The problem of HPO has a long history, dating back to the 1990s (e.g., [123,
104, 74, 79]), and it was also established early that different hyperparameter
configurations tend to work best for different datasets [79]. In contrast, it is a
rather new insight that HPO can be used to adapt general-purpose pipelines to
specific application domains [28]. Nowadays, it is also widely acknowledged that
tuned hyperparameters improve over the default setting provided by common
machine learning libraries [146, 97, 127, 113].
Because of the increased usage of machine learning in companies, HPO is
also of substantial commercial interest and plays an ever larger role there, be it
in company-internal tools [42], as part of machine learning cloud services [86, 5],
or as a service by itself [134].
HPO faces several challenges which make it a hard problem in practice:
• Function evaluations can be extremely expensive for large models (e.g., in
deep learning), complex machine learning pipelines, or large datesets.
• The configuration space is often complex (comprising a mix of continuous,
categorical and conditional hyperparameters) and high-dimensional. Fur-
thermore, it is not always clear which of an algorithm’s hyperparameters
need to be optimized, and in which ranges.
• We usually don’t have access to a gradient of the loss function with re-
spect to the hyperparameters. Furthermore, other properties of the target
function often used in classical optimization do not typically apply, such
as convexity and smoothness.
• One cannot directly optimize for generalization performance as training
datasets are of limited size.
We refer the interested reader to other reviews of HPO for further discussions
on this topic [61, 91].
This chapter is structured as follows. First, we define the HPO problem
formally and discuss its variants (Section 1.2). Then, we discuss blackbox opti-
mization algorithms for solving HPO (Section 1.3). Next, we focus on modern
multi-fidelity methods that enable the use of HPO even for very expensive mod-
els, by exploiting approximate performance measures that are cheaper than full
model evaluations (Section 1.4). We then provide an overview of the most
important hyperparameter optimization systems and applications to AutoML
(Section 1.5) and end the chapter with a discussion of open problems (Section
1.6).
1.2. PROBLEM STATEMENT 5
1.2 Problem Statement
Let A denote a machine learning algorithm with N hyperparameters. We denote
the domain of the n-th hyperparameter by Λ
n
and the overall hyperparameter
configuration space as Λ = Λ
1
× Λ
2
× . . . Λ
N
. A vector of hyperparameters is
denoted by λ ∈ Λ, and A with its hyperparameters instantiated to λ is denoted
by A
λ
.
The domain of a hyperparameter can be real-valued (e.g., learning rate),
integer-valued (e.g., number of layers), binary (e.g., whether to use early stop-
ping or not), or categorical (e.g., choice of optimizer). For integer and real-
valued hyperparameters, the domains are mostly bounded for practical reasons,
with only a few exceptions [10, 133, 110].
Furthermore, the configuration space can contain conditionality, i.e., a hy-
perparameter may only be relevant if another hyperparameter (or some combi-
nation of hyperparameters) takes on a certain value. Conditional spaces take
the form of directed acyclic graphs. Such conditional spaces occur, e.g., in the
automated tuning of machine learning pipelines, where the choice between dif-
ferent preprocessing and machine learning algorithms is modeled as a categorical
hyperparameter, a problem known as Full Model Selection (FMS) or Combined
Algorithm Selection and Hyperparameter (CASH) [28, 146, 80, 32]. They also
occur when optimizing the architecture of a neural network: e.g., the number
of layers can be an integer hyperparameter and the per-layer hyperparameters
of layer i are only active if the network depth is at least i [10, 12, 31].
Given a data set D, our goal is to find
λ
∗
= argmin
λ∈Λ
E
(D
train
,D
valid
)∼D
V(L, A
λ
, D
train
, D
valid
), (1.1)
where V(L, A
λ
, D
train
, D
valid
) measures the loss of a model generated by al-
gorithm A with hyperparameters λ on training data D
train
and evaluated on
validation data D
valid
. In practice, we only have access to finite data D ∼ D
and thus need to approximate the expectation in Equation 1.1.
Popular choices for the validation protocol V(·, ·, ·, ·) are the holdout and
cross-validation error for a user-given loss function (such as misclassification
rate); see Bischl et al. [14] for an overview of validation protocols. Several
strategies for reducing the evaluation time have been proposed: It is possible
to only test machine learning algorithms on a subset of folds [146], only on
a subset of data [99, 144, 75], or for a small amount of iterations; we will
discuss some of these strategies in more detail in Section 1.4. Recent work on
multi-task [144] and multi-source [118] optimization introduced further cheap,
auxiliary tasks, which can be queried instead of Equation 1.1. These can provide
cheap information to help HPO, but do not necessarily train a machine learning
model on the dataset of interest and therefore do not yield a usable model as a
side product.
6 CHAPTER 1. HYPERPARAMETER OPTIMIZATION
1.2.1 Alternatives to Optimization: Ensembling and Marginal-
ization
Solving Equation 1.1 with one of the techniques described in the rest of this
chapter usually requires fitting the machine learning algorithm A with multiple
hyperparameter vectors λ
t
. Instead of using the argmin-operator over these,
it is possible to either construct an ensemble (which aims to minimize the loss
for a given validation protocol) or to integrate out all the hyperparameters (if
the model under consideration is a probabilistic model). We refer to Guyon et
al. [47] and the references therein for a comparison of frequentist and Bayesian
model selection.
Only choosing a single hyperparameter configuration can be wasteful when
many good configurations have been identified by HPO, and combining them in
an ensemble can improve performance [106]. This is particularly useful in Au-
toML systems with a large configuration space (e.g., in FMS or CASH ), where
good configurations can be very diverse, which increases the potential gains
from ensembling [29, 17, 32, 4]. To further improve performance, Automatic
Frankensteining [152] uses HPO to train a stacking model [153] on the outputs
of the models found with HPO; the 2
nd
level models are then combined using a
traditional ensembling strategy.
The methods discussed so far applied ensembling after the HPO procedure.
While they improve performance in practice, the base models are not optimized
for ensembling. It is, however, also possible to directly optimize for models
which would maximally improve an existing ensemble [94].
Finally, when dealing with Bayesian models it is often possible to integrate
out the hyperparameters of the machine learning algorithm, for example using
evidence maximization [95], Bayesian model averaging [53], slice sampling [108]
or empirical Bayes [100].
1.2.2 Optimizing for Multiple Objectives
In practical applications it is often necessary to trade off two or more objectives,
such as the performance of a model and resource consumption [62] (see also
Chapter 3) or multiple loss functions [54]. Potential solutions can be obtained
in two ways.
First, if a limit on a secondary performance measure is known (such as
the maximal memory consumption), the problem can be formulated as a con-
strained optimization problem. We will discuss constraint handling in Bayesian
optimization in Section 1.3.2.
Second, and more generally, one can apply multi-objective optimization to
search for the Pareto front, a set of configurations which are optimal tradeoffs
between the objectives in the sense that, for each configuration on the Pareto
front, there is no other configuration which performs better for at least one and
at least as well for all other objectives. The user can then choose a configuration
from the Pareto front. We refer the interested reader to further literature on
this topic [62, 131, 50, 54].
剩余221页未读,继续阅读
poolpoolpool
- 粉丝: 5
- 资源: 64
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- RTL8188FU-Linux-v5.7.4.2-36687.20200602.tar(20765).gz
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
- SPC统计方法基础知识.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0