Hyperband
Hyperband
Lisha Li [email protected]
Carnegie Mellon University, Pittsburgh, PA 15213
Kevin Jamieson [email protected]
University of Washington, Seattle, WA 98195
Giulia DeSalvo [email protected]
arXiv:1603.06560v4 [cs.LG] 18 Jun 2018
Abstract
Performance of machine learning algorithms depends critically on identifying a good set of
hyperparameters. While recent approaches use Bayesian optimization to adaptively select
configurations, we focus on speeding up random search through adaptive resource allocation
and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-
stochastic infinite-armed bandit problem where a predefined resource like iterations, data
samples, or features is allocated to randomly sampled configurations. We introduce a novel
algorithm, Hyperband, for this framework and analyze its theoretical properties, providing
several desirable guarantees. Furthermore, we compare Hyperband with popular Bayesian
optimization methods on a suite of hyperparameter optimization problems. We observe
that Hyperband can provide over an order-of-magnitude speedup over our competitor set
on a variety of deep-learning and kernel-based learning problems.
Keywords: hyperparameter optimization, model selection, infinite-armed bandits, online
optimization, deep learning
1. Introduction
In recent years, machine learning models have exploded in complexity and expressibility at the
price of staggering computational costs. Moreover, the growing number of tuning parameters
associated with these models are difficult to set by standard optimization techniques. These
“hyperparameters” are inputs to a machine learning algorithm that govern how the algorithm’s
performance generalizes to new, unseen data; examples of hyperparameters include those
that impact model architecture, amount of regularization, and learning rates. The quality of
a predictive model critically depends on its hyperparameter configuration, but it is poorly
understood how these hyperparameters interact with each other to affect the resulting model.
2018
c Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh and Ameet Talwalkar.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v18/16-558.html.
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
Figure 1: (a) The heatmap shows the validation error over a two-dimensional search space
with red corresponding to areas with lower validation error. Configuration selection
methods adaptively choose new configurations to train, proceeding in a sequential
manner as indicated by the numbers. (b) The plot shows the validation error as
a function of the resources allocated to each configuration (i.e. each line in the
plot). Configuration evaluation methods allocate more resources to promising
configurations.
Consequently, practitioners often default to brute-force methods like random search and
grid search (Bergstra and Bengio, 2012).
2
Bandit-Based Approach to Hyperparameter Optimization
2. Related Work
In Section 1, we briefly discussed related work in the hyperparameter optimization literature.
Here, we provide a more thorough coverage of the prior work, and also summarize significant
related work on bandit problems.
3
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
4
Bandit-Based Approach to Hyperparameter Optimization
drastically suffer when they are violated. Krueger et al. (2015) proposed a heuristic based
on sequential analysis to determine stopping times for training configurations on increasing
subsets of the data. However, the theoretical correctness and empirical performance of this
method are highly dependent on a user-defined “safety zone.”
Several hybrid methods combining adaptive configuration selection and evaluation have
also been introduced (Swersky et al., 2013, 2014; Domhan et al., 2015; Kandasamy et al.,
2016; Klein et al., 2017a; Golovin et al., 2017). The algorithm proposed by Swersky et al.
(2013) uses a GP to learn correlation between related tasks and requires the subtasks as
input, but efficient subtasks with high informativeness for the target task are unknown
without prior knowledge. Similar to the work by Swersky et al. (2013), Klein et al. (2017a)
modeled the conditional validation error as a Gaussian process using a kernel that captures
the covariance with downsampling rate to allow for adaptive evaluation. Swersky et al.
(2014), Domhan et al. (2015), and Klein et al. (2017a) made parametric assumptions on the
convergence of learning curves to perform early-stopping. In contrast, Golovin et al. (2017)
devised an early-stopping rule based on predicted performance from a nonparametric GP
model with a kernel designed to measure the similarity between performance curves. Finally,
Kandasamy et al. (2016) extended GP-UCB to allow for adaptive configuration evaluation
by defining subtasks that monotonically improve with more resources.
In another line of work, Sparks et al. (2015) proposed a halving style bandit algorithm
that did not require explicit convergence behavior, and Jamieson and Talwalkar (2015)
analyzed a similar algorithm originally proposed by Karnin et al. (2013) for a different
setting, providing theoretical guarantees and encouraging empirical results. Unfortunately,
these halving style algorithms suffer from the “n versus B/n” problem, which we will
discuss in Section 3.1. Hyperband addresses this issue and provides a robust, theoretically
principled early-stopping algorithm for hyperparameter optimization.
We note that Hyperband can be combined with any hyperparameter sampling approach
and does not depend on random sampling; the theoretical results only assume the validation
losses of sampled hyperparameter configurations are drawn from some stationary distribution.
In fact, subsequent to our submission, Klein et al. (2017b) combined adaptive configuration
selection with Hyperband by using a Bayesian neural network to model learning curves and
only selecting configurations with high predicted performance to input into Hyperband.
5
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
Section 5.3.2. However, their algorithm was derived specifically for the β-parameterization
of F , and furthermore, they must estimate β before running the algorithm, limiting the
algorithm’s practical applicability. Also, the algorithm assumes stochastic losses from the
arms and thus the convergence behavior is known; consequently, it does not apply in our
hyperparameter optimization setting.3 Two related lines of work that both make use of an
underlying metric space are Gaussian process optimization (Srinivas et al., 2010) and X-
armed bandits (Bubeck et al., 2011), or bandits defined over a metric space. However, these
works either assume stochastic rewards or need to know something about the underlying
function (e.g. an appropriate kernel or level of smoothness).
In contrast, Hyperband is devised for the non-stochastic setting and automatically
adapts to unknown F without making any parametric assumptions. Hence, we believe our
work to be a generally applicable pure exploration algorithm for infinite-armed bandits. To
the best of our knowledge, this is also the first work to test out such an algorithm on a real
application.
3. Hyperband Algorithm
In this section, we present the Hyperband algorithm. We provide intuition for the algorithm,
highlight the main ideas via a simple example that uses iterations as the adaptively allocated
resource, and present a few guidelines on how to deploy Hyperband in practice.
6
Bandit-Based Approach to Hyperparameter Optimization
Figure 2: The validation loss as a function of total resources allocated for two configurations
is shown. ν1 and ν2 represent the terminal validation losses at convergence. The
shaded areas bound the maximum distance of the intermediate losses from the
terminal validation loss and monotonically decrease with the resource.
terminal losses. There are two takeaways from this observation: more resources are needed
to differentiate between the two configurations when either (1) the envelope functions are
wider or (2) the terminal losses are closer together.
However, in practice, the optimal allocation strategy is unknown because we do not
have knowledge of the envelope functions nor the distribution of terminal losses. Hence, if
more resources are required before configurations can differentiate themselves in terms of
quality (e.g., if an iterative training method converges very slowly for a given data set or if
randomly selected hyperparameter configurations perform similarly well), then it would be
reasonable to work with a small number of configurations. In contrast, if the quality of a
configuration is typically revealed after a small number of resources (e.g., if iterative training
methods converge very quickly for a given data set or if randomly selected hyperparameter
configurations are of low-quality with high probability), then n is the bottleneck and we
should choose n to be large.
Certainly, if meta-data or previous experience suggests that a certain tradeoff is likely
to work well in practice, one should exploit that information and allocate the majority of
resources to that tradeoff. However, without this supplementary information, practitioners
are forced to make this tradeoff, severely hindering the applicability of existing configuration
evaluation methods.
3.2 Hyperband
Hyperband, shown in Algorithm 1, addresses this “n versus B/n” problem by considering
several possible values of n for a fixed B, in essence performing a grid search over feasible
value of n. Associated with each value of n is a minimum resource r that is allocated to all
configurations before some are discarded; a larger value of n corresponds to a smaller r and
hence more aggressive early-stopping. There are two components to Hyperband; (1) the
inner loop invokes SuccessiveHalving for fixed values of n and r (lines 3–9) and (2) the
outer loop iterates over different values of n and r (lines 1–2). We will refer to each such
run of SuccessiveHalving within Hyperband as a “bracket.” Each bracket is designed
to use approximately B total resources and corresponds to a different tradeoff between n
7
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
and B/n. Hence, a single execution of Hyperband takes a finite budget of (smax + 1)B;
we recommend repeating it indefinitely.
Hyperband requires two inputs (1) R, the maximum amount of resource that can
be allocated to a single configuration, and (2) η, an input that controls the proportion of
configurations discarded in each round of SuccessiveHalving. The two inputs dictate
how many different brackets are considered; specifically, smax + 1 different values for n are
considered with smax = blogη (R)c. Hyperband begins with the most aggressive bracket
s = smax , which sets n to maximize exploration, subject to the constraint that at least one
configuration is allocated R resources. Each subsequent bracket reduces n by a factor of
approximately η until the final bracket, s = 0, in which every configuration is allocated
R resources (this bracket simply performs classical random search). Hence, Hyperband
performs a geometric search in the average budget per configuration and removes the need
to select n for a fixed budget at the cost of approximately smax + 1 times more work than
running SuccessiveHalving for a single value of n. By doing so, Hyperband is able to
exploit situations in which adaptive allocation works well, while protecting itself in situations
where more conservative allocations are required.
Hyperband requires the following methods to be defined for any given learning problem:
8
Bandit-Based Approach to Hyperparameter Optimization
Table 1: The values of ni and ri for the brackets of Hyperband corresponding to various
values of s, when R = 81 and η = 3.
• run then return val loss(t, r) – a function that takes a hyperparameter configu-
ration t and resource allocation r as input and returns the validation loss after training
the configuration for the allocated resources.
9
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
3.5 Setting R
The resource R and η (which we address next) are the only required inputs to Hyperband.
As mentioned in Section 3.2, R represents the maximum amount of resources that can be
allocated to any given configuration. In most cases, there is a natural upper bound on
the maximum budget per configuration that is often dictated by the resource type (e.g.,
training set size for data set downsampling; limitations based on memory constraint for
feature downsampling; rule of thumb regarding number of epochs when iteratively training
neural networks). If there is a range of possible values for R, a smaller R will give a result
faster (since the budget B for each bracket is a multiple of R), but a larger R will give a
better guarantee of successfully differentiating between the configurations.
Moreover, for settings in which either R is unknown or not desired, we provide an
infinite horizon version of Hyperband in Section 5. This version of the algorithm doubles
10
Bandit-Based Approach to Hyperparameter Optimization
the budget over time, B ∈ {2, 4, 8, 16, . . .}, and for each B, tries all possible values of
n ∈ 2k : k ∈ {1, . . . , log2 (B)} . For each combination of B and n, the algorithm runs
3.6 Setting η
The value of η is a knob that can be tuned based on practical user constraints. Larger
values of η correspond to more aggressive elimination schedules and thus fewer rounds of
elimination; specifically, each round retains 1/η configurations for a total of blogη (n)c + 1
rounds of elimination with n configurations. If one wishes to receive a result faster at the
cost of a sub-optimal asymptotic constant, one can increase η to reduce the budget per
bracket B = (blogη (R)c + 1)R. We stress that results are not very sensitive to the choice of η.
If our theoretical bounds are optimized (see Section 5), they suggest choosing η = e ≈ 2.718,
but in practice we suggest taking η to be equal to 3 or 4.
Tuning η will also change the number of brackets and consequently the number of different
tradeoffs that Hyperband tries. Usually, the possible range of brackets is fairly constrained,
since the number of brackets is logarithmic in R; namely, there are (blogη (R)c + 1) = smax + 1
brackets. For our experiments in Section 4, we chose η to provide 5 brackets for the specified
R; for most problems, 5 is a reasonable number of n versus B/n tradeoffs to explore. However,
for large R, using η = 3 or 4 can give more brackets than desired. The number of brackets
can be controlled in a few ways. First, as mentioned in the previous section, if R is too
large and overhead is an issue, then one may want to control the overhead by limiting the
maximum number of configurations to nmax , thereby also limiting smax . If overhead is not a
concern and aggressive exploration is desired, one can (1) increase η to reduce the number
of brackets while maintaining R as the maximum number of configurations in the most
exploratory bracket, or (2) still use η = 3 or 4 but only try brackets that do a baseline
level of exploration, i.e., set nmin and only try brackets from smax to s = blogη (nmin )c. For
computationally intensive problems that have long training times and high-dimensional
search spaces, we recommend the latter. Intuitively, if the number of configurations that can
be trained to completion (i.e., trained using R resources) in a reasonable amount of time is
on the order of the dimension of the search space and not exponential in the dimension, then
11
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
6. This is not done for the experiments in Section 4.2.1, since the most aggressive bracket varies from dataset
to dataset with the number of training points.
12
Bandit-Based Approach to Hyperparameter Optimization
all speedups, we consider random search and “random 2×,” a variant of random search with
twice the budget of other methods. Of the hybrid methods described in Section 2, we compare
to a variant of SMAC using the early termination criterion proposed by Domhan et al.
(2015) in the deep learning experiments described in Section 4.1. We think a comparison
of Hyperband to more sophisticated hybrid methods introduced recently by Klein et al.
(2017a) and Kandasamy et al. (2017) is a fruitful direction for future work.
In the experiments below, we followed these loose guidelines when determining how to
configuration Hyperband:
1. The maximum resource R should be reasonable given the problem, but ideally large
enough so that early-stopping is beneficial.
13
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
0.32 0.30
hyperband (finite) spearmint
0.30 hyperband (infinite) random 0.29
SMAC random 2x 0.28
0.28 SMAC (early) bracket s=4
Average Test Error
0.10
0.09
0.08
Average Test Error
0.07
0.06
0.05
0.04
0.03
0 10 20 30 40 50
Multiple of R Used
(c) SVHN
Figure 4: Average test error across 10 trials. Label “SMAC (early)” corresponds to SMAC
with the early-stopping criterion proposed in Domhan et al. (2015) and label
“bracket s = 4” corresponds to repeating the most exploratory bracket of Hyper-
band.
8. Most trials were run on Amazon EC2 g2.8xlarge instances but a few trials were run on different machines
due to the large computational demand of these experiments.
14
Bandit-Based Approach to Hyperparameter Optimization
magnitude faster than standard configuration selection approaches and 5× faster than
SMAC (early). For SVHN, while Hyperband finds a good configuration faster, Bayesian
optimization methods are competitive and SMAC (early) outperforms Hyperband. The
performance of SMAC (early) demonstrates there is merit to combining early-stopping and
adaptive configuration selection.
Across the three data sets, Hyperband and SMAC (early) are the only two methods
that consistently outperform random 2×. On these data sets, Hyperband is over 20×
faster than random search while SMAC (early) is ≤ 7× faster than random search within
the evaluation window. In fact, the first result returned by Hyperband after using a
budget of 5R is often competitive with results returned by other searchers after using 50R.
Additionally, Hyperband is less variable than other searchers across trials, which is highly
desirable in practice (see Appendix A for plots with error bars).
As discussed in Section 3.6, for computationally expensive problems in high-dimensional
search spaces, it may make sense to just repeat the most exploratory brackets. Similarly, if
meta-data is available about a problem or it is known that the quality of a configuration is
evident after allocating a small amount of resource, then one should just repeat the most
exploratory bracket. Indeed, for these experiments, bracket s = 4 vastly outperforms all
other methods on CIFAR-10 and MRBI and is nearly tied with SMAC (early) for first on
SVHN.
While we set R for these experiments to facilitate comparison to Bayesian methods
and random search, it is also reasonable to use infinite horizon Hyperband to grow the
maximum resource until a desired level of performance is reached. We evaluate infinite
horizon Hyperband on CIFAR-10 using η = 4 and a starting budget of B = 2R. Figure 4(a)
shows that infinite horizon Hyperband is competitive with other methods but does not
perform as well as finite horizon Hyperband within the 50R budget limit. The infinite
horizon algorithm underperforms initially because it has to tune the maximum resource R as
well and starts with a less aggressive early-stopping rate. This demonstrates that in scenarios
where a max resource is known, it is better to use the finite horizon algorithm. Hence, we
focus on the finite horizon version of Hyperband for the remainder of our empirical studies.
Finally, CIFAR-10 is a very popular data set and state-of-the-art models achieve much
lower error rates than what is shown in Figure 4. The difference in performance is mainly
attributable to higher model complexities and data manipulation (i.e. using reflection or
random cropping to artificially increase the data set size). If we limit the comparison to
published results that use the same architecture and exclude data manipulation, the best
human expert result for the data set is 18% error and the best hyperparameter optimized
results are 15.0% for Snoek et al. (2012)9 and 17.2% for Domhan et al. (2015). These results
exceed ours on CIFAR-10 because they train on 25% more data, by including the validation
set, and also train for more epochs. When we train the best model found by Hyperband on
the combined training and validation data for 300 epochs, the model achieved a test error of
17.0%.
9. We were unable to reproduce this result even after receiving the optimal hyperparameters from the
authors through a personal communication.
15
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
16
Bandit-Based Approach to Hyperparameter Optimization
5 5
SMAC random
TPE random 2x
4 4 hyperband
Average Rank
Average Rank
3 3
2 2
1 1
0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500
Time (s) Time (s)
(a) Validation Error on 117 Data Sets (b) Test Error on 117 Data Sets
4
Average Rank
1
0 500 1000 1500 2000 2500 3000 3500
Time (s)
(c) Test Error on 21 Data Sets
Figure 5: Average rank across all data sets for each searcher. For each data set, the searchers
are ranked according to the average validation/test error across 20 trials.
high relative to total training time, while for larger data sets, only a handful of configurations
could be trained within the hour window.
We note that while average rank plots like those in Figure 5 are an effective way to
aggregate information across many searchers and data sets, they provide no indication about
the magnitude of the differences between the performance of the methods. Figure 6, which
charts the difference between the test error for each searcher and that of random search
across all 117 datasets, highlights the small difference in the magnitude of the test errors
across searchers.
These results are not surprising; as mentioned in Section 2.1, vanilla Bayesian optimization
methods perform similarly to random search in high-dimensional search spaces. Feurer
et al. (2015) showed that using meta-learning to warmstart Bayesian optimization methods
improved performance in this high-dimensional setting. Using meta-learning to identify a
17
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
2.00%
Figure 6: Each line plots, for a single data set, the difference in test error versus random
search for each search, where lower is better. Nearly all the lines fall within the
-0.5% and 0.5% band and, with the exception of a few outliers, the lines are mostly
flat.
10. The default SVM method in Scikit-learn is single core and takes hours to train on CIFAR-10, whereas a
block coordinate descent least squares solver takes less than 10 minutes on an 8 core machine.
18
Bandit-Based Approach to Hyperparameter Optimization
0.70 0.70
hyperband (finite) hyperband (finite)
0.65 SMAC 0.65 SMAC
TPE TPE
0.60 random 0.60 spearmint
random 2x random
Test Error
Test Error
0.55 bracket s=4 0.55 random 2x
bracket s=4
0.50 0.50
0.45 0.45
0.40 0.40
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
Minutes Minutes
Figure 7: Average test error of the best ker- Figure 8: Average test error of the best
nel regularized least square clas- random features model found by
sification model found by each each searcher on CIFAR-10. The
searcher on CIFAR-10. The color test error for Hyperband and
coded dashed lines indicate when bracket s = 4 are calculated in
the last trial of a given searcher every evaluation instead of at the
finished. end of a bracket.
performance for the two algorithms are the same. Random 2× is competitive with SMAC
and TPE.
19
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
1. How training time scales with the given resource. In cases where training time is
superlinear as a function of the resource, Hyperband can offer higher speedups. For
instance, if training scales like a polynomial of degree p > 1, the maximum speedup for
p −1
Hyperband over random search is approximately ηηp−1 R. In the kernel least square
classifier experiment discussed in Section 4.2.2, the training time scaled quadratically
as a function of the resource, which explains why the realized speedup of 70× is higher
than the maximum expected speedup given linear scaling.
2. Overhead costs associated with training. Total evaluation time also depends on fixed
overhead costs associated with evaluating each hyperparameter configuration, e.g.,
initializing a model, resuming previously trained models, and calculating validation
error. For example, in the downsampling experiments on 117 data sets presented in
Section 4.2.1, Hyperband did not provide significant speedup because many data
sets could be trained in a matter of a few seconds and the initialization cost was high
relative to training time.
20
Bandit-Based Approach to Hyperparameter Optimization
only 6× faster than random search. In contrast, for the neural network experiments in
Section 4.1, we hypothesize that faster speedups are observed for Hyperband because
the dimension of the search space is higher.
With the exception of the LeNet experiment (Section 3.3) and the 117 Datasets experi-
ment (Section 4.2.1), the most aggressive bracket of SuccessiveHalving outperformed
Hyperband in all of our experiments. In hindsight, we should have just run bracket s = 4,
since aggressive early-stopping provides massive speedups on many of these benchmarking
tasks. However, as previously mentioned, it was unknown a priori that bracket s = 4
would perform the best and that is why we have to cycle through all possible brackets
with Hyperband. Another question is what happens when one increases s even further,
i.e. instead of 4 rounds of elimination, why not 5 or even more with the same maximum
resource R? In our case, s = 4 was the most aggressive bracket we could run given the
minimum resource per configuration limits imposed for the previous experiments. However,
for larger data sets, it is possible to extend the range of possible values for s, in which case,
Hyperband may either provide even faster speedups if more aggressive early-stopping helps
or be slower by a small factor if the most aggressive brackets are essentially throwaways.
We believe prior knowledge about a task can be particularly useful for limiting the
range of brackets explored by Hyperband. In our experience, aggressive early-stopping
is generally safe for neural network tasks and even more aggressive early-stopping may be
reasonable for larger data sets and longer training horizons. However, when pushing the
degree of early-stopping by increasing s, one has to consider the additional overhead cost
associated with examining more models. Hence, one way to leverage meta-learning would
be to use learning curve convergence rate, difficulty of different search spaces, and overhead
costs of related tasks to determine the brackets considered by Hyperband.
In certain cases, the setting for a given hyperparameter should depend on the allocated
resource. For example, the maximum tree depth regularization hyperparameter for random
forests should be higher with more data and more features. However, the optimal tradeoff
between maximum tree depth and the resource is unknown and can be data set specific.
In these situations, the rate of convergence to the true loss is usually slow because the
performance on a smaller resource is not indicative of that on a larger resource. Hence,
these problems are particularly difficult for Hyperband, since the benefit of early-stopping
can be muted. Again, while Hyperband will only be a small factor slower than that
of SuccessiveHalving with the optimal early-stopping rate, we recommend removing
the dependence of the hyperparameter on the resource if possible. For the random forest
example, an alternative regularization hyperparameter is minimum samples per leaf, which
is less dependent on the training set size. Additionally, the dependence can oftentimes be
removed with simple normalization. For example, the regularization term for our kernel
least squares experiments were normalized by the training set size to maintain a constant
tradeoff between the mean-squared error and the regularization term.
21
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
5. Theory
Let X denote the space of valid hyperparameter configurations, which could include contin-
uous, discrete, or categorical variables that can be constrained with respect to each other
in arbitrary ways (i.e. X need not be limited to a subset of [0, 1]d ). For k = 1, 2, . . . let
`k : X → [0, 1] be a sequence of loss functions defined over X . For any hyperparameter
configuration x ∈ X , `k (x) represents the validation error of the model trained using x with k
units of resources (e.g. iterations). In addition, for some R ∈ N ∪ {∞}, define `∗ = limk→R `k
and ν∗ = inf x∈X `∗ (x). Note that `k (·) for all k ∈ N, `∗ (·), and ν∗ are all unknown to the
algorithm a priori. In particular, it is uncertain how quickly `k (x) varies as a function of x
for any fixed k, and how quickly `k (x) → `∗ (x) as a function of k for any fixed x ∈ X .
We assume hyperparameter configurations are sampled randomly from a known probabil-
ity distribution p(x) : X → [0, ∞), with support on X . In our experiments, p(x) is simply the
uniform distribution, but the algorithm can be used with any sampling method. If X ∈ X
is a random sample from this probability distribution, then `∗ (X) is a random variable
whose distribution is unknown since `∗ (·) is unknown. Additionally, since it is unknown how
`k (x) varies as a function of x or k, one cannot necessarily infer anything about `k (x) given
knowledge of `j (y) for any j ∈ N, y ∈ X . As a consequence, we reduce the hyperparmeter
optimization problem down to a much simpler problem that ignores all underlying structure
of the hyperparameters: we only interact with some x ∈ X through its loss sequence `k (x)
for k = 1, 2, . . . . With this reduction, the particular value of x ∈ X does nothing more than
index or uniquely identify the loss sequence.
Without knowledge of how fast `k (·) → `∗ (·) or how `∗ (X) is distributed, the goal of
Hyperband is to identify a hyperparameter configuration x ∈ X that minimizes `∗ (x) − ν∗
by drawing as many random configurations as desired while using as few total resources as
possible.
We now formally define the bandit problem of interest, and relate it to the problem of
hyperparameter optimization. Each “arm” in the NIAB game is associated with a sequence
that is drawn randomly from a distribution over sequences. If we “pull” the ith drawn arm
exactly k times, we observe a loss `i,k . At each time, the player can either draw a new arm
(sequence) or pull a previously drawn arm an additional time. There is no limit on the
number of arms that can be drawn. We assume the arms are identifiable only by their index
22
Bandit-Based Approach to Hyperparameter Optimization
i (i.e. we have no side-knowledge or feature representation of an arm), and we also make the
following two additional assumptions:
Assumption 1 For each i ∈ N the limit limk→∞ `i,k exists and is equal to νi .11
The objective of the NIAB problem is to identify an arm ı̂ with small νı̂ using as few total
pulls as possible. We are interested in characterizing νı̂ as a function of the total number of
pulls from all the arms. Clearly, the hyperparameter optimization problem described above
is an instance of the NIAB problem. In this case, arm i correspondes to a configuration
xi ∈ X , with `i,k = `k (xi ); Assumption 1 is equivalent to requiring that νi = `∗ (xi ) exists;
and Assumption 2 follows from the fact that the arms are drawn i.i.d. from X according
to distribution function p(x). F is simply the cumulative distribution function of `∗ (X),
where X is a random variable drawn from the distribution p(x) over X . Note that since the
arm draws are independent, the νi ’s are also independent. Again, this is not to say that the
validation losses do not depend on the settings of the hyperparameters; the validation loss
could well be correlated with certain hyperparameters, but this is not used in the algorithm
and no assumptions are made regarding the correlation structure.
In order to analyze the behavior of Hyperband in the NIAB setting, we must define
a few additional objects. Let ν∗ = inf{m : P (ν ≤ m) > 0} > −∞, since the domain of the
distribution F is bounded. Hence, the cumulative distribution function F satisfies
and let F −1 (y) = inf x {x : F (x) ≤ y}. Define γ : N → R as the pointwise smallest,
monotonically decreasing function satisfying
The function γ is guaranteed to exist by Assumption 1 and bounds the deviation from
the limit value as the sequence of iterates j increases. For hyperparameter optimization,
this follows from the fact that `k uniformly converges to `∗ for all x ∈ X . In addition, γ
can be interpretted as the deviation of the validation error of a configuration trained on a
subset of resources versus the maximum number of allocatable resources. Finally, define
R as the first index such that γ(R) = 0 if it exists, otherwise set R = ∞. For y ≥ 0 let
γ −1 (y) = min{j ∈ N : γ(j) ≤ y}, using the convention that γ −1 (0) := R which we recall can
be infinite.
As previously discussed, there are many real-world scenarios in which R is finite and
known. For instance, if increasing subsets of the full data set is used as a resource, then the
maximum number of resources cannot exceed the full data set size, and thus γ(k) = 0 for
all k ≥ R where R is the (known) full size of the data set. In other cases such as iterative
training problems, one might not want to or know how to bound R. We separate these two
settings into the finite horizon setting where R is finite and known, and the infinite horizon
11. We can always define `i,k so that convergence is guaranteed, i.e. taking the infimum of a sequence.
23
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
Keep the best b|Sk |/2c arms in terms of the rk th observed loss as Sk+1 .
Output : ı̂, ` B/2 where ı̂ = Sdlog2 (n)e
ı̂,b dlog (n)e c
2
setting where no bound on R is known and it is assumed to be infinite. While our empirical
results suggest that the finite horizon may be more practically relevant for the problem
of hyperparameter optimization, the infinite horizon case has natural connections to the
literature, and we begin by analyzing this setting.
24
Bandit-Based Approach to Hyperparameter Optimization
Theorem 1 Fix n arms. Let νi = lim `i,τ and assume ν1 ≤ · · · ≤ νn . For any > 0 let
τ →∞
i=1,...,n
If the SuccessiveHalving algorithm of Figure 9 is run with any budget B > zSH then an
arm ı̂ is returned that satisfies νı̂ − ν1 ≤ /2. Moreover, |` B/2 − ν1 | ≤ .
ı̂,b dlog (n)e c
2
P The next −1
technical
νlemma
i −ν1
will be used to characterize the problem dependent term
i=1,...,n γ max 4 , 2 when the sequences are drawn from a probability distribution.
Setting = 4(F −1 (pn ) − ν∗ ) in Theorem 1 and using the result of Lemma 2 that
ν∗ ≤ ν1 ≤ ν∗ + (F −1 (pn ) − ν∗ ), we immediately obtain the following corollary.
Corollary 3 Fix δ ∈ (0, 1) and ≥ 4(F −1 ( log(2/δ) n ) − ν∗ ). Let B = 4dlog2 (n)eH(F, γ, n, δ, )
where H(F, γ, n, δ, ) is defined in Lemma 2. If the SuccessiveHalving algorithm of
Figure 9 is run with the specified B and n arm configurations drawn randomly according
to F , then an arm ı̂ ∈ [n] is returned such that with probability at least 1 − δ we have
νı̂ − ν∗ ≤ F −1 ( log(2/δ)
n ) − ν∗ + /2. In particular, if B = 4dlog2 (n)eH(F, γ, n, δ) and
= 4(F −1 ( log(2/δ) ) − ν∗ ) then νı̂ − ν∗ ≤ 3 F −1 ( log(2/δ)
n n ) − ν∗ with probability at least 1 − δ.
Note that for any fixed n ∈ N we have for any ∆ > 0
P( min νi − ν∗ ≥ ∆) = (1 − F (ν∗ + ∆))n ≈ e−nF (ν∗ +∆)
i=1,...,n
25
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
log(c/δ)
then with probability at least δ, we have νı̂ − ν∗ ≥ 2(F −1 ( n+log(c/δ) ) − ν∗ ), where c is a
constant that depends on the regularity of F .
For any fixed n and sufficiently large B, Corollary 3 shows that SuccessiveHalving
outputs an ı̂ ∈ [n] that satisfies νı̂ − ν∗ . F −1 ( log(2/δ)
n ) − ν∗ with probability at least 1 − δ.
This guarantee is similar to the result in Proposition 4. However, SuccessiveHalving
achieves its guarantee as long as12
" Z 1 #
B ' log2 (n) log(1/δ)γ −1 F −1 ( log(1/δ)
n ) − ν∗ + n γ −1 (F −1 (t) − ν∗ )dt , (4)
log(1/δ)
n
and this sample complexity may be substantially smaller than the budget required by uniform
allocation shown in Eq. (3) of Proposition 4. Essentially, the first term in Eq. (4) represents
the budget allocated to the constant number of arms with limits νi ≈ F −1 ( log(1/δ)n ) while
the second term describes the number of times the sub-optimal arms are sampled before
discarded. The next section uses a particular parameterization for F and γ to help better
illustrate the difference between the sample complexity of uniform allocation (Equation 3)
versus that of SuccessiveHalving (Equation 4).
26
Bandit-Based Approach to Hyperparameter Optimization
Note that a large value of α implies that the convergence of `i,k → νi is very slow.
We will consider two possible parameterizations of F . First, assume there exists positive
constants β such that
(x − ν∗ )β if x ≥ ν∗
F (x) ' . (6)
0 if x < ν∗
Here, a large value of β implies that it is very rare to draw a limit close to the optimal
value ν∗ . The same model was studied in Carpentier and Valko (2015). Fix some ∆ > 0.
As discussed in the preceding section, if n = Flog(1/δ)
(ν∗ +∆) ' ∆
−β log(1/δ) arms are drawn from
then both also satisfy νı̂ − ν∗ . ∆ with probability at least 1 − δ.13 SuccessiveHalving’s
budget scales like ∆− max{α,β} , which can be significantly smaller than the uniform allocation’s
budget of ∆−(α+β) . However, because α and β are unknown in practice, neither method
knows how to choose the optimal n or B to achieve this ∆ accuracy. In Section 5.3.3, we
show how Hyperband addresses this issue.
The second parameterization of F is the following discrete distribution:
K
1 X
F (x) = 1{x ≤ µj } with ∆j := µj − µ1 (7)
K
j=1
for some set of unique scalars µ1 < µ2 < · · · < µK . Note that by letting K → ∞ this discrete
CDF can approximate any piecewise-continuous CDF to arbitrary accuracy. In particular,
this model can have multiple means take the same value so that α mass is on µ1 and 1 − α
mass is on µ2 > µ1 , capturing the stochastic infinite-armed bandit model of Jamieson et al.
(2016). In this setting, both uniform allocation and SuccessiveHalving output a νı̂ that
is within the top log(1/δ)
n fraction of the K arms with probability at least 1 − δ if their
budgets are sufficiently large. Thus, let q > 0 be such that n ' q −1 log(1/δ). Then, if the
measurement budgets of the uniform allocation (Equation 3) and SuccessiveHalving
13. These quantities are intermediate results in the proofs of the theorems of Section 5.3.3.
27
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
(Equation 4) satisfy
K max ∆−α
j if q = 1/K
Uniform allocation B ' log(1/δ) j=2,...,K
q −1 ∆−α if q > 1/K
dqKe
XK
−α
∆−α
∆ + if q = 1/K
2 j
j=2
SuccessiveHalving B ' log(q −1 log(1/δ)) log(1/δ) K
X
∆−α + ∆−α
1
if q > 1/K,
dqKe
qK j
j=dqKe
an arm that is in the best q-fraction of arms is returned, i.e. ı̂/K ≈ q and νı̂ − ν∗ .
∆dmax{2,qK}e , with probability at least 1 − δ. This shows that the average resource per
arm for uniform allocation is that required to distinguish the top q-fraction from the best,
while that for SuccessiveHalving is a small multiple of the average resource required to
distinguish an arm from the best; the difference between the max and the average can be
very large in practice. We remark that the value of in Corollary 3 is carefully chosen to
make the SuccessiveHalving budget and guarantee work out. Also note that one would
never take q < 1/K because q = 1/K is sufficient to return the best arm.
Ek,s := {Bk,s > 4dlog2 (nk,s )eH(F, γ, nk,s , δk,s )} = {2k > 4sH(F, γ, 2s , 2k 3 /δ)}
∞ [
k ∞ X
k ∞
!
3 /δ) δ
F −1 ( log(4k
[ X X
P {νı̂k,s − ν∗ > 3 2s ) − ν∗ } ∩ Ek,s ≤ δk,s = ≤ δ.
2k 2
k=1 s=1 k=1 s=1 k=1
28
Bandit-Based Approach to Hyperparameter Optimization
Also note that on stage k at most ki=1 iBi,1 ≤ k ki=1 Bi,1 ≤ 2kBk,s = 2 log2 (Bk,s )Bk,s
P P
total samples have been taken. While this guarantee holds for general F, γ, the value of
sB , and consequently the resulting bound, is difficult to interpret. The following corollary
considers the β, α parameterizations of F and γ, respectively, of Section 5.3.2 for better
interpretation.
Theorem 5 Assume that Assumptions 1 and 2 of Section 5.2 hold and that the sampled
loss sequences obey the parametric assumptions of Equations 5 and 6. Fix δ ∈ (0, 1). For
any T ∈ N, let ı̂T be the empirically best-performing arm output from SuccessiveHalving
from the last round k of Hyperband of Figure 9 after exhausting a total budget of T from
all rounds, then
1/ max{α,β}
log(T )3 log(log(T )/δ)
νı̂T − ν∗ ≤ c
T
for some constant c = exp(O(max{α, β})) where log(x) = log(x) log log(x).
By a straightforward modification of the proof, one can show that if uniform allocation
is used in place of SuccessiveHalving in Hyperband, the uniform allocation version
1/(α+β)
achieves νı̂T −ν∗ ≤ c log(T )log(log(T
T
)/δ)
. We apply the above theorem to the stochastic
infinite-armed bandit setting in the following corollary.
Corollary 6 [Stochastic Infinite-armed Bandits] For any step k, s in the infinite horizon
Hyperband algorithm with nk,s arms drawn, consider the setting where the jth pull of the ith
arm results in a stochastic loss Yi,j ∈ [0, 1] such that E[Yi,j ] = νi and P(νi − ν∗ ≤ ) = c−1 β
1 .
If `j (i) = 1j js=1 Yi,s then with probability at least 1 − δ/2 we have ∀k ≥ 1, 0 ≤ s ≤ k, 1 ≤
P
i ≤ nk,s , 1 ≤ j ≤ Bk ,
q q 1/2
log(Bk nk,s /δk,s )
|νi − `i,j | ≤ 2j ≤ log( 16B
δ
k
) 2
j .
Consequently, if after B total pulls we define νbB as the mean of the empirically best arm
output from the last fully completed round k, then with probability at least 1 − δ
The result of this corollary matches the anytime result of Section 4.3 of Carpentier and
Valko (2015) whose algorithm was built specifically for the case of stochastic arms and the β
parameterization of F defined in Eq. (6). Notably, this result also matches the lower bounds
shown in that work up to poly-logarithmic factors, revealing that Hyperband is nearly
tight for this important special case. However, we note that this earlier work has a more
careful analysis for the fixed budget setting.
29
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
Theorem 7 Assume that Assumptions 1 and 2 of Section 5.2 hold and that the sampled loss
sequences obey the parametric assumptions of Equations 5 and 7. For any T ∈ N, let ı̂T be the
empirically best-performing arm output from SuccessiveHalving from the last round k of
Hyperband of Figure 9 after exhausting a total budget of T from all rounds. Fix δ ∈ (0, 1)
and q ∈ (1/K, 1) and let zq = log(q )(∆dmax{2,qK}e + qK i=dmax{2,qK}e ∆−α
−1 −α 1 PK
i ). Once
T = Ω (zq log(zq ) log(1/δ)) total pulls have been made by Hyperband we have νbT − ν∗ ≤
e
∆dmax{2,qK}e with probability at least 1 − δ where Ω(·)
e hides log log(·) factors.
Theorem 8 Fix n arms. Let νi = `i,R and assume ν1 ≤ · · · ≤ νn . For any > 0 let
h n
X i
min R, γ −1 max 4 , νi −ν
1
zSH = η(logη (R) + 1) n + 2
i=1
If the Successive Halving algorithm of Figure 10 is run with any budget B ≥ zSH then an
arm ı̂ is returned that satisfies νı̂ − ν1 ≤ /2.
Recall that γ(R) = 0 in this setting and by definition supy≥0 γ −1 (y) ≤ R. Note that
Lemma 2 still applies in this setting and just like above we obtain the following corollary.
30
Bandit-Based Approach to Hyperparameter Optimization
Figure 10: The finite horizon SuccessiveHalving and Hyperband algorithms are inspired
by their infinite horizon counterparts of Figure 9 to handle practical constraints.
Hyperband calls SuccessiveHalving as a subroutine.
according to F then an arm ı̂ ∈ [n] is returned such that with probability at least 1 − δ we
have νı̂ − ν∗ ≤ F −1 ( log(2/δ)
n ) − ν∗ + /2. In particular, if B = 4dlog2 (n)eH(F, γ, n, δ) and
= 4(F −1 ( log(2/δ) ) − ν∗ ) then νı̂ − ν∗ ≤ 3 F −1 ( log(2/δ)
n n ) − ν∗ with probability at least 1 − δ.
then both also satisfy νı̂ − ν∗ . ∆ with probability at least 1 − δ. Recalling that a larger
α means slower convergence and that a larger β means a greater difficulty of sampling
a good limit, note that when α/β < 1 the budget of SuccessiveHalving behaves like
R + ∆−β log(1/δ) but as α/β → ∞ the budget asymptotes to R∆−β log(1/δ).
31
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
We can also apply the discrete-CDF parameterization of Eq. (7). For any q ∈ (0, 1), if
n ' q −1 log(1/δ) and the measurement budgets of the uniform allocation (Equation 3) and
SuccessiveHalving (Equation 4) satisfy
K min R, max ∆−α
if q = 1/K
j
Uniform allocation: B ' log(1/δ) j=2,...,K
q −1 min{R, ∆−α }
if q > 1/K
dqKe
SuccessiveHalving:
K
X
−α
min{R, ∆−α
min{R, ∆ } + j } if q = 1/K
2
j=2
B ' log(q −1 log(1/δ)) log(1/δ) K
X
min{R, ∆−α min{R, ∆−α
1
dqKe } + j } if q > 1/K
qK
j=dqKe
then an arm that is in the best q-fraction of arms is returned, i.e. ı̂/K ≈ q and νı̂ − ν∗ .
∆dmax{2,qK}e , with probability at least 1 − δ. Once again we observe a stark difference
between uniform allocation and SuccessiveHalving, particularly when ∆−α j R for many
values of j ∈ {1, . . . , n}.
Armed with Corollary 9, all of the discussion of Section 5.3.3 preceding Theorem 5 holds
for the finite case (R < ∞) as well. Predictably analogous theorems also hold for the finite
horizon setting, but their specific forms (with the polylog factors) provide no additional
insights beyond the sample complexities sufficient for SuccessiveHalving to succeed, given
immediately above.
It is important to note that in the finite horizon setting, for all sufficiently large B (e.g.
B > 3R) and all distributions F , the budget B of SuccessiveHalving should scale linearly
with n ' ∆−β log(1/δ) as ∆ → 0. Contrast this with the infinite horizon setting in which
the ratio of B to n can become unbounded based on the values of α, β as ∆ → 0. One
consequence of this observation is that in the finite horizon setting it suffices to set B large
enough to identify an ∆-good arm with just constant probability, say 1/10, and then repeat
9 m
SuccessiveHalving m times to boost this constant probability to probability 1 − ( 10 ) .
While in this theoretical treatment of Hyperband we grow B over time, in practice we
recommend fixing B as a multiple of R as we have done in Section 3. The fixed budget
version of finite horizon Hyperband is more suitable for practical application due to the
constant time, instead of exponential time, between configurations trained to completion in
each outer loop.
6. Conclusion
We conclude by discussing three potential extensions related to parallelizing Hyperband
for distributed computing, adjusting for training methods with different convergence rates,
and combining Hyperband with non-random sampling methods.
Distributed implementations. Hyperband has the potential to be parallelized since
arms are independent and sampled randomly. The most straightforward parallelization
scheme is to distribute individual brackets of SuccessiveHalving to different machines.
This can be done asynchronously and as machines free up, new brackets can be launched
32
Bandit-Based Approach to Hyperparameter Optimization
with a different set of arms. One can also parallelize a single bracket so that each round of
SuccessiveHalving runs faster. One drawback of this method is that if R can be computed
on one machine, the number of tasks decreases exponentially as arms are whittled down so a
more sophisticated job priority queue must be managed. Devising parallel generalizations of
Hyperband that efficiently leverage massive distributed clusters while minimizing overhead
costs is an interesting avenue for future work.
Adjusting for different convergence rates. A second open challenge involves gen-
eralizing the ideas behind Hyperband to settings where configurations have drastically
differing convergence rates. Configurations can have different convergence rates if they
have hyperparameters that impact convergence (e.g., learning rate decay for SGD or neural
networks with differing numbers of layers or hidden units), and/or if they correspond to
different model families (e.g., deep networks versus decision trees). The core issue arises
when configurations with drastically slower convergence rates ultimately result in better
models. To address these issues, we should be able to adjust the resources allocated to each
configuration so that a fair comparison can be made at the time of elimination.
Incorporating non-random sampling. Finally, Hyperband can benefit from differ-
ent sampling schemes aside from simple random search. Quasi-random methods like Sobol
or latin hypercube which were studied in Bergstra and Bengio (2012) may improve the
performance of Hyperband by giving better coverage of the search space. Alternatively,
meta-learning can be used to define intelligent priors informed by previous experimenta-
tion (Feurer et al., 2015). Finally, as mentioned in Section 2, exploring ways to combine
Hyperband with adaptive configuration selection strategies is a very promising future
direction.
Acknowledgments
33
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
Table 2: Hyperparameter space for the LeNet application of Section 3.3. Note that the
number of kernels in Layer-1 is upper bounded by the number of kernels in Layer-2.
34
Bandit-Based Approach to Hyperparameter Optimization
Table 3: Hyperparameters and associated ranges for the three-layer convolutional network.
If a configuration is terminated early, the predicted terminal value from the estimated
learning curves is used as the validation error passed to the hyperparameter optimization
algorithm. Hence, if the learning curve fit is poor, it could impact the performance of the
configuration selection algorithm. While this approach is heuristic in nature, it could work
well in practice so we compare Hyperband to SMAC with early termination (labeled SMAC
(early) in Figure 11). We used the conservative termination criterion with default parameters
and recorded the validation loss every 400 iterations and evaluated the termination criterion
3 times within the training period (every 8k iterations for CIFAR-10 and MRBI and every
16k iterations for SVHN).15 Comparing the performance by the number of total iterations
as mulitple of R is conservative because it does not account for the time spent fitting the
learning curve in order to check the termination criterion.
35
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
0.32 0.30
hyperband (finite) spearmint
0.30 hyperband (infinite) random 0.29
SMAC random 2x 0.28
0.28 SMAC (early) bracket s=4
Average Test Error
0.10
0.09
0.08
Average Test Error
0.07
0.06
0.05
0.04
0.03
0 10 20 30 40 50
Multiple of R Used
(c) SVHN
Figure 11: Average test error across 10 trials is shown in all plots. Error bars indicate the
top and bottom quartiles of the test error corresponding to the model with the
best validation error
36
Bandit-Based Approach to Hyperparameter Optimization
20 trials of each (data set-searcher) pair, and as in Feurer et al. (2015) we kept the same
data splits across trials, while using a different random seed for each searcher in each trial.
Shortcomings of the Experimental Setup: The benchmark contains a large variety
of training set sizes and feature dimensions16 resulting in random search being able to test
600 configurations on some data sets but just dozens on others. Hyperband was designed
under the implicit assumption that computation scaled at least linearly with the data set size.
For very small data sets that are trained in seconds, the initialization overheads dominate
the computation and subsampling provides no computational benefit. In addition, many of
the classifiers and preprocessing methods under consideration return memory errors as they
require storage quadratic in the number of features (e.g., covariance matrix) or the number of
observations (e.g., kernel methods). These errors usually happen immediately (thus wasting
little time); however, they often occur on the full data set and not on subsampled data sets.
A searcher like Hyperband that uses a subsampled data set could spend significant time
training on a subsample only to error out when attempting to train it on the full data set.
Table 4: Hyperparameter space for kernel regularized least squares classification problem
discussed in Section 4.2.2.
The cost term C is divided by the number of samples so that the tradeoff between the
squared error and the L2 penalty would remain constant as the resource increased (squared
error is summed across observations and not averaged). The regularization term λ is equal
to the inverse of the scaled cost term C. Additionally, the average test error with the top
and bottom quartiles across 10 trials are show in Figure 12.
Table 5 shows the hyperparameters and associated ranges considered in the random
features kernel approximation classification experiment discussed in Section 4.3. The
regularization term λ is divided by the number of features so that the tradeoff between
the squared error and the L2 penalty would remain constant as the resource increased.
Additionally, the average test error with the top and bottom quartiles across 10 trials are
show in Figure 13.
16. Training set size ranges from 670 to 73,962 observations, and number of features ranges from 1 to 10,935.
37
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
0.70
hyperband (finite)
0.65 SMAC
TPE
0.60 random
random 2x
Test Error
0.55 bracket s=4
0.50
0.45
0.40
0 100 200 300 400 500 600 700
Minutes
Figure 12: Average test error of the best kernel regularized least square classification model
found by each searcher on CIFAR-10. The color coded dashed lines indicate
when the last trial of a given searcher finished. Error bars correspond to the top
and bottom quartiles of the test error across 10 trials.
0.70
hyperband (finite)
0.65 SMAC
TPE
0.60 spearmint
random
Test Error
0.55 random 2x
bracket s=4
0.50
0.45
0.40
0 100 200 300 400 500 600 700
Minutes
Figure 13: Average test error of the best random features model found by each searcher on
CIFAR-10. The test error for Hyperband and bracket s = 4 are calculated in
every evaluation instead of at the end of a bracket. Error bars correspond to the
top and bottom quartiles of the test error across 10 trials.
38
Bandit-Based Approach to Hyperparameter Optimization
Appendix B. Proofs
In this section, we provide proofs for the theorems presented in Section 5.
For notational ease, let `i,j := `j (Xi ). Again, for each i ∈ [n] := {1, . . . , n} we assume
the limit limk→∞ `i,k exists and is equal to νi . As a reminder, γ : N → R is defined as the
pointwise smallest, monotonically decreasing function satisfying
Note γ is guaranteed to exist by the existence of νi and bounds the deviation from the limit
value as the sequence of iterates j increases.
Without loss of generality, order the terminal losses so that ν1 ≤ ν2 ≤ · · · ≤ νn . Assume
that B ≥ zSH . Then we have for each round k
B
rk ≥ −1
|Sk |dlog2 (n)e
2 νi − ν1
≥ max i 1 + γ −1 max , −1
|Sk | i=2,...,n 4 2
νb|Sk |/2c+1 − ν1
2 −1
≥ (b|Sk |/2c + 1) 1 + γ max , −1
|Sk | 4 2
νb|Sk |/2c+1 − ν1
≥ 1 + γ −1 max , −1
4 2
νb|Sk |/2c+1 − ν1
= γ −1 max , ,
4 2
where the fourth line follows from b|Sk |/2c ≥ |Sk |/2 − 1.
39
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
Under this scenario, we will eliminate arm i before arm 1 since on each round the arms are
sorted by their empirical losses and the top half are discarded. Note that by the assumption
the νi limits are non-decreasing in i so that the τi values are non-increasing in i.
Fix a round k and assume 1 ∈ Sk (note, 1 ∈ S0 ). The above calculation shows that
Consequently,
X
{1 ∈ Sk , 1 ∈
/ Sk+1 } ⇐⇒ 1{`i,rk < `1,rk } ≥ b|Sk |/2c
i∈Sk
X
=⇒ 1{rk < τi } ≥ b|Sk |/2c
i∈Sk
b|SkX|/2c+1
=⇒ 1{rk < τi } ≥ b|Sk |/2c
i=2
⇐⇒ rk < τb|Sk |/2c+1 .
where the first line follows by the definition of the algorithm, the second by Equation 9,
and the third by τi being non-increasing (for all i < j we have τi ≥ τj and consequently,
1{rk < τi } ≥ 1{rk < τj } so the first indicators of the sum not including 1 would be on
before any other i’s in Sk ⊂ [n] sprinkled throughout [n]). This implies
40
Bandit-Based Approach to Hyperparameter Optimization
• Case 3: 1 ∈
/ Sk
Since 1 ∈ S0 , there exists some r < k such that 1 ∈ Sr and 1 ∈
/ Sr+1 . For this r, only
case 2 is possible since case 1 would proliferate 1 ∈ Sr+1 . However, under case 2, if
1∈/ Sr+1 then maxi∈Sr+1 νi ≤ ν1 + /2.
Because 1 ∈ S0 , we either have that 1 remains in Sk (possibly alternating between cases
1 and 2) for all k until the algorithm exits with the best arm 1, or there exists some k such
that case 3 is true and the algorithm exits with an arm bi such that νbi ≤ ν1 + /2. The proof
is complete by noting that
ξ1 = {ν1 ≤ F −1 (pn )}
( n )
X
ν −ν∗
min{M, γ −1
p
} ≤ nµ + 2nµM log(2/δ) + 23 M log(2/δ)
ξ2 = i
4
i=1
41
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
so that
n
X
γ −1 max 4 , νi −ν ≤ 2nµ + 43 M log(2/δ)
1
2
i=1
Z ∞
γ −1 ( t−ν 4
log(2/δ) + 2nF (ν∗ + /4) γ −1
∗
= 2n 4 )dF (t) + 3 16
ν∗ +/4
42
Bandit-Based Approach to Hyperparameter Optimization
Proof Note that if we draw n random configurations from F and i∗ = arg mini=1,...,n `∗ (Xi )
then
n
[
P (`∗ (Xi∗ ) − ν∗ ≤ ) = P {`∗ (Xi ) − ν∗ ≤ }
i=1
= 1 − (1 − F (ν∗ + ))n ≥ 1 − e−nF (ν∗ +) ,
which is equivalent to saying that with probability at least 1−δ, `∗ (Xi∗ )−ν∗ ≤ F −1 (log(1/δ)/n)−
ν∗ . Furthermore, if each configuration is trained for j iterations then with probability at
least 1 − δ
The following proposition demonstrates that the upper bound on the error of the uniform
allocation strategy in Proposition 4 is in fact tight. That is, for any distribution F and
function γ there exists a loss sequence that requires the budget described in Eq. (3) in order
to avoid a loss of more than with high probability.
Proposition 11 Fix any δ ∈ (0, 1) and n ∈ N. For any c ∈ (0, 1], let Fc denote the space of
continuous cumulative distribution functions F satisfying18 inf x∈[ν∗ ,1−ν∗ ] inf ∆∈[0,1−x] F (x+∆)−F (x+∆/2)
F (x+∆)−F (x) ≥
c. And let Γ denote the space of monotonically decreasing functions over N. For any
F ∈ Fc and γ ∈ Γ there exists a probability distribution µ over X and a sequence of
functions `j : X → R ∀j ∈ N with `∗ := limj→∞ `j , ν∗ = inf x∈X `∗ (x) such that
43
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
supx∈X |`j (x) − `∗ (x)| ≤ γ(j) and Pµ (`∗ (X) − ν∗ ≤ ) = F (). Moreover, if n configu-
rations X1 , . . . , Xn are drawn from µ and ı̂ = arg mini∈1,...,n `B/n (Xi ) then with probability
at least δ
log(c/δ)
`∗ (Xı̂ ) − ν∗ ≥ 2(F −1 ( n+log(c/δ) ) − ν∗ )
log(c/δ)
whenever B ≤ nγ −1 2(F −1 ( n+log(c/δ) ) − ν∗ ) .
Proof Let X = [0, 1], `∗ (x) = F −1 (x), and µ be the uniform distribution over [0, 1]. Define
log(c/δ)
νb = F −1 ( n+log(c/δ) ) and set
(
ν + 12 γ(j) − `∗ (x))
νb + 21 γ(j) + (b ν + 12 γ(j) − `∗ (x)| ≤ 21 γ(j)
if |b
`j (x) =
`∗ (x) otherwise.
Essentially, if `∗ (x) is within 12 γ(j) of νb + 12 γ(j) then we set `j (x) equal to `∗ (x) reflected
across the value 2b ν + γ(j). Clearly, |`j (x) − `∗ (x)| ≤ γ(j) for all x ∈ X .
Since each `∗ (Xi ) is distributed according to F , we have
n
\
P {`∗ (Xi ) − ν∗ ≥ } = (1 − F (ν∗ + ))n ≥ e−nF (ν∗ +)/(1−F (ν∗ +)) .
i=1
Setting the right-hand-side greater than or equal to δ/c and solving for , we find ν∗ + ≥
log(c/δ)
F −1 ( n+log(c/δ) ) = νb.
ν , νb + 12 γ(B/n))
Define I0 = [ν∗ , νb), I1 = [b Pn ν + 12 γ(B/n), νb + γ(B/n)]. Fur-
and I2 = [b
thermore, for j ∈ {0, 1, 2} define Nj = i=1 1`∗ (Xi )∈Ij . Given N0 = 0 (which occurs with
log(c/δ)
probability at least δ/c), if N1 = 0 then `∗ (Xı̂ ) − ν∗ ≥ F −1 ( n+log(c/δ) ) + 12 γ(B/n) and the
claim is true.
Below we will show that if N2 > 0 whenever N1 > 0 then the claim is also true. We now
show that this happens with at least probability c whenever N1 + N2 = m for any m > 0.
Observe that
since
P(νi ∈ I2 ) ν + 12 γ, νb + γ])
P(νi ∈ [b ν + 12 γ)
ν + γ) − F (b
F (b
P(νi ∈ I2 |νi ∈ I1 ∪ I2 ) = = = ≥ c.
P(νi ∈ I1 ∪ I2 ) P(νi ∈ [bν , νb + γ]) ν + γ) − F (b
F (b ν)
Thus, the probability of the event that N0 = 0 and N2 > 0 whenever N1 > 0 occurs with
probability at least δ/c · c = δ, so assume this is the case in what follows.
Since N0 = 0, for all j ∈ N, each Xi must fall into one of three cases:
44
Bandit-Based Approach to Hyperparameter Optimization
The first case holds since within that regime we have `j (x) = `∗ (x), while the last two
cases hold since they consider the regime where `j (x) = 2b ν + γ(j) − `∗ (x). Thus, for
any i such that `∗ (Xi ) ∈ I2 it must be the case that `j (Xi ) ∈ I1 and vice versa. Be-
cause N2 ≥ N1 > 0, we conclude that if ı̂ = arg mini `B/n (Xi ) then `B/n (Xı̂ ) ∈ I1 and
log(c/δ)
`∗ (Xı̂ ) ∈ I2 . That is, νı̂ − ν∗ ≥ νb − ν∗ + 12 γ(j) = F −1 ( n+log(c/δ) ) − ν∗ + 12 γ(j). So if
log(c/δ)
we wish νı̂ − ν∗ ≤ 2(F −1 ( n+log(c/δ) ) − ν∗ ) with probability at least δ then we require
log(c/δ)
B/n = j ≥ γ −1 2(F −1 ( n+log(c/δ) ) − ν∗ ) .
F −1 (pn )−ν∗
−α
γ −1 4 ≤ c F −1 (pn ) − ν∗ ≤ c p−α/β
n
and
(
Z 1 −1
Z 1 c log(1/pn ) if α = β
γ −1 ( F (t)−ν
4
∗
)dt ≤c t−α/β
dt ≤ 1−α/β
1−pn
pn pn c 1−α/β if α 6= β.
We conclude that
Z 1 −1 (t)−ν F −1 (pn )−ν∗
H(F, γ, n, δ) = 2n γ −1 ( F 4
∗
)dt + 10
3 log(2/δ)γ −1 4
pn
(
log(1/pn ) if α = β
≤ cp−α/β
n log(1/δ) + cn 1−α/β
1−pn
1−α/β if α 6= β.
Step 2: Solve for (Bk,l , nk,l ) in terms of ∆. Fix ∆ > 0. Our strategy is to describe nk,l
3 /δ)
in terms of ∆. In particular, parameterize nk,l such that pnk,l = c log(4k
nk,l = ∆β so that
nk,l = c∆−β log(4k 3 /δ) so
log(1/pnk,l ) if α = β
H(F, γ, nk,l , δk,l ) ≤ cp−α/β
nk,l log(1/δk,l ) + cnk,l
1−α/β
1−pnk,l if α 6= β.
1−α/β
(
∆−β log(∆−1 )
if α = β
≤ c log(k/δ) ∆−α + ∆−β −∆−α
1−α/β if α 6= β
≤ c log(k/δ) min{ |1−α/β| , log(∆−1 )}∆− max{β,α}
1
45
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
Step 3: Count the total number of measurements. Moreover, the total number of
measurements before ı̂k,l is output is upperbounded by
k X
X i k
X
T = Bi,j ≤ k Bi,1 ≤ 2kBk,1 = 2Bk,1 log2 (Bk,1 )
i=1 j=l i=1
Pk Pk i
where we have employed the so-called “doubling trick”: i=1 Bi,1 = i=1 2 ≤ 2k+1 = 2Bk,i .
Simplifying,
T ≤ cz∆ log(log(z∆ )/δ)log(z∆ log(log(z∆ )/δ)) ≤ c∆− max{β,α} log(∆−1 )3 log(log(∆−1 )/δ)
46
Bandit-Based Approach to Hyperparameter Optimization
so that
and
Z 1 Z 1 K
−1 c
γ −1 ( F (t)−ν
X
4
∗
)dt = γ −1 ( x−ν∗
4 )dF (x) ≤ ∆−α
i
pn F −1 (p n)
K
i=dpn Ke
so that
Z 1 −1 (t)−ν F −1 (pn )−ν∗
H(F, γ, n, δ) = 2n γ −1 ( F 4
∗
)dt + 10
3 log(2/δ)γ −1 4
pn
K
cn X
≤ c∆−α
dpn Ke log(1/δ) + ∆−α
i .
K
i=dpn Ke
Now consider the case when 3 F −1 (pn ) − ν∗ ≤ ∆2 . In this case F (ν∗ + /4) = 1/K,
R∞
≤ c∆−α
PK −α
γ −1 16
−1 t−ν∗
2 , and ν∗ +/4 γ ( 4 )dF (t) ≤ c i=2 ∆i so that
Z ∞
γ −1 ( t−ν 4
log(2/δ) + 2nF (ν∗ + /4) γ −1
∗
H(F, γ, n, δ, ) = 2n 4 )dF (t) + 3 16
ν∗ +/4
K
cn X −α
≤ c(log(1/δ) + n/K)∆−α
2 + ∆i .
K
i=2
Step 2: Solve for (Bk,l , nk,l ) in terms of ∆. Note there is no improvement possible
once pnk,l ≤ 1/K since 3 F −1 (1/K) − ν∗ ≤ ∆2 . That is, when pnk,l ≤ 1/K the algorithm
has found the best arm but will continue to take samples indefinetely. Thus, we only
consider the case when q = 1/K and q > 1/K. Fix ∆ > 0. Our strategy is to describe
3 /δ)
nk,l in terms of q. In particular, parameterize nk,l such that pnk,l = c log(4k
nk,l = q so that
nk,l = cq −1 log(4k 3 /δ) so
n nk,l PK
(log(1/δk,l ) + Kk,l )∆−α −α
if 5 F −1 (pnk,l ) − ν∗ ≤ ∆2
(
2 + K i=2 ∆i
H(F, γ, nk,l , δk,l , k,l ) ≤ c nk,l PK
∆−α
dpnk,l Ke log(1/δ k,l ) + K i=dpnk,l Ke ∆i
−α
if otherwise
(
∆−α K −α
P
2 + i=2 ∆i if q = 1/K
≤ c log(k/δ) −α 1 P K −α
∆dqKe + qK i=dqKe ∆i if q > 1/K.
K
1 X
≤ c log(k/δ)∆−α
dmax{2,qK}e + ∆−α
i
qK
i=dmax{2,qK}e
47
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
Because the output arm is just the empirical best, there is some error associated with using the
k−1
empirical estimate. The arm returned on round (k, l) is pulled d 2 l e ≥ cBk,l / log(Bk,l ) times
log(Bk,l ) 1/α
1/α
log(T )2 log(log(T ))
so the possible error is bounded by γ(Bk,l / log(Bk,l )) ≤ c Bk,l ≤ c T .
This is dominated by ∆dmax{2,qK}e for the value of T prescribed by the above calculation,
completing the proof.
48
Bandit-Based Approach to Hyperparameter Optimization
References
A. Agarwal, J. Duchi, P. L. Bartlett, and C. Levrard. Oracle inequalities for computationally
budgeted model selection. In Conference On Learning Theory (COLT), 2011.
A. Carpentier and M. Valko. Simple regret for infinitely many armed bandits. In International
Conference on Machine Learning (ICML), 2015.
E. Contal, V. Perchet, and N. Vayatis. Gaussian process optimization with mutual informa-
tion. In International Conference on Machine Learning (ICML), 2014.
49
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
50
Bandit-Based Approach to Hyperparameter Optimization
51
Li, Jamieson, DeSalvo, Rostamizadeh and Talwalkar
A. van der Vaart and H. van Zanten. Information rates of nonparametric Gaussian process
methods. Journal of Machine Learning Research, 12:2095–2119, 2011.
52