Lec07 Baysian Opti
Lec07 Baysian Opti
Intuition:
We want to find the peak of our true function (eg. accuracy as a function of
hyperparameters)
To find this peak, we will fit a Gaussian Process to our observed points and
pick our next best point where we believe the maximum will be.
Brochu et al., 2010, A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement
Learning
Bayesian Optimization Tutorial
Brochu et al., 2010, A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement
Learning
Bayesian Optimization Tutorial
Update acquisition function from new posterior and find the next best point
Brochu et al., 2010, A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement
Learning
Acquisition Function Intuition
Brochu et al., 2010, A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement
Learning
Acquisition Functions
- Guides the optimization by determining which point to observe next
and is easier to optimize to find the next sample point
Consider the case where N evaluations have completed, with data {xn,yn}n=1N, and J
evaluations are pending {xj}j=1J
Parallelization Example
- We’ve evaluated 3 observations and 2
are pending {x1,x2}
- Fit a model for each possible realization
of {f(x1), f(x2)}
- Calculate acquisition for each model
- Integrate all acquisitions over x
Results
● Branin-Hoo
● Logistic Regression MNIST
● Online LDA
● M3E
● CNN CIFAR-10
Logistic Regression - MNIST
CIFAR-10
● 3-layer conv-net
● Optimized over
○ Number of epochs
○ Learning rate
○ L2-norm constants
● Achieved state of the art
○ 9.5% test error
GP Bayesian Optimization - Pros and Cons
● Advantages
○ Computes the mean and variance
● Disadvantages
○ Function evaluation is cubic on the number of inputs
Scalable Bayesian Optimization Using Deep Neural Networks
rt = f(Dmax) - f(Dt)
- Cumulative regret (RT): the total loss in reward after
T steps:
RT = ∑rt
Minimizing Regret: A Tradeoff
Average Regret = RT / T
Credit: Russo et al., 2017
Asymptotic Regret
** Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. In COLT, 2008.
Infinite-dimensional Functions
functions?
Using Information Gain
To Derive Regret Bounds
Information Gain
Recall Mackay (1992) paper: Information gain can be quantified as change in entropy
In this context:
Explore Exploit
Definition:
This term will be used to quantify the regret bound for the algorithm
Regret Bounds - Finite Domain
Assuming some strictly sublinear T...
(we will verify later that this is
achievable by choice of kernels),
We can find some sublinear function
f(T) bounding above
Theorem 1:
Assumptions: f(T)
- Finite D
- f sample of a GP with mean 0,
- k(x, x’) of GP s.t. k(x,x) (variance) not greater than 1
P(RT curve lies below)
Then, by running GP-UCB for f with is at least 1-
We obtain:
T
Regret Bounds II - General Compact+Convex Space
Theorem 2:
Assumptions:
- D compact and convex in [0,r]d,
- f sample of a GP with mean 0,
- k(x, x’) of GP s.t. k(x,x) (variance) not greater than 1
- k(x,x’) s.t. f fulfills smoothness condition -- discussed next
We obtain:
Regret Bounds II Continued
This holds for stationary kernels k(x,x’) = k(x-x’) which are 4-times differentiable:
Submodularity
-- F is submodular function
⇒
T
We can bound the term by considering the worst allocation of the T samples under
some relaxed greedy procedure (see appendix section C).
In finite space D, this eventually gives us a bound in terms of the eigenvalues of the
covariance matrix for all |D| points:
The faster the spectrum decays, the slower the growth of the bound
Bounding Information Gain Continued
Theorem 5: Assume general compact and convex set D in Rd, kernel k(x,x’)≤1:
Combining the two theorems we obtain the following (1-δ) upper confidence bound
for the total regret, RT (up to polylog factors):
Figure: Functions drawn from a GP with squared exponential kernel (lengthscale=0.2) Credit: Srinivas et al. 2010
Experimental Setup
Proofs/Math. Experimental
GP-UCB Results
Analysis Design
- The concepts of Information Gain and Regret Bounds are analyzed and their relations represented in
the following theorems:
● Regret Bounds for Finite Domain
● Regret Bounds for General Compact + Convex Space
● Bounding Information Gain
- Synthetic and Real experimental data used to test the algorithm
- GP-UCB is found to perform at least on par with existing approaches which do not include regret
bounds
- Their results are encouraging as they illustrate exploration and exploitation trade-offs for complex
functions
- The paper uses tools (concept of regret and information gain) to come up with a convergence rate for
the GP-UCB algorithm
Exploiting Structure for
Bayesian Optimization
K. Swersky, J. Snoek, R.P. Adams (2014)
Freeze-Thaw Bayesian Optimization
Presentation by: Shu Jian (Eddie) Du, Romina Abachi, William Saunders
Freeze-Thaw Bayesian Optimization
K. Swersky, J. Snoek, R.P. Adams (2014)
- Human experts tend to stop model training halfway if the loss curve
looks bad.
- Like Snoek 2012 alluded to, we’d like to leverage partial information
(before a model finishes training) to determine what points to evaluate
next.
Big Idea
Demo:
https://github.com/esdu/misc/raw/master/csc2541/demo1.pdf
Code:
https://github.com/esdu/misc/blob/master/csc2541/csc2541_ftbo_pres_demo.ipynb
Are we done?
- We could model all N training curves over all T timesteps jointly using a
single GP using the Exp Decay Kernel.
- However, since GP takes cubic time to fit, it would run in
time. (We have N*T data points) this is way too slow!
- Paper proposes a generative model to speed this up.
A more efficient way
Global GP
Joint distribution
・ ・ ・ ・
N ・ N ・ ・ ・
・ ・ ・ ・
1 N
1 ・ ・
1 ・ ・
1 ・ ・
N*T 1 N*T ・ ・
(At most N*T) 1 (At most N*T) ・ ・ ・
1 ・ ・ ・
1 ・ ・ ・
N N*T (At most N*T)
Marginal likelihood
Posterior distribution
Posterior predictive distribution
Aside: To Derive These...
Demo:
https://github.com/esdu/misc/raw/master/csc2541/demo2.pdf
Code:
https://github.com/esdu/misc/blob/master/csc2541/csc2541_ftbo_pres_demo.ipynb
Which acquisition function to use?
Expected Improvement:
(x) and v(x) -- posterior mean and variance of the probabilistic model
evaluated at x
Idea: How much information does evaluating a new point give us about the
location of the minimum?
KK
Choose the point that, in expectation, will have the greatest improvement
over the best known point
Assumes that after querying, either the best known point or the queried
point will be the maximum
Acquisition Function: Entropy Search
Observing a point on a related task can never reveal more information than
sampling the same point on the target task
But, it can be better when information per unit cost is taken into account
Acquisition Function: Taking Cost Into Account