PAC Bayesian Learning overview
PAC Bayesian Learning overview
Benjamin Guedj
https://bguedj.github.io
Inria Lille - Nord Europe https://bguedj.github.io - 6 PAC
6 PAC: Making PAC Learning great again
1. Active
2. Sequential
3. Structure-aware
4. Efficient
5. Ideal
6. Safe
https://bguedj.github.io - 6 PAC - 2
A mathematical theory of learning: towards AI
{Statistical,Machine} learning: devise automatic procedures to
infer general rules from data.
In the (rather not so?) long term: mimic the inductive functioning
of the humain brain to develop an artificial intelligence.
https://bguedj.github.io - 6 PAC - 3
Learning in a nutshell
https://bguedj.github.io - 6 PAC - 4
Learning in a nutshell
Collect data Dn = (Xi , Yi )ni=1 distributed as a random variable
(X, Y) ∈ X × Y. Data may be incomplete (unsupervised setting,
missing input), collected sequentially / actively, etc.
https://bguedj.github.io - 6 PAC - 4
Learning in a nutshell
Collect data Dn = (Xi , Yi )ni=1 distributed as a random variable
(X, Y) ∈ X × Y. Data may be incomplete (unsupervised setting,
missing input), collected sequentially / actively, etc.
https://bguedj.github.io - 6 PAC - 4
Learning in a nutshell
Collect data Dn = (Xi , Yi )ni=1 distributed as a random variable
(X, Y) ∈ X × Y. Data may be incomplete (unsupervised setting,
missing input), collected sequentially / actively, etc.
https://bguedj.github.io - 6 PAC - 5
Bayesian learning in a nutshell
Let F be a set of candidates functions equipped with a probability
measure π (prior). Let f be the (known) density of the (assumed)
distribution of (X, Y), and define the posterior
https://bguedj.github.io - 6 PAC - 5
Bayesian learning in a nutshell
Let F be a set of candidates functions equipped with a probability
measure π (prior). Let f be the (known) density of the (assumed)
distribution of (X, Y), and define the posterior
R
I Mean φb = Eρb φ = F φb
ρ (dφ).
I Realization φb ∼ ρb.
I ...
https://bguedj.github.io - 6 PAC - 5
Quasi-Bayesian learning in a nutshell
A.k.a generalized Bayes.
https://bguedj.github.io - 6 PAC - 6
Quasi-Bayesian learning in a nutshell
A.k.a generalized Bayes.
Let F be a set of candidates functions equipped with a probability
measure π (prior). Let λ > 0, and define a quasi-posterior
Model-free learning!
I ...
https://bguedj.github.io - 6 PAC - 6
Why quasi-Bayes?
https://bguedj.github.io - 6 PAC - 7
Why quasi-Bayes?
K(ρ, π)
Z
ρbλ ∈ arg inf rn (φ)ρ(dφ) + .
ρπ F λ
https://bguedj.github.io - 6 PAC - 7
Statistical aggregation revisited
https://bguedj.github.io - 6 PAC - 8
Statistical aggregation revisited
Z
φbλ := Eρbλ φ = φb
ρλ (dφ)
ZF
= φ exp (−λrn (φ)) π(dφ)
F
#F
X exp(−λrn (φi ))π(φi )
= P#F φi , if |F| < +∞.
i=1 j=1 exp(−λrn (φj ))π(φj )
| {z }
ωλ,i
G. (2013). Agrégation d’estimateurs et de classificateurs : théorie et méthodes, Ph.D. thesis, Université Pierre
https://bguedj.github.io - 6 PAC - 8
PAC learning in a nutshell
https://bguedj.github.io - 6 PAC - 9
PAC learning in a nutshell
Probably Approximately Correct (PAC) oracle inequalities /
generalization bounds and empirical bounds.
Valiant (1984). A theory of the learnable, Communications of the ACM
https://bguedj.github.io - 6 PAC - 9
PAC learning in a nutshell
Probably Approximately Correct (PAC) oracle inequalities /
generalization bounds and empirical bounds.
Valiant (1984). A theory of the learnable, Communications of the ACM
https://bguedj.github.io - 6 PAC - 9
The PAC-Bayesian theory
https://bguedj.github.io - 6 PAC - 10
The PAC-Bayesian theory
...consists in producing PAC bounds for quasi-Bayesian learning
algorithms.
While PAC bounds focus on estimators θ̂n that are obtained as
functionals of the sample and for which the risk R is small, the
PAC-Bayesian approach studies anRaggregation distribution ρ̂n that
depends on the sample, for which R(θ)ρ̂n (dθ) is small.
https://bguedj.github.io - 6 PAC - 10
The PAC-Bayesian theory
...consists in producing PAC bounds for quasi-Bayesian learning
algorithms.
While PAC bounds focus on estimators θ̂n that are obtained as
functionals of the sample and for which the risk R is small, the
PAC-Bayesian approach studies anRaggregation distribution ρ̂n that
depends on the sample, for which R(θ)ρ̂n (dθ) is small.
Shawe-Taylor and Williamson (1997). A PAC analysis of a Bayes estimator, COLT
Audibert (2004). Une approche PAC-bayésienne de la théorie statistique de l’apprentissage, Ph.D. thesis,
Catoni (2007). PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning, IMS
Dalalyan and Tsybakov (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity,
Machine Learning
https://bguedj.github.io - 6 PAC - 10
A flexible and powerful framework (1/2)
https://bguedj.github.io - 6 PAC - 11
A flexible and powerful framework (1/2)
Alquier and Wintenberger (2012). Model selection for weakly dependent time series forecasting, Bernoulli
Seldin, Laviolette, Cesa-Bianchi, Shawe-Taylor and Auer (2012). PAC-Bayesian inequalities for martingales,
Alquier and Biau (2013). Sparse Single-Index Model, Journal of Machine Learning Research
G. and Alquier (2013). PAC-Bayesian Estimation and Prediction in Sparse Additive Models, Electronic Journal
of Statistics
Alquier and G. (2017). An Oracle Inequality for Quasi-Bayesian Non-Negative Matrix Factorization,
Dziugaite and Roy (2017). Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural
Dziugaite and Roy (2018). Data-dependent PAC-Bayes priors via differential privacy, NIPS
https://bguedj.github.io - 6 PAC - 11
A flexible and powerful framework (2/2)
Rivasplata, Parrado-Hernandez, Shawe-Taylor, Sun and Szepesvari (2018). PAC-Bayes bounds for stable
G. and Robbiano (2018). PAC-Bayesian High Dimensional Bipartite Ranking, Journal of Statistical Planning and
Inference
Li, G. and Loustau (2018). A Quasi-Bayesian perspective to Online Clustering, Electronic Journal of Statistics
G. and Li (2018). Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly, arXiv preprint
https://bguedj.github.io - 6 PAC - 12
A flexible and powerful framework (2/2)
Rivasplata, Parrado-Hernandez, Shawe-Taylor, Sun and Szepesvari (2018). PAC-Bayes bounds for stable
G. and Robbiano (2018). PAC-Bayesian High Dimensional Bipartite Ranking, Journal of Statistical Planning and
Inference
Li, G. and Loustau (2018). A Quasi-Bayesian perspective to Online Clustering, Electronic Journal of Statistics
G. and Li (2018). Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly, arXiv preprint
Bégin, Germain, Laviolette and Roy (2016). PAC-Bayesian bounds based on the Rényi divergence, AISTATS
Alquier and G. (2018). Simpler PAC-Bayesian bounds for hostile data, Machine Learning
https://bguedj.github.io - 6 PAC - 12
Existing implementation: PAC-Bayes in the real world
https://bguedj.github.io - 6 PAC - 13
Existing implementation: PAC-Bayes in the real world
I (Transdimensional) MCMC
G. and Alquier (2013). PAC-Bayesian Estimation and Prediction in Sparse Additive Models, Electronic
Journal of Statistics
Alquier and Biau (2013). Sparse Single-Index Model, Journal of Machine Learning Research
Li, G. and Loustau (2018). A Quasi-Bayesian perspective to Online Clustering, Electronic Journal of
Statistics
G. and Robbiano (2018). PAC-Bayesian High Dimensional Bipartite Ranking, Journal of Statistical
https://bguedj.github.io - 6 PAC - 13
Existing implementation: PAC-Bayes in the real world
I (Transdimensional) MCMC
G. and Alquier (2013). PAC-Bayesian Estimation and Prediction in Sparse Additive Models, Electronic
Journal of Statistics
Alquier and Biau (2013). Sparse Single-Index Model, Journal of Machine Learning Research
Li, G. and Loustau (2018). A Quasi-Bayesian perspective to Online Clustering, Electronic Journal of
Statistics
G. and Robbiano (2018). PAC-Bayesian High Dimensional Bipartite Ranking, Journal of Statistical
I Stochastic optimization
Alquier and G. (2017). An Oracle Inequality for Quasi-Bayesian Non-Negative Matrix Factorization,
G. and Li (2018). Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly, arXiv
preprint
https://bguedj.github.io - 6 PAC - 13
Existing implementation: PAC-Bayes in the real world
I (Transdimensional) MCMC
G. and Alquier (2013). PAC-Bayesian Estimation and Prediction in Sparse Additive Models, Electronic
Journal of Statistics
Alquier and Biau (2013). Sparse Single-Index Model, Journal of Machine Learning Research
Li, G. and Loustau (2018). A Quasi-Bayesian perspective to Online Clustering, Electronic Journal of
Statistics
G. and Robbiano (2018). PAC-Bayesian High Dimensional Bipartite Ranking, Journal of Statistical
I Stochastic optimization
Alquier and G. (2017). An Oracle Inequality for Quasi-Bayesian Non-Negative Matrix Factorization,
G. and Li (2018). Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly, arXiv
preprint
I Variational Bayes
Alquier, Ridgway and Chopin (2016). On the properties of variational approximations of Gibbs
https://bguedj.github.io - 6 PAC - 14
(intermediary) take-home message
https://bguedj.github.io - 6 PAC - 14
(intermediary) take-home message
https://bguedj.github.io - 6 PAC - 14
(intermediary) take-home message
https://bguedj.github.io - 6 PAC - 14
A unified PAC-Bayesian framework
Alquier and G. (2018)
Simpler PAC-Bayesian Bounds for hostile data
Machine Learning
https://bguedj.github.io - 6 PAC - 15
Motivation: towards an agnostic learning theory
PAC-Bayesian bounds are a key justification in stat/ML for using
Bayesian-flavored learning algorithms in several settings.
high dimensional bipartite ranking, non-negative matrix factorization, sequential learning of principal curves, online
clustering, single-index models, high dimensional additive regression, domain adaptation, neural networks, . . .
https://bguedj.github.io - 6 PAC - 16
Motivation: towards an agnostic learning theory
PAC-Bayesian bounds are a key justification in stat/ML for using
Bayesian-flavored learning algorithms in several settings.
high dimensional bipartite ranking, non-negative matrix factorization, sequential learning of principal curves, online
clustering, single-index models, high dimensional additive regression, domain adaptation, neural networks, . . .
https://bguedj.github.io - 6 PAC - 16
Context: PAC bounds for heavy-tailed random variables
https://bguedj.github.io - 6 PAC - 17
Context: PAC bounds for heavy-tailed random variables
The next big thing (≥ 2015)
I PAC bounds for the (penalized) ERM without an exponential
moments assumption with the small-ball property
Mendelson (2015). Learning without concentration, Journal of ACM Lecué and Mendelson (2016).
Regularization and the small-ball method, The Annals of Statistics Grünwald and Mehta (2016). Fast
Lugosi and Mendelson (2018). Risk minimization by median-of-means tournaments, Journal of the
European Mathematical Society Lugosi and Mendelson (2017). Regularization, sparse recovery, and
median-of-means tournaments, arXiv preprint Lecué and Lerasle (2017). Learning from MoM’s
Ralaivola, Szafranski and Stempfel (2010). Chromatic PAC-Bayes bounds for non-iid data: Applications to
Seldin, Laviolette, Cesa-Bianchi, Shawe-Taylor and Auer (2012). PAC-Bayesian inequalities for martingales,
Alquier and Wintenberger (2012). Model selection for weakly dependent time series forecasting, Bernoulli
Agarwal and Duchi (2013). The generalization ability of online algorithms for dependent data, IEEE
Kuznetsov and Mohri (2014). Generalization bounds for time series prediction with non-stationary processes,
ALT
https://bguedj.github.io - 6 PAC - 19
Disclaimer
The strategy I’m about to describe yields, at best, the same rates
as those existing in known settings.
https://bguedj.github.io - 6 PAC - 20
Disclaimer
The strategy I’m about to describe yields, at best, the same rates
as those existing in known settings.
https://bguedj.github.io - 6 PAC - 20
Notation
https://bguedj.github.io - 6 PAC - 21
Key quantities
https://bguedj.github.io - 6 PAC - 22
Key quantities
Definition
For any function g, let
Z
Mφp ,n = E (|rn (θ) − R(θ)|p ) π(dθ).
https://bguedj.github.io - 6 PAC - 22
Key quantities
Definition
For any function g, let
Z
Mφp ,n = E (|rn (θ) − R(θ)|p ) π(dθ).
Definition
Let f be a convex function with f (1) = 0. Csiszár’s f -divergence
between ρ and π is defined by
Z
dρ
Df (ρ, π) = f dπ
dπ
p
Fix p > 1, q = p−1 and δ ∈ (0, 1). With probability at least 1 − δ
we have for any distribution ρ
1
Mφq ,n
Z Z
q 1
Rdρ − rn dρ ≤ Dφp −1 (ρ, π) + 1 p .
δ
https://bguedj.github.io - 6 PAC - 23
Proof
Let ∆n (θ) := |rn (θ) − R(θ)|.
Z Z Z Z
dρ
Rdρ − rn dρ ≤ ∆n dρ = ∆n
dπ
dπ
Z q1 Z p p1
q dρ
≤ ∆n dπ dπ (Hölder ineq.)
dπ
R q q1 Z p p1
E ∆n dπ dρ
≤ dπ (Markov, w.p. 1 − δ)
δ dπ
1
Mφq ,n q
1
= Dφp −1 (ρ, π) + 1 p .
δ
Inspired by
Bégin, Germain, Laviolette and Roy (2016). PAC-Bayesian bounds based on the Rényi divergence, AISTATS
https://bguedj.github.io - 6 PAC - 24
R R
We can compare rn dρ (observable) to Rdρ (unknown, the
objective) in terms of
I the moment Mφq ,n (which depends on the distribution of the
data)
I and the divergence Dφp −1 (ρ, π) (which is a measure of the
complexity of the set Θ).
https://bguedj.github.io - 6 PAC - 25
R R
We can compare rn dρ (observable) to Rdρ (unknown, the
objective) in terms of
I the moment Mφq ,n (which depends on the distribution of the
data)
I and the divergence Dφp −1 (ρ, π) (which is a measure of the
complexity of the set Θ).
https://bguedj.github.io - 6 PAC - 25
Intersection
I Continuous case
goto
4. Conclusion
goto
https://bguedj.github.io - 6 PAC - 26
Computing the divergence term (discrete case)
back to intersection
https://bguedj.github.io - 6 PAC - 27
Computing the divergence term (continuous case)
Assume that there exists d > 0 such that for any γ > 0,
n o
0
≥ γd .
π θ ∈ Θ : rn (θ) ≤ inf 0
rn (θ ) + γ
θ ∈Θ
p
Fix p > 1, q = p−1 , δ ∈ (0, 1) and
h i
πγ (dθ) ∝ π(dθ)1 r (θ) − rn (θ̂ERM ) ≤ γ .
back to intersection
https://bguedj.github.io - 6 PAC - 28
Bounding the moment Mφq ,n : the i.i.d case
Assume that Z
2
s = Var[`1 (θ)]π(dθ) < +∞
then q2
s2
Mφq ,n ≤ .
n
So
1 r
Dφp −1 (ρ, π) + 1 p s 2
Z Z
Rdρ ≤ rn dρ + 1 .
δq n
This rate can not be improved without further assumptions.
back to intersection
https://bguedj.github.io - 6 PAC - 29
Bounding the moment Mφq ,n : the i.i.d case
back to intersection
https://bguedj.github.io - 6 PAC - 30
Bounding the moment Mφq ,n : the dependent case
Definition
The α-mixing coefficients between two σ-algebras F and G are
defined by
Define
αj = α[σ(X0 , Y0 ), σ(Xj , Yj )].
When the future of the series is strongly dependent of the past, αj
will remain constant or slowly decay. When the near future is
almost independent of the past, then the αj quickly decay to 0.
back to intersection
https://bguedj.github.io - 6 PAC - 31
Bounding the moment Mφq ,n : the dependent case
Bounded case: assume P 0 ≤ ` ≤ 1 and (Xi , Yi )i∈Z is a stationary
process which satisfies j∈Z αj < ∞. Then
1X
Mφ2 ,n ≤ αj .
n
j∈Z
https://bguedj.github.io - 6 PAC - 32
Bounding the moment Mφq ,n : the dependent case
Bounded case: assume P 0 ≤ ` ≤ 1 and (Xi , Yi )i∈Z is a stationary
process which satisfies j∈Z αj < ∞. Then
1X
Mφ2 ,n ≤ αj .
n
j∈Z
Then
Z X 1
1 2
Mφ2 ,n ≤ {E [`si (θ)]} π(dθ)
s αjr .
n
j∈Z
back to intersection
https://bguedj.github.io - 6 PAC - 32
Example
Consider auto-regression with quadratic loss and linear predictors:
Xi = (1, Yi−1 ) ∈ R2 , Θ = R2 and fθ (·) = hθ, ·i. Let
Z
2 X 13
ν = 32E Yi6 3 αj 1 + 4 kθk6 π(dθ) .
j∈Z
https://bguedj.github.io - 6 PAC - 33
PAC-Bayesian bounds to elicit new learning algorithms
Reminder:
back to intersection
https://bguedj.github.io - 6 PAC - 34
Definition
We define r n = r n (δ, p) as
Mφq ,n
Z
q
r n = min u ∈ R, [u − rn (θ)]+ π(dθ) = .
δ
back to intersection
https://bguedj.github.io - 6 PAC - 35
With probability at least 1 − δ,
(Z 1 )
Mφq ,n q
Z
1
Rdρ̂n ≤ r n ≤ inf Rdρ + 2 Dφp −1 (ρ, π) + 1 p .
ρ δ
https://bguedj.github.io - 6 PAC - 36
With probability at least 1 − δ,
(Z 1 )
Mφq ,n q
Z
1
Rdρ̂n ≤ r n ≤ inf Rdρ + 2 Dφp −1 (ρ, π) + 1 p .
ρ δ
Assume that there exists d > 0 such that for any γ > 0,
n o
0
≥ γd .
π θ ∈ Θ : rn (θ) ≤ inf 0
rn (θ ) + γ
θ ∈Θ
back to intersection
https://bguedj.github.io - 6 PAC - 36
Highlights
https://bguedj.github.io - 6 PAC - 37
Highlights
I 6 PAC
I 2 ANR-funded projects for the period 2019–2023
I APRIORI: representation learning and deep neural networks,
with PAC-Bayes
I BEAGLE (PI): agnostic learning, with PAC-Bayes
I H2020 European Commission project PERF-AI: machine
learning algorithms (including PAC-Bayes) applied to aviation
https://bguedj.github.io - 6 PAC - 37
We are hiring!
https://bguedj.github.io - 6 PAC - 38