Cheat Sheet 2
Cheat Sheet 2
A-
argmwinslogldtexptyiwtxibtx.llwlli.x-z.pe Burn in -
: discard initial samples astheyarenotsampledfromltlx)
Active Learning
n=§×i×ipji( 1- pyi-xtdiagllpyill-pyn.DE )X,pyi=6( ivtxi) Continuous RV :(
e.g.ph/)--1zexp(-f(xD.'?a¥¥→e×pcf(×
fCx):energyfunctionwit}ha--mi
) nE1
.
,
purely explorative
.gr/*)=fG(y*f).N(irTx*,x*TA-1x*)df*Bayeo:PCx1y).P(Y)--P(
Predict :p(y*1✗
:
( MALA ,SGLD ) •
•
Regression Xt+e=argm×a×F(StU{ :
×} ) St={ xp ,×t }
Marginalization ply )=f×p(×,y)d×(=E×[ plylx)) ) Prior PCW)=N(0 6B¥ )( Lasso Laplace (0,951 ))
=fp(y*1×*,w)q(w)dw / qcwtxtfqcwlplwtxlw )dw Metropolis Hastings
, . .
: :
:
, -
:
,Yi+is , }
Entropy
:
H(×)= -
,¥É&¥:Rp¥ } Gaussian :X
NCM.815-xi-GE~NCO.cn?1)&setxs-x1,otherwisestayatx
2) accept with prob o= min{ 1
-1
E- Kori #✗ + 6p- 211)
Variational Inference
-
f- Ax :X))
11-1×191=11-1×1 if ✗ 19 OLS WMLt-argmwaxplylx.hr)=( xtxj-nxty-col.sk/71in.indep.i.e.uncorr.features
:
flog 1+6%81 It ;Y)= /
Hlfcxr ,
H(f( XD SHIH
. .
:p(Y*l×iYi×*)=N(Ñ×Y×*×I+6n)
e.
g. ( with proposals from transition matrix Rlxilxj )=Pij acceptance matrix
Icf ;ys)=Elog / A- +
,
Predict énksl
i.ph#IEab1ee8
,
tij-Pij.dijl-1-i-jh.GG?kjL-gkjIi.e.T=Poo
aij=✗( xilxj> and stationary transit matrix
"
xti-r-argm.fi/T-(StUEX3)yconst=argmxaxF(
FStUEX3
.
" " "" " " "" " " "" " " d"
solve :p (0-1*9)=9×10-1 =kt(xx)
② aleatoric
"
irred noise " " " " " " " "
"
uncertainty : .
) -
9- (G) =N(µ=É,I=A^ )
Odds oddslx)=P( X ) / ( 1- PIX ) )
:
:
=fwp(y*,wIx,y,✗*)dw 1) Laplace Method
,
modelflyst
argm.fm/I(f;Yx1Yst )
:
ratio of posteriors - =
f
Mutual information =fwplY*lx*,w)p(wlxiyldw-Ewix.yf.pe/Y*lX*.w )] Gibbs
IN ;y)=H( × ) H( ✗ ly ) Ily ;✗ ) E- sampling efficient sampling proposal
=angm×a×E1og(1+%¥-)=argm×ax6t(×
argnfaxp.IQ/x.y)Hyperparams:DFindI=&-
formaltivariatecase
: '
-
= : ✗
)
Monotone submodular ,xÑ]t
FCAUEX } ) f- (A)
logp( 0=1*9 ) Algorithm much easier than
)
s
( Hessian) ✗ it
)=[ it ,
Ridgeregr by 1=-171> ✗
-
: :
. .
④ -1-(5-1)<-11
:
2) Find Én2=1nSi( yi irtxi)2( MAP) Derivation :p( ⑤ IX. g) =e×p( logplo-lx.is )) FEI
. .
→
x.Y-tm~p.ci/i1xit+Y...,xit+i',xiI,...,xiF )
-
)tlogp§×?y )
← aleatoric uncertainty
Khlotllp ) # KLCPHG ) +10--8
Model y=wT¢(×)+E,E~N( 0,62 )
Bayesian Deep Learning
:
Classification xti-r-argm.ae/H(Ylxr:t.Yi:t.x)
.gov??&flxi7ppiop:pcw)=N( 0,1 ) +1210--0^-517-1710=9%0%(0--8)
•
:
Hoefling 's inequality P4E[ ft) ] I .&f(✗ ill > E) Givenwspans full function ( plytxtx) E
Insights ①
: - : • :
-0 often xtti-argm.fr/I(YxiO-lXn:t,yr:t,x)CapproximatewithV.I )
:
:
② kernel
.
s2.es/p(-2Ne4C2 )
:
trick :
plylx) depends only onto:& •
Heteroscedastic :{ ( x)
IICYIO-IDt.H-HL-D-x-EO-npc.iq> [ HCYIQX) ]
,
and finite
marginal d) " " ① " Hence :
9×10-1 -
argminkkqypgh.gg, Model : •
Regression :
y=fg(×)+E(x) , ENNIO,gñ(×)1 ) uncertainty undercurrent model ¥aityFif
9- EQ
Bayesian decision theory :a*=argmainEy,×[c( ya,] "" """ ×"
.µ×n:,×n, ] argmaxELBOCG.rs)
" '
=
Likelihood :
plylx ,Q)=N( fold ,6n4× ))
ccy ,a) :
costoffinalpredictionaandmodelpred.ly , ,
Kanna ,
qeQ
G Bayesian Optimization
. .
(
plylx.0-t-NCfo.MG/),expCfo?Cx)))
'
as
→ forward KL :KL( pilot ) tends to full p df
:
Gauss Markov theorem t-oryi-wtxit.ci ,E[Ei]=QVar( Ei )=o2
-
: cover
,
Kernels product feature space klxix )=¢(Ñ¢(× ) ' '
T
Problem :I¥f(×),f(×)
inner in KL :KL( 9- HP ) costly to evaluate
: :
Classification :
f- signlfolx))
ELBOCofipl-EO-n.gl/ogp(QyDtH(q)gpri
] or--Eo-~q(logpl1G
o
Model surrogate function :gC×)~GP(µK ) conditioned
-
KL( 9- 11pct) . . .
, •
krlx.si)tkz(x,✗ ) also kernel '
: :
polynomial with
)={
c. µ× aisokerndfo.ro > o random variable
Method Algorithm
PxEo~q,[flat]=p,Ee~p[f(g¥É]
repeat
•
on. -
MAP E- argmo.in Iogp( G)
:
E[aX+bY]=aE[X]tbE[Y
.
] { logplyilxi ,G )
:
elf
:
1 Positive • : - -
(pco-5-plgce.in)))
•
fckcx.xygaisokepnelfor.fi
.
i-argnf.ca?alx)
• .
"
P" 1) Pick candidate
Var(✗)=E[ ( X E[x ] )( ✗ E[X ] )T]=E[X2] E- [ × ]2
• -
:
Xi )2 ]
- -
=angM②inXHGHÉ+1z§[ Hyi
T
stationary ifklx,✗ )=k(✗ X ) Mahi )H } / q(×i)2t1og6g(
' ' p
2) Evaluate ffx)
-
distribution new :
,y)dÉ FOE Rn
→ ←
•
MA)
a) Variational inference : Acquisition functions : with f-
*
=
maxflxi ) (current best)
•
Law of unconscious statistician :E×[g(×)]=fglx)p(x)d✗ [ klx.in) , ,k(×,Xn) ] ,
☒
,
① subsampling :~~Ea~%[p(y*1x :o)] DUCE.at/)=MtW+Bt-a-cxi (optimism inthefaceof uncertainty )
, hY¥aram ( control exploration)
(Bayesby
. .
backprop )
Law of total Varly )=E×[ Varcylx ) ]+Var(E×[ YIX ] )
x1m§,p(y*l×*,Gj ) (=µTx* ) Gj sampled weights
T
variance YA [ Yrs Yn ] ,µ,KpriorGp(µl×=o plw )nN(0,4)
• :
.
> ) :
Jensen inequality :g(E[ × ] )ÉE[g(×)],g(o)ÉÑÑe% =thEe~p[mEi~unif{n ,n}[ plyilgceil))] ](pi=1n ) Theory ( instantaneous) regret : r G) flx ) fix)
'
pllf.IT/x*.Xa)=N(/Yy*-Y1ffIif1EII.IfnH-
•
" * At
•
: =
Max ×,
]) for-gaussi.EE gaussian mixture
-
,
. .
kx # ✗ * ton
p.ly#lXn:n,yi:n,x*)--NlMYx*).KYx*,x* ) ) F- ( sublinear
E[xT]=E[×]T average regret convergence tof(×*D
•
• :
Var(Ax)=AVar( ×) At
Hyperpapams ( kernel) E- argm.gl/plYi:n1xr:n.Q) 3) MCMC sample directly from po-lx.is )
E÷¥¥É¥¥+Vaµ¥Em¥!¥•¥a
¥¥- theoremca-P-UCBI
,a¥ICf;Ys .FI?yCm.axfCx)-flxtD--oCFF).8T=F
• -
:
: →
, → :
:
gcx)
.
Preto
•
:
Linear :8t=O(
dtogt )
Computational
•
)4(KAAt6H_T^
""
KAAEIR
• : , M :
scaling factors .t Mga) > fcx)F× .
② Dropout :
9×10-1=4-9%16;) •
RBF :8t=O( Clogt )d+^ )
Sum if XIY ✗+ y~N(µx+µy,E×tZy ) O(n3 ) 1) proposal Xlvgcx )
Algorithm
•
: :
conditional GPU ,
acceptwithprob.m.gg,
•
,
use : , 2) ( settoowithprob.pe/otherwisefix@Xj )
Xr1Xz~N( tlxnlxz 8×11×2) methods ( ifkk.HN/llX-X' " A) →
2) Probability of improvement : a(×)=PI(x)(+ hyperparam.SI
samples from pcx ) aEe~qa[ ply ]ñ1m§p(y*l×*
'
→ #
1×50-1
-0J )(
×
, >
keep dropout ) ( hyperparameg f-
*
f-
*
g)
Bayesian Logistic Regression
, : ← +
M×m×2=M^ -1212852^1×2 M2) of
{
-
area
-0 :
sample inefficient form >> 0
2) MCMC
•
improvement :PI(×)=P( flx) f*)=I(E) p(✗ If )
Probability of >
Sxnlxz=Sn -2128-2^2821 :
Marginalization
•
Ergodicity FTS.t.V-sescanbereachedfroms-csinexa-tly-step.sn 31
Expected improvement :a(×)=EI(x) ( hyperparam.SI
.
→ : +
61.7=1 / ( 1te×p( •
D TE=pTñT ← ( P(Xt+r=x)=P( 6 :
Expected improvement : =k( xx )
P-DT-i.MIL#.exp(-IzCx-M7)
- -
.
, , , ✓
(PT)ij=P( ilj ) burn in
a) (
Z.IE/Z)t0(Z))tO:needtostorelesssampleowithZ=(MCx
irre-sp-of-PXDC.at/MCcanbemadeergodicbyPij Otij ) [ 1-1×11×1
-
EICX)=Ef(×)~N(µcx),6(✗1)
•
> 0
61-7=1-61 )
=
.
with 111-411=1
of Phare > Oforsomen
lnDIW-jnisT.exptt-zlx-MI-kx.MY ergodic MC )=¥§(o f) 1 6 (X) , lot)=N(0,1
:
all entries > 0 ) -
*
)
Prior :
plw)=N(0,6B¥ ) Detailed balance
( P trans matrix)
:
.
3) Probabilistic ensembles :
equati.pro#-pd-fs:(Nx.Ny)(z)=Z--NzCnotnorm Thompson sampling
→ on
4)
:
:
) !
1zQ( .ph/1lx)--1zQCxYPCxlX')V-x.X'C--7QCx)/Z-=ltCx)-Vx) Inference ×) Train MNNS bootstrap datasets of D={ 1×1,9^1 HmYn ) } ,
2- intractable pcwlx.YIXNC.it
.
>
,
:
evaluate
µz=(µnoEtµz Gi ) / (6^2+622)
N
ifmc
p.ly#1x*,Q)~~1mjEpCy*lx*,O-j)where0-j:MAPeotimateof- NNJ
:
/
full trajectory )
:{ Etdtsos
:
A ( st.at/--rtt8rt+nt-..+8t+kV(st+k)-VGt ) ( TDCK ) )
VIG)=rttjVw(st+n)
target e.g ,
Robbins Monroe conditions
2) TD
.
: -
, : .
take grad
Algorithm Repeat : :
Q"G,a)=rtt8maa×Q(Sti na 't i
-
-
on them
M=( 5. A P P 8) with P p( sits ,a) and Markovian ,r
( performance metric of model ( logplyvallxval ) )
, , ,
:
:
mga) ,8E[ 0,1 ] 1) Generate a trajectory T=( sosao.ro , ),
Deep Q learning
. .
Problem
.
following CDQN) :
: cannot optimize infinite horizon over •
ELBO :
EO-nqllogp.ly/O-)J-DkL(Glo)l1p(o ) )
Policy
-
it
: •
Deterministic :ñCs)=a (Tata)=E[Et%8trt / so~m.to] )
logplyt-logjply.O-ldOG.a.rs
2) Backpropagate return Experience replay Theory
instead learning pure online transitions
:
of stone
• :
:
•
Stochastic ltlals ) : c- [ 0,1 ] , E. ltcals)=1 For -VStET
,
AEA ) in
,
out -
/
Get (return) Gee ,
Value function V45)=E[§8t.pt/so--s ]
-_
r ttsr ttr t
Algorithm
. .
Ateachstept
.
it
:
increased sample efficiency MY
'0→](Jensen
:
zEe~q[ log
:
,
nttnttl ( learning rate)
horizon H 910-1
: -
logo convex)
stabilizes learning Optimize
={{NS.ltGDI-8-EPGYS.ITGDV/tG7it:det )t±tCGt-r÷st)
:
over
•
Vest) ← Vest
=Eo-~q[ logply.IQ )] Dkllqlolllpl ) )
Eff 8-mr-cts-c.ae/at:t+H-r ]
. -
.
Max
ltcalsfrcsia
end •
Update with
target network : (or e.g. Huber loss as Q noisy )
Classification
t§P( ,a).V%D]
at :at+H-i
> +
8. sits Stoch metrics
^Q"- (ga) 11 } )
it "
if to V
: :
.
_=ÉÉ
•
④ unbiased
:
-0 high variance ( due to variance
: in -4 G. arose B Deterministic metrics :
+ -
( replanaftereaohstep horizon H :
trade-off performance / computation) confusion matrix →
-
① G. a)
" + Tp pp
•
maxQO-targetcsi.at )
:
n
=p + g.
closed form :
Vñ=Rñ+8PñV " Vñ=( 1-8*5^12" ,
Optimization methods
•
accuracy :( TPtTN)1n Y
Temporal difference
a'
(1- D) learning :(on policy )
.
4) -
,
.
:
it : how reliable when predict , but may never predict
,
Double DQN
[ I. altcalso ) rcso.at, Salta / Sn)
MSN.at/TVCst7tV(st)tdt(rtt8-V(st+n)-V(stD solves overestimation of DQN
't
R 2) Random shooting method : optimize finite horizon by generating m
:
F1 2.tl/C2-TPtFPtFN )
-
=
score
-
• :
,
. . -
( + Mont e Carlo) random samples at :t+H -1 , then pick best recall / TPR / sensitivity : TPICTPTFN )=1
112151×151 Psis ,=§ñ(als > PGYS.at FNR
FEevtcstj-io.IQ
pit c-
•
-
Policy Gradient
,
how many caught ,
but may just always predict
function Q
-
=E[ :
"
(ga) It 8trtlso-siao-a.IT] temporal difference error St Value-based :
Talat :t+H a) -
=E[ (E 18T try )t8HV(Sttr) / a ]
-
. . .
•
TNR / specificity / selectivity.TN/(TN+Fpj=1-FPR
: =
G. a) Model
:
:
0-1-+1 ←
( Gradient ascent)
⑤ :
lower variance -0 : biased
*
G) =maxV"Ts)=maxQ*(s.at , ① G. a) =maxQ"Ga)I
*
calibrated :lP(y^=y1p^=p)=p
Optimality : ✓ (n→cs mc ) Gradient Tato )=Ez~po[R t )] ,T=(so.ao.ro ),Rl =E8trt ,
well
Gt"Értt8rt+n+
:
ITEIT AEA ITEIT
( Gta) Reliability diagrams
? , Ñ ¥ § µ f y *Y É B d
step) St V( St))
,
TDCN
. . .
• :
: -_
t
-
-
V#
possible (stationary + det )t*(s)=argma×Q*Ga){
soYIK-o.cat/St)PCst+rlSt,at)
. . .
always unique
.
freqcbm)=¥miSieBm1{Ñi=1 }
,
one
AEA
.
a-
*
not
necessarily
+ 8nV(Sttn ) Po=µ( Acc
④ : lower bias -0 : higher variance
Theorem overstates conf ( Bm)=÷EieBmÑi
For infinite horizon MDP there optimal stationary + deterministic policy Es~dfuqa~itgc.is, [Ms,a) ]
.
: exists =
← w.int
in?§X
.
average conf
Kst)) ,GÉ=(1- calibrated :freq( Bm)=conf( Bm)
"
-1Gt"
"
TDA) St ( Gt ,
trajectory if well
Bellman operators
: -_ -
:
t
dѰ(s)=(1-8 )§g8tp( st-slso~M.no ) ← visitation
discounted
Emma BnmllfreqlBml-conf-CB.my
'
state ECE :
: =
• :
+
.
, reparam.us
.at/rtiSttnat+r):withfixpointJV*=V*QCSt.at)tQGtiat)tdt(rt-
It ,
:{ Zk=Hk×x-r+vk(
Model
:
step ,
Sarratt)
=
Likelihood
,§bG)=1 Estimator ÑoJ(ño)=zJ(to) { XKIXK-1~NCFKXK-n.dk )
:
→ new ( belief ) states :b :{ 1 , . .
,n } → [ 0,1 ] :
6) Q
learning :(off policy) ZKIXKNCHKXK ) Rk
- ~
)§p§Ép%¥si
/
-
btcs) :
probability that SES 1) REINFORCE Beate)=R(e) t.IT/o-Ioglto-(at1St) Algorithm
:
:
repeat :
bttics)=P( sttn-slyr.tn )=1zP(yt+ , / Stu Algorithm Follow any ( with sufficient exploration)ñ
:
,
a) predict : plxwlzi.hr -1 ) (prediction)
for each step ( st.at/rt,St ) : unbiased but high variance closed form :N(Ñklk -1,17×11×-1)
r( b. a) =§bG)
→
-
Hsia)
to-E0-Gt.atto-l
~c~po.LRCT.lt?Eogtglo-1too.cat/St5f(rtdoeonotdependon
g1to-(atlStD=To-LFe~po.ft=&8tQ'
QGt.at) ← QGt.at)+at( rtt8-maxaQCSttya-QCSt.at)) Kwik-1--1=1×511×-111×-1
2) Reward -
to -
go :
pklk-i-FKPK-rlk-rt-h.tt Qu
Model-based
-0 :
Tends to overestimate QG, a) ( QQ, a)
very noisy + use made } .
: :
Initialize Vocs)V-se5 (true) state ×^k1k=×^klk -
a + KKCZK -
HKÑKIK-1 )
maximization + estimation
ÑoJ(ño)=§g8t(trt.lt/o-1og1to-(at1st)t1--t--
←
£8T
: '
convergence
:
convergence
:
3) Baseline :
E~upo.lt?Eg8t(Q*9st.at)-bGtDto-loglto-Catlst
,
:
G- LIE DeVito )= )]
J "tVt Until policies greedy in the limit with infinite exploration
:
1) Prediction :p / ✗ 1×12-1 :k a) -
)={ argmaxa.GG/aDwithprob.1-EUnif(
←
1- split ) -112ft • E- greedy : mais
-
A) E plxklzi.k-rl-fplxklxk-rlplxk-nlzr.is
¥at=A%¥É%t
prob with .
- a) dxk -1
2) Policy improvement : )
or directly :
XKIZ-i.k-1-FKXK-ilzr.hr -1 + WK (SUM of gauss ) .
-
with B :
^
temperature
( the smaller the more uniform /exploration)
even less variance
Bayes rule plxklzn :k)=¥p( ZKIXK )p(×klZn:k -1 )
:
argma.cn/Qcs.ai)+c.#Ta.ferrorwithe.g
IOGNGI or
Model estimation :
•
UCB Keats) 4) Actor critic :( with without baseline)
-
:
ZK
estimate , :*
-
-
or
Calculus :
NG.si#rYs,a)=&tYjY-.ajt'-at)fHoeffding
actor ( update by policy gradient with critic)( usually slower than critic
pis FÉÉ→o
ite
. •
:
1) MLE :
'
/ g. a) = 's
fortes )
•
lnt.by parts :
jaw 'd×=[uv]ab fabuivdx -
NCS ,a)
•
critic Qw or Vw ( update by G-Don function approx on current actor)
( ✗Ta)=¥aTx)=a
:
inequality
i
.
• •
( atxb )=abT, ( atxtb )=baT
2) Rmax :
Implicit exploration ( agent always tries to exploit) Deadly Triad Divergence likely :
if combine bootstrapping ,
Theorem : PG-exactiftwQwx-to-logito-andlcritic-MSECQw.GE ) •
( ✗TAX)=( ATTA )× •
¥(bTA✗)=ATb •
log 1×1=11
- T
•
Online :
update after each step
•
5×11×-5112=1 iz
•
3×11×11 , __
signal •
1×1=1×1
Initialize NG,a)=O Nls :S ,a)=O Nrcs ,a)=O,ñ( Sia)=Rmax
Algorithm
,
wise )
:
, , ,
1761×3=61×111-61×1 ) element =
signcx)
Algorithm lnit.0-o.no
-
•
pics
:
*
Is ,a)=1 this ,a with s* fairytale ( additional ) state softmaxlx,i ) )
•
jsoftmaxlx,i)=softmax( × ,i)(1{i=j }
-
Xo )THf( ✗ a) ( x Xo )t-
•
/
Norm : 11×1 / p=( E. Ixilp )^ P
-
Ncs, a) ← Ncs a) +1 :
,
Update critic : Wttr twt Bt St RwQwt( st.at )
Linear
- - -
NG :S , a) ← Ncs :S a) +1 ,
Model approximate :
VCs)=VwG)=¢(s5w(orQ(s,aD(eg.by NN ) algebra :
•
-12C :
critic on value functions to estimate A'%(s,a) here 1- step TD
•
Push -
, ← Npcs a) trcs a) , ,
:
,
Loss function :( ( w) -_1zEs~d[ ( Vwls) VTYS)) -1=121 / Vw Vitug canalsouseTD.cn) °XTX=EiXiTXi with Xi row vectors D=diag(61 ,6n )
3) Update if Algorithm
•
> m :
- ,
D- ^=diag(6^-1,
[ Ebd ] -1=1
# %]
. .
p^(sits a) lto-iforeadrstepcst.at/rtist+i )
•
} stationary
'
,
=
" , def
state is known " ✗i Gi
.
=
now assume
ÑCS a) ,
=
Nrcs .at/Nls,a) calculate : • TD : 8t=rtt8Vw+( 5th ) -
Vwtcst) f- A^(stat)) •
Trace trick :xTAx=Tr( ✗TAX )=Tr(×xTA)=Tr( Axxt )
Gradient :tL(w)=Es~d[(VwG) V "TsDtVwG) ]
==QÉÉQÉat) ^
A- unique, 11-1--10,7-12=0
-
• A- invertible
Theorem With prob 1-8 , Rmax : will reach E- optimal policy Nsteps polynomial
in ,
cols(A) rows (A) linindep ,Ker(A) ={ 0 }
=(Vw(s) V "Ts))tVw(s) ( SGD)
.
I, ÑoJ(Ito)= St
.to/og1to-t(atISt)Updateactor:O-t+ns-O-ttati7o-Jl1To-t)
- , .
•
ATAEO , AATEOCXTCATA )✗=( Ax)T( Ax)= 11-1×113>-0)