0% found this document useful (0 votes)
15 views

Cheat Sheet 2

Uploaded by

sallyxu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Cheat Sheet 2

Uploaded by

sallyxu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Probability Theory Bayesian Linear Regression

A-
argmwinslogldtexptyiwtxibtx.llwlli.x-z.pe Burn in -
: discard initial samples astheyarenotsampledfromltlx)
Active Learning
n=§×i×ipji( 1- pyi-xtdiagllpyill-pyn.DE )X,pyi=6( ivtxi) Continuous RV :(
e.g.ph/)--1zexp(-f(xD.'?a¥¥→e×pcf(×
fCx):energyfunctionwit}ha--mi
) nE1
.

Product rule :P(x,y)=P(ylx) .PH) Model :y=wTXtE,E~N( 0,6K ) ) ft'D


Uncertainty sampling
-

,
purely explorative
.gr/*)=fG(y*f).N(irTx*,x*TA-1x*)df*Bayeo:PCx1y).P(Y)--P(
Predict :p(y*1✗
:

ylxj.PH) Likelihood :p( Ylxiw )=N( wtx.GE ) --


→ use gradients to guide proposals to high density areas

( MALA ,SGLD ) •

Regression Xt+e=argm×a×F(StU{ :
×} ) St={ xp ,×t }
Marginalization ply )=f×p(×,y)d×(=E×[ plylx)) ) Prior PCW)=N(0 6B¥ )( Lasso Laplace (0,951 ))
=fp(y*1×*,w)q(w)dw / qcwtxtfqcwlplwtxlw )dw Metropolis Hastings
, . .

: :
:
, -
:

with FCS)=H( f) Hlflys )=I(f;Ys) SED


Inference pcwlx ,y)=N( it -8 )
-

( satzdertotalenwahrscheinlichkeit) RCXYX ) , sets


,

Algorithm proposal distribution


,

ftp.cy#1f*)qCw)p(f*--wTx*Iw)dwdf* Trion _pÉfYs={ Yi


:
, :
=
,
..

,Yi+is , }

Entropy
:
H(×)= -

fplilogplxdx µ-=( xtxt 8%151×1-9 a) proposal ✗ ~R( ✗ 11×1


'

,¥É&¥:Rp¥ } Gaussian :X

NCM.815-xi-GE~NCO.cn?1)&setxs-x1,otherwisestayatx
2) accept with prob o= min{ 1
-1
E- Kori #✗ + 6p- 211)
Variational Inference
-

Hex ,y)= 1-11×191+1491=1-1191×7+1-11×7 unique iff .

f- Ax :X))
11-1×191=11-1×1 if ✗ 19 OLS WMLt-argmwaxplylx.hr)=( xtxj-nxty-col.sk/71in.indep.i.e.uncorr.features
:
flog 1+6%81 It ;Y)= /
Hlfcxr ,

H(f( XD SHIH
. .

.si/n)=TfCxiD--SHlfHiDRidgepegreosion:WmAp-- argmwaxplwlxiy )=µ-


HCXIY )sH( X )
Problem :
pco-lxr.n.gr :n)=¥p(G) '

plyr.in/Xn:n,0-) converges toltlx)=QCx) / 2- f- posterior) Gaussian process :


f~GP(µK ) , Ys noisy observations

:p(Y*l×iYi×*)=N(Ñ×Y×*×I+6n)
e.
g. ( with proposals from transition matrix Rlxilxj )=Pij acceptance matrix
Icf ;ys)=Elog / A- +
,

Predict énksl
i.ph#IEab1ee8
,

Gaussian .H(N(µ,g))=1zIn12ñeE1 z=fop( ftp.cyi.n/xn:n..O-)dO-


'

tij-Pij.dijl-1-i-jh.GG?kjL-gkjIi.e.T=Poo
aij=✗( xilxj> and stationary transit matrix
"

xti-r-argm.fi/T-(StUEX3)yconst=argmxaxF(
FStUEX3
.

① epistemic uncertainty lackofdata changed diagonal ns.t.Ejcolj-1/ifx' rejected


with entries
:

" " "" " " "" " " "" " " d"
solve :p (0-1*9)=9×10-1 =kt(xx)
② aleatoric
"
irred noise " " " " " " " "
"

uncertainty : .
) -

Metropolis symm.RU/'lx)(e.g.Nlx.1)(x' 1) a=min{ 1 } undercurrent

9- (G) =N(µ=É,I=A^ )
Odds oddslx)=P( X ) / ( 1- PIX ) )
:
:
=fwp(y*,wIx,y,✗*)dw 1) Laplace Method
,
modelflyst
argm.fm/I(f;Yx1Yst )
:
ratio of posteriors - =

f
Mutual information =fwplY*lx*,w)p(wlxiyldw-Ewix.yf.pe/Y*lX*.w )] Gibbs
IN ;y)=H( × ) H( ✗ ly ) Ily ;✗ ) E- sampling efficient sampling proposal
=angm×a×E1og(1+%¥-)=argm×ax6t(×
argnfaxp.IQ/x.y)Hyperparams:DFindI=&-
formaltivariatecase
: '
-
= : ✗
)
Monotone submodular ,xÑ]t
FCAUEX } ) f- (A)
logp( 0=1*9 ) Algorithm much easier than
)
s
( Hessian) ✗ it
)=[ it ,
Ridgeregr by 1=-171> ✗
-
: :
. .

( where most unsure,


( widerrangeofinf.be/ter)T-(BUEx3)-F(B)V-AEB,V-x fpait-fplxm.sn)d× G) ma×F( 5) (nearoptim )
.

④ -1-(5-1)<-11
:

for ie{ 1,2, ,n }


, : -

2) Find Én2=1nSi( yi irtxi)2( MAP) Derivation :p( ⑤ IX. g) =e×p( logplo-lx.is )) FEI
. .


x.Y-tm~p.ci/i1xit+Y...,xit+i',xiI,...,xiF )
-

Divergence :kL( qllp )=f9lQ)1og§¥-,dQ


KL I -0 :
Doesn't worbinheteroscedasticcase, need
Gaussian Process
-
:

KLCGIIPIZO Tlogpcolx,y)(O-;É)=logp(Élx,y) end


IF
b-
old values ✗
tt1=arGM×a×
6f(×)2 e- epistemic uncertainty

)tlogp§×?y )
← aleatoric uncertainty
Khlotllp ) # KLCPHG ) +10--8
Model y=wT¢(×)+E,E~N( 0,62 )
Bayesian Deep Learning
:

Classification xti-r-argm.ae/H(Ylxr:t.Yi:t.x)
.gov??&flxi7ppiop:pcw)=N( 0,1 ) +1210--0^-517-1710=9%0%(0--8)

:

Montecarlo method :E[f(x)]=fi


Noise Homoscedastic Bayesian active
learning by disagreement CBALD) :

Hoefling 's inequality P4E[ ft) ] I .&f(✗ ill > E) Givenwspans full function ( plytxtx) E
Insights ①
: - : • :

overconfident as only approximate around modemax


:

-0 often xtti-argm.fr/I(YxiO-lXn:t,yr:t,x)CapproximatewithV.I )
:
:

② kernel
.

s2.es/p(-2Ne4C2 )
:
trick :
plylx) depends only onto:& •
Heteroscedastic :{ ( x)
IICYIO-IDt.H-HL-D-x-EO-npc.iq> [ HCYIQX) ]
,

with f- C.) c- [QC ] 1-


→ f~G-PCM.tn ) ,
with "f=l¥%]e1R•
"

and finite
marginal d) " " ① " Hence :
9×10-1 -
argminkkqypgh.gg, Model : •

Regression :
y=fg(×)+E(x) , ENNIO,gñ(×)1 ) uncertainty undercurrent model ¥aityFif
9- EQ
Bayesian decision theory :a*=argmainEy,×[c( ya,] "" """ ×"

I !¥ ]~ NlµK) µ=[ TIE;] ,k=[


"

.µ×n:,×n, ] argmaxELBOCG.rs)
" '

=
Likelihood :
plylx ,Q)=N( fold ,6n4× ))
ccy ,a) :
costoffinalpredictionaandmodelpred.ly , ,
Kanna ,
qeQ
G Bayesian Optimization
. .

( e. g. some labels more


important / punish FP ) with kernel Cora ,× ) ftp.infinitedim.mullivariategaussian)
yodel noise
tnsameiviv
'

(
plylx.0-t-NCfo.MG/),expCfo?Cx)))
'
as
→ forward KL :KL( pilot ) tends to full p df
:
Gauss Markov theorem t-oryi-wtxit.ci ,E[Ei]=QVar( Ei )=o2
-
: cover
,
Kernels product feature space klxix )=¢(Ñ¢(× ) ' '
T
Problem :I¥f(×),f(×)
inner in KL :KL( 9- HP ) costly to evaluate
: :

tends to focus modemax


Covlei,Ej)=O Var( Wow)<_Var(Ñ ) for Ñ unbiased → reverse


toensureooizo
:
on

Convergence improbability :XnP→X P( lxn ✗ I >_ E)


"

symmetric :K(×,× )=kCx :X )


' •

Classification :
f- signlfolx))

ELBOCofipl-EO-n.gl/ogp(QyDtH(q)gpri
] or--Eo-~q(logpl1G
o
Model surrogate function :gC×)~GP(µK ) conditioned
-

positive semidefinite : KZO ( 11<>-0 -VAk) → : •


on

convergence in distribution :Xnd→X him -1-1×7--1=1×1 "


( Ska ,× )f(x)f(✗ 1) dxdx )
Likelihood :
plylx.lt )=Ber( olfolx))) D:[ { 1×1,9^2 ,(✗t.yt )}
' '
n→cs 20
( conv.inpnob.strongerthanindistr.ae Xn→× notjustsamepdf ) y ) -

KL( 9- 11pct) . . .

, •
krlx.si)tkz(x,✗ ) also kernel '

kik .✗ ' ) .kz IX. ✗ 1) also kernel Prior p( G) =N( 0,6B¥ )


acquisition function :acx) (with exploration exploitation) -

General properties Reparametetizationtrick


• :

: :

polynomial with

)={
c. µ× aisokerndfo.ro > o random variable
Method Algorithm
PxEo~q,[flat]=p,Ee~p[f(g¥É]
repeat

on. -
MAP E- argmo.in Iogp( G)
:

E[aX+bY]=aE[X]tbE[Y
.

] { logplyilxi ,G )
:
elf
:

1 Positive • : - -

(pco-5-plgce.in)))

fckcx.xygaisokepnelfor.fi
.

i-argnf.ca?alx)
• .

"
P" 1) Pick candidate
Var(✗)=E[ ( X E[x ] )( ✗ E[X ] )T]=E[X2] E- [ × ]2
• -
:

Xi )2 ]
- -

=angM②inXHGHÉ+1z§[ Hyi
T
stationary ifklx,✗ )=k(✗ X ) Mahi )H } / q(×i)2t1og6g(
' ' p
2) Evaluate ffx)
-

ocovlx,Y)=E[ ( X E[ × ] )lY-E[ y ] )T]=E[xy]-E[✗]E[ y ]


dep .onx distribution gait )=G Creparameteritation)


-

distribution new :

isotropic if k( ✗ N' 1=1×(11×-111112)


{
Var(aX±bY ) a2Var(✗)+b2Var( 9) ±2abCov( ✗ is ) """"" model :Dt+rDtU{
Gaussian :q(G) =N(µ,E) ,G=µ+CE , f- [I ] 3)
Update ( I ,flx^) ) }

=

,y)dÉ FOE Rn
→ ←

Predict predict :p(y*lx,y,x*)=fp(y*1×IG)p( Glx )


>> 0
Van
.COM/i)=&d?Var(Xi)t2Ei--rIiajenaidjCov(Xi,Xj ) :
plflxn Yi :n)=GP(
µ ,K ) :n
' ' / f / Dttr ← a- p(µ;k )
'

p(E) =N( 0,1) ,E=CCT


,

• Joint Éxpectatiou Exist ]=E×[Ey[ 1×3 ] : . •

µ '(×)=µ(x)tk×,a( KAA +6215^191 n


-

MA)
a) Variational inference : Acquisition functions : with f-
*
=
maxflxi ) (current best)

kx.ACKAAI-dt.TK?inwithA--EXy..,Xn3,kx,a-- SEE ?÷É÷¥g¥¥¥j¥¥÷!÷¥


Tower rule :E×[x]=Ey[ Ex ,y[ XIY ] ] K' G. ✗ 1=1×(4×1) +6211 → * Dt
'
• -


Law of unconscious statistician :E×[g(×)]=fglx)p(x)d✗ [ klx.in) , ,k(×,Xn) ] ,

,
① subsampling :~~Ea~%[p(y*1x :o)] DUCE.at/)=MtW+Bt-a-cxi (optimism inthefaceof uncertainty )
, hY¥aram ( control exploration)
(Bayesby
. .

backprop )
Law of total Varly )=E×[ Varcylx ) ]+Var(E×[ YIX ] )
x1m§,p(y*l×*,Gj ) (=µTx* ) Gj sampled weights
T
variance YA [ Yrs Yn ] ,µ,KpriorGp(µl×=o plw )nN(0,4)
• :
.
> ) :

Jensen inequality :g(E[ × ] )ÉE[g(×)],g(o)ÉÑÑe% =thEe~p[mEi~unif{n ,n}[ plyilgceil))] ](pi=1n ) Theory ( instantaneous) regret : r G) flx ) fix)
'

pllf.IT/x*.Xa)=N(/Yy*-Y1ffIif1EII.IfnH-

" * At

: =
Max ×,
]) for-gaussi.EE gaussian mixture
-

,
. .

kx # ✗ * ton

mini-batch-h-mf.IT/xlogp(YijlgCEj/ (¥jYp ) :R+=÷z


:
"
Cauchy
.cm#flx)-fCxtDwithVar(y*lx,y,X*)=
Schwarz :/ L-TLX.TT/2slELX2]E[ 92] "
cumulative regret
D) draw simultaneously

-

p.ly#lXn:n,yi:n,x*)--NlMYx*).KYx*,x* ) ) F- ( sublinear
E[xT]=E[×]T average regret convergence tof(×*D

• :

Var(Ax)=AVar( ×) At
Hyperpapams ( kernel) E- argm.gl/plYi:n1xr:n.Q) 3) MCMC sample directly from po-lx.is )
E÷¥¥É¥¥+Vaµ¥Em¥!¥•¥a
¥¥- theoremca-P-UCBI
,a¥ICf;Ys .FI?yCm.axfCx)-flxtD--oCFF).8T=F
• -

:
: →

Multivariate gaussian :X=(


xn.in/n)~NCMiE)--argmoEdfP-?.:.-Y:-)pf-#- df )=f¥- ) ( information gain overtime)
Rejection sampling :
pcx onlyaccesstounnonm.fi)
:%¥¥ ) )
,
Def :X~NCµ,I)ifSioiXi~N(µ 02 )

~~1m§d(×IQj)+1m§(µ(×*,Gj)-µ(×* )) ( NN output Theorem Guarantees sublinearregretlconvergfor


,

proposal distribution ( EImYpYe )


.

, → :
:

gcx)
.

Preto

:

Linear :8t=O(
dtogt )
Computational

)4(KAAt6H_T^
""

Linear combination MX~NCMM.ME Mt ) cost :k(Xix


'

KAAEIR
• : , M :
scaling factors .t Mga) > fcx)F× .

② Dropout :

9×10-1=4-9%16;) •
RBF :8t=O( Clogt )d+^ )
Sum if XIY ✗+ y~N(µx+µy,E×tZy ) O(n3 ) 1) proposal Xlvgcx )
Algorithm

: :

Maternal > %) :8t=O(T¥a.(


logt )¥Fd )
with
:

9xj( g) =p -810g -07+11 p) .S(Gj Aj)


:X=[¥z]~N( [the:] [8%53%2] ) ☒


- -
-

explicit 0TH)=¢C×) local


-

conditional GPU ,
acceptwithprob.m.gg,

,
use : , 2) ( settoowithprob.pe/otherwisefix@Xj )
Xr1Xz~N( tlxnlxz 8×11×2) methods ( ifkk.HN/llX-X' " A) →
2) Probability of improvement : a(×)=PI(x)(+ hyperparam.SI
samples from pcx ) aEe~qa[ ply ]ñ1m§p(y*l×*
'
→ #
1×50-1
-0J )(
×
, >
keep dropout ) ( hyperparameg f-
*
f-
*
g)
Bayesian Logistic Regression
, : ← +
M×m×2=M^ -1212852^1×2 M2) of
{
-
area
-0 :
sample inefficient form >> 0
2) MCMC

improvement :PI(×)=P( flx) f*)=I(E) p(✗ If )
Probability of >

Sxnlxz=Sn -2128-2^2821 :

)=Q¥-)_ with 2- =(µ(×) f- ) / o( × ) over # - *

Theory build ( implicitd) mcwithitco-1-pco-lx.ly


:✗=f¥ÉvNfÉff¥¥Éf ) Subsampling :ñ¥§p(y*1x*,Gj) Gj sampled after burn
-0 :X very close toDt(
Model :
5- signlwtx)
:
① : -
in exploitive) ^LifZ 0 smaller > :

Marginalization

Ergodicity FTS.t.V-sescanbereachedfroms-csinexa-tly-step.sn 31
Expected improvement :a(×)=EI(x) ( hyperparam.SI
.

→ : +

→ [ ¥ ]~N( [151,13%833] ) (drop other )


Likelihood :
plylx.hr)=Ber( ocywtx)) ergodic Mohave unique stationary distributions Tux) > 0
② Gaussian approximation :
pco-lx.YKNCM.diagkm.ro ) ) •

Improvement :I(x)=ma×{ 0, f- (x) ft } -

xt-XIV-x-c.es/s.t.figzsP(xt--X)--TH µi=¥§Gij ?=F§(Gij Mi5 _Oje1Rd sampled after


\

61.7=1 / ( 1te×p( •
D TE=pTñT ← ( P(Xt+r=x)=P( 6 :
Expected improvement : =k( xx )

P-DT-i.MIL#.exp(-IzCx-M7)
- -
.

, , , ✓
(PT)ij=P( ilj ) burn in
a) (
Z.IE/Z)t0(Z))tO:needtostorelesssampleowithZ=(MCx
irre-sp-of-PXDC.at/MCcanbemadeergodicbyPij Otij ) [ 1-1×11×1
-

EICX)=Ef(×)~N(µcx),6(✗1)

> 0
61-7=1-61 )
=
.

with 111-411=1
of Phare > Oforsomen
lnDIW-jnisT.exptt-zlx-MI-kx.MY ergodic MC )=¥§(o f) 1 6 (X) , lot)=N(0,1
:
all entries > 0 ) -
*
)
Prior :
plw)=N(0,6B¥ ) Detailed balance
( P trans matrix)
:
.

3) Probabilistic ensembles :
equati.pro#-pd-fs:(Nx.Ny)(z)=Z--NzCnotnorm Thompson sampling
→ on
4)
:
:

) !
1zQ( .ph/1lx)--1zQCxYPCxlX')V-x.X'C--7QCx)/Z-=ltCx)-Vx) Inference ×) Train MNNS bootstrap datasets of D={ 1×1,9^1 HmYn ) } ,
2- intractable pcwlx.YIXNC.it
.

sample function f~pcflxnt.gr :t) ,


on
:x^=argm×a×FC×)
: . .

>
,
:
evaluate
µz=(µnoEtµz Gi ) / (6^2+622)
N

Ergodic theorem ergodic limIE.flxij-x.gg/tCx)-f(xi)



-

ifmc
p.ly#1x*,Q)~~1mjEpCy*lx*,O-j)where0-j:MAPeotimateof- NNJ
:

oz= (6^2.622)/(62^+62) Variational inference ( Laplace) plwlx ,y)=N(Ñ,Ñ)


:
( > lawoflargenumbersformc ) was in
trandomness automatically trades off exploration exploitation -
Estimator :Ññ(s)=V"G) (we don't have access tovñcs))
:ÑÉAE( st.at )= (1-2)%7=17 "Ñt
( K)

Reinforcement Learning Model free


"

1) Montecarlo method V "Ts)=Gt treat as


with GAE 1st .at ) ( TDA) )
Extra Statics

/
full trajectory )
:{ Etdtsos
:
A ( st.at/--rtt8rt+nt-..+8t+kV(st+k)-VGt ) ( TDCK ) )
VIG)=rttjVw(st+n)
target e.g ,
Robbins Monroe conditions
2) TD
.

: -

Basics Bootstrapping depends :


on itself, e.
g. update ÑCS) with ÑG' )
3) TDA) VEG)=G+ ' Vwadcstti) Entropy regularization Tata)tAH( itololst)) :

Evidence lower bound ( ELBO)


{
: :

don't HCKO-l-lstD-L-Ta~ito-c.is)[- logit /ais)]


3) Montecarlo method :(on policy) 4) Q learning :ifQ"Ts a) instead of Vñcs) vrnum stability
-

, : .

Markov decision process


-

take grad

Planning ( Log ) evidence logply ) (orlogplylx))


: .
• -
:

Algorithm Repeat : :
Q"G,a)=rtt8maa×Q(Sti na 't i
-
-
on them
M=( 5. A P P 8) with P p( sits ,a) and Markovian ,r
( performance metric of model ( logplyvallxval ) )
, , ,
:
:
mga) ,8E[ 0,1 ] 1) Generate a trajectory T=( sosao.ro , ),
Deep Q learning
. .

Problem
.

following CDQN) :
: cannot optimize infinite horizon over •
ELBO :
EO-nqllogp.ly/O-)J-DkL(Glo)l1p(o ) )
Policy
-

it
: •
Deterministic :ñCs)=a (Tata)=E[Et%8trt / so~m.to] )

logplyt-logjply.O-ldOG.a.rs
2) Backpropagate return Experience replay Theory
instead learning pure online transitions
:

of stone
• :
:

Stochastic ltlals ) : c- [ 0,1 ] , E. ltcals)=1 For -VStET
,

predictive ogfply.at#%d0---logEo-~qIPlY'0-) ] control ( MPC ) /


:

buffer periodically throwing receding old transitions Model horizon : model-based =/


' '

AEA ) in
,
out -

/
Get (return) Gee ,
Value function V45)=E[§8t.pt/so--s ]
-_
r ttsr ttr t
Algorithm
. .

Ateachstept
.

it
:
increased sample efficiency MY
'0→](Jensen
:

zEe~q[ log
:
,
nttnttl ( learning rate)
horizon H 910-1
: -

logo convex)
stabilizes learning Optimize

={{NS.ltGDI-8-EPGYS.ITGDV/tG7it:det )t±tCGt-r÷st)
:
over

Vest) ← Vest
=Eo-~q[ logply.IQ )] Dkllqlolllpl ) )
Eff 8-mr-cts-c.ae/at:t+H-r ]
. -

.
Max

ltcalsfrcsia
end •

Update with
target network : (or e.g. Huber loss as Q noisy )
Classification
t§P( ,a).V%D]
at :at+H-i
> +
8. sits Stoch metrics
^Q"- (ga) 11 } )
it "
if to V
: :
.

converges each state visited as


L( G) 11^31 I Ell Q o (ga) ( B batch replay buffer) Take step
=
:
at y
-

_=ÉÉ

④ unbiased
:
-0 high variance ( due to variance
: in -4 G. arose B Deterministic metrics :
+ -

( replanaftereaohstep horizon H :
trade-off performance / computation) confusion matrix →
-

① G. a)
" + Tp pp

maxQO-targetcsi.at )
:
n
=p + g.
closed form :
Vñ=Rñ+8PñV " Vñ=( 1-8*5^12" ,

Optimization methods

accuracy :( TPtTN)1n Y
Temporal difference
a'
(1- D) learning :(on policy )
.

4) -

where QO- target updated (①


: -
FN TN
frequently target ← Q
every insteps ) precision TPKTPTFP)
Vñ=[vñ(so) ,V%n)]T
• :

1) Gradient descent analytically compute gradients


Algorithm
:

Follow for each step Gt.at/rt,St+n)


.

,
.

:
it : how reliable when predict , but may never predict
,
Double DQN
[ I. altcalso ) rcso.at, Salta / Sn)
MSN.at/TVCst7tV(st)tdt(rtt8-V(st+n)-V(stD solves overestimation of DQN
't
R 2) Random shooting method : optimize finite horizon by generating m
:
F1 2.tl/C2-TPtFPtFN )
-
=
score
-
• :
,
. . -

( + Mont e Carlo) random samples at :t+H -1 , then pick best recall / TPR / sensitivity : TPICTPTFN )=1
112151×151 Psis ,=§ñ(als > PGYS.at FNR

FEevtcstj-io.IQ
pit c-

-

Policy Gradient
,
how many caught ,
but may just always predict
function Q
-

=E[ :
"
(ga) It 8trtlso-siao-a.IT] temporal difference error St Value-based :
Talat :t+H a) -
=E[ (E 18T try )t8HV(Sttr) / a ]
-

. . .

TNR / specificity / selectivity.TN/(TN+Fpj=1-FPR

( performance much better) FPR :TP1(TP+FN )


m.gl/JC1to-),J(lTo-)--L-Tf+?&8trtlso~u,ito.J--Es~dLV1t%5f

§P(sits.at VIG ) ( it :det ) Problem


"
=
Ms a) ,
+ 8. -
'
.

convergences toll if each state visited as :



(Alt) ROC
and satisfies Robbins Monro condition
Advantage function
at
A' G. a) VIG) On policy Talita)=E[ ( I 8 're )+8HQ( SH ,ñalsHDlQ ]
-

: =
G. a) Model
:

Probabilistic metrics :P^(Y)=?P( 9) =AcdY )


O-ttdtto-Jllto.tl
-
-

:
0-1-+1 ←
( Gradient ascent)
⑤ :
lower variance -0 : biased
*
G) =maxV"Ts)=maxQ*(s.at , ① G. a) =maxQ"Ga)I
*
calibrated :lP(y^=y1p^=p)=p
Optimality : ✓ (n→cs mc ) Gradient Tato )=Ez~po[R t )] ,T=(so.ao.ro ),Rl =E8trt ,
well

Gt"Értt8rt+n+
:
ITEIT AEA ITEIT
( Gta) Reliability diagrams

? , Ñ ¥ § µ f y *Y É B d
step) St V( St))
,
TDCN
. . .

• :
: -_
t
-
-

V#
possible (stationary + det )t*(s)=argma×Q*Ga){
soYIK-o.cat/St)PCst+rlSt,at)
. . .

always unique
.

freqcbm)=¥miSieBm1{Ñi=1 }
,
one
AEA
.

a-
*
not
necessarily
+ 8nV(Sttn ) Po=µ( Acc
④ : lower bias -0 : higher variance
Theorem overstates conf ( Bm)=÷EieBmÑi
For infinite horizon MDP there optimal stationary + deterministic policy Es~dfuqa~itgc.is, [Ms,a) ]
.

: exists =
← w.int

in?§X
.

average conf
Kst)) ,GÉ=(1- calibrated :freq( Bm)=conf( Bm)
"
-1Gt"
"
TDA) St ( Gt ,
trajectory if well
Bellman operators
: -_ -

:
t
dѰ(s)=(1-8 )§g8tp( st-slso~M.no ) ← visitation
discounted
Emma BnmllfreqlBml-conf-CB.my
'
state ECE :

weighted average of-VTD.cn step)


; 7=0 : TDCO) 1=1 :


MC
Expectation operator :JEV=R"tyP"V with fix point JñV" =V
"
-
,

• M CE :
maxmlfreqCBml-conflB.my
5) Sansa :(on policy ) Score function need for
Kalman filter
Optimality operator :( TV )G)=mq×[ Ms a) 8. §.PH/s.a)VG1 )] tologito no
-

: =
• :
+
.

, reparam.us

Algorithm step Follow for each 1St we know Ito

.at/rtiSttnat+r):withfixpointJV*=V*QCSt.at)tQGtiat)tdt(rt-
It ,

:{ Zk=Hk×x-r+vk(
Model
:

Potato)=Ee~po[ Rltltologpot) ] - k=FkXk-r+wµ ✗ (true state


8. Qlsttr Attn)
,
-
Qlstiat )) as measurement, } 's"yF¥m
POMDP :
can't observe St directly only , observation yt Sarsaln -

step ,
Sarratt)
=

E~c~po.INT#gto-loglto-(atlSt )] Wk~N( 0 ,Qk) , VKNNCQRK )

Likelihood
,§bG)=1 Estimator ÑoJ(ño)=zJ(to) { XKIXK-1~NCFKXK-n.dk )
:
→ new ( belief ) states :b :{ 1 , . .
,n } → [ 0,1 ] :

6) Q
learning :(off policy) ZKIXKNCHKXK ) Rk
- ~

)§p§Ép%¥si
/
-

btcs) :
probability that SES 1) REINFORCE Beate)=R(e) t.IT/o-Ioglto-(at1St) Algorithm
:
:
repeat :

bttics)=P( sttn-slyr.tn )=1zP(yt+ , / Stu Algorithm Follow any ( with sufficient exploration)ñ
:
,
a) predict : plxwlzi.hr -1 ) (prediction)
for each step ( st.at/rt,St ) : unbiased but high variance closed form :N(Ñklk -1,17×11×-1)
r( b. a) =§bG)

-

Hsia)

to-E0-Gt.atto-l
~c~po.LRCT.lt?Eogtglo-1too.cat/St5f(rtdoeonotdependon
g1to-(atlStD=To-LFe~po.ft=&8tQ'
QGt.at) ← QGt.at)+at( rtt8-maxaQCSttya-QCSt.at)) Kwik-1--1=1×511×-111×-1
2) Reward -
to -

go :

pklk-i-FKPK-rlk-rt-h.tt Qu
Model-based
-0 :
Tends to overestimate QG, a) ( QQ, a)
very noisy + use made } .

operator update these


2) Update plxklzi.hr) : ( correction)
in on
noisy values ) mean
previous actions,
< estimated → closed form : NCIKIK.PK/k)
Double Q learning minimize overestimation by decoupling of
Value iteration Algorithm just add noise)
:
otherwise
-

: :
Initialize Vocs)V-se5 (true) state ×^k1k=×^klk -
a + KKCZK -
HKÑKIK-1 )
maximization + estimation
ÑoJ(ño)=§g8t(trt.lt/o-1og1to-(at1st)t1--t--

£8T
: '

Until Pink __ ( 1- KKHK )Pklk-1


Vt+n=JVt
-

convergence
:

Update with prob .BQn else Q2 :


with Kk :
Kalman
,
gain
Policy iteration :
Algorithm :
Initialize Ito ( QKstiatk-QKst.atltotlrttrmaxaQzlsti-nal-Qncst.at )) )
( ① 21st
at)←Q2(
st.atltdtlrttsmaxaQncsttnaj-Qzcst.at ) ) )
less variance Qñ0(stat)=Gt
RKI -Hh.PH/x-rHkT)-1(howmuchwetrusl-
=Pklk-rHÉ(
Until measurements)
'

convergence
:

3) Baseline :

1) Policy evaluation Theory using : ✗ KHZ 2-1×11-21=-1 (dynamics)

E~upo.lt?Eg8t(Q*9st.at)-bGtDto-loglto-Catlst
,
:

G- LIE DeVito )= )]
J "tVt Until policies greedy in the limit with infinite exploration
:
1) Prediction :p / ✗ 1×12-1 :k a) -

{( convergence weknowplxk-nIZ-n.hr e) ( last step )


viii. ÑgJ(Ito)=§y8t( Q'to-Gt.at)


-

)={ argmaxa.GG/aDwithprob.1-EUnif(

1- split ) -112ft • E- greedy : mais
-

best)) tologñelatlst ) andplxlslxk n ) ( dynamics) -

A) E plxklzi.k-rl-fplxklxk-rlplxk-nlzr.is
¥at=A%¥É%t
prob with .
- a) dxk -1
2) Policy improvement : )
or directly :
XKIZ-i.k-1-FKXK-ilzr.hr -1 + WK (SUM of gauss ) .

ltttrcs )=argmaxQ"t( Sia) •


softmax ltcalsj-explB.QG.at/IaiexplB-QG.a ))
: '
2) Correction ( inference) :p / ✗ 1×12-1 :k -1,2hr)=p( Xklzr :k)

Qttls ,a) :[rcs a) ,


+
g. §P(sits , a) Vittle ) ] "
greedy param
"

-
with B :
^

temperature
( the smaller the more uniform /exploration)
even less variance
Bayes rule plxklzn :k)=¥p( ZKIXK )p(×klZn:k -1 )
:

directly via joint :[ ]=[¥µ]×µz .it/E)vk


"" Z " " "

argma.cn/Qcs.ai)+c.#Ta.ferrorwithe.g
IOGNGI or

Model estimation :

UCB Keats) 4) Actor critic :( with without baseline)
-
:
ZK
estimate , :*
-
-

or

Calculus :
NG.si#rYs,a)=&tYjY-.ajt'-at)fHoeffding
actor ( update by policy gradient with critic)( usually slower than critic
pis FÉÉ→o
ite
. •
:

1) MLE :
'
/ g. a) = 's
fortes )

lnt.by parts :
jaw 'd×=[uv]ab fabuivdx -

NCS ,a)

critic Qw or Vw ( update by G-Don function approx on current actor)
( ✗Ta)=¥aTx)=a
:

inequality
i
.

• •
( atxb )=abT, ( atxtb )=baT

2) Rmax :
Implicit exploration ( agent always tries to exploit) Deadly Triad Divergence likely :
if combine bootstrapping ,
Theorem : PG-exactiftwQwx-to-logito-andlcritic-MSECQw.GE ) •
( ✗TAX)=( ATTA )× •
¥(bTA✗)=ATb •

log 1×1=11
- T

off policy , function approximation


-


Online :
update after each step

5×11×-5112=1 iz

3×11×11 , __
signal •
1×1=1×1
Initialize NG,a)=O Nls :S ,a)=O Nrcs ,a)=O,ñ( Sia)=Rmax
Algorithm
,

wise )
:
, , ,
1761×3=61×111-61×1 ) element =
signcx)
Algorithm lnit.0-o.no
-

pics
:
*
Is ,a)=1 this ,a with s* fairytale ( additional ) state softmaxlx,i ) )

jsoftmaxlx,i)=softmax( × ,i)(1{i=j }
-

Follow Ito , for each step ( st.at/rtiSt+i,at+r )


Repeat
:
:
txiin numerator
Taylor series
Function Approximation calculate ÑoJ(Ito)=Qw+(st.at)%- logltotlatlst)

:
: •
PG :
1) Compute ñt undercurrent Piri Tflx ;Xo)=f(Xo)t( ✗ -
Xo )Ttf(×o)+1z(✗ -

Xo )THf( ✗ a) ( x Xo )t-

TD : 8t=rtt8Qw+( 51-+1,9++1 ) Qwtcstiat)


. . .


/
Norm : 11×1 / p=( E. Ixilp )^ P
-

2) Follow It and update for encountered (s :S ,a) p


• -

Problem for large Update actor :


0-1-+1<-8++01-570711%+7 euclidean 11×112,11×1122--51 ,× > =×T✗
5 cannot store tables (of VCs),Q(s,a )) norm :

Ncs, a) ← Ncs a) +1 :
,
Update critic : Wttr twt Bt St RwQwt( st.at )
Linear
- - -

NG :S , a) ← Ncs :S a) +1 ,
Model approximate :
VCs)=VwG)=¢(s5w(orQ(s,aD(eg.by NN ) algebra :


-12C :
critic on value functions to estimate A'%(s,a) here 1- step TD

Push -

through identity U(1+VU)-1=( A- + UV ) 'll


Nrcs a)
:

, ← Npcs a) trcs a) , ,
:
,

Loss function :( ( w) -_1zEs~d[ ( Vwls) VTYS)) -1=121 / Vw Vitug canalsouseTD.cn) °XTX=EiXiTXi with Xi row vectors D=diag(61 ,6n )
3) Update if Algorithm

Ncs ,a) ( threshold ) lnit.0-o.no , . .

> m :
- ,

TD(A) ( AGAE ) ,Én^ )


-

D- ^=diag(6^-1,
[ Ebd ] -1=1
# %]
. .

p^(sits a) lto-iforeadrstepcst.at/rtist+i )

Follow IDI ltioi


NCSis.at/NCSia)V-s ( 11×115 -_xTD✗,D=diag(d) d: state distr ) t -_
:

} stationary
'
,
=
" , def
state is known " ✗i Gi
.
=
now assume
ÑCS a) ,
=
Nrcs .at/Nls,a) calculate : • TD : 8t=rtt8Vw+( 5th ) -

Vwtcst) f- A^(stat)) •
Trace trick :xTAx=Tr( ✗TAX )=Tr(×xTA)=Tr( Axxt )
Gradient :tL(w)=Es~d[(VwG) V "TsDtVwG) ]
==QÉÉQÉat) ^
A- unique, 11-1--10,7-12=0
-

• A- invertible
Theorem With prob 1-8 , Rmax : will reach E- optimal policy Nsteps polynomial
in ,
cols(A) rows (A) linindep ,Ker(A) ={ 0 }
=(Vw(s) V "Ts))tVw(s) ( SGD)
.

I, ÑoJ(Ito)= St
.to/og1to-t(atISt)Updateactor:O-t+ns-O-ttati7o-Jl1To-t)
- , .

1091s Rmax for large 151 %


PG :

151,1*1 , -1 (mixing time MDP) ,


in
LIE Chdeskydecomp :A=LEifA > 0,AT=A(Llowertriang )

,
.

SVD :A=UIVT(not unique),ranK(A) =/ { oi=0:6i=Eii}l


Update wti-ns-wt-dt.st.tw Vwls )

Update critic : Wttr twt -


Bt St twvwtlst )
- -


ATAEO , AATEOCXTCATA )✗=( Ax)T( Ax)= 11-1×113>-0)

You might also like