Probabilistic Machine Learning Advanced Topics Draft 1st Edition Kevin P. Murphy - Own the complete ebook set now in PDF and DOCX formats
Probabilistic Machine Learning Advanced Topics Draft 1st Edition Kevin P. Murphy - Own the complete ebook set now in PDF and DOCX formats
com
https://ebookmeta.com/product/probabilistic-machine-
learning-advanced-topics-draft-1st-edition-kevin-p-murphy/
OR CLICK HERE
DOWLOAD EBOOK
https://ebookmeta.com/product/probabilistic-machine-learning-advanced-
topics-kevin-p-murphy/
ebookmeta.com
https://ebookmeta.com/product/probabilistic-numerics-computation-as-
machine-learning-1st-edition-philipp-hennig/
ebookmeta.com
https://ebookmeta.com/product/if-the-boots-fit-1st-edition-s-e-walker-
cooper-mckenzie/
ebookmeta.com
the London Thames Path updated edition David Fathers
https://ebookmeta.com/product/the-london-thames-path-updated-edition-
david-fathers/
ebookmeta.com
https://ebookmeta.com/product/changing-trade-pattern-ict-and-
employment-evidence-across-countries-1st-edition-sharma/
ebookmeta.com
https://ebookmeta.com/product/international-handbook-of-computer-
supported-collaborative-learning/
ebookmeta.com
https://ebookmeta.com/product/screwed-by-the-fates-1st-edition-tami-
payton/
ebookmeta.com
https://ebookmeta.com/product/the-complete-guide-to-photorealism-for-
visual-effects-visualization-and-games-1st-edition-eran-dinur-2/
ebookmeta.com
Pulsar astronomy Fourth Edition. Edition A G Lyne Francis
Graham Smith
https://ebookmeta.com/product/pulsar-astronomy-fourth-edition-edition-
a-g-lyne-francis-graham-smith/
ebookmeta.com
Probabilistic Machine Learning: Advanced Topics
Adaptive Computation and Machine Learning
Thomas Dietterich, Editor
Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors
Bioinformatics: The Machine Learning Approach, Pierre Baldi and Søren Brunak
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto
Graphical Models for Machine Learning and Digital Communication, Brendan J. Frey
Learning in Graphical Models, Michael I. Jordan
Causation, Prediction, and Search, second edition, Peter Spirtes, Clark Glymour, and Richard
Scheines
Principles of Data Mining, David Hand, Heikki Mannila, and Padhraic Smyth
Bioinformatics: The Machine Learning Approach, second edition, Pierre Baldi and Søren Brunak
Learning Kernel Classifiers: Theory and Algorithms, Ralf Herbrich
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond,
Bernhard Schölkopf and Alexander J. Smola
Introduction to Machine Learning, Ethem Alpaydin
Gaussian Processes for Machine Learning, Carl Edward Rasmussen and Christopher K.I. Williams
Semi-Supervised Learning, Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien, Eds.
The Minimum Description Length Principle, Peter D. Grünwald
Introduction to Statistical Relational Learning, Lise Getoor and Ben Taskar, Eds.
Probabilistic Graphical Models: Principles and Techniques, Daphne Koller and Nir Friedman
Introduction to Machine Learning, second edition, Ethem Alpaydin
Boosting: Foundations and Algorithms, Robert E. Schapire and Yoav Freund
Machine Learning: A Probabilistic Perspective, Kevin P. Murphy
Foundations of Machine Learning, Mehryar Mohri, Afshin Rostami, and Ameet Talwalker
Probabilistic Machine Learning:
Advanced Topics
Kevin P. Murphy
The MIT Press would like to thank the anonymous peer reviewers who provided comments on drafts of this
book. The generous work of academic experts is essential for establishing the authority and quality of our
publications. We acknowledge with gratitude the contributions of these otherwise uncredited readers.
ISBN:
10 9 8 7 6 5 4 3 2 1
This book is dedicated to my wife Margaret,
who has been the love of my life for 20+ years.
Brief Contents
1 Introduction 1
I Fundamentals 3
2 Probability 5
3 Statistics 63
4 Probabilistic graphical models 123
5 Information theory 183
6 Optimization 219
II Inference 299
7 Inference algorithms: an overview 301
8 State-space inference 311
9 Message passing inference 365
10 Variational inference 397
11 Monte Carlo inference 447
12 Markov Chain Monte Carlo inference 463
13 Sequential Monte Carlo inference 513
1
2
IV Generation 769
3 21 Generative models: an overview 771
4 22 Variational autoencoders 785
5 23 Auto-regressive models 829
6
24 Normalizing Flows 837
7
25 Energy-based models 857
8
9
26 Denoising diffusion models 877
10 27 Generative adversarial networks 885
11
12
13 V Discovery 917
14 28 Discovery methods: an overview 919
15 29 Latent variable models 921
16 30 Hidden Markov models 961
17
31 State-space models 991
18
32 Graph learning 1019
19
20 33 Non-parametric Bayesian models 1029
21 34 Representation learning (Unfinished) 1061
22 35 Interpretability 1063
23
24
25 VI Decision making 1097
26 36 Multi-step decision problems 1099
27 37 Reinforcement learning 1125
28
38 Causality 1163
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
Contents
Preface xxxv
1 Introduction 1
I Fundamentals 3
2 Probability 5
2.1 Introduction 5
2.2 Some common univariate distributions 5
2.2.1 Some common discrete distributions 5
2.2.2 Some common continuous distributions 8
2.2.3 Pareto distribution 14
2.3 The multivariate Gaussian (normal) distribution 16
2.3.1 Definition 16
2.3.2 Moment form and canonical form 17
2.3.3 Marginals and conditionals of a MVN 17
2.3.4 Bayes’ rule for Gaussians 18
2.3.5 Example: sensor fusion with known measurement noise 19
2.3.6 Handling missing data 19
2.3.7 A calculus for linear Gaussian models 20
2.4 Some other multivariate continuous distributions 23
2.4.1 Multivariate Student distribution 23
2.4.2 Circular normal (von Mises Fisher) distribution 24
2.4.3 Matrix-variate Gaussian (MVG) distribution 24
2.4.4 Wishart distribution 24
2.4.5 Dirichlet distribution 27
2.5 The exponential family 28
2.5.1 Definition 29
2.5.2 Examples 30
2.5.3 Log partition function is cumulant generating function 34
2.5.4 Canonical (natural) vs mean (moment) parameters 36
x
1
2 2.5.5 MLE for the exponential family 37
3 2.5.6 Exponential dispersion family 38
4 2.5.7 Maximum entropy derivation of the exponential family 38
5 2.6 Fisher information matrix (FIM) 39
6 2.6.1 Definition 39
7 2.6.2 Equivalence between the FIM and the Hessian of the NLL 39
8 2.6.3 Examples 41
9 2.6.4 Approximating KL divergence using FIM 42
10 2.6.5 Fisher information matrix for exponential family 42
11 2.7 Transformations of random variables 44
12 2.7.1 Invertible transformations (bijections) 44
13 2.7.2 Monte Carlo approximation 44
14 2.7.3 Probability integral transform 44
15 2.8 Markov chains 46
16 2.8.1 Parameterization 46
17 2.8.2 Application: Language modeling 48
18 2.8.3 Parameter estimation 49
19 2.8.4 Stationary distribution of a Markov chain 51
20 2.9 Divergence measures between probability distributions 54
21 2.9.1 f-divergence 55
22 2.9.2 Integral probability metrics 56
23 2.9.3 Maximum mean discrepancy (MMD) 57
24 2.9.4 Total variation distance 60
25 2.9.5 Comparing distributions using binary classifiers 60
26
27
3 Statistics 63
28 3.1 Introduction 63
29 3.1.1 Frequentist statistics 63
30 3.1.2 Bayesian statistics 63
31 3.1.3 Arguments for the Bayesian approach 64
32 3.1.4 Arguments against the Bayesian approach 64
33 3.1.5 Why not just use MAP estimation? 65
34 3.2 Closed-form analysis using conjugate priors 69
35 3.2.1 The binomial model 69
36 3.2.2 The multinomial model 77
37 3.2.3 The univariate Gaussian model 79
38 3.2.4 The multivariate Gaussian model 84
39 3.2.5 Conjugate-exponential models 90
40 3.3 Beyond conjugate priors 92
41 3.3.1 Robust (heavy-tailed) priors 92
42 3.3.2 Priors for variance parameters 93
43 3.4 Noninformative priors 94
44 3.4.1 Maximum entropy priors 94
45 3.4.2 Jeffreys priors 95
46 3.4.3 Invariant priors 98
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
CONTENTS
1
2 3.4.4 Reference priors 99
3 3.5 Hierarchical priors 100
4 3.5.1 A hierarchical binomial model 100
5 3.5.2 A hierarchical Gaussian model 102
6 3.6 Empirical Bayes 105
7 3.6.1 A hierarchical binomial model 106
8 3.6.2 A hierarchical Gaussian model 107
9 3.6.3 Hierarchical Bayes for n-gram smoothing 108
10 3.7 Model selection and evaluation 110
11 3.7.1 Bayesian model selection 110
12 3.7.2 Estimating the marginal likelihood 111
13 3.7.3 Connection between cross validation and marginal likelihood 112
14 3.7.4 Pareto-Smoothed Importance Sampling LOO estimate 113
15 3.7.5 Information criteria 114
16 3.7.6 Posterior predictive checks 116
17 3.7.7 Bayesian p-values 117
18 3.8 Bayesian decision theory 119
19 3.8.1 Basics 120
20 3.8.2 Example: COVID-19 120
21 3.8.3 One-shot decision problems 121
22 3.8.4 Multi-stage decision problems 122
23
4 Probabilistic graphical models 123
24
25 4.1 Introduction 123
26 4.2 Directed graphical models (Bayes nets) 123
27 4.2.1 Representing the joint distribution 123
28 4.2.2 Examples 124
29 4.2.3 Conditional independence properties 129
30 4.2.4 Generation (sampling) 134
31 4.2.5 Inference 134
32 4.2.6 Learning 136
33 4.2.7 Plate notation 141
34 4.3 Undirected graphical models (Markov random fields) 144
35 4.3.1 Representing the joint distribution 144
36 4.3.2 Examples 146
37 4.3.3 Conditional independence properties 153
38 4.3.4 Generation (sampling) 155
39 4.3.5 Inference 155
40 4.3.6 Learning 156
41 4.4 Comparing directed and undirected PGMs 160
42 4.4.1 CI properties 160
43 4.4.2 Converting between a directed and undirected model 162
44 4.4.3 Combining directed and undirected graphs 163
45 4.4.4 Comparing directed and undirected Gaussian PGMs 165
46 4.4.5 Factor graphs 167
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
xii
1
2 4.5 Extensions of Bayes nets 170
3 4.5.1 Probabilistic circuits 170
4 4.5.2 Relational probability models 171
5 4.5.3 Open-universe probability models 173
6 4.5.4 Programs as probability models 175
7 4.6 Structural causal models 175
8 4.6.1 Example: causal impact of education on wealth 176
9 4.6.2 Structural equation models 177
10 4.6.3 Do operator and augmented DAGs 177
11 4.6.4 Estimating average treatment effect using path analysis 178
12 4.6.5 Counterfactuals 179
13
14
5 Information theory 183
15 5.1 KL divergence 183
16 5.1.1 Desiderata 184
17 5.1.2 The KL divergence uniquely satisfies the desiderata 185
18 5.1.3 Thinking about KL 188
19 5.1.4 Properties of KL 190
20 5.1.5 KL divergence and MLE 192
21 5.1.6 KL divergence and Bayesian Inference 193
22 5.1.7 KL divergence and Exponential Families 194
23 5.2 Entropy 195
24 5.2.1 Definition 195
25 5.2.2 Differential entropy for continuous random variables 196
26 5.2.3 Typical sets 197
27 5.2.4 Cross entropy and perplexity 199
28 5.3 Mutual information 200
29 5.3.1 Definition 200
30 5.3.2 Interpretation 200
31 5.3.3 Data processing inequality 201
32 5.3.4 Sufficient Statistics 202
33 5.3.5 Multivariate mutual information 202
34 5.3.6 Variational bounds on mutual information 205
35 5.4 Data compression (source coding) 208
36 5.4.1 Lossless compression 208
37 5.4.2 Lossy compression and the rate-distortion tradeoff 208
38 5.4.3 Bits back coding 211
39 5.5 Error-correcting codes (channel coding) 211
40 5.6 The information bottleneck 213
41 5.6.1 Vanilla IB 213
42 5.6.2 Variational IB 214
43 5.6.3 Conditional entropy bottleneck 215
44
45 6 Optimization 219
46 6.1 Introduction 219
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
CONTENTS
1
2 6.2 Automatic differentiation 219
3 6.2.1 Differentiation in functional form 219
4 6.2.2 Differentiating chains, circuits, and programs 224
5 6.3 Stochastic gradient descent 229
6 6.4 Natural gradient descent 230
7 6.4.1 Defining the natural gradient 230
8 6.4.2 Interpretations of NGD 231
9 6.4.3 Benefits of NGD 232
10 6.4.4 Approximating the natural gradient 233
11 6.4.5 Natural gradients for the exponential family 234
12 6.5 Mirror descent 236
13 6.5.1 Bregman divergence 237
14 6.5.2 Proximal point method 238
15 6.5.3 PPM using Bregman divergence 238
16 6.6 Gradients of stochastic functions 238
17 6.6.1 Minibatch approximation to finite-sum objectives 239
18 6.6.2 Optimizing parameters of a distribution 239
19 6.6.3 Score function estimator (likelihood ratio trick) 240
20 6.6.4 Reparameterization trick 241
21 6.6.5 The delta method 243
22 6.6.6 Gumbel softmax trick 243
23 6.6.7 Stochastic computation graphs 244
24 6.6.8 Straight-through estimator 244
25 6.7 Bound optimization (MM) algorithms 245
26 6.7.1 The general algorithm 245
27 6.7.2 Example: logistic regression 246
28 6.7.3 The EM algorithm 248
29 6.7.4 Example: EM for an MVN with missing data 250
30 6.7.5 Example: robust linear regression using Student-t likelihood 252
31 6.7.6 Extensions to EM 253
32 6.8 The Bayesian learning rule 255
33 6.8.1 Deriving inference algorithms from BLR 256
34 6.8.2 Deriving optimization algorithms from BLR 258
35 6.8.3 Variational optimization 261
36 6.9 Bayesian optimization 262
37 6.9.1 Sequential model-based optimization 263
38 6.9.2 Surrogate functions 263
39 6.9.3 Acquisition functions 265
40 6.9.4 Other issues 268
41 6.10 Optimal Transport 269
42 6.10.1 Warm-up: Matching optimally two families of points 269
43 6.10.2 From Optimal Matchings to Kantorovich and Monge formulations 270
44 6.10.3 Solving optimal transport 273
45 6.11 Submodular optimization 277
46 6.11.1 Intuition, Examples, and Background 278
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
xiv
1
2 6.11.2 Submodular Basic Definitions 280
3 6.11.3 Example Submodular Functions 281
4 6.11.4 Submodular Optimization 284
5 6.11.5 Applications of Submodularity in Machine Learning and AI 288
6 6.11.6 Sketching, CoreSets, Distillation, and Data Subset & Feature Selection 288
7 6.11.7 Combinatorial Information Functions 292
8 6.11.8 Clustering, Data Partitioning, and Parallel Machine Learning 293
9 6.11.9 Active and Semi-Supervised Learning 294
10 6.11.10 Probabilistic Modeling 295
11 6.11.11 Structured Norms and Loss Functions 296
12 6.11.12 Conclusions 297
13 6.12 Derivative free optimization 297
14
15
16 II Inference 299
17
18
7 Inference algorithms: an overview 301
19 7.1 Introduction 301
20 7.2 Common inference patterns 301
21 7.2.1 Global latents 302
22 7.2.2 Local latents 302
23 7.2.3 Global and local latents 303
24 7.3 Exact inference algorithms 303
25 7.4 Approximate inference algorithms 304
26 7.4.1 MAP estimation 304
27 7.4.2 Grid approximation 304
28 7.4.3 Laplace (quadratic) approximation 305
29 7.4.4 Variational inference 306
30 7.4.5 Markov Chain Monte Carlo (MCMC) 308
31 7.4.6 Sequential Monte Carlo 309
32 7.5 Evaluating approximate inference algorithms 309
33
34
8 State-space inference 311
35 8.1 Introduction 311
36 8.1.1 State space models 311
37 8.1.2 Example: casino HMM 313
38 8.1.3 Example: linear-Gaussian SSM for tracking in 2d 314
39 8.1.4 Inferential goals 314
40 8.2 Bayesian filtering and smoothing 317
41 8.2.1 The filtering equations 318
42 8.2.2 The smoothing equations 318
43 8.3 Inference for discrete SSMs 319
44 8.3.1 Forwards filtering 319
45 8.3.2 Backwards smoothing 321
46 8.3.3 The forwards-backwards algorithm 321
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
CONTENTS
1
2 8.3.4 Two-slice smoothed marginals 323
3 8.3.5 Time and space complexity 324
4 8.3.6 The Viterbi algorithm 325
5 8.3.7 Forwards filtering, backwards sampling 328
6 8.3.8 Application to discretized state spaces 328
7 8.4 Inference for linear-Gaussian SSMs 329
8 8.4.1 The Kalman filter 329
9 8.4.2 Kalman filtering for linear regression (recursive least squares) 334
10 8.4.3 Predictive coding as Kalman filtering 336
11 8.4.4 The Kalman (RTS) smoother 338
12 8.5 Inference based on local linearization 339
13 8.5.1 Taylor series expansion 339
14 8.5.2 The extended Kalman filter (EKF) 342
15 8.5.3 The extended Kalman smoother 345
16 8.5.4 Exponential-family EKF 345
17 8.6 Inference based on the unscented transform 347
18 8.6.1 The unscented transform 348
19 8.6.2 The unscented Kalman filter (UKF) 349
20 8.6.3 The unscented Kalman smoother 351
21 8.7 Other variants of the Kalman filter 352
22 8.7.1 Ensemble Kalman filter 352
23 8.7.2 Robust Kalman filters 353
24 8.7.3 Gaussian filtering 353
25 8.8 Assumed density filtering 356
26 8.8.1 The ADF algorithm 356
27 8.8.2 Connection with Gaussian filtering 357
28 8.8.3 The Gaussian sum filter for switching SSMs 357
29 8.8.4 ADF for training logistic regression 360
30
31 9 Message passing inference 365
32 9.1 Introduction 365
33 9.2 Belief propagation on trees 366
34 9.2.1 BP for polytrees 366
35 9.2.2 BP for undirected graphs with pairwise potentials 369
36 9.2.3 BP for factor graphs 370
37 9.2.4 Max product belief propagation 371
38 9.2.5 Gaussian and non-Gaussian belief propagation 373
39 9.3 Loopy belief propagation 373
40 9.3.1 Convergence 374
41 9.3.2 Accuracy 376
42 9.3.3 Connection with variational inference 377
43 9.3.4 Generalized belief propagation 377
44 9.3.5 Application: error correcting codes 377
45 9.3.6 Application: Affinity propagation 379
46 9.3.7 Emulating BP with graph neural nets 380
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
xvi
1
2 9.4 The variable elimination (VE) algorithm 381
3 9.4.1 Derivation of the algorithm 381
4 9.4.2 Computational complexity of VE 382
5 9.4.3 Computational complexity of exact inference 384
6 9.4.4 Drawbacks of VE 385
7 9.5 The junction tree algorithm (JTA) 386
8 9.5.1 Creating a junction tree 386
9 9.5.2 Running belief propagation on a junction tree 391
10 9.5.3 The generalized distributive law 392
11 9.5.4 Other applications of the JTA 393
12 9.6 Inference as backpropagation 393
13
14
10 Variational inference 397
15 10.1 Introduction 397
16 10.1.1 Variational free energy 397
17 10.1.2 Evidence lower bound (ELBO) 398
18 10.2 Mean field VI 399
19 10.2.1 Coordinate ascent variational inference (CAVI) 399
20 10.2.2 Example: CAVI for the Ising model 400
21 10.2.3 Variational Bayes 402
22 10.2.4 Example: VB for a univariate Gaussian 403
23 10.2.5 Variational Bayes EM 406
24 10.2.6 Example: VBEM for a GMM 407
25 10.2.7 Variational message passing (VMP) 413
26 10.2.8 Autoconj 414
27 10.3 Fixed-form VI 414
28 10.3.1 Black-box variational inference 414
29 10.3.2 Stochastic variational inference 416
30 10.3.3 Reparameterization VI 417
31 10.3.4 Gaussian VI 418
32 10.3.5 Automatic differentiation VI 422
33 10.3.6 Beyond Gaussian posteriors 423
34 10.3.7 Amortized inference 425
35 10.3.8 Exploiting partial conjugacy 426
36 10.3.9 Online variational inference 430
37 10.4 More accurate variational posteriors 433
38 10.4.1 Structured mean field 434
39 10.4.2 Hierarchical (auxiliary variable) posteriors 434
40 10.4.3 Normalizing flow posteriors 434
41 10.4.4 Implicit posteriors 436
42 10.4.5 Combining VI with MCMC inference 437
43 10.5 Lower bounds 437
44 10.5.1 Multi-sample ELBO (IWAE bound) 437
45 10.5.2 The thermodynamic variational objective (TVO) 438
46 10.6 Upper bounds 438
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
CONTENTS
1
2 10.6.1 Minimizing the χ-divergence upper bound 439
3 10.6.2 Minimizing the evidence upper bound 440
4 10.7 Expectation propagation (EP) 441
5 10.7.1 Minimizing forwards vs reverse KL 441
6 10.7.2 EP as generalized ADF 443
7 10.7.3 Algorithm 443
8 10.7.4 Example 444
9 10.7.5 Optimization issues 444
10 10.7.6 Power EP and α-divergence 445
11 10.7.7 Stochastic EP 445
12 10.7.8 Applications 446
13
14
11 Monte Carlo inference 447
15 11.1 Introduction 447
16 11.2 Monte Carlo integration 447
17 11.2.1 Example: estimating π by Monte Carlo integration 448
18 11.2.2 Accuracy of Monte Carlo integration 448
19 11.3 Generating random samples from simple distributions 450
20 11.3.1 Sampling using the inverse cdf 450
21 11.3.2 Sampling from a Gaussian (Box-Muller method) 451
22 11.4 Rejection sampling 451
23 11.4.1 Basic idea 452
24 11.4.2 Example 453
25 11.4.3 Adaptive rejection sampling 453
26 11.4.4 Rejection sampling in high dimensions 454
27 11.5 Importance sampling 454
28 11.5.1 Direct importance sampling 455
29 11.5.2 Self-normalized importance sampling 455
30 11.5.3 Choosing the proposal 456
31 11.5.4 Annealed importance sampling (AIS) 456
32 11.6 Controlling Monte Carlo variance 458
33 11.6.1 Rao-Blackwellisation 458
34 11.6.2 Control variates 459
35 11.6.3 Antithetic sampling 460
36 11.6.4 Quasi Monte Carlo (QMC) 461
37
38 12 Markov Chain Monte Carlo inference 463
39 12.1 Introduction 463
40 12.2 Metropolis Hastings algorithm 463
41 12.2.1 Basic idea 464
42 12.2.2 Why MH works 465
43 12.2.3 Proposal distributions 466
44 12.2.4 Initialization 469
45 12.2.5 Simulated annealing 469
46 12.3 Gibbs sampling 471
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
xviii
1
2 12.3.1 Basic idea 472
3 12.3.2 Gibbs sampling is a special case of MH 472
4 12.3.3 Example: Gibbs sampling for Ising models 473
5 12.3.4 Example: Gibbs sampling for Potts models 474
6 12.3.5 Example: Gibbs sampling for GMMs 475
7 12.3.6 Sampling from the full conditionals 477
8 12.3.7 Blocked Gibbs sampling 477
9 12.3.8 Collapsed Gibbs sampling 478
10 12.4 Auxiliary variable MCMC 480
11 12.4.1 Slice sampling 481
12 12.4.2 Swendsen Wang 483
13 12.5 Hamiltonian Monte Carlo (HMC) 484
14 12.5.1 Hamiltonian mechanics 484
15 12.5.2 Integrating Hamilton’s equations 485
16 12.5.3 The HMC algorithm 487
17 12.5.4 Tuning HMC 488
18 12.5.5 Riemann Manifold HMC 489
19 12.5.6 Langevin Monte Carlo (MALA) 489
20 12.5.7 Connection between SGD and Langevin sampling 490
21 12.5.8 Applying HMC to constrained parameters 492
22 12.5.9 Speeding up HMC 493
23 12.6 MCMC convergence 493
24 12.6.1 Mixing rates of Markov chains 494
25 12.6.2 Practical convergence diagnostics 495
26 12.6.3 Improving speed of convergence 502
27 12.6.4 Non-centered parameterizations and Neal’s funnel 502
28 12.7 Stochastic gradient MCMC 504
29 12.7.1 Stochastic Gradient Langevin Dynamics (SGLD) 504
30 12.7.2 Preconditionining 505
31 12.7.3 Reducing the variance of the gradient estimate 506
32 12.7.4 SG-HMC 507
33 12.7.5 Underdamped Langevin Dynamics 507
34 12.8 Reversible jump (trans-dimensional) MCMC 509
35 12.8.1 Basic idea 510
36 12.8.2 Example 511
37 12.8.3 Discussion 512
38 12.9 Annealing methods 512
39 12.9.1 Parallel tempering 512
40
41 13 Sequential Monte Carlo inference 513
42 13.1 Introduction 513
43 13.1.1 Problem statement 513
44 13.1.2 Particle filtering for state-space models 513
45 13.1.3 SMC samplers for static parameter estimation 515
46 13.2 Basics of SMC 515
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
CONTENTS
1
2 13.2.1 Importance sampling 515
3 13.2.2 Sequential importance sampling 516
4 13.2.3 Sequential importance sampling with resampling 517
5 13.2.4 Resampling methods 520
6 13.2.5 Adaptive resampling 522
7 13.3 Some applications of particle filtering 523
8 13.3.1 1d pendulum model with outliers 523
9 13.3.2 Visual object tracking 525
10 13.3.3 Robot localization 525
11 13.3.4 Online parameter estimation 527
12 13.4 Proposal distributions 527
13 13.4.1 Locally optimal proposal 528
14 13.4.2 Proposals based on the Laplace approximation 528
15 13.4.3 Proposals based on the extended and unscented Kalman filter 530
16 13.4.4 Proposals based on SMC 530
17 13.4.5 Neural adaptive SMC 531
18 13.4.6 Amortized adaptive SMC 531
19 13.4.7 Variational SMC 532
20 13.5 Rao-Blackwellised particle filtering (RBPF) 533
21 13.5.1 Mixture of Kalman filters 533
22 13.5.2 FastSLAM 535
23 13.6 SMC samplers 537
24 13.6.1 Ingredients of an SMC sampler 537
25 13.6.2 Likelihood tempering (geometric path) 538
26 13.6.3 Data tempering 541
27 13.6.4 Sampling rare events and extrema 542
28 13.6.5 SMC-ABC and likelihood-free inference 543
29 13.6.6 SMC2 544
30 13.7 Particle MCMC methods 544
31 13.7.1 Particle Marginal Metropolis Hastings 544
32 13.7.2 Particle Independent Metropolis Hastings 545
33 13.7.3 Particle Gibbs 546
34
35
36 III Prediction 547
37
38 14 Predictive models: an overview 549
39 14.1 Introduction 549
40 14.1.1 Types of model 549
41 14.1.2 Model fitting using ERM, MLE and MAP 550
42 14.1.3 Model fitting using Bayes, VI and generalized Bayes 551
43 14.2 Evaluating predictive models 552
44 14.2.1 Proper scoring rules 552
45 14.2.2 Calibration 552
46 14.2.3 Beyond evaluating marginal probabilities 556
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
xx
1
2 14.3 Conformal prediction 559
3 14.3.1 Conformalizing classification 560
4 14.3.2 Conformalizing regression 561
5 14.3.3 Conformalizing Bayes 562
6 14.3.4 What do we do if we don’t have a calibration set? 563
7
15 Generalized linear models 565
8
9 15.1 Introduction 565
10 15.1.1 Examples 565
11 15.1.2 GLMs with non-canonical link functions 568
12 15.1.3 Maximum likelihood estimation 568
13 15.1.4 Bayesian inference 569
14 15.2 Linear regression 570
15 15.2.1 Conjugate priors 570
16 15.2.2 Uninformative priors 572
17 15.2.3 Informative priors 574
18 15.2.4 Spike and slab prior 576
19 15.2.5 Laplace prior (Bayesian lasso) 577
20 15.2.6 Horseshoe prior 578
21 15.2.7 Automatic relevancy determination 579
22 15.3 Logistic regression 581
23 15.3.1 Binary logistic regression 582
24 15.3.2 Multinomial logistic regression 582
25 15.3.3 Priors 583
26 15.3.4 Posteriors 584
27 15.3.5 Laplace approximation 584
28 15.3.6 MCMC inference 587
29 15.3.7 Variational inference 588
30 15.4 Probit regression 588
31 15.4.1 Latent variable interpretation 588
32 15.4.2 Maximum likelihood estimation 589
33 15.4.3 Bayesian inference 590
34 15.4.4 Ordinal probit regression 591
35 15.4.5 Multinomial probit models 592
36 15.5 Multi-level GLMs 592
37 15.5.1 Generalized linear mixed models (GLMMs) 592
38 15.5.2 Model fitting 593
39 15.5.3 Example: radon regression 593
40
41 16 Deep neural networks 597
42 16.1 Introduction 597
43 16.2 Building blocks of differentiable circuits 597
44 16.2.1 Linear layers 598
45 16.2.2 Non-linearities 598
46 16.2.3 Convolutional layers 599
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
CONTENTS
1
2 16.2.4 Residual (skip) connections 600
3 16.2.5 Normalization layers 601
4 16.2.6 Dropout layers 601
5 16.2.7 Attention layers 602
6 16.2.8 Recurrent layers 605
7 16.2.9 Multiplicative layers 605
8 16.2.10 Implicit layers 606
9 16.3 Canonical examples of neural networks 606
10 16.3.1 Multi-layer perceptrons (MLP) 607
11 16.3.2 Convolutional neural networks (CNN) 607
12 16.3.3 Recurrent neural networks (RNN) 607
13 16.3.4 Transformers 609
14 16.3.5 Graph neural networks (GNNs) 612
15
16
17 Bayesian neural networks 619
17 17.1 Introduction 619
18 17.2 Priors for BNNs 619
19 17.2.1 Gaussian priors 620
20 17.2.2 Sparsity-promoting priors 621
21 17.2.3 Learning the prior 622
22 17.2.4 Priors in function space 622
23 17.2.5 Architectural priors 622
24 17.3 Likelihoods for BNNs 623
25 17.4 Posteriors for BNNs 624
26 17.4.1 Laplace approximation 624
27 17.4.2 Variational inference 625
28 17.4.3 Expectation propagation 626
29 17.4.4 Last layer methods 626
30 17.4.5 Dropout 626
31 17.4.6 MCMC methods 627
32 17.4.7 Methods based on the SGD trajectory 627
33 17.4.8 Deep ensembles 628
34 17.4.9 Approximating the posterior predictive distibution 632
35 17.5 Generalization in Bayesian deep learning 633
36 17.5.1 Sharp vs flat minima 633
37 17.5.2 Effective dimensionality of a model 634
38 17.5.3 The hypothesis space of DNNs 636
39 17.5.4 Double descent 637
40 17.5.5 A Bayesian Resolution to Double Descent 638
41 17.5.6 PAC-Bayes 640
42 17.5.7 Out-of-Distribution Generalization for BNNs 641
43 17.6 Online inference 644
44 17.6.1 Extended Kalman Filtering for DNNs 644
45 17.6.2 Assumed Density Filtering for DNNs 646
46 17.6.3 Sequential Laplace for DNNs 648
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
xxii
1
2 17.6.4 Variational methods 648
3 17.7 Hierarchical Bayesian neural networks 648
4 17.7.1 Solving multiple related classification problems 649
5
18 Gaussian processes 653
6
7 18.1 Introduction 653
8 18.2 Mercer kernels 655
9 18.2.1 Some popular Mercer kernels 656
10 18.2.2 Mercer’s theorem 661
11 18.2.3 Kernels from Spectral Densities 662
12 18.3 GPs with Gaussian likelihoods 664
13 18.3.1 Predictions using noise-free observations 664
14 18.3.2 Predictions using noisy observations 665
15 18.3.3 Weight space vs function space 666
16 18.3.4 Semi-parametric GPs 667
17 18.3.5 Marginal likelihood 668
18 18.3.6 Computational and numerical issues 668
19 18.3.7 Kernel ridge regression 669
20 18.4 GPs with non-Gaussian likelihoods 672
21 18.4.1 Binary classification 672
22 18.4.2 Multi-class classification 674
23 18.4.3 GPs for Poisson regression (Cox process) 674
24 18.5 Scaling GP inference to large datasets 675
25 18.5.1 Subset of data 676
26 18.5.2 Nyström approximation 677
27 18.5.3 Inducing point methods 678
28 18.5.4 Sparse variational methods 681
29 18.5.5 Exploiting parallelization and structure via kernel matrix multiplies 684
30 18.6 Learning the kernel 687
31 18.6.1 Empirical Bayes for the kernel parameters 687
32 18.6.2 Bayesian inference for the kernel parameters 690
33 18.6.3 Multiple kernel learning for additive kernels 691
34 18.6.4 Automatic search for compositional kernels 693
35 18.6.5 Spectral mixture kernel learning 695
36 18.6.6 Deep kernel learning 697
37 18.6.7 Functional kernel learning 698
38 18.7 GPs and DNNs 699
39 18.7.1 Kernels derived from random DNNs (NN-GP) 700
40 18.7.2 Kernels derived from trained DNNs (neural tangent kernel) 703
41 18.7.3 Deep GPs 705
42
43 19 Structured prediction 711
44 19.1 Introduction 711
45 19.2 Conditional random fields (CRFs) 711
46 19.2.1 1d CRFs 711
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
CONTENTS
1
2 19.2.2 2d CRFs 715
3 19.2.3 Parameter estimation 717
4 19.2.4 Other approaches 718
5 19.3 Time series forecasting 719
6 19.3.1 Structural time series models 719
7 19.3.2 Prophet 725
8 19.3.3 Gaussian processes for timeseries forecasting 726
9 19.3.4 Neural forecasting methods 727
10 19.3.5 Causal impact of a time series intervention 728
11
12
20 Beyond the iid assumption 733
13 20.1 Introduction 733
14 20.2 Distribution shift 733
15 20.2.1 Motivating examples 733
16 20.2.2 A causal view of distribution shift 735
17 20.2.3 Covariate shift 736
18 20.2.4 Domain shift 736
19 20.2.5 Label / prior shift 737
20 20.2.6 Concept shift 737
21 20.2.7 Manifestation shift 737
22 20.2.8 Selection bias 737
23 20.3 Training-time techniques for distribution shift 738
24 20.3.1 Importance weighting for covariate shift 738
25 20.3.2 Domain adaptation 740
26 20.3.3 Domain randomization 740
27 20.3.4 Data augmentation 741
28 20.3.5 Unsupervised label shift estimation 741
29 20.3.6 Distributionally robust optimization 741
30 20.4 Test-time techniques for distribution shift 742
31 20.4.1 Detecting shifts using two-sample testing 742
32 20.4.2 Detecting single out-of-distribution (OOD) inputs 742
33 20.4.3 Selective prediction 745
34 20.4.4 Open world recognition 747
35 20.4.5 Online adaptation 747
36 20.5 Learning from multiple distributions 748
37 20.5.1 Transfer learning 748
38 20.5.2 Few-shot learning 749
39 20.5.3 Prompt tuning 749
40 20.5.4 Zero-shot learning 750
41 20.5.5 Multi-task learning 750
42 20.5.6 Domain generalization 751
43 20.5.7 Invariant risk minimization 752
44 20.6 Meta-learning 753
45 20.6.1 Meta-learning as probabilistic inference for prediction 754
46 20.6.2 Gradient-based meta-learning 755
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
xxiv
1
2 20.6.3 Metric-based few-shot learning 755
3 20.6.4 VERSA 755
4 20.6.5 Neural processes 756
5 20.7 Continual learning 756
6 20.7.1 Domain drift 756
7 20.7.2 Concept drift 756
8 20.7.3 Task incremental learning 758
9 20.7.4 Catastrophic forgetting 759
10 20.7.5 Online learning 761
11 20.8 Adversarial examples 762
12 20.8.1 Whitebox (gradient-based) attacks 764
13 20.8.2 Blackbox (gradient-free) attacks 764
14 20.8.3 Real world adversarial attacks 766
15 20.8.4 Defenses based on robust optimization 766
16 20.8.5 Why models have adversarial examples 767
17
18
19 IV Generation 769
20
21
21 Generative models: an overview 771
22 21.1 Introduction 771
23 21.2 Types of generative model 771
24 21.3 Goals of generative modeling 773
25 21.3.1 Generating data 773
26 21.3.2 Density estimation 775
27 21.3.3 Imputation 775
28 21.3.4 Structure discovery 776
29 21.3.5 Latent space interpolation 776
30 21.3.6 Representation learning 777
31 21.4 Evaluating generative models 777
32 21.4.1 Likelihood 778
33 21.4.2 Distances and divergences in feature space 780
34 21.4.3 Precision and recall metrics 781
35 21.4.4 Statistical tests 782
36 21.4.5 Challenges with using pretrained classifiers 782
37 21.4.6 Using model samples to train classifiers 783
38 21.4.7 Assessing overfitting 783
39 21.4.8 Human evaluation 784
40
41 22 Variational autoencoders 785
42 22.1 Introduction 785
43 22.2 VAE basics 785
44 22.2.1 Modeling assumptions 786
45 22.2.2 Evidence lower bound 787
46 22.2.3 Optimization 788
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
CONTENTS
1
2 22.2.4 The reparameterization trick 788
3 22.2.5 Computing the reparameterized ELBO 790
4 22.2.6 Comparison of VAEs and autoencoders 792
5 22.2.7 VAEs optimize in an augmented space 793
6 22.3 VAE generalizations 795
7 22.3.1 σ-VAE 795
8 22.3.2 β-VAE 796
9 22.3.3 InfoVAE 798
10 22.3.4 Multi-modal VAEs 800
11 22.3.5 VAEs with missing data 803
12 22.3.6 Semi-supervised VAEs 805
13 22.3.7 VAEs with sequential encoders/decoders 806
14 22.4 Avoiding posterior collapse 809
15 22.4.1 KL annealing 810
16 22.4.2 Lower bounding the rate 810
17 22.4.3 Free bits 810
18 22.4.4 Adding skip connections 811
19 22.4.5 Improved variational inference 811
20 22.4.6 Alternative objectives 811
21 22.4.7 Enforcing identifiability 812
22 22.5 VAEs with hierarchical structure 813
23 22.5.1 Bottom-up vs top-down inference 813
24 22.5.2 Example: Very deep VAE 814
25 22.5.3 Connection with autoregressive models 815
26 22.5.4 Variational pruning 817
27 22.5.5 Other optimization difficulties 818
28 22.6 Vector quantization VAE 818
29 22.6.1 Autoencoder with binary code 818
30 22.6.2 VQ-VAE model 819
31 22.6.3 Learning the prior 821
32 22.6.4 Hierarchical extension (VQ-VAE-2) 821
33 22.6.5 Discrete VAE 822
34 22.6.6 VQ-GAN 824
35 22.7 Wake-sleep algorithm 824
36 22.7.1 Wake phase 825
37 22.7.2 Sleep phase 825
38 22.7.3 Daydream phase 826
39 22.7.4 Summary of algorithm 827
40
41 23 Auto-regressive models 829
42 23.1 Introduction 829
43 23.2 Neural autoregressive density estimators (NADE) 830
44 23.3 Causal CNNs 830
45 23.3.1 1d causal CNN (Convolutional Markov models) 831
46 23.3.2 2d causal CNN (PixelCNN) 831
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
xxvi
1
2 23.4 Transformer decoders 832
3 23.4.1 Text generation (GPT) 833
4 23.4.2 Music generation 833
5 23.4.3 Text-to-image generation (DALL-E) 834
6
24 Normalizing Flows 837
7
8 24.1 Introduction 837
9 24.1.1 Preliminaries 837
10 24.1.2 Example 839
11 24.1.3 How to train a flow model 840
12 24.2 Constructing Flows 841
13 24.2.1 Affine flows 841
14 24.2.2 Elementwise flows 842
15 24.2.3 Coupling flows 844
16 24.2.4 Autoregressive flows 846
17 24.2.5 Residual flows 851
18 24.2.6 Continuous-time flows 853
19 24.3 Applications 854
20 24.3.1 Density estimation 854
21 24.3.2 Generative Modeling 855
22 24.3.3 Inference 855
23
24
25 Energy-based models 857
25 25.1 Introduction 857
26 25.1.1 Example: Products of experts (PoE) 858
27 25.1.2 Computational difficulties 858
28 25.2 Maximum Likelihood Training 859
29 25.2.1 Gradient-based MCMC methods 860
30 25.2.2 Contrastive divergence 860
31 25.3 Score Matching (SM) 863
32 25.3.1 Basic score matching 864
33 25.3.2 Denoising Score Matching (DSM) 865
34 25.3.3 Sliced Score Matching (SSM) 866
35 25.3.4 Connection to Contrastive Divergence 867
36 25.3.5 Score-Based Generative Models 868
37 25.4 Noise Contrastive Estimation 871
38 25.4.1 Connection to Score Matching 872
39 25.5 Other Methods 873
40 25.5.1 Minimizing Differences/Derivatives of KL Divergences 873
41 25.5.2 Minimizing the Stein Discrepancy 874
42 25.5.3 Adversarial Training 874
43
44 26 Denoising diffusion models 877
45 26.1 Model definition 877
46 26.2 Examples 879
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
CONTENTS
1
2 26.3 Model training 880
3 26.4 Connections with other generative models 882
4 26.4.1 Connection with score matching 882
5 26.4.2 Connection with VAEs 883
6 26.4.3 Connection with flow models 883
7
8
27 Generative adversarial networks 885
9 27.1 Introduction 885
10 27.2 Learning by Comparison 886
11 27.2.1 Guiding principles 887
12 27.2.2 Class probability estimation 888
13 27.2.3 Bounds on f -divergences 891
14 27.2.4 Integral probability metrics 892
15 27.2.5 Moment matching 894
16 27.2.6 On density ratios and differences 895
17 27.3 Generative Adversarial Networks 896
18 27.3.1 From learning principles to loss functions 897
19 27.3.2 Gradient Descent 898
20 27.3.3 Challenges with GAN training 899
21 27.3.4 Improving GAN optimization 901
22 27.3.5 Convergence of GAN training 901
23 27.4 Conditional GANs 904
24 27.5 Inference with GANs 906
25 27.6 Neural architectures in GANs 906
26 27.6.1 The importance of discriminator architectures 907
27 27.6.2 Architectural inductive biases 907
28 27.6.3 Attention in GANs 907
29 27.6.4 Progressive generation 908
30 27.6.5 Regularization 909
31 27.6.6 Scaling up GAN models 910
32 27.7 Applications 910
33 27.7.1 GANs for image generation 911
34 27.7.2 Video generation 913
35 27.7.3 Audio generation 914
36 27.7.4 Text generation 914
37 27.7.5 Imitation Learning 915
38 27.7.6 Domain Adaptation 916
39 27.7.7 Design, Art and Creativity 916
40
41
42 V Discovery 917
43
44 28 Discovery methods: an overview 919
45 28.1 Introduction 919
46 28.2 Overview of Part V 920
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
xxviii
1
2 29 Latent variable models 921
3 29.1 Introduction 921
4 29.2 Mixture models 921
5 29.2.1 Gaussian mixture models (GMMs) 922
6 29.2.2 Bernoulli mixture models 924
7 29.2.3 Gaussian scale mixtures 924
8 29.2.4 Using GMMs as a prior for inverse imaging problems 926
9 29.3 Factor analysis 929
10 29.3.1 Vanilla factor analysis 929
11 29.3.2 Probabilistic PCA 933
12 29.3.3 Factor analysis models for paired data 936
13 29.3.4 Factor analysis with exponential family likelihoods 939
14 29.3.5 Factor analysis with DNN likelihoods 940
15 29.3.6 Factor analysis with GP likelihoods (GP-LVM) 941
16 29.4 Mixture of factor analysers 943
17 29.4.1 Model definition 943
18 29.4.2 Model fitting 944
19 29.4.3 MixFA for image generation 945
20 29.5 LVMs with non-Gaussian priors 949
21 29.5.1 Non-negative matrix factorization (NMF) 949
22 29.5.2 Multinomial PCA 950
23 29.5.3 Latent Dirichlet Allocation (LDA) 952
24 29.6 Independent components analysis (ICA) 953
25 29.6.1 Noiseless ICA model 954
26 29.6.2 The need for non-Gaussian priors 954
27 29.6.3 Maximum likelihood estimation 955
28 29.6.4 Alternatives to MLE 956
29 29.6.5 Sparse coding 958
30 29.6.6 Nonlinear ICA 959
31
3230 Hidden Markov models 961
33 30.1 Introduction 961
34 30.2 HMMs: parameterization 961
35 30.2.1 Transition model 961
36 30.2.2 Observation model 962
37 30.3 HMMs: Applications 965
38 30.3.1 Segmentation of time series data 965
39 30.3.2 Spelling correction 967
40 30.3.3 Protein sequence alignment 970
41 30.4 HMMs: parameter learning 971
42 30.4.1 The Baum-Welch (EM) algorithm 971
43 30.4.2 Parameter estimation using SGD 975
44 30.4.3 Parameter estimation using spectral methods 977
45 30.4.4 Bayesian parameter inference 978
46 30.5 HMMs: Generalizations 978
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
CONTENTS
1
2 30.5.1 Hidden semi-Markov model (HSMM) 979
3 30.5.2 HSMMs for changepoint detection 981
4 30.5.3 Hierarchical HMMs 984
5 30.5.4 Factorial HMMs 986
6 30.5.5 Coupled HMMs 988
7 30.5.6 Dynamic Bayes nets (DBN) 990
8
9
31 State-space models 991
10 31.1 Introduction 991
11 31.2 Linear dynamical systems 991
12 31.2.1 Example: Noiseless 1d spring-mass system 992
13 31.2.2 Example: Noisy 2d tracking problem 993
14 31.2.3 Example: Online linear regression 996
15 31.2.4 Example: structural time series forecasting 998
16 31.2.5 Parameter estimation 998
17 31.3 Non-linear dynamical systems 1000
18 31.3.1 Example: nonlinear 2d tracking problem 1001
19 31.3.2 Example: Simultaneous localization and mapping (SLAM) 1001
20 31.3.3 Example: stochastic volatility models 1003
21 31.3.4 Example: Multi-target tracking 1004
22 31.4 Other kinds of SSM 1006
23 31.4.1 Exponential family SSM 1006
24 31.4.2 Bayesian SSM 1010
25 31.4.3 GP-SSM 1010
26 31.5 Deep state space models 1011
27 31.5.1 Deep Markov models 1011
28 31.5.2 Recurrent SSM 1012
29 31.5.3 Improving multi-step predictions 1013
30 31.5.4 Variational RNNs 1014
31 31.5.5 Structured State Space Sequence model (S4) 1015
32
33 32 Graph learning 1019
34 32.1 Introduction 1019
35 32.2 Latent variable models for graphs 1019
36 32.2.1 Stochastic block model 1019
37 32.2.2 Mixed membership stochastic block model 1021
38 32.2.3 Infinite relational model 1023
39 32.3 Graphical model structure learning 1025
40 32.3.1 Applications 1025
41 32.3.2 Relevance networks 1027
42 32.3.3 Learning sparse PGMs 1028
43
44 33 Non-parametric Bayesian models 1029
45 33.1 Introduction 1029
46 33.2 Dirichlet process 1030
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
xxx
1
2 33.2.1 Definition 1030
3 33.2.2 Stick breaking construction of the DP 1032
4 33.2.3 The Chinese restaurant process (CRP) 1033
5 33.2.4 Dirichlet process mixture models 1035
6 33.3 Generalizations of the Dirichlet process 1040
7 33.3.1 Pitman-Yor process 1041
8 33.3.2 Dependent random probability measures 1042
9 33.4 The Indian buffet process and the Beta process 1044
10 33.5 Small-variance asymptotics 1047
11 33.6 Completely random measures 1050
12 33.7 Lévy processes 1051
13 33.8 Point processes with repulsion and reinforcement 1053
14 33.8.1 Poisson process 1053
15 33.8.2 Renewal process 1054
16 33.8.3 Hawkes process 1055
17 33.8.4 Gibbs point process 1057
18 33.8.5 Determinantal point process 1058
19
34 Representation learning (Unfinished) 1061
20
21
34.1 CLIP 1061
22 35 Interpretability 1063
23
35.1 Introduction 1063
24
35.1.1 The Role of Interpretability 1064
25
35.1.2 Terminology and Framework 1065
26
35.2 Methods for Interpretable Machine Learning 1069
27
35.2.1 Inherently Interpretable Models: The Model is its Explanation 1069
28
35.2.2 Semi-Inherently Interpretable Models: Example-Based Methods 1071
29
35.2.3 Post-hoc or Joint training: The Explanation gives a Partial View of the
30
Model 1072
31
35.2.4 Transparency and Visualization 1076
32
35.3 Properties: The Abstraction Between Context and Method 1077
33
35.3.1 Properties of Explanations from Interpretable Machine Learning 1077
34
35.3.2 Properties of Explanations from Cognitive Science 1080
35
35.4 Evaluation of Interpretable Machine Learning Models 1081
36
35.4.1 Computational Evaluation: Does the Method have Desired Properties? 1082
37
35.4.2 User Study-based Evaluation: Does the Method Help a User Perform a
38
Task? 1086
39
35.5 Discussion: How to Think about Interpretable Machine Learning 1090
40
41
42
VI Decision making 1097
43
44 36 Multi-step decision problems 1099
45 36.1 Introduction 1099
46 36.2 Decision (influence) diagrams 1099
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
CONTENTS
1
2 36.2.1 Example: oil wildcatter 1099
3 36.2.2 Information arcs 1100
4 36.2.3 Value of information 1101
5 36.2.4 Computing the optimal policy 1102
6 36.3 A/B testing 1102
7 36.3.1 A Bayesian approach 1103
8 36.3.2 Example 1106
9 36.4 Contextual bandits 1107
10 36.4.1 Types of bandit 1107
11 36.4.2 Applications 1109
12 36.4.3 Exploration-exploitation tradeoff 1109
13 36.4.4 The optimal solution 1109
14 36.4.5 Upper confidence bounds (UCB) 1111
15 36.4.6 Thompson sampling 1113
16 36.4.7 Regret 1114
17 36.5 Markov decision problems 1115
18 36.5.1 Basics 1116
19 36.5.2 Partially observed MDPs 1117
20 36.5.3 Episodes and returns 1118
21 36.5.4 Value functions 1118
22 36.5.5 Optimal value functions and policies 1119
23 36.6 Planning in an MDP 1120
24 36.6.1 Value iteration 1121
25 36.6.2 Policy iteration 1122
26 36.6.3 Linear programming 1123
27
28 37 Reinforcement learning 1125
29 37.1 Introduction 1125
30 37.1.1 Overview of methods 1125
31 37.1.2 Value based methods 1126
32 37.1.3 Policy search methods 1126
33 37.1.4 Model-based RL 1127
34 37.1.5 Exploration-exploitation tradeoff 1127
35 37.2 Value-based RL 1129
36 37.2.1 Monte Carlo RL 1130
37 37.2.2 Temporal difference (TD) learning 1130
38 37.2.3 TD learning with eligibility traces 1131
39 37.2.4 SARSA: on-policy TD control 1132
40 37.2.5 Q-learning: off-policy TD control 1132
41 37.2.6 Deep Q-network (DQN) 1135
42 37.3 Policy-based RL 1136
43 37.3.1 The policy gradient theorem 1136
44 37.3.2 REINFORCE 1137
45 37.3.3 Actor-critic methods 1137
46 37.3.4 Bound optimization methods 1140
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
xxxii
1
2 37.3.5 Deterministic policy gradient methods 1141
3 37.3.6 Gradient-free methods 1142
4 37.4 Model-based RL 1142
5 37.4.1 Model predictive control (MPC) 1143
6 37.4.2 Combining model-based and model-free 1144
7 37.4.3 MBRL using Gaussian processes 1145
8 37.4.4 MBRL using DNNs 1146
9 37.4.5 MBRL using latent-variable models 1147
10 37.4.6 Robustness to model errors 1149
11 37.5 Off-policy learning 1149
12 37.5.1 Basic techniques 1150
13 37.5.2 The curse of horizon 1153
14 37.5.3 The deadly triad 1154
15 37.6 Control as inference 1156
16 37.6.1 Maximum entropy reinforcement learning 1156
17 37.6.2 Active inference 1158
18 37.6.3 Other approaches 1159
19 37.6.4 Imitation learning 1160
20
21
38 Causality 1163
22 38.1 Introduction 1163
23 38.1.1 Why is causality different than other forms of ML? 1163
24 38.2 Causal Formalism 1165
25 38.2.1 Structural Causal Models 1165
26 38.2.2 Causal DAGs 1167
27 38.2.3 Identification 1169
28 38.2.4 Counterfactuals and the Causal Hierarchy 1170
29 38.3 Randomized Control Trials 1172
30 38.4 Confounder Adjustment 1173
31 38.4.1 Causal Estimand, Statistical Estimand, and Identification 1173
32 38.4.2 ATE Estimation with Observed Confounders 1176
33 38.4.3 Uncertainity Quantification 1181
34 38.4.4 Matching 1182
35 38.4.5 Practical Considerations and Procedures 1183
36 38.4.6 Summary and Practical Advice 1186
37 38.5 Instrumental Variable Strategies 1187
38 38.5.1 Additive Unobserved Confounding 1189
39 38.5.2 Instrument Monotonicity and Local Average Treatment Effect 1190
40 38.5.3 Two Stage Least Squares 1194
41 38.6 Difference in Differences 1194
42 38.6.1 Estimation 1198
43 38.7 Credibility Checks 1198
44 38.7.1 Placebo Checks 1199
45 38.7.2 Sensitivity Analysis to Unobserved Confounding 1199
46 38.8 The Do Calculus 1207
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
CONTENTS
1
2 38.8.1 The three rules 1207
3 38.8.2 Revisiting Backdoor Adjustment 1208
4 38.8.3 Frontdoor Adjustment 1209
5 38.9 Further Reading 1211
6
Bibliography 1225
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
Preface
This book is a sequel to [Mur22]. That book mostly focused on techniques for learning functions
f : X → Y, where X is the set of possible inputs (typically X = RD ), Y represents the set of labels
(for classification problems) or real values (for regression problems), and f is some nonlinear model,
such as a deep neural network. We assumed that the training data consists of iid labeled samples,
D = {(xn , yn ) ∼ p(x, y) : n = 1 : N }, and that the test distribution is the same as the training
distribution.
Judea Pearl, a well known AI researcher, has called this kind of ML a form of “glorified curve
fitting” (quoted in [Har18]). In this book, we expand the scope of ML to encompass more challenging
problems. For example, we consider learning and testing under multiple different distributions; we
consider generation of high dimensional outputs, such as images, text and graphs; we discuss methods
for discovering “insights” about data, based on latent varibale models; and we discuss how to use
probabilistic models and inference for decision making and control tasks.
We assume the reader has some prior exposure to (supervised) ML and other relevant mathematical
topics (e.g., probability, statistics, linear algebra, optimization). This background material is covered
in the prequel to this book, [Mur22], although the current book is self-contained, and does not require
that you read [Mur22] first.
Since this book cover so many topics, it was not possible to fit all of the content into these pages.
Some of the extra material can be found in an online supplement at probml.ai. This site also contains
Python code for reproducing most of the figures in the book. In addition, because of the broad scope
of the book, about one third of the chapters are written, or co-written, with guest authors, who are
domain experts (see the full list of contributors below). I hope that by collecting all this material
in one place, new ML researchers will find it easier to “see the wood for the trees”, so that we can
collectively advance the field using a larger step size.
Contributing authors
I would like to thank the following people who co-wrote various parts of this book:
1
2 • Roy Frostig (Google), who wrote Section 6.2 (Automatic differentiation).
3 • Andrew Wilson (NYU), who helped write Chapter 17 (Bayesian neural networks) and Chapter 18
4 (Gaussian processes).
5 • George Papamakarios (Deepmind) and Balaji Lakshminarayanan (Google), who wrote Chapter 24
6 (Normalizing Flows).
7 • Yang Song (Stanford) and Durk Kingma (Google), who helped write Chapter 25 (Energy-based
8 models).
9 • Mihaela Rosca (Deepmind / UCL), Shakir Mohamed (Deepmind) and Balaji Lakshminarayanan
10 (Google), who wrote Chapter 27 (Generative adversarial networks).
11 • Vinayak Rao (Purdue), who wrote Chapter 33 (Non-parametric Bayesian models).
12 • Ben Poole (Google) and Simon Kornblith (Google), who wrote Chapter 34 (Representation learning
13 (Unfinished) ).
14 • Been Kim (Google) and Finale Doshi-Velez (Harvard), who wrote Chapter 35 (Interpretability).
15 • Lihong Li (Amazon, work done at Google), who helped write Section 36.4 (Contextual bandits)
16 and Chapter 37 (Reinforcement learning).
17 • Victor Veitch (Google / U. Chicago) and Alexander D’Amour (Google), who wrote Chapter 38
18 (Causality).
19
20Acknowledgements
21
22
I would like to thank the following people who helped in various ways:
23
24 • Mahmoud Soliman, for help with many issues related to latex, python, GCP, TPUs, etc.
25 • Aleyna Kara, for help with many figures and software examples.
26 • Gerardo Duran-Martin for help with many software examples.
27 • Participants in the Google Summer of Code for 2021 and 2022.
28 • Numerous people who proofread parts of the book, including: Kay Brodersen (Section 19.3.5),
29 Krzysztof Choromanski (Chapter 6, Chapter 11), John Fearns (an earlier version of the the whole
30 book), Lehman Pavasovic Krunoslav (Chapter 10, Chapter 12), Amir Globerson (Chapter 4),
31 Ravin Kumar (Chapter 2) Scott Linderman (Section 31.4.1), Simon Prince (??), Hal Varian
32 (Section 19.3.5), Chris Williams (Part IV), Raymond Yeh (Chapter 6, Chapter 14, Chapter 16,
33 Chapter 19, Chapter 20, Chapter 23), et al.
34 • Numerous people who contributed figures (acknowledged in the captions).
35 • Numerous people who made their open source code available (acknowledged in the online code).
36
37
38
About the cover
39
40The cover illustrates a variational autoencoder (Chapter 22) being used to map from a 2d Gaussian
41to image space.
42
Kevin Patrick Murphy
43
Palo Alto, California
44
March 2022.
45
46
47
Draft of “Probabilistic Machine Learning: Advanced Topics” by Kevin Murphy. February 28, 2022
1 Introduction
This book focuses on probabilistic modeling and inference, for solving four main kinds of task:
prediction (e.g., classification and regression), generation (e.g., image and text generation), discovery
(e.g., clustering, dimensionality reduction and state estimation), and control (decision making).
In more detail, in Part I, we cover some of the fundamentals of the field, filling in some details
that were missing from the prequel to this book, [Mur22].
In Part II, we discuss algorithms for Bayesian inference in various kinds of probabilistic model.
These different algorithms make different tradeoffs between speed, accuracy, generality, etc. The
resulting methods can be applied to many different problems.
In Part III, we discuss prediction methods, for fitting conditional distributions of the form p(y|x),
where x ∈ X is some input (often high dimensional), and y ∈ Y is the desired output (often low
dimensional). In this part of the book, we assume there is one right answer that we want to predict,
although we may be uncertain about it.
In Part IV, we discuss generative models, which are models of the form p(y) or p(y|x) where
there may be multiple valid outputs. For example, given a text prompt x, we may want to generate
a diverse set of images y that “match” the caption. Evaluating such models is harder than in the
prediction setting, since it is less clear what the desired output should be.
In Part V, we turn our attention to the analysis of data, using methods that aim to uncover some
meaningful underlying state or patterns. Our focus is mostly on latent variable models, which are
joint models of the form p(z, y) = p(z)p(y|z), where z is the hidden state and y are the observations;
the goal is to infer z from y. (The model can optionally be conditioned on fixed inputs, to get
p(z, y|x).) We also consider methods for trying to discover patterns learned implicitly by predictive
models of the form p(y|x), without relying on an explicit generative model.
Finally, in Part VI, we discuss how to use probabilistic models and inference to make decisions
under uncertainty. This naturally leads into the very important topic of causality, with which we
close the book.
Part I
Fundamentals
Other documents randomly have
different content
feierlichen Lichte. Der Teich sprühte Funken. Der Schatten der
Bäume löste sich scharf vom dunklen Grün.
„Leb’ wohl, Hanna!“ tönt es hinter ihr, und ein Kuß begleitete diese
Worte.
„Du bist wieder zurückgekehrt?“ sagte sie und schaute sich um.
Aber als sie einen unbekannten Burschen sah, wandte sie sich zur
Seite.
„Leb’ wohl, Hanna!“ ertönte es da wieder, und wieder küßte sie
jemand auf die Wange.
„Hat der Teufel noch einen hierhergeführt!“ rief sie voller Zorn.
„Leb’ wohl, liebe Hanna!“
„Ein Dritter!“
„Leb’ wohl, leb’ wohl, leb’ wohl, Hanna!“ Und von allen Seiten
regneten Küsse auf sie herab.
„Das ist ja eine ganze Horde!“ schrie Hanna und mußte sich
gewaltsam aus einem großen Haufen von Burschen losreißen, die sie
um die Wette umarmten. „Wie ist ihnen nur das ewige Küssen nicht
zuwider! Bei Gott, bald darf man sich nicht mehr auf der Straße
zeigen!“
Nach diesen Worten schlug die Türe zu, und man hörte nur noch,
wie der eiserne Riegel sich klirrend vorschob.
II.
Der Dorfamtmann
K ennt Ihr die Nächte der Ukraine? O Ihr kennt die Nächte der
Ukraine nicht. Blickt nur recht tief in sie hinein, versenkt Euch
tiefer in ihre Wunder. Mitten vom Himmel herab blickt der Mond;
noch gewaltiger als sonst ist die unermeßliche Wölbung des
Himmels, dehnt sich noch weiter in unermeßlichen Fernen und
scheint brennend und lohend zu atmen. Die ganze Erde liegt in
silbernem Lichte da, die wundersame Luft ist von einer schwülen
Kühle und Wonne erfüllt, und strömt einen Ozean von Wohlgerüchen
aus. Göttliche Nacht! Berückende Nacht! Regungslos und wie
begeistert stehen die Wälder in tiefer Finsternis und werfen
ungeheure Schatten. Still liegen die Teiche ruhend da; die Kälte und
die Finsternis sind düster verkerkert in die dunkelgrünen Mauern der
Gärten. Die jungfräulichen Hecken aus Faulbeer und Kirschbäumen
strecken scheu ihre Wurzeln in die kühle Flut der Quellen, und ihre
Blätter lispeln ab und zu, als ob sie zürnten oder sich empörten,
wenn der schöne, flatterhafte Nachtwind schnell herangeschlichen
kommt und sie küßt. Die ganze Natur schläft. Oben aber lebt und
webt alles in herrlicher Feier. Und auch die Seele breitet sich herrlich
aus ins Unermeßliche, und Reigen silberner Visionen steigen aus
ihrer Tiefe auf. Göttliche Nacht! Berückende Nacht! Mit einemmal
aber wird alles lebendig: Wälder, Teiche und Steppen. Majestätisch
rollt das Schmettern der ukrainischen Nachtigall dahin, und man
meint, selbst der Mond lausche ihr aus der Mitte des Himmels ....
Wie verzaubert schlummert das Dorf auf der Anhöhe. Noch weißer
und prächtiger strahlen die Haufen der Häuschen im Mondlichte,
noch blendender heben sich ihre niederen Mauern von der
Dunkelheit ab. Die Lieder sind verstummt. Alles ist still. Die frommen
Leute schlafen schon. Nur hie und da leuchtet ein schmales
Fensterchen auf. Auf den Schwellen einzelner Hütten sitzt noch eine
Familie und verzehrt ihr spätes Nachtmahl.
„I wo, ein Hopser wird ganz anders getanzt! Also darum ging’s
nicht vom Fleck! — Was erzählt der Gevatter da? .... Nun also: Hop,
trala! — hop, trala! — hop, hop, hop!“ So sprach ein angeheiterter
Bauer mittleren Alters zu sich selbst und begann mitten auf der
Straße zu tanzen. „Bei Gott, so wird kein Hopser getanzt! Was soll
ich schwindeln? Bei Gott! So nicht! Nun also: Hop trala! — Hop trala!
— hop, hop, hop!“
„Der Mensch ist ja ganz närrisch. Wenn’s noch ein junger Kerl
wäre, aber so ein alter Bär .... der tanzt bloß den Kindern zum Spott
hier nachts auf der Straße!“ rief eine ältere Frau im Vorübergehen,
die Stroh in der Hand trug. „Geh nach Haus! Es ist schon längst
Schlafenszeit!“
„Ich gehe ja schon,“ sagte der Bauer und blieb stehen. „Ich geh’
ja schon. Ich pfeife auf den Amtmann. Was denkt er sich denn. Der
Teufel soll seinen Vater holen. Wenn er Amtmann ist und die Leute
bei stärkstem Frostwetter noch mit kaltem Wasser begießt, hat er
darum etwa das Recht, so hochnäsig und wichtig zu tun? Ei, ist das
mir ein Amtmann! Ich bin mein eigner Amtmann! Gott soll mich
schlagen — ich bin mein eigner Amtmann! Jawohl,“ fuhr er fort,
„und nicht etwa ....“ Er trat ans erste beste Häuschen heran, blieb
vor dem Fenster stehen, und bemühte sich, mit den Fingern über die
Scheibe gleitend, den hölzernen Griff zu finden. „Weib, mach auf!
Schnell, Weib, ich sage dir, mach auf! Der Kosak will schlafen!“
„Wo willst du hin, Kalenik? du bist an ein fremdes Haus geraten!“
schrien lachend die Mädchen hinter ihm her, die vom fröhlichen Sang
heimkehrten. „Sollen wir dir dein Haus zeigen?“
„Zeigt mir’s, meine lieben jungen Damen!“
„Damen? Hört ihr’s?“ rief die eine, „wie artig Kalenik ist! Dafür
müssen wir ihm sein Haus zeigen ....! Aber nein, erst tanz uns mal
eins vor!“
„Tanzen? .... Ah, ihr schlauen Mädel!“ rief Kalenik gedehnt
lachend, mit dem Finger drohend und stolpernd, denn er war etwas
unsicher auf den Beinen. „Laßt Ihr euch auch küssen? Ich will euch
alle küssen — alle .... alle!“ Und mit wankenden Schritten jagte er
hinter ihnen her. Die Mädchen schrieen alle durcheinander; aber bald
faßten sie Mut und liefen auf die andere Seite der Straße, als sie
merkten, daß Kalenik nicht allzu flink auf den Beinen war.
„Da ist dein Haus!“ schrien sie ihm beim Fortgehen zu und zeigten
auf ein Haus, das größer war als die übrigen und dem Dorfamtmann
gehörte. Kalenik wankte gehorsam auf jene Seite hinüber und
begann dann von neuem auf den Amtmann zu schimpfen.
Wer aber ist denn eigentlich dieser Amtmann, der so böses
Gerede über sich erregt? O, dieser Amtmann ist eine wichtige Person
auf dem Lande. Bis Kalenik das Ende seines Weges erreicht hat,
werden wir wohl Zeit finden, einiges über ihn zu sagen. Alle im Dorfe
greifen bei seinem Anblick an die Mütze, und selbst die allerjüngsten
Mädchen sagen ihm Guten Tag. Wer im Dorfe möchte nicht
Amtmann sein? Dem Amtmann ist der Weg zu allen Tabaksdosen
offen, und der kräftige Bauer steht die ganze Zeit über ehrfurchtsvoll
mit der Mütze in der Hand da, solange jener seine dicken und
groben Finger in seine Tabatiere von Bast steckt. Im Gemeinderat
hat der Amtmann immer die Oberhand, obgleich seine Macht noch
durch andere Stimmen beschränkt wird, und er heißt fast ganz nach
seiner Willkür jeden, der ihm gerade paßt, den Weg ebnen oder
einen Graben anlegen. Der Amtmann ist mürrisch, von plumpem
Äußeren und redet nicht gern. Vor langer, langer Zeit, als noch die
große Zarin Katharina seligen Angedenkens einmal in die Krim reiste,
war er auserwählt worden, an ihrem Gefolge teilzunehmen; er
bekleidete dieses Amt ganze zwei Tage und hatte sogar die Ehre, auf
dem Bock neben dem Kutscher der Zarin sitzen zu dürfen. Seit
dieser Zeit weiß der Amtmann würdevoll und sinnend den Kopf zu
senken, seinen langen und an der Spitze etwas krausen Schnurrbart
zu glätten und drohende Falkenblicke um sich zu werfen. Seit dieser
Zeit weiß er auch, worüber man immer mit ihm sprechen mag, stets
die Rede darauf zu bringen, daß er die Zarin begleitet und auf dem
Kutschbock des kaiserlichen Wagens gesessen habe. Der Amtmann
beliebt nur manchmal, sich taub zu stellen, besonders wenn er
etwas hören muß, was er nicht gerne hört. Er liebt es nicht, Staat zu
machen, trägt stets einen Kittel aus schwarzem Haustuch, umgürtet
sich mit einem bunten Wollgürtel, und noch nie hat ihn jemand in
einem anderen Kostüm gesehen, ausgenommen vielleicht in der Zeit,
wo die Zarin in die Krim reiste, und wo er einen blauen Kosakenrock,
den Schupan, trug. Aber auf diese Zeit kann sich wohl kaum jemand
aus dem ganzen Dorfe besinnen; den Schupan aber bewahrt er in
einem Kasten unter Schloß und Riegel. Der Amtmann ist Witwer;
aber in seinem hause lebt eine Schwägerin, die ihm Mittag- und
Abendbrot kocht, die Bänke scheuert, die Stube weißt, ihm
Hemdentuch webt und sein ganzes Hauswesen leitet. Im Dorfe heißt
es, sie sei nicht richtig mit ihm verwandt; aber wir haben ja schon
gesehen, daß der Amtmann viele Feinde hat, die ihn gern ein wenig
verleumden. Übrigens hat vielleicht der Umstand Anlaß dazu
gegeben, daß es der Schwägerin immer mißfiel, wenn der Amtmann
aufs Feld ging, wo die Schnitterinnen an der Arbeit waren, oder zu
einem Kosaken, der ein junges Töchterchen hatte. Der Amtmann ist
einäugig, dafür aber ist sein einsames Auge ein Schelm und kann
schon von fern ein hübsches Bauernmädchen erkennen. Doch bevor
er sein Auge auf ein niedliches Gesichtchen richtet, sieht er sich erst
sorgfältig um, ob ihm die Schwägerin auch nicht zuschaut.
Nun haben wir schon fast alles Notwendige vom Amtmann erzählt,
und der betrunkene Kalenik hat noch nicht die Hälfte des Weges
zurückgelegt. Noch lange traktierte er den Amtmann mit den
ausgesuchtesten Worten, die ihm auf seine faule und
zusammenhangloses Zeug lallende Zunge kamen.
III.
Ein unerwarteter Nebenbuhler
Die Verschwörung
IV.
Die Burschen bummeln
V.
Die Ertrunkene
Ein Fenster tat sich leise auf, und dasselbe Köpfchen, dessen
Spiegelbild er im Teiche gesehen hatte, guckte heraus und lauschte
aufmerksam dem Sang. Ihre schweren Lider waren halb über die