Modeling Mindsets
Modeling Mindsets
Christoph Molnar
Copyright © 2022 Christoph Molnar, Germany, Munich
christophmolnar.com
All rights reserved. For more information about permission
to reproduce selections from this book, write to christoph.
[email protected].
2022, First Edition
ISBN 9798358729339 (paperback)
Christoph Molnar
c/o MUCBOOK, Heidi Seibold
Elsenheimerstraße 48
80687 München, Germany
Preface vii
1 Introduction 1
Models Have Variables And Learnable Functions . 1
Models Are Embedded In Mindsets . . . . . . . . . 3
A Mindset Is A Perspective Of The World . . . . . 4
Mindsets Are Cultural . . . . . . . . . . . . . . . . 5
Mindsets Are Archetypes . . . . . . . . . . . . . . 6
Mindsets Covered In This Book . . . . . . . . . . . 7
13 Acknowledgments 97
References 101
Preface
Is a linear regression model statistical modeling or machine
learning? This is a question that I have heard more than
once, but it starts from the false premise that models some-
how belong to one approach or the other. Rather, we should
ask: What’s the mindset of the modeler?
Differences in mindsets can be subtle yet substantial. It takes
years to absorb a modeling mindset because, usually, the fo-
cus is on the methods and math rather than the mindset.
Sometimes modelers aren’t even aware of their modeling lim-
itations due to their mindset. I studied statistics, which gave
me an almost pure frequentist mindset. Because of that nar-
row lens on modeling, I often hit a wall in my modeling
projects, ranging from a lack of causal reasoning to crappy
predictive models. I couldn’t solve these problems by dou-
bling down on my current mindset. Instead, I made the most
progress when I embraced new modeling mindsets. Modeling
Mindsets is the book I wish I had read earlier to save myself
time and headaches.
The inspiration for Modeling Mindsets was the article “Sta-
tistical Modeling: The Two Cultures” by the statistician Leo
Breiman. His article was the first to show me that model-
ing is not only about math and methods but it’s about the
mindset through which you see the world. Leo Breiman’s
paper is over 21 years old and hasn’t lost any of its relevance
even today. Modeling Mindsets builds on the same principle
of making mindsets explicit and embraces the multiplicity of
mindsets. I kept Modeling Mindsets short, (mostly) math-
free and poured over a decade of modeling experiences into
it. My hope is that this book will still be relevant 21 years
from now and a great investment for you.
Who This Book is For
This book is for everyone who builds models from data: data
scientists, statisticians, machine learners, and quantitative
researchers.
To get the most out of this book:
• You should already have experience with modeling and
working with data.
• You should feel comfortable with at least one of the
mindsets in this book.
Don’t read this book if:
• You are completely new to working with data and models.
• You cling to the mindset you already know and aren’t
open to other mindsets.
You will get the most out of Modeling Mindsets if you keep
an open mind. You have to challenge the rigid assumptions
of the mindset that feels natural to you.
1 Introduction
Every day, people use data for prediction, automation, sci-
ence, and making decisions. For example:
• Identifying patients prone to drug side effects.
• Finding out how climate change affects bee hives.
• Predicting which products will be out-of-stock.
Each data point has details to contribute:
• Patient with ID 124 got acne.
• Bee colony 27 shrank during the drought in 2018.
• On that one Tuesday, the flour was sold out.
Data alone aren’t enough to solve the above tasks. Data are
noisy and high-dimensional – most of the information will be
irrelevant to the task. No matter how long a human analyzes
the data, it’s difficult to gain insights just by sheer human
willpower.
Therefore, people rely on models to interpret and use data. A
model simplifies and represents an aspect of the world. Mod-
els learned from data – the focus of this book – glue together
the raw data and the world. With a model, the modeler can
make predictions, learn about the world, test theories, make
decisions and communicate results to others.
Figure 3.1: The line shows how the relative frequency of tails
changes as the number of coin tosses increases
from 1 to 100 (left to right).
I’ll spare you the math behind this example.2 The coin flip
scenario is not completely artificial: Imagine a domain expert
asks a statistician to perform an analysis with 1000 data
points. For the frequentist, it matters whether the domain
expert had a rule to stop after 1000 data points or whether
the expert would continue collecting data depending on the
outcome of the analysis.
1 The likelihood principle is violated when data is used to inform the
prior. For example, empirical priors violate the likelihood principle.
2 https://en.wikipedia.org/wiki/Likelihood_principle
36 Likelihoodism – Likelihood As Evidence
ratio tests are typically used for nested models, where the
variables of the first model are a subset of variables of the
second model. Bayesians may use the Bayes factor to com-
pare the ratio of two posteriors. If both models use the same
priors, this ratio is the likelihood ratio.
But the likelihood ratio has a big problem. While a likeli-
hoodist may interpret it as evidential favoring, the mindset
lacks a mechanism to provide a definitive answer. Frequen-
tists can use tests to decide on one hypothesis or the other.
But testing is not allowed for the likelihoodist, or otherwise,
they would violate the likelihood principle. The likelihood
ratio is only an estimate and has some estimation error. How
does the modeler know that a ratio of three is not just noise
and that with a fresh data sample, it would be 0.9? At-
tempts to find a cut-off lead to frequentist ideas (which are
rejected). Likelihoodism also rejects the idea of Bayesian
model interpretation. Bayesian inference estimates posterior
probabilities, which may be interpreted as updated beliefs
about the parameter values and conclusions may be drawn
from them. Likelihoodism doesn’t provide the same type of
guidance.
The effect of the drug can be divided into direct and indi-
rect effects. The total effect is the direct effect of the drug
plus any indirect effects, and in this case, via reducing in-
flammation. We were interested in the total effect of the
drug, but how we had specified the model, the coefficient
for the drug had to be interpreted as a direct effect. The
indirect effect was fully reflected by the coefficient for the
inflammation level (measured after the start of medication).
We therefore removed the inflammation variable1 and did a
1 Doesn’t inflammation confound the relation between decision to treat
and ossification? Or is it a mediator? It depends on the time of the
measurement. In the faulty model, inflammation was considered
after treatment started, making it a mediator of the drug. Later,
we adjusted the model to include inflammation measured before
treatment start, making it a confounder.
Causality Is Often Ignored 41
How does the modeler place the arrows? There are several
guides here:
• Good old common sense, such as knowing that park visi-
tors can’t control the sun.
• Domain expertise.
• Direction of time: We know that the elevator comes be-
cause you pressed the button, not the other way around.
• Causal structure learning: To some extent, we can learn
causal structures automatically. But this usually leads to
multiple, ambiguous DAGs.
• …
Causal modelers say that you can estimate causal effects even
for observational data. I am willing to reveal their secret:
Causal modelers use high-energy particle accelerators to cre-
ate black holes. Each black hole contains a parallel universe
in which they can study a different what-if scenario. All
jokes aside, there is no magic ingredient for estimating causal
effects. Causal modeling is mainly a recipe for translating
causal models into statistical estimators in the following four
steps (Pearl 2009):
– Models that solve tasks well are not necessarily the best
for insights. Methods of interpretable machine learning4 can
alleviate this problem.
– Often requires a lot of data and is computationally inten-
sive.
4 https://christophm.github.io/interpretable-ml-book/
8 Supervised Learning –
Predict New Data
Premise: The world is best approached by making predic-
tions.
Consequence: Learn predictive models from data and eval-
uate them with unseen data.
Everything Pete touched became a prediction problem. He
could no longer play sports, only bet on the outcome. When
he read books, he skipped to the end to check if his predictions
were correct. Pete’s social life suffered as he was always
trying to predict the next words in conversations. His life
began to improve only when he discovered supervised learning
as an outlet for his obsessions.
It was 2012, and I had just developed a statistical model
to predict whether a patient would develop type II diabetes
and now it was time to win some money with that model. I
uploaded the file with the model predictions to the competi-
tion’s website, which evaluated the results. Fingers crossed.
But then the disappointment: the predictions of my model
were terrible.
At the time, I was a master’s student in statistics. I mod-
eled diabetes risk using a generalized additive model, a model
often used in statistical modeling. More importantly, I ap-
proached the problem with a frequentist modeling mindset.
I thought a lot about the data-generating process, manually
added and removed variables, and evaluated the model based
on goodness of fit on the training data. But this approach
completely failed me in this prediction competition. I was
confused: statistical models can be used for prediction, many
models are used in both mindsets, and machine learning is
56 Supervised Learning – Predict New Data
Figure 8.2: For honest evaluation, data are usually split into
training, validation and test data.
Tip Top Travel, a travel agency I just made up, offers a wide
range of trips, from all-inclusive holidays in Spain to hiking
in Norway and weekend city trips to Paris. They have a huge
database with the booking history of their customers. And
yet, they know surprisingly little about the general patterns
in their data: Are there certain types of customers? For
example, do customers who travel to Norway also like Swe-
den? Our imaginary travel company’s big data is a dream
for unsupervised learners. They might start with a cluster
66 Unsupervised Learning – Find Hidden Patterns
I love grocery shopping. Many people hate it, but only be-
cause they are unaware of its magnificence: Supermarkets
are incredible places that let you live like royals. Exotic
fruits, spices from all over the world and products that take
months or even years to make, like soy sauce, wine and cheese
– all just a visit to the supermarket away. Sorry I digress,
so let’s talk about association rule learning, which is usually
introduced with shopping baskets as an example.
The baskets might look like this: {yeast, flour, salt}, {beer,
chips}, {sandwich, chips}, or {onions, tomatoes, cheese, beer,
chips}. The goal of association rule learning is to identify pat-
terns: Do people who buy flour often buy yeast? Association
rule mining is again a case of describing the data distribution.
An association rule might be {beer} ⇒ {chips} and would
mean that people who buy beer frequently buy chips.
Dimensionality Reduction
winning a boat race, but the score (aka reward) was defined
by collecting objects on the race course. The agent learned
not to finish the race. Instead, it learned to go around in
circles and collect the same reappearing objects over and
over again. The greedy agent scored, on average, 20% more
points than humans3 .
For me, this was the most confusing aspect of getting into re-
inforcement learning: What function(s) are actually learned?
In supervised learning, it’s clear: the model is a function that
maps the features to the label. But there is more than one
function to (possibly) learn in reinforcement learning:
• Learn a complete model of the environment. The agent
can query such a model to simulate the best action at each
time.
• Learn the state value function. If an agent has access to
a value function, it can choose actions that maximize the
value.
• Learn the action-value function, which takes as input not
just the state but state and action.
• Learn the policy of the agent directly.
These approaches are not mutually exclusive but can be com-
bined. Oh, and also, there are many different ways on how
to learn these functions. And that depends on the dimension-
ality of the environment and the action space. For example,
Tic-tac-toe and Go are pretty similar games. I imagine all
the Go players reading this book will object, but hear me
out. Two players face off in a fierce turn-based strategy game.
The battlefield is a rectangular board, and each player places
markers on the grid. The winner is determined by the con-
stellations of the markers.
Despite some similarities, the games differ in their complexity
for both humans and reinforcement learning. Tic-tac-toe
3 https://openai.com/blog/faulty-reward-functions/
Learn In Different Ways 79
4 https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
80 Reinforcement Learning – Learn To Interact
5 https://www.alexirpan.com/2018/02/14/rl-hard.html
11 Deep Learning - Learn
End-To-End Networks
If you’re looking for a book that will help you make ma-
chine learning models explainable, look no further than In-
terpretable Machine Learning. This book provides a clear
and concise explanation of the methods and mathematics
behind the most important approaches to making machine
learning models interpretable.
Available from leanpub.com and amazon.com.
References
Athey S, Imbens G (2016) Recursive partitioning for het-
erogeneous causal effects. Proceedings of the National
Academy of Sciences 113:7353–7360
Blei DM, Kucukelbir A, McAuliffe JD (2017) Variational in-
ference: A review for statisticians. Journal of the Ameri-
can Statistical Association 112:859–877
Breiman L (2001) Statistical modeling: The two cultures
(with comments and a rejoinder by the author). Statisti-
cal science 16:199–231
Caruana R, Lou Y, Gehrke J, et al (2015) Intelligible models
for healthcare: Predicting pneumonia risk and hospital
30-day readmission. In: Proceedings of the 21st ACM
SIGKDD international conference on knowledge discov-
ery and data mining. pp 1721–1730
Chen T, Wang Z, Li G, Lin L (2018) Recurrent attentional
reinforcement learning for multi-label image recognition.
In: Proceedings of the AAAI conference on artificial in-
telligence
Cireşan D, Meier U, Masci J, Schmidhuber J (2011) A com-
mittee of neural networks for traffic sign classification.
In: The 2011 international joint conference on neural net-
works. IEEE, pp 1918–1921
Freiesleben T, König G, Molnar C, Tejero-Cantero A (2022)
Scientific inference with interpretable machine learning:
Analyzing models to learn about real-world phenomena.
arXiv preprint arXiv:220605487
Grinsztajn L, Oyallon E, Varoquaux G (2022) Why do tree-
based models still outperform deep learning on tabular
data? arXiv preprint arXiv:220708815
Gu S, Holly E, Lillicrap T, Levine S (2017) Deep re-
inforcement learning for robotic manipulation with
asynchronous off-policy updates. In: 2017 IEEE inter-
102 References
1 https://christophm.github.io/interpretable-ml-book
swiss clinical quality management cohort. Annals of the
rheumatic diseases 77:63–69
Olah C, Mordvintsev A, Schubert L (2017) Feature visual-
ization. Distill. https://doi.org/10.23915/distill.00007
Pearl J (2009) Causal inference in statistics: An overview.
Statistics surveys 3:96–146
Pearl J (2012) The do-calculus revisited. arXiv preprint
arXiv:12104852
Perezgonzalez JD (2015) Fisher, neyman-pearson or NHST?
A tutorial for teaching data testing. Frontiers in psychol-
ogy 223
Richard R (2017) Statistical evidence: A likelihood
paradigm. Routledge
Russakovsky O, Deng J, Su H, et al (2015) ImageNet Large
Scale Visual Recognition Challenge. International Jour-
nal of Computer Vision (IJCV) 115:211–252. https://do
i.org/10.1007/s11263-015-0816-y
Silver D, Huang A, Maddison CJ, et al (2016) Mastering the
game of go with deep neural networks and tree search.
nature 529:484–489
Tiao GC, Box GE (1973) Some comments on “bayes” esti-
mators. The American Statistician 27:12–14
Wang Z, Sarcar S, Liu J, et al (2018) Outline objects
using deep reinforcement learning. arXiv preprint
arXiv:180404603
Weisberg M (2007) Who is a modeler? The British journal
for the philosophy of science 58:207–233
Weisberg M (2012) Simulation and similarity: Using models
to understand the world. Oxford University Press
Yang R, Berger JO (1996) A catalog of noninformative priors.
Institute of Statistics; Decision Sciences, Duke University
Durham, NC, USA