Bayesian Statistics and Inductive Inference
Bayesian Statistics and Inductive Inference
( p(
k
) p(y|
k
))
=
p(
k
)
_
p(y,
k
|
k
)d
k
( p(
k
)
_
p(y,
k
|
k
)d
k
)
.
These posterior probabilities of hypotheses can be used for Bayesian model selection or
Bayesian model averaging (topics to which we return below). Scientic progress, in this
view, consists of gathering data perhaps through well-designed experiments, designed
to distinguish among interesting competing scientic hypotheses (cf. Atkinson &Donev,
1992; Paninski, 2005) and then plotting the p(
k
|y) over time and watching the system
learn (as sketched in Figure 1).
In our view, the account of the last paragraph is crucially mistaken. The data-analysis
process Bayesian or otherwise does not end with calculating parameter estimates
or posterior distributions. Rather, the model can then be checked, by comparing the
implications of the tted model to the empirical evidence. One asks questions such as
whether simulations from the tted model resemble the original data, whether the tted
model is consistent with other data not used in the tting of the model, and whether
variables that the model says are noise (error terms) in fact display readily-detectable
patterns. Discrepancies between the model and data can be used to learn about the
ways in which the model is inadequate for the scientic purposes at hand, and thus to
motivate expansions and changes to the model (Section 4.).
2.1. Example: Estimating voting patterns in subsets of the population
We demonstrate the hypothetico-deductive Bayesian modelling process with an example
from our recent applied research (Gelman, Lee, & Ghitza, 2010). In recent years,
American political scientists have been increasingly interested in the connections
between politics and income inequality (see, for example, McCarty, Poole, & Rosenthal
2006). In our own contribution to this literature, we estimated the attitudes of rich,
middle-income and poor voters in each of the 50 states (Gelman, Park, Shor, Bafumi,
& Cortina, 2008). As we described in our paper on the topic (Gelman, Shor, Park, &
Bafumi, 2008), we began by tting a varying-intercept logistic regression: modelling
votes (coded as y = 1 for votes for the Republican presidential candidate and y = 0
6 Andrew Gelman and Cosma Shalizi
for Democratic votes) given family income (coded in ve categories from low to high
as x =2, 1, 0, 1, 2), using a model of the form Pr(y =1) =logit
1
(a
s
+ bx), where
s indexes state of residence the model is tted to survey responses and the varying
intercepts a
s
correspond to some states being more Republican-leaning than others.
Thus, for example, a
s
has a positive value in a conservative state such as Utah and
a negative value in a liberal state such as California. The coefcient b represents the
slope of income, and its positive value indicates that, within any state, richer voters are
more likely to vote Republican.
It turned out that this varying-intercept model did not t our data, as we learned
by making graphs of the average survey response and tted curves for the different
income categories within each state. We had to expand to a varying-intercept, varying-
slope model, Pr (y = 1) = logit
1
(a
s
+ b
s
x), in which the slopes b
s
varied by state as
well. This model expansion led to a corresponding expansion in our understanding: we
learned that the gap in voting between rich and poor is much greater in poor states such
as Mississippi than in rich states such as Connecticut. Thus, the polarization between
rich and poor voters varied in important ways geographically.
We found this not through any process of Bayesian induction but rather through
model checking. Bayesian inference was crucial, not for computing the posterior
probability that any particular model was true we never actually did that but in
allowing us to t rich enough models in the rst place that we could study state-to-state
variation, incorporating in our analysis relatively small states such as Mississippi and
Connecticut that did not have large samples in our survey.
5
Life continues, though, and so do our statistical struggles. After the 2008 election, we
wanted to make similar plots, but this time we found that even our more complicated
logistic regression model did not t the data especially when we wanted to expand
our model to estimate voting patterns for different ethnic groups. Comparison of data
to t led to further model expansions, leading to our current specication, which uses a
varying-intercept, varying-slope logistic regression as a baseline but allows for non-linear
and even non-monotonic patterns on top of that. Figure 2 shows some of our inferences
in map form, while Figure 3 shows one of our diagnostics of data and model t.
The power of Bayesian inference here is deductive: given the data and some model
assumptions, it allows us to make lots of inferences, many of which can be checked and
potentially falsied. For example, look at NewYork state (in the bottomrowof Figure 3):
apparently, voters in the second income category supported John McCain much more
than did voters in neighbouring income groups in that state. This pattern is theoretically
possible but it arouses suspicion. A careful look at the graph reveals that this is a pattern
in the rawdata which was moderated but not entirely smoothed away by our model. The
natural next step would be to examine data from other surveys. We may have exhausted
what we can learn from this particular data set, and Bayesian inference was a key tool in
allowing us to do so.
3. The Bayesian principalagent problem
Before returning to discussions of induction and falsication, we briey discuss some
ndings relating to Bayesian inference under misspecied models. The key idea is that
5
Gelman and Hill (2006) review the hierarchical models that allow such partial pooling.
Philosophy and the practice of Bayesian statistics 7
Did you vote for McCain in 2008?
Income <$20,000
All Voters
White
Black
Hispanic
Other races
0% 20%
When a category represents less than 1% of the voters in a state, the state is left blank
40% 60% 80% 100%
$20-40,000 $40-75,000 $75-150,000 > $150,000
Figure 2. [Colour online]. States won by John McCain and Barack Obama among different ethnic
and income categories, based on a model tted to survey data. States coloured deep red and
deep blue indicate clear McCain and Obama wins; pink and light blue represent wins by narrower
margins, with a continuous range of shades going to grey for states estimated at exactly 5050. The
estimates shown here represent the culmination of months of effort, in which we tted increasingly
complex models, at each stage checking the t by comparing to data and then modifying aspects
of the prior distribution and likelihood as appropriate. This gure is reproduced from Ghitza and
Gelman (2012) with the permission of the authors.
Bayesian inference for model selection statements about the posterior probabilities of
candidate models does not solve the problem of learning from data about problems
with existing models.
In economics, the principalagent problem refers to the difculty of designing
contracts or institutions which ensure that one selsh actor, the agent, will act in
the interests of another, the principal, who cannot monitor and sanction their agent
without cost or error. The problem is one of aligning incentives, so that the agent
serves itself by serving the principal (Eggertsson, 1990). There is, as it were, a Bayesian
principalagent problem as well. The Bayesian agent is the methodological ction (now
often approximated in software) of a creature with a prior distribution over a well-
dened hypothesis space , a likelihood function p(y|), and conditioning as its sole
mechanism of learning and belief revision. The principal is the actual statistician or
scientist.
The ideas of the Bayesian agent are much more precise than those of the actual
scientist; in particular, the Bayesian (in this formulation, with which we disagree) is
8 Andrew Gelman and Cosma Shalizi
2008 election: McCain share of the two-party vote in each income catetory
within each state among all voters (black) and non-Hispanic whites (green)
Wyoming Oklahoma Utah Idaho Alabama Arkansas Louisiana
Kentucky Nebraska Kansas Tennessee Mississippi
South Carolina
North Carolina Indiana Florida Ohio
New Hampshire Minnesota Pennsylvania Nevada
Oregon Michigan Washington Maine
Delaware
poor
0
5
0
%
1
0
0
%
0
5
0
%
1
0
0
%
0
5
0
%
1
0
0
%
0
5
0
%
1
0
0
%
0
5
0
%
1
0
0
%
0
5
0
%
1
0
0
%
0
5
0
%
1
0
0
%
mid rich poor mid rich poor mid rich poor mid rich poor mid rich poor mid rich
Maryland New York Massachusetts Rhode Island Vermont
Poor mid rich
Connecticut California Illinois
Wisconsin New Jersey New Mexico
Virginia Colorado lowa
North Dakota Arizona South Dakota Georgia Montana Missouri
West Virginia Texas
Figure 3. [Colour online]. Some of the data and tted model used to make the maps shown in
Figure 2. Dots are weighted averages from pooled JuneNovember Pew surveys; error bars show
1 standard error bounds. Curves are estimated using multilevel models and have a standard error
of about 3% at each point. States are ordered in decreasing order of McCain vote (Alaska, Hawaii
and the District of Columbia excluded). We tted a series of models to these data; only this last
model tted the data well enough that we were satised. In working with larger data sets and
studying more complex questions, we encounter increasing opportunities to check model t and
thus falsify in a way that is helpful for our research goals. This gure is reproduced from Ghitza
and Gelman (2012) with the permission of the authors.
certain that some is the exact and complete truth, whereas the scientist is not.
6
At
some point in history, a statistician may well write down a model which he or she
6
In claiming that the Bayesian is certain that some is the exact and complete truth, we are not claiming
that actual Bayesian scientists or statisticians hold this view. Rather, we are saying that this is implied by the
philosophy we are attacking here. All statisticians, Bayesian and otherwise, recognize that the philosophical
position which ignores this approximation is problematic.
Philosophy and the practice of Bayesian statistics 9
believes contains all the systematic inuences among properly dened variables for the
system of interest, with correct functional forms and distributions of noise terms. This
could happen, but we have never seen it, and in social science we have never seen
anything that comes close. If nothing else, our own experience suggests that however
many different specications we thought of, there are always others which did not occur
to us, but cannot be immediately dismissed a priori, if only because they can be seen
as alternative approximations to the ones we made. Yet the Bayesian agent is required
to start with a prior distribution whose support covers all alternatives that could be
considered.
7
This is not a small technical problem to be handled by adding a special value of , say
standing for none of the above; even if one could calculate p(y|
), the likelihood
of the data under this catch-all hypothesis, this in general would not lead to just a
small correction to the posterior, but rather would have substantial effects (Fitelson &
Thomason, 2008). Fundamentally, the Bayesian agent is limited by the fact that its beliefs
always remain within the support of its prior. For the Bayesian agent the truth must, so
to speak, be always already partially believed before it can become known. This point
is less than clear in the usual treatments of Bayesian convergence, and so worth some
attention.
Classical results (Doob, 1949; Schervish, 1995; Lijoi, Pr unster, & Walker, 2007) show
that the Bayesian agents posterior distribution will concentrate on the truth with prior
probability 1, provided some regularity conditions are met. Without diving into the
measure-theoretic technicalities, the conditions amount to: (i) the truth is in the support
of the prior; and (ii) the information set is rich enough that some consistent estimator
exists (see the discussion in Schervish, 1995, Section 7.4.1). When the truth is not in
the support of the prior, the Bayesian agent still thinks that Doobs theorem applies and
assigns zero prior probability to the set of data under which it does not converge on the
truth.
The convergence behaviour of Bayesian updating with a misspecied model can be
understood as follows (Berk, 1966, 1970; Kleijn & van der Vaart, 2006; Shalizi, 2009). If
the data are actually coming from a distribution q, then the KullbackLeibler divergence
rate, or relative entropy rate, of the parameter value is
d() = lim
n
1
n
E
_
log
p(y
1
, y
2
, . . . , y
n
|)
q(y
1
, y
2
, . . . , y
n
)
_
,
with the expectation being taken under q. (For details on when the limit exists, see
Gray, 1990.) Then, under not-too-onerous regularity conditions, one can show (Shalizi,
2009) that
p(|y
1
, y
2
, . . . , y
n
) p() exp
_
n(d() d
)
_
,
with d
1
n
log p(|y
1
, y
2
, . . . , y
n
) d() d
,
7
It is also not at all clear that Savage and other founders of Bayesian decision theory ever thought that this
principle should apply outside of the small worlds of articially simplied and stylized problems see Binmore
(2007). But as scientists we care about the real, large world.
10 Andrew Gelman and Cosma Shalizi
q-almost surely. Thus the posterior distribution comes to concentrate on the parts of
the prior support which have the lowest values of d() and the highest expected
likelihood.
8
There is a geometric sense in which these parts of the parameter space
are closest approaches to the truth within the support of the prior (Kass & Vos, 1997),
but they may or may not be close to the truth in the sense of giving accurate values for
parameters of scientic interest. They may not even be the parameter values which give
the best predictions (Gr unwald & Langford, 2007; M uller, 2011). In fact, one cannot
even guarantee that the posterior will concentrate on a single value of at all; if d()
has multiple global minima, the posterior can alternate between (concentrating around)
them forever (Berk, 1966).
To sum up, what Bayesian updating does when the model is false (i.e., in reality,
always) is to try to concentrate the posterior on the best attainable approximations to
the distribution of the data, best being measured by likelihood. But depending on how
the model is misspecied, and how represents the parameters of scientic interest,
the impact of misspecication on inferring the latter can range from non-existent to
profound.
9
Since we are quite sure our models are wrong, we need to check whether
the misspecication is so bad that inferences regarding the scientic parameters are
in trouble. It is by this non-Bayesian checking of Bayesian models that we solve our
principalagent problem.
4. Model checking
In our view, a key part of Bayesian data analysis is model checking, which is where there
are links to falsicationism. In particular, we emphasize the role of posterior predictive
checks, creating simulations and comparing the simulated and actual data. Again, we
are following the lead of Box (1980), Rubin (1984) and others, also mixing in a bit of
Tukey (1977) in that we generally focus on visual comparisons (Gelman et al., 2004,
Chapter 6).
Here is how this works. A Bayesian model gives us a joint distribution for the
parameters and the observables y. This implies a marginal distribution for the data,
p( y) =
_
p( y|) p()d.
If we have observed data y, the prior distribution p() shifts to the posterior distribution
p(|y), and so a different distribution of observables,
p( y
rep
|y) =
_
p( y
rep
|) p(|y)d,
where we use y
rep
to denote hypothetical alternative or future data, a replicated data
set of the same size and shape as the original y, generated under the assumption that
8
More precisely, regions of where d() > d
) p(
)d
d.
The prior distribution thus has implications for the distribution of replicated data, and
so can be checked using the type of tests we have described and illustrated above.
13
When it makes sense to think of further data coming from the same source, as in certain
kinds of sampling, time-series or longitudinal problems, the prior also has implications
for these new data (through the same formula as above, changing the interpretation of
y
rep
), and so becomes testable in a second way. There is thus a connection between the
model-checking aspect of Bayesian data analysis and prequentialism (Dawid & Vovk,
1999; Gr unwald, 2007), but exploring that would take us too far aeld.
One advantage of recognizing that the prior distribution is a testable part of a Bayesian
model is that it claries the role of the prior in inference, and where it comes from. To
reiterate, it is hard to claim that the prior distributions used in applied work represent
statisticians states of knowledge and belief before examining their data, if only because
most statisticians do not believe their models are true, so their prior degree of belief
in all of is not 1 but 0. The prior distribution is more like a regularization device,
akin to the penalization terms added to the sum of squared errors when doing ridge
regression and the lasso (Hastie, Tibshirani, & Friedman, 2009) or spline smoothing
(Wahba, 1990). All such devices exploit a sensitivitystability trade-off: they stabilize
estimates and predictions by making tted models less sensitive to certain details of
the data. Using an informative prior distribution (even if only weakly informative, as in
Gelman, Jakulin, Pittau, & Su, 2008) makes our estimates less sensitive to the data than,
say, maximum-likelihood estimates would be, which can be a net gain.
Because we see the prior distribution as a testable part of the Bayesian model,
we do not need to follow Jaynes in trying to devise a unique, objectively correct
prior distribution for each situation an enterprise with an uninspiring track record
(Kass & Wasserman, 1996), even leaving aside doubts about Jayness specic proposal
(Seidenfeld, 1979, 1987; Csiszar, 1995; Ufnk, 1995, 1996). To put it even more
succinctly, the model, for a Bayesian, is the combination of the prior distribution and
12
A similar point was expressed by the sociologist and social historian Charles Tilly (2004, p. 597), writing
from a very different disciplinary background: Most social researchers learn more from being wrong than
from being right provided they then recognize that they were wrong, see why they were wrong, and
go on to improve their arguments. Post hoc interpretation of data minimizes the opportunity to recognize
contradictions between arguments and evidence, while adoption of formalisms increases that opportunity.
Formalisms blindly followed induce blindness. Intelligently adopted, however, they improve vision. Being
obliged to spell out the argument, check its logical implications, and examine whether the evidence conforms
to the argument promotes both visual acuity and intellectual responsibility.
13
Admittedly, the prior only has observable implications in conjunction with the likelihood, but for a Bayesian
the reverse is also true.
Philosophy and the practice of Bayesian statistics 13
the likelihood, each of which represents some compromise among scientic knowledge,
mathematical convenience and computational tractability.
This gives us a lot of exibility in modelling. We do not have to worry about making
our prior distributions match our subjective beliefs, still less about our model containing
all possible truths. Instead we make some assumptions, state them clearly, see what they
imply, and check the implications. This applies just much to the prior distribution as it
does to the parts of the model showing up in the likelihood function.
4.1. Testing to reveal problems with a model
We are not interested in falsifying our model for its own sake among other things,
having built it ourselves, we know all the shortcuts taken in doing so, and can already
be morally certain it is false. With enough data, we can certainly detect departures from
the model this is why, for example, statistical folklore says that the chi-squared statistic
is ultimately a measure of sample size (cf. Lindsay & Liu, 2009). As writers such as Giere
(1988, Chapter 3) explain, the hypothesis linking mathematical models to empirical data
is not that the data-generating process is exactly isomorphic to the model, but that the
data source resembles the model closely enough, in the respects which matter to us, that
reasoning based on the model will be reliable. Such reliability does not require complete
delity to the model.
The goal of model checking, then, is not to demonstrate the foregone conclusion of
falsity as such, but rather to learn how, in particular, this model fails (Gelman, 2003).
14
When we nd such particular failures, they tell us how the model must be improved;
when severe tests cannot nd them, the inferences we draw about those aspects of
the real world from our tted model become more credible. In designing a good test
for model checking, we are interested in nding particular errors which, if present,
would mess up particular inferences, and devise a test statistic which is sensitive to
this sort of misspecication. This process of examining, and ruling out, possible errors
or misspecications is of course very much in line with the eliminative induction
advocated by Kitcher (1993, Chapter 7).
15
All models will have errors of approximation. Statistical models, however, typically
assert that their errors of approximation will be unsystematic and patternless noise
(Spanos, 2007). Testing this can be valuable in revising the model. In looking at the red-
state/blue-state example, for instance, we concluded that the varying slopes mattered
not just because of the magnitudes of departures from the equal-slope assumption, but
also because there was a pattern, with richer states tending to have shallower slopes.
What we are advocating, then, is what Cox and Hinkley (1974) call pure signicance
testing, in which certain of the models implications are compared directly to the data,
rather than entering into a contest with some alternative model. This is, we think, more
in line with what actually happens in science, where it can become clear that even
14
In addition, no model is safe from criticism, even if it passes all possible checks. Modern Bayesian models
in particular are full of unobserved, latent and unobservable variables, and non-identiability is an inevitable
concern in assessing such models; see, for example, Gustafson (2005), Vansteelandt, Goetghebeur, Kenward,
&Molenberghs (2006) and Greenland (2009). We nd it somewhat dubious to claimthat simply putting a prior
distribution on non-identied quantities somehow resolves the problem; the bounds or partial identication
approach, pioneered by Manski (2007), seems to be in better accord with scientic norms of explicitly
acknowledging uncertainty (see also Vansteelandt et al., 2006; Greenland, 2009).
15
Despite the name, this is, as Kitcher notes, actually a deductive argument.
14 Andrew Gelman and Cosma Shalizi
large-scale theories are in serious trouble and cannot be accepted unmodied even
if there is no alternative available yet. A classical instance is the status of Newtonian
physics at the beginning of the twentieth century, where there were enough difculties
the MichaelsonMorley effect, anomalies in the orbit of Mercury, the photoelectric
effect, the black-body paradox, the stability of charged matter, etc. that it was clear,
even before relativity and quantum mechanics, that something would have to give. Even
today, our current best theories of fundamental physics, namely general relativity and the
standard model of particle physics, an instance of quantum eld theory, are universally
agreed to be ultimately wrong, not least because they are mutually incompatible, and
recognizing this does not require that one have a replacement theory (Weinberg,
1999).
4.2. Connection to non-Bayesian model checking
Many of these ideas about model checking are not unique to Bayesian data analysis and
are used more or less explicitly by many communities of practitioners working with
complex stochastic models (Ripley, 1988; Guttorp, 1995). The reasoning is the same:
a model is a story of how the data could have been generated; the tted model should
therefore be able to generate synthetic data that look like the real data; failures to do so
in important ways indicate faults in the model.
For instance, simulation-based model checking is now widely accepted for assessing
the goodness of t of statistical models of social networks (Hunter, Goodreau, &
Handcock, 2008). That community was pushed toward predictive model checking
by the observation that many model specications were degenerate in various ways
(Handcock, 2003). For example, under certain exponential-family network models, the
maximum likelihood estimate gave a distribution over networks which was bimodal,
with both modes being very different from observed networks, but located so that the
expected value of the sufcient statistics matched observations. It was thus clear that
these specications could not be right even before more adequate specications were
developed (Snijders, Pattison, Robins, & Handcock, 2006).
At a more philosophical level, the idea that a central task of statistical analysis is the
search for specic, consequential errors has been forcefully advocated by Mayo (1996),
Mayo and Cox (2006), Mayo and Spanos (2004), and Mayo and Spanos (2006). Mayo has
placed a special emphasis on the idea of severe testing a model being severely tested
if it passes a probe which had a high probability of detecting an error if it is present.
(The exact denition of a tests severity is related to, but not quite, that of its power; see
Mayo, 1996, or Mayo & Spanos, 2006, for extensive discussions.) Something like this is
implicit in discussions about the relative merits of particular posterior predictive checks
(which can also be framed in a non-Bayesian manner as graphical hypothesis tests based
on the parametric bootstrap).
Our contribution here is to connect this hypothetico-deductive philosophy to
Bayesian data analysis, going beyond the evaluation of Bayesian methods based on
their frequency properties as recommended by Rubin (1984) and Wasserman (2006),
among others to emphasize the learning that comes from the discovery of systematic
differences between model and data. At the very least, we hope this paper will motivate
philosophers of hypothetico-deductive inference to take a more serious look at Bayesian
data analysis (as distinct from Bayesian theory) and, conversely, motivate philosophically
minded Bayesian statisticians to consider alternatives to the inductive interpretation of
Bayesian learning.
Philosophy and the practice of Bayesian statistics 15
4.3. Why not just compare the posterior probabilities of different models?
As mentioned above, the standard view of scientic learning in the Bayesian community
is, roughly, that posterior odds of the models under consideration are compared, given
the current data.
16
When Bayesian data analysis is understood as simply getting the
posterior distribution, it is held that pure signicance tests have no role to play in the
Bayesian framework (Schervish, 1995, p. 218). The dismissal rests on the idea that the
prior distribution can accurately reect our actual knowledge and beliefs.
17
At the risk of
boring the reader by repetition, there is just no way we can ever have any hope of making
include all the probability distributions which might be correct, let alone getting p(|y)
if we did so, so this is deeply unhelpful advice. The main point where we disagree with
many Bayesians is that we do not see Bayesian methods as generally useful for giving
the posterior probability that a model is true, or the probability for preferring model A
over model B, or whatever.
18
Beyond the philosophical difculties, there are technical
problems with methods that purport to determine the posterior probability of models,
most notably that in models with continuous parameters, aspects of the model that have
essentially no effect on posterior inferences within a model can have huge effects on
the comparison of posterior probability among models.
19
Bayesian inference is good for
deductive inference within a model we prefer to evaluate a model by comparing it to
data.
In rehashing the well-known problems with computing Bayesian posterior probabili-
ties of models, we are not claiming that classical p-values are the answer. As is indicated
by the literature on the JeffreysLindley paradox (notably Berger &Sellke, 1987), p-values
can drastically overstate the evidence against a null hypothesis. From our model-building
Bayesian perspective, the purpose of p-values (and model checking more generally) is
not to reject a null hypothesis but rather to explore aspects of a models mist to data.
In practice, if we are in a setting where model A or model B might be true, we are
inclined not to do model selection among these specied options, or even to perform
model averaging over them (perhaps with a statement such as we assign 40% of our
16
Some would prefer to compare the modication of those odds called the Bayes factor (Kass &Raftery, 1995).
Everything we have to say about posterior odds carries over to Bayes factors with few changes.
17
As Schervish (1995) continues: If the [parameter space ] describes all of the probability distributions
one is willing to entertain, then one cannot reject [] without rejecting probability models altogether. If one
is willing to entertain models not in [], then one needs to take them into account by enlarging , and
computing the posterior distribution over the enlarged space.
18
There is a vast literature on Bayes factors, model comparison, model averaging, and the evaluation of
posterior probabilities of models, and although we believe most of this work to be philosophically unsound
(to the extent that it is designed to be a direct vehicle for scientic learning), we recognize that these can
be useful techniques. Like all statistical methods, Bayesian and otherwise, these methods are summaries of
available information that can be important data-analytic tools. Even if none of a class of models is plausible
as truth, and even if we are not comfortable accepting posterior model probabilities as degrees of belief in
alternative models, these probabilities can still be useful as tools for prediction and for understanding structure
in data, as long as these probabilities are not taken too seriously. See Raftery (1995) for a discussion of the
value of posterior model probabilities in social science research and Gelman and Rubin (1995) for a discussion
of their limitations, and Claeskens and Hjort (2008) for a general review of model selection. (Some of the
work on model-selection tests in econometrics (e.g., Vuong, 1989; Rivers & Vuong, 2002) is exempt from
our strictures, as it tries to nd which model is closest to the data-generating process, while allowing that all
of the models may be misspecied, but it would take us too far aeld to discuss this work in detail.)
19
This problem has been called the JeffreysLindley paradox and is the subject of a large literature.
Unfortunately (from our perspective) the problem has usually been studied by Bayesians with an eye on
solving it that is, coming up with reasonable denitions that allow the computation of non-degenerate
posterior probabilities for continuously parameterized models but we think that this is really a problem
without a solution; see Gelman et al. (2004, Section 6.7).
16 Andrew Gelman and Cosma Shalizi
posterior belief to A and 60% to B) but rather to do continuous model expansion by
forming a larger model that includes both A and B as special cases. For example, Merrill
(1994) used electoral and survey data from Norway and Sweden to compare two models
of political ideology and voting: the proximity model (in which you prefer the political
party that is closest to you in some space of issues and ideology) and the directional
model (in which you like the parties that are in the same direction as you in issue space,
but with a stronger preference to parties further from the centre). Rather than using
the data to pick one model or the other, we would prefer to think of a model in which
voters consider both proximity and directionality in forming their preferences (Gelman,
1994).
In the social sciences, it is rare for there to be an underlying theory that can provide
meaningful constraints on the functional form of the expected relationships among
variables, let alone the distribution of noise terms.
20
Taken to its limit, then, the idea
of continuous model expansion counsels social scientists pretty much to give up using
parametric statistical models in favour of non-parametric, innite-dimensional models,
advice which the ongoing rapid development of Bayesian non-parametrics (Ghosh &
Ramamoorthi, 2003; Hjort, Holmes, M uller, &Walker, 2010) makes increasingly practical.
While we are certainly sympathetic to this, and believe a greater use of nonparametric
models in empirical research is desirable on its own merits (cf. Li & Racine, 2007), it is
worth sounding a few notes of caution.
A technical, but important, point concerns the representation of uncertainty in
Bayesian non-parametrics. In nite-dimensional problems, the use of the posterior
distribution to represent uncertainty is in part supported by the Bernsteinvon Mises
phenomenon, which ensures that large-sample credible regions are also condence
regions. This simply fails in innite-dimensional situations (Cox, 1993; Freedman, 1999),
so that a naive use of the posterior distribution becomes unwise.
21
(Since we regard
the prior and posterior distributions as regularization devices, this is not especially
troublesome for us.) Relatedly, the prior distribution in a Bayesian non-parametric model
is a stochastic process, always chosen for tractability (Ghosh &Ramamoorthi, 2003; Hjort
et al., 2010), and any pretense of representing an actual inquirers beliefs abandoned.
Most fundamentally, switching to non-parametric models does not really resolve
the issue of needing to make approximations and check their adequacy. All non-
parametric models themselves embody assumptions such as conditional independence
which are hard to defend except as approximations. Expanding our prior distribution
to embrace all the models which are actually compatible with our prior knowledge
would result in a mess we simply could not work with, nor interpret if we could.
This being the case, we feel there is no contradiction between our preference
for continuous model expansion and our use of adequately checked parametric
models.
22
20
See Manski (2007) for a critique of the econometric practice of making modelling assumptions (such as
linearity) with no support in economic theory, simply to get identiability.
21
Even in parametric problems, M uller (2011) shows that misspecication can lead credible intervals to have
sub-optimal coverage properties which, however, can be xed by a modication to their usual calculation.
22
A different perspective common in econometrics (e.g., Wooldridge, 2002) and machine learning (e.g.,
Hastie et al., 2009) reduces the importance of models of the data source, either by using robust procedures
that are valid under departures from modelling assumptions, or by focusing on prediction and external
validation. We recognize the theoretical and practical appeal of both these approaches, which can be relevant
to Bayesian inference. (For example, Rubin, 1978, justies randomassignment froma Bayesian perspective as a
tool for obtaining robust inferences.) But it is not possible to work with all possible models when considering
Philosophy and the practice of Bayesian statistics 17
treatment
control
a
f
t
e
r
m
e
a
s
u
r
e
m
e
n
t
,
y
before measurement, x
Figure 4. Sketch of the usual statistical model for before-after data. The difference between the
tted lines for the two groups is the estimated treatment effect. The default is to regress the after
measurement on the treatment indicator and the before measurement, thus implicitly assuming
parallel lines.
4.4. Example: Estimating the effects of legislative redistricting
We use one of our own experiences (Gelman & King, 1994) to illustrate scientic
progress through model rejection. We began by tting a model comparing treated
and control units state legislatures, immediately after redistricting or not following
the usual practice of assuming a constant treatment effect (parallel regression lines in
beforeafter plots, with the treatment effect representing the difference between the
lines). In this example, the outcome was a measure of partisan bias, with positive values
representing state legislatures where the Democrats were overrepresented (compared
to how we estimated the Republicans would have done with comparable vote shares)
and negative values in states where the Republicans were overrepresented. A positive
treatment effect here would correspond to a redrawing of the district lines that favoured
the Democrats.
Figure 4 shows the default model that we (and others) typically use for estimating
causal effects in beforeafter data. We tted such a no-interaction model in our example
too, but then we made some graphs and realized that the model did not t the data. The
line for the control units actually had a much steeper slope than the treated units. We
tted a new model, and it had a completely different story about what the treatment
effects meant.
The graph for the new model with interactions is shown in Figure 5. The largest
effect of the treatment was not to benet the Democrats or Republicans (i.e., to change
the intercept in the regression, shifting the tted line up or down) but rather to change
the slope of the line, to reduce partisan bias.
Rejecting the constant-treatment-effect model and replacing it with the interaction
model was, in retrospect, a crucial step in this research project. This pattern of
higher beforeafter correlation in the control group than in the treated group is
fully probabilistic methods that is, Bayesian inferences that are summarized by joint posterior distributions
rather than point estimates or predictions. This difculty may well be a motivation for shifting the foundations
of statistics away from probability and scientic inference, and towards developing a technology of robust
prediction. (Even when prediction is the only goal, with limited data biasvariance considerations can make
even misspecied parametric models superior to non-parametric models.) This, however, goes far beyond the
scope of the present paper, which aims merely to explicate the implicit philosophy guiding current practice.
18 Andrew Gelman and Cosma Shalizi
(favours Democrats)
(favours Republicans)
-0.05
Estimated partisan bias in previous election
0.0 0.05
E
s
t
i
m
a
t
e
d
p
a
r
t
i
s
a
n
b
i
a
s
(
a
d
j
u
s
t
e
d
f
o
r
s
t
a
t
e
)
-
0
.
0
5
0
.
0
0
.
0
5
no redistricting
Dem. redistrict
bipartisan redistrict
Rep. redistrict
Figure 5. Effect of redistricting on partisan bias. Each symbol represents a state election year,
with dots indicating controls (years with no redistricting) and the other symbols corresponding
to different types of redistricting. As indicated by the tted lines, the before value is much more
predictive of the after value for the control cases than for the treated (redistricting) cases. The
dominant effect of the treatment is to bring the expected value of partisan bias towards zero, and
this effect would not be discovered with the usual approach (pictured in Figure 4), which is to
t a model assuming parallel regression lines for treated and control cases. This gure is re-drawn
after Gelman and King (1994), with the permission of the authors.
quite general (Gelman, 2004), but at the time we did this study we discovered it
only through the graph of model and data, which falsied the original model and
motivated us to think of something better. In our experience, falsication is about plots
and predictive checks, not about Bayes factors or posterior probabilities of candidate
models.
The relevance of this example to the philosophy of statistics is that we began by tting
the usual regression model with no interactions. Only after visually checking the model
t and thus falsifying it in a useful way without the specication of any alternative
did we take the crucial next step of including an interaction, which changed the whole
direction of our research. The shift was induced by a falsication a bit of deductive
inference from the data and the earlier version of our model. In this case the falsication
came from a graph rather than a p-value, which in one way is just a technical issue, but
in a larger sense is important in that the graph revealed not just a lack of t but also a
sense of the direction of the mist, a refutation that sent us usefully in a direction of
substantive model improvement.
5. The question of induction
As we mentioned at the beginning, Bayesian inference is often held to be inductive in a
way that classical statistics (following the Fisher or NeymanPearson traditions) is not.
We need to address this, as we are arguing that all these forms of statistical reasoning
are better seen as hypothetico-deductive.
The common core of various conceptions of induction is some form of inference
from particulars to the general in the statistical context, presumably, inference from
Philosophy and the practice of Bayesian statistics 19
the observations y to parameters describing the data-generating process. But if that
were all that was meant, then not only is frequentist statistics a theory of inductive
inference (Mayo & Cox, 2006), but the whole range of guess-and-test behaviors engaged
in by animals (Holland, Holyoak, Nisbett, &Thagard, 1986), including those formalized in
the hypothetico-deductive method, are also inductive. Even the unpromising-sounding
procedure, pick a model at random and keep it until its accumulated error gets too
big, then pick another model completely at random, would qualify (and could work
surprisingly well under some circumstances cf. Ashby, 1960; Foster & Young, 2003).
So would utterly irrational procedures (pick a new random when the sum of the least
signicant digits in y is 13). Clearly something more is required, or at least implied, by
those claiming that Bayesian updating is inductive.
One possibility for that something more is to generalize the truth-preserving
property of valid deductive inferences: just as valid deductions from true premises
are themselves true, good inductions from true observations should also be true, at
least in the limit of increasing evidence.
23
This, however, is just the requirement that
our inferential procedures be consistent. As discussed above, using Bayess rule is not
sufcient to ensure consistency, nor is it necessary. In fact, every proof of Bayesian
consistency known to us either posits that there is a consistent non-Bayesian procedure
for the same problem, or makes other assumptions which entail the existence of such a
procedure. In any case, theorems establishing consistency of statistical procedures make
deductively valid guarantees about these procedures they are theorems, after all but
do so on the basis of probabilistic assumptions linking future events to past data.
It is also no good to say that what makes Bayesian updating inductive is its conformity
to some axiomatization of rationality. If one accepts the Kolmogorov axioms for
probability, and the Savage axioms (or something like them) for decision-making,
24
then
updating by conditioning follows, and a prior belief state p() plus data y deductively
entail that the new belief state is p(|y). In any case, lots of learning procedures can be
axiomatized (all those which can be implemented algorithmically, to start with). To pick
this system, we would need to know that it produces good results (cf. Manski, 2011),
and this returns us to previous problems. To know that this axiom system leads us to
approach the truth rather than become convinced of falsehoods, for instance, is just the
question of consistency again.
Karl Popper, the leading advocate of hypothetico-deductivism in the last century,
denied that induction was even possible; his attitude is well paraphrased by Greenland
(1998) as: we never use any argument based on observed repetition of instances
that does not also involve a hypothesis that predicts both those repetitions and the
unobserved instances of interest. This is a recent instantiation of a tradition of anti-
inductive arguments that goes back to Hume, but also beyond him to al Ghazali
(1100/1997) in the Middle Ages, and indeed to the ancient Sceptics (Kolakowski, 1968).
As forcefully put by Stove (1982, 1986), many apparent arguments against this view of
induction can be viewed as statements of abstract premises linking both the observed
data and unobserved instances various versions of the uniformity of nature thesis
have been popular, sometimes resolved into a set of more detailed postulates, as in
23
We owe this suggestion to conversation with Kevin Kelly; cf. Kelly (1996, especially Chapter 13).
24
Despite his ideas on testing, Jaynes (2003) was a prominent and emphatic advocate of the claimthat Bayesian
inference is the logic of inductive inference as such, but preferred to follow Cox (1946, 1961) rather than
Savage. See Halpern (1999) on the formal invalidity of Coxs proofs.
20 Andrew Gelman and Cosma Shalizi
Russell (1948, Part VI, Chapter 9), though Stove rather maliciously crafted a parallel
argument for the existence of angels, or something very much like them.
25
As Norton
(2003) argues, these highly abstract premises are both dubious and often superuous for
supporting the sort of actual inferences scientists make inductions are supported not
by their matching certain formal criteria (as deductions are), but rather by material facts.
To generalize about the melting point of bismuth (to use one of Nortons examples)
requires very few samples, provided we accept certain facts about the homogeneity of
the physical properties of elemental substances; whether nature in general is uniform is
not really at issue.
26
Simply put, we think the anti-inductivist view is pretty much right, but that statistical
models are tools that let us draw inductive inferences on a deductive background. Most
directly, random sampling allows us to learn about unsampled people (unobserved balls
in an urn, as it were), but such inference, however inductive it may appear, relies
not any axiom of induction but rather on deductions from the statistical properties of
random samples, and the ability to actually conduct such sampling. The appropriate
design depends on many contingent material facts about the system we are studying,
exactly as Norton argues.
Some results in statistical learning theory establish that certain procedures are
probably approximately correct in what is called a distribution-free manner (Bousquet,
Boucheron, & Lugosi, 2004, Vidyasagar 2003); some of these results embrace Bayesian
updating (McAllister, 1999). But here distribution-free just means holding uniformly
over all distributions in a very large class, for example requiring the data to be
independent and identically distributed, or from a stationary, mixing stochastic process.
Another branch of learning theory does avoid making any probabilistic assumptions,
getting results which hold universally across all possible data sets, and again these
results apply to Bayesian updating, at least over some parameter spaces (Cesa-Bianchi
& Lugosi, 2006). However, these results are all of the form in retrospect, the posterior
predictive distribution will have predicted almost as well as the best individual model
could have done, speaking entirely about performance on the past training data and
revealing nothing about extrapolation to hitherto unobserved cases.
To sum up, one is free to describe statistical inference as a theory of inductive logic,
but these would be inductions which are deductively guaranteed by the probabilistic
assumptions of stochastic models. We can see no interesting and correct sense in which
Bayesian statistics is a logic of induction which does not equally imply that frequentist
statistics is also a theory of inductive inference (cf. Mayo & Cox, 2006), which is to say,
not very inductive at all.
25
Stove (1986) further argues that induction by simple enumeration is reliable without making such
assumptions, at least sometimes. However, his calculations make no sense unless his data are independent and
identically distributed.
26
Within environments where such premises hold, it may of course be adaptive for organisms to develop
inductive propensities, whose scope would be more or less tied to the domain of the relevant material
premises. Barkow, Cosmides, and Tooby (1992) develop this theme with reference to the evolution of domain-
specic mechanisms of learning and induction; Gigerenzer (2000) and Gigerenzer, Todd, and ABC Research
Group (1999) consider proximate mechanisms and ecological aspects, and Holland et al. (1986) propose a
unied framework for modelling such inductive propensities in terms of generate-and-test processes. All of
this, however, is more within the eld of psychology than either statistics or philosophy, as (to paraphrase the
philosopher Ian Hacking, 2001) it does not so much solve the problem of induction as evade it.
Philosophy and the practice of Bayesian statistics 21
6. What about Popper and Kuhn?
The two most famous modern philosophers of science are undoubtedly Karl Popper
(1934/1959) and Thomas Kuhn (1970), and if statisticians (like other non-philosophers)
know about philosophy of science at all, it is generally some version of their ideas. It may
therefore help readers to see how our ideas relate to theirs. We do not pretend that our
sketch fully portrays these gures, let alone the literatures of exegesis and controversy
they inspired, or even how the philosophy of science has moved on since 1970.
Poppers key idea was that of falsication or conjectures and refutations. The
inspiring example, for Popper, was the replacement of classical physics, after several
centuries as the core of the best-established science, by modern physics, especially
the replacement of Newtonian gravitation by Einsteins general relativity. Science, for
Popper, advances by scientists advancing theories which make strong, wide-ranging
predictions capable of being refuted by observations. Agood experiment or observational
study is one which tests a specic theory (or theories) by confronting their predictions
with data in such a way that a match is not automatically assured; good studies are
designed with theories in mind, to give them a chance to fail. Theories which conict
with any evidence must be rejected, since a single counter-example implies that a
generalization is false. Theories which are not falsiable by any conceivable evidence
are, for Popper, simply not scientic, though they may have other virtues.
27
Even those
falsiable theories which have survived contact with data so far must be regarded as more
or less provisional, since no nite amount of data can ever establish a generalization, nor
is there any non-circular principle of induction which could let us regard theories which
are compatible with lots of evidence as probably true.
28
Since people are fallible, and
often obstinate and overly fond of their own ideas, the objectivity of the process which
tests conjectures lies not in the emotional detachment and impartiality of individual
scientists, but rather in the scientic community being organized in certain ways, with
certain institutions, norms and traditions, so that individuals prejudices more or less
wash out (Popper, 1945, Chapters 2324).
Clearly, we nd much here to agree with, especially the general hypothetico-
deductive view of scientic method and the anti-inductivist stance. On the other hand,
Poppers specic ideas about testing require, at the least, substantial modication. His
idea of a test comes down to the rule of deduction which says that if p implies q, and
q is false, then p must be false, with the roles of p and q being played by hypotheses
and data, respectively. This is plainly inadequate for statistical hypotheses, yet, as critics
have noted since Braithwaite (1953) at least, he oddly ignored the theory of statistical
hypothesis testing.
29
It is possible to do better, both through standard hypothesis tests
and the kind of predictive checks we have described. In particular, as Mayo (1996) has
emphasized, it is vital to consider the severity of tests, their capacity to detect violations
of hypotheses when they are present.
Popper tried to say how science ought to work, supplemented by arguments that
his ideals could at least be approximated and often had been. Kuhns work, in contrast,
27
This demarcation criterion has received a lot of criticism, much of it justied. The question of what makes
something scientic is fortunately not one we have to answer; cf. Laudan (1996, Chapters 1112) and Ziman
(2000).
28
Popper tried to work out notions of corroboration and increasing truth content, or verisimilitude, to t
with these stances, but these are generally regarded as failures.
29
We have generally found Poppers ideas on probability and statistics to be of little use and will not discuss
them here.
22 Andrew Gelman and Cosma Shalizi
was much more an attempt to describe how science had, in point of historical fact,
developed, supported by arguments that alternatives were infeasible, from which some
morals might be drawn. His central idea was that of a paradigm, a scientic problemand
its solution which served as a model or exemplar, so that solutions to other problems
could be developed in imitation of it.
30
Paradigms come along with presuppositions
about the terms available for describing problems and their solutions, what counts as a
valid problem, what counts as a solution, background assumptions which can be taken as
a matter of course, etc. Once a scientic community accepts a paradigm and all that goes
with it, its members can communicate with one another and get on with the business
of solving puzzles, rather than arguing about what they should be doing. Such normal
science includes a certain amount of developing and testing of hypotheses but leaves
the central presuppositions of the paradigm unquestioned.
During periods of normal science, according to Kuhn, there will always be some
anomalies things within the domain of the paradigm which it currently cannot
explain, or which even seem to refute its assumptions. These are generally ignored,
or at most regarded as problems which somebody ought to investigate eventually.
(Is a special adjustment for odd local circumstances called for? Might there be some
clever calculational trick which xes things? How sound are those anomalous observa-
tions?) More formally, Kuhn invokes the QuineDuhem thesis (Quine, 1961; Duhem,
1914/1954). A paradigm only makes predictions about observations in conjunction with
auxiliary hypotheses about specic circumstances, measurement procedures, etc. If
the predictions are wrong, Quine and Duhem claimed that one is always free to x the
blame on the auxiliary hypotheses, and preserve belief in the core assumptions of the
paradigm come what may.
31
The QuineDuhem thesis was also used by Lakatos (1978)
as part of his methodology of scientic research programmes, a falsicationism more
historically oriented than Poppers distinguishing between progressive development of
auxiliary hypotheses and degenerate research programmes where auxiliaries become ad
hoc devices for saving core assumptions from data.
According to Kuhn, however, anomalies can accumulate, becoming so serious as to
create a crisis for the paradigm, beginning a period of revolutionary science. It is then
that a newparadigmcan form, one which is generally incommensurable with the old: it
makes different presuppositions, takes a different problem and its solution as exemplars,
redenes the meaning of terms. Kuhn insisted that scientists who retain the old paradigm
are not being irrational, because (by the QuineDuhem thesis) they can always explain
away the anomalies somehow; but neither are the scientists who embrace and develop
the new paradigm being irrational. Switching to the new paradigm is more like a bistable
illusion ipping (the apparent duck becomes an obvious rabbit) than any process of
ratiocination governed by sound rules of method.
32
30
Examples are Newtons deduction of Keplers laws of planetary motion and other facts of astronomy from
the inverse square law of gravitation, and Plancks derivation of the black-body radiation distribution from
Boltzmanns statistical mechanics and the quantization of the electromagnetic eld. An internal example for
statistics might be the way the NeymanPearson lemma inspired the search for uniformly most powerful tests
in a variety of complicated situations.
31
This thesis can be attacked from many directions, perhaps the most vulnerable being that one can often nd
multiple lines of evidence which bear on either the main principles or the auxiliary hypotheses separately,
thereby localizing the problems (Glymour, 1980; Kitcher, 1993; Laudan, 1996; Mayo, 1996).
32
Salmon (1990) proposed a connection between Kuhn and Bayesian reasoning, suggesting that the choice
between paradigms could be made rationally by using Bayess rule to compute their posterior probabilities,
with the prior probabilities for the paradigms encoding such things as preferences for parsimony. This has
Philosophy and the practice of Bayesian statistics 23
In some way, Kuhns distinction between normal and revolutionary science is
analogous to the distinction between learning within a Bayesian model, and checking
the model in preparation to discarding or expanding it. Just as the work of normal
science proceeds within the presuppositions of the paradigm, updating a posterior
distribution by conditioning on new data takes the assumptions embodied in the prior
distribution and the likelihood function as unchallengeable truths. Model checking, on
the other hand, corresponds to the identication of anomalies, with a switch to a new
model when they become intolerable. Even the problems with translations between
paradigms have something of a counterpart in statistical practice; for example, the
intercept coefcients in a varying-intercept, constant-slope regression model have a
somewhat different meaning than do the intercepts in a varying-slope model. We do
not want to push the analogy too far, however, since most model checking and model
reformulation would by Kuhn have been regarded as puzzle-solving within a single
paradigm, and his views of how people switch between paradigms are, as we just saw,
rather different.
Kuhns ideas about scientic revolutions are famous because they raise so many
disturbing questions about the scientic enterprise. For instance, there has been
considerable controversy over whether Kuhn believed in any notion of scientic
progress, and over whether or not he should have, given his theory. Yet detailed historical
case studies (Donovan, Laudan, & Laudan, 1988) have shown that Kuhns picture of
sharp breaks between normal and revolutionary science is hard to sustain.
33
The leads
to a tendency, already remarked by Toulmin (1972, pp. 112117), either to expand
paradigms or to shrink them. Expanding paradigms into persistent and all-embracing,
because abstract and vague, bodies of ideas lets one preserve the idea of abrupt
breaks in thought, but makes them rare and leaves almost everything to puzzle-solving
normal science. (In the limit, there has only been one paradigm in astronomy since
the Mesopotamians, something like many lights in the night sky are objects which are
very large but very far away, and they move in interrelated, mathematically describable,
discernible patterns.) This corresponds, we might say, to relentlessly enlarging the
support of the prior. The other alternative is to shrink paradigms into increasingly
concrete, specic theories and even models, making the standard for a revolutionary
change very small indeed, in the limit reaching any kind of conceptual change
whatsoever.
We suggest that there is actually some validity to both moves, that there is a sort of
(weak) self-similarity involved in scientic change. Every scale of size and complexity,
from local problem-solving to big-picture science, features progress of the normal
science type, punctuated by occasional revolutions. For example, in working on an
applied research or consulting problem, one typically will start in a certain direction,
then suddenly realize one was thinking about it incorrectly, then move forward, and so
forth. In a consulting setting, this re-evaluation can happen several times in a couple of
at least three big problems. First, all our earlier objections to using posterior probabilities to chose between
theories apply, with all the more force because every paradigm is compatible with a broad range of specic
theories. Second, devising priors encoding those methodological preferences particularly a non-vacuous
preference for parsimony is hard or impossible in practice (Kelly, 2010). Third, it implies a truly remarkable
form of Platonism: for scientists to give a paradigm positive posterior probability, they must, by Bayess rule,
have always given it strictly positive prior probability, even before having encountered a statement of the
paradigm.
33
Arguably this is true even of Kuhn (1957).
24 Andrew Gelman and Cosma Shalizi
hours. At a slightly longer time scale, we commonly reassess any approach to an applied
problem after a few months, realizing there was some key feature of the problem we
were misunderstanding, and so forth. There is a link between the size and the typical
time scales of these changes, with small revolutions occurring fairly frequently (every
few minutes for an exam-type problem), up to every few decades for a major scientic
consensus. (This is related to but somewhat different from the recursive subject-matter
divisions discussed by Abbott, 2001.) The big changes are more exciting, even glamorous,
but they rest on the hard work of extending the implications of theories far enough that
they can be decisively refuted.
To sum up, our views are much closer to Poppers than to Kuhns. The latter
encouraged a close attention to the history of science and to explaining the process
of scientic change, as well as putting on the agenda many genuinely deep ques-
tions, such as when and how scientic elds achieve consensus. There are even
analogies between Kuhns ideas and what happens in good data-analytic practice.
Fundamentally, however, we feel that deductive model checking is central to statistical
and scientic progress, and that it is the threat of such checks that motivates us
to perform inferences within complex models that we know ahead of time to be
false.
7. Why does this matter?
Philosophy matters to practitioners because they use it to guide their practice; even those
who believe themselves quite exempt from any philosophical inuences are usually the
slaves of some defunct methodologist. The idea of Bayesian inference as inductive,
culminating in the computation of the posterior probability of scientic hypotheses, has
had malign effects on statistical practice. At best, the inductivist view has encouraged
researchers to t and compare models without checking them; at worst, theorists have
actively discouraged practitioners from performing model checking because it does not
t into their framework.
In our hypothetico-deductive view of data analysis, we build a statistical model out
of available parts and drive it as far as it can take us, and then a little farther. When the
model breaks down, we dissect it and gure out what went wrong. For Bayesian models,
the most useful way of guring out how the model breaks down is through posterior
predictive checks, creating simulations of the data and comparing them to the actual
data. The comparison can often be done visually; see Gelman et al. (2004, Chapter 6)
for a range of examples. Once we have an idea about where the problem lies, we can
tinker with the model, or perhaps try a radically new design. Either way, we are using
deductive reasoning as a tool to get the most out of a model, and we test the model it
is falsiable, and when it is consequentially falsied, we alter or abandon it. None of this
is especially subjective, or at least no more so than any other kind of scientic inquiry,
which likewise requires choices as to the problem to study, the data to use, the models
to employ, etc. but these choices are by no means arbitrary whims, uncontrolled by
objective conditions.
Conversely, a problemwith the inductive philosophy of Bayesian statistics in which
science learns by updating the probabilities that various competing models are true is
that it assumes that the true model (or, at least, the models among which we will choose
or over which we will average) is one of the possibilities being considered. This does
Philosophy and the practice of Bayesian statistics 25
not t our own experiences of learning by nding that a model does not t and needing
to expand beyond the existing class of models to x the problem.
Our methodological suggestions are to construct large models that are capable of
incorporating diverse sources of data, to use Bayesian inference to summarize uncertainty
about parameters in the models, to use graphical model checks to understand the
limitations of the models, and to move forward via continuous model expansion rather
than model selection or discrete model averaging. Again, we do not claim any novelty in
these ideas, which we and others have presented in many publications and which reect
decades of statistical practice, expressed particularly forcefully in recent times by Box
(1980) and Jaynes (2003). These ideas, important as they are, are hardly ground-breaking
advances in statistical methodology. Rather, the point of this paper is to demonstrate
that our commonplace (if not universally accepted) approach to the practice of Bayesian
statistics is compatible with a hypothetico-deductive framework for the philosophy of
science.
We fear that a philosophy of Bayesian statistics as subjective, inductive inference
can encourage a complacency about picking or averaging over existing models rather
than trying to falsify and go further.
34
Likelihood and Bayesian inference are powerful,
and with great power comes great responsibility. Complex models can and should be
checked and falsied. This is how we can learn from our mistakes.
Acknowledgements
We thank the National Security Agency for grant H98230-10-1-0184, the Department of Energy
for grant DE-SC0002099, the Institute of Education Sciences for grants ED-GRANTS-032309-
005 and R305D090006-09A, and the National Science Foundation for grants ATM-0934516,
SES-1023176 and SES-1023189. We thank Wolfgang Beirl, Chris Genovese, Clark Glymour,
Mark Handcock, Jay Kadane, Rob Kass, Kevin Kelly, Kristina Klinkner, Deborah Mayo, Martina
Morris, Scott Page, Aris Spanos, Erik van Nimwegen, Larry Wasserman, Chris Wiggins, and
two anonymous reviewers for helpful conversations and suggestions.
References
Abbott, A. (2001). Chaos of disciplines. Chicago: University of Chicago Press.
al Ghazali, Abu Hamid Muhammad ibn Muhammad at-Tusi (1100/1997). The incoherence of the
philosophers = Tahafut al-falasifah: A parallel English-Arabic text, trans. M. E. Marmura.
Provo, UT: Brigham Young University Press.
Ashby, W. R. (1960). Design for a brain: The origin of adaptive behaviour (2nd ed.). London:
Chapman & Hall.
Atkinson, A. C., & Donev, A. N. (1992). Optimum experimental designs. Oxford: Clarendon
Press.
Barkow, J. H., Cosmides, L., &Tooby, J. (Eds.) (1992). The adapted mind: Evolutionary psychology
and the generation of culture. Oxford: Oxford University Press.
Bartlett, M. S. (1967). Inference and stochastic processes. Journal of the Royal Statistical Society,
Series A, 130, 457478.
34
Ghosh and Ramamoorthi (2003, p. 112) see a similar attitude as discouraging inquiries into consistency: the
prior and the posterior given by Bayes theorem [sic] are imperatives arising out of axioms of rational behavior
and since we are already rational why worry about one more criterion, namely convergence to the truth?
26 Andrew Gelman and Cosma Shalizi
Bayarri, M. J., & Berger, J. O. (2000). P values for composite null models. Journal of the American
Statistical Association, 95, 11271142.
Bayarri, M. J., & Berger, J. O. (2004). The interplay of Bayesian and frequentist analysis. Statistical
Science, 19, 5880. doi:10.1214/088342304000000116
Bayarri, M. J., & Castellanos, M. E. (2007). Bayesian checking of the second levels of hierarchical
models. Statistical Science, 22, 322343.
Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: Irreconcilability of p-values and
evidence. Journal of the American Statistical Association, 82, 112122.
Berk, R. H. (1966). Limiting behavior of posterior distributions when the model is incorrect.
Annals of Mathematical Statistics, 37, 5158. doi:10.1214/aoms/1177699597 Correction: 37
(1966), 745746.
Berk, R. H. (1970). Consistency a posteriori. Annals of Mathematical Statistics, 41, 894906.
doi:10.1214/aoms/1177696967
Bernard, C. (1865/1927). Introduction to the study of experimental medicine, trans. H. C. Greene.
NewYork: Macmillan. First published as Introduction ` a letude de la medecine experimentale,
Paris: J. B. Bailli` ere. Reprinted New York: Dover, 1957.
Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. New York: Wiley.
Binmore, K. (2007). Making decisions in large worlds. Technical Report 266, ESRC Centre for
Economic Learning and Social Evolution, University College London. Retrieved from http://
[Link]/papers/uploaded/[Link]
Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to statistical learning theory. In O.
Bousquet, U. von Luxburg, & G. Ratsch (Eds.), Advanced lectures in machine learning (pp.
169207). Berlin: Springer.
Box, G. E. P. (1980). Sampling and Bayes inference in scientic modelling and robustness. Journal
of the Royal Statistical Society, Series A, 143, 383430.
Box, G. E. P. (1983). An apology for ecumenism in statistics. In G. E. P. Box, T. Leonard & C.-F. Wu
(Eds.), Scientic inference, data analysis, and robustness (pp. 5184). New York: Academic
Press.
Box, G. E. P. (1990). Comment on The unity and diversity of probability by Glen Shafer. Statistical
Science, 5, 448449. doi:10.1214/ss/1177012024
Braithwaite, R. B. (1953). Scientic explanation: A study of the function of theory, probability
and law in science. Cambridge: Cambridge University Press.
Brown, R. Z., Sallow, W., Davis, D. E., & Cochran, W. G. (1955). The rat population of Baltimore,
1952. American Journal of Epidemiology, 61, 89102.
Cesa-Bianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge: Cambridge
University Press.
Claeskens, G., &Hjort, N. L. (2008). Model selection and model averaging. Cambridge: Cambridge
University Press.
Cox, D. D. (1993). An analysis of Bayesian inference for nonparametric regression. Annals of
Statistics, 21, 903923. doi:10.1214/aos/1176349157
Cox, D. R., & Hinkley, D. V. (1974). Theoretical statistics. London: Chapman & Hall.
Cox, R. T. (1946). Probability, frequency, and reasonable expectation. American Journal of
Physics, 14, 113.
Cox, R. T. (1961). The algebra of probable inference. Baltimore, MD: Johns Hopkins University
Press.
Csiszar, I. (1995). Maxent, mathematics, and information theory. In K. M. Hanson& R. N. Silver
(Eds.), Maximumentropy and Bayesian methods: Proceedings of the Fifteenth International
Workshop on Maximum Entropy and Bayesian Methods (pp. 3550). Dordrecht: Kluwer
Academic.
Dawid, A. P., & Vovk, V. G. (1999). Prequential probability: Principles and properties. Bernoulli,
5, 125162. Retrieved from: [Link]
Philosophy and the practice of Bayesian statistics 27
Donovan, A., Laudan, L., & Laudan, R. (Eds.), (1988). Scrutinizing science: Empirical studies
of scientic change. Dordrecht: Kluwer Academic. Reprinted 1992 (Baltimore, MD: Johns
Hopkins University Press) with a new introduction.
Doob, J. L. (1949). Application of the theory of martingales. In Colloques internationaux du
Centre National de la Recherche Scientique, Vol. 13 (pp. 2327). Paris: Centre National de
la Recherche Scientique.
Duhem, P. (1914/1954). The aim and structure of physical theory, trans. P. P. Wiener. Princeton,
NJ: Princeton University Press.
Earman, J. (1992). Bayes or bust? A critical account of Bayesian conrmation theory. Cambridge,
MA: MIT Press.
Eggertsson, T. (1990). Economic behavior and institutions. Cambridge: Cambridge University
Press.
Fitelson, B., & Thomason, N. (2008). Bayesians sometimes cannot ignore even very implausible
theories (even ones that have not yet been thought of). Australasian Journal of Logic, 6,
2536. Retrieved from: [Link]
Foster, D. P., & Young, H. P. (2003). Learning, hypothesis testing and Nash equilibrium. Games
and Economic Behavior, 45, 7396. doi:10.1016/S0899-8256(03)00025-3
Fraser, D. A. S., & Rousseau, J. (2008). Studentization and deriving accurate p-values. Biometrika,
95, 116. doi:10.1093/biomet/asm093
Freedman, D. A. (1999). On the Bernstein-von Mises theoremwith innite-dimensional parameters.
Annals of Statistics, 27, 11191140. doi:10.1214/aos/1017938917
Gelman, A. (1994). Discussion of A probabilistic model for the spatial distribution of party support
in multiparty elections by S. Merrill. Journal of the American Statistical Association, 89, 1198.
Gelman, A. (2003). A Bayesian formulation of exploratory data analysis and goodness-of-t testing.
International Statistical Review, 71, 369382. doi:10.1111/j.1751-5823.2003.tb00203.x
Gelman, A. (2004). Treatment effects in before-after data. In A. Gelman&X.-L. Meng (Eds.), Applied
Bayesian modeling and causal inference from incomplete-data perspectives (pp. 191198).
Chichester: Wiley.
Gelman, A. (2007). Comment: Bayesian checking of the second levels of hierarchical models.
Statistical Science, 22, 349352. doi:10.1214/07-STS235A
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis (2nd ed.).
Boca Raton, FL: CRC Press.
Gelman, A., &Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models.
Cambridge: Cambridge University Press.
Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y.-S. (2008). A weakly informative default prior
distribution for logistic and other regression models. Annals of Applied Statistics, 2, 1360
1383. doi:10.1214/08-AOAS191
Gelman, A., & King, G. (1994). Enhancing democracy through legislative redistricting. American
Political Science Review, 88, 541559.
Gelman, A., Lee, D., & Ghitza, Y. (2010). Public opinion on health care reform. The Forum, 8(1).
doi:10.2202/1540-8884.1355
Gelman, A., Meng, X.-L., & Stern, H. S. (1996). Posterior predictive assessment of model tness
via realized discrepancies (with discussion). Statistica Sinica, 6, 733807. Retrieved from:
[Link]
Gelman, A., Park, D., Shor, B., Bafumi, J., & Cortina, J. (2008). Red state, blue state, rich state,
poor state: Why Americans vote the way they do. Princeton, NJ: Princeton University Press.
doi:10.1561/100.00006026
Gelman, A., & Rubin, D. B. (1995). Avoiding model selection in Bayesian social research.
Sociological Methodology, 25, 165173.
Gelman, A., Shor, B., Park, D., & Bafumi, J. (2008). Rich state, poor state, red state, blue state:
Whats the matter with Connecticut? Quarterly Journal of Political Science, 2, 345367.
28 Andrew Gelman and Cosma Shalizi
Ghitza, Y., & Gelman, A. (2012). Deep interactions with MRP: presidential turnout and voting
patterns among small electoral subgroups. Technical report, Department of Political Science,
Columbia University.
Ghosh, J. K., & Ramamoorthi, R. V. (2003). Bayesian nonparametrics. New York: Springer.
Giere, R. N. (1988). Explaining science: A cognitive approach. Chicago: University of Chicago
Press.
Gigerenzer, G. (2000). Adaptive thinking: Rationality in the real world. Oxford: Oxford
University Press.
Gigerenzer, G., Todd, P. M., & ABC Research Group. (1999). Simple heuristics that make us
smart. Oxford: Oxford University Press.
Glymour, C. (1980). Theory and evidence. Princeton, NJ: Princeton University Press.
Good, I. J. (1983). Good thinking: The foundations of probability and its applications.
Minneapolis: University of Minnesota Press.
Good, I. J., & Crook, J. F. (1974). The Bayes/non-Bayes compromise and the multinomial
distribution. Journal of the American Statistical Association, 69, 711720.
Gray, R. M. (1990). Entropy and information theory. New York: Springer.
Greenland, S. (1998). Induction versus Popper: Substance versus semantics. International Journal
of Epidemiology, 27, 543548. doi:10.1093/ije/27.4.543
Greenland, S. (2009). Relaxation penalties and priors for plausible modeling of nonidentied bias
sources. Statistical Science, 24, 195210. doi:10.1214/09-STS291
Gr unwald, P. D. (2007). The minimum description length principle. Cambridge, MA: MIT Press.
Gr unwald, P. D., & Langford, J. (2007). Suboptimal behavior of Bayes and MDL in classication
under misspecication. Machine Learning, 66, 119149. doi:10.1007/s10994-007-0716-7
Gustafson, P. (2005). On model expansion, model contraction, identiability and prior information:
Two illustrative scenarios involving mismeasured variables. Statistical Science, 20, 111140.
doi:10.1214/088342305000000098
Guttorp, P. (1995). Stochastic modeling of scientic data. London: Chapman & Hall.
Haack, S. (1993). Evidence and inquiry: Towards reconstruction in epistemology. Oxford:
Blackwell.
Hacking, I. (2001). An introduction to probability and inductive logic. Cambridge: Cambridge
University Press.
Halpern, J. Y. (1999). Coxs theorem revisited. Journal of Articial Intelligence Research, 11,
429435. doi:10.1613/jair.644
Handcock, M. S. (2003). Assessing degeneracy in statistical models of social networks. Working
Paper no. 39, Center for Statistics and the Social Sciences, University of Washington. Retrieved
from [Link]
Hastie, T., Tibshirani, R., &Friedman, J. (2009). The elements of statistical learning: Data mining,
inference, and prediction (2nd ed.). Berlin: Springer.
Hempel, C. G. (1965). Aspects of scientic explanation. Glencoe, IL: Free Press.
Hill, J. R. (1990). A general framework for model-based statistics. Biometrika, 77, 115126.
Hjort, N. L., Holmes, C., M uller, P., & Walker, S. G. (Eds.), (2010). Bayesian nonparametrics.
Cambridge: Cambridge University Press.
Holland, J. H., Holyoak, K. J., Nisbett, R. E., & Thagard, P. R. (1986). Induction: Processes of
inference, learning, and discovery. Cambridge, MA: MIT Press.
Howson, C., & Urbach, P. (1989). Scientic reasoning: The Bayesian approach. La Salle, IL: Open
Court.
Hunter, D. R., Goodreau, S. M., & Handcock, M. S. (2008). Goodness of t of social network
models. Journal of the American Statistical Association, 103, 248258. doi:10.1198/
016214507000000446
Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge: Cambridge University
Press.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association,
90, 773795.
Philosophy and the practice of Bayesian statistics 29
Kass, R. E., & Vos, P. W. (1997). Geometrical foundations of asymptotic inference. New York:
Wiley.
Kass, R. E., & Wasserman, L. (1996). The selection of prior distributions by formal rules. Journal
of the American Statistical Association, 91, 13431370.
Kelly, K. T. (1996). The logic of reliable inquiry. Oxford: Oxford University Press.
Kelly, K. T. (2010). Simplicity, truth, and probability. In P. Bandyopadhyay & M. Forster (Eds.),
Handbook on the philosophy of statistics. Dordrecht: Elsevier.
Kitcher, P. (1993). The advancement of science: Science without legend, objectivity without
illusions. Oxford: Oxford University Press.
Kleijn, B. J. K., & van der Vaart, A. W. (2006). Misspecication in innite-dimensional Bayesian
statistics. Annals of Statistics, 34, 837877. doi:10.1214/009053606000000029
Kolakowski, L. (1968). The alienation of reason: A history of positivist thought, trans. N.
Guterman. Garden City, NY: Doubleday.
Kuhn, T. S. (1957). The Copernican revolution: Planetary astronomy in the development of
western thought. Cambridge, MA: Harvard University Press.
Kuhn, T. S. (1970). The structure of scientic revolutions (2nd ed.). Chicago: University of Chicago
Press.
Lakatos, I. (1978). Philosophical papers. Cambridge: Cambridge University Press.
Laudan, L. (1996). Beyond positivism and relativism: Theory, method and evidence. Boulder,
Colorado: Westview Press.
Laudan, L. (1981). Science and hypothesis. Dodrecht: D. Reidel.
Li, Q., & Racine, J. S. (2007). Nonparametric econometrics: Theory and practice. Princeton, NJ:
Princeton University Press.
Lijoi, A., Pr unster, I., & Walker, S. G. (2007). Bayesian consistency for stationary models.
Econometric Theory, 23, 749759. doi:10.1017/S0266466607070314
Lindsay, B., & Liu, L. (2009). Model assessment tools for a model false world. Statistical Science,
24, 303318. doi:10.1214/09-STS302
Manski, C. F. (2007). Identication for prediction and decision. Cambridge, MA: Harvard
University Press.
Manski, C. F. (2011). Actualist rationality. Theory and Decision, 71. doi:10.1007/
s11238-009-9182-y
Mayo, D. G. (1996). Error and the growth of experimental knowledge. Chicago: University of
Chicago Press.
Mayo, D. G., & Cox, D. R. (2006). Frequentist statistics as a theory of inductive inference. In J.
Rojo (ed.), Optimality: The Second Erich L. Lehmann Symposium(pp. 7797). Bethesda, MD:
Institute of Mathematical Statistics.
Mayo, D. G., & Spanos, A. (2004). Methodology in practice: Statistical misspecication testing.
Philosophy of Science, 71, 10071025.
Mayo, D. G., &Spanos, A. (2006). Severe testing as a basic concept in a Neyman-Pearson philosophy
of induction. British Journal for the Philosophy of Science, 57, 323357. doi:10.1093/bjps/
axl003
McAllister, D. A. (1999). Some PAC-Bayesian theorems. Machine Learning, 37, 355363.
doi:10.1023/A:1007618624809
McCarty, N., Poole, K. T., & Rosenthal, H. (2006). Polarized America: The dance of ideology and
unequal riches. Cambridge, MA: MIT Press.
Merrill III, S. (1994). A probabilistic model for the spatial distribution of party support in multiparty
electorates. Journal of the American Statistical Association, 89, 11901197.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., &Teller, E. (1953). Equations of
state calculations by fast computing machines. Journal of Chemical Physics, 21, 10871092.
doi:10.1063/1.1699114
Morris, C. N. (1986). Comment on Why isnt everyone a Bayesian?. American Statistician, 40,
78.
30 Andrew Gelman and Cosma Shalizi
M uller, U. K. (2011). Risk of Bayesian inference in misspecied models, and the sandwich
covariance matrix. Econometrica, submitted. Retrieved from [Link]
umueller/[Link]
Newman, M. E. J., & Barkema, G. T. (1999). Monte Carlo methods in statistical physics. Oxford:
Clarendon Press.
Norton, J. D. (2003). A material theory of induction. Philosophy of Science, 70, 647670.
doi:10.1086/378858
Paninski, L. (2005). Asymptotic theory of information-theoretic experimental design. Neural
Computation, 17, 14801507. doi:10.1162/0899766053723032
Popper, K. R. (1934/1959). The logic of scientic discovery. London: Hutchinson.
Popper, K. R. (1945). The open society and its enemies. London: Routledge.
Quine, W. V. O. (1961). From a logical point of view: Logico-philosophical essays (2nd ed.).
Cambridge, MA: Harvard University Press.
Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25,
111196.
Ripley, B. D. (1988). Statistical inference for spatial processes. Cambridge: Cambridge University
Press.
Rivers, D., & Vuong, Q. H. (2002). Model selection tests for nonlinear dynamic models.
Econometrics Journal, 5, 139. doi:10.1111/1368-423X.t01-1-00071
Robins, J. M., van der Vaart, A., & Ventura, V. (2000). Asymptotic distribution of p values in
composite null models (with discussions and rejoinder). Journal of the American Statistical
Association, 95, 11431172.
Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. Annals of
Statistics, 6, 3458. doi:10.1214/aos/1176344064
Rubin, D. B. (1984). Bayesianly justiable and relevant frequency calculations for the applied
statistician. Annals of Statistics, 12, 11511172. doi:10.1214/aos/1176346785
Russell, B. (1948). Human knowledge: Its scope and limits. New York: Simon and Schuster.
Salmon, W. C. (1990). The appraisal of theories: Kuhn meets Bayes. PSA: Proceedings of the
Biennial Meeting of the Philosophy of Science Association (Vol. 2, pp. 325332). Chicago:
University of Chicago Press.
Savage, L. J. (1954). The foundations of statistics. New York: Wiley.
Schervish, M. J. (1995). Theory of statistics. Berlin: Springer.
Seidenfeld, T. (1979). Why I am not an objective Bayesian: Some reections prompted by
Rosenkrantz. Theory and Decision, 11, 413440. doi:10.1007/BF00139451
Seidenfeld, T. (1987). Entropy and uncertainty. In I. B. MacNeill& G. J. Umphrey (Eds.),
Foundations of statistical inference (pp. 259287). Dordrecht: D. Reidel.
Shalizi, C. R. (2009). Dynamics of Bayesian updating with dependent data and misspecied models.
Electronic Journal of Statistics, 3, 10391074. doi:10.1214/09-EJS485
Snijders, T. A. B., Pattison, P. E., Robins, G. L., & Handcock, M. S. (2006). New specications
for exponential random graph models. Sociological Methodology, 36, 99153. doi:10.1111/j.
1467-9531.2006.00176.x
Spanos, A. (2007). Curve tting, the reliability of inductive inference, and the error-statistical
approach. Philosophy of Science, 74, 10461066. doi:10.1086/525643
Stove, D. C. (1982). Popper and after: Four modern irrationalists. Oxford: Pergamon Press.
Stove, D. C. (1986). The rationality of induction. Oxford: Clarendon Press.
Tilly, C. (2004). Observations of social processes and their formal representations. Sociological
Theory, 22, 595602. Reprinted in Tilly (2008). doi:10.1111/j.0735-2751.2004.00235.x
Tilly, C. (2008). Explaining social processes. Boulder, CO: Paradigm.
Toulmin, S. (1972). Human understanding: The collective use and evolution of concepts.
Princeton, NJ: Princeton University Press.
Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.
Philosophy and the practice of Bayesian statistics 31
Ufnk, J. (1995). Can the maximum entropy principle be explained as a consistency requirement?
Studies in the History and Philosophy of Modern Physics, 26B, 223261. doi:10.1016/1355-
2198(95)00015-1
Ufnk, J. (1996). The constraint rule of the maximum entropy principle. Studies in History and
Philosophy of Modern Physics, 27, 4779. doi:10.1016/1355-2198(95)00022-4
Vansteelandt, S., Goetghebeur, E., Kenward, M. G., & Molenberghs, G. (2006). Ignorance and
uncertainty regions as inferential tools in a sensitivity analysis. Statistica Sinica, 16, 953980.
Vidyasagar, M. (2003). Learning and generalization: With applications to neural networks (2nd
ed.). Berlin: Springer.
Vuong, Q. H. (1989). Likelihood ratio tests for model selection and non-nested hypotheses.
Econometrica, 57, 307333.
Wahba, G. (1990). Spline models for observational data. Philadelphia: Society for Industrial and
Applied Mathematics.
Wasserman, L. (2006). Frequentist Bayes is objective. Bayesian Analysis, 1, 451456.
doi:10.1214/06-BA116H
Weinberg, S. (1999). What is quantum eld theory, and what did we think it was? In T. Y. Cao
(Ed.), Conceptual foundations of quantumeld theory (pp. 241251). Cambridge: Cambridge
University Press.
White, H. (1994). Estimation, inference and specication analysis. Cambridge: Cambridge
University Press.
Wooldridge, J. M. (2002). Econometric analysis of cross section and panel data. Cambridge, MA:
MIT Press.
Ziman, J. (2000). Real science: What it is, and what it means. Cambridge: Cambridge University
Press.
Received 28 June 2011; revised version received 6 December 2011