0% found this document useful (0 votes)
21 views

Lectura 2 Point Estimator Basics

This document introduces concepts related to point estimation in statistics. It defines point estimation as using sample data to estimate an unknown population parameter. The document discusses properties of estimators like unbiasedness, which means the estimator on average equals the true parameter value across many samples. Examples are provided of common estimators like the sample mean and variance, which are shown to be unbiased estimators of the population mean and variance. The document notes that while unbiasedness is a nice property, it does not guarantee an estimator will be accurate, and allowing bias can sometimes improve accuracy.

Uploaded by

Alejandro Marin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lectura 2 Point Estimator Basics

This document introduces concepts related to point estimation in statistics. It defines point estimation as using sample data to estimate an unknown population parameter. The document discusses properties of estimators like unbiasedness, which means the estimator on average equals the true parameter value across many samples. Examples are provided of common estimators like the sample mean and variance, which are shown to be unbiased estimators of the population mean and variance. The document notes that while unbiasedness is a nice property, it does not guarantee an estimator will be accurate, and allowing bias can sometimes improve accuracy.

Uploaded by

Alejandro Marin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Chapter 2

Point Estimation Basics

2.1 Introduction
The statistical inference problem starts with the identification of a population of interest,
about which something is unknown. For example, before introducing a law that homes be
equipped with radon detectors, government officials should first ascertain whether radon
levels in local homes are, indeed, too high. The most efficient (and, surprisingly, often the
most accurate) way to gather this information is to take a sample of local homes and record
the radon levels in each.1 Now that the sample is obtained, how should this information be
used to answer the question of interest? Suppose that officials are interested in the mean
radon level for all homes in their community—this quantity is unknown, otherwise, there’d be
no reason to take the sample in the first place. After some careful exploratory data analysis,
the statistician working the project determines a statistical model, i.e., the functional form
of the PDF that characterizes radon levels in homes in the community. Now the statistician
has a model, which depends on the unknown mean radon level (and possibly other unknown
population characteristics), and a sample from that distribution. His/her charge is to use
these two pieces of information to make inference about the unknown mean radon level.
The simplest of such inferences is simply to estimate this mean. In this section we will
discuss some of the basic principles of statistical estimation. This will be an important
theme throughout the course.

2.2 Notation and terminology


The starting point is a statement of the model. Let X1 , . . . , Xn be a sample from a distri-
bution with CDF Fθ , depending on a parameter θ which is unknown. In some cases, it will
1
In this course, we will take the sample as given, that is, we will not consider the question of how the
sample is obtained. In general it is not an easy task to obtain a bona fide completely random sample;
carefully planning of experimental/survey designs is necessary.

16
be important to also know the parameter space Θ, the set of possible values of θ.2 Point
estimation is the problem of find a function of the data that provides a “good” estimate of
the unknown parameter θ.
Definition 2.1. Let X1 , . . . , Xn be a sample from a distribution Fθ with θ ∈ Θ. A point
estimate of θ is a function θ̂ = θ̂(X1 , . . . , Xn ) taking values in Θ.
This (point) estimator θ̂ (read as “theta-hat”) is a special case of a statistic discussed
previously. What distinguishes an estimator from a general statistic is that it is required
to take values in the parameter space Θ, so that it makes sense to compare θ and θ̂. But
besides this, θ̂ can be anything, although some choices are better than others.
Example 2.1. Suppose that X1 , . . . , Xn are iid N(µ, σ 2 ). Then the following are all point
estimates of the mean µ:
n
1X X(1) + X(n)
µ̂1 = Xi , µ̂2 = Mn (sample median), µ̂3 = .
n i=1 2

The sampling distribution of µ̂1 is known (what is it?), but not for the others. However,
some asymptotic theory is available that may be helpful for comparing these as estimators
of µ; more on this later.
Exercise 2.1. Modify the R code in Section 1.6.1 to get Monte Carlo approximations of
the sampling distributions of µ̂1 , µ̂2 , and µ̂3 in Example 2.1. Start with n = 10, µ = 0, and
σ = 1, and draw histograms to compare. What happens if you change n, µ, or σ?
Example 2.2. Let θ denote the proportion of individuals in a population that favor a
particular piece of legislation. To estimate θ, sample X1 , . . . , Xn iid Ber(θ); that is, Xi = 1
if sampled individual i favorsPthe legislation, and Xi = 0 otherwise. Then an estimate of θ
−1 n Pn
is the sample mean, θ̂ = n i=1 Xi . Since the summation i=1 Xi has a known sampling
distribution (what is it?), many properties of θ̂ can be derived without too much trouble.
We will focus mostly on problems where there is only one unknown parameter. However,
there are also important problems where θ is actually a vector and Θ is a subset of Rd , d > 1.
One of the most important examples is the normal model where both the mean and variance
are unknown. In this case θ = (µ, σ 2 ) and Θ = {(µ, σ 2 ) : µ ∈ R, σ 2 ∈ R+ } ⊂ R2 .
The properties of estimators θ̂ will depend on its sampling distribution. Here I need to
elaborate a bit on notation. Since the distribution of X1 , . . . , Xn depends on θ, so does the
sampling distribution of θ̂. So when we calculate probabilities, expected values, etc it is often
important to make clear under what parameter value these are being take. Therefore, we
will highlight this dependence by adding a subscript to the familiar probability and expected
value operators P and E. That is, Pθ and Eθ will mean probability and expected value with
respect to the joint distribution of (X1 , . . . , Xn ) under Fθ .
2
HMC uses “Ω” (Omega) for the parameter space, instead of Θ; however, I find it more convenient to use
the same Greek letter with the lower- and upper-case to distinguish the meaning.

17
In general (see Example 2.1) there can be more than one “reasonable” estimator of an
unknown parameter. One of the goals of mathematical statistics is to provide a theoretical
framework by which an “optimal” estimator can be identified in a given problem. But before
we can say anything about which estimator is best, we need to know something about the
important properties estimator should have.

2.3 Properties of estimators


Properties of estimators are all consequences of their sampling distributions. Most of the
time, the full sampling distribution of θ̂ is not available; therefore, we focus on properties
that do not require complete knowledge of the sampling distribution.

2.3.1 Unbiasedness
The first, and probably the simplest, property is called unbiasedness. In words, an estimator
θ̂ is unbiased if, when applied to many different samples from Fθ , θ̂ equals the true parameter
θ, on average. Equivalently, unbiasedness means the sampling distribution of θ̂ is, in some
sense, centered around θ.

Definition 2.2. The bias of an estimator is bθ (θ̂) = Eθ (θ̂) − θ. Then θ̂ is an unbiased


estimator of θ if bθ (θ̂) = 0 for all θ.

That is, no matter the actual value of θ, if we apply θ̂ = θ̂(X1 , . . . , Xn ) to many data
sets X1 , . . . , Xn sampled from Fθ , then the average of these θ̂ values will equal θ—in other
words, Eθ (θ̂) = θ for all θ. This is clearly not an unreasonable property, and a lot of work in
mathematical statistics has focused on unbiased estimation.

Example 2.3. Let X1 , . . . , Xn be iid from some distribution having mean µ and variance σ 2 .
This distribution could be normal, but it need not be. Consider µ̂ = X̄, the sample mean,
and σ̂ 2 = S 2 , the sample variance. Then µ̂ and σ̂ 2 are unbiased estimators of µ and σ 2 ,
respectively. The proof for µ̂ is straightforward—try it yourself! For σ̂ 2 , recall the following
decomposition of the sample variance:
n
2 2 1 nX 2 2
o
σ̂ = S = X − nX̄ .
n − 1 i=1 i

Drop the subscript (µ, σ 2 ) on Eµ,σ2 for simplicity. Recall the following two general facts:

E(X 2 ) = V(X) + E(X)2 and V(X̄) = n−1 V(X1 ).

18
Then using linearity of expectation,
n
2 1 n X 2  2
o
E(σ̂ ) = E Xi − nE(X̄ )
n−1 i=1
1 n o
n V(X1 ) + E(X1 )2 − nV(X̄) − nE(X̄)

=
n−1
1  2
= nσ + nµ2 − σ 2 − nµ2
n−1
= σ2.
Therefore, the sample variance is an unbiased estimator of the population variance, regardless
of the model.
While unbiasedness is a nice property for an estimator to have, it doesn’t carry too much
weight. Specifically, an estimator can be unbiased but otherwise very poor. For an extreme
example, suppose that Pθ {θ̂ = θ + 105 } = 1/2 = Pθ {θ̂ = θ − 105 }. In this case, Eθ (θ̂) = θ,
but θ̂ is always very far away from θ. There is also a well-known phenomenon (bias–variance
trade-off) which says that often allowing the bias to be non-zero will improve on estimation
accuracy; more on this below. The following example highlights some of the problems of
focusing primarily on the unbiasedness property.
Example 2.4. (See Remark 7.6.1 in HMC.) Let X be a sample from a Pois(θ) distribution.
Suppose the goal is to estimate η = e−2θ , not θ itself. We know that θ̂ = X is an unbiased
estimator of θ. However, the natural estimator e−2X is not an unbiased estimator of e−2θ .
Consider instead η̂ = (−1)X . This estimator is unbiased:
∞ x x ∞
X
−θ (−1) θ −θ
X (−θ)x
X
Eθ [(−1) ] = e =e = e−θ e−θ = e−2θ .
x=0
x! x=0
x!

In fact, it can even be shown that (−1)X is the “best” of all unbiased estimators; cf. the
Lehmann–Scheffe theorem. But even though it’s unbiased, it can only take values ±1 so,
depending on θ, (−1)X may never be close to e−2θ .
Exercise 2.2. Prove the claim in the previous example that e−2X is not an unbiased esti-
mator of e−2θ . (Hint: use the Poisson moment-generating function, p. 152 in HMC.)
In general, for a given function g, if θ̂ is an unbiased estimator of θ, then g(θ̂) is not an
unbiased estimator of g(θ). But there is a nice method by which an unbiased estimator of
g(θ) can often be constructed; see method of moments in Section 2.4. It is also possible that
certain (functions of) parameters may not be unbiasedly estimable.
Example 2.5. Let X1 , . . . , Xn be iid Ber(θ) and suppose we want to estimate η = θ/(1 − θ),
the so-called odds ratio. Suppose η̂ is an unbiased estimator of η, so that Eθ (η̂) = η =
θ/(1 − θ) for all θ or, equivalently,
(1 − θ)Eθ (η̂) − θ = 0 for all θ.

19
Here the joint PMF of (X1 , . . . , Xn ) is fθ (x1 , . . . , xn ) = θx1 +···+xn (1 − θ)n−(x1 +···+xn ) . Writing
out Eθ (η̂) as a weighted average with weights given by fθ (x1 , . . . , xn ), we get
X
(1 − θ) η̂(x1 , . . . , xn )θx1 +···+xn (1 − θ)n−(x1 +···+xn ) − θ = 0 for all θ.
all (x1 , . . . , xn )

The quantity on the left-hand side is a polynomial in θ of degree n+1. From the Fundamental
Theorem of Algebra, there can be at most n + 1 real roots of the above equation. However,
unbiasedness requires that there be infinitely many roots. This contradicts the fundamental
theorem, so we must conclude that there are no unbiased estimators of η.

2.3.2 Consistency
Another reasonable property is that the estimator θ̂ = θ̂n , which depends on the sample size
n through the dependence on X1 , . . . , Xn , should get close to the true θ as n gets larger and
larger. To make this precise, recall the following definition (see Definition 5.1.1 in HMC).3
Definition 2.3. Let T and {Tn : n ≥ 1} be random variables in a common sample space.
Then Tn converges to T in probability if, for any ε > 0,

lim P{|Tn − T | > ε} = 0.


n→∞

The law of large numbers (LLN, Theorem 5.1.1 in HMC) is an important result on con-
vergence in probability.
Theorem 2.1 (Law of Large PNumbers, LLN). If X1 , . . . , Xn are iid with mean µ and finite
variance σ 2 , then X̄n = n−1 ni=1 Xi converges in probability to µ.4
The LLN is a powerful result and will be used throughout the course. Two useful tools
for proving convergence in probability are the inequalities of Markov and Chebyshev. (These
are presented in HMC, Theorems 1.10.2–1.10.3, but with different notation.)

• Markov’s inequality. Let X be a positive random variable, i.e., P(X > 0) = 1. Then,
for any ε > 0, P(X > ε) ≤ ε−1 E(X).
• Chebyshev’s inequality. Let X be a random variable with mean µ and variance σ 2 .
Then, for any ε > 0, P{|X − µ| > ε} ≤ ε−2 σ 2 .

It is through convergence in probability that we can say that an estimator θ̂ = θ̂n gets
close to the estimand θ as n gets large.
3
To make things simple, here we shall focus on the real case with distance measured by absolute difference.
When θ̂ is vector-valued, we’ll need to replace the absolute difference by a normed difference. More generally,
the definition of convergence in probability can handle sequences of random elements in any space equipped
with a metric.
4
We will not need this in Stat 411, but note that the assumption of finite variance can be removed and,
simultaneously, the mode of convergence can be strengthened.

20
Definition 2.4. An estimator θ̂n of θ is consistent if θ̂n → θ in probability.

A rough way to understand consistency of an estimator θ̂n of θ is that the sampling


distribution of θ̂n gets more and more concentrated as n → ∞. The following example
demonstrates both a theoretical verification of consistency and a visual confirmation via
Monte Carlo.

Example 2.6. Recall the setup of Example 2.3. It follows immediately from the LLN that
µ̂n = X̄ is a consistent estimator of the mean µ. Moreover, the sample variance σ̂n2 = S 2 is
also a consistent estimator of the variance σ 2 . To see this, recall that
n
n n1 X 2 o
σ̂n2 = Xi − X̄ 2 .
n − 1 n i=1

The factor n/(n − 1) converges to 1; the first term in the braces convergence in probability
to σ 2 − µ2 by the LLN applied to the Xi2 ’s; the second term in the braces converges in
probability to µ2 by the LLN and Theorem 5.1.4 in HMC (see, also, the Continuous Mapping
Theorem below). Putting everything together, we find that σ̂n2 → σ 2 in probability, making
it a consistent estimator. To see this property visually, suppose that the sample originates
from a Poisson distribution with mean θ = 3. We can modify the R code in Example 7
in Notes 01 to simulate the sampling distribution of θ̂n = σ̂n2 for any n. The results for
n ∈ {10, 25, 50, 100} are summarized in Figure 2.1. Notice that as n increases, the sampling
distributions become more concentrated around θ = 3.

Unbiased estimators generally are not invariant under transformations [i.e., in general,
if θ̂ is unbiased for θ, then g(θ̂) is not unbiased for g(θ)], but consistent estimators do have
such a property, a consequence of the so-called Continuous Mapping Theorem (basically
Theorem 5.1.4 in HMC).

Theorem 2.2 (Continuous Mapping Theorem). Let g be a continuous function on Θ. If θ̂n


is consistent for θ, then g(θ̂n ) is consistent for g(θ).

Proof. Fix a particular θ value. Since g is a continuous function on Θ, it’s continuous at this
particular θ. For any ε > 0, there exists a δ > 0 (depending on ε and θ) such that

|g(θ̂n ) − g(θ)| > ε implies |θ̂n − θ| > δ.

Then the probability of the event on the left is no more than the probability of the event on
the right, and this latter probability vanishes as n → ∞ by assumption. Therefore

lim Pθ {|g(θ̂n ) − g(θ)| > ε} = 0.


n→∞

Since ε was arbitrary, the proof is complete.

21
0.30

0.4
0.20

0.3
Density

Density

0.2
0.10

0.1
0.00

0.0
0 2 4 6 8 10 12 0 2 4 6 8 10 12

θ^n θ^n

(a) n = 10 (b) n = 25
0.6

0.8
0.5

0.6
0.4
Density

Density
0.3

0.4
0.2

0.2
0.1
0.0

0.0

0 2 4 6 8 10 12 0 2 4 6 8 10 12

θ^n θ^n

(c) n = 50 (d) n = 100

Figure 2.1: Plots of the sampling distribution of θ̂n , the sample variance, for several values
of n in the Pois(θ) problem with θ = 3.

Example 2.7. Let X1 , . . . , Xn be iid Pois(θ). Since θ is both the mean and the variance for
the Poisson distribution, it follows that both θ̂n = X̄ and θ̃n = S 2 are unbiased and consistent
for θ by the results in Examples 2.3 and 2.6. Another comparison of these two estimators is
given in Example 2.10. Here consider a new estimator θ̇n = (X̄S 2 )1/2 . Define the function
g(x1 , x2 ) = (x1 x2 )1/2 . Clearly g is continuous (why?). Since the pair (θ̂n , θ̃n ) is a consistent
estimator of (θ, θ), it follows from the continuous mapping theorem that θ̇n = g(θ̂n , θ̃n ) is a
consistent estimator of θ = g(θ, θ).
Like with unbiasedness, consistency is a nice property for an estimator to have. But
consistency alone is not enough to make an estimator a good one. Next is an exaggerated
example that makes this point clear.
Example 2.8. Let X1 , . . . , Xn be iid N(θ, 1). Consider the estimator
(
107 if n < 10750 ,
θ̂n =
X̄n otherwise.

22
Let N = 10750 . Although N is very large, its ultimately finite and can have no effect on the
limit. To see this, fix ε > 0 and define

an = Pθ {|θ̂n − θ| > ε} and bn = Pθ {|X̄n − θ| > ε}.

Since bn → 0 by the LLN, and an = bn for all n ≥ N , it follows that an → 0 and, hence,
θ̂n is consistent. However, for any reasonable application, where the sample size is finite,
estimating θ by a constant 107 is an absurd choice.

2.3.3 Mean-square error


Measuring closeness of an estimator θ̂ to its estimand θ via consistency assumes that the
sample size n is very large, actually infinite. As a consequence, many estimators which are
“bad” for any finite n (like that in Example 2.8) can be labelled as “good” according to the
consistency criterion. An alternative measure of closeness is called the mean-square error
(MSE), and is defined as
MSEθ (θ̂) = Eθ {(θ̂ − θ)2 }. (2.1)
This measures the average (squared) distance between θ̂(X1 , . . . , Xn ) and θ as the data
X1 , . . . , Xn varies according to Fθ . So if θ̂ and θ̃ are two estimators of θ, we say that θ̂ is
better than θ̃ (in the mean-square error sense) if MSEθ (θ̂) < MSEθ (θ̃).
Next are some properties of the MSE. The first relates MSE to the variance and bias of
an estimator.

Proposition 2.1. MSEθ (θ̂) = Vθ (θ̂) + bθ (θ̂)2 . Consequently, if θ̂ is an unbiased estimator of


θ, then MSEθ (θ̂) = Vθ (θ̂).

Proof. Let θ̄ = Eθ (θ̂). Then

MSEθ (θ̂) = Eθ {(θ̂ − θ)2 } = Eθ {[(θ̂ − θ̄) + (θ̄ − θ)]2 }.

Expanding the quadratic inside the expectation gives

MSEθ (θ̂) = Eθ {(θ̂ − θ̄)2 } + 2(θ̄ − θ)Eθ {(θ̂ − θ̄)} + (θ̄ − θ)2 .

The first term is the variance of θ̂; the second term is zero by definition of θ̄; and the third
terms is the squared bias.
Often the goal is to find estimators with small MSEs. From Proposition 2.1, this can be
achieved by picking θ̂ to have small variance and small squared bias. But it turns out that,
in general, making bias small increases the variance, and vice versa. This it what is called
the bias–variance trade-off. In some cases, if minimizing MSE is the goal, it can be better to
allow a little bit of bias if it means a drastic decrease in the variance. In fact, many common
estimators are biased, at least partly because of this trade-off.

23
2
Example 2.9. Let PnX1 , . . . , Xn 2be iid N(µ, σ ) and suppose the goal is to estimate σ 2 . Define
the statistic T = i=1 (Xi − X̄) . Consider a class of estimators σ̂ 2 = aT where a is a positive
number. Reasonable choices of a include a = (n − 1)−1 and a = n−1 . Let’s find the value of
a that minimizes the MSE.
First observe that (1/σ 2 )T is a chi-square random variable with degrees of freedom n − 1;
see Theorem 3.6.1 in the text. It can then be shown, using Theorem 3.3.1 of the text,
that Eσ2 (T ) = (n − 1)σ 2 and Vσ2 (T ) = 2(n − 1)σ 4 . Write R(a) for MSEσ2 (aT ). Using
Proposition 2.1 we get

R(a) = Eσ2 (aT − σ 2 )2 = Vσ2 (aT ) + bσ2 (aT )2


= 2a2 (n − 1)σ 4 + [a(n − 1)σ 2 − σ 2 ]2
= σ 4 2a2 (n − 1) + [a(n − 1) − 1]2 .


To minimize R(a), set the derivative equal to zero, and solve for a. That is,

0 = R0 (a) = σ 4 4(n − 1)a + 2(n − 1)2 a − 2(n − 1) .


set 

From here it’s easy to see that a = (n+1)−1 is the only solution (and this must
Pbe a minimum
n
since R(a) is a quadratic). Therefore, among estimators of the form σ̂ = a i=1 (Xi − X̄)2 ,
2
n
the one with smallest MSE is σ̂ 2 = (n + 1)−1 i=1 (Xi − X̄)2 . Note that this estimator is
P
not unbiased since a 6= (n − 1)−1 . To put this another way, the classical estimator S 2 pays
a price (larger MSE) for being unbiased.

Proposition 2.2 below helps to justify the approach of choosing θ̂ to make the MSE small.
Indeed, if the choice is made so that the MSE vanishes as n → ∞, then the estimator turns
out to be consistent.

Proposition 2.2. If MSEθ (θ̂n ) → 0 as n → ∞, then θ̂n is a consistent estimator of θ.

Proof. Fix ε > 0 and note that Pθ {|θ̂n − θ| > ε} = Pθ {(θ̂n − θ)2 > ε2 }. Applying Markov’s
inequality to the latter term gives an upper bound of ε−2 MSEθ (θ̂n ). Since this goes to zero
by assumption, θ̂n is consistent.
The next example compares two unbiased and consistent estimators based on their re-
spective MSEs. The conclusion actually gives a preview of some of the important results to
be discussed later in Stat 411.

Example 2.10. Suppose X1 , . . . , Xn are iid Pois(θ). We’ve looked at two estimators of θ
in the context, namely, θ̂1 = X̄ and θ̂2 = S 2 . Both of these are unbiased and consistent.
To decide which we like better, suppose we prefer the one with the smallest variance.5 The
variance of θ̂1 is an easy calculation: Vθ (θ̂) = Vθ (X̄) = Vθ (X1 )/n = θ/n. But the variance of
θ̂2 is trickier, so we’ll resort to an approximation, which relies on the following general fact.6
5
In Chapter 1, we looked at this same problem in an example on the Monte Carlo method.
6
A. DasGupta, Asymptotic Theory of Statistics and Probability, Springer, 2008, Theorem 3.8.

24
Let X1 , . . . , Xn be iid with mean µ. Define the sequence of population and sample
central moments:
n
k 1X
µk = E(X1 − µ) and Mk = (Xi − X̄)k , k ≥ 1.
n i=1

Then, for large n, the following approximations hold:

E(Mk ) ≈ µk
V(Mk ) ≈ n−1 {µ2k − µ2k − 2kµk−1 µk+1 + k 2 µ2 µ2k−1 }.

In the Poisson case, µ1 = 0, µ2 = θ, µ3 = θ, and µ4 = θ + 3θ2 ; these can be verified


directly by using the Poisson moment-generating function. Plugging these values into the
above approximation (k = 2), gives Vθ (θ̂2 ) ≈ (θ + 2θ2 )/n. This is more than Vθ (θ̂1 ) = θ/n
so we conclude that θ̂1 is better than θ̂2 (in the mean-square error sense). In fact, it can be
shown (via the Lehmann–Scheffe theorem in Chapter 7 in HMC) that, among all unbiased
estimators, θ̂1 is the best in the mean-square error sense.

Exercise 2.3. (a) Verify the expressions for µ1 , µ2 , µ3 , and µ4 in Example 2.10. (b) Look
back to Example 7 in Notes 01 and compare the Monte Carlo approximation of V(θ̂2 ) to
the large-n approximation V(θ̂2 ) ≈ (θ + 2θ2 )/n used above. Recall that, in the Monte Carlo
study, θ = 3 and n = 10. Do you think n = 10 is large enough to safely use a large-n
approximation?

It turns out that, in a typical problem, there is no estimator which can minimize the
MSE uniformly over all θ. If there was such a estimator, then this would clearly be the best.
To see that such an ideal cannot be achieved, consider the silly estimator θ̂ ≡ 7. Clearly
MSE7 (θ̂) = 0 and no other estimator can beat that; of course, there’s nothing special about 7.
However, if we restrict ourselves to the class of estimators which are unbiased, then there is a
lower bound on the variance of such estimators and theory is available for finding estimators
that achieve this lower bound.

2.4 Where do estimators come from?


In the previous sections we’ve simply discussed properties of estimators—nothing so far has
been said about the origin of these estimators. In some cases, a reasonable choice is obvious,
like estimating a population mean by a sample mean. But there are situations where this
choice is not so obvious. There are some general methods for constructing estimators. Here
I simply list the various methods with a few comments.

• Perhaps the simplest method of constructing estimators is the method of moments.


This approach is driven by the unbiasedness property. The idea is to start with some
statistic T and calculate its expectation h(θ) = Eθ (T ); now set T = h(θ) and use the

25
solution θ̂ as an estimator. For example, if X1 , . . . , Xn are iid N(θ, 1) and the goal is
to estimate θ2 , a reasonable starting point is T = X̄ 2 . Since Eθ (T ) = θ2 + 1/n, an
unbiased estimator of θ2 is X̄ 2 − 1/n.

• Perhaps the most common way to construct estimator is via the method of maximum
likelihood. We will spend a considerable amount of time discussing this approach.
There are other related approaches, such as M-estimation and least-squares estima-
tion,7 which we will not discuss here.

• As alluded to above, one cannot, for example, find an estimator θ̂ that minimizes the
MSE uniformly over all θ. But but restricting the class of estimators to those which
are unbiased, a uniformly best estimator often exists. Such an estimator is called the
uniformly minimum variance unbiased estimates (UMVUE) and we will spend a lot of
time talking about this approach.

• Minimax estimation takes a measure of closeness of an estimator θ̂ to θ, such as


MSEθ (θ̂), but rather than trying to minimize the MSE pointwise over all θ, as in the
previous point, one first maximizes over θ to give a pessimistic “worst case” measure
of the performance of θ̂. Then one tries to find the θ̂ that MINImizes the MAXimum
MSE. This approach is interesting, and relates to game theory and economics, but is
somewhat out of style in the statistics community.

• Bayes estimation is an altogether different approach. We will discuss the basics of


Bayesian inference, including estimation, in Chapter 6.

7
Students may have heard of the least-square approach in other courses, such as applied statistics courses
or linear algebra/numerical analysis.

26

You might also like