0% found this document useful (0 votes)
2 views

ClassicalEstimation

The document discusses classical estimation topics, focusing on Mean Squared Error (MSE) and Minimum Variance Unbiased Estimation (MVUE) for both scalar and vector parameters. It covers concepts such as sufficient statistics, the Rao-Blackwell-Lehmann-Scheffe theorem, and the Cramer Rao Lower Bound (CRLB), along with their implications in statistical estimation. Additionally, it addresses exponential families and linear models, providing insights into the conditions under which estimators achieve efficiency and unbiasedness.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ClassicalEstimation

The document discusses classical estimation topics, focusing on Mean Squared Error (MSE) and Minimum Variance Unbiased Estimation (MVUE) for both scalar and vector parameters. It covers concepts such as sufficient statistics, the Rao-Blackwell-Lehmann-Scheffe theorem, and the Cramer Rao Lower Bound (CRLB), along with their implications in statistical estimation. Additionally, it addresses exponential families and linear models, providing insights into the conditions under which estimators achieve efficiency and unbiasedness.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Classical Estimation Topics

Namrata Vaswani, Iowa State University

February 25, 2014

This note fills in the gaps in the notes already provided (l0.pdf, l1.pdf, l2.pdf, l3.pdf,
LeastSquares.pdf).

1 Min classical Mean Squared Error (MSE) and Min-


imum Variance Unbiased Estimation (MVUE)
1. First, assume a scalar unknown parameter θ

(a) Min classical MSE estimator:

θ̂(X) = arg min EX [(θ − θ̂(X))22 ]


θ̂

(b) Often the resulting estimator is not realizable, i.e. it depends on θ.


(c) MVUE estimator:

θ̂M V U E (X) = arg min EX [(θ − θ̂(X))22 ] = arg min var[θ̂(X)]


θ̂:E[θ̂(X)]=θ θ̂:E[θ̂(X)]=θ

2. Vector parameter θ

(a) MVUE: θ̂M V U E,i (X) = arg minθ̂i :E[θ̂i (X)]=θi var[θ̂i (X)]

3. Sufficient statistic (ss)

(a) A stat Z := T (X) is a ss for θ if pX|Z (x|T (x); θ) = pX|Z (x|T (x)), i.e. it does not
depend on θ

4. Minimal ss

(a) A stat T (X) is a minimal ss if it is a ss and it is a function of every other ss.

5. Complete ss:

1
(a) T (X) is a complete ss for θ iff it is a ss and if E[v(T (X))] = 0 (expectation taken
w.r.t. pdf/pmf p(x; θ)) for all θ, implies that P rθ (v(T (X)) = 0) = 1 (if we can
show that v(t) = 0 for all t, this gets satisfied).
(b) T (X) is a complete ss for θ iff it is a ss and there is at most one function g(t)
such that g(T (X)) is an unbiased estimate of θ, i.e. E[g(T (X))] = θ.

6. Factorization Theorem (Neyman) to find a ss

(a) A stat T (X) is a ss for θ iff the pmf/pdf pX (x; θ) can be factorized as

pX (x; θ) = g(T (x), θ)h(x)

for all x and for all θ ∈ Θ (Θ: parameter space).


(b) Proof: proof for the discrete rv case is easy and illustrates the main point. Idea:
just use definition of ss.

7. Rao-Blackwell-Lehmann-Scheffe (RBLS) theorem:

(a) RB theorem: given a ss, and some unbiased estimator for θ, find another unbiased
estimator with equal or lower variance. Statement: Let θ̌(x) be an unbiased
estimator for θ. Let T (X) be a ss for θ. Let Z := T (X). Define a function

θ̂(z) := EX|Z [θ̌(X)|z].

Then
i. θ̂(T (x)) is a realizable estimator
ii. θ̂(T (X)) is unbiased, i.e. EX [θ̂(T (X))] = θ
iii. var[θ̂(T (X))] ≤ var[θ̌(X)]
(b) Proof:
i. follows from the definition of a ss
ii. follows by using iterated expectation: EX [θ̂(T (X))] = EX [EX|Z [θ̌(X)|T (X)]] =
EX [θ̌(X)] = θ
iii. follows by using conditional variance identity.
(c) LS theorem: if T (X) is a complete ss for θ and if there is a function g(t) s.t.
E[g(T (X))] = θ, then g(T (X)) is the MVUE for θ.
(d) LS theorem (equivalent statement): if T (X) is a complete ss for θ, then θ̂(T (X))
defined above is MVUE for θ and in fact g(T (X)) = θ̂(T (X))
(e) Proof: follows from the fact that θ̂(T (X)) := EX|T (X) [θ̌(X)|T (X)] is a function of
T (X)

2
(f) Thus, LS theorem implies that if I can find a function, g(t), of a complete ss,
T (X) that is an unbiased estimate of θ, then g(T (X)) is the MVUE. Or if take
any unbiased function and compute its conditional expectation conditioned on
the complete ss, then also I will get the MVUE.

8. Completeness Theorem for Exponential Families: see later

9. Examples

(a) Example proving completeness of a ss: Kay’s book


(b) MVUE computation

2 Information Inequality and Cramer Rao Lower Bound


1. l2.pdf is quite complete.

2. Poor’s book also talks about the scalar case CRLB.


∂ log p(X;θ)
3. score function: derivative of log likelihood w.r.t. θ, score = ∂θ

4. Under “regularity”,

(a) expected value of score function is zero


(b) Fisher Information Matrix (or Number for a scalar) is defined as E[score scoreT ]
(c) under more “regularity” FIM is the negative expected value of the derivative of
score (Hessian of log likelihood w.r.t. θ)

5. Info inequality and CRLB (scalar case): Assume “regularity”.

(a) Consider a pdf/pmf family p(x; θ). Assume that 0 < I(θ) < ∞. The variance of
any statistic T (X) with finite variance and with E[T (X)] = ψ(θ) is lower bounded
as follows
|ψ ′ (θ)|2
var[T (X)] ≥
I(θ)
with equality occurring if and only if score is an affine function of T (X), i.e.

score = k(θ)(T (X) − b(θ))

(b) Under “regularity”, this is achieved if and only if p(x; θ) is a one parameter ex-
ponential family.
(c) Proof idea:
i. Write out expression for ψ ′ (θ) and re-arrange it as E[T (X) score].

3
ii. Use Cauchy-Schwartz and the fact that the score function is zero mean
(d) details: see page 6 of l2.pdf or see Kay’s book (Appendix of Chap 3) or see Poor’s
book

6. Info inequality and CRLB vector case: Assume “regularity”.

(a) Consider a pdf/pmf family p(x; θ). Assume that the FIM I(θ) is non-singular.
Consider any vector statistic T (X) with finite variance in all directions and with
E[T (X)] = ψ(θ). Then,

cov[T (X)] ≽ ψ ′ (θ)I(θ)−1 ψ ′ (θ)T

with equality occurring if and only if score is an affine function of T (X), i.e.

score = K(θ)(T (X) − b(θ))

(for matrices M1 ≽ M2 means a′ (M1 − M2 )a ≥ 0 for any vector a).


(b) CRLB: Special case where ψ(θ) = θ. In this case, cov[T (X)] ≽ I(θ)−1 with
equality if and only if
score = I(θ)(T (X) − θ)

(c) Here ψ ′ (θ) := ∂ψ(θ)


∂θT
, i.e. (ψ ′ (θ))i,j = ∂ψi (θ)
∂θj
. From this notation notice that ∂ψ(θ)
∂θ T
=
T
( ∂ψ∂θ(θ) )T .
(d) Proof idea:
i. Write out expression for ψ ′ (θ) and re-arrange it as ψ ′ (θ) = E[T (X) scoreT ]
T
ii. E[T (X) ∂ log∂θ
p(X;θ)
] is now a matrix
iii. Apply C-S to a′ ψ ′ (θ)b = a′ E[T (X) scoreT ]b and use the fact that the score
function is zero mean to get (a′ ψ ′ (θ)b)2 ≤ var(a′ T (X))var(scoreT b)
iv. Notice that var(a′ T (X)) = a′ cov(T (X))a and var(scoreT b) = var(bT score) =
bT I(θ)b
v. Set b = I(θ)−1 ψ ′ (θ)T a to get the final result.
(e) Details: see (Appendix of Chap 3 of Kay’s book)

7. We say an estimator is efficient if it is unbiased and its variance is equal to the Cramer
Rao lower bound.

8. If more parameters are unknown, the CRLB is larger (or equal if the FIM is diagonal).
Consider a pdf/pmf with 2 parameters. First suppose that only one parameter is
unknown and suppose its CRB is c. Now for the same pdf/pmf if both the parameters
are unknown, the CRB will be greater than or equal to c. It is equal to c only if the FIM
for the 2-parameter case is diagonal. Same concept extends to multiple parameters.

4
(a) Denote the FIM for the 2-parameter case by I(θ). Recall that [I11 (θ)]−1 is the
CRLB for θ1 (when θ2 is known). When both are unknown then, [I(θ)−1 ]11 is the
CRLB. We claim that
[I(θ)−1 ]11 ≥ [I11 (θ)]−1
with equality if and only if I(θ) is a diagonal matrix.
√ √
(b) This, in turn, follows by using C-S for vectors on I(θ)e1 , I(θ)−1 e1

9. Gaussian CRB: see l2.pdf, Theorem 5, page 32.

10. Examples

3 Exponential Family
1. Multi-parameter expo family:

k
p(x; θ) = h(x)C(θ) exp [ ηi (θ)Ti (x)]
i=1

(a) single-parameter expo family: special case where k = 1.


(b) Examples: Gaussian, Poisson, Laplacian, binomial, geometric

2. By factorization theorem, easy to see that T (X) = [T1 (X), . . . Tk (X)]′ is the ss for the
vector parameter θ.

3. Completeness Theorem: if the parameter space for the parameters ηi (θ)’s, contains a
k-dimensional hyper-rectangle, then T (X) is a complete ss for ηi (θ)’s.

(a) See proposition IV.C.3 of Poor’s book (that is stated by first re-parameterizing

p(x; θ) as p(x; ϕ) = h(x)C(ϕ) exp( ki=1 ϕi Ti (x)) to make things easier).

4. Example IV.C.3 of Poor’s book

5. Easy to see that the support of p(x; θ), i.e. the set {x : p(x; θ) > 0}, does not depend
on θ.

6. One parameter expo family and EE/MVUE: Under “regularity” (the partial derivative
w.r.t. θ can be moved inside or outside the integral sign when computing the expec-
tation of any statistic, and if E[|T1 (X)|] < ∞), T1 (X) is the efficient estimator (EE),
and hence MVUE, for its expected value.

(a) In fact, under regularity, an estimator T (X) achieves the CRLB for η(θ) =
Eθ [T (X)] if and only if p(x; θ) = h(x)C(θ) exp(η(θ)T (X)) (one parameter expo
family of the form).

5
(b) See Example IV.C.4 of Poor for a proof.

7. Multi-parameter expo family: The vector T (X) is an EE and hence MVUE for its
expected value.

(a) Proof:
∑k
i. Rewrite expo family distribution as p(x; θ) = h(x)C(η(θ)) exp [ i=1 ηi (θ)Ti (x)].
This is always possible to do because C(θ) is given by
∫ ∑
k
1
= h(x) exp [ ηi (θ)Ti (x)]dx
C(θ) i=1

and hence it is actually a function of η(θ).


ii. Thus expo family distribution can always be re-parameterized in terms of η
as
∑k
p(x; η) = h(x) exp{[ ηi Ti (x)] − A(η)}, A(η) := − log C(η)
i=1

iii. With this, clearly,


∂A(η)
score = T (X) −
∂η
iv. Since E[score] = 0, thus, E[T (X)] = ∂A(η)
∂η
v. Thus, cov(T (X)) = E[score scoreT ] = I(η)
∂ 2 A(η)
vi. Also ψ(η) = E[T (X)] = ∂A(η)
∂η
implies that ψ ′ (η) = ∂ηη T
∂ 2 A(η)
vii. But I(η) also satisfies I(η) = E[− ∂score
∂η T
]= ∂ηη T
= ψ ′ (η)
viii. Thus, ψ ′ (η)I(η)−1 ψ ′ (η)T = I(η)I(η)−1 I(η) = I(η) = cov(T (X))
T

ix. Thus, T (X) is EE and hence MVUE of its expected value.


(b) Example on page 30 of l2.pdf of applying this theorem.
(c) But to my knowledge there is no “if and only if” result, i.e. one cannot say that
an estimator achieves CRLB for its expected value only for multi-parameter expo
families. ??check

8. Some more properties and FIM expression for single parameter expo families in l2.pdf,
page 13-17.

4 Linear Models
1. Linear model means the data X satisfies

X = Hθ + W

6
where X is an N × 1 data vector, θ is a p × 1 vector of unknown parameters and W is
the zero mean noise, i.e. E[W ] = 0.

2. The above model is identifiable iff H has rank p.

3. If W is Gaussian noise, then the MVUE exists. In fact the MVUE is also the efficient
estimator (EE).

4. If W ∼ N (0, C) then,

θ̂M V U E (X) = (H ′ C −1 H)−1 H ′ C −1 X

Proof:

(a) Show unbiasedness


(b) Compute cov[θ̂M V U E (X)] and show that it is equal to the CRLB.

5 Best Linear Unbiased Estimation (BLUE)


1. For any given pdf/pmf p(X; θ), find the “best” estimator among the class of linear and
unbiased estimators. Here “best” means minimum variance.

2. Scalar parameter θ: θ̂BLU E (X) = a′B X where aB is a vector obtained by solving

aB = arg min a′ cov(X)a


a:a′ E[X]=θ

(recall that the expectation of any linear estimator, a′ X, is a′ E[X] and its variance is
a′ cov(X)a).

3. Vector parameter θ: θ̂BLU E (X) = A′B X where AB is a n × p matrix obtained by solving

(AB )i = arg min



A′i cov(X)Ai
A:Ai E[X]=θi

here Ai refers to the ith column of the matrix A.

4. To prove that a given matrix AB is the minimizer: typical approach is as follows. Try
to show that

(a) A′B E[X] = θ


(b) For all matrices A satisfying A′ E[X] = θ, the following holds

A′ cov(X)A − A′B cov(X)AB ≽ 0

(here M ≽ 0 means that the matrix M is positive semi-definite, i.e. it satisfies


z ′ M z ≥ 0 for any vector z).

7
(c) By letting z = ei where ei is a vector with 1 at the ith coordinate and zero
everywhere else, we can see that the above implies that AB indeed is the BLUE.

5. Example of finding a BLUE: l2.pdf

6. Example situation where cannot find even one linear estimator that is unbiased, and
so BLUE does not exist: l2.pdf

(a) In the above case, if transform the data in some way, it may be possible to find a
BLUE.

6 Maximum Likelihood Estimation


1. Define MLE.

2. We assume Identifiability: p(x, θ1 ) = p(x, θ2 ) if and only if θ1 = θ2

• Example: in case of a linear model of the form X = Hθ + W, W ∼ N (0, σ 2 I),


so that p(x; θ) = N (x; Hθ, σ 2 I), θ is identifiable if and only if H has full column
rank. This, in turn, means that H has to be a full rank square or tall matrix.

3. Given X1 , . . . XN iid with pdf or pmf p(x; θ), i.e. for discrete case, P r(Xi = x) =
p(x; θ), for continuous case, P r(Xi ∈ [x, x + dx]) ≈ p(x; θ)dx (or more precisely
∫x
P r(Xi ≤ x) = t=∞ p(t; θ)).
∏N
4. Then define θ̂N (X) := arg maxθ i=1 p(Xi ; θ).

5. Consistency and Asymptotic Normality of MLE: If X1 , X2 , . . . Xn iid p(x; θ), then under
certain “regularity conditions”, θ̂N (X) is consistent and asymptotically normal, i.e.

for any ϵ > 0, lim P r(|θ̂N (X) − θ| > ϵ) = 0, and


N →∞

N (θ̂N ((X)) − θ) → Z ∼ N (0, i1 (θ)−1 ), in distribution as N → ∞
where
∂ ∂2
i1 (θ) := E[(log p(X1 ; θ))2 ] = E[− 2 log p(X1 ; θ)]
∂θ ∂θ
is the Fisher information number for X1 .
Proof approach:

(a) Show consistency of θ̂N


i. See Poor’s book for a correct proof. See Appendix of Kay’s book for this
rough idea:

8
ii. Jensen’s inequality (or non-negativity of Kullback-Leibler divergence), fol-

lowed by taking a derivative w.r.t. θ on both sides, tells us that p(x; θ) ∂θ ∂
log p(x; θ)dx ≥
∫ ∂
∫ ∂
p(x; θ) ∂θ log p(x; θ̃)dx or in other words arg maxθ̃ p(x; θ) ∂θ log p(x; θ̃)dx =
θ.
∑ ∂
iii. The MLE, θ̂N (X) = arg maxθ̃ N1 i ∂θ log p(Xi ; θ̃).
∑ ∫
iv. By WLLN, N1 i ∂θ ∂
log p(Xi ; θ̃) converges to its expected value, p(x; θ) ∂θ ∂
log p(x; θ̃)dx,
in probability.
v. By using an appropriate “continuity argument”, its maximizer, θ̂N , also con-
verges in probability to the maximizer of the RHS, which is θ.
∑ ∂
(b) N1 i ∂θ log p(Xi ; θ̂N ) = 0 by definition of the MLE (first derivative equal to zero
for maximizer): for differentiable functions with maximizer inside an open region.
∑ ∂
(c) Using the above fact and Mean Value Theorem on N1 i ∂θ log p(Xi ; θ̂N ), we can
rewrite √
√ N RN (θ̂N )
N (θ̂N − θ) =
TN (θ̃N )
for some θ̃N lying in between θ̂N and θ
(d) Use consistency of θ̂N to show that TN (θ̃N ) converges to TN (θ) in probability and
hence in distribution.
(e) Use WLLN to show that TN (θ) converges in probability to i1 (θ).

(f) Use Central Limit Theorem on N RN (θ̂N ) to show that it converges in distribu-
tion to Z ∼ N (0, i1 (θ)).
(g) Use Slutsky’s theorem to get the final result
(h) For details: see class notes or see Appendix of Chapter 7 of Kay’s book or see
Poor’s book.

6. Define Asymptotic Efficiency: two ways that different books define it.

(a) First definition (used in Lehmann’s book and mentioned in Poor’s book): N (θ̂N −
θ) converges to a random variable Z in distribution and E[Z] = 0 and V ar[Z] =
i1 (θ)−1 . The asymptotic normality proof directly implies this.
(b) Second definition: θ̂N is asymptotically unbiased, i.e. limN →∞ E[θ̂N (X)] = θ and

its asymptotic variance is equal to the CRLB, i.e. limN →∞ V ar[ N θ̂N (X)] =
i1 (θ)−1 .
i. Under more regularity conditions, I believe that proofs of the above statement
do exist.
ii. Note though: I did not prove the above in class. Notice: neither
convergence in probability nor convergence in distribution imply

9
convergence of the moments, i.e. neither implies that limN →∞ E[θ̂N (X)] =
θ or that the variance converges.

7. Need for MLE: consider the example Xn ∼ N (A, A), iid. In this case, we showed that
we do not know how to compute either an efficient estimator (EE) or a minimum vari-
ance unbiased estimator (MVUE). But MLE is always computable, either analytically
or numerically. In this case, it is analytically computable.

8. If an efficient estimator (EE) exists, then MLE is equal to it (Theorem 5 of l3.pdf)

• Proof: easy. If EE exists, score(θ) = I(θ)(θ̂EE (X) − θ). If MLE lies inside an
open interval of parameter space, it satisfies, score(θ̂M L (X)) = 0. Thus, if EE
exists θ̂EE = θ̂M L .
• Vice versa is not true for finite N , but is true asymptotically under certain “reg-
ularity”: discussed above.

9. ML Invariance Principle: Theorem ? of l3.pdf

10. Examples showing the use of ML invariance principle: amplitude and phase estimation
of a sinusoid from a sequence of noisy measurements: l3.pdf

11. Example application in digital communications: ML bit decoding: l3.pdf

12. Newton Raphson method and its variants: l3.pdf

7 Least Squares Estimation


1. No probability model at all. Just a linear algebra technique that finds an estimate of
θ that minimizes the 2-norm of the error, ∥X − f (θ)∥22 .

2. Closed form solutions exist for the linear model case, i.e. the case where X = Hθ + E
and we want to find the θ that minimizes ∥E∥22 .

3. Assume that X is an n × 1 vector and θ is a p × 1 vector. So H is an n × p matrix.

4. LS:
θ̂ = arg min ∥X − Hθ∥22 , ∥X − Hθ∥22 := (X − Hθ)′ (X − Hθ)
θ

5. Weighted LS

θ̂ = arg min ∥X − Hθ∥2W , ∥X − Hθ∥2W := (X − Hθ)′ W (X − Hθ)


θ

10
6. Regularized LS
θ̂ = arg min ∥θ − θ0 ∥2R + ∥X − Hθ∥2W
θ

7. Notice that both weighted LS and LS are special cases of regularized LS with R = 0
(weighted LS), R = 0, W = I (LS).

8. Recursive LS algorithm: recursive algorithm to compute regularized LS estimate. De-


rived in LeastSquares.pdf

9. Consider basic LS. Recall that H is an n × p matrix.

θ̂ = arg min ∥X − Hθ∥22


θ

Two cases: rank(H) = p and rank(H) < p

(a) If rank(H) = p, the minimizer is unique and given by

θ̂ = (H ′ H)−1 H ′ x

(b) If rank(H) < p, there are infinitely many solutions.


rank(H) < p can happen in two ways
i. If n < p (fat matrix), then definitely rank(H) < p
ii. Even when n ≥ p, (square or tall matrix), it could be that the columns of H
are linearly dependent, e.g. suppose p = 3 and H = [H1 , H2 , (H1 + H2 )], then
rank(H) ≤ 2 < p.

10. Nonlinear LS: θ̂ = arg minθ ∥X − f (θ)∥22 .

(a) In general: no closed form solution, use Newton Raphson, or any numerical opti-
mization algorithm.
(b) If partly linear model, i.e. if θ = [α, β] and f (θ) = H(α)β, then
i. first compute closed form solution for β in terms of α, i.e. β̂(α) = [H(α)′ H(α)]−1 H(α)′ X
ii. solve for α numerically by solving minα ∥X − H(α)β̂(α)∥22

11

You might also like