0% found this document useful (0 votes)
618 views25 pages

M2S2 - Statistical Modelling: DR Axel Gandy Imperial College London Spring 2011

This document provides an overview and summary of the M2S2 Statistical Modelling course at Imperial College London. The course covers topics including point estimation, properties of estimators like bias and mean squared error, statistical inference through maximum likelihood estimation and hypothesis testing, linear models, and using the statistical programming language R. The course involves lectures, problem sheets, problem classes, and assessed courseworks. Background readings are provided from references on mathematical statistics, Bayesian methods, linear models, and applied statistical modeling.

Uploaded by

Nat Leung
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
618 views25 pages

M2S2 - Statistical Modelling: DR Axel Gandy Imperial College London Spring 2011

This document provides an overview and summary of the M2S2 Statistical Modelling course at Imperial College London. The course covers topics including point estimation, properties of estimators like bias and mean squared error, statistical inference through maximum likelihood estimation and hypothesis testing, linear models, and using the statistical programming language R. The course involves lectures, problem sheets, problem classes, and assessed courseworks. Background readings are provided from references on mathematical statistics, Bayesian methods, linear models, and applied statistical modeling.

Uploaded by

Nat Leung
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

M2S2 - Statistical Modelling

Dr Axel Gandy Imperial College London Spring 2011

Overview
Statistical Inference

Point estimators Properties of Maximum Likelihood Estimators Interval Estimation/Condence Regions Hypothesis testing - Likelihood Ratio Tests Bayesian Statistics
Linear Models

Least squares Distributional Results Diagnostics In several examples, the statistical programme R (see http://www.r-project.org) will be used. It is freely available for Unix, Linux, Windows, MacOS. 2739 extension packages (Jan 2011); 2140 extension packages (January 2010); Jan 2009: 1628

References
P.J. Bickel and K.A. Doksum. Mathematical statistics: basic ideas and selected topics. Vol. 1. Prentice Hall, 2000. B.P. Carlin and T.A. Louis. Bayes and Empirical Bayes Methods for Data Analysis. Chapman & Hall/CRC, second edition edition, 2000. A.C. Davison. Statistical Models. Cambridge University Press, 2003,2008. M.H. DeGroot. Probability and statistics. Addison-Wesley Boston, 1986.

D. Gamerman and H.F. Lopes. Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference. Chapman & Hall/CRC, 2006. F.A. Graybill. An Introduction to Linear Statistical Models. Vol. 1. McGraw-Hill, 1961. B.W. Lindgren. Statistical Theory. Chapman & Hall/CRC, fourth edition edition, 1993. H. Sche. The Analysis of Variance. John Wiley & Sons Inc, 1959. e Shayle R. Searle. Linear Models. Wiley, 1971. GAF Seber and AJ Lee. Linear Regression Analysis. 2nd edt. John Wiley & Sons, Hoboken, New Jersey, 2003. James H. Stapleton. Linear statistical models. Wiley, 1995. WN Venables and B.D. Ripley. Modern Applied Statistics with S. Springer, 2002. Part 1:Lindgren [1993], DeGroot [1986], Bickel and Doksum [2000], Carlin and Louis [2000], Gamerman and
Lopes [2006] Part 2:Graybill [1961]Sche [1959] Searle [1971] Stapleton [1995] Seber and Lee [2003] e Part 1+2:Davison [2003,2008],Venables and Ripley [2002].

Prerequisites
M1S, M2S1

Assessment
Exam, and 2 assessed course works.

Course Layout
Lecture notes. Problem Sheets. Weekly problems classes (will start in the second week).

Other Information
I have designated an oce hour on Thursdays at 12.00-13.00, and can be found in Huxley 530. I can also be reached via email ([email protected]). Material (lecture notes, problem sheets, solutions, ...) will be available at http://www2.imperial.ac.uk/~agandy/teaching/m2s2.

I keep saying that the sexy job in the next 10 years will be statisticians. And Im not kidding. (Hal Varian, chief economist at Google)

What is Statistics?
No clear-cut denition - some attempts

the technology of extracting meaning from data the technology of handling uncertainty the discipline used for predicting the future or for making inferences about the unknown the discipline of producing convenient summaries of data greater statistics [is] everything related to learning from data, from the rst planning or collection to the last presentation or report (John Chambers, 1993) applied philosophy of science (Fisher 1935, Kempthorne 1976)

Typical steps in statistics


(simplied)

Interested in an unknown quantity (or a relationship) [e.g. clinical trial: eect of medication] want to make an informed decision Collect/observe relevant data (Sampling/Design of Experiments) course by Lynda White Model the relationship between observation and the unknown quantity Make inferences
According to what principles? How to calculate/approximate? Example. Space Shuttle Challenger Cause: Failure of O-Ring Number of faults in O-Rings. Data from Dalal et al. (1989, JASA). d 0 1 0 0 0 0 0 0 1 1 1 0 0 2 0 0 0 0 0 0 T 66 70 69 68 67 72 73 70 57 63 70 78 67 53 67 75 70 81 76 79 P 50 50 50 50 50 50 100 100 200 200 200 200 200 200 200 200 200 200 200 200 d=# distressed o-rings, T =temperature [degree F], P=pressure [psi] For Challenger launch: T=31, P=200. Could the Challenger catastrophe have been foreseen through proper statistical

2 0 1 75 76 58 200 200 200

modelling?

1
1.1

Statistical Inference
Statistical Models

Often: data representable as y1 , ... , yn , where yi is a realisation of a random variable Yi (i = 1, ... , n). Statistical model: specication of the joint distribution of Y1 , ... , Yn ; usually depends on unknown parameters (often denoted by ). The set of all s is called the parameter space . Simplest situation: Y1 , . . . , Yn iid independent and identically distributed. In this case, Y1 , ... , Yn are called a random sample. Example.

Pin (see introduction) = [0, 1]

Response to a question on the SOLE evaluation; Yi = response from 1 to 5; Model: Y1 , ... , Yn iid, P(Yi = j) = j , = { [0, 1]5 : i=1 j = 1}
Often: Yi depend on (nonrandom) quantities xi . covariates Example. Relation between height and income xi =height, Yi =income Model: Yi = a + bxi + i , i = 1, ... , n i N(0, 2 ) iid, = (a, b, 2 ) , = R2 [0, ) linear model - see second part of the course

xi =time since leaving Imperial, Yi =income


1, xi = 0, new treatment old treatment

, Yi =survival time

Example (Data set 1 - Faults in Rolls of Textile Fabric). Number of faults in rolls of textile fabric of dierent length; Bissell (1972), Biometrika. Roll length 551 651 832 375 715 868 271 630 491 372 645 441 895 458 642 492 No. of faults 6 4 17 9 14 8 5 7 7 7 6 8 28 4 10 4 Roll length 543 842 905 542 522 122 657 170 738 371 735 749 495 716 952 417 No. of faults 8 9 23 9 6 1 9 4 9 14 17 10 7 3 9 2
25 0 5 10 No. of faults 15 20

200

400

600 Roll length

800

Suitable model? Consider the fabric subdivided in n small pieces Suppose: events {fault in piece i}, i = 1, ... , n are independent. # of faults is Bin(n, p) with n large and p (hopefully) small. well approximated by a Poisson distribution with rate pn (see M2S1).

Possible model: Number of faults in a roll of length L follows a Poisson(L)distribution. Yi = number of faults in roll of length xi . We assume that the Yi s are independent Poisson random variables. One might want to make inference about or ask e.g.: what is the probability that a roll of 500 metre length has more than 5 faults? Example (Challenger). Scientic Questions: 1. Does temperature inuence the # of distressed O-rings? 2. What is the probability of at least one distressed O-ring on the day of the Challenger start? Observations: For the ith shuttle start , i = 1, ... , 23, we observe di =# distressed O-rings [one of the values 0, ... , 6], ti =temperature [degree F], pi =pressure [psi]. One possible model: P(Di = j) = 6 ij (1 i )1j , j = 0, ... , 6, j i = exp(0 + 1 ti + 2 pi ) 1 + exp(0 + 1 ti + 2 pi )

0 , 1 , 2 Rare the unknown parameters of the model. Of course, this is not the only possible model. i could be modeled in dierent ways. To answer the above questions we need to 1. Test the hypothesis 1 = 0 2. Derive estimates 0 , 1 , 2 of 0 , 1 , 2 . Use these together with the temperature and the pressure for the Challenger launch to estimate P(D > 0|Challenger conditions)

Desirable properties of models:


Should agree with observed data reasonably well. Should be reasonably simple (not more parameters than necessary). Should be easy to interpret, e.g. parameters should have a practical meaning.
Having formulated a model, we estimate any unknown parameters ( tting the model). Is the model adequate for the data ( goodness of t/model checks)? If not one needs to rene the model (often needs to be iterated). Uses of models:

prediction decision making testing hypotheses condence intervals interpretation

1.2

Point estimation

y1 , ... , yn are observed values, yi is a realisation of Yi for i = 1, ... , n. Assume: Y1 , ... , Yn have a joint distribution (pdf, pmf) of known functional form, but depending on an unknown parameter = (1 , ... , k )T . In many examples: k = 1 and the Yi are independent. Denition. A function of observable random variables is called statistic. How to estimate from y1 ,. . . ,yn ? Any statistic could be used. If t(y1 , ... , yn ) has been suggested as an estimate of , its r.v. version T = t(Y1 , ... , Yn ) is called an estimator of . We judge how good t is by looking at the properties of T . Example. Y1 , ... , Yn N(, 1) iid: Even in this simple situation can be estimated in several ways:

sample mean: y =

1 n

yi

sample median: y((n+1)/2) if n is odd and (y(n/2) + y(n/2+1) )/2 if n is even, where y(1) < < y(n) is the ordered sample trimmed mean: discard the highest and lowest k observed yi before computing the mean ...
For the estimate t(y1 , ... , yn ) = y the corresponding estimator is T (Y1 , ... , Yn ) = Y = Note: T is a r.v. Its distribution may depend on = . 1 Here: T N(, ). n
1 n

Yi .

1.3
1.3.1

Properties of Estimators
Bias

Denition. bias (T ) = E (T ) g () If bias (T ) = 0 for all , we say that T is unbiased for g (). extension to higher-dimensional parameters: componentwise interpretation. Example. For a random sample Y1 , ... , Yn with E Yi = and Var Yi = 2 : 1 Yi is unbiased for = E Y . Y = n i=1 Indeed, 1 E(Y ) = E( n
n

Yi ) =

1 n

E Yi = E Y

s2 =

1 n1

Indeed,

Y )2 is unbiased for 2 = Var Y . 1 1 (Yi Y )2 = Yi2 Yi Yj = (1 ) n n


i i,j P

n i=1 (Yi

Yi2

1 n

Yi Yj and hence, E s 2 =
i=j

1 n1 n1 n

1 E Yi2 n

i=j +

i=j ...

E Yi Yj
i,j =E Yi E Yj

= E Y 2 (E Y )2 = 2

Thus, (Y , s 2 ) is an unbiased estimator of (, 2 ). However, in general: Y 2 is not unbiased for 2 and s is not unbiased for (see Problem Sheet). Remark. T unbiased for does not implyh(T ) unbiased for h().

1.3.2

MSE

Denition. mean squared error=MSE (T ) = E (T )2 MSE (T ) = Var (T ) + (bias (T ))2 Proof: see example sheet. Remark. MSE good criterion for selection an estimator (includes both bias and variance). If bias = 0 then MSE=Var. The following example shows that a biased estimator can outperform an unbiased estimator Example. X Binomial(n,p). n known. Want to estimate p. Consider the two estimators S = X /n and T = X +1 . n+2 n+1 Thus X = 0 = S = 0, T = 1/(n + 2) and X = n = S = 1, T = n+2 1 1 Then biasp S = Ep (S p) = Ep X p = 0 and Varp S = 2 Varp X = p(1 p)/n. n n Thus MSEp (S) = p(1 p)/n. np + 1 1 2p Ep X + 1 p = p = and biasp T = Ep (T p) = n+2 n+2 n+2 1 np(1 p) np(1 p) + (1 2p)2 Varp T = Varp X = . Thus MSEp (T ) = . 2 2 (n + 2) (n + 2) (n + 2)2 1 For p = 0 and p = 1, MSEp (T ) = > 0 = MSEp (S) (n + 2)2 n 1 n 1 < 2 = = MSE 1 (S) However, for p = 2 , MSE 1 (T ) = 2 2 4(n + 2)2 4n 4n Since both MSEp (T ) and MSEp (S) are quadratic in p this implies that 0 < p1 < p2 < 1 such that p (p1 , p2 ) : MSEp (T ) < MSEp (S) and p [0, p1 ) (p2 , 1], MSEp (T ) > MSEp (S). Remark. On problem sheet: Example in which a biased estimator has a smallerMSE than the unbiased sample mean.

1.3.3

Asymptotic Properties

Performance of estimators as the sample size n increases. Denition. A sequence of estimators (Tn )nN for g () is called (weakly) consistent if for all :
Tn g () Recall from M2S1(Chapter 6.3): Tn g ()

(n )

by:

(n ) denotes convergence in probability and is dened

> 0 :

lim P (|Tn g ()| < ) = 1

Usually: Tn depends only on Y1 , ... , Yn . Loosely speaking: A consistent estimator gets closer to the true value the more data you have. Showing consistency via the denition can be tedious! Following lemma: simple sucient condition.

Denition. A sequence of estimators (Tn )nN for g () is called asymptotically unbiased if for all : E (Tn ) g () (n )

Lemma 1. Suppose (Tn ) is asymptotically unbiased for g () and for all Var (Tn ) 0 (n ). Then (Tn ) is consistent for g (). Proof. Use Markovs inequality (M1S, chapter 2.2; P(|X | a) E |X | for a > 0 a [Proof: a I (|X | a) |X |. Hence, P(|X | a) = E I (|X | a) 1 E |X |)] a P (|Tn g ()| ) = P ((Tn g ())2 2 ) =
0

E (Tn g ())2 2

1 (Var Tn +(E Tn g ())2 ) 2


0

Example (Pin). Xi B(1, ), = [0, 1]. n 1 Tn (x1 , ... , xn ) = n i=1 xi E Tn (X1 , ... , Xn ) = 1 Var (Tn ) = n (1 ) 0 (n ) = Tn consistent

1.3.4

Information Inequality (Rao-Cramr Inequality) e

In this section: lower bound on the variance of an estimator. Regularity conditions will not be looked at in detail in this course - see separate course on Statistical Theory Suppse T = T (X ) is an unbiased estimator for . generalisation to unbiased estimators Let f (x) denote the joint density of the sample X . Note: Usually, X is a vector of random variables. Then 1 = f (x)dx and thus 0= f (x)dx = f (x)dx = ( log f (x))f (x)dx = E [ log f (X )] (1)

Note: interchange of dierentiation and integration - works under broad assumptions - not true in general Furthermore, since T is unbiased, = E T = Dierentiating this wrt gives 1= T (x) f (x)dx = T (x) log f (x) f (x)dx = E [T log f (X )] T (x)f (x)dx.

Subtracting (1) multiplied by E T from the previous equality we get 1 = E [(T E T ) log f (X )].

Hence, using the Cauchy-Schwarz inequality [(E YZ )2 E Y 2 E Z 2 for square integrable random variables Y and Z ] 1 =(E [(T E T ) log f (X )])2 E [(T E T )2 ] E [( log f (X ))2 ] = Var (T ) E [( log f (X ))2 ]. Var (T ) 1 , If ()

Thus,

where If () = E [( log f (X ))2 ] is the so-called Fisher-Information. Alternative formulation: 2 If () = E [ log f (X )]

Indeed, letting f =
2

and f =

f ,

E [

log f ] = E [

f f f ] = E [ 2 f + ] = E [( log f )2 ] + f f f Cond = If () + ( )2 f (x; ) dx = If ()


=1 n (1)

f (x; ) dx

iid case: f (x1 , ... , xn ) =


i=1

f (xi ). Then the alternative formulation implies If () = nIf (1) ().

Example (pin). X1 , ... , Xn Bin(1, ) iid (1) Want to compute If : Here, f (x) = x (1 )1x . x 1x (1) log f (x) = (x log + (1 x) log(1 )) = 1 Hence, If (1) () = E ( (1) log f (X ))2 =
1

x=0

1 1 1 1 1 = (1 ) + 2 = + = . (1 )2 1 (1 )
n Thus If () = nIf (1) () = (1) . Hence, for any unbiased estimator T for ,

x 1x 1

x (1 )1x

Var T Consider S =
1 n n i=1

(1 ) n

Xi . Since Var(S) = 1 1 n Var X1 = (1 ) n2 n

Hence, S has minimal variance among all unbiased estimators for .

1.4

Maximum Likelihood Estimation

Method of nding an estimator for Widely applicable. (see also M2S1) The likelihood function is P(Y = y; ), L() = L(; y) = fY (y; ),
n

discrete data abs. cont. data

Thus the likelihood is the joint pdf/pmf of the observed data. In the case when Yi are iid (and Yi has pdf f (; ) ), L() =
i=1

f (yi ; ).

Denition. A maximum likelihood estimator (MLE) of is an estimator s.t. L() = sup L().

(sup = least upper bound) The maximum likelihood estimator yields the parameter for which the observed data is most likely. Usually, the MLE is well dened.However, one can easily construct situations in which it does not exist os is unique. Example (Surivival of Leukemia Patients (data set 2)). This data contains the survival time yi (in weeks) and xi = log10 (initial white blood cell count) for 17 leukemia patients. xi 3.36 2.88 3.63 3.41 3.78 4.02 4 4.23 3.73 3.85 3.97 4.51 4.54 5 5 4.72 5 yi 65 156 100 134 16 108 121 4 39 143 56 26 22 1 1 5 65
150 y_i 0 50 100

3.0

3.5

4.0 x_i

4.5

5.0

Model: Yi = exp((xi x ))i , i = 1, ... , n, where i Exp(1) iid. 1 = Yi Exp(i ) with i = exp[(xi )] . x i yi pdf of Yi is fYi (yi ) = i e . = (, )T L() =
i P P 1 1 x x = n e (xi ) yi exp[(xi )]

i e i yi

10

log L() = n log

(xi x )

yi exp((xi x ))

Dierentiate wrt , and solve numerically (or optimise numerically). Numerical tting with R
> > + > > dat <- read.csv("leuk.dat") g <- function(par) -sum(log(dexp(dat$y, 1/(par[1] * exp(par[2] * (dat$x - mean(dat$x))))))) fit <- optim(g, par = c(1, 1), hessian = TRUE) fit$par

[1] 51.086326 -1.110557

Remark. What is unrealistic about this data set? Survival time is observed for all patients. Such a study may take a long time. In reality: at the time the study is evaluated, for many patients it will only be known that they did not die before a given time ( censored observations). course on Survival Analysis

1.5
1.5.1

Properties of Maximum Likelihood estimators


MLEs are not necessarily unbiased
1 n
n

Example. Y1 , ... , Yn N(, 2 ) iid with and 2 unknown. Then the MLEs are = 1 n (yi y )
2

yi and 2 =
i=1

(Check). is unbiased but is not: E( 2 ) = 1 E( n n1 2 (Yi Y )2 ) = = 2 n

1 Recall: the sample variance n1 (asymptotically unbiased though)

i (Yi

Y )2 is an unbiased estimator of the variance.

1.5.2

MLEs are functionally invariant

If g is bijective and if is an MLE of , then = g () is an MLE of = g (). This can be seen as follows: Since g is bijective, g has an inverse. Likelihood written as function of : L() = L(g 1 ()). Thus for all : L() = L(g 1 ()) = L(g 1 (g ()) = L() L(g 1 ()) = L() Thus maximises L. Example. Y1 , ... , Yn N(, 1) iid n 1 The MLE of is = Yi . n i=1 What is the MLE of = + 2? = + 2 =: g (). Clearly, g is bijective n 1 Hence, := g () = + 2 = n i=1 Yi + 2 is the MLE of .

11

Remark. What if g is not bijective? If is not surjective then there are impossible . If is not injective then knowing does not uniquely identify the parameter or the model. Recall: Let f : A B be a function.It is called injective i a1 , a2 , A : f (a1 ) = f (a2 ) = a1 = a2 .It is called surjective i b Ba A : f (a) = b. Then = g () maximises the induced likelihood function L() = sup{L() : g () = }. Example. (continued from previous example) Consider g () = 2 . (not bijective) Then g () = i { , }. Thus L() = max(L( ), L( ))

1.5.3

Large sample properties

What happens as n ? Theorem 2. Let X1 , X2 , ... be iid observations with pdf (or pmf) f (x), where and is an open interval. Under regularity conditions (e.g. {x : f (x) > 0} does not depend on ), the following holds: (i) There exists a consistent sequence (n ) of maximum likelihood estimators. (ii) Suppose (n ) is a consistent sequence of MLEs. Then n(n 0 ) N(0, (If (0 ))1 ),
d

where 0 denotes the true parameter and If () = E [( log f (X ))2 ] is the Fisher Information of a sample of size 1.

Example (pin). X1 , X2 , ... Bin(1, ) iid d 1 We already know If () = (1) . Thus, n( 0 ) N(0, 0 (1 0 )) n (since = x one could show this directly using the central limit theorem).

0 (1 0 )

N(0, 1)

0 (1 0 ) ) distribution. In other words, the distribution of can be approximated by an N(0 , n Proof of Theorem 2: Sketch of (ii) only. 1 0= n Taylor 1 = n log f (xi ; ) 1 log f (xi ; 0 ) + n
=:An

log f (xi ; ) n( 0 )

=:Bn

for some between and 0 . Hence, By the LLN: Cn := 1 n

1 n( 0 ) = An Bn .

log f (xi ; 0 ) E[

log f (X ; 0 )] = If (0 )

12

P P Use 0 to show Bn Cn 0, hence

Bn If (0 ). Want to use CLT for An : E f log f = E = f f =


cond

f = 0,
=1

and Var ( Hence, by the CLT: An N(0, If (0 )) d 1 By Slutskys lemma, An /Bn N(0, If (0 ) ). Recall from M2S1:
d

log f ) = If ().

Lemma 3 (Slutsky). If Xn , Yn , X are random variables and c is a constant such that Xn X and Yn c then

Xn + Yn X + c, Xn Yn cX , Xn /Yn X /c if c = 0.
Remark. How to estimate If (0 )? In an iid sample, If (0 ) can be estimated by If (),
d d

1 n

n i=1 n i=1

log(f (xi ; ))|=

1 n

log(f (xi ; ))|= or

(and is often derived using numerical optimisation) Example (Data set 1 - Faults in Rolls of Textile Fabric). Numerical computation of the MLE
> g <- function(lambda) -sum(log(dpois(dat$faults, dat$length * + lambda))) > fit <- optim(g, par = c(1e-04), hessian = TRUE) > fit$par [1] 0.0151 > fit$hessian [,1] [1,] 1256614 > sqrt(1/fit$hessian) [,1] [1,] 0.0008920701

13

Remark. Multivariate version:

n(n 0 ) N(0, (If (0 ))1 ),


T

where 0 denotes the true parameter and If () := E [( log f (X ; ))( log f (X ; ))T ] = E [ log f (X ; )]

or, using the gradient wrt : If () := E [( log f (X ; ))T ( log f (X ; ))] = E [T log f (X ; )] Example (Surivival of Leukemia Patients (data set 2)).

> solve(fit$hessian) [,1] [,2] [1,] 153.387342509 -0.003780545 [2,] -0.003780545 0.170950947 > sqrt(diag(solve(fit$hessian))) [1] 12.3849644 0.4134621

1.6

Condence Regions

Point estimator: one number only. Condence interval: random interval that contains the true parameter with a certain probability
2 Example. Y1 , ... , Yn iid N(, 0 ) and is unknown.

known Want: random interval that contains with probability 1 for some > 0, e.g. = 0.05 2 1 Y =n Yi N(, 0 ) Hence, Y N(0, 1). 0 / n
pdf of N(0, 1) ,

Thus, 1 = P(c/2 <

Y < c/2 ), 0 / n

where 0 < < 1 and (c/2 ) = 1 /2. is the cdf of N(0, 1)

14

Rewrite this as

1 = P(Y + c/2 0 / n >


random

non-random

> Y c/2 0 / n]).


random

(Y c/2 0 / n, Y + c/2 0 / n) is a random interval. It contains the true with probability 1 . The observed value of the random interval is ( c/2 0 / n, y + c/2 0 / n). y This is called a 1 condence interval for . Remarks:

is usually small, often = 0.05. this is the usual convention the condence interval is the observed value of the random interval. Could use asymmetrical values, but symmetrical values (c/2 ) give the shortest interval in this case.
Example. In an industrial process, past experience shows it gives components whose strengths are N(40, 1.212 ). The process is modied but s.d.(=1.21) remains the same. After modication, 12 components give an average of 41.125. New strength N(, 1.212 ). n = 12, 0 = 1.21, y = 41.125, = 0.05, c/2 1.96. a 95% CI for is (40.44, 41.81). This does not mean that we are 95% condent that the true lies in (40.44, 41.81). It means that if we were to take an innite number of (indep) samples then in 95% of cases the calculated CI would contain the true value. Note that our CI does not include 40 - an indication that the modication seems to have increased strength ( hypothesis testing) Denition. A 1 condence interval for is a random interval (L, U) that contains the true parameter with probability 1 , i.e. P (L U) 1 Example. X Bin(1, ). Want: 1 CI for (suppose 0 < < 1/2). Let [0, 1 ], for X = 0 [L, U] = [, 1], for X = 1 This is indeed a 1 CI, since P (X = 0) = 1 1 P ( [L, U]) = 1 P (X = 1) = 1 Remark. L = and U = is allowed. Example (One-sided condence interval). Suppose Y1 , ... , Yn are independent measurements of a pollutant. We want a 1 CI of the form h2 (y), i.e. P( < h2 (Y)) = 1 because we want to be condent that is su. small The CI is then (, h2 (y)) for < , for > 1 . for 1 ,

15

1.6.1

Construction of Condence Intervals


Y 0 / n

Features of

in the rst example:

1. it is a function of the unknown and the data only (0 is known) 2. its distribution is completely known. More generally, consider a situation, where we are interested in a (scalar) unknown parameter . There may be nuisance parameters (i.e. other unknown parameters we are not interested in). Denition. A pivotal quantity for is a function t(Y, ) of the data and (and NOT any further nuisance parameters) s.t. the distribution of t(Y, ) is know, i.e. does NOT depend on ANY unknown parameters. Suppose t(Y, ) is a pivotal quantity for . Then we can nd constants a1 , a2 s.t. P(a1 t(Y, ) a2 ) 1 because we know the distribution of t(Y, ). (there may be many pairs (a1 , a2 ); is needed for discrete distributions) In many cases (as above) we can rearrange terms to give P(h1 (Y) h2 (Y)) 1 (h1 (Y), h2 (Y)) is a random interval. The observed interval (h1 (y), h2 (y)) lower condence limit upper condence limit is a 1 condence interval for . Example. Y1 , ... , Yn i.i.d N(, 2 ), , 2 both unknown 1. Want: condence interval for . is unknown = cant use Replace by S, where S2 = to give
Y / n

as a pivotal quantity;

1 n1

(Yi Y )2 T =

(sample variance)

n (Y ). S M2S1 (chapter about tests, handout on Multivariate Normal distribution); also a consequence of more general results in 2nd part of the course: T follows a Student-t distribution with n 1 degrees of freedom.

pdf of t distr with n1 df

tn1,

tn1,

16

1 = P(tn1,/2 T tn1,/2 )

S S = P(Y tn1,/2 y + tn1,/2 ) n n +


s

1 CI is ( y

t ,y n n1,/2

t ) n n1,/2

2. Want: condence interval for (or 2 ). M2S1:

(Yi Y )2 2 n1 2

2 pdf of n1

2 2

c1

c2

c1 and c2 such that P c1 = a 1 CI for 2 is


P (yi )2 y , c2 P

(Yi Y )2 c2 2

=1
P (yi )2 y , c2 P (yi )2 y c1

(yi )2 y c1

and a 1 CI for is

1.6.2

Asymptotic condence intervals


n(Tn ) N(0, 2 ()) (e.g. asymptotic distribution of the MLE). Then approximately: Tn n N(0, 1) ()
d

Often, we only know

and we can use the LHS as a pivotal quantity. The resulting condence interval is often called asymptotic condence interval. depends on it may be dicult to solve the resulting inequalities for . Simplication: () P P 1 for all . Suppose () for all . Then we also have Hence, by the Slutsky lemma, Tn Tn () d n = n N(0, 1). () Using the LHS as the pivotal quantity leads to the approximate condence limits Tn c/2 / n where (c/2 ) = 1 /2.

17

Example. Y Bin(n, ) d n(Y /n ) N(0, (1 )) (for large n, see CLT)(alternatively, use large sample properties of the Y /n is approx. N(0, 1). MLE).Hence, n
(1)

Hence, P(c/2 The conf. limits (approx) are the roots of

Y n

n(1 )

c/2 ) 1

2 (y n)2 = c/2 n(1 )

Solving this give the condence interval 1 2 yn + c 2 n + 2 4 yn2 c 2 + c 4 n2 4 y 2 c 2 n 1 2 yn + c 2 n , n (n + c 2 ) 2


Y n

4 yn2 c 2 + c 4 n2 4 y 2 c 2 n n (n + c 2 )
P P

Simplication: For 2 = pivotal quantity

(1

Y n

) one can show 2 (1 ) (LLN: Y /n , rules for ).Using the n Y /n


Y n Y n

(1

leads to the condence limits

c/2 y n n

y y (1 ). n n

1.6.3

Simultaneous Condence Intervals/Condence Regions

Extension to more than one parameter: Suppose = (1 , ... , k )T Rk and suppose that we have random intervals (Li (Y), Ui (Y)) such that P(Li (Y) < i < Ui (Y) for i = 1, ... , k) 1 then we call (Li (y), Ui (y)), i = 1, ... , k a 1 simultaneous condence intervals for 1 , ... , k . Remark (Bonferroni correction). Suppose [Li , Ui ] is a 1 /k condence interval for i , i = 1, ... , k. Then [(L1 , ... , Lk )T , (U1 , ... , Uk )T ] is a 1 simultaneous condence interval for (1 , ... , k )T . Indeed,
k k

P(i [Li , Ui ], i = 1, ... , k) = 1 P(

i=1

{i [Li , Ui ]}) 1 /

i=1

P(i [Li , Ui ]) 1 . /
/k

If one uses a more complicated form than rectangles, i.e. one uses a random set A(Y) such that for all P ( A(Y)) 1 one calls A(y) a 1 condence region for . (often the random set is an ellipse second part of the course)

1.7

Hypothesis Tests
(William Edwards Deming, 1900-1993)

In God we trust; all others must bring data. (Already covered in M2S1) Two hypothesis (usually about ):

H0 : 0 against H1 : 1 := \ 0

18

We talk about a test of H0 against H1 . Not trying to decide between H0 and H1 . Roles of H0 and H1 are not symmetrical. H0 is regarded as the status quo which we do not reject unless there is (considerable) evidence against it. Example. Medical statistics: H0 : new treatment is not better H1 : new treatment is better. H0 true Two types of error: do not reject H0 reject H0 Type I error A test is dened by the set of observations for which one rejects, called critical region. A test is of level (0 < < 1) if P (reject H0 ) 0 . H0 false Type II error

Usually is small, e.g. 0.01 or 0.05. Loosely speaking: the probability of a type I error is less than . There is no such bound for the probability of a type II error.

1.7.1

Connection between tests and condence intervals

Suppose A(Y ) is a 1 condence region for , i.e. P ( A(Y )) 1 . Then one can dene a test for H0 : 0 with level as follows: Reject H0 if 0 A(y ) = . Reject the null hypothesis if none of its elements are in the condence region. Indeed, for all 0 : P (reject) = P (0 A(Y ) = ) P ( A(Y )) . / On the other hand, suppose that 0 we have a level test 0 for H0 : = 0 . Then A := { : does not reject} is a 1 condence region for . Indeed, , P ( A) = P ( does not reject) = 1 P ( rejects) 1

1.7.2

Power
() = P (reject H0 )

Power function: if 0 want () small if 1 want () large

19

Example. X N(, 1). Critical region:

H0 : 0 against H1 : > 0 R = [c, )

where we will choose the critical value s.t. the test is of level . For 0: P (X > c) = P ( X > c ) = 1 (c ) 1 (c)
N(0,1)

Choose c = c . In this case it was sucient to construct a test of level for the boundary case =0 Sketch of the power function ()
1 0 ()

(the above is a typical () for a one-sided test)

1.7.3

p-value

Often the so-called p-value is reported (instead of a test decision): p = sup P (observing something at least as extreme as the observation)
0

Reject H0 i p -level test. Advantage for computer packages: User does not have to specify the level. If the test is based on the statistic T with rejection for large values of T then p = sup P (T t),
0

where t is the observed value. In the above example (where X N(, 1) and H0 : 0 against H1 : > 0 ) the p-value is: p = sup P (X x) = P0 (X x) = 1 (x)
0

Example. Two-sided test with known variance. X1 , ... , Xn N(, 1) iid, unknown parameter H0 : = 0 against H1 : = 0

20

Under H0 : T =

n(X 0 ) N(0, 1). Rejection region (based on T ): (, c/2 ] [c/2 , ),

where (c/2 ) = 1 /2. Test rejects for large values of |T |. Hence, for the observation t the p-value is: p = P0 (|T | |t|) = P(T |t| or T |t|) = (|t|) + 1 (|t|) = 2 2(|t|) Power: Note that T N( n( 0 ), 1). () = P (|T | c/2 ) = 1 P (c/2 T c/2 ) =1 P ( n( 0 ) c/2 T n( 0 ) n( 0 ) + c/2 ) =1 ( n(0 ) + c/2 ) + ( n(0 ) c/2 )
test of mean, known variance
1.0 1.0

test of mean, known variance

0.8

0.6

power

0.4

power 0.2 0.2 0.4

0.6

0.8

n=16 n=100

= 0.05 = 0.1

0.0

4.0

4.5

5.0 true mean = 0.05,H0: = 5

5.5

6.0

0.0
4.0

4.5

5.0 true mean n=16, H0: = 5

5.5

6.0

Example (Students t-Test; One-Sample t-Test). X1 , ... , Xn N(, 2 ) iid, and unknown parameters H0 : = 0 against H1 : = 0 Under H0 : T = n X 0 tn1 . S Rejection region: (, c] [c, ), where c = tn1,/2 . (tn1,/2 is chosen such that if Y tn1 then P(Y > tn1,/2 ) = /2) (one gets similar plots for the power function)

1.8

Likelihood Ratio Tests

This method can be used for many problems. Intuitively appealing. H0 : 0 against H1 : 1 := \ 0

Denition. The likelihood ratio test statistic is t(y) = sup L(; y) max. lik. under H0 + H1 = sup0 L(; y) max. lik. under H0

21

(other equivalent denitions are possible) If t(y) is large this will indicate support for H1 , so reject H0 when t(y) k, where k is chose to make
0

sup P (t(Y) k) = (or )

(e.g. = 0.05) Example. Y1 , ... , Yn iid N(, 1) H0 : = 0 H1 : = 0 1 1 exp( (yi )2 ) L(; y) = n 2 ( 2) MLE of : = y 1 1 exp( sup L(; y) = 2 ( 2)n 1 1 exp( sup L(; y) = n 2 ( 2) 0

(yi y )2 ) (yi 0 )2 ) (yi 0 )2 )

= t(y) = exp(

1 2

= ... = exp(

n ( 0 )2 ) y 2

(yi y )2

Reject H0 if | 0 | k where k is chosen so that P(|Y 0 | k) = . y Example. Yij = life-length of bulb j made in factory i Yij indep. Exp(i ), i=1,. . . ,m; j=1,. . . n H0 : 1 = ... m H1 : not H0

T = (1 , ... , m ) H0 not a single value here Interpretation of H0 : all factories produce bulbs of equal quality Likelihood:
m

L(; y) =
i=1

n e i i

yij

1 MLE: i = , where yi = yi

1 n

yij . Hence, sup L(; y ) =

e mn ( i yi )n
P
i,j

Under H0 : L(; y) = mn e 1 MLE: = . Hence, y sup L(; y ) =


H0 yij

e mn y mn

y mn To construct a test we would need to know the distr. of t(Y) under H0 . Not easy! = t(y) = ( i yi )n Even if it were known - the distribution of t(Y) may depend on and hence, choosing k according to sup0 P (t(Y) k) = may not be easy.

22

Theorem 4. Under certain regularity conditions, 2 log t(Y) 2 r


d

(n )

under H0 , where r = #independent restrictions on needed to dene H0 . In the above examples: r = 1 and r = m 1, respectively. Alternative way to derive the degrees of freedom r : r = # of independent parameters under full model # of independent parameters under H0 Sketch of a proof for the case 0 = {0 }. Suppose Rr . Then 2 log t(Y) = 2(log L() log L(0 )), where denotes the MLE of . Using a Taylor expansion, log L(0 ) log L() + (0 )T Hence, 2 log t(Y) (0 )T By a multivariate version of Theorem 2, n(n 0 ) N(0, If (0 )1 ),
d

log L()
=0;(su. cond. for min)

1 log L() (0 ) + (0 )T 2 T

log L() (0 ). T

where If () = E [( log L())( log L())T ]. By the law of large numbers (and a few more arguments similar to the proof of the asymptotics of the MLE),

1 log L() P0 If (0 ). n T

Hence, 2 log t(Y) ZT If (0 )Z, where Z Np (0, If (0 )1 ). Results on quadratic forms of normal random vectors (which will be derived in the second part of this course) imply ZT If (0 )Z 2 . r

1.9

Bayesian Statistics

Setup so far: observed data D = (Y1 , ... , Yn ). model p(D; ) [pdf or pmf] parameter is an unknown constant Bayesian Statistics: Parameter is a realisation of a random variable (also denoted ) with pdf (usually the parameter distr is abs. cont.). Bayes formula: p(D|)() p(|D) = , p(D) where p(D) = p(D|)() d. p(|D) is called the a-posteriori distribution of .

23

Example. N(0, 1), Xi N(, 1) iid, i = 1, ... , n. To compute p(|D) we can ignore multiplicative constants (know: 1 () exp( 2 ) and 2
n

p(|D) d = 1):

p(x|) . Hence,

i=1

1 1 exp( (xi )2 ) exp( (n2 2 2 2

xi ))
i=1

1 p(|x) p(x|)() exp( ((n + 1)2 2 2 and thus |x1 , ... , xn N( 1 n+1 xi , 1 ) n+1

xi )) exp(

1 ( xi /n + 1)2 ) 2 1/(n + 1)

1.9.1

Summarising the Information

often one cannot use the posterior as a nal result In particular if is high dimensional (try visualizing the density of a 3 or higher-dimensional random vector). To nd a best summary of the posterior want to use the minimum of E (L(, )|D) =

L(, )p(|D) d,

where L is some loss function. If L(, ) = ( )2 , the optimum is given by the mean of the posterior distribution. = | |, the optimum is given by the median of the posterior distribution. If L(, ) The mean is the most common summary of the posterior distribution.

1.9.2

Conjugate Distributions

Computation of posterior distributions may not always be possible in closed form. When the prior and the observational distribution come from so-called conjugate distributions this is possible. More formally, a family of (prior) distributions P is conjugate to a family of observational distributions F if for every prior p P, and any observational distribution p(D|), the posterior p(|D) P. The example with a normal prior and a normal observation was such an example Example. Suppose X Poisson() and Gamma(, ). Then p(|D) e x 1 e (+x)1 e (+1)

that is the density of a Gamma( + x, + 1)distribution. Hence, the Gamma family is conjugated to the Poisson observation model. An advantage of conjugate distribution is that the posterior can be computed iteratively for independent observations. The posterior of the rst observation can be used as prior for the second observarion. Example. Suppose we have X1 , X2 , ... Xn Poisson() and Gamma(, ). p(|D) ()f (X1 |)f (x2 |) f (X1 |) Already know: ()f (X1 |) is proportional to a Gamma( + X1 , + 1) distribution. We can use the same result again to see that [()f (X1 |)]f (x2 |) is proportional to a Gamma(+X1 +X2 , +2) distribution.

24

Iterating this gives |X1 , ... , Xn Gamma( +

Xi , + n).
i=1

1.9.3

Credible Intervals

Analogy to condence intervals for classical statistics. In classical stats: realisation of a condence interval either contains the true parameter or not. In Bayesian stats the parameter is random - a particular interval can contain the random parameter with a given probability! Want an interval that contains the parameter with a given probability. Let l(x) and u(x) be some functions of the observed data then a credible interval (for some 0 < < 1) satises
u(x)

P(l(x) < < u(x)|D) =


l(x)

p(|D) d = .

1.9.4

MCMC

The posterior distribution can be computed explicitly only in special cases. The advance of computers and recent theoretical developments have overcome this problem: One can approximate the posterior by simulating a specic (so-called) Markov chain whose stationary distribution is the posterior. This approach is usually called Markov chain Monte Carlo (MCMC); a program implementing this is called WinBUGS.

25

You might also like