0% found this document useful (0 votes)
9 views

AOD Lec9

Uploaded by

nmviet.iucoss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

AOD Lec9

Uploaded by

nmviet.iucoss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

5/1/2024

9. BOOTSTRAP AND JACKKNIFE


METHODS
LECTURER: DR. NGUYEN THI THANH SANG
REFERENCES: [4]. C.14

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 1

OUTLINE

• Bootstrap
• Jackknife

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 2

1
5/1/2024

What is the bootstrap?

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 3

What is the bootstrap? In Statistics…


Randomly sampling, with replacement, from an
original dataset for use in obtaining statistical
estimates.
A data-based simulation method for statistical
inference.
A computer-based method for assigning measures
of accuracy to statistical estimates.
The method requires modern computer power to
simplify intricate calculations of traditional statistical
theory.
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 4

2
5/1/2024

THE METHOD HAS THE FOLLOWING STEPS

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 5

WHY USE THE BOOTSTRAP?

Good question.
Small sample size.
Non-normal distribution of the sample.
A test of means for two samples.
Not as sensitive to N.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 6

3
5/1/2024

BOOTSTRAP IDEA

We avoid the task of taking many samples from the population by instead
taking many resamples from a single sample. The values of x from these
resamples form the bootstrap distribution. We use the bootstrap distribution
rather than theory to learn about the sampling distribution.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 7

• Bootstrap draws samples from the Empirical


Distribution of data {X1, X2, · · · , Xn} to replicate
statistic  to obtain its sampling distribution.
• The Empirical Distribution is just a uniform
distribution over {X1, X2, · · · , Xn}. Therefore,
Bootstrap is just drawing i.i.d samples from {X1,
X2, · · · , Xn}. The procedure is illustrated by the
following graph.

Note: i.i.d = independent and identically distributed


IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 8

4
5/1/2024

THE NONPARAMETRIC OF BOOTSTRAP

Population 
estimate by ˆ
sample
Unknown
distribution, F

i.i.d inference

resample
Repeat for
B times
(B≥1000)
XB1, XB2, … , XBn

statistics
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 9
ˆ1* ˆ2* ˆB*

Population  , with unknown


distribution F

Step1 sampling

i.i.d

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 10

10

10

5
5/1/2024

step2 i.i.d
resampling

Repeat for
B times
XB1, XB2, … , XBn

STEP 2: Resampling the data B times with replacement, then you can
get many resampling data sets, and use this resampling data instead of
real samples data from the population

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 11

11

11

Step 3: statistics
Repeat for
B times

XB1, XB2, … , XBn

ˆ1* ˆ2* ˆB*

STEP3: Regard X1, X2,…, Xn as the new population


and resample it B times with replacement, Xb1,
Xb2, …,Xbn where i=1,2,…,B

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 12

12

6
5/1/2024

BOOTSTRAP FOR ESTIMATING


STANDARD ERROR OF A STATISTIC S(X)

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 13

13

b= 1,2, …,B
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 14

14

7
5/1/2024

BSE CALCULATION (CONTINUED)

• Bootstrap replicates s(x*1),s(x*2),…,s(x*B) are obtained by calculating the


value of the statistic s(x) on each bootstrap sample.
• The standard deviation of the values s(x*1), s(x*2), …, s(x*B) is the estimate of
the standard error of s(x).

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 15

15

NONPARAMETRIC CONFIDENCE
INTERVALS FOR USING BOOTSTRAPPING

•Many methods
The simplest : The percentile method

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 16

16

8
5/1/2024

THE PERCENTILE METHOD

• 1) Construct , the empirical distribution function of the


observed data. places probability 1/n on each observed
data point X1,X2,...,Xn.
• 2) Draw a bootstrap sample X1*,X2*,...,Xn* of size n with
replacement from .
Then calculate
*= (X1*,X2*,...,Xn*).
• 3) Repeat Step (2) a large number of times, say 1000, and then
rank the values *.
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 17

17

THE PERCENTILE METHOD (CONTINUED)

For a 95% confidence interval, after ranking the bootstrapped theta


coefficients, simply take the 2.5 % as the lower confidence limit and the
97.5% as the upper confidence limit.

The percentile (1-a) 100% confidence interval for a population mean is:
( *(a/2) , * (1-a/2) )

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 18

18

9
5/1/2024

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 19

19

EXAMPLE

Suppose we are interested in the wireless network download speed in the


Stony Brook University. It is difficult for us the examine the entire population in
the SBU, then the ideology of bootstrap resampling comes in. We take a
population sample with 10 data sets, then we resample from the sample we
have.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 20

20

10
5/1/2024

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 21

21

5.55, 9.14, 9.15, 9.19, Resample


9.19, 9.25, 9.25, #1
10.05, 10.05, 10.05

Population Sample (Mbps) 9.14, 9.15, 9.19, 9.46, Resample


9.46, 9.55, 10.05, #2
20.69, 20.69, 31.94

5.55, 9.14, 9.15, 9.19, 9.25, 9.46,


9.55, 10.05, 20.69, 31.94 5.55, 9.15, 9.15, 9.15, Resample
9.25, 9.25, 9.25, 9.25, #3
9.46, 9.46
……

Repeat for N times

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 22

22

11
5/1/2024

5.55, 9.14, 9.15, 9.19,


9.19, 9.25, 9.25,
10.05, 10.05, 10.05
Resample #1

9.14, 9.15, 9.19, 9.46,


9.46, 9.55, 10.05,
20.69, 20.69, 31.94
Resample #2

5.55, 9.15, 9.15, 9.15,


9.25, 9.25, 9.25, 9.25,
9.46, 9.46
Resample #3

……
……

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 23

Resample #N

23

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 24

24

12
5/1/2024

/* This is the original data */


5.55 9.46 9.25 9.14 9.15 9.19 31.94 9.55 10.05 20.69

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 25

25

/* Create 1000 bootstrap replications */

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 26

26

13
5/1/2024

/* SAVE THE VARIABLES MEAN, VAR AND N


INTO A NEW DATA SET ENTITLED BOOT. */

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 27

27

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 28

28

14
5/1/2024

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 98% C.I is 8.289 to 18.9365 5/1/2024 29

29

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 30

30

15
5/1/2024

We run the code the second time, and we get the result as

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 31

31

98% C.I is 8.4825 to 18.4995

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 32

32

16
5/1/2024

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 33

33

MEAN

12.447
≈ 12.457

C. I.

(8.289 , 18.9365)
≈ (8.4825, 18.4995

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 34

34

17
5/1/2024

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 35

35

THE JACKKNIFE
Jackknife methods make use of systematic partitions
of a data set to estimate properties of an estimator
computed from the full sample.

Quenouille (1949, 1956) suggested the technique to


estimate (and, hence, reduce) the bias of an
estimator ˆn .

Tukey(1958) coined the term jackknife to refer to the


method, and also showed that the method is useful in
estimating the variance of an estimator.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 36

36

18
5/1/2024

Why do we need the Jackknife?

For a data set X = (x1, x2, x3, x4, x5) the standard
deviation of the average is:

n 1 n
 i  
2
  x  x
n i 1
For measurements other than the mean,
there is no easy way to assess the accuracy.
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 37

37

Jackknife Method

Consider the problem of estimating the standard error of a


Statistic t  t ( x1 , , xn ) calculated based on a random
sample from distribution F. In the jackknife method
resampling is done by deleting one observation at a time.
Thus, we calculate n values of the statistic denoted by


 n
t  t ( x1 , x2 , , xi 1 , xi 1 , , xn ) . Let t  
i

t n. Then the
i 1 i
jackknife estimate of SE (t ) is given by
 (t )  n  1  t   t   2   n  1 st*
n
JSE  i
n i 1 n (1)
where st * is the sample standard deviation of t , t , , tn.

1

2
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 38

38

19
5/1/2024

THE STEPS OF THE


JACKKNIFE METHOD
• We consider a dataset Z of N
independent measurements
either of a random variable X
or of a pair of random
variables.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 39

39

THE NONPARAMETRIC OF JACKKNIFE

statistic t estimate by t
Unknown sample
distribution F

t  x1 , x2 , , xn  inference
resample
Repeat for
n times

t  x2 , x3 , , xn  t  x1 , x3 , , xn  n≥1000 t  x1 , x3 , , xn 1 
statistics
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 40
t1 t2 tn

40

20
5/1/2024

The formula is not immediately evident, so let us look at the


special case: t  x . Then
1 nx  xi 1 n *
ti  xi*   xj  and t  x   x. i  x
 *
n  1 j i n 1 n i 1
Using simple algebra, it can be shown that

 x  x
n 2
 (t )  n  1  x *  x * 2 
n

 i  SE  x  (2)
i 1 i
JSE
n i 1 n  n  1

Thus, the jackknife estimate of the standard error (1) gives


an exact result for x.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 41

41

Limitations of the Jackknife


• The jackknife method of estimation can fail if the statistic
ti is not smooth. Smoothness implies that relatively small
changes to data values will cause only a small change in the
statistic.
• The jackknife is not a good estimation method for estimating
percentiles (such as the median), or when using any other
non-smooth estimator.
• An alternate the jackknife method of deleting one observation
at a time is to delete d observations at a time (d  2). This is
known as the delete-d jackknife.
• In practice, if n is large and d is chosen such that
n  d  n , then the problems of non-smoothness are removed.
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 42

42

21
5/1/2024

EXAMPLE FOR
JACKKNIFE

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 43

43

AN EXAMPLE IN PYTHON

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 44

44

22
5/1/2024

AN EXAMPLE IN PYTHON

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 45

45

AN EXAMPLE IN PYTHON

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 46

46

23
5/1/2024

EXERCISE

• Use python to implement the Jackknife process.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 47

47

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 48

48

24
5/1/2024

0, 9.14, 9.15, 9.19, Resample


9.25, 9.46, 9.55, #1
10.05, 20.69, 31.94

Population Sample (Mbps) 5.55, 0, 9.15, 9.19, Resample


9.25, 9.46, 9.55, #2
10.05, 20.69, 31.94

5.55, 9.14, 9.15, 9.19, 9.25,


9.46, 9.55, 10.05, 20.69, 31.94 5.55, 9.14, 0, 9.19, Resample
9.25, 9.46, 9.55, #3
10.05, 20.69, 31.94

……
Repeat for 10 times

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 49

49

OUTPUT OF JACKKNIFE EXAMPLE

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 50

50

25
5/1/2024

OUTPUT OF JACKKNIFE EXAMPLE

Jackknife
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA Outcome Bias Correcte Standard error
5/1/2024 51
outcomes of original d
for total sample estimate
s
51

THANK YOU FOR LISTENING

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/1/2024 52

52

26

You might also like