0% found this document useful (0 votes)
198 views

Methods in Sample Surveys: Cluster Sampling

mikn

Uploaded by

moses machira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views

Methods in Sample Surveys: Cluster Sampling

mikn

Uploaded by

moses machira
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License.

Your
use of this material constitutes acceptance of that license and the conditions of use of materials on this site.

Copyright 2009, The Johns Hopkins University and Saifuddin Ahmed. All rights reserved. Use of these
materials permitted only in accordance with license rights granted. Materials provided “AS IS”; no
representations or warranties provided. User assumes all responsibility for use, and all liability related
thereto, and must independently review all materials for accuracy and efficacy. May contain materials
owned by others. User is responsible for obtaining permissions for use from third parties as needed.

Methods in Sample Surveys


140.640

Cluster Sampling

Saifuddin Ahmed
Dept. of Biostatistics
School of Hygiene and Public Health
Johns Hopkins University
Cluster Sampling
Consider that we want to estimate health insurance coverage in Baltimore city.
We could take a random sample of 100 households(HH). In that case, we need a
sampling list of Baltimore HHs. If the list is not available, we need to conduct a
census of HHs. The complete coverage of Baltimore city is required so that all
HHs are listed, which could be expensive. Furthermore, since our sample size is
small compared to the numbers of total HHs, we need to sample only few, say
one or two, in each block (subdivisions). Alternatively, we could select 5 blocks
(say the city is divided into 200 blocks), and in each block interview 20 HHs. We
need to construct HH listing frame only for 5 blocks (less time and costs needed).
Furthermore, by limiting the survey to a smaller area, additional costs will be
saved during the execution of interviews.

Such sampling strategy is known as “cluster sampling.”

The blocks are “Primary Sampling Units” (PSU) – the clusters.


The households are “Secondary Sampling Units” (SSU).

Definition:

In cluster sampling, cluster, i.e., a group of population elements, constitutes the


sampling unit, instead of a single element of the population.

The main reason for cluster sampling is “cost efficiency” (economy and
feasibility), but we compromise with variance estimation efficiency.

Advantages:
• Generating sampling frame for clusters is economical, and sampling
frame is often readily available at cluster level
• Most economical form of sampling
• Larger sample for a similar fixed cost
• Less time for listing and implementation
• Also suitable for survey of institutions

Disadvantages:
• May not reflect the diversity of the community.
• Other elements in the same cluster may share similar characteristics.
• Provides less information per observation than an SRS of the same
size (redundant information: similar information from the others in the
cluster).
• Standard errors of the estimates are high, compared to other sampling
designs with same sample size

2
Need to consider the sampling order:

• Primary sampling units (PSU): clusters


• Secondary sampling units (SSU): households/individual elements

1. We may select the PSU’s by using a specific element sampling


techniques, such as simple random sampling, systematic sampling or by
PPS sampling.

2. We may select all SSU’s for convenience or few by using a specific


element sampling techniques (such as simple random sampling,
systematic sampling or by PPS sampling).

Simple one-stage cluster sample:

List all the clusters in the population, and from the list, select the clusters –
usually with simple random sampling (SRS) strategy. All units (elements)
in the sampled clusters are selected for the survey.

Simple two-stage cluster sample:

List all the clusters in the population. First, select the clusters, usually by
simple random sampling (SRS). The units (elements) in the selected
clusters of the first-stage are then sampled in the second-stage, usually by
simple random sampling (or often by systematic sampling).

Multi-stage sampling:
when sampling is done in more than one stage.
In practice, clusters are also stratified.

Question: Is sampling with probability proportional to size (PPS) a variant of


cluster sampling?

Theory:

1. It is assumed that population elements are clustered into N groups, i.e., in


N clusters (PSUs).

2. Let the size of cluster is Mi, for the i-th cluster, i.e., the number of elements
(SSUs) of the i-th cluster is Mi.

3. The corresponding number of PSUs (clusters) in sample = n, and the


number of elements from the i-th PSU =mi.

3
Estimation for cluster sampling

Let yij = measurement for j-th element (SSU) in i-th cluster (PSU).

In the simple case of equal-sized clusters (although may be unrealistic),


the total number of elements in the population,

K= N*M, where Mi=M (constant for all the clusters)


If the clusters are of unequal sizes, the total number of elements in the
population:

N
K =∑Mi
i =1

Total in the i-th population: Estimated sample total for the ith PSU:
Mi
y ij
t i = ∑ y ij t̂ i = ∑ M i = ∑M i yi
j =1 j∈Si mi j∈Si

Population total: Estimated sample total for population:

N
t = ∑ t i = ∑∑ y ij
N Mi
t̂ = ∑t
j∈Si
i

i =1 i =1 j =1

Estimated (unbiased) total for population:

N
t̂ unb =
n
∑t
j∈Si
i

Population mean in the i-th cluster: Sample mean for the i-th PSU:

y ij t̂ i
∑m
Mi
y ij ti
Yi ,clu = ∑ = y i ,clu = =
j =1 Mi Mi j∈Si i mi

Population mean: Sample mean (unbiased):


1 N Mi
ŷ clu =
y clu =
K
∑∑ yi =1 j =1
ij ∑m i
i∈S

4
Variance estimation:

N
∑t i

∑t
j∈Si
t̂ unb = i =N = Ny total , where y is the mean " total" for the clusters
n j∈Si n

Then, variance:

St2 ⎛ n⎞
var( t̂ unb ) = N 2 ⎜1 − ⎟
n ⎝ N⎠
where ,
2
N
⎛ t ⎞
∑ ⎜ ti − ⎟
S t2 = i =1 ⎝
N⎠
N −1

Note: Variance of total is likely to be larger with unequal cluster sizes.

The mean (with clusters of equal sizes):


ŷ clu = , ( because of the equal size M i = mi = M )
NM

The variance of mean is then:

1 N 2 St2 ⎛ n⎞ St2 ⎛ n⎞
var( ŷ ) = var( t̂ ) = ⎜ 1 − ⎟ = 2 ⎜
1− ⎟
2
N M 2
N nM ⎝ N ⎠ nM ⎝ N ⎠
2 2

Intra-class Correlation

Intra-class correlation reflects the homogeneity of sample.

We may decompose the variance into:

5
σ 2 = σ w2 + σ b2 ,
that is,
Total var iance = var iance _ within + var iance _ between

Intra-class correlation is defined as:

σ w2 σ b2 σ b2
ρ =1− = =
σ 2 σ 2 σ b2 + σ w2
More specifically:

n σ w2
ρ =1−
n −1 σ 2

Minimum : When σ b2 = 0, ρ = − 1 /(n − 1)


Maximum : When σ w2 = 0, ρ = 1

Derivation of Variance for Cluster Sampling

n σ w2
ρ =1 −
n −1 σ 2

(n − 1)σ 2 − nσ w2
ρ=
(n − 1)σ 2

=> nσ 2 − σ 2 − n(σ 2 − σ b2 ) = σ 2 (n − 1) ρ
=> nσ b2 = σ 2 + σ 2 (n − 1) ρ

σ2
=> σ b2 = [1 + (n − 1) ρ ]
n
σ2
[1 + (n − 1) ρ ]
var( x ) =
n
________________________________________________________________________

6
Let consider a single-stage cluster sampling, where n units of sample is selected from N
clusters, and the (average) size of cluster is M, then the variance of y is:
− ⎛ σ x2 ⎞
Varclu ( y ) = ⎜⎜ ⎟[1 + ( M − 1) ρ ]

⎝ nM ⎠
and,

Deff = 1 + ( M − 1) ρ

In cluster sampling, the size of ρ could be quite large, that may seriously affect the
precision of estimates.

In general, as cluster size increases ρ decreases, but deff depends on both M and ρ,
so in cluster sampling, increase in cluster size make sampling more inefficient.

As an example, for a size of cluster 20, if ρ = 0.1, the deff = 1+(20-1)*0.1 = 2.9
suggesting that the actual variance is 2.9 times above what it would have been with
variance from SRS with same sample size. However, if the size of cluster is large, say
m=200, deff=1+(200-1)*0.1=20.9!

When ρ = 0.0, deff=1.

This relationship has important implications for cluster sampling strategies.

Consider a sampling scenario: we need to draw 300 samples. We may draw 10 clusters
with 30 elements, or draw 3 clusters with 100 elements. We have said earlier, the
principal reason of conducting cluster sampling is to reduce costs. Obviously, the 2nd
option is cheaper as we need to go to only 3 clusters. However, as we have shown above,
larger the m size (cluster size), larger the deff. As a result, the first option should be
implemented (take more clusters with fewer elements) as a balance between “cost
efficiency” and “variance efficiency.”

Lessons for Cluster Sampling

• Use as many clusters as feasible.


• Use smaller cluster size in terms of number of households/individuals
selected in each cluster.
• Use a constant “take size” rather than a variable one (say 30 households
from each cluster).

7
Example:

Let us see an example.


list area age, clean

area age
1. 1 15
2. 1 16
3. 1 17
4. 1 18
5. 1 19
6. 1 20
7. 1 21
8. 1 22
9. 1 23
10. 1 24
11. 1 25
12. 2 25
13. 2 26
14. 2 27
15. 2 28
16. 2 29
17. 2 30
18. 2 31
19. 2 32
20. 2 33
21. 2 34
22. 2 35

. sum age

Variable | Obs Mean Std. Dev. Min Max


---------+-----------------------------------------------------
age | 22 25 6.055301 15 35

. ci age

Variable | Obs Mean Std. Err. [95% Conf. Interval]


---------+-------------------------------------------------------------
age | 22 25 1.290994 22.31523 27.68477

. oneway age area

Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 550 1 550 50.00 0.0000
Within groups 220 20 11
------------------------------------------------------------------------
Total 770 21 36.6666667

8
*SE under SRS

. disp sqrt((770/21)/22)
1.2909944

UNDER CLUSTER SAMPLING:

svyset, psu(area)
psu is area

. svymean age

Survey mean estimation

pweight: <none> Number of obs = 22


Strata: <one> Number of strata = 1
PSU: area Number of PSUs = 2
Population size = 22

------------------------------------------------------------------------------
Mean | Estimate Std. Err. [95% Conf. Interval] Deff
---------+--------------------------------------------------------------------
age | 25 5 -38.53102 88.53102 15
------------------------------------------------------------------------------
*Direct estimation of SE under cluster sampling design

. disp sqrt((550/1)/22)
5

*Estimation of deff:
. di 5^2/1.290994^2
15.00001

Use of STATA to estimate intra-class correlation

1. loneway
. loneway age area

One-way Analysis of Variance for age:

Number of obs = 22
R-squared = 0.7143

Source SS df MS F Prob > F


-------------------------------------------------------------------------
Between area 550 1 550 50.00 0.0000
Within area 220 20 11
-------------------------------------------------------------------------
Total 770 21 36.666667

9
Intraclass Asy.
correlation S.E. [95% Conf. Interval]
------------------------------------------------
0.81667 0.22140 0.38274 1.25059

Estimated SD of area effect 7


Estimated SD within area 3.316625
Est. reliability of a area mean 0.98000
(evaluated at n=11.00)

In loneway command, icc(ρ) is estimated by:

Rho= (MSB-MSW)/(MSB+(m-1)MSW)

MSB=Mean square between


MSW=Mean square within
M=(average) size of the cluster

. di (550-11)/(550+(11-1)*11)
.81666667

2. xt – command:
xtreg age, i(area)

Random-effects GLS regression Number of obs = 22


Group variable (i): area Number of groups = 2

R-sq: within = . Obs per group: min = 11


between = . avg = 11.0
overall = 0.0000 max = 11

Random effects u_i ~ Gaussian Wald chi2(0) = 0.00


corr(u_i, X) = 0 (assumed) Prob > chi2 = .

------------------------------------------------------------------------------
age | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 25 5 5.00 0.000 15.20018 34.79982
-------------+----------------------------------------------------------------
sigma_u | 7
sigma_e | 3.3166248
rho | .81666667 (fraction of variance due to u_i)
------------------------------------------------------------------------------

*How icc (rho) is measured:


di 7^2/(3.3166248^2+7^2)
.81666667

10
However, estimating ICC from binary outcome is done differently:
. ta area adversehealth, row
| adversehealth
area | 0 1 | Total
-----------+----------------------+----------
1 | 3 8 | 11
| 27.27 72.73 | 100.00
-----------+----------------------+----------
2 | 8 3 | 11
| 72.73 27.27 | 100.00
-----------+----------------------+----------
Total | 11 11 | 22

| 50.00 50.00 | 100.00

. xtlogit adverse, i(area)

Fitting comparison model:


Iteration 0: log likelihood = -15.249238

Fitting full model:

Random-effects logistic regression Number of obs = 22


Group variable (i): area Number of groups = 2

Random effects u_i ~ Gaussian Obs per group: min = 11


avg = 11.0
max = 11

Wald chi2(0) = 0.00


Log likelihood = -14.730665 Prob > chi2 = .

------------------------------------------------------------------------------
adversehea~h | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | -1.12e-15 .7128713 -0.00 1.000 -1.397202 1.397202
-------------+----------------------------------------------------------------
/lnsig2u | -.5081339 1.802657 -4.041277 3.025009
-------------+----------------------------------------------------------------
sigma_u | .7756399 .6991063 .1325708 4.538082
rho | .1545983 .2356031 .0053138 .8622567
------------------------------------------------------------------------------
Likelihood-ratio test of rho=0: chibar2(01) = 1.04 Prob >= chibar2 = 0.154

If the error term is considered to have standard logistic distribution, the variance of error term is π2/3

σ u2
So, rho=
π2
σu +
2

3
di .7756399^2/(.7756399^2+_pi^2/3)
.15459836

11
SAMPLE SIZE ESTIMATION under CLUSTER SAMPLING:

The major issue: DEFF >1.0

Solutions:

1. Increase the sample size estimated under SRS by multiplying with an


estimated DEFF (from published source, or estimate from the formula as
stated below):

deff=1+(m-1)ρ

Consider the comparison between:

σ2
n ... var iance under SRS
vs.
σ2
nm [1 + (m − 1) ρ .... var iance under clster sampling

So, transform sample size estimation formula,

( zα / 2 + z β ) 2 σ 2
n=
(d )2

to:

(z + zβ )2 σ 2
2* α /2
nm = [1 + ( m − 1) p ].....total.... sample....of ....individuals ( n clusters of m size)
2
(d )

In practice, m ~30 and, ρ is kept very (very) small. The deff values are available
from published reports (e.g., Demographic and Health Survey reports). Usually a
value of 1.5 to 2.0 for deff is considered for sample size estimation.

12
Source: Bangladesh DHS

Note that DHS (as shown above) reports “deft” which is the “squared of deff”, ie.,
deft=std.error(cluster)/std.error(srs).

2. You may also calculate the number of clusters required for the study utilizing
the above formulas.

13
(z + zβ )2 σ 2
2* α /2
n= [1 + ( m − 1) p ].....no of clusters
m (d ) 2

Essentially, you need the same sample size formula for “randomized community
trial.” However, deff is called “variance inflation factor” in the randomized
community trial (essentially borrowed from survey statistics!).

3. Other methods:

Direct estimation of the number of clusters needed for a survey:

Exact:
Z 12−α / 2 MV 2
m=
Z 12−α / 2V 2 + ( M − 1)d 2

Example: M= 100 clusters in the population


Need to know (research question): average number of children in the
population, based on 100 clusters, for designing a health care facility needs
study with following info:

σ2 = 0.5
Y(mean): 5.6

σ2 0.5
V2 = = = .01594388
Y (5.6) 2

STATA command:

. di “m = “ (9*100*.01594388)/(9*.01594388+99*(.10^2))
m = 12.659512 ~ 13 clusters

Note : (1.96+ 0.84)^2 ~ 9 {for faster calculation}

Approximate method:

Z 12−α / 2V 2
m=
d2
STATA Command:
di “ m = “ (9*.01594388)/(.10^2)
m = 14.349492 ~ 15 clusters

14

You might also like