Methods in Sample Surveys: Cluster Sampling
Methods in Sample Surveys: Cluster Sampling
Your
use of this material constitutes acceptance of that license and the conditions of use of materials on this site.
Copyright 2009, The Johns Hopkins University and Saifuddin Ahmed. All rights reserved. Use of these
materials permitted only in accordance with license rights granted. Materials provided “AS IS”; no
representations or warranties provided. User assumes all responsibility for use, and all liability related
thereto, and must independently review all materials for accuracy and efficacy. May contain materials
owned by others. User is responsible for obtaining permissions for use from third parties as needed.
Cluster Sampling
Saifuddin Ahmed
Dept. of Biostatistics
School of Hygiene and Public Health
Johns Hopkins University
Cluster Sampling
Consider that we want to estimate health insurance coverage in Baltimore city.
We could take a random sample of 100 households(HH). In that case, we need a
sampling list of Baltimore HHs. If the list is not available, we need to conduct a
census of HHs. The complete coverage of Baltimore city is required so that all
HHs are listed, which could be expensive. Furthermore, since our sample size is
small compared to the numbers of total HHs, we need to sample only few, say
one or two, in each block (subdivisions). Alternatively, we could select 5 blocks
(say the city is divided into 200 blocks), and in each block interview 20 HHs. We
need to construct HH listing frame only for 5 blocks (less time and costs needed).
Furthermore, by limiting the survey to a smaller area, additional costs will be
saved during the execution of interviews.
Definition:
The main reason for cluster sampling is “cost efficiency” (economy and
feasibility), but we compromise with variance estimation efficiency.
Advantages:
• Generating sampling frame for clusters is economical, and sampling
frame is often readily available at cluster level
• Most economical form of sampling
• Larger sample for a similar fixed cost
• Less time for listing and implementation
• Also suitable for survey of institutions
Disadvantages:
• May not reflect the diversity of the community.
• Other elements in the same cluster may share similar characteristics.
• Provides less information per observation than an SRS of the same
size (redundant information: similar information from the others in the
cluster).
• Standard errors of the estimates are high, compared to other sampling
designs with same sample size
2
Need to consider the sampling order:
List all the clusters in the population, and from the list, select the clusters –
usually with simple random sampling (SRS) strategy. All units (elements)
in the sampled clusters are selected for the survey.
List all the clusters in the population. First, select the clusters, usually by
simple random sampling (SRS). The units (elements) in the selected
clusters of the first-stage are then sampled in the second-stage, usually by
simple random sampling (or often by systematic sampling).
Multi-stage sampling:
when sampling is done in more than one stage.
In practice, clusters are also stratified.
Theory:
2. Let the size of cluster is Mi, for the i-th cluster, i.e., the number of elements
(SSUs) of the i-th cluster is Mi.
3
Estimation for cluster sampling
Let yij = measurement for j-th element (SSU) in i-th cluster (PSU).
N
K =∑Mi
i =1
Total in the i-th population: Estimated sample total for the ith PSU:
Mi
y ij
t i = ∑ y ij t̂ i = ∑ M i = ∑M i yi
j =1 j∈Si mi j∈Si
N
t = ∑ t i = ∑∑ y ij
N Mi
t̂ = ∑t
j∈Si
i
i =1 i =1 j =1
N
t̂ unb =
n
∑t
j∈Si
i
Population mean in the i-th cluster: Sample mean for the i-th PSU:
y ij t̂ i
∑m
Mi
y ij ti
Yi ,clu = ∑ = y i ,clu = =
j =1 Mi Mi j∈Si i mi
t̂
1 N Mi
ŷ clu =
y clu =
K
∑∑ yi =1 j =1
ij ∑m i
i∈S
4
Variance estimation:
N
∑t i
∑t
j∈Si
t̂ unb = i =N = Ny total , where y is the mean " total" for the clusters
n j∈Si n
Then, variance:
St2 ⎛ n⎞
var( t̂ unb ) = N 2 ⎜1 − ⎟
n ⎝ N⎠
where ,
2
N
⎛ t ⎞
∑ ⎜ ti − ⎟
S t2 = i =1 ⎝
N⎠
N −1
t̂
ŷ clu = , ( because of the equal size M i = mi = M )
NM
1 N 2 St2 ⎛ n⎞ St2 ⎛ n⎞
var( ŷ ) = var( t̂ ) = ⎜ 1 − ⎟ = 2 ⎜
1− ⎟
2
N M 2
N nM ⎝ N ⎠ nM ⎝ N ⎠
2 2
Intra-class Correlation
5
σ 2 = σ w2 + σ b2 ,
that is,
Total var iance = var iance _ within + var iance _ between
σ w2 σ b2 σ b2
ρ =1− = =
σ 2 σ 2 σ b2 + σ w2
More specifically:
n σ w2
ρ =1−
n −1 σ 2
n σ w2
ρ =1 −
n −1 σ 2
(n − 1)σ 2 − nσ w2
ρ=
(n − 1)σ 2
=> nσ 2 − σ 2 − n(σ 2 − σ b2 ) = σ 2 (n − 1) ρ
=> nσ b2 = σ 2 + σ 2 (n − 1) ρ
σ2
=> σ b2 = [1 + (n − 1) ρ ]
n
σ2
[1 + (n − 1) ρ ]
var( x ) =
n
________________________________________________________________________
6
Let consider a single-stage cluster sampling, where n units of sample is selected from N
clusters, and the (average) size of cluster is M, then the variance of y is:
− ⎛ σ x2 ⎞
Varclu ( y ) = ⎜⎜ ⎟[1 + ( M − 1) ρ ]
⎟
⎝ nM ⎠
and,
Deff = 1 + ( M − 1) ρ
In cluster sampling, the size of ρ could be quite large, that may seriously affect the
precision of estimates.
In general, as cluster size increases ρ decreases, but deff depends on both M and ρ,
so in cluster sampling, increase in cluster size make sampling more inefficient.
As an example, for a size of cluster 20, if ρ = 0.1, the deff = 1+(20-1)*0.1 = 2.9
suggesting that the actual variance is 2.9 times above what it would have been with
variance from SRS with same sample size. However, if the size of cluster is large, say
m=200, deff=1+(200-1)*0.1=20.9!
Consider a sampling scenario: we need to draw 300 samples. We may draw 10 clusters
with 30 elements, or draw 3 clusters with 100 elements. We have said earlier, the
principal reason of conducting cluster sampling is to reduce costs. Obviously, the 2nd
option is cheaper as we need to go to only 3 clusters. However, as we have shown above,
larger the m size (cluster size), larger the deff. As a result, the first option should be
implemented (take more clusters with fewer elements) as a balance between “cost
efficiency” and “variance efficiency.”
7
Example:
area age
1. 1 15
2. 1 16
3. 1 17
4. 1 18
5. 1 19
6. 1 20
7. 1 21
8. 1 22
9. 1 23
10. 1 24
11. 1 25
12. 2 25
13. 2 26
14. 2 27
15. 2 28
16. 2 29
17. 2 30
18. 2 31
19. 2 32
20. 2 33
21. 2 34
22. 2 35
. sum age
. ci age
Analysis of Variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 550 1 550 50.00 0.0000
Within groups 220 20 11
------------------------------------------------------------------------
Total 770 21 36.6666667
8
*SE under SRS
. disp sqrt((770/21)/22)
1.2909944
svyset, psu(area)
psu is area
. svymean age
------------------------------------------------------------------------------
Mean | Estimate Std. Err. [95% Conf. Interval] Deff
---------+--------------------------------------------------------------------
age | 25 5 -38.53102 88.53102 15
------------------------------------------------------------------------------
*Direct estimation of SE under cluster sampling design
. disp sqrt((550/1)/22)
5
*Estimation of deff:
. di 5^2/1.290994^2
15.00001
1. loneway
. loneway age area
Number of obs = 22
R-squared = 0.7143
9
Intraclass Asy.
correlation S.E. [95% Conf. Interval]
------------------------------------------------
0.81667 0.22140 0.38274 1.25059
Rho= (MSB-MSW)/(MSB+(m-1)MSW)
. di (550-11)/(550+(11-1)*11)
.81666667
2. xt – command:
xtreg age, i(area)
------------------------------------------------------------------------------
age | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 25 5 5.00 0.000 15.20018 34.79982
-------------+----------------------------------------------------------------
sigma_u | 7
sigma_e | 3.3166248
rho | .81666667 (fraction of variance due to u_i)
------------------------------------------------------------------------------
10
However, estimating ICC from binary outcome is done differently:
. ta area adversehealth, row
| adversehealth
area | 0 1 | Total
-----------+----------------------+----------
1 | 3 8 | 11
| 27.27 72.73 | 100.00
-----------+----------------------+----------
2 | 8 3 | 11
| 72.73 27.27 | 100.00
-----------+----------------------+----------
Total | 11 11 | 22
------------------------------------------------------------------------------
adversehea~h | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | -1.12e-15 .7128713 -0.00 1.000 -1.397202 1.397202
-------------+----------------------------------------------------------------
/lnsig2u | -.5081339 1.802657 -4.041277 3.025009
-------------+----------------------------------------------------------------
sigma_u | .7756399 .6991063 .1325708 4.538082
rho | .1545983 .2356031 .0053138 .8622567
------------------------------------------------------------------------------
Likelihood-ratio test of rho=0: chibar2(01) = 1.04 Prob >= chibar2 = 0.154
If the error term is considered to have standard logistic distribution, the variance of error term is π2/3
σ u2
So, rho=
π2
σu +
2
3
di .7756399^2/(.7756399^2+_pi^2/3)
.15459836
11
SAMPLE SIZE ESTIMATION under CLUSTER SAMPLING:
Solutions:
deff=1+(m-1)ρ
σ2
n ... var iance under SRS
vs.
σ2
nm [1 + (m − 1) ρ .... var iance under clster sampling
( zα / 2 + z β ) 2 σ 2
n=
(d )2
to:
(z + zβ )2 σ 2
2* α /2
nm = [1 + ( m − 1) p ].....total.... sample....of ....individuals ( n clusters of m size)
2
(d )
In practice, m ~30 and, ρ is kept very (very) small. The deff values are available
from published reports (e.g., Demographic and Health Survey reports). Usually a
value of 1.5 to 2.0 for deff is considered for sample size estimation.
12
Source: Bangladesh DHS
Note that DHS (as shown above) reports “deft” which is the “squared of deff”, ie.,
deft=std.error(cluster)/std.error(srs).
2. You may also calculate the number of clusters required for the study utilizing
the above formulas.
13
(z + zβ )2 σ 2
2* α /2
n= [1 + ( m − 1) p ].....no of clusters
m (d ) 2
Essentially, you need the same sample size formula for “randomized community
trial.” However, deff is called “variance inflation factor” in the randomized
community trial (essentially borrowed from survey statistics!).
3. Other methods:
Exact:
Z 12−α / 2 MV 2
m=
Z 12−α / 2V 2 + ( M − 1)d 2
σ2 = 0.5
Y(mean): 5.6
σ2 0.5
V2 = = = .01594388
Y (5.6) 2
STATA command:
. di “m = “ (9*100*.01594388)/(9*.01594388+99*(.10^2))
m = 12.659512 ~ 13 clusters
Approximate method:
Z 12−α / 2V 2
m=
d2
STATA Command:
di “ m = “ (9*.01594388)/(.10^2)
m = 14.349492 ~ 15 clusters
14