Paper 4 SiT Dec2009 Shu Tha Pat Raj
Paper 4 SiT Dec2009 Shu Tha Pat Raj
ABSTRACT
In sample surveys, the problem of non-response is one of the most frequent and
widely appearing, whose solution is required to obtain using relevant statistical
techniques. The imputation is one such methodology, which uses available data
as a source for replacement of missing observations. Two-phase sampling is
useful when population parameter of auxiliary information is unknown. This
paper presents the use of imputation for dealing with non-responding units in the
setup of two-phase sampling. Two different two-phase sampling strategies (sub-
sample and independent sample) are compared under imputed data setup. Factor-
Type (F-T) estimators are used as tools of imputation and simulation study is
performed over multiple samples showing the comparative strength of one over
other. First imputation strategy is found better than second whereas second
sampling design is better than first.
Key words: Estimation, Missing data, Imputation, Bias, Mean squared error
(MSE), Factor Type (F-T) estimator, Two-phase sampling, Simple Random
Sampling Without Replacement (SRSWOR), Compromised Imputation (C. I.).
1. Introduction
1
Diwakar Shukla, Narendra Singh Thakur, Sharad Pathak, Deptt. of Mathematics and Statistics,
H.S. Gour University of Sagar, Sagar,(M.P.) INDIA, Pin-470003.
e-mails:[email protected], [email protected], [email protected].
2
Dilip Singh Rajput, Govt. College, Rehli, Sagar (M.P.), INDIA.
398 D. Shukla, N. S. Thakur, S. Pathak, D. S. Rajput: Estimation of…
N
X = N −1 X i are population means, X is assumed known and Y under
i =1
investigation. Singh and Shukla (1987) proposed Factor Type (F-T) estimator to
obtain the estimate of population mean under setup of SRSWOR. Some other
contributions on Factor-Type estimator, in similar setup, are due to Singh and
Shukla (1991) and Singh et al. (1993).
With X unknown, the two-phase sampling is used to obtain the estimate of
population mean and Shukla (2002) suggested F-T estimator under this case. But
when few of observations are missing in the sample, the F-T estimator fails to
estimate. This paper undertakes the problem of Shukla (2002) with suggested
imputation procedures for missing observations.
Rubin (1976) addressed three missing observation concepts: missing at
random (MAR), observed at random (OAR) and parameter distribution (PD).
Heitjan and Basu (1996) explained the concept of missing at random (MAR) and
introduced the missing completely at random (MCAR). The present discussion is
on MCAR wherever the non-response is quoted. Rao and Sitter (1995) discussed
a new linearization variance estimator that makes more complete use of the
sample data than a standard one. They have shown its application to ‘mass’
imputation under two-phase sampling and deterministic imputation for missing
data. Singh and Horn (2000) suggested a Compromised Imputation (C-I)
procedure in which the estimator of mean obtained through C-I remains better
than obtained from ratio method of imputation and mean method of imputation.
Ahmed et al. (2006) designed several generalized structure of imputation
procedures and their corresponding estimators of the population mean. Motivation
is derived from these and from Shukla (2002) to extend the content for the
imputation setup.
Consider a preliminary large sample S ' of size n ' drawn from population Ω
by SRSWOR and a secondary sample S of size n (n < n ' ) drawn in either of the
following manners:
Case I: as a sub-sample from sample S ' (denoted by design F1) as in fig.
1(a),
Case II: independent to sample S ' (denote by design F2) as in fig. 1(b)
without replacing S ' .
STATISTICS IN TRANSITION-new series, December 2009 399
Population (N)
Population (N)
Y X Y X Y X Y X
S'
S ' S
R x
'
C n'
n' ' yr R
x xr
n,
S
x
R
y r xr RC
n, x
Two proposed strategies d1 and d2 for missing data under both cases are :
yi if i∈R
d1 : ( yd1 )i = 1 ( A + C )x ' + fB x (2.1)
(n − r ) n y r − r y r if i ∈ RC
( A + fB )x ' + C x
yi if i∈R
d2 : ( yd 2 )i = 1 ( A + C )x ' + fB x r (2.2)
(n − r ) n y r − r y r if i ∈ RC
( A + fB )x ' + C x r
( A + C )x ' + fB x r
(y ) = y r ; (2.4)
( A + fB )x ' + C x r
d2
(i) At k = 1; A = 0; B = 0; C = – 6
(y ) = y x '
(2.5)
d1 r
x
(y ) = y x
'
(2.6)
d2 r
xr
(ii) At k = 2; A = 0; B = –2; C = 0
(y ) = y x
d1 r '
(2.7)
x
STATISTICS IN TRANSITION-new series, December 2009 401
(y ) = y x
d2 r
'
r
(2.8)
x
(iii) At k = 3; A= 2; B = –2 ; C = 0
(y ) = y x + f x
'
(2.9)
(1 − f )x
d1 r '
(y ) = y x − f x
'
(2.10)
(1 − f )x
d2 r '
(iv) At k = 4; A = 6; B = 0; C = 0
(y ) = y
d1 r (2.11)
(y ) = y
d2 r (2.12)
Let B(.)t and M(.)t denote the bias and mean squared error (M.S.E.) of
estimator under sampling design t = I, II (or F1, F2). Large sample approximations
are:
( )
E (e1 ) = E (e2 ) = E (e3 ) = E e3' = 0 ; E (e12 ) = δ 1CY2 ; E (e22 ) = δ 1C X2 ;
( )
E e32 = δ 2 C X2 ; E e ( )= δ C
'2
3 3
2
x ; E (e1e2 ) = δ 1 ρ CY C X ;
E (e1e3 ) = δ 2 ρ CY C X ; E (e1e3' ) = δ 3 ρCY C X ; E (e2 e3 ) = δ 2 C X2 ;
( )
E e2 e3' = δ 3 C X2 ; E e3 e3' = δ 3 C X2 ; ( )
(ii) Under design F2 [Case II ]:
( )
E (e1 ) = E (e2 ) = E (e3 ) = E e3' = 0 ; E (e12 ) = δ 4 CY2 ; E (e22 ) = δ 4 C X2 ;
( )
E (e32 ) = δ 5 C X2 ; E e3' = δ 3C x2 ; E (e1e2 ) = δ 4 ρ CY C X ;
2
402 D. Shukla, N. S. Thakur, S. Pathak, D. S. Rajput: Estimation of…
( ) ( )
Theorem 3.1: Estimators y d 1 and y d 2 , in terms of ei ; i = 1,2,3 and ei' ,
could be expressed :
[ ]
(i) y d 1 = Y 1 + e1 + P{e3 − e3' − θ 4 e32 + θ 3 e3'2 + e1e3 − e1e3' − (θ 3 − θ 4 )e3 e3' } (3.1)
[
(ii) y d 2 = Y 1 + e1 + P{e2 − e − θ e + θ e + e1e2 − e e − (θ 3 − θ 4
'
3
2
4 2
'2
3 3
'
1 3 )e e }]
'
3 3
(3.2)
[ ]
While ignoring terms E eir e sj , E eir e 'j [ ( ) ] for r + s > 2 , r, s = 0,1,2,... and
s
Proof:
( A + C )x ' + fB x
(i) y d 1 = y r
( A + fB )x + C x
'
( )(
= Y (1 + e1 ) 1 + θ 1e3' + θ 2 e3 1 + θ 3 e3' + θ 4 e3 )
−1
[ {
= Y 1 + e1 + P e3 − e3' − θ 4 e32 + θ 3 e3'2 + e1e3 − e1e3' − (θ 3 − θ 4 )e3 e3' }]
( A + C )x ' + fB x r
(ii) y d 2 = y r
( A + fB )x + C x r
'
( )(
= Y (1 + e1 ) 1 + θ 1e3' + θ 2 e2 1 + θ 3 e3' + θ 4 e2 )
−1
[ {
= Y 1 + e1 + P e2 − e3' − θ 4 e22 + θ 3 e3'2 + e1e2 − e1e3' − (θ 3 − θ 4 )e2 e3' }]
STATISTICS IN TRANSITION-new series, December 2009 403
Proof:
[ ]
(i) B y d 1 I
[
= E y d1 − Y ] I
[ {
= Y E 1 + e1 + P e3 − e3' − θ 4 e32 + θ 3 e3' 2 + e1e3 − e1e3' − (θ 3 − θ 4 )e3 e3' − 1 } ]
= −Y P(δ 2 − δ 3 )(θ C 4
2
X − ρCY C X )
[ ]
(ii) B y d 1 II
[
= E y d1 − Y ] II
[ {
= Y E 1 + e1 + P e3 − e3' − θ 4 e32 + θ 3 e3' 2 + e1e3 − e1e3' − (θ 3 − θ 4 )e3 e3' − 1 } ]
= Y P[(θ δ3 3 − θ 4δ 5 )C X2 + δ 5 ρCY C X ]
[ ]
(iii) B y d 2 I
= E yd2 − Y[ ]I
(
= −Y P(δ 1 − δ 3 ) θ 4 C − ρCY C X 2
X )
[ ]
(iv) B y d 2 II
[
= E yd2 − Y ] II
[
= Y P (θ 3δ 3 − θ 4δ 4 )C + δ 4 ρCY C X 2
X ]
Theorem 3.3: Mean squared errors of (y ) d1 t ( )
and y d 2 t
under design F1
and F2 , up to first order of approximation are:
[ ] = Y [δ C + (δ − δ )(P C + 2PρC C )]
(i) M y d 1 I
2
1
2
Y 2 3
2 2
X Y X (3.7)
(ii) M [y ] = −Y [δ C + (δ + δ )P C + 2 Pδ ρC C ]
2 2 2 2
d 1 II 4 Y 3 5 X 5 Y X (3.8)
(iii) M [y ] = Y [δ C + (δ − δ )(P C + 2 PρC C )]
2 2 2 2
d2 I 1 Y 1 3 X Y X (3.9)
(iv) M [y ] = Y [δ C + (δ + δ )P C + 2 Pδ ρC C ]
2 2 2 2
d 2 II 4 Y 3 4 X 4 Y X (3.10)
Proof:
[ ]
(i) M y d 1 I
[
= E y d1 − Y ]
2
I
404 D. Shukla, N. S. Thakur, S. Pathak, D. S. Rajput: Estimation of…
2
[
= Y E e1 + P e3 − e3' ( )] 2
[
= Y δ 1 CY2 + (δ 2 − δ 3 ) P 2 C X2 − 2 PρCY C X
2
( )]
(ii) M y d 1 [ ] II
= E y d1 − Y [ ]
2
II
2
[
= Y E e1 + P e3 − e ( '
3 )] 2
[
= Y δ 4 CY2 + (δ 3 + δ 5 )P 2 C X2 + 2 Pδ 5 ρCY C X
2
]
(iii) M y d 2 [ ] I
= E yd2 − Y [ ] 2
I
2
[
= Y E e1 + P e2 − e ( )] '
3
2
2
= Y δ 1 CY2[ + (δ − δ )(P C 1 3
2
X
2
− 2 PρCY C X )]
(iv) M y d 2 [ ] II
= E yd2 −Y[ ]
2
II
[
= Y δ 4 CY2 + (δ 3 + δ 4 )P 2 C X2 + 2 Pδ 4 ρCY C X
2
]
Theorem 3.4: Minimum mean squared errors of (y )
d1 t ( )
and y d 2 t
under
design F1 and F2 are :
[ ( ) ] = [δ − (δ − δ )ρ ]S when P = −V
(i) Min M y d1 I 1 2 3
2 2
Y (3.11)
Min [M (y )] = [δ − (δ + δ ) δ ρ ]S
−1 2 2 2
(ii) d1 II 4 3 5 5 Y
when P = −δ 5 V / (δ 3 + δ 5 ) (3.12)
[ ( )] = [δ − (δ − δ )ρ ]S when
(iii) Min M y d 2 I 1 1 3
2 2
Y P = −V (3.13)
Min[M (y )] = [δ − (δ + δ ) δ ρ ] S
−1 2 2 2
(iv) d2 II 4 3 4 4 Y
when P = −δ 4 V / (δ 3 + δ 4 ) (3.14)
Proof:
(i)
d
dP
[ ( ) ] = 0 P = − ρ CC = −V and using this in (3.7)
M y d1 I
Y
Min[M (y )] = [δ − (δ − δ )ρ ]S
d1 I 1 2 3
2 2
Y
(ii)
d
dP
[M (y ) ] = 0 P = −δ V (δ + δ ) and using this in (3.8)
d 1 II 5 3 5
−1
Min[M (y )] = [δ − (δ + δ ) δ ρ ]S
−1 2 2 2
d1 II 4 3 5 5 Y
(iii)
d
dP
[M (y ) ] = 0 P = − ρ
C
d2 I
C
= −V Y
X
STATISTICS IN TRANSITION-new series, December 2009 405
[ ( )] [
Min M y d 2 I = δ1 − (δ1 − δ 3 )ρ 2 SY2 ]
(iv)
d
dP
[ ( )]
M y d 2 II = 0 P = −δ 4V (δ 3 + δ 4 )
−1
Min[M (y )] = [δ − (δ + δ 4 ) δ 42 ρ 2 SY2 ]
−1
d2 II 4 3
( A + C )x ' + fB x
(y ) = y (3.15)
( A + fB )x + C x
d w '
[ ( ) ] = Y V [1 − ρ (1 − δ )]
opt M y d w I
2
20
2
(3.18)
opt [M (y ) ] = Y V [1 − ρ (1 + δ ) ]
2 2 −1
d w II 20 (3.19)
−1
1 1 1 1
where δ = '
− − ;
n N n N
Vi j = E y − Y [( ) (x − X ) ]
i j i j
Y X ; i = 0,1,2; j = 0,1,2
[ ]
(i) B y d 1 I
= 0 P (θ 4 C X2 − ρCY C X ) = 0
[
If P = 0 (k − 4 ) k 2 − (5 + f )k + (6 + f ) = 0 ] (4.1)
and k = k1 = 4
k = k2 =
1
2
[
(5 + f ) + f 2 + 6 f + 1 1 / 2 ( ) ]
(4.2)
1
[
k = k3 = (5 + f ) − f 2 + 6 f + 1
2
1/ 2
( ) ]
406 D. Shukla, N. S. Thakur, S. Pathak, D. S. Rajput: Estimation of…
If θ 4 C X2 − ρCY C X = 0
AV + VfB + (V − 1)C = 0 (4.3)
(V − 1)k 3 − [(8 − f )V − 9]k 2 − [(23 + 5 f )V + 26]k − 2[(11 − 2 f )V − 12] = 0
(4.4)
[ ]
(ii) B y d 1 II
[
= 0 P (θ 3δ 3 − θ 4δ 5 )C + δ 5 ρCY C X
2
X ] =0
If P = 0 we have solution as per (4.2) and
if (θ 3δ 3 − θ 4δ 5 )C X + δ 5 ρCY = 0
δ δ
(V − 1)k 3 + 3 + V (1 + f ) − 9(V − 1) k 2 − 3 + V (3 + 5 f ) − 26(V − 1) k
δ 5 δ 5
δ
+ 2 3 + V (1 + 2 f ) − 12(V − 1) = 0 (4.5)
δ 5
[ ]
(iii) B y d 2 I
= 0 provides similar solution as in (i).
(iv) B[y ]
d 2 II [
= 0 P (θ 3δ 3 − θ 4δ 4 )C X2 + δ 4 ρCY C X ] =0
if P = 0 we have solution as per (4.2) and
if (θ 3δ 3 − θ 4δ 4 )C X2 + δ 4 ρCY C X = 0
[( A + fB )δ 3 − Cδ 4 ] = −δ 4V ( A + fB + C )
δ δ
(V − 1)k 3 + 3 + V (1 + f ) − 9(V − 1) k 2 − 3 + V (3 + 5 f ) − 26(V − 1) k
δ 4 δ 4
δ
+ 2 3 + V (1 + 2 f ) − 12(V − 1) = 0 (4.6)
δ 4
[ ( ) ]− min[M (y ) ] = 1r − N1 ρ S
(i) Δ 1 = min M y d 1 I d2 I
2 2
Y
(y ) is better than (y ) if Δ > 0 N > r which is always true.
d2 I d1 I 1
δ 42 δ 52 2 2
= − ρ S Y
(δ 3 + δ 4 ) (δ 3 + δ 5 )
STATISTICS IN TRANSITION-new series, December 2009 407
(y )d 2 II is better than y d 1 ( ) II
if Δ 2 > 0
[ (
(n − r ) N 3 − n ' n + n ' r + nr N + 2n ' nr > 0 ) ]
which generates two options as
(A) when (n − r ) > 0 n > r and
[ (
(B) N 3 − n ' n + n ' r + nr N + 2n' nr ) ] >0
if n ' ≈ N [i.e. n '
→N ]
[
then N N − (n − r )N + nr
2
] > 0 (since N > 0 always)
( N − n )( N − r ) > 0
( N − n ) > 0 N > n and N − r > 0 N > r
The ultimate is N > n > r , which is always true.
(iii) Δ 3 = min M y d 2 [ ( ) ]− min[M (y ) ]
I d 2 II
(δ 1 − δ 4 )(δ 3 + δ 4 ) + (δ + δ − δ 1δ 3 − δ 1δ 4 + δ 3δ 4 ) 2 2
2 2
= 4 3
ρ SY
(δ 3 + δ 4 )
(y )d 2 II ( )
is better than y d 2 I , if Δ 3 > 0
1+ m
ρ2 > where m =
r N − n' ( )
'
1 + 2m n (N − r )
1+ m 1+ m
−1 < ρ < − or < ρ <1
1 + 2m 1 + 2m
6. Empirical Study
to compute x ' and further draw a random sample S of size n = 50 such that
408 D. Shukla, N. S. Thakur, S. Pathak, D. S. Rajput: Estimation of…
Table 6.2.
Optimum Condition for
Design Three optimum values of k on one condition
MSE
P = −V k1 = 1.5206 k 2 = 2.4505 k 3 = 8.9456
I
P = −δ 5 V / (δ 3 + δ 5 ) k 4 = 1.5880 k5 = 2.8768 k 6 = 6.4279
P = −V k7 = 1.5206 = k1 k8 = 2.4505 = k 2 k9 = 8.9456 = k 3
II
P = −δ 4 V / (δ 3 + δ 4 ) k10 = 1.5645 k11 = 2.8572 k12 = 6.7221
7. Simulation
The bias and optimum m.s.e. of proposed estimators under both designs are
computed through 50,000 repeated samples n, n ' as per design. Computations are
in table 7.1 where efficiency loss measurement due to imputation is as
( )
LI t y s =
[ ( )]
Opt M y s t
with Opt M y s [ ( )] the optimum mean squared
Opt [M (y ) ]
t
d w
error of estimator y s ,
s = d , d1 , d 2 ; t = I , II , t = w = without imputation.
For design I and II the simulation procedure has following steps :
Step 1: Draw a random sample S ' of size n ' = 110 from the population of N
= 200 by SRSWOR.
Step 2: Draw a random sub-sample of size n = 50 from S ' for design I and
(
independent random sample n= 50 from N − n ' for design II. )
Step 3: Drop down 5 units randomly from each second sample corresponding
to Y in both I and II.
Step 4: Impute dropped units of Y by proposed methods and available
methods and compute the relevant statistic.
Step 5: Repeat the above steps 50,000 times, which provides multiple sample
( ) ( )
based estimates ( yˆ1s )t , ( yˆ 2 s )t , ( yˆ 3 s )t ....( yˆ 50000 s )t for estimators y d1 t , y d 2 t and
(y ) .
d w
Step 7: M.S.E. of ( ŷ s )t is M ( yˆ s )t =
1 50000
[
( yˆ is )t − Y
50000 i =1
]
Step 8: The efficiency comparisons are
Design efficiency E1 =
( ) ×100 ; M y d1 I
M (y ) d 1 II
M (y )
Design efficiency E = × 100 d2 I
M (y )
2
d 2 II
M (y )
Estimator efficiency E = × 100 ; d1 I
M (y )
3
d2 I
M (y )
Estimator efficiency E = × 100 d 1 II
M (y )
4
d 2 II
Opt (k) (y ) d1 k
i
(y ) d2 k
i
(y )
d1 k
i
(y ) d2 k
i
k10 = 1.5645 -0.0953 2.2743 -0.2082 1.9476 0.4044 3.4592 0.4024 3.3839
k11 = 2.8572 -0.1287 2.2961 -0.2429 1.9705 0.2473 3.2895 0.2301 3.1970
k12 = 6.7221 -0.1015 2.7780 -0.2146 1.9515 0.3756 3.4273 0.3707 3.3472
410 D. Shukla, N. S. Thakur, S. Pathak, D. S. Rajput: Estimation of…
Opt (k) E1 E2 E3 E4
k1 = 1.3813 92.67% 72.15% 119.42% 92.97%
The proposed estimators are found useful for situation when some
observations are missing in the sample. As per table 7.3 for y d1 and y d 2 under
design F1, the efficient performance of both is found when k = 1.5880, 1.5645,
6.4279 and 6.7221. On these specific choices the loss of efficiency with respect to
without imputation is very low. Similarly, for y d1 and y d 2 under design F2, the
efficient performance observed at k = 1.3813, 2.7576, 9.9538. It seems even by
adopting imputation, the suggested estimators are loosing a little in terms of
relative m.s.e. to the without imputation usual F-T estimator.
While mutual comparisons are in table 7.4, the design F2 is uniformly
efficient as F1 at all the optimum k-values, over both suggested F-T strategies.
Within F1, the estimator y d 2 is more efficient than y d1 whereas within F2 it does
not hold uniformly for all k-optimals. The y d 2 under F2 found better when k =
1.2, 2.8, 6.4 and 6.7. One can get almost unbiased estimators also on choices k =
0.1812, 1.4236, 1.4339, 1.8246, 2.4469, 2.4488, 3.4253, 4, 10.5426, 10.7864. The
most suitable will be that which has the lowest m.s.e. So these suggested
strategies are almost unbiased with a reducing control over m.s.e. also.
STATISTICS IN TRANSITION-new series, December 2009 413
Appendix A
Population (N = 200)
Yi 45 50 39 60 42 38 28 42 38 35
Xi 15 20 23 35 18 12 8 15 17 13
Yi 40 55 45 36 40 58 56 62 58 46
Xi 29 35 20 14 18 25 28 21 19 18
Yi 36 43 68 70 50 56 45 32 30 38
Xi 15 20 38 42 23 25 18 11 09 17
Yi 35 41 45 65 30 28 32 38 61 58
Xi 13 15 18 25 09 08 11 13 23 21
Yi 65 62 68 85 40 32 60 57 47 55
Xi 27 25 30 45 15 12 22 19 17 21
Yi 67 70 60 40 35 30 25 38 23 55
Xi 25 30 27 21 15 17 09 15 11 21
Yi 50 69 53 55 71 74 55 39 43 45
Xi 15 23 29 30 33 31 17 14 17 19
Yi 61 72 65 39 43 57 37 71 71 70
Xi 25 31 30 19 21 23 15 30 32 29
Yi 73 63 67 47 53 51 54 57 59 39
Xi 28 23 23 17 19 17 18 21 23 20
Yi 23 25 35 30 38 60 60 40 47 30
Xi 07 09 15 11 13 25 27 15 17 11
Yi 57 54 60 51 26 32 30 45 55 54
Xi 31 23 25 17 09 11 13 19 25 27
Yi 33 33 20 25 28 40 33 38 41 33
Xi 13 11 07 09 13 15 13 17 15 13
Yi 30 35 20 18 20 27 23 42 37 45
Xi 11 15 08 07 09 13 12 25 21 22
Yi 37 37 37 34 41 35 39 45 24 27
Xi 15 16 17 13 20 15 21 25 11 13
Yi 23 20 26 26 40 56 41 47 43 33
Xi 09 08 11 12 15 25 15 25 21 15
Yi 37 27 21 23 24 21 39 33 25 35
Xi 17 13 11 11 09 08 15 17 11 19
Yi 45 40 31 20 40 50 45 35 30 35
Xi 21 23 15 11 20 25 23 17 16 18
Yi 32 27 30 33 31 47 43 35 30 40
Xi 15 13 14 17 15 25 23 17 16 19
Yi 35 35 46 39 35 30 31 53 63 41
Xi 19 19 23 15 17 13 19 25 35 21
Yi 52 43 39 37 20 23 35 39 45 37
Xi 25 19 18 17 11 09 15 17 19 19
414 D. Shukla, N. S. Thakur, S. Pathak, D. S. Rajput: Estimation of…
REFERENCES