0% found this document useful (0 votes)
180 views

SPSS

This document discusses techniques for handling missing data, including traditional and modern methods. It outlines traditional techniques like listwise deletion, mean imputation, and last observation carried forward. It then introduces maximum likelihood and multiple imputation as modern techniques for handling missing data in SPSS. The document provides an example dataset and discusses testing for different missing data mechanisms like MCAR, MAR, and MNAR. It emphasizes the importance of properly handling missing data and outlines methods in SPSS.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
180 views

SPSS

This document discusses techniques for handling missing data, including traditional and modern methods. It outlines traditional techniques like listwise deletion, mean imputation, and last observation carried forward. It then introduces maximum likelihood and multiple imputation as modern techniques for handling missing data in SPSS. The document provides an example dataset and discusses testing for different missing data mechanisms like MCAR, MAR, and MNAR. It emphasizes the importance of properly handling missing data and outlines methods in SPSS.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Missing Data Analysis with SPSS

Meng-Ting Lo
([email protected])
Department of Educational Studies
Quantitative Research, Evaluation and Measurement Program (QREM)
Research Methodology Center (RMC)
Outline
• Missing Data Patterns and Mechanisms
• Traditional Techniques
• Listwise and pairwise deletion
• Mean substitution
• Regression and stochastic regression
• Hot deck imputation
• Averaging the available items
• Last observations carried forward
• Maximum Likelihood (ML) and Multiple Imputation (MI)
• SPSS with Multiple Imputation (demonstration and practice)
• Practical Issues/ Myths

2
Data and Material
• High school longitudinal study of 2009: public-use data
 NCES secondary longitudinal studies, more than 21,000 9th graders in
944 schools
 Hsls09_MissingDataWorkshop_demo
 Hsls09_MissingDataWorkshop_demo2_imputed5
 Hsls09_MissingDataWorkshop_demo2_IterationHistory
 Hsls09_MissingDataWorkshop_practice

• SPSS modules
 Missing Value Analysis
 Multiple Imputation

3
The importance of dealing with missing data
• Rarely see a dataset that is complete and beautiful

• Traditional techniques rely on strict assumption about


missing data mechanisms (rarely be achieved in real world)

• The problem of missing data:


• Treat it inappropriately, obtain unreliable and biased estimates,
make incorrect conclusion of results

• Reduce the statistical power of your test to detect a significant effect


(e.g., listwise deletion)

4
Missing data patterns
“Where” is the missing data in
1 your data set? Describing the
2
3 location of missing data (shaded
4
.
area).
.
.
. In old time: specific missing data
handling methods
were developed to deal with
different missing data patterns.

Now: MI and ML work well in any


missing data patterns.

Figures from p.4 in Enders, C. K. (2010). Applied missing data analysis.


Guilford Press.
5
Missing data mechanisms (Donald Rubin, 1976)
“Describe the relationships between measured variables and
the probability of missing data and essentially function as
assumptions for missing data analysis (Enders, 2010, p.2).”
• Missing complete at random (MCAR), Missing at random
(MAR), and Missing not at random(MNAR)

• “Why data are missing?” Possible explanation for missing


data and find evidence to justify our claim.

• Missing data mechanisms are much important than


percentage of missing.
• Percentage of missing is to know the scope of missing data problem.
• It governs the performance of different analytic techniques.
6
Missing data mechanisms
DV: Reading Introduced by Rubin (1976),
Race Achievement R missingness is a binary
Asian 0
variable that has a probability
distribution
Asian 0
Caucasian 0 missing
Asian 0 Race: complete observed
Asian 0 DV: missing for some
Caucasian 66 1 students
Caucasian 88 1 R: missing data indicator
Caucasian 95 1
Caucasian 100 1 observed Whether the probability of
Asian 86 1 missing data on a variable
Asian 56 1 (R) is related to other
Caucasian 78 1 variables in the dataset?
The relationship between probability of missingness and other variables in the dataset is
then used to determine the missing data mechanisms.
7
Missing not at random (MNAR)
“The probability of missing data on a variable Y is related to the
values of Y itself, even after controlling for other variables
(Enders, 2010, p.8).”

Example:

• There is no way to verify whether data is MNAR without


knowing the actual values of Y.
• In some situation, you may have some sense about the actual values
if you are in the field monitoring data collection process.
• Needs to use other techniques to handle missing data. 8
Missing at Random (MAR)
“The probability of missing data on a variable Y is related to
some other measured variable(s), but not to the values of Y itself
(Enders, 2010, p.6).”

Example:

• Because we do not know the actual value of Y…


• Theoretical judgement about MAR by providing evidence.
• ML and MI assume MAR.
9
Missing Complete at Random (MCAR)
“The probability of missing data on a variable Y is unrelated to
other measured variables and is unrelated to the values of Y itself
(Enders, 2010, p.7).”
Example:

• Observed data are just a simple random sample of the hypothetically


complete dataset.
• Find some evidence for MCAR.
• For example, comparing cases with missing and without missing of a variable on
other measured variables, two groups should not have differences!
10
Finding evidence for MCAR or MAR: t-test
Preforming a series of independent sample t-test to compare a
group with missing and a group without missing on the mean of
other variables in the dataset (categorical data, chi-square).
Self- DV: Reading
efficacy Achievement R • Available in SPSS Missing Values
5 0 Analysis module
1 0
2 0 • No sig difference implies MCAR
4 0 • A sig difference implies MAR (good)
2 0
5 66 1 • A good way to identify variables that is
3 88 1 related to missingness, which can be
4 95 1
3 100 1 used in MI (provide information to
2 86 1 impute missing value)
4 56 1
5 78 1
11
Testing the MCAR: Little (1998)’s MCAR Test
• Multivariate extension of the t-test approach: perform all t-tests
simultaneously.

• A global test of MCAR, available in SPSS Missing Values Analysis


module under EM procedure.

• Testing the Null hypothesis: the data is MCAR.


 Significant MCAR test and/or significant t-tests = an indication of MAR.

• Issues:
 (1) Do not identify variables that violate MCAR.
 (2) Low statistical power (type II error) when the number of variables
that violate MCAR is small or weak relationship between missingness
and data.

12
Traditional methods for handling missing data
• Listwise deletion
• Pairwise deletion
• Mean substitution
• Regression and Stochastic regression
• Hot deck imputation
• Averaging available items
• Last observation carried forward

13
Listwise Deletion (complete-case analysis)-include
only cases with complete data
• Easy, convenient, available in all statistical software
• Waste data and resources
• Reduce sample size and statistical power
• Assume MCAR (otherwise produce biased estimates)

14
Listwise Deletion (complete-case analysis)

Assume MAR for this example data


Problems :
1. The remaining cases do not represent the entire sample well
2. Higher mean estimate GPA Complete data Listwise deletion
3. Reduce the variability of data Mean 3.19 3.51
Var 0.76 0.67
15
Pairwise Deletion (available-case analysis)- analyses
(e.g., correlation, regression) are conducted based on different
subset of cases
• Assume MCAR
50.01

1. Cases with complete


𝜎𝑋𝑌 data for X&Y
• Correlation r= 2. Use cases having x or
𝜎𝑥2 𝜎𝑦2 y alone (separate
subsample)

Estimation problem: r >1 or < -1

• Lack of consistent sample size: using


different subsets of cases to estimate
parameters, difficult to compute
standard errors
16
Arithmetic Mean Imputation (mean substitution):
using the mean of the available cases to fill in the missing value
X Y
169 148
Schafer &Graham (2002) 126
132
160 169
105
116
125
112
133
94
109
109
106
176 137
128
131
130
145 155
136
146 134
111
97

• Y has some missing , replace the missing value for Y with the
134
153 112
118
mean of Y calculated from cases without missing on Y. 137
101
• Reduce variability of the data and correlations. 103
78

• Severely bias the parameter estimate, even MCAR. 151 113

17
Regression Imputation (conditional mean imputation):
using the predicted scores from a regression equation of the
complete cases to fill in the missing value
• Predicted score of Yi*=𝛽0 +𝛽1 X
Schafer &Graham (2002)

• Reduce variability, overestimate correlations between


variables and 𝑅2 , even MCAR.
18
Stochastic Regression Imputation: using the predicted
scores from a regression equation of the complete cases to fill
in the missing value + normally distributed error term N~(0,𝝈𝟐 )
Schafer &Graham (2002) Schafer &Graham (2002)

• Predicted score of Yi*=𝛽0 +𝛽1 X+ Zi


• Adding residual terms to the predicted values: restore the
variability to the imputed data and eliminate biases.
• Provide unbiased estimates under MAR just like ML and MI!
But…attenuate the standard error, inflate type I error rate. 19
Hot-Deck imputation: impute the missing values from
similar respondents
• Procedure: some respondents did not report their income,
classified respondents into cells (groups) based on their
demographic information such as age, gender, marital status;
randomly draw an income value from similar respondents

Schafer &Graham (2002)

• Reduce variability to some extent, produce biases on correlation


estimates and regression coefficients.
20
Averaging the available items (multiple-item questionnaire)
• Researchers typically compute a scale score by summing or
averaging the item responses that measure the same
construct.
• For example, 5 items measuring well-being, a respondent answered
3 items but not all of the items, her/his scale score would be the
average of those 3 items.
• “Person mean substitution”

• Potential problem : Cronbach’s alpha is incorrect, may bias


the variance and correlation.

• Use with caution, especially with high rate of item nonresponses.


• ML and MI are better approaches.

21
Last observation carried forward: longitudinal designs
Observed data
ID W1 W2 W3 W4
1 50 51
2 46 48 50
3 24 55 56

Observed data
ID W1 W2 W3 W4
1 50 51 51 51
2 46 46 48 50
3 24 55 56 56

• Replace the missing value with the observation that immediately


before dropout.
• Assume the scores do not change from the previous measurement.
• Likely to produce biased estimate, even when data are MCAR.
22
Recommended methods for handling
missing data
• Maximum likelihood method (full information
maximum likelihood, FIML)

• Multiple imputation

23
Why FIML or Multiple imputation (MI)?
• Traditional methods have its own limitation and some of them
have strict assumption about missing data mechanisms.

• Provides you with better and more trustworthy parameter


estimates.

• Make the conclusion about your statistical test more


appropriately.

• Allow you to have rigor on your study.

24
Full information maximum likelihood (FIML)
• Assume MAR and multivariate normality data.

• Implemented in structural equation modeling program such as


Mplus (default) when the outcome is continuous.

• When used in the missing data context, using all the information in
the dataset to directly estimate the parameters and standard
errors; handling missing data in one-step.
• Does not drop any cases with missing values.
• Does not produce imputed datasets.
• FIML reads in the raw data of one case at a time, and maximizes the
ML function for one case at a time.

25
Full information maximum likelihood (FIML)
• “The computations for a case use the information only from the
variables and the corresponding parameters for which the case has
complete data (Enders, 2010, p.89)”.
• Implies: depending on the missing data pattern for that case, the
computations differ slightly (the ML function is customized to
different missing data pattern).

• Involving iterative processes, each time using different estimates of


the parameters, until it finds a set of parameter values that
maximize the likelihood function (Enders, 2010).
• i.e., maximize the probability of observing the data, find a model that best
fit the data.
• ML converges: The parameter estimates no longer change across successive
iterations.
26
Full information maximum likelihood (FIML)
An iterative process: putting the distribution in all possible
locations until the program finds a place where the distribution
with a set of parameters that best fit the data (have the highest
probability /likelihood of observing the data).

0 100

Reading achievements
27
Multiple imputation (MI)
• Assume MAR, also called multiple stochastic regression
imputation (iterative procedure).
• Available in Mplus, SAS, Stata, Blimp, SPSS, R and other.
• Involves three steps:
Imputation Phase Analysis Phase Pooling Phase
Imputed dataset 1 Results 1

Imputed dataset 2 Results 2

A dataset with Pooled


missing (overall)
data results

Imputed dataset m Results m


28
Multiple imputation- imputation phase
SPSS uses fully conditional specification (FCS) or chained equations
imputation, multivariate imputation by chained equations (MICE)
(a Markov Chain Monte Carlo algorithm)

• Does not rely on the assumption of multivariate normality.

• Flexible in handling different types of variables.


ID Age Income Gender
 Scale: linear regression 1 35 0
2 5000 1
 Categorical: logistic regression 3 45 10000 0
4 20 1
5 18 4500

• Specify the imputation model on a variable-by-variable basis.


 “For each variable with missing data, a univariate (single dependent
variable) imputation model is fitted using all other available
variables in the model as predictors, then imputes missing values
for the variable being fit “(IBM SPSS Missing Values 24).
29
Multiple imputation- imputation phase
• The imputation process goes through all variables with missing
value iteratively, every time with new/updated imputed values.

Age Income Gender


This process is repeated for several times

• When the maximum number of iterations is reached (specified by


researchers or by default), the imputed values at the maximum
iteration are saved (one imputed dataset is created).
 Request 5 imputations with 200 maximum iterations = SPSS runs the MCMC
algorithm 5 times and save the imputed values at 200th iteration each time.
 Generally, 5-10 iterations is sufficient, but recommended to be conservative.
 You may need to increase the number of iterations if the model hasn't
converged (save iteration history data in SPSS and plot it to assess
convergence).
30
Multiple imputation – imputation phase
What variables should be included in the imputation model?
• (1) “At least” the variables that you are going to use in the
subsequent analysis should be included.
• For example, run a regression model and use gender, SES to predict
freshman’s GPA. Gender, SES, and GPA should be included in the
imputation model.

• (2) Include auxiliary variables: “variables are either correlates of


missingness or correlates of an incomplete variable” (Enders,
2010, p.17); these variables may not the study interest, but help
improving the imputation quality and increasing the “plausibility of
MAR”.
• For example, there are other variables such as parents’ education
level, ACT, SAT, and other variables in the datasets which are
correlated with variables of interest or their missingness.
31
Multiple imputation – imputation phase
How many imputed datasets are needed?
• There are strong associations between statistical power and
number of imputations.

• Convention wisdom: 3-5 imputed datasets; however, study showed


that with only 3 or 5 imputed datasets, the power is below its
optimal level (Graham et al., 2007).

• According to Enders (2011), “generating a minimum of 20 imputed


datasets seems to be a good rule of thumb for many situations“.

• If the proportion of missing data is > 50% , increasing the # of


imputations > 40 and be thoughtful about the variables included in
the imputation model.
32
Multiple imputation – analysis phase
• The imputation phase generate m set of imputed
datasets.

• The analysis phase: analyze the imputed datasets using


the normal analysis procedure.
• For example, a researcher generates 20 datasets and now would
like to use multiple regression to analyze the data. She/he will
repeat multiple regression analysis 20 times, one analysis for
each of the datasets.

Dataset1 Dataset2
Paramter β SE Paramter β SE
Intercept 2.62 3.41 Intercept 2.18 3.2
SES 1.81 1.6 SES 1 1.9

33
Multiple imputation – pooling phase
• Pooling point estimate: • Pooling standard errors:
𝑚
1 𝑉𝐵
𝜽= 𝜃𝑡 𝑽𝑻 = 𝑉𝑊 + 𝑉𝐵 + ; SE= 𝑽𝑻
𝑚
𝑚
1
𝑉𝑇 = total sampling variance
𝑉𝑊 =within-imputation variance
m= # of imputed datasets
(the mean of the squared SE across m datasets)
𝜃𝑡 = parameter estimate for t dataset
𝑉𝐵 = between-imputation variance
• Take an average of the parameter (variability of parameter estimate across m
estimates across m datasets datasets; additional variance that is due to
missing)
The statistical significance of the 𝜽
can be calculated in the usual way by 𝑉𝐵
= correction factor for a finite number
𝑚
calculating the ratio 𝜽 / 𝑉𝑇 of imputation
34
Using SPSS to
Deal with Missing Data

35
The example data
• High school longitudinal study of 2009: public-use data
 NCES secondary longitudinal studies, more than 21,000 9th graders in
944 schools
• Selected sample: subsample of 500 students who took math
and science course in 2009
• Selected measures:
 9th grade sex (0=male), race/ethnicity (0=white), socioeconomic
status
 9th and 11th grade math IRT scores
 9th grade math interest (3 items; 4 point Likert scale)
 9th grade math self-efficacy (4 items; 4 point Likert scale)

Demonstration dataset: Hsls09_MissingDataWorkshop_demo


36
Using SPSS to deal with missing data
• Delete cases with no data on any of the variables.
• All missing values need to be displayed as system missing (a
blank cell) or user-defined missing (a value assigned by
researcher, such as 999 or -8888).

37
Using SPSS to deal with missing data
• Change all missing values (either system missing or user-defined
missing value) to a common value -999.
• Transform-> click Recode into Same Variables -> Select all of the
variables into the selection box-> click Old and New Values->

2
-999

1
3

4
38
Using SPSS to deal with missing data
• Assign missing values for all the variables: In Variable View -> Click
on one cell in the Missing column to assign -999 as a discrete
missing value -> Click OK.
• Right click Copy -> Select all cells with numeric variables --- Click
Paste.

39
Using SPSS to deal with missing data
• Define variables : In Variable View -> Under Measure column ->
assign the scale for each of the variables.

40
Using SPSS to deal with missing data
• Analyze the pattern of missing data:
Go to Analyze ->
Multiple Imputation - >
Analyze Patterns

Select the variables excluding


the ID to Analyze Across
Variables

For Minimum percentage


missing for variable to be
displayed, change to 0 ->
Click OK
(would like to see everything that is missing)

41
Using SPSS to deal with missing data

Only 1.83% of
the individual
values are
missing.

Variables: the number of variables which contained missing values= 9 out


of 12 (green)
Cases: 409 cases have complete data (81.8%) (blue) ; 91 cases have at
least one missing value on a variable
Values: the number of individual values (out of 6000=12*500) that are
missing = 110 (1.83%) (green) 42
Using SPSS to deal with missing data
The number and percent missing for each variable.

Notice, the variables are ordered by the amount of values they are
missing (i.e. the percentage missing).
Examine the percentage of missing for each variable, make sure that
each percent missing makes sense based on your knowledge about
this dataset! 43
Using SPSS to deal with missing data

The pattern here is arbitrary.

least highest

• Each pattern (row) reflects a group of cases with • The percent missing for the 10 most common
the same pattern of missing values (15 patterns of patterns
missing and nonmissing data) • Pattern 1 = no missing (81%) is the most
• The variables along the bottom (x-axis) are prevalent pattern.
ordered by the amount of missing values each • Pattern 10= missing on MATH11 (10%)
contains. 44
Using SPSS to deal with missing data
• Request Little’s MCAR test and independent sample t-tests for MAR
Go to Analyze --- Missing Value Analysis-->

Descriptive: Report Student t-


test for each pair of continuous
variables to examine MAR

45
Using SPSS to deal with missing data
• Request Little’s MCAR test and Separate Variance t tests
Go to Analyze --- Missing Value Analysis

A note:
If you get a warning message in
the SPSS output that the EM
algorithm failed to converge in 25
iterations, you can increase the
maximum iterations by clicking on
the EM button.

46
Using SPSS to deal with missing data
• Request Little’s MCAR test and Separate Variance t-tests

Scroll down in the SPSS Output window to the EM Means table:


Under this table, you can find the result from Little’s MCAR
test.

Non- significant results at p = .054 indicate the data are missing


completely at random (MCAR).
47
Examine independent sample t-tests

A significant t-test
indicates the
probability
of missing is a
function of the
values on another
variables.

It’s an indication
of MAR!

We have variables
that can be used
in the imputation
model.

48
Analysis model
• Research Question: Can students’ SES and math self-efficacy
predict their 11th grade math score ?

• Dependent Variable: MATH11


• Independent Variables: SES and EFF_total (sum of 4 items)
• Auxiliary variables (for imputation): SEX, RACE, MATH09, Math
interest items
• Correlation analysis: these variables are correlated with variables of
interest to some extent
• Independent sample t-test: some of them are correlated with
missingness for variables of interest

49
Before imputation, set a random seed
Transform-> Random Number Generators -
> select Set Active Generator-> click
Mersenne Twister -> select Set Starting
Point and Fixed Value -> click OK.

50
Using SPSS to deal with missing data
• Conducting multiple imputation: Analyze-> Multiple Imputation->
Impute Missing Data Values-> Move the variables of interest to the
Variables in Model box.

51
Variables->
• 5 imputations will
be implemented
for demonstration
purpose

• Missing value will


be imputed 5
times and stored

• Name the dataset


below the Create a
new dataset
button

52
Method->
• Since the missing
data pattern is
arbitrary, selecting
FCS
Default =10; Increase the number of iterations if the Markov Chain
Monte Carlo algorithm hasn't converged.
• Specify the number
of maximum
iterations = 200

• PMM: still uses regression, but the imputed values are adjusted to
match the nearest actual value in the dataset (from observations
with the same predicted value with no missing on that variable).
• If the original variable is bounded by 0 and 40, the imputed values
will also be bounded by 0 and 40.
• According to Paul Allison, there are some drawbacks of PMM in
SPSS. https://statisticalhorizons.com/predictive-mean-matching
53
Constraints->
• Click on Scan Data: examine
1
the variable summary

• You can specify the role of a


variable during the imputation
and constraint the range of
imputed values (min, max,
rounding) so that they are 3
plausible 2

• Obtain integer values = specify


1 as the rounding
denomination (6.648->7);
obtain values rounded to the
nearest cent, specify 0.01 This column allows you
to specify the smallest
(6.648->6.65) denomination to accept.

54
Constraints->
• If specify the Min and Max:
Maximum draw procedure will
be activated: it attempts to draw
values for a case until it finds a
set of values that are within the
specified ranges

• Errors: if a set of values within


the ranges is not obtained
• Increase the maximum
draws

• Demonstration: no constraints on
the range of variables

55
• Imputation model:
univariate model type,
model effects, and # of
values imputed

• Descriptive statistics:
basic information before
and after imputation

• Iteration history:
information on the
convergence
performance

56
Outputs

Hsls09_MissingDataWorkshop_demo2_imputed5
57
• Datasets with
imputed values are
numbered 1
through M, where
M is the number of
imputations.

• Select the
imputation from
the drop-down list
in the edit bar in
Data view.

58
You can distinguish
imputed values from
observed values by
cell background color.

59
Create composite score: Transform-> Compute Variable

• Compute the
scale score
(composite score)
for self-efficacy
in the stacked
dataset

• This would apply


to all the
imputed
datasets

60
Before the analysis: Data-> Split file

• Split the file by


imputation
number

• This invokes the


analysis and
pooling phase
for multiple
imputed
datasets

61
Analyze data as usual
• SPSS provides
pooled estimate for
some analyses but
not all…

• Analyses with this


icon, indicating that
SPSS provides
corresponding
procedure to
accommodate
multiple imputed
datasets

• Let’s perform a
multiple regression
62
SPSS outputs for multiple regression-descriptive statistics

63
SPSS outputs for multiple regression- correlation matrix

64
SPSS outputs for multiple regression- coefficient estimates
Coefficientsa
Standardized Relative
Unstandardized Coefficients Coefficients Fraction Increase Relative
Imputation Number Model B Std. Error Beta t Sig. Missing Info. Variance Efficiency
Original data 1 (Constant) 45.446 3.777 12.031 .000
X1 Socio-economic 8.626 1.072 .356 8.046 .000
status composite
EFF_total 1.879 .315 .264 5.967 .000
Pooled 1 (Constant) 44.126 3.734 11.818 .000 .158 .174 .969
X1 Socio-economic 9.242 1.019 9.073 .000 .087 .091 .983
status composite
EFF_total 1.901 .309 6.146 .000 .130 .141 .975
a. Dependent Variable: X2 Mathematics IRT-estimated number right score

• Results differ slightly across imputed datasets


• SPSS provides pooled estimate for unstandardized regression coefficients!

65
Imputation Diagnostics

66
SPSS outputs for multiple regression- coefficient estimates

• Fraction missing info: The proportion of total sampling variance that is


𝑉
due to missing data (𝑉𝐵 + 𝐵 )/ 𝑉𝑇 for a parameter estimate, related to
𝑚
percentage missing for that variable.
• 0.087 for SES: 8.7% of the sampling variance is due to missing data
• A measure of the impact of missing data on parameter estimates

67
SPSS outputs for multiple regression- coefficient estimates

• Relative Increase Variance: how much the sampling variance would be


𝑉
increased (inflated) because of missingness (𝑉𝐵 + 𝐵 )/ 𝑉𝑤 .
𝑚
• 0.141 for EFF_total: compared to the sampling variance for EFF_total
assumed it has complete data, the estimated sampling variance
for EFF_total (with missing) is 14.1% larger.
• Variables with larger percentage missingness tend to have larger relative
increase variance.
68
SPSS outputs for multiple regression- coefficient estimates

The SE got from infinite # of imputations is


• Relative efficiency: it is an efficiency 98.3% of SE got from 5 imputations
estimate from m imputations relative to
performing an infinite number of imputations (fraction of missing info)
1/(1+F/M), where F= Fraction missing info,
M= # of imputation.
• Close to 1 = more efficient, produce proper
SE (won’t produce too large SE)
• Large percentage of missing needs more
SAS documentation for multiple imputation
imputations to achieve sufficient efficiency (Horton & Lipsitz, 2001, p. 246)
for parameter estimates 69
Iteration history:
Provides mean and
standard deviation by
iteration and
imputation for
continuous imputed
variables

• Build the plot to


examine the
convergence of
model

70
Assessing the performance of imputations
Graphs > Chart Builder> select line chart

71
Assessing the performance of imputations

2
3

72
Assessing the performance of imputations
1 2

In the Element Properties, select Value as


the statistic to display. 4

73
Assessing the performance of imputations

74
Mean and standard deviation of the imputed values of SES
at each iteration (200) for each of the 5 requested imputations
(can be requested for each continuous imputed variable).

The purpose of this plot is to look for trends or patterns.


Model converge: the parameter values bounce around in a random
fashion with no trend ( it reaches this phase immediately) and the
different lines of imputations should be mixed with each other.

75
Assessing the performance of imputations using trace plots
(using Ender’s Macro http://www.appliedmissingdata.com/macro-programs.html):

• The plot for mean and SD for imputed continuous variables can be requested using
Ender’s SPSS macro.
• An indication of the performance of the imputations.
• For using this macro: 1000 iterations with 2 imputed datasets.
• Provides additional convergence performance criterion:
• Potential scale reduction (PSR) for every 100 iteration: the MCMC is regarded
as converge when the PSR < 1.05. 76
Problematic or pathological case of non-convergence:

Figure from Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation
by chained equations in R. Journal of statistical software, 1-68. 77
Healthy case of convergence:

Figure from Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation
by chained equations in R. Journal of statistical software, 1-68. 78
Practice time!

79
The practice data
• High school longitudinal study of 2009: public-use data
• Selected sample: subsample of 490 participants who took
math and science course in 2009
• Selected measures:
 9th grade sex (0=male), race/ethnicity (0=white), SES
 9th and 11th grade math and science GPA
 9th grade science utility (3 items; 4 point Likert scale)
 9th grade science self-efficacy (4 items; 4 point Likert scale)
• Nominal Var: SEX, RACE
• Scale Var: SES, MGPA12, SGPA12
• Ordinal Var: Science utility and self-efficacy items

80
Analysis model
• Research Question: Can students’ race, SES and science self-
efficacy predict their 12th grade science GPA score ?

• Dependent Variable: SGPA12

• Independent Variables: Race, SES and SEFF_total (sum of 4


items)

• Auxiliary variables for imputation model: Sex, MGPA12,


science utility items
• Examine the correlation analysis and univariate t-tests

81
TASKS : YOU CAN DO IT!
• Change all missing values (either system missing or user-defined
missing value) to a common value , e.g., 999
• Assign missing values for all the variables in variable view
• Define variables : In Variable View -> Under Measure column -> assign
the scale for each of the variables
• Analyze the pattern of missing data and examine the percentage of
missing (how many percentage of missing?)
• Request Little’s MCAR test (EM) and Separate Variance t-test
• Conducting multiple imputation: 10 datasets, 100 iterations
• Remember to set the maximum and minimum value of science and
math GPA to 0 and 4
• Create a composite score for science self-efficacy
• Run a regression model to answer the research question
• Examine the convergence of model by using iteration history
82
Practical Issues/
Myths

83
Practical issues/Myths
 Is imputation making up the data?

Note really! The goal of imputation is not to produce the


individual values and treat them as real data, but to estimate
the population parameter and “preserve important
characteristics of the data set as a whole (Graham, 2008).”

Account for uncertainty associated with missing data. Thus,


unbiased estimates can be obtained.

84
Practical issues/Myths
 Should both independent variables and dependent
variables be included in the imputation model (MI)?
At least, all the variables that you will use in your analysis
should be included. Why?

When the DV is not included, the correlations between it and


IVs are assumed to be 0. Excluding it will reduce its
relationships with other variables.

Taking a liberal approach for variables selection in the


imputation phase. Programs did not distinguish whether a
variable is IV or DV!
85
Practical issues
 Why including auxiliary variables?
Inclusive Analysis Strategy:
ML and MI require MAR and since there is no test for MAR, we need
to find ways to increase the likelihood to satisfy MAR.

Shafer and Graham (2002, p, 173): collecting data on the potential


causes of missingness “may effectively convert an MNAR situation to
MAR ”.

Incorporates a number of “auxiliary variables” : help increasing


statistical power or reduce biases in parameter estimates.

Use as many as you can, most useful are those with correlations  .40.

86
Practical issues
 Working with multiple items questionnaire, whether
to impute the individual items or scale scores?

If doable, imputing individual items, since it maximizes the


information for creating the imputations and have more
statistical power than imputing scale scores (Enders, 2010,
p.269-270).

87
Practical issues
 What if my missing data is MNAR?
Using Selection Modeling and Pattern Mixture Modeling
(Chapter 10 in Ender’s Applied Missing Data Analysis)

These two models deal with the NMAR situation by


statistically modeling the missing data mechanism.

Enders, C. K. (2011). Missing not at random models for latent growth curve
analyses. Psychological methods, 16(1), 1.

88
What should I report when I write it up?
• Missing data mechanisms
• Percentage of missing for each variable & overall percentage
of missing
• Software for missing data imputation
• Imputation method & algorithm
• Number of imputed datasets
• The variables used in the imputation model

89
Reference
• Enders, C. K. (2010). Applied missing data analysis. Guilford Press.
• Graham, J. W. (2012). Missing data : analysis and design. Springer.
• Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of
psychology, 60, 549-576.
• Pigott, T. D. (2001). A review of methods for missing data. Educational research and
evaluation, 7(4), 353-383.
• Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the
art. Psychological methods, 7(2), 147.
• Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained
equations: what is it and how does it work?. International journal of methods in psychiatric
research, 20(1), 40-49.
• Puma, M. J., Olsen, R. B., Bell, S. H., & Price, C. (2009). What to Do when Data Are Missing in
Group Randomized Controlled Trials. NCEE 2009-0049. National Center for Education
Evaluation and Regional Assistance.
• IBM SPSS Missing Values 21 & 24 (user manual).
• Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation by chained
equations in R. Journal of statistical software, 1-68.

90
Recommended websites
• UCLA: idre
• SAS : https://stats.idre.ucla.edu/sas/seminars/multiple-
imputation-in-sas/mi_new_1/
• Stata :
https://stats.idre.ucla.edu/stata/seminars/mi_in_stata_pt
1_new/
• Craig Enders website:
• Mplus: http://www.appliedmissingdata.com/additional-
examples.html
• Blimp: http://www.appliedmissingdata.com/multilevel-
imputation.html
91
Thank you 
Don’t be afraid of missing data!

92

You might also like