SPSS
SPSS
Meng-Ting Lo
([email protected])
Department of Educational Studies
Quantitative Research, Evaluation and Measurement Program (QREM)
Research Methodology Center (RMC)
Outline
• Missing Data Patterns and Mechanisms
• Traditional Techniques
• Listwise and pairwise deletion
• Mean substitution
• Regression and stochastic regression
• Hot deck imputation
• Averaging the available items
• Last observations carried forward
• Maximum Likelihood (ML) and Multiple Imputation (MI)
• SPSS with Multiple Imputation (demonstration and practice)
• Practical Issues/ Myths
2
Data and Material
• High school longitudinal study of 2009: public-use data
NCES secondary longitudinal studies, more than 21,000 9th graders in
944 schools
Hsls09_MissingDataWorkshop_demo
Hsls09_MissingDataWorkshop_demo2_imputed5
Hsls09_MissingDataWorkshop_demo2_IterationHistory
Hsls09_MissingDataWorkshop_practice
• SPSS modules
Missing Value Analysis
Multiple Imputation
3
The importance of dealing with missing data
• Rarely see a dataset that is complete and beautiful
4
Missing data patterns
“Where” is the missing data in
1 your data set? Describing the
2
3 location of missing data (shaded
4
.
area).
.
.
. In old time: specific missing data
handling methods
were developed to deal with
different missing data patterns.
Example:
Example:
• Issues:
(1) Do not identify variables that violate MCAR.
(2) Low statistical power (type II error) when the number of variables
that violate MCAR is small or weak relationship between missingness
and data.
12
Traditional methods for handling missing data
• Listwise deletion
• Pairwise deletion
• Mean substitution
• Regression and Stochastic regression
• Hot deck imputation
• Averaging available items
• Last observation carried forward
13
Listwise Deletion (complete-case analysis)-include
only cases with complete data
• Easy, convenient, available in all statistical software
• Waste data and resources
• Reduce sample size and statistical power
• Assume MCAR (otherwise produce biased estimates)
14
Listwise Deletion (complete-case analysis)
• Y has some missing , replace the missing value for Y with the
134
153 112
118
mean of Y calculated from cases without missing on Y. 137
101
• Reduce variability of the data and correlations. 103
78
17
Regression Imputation (conditional mean imputation):
using the predicted scores from a regression equation of the
complete cases to fill in the missing value
• Predicted score of Yi*=𝛽0 +𝛽1 X
Schafer &Graham (2002)
21
Last observation carried forward: longitudinal designs
Observed data
ID W1 W2 W3 W4
1 50 51
2 46 48 50
3 24 55 56
Observed data
ID W1 W2 W3 W4
1 50 51 51 51
2 46 46 48 50
3 24 55 56 56
• Multiple imputation
23
Why FIML or Multiple imputation (MI)?
• Traditional methods have its own limitation and some of them
have strict assumption about missing data mechanisms.
24
Full information maximum likelihood (FIML)
• Assume MAR and multivariate normality data.
• When used in the missing data context, using all the information in
the dataset to directly estimate the parameters and standard
errors; handling missing data in one-step.
• Does not drop any cases with missing values.
• Does not produce imputed datasets.
• FIML reads in the raw data of one case at a time, and maximizes the
ML function for one case at a time.
25
Full information maximum likelihood (FIML)
• “The computations for a case use the information only from the
variables and the corresponding parameters for which the case has
complete data (Enders, 2010, p.89)”.
• Implies: depending on the missing data pattern for that case, the
computations differ slightly (the ML function is customized to
different missing data pattern).
0 100
Reading achievements
27
Multiple imputation (MI)
• Assume MAR, also called multiple stochastic regression
imputation (iterative procedure).
• Available in Mplus, SAS, Stata, Blimp, SPSS, R and other.
• Involves three steps:
Imputation Phase Analysis Phase Pooling Phase
Imputed dataset 1 Results 1
Dataset1 Dataset2
Paramter β SE Paramter β SE
Intercept 2.62 3.41 Intercept 2.18 3.2
SES 1.81 1.6 SES 1 1.9
33
Multiple imputation – pooling phase
• Pooling point estimate: • Pooling standard errors:
𝑚
1 𝑉𝐵
𝜽= 𝜃𝑡 𝑽𝑻 = 𝑉𝑊 + 𝑉𝐵 + ; SE= 𝑽𝑻
𝑚
𝑚
1
𝑉𝑇 = total sampling variance
𝑉𝑊 =within-imputation variance
m= # of imputed datasets
(the mean of the squared SE across m datasets)
𝜃𝑡 = parameter estimate for t dataset
𝑉𝐵 = between-imputation variance
• Take an average of the parameter (variability of parameter estimate across m
estimates across m datasets datasets; additional variance that is due to
missing)
The statistical significance of the 𝜽
can be calculated in the usual way by 𝑉𝐵
= correction factor for a finite number
𝑚
calculating the ratio 𝜽 / 𝑉𝑇 of imputation
34
Using SPSS to
Deal with Missing Data
35
The example data
• High school longitudinal study of 2009: public-use data
NCES secondary longitudinal studies, more than 21,000 9th graders in
944 schools
• Selected sample: subsample of 500 students who took math
and science course in 2009
• Selected measures:
9th grade sex (0=male), race/ethnicity (0=white), socioeconomic
status
9th and 11th grade math IRT scores
9th grade math interest (3 items; 4 point Likert scale)
9th grade math self-efficacy (4 items; 4 point Likert scale)
37
Using SPSS to deal with missing data
• Change all missing values (either system missing or user-defined
missing value) to a common value -999.
• Transform-> click Recode into Same Variables -> Select all of the
variables into the selection box-> click Old and New Values->
2
-999
1
3
4
38
Using SPSS to deal with missing data
• Assign missing values for all the variables: In Variable View -> Click
on one cell in the Missing column to assign -999 as a discrete
missing value -> Click OK.
• Right click Copy -> Select all cells with numeric variables --- Click
Paste.
39
Using SPSS to deal with missing data
• Define variables : In Variable View -> Under Measure column ->
assign the scale for each of the variables.
40
Using SPSS to deal with missing data
• Analyze the pattern of missing data:
Go to Analyze ->
Multiple Imputation - >
Analyze Patterns
41
Using SPSS to deal with missing data
Only 1.83% of
the individual
values are
missing.
Notice, the variables are ordered by the amount of values they are
missing (i.e. the percentage missing).
Examine the percentage of missing for each variable, make sure that
each percent missing makes sense based on your knowledge about
this dataset! 43
Using SPSS to deal with missing data
least highest
• Each pattern (row) reflects a group of cases with • The percent missing for the 10 most common
the same pattern of missing values (15 patterns of patterns
missing and nonmissing data) • Pattern 1 = no missing (81%) is the most
• The variables along the bottom (x-axis) are prevalent pattern.
ordered by the amount of missing values each • Pattern 10= missing on MATH11 (10%)
contains. 44
Using SPSS to deal with missing data
• Request Little’s MCAR test and independent sample t-tests for MAR
Go to Analyze --- Missing Value Analysis-->
45
Using SPSS to deal with missing data
• Request Little’s MCAR test and Separate Variance t tests
Go to Analyze --- Missing Value Analysis
A note:
If you get a warning message in
the SPSS output that the EM
algorithm failed to converge in 25
iterations, you can increase the
maximum iterations by clicking on
the EM button.
46
Using SPSS to deal with missing data
• Request Little’s MCAR test and Separate Variance t-tests
A significant t-test
indicates the
probability
of missing is a
function of the
values on another
variables.
It’s an indication
of MAR!
We have variables
that can be used
in the imputation
model.
48
Analysis model
• Research Question: Can students’ SES and math self-efficacy
predict their 11th grade math score ?
49
Before imputation, set a random seed
Transform-> Random Number Generators -
> select Set Active Generator-> click
Mersenne Twister -> select Set Starting
Point and Fixed Value -> click OK.
50
Using SPSS to deal with missing data
• Conducting multiple imputation: Analyze-> Multiple Imputation->
Impute Missing Data Values-> Move the variables of interest to the
Variables in Model box.
51
Variables->
• 5 imputations will
be implemented
for demonstration
purpose
52
Method->
• Since the missing
data pattern is
arbitrary, selecting
FCS
Default =10; Increase the number of iterations if the Markov Chain
Monte Carlo algorithm hasn't converged.
• Specify the number
of maximum
iterations = 200
• PMM: still uses regression, but the imputed values are adjusted to
match the nearest actual value in the dataset (from observations
with the same predicted value with no missing on that variable).
• If the original variable is bounded by 0 and 40, the imputed values
will also be bounded by 0 and 40.
• According to Paul Allison, there are some drawbacks of PMM in
SPSS. https://statisticalhorizons.com/predictive-mean-matching
53
Constraints->
• Click on Scan Data: examine
1
the variable summary
54
Constraints->
• If specify the Min and Max:
Maximum draw procedure will
be activated: it attempts to draw
values for a case until it finds a
set of values that are within the
specified ranges
• Demonstration: no constraints on
the range of variables
55
• Imputation model:
univariate model type,
model effects, and # of
values imputed
• Descriptive statistics:
basic information before
and after imputation
• Iteration history:
information on the
convergence
performance
56
Outputs
Hsls09_MissingDataWorkshop_demo2_imputed5
57
• Datasets with
imputed values are
numbered 1
through M, where
M is the number of
imputations.
• Select the
imputation from
the drop-down list
in the edit bar in
Data view.
58
You can distinguish
imputed values from
observed values by
cell background color.
59
Create composite score: Transform-> Compute Variable
• Compute the
scale score
(composite score)
for self-efficacy
in the stacked
dataset
60
Before the analysis: Data-> Split file
61
Analyze data as usual
• SPSS provides
pooled estimate for
some analyses but
not all…
• Let’s perform a
multiple regression
62
SPSS outputs for multiple regression-descriptive statistics
63
SPSS outputs for multiple regression- correlation matrix
64
SPSS outputs for multiple regression- coefficient estimates
Coefficientsa
Standardized Relative
Unstandardized Coefficients Coefficients Fraction Increase Relative
Imputation Number Model B Std. Error Beta t Sig. Missing Info. Variance Efficiency
Original data 1 (Constant) 45.446 3.777 12.031 .000
X1 Socio-economic 8.626 1.072 .356 8.046 .000
status composite
EFF_total 1.879 .315 .264 5.967 .000
Pooled 1 (Constant) 44.126 3.734 11.818 .000 .158 .174 .969
X1 Socio-economic 9.242 1.019 9.073 .000 .087 .091 .983
status composite
EFF_total 1.901 .309 6.146 .000 .130 .141 .975
a. Dependent Variable: X2 Mathematics IRT-estimated number right score
65
Imputation Diagnostics
66
SPSS outputs for multiple regression- coefficient estimates
67
SPSS outputs for multiple regression- coefficient estimates
70
Assessing the performance of imputations
Graphs > Chart Builder> select line chart
71
Assessing the performance of imputations
2
3
72
Assessing the performance of imputations
1 2
73
Assessing the performance of imputations
74
Mean and standard deviation of the imputed values of SES
at each iteration (200) for each of the 5 requested imputations
(can be requested for each continuous imputed variable).
75
Assessing the performance of imputations using trace plots
(using Ender’s Macro http://www.appliedmissingdata.com/macro-programs.html):
• The plot for mean and SD for imputed continuous variables can be requested using
Ender’s SPSS macro.
• An indication of the performance of the imputations.
• For using this macro: 1000 iterations with 2 imputed datasets.
• Provides additional convergence performance criterion:
• Potential scale reduction (PSR) for every 100 iteration: the MCMC is regarded
as converge when the PSR < 1.05. 76
Problematic or pathological case of non-convergence:
Figure from Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation
by chained equations in R. Journal of statistical software, 1-68. 77
Healthy case of convergence:
Figure from Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation
by chained equations in R. Journal of statistical software, 1-68. 78
Practice time!
79
The practice data
• High school longitudinal study of 2009: public-use data
• Selected sample: subsample of 490 participants who took
math and science course in 2009
• Selected measures:
9th grade sex (0=male), race/ethnicity (0=white), SES
9th and 11th grade math and science GPA
9th grade science utility (3 items; 4 point Likert scale)
9th grade science self-efficacy (4 items; 4 point Likert scale)
• Nominal Var: SEX, RACE
• Scale Var: SES, MGPA12, SGPA12
• Ordinal Var: Science utility and self-efficacy items
80
Analysis model
• Research Question: Can students’ race, SES and science self-
efficacy predict their 12th grade science GPA score ?
81
TASKS : YOU CAN DO IT!
• Change all missing values (either system missing or user-defined
missing value) to a common value , e.g., 999
• Assign missing values for all the variables in variable view
• Define variables : In Variable View -> Under Measure column -> assign
the scale for each of the variables
• Analyze the pattern of missing data and examine the percentage of
missing (how many percentage of missing?)
• Request Little’s MCAR test (EM) and Separate Variance t-test
• Conducting multiple imputation: 10 datasets, 100 iterations
• Remember to set the maximum and minimum value of science and
math GPA to 0 and 4
• Create a composite score for science self-efficacy
• Run a regression model to answer the research question
• Examine the convergence of model by using iteration history
82
Practical Issues/
Myths
83
Practical issues/Myths
Is imputation making up the data?
84
Practical issues/Myths
Should both independent variables and dependent
variables be included in the imputation model (MI)?
At least, all the variables that you will use in your analysis
should be included. Why?
Use as many as you can, most useful are those with correlations .40.
86
Practical issues
Working with multiple items questionnaire, whether
to impute the individual items or scale scores?
87
Practical issues
What if my missing data is MNAR?
Using Selection Modeling and Pattern Mixture Modeling
(Chapter 10 in Ender’s Applied Missing Data Analysis)
Enders, C. K. (2011). Missing not at random models for latent growth curve
analyses. Psychological methods, 16(1), 1.
88
What should I report when I write it up?
• Missing data mechanisms
• Percentage of missing for each variable & overall percentage
of missing
• Software for missing data imputation
• Imputation method & algorithm
• Number of imputed datasets
• The variables used in the imputation model
89
Reference
• Enders, C. K. (2010). Applied missing data analysis. Guilford Press.
• Graham, J. W. (2012). Missing data : analysis and design. Springer.
• Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual review of
psychology, 60, 549-576.
• Pigott, T. D. (2001). A review of methods for missing data. Educational research and
evaluation, 7(4), 353-383.
• Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the
art. Psychological methods, 7(2), 147.
• Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained
equations: what is it and how does it work?. International journal of methods in psychiatric
research, 20(1), 40-49.
• Puma, M. J., Olsen, R. B., Bell, S. H., & Price, C. (2009). What to Do when Data Are Missing in
Group Randomized Controlled Trials. NCEE 2009-0049. National Center for Education
Evaluation and Regional Assistance.
• IBM SPSS Missing Values 21 & 24 (user manual).
• Buuren, S. V., & Groothuis-Oudshoorn, K. (2010). mice: Multivariate imputation by chained
equations in R. Journal of statistical software, 1-68.
90
Recommended websites
• UCLA: idre
• SAS : https://stats.idre.ucla.edu/sas/seminars/multiple-
imputation-in-sas/mi_new_1/
• Stata :
https://stats.idre.ucla.edu/stata/seminars/mi_in_stata_pt
1_new/
• Craig Enders website:
• Mplus: http://www.appliedmissingdata.com/additional-
examples.html
• Blimp: http://www.appliedmissingdata.com/multilevel-
imputation.html
91
Thank you
Don’t be afraid of missing data!
92