Advanced Data Analysis
Session 3: Data preparation
Today’s session
Data preparation
• Dealing with missing values
• Dealing with outliers
• Compute new variables
• Research on a subset of observations
• Summated scales
Recoding variables
Reliability analysis
• Transforming metric variables into categorical variable
DEALING WITH MISSING VALUES
File: CLASS missing [Link]
Dealing with missing values •
•
Course website > Session 3
Save on computer
Variable view
Name Variable name
restrictions: no spaces, no underscore at end, no
duplicates, no reserved keywords: ALL,
AND,BY,EQ,GE,LE,LT,…
Type String data (~text) vs. numeric data
Label Label variable: less restrictions than name
Values Label assigned to the levels of the variable
Missing Assign missing values
Measure Measurement level: scale – ordinal - nominal
Dealing with missing values
• Missing value = No response or “Don’t know” “No opinion” (NOT: “Neutral”)
• Can significantly influence results!
• Need to be assigned explicitly
• Represented by a Dot (.) in the dataset
In large datasets: not easy to identify by sight!
Dealing with missing values
Detecting missing values
Analyze > Descriptive Statistics > Frequencies
Dealing with missing values
Detecting missing values
Step 1: Select ALL variables and move them into the « variable box »
Dealing with missing values
Detecting missing values
Step 2: Check whether missing values are spotted for any of the variables
Step 3: For missing values detected, check frequency table of the variable
Step 4: Go into DATA VIEW and scroll down to identify where the
value is missing for that specific variable
Dealing with missing values
Specify missing values in SPSS
Rule:
Assign a value that is
NOT an answer option
(e.g., 99 or -1)
Step 5: Tell SPSS that a value is missing
Having missing values for one or more participants doesn’t mean we have to ignore the
data we do have for those participants!!
But, we need to tell SPSS that a value is missing for those participants!
2 possibilities:
1) Leave the cell blank with a dot
→ Not 100% clear whether the answer still needs to be filled out or whether the value
is missing; sometimes SPSS cannot work with that
2) Assign a value that clearly indicates that the value is missing
→ Better option!
Dealing with missing values
Specify missing values in SPSS
Step 5.1: In Data view
(put -1)
Illustration: Calculate the mean weight
Analyze > Descriptive Statistics > Descriptives
Dealing with missing values
Specify missing values in SPSS
Step 5.1: In Data view
(put -1)
Step5.2: In Variable view
Without
specification,
the mean
could be very
different!!
For EVERY
variable with
missings!
DEALING WITH OUTLIERS
Dealing with outliers
Outlier (reminder)
• A case that is very different from the rest of the data
• Can significantly influence results:
Bias the mean
Inflate the standard deviation
Dealing with outliers
the!mean!increases!(it!increases!by!0.4).!This!example!shows!how!a!single!score,!from!some!meanO
Dealing with outliers
spirited!badger!turd,!can!bias!a!parameter!such!as!the!mean:!the!first!rating!of!2!drags!the!average!
down.!Based!on!this!biased!estimate!new!customers!might!erroneously!conclude!that!my!book!is!
worse!than!the!population!actually!thinks!it!is.!Although!I!am!consumed!with!bitterness!about!this!
whole!affair,!it!has!at!least!given!me!a!great!example!of!an!outlier.!
Figure' 5.2:' The!first!7!customer!ratings!of!this!book!on![Link]!(in!about!2002).!The!first!score!
biases!the!mean! Importance of screening your data!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
slated!every!aspect!of!the!data!analysis!in!a!very!pedantic!way.!Imagine!my!horror!when!my!supervisor!came!
Dealing with outliers
Screen your data for outliers
1) Graph the data with a frequency distribution
(histogram)
2) If an outlier appears to be present for a variable,
graph the data for that variable with a boxplot
Example
A biologist was worried about the potential health effects of music festivals.
He measured the hygiene of 810 concert-goers over the three days of the
festival. Hygiene was measured using a scale with scores ranging from 0 to 4:
• 0 = you smell like a corpse rotting up a skunk’s arse
• 4 = you smell like sweet roses on a fresh spring day
Dealing with outliers
Screen your data for outliers
Step 1: Analyze > Descriptive Statistics > Frequencies
Step 2: Drag all the variables in your dataset from the left box to the
Variable(s) Box.
Step 3: Click on Charts and as Chart Type you indicate Histograms
Step 4: Click on Continue and then OK.
Dealing with outliers
Reminder: Spotting outliers with graphs
Hygiene rating on a festival:
Interval scale of 0 - 4
Particularly odd because
it has a score of 20 on a
scale from 0 to 4!
TIP: Always check that there
are no scores BELOW or ABOVE
the maximum and minimum of
the scale used!
Dealing with outliers
Reminder: Spotting outliers with graphs
Boxplots tell you
WHETHER there are
extreme scores &
WHERE to look in
your dataset
Case 611 seems to be an extreme score!
(LOOK OUT: it is line 611 in data view, not a value of 611 for the variable!)
Check some hints on outlier detection for the group work in a document online
Dealing with outliers
What to do with outliers once detected?
Discuss the outliers very specifically:
• Are they mistakes (e.g., typos)? Then, fix them!!
e.g., length = 317 cm 173 cm
• Delete the person from the dataset
ONLY IF you think the entire person is an outlier:
1) not representative for the population you want to investigate
(e.g., student from Paris while testing students from Lille)
2) the same person is responsible for outliers in many questions in
the survey
Only delete when you are SURE that it is an outlier! Sometimes, people may have very
good reasons for responding with an extreme score (compared to the rest of the data)!
• Change the score: CAUTION!!
Dealing with outliers
What to do with outliers once detected?
CHANGING THE SCORE
« What did you say? CHEATING???!!! »
2 options
• Replace by next highest/lowest non-outlier value
(e.g., score of 12 while only a scale of 1-10 change to 10)
• Replace that particular value by a missing code (e.g., -1)
REPORT!!! Don’t delete/change outliers automatically, without disclosure and discussion
Explain who was deleted and on which ground!
We detected one outlier (case …) because […].
We left that value out of the analysis OR we
changed that value to… because …
File: [Link]
• Course website > Session 3
• Save on computer
COMPUTE NEW VARIABLES
Compute new variables
Exercise
Calculate a new variable BMI (Body Mass Index)
BMI = Weight (in kg)/(Length (in m))²
Do not do it yourself!
Cumbersome task
Prone to mistakes
Use SPSS to help you with that…
Compute new variables
Exercise
BMI = Weight (in kg)/(Length (in m))²
Transform > Compute Variable
Compute new variables
Exercise
BMI = Weight (in kg)/(Length (in m))²
???
Predefined
formulas (e.g.,
Square to calculate
means, etc.)
Compute new variables
Exercise
Data view
Check quickly whether the right calculation is made!
DO NO FORGET to
specify characteristics
for this new variable in
the variable view!
RESEARCH ON A SUBSET OF
OBSERVATIONS
Research on a subset of observations File: [Link]
Sometimes, you may be interested in performing an analysis for a specific group of
people only → Need to tell SPSS, otherwise, it takes everybody in the dataset!
Data > Select cases
Research on a subset of observations
e.g., you want to exclude
group 1
Variable > 1
e.g., you want to include
only group 1 and 3
Variable =1 │ Variable = 3
Research on a subset of observations
Data view
Check quickly whether the right selection is made!
Don’t forget to put off the selection after your analysis!
Data > Select cases > Check ‘All cases’
REDUCING DATA:
SUMMATED SCALES
File: [Link]
Reducing data •
•
Course website > Session 3
Save on computer
« Candy Preference Scale »
Candy Preference Scale
When I watch television in the evening, I eat candy on a regular basis.
If I'm hungry between meals, I will eat fruit more often than candy.
I always like to add extra sugar to my dessert.
When I take a snack, I prefer the sweetest one.
Four questions to measure the same preference for candy
Two options:
Repeat the same analyses for the four questions =
cumbersome task (e.g., 15 questions)
Reduce the data by making a SUMMATED SCALE
= one variable that combines several variables that
measure the same construct (here: candy preference)
Summated scales
3 requirements
1. All questions need to be measured on the same scale
(e.g., Likert scale from 1 to 5)
2. All questions need to be scaled in the same direction
3. The new variable should contain only variables that
measure the same construct (here: candy preference)
Summated scales
Three requirements
1. All questions need to be measured on the same scale
Candy Preference Scale
Please indicate the extent to which you agree with the following statements
from 1 « completely disagree » to 7 « completely agree »
When I watch television in the evening, I eat candy on a regular basis.
□1 □2 □3 □4 □5 □6 □7
If I'm hungry between meals, I will eat fruit more often than candy.
□1 □2 □3 □4 □5 □6 □7
I always like to add extra sugar to my dessert.
□1 □2 □3 □4 □5 □6 □7
When I take a snack, I prefer the sweetest one.
□1 □2 □3 □4 □5 □6 □7
Summated scales
Three requirements
2. All questions need to be scaled in the same direction
Candy Preference Scale
Please indicate the extent to which you agree with the following statements
from 1 « completely disagree » to 7 « completely agree »
When I watch television in the evening, I eat candy on a regular basis.
If I'm hungry between meals, I will eat fruit more often than candy.
I always like to add extra sugar to my dessert.
When I take a snack, I prefer the sweetest one.
Summated scales
Recoding variables
Transform > Recode into Different Variables
Helpful hint: Display the
variable name instead of label:
1. Right mouse click
on the variable of
interest
2. Select « Display
Variable Names »
Summated scales
Recoding variables
1. Click on the variable to recode (« CandyPref2 ») in the list of variables on the left and
click the arrow
2. Under Output Variable: enter the name of the new recoded variable
(‘CandyPref2_recoded) + label
3. Click on ‘Old and New Values’
Summated scales
Recoding variables
1. Specify potential missing variables
Summated scales
Recoding variables
2. Specify the old and new values for reverse coding
7
OLD: System- or user-
1 missing
NEW: System-missing
ATTENTION!!
Different scales need different
recodings (old-new values)
(here: 7-point scale)
Summated scales
Recoding variables
Data view
Check quickly whether the right recoding is made!
Summated scales
Three requirements
3. Summated scale = one variable that combines several variables
that measure the same construct (here: candy preference)
Create your summated
scale ONLY IF
sufficiently reliable!
(i.e. acceptable
Cronbach’s α)
Summated scales
Reliability analysis
Internal consistency reliability
= a measure that indicates whether several items that propose
to measure the same general construct produce similar scores
When you have a large scale, first make groups of items that
logically fit together in terms of interpretation before you
test their internal consistency
Example for construct “attitude toward cycling”
I like to ride bicycles I’ve enjoyed riding bicycles I hate bicycles
in the past
Totally agree Totally agree Totally disagree
good internal consistency of the scale
Measured by Cronbach’s alpha
0<α<1
Summated scales
Reliability analysis
Analyze > Scale > Reliability Analysis
Summated scales
Reliability analysis
If you have a RECODED variable: use it instead of the
original one, otherwise Cronbach’s alpha will produce a
strange score!!!
Summated scales
Reliability analysis
Summated scales
Reliability analysis
Summated scales
Reliability analysis
OUTPUT
Alpha could be increased by deleting CandyPref2_recoded
BUT is it necessary? Cronbach’s alpha is already very high
and the increase is only marginal!
Summated scales
Reliability analysis
Rule of thumb
• α ≥ 0.9: excellent
• 0.8 ≤ α < 0.9: very good Cronbach’s alpha > 0.7
• 0.7 ≤ α < 0.8: good
• 0.6 ≤ α < 0.7: acceptable
• 0.5 ≤ α < 0.6: poor
• α < 0.5: unacceptable
Warning!
• Items should logically match according to interpretation
(garbage in, garbage out) first fit items together based on
their meaning
• If only marginal difference, choose for more items
• Min. 3 items
• Max. 10 items (α increases as amount of items increase)
Summated scales
Reliability analysis
What if only 2 items?
No Cronbach’s alpha
BUT Pearson correlation
Analyze > Correlate > Bivariate
Summated scales
Reliability analysis
What if only 2 items?
Summated scales
Reliability analysis
OUTPUT
= correlation between
= the two variables
Summated scales
How to report reliability analysis?
Report about Cronbach’s alpha by using the symbol α when
you write about your measures:
« Candy preference was measured with four items: « (1) When I watch
television in the evening, I eat candy on a regular basis; (2) If I'm hungry
between meals, I will eat fruit more often than candy; (3) I always like to
add extra sugar to my dessert; (4) When I take a snack, I prefer the
sweetest one.” The second item was reverse coded. After reversing the
second item, the candy preference scale had a high reliability (α = .95). »
When reporting statistics below
Same for Pearson correlation, but
1, always drop the 0 before the
use the symbol r (r = .95)
decimal place!!!
Exercise on Cronbach’s Alpha
Open dataset ‘[Link]’
Do an appropriate reliability analysis on the items
Summated scales
Last stage: Creating the new, summated scale
3 requirements
1. All questions need to be measured on the same scale
(e.g., Likert scale from 1 to 5)
2. All questions need to be scaled in the same direction
3. The new variable should contain only variables that
measure the same construct (here: candy preference)
If those 3 requirements are met:
Create a new variable, which is the MEAN of the scores on
the different questions
1. Go in ‘Transform’ > ‘Compute variable’
2. Give a name to your new variable (« MEAN… » )
3. Compute the mean of all the scores on the different questions
Summated scales
Last stage: Creating the new, summated scale
MEAN_CandyPreference (CandyPref1 + CandyPref3 + CandyPref4 +
CandyPref2_recoded) / 4 This number of course depends
on the number of items you
sum up
Again, you can check the newly created variable in the data view!
TRANSFORMING METRIC
VARIABLES
Transforming metric variable into categorical variable
File: [Link]
• Course website > Session 3
• Save on computer
IMPORTANT QUESTION
How many categories do we want?
Two groups
(= median split):
People are split into two equal Three groups:
categories based on the median: People are split into three equal
1) <(=) the median categories based on
2) >(=) the median % of people in each group:
In each category, +/- 33% of the total
sample should be present
Transforming metric variable
into 2 groups variable
Step 1: Ask for frequencies + median
Analyze > Descriptives > Frequencies
Transforming metric variable
into 2 groups variable
Group 1
Group 2
Look for the BEST split point to divide the sample
in half (= split after the value closest to 50)
HERE: Include the median in the highest group:
• Group 1: < median
• Group 2: =/> median
Transforming metric variable
into 2 groups variable
Step 2: Make a new variable: Transform > Recode into different variable
Transforming metric variable
into 2 groups variable
Transforming metric variable
into 2 groups variable
2
1
Transforming metric variable
into 2 groups variable
Check quickly whether the right coding is made!
Transforming metric variable into categorical variable
IMPORTANT QUESTION
How many categories do we want?
Two categories
(= median split):
People are split into two equal Three categories:
categories based on the median: People are split into three equal
1) <(=) the median categories based on
2) >(=) the median % of people in each group:
In each category, +/- 33% of the total
sample should be present
Transforming metric variable
into 3 groups variable
Step 1: Ask for frequencies
Analyze > Descriptives > Frequencies
Look at the cumulative percent :
Group 1: < ± 33%
Group 2: ± 33% < (…) < ± 66%
Group 3: > ± 66%
Transforming metric variable
into 3 groups variable
Step 2: Make a new variable: Transform > Recode into different variable
Transforming metric variable
into 3 groups variable
Transforming metric variable
into 3 groups variable
3
Transforming metric variable
into 3 groups variable
Check quickly whether the right coding is made!(in data view)
Analyze > Descriptive Statistics > Frequencies (for the new variable)
Transforming metric variable into categorical variable
Exercise
Useful hint: the value that you mention is always included in that
particular category
CHECKLIST
Checklist Session 3
1) Missing values?
2) Outliers?
3) Scales are used ( = data that can be reduced)?
• Check requirements:
1) Same scale
2) Reversed items to be recoded?
3) Sufficiently reliable scale?
Cronbach’s alpha (more than 2 items)
Pearson correlation (for only 2 items)
• Make a summated scale (Transform > Compute > Mean)
4) Other variables that need to be created or transformed?
5) Do some cases need to be excluded for some analyses
(Data > Select cases)?
6) Do we need to transform a metric variable into a categorical
variable? If yes, how many categories do we need?
If you have not done so yet:
Clean up the dataset!!!
(cf. session 1 – Part II)
Prepare your dataset for
further analyses
Go over checklist
session 3
GROUP PROJECT