1.data Cleaning Screening
1.data Cleaning Screening
Data Screening
• So, I’ve got all this data…what now?
Why?
• Data screening – important to check for errors, assumptions, and
outliers.
• What’s the most important?
• Depends on the type of test because they have different assumptions.
The List – In Order
• Accuracy
• Missing Data
• Outliers
• It Depends:
• Correlations
• Normality
• Linearity
• Homogeneity
• Homoscedasticity
The List – In Order
• Why this order?
• Because if you fix something (accuracy)
• Or replace missing data
• Or take out outliers
• ALL THE REST OF THE ANALYSES CHANGE.
Accuracy
• Check for typos
• Frequencies – you can see if there are numbers that shouldn’t be in your data
set
• Check:
• Min
• Max
• Means
• SD
• Missing values
Open R file to check Accuracy
• Interpret the output
• Check for high and low values in minimum and maximum
• (You can also see the missing data).
• Are the standard deviations really high?
• Are the means strange looking?
• This output will also give you a zillion charts – great for examining Likert scale
data to see if you have all ceiling or floor effects.
Missing Data
• With the output you already have you can see if you have missing
data in the variables.
• Go to the main box that is first shown in the data.
• See the line that says missing?
• Check it out!
Missing Data
• Missing data is an important problem.
• First, ask yourself, “why is this data missing?”
• Because you forgot to enter it?
• Because there’s a typo?
• Because people skipped one question? Or the whole end of the scale?
Missing Data
• Two Types of Missing Data:
• MCAR – missing completely at random (you want this)
• MNAR – missing not at random (eek!)
• There are ways to test for the type, but usually you can see it
• Randomly missing data appears all across your dataset.
• If everyone missed question 7 – that’s not random.
Missing Data
• MCAR – probably caused by skipping a question or missing a trial.
• MNAR – may be the question that’s causing a problem.
• For instance, what if you surveyed campus about alcohol abuse? What does
it mean if everyone skips the same question?
Missing Data
• How much can I have?
• Depends on your sample size – in large datasets <5% is ok.
• Small samples = you may need to collect more data.
• Please note: there is a difference between “missing data” and “did not
finish the experiment”.
Missing Data
• How do I check if it’s going to be a big deal?
• Frequencies – you can see which variables have the missing data.
• Sample test – you can code people into two groups. Test the people
with missing data against those who don’t have missing data.
• Regular analysis – you can also try dropping the people with missing
data and see if you get the same results as your regular analysis with
the missing data.
Missing Data
• Deleting people / variables
• You can exclude people “pairwise” or “listwise”
• Pairwise – only excludes people when they have missing values for that
analysis
• Listwise – excludes them for all analyses
• Variables – if it’s just an extraneous variable (like GPA) you can just
delete the variable
Missing Data
• What if you don’t want to delete people (using special people or can’t
get others)?
• Several estimation methods to “fill in” missing data
Missing Data
• Prior knowledge – if there is an obvious value for missing data
• Such as the median income when people don’t list it
• You have been working in the field for a while
• Small number of missing cases
Missing Data
• Mean substitution – fairly popular way to enter missing data
• Conservative – doesn’t change the mean values used to find significant
differences
• Does change the variance, which may cause significance tests to change with
a lot of missing data
• R will do this substitution with the grand mean
Missing Data
• DO NOT mean replace categorical variables
• You can’t be 1.5 gender.
• So, either leave them out OR pairwise eliminate them (aka eliminate only for
the analysis they are used in).
• Continuous variables – mean replace, linear trend, etc.
• Or leave them out.
Open R file to check Missing Value
• Interpret the output
• Check for the percentage missing values per row and column
• Remove or Replace missing values with the mean?
Outliers
• Outlier – case with extreme value on one variable or multiple
variables
• Why?
• Data input error
• Missing values as “9999”
• Not a population you meant to sample
• From the population but has really long tails and very extreme values
Outliers
• Outliers – Two Types
• Univariate – for basic univariate statistics
• Use these when you have ONE DV or Y variable.
• Multivariate – for some univariate statistics and all multivariate
statistics
• Use these when you have multiple continuous variables or lots of DVs.