0% found this document useful (0 votes)
63 views

1.data Cleaning Screening

This document provides guidance on screening data before conducting analyses. It recommends checking for accuracy, missing data, outliers, and assessing assumptions in that order. Accuracy should be checked by examining frequencies, means, standard deviations and missing values for errors. Missing data should then be analyzed for amount and potential non-random patterns. Outliers can be identified using univariate or multivariate methods. Finally, assumptions like normality, linearity and homoscedasticity should be evaluated depending on the intended statistical tests. Methods like mean substitution are described for handling missing data within limits. The overall goal is to clean the data of errors before finalizing analyses.

Uploaded by

Sukhmani Sandhu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

1.data Cleaning Screening

This document provides guidance on screening data before conducting analyses. It recommends checking for accuracy, missing data, outliers, and assessing assumptions in that order. Accuracy should be checked by examining frequencies, means, standard deviations and missing values for errors. Missing data should then be analyzed for amount and potential non-random patterns. Outliers can be identified using univariate or multivariate methods. Finally, assumptions like normality, linearity and homoscedasticity should be evaluated depending on the intended statistical tests. Methods like mean substitution are described for handling missing data within limits. The overall goal is to clean the data of errors before finalizing analyses.

Uploaded by

Sukhmani Sandhu
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Screening

Data Screening
• So, I’ve got all this data…what now?
Why?
• Data screening – important to check for errors, assumptions, and
outliers.
• What’s the most important?
• Depends on the type of test because they have different assumptions.
The List – In Order
• Accuracy
• Missing Data
• Outliers
• It Depends:
• Correlations
• Normality
• Linearity
• Homogeneity
• Homoscedasticity
The List – In Order
• Why this order?
• Because if you fix something (accuracy)
• Or replace missing data
• Or take out outliers
• ALL THE REST OF THE ANALYSES CHANGE.
Accuracy
• Check for typos
• Frequencies – you can see if there are numbers that shouldn’t be in your data
set
• Check:
• Min
• Max
• Means
• SD
• Missing values
Open R file to check Accuracy
• Interpret the output
• Check for high and low values in minimum and maximum
• (You can also see the missing data).
• Are the standard deviations really high?
• Are the means strange looking?
• This output will also give you a zillion charts – great for examining Likert scale
data to see if you have all ceiling or floor effects.
Missing Data
• With the output you already have you can see if you have missing
data in the variables.
• Go to the main box that is first shown in the data.
• See the line that says missing?
• Check it out!
Missing Data
• Missing data is an important problem.
• First, ask yourself, “why is this data missing?”
• Because you forgot to enter it?
• Because there’s a typo?
• Because people skipped one question? Or the whole end of the scale?
Missing Data
• Two Types of Missing Data:
• MCAR – missing completely at random (you want this)
• MNAR – missing not at random (eek!)
• There are ways to test for the type, but usually you can see it
• Randomly missing data appears all across your dataset.
• If everyone missed question 7 – that’s not random.
Missing Data
• MCAR – probably caused by skipping a question or missing a trial.
• MNAR – may be the question that’s causing a problem.
• For instance, what if you surveyed campus about alcohol abuse? What does
it mean if everyone skips the same question?
Missing Data
• How much can I have?
• Depends on your sample size – in large datasets <5% is ok.
• Small samples = you may need to collect more data.
• Please note: there is a difference between “missing data” and “did not
finish the experiment”.
Missing Data
• How do I check if it’s going to be a big deal?
• Frequencies – you can see which variables have the missing data.
• Sample test – you can code people into two groups. Test the people
with missing data against those who don’t have missing data.
• Regular analysis – you can also try dropping the people with missing
data and see if you get the same results as your regular analysis with
the missing data.
Missing Data
• Deleting people / variables
• You can exclude people “pairwise” or “listwise”
• Pairwise – only excludes people when they have missing values for that
analysis
• Listwise – excludes them for all analyses
• Variables – if it’s just an extraneous variable (like GPA) you can just
delete the variable
Missing Data
• What if you don’t want to delete people (using special people or can’t
get others)?
• Several estimation methods to “fill in” missing data
Missing Data
• Prior knowledge – if there is an obvious value for missing data
• Such as the median income when people don’t list it
• You have been working in the field for a while
• Small number of missing cases
Missing Data
• Mean substitution – fairly popular way to enter missing data
• Conservative – doesn’t change the mean values used to find significant
differences
• Does change the variance, which may cause significance tests to change with
a lot of missing data
• R will do this substitution with the grand mean
Missing Data
• DO NOT mean replace categorical variables
• You can’t be 1.5 gender.
• So, either leave them out OR pairwise eliminate them (aka eliminate only for
the analysis they are used in).
• Continuous variables – mean replace, linear trend, etc.
• Or leave them out.
Open R file to check Missing Value
• Interpret the output
• Check for the percentage missing values per row and column
• Remove or Replace missing values with the mean?
Outliers
• Outlier – case with extreme value on one variable or multiple
variables
• Why?
• Data input error
• Missing values as “9999”
• Not a population you meant to sample
• From the population but has really long tails and very extreme values
Outliers
• Outliers – Two Types
• Univariate – for basic univariate statistics
• Use these when you have ONE DV or Y variable.
• Multivariate – for some univariate statistics and all multivariate
statistics
• Use these when you have multiple continuous variables or lots of DVs.

You might also like