Advance Data Analysis - Students
Advance Data Analysis - Students
1
Data Analysis: Why?
• Capture variability (variance) – how the
scores vary across persons
• Parsimony – data reduction technique,
how to describe many data points in
simple numbers
• Discover meaning and relationships
• Explore potential biases in data (sampling)
• Test hypotheses
2
Where to begin:
• After data is collected, we begin a long
process of data entry & cleaning
• Data entry requires a code book be
developed for the statistical program you
plan to use, such as SPSS.
• Data codebooks allow you to give your
variables names, values, and labels.
3
Data Entry & Cleaning
• Data entry is a BIG source of error in data
• Double data entry is one strategy
• Cleaning data looking for values outside
the ranges, e.g. age of 154 is probably a
typo.
• We examine frequencies, high score, low
scores, outliers, etc.
4
Coding Variables
Capture data in its most continuous form possible.
5
Dichotomous Variables
Do not do this:
1 = Male
2= Female
Do this!
1 = male
0 = female
6
Dummy Coding
Ethnicity
1 = Black; 2 = White; 3 = Hispanic
7
Missing Data
• SPSS assigns a dot “.” to missing data
• SPSS often gives you a choice of
pairwise or listwise deletion for missing
values.
9
Measures:
• Central Tendency
• Relationships
• Effects
10
Measures of Central Tendency
• Mean – arithmetic average score
• Standard deviation (SD) – how the scores
cluster around the mean
• Range – high and low score.
11
Formulas
n
X
1
n
Mean =
N
(m − Xn )2
1
SD =
n −1
12
Measures of Central Tendency
• Mean – arithmetic average
• Median – score which divides the
distribution in half (50% above and 50%
below)
• Mode – the most frequently occurring
value
13
Normal Curve: very robust!
34% 34%
2.5% 2.5%
-2 -1 M +1 +2
14
Normal Curves
15
Normal Curve
(Mean=Median=Mode)
Frequency
50% 50%
Mean
Median
Mode
16
Non-Normal Curves
Y-Axis
Y-Axis X-Axis
X-Axis
17
Scaling
• Discrete • Non-parametric
(qualitative) (no assumptions
– Nominal required; Chi square)
– Ordinal
• Continuous • Parametric
(quantitative) (assumes the normal
– Interval curve, e.g. t and F
– ratio tests)
18
Degrees of Freedom
Estimates of statistical parameters can be based upon
different amounts of information or data. The number of
independent pieces of information that go into the estimate
of a parameter are called the degrees of freedom.
http://www.statsoft.com/textboo
k/sttable.html
Calculating t
• The formula for t is the same as that for z
except the standard deviation is estimated
– not known.
5 100
10 200 Free to vary
7 300
Limited by the constraint
X X that the sum of all the
20 20 numbers must be 20
37
Relationships or Associations
38
Measures of Association: Correlations
• Range: -1 to 1
• Dimensions:
– Strength (0-1)
– Direction (+ or -)
• Definition: a change in X results in a
predictable change in Y; shared variation
or variance.
39
Correlations
• Sample specific (each sample is a subset
of the population)
• Unstable
• Dependent upon sample size
• Everything is statistically significant with a
very large sample size; may not be
clinically significant.
• Expresses relation not a causal statement
40
Types of Correlations
• Pearson product moment r
– continuous by continuous variable
• Phi correlation
– discrete by discrete variable (Chi square)
• Rho rank order correlation
– discrete ranks by ranks
• Point-biserial
– discrete by continuous variable
• Eta Squared
41
Estimate the value of the
Y-Axis
correlation
Y-Axis
r=?
r=?
X-Axis X-Axis
Y-Axis
r=?
X-Axis
42
Types of Data Analyses
Descriptive X? Y? Z?
Measures of central tendency
Correlational rx,y?
Is there a relationship between X and Y?
Measures of relationships (correlations)
Causal ΔX ΔY?
• Does a change in X cause a change in Y?
Testing group differences (t or F tests)
43
Testing Effects of Interventions
44
Testing Group Differences
• t tests
• F tests (Analysis of Variance or ANOVA)
45
Types of tests of group differences
• Between groups
– (unpaired)
• Within groups
– (paired or repeated measures; if two groups it
is also test-retest)
– requires identified subjects
46
Tests of Significance
3 4
1 O1 X O2
2 O1 O2
47
Testing Group Differences
Between Variance
F (or t) =
Within Variance
48
Examining Variance
Between
Variance
Within
Variance
Mc Me
49
Examining Variance:
No difference between the means
Mc Me
50
Examining Variance:
Big difference between means
Mc Me
51
Examining Variance: Three groups
Mc Me2 Me1
52
• Standard Deviation
– how scores vary around a mean
53
Standard Error of the Mean:
Average of sample SDs
Population (n=1000)
Mean Age:
36.5 years
Samples (n=50)
54
Conceptual:
MeanE – MeanC
t=
standard error of the mean
55
Assumptions of ANOVA
• Normal distribution
• Independence of measures
• Continuous scaling
• Linear relationship between variables
56
3 X 2 ANOVA
O1exp X1 O2exp
R O1exp X2 O2exp
O1con O2con
57
Omnibus F Test
O1exp X1 O2exp
R O1exp X2 O2exp
O1con O2con
58
Tests of Significance
Non-parametric Parametric
Two-groups
Paired Wilcoxin Rank Paired t test
Unpaired Mann-Whitney U Unpaired t test
59
ANOVA
• ANOVA – analysis of variance
• ANCOVA – analysis of co-variance,
includes Z variable(s)
• MANOVA – multivariate analysis of
variance (more than one dependent
variable)
• MANCOVA – multivariate analysis of
co-variance, includes Z variable(s).
60
Multiple Regression Analysis
Correlational technique
– Unstable values
– Sample specific
– Reliability of measures very
important
– Requires large sample size
– Easy to get significance with large
sample size
61
Multiple Regression Analysis
Y = X1+X2+X3
62
Multiple Regression Questions:
• What is the contribution of age, gender, and
smoking to health status?
• How much of the variation in health status is
accounted for by variation in age, gender, and
smoking?
63
Multiple Regression Analysis
• Creates a correlation matrix.
• Selects the most highly correlated independent
variable with the dependent variable first.
• Extract the variance in Y accounted for by that X
variable.
• Repeats the process (iterative) until no more of
the variance in Y is statistically explained by the
addition of another X variable.
64
Health Status =
Age + Gender + Smoking
Health Age Gender Smoking
Status X1 X2 X3
Y r2 r2 r2
Health 1 0.25 0.04 0.40
Status 6% 0% 16%
Y
Age 1 0.11 .05
X1 1% 0%
Gender 1 .20
X2 4%
Smoking 1
X3
65
Multiple Regression: Shared Variance
Smoking 40%
Health Status
Age 25%
Gender 4%
Age
Smoking
Gender
66
Multiple Regression
• Correlation results in a r
• Multiple regressions results in an r2
• R squared is the total amount of the
variance in Y that is explained by the
predictors, removing the overlap among
the predictors.
67
Multiple Regression
Types
• Step-wise = based upon highest
correlation, that variable is entered first
(computer makes the decision), theory
building
• Hierarchical = choose the order of entry,
forced entry, theory testing
68
Multiple Regression
• Allows one to cluster variables into Blocks.
• Block 1: Demographic variables
– (age, gender, SES)
• Block 2: Psychological Well-Being
– (depression, social support)
• Block 3: Severity of Illness
– (CD4 count, AIDS dx, viral load, OIs)
• Block 4: Treatment or control
– 1= treatment and 0 = control
69
Regression Analysis
• Multiple regression: one Y, multiple Xs.
• Logistic regression: Y is dichotomous,
popular in epidemiology, Y=disease or no
disease; odds - risk ratio (not explained
variance)
• Canonical variate analysis: multiple Y and
multiple X variables: Y1+Y2+Y3=X1+X2+X3
71
Multiple Dependent & Independent
Path Analysis Modeling
Relationships are based upon
the literature review and then
potentially explored, discovered,
Age tested, or validated in a study
Severity of
illness
Diabetic
Gender Adherence Control
to diet
Cognitive
Ability
Social
Support
72
Structural Equation Modeling
Muscle Fatigue
ache Month Month 0
0 Intercept
Intercep
t
Muscle Fatigue
ache Month 1
Month 1
Muscle Fatigue
ache Month 3
Month 3 Slope Slope
Muscle Fatigue
ache Month 6
Month 6
73
Factor Analysis
• Exploration of instrument construct validity
• Correlational technique
• Requires only one administration of an
instrument
• Data reduction technique
• A statistical procedure that requires artistic
skills
74
Conceptual Types of Factor Analysis
75
Factor Analysis
• Principal Components –
(principal factor
or
principal axes)
76
Correlation Matrix of Scale Items:
Which items are related?
77
Factor Analysis:
An iterative process
Factor extraction
78
Factor Analysis
Factor I Factor II Factor III Communality
Item 1 0.80 0.20 -0.30 0.77
Item 2 0.75 0.30 0.01 0.65
Item 3 0.30 0.80 0.05 0.63
Item 4 0.25 0.75 0.20 0.67
Eigenvalue 2.10 2.05 0.56
% var 34% 30% 10%
79
Definitions:
• Communality: Square item loadings on
each factor and sum over each ITEM
• Eigenvalue: Square items loading down
for each factor and sum over each
FACTOR
• Labeling Factors: figments of the authors
imagination. Items 1 & 2 = Factor I; Items
3 & 4 = Factor II.
80
Factor Rotation
Factors are mathematically rotated depending
upon the perspective of the author.
• Orthogonal – right angels, low inter-factor
correlations, creates more independence of
factors, good for multiple regression analysis,
may not reflect well the actual data. (varimax)
• Oblique – different types, let’s factors
correlate with each other to the degree they
actually do correlate, some like this and
believe it better reflects that actual data,
harder to use in multiple regression because
of the multicolinearity. (oblimax)
81
Summary: Data Analysis
• Measures of Central Tendency
• Measures of Relationships
• Testing Group Differences
• Correlational
• Multiple regression as a predictive
(causal) technique.
• Factor analysis as a scale
development, construct validity
technique
82