0% found this document useful (0 votes)
16 views

Advance Data Analysis - Students

This document discusses data analysis and statistical concepts. It covers topics such as data entry and cleaning, coding variables, measures of central tendency, the normal curve, scaling variables, degrees of freedom, t-tests, confidence intervals, assumptions of parametric tests, and measures of association like correlations. The goal of data analysis is to capture variability in data, reduce data, discover relationships, and test hypotheses.

Uploaded by

Adhy popz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Advance Data Analysis - Students

This document discusses data analysis and statistical concepts. It covers topics such as data entry and cleaning, coding variables, measures of central tendency, the normal curve, scaling variables, degrees of freedom, t-tests, confidence intervals, assumptions of parametric tests, and measures of association like correlations. The goal of data analysis is to capture variability in data, reduce data, discover relationships, and test hypotheses.

Uploaded by

Adhy popz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Data Analysis

1
Data Analysis: Why?
• Capture variability (variance) – how the
scores vary across persons
• Parsimony – data reduction technique,
how to describe many data points in
simple numbers
• Discover meaning and relationships
• Explore potential biases in data (sampling)
• Test hypotheses

2
Where to begin:
• After data is collected, we begin a long
process of data entry & cleaning
• Data entry requires a code book be
developed for the statistical program you
plan to use, such as SPSS.
• Data codebooks allow you to give your
variables names, values, and labels.

3
Data Entry & Cleaning
• Data entry is a BIG source of error in data
• Double data entry is one strategy
• Cleaning data looking for values outside
the ranges, e.g. age of 154 is probably a
typo.
• We examine frequencies, high score, low
scores, outliers, etc.

4
Coding Variables
Capture data in its most continuous form possible.

Age: 35 years - get the actual value


vs.
Check one: _<25
_ 25-35
_ 36-45
_ >45

5
Dichotomous Variables
Do not do this:
1 = Male
2= Female

Do this!
1 = male
0 = female

Why? Add function

6
Dummy Coding
Ethnicity
1 = Black; 2 = White; 3 = Hispanic

N-1 or 3-1 = 2 variables


Black: 1 = Black; 0 = White and Hispanic
White: 1 = White; 0 = Black and Hispanic

7
Missing Data
• SPSS assigns a dot “.” to missing data
• SPSS often gives you a choice of
pairwise or listwise deletion for missing
values.

Mean Substitution: give the variable the


average score for the group, e.g. age,
adds no variation to the data set.
8
Missing Data
Pairwise: just a particular correlation is
removed, best choice to conserve power

Listwise: removes variables, required in


repeated measures designs.

9
Measures:

• Central Tendency

• Relationships

• Effects

10
Measures of Central Tendency
• Mean – arithmetic average score
• Standard deviation (SD) – how the scores
cluster around the mean
• Range – high and low score.

(Example: M = 36.4 years


SD= 4.2
Range: 22-45)

11
Formulas
n

X
1
n

Mean =
N

 (m − Xn )2

1
SD =
n −1

12
Measures of Central Tendency
• Mean – arithmetic average
• Median – score which divides the
distribution in half (50% above and 50%
below)
• Mode – the most frequently occurring
value

When does the mean=median=mode?

13
Normal Curve: very robust!

34% 34%
2.5% 2.5%

-2 -1 M +1 +2

14
Normal Curves

15
Normal Curve
(Mean=Median=Mode)

Frequency

50% 50%

Mean
Median
Mode
16
Non-Normal Curves

Y-Axis
Y-Axis X-Axis

X-Axis

17
Scaling
• Discrete • Non-parametric
(qualitative) (no assumptions
– Nominal required; Chi square)
– Ordinal

• Continuous • Parametric
(quantitative) (assumes the normal
– Interval curve, e.g. t and F
– ratio tests)

18
Degrees of Freedom
Estimates of statistical parameters can be based upon
different amounts of information or data. The number of
independent pieces of information that go into the estimate
of a parameter are called the degrees of freedom.

In general, the degrees of freedom of an estimate of a


parameter are equal to the number of
independent scores that go into the estimate minus the
number of parameters used as intermediate steps in the
estimation of the parameter itself (most of the time the
sample variance has N − 1 degrees of freedom, since it is
computed from N random scores minus the only 1
parameter estimated as intermediate step, which is the
sample mean) 19
Student’s t-Test
• William Sealy Gossett published under the
name “Student” but was a chemist and
executive at Guiness Brewery until 1935.
What is the t Distribution?
• The t distribution is the shape of the
sampling distribution when n < 30.
• The shape changes slightly depending on
the number of subjects in the sample.
• The degrees of freedom (df) tell you which
t distribution should be used to test your
hypothesis:
– df = n - 1
Comparison to Normal
Distribution
• Both are symmetrical, unimodal, and bell-
shaped.
• When df are infinite, the t distribution is the
normal distribution.
• When df are greater than 30, the t
distribution closely approximates it.
• When df are less than 30, higher
frequencies occur in the tails for t.
The Shape Varies with the df (k)

Smaller df produce larger tails


Comparison of t Distribution and
Normal Distribution for df=4
Finding Critical Values of t
• Use the t-table NOT the z-table.
• Calculate the degrees of freedom.
• Select the significance level (e.g., .05,
.01).
• Look in the column corresponding to the df
and the significance level.
• If t is greater than the critical value, then
the result is significant (reject the null
hypothesis).
Link to t-Tables

http://www.statsoft.com/textboo
k/sttable.html
Calculating t
• The formula for t is the same as that for z
except the standard deviation is estimated
– not known.

• Sample standard deviation (s) is


calculated using (n – 1) in the
denominator, not n.
Confidence Intervals for t
• Use the same formula as for z but:
– Substitute the t value (from the t-table) in
place of z.
– Substitute the estimated standard error of the
mean in place of the calculated standard error
of the mean.
• Mean ± (tconf)(sx)
• Get tconf from the t-table by selecting the df
and confidence level
Assumptions
• Use t whenever the standard deviation is
unknown.
• The t test assumes the underlying
population is normal.
• The t test will produce valid results with
non-normal underlying populations when
sample size > 10.
Deciding between t and z
• Use z when the population is normal and s
is known (e.g., given in the problem).
– Use t when the population is normal but s is
unknown (use s in place of s).
• If the population is not normal, consider
the sample size.
– Use either t or z if n > 30 (see above).
– If n < 30, not enough is known.
What are Degrees of Freedom?
• Degrees of freedom (df) are the number of
values free to vary given some
mathematical restriction.
• Example – if a set of numbers must add
up to a specific toal, df are the number of
values that can vary and still produce that
total.
• In calculating s (std dev), one df is used up
calculating the mean.
Example
• What number must X be to make the total
20?

5 100
10 200 Free to vary

7 300
Limited by the constraint
X X that the sum of all the
20 20 numbers must be 20

So there are 3 degrees of freedom in this example.


A More Accurate Estimate of s
• When calculating s for inferential statistics
(but not descriptive), an adjustment is
made.
• One degree of freedom is used up
calculating the mean in the numerator.
• One degree of freedom must also be
subtracted in the denominator to
accurately describe variability.
Within Subjects Designs
• Two t-tests, depending on design:
– t-test for independent groups is for Between
Subjects designs.
– t-test for paired samples is for Within Subjects
designs.
• Dependent samples are also called:
– Paired samples
– Repeated measures
– Matched samples
Examples of Paired
Samples
• Within subject designs
• Pre-test/post-test
• Matched-pairs
Independent samples –
separate groups
Degrees of Freedom

• Sample size (n-1)


• Number of groups (k-1)
• Number of points in time (l-1)

37
Relationships or Associations

38
Measures of Association: Correlations

• Range: -1 to 1
• Dimensions:
– Strength (0-1)
– Direction (+ or -)
• Definition: a change in X results in a
predictable change in Y; shared variation
or variance.

39
Correlations
• Sample specific (each sample is a subset
of the population)
• Unstable
• Dependent upon sample size
• Everything is statistically significant with a
very large sample size; may not be
clinically significant.
• Expresses relation not a causal statement

40
Types of Correlations
• Pearson product moment r
– continuous by continuous variable
• Phi correlation
– discrete by discrete variable (Chi square)
• Rho rank order correlation
– discrete ranks by ranks
• Point-biserial
– discrete by continuous variable
• Eta Squared

41
Estimate the value of the
Y-Axis
correlation

Y-Axis
r=?
r=?

X-Axis X-Axis
Y-Axis

r=?

X-Axis

42
Types of Data Analyses
Descriptive X? Y? Z?
Measures of central tendency

Correlational rx,y?
Is there a relationship between X and Y?
Measures of relationships (correlations)

Causal ΔX  ΔY?
• Does a change in X cause a change in Y?
Testing group differences (t or F tests)

43
Testing Effects of Interventions

44
Testing Group Differences
• t tests
• F tests (Analysis of Variance or ANOVA)

(t tests are F tests with two groups)

45
Types of tests of group differences

• Between groups
– (unpaired)

• Within groups
– (paired or repeated measures; if two groups it
is also test-retest)
– requires identified subjects

46
Tests of Significance
3 4

1 O1 X O2

2 O1 O2

47
Testing Group Differences

Between Variance
F (or t) =
Within Variance

48
Examining Variance

Between
Variance
Within
Variance

Mc Me

49
Examining Variance:
No difference between the means

Mc Me

50
Examining Variance:
Big difference between means

Mc Me

51
Examining Variance: Three groups

Mc Me2 Me1

52
• Standard Deviation
– how scores vary around a mean

• Standard Error of the Mean


– how mean scores vary around a population
mean

53
Standard Error of the Mean:
Average of sample SDs
Population (n=1000)

Mean Age:
36.5 years

Samples (n=50)

Mean Age: 34.6 yrs 37.1 yrs 36.4 yrs.


SD 3.4 3.8 4.1

54
Conceptual:

MeanE – MeanC
t=
standard error of the mean

55
Assumptions of ANOVA
• Normal distribution
• Independence of measures
• Continuous scaling
• Linear relationship between variables

56
3 X 2 ANOVA

O1exp X1 O2exp

R O1exp X2 O2exp

O1con O2con

One between factor: group (3 levels)


One within factor: time (2 levels)

57
Omnibus F Test

O1exp X1 O2exp

R O1exp X2 O2exp

O1con O2con

F test group: Is there a difference among the three


groups?
F test time: Is there a difference between time 1 and 2?
If yes to either question, where is the difference?
Interaction: Group by Time

58
Tests of Significance
Non-parametric Parametric

Two-groups
Paired Wilcoxin Rank Paired t test
Unpaired Mann-Whitney U Unpaired t test

More than two-groups


Repeated measures Friedman test ANOVA
Independent groups Kruskal -Wallis Repeated measures
ANOVA

59
ANOVA
• ANOVA – analysis of variance
• ANCOVA – analysis of co-variance,
includes Z variable(s)
• MANOVA – multivariate analysis of
variance (more than one dependent
variable)
• MANCOVA – multivariate analysis of
co-variance, includes Z variable(s).
60
Multiple Regression Analysis
Correlational technique
– Unstable values
– Sample specific
– Reliability of measures very
important
– Requires large sample size
– Easy to get significance with large
sample size

61
Multiple Regression Analysis

Attempts to make causal statements of


relationship

Y = X1+X2+X3

Y = dependent variable (health status)


X1-3 = predictors or independent variables
Health Status = Age + Gender + Smoking

62
Multiple Regression Questions:
• What is the contribution of age, gender, and
smoking to health status?
• How much of the variation in health status is
accounted for by variation in age, gender, and
smoking?

63
Multiple Regression Analysis
• Creates a correlation matrix.
• Selects the most highly correlated independent
variable with the dependent variable first.
• Extract the variance in Y accounted for by that X
variable.
• Repeats the process (iterative) until no more of
the variance in Y is statistically explained by the
addition of another X variable.

64
Health Status =
Age + Gender + Smoking
Health Age Gender Smoking
Status X1 X2 X3
Y r2 r2 r2
Health 1 0.25 0.04 0.40
Status 6% 0% 16%
Y
Age 1 0.11 .05
X1 1% 0%
Gender 1 .20
X2 4%
Smoking 1
X3
65
Multiple Regression: Shared Variance

Smoking 40%
Health Status
Age 25%

Gender 4%

Age

Smoking

Gender

66
Multiple Regression
• Correlation results in a r
• Multiple regressions results in an r2
• R squared is the total amount of the
variance in Y that is explained by the
predictors, removing the overlap among
the predictors.

67
Multiple Regression
Types
• Step-wise = based upon highest
correlation, that variable is entered first
(computer makes the decision), theory
building
• Hierarchical = choose the order of entry,
forced entry, theory testing

68
Multiple Regression
• Allows one to cluster variables into Blocks.
• Block 1: Demographic variables
– (age, gender, SES)
• Block 2: Psychological Well-Being
– (depression, social support)
• Block 3: Severity of Illness
– (CD4 count, AIDS dx, viral load, OIs)
• Block 4: Treatment or control
– 1= treatment and 0 = control
69
Regression Analysis
• Multiple regression: one Y, multiple Xs.
• Logistic regression: Y is dichotomous,
popular in epidemiology, Y=disease or no
disease; odds - risk ratio (not explained
variance)
• Canonical variate analysis: multiple Y and
multiple X variables: Y1+Y2+Y3=X1+X2+X3

-linking physiological variables with


psychosocial variables.
70
Multivariate Regression Models:
• Path Analysis and now Structural Equation
Modeling
• Software program: SPSS, Minitab, R Studio
• Measurement model is combined with predictive
model
• Keep in the picture the multicolinearity of
variables (they are correlated!)
• Allows for moderating variables (direct and
indirect effects.

71
Multiple Dependent & Independent
Path Analysis Modeling
Relationships are based upon
the literature review and then
potentially explored, discovered,
Age tested, or validated in a study

Severity of
illness

Diabetic
Gender Adherence Control
to diet

Cognitive
Ability
Social
Support

72
Structural Equation Modeling
Muscle Fatigue
ache Month Month 0
0 Intercept
Intercep
t

Muscle Fatigue
ache Month 1
Month 1

Muscle Fatigue
ache Month 3
Month 3 Slope Slope

Muscle Fatigue
ache Month 6
Month 6

73
Factor Analysis
• Exploration of instrument construct validity
• Correlational technique
• Requires only one administration of an
instrument
• Data reduction technique
• A statistical procedure that requires artistic
skills

74
Conceptual Types of Factor Analysis

• Exploratory – see what is in the data


set

• Confirmatory – see if you can


replicate the reported structure.

75
Factor Analysis

• Principal Components –

(principal factor
or
principal axes)

76
Correlation Matrix of Scale Items:
Which items are related?

Item 1 Item 2 Item 3 Item 4

Item 1 1 0.80 0.30 0.25


Item 2 1 0.40 0.25
Item 3 1 0.70
Item 4 1

77
Factor Analysis:

An iterative process

Factor extraction

78
Factor Analysis
Factor I Factor II Factor III Communality
Item 1 0.80 0.20 -0.30 0.77
Item 2 0.75 0.30 0.01 0.65
Item 3 0.30 0.80 0.05 0.63
Item 4 0.25 0.75 0.20 0.67
Eigenvalue 2.10 2.05 0.56
% var 34% 30% 10%

79
Definitions:
• Communality: Square item loadings on
each factor and sum over each ITEM
• Eigenvalue: Square items loading down
for each factor and sum over each
FACTOR
• Labeling Factors: figments of the authors
imagination. Items 1 & 2 = Factor I; Items
3 & 4 = Factor II.

80
Factor Rotation
Factors are mathematically rotated depending
upon the perspective of the author.
• Orthogonal – right angels, low inter-factor
correlations, creates more independence of
factors, good for multiple regression analysis,
may not reflect well the actual data. (varimax)
• Oblique – different types, let’s factors
correlate with each other to the degree they
actually do correlate, some like this and
believe it better reflects that actual data,
harder to use in multiple regression because
of the multicolinearity. (oblimax)
81
Summary: Data Analysis
• Measures of Central Tendency
• Measures of Relationships
• Testing Group Differences
• Correlational
• Multiple regression as a predictive
(causal) technique.
• Factor analysis as a scale
development, construct validity
technique

82

You might also like