0% found this document useful (0 votes)
71 views

Chi-Square Test of Independence

The Chi-Square Test of Independence determines if two categorical variables are associated or independent. It uses a contingency table to analyze the relationship between variables with two or more categories. The test statistic is a chi-square value that is compared to a critical value to determine if the null hypothesis of independence can be rejected. Common uses include testing the association between two categorical variables like gender and smoking behavior.

Uploaded by

kawanjot kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Chi-Square Test of Independence

The Chi-Square Test of Independence determines if two categorical variables are associated or independent. It uses a contingency table to analyze the relationship between variables with two or more categories. The test statistic is a chi-square value that is compared to a critical value to determine if the null hypothesis of independence can be rejected. Common uses include testing the association between two categorical variables like gender and smoking behavior.

Uploaded by

kawanjot kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

SPSS

Chi-Square Test of Independence

The Chi-Square Test of Independence determines whether there is an association between


categorical variables (i.e., whether the variables are independent or related). It is a
nonparametric test.

This test is also known as:

 Chi-Square Test of Association.

This test utilizes a contingency table to analyze the data. A contingency table (also known as
a cross-tabulation, crosstab, or two-way table) is an arrangement in which data is classified
according to two categorical variables. The categories for one variable appear in the rows,
and the categories for the other variable appear in columns. Each variable must have two or
more categories. Each cell reflects the total count of cases for a specific pair of categories.

There are several tests that go by the name "chi-square test" in addition to the Chi-Square
Test of Independence. Look for context clues in the data and research question to make sure
what form of the chi-square test is being used.

Common Uses

The Chi-Square Test of Independence is commonly used to test the following:

 Statistical independence or association between two or more categorical variables.

The Chi-Square Test of Independence can only compare categorical variables. It cannot make
comparisons between continuous variables or between categorical and continuous variables.
Additionally, the Chi-Square Test of Independence only assesses associations between
categorical variables, and can not provide any inferences about causation.

If your categorical variables represent "pre-test" and "post-test" observations, then the chi-
square test of independence is not appropriate. This is because the assumption of the
independence of observations is violated. In this situation, McNemar's Test is appropriate.

Data Requirements

Your data must meet the following requirements:


1. Two categorical variables.
2. Two or more categories (groups) for each variable.
3. Independence of observations.
 There is no relationship between the subjects in each group.
 The categorical variables are not "paired" in any way (e.g. pre-test/post-test
observations).
4. Relatively large sample size.
 Expected frequencies for each cell are at least 1.
 Expected frequencies should be at least 5 for the majority (80%) of the cells.

Hypotheses

The null hypothesis (H0) and alternative hypothesis (H1) of the Chi-Square Test of
Independence can be expressed in two different but equivalent ways:

H0: "[Variable 1] is independent of [Variable 2]"


H1: "[Variable 1] is not independent of [Variable 2]"

OR

H0: "[Variable 1] is not associated with [Variable 2]"


H1: "[Variable 1] is associated with [Variable 2]"

Test Statistic

2
The test statistic for the Chi-Square Test of Independence is denoted Χ , and is computed as:

χ2=∑i=1R∑j=1C(oij−eij)2eijχ2=∑i=1R∑j=1C(oij−eij)2eij

where

th th
oijoij is the observed cell count in the i  row and j  column of the table
th th
eijeij is the expected cell count in the i  row and j  column of the table, computed as

eij=row i total∗col j totalgrand totaleij=row i total∗col j totalgrand total

The quantity (oij - eij) is sometimes referred to as the residual of cell (i, j), denoted rijrij.


2 2
The calculated Χ  value is then compared to the critical value from the Χ  distribution table
with degrees of freedom df = (R - 1)(C - 1) and chosen confidence level. If the
2 2
calculated Χ  value > critical Χ  value, then we reject the null hypothesis.

Data Set-Up

There are two different ways in which your data may be set up initially. The format of the
data will determine how to proceed with running the Chi-Square Test of Independence. At
minimum, your data should include two categorical variables (represented in columns) that
will be used in the analysis. The categorical variables must include at least two groups. Your
data may be formatted in either of the following ways:

IF YOU HAVE THE RAW DATA (EACH ROW IS A


SUBJECT):

 Cases represent subjects, and each subject appears once in the dataset. That is, each
row represents an observation from a unique subject.
 The dataset contains at least two nominal categorical variables (string or numeric).
The categorical variables used in the test must have two or more categories.
Run a Chi-Square Test of Independence

In SPSS, the Chi-Square Test of Independence is an option within the Crosstabs procedure.
Recall that the Crosstabs procedure creates a contingency table or two-way table, which
summarizes the distribution of two categorical variables.
To create a crosstab and perform a chi-square test of independence, click Analyze >
Descriptive Statistics > Crosstabs.

A  Row(s): One or more variables to use in the rows of the crosstab(s). You must enter at

least one Row variable.

B  Column(s): One or more variables to use in the columns of the crosstab(s). You must

enter at least one Column variable.

Also note that if you specify one row variable and two or more column variables, SPSS will
print crosstabs for each pairing of the row variable with the column variables. The same is
true if you have one column variable and two or more row variables, or if you have multiple
row and column variables. A chi-square test will be produced for each table. Additionally, if
you include a layer variable, chi-square tests will be run for each pair of row and column
variables within each level of the layer variable.
C  Layer: An optional "stratification" variable. If you have turned on the chi-square test

results and have specified a layer variable, SPSS will subset the data with respect to the
categories of the layer variable, then run chi-square tests between the row and column
variables. (This is not equivalent to testing for a three-way association, or testing for an
association between the row and column variable after controlling for the layer variable.)

D  Statistics: Opens the Crosstabs: Statistics window, which contains fifteen different

inferential statistics for comparing categorical variables. To run the Chi-Square Test of
Independence, make sure that the Chi-square box is checked off.

E  Cells: Opens the Crosstabs: Cell Display window, which controls which output is

displayed in each cell of the crosstab. (Note: in a crosstab, the cells are the inner sections of
the table. They show the number of observations for a given combination of the row and
column categories.) There are three options in this window that are useful (but optional)
when performing a Chi-Square Test of Independence:
1 Observed: The actual number of observations for a given cell. This option is enabled by

default.

2 Expected: The expected number of observations for that cell (see the test statistic

formula).

3 Unstandardized Residuals: The "residual" value, computed as observed minus expected.

F  Format: Opens the Crosstabs: Table Format window, which specifies how the rows of

the table are sorted.


Example: Chi-square Test for 3x2 Table

PROBLEM STATEMENT
In the sample dataset, respondents were asked their gender and whether or not they were a
cigarette smoker. There were three answer choices: Nonsmoker, Past smoker, and Current
smoker. Suppose we want to test for an association between smoking behavior (nonsmoker,
current smoker, or past smoker) and gender (male or female) using a Chi-Square Test of
Independence (we'll use α = 0.05).

RUNNING THE TEST


1. Open the Crosstabs dialog (Analyze > Descriptive Statistics > Crosstabs).
2. Select Smoking as the row variable, and Gender as the column variable.
3. Click Statistics. Check Chi-square, then click Continue.
4. (Optional) Check the box for Display clustered bar charts.
5. Click OK.

SYNTAX (MACRO)

CROSSTABS
/TABLES=Smoking BY Gender
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT
/COUNT ROUND CELL
/BARCHART.

OUTPUT
TABLES
The first table is the Case Processing summary, which tells us the number of valid cases used
for analysis. Only cases with nonmissing values for both smoking behavior and gender can be
used in the test.

The next tables are the crosstabulation and chi-square test results.

The key result in the Chi-Square Tests table is the Pearson Chi-Square.

 The value of the test statistic is 3.171.


 The footnote for this statistic pertains to the expected cell count assumption (i.e.,
expected cell counts are all greater than 5): no cells had an expected count less than 5,
so this assumption was met.
 Because the test statistic is based on a 3x2 crosstabulation table, the degrees of
freedom (df) for the test statistic is
df=(R−1)∗(C−1)=(3−1)∗(2−1)=2∗1=2
.
 The corresponding p-value of the test statistic is p = 0.205.

DECISION AND CONCLUSIONS


Since the p-value is greater than our chosen significance level (α = 0.05), we do not reject the
null hypothesis. Rather, we conclude that there is not enough evidence to suggest an
association between gender and smoking.

Based on the results, we can state the following:


2
No association was found between gender and smoking behavior (Χ (2)= 3.171, p =
0.205).

_________________________________________________________
ANOVA

One-Way ANOVA

One-Way ANOVA ("analysis of variance") compares the means of two or more independent
groups in order to determine whether there is statistical evidence that the associated
population means are significantly different. One-Way ANOVA is a parametric test.

This test is also known as:

 One-Factor ANOVA
 One-Way Analysis of Variance
 Between Subjects ANOVA

The variables used in this test are known as:

 Dependent variable
 Independent variable (also known as the grouping variable, or factor)
 This variable divides cases into two or more mutually exclusive levels, or
groups

Common Uses

The One-Way ANOVA is often used to analyze data from the following types of studies:

 Field studies
 Experiments
 Quasi-experiments

The One-Way ANOVA is commonly used to test the following:

 Statistical differences among the means of two or more groups


 Statistical differences among the means of two or more interventions
 Statistical differences among the means of two or more change scores

Note: Both the One-Way ANOVA and the Independent Samples t Test can compare the
means for two groups. However, only the One-Way ANOVA can compare the means across
three or more groups.

Note: If the grouping variable has only two groups, then the results of a one-way ANOVA
and the independent samples t test will be equivalent. In fact, if you run both an independent
samples t test and a one-way ANOVA in this situation, you should be able to confirm
2
that t =F.

Data Requirements

Your data must meet the following requirements:

1. Dependent variable that is continuous (i.e., interval or ratio level)


2. Independent variable that is categorical (i.e., two or more groups)
3. Cases that have values on both the dependent and independent variables
4. Independent samples/groups (i.e., independence of observations)
1. There is no relationship between the subjects in each sample. This means that:
1. subjects in the first group cannot also be in the second group
2. no subject in either group can influence subjects in the other group
3. no group can influence the other group
5. Random sample of data from the population
6. Normal distribution (approximately) of the dependent variable for each group (i.e., for
each level of the factor)
1. Non-normal population distributions, especially those that are thick-tailed or
heavily skewed, considerably reduce the power of the test
2. Among moderate or large samples, a violation of normality may yield fairly
accurate p values
7. Homogeneity of variances (i.e., variances approximately equal across groups)
1. When this assumption is violated and the sample sizes differ among groups,
the p value for the overall F test is not trustworthy. These conditions warrant
using alternative statistics that do not assume equal variances among
populations, such as the Browne-Forsythe or Welch statistics (available
via Options in the One-Way ANOVA dialog box).
2. When this assumption is violated, regardless of whether the group sample
sizes are fairly equal, the results may not be trustworthy for post hoc tests.
When variances are unequal, post hoc tests that do not assume equal variances
should be used (e.g., Dunnett’s C).
8. No outliers

Note: When the normality, homogeneity of variances, or outliers assumptions for One-Way


ANOVA are not met, you may want to run the nonparametric Kruskal-Wallis test instead.

Researchers often follow several rules of thumb for one-way ANOVA:

 Each group should have at least 6 subjects (ideally more; inferences for the population
will be more tenuous with too few subjects)
 Balanced designs (i.e., same number of subjects in each group) are ideal; extremely
unbalanced designs increase the possibility that violating any of the
requirements/assumptions will threaten the validity of the ANOVA F test

Hypotheses

The null and alternative hypotheses of one-way ANOVA can be expressed as:

H0: µ1 = µ2 = µ3  = ...   = µk   ("all k population means are equal")
H1: At least one µi different  ("at least one of the k population means is not equal to the
others")

where


th
µi is the population mean of the i  group (i = 1, 2, ..., k)

Note: The One-Way ANOVA is considered an omnibus (Latin for “all”) test because
the F test indicates whether the model is significant overall—i.e., whether or not there
are any significant differences in the means between any of the groups. (Stated another way,
this says that at least one of the means is different from the others.) However, it does not
indicate which mean is different. Determining which specific pairs of means are significantly
different requires either contrasts or post hoc (Latin for “after this”) tests.

Test Statistic

The test statistic for a One-Way ANOVA is denoted as F. For an independent variable
with k  groups, the F statistic evaluates whether the group means are significantly different.
Because the computation of the F statistic is slightly more involved than computing the
paired or independent samples t test statistics, it's extremely common for all of the F statistic
components to be depicted in a table like the following:

  Sum of Squares df Mean Square F


Treatment SSR dfr MSR MSR/MSE
Error SSE dfe MSE  
Total SST dfT    

where

SSR = the regression sum of squares

SSE = the error sum of squares

SST = the total sum of squares (SST = SSR + SSE)

dfr = the model degrees of freedom (equal to dfr = k - 1)

dfe = the error degrees of freedom (equal to dfe = n - k - 1)

k = the total number of groups (levels of the independent variable)

n = the total number of valid observations

dfT = the total degrees of freedom (equal to dfT = dfr + dfe = n - 1)

MSR = SSR/dfr = the regression mean square

MSE = SSE/dfe = the mean square error

Then the F statistic itself is computed as

F=MSRMSEF=MSRMSE

Note: In some texts you may see the notation df1 or ν1 for the regression degrees of freedom,
and df2 or ν2 for the error degrees of freedom. The latter notation uses the Greek letter nu (ν)
for the degrees of freedom.
Some texts may use "SSTr" (Tr = "treatment") instead of SSR (R = "regression"), and may
use SSTo (To = "total") instead of SST.

The terms Treatment (or Model) and Error are the terms most commonly used in natural


sciences and in traditional experimental design texts. In the social sciences, it is more
common to see the terms Between groups instead of "Treatment", and Within groups instead
of "Error". The between/within terminology is what SPSS uses in the one-way ANOVA
procedure.

Data Set-Up

Your data should include at least two variables (represented in columns) that will be used in
the analysis. The independent variable should be categorical (nominal or ordinal) and include
at least two groups, and the dependent variable should be continuous (i.e., interval or ratio).
Each row of the dataset should represent a unique subject or experimental unit.

Note: SPSS restricts categorical indicators to numeric or short string values only.

Run a One-Way ANOVA

The following steps reflect SPSS’s dedicated One-Way ANOVA procedure. However, since


the One-Way ANOVA is also part of the General Linear Model (GLM) family of statistical
tests, it can also be conducted via the Univariate GLM procedure (“univariate” refers to one
dependent variable). This latter method may be beneficial if your analysis goes beyond the
simple One-Way ANOVA and involves multiple independent variables, fixed and random
factors, and/or weighting variables and covariates (e.g., One-Way ANCOVA). We proceed
by explaining how to run a One-Way ANOVA using SPSS’s dedicated procedure.

To run a One-Way ANOVA in SPSS, click Analyze > Compare Means > One-Way
ANOVA.
The One-Way ANOVA window opens, where you will specify the variables to be used in the
analysis. All of the variables in your dataset appear in the list on the left side. Move variables
to the right by selecting them in the list and clicking the blue arrow buttons. You can move a
variable(s) to either of two areas: Dependent List or Factor.

A  Dependent List: The dependent variable(s). This is the variable whose means will be

compared between the samples (groups). You may run multiple means
comparisons simultaneously by selecting more than one dependent variable.
B  Factor: The independent variable. The categories (or groups) of the independent variable

will define which samples will be compared. The independent variable must have at least two
categories (groups), but usually has three or more groups when used in a One-Way ANOVA.

C  Contrasts: (Optional) Specify contrasts, or planned comparisons, to be conducted after

the overall ANOVA test.

When the initial F test indicates that significant differences exist between group means,
contrasts are useful for determining which specific means are significantly different when
you have specific hypotheses that you wish to test. Contrasts are decided before analyzing the
data (i.e., a priori). Contrasts break down the variance into component parts. They may
involve using weights, non-orthogonal comparisons, standard contrasts, and polynomial
contrasts (trend analysis).

Many online and print resources detail the distinctions among these options and will help
users select appropriate contrasts. For more information about contrasts, you can open the
IBM SPSS help manual from within SPSS by clicking the "Help" button at the bottom of the
One-Way ANOVA dialog window.
D  Post Hoc: (Optional) Request post hoc  (also known as multiple comparisons) tests.

Specific post hoc tests can be selected by checking the associated boxes.

1  Equal Variances Assumed: Multiple comparisons options that assume homogeneity of

variance  (each group has equal variance). For detailed information about the specific
comparison methods, click the Help button in this window.

2  Test: By default, a 2-sided hypothesis test is selected. Alternatively, a directional, one-

sided hypothesis test can be specified if you choose to use a Dunnett post hoc test. Click the
box next to Dunnett and then specify whether the Control Category is the Last or First
group, numerically, of your grouping variable. In the Test area, click either < Control or >
Control. The one-tailed options require that you specify whether you predict that the mean
for the specified control group will be less than (> Control) or greater than (< Control)
another group.

3  Equal Variances Not Assumed: Multiple comparisons options that do not assume equal

variances. For detailed information about the specific comparison methods, click
the Help button in this window.
4  Significance level: The desired cutoff for statistical significance. By default, significance

is set to 0.05.

When the initial F test indicates that significant differences exist between group means, post
hoc tests are useful for determining which specific means are significantly different when you
do not have specific hypotheses that you wish to test. Post hoc tests compare each pair of
means (like t-tests), but unlike t-tests, they correct the significance estimate to account for the
multiple comparisons.

E  Options: Clicking Options will produce a window where you can specify

which Statistics to include in the output (Descriptive, Fixed and random effects,


Homogeneity of variance test, Brown-Forsythe, Welch), whether to include a Means plot,
and how the analysis will address Missing Values (i.e., Exclude cases analysis by
analysis or Exclude cases listwise). Click Continue when you are finished making
specifications.

Click OK to run the One-Way ANOVA.

Example
To introduce one-way ANOVA, let's use an example with a relatively obvious conclusion.
The goal here is to show the thought process behind a one-way ANOVA.

PROBLEM STATEMENT
In the sample dataset, the variable Sprint is the respondent's time (in seconds) to sprint a
given distance, and Smoking is an indicator about whether or not the respondent smokes (0 =
Nonsmoker, 1 = Past smoker, 2 = Current smoker). Let's use ANOVA to test if there is a
statistically significant difference in sprint time with respect to smoking status. Sprint time
will serve as the dependent variable, and smoking status will act as the independent variable.

BEFORE THE TEST


Just like we did with the paired t test and the independent samples t test, we'll want to look at
descriptive statistics and graphs to get picture of the data before we run any inferential
statistics.

The sprint times are a continuous measure of time to sprint a given distance in seconds. From
the Descriptives procedure (Analyze > Descriptive Statistics > Descriptives), we see that
the times exhibit a range of 4.5 to 9.6 seconds, with a mean of 6.6 seconds (based on n=374
valid cases). From the Compare Means procedure (Analyze > Compare Means > Means),
we see these statistics with respect to the groups of interest:

  N Mean Std. Deviation


Nonsmoker 26 6.411 1.252
1
Past smoker 33 6.835 1.024
Current smoker 59 7.121 1.084
Total 35 6.569 1.234
3

Notice that, according to the Compare Means procedure, the valid sample size is actually
n=353. This is because Compare Means (and additionally, the one-way ANOVA procedure
itself) requires there to be nonmissing values for both the sprint time and the smoking
indicator.

Lastly, we'll also want to look at a comparative boxplot to get an idea of the distribution of
the data with respect to the groups:
From the boxplots, we see that there are no outliers; that the distributions are roughly
symmetric; and that the center of the distributions don't appear to be hugely different. The
median sprint time for the nonsmokers is slightly faster than the median sprint time of the
past and current smokers.

RUNNING THE PROCEDURE


1. Click Analyze > Compare Means > One-Way ANOVA.
2. Add the variable Sprint to the Dependent List box, and add the variable Smoking to
the Factor box.
3. Click Options. Check the box for Means plot, then click Continue.
4. Click OK when finished.

Output for the analysis will display in the Output Viewer window.

SYNTAX

ONEWAY Sprint BY Smoking


/PLOT MEANS
/MISSING ANALYSIS.

OUTPUT
The output displays a table entitled ANOVA.

  Sum of Squares df Mean Square F Sig.


Between Groups 26.788 2 13.394 9.209 .000
Within Groups 509.082 350 1.455    
  Sum of Squares df Mean Square F Sig.
Total 535.870 352      

After any table output, the Means plot is displayed.

The Means plot is a visual representation of what we saw in the Compare Means output. The
points on the chart are the average of each group. It's much easier to see from this graph that
the current smokers had the slowest mean sprint time, while the nonsmokers had the fastest
mean sprint time.

DISCUSSION AND CONCLUSIONS


We conclude that the mean sprint time is significantly different for at least one of the
smoking groups (F2, 350 = 9.209, p < 0.001). Note that the ANOVA alone does not tell us
specifically which means were different from one another. To determine that, we would need
to follow up with multiple comparisons (or post-hoc) tests.

INDEPENDENT SAMPLE T-TEST


Independent Samples t Test

The Independent Samples t Test compares the means of two independent groups in order to
determine whether there is statistical evidence that the associated population means are
significantly different. The Independent Samples t Test is a parametric test.

This test is also known as:

 Independent t Test
 Independent Measures t Test
 Independent Two-sample t Test
 Student t Test
 Two-Sample t Test
 Uncorrelated Scores t Test
 Unpaired t Test
 Unrelated t Test

The variables used in this test are known as:

 Dependent variable, or test variable


 Independent variable, or grouping variable

Common Uses

The Independent Samples t Test is commonly used to test the following:

 Statistical differences between the means of two groups


 Statistical differences between the means of two interventions
 Statistical differences between the means of two change scores

Note: The Independent Samples t Test can only compare the means for two (and only two)
groups. It cannot make comparisons among more than two groups. If you wish to compare
the means across more than two groups, you will likely want to run an ANOVA.

Data Requirements

Your data must meet the following requirements:

1. Dependent variable that is continuous (i.e., interval or ratio level)


2. Independent variable that is categorical (i.e., two groups)
3. Cases that have values on both the dependent and independent variables
4. Independent samples/groups (i.e., independence of observations)
 There is no relationship between the subjects in each sample. This means that:
 Subjects in the first group cannot also be in the second group
 No subject in either group can influence subjects in the other group
 No group can influence the other group
 Violation of this assumption will yield an inaccurate p value
5. Random sample of data from the population
6. Normal distribution (approximately) of the dependent variable for each group
 Non-normal population distributions, especially those that are thick-tailed or
heavily skewed, considerably reduce the power of the test
 Among moderate or large samples, a violation of normality may still yield
accurate p values
7. Homogeneity of variances (i.e., variances approximately equal across groups)
 When this assumption is violated and the sample sizes for each group differ,
the p value is not trustworthy. However, the Independent Samples t Test
output also includes an approximate t statistic that is not based on assuming
equal population variances. This alternative statistic, called the Welch t Test
1
statistic , may be used when equal variances among populations cannot be
assumed. The Welch t Test is also known an Unequal Variance t Test or
Separate Variances t Test.
8. No outliers

Note: When one or more of the assumptions for the Independent Samples t Test are not met,
you may want to run the nonparametric Mann-Whitney U Test instead.

Researchers often follow several rules of thumb:

 Each group should have at least 6 subjects, ideally more. Inferences for the population
will be more tenuous with too few subjects.
 A balanced design (i.e., same number of subjects in each group) is ideal. Extremely
unbalanced designs increase the possibility that violating any of the
requirements/assumptions will threaten the validity of the Independent
Samples t Test.


Welch, B. L. (1947). The generalization of "Student's" problem when several different population variances are involved. Biometrika, 34(1–2), 28–35.

Hypotheses

The null hypothesis (H0) and alternative hypothesis (H1) of the Independent Samples t Test
can be expressed in two different but equivalent ways:

H0: µ1 = µ2 ("the two population means are equal")


H1: µ1 ≠ µ2 ("the two population means are not equal")
OR

H0: µ1 - µ2 = 0 ("the difference between the two population means is equal to 0")
H1: µ1 - µ2 ≠ 0 ("the difference between the two population means is not 0")

where µ1 and µ2 are the population means for group 1 and group 2, respectively. Notice that
the second set of hypotheses can be derived from the first set by simply subtracting µ2 from
both sides of the equation.

Levene’s Test for Equality of Variances

Recall that the Independent Samples t Test requires the assumption of homogeneity of


variance  -- i.e., both groups have the same variance. SPSS conveniently includes a test for
the homogeneity of variance, called Levene's Test, whenever you run an independent
samples t test.

The hypotheses for Levene’s test are: 

2 2
H0: σ1  - σ2  = 0 ("the population variances of group 1 and 2 are equal")
2 2
H1: σ1  - σ2  ≠ 0 ("the population variances of group 1 and 2 are not equal")

This implies that if we reject the null hypothesis of Levene's Test, it suggests that the
variances of the two groups are not equal; i.e., that the homogeneity of variances assumption
is violated.

The output in the Independent Samples Test table includes two rows: Equal variances
assumed and Equal variances not assumed. If Levene’s test indicates that the variances are
equal across the two groups (i.e., p-value large), you will rely on the first row of
output, Equal variances assumed, when you look at the results for the actual Independent
Samples t Test (under the heading t-test for Equality of Means). If Levene’s test indicates that
the variances are not equal across the two groups (i.e., p-value small), you will need to rely
on the second row of output, Equal variances not assumed, when you look at the results of
the Independent Samples t Test (under the heading t-test for Equality of Means). 

The difference between these two rows of output lies in the way the independent
samples t test statistic is calculated. When equal variances are assumed, the calculation uses
pooled variances; when equal variances cannot be assumed, the calculation utilizes un-pooled
variances and a correction to the degrees of freedom.

Test Statistic

The test statistic for an Independent Samples t Test is denoted t. There are actually two forms
of the test statistic for this test, depending on whether or not equal variances are assumed.
SPSS produces both forms of the test, so both forms of the test are described here. Note that
the null and alternative hypotheses are identical for both forms of the test statistic.

EQUAL VARIANCES ASSUMED


When the two independent samples are assumed to be drawn from populations with identical
2 2
population variances (i.e., σ1  = σ2 ) , the test statistic t is computed as:

t=x¯¯¯1−x¯¯¯2sp1n1+1n2−−−−−−√t=x¯1−x¯2sp1n1+1n2

with

sp=(n1−1)s21+(n2−1)s22n1+n2−2−−−−−−−−−−−−−−−−−−
− sp=(n1−1)s12+(n2−1)s22n1+n2−2

Where

x¯1x¯1 = Mean of first sample


x¯2x¯2 = Mean of second sample
n1n1 = Sample size (i.e., number of observations) of first sample
n2n2 = Sample size (i.e., number of observations) of second sample
s1s1 = Standard deviation of first sample
s2s2 = Standard deviation of second sample
spsp = Pooled standard deviation

The calculated t value is then compared to the critical t value from the t distribution table


with degrees of freedom df = n1 + n2 - 2 and chosen confidence level. If the calculated t value
is greater than the critical t value, then we reject the null hypothesis.
Note that this form of the independent samples t test statistic assumes equal variances.

Because we assume equal population variances, it is OK to "pool" the sample variances (sp).
However, if this assumption is violated, the pooled variance estimate may not be accurate,
which would affect the accuracy of our test statistic (and hence, the p-value).

EQUAL VARIANCES NOT ASSUMED


When the two independent samples are assumed to be drawn from populations with unequal
2 2
variances (i.e., σ1  ≠ σ2 ), the test statistic t is computed as:

t=x¯¯¯1−x¯¯¯2s21n1+s22n2−−−−−−√t=x¯1−x¯2s12n1+s22n2

where

x¯1x¯1 = Mean of first sample


x¯2x¯2 = Mean of second sample
n1n1 = Sample size (i.e., number of observations) of first sample
n2n2 = Sample size (i.e., number of observations) of second sample
s1s1 = Standard deviation of first sample
s2s2 = Standard deviation of second sample

The calculated t value is then compared to the critical t value from the t distribution table


with degrees of freedom

df=(s21n1+s22n2)21n1−1(s21n1)2+1n2−1(s22n2)2df=(s12n1+s22n2)21n1−1(s12n1)2+1n2−

1(s22n2)2

and chosen confidence level. If the calculated t value > critical t value, then we reject the null
hypothesis.

Note that this form of the independent samples t test statistic does not assume equal
variances. This is why both the denominator of the test statistic and the degrees of freedom of
the critical value of t are different than the equal variances form of the test statistic.

Data Set-Up
Your data should include two variables (represented in columns) that will be used in the
analysis. The independent variable should be categorical and include exactly two groups.
(Note that SPSS restricts categorical indicators to numeric or short string values only.) The
dependent variable should be continuous (i.e., interval or ratio).

SPSS can only make use of cases that have nonmissing values for the independent and the
dependent variables, so if a case has a missing value for either variable, it cannot be included
in the test.

Run an Independent Samples t Test

To run an Independent Samples t Test in SPSS, click Analyze > Compare Means >


Independent-Samples T Test.

The Independent-Samples T Test window opens where you will specify the variables to be
used in the analysis. All of the variables in your dataset appear in the list on the left side.
Move variables to the right by selecting them in the list and clicking the blue arrow buttons.
You can move a variable(s) to either of two areas: Grouping Variable or Test Variable(s).

A  Test Variable(s): The dependent variable(s). This is the continuous variable whose

means will be compared between the two groups. You may run multiple t tests
simultaneously by selecting more than one test variable.
B  Grouping Variable: The independent variable. The categories (or groups) of the

independent variable will define which samples will be compared in the t test. The grouping
variable must have at least two categories (groups); it may have more than two categories but
a t test can only compare two groups, so you will need to specify which two groups to
compare. You can also use a continuous variable by specifying a cut point to create two
groups (i.e., values at or above the cut point and values below the cut point).

C  Define Groups: Click Define Groups to define the category indicators (groups) to use in

the t test. If the button is not active, make sure that you have already moved your independent
variable to the right in the Grouping Variable field. You must define the categories of your
grouping variable before you can run the Independent Samples t Test procedure.

D  Options: The Options section is where you can set your desired confidence level for the

confidence interval for the mean difference, and specify how SPSS should handle missing
values.

When finished, click OK to run the Independent Samples t Test, or click Paste to have the
syntax corresponding to your specified settings written to an open syntax window. (If you do
not have a syntax window open, a new window will open for you.)

DEFINE GROUPS
Clicking the Define Groups button (C) opens the Define Groups window:
1  Use specified values: If your grouping variable is categorical, select Use specified

values. Enter the values for the categories you wish to compare in the Group 1 and Group
2 fields. If your categories are numerically coded, you will enter the numeric codes. If your
group variable is string, you will enter the exact text strings representing the two categories.
If your grouping variable has more than two categories (e.g., takes on values of 1, 2, 3, 4),
you can specify two of the categories to be compared (SPSS will disregard the other
categories in this case).

Note that when computing the test statistic, SPSS will subtract the mean of the Group 2 from
the mean of Group 1. Changing the order of the subtraction affects the sign of the results, but
does not affect the magnitude of the results.

2  Cut point: If your grouping variable is numeric and continuous, you can designate a cut

point for dichotomizing the variable. This will separate the cases into two categories based on
the cut point. Specifically, for a given cut point x, the new categories will be:

 Group 1: All cases where grouping variable > x


 Group 2: All cases where grouping variable < x

Note that this implies that cases where the grouping variable is equal to the cut point itself
will be included in the "greater than or equal to" category. (If you want your cut point to be
included in a "less than or equal to" group, then you will need to use Recode into Different
Variables or use DO IF syntax to create this grouping variable yourself.) Also note that while
you can use cut points on any variable that has a numeric type, it may not make practical
sense depending on the actual measurement level of the variable (e.g., nominal categorical
variables coded numerically). Additionally, using a dichotomized variable created via a cut
point generally reduces the power of the test compared to using a non-dichotomized variable.

OPTIONS
Clicking the Options button (D) opens the Options window:
The Confidence Interval Percentage box allows you to specify the confidence level for a
confidence interval. Note that this setting does NOT affect the test statistic or p-value or
standard error; it only affects the computed upper and lower bounds of the confidence
interval. You can enter any value between 1 and 99 in this box (although in practice, it only
makes sense to enter numbers between 90 and 99).

The Missing Values section allows you to choose if cases should be excluded "analysis by
analysis" (i.e. pairwise deletion) or excluded listwise. This setting is not relevant if you have
only specified one dependent variable; it only matters if you are entering more than one
dependent (continuous numeric) variable. In that case, excluding "analysis by analysis" will
use all nonmissing values for a given variable. If you exclude "listwise", it will only use the
cases with nonmissing values for all of the variables entered. Depending on the amount of
missing data you have, listwise deletion could greatly reduce your sample size.

Example: Independent samples T test when variances are not equal

PROBLEM STATEMENT
In our sample dataset, students reported their typical time to run a mile, and whether or not
they were an athlete. Suppose we want to know if the average time to run a mile is different
for athletes versus non-athletes. This involves testing whether the sample means for mile time
among athletes and non-athletes in your sample are statistically different (and by extension,
inferring whether the means for mile times in the population are significantly different
between these two groups). You can use an Independent Samples t Test to compare the mean
mile time for athletes and non-athletes.

The hypotheses for this example can be expressed as:


H0: µnon-athlete - µathlete  = 0 ("the difference of the means is equal to zero")
H1: µnon-athlete - µathlete  ≠ 0 ("the difference of the means is not equal to zero")

where µathlete and µnon-athlete are the population means for athletes and non-athletes, respectively.

In the sample data, we will use two variables: Athlete and MileMinDur. The


variable Athlete has values of either “0” (non-athlete) or "1" (athlete). It will function as the
independent variable in this T test. The variable MileMinDur is a numeric duration variable
(h:mm:ss), and it will function as the dependent variable. In SPSS, the first few rows of data
look like this:

BEFORE THE TEST


Before running the Independent Samples t Test, it is a good idea to look at descriptive
statistics and graphs to get an idea of what to expect. Running Compare Means (Analyze >
Compare Means > Means) to get descriptive statistics by group tells us that the standard
deviation in mile time for non-athletes is about 2 minutes; for athletes, it is about 49 seconds.
This corresponds to a variance of 14803 seconds for non-athletes, and a variance of 2447
1
seconds for athletes . Running the Explore procedure (Analyze > Descriptives > Explore) to
obtain a comparative boxplot yields the following graph:
If the variances were indeed equal, we would expect the total length of the boxplots to be
about the same for both groups. However, from this boxplot, it is clear that the spread of
observations for non-athletes is much greater than the spread of observations for athletes.
Already, we can estimate that the variances for these two groups are quite different. It should
not come as a surprise if we run the Independent Samples t Test and see that Levene's Test is
significant.

Additionally, we should also decide on a significance level (typically denoted using the Greek
letter alpha, α) before we perform our hypothesis tests. The significance level is the threshold
we use to decide whether a test result is significant. For this example, let's use α = 0.05.

1
When computing the variance of a duration variable (formatted as hh:mm:ss or mm:ss or mm:ss.s), SPSS converts the standard deviation value to

seconds before squaring.

RUNNING THE TEST


To run the Independent Samples t Test:

1. Click Analyze > Compare Means > Independent-Samples T Test.


2. Move the variable Athlete to the Grouping Variable field, and move the
variable MileMinDur to the Test Variable(s) area. Now Athlete is defined as the
independent variable and MileMinDur is defined as the dependent variable.
3. Click Define Groups, which opens a new window. Use specified values is selected
by default. Since our grouping variable is numerically coded (0 = "Non-athlete", 1 =
"Athlete"), type “0” in the first text box, and “1” in the second text box. This indicates
that we will compare groups 0 and 1, which correspond to non-athletes and athletes,
respectively. Click Continue when finished.
4. Click OK to run the Independent Samples t Test. Output for the analysis will display
in the Output Viewer window. 

SYNTAX

T-TEST GROUPS=Athlete(0 1)
/MISSING=ANALYSIS
/VARIABLES=MileMinDur
/CRITERIA=CI(.95).

OUTPUT
TABLES

Two sections (boxes) appear in the output: Group Statistics and Independent Samples


Test. The first section, Group Statistics, provides basic information about the group
comparisons, including the sample size (n), mean, standard deviation, and standard error for
mile times by group. In this example, there are 166 athletes and 226 non-athletes. The mean
mile time for athletes is 6 minutes 51 seconds, and the mean mile time for non-athletes is 9
minutes 6 seconds.

The second section, Independent Samples Test, displays the results most relevant to the
Independent Samples t Test. There are two parts that provide different pieces of information:
(A) Levene’s Test for Equality of Variances and (B) t-test for Equality of Means.
A  Levene's Test for Equality of of Variances: This section has the test results for

Levene's Test. From left to right:

 F is the test statistic of Levene's test


 Sig. is the p-value corresponding to this test statistic.

The p-value of Levene's test is printed as ".000" (but should be read as p < 0.001 --
i.e., p very small), so we we reject the null of Levene's test and conclude that the variance in
mile time of athletes is significantly different than that of non-athletes. This tells us that we
should look at the "Equal variances not assumed" row for the t test (and corresponding
confidence interval) results. (If this test result had not been significant -- that is, if we had
observed p > α -- then we would have used the "Equal variances assumed" output.)

B  t-test for Equality of Means provides the results for the actual Independent

Samples t Test. From left to right:

 t is the computed test statistic


 df is the degrees of freedom
 Sig (2-tailed) is the p-value corresponding to the given test statistic and degrees of
freedom
 Mean Difference is the difference between the sample means; it also corresponds to
the numerator of the test statistic
 Std. Error Difference is the standard error; it also corresponds to the denominator of
the test statistic

Note that the mean difference is calculated by subtracting the mean of the second group from
the mean of the first group. In this example, the mean mile time for athletes was subtracted
from the mean mile time for non-athletes (9:06 minus 6:51 = 02:14). The sign of the mean
difference corresponds to the sign of the t value. The positive t value in this example
indicates that the mean mile time for the first group, non-athletes, is significantly greater than
the mean for the second group, athletes.
The associated p value is printed as ".000"; double-clicking on the p-value will reveal the un-
rounded number. SPSS rounds p-values to three decimal places, so any p-value too small to
round up to .001 will print as .000. (In this particular example, the p-values are on the order
-40
of 10 .)

C  Confidence Interval of the Difference: This part of the t-test output complements the

significance test results. Typically, if the CI for the mean difference contains 0, the results are
not significant at the chosen significance level. In this example, the 95% CI is [01:57, 02:32],
which does not contain zero; this agrees with the small p-value of the significance test.

DECISION AND CONCLUSIONS


Since p < .001 is less than our chosen significance level α = 0.05, we can reject the null
hypothesis, and conclude that the that the mean mile time for athletes and non-athletes is
significantly different.

Based on the results, we can state the following:

 There was a significant difference in mean mile time between non-athletes and
athletes (t315.846 = 15.047, p < .001).
 The average mile time for athletes was 2 minutes and 14 seconds faster than the
average mile time for non-athletes.
_____________________________________________________________

Pearson Correlation

Pearson Correlation

The bivariate Pearson Correlation produces a sample correlation coefficient, r, which


measures the strength and direction of linear relationships between pairs of continuous
variables. By extension, the Pearson Correlation evaluates whether there is statistical
evidence for a linear relationship among the same pairs of variables in the population,
represented by a population correlation coefficient, ρ (“rho”). The Pearson Correlation is a
parametric measure.

This measure is also known as:


 Pearson’s correlation
 Pearson product-moment correlation (PPMC)

Common Uses

The bivariate Pearson Correlation is commonly used to measure the following:

 Correlations among pairs of variables


 Correlations within and between sets of variables

The bivariate Pearson correlation indicates the following:

 Whether a statistically significant linear relationship exists between two continuous


variables
 The strength of a linear relationship (i.e., how close the relationship is to being a
perfectly straight line)
 The direction of a linear relationship (increasing or decreasing)

Note: The bivariate Pearson Correlation cannot address non-linear relationships or


relationships among categorical variables. If you wish to understand relationships that
involve categorical variables and/or non-linear relationships, you will need to choose another
measure of association.

Note: The bivariate Pearson Correlation only reveals associations among continuous


variables. The bivariate Pearson Correlation does not provide any inferences about causation,
no matter how large the correlation coefficient is.

Data Requirements

Your data must meet the following requirements:

1. Two or more continuous variables (i.e., interval or ratio level)


2. Cases that have values on both variables
3. Linear relationship between the variables
4. Independent cases (i.e., independence of observations)
 There is no relationship between the values of variables between cases. This
means that:
 the values for all variables across cases are unrelated
 for any case, the value for any variable cannot influence the value of
any variable for other cases
 no case can influence another case on any variable
 The biviariate Pearson correlation coefficient and corresponding significance
test are not robust when independence is violated.
5. Bivariate normality
 Each pair of variables is bivariately normally distributed
 Each pair of variables is bivariately normally distributed at all levels of the
other variable(s)
 This assumption ensures that the variables are linearly related; violations of
this assumption may indicate that non-linear relationships among variables
exist. Linearity can be assessed visually using a scatterplot of the data.
6. Random sample of data from the population
7. No outliers

Hypotheses

The null hypothesis (H0) and alternative hypothesis (H1) of the significance test for correlation
can be expressed in the following ways, depending on whether a one-tailed or two-tailed test
is requested:

Two-tailed significance test:

H0: ρ = 0 ("the population correlation coefficient is 0; there is no association")


H1: ρ ≠ 0 ("the population correlation coefficient is not 0; a nonzero correlation could exist")

One-tailed significance test:

H0: ρ = 0 ("the population correlation coefficient is 0; there is no association")


H1: ρ  > 0 ("the population correlation coefficient is greater than 0; a positive correlation
could exist")
     OR
H1: ρ  < 0 ("the population correlation coefficient is less than 0; a negative correlation could
exist")

where ρ is the population correlation coefficient.

Test Statistic

The sample correlation coefficient between two variables x and y is denoted r or rxy, and can
be computed as:

rxy=cov(x,y)var(x)−−−−−√˙var(y)−−−−−√rxy=cov(x,y)var(x)˙var(y)

where cov(x, y) is the sample covariance of x  and y; var(x) is the sample variance of x; and
var(y) is the sample variance of y.
Correlation can take on any value in the range [-1, 1]. The sign of the correlation coefficient
indicates the direction of the relationship, while the magnitude of the correlation (how close it
is to -1 or +1) indicates the strength of the relationship.

  -1 : perfectly negative linear relationship


   0 : no relationship
 +1  : perfectly positive linear relationship

The strength can be assessed by these general guidelines [1] (which may vary by discipline):

 .1 < | r | < .3 … small / weak correlation


 .3 < | r | < .5 … medium / moderate correlation
 .5 < | r | ……… large / strong correlation

Note: The direction and strength of a correlation are two distinct properties. The scatterplots
below [2] show correlations that are r = +0.90, r = 0.00, and r = -0.90, respectively. The
strength of the nonzero correlations are the same: 0.90. But the direction of the correlations is
different: a negative correlation corresponds to a decreasing relationship, while and a positive
correlation corresponds to an increasing relationship. 

r = -0.90

r = 0.00
r = 0.90

Note that the r = 0.00 correlation has no discernable increasing or decreasing linear pattern in
this particular graph. However, keep in mind that Pearson correlation is only capable of
detecting linear associations, so it is possible to have a pair of variables with a strong
nonlinear relationship and a small Pearson correlation coefficient. It is good practice to create
scatterplots of your variables to corroborate your correlation coefficients.

[1]  Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.

[2]  Scatterplots created in R using ggplot2, ggthemes::theme_tufte(), and MASS::mvrnorm().

Data Set-Up

Your dataset should include two or more continuous numeric variables, each defined as scale,
which will be used in the analysis.

Each row in the dataset should represent one unique subject, person, or unit. All of the
measurements taken on that person or unit should appear in that row. If measurements for one
subject appear on multiple rows -- for example, if you have measurements from different
time points on separate rows -- you should reshape your data to "wide" format before you
compute the correlations.

Run a Bivariate Pearson Correlation

To run a bivariate Pearson Correlation in SPSS, click Analyze > Correlate > Bivariate.

The Bivariate Correlations window opens, where you will specify the variables to be used in
the analysis. All of the variables in your dataset appear in the list on the left side. To select
variables for the analysis, select the variables in the list on the left and click the blue arrow
button to move them to the right, in the Variables field.
A Variables: The variables to be used in the bivariate Pearson Correlation. You must select

at least two continuous variables, but may select more than two. The test will produce
correlation coefficients for each pair of variables in this list.

B Correlation Coefficients: There are multiple types of correlation coefficients. By

default, Pearson is selected. Selecting Pearson will produce the test statistics for a bivariate
Pearson Correlation.

C Test of Significance: Click Two-tailed or One-tailed, depending on your desired

significance test. SPSS uses a two-tailed test by default.

D Flag significant correlations: Checking this option will include asterisks (**) next to

statistically significant correlations in the output. By default, SPSS marks statistical


significance at the alpha = 0.05 and alpha = 0.01 levels, but not at the alpha = 0.001 level
(which is treated as alpha = 0.01)

E Options: Clicking Options will open a window where you can specify

which Statistics to include (i.e., Means and standard deviations, Cross-product


deviations and covariances) and how to address Missing Values (i.e., Exclude cases
pairwise or Exclude cases listwise). Note that the pairwise/listwise setting does not affect
your computations if you are only entering two variable, but can make a very large difference
if you are entering three or more variables into the correlation procedure.

Example: Understanding the linear association between weight and height

PROBLEM STATEMENT
Perhaps you would like to test whether there is a statistically significant linear relationship
between two continuous variables, weight and height (and by extension, infer whether the
association is significant in the population). You can use a bivariate Pearson Correlation to
test whether there is a statistically significant linear relationship between height and weight,
and to determine the strength and direction of the association.

BEFORE THE TEST


In the sample data, we will use two variables: “Height” and “Weight.” The variable “Height”
is a continuous measure of height in inches and exhibits a range of values from 55.00 to
84.41 (Analyze > Descriptive Statistics > Descriptives). The variable “Weight” is a
continuous measure of weight in pounds and exhibits a range of values from 101.71 to
350.07.

Before we look at the Pearson correlations, we should look at the scatterplots of our variables
to get an idea of what to expect. In particular, we need to determine if it's reasonable to
assume that our variables have linear relationships. Click Graphs > Legacy Dialogs >
Scatter/Dot. In the Scatter/Dot window, click Simple Scatter, then click Define. Move
variable Height to the X Axis box, and move variable Weight to the Y Axis box. When
finished, click OK.

To add a linear fit like the one depicted, double-click on the plot in the Output Viewer to
open the Chart Editor. Click Elements > Fit Line at Total. In the Properties window, make
sure the Fit Method is set to Linear, then click Apply. (Notice that adding the linear
regression trend line will also add the R-squared value in the margin of the plot. If we take
the square root of this number, it should match the value of the Pearson correlation we
obtain.)

From the scatterplot, we can see that as height increases, weight also tends to increase. There
does appear to be some linear relationship.

RUNNING THE TEST


To run the bivariate Pearson Correlation, click Analyze > Correlate > Bivariate. Select the
variables Height and Weight and move them to the Variables box. In the Correlation
Coefficients area, select Pearson. In the Test of Significance area, select your desired
significance test, two-tailed or one-tailed. We will select a two-tailed significance test in this
example. Check the box next to Flag significant correlations.
Click OK to run the bivariate Pearson Correlation. Output for the analysis will display in the
Output Viewer.

SYNTAX

CORRELATIONS
/VARIABLES=Weight Height
/PRINT=TWOTAIL NOSIG
/MISSING=PAIRWISE.

OUTPUT
TABLES

The results will display the correlations in a table, labeled Correlations.

A  Correlation of Height with itself (r=1), and the number of nonmissing observations for

height (n=408).

B  Correlation of height and weight (r=0.513), based on n=354 observations with pairwise

nonmissing values.

C  Correlation of height and weight (r=0.513), based on n=354 observations with pairwise

nonmissing values.

D  Correlation of weight with itself (r=1), and the number of nonmissing observations for

weight (n=376).
The important cells we want to look at are either B or C. (Cells B and C are identical, because
they include information about the same pair of variables.) Cells B and C contain the
correlation coefficient for the correlation between height and weight, its p-value, and the
number of complete pairwise observations that the calculation was based on.

The correlations in the main diagonal (cells A and D) are all equal to 1. This is because a
variable is always perfectly correlated with itself. Notice, however, that the sample sizes are
different in cell A (n=408) versus cell D (n=376). This is because of missing data -- there are
more missing observations for variable Weight than there are for variable Height.

If you have opted to flag significant correlations, SPSS will mark a 0.05 significance level
with one asterisk (*) and a 0.01 significance level with two asterisks (0.01). In cell B
(repeated in cell C), we can see that the Pearson correlation coefficient for height and weight
is .513, which is significant (p < .001 for a two-tailed test), based on 354 complete
observations (i.e., cases with nonmissing values for both height and weight).

DECISION AND CONCLUSIONS


Based on the results, we can state the following:

 Weight and height have a statistically significant linear relationship (p < .001).
 The direction of the relationship is positive (i.e., height and weight are positively
correlated), meaning that these variables tend to increase together (i.e., greater height
is associated with greater weight).
 The magnitude, or strength, of the association is approximately moderate (.3 < | r | < .
5).

Regression

Model Summary

Model R R Square Adjusted R Std. Error of the


Square Estimate

1 .513a .263 .261 36.51344

a. Predictors: (Constant), Height


This table provides the R and R2 values. The R value represents the simple
correlation and is 0.513 (the "R" Column), which indicates a moderate degree of
correlation. The R2 value (the "R Square" column) indicates how much of the total
variation in the dependent variable, weight, can be explained by the independent
variable, height. In this case, 26.3% can be explained, which is quite less

The next table is the ANOVA table, which reports how well the regression
equation fits the data (i.e., predicts the dependent variable) and is shown below:

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 167845.288 1 167845.288 125.894 .000b

1 Residual 469297.463 352 1333.231

Total 637142.751 353

a. Dependent Variable: Weight


b. Predictors: (Constant), Height

This table indicates that the regression model predicts the dependent variable
significantly well. How do we know this? Look at the "Regression" row and go to
the "Sig." column. This indicates the statistical significance of the regression
model that was run. Here, p < 0.0005, which is less than 0.05, and indicates that,
overall, the regression model statistically significantly predicts the outcome
variable (i.e., it is a good fit for the data).

The Coefficients table provides us with the necessary information to predict price


from income, as well as determine whether income contributes statistically
significantly to the model (by looking at the "Sig." column). Furthermore, we can
use the values in the "B" column under the "Unstandardized Coefficients"
column, as shown below:

Coefficientsa

Model Unstandardized Coefficients Standardized t Sig.


Coefficients

B Std. Error Beta

(Constant) -96.696 24.817 -3.896 .000


1
Height 4.087 .364 .513 11.220 .000

a. Dependent Variable: Weight


Published with written permission from SPSS Statistics, IBM Corporation.

to present the regression equation as:

Weight = -96.696 + 4.087(Height)

You might also like