Correlation and Regression
Correlation and Regression
e
ut
rib
R egression analysis allows us to predict one variable from information we have
t
about other variables. In this chapter, linear regression is discussed. Linear
is
regression is a type of analysis that is performed on interval and ratio variables
(labeled “scale” variables in SPSS Statistics). However, it is possible to incorporate
d
data from variables with lower levels of measurement (i.e., nominal and ordinal
variables) through the use of dummy variables. We will begin with a bivariate
or
regression example and then add some more detail to the analysis.
BIVARIATE REGRESSION
s t,
In the case of bivariate regression, researchers are interested in predicting the
value of the dependent variable, Y, from the information they have about the
po
127
Copyright ©2020 by SAGE Publications, Inc.
This work may not be reproduced or distributed in any form or by any means without express written permission of the publisher.
The “Linear Regression” dialog box will appear. Initially, select the variables
of interest and drag them into the appropriate areas for dependent and indepen-
dent variables. The variable “REALRINC,” respondent’s actual annual income,
should be moved to the “Dependent” area, and “EDUC,” respondent’s number
of years of education, should be moved to the “Independent(s)” area. Now, simply
click “OK.” The following SPSS Statistics output will be produced:
e
ut
t rib
In the first column of the “Model Summary” box, the output will yield
is
Pearson’s r (in the column labeled “R”), followed in the next column by r-square
(r 2). SPSS Statistics also computes an adjusted r 2 for those interested in using that
d
value. R-square, like lambda, gamma, Kendall’s tau-b, and Somers’ d, is a PRE
(proportional reduction in error) statistic that reveals the proportional reduction
or
in error by introducing the dependent variable(s). In this case, r2 = .083, which
means that 8.3% of the variation in real annual income is explained by the varia-
tion in years of education. Although this percentage might seem low, consider that
t,
years of education is one factor among many (8.3% of the factors, to be exact) that
s
contribute to income, including major field of study, schools attended, prior and
continuing experience, region of the country, gender, race/ethnicity, and so on. We
po
will examine gender (sex) later in this chapter to demonstrate multiple regression.
,
py
t co
no
ANOVA (analysis of variance) values, including the F statistic, are given in the
above table of the linear regression output.
o
D
128 USING IBM ® SPSS ® STATISTICS FOR RESEARCH METHODS AND SOCIAL SCIENCE STATISTICS
Copyright ©2020 by SAGE Publications, Inc.
This work may not be reproduced or distributed in any form or by any means without express written permission of the publisher.
The coefficients table reveals the actual regression coefficients for the regres-
sion equation, as well as their statistical significance. In the “Unstandardized
Coefficients” columns, and in the “B” column, the coefficients are given. In this
case, the b value for number of years of education completed is 2,933.597. The
a value, or constant, is -17,734.68. By looking in the last column (“Sig.”), you can
see that both values are statistically significant (p = .000). Remember, the p value
refers to the probability that the result is due to chance, so smaller numbers are
e
better. The standard in social sciences is usually .05; a result is deemed statistically
significant if the p value is less than .05. We would write the regression equation
ut
describing the model computed by SPSS Statistics as follows:
rib
*Statistically significant at the p ≤ .05 level.
t
The coefficient in the bivariate regression model above can be interpreted
is
to mean that each additional year of education provides a $2,933.60 predicted
d
increase in real annual income. The constant gives the predicted real annual
income when years of education is zero; however, as is often the case with a regres-
sion equation, that may be beyond the range of the data for reasonable prediction.
or
In other words, if no one had zero or near zero years of education in the sample,
the range of the data upon which the prediction was calculated did not include
such, and we should be cautious in making predictions at those levels.
s t,
CORRELATION
po
Information about correlation tells us the extent to which variables are related.
Below, the Pearson method of computing correlation is requested through
SPSS Statistics. To examine a basic correlation between two variables, use the
,
following menus:
py
e
nal dichotomy as an interval/ratio variable and then use it in regression and cor-
relation analysis. Follow the following menus to create the male dummy variable:
ut
Transform → Recode into Different Variables . . .
t rib
d is
or
s t,
po
Select SEX, and then add the name and label, as above. Now click “Old and
,
130 USING IBM ® SPSS ® STATISTICS FOR RESEARCH METHODS AND SOCIAL SCIENCE STATISTICS
Copyright ©2020 by SAGE Publications, Inc.
This work may not be reproduced or distributed in any form or by any means without express written permission of the publisher.
Now, click “Continue,” and then click “OK” in the first dialog box. The new
variable, “MALE,” will be created. Be sure to do the appropriate fine-tuning for
this new variable (e.g., eliminate decimal places, because there are only two pos-
sible values this variable can take: 0 and 1) in the Variable View window.
Returning to the correlation exercise, the output that results is shown in the
following table:
e
ut
t rib
d is
Note that in the output, the correlation is an extremely small, −.12, which is
or
not statistically significant (p = .513). This tells us that being male is not correlated
with having completed a greater number of years of education.
It is also possible to produce partial correlations. Suppose you are inter-
t,
ested in examining the correlation between occupational prestige and education.
s
Further suppose you wish to determine the way that sex affects that correlation.
po
e
ut
t rib
d is
or
s t,
Here, the correlation is noteworthy, at .302, and is statistically significant
(p = .000). This is indicative of a relationship between education and income.
po
in regression analysis.
py
MULTIPLE REGRESSION
co
Because linear regression requires interval-ratio variables, one must take care
when incorporating variables such as sex, race/ethnicity, religion, and the like. By
D
creating dummy variables from the categories of these nominal variables, you can
add this information to the regression equation.
To do this, use the recode function (for more information about recod-
ing variables, see Chapter 2, “Transforming Variables”). Create a dichotomous
132 USING IBM ® SPSS ® STATISTICS FOR RESEARCH METHODS AND SOCIAL SCIENCE STATISTICS
Copyright ©2020 by SAGE Publications, Inc.
This work may not be reproduced or distributed in any form or by any means without express written permission of the publisher.
e
ut
t rib
d is
or
t,
variable for all but one category, the “omitted” comparison category or attri-
s
bute, and insert each of those dichotomies into the “Independent(s)” area. The
po
categories and other conditions, so the recoding just changes this to one variable:
“MALE.” (Alternatively, you could have changed it to “FEMALE.”) The coding
should be binary: 1 for affirmation of the attribute, 0 for respondents not possess-
ing the attribute. Now, as was entered into the previous dialog box, just select the
t
new recoded variable, “MALE,” from the variable bank on the left and drag it into
no
the “Independent(s)” area on the right. You may need to set the variable property
to scale in the Variable View tab of the Data Editor window so that SPSS Statistics
will allow that variable to be included in the regression analysis. Newer versions
o
of SPSS Statistics track variable types and often will not allow you to include vari-
ables with lower levels of measurement in analyses requiring variables with higher
D
levels of measurement.
After recoding as necessary and dragging your variables of interest into their
respective areas, click the “Plots . . .” button, and you will be shown the “Linear
Regression: Plots” dialog box:
or
When you are returned to the “Linear Regression” dialog box, select the
“Statistics . . .” button. The following dialog box will appear:
s t,
, po
py
t co
no
There are a number of options, including descriptive statistics, that you may
select to be included in the SPSS Statistics linear regression output. For now, leave
o
the defaults checked as shown, and click “Continue” in this box; then click “OK”
when returned to the “Linear Regression” dialog box.
D
On next page you will find tables from the SPSS Statistics output that
results. The first table reports the descriptive statistics that were requested.
The next two tables give the same sort of information as before in the bivari-
ate regression case: Pearson’s r (correlation coefficient), r 2 (PRE), and ANOVA
(analysis of variance) values.
134 USING IBM ® SPSS ® STATISTICS FOR RESEARCH METHODS AND SOCIAL SCIENCE STATISTICS
Copyright ©2020 by SAGE Publications, Inc.
This work may not be reproduced or distributed in any form or by any means without express written permission of the publisher.
e
ut
t rib
d is
In this case, r 2 = .115, which means that 11.5% of the variation in respon-
or
dents’ real annual income (“REALRINC”) is explained by the variation in the
independent variables: years of education (“EDUC”) and sex (“MALE”).
s t,
, po
py
co
The “Coefficients” table (on page 136), again, provides the information
that can be used to construct the regression model and equation. Note that the
dummy variable, “male,” was not statistically significant.
t
no
rib
probability for the dependent variable, real annual income.
Histogram
t
Dependent Variable: R’s income in constant $
is
Mean = −2.21E−16
400 Std. Dev. = 0.999
N = 1,631
d
300
Frequency
200 or
t,
100
s
po
0
−2 0 2 4 6
Regression Standardized Residual
,
0.8
Expected Cum Prob
t
no
0.6
0.4
o
D
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
136 USING IBM ® SPSS ® STATISTICS FOR RESEARCH METHODS AND SOCIAL SCIENCE STATISTICS
Copyright ©2020 by SAGE Publications, Inc.
This work may not be reproduced or distributed in any form or by any means without express written permission of the publisher.
It is possible to add additional variables to your linear regression model,
such as those in the dialog box featured below. Interval-ratio variables may be
included, as well as dummy variables, along with others such as interaction vari-
ables. Interaction variables may be computed using the compute function (in
the “Transform” menu). More information about computing variables can be
found in Chapter 2, “Transforming Variables.” The computation would consist
of: Variable 1 × Variable 2 = Interaction Variable.
e
ut
t rib
d is
or
s t,
, po
py
t co
no
Access the full 2016 data file and the 1972–2016 Cumulative Codebook at
the student study site: study.sagepub.com/wagner7e.
o
D