0% found this document useful (0 votes)
9 views

Regression c

The document discusses the regression coefficient and its role in examining causal relationships between variables through statistical techniques like correlation and regression. It defines simple and multivariate regression, outlines criteria for causality, and explains the importance of controlling for other variables in research. Additionally, it emphasizes that correlation does not imply causation and introduces methods like partial correlation and multiple regression for analyzing relationships among multiple variables.

Uploaded by

nzosijuliet
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Regression c

The document discusses the regression coefficient and its role in examining causal relationships between variables through statistical techniques like correlation and regression. It defines simple and multivariate regression, outlines criteria for causality, and explains the importance of controlling for other variables in research. Additionally, it emphasizes that correlation does not imply causation and introduces methods like partial correlation and multiple regression for analyzing relationships among multiple variables.

Uploaded by

nzosijuliet
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

The Regression

Coefficient
The following is a discussion of the regression
coefficient without all the formulas
Correlation and Regression
• The goal of explanatory research is to tell us why things exist
as they do. The use of true experimental designs is a powerful
method for conducting causal research. There are however
many types of social phenomena for which it is not practical or
ethical to conduct social experiments.

• Thus we use statistical techniques such as regression to


examine causality.
Why Am I Learning This?
• Relationships among variables are important because they are
fundamental to scientific theories and they have many
practical uses such as making predictions.

• We will learn about the common statistic for measuring the


degree of linear relationship: the correlation coefficient, how
to represent a relationship using a line, and how to use that
information to make predictions.

• Before defining causality, we shall define the difference


between simple or bivariate regression and multivariate
regression.
Definitions
• Simple regression-only one explanatory variable (1 x)

• Multivariate regression-more than one explanatory variable (2


or more x’s)

• What is causality? Most conceptions of causality involve the


idea that one event produces another or that certain
occurrences follow from other occurrences. There is a sense
of agency present in this notion and a sense of connection
between events. Certain events are responsible for the
occurrence of other phenomena.
Definitions
• The essence of causality may be captured by the notion of
manipulation. If one could intervene without changing the
surrounding circumstances and make a change in the first
thing, a change in the second thing would follow from the
original manipulation.
What are the criteria for causality?
• Association: The first criterion for causality is than an
association must exist between presumed cause and its effect.
If two variables do no covary, neither can be considered a
candidate to exert causal influence on the other.

• Variables are related to another in varying degrees, which


underscores the probabilistic nature of causality. A perfect
positive relationship exists when we observe a one-to-one
correspondence between the two variables being explored.
What are the criteria for causality?
• Temporal Order: For variable A to be considered a causal
candidate for the occurrence of B, it must occur before B in
time. Temporal order in the social and behavioral sciences is
often obvious, but because of the feedback nature of many of
the things we study, the order is not always easy to determine.

• Spuriousness: The third criterion is that the relationship must


not statistically disappear when the influence of other
variables is considered.

• Necessary Cause: A necessary cause of or condition is one


that must be present for an effect to follow.
What are the criteria for causality?
• Sufficient Cause: A sufficient cause is a cause or condition that
by itself is able to produce an event.

• Necessary and Sufficient Cause: A cause is a necessary and


sufficient cause if, and only if, it must be present for an effect
to occur and has no help from other variables. Note: We can
never satisfy the necessary and sufficient criterion of causality
and we never will (Walsh and Ollenburger 2001).
Explanation and Interpretation
• An outcome is called an explanation when the original
relationship is explained away by an antecedent variable, that
is, a variable that precedes both the dependent and
independent variables in time.

• Interpretation occurs when the initial bivariate relationship is


rendered insignificant by an intervening variable. Because the
introduction of an intervening variable specifies a process by
which the independent variable affects the dependent
variable, an interpretation outcome is not a spurious one in
the sense that an explanation outcome is. The independent
variable “causes” the intervening control variable, which
“causes” the dependent variable.
Explanation and Interpretation
• The difference between explanation and interpretation are
theoretical assumptions relating to the time ordering of the
test variable. If a test variable exerts influence before both the
independent and dependent variable it is antecedent; if it
exerts influence after the independent variable but before the
dependent variable it is intervening.

• To examine whether are causal relationships we use


regression techniques.
Pearson correlation
coefficient
• The Pearson correlation coefficient--usually represented by
the symbol r--measures the linear relationship between two
variables.
• Values of the correlation coefficient are always between -1 and
+1, inclusive. The value r = 0 indicates no relationship between
the two variables.
• Positive values of r imply that higher values on one variable are
associated with higher values on the other variable.
• Negative values of r imply that higher values on one variable are
associated with lower values on the other variable.
• The value r= +1 indicates a perfect positive linear relationship and
r = -1 indicates a perfect negative linear relationship between the
two variables.
Pearson correlation
coefficient
• If the scores on the two variables tend to go up together, there
is a positive relationship between the two variables.
• In positive relationships, if the score on one of the variables is
high, the corresponding score on the other variable tends to
be high too.
• For example, height and weight of people have a positive
relationship: taller people generally are heavier than shorter
people and heavier people are generally taller than lighter
people.
• When the relationship is positive, the correlation coefficient
has a positive sign.
Pearson correlation
coefficient
• If the scores on the two variables tend to go in opposite
directions, there is a negative relationship between the two
variables.
• In negative relationships, if the score on one of the variables is
high, the corresponding score on the other variable tends to
be low.
• For example, speed of doing a task and the accuracy with which
the task is done have a negative relationship. At high speeds,
accuracy tends to be lower than for low speeds and accuracy
tends to be higher when speeds are low.
• When the relationship is negative, the correlation coefficient
has a negative sign.
Pearson correlation
coefficient
• Above and Below the Mean
• In a positive relationship, if the score on one of the two
variables is above the mean, then the score on the other
variable is also likely to be above the mean. And of course, if
the score on one of the two variables is below the mean, then
the score on the other variable is also likely to be below the
mean.
• For example, someone who is taller than average is also likely to
be heavier than average and someone shorter than average is
also likely to be lighter than average.
Equation for a Line
• While the correlation coefficient is useful for telling us
whether two variables are correlated, it does not describe the
nature of the relationship between the two variables. Often
we know or strongly suspect that two variables are related;
what we want to know is precisely how they are related.
• For example, it is not surprising that there is a positive
relationship between an automobile's speed and its stopping
time on dry pavement. What we want to know is how much
stopping distance increases with each speed increase of, say, 10
mph.
Equation for a Line
• Lines are very useful for describing relationships between two
variables. Some relationships are much more complicated
than lines, but lines always are a useful starting point and are
often all we need for many relationships.
• Before we see how lines are used to model relationships
between variables, we will first review the basis of lines and
how they work. If you remember from an algebra or geometry
course how lines work, you may skip ahead to estimating
slopes.
• However, a quick review is good for everyone.
Equation for a Line
The equation for a line is:

Two variables are related by the two parameters in the equation:


a is the intercept and b is the slope. Use the graph below to
explore how these parameters affect the line.
Equation for a Line
Relating Two Variables
• For our purposes, the most useful property of a line is that it
relates each possible value of one variable with a particular
value of the other variable. The yellow lines in the graph below
illustrate how the values are related by the line.

Typical Line
• Earlier, when we were describing one variable, we wanted to
have a model that represented the typical value. That model
was a single value. When we have two variables, we want to
find the line that is typical of or best represents all the
observations.
• In the scatterplot on the next slide, does it appear that there is
a linear relationship?
Equation for a Line
Equation for a Line
• If you said yes, then you were correct.
B Std. Error t Sig.
(Constant) 1.395 .835 1.672 .101
Poverty Rate: 2008 .698 .063 11.034 .000
Dependent Variable: Births to Teenage Mothers as Percent of All
Births: 2007
• Looking at the independent variable, poverty on the
dependent variable, there is a positive relationship (.698)
and according the t-test (11.034) it is statistically
significant (.000)
• The coefficient for poverty (.698) suggests that as
poverty increases so does teen pregnancy.
Equation for a Line
• If you said yes, then you were correct.
B Std. Error t Sig.
(Constant) 1.395 .835 1.672 .101
Poverty Rate: 2008 .698 .063 11.034 .000
Dependent Variable: Births to Teenage Mothers as Percent of All Births:
2007
• Back to the regression formula: Y=a + bx
• The constant is for a, or the intercept
• Y= a + b(x)
• Y= 1.395 + poverty rate (.698)
• So you could enter the values to calculate Y
• However, it is important to remember that there is error in the
calculations: .835 for the intercept and .063 for the independent
variable
Equation for a Line
• Outliers: Describing data with a line--making a linear model of
the data--is a powerful and useful statistical technique.
However, there are important cautions to keep in mind when
calculating correlations and when using regression to estimate
slopes.
• Other Than Linear Relationships: As noted many times in this
chapter, there is a link between the correlation and linear
relationships. If there is a linear relationship--one that can be
represented by a straight line, then the correlation will be
high. However, if the correlation is low, then there is no
evidence of a linear relationship between the variables.
However, there may be other kinds of relationships.
Caution Inferring Cause
• When we observe a relationship between two variables, it is
frequently tempting to infer that one variable is causing the other
one.
• For example, suppose a medical survey reports a negative
correlation between the amount of red meat eaten and age of
death--the more red meat eaten the younger the age of death, on
average.
• It is tempting to conclude that eating the red meat causes people to
die younger of various diseases. Although a causal link is one
possibility, there are other plausible alternative explanations. It may
be that some other variable, related to both, is actually the cause.
• For example, eating red meat might be associated with getting less
exercise and it might actually be the lack of exercise that is the
cause of the later health problems. The causal ambiguity is more
obvious in the following true example.
Caution Inferring Cause
• In any major city there is a positive correlation between the
monthly sales of ice cream cones and the monthly number of
suicides.
• It seems farfetched to propose that eating ice cream causes
suicides, but what is going on? We don't really believe that
banning ice cream sales would reduce suicide rates.
• Most likely, a third variable--average monthly temperature--is
responsible for the relationship between ice cream and
suicides. Hotter temperatures may cause all sorts of people to
eat more ice cream and, unfortunately, they also may cause
some depressed people to become even more desperate.
Caution Inferring Cause
• Many textbooks emphasize: Correlation does not imply
causation.
• Although following that rule will generally get you in less
trouble than its opposite, it is too strong a statement.
• If one of the variables is under the control of the researcher,
then correlation does imply causation.
• The correlation in that case does allow the researcher to
conclude that increasing the stress level causes ingots to fail
sooner. But if instead both variables are simply observed and
not controlled, then remember: Correlation does not imply
causation.
• Correlation and regression measure and describe the co-
relationship between two variables.
MAIN POINTS
• Control
• Although two variables may be associated, they are not
necessarily causally related. By using methods that control other
factors, researchers are able to obtain evidence about whether
an independent variable has a causal influence on a dependent
variable.

• In quasi‑experimental designs, statistical techniques substitute


for the experimental method of control. These techniques are
employed during data analysis rather than at the data collection
stage. There are three methods of statistical control: cross-
tabulation, partial correlation, and multiple regression.
MAIN POINTS
• Elaboration
• Elaboration analysis involves considering the nature of the
effect of a third variable on a bivariate relationship. If the
third variable intervenes between the independent and
dependent variables, and the original relationship changes
under conditions of the third variable, the third variable
clarifies how the variables are related. If the third variable
precedes both the independent and the dependent variable,
and the original relationship changes under conditions of the
third variable, the result specifies the condition under which
the relationship exists.
MAIN POINTS
• Elaboration, cont.,
• Partial correlation is a statistical method for controlling the
effects of a third variable on a bivariate relationship. The
partial correlation coefficient measures the extent to which
two interval variables are related. This method can be
extended to simultaneously remove the effects of several
variables if they have been measured and are interval‑level
variables.

• Multiple regression is a simple extension of bivariate


regression allowing for an assessment of the relationship
between two variables while controlling for the effect of
others.
MAIN POINTS
• Multivariate Analysis: Multiple Relationships
• Because there are usually several determinants for any
dependent variable, social scientists often use a method called
multiple regression analysis to specify how a set of independent
variables in combination influence a dependent variable. To
examine the combined effect of all the independent variables,
the coefficient of determination, r‑squared is computed. The
square root of r‑squared indicates the correlation between all
independent variables taken together with the dependent
variable; it is thus denoted as the coefficient of multiple
correlation.
On Regression
• To better understand regression, let us take
the follow hypothesis:
• States with direct democratic mechanisms,
such as the initiative, have higher voter
turnout than states without the initiative
• So the regression model for this would look
something like this: Y=a+bx
• Y = turnout; a = constant/intercept; and
b(x) = states with the initiative process
• How is each part measured?
• Turnout is readily available at the state level for
presidential and midterm elections going back for
decades.
• The intercept (or constant) is a value if the
dependent variable is 0.
• States with the initiative process is information
that is available but how should it be coded. There
are some states that have the initiative and some
do not—so this is either “yes” or “no” Dummy
variables are used in such cases, where 1 is “yes”
and 0 is “no”
• So we code the “initiative” as either a 1 or 0
This is a simple bivariate model of voter turnout in midterm
election years (between 1970-1996) and whether the state has
the initiative or not.
Model Summary
Mode R R Adjusted R Std. Error of
l Square Square the
Estimate
1 .366a .134 .131 9.29317
a. Predictors: (Constant), duminit

So first what do these numbers tell us? The important number


for us is the adjusted R–squared. The adjusted R-squared tells
us how much of the variance is explained by the independent
variables. So, the adjusted R-square is .131 indicating that 13%
of the variance is explained by the regression model.
Coefficientsa
Model Unstandardized Standardize t Sig.
Coefficients d
Coefficients
B Std. Error Beta
(Constan
36.323 .680 53.449 .000
1 t)
duminit 7.301 .997 .366 7.320 .000
a. Dependent Variable: turnout

What does this tell us? Looking at the regression coefficient for
the initiative (7.301) with a standard error of .997, this would
suggest what? First, we must check to see if the variable is
statistically significant because if it is not then it does not matter
what the regression coefficient is. All we would be able to say is
that there is no statistical relationship between the two variables.
We know from the t statistic and Sig. that
there is a statistically significant relationship
and that it is a positive relationship: 7.301 is
positive and as noted above statistically
significant (.000). So, this would allow us to
infer that states with the initiative have
higher turnout than states that do not have
the initiative. Further analysis would be
necessary to be certain but it appears that
turnout would be about 6.3 and 8.3 percent
higher.
Should we also include number of initiatives on the ballot to see
if that makes a difference? Why not!
Model Summary
Mode R R Adjusted R Std. Error of
l Square Square the
Estimate
1 .366a .134 .129 9.30567
a. Predictors: (Constant), initnumb, duminit

So first what do these numbers tell us? The adjusted R-squared


tells us how much of the variance is explained by the
independent variables. So, the adjusted R-square is .129
indicating that about 13% of the variance is explained by the
regression model.
Coefficientsa
Model Unstandardized Standardize t Sig.
Coefficients d
Coefficients
B Std. Error Beta
(Constan
36.323 .680 53.377 .000
t)
1
duminit 7.154 1.146 .358 6.245 .000
initnumb .069 .262 .015 .262 .794
a. Dependent Variable: turnout

What does this tell us? Looking at the regression coefficient for the
initiative (7.154) with a standard error of 1.146, and the regression
coefficient for the number of initiatives is .069 with a standard
error of .262. What would this suggest? First, we must check to
see if the variables are statistically significant because if it is not
then it does not matter what the regression coefficient is. All we
would be able to say is that there is no statistical relationship
between the two variables.
Coefficientsa
Model Unstandardized Standardize t Sig.
Coefficients d
Coefficients
B Std. Error Beta
(Constan
36.323 .680 53.377 .000
t)
1
duminit 7.154 1.146 .358 6.245 .000
initnumb .069 .262 .015 .262 .794
a. Dependent Variable: turnout

We know from the t statistics and Sig. that there is a statistically


significant relationship for the initiative and that it is a positive
relationship: 7.174 is positive and as noted above statistically significant
(.000). However, the number of initiatives is not statistically significant
(.794) So again, this would allow us to infer that states with the initiative
have higher turnout than states that do not have the initiative. Further
analysis would be necessary to be certain but it appears that turnout
would be about 6 and 8.3 percent higher..
Coefficientsa
Model Unstandardized Standardize t Sig.
Coefficients d
Coefficients
B Std. Error Beta
(Constan
36.323 .680 53.377 .000
t)
1
duminit 7.154 1.146 .358 6.245 .000
initnumb .069 .262 .015 .262 .794
a. Dependent Variable: turnout

We cannot say anything about the impact of the number of


initiatives on voter turnout other than there is no impact
because as we noted above it is not statistically significant.
Now, consider whether there are other
possible causes of increased turnout in a
state. This is important because when we
do research we must control for other
possible/alternative causes—that is to say
influences on the dependent variable. If
not, then we cannot really say what
influences what.
What might be alternate impacts on voter
turnout? Based on turnout research the
following might have a positive or negative
impact on voter turnout: Gubernatorial
elections, Senatorial elections, whether the
state is a Southern State (the confederacy),
level of education in the state, level of racial
diversity in the state, and registration
requirements. Let’s see what happens when
we account for (in other words control for)
these other possible influences.
• NOTE: Dummy variables are used for
gubernatorial elections (gubdum), senatorial
elections (senatdum), and whether the state is a
Southern State (southdum). Level of education
in the state (hsgrad) measured by the proportion
of a state’s population with at least a high school
diploma; level of racial diversity in the state
(mindiv) was measured by the amount of racial
diversity, and registration requirements (Voter
Registration) was measured by the number of
days prior to the election that one needed to
register to vote in that year’s election.
Model Summary
Model R R Adjusted R Std. Error of
Square Square the
Estimate
1 .765a .585 .567 6.70929
a. Predictors: (Constant), mindiv, dumgub,
senatdum, initnumb, hsgrad, Voter Registration
Closing Date, duminit, southdum

• So what do these numbers tell us? The


adjusted R-square is .567 indicating that
almost 57% of the variance is explained by
the regression model.
Coefficientsa
Model Unstandardized Standardize t Sig.
Coefficients d
Coefficients
B Std. Error Beta
(Constant) 53.216 3.910 13.610 .000
Duminit 1.300 1.192 .064 1.091 .277
Initnumb -.103 .387 -.015 -.268 .789
Dumgub -.699 1.100 -.031 -.635 .526
Senatdum 1.732 1.020 .080 1.698 .091
1
Southdum -12.967 1.470 -.544 -8.819 .000
Voter Registration
-.052 .049 -.056 -1.052 .294
Closing Date
hsgrad -.063 .058 -.066 -1.086 .279
mindiv -23.384 3.886 -.354 -6.018 .000
a. Dependent Variable: turnout
Observe what happened! Looking at the
regression model that contains a number of
controls we can see that the regression
coefficients for the initiative, number of
initiatives, gubernatorial election, voter
registration, and whether they graduated
higher school were not significant. After
controlling for other factors, the results did
not confirm the hypothesis. This type of
finding is just as important (sometimes more
so) as having results that confirm the research
hypothesis.
One could comment on whether the
regression coefficient is in the expected
direction or not but it is not necessary here.
The only three variables that are statistically
significant (using a significance level of .1 or
less) would be whether there was a
Senatorial race on the ballot, whether the
state was a southern state, or the level of
minority diversity.
Coefficientsa
Model Unstandardized Standardize t Sig.
Coefficients d
Coefficients
B Std. Error Beta
(Constant) 53.216 3.910 13.610 .000
Duminit 1.300 1.192 .064 1.091 .277
Initnumb -.103 .387 -.015 -.268 .789
Dumgub -.699 1.100 -.031 -.635 .526
Senatdum 1.732 1.020 .080 1.698 .091
1
Southdum -12.967 1.470 -.544 -8.819 .000
Voter Registration
-.052 .049 -.056 -1.052 .294
Closing Date
hsgrad -.063 .058 -.066 -1.086 .279
mindiv -23.384 3.886 -.354 -6.018 .000
a. Dependent Variable: turnout
• What we could say given these results is
that when controlling for other factors
states with the initiative do not have
statistically significant turnout than states
that do not have the initiative.

• Controlling for other factors allows


researchers to speak about their results
with much more confidence and that is
precisely what you want to do.

You might also like