0% found this document useful (0 votes)
96 views

The Nature of Regression Analysis

The document provides an overview of regression analysis, including: 1) It discusses the historical origins and modern interpretation of regression analysis, how it differs from deterministic relationships, and its relationship to causation and correlation. 2) Key terms like dependent and explanatory variables are introduced, and the differences between simple and multiple regression are outlined. 3) The main types of data used in regression analysis are described: time series, cross-sectional, and pooled data. Examples of each type are provided.

Uploaded by

Esaie Guemou
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

The Nature of Regression Analysis

The document provides an overview of regression analysis, including: 1) It discusses the historical origins and modern interpretation of regression analysis, how it differs from deterministic relationships, and its relationship to causation and correlation. 2) Key terms like dependent and explanatory variables are introduced, and the differences between simple and multiple regression are outlined. 3) The main types of data used in regression analysis are described: time series, cross-sectional, and pooled data. Examples of each type are provided.

Uploaded by

Esaie Guemou
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

Chapter 1

The Nature of Regression


Analysis

1
1.1 historical origin of the term
regression

• The term regression was introduced by Francis


Galton. In a famous paper, Galton found that,
although there was a tendency for tall parents to have
tall children and for short parents to have short
children, the average height of children born of
parents of a given height tended to move or “regress”
toward the average height in the population as a
whole.
• In other words, the height of the children of unusually
tall or unusually short parents tends to move toward
the average height of the population.
1.2 The Modern Interpretation Of
Regression

• Regression analysis is concerned with the study of the


dependence of one variable, the dependent variable,
on one or more other variables, the explanatory
variables, with a view to estimating and/or predicting
the (population) mean or average value of the former
in terms of the known or fixed (in repeated sampling)
values of the latter.
• Finding out how the average height of sons changes,
given the father’s height. In other words, our concern
is with predicting the average height of sons knowing
the height of their fathers.

• This figure shows the distribution of heights of sons


in a hypothetical population corresponding to the
given or fixed values of the father’s height. Notice
that corresponding to any given height of a father is
a range or distribution of the heights of the sons.
However, notice that despite the variability of the
height of sons for a given value of father’s height,
the average height of sons generally increases
as the height of the father increases.
Hypothetical
distribution of
sons’ heights
corresponding to
given heights of
fathers.

To show this clearly, the circled crosses in the figure indicate the average height
of sons corresponding to a given height of the father. Connecting these
averages, we obtain the line shown in the figure. This line, as we shall see, is
known as the regression line. It shows how the average height of sons
increases with the father’s height.
1.3 Statistical Versus Deterministic
Relationships

• In regression analysis we are concerned with what is


known as the statistical, not functional or deterministic,
dependence among variables, such as those of classical
physics.
• In statistical relationships among variables we essentially
deal with random or stochastic variables, that is,
variables that have probability distributions.
• In functional or deterministic dependency, on the other
hand, we also deal with variables, but these variables are
not random or stochastic.
Example of random variables

• The dependence of crop yield on temperature, rainfall,


sunshine, and fertilizer, for example, is statistical in
nature in the sense that the explanatory variables,
although certainly important, will not enable the
agronomist to predict crop yield exactly because of
errors involved in measuring these variables as well as
a host of other factors (variables) that collectively affect
the yield but may be difficult to identify individually.
• Thus, there is bound to be some “intrinsic” or random
variability in the dependent-variable crop yield that
cannot be fully explained no matter how many
explanatory variables we consider.
Example of deterministic variables

• In deterministic phenomena, on the other hand, we


deal with relationships of the type, say, exhibited by
the relationship between interest and the money you
have stored in the bank, if y=interest, x=money stored
in the bank, at a given interest rate, for example 5%
you can compute the interest y=0.05x.
• But in our lessons, we are not concerned with such
deterministic relationships.
1.4 Regression Versus Causation

• In the crop-yield example cited previously, there is no


statistical reason to assume that rainfall does not
depend on crop yield. The fact that we treat crop yield
as dependent on rainfall (among other things) is due to
nonstatistical considerations: Common sense
suggests that the relationship cannot be reversed, for
we cannot control rainfall by varying crop yield.
• The point to note is that a statistical relationship in
itself cannot logically imply causation. To ascribe
causality, one must appeal to a priori or theoretical
considerations.
1.5 Regression Versus Correlation

• Closely related to but conceptually very much different


from regression analysis is correlation analysis, where
the primary objective is to measure the strength or
degree of linear association between two variables.
• The correlation coefficient, which we shall study in
detail in Chapter 3, measures this strength of (linear)
association.
• For example, we may be interested in finding the
correlation between smoking and lung cancer, between
scores on statistics and mathematics examinations,
between high school grades and college grades, and so
on.
• In regression analysis, as already noted, we are not
primarily interested in such a measure. Instead, we try
to estimate or predict the average value of one
variable on the basis of the fixed values of other
variables.
• Thus, we may want to know whether we can predict
the average score on a statistics examination by
knowing a student’s score on a mathematics
examination.
The differences between regression
and correlation

• In regression analysis, there is an asymmetry in the


way the dependent and explanatory variables are
treated.
• The dependent variable is assumed to be statistical,
random, or stochastic, that is, to have a probability
distribution.
• The explanatory variables, on the other hand, are
assumed to have fixed values (in repeated sampling).
• In correlation analysis, we treat any (two) variables
symmetrically; there is no distinction between the
dependent and explanatory variables.
• After all, the correlation between scores on mathematics
and statistics examinations is the same as that between
scores on statistics and mathematics examinations.
• Moreover, both variables are assumed to be random.
As we shall see, most of the correlation theory is based
on the assumption of randomness of variables.
• Whereas most of the regression theory is conditional
upon the assumption that the dependent variable is
stochastic but the explanatory variables are fixed or non-
stochastic.
1.6 Terminology And Notation

• In the literature the terms dependent variable and


explanatory variable are described variously. A
representative list is:

Dependent variable Explanatory variable


Explained variable Independent variable
Predictand Predictor
Regressand Regressor
Response Stimulus
Endogenous Exogenous
Outcome Covariate
Controlled variable Control variable
• If we studying the dependence of a variable on only a
single explanatory variable, such a study is known as
simple, or two-variable, regression analysis.
• However, if we are studying the dependence of one
variable on more than one explanatory variable, it is
known as multiple regression analysis.
• As noted earlier, a random or stochastic variable is a
variable that can take on any set of values, positive
or negative, with a given probability.
• Y= dependent variable,
X= explanatory variable,
Xk= kth explanatory variable .
1.7 Types of Data

• Three types of data may be available for empirical


analysis: time series, cross-section, and pooled
(i.e., combination of time series and crosssection)
data.
Time Series Data
• The data we use in the introduction is an example of
time series data.
• A time series is a set of observations on the
values that a variable takes at different times.
Such data may be collected at regular time intervals,
such as daily (e.g., stock prices, weather reports),
weekly (e.g., money supply figures), monthly [e.g.,
the unemployment rate, the Consumer Price Index
(CPI)], quarterly (e.g., GDP), annually (e.g.,
government budgets), quinquennially, that is, every
5 years (e.g., the census of manufactures), or
decennially (e.g., the census of population).
• Sometime data are available both quarterly as well as
annually, as in the case of the data on GDP and
consumer expenditure.
Cross-Section Data

• Cross-section data are data on one or more


variables collected at the same point in time, such
as the census of population conducted by the Census
Bureau every 10 years (the latest being in year 2000),
the surveys of consumer expenditures conducted by
the University of Michigan, and, of course, the opinion
polls by Gallup and umpteen other organizations.
• A concrete example of cross-sectional data is given in
Table 1.1. Egg production and egg prices for the 50
states in the union for 1990 and 1991. For each year
the data on the 50 states are cross-sectional data.
Thus, in Table 1.1 we have two cross-sectional
samples.
• This table gives data on China’s population, birth rate,
death rate and nature growth rate by region in 2011.

You might also like