Fdsa Unit 2
Fdsa Unit 2
FDSA UNIT 2
PART A
1. What is Statistics?
Statistics is a branch of applied mathematics that involves the collection,
description, analysis, and inference of conclusions from quantitative data.
Dependent Variable
o A dependent Variable is a variable that is believed to have been
influenced by the independent variable.
23. What is frequency distribution for Ungrouped Data and grouped Data?
Frequency Distribution for Ungrouped Data
A frequency distribution produced whenever observations are sorted
into classes of single values.
Frequency Distribution for Grouped Data
A frequency distribution produced whenever observations are sorted
into classes of more than one value
55. How will you Find Values of b and a in least square regression
equation?
PART B
1. Explain in detail about types of Data.
Table 2.2
Finally, the Y and N replies of students in Table 2.2 are qualitative data,
since any single observation is a letter that represents a class of replies.
TYPES OF VARIABLES
Definition
A variable is a characteristic or property that can take on different values.
Accordingly, the weights not only as quantitative data but also as
observations for a quantitative variable, since the various weights take on
different numerical values.
By the same token, for a qualitative variable, since the replies to the
Facebook profile question take on different values of either Yes or No.
Any single observation is constant, since it takes on only one value.
Independent Variable
An independent variable is the treatment manipulated by the
investigator.
Dependent Variable
A variable that is believed to have been influenced by the independent
variable.
Observational Studies
An observational study focuses on detecting relationships between variables
not manipulated by the investigator,
Example,
The independent variable is qualitative, with nominal measurement,
whereas the dependent variable (number of communication breakdowns)
is quantitative.
Confounding Variable
An uncontrolled variable that compromises the interpretation of a study is
known as a confounding variable.
Definition
Frequency Distribution
o Ungrouped Data
o Grouped Data
Guidelines
Example Problems
Types of Frequency Distribution
o Relative FD
o Cumulative FD
Frequency Distributions For Qualitative (Nominal) Data
Definition
A frequency distribution is a collection of observations produced by
sorting observations into classes and showing their frequency (f) of
occurrence in each class.
GUIDELINES
The “Guidelines for Frequency Distributions” box lists seven rules for
producing a well-constructed frequency distribution.
The first three rules are essential and should not be violated. The last
four rules are optional and can be modified or ignored.
Optional
4. All classes should have both an upper boundary and a lower
boundary.
Example: 240–249. Less preferred would be 240–above, in which no
maximum value can be assigned to observations in this class.
5. Select the class interval from convenient numbers, such as 1, 2,
3, . . . 10, particularly 5 and 10 or multiples of 5 and 10.
Example: 130–139, 140–149, in which the class interval of 10 is a
convenient number. Less preferred would be 130–142, 143–155, etc.,
in which the class interval of 13 is not a convenient number.
6. The lower boundary of each class interval should be a multiple
of the class interval.
Example: 130–139, 140–149, in which the lower boundaries of 130,
140, are multiples of 10, the class interval. Less preferred would be
135–144, 145–154, etc., in which the lower boundaries of 135 and
145 are not multiples of 10, the class interval.
7. Aim for a total of approximately 10 classes.
Example 2.1
The IQ scores for a group of 35 high school dropouts are as follows:
Table 2.7
Solution
a. Construction of a Frequency Distribution
1. Find the range, that is, the difference between the largest and smallest
observations.
Example- From Table 2.7:
The range of weights in the above table is 123 – 69 = 54.
2. Find the class interval required to span the range by dividing the range
by the desired number of classes (ordinarily 10).
Solution b - the real limits for the lowest class interval in this frequency
Distribution - 64.5 – 69.5
Example 2.2
What are some possible poor features of the following frequency
Distribution?
Solution
Not all observations can be assigned to one and only one class (because of
gap between 20–22 and 25–30 and overlap between 25–30 and 30–34).
All classes are not equal in width (25–30 versus 30–34).
All classes do not have both boundaries (35–above).
The cumulative frequency for the class 130–139 is 3, since there are
no classes ranked lower.
The cumulative frequency for the class 140–149 is 4, since 1 is the
frequency for that class and 3 is the frequency of all lower-ranked
classes.
The cumulative frequency for the class 150–159 is 21, since 17 is the
frequency for that class and 4 is the sum of the frequencies of all
lower-ranked classes.
Cumulative Percentages
• To obtain this cumulative percentage (75%), the cumulative frequency
of 40 for the class 170–179 should be divided by the total frequency of
53 for the entire distribution.
Example 2.3
GRE scores for a group of graduate school applicants are
distributed as follows:
Percentile Ranks
Cumulative percentages are referred to as percentile ranks.
The percentile rank of a score indicates the percentage of scores in the
entire distribution with similar or smaller values than that score.
Types of Percentile Rank
The assignment of exact percentile ranks requires that cumulative
percentages be obtained from frequency distributions for ungrouped
data.
The assignment of approximate percentile ranks requires that
cumulative percentages be obtained from frequency distributions for
grouped data.
Example 2.4
Referring to Table 2.10, find the approximate percentile rank of
any weight in the class 200–209.
Solution –
The approximate percentile rank for weights between 200 and 209 lbs
is 92 (because 92 is the cumulative percent for this interval).
Example 2.5
Movie ratings reflect ordinal measurement because they can be
ordered from most to least restrictive: NC-17, R, PG-13, PG, and G.
The ratings of some films shown recently in San Francisco are as
follows:
Example 2.6:
Identify any outliers in each of the following sets of data collected
from nine college students
Solution
o Outliers are a summer income of $25,700; an age of 61; and a family
size of 18. No outliers for GPA.
Figure 2.1
2. Frequency Polygon
Frequency Polygon or a line graph for quantitative data emphasizes the
continuity of continuous variables.
Frequency polygons may be constructed directly from frequency
distributions.
Transformation steps from histogram to frequency polygon
1. This panel shows the histogram for the weight distribution.
2. Place dots at the midpoints of each bar top or, at midpoints for classes
on the horizontal axis, and connect them with straight lines.
3. First, extend the upper tail to the midpoint of the first unoccupied class
on the upper flank of the histogram. Then extend the lower tail to the
EXAMPLE
Constructing a Display
To construct the stem and leaf display for these data, first note that,
when counting by tens, the weights range from the 130s to the 240s.
Arrange a column of numbers, the stems, beginning with 13
(representing the 130s) and ending with 24 (representing the 240s).
Draw a vertical line to separate the stems, which represent multiples of
10, from the space to be occupied by the leaves, which represent
multiples of 1.
Next, enter each raw score into the stem and leaf display.
Interpretation
The weight data have been sorted by the stems. All weights in the 130s
are listed together; all of those in the 140s are listed together, and so on.
As suggested by the shaded coding in Table 2.11, the first raw score
of 160 reappears as a leaf of 0 on a stem of 16. The next raw score of
193 reappears as a leaf of 3 on a stem of 19, and the third raw score
of 226 reappears as a leaf of 6 on a stem of 22, and so on, until each
raw score reappears as a leaf on its appropriate stem.
Example 2.7
Construct a stem and leaf display for the following IQ scores
obtained from a group of four-year-old children.
Solution
Stem Leaf
Bar Graph
Equal segments along the vertical axis reflect increases in frequency.
The body of the bar graph consists of a series of bars whose heights
reflect the frequencies for the various words or classes. Refer Figure 2.7.
CONSTRUCTING GRAPHS
1. Decide on the appropriate type of graph, recalling that histograms and
frequency polygons are appropriate for quantitative data, while bar graphs
are appropriate for qualitative data.
2. Draw the horizontal axis, then the vertical axis, remembering that the
vertical axis should be about as tall as the horizontal axis is wide.
3. Identify the string of class intervals that eventually will be
superimposed on the horizontal axis. For qualitative data or ungrouped
quantitative data, just use the classes suggested by the data. For grouped
quantitative data, creating a set of class intervals for a frequency
distribution.
4. Superimpose the string of class intervals (with gaps for bar graphs) along
the entire length of the horizontal axis.
5. Along the entire length of the vertical axis, superimpose a progression
of convenient numbers, beginning at the bottom with 0 and ending at the
top with a number as large as or slightly larger than the maximum
observed frequency.
6. Using the scaled axes, construct bars (or dots and lines) to reflect the
frequency of observations within each class interval.
7. Supply labels for both axes and a title for the graph.
TYPICAL SHAPES
Whether expressed as a histogram, a frequency polygon, or a stem and
leaf display, an important characteristic of a frequency distribution is its
shape
The more typical shapes for smoothed frequency polygons
Normal
The familiar bell-shaped silhouette of the normal curve can be
superimposed on many frequency distributions, scores on standardized
tests, and even the popping times of individual kernels in a batch of
popcorn. Figure 2.3 represents normal curve.
Example 2.8
Describe the probable shape—normal, bimodal, positively
Skewed, or negatively skewed—for each of the following distributions:
(a) female beauty contestants’ scores on a masculinity test, with a
higher score indicating a greater degree of masculinity
(b) scores on a standardized IQ test for a group of people selected
from the general population
(c) test scores for a group of high school students on a very difficult
college - level math exam
(d) reading achievement scores for a third-grade class consisting of
about equal numbers of regular students and learning-challenged
students
(e) scores of students at the Eastman School of Music on a test of
music aptitude (designed for use with the general population)
Solution
Example 2.9:
Determine the mode for the following retirement ages: 60, 63, 45,
63, 65, 70, 55, 63, 60, 65, 63.
Answer - mode = 63
Example 2.10:
The owner of a new car conducts six gas mileage tests and obtains
the following results, expressed in miles per gallon: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9. Find the mode for these data.
Answer - mode = 27.4
MEDIAN
The median reflects the middle value when observations are ordered from
least to most.
Example 2.11:
Find the mean for the following retirement ages: 60, 63, 45, 63, 65,
70, 55, 63, 60, 65, 63.
Example 2.12:
Find the median for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.
MEAN
The mean is found by adding all scores and then dividing by the number of
scores.
Mean = sum of all scores / number of scores
Types of mean
population mean
sample mean
Sample Mean
Sample Size (n) - The total number of scores in the sample.
Sample Mean is obtained by dividing the sum of all scores in the sample by
the number of scores in the sample.
“X-bar equals the sum of the variable X divided by the sample size n.”
Example 2.13
Find the mean for the following retirement ages: 60, 63, 45, 63, 65,
70, 55, 63, 60, 65, 63.
Find the mean for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.
If Distribution Is Skewed
Example 2.14
Indicate whether the following skewed distributions are
positively
skewed because the mean exceeds the median or negatively
skewed because the median exceeds the mean.
a. a distribution of test scores on an easy test, with most
students scoring high and a few students scoring low
Solution
- negatively skewed because the median exceeds the mean
Example 2.15
College students were surveyed about where they would most
like to spend their spring break: Daytona Beach (DB), Cancun,
Mexico (C), South Padre Island (SP), Lake Havasu (LH), or other
(O). The results were as follows:
Measures of Variability
Variability for Quantitative Data
Range
Variance
Standard Deviation
Variability for Qualitative Data
Variability for Ranked Data
MEASURES OF VARIABILITY
Variability is the description of the amount by which scores are dispersed
or scattered in a distribution.
There are many ways to describe variability or spread including:
Range
Variance
Standard Deviation
RANGE
It is the difference between the largest and smallest scores.
FIGURE 2.11 - Three distributions with the same mean (10) but
different amounts of variability. Numbers in the boxes indicate
distances from the mean.
In Figure 2.11, distribution A, the least variable, has the smallest range
of 0 (from 10 to 10); distribution B, the moderately variable, has an
intermediate range of 2 (from 11 to 9); and distribution C, the most
variable, has the largest range of 6 (from 13 to 7),
VARIANCE
The variance is the mean of all squared deviation scores.
A deviation from the mean is how far a score lies from the mean.
Variance is the square of the standard deviation.
Example
To get variance, square the standard deviation.
s = 95.5
s2 = 95.5 x 95.5 = 9129.14
The variance of your data is 9129.14.
Variance reflects the degree of spread in the data set.
The more spread the data, the larger the variance is in relation to the
mean.
STANDARD DEVIATION
The square root of the mean of all squared deviations from the mean, that is,
standard deviation = √variance
Example 2.16
Using the computation formula for the sum of squares,
Calculate the population standard deviation for the scores in (a) and
the sample standard deviation for the scores in (b).
(a) 1, 3, 7, 2, 0, 4, 7, 3
(b) 10, 8, 5, 0, 1, 1, 7, 9, 2
Example 2.17
Days absent from school for a sample of 10 first-grade children are:
8, 5, 7, 1, 4, 0, 5, 7, 2, 9.
a) Before calculating the standard deviation, decide whether the
definitional or computational formula would be more efficient.
Why?
Solution - computation formula since the mean is not a whole number.
b) Use the more efficient formula to calculate the sample standard
deviation.
Example 2.18
As a first step toward modifying his study habits, Phil keeps daily
records of his study time.
a. During the first two weeks, Phil’s mean study time equals 20 hours
per week. If he studied 22 hours during the first week, how many
hours did he study during the second week?
Solution - 18 hours
b. During the first four weeks, Phil’s mean study time equals 21
hours. If he studied 22, 18, and 21 hours during the first, second,
and third weeks, respectively, how many hours did he study during
the fourth week?
Solution - 23 hours
c. If the information in (a) and (b) is to be used to estimate some
unknown population characteristic, the notion of degrees of
freedom can be introduced. How many degrees of freedom are
associated with (a) and (b)?
Solution - df = 1 in (a) and df = 3 in (b)
d. Describe the mathematical restriction that causes a loss of degrees
of freedom in (a) and (b).
Solution - When all observations are expressed as deviations from
their mean, the sum of all deviations must equal zero.
Example 2.19
Determine the values of the range and the IQR for the following sets of
data.
a. Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63
range = 25; IQR = 65 – 60 = 5
b. Residence changes: 1, 3, 4, 1, 0, 2, 5, 8, 0, 2, 3, 4, 7, 11, 0, 2, 3, 4
range = 11; IQR = 4 – 1 = 3
Although there is an infinite number of different normal curves, each with its
own mean and standard deviation, there is only one standard normal curve,
with a mean of 0 and a standard deviation of 1.
where X is the original score and μ and σ are the mean and the standard
deviation, respectively, for the normal distribution of the original scores.
A z score consists of two parts:
1. a positive or negative sign indicating whether it’s above or below the mean;
2. a number indicating the size of its deviation from the mean in standard
deviation units.
Example
A z score of 2.00 always signifies that the original score is exactly two
standard deviations above its mean.
Similarly, a z score of –1.27 signifies that the original score is exactly 1.27
standard deviations below its mean.
A z score of 0 signifies that the original score coincides with the mean.
Example 2.20
Express each of the following scores as a z score:
a. Margaret’s IQ of 135, given a mean of 100 and a standard deviation
of 15
b. a score of 470 on the SAT math test, given a mean of 500 and a
standard deviation of 100
c. a daily production of 2100 loaves of bread by a bakery, given a
mean of 2180 and a standard deviation of 50
d. Sam’s height of 69 inches, given a mean of 69 and a standard
deviation of 3
e. a thermometer-reading error of –3 degrees, given a mean of 0
degrees and a standard deviation of 2 degrees
(a) 2.33 (b) –0.30 (c) –1.60 (d) 0.00 (e) –1.50
TABLE 2.27 Proportions (Of Areas) Under The Standard Normal Curve For
Values Of Z
When using the top legend, all entries refer to the upper half of the standard
normal curve. The entries in column A are z scores, beginning with 0.00 and
ending with 4.00.
Given a z score of zero or more, columns B and C indicate how the z score
splits the area in the upper half of the normal curve.
The shading in the top legend, column B indicates the proportion of area
between the mean and the z score, and column C indicates the proportion of
area beyond the z score, in the upper tail of the standard normal curve.
Example 2.21
Using Table A in Appendix C, find the proportion of the total area
identified with the following statements:
(a) above a z score of 1.80
(b) between the mean and a z score of –0.43
(c) below a z score of –3.00
(d) between the mean and a z score of 1.65
(e) between z scores of 0 and –1.96
Solution : (a) .0359 (b) .1664 (c) .0013 (d) .4505 (e) .4750
4. Find the target area. and note the corresponding proportion of .1587 in
column C’:
Example 2.22
Assume that GRE scores approximate a normal curve with a mean of
500 and a standard deviation of 100.
(a) Sketch a normal curve and shade in the target area described by
each of the following statements:
(i) less than 400
(ii) more than 650
(iii) less than 700
(b) Plan solutions (in terms of columns B, C, B′, or C′ of the standard
normal table, as well as the fact that the proportion for either the
entire upper half or lower half always equals .5000) for the target
areas in part (a).
(c) Convert to z scores and find the proportions that correspond to the
target areas in part (a)
4. Find the target area. In Table A, locate a z score of 2.00 in column A, and
note the corresponding proportion of .0228 in column C. Because of the
symmetry of the normal curve, you need not enter the table again to find
the proportion below z score of –2.00. Instead, merely double the above
proportion of .0228 to obtain .0456, which represents the proportion of
students with IQs more than 30 points either above or below the mean.
Example 2.23:
Assume that SAT math scores approximate a normal curve with a
mean of 500 and a standard deviation of 100.
(a) Sketch a normal curve and shade in the target area(s) described
by each of the following statements:
(i) more than 570
(ii) less than 515
(iii) between 520 and 540
(iv) between 470 and 520
(v) more than 50 points above the mean
(vi) more than 100 points either above or below the mean
(vii) within 50 points either above or below the mean</RLLNL>
(b) Plan solutions (in terms of columns B, C, B′, and C′) for the
target areas in part (a).
(c) Convert to z scores and find the target areas in part (a).
Solution:
CONTENTS:
Correlation
Types of relationship
o Positive Relationship
o Negative Relationship
o Little or no relationship
Correlation Coefficient r
o Key properties of r
Computational formula for
Correlation Coefficient
Computational Sequence
Data and Computation
o
Correlation
Two variables are related if pairs of scores show an orderliness that can
be depicted graphically with a scatter plot and numerically with a
correlation coefficient.
Table 2.8 Greeting Cards Sent and Received by five Friends
Types of relationship:
1. Positive Relationship
2. Negative Relationship
3. Little or no relationship
Positive relationship:
Two variables are positively related if pairs of scores tend to occupy
similar relative positions (high with high and low with low) in their
respective distributions.
Negative relationship:
Two variables are negatively related if pairs of scores tend to occupy
dissimilar relative positions (high with low and vice versa) in their
respective distributions.
Occurs insofar as pairs of scores tend to occupy dissimilar relative
positions (high with low and vice versa) in their respective
distributions the relationship is negative.
Notice the pattern among the pairs in Table 2.10. Now there is a
pronounced tendency for pairs of scores to occupy dissimilar and
opposite relative positions in their respective distributions.
This relationship implies that relatively low values are paired with
relatively high values, and relatively high values are paired with
relatively low values, the relationship is negative.
Table 2.10 Negative Relationship
Little or No Relationship
No regularity is apparent among the pairs of scores in Table 2.11.
For instance, although both Andrea and John sent relatively few cards
(5 and 1, respectively), Andrea received relatively few cards (6) and
John received relatively many cards (14).
Given this lack of regularity, if any, relationship exists between the
two variables it is little or no relationship
Table 2.12 Little or No Relationship
Example – 2.34
Indicate whether the following statements suggest a positive or
negative relationship:
a. More densely populated areas have higher crime rates.
b. Schoolchildren who often watch TV perform more poorly on
academic achievement tests.
c. Heavier automobiles yield poorer gas mileage.
d. Better-educated people have higher incomes.
e. More anxious people voluntarily spend more time performing a
simple repetitive task.
Solution:
a. Positive. The crime rate is higher, square mile by square mile, in
densely populated cities than in sparsely populated rural areas.
b. Negative. As TV viewing increases, performance on academic
achievement tests tends to decline.
c. Negative. Increases in car weight are accompanied by decreases in
miles per gallon.
d. Positive. Increases in educational level—grade school, high school,
college—tend to be associated with increases in income.
e. Positive. Highly anxious people willingly spend more time performing a
simple repetitive task than do less anxious people.
CORRELATION COEFFICIENT R
Key Properties of r
Named in honor of the British scientist Karl Pearson, the Pearson
correlation coefficient, r, can equal any value between –1.00 and +1.00.
The more closely the value of r approaches 0, the weaker (less regular)
the relationship.
Where
SPxy, - Sum of the products for each pair of deviation scores defined
as
SSx and SSy - summing the squared deviation scores for either X or Y
is defines as
Computational Sequence
Assign a value to n (1), representing the number of pairs of scores.
Sum all scores for X (2) and for Y (3).
Find the product of each pair of X and Y scores (4), one at a time, then
add all of these products (5).
Square each X score (6), one at a time, then add all squared X scores
(7).
Square each Y score (8), one at a time, then add all squared Y scores
(9).
Substitute numbers into formulas (10) and solve for SPxy, SSx, and SSy.
Substitute into formula (11) and solve for r.
Example 2.35:
Supply a verbal description for each of the following
correlations.
a. an r of –.84 between total mileage and automobile resale
value
b. an r of –.35 between the number of days absent from school
and performance on a math achievement test
c. an r of .03 between anxiety level and college GPA
d. an r of .56 between age of schoolchildren and reading
comprehension
Solution:
a. Cars with more total miles tend to have lower resale values.
b. Students with more absences from school tend to score lower on math
achievement tests.
c. Little or no relationship between anxiety level and college GPA.
d. Older school children tend to have better reading comprehension.
Example 2.36:
Couples who attend a clinic for first pregnancies are asked to estimate
(Independently of each other) the ideal number of children. Given that
X and Y represent the estimates of females and males, respectively,
the results are as follows:
CONTENTS:
Scatterplot
Construction of Scatterplot
Types of relationship in Scatterplot
o Positive Relationship
o Negative Relationship
o Little or no Relationship
o Strong or Weaker Relationship
o Linear Relationship
o Perfect Relationship
o Curvilinear Relationship
Scatterplot
A scatterplot is a graph containing a cluster of dots that represents all
pairs of scores.
Construction of Scatterplot
To construct a scatterplot, scale each of the two variables along the
horizontal (X) and vertical (Y) axes, and use each pair of scores to locate a
dot within the scatterplot.
For example, the pair of numbers for Mike, 7 and 12, define points along
the X and Y axes, respectively.
Using these points to anchor lines perpendicular (at right angles) to each
axis, locate Mike’s dot where the two lines intersect.
Repeat this process, with imaginary lines, for each of the four remaining
pairs of scores to create the scatterplot refer Figure 3.1
Figure 2.14 has shown the basic idea of correlation and the construction
of a scatterplot.
Positive Relationship
A dot cluster that has a slope from the lower left to the upper right,
reflects a positive relationship.
Small values of one variable are paired with small values of the other
variable, and large values are paired with large values.
In Figure 2.15, short people tend to be light, and tall people tend to be
heavy.
Negative Relationship
A dot cluster that has a slope from the upper left to the lower right,
reflects a negative relationship.
Small values of one variable tend to be paired with large values of the
other variable, and vice versa.
Figure 2.16 reflects the relationship between life expectancy and Heavy
Smoking.
Little or No Relationship
A dot cluster that lacks any apparent slope, reflects little or no
relationship.
Small values of one variable are just as likely to be paired with small,
medium, or large values of the other variable.
In Figure 2.17, dots are seen about in an irregular fashion, suggesting
that there is little or no relationship between the height of young adults
and their life expectancies.
Linear Relationship
A relationship that can be described best with a straight line.
Perfect Relationship
A dot cluster that equals (rather than merely approximates) a straight line
reflects a perfect relationship between two variables.
Curvilinear Relationship
Sometimes a dot cluster approximates a bent or curved line, as in Figure
2.19, and therefore reflects a curvilinear relationship.
Example 2.37:
Critical reading and math scores on the SAT test for students A, B, C,
D, E, F, G, and H are shown in the following scatterplot:
CONTENTS:
Regression Line
Example
Placement of Line
Predictive Errors
Total Predictive Errors
Regression Line
The regression line is a straight line rather than a curved line because
of the linear relationship between two variables.
Using the dots for Steve and Doris as guides, construct two strings of
arrows, one beginning at 9 and ending at 18 for Steve and the other
beginning at 13 and ending at 14 for Doris.
Focusing on the interval along the Y axis between the two strings of
arrows, could predict that Emma’s return should be between 14 and
18 cards, the numbers received by Doris and Steve.
Predictive Errors
Figure 2.22 illustrates the predictive errors that would have occurred
if the regression line had been used to predict the number of cards
received by the five friends.
Solid dots reflect the actual number of cards received, and open dots,
always located along the regression line, reflect the predicted number
of cards received
CONTENTS:
Least squares Regression Line
Least Squares Regression Equation
Computational Sequence
Computations
The regression line is often referred to as the least squares regression line
to minimize the total predictive error thereby providing a more favorable
prognosis for the predictions.
Computational Sequence
Determine values of SSx′ SSy′ and r (1) by referring to the original
correlation analysis
Substitute numbers into the formula (2) and solve for b.
Assign values to X and Y (3) by referring to the original correlation
analysis
Substitute numbers into the formula (4) and solve for a.
Substitute numbers for b and a in the least squares regression equation
(5).
Computations
where .80 and 6.40 represent the values computed for b and a,
respectively.
Example 2.37:
Assume that an r of .30 describes the relationship between
educational level (highest grade completed) and estimated
number of hours spent reading each week. More specifically:
Solution:
CONTENTS:
Standard Error of Estimate
Finding the Standard Error of Estimate
Definition Formula
Computation Formula
Computational Sequence
Computation
Importance of r
where
SSy|x , represents the sum of the squares for predictive errors,
Y − Y′, and the degrees of freedom term in the denominator,
n − 2, reflects the loss of two degrees of freedom
Computation Formula
Computational Sequence
Assign values to SSy and r (1) by referring to previous work with the least
squares regression equation.
Substitute numbers into the formula (2) and solve for sy|x.
Computation
Importance of r
Let’s substitute a few extreme values for r, the sum of squares for
predictive errors, SSy|x .
Substituting a value of 1 for r,
CONTENTS:
Squared correlation coefficient, r2
Two kinds of predictive errors
Repetitive Prediction of the Mean
Predictive Errors
Error Variability (Sum of Squares)
Proportion of Predicted Variability
The figure 2.23 shows the predictive errors for all five friends when the
mean for all five friends, Y, of 12 is always used to predict each of
their five Y scores.
The figure 3.12 shows the corresponding predictive errors for all five
friends when a series of different Y′ values, obtained from the least
squares equation is used to predict each of their five Y scores.
For example, Figure 3.11 shows the error for John when the mean for
all five friends, Y, of 12 is used to predict his Y score of 6. Shown as a
broken vertical line, the error of −6 for John (from Y – Y’ = 6 − 12 = −6)
indicates that Y overestimates John’s Y score by 6 cards.
Figure 2.24 shows a smaller error of −1.20 for John when a Y′ value of
7.20 is used to predict the same Y score of 6.
This Y’ value of 7.20 is obtained from the least squares equation,
Positive and negative errors indicate that Y scores are either above or
below their corresponding predicted scores.
Overall, errors are smaller when a customized prediction of Y′ from the
least squares equation than the repetitive prediction of Y.
Using the errors for the five friends shown in Figure 3.11, this becomes
The error variability for the customized predictions from the least squares
equation can be designated as SSy|x, since each Y score is expressed as a
squared deviation from its corresponding Y’ and then summed, that is
Using the errors for the five friends shown in Figure 3.12, we obtain:
where the sum of squares term, SSy′, is simply the variability explained by
or predictable from the regression equation, that is,
Problem 2.38:
o Assume that an r of .30 describes the relationship between
educational level and estimated hours spent reading each week.
a. According to r², what percent of the variability in weekly
reading time is predictable from its relationship with
educational level?
b. What percent of variability in weekly reading time is not
predictable from this relationship?
c. Someone claims that 9 percent of each person’s estimated
reading time is predictable from the relationship. What is
wrong with this claim?
Solution:
(a) 9 percent predicted.
(b) 91 percent not predicted.
(c) 9 percent refers to the variability of all estimated reading
times.
Problem 2.39:
As the correlation between the IQ scores of parents and children
is .50, and that between the IQ scores of foster parents and
foster children is .27.
a. Does this signify, therefore, that the relationship between
foster parents and foster children is about one-half as
strong as the relationship between parents and children?
b. Use r2 to compare the strengths of these two correlations.
Solution:
(a) No
(b) The r2 of .25 for parents and children is about four times
greater than the r2 of .07 for foster parents and foster children.
Table 2.13 lists the top 10 hitters in the major leagues during 2014
and shows how they fared during 2015.
Notice that 7 of the top 10 batting averages regressed downward,
toward 260s, the approximate mean for all hitters during 2015.
Hitters among the top 10 in 2014, who were not among the top 10 in
2015, were replaced by other mostly above-average hitters, who also
were very lucky during 2015.
Observed regression toward the mean occurs for individuals or
subsets of individuals, not for entire groups.
BOOK EXERCISES
Solution
Solution
Solution
Solution
Solution
Solution
Solution
Solution
Solution
Solution
Solution
Solution
Solution