Second Stats Packet 24
Second Stats Packet 24
What are the possible values of this variable and how frequently does it take on those values?
What is the typical value of this variable? How much variation is there in the values of this variable?
Bivariate Data!
Key Questions
2
We also use graphs to learn more about bivariate data:
Response Variables:
measures the outcome of a study
(y-axis)
If there is no obvious
explanatory/response
relationship, either variable
Explanatory Variable: may be graphed on
attempts to explain changes in the horizontal axis.
observed outcomes (x-axis)
3
No relationship: if one variable
increases, there’s no clear indication
of what we would expect the other
variable to do.
4
Strongest Moderate Weakest
Relationship Relationship Relationship
Negative
Positive
A C
B
D E
F
Outliers
5
How well does a child’s height at age 6 predict height at age 16? To find out, the heights of a large group
of children are measured at age 6 and again at age 16.
What are the explanatory and response variables here? Are they categorical or quantitative?
There may be a “gender gap” in political party preference in the US, with women more likely than men to
prefer Democratic candidates. A political scientist selects a large sample of registered voters. She asks each
whether they voted for the Democratic or Republican candidate in the last congressional election. What are
the explanatory and response variables in this scenario? Are they categorical or quantitative?
6
Process for Examining Relationships in Bivariate Data
1. Graph. Always graph!
2. Check for patterns and deviations.
3. Reference numerical descriptions.
4. Explain the overall pattern
Comment on direction, form,
strength and linearity.
7
Calories and Salt in Hot Dogs
Do hot dogs that are relatively higher in calories also contain more sodium?
Create a scatterplot showing calories and salt content (in Brand Calories Sodium (in mg)
milligrams of sodium) using the given information in the Schneiders Beef 110 330
table.
Applegate Uncured Beef 110 530
Trader Joes Uncured 90 580
Turkey
Does the scatterplot show a clear positive or Oscar Meyer Cheesy Hot 140 540
negative relationship between these variables? Dog
Morningstar Plant Based 50 430
Deli Beef Franks 150 480
What does this association mean about the Oscar Meyer Turkey 100 510
relationship between calories and salt in hot dogs? Oscar Meyer Jumbo 170 590
Kirkland Signature 170 530
Jenni-O Turkey 70 390
Nathan’s 290 790
Oscar Meyer Beef 130 470
Generic Beef & Pork 170 630
8
Do You Know How Much You are Spending?
A consumer advocate group asked 5120 people to estimate how much they spend each month on entertainment,
including streaming service subscriptions. The group then used financial records to find the actual average
amount each person spent per month on this category.
Below is a table of the results for 20 randomly selected participants.
We think that the people generally
Participant Estimate Actual Participant Estimate Actual underestimate how much they spend
Number Amount Number Amount but that they are aware if their
1 180 208 11 200 190 spending is relatively low or high.
a) Make a scatterplot of the data.
2 220 265 12 175 196
Use the same scale on both axes
3 200 302 13 150 193 since both variables are measured in
4 150 184 14 250 269 dollars per month.
5 175 193 15 200 233
b) Describe the relationship. Is there
6 225 278 16 150 174
a positive or negative association? Is
7 125 205 17 200 217 the relationship approximately
8 350 459 18 200 223 linear? Are there any outliers? Are
any of the outliers influential points?
9 145 160 19 250 330
10 100 152 20 225 247
9
Least Squares Regression Line
LSRL ! = $ + &'
"
“least” is optimized
Calculator Mechanics
10
Method 1 for Linear Regression
Step One: Enter data into 4: lists and spreadsheets, then add a page Step Three: Linear Regression Window
5: data and statistics and view the scatterplot
When you have this option
“save regeqn to” then the
Nspire will save your
regression equation at the
location it gives here. That
means on any page in the
Nspire document, f1 is going
to be referring to this
regression equation. f1 on a
calculator page can be used
to find the predicted value for
given values of the
independent variable.
!
Every time you run a linear
regression, the variables are
overwritten and updated with the
most recent results.
11
Correlation measures the strength and direction of a linear relationship between two variables
(x, y). Correlation is denoted with the letter r.
It does not matter which variable is x (explanatory) and which is y (response), r will have
the same value either way.
-1 0 1
perfectly linear, no linear perfectly linear,
negative relationship positive
12
Because r uses a standardized value of observation, r does not change when we change units.
Moderately strong to strong correlation
generally starts at r = 0.8 or r = -0.8.
Like the mean and standard deviation, correlation is strongly affected by outliers (it is non-resistant).
Because r uses a standardized value of observation, r does not change when we chang
13
Don't confuse correlation coefficient and slope of least-squares regression line.
A slope close to 1 or -1 doesn't mean strong correlation.
An r value close to 1 or -1 doesn't mean the slope of the linear regression line is close to 1 or -1.
$!
!=#
$"
A college newspaper interviews a psychologist about student ratings of faculty members’ teaching. The psychologist
says “The evidence indicates that the correlation between the research productivity and teaching rating of faculty is
close to zero.” The paper reports this as the psychologist saying that “good researchers tend to be poor teachers”.
Is that a correct representation of the psychologist’s statement?
Sloppy Writing?
“We found a high correlation (r = 1.09) between students’ ratings of faculty teaching and final grades in the courses.”
“There is a high correlation between American workers’ college majors and their incomes.”
“The correlation between attending a social function and incubation period was 0.23 days.”
14
Least-Squares Regression Line (LSRL)
LSRL of y on x is the line that makes the sum of the squares of the vertical distances between the data points and
the line (these distances are called residuals, or errors) as small as possible.
15
Facts about LSRL:
slope $(
!=#
$)
In statistics, the coefficient of determination R2 is used in the context of statistical models whose
main purpose is the prediction of future outcomes on the basis of other related information.
It is the proportion of variability in a data set that is accounted for by the statistical model.
It provides a measure of how well future outcomes are likely to be predicted by the model.
Remember that r2 > 0 doesn't mean r > 0. For instance, if r2= 0.81, then r = 0.9 or r = -0.9. In order to
decide the correct sign of r, we need to know the relationship between the variables. r and b have the
same sign but not the same magnitude.
Number Price
of
Example: Pizza!
Toppings
$8 plain
$1.50 per topping 0 8
1 9.50
2 11
3 12.50
4 14
16
Pizza! Number Price
$8 plain of
Toppings
$1 per veggie topping (spinach, onion, mushrooms)
$2 per meat topping (pepperoni, sausage) 0 8
No double toppings.
Herpetologists measure, among other things, the snout vent length (SVL) of tropical
lizards and study how the SVL relates to clutch size.
Kiefer, Mara & Van Sluys, Monique & Rocha, Carlos. (2008). Clutch and egg size of the tropical
lizard Tropidurus torquatus (Tropiduridae) along its geographic range in coastal eastern Brazil.
Canadian Journal of Zoology. 86. 1376-1388. 10.1139/Z08-106.
17
Kelley Blue Book is a resource that provides value estimates for used cars in North America for individuals looking to buy
cars. Car dealers frequently use a “Red Book” that gives recent selling prices for used cars in order to help them decide
what to pay for a trade-in. Until recently, there were no instructions to dealers for how to use the odometer reading to
determine the trade-in value of the car. In order to determine how the mileage affects the value of the car, ten sales of
cars of the same make, condition and options are selected, and their odometer readings and sales price are collected. The
data is shown in the table:
Trade in Value (in 100s of dollars) 37 31 43 39 41 39 35 40 29 33
Odometer (in 1000s of miles) 59 92 61 72 52 67 88 62 95 83
Sketch the scatterplot of this data. Be sure to put the appropriate variable on the horizontal axis.
Does there appear to be an association between the odometer reading and trade-in value? If so, what is it?
Give the correlation coefficient for this linear model. Interpret the value of this number.
18
Software Regression Output
This output lets us know the regression equation for this data set would be: %$ = 4.486 + 8.253
Researchers studying acid rain measured the acidity of precipitation in a Colorado wilderness are for 150 consecutive weeks.
Acidity is measured by pH. Lower pH values show higher acidity. The acid rain researchers observed a linear pattern over time.
The reported the least-squares regression model:
78 = 5.43 − (0.0053:;;<=)
Is this association positive or negative? Explain what the association means in context.
What was the pH at the beginning of the student (weeks = 1)? What was the pH by the end of the study (weeks = 150)?
19
A scatterplot of grade point average (GPA) against IQ test scores for 78 seventh-grade students is created.
Calculation shows the mean and standard deviation of the IQ scores are: 3̅ = 108.9 =! = 13.17
For the grade point averages, %6 = 7.447 =" = 2.10
The correlation between IQ and GPA is r = 0.6337.
a) Find the equation of the least-squares line for predicting GPA from IQ.
b) What percent of the observed variation in these students’ GPAs can be explained by the linear relationship
between GPA and IQ?
c) One student has an IQ of 103 and a GPA of 2.35. What is the predicted GPA for a student with IQ = 103?
What is the residual for this particular student?
20
21
x y x2 y2
1 15 1 15
2 19 2 19
3 16 3 16
4 26 4 26
5 22
"! = 5.295 + 5.049,
"! = 11.933 + 2.686, 5 22
"! 4.5 = 28.01
6 30 "! 4.5 = 24.02 10 60
x3 y3 x y
1 15 1 15
2 19 2 19
3 16 3 16
4 26 4 26
5 22 5 22
"! = 1.933 + 6.971, "! = 16.526 + 0.962,
6 60 15 30
"(4.5)
! =33.3 "! 4.5 = 20.86
22
Remember:
residual = y – y-hat, or the (observed/actual y) minus (the y value predicted for that x value)
We can graph residuals. The horizontal axis is the independent variable and the y-axis is RESID,
which results from linear regression in the calculator.
The graphs of residuals (residual plots) can tell us if our linear model is appropriate for the given
data. It is another way to determine the best way to analyze bivariate data.
Example 1
This data produces a favorable residual plot that indicates the line is a good model for the data.
Age (months) 18 19 20 21 22 23 24 25 26 27 28
29
Height (cm) 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8
83.5
23
Example 2
This data produces a residual plot that indicates the line is not a good model.
x 1 3 4 5 6 8 10 12 15
y 1 4.66 6.96 9.52 12.29 18.38 2511
. 32.42 44.31
Interpret.
24
Cautions About Linear Regression
How to tell if a linear model is appropriate (should the relationship be modeled with a linear regression?)
1. check graph (is it clearly not linear?)
2. look at r and r-squared
3. check residual plot. A pattern in the residual plot means a linear model is not appropriate.
An outlier is an observation that lies beyond the range of x or y values of other observations in the
scatterplot.
An outlier can be an observation that lies out in the horizontal direction or in the vertical direction
or in both directions.
An observation is influential if removing it would markedly change the position or slope of the
regression line.
Points that are outliers in the x direction are often influential points.
Influential points often have small residuals because they pull the LSRL towards themselves.
25
Correlation and regression describe only linear relationships and are not resistant to the
influence of outliers. r2 is influenced by outliers.
Extrapolation
Predicting for an x value that is well outside the known data values for x.
Such predictions cannot be trusted.
Example: a model that predicts height based on age, based on data from children aged 5 - 11 years,
cannot be used to predict height for someone who is 20 years old.
Example: if we create a model of average quiz scores and average test scores, that
model should not be used to predict a single student’s test score based on their quiz scores
26
Causation Changes in explanatory variable cause changes in response variable.
Association Scenario I: Causation
we think: changes in x cause changes in y Example:
x = more practice time,
x y y = performance quality
Lurking Variable: A variable that is not among the explanatory or response variables in a study yet may
influence the interpretation of relationships among the variables.
z
what’s really going on:
Variations in both z and x are creating variation in y, but we cannot discern the effects.
Confounding variables can be lurking variables.
Changes in a confounding variable can also change x and y.
27
Nations with greater internet speeds have higher life expectancies.
Could we increase life expectancy in a particular country by improving their internet infrastructure?
Data show that married men (and men who are divorced or widowed) earn more than men who have
never been married. If you want to make more money, should you get married?
A study of elementary school children ages 6-11 finds a high positive correlation between shoe size (x) and score on a
common reading comprehension assessment (y).
What explains this correlation?
Members of a language club believe that early study of a foreign language by native English speakers improves a
student’s command of English. They obtain students scores on an English achievement test given to all 8th grade students
and find that the mean score of 8th graders who studied a foreign language in elementary or middle school is much
higher than the mean score of students who have not yet started learning a foreign language. Does this justify their
assertion about foreign language study and English skills? What lurking variables might be present here?
28
To test the health benefits of herbal tea, a group of college students make weekly visits to a local nursing home, where
they visit the residents and serve them herbal tea.
After several months of the twice-weekly visits with tea, the staff at the nursing home reported improvements in
qualitative health measures such as increased cheerfulness and lower overall anxiety.
What is the explanatory variable here? What is the response variable?
Can the college students conclude that the herbal tea caused the change in residents’ wellbeing?
If not, what could the causation relationship be?
A study shows that there is a positive correlation between the size of a hospital
(measured by the number of patient beds) and the median number of days that a
patient remains in the hospital. Does this mean you can shorten a hospital stay by
choosing a smaller hospital?
number final number final What does the scatterplot reveal? Is the data obviously non-linear?
hours exam hours exam
studied score studied score
0.25 58 3 80
0.5 72 3 84
1 70 3 86
1 76 3.5 78 What do r, r2 tell you about a linear model for these variables?
1.2 70 3.5 86
1.5 78 3.5 92
1.75 75 4 82
2 80 4 93
2 75 5 97
What about the residual plot?
2 83 5 94
2 80 5.5 100
2 83 6 92
2.5 84 6 98
29
What is the regression equation (in context)?
What is the predicted score for someone who ignores everything else in life and studies 10 hours?
Is that predicted score for 10 hours of studying a reasonable prediction? How would you explain that particular prediction to
your friend who needs that score to make an A in the class and has decided to study for 10 hours?
x y
What does the scatterplot reveal? Is the data obviously non-linear?
15 33
20 42
40 81
48 98
70 137
What do r, r2 tell you about a linear model for these variables?
75 152
80 167
91 200
98 180
108 260
125 300 What about the residual plot?
140 170
160 400
30
31
Linearizing Data with Transformations
Often a straight-line pattern is not the best model for depicting a relationship between
two variables. A clear indication of this problem is when the scatter plot shows a
distinctive curved pattern. Many times this happens when a variable is growing
exponentially instead of linearly. A variable grows exponentially if it is multiplied by a
fixed number greater than 1 in each equal x-interval (the relationship is said to be
exponential decay if the fixed number is less than 1).
If you have a nonlinear pattern, many times you can transform one or both of the
variables in order to uncover a linear relationship. If a variable is growing exponentially,
taking the log (common or natural) of that variable will uncover a linear pattern.
POPULATION
50 67 91 122 165
(1000s)
32
1. The paper “Population Pressure and Agricultural Intensity” reported a positive association between
population density and agricultural intensity. The accompanying graphs consists of measures of
population density and agricultural intensity for 18 different subtropical locations.
2. The following graph’s data is based on the average radius of the planet’s orbit in terms of the Earth’s
orbit radius to predict the length of the planet’s year in earth years.
33
LSRL Practice and Applications
Show all of your work neatly and clearly on separate paper. Answer in context.
1. According to an article about secondary education there is a mild correlation (r =.55) between
high school GPA and college GPA. The high school GPA’s in the sample have a mean of 3.7 and
standard deviation of 0.47. The college GPA’s in the sample have a mean of 2.86 with standard
deviation of 0.85.
a) What is the explanatory variable?
b) What is the slope of the LSRL of college GPA on high school GPA? Interpret these in context
of the problem.
c) If Bob’s high school GPA is 3.2, what could we expect of him in college?
3. One measure of the success of knee surgery is postsurgical range of motion for the knee joint.
Postsurgical range of motion was recorded for 12 patients who had surgery following a knee
dislocation. The age of each patient was also recorded (“Reconstruction…” American Journal of
Sports Medicine). The average age was 25.83 years and standard deviation of 7.578 years. The
average range of motion was 130.1 degrees with a standard deviation of 11.927 degrees. The
correlation coefficient was r = .5534.
a) If we use age to try and predict the range of motion, what is the slope? What is the y-intercept?
Interpret the two (slope and y-intercept) in context of the problem.
b) Use the regression line to predict the range of motion of someone 32 years of age.
c) Use the regression line to predict the range of motion of someone 50 years of age. Do you feel
this is an accurate prediction? Explain your thoughts.
1
34
4. Greenlight gave the following 2023 average weekly earnings (in USD) from allowances and
payment for chores children of ages 5 through 18
Age 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Earnings 6 6.7 7.1 7.8 8.4 9.3 10.3 11.6 13 14.1 16.1 20.5 25 32
5. Success in hunting varies greatly among species of animals. Lions, who hunt singly, are rarely
successful in more than 10 percent of their hunts. Wild African dogs, who hunt in packs, are among
the most efficient of all hunters, succeeding at a rate of over 90 percent of their hunts.
In the early 1960’s, researcher Jane Goodall discovered that chimpanzees were not solely
vegetarian in their diets, as had previously been thought. This discovery spurred a tremendous
amount of primate research. Some of the latest primatology research has been done on chimpanzees
to find out if larger hunting parties increase the chances of a successful hunt. The results of one such
research project are summarized in the table for the number of chimpanzees in the hunting party
versus the percentage of successful hunts.
Number of Chimps 1 3 5 5 6 7 8 8 9 10 10 11 12
Percent of Success 20 28 42 40 58 45 62 65 63 75 78 75 82
2
35
6. The following output data from MINITAB shows the number of teachers (in thousands) for each
of the states plus the District of Columbia against the number of students (in thousands) enrolled in
grades K-12.
Predictor Coef Stdev t-ratio p
Constant 4.486 2.025 2.22 0.031
Enroll 0.053401 0.001692 31.57 0.000
s=2.589 R-sq=81.5%
a) What is the equation of the least squares line? Interpret the slope.
b) Predict the number of teachers if the number of students in the state is 35,700.
7. Shells of mollusks function as both part of the skeletal system and as protective armor. It has
been argued that many features of these shells were the result of natural selection in the constant
battle against predators. The paper “Postmortem Changes in Strength of Gastropod Shells” included
scatter plot of data on x = shell height (cm) and y = breaking strength (newtons). The least squares
line for a sample of 38 hermit crab shells was y = -2751 . + 244.9 x .
a. What are the slope and intercept of this line?
b. When shell height increases by 1 cm, by how much does breaking strength tend to
change?
c. What breaking strength would you predict when shell height is 2 cm?
d. Does this approximate linear relationship appear to hold for shell heights as small as 1
cm? Explain your thoughts.
3
36
AP Statistics Review LSRL
PART I : CORRELATION. For each of the following three scatter plots, identify the correlation as either strong
positive, weak positive, strong negative, weak negative, or little-or-no correlation.
50 50 50
40 40 40
30 30 30
20 20 20
10 10 10
0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
___________2b. The two coefficients (a and b) for the line of best fit have the same
sign.
___________2c. The correlation coefficient has the same sign as the slope of the least
squares line fitted to the same data.
___________2d. A r-value greater than zero indicates that ordered pairs with high x-
values will have low y- values.
___________2e. A correlation of -.41 and + .41 show the same degree of clustering around the
regression line.
___________2f. A correlation of .75 indicates a relationship that is 3 times as linear as one for which
the correlation is only .25.
___________2h. A definite pattern in the residual plot is an indication that a nonlinear model will
show a better fit to the data than the straight regression line.
___________2i. An x-value that is an outlier in the x-direction is more indicative that a point is
influential than a y-value that is an outlier in the y-direction.
37
PART III : MULTIPLE CHOICE
__________ 3a. In a random sample of older patients at a large medical practice, the age of a patient and a
measure of that patient’s hearing loss were recorded. The correlation between age and hearing
loss of the patients in the sample was found to be 0.7. Which one of the following would be a
correct statement if the age of a patient were used to predict the amount of hearing loss for a
patient?
(A) Forty-nine percent of the time, the least squares regression line accurately predicts hearing loss.
(B) Forty-nine percent of the variation in hearing loss can be explained by the least squares regression
line relating hearing loss and age.
(C) About 70% of a person’s hearing loss can be explained by age, according to the regression line
relating hearing loss and age.
(D) About 70% of the time, age will correctly predict the amount of hearing loss.
(E) The least squares regression line relating hearing loss to age will have a slope of
approximately 0.7.
__________ 3b. Suppose one has collected data on X the diameter of tree trunk and Y tree height. If the
Regression equation is ŷ = -3.6 + 3.1x, what is your estimate of the average height of all
trees having a trunk diameter of 7 inches?
(A) 18.1 (B) 19.1 (C) 20.1 (D) 21.1 (E) 22.1
__________ 3c. A correlation between college entrance exam grades and scholastic achievement was found to
be -1.08. On the basis of this you would tell the university that:
__________3d. A study of the effects of television measured how many hours of television each of 125
grade school children watched per week during a school year and their reading scores.
Which variable would you put on the horizontal axis of a scatterplot of the data?
____________3e. The study described in the previous question found that children who watch more television
tend to have lower reading scores than children who watch fewer hours of television. The
study report says that, “Hours of television watched explained 9% of the observed variation
in the reading scores of the 125 subjects.” The correlation between hours of TV and reading
score must be
(A) r = -0.3 (B) r = 0.3 (C) r = -0.09
(D) r = 0.09
(E) Can’t tell from the information given.
38
__________3g. A study of child development measures the age (in months) at which a child begins to talk
and also the child’s score on an ability test given several years later. The study asks
whether the age at which a child talks helps predict the later test score. The least-squares
regression line of test score y on age x is ŷ = 110 – 1.3x. According to this regression line,
what happens (on the average) when a child starts talking one month later?
____________3j A study utilizing a simple random sample of 40 college students studied their hours of part-
time work and grade point average. It was found that the correlation between the variables
was -.43. If the resulting linear regression equation is: predicted GPA= 3.75 - .05(hours),
which of the following is NOT a correct statement?
(A) The average GPA of students who don’t work is approximately 3.75
(B) If the correlation coefficient was -.60, the slope of the regression line would be
approximately -.07.
(C) Students who work 40 hours per week have a mean GPA of approximately 1.75.
(D) The value of the correlation coefficient and the steepness of the regression line are not
related.
(E) 18.5% of the variation in GPA scores can be explained by the hours of part-time work.
39
____________3l. An efficiency expert wanted to see if there is a relationship between the number of people
attending a meeting and the number of minutes late that the meeting started. The table shows
the results with the accompanying scatter plot (Figure 1).
14
12
10
Minutes Late
6
Figure 1
4
2
1 2 3 4 5 6 7 8 9 10
Number attending
Figure 2 represents the scatter plot with the point (10, 7) removed from the data.
14
12
10
Minutes Late
Figure 2
8
2
2 3 4 5 6
Number attending
Which one of the following is TRUE about the point (10, 7)?
40
____________3m. Most colleges have an end-of-course evaluation of the instructor. A random sample of
students at a large university are asked to rate their instructor on a scale of 1 (poor) to 4
(excellent) and to also rate the subject matter of the course on a scale of 1 (did not like) to 4
(liked a lot). The scatterplot below shows the results of one semester of these evaluations.
4
Instructor Rating
1
1 2 3 4
Course Rating
(A) Students tended to rate their instructors more highly than they did their courses.
(B) Students who rated the course a 3 tended to give the instructor a lower rating.
(C) There is not much variation in the instructor ratings.
(D) There doesn’t seem to be any relationship between course rating and instructor rating.
(E) There appears to be a very strong linear relationship between course rating and instructor rating.
41
PART IV: SHORT ANSWERS.
4. A psychologist determines that a strong, positive, linear relationship exists between an individual’s IQ
score and their sense of humor. She randomly selected 45 adults and found the following results.
b) The value of r 2 is .81, which indicates a fairly strong relationship between IQ and humor. Is
this relationship causal or associative? Justify your answer.
42
43
44
LSRL Review part 2
True/False:
_____1. If the least-squares equation relating the independent variable x and the
dependent variable y for a given problem is y = 2x+5, then an increase of 1 unit in x is
associated with an increase of 2 units in y.
_____2. The coefficient of determination measures the variation in the dependent variable
that is explained by the regression model.
_____4. If your computed correlation coefficient r = +1.2, then you have better than a
perfect positive correlation.
_____5. A student might expect that there is a positive correlation between the age of
his or her computer and its resale value.
Multiple Choice:
Use the following set of observations for the independent variable x and the dependent variable y in
questions #6-7:
X -3 -1 1 3
Y 8 4 5 -1
45
_____8. The correlation between two scores X and Y equals 0.8. If both the X scores
and the Y scores are converted to z-scores, then the correlation between z-
scores for X and z-scores for Y would be
A) -0.8
B) -0.2
C) 0.0
D) 0.2
E) 0.8
_____9. A least-squares regression line was fitted to the weights (in pounds) versus age
(in months) of a group of many young children. The equation of the line is
yˆ = 16.6 + 0.65t , where ŷ is the predicted weight and t is the age of the child. A 20-month
old child in this group has an actual weight of 25 pounds. Which of the following is the
residual weight, in pounds, for this child?
A) -7.85
B) -4.60
C) 4.60
D) 5.00
E) 7.85
46
_____11. A study of fuel economy for various automobiles plotted the fuel consumption
vs. speed. A LSRL was fit to the data. Here is the residual plot from this least-
squares fit. What does the pattern of the residuals tell you about the linear model?
Match the following scatter plots with the appropriate correlations from the list:
47
Free Response:
13. The equation of the least squares regression line for a set of given points is yˆ = 1.3 + 0.73 x .
What is the residual for the point (4, 7)?
14. What is the explanatory variable and what is the response variable?
15. Carla collected data on temperature and number of chirps and found this:
x = 166.8, sx = 31.0, y = 78.83, s y = 9.11 and r = 0.461
Use this information to write the equation of the LSRL.
48
Inferential Statistics
These methods are used to take sample data and use it to draw a conclusion about a population.
We use sample means that we calculate to learn about population means that are unknowable.
We use sample proportions that we calculate to learn about population proportions that are unknowable.
Reminder: a statistic is from a sample, a parameter is from a population
A graduate student in an education program wants to know the ways HS teachers in her state have been
using AI tools in the classroom. She obtains a list of certified HS teachers in her state and from that list
randomly selects 200 teachers. She emails a survey to each of the 200 teachers and receives 132
returned and completed surveys.
sampling frame
The list she uses to choose her sample is called a sampling frame. Even though she wants to learn
about ALL HS teachers in her state, the only feasible data she can use is the available list of certified teachers.
49
For each of the following sampling situations, identify the population as precisely as possible. (What individuals are
included in the population?)
The Gallup organization in the U.S. questions a sample of about 1500 adult U.S. residents to collect opinions
about a variety of issues.
Every ten years, the U.S. census will collect basic information from every household in the United States. The census will
select a sample of households to receive the ‘long form’ of the census that asks for more detailed information. About
one in ten households are asked to complete the long form.
A manufacturer purchases fasteners from a supplier in South America. As part of its routine quality control, the
manufacturer will randomly select a sample of 20 fasteners from each day’s supply and test them for durability and
accurate measurements.
40 students
in our sample
50
Simple Random Sampling (SRS)
Sample is chosen using a randomized method such that any set of n individuals has an
equal chance of being selected. (this is actually not that simple!)
For our sample of 40 CISH students, if we randomly For our sample of 40 CISH students, if we
choose 10 students from each grade in the high randomly choose 4 home groups and included all
ten students from each home group in our
school, is this a simple random sample?
sample, would this be an SRS?
1. Alex
2. Justin
3. Arthur
4. Bao Nhi
5. Chaerin
6. Brad
7. Calvin
8. Michael
9. Daniel
10. Duc
11. Ella
12. Jesus
13. Hayle
14. Kate
15. Yoochan
16. Yoowon
51
Stratified Random Sample
Divide the population into subgroups of similar individuals (similar in ways that are important to the topic of
interest), called strata. Choose a separate SRS within each stratum and combine these into a full sample
A probability-based sampling method.
Usually more representative than an SRS.
40 students
What strata should we use if we are interested in: in our sample
If CISH has 600 students, what would n be for our sample of 40 students?
Multi-stage Sampling
probability-based method
Sample is chosen in stages, starting with larger groups, then eventually smaller groupings.
52
Cluster Sample
Population is split into subgroups that are convenient or already exist in the population.
These subgroups are called clusters.
Randomly select clusters.
All units within the randomly selected clusters are included in the sample.
Bias in Sampling
A sampling method is biased if it systematically favors a particular outcome. In a biased sample, not all
viewpoints or situations that can affect the outcome have an equal or proportional chance of being
represented.
Inference methods account for this kind of error.
Bias can be avoided by a well-designed sampling method.
53
Sources of Bias in Sampling
Non-response
You are on the staff of an elected government representative who is interested in public support for
increased funding for elder care. You report that about 213 messages have been received from the
public via mail or email and of those messages 141 oppose the increased spending. The government
representative is surprised and says she expected stronger support for the initiative. Would you conclude
that the majority of voters oppose the increased funding?
54
Sampling Methods
Match each setting with the correct sampling method by writing the correct letter on each line.
_________ 1. A mid-size paper company wishes to take a random sample of its clients. Clients are divided into
Small (under $50k), Medium ($50k to $250k), and Large (over $250k). A random digits table is
used to select 30 small, 15 medium, and 10 large clients.
_________ 2. An ultimate frisbee tournament organizer wants to estimate the mean number of years of playing
experience for its participants. She uses a list of all participants, numbered 001 to 312, and a
random digits table to select 25 participants.
__________ 3. A restaurant manager wants to gauge the opinion of the restaurant’s customers. He walks around
and interviews 20 people who are eating there one Saturday night.
__________ 4 A comic book store wishes to take a sample of comic books. Their inventory is currently stored in
cardboard boxes. The manager numbers each box from 001 to 684 and uses a random digits table
to select 20 random boxes. All comics in these boxes are included in her sample.
__________ 5. An exit poller wants to estimate the proportion of voters who voted for certain candidates. He
stands outside a particular polling place at 10am and interviews the first 100 voters to exit.
__________ 6. Tobias wants to watch a random sample of Simpsons episodes from the first 10 seasons (which are
the best). He chooses an SRS of 3 episodes from each of these ten seasons, and watches these
episodes..
__________ 7. To choose a sample of fifty employees from a large corporate office, a list of employees is
obtained, alphabetized, and labeled numerically. A random digits table is read left to right, in sets
of threes, until fifty unique labels are found. These fifty employees are used in the sample.
__________ 8. A local newspaper wishes to estimate the years of graduate education that the teachers
in the local school district have obtained. An announcement is placed in the district newsletter
for all teachers in the district asking them to contact the newspaper with the number of years of
graduate education obtained.
55
AP Statistics
Methods of Sampling
Discussion Questions:
1. Which of the following sampling methods produce a random sample from a class of 36 students:
2. Describe how you would select a sample of 10 juniors from your school using the following methods:
a. SRS
b. convenience sampling
e. systematic sampling
f. multi-stage sampling
3. For each sampling method below, tell which groups in the population are likely to be
underrepresented.
• To obtain a sample of households, a television rating service dials numbers taken at random
from telephone-directories.
• To determine the percentage of teenage girls with long hair, Teen magazine published a mail-
in questionnaire. Of the 500 respondents, 85% had hair shoulder length or longer.
• To evaluate the reliability of cars owned by its subscribers, Consumer Reports magazine
publishes a yearly list of automobiles and their frequency-of-repair records. The magazine
collects the information by mailing a questionnaire to subscribers and tabulating the results
from those who return it.
• A college psychology professor needs subjects for a research project to determine which colors
average American adults find restful. From the list of all 743 students taking introductory
psychology at her school, she selects 25 students using a random number table.
• For a survey of student opinions about school athletic programs, a member of the school board
obtains a sample of students by listing all students in the school and using a random number
table to select 30 of them. Six of the students say that they don’t have time to participate, and
they are eliminated from the sample.
56
Stat Homework 3.1
Homework Questions
1. Retailers at the local shopping mall want to survey their Saturday customers about their satisfaction
with the eating facilities within the mall. One merchant went to business school and learned about the
importance of statistics, so he wants to obtain a random sample. He proposes the following method:
Interviewers should stand at the center of the mall and select the first 100 people who walk by after
11:00 a.m. He believes this approach will provide a random sample because the interviewers will not
exercise any decision over whether or not to include specific individuals in the sample.
a. What kind of sample would the merchant really get?
b. In what way might this sampling method be biased?
c. Describe how the merchant could modify this approach to use a version of systematic
sampling.
d. If the retailer were to use stratified random sampling, what strata would you recommend that
he choose?
e. What method would you suggest to the merchant? Explain your choice.
2. The Educational Testing Service (ETS) needed a representative sample of college students. ETS first
divided all colleges into groups of similar ones (such as public colleges with more than 25,000
students, small private schools, etc.). Then they used their judgment to choose one representative
school from each group, thus obtaining the sample of schools. Each school in turn picked a sample of
students.
a. ETS divided the colleges into strata but did not perform stratified random sampling. Explain.
b. Suggest ways to improve this sampling scheme.
4. A newspaper article began, “Almost half of the USA’s secretaries would rather work for a man than a
woman, even though a male boss is more likely to ask them to clean the coffeepot, says a Working
Woman survey” (USA Today). This is the result of a “poll of 1,100 readers in the magazine’s May
issue.” Of these readers, 46% prefer to work for a man, 5% for a woman, and 49% say it doesn’t
matter.
57
Experimental Design
Sometimes an experiment is impossible or unethical and an observation (or simulation) has to be used.
58
Example: one experiment wants to see the change in the amount
of food wasted by students when certain factors are changed.
59
Second Principle of Experimental Design: Randomization
Systematic differences among groups in a comparative experiment are a possible source of
bias. The remedy is to use randomization to make group or treatment assignments.
Replication refers to having an adequate number of experimental units or subjects in each group.
60
Experimental Design: Completely Randomized
All experimental units are allocated randomly among the treatments. This is done to produce groups
that are similar before treatments are applied.
removes bias in self-reported information and removes bias in any subjective part of data collection
61
A block design is like
doing multiple
experiments at the
same time
Block Design
Used when subjects are of different type in a way that is expected to affect the outcome
(e.g. men and women respond differently to medication, people with higher blood pressure
may respond differently to diet changes than people with lower blood pressure...)
Essentially: block design allows us to compare apples to apples and oranges to oranges.
Blocking Stratifying
is used in an experiment. is used in sample selection or in survey
selection.
62
63
Example: Completely Randomized Design
64
Matched Pairs Design
Particular type of randomized block design.
Used when comparing only two treatments.
Subjects are compared (matched with) similar subjects OR each subject receives both treatments.
Randomization occurs either in assigning treatments or in the order of the treatments.
Window (warm
What would our response variable likely be? sun)
65
What is happening in this experiment?
What kind of experiment is it?
A high school regularly offers a review course to prepare students for the SAT. This year budget cuts will
allow the school to offer only an online version of the course. The group of students who take the online
course earn an increase of 45 points in their math test from the pre-test to the actual SAT test.
As an experiment this would have a very simple design. A group of students (the subjects) were exposed to
a treatment (online course) and the outcome was observed (change in SAT math scores).
66
67
68
McDonald’s is giving away squishmallows in its Happy Meals. Right now 50% of them are decorated with music notes on them,
20% have toys on them and 30% are just wearing clothes.
If you want to collect one of each and toys are randomly placed into boxes, how many Happy Meals do you expect to buy before
you have at least one of each kind of toy?
To answer this question, we can use a simulation: the imitation of chance behavior based on a
model that accurately reflects the experiment under consideration.
For probability experiments, clearly describe the process you are using.
• How are you assigning digits?
• What will you do about digits that repeat? (if the same digit shows up more than once, what will you do?)
• What does a trial consist of?
• How many trials will you run?
• How will you interpret the results in the context of the question you are answering?
McDonald’s is giving away squishmallows in its Happy Meals. Right now 50% of them are decorated with music
notes on them, 20% have toys on them and 30% are just wearing clothes.
If you want to collect one of each and toys are randomly placed into boxes, how many Happy Meals do you expect to
buy before you have at least one of each kind of toy?
69
McDonald’s is giving away toys in its Happy Meals. Right now 50% of toys are cars, 20% are ponies and 30% are parachutes. If you want to
collect one of each and toys are randomly placed into boxes, how many Happy Meals do you expect to buy before you have at least one of each
kind of toy?
According to YOMA coffee, 13% of their holiday cups have gingerbread designs, 53% have
snowflake designs, 15% have birds, and 19% have winter flowers and berries. When you go to
pick up coffee for you and three of your friends, 3 of the 4 cups have snowflakes! What is the
likelihood of that happening just randomly? Perform ten trials of this probability experiment.
70
According to YOMA coffee, 13% of their holiday cups have gingerbread designs, 53% have snowflake
designs, 15% have birds, and 19% have winter flowers and berries. When you go to pick up coffee for you
and three of your friends, 3 of the 4 cups have snowflakes! What is the likelihood of that happening just
randomly? Perform ten trials of this probability experiment.
Let’s use the Nspire and answer this one again!
• how are you assigning digits?
• what will you do about repeats?
• what will be considered one trial?
• how many trials will you run?
CISH has 35 Freshmen, 41 Sophomores, 38 Juniors and 48 Seniors. Dr. Sutherland randomly selects 10
students to take on a fun field trip. The 10 students include 4 seniors, 3 juniors, 2 sophomores and 1
freshman. The freshmen say that there are too many seniors selected for the process to have actually
been random. Are they right to be suspicious?
71
According to a statistician at SAS, the color distribution of M&Ms is approximately 24% blue, 20%
orange, 16% green, 14% yellow, 13% red, and 13% brown.
You randomly select 8 M&Ms from the bag and 3 are red. That makes you think those proportions
must be wrong, so you do a probability experiment to see how likely your result is if the proportions
are accurate.
Use the Nspire to answer this question. Seed your calculator with 4308 and perform at least 10 trials.
Remember to include all of the necessary information.
A basketball player has historically made 65% of the shots she attempted. In one game she
makes six shots in a row and the announcer says the player is “in the zone!”. Assume the player
attempts 20 shots per game. How unusual would it be for her to make 6 or more shots in a row?
72
AP. Statistics. Experimental Design
1. Some studies find an association between liver cancer and smoking. However, alcohol
consumption is a confounding variable. Explain what is meant by alcohol being a confounding
variable.
2. Another recent study that was reported in the Fall of 2009 found that people who used
sunscreen were more likely to develop skin cancer.
(a) What might be a confounding variable in this study?
(b) Design an experiment to determine if sunscreen helps to increase the likelihood of
developing skin cancer.
3. On the television news in August, 2013, it was reported that young children who used hand
sanitizers reduced the number of illnesses they had by 20% over those children that did not use
these hand sanitizers.
(a) What might be a confounding variable in this study?
(b) Design an experiment to determine if the dry soaps prevent illness in young children.
4. According to a recent Daily Yomiuri article, Prof Takafumi Tezuka from Nagoya University
doubled the yield of green beans by twisting the vines counterclockwise around a pole:
"...researchers grew a total of 45 green bean plants in three ways letting the vine wind
clockwise, binding them straight with cord, and twisting them counterclockwise. They then
tallied the number of pods produced by plants in each category.
Plants bound straight produced 1.5 times as many pods as those allowed to grow naturally,
while those twisted in the opposite of their natural direction produced twice as many. The
pods' size and weight were generally the same."
The professor hypothesized that some stress (comfortable tension) on a plant might be good.
They expect the technique to also work on morning glories.
(a) In this experiment, what was the control group(s) and what was the experimental
group(s)?
(b) There was not much information given in this article. If you were trying to conduct this
experiment, what is one thing you would make sure you did to improve on the model
given above.
Page 1
73
Name: ________________________________________ Experimental Design WS
2) Upon reconsidering the above problem, the psychologist decides that the age of the
child may affect the attention span. Consequently, the psychologist randomly assigns
fifteen 10-year-olds, fifteen 8-year-olds, fifteen 6-year-olds, and fifteen 4-year olds to watch
one of three the commercials, and their attention spans are measured.
4) The editor of the student newspaper was in the process of making some major changes in
the newspaper’s layout. He was also contemplating changing the typeface of the print
used. To help him make a decision, he asked six individuals to read four newspaper pages,
with each page printed in a different typeface. If the reading speed differed, then the
typeface that was the fastest would be used. However, if there was not enough evidence
to allow the editor to conclude that such differences existed, the current typeface would be
continued. (Where should randomization be implemented?)
74
AP Stats Name _________________________________________________________
Chapter 5 Review
Part I - Multiple Choice (Questions 1-10) - Circle the answer of your choice.
(a) Each member of the population has an equal chance of being selected.
(b) Each member of the population is given an opportunity to respond to the survey.
(c) All samples of size n have the same chance of being selected.
(d) The probability of selecting any sample is known to be 7 ® rand .
(e) The sample is guaranteed to represent the entire population.
(a) Teacher
(b) Section of the Course
(c) Teaching Method
(d) Final Exam Score
(e) Student
75
5. In a study on the effect of reinforcement on learning from programmed text, two
experimental treatments are planned: reinforcement given after every frame of
programmed text or reinforcement given after every three frames. Which one of the
following control groups would serve best in this study?
(a) A group which does not read the programmed text material.
(b) A group that reads the programmed material in prose formats.
(c) A group which reads the programmed material but does not receive reinforcement.
(d) A group that reads the programmed text material and reinforcement is given at random.
(e) A group which watches the video of the programmed material.
6. We say that the design of a study is biased if which of the following is true?
I. Voluntary response samples often over represent people with strong opinions.
II. Convenience samples often lead to undercoverage bias.
III. Questionnaires with nonneutral wording are likely to have response bias.
(a) I and II
(b) I and III
(c) II and III
(d) I, II, and III
(e) None of the above gives the true set of responses.
8. To survey the opinions of bleacher fans at Wrigley Field, a surveyor plans to select every
one-hundredth fan entering the bleachers one afternoon. Will this result in a random
sample?
(a) Yes, because each bleacher fan has the same chance of being selected.
(b) Yes, but only if there is a single entrance to the bleachers.
(c) Yes, because the 99 out of 100 bleacher fans that are not selected will form a control
group.
(d) Yes, because this is an example of systematic sampling, which is a special case of
random sampling.
(e) No, because each fan does not have the same chance of being selected.
76
9. What fault do all these sampling designs have in common?
I. The Wall Street Journal plans to make a prediction for a presidential election
based on a survey of its readers.
II. A radio talk show asks people to phone in their views on whether the United States
should pay off its huge debt to the United Nations.
III. A police detective is interested in determining a sample of high school students
and interviews each one about any illegal drug use by the student during the past
year.
10. The following students are available to serve on the Student Procrastination Committee.
Using the randInt function your calculator, select a simple random sample size 4. Before you
select your sample, seed your random number generator by storing 7 into rand [ 7 ® rand ].
The students who were selected were:
(a) Ally, Ramone, Kyle, Olive
(b) Donald, Ramone, Kyle, Frannie
(c) Jan, Kyle, Kyle, Ramone
(d) Gina, Ivana, Patti, Eli
(e) Norm, Donald, Morris, Frannie
77
Part II – Free Response (Questions 11-14) – Show your work and explain your results clearly.
11. P.P. Pumpkineater, the renowned agricultural geneticist, has mutated previous varieties
of pumpkins and produced two new strains, Scary Face and Candle Breath. Because he
has limited marketing funds, he must decide which strain is the most “jack-o-lanternable”.
Having been in the jack-o-lantern business for a long period of time, he has developed
the PPPJOL Test to compare different strains. He is quite concerned about the effects of
sunlight and water on the growth of the pumpkins. He has 60 seeds of each variety
available for testing.
Design an experiment that will help P.P. determine which strain to market.
13. Is the right foot more powerful than the left? A researcher decides to measure foot
power by having subjects kick a large Styrofoam block and measure the depth of the
impression. Twenty subjects are available for the experiment.
(c) Comment on which experiment may be more appropriate and concerns you may have
about the experimental design.
14. You have been asked to investigate the attitudes of students in the Upper School about
the school’s uniform policy. You only have enough time and resources to contact 120
students. Describe your sample design clearly. Comment on any practical difficulties that
you anticipate.
78
Name _______________________ Period __________________
A. I only
B. II only
C. III only
D. None of the statements are true.
E. None of the above gives the complete set of true responses.
A. I and II
B. I and III
C. II and III
D. I, II, and III
E. None of the above gives the complete set of true responses.
A. Yes, because each player has the same chance of being selected.
B. Yes, because each team is equally represented.
C. Yes, because this is an example of stratified sampling, which is a special case of
simple random sampling.
D. No, because the teams are not chosen randomly.
E. No, because not each group of 58 players has the same chance of being selected.
79
4. In designing an experiment, blocking is used
A. To reduce bias.
B. To reduce variation
C. As a substitute for a control group
D. As a first step in randomization
E. To control the level of the experiment.
5. A nutritionist believes that having each player take a vitamin pill before a game
enhances the performance of the football team. During the course of one season, each
player takes a vitamin pill before each game, and the team achieves a winning season
for the first time in several years. Is this an experiment or an observational study?
A. An experiment, but with no reasonable conclusion possible about cause and effect
B. An experiment, thus making cause and effect a reasonable conclusion.
C. An observational study, because there was no use of a control group.
D. An observational study, but a poorly designed one because randomization was not
used.
E. An observational study, thus allowing a reasonable conclusion of association but not
of cause and effect.
6. Which of the following are true about the design of matched-pair experiments?
I. Each subject might receive both treatments.
II. Each pair of subjects receives the identical treatment, and differences in their
responses are noted.
III. Blocking is one form of matched-pair design.
7. A consumer product agency tests miles per gallon for a sample of automobiles using
each of four different octane varieties of gasoline. Which of the following is true?
80
8. In a 1927-32 Western Electric Company study on the effect of lighting on worker
productivity, productivity increased with each increase in lighting but then also increased
with every decrease in lighting. If it is assumed that the workers knew a study was in
progress, this is an example of
9. Twenty men and 20 women with high blood pressure were subjects in an experiment to
determine the effectiveness of a new drug in lowering blood pressure. Ten of the 20 men
and 10 of the 20 women were chosen at random to receive the new drug. The
remaining men and women received the placebo. The change in blood pressure was
measured for each subject. The design of this experiment is:
Free Response
11. An equipment firm is trying out three new types of grease in the transmissions of its front-
end loaders. The maintenance manager is interested in whether any of the greases reduce
the time before the transmissions have to be repaired. The company has 30 identical new
front-end loaders to use in the test. How would you design the experiment and in what way
would you assign the front-end loaders? Be specific? Would you use a completely
randomized design or a block design? How many factors are there? How many
treatments? If it is randomized block, what characteristic identifies the blocks? Explain your
decisions.
81
Statistics AP Name: ___________________________
Sampling and Experimental Design
Newspaper advice columnist Ann Landers once asked her readers, “If you had it to do over
again, would you have children?” About 10,000 readers responded and approximately 7,000
said no.
True/False:
__________4. Voluntary response samples often under represent people with strong
Opinions.
__________8. The entire group of individuals we want information about is called the
sample.
82
Use the following information to answer questions 14-16:
_______ 15. If the director includes only the employees in one department in her study, she
is performing a
a. simple random sample
b. quota sample
c. convenience sample
d. multi-stage sample
e. cluster sample
_______ 16. If the director selects 50 employees at random from throughout the company
and categorizes their lunchtime practices by gender, she is:
a. blocking for gender
b. testing for a lurking variable
c. promoting sexual harassment
d. testing for bias
e. none of these
19. How do we control for confounding variables in an experiment? Your answer can be
expressed in one word.
83
20. A medical researcher is interested in testing a new medicine for migraine headaches.
She decides to conduct a clinical trial on 100 randomly selected adults who get migraines at
a rate of one or more per week. Although age and gender are not of primary interest in the
trial, the researcher is concerned that these factors may impact the effectiveness of the
drug. Describe graphically how she would set up the experiment if:
a. she sets up her experiment for the 100 subjects without considerations of age and
gender.
b. she sets up her experiment for the 100 subjects and wants to control for gender.
84
c. she sets up her experiment for the 100 subjects and wants to control for age. She
decides on age categories of young (21-35), middle (36-55), and elderly (over 55).
d. she sets up her experiment for the 100 subjects and wishes to control for both age
and gender
85
AP Statistics Sampling and Experiments Questions
A chemical engineer is designing the production process for a new product. The chemical
reaction that produces the product may have a higher or lower yield depending on the
temperature and the stirring rate in the vessel in which the reaction takes place. The engineer
decided to investigate the effects of the combinations of the two temperatures (50 ° C and
60 ° C) and three stirring rates (60 rpm, 90 rpm, and 120 rpm) on the yields of feedstock. Ten
batches of feedstock will be processed at each combination of temperature and stirring rate.
A) 2
B) 3
C) 5
D) 6
E) None of the above. The answer is __________________________.
A) 2
B) 3
C) 5
D) 6
E) None of the above. The answer is __________________________.
86
6. In a survey of public opinion concerning state aid to a particular city, every 40th person
registered as a voter was interviewed, beginning with a person selected at random and
from among the first 40 listed. This is an example of
9. We say that the design of a study is biased if which of the following is true?
10. Suppose that a number of crates of pencils are chosen at random from a boxcar of crates,
then a number of boxes of pencils is chosen at random from each selected crate. Our
goal is to estimate the number of defective pencils in a box. This is an example of
87
Free Response. Answer in complete sentences. Abbreviations will count as a wrong answer.
Suppose the Houston Chronicle asks a sample of 150 Houstonians their opinions on the quality of
life in Houston.
12. Identify the sample and the population in the opinion poll.
Bias is present in each of the following sampling designs. In each case, identify the type of bias
involved and state how you think the responses will be affected compared to those obtained
using better sampling techniques.
14. A political pollster seeks information about the proportion of American adults that oppose
gun control. He asks an SRS of 1000 American adults: “Do you agree or disagree with the
following statement: Americans should preserve their constitutional right to keep and bear
arms.” A total of 910 or 91% said “agree” (that is, 910 out of 1000 oppose gun control).
15. A flour company in Dallas wants to know what percentage of local households bake at
least twice a week. A company representative calls 500 households during the daytime
and finds that 50% of them bake at least twice a week.
88
You are participating in the design of a medical experiment to investigate whether or not a
calcium supplement in the diet will reduce the blood pressure of middle-aged men. Preliminary
research suggests that the supplement may have different effects on different races.
16. What sort of experimental design would you choose, and why?
89
Introduction to Random Variables
A Random Variable is a numerical value whose outcome depends on a chance experiment.
Random indicates that the variable is unknown from trial to trial, but the possible values are
known.
Types of Random Variables:
Discrete
Continuous
Gives the probabilities associated with each possible value of the variable.
Usually displayed in a table, may be displayed in a histogram or formula.
As with any probability distribution, each probability is between 0 and 1, and the sum
of probabilities for the entire distribution is 1.
Example x= the number of heads that appear when four coins are tossed.
Create a probability distribution.
0 1 2 3 4
90
Consider the random variable X as the sum of two dice when rolled.
Construct a probability distribution.
p ( x = 4)
p (x < 4)
p (x ≤ 4)
Let x be the number of courses for which a randomly selected student at a certain
university is registered.
x 1 2 3 4 5 6 7
p(x) .02 .03 .09 .40 .16 .05
91
(a) What percent of the sons of lower-class fathers reach the highest class, Class 5?
(b) Check that this distribution satisfies the requirements for a discrete probability
distribution.
(e) Write the event "a son of a lower-class father reaches one of the two highest
classes" in terms of X. What is the probability of this event?
92
7.12 Car Ownership
93
Distributions
Discrete RV Continuous RV
Binomial uniform distribution
Geometric normal distribution
1. Each observation must fall into one of two categories (we call these
'success' and 'failure’.)
4. The probability of success (p) is the same for every trial or observation.
The possible values of x are whole numbers only because they are a
count of the number of success, so a Binomial variable must also be
discrete random variable.
Remember:
A normal distribution is defined by mean, standard deviation.
A binomial distribution is defined by:
number of trials (n), probability of success on any given trial (p).
x ~ B(n,p)
94
Which of these fit the description of a binomial random variable? If it's
binomial, identify what "success" is as well as n and p.
We roll a die 5 times and see how many times we land on an even number.
Example: Out of the 18 students in block 1, the probability that any one
student is tardy is 0.18. (we will assume independence and not a conspiracy)
"success" =
n=
p=
95
Example: Out of the 18 students in block 1, the probability that any one student is
tardy is 0.18. (we will assume independence and not a conspiracy)
Example: Out of the 18 students in track 1, the probability that any one student is
tardy is 0.18. (we will assume independence and not a conspiracy)
96
Formula for Binomial Probability
n = number of trials
k = number of successes
p = probability of success on any trial
A free-throw shooter has a 70% average for making free throws. Out of 20 attempts, find the
following probabilities, where x = number of successful free throws.
P (x = 8)
P (x = 15)
P (x is at least 12)
97
Mean and Standard Deviation of a Binomial Variable
Variance: OR npq
Standard Deviation:
Binomial variables have distributions that start with 0 while the lowest
value for a geometric distribution is 1
98
Geometric Distribution G (p)
Probability Formula
p (X=x)
According to a recent Census Bureau report, 12.7% of Americans live below the poverty level.
Suppose you plan to survey randomly selected Americans until you find an American living
below the poverty level.
What is the probability the first such American you encounter is the 5th one you survey?
What is the probability the first such American you encounter is the 7th one or later that you
survey?
According to a recent Census Bureau report, 12.7% of Americans live below the poverty level.
Suppose you plan to sample at random 100 Americans and count the number of people who
live below the poverty level. What is the probability that you count 10 or fewer?
99
An Olympic archer is able to hit a bulls-eye 80% of the time. Assume each
shot is independent of the others. The variable of interest is the first bulls-eye
she makes.
a). Which attempt is the expected first success? What is the standard deviation?
b) What is the probability that her first success is on the 4th arrow shot?
c) What is the probability that her first success is earlier than the 3rd arrow shot?
d) What is the probability that her first success is on the 4th arrow shot or later?
100