0% found this document useful (0 votes)
2 views

Second Stats Packet 24

AP statistics

Uploaded by

Thanh Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Second Stats Packet 24

AP statistics

Uploaded by

Thanh Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

So Far: Univariate Data

Key Questions: Univariate Data

What is the variable? How is it measured? (Or is it counted?)

What are the possible values of this variable and how frequently does it take on those values?

What is the typical value of this variable? How much variation is there in the values of this variable?

We used graphs to learn more about or better describe univariate data:

Bivariate Data!

Key Questions

Are both variables quantitative, or is one categorical?

What individuals do the data describe?


What exactly are the variables?
How are they measured?

What is the relationship between the variables?


Does change in one variable cause change in the other variable?
Can we use change in one variable to predict change in another variable?

2
We also use graphs to learn more about bivariate data:

Scatterplots show the relationship


between two quantitative variables
measured on the same individuals.

For scatterplots, both variables


must be quantitative.

Response Variables:
measures the outcome of a study
(y-axis)
If there is no obvious
explanatory/response
relationship, either variable
Explanatory Variable: may be graphed on
attempts to explain changes in the horizontal axis.
observed outcomes (x-axis)

Two variables are said to be positively related if


larger values of one variable tend to be associated with
larger values of the other.
For variables that are positively
related, if one variable increases, the
other typically increases, or could be
expected to increase.

Two variables are said to be negatively related


if larger values of one variable tend to be
associated with smaller values
of the other.
For variables that are negatively related, if one
variable increases, the other typically decreases, or
could be expected to decrease.

3
No relationship: if one variable
increases, there’s no clear indication
of what we would expect the other
variable to do.

Sometimes there is a clear


relationship between the variables,
but it’s not linear. This is something
to keep in mind when calculations
indicate no linear relationship.

Weakly Positive Moderately/Weakly Negative

Pro Tip: Categorical differences among


individuals can be indicated with different
colors or markings.

4
Strongest Moderate Weakest
Relationship Relationship Relationship
Negative
Positive

A C
B

D E
F

Outliers

Individual observations that


fall outside the overall pattern
of the graph.

A value that is an outlier in terms


of the explanatory and response
variables but not an outlier of the
relationship will not be influential.

5
How well does a child’s height at age 6 predict height at age 16? To find out, the heights of a large group
of children are measured at age 6 and again at age 16.
What are the explanatory and response variables here? Are they categorical or quantitative?

There may be a “gender gap” in political party preference in the US, with women more likely than men to
prefer Democratic candidates. A political scientist selects a large sample of registered voters. She asks each
whether they voted for the Democratic or Republican candidate in the last congressional election. What are
the explanatory and response variables in this scenario? Are they categorical or quantitative?

Interpreting Univariate Graphs

Comment on center, spread, unusual characteristics, shape

6
Process for Examining Relationships in Bivariate Data
1. Graph. Always graph!
2. Check for patterns and deviations.
3. Reference numerical descriptions.
4. Explain the overall pattern
Comment on direction, form,
strength and linearity.

Pair of Variables Direction of Association Strength of Association


Height and Armspan
Height and Shoe Size
Height and GPA
SAT score and College
GPA
Latitude and Average
January Temperature for
North American Cities
Lifespan and Weekly
Cigarette Consumption
Serving Size and Calories
of Fast Food French Fries
Cost and Quality Rating of
Peanut Butter Brands
Head Coach Salary and
Team’s Winning Percentage

7
Calories and Salt in Hot Dogs
Do hot dogs that are relatively higher in calories also contain more sodium?
Create a scatterplot showing calories and salt content (in Brand Calories Sodium (in mg)
milligrams of sodium) using the given information in the Schneiders Beef 110 330
table.
Applegate Uncured Beef 110 530
Trader Joes Uncured 90 580
Turkey
Does the scatterplot show a clear positive or Oscar Meyer Cheesy Hot 140 540
negative relationship between these variables? Dog
Morningstar Plant Based 50 430
Deli Beef Franks 150 480
What does this association mean about the Oscar Meyer Turkey 100 510
relationship between calories and salt in hot dogs? Oscar Meyer Jumbo 170 590
Kirkland Signature 170 530
Jenni-O Turkey 70 390
Nathan’s 290 790
Oscar Meyer Beef 130 470
Generic Beef & Pork 170 630

8
Do You Know How Much You are Spending?
A consumer advocate group asked 5120 people to estimate how much they spend each month on entertainment,
including streaming service subscriptions. The group then used financial records to find the actual average
amount each person spent per month on this category.
Below is a table of the results for 20 randomly selected participants.
We think that the people generally
Participant Estimate Actual Participant Estimate Actual underestimate how much they spend
Number Amount Number Amount but that they are aware if their
1 180 208 11 200 190 spending is relatively low or high.
a) Make a scatterplot of the data.
2 220 265 12 175 196
Use the same scale on both axes
3 200 302 13 150 193 since both variables are measured in
4 150 184 14 250 269 dollars per month.
5 175 193 15 200 233
b) Describe the relationship. Is there
6 225 278 16 150 174
a positive or negative association? Is
7 125 205 17 200 217 the relationship approximately
8 350 459 18 200 223 linear? Are there any outliers? Are
any of the outliers influential points?
9 145 160 19 250 330
10 100 152 20 225 247

9
Least Squares Regression Line

LSRL ! = $ + &'
"
“least” is optimized

“sum of squares” means sum of squared residuals

Calculator Mechanics

10
Method 1 for Linear Regression
Step One: Enter data into 4: lists and spreadsheets, then add a page Step Three: Linear Regression Window
5: data and statistics and view the scatterplot
When you have this option
“save regeqn to” then the
Nspire will save your
regression equation at the
location it gives here. That
means on any page in the
Nspire document, f1 is going
to be referring to this
regression equation. f1 on a
calculator page can be used
to find the predicted value for
given values of the
independent variable.

Step Two: 6: Statistics


1: Stat Calculations
4: Linear Regression (a+bx) Results

Method 2 for Linear Regression: from the scatterplot….

Any time you run statistics procedures, everything the Nspire


calculates is stored as a variable and can be found either by
typing in the name of the variable or by selecting the variables
key: var

!
Every time you run a linear
regression, the variables are
overwritten and updated with the
most recent results.

11
Correlation measures the strength and direction of a linear relationship between two variables
(x, y). Correlation is denoted with the letter r.

The (Pearson) correlation coefficient, r, is the


average of the products of standardized x's and
standardized y's. 1 '! − '̅ *! − *+
!= &
$−1 )" )#

Standardizing removes the units and allows us to


calculate r while combining unrelated variables.

Both variables must be quantitative to calculate r.

It does not matter which variable is x (explanatory) and which is y (response), r will have
the same value either way.

A positive r indicates a positive association between variables.


A negative r indicates a negative association between the variables.
r will always be between -1 and 1, inclusive.

-1 0 1
perfectly linear, no linear perfectly linear,
negative relationship positive

12
Because r uses a standardized value of observation, r does not change when we change units.
Moderately strong to strong correlation
generally starts at r = 0.8 or r = -0.8.

Proportional reasoning does not apply.


The scale for correlation is NOT on a linear
scale, so a correlation of 0.8 is not twice
as strong as a correlation of 0.4.

Like the mean and standard deviation, correlation is strongly affected by outliers (it is non-resistant).

Correlation only measures the strength of a linear relationship.


It does not describe non-linear (curve) relationships.

r is not a complete description of two variable data.


Always check the graph!

Because r uses a standardized value of observation, r does not change when we chang

This is also a graph of two


variables that have no
relation – no matter the
value of x, the value of y
does not change.

13
Don't confuse correlation coefficient and slope of least-squares regression line.
A slope close to 1 or -1 doesn't mean strong correlation.
An r value close to 1 or -1 doesn't mean the slope of the linear regression line is close to 1 or -1.

The relationship between b (slope of regression line) and r (coefficient of correlation) is

$!
!=#
$"

A college newspaper interviews a psychologist about student ratings of faculty members’ teaching. The psychologist
says “The evidence indicates that the correlation between the research productivity and teaching rating of faculty is
close to zero.” The paper reports this as the psychologist saying that “good researchers tend to be poor teachers”.
Is that a correct representation of the psychologist’s statement?

Sloppy Writing?

“We found a high correlation (r = 1.09) between students’ ratings of faculty teaching and final grades in the courses.”

“There is a high correlation between American workers’ college majors and their incomes.”

“The correlation between attending a social function and incubation period was 0.23 days.”

14
Least-Squares Regression Line (LSRL)
LSRL of y on x is the line that makes the sum of the squares of the vertical distances between the data points and
the line (these distances are called residuals, or errors) as small as possible.

error = observed value - predicted value We often use a regression line to


= y - "y-hat" predict the value of y, or the
dependent/response variable.
Prediction is truly the reason we
care about linear regression.
predicted "
!
!
Distance " − "
observed "

15
Facts about LSRL:

1. Explanatory and response variables must be clearly defined.

2. The LSRL always passes through the centroid

3. The LSR sum of residuals is zero.


4. The LSR sum of residuals squared is an absolute minimum.

slope $(
!=#
$)

predicted response y-axis intercept


% = '& − !*̅

In statistics, the coefficient of determination R2 is used in the context of statistical models whose
main purpose is the prediction of future outcomes on the basis of other related information.
It is the proportion of variability in a data set that is accounted for by the statistical model.
It provides a measure of how well future outcomes are likely to be predicted by the model.
Remember that r2 > 0 doesn't mean r > 0. For instance, if r2= 0.81, then r = 0.9 or r = -0.9. In order to
decide the correct sign of r, we need to know the relationship between the variables. r and b have the
same sign but not the same magnitude.

Number Price
of
Example: Pizza!
Toppings
$8 plain
$1.50 per topping 0 8
1 9.50
2 11
3 12.50
4 14

16
Pizza! Number Price
$8 plain of
Toppings
$1 per veggie topping (spinach, onion, mushrooms)
$2 per meat topping (pepperoni, sausage) 0 8

No double toppings.

Herpetologists measure, among other things, the snout vent length (SVL) of tropical
lizards and study how the SVL relates to clutch size.

Data collected in the field were used to construct a linear


model where SVL is used to predict clutch size.

The regression equation is: %$ = 0.0176 + 2.4538 3

Interpret the regression equation in context.

Kiefer, Mara & Van Sluys, Monique & Rocha, Carlos. (2008). Clutch and egg size of the tropical
lizard Tropidurus torquatus (Tropiduridae) along its geographic range in coastal eastern Brazil.
Canadian Journal of Zoology. 86. 1376-1388. 10.1139/Z08-106.

17
Kelley Blue Book is a resource that provides value estimates for used cars in North America for individuals looking to buy
cars. Car dealers frequently use a “Red Book” that gives recent selling prices for used cars in order to help them decide
what to pay for a trade-in. Until recently, there were no instructions to dealers for how to use the odometer reading to
determine the trade-in value of the car. In order to determine how the mileage affects the value of the car, ten sales of
cars of the same make, condition and options are selected, and their odometer readings and sales price are collected. The
data is shown in the table:
Trade in Value (in 100s of dollars) 37 31 43 39 41 39 35 40 29 33
Odometer (in 1000s of miles) 59 92 61 72 52 67 88 62 95 83
Sketch the scatterplot of this data. Be sure to put the appropriate variable on the horizontal axis.

Does there appear to be an association between the odometer reading and trade-in value? If so, what is it?

Calculate and plot the centroid 3,̅ %6

Determine the equation of the LSRL. Write the equation in context.

Give the correlation coefficient for this linear model. Interpret the value of this number.

Find and interpret the coefficient of determination for the relationship

Predict the trade-in value of a car with 60,000 miles

18
Software Regression Output

Statistics packages like MINITAB Predictor Coef Stdev t-ratio p


Constant 4.486 2.025 2.22 0.031
might have regression output Sleep 8.25 0.001692 31.57 0.000
that looks like this:
s=2.589 R-sq=81.5%

“constant” means this is Predictor Coef Stdev t-ratio p


the constant term, or the a Constant 4.486 2.025 2.22 0.031
Sleep 8.25 0.0167 31.57 0.000
value in the regression
equation s=2.589 R-sq=81.5%

In this case “sleep” must be the Predictor Coef Stdev t-ratio p


independent variable, and the Constant 4.486 2.025 2.22 0.031
coefficient on sleep must be the Sleep 8.25 0.0167 31.57 0.000

slope, or b. s=2.589 R-sq=81.5%

This output lets us know the regression equation for this data set would be: %$ = 4.486 + 8.253

Tells us the r2 value is 0.815. To find the value of r, we need to decide if r


R-sq=81.5%
should be positive or negative, then find the square root of r2.

Researchers studying acid rain measured the acidity of precipitation in a Colorado wilderness are for 150 consecutive weeks.
Acidity is measured by pH. Lower pH values show higher acidity. The acid rain researchers observed a linear pattern over time.
The reported the least-squares regression model:

78 = 5.43 − (0.0053:;;<=)

Is this association positive or negative? Explain what the association means in context.

What was the pH at the beginning of the student (weeks = 1)? What was the pH by the end of the study (weeks = 150)?

Interpret the slope of the regression line in context.

19
A scatterplot of grade point average (GPA) against IQ test scores for 78 seventh-grade students is created.

Calculation shows the mean and standard deviation of the IQ scores are: 3̅ = 108.9 =! = 13.17
For the grade point averages, %6 = 7.447 =" = 2.10
The correlation between IQ and GPA is r = 0.6337.

a) Find the equation of the least-squares line for predicting GPA from IQ.

b) What percent of the observed variation in these students’ GPAs can be explained by the linear relationship
between GPA and IQ?

c) One student has an IQ of 103 and a GPA of 2.35. What is the predicted GPA for a student with IQ = 103?
What is the residual for this particular student?

20
21
x y x2 y2
1 15 1 15
2 19 2 19
3 16 3 16
4 26 4 26
5 22
"! = 5.295 + 5.049,
"! = 11.933 + 2.686, 5 22
"! 4.5 = 28.01
6 30 "! 4.5 = 24.02 10 60

x3 y3 x y

1 15 1 15

2 19 2 19

3 16 3 16

4 26 4 26

5 22 5 22
"! = 1.933 + 6.971, "! = 16.526 + 0.962,
6 60 15 30
"(4.5)
! =33.3 "! 4.5 = 20.86

22
Remember:
residual = y – y-hat, or the (observed/actual y) minus (the y value predicted for that x value)

We can graph residuals. The horizontal axis is the independent variable and the y-axis is RESID,
which results from linear regression in the calculator.

The graphs of residuals (residual plots) can tell us if our linear model is appropriate for the given
data. It is another way to determine the best way to analyze bivariate data.

Example 1
This data produces a favorable residual plot that indicates the line is a good model for the data.

Age (months) 18 19 20 21 22 23 24 25 26 27 28
29
Height (cm) 76.1 77.0 78.1 78.2 78.8 79.7 79.9 81.1 81.2 81.8 82.8
83.5

Which of these months has the largest residual?

Sketch the residual plot. Interpret.

23
Example 2
This data produces a residual plot that indicates the line is not a good model.

x 1 3 4 5 6 8 10 12 15
y 1 4.66 6.96 9.52 12.29 18.38 2511
. 32.42 44.31

Interpret.

Sketch the residual plot.


Note: r =

24
Cautions About Linear Regression

How to tell if a linear model is appropriate (should the relationship be modeled with a linear regression?)
1. check graph (is it clearly not linear?)
2. look at r and r-squared
3. check residual plot. A pattern in the residual plot means a linear model is not appropriate.

This residual plot has a quadratic pattern so LSRL is


The points in this residual plot do not create
a clear pattern. not appropriate.
This lets us know a linear model is okay to use. If we can ‘capture’ the pattern of errors with a
function, we can change the model to reduce or
eliminate those errors.

An outlier is an observation that lies beyond the range of x or y values of other observations in the
scatterplot.

An outlier can be an observation that lies out in the horizontal direction or in the vertical direction
or in both directions.
An observation is influential if removing it would markedly change the position or slope of the
regression line.
Points that are outliers in the x direction are often influential points.
Influential points often have small residuals because they pull the LSRL towards themselves.

25
Correlation and regression describe only linear relationships and are not resistant to the
influence of outliers. r2 is influenced by outliers.

Extrapolation
Predicting for an x value that is well outside the known data values for x.
Such predictions cannot be trusted.
Example: a model that predicts height based on age, based on data from children aged 5 - 11 years,
cannot be used to predict height for someone who is 20 years old.

Using Averaged Data


Correlations are usually too high as the usual variation from individual to individual is lost.
Correlations based on averages are usually higher than correlations based on individuals.
Avoid predictions for individuals based on models created from averages

Example: if we create a model of average quiz scores and average test scores, that
model should not be used to predict a single student’s test score based on their quiz scores

26
Causation Changes in explanatory variable cause changes in response variable.
Association Scenario I: Causation
we think: changes in x cause changes in y Example:
x = more practice time,

x y y = performance quality

what’s really going on: changes in x cause changes in y

Lurking Variable: A variable that is not among the explanatory or response variables in a study yet may
influence the interpretation of relationships among the variables.

Association Scenario II: Common Response to a Lurking Variable


we think: changes in x cause changes in y Example:
x = increased confidence
y = performance quality
Z = more practice time
x y
z
what’s really going on: changes in lurking variable z are causing the changes in x and y

Association Scenario III: Confounding Variable


we think: changes in x cause changes in y
Example:
x = medication
x y y = overall feeling
during a cold
z = passage of time

z
what’s really going on:

Variations in both z and x are creating variation in y, but we cannot discern the effects.
Confounding variables can be lurking variables.
Changes in a confounding variable can also change x and y.

How do we decide on causation?

Statistics: controlled, randomized experiment with statistically significant results.

27
Nations with greater internet speeds have higher life expectancies.
Could we increase life expectancy in a particular country by improving their internet infrastructure?

Data show that married men (and men who are divorced or widowed) earn more than men who have
never been married. If you want to make more money, should you get married?

A study of elementary school children ages 6-11 finds a high positive correlation between shoe size (x) and score on a
common reading comprehension assessment (y).
What explains this correlation?

Members of a language club believe that early study of a foreign language by native English speakers improves a
student’s command of English. They obtain students scores on an English achievement test given to all 8th grade students
and find that the mean score of 8th graders who studied a foreign language in elementary or middle school is much
higher than the mean score of students who have not yet started learning a foreign language. Does this justify their
assertion about foreign language study and English skills? What lurking variables might be present here?

28
To test the health benefits of herbal tea, a group of college students make weekly visits to a local nursing home, where
they visit the residents and serve them herbal tea.
After several months of the twice-weekly visits with tea, the staff at the nursing home reported improvements in
qualitative health measures such as increased cheerfulness and lower overall anxiety.
What is the explanatory variable here? What is the response variable?
Can the college students conclude that the herbal tea caused the change in residents’ wellbeing?
If not, what could the causation relationship be?

A study shows that there is a positive correlation between the size of a hospital
(measured by the number of patient beds) and the median number of days that a
patient remains in the hospital. Does this mean you can shorten a hospital stay by
choosing a smaller hospital?

Is a linear model appropriate for this data?

number final number final What does the scatterplot reveal? Is the data obviously non-linear?
hours exam hours exam
studied score studied score
0.25 58 3 80
0.5 72 3 84
1 70 3 86
1 76 3.5 78 What do r, r2 tell you about a linear model for these variables?
1.2 70 3.5 86
1.5 78 3.5 92
1.75 75 4 82
2 80 4 93
2 75 5 97
What about the residual plot?
2 83 5 94
2 80 5.5 100
2 83 6 92
2.5 84 6 98

29
What is the regression equation (in context)?

Interpret the slope:

What is the predicted score for someone who ignores everything else in life and studies 10 hours?

Is that predicted score for 10 hours of studying a reasonable prediction? How would you explain that particular prediction to
your friend who needs that score to make an A in the class and has decided to study for 10 hours?

Is a linear model appropriate for the data? Why or why not?

x y
What does the scatterplot reveal? Is the data obviously non-linear?
15 33
20 42
40 81
48 98
70 137
What do r, r2 tell you about a linear model for these variables?
75 152
80 167
91 200
98 180
108 260
125 300 What about the residual plot?
140 170
160 400

30
31
Linearizing Data with Transformations

Often a straight-line pattern is not the best model for depicting a relationship between
two variables. A clear indication of this problem is when the scatter plot shows a
distinctive curved pattern. Many times this happens when a variable is growing
exponentially instead of linearly. A variable grows exponentially if it is multiplied by a
fixed number greater than 1 in each equal x-interval (the relationship is said to be
exponential decay if the fixed number is less than 1).

If you have a nonlinear pattern, many times you can transform one or both of the
variables in order to uncover a linear relationship. If a variable is growing exponentially,
taking the log (common or natural) of that variable will uncover a linear pattern.

Ex. Consider the following years and corresponding populations.

YEAR 1950 1960 1970 1980 1990

POPULATION
50 67 91 122 165
(1000s)

a) Draw the scatter plot for the original data.


b) Verify that the pattern is exponential by finding the common ratio.
c) Calculate both the linear regression and r-value for the original data and the
transformed data.
d) Draw the residual plot for the original and the transformed data. Which is the
best model? Explain.
e) Predict the population in 1993 using the transformed data model. Would you feel
comfortable predicting the population in 2025 using this model?

32
1. The paper “Population Pressure and Agricultural Intensity” reported a positive association between
population density and agricultural intensity. The accompanying graphs consists of measures of
population density and agricultural intensity for 18 different subtropical locations.

a) Identify both models.


b) Write the prediction equations for both models.

c) Interpret the coefficient of determination for both models.

2. The following graph’s data is based on the average radius of the planet’s orbit in terms of the Earth’s
orbit radius to predict the length of the planet’s year in earth years.

a. Write the prediction equations for both models.

b. Interpret the slope for both models.

33
LSRL Practice and Applications
Show all of your work neatly and clearly on separate paper. Answer in context.

1. According to an article about secondary education there is a mild correlation (r =.55) between
high school GPA and college GPA. The high school GPA’s in the sample have a mean of 3.7 and
standard deviation of 0.47. The college GPA’s in the sample have a mean of 2.86 with standard
deviation of 0.85.
a) What is the explanatory variable?
b) What is the slope of the LSRL of college GPA on high school GPA? Interpret these in context
of the problem.
c) If Bob’s high school GPA is 3.2, what could we expect of him in college?

2. The scatterplot shows the advertised


prices (in thousands of dollars) plotted against the odometer reading (in thousands of miles) for a
random sample of Jeep Cherokees listed on CarMax and located in several different states.
.
A computer printout shows the results of a least
squares regression procedure, where OR = odometer
reading:
Predicted Price = 28.426 – 0.14*OR
R-squared = 0.73

a) Find the correlation coefficient for the relationship


between price and age of Cherokees based on these
data.
b) What is the slope of the regression line? Interpret
it in the context of these data.
c) If the price for a Jeep that has 38k miles is $25,000, what is the residual for that Jeep?
d) With an r2 of 0.73, what else, other than the odometer reading, might explain variation in the
price?

3. One measure of the success of knee surgery is postsurgical range of motion for the knee joint.
Postsurgical range of motion was recorded for 12 patients who had surgery following a knee
dislocation. The age of each patient was also recorded (“Reconstruction…” American Journal of
Sports Medicine). The average age was 25.83 years and standard deviation of 7.578 years. The
average range of motion was 130.1 degrees with a standard deviation of 11.927 degrees. The
correlation coefficient was r = .5534.

a) If we use age to try and predict the range of motion, what is the slope? What is the y-intercept?
Interpret the two (slope and y-intercept) in context of the problem.
b) Use the regression line to predict the range of motion of someone 32 years of age.
c) Use the regression line to predict the range of motion of someone 50 years of age. Do you feel
this is an accurate prediction? Explain your thoughts.

1
34
4. Greenlight gave the following 2023 average weekly earnings (in USD) from allowances and
payment for chores children of ages 5 through 18

Age 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Earnings 6 6.7 7.1 7.8 8.4 9.3 10.3 11.6 13 14.1 16.1 20.5 25 32

a. Construct a scatter plot.


b. Interpret the slope in terms of the problem.
c. Find the coefficient of determination and interpret in terms of the problem.
d. Find the correlation coefficient and interpret in terms of the problem.
e. Sketch the residual plot. Interpret the residual plot.

5. Success in hunting varies greatly among species of animals. Lions, who hunt singly, are rarely
successful in more than 10 percent of their hunts. Wild African dogs, who hunt in packs, are among
the most efficient of all hunters, succeeding at a rate of over 90 percent of their hunts.
In the early 1960’s, researcher Jane Goodall discovered that chimpanzees were not solely
vegetarian in their diets, as had previously been thought. This discovery spurred a tremendous
amount of primate research. Some of the latest primatology research has been done on chimpanzees
to find out if larger hunting parties increase the chances of a successful hunt. The results of one such
research project are summarized in the table for the number of chimpanzees in the hunting party
versus the percentage of successful hunts.

Number of Chimps 1 3 5 5 6 7 8 8 9 10 10 11 12
Percent of Success 20 28 42 40 58 45 62 65 63 75 78 75 82

a. Based on the scatterplot alone, describe the


relationship between the number of chimps in the hunting party and the success rate.
b. Determine the equation of the regression line from the given data.
c. Interpret the slope.
d. Find the correlation coefficient and interpret it in terms of the problem.
e. Find the coefficient of determination and interpret it in terms of the problem.
f. Sketch the residual plot. Interpret in terms of the problem.

2
35
6. The following output data from MINITAB shows the number of teachers (in thousands) for each
of the states plus the District of Columbia against the number of students (in thousands) enrolled in
grades K-12.
Predictor Coef Stdev t-ratio p
Constant 4.486 2.025 2.22 0.031
Enroll 0.053401 0.001692 31.57 0.000
s=2.589 R-sq=81.5%
a) What is the equation of the least squares line? Interpret the slope.

b) Predict the number of teachers if the number of students in the state is 35,700.

7. Shells of mollusks function as both part of the skeletal system and as protective armor. It has
been argued that many features of these shells were the result of natural selection in the constant
battle against predators. The paper “Postmortem Changes in Strength of Gastropod Shells” included
scatter plot of data on x = shell height (cm) and y = breaking strength (newtons). The least squares
line for a sample of 38 hermit crab shells was y = -2751 . + 244.9 x .
a. What are the slope and intercept of this line?
b. When shell height increases by 1 cm, by how much does breaking strength tend to
change?
c. What breaking strength would you predict when shell height is 2 cm?
d. Does this approximate linear relationship appear to hold for shell heights as small as 1
cm? Explain your thoughts.

3
36
AP Statistics Review LSRL

PART I : CORRELATION. For each of the following three scatter plots, identify the correlation as either strong
positive, weak positive, strong negative, weak negative, or little-or-no correlation.

50 50 50

40 40 40

30 30 30

20 20 20

10 10 10

0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

1a. _______________________ 1b. _______________________ 1c._______________________

PART II : TRUE OR FALSE.


___________ 2a. A high correlation between x and y proves that x causes y.

___________2b. The two coefficients (a and b) for the line of best fit have the same
sign.

___________2c. The correlation coefficient has the same sign as the slope of the least
squares line fitted to the same data.

___________2d. A r-value greater than zero indicates that ordered pairs with high x-
values will have low y- values.

___________2e. A correlation of -.41 and + .41 show the same degree of clustering around the
regression line.

___________2f. A correlation of .75 indicates a relationship that is 3 times as linear as one for which
the correlation is only .25.

___________2g. The mean of the residuals is always zero.

___________2h. A definite pattern in the residual plot is an indication that a nonlinear model will
show a better fit to the data than the straight regression line.

___________2i. An x-value that is an outlier in the x-direction is more indicative that a point is
influential than a y-value that is an outlier in the y-direction.

37
PART III : MULTIPLE CHOICE

__________ 3a. In a random sample of older patients at a large medical practice, the age of a patient and a
measure of that patient’s hearing loss were recorded. The correlation between age and hearing
loss of the patients in the sample was found to be 0.7. Which one of the following would be a
correct statement if the age of a patient were used to predict the amount of hearing loss for a
patient?

(A) Forty-nine percent of the time, the least squares regression line accurately predicts hearing loss.
(B) Forty-nine percent of the variation in hearing loss can be explained by the least squares regression
line relating hearing loss and age.
(C) About 70% of a person’s hearing loss can be explained by age, according to the regression line
relating hearing loss and age.
(D) About 70% of the time, age will correctly predict the amount of hearing loss.
(E) The least squares regression line relating hearing loss to age will have a slope of
approximately 0.7.

__________ 3b. Suppose one has collected data on X the diameter of tree trunk and Y tree height. If the
Regression equation is ŷ = -3.6 + 3.1x, what is your estimate of the average height of all
trees having a trunk diameter of 7 inches?

(A) 18.1 (B) 19.1 (C) 20.1 (D) 21.1 (E) 22.1

__________ 3c. A correlation between college entrance exam grades and scholastic achievement was found to
be -1.08. On the basis of this you would tell the university that:

(A) The entrance exam is a good predictor of academic success.


(B) The exam is a poor predictor of academic success.
(C) Students who do best on this exam will make the worst students.
(D) Students at this school are underachieving.
(E) They should hire a new statistician.

__________3d. A study of the effects of television measured how many hours of television each of 125
grade school children watched per week during a school year and their reading scores.
Which variable would you put on the horizontal axis of a scatterplot of the data?

(A) Reading score, because it is the response variable.


(B) Reading score, because it is the explanatory variable.
(C) Hours of television, because it is the response variable.
(D) Hours of television, because it is the explanatory variable.
(E) It makes no difference, because there is no explanatory-response distinction in this
study.

____________3e. The study described in the previous question found that children who watch more television
tend to have lower reading scores than children who watch fewer hours of television. The
study report says that, “Hours of television watched explained 9% of the observed variation
in the reading scores of the 125 subjects.” The correlation between hours of TV and reading
score must be
(A) r = -0.3 (B) r = 0.3 (C) r = -0.09
(D) r = 0.09
(E) Can’t tell from the information given.

38
__________3g. A study of child development measures the age (in months) at which a child begins to talk
and also the child’s score on an ability test given several years later. The study asks
whether the age at which a child talks helps predict the later test score. The least-squares
regression line of test score y on age x is ŷ = 110 – 1.3x. According to this regression line,
what happens (on the average) when a child starts talking one month later?

(A) The test score goes down 110 points.


(B) The test score goes down 1.3 points.
(C) The test score goes up 110 points.
(D) The test score goes up 1.3 points.
(E) The test score is 108.7.

____________3j A study utilizing a simple random sample of 40 college students studied their hours of part-
time work and grade point average. It was found that the correlation between the variables
was -.43. If the resulting linear regression equation is: predicted GPA= 3.75 - .05(hours),
which of the following is NOT a correct statement?

(A) The average GPA of students who don’t work is approximately 3.75
(B) If the correlation coefficient was -.60, the slope of the regression line would be
approximately -.07.
(C) Students who work 40 hours per week have a mean GPA of approximately 1.75.
(D) The value of the correlation coefficient and the steepness of the regression line are not
related.
(E) 18.5% of the variation in GPA scores can be explained by the hours of part-time work.

39
____________3l. An efficiency expert wanted to see if there is a relationship between the number of people
attending a meeting and the number of minutes late that the meeting started. The table shows
the results with the accompanying scatter plot (Figure 1).

Number of people attending the meeting 2 3 4 5 6 10


Number of minutes late the meeting started 3 6 8 10 14 7

Scatterplot of Minutes Late vs Number attending

14

12

10
Minutes Late

6
Figure 1
4

2
1 2 3 4 5 6 7 8 9 10
Number attending

Figure 2 represents the scatter plot with the point (10, 7) removed from the data.

Scatterplot of Minutes Late vs Number attending

14

12

10
Minutes Late

Figure 2
8

2
2 3 4 5 6
Number attending

Which one of the following is TRUE about the point (10, 7)?

(A) It has the largest residual.


(B) It is an influential observation.
(C) It is not an outlier in the x-direction.
(D) The P-value of a Linear Regression t-test changes little when the point is removed.
(E) The correlation is lower with the removal of the point.

40
____________3m. Most colleges have an end-of-course evaluation of the instructor. A random sample of
students at a large university are asked to rate their instructor on a scale of 1 (poor) to 4
(excellent) and to also rate the subject matter of the course on a scale of 1 (did not like) to 4
(liked a lot). The scatterplot below shows the results of one semester of these evaluations.

4
Instructor Rating

1
1 2 3 4
Course Rating

Which of the statements below is an appropriate conclusion about this scatterplot?

(A) Students tended to rate their instructors more highly than they did their courses.
(B) Students who rated the course a 3 tended to give the instructor a lower rating.
(C) There is not much variation in the instructor ratings.
(D) There doesn’t seem to be any relationship between course rating and instructor rating.
(E) There appears to be a very strong linear relationship between course rating and instructor rating.

41
PART IV: SHORT ANSWERS.

4. A psychologist determines that a strong, positive, linear relationship exists between an individual’s IQ
score and their sense of humor. She randomly selected 45 adults and found the following results.

IQ: Mean = 105. Standard deviation = 12.

Durante Test of Relevant Humor: Mean = 140. Standard deviation = 24.


r 2 = 0.81.

What is the predicted humor score if the IQ score of an individual is 110?

a) What is the predicted humor score if the IQ score of an individual is 110?

b) The value of r 2 is .81, which indicates a fairly strong relationship between IQ and humor. Is
this relationship causal or associative? Justify your answer.

42
43
44
LSRL Review part 2

True/False:

_____1. If the least-squares equation relating the independent variable x and the
dependent variable y for a given problem is y = 2x+5, then an increase of 1 unit in x is
associated with an increase of 2 units in y.

_____2. The coefficient of determination measures the variation in the dependent variable
that is explained by the regression model.

_____3. The least-squares equation minimizes the error sum of squares.

_____4. If your computed correlation coefficient r = +1.2, then you have better than a
perfect positive correlation.

_____5. A student might expect that there is a positive correlation between the age of
his or her computer and its resale value.

Multiple Choice:

Use the following set of observations for the independent variable x and the dependent variable y in
questions #6-7:

X -3 -1 1 3
Y 8 4 5 -1

_____6. The correlation coefficient is :


A) -1.0
B) -.8971
C) +1
D) 0.8971

_____7. The coefficient of determination is:


A) -1.0
B) –0.8048
C) +1
D) 0.8048

45
_____8. The correlation between two scores X and Y equals 0.8. If both the X scores
and the Y scores are converted to z-scores, then the correlation between z-
scores for X and z-scores for Y would be
A) -0.8
B) -0.2
C) 0.0
D) 0.2
E) 0.8

_____9. A least-squares regression line was fitted to the weights (in pounds) versus age
(in months) of a group of many young children. The equation of the line is
yˆ = 16.6 + 0.65t , where ŷ is the predicted weight and t is the age of the child. A 20-month
old child in this group has an actual weight of 25 pounds. Which of the following is the
residual weight, in pounds, for this child?
A) -7.85
B) -4.60
C) 4.60
D) 5.00
E) 7.85

_____10. A wildlife biologist is interested in the relationship between the number of


chirps per minute for crickets (y) and temperature. Based on the collected
data, the least-squares regression line is yˆ = 10.53 + 3.41x , where x is the number of
degrees Fahrenheit by which the temperature exceeds 50 ° . Which of the following best
describes the meaning of the slope of the least-squares regression line?
A) For each increase in temperature of 1 ° F, the estimated # of chirps per minute increases by
10.53.
B) For each increase in temperature of 1 ° F, the estimated # of chirps per minute
increases by 3.41
C) For each increase of one chirp per minute, there is an estimated increase of temp. of
10.53 ° F.
D) For each increase of one chirp per minute, there is an estimated increase of temp. of
3.41 ° F.
E) The slope has no meaning because the units of measure for x and y are not the same.

46
_____11. A study of fuel economy for various automobiles plotted the fuel consumption
vs. speed. A LSRL was fit to the data. Here is the residual plot from this least-
squares fit. What does the pattern of the residuals tell you about the linear model?

A) The evidence is inconclusive


B) The residual plot confirms the linearity of the fuel economy data
C) The residual plot does not confirm the linearity of the data
D) The residual plot clearly contradicts the linearity of the data.

Match the following scatter plots with the appropriate correlations from the list:

r = - .48 r = .98 r = .82 r = - .17 r=1 r = .17 r=-1

16._______ 17._______ 18._______ 19._______

47
Free Response:

13. The equation of the least squares regression line for a set of given points is yˆ = 1.3 + 0.73 x .
What is the residual for the point (4, 7)?

For questions 14 and 15:


At summer camp, one of Carla’s counselors told her that you can determine air temperature by the
number of cricket chirps you hear.

14. What is the explanatory variable and what is the response variable?

15. Carla collected data on temperature and number of chirps and found this:
x = 166.8, sx = 31.0, y = 78.83, s y = 9.11 and r = 0.461
Use this information to write the equation of the LSRL.

48
Inferential Statistics

These methods are used to take sample data and use it to draw a conclusion about a population.

We use sample means that we calculate to learn about population means that are unknowable.
We use sample proportions that we calculate to learn about population proportions that are unknowable.
Reminder: a statistic is from a sample, a parameter is from a population

Because sample calculations are so


All CISH Students
important, collecting the data must be
done using a method that is
40 students defensible. This almost always means
in our sample that some step in selecting samples
must be randomized.

A graduate student in an education program wants to know the ways HS teachers in her state have been
using AI tools in the classroom. She obtains a list of certified HS teachers in her state and from that list
randomly selects 200 teachers. She emails a survey to each of the 200 teachers and receives 132
returned and completed surveys.

What is the population in this study?

What is the sample in this study?

sampling frame
The list she uses to choose her sample is called a sampling frame. Even though she wants to learn
about ALL HS teachers in her state, the only feasible data she can use is the available list of certified teachers.

49
For each of the following sampling situations, identify the population as precisely as possible. (What individuals are
included in the population?)

The Gallup organization in the U.S. questions a sample of about 1500 adult U.S. residents to collect opinions
about a variety of issues.

Every ten years, the U.S. census will collect basic information from every household in the United States. The census will
select a sample of households to receive the ‘long form’ of the census that asks for more detailed information. About
one in ten households are asked to complete the long form.

A manufacturer purchases fasteners from a supplier in South America. As part of its routine quality control, the
manufacturer will randomly select a sample of 20 fasteners from each day’s supply and test them for durability and
accurate measurements.

Census: a method of information gathering that


attempts to contact/survey/measure
All CISH Students every member of a population

40 students
in our sample

Convenience Sample: includes members that are the easiest to reach/ask/poll/measure

Voluntary Response Sample: the sample selects itself

50
Simple Random Sampling (SRS)
Sample is chosen using a randomized method such that any set of n individuals has an
equal chance of being selected. (this is actually not that simple!)

Acceptable Randomization Methods


• numbered members of population, numbers selected using random number generator or random digits table
• names on paper, put into a hat

For our sample of 40 CISH students, if we randomly For our sample of 40 CISH students, if we
choose 10 students from each grade in the high randomly choose 4 home groups and included all
ten students from each home group in our
school, is this a simple random sample?
sample, would this be an SRS?

Let’s randomly choose 6 students from this list!

1. Alex
2. Justin
3. Arthur
4. Bao Nhi
5. Chaerin
6. Brad
7. Calvin
8. Michael
9. Daniel
10. Duc
11. Ella
12. Jesus
13. Hayle
14. Kate
15. Yoochan
16. Yoowon

51
Stratified Random Sample
Divide the population into subgroups of similar individuals (similar in ways that are important to the topic of
interest), called strata. Choose a separate SRS within each stratum and combine these into a full sample
A probability-based sampling method.
Usually more representative than an SRS.

40 students
What strata should we use if we are interested in: in our sample

- distance of daily commute to school?


All CISH Students
- amount of time spent on homework?

- amount of money spent in Joma?

Systematic Random Sampling

Select a random starting point, then select every nth individual.

n is determined by population size and desired sample size

If CISH has 600 students, what would n be for our sample of 40 students?

Multi-stage Sampling
probability-based method

Sample is chosen in stages, starting with larger groups, then eventually smaller groupings.

Example: National Multi-Stage Sample

52
Cluster Sample
Population is split into subgroups that are convenient or already exist in the population.
These subgroups are called clusters.
Randomly select clusters.
All units within the randomly selected clusters are included in the sample.

Geographically: zip codes, counties, townships could all be clusters.

What kind of clusters exist at CISH?

Bias in Sampling
A sampling method is biased if it systematically favors a particular outcome. In a biased sample, not all
viewpoints or situations that can affect the outcome have an equal or proportional chance of being
represented.
Inference methods account for this kind of error.
Bias can be avoided by a well-designed sampling method.

“Error” due to variation cannot be avoided.

53
Sources of Bias in Sampling

Undercoverage Wording Effects

Voluntary Response Response Bias

Non-response

You are on the staff of an elected government representative who is interested in public support for
increased funding for elder care. You report that about 213 messages have been received from the
public via mail or email and of those messages 141 oppose the increased spending. The government
representative is surprised and says she expected stronger support for the initiative. Would you conclude
that the majority of voters oppose the increased funding?

54
Sampling Methods
Match each setting with the correct sampling method by writing the correct letter on each line.

A. Convenience B. Voluntary Response C. SRS


D. Stratified E. Cluster F. Multistage

_________ 1. A mid-size paper company wishes to take a random sample of its clients. Clients are divided into
Small (under $50k), Medium ($50k to $250k), and Large (over $250k). A random digits table is
used to select 30 small, 15 medium, and 10 large clients.

_________ 2. An ultimate frisbee tournament organizer wants to estimate the mean number of years of playing
experience for its participants. She uses a list of all participants, numbered 001 to 312, and a
random digits table to select 25 participants.

__________ 3. A restaurant manager wants to gauge the opinion of the restaurant’s customers. He walks around
and interviews 20 people who are eating there one Saturday night.

__________ 4 A comic book store wishes to take a sample of comic books. Their inventory is currently stored in
cardboard boxes. The manager numbers each box from 001 to 684 and uses a random digits table
to select 20 random boxes. All comics in these boxes are included in her sample.

__________ 5. An exit poller wants to estimate the proportion of voters who voted for certain candidates. He
stands outside a particular polling place at 10am and interviews the first 100 voters to exit.

__________ 6. Tobias wants to watch a random sample of Simpsons episodes from the first 10 seasons (which are
the best). He chooses an SRS of 3 episodes from each of these ten seasons, and watches these
episodes..

__________ 7. To choose a sample of fifty employees from a large corporate office, a list of employees is
obtained, alphabetized, and labeled numerically. A random digits table is read left to right, in sets
of threes, until fifty unique labels are found. These fifty employees are used in the sample.

__________ 8. A local newspaper wishes to estimate the years of graduate education that the teachers
in the local school district have obtained. An announcement is placed in the district newsletter
for all teachers in the district asking them to contact the newspaper with the number of years of
graduate education obtained.

55
AP Statistics
Methods of Sampling

Discussion Questions:
1. Which of the following sampling methods produce a random sample from a class of 36 students:

• Select the first six students to enter the room.


• Select those students whose phone numbers end with the digit 4.
• Suppose that the class has 18 boys and 18 girls. Select a sample of 6 students by using a
random number table to choose 1 of the 18 boys, then 1 of the 18 girls, then a boy, then a girl,
and so on until you have chosen 6 students.
• Suppose that the classroom has six rows of chairs with six chairs in each row. Assign the rows
the digits 1 through 6. Throw a die and place all the students in the row corresponding to the
number of the die in the sample.
• Assign each student a number from 1 to 36. The girls get the numbers 1 to 18 and the boys the
numbers from 19 to 36. Use a random number table to select six two-digit numbers between 1
and 36, and place the corresponding students in the sample.

2. Describe how you would select a sample of 10 juniors from your school using the following methods:

a. SRS

b. convenience sampling

c. voluntary response sampling

d. stratified random sampling

e. systematic sampling

f. multi-stage sampling

3. For each sampling method below, tell which groups in the population are likely to be
underrepresented.

• To obtain a sample of households, a television rating service dials numbers taken at random
from telephone-directories.
• To determine the percentage of teenage girls with long hair, Teen magazine published a mail-
in questionnaire. Of the 500 respondents, 85% had hair shoulder length or longer.
• To evaluate the reliability of cars owned by its subscribers, Consumer Reports magazine
publishes a yearly list of automobiles and their frequency-of-repair records. The magazine
collects the information by mailing a questionnaire to subscribers and tabulating the results
from those who return it.
• A college psychology professor needs subjects for a research project to determine which colors
average American adults find restful. From the list of all 743 students taking introductory
psychology at her school, she selects 25 students using a random number table.
• For a survey of student opinions about school athletic programs, a member of the school board
obtains a sample of students by listing all students in the school and using a random number
table to select 30 of them. Six of the students say that they don’t have time to participate, and
they are eliminated from the sample.

56
Stat Homework 3.1
Homework Questions
1. Retailers at the local shopping mall want to survey their Saturday customers about their satisfaction
with the eating facilities within the mall. One merchant went to business school and learned about the
importance of statistics, so he wants to obtain a random sample. He proposes the following method:
Interviewers should stand at the center of the mall and select the first 100 people who walk by after
11:00 a.m. He believes this approach will provide a random sample because the interviewers will not
exercise any decision over whether or not to include specific individuals in the sample.
a. What kind of sample would the merchant really get?
b. In what way might this sampling method be biased?
c. Describe how the merchant could modify this approach to use a version of systematic
sampling.
d. If the retailer were to use stratified random sampling, what strata would you recommend that
he choose?
e. What method would you suggest to the merchant? Explain your choice.

2. The Educational Testing Service (ETS) needed a representative sample of college students. ETS first
divided all colleges into groups of similar ones (such as public colleges with more than 25,000
students, small private schools, etc.). Then they used their judgment to choose one representative
school from each group, thus obtaining the sample of schools. Each school in turn picked a sample of
students.
a. ETS divided the colleges into strata but did not perform stratified random sampling. Explain.
b. Suggest ways to improve this sampling scheme.

3. Researchers wanted a representative sample of Japanese-Americans living in San Francisco. The


procedure was as follows: After consultation with representative figures in the Japanese community,
the four blocks closest to the Japanese community center were chosen; all persons resident in those
four blocks were taken for the sample. However, a comparison with Census data shows that the
sample did not include a high-enough proportion of Japanese with college degrees.

a. What kind of sampling did this survey use?


b. Why do you suppose the sample did not have enough college graduates?
c. Can you think of a way to improve this sampling scheme?

4. A newspaper article began, “Almost half of the USA’s secretaries would rather work for a man than a
woman, even though a male boss is more likely to ask them to clean the coffeepot, says a Working
Woman survey” (USA Today). This is the result of a “poll of 1,100 readers in the magazine’s May
issue.” Of these readers, 46% prefer to work for a man, 5% for a woman, and 49% say it doesn’t
matter.

a. What kind of sampling do you think was used?


b. What population do the results apply to, according to the newspaper?
c. In what way might the sampling method be biased?
d. Discuss the issue of undercoverage.
e. If USA Today requested a random sample of bosses to interview their own secretaries, discuss
potential trouble with this method.

57
Experimental Design

Sometimes an experiment is impossible or unethical and an observation (or simulation) has to be used.

some experiments include different levels of a single factor


some experiments include different factors and/or different levels of different factors

58
Example: one experiment wants to see the change in the amount
of food wasted by students when certain factors are changed.

· regular plate vs. smaller plate


· eating in the cafeteria vs. eating in Joma vs. eating outside
· eating alone vs. eating with friends

How many factors are there in this experiment?

How many treatments?

First Principle of Experimental Design: Control

Control the effects of lurking variables. When we control lurking variables, it


makes it more likely that the effects observed can be attributed to the
experimental variables.
A control group is a group in an experiment that receives no treatment or
receives a neutral treatment (placebo).

It is not always possible to have a control group.

59
Second Principle of Experimental Design: Randomization
Systematic differences among groups in a comparative experiment are a possible source of
bias. The remedy is to use randomization to make group or treatment assignments.

Third Principle of Experimental Design: Replication


Replicate each treatment on many units to reduce chance variation in the results.

Replication refers to having an adequate number of experimental units or subjects in each group.

"how rare is rarely enough?"

ex: taking medicine with a lot of water

60
Experimental Design: Completely Randomized

All experimental units are allocated randomly among the treatments. This is done to produce groups
that are similar before treatments are applied.

why would this be important?

removes bias in self-reported information and removes bias in any subjective part of data collection

61
A block design is like
doing multiple
experiments at the
same time

Block Design
Used when subjects are of different type in a way that is expected to affect the outcome
(e.g. men and women respond differently to medication, people with higher blood pressure
may respond differently to diet changes than people with lower blood pressure...)

Blocking is a form of control.

Essentially: block design allows us to compare apples to apples and oranges to oranges.

Blocking Stratifying
is used in an experiment. is used in sample selection or in survey
selection.

Stratifying and blocking are not interchangeable terms.

62
63
Example: Completely Randomized Design

Example: Randomized Block Design

Subjects are assigned to Treatment


blocks before any random groups
assignment to treatment
or control groups

64
Matched Pairs Design
Particular type of randomized block design.
Used when comparing only two treatments.
Subjects are compared (matched with) similar subjects OR each subject receives both treatments.
Randomization occurs either in assigning treatments or in the order of the treatments.

Subjects that share important


Each subject receives both Two experimental units that
characteristics (that may influence
treatments. The order is naturally exist as a pair are response) are paired and randomly
randomized. randomly assigned to treatment or assigned to treatment or placebo.
Example: Taste Test placebo.

Example: Fish Tanks We want to test a new fish food!

Window (warm
What would our response variable likely be? sun)

How should we design the experiment?

65
What is happening in this experiment?
What kind of experiment is it?

A high school regularly offers a review course to prepare students for the SAT. This year budget cuts will
allow the school to offer only an online version of the course. The group of students who take the online
course earn an increase of 45 points in their math test from the pre-test to the actual SAT test.

As an experiment this would have a very simple design. A group of students (the subjects) were exposed to
a treatment (online course) and the outcome was observed (change in SAT math scores).

Students Online Course Change in scores (pre-test vs SAT score)

Would you conclude the online course is effective? Explain.

66
67
68
McDonald’s is giving away squishmallows in its Happy Meals. Right now 50% of them are decorated with music notes on them,
20% have toys on them and 30% are just wearing clothes.
If you want to collect one of each and toys are randomly placed into boxes, how many Happy Meals do you expect to buy before
you have at least one of each kind of toy?

To answer this question, we can use a simulation: the imitation of chance behavior based on a
model that accurately reflects the experiment under consideration.

For probability experiments, clearly describe the process you are using.
• How are you assigning digits?
• What will you do about digits that repeat? (if the same digit shows up more than once, what will you do?)
• What does a trial consist of?
• How many trials will you run?
• How will you interpret the results in the context of the question you are answering?

McDonald’s is giving away squishmallows in its Happy Meals. Right now 50% of them are decorated with music
notes on them, 20% have toys on them and 30% are just wearing clothes.
If you want to collect one of each and toys are randomly placed into boxes, how many Happy Meals do you expect to
buy before you have at least one of each kind of toy?

• how are you assigning digits?


• what will you do about repeats?
• what will be considered one trial?
• how many trials will you run?

69
McDonald’s is giving away toys in its Happy Meals. Right now 50% of toys are cars, 20% are ponies and 30% are parachutes. If you want to
collect one of each and toys are randomly placed into boxes, how many Happy Meals do you expect to buy before you have at least one of each
kind of toy?

• how are you assigning digits?


• what will you do about repeats?
• what will be considered one trial?
• how many trials will you run?

According to YOMA coffee, 13% of their holiday cups have gingerbread designs, 53% have
snowflake designs, 15% have birds, and 19% have winter flowers and berries. When you go to
pick up coffee for you and three of your friends, 3 of the 4 cups have snowflakes! What is the
likelihood of that happening just randomly? Perform ten trials of this probability experiment.

70
According to YOMA coffee, 13% of their holiday cups have gingerbread designs, 53% have snowflake
designs, 15% have birds, and 19% have winter flowers and berries. When you go to pick up coffee for you
and three of your friends, 3 of the 4 cups have snowflakes! What is the likelihood of that happening just
randomly? Perform ten trials of this probability experiment.
Let’s use the Nspire and answer this one again!
• how are you assigning digits?
• what will you do about repeats?
• what will be considered one trial?
• how many trials will you run?

CISH has 35 Freshmen, 41 Sophomores, 38 Juniors and 48 Seniors. Dr. Sutherland randomly selects 10
students to take on a fun field trip. The 10 students include 4 seniors, 3 juniors, 2 sophomores and 1
freshman. The freshmen say that there are too many seniors selected for the process to have actually
been random. Are they right to be suspicious?

71
According to a statistician at SAS, the color distribution of M&Ms is approximately 24% blue, 20%
orange, 16% green, 14% yellow, 13% red, and 13% brown.
You randomly select 8 M&Ms from the bag and 3 are red. That makes you think those proportions
must be wrong, so you do a probability experiment to see how likely your result is if the proportions
are accurate.
Use the Nspire to answer this question. Seed your calculator with 4308 and perform at least 10 trials.
Remember to include all of the necessary information.

A basketball player has historically made 65% of the shots she attempted. In one game she
makes six shots in a row and the announcer says the player is “in the zone!”. Assume the player
attempts 20 shots per game. How unusual would it be for her to make 6 or more shots in a row?

72
AP. Statistics. Experimental Design

1. Some studies find an association between liver cancer and smoking. However, alcohol
consumption is a confounding variable. Explain what is meant by alcohol being a confounding
variable.

2. Another recent study that was reported in the Fall of 2009 found that people who used
sunscreen were more likely to develop skin cancer.
(a) What might be a confounding variable in this study?
(b) Design an experiment to determine if sunscreen helps to increase the likelihood of
developing skin cancer.

3. On the television news in August, 2013, it was reported that young children who used hand
sanitizers reduced the number of illnesses they had by 20% over those children that did not use
these hand sanitizers.
(a) What might be a confounding variable in this study?
(b) Design an experiment to determine if the dry soaps prevent illness in young children.

4. According to a recent Daily Yomiuri article, Prof Takafumi Tezuka from Nagoya University
doubled the yield of green beans by twisting the vines counterclockwise around a pole:

"...researchers grew a total of 45 green bean plants in three ways letting the vine wind
clockwise, binding them straight with cord, and twisting them counterclockwise. They then
tallied the number of pods produced by plants in each category.

Plants bound straight produced 1.5 times as many pods as those allowed to grow naturally,
while those twisted in the opposite of their natural direction produced twice as many. The
pods' size and weight were generally the same."

The professor hypothesized that some stress (comfortable tension) on a plant might be good.
They expect the technique to also work on morning glories.

(a) In this experiment, what was the control group(s) and what was the experimental
group(s)?

(b) There was not much information given in this article. If you were trying to conduct this
experiment, what is one thing you would make sure you did to improve on the model
given above.

Page 1

73
Name: ________________________________________ Experimental Design WS

For the following questions: Identify whether it is an experiment or an observation study/survey. If


experiment: identify experimental design, experimental units, factors & levels, response variable, and
blocks (if any). If study/survey, identify sampling method and one possible source of bias.

1) In marketing children’s products, it’s extremely important to produce television


commercials that hold the attention of the children who view them. A psychologist hired by
a marketing research firm wants to determine whether differences in attention span exist
among advertisements for different types of products. Fifteen children under 10 years of age
are randomly asked to watch one 60-second commercial for one of three types of products,
and their attention spans are measured in seconds.

2) Upon reconsidering the above problem, the psychologist decides that the age of the
child may affect the attention span. Consequently, the psychologist randomly assigns
fifteen 10-year-olds, fifteen 8-year-olds, fifteen 6-year-olds, and fifteen 4-year olds to watch
one of three the commercials, and their attention spans are measured.

3) An economist wants to determine if differences exist among the salaries of university


professors in different departments. Data is collected from a random sample of six professors
from each of the departments of business, history, and psychology.

4) The editor of the student newspaper was in the process of making some major changes in
the newspaper’s layout. He was also contemplating changing the typeface of the print
used. To help him make a decision, he asked six individuals to read four newspaper pages,
with each page printed in a different typeface. If the reading speed differed, then the
typeface that was the fastest would be used. However, if there was not enough evidence
to allow the editor to conclude that such differences existed, the current typeface would be
continued. (Where should randomization be implemented?)

5) In a recent report, a group of scientists claimed that Americans are consuming an


excessive amount of selenium in their diets. The National Science Foundation has stated that
the safe upper limit is 200 micrograms per day. In order to determine the extent of the
problem in Plano, researchers numbered each city block and randomly selected 20 blocks.
They surveyed all the households on each of the selected blocks by interviewing an adult at
the residence and measured their daily consumption of selenium.

74
AP Stats Name _________________________________________________________
Chapter 5 Review

Part I - Multiple Choice (Questions 1-10) - Circle the answer of your choice.

1. Which one of the following is not a principle of experimentation?

(a) Randomly allocating experimental units to treatments.


(b) Stratifying the experimental units into groups of similar individuals and applying different
treatments to each stratum.
(c) Using double blindness to eliminate bias.
(d) Replicating to measure overall experimental error and increase precision.
(e) Using a control group to determine whether treatment really works.

2. A simple random sample of size n is selected in such a way that

(a) Each member of the population has an equal chance of being selected.
(b) Each member of the population is given an opportunity to respond to the survey.
(c) All samples of size n have the same chance of being selected.
(d) The probability of selecting any sample is known to be 7 ® rand .
(e) The sample is guaranteed to represent the entire population.

3. In sample surveys, bias can be controlled by all of the following except

(a) Using a probability or chance sampling procedure.


(b) Wording questions so they are not confusing or misleading.
(c) Carefully training and supervising interviewers.
(d) Prompting respondents so that they give correct responses.
(e) Reducing non-response and undercoverage.

4. A graduate student conducts a study to determine whether a new activity-based method


is better than the traditional lecture of teaching statistics. He found two teachers to help him
in his study for one semester. Mr. Dull volunteered to continue teaching with traditional
lectures and Ms. Perky agreed to try the new activity-based method. Each teacher planned
to teach two sections of approximately forty students each for adequate replication. At the
end of the semester, all sections would take the same final exam and their scores would be
compared. What is the treatment variable in this study?

(a) Teacher
(b) Section of the Course
(c) Teaching Method
(d) Final Exam Score
(e) Student

75
5. In a study on the effect of reinforcement on learning from programmed text, two
experimental treatments are planned: reinforcement given after every frame of
programmed text or reinforcement given after every three frames. Which one of the
following control groups would serve best in this study?

(a) A group which does not read the programmed text material.
(b) A group that reads the programmed material in prose formats.
(c) A group which reads the programmed material but does not receive reinforcement.
(d) A group that reads the programmed text material and reinforcement is given at random.
(e) A group which watches the video of the programmed material.

6. We say that the design of a study is biased if which of the following is true?

(a) A racial or sexual preference is suspected.


(b) Random placebos have been used.
(c) The research designer has received a grant from a special interest group.
(d) The correlation is greater than 1 or less than –1.
(e) Certain outcomes are systematically favored.

7. Which of the following are true statements?

I. Voluntary response samples often over represent people with strong opinions.
II. Convenience samples often lead to undercoverage bias.
III. Questionnaires with nonneutral wording are likely to have response bias.

(a) I and II
(b) I and III
(c) II and III
(d) I, II, and III
(e) None of the above gives the true set of responses.

8. To survey the opinions of bleacher fans at Wrigley Field, a surveyor plans to select every
one-hundredth fan entering the bleachers one afternoon. Will this result in a random
sample?

(a) Yes, because each bleacher fan has the same chance of being selected.
(b) Yes, but only if there is a single entrance to the bleachers.
(c) Yes, because the 99 out of 100 bleacher fans that are not selected will form a control
group.
(d) Yes, because this is an example of systematic sampling, which is a special case of
random sampling.
(e) No, because each fan does not have the same chance of being selected.

76
9. What fault do all these sampling designs have in common?

I. The Wall Street Journal plans to make a prediction for a presidential election
based on a survey of its readers.
II. A radio talk show asks people to phone in their views on whether the United States
should pay off its huge debt to the United Nations.
III. A police detective is interested in determining a sample of high school students
and interviews each one about any illegal drug use by the student during the past
year.

(a) All the designs make improper use of stratification.


(b) All the designs have errors that can lead to strong bias.
(c) All the designs confuse association with cause and effect.
(d) None of the designs satisfactorily controls for sampling error.
(e) None of the designs makes use of chance in selecting a sample.

10. The following students are available to serve on the Student Procrastination Committee.

1. Ally 2. Benji 3. Chad 4. Donald 5. Eli 6. Frannie


7. Gina 8. Hank 9. Ivana 10. Jan 11. Kyle 12. Lana
13. Morris 14. Norm 15. Olive 16. Patti 17. Quasimodo 18. Ramone

Using the randInt function your calculator, select a simple random sample size 4. Before you
select your sample, seed your random number generator by storing 7 into rand [ 7 ® rand ].
The students who were selected were:
(a) Ally, Ramone, Kyle, Olive
(b) Donald, Ramone, Kyle, Frannie
(c) Jan, Kyle, Kyle, Ramone
(d) Gina, Ivana, Patti, Eli
(e) Norm, Donald, Morris, Frannie

77
Part II – Free Response (Questions 11-14) – Show your work and explain your results clearly.

11. P.P. Pumpkineater, the renowned agricultural geneticist, has mutated previous varieties
of pumpkins and produced two new strains, Scary Face and Candle Breath. Because he
has limited marketing funds, he must decide which strain is the most “jack-o-lanternable”.
Having been in the jack-o-lantern business for a long period of time, he has developed
the PPPJOL Test to compare different strains. He is quite concerned about the effects of
sunlight and water on the growth of the pumpkins. He has 60 seeds of each variety
available for testing.

Design an experiment that will help P.P. determine which strain to market.

13. Is the right foot more powerful than the left? A researcher decides to measure foot
power by having subjects kick a large Styrofoam block and measure the depth of the
impression. Twenty subjects are available for the experiment.

(a) Design a completely randomized experiment to test the hypothesis.

(b) Design a matched pairs experiment to test the hypothesis.

(c) Comment on which experiment may be more appropriate and concerns you may have
about the experimental design.

14. You have been asked to investigate the attitudes of students in the Upper School about
the school’s uniform policy. You only have enough time and resources to contact 120
students. Describe your sample design clearly. Comment on any practical difficulties that
you anticipate.

78
Name _______________________ Period __________________

Sampling and Experimental Design


Review Questions

1. Which of the following are true statements?


I. If bias is present in a sampling procedure, it can be overcome by dramatically
increasing the sample size.
II. There is no such thing as a “bad sample.”
III. Sampling techniques that use probability techniques effectively eliminate bias.

A. I only
B. II only
C. III only
D. None of the statements are true.
E. None of the above gives the complete set of true responses.

2. Which of the following are true statements?


I. Voluntary response samples often over-represent people with strong opinions.
II. Convenience samples often lead to undercoverage bias.
III. Questionnaires with non-neutral wording are likely to have response bias.

A. I and II
B. I and III
C. II and III
D. I, II, and III
E. None of the above gives the complete set of true responses.

3. Each of the 29 NBA teams has 12 players. A sample of 58 players is to be chosen as


follows. Each team will be asked to place 12 cards with their players’ names into a hat
and randomly draw out two names. The two names from each team will be combined
to make up the sample. Will this method result in a simple random sample of the 348
basketball players?

A. Yes, because each player has the same chance of being selected.
B. Yes, because each team is equally represented.
C. Yes, because this is an example of stratified sampling, which is a special case of
simple random sampling.
D. No, because the teams are not chosen randomly.
E. No, because not each group of 58 players has the same chance of being selected.

79
4. In designing an experiment, blocking is used
A. To reduce bias.
B. To reduce variation
C. As a substitute for a control group
D. As a first step in randomization
E. To control the level of the experiment.

5. A nutritionist believes that having each player take a vitamin pill before a game
enhances the performance of the football team. During the course of one season, each
player takes a vitamin pill before each game, and the team achieves a winning season
for the first time in several years. Is this an experiment or an observational study?

A. An experiment, but with no reasonable conclusion possible about cause and effect
B. An experiment, thus making cause and effect a reasonable conclusion.
C. An observational study, because there was no use of a control group.
D. An observational study, but a poorly designed one because randomization was not
used.
E. An observational study, thus allowing a reasonable conclusion of association but not
of cause and effect.

6. Which of the following are true about the design of matched-pair experiments?
I. Each subject might receive both treatments.
II. Each pair of subjects receives the identical treatment, and differences in their
responses are noted.
III. Blocking is one form of matched-pair design.

A. I only B. II only C. III only D. I and III E. II and III

7. A consumer product agency tests miles per gallon for a sample of automobiles using
each of four different octane varieties of gasoline. Which of the following is true?

A. There are four explanatory variables and one response variable.


B. There is one explanatory variable with four levels of response.
C. Miles per gallon is the only explanatory variable, but there are four response variables
corresponding to the different octane varieties.
D. There are four levels of a single explanatory variable.
E. Each explanatory level has an associated level of response.

80
8. In a 1927-32 Western Electric Company study on the effect of lighting on worker
productivity, productivity increased with each increase in lighting but then also increased
with every decrease in lighting. If it is assumed that the workers knew a study was in
progress, this is an example of

A. the effect of a treatment unit


B. the placebo effect
C. the control group effect
D. lack of realism
E. voluntary response bias.

9. Twenty men and 20 women with high blood pressure were subjects in an experiment to
determine the effectiveness of a new drug in lowering blood pressure. Ten of the 20 men
and 10 of the 20 women were chosen at random to receive the new drug. The
remaining men and women received the placebo. The change in blood pressure was
measured for each subject. The design of this experiment is:

A. Randomized block, blocked by gender


B. Randomized block, blocked by drug
C. Randomized block, blocked by drug and gender
D. Completely randomized with one factor, drug
E. Completely randomized with one factor, gender

Free Response

11. An equipment firm is trying out three new types of grease in the transmissions of its front-
end loaders. The maintenance manager is interested in whether any of the greases reduce
the time before the transmissions have to be repaired. The company has 30 identical new
front-end loaders to use in the test. How would you design the experiment and in what way
would you assign the front-end loaders? Be specific? Would you use a completely
randomized design or a block design? How many factors are there? How many
treatments? If it is randomized block, what characteristic identifies the blocks? Explain your
decisions.

81
Statistics AP Name: ___________________________
Sampling and Experimental Design

Newspaper advice columnist Ann Landers once asked her readers, “If you had it to do over
again, would you have children?” About 10,000 readers responded and approximately 7,000
said no.

____________________1. What is the population?

____________________2. What is the sample?

_____________________3. What kind of sample is it?

True/False:

__________4. Voluntary response samples often under represent people with strong
Opinions.

__________5. Convenience samples often lead to under coverage bias.

__________6. Questionnaires with non-neutral wording are likely to have response


bias.

__________7. In an observational study we impose a treatment on the subjects.

__________8. The entire group of individuals we want information about is called the
sample.

9. A study is _______________ if it systematically favors certain outcomes.

10. _________________________ chooses the individuals easiest to reach to make up the


sample.

11. A __________________________ gives each member of the population a known chance


to be selected.

12. ___________________________ is when we divide the population into groups of


individuals that are similar in some way that is important to the response and then choose
a SRS for each group.

82
Use the following information to answer questions 14-16:

A personnel director at a large company studied the eating habits of employees by


watching the movements of a selected group of employees at lunchtime. The purpose
of the study was to determine the proportion of employees who buy lunch in the
cafeteria, bring their own lunches , or go out to lunch.

_______ 14. The study could best be categorized as:


a. a census
b. a survey sample
c. an observational study
d. a designed experiment
e. none of these

_______ 15. If the director includes only the employees in one department in her study, she
is performing a
a. simple random sample
b. quota sample
c. convenience sample
d. multi-stage sample
e. cluster sample

_______ 16. If the director selects 50 employees at random from throughout the company
and categorizes their lunchtime practices by gender, she is:
a. blocking for gender
b. testing for a lurking variable
c. promoting sexual harassment
d. testing for bias
e. none of these

17. What are the 3 principles of experimental design?

18. What does it mean for an experiment to be double-blind?

19. How do we control for confounding variables in an experiment? Your answer can be
expressed in one word.

83
20. A medical researcher is interested in testing a new medicine for migraine headaches.
She decides to conduct a clinical trial on 100 randomly selected adults who get migraines at
a rate of one or more per week. Although age and gender are not of primary interest in the
trial, the researcher is concerned that these factors may impact the effectiveness of the
drug. Describe graphically how she would set up the experiment if:

a. she sets up her experiment for the 100 subjects without considerations of age and
gender.

b. she sets up her experiment for the 100 subjects and wants to control for gender.

84
c. she sets up her experiment for the 100 subjects and wants to control for age. She
decides on age categories of young (21-35), middle (36-55), and elderly (over 55).

d. she sets up her experiment for the 100 subjects and wishes to control for both age
and gender

85
AP Statistics Sampling and Experiments Questions

Multiple Choice: Circle the letter corresponding to the best answer.

A chemical engineer is designing the production process for a new product. The chemical
reaction that produces the product may have a higher or lower yield depending on the
temperature and the stirring rate in the vessel in which the reaction takes place. The engineer
decided to investigate the effects of the combinations of the two temperatures (50 ° C and
60 ° C) and three stirring rates (60 rpm, 90 rpm, and 120 rpm) on the yields of feedstock. Ten
batches of feedstock will be processed at each combination of temperature and stirring rate.

1. What are the experimental units?

A) The two temperatures (50 ° C and 60 ° C)


B) The three stirring rates (60 rpm, 90 rpm, 120 rpm)
C) The two temperatures and the three stirring rates
D) The batches of feedstock
E) None of the above. The answer is __________________________.

2. Identify all factors (explanatory variables)

A) The two temperatures (50 ° C and 60 ° C)


B) The three stirring rates (60 rpm, 90 rpm, 120 rpm)
C) The two temperatures and the three stirring rates
D) The batches of feedstock
E) None of the above. The answer is __________________________.

3. What is the response variable?

A) The two temperatures (50 ° C and 60 ° C)


B) The three stirring rates (60 rpm, 90 rpm, 120 rpm)
C) The two temperatures and the three stirring rates
D) The batches of feedstock
E) None of the above. The answer is __________________________.

4. How many treatments are there?

A) 2
B) 3
C) 5
D) 6
E) None of the above. The answer is __________________________.

5. How many experimental units are needed?

A) 2
B) 3
C) 5
D) 6
E) None of the above. The answer is __________________________.

86
6. In a survey of public opinion concerning state aid to a particular city, every 40th person
registered as a voter was interviewed, beginning with a person selected at random and
from among the first 40 listed. This is an example of

A) Simple random sampling


B) Stratified random sampling
C) Systematic sampling
D) Single-stage cluster sampling
E) None of the above

7. Which of the following is not important in the design of an experiment?

A) Control of confounding variables


B) Randomization in assigning subjects to treatments
C) Use of lurking variables to control the placebo effect
D) Replication of the experiment to control the placebo effect
E) All of the above are important in the design of experiments

8. Which of the following is a method for improving the accuracy of a sample?

A) Use no more than 3 or 4 words in any question


B) When possible, avoid the use of human interviewers, relying on computerized
dialing instead
C) Use larger sample sizes
D) Use smaller sample sizes
E) None of the above. The answer is _____________________________.

9. We say that the design of a study is biased if which of the following is true?

A) A racial or sexual preference is suspected


B) Random placebos have been used
C) Certain outcomes are systematically favored.
D) The mean is larger than the median
E) None of the above. The answer is _____________________________.

10. Suppose that a number of crates of pencils are chosen at random from a boxcar of crates,
then a number of boxes of pencils is chosen at random from each selected crate. Our
goal is to estimate the number of defective pencils in a box. This is an example of

A) Simple random sampling


B) Multi-stage sampling
C) Stratified sampling
D) Systematic sampling
E) None of the above

87
Free Response. Answer in complete sentences. Abbreviations will count as a wrong answer.

Suppose the Houston Chronicle asks a sample of 150 Houstonians their opinions on the quality of
life in Houston.

11. Is this study an experiment? Explain why or why not.

12. Identify the sample and the population in the opinion poll.

Bias is present in each of the following sampling designs. In each case, identify the type of bias
involved and state how you think the responses will be affected compared to those obtained
using better sampling techniques.

14. A political pollster seeks information about the proportion of American adults that oppose
gun control. He asks an SRS of 1000 American adults: “Do you agree or disagree with the
following statement: Americans should preserve their constitutional right to keep and bear
arms.” A total of 910 or 91% said “agree” (that is, 910 out of 1000 oppose gun control).

15. A flour company in Dallas wants to know what percentage of local households bake at
least twice a week. A company representative calls 500 households during the daytime
and finds that 50% of them bake at least twice a week.

88
You are participating in the design of a medical experiment to investigate whether or not a
calcium supplement in the diet will reduce the blood pressure of middle-aged men. Preliminary
research suggests that the supplement may have different effects on different races.

16. What sort of experimental design would you choose, and why?

17. Assume that the experimental sample consists of 350 men.


Outline in a diagram the design of the experiment.

89
Introduction to Random Variables
A Random Variable is a numerical value whose outcome depends on a chance experiment.
Random indicates that the variable is unknown from trial to trial, but the possible values are
known.
Types of Random Variables:
Discrete

Continuous

Discrete Probability Distribution

Gives the probabilities associated with each possible value of the variable.
Usually displayed in a table, may be displayed in a histogram or formula.

X is used when describing the variable.


x is used when describing particular values of the variable.

As with any probability distribution, each probability is between 0 and 1, and the sum
of probabilities for the entire distribution is 1.

Example x= the number of heads that appear when four coins are tossed.
Create a probability distribution.

What is the probability of getting more than 2 heads?

What is the probability of getting at most 3 heads?

Construct a probability histogram.

0 1 2 3 4

90
Consider the random variable X as the sum of two dice when rolled.
Construct a probability distribution.

Let x be the number of courses for which a randomly selected student at a


certain university is registered.
x 1 2 3 4 5 6 7
p(x) .02 .03 .09 .40 .16 .05

p ( x = 4)
p (x < 4)
p (x ≤ 4)

Mean (Expected Value) of Variance of a


a Discrete Random Variable Discrete Random Variable

Let x be the number of courses for which a randomly selected student at a certain
university is registered.
x 1 2 3 4 5 6 7
p(x) .02 .03 .09 .40 .16 .05

Find the expected number of courses for a student at this university.

What is the standard deviation of this distribution?

91
(a) What percent of the sons of lower-class fathers reach the highest class, Class 5?

(b) Check that this distribution satisfies the requirements for a discrete probability
distribution.

(e) Write the event "a son of a lower-class father reaches one of the two highest
classes" in terms of X. What is the probability of this event?

92
7.12 Car Ownership

a) Verify that this is a legitimate discrete probability distribution. Display


the distribution in a probability histogram.

b) Write in words what the event {X>1} represents. Find P (x >1).

c) A housing company builds houses with two-car garages. What percent of


households have more cars than the garage can hold?

d) What is the expected number of cars for this community?

93
Distributions

Discrete RV Continuous RV
Binomial uniform distribution
Geometric normal distribution

Four Rules of a Binomial Setting

1. Each observation must fall into one of two categories (we call these
'success' and 'failure’.)

2. There is a fixed number 'n' of observations or trials.

3. The 'n' observations must all be independent.

4. The probability of success (p) is the same for every trial or observation.

The possible values of x are whole numbers only because they are a
count of the number of success, so a Binomial variable must also be
discrete random variable.

A binomial distribution refers to


x = number of successes with a given n and a given p.

The notation we use is x ~ B(n,p)

Remember:
A normal distribution is defined by mean, standard deviation.
A binomial distribution is defined by:
number of trials (n), probability of success on any given trial (p).

x ~ B(n,p)

94
Which of these fit the description of a binomial random variable? If it's
binomial, identify what "success" is as well as n and p.

We roll a die 5 times and see how many times we land on 2.

We roll a die and count the number of rolls until we land on 5.

We roll a die 5 times and see how many times we land on an even number.

We roll a die 5 times and record the numbers we get (1-6) .

Example: Out of the 18 students in block 1, the probability that any one
student is tardy is 0.18. (we will assume independence and not a conspiracy)

x = number of students tardy to any one class

"success" =
n=
p=

What is the probability that exactly 4 students are tardy?

95
Example: Out of the 18 students in block 1, the probability that any one student is
tardy is 0.18. (we will assume independence and not a conspiracy)

What is the probability that exactly 7 students are tardy?

What is the probability that fewer than 7 students are tardy?

Example: Out of the 18 students in track 1, the probability that any one student is
tardy is 0.18. (we will assume independence and not a conspiracy)

What is the probability that between 4 and 9 students are tardy?

96
Formula for Binomial Probability
n = number of trials
k = number of successes
p = probability of success on any trial

This notation means “the number of


ways you can choose k things from a
total of n things in a set”

A free-throw shooter has a 70% average for making free throws. Out of 20 attempts, find the
following probabilities, where x = number of successful free throws.

P (x = 8)

P (x = 15)

P (x is at least 12)

P (x is less than 10)

97
Mean and Standard Deviation of a Binomial Variable

Mean: (expected value)

Variance: OR npq

Standard Deviation:

Conditions of Geometric Distribution


Two mutually exclusive outcomes, "success" or "failure"
Each trial is independent of other trials

The probability of success remains constant for each trial.

The random variable X is defined as the number of trials required


until the first success occurs.

Binomial variables have distributions that start with 0 while the lowest
value for a geometric distribution is 1

Binomial distributions are finite, while geometric distributions are infinite.

98
Geometric Distribution G (p)

Probability Formula

p (X=x)

expected number of trials


until first success

According to a recent Census Bureau report, 12.7% of Americans live below the poverty level.
Suppose you plan to survey randomly selected Americans until you find an American living
below the poverty level.

What is the probability the first such American you encounter is the 5th one you survey?

What is the probability the first such American you encounter is the 7th one or later that you
survey?

According to a recent Census Bureau report, 12.7% of Americans live below the poverty level.
Suppose you plan to sample at random 100 Americans and count the number of people who
live below the poverty level. What is the probability that you count 10 or fewer?

99
An Olympic archer is able to hit a bulls-eye 80% of the time. Assume each
shot is independent of the others. The variable of interest is the first bulls-eye
she makes.

a). Which attempt is the expected first success? What is the standard deviation?

b) What is the probability that her first success is on the 4th arrow shot?

c) What is the probability that her first success is earlier than the 3rd arrow shot?

d) What is the probability that her first success is on the 4th arrow shot or later?

100

You might also like