0% found this document useful (0 votes)
30 views13 pages

HL AA Statistics Notes RMS

The document provides notes on statistics topics including descriptive statistics, probability, and the binomial distribution. Key concepts covered are measures of center (mean, median, mode) and spread (range, interquartile range, variance, standard deviation) for both discrete and continuous data. Probability topics such as conditional probability, Bayes' theorem, and independent events are also discussed.

Uploaded by

Tanish Bengani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views13 pages

HL AA Statistics Notes RMS

The document provides notes on statistics topics including descriptive statistics, probability, and the binomial distribution. Key concepts covered are measures of center (mean, median, mode) and spread (range, interquartile range, variance, standard deviation) for both discrete and continuous data. Probability topics such as conditional probability, Bayes' theorem, and independent events are also discussed.

Uploaded by

Tanish Bengani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

HL AA Statistics Notes

1. Summary (from Textbook)

NB: In assessment, both Stratified and Quota will be taken as being proportional to the
categories. The difference is that Stratified will use a probability sampling method (such as
simple random or systematic). Quota does not- it is likely to use convenience.

Note that for a small data set, with data listed in ascending order:
"#$
Median is the ( % )'( value. If 𝑛 is even, it is the average of the values either side.
"#$ '(
Lower quartile is the ( +
) value. Again, if it is not a whole number it is the average.
,("#$) '(
Upper quartile is the ( + ) value. Again, if it is not a whole number it is the average.
When 𝑛 is a large number, we tend to ignore the ‘+1’.

HL AA Statistics Notes RMS 2020 1


Variance is the standard deviation squared: 𝑉𝑎𝑟 = 𝜎 % .
We can calculate it by squaring the above, or, if we have frequency:
∑ 𝒇𝒊 (𝒙𝒊 8𝝁)𝟐 ∑ 𝒇𝒊 𝒙𝒊 𝟐
𝝈𝟐 = 𝒏
= 𝒏
− 𝝁𝟐 Note that it is often the last version that is easiest to use.

Topic 4: Statistics and probability – HL only

AHL P ( B) P ( A | B)
Bayes’ theorem P ( B | A) =
4.13
P ( B ) P ( A | B) + P ( B′) P ( A | B′)

P( Bi ) P( A | Bi )
P ( Bi | A) =
P( B1 ) P( A | B1 ) + P( B2 ) P( A | B2 ) + P( B3 ) P( A | B3 )
Note well the notes above regarding changes to the original data:
If the mean of a set of data point is 𝑥̅ (𝑜𝑟k 𝜇), and the standard
AHL k deviation is 𝜎,
All data points increased by ‘𝑏′ ∑ fi ( xi − µ ) ∑ fi xi
2 2
4.14•
Variance
New mean= σ 2 𝑥̅ + 𝑏 Standardσ 2 = deviation
i =1
and =Variance
i =1
µ 2 unchanged. (As the data has
−are
n n
just be slid, so the spread has not changed.
k

All data points multiplied by ‘𝑎′ ∑ f ( x − µ)


2
• i i
Standard deviation σ σ = i =1
New mean= 𝑎𝑥̅ New Standard deviation=
n 𝑎𝜎 New Variance = 𝑎% 𝜎 %
This is summarised in the formula booklet p.10
Linear transformation of a E ( aX + b ) = aE ( X ) + b
single random variable
Var ( aX + b ) = a 2 Var ( X )

Expected value of a ∞

continuous random E ( X ) = µ = ∫ x f ( x) dx
−∞
variable X

Var ( X ) = E ( X − µ ) 2 = E ( X 2 ) − [ E (X ) ]
2
Variance

Variance of a discrete Var ( X ) = ∑ ( x − µ ) 2 P ( X = x) = ∑ x 2 P ( X = x) − µ 2


random variable X

Variance of a continuous ∞ ∞
Var ( X ) = ∫ ( x − µ ) 2 f ( x) dx = ∫ x 2 f ( x) dx − µ 2
random variable X −∞ −∞

HL AA Statistics Notes RMS 2020 2


PMCC-Pearsons Moment Correlation Coefficent, 𝑟. −1 ≤ 𝑟 ≤
In the Formula Booklet:
Topic 4: Statistics and probability – SL and HL
p.9

Topic
SL
4.2
4: Statistics
Interquartile range andIQR
probability
=Q − Q – HL only 3 1

SL k

4.3
AHL ∑ fi xi P ( B ) kP ( A | B)
4.13
Bayes’
Mean, theorem
x , of a set of data B |i =A1 ) = , where n = ∑ f i
Px( =
n P ( B ) P ( A | B)i =+1 P ( B′) P ( A | B′)

SL n ( A) P( Bi ) P( A | Bi )
p.10
4.5 Probability of an event A A) =
PP((BAi )| =
n (UP() B1 ) P( A | B1 ) + P( B2 ) P( A | B2 ) + P( B3 ) P( A | B3 )

AHL Complementary events P ( A) +k P ( A′) = 1 2 k

4.14 ∑ f i ( xi − µ ) ∑ f i xi 2
Variance σ 2 σ =
2 i =1
= i =1
− µ2
SL Combined events P ( A ∪ B ) =nP ( A) + P ( B) n− P ( A ∩ B )
4.6
Mutually exclusive events P ( A ∪k B ) = P ( A) + 2 P ( B)
∑ f i ( xi − µ )
Standard deviation σ σ = i =1
Pn( A ∩ B )
Conditional probability P ( A B) =
P ( B)
Linear transformation of a E ( aX + b ) = aE ( X ) + b
single randomevents
Independent variable ∩ B+) b=)P=( aA2) Var
P ( A( aX P ( B()X )
Var

SL Expected value of a
Expected value of a E(X ) = ∑ x P ∞ ( X = x)
4.7 discrete random
continuous randomvariable X E ( X ) = µ = ∫ x f ( x) dx
−∞
variable X
SL Binomial distribution
4.8 X ~ B (n , p)
Var ( X ) = E ( X − µ ) 2 = E ( X 2 ) − [ E (X ) ]
2
Variance
Mean E ( X ) = np
Variance of a discrete Var ( X ) = ∑ ( x − µ ) 2 P ( X = x) = ∑ x 2 P ( X = x) − µ 2
Variance
random variable X Var ( X ) = np (1 − p )

SL Standardized
Variance normal
of a continuous x−µ ∞ ∞
4.12 variablevariable X
random
z = ( X ) = ∫−∞ ( x − µ ) 2 f ( x) dx = ∫−∞ x 2 f ( x) dx − µ 2
Var
σ

HL AA Statistics Notes RMS 2020 3


2. One Variable Statistics

Calculator
stat EDIT 1: edit (L1: enter 𝑥- values (or midpoint for grouped data); L2: enter frequency
(if you have)
stat CALC 1:1-Var Stats (List: L1; FreqList: blank if no frequencies, L2 if frequencies)

Note: Use 𝜎𝑥 for standard deviation. Square this value to get variance (𝜎2 ). (𝑆𝑥 is the
unbiased estimator, 𝑆"8$ .

Parameter for different types of Data

Discrete Data, Discrete Data, Continuous


Listed Frequency Table (Grouped) Data
Mean, F
𝒙 ∑G
where n is the ∑𝑓 × 𝑥 ∑ J×G
where x is the
" ∑J
number of data points ∑𝑓
middle value of the
group (class)
Mode Most common value Value with the “modal class”, Group
highest frequency with the highest
frequency
Median With data in ∑ J#$th ∑ J th
%
value. Use %
value, which will
ascending order,
"#$th frequency table to lie within a group.
%
value find. (If ∑ 𝑓 is large, Typically read from
∑J cumulative frequency
use % th value)
graph
Variance, 𝝈2 ∑(𝑥 − 𝑥̅ )% ∑ 𝑓(𝑥 − 𝑥̅ )% ∑ J(G8G̅ )K
where x is
∑J
𝑛 ∑𝑓
the middle value of
the group (class)
Standard Deviation,
∑(𝑥 − 𝑥̅ )% ∑ 𝑓(𝑥 − 𝑥̅ )% ∑ 𝑓(𝑥 − 𝑥̅ )%
𝝈 R R R
𝑛 ∑𝑓 ∑𝑓
(√𝑽𝒂𝒓𝒊𝒂𝒏𝒄𝒆)
where 𝑥 is the middle
∑ 𝑥% ∑ 𝑓S 𝑥S %
=R − 𝜇% =R value of the group
− 𝜇%
𝑛 𝑛 (class)
Range Highest value- Highest value- Highest value in top
Lowest value Lowest value group-Lowest value
in bottom group
IQR Upper quartile-Lower Upper quartile-Lower Upper quartile-Lower
quartile quartile quartile
Typically read from
cumulative frequency
graph
5 point summary Min, Max, LQ, UQ, Min, Max, LQ, UQ, Min, Max, LQ, UQ,
Median Median Median

𝒏#𝟏 𝒕𝒉 𝟑(𝒏#𝟏) 𝒕𝒉
• The Lower Quartile is the ( 𝟒
) term, The Upper Quartile is the ( 𝟒
) term.
Again, if it is a decimal, it is the average of the terms on either side.

HL AA Statistics Notes RMS 2020 4


Example

1. Tabulated discrete data/ Tabulated continuous data

𝒙 3 4 5 7
(OR 2.5 ≤ 𝑥 < 3.5) (OR 3.5 ≤ 𝑥 < 4.5) (OR 4.5 ≤ 𝑥 < 5.5) (OR 5.5 ≤ 𝑥 < 8.5)
𝒇𝒙 15 𝑎 21 18

a) Given that the mean of the data is [OR estimated to be] 5, find the value of 𝑎.

∑ 𝑥 = 𝑛𝑥̅
3 × 15 + 4𝑎 + 5 × 21 + 7 × 18 = (15 + 𝑎 + 21 + 18) × 5
276 + 4𝑎 = 270 + 5𝑎
𝑎=6

b) Another ‘𝑏’ 7’s are added to the data. The new standard deviation is [OR estimated to
be] 1.5664 (to 4 significant figures). Determine the value of 𝑏.

∑ Jc Gc K
𝝈𝟐 = "
− 𝜇%

$d×,K #e×+K #%$×dK #($f#g)×hK d×ei#hg %


1.5664% = ei#g
−( ei#g
)

$d×,K #e×+K #%$×dK #($f#g)×hK d×ei#hg %


0= ei#g
−( ei#g
) − 1.5664%

From GDC, 𝑏 = 9

Outliers

If a value is < 𝐿𝑄 − 1.5 𝐼𝑄𝑅 or > 𝑈𝑄 + 1.5 𝐼𝑄𝑅 then the value is an outlier.

Example
If a data set has 𝑄1 = 𝐿𝑄 = 16, 𝑄3 = 𝑈𝑄 = 21

𝐼𝑄𝑅 = 21 − 16 = 5

LQ−1.5 𝐼𝑄𝑅 = 16 − 1.5 × 5 = 8.5

UQ+1.5 𝐼𝑄𝑅 = 21 + 1.5 × 5 = 28.5

Therefore, outliers are all values that are >8.5 or <28.5

HL AA Statistics Notes RMS 2020 5


Box and Whisker plots

• A box and whisker plot (sometimes called a box plot) is a visual display of some of the
descriptive statistics of a data set (called the five-number summary of the data set).
It shows:
1. The minimum value.
2. The lower quartile.
3. The median.
4. The upper quartile.
5. The maximum value.

Note: Any outliers are shown as separate crosses.

• In the box and whisker plot,


25% of the data lies between the minimum and the LQ
25% of the data lies between the LQ and the median
25% of the data lies between the median and the UQ
25% of the data lies between the UQ and the maximum

Skew

Negatively Skewed Positively Skewed

HL AA Statistics Notes RMS 2020 6


Example:

The following is the five point summary of the heights of 40 school children.
Minimum value = 132

Lower quartile = 142

Median = 146.5

Upper quartile = 151

Maximum value = 163


a) Construct a box and whisker plot.

Heights of 40 Students

a) Determine the minimum number of students with a height less than or equal to 151cm.
151cm is the UQ. Therefore, as a minimum, 75% or ¾ of students have heights
less than or equal to 151cm.
,
+
× 40 = 30 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠

b) Determine the minimum number of students with a height less than or equal to 151cm.
151cm is the UQ. Therefore, as a minimum, 75% or ¾ of students have heights
less than or equal to 151cm.
,
+
× 40 = 30 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠

HL AA Statistics Notes RMS 2020 7


Cummulative Frequency Graph

A cumulative frequency graph is useful for not only finding the median and upper and lower
quartiles, but also for finding the number of scores that lie above or below a particular value.
• A percentile is the score below which a certain percentage of the data lies;
e.g. The 85th percentile is the score below which 85% of the data lies.
e.g. If your score in a test is the 95th percentile, then 95% of the class have scored
less than you.
• The lower quartile (Q1 on GDC) is the 25th percentile.
• The median is the 50th percentile.
• The upper quartile (Q3 on GDC) is the 75th percentile.

Example:

The data shows the results of the women's marathon at the 2008 Olympics for all competitors that
finished the race.
Finishing Time (t hours and minutes) Frequency
2h26m ≤ 𝑡 < 2h28m 8
2h28m ≤ 𝑡 < 2h30m 3
2h30m ≤ 𝑡 < 2h32m 9
2h32m ≤ 𝑡 < 2h34m 11
2h34m ≤ 𝑡 < 2h36m 12
2h36m ≤ 𝑡 < 2h38m 7
2h38m ≤ 𝑡 < 2h40m 5
2h40m ≤ 𝑡 < 2h42m 8
2h42m ≤ 𝑡 < 2h44m 6

a) Construct a cumulative frequency distribution table.

Finishing Time (t hours and minutes) Cumulative Frequency


𝑡 < 2h28m 8
𝑡 < 2h30m 11
𝑡 < 2h32m 20
𝑡 < 2h34m 31
𝑡 < 2h36m 43
𝑡 < 2h38m 50
𝑡 < 2h40m 55
𝑡 < 2h42m 63
𝑡 < 2h44m 69

HL AA Statistics Notes RMS 2020 8


b) Represent the data on a cumulative frequency graph.

c) Use your graph to estimate:

(i) The median finishing time.

Median is the 50th percentile. Therefore, 50% of 69 = 34.5. Reading from the
graph, the median time is 2 hours and 34.5 minutes.

(ii) The number of competitors who finished in a time less than 2 hours 35minutes.

Reading from the graph, there are approx. 37 competitors who took less than
2 hours and 35 minutes to complete the race.

(iii) The percentage of competitors who took more than 2 hours 39 min to finish.

Reading from the graph, there are 69 − 52 = 17 competitors who took more
$h
than 2 hours 39 minutes. Therefore, ey = 26.4% (3 s.f.) took more than 2
hours 39 minutes.

(iv) The time taken by a competitor who finished in the top 20% of those who
completed the marathon.

20% of 69 = 13.8 ≈ 14. Reading from the graph, 20% of competitors took
less than 2 hours 30 minutes 30 seconds.

HL AA Statistics Notes RMS 2020 9


3. Two Variable Statistics (Bivariate data)

Statisticians are often interested in how two variables are related. The independent variable is
placed on the horizontal axis and the dependent variable is placed on the vertical axis

For example, if we recorded the age of a number of athletes and how far they can throw a discus,
the independent variable would be age and the dependent variable would be the distance
thrown.

If there is one or more data points that do not follow the trend of the others, they may be
considered as Outliers
o An outlier is defined as more than 1.5 × IQR from the nearest quartile.
o If an outlier is the result of a recording error, it should be discarded.
o If the outlier proves to be a genuine piece of data, it should be kept.

For example:

It is important to note that correlation between two variables does not necessarily mean that one
variable causes the other.

If a change in one variable does cause a change in the other variable, then a causal relationship
exists between them.

Correlation refers to the relationship or association between two variables.

Correlation may be linear, quadratic, cubic or another function. Or there can be no visible
correlation. In this course, we will only consider Linear Correlation.

HL AA Statistics Notes RMS 2020 10


4. Linear Correlation
If a scatter plot of our data looks like the one on the left, we would have reason to assume there is
some linear correlation between our variables.

In that case, we would consider the following:

1. Direction

Upward trend Downward trend No upward or downward trend


Positive linear correlation Negative correlation No correlation

2. Strength

HL AA Statistics Notes RMS 2020 11


Pearson's Correlation Coefficient

To get a more precise measure of the strength of linear correlation between two variables,
we use the Pearson's product-moment correlation coefficient, r . The formula is:

z(G8Ḡ)(|8|¯)
𝑟 = , where 𝑥¯ and 𝑦¯ are the means of the x and y data respectively.
}z(G8Ḡ)K (|8|¯ )K

Note that you will not be expected to calculate this by hand.

1. The sign of r indicates the direction of the correlation.

• A positive value for r indicates the variables are positively correlated;


i.e. an increase in one of the variables will result in an increase in the other.

• A negative value for r indicates the variables are negatively correlated.


i.e. an increase in one of the variables will result in a decrease in the other.

2. The size of r indicates the strength of the correlation.

• A value of r close to +1 or ̶ 1 indicates a strong correlation between the


variables.

• A value of r close to 0 indicates a weak correlation between the variables.

Line of Best Fit


Having decided there is likely to be a linear correlation, we can model the data with a line of best
fit.

• Line of Best Fit by Eye

Steps: 1. Using your GDC, find the mean of values, 𝑥¯ and 𝑦¯.

Calculator
stat EDIT 1: edit (L1: enter 𝑥- values; L2: enter 𝑦- values)

stat CALC 1: 2-Var Stats (Xlist: L1; Ylist: L2; FreqList: blank)
This gives the measures for the two variables individually.

2. Mark the mean point (𝑥¯,𝑦¯) on the scatter diagram.

3. Draw a line through the mean point which fits the trend of the data, so that
about the same number of data points are above the line as below it.

The line formed is called the line of best fit by eye. This line will vary from person to
person.

HL AA Statistics Notes RMS 2020 12


Note Well:
If you have two variables, 𝐴 and 𝐵, if the questions says 𝑩 on 𝑨, then 𝐵 is on the 𝑦-axis, 𝐴 is on
the 𝑥-axis. You should only predict 𝐵 from a given 𝐴, not 𝐴 from 𝐵.
For 𝑨 on 𝑩, 𝐴 is on the 𝑦-axis and you should only predict 𝐴 from a given 𝐵.

Sum of square residuals

If a linear function, 𝑓(𝑥) is chosen to represent the data, for each 𝑥-value, there will be a
difference between the actual data point (𝑦) and 𝑓(𝑥). If we square these differences (to get rid of
the effect of negatives), then we have the sum of square residuals, 𝑺𝑺𝒓𝒆𝒔 .
𝒏

𝑺𝑺𝒓𝒆𝒔 = ˆ(𝒚𝒊 − 𝑓(𝒙𝒊 ))%


𝒊Š𝟏

Least Squares Regression Line

The line that minimises the sum of square residuals is the called the least squares regression
line, or the linear regression equation. It can be found using your calculator .

Calculator
stat EDIT 1: edit (L1: enter 𝑥- values; L2: enter 𝑦- values)

stat CALC 1: 2-Var Stats (Xlist: L1; Ylist: L2; FreqList: blank)
This gives the measures for the two variables individually.

stat CALC 4: LinReg(ax+b) (Xlist: L1; Ylist: L2; FreqList: blank)

This gives Pearson’s Product Moment Correlation Coefficient, 𝒓, and the equation
of the equation of the linear regression line.
(Note that you must have STAT DIAGONOSTICS on in ‘mode’)

Note Well again:


If you have two variables, 𝐴 and 𝐵, if the questions says 𝑩 on 𝑨, then 𝐵 is on the 𝑦-axis, 𝐴 is on
the 𝑥-axis. You should only predict 𝐵 from a given 𝐴, not 𝐴 from 𝐵.
For 𝑨 on 𝑩, 𝐴 is on the 𝑦-axis and you should only predict 𝐴 from a given 𝐵.

Interpolation and Extrapolation

The trends we model are only valid over the range of our data. Determining values inside this
range is called Interpolation.

We cannot use the trends to draw conclusions about values OUTSIDE of our range of data- this is
called Extrapolation and the conclusions are not reliable or valid.

HL AA Statistics Notes RMS 2020 13

You might also like