HL AA Statistics Notes RMS
HL AA Statistics Notes RMS
NB: In assessment, both Stratified and Quota will be taken as being proportional to the
categories. The difference is that Stratified will use a probability sampling method (such as
simple random or systematic). Quota does not- it is likely to use convenience.
Note that for a small data set, with data listed in ascending order:
"#$
Median is the ( % )'( value. If 𝑛 is even, it is the average of the values either side.
"#$ '(
Lower quartile is the ( +
) value. Again, if it is not a whole number it is the average.
,("#$) '(
Upper quartile is the ( + ) value. Again, if it is not a whole number it is the average.
When 𝑛 is a large number, we tend to ignore the ‘+1’.
AHL P ( B) P ( A | B)
Bayes’ theorem P ( B | A) =
4.13
P ( B ) P ( A | B) + P ( B′) P ( A | B′)
P( Bi ) P( A | Bi )
P ( Bi | A) =
P( B1 ) P( A | B1 ) + P( B2 ) P( A | B2 ) + P( B3 ) P( A | B3 )
Note well the notes above regarding changes to the original data:
If the mean of a set of data point is 𝑥̅ (𝑜𝑟k 𝜇), and the standard
AHL k deviation is 𝜎,
All data points increased by ‘𝑏′ ∑ fi ( xi − µ ) ∑ fi xi
2 2
4.14•
Variance
New mean= σ 2 𝑥̅ + 𝑏 Standardσ 2 = deviation
i =1
and =Variance
i =1
µ 2 unchanged. (As the data has
−are
n n
just be slid, so the spread has not changed.
k
Expected value of a ∞
continuous random E ( X ) = µ = ∫ x f ( x) dx
−∞
variable X
Var ( X ) = E ( X − µ ) 2 = E ( X 2 ) − [ E (X ) ]
2
Variance
Variance of a continuous ∞ ∞
Var ( X ) = ∫ ( x − µ ) 2 f ( x) dx = ∫ x 2 f ( x) dx − µ 2
random variable X −∞ −∞
Topic
SL
4.2
4: Statistics
Interquartile range andIQR
probability
=Q − Q – HL only 3 1
SL k
4.3
AHL ∑ fi xi P ( B ) kP ( A | B)
4.13
Bayes’
Mean, theorem
x , of a set of data B |i =A1 ) = , where n = ∑ f i
Px( =
n P ( B ) P ( A | B)i =+1 P ( B′) P ( A | B′)
SL n ( A) P( Bi ) P( A | Bi )
p.10
4.5 Probability of an event A A) =
PP((BAi )| =
n (UP() B1 ) P( A | B1 ) + P( B2 ) P( A | B2 ) + P( B3 ) P( A | B3 )
4.14 ∑ f i ( xi − µ ) ∑ f i xi 2
Variance σ 2 σ =
2 i =1
= i =1
− µ2
SL Combined events P ( A ∪ B ) =nP ( A) + P ( B) n− P ( A ∩ B )
4.6
Mutually exclusive events P ( A ∪k B ) = P ( A) + 2 P ( B)
∑ f i ( xi − µ )
Standard deviation σ σ = i =1
Pn( A ∩ B )
Conditional probability P ( A B) =
P ( B)
Linear transformation of a E ( aX + b ) = aE ( X ) + b
single randomevents
Independent variable ∩ B+) b=)P=( aA2) Var
P ( A( aX P ( B()X )
Var
SL Expected value of a
Expected value of a E(X ) = ∑ x P ∞ ( X = x)
4.7 discrete random
continuous randomvariable X E ( X ) = µ = ∫ x f ( x) dx
−∞
variable X
SL Binomial distribution
4.8 X ~ B (n , p)
Var ( X ) = E ( X − µ ) 2 = E ( X 2 ) − [ E (X ) ]
2
Variance
Mean E ( X ) = np
Variance of a discrete Var ( X ) = ∑ ( x − µ ) 2 P ( X = x) = ∑ x 2 P ( X = x) − µ 2
Variance
random variable X Var ( X ) = np (1 − p )
SL Standardized
Variance normal
of a continuous x−µ ∞ ∞
4.12 variablevariable X
random
z = ( X ) = ∫−∞ ( x − µ ) 2 f ( x) dx = ∫−∞ x 2 f ( x) dx − µ 2
Var
σ
Calculator
stat EDIT 1: edit (L1: enter 𝑥- values (or midpoint for grouped data); L2: enter frequency
(if you have)
stat CALC 1:1-Var Stats (List: L1; FreqList: blank if no frequencies, L2 if frequencies)
Note: Use 𝜎𝑥 for standard deviation. Square this value to get variance (𝜎2 ). (𝑆𝑥 is the
unbiased estimator, 𝑆"8$ .
𝒏#𝟏 𝒕𝒉 𝟑(𝒏#𝟏) 𝒕𝒉
• The Lower Quartile is the ( 𝟒
) term, The Upper Quartile is the ( 𝟒
) term.
Again, if it is a decimal, it is the average of the terms on either side.
𝒙 3 4 5 7
(OR 2.5 ≤ 𝑥 < 3.5) (OR 3.5 ≤ 𝑥 < 4.5) (OR 4.5 ≤ 𝑥 < 5.5) (OR 5.5 ≤ 𝑥 < 8.5)
𝒇𝒙 15 𝑎 21 18
a) Given that the mean of the data is [OR estimated to be] 5, find the value of 𝑎.
∑ 𝑥 = 𝑛𝑥̅
3 × 15 + 4𝑎 + 5 × 21 + 7 × 18 = (15 + 𝑎 + 21 + 18) × 5
276 + 4𝑎 = 270 + 5𝑎
𝑎=6
b) Another ‘𝑏’ 7’s are added to the data. The new standard deviation is [OR estimated to
be] 1.5664 (to 4 significant figures). Determine the value of 𝑏.
∑ Jc Gc K
𝝈𝟐 = "
− 𝜇%
From GDC, 𝑏 = 9
Outliers
If a value is < 𝐿𝑄 − 1.5 𝐼𝑄𝑅 or > 𝑈𝑄 + 1.5 𝐼𝑄𝑅 then the value is an outlier.
Example
If a data set has 𝑄1 = 𝐿𝑄 = 16, 𝑄3 = 𝑈𝑄 = 21
𝐼𝑄𝑅 = 21 − 16 = 5
• A box and whisker plot (sometimes called a box plot) is a visual display of some of the
descriptive statistics of a data set (called the five-number summary of the data set).
It shows:
1. The minimum value.
2. The lower quartile.
3. The median.
4. The upper quartile.
5. The maximum value.
Skew
The following is the five point summary of the heights of 40 school children.
Minimum value = 132
Median = 146.5
Heights of 40 Students
a) Determine the minimum number of students with a height less than or equal to 151cm.
151cm is the UQ. Therefore, as a minimum, 75% or ¾ of students have heights
less than or equal to 151cm.
,
+
× 40 = 30 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠
b) Determine the minimum number of students with a height less than or equal to 151cm.
151cm is the UQ. Therefore, as a minimum, 75% or ¾ of students have heights
less than or equal to 151cm.
,
+
× 40 = 30 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠
A cumulative frequency graph is useful for not only finding the median and upper and lower
quartiles, but also for finding the number of scores that lie above or below a particular value.
• A percentile is the score below which a certain percentage of the data lies;
e.g. The 85th percentile is the score below which 85% of the data lies.
e.g. If your score in a test is the 95th percentile, then 95% of the class have scored
less than you.
• The lower quartile (Q1 on GDC) is the 25th percentile.
• The median is the 50th percentile.
• The upper quartile (Q3 on GDC) is the 75th percentile.
Example:
The data shows the results of the women's marathon at the 2008 Olympics for all competitors that
finished the race.
Finishing Time (t hours and minutes) Frequency
2h26m ≤ 𝑡 < 2h28m 8
2h28m ≤ 𝑡 < 2h30m 3
2h30m ≤ 𝑡 < 2h32m 9
2h32m ≤ 𝑡 < 2h34m 11
2h34m ≤ 𝑡 < 2h36m 12
2h36m ≤ 𝑡 < 2h38m 7
2h38m ≤ 𝑡 < 2h40m 5
2h40m ≤ 𝑡 < 2h42m 8
2h42m ≤ 𝑡 < 2h44m 6
Median is the 50th percentile. Therefore, 50% of 69 = 34.5. Reading from the
graph, the median time is 2 hours and 34.5 minutes.
(ii) The number of competitors who finished in a time less than 2 hours 35minutes.
Reading from the graph, there are approx. 37 competitors who took less than
2 hours and 35 minutes to complete the race.
(iii) The percentage of competitors who took more than 2 hours 39 min to finish.
Reading from the graph, there are 69 − 52 = 17 competitors who took more
$h
than 2 hours 39 minutes. Therefore, ey = 26.4% (3 s.f.) took more than 2
hours 39 minutes.
(iv) The time taken by a competitor who finished in the top 20% of those who
completed the marathon.
20% of 69 = 13.8 ≈ 14. Reading from the graph, 20% of competitors took
less than 2 hours 30 minutes 30 seconds.
Statisticians are often interested in how two variables are related. The independent variable is
placed on the horizontal axis and the dependent variable is placed on the vertical axis
For example, if we recorded the age of a number of athletes and how far they can throw a discus,
the independent variable would be age and the dependent variable would be the distance
thrown.
If there is one or more data points that do not follow the trend of the others, they may be
considered as Outliers
o An outlier is defined as more than 1.5 × IQR from the nearest quartile.
o If an outlier is the result of a recording error, it should be discarded.
o If the outlier proves to be a genuine piece of data, it should be kept.
For example:
It is important to note that correlation between two variables does not necessarily mean that one
variable causes the other.
If a change in one variable does cause a change in the other variable, then a causal relationship
exists between them.
Correlation may be linear, quadratic, cubic or another function. Or there can be no visible
correlation. In this course, we will only consider Linear Correlation.
1. Direction
2. Strength
To get a more precise measure of the strength of linear correlation between two variables,
we use the Pearson's product-moment correlation coefficient, r . The formula is:
z(G8Ḡ)(|8|¯)
𝑟 = , where 𝑥¯ and 𝑦¯ are the means of the x and y data respectively.
}z(G8Ḡ)K (|8|¯ )K
Steps: 1. Using your GDC, find the mean of values, 𝑥¯ and 𝑦¯.
Calculator
stat EDIT 1: edit (L1: enter 𝑥- values; L2: enter 𝑦- values)
stat CALC 1: 2-Var Stats (Xlist: L1; Ylist: L2; FreqList: blank)
This gives the measures for the two variables individually.
3. Draw a line through the mean point which fits the trend of the data, so that
about the same number of data points are above the line as below it.
The line formed is called the line of best fit by eye. This line will vary from person to
person.
If a linear function, 𝑓(𝑥) is chosen to represent the data, for each 𝑥-value, there will be a
difference between the actual data point (𝑦) and 𝑓(𝑥). If we square these differences (to get rid of
the effect of negatives), then we have the sum of square residuals, 𝑺𝑺𝒓𝒆𝒔 .
𝒏
The line that minimises the sum of square residuals is the called the least squares regression
line, or the linear regression equation. It can be found using your calculator .
Calculator
stat EDIT 1: edit (L1: enter 𝑥- values; L2: enter 𝑦- values)
stat CALC 1: 2-Var Stats (Xlist: L1; Ylist: L2; FreqList: blank)
This gives the measures for the two variables individually.
This gives Pearson’s Product Moment Correlation Coefficient, 𝒓, and the equation
of the equation of the linear regression line.
(Note that you must have STAT DIAGONOSTICS on in ‘mode’)
The trends we model are only valid over the range of our data. Determining values inside this
range is called Interpolation.
We cannot use the trends to draw conclusions about values OUTSIDE of our range of data- this is
called Extrapolation and the conclusions are not reliable or valid.