0% found this document useful (0 votes)
11 views10 pages

Formuleblad-statistiek

The document provides an overview of statistical formulas, including calculations for sample and population data, measures of center and variability, and probability laws. It also covers formulas related to random variables, confidence intervals, and hypothesis testing for one-sample and two-sample scenarios. Key notations and calculations for various statistical measures are presented for easy reference.

Uploaded by

vdbkill
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

Formuleblad-statistiek

The document provides an overview of statistical formulas, including calculations for sample and population data, measures of center and variability, and probability laws. It also covers formulas related to random variables, confidence intervals, and hypothesis testing for one-sample and two-sample scenarios. Key notations and calculations for various statistical measures are presented for easy reference.

Uploaded by

vdbkill
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Formula Overview

1 Standard Formulas
First we treat the calculations on sample data and afterwards population data.

1.1 Calculations on sample data


A few notations:

• Here x1 , . . . , xn is the data from a sample.


• Here x(1) , . . . , x(n) represents the same data, but x(1) is the lowest, x(2) is the one after that, . . .,
x(n) is the highest value.
• Here Q1 and Q3 represent the first and third quartiles of the data.

1.1.1 Measures of Center

Name Notation / Calculation


x1 +...+ xn 1 n
Mean x̄ = n n ∑ i =1 x i
=
Median M= 0.5((n − 1)th ranked observation)
Percentile q p% = the [ p(n + 1)/100]th ranked observation
x (1) + x ( n )
Mid-range Average of minimum and maximum: 2
Mode Most occurring data value (if any)
Mid-hinge Average of Q1 and Q3 quartiles: 21 ( Q1 + Q3 )

Geometric Mean G = n x1 · x2 · . . . · x n
k% trimmed mean Discard the k% highest and lowest observations, and then take the mean

1.1.2 Measures of Variability

Name Notation / Calculation


∑n ( x − x̄ )2
Sample Variance s2 = i=1n−i1

Standard Deviation s = s2
Range x ( n ) − x (1)
Interquartile range IQR = Q3 − Q1
∑in=1| xi − x̄ |
Mean Absolute Deviation MAD = n
s
Coefficient of Variation* CV = x̄

1
Formula Overview
1.1.3 Covariance and Correlation (sample data)

Name Notation & Calculation


∑in=1 ( xi − x̄ )(yi −ȳ)
Sample Covariance SXY = n −1
s XY
Sample (Pearson) Correlation r XY = s X sY

1.2 Calculations on Population data


Almost all formulas for population data are the same except:
Name Notation & Calculation
∑iN=1 ( xi −µ)2
Variance σ2 = N
∑ N ( x − x̄ )(y −ȳ)
Sample Covariance σXY = i=1 i N i
If you cannot find the formula above here, refer back Section 1.1. The notation also changes a little bit.
You can compare in the table below:
Sample Population
Sample Size n Sample Size N
∑n x ∑iN=1 xi
Mean x̄ = i=n1 i Mean µ= N
∑n ( x − x̄ )2 ∑ N ( x − µ )2
Variance s2 =√ i=1n−i1 Variance σ2 =√ i=1 Ni
Std. Deviation s = s2 Std. Devation σ = σ2
∑in=1 ( xi − x̄ )(yi −ȳ) ∑iN=1 ( xi −µ X )(yi −µY )
Covariance s XY = n −1 Covariance σXY = N
s XY σXY
Correlation r XY = s X sY Correlation ρ XY = σX σY
Proportion p Proportion π

2 Formulas about Probability/Distributions


A few notations:
• Here ∪ means “or” and ∩ means “and”.
• Here A | B means “A given B”.

2.1 General Laws of Probability


Name Formula
Addition P( A ∪ B) = P( A) + P( B) − P( A ∩ B)
P( A∩ B)
Conditional Probability P ( A | B ) = P( B)
Multiplication P( A ∩ B) = P( A | B) P( B) = P( B | A) P( A)
P( A| B) P( B)
Bayes’ P( B | A) = P( A)

A few handy mathematical notations:


• n! = 1 · 2 · 3 · . . . · (n − 1) · n, where 0! = 1.
n!
• n Pr = ( n −r ) !

• n Cr = (nr) = n!
r!(n−r )!

2
Formula Overview
3 Formulas about random variables / distributions
For the formulas of all the distributions (Bernoulli, normal etc.), we refer to AthenaDocs.

3.1 Expectation and Variance


Let X be a discrete random variable with possible outcomes denoted by xi .
Name Notation & Calculation
Expectation µ = E( X ) = ∑ xi x i P ( X = x i )
Variance σX2 = Var( X ) = ∑ xi [ xi − µ X ]2 P( X = xi )

Expectancies and variances of common distributions:


Distribution Notation P(X = k) P(X ≤ k) E(X) Var(X)
Binomial Bin(n, π ) (nk)π k (1 − π )n−k - nπ nπ (1 − π )
λk e−λ
Poisson Pois(λ) k! - λ λ
1− π 2
Geometric Geom(π ) (1 − π ) k −1 π 1 − (1 − π ) k 1
π π2
−S
S (Sk)( Nn− k) −n
Hypergeometric Hyp(π = N) N - nπ nπ (1 − π ) N
N −1
(n)
Normal N(µ, σ2 ) 0 - µ σ2
Exponential Exp(λ) 0 1 − e−λx 1
λ
1
λ2

3.1.1 Handy Formulas


• µ X +Y = E ( X + Y ) = µ X + µ Y = E ( X ) + E (Y )
• σX2 +Y = Var( X + Y ) = σX2 + σY2 + 2σXY = Var( X ) + Var(Y )
• X and Y independent ⇐⇒ P( X = x, Y = y) = P( X = x ) P(Y = Y )
• X and Y independent =⇒ Var( X + Y ) = Var( X ) + Var(Y ), also Cov( X, Y ) = 0

3.1.2 Standardising Normal


If X ∼ N (µ, σ2 ), then
k−µ
 
P( X < k) = P Z < ,
σ
where Z ∼ N (0, 1).

3.2 Approximations
3.2.1 Approximating X̄
Note: The following is only true if the conditions in Table 1 hold!

Let X be a RV with mean µ and variance σ2 . Let X̄ be the average of n X’s. Then
!
σ2
X̄ ∼ N µ, .
n

3
Formula Overview
The Standard Error of X̄ is √σ .
n
Hence,
!
k−µ
P( X̄ < k) = P Z< √ .
σ/ n

n < 15 If this is true, we must know that the population is normally distributed.
15 ≤ n < 30 If this is true, we only have to know that the population is symmetrically distributed.
n ≥ 30 If this is true, the population may be skewed, but not severe.

Table 1: Assumptions

3.2.2 Approximating Binomial


Let X ∼ Bin(n, π ).

• If BOTH nπ ≥ 5 and nπ (1 − π ) ≥ 5:
We approximate X ∼ N(nπ, nπ (1 − π )). Do not forget the continuity correction:

Original Approximation
P( X > k) P( X > k + 0.5)
P( X ≥ k) P( X ≥ k − 0.5)
P( X < k) P( X < k − 0.5)
P( X ≤ k) P( X ≤ k + 0.5)

• If BOTH n ≥ 20 and π ≤ 0.05:


We approximate X ∼ Pois(nπ ).

3.2.3 Approximating Poisson


Let X ∼ Pois(λ).

If λ is too large for table, use X ∼ N(λ, λ).

4
Formula Overview
4 Confidence Interval Formulas
4.1 One-Sample C.I.
Here, C.I. stands for “Confidence Interval”. For C.I. about regression, see Section 6.2.1.

When to Use Interval Distribution df Assumptions


C.I. for µ,
x̄ ± z α √σ t (or z) ∞ See Table 1
σ known 2 n
C.I. for µ, √s
x̄ ± t α t n−1 See Table 1
σ unknown 2
q
n
p (1− p )
C.I. for π p ± zα n t (or z) ∞ See Table 1
2
( n −1) s2 ) ( n −1) s2 )
C.I. for σ2 ( χ2α
, χ2 ) χ2 n−1 Normal Population
α
1− 2
2
C.I. for β i Normal Populations,
bi ± t α · s bi t n−k−1
(regression) 2 Equal variances

4.2 Two-Sample C.I.


When to Use Interval Distr. df Assumptions
r
C.I. for µ1 − µ2 , σ12 σ12 See Table 1,
( x̄1 − x̄2 ) ± z α2 n1 + n2 t (or z) ∞
σ, σ2 known r Independent Populations
s2p s2p
( x̄1 − x̄2 ) ± t α2 n1 + n2
C.I. for µ1 − µ2 , See Table 1,
t n1 + n2 − 2
σ1 = σ2 unknown Independent Populations
(n1 −1)s21 +(n2 −1)s22
s2p = n1 +n2r −2
C.I. for µ1 − µ2 , s21 s22 See Table 1,
( x̄1 − x̄2 ) ± t α2 n1 + n2 t d f special
σ1 6= σ2 unknown Independent Populations
ni pi ≥ 10 and ni (1 − pi ) ≥ 10,
q
p1 (1− p1 ) p2 (1− p2 )
C.I. for π1 − π2 ( p1 − p2 ) ± z α2 n1 + n2 t (or z) ∞
Independent Populations

In the case of “C.I. for µ1 , µ2 , σ1 6= σ2 unknown”, the d f is usually given. A little note:

• In the odd case that you need to calculate the d f after all, use the formula
 2
s21 s22
n1 + n2
d f special = !2 !2
s2 s2
1 2
n1 n2

n1 −1 + n2 −1

5 Hypothesis Testing
5.1 One-Sample Hypothesis Testing
Use this table when H0 and H1 concern only one parameter:

5
Formula Overview
When to Use Test statistic Distribution df Requirements
Test for µ, X̄ −µ0
zcalc = √ t (or z) ∞ See Table 1
σ known σ/ n
C.I. for µ, X̄ −µ0
tcalc = √ t n−1 See Table 1
σ unknown s/ n
p − π0
Test for π zcalc = √ t (or z) ∞ nπ0 ≥ 10 and n(1 − π0 ) ≥ 10
π0 (1−π0 )/n
( n −1) s2
Test for σ2 χ2calc = σ2 χ2 n−1 Normal Population
q0
Test for ρ tcalc = r 1n−−r22 t n−2 See Table 1
Test for β i bi − c Normal Populations,
tcalc = sb t n−k−1
(regression) i Equal variances

5.2 Two-Sample Hypothesis Testing


Use this table when H0 and H1 concern two parameters:

When to Use Test statistic Distribution df Requirements


Test for µ1 − µ2 , r1 − X̄2 )−d
( X̄ See Table 1,
zcalc = t (or z) ∞
σ, σ2 known σ2
1 σ22 Independent Populations
n1 + n2

tcalc = r1 − X̄2 )−d


( X̄
s2p s2p
Test for µ1 − µ2 , n1 + n2 See Table 1,
t n1 + n2 − 2
σ = σ2 unknown Independent Populations
(n1 −1)s21 +(n2 −1)s22
s2p = n1 + n2 −2
Test for µ1 − µ2 , r1 − X̄2 )−d
( X̄ See Table 1,
tcalc = t d f special
σ 6= σ2 unknown s2
1 s2
2
Independent Populations
n1 + n2
( p1 − p2 )−0
zcalc = r
p c (1− p c ) p (1− p )
Test for π1 − π2 , n1 + c n c ni pi ≥ 10 and ni (1 − pi ) ≥ 10,
2
t (or z) ∞
d=0 Independent Populations
n1 p1 + n2 p2
pc = n1 + n2
Test for π1 − π2 , ( p1 − p2 )−d ni pi ≥ 10 and ni (1 − pi ) ≥ 10,
zcalc = r t (or z) ∞
d 6= 0 p1 (1− p1 ) p (1− p )
+ 2 n 2
Independent Populations
n 1 2
Test for σ1 = σ2 s21 Normal Populations,
Fcalc = s22
F n1 − 1, n2 − 1
(Levene) Independent Populations

In the case of “Test for µ1 − µ2 , σ1 6= σ2 unknown”, the d f is usually given. If you have two samples,
and none of these formulas work, do not forget about the paired difference procedure!. A little note:
• In the odd case that you need to calculate the d f after all, use the formula
 2
s21 s22
n1 + n2
d f special = !2 !2
s2 s2
1 2
n1 n2

n1 −1 + n2 −1

• Sometimes pc is written as p̄.

6
Formula Overview
6 Linear Regression
6.1 Simple Linear Regression
Let’s say you have a theoretical regression model of the form Y = β 0 + β 1 X + ε, and you test

H0 : β 1 = c, H0 : β 1 ≤ c, or β 1 ≥ c,

then use
b1 − c
t= ∼ t n − k −1
sb1
where

• b1 is estimator of β 1 (from estimated regression model)


• sb1 is the standard error of b1
• k is the amount of predictor variables.

You can also use this formula for other bi ’s if you have the model Y = β 0 + β 1 X1 + . . . + β k Xk + e of
course.
• Requirements: Normal Populations & Equal Variances

6.2 Multiple Linear Regression (testing independence)


If the regression model is of the form Y = β 0 + β 1 X1 + . . . + β k Xk + ε, and you test independence:

H0 : β 1 = β 2 = . . . = β k = 0, H1 : At least one β i non-zero

then use
SSR/k ( n − k − 1) R2
F= = ∼ Fk,n−k−1 .
SSE/(n − k − 1) k 1 − R2
Usually a table of this form is given:

Sum of Squares df Mean Squares Statistic


Regression SSR k MSR = SSR
k F = MSR
MSE ∼ Fk,n−k −1
SSE
Residual SSE n−k−1 MSE = n− k −1 −
Total SST n−1 − −
where all the numbers in Sum of Squares are given, and also the d f . It is usually up to you to calculate
the Mean Squares and the F-statistic using this information. Also R2 is usually given, but here is the
formula to calculate it:
SSE SSR
R2 = 1 − = .
SST SST
To calculate the adjusted R2adj , use the formula:

n−1
 
R2adj = 1 − (1 − R ) 2
n−k−1

• Requirements: Normal Populations & Equal Variances


• NOTE: Always right-tailed rejection region.

7
Formula Overview
6.2.1 Confidence Intervals
If we have a regression model Y = β 0 + β 1 X1 + . . . + β k Xk , then a (1 − α)100% confidence interval for
a slope β i is given by
bi ± t α ;n−k−1 sbi
2

where sbi is the standard error of the regression bi and k the amount of predictor variables.

7 Other Hypothesis Tests


7.1 One-Way Anova Tests
Let’s say you have a test that looks like:
H0 : µ1 = µ2 = . . . = µc , H1 : not H0
then you use the statistic F, found using this table:

Source varia- Sum of Squared Degrees Mean Square F-statistic


tion Freedom
Treatment SSB = ∑cj=1 n j (ȳ j − ȳ)2 c−1 MSB = SSB
c −1 F= MSB
MSE ∼ Fc−1,n−c
Error SSE = ∑cj=1 ∑in=1 (yij − ȳ j )2 n−c MSE = SSE
n−c -
Total SST = ∑cj=1 ∑in=1 (yij − ȳ)2 n−1 - -
Here n is the total of all observations in samples and c is the amount of samples compared.
• NOTE: Always right-tailed rejection region.
• Requirements: No extreme outliers & Equal Variances

7.2 Two-Way Anova Tests


Let’s say you have collected data that may be influenced by two factors. Then you can make three
hypotheses:
H0 : Factor A influences data
H0 : Factor B influences data
H0 : Interaction/Combination influences data
Each of these hypotheses have their own statistic, which you can calculate, using the table below.

Variation source Sum of Squares df Mean Square F-Statistic


Factor A SSA r−1 MSA = SSA
r −1 FA = MSA
MSE ∼ Fr −1,rc(m−1)
Factor B SSB c−1 MSB = SSB
c −1 FB = MSB
MSE ∼ Fc−1,rc(m−1)
Interaction SSAB (r − 1)(c − 1) MSAB = (r−SSAB
1)(c−1)
FAB = MSAB
MSE ∼ F(r −1)(c−1),rc(m−1)
SSE
Error SSE rc(m − 1) MSE = rc(m−1)
-
Total SST n−1 - -
Here r is the amount of categories of factor A, c the amount of categories in factor B and m the amount
of observations per cell.
• NOTE: Always right-tailed rejection region.

8
Formula Overview
7.3 Contingency Tables
Let’s say you two variables (that are qualitative) for which you want to check whether they are
independent (like Education Level and Place of residence), with a hypothesis:
H0 : The variables are independent.
For this test you use the statistic
( f jk − e jk )2
χ2calc = ∑ e jk
∼ χ2(r−1)(c−1) .

The formula for expected frequency of a cell:

R j · Ck
e jk = .
n
Here
- f jk is observed frequency in a cell,
- e jk is the expected frequency of a cell,
- R j is the total amount of observations in a row,
- Ck is the total amount of observations in a column,
- n is the total amount of observations in total,
- r and c are the amount of rows and collumns respectively.

We require that e jk ≥ 5, or else you can not use this test.


• NOTE: Always right-tailed rejection region.

7.4 Tests for the Median (non-parametric)


The test will look like:
H0 : M ≤ M0 , H0 : M = M0 , H0 : M ≥ M0 ,
where M0 is the hypothesized median and M is the hypothesized median.

7.4.1 When to use which test


There are two tests to test the median:
• If we assume symmetric population, then use Wilcoxon Signed Ranked Test;
• If not, then use Sign Test.

7.4.2 Sign Test


The statistic is
S= ∑ Si = “Counts how many observations are above M0 ”
i
and has a distribution
S ∼ Bin(n, π = 0.5),
where n is the amount of observations.

9
Formula Overview
7.4.3 Wilcoxon Signed Ranked Test
You need to assume symmetry to use this test. The statistic is

W= ∑ Ri+
i

We take the following steps to calculate this test statistic:


Step 1 Order observations by distance from M0 (the median assumed in H0 ).
Delete observation if “distance to M0 ” = 0.
Set n = “remaining observations”.

Step 2 Number each observation in this list from 1 to n. These numbers we call ‘Ranks’.
Step 3 Add the rank numbers for all observations for which the observation was above M0 .
This equals the test statistic W.
Step 4 If n ≤ 20 go to the Wilcoxon signed ranked test table. Find corresponding interval from
table ( a, b).
- If our W lies in the interval: keep H0 .
- If W lies outside the interval: reject H0 .
n ( n +1)
W− 4
Step 5 If n > 20 we use a z-test, with the test statistic Z = q
n(n+1)(2n+1)
24

10

You might also like