0% found this document useful (0 votes)
0 views

AFM_Module 6

The document covers key concepts in statistics, including descriptive and inferential statistics, hypothesis testing, and various statistical methods such as t-tests and ANOVA. It explains the importance of measures like mean, median, variance, and standard deviation in analyzing data, as well as the differences between parametric and non-parametric tests. Additionally, it discusses the role of exploratory data analysis and diagrammatic representations in understanding data relationships.

Uploaded by

Kalsoom Khalid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

AFM_Module 6

The document covers key concepts in statistics, including descriptive and inferential statistics, hypothesis testing, and various statistical methods such as t-tests and ANOVA. It explains the importance of measures like mean, median, variance, and standard deviation in analyzing data, as well as the differences between parametric and non-parametric tests. Additionally, it discusses the role of exploratory data analysis and diagrammatic representations in understanding data relationships.

Uploaded by

Kalsoom Khalid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Descriptive Statistics

Visualizing a Single and multiple Variable


Hypothesis testing
Parametric and non-parametric tests
Week 6
Dealing with uncertainty

2
Decision making process

3
Statistical methods

Descriptive statistics
▪ It is the discipline of quantitatively describing the main features of a
collection of data
▪ Collecting, summarizing, processing data into useful information
Inferential statistics
▪ Provides the basis for prediction, and estimates that are used to
transform information about sample into inferences about population

4
Descriptive vs. inferential

5
Statistical analysis

▪ An analysis must have four elements

▪ Data/information (what)
▪ Scientific reasoning (Who? where? how? what happens?)
▪ Finding (what results?)
▪ Lesson/conclusion (so what? so how? )
Key definitions

Population vs. sample

7
Descriptive statistics

Why they are useful:


• Help us summarize the characteristics of our data/observations

• They form the initial step in a statistical analysis

• We get information about the centrality (e.g., mean, median)

• We get information about the variability (e.g., variance, standard


deviation)

8
Descriptive statistics
The Mean:
• Mean is another word for average.
• Most commonly used statistic that tells us about the centre of a data
set.
• The mean is important because it is involved in many statistical tests.
• E.g., T-test, ANOVA etc
• The sum of all values divided by the number of observations.
n=4

+ + + = 307
307 ÷4

9
Descriptive statistics

The Median : Example 2: 4,12,0,21,4,13 (n=6)


• Is the middle point of your ordered data • Step1: Order observations:
0,4,4,12,13,21
Example 1: 1,3,10,5,8 • 6 observations not a single point in the
Step1: Order the observations middle
1,3,5,8,10
• Step 2: Average the two middle points
The median is the number in the middle 4+12
𝑚𝑒𝑑𝑖𝑎𝑛 = =8
median=5 2

10
Descriptive statistics

Difference between Mean and Median


• Both are single values that represent the center of a data set
• But median is more robust or less affected to outliers
Example 3: 3,1,4,2.5,50 (n=5)
• 50 is a value quite extreme compared to the rest (outlier)

3+1+4+2.5+50
• Mean= =12.1
5
• Median: 1,2.5,3,4,50 =3

11
Measures of spread

12
Descriptive statistics

The Variance (𝑠 2 )
• A measure to show how spread out our observations are
• Based on the distance of each observation from the mean

13
Descriptive statistics

The Variance (𝑠 2 )
• Squared
• The average of the squared differences from the mean

Dog 1: 10kg
Dog 2: 12kg
Dog 3: 13kg
Dog 4: 17kg
Dog 5: 20kg
Dog 6: 24kg
Mean = 16kg

14
Descriptive statistics
The Variance (𝑠 2 )
1. Work out the difference from the mean and square
each value
2. Add all the squared values together
Dog 1: -6 36
Dog 2: -4 16
Dog 3: -3 9
Dog 4: 1 1
Dog 5: 4 16
Dog 6: 8 64
Mean = 16kg 142

15
Descriptive statistics

The Variance (𝑠 2 )
3. Divide the squared value by the number of
observations (n).
n=6

142
6 5- 1
Variance value:
28.4kg²
16
Descriptive statistics

The Standard deviation (s) :


• It is the square root of variance
• The result is in the same measurement units as our data (e.g., kg)
• This makes it easier to interpret and understand the spread of the
data than variance alone

17
Descriptive statistics

The Standard deviation (s) :

28.4
= 5.3kg

18
Descriptive statistics

The Standard deviation (s) :


• Often plotted as error bars above and below the mean (5.3kg
in our example).
25
• Small value indicates
20 data are gathered close
to the mean
Weight (kg)

15

10
• Large value indicates
5 data are gathered far
0 from the mean
19
Know the data and explore it

▪ Exploratory data analysis (EDA): Purpose and Benefits


▪ Size, Dimension, and Resolution of Data
▪ Types of Attributes
▪ Statistical EDA
▪ Measures of Central Tendencies and Spread
▪ Bivariate EDA: Correlation, Contingency Table
▪ Graphical EDA Types of Diagrams

20
Types of data
Types of data based on number
of attributes
▪ Univariate Data
▪ Bivariate Data
▪ Multivariate Data

21
Types of data

22
Types of data

23
Normal Distribution (Bell-curve)

24
Bivariate measures

25
Contingency Table

26
Covariance and correlation

27
Covariance and correlation

28
Covariance and correlation

29
Diagrammatic Representations of Data
▪ Easy to understand:
▪ Numbers do not tell all the story.
▪ Diagrammatic representation of data makes it easier to understand
▪ Simplified Presentation:
▪ Large volumes of complex data can be represented in a simplified and diagram
▪ Reveals hidden facts:
▪ Diagrams help in bringing out the facts and relationships between data not
noticeable in raw/tabular form
▪ Easy to compare:
▪ Diagrams make it easier to compare data

30
Diagrammatic Representations of Dat

▪ Bar Charts
▪ Histogram
▪ Box Plot
▪ Scatter Plot
▪ Heat map
▪ Line Graph

31
Parametric vs. Non-parametric tests
Field 1 Field 2
15.2 15.9
15.3 15.9
16 15.2

Hypothesis Testing, Students t-test 15.8


15.6
16.6
15.2
14.9 15.8
15 15.8

▪ Field 1 and Field 2


15.4 16.2
15.6 15.6
15.7 15.6

▪ Took a sample from Field 1 and 2 15.5


15.2
15.8
15.5

▪ Can you tell which one has high Yield? 15.5


15.1
15.5
15.5

▪ Null hypothesis:
15.3 14.9
15 15.9

There is no statistically significant difference


between the samples.
▪ Decision:
▪If t-value < t-critical >>>>Don’t reject Null
▪If t-value > t-critical >>>>Reject Null
Students t-test
t-table
▪Decision:
▪If t-value < t-critical >>>>Don’t reject Null
▪If t-value > t-critical >>>>Reject Null
Students t-test

Dof = n1 + n2 - 2

Null hypothesis:
There is no statistically significant difference between the samples.
▪ t-value > t-critical
▪ 2.3 > 2.04
▪That means, there is some statistically significant difference between
the samples
Rubber Rubber Rubber
Analysis of variance (ANOVA) supplier 1 supplier 2 supplier 3
1 2 2
Step 1: 2 4 3
Null hypothesis: 5 2 4
There is no difference between means
µ1 = µ2 = µ3
Alternative:
At least there is one difference among the means
Alpha = 0.05
Step 2
Find critical F-value
- Dof (between-numerator) = k – 1 = 3 – 1 = 2
k (number of conditions in our group)
- Dof (within-denominator) = N – k = 9 – 3 = 6,
N (total number of scores we have in sample
- Dof (total) = 8
- F-Critical = 5.14 (from table)
Rubber Rubber Rubber
supplier 1 supplier 2 supplier 3
1 2 2

ANOVA 2
5
4
2
3
4
Step 3:
Analysis of sum of squares-total variability SS (within) = Sum of squares (within)
Mean for each condition/group = Sum (x1 – mean (x1))^2
- Mean x1 = 2.67 + Sum (x2 – mean (x2))^2
- Mean x2 = 2.67 + Sum (x3 – mean (x3))^2
- Mean x3 = 3.00
= (1-2.67)^2 + (2- 2.67)^2 + (5-2.67)^
Grand mean (G) = G/N = + (2-2.67)^2 + (4- 2.67)^2 + (2-2.67)^
= (1+2+5+2+4+2+2+3+4)/9 + (2-3)^2 + (3- 3)^2 + (4-3)^2
G = 2.78
SS (within) = 13.34
SS (total) = Sum of squares (total)
= Sum (x - G)^2 SS (between) = SS(total) – SS (within)
= (1-2.78)^2 + (2- 2.78)^2 + (5-2.78)^2 = 13.6 – 13.34
+ (2-2.78)^2 + (4- 2.78)^2 + (2-2.78)^2 = 0.24
+ (2-2.78)^2 + (3- 2.78)^2 + (4-2.78)^2
SS (total) = 13.6
Rubber Rubber Rubber
supplier 1 supplier 2 supplier 3
1 2 2

ANOVA 2
5
4
2
3
4
- Dof = k – 1
- Dof = N – k
- Dof = n1 + n2 - 2
SS (between) = SS(total) – SS (within)
Variance (between)
Variance (within)
Mean square = MS (between)
Mean square = MS (within)
Null hypothesis:
There is no difference between means
ANOVA µ1 = µ2 = µ3

Step 4: Step 5:
Variance (between) F-value = MS(between)/MS (within)
Variance (within) = 0.12/2.22
= 0.054
Mean square = MS (between)
= SS (between)/Dof (between) F-critical = 5.14 Remember!!!
= 0.24/2
= 0.12 F-value < F-critical
Mean square = MS (within) 0.054 < 5.14
= SS (within)/Dof (within)
= 13.34/6 Conclusion:
= 2.22 We fail to reject null hypothesis

▪Decision:
▪If F-value < F-critical >>>>Don’t reject Null
▪If F-value > F-critical >>>>Reject Null
Practicing ANOVA
A company is evaluating three different goods suppliers based on their delivery
performance. The goal is to determine whether there is a significant difference
in performance among the three suppliers.

Goods supplier 1 Goods supplier 2 Goods supplier 3


4 5 4
2 2 2
3 1 2
Hypothesis Testing-Wilcoxon Rank Sum Test
▪Wilcoxon with both n1 and n2 < 10 or n1 and n2 ≥ 10

▪When we test a hypothesis about the difference between two


independent population means, we do so using the difference
between two sample means.

▪ When the two sample variances are tested and found not to be
equal
• As we cannot use the sample variances
• thus we cannot use the t-test for independent samples. Instead, we use
the Wilcoxon Rank Sum Test
Wilcoxon Rank Sum Test
The Z test and the t test are “parametric tests” – that is, they answer a
question about the difference between populations by comparing
sample statistics (e.g., X1 and X2) and making an inference to the
population parameters (μ1 and μ2).

The Wilcoxon, in contrast, allows inferences about whole populations


Small samples, independent groups
Wilcoxon Rank Sum Test
• first, combine the two samples and rank order all the
observations
• smallest number has rank 1, largest number has rank
N (= sum of n1 and n2)
• separate samples and add up the ranks for the smaller
sample (If n1 = n2, choose either one)
• test statistics : rank sum T for smaller sample
Small samples, independent groups
Wilcoxon – Rejection region:
(With Sample taken from Population A being smaller than sample for
Population B) – reject H0 if
TA ≥ TU or TA ≤ TL
A restaurant chain wants to compare the costs of
preparing Cajun and Creole dishes to determine if one
Example 1 cuisine tends to be more expensive than the other.

- These are small samples, and they are independent (“random samples
of Cajun and Creole dishes”)
- Therefore, we must begin with the test of equality of variances
Cajun Creole
3500 3100
4200 4700
4100 2700
4700 3500
4200 2000
3705 3100
4100 1550
Test of hypothesis of equal variances
H0: 12 = 22
HA: 12 ≠ 22 F = larger value-numerator
smaller value-denominator

Test statistic: F = S22


S12

Rej. region: F > Fα/2 = F(6,6,.025) = 5.82


or F < (1/5.82) = .172
Use table or Excel
=F.INV.RT(prob,df1,df2)
Test of hypothesis of equal variances
S2Cajun = (385.27)2 = 148432.14 ▪If F-value < F-critical >>>>Don’t reject Null
▪If F-value > F-critical >>>>Reject Null
S2Creole = (1027.54)2 = 1055833.33

Fobt = 1055833.33 = 7.11


148432.14

Reject H0 – variances are not equal, so we do the Wilcoxon rank sum test
Example 1 – Wilcoxon Rank Sum Test
H0 : The median cost of preparing Cajun dishes is the same as Creole dishes

H1: The median cost of preparing Cajun dishes is different from Creole dishes

Statistical test:

Rejection region (given or will be provided):


Reject H0 if TCajun > 66 (or if TCreole < 39)

(Note: We shall give lower heat values lower rank values)


Example 1 – Wilcoxon Rank Sum Test
Cajun Creole Calculation check!!!
6.5 4.5
3500 3100
11.5 13.5 Sum of the ranks =
4200 4700
9.5 3
4100 2700 (n) (n+1)
13.5 6.5
4700 3500 2
11.5 2
4200 8 2000 4.5
3705 9.5 3100 1 70 + 35 = 105 = (14)(15)
2
4100 Σ 70 1550 35
Example 1 – Wilcoxon Rank Sum Test
TCajun = 70 > 66 (and TCreole = 35 < 39)

Therefore, reject H0 , so The median cost of preparing Cajun dishes is


different from Creole dishes

Rejection region:
- TA ≥ TU or TA ≤ TL
- Reject H0 if TCajun > 66 (or if TCreole < 39)
Example 2 – Wilcoxon Rank Sum Test
A company is analyzing the performance of employees based on gender to determine
whether there is a significant difference in their output (e.g., sales numbers, efficiency
H0: 12 = 22 scores, or productivity ratings). The goal is to assess whether males and females perform
differently in this specific metric.
HA: 12 ≠ 22 H0: There is no significant difference in performance between male and female employees
H1: There is a significant difference in performance between male and female employees

Male Female
Test statistic: F= S22 6.4 2.7
S12 1.7 3.9
3.2 4.6
5.9 3.0
Rej. region: F > Fα/2 = F(7,8,.025) = 4.53 2.0 3.4
3.6 4.1
or F < (1/4.90) = .204 5.4 3.4
7.2 4.7
3.8
Example 2 – Wilcoxon Rank Sum Test
6.4 16 2.7 3
Fobt = 4.316 = 9.38 1.7 1 3.9 10
.46 3.2 5 4.6 12
5.9 15 3.0 4
Reject H0 – do Wilcoxon
2.0 2 3.4 6.5
Statistical test: T 3.6 8 4.1 11
Rejection region: 5.4 14 3.4 6.5
T > TU = 90 (or T < TL = 54) 7.2 17 4.7 13
3.8 9
Σ 78 75
T = 78 < TU = 90
Failed to reject H0. Hence, there is no significant difference in performance between male and female
employees
To compare the delivery times of Supplier A and Supplier B to determine if one

Example 3 supplier consistently delivers faster than the other.

H₀: There is no significant difference in the delivery times of Supplier A and


Supplier B
H0: 12 = 22 H₁: There is a significant difference in delivery times between the two suppliers
HA: 12 ≠ 22
Supplier A Supplier B
Test statistic: F= S12 2 6
S22 6 8
4 7
Rej. region: F > Fα/2 = F(5,5,.025) = 7.15 23 10
7 8
or F < (1/7.15) = .140
6 4
Example 3 – Wilcoxon Rank Sum Test
Fobt = (7.563)2 = 47.20 Supplier A Supplier B
(2.04)2 4.16 2 1 6 5
6 5 8 9.5
= 13.74 4 2.5 7 7.5
23 12 10 11
Reject H0 – do Wilcoxon 7 7.5 8 9.5
6 5 4 2.5
Rej. region: F > Fα/2 = F(5,5,.025) = 7.15
Σ 33 45
Rejection region: TSA > 52
Do not reject H0 – no evidence for a significant difference
between suppliers.

You might also like