Data Mining: Statistical Methods
Data Mining: Statistical Methods
LECTURE 2
Statistical Methods
Statistical Methods
• Mean
• Median
• Mood
• Mean, median, and mode are three kinds of
"averages". There are many "averages" in
statistics, but these are, I think, the three most
common, and are certainly the three you are most
likely to encounter.
Statistical Methods
• The "mean" is the "average" you're used to,
where you add up all the numbers and then
divide by the number of numbers.
• The "median" is the "middle" value in the list of
numbers. To find the median, your numbers have
to be listed in numerical order from smallest to
largest, so you may have to rewrite your list
before you can find the median.
• The "mode" is the value that occurs most often. If
no number in the list is repeated, then there is no
mode for the list.
What is mean?
• Mean is the average of numbers.
Example:
3, 5, 6, 9, 8
Mean = 3+5+6+9+8/5
Mean = 6.2
How to calculate the mean for data with
frequencies?
Age (X) Frequency (F) Age * Frequency (FX)
22 5 22 * 5 = 110
33 2 33 * 2 = 66
44 6 44 * 6 = 264
66 4 66 * 4 = 264
Total ( ∑ ) ( ∑F ) = 17 ( ∑FX ) = 704
Mean= ∑FX / ∑F
Mean = 704/ 17
Mean = 41
What is Median?
• Median is the middle value among all values.
Median = 6
How to calculate median for an even
number of values?
Example:
9, 8, 5, 6, 3, 4
Median = 5+6/2
Median = 5.5
What is Mode?
• The mode is the most occurring value.
Mode = L+((FMG-FBMG)/((FMG-FBMG)+(FMG-FAMG)))*GW
= 95.5+((10-4)/((10-4)+(10-6)))*5
= 98.5
How to calculate the Quartile Q1
Groups X F CF
85.5-90.5 88 6 6
Q1 = L+(h/f) *((n/4)-c)
90.5-95.5 93 4 10
= 90.5+(5/4)*(7.5-6)
95.5-100.5 98 10 20 15 th =92.375
100.5-105.5 103 6 26
105.5-110.5 108 3 29
110.5-115.5 113 1 30
Total 30
V = 28/6
= 4.6666667
How to calculate the Standard
Deviation & Variance for group data
Group X F FX (X-Mean) (X-Mean)2 F(X-Mean)2
30-35 32.5 12 390 -12 144 1728
35-40 37.5 18 675 -7 49 882
40-45 42.5 29 1232.5 -2 4 116
45-50 47.5 32 1520 3 9 288
50-55 52.5 16 840 8 64 1024
55-60 57.5 8 460 13 169 1352
Total 115 5117.5 439 5390
Mean = 5117.5/115
= 44.5
"x" "y"
Hours of Sunshine Ice Creams Sold
2 4
3 5
5 7
7 10
9 15
Least Squares Regression
Step 2: Sum x, y, x2 and xy (gives us Σx, Σy, Σx2 and
Step 1: For each (x,y) calculate x2 and xy:
Σxy):
x y x2 xy
x y x2 xy
2 4 4 8
2 4 4 8
3 5 9 15
3 5 9 15
5 7 25 35
5 7 25 35
7 10 49 70
7 10 49 70
9 15 81 135
9 15 81 135
Σx: 26 Σy: 41 Σx2: 168 Σxy: 263
Cat Dog
Men 207 282 489
Women 231 242 473
438 524 962
Cat Dog
Men 489×438/962 489×524/962 489
Women 473×438/962 473×524/962 473
438 524 962
Chi-Square Test
Which gives us:
Cat Dog
Men 222.64 266.36 489
Women 215.36 257.64 473
438 524 962
Chi-Square is 4.102
Chi-Square Test
• From Chi-Square to p
• But first you will need a "Degree of Freedom"
(DF)
• Calculate Degrees of Freedom
• Multiply (rows − 1) by (columns − 1)
DF = (2 − 1)(2 − 1) = 1×1 = 1