0% found this document useful (0 votes)
34 views27 pages

Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU

This document discusses methods for measuring the dispersion of data, including quartiles, outliers, boxplots, variance, and standard deviation. It defines quartiles as the 25th and 75th percentiles (Q1 and Q3), and the interquartile range as Q3 - Q1. Variance measures the average squared deviation from the mean, while standard deviation is the square root of variance and indicates how spread out the data are from the mean. The empirical rule states that approximately 68% of the data lie within 1 standard deviation of the mean, 95% within 2 standard deviations, and 99.7% within 3 standard deviations, assuming a normal distribution.

Uploaded by

Dipty Sarker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views27 pages

Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU

This document discusses methods for measuring the dispersion of data, including quartiles, outliers, boxplots, variance, and standard deviation. It defines quartiles as the 25th and 75th percentiles (Q1 and Q3), and the interquartile range as Q3 - Q1. Variance measures the average squared deviation from the mean, while standard deviation is the square root of variance and indicates how spread out the data are from the mean. The empirical rule states that approximately 68% of the data lie within 1 standard deviation of the mean, 95% within 2 standard deviations, and 99.7% within 3 standard deviations, assuming a normal distribution.

Uploaded by

Dipty Sarker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

DATA MINING

CSE-443

Ayesha Aziz Prova


Lecturer,
Dept. of CSE
CWU
MEASURING THE DISPERSION OF DATA

Quartiles, outliers and boxplots


Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually
Outlier: usually, a value higher/lower than 1.5 x IQR

• Variance and standard deviation (sample: s, population: σ)


– Variance: (algebraic, scalable computation)

– Standard deviation s (or σ) is the square root of variance s2 (or σ2)


VARIANCE AND STANDARD DEVIATION
• While the central value of a sample is important, the variations around that
value are often equally important.
• Recall that variability exists when some values are different from (above or
below) the mean.
• what is a typical deviation from the mean?
– small values of this typical deviation indicate small variability in the
data
– large values of this typical deviation indicate large variability in the data
VARIANCE AND STANDARD DEVIATION
• Each data value has an associated deviation from the mean:

• The Mean Deviation is a measure of dispersion which calculates


– distance between each data point and the mean
– finds the average of these distances.
VARIANCE AND STANDARD DEVIATION
 Variance is the average squared deviation from the mean of a set of data.
 It is used to find the standard deviation.
 Population variance
 Sample variance
CALCULATING VARIANCE

1. Find the mean of the data.


Hint – mean is the average so add up the values and divide
by the number of items.
2. Subtract the mean from each value – the
result is called the deviation from the mean.

3. Square each deviation of the mean.

4. Find the sum of the squares.

5. Divide the total by the number of items.


VARIANCE AND STANDARD DEVIATION

 The sample variance is defined as follows:


n
1
s2  
n  1 i 1
( xi  x ) 2
VARIANCE AND STANDARD DEVIATION
 The short-cut sample variance is defined as follows:
VARIANCE AND STANDARD DEVIATION

 The population variance is defined as follows:

 Where 2 is called the variance of the population


 Mean is represented by
VARIANCE AND STANDARD DEVIATION

Metabolic rates of 7 men (cal./24hr.) :


1792 1666 1362 1614 1460 1867 1439
VARIANCE AND STANDARD DEVIATION

Observations Deviations Squared deviations

1792 17921600 = 192 (192)2 = 36,864


1666 1666 1600 = 66 (66)2 = 4,356
1362 1362 1600 = -238 (-238)2 = 56,644
1614 1614 1600 = 14 (14)2 = 196
1460 1460 1600 = -140 (-140)2 = 19,600
1867 1867 1600 = 267 (267)2 = 71,289
1439 1439 1600 = -161 (-161)2 = 25,921
sum = 0 sum = 214,870
VARIANCE AND STANDARD DEVIATION
VARIANCE AND STANDARD DEVIATION

 The standard deviation is the positive square root of the variance:


 Population standard deviation
 Sample standard deviation
 If the data is close together, the standard deviation will be small.
 If the data is spread out, the standard deviation will be large.
VARIANCE AND STANDARD DEVIATION
 Standard deviation of the population.

 Standard deviation of the sample


STANDARD DEVIATION
 Find the mean of the data.
 Subtract the mean from each value.
 Square each deviation of the mean.
 Find the sum of the squares.
 Divide the total by the number of items.
 Take the square root of the variance.
VARIANCE AND STANDARD DEVIATION

Metabolic rates of 7 men (cal./24hr.) :


1792 1666 1362 1614 1460 1867 1439
VARIANCE AND STANDARD DEVIATION

Observations Deviations Squared deviations

1792 17921600 = 192 (192)2 = 36,864


1666 1666 1600 = 66 (66)2 = 4,356
1362 1362 1600 = -238 (-238)2 = 56,644
1614 1614 1600 = 14 (14)2 = 196
1460 1460 1600 = -140 (-140)2 = 19,600
1867 1867 1600 = 267 (267)2 = 71,289
1439 1439 1600 = -161 (-161)2 = 25,921
sum = 0 sum = 214,870
VARIANCE AND STANDARD DEVIATION
VARIANCE AND STANDARD DEVIATION

The math test scores of five students are:


92,88,80,68 and 52.
1) Find the mean: (92+88+80+68+52)/5 = 76.
2) Find the deviation from the mean:
92-76=16
88-76=12
80-76=4
68-76= -8
52-76= -24
VARIANCE AND STANDARD DEVIATION
3) Square the deviation from the
mean:

4) Find the sum of the squares of the deviation


from the mean:
256+144+16+64+576= 1056
VARIANCE AND STANDARD DEVIATION
5) Divide by the number of data
items to find the variance:
1056/5 = 211.2

6) Find the square root of the


variance:

Thus the standard deviation of the test


scores is 14.53.
VARIANCE AND STANDARD DEVIATION
 Empirical Rule
 If the data has a bell shaped (normal) distribution then
Chebyshev’s theorem can be improved.
 The proportion (or fraction) of any data set lying
 Within 1 standard deviation of the mean is about 68%.
 Within 2 standard deviations of the mean is about 95%.
 Within 3 standard deviations of the mean is about 99.7%.
VARIANCE AND STANDARD DEVIATION
PROPERTIES OF NORMAL DISTRIBUTION CURVE

• The normal (distribution) curve


– From μ–σ to μ+σ: contains about 68% of the measurements (μ:
mean, σ: standard deviation)
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it
SUMMARY MEASURES
Describing Data Numerically

Central Tendency Quartiles Variation Shape

Arithmetic Mean Range Skewness

Median Interquartile Range

Mode Variance

Geometric Mean Standard Deviation

Coefficient of Variation
THANKS

26
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Any Question???

You might also like