Distribution in Statistics
Distribution in Statistics
com)
Distribution in Statistics
Table of Contents
1 Frequency Distribution
2 Normal Probability Distribution
2.1 Normal Distribution Assumptions
3 F-Distribution
4 Measures of Central Tendency
4.1 Statistical Mean
4.1.1 Arithmetic Mean
4.1.2 Geometric Mean
4.1.3 Calculate Median
4.2 Statistical Mode
4.3 Range (Statistics)
5 Statistical Variance
5.1 Measurement Of Uncertainty: Standard Deviation
5.1.1 Calculate Standard Deviation
6 Quartile
7 Trimean
1
Copyright Notice
Copyright © Explorable.com 2014. All rights reserved, including the right of reproduction in
whole or in part in any form. No parts of this book may be reproduced in any form without
written permission of the copyright owner.
Notice of Liability
The author(s) and publisher both used their best efforts in preparing this book and the
instructions contained herein. However, the author(s) and the publisher make no warranties of
any kind, either expressed or implied, with regards to the information contained in this book,
and especially disclaim, without limitation, any implied warranties of merchantability and
fitness for any particular purpose.
In no event shall the author(s) or the publisher be responsible or liable for any loss of profits or
other commercial or personal damages, including but not limited to special incidental,
consequential, or any other damages, in connection with or arising out of furnishing,
performance or use of this book.
Trademarks
Throughout this book, trademarks may be used. Rather than put a trademark symbol in every
occurrence of a trademarked name, we state that we are using the names in an editorial
fashion only and to the benefit of the trademark owner with no intention of infringement of the
trademarks. Thus, copyrights on individual photographic, trademarks and clip art images
reproduced in this book are retained by the respective owner.
Information
Published by Explorable.com.
2
1 Frequency Distribution
A frequency distribution gives us an idea about how frequently a given data point occurs and
how probable it is to occur.
For example, consider a fair coin that is tossed four times. We want to derive the frequency
distribution for the number of heads that can occur. There are different possibilities, through
which these heads might occur, which are summarized in the table below:
3
Total number of ways
No. of Heads Value of the four coin flips
of getting a head
0 T-T-T-T 1
H-T-T-T
T-H-T-T
1 4
T-T-H-T
T-T-T-H
H-H-T-T
H-T-H-T
H-T-T-H
2 6
T-H-H-T
T-H-T-H
T-T-H-H
T-H-H-H
H-T-H-H
3 4
H-H-T-H
H-H-H-T
4 H-H-H-H 1
The frequency distribution is easy to see. On an average, if the number of flips are very high,
then out of every 16 coin flips, 1 will end up with 0 heads, 4 will end up with 4 heads, 6 will
end up with 2 heads, 4 will end up with 3 heads and 1 will end up as all 4 heads. This of
course is assuming that the coin used for the experiment is a fair coin, with an equal
probability of a head and tail on any given flip.
In the above case, the coin is flipped only 4 times. If the coin is tossed many more times, like
say 100 times, and the frequency distribution drawn, it will be exactly like a normal probability
distribution in shape.
Siddharth Kalla (Feb 20, 2010). Frequency Distribution. Retrieved from Explorable.com:
https://explorable.com/frequency-distribution
4
2 Normal Probability Distribution
These are symmetric in nature and peak at the mean, with the probability distribution
decreasing away before and after this mean smoothly, as shown in the figure below.
The figure also shows a family of curves with different peaks centered about the same mean,
which differ in their spread and height.
Normal distribution occurs very frequently in statistics, economics, natural and social sciences
and can be used to approximate many distributions occurring in nature and in the manmade
world.
For example, the height of all people of a particular race, the length of all dogs of a particular
breed, IQ, memory and reading skills of people in a general population and income
distribution in an economy all approximately follow the normal probability distribution shaped
like a bell curve.
The theory of normal distribution also finds use in advanced sciences like astronomy,
5
photonics and quantum mechanics.
The normal distribution can be characterized by the mean and standard deviation. The mean
determines where the peak occurs, which is at 0 in our figure for all the curves. The standard
deviation is a measure of the spread of the normal probability distribution, which can be seen
as differing widths of the bell curves in our figure.
The Formula
The mean is generally represented by μ and the standard deviation by σ. For a perfect normal
distribution, the mean, median and mode are all equal. The normal distribution function can be
written in terms of the mean and standard deviation as follows:
2 2
p(x) = (1/σRoot(2π)) x exp (- (x-μ)
/2σ)
or
(from wikipedia.org)
From the above formula for normal distribution, it can be inferred that about 68% of all values
lie within one standard deviation from the mean; 95.4% of all values lie within two standard
deviations from the mean and 99.7% of all values lie within three standard deviations from the
mean.
From the basic bell curve, there can be many special cases derived that become meaningful
under different situations.
For example the left or right or both sides of a normal distribution can
be skewed or there could be the presence of long tails.
6
A basic study of the normal distribution therefore is necessary before a meaningful study can
be made into these special cases. This concept can be extended to 3-D normal distributions
as well, which are used for more advanced applications.
7
2.1 Normal Distribution Assumptions
However, there are important special scenarios when this is not the case. An understanding of
the normal distribution assumptions will help researchers know the limitations of their
experiment and also help them understand their own study and where it breaks down.
Normal distribution assumptions can be relaxed in some situations but it forms a more
complex analysis. If the physical process can be approximated by a normal distribution, it will
yield the simplest analysis. However, some basic properties are retained even when
distributions are not normal. For example, one might assume symmetry, as in a t-distribution
even if the distribution is not truly normal.
In fact, a number of different non-normal distributions are just variations of the normal
distribution. For example, a distribution might have a longer tail, which is a variation of the
normal distribution. Such distributions too are frequently encountered.
The reason for the normal distribution assumptions is that this is usually the simplest
mathematical model that can be used. In addition, it is surprisingly ubiquitous and it occurs in
most natural and social phenomena. This is why the assumption of normality is usually a good
first approximation.
Error Analysis
One of the most used assumption of normality is in error analysis. We usually assume that the
random errors follow a normal distribution. This assumption can break down when there are
multiple sources of errors and they are correlated. In addition, if the errors are not truly
random, then too this assumption might not be valid. If the error distribution is not normal and
the assumption of normality is made, then there could lead to an incorrect statistical analysis
and thus erroneous conclusions.
Tests
There are statistical tests that a researcher can undertake which help determine whether the
8
normal distribution assumptions are valid or not. One quick way is to compare the sample
means to the real mean. For a normally distributed population, the sampling distribution is
also normal when there are sufficient test items in the samples.
The assumption of normality is valid in most cases but when it is not, it could lead to serious
trouble. Also, since this assumption is made so inherently, it is hard to spot and sometimes
difficult to question. Therefore care must be taken to ensure that the researcher is aware of
not just the assumption of normality but in fact all the assumptions that go into a statistical
analysis. This will help define the scope of the experiment and if something is not as
expected, one can find the reason for the discrepancy.
Siddharth Kalla (Sep 11, 2011). Normal Distribution Assumptions. Retrieved from
Explorable.com: https://explorable.com/normal-distribution-assumptions
9
3 F-Distribution
Suppose we have two samples with n1 and n2 observations, the ratio F = s12 / s22 where
s12 and s22 are the sample variances, is distributed according to an F-distribution with v1 =
n1-1 numerator degrees of freedom, and v2 = n2-1 denominator degrees of freedom.
For example, if F follows an F-distribution and the degrees of freedom for the numerator are 4
and the degrees of freedom for the denominator are 10, then F ~ F4,10. For each combination
of these degrees of freedom there is a different F-distribution. The F-distribution is most
spread out when the degrees of freedom are small. As the degrees of freedom increase, the F-
distribution is less dispersed.
Properties
The F-distribution has the following properties:
The F-distribution is skewed to the right, and the F-values can be only positive. The curve
reaches a peak not far to the right of 0, and then gradually approaches the horizontal axis.
The F-distribution approaches, but never quite touches the horizontal axis.
Uses
The main use of F-distribution is to test whether two independent samples have been drawn
for the normal populations with the same variance, or if two independent estimates of the
population variance are homogeneous or not, since it is often desirable to compare two
variances rather than two averages. For instance, college administrators would prefer two
college professors grading exams to have the same variation in their grading. For this, the F-
test can be used, and after examining the p-value, inference can be drawn on the variation.
10
Assumptions
In order to perform F-test of two variances, it is important that the following are true:
The populations from which the two samples are drawn are normally distributed.
The two populations are independent of each other.
If the two populations have equal variances, then s12 and s22 are close in value and F is
close to 1. But if the two population variances are very different, s12 and s22 tend to be very
different, too.
Choosing s12 as the larger sample variance causes the ratio to be greater than 1. If s12 and
s22 are far apart, then F is a large number. Therefore, if F is close to 1, the evidence favours
the null hypothesis (the two population variances are equal). But if F is much larger than 1,
then the evidence is against the null hypothesis, and we can infer that possibly the population
variances differ to a large extent.
Anova and F
In the technique known as Analysis of Variance (ANOVA) which plays a very important role in
Design of Experiments, the variance ratio test is applied to test the significance of different
components of variation against error variation.
For example, a new drug for treating Osteoporosis could need to be field tested. Since
severity of this disease is generally a function of age, the new drug could be administered
randomly to n patients in each age group. Put differently, this would be an experiment in m
age groups and n different dosage levels of the drug allocated randomly to the patients. With
figures provided from patients for each age group x dose combination, we can use the
variance ratio test (F- test) to test for difference between dose levels and if this variation can
be attributed to chance.
The other uses include testing the significance of the correlation ratio between two random
variables, and to test the linearity of regression.
11
4 Measures of Central Tendency
Individual scores by themselves may mean little but when looked at from a group point of
view, they may reveal the whole picture.
For example, if you say you saw an insect of length of 10mm it doesn't
mean anything by itself. However, if you say that the normal length of
the insect is about 6cm and the maximum recorded length ever is
10.4cm, then it may mean you saw a particularly large insect. Therefore
it is important to be able to quantify the "normal length" as used above,
and this is what central tendency is all about.
The arithmetic mean is one of the most commonly used measures of central tendency. For a
set of numbers, the mean is simply the average, i.e. sum of all the numbers divided by the
number of numbers.
The median is another frequently used measure of central tendency. The median is simply the
midpoint of the distribution, i.e. there are as many numbers above it as below it.
If the number of data points is odd, then the median is simply the middle number. Therefore
the median of 3, 5, 6, 9, 15 is 6.
If the number of data points is even, then the median is the mean of the middle two numbers.
Therefore the median of 2, 7, 15, 20 is (7+15)/2 = 11.
The median is particularly useful when there are a few data points that are vastly different. For
example, in calculating the central measure of the salary obtained by a group of graduates, it
12
may happen than a couple of students have got extraordinarily high salaries. This will take the
mean of the salaries of the group to very high values, but the median will truly reflect the
placement scenario as it is.
Another commonly used measure of central tendency in specific cases is the mode. The mode
is simply the most commonly occurring value. For example, in a class of 50 students graded
on a scale of 1-5, the distribution may be as shown in the figure. The mode of this data is 4.
Different types of data need different measures of central tendency to describe the distribution
of data. For highly skewed data, none of these may be sufficient, and we may need to go for
other specialized measures or simply report them all in a table.
Siddharth Kalla (Nov 22, 2009). Measures of Central Tendency. Retrieved from
Explorable.com: https://explorable.com/measures-of-central-tendency
13
4.1 Statistical Mean
In Statistics, the statistical mean, or statistical average, gives a very good idea about
the central tendency of the data being collected.
Statistical mean gives important information about the data set at hand, and as a single
number, can provide a lot of insights into the experiment and nature of the data.
Examples
The concept of statistical mean has a very wide range of applicability in statistics for a number
of different types of experimentation.
For example, if a simple pendulum is being used to measure the acceleration due to gravity, it
makes sense to take a set of values, and then average the final result. This eliminates the
random errors in the experiment and usually gives a more accurate value than a single
experiment carried out.
The statistical mean also gives a good idea about interpreting the statistical data.
For example, the mean life expectancy in Japan is higher than that of Brazil, which suggests
that on an average, the people in Japan are likely to live longer. There may be many viable
conclusions about this, such as that it is due to better healthcare facilities in Japan, but the
truth is that we do not know this unless we measure it.
Similarly, the mean height of people in Russia is higher than that of China, which means that
on an average, you will find Russians to be taller than Chinese.
Statistical mean is a measure of central tendency and gives us an idea about where the data
seems to cluster around.
For example, the mean marks obtained by students in a test is required to correctly gauge the
performance of a student in that test. If the student scores a low percentage, but is well ahead
of the mean, then it means the test is difficult and therefore his performance is good,
something that simply a percentage will not be able to tell.
14
There are different kinds of statistical means or measures of central tendency for the data
points. Each one has its own utility. The arithmetic mean, geometric mean, median and mode
are some of the most commonly used measures of statistical mean. They make sense in
different situations, and should be used according to the distribution and nature of the data.
For example, the arithmetic mean is frequently used in scientific experimentation, the
geometric mean is used in finance to calculate compounding quantities, the median is used as
a robust mean in case of skewed data with many outliers and the mode is frequently used in
determining the most frequently occurring data, like during an election.
Siddharth Kalla (Jan 13, 2009). Statistical Mean. Retrieved from Explorable.com:
https://explorable.com/statistical-mean
15
4.1.1 Arithmetic Mean
The arithmetic mean is perhaps the most commonly used statistical mean to measure
the central tendency of data.
The arithmetic mean is also called the "average". It is used in most scientific experiments.
16
Sum of all datapoints
arithmetic mean =
Total number of data-points
EXAMPLES
If there are three numbers in a data-set, add them and divide by three:
17
x1 + x2 + x
arithmetic mean = 3
3
18
x1 + x2 + x
3 + x
4
arithmetic mean =
4
For example, the time in seconds taken for a particular chemical reaction under the same
laboratory conditions might give values of 11.6, 12.1, 11.8, 11.5 and 12.0.
19
Sum of datapoints
arithmetic mean =
Total number of data-points
20
11.6 + 12.1 + 11.8 + 11.5 + 12.0
arithmetic mean =
5
21
arithmetic mean = 11.8
This would be the time as measured by the experiment. The differences can be attributed to
random errors like random fluctuations in temperature and humidity in the laboratory.
The mean gives very useful information in cases where the data is relatively symmetric. For
example, if the data is nearly normally distributed, then the mean is the best measure of
central tendency. However, if the data is very skewed, then the arithmetic mean might
become misleading.
For example, a commonly quoted number in the placement of business schools is the
average salary of the outgoing batch.
However, in some cases, even when the data is skewed, the arithmetic mean gives some
valuable information about the data but it needs to be interpreted in the right manner.
The Gross Domestic Product or GDP used in economics to determine the financial well being
of a country is an arithmetic mean. The total value of goods and services produced in the
country when averaged out over the total population gives a measure of GDP. The GDP tells
us nothing about the distribution of wealth inside the country, but can be a good parameter for
the country to work with in improving the economic condition of its citizens.
22
Siddharth Kalla (Jul 7, 2009). Arithmetic Mean. Retrieved from Explorable.com:
https://explorable.com/arithmetic-mean
23
4.1.2 Geometric Mean
The geometric mean is relevant on certain sets of data, and is different from the
arithmetic mean. Mathematically, the geometric mean is the nth root of the product of n
numbers.
Where:
N = Number of datapoints
a = score of a datapoint
The geometric mean is relevant on those sets of data that are products or exponential in
nature. This includes a variety of branches of natural sciences and social sciences.
Examples
For example, if a strain of bacteria increases its population by 20% in the first hour, 30% in the
next hour and 50% in the next hour, we can find out an estimate of the mean percentage
growth in population.
In this case, it is the geometric mean, and not the arithmetic mean that is relevant. To see
this, start off with 100 bacteria.
After the first hour, they grow to 120 bacteria, which is a grow rate of 1,2 (100*1.2);
24
After the second hour, they grow to 156 bacteria, which is a grow rate of 1,3 (120*1.3)
After the third hour, they grow to 234 bacteria, which is a grow rate of 1,5 (156*1.5).
If we find the geometric mean of 1.2, 1.3 and 1.5, we get 1.3276. This should be interpreted
as the mean rate of growth of the bacteria over the period of 3 hours, which means if the
strain of bacteria grew by 32.76% uniformly over the 3 hour period, then starting with 100
bacteria, it would reach 234 bacteria in 3 hours.
Therefore whenever we have percentage growth over a period of time, it is the geometric
mean and not the arithmetic mean that makes sense.
Usages
In social sciences, we frequently encounter this in a number of ways. For example, the human
population growth is expressed as a percentage, and thus when population growth needs to
be averaged, it is the geometric mean that is most relevant.
In surveys and studies too, the geometric mean becomes relevant. For example, if a survey
found that over the years, the economic status of a poor neighborhood is getting better, they
need to quote the geometric mean of the development, averaged over the years in which the
survey was conducted. The arithmetic mean will not make sense in this case either.
In economics, we see the percentage growth in interest accumulation. Thus if you are starting
out with a sum of money that is compounded for interest, then the mean that you should look
for is the geometric mean. Many such financial instruments like bonds yield a fixed
percentage return, and while quoting their “average” return, it is the geometric mean that
should be quoted.
Siddharth Kalla (Aug 21, 2009). Geometric Mean. Retrieved from Explorable.com:
25
https://explorable.com/geometric-mean
26
4.1.3 Calculate Median
The median is central to many experimental data sets, and to calculate median in such
examples is important, by not falling into the trap of reporting the arithmetic mean.
The median can be seen to be the “middle value” of the distribution, i.e. it separates the upper
and lower halves.
Calculation - Examples
To calculate median, consider the following example. Suppose we have the heights of
different trees in a garden, and we need an “average” value for this. Say the heights in meters
are 1.5, 6.9, 2.8, 1.8 and 2.3.
1. Arrange order. Arrange the numbers in either ascending or descending order, such as
1.5, 1.8, 2.3, 2.8 and 6.9.
2. Choose middle value. Choose the value which is exactly in the middle of the order.
1.5, 1.8, 2.3, 2.8, 6.9
The median of this distribution is 2.3, which is the middle value, separating the lower half
{1.5, 1.8} from the upper half {2.8, 6.9}.
Median = 2.3
3. In case of two middle values - Use the Arithmetic Mean on the two middle values
Suppose there are six heights (New value: 1.2), rather than five:
1.2, 1.5, 1.8, 2.3, 2.8, 6.9
This leaves two middle values, {1.8, 2.3}. The median for this data set is (1.8+2.3)/2 =
2.05
Median = 2.05
One can immediately see that the data is skewed - the 6.9 meter tree makes it so. The
arithmetic mean of this data is 3.08 meters, which is more than 4 out of 5 data points. Thus
the arithmetic mean doesn’t make much sense in this case.
If the number of data points is even, unlike the example we previously considered, then to
27
calculate median we simply take the mean of the middle two elements. Thus if we have 10
numbers arranged in ascending order, the median is the average of the 5th and 6th numbers.
Another Example:
If the salaries of professional passing out of college in thousands of dollars per annum is 60,
64, 71, 73, 73, 77, 82, 85, 160 and 255, then their median salary is (73+77)/2 = 75. The mean
in this case is 100, which like the previous case, doesn’t make much sense and doesn’t really
tell us about the central tendency of the data.
In cases where the data is skewed, it is the median that makes sense, and not the mean. In
these cases, as an experimenter, you need to calculate median and not mean for your
experiment. This is especially true in cases where there are outliers. Many scientists calculate
their results both in terms of median and mean, to see whether the outcome of theirresults are
the same.
The median is resistant to change with the discovery of outliers. For example, if we want to
know the mean weight of all the dinosaurs, it is a very difficult task because we do not yet
know all the types of dinosaurs that ever walked the earth. Therefore if a new type of dinosaur
bigger than all the others is discovered, it will significantly alter the mean.
However, the median remains almost unchanged in this case. Thus in many cases when the
end points of the data set are not known, you need to calculate median and not the mean for
that data set.
Siddharth Kalla (Oct 25, 2009). Calculate Median. Retrieved from Explorable.com:
https://explorable.com/calculate-median
28
4.2 Statistical Mode
Statistical mode tells us about the data point that is most frequently repeated in the
dataset.
For a symmetric data distribution, the statistical mode can be near the mean and median, but
for highly skewed data, the mode can be quite distinct.
Examples
For example, if the marks obtained by 15 students in a test are 81, 82, 85, 85, 89, 91, 91, 91,
91, 93, 93, 95, 96, 96 and 99, then the statistical mode for this data distribution is 91. This is
because 4 students have obtained this score, which is the highest number of students with the
same score.
It should be noted that unlike the mean and median, the mode doesn’t need to be unique.
This is because there can always be two or more data points with the same frequency.
Consider a revised example for the test scores of the 15 students to be 81, 81, 85, 85, 89, 91,
91, 92, 92, 93, 93, 95, 96, 96 and 99. In this case, we see that 2 students score the same
marks 81, 85, 91, 92 and 96, which are all the modes for this data distribution.
Sometimes when the data is continuous, the usual definition of statistical mode is inadequate.
To see why, consider an experiment that measure the reaction times of different subjects.
Suppose the raw data gives time in milliseconds as 42.1, 48.3, 52.2, 52.6, 52.8, 52.9, 53.0,
53.1, 53.2, 53.7, 54.6, 55.8, 56.7, 58.0 and 60.9.
As can be seen, all the values are distinct, and therefore all the data points are the statistical
modes. However, intuitively, we can see that the data is clustered about the values in the
middle.
Therefore if we can define an interval of 1 millisecond, then the interval from 52.5-53.5 will
form the mode. This is a much more practical measure of the mode. The problem arises
because time is a continuous variable, and thus two measured times are not usually exactly
equal in nature, but only slightly differ from each other.
29
Advantage
A big advantage of statistical mode is that it is not restricted to numbers alone. For example,
among all the letters of the English alphabet, the mode is the letter ‘E’, which is the most
frequently encountered letter. However, we cannot define the median or mean letter, since
these can only be defined for numbers. This makes the scope of the mode quite broad in
nature.
30
4.3 Range (Statistics)
In statistics, range is defined simply as the difference between the maximum and
minimum observations. It is intuitively obvious why we define range in statistics this
way - range should suggest how diversely spread out the values are, and by computing
the difference between the maximum and minimum values, we can get an estimate of
the spread of the data.
For example, suppose an experiment involves finding out the weight of lab rats and the values
in grams are 320, 367, 423, 471 and 480. In this case, the range is simply computed as 480-
320 = 160 grams.
For example, in our previous case, consider a small baby rat added to the data set that
weighs only 50 grams. Now the range is computed as 480-50 = 430 grams, which looks like a
false indication of the dispersion of data.
This limitation of range is to be expected primarily because range is computed taking only two
data points into consideration. Thus it cannot give a very good estimate of how the overall
data behaves.
Sometimes, we define range in such a way so as to eliminate the outliers and extreme points
in the data set. For example, the inter-quartile range in statistics is defined as the difference
between the third and first quartiles. You can immediately see how this new definition of range
31
is more robust than the previous one. Here the outliers will not matter and this definition takes
the whole distribution of data into consideration and not just the maximum and minimum
values.
It should be pointed out that in spite of several limitations, the range can be a useful indication
for many cases. As a student of statistics you should understand what kinds of data are best
suited to be defined based on range. If there are too many outliers, it may not be a good idea.
But range gives a quick and easy to estimate indication about the spread of data.
Siddharth Kalla (Jun 10, 2011). Range (Statistics). Retrieved from Explorable.com:
https://explorable.com/range-in-statistics
32
5 Statistical Variance
Statistical variance gives a measure of how the data distributes itself about the mean
or expected value. Unlike range that only looks at the extremes, the variance looks at
all the data points and then determines their distribution.
In many cases of statistics and experimentation, it is the variance that gives invaluable
information about the data distribution.
2
σ = variance
2 2
∑ (X - µ) = The sum of (X - µ) for all datapoints
In many cases of statistics and experimentation, it is the variance that gives invaluable
information about the data distribution.
2
σ = variance
33
2 2
∑ (X - µ) = The sum of (X - µ) for all datapoints
X = individual data points
µ = mean of the population
N = number of data points
This means the square of the variance is given by the average of the squares of difference
between the data points and the mean.
2
σ = ∑ (x-µ)2 / N
2
σ = ∑ (x-µ)2 / 5
3. The mean (µ) for the five scores (67, 72, 85, 93, 98), so µ = 83.
2
σ = ∑ (x-83)2 / 5
4. Now, compare each score (x = 67, 72, 85, 93, 98) to the mean (µ = 83)
2
σ = [ (67-83)2+(72-83)2+(85-83)2+(93-83)2+(98-83)2 ] / 5
67-83 = -16
72-83 = -11
85-83 = 2
93-83 = 10
98 - 83 = 15
34
2
σ = [ (-16)2+(-11)2+(2)2+(10)2+(15)2] / 5
6. Then, square each paranthesis. We get 256, 121, 4, 100 and 225.
This is how:
2
σ = [ (-16)x(-16)+(-11)x(-11)+(2)x(2)+(10)x(10)+(15)x(15)] / 5
2
σ = [ 16x16 + 11x11 + 2x2 + 10x10 + 15x15] / 5
which equals:
2
σ = [256 + 121 + 4 + 100 + 225] / 5
σ2 = 706 / 5
8. To get the final answer, we divide the sum by 5 (Because it was five scores). This is the
variance for the dataset:
2σ = 141.2
2
S = the standard deviation of the sample.
Note that the denominator is one less than the sample size in this case.
Usage
The concept of variance can be extended to continuous data sets too. In that case, instead of
35
summing up the individual differences from the mean, we need to integrate them. This
approach is also useful when the number of data points is very large, like the population of a
country.
Variance is extensively used in probability theory, wherein from a given smaller sample set,
more generalized conclusions need to be drawn. This is because variance gives us an idea
about the distribution of data around the mean, and thus from this distribution, we can work
out where we can expect an unknown data point.
Siddharth Kalla (Mar 15, 2009). Statistical Variance. Retrieved from Explorable.com:
https://explorable.com/statistical-variance
36
5.1 Measurement Of Uncertainty: Standard Deviation
Examples
For example, the data points 50, 51, 52, 55, 56, 57, 59 and 60 have a mean at 55 (Blue).
Another data set of 12, 32, 43, 48, 64, 71, 83 and 87. This set too has a mean of 55 (Pink).
However, it can clearly be seen that the properties of these two sets are different. The first set
is much more closely packed than the second one. Through standard deviation, we can
measure this distribution of data about the mean.
The above example should make it clear that if the data points are values of the same
parameter in various experiments, then the first data set is a good fit, but the second one is
too uncertain. Therefore in measurement of uncertainty, standard deviation is important - the
lesser the standard deviation, the lesser this uncertainty and thus more the confidence in the
experiment, and thus higher the reliability of the experiment.
37
energy between 180 to 220 units. This is assuming that the data of energy consumption is
normally distributed.
If a researcher considers three standard deviations to either side of the mean, this covers
99% of the data. Thus in the previous example, 99% of the households have their energy
consumption between 140 to 260 units. In most cases, this is considered as the whole data set
, especially when the data can extend to infinity.
Usage
The measurement of uncertainty through standard deviation is used in many experiments of
social sciences and finances. For example, the more risky and volatile ventures have a higher
standard deviation. Also, a very high standard deviation of the results for the same survey, for
example, should make one rethink about the sample size and the survey as a whole.
38
5.1.1 Calculate Standard Deviation
The standard deviation is the square root of variance. Thus the way we calculate standard
deviation is very similar to the way we calculate variance.
In fact, to calculate standard deviation, we first need to calculate the variance, and then take
its square root.
σ = standard deviation
xi = each value of dataset
x (with a bar over it) = the arithmetic mean of the data (This symbol will
be indicated as mean from now)
N = the total number of data points
∑ (xi - mean)^2 = The sum of (xi - mean)^2 for all datapoints
σ = √[ ∑(x-mean)^2 / N ]
39
Standard Deviation Calculation Example (for Population)
As an example to calculate standard deviation, consider a sample of IQ scores given by 96,
104, 126, 134 and 140.
Try it yourself
σ = √[ ((-24)^2+(-16)^2+(6)^2+(14)^2+(20)^2) / 5 ]
σ = √[ ((96-120)^2+(104-120)^2+(126-120)^2+(134-120)^2+(140-120)^2) / 5 ]
5. Square and sum the deviations:
The sum of their squares is given by (-24)^2 + (-16)^2 + (6)^2 + (14)^2 + (20)^2 = 1464.
σ = √[ (1464) / 5 ]
σ = √[ ((-24)x(-24)+(-16)x(-16)+(6)x(6)+(14)x(14)+(20)x(20)) / 5 ]
6. Divide by the number of scores (minus one if it is a sample, not a population):
The mean of this value is given by 1464/5 = 292.8. The number in between the brackets
is the variance of the data.
σ = √[292.8]
σ = 17.11
It can easily be seen that the sample standard deviation is larger than the standard deviation
40
for the data.
Interpretation of Data
Calculation of standard deviation is important to correctly interpret the data. For example, in
physical sciences, a lower standard deviation for the same measurement implies higher
precision for the experiment.
Also, when the mean needs to be interpreted, it is important to quote the standard deviation
too. For example, the mean weather over a day in two cities might be 24C. However, if the
standard deviation is very large, it may mean extremes of temperature - too hot during the day
and too cold during the nights (like a desert). On the other hand, if the standard deviation is
small, it means a fairly uniform temperature throughout the day (like a coastal region).
where the denominator is N-1 instead of N in the previous case. This correction is required to
get an unbiased estimator for the standard deviation.
This follows the same calculation as the example above, for standard deviation for population,
with one exception: The division should be "N - 1", not "N".
σ = √[ ∑(x-mean)2 / (N - 1) ]
Then it follows the same example as above, except that there is a 4 where there was a 5:
σ = √[ ∑(x-mean)2 / 4 ]
σ = √[ ∑(x-mean)2 / (5-1) ]
3. What is the mean?
41
The mean of this data is (96+104+126+134+140)/5 = 120.
σ = √[ ∑(x-120)2 / 4 ]
4. What are the respective deviations from the mean?
The deviation from the mean is given by 96-120 = -24, 104-120 = -16, 126-120 = 6, 134-
120 = 14, 140-120 = 20.
σ = √[ ((-24)2+(-16)+(6)2+(14)2+(20)2) / 4 ]
σ = √[ ((96-120)2+(104-120)+(126-120)2+(134-120)2+(140-120)2) / 4 ]
5. Square and sum the deviations:
The sum of their squares is given by (-24)2 + (-16)2 + (6)2 + (14)2 + (20)2 = 1464.
σ = √[ (1464) / 4 ]
σ = √[ ((-24)x(-24)+(-16)x(-16)+(6)x(6)+(14)x(14)+(20)x(20)) / 4 ]
6. Divide by the number of scores minus one (minus one since it is a sample, not a
population):
The mean of this value is given by 1464/4 = 366. The number in between the brackets is
the variance of the data.
σ = √[366]
σ = 19.13
Siddharth Kalla (Sep 27, 2009). Calculate Standard Deviation. Retrieved from
Explorable.com: https://explorable.com/calculate-standard-deviation
42
5.2 Standard Error of the Mean
The standard error of the mean, also called the standard deviation of the mean, is a
method used to estimate the standard deviation of a sampling distribution. To
understand this, first we need to understand why a sampling distribution is required.
As an example, consider an experiment that measures the speed of sound in a material along
the three directions (along x, y and z coordinates). By taking the mean of these values, we
can get the average speed of sound in this medium.
However, there are so many external factors that can influence the speed of sound, like small
temperature variations, reaction time of the stopwatch, pressure changes in the laboratory,
wind velocity changes, and other random errors. Thus instead of taking the mean by one
measurement, we prefer to take several measurements and take a mean each time. This is a
sampling distribution. The standard error of the mean now refers to the change in mean with
different experiments conducted each time.
It can be seen from the formula that the standard error of the mean decreases as N increases.
This is expected because if the mean at each step is calculated using a lot of data points, then
a small deviation in one value will cause less effect on the final mean.
43
The standard error of the mean tells us how the mean varies with different experiments
measuring the same quantity. Thus if the effect of random changes are significant, then the
standard error of the mean will be higher. If there is no change in the data points as
experiments are repeated, then the standard error of mean is zero.
For a sample, the formula for the standard error of the estimate is given by:
where Y refers to individual data sets, Y' is the mean of the data and N is the sample size.
Note that this is similar to the standard deviation formula, but has an N-2 in the denominator
instead of N-1 in case of sample standard deviation.
Siddharth Kalla (Sep 21, 2009). Standard Error of the Mean. Retrieved from Explorable.com:
https://explorable.com/standard-error-of-the-mean
44
6 Quartile
Quartile is a useful concept in statistics and is conceptually similar to the median. The
first quartile is the data point at the 25th percentile, and the third quartile is the data
point at the 75th percentile. The 50th percentile is the median.
Median Revisited
To understand a quartile, let us revisit median. To compute the median, we cut off the data
into two groups with equal number of points. Thus the middle value that separates these
groups is the median. In a similar fashion, if we divide the data into 4 equal groups now
instead of 2, the first differentiating point is the first quartile, the second differentiating point is
the second quartile which is the same as the median and the third such differentiating point is
the third quartile.
To further see what quartiles do, the first quartile is at the 25th percentile. This means that
25% of the data is smaller than the first quartile and 75% of the data is larger than this.
Similarly, in case of the third quartile, 25% of the data is larger than it while 75% of it is
smaller. For the second quartile, which is nothing but the median, 50% or half of the data is
smaller while half of the data is larger than this value.
Interpreting Quartiles
As you know, the median is a measure of the central tendency of the data but says nothing
about how the data is distributed in the two arms on either side of the median. Quartiles help
us measure this.
Thus if the first quartile is far away from the median while the third quartile is closer to it, it
means that the data points that are smaller than the median are spread far apart while the
data points that are greater than the median are closely packed together.
An Alternative View
Another way of understanding quartiles is by thinking of those as medians of either of the two
sets of data points differentiated by the median. In this case, the first quartile is the median of
the data that is smaller than the full median while the third quartile is the median of the data
45
that is larger than the full median. Here full median is used in the context of the median of the
entire set of data.
It should be noted that a quartile is not limited to discrete variables but also applies equally
well to continuous variables. In this case, you will need to know the data distribution to figure
out the quartiles. If the distribution is symmetric, like normal distribution, then the first and third
quartiles are equidistant from the median in either direction.
46
7 Trimean
Trimean is a measure of central tendency, like mean, median and mode. Its meaning is
sometimes confusing because it is defined in a manner different from these traditional
measures of central tendencies.
Mathematical Formulation
The trimean is defined as the weighted average of the median and the two quartiles. Thus,
mathematically it is written as
(Wikipedia)
TM = Trimean
Q2 = the median
Q1, Q3 = the two quartiles
(Wikipedia)
which tells us that it is the average of the median and “quartile average” also called as
midhinge.
The trimean takes not only the central tendency into account but also gives due importance to
the distribution of data. This is what makes the trimean a different statistical parameter than
47
the others, like median, that are frequently encountered.
A Sample Example
For example, consider the heights of students in a class, in cm, to be 155, 158, 161, 162, 166,
170, 171, 174 and 179. It is easy to see the median of this data is 166 cm.
Now consider another class where the heights of the students, again in cm, are 162, 162, 163,
165, 166, 175, 181, 186, and 192. It can be seen that the median height of the class is again
166 cm. However, a look at the two data distributions tells us that the distributions are quite
different in both these cases, even though they have the same median.
Now let us compute the trimean for the first case. The median as we saw was 166, the first
quartile is 161 and the third quartile is 171. Using the formula given above, the trimean is
computed as (161 + 2(166) + 171)/4 = 166.
In the second example, the median is the same 166, but the first quartile is 163 and the
second quartile is 181. Now the trimean is computed as (163 + 2(166) + 181)/4 = 169.
In the second case, the trimean is bigger than the mean. As you can see, the third quartile is
farther away from the median than the first quartile, which essentially means that the data is
biased in the second half of the distribution. Thus the trimean reflects this bias in data away
from the median. Thus the effect of quartiles appears on the definition of trimean.
48
Explorable.com - Copyright © 2008-2015 All Rights Reserved.
49