0% found this document useful (0 votes)
99 views

Distribution in Statistics

Uploaded by

Ahmad Makhlouf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

Distribution in Statistics

Uploaded by

Ahmad Makhlouf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Published on Explorable.com (https://explorable.

com)

Distribution in Statistics

Table of Contents
1 Frequency Distribution
2 Normal Probability Distribution
2.1 Normal Distribution Assumptions

3 F-Distribution
4 Measures of Central Tendency
4.1 Statistical Mean
4.1.1 Arithmetic Mean
4.1.2 Geometric Mean
4.1.3 Calculate Median

4.2 Statistical Mode
4.3 Range (Statistics)

5 Statistical Variance
5.1 Measurement Of Uncertainty: Standard Deviation
5.1.1 Calculate Standard Deviation

5.2 Standard Error of the Mean

6 Quartile
7 Trimean

1
Copyright Notice

Copyright © Explorable.com 2014. All rights reserved, including the right of reproduction in
whole or in part in any form. No parts of this book may be reproduced in any form without
written permission of the copyright owner.

Notice of Liability
The author(s) and publisher both used their best efforts in preparing this book and the
instructions contained herein. However, the author(s) and the publisher make no warranties of
any kind, either expressed or implied, with regards to the information contained in this book,
and especially disclaim, without limitation, any implied warranties of merchantability and
fitness for any particular purpose.

In no event shall the author(s) or the publisher be responsible or liable for any loss of profits or
other commercial or personal damages, including but not limited to special incidental,
consequential, or any other damages, in connection with or arising out of furnishing,
performance or use of this book.

Trademarks
Throughout this book, trademarks may be used. Rather than put a trademark symbol in every
occurrence of a trademarked name, we state that we are using the names in an editorial
fashion only and to the benefit of the trademark owner with no intention of infringement of the
trademarks. Thus, copyrights on individual photographic, trademarks and clip art images
reproduced in this book are retained by the respective owner.

Information
Published by Explorable.com.

Cover design by Explorable / Gilli.me.

2
1 Frequency Distribution

Frequency distribution is a curve that gives us the frequency of the occurrence of a


particular data point in an experiment. This is usually the limit of a histogram of
frequencies when the data points are very large and the results can be treated to be
varying continuously instead of taking on discrete values.

A frequency distribution gives us an idea about how frequently a given data point occurs and
how probable it is to occur.

Frequency distribution is related to probability distribution. While a frequency distribution gives


the exact frequency or the number of times a data point occurs, a probability distribution gives
the probability of occurrence of the given data point. When the number of test cases are large,
the frequency distribution and the probability distributions are similar in shape.

For example, consider a fair coin that is tossed four times. We want to derive the frequency
distribution for the number of heads that can occur. There are different possibilities, through
which these heads might occur, which are summarized in the table below:

3
Total number of ways
No. of Heads Value of the four coin flips
of getting a head
0 T-T-T-T 1
H-T-T-T
T-H-T-T
1 4
T-T-H-T
T-T-T-H
H-H-T-T
H-T-H-T
H-T-T-H
2 6
T-H-H-T
T-H-T-H
T-T-H-H
T-H-H-H
H-T-H-H
3 4
H-H-T-H
H-H-H-T
4 H-H-H-H 1

The frequency distribution is easy to see. On an average, if the number of flips are very high,
then out of every 16 coin flips, 1 will end up with 0 heads, 4 will end up with 4 heads, 6 will
end up with 2 heads, 4 will end up with 3 heads and 1 will end up as all 4 heads. This of
course is assuming that the coin used for the experiment is a fair coin, with an equal
probability of a head and tail on any given flip.

In the above case, the coin is flipped only 4 times. If the coin is tossed many more times, like
say 100 times, and the frequency distribution drawn, it will be exactly like a normal probability
distribution in shape.

How to cite this article: 

Siddharth Kalla (Feb 20, 2010). Frequency Distribution. Retrieved from Explorable.com:  
https://explorable.com/frequency-distribution

4
2 Normal Probability Distribution

Normal probability distribution, also called Gaussian distribution refers to a family of


distributions that are bell shaped.

These are symmetric in nature and peak at the mean, with the probability distribution
decreasing away before and after this mean smoothly, as shown in the figure below.

The figure also shows a family of curves with different peaks centered about the same mean,
which differ in their spread and height.

μ = Mean of the Population


σ = Standard Deviation

Normal distribution occurs very frequently in statistics, economics, natural and social sciences
and can be used to approximate many distributions occurring in nature and in the manmade
world.

For example, the height of all people of a particular race, the length of all dogs of a particular
breed, IQ, memory and reading skills of people in a general population and income
distribution in an economy all approximately follow the normal probability distribution shaped
like a bell curve.

The theory of normal distribution also finds use in advanced sciences like astronomy,

5
photonics and quantum mechanics.

The normal distribution can be characterized by the mean and standard deviation. The mean
determines where the peak occurs, which is at 0 in our figure for all the curves. The standard
deviation is a measure of the spread of the normal probability distribution, which can be seen
as differing widths of the bell curves in our figure.

The Formula
The mean is generally represented by μ and the standard deviation by σ. For a perfect normal
distribution, the mean, median and mode are all equal. The normal distribution function can be
written in terms of the mean and standard deviation as follows:

2 2
p(x) = (1/σRoot(2π)) x exp (- (x-μ)
/2σ)
or

(from wikipedia.org)

From the above formula for normal distribution, it can be inferred that about 68% of all values
lie within one standard deviation from the mean; 95.4% of all values lie within two standard
deviations from the mean and 99.7% of all values lie within three standard deviations from the
mean.

From the basic bell curve, there can be many special cases derived that become meaningful
under different situations.

For example the left or right or both sides of a normal distribution can
be skewed or there could be the presence of long tails.

6
A basic study of the normal distribution therefore is necessary before a meaningful study can
be made into these special cases. This concept can be extended to 3-D normal distributions
as well, which are used for more advanced applications.

How to cite this article: 

Siddharth Kalla (Nov 3, 2009). Normal Probability Distribution. Retrieved from


Explorable.com:  https://explorable.com/normal-probability-distribution

7
2.1 Normal Distribution Assumptions

Normal distribution assumptions are important to note because so many experiments


rely on assuming a distribution to be normal. In most cases, the assumption of
normality is a reasonable one to make.

However, there are important special scenarios when this is not the case. An understanding of
the normal distribution assumptions will help researchers know the limitations of their
experiment and also help them understand their own study and where it breaks down.

Normal distribution assumptions can be relaxed in some situations but it forms a more
complex analysis. If the physical process can be approximated by a normal distribution, it will
yield the simplest analysis. However, some basic properties are retained even when
distributions are not normal. For example, one might assume symmetry, as in a t-distribution
even if the distribution is not truly normal.

In fact, a number of different non-normal distributions are just variations of the normal
distribution. For example, a distribution might have a longer tail, which is a variation of the
normal distribution. Such distributions too are frequently encountered.

The reason for the normal distribution assumptions is that this is usually the simplest
mathematical model that can be used. In addition, it is surprisingly ubiquitous and it occurs in
most natural and social phenomena. This is why the assumption of normality is usually a good
first approximation.

Error Analysis
One of the most used assumption of normality is in error analysis. We usually assume that the
random errors follow a normal distribution. This assumption can break down when there are
multiple sources of errors and they are correlated. In addition, if the errors are not truly
random, then too this assumption might not be valid. If the error distribution is not normal and
the assumption of normality is made, then there could lead to an incorrect statistical analysis
and thus erroneous conclusions.

Tests
There are statistical tests that a researcher can undertake which help determine whether the

8
normal distribution assumptions are valid or not. One quick way is to compare the sample
means to the real mean. For a normally distributed population, the sampling distribution is
also normal when there are sufficient test items in the samples.

The assumption of normality is valid in most cases but when it is not, it could lead to serious
trouble. Also, since this assumption is made so inherently, it is hard to spot and sometimes
difficult to question. Therefore care must be taken to ensure that the researcher is aware of
not just the assumption of normality but in fact all the assumptions that go into a statistical
analysis. This will help define the scope of the experiment and if something is not as
expected, one can find the reason for the discrepancy.

How to cite this article: 

Siddharth Kalla (Sep 11, 2011). Normal Distribution Assumptions. Retrieved from
Explorable.com:  https://explorable.com/normal-distribution-assumptions

9
3 F-Distribution

The F-distribution, also known as the Snedecor's F-distribution or the Fisher-Snedecor


distribution (after R.A. Fisher and George W. Snedecor), is the distribution of ratios of
two independent estimators of the population variances.

Suppose we have two samples with n1 and n2 observations, the ratio F = s12 / s22 where
s12 and s22 are the sample variances, is distributed according to an F-distribution with v1 =
n1-1 numerator degrees of freedom, and v2 = n2-1 denominator degrees of freedom.

For example, if F follows an F-distribution and the degrees of freedom for the numerator are 4
and the degrees of freedom for the denominator are 10, then F ~ F4,10. For each combination
of these degrees of freedom there is a different F-distribution. The F-distribution is most
spread out when the degrees of freedom are small. As the degrees of freedom increase, the F-
distribution is less dispersed.

Properties
The F-distribution has the following properties:

The mean of the distribution is equal to v1 / ( v2 - 2 ). The variance is equal to [ v22 * ( v1 + 2 )


] / [ v1 * ( v2 - 2 ) * ( v2 - 4 ) ]

The F-distribution is skewed to the right, and the F-values can be only positive. The curve
reaches a peak not far to the right of 0, and then gradually approaches the horizontal axis.
The F-distribution approaches, but never quite touches the horizontal axis.

Uses
The main use of F-distribution is to test whether two independent samples have been drawn
for the normal populations with the same variance, or if two independent estimates of the
population variance are homogeneous or not, since it is often desirable to compare two
variances rather than two averages. For instance, college administrators would prefer two
college professors grading exams to have the same variation in their grading. For this, the F-
test can be used, and after examining the p-value, inference can be drawn on the variation.

10
Assumptions
In order to perform F-test of two variances, it is important that the following are true:

The populations from which the two samples are drawn are normally distributed.
The two populations are independent of each other.

If the two populations have equal variances, then s12 and s22 are close in value and F is
close to 1. But if the two population variances are very different, s12 and s22 tend to be very
different, too.

Choosing s12 as the larger sample variance causes the ratio to be greater than 1. If s12 and
s22 are far apart, then F is a large number. Therefore, if F is close to 1, the evidence favours
the null hypothesis (the two population variances are equal). But if F is much larger than 1,
then the evidence is against the null hypothesis, and we can infer that possibly the population
variances differ to a large extent.

Anova and F
In the technique known as Analysis of Variance (ANOVA) which plays a very important role in
Design of Experiments, the variance ratio test is applied to test the significance of different
components of variation against error variation.

For example, a new drug for treating Osteoporosis could need to be field tested. Since
severity of this disease is generally a function of age, the new drug could be administered
randomly to n patients in each age group. Put differently, this would be an experiment in m
age groups and n different dosage levels of the drug allocated randomly to the patients. With
figures provided from patients for each age group x dose combination, we can use the
variance ratio test (F- test) to test for difference between dose levels and if this variation can
be attributed to chance.

The other uses include testing the significance of the correlation ratio between two random
variables, and to test the linearity of regression.

How to cite this article: 

Explorable.com (Jul 4, 2010). F-Distribution. Retrieved from Explorable.com:  


https://explorable.com/f-distribution

11
4 Measures of Central Tendency

In statistics we always find measures of central tendency because it always makes


sense to compare individual scores to the overall group of scores in order to be able to
correctly interpret the result.

Individual scores by themselves may mean little but when looked at from a group point of
view, they may reveal the whole picture.

For example, if you say you saw an insect of length of 10mm it doesn't
mean anything by itself. However, if you say that the normal length of
the insect is about 6cm and the maximum recorded length ever is
10.4cm, then it may mean you saw a particularly large insect. Therefore
it is important to be able to quantify the "normal length" as used above,
and this is what central tendency is all about.

The arithmetic mean is one of the most commonly used measures of central tendency. For a
set of numbers, the mean is simply the average, i.e. sum of all the numbers divided by the
number of numbers.

Therefore if you want to find the average length of a group of insects,


you simply take the length of each insect, add up all these lengths and
divide by the number of insects. If the lengths of 5 insects are 6.5mm,
5.4mm, 5.8mm, 6.2mm and 5.9mm, then the mean is
(6.5+5.4+5.8+6.2+5.9)mm/5 = 5.96mm.

The median is another frequently used measure of central tendency. The median is simply the
midpoint of the distribution, i.e. there are as many numbers above it as below it.

If the number of data points is odd, then the median is simply the middle number. Therefore
the median of 3, 5, 6, 9, 15 is 6.

If the number of data points is even, then the median is the mean of the middle two numbers.
Therefore the median of 2, 7, 15, 20 is (7+15)/2 = 11.

The median is particularly useful when there are a few data points that are vastly different. For
example, in calculating the central measure of the salary obtained by a group of graduates, it

12
may happen than a couple of students have got extraordinarily high salaries. This will take the
mean of the salaries of the group to very high values, but the median will truly reflect the
placement scenario as it is.

Another commonly used measure of central tendency in specific cases is the mode. The mode
is simply the most commonly occurring value. For example, in a class of 50 students graded
on a scale of 1-5, the distribution may be as shown in the figure. The mode of this data is 4.

Different types of data need different measures of central tendency to describe the distribution
of data. For highly skewed data, none of these may be sufficient, and we may need to go for
other specialized measures or simply report them all in a table.

How to cite this article: 

Siddharth Kalla (Nov 22, 2009). Measures of Central Tendency. Retrieved from
Explorable.com:  https://explorable.com/measures-of-central-tendency

13
4.1 Statistical Mean

In Statistics, the statistical mean, or statistical average, gives a very good idea about
the central tendency of the data being collected.

Statistical mean gives important information about the data set at hand, and as a single
number, can provide a lot of insights into the experiment and nature of the data.

Examples
The concept of statistical mean has a very wide range of applicability in statistics for a number
of different types of experimentation.

For example, if a simple pendulum is being used to measure the acceleration due to gravity, it
makes sense to take a set of values, and then average the final result. This eliminates the
random errors in the experiment and usually gives a more accurate value than a single
experiment carried out.

The statistical mean also gives a good idea about interpreting the statistical data.

For example, the mean life expectancy in Japan is higher than that of Brazil, which suggests
that on an average, the people in Japan are likely to live longer. There may be many viable
conclusions about this, such as that it is due to better healthcare facilities in Japan, but the
truth is that we do not know this unless we measure it.

Similarly, the mean height of people in Russia is higher than that of China, which means that
on an average, you will find Russians to be taller than Chinese.

Statistical mean is a measure of central tendency and gives us an idea about where the data
seems to cluster around.

For example, the mean marks obtained by students in a test is required to correctly gauge the
performance of a student in that test. If the student scores a low percentage, but is well ahead
of the mean, then it means the test is difficult and therefore his performance is good,
something that simply a percentage will not be able to tell.

Different Statistical Means

14
There are different kinds of statistical means or measures of central tendency for the data
points. Each one has its own utility. The arithmetic mean, geometric mean, median and mode
are some of the most commonly used measures of statistical mean. They make sense in
different situations, and should be used according to the distribution and nature of the data.

For example, the arithmetic mean is frequently used in scientific experimentation, the
geometric mean is used in finance to calculate compounding quantities, the median is used as
a robust mean in case of skewed data with many outliers and the mode is frequently used in
determining the most frequently occurring data, like during an election.

How to cite this article: 

Siddharth Kalla (Jan 13, 2009). Statistical Mean. Retrieved from Explorable.com:  
https://explorable.com/statistical-mean

15
4.1.1 Arithmetic Mean

The arithmetic mean is perhaps the most commonly used statistical mean to measure
the central tendency of data.

The arithmetic mean is also called the "average". It is used in most scientific experiments.

Mathematically, the arithmetic mean is given by:

16
Sum of all datapoints
arithmetic mean =
Total number of data-points

or more complicated (from wikipedia):

EXAMPLES
If there are three numbers in a data-set, add them and divide by three:

17
x1 + x2 + x
arithmetic mean = 3
3

Or if there are four numbers, add them and divide by 4:

18
x1 + x2 + x
3 + x
4
arithmetic mean =
4

For example, the time in seconds taken for a particular chemical reaction under the same
laboratory conditions might give values of 11.6, 12.1, 11.8, 11.5 and 12.0.

19
Sum of datapoints
arithmetic mean =
Total number of data-points

20
11.6 + 12.1 + 11.8 + 11.5 + 12.0
arithmetic mean =
5

21
arithmetic mean = 11.8

The arithmetic mean of these numbers is 11.8.

This would be the time as measured by the experiment. The differences can be attributed to 
random errors like random fluctuations in temperature and humidity in the laboratory.

When the word mean is


used, it generally refers to
the arithmetic mean.

The mean gives very useful information in cases where the data is relatively symmetric. For
example, if the data is nearly normally distributed, then the mean is the best measure of
central tendency. However, if the data is very skewed, then the arithmetic mean might
become misleading.

For example, a commonly quoted number in the placement of business schools is the
average salary of the outgoing batch.

IS THE ARITHMENTIC MEAN MISLEADING?


When the arithmetic mean is used to calculate this, it can be misleading because the salaries
can be widely spread. Thus it may happen that 10% of the class have gotten excellent job
offers while half the class is without jobs. Even in such a scenario, the mean salary can be a
good number, but it hides facts behind it. The median becomes a much better measure of the
salaries of the students in this case.

However, in some cases, even when the data is skewed, the arithmetic mean gives some
valuable information about the data but it needs to be interpreted in the right manner.

The Gross Domestic Product or GDP used in economics to determine the financial well being
of a country is an arithmetic mean. The total value of goods and services produced in the
country when averaged out over the total population gives a measure of GDP. The GDP tells
us nothing about the distribution of wealth inside the country, but can be a good parameter for
the country to work with in improving the economic condition of its citizens.

How to cite this article: 

22
Siddharth Kalla (Jul 7, 2009). Arithmetic Mean. Retrieved from Explorable.com:  
https://explorable.com/arithmetic-mean

23
4.1.2 Geometric Mean

The geometric mean is relevant on certain sets of data, and is different from the
arithmetic mean. Mathematically, the geometric mean is the nth root of the product of n
numbers.

This can be written as:

Geometric Mean = (a1 × a2 . . . an)^1/n

Where:

N = Number of datapoints

a = score of a datapoint

or more complicated - the exact formula (from wikipedia)

The geometric mean is relevant on those sets of data that are products or exponential in
nature. This includes a variety of branches of natural sciences and social sciences.

Examples
For example, if a strain of bacteria increases its population by 20% in the first hour, 30% in the
next hour and 50% in the next hour, we can find out an estimate of the mean percentage
growth in population.

In this case, it is the geometric mean, and not the arithmetic mean that is relevant. To see
this, start off with 100 bacteria.

After the first hour, they grow to 120 bacteria, which is a grow rate of 1,2 (100*1.2);

24
After the second hour, they grow to 156 bacteria, which is a grow rate of 1,3 (120*1.3)
After the third hour, they grow to 234 bacteria, which is a grow rate of 1,5 (156*1.5).

Now, we would like to find the mean growth rate

Geometric Mean = (a1 × a2 . . . an)^1/n


Geometric Mean = (1.2 × 1.3 x 1.5)^1/3
Geometric Mean = (1.2 × 1.3 x 1.5)^1/3
Geometric Mean = (2,34)^1/3
Geometric Mean ≈ 1.3276

If we find the geometric mean of 1.2, 1.3 and 1.5, we get 1.3276. This should be interpreted
as the mean rate of growth of the bacteria over the period of 3 hours, which means if the
strain of bacteria grew by 32.76% uniformly over the 3 hour period, then starting with 100
bacteria, it would reach 234 bacteria in 3 hours.

Therefore whenever we have percentage growth over a period of time, it is the geometric
mean and not the arithmetic mean that makes sense.

Usages
In social sciences, we frequently encounter this in a number of ways. For example, the human
population growth is expressed as a percentage, and thus when population growth needs to
be averaged, it is the geometric mean that is most relevant.

In surveys and studies too, the geometric mean becomes relevant. For example, if a survey
found that over the years, the economic status of a poor neighborhood is getting better, they
need to quote the geometric mean of the development, averaged over the years in which the
survey was conducted. The arithmetic mean will not make sense in this case either.

In economics, we see the percentage growth in interest accumulation. Thus if you are starting
out with a sum of money that is compounded for interest, then the mean that you should look
for is the geometric mean. Many such financial instruments like bonds yield a fixed
percentage return, and while quoting their “average” return, it is the geometric mean that
should be quoted.

How to cite this article: 

Siddharth Kalla (Aug 21, 2009). Geometric Mean. Retrieved from Explorable.com:  

25
https://explorable.com/geometric-mean

26
4.1.3 Calculate Median

The median is central to many experimental data sets, and to calculate median in such
examples is important, by not falling into the trap of reporting the arithmetic mean.

The median can be seen to be the “middle value” of the distribution, i.e. it separates the upper
and lower halves.

Calculation - Examples
To calculate median, consider the following example. Suppose we have the heights of
different trees in a garden, and we need an “average” value for this. Say the heights in meters
are 1.5, 6.9, 2.8, 1.8 and 2.3.

1. Arrange order. Arrange the numbers in either ascending or descending order, such as
1.5, 1.8, 2.3, 2.8 and 6.9.
2. Choose middle value. Choose the value which is exactly in the middle of the order.
1.5, 1.8, 2.3, 2.8, 6.9

The median of this distribution is 2.3, which is the middle value, separating the lower half
{1.5, 1.8} from the upper half {2.8, 6.9}.

Median = 2.3

3. In case of two middle values - Use the Arithmetic Mean on the two middle values

Suppose there are six heights (New value: 1.2), rather than five:
1.2, 1.5, 1.8, 2.3, 2.8, 6.9

This leaves two middle values, {1.8, 2.3}. The median for this data set is (1.8+2.3)/2 =
2.05

Median = 2.05

One can immediately see that the data is skewed - the 6.9 meter tree makes it so. The
arithmetic mean of this data is 3.08 meters, which is more than 4 out of 5 data points. Thus
the arithmetic mean doesn’t make much sense in this case.

If the number of data points is even, unlike the example we previously considered, then to

27
calculate median we simply take the mean of the middle two elements. Thus if we have 10
numbers arranged in ascending order, the median is the average of the 5th and 6th numbers.

Another Example:

If the salaries of professional passing out of college in thousands of dollars per annum is 60,
64, 71, 73, 73, 77, 82, 85, 160 and 255, then their median salary is (73+77)/2 = 75. The mean
in this case is 100, which like the previous case, doesn’t make much sense and doesn’t really
tell us about the central tendency of the data.

In cases where the data is skewed, it is the median that makes sense, and not the mean. In
these cases, as an experimenter, you need to calculate median and not mean for your
experiment. This is especially true in cases where there are outliers. Many scientists calculate
their results both in terms of median and mean, to see whether the outcome of theirresults are
the same.

The median is resistant to change with the discovery of outliers. For example, if we want to
know the mean weight of all the dinosaurs, it is a very difficult task because we do not yet
know all the types of dinosaurs that ever walked the earth. Therefore if a new type of dinosaur
bigger than all the others is discovered, it will significantly alter the mean.

However, the median remains almost unchanged in this case. Thus in many cases when the
end points of the data set are not known, you need to calculate median and not the mean for
that data set.

How to cite this article: 

Siddharth Kalla (Oct 25, 2009). Calculate Median. Retrieved from Explorable.com:  
https://explorable.com/calculate-median

28
4.2 Statistical Mode

Statistical mode tells us about the data point that is most frequently repeated in the
dataset.

For a symmetric data distribution, the statistical mode can be near the mean and median, but
for highly skewed data, the mode can be quite distinct.

Examples
For example, if the marks obtained by 15 students in a test are 81, 82, 85, 85, 89, 91, 91, 91,
91, 93, 93, 95, 96, 96 and 99, then the statistical mode for this data distribution is 91. This is
because 4 students have obtained this score, which is the highest number of students with the
same score.

It should be noted that unlike the mean and median, the mode doesn’t need to be unique.
This is because there can always be two or more data points with the same frequency.

Consider a revised example for the test scores of the 15 students to be 81, 81, 85, 85, 89, 91,
91, 92, 92, 93, 93, 95, 96, 96 and 99. In this case, we see that 2 students score the same
marks 81, 85, 91, 92 and 96, which are all the modes for this data distribution.

Sometimes when the data is continuous, the usual definition of statistical mode is inadequate.
To see why, consider an experiment that measure the reaction times of different subjects.

Suppose the raw data gives time in milliseconds as 42.1, 48.3, 52.2, 52.6, 52.8, 52.9, 53.0,
53.1, 53.2, 53.7, 54.6, 55.8, 56.7, 58.0 and 60.9.

As can be seen, all the values are distinct, and therefore all the data points are the statistical
modes. However, intuitively, we can see that the data is clustered about the values in the
middle.

Therefore if we can define an interval of 1 millisecond, then the interval from 52.5-53.5 will
form the mode. This is a much more practical measure of the mode. The problem arises
because time is a continuous variable, and thus two measured times are not usually exactly
equal in nature, but only slightly differ from each other.

29
Advantage
A big advantage of statistical mode is that it is not restricted to numbers alone. For example,
among all the letters of the English alphabet, the mode is the letter ‘E’, which is the most
frequently encountered letter. However, we cannot define the median or mean letter, since
these can only be defined for numbers. This makes the scope of the mode quite broad in
nature.

How to cite this article: 

Siddharth Kalla (Sep 4, 2009). Statistical Mode. Retrieved from Explorable.com:  


https://explorable.com/statistical-mode

30
4.3 Range (Statistics)

In statistics, range is defined simply as the difference between the maximum and
minimum observations. It is intuitively obvious why we define range in statistics this
way - range should suggest how diversely spread out the values are, and by computing
the difference between the maximum and minimum values, we can get an estimate of
the spread of the data.

For example, suppose an experiment involves finding out the weight of lab rats and the values
in grams are 320, 367, 423, 471 and 480. In this case, the range is simply computed as 480-
320 = 160 grams.

Some Limitations of Range


Range is quite a useful indication of how spread out the data is, but it has some serious
limitations. This is because sometimes data can have outliers that are widely off the other
data points. In these cases, the range might not give a true indication of the spread of data.

For example, in our previous case, consider a small baby rat added to the data set that
weighs only 50 grams. Now the range is computed as 480-50 = 430 grams, which looks like a
false indication of the dispersion of data.

This limitation of range is to be expected primarily because range is computed taking only two
data points into consideration. Thus it cannot give a very good estimate of how the overall
data behaves.

Practical Utility of Range


In a lot of cases, however, data is closely clustered and if the number of observations is very
large, then it can give a good sense of data distribution. For example, consider a huge survey
of the IQ levels of university students consisting of 10,000 students from different
backgrounds. In this case, the range can be a useful tool to measure the dispersion of IQ
values among university students.

Sometimes, we define range in such a way so as to eliminate the outliers and extreme points
in the data set. For example, the inter-quartile range in statistics is defined as the difference
between the third and first quartiles. You can immediately see how this new definition of range

31
is more robust than the previous one. Here the outliers will not matter and this definition takes
the whole distribution of data into consideration and not just the maximum and minimum
values.

It should be pointed out that in spite of several limitations, the range can be a useful indication
for many cases. As a student of statistics you should understand what kinds of data are best
suited to be defined based on range. If there are too many outliers, it may not be a good idea.
But range gives a quick and easy to estimate indication about the spread of data.

How to cite this article: 

Siddharth Kalla (Jun 10, 2011). Range (Statistics). Retrieved from Explorable.com:  
https://explorable.com/range-in-statistics

32
5 Statistical Variance

Statistical variance gives a measure of how the data distributes itself about the mean
or expected value. Unlike range that only looks at the extremes, the variance looks at
all the data points and then determines their distribution.

In many cases of statistics and experimentation, it is the variance that gives invaluable
information about the data distribution.

Variance Calculation (population of Scores)


The mathematical formula to calculate the variance is given by:

2
σ = variance
2 2
∑ (X - µ) = The sum of (X - µ) for all datapoints

In many cases of statistics and experimentation, it is the variance that gives invaluable
information about the data distribution.

Variance Calculation (population of Scores)


The mathematical formula to calculate the variance is given by:

2
σ = variance

33
2 2
∑ (X - µ) = The sum of (X - µ) for all datapoints
X = individual data points
µ = mean of the population
N = number of data points

This means the square of the variance is given by the average of the squares of difference
between the data points and the mean.

Step By Step Calculation


For example, suppose you want to find the variance of scores on a test. Suppose the scores
are 67, 72, 85, 93 and 98.

1. Write down the formula for variance:

2
σ = ∑ (x-µ)2 / N

2. There are five scores in total, so N = 5.

2
σ = ∑ (x-µ)2 / 5

3. The mean (µ) for the five scores (67, 72, 85, 93, 98), so µ = 83.

2
σ = ∑ (x-83)2 / 5

4. Now, compare each score (x = 67, 72, 85, 93, 98) to the mean (µ = 83)

2
σ = [ (67-83)2+(72-83)2+(85-83)2+(93-83)2+(98-83)2 ] / 5

5. Conduct the subtraction in each paranthesis.

67-83 = -16

72-83 = -11

85-83 = 2

93-83 = 10

98 - 83 = 15

The formula will look like this:

34
2
σ = [ (-16)2+(-11)2+(2)2+(10)2+(15)2] / 5

6. Then, square each paranthesis. We get 256, 121, 4, 100 and 225.

This is how:

2
σ = [ (-16)x(-16)+(-11)x(-11)+(2)x(2)+(10)x(10)+(15)x(15)] / 5

2
σ = [ 16x16 + 11x11 + 2x2 + 10x10 + 15x15] / 5

which equals:

2
σ = [256 + 121 + 4 + 100 + 225] / 5

7. Then summarize the numbers inside the brackets:

        σ2 = 706 / 5

8. To get the final answer, we divide the sum by 5 (Because it was five scores). This is the
variance for the dataset:

          2σ = 141.2

This is the variance of the population of scores.

Standard Deviation of Sample


In many cases, instead of a population, we deal with samples.

In this case, we need to slightly change the formula for standard deviation as

2
S  = the standard deviation of the sample.

Note that the denominator is one less than the sample size in this case.

Usage
The concept of variance can be extended to continuous data sets too. In that case, instead of

35
summing up the individual differences from the mean, we need to integrate them. This
approach is also useful when the number of data points is very large, like the population of a
country.

Variance is extensively used in probability theory, wherein from a given smaller sample set,
more generalized conclusions need to be drawn. This is because variance gives us an idea
about the distribution of data around the mean, and thus from this distribution, we can work
out where we can expect an unknown data point.

How to cite this article: 

Siddharth Kalla (Mar 15, 2009). Statistical Variance. Retrieved from Explorable.com:  
https://explorable.com/statistical-variance

36
5.1 Measurement Of Uncertainty: Standard Deviation

Many experiments require measurement of uncertainty. Standard deviation is the best


way to accomplish this. Standard deviation tells us about how the data is distributed
about the mean value.

Examples
For example, the data points 50, 51, 52, 55, 56, 57, 59 and 60 have a mean at 55 (Blue).

Another data set of 12, 32, 43, 48, 64, 71, 83 and 87. This set too has a mean of 55 (Pink).

However, it can clearly be seen that the properties of these two sets are different. The first set
is much more closely packed than the second one. Through standard deviation, we can
measure this distribution of data about the mean.

The above example should make it clear that if the data points are values of the same
parameter in various experiments, then the first data set is a good fit, but the second one is
too uncertain. Therefore in measurement of uncertainty, standard deviation is important - the
lesser the standard deviation, the lesser this uncertainty and thus more the confidence in the
experiment, and thus higher the reliability of the experiment.

One Standard Deviation


In a normal distribution, values falling within 68.2% of the mean fall within one standard
deviation. This means if the mean energy consumption of various houses in a colony is 200
units with a standard deviation of 20 units, it means that 68.2% of the households consume

37
energy between 180 to 220 units. This is assuming that the data of energy consumption is
normally distributed.

If a researcher considers three standard deviations to either side of the mean, this covers
99% of the data. Thus in the previous example, 99% of the households have their energy
consumption between 140 to 260 units. In most cases, this is considered as the whole data set
, especially when the data can extend to infinity.

Usage
The measurement of uncertainty through standard deviation is used in many experiments of
social sciences and finances. For example, the more risky and volatile ventures have a higher
standard deviation. Also, a very high standard deviation of the results for the same survey, for
example, should make one rethink about the sample size and the survey as a whole.

In physical experiments, it is important to have a measurement of uncertainty. Standard


deviation provides a way to check the results. Very large values of standard deviation can
mean the experiment is faulty - either there is too much noise from outside or there could be a
fault in the measuring instrument.

How to cite this article: 

Siddharth Kalla (Aug 2, 2009). Measurement Of Uncertainty: Standard Deviation. Retrieved


from Explorable.com:  https://explorable.com/measurement-of-uncertainty-standard-deviation

38
5.1.1 Calculate Standard Deviation

As an experimenter, it is important to be able to calculate standard deviation, because


it is a very important parameter that defines the way data is centered about the mean.

The standard deviation is the square root of variance. Thus the way we calculate standard
deviation is very similar to the way we calculate variance.

In fact, to calculate standard deviation, we first need to calculate the variance, and then take
its square root.

Standard Deviation Formula


The standard deviation formula is similar to the variance formula. It is given by:

σ = standard deviation
xi = each value of dataset
x (with a bar over it) = the arithmetic mean of the data (This symbol will
be indicated as mean from now)
N = the total number of data points
∑ (xi - mean)^2 = The sum of (xi - mean)^2 for all datapoints

For simplicity, we will rewrite the formula:

σ = √[ ∑(x-mean)^2 / N ]

This to minimize the chance of confusion in the examples below.

39
Standard Deviation Calculation Example (for Population)
As an example to calculate standard deviation, consider a sample of IQ scores given by 96,
104, 126, 134 and 140.

Try it yourself

1. Write the formula.


σ = √[ ∑(x-mean)^2 / N ]
2. How many numbers are there?
There are five numbers.
σ = √[ ∑(x-mean)^2 / 5 ]
3. What is the mean?
The mean of this data is (96+104+126+134+140)/5 = 120.
σ = √[ ∑(x-120)^2 / 5 ]
4. What are the respective deviations from the mean?
The deviation from the mean is given by 96-120 = -24, 104-120 = -16, 126-120 = 6, 134-
120 = 14, 140-120 = 20.

σ = √[ ((-24)^2+(-16)^2+(6)^2+(14)^2+(20)^2) / 5 ]

σ = √[ ((96-120)^2+(104-120)^2+(126-120)^2+(134-120)^2+(140-120)^2) / 5 ]
5. Square and sum the deviations:
The sum of their squares is given by (-24)^2 + (-16)^2 + (6)^2 + (14)^2 + (20)^2 = 1464.

σ = √[ (576 + 256 + 36 + 196 + 400) / 5 ]

σ = √[ (1464) / 5 ]

σ = √[ ((-24)x(-24)+(-16)x(-16)+(6)x(6)+(14)x(14)+(20)x(20)) / 5 ]
6. Divide by the number of scores (minus one if it is a sample, not a population):
The mean of this value is given by 1464/5 = 292.8. The number in between the brackets
is the variance of the data.

σ = √[292.8]

7. Square root the total:


To calculate standard deviation, we take the square root √(292.8) = 17.11.

σ = 17.11

It can easily be seen that the sample standard deviation is larger than the standard deviation

40
for the data.

Interpretation of Data
Calculation of standard deviation is important to correctly interpret the data. For example, in
physical sciences, a lower standard deviation for the same measurement implies higher
precision for the experiment.

Also, when the mean needs to be interpreted, it is important to quote the standard deviation
too. For example, the mean weather over a day in two cities might be 24C. However, if the
standard deviation is very large, it may mean extremes of temperature - too hot during the day
and too cold during the nights (like a desert). On the other hand, if the standard deviation is
small, it means a fairly uniform temperature throughout the day (like a coastal region).

Standard Deviation for Samples


Just like in the case of variance, we define a sample standard deviation when we are dealing
with samples rather than populations. This is given by a slightly modified equation:

where the denominator is N-1 instead of N in the previous case. This correction is required to
get an unbiased estimator for the standard deviation.

Example of Standard Deviation of Samples

This follows the same calculation as the example above, for standard deviation for population,
with one exception: The division should be "N - 1", not "N".

σ = √[ ∑(x-mean)2 / (N - 1) ]

Then it follows the same example as above, except that there is a 4 where there was a 5:

1. Write the formula.


σ = √[ ∑(x-mean)2 / (N-1) ]
2. How many numbers are there?
There are five numbers.

σ = √[ ∑(x-mean)2 / 4 ]

σ = √[ ∑(x-mean)2 / (5-1) ]
3. What is the mean?

41
The mean of this data is (96+104+126+134+140)/5 = 120.
σ = √[ ∑(x-120)2 / 4 ]
4. What are the respective deviations from the mean?
The deviation from the mean is given by 96-120 = -24, 104-120 = -16, 126-120 = 6, 134-
120 = 14, 140-120 = 20.

σ = √[ ((-24)2+(-16)+(6)2+(14)2+(20)2) / 4 ]

σ = √[ ((96-120)2+(104-120)+(126-120)2+(134-120)2+(140-120)2) / 4 ]
5. Square and sum the deviations:
The sum of their squares is given by (-24)2 + (-16)2 + (6)2 + (14)2 + (20)2 = 1464.

σ = √[ (576 + 256 + 36 + 196 + 400) / 4 ]

σ = √[ (1464) / 4 ]

σ = √[ ((-24)x(-24)+(-16)x(-16)+(6)x(6)+(14)x(14)+(20)x(20)) / 4 ]
6. Divide by the number of scores minus one (minus one since it is a sample, not a
population):
The mean of this value is given by 1464/4 = 366. The number in between the brackets is
the variance of the data.

σ = √[366]

7. Square root the total:


To calculate standard deviation, we take the square root √(366) = 19.13.

σ = 19.13

How to cite this article: 

Siddharth Kalla (Sep 27, 2009). Calculate Standard Deviation. Retrieved from
Explorable.com:  https://explorable.com/calculate-standard-deviation

42
5.2 Standard Error of the Mean

The standard error of the mean, also called the standard deviation of the mean, is a
method used to estimate the standard deviation of a sampling distribution. To
understand this, first we need to understand why a sampling distribution is required.

As an example, consider an experiment that measures the speed of sound in a material along
the three directions (along x, y and z coordinates). By taking the mean of these values, we
can get the average speed of sound in this medium.

However, there are so many external factors that can influence the speed of sound, like small
temperature variations, reaction time of the stopwatch, pressure changes in the laboratory,
wind velocity changes, and other random errors. Thus instead of taking the mean by one
measurement, we prefer to take several measurements and take a mean each time. This is a
sampling distribution. The standard error of the mean now refers to the change in mean with
different experiments conducted each time.

Mathematically, the standard error of the mean formula is given by:

σM = standard error of the mean


σ = thestandard deviation of the original distribution
N = the sample size
√N = Root of the sample size

It can be seen from the formula that the standard error of the mean decreases as N increases.
This is expected because if the mean at each step is calculated using a lot of data points, then
a small deviation in one value will cause less effect on the final mean.

43
The standard error of the mean tells us how the mean varies with different experiments
measuring the same quantity. Thus if the effect of random changes are significant, then the
standard error of the mean will be higher. If there is no change in the data points as
experiments are repeated, then the standard error of mean is zero.

Standard Error of the Estimate


A related and similar concept to standard error of the mean is the standard error of the
estimate. This refers to the deviation of any estimate from the intended values.

For a sample, the formula for the standard error of the estimate is given by:

where Y refers to individual data sets, Y' is the mean of the data and N is the sample size.

Note that this is similar to the standard deviation formula, but has an N-2 in the denominator
instead of N-1 in case of sample standard deviation.

How to cite this article: 

Siddharth Kalla (Sep 21, 2009). Standard Error of the Mean. Retrieved from Explorable.com:  
https://explorable.com/standard-error-of-the-mean

44
6 Quartile

Quartile is a useful concept in statistics and is conceptually similar to the median. The
first quartile is the data point at the 25th percentile, and the third quartile is the data
point at the 75th percentile. The 50th percentile is the median.

Median Revisited
To understand a quartile, let us revisit median. To compute the median, we cut off the data
into two groups with equal number of points. Thus the middle value that separates these
groups is the median. In a similar fashion, if we divide the data into 4 equal groups now
instead of 2, the first differentiating point is the first quartile, the second differentiating point is
the second quartile which is the same as the median and the third such differentiating point is
the third quartile.

To further see what quartiles do, the first quartile is at the 25th percentile. This means that
25% of the data is smaller than the first quartile and 75% of the data is larger than this.
Similarly, in case of the third quartile, 25% of the data is larger than it while 75% of it is
smaller. For the second quartile, which is nothing but the median, 50% or half of the data is
smaller while half of the data is larger than this value.

Interpreting Quartiles
As you know, the median is a measure of the central tendency of the data but says nothing
about how the data is distributed in the two arms on either side of the median. Quartiles help
us measure this.

Thus if the first quartile is far away from the median while the third quartile is closer to it, it
means that the data points that are smaller than the median are spread far apart while the
data points that are greater than the median are closely packed together.

An Alternative View
Another way of understanding quartiles is by thinking of those as medians of either of the two
sets of data points differentiated by the median. In this case, the first quartile is the median of
the data that is smaller than the full median while the third quartile is the median of the data

45
that is larger than the full median. Here full median is used in the context of the median of the
entire set of data.

It should be noted that a quartile is not limited to discrete variables but also applies equally
well to continuous variables. In this case, you will need to know the data distribution to figure
out the quartiles. If the distribution is symmetric, like normal distribution, then the first and third
quartiles are equidistant from the median in either direction.

How to cite this article: 

Siddharth Kalla (Jan 20, 2011). Quartile. Retrieved from Explorable.com:  


https://explorable.com/quartile

46
7 Trimean

Trimean is a measure of central tendency, like mean, median and mode. Its meaning is
sometimes confusing because it is defined in a manner different from these traditional
measures of central tendencies.

Mathematical Formulation
The trimean is defined as the weighted average of the median and the two quartiles. Thus,
mathematically it is written as

(Wikipedia)

TM = Trimean
Q2 = the median
Q1, Q3 = the two quartiles

It can also be written as

(Wikipedia)

which tells us that it is the average of the median and “quartile average” also called as
midhinge.

The trimean takes not only the central tendency into account but also gives due importance to
the distribution of data. This is what makes the trimean a different statistical parameter than

47
the others, like median, that are frequently encountered.

A Sample Example
For example, consider the heights of students in a class, in cm, to be 155, 158, 161, 162, 166,
170, 171, 174 and 179. It is easy to see the median of this data is 166 cm.

Now consider another class where the heights of the students, again in cm, are 162, 162, 163,
165, 166, 175, 181, 186, and 192. It can be seen that the median height of the class is again
166 cm. However, a look at the two data distributions tells us that the distributions are quite
different in both these cases, even though they have the same median.

Now let us compute the trimean for the first case. The median as we saw was 166, the first
quartile is 161 and the third quartile is 171. Using the formula given above, the trimean is
computed as (161 + 2(166) + 171)/4 = 166.

In the second example, the median is the same 166, but the first quartile is 163 and the
second quartile is 181. Now the trimean is computed as (163 + 2(166) + 181)/4 = 169.

Interpreting the Results


In the first case, we see that the trimean is the same as the median. What this essentially
means is that the distribution is very even from the median, which means there are about as
many data points at a given distance from the median on either side (only on an average case
of course).

In the second case, the trimean is bigger than the mean. As you can see, the third quartile is
farther away from the median than the first quartile, which essentially means that the data is
biased in the second half of the distribution. Thus the trimean reflects this bias in data away
from the median. Thus the effect of quartiles appears on the definition of trimean.

How to cite this article: 

Siddharth Kalla (Jan 10, 2011). Trimean. Retrieved from Explorable.com:  


https://explorable.com/trimean

Thanks for reading!


Explorable.com Team

48
Explorable.com - Copyright © 2008-2015 All Rights Reserved.

49

You might also like