Stats Notes
Stats Notes
In the study of a population with respect to one in which we are interested we may get a large number
of observations. It is not possible to grasp any idea about the characteristic when we look at all the
observations. So it is better to get one number for one group. That number must be a good
representative one for all the observations to give a clear picture of that characteristic. Such
representative number can be a central value for all these observations. This central value is called a
measure of central tendency or an average or a measure of locations. There are five averages. Among
them mean, median and mode are called simple averages and the other two averages geometric mean
and harmonic mean are called special averages.
The meaning of average is nicely given in the following definitions. “A measure of central tendency is a
typical value around which other figures congregate.”
“An average stands for the whole group of which it forms a part yet represents the whole.”
“One of the most widely used set of summary figures is known as measures of location.”
formula.
Besides the above requisites, a good average should represent maximum characteristics of the data, its
value should be nearest to the most items of the given series.
Arithmetic mean or simply the mean of a variable is defined as the sum of the observations divided by
the number of observations. If the variable x assumes n values x1, x2 …xn then the mean, x, is given by
x x1 x2 x3 .... xn n
1n xi
ni
= 31
Merits:
1. It is rigidly defined.
3. If the number of items is sufficiently large, it is more accurate and more reliable.
5. It is possible to calculate even if some of the details of the data are lacking.
Demerits:
2. It cannot be in the study of qualitative phenomena not capable of numerical measurement i.e.
Intelligence, beauty, honesty etc.,
3. It can ignore any single item only at the risk of losing its accuracy.
6. It may lead to fallacious conclusions, if the details of the data from which it is computed are not
given.
The average whose component items are being multiplied by certain values known as “weights” and
the aggregate of the multiplied results are being divided by the total sum of their “weight”.
If x1, x2…xn be the values of a variable x with respective weights of w1, w2…wn assigned to them,
then Weighted A.M = xw w1 x1 w2 x2 .... wn xn = wi xi
Merits of H.M :
1. It is rigidly defined.
4. It is the most suitable average when it is desired to give greater weight to smaller observations
and less weight to the larger ones.
Demerits of H.M :
2. It is difficult to compute.
3. It is only a summary figure and may not be the actual item in the series
4. It gives greater importance to small items and is therefore, useful only when small items have to
be given greater weightage. Merits of Geometric mean :
1. It is rigidly defined
1. It cannot be used when the values are negative or if any of the observations is zero
2. It is difficult to calculate particularly when the items are very large or when there is a frequency
distribution.
3. It brings out the property of the ratio of the change and not the absolute difference of change as
the case in arithmetic mean.
Grouped Data:
In a grouped distribution, values are associated with frequencies. Grouping can be in the form of a
discrete frequency distribution or a continuous frequency distribution. Whatever may be the type of
distribution , cumulative frequencies have to be calculated to know the total number of items.
Cumulative frequency of each class is the sum of the frequency of the class and the frequencies of the
pervious classes, ie adding the frequencies successively, so that the last cumulative frequency gives the
total number of items.
Discrete Series:
Positional Averages:
These averages are based on the position of the given observation in a series, arranged in an ascending
or descending order. The magnitude or the size of the values does matter as was in the case of
arithmetic mean. It is because of the basic difference
that the median and mode are called the positional measures of an average.
Median :
The median is that value of the variate which divides the group into two equal parts, one part
comprising all values greater, and the other, all values less than median.
Arrange the given values in the increasing or decreasing order. If the number of values are odd, median
is the middle value .If the number of values are even, median is the mean of middle two values.
Merits of Median :
4. Median can be located even for qualitative factors such as ability, honesty etc.
Demerits of Median :
1. A slight change in the series may bring drastic change in median value.
2. In case of even number of items or continuous series, median is an estimated value other than
any value in the series.
3. It is not suitable for further mathematical treatment except its use in mean deviation.
The quartiles divide the distribution in four parts. There are three quartiles. The second quartile divides
the distribution into two halves and therefore is the same as the median. The first (lower) quartile (Q1)
marks off the first one-fourth, the third (upper) quartile (Q3) marks off the three-fourth.
Mode :
occur most frequently. It is an actual value, which has the highest concentration of items in and around
it.
distribution is the value at the point around which the items tend to be most heavily concentrated. It
may be regarded at the most typical of a series of values”.
It shows the centre of concentration of the frequency in around a given value. Therefore, where the
purpose is to know the point of the highest concentration it is preferred. It is, thus, a positional
measure.
Its importance is very great in marketing studies where a manager is interested in knowing about the
size, which has the highest concentration of items. For example, in placing an order for shoes or ready-
made garments the modal size helps because this sizes and other sizes around in common demand.
Merits of Mode:
Demerits of mode:
MEASURES OF DISPERSION –
Introduction :
The measure of central tendency serve to locate the center of the distribution, but they do not reveal
how the items are spread out on either side of the center. This characteristic of a frequency distribution
is commonly referred to as dispersion. In a series all the items are not equal. There is difference or
variation among the values. The degree of variation is evaluated by various measures of dispersion.
Small dispersion indicates high uniformity of the items, while large dispersion indicates less uniformity.
For example consider the following marks of two students.
Student I Student II
68 85
75 90
65 80
67 25
70 65
The fact is that the second student has failed in one paper. When the averages alone are considered,
the two students are equal. But first student has less variation than second student. Less variation is a
desirable characteristic.
calculate
There are two kinds of measures of dispersion, namely 1.Absolute measure of dispersion
Absolute measure of dispersion indicates the amount of variation in a set of values in terms of units of
observations. For example, when rainfalls on different days are available in mm, any absolute measure
of dispersion gives the variation in rainfall in mm. On the other hand relative measures of dispersion are
free from the units of measurements of the observations. They are pure numbers. They are used to
compare the variation in two or more sets, which are having different units of measurements of
observations.
The various absolute and relative measures of dispersion are listed below.
Range:
This is the simplest possible measure of dispersion and is defined as the difference between the largest
and smallest values of the variable.
In symbols, Range = L – S.
S = Smallest value.
Merits:
1. It is simple to understand.
2. It is easy to calculate.
3. In certain types of problems like quality control, weather forecasts, share price analysis, et c.,
range is most widely used.
Demerits:
Definition: Quartile Deviation is half of the difference between the first and third quartiles. Hence, it is
called Semi Inter Quartile Range.
Co-efficient of Q.D = Q3 Q1
Q3 Q1
3. It can be calculated for data with open end classes also. Demerits:
1. It is not based on all the items. It is based on two positional values Q1 and Q3 and ignores the
extreme 50% of the items
Mean Deviation:
The range and quartile deviation are not based on all observations. They are positional measures of
dispersion. They do not show any scatter of the observations from an average. The mean deviation is
measure of dispersion based on all items in a distribution.
Definition:
Mean deviation is the arithmetic mean of the deviations of a series computed from any measure of
central tendency; i.e., the mean, median or mode, all the deviations are taken as positive i.e., signs are
ignored. According to Clark and Schekade,
“Average deviation is the average amount scatter of the items in a distribution from either the mean or
the median, ignoring the signs of the deviations”.
We usually compute mean deviation about any one of the three averages mean, median or mode. Some
times mode may be ill defined and as such mean deviation is computed from mean and median.
Median is preferred as a choice between mean and median. But in general practice and due to wide
applications of mean, the mean deviation is generally computed from mean. M.D can be used to denote
mean deviation.
Mean deviation calculated by any measure of central tendency is an absolute measure. For the purpose
of comparing variation among different series, a relative mean deviation is required. The relative mean
deviation is obtained by dividing the mean deviation by the average used for calculating mean deviation.
Mean deviation
Merits:
2. It is rigidly defined.
4. Algebraic positive and negative signs are ignored. It is mathematically unsound and illogical.
Standard Deviation :
Karl Pearson introduced the concept of standard deviation in 1893. It is the most important measure of
dispersion and is widely used in many statistical formulae. Standard deviation is also called Root-Mean
Square Deviation. The reason is that it is the square–root of the mean of the squared deviation from the
arithmetic mean. It provides accurate result. Square of standard deviation is called Variance.
Definition:
It is defined as the positive square-root of the arithmetic mean of the Square of the deviations of the
given observation from their arithmetic mean.
2 x2
x2
Thus = n or n
Taking deviations from fractional value would be a very difficult and tedious task. To save time and
labour, We apply short –cut method; deviations are taken from an assumed mean. The formula is:
d2 d 2
N N
Where d-stands for the deviation from assumed mean = (X-A) Steps:
2. Find out the deviations from the assumed mean; i.e., X-A denoted by d and also the total of the
deviations d
3. Square the deviations; i.e., d2 and add up the squares of deviations, i.e, d2
d2 d 2
= n n
Note: We can also use the simplified formula for standard deviation.
1 2o n d2 d n
o c N fd 2 fd 2
Discrete Series:
There are three methods for calculating standard deviation in discrete series:
(b) Assumed mean method (c) Step-deviation method. (a) Actual mean method:
Steps:
x- x = d.
3. Square the deviations (= d2 ) and multiply by the respective frequencies(f) we get fd2
f
If the actual mean in fractions, the calculation takes lot of time and labour; and as such this method is
rarely used in practice.
Here deviation are taken not from an actual mean but from an assumed mean. Also this method is used,
if the given variable values are not in equal intervals.
Steps:
1. Assume any one of the items in the series as an assumed mean and denoted by A.
2. Find out the deviations from assumed mean, i.e, X-A and denote it by d.
5. Multiply the squared deviations (d2) by the respective frequencies (f) and get fd2.
f f
Where d = X A , N = f.
Merits:
1. It is rigidly defined and its value is always definite and based on all the observations and the
actual signs of deviations are used.
Demerits:
2. It gives more weight to extreme values because the values are squared up.
Coefficient of Variation :
The Standard deviation is an absolute measure of dispersion. It is expressed in terms of units in which
the original figures are collected and stated. The standard deviation of heights of students cannot be
compared with the standard deviation of weights of students, as both are expressed in different units,
i.e heights in centimeter and weights in kilograms. Therefore the standard deviation must be converted
into a relative measure of dispersion for the purpose of comparison. The relative measure is known as
the coefficient of variation.
The coefficient of variation is obtained by dividing the standard deviation by the mean and multiply it
by 100. symbolically,
If we want to compare the variability of two or more series, we can use C.V. The series or groups of data
for which the C.V. is greater indicate that the group is more variable, less stable, less uniform, less
consistent or less homogeneous. If the C.V. is less, it indicates that the group is less variable, more
stable, more uniform, more consistent or more homogeneous.
Skewness:
Meaning:
Skewness means ‘ lack of symmetry’ . We study skewness to have an idea about the shape of the curve
which we can draw with the help of the given data.If in a distribution mean = median = mode, then that
distribution is known as symmetrical distribution. If in a distribution mean median mode , then it is
not a symmetrical distribution and it is called a skewed distribution and such a distribution could either
be positively skewed or negatively skewed.
a) Symmetrical distribution:
Mean = Median = Mode
It is clear from the above diagram that in a symmetrical distribution the values of mean, median and
mode coincide. The spread of the frequencies is the same on both sides of the center point of the curve.
It is clear from the above diagram, in a positively skewed distribution, the value of the mean is maximum
and that of the mode is least, the median lies in between the two. In the positively skewed distribution
the frequencies are spread out over a greater range of values on the right hand side than they are on
the left hand side.
It is clear from the above diagram, in a negatively skewed distribution, the value of the mode is
maximum and that of the mean is least. The median lies in between the two. In the negatively skewed
distribution the frequencies are spread out over a greater range of values on the left hand side than
they are on the right hand side.
Measures of skewness:
According to Karl – Pearson, the absolute measure of skewness = mean – mode. This measure is not
suitable for making valid comparison of the skewness in two or more distributions because the unit of
measurement may be different in different series. To avoid this difficulty use relative measure of
skewness called Karl – Pearson’ s coefficient of skewness given by:
Mean - Mode
S.D.
In case of mode is ill – defined, the coefficient can be determined by the formula:
3(Mean - Median)
Coefficient of skewness =
.2 Bowley’ s Coefficient of skewness:
In Karl – Pearson’ s method of measuring skewness the whole of the series is needed. Prof. Bowley has
suggested a formula based on relative position of quartiles. In a symmetrical distribution, the quartiles
are equidistant from the value of the median; ie.,
Median – Q1 = Q3 – Median. But in a skewed distribution, the quartiles will not be equidistant from the
median. Hence Bowley has suggested the following formula:
Q3 Q1
Kurtosis:
The three measures – central tendency, dispersion and skewness describe the characteristics of
frequency distributions. But these studies will not give us a clear picture of the characteristics of a
distribution.
As far as the measurement of shape is concerned, we have two characteristics – skewness which refers
to asymmetry of a series and kurtosis which measures the peakedness of a normal curve. All the
frequency curves expose different degrees of flatness or peakedness. This characteristic of frequency
curve is termed as kurtosis. Measure of kurtosis denote the shape of top of a frequency curve. Measure
of kurtosis tell us the extent to which a distribution is more peaked or more flat topped than the normal
curve, which is symmetrical and bell-shaped, is designated as Mesokurtic. If a curve is relatively more
narrow and peaked at the top, it is designated as Leptokurtic. If the frequency curve is more flat than
normal curve, it is designated as platykurtic.
Measure of Kurtosis:
The measure of kurtosis of a frequency distribution based moments is denoted by 2 and is given by
If 2 >3, the distribution is said to be more peaked and the curve is leptokurtic.
If 2< 3, the distribution is said to be flat topped and the curve is platykurtic.
Measure of Kurtosis:
The measure of kurtosis of a frequency distribution based moments is denoted by 2 and is given by
If 2 =3, the distribution is said to be normal and the curve is mesokurtic.
If 2 >3, the distribution is said to be more peaked and the curve is leptokurtic.
If 2< 3, the distribution is said to be flat topped and the curve is platykurtic.