notes-statistics
notes-statistics
3A.1. Introduction
Statistics enters into almost every phase of life in some way. A daily news broadcast may
start with a weather forecast and ends with an analysis of the stock market. Statistics in
systematic ways provides bases for investigations in many field of knowledge, such as
social, physical, engineering, medicine, biological sciences, education, business and
management. Information on a topic is acquired in the form of numbers; an analysis of
these data is made in order to obtain a better understanding of the phenomenon of interest,
and some conclusions may be drawn. Often generalizations are sought; their validity is
assessed by further investigation.
The totality of individuals under consideration is called population and each individual in
the population is called a unit. A particular aspect about which we require information is
called characteristic. Sometimes values for all individuals in the population of relevance
are obtained, but often only a set of individuals, which can be considered as representatives
of that population are observed; such a set of individuals constitutes a sample.
The method of collecting data from all the units of the population is called census survey
or complete enumeration and the method of collecting data from a sample is called sample
survey.
A researcher who has to deal with data would be more efficient if the data are presented to
him in a properly tabulated, easy-to-read form. This facilitates quick assimilation of data
and decision-making. However, the data needed for purpose of analysis are generally not
available in a proper format. The analyst is, therefore, required to undertake on his own,
the task of organizing the data into a proper format.
Raw Data: The data in the form originally collected, completely devoid of
arrangement by size or sequence are known as raw data. That is, the unorganized data
are called raw data. These raw data are not amenable even to simple reading and do
not highlight any characteristic or trend.
Frequency Distribution: A frequency distribution can be either grouped or
ungrouped. If the distinct values in a set of data are less, we can write the observed
values in one column and the corresponding number of repetitions ( frequencies) in
another column. Such type of frequency distribution is called ungrouped frequency
distribution. If the distinct values in the set of data are large, we can group the set of
values into different classes and the number of observations in each class can be find
out. Then the distribution is called grouped frequency distribution.
For example, the numbers of children in 20 households can be summarized as follows:
The first and second columns together form an ungrouped frequency table. The third
column gives the relative frequency, which is obtained on dividing each frequency by
the total frequency. The last column gives the cumulative frequency.
The following table gives the marks of 50 students in an examination. The data is
summarized in the form of a grouped frequency table.
Marks Frequency
0 - 10 2
10 - 20 5
20 - 30 25
30 - 40 15
40 - 50 3
If the upper limit of a class is same as the lower limit of the next class, then the
distribution is continuous. Here the upper limits are not included in that class, so this
type of classification is called exclusive classification.
2
3A.3. MEASURES OF CENTRAL TENDENCY ( AVERAGES)
After a set of data has been collected, it must be organized and condensed or
categorized for purposes of analysis. In addition to graphical displays, numerical
indices can be computed that summarize the primary features of the data set. One is an
indicator of location or central tendency that specifies where the set of measurements
is “located”. That is, an average is a value, which is relatively close to all the
observations and act as a representative.
The commonly used averages are Arithmetic Mean (AM), Median and Mode.
1. Arithmetic Mean
Arithmetic mean is defined as the sum of the values divided by the total number of
values in the data.
x + x 2 + ... + x n
If x1, x2, …, xn are the values, then its AM denoted by X = 1
n
n
xi
=
i =1 n
Example 1: The numbers of children in 10 families are: 5, 2, 2, 3, 1, 4, 3, 2, 1, 2.
5 + 2 + 2 + 3 +1+ 4 + 3 + 2 +1+ 2 25
Solution: AM = = = 2.5
10 10
Values : x1 x2 x3 … xn
Frequency: f1 f2 f3 … fn, then,
x1 f1 gives the sum of all observations with value x1, x2 f2 gives the sum of all
observations with value x2, …, xn fn gives the sum of all observations with value xn.
Therefore, the sum of all observations will be x1 f1 + x2 f2 + … + xn fn.
x 1 f 1 + x 2 f 2 + ... + x n f n
Arithmetic Mean, X =
f 1 + f 2 + ... + f n
n
x
i =1
i fi
= n
fi =1
i
x: 1 2 3 4 5 6 7
f: 5 9 12 17 14 10 6
3
Solution
x f fx
1 5 5
2 9 18
3 12 36
4 17 68
5 14 70
6 10 60
7 6 42
Total 73 299
X =
f x =
299
= 4.096
f 73
In a grouped frequency table we don’t know the actual values of the observations
falling in a class. We only know that the values of observations falling in a class lie
between its lower limit and upper limit. So, we cannot find out the exact AM. For
calculating AM(approximate) of a grouped table, we make an assumption that the
values of observations falling in a class are equal to the mid-value of that class. Then
we consider the class mid-values as x-values and make use of the formulae in the case
of ungrouped frequency table.
Solution:
Wages Mid-value Frequency f.x
(x) (f)
50 - 60 55 2 110
60 - 70 65 5 325
70 - 80 75 7 525
80 - 90 85 6 510
90 - 100 95 5 475
Total 25 1945
X =
f x =
1945
= 77.8
f 25
4
Short-cut Method to find AM
If the values of x are very large, the calculation of AM becomes time consuming.
Let the mid-values of k classes be x1, x2, …, xk and f1, f2, …, fk be the corresponding
x -A
frequencies. We use the transformation of the form u i = i for i = 1,2, …, k.
C
Here A and C can be any two numbers. But it is better to take A as a number among the
middle part of the mid-values. If all the classes are of equal width, C can be taken as the
class width.
Then AM, X = A + C u
Where u =
f u
f
Example 4: Find the AM of the following data:
Marks : 0 – 10 10 – 20 20 – 30 30 – 40 40 – 50 50 - 60
No. of students: 12 18 27 20 17 6
Solution:
u =
fu =
− 70
= -0.7
f 100
X= A + C u
= 35 + 10 -0.7
= 35 – 7
= 28
Properties of AM
5
2. Sum of squares of deviations of a set of values is minimum when
deviations taken about AM.
Let x 1 and x 2 be the means of two groups. Let there be n1 observations in the
first group and n2 observations in the second group. Then x , the mean of the combined
group can be obtained as
n1 x1 + n 2 x 2
x=
n1 + n 2
Example 5: Average daily wage of 60 male workers in a firm is Rs. 120 and that of
40 females is Rs.100. Find the mean wage of all the workers.
60 120 + 40 100
Combined Mean =
60 + 40
= 112
Weighted AM
When calculating AM we assume that all the observations have equal importance.
If some items are more important than others, proper weightage should be given in
accordance with their importance. Let w 1, w2, …, wn be the weights attached to the
items x1, x2, …, xn, then the weighted AM is defined as
w 1 x 1 + w 2 x 2 + ... + w n x n
Weighted mean =
w 1 + w 2 + ... + w n
Example 6: A teacher has decided to use a weighted average in figuring final grades
for his students. The midterm examination will count 40%, the final examination will
count 50% and quizzes 10%. Compute the average mark obtained for a student who
got 90 marks for midterm examination, 80 marks for final and 70 for quizzes.
40 90 + 50 80 + 10 70
Weighted mean =
40 + 50 + 10
6
8300
=
100
= 83
2. Median
The median of a set of observations is a value that divides the set of observations in half,
so that the observations in one half are less than or equal to the median and the observations
in the other half are greater than or equal to the median value.
In finding the median of a set of data it is often convenient to put the observations in
ascending or descending order. If the number of observations is odd, the median is the
middle observation. For example, if the values are 52, 55, 61, 67, and 72, the median is 61.
If there were 4 values instead of 5, say 52, 55, 61, and 67, there would not be a middle
value. Here any number between 55 and 61 could serve as a median; but it is desirable to
use a specific number for the median and we usually take the AM of two middle values,
i.e, (55+61)/2 = 58.
Median is the primary measure of location for variables measured on ordinal scale because
it indicates which observation is central without attention to how far above or below the
median the other observations fall.
In a grouped frequency distribution, we do not know the exact values falling in each
class. So, the median can be approximated by interpolation. Let the total number of
observations be N. for calculating median we assume that the observations in the median
class are uniformly distributed. Median class is the class in which the (N/2) th observation
belongs. Also assume that median is the (N/2)th observation.
Here the frequency table must be continuous. If it is not, convert it into continuous
table. Prepare a less than cumulative frequency table and find the median class. Let ‘l’ be
the lower limit of the median class, ‘f’ the frequency of the median class, and ‘c’ is the
class width of the median classs. By the assumption of uniform distribution, the ‘f’
c 2c fc
observations in the median class are l + , l + , …, l + . Let ‘m’ be the cumulative
f f f
N
frequency of the class above the median class. Then the median will be the ( - m) th
2
observation in the median class.
7
N c
That is, median = l + (- m)
2 f
Example 8: Calculate the median of the following data:
class frequency
1 - 10 4
11 - 20 12
21 - 30 24
31 - 40 36
41 - 50 20
51 - 60 16
61 - 70 8
71 - 80 5
Solution: Since the frequency table is of inclusive, convert it into exclusive by subtracting
0.5 from the lower limits and adding 0.5 to the upper limits.
N 125
Here = = 62.5, which lies in the 30.5 - 40.5 class (median class)
2 2
So, l = 30.5, f = 36, m = 40 and c = 10
N c
Median = l + ( - m)
2 f
10
= 30.5 + (62.5 – 40)
36
= 36.75
Property of Median: The sum of absolute deviations of a set values is minimum when
the deviations are taken from median.
3. Mode
8
The mode of a categorical or a discrete numerical variable is that category or value which
occurs with the greatest frequency.
In a grouped frequency distribution, to find the mode, first locate the modal class.
Modal class is that class with maximum frequency. Let l be the lower limit of the modal
class, ‘c’ be the class interval, f1 be the frequency of the modal class, f0 be the frequency
of the class preceding and f2 be the frequency of the class succeeding the modal class.
c(f1 - f 2 )
Then, Mode = l +
2f 1 - f 0 - f 2
class frequency
10 – 15 3
15 – 20 9
20 – 25 16
25 – 30 12
30 – 35 7
35 – 40 5
40 - 45 2
Median, as has been indicated, is a locational average, which divides the frequency
distribution into two equal parts. Quartiles, deciles and percentiles are not averages. They
are the partition values, which divides the distribution into certain equal parts.
Quartiles
9
Quartiles are the values, which divides a frequency distribution into four equal
parts so that 25% of the data fall below the first quartile (Q 1), 50% below the second
quartile (Q2), and 75% below the third quartile (Q3). The values of Q1 and Q3 can be
find out as in the case of Q2 (Median). For a raw data, Q1 is the (n/4)th observation and
Q3 is the (3n/4)th observation.
N c1
For a grouped table, Q1 = l1 + ( - m1)
4 f1
Where N is the total frequency, l1 is the lower limit of the first quartile class
( class in which (N/4)th observation belongs), m1 is the cumulative frequency of the
class above the first quartile class, f1 is the frequency of the first quartile class and c1
is the width of the first quartile class.
3N C
Q3 = l3 + ( - m3) 3
4 f3
Where l3 is the lower limit of the third quartile class ( class in which (3N/4) th
observation belongs), m3 is the cumulative frequency of the class above the third
quartile class, f3 is the frequency of the third quartile class and C3 is the width of the
third quartile class.
Deciles are nine in number and divide the frequency distribution into 10 equal parts.
Percentiles are 99 in number and divide the frequency distribution into 100 equal parts.
Central tendency for interval data is generally represented by the A.M., which takes
into account the available information about distances between scores. For ranked
(ordinal) data, the median is generally most appropriate, and for nominal data, the
mode.
If the distribution is badly skewed, one may prefer the median to the mean, because
the median would not be affected as much by unusual extreme scores. For this reason,
for example, the median income of people is usually reported rather than the A.M.
If one is interested in prediction, the mode is the best value to predict if an exact
score in a group has to be picked.
3A.4. MEASURES OF DISPERSION
10
So far we have discussed averages as sample values used to represent data. But the average
cannot describe the data completely.
Consider two sets of data : 5, 10, 15, 20, 25
15, 15, 15, 15, 15
Here we observe that both the sets, the same mean 15. But in the set I, the observations are
more scattered about the mean. This shows that, even though they have the same mean, the
two sets differ. This reveals the necessity to introduce measures of dispersion.
Commonly used measures of dispersion are Range, Mean deviation, Standard deviation,
and quartile deviation.
1. Range
Range of a set of observations is the difference between the largest and the smallest
observations. In the case of grouped frequency table, range is the difference between the
upper bound of last class and the lower bound of the first class.
Example 1: The range of the set of data 9, 12, 25, 42, 45, 62, 65 is 65 – 9 = 56
Range is the simplest measure of dispersion but its demerit is that it depends only on the
extreme values.
You have seen that range is a measure of dispersion, which does not depend on all
observations. Let us think about another measure of dispersion, which will depend on all
observations.
One measure of dispersion that you may suggest now is the sum of the deviations of
observations from mean. But we know that the sum of deviations of observations from the
A.M is always zero. So we cannot take the sum of deviations of observations from the
mean as a measure.
One method to overcome this is to take the sum of absolute values of these deviations. But
if we have two sets with different numbers of observations this cannot be justified. To make
it meaningful we will take the average of the absolute deviations. Thus mean deviation
(MD) about the mean is the mean of the absolute deviations of observations from arithmetic
mean.
1 n
If x1, x2, …, xn are n observations, then, MD = | xi - x |
n i =1
Example 2: Find the MD for the following data 12, 15, 21, 24, 28
Solution:
11
12 + 15 + 21 + 24 + 28
X = = 20
5
x | xi - x |
12 8
15 5
21 1
24 4
28 8
Total 26
26
MD = = 5.2
5
Let x1, x2, …, xn be the values and f1, f2, …, fn are the corresponding frequencies. Let N
1 n
be the sum of the frequencies. Then, MD = | xi - x | fi
N i =1
In the case of a grouped frequency table, take the mid-values as x-values and use the same
method given above.
Example 3: Find the mean deviation of the heights of 100 students given below:
Heigt in cm frequency
160 – 162 5
163 – 165 18
166 – 168 42
169 – 171 27
172 - 174 8
Solution:
Heigt Mid- Frequency
in cm value (f) fx | xi - x | fi| xi - x |
(x)
160 – 162 161 5 805 6.45 32.25
163 – 165 164 18 2952 3.45 62.10
166 – 168 167 42 7014 0.45 18.90
169 – 171 170 27 4590 2.55 68.85
172 - 174 173 8 1384 5.55 44.40
Total 100 16745 226.50
16745
X = = 167.45
100
12
1 n
MD = | xi - x | fi
N i =1
226.5
= = 2.265
100
When we take the deviations of the observations from their A.M both positive and
negative values occurs. For defining mean deviation we took absolute values of the
deviations. Another method to avoid this problem is to take the square of the deviations.
So, variance is the mean of squares of deviations from A.M.. Positive square root of
variance is called standard deviation.
1 n
If x1, x2, …, xn are n observations, then, the variance = (xi - x )2 and standard
n i =1
n
1
deviation(SD) is defined as, SD = (xi - x )2
n i =1
Example 4: Find the variance and standard deviation of the following data:
42, 39, 44, 40, 36, 39, 30, 46, 48, 36
400
Solution: Arithmetic mean X = = 40
10
1 n 1
n i =1
(xi - x )2 =
10
[(42 – 40)2 + (39 – 40)2 + … + (36 – 40)2]
254
= = 25.4
10
Variance = 25.4
S.D = 25.4 = 5.04
Let x1, x2, …, xn be the values and f1, f2, …, fn are the corresponding frequencies. Let N
n
1
be the sum of the frequencies. Then, Variance =
N
( x
i =1
i-x )2 fi and
n
1
Standard deviation =
N
(x
i =1
i-x )2fi
1
The above formulae for variance can be expressed as, variance = fi xi2 - X 2
N
In the case of a grouped frequency table, take the mid-values as x-values and use the same
method given above.
13
Example 5: Find the variance and standard deviation of the following data:
class frequency
0 – 10 3
10 – 20 4
20 - 30 6
30 – 40 10
40 - 50 7
Solution:
1
Variance = fi xi2 - X 2
N
890
N = 30, X = = 29.67, fi xi2 = 31150
30
31150
Variance = - (29.67)2
30
= 1038.33 - 880.31
= 158.02
Standard deviation = 158.02 = 12.57
If the values of x are very large, the calculation of SD becomes time consuming.
Let the mid-values of k classes be x1, x2, …, xk and f1, f2, …, fk be the corresponding
xi - A
frequencies. We use the transformation of the form u i = for i = 1,2, …, k.
C
Here A and C can be any two numbers. But it is better to take A as a number among the
middle part of the mid-values. If all the classes are of equal width, C can be taken as the
class width.
1
Variance of ui’s , Var(u) = fi ui2 - u 2
N
Then variance of xi’s, Var(x) = C2 Var(u)
That is, SD(x) = C SD(u)
14
Example 6: Consider the problem in example 5, let us find out the SD using short-cut
method.
Solution:
u =
fu =
14
= 0.467, fi ui2 = 54, N = 30
N 30
54
Variance(u) = - (0.467)2
30
= 1.8 – 0.21809
= 1.5819
Variance(x) = 102 1.5819 = 158.19
Combined Variance
If there are two sets of data consisting of n1 and n2 observations with s12 and s22 as their
respective variances, then the variance of the combined set consisting of n 1+n2
observations is
S2 = [n1(s12 + d12) + n2(s22 + d22)] / (n1 + n2)
Where d1 and d2 are the differences of the means, x1 and x2 , from the combined
mean x respectively.
Series A Series B
Mean 50 40
Standard deviation 5 6
No. of items 100 150
Solution:
Given x1 = 50 and x2 = 40, s12 = 25 and s22 = 36, n1 = 100 and n2 = 150
15
100 50 + 150 40
Combined mean x = = 44,
100 + 150
d1 = x1 - x = 50 – 44 = 6, and d2 = x2 - x = 40 – 44 =-4
4. Quartile Deviation
Quartile deviation (Semi inter-quartile range) is one-half of the difference between the
third quartile and first quartile.
Q3 - Q1
That is, Quartile deviation, Q.D =
2
Solution:
Since the data has open ends, Q.D would be a suitable measure
N c1
Q1 = l1 + ( - m1)
4 f1
16
3N c3
Q3 = l3 + ( - m3)
4 f3
N 3N
Here N= 1000, = 250, =750
4 4
The class 70 – 90 is the first quartile class and 110 – 130 is the third quartile class
20
Q1 = 70 + (250- 154)
140
= 83.7
20
Q3 = 110 + (750- 594)
230
= 123.5
123.5 - 83.7
Q.D = = 19.9 Rs.
2
Relative Measures
Firm A Firm B
Number of workers 586 648
Average monthly wage 52.5 47.5
Standard deviation 10 11
17
10
Solution: Coefficient of variation for firm A = 100
52.5
= 19%
11
Coefficient of variation for firm B = 100
47.5
= 23%
There is greater variability in wages in firm B.
1. Skewness
Very often it becomes necessary to have a measure that reveals the direction of dispersion
about the center of the distribution. Measures of dispersion indicate only the extent to
which individual values are scattered about an average. These do not give information
about the direction of scatter. Skewness refers to the direction of dispersion leading
departures from symmetry, or lack of symmetry in a direction.
If the frequency curve of a distribution has longer tail to the right of the center of the
distribution, then the distribution is said to be positively skewed. On the other hand, if the
distribution has a longer tail to the left of the center of the distribution, then distribution is
said to be negatively skewed. Measures of skewness indicate the magnitude as well as the
direction of skewness in a distribution.
The relationship between these three measures depends on the shape of the frequency
distribution. In a symmetrical distribution the value of the mean, median and the mode is
the same. But as the distribution deviates from symmetry and tends to become skewed, the
extreme values in the data start affecting the mean.
In a positively skewed distribution, the presence of exceptionally high values affects the
mean more than those of the median and the mode. Consequently the mean is highest,
followed, in a descending order, by the median and the mode. That is, for a positively
skewed distribution, Mean > Median> Mode. In a negatively skewed distribution, on the
other hand, the presence of exceptionally low values makes the values of the mean the
least, followed, in an ascending order, by the median and the mode. That is, for a negatively
skewed distribution, Mean < Median < Mode.
Empirically, if the number of observations in any set of data is large enough to make its
frequency distribution smooth and moderately skewed, then, Mean – Mode = 3(Mean –
Median)
Measures of Skewness
18
1. Karl Pearson’s measure of skewness: Prof. Karl Pearson has been
developed this measure from the fact that when a distribution drifts away
from symmetry, its mean, median and mode tend to deviate from each other.
Mean - Mode
Karl Pearson’s measure of skewness is defined as, SkP =
SD
2. Bowley’s measure of skewness: developed by Prof. Bowley, this measure
of skewness is derived from quartile values.
Q3 + Q1 - 2Q2
It is defined as SkB =
Q3 - Q1
3. Moment measure of skewness:
If x1, x2, …, xn are n observations, then the rth moment about mean is defined
1 n
as mr = (xi - x )r
n i =1
The moment measure of skewness is defined as 1 = m3/(SD)3
In a perfectly symmetrical distribution 1 =0, and a greater or smaller value
of 1 results in a greater or smaller degree of skewness.
2.Kurtosis
Solution:
Moment measure of skewness, 1 = m3/(SD)3
− 100
= = - 0.4
( 40 ) 3
Hence, there is negative skewness
Example 2: The first four moments of a distribution about mean are 0, 2.5, 0.7, and
18.75. Comment on the Kurtosis of the distribution
m4
Moment measure of kurtosis is, 2 = .
m2 2
18.75
= =3
2.5 2
19
So, the curve is normal.
3A.6. Exercises
1. Find the arithmetic mean, median, and mode of the following data: 38,
28,12, 18, 28, 44, 28, 19, 21.
3. From the following data of income distribution, calculate the AM. It is given
that i) the total income of persons in the highest group is Rs. 435, and ii)
none is earning less than Rs. 20.
5. The mean yearly salary of employees of a company was Rs. 20,000. The
mean yearly salaries of male and female employees were Rs. 20,800 and
Rs. 16,800 respectively. Find out the percentage of males employed.
6. The average wage of 100 male workers is Rs. 80 and that 50 female workers
is 75. Find the mean wage of workers in the company.
20
10. Calculate the Mean deviation, Variance and Standard deviation of the
following data:
Class Frequency
10 – 15 3
15 – 20 7
20 – 25 16
25 – 30 12
30 – 35 9
35 – 40 5
40 - 45 2
11.Find the standard deviation of the values: 11, 18, 9, 17, 7, 6, 15, 6, 4, 1
13. Goals scored by two teams a and B in a foot ball season were as follows:
No. of goals scored: 0 1 2 3 4
No. of matches A: 2 9 8 5 4
B: 1 7 6 5 3
Find which team may be considered more consistent?
14. The mean of two samples of sizes 50 and 100 respectively are 54.1 and
50.3 and the standard deviations are 19 and 8. Find the mean and the
standard deviation of the combined sample.
Class Frequency
< 15 5
15 – 20 12
20 – 25 22
25 – 30 31
30 – 35 19
35 – 40 9
>40 2
21
18. Find the Karl Pearson’s measure of skewness of the following data:
Class Frequency
< 15 5
15 – 20 12
20 – 25 22
25 – 30 31
30 – 35 19
35 – 40 9
>40 2
1C. PROBABILITY
1C.1. Introduction
Each of us has some intuitive notion of what “probability” is. Everyday conversation is full
of references to it: “He is likely to win the game”. He will probably be selected for the
job”. The use of words ‘likely’, ‘probably’ indicates that there is an element of uncertainty
about these statements. The theory of probability provides a numerical measure of the
element of uncertainty. It enables us to take decisions under uncertainty with a certain
amount of risk.
In science we come across phenomena, which follows certain pattern without fail. A stone
drops from a cliff follows Newton’s laws of motion. But there are experiments whose
results cannot be predicted in advance.
Random experiment is an experiment, which does not give the same result if it conducted
under homogeneous conditions.
Examples:
1. Tossing a coin and observing the face turns up
2. Rolling a die and observing the face turns up
Set of all possible outcomes of a random experiment is called a sample space and
is usually denoted by S.
Examples:
1. Consider the random experiment, tossing a coin and observing the face
turns up.
S = { H, T} , Where H – Head, T – Tail
22
2. Rolling a die and observing the face turns up.
S = {1, 2, 3, 4, 5, 6}
An outcome of the experiment is an element in S, which is also known as sample point.
An event is any subset of the sample space. In the example of tossing a coin, H and T are
sample points, but (null event), {H}, {T}, {H, T}(sure event) are events. The event is
an impossible event because it can never occur. But the event {H, T} is a sure event, which
occurs in every trial. An event A will be said to have occurred in a trial if the outcome is a
sample point, which belongs to A.
The set consisting of exactly one sample point is called an elementary event. For example,
in the experiment of throwing a die, {1}, {2}, {3}, {4}, {5}, and {6} are elementary events,
but 1, 2, 3, 4, 5, and 6 are sample points. That is, elementary events are events, which
cannot be further split up. Events, which can be further split up are called compound events.
For example, {2, 4, 6} is a compound event.
If A and B are two events in the same experiment, the event which represents the
simultaneous occurrences of A and B is A B.
Example: In a die rolling trial, let A be the event ‘a prime number happened’ and
B be the event ‘an odd number happened’. That is, A = {2, 3, 5} and B={1, 3, 5}.
Then, the event represents ‘the number happened is both prime and odd’ is A
B={3, 5}.
If A and B are two events in the same experiment, the event which gives at least
one among (A or B) is A B.
Example: In a die rolling trial, let A be the event ‘a prime number happened’ and
B be the event ‘an odd number happened’. That is, A = {2, 3, 5} and B={1, 3, 5}.
Then the event at least one among (a prime number or an odd number) is A B={1,
2, 3, 5}.
23
3. A and not B (difference)
If A and B are two events in the same experiment, the event which represents A
and not B is A B .
Example: In a die rolling trial, let A be the event ‘a prime number happened’ and
B be the event ‘an odd number happened’. That is, A = {2, 3, 5} and B={1, 3, 5}.
Then, the event represents ‘the number happened is a prime but not odd’ is A B
={2}.
If A and B are two events in the same experiment, the event, which represents the
happening of exactly one is (A B ) ( A B).
Example: In a die rolling trial, let A be the event ‘a prime number happened’ and
B be the event ‘an odd number happened’. That is, A = {2, 3, 5} and B={1, 3, 5}.
Then, the event represents ‘exactly one among A and B’ is (A B ) ( A B) =
{2} {1} ={1, 2}.
Two events are said to be disjoint if the occurrence of one event prevents the
occurrence of other event. That is, if A and B are disjoint events, their simultaneous
occurrence will not be possible. Therefore A B = .
Example: In a die rolling trial, let A be the event ‘an even number happened’ and
B be the event ‘an odd number happened’. That is, A = {2, 4, 6} and B={1, 3, 5}.
Since the occurrence of ‘an even number’ prevents the occurrence of ‘an odd
number’ in the same trial, the events A and B are mutually exclusive. See that A
B= .
24
Generally speaking, probability is a measure of chance of happening of an uncertain
event. That is, probability is used to measure the uncertainty of an event. The value
of probability ranges between 0 and 1. If it is certain that an event happen, then its
probability would be 1 and if it is certain that the event would not happen its
probability is 0.
There are three different conceptual approaches to the study of probability. They
are:
1. Classical approach.
2. Frequency approach.
3. Axiomatic approach.
1. Classical Definition
This is the earliest approach to the theory of probability. Laplace, the French
mathematician given this definition of probability. Using this definition, we can
determine the probability of an event even before the performance of trial. So
classical probability is often called ‘a priori probability’.
Example 1: Consider the random experiment tossing two coins and observing
the faces turns up. Sample space, S ={(H,H), (H,T), (T,H), (T,T)}. Let A be
3
the event that ‘ getting at least one tail. Then P(A) = ( In three outcomes
4
there is at least one tail).
25
f
Definition: In frequency approach, probability can be defined as P(A)= Lt n →
n
Where f is the frequency of A and n is the number of trials.
3. Axiomatic Definition
Definition: A function P from the class of events taking values in the real line is a
probability if it satisfies the following axioms:
Axiom 1 P(A) 0, for every event A
Axiom 2 P(S) = 1, where S is the sample space
Axiom 3 If A1, A2, …, are disjoint events, then,
P(A1 A2 …) = P(A1) + P(A2) + …
26
Suppose we had to elect two vice-presidents. Now, we are interested in which two
members are elected and the order is of no consequence. For instance, announcing
that AC or CA have been elected makes no difference since they are in the same
hierarchy. So when two persons have been elected without regard to their
arrangement, then this ‘unordered’ selection is called a combination.
1. Permutation
Example 1. Four persons enter a railway compartment in which there are six seats.
In how many ways can they take their place?
Solution: Let us consider the students from the same class as a group. Hence there
are 3 groups. The first group contains 3 students, the second contains 2 students
and the third contains one student. Three groups can be permuted in P(3,3) ways.
Then within the first group, 3 students can be permuted in P(3, 3) ways. Within the
second group, the students can be permuted in P(2, 2) ways.
So, the required number of arrangements = P(3, 3) P(3, 3) P(2, 2)
= 3! 3! 2! = 662
= 72
Example 3. In how many ways can a cricket team of 11 players choose a captain
and a vice captain from amongst themselves?
Result 2: The number of permutations of n objects taken r at a time when each may
be repeated any number of times in any permutation is given by nr.
27
2. Combination
n!
The number of combinations of ‘n’ objects taken ‘r’ at a time is and is
(n - r)!r!
denoted by nCr or C(n, r) .
Solution: Five students can be chosen out of 10 students in 10C5 = 10!/5!5! = 252
ways.
Example 2. In how many ways can selection of 5 books be made from 12 books (a)
when one specified book is never included (b) when one specified book is always
included.
Solution:
(a) Here remove the specified book and select 5 books from the remaining
11 books. It can be done in 11C5 different ways.
(a) first select the specified book which is to be included always and the
select 4 books from the remaining 11 books. It can be done in 11C4 ways.
28
P(A B)
It is defined as P(B/A) = , if P(A)>0
P(A)
P(A B)
And P(A/B) = , if P(B)>0
P(B)
Example: Suppose a card is selected at random from a pack of cards. The card
selected is an ace. What is the probability that the card selected is a red one?
Solution:
Let A be the event that the card selected is an ace and B be the event that
the card selected is a red one. The required probability is P(B/A).
P(A B)
By definition, P(B/A) =
P(A)
4 1
P(A) = Probability that the card selected is an ace = = ( Since there are 4
52 13
aces in a pack of 52 cards).
2 1
P(A B) = Probability that the card selected is a red ace = =
52 26
1
1
Therefore P(B/A) = 261 =
13
2
If A and B are two events in a sample space. Then, the multiplication theorem states
that P(A B) = P(A) P(B/A) if P(A)>0 and
= P(B) P(A/B) if P(B)>0
Two events A and B are independent, then, P(B/A) = P(B) or P(A/B) = P(A)
Example: Two cards are drawn form a well-shuffled pack of cards. Find the
probability that they are both aces if the first card is (a) replaced (b) not replaced.
Solution:
Let A be the event that “ace selected on the first draw” and B be the event
that “ace selected at the second draw”.
Then we require P(A B). By multiplication theorem, P(A B) = P(A) P(B/A)
4
(a) Since for the first draw, there are 4 aces in 52 cards. P(A) = .
52
4
The card is replaced and then selected, so P(B/A) = .
52
4 4
P(A B) =
52 52
29
1
=
169
(b) If the card is not replaced after first drawing, there will be only 3 aces
on the second drawing out of 51cards.
3
P(A) is same as in the first case, but P(B/A) =
51
4 3 1
P(A B) = =
52 51 221
Bayes’ Theorem is used to revise the probability of an event when new information
is available. The idea of revising probabilities is used by all of us in daily life even
though we may not know anything about probability. For example, a person while
going out may start without taking a raincoat, but as soon as he comes out of his
home and sees a large mass of cloud in the sky he may decide to take a raincoat
with him. So, by Bayes’ theorem, we find the posteriori probabilities.
Statement: Let B1, B2, …, Bn are ‘n’ mutually exclusive events whose union is the
P(Bi) P(A/Bi)
sample space. If A is any event, then, P(Bi/A) =
P(Bi) P(A/Bi)
1C.13. Solved Problems
1. Write down the sample space of the random experiment of throwing two dice
simultaneously and observing the face numbers.
2. If a box contains 10red and 6 blue balls, what is the probability that a bal drawn
at random is red? Find also the probability that the ball drawn is blue?
30
3. A speaks truth in 60% cases and B in 70% cases. In what percentage of cases
are they likely to contradict each other in stating the same fact?
Solution:
Contradiction takes place only if one of them speaks truth and the other tells
60
lie. The probability that A speaks truth = = 0.6
100
The probability that A tells lie = 1 – 0.6 = 0.4
70
The probability that B tells truth = = 0.7
100
The probability that B tells lie = 1 – 0.7 = 0.3
Since A and B speaks independently, probability that A speaks truth and B
tells lie = Probability that A speaks truth Probability that B tells lie = 0.60.3
Similarly, Probability that A tells lie and B speaks truth = 0.40.7
Thus the probability that A speaks truth and B tells lie or A tells lie and B
speaks truth = 0.60.3 + 0.40.7 = 0.18 + 0.28 = 0.46.
That is, in 46% of cases they contradict each other.
4. The odds against A speaking the truth are 4 : 6 while the odds in favour of B
speaking the truth are 7:3. (i) What is the probability that A and B contradict
each other in stating the same fact? (ii) If A and B agree on a statement, what
is the probability that this statement is true?
Solution:
6
The probability that A speaks truth = = 0.6
10
The probability that A tells lie = 1 – 0.6 = 0.4
7
The probability that B tells truth = = 0.7
10
The probability that B tells lie = 1 – 0.7 = 0.3
(i) A and B will contradict each other if one of them tells lie and
the other speaks truth.
The required probability = 0.60.3 + 0.40.7
= 0.18 + 0.28
= 0.46
(ii) A and B agree on a statement if both tell lie or speak truth
Probability that both speaks truth = 0.60.7 = 0.42
Probability that both tells lie = 0.40.3 = 0.12
Probability that both agree on a statement = 0.42 + 0.12
= 0.54
0.42 7
Required probability = =
0.54 9
5. Three light bulbs are chosen at random from 15 bulbs of which 5 are defectives.
Find the probability that (i) none is defective (ii) exactly one is defective, (iii)
at least one is defective.
31
Solution:
There are 15C3 = 455 ways to choose 3 bulbs from 15 bulbs.
(ii) Since there are 10 non-defective bulbs, there are 10C3 = 120 ways to choose
120
3 non-defective bulbs. Thus, P(none is defective) = = 0.26
455
(iii) Since there are 5 defective bulbs, one defective bulb can be chosen in 5
different ways and 10C2 = 45 different ways to choose 2 non-defective bulbs.
Hence, there are 5 45 = 225 ways to choose 3 bulbs of which exactly one
225
is defective. Thus, P(exactly one is defective ) = = 0.49
455
(iv) The event that at least one is defective is the complement of the event ‘none
is defective’. By (i), P(none is defective) = 0.26
Hence, P(at least one is defective) = 1 – 0.26 = 0.74
6. A box contains 5 white and 7 black balls. If three balls are drawn at random,
what is the probability that one is white and two are black balls.
Solution:
One white ball can happen in 5 ways and 2 back balls can happen in 7C2 =
21 different ways. Also 3 balls can happen in 12C3 = 220 different ways.
5 21 21
Thus, the required probability = =
220 44
7. A box I contains 8 red and 7 blue balls. Another box II contains 6 red and 6
blue balls. One ball is selected at random from the box I and transferred it into
box II. Then, one ball is drawn at random from the box II, what is the probability
that it is a red ball?
Solution:
Let A be the event that the selected ball from the box II is a red ball. Then,
A can happen in the following ways. Transfer a red ball from box I to box II and
then select a red ball from box II or transfer a blue ball from box I to box II and
then select a red ball from box II.
P(transfer a red ball from box I to box II and then select a red ball from box
8 7 56
II ) = =
15 13 195
P( transfer a blue ball from box I to box II and then select a red ball from
7 6 42
box II) = = .
15 13 195
56 42 98
So, the required probability = + =
195 195 195
8. If P(A) = 0.4, P(B) = 0.7, and P(A B) = 0.3, then, what is the probability of
A or B happened?
32
Solution:
By addition theorem on probability, P(A or B) =P(A B)= P(A) +P(B)-P(A B)
That is, P(A B) = 0.4 + 0.7 – 0.3 = 0.8
3 5 3
9. Given, P(A) = , P(B) = and P(A B)= , Are A and B independent?
8 8 4
Solution:
Two events A and B are independent if P(A B) = P(A) P(B)
By addition theorem on probability, P(A B)= P(A) +P(B)-P(A B)
So, P(A B) = P(A) +P(B) - P(A B)
3 5 3 1
= + - =
8 8 4 4
3 5 15
P(A) P(B) = =
8 8 64
Thus, P(A B) P(A) P(B), hence A and B are not independent.
10. The probability that a contractor will get a contract for road construction is 0.5
and the probability that he will get a contract for the construction of water tank
is 0.7. What is the probability of getting at least one contract?
Solution:
Let A be the event getting contract for road construction and B be the event
of getting contract for construction of water tank.
By addition theorem on probability,
P(at least one) =P(A B)= P(A) +P(B)-P(A B)
Since A and B are independent, P(A B) = P(A) P(B)
= 0.50.7 = 0.35
Hence, P(A B) = 0.5 + 0.7 – 0.35 = 0.85
11. A company has two plants to manufacture scooters. Plant I manufactures 70%
of the scooters and plant II manufactures 30%. At plant I, 80% of scooters are
rated standard quality and at plant II, 90% of scooters are rated standard quality.
A scooter is selected at random and is found to be of standard quality. What is
the chance that it has come from (a) plant I (b) plant II.
Solution:
Let A be the event ‘scooter selected is of standard quality’.
Let B1 be the event ‘scooter manufactured at plant I’ and B2 be the event ‘scooter
manufactured at plant II.
P(B1) = 0.7, P(B2) = 0.3, P(A/B1) = 0.8, and P(A/B2) = 0.9
P(B1) P(A/B1)
(a) Required probability = P(B1/A) =
P(B1) P(A/B1) + P(B2) P(A/B2)
33
0.7 0.8
=
0.7 0.8 + 0.3 0.9
56
=
83
P(B2) P(A/B2)
(b) Required probability = P(B2/A) =
P(B1) P(A/B1) + P(B2) P(A/B2)
0.3 0.9
=
0.7 0.8 + 0.3 0.9
27
=
83
12. A box X contains 2 white and 3 red balls. Another box Y contains 4 white and
5 red balls. One ball is drawn at random from one of the boxes and is found to
be red. Find the probability that it was drawn from box Y.
Solution:
Let A be the event ‘the ball drawn is red’, B1 be the event ‘box X has been chosen’,
and B2 be the event ‘box Y has been chosen’
P(B2) P(A/B2)
Required probability is P(B2/A) =
P(B1) P(A/B1) + P(B2) P(A/B2)
1 1 3 5
P(B1) = , P(B2) = , P(A/B1) = , and P(A/B2) =
2 2 5 9
1 5
P(B2/A) = 2 9
1 3 1 5
+
2 5 2 9
25
=
52
1C.14. Exercises
34
(iv) At least 2 girls are selected
8. The probability that a boy will get a scholarship is 0.9, and a girl will get is 0.8.
What is the probability that at least one of them will get the scholarship?
9. Five men in a company of 20 are graduates. If 3 men are picked out of the 20
at random, what is the probability that they are all graduates? What is the
probability of at least one graduate?
10. A card is drawn at random from a well-shuffled pack of cards. What is the
probability that it is a heart or a queen?
11. A candidate is interviewed for 3 posts. For the first post there are 3 candidates,
for the second there are 4, and for the third there are 2. What are the chances
for his getting at least one post?
12. An urn contains 8 white and 3 red balls. If two balls are drawn at random find
the probability that (i) both are white (ii) both are red (iii) one is of each colour.
13. A can solve 80% of the problems given in statistics book and B can solve 60%.
What is the probability that at least one of them solve a problem selected at
random?
14. If P(A) =0.5, P(B) = 0.3, and P(AB) = 0.2, obtain the probability that:
i) A occurs but not B
ii) At least one of A and B occurs
iii) Neither of A and B occurs
15. What is the probability that a leap year selected at random will contain 53
Sundays?
16. The probabilities that a husband and wife will be alive 20 years from now are
0.8 and 0.9 respectively. Find the probability that in 20 years (a) both alive (b)
neither alive (c) at least one alive.
1 1
17. The probability of hitting a target is for A and for B. If both fire at the
3 2
same target find the probability that at least one of them hit the target.
18. A bag contains 6 black and 3 white balls. Another bag contains 5 black and 4
white balls. If a ball is drawn from each bag find he probability that these two
balls are of the same color.
19. The odds that A speaks truth are 3:2 and the odds that B does so are 5:3. In what
percentage of cases are they likely to contradict each other?
20. On the average 20% of persons going to a handicrafts emporium are foreigners
and the remaining 80% are local persons. 75% of such foreigners and 50% of
such local persons are found to make purchases. If a bundle of purchased items
is sent to cash counter, what is the probability that the purchaser is a foreigner?
21. In an examination 30% of the students have failed in Mathematics, 20% of the
students have failed in Chemistry and 10% have failed in both Mathematics . a
student is selected at random.
(i) What is the probability that the student has failed in Mathematics
when it is known that he has failed in Chemistry?
(ii) What is the probability that the student selected at random has failed
either in Mathematics or in Chemistry?
35
22. Two urns I and II contain 3 white, 7 black balls and 5 white, 7 black balls
respectively. A ball is transferred from urn I to urn II. Then a ball is drawn at
random from urn II and it is found black. What is the probability that the
transferred ball has been a black ball?
23. Urn I contains 4 white and 5 black balls. Urn II contains 5 white and 8 black
balls. A ball is transferred from urn I to urn II, then a ball is drawn from urn II.
Find the probability that it is white?
24. Box I contains 3 red and 2 blue marbles while box II contains 2 red and 8 blue
marbles. A fair coin is tossed. If the coin turns up head, a marble is chosen from
box I; if it turns up a tail, a marble is chosen from box II. Find the probability
that a red marble is chosen?
25. A box contains 5 red and 4 white marbles. Two marble s are drawn successively
from the box without replacement and it is noted that the second one is white.
What is the probability that the first is also white?
26. A manufacturing company produces steel pipes in three plants with daily
production volume of 500, 1000, and 2000 units respectively. According to past
experience it is known that the fraction of defective outputs produced by the
three plants are respectively 0.005, 0.008, and 0.010. If a pipe is selected at
random from a day’s production and found to be defective. Find out the
probability that it came from the first plant.
27. A company produces a product through three machines A, B, and C. Machine
A produces 45% of the product, B produces 35% of the product and C produces
20%. From past experience it is known that 4% of the items produced by
machine A is defective, 3% of the items produced by B is defective and 1% of
the items produced by C is defective. An item selected at random is found to be
defective. What is the probability that it produced by machine B?
28. A die is thrown twice and the sum of the numbers appearing is observed to be
6. What is the probability that the number 5 has appeared at least once?
36
Suppose we wish to draw conclusions about a characteristic of a population. We draw a
random sample of size n and take measurements about the characteristic, which we
interested to study. Let the sample values be x1, x2, x3, …, xn. Then any quantity which can
be determined as a function of the sample values x1, x2, x3, …, xn is called a statistic. Since
the sample values are the results of random selections, a statistic is a random variable.
Therefore, a statistic has a probability distribution. It is known as sampling distribution.
The standard deviation of the sampling distribution is called standard error.
The process of inferring certain facts about a population based on a sample is known as
statistical inference. Sample statistics and their distributions are the basis of all inferences
drawn about the population.
Suppose we have a sample of size n from a population. Let x1, x2, x3, …, xn be the values
of the characteristic under study corresponding to the selected units. Then the sample mean
__ __ x + x2 + x3 ++ xn
X is defined as X = 1 .
n
If we draw another sample of size n from the same population, we may end up with a
different set of sample values and so a different sample mean. Thus the value of the sample
mean is determined by chance causes. The distribution of the sample mean is called
sampling distribution of the sample mean.
If x1, x2, x3, …, xn constitute a random sample from an infinite population having the mean
and variance 2, then the distribution of sample mean will be normal with mean and
variance
2
, when n is large.
n
37
__
Required probability = P( x - > 3)
−
x− 3
= P( n > n )
= P(z > 1.2)
= 0.1151 ( from N(0,1) table, since z ~ N(0,1))
Example 2: A random sample of size 64 is taken from an infinite population with the mean
22 and variance 196. What is the probability that the mean of the sample will greater than
23.
__
Solution: Given n = 64, = 22, = 14. Let x be the sample mean.
__
We have to find out P( x > 23)
−
__
x − 22 23 − 22
P( x >23) = P( 64 > 64 )
14 14
8
= P(z> ) = P(z > 0.57) = 0.2843
14
If a random variable X has the standard normal distribution, then the distribution X 2 is
called chi-square (2) distribution with one degree of freedom. This distribution would be
quite different from a normal distribution because X2, being a square term, can assume
only non-negative values. The probability curve of 2 will be higher near 0, because most
of the x-values are close to 0 in a standard normal distribution.
If X1, X2, …, Xn are independent standard normal variables, then X 1+X2+… + Xn has the
2 distribution with n degrees of freedom. Here ‘n’ is the only one parameter.
2 – table
38
Since 2-distribution arises in many important applications, especially in statistical
inference, integrals of its density has been tabulated. The table gives the value of 2,n such
that probability that 2 is greater than 2,n is equal to for = 0.005, 0.01, 0.025, 0.05
etc. and n = 1, 2, 3, … . That is, the table gives P(2 >2,n) =
2,n
If X and Y are two independent random variables, X has the standard normal distribution
and Y has a chi-square distribution with ‘n’ degrees of freedom, then the distribution of
the statistic t = X is called Student ‘t’ distribution. The t-distribution was first obtained
Y
n
by by W.S. Gosset, who is known under the pen name ‘Student’.
−
x−
An example of a t-statistic is t = n , which follows t-distribution with (n-1) degrees
s
__
of freedom, where x and s are mean and standard deviation of a random sample of size n
from a normal population with mean and variance 2.
Student ‘t’ table has many applications in statistical inference. The t-table gives the values
t,n for = 0.25, 0.125, 0.10, 0.05 etc. and n = 1, 2, 3, …, where t,n is such that the area to
its right under the curve of the t-distribution with ‘n’ degrees of freedom is equal to . That
is, t,n is such that P(t > t,n) = . Also note that the t-distribution is a symmetric distribution.
39
.
t,n
1. To test the mean of a normal population when the sample size is small and
population variance is unknown.
2. To test the equality of means of two normal populations when the sample sizes are
small and population variances are unknown but same.
3. To test the correlation coefficient is zero.
4. To find the confidence interval of mean of normal population when sample size is
small and population variance is unknown.
The F- Distribution
If U and V are independent random variables having chi-square distribution with m and n
U
degrees of freedom, then the distribution of m is called the F-distribution with m and n
V
n
degrees of freedom.
For example, if S12 and S22 are the variances of independent random samples of sizes m
and n from normal populations with variances 12 and 22, then,
S
2 2
Table of F-distribution
The table of F-distribution gives the values F;m,n for =0.05 and 0.01 for various values
of m and n where F;m,n is such that the area to the right under the curve of F-distribution
with m, n degrees of freedom is equal to .
40
F;m,n
A function, T, used for estimating a parameter , is called an estimator and its value
given a sample is known as estimate.
41
2. The sample proportion is a point estimate of the population proportion.
3. The sample variance is a point estimator of population variance.
Statistical testing or testing hypotheses, is one of the most important aspects of the
theory of decision-making. Testing hypotheses consists of decision rules required for
drawing probabilistic inferences about the population parameters.
Definition: A Statistical Hypothesis is a statement concerning a probability distribution
or population parameters and a process by which a decision is arrived at, whether or
not a hypothesis is true is Testing Hypothesis.
For example, the statement, mean of a normal population is 30, the variance of a
population is greater than 12 are statistical hypotheses.
The hypothesis under test is known as the null hypothesis and the hypothesis that will
be accepted when the null hypothesis is rejected is known as the alternate hypothesis.
The null hypothesis is usually denoted by H0 and the alternate hypothesis by H1. For
example, if the population mean is represented by , we can set up our hypothesis as
follows: H0: 30; H1: > 30.
The following are the steps in testing a statistical hypothesis. We draw a sample from
the concerned population. Then choose the appropriate test statistic. A test statistic is
a statistic, based on the value of it we decide either to reject or accept a hypothesis.
Divide the sample space of the test statistic into two regions, namely, rejection region
and acceptance region ( The set of sample points, which lead to the rejection of the null
hypothesis, is called the Critical Region or Rejection Region). Calculate the value of
the test statistic for our sampled data. If this value falls in the rejection region, reject
the hypothesis; otherwise accept it.
Since we have to depend on the sample there is no way to know, which of the two
hypotheses is actually true. The test procedure is to fix the rejection region, in which
the value of test statistic observed, the null hypothesis would be rejected. The null
hypothesis may be true, but the test procedure may reject the null hypothesis. This error
is known as the first kind of error. It is also possible that the null hypothesis is actually
false but the test accepts it. This error is known as the second kind of error. Thus, the
error committed in rejecting a true null hypothesis is called type I error and the error
in accepting a false null hypothesis is called the type II error.
Significance Level
42
The probabilities of two errors cannot be simultaneously reduced, since is we increase
the rejection region the probability of type I error will increase whereas the reduction
in rejection region will increase type II error. The procedure usually adopted is to keep
the probability of type I error below a pre-assigned number and subject to this condition
minimize the type II error. A pre-assigned number between 0 and 1 chosen as an
upper bound of type I error is called the level of significance.
A test where the critical region is found to lie under one tail of the distribution of the
test statistic is called One-tailed test. In two-tailed tests the critical region lies under
both the tails of the distribution of the test statistic.
Example: Let be the mean of a population. Then,
1. H0: = 30; H1: 30 is a two tailed test
2. H0: = 30; H1: > 30 is a single tailed test.
STATISTICAL METHODS
Objectives
5A.1. Introduction
Suppose we have to test the hypothesis that the population mean has a specified value
0. Then formulate the null hypothesis H0 : = 0. The alternative hypothesis is: 1) H1:
0 or 2) H1: > 0 or 3) H1: > 0
43
__
A random sample of size n ( n > 30) is to be taken and let x be the sample mean. Since
__
n is large, the sampling distribution of x is approximately normal.
−
x−
If H0 is true, the test statistic z = n has approximately standard normal.
Case i: When is Known Use the above test statistic. The critical region for z
depending on the nature of H1 and the level of significance is given below:
−
x−
the test statistic z = n , which follows standard normal. Use the above critical
s
regions.
Example : The mean life of a random sample of 100 tyres is drawn from a population
of tyres with standard deviation of 1248kms is 15269 kms. It is climed that mean life
of tyres is 15200 kms. Test the validity of the claim.
Example : The manufactures of a small car claim that on an average the car is driven
2000 kms per month. A random sample of 100 owners of the car are asked to keep a
record of kilometers they drive their cars. On the basis of these sample records. It was
44
found that on an average the car was driven 2200 kms. per month with a standard
deviation of 600kms. Do the sample data support the hypothesis that the average
distance the car is driven has increased?
Solution: H0: = 2000; against H1: > 2000 where is the mean distance driven
the car per month.
__
Given x = 2200, n = 100, s = 600, 0 = 2000
−
x−
Test statistic z = n
s
2200 − 2000
= 100
600
= 3.33
Let = 0.05, Critical region is z > 1.64.
Since z = 3.33 > 1.64, we reject H0 . That is, the average distance a car is driven has
increased.
45
Solution: H0: p = 0.95 H1: p < 0.95
Given, x = 200-18 = 182, n= 200, p0 = 0.95, q0 = 1 – p0 = 1 – 0.95 =0.05
x − np 0
Test statistic, z =
np 0 q 0
182 − 200 0.95
=
200 0.95 0.05
= -2.597
Critical region is z < -1.64. Since z = -2.597 < -1.64, we reject H0. So, the claim of the
manufacturer not justified.
Suppose we have to test whether two population proportions p1 and p2 are equal. We
take a sample of size n1 from the first population and a sample of size n 2 from the
second population. Let x1 units possess a particular attribute in the first sample and x 2
− x − x
units from the second sample possess the attribute. Let p1 = 1 and p 2 = 2 be the
n1 n2
respective sample proportions. The null hypothesis is H0: p1 = p2.
The alternative hypothesis is 1) H1: p1 p2 2) H1: p1 < p2 3) H1: p1 > p2
x1 x 2
− − −
n1 n2 x1 + x 2 n1 p1 + n2 p 2
The test statistic is z = , where p = =
1 1 n1 + n2 n1 + n2
pq ( + )
n1 n2
The critical regions given below:
x1 + x 2 40 + 15
p= = = 0.183, q = 1 – p = 0.817
n1 + n2 200 + 100
46
x1 x 2
−
n1 n2
The test statistic is z =
1 1
pq ( + )
n1 n2
40 15
−
= 200 100 = 1.063
1 1
0.183 0.817( + )
200 100
Since |z| = 1.063 <1.96, We have to accept H0. That is, defaulters rate is same for the
two classes.
Let there are two independent populations with means 1, 2 and variances 12 and
2 respectively. The null hypothesis is H0: 1 = 2. The alternative hypothesis is: 1)
2
Example: A random sample of size 100 is taken from a population with mean 1 and
variance 16 and a sample of size 50 is taken from another population with mean 2 and
variance 25. The mean of the first sample is 40 and the mean of the second sample is
38. Test whether the samples are from populations with same mean.
− −
Solution: Given x1 = 40 and x 2 = 38, n1 = 100 and n2 = 50, 12 = 16 and 2 2= 25
H0: 1 = 2 H1: 1 2
47
− −
x1 − x 2
Test statistic is z =
12 22
+
n1 n2
40− 38
= = 2.46
16 25
+
100 50
Since |z|=2.46 > 1.96, reject H0, Thus the population means are not same.
If the sample size is less than 30, then we need to make the assumption that the
population follows normal distribution.
−
x− 1 n
Then, the test statistic is t =
s
n − 1 , where s2 = (xi - x )2 . Here the statistic
n i =1
t follows Student ‘t’ distribution with n-1 degree of freedom. The critical regions can
be found from the Student ‘t’ table.
Example: A consumer testing agency while examining a new automobile for gasoline
mileage performance found that 12 readings of miles covered per gallon under normal
conditions resulted in an average of 16 miles per gallon with a standard deviation of
1.8. Do the sample results support the manufacturer’s claim that the new automobile
gives a performance of more than 15 miles per gallon?
Solution: H0: = 15 H1: > 15
__
Given x = 16, n = 12, s = 1.8, 0 = 15
−
x−
Test statistic z = n −1
s
48
16 − 15
= 11
1.8
= 1.84
Let = 0.05, from t-table with n = 11, t0.5,11 = 1.7959
Since t = 1.84 > 1.7959, H0 is to be rejected. So, the manufacturer’s claim can be
justified.
Here we assume that the populations follows normal distributions, independent and
population variance are unknown but equal..
The null hypothesis is H0: 1 = 2. The alternative hypothesis is: 1) H1: 1 2 or
2) H1: 1 > 2 or 3) H1: 1 > 2
We take a sample of size n1 from the first population and a sample of size n2 from the
− −
second population. Let x1 and x 2 be the sample means and s12 and s22 are the sample
variances.
− −
x1 − x 2
Test statistic is t = , which follows Student ‘t’ distribution
n1 s1 + n2 s 2 1
2 2
1
( + )
n1 + n2 − 2 n1 n2
with n1+n2-2 degrees of freedom.
Example: Given two independent random samples of sizes n1=12 and n2= 20 from
− −
two different normal populations, with x1 = 180 and x 2 =187, s12=40 and s22=60,
test the hypothesis at =0.10 that the population means are equal.
Solution: H0: 1 = 2 H1: 1 2
− −
x1 − x 2
Test statistic is t =
n1 s1 + n2 s 2 1
2 2
1
( + )
n1 + n2 − 2 n1 n2
180− 187
= = -2.565
12 40 + 20 60 1 1
( + )
12 + 20 − 2 12 20
From t- table with =0.10 and 30 degrees of freedom, t0.1,30 = 1.6973.
Since |t| =2.565 > 1.6973, we reject H0.
49
Paired measurements arise when two measurements are made on one unit of
observation. For example, the severity of an illness measured before and after
medication.
When the difference of two measurements is the variable of interest, a test of the
hypothesis that the mean difference is zero in the population can be obtained from
the differences of pairs of measurements in the sample. This is a particularly useful
application because a mean difference of 0 signifies that the mean of one measure
is identical to the mean of the other measure.
Let (x1, y1), (x2, y2), …, (xn, yn) be the sample observations. Let di = yi - xi. Then to
test the means of X and Y are equal, it is sufficient to test the mean of the differences
d = 0.
Thus H0: d= 0. the alternative is 1) H1: d 0 or 2) H1: d > 0 or 3) H1: d > 0
−
d
The test statistic is t = n − 1 , which follows t-distribution with n-1 degrees of
sd
freedom.
Example: Two laboratories A and B carry out independent estimates of fat content
in ice-cream made by a firm. A sample is taken from each batch, halved, and the
separated halves sent to the two laboratories. The fat content obtained by the
laboratories is recorded below:
Batch no. 1 2 3 4 5 6 7 8 9 10
Lab A 7 8 7 3 8 6 9 4 7 8
Lab B 9 8 8 4 7 7 9 6 6 6
Is there a significant difference between the mean fat content obtained by the two
laboratories A and B?
Solution:
xi yi di=yi-xi di2
7 9 2 4
8 8 0 0
7 8 1 1
3 4 1 1
8 7 -1 1
6 7 1 1
9 9 0 0
4 6 2 4
7 6 -1 1
8 6 -2 4
50
H0: d = 0 against H1: d 0
− 1 3
Here, d = di = = 0.3
n 10
1 − 17
Sd2 = di2 – ( d )2 = - (0.3)2 = 1.61
n 10
So, Sd = 1.27
−
d
The test statistic is t = n −1
sd
0.3
= 10 − 1 = 2.126
1.27
From t- table with =0.05 and 9 degrees of freedom, t0.05,9 =2.26.
Since |t| = 2.126 < 2.26, we accept H0. Thus there is no significant difference
between the mean fat content obtained by the two laboratories A and B.
5A.4 Analysis of Variance
The analysis of variance is a set of statistical techniques for studying variability from
different sources and comparing them to understand the relative importance of each of the
sources. It is also used to make inferences about the population through tests of
significance, including the very important comparison of the means of two or more separate
populations.
The technique of analysis of variance in case of a single variable and in case of two
variables is similar. In both cases a comparison is made between the variance of sample
means with the residual variance. However in case of a single variable the total variance is
divide in to two parts only viz, variance between the samples and variance within the
samples. The later variance is called the residual variance. In case of two variables, the
total variance is divided in to three parts viz, variance due to first variable, variance due to
second variable and residual variance.
In one-way classification we take into account only one variable – say the effect of
treatment. Let there are m treatments and there are ni sample observations on the ith
treatment. Let X be the dependent variable and x ij be the jth observation of X for the ith
treatment. We will start with the null hypothesis, the mean treatment effects are same, or
H0 : 1 = 2 = 3 = … = m against the alternate the mean treatment effects are not same.
51
The following are the steps in testing the above hypothesis.
i) Find the grand total, which is the sum of the values of all the items of
all the samples and is denoted by T.
T2
ii) Calculate the correction factor which is equal to , where N is the
N
total number of observations ( N= n1+n2+..+ nm)
iii) Find the sum of squares of all the items of all the samples and add them
together ( i.e. xij2).
iv) Find out the total sum of squares (TSS) by subtracting the correction
T2
factor from the sum of squares of all the items ( TSS = xij2 - )
N
v) Find the totals of each sample (xi.). Then square the sample totals and
divide by the number of items in that sample. Add all these figures.
Between sum of squares is obtained by subtracting the correction factor
2 2 2
x x x T2
from the above sum ( BSS = 1. + 2. + ... + m. - )
n1 n2 nm N
vi) The within sum of squares, ESS = TSS – BSS
vii) The degrees of freedom of BSS is m-1, the degrees of freedom of TSS
is N – 1, and the degrees of freedom of WSS is N-m.
52
MSB
ix) Calculate F-ratio. F =
MSE
x) Find the table value from the F-table corresponding to degrees of
freedoms m-1 and N-m; and significance level .
xi) If the calculated value is greater than table value, reject H 0.
In a one-way classification we take into account the effect of only one variable. If there is
a two-way classification the effect of two variables can be studied. In two-way, the total
variation is the sum of column variation, row variation and error variation. The variances
are calculated for both columns and rows and they are compared with the residual or error
variation. Let there are r rows and c columns. The null hypotheses are: H 01: Column wise
effects are not significant H02: Row wise effects are not significant. Then the ANOVA
table is given below:
53
ESS MSR F2; r-1, If F2>F2
MSE= F2=
Error ESS (c-1)(r-1) (c − 1)(r − 1) MSE (c-1)(r-1) Reject
H02
Example 1: From the data given below, set up a table of analysis of variance and find out
whether the means of the various samples differ significantly among themselves.
Sample 1: 9 11 13 9 8
Sample 2: 13 12 10 15 5
Sample 3: 19 13 17 7 9
Sample 4: 14 10 13 17 16
Sum of Squares of the sample totals divided by the number of observations in each sample
(50) 2 (55) 2 (65) 2 (70) 2
= + + + = 2930
5 5 5 5
54
ii) Between Sum of Squares, BSS = 2930 –CF = 2930 – 2880 = 50
Total 258 19
Towns
Quarters A B C D Total
I 60 50 60 50 220
II 50 40 65 50 205
III 45 35 45 50 175
IV 65 45 60 70 240
Total 220 170 230 220 840
Solution: Null Hypotheses H01 : Prices do not differ in the four towns
H02 : Prices do not differ in the four quarters
T2 (840) 2
i) Correction Factor, CF = = = 44100
N 16
55
v) Error Sum of Squares, ESS = TSS – CSS – RSS = 1450-550-562.5
= 337.5
5A.5. Exercises
1. The manufacturer of light bulbs claims that a light bulb lasts on an average 1600
hours. A sample of 100 light bulbs was taken at random and the average life of
bulbs was computed as 1570 hours with a standard deviation of 120 hours. At
α = 0.01, test the validity of the claim.
2. An insurance company claims that it takes 2 weeks (14 days), on an average, to
process an auto accident claim. The standard deviation is 6 days. To test the
validity of the claim, an investigator randomly selected 36 people who recently
filed claims. This sample revealed that it took the company an average of 16
days to process these claims. At 99% level of confidence, check if it takes the
company more than 14 days on an average to process the claim.
3. The sponsor of a television show believes that his studio audience is divided
equally between men and women. Out of 400 persons attending the show one
day, there were 230 men. At 5% significance level, test if the belief of the
sponsor is correct.
4. An airline claims that at most 8% of its lost luggage is never found. A consumer
advocacy wants to test this claim. In a study of 200 random cases of lost
luggage, it was found that in 22 cases, the lost luggage was never found. At
95% confidence test the airline’s claim.
5. An advertising agency wants to find out if there is any difference in the degree
of loyalty for a given brand of cereal between men and women. A random
sample of 200 men and 200 women was taken and it was determined that 58%
of women and 65% of men showed brand loyalty. At 5% level of significance
test the null hypothesis that there is no significant difference between the
population proportion of men and women who are brand loyal.
6. An experiment has been conducted to compare the productivity of two
machines. Machine I was observed for 40 hours and machine II for 50 hours.
56
The average productivity of items produced per hour and the standard deviation
for each machine is recorded below:
Machine I Machine II
A 20 23 28 29
B 25 32 30 21
Salesmen C 23 28 35 18
D 15 21 19 25
Based on this information, can it be concluded that at 0.05 level of significance
that there is a significant difference in the performance of these four salesmen?
11. The following table gives the number of refrigerators sold by 4 salesmen in
three months May, June, and July:
Month Salesmen
A B C D
May 50 40 48 39
June 46 48 50 45
July 39 44 40 39
57
Is there a significant difference in the sales made by the four salesmen?
Is there a significant difference in the sales made during different months?
5B.1. Introduction
The statistical techniques, which we have discussed so far we concerned with univariate
data- the data on a single variable. It is possible that there may exist a relationship between
two more variables, which should be gainfully utilized in taking decisions. For example, it
is worthwhile for the management of a business concern to know the relationship between
expenses on advertisement of a product and its sales. In order to study about the joint
behavior of the variables we need to study their joint probability distribution.
Let X and Y are two random variables. The joint distribution of (X,Y) provide the
simultaneous occurrence of events defined by (X,Y). Since X and Y are random variables,
we can get individual distributions of X and Y. They are called marginal distributions. The
individual distribution of X is called marginal distribution of X and that of Y is called
marginal distribution of Y.
Let X be the height and Y be the weight of students in a class. The height of some students
may vary even though their weights are same. So, it makes sense to find the probability
distribution of Y when the weight is a particular value. The distribution of Y when X is
given is called conditional distribution of Y given X. Also the distribution of X when Y is
given is called conditional distribution of X given Y.
In this topic, we study the statistical relationship between two quantitative variables. We
examine the directional relationship between two variables. In many instances one variable
may have a direct effect on the other or may be used to predict the other.
58
There are many instances where managers take decisions based on future events. For this,
they rely on observations of two or more variables which appear to be related to one
another.
Regression analysis is a set of statistical techniques for analyzing the relationship between
two numerical variables. One variable is viewed as the dependent variable and the other as
the independent variable. The purpose of regression analysis is to understand the direction
and extent to which values of dependent variable can be predicted by the corresponding
values of the independent variable. The regression gives the nature of relationship between
the variables.
Often the relationship between two variable x and y is not an exact mathematical
relationship, but rather several y values corresponding to a given x value scatter about a
value that depends on the x value. For example, although not all persons of the same height
have exactly the same weight, their weights bear some relation to that height. On the
average, people who are 6 feet tall are heavier than those who are 5 feet tall; the mean
weight in the population of 6-footers exceeds the mean weight in the population of 5-
footers.
In conducting a regression analysis, we use a sample of data to estimate the values of these
parameters. The population of y values at a particular x value also has a variance; the usual
assumption is that the variance is the same for all values of x.
Principle of least squares is used to estimate the parameters of a linear regression. The
principle states that the best estimates of the parameters are those values of the parameters,
which minimize the sum of squares of residual errors. The residual error is the difference
between the actual value of the dependent variable and the estimated value of the dependent
variable.
59
1 −
And Sx2 is the variance of x, that is, Sx2 = xi2 – ( x )2
n
Y 3.5 4.3 5.2 5.8 6.4 7.3 7.2 7.5 7.8 8.3
X 6 8 9 12 10 15 17 20 18 24
Solution:
Y X XY X2
3.5 6 21 36
4.3 8 34.4 64
5.2 9 46.8 81
5.8 12 69.6 144
6.4 10 64 100
7.3 15 109.5 225
7.2 17 122.4 289
7.5 20 150 400
7.8 18 140.4 324
8.3 24 199.2 576
63.3 139 957.3 2239
− 139 − 63.3
x= =13.9 y= = 6.33
10 10
1 − − 957.3
Sxy = xi yi - x y = - 13.96.33 = 7.743
n 10
1 − 2239
Sx2 = xi2 – ( x )2 = - 13.92 = 30.69
n 10
S xy 7.743
So, b = 2 = = 0.252
Sx 30.69
− −
and a = y - b x = 6.33 – 0.25213.9 = 2.8272
60
There are two regression lines; regression line of y on x and regression line of x on y. In
the regression line of y on x, y is the dependent variable and x is the independent variable
and it is used to predict the value of y for a given value of x. But in the regression line of x
on y, x is the dependent variable and y is the independent variable and it is used to predict
the value of x for a given value of y.
The regression line of y on x is given by
− S xy −
y- y = 2 (x- x)
Sx
and the regression line of x on y is given by
− S xy −
x- x = 2
(y - y )
Sy
Regression Coefficients
S xy
The quantity 2
is the regression coefficient of y ox and is denoted by byx, which gives
Sx
S xy
the slope of the line. That is, byx = 2 is the rate of change in y for the unit change in x.
Sx
S xy
The quantity 2
is the regression coefficient of x on y and is denoted by bxy, which gives
Sy
S xy
the slope of the line. That is, bxy = 2
is the rate of change in x for the unit change in y.
Sy
5B.3. Correlation
Correlation measures the degree of linear relation between the variables. The existence of
correlation between variables does not necessarily mean that one is the cause of the change
in the other. It should noted that the correlation analysis merely helps in determining the
degree of association between two variables, but it does not tell any thing about the cause
and effect relationship. While interpreting the correlation coefficient, it is necessary to see
whether there is any cause and effect relationship between variables under study. If there
is no such relationship, the observed is meaningless.
The first step in correlation and regression analysis is to visualize the relationship between
the variables. A scatter diagram is obtained by plotting the points (x 1, y1), (x2, y2), …,
(xn,yn) on a two-dimensional plane. If the points are scattered around a straight line , we
may infer that there exist a linear relationship between the variables. If the points are
clustered around a straight line with negative slope, then there exist negative correlation or
the variables are inversely related ( i.e, when x increases y decreases and vice versa. ). If
the points are clustered around a straight line with positive slope, then there exist positive
correlation or the variables are directly related ( i.e, when x increases y also increases and
vice versa. ).
61
Karl Pearson’s Correlation Coefficient
If (x1, y1), (x2, y2), …, (xn,yn) be n given observations, then the Karl Pearson’s correlation
S xy
coefficient is defined as, r = , where Sxy is the covariance and Sx, Sy are the standard
SxSy
deviations of X and Y respectively.
1 − −
xy − x y
That is, r = n
2 2
1 2 − 1 2 −
x − x y − y
n n
The value of r is in in between –1 and 1. That is, -1 r 1. When r = 1, there exist a perfect
positive linear relation between x and y. when r = -1, there exist perfect negative linear
relationship between x and y. when r = 0, there is no linear relationship between x and y.
Coefficient of Determination
Coefficient of determination is the square of correlation coefficient and which gives the
proportion of variation in y explained by x. That is, coefficient of determination is the ratio
of explained variance to the total variance. For example, r2 = 0.879 means that 87.9% of
the total variances in y are explained by x. When r2 = 1, it means that all the points on the
scatter diagram fall on the regression line and the entire variations are explained by the
straight line. On the other hand, if r2 = 0 it means that none of the points on scatter diagram
falls on the regression line, meaning thereby that there is no linear relationship between the
variables.
62
1. Fit both regression lines
2. Find the correlation coefficient
3. Verify the correlation coefficient is the geometric mean of the regression
coefficients
4. Find the value of y when x = 17.5
Solution:
X Y XY X2 Y2
15 80 1200 225 6400
16 75 1200 256 5625
17 60 1020 289 3600
18 40 720 324 1600
19 30 570 361 900
20 20 400 400 400
105 305 5110 1855 18525
− x 105 − y 305
x = = = 17.5, y = = = 50.83
n 6 n 6
1 − − 5110
Sxy = xi yi - x y = - 17.550.83 = -37.86
n 6
1 − 1855
Sx2 = xi2 – ( x )2 = - 17.52 = 2.92
n 6
−
1 18525
Sy2 = yi2 – ( y )2 = -50.83 2 = 503.81
n 6
S xy − 37.86 S xy − 37.86
byx = 2 = = -12.96 and bxy = 2
= = -0.075
Sx 2.92 Sy 503.81
− S xy −
1. Regression line of y on x is y- y = 2
(x- x)
Sx
i.e., y – 50.83 = -12.96(x – 17.5)
y = -12.96 x + 277.63
− S xy −
Regression line of x on y is x- x = 2
(y - y )
Sy
i.e., x – 17.5 = -0.075(y – 50.83)
x = -0.075 y + 21.31
S xy
2. Correlation coefficient, r =
SxSy
− 37.86
= = 0.986
1.71 22.45
3. byx bxy = -12.96 -0.075 = 0.972
63
Then, 0.972 = 0.986
So, r = -0.986
4. To predict the value of y, use regression line of y on x.
When x= 17.5, y = -12.96 17.5 + 277.63 = 50.83
X Y u v uv u2 v2
15 80 -3 4 -12 9 16
16 75 -2 3.5 -7 4 12.25
17 60 -1 2 -2 1 4
18 40 0 0 0 0 0
19 30 1 -1 -1 1 1
20 20 2 -2 -4 4 4
85 305 -3 6.5 -26 19 37.25
− u −3 − v 6.5
u = = =-0.5, v = = = 1.083
n 6 n 6
1 − − − 26
Suv = ui vi - u v = - -0.51.083 = -3.79
n 6
1 − 19
Su2 = ui2 – ( u )2 = - (-0.5)2 = 2.92
n 6
−
1 37.25
Sv2 = vi2 – ( v )2 = -1.083 2 = 5.077
n 6
S uv − 3.79 S − 3.79
bvu = 2
= = -1.297 and buv = uv2 = = -0.75
Su 2.92 Sv 5.077
− −
1. Regression line of v on u is v - v = bvu(u- u )
i.e., v – 1.083 = -1.297(u – -0.5)
v = -1.297u + 0.4345
y − 40 x − 18
Therefore, the regression line of y on x is = -1.297 + 0.4345
10 1
i.e, y = -12.97 x + 277.8
− −
Regression line of u on v is u - u = buv (v - v )
i.e., u –-0.5= -0.75(y – 1.083)
u = -0.75 v + 0.31225
x − 18 y − 40
Therefore, the regression line of x on y is = -0.75 + 0.31225
1 10
i.e., x = -0.075 y + 21.31
64
S uv
2. Correlation coefficient, r =
Su Sv
− 3.79
= = -0.986
1.71 2.253
6d i
2
First judge: 1 6 5 10 3 2 4 9 7 8
Second judge: 3 5 8 4 7 10 2 1 6 9
Find the correlation between the rankings.
Solution:
xi yi di = xi-yi di2
1 3 -2 4
6 5 1 1
5 8 -3 9
10 4 6 36
3 7 -4 16
2 10 -8 64
4 2 2 4
9 1 8 64
7 6 1 1
8 9 -1 1
65
6d i
2
Tied Ranks
Sometimes where there is more than one item with the same value a common rank is given
to such items. This rank is the average of the ranks which these items would have got had
they differed slightly from each other. When this is done, the coefficient of rank correlation
needs some correction, because the above formula is based on the supposition that the ranks
of various items are different.
If in a series, ‘mi’ be the frequency of ith tied ranks,
1
6[d i + (m 3 − m)]
2
Then, r = 1 - 12
n(n 2 − 1)
Example: Calculate the rank correlation coefficient from the sales and expenses of 10 firms
are below:
Sales(X): 50 50 55 60 65 65 65 60 60 50
Expenses(Y): 11 13 14 16 16 15 15 14 13 13
Solution:
x R1 y R2 d= R1 – R2 d2
50 9 11 10 -1 1
50 9 13 8 1 1
55 7 14 5.5 1.5 2.25
60 5 16 1.5 3.5 12.25
65 2 16 1.5 0.5 0.25
65 2 16 3.5 -1.5 2.25
65 2 15 3.5 -1.5 2.25
60 5 14 5.5 -0.5 0.25
60 5 13 8 -3 9
50 9 13 8 1 1
31.5
r=1- 12
n(n 2 − 1)
1
6[31.5 + [(33 − 3) + (33 − 3) + (33 − 3) + (2 3 − 2) + (2 3 − 2) + (2 3 − 2) + (33 − 3)]]
=1- 12
10(10 2 − 1)
66
= 0.75
5B.4. Exercises
Region : 1 2 3 4 5 6
Expenditure(X): 40 45 80 20 15 50
Sales (Y): 25 30 45 20 20 40
Economics: 61 78 77 97 65 95 30 74 55
Finance: 84 70 93 93 77 99 43 80 67
a) Compute the correlation coefficient?
3. Calculate the rank correlation coefficient from the sales and expenses of 9
firms are below:
Sales(X): 42 40 54 62 55 65 65 66 62
Expenses(Y): 10 18 18 17 17 14 13 10 13
Y = B0 + B1 X1 + B2 X2 + … + Bk Xk + e,
67
provides a good estimate of an individual’s Y score based on his X scores. The least
squares-method is used to estimate the parameters in such a way that the sum of the squared
deviations of the actual values and the predicted values is kept as small as possible.
The maximum correlation between the dependent variable and the linear combination of
independent variables is called the multiple correlation and is usually denoted by R. The
value of R will be from 0 to 1. If R =0, there exists no linear relation between the dependent
variable and the independent variables taken. If R =1, there exists perfect linear relation
between the dependent variable and the independent variables taken. Then R 2 is the
coefficient of determination, which gives the percentage of variation of the dependent
variable explained by the independent variables.
Where Y is the dependent variable; X1, X2, …, Xk are independent variables; B0,
B1, B2, …, Bk are regression coefficients.
Cluster analysis is a multivariate procedure for detecting groupings in the data. The objects
in these groups may be cases or variables. A cluster analysis of cases resembles
discriminant analysis in one respect – the researcher seeks to classify a set of objects into
groups or categories, but, in cluster analysis, neither the number nor the members of the
groups are known.
Two method of clustering of objects into categories are a) Hierarchical cluster analysis and
b) K-means cluster analysis.
Factor analysis is used in exploratory data analysis to a) study the correlations among a
large number of interrelated quantitative variables by grouping the variables into few
factors; after grouping, the variables within each factor are more highly correlated with
variables in that factor than with variables in other factors, b) interpret each factor
according to the meaning of the variables, c) summarize many variables by a few factors.
68