0% found this document useful (0 votes)
11 views

notes-statistics

The document provides an overview of descriptive statistics and statistical inference, emphasizing the importance of statistics in various fields. It covers data collection methods, organization, and summarization techniques, including frequency distributions and measures of central tendency such as the arithmetic mean, median, and mode. Additionally, it discusses methods for calculating these measures from both raw and grouped data, along with examples to illustrate the concepts.

Uploaded by

Jeffry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

notes-statistics

The document provides an overview of descriptive statistics and statistical inference, emphasizing the importance of statistics in various fields. It covers data collection methods, organization, and summarization techniques, including frequency distributions and measures of central tendency such as the arithmetic mean, median, and mode. Additionally, it discusses methods for calculating these measures from both raw and grouped data, along with examples to illustrate the concepts.

Uploaded by

Jeffry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

DESCRIPTIVE STATISTICS, STATISTICAL INFERENCE

3A. DESCRIPTIVE STATISTICS

3A.1. Introduction

Statistics enters into almost every phase of life in some way. A daily news broadcast may
start with a weather forecast and ends with an analysis of the stock market. Statistics in
systematic ways provides bases for investigations in many field of knowledge, such as
social, physical, engineering, medicine, biological sciences, education, business and
management. Information on a topic is acquired in the form of numbers; an analysis of
these data is made in order to obtain a better understanding of the phenomenon of interest,
and some conclusions may be drawn. Often generalizations are sought; their validity is
assessed by further investigation.

Statistics is a methodology for collecting, analyzing, interpreting, and drawing


conclusions from data. Data is the statistical information collected by the investigator.

3A.2. Data collection and Data Representation

The totality of individuals under consideration is called population and each individual in
the population is called a unit. A particular aspect about which we require information is
called characteristic. Sometimes values for all individuals in the population of relevance
are obtained, but often only a set of individuals, which can be considered as representatives
of that population are observed; such a set of individuals constitutes a sample.

Census Survey and Sample Survey

The method of collecting data from all the units of the population is called census survey
or complete enumeration and the method of collecting data from a sample is called sample
survey.

Organising and Summarising Data

A researcher who has to deal with data would be more efficient if the data are presented to
him in a properly tabulated, easy-to-read form. This facilitates quick assimilation of data
and decision-making. However, the data needed for purpose of analysis are generally not
available in a proper format. The analyst is, therefore, required to undertake on his own,
the task of organizing the data into a proper format.

Raw Data: The data in the form originally collected, completely devoid of
arrangement by size or sequence are known as raw data. That is, the unorganized data
are called raw data. These raw data are not amenable even to simple reading and do
not highlight any characteristic or trend.
Frequency Distribution: A frequency distribution can be either grouped or
ungrouped. If the distinct values in a set of data are less, we can write the observed
values in one column and the corresponding number of repetitions ( frequencies) in
another column. Such type of frequency distribution is called ungrouped frequency
distribution. If the distinct values in the set of data are large, we can group the set of
values into different classes and the number of observations in each class can be find
out. Then the distribution is called grouped frequency distribution.
For example, the numbers of children in 20 households can be summarized as follows:

Number of Frequency Relative Cumulative


Children Frequency Frequency
0 2 0.1 2
1 4 0.2 6
2 7 0.35 13
3 5 0.25 18
4 2 0.1 20

The first and second columns together form an ungrouped frequency table. The third
column gives the relative frequency, which is obtained on dividing each frequency by
the total frequency. The last column gives the cumulative frequency.

The following table gives the marks of 50 students in an examination. The data is
summarized in the form of a grouped frequency table.

Marks Frequency
0 - 10 2
10 - 20 5
20 - 30 25
30 - 40 15
40 - 50 3

If the upper limit of a class is same as the lower limit of the next class, then the
distribution is continuous. Here the upper limits are not included in that class, so this
type of classification is called exclusive classification.

Classification of the form 0 - 10, 11 - 20, 21 - 30, … is called inclusive


classification. The difference between upper limit and lower limit of a class is known
as class width. In the above table, class width is 10.

An inclusive type of classification can be converted to an exclusive type by making


use of some minor adjustments. i) Find the difference between the upper limit of a
class and the lower limit of the next class and divide it by 2. ii) Subtract the resulting
quantity from all the lower limits and add to all the upper limits.

2
3A.3. MEASURES OF CENTRAL TENDENCY ( AVERAGES)

After a set of data has been collected, it must be organized and condensed or
categorized for purposes of analysis. In addition to graphical displays, numerical
indices can be computed that summarize the primary features of the data set. One is an
indicator of location or central tendency that specifies where the set of measurements
is “located”. That is, an average is a value, which is relatively close to all the
observations and act as a representative.
The commonly used averages are Arithmetic Mean (AM), Median and Mode.

1. Arithmetic Mean

Arithmetic mean is defined as the sum of the values divided by the total number of
values in the data.
x + x 2 + ... + x n
If x1, x2, …, xn are the values, then its AM denoted by X = 1
n
n
xi
=
i =1 n
Example 1: The numbers of children in 10 families are: 5, 2, 2, 3, 1, 4, 3, 2, 1, 2.

5 + 2 + 2 + 3 +1+ 4 + 3 + 2 +1+ 2 25
Solution: AM = = = 2.5
10 10

If the data is in the form of an ungrouped frequency table as follows:

Values : x1 x2 x3 … xn
Frequency: f1 f2 f3 … fn, then,
x1 f1 gives the sum of all observations with value x1, x2 f2 gives the sum of all
observations with value x2, …, xn fn gives the sum of all observations with value xn.
Therefore, the sum of all observations will be x1 f1 + x2 f2 + … + xn fn.

x 1  f 1 + x 2  f 2 + ... + x n  f n
Arithmetic Mean, X =
f 1 + f 2 + ... + f n
n

x
i =1
i  fi
= n

fi =1
i

Example 2: Find the arithmetic mean of the following distribution:

x: 1 2 3 4 5 6 7
f: 5 9 12 17 14 10 6

3
Solution

x f fx
1 5 5
2 9 18
3 12 36
4 17 68
5 14 70
6 10 60
7 6 42
Total 73 299

X =
f  x =
299
= 4.096
f 73

Calculation AM from a grouped frequency table

In a grouped frequency table we don’t know the actual values of the observations
falling in a class. We only know that the values of observations falling in a class lie
between its lower limit and upper limit. So, we cannot find out the exact AM. For
calculating AM(approximate) of a grouped table, we make an assumption that the
values of observations falling in a class are equal to the mid-value of that class. Then
we consider the class mid-values as x-values and make use of the formulae in the case
of ungrouped frequency table.

Example 3: Find the AM of the following data:

Daily wages(in Rs.): 50 –60 60 – 70 70 - 80 80 - 90 90 - 100


No. of workers : 2 5 7 6 5

Solution:
Wages Mid-value Frequency f.x
(x) (f)
50 - 60 55 2 110
60 - 70 65 5 325
70 - 80 75 7 525
80 - 90 85 6 510
90 - 100 95 5 475
Total 25 1945

X =
f  x =
1945
= 77.8
f 25

4
Short-cut Method to find AM

If the values of x are very large, the calculation of AM becomes time consuming.
Let the mid-values of k classes be x1, x2, …, xk and f1, f2, …, fk be the corresponding
x -A
frequencies. We use the transformation of the form u i = i for i = 1,2, …, k.
C
Here A and C can be any two numbers. But it is better to take A as a number among the
middle part of the mid-values. If all the classes are of equal width, C can be taken as the
class width.
Then AM, X = A + C u

Where u =
f  u
f
Example 4: Find the AM of the following data:

Marks : 0 – 10 10 – 20 20 – 30 30 – 40 40 – 50 50 - 60
No. of students: 12 18 27 20 17 6

Solution:

Marks Mid-value xi - 35 Frequency f.u


ui =
(x) 10 (f)
0 - 10 5 -3 12 -36
10 - 20 15 -2 18 -36
20 - 30 25 -1 27 -27
30 - 40 35 0 20 0
40 - 50 45 1 17 17
50 - 60 55 2 6 12

Total 100 -70

u =
 fu =
− 70
= -0.7
f 100
X= A + C u
= 35 + 10 -0.7
= 35 – 7
= 28

Properties of AM

1. The algebraic sum of deviations of a set of values from their AM is zero.

5
2. Sum of squares of deviations of a set of values is minimum when
deviations taken about AM.

Combined Mean of Two Groups

Let x 1 and x 2 be the means of two groups. Let there be n1 observations in the
first group and n2 observations in the second group. Then x , the mean of the combined
group can be obtained as

n1 x1 + n 2 x 2
x=
n1 + n 2

Example 5: Average daily wage of 60 male workers in a firm is Rs. 120 and that of
40 females is Rs.100. Find the mean wage of all the workers.

Solution: Here n1 = 60, x1 = 120 and n2 = 40, x2 = 100

60  120 + 40  100
Combined Mean =
60 + 40
= 112

Weighted AM

When calculating AM we assume that all the observations have equal importance.
If some items are more important than others, proper weightage should be given in
accordance with their importance. Let w 1, w2, …, wn be the weights attached to the
items x1, x2, …, xn, then the weighted AM is defined as

w 1 x 1 + w 2 x 2 + ... + w n x n
Weighted mean =
w 1 + w 2 + ... + w n

Example 6: A teacher has decided to use a weighted average in figuring final grades
for his students. The midterm examination will count 40%, the final examination will
count 50% and quizzes 10%. Compute the average mark obtained for a student who
got 90 marks for midterm examination, 80 marks for final and 70 for quizzes.

Solution: Here w1 = 40, x1 = 90


w2 = 50, x2 = 80
w3 = 10, x3 = 70

40  90 + 50  80 + 10  70
Weighted mean =
40 + 50 + 10

6
8300
=
100
= 83

2. Median

The median of a set of observations is a value that divides the set of observations in half,
so that the observations in one half are less than or equal to the median and the observations
in the other half are greater than or equal to the median value.

In finding the median of a set of data it is often convenient to put the observations in
ascending or descending order. If the number of observations is odd, the median is the
middle observation. For example, if the values are 52, 55, 61, 67, and 72, the median is 61.
If there were 4 values instead of 5, say 52, 55, 61, and 67, there would not be a middle
value. Here any number between 55 and 61 could serve as a median; but it is desirable to
use a specific number for the median and we usually take the AM of two middle values,
i.e, (55+61)/2 = 58.

Median is the primary measure of location for variables measured on ordinal scale because
it indicates which observation is central without attention to how far above or below the
median the other observations fall.

Example 7: Find the median of 10, 2, 4, 8, 5, 1, 7

Solution: Observations in ascending order of magnitude are 1, 2, 4, 5, 7, 8, 10


Here there are 7 observations, so median is the 4th observation.
That is, median = 5

Median for a grouped frequency distribution

In a grouped frequency distribution, we do not know the exact values falling in each
class. So, the median can be approximated by interpolation. Let the total number of
observations be N. for calculating median we assume that the observations in the median
class are uniformly distributed. Median class is the class in which the (N/2) th observation
belongs. Also assume that median is the (N/2)th observation.

Here the frequency table must be continuous. If it is not, convert it into continuous
table. Prepare a less than cumulative frequency table and find the median class. Let ‘l’ be
the lower limit of the median class, ‘f’ the frequency of the median class, and ‘c’ is the
class width of the median classs. By the assumption of uniform distribution, the ‘f’
c 2c fc
observations in the median class are l + , l + , …, l + . Let ‘m’ be the cumulative
f f f
N
frequency of the class above the median class. Then the median will be the ( - m) th
2
observation in the median class.

7
N c
That is, median = l + (- m)
2 f
Example 8: Calculate the median of the following data:

class frequency
1 - 10 4
11 - 20 12
21 - 30 24
31 - 40 36
41 - 50 20
51 - 60 16
61 - 70 8
71 - 80 5

Solution: Since the frequency table is of inclusive, convert it into exclusive by subtracting
0.5 from the lower limits and adding 0.5 to the upper limits.

Class Frequency Cumulative


frequency
0.5 - 10.5 4 4
10.5 – 20.5 12 16
20.5 – 30.5 24 40
30.5 – 40.5 36 76
40.5 – 50.5 20 96
50.5 – 60.5 16 112
60.5 – 70.5 8 120
70.5 – 80.5 5 125

N 125
Here = = 62.5, which lies in the 30.5 - 40.5 class (median class)
2 2
So, l = 30.5, f = 36, m = 40 and c = 10
N c
Median = l + ( - m)
2 f
10
= 30.5 + (62.5 – 40)
36
= 36.75

Property of Median: The sum of absolute deviations of a set values is minimum when
the deviations are taken from median.

3. Mode

8
The mode of a categorical or a discrete numerical variable is that category or value which
occurs with the greatest frequency.

Example 8: The mode of the data 2, 5, 4, 4, 7, 8, 3, 4, 6, 4, 3 is 4 because 4 repeated the


greatest number of times.

Mode of a grouped frequency distribution

In a grouped frequency distribution, to find the mode, first locate the modal class.
Modal class is that class with maximum frequency. Let l be the lower limit of the modal
class, ‘c’ be the class interval, f1 be the frequency of the modal class, f0 be the frequency
of the class preceding and f2 be the frequency of the class succeeding the modal class.
c(f1 - f 2 )
Then, Mode = l +
2f 1 - f 0 - f 2

Example 9: Find the mode of the distribution given below

class frequency
10 – 15 3
15 – 20 9
20 – 25 16
25 – 30 12
30 – 35 7
35 – 40 5
40 - 45 2

Solution: Here the modal class is the class 20 – 25.


That is, l = 20, c= 5, f0 = 9, f1 = 16 and f2 = 12
c(f1 - f 2 )
Mode = l +
2f 1 - f 0 - f 2
5(16 - 12)
= 20 +
32 - 9 - 12
= 21.8

4. Quartiles, Deciles and Percentiles

Median, as has been indicated, is a locational average, which divides the frequency
distribution into two equal parts. Quartiles, deciles and percentiles are not averages. They
are the partition values, which divides the distribution into certain equal parts.

Quartiles

9
Quartiles are the values, which divides a frequency distribution into four equal
parts so that 25% of the data fall below the first quartile (Q 1), 50% below the second
quartile (Q2), and 75% below the third quartile (Q3). The values of Q1 and Q3 can be
find out as in the case of Q2 (Median). For a raw data, Q1 is the (n/4)th observation and
Q3 is the (3n/4)th observation.

N c1
For a grouped table, Q1 = l1 + ( - m1)
4 f1

Where N is the total frequency, l1 is the lower limit of the first quartile class
( class in which (N/4)th observation belongs), m1 is the cumulative frequency of the
class above the first quartile class, f1 is the frequency of the first quartile class and c1
is the width of the first quartile class.

3N C
Q3 = l3 + ( - m3) 3
4 f3

Where l3 is the lower limit of the third quartile class ( class in which (3N/4) th
observation belongs), m3 is the cumulative frequency of the class above the third
quartile class, f3 is the frequency of the third quartile class and C3 is the width of the
third quartile class.

Deciles and Percentiles

Deciles are nine in number and divide the frequency distribution into 10 equal parts.
Percentiles are 99 in number and divide the frequency distribution into 100 equal parts.

Selecting the Most Appropriate Measure of Central Tendency

Generally speaking, in analyzing the distribution of a variable only one of the


possible measures of central tendency would be used. Its selection is largely a matter
of judgment based upon the kind of data, the aspect of the data to be examined, and
the research question. Some of the points that must be considered are following.

Central tendency for interval data is generally represented by the A.M., which takes
into account the available information about distances between scores. For ranked
(ordinal) data, the median is generally most appropriate, and for nominal data, the
mode.

If the distribution is badly skewed, one may prefer the median to the mean, because
the median would not be affected as much by unusual extreme scores. For this reason,
for example, the median income of people is usually reported rather than the A.M.
If one is interested in prediction, the mode is the best value to predict if an exact
score in a group has to be picked.
3A.4. MEASURES OF DISPERSION

10
So far we have discussed averages as sample values used to represent data. But the average
cannot describe the data completely.
Consider two sets of data : 5, 10, 15, 20, 25
15, 15, 15, 15, 15
Here we observe that both the sets, the same mean 15. But in the set I, the observations are
more scattered about the mean. This shows that, even though they have the same mean, the
two sets differ. This reveals the necessity to introduce measures of dispersion.

A measure of dispersion is defined as a mean of the scatter of observations from an average.

Commonly used measures of dispersion are Range, Mean deviation, Standard deviation,
and quartile deviation.

1. Range

Range of a set of observations is the difference between the largest and the smallest
observations. In the case of grouped frequency table, range is the difference between the
upper bound of last class and the lower bound of the first class.

Example 1: The range of the set of data 9, 12, 25, 42, 45, 62, 65 is 65 – 9 = 56

Range is the simplest measure of dispersion but its demerit is that it depends only on the
extreme values.

2. Mean deviation about the Mean:

You have seen that range is a measure of dispersion, which does not depend on all
observations. Let us think about another measure of dispersion, which will depend on all
observations.

One measure of dispersion that you may suggest now is the sum of the deviations of
observations from mean. But we know that the sum of deviations of observations from the
A.M is always zero. So we cannot take the sum of deviations of observations from the
mean as a measure.

One method to overcome this is to take the sum of absolute values of these deviations. But
if we have two sets with different numbers of observations this cannot be justified. To make
it meaningful we will take the average of the absolute deviations. Thus mean deviation
(MD) about the mean is the mean of the absolute deviations of observations from arithmetic
mean.
1 n
If x1, x2, …, xn are n observations, then, MD =  | xi - x |
n i =1

Example 2: Find the MD for the following data 12, 15, 21, 24, 28
Solution:

11
12 + 15 + 21 + 24 + 28
X = = 20
5

x | xi - x |
12 8
15 5
21 1
24 4
28 8
Total 26

26
MD = = 5.2
5

Mean deviation about mean for a frequency table

Let x1, x2, …, xn be the values and f1, f2, …, fn are the corresponding frequencies. Let N
1 n
be the sum of the frequencies. Then, MD =  | xi - x | fi
N i =1
In the case of a grouped frequency table, take the mid-values as x-values and use the same
method given above.

Example 3: Find the mean deviation of the heights of 100 students given below:

Heigt in cm frequency
160 – 162 5
163 – 165 18
166 – 168 42
169 – 171 27
172 - 174 8

Solution:
Heigt Mid- Frequency
in cm value (f) fx | xi - x | fi| xi - x |
(x)
160 – 162 161 5 805 6.45 32.25
163 – 165 164 18 2952 3.45 62.10
166 – 168 167 42 7014 0.45 18.90
169 – 171 170 27 4590 2.55 68.85
172 - 174 173 8 1384 5.55 44.40
Total 100 16745 226.50

16745
X = = 167.45
100

12
1 n
MD =  | xi - x | fi
N i =1
226.5
= = 2.265
100

3. Variance and Standard Deviation

When we take the deviations of the observations from their A.M both positive and
negative values occurs. For defining mean deviation we took absolute values of the
deviations. Another method to avoid this problem is to take the square of the deviations.
So, variance is the mean of squares of deviations from A.M.. Positive square root of
variance is called standard deviation.
1 n
If x1, x2, …, xn are n observations, then, the variance =  (xi - x )2 and standard
n i =1
n
1
deviation(SD) is defined as, SD =  (xi - x )2
n i =1

Example 4: Find the variance and standard deviation of the following data:
42, 39, 44, 40, 36, 39, 30, 46, 48, 36
400
Solution: Arithmetic mean X = = 40
10
1 n 1

n i =1
(xi - x )2 =
10
[(42 – 40)2 + (39 – 40)2 + … + (36 – 40)2]

254
= = 25.4
10
Variance = 25.4
S.D = 25.4 = 5.04

Variance and Standard deviation for a frequency table

Let x1, x2, …, xn be the values and f1, f2, …, fn are the corresponding frequencies. Let N
n
1
be the sum of the frequencies. Then, Variance =
N
( x
i =1
i-x )2 fi and

n
1
Standard deviation =
N
 (x
i =1
i-x )2fi

1
The above formulae for variance can be expressed as, variance =  fi xi2 - X 2
N
In the case of a grouped frequency table, take the mid-values as x-values and use the same
method given above.

13
Example 5: Find the variance and standard deviation of the following data:

class frequency
0 – 10 3
10 – 20 4
20 - 30 6
30 – 40 10
40 - 50 7
Solution:

class mid-value frequency


(x) (f) fx fx2
0 – 10 5 3 15 75
10 – 20 15 4 60 900
20 - 30 25 6 150 3750
30 – 40 35 10 350 12250
40 - 50 45 7 315 14175
Total 30 890 31150

1
Variance =  fi xi2 - X 2
N
890
N = 30, X = = 29.67,  fi xi2 = 31150
30
31150
Variance = - (29.67)2
30
= 1038.33 - 880.31
= 158.02
Standard deviation = 158.02 = 12.57

Short-cut method to find standard deviation

If the values of x are very large, the calculation of SD becomes time consuming.
Let the mid-values of k classes be x1, x2, …, xk and f1, f2, …, fk be the corresponding
xi - A
frequencies. We use the transformation of the form u i = for i = 1,2, …, k.
C
Here A and C can be any two numbers. But it is better to take A as a number among the
middle part of the mid-values. If all the classes are of equal width, C can be taken as the
class width.
1
Variance of ui’s , Var(u) =  fi ui2 - u 2
N
Then variance of xi’s, Var(x) = C2  Var(u)
That is, SD(x) = C  SD(u)

14
Example 6: Consider the problem in example 5, let us find out the SD using short-cut
method.
Solution:

class mid-value xi - 25 frequency


ui =
(x) 10 (f) fu fu2
0 – 10 5 -2 3 -6 12
10 – 20 15 -1 4 -4 4
20 - 30 25 0 6 0 0
30 – 40 35 1 10 10 10
40 - 50 45 2 7 14 28
Total 30 14 54

u =
 fu =
14
= 0.467,  fi ui2 = 54, N = 30
N 30

54
Variance(u) = - (0.467)2
30
= 1.8 – 0.21809
= 1.5819
Variance(x) = 102  1.5819 = 158.19

SD(x) = 158.19 = 12.57

Combined Variance

If there are two sets of data consisting of n1 and n2 observations with s12 and s22 as their
respective variances, then the variance of the combined set consisting of n 1+n2
observations is
S2 = [n1(s12 + d12) + n2(s22 + d22)] / (n1 + n2)
Where d1 and d2 are the differences of the means, x1 and x2 , from the combined
mean x respectively.

Example 7: Find the combined standard deviation of two series A and B

Series A Series B
Mean 50 40
Standard deviation 5 6
No. of items 100 150
Solution:
Given x1 = 50 and x2 = 40, s12 = 25 and s22 = 36, n1 = 100 and n2 = 150

15
100  50 + 150  40
Combined mean x = = 44,
100 + 150
d1 = x1 - x = 50 – 44 = 6, and d2 = x2 - x = 40 – 44 =-4

100(25 + 36) + 150(36 + 16)


Combined variance =
100 + 150
= 55.6
Therefore, combined SD = 55.6 = 7.46

4. Quartile Deviation

Quartile deviation (Semi inter-quartile range) is one-half of the difference between the
third quartile and first quartile.
Q3 - Q1
That is, Quartile deviation, Q.D =
2

Example 8: Estimate an appropriate measure of dispersion for the following data:


Income (Rs.) No. of persons
Less than 50 54
50 – 70 100
70 – 90 140
90 – 110 300
110 – 130 230
130 – 150 125
Above 150 51
1000

Solution:
Since the data has open ends, Q.D would be a suitable measure

Income (Rs.) No. of persons Cumulative


x f frequency
Less than 50 54 54
50 – 70 100 154
70 – 90 140 294
90 – 110 300 594
110 – 130 230 824
130 – 150 125 949
Above 150 51 1000
1000

N c1
Q1 = l1 + ( - m1)
4 f1

16
3N c3
Q3 = l3 + ( - m3)
4 f3
N 3N
Here N= 1000, = 250, =750
4 4
The class 70 – 90 is the first quartile class and 110 – 130 is the third quartile class

l1 = 70, m1 = 154, c1 = 20, f 1 = 140

l3 = 110, m3 = 594, c3 = 20, f3 = 230

20
Q1 = 70 + (250- 154)
140
= 83.7
20
Q3 = 110 + (750- 594)
230
= 123.5
123.5 - 83.7
Q.D = = 19.9 Rs.
2

Relative Measures

The absolute measures of dispersion discussed above do not facilitate comparison


of two or more data sets in terms of their variability. If the units of measurement of two or
more sets of data are same, comparison between such sets of data is possible directly in
terms of absolute measures. But conditions of direct comparison are not met, the desired
comparison can be made in terms of the relative measures.
Coefficient of Variation is a relative measure of dispersion which express standard
deviation(  ) as percent of the mean. That is Coefficient of variation, C.V = (  / x )100.
Another relative measure in terms of quartile deviations is Coefficient of quartile
Q3 - Q1
deviation and is defined as Qr =  100 .
Q3 + Q1
Example 9: An analysis of the monthly wages paid to workers in two firms A and B,
belonging to the same industry, gives the following results:

Firm A Firm B
Number of workers 586 648
Average monthly wage 52.5 47.5
Standard deviation 10 11

In which firm, A or B, is there greater variability in individual wages?

17
10
Solution: Coefficient of variation for firm A =  100
52.5
= 19%
11
Coefficient of variation for firm B =  100
47.5
= 23%
There is greater variability in wages in firm B.

3A.5. SKEWNESS and KURTOSIS

1. Skewness

Very often it becomes necessary to have a measure that reveals the direction of dispersion
about the center of the distribution. Measures of dispersion indicate only the extent to
which individual values are scattered about an average. These do not give information
about the direction of scatter. Skewness refers to the direction of dispersion leading
departures from symmetry, or lack of symmetry in a direction.

If the frequency curve of a distribution has longer tail to the right of the center of the
distribution, then the distribution is said to be positively skewed. On the other hand, if the
distribution has a longer tail to the left of the center of the distribution, then distribution is
said to be negatively skewed. Measures of skewness indicate the magnitude as well as the
direction of skewness in a distribution.

Empirical Relationship between Mean, Median and Mode

The relationship between these three measures depends on the shape of the frequency
distribution. In a symmetrical distribution the value of the mean, median and the mode is
the same. But as the distribution deviates from symmetry and tends to become skewed, the
extreme values in the data start affecting the mean.

In a positively skewed distribution, the presence of exceptionally high values affects the
mean more than those of the median and the mode. Consequently the mean is highest,
followed, in a descending order, by the median and the mode. That is, for a positively
skewed distribution, Mean > Median> Mode. In a negatively skewed distribution, on the
other hand, the presence of exceptionally low values makes the values of the mean the
least, followed, in an ascending order, by the median and the mode. That is, for a negatively
skewed distribution, Mean < Median < Mode.

Empirically, if the number of observations in any set of data is large enough to make its
frequency distribution smooth and moderately skewed, then, Mean – Mode = 3(Mean –
Median)

Measures of Skewness

18
1. Karl Pearson’s measure of skewness: Prof. Karl Pearson has been
developed this measure from the fact that when a distribution drifts away
from symmetry, its mean, median and mode tend to deviate from each other.
Mean - Mode
Karl Pearson’s measure of skewness is defined as, SkP =
SD
2. Bowley’s measure of skewness: developed by Prof. Bowley, this measure
of skewness is derived from quartile values.
Q3 + Q1 - 2Q2
It is defined as SkB =
Q3 - Q1
3. Moment measure of skewness:
If x1, x2, …, xn are n observations, then the rth moment about mean is defined
1 n
as mr =  (xi - x )r
n i =1
The moment measure of skewness is defined as  1 = m3/(SD)3
In a perfectly symmetrical distribution  1 =0, and a greater or smaller value
of  1 results in a greater or smaller degree of skewness.

2.Kurtosis

Kurtosis refers to the degree of peakedness, or flatness of the frequency Curve.


If the curve is more peaked than the normal curve, the curve is said to be lepto
kurtic. If the curve is more flat than the normal curve, the curve is said to be
platy kurtic. The normal curve is also called meso kurtic. The moment measure
m4
of kurtosis is  2 = . The value of  2 =3, if the distribution is normal; more than 3, if
m2 2
the distribution is lepto kurtic; and less than 3, if the distribution is platy kurtic.

Example 1: Given m2(variance) = 40, m3 = -100. Find a measure of skewness.

Solution:
Moment measure of skewness,  1 = m3/(SD)3
− 100
= = - 0.4
( 40 ) 3
Hence, there is negative skewness

Example 2: The first four moments of a distribution about mean are 0, 2.5, 0.7, and
18.75. Comment on the Kurtosis of the distribution

m4
Moment measure of kurtosis is,  2 = .
m2 2
18.75
= =3
2.5 2

19
So, the curve is normal.

3A.6. Exercises

1. Find the arithmetic mean, median, and mode of the following data: 38,
28,12, 18, 28, 44, 28, 19, 21.

2. Calculate the mean, median and mode of the following data:


Class: 10 –20 20 - 30 30 – 40 40 – 50 50 - 60
Frequency: 25 52 73 40 10

3. From the following data of income distribution, calculate the AM. It is given
that i) the total income of persons in the highest group is Rs. 435, and ii)
none is earning less than Rs. 20.

Income ( Rs) No. of persons


Below 30 16
“ 40 36
“ 50 61
“ 60 76
“ 70 87
“ 80 95
80 and above 5

4. Mean of 20 values is 45. If one of these values is to be taken 64 instead of


46. Find the correct mean.

5. The mean yearly salary of employees of a company was Rs. 20,000. The
mean yearly salaries of male and female employees were Rs. 20,800 and
Rs. 16,800 respectively. Find out the percentage of males employed.

6. The average wage of 100 male workers is Rs. 80 and that 50 female workers
is 75. Find the mean wage of workers in the company.

7. In the final examination of a course marks of written examination is


weighted 3 times as much as quiz and a student has final examination grade
of 85 and quiz grade of 70. Find the mean grade.

8. Calculate the range of the prices of gold from Monday to Saturday of a


week.

Mon. Tue. Wed. Thurs. Fri. Sat.


1200 1160 1214 1145 1187 1196
9. Compute the mean deviation about mean of the following data:
114, 108, 100, 96, 102, 108, 120, 121, 115, 112

20
10. Calculate the Mean deviation, Variance and Standard deviation of the
following data:

Class Frequency
10 – 15 3
15 – 20 7
20 – 25 16
25 – 30 12
30 – 35 9
35 – 40 5
40 - 45 2

11.Find the standard deviation of the values: 11, 18, 9, 17, 7, 6, 15, 6, 4, 1

12. Daily sales of a retail shop are given below:

Daily sales(Rs): 102 106 110 114 118 122 126


No. of days: 3 9 25 35 17 10 1
Calculate the mean and standard deviation of the above data and explain
what they indicate about the distribution of daily sales?

13. Goals scored by two teams a and B in a foot ball season were as follows:
No. of goals scored: 0 1 2 3 4
No. of matches A: 2 9 8 5 4
B: 1 7 6 5 3
Find which team may be considered more consistent?

14. The mean of two samples of sizes 50 and 100 respectively are 54.1 and
50.3 and the standard deviations are 19 and 8. Find the mean and the
standard deviation of the combined sample.

15. Find the quartile deviation of the following data:

Class Frequency
< 15 5
15 – 20 12
20 – 25 22
25 – 30 31
30 – 35 19
35 – 40 9
>40 2

16. Find the skewness of the data 2, 3,5, 8, 7, 6, 8, 7, 6, 5


17. Find the kurtosis of the data 7, 6, 9, 1, 0, 5, 5, 6, 5, 4

21
18. Find the Karl Pearson’s measure of skewness of the following data:

Class Frequency
< 15 5
15 – 20 12
20 – 25 22
25 – 30 31
30 – 35 19
35 – 40 9
>40 2

1C. PROBABILITY

1C.1. Introduction

Each of us has some intuitive notion of what “probability” is. Everyday conversation is full
of references to it: “He is likely to win the game”. He will probably be selected for the
job”. The use of words ‘likely’, ‘probably’ indicates that there is an element of uncertainty
about these statements. The theory of probability provides a numerical measure of the
element of uncertainty. It enables us to take decisions under uncertainty with a certain
amount of risk.

1C.2. Random Experiment

In science we come across phenomena, which follows certain pattern without fail. A stone
drops from a cliff follows Newton’s laws of motion. But there are experiments whose
results cannot be predicted in advance.
Random experiment is an experiment, which does not give the same result if it conducted
under homogeneous conditions.

Examples:
1. Tossing a coin and observing the face turns up
2. Rolling a die and observing the face turns up

Sample Space, Outcomes and Events

Set of all possible outcomes of a random experiment is called a sample space and
is usually denoted by S.

Examples:
1. Consider the random experiment, tossing a coin and observing the face
turns up.
S = { H, T} , Where H – Head, T – Tail

22
2. Rolling a die and observing the face turns up.
S = {1, 2, 3, 4, 5, 6}
An outcome of the experiment is an element in S, which is also known as sample point.

An event is any subset of the sample space. In the example of tossing a coin, H and T are
sample points, but  (null event), {H}, {T}, {H, T}(sure event) are events. The event  is
an impossible event because it can never occur. But the event {H, T} is a sure event, which
occurs in every trial. An event A will be said to have occurred in a trial if the outcome is a
sample point, which belongs to A.

The set consisting of exactly one sample point is called an elementary event. For example,
in the experiment of throwing a die, {1}, {2}, {3}, {4}, {5}, and {6} are elementary events,
but 1, 2, 3, 4, 5, and 6 are sample points. That is, elementary events are events, which
cannot be further split up. Events, which can be further split up are called compound events.
For example, {2, 4, 6} is a compound event.

1C.3. Algebra of Events

1. Event not A ( complement of A)

Corresponding to an event A, we can define another event, which contains the


outcomes in the sample space but not in A is called complement of A and it is
denoted by A , A' , or Ac.
Example: In the random experiment, rolling a die and observing the number shown
up, let A = {2, 4, 6}. Then A = {1, 3, 5}. Here the event A is ‘even number shown
up’ and A is ‘odd number shown up’.

2. All events (intersection)

If A and B are two events in the same experiment, the event which represents the
simultaneous occurrences of A and B is A  B.

Example: In a die rolling trial, let A be the event ‘a prime number happened’ and
B be the event ‘an odd number happened’. That is, A = {2, 3, 5} and B={1, 3, 5}.
Then, the event represents ‘the number happened is both prime and odd’ is A 
B={3, 5}.

3. At least one among events (Union)

If A and B are two events in the same experiment, the event which gives at least
one among (A or B) is A  B.
Example: In a die rolling trial, let A be the event ‘a prime number happened’ and
B be the event ‘an odd number happened’. That is, A = {2, 3, 5} and B={1, 3, 5}.
Then the event at least one among (a prime number or an odd number) is A  B={1,
2, 3, 5}.

23
3. A and not B (difference)

If A and B are two events in the same experiment, the event which represents A
and not B is A  B .

Example: In a die rolling trial, let A be the event ‘a prime number happened’ and
B be the event ‘an odd number happened’. That is, A = {2, 3, 5} and B={1, 3, 5}.
Then, the event represents ‘the number happened is a prime but not odd’ is A  B
={2}.

4. Exactly One (Symmetric difference)

If A and B are two events in the same experiment, the event, which represents the
happening of exactly one is (A  B )  ( A  B).
Example: In a die rolling trial, let A be the event ‘a prime number happened’ and
B be the event ‘an odd number happened’. That is, A = {2, 3, 5} and B={1, 3, 5}.
Then, the event represents ‘exactly one among A and B’ is (A  B )  ( A  B) =
{2}  {1} ={1, 2}.

1C.4. Mutually Exclusive Events ( Disjoint Events)

Two events are said to be disjoint if the occurrence of one event prevents the
occurrence of other event. That is, if A and B are disjoint events, their simultaneous
occurrence will not be possible. Therefore A  B =  .
Example: In a die rolling trial, let A be the event ‘an even number happened’ and
B be the event ‘an odd number happened’. That is, A = {2, 4, 6} and B={1, 3, 5}.
Since the occurrence of ‘an even number’ prevents the occurrence of ‘an odd
number’ in the same trial, the events A and B are mutually exclusive. See that A 
B= .

1C.5. Exhaustive Events and Equally Likely Events

A list of elementary events of a random experiment is said to be exhaustive if their


union is the sample space. If every elementary event of a random experiment has
an equal chance of occurrence, then the elementary events are said to be equally
likely.
Example: In a die rolling trial, the events {1}, {2}, {3}, {4}, {5}, and {6} are
exhaustive events since their union is the sample space. Since there is no preference
for any one event over another, these events are also equally likely.

1C.6. Definitions of Probability

24
Generally speaking, probability is a measure of chance of happening of an uncertain
event. That is, probability is used to measure the uncertainty of an event. The value
of probability ranges between 0 and 1. If it is certain that an event happen, then its
probability would be 1 and if it is certain that the event would not happen its
probability is 0.

There are three different conceptual approaches to the study of probability. They
are:
1. Classical approach.
2. Frequency approach.
3. Axiomatic approach.

1. Classical Definition

This is the earliest approach to the theory of probability. Laplace, the French
mathematician given this definition of probability. Using this definition, we can
determine the probability of an event even before the performance of trial. So
classical probability is often called ‘a priori probability’.

Definition: If the elementary events of a random experiment with finite sample


space are mutually exclusive, equally likely and exhaustive, then, the
probability of an event A is the “ratio of the number of outcomes favourable to
A to the total number of possible outcomes”. That is, if an event A can occur in
m
‘m’ ways out of ‘n’ equally likely ways, then, P(A) = .
n
Note that, the outcomes, which result in the happening of a desired event are
called favourable outcomes.

Example 1: Consider the random experiment tossing two coins and observing
the faces turns up. Sample space, S ={(H,H), (H,T), (T,H), (T,T)}. Let A be
3
the event that ‘ getting at least one tail. Then P(A) = ( In three outcomes
4
there is at least one tail).

2. Frequency ( Empirical ) Definition

In many situations it is not possible to have equally likely events, on which


the classical definition of probability is based. In these situations, another approach
can be used is to find the probability from the past experience. That is, we may find
the probability on the basis of relative frequency of the event in the past. However
relative frequency should always be estimated on the basis of a large number of
readings in the past. The larger the past readings the greater will be the accuracy of
the result. Since in relative frequency approach probabilities are calculated on the
basis of past experience, these probabilities are called posteriori probabilities.

25
f
Definition: In frequency approach, probability can be defined as P(A)= Lt n →
n
Where f is the frequency of A and n is the number of trials.

3. Axiomatic Definition

In classical and frequency definitions probability is defined under certain


assumptions. There is another definition of probability, which we shall now discuss,
called the axiomatic definition of probability, where probability is defined as a
function whose domain is the class of events, taking values in the real line.

Definition: A function P from the class of events taking values in the real line is a
probability if it satisfies the following axioms:
Axiom 1 P(A)  0, for every event A
Axiom 2 P(S) = 1, where S is the sample space
Axiom 3 If A1, A2, …, are disjoint events, then,
P(A1  A2  …) = P(A1) + P(A2) + …

1C.7. Some Results in Probability

1. If A is an event and A its complement, then, P(A) + P( A ) =1


2. For any event A, 0  P(A)  1
3. Addition Theorem on Probability
For any two events A and B, P(A  B) = P(A) + P(B) – P(A  B)
If A and B are disjoint, then, P(A  B) = P(A) + P(B)

1C.8. Independence of Events


Two events A and B are said to be independent if and only if
P(A  B) = P(A)  P(B)

1C.9. Permutations and Combinations

In an organization, a committee of 6 people has to elect 2 persons from amongst


themselves to fill up the posts of president and secretary. For convenience, call the
committee members A, B, C, D, E and F. There are six ways of filling up the
president’s position as there are six members available for election. Once this post
has been filled up there are only five possible candidates left to choose from for the
post of secretary. By the fundamental principle of counting, the two posts can be
filled up in 65 = 30 different ways. Suppose B is selected first and then E is
selected. That is, the order of selection is BE, where B is the president and E is the
secretary. If the order of selection was EB, E might be the president and B the
secretary. Since the two positions are different in hierarchy, the order in which the
two persons have been elected is important. Thus the number arrangement of 6
persons taken 2 at a time is 30. This is a ‘permutation’ of 6 persons taken 2 at a
time. Permutations refer to the number of ways in which a set of objects can be
arranged ‘in order’.

26
Suppose we had to elect two vice-presidents. Now, we are interested in which two
members are elected and the order is of no consequence. For instance, announcing
that AC or CA have been elected makes no difference since they are in the same
hierarchy. So when two persons have been elected without regard to their
arrangement, then this ‘unordered’ selection is called a combination.

1. Permutation

The number of permutations of r objects, out of ‘n’ distinguishable objects is


n!
obtained by and is denoted by nPr or P(n, r).
(n - r)!

Example 1. Four persons enter a railway compartment in which there are six seats.
In how many ways can they take their place?

Solution: The number of ways of 6 seats can be occupied by 4 persons is same as


the arrangement of 6 things, taking 4 at a time.
Hence, the required number of ways = P(6,4)
6!
=
(6 - 2)!
= 6543 = 360
Example 2. There are 6 students of which 3 belong to the first year class, 2 belong
to the second year class and one is in the third year. In how many ways can they
stand in a line so that the students from the same class are together?

Solution: Let us consider the students from the same class as a group. Hence there
are 3 groups. The first group contains 3 students, the second contains 2 students
and the third contains one student. Three groups can be permuted in P(3,3) ways.
Then within the first group, 3 students can be permuted in P(3, 3) ways. Within the
second group, the students can be permuted in P(2, 2) ways.
So, the required number of arrangements = P(3, 3)  P(3, 3)  P(2, 2)
= 3!  3!  2! = 662
= 72
Example 3. In how many ways can a cricket team of 11 players choose a captain
and a vice captain from amongst themselves?

Solution: The number of possible ways is P(11, 2) = 11!/9! = 1110 =110

Result 1: The number of permutations of n distinguishable objects taken all at a


time is P(n, n) = n!

Result 2: The number of permutations of n objects taken r at a time when each may
be repeated any number of times in any permutation is given by nr.

27
2. Combination

A combination as an ‘unordered’ selection of objects. Selection of different objects


that can be made out of a number of objects, by taking some or all of them at a time
is called combination.

n!
The number of combinations of ‘n’ objects taken ‘r’ at a time is and is
(n - r)!r!
denoted by nCr or C(n, r) .

The number of combinations of n things taken r at a time is same as the number of


permutations of n things taken (n-r) at a time. That is, nCr = nCn-r ( verify yourself).
There is only one way to select n things taken all at a time.

Example 1. In how many ways can 5 students be chosen out of 10 students.

Solution: Five students can be chosen out of 10 students in 10C5 = 10!/5!5! = 252
ways.

Example 2. In how many ways can selection of 5 books be made from 12 books (a)
when one specified book is never included (b) when one specified book is always
included.

Solution:
(a) Here remove the specified book and select 5 books from the remaining
11 books. It can be done in 11C5 different ways.
(a) first select the specified book which is to be included always and the
select 4 books from the remaining 11 books. It can be done in 11C4 ways.

1C.10. CONDITIONAL PROBABILITY

We often face situations where the probability of an event A is influenced by the


information that another event B has already occurred. The probability being a
measure of chance, our assessment of the probability of an event will also change
if we knew that another event has occurred. This reassessment of the probability of
one event conditional on the occurrence or non-occurrence of another event is
called the conditional probability.

Let A and B be two events in a sample space. Then P(B/A) ( read as


probability of B given A) be the probability of the event B given that the event A
has occurred is called conditional probability of B given A.

28
P(A  B)
It is defined as P(B/A) = , if P(A)>0
P(A)
P(A  B)
And P(A/B) = , if P(B)>0
P(B)

Example: Suppose a card is selected at random from a pack of cards. The card
selected is an ace. What is the probability that the card selected is a red one?

Solution:
Let A be the event that the card selected is an ace and B be the event that
the card selected is a red one. The required probability is P(B/A).
P(A  B)
By definition, P(B/A) =
P(A)
4 1
P(A) = Probability that the card selected is an ace = = ( Since there are 4
52 13
aces in a pack of 52 cards).
2 1
P(A  B) = Probability that the card selected is a red ace = =
52 26
1
1
Therefore P(B/A) = 261 =
13
2

1C.11. Multiplication Theorem on Probability

If A and B are two events in a sample space. Then, the multiplication theorem states
that P(A  B) = P(A) P(B/A) if P(A)>0 and
= P(B) P(A/B) if P(B)>0

Two events A and B are independent, then, P(B/A) = P(B) or P(A/B) = P(A)

Example: Two cards are drawn form a well-shuffled pack of cards. Find the
probability that they are both aces if the first card is (a) replaced (b) not replaced.

Solution:
Let A be the event that “ace selected on the first draw” and B be the event
that “ace selected at the second draw”.
Then we require P(A  B). By multiplication theorem, P(A  B) = P(A) P(B/A)
4
(a) Since for the first draw, there are 4 aces in 52 cards.  P(A) = .
52
4
The card is replaced and then selected, so P(B/A) = .
52
4 4
 P(A  B) = 
52 52

29
1
=
169
(b) If the card is not replaced after first drawing, there will be only 3 aces
on the second drawing out of 51cards.
3
P(A) is same as in the first case, but P(B/A) =
51
4 3 1
 P(A  B) =  =
52 51 221

1C.12. Bayes’ Theorem

Bayes’ Theorem is used to revise the probability of an event when new information
is available. The idea of revising probabilities is used by all of us in daily life even
though we may not know anything about probability. For example, a person while
going out may start without taking a raincoat, but as soon as he comes out of his
home and sees a large mass of cloud in the sky he may decide to take a raincoat
with him. So, by Bayes’ theorem, we find the posteriori probabilities.
Statement: Let B1, B2, …, Bn are ‘n’ mutually exclusive events whose union is the
P(Bi) P(A/Bi)
sample space. If A is any event, then, P(Bi/A) =
 P(Bi) P(A/Bi)
1C.13. Solved Problems

1. Write down the sample space of the random experiment of throwing two dice
simultaneously and observing the face numbers.

Solution: Sample space S is given by:


S={(1,1 ), (1,2), (1,3), (1,4), (1,5), (1,6),
(2,1), (2,2), (2,3), (2,4), (2,5), (2,6),
(3,1), (3,2), (3,3), (3,4), (3,5), (3,6),
(4,1), (4,2), (4,3), (4,4), (4,5), (4,6),
(5,1), (5,2), (5,3), (5,4), (5,5), (5,6),
(6,1), (6,2), (6,3), (6,4), (6,5), (6,6)}

2. If a box contains 10red and 6 blue balls, what is the probability that a bal drawn
at random is red? Find also the probability that the ball drawn is blue?

Solution: Number of red balls = 10


Number of blue balls = 6
Total number of balls = 16
10
By classical definition, P(the ball drawn is red) =
16
6
P(the ball drawn is blue) =
16

30
3. A speaks truth in 60% cases and B in 70% cases. In what percentage of cases
are they likely to contradict each other in stating the same fact?

Solution:
Contradiction takes place only if one of them speaks truth and the other tells
60
lie. The probability that A speaks truth = = 0.6
100
The probability that A tells lie = 1 – 0.6 = 0.4
70
The probability that B tells truth = = 0.7
100
The probability that B tells lie = 1 – 0.7 = 0.3
Since A and B speaks independently, probability that A speaks truth and B
tells lie = Probability that A speaks truth  Probability that B tells lie = 0.60.3
Similarly, Probability that A tells lie and B speaks truth = 0.40.7
Thus the probability that A speaks truth and B tells lie or A tells lie and B
speaks truth = 0.60.3 + 0.40.7 = 0.18 + 0.28 = 0.46.
That is, in 46% of cases they contradict each other.

4. The odds against A speaking the truth are 4 : 6 while the odds in favour of B
speaking the truth are 7:3. (i) What is the probability that A and B contradict
each other in stating the same fact? (ii) If A and B agree on a statement, what
is the probability that this statement is true?

Solution:
6
The probability that A speaks truth = = 0.6
10
The probability that A tells lie = 1 – 0.6 = 0.4
7
The probability that B tells truth = = 0.7
10
The probability that B tells lie = 1 – 0.7 = 0.3
(i) A and B will contradict each other if one of them tells lie and
the other speaks truth.
The required probability = 0.60.3 + 0.40.7
= 0.18 + 0.28
= 0.46
(ii) A and B agree on a statement if both tell lie or speak truth
Probability that both speaks truth = 0.60.7 = 0.42
Probability that both tells lie = 0.40.3 = 0.12
Probability that both agree on a statement = 0.42 + 0.12
= 0.54
0.42 7
Required probability = =
0.54 9
5. Three light bulbs are chosen at random from 15 bulbs of which 5 are defectives.
Find the probability that (i) none is defective (ii) exactly one is defective, (iii)
at least one is defective.

31
Solution:
There are 15C3 = 455 ways to choose 3 bulbs from 15 bulbs.
(ii) Since there are 10 non-defective bulbs, there are 10C3 = 120 ways to choose
120
3 non-defective bulbs. Thus, P(none is defective) = = 0.26
455
(iii) Since there are 5 defective bulbs, one defective bulb can be chosen in 5
different ways and 10C2 = 45 different ways to choose 2 non-defective bulbs.
Hence, there are 5  45 = 225 ways to choose 3 bulbs of which exactly one
225
is defective. Thus, P(exactly one is defective ) = = 0.49
455
(iv) The event that at least one is defective is the complement of the event ‘none
is defective’. By (i), P(none is defective) = 0.26
Hence, P(at least one is defective) = 1 – 0.26 = 0.74

6. A box contains 5 white and 7 black balls. If three balls are drawn at random,
what is the probability that one is white and two are black balls.

Solution:
One white ball can happen in 5 ways and 2 back balls can happen in 7C2 =
21 different ways. Also 3 balls can happen in 12C3 = 220 different ways.
5  21 21
Thus, the required probability = =
220 44

7. A box I contains 8 red and 7 blue balls. Another box II contains 6 red and 6
blue balls. One ball is selected at random from the box I and transferred it into
box II. Then, one ball is drawn at random from the box II, what is the probability
that it is a red ball?

Solution:
Let A be the event that the selected ball from the box II is a red ball. Then,
A can happen in the following ways. Transfer a red ball from box I to box II and
then select a red ball from box II or transfer a blue ball from box I to box II and
then select a red ball from box II.
P(transfer a red ball from box I to box II and then select a red ball from box
8 7 56
II ) =  =
15 13 195
P( transfer a blue ball from box I to box II and then select a red ball from
7 6 42
box II) =  = .
15 13 195
56 42 98
So, the required probability = + =
195 195 195

8. If P(A) = 0.4, P(B) = 0.7, and P(A  B) = 0.3, then, what is the probability of
A or B happened?

32
Solution:
By addition theorem on probability, P(A or B) =P(A  B)= P(A) +P(B)-P(A  B)
That is, P(A  B) = 0.4 + 0.7 – 0.3 = 0.8

3 5 3
9. Given, P(A) = , P(B) = and P(A  B)= , Are A and B independent?
8 8 4

Solution:
Two events A and B are independent if P(A  B) = P(A)  P(B)
By addition theorem on probability, P(A  B)= P(A) +P(B)-P(A  B)
So, P(A  B) = P(A) +P(B) - P(A  B)
3 5 3 1
= + - =
8 8 4 4
3 5 15
P(A)  P(B) =  =
8 8 64
Thus, P(A  B)  P(A)  P(B), hence A and B are not independent.

10. The probability that a contractor will get a contract for road construction is 0.5
and the probability that he will get a contract for the construction of water tank
is 0.7. What is the probability of getting at least one contract?

Solution:
Let A be the event getting contract for road construction and B be the event
of getting contract for construction of water tank.
By addition theorem on probability,
P(at least one) =P(A  B)= P(A) +P(B)-P(A  B)
Since A and B are independent, P(A  B) = P(A)  P(B)
= 0.50.7 = 0.35
Hence, P(A  B) = 0.5 + 0.7 – 0.35 = 0.85

11. A company has two plants to manufacture scooters. Plant I manufactures 70%
of the scooters and plant II manufactures 30%. At plant I, 80% of scooters are
rated standard quality and at plant II, 90% of scooters are rated standard quality.
A scooter is selected at random and is found to be of standard quality. What is
the chance that it has come from (a) plant I (b) plant II.

Solution:
Let A be the event ‘scooter selected is of standard quality’.
Let B1 be the event ‘scooter manufactured at plant I’ and B2 be the event ‘scooter
manufactured at plant II.
P(B1) = 0.7, P(B2) = 0.3, P(A/B1) = 0.8, and P(A/B2) = 0.9
P(B1) P(A/B1)
(a) Required probability = P(B1/A) =
P(B1) P(A/B1) + P(B2) P(A/B2)

33
0.7  0.8
=
0.7  0.8 + 0.3  0.9
56
=
83
P(B2) P(A/B2)
(b) Required probability = P(B2/A) =
P(B1) P(A/B1) + P(B2) P(A/B2)
0.3  0.9
=
0.7  0.8 + 0.3  0.9
27
=
83

12. A box X contains 2 white and 3 red balls. Another box Y contains 4 white and
5 red balls. One ball is drawn at random from one of the boxes and is found to
be red. Find the probability that it was drawn from box Y.

Solution:
Let A be the event ‘the ball drawn is red’, B1 be the event ‘box X has been chosen’,
and B2 be the event ‘box Y has been chosen’
P(B2) P(A/B2)
Required probability is P(B2/A) =
P(B1) P(A/B1) + P(B2) P(A/B2)
1 1 3 5
P(B1) = , P(B2) = , P(A/B1) = , and P(A/B2) =
2 2 5 9
1 5

P(B2/A) = 2 9
1 3 1 5
 + 
2 5 2 9
25
=
52

1C.14. Exercises

1. Define random experiment and sample space.


3. What do you mean be elementary events. Give two examples.
4. Write down the sample space for the random experiment, tossing a coin until
the first head occurs.
5. State the classical definition of probability. What are the limitations of this
definition?
6. What are the axioms of probability?
7. A class consists of 6 girls and 10 boys. If a committee of 3 is chosen at random
from the class, find the probability that:
(i) 3 boys are selected
(ii) exactly 2 boys are selected
(iii) exactly 3 girls are selected

34
(iv) At least 2 girls are selected
8. The probability that a boy will get a scholarship is 0.9, and a girl will get is 0.8.
What is the probability that at least one of them will get the scholarship?
9. Five men in a company of 20 are graduates. If 3 men are picked out of the 20
at random, what is the probability that they are all graduates? What is the
probability of at least one graduate?
10. A card is drawn at random from a well-shuffled pack of cards. What is the
probability that it is a heart or a queen?
11. A candidate is interviewed for 3 posts. For the first post there are 3 candidates,
for the second there are 4, and for the third there are 2. What are the chances
for his getting at least one post?
12. An urn contains 8 white and 3 red balls. If two balls are drawn at random find
the probability that (i) both are white (ii) both are red (iii) one is of each colour.
13. A can solve 80% of the problems given in statistics book and B can solve 60%.
What is the probability that at least one of them solve a problem selected at
random?
14. If P(A) =0.5, P(B) = 0.3, and P(AB) = 0.2, obtain the probability that:
i) A occurs but not B
ii) At least one of A and B occurs
iii) Neither of A and B occurs
15. What is the probability that a leap year selected at random will contain 53
Sundays?
16. The probabilities that a husband and wife will be alive 20 years from now are
0.8 and 0.9 respectively. Find the probability that in 20 years (a) both alive (b)
neither alive (c) at least one alive.
1 1
17. The probability of hitting a target is for A and for B. If both fire at the
3 2
same target find the probability that at least one of them hit the target.
18. A bag contains 6 black and 3 white balls. Another bag contains 5 black and 4
white balls. If a ball is drawn from each bag find he probability that these two
balls are of the same color.
19. The odds that A speaks truth are 3:2 and the odds that B does so are 5:3. In what
percentage of cases are they likely to contradict each other?

20. On the average 20% of persons going to a handicrafts emporium are foreigners
and the remaining 80% are local persons. 75% of such foreigners and 50% of
such local persons are found to make purchases. If a bundle of purchased items
is sent to cash counter, what is the probability that the purchaser is a foreigner?
21. In an examination 30% of the students have failed in Mathematics, 20% of the
students have failed in Chemistry and 10% have failed in both Mathematics . a
student is selected at random.
(i) What is the probability that the student has failed in Mathematics
when it is known that he has failed in Chemistry?
(ii) What is the probability that the student selected at random has failed
either in Mathematics or in Chemistry?

35
22. Two urns I and II contain 3 white, 7 black balls and 5 white, 7 black balls
respectively. A ball is transferred from urn I to urn II. Then a ball is drawn at
random from urn II and it is found black. What is the probability that the
transferred ball has been a black ball?
23. Urn I contains 4 white and 5 black balls. Urn II contains 5 white and 8 black
balls. A ball is transferred from urn I to urn II, then a ball is drawn from urn II.
Find the probability that it is white?
24. Box I contains 3 red and 2 blue marbles while box II contains 2 red and 8 blue
marbles. A fair coin is tossed. If the coin turns up head, a marble is chosen from
box I; if it turns up a tail, a marble is chosen from box II. Find the probability
that a red marble is chosen?
25. A box contains 5 red and 4 white marbles. Two marble s are drawn successively
from the box without replacement and it is noted that the second one is white.
What is the probability that the first is also white?
26. A manufacturing company produces steel pipes in three plants with daily
production volume of 500, 1000, and 2000 units respectively. According to past
experience it is known that the fraction of defective outputs produced by the
three plants are respectively 0.005, 0.008, and 0.010. If a pipe is selected at
random from a day’s production and found to be defective. Find out the
probability that it came from the first plant.
27. A company produces a product through three machines A, B, and C. Machine
A produces 45% of the product, B produces 35% of the product and C produces
20%. From past experience it is known that 4% of the items produced by
machine A is defective, 3% of the items produced by B is defective and 1% of
the items produced by C is defective. An item selected at random is found to be
defective. What is the probability that it produced by machine B?
28. A die is thrown twice and the sum of the numbers appearing is observed to be
6. What is the probability that the number 5 has appeared at least once?

3B. STATISTICAL INFERENCE

3B.1. Sampling Distributions

36
Suppose we wish to draw conclusions about a characteristic of a population. We draw a
random sample of size n and take measurements about the characteristic, which we
interested to study. Let the sample values be x1, x2, x3, …, xn. Then any quantity which can
be determined as a function of the sample values x1, x2, x3, …, xn is called a statistic. Since
the sample values are the results of random selections, a statistic is a random variable.
Therefore, a statistic has a probability distribution. It is known as sampling distribution.
The standard deviation of the sampling distribution is called standard error.

The process of inferring certain facts about a population based on a sample is known as
statistical inference. Sample statistics and their distributions are the basis of all inferences
drawn about the population.

Sampling Distribution of the Sample Mean

Suppose we have a sample of size n from a population. Let x1, x2, x3, …, xn be the values
of the characteristic under study corresponding to the selected units. Then the sample mean
__ __ x + x2 + x3 ++ xn
X is defined as X = 1 .
n
If we draw another sample of size n from the same population, we may end up with a
different set of sample values and so a different sample mean. Thus the value of the sample
mean is determined by chance causes. The distribution of the sample mean is called
sampling distribution of the sample mean.

Distribution of Sample mean

1. Distribution of sample mean of sample taken from any infinite population

If x1, x2, x3, …, xn constitute a random sample from an infinite population having the mean
 and variance 2, then the distribution of sample mean will be normal with mean  and
variance 
2
, when n is large.
n

2. Distribution of sample mean of sample taken from the normal population


__
If X is the mean of a random sample of size n from a normal population with the mean 
and variance 2, its sampling distribution is a normal distribution with the mean  and
variance 
2
.
n
Example 1: a random sample of size 100 is taken from a normal population with  = 25.
What is the probability that the mean of the sample will greater from the mean of the
population by atleast 3.
__
Solution: Let  be the population mean and x be the sample mean. Given that n = 100,
=25.

37
__
Required probability = P( x - > 3)

x−  3
= P( n > n )
 
= P(z > 1.2)
= 0.1151 ( from N(0,1) table, since z ~ N(0,1))

Example 2: A random sample of size 64 is taken from an infinite population with the mean
22 and variance 196. What is the probability that the mean of the sample will greater than
23.
__
Solution: Given n = 64,  = 22,  = 14. Let x be the sample mean.
__
We have to find out P( x > 23)

__
x − 22 23 − 22
P( x >23) = P( 64 > 64 )
14 14
8
= P(z> ) = P(z > 0.57) = 0.2843
14

Some Uses of Sampling distribution of Mean

1. To test the mean of a normal population when population standard deviation is


known
2. To test the mean of any population when sample size is large ( usually n >30)
3. To test the equality of means of two populations when sample sizes large.
4. To test the equality of means of two normal populations when population standard
deviations are known.
5. To find out the confidence interval for population mean; difference of population
means of two populations. ( both cases sample sizes are large).

The Chi- Square Distribution

If a random variable X has the standard normal distribution, then the distribution X 2 is
called chi-square (2) distribution with one degree of freedom. This distribution would be
quite different from a normal distribution because X2, being a square term, can assume
only non-negative values. The probability curve of 2 will be higher near 0, because most
of the x-values are close to 0 in a standard normal distribution.

If X1, X2, …, Xn are independent standard normal variables, then X 1+X2+… + Xn has the
2 distribution with n degrees of freedom. Here ‘n’ is the only one parameter.

2 – table

38
Since 2-distribution arises in many important applications, especially in statistical
inference, integrals of its density has been tabulated. The table gives the value of 2,n such
that probability that 2 is greater than 2,n is equal to  for  = 0.005, 0.01, 0.025, 0.05
etc. and n = 1, 2, 3, … . That is, the table gives P(2 >2,n) = 

2,n

Some Uses of Chi – Square Distribution


1. To test the variance of a normal population.
2. To test the independence of two attributes.
3. To test the homogeneity of two attributes.
4. To find the confidence interval for the variance of a normal population.

The Student – t Distribution

If X and Y are two independent random variables, X has the standard normal distribution
and Y has a chi-square distribution with ‘n’ degrees of freedom, then the distribution of
the statistic t = X is called Student ‘t’ distribution. The t-distribution was first obtained
Y

n
by by W.S. Gosset, who is known under the pen name ‘Student’.

x− 
An example of a t-statistic is t = n , which follows t-distribution with (n-1) degrees
s
__
of freedom, where x and s are mean and standard deviation of a random sample of size n
from a normal population with mean  and variance 2.

Student ‘t’ table

Student ‘t’ table has many applications in statistical inference. The t-table gives the values
t,n for  = 0.25, 0.125, 0.10, 0.05 etc. and n = 1, 2, 3, …, where t,n is such that the area to
its right under the curve of the t-distribution with ‘n’ degrees of freedom is equal to . That
is, t,n is such that P(t > t,n) = . Also note that the t-distribution is a symmetric distribution.

39

.
t,n

Some Uses of t-distribution

1. To test the mean of a normal population when the sample size is small and
population variance is unknown.
2. To test the equality of means of two normal populations when the sample sizes are
small and population variances are unknown but same.
3. To test the correlation coefficient is zero.
4. To find the confidence interval of mean of normal population when sample size is
small and population variance is unknown.

The F- Distribution

If U and V are independent random variables having chi-square distribution with m and n
U
degrees of freedom, then the distribution of m is called the F-distribution with m and n
V
n
degrees of freedom.

For example, if S12 and S22 are the variances of independent random samples of sizes m
and n from normal populations with variances 12 and 22, then,
S 
2 2

F= 1 2 2 2 has an F-distribution with m-1 and n-1 degrees of freedom.


S2  1

Table of F-distribution
The table of F-distribution gives the values F;m,n for =0.05 and 0.01 for various values
of m and n where F;m,n is such that the area to the right under the curve of F-distribution
with m, n degrees of freedom is equal to .

That is F;m,n is such that P(F> F;m,n) = 

40

F;m,n

Some Uses of F-distribution

1. To test the equality of variances of two normal populations.


2. F-distribution is used in analysis of variance.

3B.4. Estimation of Parameters

The problem of estimation is of finding out a value for unknown population


parameters, which we cannot directly observe, as precisely as possible. Managers deal
this problem most frequently. They make quick estimates too. Since our estimates are
based only on a sample, the estimates are not likely to be exactly equal to the value we
are looking for. Still we will be able to obtain estimates whose possible values are
around the true, but unknown value. The difference between the true value and the
estimate is the error in estimation.

There are two types of estimates 1. Point Estimate 2. Interval Estimate

If an estimate of a population parameter is given by a single value, then the estimate is


called point estimate of the parameter. But if two distinct numbers give an estimate of
a population parameter between which the parameter may be considered to lie, then the
estimate is called an interval estimate of the parameter.

A function, T, used for estimating a parameter , is called an estimator and its value
given a sample is known as estimate.

Required Properties of an Estimator

1. Unbiasedness: An estimator must be an unbiased estimator of the parameter.


That is an estimator T is said to be unbiased for a parameter  if E(T) = .
2. Efficiency: Efficiency refers to the size of the standard error of the estimator.
That is, an estimator T1 is said to be more efficient than another estimator T2 if
standard error of T1 is less than the standard error of T2.
3. Consistency: As the sample size increases the value of the estimator must get
close to the parameter.
4. Sufficiency: An estimator T is said to be sufficient for a parameter  if T contains
all information which the sample contains and furnishes about .
Some Point Estimators
__
1. The sample mean X is a point estimator of the population mean 

41
2. The sample proportion is a point estimate of the population proportion.
3. The sample variance is a point estimator of population variance.

3B.3. Testing Hypotheses

Statistical testing or testing hypotheses, is one of the most important aspects of the
theory of decision-making. Testing hypotheses consists of decision rules required for
drawing probabilistic inferences about the population parameters.
Definition: A Statistical Hypothesis is a statement concerning a probability distribution
or population parameters and a process by which a decision is arrived at, whether or
not a hypothesis is true is Testing Hypothesis.
For example, the statement, mean of a normal population is 30, the variance of a
population is greater than 12 are statistical hypotheses.

Null Hypothesis and Alternate Hypothesis

The hypothesis under test is known as the null hypothesis and the hypothesis that will
be accepted when the null hypothesis is rejected is known as the alternate hypothesis.
The null hypothesis is usually denoted by H0 and the alternate hypothesis by H1. For
example, if the population mean is represented by , we can set up our hypothesis as
follows: H0:   30; H1:  > 30.

The following are the steps in testing a statistical hypothesis. We draw a sample from
the concerned population. Then choose the appropriate test statistic. A test statistic is
a statistic, based on the value of it we decide either to reject or accept a hypothesis.
Divide the sample space of the test statistic into two regions, namely, rejection region
and acceptance region ( The set of sample points, which lead to the rejection of the null
hypothesis, is called the Critical Region or Rejection Region). Calculate the value of
the test statistic for our sampled data. If this value falls in the rejection region, reject
the hypothesis; otherwise accept it.

Type I Error and Type II Error

Since we have to depend on the sample there is no way to know, which of the two
hypotheses is actually true. The test procedure is to fix the rejection region, in which
the value of test statistic observed, the null hypothesis would be rejected. The null
hypothesis may be true, but the test procedure may reject the null hypothesis. This error
is known as the first kind of error. It is also possible that the null hypothesis is actually
false but the test accepts it. This error is known as the second kind of error. Thus, the
error committed in rejecting a true null hypothesis is called type I error and the error
in accepting a false null hypothesis is called the type II error.

Significance Level

42
The probabilities of two errors cannot be simultaneously reduced, since is we increase
the rejection region the probability of type I error will increase whereas the reduction
in rejection region will increase type II error. The procedure usually adopted is to keep
the probability of type I error below a pre-assigned number and subject to this condition
minimize the type II error. A pre-assigned number  between 0 and 1 chosen as an
upper bound of type I error is called the level of significance.

Two-tailed and One-tailed Tests

A test where the critical region is found to lie under one tail of the distribution of the
test statistic is called One-tailed test. In two-tailed tests the critical region lies under
both the tails of the distribution of the test statistic.
Example: Let  be the mean of a population. Then,
1. H0:  = 30; H1:   30 is a two tailed test
2. H0:  = 30; H1:  > 30 is a single tailed test.

STATISTICAL METHODS

Objectives

After reading the unit, you will understand:

● different statistical tests


● concepts and methods of analysis of variance
● concepts of correlation and regression
● methods in correlation and regression and interpretation
● the meaning of some multivariate techniques

5A. TESTING OF HYPOTHESES-II

5A.1. Introduction

In Unit 3, we discussed the basic concepts of testing a statistical hypothesis. Here we


discuss some important statistical tests we frequently used.

5A.2. Large Sample Tests

1. Large Sample Tests for Mean

Suppose we have to test the hypothesis that the population mean  has a specified value
0. Then formulate the null hypothesis H0 :  = 0. The alternative hypothesis is: 1) H1:
  0 or 2) H1:  > 0 or 3) H1:  > 0

43
__
A random sample of size n ( n > 30) is to be taken and let x be the sample mean. Since
__
n is large, the sampling distribution of x is approximately normal.

x− 
If H0 is true, the test statistic z = n has approximately standard normal.

Case i: When  is Known Use the above test statistic. The critical region for z
depending on the nature of H1 and the level of significance  is given below:

Level of Significance  0.1 (10%) 0.05 ( 5%) 0.01 (1%)


Critical Region for H1:   0 |z| > 1.64 |z| > 1.96 |z| > 2.58
Critical Region for H1:  < 0 z < -1.28 z < -1.64 z <-2.33
Critical Region for H1:  > 0 z > 1.28 z > 1.64 z > 2.33

Case ii): When  is Unknown


n
1
If n > 30, Use the sample variance s2 =
n
 (x
i =1
i-x )2 as an estimate of 2, and use


x− 
the test statistic z = n , which follows standard normal. Use the above critical
s
regions.

Example : The mean life of a random sample of 100 tyres is drawn from a population
of tyres with standard deviation of 1248kms is 15269 kms. It is climed that mean life
of tyres is 15200 kms. Test the validity of the claim.

Solution: Let X be the life of tyres.


__
Given that x = 15269, n = 100,  = 1248, 0 = 15200
H0 :  = 15200 against H1:   15200

x− 
Test statistic z = n

15269 − 15200
= 100
1248
= 0.55
Let  be 0.05, then the critical region is |z| > 1.96. Since |z| = 0.55 < 1.96, there is no
reason to reject H0. So, the claim is justified.

Example : The manufactures of a small car claim that on an average the car is driven
2000 kms per month. A random sample of 100 owners of the car are asked to keep a
record of kilometers they drive their cars. On the basis of these sample records. It was

44
found that on an average the car was driven 2200 kms. per month with a standard
deviation of 600kms. Do the sample data support the hypothesis that the average
distance the car is driven has increased?

Solution: H0: = 2000; against H1:  > 2000 where  is the mean distance driven
the car per month.
__
Given x = 2200, n = 100, s = 600, 0 = 2000

x− 
Test statistic z = n
s
2200 − 2000
= 100
600
= 3.33
Let  = 0.05, Critical region is z > 1.64.

Since z = 3.33 > 1.64, we reject H0 . That is, the average distance a car is driven has
increased.

2. Large Sample Test for Proportion



A random sample of size n (n>30) has a sample proportion p of members possessing
a certain attribute ( say, success). To test the hypothesis that the proportion p in the
population has a specified value p0.

The null hypothesis is p = p0.


The alternative hypothesis is 1) H1: p  p0 2) H1: p < p0 3) H1: p > p0
− − x
The distribution of p is approximately normal. Also p = , where x = is the number
n
of successes in a sample of size n.

p − p0 x − np 0
The test statistic is z = = , where q0 = 1- p0
p0 q0 np 0 q 0
n
The critical regions are given below:

Level of Significance  0.1 (10%) 0.05 ( 5%) 0.01 (1%)


Critical Region for H1: p p0 |z| > 1.64 |z| > 1.96 |z| > 2.58
Critical Region for H1: p < p0 z < -1.28 z < -1.64 z <-2.33
Critical Region for H1: p > p0 z > 1.28 z > 1.64 z > 2.33

Example: A manufacturer claimed that at least 95% of the equipment which he


supplied to a factory conformed to specifications. An examination of a sample of 200
pieces of equipment revealed that 18 were faulty. Test his claim at a significance level
of 0.05.

45
Solution: H0: p = 0.95 H1: p < 0.95
Given, x = 200-18 = 182, n= 200, p0 = 0.95, q0 = 1 – p0 = 1 – 0.95 =0.05
x − np 0
Test statistic, z =
np 0 q 0
182 − 200  0.95
=
200  0.95  0.05
= -2.597
Critical region is z < -1.64. Since z = -2.597 < -1.64, we reject H0. So, the claim of the
manufacturer not justified.

3. Test for the Equality of Two Proportions

Suppose we have to test whether two population proportions p1 and p2 are equal. We
take a sample of size n1 from the first population and a sample of size n 2 from the
second population. Let x1 units possess a particular attribute in the first sample and x 2
− x − x
units from the second sample possess the attribute. Let p1 = 1 and p 2 = 2 be the
n1 n2
respective sample proportions. The null hypothesis is H0: p1 = p2.
The alternative hypothesis is 1) H1: p1  p2 2) H1: p1 < p2 3) H1: p1 > p2
x1 x 2
− − −
n1 n2 x1 + x 2 n1 p1 + n2 p 2
The test statistic is z = , where p = =
1 1 n1 + n2 n1 + n2
pq ( + )
n1 n2
The critical regions given below:

Level of Significance  0.1 (10%) 0.05 ( 5%) 0.01 (1%)


Critical Region for H1: p1 p2 |z| > 1.64 |z| > 1.96 |z| > 2.58
Critical Region for H1: p1 < p2 z < -1.28 z < -1.64 z <-2.33
Critical Region for H1: p1 > p2 z > 1.28 z > 1.64 z > 2.33

Example: A sample survey of tax-payers belonging to business class and professional


class yielded the following results.
Business class Professional class
Sample size 200 100
Defaulters in tax payment 40 15
Test whether the defaulters rate is same for the two classes?
Solution: Given n1 = 200, n2 = 100, x1 = 40, x2 = 15
H0: p1 = p2 against H1: p1  p2

x1 + x 2 40 + 15
p= = = 0.183, q = 1 – p = 0.817
n1 + n2 200 + 100

46
x1 x 2

n1 n2
The test statistic is z =
1 1
pq ( + )
n1 n2
40 15

= 200 100 = 1.063
1 1
0.183  0.817( + )
200 100

The critical region is |z| > 1.96 at significance level 0.05.

Since |z| = 1.063 <1.96, We have to accept H0. That is, defaulters rate is same for the
two classes.

4. Test for the Equality of Means

1. When sample sizes large and population variances known

Let there are two independent populations with means 1, 2 and variances 12 and
2 respectively. The null hypothesis is H0: 1 = 2. The alternative hypothesis is: 1)
2

H1: 1  2 or 2) H1: 1 > 2 or 3) H1: 1 > 2


We take a sample of size n1 from the first population and a sample of size n2 from the
− −
second population. Let x1 and x 2 be the sample means.
− −
x1 − x 2
The test statistic is z = , which is approximately standard normal.
 12  22
+
n1 n2
The critical regions are as follows:

Level of Significance  0.1 (10%) 0.05 ( 5%) 0.01 (1%)


Critical Region for H1: 1 2 |z| > 1.64 |z| > 1.96 |z| > 2.58
Critical Region for H1: 1 < 2 z < -1.28 z < -1.64 z <-2.33
Critical Region for H1: 1 > 2 z > 1.28 z > 1.64 z > 2.33

Example: A random sample of size 100 is taken from a population with mean 1 and
variance 16 and a sample of size 50 is taken from another population with mean 2 and
variance 25. The mean of the first sample is 40 and the mean of the second sample is
38. Test whether the samples are from populations with same mean.
− −
Solution: Given x1 = 40 and x 2 = 38, n1 = 100 and n2 = 50, 12 = 16 and 2 2= 25
H0: 1 = 2 H1: 1  2

47
− −
x1 − x 2
Test statistic is z =
 12  22
+
n1 n2

40− 38
= = 2.46
16 25
+
100 50

The critical region is |z| > 1.96 at significance level 0.05.

Since |z|=2.46 > 1.96, reject H0, Thus the population means are not same.

2. When sample sizes large and population variances are unknown


When sample sizes are large, 12 and 2 2 are replaced by sample variances s12 and s22.
− −
x1 − x 2
Then z = and follow the test as above.
2 2
s1 s
+ 2
n1 n2

5A.3. Small Sample Tests

1. Small Sample Test for Mean

If the sample size is less than 30, then we need to make the assumption that the
population follows normal distribution.

x−  1 n
Then, the test statistic is t =
s
n − 1 , where s2 =  (xi - x )2 . Here the statistic
n i =1
t follows Student ‘t’ distribution with n-1 degree of freedom. The critical regions can
be found from the Student ‘t’ table.

Example: A consumer testing agency while examining a new automobile for gasoline
mileage performance found that 12 readings of miles covered per gallon under normal
conditions resulted in an average of 16 miles per gallon with a standard deviation of
1.8. Do the sample results support the manufacturer’s claim that the new automobile
gives a performance of more than 15 miles per gallon?
Solution: H0:  = 15 H1:  > 15
__
Given x = 16, n = 12, s = 1.8, 0 = 15

x− 
Test statistic z = n −1
s

48
16 − 15
= 11
1.8
= 1.84
Let  = 0.05, from t-table with n = 11, t0.5,11 = 1.7959

Since t = 1.84 > 1.7959, H0 is to be rejected. So, the manufacturer’s claim can be
justified.

2. Small sample Tests for equality means

Here we assume that the populations follows normal distributions, independent and
population variance are unknown but equal..

The null hypothesis is H0: 1 = 2. The alternative hypothesis is: 1) H1: 1  2 or
2) H1: 1 > 2 or 3) H1: 1 > 2

We take a sample of size n1 from the first population and a sample of size n2 from the
− −
second population. Let x1 and x 2 be the sample means and s12 and s22 are the sample
variances.

− −
x1 − x 2
Test statistic is t = , which follows Student ‘t’ distribution
n1 s1 + n2 s 2 1
2 2
1
( + )
n1 + n2 − 2 n1 n2
with n1+n2-2 degrees of freedom.

Example: Given two independent random samples of sizes n1=12 and n2= 20 from
− −
two different normal populations, with x1 = 180 and x 2 =187, s12=40 and s22=60,
test the hypothesis at =0.10 that the population means are equal.
Solution: H0: 1 = 2 H1: 1  2
− −
x1 − x 2
Test statistic is t =
n1 s1 + n2 s 2 1
2 2
1
( + )
n1 + n2 − 2 n1 n2

180− 187
= = -2.565
12  40 + 20  60 1 1
( + )
12 + 20 − 2 12 20
From t- table with =0.10 and 30 degrees of freedom, t0.1,30 = 1.6973.
Since |t| =2.565 > 1.6973, we reject H0.

3. Test of Equality of Means of Paired Observations

49
Paired measurements arise when two measurements are made on one unit of
observation. For example, the severity of an illness measured before and after
medication.

When the difference of two measurements is the variable of interest, a test of the
hypothesis that the mean difference is zero in the population can be obtained from
the differences of pairs of measurements in the sample. This is a particularly useful
application because a mean difference of 0 signifies that the mean of one measure
is identical to the mean of the other measure.

Let (x1, y1), (x2, y2), …, (xn, yn) be the sample observations. Let di = yi - xi. Then to
test the means of X and Y are equal, it is sufficient to test the mean of the differences
d = 0.

Thus H0: d= 0. the alternative is 1) H1: d  0 or 2) H1: d > 0 or 3) H1: d > 0

d
The test statistic is t = n − 1 , which follows t-distribution with n-1 degrees of
sd
freedom.

Example: Two laboratories A and B carry out independent estimates of fat content
in ice-cream made by a firm. A sample is taken from each batch, halved, and the
separated halves sent to the two laboratories. The fat content obtained by the
laboratories is recorded below:

Batch no. 1 2 3 4 5 6 7 8 9 10
Lab A 7 8 7 3 8 6 9 4 7 8
Lab B 9 8 8 4 7 7 9 6 6 6

Is there a significant difference between the mean fat content obtained by the two
laboratories A and B?

Solution:
xi yi di=yi-xi di2
7 9 2 4
8 8 0 0
7 8 1 1
3 4 1 1
8 7 -1 1
6 7 1 1
9 9 0 0
4 6 2 4
7 6 -1 1
8 6 -2 4

50
H0: d = 0 against H1: d  0
− 1 3
Here, d = di = = 0.3
n 10
1 − 17
Sd2 = di2 – ( d )2 = - (0.3)2 = 1.61
n 10
So, Sd = 1.27

d
The test statistic is t = n −1
sd

0.3
= 10 − 1 = 2.126
1.27
From t- table with =0.05 and 9 degrees of freedom, t0.05,9 =2.26.
Since |t| = 2.126 < 2.26, we accept H0. Thus there is no significant difference
between the mean fat content obtained by the two laboratories A and B.
5A.4 Analysis of Variance

The analysis of variance is a set of statistical techniques for studying variability from
different sources and comparing them to understand the relative importance of each of the
sources. It is also used to make inferences about the population through tests of
significance, including the very important comparison of the means of two or more separate
populations.

The technique of analysis of variance developed by R.A. Fisher is capable of fruitful


application in a variety of problems. The technique originated in agricultural research
where the effect of various types of soils on the output or the effect of different types of
fertilizers on production had to be studied. Later, this technique was found to be extremely
useful in all types of researches where the effects of one or two variables on a problem
under study had to be determined on the basis of a number of experiments conducted
simultaneously.

The technique of analysis of variance in case of a single variable and in case of two
variables is similar. In both cases a comparison is made between the variance of sample
means with the residual variance. However in case of a single variable the total variance is
divide in to two parts only viz, variance between the samples and variance within the
samples. The later variance is called the residual variance. In case of two variables, the
total variance is divided in to three parts viz, variance due to first variable, variance due to
second variable and residual variance.

1. Analysis of Variance (ANOVA) in One – Way Classification.

In one-way classification we take into account only one variable – say the effect of
treatment. Let there are m treatments and there are ni sample observations on the ith
treatment. Let X be the dependent variable and x ij be the jth observation of X for the ith
treatment. We will start with the null hypothesis, the mean treatment effects are same, or
H0 : 1 = 2 = 3 = … = m against the alternate the mean treatment effects are not same.

51
The following are the steps in testing the above hypothesis.

1. Calculate the sum of squares of variation between samples (BSS)


2. Calculate the total sum of squares of variation (TSS)
3. Calculate the sum of squares of variation within the samples or Error Sum of
Squares (ESS). This will be obtained by subtracting the BSS from TSS.
4. Calculate the F-ratio.
5. compare the F-ratio so calculated with the critical value of F-ratio as given in
Snedecor’s table.
6. Draw inference whether the null hypothesis is accepted or rejected.

The various steps in the calculations are:

i) Find the grand total, which is the sum of the values of all the items of
all the samples and is denoted by T.
T2
ii) Calculate the correction factor which is equal to , where N is the
N
total number of observations ( N= n1+n2+..+ nm)

iii) Find the sum of squares of all the items of all the samples and add them
together ( i.e. xij2).

iv) Find out the total sum of squares (TSS) by subtracting the correction
T2
factor from the sum of squares of all the items ( TSS = xij2 - )
N
v) Find the totals of each sample (xi.). Then square the sample totals and
divide by the number of items in that sample. Add all these figures.
Between sum of squares is obtained by subtracting the correction factor
2 2 2
x x x T2
from the above sum ( BSS = 1. + 2. + ... + m. - )
n1 n2 nm N
vi) The within sum of squares, ESS = TSS – BSS

vii) The degrees of freedom of BSS is m-1, the degrees of freedom of TSS
is N – 1, and the degrees of freedom of WSS is N-m.

viii) Find the mean sum of squares.


BSS
Mean sum of squares between samples, MSB =
m −1
ESS
Mean sum of squares within samples, MSE =
N −m

52
MSB
ix) Calculate F-ratio. F =
MSE
x) Find the table value from the F-table corresponding to degrees of
freedoms m-1 and N-m; and significance level .
xi) If the calculated value is greater than table value, reject H 0.

Analysis of Variance Table ( One Way Classification)

Source of Sum of d.f Mean F-ratio Table Inference


Variation Squares Squares value

Between Samples BSS m-1 BSS


MSB =
m −1 F=
MSB F; m-1, N-m If F>F
Within Samples ESS N-m ESS MSE Reject H0
MSE =
N −m
Total TSS N-1

Analysis of Variance in Two-Way Classification

In a one-way classification we take into account the effect of only one variable. If there is
a two-way classification the effect of two variables can be studied. In two-way, the total
variation is the sum of column variation, row variation and error variation. The variances
are calculated for both columns and rows and they are compared with the residual or error
variation. Let there are r rows and c columns. The null hypotheses are: H 01: Column wise
effects are not significant H02: Row wise effects are not significant. Then the ANOVA
table is given below:

Source of Sum d.f Mean F-ratio Table Inference


Variation of Squares values
Square
s
Between Columns CSS c-1 CSS
MSC =
c −1 MSC F1; c-1, If F1>F1
F1=
Between Rows RSS r–1 RSS MSE (c-1)(r-1) Reject
MSR= H01
r −1

53
ESS MSR F2; r-1, If F2>F2
MSE= F2=
Error ESS (c-1)(r-1) (c − 1)(r − 1) MSE (c-1)(r-1) Reject
H02

Total TSS N-1

Example 1: From the data given below, set up a table of analysis of variance and find out
whether the means of the various samples differ significantly among themselves.

Sample 1: 9 11 13 9 8
Sample 2: 13 12 10 15 5
Sample 3: 19 13 17 7 9
Sample 4: 14 10 13 17 16

Solution: H0 : Means are equal ( 1=2=3=4)

X1 X12 X2 X22 X3 X32 X4 X42


9 81 13 169 19 361 14 196
11 121 12 144 13 169 10 100
13 169 10 100 17 289 13 169
9 81 15 225 7 49 17 289
8 64 5 25 9 81 16 256
50 516 55 663 65 949 70 1010

Sum of all observations, T = 50+55+65+70 = 240


T2 (240) 2
Correction Factor, CF = = = 2880
N 20

Sum of squares of all observations = 516+663+949+1010 = 3138

i) Total Sum of Squares, TSS = 3138 – CF = 3138 –2880 = 258

Sum of Squares of the sample totals divided by the number of observations in each sample
(50) 2 (55) 2 (65) 2 (70) 2
= + + + = 2930
5 5 5 5

54
ii) Between Sum of Squares, BSS = 2930 –CF = 2930 – 2880 = 50

iii) Within Sum of Squares, WSS = TSS – BSS = 258 – 50 = 208

Source of Sum of d.f Mean F-ratio Table Inference


Variation Squares Squares value

Between Samples 50 3 50 Since,


MSB= =16.67
3 16.67 F0.5; 3,16 F<F
F=
Within Samples 208 16 208 13 =3.24 Accept H0
MSE = =13
16 = 1.28

Total 258 19

Hence, the means are not significantly different.


Example 2: The price of a certain commodity was ascertained in each of the four towns A,
B, C and D, in four quarters of a year. The prices are given below. Are the variations in
prices between different towns and in different seasons significant?

Towns
Quarters A B C D Total

I 60 50 60 50 220
II 50 40 65 50 205
III 45 35 45 50 175
IV 65 45 60 70 240
Total 220 170 230 220 840

Solution: Null Hypotheses H01 : Prices do not differ in the four towns
H02 : Prices do not differ in the four quarters

T2 (840) 2
i) Correction Factor, CF = = = 44100
N 16

(220) 2 (170) 2 (230) 2 (220) 2


ii) Column Sum of Squares, CSS = + + + - CF
4 4 4 4
= 44650 – 44100 = 550
(220) 2 (205) 2 (175) 2 (240) 2
iii) Row Sum of Squares, RSS = + + + - CF
4 4 4 4
= 44662.5 –44100 = 562.5

iv) Total Sum of Squares, TSS = 602 + 502 + … + 702 – CF


= 45550 – 44100 = 1450

55
v) Error Sum of Squares, ESS = TSS – CSS – RSS = 1450-550-562.5
= 337.5

Source of Sum d.f Mean F-ratio Table Inference


Variation of Squares values
Square
s
Between Columns 550 3 550
MSC = = 183.3
3 183.3 3.86 Reject
F1= =4.89
Between Rows 562.5 3 562.5 37.5 H01
MSR= = 187.5
3 187.5
F2= =5 3.86 Reject
337.5 37.5
Error 337.5 9 MSE= = 37.5 H02
9
Total 1450 15

5A.5. Exercises

1. The manufacturer of light bulbs claims that a light bulb lasts on an average 1600
hours. A sample of 100 light bulbs was taken at random and the average life of
bulbs was computed as 1570 hours with a standard deviation of 120 hours. At
α = 0.01, test the validity of the claim.
2. An insurance company claims that it takes 2 weeks (14 days), on an average, to
process an auto accident claim. The standard deviation is 6 days. To test the
validity of the claim, an investigator randomly selected 36 people who recently
filed claims. This sample revealed that it took the company an average of 16
days to process these claims. At 99% level of confidence, check if it takes the
company more than 14 days on an average to process the claim.
3. The sponsor of a television show believes that his studio audience is divided
equally between men and women. Out of 400 persons attending the show one
day, there were 230 men. At 5% significance level, test if the belief of the
sponsor is correct.
4. An airline claims that at most 8% of its lost luggage is never found. A consumer
advocacy wants to test this claim. In a study of 200 random cases of lost
luggage, it was found that in 22 cases, the lost luggage was never found. At
95% confidence test the airline’s claim.
5. An advertising agency wants to find out if there is any difference in the degree
of loyalty for a given brand of cereal between men and women. A random
sample of 200 men and 200 women was taken and it was determined that 58%
of women and 65% of men showed brand loyalty. At 5% level of significance
test the null hypothesis that there is no significant difference between the
population proportion of men and women who are brand loyal.
6. An experiment has been conducted to compare the productivity of two
machines. Machine I was observed for 40 hours and machine II for 50 hours.

56
The average productivity of items produced per hour and the standard deviation
for each machine is recorded below:

Machine I Machine II

Mean 61.4 59.5


SD 3.1 2.8
At 1% level of significance, do the sample provides sufficient evidence to
conclude that productivity on machine I is better than productivity of machine
7. A gas station repair shop claims that it can do a lubrication job and oil change
in 30 minutes. The consumer protection department wants to test this claim. A
sample of 6 cars were send to the station for oil change and lubrication. The job
took an average of 34 minutes with a standard deviation of 4 minutes. Test this
claim at 5% level of significance.
8. Two salespersons A and B are working for the same insurance company in a
certain district. From a sample survey conducted by the head office regarding
the sales of these salespersons in a given month, the following results were
obtained:
Salesperson A Salesperson B
Number of sales 10 18
Average sales(‘000 Rs.) 170 205
Standard deviation(‘000Rs.) 20 25

9. a certain stimulus administered to each of the 12 patients resulted in the


following changes in their systolic blood pressure?
5, 2, 8, -1, 3, 0, -2, 1, 5, 0, 4, 6
At 95% confidence level, can it be concluded that the stimulus, in general, will
result in an increase in systolic blood pressure count?
10. four salesmen were posted in different areas by a company. The number of units
of commodity X sold by them in four randomly selected weeks are as follows:

A 20 23 28 29
B 25 32 30 21
Salesmen C 23 28 35 18
D 15 21 19 25
Based on this information, can it be concluded that at 0.05 level of significance
that there is a significant difference in the performance of these four salesmen?
11. The following table gives the number of refrigerators sold by 4 salesmen in
three months May, June, and July:

Month Salesmen
A B C D
May 50 40 48 39
June 46 48 50 45
July 39 44 40 39

57
Is there a significant difference in the sales made by the four salesmen?
Is there a significant difference in the sales made during different months?

5B. CORRELATION AND REGRESSION

5B.1. Introduction

The statistical techniques, which we have discussed so far we concerned with univariate
data- the data on a single variable. It is possible that there may exist a relationship between
two more variables, which should be gainfully utilized in taking decisions. For example, it
is worthwhile for the management of a business concern to know the relationship between
expenses on advertisement of a product and its sales. In order to study about the joint
behavior of the variables we need to study their joint probability distribution.

Let X and Y are two random variables. The joint distribution of (X,Y) provide the
simultaneous occurrence of events defined by (X,Y). Since X and Y are random variables,
we can get individual distributions of X and Y. They are called marginal distributions. The
individual distribution of X is called marginal distribution of X and that of Y is called
marginal distribution of Y.

Let X be the height and Y be the weight of students in a class. The height of some students
may vary even though their weights are same. So, it makes sense to find the probability
distribution of Y when the weight is a particular value. The distribution of Y when X is
given is called conditional distribution of Y given X. Also the distribution of X when Y is
given is called conditional distribution of X given Y.

In this topic, we study the statistical relationship between two quantitative variables. We
examine the directional relationship between two variables. In many instances one variable
may have a direct effect on the other or may be used to predict the other.

58
There are many instances where managers take decisions based on future events. For this,
they rely on observations of two or more variables which appear to be related to one
another.

5B.2. Linear Regression

Regression analysis is a set of statistical techniques for analyzing the relationship between
two numerical variables. One variable is viewed as the dependent variable and the other as
the independent variable. The purpose of regression analysis is to understand the direction
and extent to which values of dependent variable can be predicted by the corresponding
values of the independent variable. The regression gives the nature of relationship between
the variables.

Often the relationship between two variable x and y is not an exact mathematical
relationship, but rather several y values corresponding to a given x value scatter about a
value that depends on the x value. For example, although not all persons of the same height
have exactly the same weight, their weights bear some relation to that height. On the
average, people who are 6 feet tall are heavier than those who are 5 feet tall; the mean
weight in the population of 6-footers exceeds the mean weight in the population of 5-
footers.

This relationship is modeled statistically as follows: For every value of x there is a


corresponding population of y values. The population mean of y for a particular value of x
is denoted by f(x). As a function of x it is called the regression function. If this regression
function is linear it may be written as f(x) = a + bx. The quantities a and b are parameters
that define the relationship between x and f(x)

In conducting a regression analysis, we use a sample of data to estimate the values of these
parameters. The population of y values at a particular x value also has a variance; the usual
assumption is that the variance is the same for all values of x.

Principle of Least Squares

Principle of least squares is used to estimate the parameters of a linear regression. The
principle states that the best estimates of the parameters are those values of the parameters,
which minimize the sum of squares of residual errors. The residual error is the difference
between the actual value of the dependent variable and the estimated value of the dependent
variable.

Fitting of Regression Line y = a + bx

By the principle of least squares, the best estimates of a and b are


S xy − −
b = 2 and a = y - b x
Sx
1 − −
Where Sxy is the covariance between x and y and is defined as S xy = xi yi - x y
n

59
1 −
And Sx2 is the variance of x, that is, Sx2 = xi2 – ( x )2
n

Example: Fit a straight line y = a + bx for the following data.

Y 3.5 4.3 5.2 5.8 6.4 7.3 7.2 7.5 7.8 8.3
X 6 8 9 12 10 15 17 20 18 24

Solution:
Y X XY X2
3.5 6 21 36
4.3 8 34.4 64
5.2 9 46.8 81
5.8 12 69.6 144
6.4 10 64 100
7.3 15 109.5 225
7.2 17 122.4 289
7.5 20 150 400
7.8 18 140.4 324
8.3 24 199.2 576
63.3 139 957.3 2239

− 139 − 63.3
x= =13.9 y= = 6.33
10 10
1 − − 957.3
Sxy = xi yi - x y = - 13.96.33 = 7.743
n 10
1 − 2239
Sx2 = xi2 – ( x )2 = - 13.92 = 30.69
n 10
S xy 7.743
So, b = 2 = = 0.252
Sx 30.69
− −
and a = y - b x = 6.33 – 0.25213.9 = 2.8272

Therefore, the straight line is y = 2.8272 + 0.252 x

Two Regression Lines

60
There are two regression lines; regression line of y on x and regression line of x on y. In
the regression line of y on x, y is the dependent variable and x is the independent variable
and it is used to predict the value of y for a given value of x. But in the regression line of x
on y, x is the dependent variable and y is the independent variable and it is used to predict
the value of x for a given value of y.
The regression line of y on x is given by
− S xy −
y- y = 2 (x- x)
Sx
and the regression line of x on y is given by
− S xy −
x- x = 2
(y - y )
Sy
Regression Coefficients
S xy
The quantity 2
is the regression coefficient of y ox and is denoted by byx, which gives
Sx
S xy
the slope of the line. That is, byx = 2 is the rate of change in y for the unit change in x.
Sx
S xy
The quantity 2
is the regression coefficient of x on y and is denoted by bxy, which gives
Sy
S xy
the slope of the line. That is, bxy = 2
is the rate of change in x for the unit change in y.
Sy

5B.3. Correlation

Correlation measures the degree of linear relation between the variables. The existence of
correlation between variables does not necessarily mean that one is the cause of the change
in the other. It should noted that the correlation analysis merely helps in determining the
degree of association between two variables, but it does not tell any thing about the cause
and effect relationship. While interpreting the correlation coefficient, it is necessary to see
whether there is any cause and effect relationship between variables under study. If there
is no such relationship, the observed is meaningless.

The Scatter Diagram

The first step in correlation and regression analysis is to visualize the relationship between
the variables. A scatter diagram is obtained by plotting the points (x 1, y1), (x2, y2), …,
(xn,yn) on a two-dimensional plane. If the points are scattered around a straight line , we
may infer that there exist a linear relationship between the variables. If the points are
clustered around a straight line with negative slope, then there exist negative correlation or
the variables are inversely related ( i.e, when x increases y decreases and vice versa. ). If
the points are clustered around a straight line with positive slope, then there exist positive
correlation or the variables are directly related ( i.e, when x increases y also increases and
vice versa. ).

61
Karl Pearson’s Correlation Coefficient

If (x1, y1), (x2, y2), …, (xn,yn) be n given observations, then the Karl Pearson’s correlation
S xy
coefficient is defined as, r = , where Sxy is the covariance and Sx, Sy are the standard
SxSy
deviations of X and Y respectively.
1 − −
xy − x y
That is, r = n
2 2
1 2 − 1 2 −
x − x y − y
n n
The value of r is in in between –1 and 1. That is, -1  r  1. When r = 1, there exist a perfect
positive linear relation between x and y. when r = -1, there exist perfect negative linear
relationship between x and y. when r = 0, there is no linear relationship between x and y.

Relation between Regression Coefficients and Correlation Coefficient

Correlation coefficient is the geometric mean of the regression coefficients.


S xy S xy
We know that byx = 2 and bxy = 2
Sx Sy
S xy S xy
The geometric mean of byx and bxy is bxy b yx = 2 2
Sy Sx
S xy
=
SxSy
= r, the correlation coefficient.
Also note that the sign of both the regression coefficients will be same, so the sign of
correlation coefficient is same as the sign of regression coefficient.

Coefficient of Determination

Coefficient of determination is the square of correlation coefficient and which gives the
proportion of variation in y explained by x. That is, coefficient of determination is the ratio
of explained variance to the total variance. For example, r2 = 0.879 means that 87.9% of
the total variances in y are explained by x. When r2 = 1, it means that all the points on the
scatter diagram fall on the regression line and the entire variations are explained by the
straight line. On the other hand, if r2 = 0 it means that none of the points on scatter diagram
falls on the regression line, meaning thereby that there is no linear relationship between the
variables.

Example: Consider the following data:


X: 15 16 17 18 19 20
Y: 80 75 60 40 30 20

62
1. Fit both regression lines
2. Find the correlation coefficient
3. Verify the correlation coefficient is the geometric mean of the regression
coefficients
4. Find the value of y when x = 17.5

Solution:

X Y XY X2 Y2
15 80 1200 225 6400
16 75 1200 256 5625
17 60 1020 289 3600
18 40 720 324 1600
19 30 570 361 900
20 20 400 400 400
105 305 5110 1855 18525

− x 105 − y 305
x = = = 17.5, y = = = 50.83
n 6 n 6
1 − − 5110
Sxy = xi yi - x y = - 17.550.83 = -37.86
n 6
1 − 1855
Sx2 = xi2 – ( x )2 = - 17.52 = 2.92
n 6

1 18525
Sy2 = yi2 – ( y )2 = -50.83 2 = 503.81
n 6
S xy − 37.86 S xy − 37.86
byx = 2 = = -12.96 and bxy = 2
= = -0.075
Sx 2.92 Sy 503.81
− S xy −
1. Regression line of y on x is y- y = 2
(x- x)
Sx
i.e., y – 50.83 = -12.96(x – 17.5)
y = -12.96 x + 277.63
− S xy −
Regression line of x on y is x- x = 2
(y - y )
Sy
i.e., x – 17.5 = -0.075(y – 50.83)
x = -0.075 y + 21.31

S xy
2. Correlation coefficient, r =
SxSy
− 37.86
= = 0.986
1.71  22.45
3. byx bxy = -12.96  -0.075 = 0.972

63
Then, 0.972 = 0.986
So, r = -0.986
4. To predict the value of y, use regression line of y on x.
When x= 17.5, y = -12.96 17.5 + 277.63 = 50.83

Short-Cut Method: The correlation coefficient is invariant under linear transformations.


x − 18 y − 40
Let us take the transformations, u = and v =
1 10

X Y u v uv u2 v2
15 80 -3 4 -12 9 16
16 75 -2 3.5 -7 4 12.25
17 60 -1 2 -2 1 4
18 40 0 0 0 0 0
19 30 1 -1 -1 1 1
20 20 2 -2 -4 4 4
85 305 -3 6.5 -26 19 37.25

− u −3 − v 6.5
u = = =-0.5, v = = = 1.083
n 6 n 6
1 − − − 26
Suv = ui vi - u v = - -0.51.083 = -3.79
n 6
1 − 19
Su2 = ui2 – ( u )2 = - (-0.5)2 = 2.92
n 6

1 37.25
Sv2 = vi2 – ( v )2 = -1.083 2 = 5.077
n 6
S uv − 3.79 S − 3.79
bvu = 2
= = -1.297 and buv = uv2 = = -0.75
Su 2.92 Sv 5.077
− −
1. Regression line of v on u is v - v = bvu(u- u )
i.e., v – 1.083 = -1.297(u – -0.5)
v = -1.297u + 0.4345
y − 40 x − 18
Therefore, the regression line of y on x is = -1.297 + 0.4345
10 1
i.e, y = -12.97 x + 277.8
− −
Regression line of u on v is u - u = buv (v - v )
i.e., u –-0.5= -0.75(y – 1.083)
u = -0.75 v + 0.31225
x − 18 y − 40
Therefore, the regression line of x on y is = -0.75 + 0.31225
1 10
i.e., x = -0.075 y + 21.31

64
S uv
2. Correlation coefficient, r =
Su Sv
− 3.79
= = -0.986
1.71  2.253

3. bvu buv = -1.297 -0.75 = 0.97275


Then, 0.972 = 0.986
So, r = -0.986

Spearman’s Rank Correlation Coefficient

Sometimes the characteristics whose possible correlation is being investigated, cannot be


measured but individuals can only be ranked on the basis of the characteristics to be
measured. We then have two sets of ranks available for working out the correlation
coefficient. Sometimes tha data on one variable may be in the form of ranks while the data
on the other variable are in the form of measurements which can be converted into ranks.
Thus, when both the variables are ordinal or when the data are available in the ordinal form
irrespective of the type variable, we use the rank correlation coefficient.

6d i
2

The Spearman’s rank correlation coefficient is defined as , r = 1 -


n(n 2 − 1)
Example: Ten competitors in a beauty contest were ranked by two judges in the following
orders:

First judge: 1 6 5 10 3 2 4 9 7 8
Second judge: 3 5 8 4 7 10 2 1 6 9
Find the correlation between the rankings.

Solution:

xi yi di = xi-yi di2
1 3 -2 4
6 5 1 1
5 8 -3 9
10 4 6 36
3 7 -4 16
2 10 -8 64
4 2 2 4
9 1 8 64
7 6 1 1
8 9 -1 1

65
6d i
2

The Spearman’s rank correlation coefficient is defined as , r = 1 -


n(n 2 − 1)
6  200
= 1-
10(10 2 − 1)
= -0.212
That is, their opinions regarding beauty test are apposite of each other.

Tied Ranks

Sometimes where there is more than one item with the same value a common rank is given
to such items. This rank is the average of the ranks which these items would have got had
they differed slightly from each other. When this is done, the coefficient of rank correlation
needs some correction, because the above formula is based on the supposition that the ranks
of various items are different.
If in a series, ‘mi’ be the frequency of ith tied ranks,
1
6[d i +  (m 3 − m)]
2

Then, r = 1 - 12
n(n 2 − 1)
Example: Calculate the rank correlation coefficient from the sales and expenses of 10 firms
are below:
Sales(X): 50 50 55 60 65 65 65 60 60 50
Expenses(Y): 11 13 14 16 16 15 15 14 13 13
Solution:

x R1 y R2 d= R1 – R2 d2
50 9 11 10 -1 1
50 9 13 8 1 1
55 7 14 5.5 1.5 2.25
60 5 16 1.5 3.5 12.25
65 2 16 1.5 0.5 0.25
65 2 16 3.5 -1.5 2.25
65 2 15 3.5 -1.5 2.25
60 5 14 5.5 -0.5 0.25
60 5 13 8 -3 9
50 9 13 8 1 1
31.5

Here there are 7 tied ranks, m1 = 3, m2 = 3, m3 = 3, m4 = 2, m5 = 2, m6 = 2, m7 = 3.


1
6[d i +  (m 3 − m)]
2

r=1- 12
n(n 2 − 1)
1
6[31.5 + [(33 − 3) + (33 − 3) + (33 − 3) + (2 3 − 2) + (2 3 − 2) + (2 3 − 2) + (33 − 3)]]
=1- 12
10(10 2 − 1)

66
= 0.75

5B.4. Exercises

1. A company selling household appliances wants to determine if there is any


relationship between advertising expenditures and sales. The following data
was compiled for 6 major sales regions. The expenditure is in thousands of
rupees and the sales are in millions of rupees.

Region : 1 2 3 4 5 6
Expenditure(X): 40 45 80 20 15 50
Sales (Y): 25 30 45 20 20 40

a) Compute the line of regression to predict sales


b) Compute the expected sales for a region where Rs.72000 is being spent on
advertising
2. The following data represents the scores in the final exam., of 10 students, in
the subjects of Economics and Finance.

Economics: 61 78 77 97 65 95 30 74 55
Finance: 84 70 93 93 77 99 43 80 67
a) Compute the correlation coefficient?
3. Calculate the rank correlation coefficient from the sales and expenses of 9
firms are below:
Sales(X): 42 40 54 62 55 65 65 66 62
Expenses(Y): 10 18 18 17 17 14 13 10 13

5C. MULTIVARIATE TECHNIQUES

5C.1. Multiple Linear Regression

In multiple regression, we form a linear composite of explanatory variables (independent)


in such a way that it has maximum correlation with a criterion (dependent) variable. This
techinique is appropriate when rhe researcher has a single metric dependent variable which
is supposed to be a function of other explanatory variables. The main objective in using
this technique is to predict the variability of the dependent variable based on its covariance
with all the independent variables. One can predict the level of the dependent phenomenon
through multiple regression analysis model, given the levels of independent variables.

Given a dependent variable, the linear multiple regression problem is to estimate


parameters B0, B1, B2, …, Bk such that the expression,

Y = B0 + B1 X1 + B2 X2 + … + Bk Xk + e,

67
provides a good estimate of an individual’s Y score based on his X scores. The least
squares-method is used to estimate the parameters in such a way that the sum of the squared
deviations of the actual values and the predicted values is kept as small as possible.

The maximum correlation between the dependent variable and the linear combination of
independent variables is called the multiple correlation and is usually denoted by R. The
value of R will be from 0 to 1. If R =0, there exists no linear relation between the dependent
variable and the independent variables taken. If R =1, there exists perfect linear relation
between the dependent variable and the independent variables taken. Then R 2 is the
coefficient of determination, which gives the percentage of variation of the dependent
variable explained by the independent variables.

Where Y is the dependent variable; X1, X2, …, Xk are independent variables; B0,
B1, B2, …, Bk are regression coefficients.

5C.2. Cluster Analysis:

Cluster analysis is a multivariate procedure for detecting groupings in the data. The objects
in these groups may be cases or variables. A cluster analysis of cases resembles
discriminant analysis in one respect – the researcher seeks to classify a set of objects into
groups or categories, but, in cluster analysis, neither the number nor the members of the
groups are known.
Two method of clustering of objects into categories are a) Hierarchical cluster analysis and
b) K-means cluster analysis.

5C.3. Factor Analysis:

Factor analysis is used in exploratory data analysis to a) study the correlations among a
large number of interrelated quantitative variables by grouping the variables into few
factors; after grouping, the variables within each factor are more highly correlated with
variables in that factor than with variables in other factors, b) interpret each factor
according to the meaning of the variables, c) summarize many variables by a few factors.

68

You might also like