0% found this document useful (0 votes)
20 views

Statistics - Chapter Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Statistics - Chapter Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

STATISTICS

1. DEFINITION :

STATISTICS : A set of concepts, rules and procedures that help us to :


Organize numerical information in the form of tables, graphs and charts;
 Understand statistical techniques underlying decisions that affect our lives and well-being; and
 Make informed decisions.

2. DATA :

Facts, observations and information that come from investigations.


Generally three types of data are used

(i) Ungrouped data, Raw data or individual series :

(ii) Discrete frequency or ungrouped data :


Definition :
Data consist of n distinct values x1, x2 ……, xn occuring with frequency f1, f2 , ……, fn respectively.
This data in tabular form is called discrete freqency distribution.

(iii) Continuous frequency or grouped data :


Definition :
A continuous frequencyDistribution is a series in which the data are classified into different class intervals
without gaps along with their respective frequencies.

3. MEASURES OF CENTRAL VALUE :

Measure of central value gives rough idea about where data points are centred. Mean, mode, median
are three measure of central tendency.

(A) MEAN :
The mean is the most common measure of central tendency and the one that can be mathematically
manipulated. It is defined as the average of a distribution is equal to the X / N. Simply, the mean is
computed bysumming all the scores in the distribution (X) and dividing that sum bythe total number of
scores (N).

Page # 14

www.rancho.in
(I) Arithmetic mean of individual series (Ungrouped data) :
If the series in this case be x1, x2, x3, ......., xn ; then the arithmetic mean x is given by

n
Sum of the series x1  x 2  x 3  ......  x n 1
i.e., x = =
N
= N  xi .
Number of terms i 1

(II) Arithmetic mean for discrete frequency distribution :


If the terms of the given series be x1, x2,........, xn and the corresponding frequencies be f1, f2, ......., fn,
then the arithmetic mean x is given by,,

f1x1  f 2 x 2  ........  f n x n 1 n  n 
x = =  fi x i .   fi  N 
 
N N i 1  i 1 

(III) Arithmetic mean for grouped or continuous frequency distribution :

1 n
Arithmetic mean ( x ) =A +  fi ( x i  A) ,
N i 1

where A= assumed mean, f = frequency and x – A= deviation of each item from the assumed mean.

(IV) Combined Arithmetic mean :


If x i (i = 1, 2, ......, k) are the means of k-component series of sizes ni, (i = 1, 2, ..., k) respectively, then
the mean x of the composite series obtained on combining the component series is given bythe formula

n
 nixi
n1x1  n 2 x 2 .........  n k x k i 1
x = = n .
n1  n 2  .....  n k
 ni
i 1

(V) Weighted Arithmetic Mean :


Weighted arithmetic mean refers to the arithmetic mean calculated after assigning weights to different
values of variable. It is suitable where the relative importance of difference items of variable is not same.
WeightedArithmetic Mean is give by
n
 WiXi
XW  i 1
n
 Wi
i 1

Page # 15

www.rancho.in
Properties of arithmetic mean :
If each of the values of a variable 'X' is increased of decreased by some constant k, then arithmetic mean
also increased of decreased by k.
Similarlywhen the value of the variable 'X' are multiplied/divided byconstant sayk, arithmetic mean also
multiplied /divided bythe same quantity k.

Illustration :
The mean weight of 150 persons in a group is 60 kg. The mean weight of men in the group is 70 kg and
that of the women is 55 kg. Find the number of men and women.
Sol. Number of person = 150; their mean weight = 60 kg;
mean weight of men ( x1 ) = 70 kg and

mean weight of women ( x 2 ) = 55 kg


Let n1 and n2 be the number of men and number of women respectively.
We know that the total number of persons (n1 + n2) = 150 or n2 = 150 – n1.
We also know that the mean weight of all persons
(n1x1  n 2 x 2 )
(x ) 
n1  n 2

70n1  55n 2
or 60 =
150
or 3n1 = (1800 – 1650) = 150
or n1 = 50 and n2 = 100

Illustration :
Find the mean of the following data :

Marks obtained 10  20 20  30 30  40 40  50 50  60 60  70 70  80
Number of students 2 3 8 14 8 3 2

Sol. Method-1 :
Marks Number of
Mid - points fi xi
obtained students
10  20 2 15 30
20  30 3 25 75
30  40 8 35 280
40  50 14 45 630
50  60 8 55 440
60  70 3 65 195
70  80 2 75 150
40 1800

Page # 16

www.rancho.in
7 7
N =  fi = 40,  fi x i = 1800
i 1 i 1

1 7 1800
x 
N i1
fi x i =
40
= 45

Method-2 :
10  80
Asumed mean a = = 45, h = 10
2
Marks Number of x i  45
Mid - points d i  fidi
obtained students 10
10  20 2 15 3 6
20  30 3 25 2 6
30  40 8 35 1 8
40  50 14 45 0 0
50  60 8 55 1 8
60  70 3 65 2 6
70  80 2 75 3 6
40 0

7
 fi di 0
x =a+ i 1
= 45 + × 10 = 45
N 40

(B) MEDIAN :

(a) Definition : The median is the score that divides the distribution into halves; half of the scores are above
the median and half are below it when the data are arranged in numerical order. The median is also
referred to as the score at the 50th percentile in the distribution.

Calculation of median :

(i) Individual series : If the data is raw, arrange in ascending or descending order. Let n be the number of
observations.
th
 n 1
If n is odd, Median = value of   item.
 2 

1 
th th
n n 
If n is even, Median =  value of   item  value of   1  item  .
2  2 2  

Page # 17

www.rancho.in
(ii) Discrete series :In this case, we first find the cumulative frequencies of the variable arranged in ascending
th
n 
or descendingorder and the median is given byMedian =   1 observation, where n is the cumulative
2 

frequency.

(iii) For grouped or continuous distributions : In this case, following formula can be used.

N 
  C
2  i
Median = l +
f

where l = Lower limit of the median class


f = Frequency of the median class
N = The sum of all frequencies
i = The width of the median class
C = The cumulative frequency of the class preceding to median class.

(b) Quartile :As median, divides a distribution into two equal parts, similarlythe quartiles, quantiles, deciles
and percentiles divide the distribution respectively into 4, 5, 10 and 100 equal parts. The jth quartile is

 N 
 j C
given by Qj = l +  10  i.
 f 
 

Illustration :
The marks obtained by 10 students in an examination are 22, 26, 14, 30, 18, 11, 35, 41, 12, 32. What
is the median mark?
Sol. Number of students (n) = 10 and marks obtained by them = 22, 26, 14, 30, 18, 11, 35, 41, 12, 32
Arranging the given marks in the ascending order, we get 11, 12, 14, 18, 22, 26, 30, 32, 35, 41.
Since the number of students is even, therefore median of their marks

 10   10  2 
= Arithmetic mean of   and   marks
2  2 
= Arithmetic mean of 5th and 6th marks

22  26
= = 24 Ans.
2

Page # 18

www.rancho.in
Illustration :
Calculate the median of the following data:

Wages per week (in Rs) 10  20 20  30 30  40 40  50 50  60 60  70 70  80


Number of worker 4 6 10 20 10 6 4

Sol. Calculation of Mean Deviation from Median

Wages per Mid-Values Frequency Cumulative


week (in Rs.) (xi) (fi) Frequency
10-20 15 4 4
20-30 25 6 10
30-40 35 10 20
40-50 45 20 40
50-60 55 10 50
60-70 65 6 56
70-80 75 4 60
N = fi
= 60

N
Here, N = 60. So = 30.
2

N
The cumulative frequency just greater than = 30 is 40 and the corresponding class is 40-50.
2
So, 40-50 is the median class.
 l = 40, f = 20, h = 10, F = 20

N
F
30  20
Now, Median = l  2  h = 40   10 = 55 Ans.
f 20

(C) MODE :
Mode is the most frequent score in the distribution.Adistribution where a single score is most frequent
has one mode and is called unimodal. When there are ties for the most frequent score, the distribution is
bimodal if two scores tie or multimodal if more than two scores tie.
Mode for continuous series

 f f 
Mode = l1 +  1 0   i
 2f1  f 0  f 2 

Page # 19

www.rancho.in
Where, l1 = The lower limit of the model class
f1 = The frequency of the model class
f0 = The frequency of the class preceding the model class
f2 = The frequency of the class succeeding the model class
i = The size of the model class.

Symmetric distribution :
A distribution is a symmetric distribution if the values of mean, mode and median coincide. In a symmetric
distribution frequencies are symmetrically distributed on both sides of the centre point of the frequency
curve.

mean = median = mode

A distribution which is not symmetric is called a skewed distribution. In a moderately asymmetric


distribution, the interval between the mean and the median is approximately one-third of the interval
between the mean and the mode i.e., when have the following empirical relation between them,
Mean – Mode = 3 (Mean – Median)  Mode = 3 Median – 2 Mean. it is known as Empirical relation.

Positively skewed :
A distribution is positively skewed when is has a tail extending out to the right (larger numbers) When a
distribution is positivelyskewed, the mean is greater than the median reflecting the fact that the mean is
sensitive to each score in the distribution and is subject to large shifts when the sample is small and
contains extreme scores.

Mean > Median > Mode

Mode Median Mean

Negatively skewed :
A negativelyskewed distribution has an extended tail pointing to the left (smaller numbers) and reflects
bunching of numbers in the upper part of the distribution with fewer scores at the lower end of the
measurement scale.

Mean < Median < Mode.

Mean Median Mode

Page # 20

www.rancho.in
In a moderatelyasymmetric distribution, the interval between the mean and the median is approximately
one-third of the interval between the mean and the mode i.e., when have the following empirical relation
between them,
Empirical formula : mode = 3 median – 2 mean
Mean  Mode
Coefficient of skewness =

Limitations of central values :


An average, such as the mean or the median only locates the centre of the data and does not tell us
anything about the spread of the data.

4. MEASURES OF SPREAD OR DISPERSION :

Measures of variability provide information about the degree to which individual scores are clustered
about or deviate from the average value in a distribution i.e.,
The degree to which numerical data tend to spread about an average value is called the dispersion of the
data. The four measure of dispersion are
(i) Range (ii) Mean deviation
(iii) Variance (iv) Standard deviation

Important Note :
(a) A small value for a measure of dispersion indicate that the data are clustered closely(the mean is therefore
representative of the data).
(b) A large value of dispersion indicates that the mean is not reliable (it is not representative of the data).

(i) Range :
The simplest measure of variability to compute and understand is the range. The range is the difference
between the highest and lowest score in a distribution. Because it is based solely on the most extreme
scores in the distribution and does not fully reflect the pattern of variation within a distribution, the range
is a very limited measure of variability.
LS
Coefficient of range :
L S
L = Largest value
S = Smallest value

(ii) Mean deviation :


The arithmetic average of the deviations (all taking positive) from the mean, median or mode is known as
mean deviation.

Page # 21

www.rancho.in
(a) Mean deviation from ungrouped data (or individual series)

1 n
Mean deviation =  xi  M .
N i 1

n
Where  x i  M is the sum of modulus of the deviation of the variate from the mean (mean, median
i 1

or mode) and N is the number of terms.

(b) Mean deviation from continuous series :


Here first of all we find the mean from which deviation is to be taken. Then we find the deviation
x i  M of each variate from the mean M and multiplythese deviations by the corresponding frequency

1 n n
So, Mean deviation =  fi x i  M , where N =
N i 1
 fi .
i 1

Illustration :
The scores of a batsman in ten innings are : 38, 70, 48, 34, 42, 55, 63, 46, 54, 44. Find the mean
deviation about the median.
Sol. Arranging the data in ascending order, we have
34, 38, 42, 44, 46, 48, 54, 55, 63, 70
Here n = 10. So, median is the A.M. of 5th and 6th observations.
 46  48 
 Median, M =   = 47
 2 
Calculation of Mean Deviation
xi |di| = |xi – 47|
38 9
70 23
48 1
34 13
42 5
55 8
63 16
46 1
54 7
44 3
Total  |di| = 86

1 86
 M.D. =  di = = 8.6 Ans.
n 10

Page # 22

www.rancho.in
Illustration :
Calculate the mean deviation from the median of the following data:

Age 16  20 21  25 26  30 31  35 36  40 41  45 46  50 51  55
Number 5 6 12 14 26 12 16 9

Since given data is not continuous frequency distribution but we can make it continuous frequency
distribution by subtracting lower limit by 0.5 and adding 0.5 to upper limit of every group.
Sol. Calculation of Mean Deviation from Median

Age Mid-Values Frequency Cumulative |di|


(xi) (fi) Frequency = |xi – 38| f|di|
15.5-20.5 18 5 5 20 100
20.5-25.5 23 6 11 15 90
25.5-30.5 28 12 23 10 120
30.5-35.5 33 14 37 5 70
35.5-40.5 38 26 63 0 0
40.5-45.5 43 12 75 5 60
45.5-50.5 48 16 91 10 160
50.5-55.5 53 9 100 15 135
N =  fi fi |di|
= 100 = 735

N
Here, N = 100. So = 50.
2

N
The cumulative frequency just greater than = 50 is 63 and the corresponding class is 35.5-40.5.
2
So, 35.5-40.5 is the median class.
 l =35.5, f = 26, h = 5, C = 37

N
C
 50  37 
Now, Median = l  2  h = 35.5     5 = 38 Ans.
f  26 

f i | d i | 735
Mean Deviation from median = = = 7.35 Ans.
N 100

Page # 23

www.rancho.in
(iii) Variance or Var(X) or 2 :
The variance is a measure based on the deviations of individual scores from the mean. As noted in the
definition of the mean, however, simplysumming the deviations will result in a value of 0.The get around
this problem the variance is based on squared deviations of scores about the mean. When the deviations
are squared, the rank order and relative distance of scores in the distribution is preserved while negative
values are eliminated. Then to control for the number of subjects in the distribution, the sum of the

squared deviations,  (X  X ) 2 , is divided by N(population). The average of the sum of the squared
deviations is called the variance.

(a) Variance of individual observations :


If x1, x2, ……, xn are n values of a variable X, then

2
1 n 1 n 2 1 n 
Var(X) =
n
 (xi  X) = n 
2
i 1
x i    x i 
 n i 1 
i 1

= Mean of squares – Squares of Mean

(b) Variance of discrete frequency distribution :


If x1, x2, ……, xn are n values of a variable X and corresponding frequencies of them are f1, f2, ……fn

2
 n 
1 n 2  1 n 
n
1
=  i i   fi x i    fi  N 
Var (X) =  fi ( x i  X ) 2
N i 1
f x    
N i 1  N i 1   i 1 

(c) Variance of a grouped or continuous frequency distribution :

1 2  1  
2
xi  X
Var (X) = h2   fi u i    fi u i   ui =
 N N   h

where h = Class width


Properties :
(1) If x1, x2, x3....., xn be n values of a variable X. If these values are changed to x1 + a, x2 + a, ....xn + a,
where a  R, then the variance remains unchanged.
(2) If x1, x2, ......., xn values of a variable X and let 'a' be a non-zero real number. Then, the variance of the
observation ax1, ax2,......,axn is a2 Var(X).

Page # 24

www.rancho.in
(iv) Standard Deviation :
The standard deviation (s or ) is defined as the positive square root of the variance. The variance is a
measure in squared units and has little meaning with respect to the data. Thus, the standard deviation is
a measure of variability expressed in the same units as the data. The standard deviation is verymuch like
a mean or an "average" of these deviations.

Combined Standard Deviation :


If there are two sets of observations containing n1 & n2 items with respective mean x1 & x 2 and

standards deviations 1 & 2, then the mean x and the standard deviations of n1 + n2 observations,
taken together, are

n1x1  n 2 x 2
x = n1  n 2

2 =
1
   
n 2  d12  n 2 22  d 22
n1  n 2 1 1

where d1 = x – x1 , d2 = x – x 2

Illustration :
Calculate the mean and standard deviation of first n natural numbers.
Sol. Here xi = i = i = 1, 2,........, n. Let X be the mean and be the S.D. Then,

1 n 1 n 1
X = 
n i 1
x i = 
n i1
i = (1 + 2 + 3 + ...... + n)
n

n (n  1) n 1
 X = =
2n 2

2
1 n 2 1 n  1 2  n 1
2
 =   x i     x i 
2 2
and 2  2 = (1  2  ......  n )   
n  i 1   n i 1  n  2 

2
n (n  1)(2n  1)  n  1  (n  1)(2n  1) (n  1) 2 n 2 1
 2 =    2 =  = Ans.
6n  2  6 4 12

Page # 25

www.rancho.in
Illustration :
The mean and variance of 7 observations are 8 and 16 respectively. If 5 of the observations are
2, 4, 10, 12, 14, find the remaining two observations.
Sol. Let x and y be the remaining two observation. Then,
Mean = 8
2  4  10  12  14  x  y
 =8  42 + x + y = 56
7
 x + y = 14 .....(i)
Variance = 16
1 2
 (2 + 42 +102 + 122 + 142 + x2 + y2) – (Mean)2 = 16
7
1
 (4 + 16 +100 + 144 + 196 + x2 + y2) – 64 = 16  460 + x2 + y2 = 7 × 80
7
 x2 + y2 = 100 .....(ii)
2 2 2
Now, (x + y) + (x – y) = 2(x + y ) 2

 196 + (x – y)2 = 2 × 100  (x – y)2 = 4  x – y = ±2


If x – y = 2, then x + y = 14 and x – y = 2  x = 8, y = 6
If x – y = –2, then x + y = 14 and x – y = –2  x = 6, y = 8
Hence, the remaining two observations are 6 and 8.

Illustration :
Find the variance and standard deviation for the following distribution:

Classes 30  40 40  50 50  60 60  70 70  80 80  90 90  100
Frequency 3 7 12 15 8 3 2

Sol. Calculation of Variance and Standard Deviation


Frequency Mid-point x – 65
Class
(fi ) yi = i yi2 fi yi fi yi2
(xi) 10
30-40 3 35 –3 9 –9 27
40-50 7 45 –2 4 –14 28
50-60 12 55 –1 1 –12 12
60-70 15 65 0 0 0 0
70-80 8 75 1 1 8 8
80-90 3 85 2 4 6 12
90-100 2 95 3 9 6 18
N = 50 –15 105

Therefore x = A +
 fi yi × h = 65 – 15 × 10 = 62
50 50

variance 2
h2
N

= 2 N  f i y i    f i yi
2

2
=
(10)2
(50) 2

50 105  (15)2 =
1
25
[5250 – 225] = 201

and standard deviation () = 201 = 14.18 Ans.

Page # 26

www.rancho.in
Illustration :
The mean and standard deviation of 20 observations are found to be 10 and 2 respectively. On rechecking,
it was found that an observation 8 was incorrect. Calculate the correct mean and standard deviation in
each of the following cases:
(i) If the wrong item is omitted.
Sol. We have , n = 20, X = 10 and  = 2
1
 X = x i  xi = n X = 20 × 10 = 200  Incorrect xi = 200
n
1
and,  = 2  2 = 4  x i 2 – (Mean)2 = 4
n
1
 x i 2 – 100 = 4  xi2 = 104 × 20  Incorrect xi2 = 2080
20
(i) When 8 is omitted from the data.
If 8 is omitted from the data, then 19 observations are left.
Now Incorrect xi = 200  Correct xi + 8 = 200  Correct xi = 192
and Incorrect xi = 2080  Correct xi2 + 82 = 2080  Correct xi2 = 2016
2

192
 Correct mean   10.10
19
1
 Correct variance = (Correct xi2) – (Correct mean)2
19
2
2016  192 
 Correct variance =  
19  19 
38304  36864 1440
Correct variance = =
361 361

1440 12 10
 Correct standard deviation = = = 1.997
361 19

Analysis of Frequency Distributions :


Measures of dispersion are unable to compare two or more series which are measured in different units
even if they have the same mean. Thus, we require those measures which are independent of the units.
The measure of variability which is independent of units is called coefficient of variation (C.V.). The
coefficient of variation is defined as

C.V. = × 100
X

where  and X are the standard deviation and mean of the data.
For comparing the variability of two series, we calculate the coefficient of variation for each series. The
series having greater C.V. is said to be more variable or conversely less consistent, less uniform less
stable or less homogeneous than the other and the series having lesser C.V. is said to be more consistent
(or homogeneous) than the other.

Page # 27

www.rancho.in
Illustration :
The following values are calculated in respect of heights and weights of the students of a section of
Class XI :
Height Weight
Mean 162.6 cm 52.36
Variance 127.69 cm 2 23.1361 kg2
Can we say that the weights show greater variation than the heights ?
Sol. To compare the variability, we have to calculate their coefficients of variation
Given Variance of height = 127.69 cm2
Therefore Standard deviation of height 127.69 cm = 11.3 cm
Also Variance of weight = 23.1361 kg2
Therefore Standard deviation of weight = 23.1361 kg = 4.81 kg
Now, the coefficient of variations (C.V.) are given by
Standard Deviation
(C.V.) in heights = × 100
Mean

11.3
= × 100 = 6.95
162.6

4.81
and (C.V.) in weight = × 1000 = 9.18
52.36
Clearly C.V. in weights is greater than the C.V. in heights
Therefore, we can say that weights show more variability than heights.

IMPORTANT DEFINITIONS :
1. Raw Data :
Data collected in original form.

2. Frequency :
The number of times a certain value or class of values occurs.

3. Frequency Distribution :
The organization of raw data in table form with classes and frequencies.

4. Categorical Frequency Distribution :


A frequencydistribution in which the data is only nominal or ordinal.

5. Ungrouped Frequency Distribution :


A frequency distribution of numerical data. The raw data is not grouped.

6. Grouped Frequency Distribution :


A frequency distribution where several numbers are grouped into one class.

Page # 28

www.rancho.in
7. Class Limits :
Separate one class in a grouped frequency distribution from another. The limits could actually appear in
the data and have gaps between the upper limit of one class and the lower limit of the next.

8. Class Boundaries :
Separate one class in a grouped frequency distribution from another. The boundaries have one more
decimal place than the raw data and therefore do not appear in the data. There is no gap between the
upper boundary of one class and the lower boundary of the next class. The lower class boundary is
found by subtracting 0.5 units from the lower class limit and the upper class boundaryis found by adding
0.5 units to the upper class limit.

9. Class Width :
The difference between the upper and lower boundaries of anyclass. The class width is also the difference
between the lower limits of two consecutive classes or the upper limits of two consecutive classes. It is
not the difference between the upper and lower limits of the same class.

10. Class Mark (Midpoint) :


The number in the middle of the class. It is found by adding the upper and lower limits and dividing by
two. It can also be found by adding the upper and lower boundaries and dividing by two.

11. Cumulative Frequency :


The number of values less than the upper class boundary for the current class. This is a running total of
the frequencies.

12. Relative Frequency :


The frequency divided by the total frequency. This gives the percent of values falling in that class.

13. Cumulative Relative Frequency (Relative Cumulative Frequency) :


The running total of the relative frequencies or the cumulative frequency divided by the total frequency.
Gives the percent of the values which are less than the upper class boundary.

14. Histogram :
A graph which displays the data by using vertical bars of various heights to represent frequencies.
The horizontal axis can be either the class boundaries, the class marks, or the class limits.

15. Frequency Polygon :


A line graph. The frequency is placed along the vertical axis and the class midpoints are placed along the
horizontal axis. These points are connected with lines.

16. Ogive :
Afrequency polygon of the cumulative frequencyor the relative cumulative frequency. The vertical axis
the cumulative frequency or relative cumulative frequency. The horizontal axis is the class boundaries.
The graph always starts at zero at the lowest class boundary and will end up at the total frequency
(for a cumulative frequency) or 1.00 (for a relative cumulative frequency).

Page # 29

www.rancho.in

You might also like