Quant For Student
Quant For Student
1
organization of data:
It is the summarization of data in some meaningful way, e.g table form.
Presentation of the data:
It is the process of re-organization, classification, compilation, and summarization of data to
present it in a meaningful form.
Analysis of data: The process of extracting relevant information from the summarized
data, mainly through the use of elementary mathematical operation.
Interpreting of data:
The interpretation and further observation of the various statistical measures through the
analysis of the data by implementing those methods by which conclusions are formed and
inferences made.
1.2. Some Basic Concepts in statistics
1) A (statistical) population: is the complete set of possible measurements for which
inferences are to be made. The population represents the target of an investigation, and
the objective of the investigation is to draw conclusions about the population hence we
sometimes call it target population.
Examples:-
Population of trees under specified climatic conditions
Population of animals fed a certain type of diet
Population of farms having a certain type of natural fertility
Population of households, etc
The population could be finite or infinite (an imaginary collection of units)
There are two ways of investigation: Census and sample survey.
2) Census: a complete enumeration of the population. But in most real problems it cannot
be realized, hence we take sample.
3) Sample: A sample from a population is the set of measurements that are actually
collected in the course of an investigation. It should be selected using some pre-defined
sampling technique in such a way that they represent the population very well.
Examples:-
Monthly production data of a certain factory in the past 10 years.
Small portion of a finite population.
2
In practice, we don’t conduct census, instead we conduct sample survey
4) Parameter: Any Characteristic (value) describing a characteristic of a population
5) Statistic: Characteristic or measure obtained from a sample.
6) Sampling: The process or method of sample selection from the population.
7) Sample size: The number of elements or observation to be included in the sample.
8) Data: This could be defined as pieces of information that represent the qualitative or
quantitative attributes of a variable or set of variables. Data are typically the results of
measurements and can be the basis of graphs, images or observations of a set of
variables. Data are often viewed as the lowest level of abstraction (ideas) from which
information and knowledge is derived for statistical analysis.
9) Variable This is any quality that can have a number of values, which may be either
discrete or continuous. A variable is a property that can take on different values.
Individual in a class may differ in sex, age, intelligence, height etc. These properties are
variables. Variables could vary in quality or in quantity. Constants unlike variables do
not assume different values.
Quantitative Variables are numerical variables and can be measured. Examples
include balance in checking account, number of children in family. This type of
variables assumes values that vary in terms of magnitude. Very easy to measure and
compare with others e.g. weight, height, age, distance, marks obtained in a test etc. Note
that: quantitative variables are either discrete (which can assume only certain values,
and there are usually "gaps" between the values, such as the number of bedrooms in
your house) or continuous (which can assume any value within a specific range, such as
the air pressure in a tire.)
Qualitative Variables are nonnumeric variables and can't be measured.
This type of variable differs in kind. They are only categorized, e.g. gender, nationality,
social economic status, academic qualifications, marital status. Complete
Independent Variable These variables can be manipulated or treated. The effect is
reflected on the dependent variable. The value of the dependent variable thus depends
on that of the independent variable. Note that in graphing, the dependent variable is
placed on the vertical (y-axis) while the independent variable is placed on the
horizontal (x-axis).
3
Discrete Variable: This is a variable that can be counted, or for which there is a fixed
set of values. For example, the number of votes in an election is a discrete.
Continuous Variable: This concept is characterized by being related to some
numerical scale of measurement, any interval of which may, if desired, be subdivided
into an infinite number of values, e.g. length, height, weight, temperature, volume and
time.
10) Distribution: This is the arrangement of a set of numbers classified according to some
properties or attributes such as age, height, weight, etc.
Applications, Uses and Limitations of statistics.
Applications of statistics:
• Statistics is applied in almost all fields of human endeavor.
• Almost all human beings in their daily life are subjected to obtaining numerical facts
• Applicable in some process e.g. invention of certain drugs, extent of environmental
pollution.
• In industries especially in quality control area.
Uses of statistics:
The main function of statistics is to enlarge our knowledge of complex phenomena. The
following are some uses of statistics:
1. It presents facts in a definite and precise form.
2. Data reduction.
3. Measuring the magnitude of variations in data.
4. Furnishes a technique of comparison
5. Estimating unknown population characteristics.
6. Testing and formulating of hypothesis.
7. Studying the relationship between two or more variable.
8. Forecasting future events.
Limitations of statistics
Statistics deals mainly with those subjects of investigation which are capable of being
quantitatively measured and numerically expressed. But, nowadays we can apply
statistics to study both quantitative and qualitative aspects.
Statistics deals with only aggregate of facts and no importance are attached to individual
items in an aggregate. It is, therefore, limited only to those problems where group
4
characteristics are desired to be studied. For example, if the mean age of a class of
students is 22 years, it does not mean that each and every student is 22 years old. It is
simply the average of the class.
Lack of Exactitude -Statistical data are only approximate and not mathematically
correct. This means that by observing only a limited number of item we make an
estimate of the characteristic of the entire population. It is well known that mathematical
and physical sciences are exact but statistical laws are not as exact but only
approximations. Statistical conclusions may not have universal validity.
Misuse of Records- Statistics must be used only by experts otherwise statistical
methods are the most dangerous tools in the hands of the inexperienced people. The use
of statistical tools by the untrained persons might lead to wrong conclusions. It may be
easily misused by quoting wrong figures of data to achieve a selfish interest. Therefore
should be used be experts.
1.3. Measurement level
Proper knowledge about the nature and type of data to be dealt with is essential in order to
specify and apply the proper statistical method for their analysis and inferences. Measurement is
the assignment of numbers or values to objects or events in a systematic fashion.
Measurement scale refers to the property of value assigned to the data. Four levels of
measurement scales are commonly distinguished: nominal, ordinal, interval, and ratio scale
and each possessed different properties of measurement systems
A. The Nominal or Classificatory Scale
A nominal scale enables the classification of individuals, objects or responses based on a
common/shared property or characteristics. These peoples, objects or responses are divided into
a number of sub-groups in such a way that each member of the sub-group has a common
characteristic. A variable measured on a nominal scale may have one, two or more sub-
categories depending upon the extent of variation. Nominal scale variables are those qualitative
variables, which show category of individuals. Numbers are assigned to the variables simply for
coding purposes. It is not possible to compare individuals basing on the numbers assigned to
them. No arithmetic and relational operation can be applied.
For example, if we say Abass wears 6 when playing football for Awassa Kenema team; and he
scored 10 marks in chemistry test. Further let us say Gemeda wears number 9 when he plays
football for Awassa Kenema team; and Gemeda scored 7 marks in chemistry test. Based on
5
numbers on the shirts, it is not possible to judge whether Gemeda plays better than Abass. Using
test scores, however, it is possible to judge that Abass performed better than Gemeda in
chemistry test. The mean marks of the two students can also be obtained, but it is not possible to
find the mean of the shirt numbers because they are simply codes.
Another example: when we collect data we may code for sex: 1=male and 2=female. This does
not show that male is greater than female. It does not show also that female is twice as male.
The numbers are simply representations. The sequence in which sub-groups are listed makes no
difference as there is no relationship among sub-groups. What we can do is counting the number
of males and the number of females.
Other examples of variables which are measured by nominal scale include religion, ethnicity,
color, Marital status (married, single, widow, divorce), Country code & Regional differentiation
of Ethiopia.
B. Ordinal or Ranking Scale
An ordinal scale has all the properties of a nominal scale plus one of its own. Besides
categorizing individuals, objects, responses or a property into sub-groups on the basis of
common characteristics, it ranks the sub-groups in a certain order. It refers to the variables
whose values can be ordered or ranked. The arrangement can be either ascending or descending
order. For example: developed, less developed, least developed; less degraded, highly degraded,
etc. The use of an ordinal scale implies a statement of greater than or less than or equal to
without being able to state how much greater or less.
Generally,
•It is the Level of measurement which classifies data into categories that can be ranked. But,
differences between the ranks do not exist.
•Arithmetic operations are not applicable but relational operations are applicable.
• Ordering is the sole property of ordinal scale.
Examples: Letter grades (A, B, C, D, F), Rating scales (Excellent, Very good, Good, Fair, poor)
and the like.
C. The Interval Scale
Interval scale has all the characteristics of an ordinal scale plus it uses a unit of measurement
that enables values of variables to be placed at equally spaced intervals in relation to the spread
of the variable. This scale has a starting and a terminating point that is divided into equally
spaced units/intervals. The starting and terminating points are the number of units/intervals
6
which vary from scale to scale. Centigrade and Fahrenheit scales are examples of the interval
scale. In the Centigrade system the starting point (considered as freezing point) is 0 oC and the
terminating point (considered boiling point) is 100 oC. The gap between freezing and boiling
points is divided into 100 equally spaced intervals, known as degrees. Each degree or interval is
measurement of temperature. The higher the degree, the higher the temperature. As the starting
and terminating points are arbitrary, they are not absolute. For example, zero does not show
absence. Therefore, the interval scale is relative in nature. So, relational operations are possible.
As it is a relative scale, no mathematical operations can be performed on its readings.
D. Ratio Scale
A ratio scale has all the properties of nominal, ordinal and interval scales plus its own property:
the zero point of a ratio scale is fixed, which means it has a fixed starting point. Therefore, it is
an absolute scale—the difference between the intervals is always measured from zero point.
Zero shows absence of something in this case. For example, if the yield is zero it shows no
yield. It also allows ratios of numbers to be meaningfully interpreted; Eg. the ratio of Bekele's
height to Martha's height is 1.32, whereas this is not possible with interval scales.
In the ratio scale mathematical operations can be used. The measurement of income, age, height
and weight are examples of this scale. A person who is 40 years of age is twice as old as a 20
year old. A person earning 60,000 birr per year earns three times the salary of person earning
20,000 birr. The following figure summarizes the characteristics of the four scales.
Table1. Characteristics and examples of the four measurement scales
Nominal/classificatory Ordinal /Ranking Interval Ratio
Measure
ment
scale
7
A. Tree, house, taxi, etc Income: above Temperature: Height: cm,
B. Gender: male/female average, average, centigrade, inches, etc
Attitude: favorable/unfavorable below average Fahrenheit Income:
Political parties: Socio-economic Altitudinal birr
C. EPRDF/democrat, republican status: upper, scale: 10-20, Age:
Religion: middle, low 20-30, 30-40, year/month
Christian/Muslim/Hindu, etc Attitude: strongly etc Weight: kg
Example
9
20+29 49 20+29
¿ =¿ =24.5 or ¿ =24.5
2 2 2
Class Width is the size of the class interval and it is obtained by subtracting lower class
boundaries from upper class boundary. Using table 3 again, class width for the first class
is 19.5 - 29.5 =10
Example 1: The following data represent the record of high temperatures (in 0C) for 50 major
towns in Ethiopia. Construct a grouped frequency distribution for the data. 28, 34, 19, 19, 28,
41, 26, 26, 30, 21, 30, 21, 23, 21, 16, 19, 32, 29, 25, 30, 30, 28, 29, 22, 33, 27, 27, 30, 23, 31,
24, 37, 35, 29, 30, 23, 23, 28, 21, 32, 20, 21, 29, 29, 28, 30, 31, 30, 27, 29.
Solution: use the following procedure for constructing a grouped frequency distribution.
Step 1: Determine the classes (n = 50)
K = 1+3.3 log10n = 1 + 3.3 log1050
= 1 + 3.3 (1.699) = 1 + 5.6 = 6.6 ≈ 7 (seven classes)
Step 2: Find the highest value and the lowest value: H = 41 and L =16
Step 3: Find the range(R): R = highest value — lowest value = H—L, so R = 41-16 = 25
Step 4: Find the class width by dividing the range by the number of classes.
R 25
width= = = 3.6 (round up to the nearest whole number) = 4
number of classesd 7
Step 5: Select a starting point for the lowest class limit. This can be the smallest data value or
any convenient number less than the smallest data value. In this case, 16 is used. Add the class
width to the lowest score taken as the starting point to get the lower limit of the next class. Keep
adding until there are 7 classes, as shown below, 16, 20, 24, etc.
Step 6: Subtract one unit from the lower limit of the second class to get the upper limit of the
first class. Then add the class width to each upper limit to get all the upper limits. i.e. 20-1 =
19 .thus, the first class is 16—19, the second class is 20-23 etc by adding the class width.
Step 7: Find the class boundaries by subtracting 0.5 from each lower class limit and adding 0.5
to each upper class limit: 16 - 0.5 = 15.5 and 19 + 0.5 = 19.5 and the class boundaries for the
first class are 15.5—19.5, for the second class 19.5—23.5, etc.
Step 8: Count the data to find the numerical frequencies.
10
Cumulative frequency also can be computed by adding the frequency in each class to the total
of the frequencies of the classes preceding that class.
Class Limits Class Boundaries Frequency Cumulative
Frequency
16—19 15.5—19.5 4 4
20—23 19.5—23.5 11 15
24—27 23.5—27.5 7 22
28—31 27.5—31.5 21 43
32—35 31.5—35.5 5 48
36--39 35.5—39.5 1 49
40—43 39.5—43.5 1 50
The frequency distribution shows that the class 28—31 contains the largest number of towns
(21) followed by the class 20—23 with 11 towns. Hence, most of the towns (39) have
temperatures between 19.5 0C and 31.5 0C.
formula to compute a statistic. It is tedious to write an expression like this very often, so
mathematicians have developed a shorthand notation to represent a sum of scores, called the
summation notation.
N
The symbol ∑ X i isa mat h ematical s h ort h∧for X 1+ X 2+ X 3+…+ XN
i=1
11
Example: Suppose the following were scores made on the first homework assignment for five
students in the class: 5, 7, 7, 6, and 8. In this example set of five numbers, where N=5, the
summation could be written:
5
The "i=1" in the bottom of the summation notation tells where to begin the sequence of
summation. If the expression were written with "i=3", the summation would start with the third
number in the set.
N
For example: ∑ Xi=X 3+ X 4 +…+ X N
i=3
The "N" in the upper part of the summation notation tells where to end the sequence of
N
summation ∑ X=∑ Xi= X 1+ X 2+ …+ X N
i=3
N
T h e symbol ∑ Xi is a mat h ematical s h ot h∧for X 1+ X 2+ …+ X N
i=3
n n
n n
n n n
Example:
considering the
following data.
12
Types of measures of central tendency
2.1. Arithmetic Mean
Arithmetic mean of a set of observations is the sum of all the observations divided by the total
number of observations. If we are considering a population, it is termed as population mean and
if the samples are considered, it is called the sample mean. The population and the sample mean
are respectively designated by and . The arithmetic mean (or just mean) is the
most important measure of central tendency, the reason being that, all members of the set are
used in the calculation of the mean. It is however affected by the extreme values of the set
unlike the range.
Let, X =X 1+ X 2+ X 3+ X 4+… Xn
n
X=
X 1+ X 2+ X 3+ …+ Xn
Or
∑ Xi
n X = i =1
n
Example
Find the mean of the following set of numbers: 10, 9, 11, 13, 12, 12, 11, 13, 10 and 16
10+ 9+11+13+12+12+11+13+10+16
X= =11.7
10
If the distributions 1 , X 2, X 3 , … Xn have frequencies f 1 , f 2 , f 3 , … Xn respectively then:
n
X=
f 1 X 1+ f 2 X 2+ f 3 X 3+ …+ fnXn
or
∑ fiXi
f 1+f 2+ f 3 …+fn X = i =1
∑ fi
Example:-
13
Find the mean of the set of data in the table below:
Mark Xi 0 1 2 3 4 5 6 7 8 9
Frequency fi 2 3 4 6 1 4 2 2 1 3
∑ fiXi
X = i =1
∑ fi
( 0∗2 ) + ( 1∗3 ) + ( 2∗4 )+ ( 3∗6 ) + ( 4∗1 ) + ( 5∗4 ) + ( 6∗2 ) + ( 7∗2 ) + ( 8∗1 ) +(9∗3)
X=
2+3+ 4+ 6+1+4 +2+2+1+3
114
¿ =4.07
28
Arithmetic Mean of Group Data
In grouped frequency distribution, the values between any class interval are considered as
condensed at the mid- point of the class interval ( class mark). If X i is the class mark of the ith
class interval, then the mean X of the grouped frequency distribution is defined as:
k
∑ fiXi
X = i =1k K = is the number of classes
∑ fi
i=1
Solutions:
• First find the class marks ( Xi)
• Find the product of frequency and class marks
14
• Find mean using the formula.
k
Class fi Xi Xifi
6-10 35 8 280 ∑ fiXi
11-15 23 13 299 X = i =1k
16-20 15 18 270 ∑ fi
i=1
21-25 12 23 276
26-30 9 28 252 6
1575
X= =15.75
100
∑ fd or X =A + ∑ fd
X =A + i=1
∑f ∑f
A = is called the ASSUMED or GUESSED MEAN.
di = is the deviation xi (midpoint) from the assumed mean(A).
n = is the number of classes
Example for ungrouped frequency distribution
A farmer recorded the mass of 25 timbers in k.g as follows:
10 14 12 10 12 11 11 9 13
16 13 9 12 13 12 10 15
10 9 11 8 14 12 8 11
a) Construct a frequency table for the data.
b) Use an assumed mean of 12 kg to calculate the mean
Solution: Given assumed mean, A= 12 kg
Masses (X) Frequency (f) d=X-A fd Use the formula:
8 2 -4 -8
9 3 -3 -9
10 4 -2 -8
11 4 -1 -4
12 5 0 0
13 3 1 3
14 2 2 4
15 1 3 3
15
16 1 4 4
∑f=25 ∑fd=-15
X =A +
∑ fd
∑f
−15
X =12+ =11.4
25
Solution
Class interval Class center (X) Frequency (f) d=X-A fd
40-49 44.5 4 -30 -120
50-59 54.5 12 -20 -240
60-69 64.5 18 -10 -180
70-79 74.5 11 0 0
80-89 84.5 7 10 70
90-99 94.5 5 20 100
100-109 104.5 2 30 60
110-119 114.5 1 40 40
∑f=60 ∑fd=-270
∑ fd
X =A + i=1
∑f
−270
X =74.5+ =70.5
60
If X k is t h e meanof n k observation
Then, the mean of all the observation in all groups often called the combined mean is given by:
k
X n + X n +…+ X k nk i=1
∑ xi ni
X c= 1 1 2 2 = k
n1+ n2 +…+n k
∑ ni
i=1
X 1=60 X 2=72
n1=60 n2=72
2
X n +X n
∑ x i ni
X c= 1 1 2 2 = i=12
n1 +n2
∑ ni
i=1
30 ( 60 ) +70(72) 6840
¿ = =68 . 40
30+70 100
5. If a wrong figure has been used when calculating the mean the correct mean can be
obtained without repeating the whole process using:
( correct value−wrong value )
correct mean=wrong mean+
n
Where n is total number of observations.
Example: An average weight of 10 students was calculated to be 65.Latter it was discovered
that one weight was misread as 40 instead of 80 k.g. Calculate the correct average weight.
Solution
17
( correct value−wrong value )
correct mean=wrong mean+
n
( 80−40 )
correct mean=65+ =65+4=69k.g.
10
6. The effect of transforming original series on the mean.
a) If a constant k is added/ subtracted to/from every observation then the new mean
will be the old mean± k respectively.
b) If every observations are multiplied by a constant k then the new mean will be
k*old mean.
Example: The mean of a set of numbers is 500.
a. If 10 is added to each of the numbers in the set, then what will be the mean of the new set?
b. If each of the numbers in the set are multiplied by -5, then what will be the mean of the new
set?
Solution:
a. new mean=old mean+10=500 +10=510
b. new mean=−5∗old mean=−5+500=−2500
Merit and demerit of arithmetic mean
Merits:
• It is based on all observation.
• It is suitable for further mathematical treatment.
• It is stable average, i.e. it is not affected by fluctuations of sampling to some extent.
• It is easy to calculate and simple to understand.
Demerit:
•It is affected by extreme observations.
•It cannot be used in the case of open end classes.
•It cannot be used when dealing with qualitative characteristics, such as intelligence, honesty,
beauty.
•Sometimes, it leads to wrong conclusion if the details of the data from which it is obtained are
not available.
•It gives high weight to high extreme values and less weight to low extreme values.
2.2. Geometric mean (G)
Geometric mean of asset of N positive numbers x1, x2, x2---xn is the nth roots of the product of
the numbers.
18
G=√n x 1*x 2 *x 3 *------*x n .
It is used to determine the average percent increases in sales, production and other business
activities.
Example: suppose the profit earned by a company A on five projects were 3,4,4,5 and 6million
respectively. Then, compute the geometric mean profit for the company.
Solution:
G= N√ x 1 *x 2 *x 3 *------*x n .
¿ √ 3.4.4.5.6 .
5
¿ √ 1440..
5
=4.28
Or
LogG = log√5 1440 ..
= log14401/5
=1/5log1440
=1/5(log1.44*1000)
=1/5(log1.44+log1000)
=1/5(log 1.44+log103 )
=1/5((log 1.44+3log10)
=1/5(log1.44+3)
=1/5(0.158362+3)
=0.6316
G=antilog0.6316=4.26≈4.3
G= antilog
∑ f log x
∑f
Example: Assume that a given power industry produce a product for one month and the output
is given in quintals below. Then, compute the geometric mean of the data.
19
Output(X) The number of days (F) Logx F*logx
20 3 1.3 3.9
18 4 1.255 5.02
19 10 1.27875 12.78
22 8 1.342 10.73
24 2 1.380 2.76
23 3 1.36 4.08
∑ 30 39.2835
G= antilog
∑ f log x = antilog 39 .2835 = antilog1.30945=20.39quintals
∑f 30
G= antilog
∑ f logm
∑f
antilog1.256=18.03
∑ f m=25
2.3. Harmonic mean(H)
Harmonic mean of a set of observations x 1, x2,-----xn is the reciprocal of arithmetic mean of the
reciprocal of the numbers.
N
¿ N
H 1 1 1 1 : H= where, n is the number of observation and
+ + +…. ∑ 1/ xn
x 1 x2 x 3 xn
X is individual observation.
N.B. It is used in problem of averaging rates per unit of time.
20
Example: if Ayantu and Gadise take 2&3 hours respectively to finish a given typing work, how
much hours is required if they work together to finish the task.
Given: N (total observation)=2, individual observation or x1 and x2 are 2 and 3 hours
respectively.
N 2 2
¿ = = 2∗6
H 1 1 1 1 1 1 5 ¿ =2.4 h
+ + +…. + 5
x 1 x2 x 3 xn 2 3 6
Harmonic mean for ungrouped data
f 1+f 2+ f 3+… fn ∑f
¿ ¿
The formula is given by H f 1 f 2 f 3 fn H f∗1
+ + + …+. ∑
x 1 x2 x 3 xn x
The mean daily temperature for any hypothetical area is given below for 30 successive days of
the month. Compute H.
Mean Temperature( 0c) No of days(f) 1 1
F*
(X) x x H=
22 3 0.045 0.1362
4
23 5 0.434 0.217
25 10 0.04 0.4
27 7 0.037 0.259 ∑f
28 3 0.035 0.1071 F∗1
∑
7 x
29 2 0.034 0.06
4
∑f=30 1
∑ F*
x
=1.1793
30
H= =25.44
1.1793
Hence, half of the sample workers get less than 1,000 Birr per month and half get more than
1,000 Birr.
Example 2: The following data shows monthly salaries of 10 sample workers in factory. Find
the median salary. 1000, 300, 1500, 500, 2000, 2500, 750, 600, 3000, 800
Solution:
Step 1: Arrange the data in order. 300, 500, 600, 750, 800, 1000, 1500, 2000, 2500, 3000.
Step 2: Select the middle value. In this data set, the total number of observations is 10 (n=10,
which is an even number). Thus, the median value will be the mean of the two middle values.
The two middle values in the data array are the 5th and the 6th values, which are 800 and 1000.
Thus, the median will be;
22
800+1000
¿( )
2
¿ 900
Hence, fifty percent of the sample workers get monthly salary above 900 Birr and fifty percent
get below it.
Median for Grouped Data
When the observations in a data set exist in grouped form the formula for the median value is:
( )
N
−Fpm
2
MD=L+ ∗W
fm
Where, L = the lower class boundary of the median class,
n = the sum of the frequencies of all the classes,
Fpm = the cumulative frequencies in all the classes
immediately preceding the class containing the median,
W = width of the median class, and
fm = frequency of the median class.
Remark: The median class is the class with the smallest cumulative frequency greater than or
equal to n /2.
Example: Refer to the frequency distribution on the record of high temperature of 50 selected
major urban centers in Ethiopia.
Class limits Class boundaries Frequency Cumulative frequency
16 - 19 15.5 - 19.5 4 4
20 - 23 19.5 - 23.5 11 15
24 - 27 23.5 - 27.5 7 22
28 - 31 27.5 - 31.5 21 43
32 - 35 31.5 - 35.5 5 48
36 - 39 35.5 - 39.5 1 49
40 - 43 39.5 - 43.5 1 50
From this frequency distribution of the record of high temperature of 50 towns, n/2 = 50/2 = 25.
This is the indicator of the cumulative frequency of the class in which the median is located. Or
the median class of the distribution is the class with cumulative frequency of 25 or more. This is
the forth class (27.5—31.5).
Therefore;
L=27.5 W =4 n=50 Fpm=22 fm=21
23
( )
n
−Fpm
2
MD=L+ W
fm
( )
50
−22
2
¿ 27.5+ 4
21
¿ 27.5+0.57=28.07
( 1∗8 ) + ( 2∗7 ) + ( 3∗3 ) + ( 4∗4 )+ (5∗3 ) + ( 6∗4 ) + ( 7∗3 ) + ( 8∗2 ) + ( 9∗3 ) + ( 10∗3 )
c ¿ Mean=
40
180
¿ =4.5
40
Mode=Lm +
( d1
)
d 1 +d 2
W
Here, the modal class is the fourth class (27.5—31.5 with 21 frequency). Thus,
Lm = 27.5, fm = 21, fpm = 7, fsm = 5, W = 31.5—27.5 = 4
Then, inserting these values in the equation
25
mode=Lm+ ( d 1+d
d1
2)
W
¿ 27.5+
( (21−721−7
) +(21+5) )
4
¿ 27.5+ ( 1414+16 ) 4
=27.5+(0.47)4
=29.37
Modes of subsets cannot be combined to determine the mode of the complete data set
without going back to the original data.
called the first, the second and the third quartile respectively.
26
To compute Quartiles for raw (ungrouped) data, first arrange the data in increasing order of
magnitude. Then, the ith quartile is given by:
n+ 1
Qi=i( )t h value
4
In dividing i(n+1) by 4, there may be a reminder. Let q be the quotient and r be the reminder of
the division. Then,
th r
Qi=q value+ ¿
4
Example: The following are yields of barley (kg/plot) from 14 plots: 30, 32, 35, 38, 40, 42, 48,
49, 52, 55, 58, 60, 62 and 65. Find the first and third quartile. (Be informed that the data must
be arranged in ascending order).
1 ( 14 +1 ) 3 ( 14 +1 )
Q 1= t h valueQ 3= t h value
4 4
15 45
¿ t h value= t h value
4 4
th th
¿ 3.75 value=11.25 value
3 th 1
¿ 3 value+ ( 4 value−3 value ) =11 value+ ( 12 value−11 value )
rd rd th th th
4 4
3 1
¿ 35+ ( 38−35 )=58+ (60−58)
4 4
¿ 37.25=58.5
c=the cumulative frequency (less than type) preceding the quartile class
27
fQi= the frequency of the quartile class
Remark: The quartile class (class containing Qi) is the class with the smallest cumulative
¿
frequency (less than type) greater than or equal to 4
Deciles: Deciles are measures that divide the frequency distribution in to ten equal parts. They
are denoted by D1, D2, ...., D9. For raw (ungrouped) data, first arrange the data in
increasing order of magnitude. Then, the ith decile is given by:
i ( n+1 )
Di= t h value
10
Similar to Quartile, in dividing i(n+1) by 10 there may be reminder. Let q be the quotient and r
be the reminder of the division. Then,
th r
Di=q value+ ¿
10
For grouped data: we have the following formula
w ¿
Di=LD i + (
fDi 10 )
−C , i=1 ,2 , 3 , … , 9
C= the cumulative frequency (less than type) preceding the decile class
Percentiles
Percentiles are measures that divide the frequency distribution in to hundred equal parts. They
are denoted by P1, P2, …, P99.
For raw (ungrouped data), first arrange the data in increasing order of magnitude. Then, the ith
percentile is given by:
i ( n+1 )
Pi= t h value
100
In dividing i(n+1) by 100, there may be a reminder. Let q be the quotient and r be the reminder
of the division. Then,
28
th r
Pi=q value+ ¿
100
For grouped data: we have the following formula
w ¿
Di=LPi + (
fPi 100 )
−C , i=1 , 2 ,3 , … , 99
C= the cumulative frequency (less than type) preceding the percentile class
Values Frequency
140-150 17
150-160 29
160-170 42
170-180 72
180-190 84
190-200 107
200-210 49
210-220 34
220-230 31
230-240 16
240250 12
Solutions:
• First find the less than cumulative frequency.
• Use the formula to calculate the required quantile.
29
Values Frequency Cumulative frequency
140-150 17 17
150-160 29 46
160-170 42 88
170-180 72 160
180-190 84 244
190-200 107 351
200-210 49 400
210-220 34 434
220-230 31 465
230-240 16 481
240250 12 493
a) Quartiles:
i. Q1
W N
Q 1=LQ 1+ ( −C )
fQ 1 4
10
Q 1=170+ ¿
72
¿ 174.90
2∗N
ii. Q2 Determine the class containing the second quartile =246.5
4
W
Q 1=L Q2+ ¿)
fQ 2
10
Q 1=190+ (246.5−244)
107
¿190.23
30
3∗N
ii. Q3 Determine the class containing the third quartile =369.75
4
W
Q 1=LQ 3+ ¿
fQ 3
10
Q 1=200+ (369.75−351)
49
¿203.83
7∗N
ii. D7 Determine the class containing the seventh decile =345.1
10
W
Q 1=LD 7+ ¿
fD 7
10
D 1=190+ (345.1−244 )
107
¿199.45
90∗N
ii. P90 Determine the class containing the 90th percentile =443.7
10
W
Q 1=LP 90+ ¿
f 90
10
D 1=220+ (443.7−434)
31
¿223.13
31
Chapter Three
3. Measures of Dispersion
Introduction and objectives of measuring Variation
The scatter or spread of items of a distribution is known as dispersion or variation.
In other words the degree to which numerical data tend to spread about an average
value is called dispersion or variation of the data (i.e Variability or dispersion concerns
with the study of the extent to which values of a data set differ from their computed
mean).
Measures of dispersions are statistical measures which provide ways of measuring the
extent in which data are dispersed or spread out.
This chapter is focused to judge the reliability of measures of central tendency, to
control variability itself, to compare two or more groups of numbers in terms of their
variability and to make further statistical analysis.
Types of Measures of Dispersion
Various measures of dispersions are in use. The most commonly used measures of dispersions
are:
Range and relative range
Mean and Quartile deviation
Variance and Standard deviation
Coefficient of variation.
32
Because the range is greatly affected by extreme scores, it may give a distorted picture
of the scores.
The following two distributions have the same range, 13, yet appear to differ greatly in
the amount of variability.
Distribution 1: 32 35 36 36 37 38 40 42 42 43 43 45
Distribution 2: 32 32 33 33 33 34 34 34 34 34 35 45
For this reason, among others, the range is not the most important measure of variability.
S= smallest observation
Mark 1-5 6-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50
Frequency 0 2 4 11 20 10 5 3 2 1
Solution
The lower class =1-5 Lower limit of the class = 0.5
The upper class = 46-50 Upper limit of the upper class = 50.5
Therefore, the range = Upper limit of the upper class- Lower limit of the class
= 50.5-0.5=50
Property of Range
It is easy to calculate and simple to understand.
It is not based on all observation.
It is highly affected by extreme observations.
It is affected by fluctuation in sampling.
It cannot be computed in the case of open end distribution.
3.2. Mean and quartile deviation
3.2.1. The Mean Deviation (M.D)
This is a much better measure of dispersion.
Mean deviation is the mean of the absolute values of the deviation from some measure
of central tendency.
33
Depending up on the type of averages used we have different mean deviations.
a) Mean Deviation about the mean
It is denoted by M . D( X ) and given by
n
∑ ¿ Xi−X /¿
M . D( X )= i=1 ¿
n
∑ fi / Xi− X /¿
M . D( X )= i=1 ¿
n
Steps to calculate M.D ( X ) :
1. Find the arithmetic mean, M.D ( X )
2. Find the deviations of each reading from M.D(X )
3. Find the arithmetic mean of the deviations, ignoring sign.
b) Mean Deviation about the median.
~
It is denoted by M . D( X ) and given by
n
∑ ¿ Xi−~
X /¿
M . D( ~
X )= i=1 ¿
n
k
It is denoted by M . D( ^
X ) and given by:
∑ ¿ Xi− ^X /¿
M . D( ^
X )= i=1 ¿
n
For the case of frequency distribution it is given as:
k
Xi 4 4 5 5 5 6 7 7 8 9 Total
/Xi - 6/ 2 2 1 1 1 0 1 1 2 3 14
/Xi - 5.5/ 1. 1.5 0.5 0.5 0.5 0.5 1.5 1.5 2.5 3.5 14
5
/Xi - 5/ 1 1 0 0 0 1 2 2 3 4 14
10
∑ ¿ Xi−5/¿ 14
M . D( ^
X )= i=1 = =1.4 ¿
10 10
10
∑ ¿ Xi−6/¿ 14
M . D( X )= i=1 = =1.4 ¿
10 10
10
∑ ¿ Xi−5.5/¿ 14
M . D( ~
X )= i=1 = =1.4 ¿
10 10
( )
n
−Fpm
2
MD=L+ C
fm
( )
141
−60
2
¿ 60.5+ 10
42
¿ 63
Then, mean deviation about median
MD=
∑ f /d /¿ ¿
∑f
2260
MD=
140
MD=16.143
36
It gives the average amount by which the two quartiles differ from the median.
Example: Compute Q.D and its coefficient for the following distribution.
Values Frequency
140-150 17
150-160 29
160-170 42
170-180 72
180-190 84
190-200 107
200-210 49
210-220 34
220-230 31
230-440 16
240-250 12
Solutions: From the above table we have obtained the values of all quartiles as:
Q1= 174.90, Q2= 190.23, Q3=203.83
203.83−174.90
Q . D= =14.47
2
2∗Q . D 2∗14.47
C . Q . D= = =0.076
Q 3+Q 1 203.83−174.90
Remark: Q.D or C.Q.D includes only the middle 50% of the observation.
Variance and Standard Deviation
Population Variance
The variance is the average of the squares of the distance each value is from the mean.
It is obtained by taking the difference between each observation and the mean, squaring
the difference, adding the squares, and finally averaging the squares.
The formula for the population variance is as follow
1
population variance (σ )=
2
N
∑ (Xi−µ)2 , i=1 ,2 , … N
Where, X = individual value, μ = population mean, N = population size
For the case of frequency distribution, it is expressed as:
1
2
population variance (σ )=
N
∑ f i ( Xi−µ)2 , i=1 , 2, … K
Sample Variance
37
One of the major uses of statistics is to estimate the corresponding parameter (characteristic of a
population). The formula of Sample Variance has the problem that the estimated value isn't the
same as the parameter (population characteristic). To compute this, the sum of the squares of the
deviations is divided by one less than the sample size.
1
Sample variance (S )=
2
n−1
∑ (Xi− X)2 ,i=1, 2 , … n
For the case of frequency distribution, it is expressed as:
1
Sample variance (S )=
2
n−1
∑ f i (Xi− X)2 , i=1 ,2 , … k
We usually use the following short cut formula if the data have decimals and where there is a
problem of rounding.
n
∑ Xi 2−n X 2
S2= i=1 , for raw data
n−1
k
∑ fiXi2−n X 2
S2= i=1 , for frequency distribution
n−1
Standard Deviation
They are used to measure the deviation of observations from the mean.
It is an improvement of mean deviation.
This is the most satisfactory and universally adopted measure of dispersion.
The measure of central tendency used in calculating standard deviation is mean.
The example below illustrates the computation stages of standard deviation.
There is a problem with variances, i.e. the deviations and units were squared.
To get the units back the same as the original data values, the square root must be
avoided.
Population standard deviation is computed as σ =√ σ 2
Sample standard deviation is computed as S= √ S2
N.B:- The larger the variance or standard deviation is, the more variable the data are.
The following steps are used to calculate the sample variance:
1. Find the arithmetic mean.
2. Find the difference between each observation and the mean.
3. Square these differences.
38
4. Sum the squared differences.
5. Since the data is a sample, divide the number (from step 4 above) by the number of
observations minus one, i.e., n-1 (where n is equal to the number of observations in the data
set).
Examples: Find the variance and standard deviation of the following sample data
1. 5, 17, 12, 10.
2. The data is given in the form of frequency distribution.
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
Solutions: 1
1.
Xi 5 10 12 17 Total
36 1 1 36 74
(Xi- X )2
∑ ¿ Xi− X /¿ 74
S = i=1
2
= =24.67 ¿
n−1 3
S= √ S2 =√ 24.67=4.97
X =55
Xi 42 47 52 57 62 67 72 Total
1183 640 198 60 588 864 867 4400
fi(Xi- X )2
S= √ S2 =√ 59.46=7.71
Other example:
39
X 0-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80
F 18 16 15 12 10 5 2 2
Calculate the population standard deviation from the frequency distribution given blow.
Solution: you can represent by difference (d).
(Xi− X)
X Mid point Frequency fX Deviation of X from d2 Fd2
(X) (f) mean d= X− X
0-10 5.5 18 99.0 -21.625 467.641 8417.538
11-20 15.5 16 248.0 -11.625 135.141 2162.256
21-30 25.5 15 382.5 -1.625 2.641 39.615
31-40 35.5 12 426.0 8.375 70.141 841.692
41-50 45.5 10 455.0 18.375 337.641 3376.410
51-60 55.5 5 277.5 28.375 805.641 4625.705
61-70 65.5 2 131.0 38.375 1472.641 2945.282
71-80 75.5 2 151 48.375 2340.141 4608.282
80 2170 27088.78
X=
∑ fX = 2170 =27.125
∑ f 80
S=
40
Variance 100 121
Solutions: You can get standard deviation from variance so that ‘S’ of Firm A &B are 10
SA 10
&11 respectively. C . V A= ∗100= ∗100=19.05 %
XA 52.5
SB 11
C . V B= ∗100= ∗100=23.16 %
XB 47.5
Since C.VA < C.VB, in firm B there is greater variability in individual wages.
Chapter Four
4. Measurement of Distribution
The shape of' the frequency distribution best describes the relationship among mean, median
and mode.
4.1 Normal Distribution
41
When the distribution of item in a series happens to be perfectly symmetrical, then we
have the following type of curve for the distribution.
Symmetric Distribution is the distribution of observations in which mean, median and
mode have the same value as shown in Figure 1 below.
Figure 1
It is Zero skewness.
Such a curve illustrated above is technically described as a normal curve and the relating
distribution as normal distribution.
Such a curve is perfectly bell shaped curve in which case the value of mean, median and
mode is just the same and skewness is absent.
The normal distribution curve can be used to study many variables that are not only
perfectly normal distribution but also approximately normal distribution.
The mathematical equation for the normal distribution is:
−¿¿¿ ¿
y=e ¿
where :e ≈ 2.718 , π ≈ 3.14,
µ is population mean
σ is population standard deviation
Another important aspect is that the area under the normal curve is more important than
the frequencies.
Therefore, when the normal distribution is pictured, the y axis, which indicates the
frequencies, is sometimes omitted.
Generally, the normal distribution is a continuous, symmetric, bell-shaped
distribution of a variable.
Summary of the properties of the theoretical Normal Distribution is presented below.
The normal distribution curve is bell-shaped.
The mean, median, and mode are equal and located at the center of the distribution.
The normal distribution curve is unimodal (i.e., it has only one mode).
The curve is symmetric about the mean, which is equivalent to saying that its shape is
the same on both sides of a vertical line passing through the center.
42
The curve is continuous, that is, there are no gaps or holes. For each value X, there is a
corresponding value of Y.
The curve never touches the x axis. Theoretically, no matter how far in either direction
the curve extends, it never meets the x axis—but it gets increasingly closer.
If the curve is distorted (whether on the right side or on the left side), we have
asymmetrical distribution which indicates that there is skewness.
If the curve is distorted on the right side, we have positive skewness but when the curve
is distorted towards left, we have negative skewness.
4.2 Skewness
An outlier can significantly alter the mean of a series of numbers, whereas the median will
remain at the center of the series. In such a case, the resulting curve drawn from the values will
appear to be skewed, tailing off rapidly to the left or right. In the case of negatively skewed or
positively skewed curves, the median remains in the center of these three measures. Figure 2
below shows a negatively skewed.
43
For moderately skewed distribution, the following relation holds among the three
commonly used measures of central tendency.
Mean−Mode=3∗(mean−median)
Measures of Skewness
-Denoted by α3
-There are different measures of skewness. Two measures of skewness are mentioned as follow.
1. The Pearsonian coefficient of skewness
Mean−Mode x ^x
α 3= =
Standard deviation S
2. The Bowley’s coefficient of skewness (coefficient of skewness based on quartiles).
( Q3−Q2 ) −(Q2−Q1 ) Q3 +Q1 −2Q2
α 3= =
Q 3−Q 1 Q 3−Q 1
The shape of the curve is determined by the value of α3.
If α3 >0, the distribution is positively skewed.
If α3 =0, the distribution is symmetric distribution.
If α3 <0, the distribution is negatively skewed.
Remark:
In a positively skewed distribution, smaller observations are more frequent than larger
observations. I.e. the majority of the observations have a value below an average.
In a negatively skewed distribution, smaller observations are less frequent than larger
observations. I.e. the majority of the observations have a value above an average.
Example 1: Suppose the mean, the mode, and the standard deviation of a certain distribution
are 32, 30.5 and 10 respectively. What is the shape of the curve representing the distribution?
Solutions: Use the Pearsonian coefficient of skewness
Mean−Mode 32−30.5
α 3= = =0.15
Standard deviation 10
Example 2: In a frequency distribution, the coefficient of skewness based on the quartiles is
given to be 0.5. If the sum of the upper and lower quartile is 28 and the middle quartile is 11,
find the values of the upper and lower quartiles.
Solutions: Given :α 3=0.5 , Q2=11
Q3 +Q1=28−−−−− X
Q3−Q1 =12−−−−−Y
44
Then, Solve x & y
( Q3−Q2 ) −(Q2−Q1 ) Q3 +Q1 −2Q2
α 3= = =0.5
Q 3−Q 1 Q 3−Q 1
Q3-(28-Q3) = 12, 2Q3=40; finally, we get Q3=20 and Q1=8.
The significance of skewness lies in the fact that through it one can study the formation of series
and can have the idea about the shape of the curve, whether normal or otherwise, when the
items of a given series are plotted on a graph.
4.3 Kurtosis
Kurtosis is the degree of peakedness (condition of having peak) of a distribution,
usually taken relative to a normal distribution.
A distribution having relatively high peak is called leptokurtic.
If a curve representing a distribution is flat topped, it is called platykurtic.
The normal distribution which is not very high peaked or flat topped is called
mesokurtic.
In brief, Kurtosis is the humpedness (convexness) of the curve and points to the nature
of distribution of items in the middle of a series.
Measures of kurtosis
The moment coefficient of kurtosis:
M4 M4
• Denoted by α4 and given by α 4 = 2
= 4
M2 σ
W h ere , M 4 is t h e fourt h moment about t h e mean.
1. The rth moment about the mean ( the rth central moment) is denoted by Mr and defined as:
45
n
M r=∑ ¿ ¿ ¿ (Ungrouped distribution)
i=1
n
M r=∑ fi ¿ ¿ ¿ (Grouped distribution)
i=1
n
M 4 =∑ fi ¿ ¿ ¿
i=1
Examples: If the first four central moments of a distribution are: M 1= 0, M2=16, M3 -60, and
M4=162. Compute a measure of kurtosis and give your interpretation.
M4 162
Solutions: α 4= 2
= 2
=0.6
M2 16
Interpretation; since 0.6<3, the curve is platykurtic
Chapter Five
5. Sampling
The concept of sampling
46
Sampling may be defined as the selection of some part of an aggregate or totality on
the basis of which a judgment or inference about the aggregate or totality is made.
In other words, it is the process of obtaining information about an entire population by
examining only a part of it.
In most of the research work and surveys, the usual approach happens to be to make
generalizations or to draw inferences based on samples about the parameters of
population from which the samples are taken.
The researcher quite often selects only a few items from the universe for his study
purposes. All this is done on the assumption that the sample data will enable him to
estimate the population parameters.
The items so selected constitute what is technically called a sample, their selection
process or technique is called sample design.
Sample should be truly representative of population characteristics without any bias so
that it may result in valid and reliable conclusions.
Some Fundamental Definitions
Before we talk about details and uses of sampling, it seems appropriate that we should be
familiar with some fundamental definitions concerning sampling concepts and principles.
1. Population: refers to the total of items about which information is desired.
The population represents the target of an investigation, and the objective of the
investigation is to draw conclusions about the population hence we sometimes call it
target population.
The population or universe can be finite or infinite.
The population is said to be finite if it consists of a fixed number of elements so that it
is possible to enumerate it in its totality. For instance, the population of a city, the
number of workers in a factory, e.t.c.
An infinite population is that population in which it is theoretically impossible to
observe all the elements. E.g. number of stars in a sky. From a practical
consideration, we use the term infinite population for a population that cannot be
enumerated in a reasonable period of time. This way we use the theoretical concept of
infinite population as an approximation of a very large finite population.
Examples
Population of trees under specified climatic conditions
47
Population of animals fed a certain type of diet
Population of farms having a certain type of natural fertility
Population of households, etc
2. Census: a complete enumeration of the population. But in most real problems it cannot be
realized, hence we take sample.
3. Sampling frame: it is the group or cluster or list of items from which the sample is to be
drawn. Whatever the frame may be, it should be a good representative of the population.
4. Sampling design: it is a definite plan for obtaining a sample from the sampling frame. It
refers to the technique or the procedure the researcher would adopt in selecting some sampling
units.
5. Statistic(s) and parameter(s): A statistic is a characteristic of a sample, whereas a parameter
is a characteristic of a population. Thus, when we work out certain measures such as mean,
median, mode or the like ones from samples, then they are called statistic(s) for they describe
the characteristics of a sample. But when such measures describe the characteristics of a
population, they are known as parameter(s). For instance, the population mean(m) is a
parameter, whereas the sample mean ( X ) is a statistic.
6. Errors: there would be a certain amount of inaccuracy in the information collected. This
inaccuracy may be two types. I.e. sampling error(error variance. ) and Non sampling errors.
a) Sampling error:
Is the discrepancy between the population value and sample value.
May arise due to in appropriate sampling techniques applied.
b) Non sampling errors: are errors due to procedure bias such as:
Due to incorrect responses.
Measurement.
Errors at different stages in processing the data.
Response error
The more homogeneous the universe, the smaller the sampling error.
Sampling error is inversely related to the size of the sample i.e., sampling error
decreases as the sample size increases and vice-versa.
7. Sampling distribution: it is all values of a particular statistic, say mean, together with their
relative frequencies.
8. Sampling: The process or method of sample selection from the population.
9. Sampling unit: the ultimate unit to be sampled or elements of the population to be sampled.
5.1. Why sampling is needed?
-Reduced cost - Sampling can save time and money. A sample study is usually less expensive
than a census study and produces results at a relatively faster speed.
-Greater speed
-Greater accuracy - Sampling may enable more accurate measurements for a sample study is
generally conducted by trained and experienced investigators.
-Greater scope
-Avoids destructive test
-The only option when the population is infinite (large).
Sometimes taking a census makes more sense than using a sample if there is Non-
representativeness and detailedness is needed.
5.2. Sampling Distribution of Sample Means
Given a variable X, if we arrange its values in ascending order and assign probability to each of
the values or if we present X i in a form of frequency distribution, the result is called Sampling
Distribution of X.
Sampling distribution of the sample mean is distribution obtained by using the means computed
from random samples of a specific size taken from population.
Steps for the construction of Sampling Distribution of the mean
1. From a finite population of size N, randomly draw all possible samples of size n.
49
2. Calculate the mean for each sample.
3. Summarize the mean obtained in step 2 in terms of frequency distribution or relative
frequency distribution.
Example: Suppose we have a population of size 5=N, consisting of the age of five children:
6, 8, 10, 12, and 14.
population mean=µ=10
2
population variance=σ =8
Take samples of size 2 with replacement and construct sampling distribution of the sample
mean.
6 8 10 12 14
6 (6,6) (6,8) (6,10) (6,12) (6,14)
8 (8,6) (8,8) (8,10) ((8,12) (8,14)
10 (10,6) (10,8) (10,10) (10,12) (10,14)
12 (12,6) (12,8) (12,10) (12,12) (12,14)
14 (14,6) (14,8) (14,10) (14,12) 914,14)
Step 2: Calculate the mean for each sample:
6 8 10 12 14
6 6 7 8 9 10
8 7 8 9 10 11
10 8 9 10 11 12
12 9 10 11 12 13
14 10 11 12 13 14
Step 3: Summarize the mean obtained in step 2 in terms of frequency distribution.
Xi 6 7 8 9 10 11 12 13 14
fi 1 2 3 4 5 4 3 2 1
a ¿ Find t h e meanof X , say µ X
µX =
∑ x ifi = 250 =10=µ
∑ fi 25
2
b ¿ Find t h e variance of X , say σ X
50
2
σ X=
∑ ( x i−µ X ) fi = 100 =4
∑ fi 25
Sample Size refers to the number of sampling units selected from the population for
investigation.
The size of the sample should be determined by a researcher keeping in view the following
points:
(i) Nature of universe (population): Universe may be either homogenous or heterogeneous in
nature. If the items of the universe are homogenous, a small sample can serve the purpose. But
if the items are heterogeneous, a large sample would be required. Technically, this can be
termed as the dispersion factor.
(ii) Number of classes proposed: If many class-groups (groups and sub-groups) are to be
formed, a large sample would be required because a small sample might not be able to give a
reasonable number of items in each class-group.
(iii) Type of sampling: Sampling technique plays an important part in determining the size of
the sample. A small random sample is appropriate to be much superior to a larger but badly
selected Sample.
(iv) Standard of accuracy and acceptable confidence level: If the standard of accuracy or the
level of precision is to be kept high, we shall require relatively larger sample.
(v) Availability of finance: In practice, size of the sample depends upon the amount of money
available for the study purposes. This factor should be kept in view while determining the size
of sample for large samples result in increasing the cost of sampling estimates.
(vi) Other considerations: Nature of units, size of the population, size of questionnaire,
availability of trained investigators, the conditions under which the sample is being conducted,
the time available for completion of the study are a few other considerations to which a
researcher must pay attention while selecting the size of the sample.
Process of determining Sample Size
Precision is the range within which the answer may vary and still be acceptable;
confidence level indicates the likelihood that the answer will fall within that range, and
the significance level indicates the likelihood that the answer will fall outside that
range.
51
We can always remember that if the confidence level is 95%, then the significance level
will be (100 – 95) i.e., 5%; if the confidence level is 99%, the significance level is (100
– 99) i.e., 1%, and so on.
We should also remember that the area of normal curve within precision limits for the
specified confidence level constitutes the acceptance region and the area of the curve
outside these limits in either direction constitutes the rejection regions.
In other ways, Confidence interval is the specific interval estimate of the parameter
determined by using the data obtained from sample and specific confidence level of the
estimate.
Confidence level of an interval estimate of a parameter is the probability that the interval
estimate will contain the parameter. Three common confidence intervals are used: 90%,
95% and 99% confidence interval.
If the specific sample mean is selected say x .There is 95% probability that it falls within the
σ
range of μ ±1.96 ( ) . Likewise there is 95% probability that the interval specified by
√n
x ± 1.96
( √σn ) will contain μi . e x−1.96 ( √σn )< μ< x +1.96( √σ2 ) .hence one can be 95% confident
that the populations mean is contained in the interval when the values of the variable are
normally distributed.
E.g. the teacher wishes to estimate the average age of the students enrolled. From past studies
standard deviation is 2 years. Sample of 50 students is selected and the mean is found to be 23.2
years. Find 95% confidence interval of the population mean.
sin ce 95 % confidence interval is desired , zα value is 1.96 .
x−z
α
2( )σ
√n
< μ< x + z
α
2( )σ
√n
23.2-1.96( ) ( √ 50 )
2 2
<¿ μ <23.2+1.96 .
√ 50
23.2-0.6 < μ <23.2+0.6 years=23.2± 0.6 years .
The teacher can say with 95 % confidence that the average age of the students is between 22.6
and 23.8 years based on 50 students. I.e. there is 95% probability that the confidence interval
built around specific sample mean would contain the population mean.
α is alpha which represents the total area in both of the tail of normal curve.
52
α
represents the area in each one of the tail.
2
The relationship between α and the confidence interval is 1−α∧α −1.
E.g. when 95 % confidence interval is to be found α =0.05 .since 1-0.05=0.95 or 95%.when
α =0.01 , 1−0.01=0.99.
Formula for confidence interval is as follow.
x– z
α
2 ( √σ2 )< μ< x + z α2 ( √σ2 )
α α
For 95% , z =1.96 . for 99 % , z =2.58 . if n≥30,S can be used in place of σ where σ is
2 2
unknown.
z
α
2 ( √σn )is called the maximum error of estimate.
Sample size: it depend on the maximum error of estimate, the population standard deviation
and the degree of confidence.
-the population standard deviation is assumed to be known (has been estimated from the
previous studies).
The formula for sample size is derived from the maximum error of estimate (e) formula.
e= z
α
2 ( √σn ) .this is solved for n as follow
( )
2
α
z ×σ
n= 2 where σ= standard deviation of the population (to be estimated from past
e
experience
z =the value of the standard variate at a given confidence level (to be read
from the table or given) and it is 1.96 for a 95%confidence level.
n = size of the sample
e = acceptable error (the precision).
N.B:- if you get n is fraction, use the next whole number for size n.
- The above formula is applicable when the population happens to be large (n>30). But, in case
of small population, the above stated formula for determining sample size will become.
2 2
Z ∗N∗σ
n= .
( N−1 ) e 2+ Z 2 ¿ σ 2
53
Example; the president of the university ask statistic of instructors to estimate the average age of
student in the university .How large sample size is necessary? The statistics instructors decide
the estimate should be accurate within 1 year and be 99%confident.from the previous study,
standard deviation of the instructors’ age is known to be 3 years.
Solution:
α
α =0.01(1-0.99), z =2.58, e=1, and σ =3
2
( )
2
α
( )
2
z ×σ 2.58 ×3
n= 2 = n= =59.9
1
e
5.4 Sampling Methods (Techniques)
There are two types of sampling techniques:
Random Sampling (probability sampling) & Non probability sampling
5.4.1. Probability sampling
Is a method of sampling in which all elements in the population have a pre-assigned non zero
probability to be included in to the sample.
Examples:
• Simple random sampling
• Stratified random sampling
• Cluster sampling
• Systematic sampling
1. Simple Random Sampling:
Is a method of selecting items from a population such that every possible sample of
specific size has an equal chance of being selected.
All elements in the population have the same pre-assigned non zero probability to be
included in to the sample.
Simple random sampling can be done using the lottery method.
2. Stratified Random Sampling:
The population will be divided in to non overlapping groups called strata.
Random selection can be carried out within each sub-group. Then, the randomly selected
representatives of the sub-groups together form the stratified sample.
The random selection can be done in proportion, according to the size or number in the
population of each sub-group. This is called proportional allocation. This requires
54
information about the relative sizes of the strata in the population. That is to say that the
exact population numbers or good estimates of these numbers should be available.
Elements in the same strata should be more or less homogeneous while different in
different strata.
It is applied if the population is heterogeneous.
Some of the criteria for dividing a population into strata are: Sex (male, female); Age
(under 18, 18 to 28, 29 to 39); Occupation (blue-collar, professional, other).
3. Cluster Sampling:
This is a method of sampling involving a naturally occurring group of individuals rather
than an individual.
In other words, a cluster sample is one in which the research interest characteristics
have been identified, the areas or zones in which these characteristics exist have also
been identified and samples from each of the identified zones randomly constituted. The
population is divided in to non overlapping groups called clusters.
A simple random sample of groups or cluster of elements is chosen and all the sampling
units in the selected clusters will be surveyed.
Clusters are formed in a way that elements within a cluster are heterogeneous, i.e.
observations in each cluster should be more or less dissimilar.
Cluster sampling is useful when it is difficult or costly to generate a simple random
sample. For example, to estimate the average annual household income in a large city
we use cluster sampling, because to use simple random sampling we need a complete
list of households in the city from which to sample. To use stratified random sampling,
we would again need the list of households. A less expensive way is to let each block
within the city represent a cluster. A sample of clusters could then be randomly
selected, and every household within these clusters could be interviewed to find the
average annual household income.
4. Systematic Sampling:
A complete list of all elements within the population (sampling frame) is required.
The procedure starts in determining the first element to be included in the sample.
Then the technique is to take the kth item from the sampling frame.
N
Let, N= population size, n= sample size, k = = sampling interval
n
55
Choose any number between 1 and k, suppose it is j (1≤ j ≤ k ¿
The jth unit is selected at first and then (j+k)th, (j+2k)th, …etc until the required sample
size is reached
Let us take that the sample size = n, and the population size N =, then the sampling
interval Kth will be given by Kth = N/n. For instance, if N = 1000, n = 100 then K = 10.
We can randomly pick any number from 1 to 10. In this case, the selection of any
number determines the entire sample. Example: if we pick 2, then 2, 12, 22, 32, 42 etc
automatically become members of the sample.
You would have noticed that the main advantage here is that it requires less work. The
disadvantage can be from the fact that if the listing of the population is not randomly
done, periodicity can be introduced. Periodicity means a situation where every K th
member of the population has some characteristics peculiar or unique to only those
members.
5.4.2. Non probability sampling
It is a sampling technique in which the choice of individuals for a sample depends on
the basis of convenience, personal choice or interest.
Examples: • Judgment sampling.
• Convenience sampling
• No-rule sampling:
1. Judgment Sampling
- In this case, the person taking the sample has direct or indirect control over which items are
selected for the sample.
2. Convenience Sampling
- In this method, the decision maker selects a sample from the population in a manner that is
relatively easy and convenient.
3. No-rule sampling: we take a sample without any rule, being the sample representative if the
population is homogeneous and we have no selection bias.
56
Chapter Six
6. SIMPLE CORRELATION AND LINEAR REGRESSION
Linear regression and correlation is studying and measuring the linear relationship among two
or more variables. When only two variables are involved, the analysis is referred to as simple
correlation and simple linear regression analysis, and when there are more than two variables
the term multiple regression and partial correlation is used.
Regression Analysis: is a statistical technique that can be used to develop a mathematical
equation showing how variables are related.
Correlation Analysis: deals with the measurement of the closeness of the relationship which
are described in the regression equation.
We say there is correlation when the two series of items vary together directly or inversely
Simple Correlation
Suppose we have two variables X=(x1, x2,… xn) and Y =(y1, y2,….. yn)
When higher values of X are associated with higher values of Y and lower values of X are
associated with lower values of Y, then the correlation is said to be positive or direct.
Examples:
- Income and expenditure
- Number of hours spent in studying and the score obtained
When higher values of X are associated with lower values of Y and lower values of X are
associated with higher values of Y, then the correlation is said to be negative or inverse.
Examples:
- Demand and supply
The correlation between X and Y may be one of the following
1. Perfect positive (slope=1)
2. Positive (slope between 0 and 1)
3. No correlation (slope=0)
57
4. Negative (slope between -1 and 0)
5. Perfect negative (slope=-1)
The presence of correlation between two variables may be due to three reasons:
1. One variable being the cause of the other. The cause is called “subject” or “independent”
variable, while the effect is called “dependent” variable.
2. Both variables being the result of a common cause. That is, the correlation that exists between
two variables is due to their being related to some third force.
Example:
Let X1= be HEEE result
Y1 & Y2 have positive correlation but they are not directly related, but they are related to each
3. Chance:
The correlation that arises by chance is called spurious correlation.
Examples:
• Weight of individuals in Ethiopia and income of individuals in Kenya.
Therefore, while interpreting correlation coefficient, it is necessary to see if there is any
likelihood of any relationship existing between variables under study.
Correlation coefficient is the measure used to determine the strength of the relationship
between two variables. There are several types of correlation coefficients. One the common
types of correlation coefficients is the Pearson Product Moment Correlation Coefficient
(PPMC). The correlation coefficient computed from the sample data measures the strength and
direction of a linear relationship between two variables. The symbol for the sample correlation
coefficient is r. The symbol for the population correlation coefficient is ρ (Greek letter rho).
There are several ways to compute the value of the correlation coefficient. One method is
Pearson's Product-moment Correlation Coefficient.
6.1.1 Pearson's Product-moment Correlation Coefficient
58
This measure considers not ranks rather magnitudes of observation. Formula and procedure is
only applicable on quantitative data. The Coefficient of Correlation( r ) is a measure of the
strength of the relationship between two variables. It requires interval or ratio scaled data
(variables).
r=
∑ ( Xi− X ) (Yi−Y )
√ ∑ ( Xi− X )2 ∑ ( Yi−Y )2
Where, n is the number of data pairs and x & y are variables .i.e. Dependent Variable(Y): The
variable that is being predicted or estimated and independent Variable(x): The variable that
provides the basis for estimation (It is the predictor variable)
The short cut formula is
n ∑ XY −∑ ( X ) ∑ (Y )
r=
√¿¿¿
Remark:
Always this r lies between -1 and 1 inclusively.
Interpretation of r:
1. Perfect positive linear relationship (if r= 1)
2. Some Positive linear relationship (if r is between 0 and 1)
3. No linear relationship (if r= 0)
4. Some Negative linear relationship (if r is between 0 and -1)
5. Perfect negative linear relationship (if r= -1)
Example 8.1: The data below shows age and average daily income of six farmers. Compute the
value of the correlation coefficient.
Solution:
Make a table, find the values of xy, x2, and y2 and place these values in the corresponding
columns of the table.
59
Farmer Age (X) Average XY X2 Y2
code daily income
(Y)
A 43 128 5504 1849 16384
B 48 120 5760 2304 14400
C 56 135 7560 3136 18225
D 61 143 8723 3721 20449
E 67 141 9447 4489 19881
F 70 152 10640 4900 23104
∑ X=345 ∑ X=819 ∑ X=112443
∑ X=47634 ∑ X=20399
285804−282555
r=
√ [ 122394−119025 ][ 674658−670761 ]
3249
r= = 0.897
√ 13128993
The correlation coefficient suggests a strong positive relationship between age and average
daily income of farmers.
Coefficient of Determination
The Coefficient of determination (r2) is the proportion of the total variation of dependent
variable Y that is explained by the variation in the independent variable X. It is the square of
the coefficient of correlation(r) and ranges from 0 to 1.
From the above example, r=0.897.
r2=(0.897)2 =0.81. This is a proportion or a percent. We can say that 81 percent of the
variation in average daily income is explained by the variation in age.
6.1.2 Spearman's Rank Order Correlation Coefficient or rank correlation)
60
Is the technique of determining the degree of correlation between two variables in case of
ordinal data where ranks are given to the different values of the variables. The main objective of
this coefficient is to determine the extent to which the two sets of ranking are similar or
dissimilar. This coefficient is determined as under:
Spearman's coefficient of correlation (or rs) is given by:
6 ∑ Di
2
r s=1− 2
n(n −1)
Where, rs=coefficient of rank correlation
6 ∑ Di
2
6 (12)
r s=1− =1− =0.786 , there is positive correlation.
n ( n −1 )
2
7(48)
Example 2
Eight nations report the following data on their infant mortality rate and general mortality
rate. Rank the data. Does there appear a relationship between the two mortality rates?
Canada U.S.A Swede U.K France Japan Chin Spain
n a
Infant mortality 8.1 10.5 6.4 9.6 10.0 6.2 50 9.6
Mortality 7.0 9.0 11.0 11.0 10.7 6.0 8 8.1
Step 1: Rank the data from lowest to highest. The lowest score should be ranked 1 and the
highest score, 8. Be sure to use the mean for two values that tie. For example, Swede and the
U.K tie for the worst general mortality rate. Since they tie the seventh and eight positions,
61
both are assigned the position 7.5 (7 + 8 /2). Rewrite the ranked data.
Canad U.S.A Swede U.K France Japa Chin Spain
a n n a
Infant mortality 3 7 2 4.5 6 1 8 4.5
Mortality 2 5 7.5 7.5 6 1 3 4
Step 2: Rearranging the data in a column, calculate the Spearman rank correlation coefficient.
Infant"Mortalit Mortality D D2
x y (x - y) ,(x-y)2
Canada 3 2 1.0 1.00
U.S.A 7 5 2.0 4.00
Sweden 2 7.5 -5.5 30.0
U.K 4.5 7.5 -3.0 9.00
France 6 6 0.0 0.00
Japan 1 1 0.0 ·0.00
China 8 3 5.0 25.00
Spain 4.5 4 0.5 0.25
69.50
6 ∑ Di
2
6 (69.5) 417
r s=1− =1− =1− =0.173
n ( n −1 )
2
2
9(8 −1) 504
Interpret the results- A correlation of 0.173· suggests there is little correlation between the
rankings of these nations' infant mortality rates and general mortality rates. The small correlation
that does exist is positive, which suggest that as a nation’s infant mortality ranking increases, so
does its general mortality rate.
b is a constant indicating the slope of the regression line, and it gives a measure of the change in
Y for a unit change in X. It is also regression coefficient of Y on X.
Example 1: The following data shows the score of 12 students for Accounting and Statistics
examinations.
a) Calculate a simple correlation coefficient
b) Fit a regression line of Statistics on Accounting using least square estimates.
c) Predict the score of Statistics if the score of accounting is 85.
Student 1 2 3 4 5 6 7 8 9 10 11 12
Acc. (X) 74 93 55 41 23 92 64 40 71 33 30 71
Sat. (Y) 81 86 67 35 30 100 55 52 76 24 48 87
First draw Scatter plot of raw data. Scatter plot of arrow data is used to determine the nature of
relationship. After scatter plot, the next step is to compute r (correlation coefficient). If r is
significant, the next step is to determine the equation of regression line. Determine regression
line where r is not significant and making prediction using regression line is meaningless.
As you see from the scatter plot, it seems there is some linear relationship between the
variables.
Stude 1 2 3 4 5 6 7 8 9 10 11 12 Tota Mea
nt l n
Acc. 74 93 55 41 23 92 64 40 71 33 30 71 687 57.2
(X) 5
Sat. 81 86 67 35 30 100 55 52 76 24 48 87 741 61.7
(Y) 5
63
X2 547 864 302 168 52 8464 409 160 504 108 900 504 4559
6 9 5 1 9 6 0 1 9 1 1
Y2 656 739 448 122 90 1000 302 270 577 576 230 756 5252
1 6 9 5 0 0 5 4 6 4 9 5
XY 599 799 368 143 69 9200 352 208 539 792 144 617 4840
4 8 5 5 0 0 0 6 0 7 7
n ( ∑ xy ) −( ∑ x ) (∑ y)
r=
√¿¿¿
12∗48407−687∗741
r=
√¿ ¿ ¿
a).The Coefficient of Correlation (r) has a value of 0.92. This indicates that the two variables
are positively correlated (Y increases as X increases).
b) Using OLS (ordinary least square).
b=
∑ ( Xi−X ) ( Yi−Y )
∑ ¿¿¿
48407−12∗57.25∗61.75
b= ¿ 0.9560
45591−12 ¿ ¿
a=Y −b X ¿ 61.75−0.9560∗57.25=7.0194
This means that for each unit change in X, Y changes by 0.9560 units. Regression line can be
used to make prediction for dependent variable.
E.g. using regression line predict the value of dependent variable ( ^y ), if=85.
C) Insert X=85 in the estimated regression line.
Y^ =7.0194 +0.9560 X
Y^ =7.0194 +0.9560 ( 85 )=88.28
For valued prediction, the value of correlation coefficient must be significant. Also, for any
specific value of x, variable y must be normally distributed about regression line. The standard
deviation of each dependent variable must be the same for each value of independent variable.
-prediction is made based on the present condition or on the premises that the present trend will
continue.
6.2.1. Coefficient of determination (r2).
It is the ratio of explained variation to total variation and is denoted by r2. It is also the measure
of variation of dependent variable that is explained by the regression line. Variation due to
64
relationship is called explained variation. Variation due to chance is called unexplained
variation. Both explained and unexplained variation is called total variation.
explained variation
r2 =
. r2 is to square correlation coefficient (r) and change to percent.
total variation
If r=0.90, then r2 =0.81wich is equivalent to 81%. Which means 81% of the variation in
dependent variable is accounted for by the variation in independent variables. The rest 19%
variation is unexplained variation. This is called coefficient of non determination (CND) and is
found by subtracting the coefficient of determination ( r2) from 1.
E.g.if r=0.6. r2=0.36 which means only 36% of the variation in the dependent variable can be
attributed to variation in independent variable. CND=1.00- r2
6.2.2. Standard error of estimate (Sest).
It is the standard deviation of the observed Y values about the the predicted ( y ¿) values. The
formula is given as follow.
√ ∑ ( y− y ¿ )
2
Sest=
n−1
The closer the observed value(Y) to the predicted value ( y ¿ ¿ , the smaller the Sest.
E.g. the following data was collected by researcher and determine that there is significant
relationship between the age of copy machine and its monthly maintenance cost. The regression
equation is y ¿ =55.57+8.13x. Find the standard error of estimate.
Solution
¿
Machine Age(x) Monthly y y- y ¿ (y- y ¿ ¿ 2 xy Y2
cost(Y)
A 1 62 63.7 -1.7 2.89 62 3.844
B 2 78 71. 83 6.17 38.9689 156 6.084
C 3 70 79.96 9.96 99.2016 210 4.900
D 4 90 88.09 1.91 3.6481 360 8.100
E 5 93 96.22 4.91 24.1081 372 8.649
F 6 103 104.35 1.35 1.8225 618 10.609
∑169.7393 ∑1778 ∑42186
√ ∑ ( y− y ¿ )
√
2
169.7393
Sest= = =6.55
n−1 6−1
65
CHARTER SEVEN
7. Multiple Régression
Simple régression équation contain one dépendent variable(DV) and one indépendant
variable(IV) and Witten as y ¿ =a+ bx.
In multiple regression, there are several independent variables (IVs) and one dependent
variable (DV) and the equation is y ¿ =a+¿ b1x1+b2x2+-----+bKxK where x1+x2+-----+xK
are IVs.
It usés when there are several IVs contributing to variation of DV.
This analyses is important to increase accuracy of prédictions for dépendent
variable over one indépendant variables.
66
Multiple regression correlations, R, can be computed to determine if significant
relationship exist between IVs &DV. Since computation of multiple regressions is
quite complicated, most part would be done on computer.
Let see the example only by taking two IVs and DV as follow.
Researcher wants to see whether students’ grade point average and age are related to students
score on examination in one hypothetical department by selecting five students, the following is
the data.
¿
y =¿44.572+87.679x1+14.519x2
¿
y =¿44.572+87.679(3.0 )+14.519(25)=581.44=581(by rounding).hence If student has GPA of
3.0 and is 25 years old, his or her predicted score is 581.
¿
In the equation, each b represents the amount of change in y for one unit of change in
corresponding x values when other x values are held constant. From the above example y ¿ =¿
44.572+87.679x1+14.519x2,, for each change in students’ GPA, there is a change of 87.679 units
of change in score with students’ age being held constant. The strength of relationship between
independent variables and dependent variable is measured by multiple correlation coefficient
which is symbolized by R. it can range from 0 to +1.the closer to 1the stronger the relationship.
The value of R takes in to account all independent variables and can be computed using the
values of individual correlation coefficient. The formula of R is given below if there are two
independent variables.
√
2 2
r y x 1 +r y x 2−2 ry x 1 . ry x2 . r x 1 x2
R= 2
1−r x1 x 2
67
Where, ry x 1 is the correlation coefficient for variable y and x 1
√
2 2
r y x 1 +r y x 2−2 ry x 1 . ry x2 . r x 1 x2
Then R= 2
1−r x1 x 2
R=√ (0.845)2+ ¿ ¿ ¿
R=
√ 0.8437569
0.862359
=0.9784288=0.989
Hence, the correlation between student’s grade point average and age with student’s score on
examination is 0.989 and so there is strong relationship among variables as r ≈ 1.00.As simple
regression, R2 is coefficient of multiple determination and it is the amount of variation explained
by the regression model.1- R2 is the amount of un explained variation called the error or residual
variation since R =0.989 ¿ previous exmple , R2=0.978.
1-R2 =1-0.978=0.022.
Adjusted R2
Since the value of R2is dependent on n(the number of pairs)and K(number of
variables),statistician calculate i.e Adjusted R2 is needed which is denoted by R2adj
2 (1−R2 )(n−1)
R adj =1−
(n−k−1)
2
R adj is smaller than R2 and takes in to account the fact that when n&k are approximately equal ,
the value of R may be artificially higher due to due to sampling error rather than the true
relationship among variables. Even if the correlation coefficient of each independent variable
and dependent variable were all zero, the R could be higher than zero due to sampling error.
Hence, both R2 and R2adj are usually reported in multiple regression analysis.
2
R adj For previous example is given as follow.
2 [ 1−R 2 ] [n−1]
R adj =1− ¿
n−k −1 ¿
68
2 [ 1−0.9892 ] [5−1]
R adj =1− ¿
5−2−1 ¿
2
R adj =1−0.043758=0.956.
In this case when the number of data pairs and the number of independent variables are
accounted for, R2adj is 0.956
CHAPTER EIGHT
8. Hypothesis Testing
8.1. Introduction
WHAT IS A HYPOTHESIS?
Ordinarily, when one talks about hypothesis, one simply means a mere assumption or
some supposition to be proved or disproved.
But for a researcher hypothesis is a formal question that he intends to resolve.
Thus a hypothesis may be defined as a proposition or a set of proposition set forth as
an explanation for the occurrence of some specified group of phenomena either asserted
merely as a provisional conjecture to guide some investigation or accepted as highly
probable in the light of established facts.
69
Quite often a research hypothesis is a predictive statement, capable of being tested by
scientific methods, that relates an independent variable to some dependent variable. For
example, consider statements like the following ones:
“Students who receive counseling will show a greater increase in creativity than students not
receiving counseling” Or “the automobile A is performing as well as automobile B.”
These are hypotheses capable of being objectively verified and tested.
Thus, we may conclude that a hypothesis states what we are looking for and it is a
proposition which can be put to a test to determine its validity.
Characteristics of hypothesis:
Hypothesis should be clear and precise. If the hypothesis is not clear and precise, the
inferences drawn on its basis cannot be taken as reliable.
Hypothesis should be capable of being tested.
Hypothesis should be limited in scope and must be specific. A researcher must
remember that narrower hypotheses are generally more testable and he should develop
such hypotheses.
Hypothesis should be stated as far as possible in most simple terms so that the same is
easily understandable by all concerned. But one must remember that simplicity of
hypothesis has nothing to do with its significance.
Hypothesis should be consistent with most known facts i.e., it must be consistent with
a substantial body of established facts. In other words, it should be one which judges
accept as being the most likely.
Hypothesis should be amenable to testing within a reasonable time. One should not
use even an excellent hypothesis, if the same cannot be tested in reasonable time for one
cannot spend a life-time collecting data to test it.
Hypothesis must explain the facts that gave rise to the need for explanation.
Hypothesis must actually explain what it claims to explain; it should have empirical
reference.
8.2. Basic concepts in the context of testing of hypotheses need to be explained
Alternative hypothesis:
It is the hypothesis available when the null hypothesis has to be rejected.
It is the hypothesis of difference.
Usually denoted by H1 or Ha.
- The following table gives a summary of possible results of any hypothesis test:
Researcher decides to: Hypothesis really
Incorrect Correct
Accept hypothesis Type I error (the worst) Researcher accept an
hypothesis that is true-
A correct decision
Reject hypothesis Researcher rejected hypothesis Type II error
that is wrong –
Correct decision
- Type I error)(α): Rejecting the null hypothesis when it is true. It is sometimes called level
of significance.
- Type II error)(β): Failing to reject the null hypothesis when it is false.
Power of a test:
The most powerful test is a test that fixes the level of significance and minimizes type II error)
(β. The power of a test is defined as the probability of rejecting the null hypothesis when it is
actually false. It is given as: power of test =1−β
71
NOTE:
1. There are errors that are prevalent in any two choice decision making problems.
2. There is always a possibility of committing one or the other errors.
3. Type I error (α) and type II error (β) have inverse relationship and therefore, cannot be
minimized at the same time.
• In practice we set α at some value and design a test that minimizes β. This is because a type I
error is often considered to be more serious, and therefore more important to avoid, than a type
II error.
The level of significance:
It is always some percentage (mostly 5%) which should be chosen with great care, thought and
reason. In case we take the significance level at 5 per cent, then this implies that H0 will be
rejected when the sampling result (i.e., observed evidence) has a less than 0.05 probability of
occurring if H0 is true. In other words, the 5 per cent level of significance means that researcher
is willing to take as much as a 5 per cent risk of rejecting the null hypothesis when it (H0)
happens to be true. Thus the significance level is the maximum value of the probability of
rejecting H0 when it is true and is usually determined in advance before testing the hypothesis.
In short, level of significance is the maximum probability with which we would be willing to
commute type I error. It is denoted by Greek letter alpha (∝).
General steps in hypothesis testing:
The first step in hypothesis testing is to specify the null hypothesis (H 0) and the
Data analysis.
Decision rule: if computed value greater than table value, H0 will be rejected
CASES:
X−µ
Z=
σ
√n
X−µ0
Z= , if σ is unknown .
S
√n
Or
Case 2: When sampling is from a normal distribution with σ2 unknown and small sample size
X−µ
t= ˜ withn−1degrees of freedom .
S
√n
CHAPTER NINE
After the data have been organized into a frequency distribution, they can be presented in
graphical form. The purpose of graphs in statistics is to convey the data to the viewers in
pictorial form. It is easier for most people to comprehend the meaning of data presented
graphically than data presented numerically in tables or frequency distributions. This is
especially true if the users have little or no statistical knowledge.
Statistical graphs can be used to describe the data set or to analyze it. Graphs are also useful in
getting the audience‘s attention in a publication or a speaking presentation. They can be used to
discuss an issue, reinforce a critical point, or summarize a data set. They can also be used to
discover a trend or pattern in a situation over a period of time.
73
The three most commonly used graphs in research are (1) The histogram, (2) The frequency
polygon, and (3) The cumulative frequency graph, or ogive.
9.1 Histogram
It is a graph which displays the data by using vertical bars of various heights to represent
frequencies. It is a graph in which class boundaries or class interval is marked on the horizontal
axis and the corresponding class frequency on the vertical axis.
Note that- the bars of a histogram must be joined together and this differentiates it from bar
chart.
Example
Considering the frequency distribution table below, draw a histogram for the table
Step 2 Represent the frequency on the y axis and the class boundaries on the x axis.
74
Step 3 Using the frequencies as heights, draw vertical bars for each class boundaries. Look at
the figure below.
A set of bars (thick lines or narrow rectangles) representing some magnitude over time space.
- They are useful for comparing aggregate over time space.
- Bars can be drawn either vertically or horizontally.
- There are different types of bar charts. The most common being :
• Simple bar chart
• Deviation or two way bar chart
• Broken bar chart
• Component or sub divided bar chart.
• Multiple bar charts.
76
Draw a component bar chart to represent the sales by product from 1957 to 1959.
9.3 Frequency Polygon:
This is a line graph that displays the data by using lines that connect points plotted for the
frequencies at the midpoints/class marks of the classes. The frequencies are represented by the
heights of the points. It can also be obtained by connecting midpoints of the tops of the
rectangles in the histogram.
Example: By using the frequency distribution of temperature of towns which listed below,
construct a frequency polygon.
Class Limits Class Mark Class Frequency Cumulative
Boundaries Frequency
16—19 17.5 15.5—19.5 4 4
20—23 21.5 19.5—23.5 11 15
24—27 25.5 23.5—27.5 7 22
28—31 29.5 27.5—31.5 21 43
32—35 33.5 31.5—35.5 5 48
36—39 37.5 35.5—39.5 1 49
40—43 41.5 39.5—43.5 1 50
Step 1 Find the midpoints of each class.
Step 2 Draw the x and y axes. Label the x axis with the midpoint of each class, and then use a
suitable scale on the y axis for the frequencies.
Step 3 Using the midpoints for the x values and the frequencies as the y values, plot the points.
Step 4 Connect adjacent points with line segments.
77
9.4. Ogive (cumulative frequency polygon)
This is the other type of line graph that can be used to represent cumulative frequencies for the
classes. This type of graph is called cumulative frequency graph or ogive. The cumulative
frequency is the sum of the frequencies accumulated up to the upper boundary of a class in the
distribution.
Step 3 Plot the cumulative frequency at each class mark (if class marks are used) or upper
boundary (if class boundaries are used). Upper boundaries are used since the cumulative
frequencies represent the number of data values accumulated up to the upper boundary of each
class.
Step 4 Starting with the first class mark, 17.5, (or first upper class boundary, 19.5) connect
adjacent points with line segments, as shown in the figure below.
78
9.5 Pie chart
- It is a chart in which each frequency is converted to degree and is presented on a circle, which is
called pie chart. It is a graph in the shape of a circular pie.. The variable is nominal or
categorical.
Example The number of passengers that board a bus from one hypothetical area to other on a
daily basis for a week is given below.
Days Passengers
Monday 50
Tuesday 80
Wednesday 60
Thursday 60
Friday 150
Saturday 150
Sunday 50
Total 600
Calculate of degrees for each day. Total passengers = 600
79
Example: The following frequency distribution shows population size of place A between 2000
and 2009. Draw a time series graph for the data about Population Size of Place A, 2000—2009
and summarize the findings.
Year Population
Size
2000 150,000
2001 160,000
2002 175,000
2003 200,000
2004 230,000
2005 270,000
2006 280,000
2007 290,000
2008 295,000
2009 300,00
Step 2: Label the x axis for years and the y axis for population size.
80