0% found this document useful (0 votes)
6 views80 pages

Quant For Student

Chapter 1 introduces key concepts in statistics, defining it as the science of collecting, organizing, analyzing, and interpreting numerical data. It outlines the two main branches of statistics, descriptive and inferential, and details the stages of statistical investigation, including data collection and analysis. The chapter also covers measurement levels, types of variables, and the applications, uses, and limitations of statistics.

Uploaded by

dababiru88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views80 pages

Quant For Student

Chapter 1 introduces key concepts in statistics, defining it as the science of collecting, organizing, analyzing, and interpreting numerical data. It outlines the two main branches of statistics, descriptive and inferential, and details the stages of statistical investigation, including data collection and analysis. The chapter also covers measurement levels, types of variables, and the applications, uses, and limitations of statistics.

Uploaded by

dababiru88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 80

Chapter 1.

Introduction to the course


1.1. Key concepts in statistics
Definition of Statistics
 The word ‘Statistics’ has different meanings to different people.
To some, it is a collection of tables, charts, data or numbers.
To others, it is considered as an aspect of advanced mathematics.
In the plural sense : statistics are the raw data themselves , like statistics of births,
statistics of deaths, statistics of students, statistics of imports and exports, etc.
In the singular sense statistics is the subject that deals with the collection,
organization, presentation, analysis and interpretation of numerical data.
Note:-
Generally, Statistics is the science of collecting, organizing, summarizing, analyzing data
and draw conclusions from it. It is also refers to the subject area of applied mathematics
that is concerned with extracting relevant information from available data with the aim to make
sound decisions.
Depending on how data can be used statistics is sometimes divided in to two main areas or
branches.
Descriptive Statistics: is concerned with summary calculations, graphs, charts and tables.
Inferential Statistics: is a method used to generalize from a sample to a population. For
example, the average income of all families (the population) in Ethiopia can be estimated
from figures obtained from a few hundred (the sample) families.
• It is important because statistical data usually arises from sample.
• Statistical techniques based on probability theory are required.
Stages in statistical investigation.
 There are five stages or steps in any statistical investigation.
 Collection of data: the process of measuring, gathering, assembling the raw data up on
which the statistical investigation is to be based. Data can be collected in a variety of
ways; one of the most common methods is through the use of survey. Survey can also be
done in different methods, three of the most common methods are:
• Telephone survey
• Mailed questionnaire
• Personal interview.

1
 organization of data:
It is the summarization of data in some meaningful way, e.g table form.
 Presentation of the data:
It is the process of re-organization, classification, compilation, and summarization of data to
present it in a meaningful form.
 Analysis of data: The process of extracting relevant information from the summarized
data, mainly through the use of elementary mathematical operation.
 Interpreting of data:
The interpretation and further observation of the various statistical measures through the
analysis of the data by implementing those methods by which conclusions are formed and
inferences made.
1.2. Some Basic Concepts in statistics
1) A (statistical) population: is the complete set of possible measurements for which
inferences are to be made. The population represents the target of an investigation, and
the objective of the investigation is to draw conclusions about the population hence we
sometimes call it target population.
Examples:-
 Population of trees under specified climatic conditions
 Population of animals fed a certain type of diet
 Population of farms having a certain type of natural fertility
 Population of households, etc
 The population could be finite or infinite (an imaginary collection of units)
 There are two ways of investigation: Census and sample survey.
2) Census: a complete enumeration of the population. But in most real problems it cannot
be realized, hence we take sample.
3) Sample: A sample from a population is the set of measurements that are actually
collected in the course of an investigation. It should be selected using some pre-defined
sampling technique in such a way that they represent the population very well.
Examples:-
 Monthly production data of a certain factory in the past 10 years.
 Small portion of a finite population.

2
In practice, we don’t conduct census, instead we conduct sample survey
4) Parameter: Any Characteristic (value) describing a characteristic of a population
5) Statistic: Characteristic or measure obtained from a sample.
6) Sampling: The process or method of sample selection from the population.
7) Sample size: The number of elements or observation to be included in the sample.
8) Data: This could be defined as pieces of information that represent the qualitative or
quantitative attributes of a variable or set of variables. Data are typically the results of
measurements and can be the basis of graphs, images or observations of a set of
variables. Data are often viewed as the lowest level of abstraction (ideas) from which
information and knowledge is derived for statistical analysis.
9) Variable This is any quality that can have a number of values, which may be either
discrete or continuous. A variable is a property that can take on different values.
Individual in a class may differ in sex, age, intelligence, height etc. These properties are
variables. Variables could vary in quality or in quantity. Constants unlike variables do
not assume different values.
 Quantitative Variables are numerical variables and can be measured. Examples
include balance in checking account, number of children in family. This type of
variables assumes values that vary in terms of magnitude. Very easy to measure and
compare with others e.g. weight, height, age, distance, marks obtained in a test etc. Note
that: quantitative variables are either discrete (which can assume only certain values,
and there are usually "gaps" between the values, such as the number of bedrooms in
your house) or continuous (which can assume any value within a specific range, such as
the air pressure in a tire.)
 Qualitative Variables are nonnumeric variables and can't be measured.
This type of variable differs in kind. They are only categorized, e.g. gender, nationality,
social economic status, academic qualifications, marital status. Complete
 Independent Variable These variables can be manipulated or treated. The effect is
reflected on the dependent variable. The value of the dependent variable thus depends
on that of the independent variable. Note that in graphing, the dependent variable is
placed on the vertical (y-axis) while the independent variable is placed on the
horizontal (x-axis).

3
 Discrete Variable: This is a variable that can be counted, or for which there is a fixed
set of values. For example, the number of votes in an election is a discrete.
 Continuous Variable: This concept is characterized by being related to some
numerical scale of measurement, any interval of which may, if desired, be subdivided
into an infinite number of values, e.g. length, height, weight, temperature, volume and
time.
10) Distribution: This is the arrangement of a set of numbers classified according to some
properties or attributes such as age, height, weight, etc.
Applications, Uses and Limitations of statistics.
Applications of statistics:
• Statistics is applied in almost all fields of human endeavor.
• Almost all human beings in their daily life are subjected to obtaining numerical facts
• Applicable in some process e.g. invention of certain drugs, extent of environmental
pollution.
• In industries especially in quality control area.
Uses of statistics:
The main function of statistics is to enlarge our knowledge of complex phenomena. The
following are some uses of statistics:
1. It presents facts in a definite and precise form.
2. Data reduction.
3. Measuring the magnitude of variations in data.
4. Furnishes a technique of comparison
5. Estimating unknown population characteristics.
6. Testing and formulating of hypothesis.
7. Studying the relationship between two or more variable.
8. Forecasting future events.
Limitations of statistics
Statistics deals mainly with those subjects of investigation which are capable of being
quantitatively measured and numerically expressed. But, nowadays we can apply
statistics to study both quantitative and qualitative aspects.
Statistics deals with only aggregate of facts and no importance are attached to individual
items in an aggregate. It is, therefore, limited only to those problems where group

4
characteristics are desired to be studied. For example, if the mean age of a class of
students is 22 years, it does not mean that each and every student is 22 years old. It is
simply the average of the class.
Lack of Exactitude -Statistical data are only approximate and not mathematically
correct. This means that by observing only a limited number of item we make an
estimate of the characteristic of the entire population. It is well known that mathematical
and physical sciences are exact but statistical laws are not as exact but only
approximations. Statistical conclusions may not have universal validity.
Misuse of Records- Statistics must be used only by experts otherwise statistical
methods are the most dangerous tools in the hands of the inexperienced people. The use
of statistical tools by the untrained persons might lead to wrong conclusions. It may be
easily misused by quoting wrong figures of data to achieve a selfish interest. Therefore
should be used be experts.
1.3. Measurement level
Proper knowledge about the nature and type of data to be dealt with is essential in order to
specify and apply the proper statistical method for their analysis and inferences. Measurement is
the assignment of numbers or values to objects or events in a systematic fashion.
Measurement scale refers to the property of value assigned to the data. Four levels of
measurement scales are commonly distinguished: nominal, ordinal, interval, and ratio scale
and each possessed different properties of measurement systems
A. The Nominal or Classificatory Scale
A nominal scale enables the classification of individuals, objects or responses based on a
common/shared property or characteristics. These peoples, objects or responses are divided into
a number of sub-groups in such a way that each member of the sub-group has a common
characteristic. A variable measured on a nominal scale may have one, two or more sub-
categories depending upon the extent of variation. Nominal scale variables are those qualitative
variables, which show category of individuals. Numbers are assigned to the variables simply for
coding purposes. It is not possible to compare individuals basing on the numbers assigned to
them. No arithmetic and relational operation can be applied.
For example, if we say Abass wears 6 when playing football for Awassa Kenema team; and he
scored 10 marks in chemistry test. Further let us say Gemeda wears number 9 when he plays
football for Awassa Kenema team; and Gemeda scored 7 marks in chemistry test. Based on

5
numbers on the shirts, it is not possible to judge whether Gemeda plays better than Abass. Using
test scores, however, it is possible to judge that Abass performed better than Gemeda in
chemistry test. The mean marks of the two students can also be obtained, but it is not possible to
find the mean of the shirt numbers because they are simply codes.
Another example: when we collect data we may code for sex: 1=male and 2=female. This does
not show that male is greater than female. It does not show also that female is twice as male.
The numbers are simply representations. The sequence in which sub-groups are listed makes no
difference as there is no relationship among sub-groups. What we can do is counting the number
of males and the number of females.
Other examples of variables which are measured by nominal scale include religion, ethnicity,
color, Marital status (married, single, widow, divorce), Country code & Regional differentiation
of Ethiopia.
B. Ordinal or Ranking Scale
An ordinal scale has all the properties of a nominal scale plus one of its own. Besides
categorizing individuals, objects, responses or a property into sub-groups on the basis of
common characteristics, it ranks the sub-groups in a certain order. It refers to the variables
whose values can be ordered or ranked. The arrangement can be either ascending or descending
order. For example: developed, less developed, least developed; less degraded, highly degraded,
etc. The use of an ordinal scale implies a statement of greater than or less than or equal to
without being able to state how much greater or less.
Generally,
•It is the Level of measurement which classifies data into categories that can be ranked. But,
differences between the ranks do not exist.
•Arithmetic operations are not applicable but relational operations are applicable.
• Ordering is the sole property of ordinal scale.
Examples: Letter grades (A, B, C, D, F), Rating scales (Excellent, Very good, Good, Fair, poor)
and the like.
C. The Interval Scale
Interval scale has all the characteristics of an ordinal scale plus it uses a unit of measurement
that enables values of variables to be placed at equally spaced intervals in relation to the spread
of the variable. This scale has a starting and a terminating point that is divided into equally
spaced units/intervals. The starting and terminating points are the number of units/intervals

6
which vary from scale to scale. Centigrade and Fahrenheit scales are examples of the interval
scale. In the Centigrade system the starting point (considered as freezing point) is 0 oC and the
terminating point (considered boiling point) is 100 oC. The gap between freezing and boiling
points is divided into 100 equally spaced intervals, known as degrees. Each degree or interval is
measurement of temperature. The higher the degree, the higher the temperature. As the starting
and terminating points are arbitrary, they are not absolute. For example, zero does not show
absence. Therefore, the interval scale is relative in nature. So, relational operations are possible.
As it is a relative scale, no mathematical operations can be performed on its readings.
D. Ratio Scale
A ratio scale has all the properties of nominal, ordinal and interval scales plus its own property:
the zero point of a ratio scale is fixed, which means it has a fixed starting point. Therefore, it is
an absolute scale—the difference between the intervals is always measured from zero point.
Zero shows absence of something in this case. For example, if the yield is zero it shows no
yield. It also allows ratios of numbers to be meaningfully interpreted; Eg. the ratio of Bekele's
height to Martha's height is 1.32, whereas this is not possible with interval scales.
In the ratio scale mathematical operations can be used. The measurement of income, age, height
and weight are examples of this scale. A person who is 40 years of age is twice as old as a 20
year old. A person earning 60,000 birr per year earns three times the salary of person earning
20,000 birr. The following figure summarizes the characteristics of the four scales.
Table1. Characteristics and examples of the four measurement scales
Nominal/classificatory Ordinal /Ranking Interval Ratio
Measure
ment
scale

Each sub-group has a property Sub-groups have a It has a unit It has a


which is common to all classified relationship to one of fixed
Characteristics of

within that group another. They are measurement staring


arranged in within an point, e.g. a
ascending or arbitrary zero point
the scale

descending order starting and


terminating
point

7
A. Tree, house, taxi, etc Income: above Temperature: Height: cm,
B. Gender: male/female average, average, centigrade, inches, etc
Attitude: favorable/unfavorable below average Fahrenheit Income:
Political parties: Socio-economic Altitudinal birr
C. EPRDF/democrat, republican status: upper, scale: 10-20, Age:
Religion: middle, low 20-30, 30-40, year/month
Christian/Muslim/Hindu, etc Attitude: strongly etc Weight: kg
Example

Climatic region: agree, agree,


kola/dega/woinadega, etc disagree, strongly
disagree

1.4. Frequency Distribution


After a researcher might have gotten a raw data from any source, there is a need for the raw data
to be arranged and organized in a meaningful way in order to be able to describe and come up
with a useful inference (suggestion). The method that is being used for such organization and
arrangement is called frequency distribution. Frequency distribution simply means a tabular
arrangement of data into the class they belong to. Frequency means the number of times
something happens.
Ungrouped Frequency Distribution
This is a type of frequency distribution in which data are not compressed together in a particular
interval.
Example: Given a set of raw data below, construct a frequency table for it. 2, 3, 2, 1, 4, 4, 1, 1,
3, 2, 5, 3, 5, 1, 5, 1, 4, 2, 3, 0, 4, 1, 0, 3, 5, 2, 5, 0, 4, 1, 4, 0, 5, 2, 3, 0, 4, 4, 1, 2.
Table 2: Ungrouped Frequency Distribution
Value Frequency
0 5
1 8
2 7
3 6
4 8
5 6

Grouped Frequency Distribution


Many times, we involve ourselves in a large volume of data that have close numerical values.
The purpose of classification is to organise the data to a manageable size. To achieve this, data
is grouped. This grouping may be of equal or unequal interval. This further reduces the task
involved in analysis, when the number of observation becomes larger. Observation is thus
grouped into a number of classes. In doing this, we have to decide at the beginning the number
8
of groups or classes we wish to classify the data into. Each of the group is given as an interval
and it is called class interval.
Example: The following are scores obtained by forty students who sat for cartography
examination in Bule Hora University. Construct a frequency distribution for the scores.
56, 20, 45, 70, 50, 49, 62, 39, 41, 65, 25, 76, 59, 48, 55, 57, 71, 49, 42, 44, 63, 60, 40, 45, 50,
31, 35, 21, 58, 56, 54, 56, 63, 30, 39, 28, 49, 53, 64, 66.

Table 3.Grouped Frequency Distribution


Scores Frequency
20-29 4
30-39 5
40-49 10
50-59 11
60-69 7
70-79 3
As tabulated above, the scores obtained by students are compared together in an interval of 20-
29, 30-39...70-79 and the corresponding number of students was recorded as frequency.
Some concepts concerning frequency distribution
Class Interval is a set of classes that are used to define the raw data or size of the group
chosen. Using table 3 of above, class interval is regarded as the scores which are: 20-29,
30-39................70-79. The number of class interval is determined by the formula
“K = 1+3.3 log10n” where K = the number of classes required; and n = the number of
observations in the sample.
Class Limits are the end numbers of class interval. The lower value for class interval is
called lower class limit while the upper value for class interval is called upper class limit.
For class interval 20-29, 20 is the lower class limit and 29 is the upper class limit in table
3above.
Class Boundaries are easily gotten by subtracting 0.5 from lower class limit or lower value
of class interval and adding 0.5 to upper class limit or upper value of class interval. For
example, class boundaries for Table 3 above are 19.5 - 29.5, 29.5 - 39.5 .......69.5 - 79.5
Class Mark is the mid-point or value of the class interval. It can be derived by adding
lower and upper class limit and dividing by two (2).
Referring to table 3 above, class mark for the first class interval is

9
20+29 49 20+29
¿ =¿ =24.5 or ¿ =24.5
2 2 2

Class Width is the size of the class interval and it is obtained by subtracting lower class
boundaries from upper class boundary. Using table 3 again, class width for the first class
is 19.5 - 29.5 =10
Example 1: The following data represent the record of high temperatures (in 0C) for 50 major
towns in Ethiopia. Construct a grouped frequency distribution for the data. 28, 34, 19, 19, 28,
41, 26, 26, 30, 21, 30, 21, 23, 21, 16, 19, 32, 29, 25, 30, 30, 28, 29, 22, 33, 27, 27, 30, 23, 31,
24, 37, 35, 29, 30, 23, 23, 28, 21, 32, 20, 21, 29, 29, 28, 30, 31, 30, 27, 29.
Solution: use the following procedure for constructing a grouped frequency distribution.
Step 1: Determine the classes (n = 50)
K = 1+3.3 log10n = 1 + 3.3 log1050
= 1 + 3.3 (1.699) = 1 + 5.6 = 6.6 ≈ 7 (seven classes)
Step 2: Find the highest value and the lowest value: H = 41 and L =16
Step 3: Find the range(R): R = highest value — lowest value = H—L, so R = 41-16 = 25
Step 4: Find the class width by dividing the range by the number of classes.
R 25
width= = = 3.6 (round up to the nearest whole number) = 4
number of classesd 7

Step 5: Select a starting point for the lowest class limit. This can be the smallest data value or
any convenient number less than the smallest data value. In this case, 16 is used. Add the class
width to the lowest score taken as the starting point to get the lower limit of the next class. Keep
adding until there are 7 classes, as shown below, 16, 20, 24, etc.
Step 6: Subtract one unit from the lower limit of the second class to get the upper limit of the
first class. Then add the class width to each upper limit to get all the upper limits. i.e. 20-1 =
19 .thus, the first class is 16—19, the second class is 20-23 etc by adding the class width.
Step 7: Find the class boundaries by subtracting 0.5 from each lower class limit and adding 0.5
to each upper class limit: 16 - 0.5 = 15.5 and 19 + 0.5 = 19.5 and the class boundaries for the
first class are 15.5—19.5, for the second class 19.5—23.5, etc.
Step 8: Count the data to find the numerical frequencies.

10
Cumulative frequency also can be computed by adding the frequency in each class to the total
of the frequencies of the classes preceding that class.
Class Limits Class Boundaries Frequency Cumulative
Frequency
16—19 15.5—19.5 4 4
20—23 19.5—23.5 11 15
24—27 23.5—27.5 7 22
28—31 27.5—31.5 21 43
32—35 31.5—35.5 5 48
36--39 35.5—39.5 1 49
40—43 39.5—43.5 1 50

The frequency distribution shows that the class 28—31 contains the largest number of towns
(21) followed by the class 20—23 with 11 towns. Hence, most of the towns (39) have
temperatures between 19.5 0C and 31.5 0C.

Chapter 2: Measure of Central Tendency


Very often, when we are given a set of numerical data, we may want to look for a single
quantity, which represents the entire sets. Thus, this may at times, make us to disregard the
entire members of the set. A statistical measure which describes the middle or centre of a set of
data is called measure of central tendency. The objective is to comprehend the data easily, to
facilitate comparison and to make further statistical analysis. There are several different
measures of central tendency such as Mean (Arithmetic, Geometric and Harmonic), Mode and
Median.
The Summation Notation: ( or sigma)
- Let X1, X2 ,X3 …XN be a number of measurements where N is the total number of

observation and Xi is ith observation.

- Very often in statistics an arithmetical expression of the form X 1+X2+X3+...+XN is used in a

formula to compute a statistic. It is tedious to write an expression like this very often, so
mathematicians have developed a shorthand notation to represent a sum of scores, called the
summation notation.
N
The symbol ∑ X i isa mat h ematical s h ort h∧for X 1+ X 2+ X 3+…+ XN
i=1

11
Example: Suppose the following were scores made on the first homework assignment for five
students in the class: 5, 7, 7, 6, and 8. In this example set of five numbers, where N=5, the
summation could be written:
5

∑ Xi=X 1+ X 2+ X 3+ X 4 + X 5=5+7+7+6 +8=33


i=1

The "i=1" in the bottom of the summation notation tells where to begin the sequence of
summation. If the expression were written with "i=3", the summation would start with the third
number in the set.
N
For example: ∑ Xi=X 3+ X 4 +…+ X N
i=3

The "N" in the upper part of the summation notation tells where to end the sequence of
N
summation ∑ X=∑ Xi= X 1+ X 2+ …+ X N
i=3

N
T h e symbol ∑ Xi is a mat h ematical s h ot h∧for X 1+ X 2+ …+ X N
i=3

* A shorthand notation for the shorthand notation is also possible.


n

∑ k=nk , Where k is any constant


i=1

n n

∑ k X i=k ∑ X i, Where k is any constant


i=1 i=1

n n

∑ (a+ bXi)=na+ b ∑ X i, Where a and b are any constant


i=1 i=1

n n n

∑ (Xi +Yi)=∑ X i i+∑ Y i,


i=1 i=1 i=1

5. The sum of the product of the two variables could be written:


N

∑ (Xi∗Yi)=( X 1∗Y 1 )+ ( X 2∗Y 2 ) +… ( X N∗Y N ),


i=1

Example:
considering the
following data.

12
Types of measures of central tendency
2.1. Arithmetic Mean
Arithmetic mean of a set of observations is the sum of all the observations divided by the total
number of observations. If we are considering a population, it is termed as population mean and
if the samples are considered, it is called the sample mean. The population and the sample mean

are respectively designated by and . The arithmetic mean (or just mean) is the
most important measure of central tendency, the reason being that, all members of the set are
used in the calculation of the mean. It is however affected by the extreme values of the set
unlike the range.
Let, X =X 1+ X 2+ X 3+ X 4+… Xn
n

X=
X 1+ X 2+ X 3+ …+ Xn
Or
∑ Xi
n X = i =1
n

Example
Find the mean of the following set of numbers: 10, 9, 11, 13, 12, 12, 11, 13, 10 and 16
10+ 9+11+13+12+12+11+13+10+16
X= =11.7
10
If the distributions 1 , X 2, X 3 , … Xn have frequencies f 1 , f 2 , f 3 , … Xn respectively then:
n

X=
f 1 X 1+ f 2 X 2+ f 3 X 3+ …+ fnXn
or
∑ fiXi
f 1+f 2+ f 3 …+fn X = i =1
∑ fi
Example:-
13
Find the mean of the set of data in the table below:
Mark Xi 0 1 2 3 4 5 6 7 8 9
Frequency fi 2 3 4 6 1 4 2 2 1 3

∑ fiXi
X = i =1
∑ fi
( 0∗2 ) + ( 1∗3 ) + ( 2∗4 )+ ( 3∗6 ) + ( 4∗1 ) + ( 5∗4 ) + ( 6∗2 ) + ( 7∗2 ) + ( 8∗1 ) +(9∗3)
X=
2+3+ 4+ 6+1+4 +2+2+1+3

114
¿ =4.07
28
Arithmetic Mean of Group Data
In grouped frequency distribution, the values between any class interval are considered as
condensed at the mid- point of the class interval ( class mark). If X i is the class mark of the ith
class interval, then the mean X of the grouped frequency distribution is defined as:
k

∑ fiXi
X = i =1k K = is the number of classes
∑ fi
i=1

Xi =the class mark (midpoint of the ith class

fi = the frequency of the ith class

Example: calculate the mean for the following age distribution


class frequency
6-10 35
11-15 23
16-20 15
21-25 12
26-30 9
31-35 6

Solutions:
• First find the class marks ( Xi)
• Find the product of frequency and class marks
14
• Find mean using the formula.
k
Class fi Xi Xifi
6-10 35 8 280 ∑ fiXi
11-15 23 13 299 X = i =1k
16-20 15 18 270 ∑ fi
i=1
21-25 12 23 276
26-30 9 28 252 6

31-35 6 33 198 ∑ fiXi


Total 100 1575 X = i =16
∑ fi
i=1

1575
X= =15.75
100

Using Assumed Mean(other way of calculating arithmetic mean)


We use the above example to calculate the mean using the assumed mean.
n

∑ fd or X =A + ∑ fd
X =A + i=1
∑f ∑f
A = is called the ASSUMED or GUESSED MEAN.
di = is the deviation xi (midpoint) from the assumed mean(A).
n = is the number of classes
Example for ungrouped frequency distribution
A farmer recorded the mass of 25 timbers in k.g as follows:
10 14 12 10 12 11 11 9 13
16 13 9 12 13 12 10 15
10 9 11 8 14 12 8 11
a) Construct a frequency table for the data.
b) Use an assumed mean of 12 kg to calculate the mean
Solution: Given assumed mean, A= 12 kg
Masses (X) Frequency (f) d=X-A fd Use the formula:
8 2 -4 -8
9 3 -3 -9
10 4 -2 -8
11 4 -1 -4
12 5 0 0
13 3 1 3
14 2 2 4
15 1 3 3
15
16 1 4 4
∑f=25 ∑fd=-15
X =A +
∑ fd
∑f
−15
X =12+ =11.4
25

Example for grouped frequency distribution


The following table shows the distribution of weekly wages earned by 60 employees of a sugar
factory in Ethiopia. Using an Assumed mean (A) of 74.5 to calculate the mean of the
distribution.
Class interval No of employees
40-49 4
50-59 12
60-69 18
70-79 11
80-89 7
90-99 5
100-109 2
110-119 1

Solution
Class interval Class center (X) Frequency (f) d=X-A fd
40-49 44.5 4 -30 -120
50-59 54.5 12 -20 -240
60-69 64.5 18 -10 -180
70-79 74.5 11 0 0
80-89 84.5 7 10 70
90-99 94.5 5 20 100
100-109 104.5 2 30 60
110-119 114.5 1 40 40
∑f=60 ∑fd=-270

∑ fd
X =A + i=1
∑f
−270
X =74.5+ =70.5
60

Special properties of Arithmetic mean


16
1. There is only one mean for a data set.

2. Its value is influenced by extreme value.

3. It is applicable to quantitative data only (difficult to apply for qualitative data).


4. If X 1 ist h e mean of n1 observation

If X 2 ist h e mean of n 2 observation

If X k is t h e meanof n k observation

Then, the mean of all the observation in all groups often called the combined mean is given by:
k

X n + X n +…+ X k nk i=1
∑ xi ni
X c= 1 1 2 2 = k
n1+ n2 +…+n k
∑ ni
i=1

Example: In a class there are 30 females and 70 males. If females averaged 60 in an


examination and boys averaged 72, find the mean for the entire class.
Solutions:
Females males

X 1=60 X 2=72

n1=60 n2=72
2

X n +X n
∑ x i ni
X c= 1 1 2 2 = i=12
n1 +n2
∑ ni
i=1

30 ( 60 ) +70(72) 6840
¿ = =68 . 40
30+70 100

5. If a wrong figure has been used when calculating the mean the correct mean can be
obtained without repeating the whole process using:
( correct value−wrong value )
correct mean=wrong mean+
n
Where n is total number of observations.
Example: An average weight of 10 students was calculated to be 65.Latter it was discovered
that one weight was misread as 40 instead of 80 k.g. Calculate the correct average weight.
Solution

17
( correct value−wrong value )
correct mean=wrong mean+
n
( 80−40 )
correct mean=65+ =65+4=69k.g.
10
6. The effect of transforming original series on the mean.
a) If a constant k is added/ subtracted to/from every observation then the new mean
will be the old mean± k respectively.
b) If every observations are multiplied by a constant k then the new mean will be
k*old mean.
Example: The mean of a set of numbers is 500.
a. If 10 is added to each of the numbers in the set, then what will be the mean of the new set?
b. If each of the numbers in the set are multiplied by -5, then what will be the mean of the new
set?
Solution:
a. new mean=old mean+10=500 +10=510
b. new mean=−5∗old mean=−5+500=−2500
Merit and demerit of arithmetic mean
Merits:
• It is based on all observation.
• It is suitable for further mathematical treatment.
• It is stable average, i.e. it is not affected by fluctuations of sampling to some extent.
• It is easy to calculate and simple to understand.
Demerit:
•It is affected by extreme observations.
•It cannot be used in the case of open end classes.
•It cannot be used when dealing with qualitative characteristics, such as intelligence, honesty,
beauty.
•Sometimes, it leads to wrong conclusion if the details of the data from which it is obtained are
not available.
•It gives high weight to high extreme values and less weight to low extreme values.
2.2. Geometric mean (G)
Geometric mean of asset of N positive numbers x1, x2, x2---xn is the nth roots of the product of
the numbers.
18
G=√n x 1*x 2 *x 3 *------*x n .
It is used to determine the average percent increases in sales, production and other business
activities.
Example: suppose the profit earned by a company A on five projects were 3,4,4,5 and 6million
respectively. Then, compute the geometric mean profit for the company.
Solution:
G= N√ x 1 *x 2 *x 3 *------*x n .
¿ √ 3.4.4.5.6 .
5

¿ √ 1440..
5

=4.28
Or
LogG = log√5 1440 ..
= log14401/5
=1/5log1440
=1/5(log1.44*1000)
=1/5(log1.44+log1000)
=1/5(log 1.44+log103 )
=1/5((log 1.44+3log10)
=1/5(log1.44+3)
=1/5(0.158362+3)
=0.6316
G=antilog0.6316=4.26≈4.3

N.B: Generally, G=antilog


∑ logx
N
Geometric mean for ungrouped frequency distribution

G= antilog
∑ f log x
∑f
Example: Assume that a given power industry produce a product for one month and the output
is given in quintals below. Then, compute the geometric mean of the data.

19
Output(X) The number of days (F) Logx F*logx
20 3 1.3 3.9
18 4 1.255 5.02
19 10 1.27875 12.78
22 8 1.342 10.73
24 2 1.380 2.76
23 3 1.36 4.08
∑ 30 39.2835

G= antilog
∑ f log x = antilog 39 .2835 = antilog1.30945=20.39quintals
∑f 30

Geometric mean for Grouped frequency distribution

G= antilog
∑ f logm
∑f

Example: find the geometric mean for the following distribution.


Class interval Frequency Midpoint(m) logm f.logm

6-10 1 8 0.9 0.9


11-15 6 13 1.11 6.66
16-20 8 18 1.25 10
21-25 7 23 1.36 9.52
26-30 3 28 1.44 4.32
∑ f logm 31.4
∑ f log m=31.4 G= antilog = antilog =
∑f 25

antilog1.256=18.03
∑ f m=25
2.3. Harmonic mean(H)
Harmonic mean of a set of observations x 1, x2,-----xn is the reciprocal of arithmetic mean of the
reciprocal of the numbers.
N
¿ N
H 1 1 1 1 : H= where, n is the number of observation and
+ + +…. ∑ 1/ xn
x 1 x2 x 3 xn
X is individual observation.
N.B. It is used in problem of averaging rates per unit of time.

20
Example: if Ayantu and Gadise take 2&3 hours respectively to finish a given typing work, how
much hours is required if they work together to finish the task.
Given: N (total observation)=2, individual observation or x1 and x2 are 2 and 3 hours
respectively.
N 2 2
¿ = = 2∗6
H 1 1 1 1 1 1 5 ¿ =2.4 h
+ + +…. + 5
x 1 x2 x 3 xn 2 3 6
Harmonic mean for ungrouped data
f 1+f 2+ f 3+… fn ∑f
¿ ¿
The formula is given by H f 1 f 2 f 3 fn H f∗1
+ + + …+. ∑
x 1 x2 x 3 xn x
The mean daily temperature for any hypothetical area is given below for 30 successive days of
the month. Compute H.
Mean Temperature( 0c) No of days(f) 1 1
F*
(X) x x H=
22 3 0.045 0.1362
4
23 5 0.434 0.217
25 10 0.04 0.4
27 7 0.037 0.259 ∑f
28 3 0.035 0.1071 F∗1

7 x
29 2 0.034 0.06
4
∑f=30 1
∑ F*
x
=1.1793

30
H= =25.44
1.1793

Harmonic mean for grouped data


∑f f 1+ f 2+ f 3+−−+ fn
H= F∗1 = 1∗1 2∗1 F∗1 3∗1 −−+ Fn∗1
∑ F +F + +F −
m m m m m m
Example: Find the harmonic mean for the following distribution.
Class interval Frequency Midpoint 1/m F*1/m ∑f
100
2-4 20 3 0.33 6.66 H= F∗1 = =4.99
∑ 20.03
4-6 40 5 0.2 8 m
6-8 30 7 0.14 4.26 21
8-10 10 9 0.11 1.11
3.2. The Median
In a distribution, median is the value of the variable which divides it in to two equal halves. it is
symbolized by MD.
- In an ordered series of data median is an observation lying exactly in the middle of the series.
It is the middle most value in the sense that the number of values less than the median is equal
to the number of values greater than it. Before one can find the median, the data must be
arranged in order (ascending or descending). When the data set is ordered, it is called a data
array.
Median for Ungrouped Data
Steps in computing the median for ungrouped data are arranging the data in order and selecting
the middle value.
Example1: The following data shows monthly salaries (in Birr) of 9 sample workers in a
certain factory. Find the median salary. 1000, 300, 1500, 500, 2000, 2500, 750, 600, 3000
Solution: Step 1- Arrange the data in order. 300, 500, 600, 750, 1000, 1500, 2000, 2500, 3000
Step 2- Select the middle value. In this data set, the total number of observations is 9 (n=9,
which is odd number). Thus, the median value will be:
n+1
MD=( )th value
2
9+1
¿( )th value
2

=5th value is the data array, which is 1000 birr.

Hence, half of the sample workers get less than 1,000 Birr per month and half get more than
1,000 Birr.
Example 2: The following data shows monthly salaries of 10 sample workers in factory. Find
the median salary. 1000, 300, 1500, 500, 2000, 2500, 750, 600, 3000, 800
Solution:
Step 1: Arrange the data in order. 300, 500, 600, 750, 800, 1000, 1500, 2000, 2500, 3000.
Step 2: Select the middle value. In this data set, the total number of observations is 10 (n=10,
which is an even number). Thus, the median value will be the mean of the two middle values.
The two middle values in the data array are the 5th and the 6th values, which are 800 and 1000.
Thus, the median will be;
22
800+1000
¿( )
2

¿ 900

Hence, fifty percent of the sample workers get monthly salary above 900 Birr and fifty percent
get below it.
Median for Grouped Data
When the observations in a data set exist in grouped form the formula for the median value is:

( )
N
−Fpm
2
MD=L+ ∗W
fm
Where, L = the lower class boundary of the median class,
n = the sum of the frequencies of all the classes,
Fpm = the cumulative frequencies in all the classes
immediately preceding the class containing the median,
W = width of the median class, and
fm = frequency of the median class.
Remark: The median class is the class with the smallest cumulative frequency greater than or
equal to n /2.
Example: Refer to the frequency distribution on the record of high temperature of 50 selected
major urban centers in Ethiopia.
Class limits Class boundaries Frequency Cumulative frequency
16 - 19 15.5 - 19.5 4 4
20 - 23 19.5 - 23.5 11 15
24 - 27 23.5 - 27.5 7 22
28 - 31 27.5 - 31.5 21 43
32 - 35 31.5 - 35.5 5 48
36 - 39 35.5 - 39.5 1 49
40 - 43 39.5 - 43.5 1 50

From this frequency distribution of the record of high temperature of 50 towns, n/2 = 50/2 = 25.
This is the indicator of the cumulative frequency of the class in which the median is located. Or
the median class of the distribution is the class with cumulative frequency of 25 or more. This is
the forth class (27.5—31.5).
Therefore;
L=27.5 W =4 n=50 Fpm=22 fm=21

23
( )
n
−Fpm
2
MD=L+ W
fm

( )
50
−22
2
¿ 27.5+ 4
21

¿ 27.5+0.57=28.07

Properties of the Median


There is only one median for a data set.
It is not influenced by extreme values.
Medians of subsets cannot be combined to determine the median of the complete data
without going back to the original observations.
It is applicable to quantitative data only.
It is unreliable as a central measure when we have a sampling problem.
3.3. Mode
 Mode is a value which occurs most frequently in a set of values
 The mode may not exist and even if it does exist, it may not be unique.
 In case of discrete distribution the value having the maximum frequency is the model
value.
In the grouped frequency distribution, the class interval with the highest frequency is called
Modal Class.
Mode of an Ungrouped Distribution
Example: The marks of 40 students out of 10 marks in Mathematics test are as follows:
6 3 5 4 1 2 4 1 6 9
10 1 2 4 6 8 2 7 3 7
2 1 1 4 5 3 2 1 9 8
10 6 5 2 2 1 1 7 9 10
(a) Draw a frequency table for the distribution.
Mark fi (frequency)
1 8 (b) State the mode and median of the distribution.
2 7 (c) Calculate the mean of the distribution
3 3
4 4 Solution: Construct a frequency Distribution table
5 3 a) Mode = 1 mark
6 4
7 3 24
8 2
9 3
10 3
4+ 4
b ¿ Median= = 4 marks
2

( 1∗8 ) + ( 2∗7 ) + ( 3∗3 ) + ( 4∗4 )+ (5∗3 ) + ( 6∗4 ) + ( 7∗3 ) + ( 8∗2 ) + ( 9∗3 ) + ( 10∗3 )
c ¿ Mean=
40

180
¿ =4.5
40

Mode for Grouped data


If data are given in the shape of continuous frequency distribution, the mode is calculated as:

Mode=Lm +
( d1
)
d 1 +d 2
W

Where, Lm = the lower class boundary of the modal class


d1 = fm—fpm
d2 = fm—fsm
fm = the frequency of the modal class
fpm = frequency of the class immediately preceding the modal class
fsm = frequency of the class immediately succeeding the modal class
W = class width of the modal class
Note: The modal class is a class with the highest frequency.
Example: Refer to the following table on the record of high temperature of 50 selected major
urban centers in Ethiopia.
Class limits Class boundaries Frequency Cumulative frequency
16 - 19 15.5 - 19.5 4 4
20 - 23 19.5 - 23.5 11 15
24 - 27 24.5 - 27.5 7 22
28 - 31 27.5 - 31.5 21 43
32 - 35 31.5 - 35.5 5 48
36 - 39 35.5 - 39.5 1 49
40 - 43 39.5 - 43.5 1 50

Here, the modal class is the fourth class (27.5—31.5 with 21 frequency). Thus,
Lm = 27.5, fm = 21, fpm = 7, fsm = 5, W = 31.5—27.5 = 4
Then, inserting these values in the equation

25
mode=Lm+ ( d 1+d
d1
2)
W

¿ 27.5+
( (21−721−7
) +(21+5) )
4

¿ 27.5+ ( 1414+16 ) 4
=27.5+(0.47)4

=29.37

The modal record of high temperature is, therefore, 29.370C


Properties of the Mode
 There can be more than one mode for a data set, or they may be no mode when all
observations in the data set have equal frequencies.

 It is not influenced by extreme values.

 Modes of subsets cannot be combined to determine the mode of the complete data set
without going back to the original data.

 It is applicable for both qualitative and quantitative data.


The general relationship between mean, median and mode for unimodal frequency distribution
is given by: Mean—Mode = 3 (Mean—Median)
Note: being the point of maximum density, mode is especially useful in finding the most
popular size in studies relating to marketing, trade, business, and industry. It is the appropriate
average to be used to find the ideal size.
MEASURES OF LOCATION (NON-CENTRAL TENDENCY)
When a distribution is arranged in order of magnitude of items, the median is the value of the
middle term. Measures of distribution that depend up on their positions (location) in the
distribution are collectively called quintiles or positional averages. Some also call them
measures of non-central tendency. These include quartiles, deciles, and percentiles.
Quartiles
- Quartiles are measures that divide the frequency distribution in to four equal parts.
- The value of the variables corresponding to these divisions are denoted Q 1, Q2, and Q3 often

called the first, the second and the third quartile respectively.

26
To compute Quartiles for raw (ungrouped) data, first arrange the data in increasing order of
magnitude. Then, the ith quartile is given by:
n+ 1
Qi=i( )t h value
4

In dividing i(n+1) by 4, there may be a reminder. Let q be the quotient and r be the reminder of
the division. Then,
th r
Qi=q value+ ¿
4

Example: The following are yields of barley (kg/plot) from 14 plots: 30, 32, 35, 38, 40, 42, 48,
49, 52, 55, 58, 60, 62 and 65. Find the first and third quartile. (Be informed that the data must
be arranged in ascending order).
1 ( 14 +1 ) 3 ( 14 +1 )
Q 1= t h valueQ 3= t h value
4 4

15 45
¿ t h value= t h value
4 4
th th
¿ 3.75 value=11.25 value

3 th 1
¿ 3 value+ ( 4 value−3 value ) =11 value+ ( 12 value−11 value )
rd rd th th th
4 4

3 1
¿ 35+ ( 38−35 )=58+ (60−58)
4 4

¿ 37.25=58.5

For grouped data: we have the following formula


w ¿
Q i=LQ i +
f Qi 4 ( )
−C ,i=1, 2 , 3

Where, LQi= lower class boundary of the quartile class

W= the size of the quartile class

N= total number of observations

c=the cumulative frequency (less than type) preceding the quartile class
27
fQi= the frequency of the quartile class

Remark: The quartile class (class containing Qi) is the class with the smallest cumulative
¿
frequency (less than type) greater than or equal to 4

Deciles: Deciles are measures that divide the frequency distribution in to ten equal parts. They
are denoted by D1, D2, ...., D9. For raw (ungrouped) data, first arrange the data in
increasing order of magnitude. Then, the ith decile is given by:
i ( n+1 )
Di= t h value
10
Similar to Quartile, in dividing i(n+1) by 10 there may be reminder. Let q be the quotient and r
be the reminder of the division. Then,
th r
Di=q value+ ¿
10
For grouped data: we have the following formula
w ¿
Di=LD i + (
fDi 10 )
−C , i=1 ,2 , 3 , … , 9

Where, LDi= lower class boundary of the decile class

W=the size of the decile class

N=total number of observations

C= the cumulative frequency (less than type) preceding the decile class

fDi= the frequency of the decile class


Remark: The decile class (class containing Di) is the class with the smallest cumulative
¿
frequency (less than type) greater than or equal to 10

Percentiles
Percentiles are measures that divide the frequency distribution in to hundred equal parts. They
are denoted by P1, P2, …, P99.
For raw (ungrouped data), first arrange the data in increasing order of magnitude. Then, the ith
percentile is given by:
i ( n+1 )
Pi= t h value
100
In dividing i(n+1) by 100, there may be a reminder. Let q be the quotient and r be the reminder
of the division. Then,
28
th r
Pi=q value+ ¿
100
For grouped data: we have the following formula
w ¿
Di=LPi + (
fPi 100 )
−C , i=1 , 2 ,3 , … , 99

Where, LPi= lower class boundary of the percentile class

W=the size of the percentile class

N=total number of observations

C= the cumulative frequency (less than type) preceding the percentile class

fDi= the frequency of the percentile class


Remark: The percentile class (class containing Pi) is the class with the smallest cumulative
¿
frequency (less than type) greater than or equal to 100

Example: Considering the following distribution and Calculate:


a) All quartiles.
b) The 7th decile.
c) The 90th percentile.

Values Frequency
140-150 17
150-160 29
160-170 42
170-180 72
180-190 84
190-200 107
200-210 49
210-220 34
220-230 31
230-240 16
240250 12
Solutions:
• First find the less than cumulative frequency.
• Use the formula to calculate the required quantile.

29
Values Frequency Cumulative frequency
140-150 17 17
150-160 29 46
160-170 42 88
170-180 72 160
180-190 84 244
190-200 107 351
200-210 49 400
210-220 34 434
220-230 31 465
230-240 16 481
240250 12 493
a) Quartiles:
i. Q1

- determine the class containing the first quartile.


N
=123.25
4
170−180is t h e class containing t h e first quartile
LQi=170 N =493 W =10 C=88 fQ=72

W N
Q 1=LQ 1+ ( −C )
fQ 1 4

10
Q 1=170+ ¿
72

¿ 174.90

2∗N
ii. Q2 Determine the class containing the second quartile =246.5
4

¿ 190−200is t h e class containing t h e first quartile

LQi=190 N =493 W =10 C=244 fQ=107

W
Q 1=L Q2+ ¿)
fQ 2

10
Q 1=190+ (246.5−244)
107

¿190.23

30
3∗N
ii. Q3 Determine the class containing the third quartile =369.75
4

¿ 200−210is t h e class containing t h e first quartile

LQi=200 N =493 W =10 C=351 fQ=49

W
Q 1=LQ 3+ ¿
fQ 3

10
Q 1=200+ (369.75−351)
49

¿203.83

7∗N
ii. D7 Determine the class containing the seventh decile =345.1
10

¿ 190−200is t h e class containing t h e sevent h decile

LDi=190 N =493 W =10 C=244 fDi=107

W
Q 1=LD 7+ ¿
fD 7

10
D 1=190+ (345.1−244 )
107

¿199.45

90∗N
ii. P90 Determine the class containing the 90th percentile =443.7
10

¿ 220−230is t h e class containing t h e 99 t h percentile

LPi=220 N=493W =10 C=434 fPi=31

W
Q 1=LP 90+ ¿
f 90

10
D 1=220+ (443.7−434)
31

¿223.13

31
Chapter Three
3. Measures of Dispersion
Introduction and objectives of measuring Variation
 The scatter or spread of items of a distribution is known as dispersion or variation.
 In other words the degree to which numerical data tend to spread about an average
value is called dispersion or variation of the data (i.e Variability or dispersion concerns
with the study of the extent to which values of a data set differ from their computed
mean).
 Measures of dispersions are statistical measures which provide ways of measuring the
extent in which data are dispersed or spread out.
 This chapter is focused to judge the reliability of measures of central tendency, to
control variability itself, to compare two or more groups of numbers in terms of their
variability and to make further statistical analysis.
Types of Measures of Dispersion
Various measures of dispersions are in use. The most commonly used measures of dispersions
are:
Range and relative range
Mean and Quartile deviation
Variance and Standard deviation
Coefficient of variation.

3.1. The Range (R)


 The range is the largest score minus the smallest score.
 It is a quick and dirty measure of variability.

32
 Because the range is greatly affected by extreme scores, it may give a distorted picture
of the scores.
 The following two distributions have the same range, 13, yet appear to differ greatly in
the amount of variability.

Distribution 1: 32 35 36 36 37 38 40 42 42 43 43 45

Distribution 2: 32 32 33 33 33 34 34 34 34 34 35 45

For this reason, among others, the range is not the most important measure of variability.

R= L – S where, L= largest observation

S= smallest observation

Range for grouped frequency distribution


Find the range of the frequency distribution given below

Mark 1-5 6-10 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50
Frequency 0 2 4 11 20 10 5 3 2 1

Solution
The lower class =1-5 Lower limit of the class = 0.5
The upper class = 46-50 Upper limit of the upper class = 50.5
Therefore, the range = Upper limit of the upper class- Lower limit of the class
= 50.5-0.5=50
Property of Range
 It is easy to calculate and simple to understand.
 It is not based on all observation.
 It is highly affected by extreme observations.
 It is affected by fluctuation in sampling.
 It cannot be computed in the case of open end distribution.
3.2. Mean and quartile deviation
3.2.1. The Mean Deviation (M.D)
 This is a much better measure of dispersion.
 Mean deviation is the mean of the absolute values of the deviation from some measure
of central tendency.

33
 Depending up on the type of averages used we have different mean deviations.
a) Mean Deviation about the mean
It is denoted by M . D( X ) and given by
n

∑ ¿ Xi−X /¿
M . D( X )= i=1 ¿
n

For the case of frequency distribution it is given as:


k

∑ fi / Xi− X /¿
M . D( X )= i=1 ¿
n
Steps to calculate M.D ( X ) :
1. Find the arithmetic mean, M.D ( X )
2. Find the deviations of each reading from M.D(X )
3. Find the arithmetic mean of the deviations, ignoring sign.
b) Mean Deviation about the median.
~
It is denoted by M . D( X ) and given by
n

∑ ¿ Xi−~
X /¿
M . D( ~
X )= i=1 ¿
n
k

For the case of frequency distribution it is given as:


∑ fi / Xi−~
X /¿
M . D( ~
X )= i=1 ¿
n
~
Steps to calculate M.D ( X ):
~
1. Find the arithmetic mean, M.D ( X )
~
2. Find the deviations of each reading from M.D X
3. Find the arithmetic mean of the deviations, ignoring sign.
c) Mean Deviation about the mode.
n

It is denoted by M . D( ^
X ) and given by:
∑ ¿ Xi− ^X /¿
M . D( ^
X )= i=1 ¿
n
For the case of frequency distribution it is given as:
k

∑ fi / Xi− ^X /¿ Where, xi is the midpoint.


M . D( ^
X )= i=1 ¿
n
34
Steps to calculate M . D ( ^
X ):
1. Find the arithmetic mean, M.D ( ^
X)
2. Find the deviations of each reading from M.D ( ^
X)
3. Find the arithmetic mean of the deviations, ignoring sign.
Examples:
The following are the number of visit made by ten mothers to the local doctor’s surgery. 8, 6, 5,
5, 7, 4, 5, 9, 7, 4. Find mean deviation about mean, median and mode.
Solutions: First calculate the three averages X =6 , ~X=5.5 , ^
X =5
Then take the deviations of each observation from these averages

Xi 4 4 5 5 5 6 7 7 8 9 Total
/Xi - 6/ 2 2 1 1 1 0 1 1 2 3 14
/Xi - 5.5/ 1. 1.5 0.5 0.5 0.5 0.5 1.5 1.5 2.5 3.5 14
5
/Xi - 5/ 1 1 0 0 0 1 2 2 3 4 14

10

∑ ¿ Xi−5/¿ 14
M . D( ^
X )= i=1 = =1.4 ¿
10 10
10

∑ ¿ Xi−6/¿ 14
M . D( X )= i=1 = =1.4 ¿
10 10
10

∑ ¿ Xi−5.5/¿ 14
M . D( ~
X )= i=1 = =1.4 ¿
10 10

Example (for grouped frequency distribution)


140 students sat for Geology test, their marks are as shown in the frequency distribution table
below. Calculate the mean deviation about the median of their marks.
Marks Class marks Frequency Cumulative Absolute deviation f/d/
(X) (f) frequency from median /d/=/X-
median/
0-10 5.5 4 4 57.5 230
11-20 15.5 6 10 47.5 285
21-30 25.5 9 19 37.5 337.5
31-40 35.5 7 28 27.5 247.5
41-50 45.5 12 40 17.5 210
51-60 55.5 20 60 7.5 150
61-70 65.5 42 102 2.5 105
35
71-80 75.5 22 124 12.5 275
81-90 85.5 10 134 22.5 225
91-100 95.5 6 140 32.5 195
140 ∑ f /d /¿ 2260
N +1
¿ 140t h −∑ f
Median=t h e value 2
2 Median=L+ ∗C
f
Where, L= lower limit of the median class
c= median class width
f = frequency of the median class
∑f= cumulative frequency below the median class

( )
n
−Fpm
2
MD=L+ C
fm

( )
141
−60
2
¿ 60.5+ 10
42

¿ 63
Then, mean deviation about median

MD=
∑ f /d /¿ ¿
∑f
2260
MD=
140

MD=16.143

3.2.1. Quartile deviation and semi-inter quartile range


The inter quartile range is the difference between the third and the first quartiles of a set of
items (Q3 – Q1) and semi-inter quartile range is half of the inter quartile range i.e. ½ (Q3 – Q2).
Q 3−Q1
Q . D=
2

Coefficient of quartile deviation (C.Q.D)


Q3−Q 1
( )
2 2∗Q. D Q 3−Q1
C . Q . D= = =
Q 3+Q1 Q 3+Q 1 Q3+Q 1
( )
2

36
It gives the average amount by which the two quartiles differ from the median.
Example: Compute Q.D and its coefficient for the following distribution.
Values Frequency
140-150 17
150-160 29
160-170 42
170-180 72
180-190 84
190-200 107
200-210 49
210-220 34
220-230 31
230-440 16
240-250 12

Solutions: From the above table we have obtained the values of all quartiles as:
Q1= 174.90, Q2= 190.23, Q3=203.83

203.83−174.90
Q . D= =14.47
2

2∗Q . D 2∗14.47
C . Q . D= = =0.076
Q 3+Q 1 203.83−174.90

Remark: Q.D or C.Q.D includes only the middle 50% of the observation.
Variance and Standard Deviation
Population Variance
The variance is the average of the squares of the distance each value is from the mean.
It is obtained by taking the difference between each observation and the mean, squaring
the difference, adding the squares, and finally averaging the squares.
The formula for the population variance is as follow
1
population variance (σ )=
2
N
∑ (Xi−µ)2 , i=1 ,2 , … N
Where, X = individual value, μ = population mean, N = population size
For the case of frequency distribution, it is expressed as:
1
2
population variance (σ )=
N
∑ f i ( Xi−µ)2 , i=1 , 2, … K
Sample Variance
37
One of the major uses of statistics is to estimate the corresponding parameter (characteristic of a
population). The formula of Sample Variance has the problem that the estimated value isn't the
same as the parameter (population characteristic). To compute this, the sum of the squares of the
deviations is divided by one less than the sample size.
1
Sample variance (S )=
2
n−1
∑ (Xi− X)2 ,i=1, 2 , … n
For the case of frequency distribution, it is expressed as:
1
Sample variance (S )=
2
n−1
∑ f i (Xi− X)2 , i=1 ,2 , … k
We usually use the following short cut formula if the data have decimals and where there is a
problem of rounding.
n

∑ Xi 2−n X 2
S2= i=1 , for raw data
n−1
k

∑ fiXi2−n X 2
S2= i=1 , for frequency distribution
n−1

Standard Deviation
 They are used to measure the deviation of observations from the mean.
 It is an improvement of mean deviation.
 This is the most satisfactory and universally adopted measure of dispersion.
 The measure of central tendency used in calculating standard deviation is mean.
 The example below illustrates the computation stages of standard deviation.
 There is a problem with variances, i.e. the deviations and units were squared.
 To get the units back the same as the original data values, the square root must be
avoided.
 Population standard deviation is computed as σ =√ σ 2
 Sample standard deviation is computed as S= √ S2
N.B:- The larger the variance or standard deviation is, the more variable the data are.
The following steps are used to calculate the sample variance:
1. Find the arithmetic mean.
2. Find the difference between each observation and the mean.
3. Square these differences.

38
4. Sum the squared differences.
5. Since the data is a sample, divide the number (from step 4 above) by the number of
observations minus one, i.e., n-1 (where n is equal to the number of observations in the data
set).
Examples: Find the variance and standard deviation of the following sample data
1. 5, 17, 12, 10.
2. The data is given in the form of frequency distribution.
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3

Solutions: 1
1.
Xi 5 10 12 17 Total
36 1 1 36 74
(Xi- X )2

∑ ¿ Xi− X /¿ 74
S = i=1
2
= =24.67 ¿
n−1 3

S= √ S2 =√ 24.67=4.97

X =55

Xi 42 47 52 57 62 67 72 Total
1183 640 198 60 588 864 867 4400
fi(Xi- X )2

∑ fi/ Xi−X /¿ 4400


S = i=1
2
= =59.46 ¿
n−1 74

S= √ S2 =√ 59.46=7.71

Other example:
39
X 0-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80
F 18 16 15 12 10 5 2 2
Calculate the population standard deviation from the frequency distribution given blow.
Solution: you can represent by difference (d).
(Xi− X)
X Mid point Frequency fX Deviation of X from d2 Fd2
(X) (f) mean d= X− X
0-10 5.5 18 99.0 -21.625 467.641 8417.538
11-20 15.5 16 248.0 -11.625 135.141 2162.256
21-30 25.5 15 382.5 -1.625 2.641 39.615
31-40 35.5 12 426.0 8.375 70.141 841.692
41-50 45.5 10 455.0 18.375 337.641 3376.410
51-60 55.5 5 277.5 28.375 805.641 4625.705
61-70 65.5 2 131.0 38.375 1472.641 2945.282
71-80 75.5 2 151 48.375 2340.141 4608.282
80 2170 27088.78

X=
∑ fX = 2170 =27.125
∑ f 80
S=

Properties of Standard deviations


√ √
∑ fd2 =
∑ fX
27088.78
80
=√ 338.610=18.401

It gives us a measure of dispersion relative to the mean.


It is sensitive to the exact value of all observation.
It is the most frequently used measures of dispersion.
It is used for quantitative data.
Coefficient of Variation (C.V)
• Is defined as the ratio of standard deviation to the mean usually expressed as percents.
S
C . V = ∗100
X
The distribution having less C.V is said to be less variable or more consistent.
Examples: An analysis of the monthly wages paid (in Birr) to workers in two firms A and B
belonging to the same industry gives the following results. Calculate coefficient of variation for
both firms.
Value Firm A Firm B
Mean wage 52.5 47.5
Median wage 50.5 45.5

40
Variance 100 121
Solutions: You can get standard deviation from variance so that ‘S’ of Firm A &B are 10
SA 10
&11 respectively. C . V A= ∗100= ∗100=19.05 %
XA 52.5
SB 11
C . V B= ∗100= ∗100=23.16 %
XB 47.5

Since C.VA < C.VB, in firm B there is greater variability in individual wages.

Chapter Four
4. Measurement of Distribution
The shape of' the frequency distribution best describes the relationship among mean, median
and mode.
4.1 Normal Distribution

41
 When the distribution of item in a series happens to be perfectly symmetrical, then we
have the following type of curve for the distribution.
 Symmetric Distribution is the distribution of observations in which mean, median and
mode have the same value as shown in Figure 1 below.

Figure 1
It is Zero skewness.

 Such a curve illustrated above is technically described as a normal curve and the relating
distribution as normal distribution.
 Such a curve is perfectly bell shaped curve in which case the value of mean, median and
mode is just the same and skewness is absent.
 The normal distribution curve can be used to study many variables that are not only
perfectly normal distribution but also approximately normal distribution.
 The mathematical equation for the normal distribution is:
−¿¿¿ ¿
y=e ¿
where :e ≈ 2.718 , π ≈ 3.14,
µ is population mean
σ is population standard deviation
 Another important aspect is that the area under the normal curve is more important than
the frequencies.
 Therefore, when the normal distribution is pictured, the y axis, which indicates the
frequencies, is sometimes omitted.
 Generally, the normal distribution is a continuous, symmetric, bell-shaped
distribution of a variable.
 Summary of the properties of the theoretical Normal Distribution is presented below.
 The normal distribution curve is bell-shaped.

 The mean, median, and mode are equal and located at the center of the distribution.

 The normal distribution curve is unimodal (i.e., it has only one mode).

 The curve is symmetric about the mean, which is equivalent to saying that its shape is
the same on both sides of a vertical line passing through the center.
42
 The curve is continuous, that is, there are no gaps or holes. For each value X, there is a
corresponding value of Y.

 The curve never touches the x axis. Theoretically, no matter how far in either direction
the curve extends, it never meets the x axis—but it gets increasingly closer.
 If the curve is distorted (whether on the right side or on the left side), we have
asymmetrical distribution which indicates that there is skewness.
 If the curve is distorted on the right side, we have positive skewness but when the curve
is distorted towards left, we have negative skewness.
4.2 Skewness
An outlier can significantly alter the mean of a series of numbers, whereas the median will
remain at the center of the series. In such a case, the resulting curve drawn from the values will
appear to be skewed, tailing off rapidly to the left or right. In the case of negatively skewed or
positively skewed curves, the median remains in the center of these three measures. Figure 2
below shows a negatively skewed.

A negatively skewed distribution, mean < median < mode.

Figure 3 below shows a positively skewed distribution.

Mode < median < mean.

 Skewness is the degree of asymmetry or departure from symmetry of a distribution


curve.
 A skewed frequency distribution is one that is not symmetrical.
 Skewness is concerned with the shape of the curve not size.
 If the frequency curve (smoothed frequency polygon) of a distribution has a longer tail
to the right of the central maximum than to the left, the distribution is said to be skewed
to the right or said to have positive skewness. If it has a longer tail to the left of the
central maximum than to the right, it is said to be skewed to the left or said to have
negative skewness.

43
 For moderately skewed distribution, the following relation holds among the three
commonly used measures of central tendency.
Mean−Mode=3∗(mean−median)
Measures of Skewness
-Denoted by α3
-There are different measures of skewness. Two measures of skewness are mentioned as follow.
1. The Pearsonian coefficient of skewness
Mean−Mode x ^x
α 3= =
Standard deviation S
2. The Bowley’s coefficient of skewness (coefficient of skewness based on quartiles).
( Q3−Q2 ) −(Q2−Q1 ) Q3 +Q1 −2Q2
α 3= =
Q 3−Q 1 Q 3−Q 1
The shape of the curve is determined by the value of α3.
If α3 >0, the distribution is positively skewed.
If α3 =0, the distribution is symmetric distribution.
If α3 <0, the distribution is negatively skewed.
Remark:
 In a positively skewed distribution, smaller observations are more frequent than larger
observations. I.e. the majority of the observations have a value below an average.
 In a negatively skewed distribution, smaller observations are less frequent than larger
observations. I.e. the majority of the observations have a value above an average.
Example 1: Suppose the mean, the mode, and the standard deviation of a certain distribution
are 32, 30.5 and 10 respectively. What is the shape of the curve representing the distribution?
Solutions: Use the Pearsonian coefficient of skewness

Mean−Mode 32−30.5
α 3= = =0.15
Standard deviation 10
Example 2: In a frequency distribution, the coefficient of skewness based on the quartiles is
given to be 0.5. If the sum of the upper and lower quartile is 28 and the middle quartile is 11,
find the values of the upper and lower quartiles.
Solutions: Given :α 3=0.5 , Q2=11
Q3 +Q1=28−−−−− X
Q3−Q1 =12−−−−−Y
44
Then, Solve x & y
( Q3−Q2 ) −(Q2−Q1 ) Q3 +Q1 −2Q2
α 3= = =0.5
Q 3−Q 1 Q 3−Q 1
Q3-(28-Q3) = 12, 2Q3=40; finally, we get Q3=20 and Q1=8.
The significance of skewness lies in the fact that through it one can study the formation of series
and can have the idea about the shape of the curve, whether normal or otherwise, when the
items of a given series are plotted on a graph.
4.3 Kurtosis
 Kurtosis is the degree of peakedness (condition of having peak) of a distribution,
usually taken relative to a normal distribution.
 A distribution having relatively high peak is called leptokurtic.
 If a curve representing a distribution is flat topped, it is called platykurtic.
 The normal distribution which is not very high peaked or flat topped is called
mesokurtic.
 In brief, Kurtosis is the humpedness (convexness) of the curve and points to the nature
of distribution of items in the middle of a series.
Measures of kurtosis
The moment coefficient of kurtosis:
M4 M4
• Denoted by α4 and given by α 4 = 2
= 4
M2 σ
W h ere , M 4 is t h e fourt h moment about t h e mean.

M 2 is t h e second moment about t h e mean .

σ is t h e population standard deviation.

The peakedness depends on the value of α4


If α 4 >3 t h ent h e curve is leptokurtic .

If α 4=3t h en t h e curveis mesokurtic .

If α 4 <3 t h ent h e curve is platykurtic .

N.B. If X is a variable that assume the values X1, X2,…..,Xn then

1. The rth moment about the mean ( the rth central moment) is denoted by Mr and defined as:

45
n
M r=∑ ¿ ¿ ¿ (Ungrouped distribution)
i=1

n
M r=∑ fi ¿ ¿ ¿ (Grouped distribution)
i=1

n
M 4 =∑ fi ¿ ¿ ¿
i=1

Examples: If the first four central moments of a distribution are: M 1= 0, M2=16, M3 -60, and
M4=162. Compute a measure of kurtosis and give your interpretation.
M4 162
Solutions: α 4= 2
= 2
=0.6
M2 16
Interpretation; since 0.6<3, the curve is platykurtic

Chapter Five
5. Sampling
The concept of sampling

46
Sampling may be defined as the selection of some part of an aggregate or totality on
the basis of which a judgment or inference about the aggregate or totality is made.
In other words, it is the process of obtaining information about an entire population by
examining only a part of it.
In most of the research work and surveys, the usual approach happens to be to make
generalizations or to draw inferences based on samples about the parameters of
population from which the samples are taken.
The researcher quite often selects only a few items from the universe for his study
purposes. All this is done on the assumption that the sample data will enable him to
estimate the population parameters.
The items so selected constitute what is technically called a sample, their selection
process or technique is called sample design.
Sample should be truly representative of population characteristics without any bias so
that it may result in valid and reliable conclusions.
Some Fundamental Definitions
Before we talk about details and uses of sampling, it seems appropriate that we should be
familiar with some fundamental definitions concerning sampling concepts and principles.
1. Population: refers to the total of items about which information is desired.
 The population represents the target of an investigation, and the objective of the
investigation is to draw conclusions about the population hence we sometimes call it
target population.
 The population or universe can be finite or infinite.
 The population is said to be finite if it consists of a fixed number of elements so that it
is possible to enumerate it in its totality. For instance, the population of a city, the
number of workers in a factory, e.t.c.
 An infinite population is that population in which it is theoretically impossible to
observe all the elements. E.g. number of stars in a sky. From a practical
consideration, we use the term infinite population for a population that cannot be
enumerated in a reasonable period of time. This way we use the theoretical concept of
infinite population as an approximation of a very large finite population.
Examples
􀀹 Population of trees under specified climatic conditions

47
􀀹 Population of animals fed a certain type of diet
􀀹 Population of farms having a certain type of natural fertility
􀀹 Population of households, etc
2. Census: a complete enumeration of the population. But in most real problems it cannot be
realized, hence we take sample.
3. Sampling frame: it is the group or cluster or list of items from which the sample is to be
drawn. Whatever the frame may be, it should be a good representative of the population.
4. Sampling design: it is a definite plan for obtaining a sample from the sampling frame. It
refers to the technique or the procedure the researcher would adopt in selecting some sampling
units.
5. Statistic(s) and parameter(s): A statistic is a characteristic of a sample, whereas a parameter
is a characteristic of a population. Thus, when we work out certain measures such as mean,
median, mode or the like ones from samples, then they are called statistic(s) for they describe
the characteristics of a sample. But when such measures describe the characteristics of a
population, they are known as parameter(s). For instance, the population mean(m) is a
parameter, whereas the sample mean ( X ) is a statistic.
6. Errors: there would be a certain amount of inaccuracy in the information collected. This
inaccuracy may be two types. I.e. sampling error(error variance. ) and Non sampling errors.
a) Sampling error:
 Is the discrepancy between the population value and sample value.
 May arise due to in appropriate sampling techniques applied.
b) Non sampling errors: are errors due to procedure bias such as:
Due to incorrect responses.
Measurement.
Errors at different stages in processing the data.

Frame error Population Sampling error = Frame error+


Response chance error + response error.
Sample frame (If we add measurement error or the
48
Sample non-sampling error to sampling
Chance error error, we get total error).

Response error

 The more homogeneous the universe, the smaller the sampling error.
 Sampling error is inversely related to the size of the sample i.e., sampling error
decreases as the sample size increases and vice-versa.
7. Sampling distribution: it is all values of a particular statistic, say mean, together with their
relative frequencies.
8. Sampling: The process or method of sample selection from the population.
9. Sampling unit: the ultimate unit to be sampled or elements of the population to be sampled.
5.1. Why sampling is needed?

-Reduced cost - Sampling can save time and money. A sample study is usually less expensive
than a census study and produces results at a relatively faster speed.
-Greater speed
-Greater accuracy - Sampling may enable more accurate measurements for a sample study is
generally conducted by trained and experienced investigators.
-Greater scope
-Avoids destructive test
-The only option when the population is infinite (large).
Sometimes taking a census makes more sense than using a sample if there is Non-
representativeness and detailedness is needed.
5.2. Sampling Distribution of Sample Means

Given a variable X, if we arrange its values in ascending order and assign probability to each of
the values or if we present X i in a form of frequency distribution, the result is called Sampling

Distribution of X.
Sampling distribution of the sample mean is distribution obtained by using the means computed
from random samples of a specific size taken from population.
Steps for the construction of Sampling Distribution of the mean
1. From a finite population of size N, randomly draw all possible samples of size n.

49
2. Calculate the mean for each sample.
3. Summarize the mean obtained in step 2 in terms of frequency distribution or relative
frequency distribution.
Example: Suppose we have a population of size 5=N, consisting of the age of five children:
6, 8, 10, 12, and 14.
population mean=µ=10
2
population variance=σ =8

Take samples of size 2 with replacement and construct sampling distribution of the sample
mean.

solutions: N=5 ,n=2

 We have N n=52=25 Possible samples since sampling is with replacement.

Step 1: Draw all possible samples:

6 8 10 12 14
6 (6,6) (6,8) (6,10) (6,12) (6,14)
8 (8,6) (8,8) (8,10) ((8,12) (8,14)
10 (10,6) (10,8) (10,10) (10,12) (10,14)
12 (12,6) (12,8) (12,10) (12,12) (12,14)
14 (14,6) (14,8) (14,10) (14,12) 914,14)
Step 2: Calculate the mean for each sample:

6 8 10 12 14
6 6 7 8 9 10
8 7 8 9 10 11
10 8 9 10 11 12
12 9 10 11 12 13
14 10 11 12 13 14
Step 3: Summarize the mean obtained in step 2 in terms of frequency distribution.

Xi 6 7 8 9 10 11 12 13 14
fi 1 2 3 4 5 4 3 2 1
a ¿ Find t h e meanof X , say µ X

µX =
∑ x ifi = 250 =10=µ
∑ fi 25
2
b ¿ Find t h e variance of X , say σ X

50
2
σ X=
∑ ( x i−µ X ) fi = 100 =4
∑ fi 25

5.3 Sample Size Determination

Sample Size refers to the number of sampling units selected from the population for
investigation.

The size of the sample should be determined by a researcher keeping in view the following
points:
(i) Nature of universe (population): Universe may be either homogenous or heterogeneous in
nature. If the items of the universe are homogenous, a small sample can serve the purpose. But
if the items are heterogeneous, a large sample would be required. Technically, this can be
termed as the dispersion factor.
(ii) Number of classes proposed: If many class-groups (groups and sub-groups) are to be
formed, a large sample would be required because a small sample might not be able to give a
reasonable number of items in each class-group.
(iii) Type of sampling: Sampling technique plays an important part in determining the size of
the sample. A small random sample is appropriate to be much superior to a larger but badly
selected Sample.
(iv) Standard of accuracy and acceptable confidence level: If the standard of accuracy or the
level of precision is to be kept high, we shall require relatively larger sample.
(v) Availability of finance: In practice, size of the sample depends upon the amount of money
available for the study purposes. This factor should be kept in view while determining the size
of sample for large samples result in increasing the cost of sampling estimates.
(vi) Other considerations: Nature of units, size of the population, size of questionnaire,
availability of trained investigators, the conditions under which the sample is being conducted,
the time available for completion of the study are a few other considerations to which a
researcher must pay attention while selecting the size of the sample.
Process of determining Sample Size

Precision is the range within which the answer may vary and still be acceptable;
confidence level indicates the likelihood that the answer will fall within that range, and
the significance level indicates the likelihood that the answer will fall outside that
range.
51
We can always remember that if the confidence level is 95%, then the significance level
will be (100 – 95) i.e., 5%; if the confidence level is 99%, the significance level is (100
– 99) i.e., 1%, and so on.
We should also remember that the area of normal curve within precision limits for the
specified confidence level constitutes the acceptance region and the area of the curve
outside these limits in either direction constitutes the rejection regions.
In other ways, Confidence interval is the specific interval estimate of the parameter
determined by using the data obtained from sample and specific confidence level of the
estimate.
Confidence level of an interval estimate of a parameter is the probability that the interval
estimate will contain the parameter. Three common confidence intervals are used: 90%,
95% and 99% confidence interval.
If the specific sample mean is selected say x .There is 95% probability that it falls within the
σ
range of μ ±1.96 ( ) . Likewise there is 95% probability that the interval specified by
√n
x ± 1.96
( √σn ) will contain μi . e x−1.96 ( √σn )< μ< x +1.96( √σ2 ) .hence one can be 95% confident
that the populations mean is contained in the interval when the values of the variable are
normally distributed.
E.g. the teacher wishes to estimate the average age of the students enrolled. From past studies
standard deviation is 2 years. Sample of 50 students is selected and the mean is found to be 23.2
years. Find 95% confidence interval of the population mean.
sin ce 95 % confidence interval is desired , zα value is 1.96 .

x−z
α
2( )σ
√n
< μ< x + z
α
2( )σ
√n

23.2-1.96( ) ( √ 50 )
2 2
<¿ μ <23.2+1.96 .
√ 50
23.2-0.6 < μ <23.2+0.6 years=23.2± 0.6 years .
The teacher can say with 95 % confidence that the average age of the students is between 22.6
and 23.8 years based on 50 students. I.e. there is 95% probability that the confidence interval
built around specific sample mean would contain the population mean.
 α is alpha which represents the total area in both of the tail of normal curve.

52
α
 represents the area in each one of the tail.
2
 The relationship between α and the confidence interval is 1−α∧α −1.
E.g. when 95 % confidence interval is to be found α =0.05 .since 1-0.05=0.95 or 95%.when
α =0.01 , 1−0.01=0.99.
Formula for confidence interval is as follow.

x– z
α
2 ( √σ2 )< μ< x + z α2 ( √σ2 )
α α
For 95% , z =1.96 . for 99 % , z =2.58 . if n≥30,S can be used in place of σ where σ is
2 2
unknown.

z
α
2 ( √σn )is called the maximum error of estimate.
Sample size: it depend on the maximum error of estimate, the population standard deviation
and the degree of confidence.
-the population standard deviation is assumed to be known (has been estimated from the
previous studies).
The formula for sample size is derived from the maximum error of estimate (e) formula.

e= z
α
2 ( √σn ) .this is solved for n as follow
( )
2
α
z ×σ
n= 2 where σ= standard deviation of the population (to be estimated from past
e
experience
z =the value of the standard variate at a given confidence level (to be read
from the table or given) and it is 1.96 for a 95%confidence level.
n = size of the sample
e = acceptable error (the precision).
N.B:- if you get n is fraction, use the next whole number for size n.
- The above formula is applicable when the population happens to be large (n>30). But, in case
of small population, the above stated formula for determining sample size will become.
2 2
Z ∗N∗σ
n= .
( N−1 ) e 2+ Z 2 ¿ σ 2

53
Example; the president of the university ask statistic of instructors to estimate the average age of
student in the university .How large sample size is necessary? The statistics instructors decide
the estimate should be accurate within 1 year and be 99%confident.from the previous study,
standard deviation of the instructors’ age is known to be 3 years.
Solution:
α
α =0.01(1-0.99), z =2.58, e=1, and σ =3
2

( )
2
α
( )
2
z ×σ 2.58 ×3
n= 2 = n= =59.9
1
e
5.4 Sampling Methods (Techniques)
There are two types of sampling techniques:
Random Sampling (probability sampling) & Non probability sampling
5.4.1. Probability sampling

Is a method of sampling in which all elements in the population have a pre-assigned non zero
probability to be included in to the sample.
Examples:
• Simple random sampling
• Stratified random sampling
• Cluster sampling
• Systematic sampling
1. Simple Random Sampling:
Is a method of selecting items from a population such that every possible sample of
specific size has an equal chance of being selected.
All elements in the population have the same pre-assigned non zero probability to be
included in to the sample.
Simple random sampling can be done using the lottery method.
2. Stratified Random Sampling:
 The population will be divided in to non overlapping groups called strata.
 Random selection can be carried out within each sub-group. Then, the randomly selected
representatives of the sub-groups together form the stratified sample.
 The random selection can be done in proportion, according to the size or number in the
population of each sub-group. This is called proportional allocation. This requires
54
information about the relative sizes of the strata in the population. That is to say that the
exact population numbers or good estimates of these numbers should be available.
 Elements in the same strata should be more or less homogeneous while different in
different strata.
 It is applied if the population is heterogeneous.
 Some of the criteria for dividing a population into strata are: Sex (male, female); Age
(under 18, 18 to 28, 29 to 39); Occupation (blue-collar, professional, other).
3. Cluster Sampling:
 This is a method of sampling involving a naturally occurring group of individuals rather
than an individual.
 In other words, a cluster sample is one in which the research interest characteristics
have been identified, the areas or zones in which these characteristics exist have also
been identified and samples from each of the identified zones randomly constituted. The
population is divided in to non overlapping groups called clusters.
 A simple random sample of groups or cluster of elements is chosen and all the sampling
units in the selected clusters will be surveyed.
 Clusters are formed in a way that elements within a cluster are heterogeneous, i.e.
observations in each cluster should be more or less dissimilar.
 Cluster sampling is useful when it is difficult or costly to generate a simple random
sample. For example, to estimate the average annual household income in a large city
we use cluster sampling, because to use simple random sampling we need a complete
list of households in the city from which to sample. To use stratified random sampling,
we would again need the list of households. A less expensive way is to let each block
within the city represent a cluster. A sample of clusters could then be randomly
selected, and every household within these clusters could be interviewed to find the
average annual household income.
4. Systematic Sampling:
A complete list of all elements within the population (sampling frame) is required.
The procedure starts in determining the first element to be included in the sample.

Then the technique is to take the kth item from the sampling frame.
N
Let, N= population size, n= sample size, k = = sampling interval
n

55
Choose any number between 1 and k, suppose it is j (1≤ j ≤ k ¿
The jth unit is selected at first and then (j+k)th, (j+2k)th, …etc until the required sample
size is reached
Let us take that the sample size = n, and the population size N =, then the sampling
interval Kth will be given by Kth = N/n. For instance, if N = 1000, n = 100 then K = 10.
We can randomly pick any number from 1 to 10. In this case, the selection of any
number determines the entire sample. Example: if we pick 2, then 2, 12, 22, 32, 42 etc
automatically become members of the sample.
You would have noticed that the main advantage here is that it requires less work. The
disadvantage can be from the fact that if the listing of the population is not randomly
done, periodicity can be introduced. Periodicity means a situation where every K th
member of the population has some characteristics peculiar or unique to only those
members.
5.4.2. Non probability sampling
 It is a sampling technique in which the choice of individuals for a sample depends on
the basis of convenience, personal choice or interest.
Examples: • Judgment sampling.
• Convenience sampling
• No-rule sampling:
1. Judgment Sampling
- In this case, the person taking the sample has direct or indirect control over which items are
selected for the sample.
2. Convenience Sampling
- In this method, the decision maker selects a sample from the population in a manner that is
relatively easy and convenient.
3. No-rule sampling: we take a sample without any rule, being the sample representative if the
population is homogeneous and we have no selection bias.

56
Chapter Six
6. SIMPLE CORRELATION AND LINEAR REGRESSION
Linear regression and correlation is studying and measuring the linear relationship among two
or more variables. When only two variables are involved, the analysis is referred to as simple
correlation and simple linear regression analysis, and when there are more than two variables
the term multiple regression and partial correlation is used.
Regression Analysis: is a statistical technique that can be used to develop a mathematical
equation showing how variables are related.
Correlation Analysis: deals with the measurement of the closeness of the relationship which
are described in the regression equation.
We say there is correlation when the two series of items vary together directly or inversely
Simple Correlation
Suppose we have two variables X=(x1, x2,… xn) and Y =(y1, y2,….. yn)
When higher values of X are associated with higher values of Y and lower values of X are
associated with lower values of Y, then the correlation is said to be positive or direct.
Examples:
- Income and expenditure
- Number of hours spent in studying and the score obtained
When higher values of X are associated with lower values of Y and lower values of X are
associated with higher values of Y, then the correlation is said to be negative or inverse.
Examples:
- Demand and supply
The correlation between X and Y may be one of the following
1. Perfect positive (slope=1)
2. Positive (slope between 0 and 1)
3. No correlation (slope=0)

57
4. Negative (slope between -1 and 0)
5. Perfect negative (slope=-1)
The presence of correlation between two variables may be due to three reasons:
1. One variable being the cause of the other. The cause is called “subject” or “independent”
variable, while the effect is called “dependent” variable.
2. Both variables being the result of a common cause. That is, the correlation that exists between
two variables is due to their being related to some third force.
Example:
Let X1= be HEEE result

Y1=be rate of surviving in the University

Y2=be the rate of getting a scholar ship.

Both X1&Y1 and X1&Y2 have high positive correlation, likewise

Y1 & Y2 have positive correlation but they are not directly related, but they are related to each

other via X1.

3. Chance:
The correlation that arises by chance is called spurious correlation.
Examples:
• Weight of individuals in Ethiopia and income of individuals in Kenya.
Therefore, while interpreting correlation coefficient, it is necessary to see if there is any
likelihood of any relationship existing between variables under study.
Correlation coefficient is the measure used to determine the strength of the relationship
between two variables. There are several types of correlation coefficients. One the common
types of correlation coefficients is the Pearson Product Moment Correlation Coefficient
(PPMC). The correlation coefficient computed from the sample data measures the strength and
direction of a linear relationship between two variables. The symbol for the sample correlation
coefficient is r. The symbol for the population correlation coefficient is ρ (Greek letter rho).
There are several ways to compute the value of the correlation coefficient. One method is
Pearson's Product-moment Correlation Coefficient.
6.1.1 Pearson's Product-moment Correlation Coefficient

58
This measure considers not ranks rather magnitudes of observation. Formula and procedure is
only applicable on quantitative data. The Coefficient of Correlation( r ) is a measure of the
strength of the relationship between two variables. It requires interval or ratio scaled data
(variables).

If the coefficient(r) has a value;


 Under 0.20, it indicates very weak correlation
 0.21 - 0.40 = weak correlation
 0.41 - 0.70 = moderate correlation
 0.71 - 0.91 = strong correlation
 >0.91 = very strong correlation
The formula is given as follow.

r=
∑ ( Xi− X ) (Yi−Y )
√ ∑ ( Xi− X )2 ∑ ( Yi−Y )2
Where, n is the number of data pairs and x & y are variables .i.e. Dependent Variable(Y): The
variable that is being predicted or estimated and independent Variable(x): The variable that
provides the basis for estimation (It is the predictor variable)
The short cut formula is
n ∑ XY −∑ ( X ) ∑ (Y )
r=
√¿¿¿
Remark:
Always this r lies between -1 and 1 inclusively.
Interpretation of r:
1. Perfect positive linear relationship (if r= 1)
2. Some Positive linear relationship (if r is between 0 and 1)
3. No linear relationship (if r= 0)
4. Some Negative linear relationship (if r is between 0 and -1)
5. Perfect negative linear relationship (if r= -1)
Example 8.1: The data below shows age and average daily income of six farmers. Compute the
value of the correlation coefficient.
Solution:
Make a table, find the values of xy, x2, and y2 and place these values in the corresponding
columns of the table.

59
Farmer Age (X) Average XY X2 Y2
code daily income
(Y)
A 43 128 5504 1849 16384
B 48 120 5760 2304 14400
C 56 135 7560 3136 18225
D 61 143 8723 3721 20449
E 67 141 9447 4489 19881
F 70 152 10640 4900 23104
∑ X=345 ∑ X=819 ∑ X=112443
∑ X=47634 ∑ X=20399

Substitute in the formula and solve for r.


n ( ∑ xy ) −( ∑ x ) (∑ y)
r=
√¿¿¿

6 ( 47634 ) −( 345 ) (819)


r=
√¿ ¿ ¿

285804−282555
r=
√ [ 122394−119025 ][ 674658−670761 ]

3249
r= = 0.897
√ 13128993

The correlation coefficient suggests a strong positive relationship between age and average
daily income of farmers.
Coefficient of Determination
The Coefficient of determination (r2) is the proportion of the total variation of dependent
variable Y that is explained by the variation in the independent variable X. It is the square of
the coefficient of correlation(r) and ranges from 0 to 1.
From the above example, r=0.897.
r2=(0.897)2 =0.81. This is a proportion or a percent. We can say that 81 percent of the
variation in average daily income is explained by the variation in age.
6.1.2 Spearman's Rank Order Correlation Coefficient or rank correlation)

60
Is the technique of determining the degree of correlation between two variables in case of
ordinal data where ranks are given to the different values of the variables. The main objective of
this coefficient is to determine the extent to which the two sets of ranking are similar or
dissimilar. This coefficient is determined as under:
Spearman's coefficient of correlation (or rs) is given by:
6 ∑ Di
2
r s=1− 2
n(n −1)
Where, rs=coefficient of rank correlation

D=the difference between paired ranks

n=the number of pairs

Rank correlation is a non-parametric technique for measuring relationship between paired


observations of two variables when data are in the ranked form.
Example:
Aster and Almaz were asked to rank 7 different types of lipsticks, see if there is correlation
between the tests of the ladies.
lipsticks A B C D E F G
Aster 2 1 4 3 5 7 6
Almaz 1 3 2 4 5 6 7
Solution:
RX 2 1 4 3 5 7 6 Total
RY 1 3 2 4 5 6 7
D=RX-RY 1 -2 2 -1 0 1 -1
D2 1 4 4 1 0 1 1 12

6 ∑ Di
2
6 (12)
r s=1− =1− =0.786 , there is positive correlation.
n ( n −1 )
2
7(48)
Example 2
Eight nations report the following data on their infant mortality rate and general mortality
rate. Rank the data. Does there appear a relationship between the two mortality rates?
Canada U.S.A Swede U.K France Japan Chin Spain
n a
Infant mortality 8.1 10.5 6.4 9.6 10.0 6.2 50 9.6
Mortality 7.0 9.0 11.0 11.0 10.7 6.0 8 8.1
Step 1: Rank the data from lowest to highest. The lowest score should be ranked 1 and the
highest score, 8. Be sure to use the mean for two values that tie. For example, Swede and the
U.K tie for the worst general mortality rate. Since they tie the seventh and eight positions,

61
both are assigned the position 7.5 (7 + 8 /2). Rewrite the ranked data.
Canad U.S.A Swede U.K France Japa Chin Spain
a n n a
Infant mortality 3 7 2 4.5 6 1 8 4.5
Mortality 2 5 7.5 7.5 6 1 3 4
Step 2: Rearranging the data in a column, calculate the Spearman rank correlation coefficient.

Infant"Mortalit Mortality D D2
x y (x - y) ,(x-y)2
Canada 3 2 1.0 1.00
U.S.A 7 5 2.0 4.00
Sweden 2 7.5 -5.5 30.0
U.K 4.5 7.5 -3.0 9.00
France 6 6 0.0 0.00
Japan 1 1 0.0 ·0.00
China 8 3 5.0 25.00
Spain 4.5 4 0.5 0.25
69.50

6 ∑ Di
2
6 (69.5) 417
r s=1− =1− =1− =0.173
n ( n −1 )
2
2
9(8 −1) 504

Interpret the results- A correlation of 0.173· suggests there is little correlation between the
rankings of these nations' infant mortality rates and general mortality rates. The small correlation
that does exist is positive, which suggest that as a nation’s infant mortality ranking increases, so
does its general mortality rate.

6.2. Simple Linear Regression


-Simple linear regression refers to the linear relationship between two variables
-it is used to predict the value of a single continuous DV (which we will call Y) from a single
continuous IV (which we will call X). Regression assumes that the relationship between IV and
the DV can be represented by the equation.
The regression equation is: Y= a + bx, where;
 The regression equation: Y= a + bx, where;
 Y and X are variables
 a and b are constants
 The constant a stands for the value Y when x = 0, and represents the y-
intercept.
 The constant b represents the slope of the line.
-The regression line is one of many but is the line of best fit.
62
b=
∑ ( Xi−X ) ( Yi−Y )
∑ ¿¿¿
a=Y −b X

b is a constant indicating the slope of the regression line, and it gives a measure of the change in
Y for a unit change in X. It is also regression coefficient of Y on X.
Example 1: The following data shows the score of 12 students for Accounting and Statistics
examinations.
a) Calculate a simple correlation coefficient
b) Fit a regression line of Statistics on Accounting using least square estimates.
c) Predict the score of Statistics if the score of accounting is 85.
Student 1 2 3 4 5 6 7 8 9 10 11 12
Acc. (X) 74 93 55 41 23 92 64 40 71 33 30 71
Sat. (Y) 81 86 67 35 30 100 55 52 76 24 48 87

First draw Scatter plot of raw data. Scatter plot of arrow data is used to determine the nature of
relationship. After scatter plot, the next step is to compute r (correlation coefficient). If r is
significant, the next step is to determine the equation of regression line. Determine regression
line where r is not significant and making prediction using regression line is meaningless.

As you see from the scatter plot, it seems there is some linear relationship between the
variables.
Stude 1 2 3 4 5 6 7 8 9 10 11 12 Tota Mea
nt l n
Acc. 74 93 55 41 23 92 64 40 71 33 30 71 687 57.2
(X) 5
Sat. 81 86 67 35 30 100 55 52 76 24 48 87 741 61.7
(Y) 5
63
X2 547 864 302 168 52 8464 409 160 504 108 900 504 4559
6 9 5 1 9 6 0 1 9 1 1
Y2 656 739 448 122 90 1000 302 270 577 576 230 756 5252
1 6 9 5 0 0 5 4 6 4 9 5
XY 599 799 368 143 69 9200 352 208 539 792 144 617 4840
4 8 5 5 0 0 0 6 0 7 7

n ( ∑ xy ) −( ∑ x ) (∑ y)
r=
√¿¿¿

12∗48407−687∗741
r=
√¿ ¿ ¿

a).The Coefficient of Correlation (r) has a value of 0.92. This indicates that the two variables
are positively correlated (Y increases as X increases).
b) Using OLS (ordinary least square).

b=
∑ ( Xi−X ) ( Yi−Y )
∑ ¿¿¿
48407−12∗57.25∗61.75
b= ¿ 0.9560
45591−12 ¿ ¿

a=Y −b X ¿ 61.75−0.9560∗57.25=7.0194

Y^ =7.0194 +0.9560 X is t h e estimated regression line

This means that for each unit change in X, Y changes by 0.9560 units. Regression line can be
used to make prediction for dependent variable.
E.g. using regression line predict the value of dependent variable ( ^y ), if=85.
C) Insert X=85 in the estimated regression line.
Y^ =7.0194 +0.9560 X
Y^ =7.0194 +0.9560 ( 85 )=88.28
For valued prediction, the value of correlation coefficient must be significant. Also, for any
specific value of x, variable y must be normally distributed about regression line. The standard
deviation of each dependent variable must be the same for each value of independent variable.
-prediction is made based on the present condition or on the premises that the present trend will
continue.
6.2.1. Coefficient of determination (r2).
It is the ratio of explained variation to total variation and is denoted by r2. It is also the measure
of variation of dependent variable that is explained by the regression line. Variation due to
64
relationship is called explained variation. Variation due to chance is called unexplained
variation. Both explained and unexplained variation is called total variation.
explained variation
r2 =
. r2 is to square correlation coefficient (r) and change to percent.
total variation
If r=0.90, then r2 =0.81wich is equivalent to 81%. Which means 81% of the variation in
dependent variable is accounted for by the variation in independent variables. The rest 19%
variation is unexplained variation. This is called coefficient of non determination (CND) and is
found by subtracting the coefficient of determination ( r2) from 1.
E.g.if r=0.6. r2=0.36 which means only 36% of the variation in the dependent variable can be
attributed to variation in independent variable. CND=1.00- r2
6.2.2. Standard error of estimate (Sest).
It is the standard deviation of the observed Y values about the the predicted ( y ¿) values. The
formula is given as follow.

√ ∑ ( y− y ¿ )
2
Sest=
n−1
The closer the observed value(Y) to the predicted value ( y ¿ ¿ , the smaller the Sest.
E.g. the following data was collected by researcher and determine that there is significant
relationship between the age of copy machine and its monthly maintenance cost. The regression
equation is y ¿ =55.57+8.13x. Find the standard error of estimate.
Solution
¿
Machine Age(x) Monthly y y- y ¿ (y- y ¿ ¿ 2 xy Y2
cost(Y)
A 1 62 63.7 -1.7 2.89 62 3.844
B 2 78 71. 83 6.17 38.9689 156 6.084
C 3 70 79.96 9.96 99.2016 210 4.900
D 4 90 88.09 1.91 3.6481 360 8.100
E 5 93 96.22 4.91 24.1081 372 8.649
F 6 103 104.35 1.35 1.8225 618 10.609
∑169.7393 ∑1778 ∑42186

√ ∑ ( y− y ¿ )

2
169.7393
Sest= = =6.55
n−1 6−1

65
CHARTER SEVEN

7. Multiple Régression

 Simple régression équation contain one dépendent variable(DV) and one indépendant
variable(IV) and Witten as y ¿ =a+ bx.
 In multiple regression, there are several independent variables (IVs) and one dependent
variable (DV) and the equation is y ¿ =a+¿ b1x1+b2x2+-----+bKxK where x1+x2+-----+xK
are IVs.
 It usés when there are several IVs contributing to variation of DV.
 This analyses is important to increase accuracy of prédictions for dépendent
variable over one indépendant variables.

66
 Multiple regression correlations, R, can be computed to determine if significant
relationship exist between IVs &DV. Since computation of multiple regressions is
quite complicated, most part would be done on computer.

Let see the example only by taking two IVs and DV as follow.

Researcher wants to see whether students’ grade point average and age are related to students
score on examination in one hypothetical department by selecting five students, the following is
the data.

Student GPA(X1) Age (X2) score(Y)


A 3.2 22 550
B 2.7 27 570
C 2.5 24 525
D 3.4 28 670
E 2.2 23 490

The multiple regression equation obtained from that data is y ¿ =¿44.572+87.679x1+14.519x2. If


student has GPA of 3.0 and is 25 years old, his or her predicted value score can be computed by
substituting these values in the the above equation for X1 and X2.

¿
y =¿44.572+87.679x1+14.519x2

¿
y =¿44.572+87.679(3.0 )+14.519(25)=581.44=581(by rounding).hence If student has GPA of
3.0 and is 25 years old, his or her predicted score is 581.

¿
In the equation, each b represents the amount of change in y for one unit of change in
corresponding x values when other x values are held constant. From the above example y ¿ =¿
44.572+87.679x1+14.519x2,, for each change in students’ GPA, there is a change of 87.679 units
of change in score with students’ age being held constant. The strength of relationship between
independent variables and dependent variable is measured by multiple correlation coefficient
which is symbolized by R. it can range from 0 to +1.the closer to 1the stronger the relationship.
The value of R takes in to account all independent variables and can be computed using the
values of individual correlation coefficient. The formula of R is given below if there are two
independent variables.


2 2
r y x 1 +r y x 2−2 ry x 1 . ry x2 . r x 1 x2
R= 2
1−r x1 x 2

67
Where, ry x 1 is the correlation coefficient for variable y and x 1

ry x 2 is the correlation coefficient for variable y and x 2 .multiple correlation coefficient is


always higher than individual correlation coefficient. For the example just mentioned before,
find value of R.

The value of correlation coefficient arery x 1=0.845, ry x 2=0.791, rx 1 x2 =0.371.


2 2
r y x 1 +r y x 2−2 ry x 1 . ry x2 . r x 1 x2
Then R= 2
1−r x1 x 2

R=√ (0.845)2+ ¿ ¿ ¿

R=
√ 0.8437569
0.862359
=0.9784288=0.989

Hence, the correlation between student’s grade point average and age with student’s score on
examination is 0.989 and so there is strong relationship among variables as r ≈ 1.00.As simple
regression, R2 is coefficient of multiple determination and it is the amount of variation explained
by the regression model.1- R2 is the amount of un explained variation called the error or residual
variation since R =0.989 ¿ previous exmple , R2=0.978.
1-R2 =1-0.978=0.022.
Adjusted R2
Since the value of R2is dependent on n(the number of pairs)and K(number of
variables),statistician calculate i.e Adjusted R2 is needed which is denoted by R2adj

2 (1−R2 )(n−1)
R adj =1−
(n−k−1)
2
R adj is smaller than R2 and takes in to account the fact that when n&k are approximately equal ,
the value of R may be artificially higher due to due to sampling error rather than the true
relationship among variables. Even if the correlation coefficient of each independent variable
and dependent variable were all zero, the R could be higher than zero due to sampling error.
Hence, both R2 and R2adj are usually reported in multiple regression analysis.
2
R adj For previous example is given as follow.

2 [ 1−R 2 ] [n−1]
R adj =1− ¿
n−k −1 ¿

68
2 [ 1−0.9892 ] [5−1]
R adj =1− ¿
5−2−1 ¿
2
R adj =1−0.043758=0.956.
In this case when the number of data pairs and the number of independent variables are
accounted for, R2adj is 0.956

CHAPTER EIGHT

8. Hypothesis Testing

8.1. Introduction

WHAT IS A HYPOTHESIS?
Ordinarily, when one talks about hypothesis, one simply means a mere assumption or
some supposition to be proved or disproved.
But for a researcher hypothesis is a formal question that he intends to resolve.
Thus a hypothesis may be defined as a proposition or a set of proposition set forth as
an explanation for the occurrence of some specified group of phenomena either asserted
merely as a provisional conjecture to guide some investigation or accepted as highly
probable in the light of established facts.

69
Quite often a research hypothesis is a predictive statement, capable of being tested by
scientific methods, that relates an independent variable to some dependent variable. For
example, consider statements like the following ones:
“Students who receive counseling will show a greater increase in creativity than students not
receiving counseling” Or “the automobile A is performing as well as automobile B.”
These are hypotheses capable of being objectively verified and tested.
Thus, we may conclude that a hypothesis states what we are looking for and it is a
proposition which can be put to a test to determine its validity.
Characteristics of hypothesis:

Hypothesis must possess the following characteristics:

 Hypothesis should be clear and precise. If the hypothesis is not clear and precise, the
inferences drawn on its basis cannot be taken as reliable.
 Hypothesis should be capable of being tested.
 Hypothesis should be limited in scope and must be specific. A researcher must
remember that narrower hypotheses are generally more testable and he should develop
such hypotheses.
 Hypothesis should be stated as far as possible in most simple terms so that the same is
easily understandable by all concerned. But one must remember that simplicity of
hypothesis has nothing to do with its significance.
 Hypothesis should be consistent with most known facts i.e., it must be consistent with
a substantial body of established facts. In other words, it should be one which judges
accept as being the most likely.
 Hypothesis should be amenable to testing within a reasonable time. One should not
use even an excellent hypothesis, if the same cannot be tested in reasonable time for one
cannot spend a life-time collecting data to test it.
 Hypothesis must explain the facts that gave rise to the need for explanation.
Hypothesis must actually explain what it claims to explain; it should have empirical
reference.
8.2. Basic concepts in the context of testing of hypotheses need to be explained

Statistical hypothesis: is an assertion or statement about the population whose plausibility is to


be evaluated on the basis of the sample data.
70
Test statistic: is a statistics whose value serves to determine whether to reject or accept the
hypothesis to be tested.
Statistic test: is a test or procedure used to evaluate a statistical hypothesis and its value
depends on sample data.
There are two types of hypothesis:
Null hypothesis:
It is the hypothesis to be tested.
It is the hypothesis of equality or the hypothesis of no difference.
Usually denoted by H0.

Alternative hypothesis:
It is the hypothesis available when the null hypothesis has to be rejected.
It is the hypothesis of difference.
Usually denoted by H1 or Ha.

Types and size of errors:


- Testing hypothesis is based on sample data which may involve sampling and non sampling
errors.

- The following table gives a summary of possible results of any hypothesis test:
Researcher decides to: Hypothesis really
Incorrect Correct
Accept hypothesis Type I error (the worst) Researcher accept an
hypothesis that is true-
A correct decision
Reject hypothesis Researcher rejected hypothesis Type II error
that is wrong –
Correct decision

- Type I error)(α): Rejecting the null hypothesis when it is true. It is sometimes called level
of significance.
- Type II error)(β): Failing to reject the null hypothesis when it is false.
Power of a test:
The most powerful test is a test that fixes the level of significance and minimizes type II error)
(β. The power of a test is defined as the probability of rejecting the null hypothesis when it is
actually false. It is given as: power of test =1−β

71
NOTE:
1. There are errors that are prevalent in any two choice decision making problems.
2. There is always a possibility of committing one or the other errors.
3. Type I error (α) and type II error (β) have inverse relationship and therefore, cannot be
minimized at the same time.
• In practice we set α at some value and design a test that minimizes β. This is because a type I
error is often considered to be more serious, and therefore more important to avoid, than a type
II error.
The level of significance:
It is always some percentage (mostly 5%) which should be chosen with great care, thought and
reason. In case we take the significance level at 5 per cent, then this implies that H0 will be
rejected when the sampling result (i.e., observed evidence) has a less than 0.05 probability of
occurring if H0 is true. In other words, the 5 per cent level of significance means that researcher
is willing to take as much as a 5 per cent risk of rejecting the null hypothesis when it (H0)
happens to be true. Thus the significance level is the maximum value of the probability of
rejecting H0 when it is true and is usually determined in advance before testing the hypothesis.
In short, level of significance is the maximum probability with which we would be willing to
commute type I error. It is denoted by Greek letter alpha (∝).
General steps in hypothesis testing:
 The first step in hypothesis testing is to specify the null hypothesis (H 0) and the

alternative hypothesis (H1).

 The next step is to select a significance level, α

 Data analysis.

 Identify degree of freedom

 Decision rule: if computed value greater than table value, H0 will be rejected

 Determine critical value


 Conclusion-Summarization of the result.
Tests about the population mean

CASES:

Case 1: When sampling is from a normal distribution with σ2 known


72
The relevant test statistic is Z-test.

X−µ
Z=
σ
√n
X−µ0
Z= , if σ is unknown .
S
√n
Or

Case 2: When sampling is from a normal distribution with σ2 unknown and small sample size

- The relevant test statistic is

X−µ
t= ˜ withn−1degrees of freedom .
S
√n

CHAPTER NINE

9. Tabular and graphic display of Quantitative Data

After the data have been organized into a frequency distribution, they can be presented in
graphical form. The purpose of graphs in statistics is to convey the data to the viewers in
pictorial form. It is easier for most people to comprehend the meaning of data presented
graphically than data presented numerically in tables or frequency distributions. This is
especially true if the users have little or no statistical knowledge.

Statistical graphs can be used to describe the data set or to analyze it. Graphs are also useful in
getting the audience‘s attention in a publication or a speaking presentation. They can be used to
discuss an issue, reinforce a critical point, or summarize a data set. They can also be used to
discover a trend or pattern in a situation over a period of time.

73
The three most commonly used graphs in research are (1) The histogram, (2) The frequency
polygon, and (3) The cumulative frequency graph, or ogive.

-In short, these are techniques have the following importance:

•They have greater attraction.

• They facilitate comparison.

• They are easily understandable

9.1 Histogram

It is a graph which displays the data by using vertical bars of various heights to represent
frequencies. It is a graph in which class boundaries or class interval is marked on the horizontal
axis and the corresponding class frequency on the vertical axis.
Note that- the bars of a histogram must be joined together and this differentiates it from bar
chart.

Example
Considering the frequency distribution table below, draw a histogram for the table

Class interval Class boundaries Frequency


1-5 0.5-5.5 6
6-10 5.5-10.5 8
11-15 10.5-15.5 5
16-20 15.5-20.5 4
21-25 20.5-25.5 2
Step 1 Draw and label the x and y axes. The x axis is always the horizontal axis, and the y axis is
always the vertical axis.

Step 2 Represent the frequency on the y axis and the class boundaries on the x axis.

74
Step 3 Using the frequencies as heights, draw vertical bars for each class boundaries. Look at
the figure below.

9.2 Bar Graph

A set of bars (thick lines or narrow rectangles) representing some magnitude over time space.
- They are useful for comparing aggregate over time space.
- Bars can be drawn either vertically or horizontally.
- There are different types of bar charts. The most common being :
• Simple bar chart
• Deviation or two way bar chart
• Broken bar chart
• Component or sub divided bar chart.
• Multiple bar charts.

Simple Bar Chart -Are used to display data on one variable.


-They are thick lines (narrow rectangles) having the same breadth. The magnitude of a quantity is
represented by the height /length of the bar.
Example: The following data represent sale by product, 1957- 1959 of a given company for three
products A, B, C.
Product Sales($) Sales($) Sales($)
In 1957 In 1958 In 1959
A 12 14 18
B 24 21 18
75
C 24 35 54

Component Bar chart


-When there is a desire to show how a total (or aggregate) is divided in to its component parts,
we use component bar chart.
-The bars represent total value of a variable with each total broken in to its component parts and
different colours or designs are used for identifications
Example:
Draw a component bar chart to represent the sales by product from 1957 to 1959 from above
table.

Multiple Bar charts


- These are used to display data on more than one variable.
- They are used for comparing different variables at the same time.
Example:

76
Draw a component bar chart to represent the sales by product from 1957 to 1959.
9.3 Frequency Polygon:

This is a line graph that displays the data by using lines that connect points plotted for the
frequencies at the midpoints/class marks of the classes. The frequencies are represented by the
heights of the points. It can also be obtained by connecting midpoints of the tops of the
rectangles in the histogram.
Example: By using the frequency distribution of temperature of towns which listed below,
construct a frequency polygon.
Class Limits Class Mark Class Frequency Cumulative
Boundaries Frequency
16—19 17.5 15.5—19.5 4 4
20—23 21.5 19.5—23.5 11 15
24—27 25.5 23.5—27.5 7 22
28—31 29.5 27.5—31.5 21 43
32—35 33.5 31.5—35.5 5 48
36—39 37.5 35.5—39.5 1 49
40—43 41.5 39.5—43.5 1 50
Step 1 Find the midpoints of each class.
Step 2 Draw the x and y axes. Label the x axis with the midpoint of each class, and then use a
suitable scale on the y axis for the frequencies.
Step 3 Using the midpoints for the x values and the frequencies as the y values, plot the points.
Step 4 Connect adjacent points with line segments.

Example: Draw a frequency polygon for the above data

77
9.4. Ogive (cumulative frequency polygon)
This is the other type of line graph that can be used to represent cumulative frequencies for the
classes. This type of graph is called cumulative frequency graph or ogive. The cumulative
frequency is the sum of the frequencies accumulated up to the upper boundary of a class in the
distribution.

Example: By using the frequency distribution of temperature of towns mentioned above,


construct an ogive for the frequency distribution.

Step 1 Find the cumulative frequencies for each class.

Class Class Mark Frequency Cumulative


Boundaries Frequency
15.5—19.5 17.5 4 4
19.5—23.5 21.5 11 15
23.5—27.5 25.5 7 22
27.5—31.5 29.5 21 43
31.5—35.5 33.5 5 48
35.5—39.5 37.5 1 49
39.5—43.5 41.5 1 50
Step 2 Draw the x and y axes. Label the x axis with the class boundaries. Use an appropriate
scale for the y axis to represent the cumulative frequencies. (Depending on the numbers in the
cumulative frequency columns, scales such as 0, 1, 2, 3, …, or 5, 10, 15, 20, …., or 1000, 2000,
3000, … can be used. Do not label the y axis with the numbers in the cumulative frequency
column.) In this example, a scale of 0, 5, 10, 15, … will be used.

Step 3 Plot the cumulative frequency at each class mark (if class marks are used) or upper
boundary (if class boundaries are used). Upper boundaries are used since the cumulative
frequencies represent the number of data values accumulated up to the upper boundary of each
class.

Step 4 Starting with the first class mark, 17.5, (or first upper class boundary, 19.5) connect
adjacent points with line segments, as shown in the figure below.

78
9.5 Pie chart

- It is a chart in which each frequency is converted to degree and is presented on a circle, which is
called pie chart. It is a graph in the shape of a circular pie.. The variable is nominal or
categorical.

Example The number of passengers that board a bus from one hypothetical area to other on a
daily basis for a week is given below.

Days Passengers
Monday 50
Tuesday 80
Wednesday 60
Thursday 60
Friday 150
Saturday 150
Sunday 50
Total 600
Calculate of degrees for each day. Total passengers = 600

The above Fig is the pie chart for the


passengers that board a bus for a particular week.

(4) Time Series Graph


When data are collected over a period of time, they can be represented by a time series graph. A
time series graph represents data that occur over a specific period of time.

79
Example: The following frequency distribution shows population size of place A between 2000
and 2009. Draw a time series graph for the data about Population Size of Place A, 2000—2009
and summarize the findings.

Year Population
Size
2000 150,000
2001 160,000
2002 175,000
2003 200,000
2004 230,000
2005 270,000
2006 280,000
2007 290,000
2008 295,000
2009 300,00

Step 1: Draw and label the x and y axes.

Step 2: Label the x axis for years and the y axis for population size.

Step 3: Plot each point according to the table.

Step 4: Draw line segments connecting adjacent points.

The graph shows an increase in


the number of population
between 2000 and 2009. The
most radical increment was
between 2003 and 2005, where
population size grew from
200,000 to 270,000.

80

You might also like