Introduction To Statistics Material 2023
Introduction To Statistics Material 2023
CHAPTER ONE
1. Introduction
1.1. Definition and Classification of Statistics
Statistics has become an integral part of our daily lives. Every day we are confronted with some
form of statistical information through newspapers, magazines and other forms of
communications. Such statistical information has become highly influential in our lives. The
term „statistics‟ is derived from the Latin word status, meaning state, and historically statistics
referred to the display of facts and figures relating to the demography of states or countries.
Statistics can be defined in two senses: plural (as Statistical Data) and singular (as Statistical
Methods).
i. Plural sense: Statistics are collection of facts (figures). This meaning of the word is widely used
when reference is made to facts and figures on sales, employment or unemployment, accident,
weather, death, education, etc. E.g: Sales Statistics, Labor Statistics, Employment Statistics, etc.
In this sense the word Statistics serves simply as data. But not all data are statistics. In order for
the numerical data to be identified as statistics, it must possess certain identifiable
characteristics. Some of these characteristics are described as follows:
a. Statistics are aggregate of facts. Single or isolated facts or figures cannot be called statistics
as these cannot be compared or related to other figures within the same framework.
Accordingly, there must be an aggregate of these figures. For example, if a person says that
“I earn Birr 30,000 per year”, it would not be considered as statistics. On the other hand if we
say that the average salary of a professor at our university is Birr 30,000 per year, then this
would be considered as statistics since the average has been computed from many related
figures such as yearly salaries of many professors.
b. Statistics, generally, are not the outcome of a single cause but affected by multiple
causes. There are a number of forces working together that affect the facts and figures. For
example, when we say the crime rate in a certain city has increased by 15% over the last
year, a number of factors might affect these changes. These factors may be general level of
economy such as economic recession, unemployment rate, extent of use of drugs, extent of
legal effectiveness and so on. While these factors can be isolated by themselves, the effect of
these factors cannot be isolated and measured individually. Similarly, a marked increase in
food grain production in a certain country may have been due to combined effect of many
factors such as better seeds, more extensive use of fertilizers, governmental and banking
support, adequate rainfall and so on. It is generally not possible to segregate and study the
effect of each of these forces individually.
c. Statistics are numerically expressed. All statistics are stated in numerical figures only.
Qualitative statements cannot be called statistics. For example, such qualitative statements as
„Ethiopia is a developing country‟ or „Jack is very tall‟ would not be considered as statistical
statements. On the other hand, comparing per capita income of Ethiopia with that of Kenya
would be considered statistical in nature. Similarly, Jack‟s height in numbers compared to
average height in Ethiopia would also be considered as statistics.
d. Statistical data are collected in a systematic manner for predetermined purpose. The
purpose and objective of collecting pertinent data must be clearly defined, decided upon and
determined prior to data collection. Also the procedures for collecting data should be
predetermined and well planned. These would facilitate the collection of proper and relevant
data.
e. Statistics are enumerated or estimated according to reasonable standard of accuracy.
There are basically two ways of collecting data. One is the actual counting or measuring,
which is the most accurate way. The second way of collecting data is by estimation and is
used in situations where actual counting or measuring is not feasible or where it involves
prohibitive costs. Estimates, based on samples cannot be as precise and accurate as actual
counts or measurements, but these should be consistent with the degree of accuracy desired.
ii. Singular sense: Statistics is the science that deals with the methods of data collection,
organization, presentation, analysis and interpretation of data. It refers the subject area that is
concerned with extracting relevant information from available data with the aim to make
sound decisions. According to this meaning, statistics is concerned with the development and
application of methods and techniques for collecting, organizing, presenting, analyzing and
interpreting statistical data.
1.1.1. Classification of Statistics
Based on the scope of the decision, statistics can be classified into two; Descriptive and
Inferential Statistics.
Descriptive Statistics refers to the procedures used to organize and summarize masses of data.
It is concerned with describing or summarizing the most important features of the data. It deals
only the characteristics of the collected data without going beyond it. That is, this part deals with
only describing the data collected without going any further: that is without attempting to
infer(conclude) anything that goes beyond the data themselves.
The methodology of descriptive statistics includes the methods of organizing (classification,
tabulation, Frequency Distributions) and presenting (Graphical and Diagrammatic Presentation)
data and calculations of certain indicators of data like Measures of Central Tendency and
Measures of Dispersion (Variation).
Inferential Statistics includes the methods used to find out something about a population, based
on the sample. It is concerned with drawing statistically valid conclusions about the
characteristics of the population based on information obtained from sample. In this form of
statistical analysis, inferential statistics is linked with probability theory in order to generalize the
results of the sample to the population. Performing hypothesis testing, determining relationships
between variables and making predictions are also inferential statistics.
Example: Classify the following statements as Descriptive or Inferential Statistics
a. The average age of the students in this class is 21 years.
b. There is a strong association between smoking and lung cancer.
c. Of the students enrolled in Haramaya University in this year 74% are male and 26% are
female.
d. The price of wheat will be increased by 5% in the coming year.
e. The chance of winning the Ethiopian National Lottery in any day is 1 out of 167000.
1.2. Stages in Statistical Investigation
According to the singular sense definition of statistics, a statistical study (statistical
investigation) involves five stages: Collection of Data, Organization of Data, Presentation of
Data, Analysis of Data and Interpretation of Data.
1. Collection of Data: This is the first stage in any statistical investigation and involves the
process of obtaining (gathering) a set of related measurements or counts to meet
predetermined objectives. The data collected may be primary data (data collected directly by
the investigator) or it may be secondary data (data obtained from intermediate sources such as
newspaper s, journals, official records, etc).
2. Organization of Data: It is usually not possible to derive any conclusion about the main
features of the data from direct inspection of the observations. The second purpose of
statistics is describing the properties of the data in a summary form. This stage of statistical
investigation helps to have a clear understanding of the information gathered and includes
editing (correcting), classifying and tabulating the collected data in a systematic manner. Thus
the first step in the organization of data is editing. It means correcting (adjusting) omissions,
inconsistencies, irrelevant answers and wrong computations in the collected data. The second
step of the organization of data is classification that is arranging the collected data according
to some common characteristics. The last step of the organization of data is presenting the
classified data in tabular form, using rows and columns (tabulation).
3. Presentation of Data: The purpose of data presentation is to have an overview of what the
data actually looks like, and to facilitate statistical analysis. Data presentation can be done
using Graphs and Diagrams which have great memorizing effect and facilitates comparison.
4. Analysis of Data: The analysis of data is the extraction of summarized and comprehensive
numerical description in order to reach conclusions or provide answers to a problem. The
problem may require simple or sophisticated mathematical expressions.
5. Interpretation of Data: This is the last stage of statistical investigation. Interpretation
involves drawing conclusions from the data collected and analyzed in order to make decision.
1.3. Application, Uses and Limitations of Statistics
1.3.1. Applications of Statistics
In this modern time, statistical information plays a very important role in a wide range of fields.
Today, statistics is applied in almost all fields of human endeavor.
In Scientific Research: Statistics is used as a tool in a scientific research. Statistical
formulas and concepts are applied on a data which are results of an experiment.
In Quality Control: Statistical methods help to check whether a product satisfies a given
standard.
For Decision Making: statistics helps to enhance the power of decision making in the face
of uncertainty by providing sufficient information.
In Agriculture: Experiments are designed and analyzed using statistical procedures.
In Public Health and Medicine: statistical methods are used for computation and
interpretation of birth and death rates.
In Economics: for modeling functional relationships between or among variables
In Education and Agricultural Extension: to study the effects of certain trainings.
In Natural and Social Sciences, Business, Planning, Behavior Sciences, etc.
For example: Height, Family size, Gender, consumption, automobile color, etc.
Based on the values that variables assume, variables can be classified as
1. Qualitative variables A qualitative variable has values that are intrinsically non-numerical
(categorical).
For example: Gender, marital status, religion, phone number, ID number, etc.
2. Quantitative variables are variables values that are intrinsically numerical (assumed to be
numeric values). These variables are numeric in nature.
For example: Height, Family size, Time, SAT score, etc
Quantitative variable can be expressed either in whole number or decimal points. Based on
its value expressed in whole number, decimal or both quantitative variables are classified in
to two; discrete and continuous variables.
Discrete variable takes whole number values and consists of distinct recognizable
individual elements that can be counted. It is a variable that assumes a finite or countable
number of possible values. These values are obtained by counting (0, 1, 2. . .).
For example: Family size, Number of children in a family, number of cars at the traffic
light, number of goal per play, etc.
Continuous variable takes any value including decimals. Such a variable can
theoretically assume an infinite number of possible values. These values are obtained by
measuring.
1. Nominal variables: are those qualitative variables which show category of individuals. They
reflect classification into categories (name of groups) where there is no particular order or
qualitative difference to the labels. Numbers may be assigned to the variables simply for coding
purposes. It is not possible to compare individual basing on the numbers assigned to them. The
only mathematical operation permissible on these variables is counting.
These variables
Have mutually exclusive (non-overlapping) and exhaustive categories.
No ranking or order between (among) the values of the variable.
Examples: Gender, Religion, ID No, Ethnicity, Color
2. Ordinal variables: are also those qualitative variables whose values can be ordered and ranked.
Ranking and counting are the only mathematical operations to be done on the values of the
variables. But there is no precise difference between the values (categories) of the variable.
Examples: Academic qualifications (B.Sc., M.Sc., Ph.D.), Grade Scores (A, B, C, D, F), Strength
(very weak, week, strong, very strong), Wealth Index (very poor, poor, rich, very rich)
3. Interval variables: are those quantitative variables when the value of the variables is zero it
does not show absence of the characteristics i.e. there is no true zero. Zero indicates low than
empty. There is a precise difference between the units of measurement (levels)
Examples: temperature, 00c does not mean there is no temperature but to say it is too cold.
4. Ratio variables: are those quantitative variables when the values of the variables are zero it
shows absence of the characteristics. Zero indicates absence of the characteristics.
Examples: Income, Amount of yield, Expenditure, Consumption.
All mathematical operations are allowed to be operated on the values of the variables.
Exercise
1. What is the difference and similarity between sample and population?
2. What are the merits of using sample over population?
3. Discuss the difference between the four levels of measurements.
4. What are the applications of Statistics in your field of study?
CHAPTER TWO
interviewed, since we can make some clarifications to the questions and avoids
incompleteness and disorder responses.
Disadvantage:
It is costly than other methods, since it requires training of interviewers and transportation
cost.
The respondent may not tell us the real information for sensitive questions, since there is
face to face interaction. Eg: Asking about salary, if his/her salary is very small, he/she might
tell us the wrong one, since the respondent gets ashamed of it.
Telephone Interview: This method involves contacting the respondent on telephone and
collecting information. It is faster to collect information. The absence of telephone lines
makes this approach less usable. It cannot be also used for rural surveys.
Advantage: It is less costly, since it requires less number of interviewers and the cost for
calling is than the cost for transportation. The respondent may give his/her opinion candidly
since there is no face to face interaction. Because of this, the data we get through this
method are more realistic than the previous one.
Disadvantage: this method is not applicable in developing countries because of the lack of
access to telephone. The respondent might not be in his/her house or may not respond to the
call, and in the meantime the interviewer might get bored. There is a high chance of getting
incomplete response, since the connection can be interrupted.
Mailed Questionnaire: the researcher sends the questionnaire to the respondent; the
respondents complete the form and sends back to the researcher. Costs are low. The
responses are free from biases of the interviewer and respondents can have more time to
give well thought answers. But it is applicable for educated persons. Non response, Partial
response, low return rates.
Disadvantage: the respondent might give in appropriate answers to questions, since there is
no one is there with them they may understand the question wrongly and repond it
incorrectly.
2.3. Questionnaire
It is a form containing the cover letter that explains about the person conducting the survey and
the objectives of the survey, and a set of related questions which will be answered by the
respondents. It requires great care in preparing a questionnaire for data collection. One of the
most important points in preparing it is that all questions in it must have relevance to the
objectives of the survey.
Having decided which type of questionnaire to use, the following points should be kept in mind
while designing a questionnaire.
The person conducting the survey should introduce himself and state the objective of the
survey, promise of the anonymity and include instructions as are necessary in giving correct
responses (on the cover letter).
The number of questions should be as few as possible.
Once the objectives of the survey are clearly defined only questions pertinent to the
objectives should. The time of the respondent should not be wasted by asking irrelevant
questions. In general 5 to 25 may be regarded as affair number. If a lengthy questionnaire is
unavoidable, it should preferably be divided into two or more parts.
Questions should be logically arranged. Put the questions in the appropriate sequence of
topics. Topics should not be mixed up.
The questions should be in a logical order so that a natural and spontaneous reply is
introduced. They should not skip back and forth.
It is undesirable to ask a person how many children s/he has before asking whether s/he is
married or not. Questions related to identification and description of the respondent should
be come first, followed by major information questions. If opinions are requested, such
questions should usually be placed at the end of the list.
Questions should be simple, short and easy to understand and they should convey one and
only one idea. Technical terms should be avoided.
Sensitive questions (questions of personal and financial nature) should be avoided. Such
questions should be obtained indirectly, among asset of ranges. Unless put them at the last
part and within a set of ranges. Eg: Age (0-25, 26-50, 51-75,>75)
Salary (Below 200,200-500,500-1000,>1000)
Leading questions should be completely avoided. If you ask person like “Don not you
smoke?” the person will automatically say „Yes I do not‟
Answers to the questions should not require any calculation.
There should be instructions how to fill the form.
Questions should be capable of objective answers.
Types of questions
Different types of questions that may form a questionnaire can be grouped into two categories.
1. Closed-ended (Dichotomous questions and Multiple-choice questions)
2. Open-ended questions
Dichotomous questions are type of questions which have two alternative responses. Such
questions can be answered in „Yes‟ or „No‟.
Example: Do you intend to purchase TV? A Yes B. No
Do you drink coffee? A. Yes B. No
Multiple-choice questions: in such types of questions the respondent is asked to select one out of
a number of alternative responses. This process not only facilitates tabulation of data but also
takes very little time of the respondent to fill the questionnaire.
Example: Why did you purchase a Sony TV?
Lower price
Best quality
Better picture
Longer guarantee
Any other
The problem with multiple-choice questions is that the respondent may like to tick more than one
alternative. So to avoid such a problem either we have to inform the respondent to choose the
most important one or to make a rank among his choices. The use of multiple choice questions
are indicated only when the investigator is confident of the existence of a limited group of
important alternatives. Open-ended or free answer questions: In such types of questions, the
respondent will have the chance to answer the questions in his/her own words.
Example: -What is your opinion on the teaching policy?
The difficulty with these types of questions is in classifying the questions during tabulations and
analysis.
2.4. Methods of Data Organization
In order to describe situations, draw conclusions or make inferences about the population even to
describe the sample, the collected data must organize into some meaningful way. The most
convenient way of organizing data is to construct a frequency distribution.
Frequency distribution is the organization of raw data in table form, using classes and
frequencies.
Definition of some terms
Class: is a description of a group of similar numbers in a data set.
Class Width (Class Size):-the difference between UCB and LCB of a class. It is also the
difference between the lower limits of two consecutive classes or it is the difference between
upper limits of two consecutive classes.
The relative frequencies are particularly helpful when comparing two or more frequency
distributions in which the numbers of cases under investigation are not equal. The percentage
distributions make such a comparison more meaningful, since percentages are relative
frequencies and hence the total number in the sample or population under consideration becomes
irrelevant.
Percentage frequency: - Relative frequency ×100
Class Class Boundaries Class Mark Frequency Relative Percentage
Limits frequency frequency
1-25 0.5-25.5 13 20 20/70
26-50 25.5-50.5 38 15 15/70
51-75 50.5-75.5 63 25 25/70
76-100 75.5-100.5 88 10 10/70
Total 70 70/70=1 100
The above frequency distributions tell us the actual number (percentage) of units in each class, it
does not tell us directly the total number (percentage) of units that lie below or above the
specified values of the classes.
Cumulative frequency: is the sum of frequencies (total number of observations) below or above
a certain value. A cumulative frequency distribution displays the total number of observations
above (below) a certain value.
Less than Cumulative Frequency: is the total number of values of a variable below a certain
Upper Class Boundary. When the interest of the investigator focuses on the number of items
below a specified value, then this specified value is the upper boundary of the class. It is known
as less than cumulative frequency distribution
More than Cumulative Frequency: - is the total number of values of a variable above a certain
Lower Class Boundary. When the interest lies in finding the number of cases above a specified
value, then this value is taken as the lower boundary of the specified class and is known as more
than cumulative frequency distribution.
6. Put the smallest value of the data set as the LCL of the first class. To obtain the LCL of the
second class add the class width W to the LCL of the first class. Continue adding until you get
K classes.
Let X be the smallest observation
LCL1=X
LCLi=LCLi-1+W for i=2, 3… K.
7. Obtain the UCLs of the FD by adding W-U to the corresponding LCLs.UCLi=LCLi+(W-U) for
i=1,2…K.
1 1
8. Generate the class boundaries.LCBi=LCLi- U and UCBi=UCLi+ U for i=1,2…K.
2 2
Example 1: Mark of 50 students out of 40
16 21 26 24 11 17 25 26 13 27 24 26 3 27 23 24 15 22 22 12 22 29 18 22 28 25 7
17 22 28 19 23 23 22 3 19 13 31 23 28 24 9 20 33 30 23 20 8 21 24
Construct grouped frequency distribution.
Solution:
1. The array form of the data (increasing order)
3 3 7 8 9 11 12 13 13 15 16 17 17 18 19 19 20 20 21 21 22 22 22 22 22 22
23 23 23 23 23 24 24 24 24 24 25 25 26 26 26 27 27 28 28 28 29 30 31 33
2. U=9-8=1
3. R=L-S=33-3=3
4. K=1+3.322logN=1+3.322log50=6.64≈7
5. W=R/K=30/6.64=4.5≈5
6. W-U=5-1=4
Class Limits Class Class Frequency Relative Percentage LCF MCF
Boundaries Mark Frequency Frequency
3-7 2.5-7.5 5 3 3/50=0.06 6 3 50
8-12 7.5-12.5 10 4 4/50=0.08 8 7 47
13-17 12.5-17.5 15 6 6/50=0.12 12 13 43
18-22 17.5-22.5 20 13 13/50=0.26 26 26 37
23-27 22.5-27.5 25 17 17/50=0.34 34 43 24
28-32 27.5-32.5 30 6 6/50=0.12 12 49 7
33-37 32.5-37.5 35 1 1/50=0.02 2 50 1
Total 50 1 100
Exercise: In a survey the age of 44 women at marriage was reported as follows. Construct the
appropriate FD for this data.
24 25 27 26 22 23 24 25 24 23 26 28 24 25 23 24 25 25 25 22 27 28
27 24 25 24 25 28 26 25 24 28 24 25 25 24 25 24 26 27 27 25 28 26
Properties of Classes (Class Boundaries)
Classes should be:
Complete and non-overlapping
Complete- it should include all the data set. Non-overlapping and no data should belong
to two classes.
Clear and properly set
The W and K should be calculated properly and W should be the same for all classes.
Standardized
A class should follow logical and chronological (increasing) order.
The number of classes should be in between 5 and 20 i.e. 5≤K≤20. K depends on N. the
larger the N the more the K. But we need to condense the data set with minimum lose of
information in an easy manageable classes.
Continuous
Even if there are no values in a class the class must be included in the frequency
distribution.
Advantages and disadvantages of frequency distributions
a. Advantages
It condenses a large mass of data in to a comparatively small table.
It attracts the attention of even a layman and gives him an insight into the nature of the
distribution.
It helps for further statistical analysis, like central tendency, scatter, symmetry,… of the
data.
b. Disadvantages
In the grouped frequency distributions, the identity of the observations is lost. We know
only the number of observations in a class and don not know what the values are.
Because the selection of the class width and the lower class limit of the first class are to a
certain extent arbitrary, different frequency distributions may be constructed for the same
data and hence may give contradictory impressions.
b) Component Bar Diagram: is used when there is a desire to show a total or aggregate is
divided into its component parts. The bars represent total value of a variable with each total
broken into its component parts and different colors are used for identification. In such type
of diagrams, a bar is subdivided in to parts in proportion to the size of the sub division.
These subdivided rectangles are shaded differently by lines, dots and colors so that they will
be very easy to compare the components.
Sometimes the volumes of different attributes may be greatly different. For making meaningful
comparisons, the components of the attributes are reduced to percentages. In that case each
attribute will have 100 as its maximum volume. This sort of component bar diagram is known as
percentage bar-diagram. Each rectangle represents total value of a variable and is broken into its
component parts.
Example:
Marital Status Male Female Total
Single 90 10 100
Married 30 40 70
Divorced 1 29 30
2. Pie chart: - Pie chart is popularly used in practice to show percentage break down of data. A
pie chart is a circle representing a set of data by dividing the circle into sectors proportional to
the number of items in the categories or a pie chart is a circle representing the total, cut into
slices in proportional to the size of the parts that make up the total. It gives the proportional
sizes of different data groups as slice of a pie or a circle.
Example:
Marital Status Number of individuals Percentage Degree
Single 100 50 180
Married 70 35 126
Divorced 30 15 54
Total 200 100 360
2. Frequency Polygon: A graph that consists of line segments connecting the intersection of the
class marks and the frequencies.
Can be constructed from Histogram by joining the mid-points of each bar.
Example: Construct frequency polygon for the following Grouped frequency Distribution.
3. Cumulative Frequency (Ogive) curves: is a smooth free hand curve of frequency polygon.
Example: Construct Ogive curve for the following Grouped frequency Distribution.
CHAPTER THREE
2. It should not be affected by extreme values. It should be as close to the maximum number of
observed values as possible.
3. It should be defined rigidly which means it should have a definite value (it should be
unique).
4. It should always exist.
5. It should be easy to understand calculate. It should not be subject to complicated and tedious
calculations, though the advent of electronic calculators and computers has made it possible.
6. It should be capable of further algebraic treatment. By algebraic treatment, we mean that the
measures should be used further in the formulation of other formulae or it should be used for
further statistical analysis.
3.3. Summation Notation
n
The sum X1+X2+…+Xn is denoted by the Greek letter ∑ (sigma) as X
i 1
i
= X1+X2+…+Xn and
n
X Y
i 1
i i X 1Y1 X 2Y2 ... X nYn
n n
( X i c) X i nc
i 1 i 1
n n
CX i =C X i , where C is a constant.
i 1 i 1
n
a =n a where a is a constant.
i 1
n
From now onwards we will use ∑X in place of X
i 1
i
just for simplicity.
Simple Arithmetic Mean:-is the sum of all observations divided by total number of observations.
For a sample of n observations X1, X2,…,Xn the sample mean is denoted by X (X-bar) and
calculated as follows.
X = = 1
X X X 2 .... X n
For a frequency array (ungrouped FD),
n n
X=
fX = f X1 1 f 2 X 2 .... f K X K
For grouped FD, X represents class mark.
f f1 f 2 ... f K
Example1: The high temperatures for a 7-day week during December in Haramaya University
were 29 , 31 , 28 , 32 , 29 , 27 , and 55 . find the mean high temperature for the
week.
Solution: X = = =33 .
The mean or average, high temperature for the week was 33 .
Example2: The amounts of drops of water in drip irrigation were registered from 43 sample drip
holes in one day and the data are as follows
The algebraic sum of the deviations of each value from the arithmetic mean is zero. That is
∑(X- X ) =0.
The sum of the squares of the deviations from the mean is less than the sum of the squares of
the deviations about the other score in the distribution.
Combined Mean: If there are p different groups (having the same unit of measurement) with
mean X 1 , X 2 ,…, X p and number of observations n1,n2,…np respectively, then the mean of all the
XC =
nX =
n1 X 1 n2 X 2 .... n p X p
n n1 n2 ... n p
Example: The mean weight of 50 women workers in a factory is 48 kg. The mean weight of 75
men working in the same factory is 58 kg. Find the mean weight of all workers in the factory.
̅ ̅
Solution: ̅ . Therefore, the mean weight of the
factors workers is 54kgs.
XW =
WX = = =83, so the average grade of a student is 83.
W
Arithmetic mean fulfills almost all characteristics of good measures of central tendency with the
exception that it is highly affected by extreme values. And it cannot be calculated for a FD with
open-ended classes (a FD with no lower class boundary of the first class or with no upper class
boundary of the last class or with both).
3.4.2. Geometric Mean
Geometric mean is the nth root of the product of the n values.
GM= n X = n X 1 X 2 ... X n
But this formula is used if n is small. If it is large, it is difficult to calculate the n th root. Thus to
facilitate the computation, we make use of logarithms.
1
GM=Antilog( ∑logX)
n
1
For ungrouped FD, GM=Antilog ( ∑flogX)
f
For grouped FD, X represents class mark.
If the variable values are measures as ratios, proportions or percentage and some values are
larger in magnitude and others are small, then the geometric mean is a better representative of
the data than the simple average. In a “geometric series”, the most meaning full average is the
geometric mean. The arithmetic mean is very biased toward the large numbers in the series.
The geometric mean is important in determining the average rate of growth, percentages, ratios
and portions.
The disadvantage of GM is that it cannot be calculated if one or more observations are zero or
negative. It is also affected by extreme values but not to the extent of AM.
Exercise:
1. Find the geometric mean of A) 1, 2, 3, 4, 5. B) 1, 2, 3, 4, 100. Is there a great difference
between the GM of A and that of B?
2. The price of a commodity increased by 5% from 1989 to 1990, 8% from 1990 to 1991 and by
77% from 1991 to 1992. Find the average price increase.
3. A machine depreciated by 10% each in the first two years and by 40% in the third year. Find
out the average rate of depreciation.
4. Decadal percentage growth of population in country A is given below. Find the average rate
growth.
Exercise:
1. Find the harmonic mean of A) 1, 2, 3, 4, 5. B) 1, 2, 3, 4, 100. Is there a great difference
between the HM of A and that of B?
2. A driver traveled 400 km per day for three days at a speed of 60, 50 and 40 kilometers per
hour. Find the average speed of the driver.
3. A student reads the first 100 pages of a book at a rate of 5 pages per hour, the next 100 pages
at a rate of 8 pages per hour. What is the student‟s average reading speed?
4. Suppose a train moves 100 km with a speed of 40 km per hour, then 150 km with a speed of
50 km per hour and the next 135 km with a speed of 45 km per hour. Calculate the average
speed of the train.
5. In a factory a mechanic takes 15 days to fabricate a machine, the second mechanic takes 18
days, the third takes 30 days and the fourth takes 90 days. Find the average number of days
taken by the workers to fabricate the machine.
6. Suppose a train moves 5 hours at a speed of 40 km per hour, then 3 hours at a speed of 50 km
per hour and the next 5 hours with a speed of 45 km per hour. Calculate the average speed of
the train.
3.5. Median
Median is the half-way point in a data set. It divides a data set into two equal parts such that half
of the numbers have a value less than the median and have will have values greater than the
median. Graphically median is the intersection of the less than and more than cumulative
frequency curves.
The median of a set of n observations X1X2,…,Xn arranged in ascending order of magnitude is
the middle value if n is odd or the arithmetic mean of the two middle values if n is even. That is
n n
( ) th value ( 1) th value
~ n 1 th ~
If n is odd X = ( ) valueand if n is even X = 2 2
2 2
Median for continuous grouped data: for grouped frequency distributions median is given by the
n
FX~ 1
~ 2
formula X = L X~ ( )w
f X~
Where n=∑f= sum of frequencies
L X~ is the LCB of the median class.
FX~ 1 is the less than cumulative frequency just before the median class.
Solution: The median class is the class having the less than cumulative frequency containing the
value n/2=40/2=20. This implies, 145-153 is the median class.
L X~ =144.5, n=40, FX~ 1 =17, f X~ =12 and w=9
n
FX~ 1
~
X = L X~ ( 2 ) w =144.5+ (20-17)* =146.8.
f X~
3.6. Mode
The mode denoted by X̂ , is the most frequently occurring value in a set of observations or it is
the value with the highest frequency. A data set may have one mode (uni-modal), two modes (bi-
modal), more than two modes (multi-modal) or no mode at all (i.e. when all observations are
equally frequent).
Ungrouped (individual series): Arrange the data in ascending order and take the value
appearing most frequently (the most frequent value).
Grouped (continuous) series: In a frequency distribution, the mode is located in the class with
highest frequency and that class is the modal class.
f Xˆ f Xˆ 1
Then the formula for mode is X̂ = L Xˆ ( )w
( f Xˆ f Xˆ 1 ) ( f Xˆ f Xˆ 1 )
Mode is not affected by extreme values and can be calculated for open-ended classes. But it
often does not exist and is value may not be unique.
Example 1: The study of the relationship between age and varies function (such as acuity and
depth perception) reported the following observation on area of sclera lamina (mm2) from human
optic nerve heads (experimental eye research 1988): 2.75, 2.62, 2.74, 3.85, 2.34, 2.74, 3.93, 4.21,
3.88, 4.33, 3.46, 4.52, 2.43, 3.65, 2.78, 3.56, 3.01. Find mean, median, mode,Q1, D5, P75.
Solution: Check the answer (mean=3.341, median=3.46, mode=2.71, Q1=2.74, D5=3.46 &
P75=3.93)
Example 2: Find the mode & interpret the result of 40 male college students in state university.
Solution: the most frequency appears at class interval 145-153, so
L X~ =144.5, n=40, FX~ 1 =9, FX~ 1 =5 f X~ =12 and w=9
f Xˆ f Xˆ 1
X̂ = L Xˆ ( ) w =144.5+ =144.5+2.7=147.2
( f Xˆ f Xˆ 1 ) ( f Xˆ f Xˆ 1 )
Deciles: are values that divide the data into ten equal parts. These values are denoted by D1, D2,
…, D9 such that 10% of the data fall below D1, 20% below D2, …, 90% below D9.
Percentiles: are values that divide a dataset into 100 equal parts. These values are denoted by P 1,
P2, …, P99.
Methods of calculation
a. Ungrouped (individual) series: Arrange the values in ascending order. Then
Quartiles: Let Qi be the ith quartile (i=1,2,3), then
i(n 1) th
i= ( ) value
4
Deciles: Let Di be the ith decile (i=1,2,…,9)
i(n 1) th
Di= ( ) value
10
Percentiles: Let Pi be the ith percentile (i=1,2,…,99)
i(n 1) th
Pi= ( ) value
100
If x1, x2, . . . , xn are sorted data set and j and k are integral and fractional parts of Qi respectively,
then Qi is between xj and xj+1 given by
Qi = xj + k(xj+1 − xj)
Example: Given the data 42, 43, 35, 38, 41, 49, 50, 51, 52 and 55. Find
A) All quartiles
B) The 2nd and 8th deciles
C) 35th and 75th percentiles
1(10 1) th
Solutions:-a) Q1 = ( ) value =2.75th value
4
=38+0.75(41-38)=40.25
2(10 1) th
Q2= ( ) value =5.5th value
4
2(10 1) th
b) D2= ( ) value =2.2th value
10
35(10 1) th
c) P35= ( ) value =3.85th value
100
=3rd value+0.85(4th value-3rd value)
=41+0.85(42-41)=41.85 (P75 left as exercise)
b. Group (continuous) data:
in
FQi 1
Quartiles: Qi= LQi ( 4 )w i=1, 2, 3.
f Qi
in
FDi 1
Deciles: Di= LDi ( 10 )w i=1, 2,…., 9.
f Di
in
FDi 1
Percentiles: Pi= LDi ( 100 )w i=1, 2,…,99.
f Di
Where n=∑f= sum of frequencies
L is the LCB of the ith(quartile, decile and percentile) class.
F is the less than cumulative frequency just before the ith(quartile, decile and percentile)
class.
f is frequency of the ith(quartile, decile and percentile) class .
w is the class width.
Example 2: In a certain investigation, 460 persons were involved in the study, and based on an
enquiry on their age, the following frequency distribution shows the age composition of the
persons under study.
Age interval 10.5-15.5 15.5-20.5 20.5-25.5 25.5-30.5 30.5-35.5 35.5-40.5 40.5-45.5 45.5-50.5
in years
Number of 24 64 90 122 51 56 20 33
persons
1x 460
88
Q1= 20.5 ( 4 )5 =22
90
2 x 460
178
Q2= 25.5 ( 4 )5 =27.63
122
in
FDi 1
b) Deciles: Di= LDi ( 10 )w i=1, 2,…., 9.
f Di
The 5th deciles class is 5*460/10=230 which lies in 25.5-30.5.
5 * 460
178
D5= 25.5 ( 10 )5 ==27.63
122
The 25th percentile is 25*460/100=115 which lies in 20.5-25.5.
25 * 460
88
P25= 20.5 ( 100 )5 ==22.
90
CHAPTER FOUR
To illustrate this let us consider the following three data sets: the price of a certain commodity in
four Maya cities in five different months.
Month
A 30 30 30 30 30
City
B 28 29 31 30 32
C 15 5 55 45 30
D 3 5 37 30 75
Now if we calculate the mean and median for each of the city, we will come up with the value
30. This value implies that, the price of the commodity in the four cities A, B, C and D, on
average, is the same. That is the average price of the commodity in the four cities is the same.
But by inspection, it is apparent that the price of the commodity in the cities differs remarkably
from one another. For city A, it is right, for city B more or less it is ok, but for city C and D it is
not realistic to say the price of the commodity is 30. This means, just only by looking at the
average we cannot talk about the data set confidently. So, along with the average values
(measures of central tendency), we have to study the scatterdness or dispersion of the data.
Dispersion or variation may be defined as the extent of scatterdness of value around the
measures of central tendency. Thus measure of dispersion tells us the extent to which the values
of a variable vary about the measure of central tendency.
4.2.1. Range
It is the simplest and crudest measure of dispersion. Range is defined as the difference between
the largest and the smallest values in the data.
Range hardly satisfies any property of good measure of dispersion as it is based on two
extreme values only, ignoring the others. It is not liable to further algebraic treatment.
4.2.2. Quartile Deviation
Sometimes known as Semi-interquartile Range (SIR)
Interquartile Range=Q3-Q1
Q3 Q1
Q Q1
QD= 3 Coefficient of QD= Q3 Q1
2
QD involves only the middle 50% of the observations by excluding the observations below the
lower quartile and the observations above the upper quartile. Note that QD does not take into
account all the individual values occurring between Q1 and Q3. It means that, no idea about the
variation of even 50% mid values is available from this measure. Anyhow it provides some idea
if the values are uniformly distributed between Q1 and Q2. It can be cal calculated for open-
ended classes.
~
MD( X ) ~ MD( X )
MD( X )= MD( X )= ~
X X
MD is not affected by extreme values. Its main drawback is that the algebraic negative signs of
the deviations are ignored. MD is minimum when the deviation is taken from median.
4.2.4. Variance and Standard Deviation
The Variance and Standard Deviation are the most superior and widely used measures of
dispersions and both measure the average dispersion of the observations around the mean.
For a population containing N elements, the population variance ( 2 ) is calculated by using the
formula 2 =
(X X ) 2
(X X ) 2
x i ( x i ) 2 / n
2
i 1 i 1 2926 (176) 2 / 11
S2 11
n 1 10
So, S S 2 11 3.316
2) Computing the variance & standard deviation for the data given below.
Observation(Xi) 32 36 40 44 48 Total
Frequency(fi) 2 5 8 4 1 20
fx ( f i xi ) 2 / f i
2
31376 (788) 2 / 20
S2 17.31
i i
f i 1 19
1-3 1 2 2 4
3-5 9 4 36 144
13-15 3 14 42 588
fm ( f i mi ) 2 / f i
2
7016 (800) 2 / 100
6.22
2 i i
S
f i 1 99
2
=6.22. So, S=√ =2.49
Properties of Variance and Standard Deviation
1. The variance and standard deviation always non-negative
2. If every value is multiplied by a constant C the new variance is S2new=C2S2old and standard
deviation is Snew=CSold
3. When a constant C is added (subtracted) to or (from) each and every value, the standard
deviation and variance remains the same.
4.2.5. Coefficient of Variation
All absolute measures of dispersion have units. If two or more distributions differ in their units
of measurement, there variability cannot be compared by any of the absolute measure given
before. Also, the size of these measures of dispersion depends up on the size of the values. That
is if the size of the values is larger, the value of the absolute measures will also be larger. Hence,
in situations where either the two or more data sets have different units of measurement, or their
means differ sufficiently in size, absolute measures fails to be appropriate.
It is a relative measure of standard deviation. The coefficient of variation is the ratio of the
standard deviation to the mean and it is expressed as percent.
CV= ×100%, for population
S
CV= ×100%, for sample
X
It is used for comparing the variability of two or more distributions. The distribution having less
CV is said to be less variable or more consistent or more uniform.
Since absolute measures depend on the units of measurement of the data, they fail to be
appropriate for comparing two or more groups if
1. The groups have different units of measurement.
2. The size of the data between the groups is not the same.
When either of these two conditions happens we have to use relative measures of variation. CV
is a unit less measure of variation and also takes into account the size of the means of the
distributions.
EX: Given Data Set A: 2 Meters, 4 Meters, 6 Meters
Data Set B: 1000 Liters, 800 Liters, 900Liters
Compare the variability of the two data sets using standard deviation and coefficient of variation.
4.2.6. Standard Score(Z-score)
It used to determine how many standard deviations a given value is above or below the
mean which is depend on whether the z-score is negative or positive.
for Population
for Sample
Example: Suppose Ablakat scored 90 on a basic statistics test in which the mean and standard
deviation of the class were 70 and 10 respectively. In the second test, Meklit scored 60 on which
the mean and standard deviation of the class were 56 and 4 respectively. Who is better of relative
to her class?
Solution:
Ablakat ==2.0
Meklit ==1.0
The score of Ablakat (90) in her class is 2 standard deviation above the mean whereas the score
of Meklit (60) in her class is 1 standard deviation above the mean. This implies that the
Ablakat‟s score is the better relative score when considered in the context of Meklit‟s score.
4.3. Moments (about the origin and about the mean)
If X is variable that assumes values X1,X2,…XN,
f
b. If A=0, r =
X r
c. If A= X , r' =
(X X ) r
2. Positively Skewed curve: If one or more observations are extremely large, the mean of the
distribution becomes greater than the median or mode and the distribution is said to be positively
skewed.
In this case
The right tail is more elongated, longest tail to the right of the central point.
More values are on the left of the mean.
The extreme variation is towards large values (to the right).
Smaller values are more frequent.
Mean>Median>Mode
3. Negatively Skewed Curve: If one or more extremely small observations are present, the mean
is the smallest of the three averages, and the distribution is said to be negatively skewed.
In this case
The left tail is more elongated.
More observations are concentrated on the right of the mean
The extreme variation is towards lower values (to the left).
Larger values are more frequent than small values
Mean<Median<Mode
If α3 =0, Symmetrical
If α3>0, Positively Skewed
If α3<0, Negatively Skewed
Where r is the rth central moment.
Example:
Q3 Q1 2Q2
Skb=
Q3 Q1
If Skb =0, Symmetrical
If Skb >0, Positively Skewed
If Skb <0, Negatively Skewed
4.4.3. Kurtosis
The shape of the peak of a distribution may be sharp or flat. Kurtosis refers to the peakedness or
flatness of a certain distribution with respect to the normal distribution. It is the event to which
the curve is more peaked or more flat toped than normal.
1. If a distribution is more picked than normal, is called a leptokurtic distribution.
3. A distribution which is neither more peaked nor flat topped than normal is called
mesokurtic.
Measures of Kurtosis
1. The coefficient of Kurtosis
Q3 Q1
K=
D9 D1
2. The Moment Measure of Kurtosis
4 4
β= =
22 4
If β=3, Mesokurtic, β>3, Leptokurtic and β<3, Platykurtic
CHAPTER FIVE
5. ELEMENTARY PROBABILITY
with one head shown and E3 be an event with two heads shown. Are E1, E2 and E3 mutually
exclusive?
Solution
S= {HH, HT, TH, TT}
E1= {TT}
E2= {HT, TH}
E3= {HH}
E1 n E2=E2 n E3=E1 n E3=
Thus, E1 and E2, E2 and E3, E1 and E3 are mutually exclusive events.
Independent events: Two events E1 and E2 are said to be independent if the occurrence of E1 has
no effect on the occurrence of E2. That means the knowledge of event E1 has occurred given no
information about the occurrence of the event E2. If two events are not independent, they are said
to be dependent.
Equally likely outcome: In a certain experiment if each outcome in the sample space has the
same chance to be occurred, then we say that the outcome is equally likely outcomes. Example:
in throwing a fair die all possible outcomes are equally likely comes/occurred. That means the
elements of the sample space have the same chance to occur.
Set theory
Set: is any well-defined list or collection of objects.
Null event: is an event which has no outcome of the experiment.
Intersection of two events: let A and B are events, then the intersection of A and B is the set
of elements that are common to both A and B.
Union of events: let A and B are two events, then the union of two events is the set of
elements that belongs to A or B or both.
Complement of events: let A be an event, A’ is the event that occurs if A doesn‟t occurred.
Mutually exclusive events: two events A and B are said to be mutually exclusive events if
they cannot occur together. I.e. A B= .
Exhaustive events: events A1, A2… An are said to be exhaustive if their union gives the
sample space.
Independent events: two events A and B are said to be independent if the occurrence or
non- occurrence of one doesn‟t affect the occurrence or non-occurrence of the others.
Concept of Set
In order to discuss the theory of probability, it is essential to be familiar with some ideas and
concepts of mathematical theory of set. A set is a collection of well-defined objects which is
denoted by capital letters like A, B, C, etc.
In describing which objects are contained in set A, two common methods are available.
These methods are:
1. Listing all objects of A. For example, A = {1, 2, 3, 4} describes the set consisting of the
positive integers 1, 2, 3 and 4.
2. Describing a set in words, for example, set A consists of all real numbers between 0 and 1,
inclusive. It can be written as A = {x : 0 ≤ x ≤1}, that is, A is the set of all x‟s where x is a
real number between 0 and 1, inclusive.
If A = {a1, a2, ..., an}, then each object ai; i = 1, 2, ..., n belonging to set A is called a member or
an element of set A, i.e., aiƐA. A set consisting all possible elements under consideration is
called a universal set (denoted by U). On the other hand, a set containing no element is called
an empty set (denoted by Ø or {}).
If every element of set A is also an element of set B,A is said to be a subset of B and write as A С
B. Every set is a subset of itself, i.e., A С A. Empty set is a subset of every set. If A С B and B С
C, then A С C. If A С B and B С A, then A and B are said to be equal.
5.2. Counting Techniques
Counting techniques are mathematical models which are used to determine the number of
possible ways of arranging or ordering objects. They are used to find solution to fix the size of
the sample space that is extremely large.
In order to determine the number of outcomes, one can use several rules of counting.
The addition rule
The multiplication rule
Permutation rule
Combination rule
a. Addition Rule: suppose there are k procedures ( p1 , p 2 ,..., p k ) in which the i th procedure can
or…or p k is n1 n2 ... nk , provided that no two procedures can be performed at the same
time or one after the other.
Example:
b) In a certain class a class representative is to be chosen from 3 female and 4 male students.
Count the ways in which a class representative can be chosen.
c) There are 2 bus and 3 train routes from city X to city Y. in how many ways can a person go
from city X to city Y?
Solution a) Here, a female representative is to be chosen in 3 ways and a male representative is
to be chosen in 4 ways. Therefore, the number of ways in which a class representative can be
chosen will be 3+4=7ways.
b. The Multiplication Rule: If a choice consists of k steps of which the first can be made in n1
ways, the second can be made in n2 ways…, the kth can be made in nk ways, then the whole
choice can be made in (n1 n2 ........ nk ) ways.
Example:
i) The instructor gives a 6 question multiple choice examinations. There are 4 possible
responses to each question. How many answer keys can be made?
ii) The personal department of large corporation wishes to issue ID card for each employees
with 4 digit numbers .How many ID cards can be prepared
A. If repetition is allowed?
B. If repetition is not allowed?
iii) There are 2 bus routes from city X to city Y and 3 train routes from city Y to city Z. in how
many ways can a person go from city X to city Z?
Solution i: There are six questions (N=6) with each 4 choice,K1=K2=…=K6=4
Total=4.4.4…..4=46
ii A). We have 10 digits numbers
K1=K2=K3=K4=10.
Total=10.10.10.10=104 =10000
B). K1=10, K2=9, k3=8, K3=7 because repetition is not allowed.
Total=10.9.8.7=5040.
c. Permutation: is the arrangement of objects in a specified order.
a. Permutation Rule1: The number of permutations of n distinct objects taken all together is n!
Where n! n (n 1) (n 2) ..... 3 2 1 . By definition 1!=0!=1
Examples: In how many ways can 6 persons be seat in a row? Ans: 6!=720
Exercise: Suppose that a photographer must arrange 4 people in a row for photograph. How
many different ways the arrangement can be done?
b. Permutation Rule 2: The arrangement of n objects in a specified order using r objects at a
n!
n Pr
(n r )!
Example: in how many ways can 9 books be arranged on a shelf having 4 places?
9!
Ans: 9P4= 3024
(9 4)!
Exercise: How many flags of two colors can be formed from a piece of cloth consisting of six
different colors?
c. Permutation Rule 3: The number of permutations of n objects in which n1 are alike,n2 are
alike, ----nk are alike is given by
n!
n r
P
n1!n2 !... nk !
Example: How many different permutations can be made from the letters in the word
“CORRECTION”?
Solutions:
Here n 10
Of which 2 are C , 2 are O, 2 are R ,1E ,1T ,1I ,1N
n1 2, n2 2, n3 2, n4 n5 n6 n7 1
U sin g the 3 rd rule of permutation , thereare
10!
453600 permutations.
2!*2!*2!*1!*1!*1!*1!
Exercise: In how many different ways can the letters in the term „STATISTICS‟ be arranged?
Combination Rule: The number of combinations of r objects selected from n objects is denoted by
n
C
n r or and is given by the formula:
r
n n!
r (n r )!*r!
Examples:
1. In how many ways a committee of 5 people to be chosen out of 9 people?
Solutions:
n9 , r 5
n n! 9!
126 ways
r ( n r )!* r! 4!* 5!
2. Among 15 clocks there are two defectives .In how many ways can an inspector chose three
of the clocks for inspection so that:
a) There is no restriction.
b) None of the defective clock is included.
c) Only one of the defective clocks is included.
d) Two of the defective clock is included.
Solutions: n=15 of which 2 are defective and 13 are non-defective , r=3
a) If there is no restriction select three clocks from 15 clocks and this can be done in :
n 15 , r 3
n n! 15!
455 ways
r ( n r )!* r! 12!* 3!
2 13
* 286 ways.
0 3
c) Only one of the defective clocks is included.
This is equivalent to one defective and two non-defective, which can be done in:
2 13
* 156 ways.
1 2
d) Two of the defective clock is included.
This is equivalent to two defective and one non defective, which can be done in:
2 13
* 13 ways.
2 3
5.3. Definitions of Probability approaches
1. Classical (Mathematical) Probability: Suppose there are N possible outcomes in the
sample space S of an experiment. Out of these N outcomes, only n are favorable to the event
n( E ) n
E, then the probability that the event E will occur is P( E ) .
n( S ) N
Examples:
a) What is the probability of getting number 6 in rolling a die?
b) What is the probability of getting two heads in tossing two coins?
c) A family plans to have three children. Describe the sample space for all possible gender
combinations. What is the probability that the family will have two boys?
d) A die is rolled. What is the probability of getting
i. An odd number.
ii. Number greater than 3.
e) Two dice are rolled. Describe the sample space. What is the probability of getting
i. A sum of 10 or more.
ii. A pair which at least one number is 3.
iii. A sum of 8, 9 or 10.
iv. One number less than 4.
Solutions:
a) S={1, 2, 3, 4, 5, 6} and E=getting number 6={6}. Thus n(S)=6 and n(E)=1
P(E)=n(E)/n(S)=1/6
b) S={HH, HT, TH, TT}and E={HH}. Thus n(S)=4 and n(E)=1
P(E)=n(E)/n(S)=1/4
The difference between classical and empirical probability is that the former uses sample space
to determine the numerical probability while the latter is based on frequency distribution.
Grade A B C D F
No of students 10 20 50 15 5
Outcome 1 2 3 4 5 6
∑Pi=1/6+1/6+1/6+1/6+1/6+1/6=1
4. If there are two events E1 and E2, the probability that at least one of these events will occur
is the sum of the probability that each event will occur minus the probability that both events
will occur at the same time (simultaneously).
P(E1 u E2)=P(E1)+P(E2)-P(E1 n E2)
Examples:
i. Assume that there are 50 students that take the exam statistics and economics .out of this
students 20 passed in statistics, 15 passed in economics and 18 filed in both subjects. If out of
this students one student is selected at random. find the probability that the students:
A. Passed in both exams.
B. Failed only in statistics.
C. Failed in statistics and economics.
ii. An MBA applies for job in two firms X and Y. the probability of his being selected in the firm
X is 0.7 and being rejected at Y is 0.5. The probability of at least one of his applications being
rejected is 0.6. What is the probability that he will be selected in one of the firms?
Solutions: Let E=the event that the student passes in statistics.
F=the event that the student passes in economics.
From this, P (E) =20/50, P (F) =15/50, P (E‟ F‟) =18/50
A. P(E F)=P(E)+P(F)-P(EUF),but P(EUF)=1-P(EUF)‟ =1-18/50=32/50
=20/50+15/50-32/50=3/50
B. P(E‟ F)=P(F)-P(E F)=15/50-3/50=12/50
C. P(E‟UF‟)=P(E‟)+P(F‟)-P(E‟ F‟)
=30/50+35/50-18/50=47/50
ii.. Let E=the event that person is selected in firm X.
Cardinality of B is given by
8 N ( B ) 28
N (B) = =28, then P (B) = = P (B) =0.283
2 N ( S ) 153
(b).Notice that B is the proper subset of A (B c A), Hence
P( A & B) P ( B ) 0.283
P (B/A) = = =
P ( A) P ( A) 0.32
P (B/A) = 0.884
This indicates that the conditional probability of B given that A is about 0.884.
5.5.2. Independent events
Two events are said to be independent if the occurrence of one does not affect the occurrence of
the other. If A and B are independent, the probability of A occurring is in no way affected by
event B having occurred or vice versa. Hence, P (A B) =P (A).P (B)
Theorem: Let A and B be independent events, then
I. A and B‟ are independent.
II. A‟ and B‟ are independent.
III. A‟ and B are independent.
IV. P(A/B)=P(A), P(B) >0 and P(B/A)=P(B), P(A)>0
Proof
I. P(A B )=P(A)-P(A B)
= P (A)-P (A)* (B)
=P (A) [1-P (B)]
=P (A). P (B )
So, they are independent.
II. P(A B )=P(A B)
=1-P (A B)
=1-[P (A) +P (B)-P (A B)]
=1-P (A) -P (B) +P (A B)
=P (A )-P (B) +P (A).P (B)
= P (A )-P (B) [1-P (A)]
= P (A ) - P (B). P (A )
= P (A ) [1-P (B)]
= P (A ).P (B )
So, they are independent.
Example:
i. An urn contains 6 white and 3 black balls. Three balls are drawn. What is the probability that
all the drawn balls will be black?
A. If the selection is done with replacement.
B. If the selection is without replacement.
Solutions
1. Total balls N=9 , number of black balls n=3
Let E1= the first black ball selected, Let E2=the second black ball selected, Let E3= the third
black ball selected.
P(E1nE2nE3)=P(E1)P(E2/E1)P(E3/E1nE2)=3/9*2/8*1/7=0.0119
2. P(E1nE2nE3)=P(E1)P(E2)P(E3)=3/9*3/9*3/9
=0.0370
ii. A coin is tossed and a die is rolled. What is the probability of getting a head on the coin or
number 4 on the die?
Solution: Let A= getting a head on the coinP(A)=1/2. Let B=getting number 4 on the
dieP(B)=1/6.
P(AUB)=P(A)+P(B)-P(AnB) But P(AnB)=P(A)x P(B)
P(AUB)=1/2 +1/6 -1/2x1/6 =7/12 ==0.5833
Exercise
1. A, B, C are three mutually exclusive and exhaustive events .find p (B) if
1/3P (A) =1/2P (B) =P(C).
2. A part time student is taking two courses, namely Economics and Statistics. The probability
that the student will pass economics course is 0.60 and the probability of passing statistics
course is 0.70. The probability that the student will pass both courses is 0.50. Find the
probability that the student
a. Will pass at least one course.
b. Will fail both courses.
3. A certain travel club has 1000 members. 60%of these members are males. 45% of these
members pay by credit card when they travel including 175 females. If a member is selected
from the travel club at random, what is the probability that :
Probability Distribution is a listing of all possible values of a random variable together with
their corresponding probabilities. Based on the type of a random variable, a probability
distribution can be discrete or continuous.
probability of x i is associated. The number p ( xi ) , i 1,2,... must satisfy the following conditions.
0 p ( xi ) 1
∑P(X=xi) =1
This function p defined above is called probability mass function (pmf) of the random variable
X. the collection of pairs ( xi , p( xi )), i 1,2,... is called the probability distribution of X.
Examples:
1. Construct a probability distribution for the number of heads observed in tossing a coin
two times.
2. Construct a probability distribution for the number of heads observed in tossing a coin
three times.
3. Construct a probability distribution for the number of girls if a family plans to have four
children.
Solutions:
Let X be the number of heads observed in tossing a coin two times. Rx={0, 1, 2}
x 0 1 2 Total
P x 14 2/ 4 ¼ 1
2. S={HHH, HHT, HTH,HTT, THH, THT, TTH, TTT}
Let X be the number of heads observed in tossing a coin three times. Rx={0, 1, 2, 3}
x 0 1 2 3 Total
P x 18 38 38 18 1
Let X be the number of successes. Then X follows a binomial distribution with parameters n,
number of experiments performed and p, probability of success, and write as X~Bin(n,p).Then,
the probability of getting exactly x successes in n trials is given by:
n
P( X x) p x q n x , x 0,1,2,...n .
x
Where p is the probability of success
q=1-p is the probability of failure
n is number of trials
x is number of successes.
This is called the Binomial Distribution. The mean of a binomial distribution is E(X)=np and
variance is V(X)=npq.
Examples:1 Suppose a coin is tossed 10 times. What is the probability of getting
a) Exactly 3 heads
b) No head
c) At most 3 heads
d) At least 3 heads
e) More than 3 heads
Find the average and variance of the number of heads.
1. The probability of a man kicking into the goal is 2/3. If a person kicks 5 times, what is
the probability of scoring
a) At least one goal.
b) At most 3 goals.
Find the average, variance and standard deviation of the number of goals.
2. If the mean and variance of the binomial distribution are 4 and 2 respectively. find the
probability of:
A. Exactly two successes appear.
B. Less than two successes appear.
C. More than six successes appear.
D. At least two successes appear
Solution:
Let X be the number of heads observed in tossing a fair coin 10 times, Rx= {0, 1, 2,…, 10}
n
P( X x) p x q n x , x 0,1,2,...,10
x
10
0.5 x 0.510 x
x
10
0.510
x
10 1
10
a) P( X 3)
3 2
10 1
10
b) P( X 3)
0 2
c) P( X 3) P( X 0) P( X 1) P( X 2) P( X 3)
d) P( X 3) P( X 3) P( X 4) ... P( X 10) 1 P( X 3)
e) P( X 3) P( X 4) P( X 5) ... P( X 10) 1 P( X 3)
Examples:
1. On average a typist commits 3 errors per page. Find the probability that she will make
a) No mistake.
b) More than one mistake.
2. Customer arrive at a photocopying machine at an average rate of two every 10 minutes.
What is the probability that there will be
a) No arrivals during any period of ten minutes.
b) Exactly one arrival during these time period.
c) More than two arrivals during this time period.
Solution:
Examples:
1. Show that each of the following function is pdf.
1,0 x 1
a. f ( x)
0, otherwise
e x , x 0
f ( x)
b.
0, otherwise
2. Find the value of b for the following function to be a pdf.
bx 2 ,0 x 1
f ( x)
0, otherwise
Normal Distribution
The most often used continuous probability distribution is the normal distribution. This
distribution plays a very important role in statistical theory and practice, particularly in the area
of statistical inference and statistical quality control. Its importance is due to the fact that in
practice, the experimental results, very often seem to follow the normal distribution or bell
shaped curve.
A random variable X is said to have a normal distribution if its probability density function is
given by
1 x 2
1
2
f ( x) e , x , , 0
2
Where E ( X ), 2 Variance ( X )
and 2 are the Parameters of the Normal Distributi on.
1. It is bell shaped and is symmetrical about its mean and it is mesokurtic. The maximum
1
ordinate is at x and is given by f ( x)
2
2. It is asymptotic to the axis, i.e., it extends indefinitely in either direction from the mean.
3. It is a continuous distribution.
4. It is a family of curves, i.e., every unique pair of mean and standard deviation defines a
different normal distribution. Thus, the normal distribution is completely described by two
parameters: mean and standard deviation. .
5. Total area under the curve sums to 1, i.e., the area of the distribution on each side of the
mean is 0.5.
6. It is uni -modal, i.e., values mound up only in the center of the curve. i.e.
mean=median=mode
The probability that a random variable will have a value between any two points is equal to the
area under the curve between those points
1
1 2z 2
f ( z) e
2
Note: To facilitate the use of normal distribution, the following distribution known as the
standard normal distribution was derived by using the transformation
X
Z
Properties of the Standard Normal Distribution:
Mean is zero
Variance is one
Standard Deviation is one
The total area under the (standard) normal curve is 1. Hence, the area to the right and left
of the center value (µ=0) of the standard normal distribution is 0.5 (as it is symmetric
about 0).
Examples:
1. Find the area under the standard normal distribution which lies
a) Between Z 0 and Z 0.96
Solution:
Solution:
Area P (1.45 Z 0)
P (0 Z 1.45)
0.4265
Solution:
Area P( Z 0.35)
P(0.35 Z 0) P( Z 0)
P(0 Z 0.35) P( Z 0)
0.1368 0.50 0.6368
Solution:
Area P( Z 0.35)
1 P( Z 0.35)
1 0.6368 0.3632
Solution:
Solution:
Solution
Solution
P ( Z z ) 0.9868
P ( Z 0) P (0 Z z )
0.50 P (0 Z z )
P (0 Z z ) 0.9868 0.50 0.4868
and from table
P (0 Z 2.2) 0.4868
z 2.2
3. A random variable X has a normal distribution with mean 80 and standard deviation 4.8.
What is the probability that it will take a value
Solution
X 87.2
a) P( X 87.2) P( )
87.2 80
P( Z )
4.8
P( Z 1.5)
P( Z 0) P(0 Z 1.5)
0.50 0.4332 0.9332
X 76.4
b) P( X 76.4) P( )
76.4 80
P( Z )
4.8
P( Z 0.75)
P( Z 0) P(0 Z 0.75)
0.50 0.2734 0.7734
81.2 X 86.0
c) P(81.2 X 86.0) P( )
81.2 80 86.0 80
P( Z )
4.8 4.8
P(0.25 Z 1.25)
P(0 Z 1.25) P(0 Z 1.25)
0.3934 0.0987 0.2957
4. A normal distribution has mean 62.4.Find its standard deviation if 20.0% of the area under
the normal curve lies to the right of 72.9
Solution
X 72.9
P( X 72.9) 0.2005 P( ) 0.2005
72.9 62.4 10.5
P( Z ) P( Z ) 0.2005
10.5
P (0 Z ) 0.50 0.2005 0.2995
And from table P(0 Z 0.84) 0.2995
10.5
0.84 12.5
5. A random variable has a normal distribution with 5 .Find its mean if the probability
that the random variable will assume a value less than 52.5 is 0.6915.
Solution
52.5
P( Z z ) P( Z ) 0.6915
5
P(0 Z z ) 0.6915 0.50 0.1915.
But from the table
P(0 Z 0.5) 0.1915
52.5
z 0.5
5
50
CHAPTER SIX
6. Sampling Techniques
6.1.Basic concepts: population, sample, parameter, statistic, sampling frame, sampling unit
Population: It is the totality of set of subjects or things possessing certain common
characteristics that we are interested in studying. It is a collection of all the units under
investigation over a given space or time. Population should be defined on the basis of the
objective of the study by the investigator. Examples; total household population of villages,
country; the total number of plants in a field, total number of patients of a certain disease.
A sample: consists of elements selected from a population with statistical methods for the
purpose of investigation and with the aim of estimating the characteristics of the population. It is
a subgroup or part of the population selected by some method in order estimate population
characteristics.
Sampling units: for the purpose of sample selection, the population is divided in to a finite
number of distinct, non-overlapping and identifiable units called sampling units for example in a
cluster sampling, clusters are sampling units and subjects in the cluster are elementary units.
Frame: once a population has been defined, the next step is to establish a means to access it. A
frame provides this means to access it. In its simplest form, a frame is a list of elements covering
the survey population, and serves as a base for sample selection.
Population Parameters: These are facts about the population. Since parameters are descriptions
of the population, a population can have money parameters.
Parameter: is a measure computed from all the observations in the population. Example:
population mean and population variance.
Sampling: is a statistical process in which one can select and examine a sample instead of
considering the whole population. OR It is a valid statistical procedure of drawing a sample from
the population.
A sampling frame is a list of units or elements that defines the target.
In practice a sample can only be a collection of elements from sampling units drawn from
a sampling frame. Many times the Sampling frame and the Sampling unit are derived
from Administrative data. The definition of the unit may be based on some natural criteria
in Administrative data, e.g., Household, persons, units of product, tickets, etc.
In many cases sampling is the only way to determine something about the population. Some of
the major reasons why sampling is necessary are:
complicated to set up and are acceptable when there is no intention or need to make a
statistical generalization to any population beyond the sample surveyed. Non probability
sampling is well suited for exploratory research intended to generate new ideas that will be
systematically tested later. However, if the goal is to learn about a large population, it is
imperative to avoid judgment of non-probabilistic samples in survey research. Examples of
non-probability sampling are: Convenience sampling, Quota sampling, Purposive sampling
and Snowball sampling, etc.
i. Quota sampling: Here the strategy is to obtain representative of the various elements of a
population, usually in the relative proportions in which they occur in the population. Quota
sampling is a special form of stratified sampling. According to this method, the population
is first divided into different strata. Then the number to be selected from each stratum is
decided. This number is known as quota.
ii. Purposive sampling: In purposive sampling, sampling is done with a purpose in mind. We
usually would have one or more specific predefined groups we are seeking. The principle
of selection in purposive sampling is the researcher's judgment as to typicality or interest.
A sample is built up which enables the researcher to satisfy his/her specific needs in a
research project. Accordingly, when the researcher deliberately or purposively selects
certain units for study from the population it is known as purposive selection. In this type
of selection the choice of the selector is supreme and nothing is left to chance.
iii. Convenience sampling: In many research contexts, we sample simply by asking for
volunteers. It involves choosing the nearest and almost convenient persons to act as
respondents. The process continues until the required sample size is reached. It is
sometimes used as a cheap and dirty way of doing a sample survey. You do not know
whether or not findings are representative. This is probably one of the most widely used
and least satisfactory methods of sampling. According to this system, a sample is selected
according to convenience of the field workers or researchers. The convenience may be in
respect of availability of source list and accessibility of the units. It is used when universe
or population is not clearly defined, sampling unit is not clear or a complete source list is
not available.
iv. Snowball sampling: Here the researcher identifies one or more individuals from the
population of interest. After they have been interviewed, they are used as informants to
identify other members of the population, who are themselves used as informants, and so
on. Snowball sampling is useful when there is difficulty in identifying members of the
population, e.g. when this is a clandestine group.
6.3.2. Probability Sampling: Basic Concepts and Definitions
Probability sampling refers to sampling techniques for which a person‟s (or event‟s) likelihood
of being selected for membership in the sample is known. The reason is that, in most cases,
researchers who use probability sampling techniques are aiming to identify a representative
sample from which to collect data. A representative sample is one that resembles the population
from which it was drawn in all the ways that are important for the research being conducted.
Obtaining a representative sample is important in probability sampling because a key goal of
studies that rely on probability samples is generalizability. In order to achieve generalizability, a
core principle of probability sampling is that all elements in the researcher‟s target population
have an equal chance of being selected for inclusion in the study. In research, this is the principle
of random selection. Random selection is a mathematical process that must meet two criteria.
The first criterion is that chance governs the selection process. The second is that every sampling
element has an equal probability of being selected.
There are a variety of probabilities sampling that researchers may use.
Simple random samples are the most basic type of probability sample, but their use is not
particularly common. Part of the reason for this may be the work involved in generating a simple
random sample. To draw a simple random sample, a researcher starts with a list of every single
member, or element, of his or her population of interest. This list is sometimes referred to as
a sampling frame. Once that list has been created, the researcher numbers each element
sequentially and then randomly selects the elements from which he or she will collect data. To
randomly select elements, researchers use a table of numbers that have been generated
randomly.
If properly conducted, this gives each person an equal chance of being included in the sample,
and also makes all possible combination of persons for a particular sample size equally likely.
So, random sampling is the form applied when the method of selection assures each element or
individual in the population an equal chance of being chosen. It is more suitable in more
homogeneous and comparatively larger groups. A random sample can be drawn either by lottery
method or by using Random number table. If the population is small we can easily choose a
SRS by lottery method: the units to be included in the sample are chosen by a lottery. This
lottery method can only be used if the population is not very large. If we have a large population
we can perform the same procedure using a computer or a table of a random numbers.
Lottery method
List the N individual elements of a finite population, and then take a random sample by choosing
the elements to be included in the sample one at a time without replacement, make sure that in
each of the successive drawings each of the remaining elements of the population has the same
chance of being selected. For instance; to take random sample of size 12 from a population of
N=247 we could write each of the 247 figures on a slip of paper. Mix up them thoroughly in a
bag, a box, or a hat and draw (without looking) twelve slips one at a time without replacement.
Table of random numbers
Random number tables are constructed in such a way that every number occurs with equal
chance. Further, the occurrence of any one number in a position is independent of any of the
other numbers that appear in the table. To use a table of random numbers: number N elements in
the population from 1 to N. Then turn to the table of random numbers and select a starting
number in the table. Proceeding from this number either across the row or down the column,
select and record n numbers that are less than or equal to N, from the table. The numbers in the
table may have many digits. But, we consider the first m digits, where m is the number of digits
in N.
Example: The money section of USA Today gives the 1,900 most active New York Stock
Exchange issues. The random numbers in the table below can be used to randomly select 10 of
these issues. Imagine that the issues are numbered from 0001 to 1900. Suppose we randomly
decide to start in row 1 and columns 21 through 24. The four-digit number located here is 0345.
Reading down these four columns and discarding any number exceeding 1900, we obtain the
following eight random numbers between 0001 and 1900: 0345, 1304, 0990, 1580, 1461, 1064,
0676, and 0347. To obtain our other two numbers, we proceed to row 1 and columns 26 through
29. Reading down this column, we find 1149 and 1074. To obtain the 10 stock issues, we read
down the columns and select the ones located in positions 345, 347,676,990, 1064, 1074, 1149,
1304, 1461, and 1580.
Stratified sampling
This involves dividing the population into a number of groups or strata, and a sample is selected
from each stratum. The elements in a stratum are supposed to be homogeneous with respect to a
given characteristic, but have different characteristic with the elements in the other strata.
After the population has been divided into strata, either a proportional or non-proportional
sample can be selected. As the name implies, a proportional sampling procedure requires that the
number of items in each stratum be in the same proportion as found in the population. In non-
proportional stratified sample, the number of items studied in each stratum is disproportionate to
the respective numbers in the population. We then weight the sample results according to the
stratum‟s proportion of the total population.
Example: suppose you want to take a sample of 200 learners from a college called
CAES to study their performance. Suppose, further, that there are six departments with
the respective number of learners as shown in Table below.
Agricultural Economics 96
Animal science 51
ABVM 81
RDAI 42
TOTAL 600
( )
The sum of the sample sizes becomes 200, in this case, because we round on the
decimal places to the next integer, to get benefit from the added sample size.
Cluster sampling
If the total area of interest happens to be a big one, a convenient way in which a sample can be
taken is to divide the area into a number of smaller non-overlapping areas and then to randomly
select a number of these smaller areas (usually called clusters), with the ultimate sample
consisting of all (or samples of) units in these small areas or clusters. Thus in cluster sampling
the total population is divided into a number of relatively small subdivisions which are
themselves clusters of still smaller units and then some of these clusters are randomly selected
for inclusion in the overall sample. Suppose we want to estimate the proportion of machine-parts
in an inventory which are defective. Also assume that there are 20000 machine parts in the
inventory at a given point of time, stored in 400 cases of 50 each. Now using a cluster sampling,
we would consider the 400 cases as clusters and randomly select „n’ cases and examine all the
machine-parts in each randomly selected case.
Cluster sampling, no doubt, reduces cost by concentrating surveys in selected clusters. But
certainly it is less precise than random sampling. There is also not as much information in „n’
observations within a cluster as there happens to be in „n’ randomly drawn observations. Cluster
sampling is used only because of the economic advantage it possesses; estimates based on cluster
samples are usually more reliable per unit cost.
CHAPTER SEVEN
7. SIMPLE LINEAR REGRESSION AND CORRELATION
7.1. Simple Linear Regression
Regression may be defined as the estimation of the unknown value of one variable from the
known values of one or more variables. The variable whose values are to be estimated is known
as dependent or explained variable while the variable which are used in determining the value of
the dependent variable are called independent or predictor variables. The regression study that
involves only two variables is called simple regression and the regression analysis that studies
more than two variables is called multiple linear regressions. If the relationship between the two
variables can be described by a straight line then the regression is known as linear regression
otherwise it is called non-linear. The regression analysis involving only two variables and having
a linear relationship is called Simple Linear Regression. This linear relationship between the two
variables is represented by a straight line.
Regression Line (Line of Regression): is the line that gives the best estimate of one variable for
any given value of another variable. The regression line which is used to estimate the values of Y
for any given value of X is called regression line of Y on X.
Regression Equation: is a mathematical equation that defines the relationship between two
variables.
Regression of Y on X
Model: Y= α + βX + Є
Where Y is the dependent variable
X is the independent variable
α is the intercept
β is the slope
Є is the error term
Its parameters are interpreted as follows:
α is the value of the dependent variable when the value of the independent variable is zero.
β is the increment in the value of the dependent variable when the value of the independent
variable increased by 1 unit. There is a direct linear relationship between the two variables if
β is positive, there is an indirect linear relationship between the two variables if β is negative,
and there is no linear relationship between the two variables if β is zero.
^ n XY X Y ^ ^
and Y X
n X 2 ( X ) 2
1 X Y
S xy
n 1
XY n
Pearson’s Coefficient of Correlation (r)
The coefficient of correlation is a measure of the degree or strength of the linear association
between two variables. It is defined as a ratio of the covariance between the two variables and
the product of the standard deviations of the two variables. The sample correlation coefficient is
denoted by r and the population correlation coefficient is denoted by ρ.
S xy n XY X Y
r
SxSy n X 2 ( X ) 2 n Y 2 ( Y ) 2
Examples
The ranks of some 10 students in two courses; Statistics and Economics are given below.
Calculate the rank correlation and interpret it.
Statistics 5 2 9 8 1 10 3 4 6 7
Economics 10 5 1 3 8 6 2 7 9 4
Interpretation of r: The value of the correlation coefficient can be positive, zero or negative,
depending on the sign of the covariance between the two variables. But, it lies the limits -1 and
+1; that is, -1≤r≤1.
If the value of r is -1 or +1, there is a perfect negative or perfect positive linear
relationship between the variables, respectively.
If the value of r is approximately -1 or +1, there is a strong negative or strong positive
linear relationship between the variables, respectively.
If r is -0.5 (or approximately -0.5) or 0.5 (or approximately 0.5), there is moderate
negative or moderate positive linear relationship between the variables, respectively.
If the value of r is near zero, there is no linear relationship between the two variables.
Properties of Correlation Analysis
1. It doesn‟t describe the cause and effect relationship.
2. It is not used for prediction and estimation.
3. It is used to study the degree or extent of relationship of the variable.
Coefficient of determination (r2)
So far, we were concerned with the problem of estimating the parameters of the regression model
and the correlation coefficient between two variables. We now consider the goodness of fit of
the estimated model to a set of data; that is, we shall find out how “well” the estimated model fits
the data.
The coefficient of determination tells how well the estimated model fits the data. For simple
linear regression (two variables case), it is defined as the square of the sample correlation
coefficient, and denoted by r2. Hence r2 measures the proportion or percentage of the variation in
the dependent variable explained by the independent variable. Generally, r2 is a nonnegative
quantity which lies in the limits 0 and 1, i.e., 0≤r2≤1. If it approaches to 1, it means a good fit and
if it approaches 0, no relationship between the variables.
Examples:
The following data are obtained in the study of age and blood pressure on six randomly selected
peoples.
Age 43 48 56 61 67 70
Blood pressure 128 120 135 143 141 152
A. Fit the regression line of blood pressure on age?
B. By how much the blood pressures change per unit change in age?
C. Compute the correlation coefficient of blood pressure and age and also coefficient of
determination and interpret them?
D. Predicate or estimate the value of blood pressure of somebody if his or her age is 80?
E. Interpret the regression coefficients?
Solution:
Since blood pressure is depends on age of individuals so the dependent variable is blood pressure
and age is independent variable.
Age(X) B.P(Y) Xi2 Yi2 Xi .Yi
43 128 1849 16384 5504
48 120 2304 14400 5760
56 135 3136 18225 7560
61 143 3721 20449 8723
67 141 4489 19881 9447
70 152 4900 23104 10640
2 2
∑ 345 ∑ 819 ∑ =20,399 ∑ =112,443 ∑ 47,634
The summarize data is
n=6 , xi 354 , yi 819 , xi 20,399
2
yi 2
112,443 , xiyi 47634 and xi yi 282555
A. The regression line is Yˆ B 0+ B 1X
∑ ∑ ∑
B 1= ∑ (∑ )
( ) ( )
B 1= ( ) ( )( )
=0.964
B 0= Y - B 1 X
= (819)/6-0.964(345)/6=81.048
Therefore, Yˆ B 0+ B 1X
Yˆ 81.048+0.964Xi this is the fitted line of B.P on age.
B. Per unit change of age, the blood pressure will change by 0.964.
∑ ∑ ∑
C. r=
√[ ∑ (∑ ) ] ∑ (∑ ) ]
( ) ( )
r=
√ ( ) ( ) ( ) ( )
r =0.897, this value indicates that there is a strong relation between age and blood pressure of
individuals.
Other method of finding correlation coefficient is
S XY
r , Where, r is the correlation coefficient
S XX SYY
Sxy is the correlation between x and y
Sxx is the standard error of x
Syy is the standard error of y
And we can also determine the regression parameters using the above information
S xy
B 1= and then, B 0= Y - B 1 X
S XX
Exercise
1. Given the following data on supply (X) and sales (Y) of a certain commodity
Supply (X) 60 62 65 70 73 75 71
Sales (Y) 10 11 13 15 16 19 14
a) Estimate the regression equation sales on supply and interpret the coefficients.
b) Calculate the correlation coefficient between supply and sales, and interpret it.
c) Find the coefficient of determination and interpret it.
d) Predict the amount of sales of the commodity if the supply amount is 80.
2. The following summary results are obtained from price and demand of a commodity
∑price=30 ∑demand=40 ∑(price)(demand)=214
∑(price)2=220 ∑(demand)2=340 n=5
a) Identify the dependent and independent variable.
b) Estimate the regression equation.
c) Interpret the estimated coefficients.
d) Calculate the correlation coefficient between price and demand, and interpret it.
e) Find the coefficient of determination and interpret it.
3. Given n=25, X =3.95, Y =2.03, S x2 =85.35, S y2 =98.75, S xy =90