Research Methodology and Statistical Analysis
Research Methodology and Statistical Analysis
(AN UGC Autonomous Institution Approved by AICTE New Delhi & Affiliated to JNTU, Hyderabad
Accredited by NAAC with ‘A++’ Grade (cycle III) NBA Tier -I Accredited
IIC-Four star Rating, NIRF Ranking 210-250, RIIA Brnd Performer
Maisammaguda(H),Medchal -Malkajgiri District, Secunderabad, Telangana State-500100,www.mrec.ac.in
E-Content File
I MBA I Semester
Subject
RESEARCH METHODOLOGY AND STATISTICAL
ANALYSIS
Code: C1E05
3
MBA-I Semester Paper Code: AB105
UNIT – I
Meaning Of Research:
Research in simple terms refers to search for knowledge. It is a scientific and systematic
search for information on a particular topic orissue. It is also known as the art of scientific
investigation. Several social scientists have defined research in different ways.
According to Redman and Mory (1923), research is a “systematized effort to gain new
knowledge”. It is an academic activity and therefore the term should be used in a technical sense.
According to Clifford Woody (kothari, 1988), research comprises “defining and redefining
problems, formulating hypotheses or suggested solutions; collecting, organizing
4
and evaluating data; making deductions and reaching conclusions; and finally,
carefully testing the conclusions to determine whether they fit the formulated
hypotheses”.
Objectives Of Research:
5
is not only important for the researcher to know the research techniques/ methods, but also
the scientific approach called methodology.
Research Approaches:
There are two main approaches to research, namely quantitative approach and qualitative
approach. The quantitative approach involves the collection of quantitative data, which are put
to rigorous quantitative analysis in a formal and rigid manner. This approach further includes
experimental, inferential, and simulation approaches to research. Meanwhile, the qualitative
approach uses the method of subjectiveassessment of opinions, behaviour and attitudes. Research
in such a situation is a function of the researcher’s impressions and insights. The results
generated by this type of research are either in non-quantitativeform or in the form which
cannot be put to rigorous quantitative analysis. Usually, this approach uses techniques like
indepth interviews, focus group interviews, and projective techniques.
Types Of Research:
There are different types of research. The basic ones are as follows.
6
Meanwhile in the Analytical research, the researcher has to use the
already available facts or information, and analyse them to make a critical
evaluation of the subject.
The research related to some abstract idea or theory is known as Conceptual Research.
Generally, philosophers and thinkers use it for developing new concepts or for reinterpreting the
existing ones. Empirical Research, on the other hand, exclusively relies on the observation or
experience with hardly any regard for theory and system. Such researchis data based, which
often comes up with conclusions that can be verified through experiments or observation.
Empirical research is also known as experimental type of research, in which it is important to
first collect the facts and their sources, and actively take steps to stimulate the production of
desired information. In this type of research, the researcher first formulates a working
hypothesis, and then gathers sufficient facts to prove or disprove the stated hypothesis. He/she
formulates the experimental design, which according to him/her would manipulate the variables,
so as to obtain the desired information. This type of research is thus characterized by the
researcher’s control over the variables under study. In simple term, empirical research is most
appropriate when an attempt is made to provethat certain variables influence the other variables
in some way. Therefore, the results obtained by using the experimental or empirical studies are
considered to be the most powerful evidences for a given hypothesis.
The remaining types of research are variations of one or more ofthe afore-mentioned
type of research. They vary in terms of the purposeof research, or the time required to
complete it, or may be based on some
8
other similar factor. On the basis of time, research may either be in the nature
of one-time or longitudinal time series research. While the research is
restricted to a single time-period in the former case, it is conducted over several
time-periods in the latter case. Depending upon the environment in which
the research is to be conducted, it can also be laboratory research or field-
setting research, or simulation research, besides being diagnostic or clinical
in nature. Under such research, in-depth approaches or case study method
may be employed to analyse the basic causal relations. These studies usually
undertake a detailed in-depth analysis of the causes of certain events of
interest, and use very small samples and sharp data collection methods. The
research may also be explanatory in nature. Formalized research studies consist
of substantial structure and specifichypotheses to be verified. As regards to
historical research, sources like historical documents, remains, etc. Are utilized
to study past events orideas. It also includes philosophy of persons and groups
of the past or any remote point of time.
9
iii. The knowledge of research methodology equips the researcher with the tools that help
him/her to make the observations objectively;and
iv. The knowledge of methodology helps the research consumers to evaluate research and
make rational decisions.
Qualities Of A Researcher:
It is important for a researcher to possess certain qualities to conduct research. First and
foremost, he being a scientist should be firmly committed to the ‘articles of faith’ of the
scientific methods of research.This implies that a researcher should be a social science
person in the truest sense. Sir Michael Foster cited by (Wilkinson and Bhandarkar, 1979)
identified a few distinct qualities of a scientist. According to him, a true research scientist should
possess the following qualities:
(1) First of all, the nature of a researcher must be of the temperament that vibrates in
unison with the theme which he is searching. Hence, the seeker of knowledge must be truthful
with truthfulness of nature, which is much more important, much more exacting than what is
sometimes known as truthfulness. The truthfulness relates to the desire for accuracy of
observation and precision of statement. Ensuring facts is the principle rule of science, which is
not an easy matter. The difficulty may arise dueto untrained eye, which fails to see anything
beyond what it has the powerof seeing and sometimes even less than that. This may also be
due to thelack of discipline in the method of science. An unscientific individualoften remains
satisfied with the expressions like approximately, almost,or nearly, which is never what
nature is. A real research cannot see two things which differ, however minutely, as the same.
(2) A researcher must possess an alert mind. Nature is constantly changing and revealing
itself through various ways. A scientific researcher must be keen and watchful to notice such
changes, no matter how small or insignificant they may appear. Such receptivity has to be
cultivated slowly and patiently over time by the researcher through practice. An individual who
is ignorant or not alert and receptive during his research will notmake a good researcher. He
will fail as a good researcher if he has no keen eyes or mind to observe the unusual changes
behind the routine. Research
10
demands a systematic immersion into the subject matter by the researcher grasp
even the slightest hint that may culminate into significant research problems.
In this context, Cohen and Negal cited by (Selltiz et al, 1965; Wilkinson and
Bhandarkar, 1979) state that “the ability to perceive in some brute experience
the occasion of a problem is not a common talent among men… it is a mark of
scientific genius to be sensitive to difficulties where less gifted people pass by
untroubled by doubt”.
Significance Of Research:
9
analysis of needs and desires of people, and the availability of revenues, which requires research.
Research helps to formulate alternative policies,in addition to examining the consequences of
these alternatives. Thus, research also facilitates the decision-making of policy-makers,
although in itself is not a part of research. In the process, research also helps in the proper
allocation of a country’s scarce resources.
Research is also necessary for collecting information on the social and economic structure
of an economy to understand the process of change occurring in the country. Collection of
statistical information, though not a routine task, involves various research problems.
Therefore, large staff of research technicians or experts are engaged by the governmentthese
days to undertake this work. Thus, research as a tool of government economic policy
formulation involves three distinct stages of operation:
(i) investigation of economic structure through continual compilationof facts; (ii) diagnosis
of events that are taking place and analysis of the forces underlying them; and (iii) the prognosis
i.e., the prediction of future developments (Wilkinson and Bhandarkar, 1979).
Research also assumes significance in solving various operational and planning problems
associated with business and industry. In several ways, operations research, market research
and motivational researchare vital and their results assist in taking business decisions. Market
research refers to the investigation of the structure and development of a market for the
formulation of efficient policies relating to purchases, production and sales. Operational
research relates to the applicationof logical, mathematical, and analytical techniques to find
solution to business problems, such as cost minimization or profit maximization, or the
optimization problems. Motivational research helps to determine why people behave in the
manner they do with respect to market characteristics. More specifically, it is concerned with the
analysis of the motivations underlying consumer behaviour. All these researches are very useful
for business and industry, and are responsible for business decision-making.
Research is equally important to social scientists for analyzing the social relationships
and seeking explanations to various social problems. It gives intellectual satisfaction of
knowing things for the sakeof knowledge. It also possesses the practical utility for the social
scientistto gain knowledge so as to be able to do something better or in a more
12
efficient manner. The research in social sciences is concerned with both
knowledge for its own sake, and knowledge for what it can contribute to solve
practical problems.
Hypothesis-Testing Research Designs are those in which the researcher tests the
hypothesis of causal relationship between two or more variables. These studies require
procedures that would not only decrease bias and enhance reliability, but also facilitate
deriving inferences about the causality. Generally, experiments satisfy such requirements.
Hence, when research design is discussed in such studies, it often refers to the design of
experiments.
Hypothesis:
i. “Students who take tuitions perform better than the others who do
not receive tuitions” or,
ii. “The female students perform as well as the male students”.
13
These two statements are hypotheses that can be objectively verified and tested. Thus,
they indicate that a hypothesis states what one is looking for. Besides, it is a proposition that
can be put to test in order to examineits validity.
Characteristics Of Hypothesis:
i. A hypothesis must be precise and clear. If it is not precise and clear, then the inferences
drawn on its basis would not be reliable.
ii. A hypothesis must be capable of being put to test. Quite often, the research programmes fail
owing to its incapability of being subject to testing for validity. Therefore, some prior study
may be conducted by the researcher in order to make a hypothesis testable. A hypothesis
“is tested if other deductions can be made from it, which in turn canbe confirmed or
disproved by observation” (Kothari, 1988).
iii. A hypothesis must state relationship between two variables, in the case of relational
hypotheses.
iv. A hypothesis must be specific and limited in scope. This is because a simpler hypothesis
generally would be easier to test for the researcher. And therefore, he/she must formulate
such hypotheses.
vi. A hypothesis must be consistent and derived from the most knownfacts. In other
words, it should be consistent with a substantial bodyof established facts. That is, it
must be in the form of a statementwhich is most likely to occur.
vii. A hypothesis must be amenable to testing within a stipulated or reasonable period of time.
No matter how excellent a hypothesis, a researcher should not use it if it cannot be
tested within a givenperiod of time, as no one can afford to spend a life-time on collecting
data to test it.
14
viii. A hypothesis should state the facts that give rise to the necessity of looking
for an explanation. This is to say that by using the hypothesis, and other
known and accepted generalizations, a researcher must be able to derive
the original problem condition. Therefore, a hypothesis should explain
what it actually wants to explain, and for this it should also have an
empirical reference.
H0: = μ = μ H0 = 100
15
Alternative hypothesis To be read as follows
The alternative hypothesis is that the
H1: μ ≠ μ H0
population mean is not equal to 100, i.e., it
could be greater than or less than 100
The alternative hypothesis is that the
H1 : μ > μ H0
population mean is greater than 100
The alternative hypothesis is that the
H1 : μ < μ H0
population mean is less than 100
Before the sample is drawn, the researcher has to state the nullhypothesis and the
alternative hypothesis. While formulating the null hypothesis, the following aspects need to be
considered:
A. Alternative hypothesis is usually the one which a researcher wishesto prove, whereas
the null hypothesis is the one which he/she wishesto disprove. Thus, a null hypothesis is
usually the one which a researcher tries to reject, while an alternative hypothesis is the
onethat represents all other possibilities.
C. Null hypothesis should always be specific hypothesis i.e., it shouldnot state about or
approximately a certain value.
In the context of hypothesis testing, the level of significance is a very important concept. It
is a certain percentage that should be chosen with great care, reason and insight. If for instance,
the significance level is taken at 5 per cent, then it means that H0 would be rejected when the
sampling result has a less than 0.05 probability of occurrence when H0 is true. In otherwords, the
five per cent level of significance implies that the researcher is willing to take a risk of five per
cent of rejecting the null hypothesis, when (H0) is actually true. In sum, the significance level
reflects the maximum value of the probability of rejecting H0 when it is actually true, and
whichis usually determined prior to testing the hypothesis.
16
3) Test Of Hypothesis Or Decision Rule:
These two types of tests are very important in the context of hypothesis
testing. A two-tailed test rejects the null hypothesis, when the sample
mean is significantly greater or lower than the hypothesized value of the
mean of the population. Such a test is suitable when the null hypothesis is
some specified value, the alternative hypothesis is a valuethat is not equal
to the specified value of the null hypothesis.
17
1) Making a Formal Statement:
This step involves making a formal statement of the null hypothesis (H0) and the
alternative hypothesis (Ha). This implies that the hypotheses should be clearly stated within the
purview of the research problem. For example, suppose a school teacher wants to test the
understanding capacity of the students which must be rated more than 90 per cent in terms
of marks, the hypotheses may be stated as follows:
After making decision on the level of significance for hypothesis testing, the
researcher has to next determine the appropriate sampling distribution. The choice to be made
generally relates to normal distribution and the t-distribution. The rules governing the
selection of the correct distribution are similar to the ones already discussed with respect to
estimation.
Another step involved in hypothesis testing is the selection of a random sample and then
computing a suitable value from the sample data relating to test statistic by using the appropriate
distribution. In other words, it involves drawing a sample for furnishing empirical data.
18
5) Calculation Of The Probability:
The next step for the researcher is to calculate the probability that
the sample result would diverge as far as it can from expectations, under
the situation when the null hypothesis is actually true.
Sample Survey:
19
1) Type Of Universe:
The first step involved in developing sample design is to clearly define the number of
cases, technically known as the universe. A universe may be finite or infinite. In a finite universe
the number of items is certain, whereas in the case of an infinite universe the number of items is
infinite (i.e., there is no idea about the total number of items). For example, while the population
of a city or the number of workers in a factory comprisefinite universes, the number of stars
in the sky, or throwing of a dicerepresent infinite universe.
2) Sampling Unit:
Prior to selecting a sample, decision has to be made about the sampling unit. A sampling
unit may be a geographical area like a state, district, village, etc., or a social unit like a family,
religious community, school, etc., or it may also be an individual. At times, the researcher would
have to choose one or more of such units for his/her study.
3) Source List:
Source list is also known as the ‘sampling frame’, from which the sample is to be
selected. The source list consists of names of all the itemsof a universe. The researcher has to
prepare a source list when it is not available. The source list must be reliable, comprehensive,
correct, and appropriate. It is important that the source list should be as representativeof the
population as possible.
4) Size Of Sample:
Size of the sample refers to the number of items to be chosen from the universe to form
a sample. For a researcher, this constitutes a major problem. The size of sample must be
optimum. An optimum sample may be defined as the one that satisfies the requirements of
representativeness, flexibility, efficiency, and reliability. While deciding the size of sample,
a researcher should determine the desired precision and the acceptable confidence level for the
estimate. The size of the population variance should be considered, because in the case of a
larger variance generally a larger sample is required. The size of the population should be
considered,
34
as it also limits the sample size. The parameters of interest in a research study
should also be considered, while deciding the sample size. Besides, costs or
budgetary constraint also plays a crucial role in deciding the sample size.
Introduction:
It is important for a researcher to know the sources of data whichhe requires for
different purposes. Data are nothing but the information. There are two sources of information
or data they are - Primary and Secondary data. The data are name after the source. Primary
data refersto the data collected for the first time, whereas secondary data refers tothe data
that have already been collected and used earlier by somebodyor some agency. For example,
the statistics collected by the Government
of India relating to the population is primary data for the Government of India since it has been
21
collected for the first time. Later when the samedata are used by a researcher for his study
of a particular problem, thenthe same data become the secondary data for the researcher. Both
the sources of information have their merits and demerits. The selection of a particular source
depends upon the
(a) purpose and scope of enquiry,
(b) availability of time,
(c) availability of finance,
(d) accuracy required,
(e) statistical tools to be used,
(f) sources of information (data), and (g) method of data collection.
After the purpose of enquiry has been clearly defined, the next stepis
to decide about the scope of the enquiry. Scope of the enquiry means the
coverage with regard to the type of information, the subject-matter and
geographical area. For instance, an enquiry may relate to India as a wholeor
a state or an industrial town wherein a particular problem related to a particular
industry can be studied.
The investigation will greatly depend on the resources available like number of skilled
personnel, the financial position etc. If the number of skilled personnel who will carry out the
enquiry is quite sufficient and the availability of funds is not a problem, then enquiry can be
conducted over a big area covering a good number of samples, otherwise a small samplesize
will do.
Deciding the degree of accuracy required is a must for the investigator, because absolute
accuracy in statistical work is seldom achieved. This is so because (i) statistics are based on
estimates, (ii) toolsof measurement are not always perfect and (iii) there may be unintentional
bias on the part of the investigator, enumerator or informant. Therefore, a desire of 100%
accuracy is bound to remain unfulfilled. Degree of accuracy desired primarily depends upon the
object of enquiry. For example, whenwe buy gold, even a difference of 1/10th gram in its weight
is significant, whereas the same will not be the case when we buy rice or wheat. However, the
researcher must aim at attaining a higher degree of accuracy, otherwise the whole purpose of
research would become meaningless.
A well defined and identifiable object or a group of objects with which the measurements
or counts in any statistical investigation are associated is called a statistical unit. For example, in
socio-economic survey the unit may be an individual, a family, a household or a block of
locality.A very important step before the collection of data begins is to defineclearly the
statistical units on which the data are to be collected. In number of situations the units are
conventionally fixed like the physical units of measurement, such as meters, kilometers, quintals,
hours, days, weeks etc., which are well defined and do not need any elaboration or explanation.
However, in many statistical investigations, particularly relating to socio-
23
economic studies, arbitrary units are used which must be clearly defined. This
is a must because in the absence of a clear cut and precise definition of the
statistical units, serious errors in the data collection may be committed in the
sense that we may collect irrelevant data on the items, which should have, in
fact, been excluded and omit data on certain items which should have been
included. This will ultimately lead to fallacious conclusions.
After deciding about the unit, a researcher has to decide about the
source from which the information can be obtained or collected. For any
statistical inquiry, the investigator may collect the data first hand or he
may use the data from other published sources, such as publications of the
government/semi-government organizations or journals and magazines etc.
24
Methods of Collecting Primary Data:
A face to face contact is made with the informants (persons from whom the information
is to be obtained) under this method of collecting data. The interviewer asks them questions
pertaining to the survey and collects the desired information. Thus, if a person wants to
collect dataabout the working conditions of the workers of the Tata Iron and Steel Company,
Jamshedpur, he would go to the factory, contact the workers and obtain the desired
information. The information collected in this manner is first hand and also original in character.
There are many merits and demerits of this method, which are discussed as under:
Merits:
1. Most often respondents are happy to pass on the information required from them when
contacted personally and thus response is encouraging.
2. The information collected through this method is normally more accurate because
interviewer can clear doubts of the informants about certain questions and thus obtain
correct information. In casethe interviewer apprehends that the informant is not giving
accurate information, he may cross-examine him and thereby try to obtain the information.
3. This method also provides the scope for getting supplementary information from the
informant, because while interviewing it is possible to ask some supplementary questions
which may be of greater use later.
4. There might be some questions which the interviewer would find difficult to ask directly,
but with some tactfulness, he can mingle such
25
questions with others and get the desired information. He can twist
the questions keeping in mind the informant’s reaction. Precisely, a delicate
situation can usually he handled more effectively by a personal interview
than by other survey techniques.
5. The interviewer can adjust the language according to the status and
educational level of the person interviewed, and thereby can avoid
inconvenience and misinterpretation on the part of the informant.
Demerits:
Conclusion:
Under this method of data collection, the investigator contacts third parties generally
called ‘witnesses’ who are capable of supplying necessary information. This method is
generally adopted when the informationto be obtained is of a complex nature and
informants are not inclinedto respond if approached directly. For example, when the researcher
istrying to obtain data on drug addiction or the habit of taking liquor, there is high probability
that the addicted person will not provide the desireddata and hence will disturb the whole
research process. In this situation taking the help of such persons or agencies or the neighbours
who know them well becomes necessary. Since these people know the person well,they can
provide the desired data. Enquiry Committees and Commissions appointed by the Government
generally adopt this method to get people’s views and all possible details of the facts related to
the enquiry.
Though this method is very popular, its correctness depends upon a number of factors such as
1. The person or persons or agency whose help is solicited must be of proven integrity;
otherwise any bias or prejudice on their part will not bring out the correct information and
the whole process of researchwill become useless.
2. The ability of the interviewers to draw information from witnesses by means of appropriate
questions and cross-examination.
3. It might happen that because of bribery, nepotism or certain other reasons those who are
collecting the information give it such a twist that correct conclusions are not arrived at.
Therefore, for the success of this method it is necessary that the evidence of one person
alone is not relied upon. Views from other persons
27
and related agencies should also be ascertained to find the real position
.Utmost care must be exercised in the selection of these persons because itis
on their views that the final conclusions are reached.
28
When no formal questionnaire is used, interviewers adapt their questioning to each
interview as it progresses. They might even try to elicit responses by indirect methods, such as
showing pictures on which the respondent comments. When a researcher follows a prescribed
sequence of questions, it is referred to as structured study. On the other hand, when no
prescribed sequence of questions exists, the study is non-structured.
When questionnaires are constructed in such a way that the objective is clear to the
respondents then these questionnaires are knownas non- disguised; on the other hand, when the
objective is not clear, the questionnaire is a disguised one. On the basis of these two
classifications, four types of studies can be distinguished:
1. Non-disguised structured,
2. Non-disguised non-structured,
3. Disguised structured and
4. Disguised non-structured.
There are certain merits and demerits of this method of data collectionwhich are discussed
below:
Merits:
1. Questionnaire method of data collection can be easily adopted where the field of
investigation is very vast and the informants are spreadover a wide geographical area.
2. This method is relatively cheap and expeditious provided the informants respond in time.
3. This method has proved to be superior when compared to other methods like personal
interviews or telephone method. This is because when questions pertaining to personal nature
or the ones requiring reaction by the family are put forth to the informants, there is a chance
for them to be embarrassed in answering them.
Demerits:
1. This method can be adopted only where the informants are literatesso that they can
understand written questions and lend the answers in writing.
51
2. It involves some uncertainty about the response. Co-operation on the
part of informants may be difficult to presume.
3. The information provided by the informants may not be correct and it
may be difficult to verify the accuracy.
Merits:
This method too like others is not free from defects or limitations. Themain limitations are
listed below:
Demerits:
1. In comparison to other methods of collecting primary data, this method is quite costly as
enumerators are generally paid persons.
2. The success of the method depends largely upon the training imparted to the enumerators.
3. Interviewing is a very skilled work and it requires experience and training. Many
statisticians have the tendency to neglect this extremely important part of the data collecting
process and this resultin bad interviews. Without good interviewing most of the information
collected may be of doubtful value.
4. Interviewing is not only a skilled work but it also requires a greatdegree of politeness and
thus the way the enumerators conduct theinterview would affect the data collected. When
questions are askedby a number of different interviewers, it is possible that variations in
the personalities of the interviewers will cause variation in the answers obtained. This
variation will not be obvious. Hence, every effort mustbe made to remove as much of
variation as possible due to different interviewers.
Secondary Data:
As stated earlier, secondary data are those data which have already been collected
and analyzed by some earlier agency for its ownuse, and later the same data are used by a
different agency. According to W.A.Neiswanger, “A primary source is a publication in which
the data are published by the same authority which gathered and analyzed them. Asecondary
source is a publication, reporting the data which was gatheredby other authorities and for
which others are responsible.”
53
Sources Of Secondary Data:
1. Published Sources:
Journals and News Papers are very important and powerful sourceof secondary data.
Current and important materials on statistics and socio- economic problems can be obtained from
journals and newspapers likeEconomic Times, Commerce, Capital, Indian Finance, Monthly
Statistics of trade etc.
2. Unpublished Sources:
Unpublished data can be obtained from many unpublished sources like records
maintained by various government and private offices, the theses of the numerous research
scholars in the universities or institutions etc.
Since secondary data have already been obtained, it is highly desirable that a proper
scrutiny of such data is made before they are used by the investigator. In fact the user
has to be extra-cautious whileusing secondary data. In this context Prof. Bowley rightly points
out that “Secondary data should not be accepted at their face value.” The reason being that data
may be erroneous in many respects due to bias, inadequate size of the sample, substitution, errors
of definition, arithmetical errorsetc. Even if there is no error such data may not be suitable
and adequatefor the purpose of the enquiry. Prof. SimonKuznet’s view in this regard is also of
great importance. According to him, “the degree of reliability of secondary source is to be
assessed from the source, the compiler and his capacity to produce correct statistics and the users
also, for the most part, tend to accept a series particularly one issued by a government agency
at its face value without enquiring its reliability”.
Therefore, before using the secondary data the investigators should consider the following
factors:
55
4. The Suitability Of Data:
The investigator must satisfy himself that the data available aresuitable
for the purpose of enquiry. It can be judged by the nature andscope of
the present enquiry with the original enquiry. For example, if the object of the
present enquiry is to study the trend in retail prices, and if the data provide only
wholesale prices, such data are unsuitable.
If the data are suitable for the purpose of investigation then we must
consider whether the data are useful or adequate for the present analysis. It
can be studied by the geographical area covered by the original enquiry. The
time for which data are available is very important element. In theabove
example, if our object is to study the retail price trend of india, andif the
available data cover only the retail price trend in the state of bihar,then it
would not serve the purpose.
56
2. The data should be accurate
3. The data should be consistent, and
4. The data should be homogeneous.
Data to posses the above mentioned characteristics have to undergo thesame type of editing
which is discussed below:
while editing, the editor should see that each schedule and questionnaire is complete in
all respects. He should see to it that the answersto each and every question have been furnished.
If some questions are not answered and if they are of vital importance, the informants should be
contacted again either personally or through correspondence. Even after all the efforts it
may happen that a few questions remain unanswered. Insuch questions, the editor should
mark ‘No answer’ in the space provided for answers and if the questions are of vital
importance then the scheduleor questionnaire should be dropped.
At the time of editing the data for consistency, the editor should see that the answers to
questions are not contradictory in nature. If they are mutually contradictory answers, he should
try to obtain the correct answers either by referring back the questionnaire or by contacting,
wherever possible, the informant in person. For example, if amongst others, two questions
in questionnaire are (a) Are you a student? (b) Which class do you study and the reply to the
first question is ‘no’ and to the latter ‘tenth’ then there is contradiction and it should be
clarified.
57
(c) Editing For Homogeneity:
In using the secondary data, it is best to obtain the data from the primary
source as far as possible. By doing so, we would at least save ourselves from
the errors of transcription which might have inadvertently crept in the
secondary source. Moreover, the primary source will also provide us with
detailed discussion about the terminology used, statistical units employed, size
of the sample and the technique of sampling (if sampling method was used),
methods of data collection and analysis of results and we can ascertain
ourselves if these would suit our purpose.
Now-a-days in a large number of statistical enquiries, secondary data
are generally used because fairly reliable published data on a large number of
diverse fields are now available in the publications of governments, private
organizations and research institutions, agencies, periodicals and magazines
etc. In fact, primary data are collected only if there do not exist
58
any secondary data suited to the investigation under study. In some of theinvestigations both
primary as well as secondary data may be used.
Questionnaire
Nowadays questionnaire is widely used for data collection in social research. It is a
reasonably fair tool for gathering data from large, diverse, varied and scattered social groups.
The questionnaire is the media of communication between the investigator and the
respondents. According to Bogardus, a questionnaire is a list of questions sent to a number
ofpersons for their answers and which obtains standardized results that can be tabulated and
treated statistically. The Dictionary of Statistical Terms defines it as a “group of or sequence
of questions designed to elicitinformation upon a subject or sequence of subjects from
information.” A questionnaire should be designed or drafted with utmost care and caution so
that all the relevant and essential information for the enquiry may be collected without any
difficulty, ambiguity and vagueness. Drafting of a good questionnaire is a highly specialized job
and requires great care skill, wisdom, efficiency and experience. No hard and fast rule can be
laid down for designing or framing a questionnaire. However, in this connection, the following
general points may be borne in mind:
112
Logical arrangement of questions reduces lot of unnecessary work
on the part of the researcher because it not only facilitates the tabulation work
but also does not leave any chance for omissions or commissions.For
example, to find if a person owns a television, the logical order of questions
would be: Do you own a television? When did you buy it? Whatis its make?
How much did it cost you? Is its performance satisfactory?Have you ever
got it serviced?
4. Questions Should Be Simple To Understand:
The vague words like good, bad, efficient, sufficient, prosperity, rarely, frequently,
reasonable, poor, rich etc., should not be used sincethese may be interpreted differently by
different persons and as such might give unreliable and misleading information. Similarly the
use of wordshaving double meaning like price, assets, capital income etc., should also be
avoided.
Questions should be designed in such a way that they are readily comprehensible and
easy to answer for the respondents. They should not be tedious nor should they tax the
respondents’ memory. At the same time questions involving mathematical calculations like
percentages, ratios etc., should not be asked.
There are some questions which disturb the respondents and he/ she may be shy or
irritated by hearing such questions. Therefore, everyeffort should be made to avoid such
questions. For example, ‘do you cook yourself or your wife cooks?’ ‘Or do you drink?’ Such
questions will certainly irk the respondents and thus be avoided at any cost. If unavoidable then
highest amount of politeness should be used.
7. Types Of Questions:
Under this head, the questions in the questionnaire may be classified as follows:
Shut questions are those where possible answers are suggested bythe framers of the
questionnaire and the respondent is required to tickone of them. Shut questions can further
be subdivided into the following forms:
113
(i) Simple Alternate Questions:
In this type of questions the respondent has to choose from the two clear
cut alternatives like ‘Yes’ or ‘No’, ‘Right or Wrong’ etc. Such questions are also
called as dichotomous questions. This technique can be applied with
elegance to situations where two clear cut alternatives exist.
Do you smoke?
Multiple choice questions are very easy and convenient for the
respondents to answer. Such questions save time and also facilitate tabulation.
This method should be used if only a selected few alternative answers exist to
a particular question.
Questions like ‘why do you use a particular type of car, say Maruti car’
should preferably be framed into two questions-
(i) which car do you use? (ii) why do you prefer it?
It would be practical in every sense to try out the questionnaire on a small scale before
using it for the given enquiry on a large scale. Thishas been found extremely useful in
practice. The given questionnaire canbe improved or modified in the light of the drawbacks,
shortcomings and problems faced by the investigator in the pre test.
11 A Covering Letter:
A covering letter from the organizers of the enquiry shouldbe enclosed along
with the questionnaire for the purposes regarding definitions, units, concepts used in the
questionnaire, for taking the respondent’s confidence, self addressed envelop in case of mailed
questionnaire, mention about award or incentives for the quick response, a promise to send a
copy of the survey report etc.
SAMPLING
Though sampling is not new, the sampling theory has been developed recently. People
knew or not but they have been using the sampling technique in their day to day life. For
example a house wife testsa small quantity of rice to see whether it has been well-cooked
and givesthe generalized result about the whole rice boiling in the vessel. The result arrived at
is most of the times 100% correct. In another example, whena doctor wants to examine the
blood for any deficiency, takes only a few drops of blood of the patient and examines. The
result arrived at is mostof the times correct and represent the whole amount of blood available
inthe body of the patient. In all these cases, by inspecting a few, they simply believe that the
samples give a correct idea about the population. Most of our decision are based on the
examination of a few items only i.e. Sample studies. In the words of Croxton and Cowdon,
“It may be too expenor too time consuming to attempt either a complete or a nearly complete
coverage in a statistical study. Further to arrive at valid conclusions, itmay not be necessary
to enumerate all or nearly all of a population. Wemay study a sample drawn from the large
population and if that sample is adequately representative of the population, we should be able
to arrive at valid conclusions.”
According to Rosander, “The sample has many advantages over a census or complete
enumeration. If carefully designed, the sample is notonly considerably cheaper but may give
results which are just accurateand sometimes more accurate than those of a census. Hence a
carefully designed sample may actually be better than a poorly planned and executed census.”
115
Merits:
It saves time:
Sampling method of data collection saves time because fewer items are collected and processed.
When the results are urgently required, this method is very helpful.
It reduces cost
Since only a few and selected items are studied in sampling, there is reduction in cost of money
and reduction in terms of man hours.
More reliable results can be obtained:
a) there are fewer chances of sampling statistical errors. If there is sampling error, it is possible
to estimate and control the results.(b) Highly experienced and trained persons can be employed for
scientific processing and analyzing of relatively limited data and they can use their high technical
knowledge and get more accurate and reliable results.
1. It provides more detailed information:
As it saves time, money and labor, more detail information can be collected in a sample
survey.
Some times it so happens that one has to depend upon sampling method alone
because if the population under study is finite, sampling method is the only method to be used.
For example, if someone’s blood has to be examined, it will become fatal to take all the blood
out from the body and study depending upon the total enumeration method.
3. Administrative convenience:
The organization and administration of sample survey are easy for the reasons which
have been discussed earlier.
4. More scientific:
Since the methods used to collect data are based on scientific theory and results obtained
116
can be tested, sampling is a more scientific method of collecting data.
It is not that sampling is free from demerits or shortcomings. There are certain
shortcomings of this method which are discussed below:
1. Illusory conclusion:
If a sample enquiry is not carefully planned and executed, the conclusions may be
inaccurate and misleading.
As there are lack of experts to plan and conduct a sample survey, itsexecution and analysis, and
its results would be Unsatisfactory and not trustworthy.
Personal Bias:
There may be personal biases and prejudices with regard to thechoice of technique and drawing
of sampling units.
If the size of the sample is not appropriate then it may lead to untrue characteristics of the
population.
If the information is required for each and every item of the universe, then a complete
enumeration survey is better.
Essentials of sampling:
117
In order to reach a clear conclusion, the sampling should possessthe following essentials:
1. It must be representative:
The sample selected should possess the similar characteristics ofthe original universe from
which it has been drawn.
2. Homogeneity:
Selected samples from the universe should have similar nature and should have any
difference when compared with the universe.
3. Adequate samples:
In order to have a more reliable and representative result, a good number of items are to
be included in the sample.
4. Optimization:
All efforts should be made to get maximum results both in termsof cost as well as
efficiency. If the size of the sample is larger, there isbetter efficiency and at the same time
the cost is more. A proper size of sample is maintained in order to have optimized results in
terms of costand efficiency.
If the sample size is less than 30, then those samples may be regardedas small samples. As a rule,
the methods and the theory of large samplesare not applicable to the small samples. The small
samples are used intesting a given hypothesis, to find out the observed values, which could
have arisen by sampling fluctuations from some values given in advance.In a small sample, the
investigator’s estimate will vary widely from sample to sample. An inference drawn from a smaller
sample result is less precise than the inference drawn from a large sample result.
t-distribution will be employed, when the sample size is 30or less and the population standard
deviation is unknown.
The formula is
where,
118
( X - µ)
t = -------- x √n
σ
σ = √
∑(X –
X)2/n –
1
119
Illustration:
Solution:
( X - µ)
t = ------- x √nσ
4. CHI-SQUARE TEST
F, t and Z tests are based on the assumption that the samples were drawn from normally
distributed populations. The testing procedure requires assumption about the type of population
or parameters, and these tests are known as ‘parametric tests’.
There are many situations in which it is not possible to make any rigid assumption about
the distribution of the population from which samples are being drawn. This limitation has
led to the development of a group of alternative techniques known as non-parametric tests.
Chi- square test of independence and goodness of fit is a prominent example of the use of non-
parametric tests.
120
Though non-parametric theory developed as early as the middleof the nineteenth
century, it was only after 1945 that non-parametric tests came to be used widely in sociological
and psychological research. The main reasons for the increasing use of non-parametric tests in
business research are:-
The χ2 test is one of the simplest and most widely used non- parametric tests in statistical
work. It is defined as:
121
∑(O– E)2 Χ2 = ------------
E
Where
RXC
E = ------------
N
Where
E = Expected frequency, R = row’s total of the respective cell,
C = column’s total of the respective cell and N = the total number of
observations.
(iii) Divide the values of (O – E)2 obtained in step (ii) by the respective
expected frequency and obtain the total, which can be symbolically represented
by ∑[(O – E)2/E]. This gives the value of χ2 which can range from zero to
infinity. If χ2 is zero it means that the observed and expected frequencies
completely coincide. The greater the discrepancy between the observed and
expected frequencies, the greater shall be the value of χ2.
122
The following observation may be made with regard to the χ2 distribution:-
i. The sum of the observed and expected frequencies is always zero.
Symbolically, ∑(O – E) = ∑O - ∑E =N–N=0
ii. The χ2 test depends only on the set of observed and expected frequencies and on degrees of
freedom v. It is a non-parametric test.
For large sample sizes, the sampling distribution of χ2 can be closely approximated by a
continuous curve known as the Chi-square distribution. The probability function of χ2
distribution is:
Where
e = 2.71828, v = number of degrees of freedom, C = a constant
depending only on v.
The χ2 distribution has only one parameter, v, the number of degrees of freedom. As in
case of t-distribution there is a distributionfor each different number of degrees of freedom.
For very small numberof degrees of freedom, the Chi-square distribution is severely skewed
to the right. As the number of degrees of freedom increases, the curve rapidly becomes more
symmetrical. For large values of v the Chi-square distribution is closely approximated by the
normal curve.
123
The following diagram gives χ2 distribution for 1, 5 and 10 degrees of
freedom:
F(x2)
v=1
v=5
v = 10
0 2 4 6 8 10 12 14 16 18 20
χ2 Distribution
Properties of χ2 Distribution
i.e.,
X=v
(iii) µ1 = 0,
(iv) µ2 = 2v,
124
(v) µ3 = 8v,
µ4 48v + 12v2 12
(v) β1µ3 = ------ = --------------- = 3 + ---
µ 22 4v2 v
The table values of χ2 are available only up to 30 degrees of freedom. For degrees of
freedom greater than 30, the distribution of χ2 approximates the normal distribution. For degrees
of freedom greater than 30, the approximation is acceptable close. The mean of the distribution
√2χ2 is
√2v – 1, and the standard deviation is equal to 1. Thus the application ofthe test is simple,
for deviation of √2χ2 from √2v – 1 may be interpreted asa normal deviate with units standard
deviation. That is,
of χ2
In a 2x2 table where the cell frequencies and marginal totals are as below:
a b (a+b)
c d (c+d)
(a+c) (b+d) N
N is the total frequency and ad the larger cross-product, the valueof χ2 can easily be
obtained by the following formula:
125
N (ad – bc)2
χ2 = -------------------------------------- or
(a + c) (b + d) (c + d) (a + b)
N (ab – bc - ½N)2
χ2 = -----------------------------------
(a + c) (b + d) (c + d) (a + b)
(iii) The constraints on the cell frequencies if any should be linear, i.e., they
should not involve square and higher powers of the frequencies such as ∑O
= ∑E = N.
Uses of χ2 test:
i. χ2 test as a test of independence. With the help of χ2 test, we can find out
whether two or more attributes are associated or not. Let’s assume that we
have n observations classified according to some attributes.
126
We may ask whether the attributes are related or independent. Thus,we can find out
whether there is any association between skin colourof husband and wife. To examine
the attributes that are associated,we formulate the null hypothesis that there is no
association against an alternative hypothesis and that there is an association between the
attributes under study. If the calculated value of χ2 is less than thetable value at a certain
level of significance, we say that the result of the experiment provides no evidence for
doubting the hypothesis. On the other hand, if the calculated value of χ2 is greater than the
table valueat a certain level of significance, the results of the experiment do not support the
hypothesis.
ii. χ2 test as a test of goodness of fit. This is due to the fact that it enables us to ascertain how
appropriately the theoretical distributions such as binomial, Poisson, Normal, etc., fit
empirical distributions. When an ideal frequency curve whether normal or some other
type is fitted tothe data, we are interested in finding out how well this curve fits withthe
observed facts. A test of the concordance of the two can be made just by inspection, but
such a test is obviously inadequate. Precisioncan be secured by applying the χ2 test.
iii. χ2 test as a test of homogeneity. The χ2 test of homogeneity is an extension of the chi-square
test of independence. Tests of homogeneity are designed to determine whether two or more
independent random samples are drawn from the same population or from different
populations. Instead of one sample as we use with independence problem we shall now have
2 or more samples. For example, we may be interested in finding out whether or not
university students of various levels, i.e., middle and richer poor income groups are
homogeneous in performance in the examination.
Illustration:
127
Treatment Diabetes No Diabetes Total
Solution:
Or E1, i.e., expected frequency corresponding to first row and first column is
60. The bale of expected frequencies shall be:
60 752 812
180 2256 2436
240 3008 3248
O E (O – E)2 (O – E)2/E
20 60 1600 26.667
220 180 1600 8.889
792 752 1600 2.218
2216 2256 1600 0.709
128
χ2 = [∑(O – E)2/E] = 38.593
V = (r – 1) (c – 1) = (2 – 1) (2 – 1) = 1
For
v = 1, χ20.05 = 3.84
129
The calculated value of χ2 is greater than the table value. The hypothesisis rejected. Hence
medicine x is useful in checking malaria.
Illustration:
Inoculated 10 20
Not inoculated 15 5
Calculate χ2 and discuss the effect of vaccine in controlling susceptibilityto tuberculosis (5%
value of χ2 for one degree of freedom = 3.84).
Solution:
Let us take the hypothesis that the vaccine is not effective incontrolling susceptibility
to tuberculosis. Applying χ2 test:
Since the calculated value of χ2 is greater than the table value the hypothesis is not true. We,
therefore, conclude the vaccine is effective in controlling susceptibility to tuberculosis.
130
UNIT-IV
SIMPLE CORRELATION
Correlation
Correlation means the average relationship between two or more variables. When changes in the
values of a variable affect the values of another variable, we say that there is a correlation between
the two variables. The two variables may move in the same direction or in opposite directions.
Simply because of the presence of correlation between two variables, we cannot jump to the
conclusion that there is a cause-effect relationship between them. Sometimes, it may be due to
chance also.
Simple correlation
We say that the correlation is simple if the comparison involvestwo variables only.
TYPES OF CORRELATION
Positive correlation
If two variables x and y move in the same direction, we say thatthere is a positive correlation
between them. In this case, when the valueof one variable increases, the value of the other
variable also increases and when the value of one variable decreases, the value of the other
variablealso decreases. Eg. The age and height of a child.
Negative correlation
If two variables x and y move in opposite directions, we say that there is a negative correlation
between them. i.e., when the value of one variable increases, the value of the other variable
decreases and vice versa. Eg. The price and demand of a normal good.
The following diagrams illustrate positive and negative correlationsbetween x and y.
y y
x x
Positive Correlation Negative Correlation
If changes in two variables are in the same direction and the changes are in equal
proportion, we say that there is a perfect positive correlation between them.
If changes in two variables are in opposite directions and the absolute values of changes
are in equal proportion, we say that there is a perfect negative correlation between them.
y y
x x
Perfect Positive Correlation Perfect Negative Correlation
132
Zero Correlation
x
Zero Correlation
Linear Correlation'
If the quantum of change in one variable always bears a constantratio to the quantum of change
in the other variable, we say that the two variables have a linear correlation between them.
Coefficient of Correlation
The coefficient of correlation between two variables X, Y is a measure of the degree of association
(i.e., strength of relationship) between them. The coefficient of correlation is usually denoted by
‘r’.
Let N denote the number of pairs of observations of two variables X and Y. The correlation
coefficient r between X and Y is defined by
N XY − ( X ) (Y )
r=
N X − X N Y − Y
130
This formula is suitable for solving problems with hand calculators. Toapply this formula,
we have to calculate ∑ X,∑Y, ∑XY, ∑X2, ∑Y2.
Let r denote the correlation coefficient between two variables. r≥ isinterpreted using the
following properties:
Problem 1
The following are data on Advertising Expenditure (in Rupees Thousand)and Sales (Rupees
in lakhs) in a company.
Advertising Expenditure : 18 19 20 21 22 23
Sales : 17 17 18 19 19 19
Determine the correlation coefficient between them and interpret theresult.
131
Solution:
X Y XY X2 Y2
N XY − ( X )( Y )
r=
N X − X N Y − Y
6 2243 −123109
r=
6 2539 − 123 6 1985 − 109
Interpretation
132
Problem 2
X : 12 14 18 23 24 27
Y : 18 13 12 30 25 10
Determine the correlation coefficient between the two variables andinterpret the result.
Solution:
X Y XY X2 Y2
12 18 216 144 324
14 13 182 196 169
18 12 216 324 144
23 30 690 529 900
24 25 600 576 625
27 10 270 729 100
Interpretation
The value of r is 0.21. Even though it is positive, the value of r is very less. Hence
we conclude that there is no correlation between the two variables X and Y. Consequently we
cannot construct any functional relational relationship between them.
133
Problem 3
Consider the following data on supply and price. Determine the correlation
Coefficient between the two variables and interpret the result.
Supply : 11 13 17 18 22 24 26 28
Price : 25 32 26 25 20 17 11 10
Solution:
X Y XY X2 Y2
134
Interpretation
The value of r is - 0.92. The negative sign in r shows that the two variables move in
opposite directions. The absolute value of r is 0.92 which is very high. Therefore we
conclude that there is high negative correlation between the two variables ‘Supply’ and
‘Price’.
Problem 4
Consider the following data on income and savings in Rs. Thousand.
Income : 50 51 52 55 56 58 60 62 65 66
Savings : 10 11 13 14 15 15 16 16 17 17
Determine the correlation coefficient between the two variables andinterpret the result.
Solution:
We have N = 10. Take X = Income and Y = Savings.Calculate ∑ X, ∑Y, ∑XY, ∑X2, ∑Y2
as follows:
X Y XY X2 Y2
50 10 500 2500 100
51 11 561 2601 121
52 13 676 2704 169
55 14 770 3025 196
56 15 840 3136 225
58 15 870 3364 225
60 16 960 3600 256
62 16 992 3844 256
65 17 1105 4225 289
66 17 1122 4356 289
Total: 575 144 8396 33355 2126
The value of r is 0.93. The positive sign in r shows that thetwo variables move in the
same direction. The value of r is very high. Therefore we conclude that there is high
positive correlation betweenthe two variables ‘Income’ and ‘Savings’. As a result, we can
construct a functional relationship between them.
RANK CORRELATION
If ranks can be assigned to pairs of observations for two variables X and Y, then the correlation
between the ranks is called the rank correlation coefficient. It is usually denoted by the symbol
ρ (rho). It is given by the formula
6 D2
= 1− 3
N −N
where
D = difference between the corresponding ranks of X and Y
= RX − RY
136
Problem 5
Alpha Recruiting Agency short listed 10 candidates for final selection. They were
examined in written and oral communication skills. They were ranked as follows:
Candidate’s
1 2 3 4 5 6 7 8 9 10
Serial no.
Rank in
written 8 7 2 10 3 5 1 9 6 4
communication
Rank in oral
communication 10 7 2 6 5 4 1 9 8 3
Find out whether there is any correlation between the written and oralcommunication skills of
the short listed candidates.
Solution:
137
= 1 – (180 / 990)
= 1 – 0.18
= 0.82
Inference:
Problem 6
The following are the ranks obtained by 10 workers in abc company on the
basis of their length of service and efficiency.
Ranking as
1 2 3 4 5 6 7 8 9 10
per service
Rank as
2 3 6 5 1 10 7 9 8 4
perefficiency
Find out whether there is any correlation between the ranks obtained by
the workers as per the two criteria.
Solution:
Take X = Length of Service and Y = Efficiency.
Rank of X: R1 RANK OF Y: R2 D= R1- R2 D2
1 2 -1 1
2 3 -1 1
3 6 -3 9
4 5 -1 1
5 1 4 16
6 10 -4 16
7 7 0 0
8 9 -1 1
9 8 1 1
10 4 6 36
138
Total 82
137
We have N = 10. The rank correlation coefficient isρ = 1 - {6 ∑ D2 / (N3 – N)}
= 1 – { 6 x 82 / (1000 – 10) }
= 1 – (492 / 990)
= 1 – 0.497
= 0.503
Inference:
The rank correlation coefficient is not high.
Calculate the rank correlation to determine the relationship between equity shares and
preference shares given by the following dataon their price.
Equity
90.0 92.4 98.5 98.3 95.4 91.3 98.0 92.0
share
Preference
76.0 74.2 75.0 77.4 78.3 78.8 73.2 76.5
share
Solution:
From the given data on share price, we have to find out the ranks for equityshares and preference
shares.
Step 1.
First, consider the equity shares and arrange them in descending
order of their price as 1,2,…,8. We have the following ranks:
Equity
98.5 98.3 98.0 95.4 92.4 92.0 91.3 90.0
share
Rank 1 2 3 4 5 6 7 8
Step 2.
Next, take the preference shares and arrange them in descending
order of their price as 1,2,…,8. We obtain the following ranks:
139
Preference
78.8 78.3 77.4 76.5 76.0 75.0 74.2 73.2
share
Rank 1 2 3 4 5 6 7 8
Step 3.
Calculation of D2:
Fit the given data with the correct rank. Take X = Equity share and Y =
Preference share. We have the following table:
Rank of Rank of Y:
X Y D=R1- R2 D2
X: R1 R2
90.0 76.0 8 5 3 9
92.4 74.2 5 7 -2 4
98.5 75.0 1 6 -5 25
98.3 77.4 2 3 -1 1
95.4 78.3 4 2 2 4
91.3 78.8 7 1 6 36
98.0 73.2 3 8 -5 25
92.0 76.5 6 4 2 4
Total 108
Step 4.
Calculation of ρ:
ρ = 1 - { 6 ∑ D2 / (N3 – N)}
= 1 – { 6 x 108 / (512 – 8) }
= 1 – (648 / 504)
= 1 – 1.29
= - 0.29
140
Inference:
From the value of ρ, it is inferred that the equity shares and preference shares under
consideration are negatively correlated. However, the absolute value of ρ is 0.29 which is not
even moderate.
Problem 8
Sales Person 1 2 3 4 5 6 7 8 9 10
Rank Awarded
8 7 6 1 5 9 10 2 3 4
by Manager I
Rank Awarded
7 8 4 6 5 10 9 3 2 1
by Manager II
Rank Awarded
by 4 5 1 8 9 10 6 7 3 2
Manager III
Determine which two managers have the nearest approach in the evaluation of the performance
of the sales persons.
Solution:
2 7 8 5 1 4 9
3 6 4 1 4 25 9
4 1 6 8 25 49 4
5 5 5 9 0 16 16
6 9 10 10 1 1 0
7 10 9 6 1 16 9
141
8 2 3 7 1 25 16
9 3 2 3 1 0 1
10 4 1 2 9 4 1
Total 44 156 74
The rank correlation coefficient between mangers I and III is1 – { 6 x 156 /
(1000 – 10) }
= 1 – (936 / 990)
= 1 – 0.95
= 0.05
1 – { 6 x 74 / (1000 – 10) }
= 1 – (444 / 990)
= 1 – 0.44
= 0.56
Inference:
If there are two items with equal values, their ranks will be two consecutive integers, say
s and s + 1. Their average is { s + (s+1)} / 2. Assign this rank to both items. Note that we
allow ranks to be fractionsalso.
If there are three items with equal values, their ranks will be three consecutive integers,
say s, s + 1 and s + 2. Their average is { s + (s+1)
+ (s+2) } / 3 = (3s + 3) / 3 = s + 1. Assign this rank to all the three items.A similar procedure
is followed if four or more number of items has equal values.
6 D2
= 1−
N3 − N
m3 - m12
to the term D2 in ρ. We have to add as many terms like (m3 – m) / 12 asthere are ties.
Let us calculate the correction terms for certain values of m. These areprovided in the
following table.
143
Correction term
3 3
m m m -m 3
m -m
=
12
2 8 6 0.5
3 27 24 2
4 64 60 5
5 125 120 10
Illustrative examples:
If there are 2 ties involving 2 items each, then the correction term is0.5 + 0.5
=1
If there are 3 ties with 2 items each, then the correction term is0.5 + 0.5 +
0.5 = 1.5
If there are 2 ties involving 3 items each, then the correction term is 2 + 2
=4
If there is a tie with 2 items and another tie with 3 items, then the correction
term is 0.5 + 2 = 2.5
If there are 2 ties with 2 items each and another tie with 3 items, then the
correction term is 0.5 + 0.5 + 2 = 3
144
Problem 9 : Resolving ties in ranks
The following are the details of ratings scored by two popular insurance schemes.
Determine the rank correlation coefficient between them.
Scheme I 80 80 83 84 87 87 89 90
Scheme II 55 56 57 57 57 58 59 60
Solution:
Step 1.
Arrange the scores for Insurance Scheme I in descending order and
rank them as 1,2,3,…,8.
SchemeI
Score 90 89 87 87 84 83 80 80
Rank 1 2 3 4 5 6 7 8
The score 87 appears twice. The corresponding ranks are 3, 4. Their average is (3 +
4) / 2 = 3.5. Assign this rank to the two equal scoresin Scheme I.
The score 80 appears twice. The corresponding ranks are 7, 8. Their average is (7 +
8) / 2 = 7.5. Assign this rank to the two equal scoresin Scheme I.
SchemeI
Score 90 89 87 87 84 83 80 80
145
Step 2.
Arrange the scores for Insurance Scheme II in descending order
and rank them as 1,2,3,…,8.
SchemeII
Score 60 59 58 57 57 57 56 55
Rank 1 2 3 4 5 6 7 8
Scheme
60 59 58 57 57 57 56 55
II Score
Rank 1 2 3 5 5 5 7 8
Step 3.
Calculation of D2: Assign the revised ranks to the given pairs of
values and calculate D2 as follows:
146
Step 4.
Calculation of ρ:
We have N = 8.
Since there are 2 ties with 2 items each and another tie with 3 items, the correction term
is 0.5 + 0.5 + 2 .
Inference:
It is inferred that the two insurance schemes are highly, positively correlated.
REGRESSION
In the pairs of observations, if there is a cause and effect relationshipbetween the variables
X and Y, then the average relationship between these two variables is called regression, which
means “stepping back” or “return to the average”. The linear relationship giving the best mean
value of a variable corresponding to the other variable is called a regression line or line
of the best fit. The regression of X on Y is different from the regression of Y on X. Thus, there
are two equations of regression and the two regression lines are given as follows:
Regression of Y on X: Y − Y = byx ( X − X )
Regression of X on Y: X − X = bxy (Y − Y )
Result:
Let σx, σy denote the standard deviations of x, y respectively. Wehave the following
result.
147
Y X
byx =r and bxy = r
X Y
r = b yxbxy
2
and so r = byxbxy
Result:
The coefficient of correlation r between X and Y is the square rootof
the product of the b values in the two regression equations. We can findr by
this way also.
Application
E1 = y1 – y1 ,E2 = y2 – y2 ,
En = yn – yn .
Some of these quantities are positive while the remaining ones arenegative. However,
the squares of all these quantities are positive.
(X1, Y1)
e1
e2
(X2, Y2)
O
X
i.e.,
E21 = (y1 – y1 )2 ≥ 0, E22 = (y2 –y2)2 ≥ 0, …, E2n = (yn –yn )2 ≥ 0.
149
Among all those straight lines which are somewhat near to thegiven
observations
(x1, y1), (x2, y2), …, (xn , yn) , we consider that straight line as the ideal one
for which the sse is the least. Since the ideal straight line giving regression
of y on x is based on this concept, we call this principle as the Principle of
least squares.
Normal equations
Where
a, b are constants to be determined. Mathematically speaking, when
we require finding the equation of a straight line, two distinct points on
the straight line are sufficient. However, a different approach is followed here.
We want to include all the observations in our attempt to build a straight line.
Then all the n observed points (x, y) are required to satisfythe relation
∑ y = ∑ (a + b x ) = ∑ (a .1 + b x ) = ( ∑ a.1) + ( ∑ b x ) = a ( ∑ 1 ) + b ( ∑ x).
i.e.
∑ y = an + b (∑ x) (2)
x. We obtain
x y = ax + bx2 .
150
i.e.,
∑ x y = a (∑ x ) + b (∑ x2).................... (3)
Equations (2) and (3) are referred to as the normal equations associatedwith the regression
of y on x. Solving these two equations, we obtain
X2 Y- a= X XY 2
n X 2 - (X)
and b=
n XY - X Y
X2 - ( X )
2
n
Note:
For calculating the coefficient of correlation,we require ∑X, ∑Y, ∑ Xy, ∑ X2, ∑Y2.
For calculating the regression of y on x, we require ∑X, ∑Y, ∑ XY, ∑ X2. Thus, tabular
column is same in both the cases with the difference that
∑Y2 is also required for the coefficient of correlation.
Y2 X- a= Y XY 2
n Y 2 - ( Y )
And b=
n XY - X Y
Y 2 - ( Y)
2
n
151
Problem 10
X 5 6 7 8 9 10 11
Y 2 4 5 5 3 8 7
Solution:
X Y XY X2
5 2 10 25
6 4 24 36
7 5 35 49
8 5 40 64
9 3 27 81
10 8 80 100
11 7 77 121
Total: 56 34 293 476
= - 224 / 196
= – 1.1429
152
b = {n (∑ x y) – (∑ x) (∑ y)} / {n (∑ x2) – (∑ x) 2}
= 147 /196
= 0.75
Y=a+bX
I.e.,
Y = – 1.14 + 0.75 X
Problem 11
Income 40 70 50 60 80 50 90 40 60 60
Expenditure 25 60 45 50 45 20 55 30 35 30
Determine the regression of expenditure on income and estimate theexpenditure when the
income is 65.
Solution:
follows:
X Y XY X2
40 25 1000 1600
70 60 4200 4900
50 45 2250 2500
153
60 50 3000 3600
80 45 3600 6400
50 20 1000 2500
90 55 4950 8100
40 30 1200 1600
60 35 2100 3600
60 30 1800 3600
= 108000 / 24000
= 4.5
b = {n (∑ x y) – (∑ x) (∑ y)} / {n (∑ x2) – (∑ x) 2}
= 14000 / 24000
= 0.58
Y=a+bX
i.e.,
Y = 4.5 + 0.583 X
154
To estimate the expenditure when income is 65:
= 4.5 + 37.895
= 42.395
= 42 (approximately).
Problem 12
Occupancy
40 45 70 60 70 75 70 80 95 90
rate
Solution:
Note that in Problems 10 and 11, we wanted only one regressionline and so we did
not take ∑Y2 . Now we require two regression lines. Therefore,
155
Calculate ∑ X, ∑Y, ∑XY, ∑X2, ∑Y2.
X Y XY X2 Y2
Y=a+bX
Where
a ={(∑ x2) (∑ y) – (∑ x) (∑ x y)} / {n (∑ x2) – (∑ x) 2}
and
b ={n (∑ x y) – (∑ x) (∑ y)} / {n (∑ x2) – (∑ x) 2}
We obtain
a = (51075 x 885 – 695 x 65450) / (10x51075 - 6952)
= - 286375 / 27725
= - 10.329
b = (10 x 65450 – 695 x 885) / 27725
= 39425 / 27725
156
= 1.422
157
So, the regression equation is Y = - 10.329 + 1.422 X
= 1099625 / 66025
= 16.655,
= 39425 / 66025
= 0.597
Note:
For the data given in this problem, if we use the formula for r, we
get
N XY − ( X ) (Y )
r=
NX − X N Y − Y
158
= (10 x 65450 – 695 x 885) / { √ (10 x 51075 - 6952 ) √ (10 x 84925 - 8852 ) }
= 39425 / 42784.23
= 0.9214
However, once we know the two b values, we can find the coefficient of
correlation r between X and Y as the square root of the product of the two
b values.Thus we obtain
r = √ (1.422 x 0.597)
= √ 0.848934
= 0.9214.
Note that this agrees with the above value of r.
Introduction
Definition
Assumptions of ANOVA
159
three important assumptions, namely
160
One-way classified data
Define the linear model for the sample data obtained from the
experiment by the equation
y = +a +e i = 1, 2,..., k
j = 1, 2,..., n
ij i ij
i
Where µ represents the general mean effect which is fixed and which
represents the general condition of the experimental units, ai denotes the
fixed effect due to ith level of the factor A (i=1,2,…,k) andhence the
variation due to ai (i=1,2,…,k) is said to be control.
The last component of the model eij is the random variable. It is called the error
component and it makes the Yij a random variate. The variation in eij is due
to all the uncontrolled factors and eij is independently, identically and
normally distributed with mean zero and constant variance σ2 .
For the realization of the random variate Yij, consider yij defined
by
y = +a +e i = 1, 2,..., k
j = 1, 2,..., n
ij i ij
i
161
The expected value of the general observation yij in the experimentalunits is given by
E( yij ) = i for all i = 1, 2,..., k
With yij=µi+eij , where eij is the random error effect due touncontrolled factors (i.e.,
due to chance only).
Here we may expect µi=µ for all i=1,2,. ... ,k , if there is no variation
due to control factors. If it is not the case, we have
Suppose i − ai .
y = + a + e i = 1, 2,..., k (1)
ij i
ij
j = 1, 2,..., n
i
E = e2
ij
ij
= ( y − − a )2 ,
ij i
ij
164
differentiating E with respect to µ and ai for all i = 1, 2,..., k
and equating
G = N + ni ai (2)
i
163
and Ti = ni µ + ni ai, i = 1,2,…,k (3)
Where N = nk. We see that the number of variables (k+1) is more than
the number of independent equations (k). So, by the theorem on a system of
linear equations, it follows that unique solution for this systemis not possible.
i
However, by making the assumption that n ai i
= 0 , we can get a
G = N
G
i.e. =
N
µ= G
Therefore the estimate of µ is given by (1)
N
Ti
= + ai
i n
Ti
Hence, ai = −
ni
i.e.,
T G
aµ= i − (2)
i
ni N
164
µ− $a i )2
−
E = ( yij
ij
163
After carrying out some calculations and using the normal equations (2)and (3) we obtain
G
2
T2 G2
The first term in the RHS of equation (6) is called the corrected
ij y
2
total sum of squares while ij is called the uncorrected total sum of
squares.
for measuring the variation due to treatment (controlled factor),we consider the null
hypothesis that all the treatment effects are equal.
i.e.,
Ho : 1 = 2 = ... = k =
i.e., Ho : i = for all i = k
y = +e i = 1, 2,..., k
j = 1, 2,..., n
ij ij
i
Proceeding as before, we get the residual sum of squares for thishypothetical model
as
2 G2
E1 = yij − (4)
N
ij
166
Actually, E1 contains the variation due to both treatment and
error. Therefore a measure of variation due to treatment can be obtained
by “ E1 − E ”. Using (6) and (7), we get
165
k
T2 G2
E1 − E = i
− (5)
i=1 ni N
i=1 ni
G 2
of squares. Here it may be noted that is a correction factor (also
N
Trss
TrMSS dF : F
F= = k −1, N −k
EMSS Ess
dF
All these values are represented in the form of a table called ANOVA
table, furnished below.
166
ANOVA table for one-way classified data
Between the
level of the
factor (Treat k-1 E1 − E = QT QT MT
M = F = :
ment) k
T 2 G2
T
k −1 T
ME
i
i n − N Fk −1, N −k
i
QE
Within the
level of
N-k By subtrac QE -
factor ME =
tion N −k
(error)
Total N-1 - -
G2
Q = yij −
ij N
167
Variance ratio
The variance ratio is the ratio of the greater variance to the smaller
variance. It is also called the F-coefficient. We have
Inference
Note:
If the calculated value of F and the table value of f are equal, we can
try some other value of α .
Problem 1
The following are the details of sales effected by three sales personsin
three door-to-door campaigns.
C 6 6 7 5
168
Construct an ANOVA table and find out whether there is anysignificant difference in
the performance of the sales persons.
Solution:
A = 8 + 9 + 5 +10 = 32
B = 7 + 6 + 6 + 9 = 28
C = 6 + 6 + 7 + 5 = 24
32
Sample mean for A : A = =8
4
28
Sample mean for B : B = =7
4
24
Sample mean for C : C = =6
4
Total number of sample items = No. of items for A + No. of itemsfor B + No. of items
for C
= 4 + 4 + 4 = 12
32 + 28 + 24 84
Mean of all the samples X = = =7
12 12
169
Sum of squares of deviations for A:
A A− A = A−8 ( A − A)
2
8 0 0
9 1 1
5 -3 9
10 2 4
14
B − B = B −7 (B − B)
B 2
7 0 0
6 -1 1
6 -1 1
9 2 4
6
C C −C = C −6 (C − C )
2
6 0 0
6 0 0
7 -1 1
5 -1 1
2
170
Sum of squares of deviations within
( A − A) + ( B − B ) + (C − C )
2 2 2
Varieties =
= 14 + 6 + 2
= 22
Sales - X = sales
Sales person sales
( Sales − 7 )
2
–7
A 8 1 1
A 9 2 4
A 5 -2 4
A 10 3 9
B 7 0 0
B 6 -1 1
B 6 -1 1
B 9 2 4
C 6 -1 1
C 6 -1 1
C 7 0 0
C 5 2 4
30
171
ANOVA Table
Within varieties 12 – 3 = 9 22 22
= 2.44
9
Total 12 – 1 = 11 30
Calculation of F value:
Inference:
The calculated value of F is less than the table value ofF. Therefore, the
null hypothesis is accepted. It is concluded that there is no significant
difference in the performance of the sales persons, at 5% level of significance.
T2 842
Correction Factor = = = 588
N 12
Sales Person X X2
A 8 64
A 9 81
A 5 25
A 10 100
B 7 49
B 6 36
B 6 36
B 9 81
C 6 36
C 6 36
C 7 49
C 5 25
618
173
Sum of squares of deviations for variance between samples
( A) ( B ) ( C )
2 2 2
= + + − CF
N1 N2 N3
322 282 242
= + + − 588
4 4 4
ANOVA Table
Bet
3-1 = 2 8 8
ween varieties =4
2
Within varieties 12 – 3 = 9 22 22
= 2.44
9
Total 12 – 1 = 11 30
174
Problem 2
The following are the details of plinth areas of ownership apartment flats offered by 3
housing companies A,B,C. Use analysis of variance to determine whether there is any significant
difference in the plinth areas of the apartment flats.
H o u s i n g
Plinth area of apartment flats
Company
A 1500 1430 1550 1450
Use analysis of variance to determine whether there is any significant difference in the
plinth areas of the apartment’s flats.
Note:
As the given figures are large, working with them will be difficult.
Therefore, we use the following facts:
In the problem under consideration, the numbers vary from 1420to 1600. So we
follow a method called the coding method. First, let us subtract 1400 from each item. We get
the following transformed data:
175
The transformed data are given below.
A 10 3 15 5
B 5 15 10 8
C 15 2 5 3
A=10+3+15+5=33
B =5+15+10+8=38
C=15+2+5+3=25
T = A + B + C
= 33 + 38 + 25
= 96
2 2
Correction factor = T = 96 = 768
N 12
Company X X2
A 10 100
A 3 9
A 15 225
A 5 25
B 5 25
B 15 225
B 10 100
B 8 64
C 15 225
176
C 2 4
C 5 25
C 3 9
1036
( A) ( B ) ( C )
2 2 2
= + + − CF
N1 N2 N3
332 382 252
= + + − 768
4 4 4
ANOVA Table
177
Total 12 – 1 = 11 268
178
Calculation of F value:
Since the calculated value of F is less than the table value of F, the null
hypothesis is accepted and it is concluded that there is no significant difference
in the plinth areas of ownership apartment flats offered by the three companies,
at 5% level of significance.
Problem 3
Sum of squares of
Source of variation Degrees of freedom
deviations
Treatments 5 15
Residual 2 25
Total (corrected) 7 40
179
Solution:
15
Variance between varieties = = 7.5
2
25
Variance between varieties = =5
5
(df1 ) =2
(df1 ) =5
= 5.79
Inference:
Since the calculated value of F is less than the table value of Fwe accept the null-
hypothesis and conclude that there is no significantdifference in the performance of the three
financial schemes.
180
I. PARTIAL CORRELATION
Suppose Y is a dependent variable, depending on n other variables X1, X2, …, Xn.. Partial
correlation is a measure of the relationship betweenY and any one of the variables X1, X2,…,Xn,
as if the other variables have been eliminated from the situation.
181
The partial correlation coefficient is defined in terms of simple correlation
coefficients as follows:
Let r12 be the simple correlation coefficient between X1 and X2. Let r13
be the simple correlation coefficient between X1 and X3. Let r23 be the
Then we have
r12 − r13 r 23
r12.3 =
Similarly,
and r32.1 =
182
Problem 1
Given that r12 = 0.6, r13 = 0.58, r23 = 0.70 determine the partialcorrelation coefficient
r12.3
Solution:
We have
0.194
=
0.194
=
0.8146 x 0.7141
0.194
=
0.5817
= 0.3335
183
Problem 2
If r12 = 0.75, r13 = 0.80, r23 = 0.70, find the partial correlation
coefficient r13.2
Solution:
We have
0.8 − 0.525
=
(1− 0.5625) (1− 0.49)
0.275
=
(0.4375) (0.51)
0.275
=
0.6614 X 0.7141
0.275
=
0.4723
= 0.5823
Suppose Y is a dependent variable, which is influenced by n other variables X1, X2, …,Xn.
The multiple correlation is a measure of the relationship between Y and X1, X2,…, Xn considered
together.
The multiple correlation coefficients are denoted by the letter R.The dependent
variable is denoted by X1. The independent variables are denoted by X2, X3, X4,…, etc.
Meaning of notations:
R1.23 denotes the multiple correlation of the dependent variable X1 with two
independent variables X2 and X3 . It is a measure of the relationship that X1 has with X2 and X3 .
R2.13 is the multiple correlation of the dependent variable X2 with two independent
variables X1 and X3.
R3.12 is the multiple correlation of the dependent variable X3 with two independent
variables X1 and X2.
R1.234 is the multiple correlation of the dependent variable X1 with three independent
variables X2 , X3 and X4.
The coefficient of multiple linear correlation is given in terms ofthe partial correlation
coefficients as follows:
185
r2 31 + r2 32 - 2 r31 r32 r12
R3.12 =
1 - r212
3. R1.23 ≥ |r12|,
Problem 3
If the simple correlation coefficients have the values r12 = 0.6, r13 =
0.65, r23 = 0.8, find the multiple correlation coefficient R1.23
Solution:
We have
186
0.36+ 0.4225- 0.624
=
1 - 0.64
0.7825- 0.624
=
0.36
0.1585
=
0.36
= 0.4403
= 0.6636
Problem 4
Given that r21 = 0.7, r23 = 0.85 and r13 = 0.75, determine R2.13
Solution:
=
1 - (0.75)2
=
187
=0.8552
188
UNIT-V
INDEX NUMBERS
Introduction:
Index numbers are meant to study the change in the effects of such factors which cannot be
measured directly. According to Bowley, ―Index numbers are used to measure the changes in
some quantity which we cannot observe directly‖. For example, changes in business activity in a
country are not capable of direct measurement but it is possible to study relative changes in
business activity by studying the variations in the values of some such factors which affect
business activity, and which are capable of direct measurement.
Index numbers are commonly used statistical device for measuring the combined fluctuations in
a group related variables. If we wish to compare the price level of consumer items today with
that prevalent ten years ago, we are not interested in comparing the prices of only one item, but
in comparing some sort of average price levels. We may wish to compare the present agricultural
production or industrial production with that at the time of independence. Here again, we have to
consider all items of production and each item may have undergone a different fractional
increase (or even a decrease). How do we obtain a composite measure? This composite measure
is provided by index numbers which may be defined as a device for combining the variations that
have come in group of related variables over a period of time, with a view to obtain a figure that
represents the ‗net‗ result of the change in the constitute variables.
189
Index numbers may be classified in terms of the variables that they are intended to measure. In
business, different groups of variables in the measurement of which index number techniques are
commonly used are (i) price, (ii) quantity, (iii) value and (iv) business activity. Thus, we have
index of wholesale prices, index of consumer prices, index of industrial output, index of value of
exports and index of business activity, etc. Here we shall be mainly interested in index numbers
of prices showing changes with respect to time, although methods described can be applied to
other cases. In general, the present level of prices is compared with the level of prices in the past.
The present period is called the current period and some period in the past is called the base
period.
Index Numbers:
Index numbers are statistical measures designed to show changes in a variable or group of
related variables with respect to time, geographic location or other characteristics such as
income, profession, etc. A collection of index numbers for different years, locations, etc., is
sometimes called an index series.
A simple index number is a number that measures a relative change in a single variable with
respect to a base.
A composite index number is a number that measures an average relative changes in a group of
relative variables with respect to a base.
190
Price index Numbers:
191
Price index numbers measure the relative changes in prices of a commodities between two
periods. Prices can be either retail or wholesale.
These index numbers are considered to measure changes in the physical quantity of goods
produced, consumed or sold of an item or a group of items.
Uses
This index number is a useful number that helps us quantify changes in our field. It is easier to
see one value than a thousand different values for each item in our field.
Take the stock market, for example. It is comprised of thousands of different public companies.
We could, of course, look at the stock value of each of these companies to see how the
companies are doing as a whole, or we can look at just one number, the stock index, to get a
general feel for how the companies are doing.
The same goes for the cost of goods. We could look at the cost of each item and compare it to its
cost from last year. But that would mean looking at the cost of millions of items. Or we could
look at the cost of goods index, just one number, to see whether prices have increased or
decreased over the past year.
We can say that the index number is one simple number that we can look at to give us a general
overview of what is happening in our field. Let's take a look at two real world index numbers.
A line of best fit is a straight line that is the best approximation of the given set of data.
It is used to study the nature of the relation between two variables. (We're only considering the
two-dimensional case, here.)
192
A line of best fit can be roughly determined using an eyeball method by drawing a straight line
on a scatter plot so that the number of points above the line and below the line is about equal
(and the line passes through as many points as possible).
A more accurate way of finding the line of best fit is the least square method.
Use the following steps to find the equation of line of best fit for a set of ordered pairs
(x1,y1),(x2,y2),...(xn,yn)(x1,y1),(x2,y2),...(xn,yn).
Step 1: Calculate the mean of the xx-values and the mean of the yy-values.
X¯¯¯=∑i=1nxinY¯¯¯=∑i=1nyinX¯=∑i=1nxinY¯=∑i=1nyin
Step 2: The following formula gives the slope of the line of best fit:
m=∑i=1n(xi−X¯¯¯)(yi−Y¯¯¯)∑i=1n(xi−X¯¯¯)2m=∑i=1n(xi−X¯)(yi−Y¯)∑i=
1n(xi−X¯)2
Step 4: Use the slope mm and the yy-intercept bb to form the equation of the line.
Example:
Use the least square method to determine the equation of line of best fit for the data. Then plot
the line.
x 8 22 111 6 5 44 121 9 6 11
x 8 1 6 5 2 9 6
y 3 101 33 6 8 121 11 4 9 141
y 3 0 6 8 2 4 9 4
193
Solution:
Plot the points on a coordinate plane.
X¯¯¯=8+2+11+6+5+4+12+9+6+110=6.4Y¯¯¯=3+10+3+6+8+12+1+4+9+1410=7X¯=8+2+11+6
+5+4+12+9+6+11 0=6.4Y¯=3+10+3+6+8+12+1+4+9+1410=7
194
ii xix i yiy i x i−X¯ y i−Y¯ (xi−X¯)(yi−Y¯) (xi−X¯¯¯)2
11 88 33 1.61.6 −4−4 −6.4−6.4 2.562.56
10 11 14
−5.4−5.4 77 −37.8−37.8 29.1629.1
m=∑i=1n(xi−X¯¯¯)(yi−Y¯¯¯)∑i=1n(xi−X¯¯¯)2=−131118.4≈−1.1m=∑i=1n(xi−X¯)(yi
−Y¯)∑i=1n(xi− X¯)2=−131 118.4≈−1.1