0% found this document useful (0 votes)
22 views173 pages

business-statistics-notes-pdf

The document provides comprehensive notes on Business Statistics, covering various modules such as data representation, probability, random variables, hypothesis testing, and correlation analysis. It defines statistics, explains its characteristics, and distinguishes between descriptive, inferential, and applied statistics. Additionally, it outlines the stages of statistical investigation and important terminologies related to the field.

Uploaded by

singhsumit2913
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views173 pages

business-statistics-notes-pdf

The document provides comprehensive notes on Business Statistics, covering various modules such as data representation, probability, random variables, hypothesis testing, and correlation analysis. It defines statistics, explains its characteristics, and distinguishes between descriptive, inferential, and applied statistics. Additionally, it outlines the stages of statistical investigation and important terminologies related to the field.

Uploaded by

singhsumit2913
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 173

lOMoARcPSD|45259513

Business Statistics Notes pdf

MMS (University of Mumbai)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Sumit Singh ([email protected])
lOMoARcPSD|45259513

Business Statistics
Module-I
Data Representation-Central Tendency and dispersion-Kurtosis and skewness

Module-II
probability-Axioms-Addition and Multiplication Rule-Types of Probability-Independence of events-
Probability Tree -BayeÕs Theorem

Module-III
Concept of Random variable-Probability Distributions-Expected value and Variance of random
Variable-Conditional expectations-Classical news Paper boys Problem,(EMV,EVPI)

Module-IV
Probability distributions-Binomial-Poisson-Normal

Module-V
Sampling distributions

Module-VI
Estimation-Point and Interval

Module-VII
Hypothesis Testing-t test, Chi Square,Z test

Module-VIII
Anova -One Way, Two Way

Module-IX
Correlation and Regression Analysis

Suggested Reading: As given in the syllabus

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

BUSINESS STATISTICS

Q1.What is Statistics?
The word ÒStatisticsÓ has been derive from the Latin word ÒStatusÓ or Italian word ÒStatistaÓ or German
word ÒStatistikaÓ. Each of these words means Political State. Initially, Statistics was used to collect the
information of the people of the state about their income, health, illiteracy and wealth etc.

But now a day, Statistics has become an important subject having useful application in various Þelds in day
to day life.

Statistics in Plural Sense:-


In the plural sense, Statistics refers to information in terms of numbers or numerical data such as Population
Statistics, Employment Statistics etc. However any numerical information is not statistics.

Example: Ram gets Rs.100 per month as pocket allowance is not Statistics. It is neither an aggregate nor an
average. Whereas average pocket allowance of the students of Class X is Rs.100 per month and there are 80
students in class XI & 8 students in Class XII are Statistics.

Data which are not Statistics

¥ A cow has 4 legs.

¥ Ram has 200 rupees in his pocket.

¥ A young lady was run over by a speeding truck at 100 km per hour.

Data which are Statistics

¥ Average height of the 26 plus male people in India is 6 feet compare to 5 feet in Nepal.

¥ Birth rate in India is 18 per thousand compare to 8 per thousand in USA.

¥ Over the past 10 years, India has won 60 test matches in cricket and lost 50.

From above information we can say that ÒAll


Statistics are data, but all data are not StatisticsÓ

Q2. DeÞne Statistics


According to Bowley - ÒStatistics are numerical statements of facts in any department of enquiry placed in
relation to each other.Ó

According to Yule and Kendall ----- ÒBy Statistics we mean quantitative data affected to marked extent by
multiplicity of causes.Ó

Q3. Explain Characteristics of Statistics


(1) Aggregate of Facts Ð A single number does not constitute Statistics. We can not draw any conclusion
from single number. We can draw any conclusion by the aggregate number of facts.
For example, if it is stated that there are 1,000 students in our college then it has no signiÞcance. But if it
is stated that there are 300 students in arts, 400 students in commerce and 300 in science in our college.
It makes statistical sense as this data convey statistical information. Similarly if it is stated that
population of India is 130 crore or the value of total exports from India is Rs.11, 66,439 crore then these
aggregate of facts will be termed as Statistics.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(2) Numerically Expressed - Statistics are expressed in terms of numbers. Qualitative aspects like small or
big, rich or poor etc. are not statistics. For instance if we say that Irfan Pathan is tall Sachin is short then
this statement has no statistical sense. However if it is stated that height of Irfan Pathan is 6 ft and 2 inch
and the height of Sachin is 5 ft and 4 inch then these numerical will be called Statistics.

(3) Affected by Multiplicity of Causes Ð Statistics are not affected by any single factor but it is affected by
many factors. For instance 30% rise in prices may have been due to several causes like reduction in
supply, increase in demand, shortage of power, rise in wages, rise in taxes, etc.

(4) Reasonable Accuracy - A reasonable degree of accuracy must be kept in view while collecting statistical
data. This accuracy depends on the purpose of investigation, its nature, size and available resources.

(5) Pre-determined Purpose - Statistics are collected with some pre-determined objective. Any information
collected without any deÞnite purpose will only be a numerical value and not Statistics. If data pertaining
to the farmers of a village is collected, there must be some pre-determined objective. Whether the
statistics are collected for the purpose of knowing their economic position or distribution of land among
them or their total population. All these objectives must be pre Ð determined.

(6) Collected in a Systematic Manner Ð Statistics should be collected in a systematic manner. Before
collecting the data, a plan must be prepared. No conclusion can be drawn from data collected in
haphazard manner. For instance, data regarding the marks secured by the students of a college without
any reference to the class, subject, examination, or maximum marks, etc will lead no conclusion.

Statistics in Singular Sense


In a singular sense, statistics means science of statistics or statistical methods. It refers to techniques or
methods relating to collection, classiÞcation, presentation, analysis and interpretation of quantitative data.
DeÞnition -Statistics may be deÞned as the collection, presentation, analysis and interpretation of numerical
data. (CROXTON AND COWDEN )

Statistics is the science which deals with the collection, classiÞcation and tabulation of numerical facts as a
basis for the explanation, description and comparison of phenomena. (LOVITT )

Q4. Give the meaning of the following terms:


A. Descriptive Statistics
B. Inferential Statistics
C. Applied Statistics
D. Terminologies ( Population-Sample- Variable-Attribute)
E. Frequency Distribution

CLASSIFICATION OF STATISTICS -Statistics can be divided into 3 parts; DESCRIPTIVE


STATISTICS- INFERENTIAL STATISTICS -APPLIED STATISTICS

1. Descriptive Statistics: Descriptive statistics is related to numerical data or facts. Such data are
collected either by counting or by some other process of measurement. It is also related to those
methods, includes editing of data, classiÞcation, tabulation, diagrammatic or graphical presentation,
measures of central tendency, measures of dispersion, correlation etc., help to make the description
of numerical facts simple, systematic, synoptic understandable and meaningful.

2. Inferential Statistics: Inferential statistics help in making generalizations about the population or
universe on the basis of study of samples. It includes the process of drawing proper and rational
conclusion about the universe. Among these methods, probability theory and different techniques of
sampling test are important.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

3. Applied Statistics; It involves application of statistical methods and techniques to the problems and
actual facts. For example-statistics related to national income, industrial and agricultural production,
population, price etc. are called applied statistics. It can be divided into 2 parts-(1) Descriptive
Applied Statistics- it deals with the study of the data which are known and which naturally relate. Its
main object is to provide descriptive information either to the past or to the present for any area. For
example- price index number and vital statistics comes under the category of descriptive applied
statistics. (2) ScientiÞc Applied Statistics- under this branch of statistical science, statistical methods
are used to formulate and verify scientiÞc laws. For example-an effort is made by an economist to
establish the law of demand, quantitative theory of money, trade circle etc. These are established and
verify by the help of scientiÞc applied statistics.

Important terminology in statistics


1. Population: By population we mean a well deÞned set or group of all the objects for a particular study.
The objects may be persons, plants, books, Þshes in ponds, shops etc. the population will consist of
certain elements like the plants of a certain kind in a speciÞed Þeld, the Þshes in a pond, the unemployed
person in India, books in library and so on. For instance, if we want to study the properties of students in
a school then the population consists of all the students of school. For instance if we want to study about
the books in a library then the population includes all the books of the library etc. if the number of
elements are limited then the population is Þnite. On the other hand if the number of elements is not
limited then the population is inÞnite. Mostly we deal with Þnite population.

2. Sample: It is a part of the population selected by some sampling procedure. The process of selection of
sample is known as sampling. The number of objects in the sample is called the size of the sample. It is
believed that a sample is best representative of the population.
For instance, suppose a research worker is required to study the weight of Þshes in a pond after a
particular period of growth. For this purpose suppose that there are 3,000 Þshes in the pond, he may
either measure the weight of all the Þshes in the pond or he may decide to select a small group of Þshes
and measure their weights. The Þrst approach of measuring the weight of all Þshes is called complete
enumeration or census. Another approach in which only a small group of Þshes is considered is called
sample survey. In brief we can say that in complete enumeration, information is collected on all the units
of the universe and in sample survey, only a part of the universe is considered.

3. Variable: A property of objects is known as variable which differ from object to object and is expressible
numerically, in terms of numbers.
For instance: the marks in Mathematics of students in a class can be expressed in the term of marks
obtained by the students. So it is a variable property which is expressible quantitatively.

4. Attribute: A property and characteristic of objects is known as attribute which are not expressible
quantatively in number. We can express the data qualitatively. For example, smoking, color, honesty etc.

FREQUENCY DISTRIBUTION
The tabular arrangement of data showing the frequency of each item is called a frequency distribution.
According to Croxton and Cowden, ÒFrequency distribution is a statistical table in which different values of
variable are shown in the sequence of magnitude along with corresponding frequencies.Ó

TYPES OF FREQUENCY DISTRIBUTION


(1) Discrete frequency distribution: It is a discontinuous frequency distribution, where observations are
independent to each other. Each observation is different and separates from others.

(2) Continuous frequency distribution: A continuous frequency distribution is such a distribution in which
data are arranged in classes or groups which are not exactly measureable. Groups or class-intervals are
always in a continuous form from the beginning of the frequency distribution, till the end, within a given
range of the data. There are two types of series according to class interval;

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(1) Inclusive form; A frequency distribution in which each upper limit of each class is also included. Such as;
0-9, 10-19, 20-29.................

(2) Exclusive form; In which the upper limit of the next class-interval. Such as; 0-10, 10-20, 20-30............

CHARACTERISTICS OF STATISTICS

1. Statistics are aggregate of facts.

2. Statistics are numerically expressed.

3. Statistics are affected to a marked extent by multiplicity of causes.

4. Statistics are either enumerated or estimated with reasonable standard of accuracy

5. Statistics are collected in a systematic manner.

6. Statistics are collected for a pre-determined purpose.

7. Statistics should be placed in relation to each other.

In the absence of the above characteristics, numerical data canÕt be called Statistics and hence Òall

statistics are numerical statements of facts but all numerical statements of facts are not statistics.Ó

According to above DeÞnitions, Statistics is both a science and an art. It is related to the study and
application of the principles and methods applicable in the collection, presentation, analysis, interpretation
and forecasting of data. Or statistical facts inßuenced by several factors and related to any area of knowledge
or research so that concrete and intelligent decisions may be taken in the phase of uncertainty

Q5. ÒStatistics is both Science and Art.Ó Explain.

NATURE OF STATISTICS
Statistics as a science: science refers to a systematized body of knowledge. It studies cause and effect
relationship and attempts to make generalizations in the form of scientiÞc principles or laws. ÒScience, in
short, is like a light house that gives light to the ships to Þnd out their own way but doesnÕt indicate the
direction in which they should go.Ó Like other sciences, Statistical Methods are also used to answer the
questions like, how an investigation should be conducted. In what way the valid and reliable conclusions can
be drawn? Statistics is called the science of scientiÞc methods.

In words of Croxton and Cowden, ÒStatistics is not a science, it is scientiÞc methods.ÓAccording to Tippet,
Òas science, the statistical method is a part of the general scientiÞc method and is based on the same
fundamental ideas and processes.Ó

Statistics as an art: we know that science is a body of systematized knowledge. How this knowledge is to be
used for solving a problem is work of an art. An art is an applied knowledge. It refers the skill of handling
facts so as to achieve a given objective. It is concerned with ways and means of presenting and handling
data, making inferences logically and drawing relevant conclusion. Art aspects of statistics tell, Ôhow to use
statistical rules and principles to study the problems and Þnding their solutions. ÔCollections of statistics
(data) its use and utility are itself an art.

Statistics is both science and art: After studying science and art aspects of statistics, it is used not only to gain
knowledge but also to understand the facts and draw important conclusions from it. If science is knowledge,

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

then art is action. Looking from this angle statistics may also be regarded as an art. It involves the application
of given methods to obtain facts, derive results and Þnally to use them for devising action.

Q6. Explain STAGES IN A STATISTICAL INVESTIGATION

5 stages -
Collection -Organisation- Presentation -Analysis-Interpretation

1. Collection: This is the primary step in a statistical study and data should be collected with care by the
investigator. If data are faulty, the conclusions drawn can never be reliable. The data may be available
from existing published or unpublished sources or else may be collected by the investigator himself. The
Þrst hand collection of data is one of the most difÞcult and important tasks faced by a statistician.

2. Organization: Data collected from published sources are generally in organized form.

However, a large mass of Þgures that are collected from a survey frequently needs organization. In
organizing, there are 3 steps as :

(A) Editing (B) Classify (C) Tabulation.


(A)Editing: The collected data must be editing very carefully so that the omissions, inconsistencies
irrelevant answers and wrong computation in the returns from a survey may be corrected or adjusted.
(B)Classify: ClassiÞcation is the process of arranging the data according to some common characteristics
possessed by the items constituting the data. (C)Tabulation: To arrange the data in columns and rows.

Hence collected data is organized properly so that the desire information may be highlighted and undesirable
information avoided.

3. Presentation: Arranged data is not capable to inßuence a layman. Thus, it is necessary that data may be
presented with the help of tables, diagrams and graphs. By these devices facts can be understood easily.

4. Analysis: A major part of it is developed to the methods used in analyzing the presented data, mostly in a
tabular form. For this analysis, a number of statistical tools are available, such as averages, correlation,
regression etc.

5. Interpretation: the interpretation of a data is a difÞcult task and necessitates a high degree of skills and
experience in the statistical investigation because certain decisions made on the basis of conclusions drawn.

Q7. Explain the SCOPE OF STATISTICS


In early stages, the scope of statistics was very limited. It was conÞned mainly to the administration of
government and was, therefore, called the ÔScience of KingsÕ. But in modern time, the scope of statistics has
widened usually all those facts come in the purview of statistics, which are expressed in quantitative terms
directly or indirectly. That is why Croxton & Cowden observed, ÒToday there is hardly a phase of endeavor
which does not Þnd statistical devices at least occasionally useful.Ó It is not unfair to say, science without
statistics bears no fruit and statistics without science have no root.Ó The applications of statistics are so
numerous that it is often remarked, ÒStatistics is what statisticians do.Ó Now let us examine a few Þelds or
areas in which statistics is applied.

1. Statistics and the State: in recent years the functions of the state have increased tremendously. The
concept of the state has changed from that of simply maintaining law and order to that of a welfare
state. Statistical data and statistical methods are of great help in promoting human welfare. The
government in most countries is the biggest collector and user of statistical data. These statistics help
in framing suitable policies.

2. Statistics in Business and Management: with growing size and increasing competition, the problems
of business enterprises have become complex. Statistics is now considered as an indispensable tool

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

in the analysis of activities in the Þeld of business, commerce and industry. The object can be
achieved by properly conducted market survey and research which greatly depends on statistical
methods. The trends in sales and production can be determined by statistical methods like time-series
analysis which are essential for future planning of the phenomena. Statistical concepts and methods
are also used in controlling the quality of products to satisfaction of consumer and the producer. The
bankers use the objective analysis furnished by statistics and then temper their decisions on the basis
of qualitative information.

3. Statistics and Economics: R.A.Fisher complained of Òthe painful misapprehension that statistics is a
branch of economics.Ó Statistical Data and methods are of immense help in the proper understanding
of the economic problems and in the information of economic policies. In the Þeld of exchange, we
study markets, law of prices based on supply and demand, cost of production, banking and credit
instruments etc. The development of various economic theories own greatly to statistical methods,
e.g., ÔEngelÕs law of family expenditureÕ, ÔMalthusian theory of populationÕ. The impact of
mathematics and statistics has led to the development of new disciplines like ÔEconometricsÕÕ and
ÔEconomic StatisticsÕ. In fact, the concept of planning so vital for growth of nations would not have
been possible in the absence of data and proper statistical analysis.

4. Statistics and Psychology and Education: Statistics has found wide application in psychology and
education. Statistical methods are used to measure human ability such as; intelligence, aptitude,
personality, interest etc. by tests. Theory of learning is also based on Statistical Principles.
Applications of statistics in psychology and education have led to the development of new discipline
called ÔPsychometricÕ.

5. Statistics and Natural science; Statistical techniques have proved to be extremely useful in the study
of all natural sciences like biology, medicine, meteorology, botany etc. for example- in diagnosing
the correct disease the doctor has to rely heavily on factual data like temperature of the body, pulse
rate, B.P. etc. In botany- the study of plant life, one has to rely heavily on statistics in conducting
experiments about the plants, effect of temperature, type of soil etc. In agriculture- statistical
techniques like Ôanalysis of varianceÕ and Ôdesign of experimentsÕ are useful for isolating the role of
manure, rainfall, watering process, seed quality etc. In fact it is difÞcult to Þnd any scientiÞc activity
where statistical data and methods are not used.

6. Statistics and Physical Science: The physical sciences in which statistical methods were Þrst
developed and applied. It seems to be making increasing use of statistics, especially in astronomy,
chemistry, engineering, geology, meteorology and certain branches of physics.

7. Statistics and Research; statistics is indispensable in research work. Most of the advancement in
knowledge has taken place because of experiments conducted with the help of statistical methods.
Statistical methods also affect research in medicine and public health. In fact, there is hardly any
research work today that one can Þnd complete without statistical methods.

8. Statistics and Computer: The development of statistics has been closely related to the evolution of
electronic computing machinery. Statistics is a form of data processing a way of converting data into
information useful for decision-making. The computers can process large amounts of data quickly
and accurately. This is a great beneÞt to business and other organizations that must maintain records
of their operations. Processing of row data is extensively required in the application of many
statistical techniques.

Q8. Brief the IMPORTANCE OF STATISTICS

In recent days, we hear talking about statistics from a common person to highly qualiÞed person. It only
show that how statistics has been intimately connected with wide range of activities in daily life. They realize

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

that work in their Þelds require some understanding of statistics. It indicates the importance of the statistics.
A.L.Bowley says, ÒKnowledge of statistics is like knowledge of foreign language or of algebra. It may prove
of use at any time under any circumstancesÓ.

1. Importance to the State or Government; In modern era, the role of state has increased and various
governments of the world also take care of the welfare of its people. Therefore, these governments require
much greater information in the form of numerical Þgures. Statistics are extensively used as a basis for
government plans and policies. For example-5-years plans are framed by using reliable statistical data of
different segments of life.

2. Importance in Human Behavior; Statistical methods viz., average, correlation etc. are closely related with
human activities and behavior. For example-when a layman wishes to purchase some article, he Þrst
enquiries about its price at different shops in the market. In other words, he collects data about the price of a
particular article and aims at getting idea about the average of the prices and the range within which the price
vary. Thus, it can be concluded that statistics play an important role in every aspect of human activities and
behavior.

3. Importance in Economics; Statistics is gaining an ever increasing importance in the Þeld of economics.
That is why Tugwell said, ÒThe science of economics is becoming statistical in its method.Ó Statistics and
economics are so interrelated to each other that the new disciplines like econometrics and economic statistics
have been developed. Inductive method of generalization used in economics, is also based on statistical
principle. There are different segments of economics where statistics are used-

(A) Consumption- By the statistics of consumption we can Þnd the way in which people in different group
spend their income. The law of demand and elasticity of demand in the Þeld of consumption are based on
inductive or inferential statistics.

(B) Production- By the statistics of production supply is adjusted according to demand. We can Þnd out the
capital invested in different productive units and its output. The decision about what to produce, how much
to produce, when to produce is based on facts analyzed statistically.

(C) Distribution- Statistics play a vital role in the Þeld of distribution. We calculate the national income of a
country by statistical methods and compare it with other countries. At every step we require the help of
Þgures without them. It is difÞcult to move and draw inferences.

4. Importance in Planning; for the proper utilization of natural and manual resources, statistics play a vital
role. Planning is indispensable for achieving faster rate of growth through the best use of a nationÕs
resources. Sometimes said that, ÒPlanning without statistics is a ship without rudder and compass.Ó For
example- In India, a number of organizations like national sample survey organization(N.S.S.O.), central
statistical organization (C.S.O.) are established to provide all types of information.

5. Importance in Business: The use of statistical methods in the solution of business problems dates almost
exclusively to the small, public or private, can prosper without the help of statistics. Statistics provides
necessary techniques to a businessman for the formulation of various policies and planning with regard to his
business. Such as-
(A) Marketing- In the Þeld of marketing, it is necessary Þrst to Þnd out what can be sold and them to evolve
a suitable strategy so that goods reach the ultimate consumer. A skillful analysis of data on population,
purchasing power, habits of people, competition, transportation cost etc. should precede any attempt to
establish a new market.
(B) Quality Control- To earn the better price in a competitive market, it is necessary to watch the quality of
the product. Statistical techniques can also be used to control the quality of the product manufactured by a
Þrm. Such as - Showing the control chart.
(C) Banking and Insurance Companies- banks use statistical techniques to take decisions regarding the
average amount of cash needed each day to meet the requirements of day to day transactions. Various
policies of investment and sanction of loans are also based on the analysis provided by statistics.
(D) Accounts writing and Auditing- Every business Þrm keeps accounts of its revenue and expenditure.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Statistical methods are also employed in accounting. In particular, the auditing function makes frequent
application of statistical sampling and estimation procedures and the cost account uses regression analysis.
(E) Research and Development- Many business organizations have their own research and development
department which are responsible for collection of such data. These departments also prepare charts groups
and other statistical analysis for the purpose.

Q9. Describe the FUNCTIONS OF STATISTICS


Statistics performs the functions of making the numerical aspects of facts simple, precise, comparable and
reliable. In fact, the various functions performed by statistics are the basis of its utility. R.W. Burgess says,
ÒThe fundamental gospel of statistics is to push back the domain of ignorance, prejudice, rule of thumb,
arbitrary and premature decisions, tradition & dogmatism and to increase the domain in which decisions are
made. Principles are formulated on the basis of analyzed quantitative facts.Ó

1. Numerical and deÞnite expression of facts: The Þrst function of the statistics is the collection and
presentation of facts in numerical form. We know that the numerical presentation helps in having a
better understanding of the nature of a problem. One of the most important functions of statistics is
to present general statements in a precise and deÞnite form. Statements and facts conveyed in exact
quantitative terms are always more convincing than vague utterances.

2. SimpliÞes the data (condensation): Not only does statistics present facts in a deÞnite form but it also
helps in condensing mass of data into a few signiÞcant Þgures. According to A.E.Waugh, Òthe
purpose of a statistical method is to simplify great bodies of numerical data.ÓIn fact, human mind
cannot follow the huge, complex and scattered numerical facts. So these facts are made simple and
precise with the help of various statistical methods like averages, dispersion, graphic or
diagrammatic, presentation, classiÞcation, tabulation etc. so that a common man also understand
them easily.

3. Comparison of facts: Baddington states, ÒThe essence of the statistics is not only counting but also
comparison.Ó The function of comparison does help in showing the relative importance of data. For
example- the pass % of examination result of a college may be appreciated better when it is
compared with the result of other college or the results of previous years of the same college.

4. Establishment of relationship b/w two or more phenomena; to investigate the relationship b/w two or
more facts is the main function of statistics. For example-demand and supply of a certain
commodity, prices and wages, temperature and germination time of seeds are interrelated.

5. Enlarges individual experiences: In word of Bowley, Òthe proper function of statistics indeed is to
enlarge individual experience.Ó Statistics is like a master key that is used to solve problems of
mankind in every Þeld. It would not be exaggeration to say that many Þelds of knowledge would
have remained closed to the mankind forever but for the efÞcient and useful techniques and
methodology of the science of statistics.

6. Helps in the formulation of policies: statistics helps in formulating policies in different Þelds,
especially in economic, social and political Þelds. The government policies like industrial policy,
export-import policies, taxation policy and monetary policy are determined on the basis of statistical
data and their movements, plan targets are also Þxed with the help of data.

7. Helps in forecasting: statistical methods provide helpful means in estimating the available facts and
forecasting for future. Here BowleyÕs statementis relevant that, Òa statistical estimate may be good or
bad, accurate or the reverse; but in almost all cases it is likely to be more accurate than a casual
observerÕs impression.Ó

8. Testing of hypothesis: statistical methods are also employed to test the hypothesis in theory and
discover newer theory. For example-the statement that average height of students of college is 66

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

inches is a hypothesis. Here students of college constitute the population. It is possible to test the
validity of this statement by the use of statistical techniques.

Q10. Enumerate the LIMITATIONS OF STATISTICS


Newsholme states, ÒStatistics must be regarded as an instrument of research of great value but having several
limitations which are not possible to overcome and as such they need out careful attention.Ó

1. Statistics does not study qualitative facts: Statistics means aggregate of numerical facts. It means that
in statistics only those phenomena are studied which can be expressed in numerical terms directly or
indirectly. Such as- (1) directly in numerical terms like age, weight and income of individual (2) no
directly but indirectly like intelligent of students and achievements of students (3) neither directly
nor directly like morality, affection etc. such type of facts donÕt come under the scope of statistics.

2. Statistics doesnÕt study individual: According to W.I.King, ÒStatistics from their very nature of
subject cannot and will never be able to take into account individual causes. When these are
important, other means must be used for their study.Ó These studied are done to compare the general
behavior of the group at different points of time or the behavior of different groups at a particular
point of time.

3. Statistical results are true only on the average: The statistical laws are not completely true and
accurate like the law of physics. For example Ð law of gravitational forces is perfectly true &
universal but statistical conclusions are not perfectly true. Such as the average age of a person in
India is 62 years. It does not mean that every person will attain this age. On the basis of statistical
methods we can say only in terms of probability and not certainty.

4. Statistics as lack of complete accuracy: According to Conner, ÒStatistical data must always be treated
as approximations or estimates and not as precise measurements.Ó Statistical result are based on
sample or census data, are bound to be true only approximately. For example Ð according to
population census 2001, countryÕs population is 1,02,70,15,247 but can real population may not be
more or less by hundred, two hundred and so on.

5. Statistics is liable to be misused: Statistical deals with Þgures and it can be easily manipulated,
distorted by the inexpert and unskilled persons it is very much likely to be misused in most of the
cases. In other words, the data should be handled by experts. Thus it must be used by technically
sound persons.

6. Statistics is only one of the methods of studying a phenomenon; According to Croxton & Cowden,
ÒIt must not be assumed that the statistical method is the only method to be used in research; neither
this method be considered the best attack for every problem.Ó The conclusions arrived at with the
help of statistics must be supplemented with other evidences.

7. Statistical results may be misleading; Without any reference, statistical results may provide doubtful
conclusions. For example Ð on the basis of increasing no. of prisoners in the prison, it may be
conclude that crime is increasing. But it may be possible that due to rude behavior of police
administration the number of prisoners is increasing but crime is decreasing.

Therefore, it is worth-mentioning that every science based on certain assumption and limitations. This does
not reduce the importance of the subject but lays emphasis on the fact that precautions should be taken while
dealing with statistical analysis and interpretations.

Q11. Explain TYPES OF DATA


Data are the foundation stones and basic raw material in relation to any statistical investigation that can be
counted, classiÞed, measured or quantiÞed.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

ON THE BASIS OF CHARACTERISTICS OF FACTS

Data may be divided into two types;

1. Quantitative Data or Numerical Data: These types of data can be measured directly such as age, income,
production, marks etc. those facts are called variables and variables may be discrete or continuous.

Discrete variableÐ Those variables whose values are individually distinct and discontinuous. There is a
deÞnite difference between two variables. According to Boddington, ÒDiscrete variables is one where the
variables (Individual values) differ from each other by deÞnite amounts.Ó For example Ð number of students
of a class, number of children in a family, number of cattleÕs etc. It takes integral values such as 0, 1, 2, 3,
4 ...etc.
Continuous variable Ð A continuous variable is one which assumes all values with in an interval. That is no
deÞnite breaks are visible in this type of series. For example Ð age, weight, height......

Questions; State which of the following represents Discrete data or Continuous data?

I. No. of accidents on each day in a month

II. Lengths of 1,000 bolts produced in a factory

III. Speed of an automobile in kilometer per hour

IV. No. of books on a library shelf


2. Qualitative Data or Categorical Data: They include data relating to such facts which canÔt be
measured directly but are counted or categorized to the basis of attributes such as literates, illiterates,
unemployed, honest etc. are called attributes. For example- population can be classiÞed on the basis
of males and females or males may be classiÞed on the basis of marital status, i.e. married or
unmarried. Qualitative Data may further be classiÞed into two categories

ON THE BASIS OF VARIABLES

On the basis of variables, also data may be of two types;

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

¥ (1) Univariate Data: When the frequencies are determined on the basis of one variable. For
example Ð no. of workers on the basis of wages, no. of persons on the basis of age etc.

¥ (2) Bivariate Data: When the data are edited or presented on the basis of two variables
simultaneously. For this two-way frequency table is constructed, one variable is placed horizontally
and the second one vertically. For example Ð to present the number of students in one table on the
basis of marks obtained in two subjects, to tabulate the no. of persons in one table on the basis of two
variables i.e. height and weight.

ON THE BASIS OF ARRANGEMENTS

Data may be categorized into two types;

¥ (1) Raw Data: When the data is arranged and analyzed. It is called ÔRawÕ because it is unprocessed
by statistical methods.

¥ (2) Arrange Data: When the data is processed and is arranged, summarized, classiÞed and tabulated
in proper way.

Terms like ÔData PointÕ and ÔData SetÕ are also used in order to distinguish between the numbers relating to
individual or single facts and the aggregate of facts. For exampleÐ the data of production of sugar for ten
years will be termed as ÔData SetÕ and the Þgures for production of one year will be as ÔData PointÕ.

Q12. Difference between CLASSIFICATION & TABULATION

CLASSIFICATION
After collection and editing of data the Þrst step towards further processing the same is classiÞcation.
ClassiÞcation is a process in which the collected data are arranged in separate classes, groups or subgroups
according to their characteristics. According to Secrist, ÒClassiÞcation is the process of arranging data into
sequences and groups according to their common characteristics or separating them into different classes.Ó

It concludes that classiÞcation means the arrangements and systematization of data into different classes and
these classes are determined on the basis of nature, objectives and scope of the enquiry.

OBJECTIVES OF CLASSIFICATION
ClassiÞcation is a method or technique for extracting the essential information supplied by the raw data.

(1) To condense the data: the main objective of classiÞcation is to condense and simplify the statistical
material, so that the same may be easily understandable.

(2) To bring out points of similarities and dissimilarities of data: classiÞcation brings out clearly the points of
similarity and dissimilarities of statistical facts because data of similar characteristics are placed in one class
i.e., males and females, literates and illiterates, married and unmarried etc.

(3) To make facts comparable: by arranging the data according to the points of similarity and dissimilarities,
it helps in comparison.

(4) To bring out relationship: classiÞcation helps in Þnding cause and effect relationship in the data. For
example- based on literacy and criminal tendency of a group peoples, it can be established whether literacy
has any impact on criminal tendency or not.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(5) To prepare ground for tabulation: tabulation is the basis of statistical analysis and classiÞcation is the
basis for tabulation.

It concludes that classiÞcation occupies an important place in the process of statistical investigation.

The fact is that the process of tabulation, presentation and analysis canÕt even be shorted without
classiÞcation.

CHARACTERISTICS OR RULES FOR A GOOD CLASSIFICATION


¥ (1) Unambiguity: the various classes should be so deÞned that there is no roomfor doubt and
confusion. For exampleÐpopulation is classiÞed as literates or illiterates.

¥ (2) Exhaustive and mutually exclusive: classiÞcation should be so exhaustive (clear all aspects) and
one item may not be Þnd place in more than one class. For example Ð students of a college are
classiÞed into three groups Ð urban, rural and hostlers. This classiÞcation is not mutually exclusive
because among hostlers some may be urban and some other rural.

¥ (3) Stability: the classiÞcation of data into various classes must be stable over be a period of time
of investigation.

¥ (4) Suitability: the classiÞcation should conÞrm to the objectives of enquiry. For exampleÐto study
the relationship between sex and university education, there is no need to classify on the basis
of age and religion.

¥ (5) Flexibility: a good classiÞcation should be ßexible so that adjustments may be easily be made
in classes according to changed situations. An ideal classiÞcation is one that can adjust itself to these
changes and yet retains its stability.

METHODS OF CLASSIFICATION
There are 4 methods of classiÞcation;

Geographical ClassiÞcation -Chronological ClassiÞcation -Qualitative ClassiÞcation -Quantitative


ClassiÞcation

TABULATION

Tabulation is the next step of classiÞcation of the data and is designed to summaries lots of information in a
simple manner. In common language tabulation is the process of arranging data in a systematic manner in the
form of rows and columns. According to Blair, ÒTabulation in its broadest sense is any orderly arrangement
of data in columns and rows.Ó

OBJECTIVES OF TABULATION
1. To simplify complex data

2. To facilitate comparison

3. To economies Space

4. To facilitate presentation

5. Help in analysis of data

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

6. To help in reference

DIFFERENCE BETWEEN CLASSIFICATION & TABULATION

Basis of Difference ClassiÞcation Tabulation

Presentation ClassiÞes into different classes ClassiÞes into row and columns

Sequence First step Second step

Methods Method of statistical Analysis Method of data Presentation

Use of data Original data are used Derivatives like percentages,


coefÞcients Proportion, etc. may
also be used

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Module-I:Data Representation-Central Tendency and


dispersion-Kurtosis and skewness

Q1. What do you mean by Graphical Representation of data? Explain different ways of
representing the data graphically.

Graphical Representation is a way of analysing numerical data. It exhibits the relation between data,
ideas, information and concepts in a diagram. It is easy to understand and it is one of the most
important learning strategies. It always depends on the type of information in a particular domain.
There are different types of graphical representation. Some of them are as follows:

¥ Line Graphs Ð Line graph or the linear graph is used to display the continuous data and it is
useful for predicting future events over time.
¥ Bar Graphs Ð Bar Graph is used to display the category of data and it compares the data
using solid bars to represent the quantities.
¥ Histograms Ð The graph that uses bars to represent the frequency of numerical data that are
organised into intervals. Since all the intervals are equal and continuous, all the bars have
the same width.
¥ Line Plot Ð It shows the frequency of data on a given number line. Ô x Ô is placed above a
number line each time when that data occurs again.
¥ Frequency Table Ð The table shows the number of pieces of data that falls within the given
interval.
¥ Circle Graph Ð Also known as the pie chart that shows the relationships of the parts of the
whole. The circle is considered with 100% and the categories occupied is represented with
that speciÞc percentage like 15%, 56%, etc.
¥ Stem and Leaf Plot Ð In the stem and leaf plot, the data are organised from least value to the
greatest value. The digits of the least place values from the leaves and the next place value
digit forms the stems.
¥ Box and Whisker Plot Ð The plot diagram summarises the data by dividing into four parts.
Box and whisker show the range (spread) and the middle ( median) of the data.

(Note:The collected raw data can be placed in any one of the given ways:

1. Serial order of alphabetical order


2. Ascending order
3. Descending order

When the raw data is placed in ascending or descending order of the magnitude is known as an
array or arrayed data.)

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Q2. Enumerate the General Rules for Graphical Representation of Data

There are certain rules to effectively present the information in the graphical representation. They
are:

¥ Suitable Title: Make sure that the appropriate title is given to the graph which indicates the
subject of the presentation.
¥ Measurement Unit: Mention the measurement unit in the graph.
¥ Proper Scale: To represent the data in an accurate manner, choose a proper scale.
¥ Index: Index the appropriate colours, shades, lines, design in the graphs for better
understanding.
¥ Data Sources: Include the source of information wherever it is necessary at the bottom of the
graph.
¥ Keep it Simple: Construct a graph in an easy way that everyone can understand.
¥ Neat: Choose the correct size, fonts, colours etc in such a way that the graph should be a
visual aid for the presentation of information.
Graphical Representation in Maths

In Mathematics, a graph is deÞned as a chart with statistical data, which are represented in the form
of curves or lines drawn across the coordinate point plotted on its surface. It helps to study the
relationship between two variables where it helps to measure the change in the variable amount
with respect to another variable within a given interval of time. It helps to study the series
distribution and frequency distribution for a given problem. There are two types of graphs to
visually depict the information. They are:

¥ Time Series Graphs Ð Example: Line Graph


¥ Frequency Distribution Graphs Ð Example: Frequency Polygon Graph

Q3. Explain Principles of Graphical Representation with diagram.

Algebraic principles are applied to all types of graphical representation of data. In graphs, it is
represented using two lines called coordinate axes. The horizontal axis is denoted as the x-axis and
the vertical axis is denoted as the y-axis. The point at which two lines intersect is called an origin
ÔOÕ. Consider x-axis, the distance from the origin to the right side will take a positive value and the
distance from the origin to the left side will take a negative value. Similarly, for the y-axis, the
points above the origin will take a positive value, and the points below the origin will a negative
value.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Generally, the frequency distribution is represented in four methods, namely

¥ Histogram
¥ Smoothed frequency graph
¥ Pie diagram
¥ Cumulative or ogive frequency graph
¥ Frequency Polygon

Q4. Explain Merits of Using Graphs

Some of the merits of using graphs are as follows:

¥ The graph is easily understood by everyone without any prior knowledge.


¥ It saves time
¥ It allows us to relate and compare the data for different time periods
¥ It is used in statistics to determine the mean, median and mode for different data, as well as
in the interpolation and the extrapolation of data.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example for Frequency polygon Graph

Here are the steps to follow to Þnd the frequency distribution of a frequency polygon and it is
represented in a graphical way.

¥ Obtain the frequency distribution and Þnd the midpoints of each class interval.
¥ Represent the midpoints along x-axis and frequencies along the y-axis.
¥ Plot the points corresponding to the frequency at each midpoint.
¥ Join these points, using lines in order.
¥ To complete the polygon, join the point at each end immediately to the lower or higher class
marks on the x-axis.

Q5. Draw the frequency polygon for the following data

Class 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90


Interval

Frequency 4 6 8 10 12 14 7 5

Solution :
Mark the class interval along x-axis and frequencies along the y-axis.

Let assume that class interval 0-10 with frequency zero and 90-100 with frequency zero.

Now calculate the midpoint of the class interval.

Class Intervals Midpoints Frequency

0-10 5 0

10-20 15 4

20-30 25 6

30-40 35 8

40-50 45 10

50-60 55 12

60-70 65 14

70-80 75 7

80-90 85 5

90-100 95 0

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Using the midpoint and the frequency value from the above table, plot the points A (5, 0), B (15, 4),
C (25, 6), D (35, 8), E (45, 10), F (55, 12), G (65, 14), H (75, 7), I (85, 5) and J (95, 0).

To obtain the frequency polygon ABCDEFGHIJ, draw the line segments AB, BC, CD, DE, EF, FG,
GH, HI, IJ, and connect all the points.

Frequently Asked Questions

1. What are the Different Types of Graphical Representation?

Some of the various types of graphical representation include:

¥ Line Graphs
¥ Bar Graphs
¥ Histograms
¥ Line Plots
¥ Frequency Table
¥ Circle Graph, etc.

2. What are the Advantages of Graphical Method?

Some of the advantages of graphical representation are:

¥ It makes data more easily understandable.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

¥ It saves time.
¥ It makes the comparison of data more efÞcient.

3. How is data represented?


A: The collected data can be expressed in various ways like bar graphs, pictographs, frequency
tables, line graphs, pie charts and many more. It depends on the purpose of the data, and
accordingly, the type of graph can be chosen.

4. What are the different types of data representation?


A: The few types of data representation are given below:
1. Frequency distribution table
2. Bar graph
3. Histogram
4. Line graph
5. Pie chart

5. What is data representation, and why is it essential?


A: After collecting the data, the investigator has to condense them in tabular form to study their
salient features. Such an arrangement is known as the presentation of data.
Importance: The data visualization gives us a clear understanding of what the information means by
displaying it visually through maps or graphs. The data is more natural to the mind to comprehend
and make it easier to rectify the trends outliners or trends within the large data sets.

6. What is the difference between data and representation?


A: The term data deÞnes the collection of speciÞc quantitative facts in their nature like the height,
number of children etc., whereas the information in the form of data after being processed, arranged
and then presented in the state which gives meaning to the data is data representation.

7. Why do we use data representation?


A: The data visualization gives us a clear understanding of what the information means by
displaying it visually through maps or graphs. The data is more natural to the mind to comprehend
and make it easier to rectify the trends outliners or trends within the large data sets.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Various measures of central tendency

Arithmetic Mean

(a) To Þnd A.M. for Raw data

For a raw data, the arithmetic mean of a series of numbers is sum of all observations divided by the
number of observations in the series. Thus if x1, x2, ..., xn represent the values of n observations,
then arithmetic mean (A.M.) for n observations is: (direct method)

There are two methods for computing the A.M :

(i) Direct method

(ii) Short cut method.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 5.1

The following data represent the number of books issued in a school library on selected from 7
different days 7, 9, 12, 15, 5, 4, 11 Þnd the mean number of books.

Solution:

Hence the mean of the number of books is 9

Short-cut Method to Þnd A.M.

Under this method an assumed mean or an arbitrary value (denoted by A) is used as the basis of
calculation of deviations (di) from individual values. That is if di = xi Ð A

Then

Example 5.2

A studentÕs marks in 5 subjects are 75, 68, 80, 92, 56. Find the average of his marks.

Solution:

Let us take the assumed mean, A = 68

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

The arithmetic mean of average marks is 74.2

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(b) To Þnd A.M. for Discrete Grouped data

If x1, x2, ..., xn are discrete values with the corresponding frequencies f1, f2, É, fn. Then the mean
for discrete grouped data is deÞned as (direct method)

In the short cut method the formula is modiÞed as

Example 5.3

A proof reads through 73 pages manuscript The number of mistakes found on each of the pages are
summarized in the table below Determine the mean number of mistakes found per page

Solution:

(i) Direct Method

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

The mean number of mistakes is 4.09

(ii) Short-cut Method

The mean number of mistakes = 4.09

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(c) Mean for Continuous Grouped data:

For the computation of A.M for the continuous grouped data, we can use direct method or short cut
method.

Direct Method:

The formula is

Short cut method

Example 5.4

The following the distribution of persons according to different income groups

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Find the average income of the persons.

Solution :

Direct Method:

Short cut method:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Merits

á It is easy to compute and has a unique value.

á It is based on all the observations.

á It is well deÞned.

á It is least affected by sampling ßuctuations.

á It can be used for further statistical analysis.

Limitations

á The mean is unduly affected by the extreme items (outliers).

á It cannot be determined for the qualitative data such as beauty, honesty etc.

á It cannot be located by observations on the graphic method.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

When to use?

Arithmetic mean is a best representative of the data if the data set is homogeneous. On the other
hand if the data set is heterogeneous the result may be misleading and may not represent the data.

Weighted Arithmetic Mean

The arithmetic mean, as discussed earlier, gives equal importance (or weights) to each observation
in the data set. However, there are situations in which values of individual observations in the data
set are not of equal importance. Under these circumstances, we may attach, a weight, as an indicator
of their importance to each observation value.

Uses of weighted arithmetic mean

Weighted arithmetic mean is used in:

á The construction of index numbers.

á Comparison of results of two or more groups where number of items in the groups differs.

á Computation of standardized death and birth rates.

Example 5.5

The weights assigned to different components in an examination or Component Weightage Marks


scored

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Calculate the weighted average score of the student who scored marks as given in the table

Solution:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Combined Mean:

Let 1 and 2 are the arithmetic mean of two groups (having the same unit of measurement of
a variable), based on n1 and n2 observations respectively. Then the combined mean can be
calculated using

Remark : The above result can be extended to any number of groups.

Example 5.6

A class consists of 4 boys and 3 girls. The average marks obtained by the boys and girls are 20 and
30 respectively. Find the class average.

Solution:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Median
Median is the value of the variable which divides the whole set of data into two equal parts. It is
the value such that in a set of observations, 50% observations are above and 50% observations
are below it. Hence the median is a positional average.

(a) Median for Ungrouped or Raw data:

In this case, the data is arranged in either ascending or descending order of magnitude.

(i) If the number of observations n is an odd number, then the median is represented by the
numerical value of x, corresponds to the positioning point of n+1 / 2 in ordered observations.
That is,

Median = value of (n+1 / 2)th observation in the data array

If the number of observations n is an even number, then the median is deÞned as the arithmetic
mean of the middle values in the array That is,

Example 5.14

The number of rooms in the seven Þve stars hotel in Chennai city is 71, 30, 61, 59, 31, 40 and 29.
Find the median number of rooms

Solution:

Arrange the data in ascending order 29, 30, 31, 40, 59, 61, 71

n = 7 (odd)

Median = 7+1 / 2 = 4th positional value

Median = 40 rooms

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 5.15

The export of agricultural product in million dollars from a country during eight quarters in 1974
and 1975 was recorded as 29.7, 16.6, 2.3, 14.1, 36.6, 18.7, 3.5, 21.3

Find the median of the given set of values

Solution:

We arrange the data in descending order

36.6, 29.7, 21.3, 18.7, 16.6, 14.1, 3.5, 2.3

Cumulative Frequency

In a grouped distribution, values are associated with frequencies. The cumulative frequencies are
calculated to know the total number of items above or below a certain limit.This is obtained by
adding the frequencies successively up to the required level. This cumulative frequencies are
useful to calculate median, quartiles, deciles and percentiles.

(b) Median for Discrete grouped data

We can Þnd median using following steps

i. Calculate the cumulative frequencies

ii. Find (N+1)/2, where N=Σf=total frequencies

iii. Identify the cumulative frequency just greater than (N+1)/2

iv. The value of x corresponding to that cumulative frequency is the (N+1)/2 median.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 5.16

The following data are the weights of students in a class. Find the median weights of the students

Solution:

The cumulative frequency greater than 30.5 is 38.The value of x corresponding to 38 is 40. The
median weight of the students is 40 kgs

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(c) Median for Continuous grouped data

In this case, the data is given in the form of a frequency table with class-interval etc., The
following formula is used to calculate the median.

Where

l = Lower limit of the median class

N = Total Numbers of frequencies

f = Frequency of the median class

m = Cumulative frequency of the class preceding the median class

c = the class interval of the median class.

From the formula, it is clear that one has to Þnd the median class Þrst. Median class is, that class
which correspond to the cumulative frequency just greater than N/2.

Example 5.17

The following data attained from a garden records of certain period Calculate the median weight
of the apple

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Solution:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 5.18

The following table shows age distribution of persons in a particular region:

Find the median age.

Solution:

We are given upper limit and less than cumulative frequencies. First Þnd the class-intervals and
the frequencies. Since the values are increasing by 10, hence the width of the class interval is
equal to 10.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 5.19

The following is the marks obtained by 140 students in a college. Find the median marks

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Solution:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Graphical method for Location of median

Median can be located with the help of the cumulative frequency curve or
ÔogiveÕ.

The procedure for locating median in a grouped data is as follows:

Step 1 : The class intervals, are represented on the horizontal axis (x-axis)

Step 2 : The cumulative frequency corresponding to different classes is


calculated. These cumulative frequencies are plotted on the vertical axis (y-
axis) against the upper limit of the respective class interval

Step 3 : The curve obtained by joining the points by means of freehand is


called the Ôless than ogiveÕ.

Step 4 : A horizontal straight line is drawn from the value N/2 and N+1 / 2on
the y-axis parallel to x- axis to meet the ogive. (depending on N is odd or
even)

Step 5 : From the point of intersection, draw a line, perpendicular to the


horizontal axis which meet the x axis at m say.

Step 6 : The value m at x axis gives the value of the median.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 5.20

Draw ogive curves for the following frequency distribution and determine
the median.

Solution:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

The median value from the graph is 42

Merits

á It is easy to compute. It can be calculated by mere inspection and by


the graphical method

á It is not affected by extreme values.

á It can be easily located even if the class intervals in the series are
unequal

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Limitations

á It is not amenable to further algebraic treatment

á It is a positional average and is based on the middle item

á It does not take into account the actual values of the items in the
series

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Mode
According to Croxton and Cowden, ÔThe mode of a distribution is the value
at the point around which the items tend to be most heavily concentrated.

In a busy road, where we take a survey on the vehicle - trafÞc on the road at
a place at a particular period of time, we observe the number of two
wheelers is more than cars, buses and other vehicles. Because of the higher
frequency, we say that the modal value of this survey is Ôtwo wheelersÕ

Mode is deÞned as the value which occurs most frequently in a data set. The
mode obtained may be two or more in frequency distribution.

Computation of mode:

(a) For Ungrouped or Raw Data:

The mode is deÞned as the value which occurs frequently in a data set

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 5.21

The following are the marks scored by 20 students in the class. Find the
mode 90, 70, 50, 30, 40, 86, 65, 73, 68, 90, 90, 10, 73, 25, 35, 88, 67, 80, 74,
46

Solution:

Since the marks 90 occurs the maximum number of times, three times
compared with the other numbers, mode is 90.

Example 5.22

A doctor who checked 9 patientsÕ sugar level is given below. Find the mode
value of the sugar levels. 80, 112, 110, 115, 124, 130, 100, 90, 150, 180

Solution:

Since each values occurs only once, there is no mode.

Example 5.23

Compute mode value for the following observations.

2, 7, 10, 12, 10, 19, 2, 11, 3, 12

Solution:

Here, the observations 10 and 12 occurs twice in the data set, the modes are
10 and 12.

For discrete frequency distribution, mode is the value of the variable


corresponding to the maximum frequency.

Example 5.24

Calculate the mode from the following data

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Solution:

Here, 7 is the maximum frequency, hence the value of x corresponding to 7


is 8.

Therefore 8 is the mode.

(b) Mode for Continuous data:

The mode or modal value of the distribution is that value of the variate for
which the frequency is maximum. It is the value around which the items or
observations tend to be most heavily concentrated. The mode is computed
by the formula.

Modal class is the class which has maximum frequency.

f1 = frequency of the modal class

f0 = frequency of the class preceding the modal class

f2 = frequency of the class succeeding the modal class

c = width of the class limits

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 5.25

The following data relates to the daily income of families in an urban area.
Find the modal income of the families.

Solution:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Determination of Modal class:

For a frequency distribution modal class corresponds to the class with


maximum frequency. But in any one of the following cases that is not easily
possible.

i. If the maximum frequency is repeated.

ii. If the maximum frequency occurs in the beginning or at the end of the
distribution

iii. If there are irregularities in the distribution, the modal class is


determined by the method of grouping.

Steps for preparing Analysis table:

We prepare a grouping table with 6 columns

i. In column I, we write down the given frequencies.

ii. Column II is obtained by combining the frequencies two by two.

iii. Leave the Ist frequency and combine the remaining frequencies
two by two and write in column III

iv. Column IV is obtained by combining the frequencies three by


three.

v. Leave the Ist frequency and combine the remaining frequencies


three by three and write in column V

vi. Leave the Ist and 2nd frequencies and combine the remaining
frequencies three by three and write in column VI

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Mark the highest frequency in each column. Then form an analysis table to
Þnd the modal class. After Þnding the modal class use the formula to
calculate the modal value.

Example 5.26

Calculate mode for the following frequency distribution:

Solution:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Analysis Table:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

The maximum occurred corresponding to 20-25, and hence it is the modal


class.

(d) Graphical Location of Mode

The following are the steps to locate mode by graph

i. Draw a histogram of the given distribution.

ii. Join the rectangle corner of the highest rectangle (modal class
rectangle) by a straight line to the top right corner of the preceding
rectangle. Similarly the top left corner of the highest rectangle is joined to
the top left corner of the rectangle on the right.

iii. From the point of intersection of these two diagonal lines, draw
a perpendicular line to the x Ðaxis which meets at M.

iv. The value of x coordinate of M is the mode.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 5.27

Locate the modal value graphically for the following frequency distribution

Solution:

Merits of Mode:

á It is comparatively easy to understand.

á It can be found graphically.

á It is easy to locate in some cases by inspection.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

á It is not affected by extreme values.

á It is the simplest descriptive measure of average.

Demerits of Mode:

á It is not suitable for further mathematical treatment.

á It is an unstable measure as it is affected more by sampling


ßuctuations.

á Mode for the series with unequal class intervals cannot be calculated.

á In a bimodal distribution, there are two modal classes and it is


difÞcult to determine the values of the mode.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Measures of Dispersion
The following data provide the runs scored by two batsmen in the last 10 matches.

Batsman A: 25, 20, 45, 93, 8, 14, 32, 87, 72, 4

Batsman B: 33, 50, 47, 38, 45, 40, 36, 48, 37, 26

The mean of both datas are same (40), but they differ signiÞcantly.

From the above diagrams, we see that runs of batsman B are grouped around the mean.
But the runs of batsman A are scattered from 0 to 100, though they both have same mean.

Thus, some additional statistical information may be required to determine how the values
are spread in data. For this, we shall discuss Measures of Dispersion.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Dispersion is a measure which gives an idea about the scatteredness of the values.

Measures of Variation (or) Dispersion of a data provide an idea of how observations


spread out (or) scattered throughout the data.

Different Measures of Dispersion are

1. Range

2. Mean deviation

3. Quartile deviation

4. Standard deviation

5. Variance

6. CoefÞcient of Variation

1. Range

The difference between the largest value and the smallest value is called Range.

Range R = L Ð S

CoefÞcient of range = (L ÐS) / (L + S)

where L - Largest value; S - Smallest value

Example 8.1 Find the range and coefÞcient of range of the following data: 25, 67, 48, 53,
18, 39, 44.

Solution Largest value L = 67; Smallest value S =18

Range R = L −S = 67 −18 = 49

CoefÞcient of range = (L ÐS) / (L + S)

CoefÞcient of range = (67 Ð 18 ) / (67 +18) = 49/85

= 0.576

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 8.2 Find the range of the following distribution.

Solution Here Largest value L = 28

Smallest value S = 18

Range R = L ÐS

R = 28 −18 = 10 Years

Example 8.3 The range of a set of data is 13.67 and the largest value is 70.08. Find the
smallest value.

Solution

Range R = 13.67

Largest value L = 70.08

Range R = L ÐS

13.67 = 70.08 Ð S

S = 70.08 −13.67 = 56.41

Therefore, the smallest value is 56.41.

2. Deviations from the mean

For a given data with n observations x 1 , x2,¼xn , the deviations from the mean are

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

3. Squares of deviations from the mean

The squares of deviations from the mean of the observations x1, x2, . . . . , xn are

Note

We note that (x i − )2 ≥ 0 for all observations xi , i = 1,2,3,É,n. If the deviations

from the mean (x i − ) are small, then the squares of the deviations will be very
small.

4. Variance

The mean of the squares of the deviations from the mean is called Variance.

It is denoted by σ2 (read as sigma square).

Variance = Mean of squares of deviations

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

5. Standard Deviation

The positive square root of Variance is called Standard deviation. That is, standard
deviation is the positive square root of the mean of the squares of deviations of the given
values from their mean. It is denoted by σ.

Standard deviation gives a clear idea about how far the values are spreading or deviating
from the mean.

1. Calculation of Standard Deviation for ungrouped data


(i) Direct Method

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Note

á While computing standard deviation, arranging data in ascending order is not


mandatory.

á If the data values are given directly then to Þnd standard deviation we can use the

formula σ =

á If the data values are not given directly but the squares of the deviations from the
mean of each observation is given then to Þnd standard deviation we can use the formula

σ=

Example 8.4 The number of televisions sold in each day of a week are 13, 8, 4, 9, 7, 12,
10.

Find its standard deviation.

Solution

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(ii) Mean method

Another convenient way of Þnding standard deviation is to use the following formula.

Standard deviation (by mean method) σ =

If di = xi Ð are the deviations, then

Example 8.5 The amount of rainfall in a particular season for 6 days are given as 17.8
cm, 19.2 cm, 16.3 cm, 12.5 cm, 12.8 cm and 11.4 cm. Find its standard deviation.

Solution Arranging the numbers in ascending order we get, 11.4, 12.5, 12.8, 16.3, 17.8,
19.2.

Number of observations n = 6

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(iii) Assumed Mean method

When the mean value is not an integer (since calculations are very tedious in decimal
form) then it is better to use the assumed mean method to Þnd the standard deviation.

Let x 1 , x2, x 3 , ..., xn be the given data values and let be their mean.

Let di be the deviation of xi from the assumed mean A, which is usually the middle
value or near the middle value of the given data.

di= xi − A gives, xi = di + A ...(1)

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Σdi= Σ(xi −A)

= Σxi −(A + A + A + . . . to n times)

Σdi = Σxi Ð A × n

Example 8.6 The marks scored by 10 students in a class test are 25, 29, 30, 33, 35, 37,
38, 40, 44, 48. Find the standard deviation.

Solution The mean of marks is 35.9 which is not an integer. Hence we take assumed
mean, A = 35, n = 10 .

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(iv) Step deviation method

Let x 1 , x2, x 3 ,...xn be the given data. Let A be the assumed mean.

Let c be the common divisor of x i - A .

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Note

We can use any of the above methods for Þnding the standard deviation

Example 8.7 The amount that the children have spent for purchasing some eatables in
one day trip of a school are 5, 10, 15, 20, 25, 30, 35, 40. Using step deviation method,
Þnd the standard deviation of the amount they have spent.

Solution We note that all the observations are divisible by 5. Hence we can use the step
deviation method. Let the Assumed mean A = 20, n = 8.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 8.8 Find the standard deviation of the following data 7, 4, 8, 10, 11. Add 3 to
all the values then Þnd the standard deviation for the new values.

Solution Arranging the values in ascending order we get, 4, 7, 8, 10, 11 and n = 5

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

When we add 3 to all the values, we get the new values as 7,10,11,13,14.

From the above, we see that the standard deviation will not change when we add some
Þxed constant to all the values.

Example 8.9 Find the standard deviation of the data 2, 3, 5, 7, 8. Multiply each data by
4. Find the standard deviation of the new values.

Solution Given, n = 5

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

When we multiply each data by 4, we get the new values as 8, 12, 20, 28, 32.

From the above, we see that when we multiply each data by 4 the standard deviation
also get multiplied by 4.

Example 8.10 Find the mean and variance of the Þrst n natural numbers.

Solution

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Calculation of Standard deviation for grouped data


(i) Mean method

Example 8.11

48 students were asked to write the total number of hours per week they spent on
watching television. With this information Þnd the standard deviation of hours spent for
watching television.

Solution

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(ii) Assumed Mean Medthod

Let x 1 , x2, x 3 , ...xn be the given data with frequencies f1 , f2, f3 , ... fn respectively.
Let x be their mean and A be the assumed mean..

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 8.12

The marks scored by the students in a slip test are given below.

Find the standard deviation of their marks.

Solution

Let the assumed mean, A = 8

2. Calculation of Standard deviation for continuous frequency distribution

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(i) Mean method

Standard deviation where xi = Middle value of the i th


class.

Þ = Frequency of the i th class.

(ii) Shortcut method (or) Step deviation method

To make the calculation simple, we provide the following formula. Let A be the
assumed mean, xi be the middle value of the ith class and c is the width of the class
interval.

Example 8.13

Marks of the students in a particular subject of a class are given below.

Find its standard deviation.

Solution

Let the assumed mean, A = 35, c = 10

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 8.14

The mean and standard deviation of 15 observations are found to be 10 and 5


respectively. On rechecking it was found that one of the observation with value 8 was
incorrect. Calculate the correct mean and standard deviation if the correct observation
value was 23?

Solution

Wrong observation value = 8, Correct observation value = 23.

Correct total = 150 − 8 + 23 = 165

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Module-II
Probability-Axioms-Addition and Multiplication Rule-Types of Probability-Probability Tree -BayeÕs
Theorem

PROBABILITY

INTRODUCTION:
Probability theory was originated from gambling theory. A large number of problems exist even
today which are based on the game of chance, such as coin tossing, dice throwing and playing
cards.

The probability is deÞned in two different ways,


➢ Mathematical (or a priori) deÞnition
➢Statistical (or empirical)deÞnition

Probability is the branch of mathematics concerning the occurrence of a random


event, and four main types of probability exist: classical, empirical, subjective and
axiomatic. Probability is synonymous with possibility, so you could say it's the
possibility that a particular event will happen. Probability is used to make predictions
about how likely it is for an event to happen, given the total number of possible
outcomes. There are many events you can't predict with total certainty, but you can
predict the chances of an event happening. You express all probability answers with a
value from zero to one.
If an event has a probability of zero, that tells you the event is impossible and won't
happen. If an event has a probability of one, that tells you the event is certain and will
happen. If an event has a probability between zero and one, that tells you how likely
the event is to happen. The closer the probability is to zero, the less likely it is to
happen, and the closer the probability is to one, the more likely it is to happen. The
total of all the probabilities for an event is equal to one.
For example, you know there's a one in two chance of tossing heads on a coin, so the
probability is 50%.
➢SOME IMPORTANT TERMS &CONCEPTS:

• RANDOM EXPERIMENTS:
Experiments of any type where the outcome cannot be predicted are called random
experiments.

• SAMPLE SPACE:
A set of all possible outcomes from an experiment is called a sample space.
Eg: Consider a random experiment E of throwing 2 coins at a time. The possible outcomes are
HH, TT, HT, TH.
These 4 outcomes constitute a sample space denoted by, S ={ HH, TT, HT, TH}.

• TRAIL & EVENT:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Consider an experiment of throwing a coin. When tossing a coin, we may get a head(H) or tail(T).
Here tossing of a coin is a trail and getting a hand or tail is an event.
In otherwords, ÒEvery non-empty subset of A of the sample space S is called an eventÓ.

• NULL EVENT:
An event having no sample point is called a null event and is denoted by ∅.

• EXHAUSTIVE EVENTS:
The total number of possible outcomes in any trail is known as exhaustive events.
Eg: In throwing a die the possible outcomes are getting 1 or 2 or 3 or 4 or 5 or 6. Hence we have
6 exhaustive events in throwing a die.

• MUTUALLY EXCLUSIVE EVENTS:


Two events are said to be mutually exclusive when the occurrence of one affects the occurrence
of the other. In otherwords, if A & B are mutually exclusive events and if A happens then B will not
happen and viceversa.
Eg: In tossing a coin the events head or tail are mutually exclusive, since both tail & head cannot
appear in the same time.

• EQUALLY LIKELY EVENTS:


Two events are said to be equally likely if one of them cannot be expected in the preference to
the other.
Eg: In throwing a coin, the events head & tail have equal chances of occurrence.

• INDEPENDENT & DEPENDENT EVENTS:


Two events are said to be independent when the actual happening of one doesnot inßuence in
any way the happening of the other. Events which are not independent are called dependent
events.
Eg: If we draw a card in a pack of well shuffled cards and again draw a card from the rest of pack
of cards (containing 51 cards), then the second draw is dependent on the Þrst. But if on the other
hand, we draw a second card from the pack by replacing the Þrst card drawn, the second draw is
known as independent of the Þrst.

• FAVOURABLE EVENTS:
Mathematical or classical or a priori deÞnition of probability,
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠
Probability (of happening an event E) =
=𝑚
Where m = Number of favourable cases
n = Total number of exhaustive cases.

Q1. Explain Axiomatic Probability through an Example


Now let us take a simple example to understand the axiomatic approach to
probability.

On tossing a coin we say that the probability of occurrence of head and tail is

1
2
each. Basically here we are assigning the probability value of

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

1
2
for the occurrence of each event.

This condition basically satisÞes both the conditions, i.e.


¥ Each value is neither less than zero nor greater than 1 and
¥ Sum of the probabilities of occurrence of head and tail is 1
Hence, for this case we can say that the probabilities of occurrence of head and tail
are

1
2
each.

Now, say

P(H)
=
5
8
and
P(T)
=
3
8

Does this probability value satisfy the conditions of axiomatic approach?

For this, let us again check the basic initial conditions of the axiomatic approach of
probability.
¥ Each value is neither less than zero nor greater than 1 and
¥ Sum of the probabilities of occurrence of head and tail is 1
Hence this sort of probability value assignment also satisÞes the axiomatic approach
of probability. Thus, we can conclude that there can be inÞnite ways to assign the
probability to outcomes of an experiment.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Q2. What are the Rules of Addition and Multiplication in


Probability:
The rule of addition and the rule of multiplication are two important rules of
probability that describe how probabilities are calculated for multiple events.

Rule of Addition
The rule of addition (also known as the "OR" rule) states that the probability of two
or more mutually exclusive events occurring is the sum of the probabilities of the
individual events occurring.

Example 1: if you have a coin and you want to know the probability of it landing on
heads "or" tails, then the answer would be 1/2 + 1/2 = 1. This means that there is a
100% chance that either heads or tails will occur.

Example 2: If you have two events, A and B, and the probability of event A occurring
is 0.40 and the probability of event B occurring is 0.30, the probability of events A
"or" B occurring is 0.40 + 0.30 = 0.70.
The above two examples apply when events are mutually exclusive, which means
that they cannot happen at the same time. In this case, the rule of addition says that
the probability of either event happening is the sum of the probabilities of each event
happening individually.
On the other hand, if events are not mutually exclusive, it means that they can happen
at the same time. In this case, the rule of addition says that the probability of either
event happening is the sum of each event's probabilities minus the probability of both
events happening simultaneously.

Example 3: If the probability of event A happening is 30% and the probability of


event B happening is 50%, and the probability of both events happening at the same
time is 10%, the probability of either event A or event B happening is 30% + 50% -
10% = 70%.

Rule of Addition
The probability that Event A or Event B
OccUrs
Probability that Event A occurs
+
Probability that Event B occurs

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

-
Probability that both Events A and B occur
P(A U B) = P(A) + P(B) - P(A n B)

Rule of Multiplication:
The multiplication rule (also known as the "AND" rule) states that the probability of
two independent events occurring together is equal to the product of their individual
probabilities.

Example 4: For example, if you have two events A and B, and the probability of
event A occurring is 0.40 and the probability of event B occurring is 0.30, the
probability of events A "and" B occurring simultaneously is 0.40 * 0.30 = 0.12. This
is because the probability of both events occurring simultaneously is the product of
the probabilities of the individual events occurring.

Example 5: If you want to calculate the probability of getting a head on the Þrst coin
ßip and tails on the second coin ßip, you will use the rule of multiplication to
determine that the probability is 0.25 because the probability of getting heads on the
Þrst coin ßip is 0.50. The probability of getting tails on the second coin ßip is also
0.50, and the probability of both events occurring simultaneously is 0.50 * 0.50 =
0.25.
Example 6: Suppose you have a bag containing 3 red balls and 2 green balls. If you
want to Þnd the probability of drawing a red ball (then put this back in the bag: With
replacement) and in the second draw you get a green ball, you would use the rule of
multiplication:

P(red AND green) = P(red) * P(green) = (3/5) * (2/5) = 6/25 = 0.24


Please note that in this example, the probability of drawing a red ball in the Þrst
selection does NOT affect the probability of the green ball in the second pick, as the
Þrst selection (red ball) is put back in the bag.
In this example, the two events were independent events, meaning that one event's
occurrence does not affect the probability of the other event occurring.

Example 7: Suppose you have a bag containing 3 red balls and 2 green balls. If you
want to Þnd the probability of drawing a red ball and in the second draw you get a
green ball (without replacement), you would use the rule of multiplication:
P(red AND green) = P(red) * P(green|red) = (3/5) * (2/4) = 6/20 = 0.30
In the above formula, P(green | red) means the probability of getting a Green ball
"provided" the Þrst event (getting a Red ball) has already happened. This is called
conditional probability.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

This means that the probability of drawing a red ball and then a green ball without
replacement is 0.30, or 30%.
Please note that in this example, the probability of drawing a red ball in the Þrst
selection DOES affect the probability of the green ball in the second pick, as the Þrst
selection (red ball) is NOT put back in the bag. This reduces the total number of balls
in the bag to 4 ( 2 Red and 2 Green)
In this example, the two events are dependent events, which means that the
occurrence of one event affects the probability of the other event occurring.

This rule states that the probability of both events occurring is equal to the probability
of the Þrst event occurring multiplied by the probability of the second event
occurring, given that the two events are independent.

Rule of Multiplication:
The probability that Events A and B both
occur =
Probability that Event A occurs
X
Probability that Event B occurs, given that
A has occurred
P(A n B) = P(A) P(BIA)

Summary:
¥ The rule of addition for mutually exclusive events: P(A or B) = P(A) + P(B)
¥ The rule of addition for non-mutually exclusive events: P(A or B) = P(A) +
P(B) - P(A and B)
¥ The rule of multiplication for dependent events: P(A and B) = P(A) * P(B/A)
¥ The rule of multiplication for non-dependent events: P(A and B) = P(A) * P(B)

Q3. Explain Four Types of Probability

These are the four different types of probability:

1. Classical
The classical or theoretical perspective on probability states that in an experiment
where there are X equally likely outcomes, and event Y has exactly Z of these
outcomes, then the probability of Y is Z/X, or P(Y) = Z/X. This is often the Þrst
perspective that students experience in formal education. For example, when rolling a

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

fair die, there are six possible outcomes that are equally likely, you can say there is a
1/6 probability of rolling each number.
The advantage to this perspective is that it's conceptually simple for a lot of
situations, however, it has limits since many situations don't have Þnitely as many
equally likely outcomes. For example, rolling a weighted die has a Þnite number of
outcomes that aren't equally likely, or studying employee incomes over many years
and into the future has an inÞnite number of outcomes for their maximum possible
income.

2. Empirical
The empirical or experimental perspective on probability deÞnes probability through
thought experiments. For example, if you are rolling a weighted die but you don't
know which side has the weight, you can get an idea for the probability of each
outcome by rolling the die an enormous number of times and calculating the
proportion of times the die gives that outcome and estimate the probability of that
outcome.
The formal way to deÞne this perspective is P(A) = the limit as C approaches inÞnity
of B/C. Where A is the probability of the event, B is the number of times the event A
happens and C is the number of times you perform the process, like rolling a die or
tossing a coin.
Another way to think of this is to imagine tossing a coin 100 times, and then
continuing to 10,000 times. Each time you toss the coin, the real-life probability
results you are getting are becoming a better approximation of the theoretical
probability of the event. The Þrst 100 times you toss the coin your probability might
be 1/3 heads, but the more tosses you make as you approach inÞnity your probability
will become 1/2, or the theoretical probability.

3. Subjective
The subjective perspective on probability considers a person's own personal belief or
judgment that an event will happen. For example, an investor may have an educated
sense of the market and intuitively talk about the probability of a certain stock going
up tomorrow. You can rationally understand how that subjective view agrees with
theoretical or experimental views. In other words, it's the probability that what a
person is expecting to happen through their knowledge and feelings will actually be
the outcome, with no formal calculations.
For example, if a fan at a football game states that a particular team is going to win
the game, they are basing their decision on the team's past wins and losses, what they
know about the opposing team, facts they know about football and their opinions or
feelings about the game. They are not making a formal mathematical calculation.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

4. Axiomatic
The axiomatic perspective on probability is a unifying perspective where the coherent
conditions used in theoretical and experimental probability prove subjective
probability. You apply a set of rules or axioms by Kolmogorov to all types of
probability. Mathematicians know them as Kolmogorov's three axioms. When using
axiomatic probability, you can quantify the chances of an event occurring or not
occurring.
You can use the three axioms with all the other probability perspectives. The
deÞnition for this perspective is the probability of any function from numbers to
events satisÞed by the following three axioms:
¥ Zero is the smallest possible probability, and one is the largest.
¥ An event that is certain has a probability of one.
¥ Two mutually exclusive events cannot occur simultaneously, but the union of
events says only one of them can occur.

Q 4. Why is probability important?


You use or see probability all around you on a daily basis. Even if you don't realize it,
you use probability every day to make decisions about things with an unknown
outcome. You may unknowingly perform mathematical calculations with theoretical
or experimental probability, or you may make judgment calls with subjective
probability. Here are some real-life examples of how you might use or see probability
every day:

1.Weather
Meteorologists aren't able to exactly predicts the weather, so they use instruments and
tools to Þnd the likelihood of snow, rain or other weather conditions. If there is a 30%
chance of rain, the meteorologist has determined the probability of rain such that it
has rained on 30 out of 100 days with similar weather conditions. Because of the
forecast, you use probability to decide whether to wear sandals or rain boots to work
that morning.

2. Sports
Coaches and athletes frequently use probability to Þgure out the best sports strategies
for competitions and games. For example, if a football kicker makes 10 out of 15
Þeld goals throughout the season, the probability of him scoring his next Þeld goal is
10/15 or 2/3. Another example is a baseball coach calculating a player's batting
average to determine the lineup for a game. If a player has a 300 battling average,
that means he's gotten three hits out of every 10 bats, and the probability of him
getting a base hit is 3/10.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

3. Insurance
When analyzing insurance policies and considering deductible amounts, probability
plays an important role. For example, if 20 out of every 100 drivers in your area have
gotten hail damage in the last year, then when choosing your car insurance policy you
can use probability to understand that there's a 1/5 chance your car will get hail
damage. This signiÞcant probability may encourage you to get comprehensive cover
for hail damage and maybe even a lower deductible.

4. Games
When you play games with an element of luck or chance, like board games, card
games or video games, you often weigh the odds of a desirable event happening like
getting the card you need or rolling a speciÞc number on the die. The likelihood of
that favorable event happening helps you determine when to take a risk or how much
you're willing to risk. One example is poker players who know the probability of
getting certain hands, like that there's a 42% chance of getting two of a kind versus a
2% chance of getting three of kind.

Q5. What Is Bayes Theorem?

The Bayes theorem is a mathematical formula for calculating conditional probability


in probability and statistics. In other words, it's used to Þgure out how likely an event
is based on its proximity to another. Bayes law or Bayes rule are other names for the
theorem.

Bayes Theorem Formula

The formula for the Bayes theorem can be written in a variety of ways. The following
is the most common version:

P(A ∣ B) = P(B ∣ A)P(A) / P(B)

P(A ∣ B) is the conditional probability of event A occurring, given that B is true.

P(B ∣ A) is the conditional probability of event B occurring, given that A is true.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

P(A) and P(B) are the probabilities of A and B occurring independently of one
another.

Example of Bayes Theorem

Now, try to solve a problem using the Bayes theorem.

Problem 1: Three urns contain 6 red, 4 black; 4 red, 6 black, and 5 red, 5 black balls
respectively. One of the urns is selected at random and a ball is drawn from it. If the
ball drawn is red, Þnd the probability that it is drawn from the Þrst urn.

Solution: Let E1, E2, E3, and A be the events deÞned as follows:

E1 = urn Þrst is chosen

E2 = urn second is chosen

E3 = urn third is chosen

A = ball drawn is red

Since there are three urns and one of the three urns is chosen at random, therefore:

P(E1) = P(E2) = P(E3) = ⅓

If E1 has already occurred, then urn Þrst has been chosen, containing 6 red and 4
black balls. The probability of drawing a red ball from it is 6/10.

So, P(A/E1) = 6/10

Similarly, you have P(A/E2) = 4/10 and P(A/E3) = 5/10

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

You are required to Þnd the P(E1/A) i.e., given that the ball drawn is red, what is the
probability that it is drawn from the Þrst urn.

By Bayes theorem, you have

P(E1/A) = P(E1) P(A/E1)P(E1) P(A/E1) + P(E2) P(A/E2) + P(E3) P(A/E3)

= 1/3 * 6/10(1/3 * 6/10) + (1/3 * 4/10) + (1/3 * 5/10)

=⅖

Problem 2: An insurance company insured 2000 scooter drivers, 4000 car drivers,
and 6000 truck drivers. The probability of an accident involving a scooter driver, car
driver, and a truck is 0.01, 0.03, and 0.015 respectively. One of the insured persons
meets with an accident. What is the probability that he is a scooter driver?

Let E1, E2, E3, and A be the events deÞned as follows:

E1 = person chosen is a scooter driver

E2 = person chosen is a car driver

E3 = person chosen is a truck driver and

A = person meets with an accident

Since there are 12000 people, therefore:

P(E1) = 2000/12000 = ⅙

P(E2) = 4000/12000 = ⅓

P(E3) = 6000/12000 = ½

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

It is given that P(A / E1) = Probability that a person meets with an accident given that
he is a scooter driver = 0.01

Similarly, you have P(A / E2) = 0.03 and P(A / E3) = 0.15

You are required to Þnd P(E1 / A), i.e. given that the person meets with an accident,
what is the probability that he was a scooter driver?

P(E1/A) = P(E1) P(A/E1)P(E1) P(A/E1) + P(E2) P(A/E2) + P(E3) P(A/E3)

= 1/6 * 0.01(1/6 * 0.01) + (1/3 * 0.03) + (1/2 * 0.15)

= 1/52

Q 6.What is a Probability Tree Diagram?

A probability tree diagram is used to represent the probability of occurrence of events


without using complicated formulas. It displays all the possible outcomes of an event.
The purpose of a probability tree is that it shows all the possible outcomes of an event
and calculates the probability of these outcomes. A probability tree diagram can either
represent a series of independent events or it can be used to denote conditional
probabilities.

Parts of a Probability Tree Diagram

There are two main parts of a probability tree. These are the nodes and the branches.
The nodes can further be classiÞed into a parent node and a sibling node. The parent
node represents a certain event and has a probability of 1. The sibling nodes denote
other additional possible events or outcomes. The branches denote the probability of
occurrence of these events. Suppose a fair coin is tossed once, then the probability
tree can be constructed as follows:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

This is a simple probability tree and has two branches only. Here, the Þrst
node represents the parent event of a coin being tossed. Head and tail are the two
possible outcomes forming the sibling nodes. 0.5 is written on the branch and
represents the probability of occurrence of each sibling event.

Probability Tree Diagram Example

Suppose a probability tree diagram needs to be constructed for ßipping a fair coin
twice. This is an example of an independent event as the outcome of each coin toss is
independent of the previous ßip. First, the probability tree diagram of a coin being
ßipped once is drawn as given in the previous section. The next step is to extend it to
two coin tosses as follows:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

The second set of probabilities represents the second coin toss. Thus, in total there are
4 sets of possible outcomes.
To calculate the probabilities of a series of events, multiply the probabilities along the
branches of the probability tree diagram. The total probability can be computed by
adding these probabilities and its value will always be equal to 1.

Some useful inferences can be made from the probability tree diagram as follows:
¥ The probability of getting the outcome (Head, Head) = 0.5 × 0.5 = 0.25.
Similarly, the probability of the other outcomes can be calculated.
¥ 0.25 + 0.25 + 0.25 + 0.25 = 1. This implies that on adding the probabilities of
each outcome, the total is equal to 1.
¥ By looking at the probability tree, the probability of getting exactly one head
can be calculated as 0.25 + 0.25 = 0.5.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Q 7.How to Draw a Probability Tree?

To draw a probability tree diagram it is necessary to identify all the possible


outcomes and the probabilities associated with them. The steps to construct a
probability tree are as follows:
¥ Step 1: Identify whether the events are dependent or independent.
¥ Step 2: Draw branches to represent the Þrst set of outcomes.
¥ Step 3: Write the probabilities associated with each outcome on the branch.
¥ Step 4: Draw the next set of branches taking into account whether the
subsequent events are dependent or independent. Also, write the associated
probabilities.
¥ Step 5: Repeat step 5 for as many branches as required.
¥ Step 6: To calculate the overall probability of the probability tree, the
probabilities along the branches have to be multiplied. The sum of these
probabilities should always be equal to 1.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Module-III
Concept of Random variable-Probability Distributions-Expected value and Variance of random
Variable-Conditional expectations-Classical news Paper boys Problem,(EMV,EVPI)

Random Variable

Random variable is a variable that is used to quantify the outcome of a random


experiment. As data can be of two types, discrete and continuous hence, there can be
two types of random variables. A discrete random variable can take on an exact value
while the value of a continuous random variable will fall between some particular
interval.
Probability distributions are used to show how probabilities are distributed over the
values of a given random variable. In this article, we will learn the deÞnition of a
random variable, its types and see various examples.

Random Variable

A random variable is a variable that can take on many values. This is because there
can be several outcomes of a random occurrence. Thus, a random variable should not
be confused with an algebraic variable. An algebraic variable represents the value of
an unknown quantity in an algebraic equation that can be calculated. On the other
hand, a random variable can have a set of values that could be the resulting outcome
of a random experiment.

Random Variable DeÞnition

A random variable can be deÞned as a type of variable whose value depends upon the
numerical outcomes of a certain random phenomenon. It is also known as a stochastic
variable. Random variables are always real numbers as they are required to be
measurable.
Random Variable Example

Suppose 2 dice are rolled and the random variable, X, is used to represent the sum of
the numbers. Then, the smallest value of X will be equal to 2 (1 + 1), while the
highest value would be 12 (6 + 6). Thus, X could take on any value between 2 to 12
(inclusive). Now if probabilities are attached to each outcome then the probability
distribution of X can be determined.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Random Variables can be divided into two broad categories depending upon the type
of data available. These are given as follows:
¥ Discrete random variable
¥ Continuous random variable
A probability mass function is used to describe a discrete random variable and a
probability density function describes a continuous random variable. The upcoming
sections will cover these topics in detail.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Q1. What Is a Random Variable?


A random variable is a variable whose value is unknown or a function that assigns
values to each of an experiment's outcomes. Random variables are often designated
by letters and can be classiÞed as discrete, which are variables that have speciÞc
values, or continuous, which are variables that can have any values within a
continuous range.

Random variables are often used in econometric or regression analysis to determine


statistical relationships among one another.

KEY TAKEAWAYS
¥ A random variable is a variable whose value is unknown or a function that
assigns values to each of an experiment's outcomes.
¥ A random variable can be either discrete (having speciÞc values) or continuous
(any value in a continuous range).
¥ The use of random variables is most common in probability and statistics,
where they are used to quantify outcomes of random occurrences.
¥ Risk analysts use random variables to estimate the probability of an adverse
event occurring.

Q2. Explain Types of Random Variables


A random variable has a probability distribution that represents the likelihood that
any of the possible values would occur. LetÕs say that the random variable, Z, is the
number on the top face of a die when it is rolled once. The possible values for Z will
thus be 1, 2, 3, 4, 5, and 6. The probability of each of these values is 1/6 as they are
all equally likely to be the value of Z.

For instance, the probability of getting a 3, or P (Z=3), when a die is thrown is 1/6,
and so is the probability of having a 4 or a 2 or any other number on all six faces of a
die. Note that the sum of all probabilities is 1.

A random variable can be either discrete or continuous.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Discrete Random Variables


Discrete random variables take on a countable number of distinct values. Consider an
experiment where a coin is tossed three times. If X represents the number of times
that the coin comes up heads, then X is a discrete random variable that can only have
the values 0, 1, 2, or 3 (from no heads in three successive coin tosses to all heads). No
other value is possible for X.

Continuous Random Variables


Continuous random variables can represent any value within a speciÞed range or
interval and can take on an inÞnite number of possible values. An example of a
continuous random variable would be an experiment that involves measuring the
amount of rainfall in a city over a year or the average height of a random group of 25
people.

Drawing on the latter, if Y represents the random variable for the average height of a
random group of 25 people, you will Þnd that the resulting outcome is a continuous
Þgure since height may be 5 ft or 5.01 ft or 5.0001 ft. Clearly, there is an inÞnite
number of possible values for height.

Example of a Random Variable


A typical example of a random variable is the outcome of a coin toss. Consider a
probability distribution in which the outcomes of a random event are not equally
likely to happen. If the random variable Y is the number of heads we get from tossing
two coins, then Y could be 0, 1, or 2. This means that we could have no heads, one
head, or both heads on a two-coin toss.

However, the two coins land in four different ways: TT, HT, TH, and HH. Therefore,
the P(Y=0) = 1/4 since we have one chance of getting no heads (i.e., two tails [TT]
when the coins are tossed). Similarly, the probability of getting two heads (HH) is
also 1/4. Notice that getting one head has a likelihood of occurring twice: in HT and
TH. In this case, P (Y=1) = 2/4 = 1/2.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

1. What Are the 2 Kinds of Random Variables?


Random variables may be categorized as either discrete or continuous. A discrete
random variable is a type of random variable that has a countable number of distinct
values, such as heads or tails, playing cards, or the sides of a die. A continuous
random variable can reßect an inÞnite number of potential values, such as the average
rainfall in a region.

2. What Is a Mixed Random Variable?


A mixed random variable combines elements of both discrete and continuous random
variables.

3. How Do You Identify a Random Variable?


A random variable is one whose value is unknown a priori, or else is assigned a
random value based on some data generating process or mathematical function.

4. Why Are Random Variables Important?


Random variables produce probability distributions based on experimentation,
observation, or some other data-generating process. Random variables, in this way,
allow us to understand the world around us based on a sample of data, by knowing
the likelihood that a speciÞc value will occur in the real world or at some point in the
future.

The Bottom Line


Random variables, whether discrete or continuous, are a key concept in statistics and
experimentation. Because they are random with unknown exact values, these allow us
to understand the probability distribution of those values or the relative likelihood of
certain events. As a result, analysts can test hypotheses and make inferences about the
natural and social world around us.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Module-IV
Probability distributions-Binomial-Poisson-Normal

Probability Distribution

Probability distribution yields the possible outcomes for any random event. It is also
deÞned based on the underlying sample space as a set of possible outcomes of any
random experiment. These settings could be a set of real numbers or a set of vectors
or a set of any entities. It is a part of probability and statistics.

Random experiments are deÞned as the result of an experiment, whose outcome


cannot be predicted. Suppose, if we toss a coin, we cannot predict, what outcome it
will appear either it will come as Head or as Tail. The possible result of a random
experiment is called an outcome. And the set of outcomes is called a sample point.
With the help of these experiments or events, we can always create a probability
pattern table in terms of variables and probabilities.

Probability Distribution of Random Variables

A random variable has a probability distribution, which deÞnes the probability of its
unknown values. Random variables can be discrete (not constant) or continuous or
both. That means it takes any of a designated Þnite or countable list of values,
provided with a probability mass function feature of the random variableÕs probability
distribution or can take any numerical value in an interval or set of intervals. Through
a probability density function that is representative of the random variableÕs
probability distribution or it can be a combination of both discrete and continuous.

Two random variables with equal probability distribution can yet vary with respect to
their relationships with other random variables or whether they are independent of
these. The recognition of a random variable, which means, the outcomes of randomly
choosing values as per the variableÕs probability distribution function, are called
random variates.

Q1. What Is a Probability Distribution?


A probability distribution is a statistical function that describes all the possible values
and likelihoods that a random variable can take within a given range. This range will
be bounded between the minimum and maximum possible values, but precisely
where the possible value is likely to be plotted on the probability distribution depends
on a number of factors. These factors include the distribution's mean (average),
standard deviation, skewness, and kurtosis.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Q2. How Probability Distributions Work?


Perhaps the most common probability distribution is the normal distribution, or "bell
curve," although several distributions exist that are commonly used. Typically, the
data generating process of some phenomenon will dictate its probability distribution.
This process is called the probability density function.

Probability distributions can also be used to create cumulative distribution functions


(CDFs), which adds up the probability of occurrences cumulatively and will always
start at zero and end at 100%.

Academics, Þnancial analysts and fund managers alike may determine a particular
stock's probability distribution to evaluate the possible expected returns that the stock
may yield in the future. The stock's history of returns, which can be measured from
any time interval, will likely be composed of only a fraction of the stock's returns,
which will subject the analysis to sampling error. By increasing the sample size, this
error can be dramatically reduced.

KEY TAKEAWAYS
¥ A probability distribution depicts the expected outcomes of possible values for
a given data generating process.
¥ Probability distributions come in many shapes with different characteristics, as
deÞned by the mean, standard deviation, skewness, and kurtosis.
¥ Investors use probability distributions to anticipate returns on assets such as
stocks over time and to hedge their risk.

Types of Probability Distributions


There are many different classiÞcations of probability distributions. Some of them
include the normal distribution, chi square distribution, binomial distribution, and
Poisson distribution. The different probability distributions serve different purposes
and represent different data generation processes. The binomial distribution, for
example, evaluates the probability of an event occurring several times over a given
number of trials and given the event's probability in each trial. and may be generated
by keeping track of how many free throws a basketball player makes in a game,
where 1 = a basket and 0 = a miss. Another typical example would be to use a fair
coin and Þguring out the probability of that coin coming up heads in 10 straight ßips.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

A binomial distribution is discrete, as opposed to continuous, since only 1 or 0 is a


valid response.

The most commonly used distribution is the normal distribution, which is used
frequently in Þnance, investing, science, and engineering. The normal distribution is
fully characterized by its mean and standard deviation, meaning the distribution is not
skewed and does exhibit kurtosis. This makes the distribution symmetric and it is
depicted as a bell-shaped curve when plotted. A normal distribution is deÞned by a
mean (average) of zero and a standard deviation of 1.0, with a skew of zero and
kurtosis = 3. In a normal distribution, approximately 68% of the data collected will
fall within +/- one standard deviation of the mean; approximately 95% within +/- two
standard deviations; and 99.7% within three standard deviations. Unlike the binomial
distribution, the normal distribution is continuous, meaning that all possible values
are represented (as opposed to just 0 and 1 with nothing in between).

Q3. Explain the relevance of Probability Distributions Used in Investing


Stock returns are often assumed to be normally distributed but in reality, they exhibit
kurtosis with large negative and positive returns seeming to occur more than would
be predicted by a normal distribution. In fact, because stock prices are bounded by
zero but offer a potentially unlimited upside, the distribution of stock returns has been
described as log-normal. This shows up on a plot of stock returns with the tails of the
distribution having a greater thickness.

Probability distributions are often used in risk management as well to evaluate the
probability and amount of losses that an investment portfolio would incur based on a
distribution of historical returns. One popular risk management metric used in
investing is value-at-risk (VaR). VaR yields the minimum loss that can occur given a
probability and time frame for a portfolio. Alternatively, an investor can get a
probability of loss for an amount of loss and time frame using VaR. Misuse and
overreliance on VaR has been implicated as one of the major causes of the 2008
Þnancial crisis.

Example of a Probability Distribution


As a simple example of a probability distribution, let us look at the number observed
when rolling two standard six-sided dice. Each die has a 1/6 probability of rolling any
single number, one through six, but the sum of two dice will form the probability
distribution depicted in the image below. Seven is the most common outcome (1+6,

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

6+1, 5+2, 2+5, 3+4, 4+3). Two and twelve, on the other hand, are far less likely (1+1
and 6+6).

Common Data Types


Before we jump on to the explanation of distributions, letÕs see what kind of data can
we encounter. The data can be discrete or continuous.

Discrete Data, as the name suggests, can take only speciÞed values. For example,
when you roll a die, the possible outcomes are 1, 2, 3, 4, 5 or 6 and not 1.5 or 2.45.

Continuous Data can take any value within a given range. The range may be Þnite or
inÞnite. For example, A girlÕs weight or height, the length of the road. The weight of
a girl can be any value from 54 kgs, or 54.5 kgs, or 54.5436kgs.

Now let us start with the types of distributions.

Q4. Explain different Types of Distributions


1. Binomial Distribution
2. Normal Distribution
3. Poisson Distribution

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Binomial distribution: Binomial probability distribution is a probability distribution


that gives the likelihood of a given number of successful outcomes in a Þxed number
of trials. It is a discrete probability distribution that is used to model the number of
successes in a sequence of n independent trials or a Þxed number of Bernoulli trials,
each asking a yes-no question. Each experiment has boolean-valued outcome such
as success/yes/true/one (with probability p) or failure/no/false/zero (with probability,
q = 1 − p). The distribution is then calculated by taking the product of n and p and
then raised to the power of n. This results in a bell-shaped curve, with the mean equal
to np and the standard deviation equal to √np(1-p). The following conditions need to
be satisÞed for the experiment to be termed as a binomial experiment: A. Fixed
number of n trials. B. Each trial is independent. C. Only two outcomes are possible
(Success and Failure). D. The probability of success (p) for each trial is constant. E. A
random variable Y= the number of successes. This type of probability calculation is
important in many Þelds, from insurance and Þnance to manufacturing and quality
control. In each case, understanding the binomial probability distribution can help
decision-makers make better choices and plan for different outcomes.
Here are some examples of binomial distribution:
¥ For a coin tossed N times, binomial distribution can be used to model the
probability of the number of successes (say, heads). For example, for the coin
toss 10 times, the binomial distribution could be used to model the probability
of a number of heads (1 to 10).
¥ Here is the sample binomial distribution plot created with different values of n
and p

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Normal distribution: A type of continuous probability distribution for a real-valued


random variable. It is a type of symmetric distribution where most of the observations
cluster around the central peak and the probabilities for values further away from the
mean taper off equally in both directions. It is represented using a bell-shaped density
curve described by its mean and standard deviation. It is also known as the Gaussian
distribution. It has got the following features:
¥ Symmetric bell shape
¥ Mean and median are equal; both located at the center of the distribution
¥ 68% of the data falls within 1 standard deviation of the mean
¥ 95% of the data falls within 2 standard deviations of the mean
¥ 99.7% percent of the data falls within 3 standard deviations of the mean

Here is a sample normal distribution curve:

Poisson distribution: A Poisson distribution is a discrete probability distribution that


shows how many times an event is likely to occur within a Þxed interval of time or
space if these events occur with a known average rate and independently of the time
since the last event. It is used for independent events that occur at a constant rate
within a given interval of time. Note that Poisson distribution is associated with both
time and space. Another key point is that events need to be independent of each other.
When to use Poisson distribution? Poisson distribution is used for Þnding the
probability of a number of events in a time period or Þnding the probability of

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

waiting some time until the next event. Here is a great read on Poisson distribution.
Here are some examples:
¥ Customers calling a help center: On average, there are, say, 10 customers
which call in an hour. Thus, Poisson distribution can be used to model the
probability of a different number of customers calling within an hour (say, 5 or
6 or 7 or 8 or 9 or 11 customers, etc). The diagram below represents
¥ No. of visitors to a website: On average, there are 500 visitors to a website
every day. Poisson distribution can be used to estimate the number of visitors
every day.
¥ Radioactive decay in atoms
¥ Photons arriving at a space telescope
¥ Movements in a stock price
¥ Number of trees in a given acre of land

Here is a sample diagram representing the probability distribution for a given


lambda (rate of change of event)

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Module-V
Sampling distributions

Q1. Explain Sampling Distribution

Sampling Distribution in the Þeld of statistics is a subtype of proportion distribution


wherein a statistic is calculated by randomly analyzing samples from a given
population. It is the distribution of samples in a population that leads to the revelation
of data in numerous Þelds.

Even though the sampling distribution does not include any sample that deviates far
off from the population's mean value, the frequency distribution of sampling
distribution often generates a normal distribution with maximum samples close to the
population's mean value.

Understanding the concept

Ever heard about probability? Ever conjectured the possibility of an event? In


statistics, probability is a major concept that has things like these covered. In this
blog, we are going to understand Sampling Distribution, a concept of probability
distribution in statistics.

Sampling Distribution is a statistic that aims to conjecture a large number of samples


obtained from a speciÞc group of subjects repeatedly. In statistics, the probability is
used for calculating the likely occurrence of a phenomenon.

This is done by collecting samples from populations. While samples (value of the
focus) are the main focus, in this case, populations (subjects) help us to procure them,
and thus, both samples and populations are considered to be equally essential.

A lot of data that is collected over time is included in studies that aim to calculate the
probabilities of an event. This data is collected with utmost precision and care so that
it leads to an effective result and does not hamper the statistics involved.

Sampling Distribution can be concerned with almost any subject. Be it the weight of
population or traits of animals, sampling distribution can cover almost anything and
everything. Another dimension of this concept is the binomial distribution.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

The binomial distribution is deÞned as the calculation of the probability of success in


a given population. For this, the population size is required to be large (almost 20
times the size of a sample) so that the successor can be conjectured by using a large
number of samples.

DeÞned as a concept that focuses on a statistic of sample statistics, sampling


distribution involves more than one statistical value of a sample. Let us understand
this with the help of a sampling distribution example.

Example of Sampling Distribution

Suppose a researcher wishes to identify the average age of babies when they begin to
walk. Instead of keeping a track of all the babies around the world, the researcher will
select a total of 500 babies.

The number of babies constitutes the population for this particular research. Now, the
researcher will identify the age of babies when they begin to walk. Let us assume that
25% of the babies began to walk at the age of 1.5 years old. Another 30% of the
babies began to walk at the age of 2 years old.

This way, the researcher will calculate the actual mean of the sampling distribution of
babies by picking a handful of samples. The sample mean (average of a sample) will
be further calculated along with other sample means obtained from the same
population.

This is how Sampling Distribution is calculated. In sampling distribution, the


standard deviation of the sampling distribution is regarded as the standard error that
keeps decreasing with the increase of the sample size.

Q2. How to create your own sampling distribution?

Here is a step-by-step guide for you to create your own sampling distribution. Let's
get started!

1. Choose a population and sample for this experiment.


2. Select a sample randomly out of the given population.
3. Calculate the sample mean of this group.
4. Follow the above steps for obtaining a number of sample means out of the
same population.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

5. Generate frequency distribution: Plot the sample means of the statistic on a


graph sheet or tabulate the data. The Þnal graph will demarcate your sample
distribution.

Q3. Explain the SigniÞcance of Sampling Distribution

The primary purpose of Sampling Distribution is to establish representative results of


small samples of a comparatively larger population.

This helps researchers and analysts to dig deep into the population, get a closer look
into small groups of the population, and create generalized results based on the same.
The signiÞcance of sampling distribution is immense in the Þeld of statistics.

1. Firstly, the concept of sampling distribution provides accuracy. For any


population being studied, it is important for a researcher to collect all possible
samples to generate an inclusive and effective result. Sampling Distribution
allows one to do that by collecting all possible samples and developing the
sample means to give the best possible result.

2. Secondly, the repeated collection of samples from the same set of subjects
leads to consistency. What's more, the standard error also allows a researcher to
reßect on the deviation and thus identify the unbiased nature of the sampling
distribution altogether.

3. Thirdly, the variability of the sampling distribution is immensely signiÞcant as


it reßects the inclusion of numerous samples from the same set of subjects.
This leads to an almost symmetric graph. The variability also ensures that all
possible samples are collected from the population.

Q4. Brief different Types of Sampling Distribution

As we have already discovered about Sampling Distribution, we will now learn about
the various types of Sampling Distribution in statistics. To begin with, there are 3
types of Sampling Distribution.

1. Sampling Distribution of Mean


The Þrst and foremost type of sampling distribution is of the mean. This type focuses
on calculating the mean average of all sample means which then lead to sampling
distribution.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

The average of every sample is put together and a sampling distribution mean is
calculated which reßects the nature of the whole population.With more
samples, the standard deviation decreases which leads to a normal frequency
distribution or a bell-shaped curve on the graph.

2. Sampling Distribution of Proportion

When it comes to the second type of Sampling Distribution, the population's samples
are calculated to obtain the proportions of a population. Herein, the mean of
all sample proportions is calculated, and thereby the sampling distribution of
proportion is generated.

As the proportion of a population is deÞned by a part of the population that possesses


a certain attribute, the sampling distribution of proportion aims to achieve a
mean of all sample proportions that involve the whole population.

3. T-Distribution

Third of all, T-Sampling Distribution is considered to involve a small size of the


population that gives about no information about standard deviation. Under
this type of sampling distribution, the population size is very small that, in
turn, leads to a normal distribution.

The frequent distribution in this type is the most near to the mean of the sampling
distribution. Only a handful of samples are far off from the mean value of the
whole population.

One of the characteristics of this T-distribution is that it cannot work well with a
population that is large in size. Therefore, this type works well with only a small-
sized population.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Related Terminologies

¥ Probability Distribution - In statistics, probability distribution generates the


probable occurrences of different outcomes by calculating statistics in a given
population. It is a mathematical representation of a probable phenomenon
among a set of random events. Sampling Distribution is a type of Probability
Distribution.

¥ Frequency Distribution - Sampling Distribution results in frequency


distribution which is either a graphical representation or a tabular
representation of sample outcomes obtained from a given population.

¥ Binomial Distribution - Binomial Distribution indicates the distribution of total


successes that are probable in a given population. It is used in order to label an
outcome as either a failure or a success. One of the parameters of binomial
distribution is that it best works with a large-sized population.

¥ Data Distribution - In statistics, data Distribution refers to the listing of data of


a given population. It simply represents the distribution of data that helps in
further analysis. Some of the Data Distribution types are Binomial Distribution
and Normal Distribution

¥ Standard Error - A statistical concept, standard error demarcates the standard


deviation or variance of sample mean from the actual mean in the sampling
distribution. It represents the accuracy of a sample mean as compared to the
actual mean. Frequency distribution is helpful for standard error calculation.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Module-VII
Hypothesis Testing-t test, Chi Square,Z test

Q1. Explain the concept of Hypothesis Testing

A hypothesis is an initial idea or assumption that may be used to try and explain an observation or
make an argument for some action that requires testing to check its validity. In a hypothesis test,
there are generally two different ideas or assumptions that are being juxtaposed and tested against
each other. The goal of the hypothesis test is to determine which hypothesis is most correct and if
the null hypothesis can be rejected altogether. Often, one or more inferences are made based on a
data sample, and the validity of the inferences is unknown. Then, the inference is tested against
another inference or against a standard point of reference. This process of testing the inference is
known as hypothesis testing.

Typically, hypothesis testing utilizes two different types of hypothesis: the null hypothesis and the
alternative hypothesis. The null hypothesis represents the assumption that is made about the data
sample, whereas the alternative hypothesis represents a counterpoint. More often than not, the
alternative hypothesis takes the exact opposite point of view from the null hypothesis.

Hypothesis testing is used in statistics to learn about and understand different population groups.
Additionally, the results of hypothesis testing can sometimes be used to predict the likelihood of
future outcomes within the population group. By going through the process of testing a hypothesis,
scientists and mathematicians are able to determine the statistical validity of their inferences, which
helps them to learn about the world around us.

For example, suppose a TV station was attempting to cater their advertisements to a more relevant
age demographic. This particular station believes that the ideal target demographic for their
advertisements is 65-year-olds. To determine if this is valid, a hypothesis test can be performed. In
this case, the null hypothesis would be that the average age of the TV station's viewers is 65. The
alternative hypothesis would be that the average age of the TV station's viewers is not equal to 65.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics.
It is most often used by scientists to test speciÞc predictions, called hypotheses, that arise from
theories.

Though the speciÞc details might vary, the procedure you will use when testing a hypothesis will
always follow some version of these steps.

1. Step 1: State your null and alternate hypothesis


2. Step 2: Collect data
3. Step 3: Perform a statistical test
4. Step 4: Decide whether to reject or fail to reject your null hypothesis
5. Step 5: Present your Þndings

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Step 1: State your null and alternate hypothesis

After developing your initial research hypothesis (the prediction that you want to investigate), it is
important to restate it as a null (Ho) and alternate (Ha) hypothesis so that you can test it
mathematically.

The alternate hypothesis is usually your initial hypothesis that predicts a relationship between
variables. The null hypothesis is a prediction of no relationship between the variables you are
interested in.

Hypothesis testing example


You want to test whether there is a relationship between gender and height. Based on your
knowledge of human physiology, you formulate a hypothesis that men are, on average, taller than
women. To test this hypothesis, you restate it as:

¥ H0: Men are, on average, not taller than women.


Ha: Men are, on average, taller than women.

Step 2: Collect data

For a statistical test to be valid, it is important to perform sampling and collect data in a way that is
designed to test your hypothesis. If your data are not representative, then you cannot make
statistical inferences about the population you are interested in.

Hypothesis testing example


To test differences in average height between men and women, your sample should have an equal
proportion of men and women, and cover a variety of socio-economic classes and any other control
variables that might inßuence average height.

You should also consider your scope (Worldwide? For one country?) A potential data source in this
case might be census data, since it includes data from a variety of regions and social classes and is
available for many countries around the world.

Step 3: Perform a statistical test

There are a variety of statistical tests available, but they are all based on the comparison of within-
group variance (how spread out the data is within a category) versus between-group variance (how
different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then
your statistical test will reßect that by showing a low p-value. This means it is unlikely that the
differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your
statistical test will reßect that with a high p-value. This means it is likely that any difference you
measure between groups is due to chance.

Your choice of statistical test will be based on the type of variables and the level of measurement of
your collected data.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Hypothesis testing example


Based on the type of data you collected, you perform a one-tailed t-test to test whether men are in
fact taller than women. This test gives you:

¥ an estimate of the difference in average height between the two groups.


¥ a p-value showing how likely you are to see this difference if the null hypothesis of no
difference is true.
Your t-test shows an average height of 175.4 cm for men and an average height of 161.7 cm for
women, with an estimate of the true difference ranging from 10.2 cm to inÞnity. The p-value is
0.002.

Step 4: Decide whether to reject or fail to reject your null hypothesis

Based on the outcome of your statistical test, you will have to decide whether to reject or fail to
reject your null hypothesis.

In most cases you will use the p-value generated by your statistical test to guide your decision. And
in most cases, your predetermined level of signiÞcance for rejecting the null hypothesis will be 0.05
Ð that is, when there is a less than 5% chance that you would see these results if the null hypothesis
were true.

In some cases, researchers choose a more conservative level of signiÞcance, such as 0.01 (1%). This
minimizes the risk of incorrectly rejecting the null hypothesis (Type I error).

Hypothesis testing example


In your analysis of the difference in average height between men and women, you Þnd that the p-
value of 0.002 is below your cutoff of 0.05, so you decide to reject your null hypothesis of no
difference.

Step 5: Present your Þndings

The results of hypothesis testing will be presented in the results and discussion sections of your
research paper, dissertation or thesis.

In the results section you should give a brief summary of the data and a summary of the results of
your statistical test (for example, the estimated difference between group means and associated p-
value). In the discussion, you can discuss whether your initial hypothesis was supported by your
results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null
hypothesis. You will probably be asked to do this in your statistics assignments.

Stating results in a statistics assignment


In our comparison of mean height between men and women we found an average difference of 13.7
cm and a p-value of 0.002; therefore, we can reject the null hypothesis that men are not taller than
women and conclude that there is likely a difference in height between men and women.
However, when presenting research results in academic papers we rarely talk this way. Instead, we
go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than
women) and state whether the result of our test did or did not support the alternate hypothesis.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

If your null hypothesis was rejected, this result is interpreted as Òsupported the alternate
hypothesis.Ó

Stating results in a research paper


We found a difference in average height between men and women of 14.3cm, with a p-value of
0.002, consistent with our hypothesis that there is a difference in height between men and women.
These are superÞcial differences; you can see that they mean the same thing.

You might notice that we donÕt say that we reject or fail to reject the alternate hypothesis. This is
because hypothesis testing is not designed to prove or disprove anything. It is only designed to test
whether a pattern we measure could have arisen spuriously, or by chance.

If we reject the null hypothesis based on our research (i.e., we Þnd that it is unlikely that the pattern
arose by chance), then we can say our test lends support to our hypothesis. But if the pattern does
not pass our decision rule, meaning that it could have arisen by chance, then we say the test is
inconsistent with our hypothesis.

A t-test is a kind of inferential and hypothetical statistical test. It is done under the
null hypothesis. It is used to compare whether the means of two groups are
signiÞcantly distinct or not, even when some particular features/characteristics might
be related. The T-test helps us estimate the difference between the averages of two
sets of data, combined with the certainty that they are from the same population. For
example, if we were to take samples of students from two different schools and
expect the means and standard deviation to be the same, then it is not possible. There
should be a slight distinction between the average and standard deviation.

Types of T-tests

There are three types of tests. Let us understand them under the circumstances they
are used:
¥ If all of the groups come from one single population (like measuring before
and after an experimental treatment), then we perform paired t-test.
¥ If the groups under consideration come from two different populations (like
two different species, or people from two separate cities), then we
perform two-sample t-test or independent t-test).
¥ If there is one group being compared against a standard value (like comparing
the acidity of a liquid to a neutral pH of 7), then we perform one-sample t-test.

Uses of t-tests
It is used to determine whether two sets of data are signiÞcantly different from each
other.
It is used to evaluate if the means of the two groups of data are statistically dissimilar
from each other.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Calculation of t-test
For calculating a t-test, we require three key data values, which are
¥ The average values from each data set known as the mean difference.
¥ Standard deviation of the group.
¥ Number of key data values of each group.
The result value of the t-test gives the t-value. This value is compared against the
value of the t-distribution tableÕs value. The t-test helps us determine whether the
difference is a true difference or an arbitrary, negligible difference.

Necessary conditions for application of t-test [Click Here for Sample Questions]
¥ The sample size should be small.
¥ Statistic follows a normal distribution.
¥ Value of scaling terms is known.
¥ Comparison is only between two groups.

T-Test formula
The formula for t-test (a.k.a. the studentÕs t-test) is shown below

According to this formula, t is called the t-value, x1 and x2 are the means of the two
groups which are being compared, s2 is the pooled standard error of the two groups,
and n1 and n2 are the number of observations of Þrst and second group, respectively.

A greater t-value indicates that the difference between means is greater than the
pooled standard error, which suggests a signiÞcant difference between the groups.

The calculated t-value can be compared against the values in a critical value chart to
determine whether your t-value is greater than what would be expected by chance. If
so, the null hypothesis can be rejected and it can be concluded that the two groups are
in fact different.

Things to remember
¥ A t-test is a type of statistic used to verify if there is a considerable difference
between the averages of two groups.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

¥ The t-test is a test used when the objective is hypothesis testing in statistics.
¥ Formula used for t-test is

¥ They are the mean difference, the standard deviation, and the number of data
values of each group.
¥ Greater t-value represents signiÞcant difference between two groups.
¥ Smaller t-value represents the difference between the two groups is negligible.

What Is a Chi-Square Test?

The Chi-Square test is a statistical procedure for determining the difference between
observed and expected data. This test can also be used to determine whether it
correlates to the categorical variables in our data. It helps to Þnd out whether a
difference between two categorical variables is due to chance or a relationship
between them.

Chi-Square Test DeÞnition

A chi-square test is a statistical test that is used to compare observed and expected
results. The goal of this test is to identify whether a disparity between actual and
predicted data is due to chance or to a link between the variables under consideration.
As a result, the chi-square test is an ideal choice for aiding in our understanding and
interpretation of the connection between our two categorical variables.

A chi-square test or comparable nonparametric test is required to test a hypothesis


regarding the distribution of a categorical variable. Categorical variables, which
indicate categories such as animals or countries, can be nominal or ordinal. They
cannot have a normal distribution since they can only have a few particular values.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

For example, a meal delivery Þrm in India wants to investigate the link between
gender, geography, and people's food preferences.

It is used to calculate the difference between two categorical variables, which are:

¥ As a result of chance or

¥ Because of the relationship

Formula For Chi-Square Test

Where

c = Degrees of freedom

O = Observed Value

E = Expected Value

The degrees of freedom in a statistical calculation represent the number of variables


that can vary in a calculation. The degrees of freedom can be calculated to ensure that

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

chi-square tests are statistically valid. These tests are frequently used to compare
observed data with data that would be expected to be obtained if a particular
hypothesis were true.

The Observed values are those you gather yourselves.

The expected values are the frequencies expected, based on the null hypothesis.

Why Do You Use the Chi-Square Test?

Chi-square is a statistical test that examines the differences between categorical


variables from a random sample in order to determine whether the expected and
observed results are well-Þtting.

Here are some of the uses of the Chi-Squared test:

¥ The Chi-squared test can be used to see if your data follows a well-known
theoretical probability distribution like the Normal or Poisson distribution.

¥ The Chi-squared test allows you to assess your trained regression model's
goodness of Þt on the training, validation, and test data sets.

What Does A Chi-Square Statistic Test Tell You?

A Chi-Square test ( symbolically represented as 2 ) is fundamentally a data analysis


based on the observations of a random set of variables. It computes how a model
equates to actual observed data. A Chi-Square statistic test is calculated based on the
data, which must be raw, random, drawn from independent variables, drawn from a
wide-ranging sample and mutually exclusive. In simple terms, two sets of statistical
data are compared -for instance, the results of tossing a fair coin. Karl Pearson
introduced this test in 1900 for categorical data analysis and distribution. This test is
also known as ÔPearsonÕs Chi-Squared TestÕ.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Chi-Squared Tests are most commonly used in hypothesis testing. A hypothesis is an


assumption that any given condition might be true, which can be tested afterwards.
The Chi-Square test estimates the size of inconsistency between the expected results
and the actual results when the size of the sample and the number of variables in the
relationship is mentioned.

These tests use degrees of freedom to determine if a particular null hypothesis can be
rejected based on the total number of observations made in the experiments. Larger
the sample size, more reliable is the result.

There are two main types of Chi-Square tests namely -

1. Independence

2. Goodness-of-Fit

Independence

The Chi-Square Test of Independence is a derivable ( also known as inferential )


statistical test which examines whether the two sets of variables are likely to be
related with each other or not. This test is used when we have counts of values for
two nominal or categorical variables and is considered as non-parametric test. A
relatively large sample size and independence of obseravations are the required
criteria for conducting this test.

For Example-

In a movie theatre, suppose we made a list of movie genres. Let us consider this as
the Þrst variable. The second variable is whether or not the people who came to watch
those genres of movies have bought snacks at the theatre. Here the null hypothesis is
that th genre of the Þlm and whether people bought snacks or not are unrelatable. If
this is true, the movie genres donÕt impact snack sales.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Goodness-Of-Fit

In statistical hypothesis testing, the Chi-Square Goodness-of-Fit test determines


whether a variable is likely to come from a given distribution or not. We must have a
set of data values and the idea of the distribution of this data. We can use this test
when we have value counts for categorical variables. This test demonstrates a way of
deciding if the data values have a Ò good enoughÓ Þt for our idea or if it is a
representative sample data of the entire population.

For Example-

Suppose we have bags of balls with Þve different colours in each bag. The given
condition is that the bag should contain an equal number of balls of each colour. The
idea we would like to test here is that the proportions of the Þve colours of balls in
each bag must be exact.

Who Uses Chi-Square Analysis?

Chi-square is most commonly used by researchers who are studying survey response
data because it applies to categorical variables. Demography, consumer and
marketing research, political science, and economics are all examples of this type of
research.

Example

Let's say you want to know if gender has anything to do with political party
preference. You poll 440 voters in a simple random sample to Þnd out which political
party they prefer. The results of the survey are shown in the table below:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

To see if gender is linked to political party preference, perform a Chi-Square test of


independence using the steps below.

Step 1: DeÞne the Hypothesis

H0: There is no link between gender and political party preference.

H1: There is a link between gender and political party preference.

Step 2: Calculate the Expected Values

Now you will calculate the expected frequency.

For example, the expected value for Male Republicans is:

Similarly, you can calculate the expected value for each of the cells.

Step 3: Calculate (O-E)2 / E for Each Cell in the Table

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Now you will calculate the (O - E)2 / E for each cell in the table.

Where

O = Observed Value

E = Expected Value

Step 4: Calculate the Test Statistic X2

X2 is the sum of all the values in the last table

= 0.743 + 2.05 + 2.33 + 3.33 + 0.384 + 1

= 9.837

Before you can conclude, you must Þrst determine the critical statistic, which
requires determining our degrees of freedom. The degrees of freedom in this case are
equal to the table's number of columns minus one multiplied by the table's number of
rows minus one, or (r-1) (c-1). We have (3-1)(2-1) = 2.

Finally, you compare our obtained statistic to the critical statistic found in the chi-
square table. As you can see, for an alpha level of 0.05 and two degrees of freedom,
the critical statistic is 5.991, which is less than our obtained statistic of 9.83. You can

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

reject our null hypothesis because the critical statistic is higher than your obtained
statistic.

This means you have sufÞcient evidence to say that there is an association between
gender and political party preference.

When to Use a Chi-Square Test?

A Chi-Square Test is used to examine whether the observed results are in order with
the expected values. When the data to be analysed is from a random sample, and
when the variable is the question is a categorical variable, then Chi-Square proves the

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

most appropriate test for the same. A categorical variable consists of selections such
as breeds of dogs, types of cars, genres of movies, educational attainment, male v/s
female etc. Survey responses and questionnaires are the primary sources of these
types of data. The Chi-square test is most commonly used for analysing this kind of
data. This type of analysis is helpful for researchers who are studying survey response
data. The research can range from customer and marketing research to political
sciences and economics.

Chi-Square Distribution

Chi-square distributions (X2) are a type of continuous probability distribution.


They're commonly utilized in hypothesis testing, such as the chi-square goodness of
Þt and independence tests. The parameter k, which represents the degrees of freedom,
determines the shape of a chi-square distribution.

A chi-square distribution is followed by very few real-world observations. The


objective of chi-square distributions is to test hypotheses, not to describe real-world
distributions. In contrast, most other commonly used distributions, such as normal
and Poisson distributions, may explain important things like baby birth weights or
illness cases per year.

Because of its close resemblance to the conventional normal distribution, chi-square


distributions are excellent for hypothesis testing. Many essential statistical tests rely
on the conventional normal distribution.

In statistical analysis, the Chi-Square distribution is used in many hypothesis tests


and is determined by the parameter k degree of freedoms. It belongs to the family of
continuous probability distributions. The Sum of the squares of the k independent
standard random variables is called the Chi-Squared distribution. PearsonÕs Chi-
Square Test formula is -

Where X^2 is the Chi-Square test symbol

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Σ is the summation of observations

O is the observed results

E is the expected results

The shape of the distribution graph changes with the increase in the value of k, i.e.
degree of freedoms.

When k is 1 or 2, the Chi-square distribution curve is shaped like a backwards ÔJÕ. It


means there is a high chance that X^2 becomes close to zero.

When k is greater than 2, the shape of the distribution curve looks like a hump and
has a low probability that X^2 is very near to 0 or very far from 0. The distribution

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

occurs much longer on the right-hand side and shorter on the left-hand side. The
probable value of X^2 is (X^2 - 2).

When k is greater than ninety, a normal distribution is seen, approximating the Chi-
square distribution.

Chi-Square P-Values

Here P denotes the probability; hence for the calculation of p-values, the Chi-Square
test comes into the picture. The different p-values indicate different types of
hypothesis interpretations.

1. P <= 0.05 (Hypothesis interpretations are rejected)

2. P>= 0.05 (Hypothesis interpretations are accepted)

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

The concepts of probability and statistics are entangled with Chi-Square Test.
Probability is the estimation of something that is most likely to happen. Simply put, it
is the possibility of an event or outcome of the sample. Probability can
understandably represent bulky or complicated data. And statistics involves collecting
and organising, analysing, interpreting and presenting the data.

Finding P-Value

When you run all of the Chi-square tests, you'll get a test statistic called X2. You have
two options for determining whether this test statistic is statistically signiÞcant at
some alpha level:

1. Compare the test statistic X2 to a critical value from the Chi-square


distribution table.

2. Compare the p-value of the test statistic X2 to a chosen alpha level.

Test statistics are calculated by taking into account the sampling distribution of the
test statistic under the null hypothesis, the sample data, and the approach which is
chosen for performing the test.

The p-value will be as mentioned in the following cases.

¥ A lower-tailed test is speciÞed by: P(TS ts | H0 is true) p-value = cdf (ts)

¥ Lower-tailed tests have the following deÞnition: P(TS ts | H0 is true) p-value =


cdf (ts)

¥ A two-sided test is deÞned as follows, if we assume that the test static


distribution of H0 is symmetric about 0. 2 * P(TS |ts| | H0 is true) = 2 * (1 -
cdf(|ts|))

Where:

P: probability Event

TS: Test statistic is computed observed value of the test statistic from your sample
cdf(): Cumulative distribution function of the test statistic's distribution (TS)

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Types of Chi-square Tests

Pearson's chi-square tests are classiÞed into two types:

1. Chi-square goodness-of-Þt analysis

2. Chi-square independence test

These are, mathematically, the same exam. However, because they are utilized for
distinct goals, we generally conceive of them as separate tests.

Properties

The chi-square test has the following signiÞcant properties:

1. If you multiply the number of degrees of freedom by two, you will receive an
answer that is equal to the variance.

2. The chi-square distribution curve approaches the data is normally distributed as


the degree of freedom increases.

3. The mean distribution is equal to the number of degrees of freedom.

Properties of Chi-Square Test

1. Variance is double the times the number of degrees of freedom.

2. Mean distribution is equal to the number of degrees of freedom.

3. When the degree of freedom increases, the Chi-Square distribution curve


becomes normal.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Limitations of Chi-Square Test

There are two limitations to using the chi-square test that you should be aware of.

¥ The chi-square test, for starters, is extremely sensitive to sample size. Even
insigniÞcant relationships can appear statistically signiÞcant when a large
enough sample is used. Keep in mind that "statistically signiÞcant" does not
always imply "meaningful" when using the chi-square test.

¥ Be mindful that the chi-square can only determine whether two variables are
related. It does not necessarily follow that one variable has a causal
relationship with the other. It would require a more detailed analysis to
establish causality.

Chi-Square Goodness of Fit Test

When there is only one categorical variable, the chi-square goodness of Þt test can be
used. The frequency distribution of the categorical variable is evaluated for
determining whether it differs signiÞcantly from what you expected. The idea is that
the categories will have equal proportions, however, this is not always the case.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Module-VIII
Anova -One Way, Two Way

ANOVA
The four steps to ANOVA are:
1. Formulate a hypothesis
2. Set a signiÞcance level
3. Compute an F-Statistic
4. Use the F-Statistic to derive a p-value
5. Compare the p-value and signiÞcance level to decide whether or not to reject the
null hypothesis
1. Formulate a Hypotheses
As with nearly all statistical signiÞcance tests, ANOVA starts with formulating a null
and alternative hypothesis. For this example, the hypotheses are as follows:
Null Hypothesis (H0): There is no difference in the average price of wine between the
three countries; they are all the same.
Alternative Hypothesis (H1): The average price of wine is not the same between the
three countries.
Note, this is an omnibus test, meaning if we are able to reject the null hypothesis it
will tell us that a statistically signiÞcant difference exists somewhere between these
countries, but it wonÕt tell us where it is.

2. Set a SigniÞcance Level


The signiÞcance level, or alpha, is the probability of rejecting our null hypothesis
when it actually holds true. In other terms, itÕs the probability of making a Type I
error.
Typically, one should weigh the costs of making a Type I vs. a Type II error to
determine the best alpha for an experiment, but for this toy example IÕm just going to
use the standard .05 for our α value.

3. Compute an F-Statistic
The F-statistic is simply a ratio of the variance between samples means to the
variance within sample means. For this ANOVA test, weÕll be looking at how far each
countryÕs average wine price is from the overall average price, and dividing that by
how much variation in price there is within each countryÕs sample distribution. The F-
statistic formula is below, which may look complicated until we break it down.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

SSB = Sum of squares between groups. This is the summation of the squared
difference between each groupÕs mean and the overall mean times the number of
elements per group. For this example, we take the mean of each countryÕs wine price,
subtract it from the overall mean, square the difference and multiply by 1,000 (the
number of data points per country).

SSW = Sum of squares within groups. This is the summation of the squared
difference between the group-mean and each value in the group. For France, we
would take the mean price of French wine, then subtract and square the difference for
each bottle of French wine of the thousand data points in that group.

DoFB = Degrees of freedom between groups, simply the number of groups minus 1.
We have three different countries we are comparing, so the degrees of freedom here is
2.
DoFW = Degrees of Freedom Within Groups, simply the number of data points
minus the number of groups. We have 3,000 data points and three different countries,
so this is 2,997 for this example.
Dividing the sum of squares for a group by its degrees of freedom yields the mean
squares for that group, and the F statistic is just a ratio of the mean squares between
over the mean squares within.
Below I manually calculate these values in Python, and end up with an F-Statistic of
~4.07.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

4. Using the F-Statistic, compute a p-value


Once we have our F-statistic, we plug it into an F-distribution to get a p-value. You
can Þnd a table for these values in the back of any statistics book or there are far-
easier online calculators that will do this for you. With our speciÞc degrees of
freedom, an F-Statistic of 4.07 yields a p-value .0172.

5. Compare the p-value and signiÞcance level to decide whether or not to reject the
null hypothesis
Our p-value signiÞes that assuming the null hypothesis (all countries have the same
mean price of wine) is true, there is roughly a 1.7% chance of seeing the data we have
by sheer sampling chance. By setting our signiÞcance level, or alpha, at 5% before all
of this, we said that we would be willing to accept a 5% chance of rejecting the null
when it is true. Since our p-value is below our pre-determined signiÞcance level, we
can reject the null hypothesis and say that there is a statistically signiÞcant difference
in the mean price of wine between countries.

Remember that ANOVA is an omnibus test, meaning because we are able to reject the
null we know that a difference exists among the average wine prices between
countries, but not exactly where. For Þnding where the difference lies, we would then
conduct hypothesis tests between two countries at a time.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

How to Perform Analysis of Variance (ANOVA) Ð Step By Step Procedure

In this lesson, we would look at the steps to follow to perform Analysis of


Variance(ANOVA). This would be very clear and easy to follow. ANOVA is used to
analyze the difference in the means of diffrent groups (for 3 or more groups).

The procedure is made up of just three basic stages. After looking at the procedure,
we would apply it in a real problem.

The Seven Steps are


Step 1: Calculate the Mean
Step 2: Setup the null and alternate hypothesis
Step 3: Calculate the Sum of Squares
Step 4: Calculate the Degrees of Freedom
Step 5: Calculate the Mean Squares
Step 6: Calculate the F Statistic
Step 7: Look up statistical Table and state your conclusion

Before we begin, take some time to examine Figure 1. This Þgure summarizes what
needs to be calculated to perform a one-way ANOVA.

Figure 1: Analysis of Variance Table

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Step 1: Calculate all the means


You need to calculate all the means for all the groups in the question. Then you also
need to calculate to overall means with all the data combined as one single group.

Step 2: Set up the null and alternate hypothesis and the Alpha
The null hypothesis assumes that there is no variance data in different groups. In
other words, the means are the same
The alternate hypothesis states the means are different. So we can state as follows:

H0: μ1 = μ2 = μ3
Ha: μ1 ≠ μ2 ≠ μ3

Step 3: Calculate the Sum of Squares

Calculate the Sum of Squares Total (SST): The SStotal is the Sum of Squares(SS)
based on the entire set in all the groups. In this case, you treat all the data from all the
groups like on single combined set of data

Calculate the Sum of Squares Within Groups (SSW): The is the sum of squares
within each group. After calculating the sum of squares for each group, then you add
them together for all the groups. That is why you have the sum symbol twice in the
formular

¥ The internal sum is for within the group


¥ The external sum is to sum the sums!

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Note: If you have calculated the Þrst two sum of squares, you can the go ahead to
calculate the the third on using the Þrst two values using the formula
But for learning purposes, we would calculate the third one. So letÕs keep moving!

Calculate the Sum of Squares Between Groups (SSB): This is the sum of squares
with the groups taken as single elements.

Assumming there are three groups you will have to do the the following:
(group1_mean Ð total_mean)2 + (group2_mean Ð total_mean)2 + group3_mean Ð
total_mean)2

Verify that
SStotal = SSbetween + SSwithin

Step 4: Calculate the Degrees of Freedom (df)

Calculate the Degrees of Freedom Total (DFT)

where n is to total of all the data sets combined

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Calculate the Degrees of Freedom Within Groups (DFW)

where k is the number of groups.


So, if there are three groups of measurements, then k = 3

Calculate the Degrees Between Groups (DFB)

Note: You could actually calculate the third degree of freedom if you have two of
them just like in the case of sum of square

You can verify that:


dft = dfw + dfb

Step 5: Calculate the Mean Squares

Calculate the Mean Squares Between (MSB)

Calculate the Mean Squares Within (MSW)

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Step 6: Create a Summary Table and Calculate the F Statistic

This is the same table you created in Step 2. You just have to Þll it with actual results
based on your calculations. Stating with this table makes the problem easier to solve.
Calculate the F Statistic using the fomular

Step 7: Look up F from table

Lood up the tabulated value of F(critical value) of F from the statistical table and
compare it with the value you calculated (absolute value).
If the absolute value is greater than the critical value, we reject the null hypothesis
and conclude that there is signiÞcant different between the means of the populations.
Otherwise, accept the null hypothesis or fail to reject the null hypothesis.

The ANOVA Procedure

We will next illustrate the ANOVA procedure using the Þve step approach. Because
the computation of the test statistic is involved, the computations are often organized
in an ANOVA table. The ANOVA table breaks down the components of variation in
the data into variation between treatments and error or residual variation. Statistical
computing packages also produce ANOVA tables as part of their standard output for
ANOVA, and the ANOVA table is set up as follows:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Mean
Source of Sums of Degrees of
Squares F
Variation Squares (SS) Freedom (df)
(MS)

Between
k-1
Treatments

Error (or
N-k
Residual)

Total N-1

where

¥ X = individual observation,

¥ = sample mean of the jth treatment (or group),


¥ = overall sample mean,
¥ k = the number of treatments or independent comparison groups, and
¥ N = total number of observations or total sample size.

The ANOVA table above is organized as follows.

¥ The Þrst column is entitled "Source of Variation" and delineates the between
treatment and error or residual variation. The total variation is the sum of the
between treatment and error variation.
¥ The second column is entitled "Sums of Squares (SS)". The between treatment
sums of squares is

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

and is computed by summing the squared differences between each treatment (or
group) mean and the overall mean. The squared differences are weighted
by the sample sizes per group (nj). The error sums of squares is:

and is computed by summing the squared differences between each observation and
its group mean (i.e., the squared differences between each observation in
group 1 and the group 1 mean, the squared differences between each
observation in group 2 and the group 2 mean, and so on). The double
summation ( SS ) indicates summation of the squared differences within
each treatment and then summation of these totals across treatments to
produce a single value. (This will be illustrated in the following examples).
The total sums of squares is:

and is computed by summing the squared differences between each observation and
the overall sample mean. In an ANOVA, data are organized by comparison
or treatment groups. If all of the data were pooled into a single sample,
SST would reßect the numerator of the sample variance computed on the
pooled or total sample. SST does not Þgure into the F statistic directly.
However, SST = SSB + SSE, thus if two sums of squares are known, the
third can be computed from the other two.

¥ The third column contains degrees of freedom. The between treatment degrees
of freedom is df1 = k-1. The error degrees of freedom is df2 = N - k. The total
degrees of freedom is N-1 (and it is also true that (k-1) + (N-k) = N-1).
¥ The fourth column contains "Mean Squares (MS)" which are computed by
dividing sums of squares (SS) by degrees of freedom (df), row by row.
SpeciÞcally, MSB=SSB/(k-1) and MSE=SSE/(N-k). Dividing SST/(N-1)
produces the variance of the total sample. The F statistic is in the rightmost
column of the ANOVA table and is computed by taking the ratio of MSB/MSE.

Example:

A clinical trial is run to compare weight loss programs and participants are randomly
assigned to one of the comparison programs and are counseled on the details of the

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

assigned program. Participants follow the assigned program for 8 weeks. The
outcome of interest is weight loss, deÞned as the difference in weight measured at the
start of the study (baseline) and weight measured at the end of the study (8 weeks),
measured in pounds.

Three popular weight loss programs are considered. The Þrst is a low calorie diet.
The second is a low fat diet and the third is a low carbohydrate diet. For comparison
purposes, a fourth group is considered as a control group. Participants in the fourth
group are told that they are participating in a study of healthy behaviors with weight
loss only one component of interest. The control group is included here to assess the
placebo effect (i.e., weight loss due to simply participating in the study). A total of
twenty patients agree to participate in the study and are randomly assigned to one of
the four diet groups. Weights are measured at baseline and patients are counseled on
the proper implementation of the assigned diet (with the exception of the control
group). After 8 weeks, each patient's weight is again measured and the difference in
weights is computed by subtracting the 8 week weight from the baseline weight.
Positive differences indicate weight losses and negative differences indicate weight
gains. For interpretation purposes, we refer to the differences in weights as weight
losses and the observed weight losses are shown below.

Low Calorie Low Fat Low Carbohydrate Control


8 2 3 2
9 4 5 2
6 3 4 -1
7 5 2 0
3 1 3 3

Is there a statistically signiÞcant difference in the mean weight loss among the four
diets? We will run the ANOVA using the Þve-step approach.

¥ Step 1. Set up hypotheses and determine level of signiÞcance


H0: μ1 = μ2 = μ3 = μ4 H1: Means are not all equal α=0.05

¥ Step 2. Select the appropriate test statistic.


The test statistic is the F statistic for ANOVA, F=MSB/MSE.

¥ Step 3. Set up decision rule.


The appropriate critical value can be found in a table of probabilities for the F
distribution(see "Other Resources"). In order to determine the critical value of F we

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

need degrees of freedom, df1=k-1 and df2=N-k. In this example, df1=k-1=4-1=3 and
df2=N-k=20-4=16. The critical value is 3.24 and the decision rule is as follows:
Reject H0 if F > 3.24.

¥ Step 4. Compute the test statistic.


To organize our computations we complete the ANOVA table. In order to compute
the sums of squares we must Þrst compute the sample means for each group and the
overall mean based on the total sample.

Low Calorie Low Fat Low Carbohydrate Control


n 5 5 5 5
Group mean 6.6 3.0 3.4 1.2

If we pool all N=20 observations, the overall mean is = 3.6.

We can now compute

So, in this case:

Next we compute,

SSE requires computing the squared differences between each observation and its
group mean. We will compute SSE in parts. For the participants in the low calorie
diet:

Low Calorie (X - 6.6) (X - 6.6)2


8 1.4 2.0
9 2.4 5.8
6 -0.6 0.4

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

7 0.4 0.2
3 -3.6 13.0
Totals 0 21.4

Thus,

For the participants in the low fat diet:

Low Fat (X - 3.0) (X - 3.0)2


2 -1.0 1.0
4 1.0 1.0
3 0.0 0.0
5 2.0 4.0
1 -2.0 4.0
Totals 0 10.0

Thus,

For the participants in the low carbohydrate diet:

Low Carbohydrate (X - 3.4) (X - 3.4)2


3 -0.4 0.2
5 1.6 2.6
4 0.6 0.4
2 -1.4 2.0
3 -0.4 0.2
Totals 0 5.4

Thus,

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

For the participants in the control group:

Control (X - 1.2) (X - 1.2)2


2 0.8 0.6
2 0.8 0.6
-1 -2.2 4.8
0 -1.2 1.4
3 1.8 3.2
Totals 0 10.6

Thus,

Therefore,

We can now construct the ANOVA table.

Sums of Degrees of Means


Source of Squares Freedom Squares F
Variation
(SS) (df) (MS)

Between 25.3/3.0=
75.8 4-1=3 75.8/3=25.3
Treatmenst 8.43

Error (or
47.4 20-4=16 47.4/16=3.0
Residual)

Total 123.2 20-1=19

¥ Step 5. Conclusion.
We reject H0 because 8.43 > 3.24. We have statistically signiÞcant evidence at
α=0.05 to show that there is a difference in mean weight loss among the four diets.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

ANOVA is a test that provides a global assessment of a statistical difference in more


than two independent means. In this example, we Þnd that there is a statistically
signiÞcant difference in mean weight loss among the four diets considered. In
addition to reporting the results of the statistical test of hypothesis (i.e., that there is a
statistically signiÞcant difference in mean weight losses at α=0.05), investigators
should also report the observed sample means to facilitate interpretation of the
results. In this example, participants in the low calorie diet lost an average of 6.6
pounds over 8 weeks, as compared to 3.0 and 3.4 pounds in the low fat and low
carbohydrate groups, respectively. Participants in the control group lost an average of
1.2 pounds which could be called the placebo effect because these participants were
not participating in an active arm of the trial speciÞcally targeted for weight loss.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Module-IX
Correlation and Regression Analysis

Correlation refers to a process for establishing the relationships between two


variables. You learned a way to get a general idea about whether or not two variables
are related, is to plot them on a Òscatter plotÓ. While there are many measures of
association for variables which are measured at the ordinal or higher level of
measurement, correlation is the most commonly used approach.

Correlation in Statistics

This section shows how to calculate and interpret correlation coefÞcients for ordinal
and interval level scales. Methods of correlation summarize the relationship between
two variables in a single number called the correlation coefÞcient. The correlation
coefÞcient is usually represented using the symbol r, and it ranges from -1 to +1.

A correlation coefÞcient quite close to 0, but either positive or negative, implies little
or no relationship between the two variables. A correlation coefÞcient close to plus 1
means a positive relationship between the two variables, with increases in one of the
variables being associated with increases in the other variable.

A correlation coefÞcient close to -1 indicates a negative relationship between two


variables, with an increase in one of the variables being associated with a decrease in
the other variable. A correlation coefÞcient can be produced for ordinal, interval or
ratio level variables, but has little meaning for variables which are measured on a
scale which is no more than nominal.

For ordinal scales, the correlation coefÞcient can be calculated by using SpearmanÕs
rho. For interval or ratio level scales, the most commonly used correlation coefÞcient
is PearsonÕs r, ordinarily referred to as simply the correlation coefÞcient.

What Does Correlation Measure?

In statistics, Correlation studies and measures the direction and extent of relationship
among variables, so the correlation measures co-variation, not causation. Therefore,
we should never interpret correlation as implying cause and effect relation. For
example, there exists a correlation between two variables X and Y, which means the
value of one variable is found to change in one direction, the value of the other
variable is found to change either in the same direction (i.e. positive change) or in the
opposite direction (i.e. negative change). Furthermore, if the correlation exists, it is

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

linear, i.e. we can represent the relative movement of the two variables by drawing a
straight line on graph paper.

Correlation CoefÞcient

The correlation coefÞcient, r, is a summary measure that describes the extent of the
statistical relationship between two interval or ratio level variables. The correlation
coefÞcient is scaled so that it is always between -1 and +1. When r is close to 0 this
means that there is little relationship between the variables and the farther away from
0 r is, in either the positive or negative direction, the greater the relationship between
the two variables.

The two variables are often given the symbols X and Y. In order to illustrate how the
two variables are related, the values of X and Y are pictured by drawing the scatter
diagram, graphing combinations of the two variables. The scatter diagram is given
Þrst, and then the method of determining PearsonÕs r is presented. From the following
examples, relatively small sample sizes are given. Later, data from larger samples are
given.

Scatter Diagram

A scatter diagram is a diagram that shows the values of two variables X and Y, along
with the way in which these two variables relate to each other. The values of variable
X are given along the horizontal axis, with the values of the variable Y given on the
vertical axis.

Later, when the regression model is used, one of the variables is deÞned as an
independent variable, and the other is deÞned as a dependent variable. In regression,
the independent variable X is considered to have some effect or inßuence on the
dependent variable Y. Correlation methods are symmetric with respect to the two
variables, with no indication of causation or direction of inßuence being part of the
statistical consideration. A scatter diagram is given in the following example. The
same example is later used to determine the correlation coefÞcient.

Types of Correlation

The scatter plot explains the correlation between the two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables Ð

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

¥ Positive Correlation Ð when the values of the two variables move in the same
direction so that an increase/decrease in the value of one variable is followed
by an increase/decrease in the value of the other variable.
¥ Negative Correlation Ð when the values of the two variables move in the
opposite direction so that an increase/decrease in the value of one variable is
followed by decrease/increase in the value of the other variable.
¥ No Correlation Ð when there is no linear dependence or no relation between the
two variables.

Correlation Formula

Correlation shows the relation between two variables. Correlation coefÞcient shows
the measure of correlation. To compare two datasets, we use the correlation formulas.

Pearson Correlation CoefÞcient Formula


The most common formula is the Pearson Correlation coefÞcient used for linear
dependency between the data sets. The value of the coefÞcient lies between -1 to

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

+1. When the coefÞcient comes down to zero, then the data is considered as not
related. While, if we get the value of +1, then the data are positively correlated, and
-1 has a negative correlation.

Where n = Quantity of Information

Σx = Total of the First Variable Value

Σy = Total of the Second Variable Value

Σxy = Sum of the Product of Þrst & Second Value

Σx2 = Sum of the Squares of the First Value

Σy2 = Sum of the Squares of the Second Value

1.What is a correlation in statistics?

In statistics, correlation is a statistic that establishes the relationship between two


variables. In other words, it is the measure of association of variables.

2. What is a correlation of 1?

A correlation of 1 or +1 shows a perfect positive correlation, which means both the


variables move in the same direction.
A correlation of -1 shows a perfect negative correlation, which means as one variable
goes down, the other goes up.

3. What does a correlation of 0.45 mean?

We know that a correlation of 1 means the two variables are associated positively,
whereas if the correlation coefÞcient is 0, then there is no correlation between two
variables. Thus, a correlation of 0.45 means 45% of the variance in one variable, say
x, is accounted for by the second variable, say y.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

4. What are the 4 types of correlation?

The four types of correlation coefÞcients are given by:


Pearson Correlation CoefÞcient
Linear Correlation CoefÞcient
Sample Correlation CoefÞcient
Population Correlation CoefÞcient

5. What is a correlation example?

Positive, negative, or no correlation can be observed between two variables. An


example of a positive correlation would be dimensions and weight. The big objects
look heavier and vice versa. Also, small objects tend to appear thin.

6. Correlation CoefÞcient Properties

Correlation coefÞcient is all about establishing relationships between two variables.


Some properties of correlation coefÞcient are as follows:

1) Correlation coefÞcient remains in the same measurement as in which the two


variables are.

2) The sign which correlations of coefÞcient have will always be the same as the
variance.

3) The numerical value of correlation of coefÞcient will be in between -1 to + 1. It is


known as real number value.

4) The negative value of coefÞcient suggests that the correlation is strong and
negative. And if ÔrÕ goes on approaching toward -1 then it means that the relationship
is going towards the negative side.

When ÔrÕ approaches to the side of + 1 then it means the relationship is strong and
positive. By this we can say that if +1 is the result of the correlation then the
relationship is in a positive state.

5) The weak correlation is signaled when the coefÞcient of correlation approaches to


zero. When ÔrÕ is near about zero then we can deduce that the relationship is weak.

6) Correlation coefÞcient can be very dicey because we cannot say that the
participants are truthful or not.

The coefÞcient of correlation is not affected when we interchange the two variables.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

7) CoefÞcient of correlation is a pure number without effect of any units on it. It also
not get affected when we add the same number to all the values of one variable. We
can multiply all the variables by the same positive number. It does not affect the
correlation coefÞcient. As we discussed, Ôr Ôis not affected by any unit because ÔrÕ is a
scale invariant.

8) We use correlation for measuring the association but that does not mean we are
talking about causation. By this, we simply mean that when we are correlating the
two variables then it might be the possibility that the third variable may be
inßuencing them.

Examples on Correlation CoefÞcient

Example 1: Calculate the Correlation coefÞcient of given data:

x 50 51 52 53 54

y 3.1 3.2 3.3 3.4 3.5

Solution:

Here n = 5

x 50 51 52 53 54

y 3.1 3.2 3.3 3.4 3.5

xy 155 163.2 171.6 180.2 189

x2 2500 2601 2704 2809 2916

y2 9.61 10.24 10.89 11.56 12.25

∑x = 260

∑y = 16.5

∑xy = 859

∑x2 = 13530

∑y2 = 54.55

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

By substituting all the values in the formula, we get r = 1. This shows a positive
correlation coefÞcient.

Example 2: Calculate the Correlation coefÞcient of given data:

x 12 15 18 21 27

y 2 4 6 8 12

Solution:

Here n = 5

x 12 15 18 21 27

y 2 4 6 8 12

xy 24 60 94 168 324

x2 144 225 324 441 729

y2 4 16 36 64 144

∑x = 93

∑y = 32

∑xy = 670

∑x2 = 1863

∑y2 = 264

Now, substitute all the values in the below formula.

We have, r = 0.84

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Difference Between Correlation And Regression


As mentioned earlier, Correlation and Regression are the principal units to be studied
while preparing for the 12th Board examinations. Also, it is an important factor for
students to be well aware of the differences between correlation and regression.
Below mentioned are a few key differences between these two aspects.

Correlation Regression

CorrelationÕ, as the name says, it RegressionÕ explains how an


determines the interconnection or independent variable is
a co-relationship between the numerically associated with the
variables. dependent variable.

In Correlation, both the However, in Regression, both


independent and dependent the dependent and independent
values have no difference. variables are different.

The primary objective of RegressionÕs main purpose is to


Correlation is to Þnd out a calculate the values of a random
quantitative/numerical value variable based on the values of
expressing the association a Þxed variable.
between the values.

Correlation stipulates the degree However, regression speciÞes


to which both variables can move the effect of the change in the
together. unit in the known variable(p) on
the evaluated variable (q).

Correlation helps to constitute the Regression helps in estimating


connection between the two a variableÕs value based on
variables. another given value.

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 9.9

Calculate the regression coefÞcient and obtain the lines of regression for the
following data

Solution:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Regression coefÞcient of X on Y

(i) Regression equation of X on Y

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(ii) Regression coefÞcient of Y on X

(iii) Regression equation of Y on X

Y = 0.929XÐ3.716+11

= 0.929X+7.284

The regression equation of Y on X is Y= 0.929X + 7.284

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 9.10

Calculate the two regression equations of X on Y and Y on X from the data


given below, taking deviations from a actual means of X and Y.

Estimate the likely demand when the price is Rs.20.

Solution:

Calculation of Regression equation

(i) Regression equation of X on Y

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(ii) Regression Equation of Y on X

When X is 20, Y will be

= Ð0.25 (20)+44.25

= Ð5+44.25

= 39.25 (when the price is Rs. 20, the likely demand is 39.25)

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 9.11

Obtain regression equation of Y on X and estimate Y when X=55 from the


following

Solution:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

(i) Regression coefÞcients of Y on X

(ii) Regression equation of Y on X

YÐ51.57 = 0.942(XÐ48.29 )

Y = 0.942XÐ45.49+51.57=0.942 #Ð45.49+51.57

Y = 0.942X+6.08

The regression equation of Y on X is Y= 0.942X+6.08 Estimation


of Y when X= 55

Y= 0.942(55)+6.08=57.89

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 9.12

Find the means of X and Y variables and the coefÞcient of correlation


between them from the following two regression equations:

2YÐXÐ50 = 0

3YÐ2XÐ10 = 0.

Solution:

We are given

2YÐXÐ50 =0 ... (1)

3YÐ2XÐ10 = 0 ... (2)

Solving equation (1) and (2)

We get Y = 90

Putting the value of Y in equation (1)

We get X = 130

Calculating correlation coefÞcient

Let us assume equation (1) be the regression equation of Y on X

2Y = X+50

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

NOTE

It may be noted that in the above problem one of the regression coefÞcient is
greater than 1 and the other is less than 1. Therefore our assumption on
given equations are correct.

Example 9.13

Find the means of X and Y variables and the coefÞcient of correlation


between them from the following two regression equations:

4XÐ5Y+33 = 0

20XÐ9YÐ107 = 0

Solution:

We are given

4XÐ5Y+33 = 0 ... (1)

20XÐ9YÐ107 =0 ... (2)

Solving equation (1) and (2)

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

We get Y = 17

Putting the value of Y in equation (1)

Calculating correlation coefÞcient

Let us assume equation (1) be the regression equation of X on Y

Let us assume equation (2) be the regression equation of Y on X

But this is not possible because both the regression coefÞcient are greater
than

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

So our above assumption is wrong. Therefore treating equation (1) has


regression equation of Y on X and equation (2) has regression equation
of X on Y . So we get

Example 9.14

The following table shows the sales and advertisement expenditure of a form

CoefÞcient of correlation r= 0.9. Estimate the likely sales for a proposed


advertisement expenditure of Rs. 10 crores.

Solution:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

When advertisement expenditure is 10 crores i.e., Y=10 then


sales X=6(10)+4=64 which implies sales is 64.

Example 9.15

There are two series of index numbers P for price index and S for stock of
the commodity. The mean and standard deviation of P are 100 and 8 and of
S are 103 and 4 respectively. The correlation coefÞcient between the two
series is 0.4. With these data obtain the regression lines of P on S and S on P.

Solution:

Let us consider X for price P and Y for stock S. Then the mean
and SD for P is considered as X-Bar = 100 and σx=8. respectively and the
mean and SD of S is considered as Y-Bar =103 and σy=4. The correlation
coefÞcient between the series is r(X,Y)=0.4

Let the regression line X on Y be

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 9.16

For 5 pairs of observations the following results are obtained ∑X=15,


∑Y=25, ∑X2 =55, ∑Y2 =135, ∑XY=83 Find the equation of the lines of
regression and estimate the value of X on the Þrst line when Y=12 and value
of Y on the second line if X=8.

Solution:

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

YÐ5 = 0.8(XÐ3)

= 0.8X+2.6

When X=8 the value of Y is estimated as

= 0.8(8)+2.6

=9

Example 9.17

The two regression lines are 3X+2Y=26 and 6X+3Y=31. Find the
correlation coefÞcient.

Solution:

Let the regression equation of Y on X be

3X+2Y = 26

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 9.18

In a laboratory experiment on correlation research study the equation of the


two regression lines were found to be 2XÐY+1=0 and 3XÐ2Y+7=0 . Find
the means of X and Y. Also work out the values of the regression coefÞcient
and correlation between the two variables X and Y.

Solution:

Solving the two regression equations we get mean values of X and Y

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Example 9.19

For the given lines of regression 3XÐ2Y=5and XÐ4Y=7. Find

(i) Regression coefÞcients

(ii) CoefÞcient of correlation

Solution:

(i) First convert the given equations Y on X and X on Y in standard form


and Þnd their regression coefÞcients respectively.

Given regression lines are

3XÐ2Y = 5 ... (1)

XÐ4Y = 7 ... (2)

Let the line of regression of X on Y is

3XÐ2Y = 5

3X = 2Y+5

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

CoefÞcient of correlation

Downloaded by Sumit Singh ([email protected])


lOMoARcPSD|45259513

Since the two regression coefÞcients are positive then the correlation
coefÞcient is also positive and it is given by

Downloaded by Sumit Singh ([email protected])

You might also like