business-statistics-notes-pdf
business-statistics-notes-pdf
Business Statistics
Module-I
Data Representation-Central Tendency and dispersion-Kurtosis and skewness
Module-II
probability-Axioms-Addition and Multiplication Rule-Types of Probability-Independence of events-
Probability Tree -BayeÕs Theorem
Module-III
Concept of Random variable-Probability Distributions-Expected value and Variance of random
Variable-Conditional expectations-Classical news Paper boys Problem,(EMV,EVPI)
Module-IV
Probability distributions-Binomial-Poisson-Normal
Module-V
Sampling distributions
Module-VI
Estimation-Point and Interval
Module-VII
Hypothesis Testing-t test, Chi Square,Z test
Module-VIII
Anova -One Way, Two Way
Module-IX
Correlation and Regression Analysis
BUSINESS STATISTICS
Q1.What is Statistics?
The word ÒStatisticsÓ has been derive from the Latin word ÒStatusÓ or Italian word ÒStatistaÓ or German
word ÒStatistikaÓ. Each of these words means Political State. Initially, Statistics was used to collect the
information of the people of the state about their income, health, illiteracy and wealth etc.
But now a day, Statistics has become an important subject having useful application in various Þelds in day
to day life.
Example: Ram gets Rs.100 per month as pocket allowance is not Statistics. It is neither an aggregate nor an
average. Whereas average pocket allowance of the students of Class X is Rs.100 per month and there are 80
students in class XI & 8 students in Class XII are Statistics.
¥ A young lady was run over by a speeding truck at 100 km per hour.
¥ Average height of the 26 plus male people in India is 6 feet compare to 5 feet in Nepal.
¥ Over the past 10 years, India has won 60 test matches in cricket and lost 50.
According to Yule and Kendall ----- ÒBy Statistics we mean quantitative data affected to marked extent by
multiplicity of causes.Ó
(2) Numerically Expressed - Statistics are expressed in terms of numbers. Qualitative aspects like small or
big, rich or poor etc. are not statistics. For instance if we say that Irfan Pathan is tall Sachin is short then
this statement has no statistical sense. However if it is stated that height of Irfan Pathan is 6 ft and 2 inch
and the height of Sachin is 5 ft and 4 inch then these numerical will be called Statistics.
(3) Affected by Multiplicity of Causes Ð Statistics are not affected by any single factor but it is affected by
many factors. For instance 30% rise in prices may have been due to several causes like reduction in
supply, increase in demand, shortage of power, rise in wages, rise in taxes, etc.
(4) Reasonable Accuracy - A reasonable degree of accuracy must be kept in view while collecting statistical
data. This accuracy depends on the purpose of investigation, its nature, size and available resources.
(5) Pre-determined Purpose - Statistics are collected with some pre-determined objective. Any information
collected without any deÞnite purpose will only be a numerical value and not Statistics. If data pertaining
to the farmers of a village is collected, there must be some pre-determined objective. Whether the
statistics are collected for the purpose of knowing their economic position or distribution of land among
them or their total population. All these objectives must be pre Ð determined.
(6) Collected in a Systematic Manner Ð Statistics should be collected in a systematic manner. Before
collecting the data, a plan must be prepared. No conclusion can be drawn from data collected in
haphazard manner. For instance, data regarding the marks secured by the students of a college without
any reference to the class, subject, examination, or maximum marks, etc will lead no conclusion.
Statistics is the science which deals with the collection, classiÞcation and tabulation of numerical facts as a
basis for the explanation, description and comparison of phenomena. (LOVITT )
1. Descriptive Statistics: Descriptive statistics is related to numerical data or facts. Such data are
collected either by counting or by some other process of measurement. It is also related to those
methods, includes editing of data, classiÞcation, tabulation, diagrammatic or graphical presentation,
measures of central tendency, measures of dispersion, correlation etc., help to make the description
of numerical facts simple, systematic, synoptic understandable and meaningful.
2. Inferential Statistics: Inferential statistics help in making generalizations about the population or
universe on the basis of study of samples. It includes the process of drawing proper and rational
conclusion about the universe. Among these methods, probability theory and different techniques of
sampling test are important.
3. Applied Statistics; It involves application of statistical methods and techniques to the problems and
actual facts. For example-statistics related to national income, industrial and agricultural production,
population, price etc. are called applied statistics. It can be divided into 2 parts-(1) Descriptive
Applied Statistics- it deals with the study of the data which are known and which naturally relate. Its
main object is to provide descriptive information either to the past or to the present for any area. For
example- price index number and vital statistics comes under the category of descriptive applied
statistics. (2) ScientiÞc Applied Statistics- under this branch of statistical science, statistical methods
are used to formulate and verify scientiÞc laws. For example-an effort is made by an economist to
establish the law of demand, quantitative theory of money, trade circle etc. These are established and
verify by the help of scientiÞc applied statistics.
2. Sample: It is a part of the population selected by some sampling procedure. The process of selection of
sample is known as sampling. The number of objects in the sample is called the size of the sample. It is
believed that a sample is best representative of the population.
For instance, suppose a research worker is required to study the weight of Þshes in a pond after a
particular period of growth. For this purpose suppose that there are 3,000 Þshes in the pond, he may
either measure the weight of all the Þshes in the pond or he may decide to select a small group of Þshes
and measure their weights. The Þrst approach of measuring the weight of all Þshes is called complete
enumeration or census. Another approach in which only a small group of Þshes is considered is called
sample survey. In brief we can say that in complete enumeration, information is collected on all the units
of the universe and in sample survey, only a part of the universe is considered.
3. Variable: A property of objects is known as variable which differ from object to object and is expressible
numerically, in terms of numbers.
For instance: the marks in Mathematics of students in a class can be expressed in the term of marks
obtained by the students. So it is a variable property which is expressible quantitatively.
4. Attribute: A property and characteristic of objects is known as attribute which are not expressible
quantatively in number. We can express the data qualitatively. For example, smoking, color, honesty etc.
FREQUENCY DISTRIBUTION
The tabular arrangement of data showing the frequency of each item is called a frequency distribution.
According to Croxton and Cowden, ÒFrequency distribution is a statistical table in which different values of
variable are shown in the sequence of magnitude along with corresponding frequencies.Ó
(2) Continuous frequency distribution: A continuous frequency distribution is such a distribution in which
data are arranged in classes or groups which are not exactly measureable. Groups or class-intervals are
always in a continuous form from the beginning of the frequency distribution, till the end, within a given
range of the data. There are two types of series according to class interval;
(1) Inclusive form; A frequency distribution in which each upper limit of each class is also included. Such as;
0-9, 10-19, 20-29.................
(2) Exclusive form; In which the upper limit of the next class-interval. Such as; 0-10, 10-20, 20-30............
CHARACTERISTICS OF STATISTICS
In the absence of the above characteristics, numerical data canÕt be called Statistics and hence Òall
statistics are numerical statements of facts but all numerical statements of facts are not statistics.Ó
According to above DeÞnitions, Statistics is both a science and an art. It is related to the study and
application of the principles and methods applicable in the collection, presentation, analysis, interpretation
and forecasting of data. Or statistical facts inßuenced by several factors and related to any area of knowledge
or research so that concrete and intelligent decisions may be taken in the phase of uncertainty
NATURE OF STATISTICS
Statistics as a science: science refers to a systematized body of knowledge. It studies cause and effect
relationship and attempts to make generalizations in the form of scientiÞc principles or laws. ÒScience, in
short, is like a light house that gives light to the ships to Þnd out their own way but doesnÕt indicate the
direction in which they should go.Ó Like other sciences, Statistical Methods are also used to answer the
questions like, how an investigation should be conducted. In what way the valid and reliable conclusions can
be drawn? Statistics is called the science of scientiÞc methods.
In words of Croxton and Cowden, ÒStatistics is not a science, it is scientiÞc methods.ÓAccording to Tippet,
Òas science, the statistical method is a part of the general scientiÞc method and is based on the same
fundamental ideas and processes.Ó
Statistics as an art: we know that science is a body of systematized knowledge. How this knowledge is to be
used for solving a problem is work of an art. An art is an applied knowledge. It refers the skill of handling
facts so as to achieve a given objective. It is concerned with ways and means of presenting and handling
data, making inferences logically and drawing relevant conclusion. Art aspects of statistics tell, Ôhow to use
statistical rules and principles to study the problems and Þnding their solutions. ÔCollections of statistics
(data) its use and utility are itself an art.
Statistics is both science and art: After studying science and art aspects of statistics, it is used not only to gain
knowledge but also to understand the facts and draw important conclusions from it. If science is knowledge,
then art is action. Looking from this angle statistics may also be regarded as an art. It involves the application
of given methods to obtain facts, derive results and Þnally to use them for devising action.
5 stages -
Collection -Organisation- Presentation -Analysis-Interpretation
1. Collection: This is the primary step in a statistical study and data should be collected with care by the
investigator. If data are faulty, the conclusions drawn can never be reliable. The data may be available
from existing published or unpublished sources or else may be collected by the investigator himself. The
Þrst hand collection of data is one of the most difÞcult and important tasks faced by a statistician.
2. Organization: Data collected from published sources are generally in organized form.
However, a large mass of Þgures that are collected from a survey frequently needs organization. In
organizing, there are 3 steps as :
Hence collected data is organized properly so that the desire information may be highlighted and undesirable
information avoided.
3. Presentation: Arranged data is not capable to inßuence a layman. Thus, it is necessary that data may be
presented with the help of tables, diagrams and graphs. By these devices facts can be understood easily.
4. Analysis: A major part of it is developed to the methods used in analyzing the presented data, mostly in a
tabular form. For this analysis, a number of statistical tools are available, such as averages, correlation,
regression etc.
5. Interpretation: the interpretation of a data is a difÞcult task and necessitates a high degree of skills and
experience in the statistical investigation because certain decisions made on the basis of conclusions drawn.
1. Statistics and the State: in recent years the functions of the state have increased tremendously. The
concept of the state has changed from that of simply maintaining law and order to that of a welfare
state. Statistical data and statistical methods are of great help in promoting human welfare. The
government in most countries is the biggest collector and user of statistical data. These statistics help
in framing suitable policies.
2. Statistics in Business and Management: with growing size and increasing competition, the problems
of business enterprises have become complex. Statistics is now considered as an indispensable tool
in the analysis of activities in the Þeld of business, commerce and industry. The object can be
achieved by properly conducted market survey and research which greatly depends on statistical
methods. The trends in sales and production can be determined by statistical methods like time-series
analysis which are essential for future planning of the phenomena. Statistical concepts and methods
are also used in controlling the quality of products to satisfaction of consumer and the producer. The
bankers use the objective analysis furnished by statistics and then temper their decisions on the basis
of qualitative information.
3. Statistics and Economics: R.A.Fisher complained of Òthe painful misapprehension that statistics is a
branch of economics.Ó Statistical Data and methods are of immense help in the proper understanding
of the economic problems and in the information of economic policies. In the Þeld of exchange, we
study markets, law of prices based on supply and demand, cost of production, banking and credit
instruments etc. The development of various economic theories own greatly to statistical methods,
e.g., ÔEngelÕs law of family expenditureÕ, ÔMalthusian theory of populationÕ. The impact of
mathematics and statistics has led to the development of new disciplines like ÔEconometricsÕÕ and
ÔEconomic StatisticsÕ. In fact, the concept of planning so vital for growth of nations would not have
been possible in the absence of data and proper statistical analysis.
4. Statistics and Psychology and Education: Statistics has found wide application in psychology and
education. Statistical methods are used to measure human ability such as; intelligence, aptitude,
personality, interest etc. by tests. Theory of learning is also based on Statistical Principles.
Applications of statistics in psychology and education have led to the development of new discipline
called ÔPsychometricÕ.
5. Statistics and Natural science; Statistical techniques have proved to be extremely useful in the study
of all natural sciences like biology, medicine, meteorology, botany etc. for example- in diagnosing
the correct disease the doctor has to rely heavily on factual data like temperature of the body, pulse
rate, B.P. etc. In botany- the study of plant life, one has to rely heavily on statistics in conducting
experiments about the plants, effect of temperature, type of soil etc. In agriculture- statistical
techniques like Ôanalysis of varianceÕ and Ôdesign of experimentsÕ are useful for isolating the role of
manure, rainfall, watering process, seed quality etc. In fact it is difÞcult to Þnd any scientiÞc activity
where statistical data and methods are not used.
6. Statistics and Physical Science: The physical sciences in which statistical methods were Þrst
developed and applied. It seems to be making increasing use of statistics, especially in astronomy,
chemistry, engineering, geology, meteorology and certain branches of physics.
7. Statistics and Research; statistics is indispensable in research work. Most of the advancement in
knowledge has taken place because of experiments conducted with the help of statistical methods.
Statistical methods also affect research in medicine and public health. In fact, there is hardly any
research work today that one can Þnd complete without statistical methods.
8. Statistics and Computer: The development of statistics has been closely related to the evolution of
electronic computing machinery. Statistics is a form of data processing a way of converting data into
information useful for decision-making. The computers can process large amounts of data quickly
and accurately. This is a great beneÞt to business and other organizations that must maintain records
of their operations. Processing of row data is extensively required in the application of many
statistical techniques.
In recent days, we hear talking about statistics from a common person to highly qualiÞed person. It only
show that how statistics has been intimately connected with wide range of activities in daily life. They realize
that work in their Þelds require some understanding of statistics. It indicates the importance of the statistics.
A.L.Bowley says, ÒKnowledge of statistics is like knowledge of foreign language or of algebra. It may prove
of use at any time under any circumstancesÓ.
1. Importance to the State or Government; In modern era, the role of state has increased and various
governments of the world also take care of the welfare of its people. Therefore, these governments require
much greater information in the form of numerical Þgures. Statistics are extensively used as a basis for
government plans and policies. For example-5-years plans are framed by using reliable statistical data of
different segments of life.
2. Importance in Human Behavior; Statistical methods viz., average, correlation etc. are closely related with
human activities and behavior. For example-when a layman wishes to purchase some article, he Þrst
enquiries about its price at different shops in the market. In other words, he collects data about the price of a
particular article and aims at getting idea about the average of the prices and the range within which the price
vary. Thus, it can be concluded that statistics play an important role in every aspect of human activities and
behavior.
3. Importance in Economics; Statistics is gaining an ever increasing importance in the Þeld of economics.
That is why Tugwell said, ÒThe science of economics is becoming statistical in its method.Ó Statistics and
economics are so interrelated to each other that the new disciplines like econometrics and economic statistics
have been developed. Inductive method of generalization used in economics, is also based on statistical
principle. There are different segments of economics where statistics are used-
(A) Consumption- By the statistics of consumption we can Þnd the way in which people in different group
spend their income. The law of demand and elasticity of demand in the Þeld of consumption are based on
inductive or inferential statistics.
(B) Production- By the statistics of production supply is adjusted according to demand. We can Þnd out the
capital invested in different productive units and its output. The decision about what to produce, how much
to produce, when to produce is based on facts analyzed statistically.
(C) Distribution- Statistics play a vital role in the Þeld of distribution. We calculate the national income of a
country by statistical methods and compare it with other countries. At every step we require the help of
Þgures without them. It is difÞcult to move and draw inferences.
4. Importance in Planning; for the proper utilization of natural and manual resources, statistics play a vital
role. Planning is indispensable for achieving faster rate of growth through the best use of a nationÕs
resources. Sometimes said that, ÒPlanning without statistics is a ship without rudder and compass.Ó For
example- In India, a number of organizations like national sample survey organization(N.S.S.O.), central
statistical organization (C.S.O.) are established to provide all types of information.
5. Importance in Business: The use of statistical methods in the solution of business problems dates almost
exclusively to the small, public or private, can prosper without the help of statistics. Statistics provides
necessary techniques to a businessman for the formulation of various policies and planning with regard to his
business. Such as-
(A) Marketing- In the Þeld of marketing, it is necessary Þrst to Þnd out what can be sold and them to evolve
a suitable strategy so that goods reach the ultimate consumer. A skillful analysis of data on population,
purchasing power, habits of people, competition, transportation cost etc. should precede any attempt to
establish a new market.
(B) Quality Control- To earn the better price in a competitive market, it is necessary to watch the quality of
the product. Statistical techniques can also be used to control the quality of the product manufactured by a
Þrm. Such as - Showing the control chart.
(C) Banking and Insurance Companies- banks use statistical techniques to take decisions regarding the
average amount of cash needed each day to meet the requirements of day to day transactions. Various
policies of investment and sanction of loans are also based on the analysis provided by statistics.
(D) Accounts writing and Auditing- Every business Þrm keeps accounts of its revenue and expenditure.
Statistical methods are also employed in accounting. In particular, the auditing function makes frequent
application of statistical sampling and estimation procedures and the cost account uses regression analysis.
(E) Research and Development- Many business organizations have their own research and development
department which are responsible for collection of such data. These departments also prepare charts groups
and other statistical analysis for the purpose.
1. Numerical and deÞnite expression of facts: The Þrst function of the statistics is the collection and
presentation of facts in numerical form. We know that the numerical presentation helps in having a
better understanding of the nature of a problem. One of the most important functions of statistics is
to present general statements in a precise and deÞnite form. Statements and facts conveyed in exact
quantitative terms are always more convincing than vague utterances.
2. SimpliÞes the data (condensation): Not only does statistics present facts in a deÞnite form but it also
helps in condensing mass of data into a few signiÞcant Þgures. According to A.E.Waugh, Òthe
purpose of a statistical method is to simplify great bodies of numerical data.ÓIn fact, human mind
cannot follow the huge, complex and scattered numerical facts. So these facts are made simple and
precise with the help of various statistical methods like averages, dispersion, graphic or
diagrammatic, presentation, classiÞcation, tabulation etc. so that a common man also understand
them easily.
3. Comparison of facts: Baddington states, ÒThe essence of the statistics is not only counting but also
comparison.Ó The function of comparison does help in showing the relative importance of data. For
example- the pass % of examination result of a college may be appreciated better when it is
compared with the result of other college or the results of previous years of the same college.
4. Establishment of relationship b/w two or more phenomena; to investigate the relationship b/w two or
more facts is the main function of statistics. For example-demand and supply of a certain
commodity, prices and wages, temperature and germination time of seeds are interrelated.
5. Enlarges individual experiences: In word of Bowley, Òthe proper function of statistics indeed is to
enlarge individual experience.Ó Statistics is like a master key that is used to solve problems of
mankind in every Þeld. It would not be exaggeration to say that many Þelds of knowledge would
have remained closed to the mankind forever but for the efÞcient and useful techniques and
methodology of the science of statistics.
6. Helps in the formulation of policies: statistics helps in formulating policies in different Þelds,
especially in economic, social and political Þelds. The government policies like industrial policy,
export-import policies, taxation policy and monetary policy are determined on the basis of statistical
data and their movements, plan targets are also Þxed with the help of data.
7. Helps in forecasting: statistical methods provide helpful means in estimating the available facts and
forecasting for future. Here BowleyÕs statementis relevant that, Òa statistical estimate may be good or
bad, accurate or the reverse; but in almost all cases it is likely to be more accurate than a casual
observerÕs impression.Ó
8. Testing of hypothesis: statistical methods are also employed to test the hypothesis in theory and
discover newer theory. For example-the statement that average height of students of college is 66
inches is a hypothesis. Here students of college constitute the population. It is possible to test the
validity of this statement by the use of statistical techniques.
1. Statistics does not study qualitative facts: Statistics means aggregate of numerical facts. It means that
in statistics only those phenomena are studied which can be expressed in numerical terms directly or
indirectly. Such as- (1) directly in numerical terms like age, weight and income of individual (2) no
directly but indirectly like intelligent of students and achievements of students (3) neither directly
nor directly like morality, affection etc. such type of facts donÕt come under the scope of statistics.
2. Statistics doesnÕt study individual: According to W.I.King, ÒStatistics from their very nature of
subject cannot and will never be able to take into account individual causes. When these are
important, other means must be used for their study.Ó These studied are done to compare the general
behavior of the group at different points of time or the behavior of different groups at a particular
point of time.
3. Statistical results are true only on the average: The statistical laws are not completely true and
accurate like the law of physics. For example Ð law of gravitational forces is perfectly true &
universal but statistical conclusions are not perfectly true. Such as the average age of a person in
India is 62 years. It does not mean that every person will attain this age. On the basis of statistical
methods we can say only in terms of probability and not certainty.
4. Statistics as lack of complete accuracy: According to Conner, ÒStatistical data must always be treated
as approximations or estimates and not as precise measurements.Ó Statistical result are based on
sample or census data, are bound to be true only approximately. For example Ð according to
population census 2001, countryÕs population is 1,02,70,15,247 but can real population may not be
more or less by hundred, two hundred and so on.
5. Statistics is liable to be misused: Statistical deals with Þgures and it can be easily manipulated,
distorted by the inexpert and unskilled persons it is very much likely to be misused in most of the
cases. In other words, the data should be handled by experts. Thus it must be used by technically
sound persons.
6. Statistics is only one of the methods of studying a phenomenon; According to Croxton & Cowden,
ÒIt must not be assumed that the statistical method is the only method to be used in research; neither
this method be considered the best attack for every problem.Ó The conclusions arrived at with the
help of statistics must be supplemented with other evidences.
7. Statistical results may be misleading; Without any reference, statistical results may provide doubtful
conclusions. For example Ð on the basis of increasing no. of prisoners in the prison, it may be
conclude that crime is increasing. But it may be possible that due to rude behavior of police
administration the number of prisoners is increasing but crime is decreasing.
Therefore, it is worth-mentioning that every science based on certain assumption and limitations. This does
not reduce the importance of the subject but lays emphasis on the fact that precautions should be taken while
dealing with statistical analysis and interpretations.
1. Quantitative Data or Numerical Data: These types of data can be measured directly such as age, income,
production, marks etc. those facts are called variables and variables may be discrete or continuous.
Discrete variableÐ Those variables whose values are individually distinct and discontinuous. There is a
deÞnite difference between two variables. According to Boddington, ÒDiscrete variables is one where the
variables (Individual values) differ from each other by deÞnite amounts.Ó For example Ð number of students
of a class, number of children in a family, number of cattleÕs etc. It takes integral values such as 0, 1, 2, 3,
4 ...etc.
Continuous variable Ð A continuous variable is one which assumes all values with in an interval. That is no
deÞnite breaks are visible in this type of series. For example Ð age, weight, height......
Questions; State which of the following represents Discrete data or Continuous data?
¥ (1) Univariate Data: When the frequencies are determined on the basis of one variable. For
example Ð no. of workers on the basis of wages, no. of persons on the basis of age etc.
¥ (2) Bivariate Data: When the data are edited or presented on the basis of two variables
simultaneously. For this two-way frequency table is constructed, one variable is placed horizontally
and the second one vertically. For example Ð to present the number of students in one table on the
basis of marks obtained in two subjects, to tabulate the no. of persons in one table on the basis of two
variables i.e. height and weight.
¥ (1) Raw Data: When the data is arranged and analyzed. It is called ÔRawÕ because it is unprocessed
by statistical methods.
¥ (2) Arrange Data: When the data is processed and is arranged, summarized, classiÞed and tabulated
in proper way.
Terms like ÔData PointÕ and ÔData SetÕ are also used in order to distinguish between the numbers relating to
individual or single facts and the aggregate of facts. For exampleÐ the data of production of sugar for ten
years will be termed as ÔData SetÕ and the Þgures for production of one year will be as ÔData PointÕ.
CLASSIFICATION
After collection and editing of data the Þrst step towards further processing the same is classiÞcation.
ClassiÞcation is a process in which the collected data are arranged in separate classes, groups or subgroups
according to their characteristics. According to Secrist, ÒClassiÞcation is the process of arranging data into
sequences and groups according to their common characteristics or separating them into different classes.Ó
It concludes that classiÞcation means the arrangements and systematization of data into different classes and
these classes are determined on the basis of nature, objectives and scope of the enquiry.
OBJECTIVES OF CLASSIFICATION
ClassiÞcation is a method or technique for extracting the essential information supplied by the raw data.
(1) To condense the data: the main objective of classiÞcation is to condense and simplify the statistical
material, so that the same may be easily understandable.
(2) To bring out points of similarities and dissimilarities of data: classiÞcation brings out clearly the points of
similarity and dissimilarities of statistical facts because data of similar characteristics are placed in one class
i.e., males and females, literates and illiterates, married and unmarried etc.
(3) To make facts comparable: by arranging the data according to the points of similarity and dissimilarities,
it helps in comparison.
(4) To bring out relationship: classiÞcation helps in Þnding cause and effect relationship in the data. For
example- based on literacy and criminal tendency of a group peoples, it can be established whether literacy
has any impact on criminal tendency or not.
(5) To prepare ground for tabulation: tabulation is the basis of statistical analysis and classiÞcation is the
basis for tabulation.
It concludes that classiÞcation occupies an important place in the process of statistical investigation.
The fact is that the process of tabulation, presentation and analysis canÕt even be shorted without
classiÞcation.
¥ (2) Exhaustive and mutually exclusive: classiÞcation should be so exhaustive (clear all aspects) and
one item may not be Þnd place in more than one class. For example Ð students of a college are
classiÞed into three groups Ð urban, rural and hostlers. This classiÞcation is not mutually exclusive
because among hostlers some may be urban and some other rural.
¥ (3) Stability: the classiÞcation of data into various classes must be stable over be a period of time
of investigation.
¥ (4) Suitability: the classiÞcation should conÞrm to the objectives of enquiry. For exampleÐto study
the relationship between sex and university education, there is no need to classify on the basis
of age and religion.
¥ (5) Flexibility: a good classiÞcation should be ßexible so that adjustments may be easily be made
in classes according to changed situations. An ideal classiÞcation is one that can adjust itself to these
changes and yet retains its stability.
METHODS OF CLASSIFICATION
There are 4 methods of classiÞcation;
TABULATION
Tabulation is the next step of classiÞcation of the data and is designed to summaries lots of information in a
simple manner. In common language tabulation is the process of arranging data in a systematic manner in the
form of rows and columns. According to Blair, ÒTabulation in its broadest sense is any orderly arrangement
of data in columns and rows.Ó
OBJECTIVES OF TABULATION
1. To simplify complex data
2. To facilitate comparison
3. To economies Space
4. To facilitate presentation
6. To help in reference
Presentation ClassiÞes into different classes ClassiÞes into row and columns
Q1. What do you mean by Graphical Representation of data? Explain different ways of
representing the data graphically.
Graphical Representation is a way of analysing numerical data. It exhibits the relation between data,
ideas, information and concepts in a diagram. It is easy to understand and it is one of the most
important learning strategies. It always depends on the type of information in a particular domain.
There are different types of graphical representation. Some of them are as follows:
¥ Line Graphs Ð Line graph or the linear graph is used to display the continuous data and it is
useful for predicting future events over time.
¥ Bar Graphs Ð Bar Graph is used to display the category of data and it compares the data
using solid bars to represent the quantities.
¥ Histograms Ð The graph that uses bars to represent the frequency of numerical data that are
organised into intervals. Since all the intervals are equal and continuous, all the bars have
the same width.
¥ Line Plot Ð It shows the frequency of data on a given number line. Ô x Ô is placed above a
number line each time when that data occurs again.
¥ Frequency Table Ð The table shows the number of pieces of data that falls within the given
interval.
¥ Circle Graph Ð Also known as the pie chart that shows the relationships of the parts of the
whole. The circle is considered with 100% and the categories occupied is represented with
that speciÞc percentage like 15%, 56%, etc.
¥ Stem and Leaf Plot Ð In the stem and leaf plot, the data are organised from least value to the
greatest value. The digits of the least place values from the leaves and the next place value
digit forms the stems.
¥ Box and Whisker Plot Ð The plot diagram summarises the data by dividing into four parts.
Box and whisker show the range (spread) and the middle ( median) of the data.
(Note:The collected raw data can be placed in any one of the given ways:
When the raw data is placed in ascending or descending order of the magnitude is known as an
array or arrayed data.)
There are certain rules to effectively present the information in the graphical representation. They
are:
¥ Suitable Title: Make sure that the appropriate title is given to the graph which indicates the
subject of the presentation.
¥ Measurement Unit: Mention the measurement unit in the graph.
¥ Proper Scale: To represent the data in an accurate manner, choose a proper scale.
¥ Index: Index the appropriate colours, shades, lines, design in the graphs for better
understanding.
¥ Data Sources: Include the source of information wherever it is necessary at the bottom of the
graph.
¥ Keep it Simple: Construct a graph in an easy way that everyone can understand.
¥ Neat: Choose the correct size, fonts, colours etc in such a way that the graph should be a
visual aid for the presentation of information.
Graphical Representation in Maths
In Mathematics, a graph is deÞned as a chart with statistical data, which are represented in the form
of curves or lines drawn across the coordinate point plotted on its surface. It helps to study the
relationship between two variables where it helps to measure the change in the variable amount
with respect to another variable within a given interval of time. It helps to study the series
distribution and frequency distribution for a given problem. There are two types of graphs to
visually depict the information. They are:
Algebraic principles are applied to all types of graphical representation of data. In graphs, it is
represented using two lines called coordinate axes. The horizontal axis is denoted as the x-axis and
the vertical axis is denoted as the y-axis. The point at which two lines intersect is called an origin
ÔOÕ. Consider x-axis, the distance from the origin to the right side will take a positive value and the
distance from the origin to the left side will take a negative value. Similarly, for the y-axis, the
points above the origin will take a positive value, and the points below the origin will a negative
value.
¥ Histogram
¥ Smoothed frequency graph
¥ Pie diagram
¥ Cumulative or ogive frequency graph
¥ Frequency Polygon
Here are the steps to follow to Þnd the frequency distribution of a frequency polygon and it is
represented in a graphical way.
¥ Obtain the frequency distribution and Þnd the midpoints of each class interval.
¥ Represent the midpoints along x-axis and frequencies along the y-axis.
¥ Plot the points corresponding to the frequency at each midpoint.
¥ Join these points, using lines in order.
¥ To complete the polygon, join the point at each end immediately to the lower or higher class
marks on the x-axis.
Frequency 4 6 8 10 12 14 7 5
Solution :
Mark the class interval along x-axis and frequencies along the y-axis.
Let assume that class interval 0-10 with frequency zero and 90-100 with frequency zero.
0-10 5 0
10-20 15 4
20-30 25 6
30-40 35 8
40-50 45 10
50-60 55 12
60-70 65 14
70-80 75 7
80-90 85 5
90-100 95 0
Using the midpoint and the frequency value from the above table, plot the points A (5, 0), B (15, 4),
C (25, 6), D (35, 8), E (45, 10), F (55, 12), G (65, 14), H (75, 7), I (85, 5) and J (95, 0).
To obtain the frequency polygon ABCDEFGHIJ, draw the line segments AB, BC, CD, DE, EF, FG,
GH, HI, IJ, and connect all the points.
¥ Line Graphs
¥ Bar Graphs
¥ Histograms
¥ Line Plots
¥ Frequency Table
¥ Circle Graph, etc.
¥ It saves time.
¥ It makes the comparison of data more efÞcient.
Arithmetic Mean
For a raw data, the arithmetic mean of a series of numbers is sum of all observations divided by the
number of observations in the series. Thus if x1, x2, ..., xn represent the values of n observations,
then arithmetic mean (A.M.) for n observations is: (direct method)
Example 5.1
The following data represent the number of books issued in a school library on selected from 7
different days 7, 9, 12, 15, 5, 4, 11 Þnd the mean number of books.
Solution:
Under this method an assumed mean or an arbitrary value (denoted by A) is used as the basis of
calculation of deviations (di) from individual values. That is if di = xi Ð A
Then
Example 5.2
A studentÕs marks in 5 subjects are 75, 68, 80, 92, 56. Find the average of his marks.
Solution:
If x1, x2, ..., xn are discrete values with the corresponding frequencies f1, f2, É, fn. Then the mean
for discrete grouped data is deÞned as (direct method)
Example 5.3
A proof reads through 73 pages manuscript The number of mistakes found on each of the pages are
summarized in the table below Determine the mean number of mistakes found per page
Solution:
For the computation of A.M for the continuous grouped data, we can use direct method or short cut
method.
Direct Method:
The formula is
Example 5.4
Solution :
Direct Method:
Merits
á It is well deÞned.
Limitations
á It cannot be determined for the qualitative data such as beauty, honesty etc.
When to use?
Arithmetic mean is a best representative of the data if the data set is homogeneous. On the other
hand if the data set is heterogeneous the result may be misleading and may not represent the data.
The arithmetic mean, as discussed earlier, gives equal importance (or weights) to each observation
in the data set. However, there are situations in which values of individual observations in the data
set are not of equal importance. Under these circumstances, we may attach, a weight, as an indicator
of their importance to each observation value.
á Comparison of results of two or more groups where number of items in the groups differs.
Example 5.5
Calculate the weighted average score of the student who scored marks as given in the table
Solution:
Combined Mean:
Let 1 and 2 are the arithmetic mean of two groups (having the same unit of measurement of
a variable), based on n1 and n2 observations respectively. Then the combined mean can be
calculated using
Example 5.6
A class consists of 4 boys and 3 girls. The average marks obtained by the boys and girls are 20 and
30 respectively. Find the class average.
Solution:
Median
Median is the value of the variable which divides the whole set of data into two equal parts. It is
the value such that in a set of observations, 50% observations are above and 50% observations
are below it. Hence the median is a positional average.
In this case, the data is arranged in either ascending or descending order of magnitude.
(i) If the number of observations n is an odd number, then the median is represented by the
numerical value of x, corresponds to the positioning point of n+1 / 2 in ordered observations.
That is,
If the number of observations n is an even number, then the median is deÞned as the arithmetic
mean of the middle values in the array That is,
Example 5.14
The number of rooms in the seven Þve stars hotel in Chennai city is 71, 30, 61, 59, 31, 40 and 29.
Find the median number of rooms
Solution:
Arrange the data in ascending order 29, 30, 31, 40, 59, 61, 71
n = 7 (odd)
Median = 40 rooms
Example 5.15
The export of agricultural product in million dollars from a country during eight quarters in 1974
and 1975 was recorded as 29.7, 16.6, 2.3, 14.1, 36.6, 18.7, 3.5, 21.3
Solution:
Cumulative Frequency
In a grouped distribution, values are associated with frequencies. The cumulative frequencies are
calculated to know the total number of items above or below a certain limit.This is obtained by
adding the frequencies successively up to the required level. This cumulative frequencies are
useful to calculate median, quartiles, deciles and percentiles.
iv. The value of x corresponding to that cumulative frequency is the (N+1)/2 median.
Example 5.16
The following data are the weights of students in a class. Find the median weights of the students
Solution:
The cumulative frequency greater than 30.5 is 38.The value of x corresponding to 38 is 40. The
median weight of the students is 40 kgs
In this case, the data is given in the form of a frequency table with class-interval etc., The
following formula is used to calculate the median.
Where
From the formula, it is clear that one has to Þnd the median class Þrst. Median class is, that class
which correspond to the cumulative frequency just greater than N/2.
Example 5.17
The following data attained from a garden records of certain period Calculate the median weight
of the apple
Solution:
Example 5.18
Solution:
We are given upper limit and less than cumulative frequencies. First Þnd the class-intervals and
the frequencies. Since the values are increasing by 10, hence the width of the class interval is
equal to 10.
Example 5.19
The following is the marks obtained by 140 students in a college. Find the median marks
Solution:
Median can be located with the help of the cumulative frequency curve or
ÔogiveÕ.
Step 1 : The class intervals, are represented on the horizontal axis (x-axis)
Step 4 : A horizontal straight line is drawn from the value N/2 and N+1 / 2on
the y-axis parallel to x- axis to meet the ogive. (depending on N is odd or
even)
Example 5.20
Draw ogive curves for the following frequency distribution and determine
the median.
Solution:
Merits
á It can be easily located even if the class intervals in the series are
unequal
Limitations
á It does not take into account the actual values of the items in the
series
Mode
According to Croxton and Cowden, ÔThe mode of a distribution is the value
at the point around which the items tend to be most heavily concentrated.
In a busy road, where we take a survey on the vehicle - trafÞc on the road at
a place at a particular period of time, we observe the number of two
wheelers is more than cars, buses and other vehicles. Because of the higher
frequency, we say that the modal value of this survey is Ôtwo wheelersÕ
Mode is deÞned as the value which occurs most frequently in a data set. The
mode obtained may be two or more in frequency distribution.
Computation of mode:
The mode is deÞned as the value which occurs frequently in a data set
Example 5.21
The following are the marks scored by 20 students in the class. Find the
mode 90, 70, 50, 30, 40, 86, 65, 73, 68, 90, 90, 10, 73, 25, 35, 88, 67, 80, 74,
46
Solution:
Since the marks 90 occurs the maximum number of times, three times
compared with the other numbers, mode is 90.
Example 5.22
A doctor who checked 9 patientsÕ sugar level is given below. Find the mode
value of the sugar levels. 80, 112, 110, 115, 124, 130, 100, 90, 150, 180
Solution:
Example 5.23
Solution:
Here, the observations 10 and 12 occurs twice in the data set, the modes are
10 and 12.
Example 5.24
Solution:
The mode or modal value of the distribution is that value of the variate for
which the frequency is maximum. It is the value around which the items or
observations tend to be most heavily concentrated. The mode is computed
by the formula.
Example 5.25
The following data relates to the daily income of families in an urban area.
Find the modal income of the families.
Solution:
ii. If the maximum frequency occurs in the beginning or at the end of the
distribution
iii. Leave the Ist frequency and combine the remaining frequencies
two by two and write in column III
vi. Leave the Ist and 2nd frequencies and combine the remaining
frequencies three by three and write in column VI
Mark the highest frequency in each column. Then form an analysis table to
Þnd the modal class. After Þnding the modal class use the formula to
calculate the modal value.
Example 5.26
Solution:
Analysis Table:
ii. Join the rectangle corner of the highest rectangle (modal class
rectangle) by a straight line to the top right corner of the preceding
rectangle. Similarly the top left corner of the highest rectangle is joined to
the top left corner of the rectangle on the right.
iii. From the point of intersection of these two diagonal lines, draw
a perpendicular line to the x Ðaxis which meets at M.
Example 5.27
Locate the modal value graphically for the following frequency distribution
Solution:
Merits of Mode:
Demerits of Mode:
á Mode for the series with unequal class intervals cannot be calculated.
Measures of Dispersion
The following data provide the runs scored by two batsmen in the last 10 matches.
Batsman B: 33, 50, 47, 38, 45, 40, 36, 48, 37, 26
The mean of both datas are same (40), but they differ signiÞcantly.
From the above diagrams, we see that runs of batsman B are grouped around the mean.
But the runs of batsman A are scattered from 0 to 100, though they both have same mean.
Thus, some additional statistical information may be required to determine how the values
are spread in data. For this, we shall discuss Measures of Dispersion.
Dispersion is a measure which gives an idea about the scatteredness of the values.
1. Range
2. Mean deviation
3. Quartile deviation
4. Standard deviation
5. Variance
6. CoefÞcient of Variation
1. Range
The difference between the largest value and the smallest value is called Range.
Range R = L Ð S
Example 8.1 Find the range and coefÞcient of range of the following data: 25, 67, 48, 53,
18, 39, 44.
Range R = L −S = 67 −18 = 49
= 0.576
Smallest value S = 18
Range R = L ÐS
R = 28 −18 = 10 Years
Example 8.3 The range of a set of data is 13.67 and the largest value is 70.08. Find the
smallest value.
Solution
Range R = 13.67
Range R = L ÐS
13.67 = 70.08 Ð S
For a given data with n observations x 1 , x2,¼xn , the deviations from the mean are
The squares of deviations from the mean of the observations x1, x2, . . . . , xn are
Note
from the mean (x i − ) are small, then the squares of the deviations will be very
small.
4. Variance
The mean of the squares of the deviations from the mean is called Variance.
5. Standard Deviation
The positive square root of Variance is called Standard deviation. That is, standard
deviation is the positive square root of the mean of the squares of deviations of the given
values from their mean. It is denoted by σ.
Standard deviation gives a clear idea about how far the values are spreading or deviating
from the mean.
Note
á If the data values are given directly then to Þnd standard deviation we can use the
formula σ =
á If the data values are not given directly but the squares of the deviations from the
mean of each observation is given then to Þnd standard deviation we can use the formula
σ=
Example 8.4 The number of televisions sold in each day of a week are 13, 8, 4, 9, 7, 12,
10.
Solution
Another convenient way of Þnding standard deviation is to use the following formula.
Example 8.5 The amount of rainfall in a particular season for 6 days are given as 17.8
cm, 19.2 cm, 16.3 cm, 12.5 cm, 12.8 cm and 11.4 cm. Find its standard deviation.
Solution Arranging the numbers in ascending order we get, 11.4, 12.5, 12.8, 16.3, 17.8,
19.2.
Number of observations n = 6
When the mean value is not an integer (since calculations are very tedious in decimal
form) then it is better to use the assumed mean method to Þnd the standard deviation.
Let x 1 , x2, x 3 , ..., xn be the given data values and let be their mean.
Let di be the deviation of xi from the assumed mean A, which is usually the middle
value or near the middle value of the given data.
Σdi = Σxi Ð A × n
Example 8.6 The marks scored by 10 students in a class test are 25, 29, 30, 33, 35, 37,
38, 40, 44, 48. Find the standard deviation.
Solution The mean of marks is 35.9 which is not an integer. Hence we take assumed
mean, A = 35, n = 10 .
Let x 1 , x2, x 3 ,...xn be the given data. Let A be the assumed mean.
Note
We can use any of the above methods for Þnding the standard deviation
Example 8.7 The amount that the children have spent for purchasing some eatables in
one day trip of a school are 5, 10, 15, 20, 25, 30, 35, 40. Using step deviation method,
Þnd the standard deviation of the amount they have spent.
Solution We note that all the observations are divisible by 5. Hence we can use the step
deviation method. Let the Assumed mean A = 20, n = 8.
Example 8.8 Find the standard deviation of the following data 7, 4, 8, 10, 11. Add 3 to
all the values then Þnd the standard deviation for the new values.
When we add 3 to all the values, we get the new values as 7,10,11,13,14.
From the above, we see that the standard deviation will not change when we add some
Þxed constant to all the values.
Example 8.9 Find the standard deviation of the data 2, 3, 5, 7, 8. Multiply each data by
4. Find the standard deviation of the new values.
Solution Given, n = 5
When we multiply each data by 4, we get the new values as 8, 12, 20, 28, 32.
From the above, we see that when we multiply each data by 4 the standard deviation
also get multiplied by 4.
Example 8.10 Find the mean and variance of the Þrst n natural numbers.
Solution
Example 8.11
48 students were asked to write the total number of hours per week they spent on
watching television. With this information Þnd the standard deviation of hours spent for
watching television.
Solution
Let x 1 , x2, x 3 , ...xn be the given data with frequencies f1 , f2, f3 , ... fn respectively.
Let x be their mean and A be the assumed mean..
Example 8.12
The marks scored by the students in a slip test are given below.
Solution
To make the calculation simple, we provide the following formula. Let A be the
assumed mean, xi be the middle value of the ith class and c is the width of the class
interval.
Example 8.13
Solution
Example 8.14
Solution
Module-II
Probability-Axioms-Addition and Multiplication Rule-Types of Probability-Probability Tree -BayeÕs
Theorem
PROBABILITY
INTRODUCTION:
Probability theory was originated from gambling theory. A large number of problems exist even
today which are based on the game of chance, such as coin tossing, dice throwing and playing
cards.
• RANDOM EXPERIMENTS:
Experiments of any type where the outcome cannot be predicted are called random
experiments.
• SAMPLE SPACE:
A set of all possible outcomes from an experiment is called a sample space.
Eg: Consider a random experiment E of throwing 2 coins at a time. The possible outcomes are
HH, TT, HT, TH.
These 4 outcomes constitute a sample space denoted by, S ={ HH, TT, HT, TH}.
Consider an experiment of throwing a coin. When tossing a coin, we may get a head(H) or tail(T).
Here tossing of a coin is a trail and getting a hand or tail is an event.
In otherwords, ÒEvery non-empty subset of A of the sample space S is called an eventÓ.
• NULL EVENT:
An event having no sample point is called a null event and is denoted by ∅.
• EXHAUSTIVE EVENTS:
The total number of possible outcomes in any trail is known as exhaustive events.
Eg: In throwing a die the possible outcomes are getting 1 or 2 or 3 or 4 or 5 or 6. Hence we have
6 exhaustive events in throwing a die.
• FAVOURABLE EVENTS:
Mathematical or classical or a priori deÞnition of probability,
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑐𝑎𝑠𝑒𝑠
Probability (of happening an event E) =
=𝑚
Where m = Number of favourable cases
n = Total number of exhaustive cases.
On tossing a coin we say that the probability of occurrence of head and tail is
1
2
each. Basically here we are assigning the probability value of
1
2
for the occurrence of each event.
1
2
each.
Now, say
P(H)
=
5
8
and
P(T)
=
3
8
For this, let us again check the basic initial conditions of the axiomatic approach of
probability.
¥ Each value is neither less than zero nor greater than 1 and
¥ Sum of the probabilities of occurrence of head and tail is 1
Hence this sort of probability value assignment also satisÞes the axiomatic approach
of probability. Thus, we can conclude that there can be inÞnite ways to assign the
probability to outcomes of an experiment.
Rule of Addition
The rule of addition (also known as the "OR" rule) states that the probability of two
or more mutually exclusive events occurring is the sum of the probabilities of the
individual events occurring.
Example 1: if you have a coin and you want to know the probability of it landing on
heads "or" tails, then the answer would be 1/2 + 1/2 = 1. This means that there is a
100% chance that either heads or tails will occur.
Example 2: If you have two events, A and B, and the probability of event A occurring
is 0.40 and the probability of event B occurring is 0.30, the probability of events A
"or" B occurring is 0.40 + 0.30 = 0.70.
The above two examples apply when events are mutually exclusive, which means
that they cannot happen at the same time. In this case, the rule of addition says that
the probability of either event happening is the sum of the probabilities of each event
happening individually.
On the other hand, if events are not mutually exclusive, it means that they can happen
at the same time. In this case, the rule of addition says that the probability of either
event happening is the sum of each event's probabilities minus the probability of both
events happening simultaneously.
Rule of Addition
The probability that Event A or Event B
OccUrs
Probability that Event A occurs
+
Probability that Event B occurs
-
Probability that both Events A and B occur
P(A U B) = P(A) + P(B) - P(A n B)
Rule of Multiplication:
The multiplication rule (also known as the "AND" rule) states that the probability of
two independent events occurring together is equal to the product of their individual
probabilities.
Example 4: For example, if you have two events A and B, and the probability of
event A occurring is 0.40 and the probability of event B occurring is 0.30, the
probability of events A "and" B occurring simultaneously is 0.40 * 0.30 = 0.12. This
is because the probability of both events occurring simultaneously is the product of
the probabilities of the individual events occurring.
Example 5: If you want to calculate the probability of getting a head on the Þrst coin
ßip and tails on the second coin ßip, you will use the rule of multiplication to
determine that the probability is 0.25 because the probability of getting heads on the
Þrst coin ßip is 0.50. The probability of getting tails on the second coin ßip is also
0.50, and the probability of both events occurring simultaneously is 0.50 * 0.50 =
0.25.
Example 6: Suppose you have a bag containing 3 red balls and 2 green balls. If you
want to Þnd the probability of drawing a red ball (then put this back in the bag: With
replacement) and in the second draw you get a green ball, you would use the rule of
multiplication:
Example 7: Suppose you have a bag containing 3 red balls and 2 green balls. If you
want to Þnd the probability of drawing a red ball and in the second draw you get a
green ball (without replacement), you would use the rule of multiplication:
P(red AND green) = P(red) * P(green|red) = (3/5) * (2/4) = 6/20 = 0.30
In the above formula, P(green | red) means the probability of getting a Green ball
"provided" the Þrst event (getting a Red ball) has already happened. This is called
conditional probability.
This means that the probability of drawing a red ball and then a green ball without
replacement is 0.30, or 30%.
Please note that in this example, the probability of drawing a red ball in the Þrst
selection DOES affect the probability of the green ball in the second pick, as the Þrst
selection (red ball) is NOT put back in the bag. This reduces the total number of balls
in the bag to 4 ( 2 Red and 2 Green)
In this example, the two events are dependent events, which means that the
occurrence of one event affects the probability of the other event occurring.
This rule states that the probability of both events occurring is equal to the probability
of the Þrst event occurring multiplied by the probability of the second event
occurring, given that the two events are independent.
Rule of Multiplication:
The probability that Events A and B both
occur =
Probability that Event A occurs
X
Probability that Event B occurs, given that
A has occurred
P(A n B) = P(A) P(BIA)
Summary:
¥ The rule of addition for mutually exclusive events: P(A or B) = P(A) + P(B)
¥ The rule of addition for non-mutually exclusive events: P(A or B) = P(A) +
P(B) - P(A and B)
¥ The rule of multiplication for dependent events: P(A and B) = P(A) * P(B/A)
¥ The rule of multiplication for non-dependent events: P(A and B) = P(A) * P(B)
1. Classical
The classical or theoretical perspective on probability states that in an experiment
where there are X equally likely outcomes, and event Y has exactly Z of these
outcomes, then the probability of Y is Z/X, or P(Y) = Z/X. This is often the Þrst
perspective that students experience in formal education. For example, when rolling a
fair die, there are six possible outcomes that are equally likely, you can say there is a
1/6 probability of rolling each number.
The advantage to this perspective is that it's conceptually simple for a lot of
situations, however, it has limits since many situations don't have Þnitely as many
equally likely outcomes. For example, rolling a weighted die has a Þnite number of
outcomes that aren't equally likely, or studying employee incomes over many years
and into the future has an inÞnite number of outcomes for their maximum possible
income.
2. Empirical
The empirical or experimental perspective on probability deÞnes probability through
thought experiments. For example, if you are rolling a weighted die but you don't
know which side has the weight, you can get an idea for the probability of each
outcome by rolling the die an enormous number of times and calculating the
proportion of times the die gives that outcome and estimate the probability of that
outcome.
The formal way to deÞne this perspective is P(A) = the limit as C approaches inÞnity
of B/C. Where A is the probability of the event, B is the number of times the event A
happens and C is the number of times you perform the process, like rolling a die or
tossing a coin.
Another way to think of this is to imagine tossing a coin 100 times, and then
continuing to 10,000 times. Each time you toss the coin, the real-life probability
results you are getting are becoming a better approximation of the theoretical
probability of the event. The Þrst 100 times you toss the coin your probability might
be 1/3 heads, but the more tosses you make as you approach inÞnity your probability
will become 1/2, or the theoretical probability.
3. Subjective
The subjective perspective on probability considers a person's own personal belief or
judgment that an event will happen. For example, an investor may have an educated
sense of the market and intuitively talk about the probability of a certain stock going
up tomorrow. You can rationally understand how that subjective view agrees with
theoretical or experimental views. In other words, it's the probability that what a
person is expecting to happen through their knowledge and feelings will actually be
the outcome, with no formal calculations.
For example, if a fan at a football game states that a particular team is going to win
the game, they are basing their decision on the team's past wins and losses, what they
know about the opposing team, facts they know about football and their opinions or
feelings about the game. They are not making a formal mathematical calculation.
4. Axiomatic
The axiomatic perspective on probability is a unifying perspective where the coherent
conditions used in theoretical and experimental probability prove subjective
probability. You apply a set of rules or axioms by Kolmogorov to all types of
probability. Mathematicians know them as Kolmogorov's three axioms. When using
axiomatic probability, you can quantify the chances of an event occurring or not
occurring.
You can use the three axioms with all the other probability perspectives. The
deÞnition for this perspective is the probability of any function from numbers to
events satisÞed by the following three axioms:
¥ Zero is the smallest possible probability, and one is the largest.
¥ An event that is certain has a probability of one.
¥ Two mutually exclusive events cannot occur simultaneously, but the union of
events says only one of them can occur.
1.Weather
Meteorologists aren't able to exactly predicts the weather, so they use instruments and
tools to Þnd the likelihood of snow, rain or other weather conditions. If there is a 30%
chance of rain, the meteorologist has determined the probability of rain such that it
has rained on 30 out of 100 days with similar weather conditions. Because of the
forecast, you use probability to decide whether to wear sandals or rain boots to work
that morning.
2. Sports
Coaches and athletes frequently use probability to Þgure out the best sports strategies
for competitions and games. For example, if a football kicker makes 10 out of 15
Þeld goals throughout the season, the probability of him scoring his next Þeld goal is
10/15 or 2/3. Another example is a baseball coach calculating a player's batting
average to determine the lineup for a game. If a player has a 300 battling average,
that means he's gotten three hits out of every 10 bats, and the probability of him
getting a base hit is 3/10.
3. Insurance
When analyzing insurance policies and considering deductible amounts, probability
plays an important role. For example, if 20 out of every 100 drivers in your area have
gotten hail damage in the last year, then when choosing your car insurance policy you
can use probability to understand that there's a 1/5 chance your car will get hail
damage. This signiÞcant probability may encourage you to get comprehensive cover
for hail damage and maybe even a lower deductible.
4. Games
When you play games with an element of luck or chance, like board games, card
games or video games, you often weigh the odds of a desirable event happening like
getting the card you need or rolling a speciÞc number on the die. The likelihood of
that favorable event happening helps you determine when to take a risk or how much
you're willing to risk. One example is poker players who know the probability of
getting certain hands, like that there's a 42% chance of getting two of a kind versus a
2% chance of getting three of kind.
The formula for the Bayes theorem can be written in a variety of ways. The following
is the most common version:
P(A) and P(B) are the probabilities of A and B occurring independently of one
another.
Problem 1: Three urns contain 6 red, 4 black; 4 red, 6 black, and 5 red, 5 black balls
respectively. One of the urns is selected at random and a ball is drawn from it. If the
ball drawn is red, Þnd the probability that it is drawn from the Þrst urn.
Solution: Let E1, E2, E3, and A be the events deÞned as follows:
Since there are three urns and one of the three urns is chosen at random, therefore:
If E1 has already occurred, then urn Þrst has been chosen, containing 6 red and 4
black balls. The probability of drawing a red ball from it is 6/10.
You are required to Þnd the P(E1/A) i.e., given that the ball drawn is red, what is the
probability that it is drawn from the Þrst urn.
=⅖
Problem 2: An insurance company insured 2000 scooter drivers, 4000 car drivers,
and 6000 truck drivers. The probability of an accident involving a scooter driver, car
driver, and a truck is 0.01, 0.03, and 0.015 respectively. One of the insured persons
meets with an accident. What is the probability that he is a scooter driver?
P(E1) = 2000/12000 = ⅙
P(E2) = 4000/12000 = ⅓
P(E3) = 6000/12000 = ½
It is given that P(A / E1) = Probability that a person meets with an accident given that
he is a scooter driver = 0.01
Similarly, you have P(A / E2) = 0.03 and P(A / E3) = 0.15
You are required to Þnd P(E1 / A), i.e. given that the person meets with an accident,
what is the probability that he was a scooter driver?
= 1/52
There are two main parts of a probability tree. These are the nodes and the branches.
The nodes can further be classiÞed into a parent node and a sibling node. The parent
node represents a certain event and has a probability of 1. The sibling nodes denote
other additional possible events or outcomes. The branches denote the probability of
occurrence of these events. Suppose a fair coin is tossed once, then the probability
tree can be constructed as follows:
This is a simple probability tree and has two branches only. Here, the Þrst
node represents the parent event of a coin being tossed. Head and tail are the two
possible outcomes forming the sibling nodes. 0.5 is written on the branch and
represents the probability of occurrence of each sibling event.
Suppose a probability tree diagram needs to be constructed for ßipping a fair coin
twice. This is an example of an independent event as the outcome of each coin toss is
independent of the previous ßip. First, the probability tree diagram of a coin being
ßipped once is drawn as given in the previous section. The next step is to extend it to
two coin tosses as follows:
The second set of probabilities represents the second coin toss. Thus, in total there are
4 sets of possible outcomes.
To calculate the probabilities of a series of events, multiply the probabilities along the
branches of the probability tree diagram. The total probability can be computed by
adding these probabilities and its value will always be equal to 1.
Some useful inferences can be made from the probability tree diagram as follows:
¥ The probability of getting the outcome (Head, Head) = 0.5 × 0.5 = 0.25.
Similarly, the probability of the other outcomes can be calculated.
¥ 0.25 + 0.25 + 0.25 + 0.25 = 1. This implies that on adding the probabilities of
each outcome, the total is equal to 1.
¥ By looking at the probability tree, the probability of getting exactly one head
can be calculated as 0.25 + 0.25 = 0.5.
Module-III
Concept of Random variable-Probability Distributions-Expected value and Variance of random
Variable-Conditional expectations-Classical news Paper boys Problem,(EMV,EVPI)
Random Variable
Random Variable
A random variable is a variable that can take on many values. This is because there
can be several outcomes of a random occurrence. Thus, a random variable should not
be confused with an algebraic variable. An algebraic variable represents the value of
an unknown quantity in an algebraic equation that can be calculated. On the other
hand, a random variable can have a set of values that could be the resulting outcome
of a random experiment.
A random variable can be deÞned as a type of variable whose value depends upon the
numerical outcomes of a certain random phenomenon. It is also known as a stochastic
variable. Random variables are always real numbers as they are required to be
measurable.
Random Variable Example
Suppose 2 dice are rolled and the random variable, X, is used to represent the sum of
the numbers. Then, the smallest value of X will be equal to 2 (1 + 1), while the
highest value would be 12 (6 + 6). Thus, X could take on any value between 2 to 12
(inclusive). Now if probabilities are attached to each outcome then the probability
distribution of X can be determined.
Random Variables can be divided into two broad categories depending upon the type
of data available. These are given as follows:
¥ Discrete random variable
¥ Continuous random variable
A probability mass function is used to describe a discrete random variable and a
probability density function describes a continuous random variable. The upcoming
sections will cover these topics in detail.
KEY TAKEAWAYS
¥ A random variable is a variable whose value is unknown or a function that
assigns values to each of an experiment's outcomes.
¥ A random variable can be either discrete (having speciÞc values) or continuous
(any value in a continuous range).
¥ The use of random variables is most common in probability and statistics,
where they are used to quantify outcomes of random occurrences.
¥ Risk analysts use random variables to estimate the probability of an adverse
event occurring.
For instance, the probability of getting a 3, or P (Z=3), when a die is thrown is 1/6,
and so is the probability of having a 4 or a 2 or any other number on all six faces of a
die. Note that the sum of all probabilities is 1.
Drawing on the latter, if Y represents the random variable for the average height of a
random group of 25 people, you will Þnd that the resulting outcome is a continuous
Þgure since height may be 5 ft or 5.01 ft or 5.0001 ft. Clearly, there is an inÞnite
number of possible values for height.
However, the two coins land in four different ways: TT, HT, TH, and HH. Therefore,
the P(Y=0) = 1/4 since we have one chance of getting no heads (i.e., two tails [TT]
when the coins are tossed). Similarly, the probability of getting two heads (HH) is
also 1/4. Notice that getting one head has a likelihood of occurring twice: in HT and
TH. In this case, P (Y=1) = 2/4 = 1/2.
Module-IV
Probability distributions-Binomial-Poisson-Normal
Probability Distribution
Probability distribution yields the possible outcomes for any random event. It is also
deÞned based on the underlying sample space as a set of possible outcomes of any
random experiment. These settings could be a set of real numbers or a set of vectors
or a set of any entities. It is a part of probability and statistics.
A random variable has a probability distribution, which deÞnes the probability of its
unknown values. Random variables can be discrete (not constant) or continuous or
both. That means it takes any of a designated Þnite or countable list of values,
provided with a probability mass function feature of the random variableÕs probability
distribution or can take any numerical value in an interval or set of intervals. Through
a probability density function that is representative of the random variableÕs
probability distribution or it can be a combination of both discrete and continuous.
Two random variables with equal probability distribution can yet vary with respect to
their relationships with other random variables or whether they are independent of
these. The recognition of a random variable, which means, the outcomes of randomly
choosing values as per the variableÕs probability distribution function, are called
random variates.
Academics, Þnancial analysts and fund managers alike may determine a particular
stock's probability distribution to evaluate the possible expected returns that the stock
may yield in the future. The stock's history of returns, which can be measured from
any time interval, will likely be composed of only a fraction of the stock's returns,
which will subject the analysis to sampling error. By increasing the sample size, this
error can be dramatically reduced.
KEY TAKEAWAYS
¥ A probability distribution depicts the expected outcomes of possible values for
a given data generating process.
¥ Probability distributions come in many shapes with different characteristics, as
deÞned by the mean, standard deviation, skewness, and kurtosis.
¥ Investors use probability distributions to anticipate returns on assets such as
stocks over time and to hedge their risk.
The most commonly used distribution is the normal distribution, which is used
frequently in Þnance, investing, science, and engineering. The normal distribution is
fully characterized by its mean and standard deviation, meaning the distribution is not
skewed and does exhibit kurtosis. This makes the distribution symmetric and it is
depicted as a bell-shaped curve when plotted. A normal distribution is deÞned by a
mean (average) of zero and a standard deviation of 1.0, with a skew of zero and
kurtosis = 3. In a normal distribution, approximately 68% of the data collected will
fall within +/- one standard deviation of the mean; approximately 95% within +/- two
standard deviations; and 99.7% within three standard deviations. Unlike the binomial
distribution, the normal distribution is continuous, meaning that all possible values
are represented (as opposed to just 0 and 1 with nothing in between).
Probability distributions are often used in risk management as well to evaluate the
probability and amount of losses that an investment portfolio would incur based on a
distribution of historical returns. One popular risk management metric used in
investing is value-at-risk (VaR). VaR yields the minimum loss that can occur given a
probability and time frame for a portfolio. Alternatively, an investor can get a
probability of loss for an amount of loss and time frame using VaR. Misuse and
overreliance on VaR has been implicated as one of the major causes of the 2008
Þnancial crisis.
6+1, 5+2, 2+5, 3+4, 4+3). Two and twelve, on the other hand, are far less likely (1+1
and 6+6).
Discrete Data, as the name suggests, can take only speciÞed values. For example,
when you roll a die, the possible outcomes are 1, 2, 3, 4, 5 or 6 and not 1.5 or 2.45.
Continuous Data can take any value within a given range. The range may be Þnite or
inÞnite. For example, A girlÕs weight or height, the length of the road. The weight of
a girl can be any value from 54 kgs, or 54.5 kgs, or 54.5436kgs.
waiting some time until the next event. Here is a great read on Poisson distribution.
Here are some examples:
¥ Customers calling a help center: On average, there are, say, 10 customers
which call in an hour. Thus, Poisson distribution can be used to model the
probability of a different number of customers calling within an hour (say, 5 or
6 or 7 or 8 or 9 or 11 customers, etc). The diagram below represents
¥ No. of visitors to a website: On average, there are 500 visitors to a website
every day. Poisson distribution can be used to estimate the number of visitors
every day.
¥ Radioactive decay in atoms
¥ Photons arriving at a space telescope
¥ Movements in a stock price
¥ Number of trees in a given acre of land
Module-V
Sampling distributions
Even though the sampling distribution does not include any sample that deviates far
off from the population's mean value, the frequency distribution of sampling
distribution often generates a normal distribution with maximum samples close to the
population's mean value.
This is done by collecting samples from populations. While samples (value of the
focus) are the main focus, in this case, populations (subjects) help us to procure them,
and thus, both samples and populations are considered to be equally essential.
A lot of data that is collected over time is included in studies that aim to calculate the
probabilities of an event. This data is collected with utmost precision and care so that
it leads to an effective result and does not hamper the statistics involved.
Sampling Distribution can be concerned with almost any subject. Be it the weight of
population or traits of animals, sampling distribution can cover almost anything and
everything. Another dimension of this concept is the binomial distribution.
Suppose a researcher wishes to identify the average age of babies when they begin to
walk. Instead of keeping a track of all the babies around the world, the researcher will
select a total of 500 babies.
The number of babies constitutes the population for this particular research. Now, the
researcher will identify the age of babies when they begin to walk. Let us assume that
25% of the babies began to walk at the age of 1.5 years old. Another 30% of the
babies began to walk at the age of 2 years old.
This way, the researcher will calculate the actual mean of the sampling distribution of
babies by picking a handful of samples. The sample mean (average of a sample) will
be further calculated along with other sample means obtained from the same
population.
Here is a step-by-step guide for you to create your own sampling distribution. Let's
get started!
This helps researchers and analysts to dig deep into the population, get a closer look
into small groups of the population, and create generalized results based on the same.
The signiÞcance of sampling distribution is immense in the Þeld of statistics.
2. Secondly, the repeated collection of samples from the same set of subjects
leads to consistency. What's more, the standard error also allows a researcher to
reßect on the deviation and thus identify the unbiased nature of the sampling
distribution altogether.
As we have already discovered about Sampling Distribution, we will now learn about
the various types of Sampling Distribution in statistics. To begin with, there are 3
types of Sampling Distribution.
The average of every sample is put together and a sampling distribution mean is
calculated which reßects the nature of the whole population.With more
samples, the standard deviation decreases which leads to a normal frequency
distribution or a bell-shaped curve on the graph.
When it comes to the second type of Sampling Distribution, the population's samples
are calculated to obtain the proportions of a population. Herein, the mean of
all sample proportions is calculated, and thereby the sampling distribution of
proportion is generated.
3. T-Distribution
The frequent distribution in this type is the most near to the mean of the sampling
distribution. Only a handful of samples are far off from the mean value of the
whole population.
One of the characteristics of this T-distribution is that it cannot work well with a
population that is large in size. Therefore, this type works well with only a small-
sized population.
Related Terminologies
Module-VII
Hypothesis Testing-t test, Chi Square,Z test
A hypothesis is an initial idea or assumption that may be used to try and explain an observation or
make an argument for some action that requires testing to check its validity. In a hypothesis test,
there are generally two different ideas or assumptions that are being juxtaposed and tested against
each other. The goal of the hypothesis test is to determine which hypothesis is most correct and if
the null hypothesis can be rejected altogether. Often, one or more inferences are made based on a
data sample, and the validity of the inferences is unknown. Then, the inference is tested against
another inference or against a standard point of reference. This process of testing the inference is
known as hypothesis testing.
Typically, hypothesis testing utilizes two different types of hypothesis: the null hypothesis and the
alternative hypothesis. The null hypothesis represents the assumption that is made about the data
sample, whereas the alternative hypothesis represents a counterpoint. More often than not, the
alternative hypothesis takes the exact opposite point of view from the null hypothesis.
Hypothesis testing is used in statistics to learn about and understand different population groups.
Additionally, the results of hypothesis testing can sometimes be used to predict the likelihood of
future outcomes within the population group. By going through the process of testing a hypothesis,
scientists and mathematicians are able to determine the statistical validity of their inferences, which
helps them to learn about the world around us.
For example, suppose a TV station was attempting to cater their advertisements to a more relevant
age demographic. This particular station believes that the ideal target demographic for their
advertisements is 65-year-olds. To determine if this is valid, a hypothesis test can be performed. In
this case, the null hypothesis would be that the average age of the TV station's viewers is 65. The
alternative hypothesis would be that the average age of the TV station's viewers is not equal to 65.
Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics.
It is most often used by scientists to test speciÞc predictions, called hypotheses, that arise from
theories.
Though the speciÞc details might vary, the procedure you will use when testing a hypothesis will
always follow some version of these steps.
After developing your initial research hypothesis (the prediction that you want to investigate), it is
important to restate it as a null (Ho) and alternate (Ha) hypothesis so that you can test it
mathematically.
The alternate hypothesis is usually your initial hypothesis that predicts a relationship between
variables. The null hypothesis is a prediction of no relationship between the variables you are
interested in.
For a statistical test to be valid, it is important to perform sampling and collect data in a way that is
designed to test your hypothesis. If your data are not representative, then you cannot make
statistical inferences about the population you are interested in.
You should also consider your scope (Worldwide? For one country?) A potential data source in this
case might be census data, since it includes data from a variety of regions and social classes and is
available for many countries around the world.
There are a variety of statistical tests available, but they are all based on the comparison of within-
group variance (how spread out the data is within a category) versus between-group variance (how
different the categories are from one another).
If the between-group variance is large enough that there is little or no overlap between groups, then
your statistical test will reßect that by showing a low p-value. This means it is unlikely that the
differences between these groups came about by chance.
Alternatively, if there is high within-group variance and low between-group variance, then your
statistical test will reßect that with a high p-value. This means it is likely that any difference you
measure between groups is due to chance.
Your choice of statistical test will be based on the type of variables and the level of measurement of
your collected data.
Based on the outcome of your statistical test, you will have to decide whether to reject or fail to
reject your null hypothesis.
In most cases you will use the p-value generated by your statistical test to guide your decision. And
in most cases, your predetermined level of signiÞcance for rejecting the null hypothesis will be 0.05
Ð that is, when there is a less than 5% chance that you would see these results if the null hypothesis
were true.
In some cases, researchers choose a more conservative level of signiÞcance, such as 0.01 (1%). This
minimizes the risk of incorrectly rejecting the null hypothesis (Type I error).
The results of hypothesis testing will be presented in the results and discussion sections of your
research paper, dissertation or thesis.
In the results section you should give a brief summary of the data and a summary of the results of
your statistical test (for example, the estimated difference between group means and associated p-
value). In the discussion, you can discuss whether your initial hypothesis was supported by your
results or not.
In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null
hypothesis. You will probably be asked to do this in your statistics assignments.
If your null hypothesis was rejected, this result is interpreted as Òsupported the alternate
hypothesis.Ó
You might notice that we donÕt say that we reject or fail to reject the alternate hypothesis. This is
because hypothesis testing is not designed to prove or disprove anything. It is only designed to test
whether a pattern we measure could have arisen spuriously, or by chance.
If we reject the null hypothesis based on our research (i.e., we Þnd that it is unlikely that the pattern
arose by chance), then we can say our test lends support to our hypothesis. But if the pattern does
not pass our decision rule, meaning that it could have arisen by chance, then we say the test is
inconsistent with our hypothesis.
A t-test is a kind of inferential and hypothetical statistical test. It is done under the
null hypothesis. It is used to compare whether the means of two groups are
signiÞcantly distinct or not, even when some particular features/characteristics might
be related. The T-test helps us estimate the difference between the averages of two
sets of data, combined with the certainty that they are from the same population. For
example, if we were to take samples of students from two different schools and
expect the means and standard deviation to be the same, then it is not possible. There
should be a slight distinction between the average and standard deviation.
Types of T-tests
There are three types of tests. Let us understand them under the circumstances they
are used:
¥ If all of the groups come from one single population (like measuring before
and after an experimental treatment), then we perform paired t-test.
¥ If the groups under consideration come from two different populations (like
two different species, or people from two separate cities), then we
perform two-sample t-test or independent t-test).
¥ If there is one group being compared against a standard value (like comparing
the acidity of a liquid to a neutral pH of 7), then we perform one-sample t-test.
Uses of t-tests
It is used to determine whether two sets of data are signiÞcantly different from each
other.
It is used to evaluate if the means of the two groups of data are statistically dissimilar
from each other.
Calculation of t-test
For calculating a t-test, we require three key data values, which are
¥ The average values from each data set known as the mean difference.
¥ Standard deviation of the group.
¥ Number of key data values of each group.
The result value of the t-test gives the t-value. This value is compared against the
value of the t-distribution tableÕs value. The t-test helps us determine whether the
difference is a true difference or an arbitrary, negligible difference.
Necessary conditions for application of t-test [Click Here for Sample Questions]
¥ The sample size should be small.
¥ Statistic follows a normal distribution.
¥ Value of scaling terms is known.
¥ Comparison is only between two groups.
T-Test formula
The formula for t-test (a.k.a. the studentÕs t-test) is shown below
According to this formula, t is called the t-value, x1 and x2 are the means of the two
groups which are being compared, s2 is the pooled standard error of the two groups,
and n1 and n2 are the number of observations of Þrst and second group, respectively.
A greater t-value indicates that the difference between means is greater than the
pooled standard error, which suggests a signiÞcant difference between the groups.
The calculated t-value can be compared against the values in a critical value chart to
determine whether your t-value is greater than what would be expected by chance. If
so, the null hypothesis can be rejected and it can be concluded that the two groups are
in fact different.
Things to remember
¥ A t-test is a type of statistic used to verify if there is a considerable difference
between the averages of two groups.
¥ The t-test is a test used when the objective is hypothesis testing in statistics.
¥ Formula used for t-test is
¥ They are the mean difference, the standard deviation, and the number of data
values of each group.
¥ Greater t-value represents signiÞcant difference between two groups.
¥ Smaller t-value represents the difference between the two groups is negligible.
The Chi-Square test is a statistical procedure for determining the difference between
observed and expected data. This test can also be used to determine whether it
correlates to the categorical variables in our data. It helps to Þnd out whether a
difference between two categorical variables is due to chance or a relationship
between them.
A chi-square test is a statistical test that is used to compare observed and expected
results. The goal of this test is to identify whether a disparity between actual and
predicted data is due to chance or to a link between the variables under consideration.
As a result, the chi-square test is an ideal choice for aiding in our understanding and
interpretation of the connection between our two categorical variables.
For example, a meal delivery Þrm in India wants to investigate the link between
gender, geography, and people's food preferences.
It is used to calculate the difference between two categorical variables, which are:
¥ As a result of chance or
Where
c = Degrees of freedom
O = Observed Value
E = Expected Value
chi-square tests are statistically valid. These tests are frequently used to compare
observed data with data that would be expected to be obtained if a particular
hypothesis were true.
The expected values are the frequencies expected, based on the null hypothesis.
¥ The Chi-squared test can be used to see if your data follows a well-known
theoretical probability distribution like the Normal or Poisson distribution.
¥ The Chi-squared test allows you to assess your trained regression model's
goodness of Þt on the training, validation, and test data sets.
These tests use degrees of freedom to determine if a particular null hypothesis can be
rejected based on the total number of observations made in the experiments. Larger
the sample size, more reliable is the result.
1. Independence
2. Goodness-of-Fit
Independence
For Example-
In a movie theatre, suppose we made a list of movie genres. Let us consider this as
the Þrst variable. The second variable is whether or not the people who came to watch
those genres of movies have bought snacks at the theatre. Here the null hypothesis is
that th genre of the Þlm and whether people bought snacks or not are unrelatable. If
this is true, the movie genres donÕt impact snack sales.
Goodness-Of-Fit
For Example-
Suppose we have bags of balls with Þve different colours in each bag. The given
condition is that the bag should contain an equal number of balls of each colour. The
idea we would like to test here is that the proportions of the Þve colours of balls in
each bag must be exact.
Chi-square is most commonly used by researchers who are studying survey response
data because it applies to categorical variables. Demography, consumer and
marketing research, political science, and economics are all examples of this type of
research.
Example
Let's say you want to know if gender has anything to do with political party
preference. You poll 440 voters in a simple random sample to Þnd out which political
party they prefer. The results of the survey are shown in the table below:
Similarly, you can calculate the expected value for each of the cells.
Now you will calculate the (O - E)2 / E for each cell in the table.
Where
O = Observed Value
E = Expected Value
= 9.837
Before you can conclude, you must Þrst determine the critical statistic, which
requires determining our degrees of freedom. The degrees of freedom in this case are
equal to the table's number of columns minus one multiplied by the table's number of
rows minus one, or (r-1) (c-1). We have (3-1)(2-1) = 2.
Finally, you compare our obtained statistic to the critical statistic found in the chi-
square table. As you can see, for an alpha level of 0.05 and two degrees of freedom,
the critical statistic is 5.991, which is less than our obtained statistic of 9.83. You can
reject our null hypothesis because the critical statistic is higher than your obtained
statistic.
This means you have sufÞcient evidence to say that there is an association between
gender and political party preference.
A Chi-Square Test is used to examine whether the observed results are in order with
the expected values. When the data to be analysed is from a random sample, and
when the variable is the question is a categorical variable, then Chi-Square proves the
most appropriate test for the same. A categorical variable consists of selections such
as breeds of dogs, types of cars, genres of movies, educational attainment, male v/s
female etc. Survey responses and questionnaires are the primary sources of these
types of data. The Chi-square test is most commonly used for analysing this kind of
data. This type of analysis is helpful for researchers who are studying survey response
data. The research can range from customer and marketing research to political
sciences and economics.
Chi-Square Distribution
The shape of the distribution graph changes with the increase in the value of k, i.e.
degree of freedoms.
When k is greater than 2, the shape of the distribution curve looks like a hump and
has a low probability that X^2 is very near to 0 or very far from 0. The distribution
occurs much longer on the right-hand side and shorter on the left-hand side. The
probable value of X^2 is (X^2 - 2).
When k is greater than ninety, a normal distribution is seen, approximating the Chi-
square distribution.
Chi-Square P-Values
Here P denotes the probability; hence for the calculation of p-values, the Chi-Square
test comes into the picture. The different p-values indicate different types of
hypothesis interpretations.
The concepts of probability and statistics are entangled with Chi-Square Test.
Probability is the estimation of something that is most likely to happen. Simply put, it
is the possibility of an event or outcome of the sample. Probability can
understandably represent bulky or complicated data. And statistics involves collecting
and organising, analysing, interpreting and presenting the data.
Finding P-Value
When you run all of the Chi-square tests, you'll get a test statistic called X2. You have
two options for determining whether this test statistic is statistically signiÞcant at
some alpha level:
Test statistics are calculated by taking into account the sampling distribution of the
test statistic under the null hypothesis, the sample data, and the approach which is
chosen for performing the test.
Where:
P: probability Event
TS: Test statistic is computed observed value of the test statistic from your sample
cdf(): Cumulative distribution function of the test statistic's distribution (TS)
These are, mathematically, the same exam. However, because they are utilized for
distinct goals, we generally conceive of them as separate tests.
Properties
1. If you multiply the number of degrees of freedom by two, you will receive an
answer that is equal to the variance.
There are two limitations to using the chi-square test that you should be aware of.
¥ The chi-square test, for starters, is extremely sensitive to sample size. Even
insigniÞcant relationships can appear statistically signiÞcant when a large
enough sample is used. Keep in mind that "statistically signiÞcant" does not
always imply "meaningful" when using the chi-square test.
¥ Be mindful that the chi-square can only determine whether two variables are
related. It does not necessarily follow that one variable has a causal
relationship with the other. It would require a more detailed analysis to
establish causality.
When there is only one categorical variable, the chi-square goodness of Þt test can be
used. The frequency distribution of the categorical variable is evaluated for
determining whether it differs signiÞcantly from what you expected. The idea is that
the categories will have equal proportions, however, this is not always the case.
Module-VIII
Anova -One Way, Two Way
ANOVA
The four steps to ANOVA are:
1. Formulate a hypothesis
2. Set a signiÞcance level
3. Compute an F-Statistic
4. Use the F-Statistic to derive a p-value
5. Compare the p-value and signiÞcance level to decide whether or not to reject the
null hypothesis
1. Formulate a Hypotheses
As with nearly all statistical signiÞcance tests, ANOVA starts with formulating a null
and alternative hypothesis. For this example, the hypotheses are as follows:
Null Hypothesis (H0): There is no difference in the average price of wine between the
three countries; they are all the same.
Alternative Hypothesis (H1): The average price of wine is not the same between the
three countries.
Note, this is an omnibus test, meaning if we are able to reject the null hypothesis it
will tell us that a statistically signiÞcant difference exists somewhere between these
countries, but it wonÕt tell us where it is.
3. Compute an F-Statistic
The F-statistic is simply a ratio of the variance between samples means to the
variance within sample means. For this ANOVA test, weÕll be looking at how far each
countryÕs average wine price is from the overall average price, and dividing that by
how much variation in price there is within each countryÕs sample distribution. The F-
statistic formula is below, which may look complicated until we break it down.
SSB = Sum of squares between groups. This is the summation of the squared
difference between each groupÕs mean and the overall mean times the number of
elements per group. For this example, we take the mean of each countryÕs wine price,
subtract it from the overall mean, square the difference and multiply by 1,000 (the
number of data points per country).
SSW = Sum of squares within groups. This is the summation of the squared
difference between the group-mean and each value in the group. For France, we
would take the mean price of French wine, then subtract and square the difference for
each bottle of French wine of the thousand data points in that group.
DoFB = Degrees of freedom between groups, simply the number of groups minus 1.
We have three different countries we are comparing, so the degrees of freedom here is
2.
DoFW = Degrees of Freedom Within Groups, simply the number of data points
minus the number of groups. We have 3,000 data points and three different countries,
so this is 2,997 for this example.
Dividing the sum of squares for a group by its degrees of freedom yields the mean
squares for that group, and the F statistic is just a ratio of the mean squares between
over the mean squares within.
Below I manually calculate these values in Python, and end up with an F-Statistic of
~4.07.
5. Compare the p-value and signiÞcance level to decide whether or not to reject the
null hypothesis
Our p-value signiÞes that assuming the null hypothesis (all countries have the same
mean price of wine) is true, there is roughly a 1.7% chance of seeing the data we have
by sheer sampling chance. By setting our signiÞcance level, or alpha, at 5% before all
of this, we said that we would be willing to accept a 5% chance of rejecting the null
when it is true. Since our p-value is below our pre-determined signiÞcance level, we
can reject the null hypothesis and say that there is a statistically signiÞcant difference
in the mean price of wine between countries.
Remember that ANOVA is an omnibus test, meaning because we are able to reject the
null we know that a difference exists among the average wine prices between
countries, but not exactly where. For Þnding where the difference lies, we would then
conduct hypothesis tests between two countries at a time.
The procedure is made up of just three basic stages. After looking at the procedure,
we would apply it in a real problem.
Before we begin, take some time to examine Figure 1. This Þgure summarizes what
needs to be calculated to perform a one-way ANOVA.
Step 2: Set up the null and alternate hypothesis and the Alpha
The null hypothesis assumes that there is no variance data in different groups. In
other words, the means are the same
The alternate hypothesis states the means are different. So we can state as follows:
H0: μ1 = μ2 = μ3
Ha: μ1 ≠ μ2 ≠ μ3
Calculate the Sum of Squares Total (SST): The SStotal is the Sum of Squares(SS)
based on the entire set in all the groups. In this case, you treat all the data from all the
groups like on single combined set of data
Calculate the Sum of Squares Within Groups (SSW): The is the sum of squares
within each group. After calculating the sum of squares for each group, then you add
them together for all the groups. That is why you have the sum symbol twice in the
formular
Note: If you have calculated the Þrst two sum of squares, you can the go ahead to
calculate the the third on using the Þrst two values using the formula
But for learning purposes, we would calculate the third one. So letÕs keep moving!
Calculate the Sum of Squares Between Groups (SSB): This is the sum of squares
with the groups taken as single elements.
Assumming there are three groups you will have to do the the following:
(group1_mean Ð total_mean)2 + (group2_mean Ð total_mean)2 + group3_mean Ð
total_mean)2
Verify that
SStotal = SSbetween + SSwithin
Note: You could actually calculate the third degree of freedom if you have two of
them just like in the case of sum of square
This is the same table you created in Step 2. You just have to Þll it with actual results
based on your calculations. Stating with this table makes the problem easier to solve.
Calculate the F Statistic using the fomular
Lood up the tabulated value of F(critical value) of F from the statistical table and
compare it with the value you calculated (absolute value).
If the absolute value is greater than the critical value, we reject the null hypothesis
and conclude that there is signiÞcant different between the means of the populations.
Otherwise, accept the null hypothesis or fail to reject the null hypothesis.
We will next illustrate the ANOVA procedure using the Þve step approach. Because
the computation of the test statistic is involved, the computations are often organized
in an ANOVA table. The ANOVA table breaks down the components of variation in
the data into variation between treatments and error or residual variation. Statistical
computing packages also produce ANOVA tables as part of their standard output for
ANOVA, and the ANOVA table is set up as follows:
Mean
Source of Sums of Degrees of
Squares F
Variation Squares (SS) Freedom (df)
(MS)
Between
k-1
Treatments
Error (or
N-k
Residual)
Total N-1
where
¥ X = individual observation,
¥ The Þrst column is entitled "Source of Variation" and delineates the between
treatment and error or residual variation. The total variation is the sum of the
between treatment and error variation.
¥ The second column is entitled "Sums of Squares (SS)". The between treatment
sums of squares is
and is computed by summing the squared differences between each treatment (or
group) mean and the overall mean. The squared differences are weighted
by the sample sizes per group (nj). The error sums of squares is:
and is computed by summing the squared differences between each observation and
its group mean (i.e., the squared differences between each observation in
group 1 and the group 1 mean, the squared differences between each
observation in group 2 and the group 2 mean, and so on). The double
summation ( SS ) indicates summation of the squared differences within
each treatment and then summation of these totals across treatments to
produce a single value. (This will be illustrated in the following examples).
The total sums of squares is:
and is computed by summing the squared differences between each observation and
the overall sample mean. In an ANOVA, data are organized by comparison
or treatment groups. If all of the data were pooled into a single sample,
SST would reßect the numerator of the sample variance computed on the
pooled or total sample. SST does not Þgure into the F statistic directly.
However, SST = SSB + SSE, thus if two sums of squares are known, the
third can be computed from the other two.
¥ The third column contains degrees of freedom. The between treatment degrees
of freedom is df1 = k-1. The error degrees of freedom is df2 = N - k. The total
degrees of freedom is N-1 (and it is also true that (k-1) + (N-k) = N-1).
¥ The fourth column contains "Mean Squares (MS)" which are computed by
dividing sums of squares (SS) by degrees of freedom (df), row by row.
SpeciÞcally, MSB=SSB/(k-1) and MSE=SSE/(N-k). Dividing SST/(N-1)
produces the variance of the total sample. The F statistic is in the rightmost
column of the ANOVA table and is computed by taking the ratio of MSB/MSE.
Example:
A clinical trial is run to compare weight loss programs and participants are randomly
assigned to one of the comparison programs and are counseled on the details of the
assigned program. Participants follow the assigned program for 8 weeks. The
outcome of interest is weight loss, deÞned as the difference in weight measured at the
start of the study (baseline) and weight measured at the end of the study (8 weeks),
measured in pounds.
Three popular weight loss programs are considered. The Þrst is a low calorie diet.
The second is a low fat diet and the third is a low carbohydrate diet. For comparison
purposes, a fourth group is considered as a control group. Participants in the fourth
group are told that they are participating in a study of healthy behaviors with weight
loss only one component of interest. The control group is included here to assess the
placebo effect (i.e., weight loss due to simply participating in the study). A total of
twenty patients agree to participate in the study and are randomly assigned to one of
the four diet groups. Weights are measured at baseline and patients are counseled on
the proper implementation of the assigned diet (with the exception of the control
group). After 8 weeks, each patient's weight is again measured and the difference in
weights is computed by subtracting the 8 week weight from the baseline weight.
Positive differences indicate weight losses and negative differences indicate weight
gains. For interpretation purposes, we refer to the differences in weights as weight
losses and the observed weight losses are shown below.
Is there a statistically signiÞcant difference in the mean weight loss among the four
diets? We will run the ANOVA using the Þve-step approach.
need degrees of freedom, df1=k-1 and df2=N-k. In this example, df1=k-1=4-1=3 and
df2=N-k=20-4=16. The critical value is 3.24 and the decision rule is as follows:
Reject H0 if F > 3.24.
Next we compute,
SSE requires computing the squared differences between each observation and its
group mean. We will compute SSE in parts. For the participants in the low calorie
diet:
7 0.4 0.2
3 -3.6 13.0
Totals 0 21.4
Thus,
Thus,
Thus,
Thus,
Therefore,
Between 25.3/3.0=
75.8 4-1=3 75.8/3=25.3
Treatmenst 8.43
Error (or
47.4 20-4=16 47.4/16=3.0
Residual)
¥ Step 5. Conclusion.
We reject H0 because 8.43 > 3.24. We have statistically signiÞcant evidence at
α=0.05 to show that there is a difference in mean weight loss among the four diets.
Module-IX
Correlation and Regression Analysis
Correlation in Statistics
This section shows how to calculate and interpret correlation coefÞcients for ordinal
and interval level scales. Methods of correlation summarize the relationship between
two variables in a single number called the correlation coefÞcient. The correlation
coefÞcient is usually represented using the symbol r, and it ranges from -1 to +1.
A correlation coefÞcient quite close to 0, but either positive or negative, implies little
or no relationship between the two variables. A correlation coefÞcient close to plus 1
means a positive relationship between the two variables, with increases in one of the
variables being associated with increases in the other variable.
For ordinal scales, the correlation coefÞcient can be calculated by using SpearmanÕs
rho. For interval or ratio level scales, the most commonly used correlation coefÞcient
is PearsonÕs r, ordinarily referred to as simply the correlation coefÞcient.
In statistics, Correlation studies and measures the direction and extent of relationship
among variables, so the correlation measures co-variation, not causation. Therefore,
we should never interpret correlation as implying cause and effect relation. For
example, there exists a correlation between two variables X and Y, which means the
value of one variable is found to change in one direction, the value of the other
variable is found to change either in the same direction (i.e. positive change) or in the
opposite direction (i.e. negative change). Furthermore, if the correlation exists, it is
linear, i.e. we can represent the relative movement of the two variables by drawing a
straight line on graph paper.
Correlation CoefÞcient
The correlation coefÞcient, r, is a summary measure that describes the extent of the
statistical relationship between two interval or ratio level variables. The correlation
coefÞcient is scaled so that it is always between -1 and +1. When r is close to 0 this
means that there is little relationship between the variables and the farther away from
0 r is, in either the positive or negative direction, the greater the relationship between
the two variables.
The two variables are often given the symbols X and Y. In order to illustrate how the
two variables are related, the values of X and Y are pictured by drawing the scatter
diagram, graphing combinations of the two variables. The scatter diagram is given
Þrst, and then the method of determining PearsonÕs r is presented. From the following
examples, relatively small sample sizes are given. Later, data from larger samples are
given.
Scatter Diagram
A scatter diagram is a diagram that shows the values of two variables X and Y, along
with the way in which these two variables relate to each other. The values of variable
X are given along the horizontal axis, with the values of the variable Y given on the
vertical axis.
Later, when the regression model is used, one of the variables is deÞned as an
independent variable, and the other is deÞned as a dependent variable. In regression,
the independent variable X is considered to have some effect or inßuence on the
dependent variable Y. Correlation methods are symmetric with respect to the two
variables, with no indication of causation or direction of inßuence being part of the
statistical consideration. A scatter diagram is given in the following example. The
same example is later used to determine the correlation coefÞcient.
Types of Correlation
The scatter plot explains the correlation between the two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables Ð
¥ Positive Correlation Ð when the values of the two variables move in the same
direction so that an increase/decrease in the value of one variable is followed
by an increase/decrease in the value of the other variable.
¥ Negative Correlation Ð when the values of the two variables move in the
opposite direction so that an increase/decrease in the value of one variable is
followed by decrease/increase in the value of the other variable.
¥ No Correlation Ð when there is no linear dependence or no relation between the
two variables.
Correlation Formula
Correlation shows the relation between two variables. Correlation coefÞcient shows
the measure of correlation. To compare two datasets, we use the correlation formulas.
+1. When the coefÞcient comes down to zero, then the data is considered as not
related. While, if we get the value of +1, then the data are positively correlated, and
-1 has a negative correlation.
2. What is a correlation of 1?
We know that a correlation of 1 means the two variables are associated positively,
whereas if the correlation coefÞcient is 0, then there is no correlation between two
variables. Thus, a correlation of 0.45 means 45% of the variance in one variable, say
x, is accounted for by the second variable, say y.
2) The sign which correlations of coefÞcient have will always be the same as the
variance.
4) The negative value of coefÞcient suggests that the correlation is strong and
negative. And if ÔrÕ goes on approaching toward -1 then it means that the relationship
is going towards the negative side.
When ÔrÕ approaches to the side of + 1 then it means the relationship is strong and
positive. By this we can say that if +1 is the result of the correlation then the
relationship is in a positive state.
6) Correlation coefÞcient can be very dicey because we cannot say that the
participants are truthful or not.
The coefÞcient of correlation is not affected when we interchange the two variables.
7) CoefÞcient of correlation is a pure number without effect of any units on it. It also
not get affected when we add the same number to all the values of one variable. We
can multiply all the variables by the same positive number. It does not affect the
correlation coefÞcient. As we discussed, Ôr Ôis not affected by any unit because ÔrÕ is a
scale invariant.
8) We use correlation for measuring the association but that does not mean we are
talking about causation. By this, we simply mean that when we are correlating the
two variables then it might be the possibility that the third variable may be
inßuencing them.
x 50 51 52 53 54
Solution:
Here n = 5
x 50 51 52 53 54
∑x = 260
∑y = 16.5
∑xy = 859
∑x2 = 13530
∑y2 = 54.55
By substituting all the values in the formula, we get r = 1. This shows a positive
correlation coefÞcient.
x 12 15 18 21 27
y 2 4 6 8 12
Solution:
Here n = 5
x 12 15 18 21 27
y 2 4 6 8 12
xy 24 60 94 168 324
y2 4 16 36 64 144
∑x = 93
∑y = 32
∑xy = 670
∑x2 = 1863
∑y2 = 264
We have, r = 0.84
Correlation Regression
Example 9.9
Calculate the regression coefÞcient and obtain the lines of regression for the
following data
Solution:
Regression coefÞcient of X on Y
Y = 0.929XÐ3.716+11
= 0.929X+7.284
Example 9.10
Solution:
= Ð0.25 (20)+44.25
= Ð5+44.25
= 39.25 (when the price is Rs. 20, the likely demand is 39.25)
Example 9.11
Solution:
YÐ51.57 = 0.942(XÐ48.29 )
Y = 0.942XÐ45.49+51.57=0.942 #Ð45.49+51.57
Y = 0.942X+6.08
Y= 0.942(55)+6.08=57.89
Example 9.12
2YÐXÐ50 = 0
3YÐ2XÐ10 = 0.
Solution:
We are given
We get Y = 90
We get X = 130
2Y = X+50
NOTE
It may be noted that in the above problem one of the regression coefÞcient is
greater than 1 and the other is less than 1. Therefore our assumption on
given equations are correct.
Example 9.13
4XÐ5Y+33 = 0
20XÐ9YÐ107 = 0
Solution:
We are given
We get Y = 17
But this is not possible because both the regression coefÞcient are greater
than
Example 9.14
The following table shows the sales and advertisement expenditure of a form
Solution:
Example 9.15
There are two series of index numbers P for price index and S for stock of
the commodity. The mean and standard deviation of P are 100 and 8 and of
S are 103 and 4 respectively. The correlation coefÞcient between the two
series is 0.4. With these data obtain the regression lines of P on S and S on P.
Solution:
Let us consider X for price P and Y for stock S. Then the mean
and SD for P is considered as X-Bar = 100 and σx=8. respectively and the
mean and SD of S is considered as Y-Bar =103 and σy=4. The correlation
coefÞcient between the series is r(X,Y)=0.4
Example 9.16
Solution:
YÐ5 = 0.8(XÐ3)
= 0.8X+2.6
= 0.8(8)+2.6
=9
Example 9.17
The two regression lines are 3X+2Y=26 and 6X+3Y=31. Find the
correlation coefÞcient.
Solution:
3X+2Y = 26
Example 9.18
Solution:
Example 9.19
Solution:
3XÐ2Y = 5
3X = 2Y+5
CoefÞcient of correlation
Since the two regression coefÞcients are positive then the correlation
coefÞcient is also positive and it is given by