Statistics For Management Notes
Statistics For Management Notes
com
LECTURE NOTES
ON
MBA I SEMESTER
IARE-R18
I Semester: MBA
One dimensional, two dimensional and three dimensional diagrams and graphs.
www.rejinpaul.com
UNIT -IV SMALL SAMPLE TESTS Classes:10
T-Distribution: properties and applications, testing for one and two means, paired t-test; analysis of variance: one way and two
way ANOVA(with and without interaction),chi-square distribution: test for a specified population variance, test for goodness of
fit, test for independence of attributes; correlation analysis: scatter diagram, positive and negative correlation, limits for
coefficient of correlation, Karl Pearson‘s coefficient of correlation, spearman‘s rank correlation, concept of multiple and partial
correlation.
UNIT -V REGRESSION ANALYSIS Classes: 10
Concept, least square fit of a linear regression, two lines of regression, properties of regression coefficients; Time Series
Analysis: Components, models of time series additive, multiplicative and mixed models; Trend analysis: Free hand curve, semi
averages, moving averages, least square methods; Index numbers: introduction, characteristics and uses of index numbers, types
of index numbers, un weighted price indices, weighted price indices, tests of adequacy and consumer price indexes.
Text Books:
1. Levin R.I., Rubin S. David, ―Statistics for Management‖, Pearson, 7th Edition, 2015.
2. Beri, ―Business Statistics‖, TMH, 1st Edition, 2015.
3. Gupta S.C, ―Fundamentals of Statistics‖, HPH, 6th Edition, 2015.
Reference Books:
1. Levine , Stephan , krehbiel , Berenson, ―Statistics for Managers using Microsoft Excel‖, PHI, 1st Edition, 2015.
2. J. K Sharma, ―Business Statistics‖, Pearson Publications, 2nd Edition, 2015.
Web References:
1. https://aditya30702.files.wordpress.com/2012/07/statistics-for-managers-using-microsoft-excel- gnv64.pdf
2. http://www.nprcet.org/mba/document/First%20Semester/BA7102%20STATISTICS%20FOR%20
MANAGEMENT%20LT%20P%20C%203%201%200%204%20ODD.pdf
E-Text Books:
1. http://bookboon.com/en/statistics-and-mathematics-ebooks
2. http://www.ebay.com/bhp/statistics-for-managers-using-microsoft-excel
www.rejinpaul.com
UNIT-I
INTRODUCTION TO STATISTICS
DEFINITION OF STATISTICS
HISTORY OF STATISTICS
The Word statistics have been derived from Latin word ―Status‖ or the Italian word ―Statista‖,
meaning of these words is ―Political State‖ or a Government. Shakespeare used a word Statist is his
drama Hamlet (1602). In the past, the statistics was used by rulers. The application of statistics was
very limited but rulers and kings needed information about lands, agriculture, commerce, population
of their states to assess their military potential, their wealth, taxation and other aspects of
government.
Gottfried Achenwall used the word statistik at a German University in 1749 which means that
political science of different countries. In 1771 W. Hooper (Englishman) used the word statistics in
his translation of Elements of Universal Erudition written by Baron B.F Bieford, in his book
statistics has been defined as the science that teaches us what is the political arrangement of all the
modern states of the known world. There is a big gap between the old statistics and the modern
statistics, but old statistics also used as a part of the present statistics.
During the 18th century the English writer have used the word statistics in their works, so statistics
has eveloped gradually during last few centuries. A lot of work has been done in the end of the
nineteenth century.
At the beginning of the 20th century, William S Gosset was developed the methods for
decision making based on small set of data. During the 20th century several statistician are
active in developing new methods, theories and application of statistics. Now these days the
availability of electronics computers is certainly a major factor in the modern development of
statistics.
www.rejinpaul.com
Descriptive Statistics
Descriptive statistics deals with the presentation and collection of data. This is usually the
first part of a statistical analysis. It is usually not as simple as it sounds, and the statistician
needs to be aware of designing experiments, choosing the right focus group and avoid biases
that are so easy to creep into the experiment.
Different areas of study require different kinds of analysis using descriptive statistics. For
example, a physicist studying turbulence in the laboratory needs the average quantities that
vary over small intervals of time. The nature of this problem requires that physical quantities
be averaged from a host of data collected through the experiment.
1. Business:
Statistics play an important role in business. A successful businessman
must be very quick and accurate in decision making. He knows that what
his customers wants, he should therefore, know what to produce and sell
and in what quantities. Statistics helps businessman to plan production
www.rejinpaul.com
according to the taste of the costumers, the quality of the products can
also be checked more efficiently by using statistical methods. So all the
activities of the businessman based on statistical information. He can
make correct decision about the location of business, marketing of the products,
financial resources etc…
2. In Economics:
Statistics play an important role in economics. Economics largely
depends upon statistics. National income accounts are multipurpose
indicators for the economists and administrators. Statistical methods are
used for preparation of these accounts. In economics research statistical
methods are used for collecting and analysis the data and testing
hypothesis. The relationship between supply and demands is studies by
statistical methods, the imports and exports, the inflation rate, the per
capita income are the problems which require good knowledge of
statistics.
4. In Mathematics:
Statistical plays a central role in almost all natural and social sciences.
The methods of natural sciences are most reliable but conclusions draw
from them are only probable, because they are based on incomplete
evidence. Statistical helps in describing these measurements more
precisely. Statistics is branch of applied mathematics. The large number
of statistical methods like probability averages, dispersions, estimation
etc… is used in mathematics and different techniques of pure
mathematics like integration, differentiation and algebra are used in
statistics.
5. In Banking:
Statistics play an important role in banking. The banks make use of
statistics for a number of purposes. The banks work on the principle that
all the people who deposit their money with the banks do not withdraw it
at the same time. The bank earns profits out of these deposits by lending
to others on interest. The bankers use statistical approaches based on
probability to estimate the numbers of depositors and their claims for a
certain day.
www.rejinpaul.com
9. In Astronomy:
Astronomy is one of the oldest branches of statistical study; it deals
with the measurement of distance, sizes, masses and densities of
heavenly bodies by means of observations. During these
measurements errors are unavoidable so most probable measurements
are founded by using statistical methods.
www.rejinpaul.com
UNIT-II
MEASURES OF CENTRAL TENDENCY
Example 1
Calculate the mean for pH levels of soil 6.8, 6.6, 5.2, 5.6, 5.8
Grouped Data
The mean for grouped data is obtained from the following formula:
Where
A = any value in x
n = total frequency
c = width of the class interval
Example 2
Given the following frequency distribution, calculate the arithmetic mean
Marks : 64 63 62 61 60 59
Number of Students : 8 18 12 9 7 6
Solution
X F Fx D=x-A Fd
64 8 512 2 16
63 18 1134 1 18
62 12 744 0 0
61 9 549 -1 -9
60 7 420 -2 -14
59 6 354 -3 -18
60 3713 -7
www.rejinpaul.com
Short-cut method
Here A = 62
Example 3
For the frequency distribution of seed yield of plot given in table, calculate the mean yield
per plot.
Yield per plot in(ing) 64.5- 84.5- 104.5- 124.5-
84.5 104.5 124.5 144.5
No of 3 5 7 20
plots
Solution
Yield ( in g) No of Plots (f) Mid X Fd
64.5-84.5 3 74.5 -1 -3
84.5-104.5 5 94.5 0 0
104.5-124.5 7 114.5 1 7
124.5-144.5 20 134.5 2 40
Total 35 44
A=94.5
The mean yield per plot is
Direct method:
= =119.64 gms
Shortcut method
www.rejinpaul.com
Merits
1. It is rigidly defined.
2. It is easy to understand and easy to calculate.
3. If the number of items is sufficiently large, it is more accurate and more reliable.
4. It is a calculated value and is not based on its position in the series.
5. It is possible to calculate even if some of the details of the data are lacking.
6. Of all averages, it is affected least by fluctuations of sampling.
7. It provides a good basis for comparison.
Demerits
1. It cannot be obtained by inspection nor located through a frequency graph.
2. It cannot be in the study of qualitative phenomena not capable of numerical measurement i.e.
Intelligence, beauty, honesty etc.,
3. It can ignore any single item only at the risk of losing its accuracy.
4. It is affected very much by extreme values.
5. It cannot be calculated for open-end classes.
6. It may lead to fallacious conclusions, if the details of the data from which it is computed are
not given.
www.rejinpaul.com
Median
The median is the middle most item that divides the group into two equal parts, one part
comprising all values greater, and the other, all values less than that item.
Ungrouped or Raw data
Arrange the given values in the ascending order. If the number of values are odd, median
is the middle value.If the number of values are even, median is the mean of middle two values.
By formula
Example 4
If the weights of sorghum ear heads are 45, 60,48,100,65 gms, calculate the median
Solution
Here n = 5
First arrange it in ascending order
45, 48, 60, 65, 100
Median =
= =60
Example 5
If the sorghum ear- heads are 5,48, 60, 65, 65, 100 gms, calculate the median.
Solution
Here n = 6
www.rejinpaul.com
Grouped data
In a grouped distribution, values are associated with frequencies. Grouping can be in the
form of a discrete frequency distribution or a continuous frequency distribution. Whatever may
be the type of distribution, cumulative frequencies have to be calculated to know the total
number of items.
Cumulative frequency (cf)
Cumulative frequency of each class is the sum of the frequency of the class and the
frequencies of the pervious classes, ie adding the frequencies successively, so that the last
cumulative frequency gives the total number of items.
Discrete Series
Step1: Find cumulative frequencies.
Step3: See in the cumulative frequencies the value just greater than
Example 6
The following data pertaining to the number of insects per plant. Find median number of insects
per plant.
Number of insects per plant (x) 1 2 3 4 5 6 7 8 9 10 11 12
No. of plants(f) 1 3 5 6 10 13 9 5 3 2 2 1
Solution
Form the cumulative frequency table
x f cf
1 1 1
2 3 4
3 5 9
4 6 15
5 10 25
6 13 38
7 9 47
8 5 52
9 3 55
10 2 57
11 2 59
12 1 60
60
www.rejinpaul.com
Median = size of
Here the number of observations is even. Therefore median = average of (n/2)th item and
(n/2+1)th item.
Step2: Find
Step3: See in the cumulative frequency the value first greater than , Then the corresponding
class interval is called the Median class. Then apply the formula
Median =
Example 7
For the frequency distribution of weights of sorghum ear-heads given in table below. Calculate
the median.
Weights of ear No of ear Less than Cumulative
heads ( in g) heads (f) class frequency (m)
60-80 22 <80 22
80-100 38 <100 60
100-120 45 <120 105
120-140 35 <140 140
140-160 24 <160 164
Total 164
Solution
Median =
It lies between 60 and 105. Corresponding to 60 the less than class is 100 and corresponding to
105 the less than class is 120. Therefore the medianal class is 100-120. Its lower limit is 100.
Here 100, n=164 , f = 45 , c = 20, m =60
Median =
Merits of Median
1. Median is not influenced by extreme values because it is a positional average.
2. Median can be calculated in case of distribution with open-end intervals.
3. Median can be located even if the data are incomplete.
Demerits of Median
1. A slight change in the series may bring drastic change in median value.
2. In case of even number of items or continuous series, median is an estimated value
other than any value in the series.
3. It is not suitable for further mathematical treatment except its use in
calculating mean deviation.
4. It does not take into account all the observations.
www.rejinpaul.com
Mode
The mode refers to that value in a distribution, which occur most frequently. It is an
actual value, which has the highest concentration of items in and around it. It shows the centre of
concentration of the frequency in around a given value. Therefore, where the purpose is to know
the point of the highest concentration it is preferred. It is, thus, a positional measure.
Its importance is very great in agriculture like to find typical height of a crop variety,
maximum source of irrigation in a region, maximum disease prone paddy variety. Thus the mode
is an important measure in case of qualitative data.
Grouped Data
For Discrete distribution, see the highest frequency and corresponding value of x is mode.
Example:
Find the mode for the following
Weight of sorghum in No. of ear head(f)
gms (x)
50 4
65 6
75 16
80 8
95 7
100 4
www.rejinpaul.com
Solution
The maximum frequency is 16. The corresponding x value is 75.
mode = 75 gms.
Continuous distribution
Locate the highest frequency the class corresponding to that frequency is called the modal class.
Then apply the formula.
Mode =
Example 10
For the frequency distribution of weights of sorghum ear-heads given in table below. Calculate
the mode
Weights of ear No of ear
heads (g) heads (f)
60-80 22
80-100 38
100-120 45 f
120-140 35
140-160 20
Total 160
Solution
Mode =
Mode =
= 109.589
www.rejinpaul.com
Geometric mean
The geometric mean of a series containing n observations is the nth root of the product of the
values. If x1, x2…, xn are observations then
G.M=
Log GM =
GM = Antilog
GM = Antilog
Example 11
If the weights of sorghum ear heads are 45, 60, 48,100, 65 gms. Find the Geometric mean for the
following data
Weight of ear Log x
head x (g)
45 1.653
60 1.778
48 1.681
100 2.000
65 1.813
Total 8.925
Solution
Here n = 5
GM = Antilog
= Antilog
= Antilog
www.rejinpaul.com
Grouped Data
Example 12
Find the Geometric mean for the following
Weight of sorghum (x) No. of ear head(f)
50 4
65 6
75 16
80 8
95 7
100 4
Solution
Weight of No. of ear Log x f x log x
sorghum (x) head(f)
50 5 1.699 8.495
63 10 10.799 17.99
65 5 1.813 9.065
130 15 2.114 31.71
135 15 2.130 31.95
Total 50 9.555 99.21
Here n= 50
GM = Antilog
= Antilog
Continuous distribution
Example 13
For the frequency distribution of weights of sorghum ear-heads given in table below.
Calculate the Geometric mean
Weights of ear No of ear
heads ( in g) heads (f)
60-80 22
80-100 38
100-120 45
120-140 35
140-160 20
Total 160
\
Solution
Weights of ear No of ear Mid x Log x f log x
heads ( in g) heads (f)
60-80 22 70 1.845 40
59
80-100 38 90 1.954 74.25
100-120 45 110 2.041 91.85
120-140 35 130 2.114 73.99
140-160 20 150 2.176 43.52
Total 160 324.2
Here n = 160
GM = Antilog
= Antilog
= Antilog
= 106.23
www.rejinpaul.com
Example 13
From the given data 5, 10,17,24,30 calculate H.M.
X
5 0.2000
10 0.1000
17 0.0588
24 0.0417
30 0.4338
= 11.526
Example 14
Number of tomatoes per plant are given below. Calculate the harmonic mean.
Number of tomatoes per plant 20 21 22 23 24 25
Number of plants 4 2 7 1 3 1
Solution
Number of No of
tomatoes per plants(f)
plant (x)
20 4 0.0500 0.2000
21 2 0.0476 0.0952
22 7 0.0454 0.3178
23 1 0.0435 0.0435
24 3 0.0417 0.1251
25 1 0.0400 0.0400
18 0.8216
www.rejinpaul.com
Merits of H.M
1. It is rigidly defined.
2. It is defined on all observations.
3. It is amenable to further algebraic treatment.
4. It is the most suitable average when it is desired to give greater weight to smaller observations
and less weight to the larger ones.
Demerits of H.M
1. It is not easily understood.
2. It is difficult to compute.
3. It is only a summary figure and may not be the actual item in the series
4. It gives greater importance to small items and is therefore, useful only when small items have
to be given greater weightage.
5. It is rarely used in grouped data.
Percentiles
The percentile values divide the distribution into 100 parts each containing 1 percent of
the cases. The xth percentile is that value below which x percent of values in the distribution fall.
It may be noted that the median is the 50th percentile.
For raw data, first arrange the n observations in increasing order. Then the x th percentile
is given by
Where
= lower limit of the percentile calss which contains the xth percentile value (x. n /100)
= cumulative frequency uotp
= frequency of the percentile class
C= class interval
N= total number of observations
www.rejinpaul.com
Example 15
The following are the paddy yields (kg/plot) from 14 plots:
30,32,35,38,40.42,48,49,52,55,58,60,62,and 65 ( after arranging in ascending order). The
computation of 25th percentile (Q1) and 75th percentile (Q3) are given below:
= 35 + (38-35)
= 35 + 3 = 37.25 kg
= 55 +(58-55)
= 55 + 3 = 55.75 kg
www.rejinpaul.com
Example 16
The frequency distribution of weights of 190 sorghum ear-heads are given below. Compute 25th
percentile and 75th percentile.
Weight of ear- No of ear
heads (in g) heads
40-60 6
60-80 28
80-100 35
100-120 55
120-140 30
140-160 15
160-180 12
180-200 9
Total 190
Solution
Weight of ear- No of ear heads Less than class Cumulative
heads (in g) frequency
40-60 6 < 60 6
60-80 28 < 80 34
80-100 35 <100 69 47.5
100-120 55 <120 124
142.5
120-140 30 <140 154
140-160 15 <160 169
160-180 12 <180 181
180-200 9 <200 190
Total 190
or P25, first find out , and for , and proceed as in the case of median.
The value 47.5 lies between 34 and 69. Therefore, the percentile class is 80-100. Hence,
= 80 +7.71 or 87.71 g.
www.rejinpaul.com
Quartiles
The quartiles divide the distribution in four parts. There are three quartiles. The second
quartile divides the distribution into two halves and therefore is the same as the median. The first
(lower).quartile (Q1) marks off the first one-fourth, the third (upper) quartile (Q3) marks off the
three-fourth. It may be noted that the second quartile is the value of the median and 50th
percentile.
Example 18
Compute quartiles for the data given below (grains/panicles) 25, 18, 30, 8, 15, 5, 10, 35, 40, 45
Solution
5, 8, 10, 15, 18, 25, 30, 35, 40, 45
= (2.75)th item
= 8+ (10-8)
= 8+ x 2
www.rejinpaul.com
= 8+1.5
= 9.5
= 3 x (2.75) th item
= (8.75)th item
= 35+ (40-35)
= 35+1.25
= 36.25
Discrete Series
Step1: Find cumulative frequencies.
Step2: Find
Step3: See in the cumulative frequencies, the value just greater than , then the
corresponding value of x is Q1
Step4: Find
Step5: See in the cumulative frequencies, the value just greater than ,then the
corresponding value of x is Q3
Example 19
Compute quartiles for the data given bellow (insects/plant).
X 5 8 12 15 19 24 30
f 4 3 2 4 5 2 4
www.rejinpaul.com
Solution
x f cf
5 4 4
8 3 7
12 2 9
15 4 13
19 5 18
24 2 20
Continuous series
Step1: Find cumulative frequencies
Step2: Find
Step3: See in the cumulative frequencies, the value just greater than , then the
Step4: Find See in the cumulative frequencies the value just greater than then the
corresponding class interval is called 3rd quartile class. Then apply the respective formulae
Table 1 shows the number of touchdown (TD) passes thrown by each of the 31 teams in the
National Football League in the 2000 season. The mean number of touchdown passes thrown is
20.4516 as shown below.
μ = ΣX/N
= 634/31
= 20.4516
Although the arithmetic mean is not the only "mean" (there is also a geometric mean), it is by far
the most commonly used. Therefore, if the term "mean" is used without specifying whether it is
the arithmetic mean, the geometric mean, or some other mean, it is assumed to refer to the
arithmetic mean.
Median
The median is also a frequently used measure of central tendency. The median is the
midpoint of a distribution: the same number of scores is above the median as below it.
For the data in Table 1, there are 31 scores. The 16th highest score (which equals 20) is
the median because there are 15 scores below the 16th score and 15 scores above the
16th score. The median can also be thought of as the 50th percentile.
GEOMETRIC MEAN
Geometric Mean is a special type of average where we multiply the numbers together and then
take a square root (for two numbers), cube root (for three numbers) etc.
Example: What is the Geometric Mean of 2 and 18?
First we multiply them: 2 × 18 = 36
Then (as there are two numbers) take the square root: √36 = 6
In one line:
Geometric Mean of 2 and 18 = √(2 × 18) = 6
It is like the area is the same!
In one line:
Geometric Mean = 3√(10 × 51.2 × 8) = 16
In one line:
Geometric Mean = 5√(1 × 3 × 9 × 27 × 81) = 9
I can't show you a nice picture of this, but it is still true that:
1 × 3 × 9 × 27 × 81 = 9 × 9 × 9 × 9 × 9
Harmonic Mean
A kind of average. To find the harmonic mean of a set of n numbers, add the reciprocals of the
numbers in the set, divide the sum by n, then take the reciprocal of the result. The harmonic
mean of {a1, a2, a3, a4, . . ., an} is given below.
RANGE
The difference between the lowest and highest values.
In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9, so the range is 9 − 3 = 6.
Range can also mean all the output values of a function.
www.rejinpaul.com
Quartile Deviation :
In a distribution, partial variance between the upper quartile and lower quartile is known as
'quartile deviation'. Quartile Deviation is often regarded as semi inter quartile range.
Three steps:
1. Find the mean of all values
2. Find the distance of each value from that mean (subtract the mean from each value,
ignore minus signs)
3. Then find the mean of those distances Example: the Mean Deviation of 3, 6, 6, 7, 8, 11,
15, 16
Step 1: Find the mean:
Mean = 3 + 6 + 6 + 7 + 8 + 11 + 15 + 16 = 72 = 9
8 8
Step 2: Find the distance of each value from that mean:
Value Distance from 9
3 6
6 3
6 3
7 2
8 1
11 2
15 6
16 7
www.rejinpaul.com
In that example the values are, on average, 3.75 away from the middle.
For deviation just think distance Formula
The formula is:
Σ|x - μ|
Mean Deviation =
N
Let's learn more about those symbols!
Firstly:
μ is the mean (in our example μ = 9)
x is each value (such as 3 or 16)
N is the number of values (in our example N = 8)
www.rejinpaul.com
Standard Deviation
The Standard Deviation is a measure of how spread out numbers are.
Its symbol is σ (the greek letter sigma)
The formula is easy: it is the square root of the Variance. So now you ask, "What is the
Variance?"
Variance:
The Variance is defined as. The average of the squared differences from the Mean.
1−3 40 2 80 4 160
3−5 30 4 120 0 0
5−7 20 6 120 4 80
7−9 10 8 80 16 160
Total 100 400 400
5−7 20 6 4 80 320
7−9 10 8 6 60 160
Total 100 200 800
www.rejinpaul.com
SKEWNESS
Measure of Skewness:
The difference between the mean and mode gives as absolute measure of skewness. If we divide
this difference by standard deviation we obtain a relative measure of skewness known as
coefficient and denoted by SK.
SK=Mean−Mode/S.D
www.rejinpaul.com
Sometimes the mode is difficult to find. So we use another formula
SK=3(Mean−Median)/S.D
Bowley‗s coefficient of Skewness
SK=Q1+Q3−2Median/Q3−Q1
Kelly‗s Measure of Skewness is one of several ways to measure skewness in a data distribution.
Bowley‗s skewness is based on the middle 50 percent of the observations in a data set. It leaves
25 percent of the observations in each tail of the distribution. Kelly suggested that leaving out
fifty percent of data to calculate skewness was too extreme. He created a measure to find
skewness with more data. Kelly‗s measure is based on P90 (the 90th percentile) and P10 (the 10th
percentile). Only twenty percent of observations (ten percent in each tail) are excluded from the
measure.
Kelley‗s measure of skewness is given in terms of percentiles and deciles(D). Kelley‗s absolute
measure of skewness (Sk)is:
Kelly‗s Measure of Skewness gives you the same information about skewness as the other three
types of skewness measures
UNIT –III
TABULATION OF UNIVARIATE
18.5° $406
22.1° $522
19.4° $412
25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
www.rejinpaul.com
Now we can easily see that warmer weather and more ice cream sales are linked, but the
relationship is not perfect.
Multivariate data
Multivariate Data Analysis refers to any statistical technique used to analyze data that arises
from more than one variable. This essentially models reality where each situation, product, or
decision involvesmore than a single variable.
www.rejinpaul.com
Univariate Data
Bivariate Data
The diagram should be properly drawn at the outset. The pith and substance of the subject
matter must be made clear under a broad heading which properly conveys the purpose of
a diagram.
The size of the scale should neither be too big nor too small. If it is too big, it may look
ugly. If it is too small, it may not convey the meaning. In each diagram, the size of the
paper must be taken note-of. It will help to determine the size of the diagram.
For clarifying certain ambiguities some notes should be added at the foot of the diagram.
This shall provide the visual insight of the diagram.
Diagrams should be absolutely neat and clean. There should be no vagueness or
overwriting on the diagram.
Simplicity refers to love at first sight. It means that the diagram should convey the
meaning clearly and easily.
Scale must be presented along with the diagram.
It must be Self-Explanatory. It must indicate nature, place and source of data presented.
Different shades, colors can be used to make diagrams more easily understandable.
Vertical diagram should be preferred to Horizontal diagrams.
It must be accurate. Accuracy must not be done away with to make it attractive or
impressive.
www.rejinpaul.com
Limitations of Diagrammatic Presentation
Types of Diagrams
(a) Line Diagrams
In these diagrams only line is drawn to represent one variable. These lines may be vertical or
horizontal. The lines are drawn such that their length is the proportion to value of the terms or
items so that comparison may be done easily.
www.rejinpaul.com
Like line diagrams these figures are also used where only single dimension i.e. length can
present the data. Procedure is almost the same, only one thickness of lines is measured. These
can also be drawn either vertically or horizontally. Breadth of these lines or bars should be equal.
Similarly distance between these bars should be equal. The breadth and distance between them
should be taken according to space available on the paper.
Imagine you just did a survey of your friends to find which kind of movie they liked best:
It is a really good way to show relative sizes: we can see which types of movie are most liked,
and which are least liked, at a glance.
We can use bar graphs to show the relative sizes of many things, such as what type of car people
have, how many customers a shop has on different days and so on.
The diagram is used, when we have to make comparison between more than two variables. The
number of variables may be 2, 3 or 4 or more. In case of 2 variables, pair of bars is drawn.
Similarly, in case of 3 variables, we draw triple bars. The bars are drawn on the same
proportionate basis as in case of simple bars. The same shade is given to the same item.
www.rejinpaul.com
Draw a multiple bar chart to represent the import and export of Canada (values in $) for the years
1991 to 1995.
chart showing the import and export of Canada from 1991 – 1995.
www.rejinpaul.com
Easy comparison.
Advantages
The chief advantages of a bar diagram can be outlined as under:
1. It is very simple to draw and read as well.
2. It is the only form of diagram which can represent a large number of data on a piece of paper.
3. It can be drawn both vertically and horizontally.
4. It gives a better look and facilitates comparison.
Disadvantages
1. It cannot exhibit a large number of aspects of the data.
2. The which of the bars are fixed arbitrarily by a drawer.
Two-Dimensional
A shape that only has two dimensions (such as width and height) and no thickness. Squares,
Circles, Triangles, etc are two dimensional objects.
Also known as "2D".
www.rejinpaul.com
3.Three-Dimensional
An object that has height, width and depth, like any object in the real world. Example:
your body is three-dimensional.
Also known as "3D".
Pie Chart: a special chart that uses "pie slices" to show relative sizes of data.
Imagine you survey your friends to find the kind of movie they like best:
Table: Favorite Type of Movie
UNIT-IV
In this section wil ladjust our statistical test for the population mean to apply to small sample
situations. Fortunately (sic!), this will be easy (in fact, once you understand one statistical test,
additional tests are easy since they all follow a similar procedure.
The only difference in performing a "small sample" statistical test for the mean as opposed to a
"large sample test" is that we do not use the normal distribution as prescribed by the Central
Limit theorem, but instead a more conservative distribution called the T-Distribution. The
Central Limit theorem applies best when sample sizes are large so that we need to make some
adjustment in computing probabilities for small sample sizes. The appropriate function in Excel
is the TDIST function, defined as follows:
TDIST(T, N-1, TAILS), where
With that new Excel function our test procedure for a sample mean, small sample size, is as
follows:
Statistical Test for the Mean (small sample size N < 30):
Fix an error level you are comfortable with (something like 10%, 5%, or 1% is most
common). Denote that "comfortable error level" by the letter "A".
Then setup the test as follows:
Null Hypothesis H0:
mean = M, i.e. The mean is a known number M Alternative Hypothesis Ha:
mean ≠ M, i.e. mean is different from M (2-tailed test)
www.rejinpaul.com
Test Statistics:
Select a random sample of size N, compute its sample mean X and the standard deviation
S. Then compute the corresponding t-score as follows:
T= (X - M) / ( S / sqrt(N) )
Rejection Region (Conclusion)
If the probability p computed in the above step is less than A (the error level you were
comfortable with inititially, you reject the null hypothesis H 0 and accept the alternative
hypothesis. Otherwise you declare your test inconclusive.
Comments:
The null and alternative hypothesis for this test are the same as before
The calculation of the test statistics is the same as before, but the result is called T instead
of Z (oh well -:)
The TDIST function is similar to the NORMSDIST function, but it does not work for
negative values of T (a limitation of Excel), and it automatically gives a "tail" probability.
Thus, the computation of the p-value had to be adjusted accordingly.
The ABS function in the above formulas stands for the "absolute value" function. (In
other words, just drop any minus signs ... -:)
Example 1: A group of secondary education student teachers were given 2 1/2 days of training
in interpersonal communication group work. The effect of such a training session on the
dogmatic nature of the student teachers was measured y the difference of scores on the "Rokeach
Dogmatism test given before and after the training session. The difference "post minus pre
score" was recorded as follows:
-16, -5, 4, 19, -40, -16, -29, 15, -2, 0, 5, -23, -3, 16, -8, 9, -14, -33, -64, -33
Can we conclude from this evidence that the training session makes student teachers less
dogmatic (at the 5% level of significance) ?
This is of course the same example as before, where we incorrectly used the normal distribution
www.rejinpaul.com
to compute the probability in the last step. This time, we will do it correctly, which is fortunately
almost identical to the previous case (except that we use TDIST instead of NORMDIST):
Null Hypothesis: there is no difference in dogmatism, i.e. mean = 0
Alternative Hypothesis: dogmatism is different, i.e. mean not equal to 0
Test statistics: sample mean = -10.9, standard deviation = 21.33, sample size = 20.
Compute
Note that in the previous section we (incorrectly) computed the probability p to be 2.2%, now it
is 3.4%. The difference is small, but can be significant in special situations. Thus, to be safe:
if N > 30 use the Z-Test based on the standard normal distribution NORMSDIST as in
the previous section
if N < 30 use the T-Test based on the T-Distribution TDIST as in this section
Example 2: Suppose GAP, the clothing store, wants to introduce their line of clothing for
women to another country. But their clothing sizes are based on the assumption that the
average size of a woman is 162 cm. To determine whether they can simply ship the clothes
to the new country they select 5 women at random in the target country and determine
their heights as follows: 149, 165, 150, 158, 153
should they adjust their line of clothing or they ship them without change? Make sure to decide
at the 0.05-level. By now statistical testing is second-nature (I hope -:)
Null Hypothesis: mean height in new country is the same as in old country, i.e. M = 162
Alt. Hypothesis: mean height in new country is different from old country, i..e. M not
equal to 162 (either too small or too tall would be bad for GAP)
Test Statistics: we can compute the sample mean = 155 and the sample standard
deviation = 6.59 while the sample size is clearly N = 5.
www.rejinpaul.com
Therefore
T = (155 - 162) / ( 6.59 / sqrt(5) ) = -2.37
t- Distribution.
This is the equation for the Student t-Distribution, or simply t-Distribution, for all samples of
size n less than 30.
www.rejinpaul.com
To find the Rejection Region, we can use the t-Distribution Table. This table merely requires
knowledge of the sample size, which allows us to calculate the Degrees of Freedom, and the
Significance Level.
As illustrated above, the t-distribution has many properties which differentiate it from the
standard normal or z-distribution.
The distribution shares the bell curve of the z, but reflects the variablility that is inherent
with smaller sample sizes.
The shape of the t-distribution is dependent on the sample size n .
The standard deviation is greater than 1.
As the sample size n increases, the shape of the curve approaches the standard deviation.
PAIRED T TEST
Paired sample t-test is a statistical technique that is used to compare two population
means in the case of two samples that are correlated. Paired sample t-test is used in
‗before-after‗ studies, or when the samples are the matched pairs, or when it is a case-
control study. For example, if we give training to a company employee and we want to
know whether or not the training had any impact on the efficiency of the employee, we
could use the paired sample test. We collect data from the employee on a seven scale
rating, before the training and after the training. By using the paired sample t-test, we
can statistically conclude whether or not training has improved the efficiency of the
employee. In medicine, by using the paired sample t-test, we can figure out whether or
not a particular medicine will cure the illness.
Steps:
1. Set up hypothesis: We set up two hypotheses. The first is the null hypothesis, which
assumes that the mean of two paired samples are equal. The second hypothesis will
be an alternative hypothesis, which assumes that the means of two paired samples are
not equal.
2. Select the level of significance: After making the hypothesis, we choose the level
of significance. In most of the cases, significance level is 5%, (in medicine, the
significance level is set at 1%).
3. Calculate the parameter: To calculate the parameter we will use the following
www.rejinpaul.com
formula:
Where d bar is the mean difference between two samples, s² is the sample variance, n is
the sample size and t is a paired sample t-test with n-1 degrees of freedom. An alternate
formula for paired sample t-test is:
Assumptions:
1. Only the matched pairs can be used to perform the test.
2. Normal distributions are assumed.
3. The variance of two samples is equal.
4. Cases must be independent of each other.
Let x = the difference in weight 3 months after the program starts. The null hypothesis is:
H0: μ = 0; i.e. any differences in weight is due to chance
We can make the following calculations using the difference column D:
s.e. = std dev / = 6.33 / = 1.6343534
tobs = (x̄ – μ) /s.e. = (10.93 – 0) /1.63 = 6.6896995
tcrit = TINV(α, df) = TINV(.05, 14) = 2.1447867
Since tobs > tcrit we reject the null hypothesis and conclude with 95% confidence that the
difference in weight before and after the program is not due solely to chance.
Alternatively we can use a type 1 TTEST to perform the analysis as follows:
p-value = TTEST(B4:B18, C4:C18, 2, 1) = 1.028E-05 < .05 = α
and so once again we reject the null hypothesis.
The test procedure described in this lesson is appropriate when the following conditions are met:
The sampling method is simple random sampling.
The variables under study are each categorical.
If sample data are displayed in a contingency table, the expected frequency count for
each cell of the table is at least 5.
www.rejinpaul.com
This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis plan, (3)
analyze sample data, and (4) interpret results.
Suppose that Variable A has r levels, and Variable B has c levels. The null hypothesis states that
knowing the level of Variable A does not help you predict the level of Variable B. That is, the
variables are independent.
The alternative hypothesis is that knowing the level of Variable A can help you predict the level
of Variable B.
Note: Support for the alternative hypothesis suggests that the variables are related; but the
relationship is not necessarily causal, in the sense that one variable "causes" the other.
The analysis plan describes how to use sample data to accept or reject the null hypothesis. The
plan should specify the following elements.
Significance level. Often, researchers choose significance levels equal to 0.01, 0.05, or
0.10; but any value between 0 and 1 can be used.
Test method. Use the chi-square test for independence to determine whether
there is a significant relationship between two categorical variables.
www.rejinpaul.com
Analyze Sample Data
Using sample data, find the degrees of freedom, expected frequencies, test statistic, and the P- value
associated with the test statistic. The approach described in this section is illustrated in the sample
problem at the end of this lesson.
DF = (r - 1) * (c - 1)
where r is the number of levels for one catagorical variable, and c is the number of levels
for the other categorical variable.
Expected frequencies. The expected frequency counts are computed separately for each
level of one categorical variable at each level of the other categorical variable. Compute r
* c expected frequencies, according to the following formula.
where Er,c is the expected frequency count for level r of Variable A and level c of
Variable B, nr is the total number of sample observations at level r of Variable A, n c is
the total number of sample observations at level c of Variable B, and n is the total
sample size.
Test statistic. The test statistic is a chi-square random variable (Χ2) defined by the
following equation.
where Or,c is the observed frequency count at level r of Variable A and level c of
Variable B, and Er,c is the expected frequency count at level r of Variable A and level c
of Variable B.
www.rejinpaul.com
P-value. The P-value is the probability of observing a sample statistic as extreme as the
test statistic. Since the test statistic is a chi-square, use the Chi-Square Distribution
Calculator to assess the probability associated with the test statistic. Use the degrees of
freedom computed above.
Interpret Results
If the sample findings are unlikely, given the null hypothesis, the researcher rejects the null
hypothesis. Typically, this involves comparing the P-value to the significance level, and rejecting
the null hypothesis when the P-value is less than the significance level.
Problem
A public opinion poll surveyed a simple random sample of 1000 voters. Respondents were classified by
gender (male or female) and by voting preference (Republican, Democrat, or Independent). Results are
shown in the contingency table below.
Voting Preferences
Row
Republican Democrat total
Independent
Male 200 150 50 400
Is there a gender gap? Do the men's voting preferences differ significantly from the women's
preferences? Use a 0.05 level of significance.
Solution
www.rejinpaul.com
State the hypotheses. The first step is to state the null hypothesis and an alternative hypothesis.
Formulate an analysis plan. For this analysis, the significance level is 0.05. Using sample data,
we will conduct a chi-square test for independence.
Analyze sample data. Applying the chi-square test for independence to sample data, we
compute the degrees of freedom, the expected frequency counts, and the chi-square test
statistic. Based on the chi-square statistic and the degrees of freedom, we determine the
P-value.
DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
where DF is the degrees of freedom, r is the number of levels of gender, c is the number
of levels of the voting preference, nr is the number of observations from level r of gender,
nc is the number of observations from level c of voting preference, n is the number of
observations in the sample, Er,c is the expected frequency count when gender is level r
and voting preference is level c, and Or,c is the observed frequency count when gender is
level r voting preference is level c.
The P-value is the probability that a chi-square statistic having 2 degrees of freedom is
more extreme than 16.2.
We use the Chi-Square Distribution Calculator to find P(Χ2 > 16.2) = 0.0003.
Interpret results. Since the P-value (0.0003) is less than the significance level (0.05), we
cannot accept the null hypothesis. Thus, we conclude that there is a relationship between
gender and voting preference.
Note: If you use this approach on an exam, you may also want to mention why this approach is
appropriate. Specifically, the approach is appropriate because the sampling method was simple
random sampling, the variables under study were categorical, and the expected frequency count
was at least 5 in each cell of the contingency table.
www.rejinpaul.com
Example: "Which holiday would you rather have?"
Beach Cruise
Now, p < 0.05 is the usual test for dependence. In this case p is greater than 0.05, so we believe
the variables are independent (ie not linked together).
In other words Men and Women probably do not have a different preference for Beach Holidays
or Cruises.
Understanding "p" Value
"p" is the probability the variables are independent.
Imagine for the previous example you had tried to fool the test by choosing a random sample of
Men each time:
Men: Men:
Beach 209, Cruise Beach 225, Cruise
280 248
www.rejinpaul.com
Is it likely you would get such different results surveying Men each time?
Well the "p" value of 0.132 says that it really could happen every so often.
Surveys are random after all. We expect slightly different results each time, right?
So most people want to see a p value less than 0.05 before they are happy to say the results show
the groups have a different response.
Cat Dog
P value is 0.043
In this case p < 0.05, so this result is thought of as being "significant" meaning we think the
variables are not independent.
In other words, because 0.043 < 0.05 we think that Gender is linked to Pet Preference (Men and
Women have different preferences for Cats and Dogs).
Just out of interest, notice that the numbers in our two examples are similar, but the resulting p-
values are very different: 0.132 and 0.043. This shows how sensitive the test is!
Why p<0.05 ?
It is just a choice! Using p<0.05 is common, but we could have chosen p<0.01 to be even more
sure that the groups behave differently, or any value really.
www.rejinpaul.com
Calculating P-Value
Chi-Square Test
Note: Chi Sounds like "Hi" but with a K, so say Chi-Square like "Ki square"
And Chi is the greek letter Χ, so we can also write it Χ2
Important points before we get started:
This test only works for categorical data (data in categories), such as Gender {Men,
Women} or color {Red, Yellow, Green, Blue} etc, but not numerical data such as
height or weight.
The numbers must be large enough. Each entry must be 5 or more. In our example
we have values such as 209, 282, etc, so we are good to go.
Chi-Square is 4.102
www.rejinpaul.com
From Chi-Square to p
To get from Chi-Square to p-value is a difficult calculation, so either look it up in a table, or use
the Chi-Square Calculator.
Done!
Chi-Square Formula
This is the formula for Chi-Square:
In this section we will first discuss correlation analysis, which is used to quantify the association
between two continuous variables (e.g., between an independent and a dependent variable or
between two independent variables). Regression analysis is a related technique to assess the
relationship between an outcome variable and one or more risk factors or confounding variables.
The outcome variable is also called the response or dependent variable and the risk factors and
confounders are called the predictors, or explanatory or independent variables. In regression
analysis, the dependent variable is denoted "y" and the independent variables are denoted by "x".
www.rejinpaul.com
[NOTE: The term "predictor" can be misleading if it is interpreted as the ability to predict
even beyond the limits of the data. Also, the term "explanatory variable" might give an
impression of a causal effect in a situation in which inferences should be limited to
identifying associations. The terms "independent" and "dependent" variable are less
subject to these interpretations as they do not strongly imply cause and effect.
Correlation Analysis
In correlation analysis, we estimate a sample correlation coefficient, more specifically the
Pearson Product Moment correlation coefficient. The sample correlation coefficient, denoted r,
ranges between -1 and +1 and quantifies the direction and strength of the linear association
between the two variables. The correlation between two variables can be positive (i.e., higher
levels of one variable are associated with higher levels of the other) or negative (i.e., higher
levels of one variable are associated with lower levels of the other).
The sign of the correlation coefficient indicates the direction of the association. The magnitude
of the correlation coefficient indicates the strength of the association.
For example, a correlation of r = 0.9 suggests a strong, positive association between two
variables, whereas a correlation of r = -0.2 suggest a weak, negative association. A correlation
close to zero suggests no linear association between two continuous variables.
LISA: [I find this description confusing. You say that the correlation coefficient is a measure of
the "strength of association", but if you think about it, isn't the slope a better measure of
association? We use risk ratios and odds ratios to quantify the strength of association, i.e., when
an exposure is present it has how many times more likely the outcome is. The analogous
quantity in correlation is the slope, i.e., for a given increment in the independent variable, how
many times is the dependent variable going to increase?
And "r" (or perhaps better R-squared) is a measure of how much of the variability in the
dependent variable can be accounted for by differences in the independent variable. The
analogous measure for a dichotomous variable and a dichotomous outcome would be the
attributable proportion, i.e., the proportion of Y that can be attributed to the presence of the
exposure.]
www.rejinpaul.com
It is important to note that there may be a non-linear association between two continuous
variables, but computation of a correlation coefficient does not detect this. Therefore, it is always
important to evaluate the data carefully before computing a correlation coefficient. Graphical
displays are particularly useful to explore associations between variables.
The figure below shows four hypothetical scenarios in which one continuous variable is plotted
along the X-axis and the other along the Y-axis.
Scenario 1 depicts a strong positive association (r=0.9), similar to what we might see
for the correlation between infant birth weight and birth length.
Scenario 2 depicts a weaker association (r=0,2) that we might expect to see between
age and body mass index (which tends to increase with age).
Scenario 3 might depict the lack of association (r approximately 0) between the extent of
media exposure in adolescence and age at which adolescents initiate sexual activity.
Scenario 4 might depict the strong negative association (r= -0.9) generally observed
between the number of hours of aerobic exercise per week and percent body fat.
We wish to estimate the association between gestational age and infant birth weight. In this
example, birth weight is the dependent variable and gestational age is the independent variable.
Thus y=birth weight and x=gestational age. The data are displayed in a scatter diagram in the
figure below.
www.rejinpaul.com
Each point represents an (x,y) pair (in this case the gestational age, measured in weeks, and the
birth weight, measured in grams). Note that the independent variable is on the horizontal axis (or
X-axis), and the dependent variable is on the vertical axis (or Y-axis). The scatter plot shows a
positive or direct association between gestational age and birth weight. Infants with shorter
gestational ages are more likely to be born with lower weights and infants with longer gestational
ages are more likely to be born with higher weights.
The variances of x and y measure the variability of the x scores and y scores around their
respective sample means (
, considered separately). The covariance measures the variability of the (x,y) pairs
around the mean of x and mean of y, considered simultaneously.
To compute the sample correlation coefficient, we need to compute the variance of gestational
age, the variance of birth weight and also the covariance of gestational age and birth weight.
We first summarize the gestational age data. The mean gestational age is:
www.rejinpaul.com
To compute the variance of gestational age, we need to sum the squared deviations (or
differences) between each observed gestational age and the mean gestational age. The
computations are summarized below.
Next, we summarize the birth weight data. The mean birth weight is:
The variance of birth weight is computed just as we did for gestational age as shown in the table
below.
www.rejinpaul.com
To compute the covariance of gestational age and birth weight, we need to multiply the deviation
from the mean gestational age by the deviation from the mean birth weight for each participant
(i.e.,
The computations are summarized below. Notice that we simply copy the deviations from the
mean gestational age and birth weight from the two tables above into the table below and
multiply.
www.rejinpaul.com
Not surprisingly, the sample correlation coefficient indicates a strong positive correlation.
As we noted, sample correlation coefficients range from -1 to +1. In practice, meaningful
correlations (i.e., correlations that are clinically or practically important) can be as small as 0.4
(or -0.4) for positive (or negative) associations. There are also statistical tests to determine
whether an observed correlation is statistically significant or not (i.e., statistically significantly
different from zero). Procedures to test whether an observed sample correlation is suggestive of a
statistically significant correlation are described in detail in Kleinbaum, Kupper and Muller. 1
www.rejinpaul.com
Formula :
R = 1 - ( (6 X σd2) / (n3 - n) )
Partial correlation analysis involves studying the linear relationship between two
variables after excluding the effect of one or more independent factors.
Simple correlation does not prove to be an all-encompassing technique especially under the
above circumstances. In order to get a correct picture of the relationship between two
variables, we should first eliminate the influence of other variables.
For example, study of partial correlation between price and demand would involve studying the
relationship between price and demand excluding the effect of money supply, exports, etc.
Partial correlation analysis involves studying the linear relationship between two
variables after excluding the effect of one or more independent factors.
Simple correlation does not prove to be an all-encompassing technique especially under the
above circumstances. In order to get a correct picture of the relationship between two
variables, we should first eliminate the influence of other variables.
For example, study of partial correlation between price and demand would involve studying the
relationship between price and demand excluding the effect of money supply, exports, etc.
Partial correlation analysis involves studying the linear relationship between two
variables after excluding the effect of one or more independent factors.
Simple correlation does not prove to be an all-encompassing technique especially under the
above circumstances. In order to get a correct picture of the relationship between two
variables, we should first eliminate the influence of other variables.
For example, study of partial correlation between price and demand would involve studying the
relationship between price and demand excluding the effect of money supply, exports, etc.
www.rejinpaul.com
Multiple Correlation
Another technique used to overcome the drawbacks of simple correlation is multiple regression
analysis.
Here, we study the effects of all the independent variables simultaneously on a dependent
variable. For example, the correlation co-efficient between the yield of paddy (X1) and the other
variables, viz. type of seedlings (X2), manure (X3), rainfall (X4), humidity (X5) is the multiple
correlation co-efficient R1.2345 . This co-efficient takes value between 0 and +1.
The limitations of multiple correlation are similar to those of partial correlation. If multiple and
partial correlation are studied together, a very useful analysis of the relationship between the
different variables is possible.
www.rejinpaul.com
UNIT-V
REGRESSION ANALYSIS
Regression Analysis
Introduction
As you develop Cause & Effect diagrams based on data, you may wish to examine the degree of
correlation between variables. A statistical measurement of correlation can be calculated using
the least squares method to quantify the strength of the relationship between two variables. The
output of that calculation is the Correlation Coefficient, or (r), which ranges between -1 and 1.
A value of 1 indicates perfect positive correlation - as one variable increases, the second
increases in a linear fashion. Likewise, a value of -1 indicates perfect negative correlation - as
one variable increases, the second decreases. A value of zero indicates zero correlation.
Before calculating the Correlation Coefficient, the first step is to construct a scatter diagram.
Most spreadsheets, including Excel, can handle this task. Looking at the scatter diagram will
give you a broad understanding of the correlation. Following is a scatter plot chart example
based on an automobile manufacturer.
In this case, the process improvement team is analyzing door closing efforts to understand what
the causes could be. The Y-axis represents the width of the gap between the sealing flange of a
car door and the sealing flange on the body - a measure of how tight the door is set to the body.
The fishbone diagram indicated that variability in the seal gap could be a cause of variability in
door closing efforts.
www.rejinpaul.com
In this case, you can see a pattern in the data indicating a negative correlation (negative slope)
between the two variables. In fact, the Correlation Coefficient is -0.78, indicating a strong
inverse or negative relationship.
MoreSteam Note: It is important to note that Correlation is not Causation - two variables can
be very strongly correlated, but both can be caused by a third variable. For example, consider
two variables: A) how much my grass grows per week, and B) the average depth of the local
reservoir. Both variables could be highly correlated because both are dependent upon a third
variable - how much it rains.
In our car door example, it makes sense that the tighter the gap between the sheet metal sealing
surfaces (before adding weatherstrips and trim), the harder it is to close the door. So a
rudimentary understanding of mechanics would support the hypothesis that there is a causal
relationship. Other industrial processes are not always as obvious as these simple examples, and
determination of causal relationships may require more extensive experimentation (Design of
Experiments).
Regression plots a line of best fit to the data using the least-squares method. You can see an
example below of linear regression using the same car door scatter plot:
www.rejinpaul.com
You can see that the data is clustered closely around the line, and that the line has a downward
slope. There is strong negative correlation expressed by two related statistics: the r value, as
stated before is, -0.78 the r2 value is therefore 0.61. R2, called the Coefficient of Determination,
expresses how much of the variability in the dependent variable is explained by variability in the
independent variable. You may find that a non-linear equation such as an exponential or power
function may provide a better fit and yield a higher r 2 than a linear equation.
These statistical calculations can be made using Excel, or by using any of several statistical
analysis software packages. MoreSteam provides links to statistical software downloads,
including free software.
Many times historical data is used in multiple regression in an attempt to identify the most
significant inputs to a process. The benefit of this type of analysis is that it can be done very
quickly and relatively simply. However, there are several potential pitfalls:
The data may be inconsistent due to different measurement systems, calibration drift,
different operators, or recording errors.
The range of the variables may be very limited, and can give a false indication of low
correlation. For example, a process may have temperature controls because temperature
has been found in the past to have an impact on the output. Using historical temperature
data may therefore indicate low significance because the range of temperature is already
controlled in tight tolerance.
www.rejinpaul.com
There may be a time lag that influences the relationship - for example, temperature
may be much more critical at an early point in the process than at a later point, or vice-
versa. There also may be inventory effects that must be taken into account to make sure
that all measurements are taken at a consistent point in the process.
Once again, it is critical to remember that correlation is not causality. As stated by Box, Hunter
and Hunter: "Broadly speaking, to find out what happens when you change something, it is
necessary to change it. To safely infer causality the experimenter cannot rely on natural
happenings to choose the design for him; he must choose the design for himself and, in
1
particular, must introduce randomization to break the links with possible lurking variables".
Returning to our example of door closing efforts, you will recall that the door seal gap had an r2
of 0.61. Using multiple regression, and adding the additional variable "door weatherstrip
durometer" (softness), the r2 rises to 0.66. So the durometer of the door weatherstrip added some
explaining power, but minimal. Analyzed individually, durometer had much lower correlation
with door closing efforts - only 0.41.
This analysis was based on historical data, so as previously noted, the regression analysis only
tells us what did have an impact on door efforts, not what could have an impact. If the range of
durometer measurements was greater, we might have seen a stronger relationship with door
closing efforts, and more variability in the output.
Trend Analysis
There are no proven "automatic" techniques to identify trend components in the time series data;
however, as long as the trend is monotonous (consistently increasing or decreasing) that part of
data analysis is typically not very difficult. If the time series data contain considerable error, then
the first step in the process of trend identification is smoothing.
Smoothing. Smoothing always involves some form of local averaging of data such that the
nonsystematic components of individual observations cancel each other out. The most common
technique is moving average smoothing which replaces each element of the series by either the
simple or weighted average of n surrounding elements, where n is the width of the smoothing
"window" (see Box & Jenkins, 1976; Velleman & Hoaglin, 1981). Medians can be used instead
of means. The main advantage of median as compared to moving average smoothing is that its
results are less biased by outliers (within the smoothing window). Thus, if there are outliers in
www.rejinpaul.com
the data (e.g., due to measurement errors), median smoothing typically produces smoother or at
least more "reliable" curves than moving average based on the same window width. The main
disadvantage of median smoothing is that in the absence of clear outliers it may produce more
"jagged" curves than moving average and it does not allow for weighting.
In the relatively less common cases (in time series data), when the measurement error is very
large, the distance weighted least squares smoothing or negative exponentially weighted
smoothing techniques can be used. All those methods will filter out the noise and convert the
data into a smooth curve that is relatively unbiased by outliers (see the respective sections on
each of those methods for more details). Series with relatively few and systematically distributed
points can be smoothed with bicubic splines.
Fitting a function. Many monotonous time series data can be adequately approximated by a
linear function; if there is a clear monotonous nonlinear component, the data first need to be
transformed to remove the nonlinearity. Usually a logarithmic, exponential, or (less often)
polynomial function can be used.
Additive models
The models that we have considered in earlier sections have been additive models, and there has
been an implicit assumption that the different components affected the time series additively.
For monthly data, an additive model assumes that the difference between the January and July
values is approximately the same each year. In other words, the amplitude of the seasonal effect
is the same each year.
The model similarly assumes that the residuals are roughly the same size throughout the series --
they are a random component that adds on to the other components in the same way at all parts
of the series.
www.rejinpaul.com
Multiplicative models
In many time series involving quantities (e.g. money, wheat production, ...), the absolute
differences in the values are of less interest and importance than the percentage changes.
For example, in seasonal data, it might be more useful to model that the July value is the same
proportion higher than the January value in each year, rather than assuming that their difference
is constant. Assuming that the seasonal and other effects act proportionally on the series is
equivalent to a multiplicative model,
Fortunately, multiplicative models are equally easy to fit to data as additive models! The trick to
fitting a multiplicative model is to take logarithms of both sides of the model,
After taking logarithms (either natural logarithms or to base 10), the four components of the time
series again act additively.
An additive model is optional for Decomposition procedures and for Winters' method.
An additive model is optional for two-way ANOVA procedures. Choose this option to
omit the interaction term from the model.
What is a multiplicative model?
This model assumes that as the data increase, so does the seasonal pattern. Most time series plots
exhibit such a pattern. In this model, the trend and seasonal components are multiplied and then
added to the error component.
www.rejinpaul.com
Should I use an additive model or a multiplicative model?
Choose the multiplicative model when the magnitude of the seasonal pattern in the data depends
on the magnitude of the data. In other words, the magnitude of the seasonal pattern increases as
the data values increase, and decreases as the data values decrease.
Choose the additive model when the magnitude of the seasonal pattern in the data does not
depend on the magnitude of the data. In other words, the magnitude of the seasonal pattern does
not change as the series goes up or down.
If the pattern in the data is not very obvious, and you have trouble choosing between the additive
and multiplicative procedures, you can try both and choose the one with smaller accuracy
measures.
INDEX NUMBERS
Introduction:
Index numbers are meant to study the change in the effects of such factors which cannot be
measured directly. According to Bowley, ―Index numbers are used to measure the changes in
some quantity which we cannot observe directly‖. For example, changes in business activity in a
country are not capable of direct measurement but it is possible to study relative changes in
business activity by studying the variations in the values of some such factors which affect
business activity, and which are capable of direct measurement.
Index numbers are commonly used statistical device for measuring the combined fluctuations in
a group related variables. If we wish to compare the price level of consumer items today with
that prevalent ten years ago, we are not interested in comparing the prices of only one item, but
in comparing some sort of average price levels. We may wish to compare the present agricultural
production or industrial production with that at the time of independence. Here again, we have to
consider all items of production and each item may have undergone a different fractional
increase (or even a decrease). How do we obtain a composite measure? This composite measure
is provided by index numbers which may be defined as a device for combining the variations that
have come in group of related variables over a period of time, with a view to obtain a figure that
represents the ‗net‗ result of the change in the constitute variables.
www.rejinpaul.com
Index numbers may be classified in terms of the variables that they are intended to measure. In
business, different groups of variables in the measurement of which index number techniques are
commonly used are (i) price, (ii) quantity, (iii) value and (iv) business activity. Thus, we have
index of wholesale prices, index of consumer prices, index of industrial output, index of value of
exports and index of business activity, etc. Here we shall be mainly interested in index numbers
of prices showing changes with respect to time, although methods described can be applied to
other cases. In general, the present level of prices is compared with the level of prices in the past.
The present period is called the current period and some period in the past is called the base
period.
Index Numbers:
Index numbers are statistical measures designed to show changes in a variable or group of
related variables with respect to time, geographic location or other characteristics such as
income, profession, etc. A collection of index numbers for different years, locations, etc., is
sometimes called an index series.
A simple index number is a number that measures a relative change in a single variable with
respect to a base.
A composite index number is a number that measures an average relative changes in a group of
relative variables with respect to a base.
Price index numbers measure the relative changes in prices of a commodities between two
periods. Prices can be either retail or wholesale.
These index numbers are considered to measure changes in the physical quantity of goods
produced, consumed or sold of an item or a group of items.
Uses
This index number is a useful number that helps us quantify changes in our field. It is easier to
see one value than a thousand different values for each item in our field.
Take the stock market, for example. It is comprised of thousands of different public companies.
We could, of course, look at the stock value of each of these companies to see how the
companies are doing as a whole, or we can look at just one number, the stock index, to get a
general feel for how the companies are doing.
The same goes for the cost of goods. We could look at the cost of each item and compare it to its
cost from last year. But that would mean looking at the cost of millions of items. Or we could
look at the cost of goods index, just one number, to see whether prices have increased or
decreased over the past year.
We can say that the index number is one simple number that we can look at to give us a general
overview of what is happening in our field. Let's take a look at two real world index numbers.
A line of best fit is a straight line that is the best approximation of the given set of data.
It is used to study the nature of the relation between two variables. (We're only considering the
two-dimensional case, here.)
www.rejinpaul.com
A line of best fit can be roughly determined using an eyeball method by drawing a straight line
on a scatter plot so that the number of points above the line and below the line is about equal
(and the line passes through as many points as possible).
A more accurate way of finding the line of best fit is the least square method.
Use the following steps to find the equation of line of best fit for a set of ordered pairs
(x1,y1),(x2,y2),...(xn,yn)(x1,y1),(x2,y2),...(xn,yn).
Step 1: Calculate the mean of the xx-values and the mean of the yy-values.
X¯¯¯=∑i=1nxinY¯¯¯=∑i=1nyinX¯=∑i=1nxinY¯=∑i=1nyin
Step 2: The following formula gives the slope of the line of best fit:
m=∑i=1n(xi−X¯¯¯)(yi−Y¯¯¯)∑i=1n(xi−X¯¯¯)2m=∑i=1n(xi−X¯)(yi−Y¯)∑i=
1n(xi−X¯)2
Step 4: Use the slope mm and the yy-intercept bb to form the equation of the line.
Example:
Use the least square method to determine the equation of line of best fit for the data. Then plot
the line.
x 8 22 111 6 5 44 121 9 6 11
x 8 1 6 5 2 9 6
y 3 101 33 6 8 121 11 4 9 141
y 3 0 6 8 2 4 9 4
www.rejinpaul.com
Solution:
Plot the points on a coordinate plane.
X¯¯¯=8+2+11+6+5+4+12+9+6+110=6.4Y¯¯¯=3+10+3+6+8+12+1+4+9+1410=7X¯=8+2+11+6
+5+4+12+9+6+11 0=6.4Y¯=3+10+3+6+8+12+1+4+9+1410=7
m=∑i=1n(xi−X¯¯¯)(yi−Y¯¯¯)∑i=1n(xi−X¯¯¯)2=−131118.4≈−1.1m=∑i=1n(xi−X¯)(yi−Y¯)∑i=1n(xi−
X¯)2=−131 118.4≈−1.1