Advanced Data Analysis Binder 2015
Advanced Data Analysis Binder 2015
Contents
Sr.
No.
Topic
Page No.
Teaching Plan
34
Charts
57
Nonparametric Tests
8 28
29 - 42
43 48
48 74
Discriminant Analysis
75 89
Logistic Regression
89 97
MANOVA
97 100
10
Factor Analysis
100 116
11
Canonical Correlation
116 120
12
Cluster Analysis
120 142
13
Conjoint Analysis
142 149
14
149 151
15
152-165
3
Advanced Methods of Data Analysis
School of Business Management, NMIMS University
MBA (Full Time) course: 2013-2014 Trim: IV
_________________________________________________________________________
Instructor: Ms Shailaja Rego
Email Id: [email protected], [email protected]
Extn no. 5873
________________________________________________________________________
Objectives:
This course aims at understanding of multivariate techniques by manual method, using SPSS or other
computer assisted tool for carrying out the analysis and interpreting the results to assist the business
decisions.
Prerequisite:
Statistical Analysis for Business Decisions
Evaluation Criteria: (in %)
Quiz
Project
Assignment
End Term Exam
:
:
:
:
20%
20%
20%
40%
Session Details:
Session Topic
1
Introduction to multivariate data analysis, dependence & Interdependence techniques,
Introduction to SPSS, two sample tests, one way ANOVA, n-way ANOVA using SPSS and
interpretation of the SPSS output. Discussion on assumptions of these tests.
(Chapter 11 & chapter 12, Appendix III- IBM SPSS of Business Research Methodology by
Srivastava Rego)
2
Introduction to SPSS and R(Open source Statistical Software)
(Appendix II of Business Research Methodology by Srivastava Rego) & R manual
3&4
Introduction to Econometrics, multiple regression analysis, Concepts of Ordinary Least
Square Estimate (OLSE) & Best Linear Unbiased Estimate (BLUE). With Practical
Examples in business
(Read: H & A Chapter 4,p.g. 193 - 288)
5
Multiple Regression Assumptions - Heteroscedasticity, Autocorrelation & Multicollinearity
(Read: H & A Chapter 4,p.g. 193 - 288)
6
Multiple Regression Case Problems, Dummy variables, Outlier analysis.
(Read: H & A Chapter 4,p.g. 193 - 288)
7 & 8 Factor Analysis Introduction to Factor Analysis, Objectives of Factor Analysis, Designing
a Factor Analysis, Assumptions in a Factor Analysis, Factors & assessing overall fit.
Interpretation of the factors.
(Read: H & A Chapter 3 pg. 125 - 165)
9
Case problem based on Factor Analysis
10 &
Introduction to Cluster Analysis. Objective of Cluster Analysis. Research Design of Cluster
11
Analysis. Assumptions in a Cluster Analysis. Employing Hierarchical.
(Read: H & A Chapter 8 pg. 579 - 622)
Non-Hierarchical, Interpretation of the clusters formed.
(Read: H & A Chapter 8 pg. 579 - 622)
4
12
13
14
15
16 &
17
18
19 &
Guest Sessions from Industry professionals in Business Analytics
20
The different Multivariate Techniques will also be interpreted using SPSS and R.
Text Book:
1. Multivariate Data Analysis by Hair & Anderson. ( H & A )
Reference Books:
1. Business Research Methodology by Srivastava & Rego
2. Statistics for Management by Srivastava & Rego
3. Analyzing Multivariate Data by Johnson & Wichern.
4. Multivariate Analysis by Subhash Sharma
5. Market Research by Naresh Malhotra
6. Statistics for Business & Economics by Douglus & Lind
7. Market Research by Rajendra Nargundkar
8. Statistics for Management by Aczel & Sounderpandian
5
Hypothesis Testing
Univariate
Techniques
Parametric
One
Sample
t Test
Z test
Independent
Samples
t Test
Z test
Non Parametric
Two
Sample
One Sample
Chi-Square
K-S
Runs
Binomial
Dependent
Samples
Paired t Test
Independent
Samples
Chi-Square
MannWhitney
Median
K-S
Two Sample
Dependent
Samples
Sign
Wilcoxon
Chi-Square
McNemar
6
Multivariate Techniques
Dependence
Techniques
Interdependence
Techniques
Variable
Interdependence
Techniques :
Factor Analysis
Inter object
similarity
Techniques :
Cluster Analysis
MDS
Metric Dependent
Variable
One Independent
Variable
Binary
t Test
Categorical:
Factorial
Categorical
& Interval
Interval
ANOVA
ANOCOV
A
Regression
One Factor
More than
One Factor
One way
ANOVA
N way
ANOVA
8
NON-PARAMETRIC TESTS
Contents
1. Relevance- Advantages and Disadvantages
2. Tests for
Randomness of a Series of Observations - Run Test
c. Specified Mean or Median of a Population Signed Rank Test
d. Goodness of Fit of a Distribution Kolmogorov- Smirnov Test
e. Comparing Two Populations Kolmogorov- Smirnov Test
Equality of Two Means Mann - Whitney (U)Test
Equality of Several Means `
Wilcoxon - Wilcox Test
Kruskel -Wallis Rank Sum (H) Test
Friedmans ( F)Test Two Way ANOVA
Rank Correlation Spearmans
Testing Equality of Several Rank Correlations
Kendals Rank Correlation Coefficient
Sign Test
1 Relevance and Introduction
All the tests of significance, those have been discussed in Chapters X and XI, are based on certain
assumptions about the variables and their statistical distributions. The most common assumption is
that the samples are drawn from a normally distributed population. This assumption is more critical
when the sample size is small. When this assumption or other assumptions for various tests described
in the above chapters are not valid or doubtful, or when the data available is ordinal (rank) type, we
take the help of non-parametric tests. For example, in the students t test for testing the equality of
means of two populations based on samples from the two populations, it is assumed that the samples
are from normal distributions with equal variance. If we are not too sure of the validity of this
assumption, it is better to apply the test given in this Chapter.
While the parametric tests refer to some parameters like mean, standard deviation, correlation
coefficient, etc. the non parametric tests, also called as distribution-free tests, are used for testing
other features also, like randomness, independence, association, rank correlation, etc.
In general, we resort to use of non-parametric tests where
The assumption of normal distribution for the variable under consideration or some
assumption for a parametric test is not valid or is doubtful.
The hypothesis to be tested does not relate to the parameter of a population
The numerical accuracy of collected data is not fully assured
Results are required rather quickly through simple calculations.
However, the non-parametric tests have the following limitations or disadvantages:
They ignore a certain amount of information.
They are often not as efficient or reliable as parametric tests.
The above advantages and disadvantages are in consistent with general premise in statistics that is, a
method that is easier to calculate does not utilize the full information contained in a sample and is less
reliable.
The use of non-parametric tests, involves a trade off. While the efficiency or reliability is
lost to some extent, but the ability to use lesser information and to calculate faster is
gained.
There are a number of tests in statistical literature. However, we have discussed only the following
tests.
Types and Names of Tests for
Randomness of a Series of Observations Run Test
Specified Mean or Median of a Population Signed Rank Test
Goodness of Fit of a Distribution Kolmogorov- Smirnov Test
Comparing Two Populations Kolmogorov- Smirnov Test
Equality of Two Means Mann - Whitney (U) Test
Equality of Several Means
Wilcoxon - Wilcox Test
Kruskel -Wallis Rank Sum ( H) Test
Friedmans ( F)Test
Rank CorrelationSpearmans
We now discuss certain popular and most widely used non-parametric tests in the subsequent
sections.
2 Test for Randomness in a Series of Observations: The Run Test
This test has been involved for testing whether the observations in a sample occur in a certain order
or they occur in a random order. The hypotheses are.
Ho : The sequence of observations is random
H1 : The sequence of observations is not random
The only condition for validity of the test is that the observations in the sample be obtained under
similar conditions.
+ + + + +
+ + +
- (1)- (2)
-(3)(4) (5)
-(6)(7) (8)
-(9)-(10)-
10
It may be noted that the number of (+) observations is equal to number of () observations; both
being equal to 10. As defined above, succession of values with the same sign is called a run. Thus the
first run comprises of observations 58 and 61, with the sign, second run comprises of only one
observation i.e. 78, with the + sign, the third observation comprises of three observations 72, 69 and
65, with the negative sign, and so on.
Total number of runs R = 10.
This value of R lies inside the acceptance interval found from appropriate table as from 7 to 15 at 5 %
level of significance. Hence the hypothesis that the sample is drawn in a random order is accepted.
Applications:
(i)
Testing Randomness of Stock Rates of Return
The number-of-runs test can be applied to a series of stocks rates of return, for each of the trading
day, to see whether the stocks rates of return are random or exhibit a pattern that could be exploited
for earning profit.
(ii)
Testing the Randomness of the Pattern Exhibited by Quality Control Data Over
Time
If a production process is in control, the distribution of sample values should be randomly distributed
above and below the center line of a control chart. Please refer to chapter XVII on Industrial
Statistics. We can use this test for testing whether the pattern of, say, 10 sample observations, taken
over time, is random,.
3 Test for Specified Mean or Median of a Population The Signed Rank Test
This test has been evolved to investigate the significance of the difference between a population mean
or median and a specified value of the mean or median, say m0 .
The hypotheses are as follows:
Null hypothesis
Ho :
m = m0
Alternative Hypothesis H1 :
m m0
The test procedure is numerically illustrated below.
Illustration 2
In the above example relating to answer books of MBA students, the Director further desired to have
an idea of the average marks of students. When he enquired the concerned Professor, he was
informed that the professor had not calculated the average but felt that the mean would be 70, and the
median would be 65. The Director wanted to test this, and asked for a sample of 10 randomly selected
answer books. The marks on those books are tabulated below as xi s.
Null hypothesis:
Ho :
m = 70
Alternative Hypothesis H1 :
m 70
Sample values are as follows:
xi ( Marks)
55
58
63
78 72 69
64 79
75 80
x i m0
15
12
7
+8 +2 1
6 +9
+5 +10
|xi m0|*
15
12
7
8
2
1
6
9
5
10
Ranks of|xi m0|
1
2
6
5
9
10
7
4
8
3
+ or signs
Ranks with signs
of respective
1
2
6
+5 +9 10
7 +4
+8 +3
( xi m0 )
Any sample values equal to m0 are to be discarded from the sample.
*- absolute value or modulus of ( xi m0 )
11
Now,
Sum of ( + ) ranks = 29
Sum of ( ) ranks = 26
Here, the statistic T is defined as the minimum of the sum of positive ranks and sum of negative
ranks.
Thus, T = Minimum of 28 and 26 = 26
The critical value of T at 5% level of significance is 8 ( for n = number of values ranked = 10).
Since the calculated value 26 is more than the critical value, the null hypothesis is not rejected. Thus
the Director does not have sufficient evidence to contradict the professors guess about mean marks
as 70.
It may be noted that the criteria using rank methods is reverse of the parametric tests wherein
the null hypothesis is rejected if the critical value exceeds the tabulated value.
In the above example, we have tested a specified value of the mean. If the specified value of the
median is to be tested the test is exactly similar only the word mean is substituted by median. For
example if the director wanted to test whether the median of the marks 70, the test would have
resulted in same values and same conclusions.
It may be verified that the Director does not have sufficient evidence to contradict the professors
guess about median marks as 65.
Example 1
The following Table gives the real annual income that senior managers actually take home in
certain countries, including India. These have been arrived at, by US based Hay Group, in 2006, after
adjusting for the cost of living, rental expenses and purchasing power parity.
Overall Rank
Country
Amount ( in Euros)
5
Brazil
76,449
26
China
42,288
8
Germany
75,701
2
India
77,665
9
Japan
69,634
3
Russia
77,355
4
Switzerland
76,913
1
Turkey
79,021
23
UK
46,809
13
USA
61,960
Test whether 1) the mean income is equal to 70000 2) The median value is 70000
It may be verified that both the hypotheses are rejected.
4 Test for Goodness of Fit of a Distribution(One sample) - Kolmogorov - Smirnov
In Chapter X, we have discussed 2 test as the test for judging goodness of fit of a distribution.
However, it is assumed that the observations come from a normal distribution.
The test is used to investigate the significance of the difference between observed and expected
cumulative distribution function for a variable with a specified theoretical distribution which could be
Binomial, Poisson, Normal or an Exponential. It tests whether the observations could reasonably have
come from the specified distribution. Here,
Null Hypothesis
Ho : The sample comes from a specified population
Alternative Hypothesis H1 : The sample does not come from a specified population
The testing procedure envisages calculations of observed and expected cumulative distribution
functions denoted by Fo(x) and Fe(x), respectively, derived from the sample. The comparison of the
two distributions for various values of the variable is measured by the test statistic
D = | Fo(x) Fe(x) |
(1)
12
If the value of the difference of D is less, the null hypothesis is likely to be accepted. But if the
difference is more, it is likely to be rejected. The procedure is explained below for testing the fitting
of an uniform distribution (vide section 7.6.1 of Chapter VII
on Statistical Distributions)
Illustration 3
Suppose we want to test whether availing of educational loan by the students of 5 Management
Institutes is independent of the Institute in which they study.
The following data gives the number of students from each of the five institutes viz. A, B, C, D and
E. These students were out of 60 students selected randomly at each institute.
The relevant data and calculations for the test are given in the following Table.
Institutes
A
B
C
D
E
Out of Groups of 60
students
11
16
19
Observed Cumulative
Distribution Function
5/60
14/60
25/60
41/60
60/60
12/60
24/60
36/60
48/60
60/60
7/60
10/60
11/60
7/60
Fo (x)
Expected Cumulative
Distribution Function
Fe (x)
| Fo (x) Fe (x)|
This test is used for testing whether two samples come from two identical population
distributions. The hypotheses are:
13
can be determined and plotted. Hence the maximum value of the difference between the plotted
values can thus be found and compared with a critical value obtained from the concerned Table. If the
observed value exceeds the critical value the null hypothesis that the two population distributions are
identical is rejected.
The test is explained through the illustration given below.
Illustration 4
At one of the Management Institutes, a sample of 30 (15 from commerce background and 15 from
Engineering background), Second Year MBA students was selected, and the data was collected on
their background and CGPA scores at the end of the First year.
The data is given as follows.
CGPA Engineering
Sr No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CGPA Commerce
3.24
3.14
3.72
3.06
3.14
3.14
3.06
3.17
2.97
3.14
3.69
2.85
2.92
2.79
3.22
2.97
2.92
3.03
2.79
2.77
3.11
3.33
2.65
3.14
2.97
3.39
3.08
3.3
3.25
3.14
Here, we wish to test that the CGPA for students with Commerce and Engineering backgrounds
follow the same distribution.
The value of the statistics D is calculated by preparing the following Table.
Cum
Score
Sr CGPA
of
No Com CGPAs
(i)
(ii)
(iii)
14
2.79
2.79
Cumulative
Distribution
Function of
CGPAs
Cumulative
CGPA Score of
Eng
CGPAs
Fi(C) =
Sr
Col.(iii)/47.25 No
(iv)
(v)
0.0590
(vi)
(vii)
2.65
2.65
Cumulative
Distribution
Function of
CGPAs
Differ
ence
Fi(C)
Fi(E) :
Fi(E)=
Di
Col.(vii)/45.84 (viii)
(viii)
0.0578
0.0012
14
12
13
9
4
7
2
5
6
10
8
15
1
11
3
2.85
2.92
2.97
3.06
3.06
3.14
3.14
3.14
3.14
3.17
3.22
3.24
3.69
3.72
5.64
8.56
11.53
14.59
17.65
20.79
23.93
27.07
30.21
33.38
36.6
39.84
43.53
47.25
0.1194
0.1812
0.2440
0.3088
0.3735
0.4400
0.5065
0.5729
0.6394
0.7065
0.7746
0.8432
0.9213
1
5
4
2
1
10
3
12
6
9
15
14
13
7
11
2.77
2.79
2.92
2.97
2.97
3.03
3.08
3.11
3.14
3.14
3.25
3.3
3.33
3.39
5.42
8.21
11.13
14.1
17.07
20.1
23.18
26.29
29.43
32.57
35.82
39.12
42.45
45.84
0.1182
0.1791
0.2428
0.3076
0.3724
0.4385
0.5057
0.5735
0.642
0.7105
0.7814
0.8534
0.926
1
0.0012
0.0021
0.0012
0.0012
0.0011
0.0015
0.0008
0.0006
0.0026
0.0040
0.0068
0.0102
0.0047
It may be noted that the observations ( scores) have been arranged in ascending order
for both the groups of students.
The calculation of values in different colums is explained below.
Col. (iii) The first cumulative score is same as the score in Col. (ii). The second cumulative score is
obtained by adding the first score to second score, and so on. The last i.e. 15th score is obtained by
adding fourteen scores to the fifteenth score.
Col. (iv) The cumulative distribution function for any observation in second column is obtained by
dividing the cumulative score for that observation by cumulative score of the last i.e. 15th observation.
Col. (vi) and Col. (viii) values are obtained just like the values in Col. (iii) and Col. (iv).
The test statistics D is calculated from Col. (viii) of the above Table as
Maximum of Di = D = Maximum of { Fi (C) - Fi (E) } = 0.0068
Since the calculated value of the statistic viz. D is 0.0068 is less than its tabulated value 0.457
at 5 % level (for n =15 ), the null hypothesis that both samples come from the same population is not
rejected.
6 Equality of Two Means Mann-Whitney U Test
This test is used with two independent samples. It is an alternative to the t test without the latters
limiting assumptions of coming from normal distributions with equal variance.
For using the U test, all observations are combined and ranked as one group of data, from smallest to
largest. The largest negative score receives the lowest rank. In case of ties, the average rank is
assigned. After the ranking, the rank values for each sample are totaled. The U statistic is calculated
as follows:
n1 (n1 1)
U
n1n2
R1 (2)
2
or,
15
n 2 (n2 1)
2
n1n2
R2
(3)
where,
n1 = Number of observations in sample 1; n2 = Number of observations in sample 2
Rl = Sum of ranks in sample 1; R2 = Sum of ranks in sample 2.
For testing purposes, the smaller of the above two U values is used.
The test is explained through a numerical example given below.
Example 2
Two equally competent groups of 10 salespersons were imparted training by two different methods
A and B. The following data gives sales of a brand of paint, in 5 kg. tins, per week per salesperson
after one month of receiving training. Test whether both the methods of imparting training are equally
effective.
Training
Salesman Method
A
Sr. No.
Sales
1
1,500
2
1,540
3
1,860
4
1,230
5
1,370
6
1,550
7
1,840
8
1,250
9
1,300
10
1,710
Training
Salesman Method
B
Sr. No.
Sales
1
1,340
2
1,300
3
1,620
4
1,070
5
1,210
6
1,170
7
950
8
1,380
9
1,460
10
1,030
Solution :
Here, the hypothesis to be tested is that both the training methods are equally effective, i.e.
H0 : m1 = m2
H1 m1 m2
where, m1 is the mean sales of salespersons trained by method A, and m2 is the mean sales of
salespersons trained by method B.
The following Table giving the sales values for both the groups as also the combined rank of sales for
each of the salesman is prepared to carry out the test.
Salesman
Sr. No.
1
Training Method A
Sales
Combined
in 5 kg. Tins
Rank
1,500
14
Salesman
Sr. No.
1
Training Method B
Sales
Combined
in 5 kg. Tins
Rank
1,340
10
16
2
3
4
5
6
7
8
9
10
Average
Sales
Sum of
Ranks
U
values
1,540
1,860
1,230
1,370
1,550
1,840
1,250
1,300
1,710
15
20
6
11
16
19
7
8.5
18
2
3
4
5
6
7
8
9
10
1,300
1,620
1,070
1,210
1,170
950
1,380
1,460
1,030
8.5
17
3
5
4
1
12
13
2
1,515
1,253
R1 = 135
R2 =
20
79.5
76
The U statistic for both the training methods are calculated as follows :
10(10 1)
134.5
2
10(10 1)
U 10 10
75.5
2
U 10 10
= 20.5
= 79.5
The tabulated or critical value of U with nl = n2 = 10, for = 0.5, is 23, for a two-tailed test.
It may be noted that in this test too, like the signed rank test in section 3, the calculated value
must be smaller than the critical value to reject the null hypothesis.
Since the calculated value 20.5 is smaller than the critical value 23, the null hypothesis is rejected that
the two training methods are equally effective. It implies that Method A is superior to Method B.
17
viz. Delhi(D), Mumbai(M), Kolkata(K), Ahmedabad(A) and Bangalore(B). Based on the data about
sales in the showroom for 5 consecutive weeks, the ranks of LCD TVs in five successive weeks are
recorded as follows:
Delhi
Mumbai
Kolkata
Ahmedabad Bangalore
Week
D
M
K
A
B
1
1
5
3
4
2
2
2
4
3
5
1
3
1
3
4
5
2
4
1
4
3
5
2
5
2
3
5
4
1
Rank Sum*
7
19
18
23
8
* Rank Sum is calculated by adding the ranks for all the five weeks for each of the city.
From the values of Rank Sum, we calculate the net difference in Rank Sum for every pair of cities,
and tabulate as follows :
Difference
D
M
K
A
B
in Ranks
D
0
12
11
16*
1
M
12
0
1
4
11
K
11
1
0
5
10
A
16*
4
5
0
15*
B
1
11
10
15*
0
The critical value for the difference in Rank Sums for number of cities = 5, number of observations
for each city = 5, and 5% level of significance is 13.6.
Comparing the calculated difference in rank sums with this 13.6, we note that difference in rank sums
in A (Ahmedabad) and D (Delhi) as also, difference in rank sums between A(Ahmedabad) and B(
Bangalore) are significant.
Note :
In the above case, if the data would have been available in terms of actual values rather than
ranks alone, ANOVA would just led to conclusion that means of D, M, K, A and B are not
equal, but would have not gone beyond that. However, the above test concludes that mean of
Ahmedabad is not equal to mean of Delhi as also mean of Bangalore. Thus, it gives comparison
of all pairs of means.
8 Kruskall-Wallis Rank Sum Test for Equality of Means (of Several Populations) (H-test) (
One-Way ANOVA)
This test is used for testing equality of means of a number of populations, and the null hypothesis is
of the type
H0 :
m1 = m2 =
It may be recalled that H0 is the same as in ANOVA. However, here the ranks of observations are used and
not actual observations.
The Kruskall-Wallis test is a One-Factor or One-Way ANOVA with values of variable as ranks. It is a
generalization of the two samples Mann-Whitney, U rank sum test for situations involving more than two
populations.
The procedure for carrying out the test involves, assigning combined ranks to the observations in all
18
the samples from smallest to largest. The rank sum of each sample is then calculated. The test
statistics H is calculated as follows:
12
n(n 1)
k
j 1
T j2
nj
T j2
j 1
nj
3(n 1)
3(n 1)
(4)
where
Tj = Sum of ranks for treatment j
nj = Number of observations for treatment i
n = nj = Total number of observations
k = Number of treatments
The test is illustrated through an example given below.
Example 5
A chain of departmental stores opened three stores in Mumbai. The management wants to compare
the sales of the three stores over a six day long promotional period. The relevant data is given below.
(Sales ibn Rs. Lakhs)
Store A
Store B
Store C
Sales
Sales
Sales
16
20
23
17
20
24
21
21
26
18
22
27
19
25
29
29
28
30
Use the Kruskal-Wallis test to compare the equality of mean sales in all the three stores.
Solution
The combined ranks to the sales of all the stores on all the six days are calculated and presented in the
following Table.
Store A
Store B
Store C
Combined
Combined
Sales
Sales
Combined Rank
Sales
Rank
Rank
16
1
20
5.5
23
10
17
2
20
5.5
24
11
21
7.5
21
7.5
26
13
18
3
22
9
27
14
19
4
25
12
29
16.5
29
16.5
28
15
30
18
T1 = 34.0
T2 = 54.5
T3 = 82.5
It may be noted that the rank given for sales as 21 in Store A and for sales as 21 in Store B are
given equal ranks as 7.5. Since there are six ranks below, the rank for 21 would have been 7, but
since the value 21 is repeated, both the values get rank as average of 7 and 8 as 7.5. The next value 22
has been assigned the rank 9. If there were three values as 21, the rank assigned to the values would
have been average of 7, 8 and 9 as 8, and the next value would have been ranked as 10.
Now the H statistics is calculated as
=
19
12
18(18 1)
34.0
54.5
6
12
10932.5
= ---------------------- ----------18( 18 1 )
6
82.5
3 (18 1)
57
12
10932.5
= --------- x -------- 57
306
6
=
71.45 57 = 14. 45
Test (Two-Way
Friedman's test is a non-parametric test for testing hypothesis that a given number of samples have
been drawn from the same population. This test is similar to ANOVA but it does not require the
assumption of normality and equal variance. Further, this test is carried out with the data in terms of
ranks of observations rather than their actual values, like in ANOVA. It is used whenever the number
of samples is greater than or equal to 3 (say k) and each of the sample size is equal (say n) like twoway analysis of variance. In fact, it is referred to as Two-Way ANOVA. The null hypothesis to be
tested is that all the k samples have come from identical populations.
The use of the test is illustrated below through a numerical example.
Illustration 5
Following data gives the percentage growth of sales of three brands of refrigerators, say A, B and
C over a period of six years.
Percentage Growth Rate of the Brands
Year
Brand A
Brand B
Brand C
1
15
14
32
2
10
19
30
3
15
11
27
4
13
19
38
5
18
20
33
6
27
20
22
In this case, the null hypothesis is that there is no significant difference among the growth rates of
the three brands. The alternative hypothesis is that at least two samples (two brands) differ from each
other.
Under null hypothesis, the Friedman's test statistic is :
F
where,
12
nk (k 1)
Rj
j 1
3n(k 1)
(5)
20
k = Number of samples(brands) = 3 ( in the illustration)
n = Number of observations for each sample(brand) = 6 ( in the illustration)
Ri = Sum of ranks of jth sample (brand)
It may be noted that this F is different from Fishers F defined in Chapter VII on Statistical
Distributions.
The statistical tables exist for the sampling distribution of Friedmans F, these are not readily for
various values of n and k. However, the sampling distribution of F can be approximated by a 2
(chi-square) distribution with k l degrees of freedom. The chi-square distribution table shows, that
with 3 1 = 2 degrees of freedom, the chi-square value at 5% level of significance is 2 = 5.99.
If the calculated value of F is less than or equal, to the tabulated value of chi-square ( 2 at 5% level
of significance), growth rates of brands are considered statistically the same. In other words, there is
no significant difference in the growth rates of the brands. In case the calculated value exceeds the
tabulated value, the difference is termed as significant.
For the above example, the following null hypothesis is framed:
Ho : There is no significant difference in the growth rates of the three brands of refrigerators.
For calculation of F, the following Table is prepared. The figures in brackets indicate the rank of
growth of a brand in a particular year- the lowest growth is ranked 1 and the highest growth is ranked
3.
The Growth Rate of Refrigerators of Different Brands
(Growth Rates of Refrigerators)
Total Ranks
Year
Brand A
Brand B
Brand C
(Row Total)
1
15
14
32
6
(2)
(1)
(3)
2
18
15
30
6
(2)
(1)
(3)
3
15
11
27
6
(2)
(1)
(3)
4
13
19
38
6
(l)
(2)
(3)
5
20
18
33
6
(2)
(1)
(3)
6
27
20
22
6
(3)
(1)
(2)
Total Ranks( Rj) 12
7
17
36
With reference to the above table, Friedmans test amounts to testing that sums of ranks ( Rj) of various
brands are all equal.
The value of F is calculated as,
12
2
2
2
F
( 9 7 17 ) 3 6(3 1)
72
482
=
72 = 80.3 72 = 8.3
6
It is observed that the calculated value of 'F' statistics is greater than the tabulated value of 2 (5.99 at
5% level of significance and 2 d.f.). Hence, the hypothesis that there is no significant difference in the
growth rates of the three brands is rejected.
Therefore, we conclude that there is a significant difference in the growth rates of the three
21
brands of refrigerator, during the period under study. The significant difference is due to the
best growth rate of brand C.
In the above example, if the data were given for six showrooms instead of six years, the test would
have remained the same.
10 Test for Significance of Spearmans Rank Correlation
In Chapter VIII on Simple Correlation and Regression Analysis, the spearmans rank correlation has
been discussed in section 8.4 and is defined as
d i2
rs
n(n 2 1)
where n is the number of pairs of ranks given to individuals or units or objects, and di is the difference
in the two ranks given to ith individual / unit /object
There is no statistic to be calculated for testing the significance of the rank correlation. The calculated
value of rs is itself compared with the tabulated value of rs, given in Appendix . , at 5% or 1%
level of significance. If the calculated value is more than the tabulated value, the null hypothesis that
there is no correlation in the two rankings is rejected.
Here, the hypotheses are as follows
Ho : s = 0
H1 : s 0
In example 8.6, of Chapter VIII on Correlation and Regression, The rank correlation between
priorities of Job Commitment Drivers among executives from India and Asia Pacific was found to
be 0.9515. Comparing this value with the tabulated value of rs at n i.e.10 d.f. and 5% level of
significance as 0.6364 , we find that the calculated value is more than the tabulated value, and hence
we reject the null hypothesis that there is no correlation between priorities of Job Commitment
Drivers among executives from India and Asia Pacific.
10.1 Test for Significance of Spearmans Rank Correlation for Large
Sample Size
If the number of pairs of ranks is more than 30, then the distribution of the rank correlation rs under
the null hypothesis that s = 0, can be approximated by normal distribution with mean 0 and s.d. as
1
. This can be expressed in symbolic form as follows :
n -1
1
for n > 30, rs
(6)
N 0,
n -1
For example, in the case study 4 of this chapter, relating to rankings of top 50 CEOs in different
cities, n = 50. It may be verified that the rank correlation between the rankings in Mumbai and
Bangalore is 0.544. Thus the value of z i.e. standard normal variable is
0.544 0
z=
= 0.544 7 = 3.808 which is more than 1.96, the value of z at 5% level of significance.
1/ 7
Thus, it may be concluded that the ranking between Mumbai and Bangalore are significantly
correlated.
11 Testing Equality of Several Rank Correlations
Sometimes, more than 2 ranks, are given to an individual / entity / object, and we are required
to test whether all the three rankings are equal. Consider the following situation wherein some
cities have been ranked as per three criteria
22
Illustration 6
As per a study published in times of India dt. 4th September 2006, the rankings of ten cities as per
earning, investing and living criteria are as follows:
City
Rankings
Earning
Investing
Living
Bangalore
2
6
1
Coimbtore
5
1
5
Surat
1
2
10
Mumbai
7
5
2
Pune
3
4
7
Chennai
9
3
3
Delhi
4
7
8
Hyderabad
8
8
6
Kolkata
10
10
4
Ahmedabad
6
9
9
( Source : City Skyline of India 2006 published by Indicus Analytics)
s12
s 22
s12
sd
n(k 1)
(8)
where sd is the sum of squares of difference between mean rank of a city and the overall mean ranks
of all cities.
sd
)
n
k (n 1)
nk (k 2 1)
(s
s22
where,
12
(9)
(10)
Earning
Investing
Living
Bangalore
Coimbtore
Surat
Mumbai
Pune
Chennai
Delhi
Hyderabad
Kolkata
Ahmedabad
2
5
1
7
3
9
4
8
10
6
6
1
2
5
4
3
7
8
10
9
1
5
10
2
7
3
8
6
4
9
Sum of
City
Rankings
9
11
13
14
14
15
19
22
24
24
Total =165
Mean of
City
Rankings
3
3.67
4.33
4.67
4.67
5
6.33
7.33
8
8
Mean =
5.5
Difference
from Grand
Mean Ranking
-2.5
-1.83
-1.17
-0.83
-0.83
-0.5
0.83
1.83
2.5
2.5
Squared Difference
from Grand Mean
Ranking
6.25
3.36
1.36
0.69
0.69
0.25
0.69
3.36
6.25
6.25
Total
29.2
23
n = 3, k = 10
3 10(100 1)
= 247.5
s
12
d = 29.2
29.2
= 1.08
s12
3(10 1)
29.2
(247.5
)
2
3
s2
= 11.89
10(3 1)
s22
( Because s22 > s12 )
s12
It may be recalled that in the Fishers F ratio of two sample variances, the greater one is taken in the
numerator, and d.f. of F are taken, accordingly
11.89
= 11.00
F
1.08
F27, 9 = 2.25
Since the calculated value is more than the tabulated value of F, we reject the null hypothesis, and
conclude that rankings on the given parameter are not equal.
8.5 Kendals Rank Correlation Coefficient
In addition to the Spearmans correlation coefficient, there exists one more rank correlation
coefficient, introduced by Maurice Kendal, in 1938. Like Spearmans correlation coefficient it is also
used when the data is available in ordinal form. It is more popular as Kendals Tau, and denoted by
the Greek letter tau( corresponding to t in English). It measures the extent of agreement or
association between rankings, and is defined as
= (nc nd) / ( nc + nd )
where,
nc : Number of concordant pairs of rankings
nd : Number of disconcordant pairs of rankings
The maximum value of nc + nd could be the total number of pairs of ranks given by two different
persons or by the same person on two different criteria. For example, if the number of observations is,
say 3 ( call them a, b and c), then the pairs of observations will be : ab, ac and bc. Similarly, if
there are four pairs, the possible pairs are six in number as follows:
ab, ac, ad, bc, bd, cd
It may be noted that, more the number of concordant pairs, more will be the value of the numerator
and more the value of the coefficient, indicating higher degree of consistency in the rankings.
A pair of subjects i.e. persons or units, is said to be concordant, if for a subject, the rank of both
variables is higher than or equal to the corresponding rank of the both variables. On the other hand, if
for a subject, the rank of one variable is higher or lower than the corresponding rank of the other
variable and the rank of the other variable is opposite i.e. lower / higher than the corresponding rank
of the other variable, the pair of subjects is said to be discordant.
The concepts of concordant and discordant pairs, and the calculation of are explained though an
example given below.
As per a study published in Times of India dt.4th September 2006, several cities were ranked as per
the criteria of Earning, Investing and Living. It is given in the CD in the Chapter relating to
Simple Correlation and Regression. An extract from the full Table is given below.
24
City
Ranking as per Earning
Ranking as per Investing
Chennai
3
2
Delhi
1
4
Kolkata
4
3
Mumbai
2
1
We rearrange the table by arranging cities as per Earning ranking
City
Ranking as per Earning
Ranking as per Investing
Delhi
1
3
Mumbai
2
2
Chennai
3
1
Kolkata
4
4
Now, we form all possible pairs of cities. In this case, total possible pairs are 4C2 = 6, viz. DM, DC,
DK, MC,MK and CK.
The status of each pair i.e. whether, it is concordant or discordant, along with reasoning is given in
the following Table.
Pair Concordant (C ) / Disconcordant Reasoning
(D)
DM D
Because both pairs 1,2 and 3,2 are in opposite order
DC D
Because both pairs 1,3 and 3,1 are opposite of each other
DK C
Because both pairs 1,4 and 3,4 are in ascending order /
(same direction)
MC D
Because both pairs 1,3 and 2,1are in opposite order
MK C
Because both pairs 2,4 and 2,4 are ascending and of the
same order
CK C
Because both pairs 3,4 and 1,4 are in ascending order /
(same direction)
It may be noted that for a pair of subjects ( cities in the above case), when a subject ranks higher on
one variable also ranks higher on the variable or even equal to the rank of the other variable, then the
pair is said to be concordant. On the other hand, if a subject ranks higher on one variable and lower
on the other variable, it is said to be discordant.
From the above Table, we note that;
nc = 3 and
nd = 3
Thus, the Kendals coefficient of correlation or concordance is
= ( 3 3 ) / 5 = 0.
As regards the possible values and interpretation of the value of , the following
If the association between the two rankings is perfect (i.e., the two rankings are the same) the
coefficient has value 1.
If the association between the two rankings is perfect (i.e., one ranking is the reverse of the
other) the coefficient has value 1.
In other cases, the value lies between 1 and 1, and increasing values imply increasing
association between the rankings. If the rankings are completely independent, the coefficient
has value 0.
Incidentally, Spearmans correlation and Kendals correlations are not comparable, and their values
could be different.
The Kendals Tau also measures the strength of association of the cross tabulations. It also tests
the strength of association of the cross tabulations when both variables are measured at the
ordinal level, and in fact, is the only measure of association in ordinal form in a cross tabulation
data available in ordinal form.
9 Sign Test
25
The sign test is a nonparametric statistical procedure for fulfilling the following objective:
(i)
To identify preference for one of the two brands of a product like tea, soft drink, mobile
phone, TV, and/or service like cellular company, internet service provider.
(ii)
To determine whether a change being contemplated or introduced is found favorable?
The data collected in such situations is of the type, + (preference for one) or 3 (preference for
the other) signs. Since the data is collected in terms of plus and minus signs, the test is called
Sign Test. The data having 10 observations is of the type:
+ , +, , +, , , +, +, +,
Illustration:
The Director of a management institute wanted to have an idea of the opinion of the students about
the new time schedule for the classes
8.00 a.m. to 2.00p.m.
He randomly selected a representative sample of 20 students, and recorded their preferences. The data
was collected in the following form:
Student No.
In Favour of the Proposed
Sign
Option : 1, Opposed to the
( + for 1, for 2 )
Option: 2
1
1
+
2
1
+
3
2
4
1
+
5
2
6
2
7
2
8
2
9
1
+
10
1
+
11
2
12
1
+
13
2
14
2
15
1
+
16
1
+
17
1
+
18
1
+
19
1
+
20
1
+
Here the hypothesis to be tested is:
H0 : p = 0.50
H1 : p 0.50
Under the assumption that
H0 is true i.e. p = 0.50
, the number of + signs follows a binomial distribution with p = .50.
Let x denote the number of + signs.
It may be noted that if the value of x is either low or high, then the null hypothesis will be rejected in
favour of the alternative
H1 : p 0.50
For = 0.05, the low value is the value of x for which the area is less than 0.025, and the high value
of x for which the area is more than 0.025
26
If the alternative is
H1: p < 0.50.
Then the value of x has to be low enough to reject the null hypothesis in favour of
H1: p < 0.50.
If the alternative is
H1: p > 0.50.
Then the value of x has to be high enough to reject the null hypothesis in favour of
H1: p > 0.50.
With a sample size of n =20, one can refer to the Table below showing the probabilities for all the
possible values of the binomial probability distribution with p = 0.5.
Number of + Signs
Probability
0
0.0000
1
0.0000
2
0.0002
3
0.0011
4
0.0046
5
0.0148
6
0.0370
7
0.0739
8
0.1201
9
0.1602
10
0.1762
11
0.1602
12
0.1201
13
0.0739
14
0.0370
15
0.0148
16
0.0046
17
0.0011
18
0.0002
19
0.0000
20
0.0000
The binomial probability distribution as shown as shown above can be used to provide the decision
rule for any sign test up to a sample size of n = 20. With the null hypothesis p = .50 and the sample
size n, the decision rule can be established for any level of significance. In addition, by considering
the probabilities in only the lower or upper tail of the binomial probability distribution, we can
develop rejection rules for one-tailed tests.
The above Table gives the probability of the number of plus signs under the assumption that H0 is
true, and is, therefore, the appropriate sampling distribution for the hypothesis test. This sampling
distribution is used to determine a criterion for rejecting H0 .This approach is similar to the method
used for developing rejection criteria for hypothesis testing given in the Chapter 11 on Statistical
Inference.
For example, let = .05, and the test be two sided. In this case the alternative hypothesis will be
H1 : m 0.50
, and we would have a critical or rejection region area of 0.025 in each tail of the distribution.
Starting at the lower end of the distribution, we see that the probability of obtaining zero, one, two,
three or four plus signs is 0.0000 + 0.0000 +0 .0002 + .0011 + .0046 + 0.0148 = .0 207. Note that we
stop at 5 + signs because adding the probability of six + signs would make the area in the lower tail
equal to .0 207 + .0370 = .0577, which substantially exceeds the desired area of .025. At the upper
27
end of the distribution, we find the same probability of .0 207 corresponding to 15, 16, or 17, 18, 19
or 20 + signs. Thus, the closest we can come to = .05 without exceeding it is .0207 + .0207 = 0
.0414. We therefore adopt the following rejection criterion.
Reject H0 if the number of + signs is less than 6 or greater than 14
Since the number of + signs in the given illustration are 12, we cannot reject the null Hypothesis: and
thus the data reveals that the students are not against the option.
It may be noted that the Table T3 for Binomial distribution does not provide probabilities for sample
sizes greater than 20. In such cases only , we can use the large-sample normal approximation of
binomial probabilities to determine the appropriate rejection rule for the sign test.
13. .Sign Test for Paired Data
The sign test can also be used to test the hypothesis that there is "no difference" between two
distributions of continuous variables x and y.
Let p = P(x > y), and then we can test the null hypothesis
H0 : p = 0.50
This hypothesis implies that given a random pair of measurements (xi, yi), then both xi and yi are
equally likely to be larger than the other.
To perform the test, we first collect independent pairs of sample data from the two populations as
(x1, y1), (x2, y2), . . ., (xn, yn)
We omit pairs for which there is no difference so that we may have a reduced sample of pairs. Now,
let x be the number of pairs for which xi yi> 0. Assuming that Ho is true, then r follows a
binomial distribution with p = 0.5. It may be noted that if the value of r is either low or high, then the
null hypothesis will be rejected in favour of the alternative
H1 : p 0.50
For = 0.05, the low value is the value of x for which the area is less than 0.025, and the high value
of x for which the area is more than 0.025
This alternative means that the x measurements tend to be higher than y measurements.
The right-tail value is computed by total of probabilities that are greater than x, and is the p-value for
the alternative H1: p > 0.50. This alternative means that the y measurements tend to be higher.
For a two-sided alternative H1: p 0.50, the p-value is twice the smallest tail-value.
Illustration
The HRD Chief of an organisation wants to assess whether there is any significant difference in the
marks obtained by twelve trainee officers in the papers on Indian and Global Perspectives conducted
after the induction training.
(72, 70)
(82, 79)
(78, 69)
(80, 74)
(64, 66)
(78, 75)
(85, 86)
(83, 77)
(83, 88)
(84, 90)
(78, 72)
(84, 82)
We convert the above data in the following tabular form where + indicates values of yi > xi .
Trainee Officer
1
2
3
4
5
6
7
8
9
10
11
12
+
+
+
+
+
+
+
+
28
It may be noted that the number of + signs is 8.
Here, we test the null hypothesis that there is "no significant difference" in the scores in the two
papers, with the two-sided alternative that there is a significant difference. In symbolic notation:
H0 : m 1 = m2
H1: m1 m2
Number of + Signs
Probability
0
0.0002
1
0.0029
2
0.0161
3
0.0537
4
0.1208
5
0.1934
6
0.2256
7
0.1934
8
0.1208
9
0.0537
10
0.0161
11
0.0029
12
0.0002
Let the level of significance be = .05.
In a two sided test, we would have a rejection region or area of approximately .025 in each tail of the
distribution.
From the above table, we note that the probability of obtaining zero, one, or two plus signs is .0002 +
.0029 + .0161 = .0 192. Thus, , the probability of getting number of plus signs up to 3 is 0.0192,
Since it is less than i.e. 0.025, the
Ho: will be rejected.
Similarly, the probability of getting 10, 11 and 12 + signs is .0002 + .0029 + .0161 = 0.0192. Thus, ,
the probability of getting number of plus signs beyond 10 is 0.0192, Since it is less than i.e.
0.025, the
Ho: will be rejected
Thus, the Ho: will be rejected, if the number of + signs is up to 3 or beyond 9, at 5% level of
significance. We. therefore, follow the following rejection criterion.
Reject H0 if the number of + signs is less than 3 or greater than 9
Since, in our case the number of plus signs is only 8, we cannot reject the null hypothesis. Thus, we
conclude that there is no significant difference between the marks obtained in the papers on Indian
and Global perspectives.
If we want to test that the average marks obtained in the paper on Indian Perspective is more than the
average marks in the paper on Global Perspectives, the null hypothesis is
H0 : m 1 = m2
, and the alternative hypothesis is
H1 : m 1 > m2
Thus, the test is one sided. If the level of significance is 0.05, the entire critical region will be on the
right side. The probability of getting number of plus signs more than 8 is 0.0792 which is more than
0.05, and, therefore, the null hypothesis cannot be rejected.
29
30
The mean of the variable write for this particular sample of students is 52.775, which is statistically
significantly different from the test value of 50. We would conclude that this group of students has a
significantly higher mean on the writing test than 50.
One sample median test
A one sample median test allows us to test whether a sample median differs significantly from a
hypothesized value. We will use the same variable, write, as we did in the one sample t-test example
above, but we do not need to assume that it is interval and normally distributed (we only need to
assume that write is an ordinal variable). However, we are unaware of how to perform this test in
SPSS.
Binomial test
A one sample binomial test allows us to test whether the proportion of successes on a two-level
categorical dependent variable significantly differs from a hypothesized value. For example, using
the hsb2 data file, say we wish to test whether the proportion of females (female) differs significantly
from 50%, i.e., from .5. We can do this as shown below.
npar tests
/binomial (.5) = female.
The results indicate that there is no statistically significant difference (p = .229). In other words, the
proportion of females in this sample does not significantly differ from the hypothesized value of 50%.
Chi-square goodness of fit
A chi-square goodness of fit test allows us to test whether the observed proportions for a categorical
variable differ from hypothesized proportions. For example, let's suppose that we believe that the
general population consists of 10% Hispanic, 10% Asian, 10% African American and 70% White
folks. We want to test whether the observed proportions from our sample differ significantly from
these hypothesized proportions.
npar test
/chisquare = race
/expected = 10 10 10 70.
31
These results show that racial composition in our sample does not differ significantly from the
hypothesized values that we supplied (chi-square with three degrees of freedom = 5.029, p = .170).
Two independent samples t-test
An independent samples t-test is used when you want to compare the means of a normally distributed
interval dependent variable for two independent groups. For example, using the hsb2 data file, say
we wish to test whether the mean for write is the same for males and females.
t-test groups = female(0 1)
/variables = write.
The results indicate that there is a statistically significant difference between the mean writing score
for males and females (t = -3.734, p = .000). In other words, females have a statistically significantly
higher mean score on writing (54.99) than males (50.12).
Wilcoxon-Mann-Whitney test
The Wilcoxon-Mann-Whitney test is a non-parametric analog to the independent samples t-test and
can be used when you do not assume that the dependent variable is a normally distributed interval
variable (you only assume that the variable is at least ordinal). You will notice that the SPSS syntax
for the Wilcoxon-Mann-Whitney test is almost identical to that of the independent samples t-test. We
will use the same data file (the hsb2 data file) and the same variables in this example as we did in the
32
independent t-test example above and will not assume that write, our dependent variable, is normally
distributed.
npar test
/m-w = write by female(0 1).
The results suggest that there is a statistically significant difference between the underlying
distributions of the write scores of males and the write scores of females (z = -3.329, p = 0.001).
Chi-square test
A chi-square test is used when you want to see if there is a relationship between two categorical
variables. In SPSS, the chisq option is used on the statistics subcommand of the crosstabs command
to obtain the test statistic and its associated p-value. Using the hsb2 data file, let's see if there is a
relationship between the type of school attended (schtyp) and students' gender (female). Remember
that the chi-square test assumes that the expected value for each cell is five or higher. This
assumption is easily met in the examples below. However, if this assumption is not met in your data,
please see the section on Fisher's exact test below.
crosstabs
/tables = schtyp by female
/statistic = chisq.
33
These results indicate that there is no statistically significant relationship between the type of school
attended and gender (chi-square with one degree of freedom = 0.047, p = 0.828).
Let's look at another example, this time looking at the linear relationship between gender (female)
and socio-economic status (ses). The point of this example is that one (or both) variables may have
more than two levels, and that the variables do not have to have the same number of levels. In this
example, female has two levels (male and female) and ses has three levels (low, medium and high).
crosstabs
/tables = female by ses
/statistic = chisq.
Again we find that there is no statistically significant relationship between the variables (chi-square
with two degrees of freedom = 4.577, p = 0.101).
Fisher's exact test
The Fisher's exact test is used when you want to conduct a chi-square test but one or more of your
cells has an expected frequency of five or less. Remember that the chi-square test assumes that each
cell has an expected frequency of five or more, but the Fisher's exact test has no such assumption and
can be used regardless of how small the expected frequency is. In SPSS unless you have the SPSS
Exact Test Module, you can only perform a Fisher's exact test on a 2x2 table, and these results are
presented by default. Please see the results from the chi squared example above.
One-way ANOVA
A one-way analysis of variance (ANOVA) is used when you have a categorical independent variable
(with two or more categories) and a normally distributed interval dependent variable and you wish to
test for differences in the means of the dependent variable broken down by the levels of the
independent variable. For example, using the hsb2 data file, say we wish to test whether the mean of
write differs between the three program types (prog). The command for this test would be:
34
oneway write by prog.
The mean of the dependent variable differs significantly among the levels of program type. However,
we do not know if the difference is between only two of the levels or all three of the levels. (The F
test for the Model is the same as the F test for prog because prog was the only variable entered into
the model. If other variables had also been entered, the F test for the Model would have been
different from prog.) To see the mean of write for each level of program type,
means tables = write by prog.
From this we can see that the students in the academic program have the highest mean writing score,
while students in the vocational program have the lowest.
Kruskal Wallis test
The Kruskal Wallis test is used when you have one independent variable with two or more levels and
an ordinal dependent variable. In other words, it is the non-parametric version of ANOVA and a
generalized form of the Mann-Whitney test method since it permits two or more groups. We will use
the same data file as the one way ANOVA example above (the hsb2 data file) and the same variables
as in the example above, but we will not assume that write is a normally distributed interval variable.
npar tests
/k-w = write by prog (1,3).
35
If some of the scores receive tied ranks, then a correction factor is used, yielding a slightly different
value of chi-squared. With or without ties, the results indicate that there is a statistically significant
difference among the three type of programs.
Paired t-test
A paired (samples) t-test is used when you have two related observations (i.e., two observations per
subject) and you want to see if the means on these two normally distributed interval variables differ
from one another. For example, using the hsb2 data file we will test whether the mean of read is
equal to the mean of write.
t-test pairs = read with write (paired).
These results indicate that the mean of read is not statistically significantly different from the mean
of write (t = -0.867, p = 0.387).
Wilcoxon signed rank sum test
The Wilcoxon signed rank sum test is the non-parametric version of a paired samples t-test. You use
the Wilcoxon signed rank sum test when you do not wish to assume that the difference between the
two variables is interval and normally distributed (but you do assume the difference is ordinal). We
will use the same example as above, but we will not assume that the difference between read and
write is interval and normally distributed.
36
npar test
/wilcoxon = write with read (paired).
The results suggest that there is not a statistically significant difference between read and write.
If you believe the differences between read and write were not ordinal but could merely be classified
as positive and negative, then you may want to consider a sign test in lieu of sign rank test. Again,
we will use the same variables in this example and assume that this difference is not ordinal.
npar test
/sign = read with write (paired).
37
McNemar's chi-square statistic suggests that there is not a statistically significant difference in the
proportion of students in the himath group and the proportion of students in the hiread group.
One-way repeated measures ANOVA
38
You would perform a one-way repeated measures analysis of variance if you had one categorical
independent variable and a normally distributed interval dependent variable that was repeated at least
twice for each subject. This is the equivalent of the paired samples t-test, but allows for two or more
levels of the categorical variable. This tests whether the mean of the dependent variable differs by the
categorical variable. We have an example data set called rb4wide, which is used in Kirk's book
Experimental Design. In this data set, y is the dependent variable, a is the repeated measure and s is
the variable that indicates the subject number.
glm y1 y2 y3 y4
/wsfactor a(4).
39
You will notice that this output gives four different p-values. The output labeled "sphericity
assumed" is the p-value (0.000) that you would get if you assumed compound symmetry in the
variance-covariance matrix. Because that assumption is often not valid, the three other p-values offer
various corrections (the Huynh-Feldt, H-F, Greenhouse-Geisser, G-G and Lower-bound). No matter
which p-value you use, our results indicate that we have a statistically significant effect of a at the .05
level.
Factorial ANOVA
A factorial ANOVA has two or more categorical independent variables (either with or without the
interactions) and a single normally distributed interval dependent variable. For example, using the
hsb2 data file we will look at writing scores (write) as the dependent variable and gender (female)
and socio-economic status (ses) as independent variables, and we will include an interaction of
female by ses. Note that in SPSS, you do not need to have the interaction term(s) in your data set.
Rather, you can have SPSS create it/them temporarily by placing an asterisk between the variables
that will make up the interaction term(s).
glm write by female ses.
40
These results indicate that the overall model is statistically significant (F = 5.666, p = 0.00). The
variables female and ses are also statistically significant (F = 16.595, p = 0.000 and F = 6.611, p =
0.002, respectively). However, that interaction between female and ses is not statistically significant
(F = 0.133, p = 0.875).
Friedman test
You perform a Friedman test when you have one within-subjects independent variable with two or
more levels and a dependent variable that is not interval and normally distributed (but at least
ordinal). We will use this test to determine if there is a difference in the reading, writing and math
scores. The null hypothesis in this test is that the distribution of the ranks of each type of score (i.e.,
reading, writing and math) are the same. To conduct a Friedman test, the data need to be in a long
format. SPSS handles this for you, but in other statistical packages you will have to reshape the data
before you can conduct this test.
npar tests
/friedman = read write math.
41
Friedman's chi-square has a value of 0.645 and a p-value of 0.724 and is not statistically significant.
Hence, there is no evidence that the distributions of the three types of scores are different.
Correlation
A correlation is useful when you want to see the relationship between two (or more) normally
distributed interval variables. For example, using the hsb2 data file we can run a correlation between
two continuous variables, read and write.
correlations
/variables = read write.
In the second example, we will run a correlation between a dichotomous variable, female, and a
continuous variable, write. Although it is assumed that the variables are interval and normally
distributed, we can include dummy variables when performing correlations.
correlations
/variables =
female write.
In the first example above, we see that the correlation between read and write is 0.597. By squaring
the correlation and then multiplying by 100, you can determine what percentage of the variability is
shared. Let's round 0.597 to be 0.6, which when squared would be .36, multiplied by 100 would be
36%. Hence read shares about 36% of its variability with write. In the output for the second
example, we can see the correlation between write and female is 0.256. Squaring this number yields
.065536, meaning that female shares approximately 6.5% of its variability with write.
Simple linear regression
Simple linear regression allows us to look at the linear relationship between one normally distributed
interval predictor and one normally distributed interval outcome variable. For example, using the
hsb2 data file, say we wish to look at the relationship between writing scores (write) and reading
scores (read); in other words, predicting write from read.
regression variables = write read
/dependent = write
/method = enter.
42
We see that the relationship between write and read is positive (.552) and based on the t-value
(10.47) and p-value (0.000), we would conclude this relationship is statistically significant. Hence,
we would say there is a statistically significant positive linear relationship between reading and
writing.
Non-parametric correlation
A Spearman correlation is used when one or both of the variables are not assumed to be normally
distributed and interval (but are assumed to be ordinal). The values of the variables are converted in
ranks and then correlated. In our example, we will look for a relationship between read and write.
We will not assume that both of these variables are normal and interval.
nonpar corr
/variables = read write
/print = spearman.
The results suggest that the relationship between read and write (rho = 0.617, p = 0.000) is
statistically significant.
43
MULTIVARIATE STATISTICAL TECHNIQUES
Contents
Introduction
Multiple Regression
Discriminant Analysis
Logistic Regression
Multivariate Analysis of Variance (MANOVA)
Factor Analysis
Principal Component Analysis
Common Factor Analysis (Principle Axis Factoring)
Canonical Correlation Analysis
Cluster Analysis
Conjoint Analysis
Multidimensional Scaling
1 Relevance and Introduction
In general, what is true about predicting the success of a politician with the help of intelligence
software, as pointed out above by the President of India, is equally true for predicting the success of
products and services with the help of statistical techniques. In this Chapter, we discuss a number of
statistical techniques which are especially useful in designing of products and services. The products
and services could be physical, financial, promotional like advertisement, behavioural like
motivational strategy through incentive package or even educational like training
programmes/seminars, etc. These techniques, basically involve reduction of data and subsequent its
summarisation, presentation and interpretation. A classical example of data reduction and
summarisation is provided by SENSEX (Bombay Stock Exchange) which is one number like 18,000,
but it represents movement in share prices listed in Bombay Stock Exchange. Yet another example is
the Grade Point Average, used for assessment of MBA students, which reduces and summarises
marks in all subjects to a single number.
In general, any live problem whether relating to individual, like predicting the cause of an ailment or
behavioural pattern, or relating to an entity, like forecasting its futuristic status in terms of products
and services, needs collection of data on several parameters. These parameters are then analysed to
summarise the entire set of data with a few indicators which are then used for drawing conclusions.
The following techniques (with their abbreviations in brackets) coupled with the appropriate
computer software like SPSS, play a very useful role in the endeavour of reduction and
summarisation of data for easy comprehension.
Multiple Regression Analysis (MRA)
Discriminant Analysis (DA)
Logistic Regression (LR)
Multivariate Analysis of Variance (MANOVA)( introduction)
Factor Analysis (FA)
Principal Component Analysis ( PCA)
Canonical Correlation Analysis (CRA) ( introduction)
Cluster Analysis
Conjoint Analysis
Multidimensional Scaling( MDS)
Before describing these techniques in details, we provide their brief description as also indicate
their relevance and uses, in a tabular format given below. This is aimed at providing
motivation for learning these techniques and generating confidence in using SPSS for arriving
at final conclusions/solutions in a research study. The contents of the Table will be fully
comprehended after reading all the techniques.
44
Statistical Techniques, Their Relevance and Uses for Designing and Marketing of Products and
Services
Technique
Multiple Regression Analysis (MRA)
It deals with the study of relationship
between one metric dependent variable
and more than one metric independent
variables
Discriminant Analysis
It is a statistical technique for classification or
determining a linear function, called
discriminant function, of the variables which
helps in discriminating between two groups of
entities or individuals.
Logistic Regression
It is a technique that assumes the
errors are drawn from a binomial
distribution.
In logistic regression the dependent
variable is the probability that an
event will occur, hence it is
constrained between 0 and 1.
All of the predictors can be binary,
a mixture of categorical and
continuous or just continuous.
Multivariate Analysis of Variance (
MANOVA)
45
It explores, simultaneously, the
relationship between several nonmetric independent variables
(Treatments, say Fertilisers) and two
or more metric dependant variables
(say, Yield & Harvest Time). If there is
only one dependant variable, MANOVA
is the same as ANOVA.
Helps in assessing
the image of a company/enterprise
attitudes of sales personnel and
customers
preference or priority for the
characteristics of
- product like television, mobile phone, etc.
- a service like TV program, air travel etc.
46
information.
Identifies the smallest number of
common factors that best explain or
account for most of the correlation
among the indicators. For example,
intelligence quotient of a student might
explain most of the marks obtained in
Mathematics, Physics, Statistics, etc.
As yet another example, when two
variables x and y are highly correlated,
only one of them could be used to
represent the entire data
Canonical Correlation Analysis (CRA)
An extension of multiple regression
analysis.
(MRA
involving
one
dependant variable and several metric
independent variables.). It is used for
situations wherein there are several
dependant variables and several
independent variables.
Involves
developing
linear
combinations of the sets of variables
(both dependant and independent) and
studies the relationship between the
two sets. The weights in the linear
combination are derived based on the
criterion that maximizes the correlation
between the two sets of variables.
Cluster Analysis
It is an analytical technique that is
used to develop meaningful
subgroups of entities which are
homogeneous or compact with
respect to certain characteristics.
Thus, observations in each group
would be similar to each other.
Further, each group should be
different from each other with
respect to the same characteristics,
and therefore, observations of one
group would be different from the
observations of the other groups.
47
Conjoint Analysis
Useful for analyzing consumer responses, and use the
same for designing of product and services
Involves determining the contribution
of variables ( each of several levels) to
Helps in determining the contributions of the
the choice preference over
predictor variables and their respective levels to the
combinations of variables that
desirability of the combinations of variables.
represent realistic choice sets (
For example, how much does the quality of food contribute
products, concepts, services,
to continued loyalty of a traveller to an airline? Which type
companies, etc.)
of food is liked most?
Multidimensional Scaling
Useful for designing of products and services.
It helps in
It is a set of procedures for drawing
pictures of data so as to visualise and
illustrating market segments based on indicated preferences.
clarify relationships described by the
identifying the products and services that are more
data more clearly.
competitive with each other
The requisite data is typically collected
by having respondents give simple
understanding the criteria used by people while judging
objects (products, services,
companies,
one-dimensional responses.
advertisements,
etc.).
Transforms consumer judgments /
perceptions of similarity or preferences
in usually a two dimensional space.
1.1 Multivariate Techniques
These techniques are classified in two types
Dependence Techniques
Interdependence Techniques
Dependence Techniques
These are the techniques, that define some of the variable/s as independent variable/s and some other
as dependent variable/s. These techniques aim at finding the relationship of these variables and may,
in turn, find the effect of independent variable on dependent variable.
The techniques to be used may differ as the type of independent / dependent variables change. For
example, if all the independent and dependent variables are metric or numeric, Multiple Regression
Analysis can be used, if dependent variable is metric, and independent variable is /are categorical,
ANOVA can be used. If dependent variable is metric and some of the independent variables are
metric, and some are qualitative ANACOVA (Analysis of co-variance) can be used. If the dependent
variable is non metric or categorical, multiple discriminant analysis or logistic regression are the
techniques used for analysis.
All the above techniques require a single dependent variable.
If there are more than one dependent variables, the techniques used are MANOVA ( Multivariate
analysis of variance) or Canonical correlation
MANOVA is used when there are more than one dependent variables and all independent variables
are categorical. If some of the independent variables are categorical and some are metric,
MANOCOVA ( multivariate analysis of covariance) can be used. If there are more than one
dependent variables and all dependent and independent variables are metric, best suited analysis is
canonical correlation.
Interdependence Techniques
Interdependence techniques do not assume any variable as independent / dependent variables or try to
find the relationship. These techniques can be divided into variable interdependence and inter object
similarity.
48
The variable interdependence techniques can be also termed as data reduction techniques. Factor
analysis is the example of the variable interdependence techniques. Factor analysis is used when there
are many related variables and one wants to reduce the list of variables or find underlying factors that
determine the variables.
The inter object similarity is assessed with the help of cluster analysis, Multidimensional scaling
(MDS)
Brief descriptions of all the above techniques are given in subsequent sections of this Chapter.
Expenditure on Advertisement +
Expenditure on R & D
Return on Stock of Reliance Industries +
Return on Stock of Infosys Technologies
49
The sample comprises of n triplets of values of x1, y and x2 ,in the following format:
y
y1
y2
x1
x11
x12
.
x1n
yn
x2
x21
x22
.
x2n
The values of constants bo, b1 and b2 are estimated with the help of Principle of Least Squares just like values
of a and b were found while fitting the equation y = a + b x in Chapter 10 on Simple Correlation and
Regression analysis. These are calculated by using the above sample observations/values, and with the help of
the formulas given below :
These formulas and manual calculations are given for illustration only. In real life these are easily obtained
with the help of personal computers wherein the formulas are already stored.
2
(yix1i n y x 1 )( x 2 i n x 2
-------------------------------------------------------------------------
( 2 )
y b1 x 1 b2 x 2
(3)
The calculations needed in the above formulas are facilitated by preparing the following Table
Y
x1
x2
yx1
yx2
x1x2
y2
x12
x22
y1
x11
x21
y1x11 y1x21 x11x21 y12 x112 x212
.
.
.
.
.
.
.
.
.
yi
x1i
x2i
yix1i
yix2i
x1ix2i
yi2
x1i2
x2i2
.
.
.
.
.
.
.
.
.
yn
x1n
x2n
yix1n yix2n x1nx2n yn2 x1n2 x2n2
Sum
yi
x1i
x2i
yix1i
yix2i
x1ix2i
yi2
x1i2
x2i2
The effectiveness or the reliability of the relationship thus obtained is judged by the multiple coefficient of
determination, usually denoted by R2, and is defined as the ratio of variation explained by the regression
equation ( 1 ) and total variation of the dependent variable y. Thus,
Explained Variation in y
2
R = --------------------------------------
.. ( 4 )
Total Variation in y
Unexplained Variation
R2 = 1 - -------------------------------------- .. ( 5 )
Total Variation
50
=1-
( yi
( yi
yi )2
yi )2
( 6)
It may be recalled from Chapter 10, that total variation in the variable y is equal to the variation explained by
the regression equation plus unexplained variation by the regression equation. Mathematically, this is
expressed as
( yi
y) 2
Total Variation
(yi
Explained Variation
( yi
yi ) 2
Unexplained Variation
( yi
y)2
(y i
all the variation is explained by y i , and therefore unexplained variation is zero. In such case , total variation is
fully explained by the regression equation, and R2 is equal to 1.
The square root of R2 viz. R is known as the coefficient of multiple correlation and is always
between 0 and 1. In fact, R is the correlation between the independent variable and its estimate
derived from the multiple regression equation, and as such it has to be positive.
All the calculations and interpretations for the multiple regression equation and coefficient of multiple
correlation or determination have been explained with the help of an illustration given below:
Example 1
The owner of a chain of ten stores wishes to forecast net profit with the help of next years projected
sales of food and non-food items. The data about current years sales of food items, sales of non-food
items as also net profit for all the ten stores are available as follows:
Table 1 Sales of Food and Non-Food Items and Net Profit of a Chain of Stores
Supermarket
No.
Net Profit
(Rs. Crores)
y
1
2
3
4
5
6
7
8
9
10
5.6
4.7
5.4
5.5
5.1
6.8
5.8
8.2
5.8
6.2
Sales of
Food Items
(Rs. Crores)
x1
20
15
18
20
16
25
22
30
24
25
51
In this case, the relationship is expressed by the equation (1) reproduced below:
y = b0 + b1 x1 + b2 x2
where, y denotes net profit, x1 denotes sales of food items, and x2 denotes sales of non-food items,
and b0, b1 & b2 are constants. Their values are obtained by the following formulas derived from the
"Principle of Least Squares".:
The required calculations can be made with the help of the following Table:
( Amounts in Rs. Crores)
Net
Profit
(y)
Sales of
Food
Items
( x1)
Sales of
NonFood
Items
(x2)
yi
x1i
x2i
x1i 2
yi x1i
yi2
yi x2i
x2i 2
5.6
20
400
112
31.36
28
25
4.7
15
225
70.5
22.09
23.5
25
75
5.4
18
324
97.2
29.16
32.4
36
108
5.5
20
400
110
30.25
27.5
25
100
5.1
16
256
81.6
26.01
30.6
36
96
6.8
25
625
170
46.24
40.8
36
150
5.8
22
484
127.6
33.64
23.2
16
88
8.2
30
900
246
67.24
57.4
49
210
5.8
24
576
139.2
33.64
17.4
72
10
6.2
25
625
155
38.44
24.8
16
100
Sum
Average
59.1
5.91
215
21.5
51
5.1
4815
1309.1
358.07
305.6
273
1099
Supermarket
x1i x2i
100
Substituting the values of bo, b1 and b2, the desired relationship is obtained as
y
=
0.233 + 0.196 x1 + 0.287 x2
(7)
This equation is known as the multiple regression equation of y on x1 and x2, and it indicates as to
how y changes with respect to changes in x1 and x2.. The interpretation of the value of the coefficient
of x1 viz. b1, i.e. 0.196, is that if x2 (sales of non-food items) is held constant, then for every crore
of sales of food items, the net profit increases by 0.196 crore i.e. 19.6 lakhs.. Similarly the
interpretation of the value of coefficient of x2 viz. b2 i.e. 0.287 is that if the sales of non-food items
increases by one crore Rs, the net profit increases by 0.287 crore i.e. 28.7 lakhs.
The effectiveness or the reliability of this relationship is judged by the multiple coefficient of
determination, usually denoted by R2, and is defined as given in (4) as
Explained Variation in y by the Regression Equation
--------------------------------------------------------------------Total Variation in y
The above two quantities are calculated with the help of the following Table.
Column ( 3 ) gives the difference in the observed value yi and its estimate y i
derived from the fitted regression equation by substituting corresponding values of x1 and x2 .
R2 =
52
yi
(1)
5.6
4.7
5.4
5.5
5.1
6.8
5.8
8.2
5.8
6.2
Sum =
59.1
y i *
(2)
5.587
yi y i
(3)
0.0127
(yi y i )2
(4)
0.0002
(yi y i )2
(5)
0.0961
4.607
5.482
5.587
5.09
6.854
5.693
8.121
5.798
6.281
0.0928
-0.082
-0.087
0.0099
-0.054
0.1075
0.0789
0.0023
-0.081
0.0086
0.0067
0.0076
0.0001
0.0029
0.0116
0.0062
0.0000
0.0065
1.4641
0.2601
0.1681
0.6561
0.7921
0.0121
5.2441
0.0121
0.0841
59.1
Sum=0.0504
Sum=8.789
(Unexplained
(Total
Variation )
Variation )
* Derived from the earlier fitted equation, y = 0.233 + 0.196 x1 + 0.287 x2
Substituting the respective values in the equation (No 6) we get
R2 =
1 (0.0504/ 8.789) = 1 0.0057 = 0.9943
The interpretation of the value of R2 = 0.9943 is that 99.43% of the variation in net profit is explained
jointly by variation in sales of food items and non-food items.
y = 5.91
Incidentally, Explained Variation for the above example can be calculated by subtracting unexplained
variation from total variation as 8.789 0.0504 = 8.7386
It may be recalled that in Chapter 10 on Simple Correlation and Regression Analysis, we have
discussed the impact of the change of variation in only one independent variable on the dependent
variable. We shall now demonstrate the usefulness of two independent variables in explaining the
variation in the dependent variable (net profit in this case).
Suppose, we consider only as one variable, say food items, then the basic data would be as follows:
Sales of Food
Net Profit (Rs.
Items (Rs.
Supermarket
Crores)
Crores)
Y
x1
1
5.6
20
2
4.7
15
3
5.4
18
4
5.5
20
5
5.1
16
6
6.8
25
7
5.8
22
8
8.2
30
9
5.8
24
10
6.2
25
53
The scatter diagram indicates a positive linear correlation between the net profit and the sales of food
items.
(8)
yi x1i
2
i
nyx1
n( x1 )2
a = y b x1
and,
r2 =
Unexplained Variation in y
r2 =
------------------------------- = 1
Total Variation in y
( yi
yi )2
( yi
y)2
54
In the above illustration,
Total variation
y)2
( yi
(yi
Unexplained Variation =
y )2
( yi y)2
These quantities can be calculated from the following Table :
Sales of
Food
Net Profit Items
Supermarket
1
2
3
4
5
6
7
8
9
10
Sum
Average
yi
xi
5.6
4.7
5.4
5.5
5.1
6.8
5.8
8.2
5.8
6.2
59.1
5.91
20
15
18
20
16
25
22
30
24
25
215
21.5
yi
yi
y ( yi
-0.31
-1.21
-0.51
-0.41
-0.81
0.89
-0.11
2.29
-0.11
0.29
Sum
y )2 1.61 0.2 xi yi
0.0961
1.4641
0.2601
0.1681
0.6561
0.7921
0.0121
5.2441
0.0121
0.0841
8.789
5.61
4.61
5.21
5.61
4.81
6.61
6.01
7.61
6.41
6.61
y ( yi
-0.01
0.09
0.19
-0.11
0.29
0.19
-0.21
0.59
-0.61
-0.41
Sum
y )2 yi
0.0001
0.0081
0.0361
0.0121
0.0841
0.0361
0.0441
0.3481
0.3721
0.1681
1.109
y ( yi
-0.3
-1.3
-0.7
-0.3
-1.1
0.7
0.1
1.7
0.5
0.7
Sum
y )2
0.09
1.69
0.49
0.09
1.21
0.49
0.01
2.89
0.25
0.49
7.7
It may be noted that the unexplained variation or residual error is 1.109 when the simple regression
equation( 9) of net profit on sales of food items is fitted but it was lower as reduced to 0.05044, when
multiple regression equation ( 7 ) was used by taking into account adding one more variable as sale
of non-food items (x2)
Also it may be noted that only one variable viz. sales of food items is considered then r2 is 0.876 i.e
87.6% of variation in net profit is explained by variation in sales of food item but when both the
variables viz. sales of food as well as non food items are considered. R2 is 0.9943 i.e 99.43% of
variation in net profit is explained by variation in both these variables.
2.2 Forecast with a Regression Equation
The multiple regression equation ( 1 ) can be used to forecast the value of the dependent variable at
any point of time, given the values of the independent variables at that point of time. For illustration,
in the above example about the net profit in a company, one may be interested in forecasting the net
profit for the next year when the sales of food items is expected to increase to 30 Crores and sales of
sales of non-food items is expected to Rs. 7 Crores. Substituting x1 = 30, and x2 = 7 in equation ( 7),
we get
y =
0.233 + 0.196
= 8.122
30 + 0.287
55
Thus, the net profit for all the 10 stores, by the end of next year would be 8.122
Crores.
Caution : It is important to note that a regression equation is valid for estimating the
value of the dependent variable only within the range of independent variable(s) or
only slightly beyond the range. However, it can be used even much beyond the range if
no other better option is available, and it is supported by commonsense.
2.3 Correlation Matrix
The multiple correlation coefficient can also be determined with the help of the total correlation
coefficients between all pairs of dependent and independent variables. All the possible total
correlations between any two pairs of x1, x2 and y and can be presented in the form of a matrix as
follows:.
Since rx1x1
rx1x1
rx1y
rx1x2
ryx1
ryy
ryx2
rx2x1
rx2y
rx2 x2
ryy and rx2 x2 are all equal to 1, the matrix can be written as
1
ryx1
rx1y
rx1x2
rx2x1
ryx2
rx2y
Further, since rxy and ryx are equal, and so are rxz and rxz , it is sufficient to write the matrix in the
following form :
1
rx1y
rx1x2
ryx2
If there are three variables x1, x2 and y then simple correlation coefficient can be defined between all
pairs of x1, x2 and y. However when there are more than two variables in a study, then the simple
correlation between any two variables are known as total correlation. All nine of such possible pairs
are represented in the form of a matrix given below.
For the above example relating to net profit where there are three variables y, x1 and x2 , the
correlation matrix is follows:
1
rx1y
rx1x2
ryx2
56
adjusted multiplication coefficient of determination, takes into account n (number of observations)
and k (number of variables) for comparison in two situations, and is calculated as
2
R =1 (
n 1
)(1 R 2 )
n k 1
(10)
where n is the sample size or the number of observations on each of the variables, and
k is the number of independent variables. For the above example,
10 1
= 1 (
)(1 0.9943)
10 2 1
= 0.9927
To start with when an independent variable is added i.e. the value of k is increased, the value of
R 2 increases but when the addition of another variable does not contribute towards explaining
the variability in the dependent variable, the value of R 2 decreases. This implies that the
addition of that variable is redundant.
The adjusted R2 i.e. R 2 is lesser than R2 as number of observations per independent variable
decreases. However, R 2 tends to be equal to R2 as sample size increases for the given number of
independent variables.
R 2 is useful in comparing two regression equations having different number of independent
variables or when the number of variables is the same but both are based on different sample
sizes.
As an illustration, in the above example relating to regression equation of net profit on sales of food
items and sales of non food items , the value of R2 is 0.96 when only sales of food items is taken as
independent variable to predict net profit, but it increases to 0.98 when another independent variable
viz. sales of non food items is also taken into consideration. However, the adjusted value of R2 is
0.96.
2.5 Dummy Variable
So far, we have considered independent variables which are quantifiable and measurable like income,
sales, profit, etc. However, sometimes the independent variables may not be quantifiable and
measurable and be only qualitative and categorical, and could impact the independent variable under
study. For example, the amount of insurance policy, a person takes, could depend on his/her marital
status which is categorical i.e. married or un-married. The sale of ice-cream might depend on the
seasons viz. summer or other seasons. The performance of a candidate at a competitive examination
depends not only his/her I.Q. but also on the categorical variable coached or un-coached
Dummy variables are very useful for capturing a variety of qualitative effects by indicating 0 and
1 as two states of qualitative or categorical data. The dummy variable is assigned the value 1 or
0 depending on whether it does or does not possess the specified characteristic. Some examples are
male and female, married and unmarried, MBA executives and Non-MBA executives, trained-not
trained, advertisement - I and advertisement - II, strategy like financial discount or gift item for sales
promotion.. Thus, a dummy variable modifies the form of a non-numeric variable to a numeric one.
They are used as explanatory variables in a regression equation. They act like switch which turn
57
various parameters on and off in an equation. Another advantage of 0 and 1 dummy variables
is that even though it is a nominal-level variable it can be treated statistically just like intervallevel variable which takes the value 1 or 0. It marks or encodes a particular attribute Indicative
Variable to Binary Variable. It is a form of coding to transform non-metric data to metric data. It
facilitates in considering two levels of an independent variable, separately.
Illustration 2
It is normally expected that a person with high income will purchase life insurance policy for a higher
amount. However, it may be worth examining whether there is any difference in the amounts of
insurance purchased by married & unmarried persons. To answer these queries, an insurance agent
collected the data about the policies purchased by his clients during the last month. The data is as
follows:
Sr. No of
Client
Annual Income
(in Thousands of
Rs.)
1
2
3
4
5
6
7
8
9
10
11
12
800
450
350
1500
1000
500
250
60
800
1400
1300
1200
Amount of
Stipulated
Annual
Insurance
Premium
(in Thousands of
Rs.)
85
50
50
140
100
50
40
10
70
150
150
110
Marital Status
(Married/Single)
M
M
S
S
M
S
M
S
S
M
M
M
Note : The marital status is converted into a independent variable by substituting M by 1 and S by
0 for the purpose of fitting the regression equation.
It may be verified that the multiple regression equation with amount of insurance premium as
dependent variable and income as well as marital status as independent variables is
Premium = 5.27 + 0.091 Income + 8.95 Marital Status
The interpretation of the coefficient 0.091 is that for every additional thousand rupees of
income, the premium increases by 1000 0.091 = Rs 91.
The interpretation of the coefficient 8.95, is that a married person takes an additional premium
of Rs 8,950 as compared to a single person.
2.6 Partial Correlation Coefficients
So far, we have discussed total correlation coefficient and multiple correlation coefficient. In the
above case of net profit planning, we had three variables viz. x1, y and x2. The correlation coefficients
between any two variables viz. ryx1 , ryx 2 and rx1 x1 are called total correlation coefficients. The
total correlation coefficients indicate the relationship between the two variables ignoring the presence
or effect of the other third variable. The multiple correlation coefficient Ry.x1 x2 indicates the
correlation between y and the estimate of y obtained by the regression equation of y on x 1 and x2. The
partial correlation coefficients are defined as correlation between any two variables when the effect of
58
the third variable on these two variables is removed or when the third variable is held constant. For
example, ryx .x means the correlation between y and x1 when the effect of x2 on y and x1 is removed
1
or x2 is held constant. The various partial correlation coefficients viz.. ryx .x ryx .x and rx x y
1 2
2 1
1 2.
are calculated
ryx1 .x2 =
ryx1
ryx 2 .x1 =
rx1 x2. y =
ryx 2 .rx1 x2
ryx1 .rx1 x 2
2
yx1
(1 r )(1 rx21 x 2 )
rx1 x2
ryx1 .ryx 2
2
yx 2
(1 r )(1 ryx2 1 )
(11)
(12)
(13)
The values of the above partial correlation coefficients, ryx .x ryx .x and rx x y are 0.997, 0.977 and
1 2
2 1
1 2.
0.973, respectively.
The interpretation of ryx .x = 0.977
is that it indicates the extent of linear correlation between y
2 1
and x2 when x1 is held constant or its impact on y and x2 is removed.
Similarly, the interpretation of rx x y = 0.973 is that it indicates the extent of linear correlation
1 2.
between x1 and x2 when y is held constant or its impact on x1 and x2 is removed.
2.7 Partial Regression Coefficients
The regression coefficients b1 and b2 in the regression equation (1) are known as partial regression
coefficients. The value of b1 indicates the change that will be caused in y with a unit change in x1
when x2 is held constant. Similarly, b2 indicates the amount by which y will change given a unit
change in x2. For illustration, in the regression equation, the interpretation of the value of b1, i.e.
0.186, is that if x2 (sales of non-food items) is held constant, then for every increase of 1 Crore rise in
the sales of food items, on an average, net profit will rise by Rs.18.6 lakhs.. Similarly, the
interpretation of b2 i.e. 0.287 is that if x1 (sales of food items) is held constant, then for every
increase of 1 Crore rise in the sales of non-food items, on an average, net profit will rise by Rs.28.7
Lakhs.
2.8 Beta Coefficients
If the independent variables are standardised i.e. they are measured from their means and divided by
their standard deviations, then the corresponding regression coefficients are called beta coefficients.
Their advantages, like in simple regression analysis vide Section 8.5.3 are that correlation and
regression between standardised variables solves the problem of dealing with different units of
measurements of the variables. Thus the magnitudes of these coefficients can be used to compare the
relative contribution of each independent variable in the prediction of each dependent variable.
Incidentally, for the data in Illustration 1, relating to sales and net profit of supermarkets, reproduced
below,
Supermarket
Net Profit
(Rs.
Crores)
Sales of
Food
Items (Rs.
Crores)
Sales of
Non-Food
Item
Standardised Variables=
(Variable Mean)/s.d.
59
1
2
3
4
5
6
7
8
9
10
Sum
mean
Variance
s.d.
yi
5.6
4.7
5.4
5.5
5.1
6.8
5.8
8.2
5.8
6.2
59.1
5.91
0.88
0.937
xi
20
15
18
20
16
25
22
30
24
25
215
21.5
19.25
4.387
zi
5
5
6
5
6
6
4
7
3
4
51
5.1
1.29
1.136
xi2
400
225
324
400
256
625
484
900
576
625
4815
Y
-0.331
-1.291
-0.544
-0.437
-0.864
0.949
-0.117
2.443
-0.117
0.309
X1
-0.34
-1.48
-0.8
-0.34
-1.25
0.798
0.114
1.937
0.57
0.798
X2
-0.09
-0.09
0.792
-0.09
0.792
0.792
-0.97
1.673
-1.85
-0.97
60
If there are two variables , sometimes the exclusion of one may result in abnormal change in the
regression coefficient of the other variable ; sometimes even the sign of the regression coefficient
may change from + to or vice versa, as demonstrated for the data given below.
y
10
18
18
25
21
32
x1
12
16
20
22
25
24
x2
25
21
22
18
17
15
It may be verified that the correlation between x1 and x2 is 0.91 indicating the
existence of multicollinearity.
It may be verified that the regression equation of y on x1 is
y = 3.4 + 1.2 x1
(i)
, the regression equation of y on x2 is
y = 58.0 1.9 x2
(ii)
61
variation in the dependent variable is significant. These issues will be explained with examples in
subsequent sections.
2.12 Regression Model with More Than Two Independent Variables:
So far we have discussed only the derivation of the regression equation and interpretations of the
correlation and regression coefficients. Further, we have confined to only two independent variables.
However, sometimes, it is advisable to have more than two independent variables. Also, for using the
equation for interpreting and predicting the values of the dependent variable with the help of
independent variables, there are certain assumptions to validate the regression equation. We also have
to test whether all or some of the independent variables are really significant to have an impact on the
dependent variable. In fact, we also have to ensure that only the optimum numbers of variables are
used in the final regression equation. While details will be discussed later on, it may be mentioned for
now that mere increase in the number of independent variables does not ensure better predictive
capability of the regression equation. Each variable has to compete with the others to be included or
retained in the regression equation.
2.13 Selection of Independent Variables
Following are three prime methods of selecting independent variables in a regression model:
General Method
Hierarchical method
Stepwise Method
These are described below.
2.13.1General Method ( Standard Multiple regression)
In standard multiple regression, all of the independent variables are entered into the regression
equation at the same time
Multiple R and R measure the strength of the relationship between the set of independent
variables and the dependent variable. An F test is used to determine if the relationship can be
generalized to the population represented by the sample.
A t-test is used to evaluate the individual relationship between each independent variable and
the dependent variable.
This method is used when a researcher knows exactly which independent variables contribute
significantly in the regression equation. In this method, all the independent variables are considered
together and the regression model is derived.
It is often difficult to identify the exact set of variables that are significant in the regression model and
the process of finding these may have many steps or iterations as explained through an illustration in
the next section. This is the limitation of this method. This limitation can be overcome in the stepwise
regression method.
2.13.2Hierarchical method (Hierarchical Multiple regression)
In hierarchical multiple regression, the independent variables are entered in two stages.
In the first stage, the independent variables that we want to control for are entered into the
regression. In the second stage, the independent variables whose relationship we want to
examine after the controls are entered.
A statistical test of the change in R from the first stage is used to evaluate the importance of
the variables entered in the second stage.
This method is used when a researcher has clearly identified three different types of variables namely
dependent variable, independent variable/s and the control variable/s.
62
This method helps the researcher to find the relationship between the independent variables and the
dependent variable, in the presence of some variables that are controlled in the experiment. Such
variables are termed as control variables. The control variables are first entered in the hierarchy, and
then the independent variables are entered. This method is available in most statistical software
including SPSS.
2.13.3 Stepwise Method (Stepwise Multiple regression)
Stepwise regression is designed to find the most parsimonious set of predictors that are most
effective in predicting the dependent variable.
Variables are added to the regression equation one at a time, using the statistical criterion of
maximizing the R of the included variables.
When none of the possible addition can make a statistically significant improvement in R, the
analysis stops.
This method is used when a researcher wants to find out, which independent variables significantly
contribute in the regression model, out of a set of independent variables. This method finds the best
fit model, i.e. the model which has a set of independent variables that contribute significantly in the
regression equation.
For example, if a researcher has identified some three independent variables that may affect the
dependent variable, and wants to find the best combination of these three variables which contribute
significantly in the regression model. The researcher may use stepwise regression. The software
would give the exact set of variables that contribute or are worth keeping in the model.
There are three most popular stepwise regression methods namely, forward regression backward
regression and stepwise regression. In forward regression, one independent variable is entered with
dependent variable and the regression equation is arrived along with other tests like ANOVA, t tests
etc. in the next iteration the one more independent variable is added and is compared with the
previous model. If the new variable significantly contributes in the model, it is kept, otherwise it is
thrown out from the model. This process is repeated for each remaining independent variables, thus
arriving at a model that is significant containing all contributing independent variables. The
backward method is exactly opposite to this method. In case of backward method initially all the
variables are considered and they are removed one by one if they do not contribute in the model.
The stepwise regression method is a combination of the forward selection and backward elimination
methods. The basic difference between this and the other two methods is that in this method, even if a
variable is selected in the beginning or gets selected subsequently, it has to keep on competing with
the other entering variables at every stage to justify its retention in the equation.
These steps are explained in the next Section, with an example.
2.14 Selection of Independent Variables in a Regression Model
Whenever, we have several independent variables which influence a dependent variable, an issue
arises is whether it is worthwhile to retain all the independent variables or whether it is worthwhile to
include only some of the variables which have relatively more influence on the dependent variables
as compared to the others. There are several methods to select most appropriate or significant
variables out of the given set of variables. However, we shall describe one of the methods using R 2
as the selection criteria. The method is illustrated with the help of data given in following example.
Example 2
The following Table gives certain parameters about some of the top rated companies in the ET 500
listings published in the issue of February 2006
63
Sr. Company
M-Cap Oct 05 Net Sales Sept Net Profit Sept 05 P/E as on Oct. 31 05
No
05
Company
Amount Rank Amount Rank Amount
Rank Amount
Rank
1
INFOSYS
68560
3
7836
29
2170.9
10
32
66
TECHNOLOGIES
2
TATA CONSULTANCY 67912
4
8051
27
1831.4
11
30
74
SERVICES
3
WIPRO
52637
7
8211
25
1655.8
13
31
67
4
BHARTI TELE60923
5
9771
20
1753.5
12
128
3
VENTURES *
5
ITC
44725
9
8422
24
2351.3
8
20
183
6
HIRO HONDA
14171
24
8086
26
868.4
32
16
248
MOTORS
7
SATYAM COMPUTER 18878
19
3996
51
844.8
33
23
132
SERVICES
8
HDFC
23625
13
3758
55
1130.1
23
21
154
9
TATA MOTORS
18881
18
18363
10
139
17
14
304
10 SIEMENS
7848
49
2753
75
254.7
80
38
45
11 ONGC
134571 1
37526
5
14748.1
1
9
390
12 TATA STEEL
19659
17
14926
11
3768,6
5
5
469
13 STEEL AUTHORITY
21775
14
29556
7
6442.8
3
3
478
OF INDIA
14 NESTLE INDIA
8080
48
2426
85
311.9
75
27
99
15 BHARAT GORGE CO.
6862
55
1412
128
190.5
97
37
48
16 RELIANCE
105634 2
74108
2
9174.0
2
13
319
INDUSTRIES
17 HDFC BANK
19822
16
3563
58
756.5
37
27
98
18 BHARAT HEAVY
28006
12
11200
17
1210.1
21
25
116
ELECTRICALS
19 ICICI BANK
36890
10
11195
18
2242.4
9
16
242
20 MARUTI UDYOG
15767
22
11601
16
988.2
26
17
213
21 SUN
11413
29
1397
130
412.2
66
30
75
PHARMACEUTICALS
* The data about Bharti Tele-Ventures is not considered for analysis because its P/E ratio is
exceptionally high.
In the above example, we take market Capitalisation as the dependent variable, and Net Profit, P/E
Ratio and Net Sales as independent variables.
We may add that this example is to be viewed as an illustration of selection of optimum number of
independent variables, and not the concept of financial analysis.
The notations used for the variables are as follows.
Y
Market Capitalisation
x1
Net Sales
x2
Net Profit
x3
P/E Ratio
Step I :
64
First of all we calculate the total correlation coefficients among all the dependent and independent
variables. We also calculate the correlation coefficients of the dependent variable These are tabulated
below.
1
2
3
Net Sales Net Profit P/E Ratio
1.0000
Net Sales
0.7978
1.0000
Net Profit
-0.5760
-0.6004
1.0000
P/E Ratio
0.6874
0.8310
-0.2464
Market Cap
We note that the correlation of y with x2 is highest. We therefore start by taking only this variable in
the regression equation.
Step II :
The regression equation of y on x2 is
y = 15465 + 7.906 x2
The values of R2 and R 2 are : R2 = 0.6906 R 2 = 0.6734
Step III :
Now, we derive two regression equations, one by adding x1 and one by adding x3 to see which
combination viz. x2 and x3 or x2 and x1 is better.
The regression equation of y on x2 and x1 is
Y = 14989 +7.397 x2 + 0.135 x1
The values of R2 and R 2 are : R2 = 0.6922 R 2 = 0.656
The regression equation of y on x2 and x3 is
Y = 19823 +10.163 x2 + 1352.4 x3
The values of R2 and R 2 are : R2 = 0.7903 R 2 = 0.7656
Since R 2 for the combination x2 and x3 (0.7656) is higher than R 2 for the combination x2 and x1
(0.6734) , we select x3 as the additional variable along with x2
It may be noted that R2 with variable x2 and x3 (0.7903) is more than value of R2 with only the
variable (0.6906). Thus it is advisable to have x3 along with x2 in the model.
Step IV :
Now we include the last variable viz. x1 to have the model as
Y = bo + b1x1 + b2x2 + b3x3
The requisite calculations are too cumbersome to be carried out manually, and, therefore, we use
Excel spreadsheet which yields the following regression equation.
Y = 23532 + 0.363x1 + 8.95x2 + 1445.6x3
R 2 = 0.7644
The values of R2 and R 2 are : R2 = 0.8016
It may be noted that inclusion of x1 in the model has very marginally increased the value of R2 from
0.7903to 0.8016, but the adjusted value of R2 i.e R 2 has come down from 0.7656 to 0.7644. Thus it
is not worthwhile to add the variable x1 to the regression model having variables as x2 and x3.
Step V :
The advisable regression model is by including only x2 and x3
Y = 19823 +10.163 x2 + 1352.4 x3
(14)
2
This is the best regression equation fitted to the data on the basis of R criterion, as discussed above.
We have discussed the same example to illustrate the method using SPSS in Section 2.17
2.15 Generalised Regression Model
65
In general, a regression equation, or also referred to as model, is written as follows
yi = bo + b1x1i + b2x2i + b3x3i +.+ b1xki + ei ..(15)
where, there are k independent variables viz. x1, x2 , x3 . and xk , and ei
is the error or residual term which is not explained by the regression model.
2.15.1 Assumptions for the Multiple Regression Model
There are certain assumptions about the error terms that ought to hold good for the regression
equation to be useful for drawing conclusions from it or using it for prediction purposes.
These are :
(i) The distribution of ei s is normal
The implication of this assumption is that the errors are symmetrical with both positive and
negative values.
(ii) E (ei) = 0
This assumption implies that the sum of positive and negative errors is zero, and thus they
cancel out each other.
(iii) Var (ei ) = 2 for all values of i
This assumption means that the variance or fluctuations in all error terms are of the same
magnitude. (Homoscedasticity)
Heteroscedasticity often occurs when there is a large difference among the sizes of the observations.
The classic example of heteroscedasticity is that of income versus food consumption. As one's
income increases, the variability of food consumption will increase. A poorer person will spend a
rather constant amount by always eating essential food; a wealthier person may, in addition to
essential food, occasionally spend on expensive meal. Those with higher incomes display a greater
variability of food consumption
Ideally, residuals are randomly scattered around 0 (the horizontal line) providing a relatively
even distribution
Heteroscedasticity is indicated when the residuals are not evenly scattered around the line.
66
Heteroscedasticity is indicated when the residuals are not evenly scattered around the line.
The traditional test for the presence of first-order autocorrelation is the DurbinWatson statistic
Other than the above assumptions, regression analysis also requires that the independent variables
should not be related to each other in other words, there should not be any multicollinearity
Indicators that multicollinearity may be present in a model:
1) Large changes in the estimated regression coefficients when a predictor variable is added or
deleted
2) Insignificant regression coefficients for the affected variables in the multiple regression, but
a rejection of the hypothesis that those coefficients are insignificant as a group (using a F-test)
3) Large changes in the estimated regression coefficients when an observation is added or
deleted
A formal detection-tolerance or the variance inflation factor (VIF) for multicollinearity is:
67
2) Drop one of the variables. An explanatory variable may be dropped to produce a model
with significant coefficients. However, you lose information (because you've dropped a
variable). Omission of a relevant variable results in biased coefficient estimates for the
remaining explanatory variables.
3) Obtain more data. This is the preferred solution. More data can produce more precise
parameter estimates (with lower standard errors).
Note: Multicollinearity does not impact the reliability of the forecast, but rather impacts the
interpretation of the explanatory variables. As long as the collinear relationships in your
independent variables remain stable over time, multicollinearity will not affect the forecast.
2.16
Applications in Finance
In this section, we indicate financial applications of regression analysis in some aspects relating to
stock market.
(i) Individual Stock Rates of Return, Payout Ratio, and Market Rates of Return
Let the relationship of rate of return of a stock with the payout ratio defined as the ratio of dividend
per share to customer earnings per share , and the rate of return on BSE SENSEX stocks as a whole,
be
y = b0 + b1 ( payout ratio ) + b2 ( rate of return on Sensex)
Let us assume that the relevant data is available, and is collected over a period of last 10 years yields
the following equation
y = 1.23 0.22 payout ratio + 0.49 rate of return
The coefficient 0.22 indicates that for a 1 % increase in pay-out ratio, the return on the stock
reduces by 0.22 % when the rate of return is held constant. Further, the coefficient 0.49 implies that a
1 % increase in the rate of return on BSE SENSEX , the return on the stock increases by 0.49 % when
the payout ratio is held constant.
Further, let the calculations yield the value of R2 as 0.66.
68
The value of R2 = 0.66 implies that 66% of variation in the rate of return on the investment in the
stock is explained by pay-out ratio and the return on BSE SENSEX.
(ii) Determination of Price per Share
To further demonstrate application of multiple regression techniques, let us assume that a crosssection regression equation is fitted with dependent variable being the price per share (y) of the 30
companies used to compile the SENSEX, and the independent variables being the dividend per share
(x1) and the retained earnings per share (x2 ) for the 30 companies. As mentioned earlier, in a crosssection regression, all data come from a single period.
Let us assume that the relevant data is available, and the data is collected for SENSEX stocks in a
year, yield the following regression equation
y = 25.45 + 15.30 x1 + 3.55 x2
The regression equation could be used for interpreting regression coefficients and predicting average
price per share given the values of dividend paid and earnings retained.
The coefficient 15.30 of x1 ( average price per share) indicates that the average price per share
increases by 15.30 when the dividend per share increases by Re. 1 when the retained earnings are
held constant.
The regression coefficient 3.55 of x2 means that when the retained earnings increases by Rs. 1.00, the
price per share increases by Rs 3.55 when dividend per share is held constant.
The use of multiple regression analysis in carrying out cost analysis was demonstrated by Bentsen in
1966. He collected data from a firms accounting, production and shipping records to establish a
multiple regression equation.
Terms in Regression Analysis
Explained variance = R2 (coefficient of determination).
Unexplained variance = residuals (error).
Adjusted R-Square = reduces the R2 by taking into account the sample size and the number
of independent variables in the regression model (It becomes smaller as we have fewer
observations per independent variable).
Standard Error of the Estimate (SEE) = a measure of the accuracy of the regression
predictions. It estimates the variation of the dependent variable values around the regression
line. It should get smaller as we add more independent variables, if they predict well.
Total Sum of Squares (SST) = total amount of variation that exists to be explained by the
independent variables. TSS = the sum of SSE and SSR.
Sum of Squared Errors (SSE) = the variance in the dependent variable not accounted for by
the regression model = residual. The objective is to obtain the smallest possible sum of
squared errors as a measure of prediction accuracy.
Sum of Squares Regression (SSR) = the amount of improvement in explanation of the
dependent variable attributable to the independent variables.
Variance Inflation Factor (VIF) measures how much the variance of the regression
coefficients is inflated by multicollinearity problems. If VIF equals 0, there is no correlation
between the independent measures. A VIF measure of 1 is an indication of some association
between predictor variables, but generally not enough to cause problems. A maximum
acceptable VIF value would be 10; anything higher would indicate a problem with
multicollinearity.
Tolerance the amount of variance in an independent variable that is not explained by the
other independent variables. If the other variables explain a lot of the variance of a
particular independent variable we have a problem with multicollinearity. Thus, small values
69
for tolerance indicate problems of multicollinearity. The minimum cutoff value for tolerance
is typically .10. That is, the tolerance value must be smaller than .10 to indicate a problem of
multicollinearity.
The next box that appears is shown in the following SPSS snapshot
SPSS Snapshot MRA 2
1.Dependent variable
to be entered here
2. Independent variables
to be entered here
70
The next step is to click on Statistics button in the bottom of the box. When one clicks on statistics
the following box will appear
SPSS Snapshot MRA 3
71
Descriptive Statistics
Mean
36285.80
13419.30
2633.38
21.70
m_cap_amt_oct05
Net_Sal_sept05
Net_Prof_sept05
peratio_oct05
Std. Deviation
34367.670
17025.814
3612.171
10.037
N
20
20
20
20
Pearson Correlation
m_cap_amt_oct05
Net_Sal_sept05
Net_Prof_sept05
peratio_oct05
m_cap_amt_oct05
Net_Sal_sept05
Net_Prof_sept05
peratio_oct05
m_cap_amt_oct05
Net_Sal_sept05
Net_Prof_sept05
peratio_oct05
Sig. (1-tailed)
m_cap_
amt_oct05
1.000
.687
.831
-.246
.
.000
.000
.148
20
20
20
20
Net_Sal_
sept05
.687
1.000
.798
-.576
.000
.
.000
.004
20
20
20
20
Net_Prof_
sept05
.831
.798
1.000
-.600
.000
.000
.
.003
20
20
20
20
peratio_oct05
-.246
-.576
-.600
1.000
.148
.004
.003
.
20
20
20
20
Part and partial correlations matrix is useful in understanding the relationships between the
independent and dependent variables. The regression analysis is valid only if the independent and
dependent variables are not interrelated. If these are related to each other, they may lead to
misinterpretation of the regression equation. This is termed as multicollinearity, and its impact is
described in Section 2.10 . The above correlation matrix is useful in checking the inter relationships
between the independent variables. In the above table, the correlations in square are correlation of
independent variables with dependent variables and are high (0.687 & 0.831) which means that the
two variables are related. Whereas the correlations between the independent variables (0.798, -0.576
and -0.6) are high which means that this data may have multicollinearity. Generally, very high
correlations between the independent variables like more than 0.9, may make the entire regression
analysis unreliable for interpreting the regression coefficients.
Variables Entered/Removedb
Variables
Entered
peratio_
oct05,
Net_Sal_
sept05,
Net_Prof_
a
sept05
Model
1
Variables
Removed
Method
Enter
Since the method selected was Enter method or General method, this table does not communicate any
meaning.
Model Summaryb
Model
1
R
.895a
R Square
.802
Adjusted
R Square
.764
Std. Error of
the Estimate
16682.585
DurbinWatson
.982
a.
Predictors: (Constant), peratio_oct05, Net_Sal_sept05, Net_Prof_sept05
b. Dependent Variable: m_cap_amt_oct05
This table gives the model summary for the set of independent and dependent variables. R2 for the
model is 0.802 which is high and means that around 80% of variation in dependent variable ( market
capitalization) is explained by the three independent variables.(net sale, net profit & P/E Ratio). The
72
Durbin Watson statistics for this model is 0.982, which is very low. The accepted value should be in
the range 1.5 to 2.5. It may, therefore, be appended as a caution that the assumption that the residuals
are uncorrelated is not valid.
ANOVAb
Model
1
Regression
Residual
Total
Sum of
Squares
1.80E+10
4.45E+09
2.24E+10
df
3
16
19
Mean Square
5996219888
278308655.0
F
21.545
Sig.
.000a
The ANOVA table for the regression analysis indicates whether the model is significant, and valid or
not. The ANOVA is significant, if the sig column in the above table is less than the level of
significance (generally taken as 5% or 1%). Since 0.000 < 0.01, we conclude that this model is
significant.
If the model is not significant, it implies that no relationship exists between the set of variables.
Coefficientsa
Model
1
(Constant)
Net_Sal_sept05
Net_Prof_sept05
peratio_oct05
Unstandardized
Coefficients
B
Std. Error
-23531.5 13843.842
.363
.381
8.954
1.834
1445.613
486.760
Standardized
Coefficients
Beta
.180
.941
.422
t
-1.700
.953
4.882
2.970
Sig.
.109
.355
.000
.009
Zero-order
Correlations
Partial
.687
.831
-.246
.232
.774
.596
Part
.106
.544
.331
This table gives the regression coefficients and their significance. The equation can be considered as,
Market capitalization = 23531.5 + 0.363 Net Sales + 8.954Net Profits + 1445.613P/E Ratio
It may be noted that the above equation is given for the purpose of understanding how to derive
equation from table. In this example, though regression coefficients, net profit and P/E ratio
are significant, net sales is not significant and since all three beta coefficients are not significant
this equation cannot be used for estimation of market cap.
Residuals Statisticsa
Predicted Value
Std. Predicted Value
Standard Error of
Predicted Value
Adjusted Predicted Value
Residual
Std. Residual
Stud. Residual
Deleted Residual
Stud. Deleted Residual
Mahal. Distance
Cook's Distance
Centered Leverage Value
Minimum
10307.05
-.844
Maximum
135150.28
3.213
Mean
36285.80
.000
Std. Deviation
30769.653
1.000
4242.990
16055.796
6689.075
3390.080
20
9759.82
-27441.7
-1.645
-1.879
-35824.0
-2.061
.279
.000
.015
139674.61
28755.936
1.724
1.809
31681.895
1.964
16.649
.277
.876
36258.65
.000
.000
-.009
27.147
-.021
2.850
.059
.150
30055.284
15308.990
.918
.999
18618.671
1.060
4.716
.091
.248
20
20
20
20
20
20
20
20
20
Charts
N
20
20
73
Histogram
Frequency
1
Mean = -1.67E-16
Std. Dev. = 0.918
N = 20
0
-2
-1
Above chart is to test the validity of the assumption that the residuals are normally distributed.
Looking at the chart one may conclude that the residuals are more or less normal. This can be tested
using Chi-square goodness of fit test.
Since all the three regression coefficients are not significant, the enter method cannot be used
for estimation. It is advisable to use stepwise regression in this case since it gives most
parsimonious set of variables in the equation.
Stepwise Method for Entering Variables SPSS Output
It may be noted that as Bharti telecom has too high P/E Ratio, it is omitted from the analysis.
Regression
Descriptive Statistics
m_cap_amt_oct05
Net_Sal_sept05
Net_Prof_sept05
peratio_oct05
Mean
36285.80
13419.30
2633.38
21.70
Std. Deviation
34367.670
17025.814
3612.171
10.037
N
20
20
20
20
Correlations
Pearson Correlation
Sig. (1-tailed)
m_cap_amt_oct05
Net_Sal_sept05
Net_Prof_sept05
peratio_oct05
m_cap_amt_oct05
Net_Sal_sept05
Net_Prof_sept05
peratio_oct05
m_cap_amt_oct05
Net_Sal_sept05
Net_Prof_sept05
peratio_oct05
Variables Entered/Removeda
Model
1
Variables
Entered
Variables
Removed
Net_Prof_
sept05
peratio_
oct05
Method
Stepwise
(Criteria:
Probabilit
y-ofF-to-enter
<= .050,
Probabilit
y-ofF-to-remo
ve >= .
100).
Stepwise
(Criteria:
Probabilit
y-ofF-to-enter
<= .050,
Probabilit
y-ofF-to-remo
ve >= .
100).
m_cap_
amt_oct05
1.000
.687
.831
-.246
.
.000
.000
.148
20
20
20
20
Net_Sal_
sept05
.687
1.000
.798
-.576
.000
.
.000
.004
20
20
20
20
Net_Prof_
sept05
.831
.798
1.000
-.600
.000
.000
.
.003
20
20
20
20
peratio_oct05
-.246
-.576
-.600
1.000
.148
.004
.003
.
20
20
20
20
74
This table gives the summary of the entered variables in the model.
Model Summaryc
Model
1
2
R
.831a
.889b
R Square
.691
.790
Adjusted
R Square
.673
.766
Std. Error of
the Estimate
19641.726
16637.346
DurbinWatson
1.112
In the previous method, there was only one model. Since this is stepwise method, it will give all
models that are significant in each step. The Durbin Watson is improved from the previous model but
is still less than the desired range (1.5 to 2.5). The last model is generally the best model. This can be
verified by the adjusted R2 the model with highest adjusted R2 is best. The model 2 which consists of
dependent variable, market capitalization and independent variables, Net profit and P/E Ratio is the
best model. The following table gives coefficients for the best model.
It may be noted that this model, though do not contain the independent variable Net Profit is slightly
better than the previous model discussed using Enter or General method of entering variables. Since
previous model adjusted R2 = 0.764 this model Adjusted R2 = 0.766.
The following table gives ANOVA for all the iterations (in this case 2), and both are significant.
ANOVAc
Model
1
Regression
Residual
Total
Regression
Residual
Total
Sum of
Squares
1.55E+10
6.94E+09
2.24E+10
1.77E+10
4.71E+09
2.24E+10
df
1
18
19
2
17
19
Mean Square
1.550E+10
385797411.3
F
40.169
Sig.
.000a
8867988179
276801281.6
32.037
.000b
Model
1
2
(Constant)
Net_Prof_sept05
(Constant)
Net_Prof_sept05
peratio_oct05
Unstandardized
Coefficients
B
Std. Error
15465.085
5484.681
7.906
1.247
-19822.8 13249.405
10.163
1.321
1352.358
475.527
Standardized
Coefficients
Beta
.831
1.068
.395
t
2.820
6.338
-1.496
7.691
2.844
Sig.
.011
.000
.153
.000
.011
Zero-order
Correlations
Partial
.831
.831
.831
.831
-.246
.881
.568
.854
.316
Part
It may be noted in the above Table that the values of the constant and regression coefficients are the
same as in the equation (14), derived manually. The SPSS stepwise regression did this automatically,
and the results we got are the same.
The following table gives summary of excluded variables in the two models.
75
Excluded Variablesc
Model
1
2
Net_Sal_sept05
peratio_oct05
Net_Sal_sept05
Beta In
.067a
.395a
.180b
t
.301
2.844
.953
Sig.
.767
.011
.355
Partial
Correlation
.073
.568
.232
Collinearity
Statistics
Tolerance
.363
.639
.349
There may be a situation that a researcher would like to divide the data into two parts, and use one
part to derive the model and other part to validate the model. SPSS allows to split the data into two
groups termed as estimation group and validation group. The estimation group is used to fit the
model, which is validated using validation group. This improves the validity of the model. This
process is called as cross validation. This method can be used only if the data is large enough to fit
model. Random variable functions from SPSS can be used to select the data randomly from the SPSS
file.
76
3 Discriminant Analysis
Discriminant analysis is basically a classifying technique that is used for classifying a given set of
objects, individuals, entities into two (or more) groups or categories based on the given data about
their characteristics. It is the process of deriving an equation called Discriminant Function giving
relationship between one dependent variable which is categorical i.e. it takes only two values, say,
yes or no, represented by 1 or 0, and several independent variables which are continuous. The
independent variables, selected for the analysis, are such which contribute towards classifying an
object, individual, or entity in one of the two categories. For example, with the help of several
financial indicators, one may decide to extend credit to a company or not. The classification could
also be done in more than two categories.
Identifying a set of variables which discriminate Best between the two groups is the first step in the
discriminant analysis. These variables are called discriminating variables.
One of the simplest examples of discriminating variable is the height in case of students of
graduate students. Let, there be a class of 50 students comprising boys and girls. Suppose we are
given only roll numbers, and we are required to classify them by their sex or segregate boys and girls.
One alternative is to take height as the variable, and premise all those equal to or more than 56
are boys and less than that height are girls. This classification should work well except in some cases
where girls are taller than 56 and boys are less than that height. In fact, one could work out from a
large sample of students, the most appropriate value of the discriminating height. This example
illustrates one fundamental aspect of discriminant analysis that in real life we cannot find
discriminating variable(s) or function that can provide 100 % accurate discrimination or
classification. We can only attempt to find the best classification from a given set of data. Yet another
example is the variable marks( percentage or percentile), in an examination which are used to
classify students in two or more categories. As is well known even marks cannot guarantee 100%
accurate classification.
Discriminant analysis is used to analyze relationships between a non-metric dependent variable and
metric or dichotomous (Yes / No type or Dummy) independent variables. Discriminant analysis uses
the independent variables to distinguish among the groups or categories of the dependent variable.
The discriminant model can be valid or useful only if it is accurate. The accuracy of the model is
measured on the basis of its ability to predict the known group memberships in the categories of the
dependent variable.
Discriminant analysis works by creating a new variable called the discriminant function score
which is used to predict to which group a case belongs. The computations find the coefficients for the
independent variables that maximize the measure of distance between the groups defined by the
dependent variable.
The discriminant function is similar to a regression equation in which the independent variables are
multiplied by coefficients and summed to produce a score.
The general form of discriminant function is :
D = b0 + b1X1 + b2X2 + + bkXk
(16 )
D = Discriminant Score
bi = Discriminant coefficients or weights
Xi = Independent variables
The weights bis are calculated by using the criteria that the groups differ as much as possible on the
basis of discriminant function.
If the dependent variable has only two categories, the analysis is termed as discriminant analysis. If
the dependent variable has more than two categories, then the analysis is termed as Multiple
Discriminant Analysis.
77
In case of multiple discriminant analysis, there will be more than one discriminant function. If the
dependent variable has three categories like, high risk, medium risk, low risk, there will be two
discriminant functions. If dependent variable has four categories, there will be three discriminant
functions. In general, the number of discriminant functions is one less than the number of categories
of the dependent variable.
It may be noted that in case of multiple discriminant functions, each function needs to be significant
to conclude the results.
The following illustrations explain the concepts and the technique of deriving a Discriminant
function, and using it for classification. The objective in this example is to explain the concepts in a
popular manner without mathematical rigour.
Illustration 3
Suppose, we want to predict whether a science graduate, studying inter alia the subjects of Physics
and Mathematics, will turn out to be a successful scientist or not. Here, it is premised that the
performance of a graduate in Physics and Mathematics, to a large extent, contributes in shaping a
successful scientist. The next step is to select some successful and some unsuccessful scientists, and
record the marks obtained by them in Mathematics and Physics in their graduate examination. While
in real life application, we have to select sufficient, say 10 or more number of students in both
categories, just for the sake of simplicity, let the data on two successful and two unsuccessful
scientists be as follows:
Successful Scientist
Unsuccessful Scientist
Marks in
Marks in Physics
Marks in
Marks in Physics
( P)
Mathematics (M)
(P)
12
11
10
Average : M s 10
Ps = 9
Mu = 8
Pu = 8
Mathematics ( M )
S : Successful
U : Unsuccessful
It may be mentioned that the marks as 8, 10, etc. are taken just for the sake of ease in calculations.
The discriminant function assumed is
Z = w1 M + w2 P
The requisite calculations on the above data yield
w1 = 9
and w2 = 23
ZC
23PS ) (9 M U
2
9 10 23 9 9 8 23 8
2
23PU )
78
= 276.5
This discriminant score helps us to predict whether a Graduate student will turn out to be a successful
scientist or not. This score for the two successful scientists is 292 and 302, both being more than the
discriminant score 276.5, the score is 214 and 270 for unsuccessful scientists, both being less than
276.5. If a young graduate gets 11 marks in Mathematics and 9 marks in Physics, his score as per the
discriminant function is 9 11 + 23 9 = 306. Since this is more than the discriminant score of
276.5, we can predict that this graduate will turn out to be a successful scientist. This is depicted
pictorially in the following diagram.
12
10
Successful
8, 10
5, 9
11, 9
12, 8
Marks in
Physics
11, 7
Unsuccessful
4
Marks in Mathematics
0
0
10
12
14
It may be noted that the both the successful scientists scores are above the discriminant line and the
scores of both the unsuccessful scientists are below the discriminant line.
The student with assumed marks is classified in the category of successful scientist.
This example illustrates that with the help of past data, about objects including entities,
individuals, etc., and their classification in two categories, one could derive the discriminant
function and the discriminant scores. Subsequently, if the same type of data is given for some
other object, the discriminant score could be worked out for that object and thus classify it in
either of the two categories.
14.3.1 Some Other Applications of Discriminate Analysis
Some other applications of discriminant analysis are given below.
(i)
Based on the past data available for a number of firms, about
Current ratio (defined as Current Assets Current Liabilities)
Debt / Asset Ratio ( defined as Total Debt Total Assets )
and the information whether a firm succeeded or failed, a determinant function could be
derived which could discriminate successful and failed firms based on their current ratio
and debt/asset ratio.
As an example, the determinant function in a particular case could be
Z= 0.5 +1.2x1 0.07 x2
where,
x1 represents the current ratio, and
79
x2 represents the debt/asset ratio.
The function could be used to sanction or not sanction the credit to an approaching firm based
on its current and debt asset ratios.
(ii)
80
Wilks'
Wilks' ) lies between 0 and 1. Large value of indicate that there is no difference in the group
means for the independent variable. Small values of indicate group means are different for the
independent variable. Smaller the value of more is the discriminating power, of the variable, in the
group.
(2) If the F test shows significance, then the individual independent variables are assessed to see
which of these differ significantly (in mean) by group and are subsequently used to classify the
dependent variable.
3.3 Key Terms Related to Discriminant Analysis
Some key terms related to discriminant analysis are described below
Term
Discriminating
Variables
Discriminant
Function
Eigen value
Relative Percentage
The Canonical
Correlation, R*,
Centroid
Discriminant Score
Cutoff:
Standardized
discriminant
coefficients
81
Functions at Group
Centroids
Discriminant
function plots
(Model) Wilks'
lambda
Classification
Matrix or
Confusion Matrix
Cross-validation.
Measures of
association
Mahalanobis DSquare, Rao's V,
Hotelling's trace,
Pillai's trace, and
Roys gcr(greatest
characteristic root)
coefficients markedly.
The mean discriminant scores for each of the dependent variable categories for each of
the discriminant functions in MDA. Two-group discriminant analysis has two centroids,
one for each group. We want the means to be well apart to show the discriminant
function is clearly discriminating. The closer the means, the more errors of
classification there likely will be
Also called canonical plots, can be created in which the two axes are two of the
discriminant functions (the dimensional meaning of which is determined by looking at
the structure coefficients, discussed above), and circles within the plot locate the
centroids of each category being analyzed. The farther apart one point is from another
on the plot, the more the dimension represented by that axis differentiates those two
groups. Thus these plots depict
Used to test the significance of the discriminant function as a whole. The "Sig." level
for this function is the significance level of the discriminant function as a whole. The
researcher wants a finding of significance, and The larger the lambda, the more likely it
is significant. A significant lambda means one can reject the null hypothesis that the
two groups have the same mean discriminant function scores and conclude the model is
discriminating.
Another overall test of the DA model. It is an F test, where a "Sig." p value < .05 means
the model differentiates discriminant scores between the groups significantly better than
chance (than a model with just the constant).
It can be used to test which independents contribute significantly to the discrimiinant
function. The smaller the value of Wilks' lambda for an independent variable, the more
that variable contributes to the discriminant function. Lambda varies from 0 to 1, with 0
meaning group means differ (thus the more the variable differentiates the groups), and 1
meaning all group means are the same.
Dichotomous independents are more accurately tested with a chi-square test than with
Wilks' lambda for this purpose.
Also called assignment, or prediction matrix or table, is used to assess the performance
of DA. This is a table in which the rows are the observed categories of the dependent
and the columns are the predicted categories of the dependents. When prediction is
perfect, all cases will lie on the diagonal. The percentage of cases on the diagonal is the
percentage of correct classifications. This percentage is called the hit ratio.
The hit ratio is not relative to zero but to the percent that would have been correctly
classified by chance alone. For two-group discriminant analysis with a 50-50 split in the
dependent variable, the expected percentage is 50%. For unequally split 2-way groups
of different sizes, the expected percent is computed in the "Prior Probabilities for
Groups" table in SPSS, by multiplying the prior probabilities times the group size,
summing for all groups, and dividing the sum by N. The best strategy is to pick the
largest group for all cases, the expected percent is then the largest group size divided by
N.
Leave-one-out classification is available as a form of cross-validation of the
classification table. Under this option, each case is classified using a discriminant
function based on all cases except the given case. This is thought to give a better
estimate of what classificiation results would be in the population.
can be computed by the crosstabs procedure in SPSS if the researcher saves the
predicted group membership for all cases.
Indices other than Wilks' lambda of the extent to which the discriminant functions
discriminate between criterion groups. Each has an associated significance test. A
measure from this group is sometimes used in stepwise discriminant analysis to
determine if adding an independent variable to the model will significantly improve
classification of the dependent variable. SPSS uses Wilks' lambda by default but also
offers Mahalanobis distance, Rao's V, unexplained variance, and smallest F ratio on
selection.
82
Structure
Correlations
These are also known as discriminant loadings, can be defined as simple correlations
between the independent variables and the discriminant functions
The next box that will appear is given in the following snapshot.
1. Enter the categorical
dependent variable
previously Defaulted here
2. Click on Define
Range
After entering the dependent variable and clicking on the Define Range as shown above, SPSS will
open following box,
SPSS Snapshot DA 2
83
2. then click on continue
1. Enter Minimum as
0 and Maximum as 1
After defining the variable, one should click on Continue button as shown above. SPSS will be back
to the previous box shown below
SPSS Snapshot DA 3
1. Enter the list of
independent variables in this
example the variables age,
years with current employee,
years at current address,
house hold income, debt to
income ratio, credit card debt
in thousands and other debt in
thousands
3. next
click on
statistics
After selecting the dependent, independent variables and the method of entering variables one my
click on statistics, SPSS will open a box as shown below
SPSS Snapshot DA 4
1. Select Univariate ANOVAs, Box's
M , Fisher's , Unstandardised and
Within-groups correlation.
After selecting the descriptives SPSS will go back to the previous box shown below:
SPSS Snapshot DA 5
84
2.Next, click on
classify
After clicking on classify SPSS will open a box as shown below,
SPSS Snapshot DA 6
2. Click
continue
After clicking continue SPSS will again go back to the previous window shown in Snapshot DA 6, at
this stage one may click ok button. This will lead SPSS to analyse the data and the output will be
displayed in the output view of SPSS.
Output for Enter Method
85
We will discuss interpretation for each output.
Discriminant
Analysis Case Processing Summary
Unweighted Cases
Valid
Excluded Missing or out-of-range
group codes
At least one missing
discriminating variable
Both missing or
out-of-range group codes
and at least one missing
discriminating variable
Total
Total
N
700
Percent
82.4
150
17.6
.0
.0
150
850
17.6
100.0
This table gives case processing summary, i.e. how may valid cases were selected, how many were
excluded (due to missing data) , total and their respective percentages.
Group Statistics
Previously defaulted
No
Yes
Total
Age in years
Years with current
employer
Years at current address
Household income in
thousands
Debt to income ratio
(x100)
Credit card debt in
thousands
Other debt in thousands
Age in years
Years with current
employer
Years at current address
Household income in
thousands
Debt to income ratio
(x100)
Credit card debt in
thousands
Other debt in thousands
Age in years
Years with current
employer
Years at current address
Household income in
thousands
Debt to income ratio
(x100)
Credit card debt in
thousands
Other debt in thousands
Valid N (listwise)
Unweighted
Weighted
517
517.000
Mean
35.5145
Std. Deviation
7.70774
9.5087
6.66374
517
517.000
8.9458
7.00062
517
517.000
47.1547
34.22015
517
517.000
8.6793
5.61520
517
517.000
1.2455
1.42231
517
517.000
2.7734
33.0109
2.81394
8.51759
517
183
517.000
183.000
5.2240
5.54295
183
183.000
6.3934
5.92521
183
183.000
41.2131
43.11553
183
183.000
14.7279
7.90280
183
183.000
2.4239
3.23252
183
183.000
3.8628
34.8600
4.26368
7.99734
183
700
183.000
700.000
8.3886
6.65804
700
700.000
8.2786
6.82488
700
700.000
45.6014
36.81423
700
700.000
10.2606
6.82723
700
700.000
1.5536
2.11720
700
700.000
3.0582
3.28755
700
700.000
This table gives the group statistics of independent variables, for each categories (here yes and no) of
dependent variables.
Tests of Equality of Group Means
Age in years
Years with current
employer
Years at current address
Household income in
thousands
Debt to income ratio
(x100)
Credit card debt in
thousands
Other debt in thousands
Wilks'
Lambda
.981
F
13.482
.920
df1
1
df2
698
Sig.
.000
60.759
698
.000
.973
19.402
698
.000
.995
3.533
698
.061
.848
124.889
698
.000
.940
44.472
698
.000
.979
15.142
698
.000
This table gives the test for Wilks for each independent variable if this is significant(<0.05 or
0.01), it means that the respective variable, mean is different for the two groups (in this case
previously defaulted and previously not defaulted). Any insignificant value will indicate that, the
variable is not different for different group or in other terms does not discriminate the dependent
variable. In the above example, all the variables are significant except household income in
thousands. This implies that the default of the loan does not depend on the household income.
86
Pooled Within-Groups Matrices
Correlation
Age in years
Years with current
employer
Years at current address
Household income in
thousands
Debt to income ratio
(x100)
Credit card debt in
thousands
Other debt in thousands
Age in years
1.000
Years with
current
employer
.524
Years at
current
address
.588
Household
income in
thousands
.475
Debt to
income ratio
(x100)
.077
Credit card
debt in
thousands
.342
Other debt in
thousands
.368
.524
.588
1.000
.292
.627
.089
.509
.471
.292
1.000
.310
.083
.260
.257
.475
.627
.310
1.000
.001
.608
.629
.077
.089
.083
.001
1.000
.455
.580
.342
.509
.260
.608
.455
1.000
.623
.368
.471
.257
.629
.580
.623
1.000
Analysis 1
Box's Test of Equality of Covariance Matrices
Log Determinants
Previously defaulted
No
Yes
Pooled within-groups
Rank
7
7
7
Log
Determinant
21.292
24.046
22.817
Approx.
df1
df2
Sig.
563.291
19.819
28
431743.0
.000
This table indicates that the Boxs M is significant. Which means the assumption of equality of
variance may not be true. This is a caution for interpreting results.
Summary of Canonical Discriminant Functions
Eigenvalues
Function
1
Eigenvalue % of Variance
.404a
100.0
Canonical
Correlation
.536
Cumulative %
100.0
This table gives summary of canonical discriminant function. This indicates eigenvalue for this model
is 0.404 and canonical correlation is 0.536. since there is single discriminant function all the
explained variation is contributed by the function.53.6 % of variation in the dependent variable is
explained by the model.
Wilks' Lambda
Test of Function(s)
1
Wilks'
Lambda
.712
Chi-square
235.447
df
7
Sig.
.000
This table tests the significance of the model as seen in the sig column, the model is significant.
Standardized Canonical Discriminant Function Coefficients
Age in years
Years with current
employer
Years at current address
Household income in
thousands
Debt to income ratio
(x100)
Credit card debt in
thousands
Other debt in thousands
Function
1
.122
-.829
-.310
.215
.603
.564
-.178
87
Structure Matrix
Function
1
Debt to income ratio
(x100)
Years with current
employer
Credit card debt in
thousands
Years at current address
Other debt in thousands
Age in years
Household income in
thousands
.666
-.464
.397
-.262
.232
-.219
-.112
This table gives simple correlations between the independent variables and the discriminant function.
High correlation will get translated to high discriminating power.
Canonical Discriminant Function Coefficients
Age in years
Years with current
employer
Years at current address
Household income in
thousands
Debt to income ratio
(x100)
Credit card debt in
thousands
Other debt in thousands
(Constant)
Function
1
.015
-.130
-.046
.006
.096
.275
-.055
-.576
Unstandardized coefficients
This table gives the canonical correlations. Negative sign indicates inverse relation. For example,
years at current employer is -0.130 means that more the number of years spent at current employer,
lesser the chance that the person will default.
Functions at Group Centroids
Previously defaulted
No
Yes
Function
1
-.377
1.066
Classification Statistics
Classification Processing Summary
Processed
Excluded
850
Missing or out-of-range
group codes
At least one missing
discriminating variable
Used in Output
0
0
850
Previously defaulted
No
Yes
Total
Prior
.500
.500
1.000
88
Classification Function Coefficients
Age in years
Years with current
employer
Years at current address
Household income in
thousands
Debt to income ratio
(x100)
Credit card debt in
thousands
Other debt in thousands
(Constant)
Previously defaulted
No
Yes
.803
.825
-.102
-.289
-.294
-.360
.073
.081
.639
.777
-1.004
-.608
-1.044
-15.569
-1.124
-16.898
This is classification matrix or confusion matrix. This gives the percentage of cases that are classified
correctly i.e. the hit ratio. This hit ratio should be at least 25% more than the random probability.
In the above example, 532 of the 700 cases are classified correctly.
Overall, 76% of the cases are classified correctly.
139 out of 263 defaulters were identified correctly.
Output for Stepwise Method
All the steps of the enter method are very similar to stepwise method except that the method to be
selected is stepwise. We have shown it in SPSS Snapshot DA3
If one selects the stepwise method, one also needs to select the method which SPSS should use to
select best set of independent variables.
This can be selected by clicking method button from Snapshot DA 8 shown below
SPSS Snapshot DA 8
89
1. select stepwise
method
2. Click on method
Min. D Squared
Step
1
Entered
Debt to
income
ratio
(x100)
Years
with
current
employer
Credit
card debt
in
thousand
s
Years at
current
address
Statistic
Between
Groups
Exact F
Statistic
df1
df2
Sig.
.924
No and
Yes
124.889
698.000
.000
1.501
No and
Yes
101.287
697.000
.000
1.926
No and
Yes
86.502
696.000
.000
2.038
No and
Yes
68.572
695.000
.000
At each step, the variable that maximizes the Mahalanobis distance between the two closest
groups is entered.
a. Maximum number of steps is 14.
b. Minimum partial F to enter is 3.84.
c. Maximum partial F to remove is 2.71.
d. F level, tolerance, or VIN insufficient for further computation.
90
Variables in the Analysis
Step
1
2
Debt to income
ratio (x100)
Debt to income
ratio (x100)
Years with
current employer
Debt to income
ratio (x100)
Years with
current employer
Credit card debt
in thousands
Debt to income
ratio (x100)
Years with
current employer
Credit card debt
in thousands
Years at current
address
Min. D
Squared
Between
Groups
Tolerance
F to Remove
1.000
124.889
.992
130.539
.450
No and Yes
.992
66.047
.924
No and Yes
.766
35.888
1.578
No and Yes
.716
111.390
.947
No and Yes
.572
44.336
1.501
No and Yes
.766
35.000
1.693
No and Yes
.691
89.979
1.213
No and Yes
.564
48.847
1.565
No and Yes
.898
11.039
1.926
No and Yes
Wilks' Lambda
Step
1
2
3
4
Number of
Variables
1
2
3
4
Lambda
.848
.775
.728
.717
df1
1
2
3
4
df2
df3
1
1
1
1
698
698
698
698
Exact F
Statistic
124.889
101.287
86.502
68.572
df1
1
2
3
4
df2
698.000
697.000
696.000
695.000
Sig.
.000
.000
.000
.000
These tables give summary of variables that are in analysis variables that are not in the analysis and
the model at each step, its significance.
It can be concluded that variables Debt to income ratio (x100)
Years with current employer, Credit card debt in thousands, Years at current address remain in the
model and others are removed from the model. This means that only these variables contribute in the
model.
91
Analysis, there could be either continuous or dichotomous or a combination of continuous and
dichotomous variables.
In logistic regression, the relationship between dependent variable and independent variable is not
linear. It is of the type
p=
1
1 e
where, p is the probability of success i.e. dichotomous variable y taking the value 1, and ( 1 p ) is
the probability of failure i.e., y taking the value 0, and y = a + bx.
The graph of this relationship between p and y is depicted below:
Probability p as a function of y
1
0.8
0.6
0.4
0.2
5.
2
4.
4
3.
6
2
2.
8
-2
-1
.2
-0
.4
0.
4
1.
2
-6
-5
.2
-4
.4
-3
.6
-2
.8
The logistic equation (8) can be reduced to a linear form by converting the probability p into log of
(p)/(1 p)p or logit as follows:
y = log [(p)/(1 p))] = a + bx .(2)
The logarithm, here, is the natural logarithm to the base e. The logarithm of any number to this base
is obtained by multiplying the logarithm to the base 10 by log of 10 to the base e i.e. 2.303.
The equation (8) is similar to a regression equation. However, here, a unit change in the independent
variable causes a change in the dependent variable, expressed as logit rather than the dependent
variable p, directly. Such regression analysis is known as Logistic Regression.
The fitting of a logistic regression equation is explained through an illustration wherein data was
recorded on the CGPA (up to first semester in the second year of MBA) of 20 MBA students, and
their success in the first interview for placement. The data collected was as follows where Pass is
indicated as 1 while Fail is indicated as 0.
Student ( Srl. No.)
1
2
3
4
5
6
7
8
9
10
CGPA
3.12
3.21
3.15
3.45
3.14
3.25
3.16
3.28
3.22
3.41
0
11
1
12
0
13
0
14
0
15
1
16
1
17
1
18
0
19
1
20
CGPA
3.48
3.34
3.25
3.46
3.32
3.29
3.42
3.28
3.36
3.31
92
Now, given this data, can we find the probability of a student succeeding in the first interview given
the CGPA?
1.2
y = 1.8264x - 5.3679
R2 = 0.1706
Success in Interview
0.8
0.6
0.4
0.2
0
3.1
3.15
3.2
3.25
3.3
3.35
3.4
3.45
3.5
CGPA
y = 1.83 x 5.37
Instead, let us now attempt to fit a logistic regression to the student data. We will do this by
computing the logits and then fitting a linear model to the logits. To compute the logits, we will
regroup the data by CGPA into intervals, using the midpoint of each interval for the independent
variable. We calculate the probability of success based on the number of students that passed the
interview for each range of CGPAs. This results in the following data: .
Class Interval
(CGPA)
Probability of Success
3.1 - 3.2
3.2 - 3.3
3.3 3.4
3.4 3.5
3.15
3.25
3.35
3.45
1/4
4/6
3/4
5/6
Logit
{ p/ (1 p)}
= 0.25
= 0.667.
= 0.75
= 0.833
-1.09861
0.9163
1.0986
1.6094
We plot the Logit against the CGPA and then look for the linear fit which gives us the equation: y = 8.306x
26.78
Thus if p is the probability of passing the interview and x is the CGPA the logistic regression can we expressed
as:
ln
p
1 p
8.306x 26.78
Converting the logarithm to an equivalent exponential form, this equation can also we expressed as expressed
as:
e8.306 x 26.78
1 e8.306 x 26.78
x
y
*
2.5
2.6
2.7
2.8
2.9
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
-6.015
-5.184
-4.354
-3.523
-2.693
-1.862
-1.031
-0.201
0.63
1.46
2.291
3.122
3.952
4.783
5.613
6.444
0.002
0.006
0.013
0.029
0.063
0.134
0.263
0.45
0.652
0.812
0.908
0.958
0.981
0.992
0.996
0.998
93
From this regression model, we can see that probability of success at the interview is below 25% for
CGPAs below 2.90 but is above 75% for CGPAs above 3.60.
While one could apply logistic regression to a number of situations, it has been found useful
particularly in the following situations:
Credit study of creditworthiness of an individual or a company. Various demographic and
credit history variables could be used to predict if an individual will turn out to be good or
bad customers.
Marketing/ Market Segmentation Study of purchasing behaviour of consumers. Various
demographic and purchasing information could be used to predict if an individual will
purchase an item or not
Customer loyalty: The analysis could be done to identify loyal or repeat customers using
various demographic and purchasing information
Medical study of risk of diseases / body disorder
4.1 Assumptions of Logistic Regressions
The multiple regression assumes assumptions like linearity, normality etc. these are not required for
logistic regression. Discriminate Analysis requires the independent variables to be metric, which is
not necessary for logistic regression. This makes the technique to be superior to discriminate
Analysis. The only care to be taken is that there are no extreme observations in the data.
4.2 Key Terms of Logistic Regressions
Following are the key terms used in logistic regression
The independent variable in logistic regression is termed as factor. The
Factor
factor is dichotomous in nature, and is usually converted into a dummy
variable.
The independent variable that is metric in nature is termed as covariate.
Covariate
Maximum Likelihood This method is used in logistic regression to predict the odd ratio for the
dependent variable. In least square estimate, the square of error is
Estimation
minimized, but in maximum likelihood estimation, the log likelihood is
maximized
Hosmer and Lemeshow chi-square test is used to test the overall model
Significance Test
of goodness-of-fit test. It is the modified chi-square test, which is better
than the traditional chi-square test. Pearson chi-square test and
likelihood ratio test are used in multinomial logistic regression to
estimate the model goodness-of-fit.
In stepwise logistic regression, the three methods available are enter,
Stepwise logistic
backward and forward. In enter method, all variables are included in
regression
logistic regression, irrespective the variable is significant or
insignificant. In backward method, the model starts with all variables
and removes nonsignificant variables from the list. In forward method,
logistic regression stats with single variable and adds one by one
variable and tests significance and removes insignificant variables from
the model.
94
Odd Ratio
Measures of Effect
Size
Classification Table
95
1.Select dependent
variable previously
defaulted
3.Select method of
logistic regression as
Forward LR
4.Click on
Save
SPSS will take back to the window as displayed in LR Snapshot 2, at this stage click on Options.
Following window will be opened
LR Snapshot 4
2. Click
Continue
1.Select HosmerLemeshow
Goodness of fit
SPSS will be back to the window as shown in LA Snapshot 2. At this stage Click OK. Following
output will be displayed.
96
Logistic Regression
Case Processing Summary
Unweighted Cases
Selected Cases
N
Included in Analysis
Missing Cases
Total
Percent
82.4
17.6
100.0
.0
100.0
700
150
850
0
850
Unselected Cases
Total
This table indicates the case processing summary 700 out of 850 cases are used for the analysis 150
are ignored as these have missing values.
Dependent Variable Encoding
Original Value
No
Yes
Internal Value
0
1
This table indicates the coding for the dependent variable 0=>not defaulted 1=>Defaulted
Block 0: Beginning Block
Classification Tablea,b
Predicted
Observed
Previously
defaulted
Step 0
Previously defaulted
No
Yes
517
0
183
0
No
Yes
Overall Percentage
Percentage
Correct
100.0
.0
73.9
Constant
B
-1.039
S.E.
.086
Wald
145.782
df
1
Sig.
.000
Exp(B)
.354
Variables
Score
13.265
56.054
18.931
3.526
106.238
41.928
14.863
201.271
age
employ
address
income
debtinc
creddebt
othdebt
Overall Statistics
df
1
1
1
1
1
1
1
7
Sig.
.000
.000
.000
.060
.000
.000
.000
.000
Step 2
Step 3
Step 4
Step
Block
Model
Step
Block
Model
Step
Block
Model
Step
Block
Model
Chi-square
102.935
102.935
102.935
70.346
173.282
173.282
55.446
228.728
228.728
18.905
247.633
247.633
df
1
1
1
1
2
2
1
3
3
1
4
4
Sig.
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
97
Model Summary
Step
1
2
3
4
-2 Log
Cox & Snell
likelihood
R Square
701.429a
.137
b
631.083
.219
b
575.636
.279
c
556.732
.298
Nagelkerke
R Square
.200
.321
.408
.436
Chi-square
3.160
4.158
6.418
8.556
df
8
8
8
8
Sig.
.924
.843
.600
.381
The Hosmer-Lemeshow statistic indicates a poor fit if the significance value is less than 0.05. Here,
since the value is above 0.05, the model adequately fits the data
Classification Tablea
Predicted
Previously defaulted
No
Yes
490
27
137
46
Step 1
Observed
Previously
defaulted
No
Yes
Step 2
Overall Percentage
Previously
defaulted
No
Yes
481
110
36
73
Step 3
Overall Percentage
Previously
defaulted
No
Yes
477
99
40
84
Step 4
Overall Percentage
Previously
defaulted
No
Yes
478
91
39
92
Overall Percentage
Percentage
Correct
94.8
25.1
76.6
93.0
39.9
79.1
92.3
45.9
80.1
92.5
50.3
81.4
This table is the classification table. It indicates the number of cases correctly classified as well as
incorrectly classified. Diagonal elements represent correctly classified cases and non-diagonal
elements represent incorrectly classified cases.
It may be noted that for each step, the number of correctly classified cases are improved than in the
previous step. The last column gives the percentage of correctly classified cases, which is improved
at each step.
Variables in the Equation
Step
a
1
Step
b
2
Step
c
3
Step
d
4
debtinc
Constant
employ
debtinc
Constant
employ
debtinc
creddebt
Constant
employ
address
debtinc
creddebt
Constant
B
.132
-2.531
-.141
.145
S.E.
.014
.195
.019
.016
Wald
85.377
168.524
53.755
87.231
-1.693
.219
-.244
.088
.503
-1.227
-.243
-.081
.088
.573
-.791
.027
.018
.081
.231
.028
.020
.019
.087
.252
df
1
1
1
1
Sig.
.000
.000
.000
.000
Exp(B)
1.141
.080
.868
1.156
59.771
.000
.184
80.262
23.328
38.652
28.144
74.761
17.183
22.659
43.109
9.890
1
1
1
1
1
1
1
1
1
.000
.000
.000
.000
.000
.000
.000
.000
.002
.783
1.092
1.653
.293
.785
.922
1.092
1.774
.453
The best model is usually the last model i.e. step 4. It contains variables years to current employee,
years at current address, debt to income ratio, and credit card debt. All other variables are
insignificant in the model.
98
Model if Term Removed
Variable
Step 1 debtinc
Step 2 employ
debtinc
Step 3
Step 4
employ
debtinc
creddebt
employ
address
debtinc
creddebt
Model Log
Likelihood
-402.182
-350.714
Change in
-2 Log
Likelihood
102.935
70.346
-369.708
-349.577
-299.710
-315.541
-333.611
-287.818
-290.006
-311.176
1
1
Sig. of the
Change
.000
.000
108.332
.000
123.518
23.783
55.446
110.490
18.905
23.281
65.621
1
1
1
1
1
1
1
.000
.000
.000
.000
.000
.000
.000
df
Step
2
Step
3
Step
4
Variables
Overall Statistics
Variables
Overall Statistics
Variables
Overall Statistics
Variables
Overall Statistics
age
employ
address
income
creddebt
othdebt
age
address
income
creddebt
othdebt
age
address
income
othdebt
age
income
othdebt
Score
16.478
60.934
23.474
3.219
2.261
6.631
113.910
.006
8.407
21.437
64.958
4.503
84.064
.635
17.851
.773
.006
22.221
3.632
.012
.320
4.640
df
1
1
1
1
1
1
6
1
1
1
1
1
5
1
1
1
1
4
1
1
1
3
Sig.
.000
.000
.000
.073
.133
.010
.000
.939
.004
.000
.000
.034
.000
.426
.000
.379
.940
.000
.057
.912
.572
.200
The above table gives the scores which can be used to predict whether the person having certain
values of variable will default or not. In fact the scores can be used to find the probability of default.
99
In this case one of the conclusions drawn was that both the programmes had positive impact on both
knowledge and motivation but there was no significant difference between classrooms based and job
based training programmes.
As yet another example, ne could assess whether a change in Compensation System-1 to
Compensation System-2 has brought about changes in sales, profit and job satisfaction in an
organisation.
MANOVA is typically used when there are more than one dependent variables, and independent
variables are qualitative/ categorical.
5.1 MANOVA Using SPSS
We will use the case data on Commodity market perceptions displayed at end of this Chapter.
Open the file Commodity.sav
Select from the menu Analyze General Linear Model Multivariate as shown below
MANOVA Snapshot 1
4.Click
OK
It may be noted that the above example is of MANOCOVA as we have selected some categorical
variables and some metric variables.
We are assuming in the above example that the dependent variables are the investments in
commodity market and in share market and the categorical independent variables are occupation and
how long the respondents block investments. The metric independent variables are age, respondents
100
rating for commodity market and share market. Here we assume that their investments depend on
their ratings, occupation, age and how long they block their investments.
The following output will be displayed
General Linear Model
Between-Subjects Factors
Occupation
how_much_time_
block_your_money
2
3
4
1
2
3
4
5
6
7
8
Value Label
"Self
Employed"
Govt
Student
House Wife
<6 months
6 to 12
months
1 to 3 years
> 3 years
N
11
15
4
14
6
8
5
10
4
6
3
2
Age
Rate_CM
Rate_SM
Occupation
how_much_time_block_
your_money
Occupation * how_much_
time_block_your_money
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Value
.139
.861
.162
.162
.157
.843
.186
.186
.096
.904
.106
.106
.027
.973
.028
.028
.908
.250
2.366
2.059
.855
.318
1.607
1.125
.823
.335
1.511
1.069
F
Hypothesis df
1.537a
2.000
1.537a
2.000
1.537a
2.000
1.537a
2.000
1.770a
2.000
1.770a
2.000
1.770a
2.000
1.770a
2.000
1.011a
2.000
1.011a
2.000
1.011a
2.000
1.011a
2.000
.268a
2.000
.268a
2.000
.268a
2.000
.268a
2.000
5.547
6.000
6.333a
6.000
7.099
6.000
13.725b
3.000
2.132
14.000
2.102a
14.000
2.066
14.000
3.214b
7.000
1.399
20.000
1.382a
20.000
1.360
20.000
2.137b
10.000
Error df
19.000
19.000
19.000
19.000
19.000
19.000
19.000
19.000
19.000
19.000
19.000
19.000
19.000
19.000
19.000
19.000
40.000
38.000
36.000
20.000
40.000
38.000
36.000
20.000
40.000
38.000
36.000
20.000
Sig.
.241
.241
.241
.241
.197
.197
.197
.197
.383
.383
.383
.383
.768
.768
.768
.768
.000
.000
.000
.000
.031
.035
.040
.019
.180
.191
.206
.071
a. Exact statistic
b. The statistic is an upper bound on F that yields a lower bound on the significance level.
c. Design: Intercept+Age+Rate_CM+Rate_SM+Occupation+how_much_time_block_your_money+Occupation *
how_much_time_block_your_money
This table indicates that the null hypothesis that the investments are equal for all occupations is
rejected since the significance value (p-value) is less than 0.05 as indicated by circles. Thus we may
conclude at 5% Level of Significance (LOS) that there is significant difference in the both
investments (share market and commodity markets) and occupation of the respondents.
The null hypothesis that the investments are equal for different levels of time the investment blocked,
is rejected since the significance value (p-value) is less than 0.05 as indicated by circles. Thus we
may conclude at 5% LOS that there is significant difference in the both investments (share market
and commodity markets) and the period for which the respondents would likely to block their money.
101
The other hypothesis about age, ratings of CM and ratings of SM are not rejected ( as p-value is
greater than 0.05) this means there is no significant difference in the investments for these variables.
Tests of Between-Subjects Effects
Source
Corrected Model
Intercept
Age
Rate_CM
Rate_SM
Occupation
how_much_time_block_
your_money
Occupation * how_much_
time_block_your_money
Error
Total
Corrected Total
Dependent Variable
Invest_SM
Invest_CM
Invest_SM
Invest_CM
Invest_SM
Invest_CM
Invest_SM
Invest_CM
Invest_SM
Invest_CM
Invest_SM
Invest_CM
Invest_SM
Invest_CM
Invest_SM
Invest_CM
Invest_SM
Invest_CM
Invest_SM
Invest_CM
Invest_SM
Invest_CM
df
23
23
1
1
1
1
1
1
1
1
3
3
7
7
10
10
20
20
44
44
43
43
Mean Square
4407960763
518647980.2
64681514.21
532489379.0
313752666.0
472901520.3
224388812.0
526600355.7
528516860.7
76132638.56
1.418E+010
1043835840
2450532038
379950972.8
2190181184
264279220.7
1037276941
255119311.4
F
4.250
2.033
.062
2.087
.302
1.854
.216
2.064
.510
.298
13.672
4.092
2.362
1.489
2.111
1.036
Sig.
.001
.057
.805
.164
.588
.189
.647
.166
.484
.591
.000
.020
.062
.227
.074
.450
6 Factor Analysis
Factor Analysis is interdependence technique. In interdependence techniques the variables are not
classified as independent or dependent variable, but their interrelationship is studied. Factor analysis
is general name for two different techniques namely, Principle Component Analysis and Common
Factor Analysis.
The Factor analysis originated about a century ago when Charles Spearman propounded that the
results of a wide variety of mental tests could be explained by a single underlying intelligence factor.
The factor analysis is done principally for two reasons
To identify a new, smaller set of uncorrelated variables to be used in subsequent multiple
regression analysis. In this situation the Principle Component Analysis is performed on the
data. PCA considers the total variance in the data while finding principle components from a
given set of variables
To identify underlying dimensions / factors that are unobservable but explain correlations
among a set of variables. In this situation the Common Factor Analysis is performed on the
data. FA considers only the common variance while finding common factors from a given set
of variables. The common factor analysis is also termed as Principle Axis Factoring.
The essential purpose of factor analysis is to describe, if possible, the covariance relationships among
many variables in terms of few underlying, but unobservable, random quantities called factors.
Basically, the factor model is motivated by the following argument. Suppose variables can be
grouped by their correlations. That is, all variables, within a particular group are highly correlated
among themselves but have relatively small correlations with variables in a different group. In that
case, it is conceivable that each group of variables represents a single underlying construct, or factor,
that is responsible for the correlations.
6.1 Rotation in Factor Analysis
If several factors have high loading with same variable, it is difficult to interpret the factors clearly.
This can be improved by using rotation.
102
Rotation does not affect communalities and percentage of total variance explained. However the
percentage of variance accounted by each factor changes. The variance explained by individual
factors is redistributed by rotation.
Correlation
This technique is used when a researcher has no prior knowledge about the
number of factors the variables will be indicating. In such cases computer based
techniques are used to indicate appropriate number of factors.
This technique is used when the researcher has the prior knowledge(on the basis of
some pre-established theory) about the number of factors the variables will be
indicating. This makes it easy as there is no decision to be taken about the number
of factors and the number is indicated in the computer based tool while conducting
analysis.
This is the matrix showing simple correlations between all possible pairs of variables.
103
Matrix
Communality
Eigenvalue
Factor
Factor Loadings
Factor Matrix
Factor Plot or
Rotated Factor
Space
Factor Scores
Goodness of a
Factor
Bartletts Test of
specificity
Kaiser Meyer
Olkin (KMO)
Measure of
Sampling
Adequacy
Scree Plot
Trace
The diagonal element of this matrix is 1 and this is a symmetric matrix, since
correlation between two variables x and y is same as between y and x.
The amount of variance, an original variable shares with all other variables included in
the analysis. A relatively high communality indicates that a variable has much in
common with the other variables taken as a group.
Eigenvalue for each factor is the total variance explained by each factor.
A linear combination of the original variables. Factor also represents the underlying
dimensions( constructs) that summarise or account for the original set of observed
variables
The factor loadings, or component loadings in PCA, are the correlation coefficients
between the variables (given in output as rows ) and factors (given in output columns)
These loadings are analogous to Pearsons correlation coefficient r, the squared factor
loading is defined as the percent of variance in the respective variable explained by the
factor.
This contains factor loadings on all the variables on all the factors extracted
This is a plot where the factors are on different axis and the variables are drawn on these
axes. This plot can be interpreted only if the number of factors are 3 or less
Each individual observation has a score, or value, associated with each of the original
variables. Factor analysis procedures derive factor scores that represent each
observations calculated values, or score, on each of the factors. The factor score will
represent an individuals combined response to the several variables representing the
factor.
The component scores may be used in subsequent analysis in PCA. When the factors
are to represent a new set of variables that they may predict or be dependent on some
phenomenon, the new input may be factor scores.
How well can a factor account for the correlations among the indicators ?
One could examine the correlations among the indicators after the effect of the factor is
removed. For a good factor solution, the resulting partial correlations should be near
zero, because once the effect of the common factor is removed , there is nothing to link
the indicators.
This is the test statistics used to test the null hypothesis that there is no correlation
between the variables.
This is an index used to test appropriateness of the factor analysis. High values of this
index, generally, more than 0.5 , may indicate that the factor analysis is an appropriate
measure, where as the lower values (less than 0.5) indicate that factor analysis may not
be appropriate.
A plot of Eigen values against the factors in the order of their extraction.
The sum of squares of the values on the diagonal of the correlation matrix used in the
factor analysis. It represents the total amount of variance on which the factor solution is
based.
104
There are a number of financial parameters/ratios for predicting health of a company. It would be
useful if only a couple of indicators could be formed as linear combination of the original
parameters/ratios in such a way that the few indicators extract most of the information contained in
the data on original variables.
Further, in the regression model, if independent variables are correlated implying there is
multicollinearity, then new variables could be formed as linear combinations of original variables
which themselves are uncorrelated. The regression equation can then be derived with these new
uncorrelated independent variables, and used for interpreting the regression coefficients as also for
predicting the dependant variable with the help of these new independent variables. This is highly
useful in marketing and financial applications involving forecasting, sales, profit, price, etc.
with the help of regression equations.
Further, analysis of principal components often reveals relationships that were not previously suspected and
thereby allows interpretations that would not be ordinarily understood. A good example of this is provided by
stock market indices.
Incidentally, PCA is a means to an end and not the end in itself. PCA can be used for inputting
principal components as variables for further analysing the data using other techniques such as
cluster analysis, regression and discriminant analysis.
6.4 Common Factor Analysis
It is yet another example of a data reduction and summarization technique. It is a statistical approach
that is used to analyse inter relationships among a large number of variables (e.g., test scores, test
items, questionnaire responses) and then explaining these variables in terms of their common
underlying dimensions (factors). For example, a hypothetical survey questionnaire may consist of 20
or even more questions, but since not all of the questions are identical, they do not all measure the
basic underlying dimensions to the same extent. By using factor analysis, we can identify the separate
dimensions being measured by the survey and determine a factor loading for each variable (test item)
on each factor.
Common Factor analysis (unlike multiple regression, discriminant analysis, or canonical correlation, in which
one or more variables are explicitly considered as the criterion or dependant variable and all others the predictor or independent variables) is an interdependence technique in which all variables are simultaneously
considered. In a sense, each of the observed (original) variables is considered as a dependant variable that is a
function of some underlying, latent, and hypothetical/unobserved set of factors (dimensions). One could also
consider the original variables as reflective indicators of the factors. For example, marks( variable) in an
examination reflect the intelligence( factor).
The statistical approach followed in factor analysis involves finding a way of condensing the information
contained in a number of original variables into a smaller set of dimensions (factors) with a minimum loss of
information.
Common Factor Analysis was originally developed to explain students performance in various subjects and
to understand the link between grades and intelligence. Thus, the marks obtained in an examination reflect the
students intelligence quotient. A salesmans performance in term of sales might reflect his attitude towards the
job, and efforts made by him.
One of the studies relating to marks obtained by students in various subjects, led to the conclusion
that students marks are a function of two common factors viz. Quantitative and Verbal abilities. The
quantitative ability factor explains marks in subjects like Mathematics, Physics and Chemistry, and
verbal ability explains marks in subjects like Languages and History.
In another study, a detergent manufacturing company was interested in identifying the major
underlying factors or dimensions that consumers used to evaluate various detergents. These factors
are assumed to be latent; however, management believed that the various attributes or properties of
detergents were indicators of these underlying factors. Factor analysis was used to identify these
underlying factors. Data was collected on several product attributes using a five-point scale. The
analysis of responses revealed existence of two factors viz. ability of the detergent to clean and its
105
mildness
In general, the factor analysis performs the following functions:
Identifies the smallest number of common factors that best explain or account for the
correlation among the indicators
Identifies a set of dimensions that are latent ( not easily observed) in a large number of
variables
Devises a method of combining or condensing a large number of consumers with varying
preferences into distinctly different number of groups.
Identifies and creates an entirely new smaller set of variables to partially or completely
replace the original set of variables for subsequent regression or discriminant analysis
from a large number of variables. It is especially useful in multiple regression analysis
when multicollinearity is found to exist as the number of independent variables is reduced
by using factors and thereby minimizing or avoiding multicolinearity. In fact, factors are
used in lieu of original variables in the regression equation.
Distinguishing Feature of Common Factor Analysis
Generally, the variables that we define in real life situations reflect the presence of unobservable
factors. These factors impact the values of those variables. For example, the marks obtained in an
examination reflect the students intelligence quotient. A salesmans performance in term of sales
might reflect his attitude towards the job, and efforts made by him.
Each of the above examples requires a scale, or an instrument, to measure the various
constructs (i.e., attitudes, image, patriotism, sales aptitude, and resistance to innovation). These
are but a few examples of the type of measurements that are desired by various business
disciplines. Factor analysis is one of the techniques that can be used to develop scales to
measure these constructs.
6.4.1 Applications of Common Factor Analysis
In one of the studies conducted by a group of the students of a Management Institute, they undertook
a survey of 120 potential buyers outside retail outlets and at dealer counters. Their opinions were
solicited through a questionnaire for each of the 20 parameters relating to a television.
Through the use of principal component analysis and factor analysis using computer software, the
group concluded that the following five parameters are most important out of the twenty parameters
on which their opinion was recorded. The five factors were:
Price (price, schemes and other offers)
Picture Quality
Brand Ambassador (Person of admiration)
Wide range
Information (Website use, Brochures, Friends recommendations )
In yet another study, another group of students of a Management Institute conducted a survey to
identify the factors that influence the purchasing decision of a motorcycle in the 125 cc category.
Through the use of Principal Component Analysis and factor analysis using computer software, the
group concluded that the following three parameters are most important:
Comfort
Assurance
Long Term Value
6.5 Factor Analysis on Data Using SPSS
We shall first explain PCA using SPSS and than Common Factor Analysis
Principle Component Analysis Using SPSS
For illustration we will be using file car_sales.sav.This file is part of SPSS cases and is in the tutorial
folder of SPSS. Within tutorial folder, this file is in the sample_files folder. For the convenience of
readers, we have provided this file in the CD with the book. This data file contains hypothetical sales
estimates, list prices, and physical specifications for various makes and models of vehicles. The list
106
prices and physical specifications were obtained alternately from edmunds.com and manufacturer
sites. Following is the list of the major variables in the file.
1. Manufacturer
6. Price in thousands
11. Length
2. Model
7. Engine size
12. Curb weight
3. Sales in thousands
8. Horsepower
13. Fuel capacity
4. 4-year resale value
9. Wheelbase
14. Fuel efficiency
5. Vehicle type
10. Width
After opening the file Car_sales.sav, one can click on analyze Data Reduction and Factor as shown
in the following snapshot.
FA Snapshot 1
4. click on
continue
1. click on initial
solution
2. click on
coefficients
3. click on KMO and
Bartletts test of
sphericity
107
SPSS will take back to previous window as shown below
FA Snapshot 3
Click on
Extraction
2. select
correlation matrix
5. click on continue
SPSS will take back to the window shown below
FA Snapshot 5
1. click on rotation
108
3. click continue
1. Select varimax
rotation
2. Select Display
rotated solution
SPSS will take back to the window as shown in FA Snapshot 5. Click on the button Scores. This
will open a new window as shown below
FA Snapshot 7
4. click on continue
1. Select save as
variable option
2. Select regression
method
3. Select display
factor score
coefficient matrix
This will take back to window as shown in FA Snapshot 5, in this window now click on ok. SPSS
will give following output. We shall explain each in brief.
Factor Analysis
Correlation Matrix
Correlation
Vehicle type
Price in thousands
Engine size
Horsepower
Wheelbase
Width
Length
Curb weight
Fuel capacity
Fuel efficiency
Vehicle type
1.000
-.042
.269
.017
.397
.260
.150
.526
.599
-.577
Price in
thousands
-.042
1.000
.624
.841
.108
.328
.155
.527
.424
-.492
Engine size
.269
.624
1.000
.837
.473
.692
.542
.761
.667
-.737
Horsepower
.017
.841
.837
1.000
.282
.535
.385
.611
.505
-.616
Wheelbase
.397
.108
.473
.282
1.000
.681
.840
.651
.657
-.497
Width
.260
.328
.692
.535
.681
1.000
.706
.723
.663
-.602
Length
.150
.155
.542
.385
.840
.706
1.000
.629
.571
-.448
Curb weight
.526
.527
.761
.611
.651
.723
.629
1.000
.865
-.820
Fuel capacity
.599
.424
.667
.505
.657
.663
.571
.865
1.000
-.802
Fuel efficiency
-.577
-.492
-.737
-.616
-.497
-.602
-.448
-.820
-.802
1.000
This is the correlation matrix. The PCA can be carried out if the correlation matrix for the variables
contains at least two correlations of 0.30 or greater. It may be noted that the correlations >0.3 are
marked in circle.
109
KMO and Bartlett's Test
Kaiser-Meyer-Olkin Measure of Sampling
Adequacy.
Bartlett's Test of
Sphericity
Approx. Chi-Square
df
Sig.
.833
1578.819
45
.000
KMO Bartlett measure of sampling adequacy is an index used to test appropriateness of the factor
analysis. The minimum required KMO is 0.5. The above table shows that the index for this data is
0.833 and the chi-square statistics is significant (<0.05). This means the principle component analysis
is appropriate for this data.
Communalities
Vehicle type
Price in thousands
Engine size
Horsepower
Wheelbase
Width
Length
Curb weight
Fuel capacity
Fuel efficiency
Initial
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
Extraction
.930
.876
.843
.933
.881
.776
.919
.891
.861
.860
Extraction communalities are estimates of the variance in each variable accounted for by the
components. The communalities in this table are all high, which indicates that the extracted
components represent the variables well. If any communalities are very low in a principal
components extraction, you may need to extract another component.
Total Variance Explained
Component
1
2
3
4
5
6
7
8
9
10
Total
5.994
1.654
1.123
.339
.254
.199
.155
.130
.091
.061
Initial Eigenvalues
% of Variance Cumulative %
59.938
59.938
16.545
76.482
11.227
87.709
3.389
91.098
2.541
93.640
1.994
95.633
1.547
97.181
1.299
98.480
.905
99.385
.615
100.000
This output gives total variance explained. This table gives the total variance contributed by each
component. We can see that the percentage of total variance contributed by first component is 59.938,
by second component is 16.545 and by third component is 11.2227. It is also clear from this table
that there are total three distinct components for the given set of variables.
110
Scree Plot
Eigenvalue
0
1
10
Component Number
The scree plot gives the number of components against the eigenvalues and helps to determine
the optimal number of components.
Incidentally, "scree" is the geological term referring to the debris which collects on the lower part of a
rocky slope
The components having steep slope indicate that good percentage of total variance is explained by
that component, hence the component is justified. The shallow slope indicates that the contribution of
total variance is less, and the component is not justified. In the above plot, the first three components
have steep slope and later the slope is shallow. This indicates the ideal number of components is
three.
Component Matrixa
Vehicle type
Price in thousands
Engine size
Horsepower
Wheelbase
Width
Length
Curb weight
Fuel capacity
Fuel efficiency
1
.471
.580
.871
.740
.732
.821
.719
.934
.885
-.863
Component
2
.533
-.729
-.290
-.618
.480
.114
.304
.063
.184
.004
3
-.651
-.092
.018
.058
.340
.298
.556
-.121
-.210
.339
This table gives each variables component loadings but it is the next table, which is easy to interpret.
111
Rotated Component Matrixa
Vehicle type
Price in thousands
Engine size
Horsepower
Wheelbase
Width
Length
Curb weight
Fuel capacity
Fuel efficiency
1
-.101
.935
.753
.933
.036
.384
.155
.519
.398
-.543
Component
2
.095
-.003
.436
.242
.884
.759
.943
.533
.495
-.318
3
.954
.041
.292
.056
.314
.231
.069
.581
.676
-.681
This table is the most important table for interpretation. The maximum of each row(ignoring sign)
indicates that the respective variable belongs to the respective component. The variables price in
thousands, engine size and horsepower are highly correlated and contribute to a single
component. wheelbase width and length contribute to second component. And vehicle type
curb weight fuel capacity contribute to the third component.
Component Transformation Matrix
Component
1
2
3
1
.601
-.797
-.063
2
.627
.422
.655
3
.495
.433
-.753
Vehicle type
Price in thousands
Engine size
Horsepower
Wheelbase
Width
Length
Curb weight
Fuel capacity
Fuel efficiency
1
-.173
.414
.226
.368
-.177
.011
-.105
.070
.012
-.107
Component
2
-.194
-.179
.028
-.046
.397
.289
.477
.043
.017
.108
3
.615
-.081
-.016
-.139
-.042
-.102
-.234
.175
.262
-.298
This table gives the component scores for each variables. The component scores can be saved for
each case in the SPSS file. These scores are useful to replace internally related variables in the
regression analysis. In the above table, the scores are given component wise. The factor score for
each component can be calculated as the linear combinations of the component scores of that
component.
112
Component Score Covariance Matrix
Component
1
2
3
1
1.000
.000
.000
2
.000
1.000
.000
3
.000
.000
1.000
113
Carry out relevant analysis and write a report to discuss the findings for the above data.
The initial process of conducting common factor analysis is exactly same as for principle component
analysis except for the method of selection shown in FA Snapshot 4.
We will discuss only the steps that are different than the principle component analysis shown above.
Following steps are carried out to run factor analysis using SPSS.
1. Open file telcom.sav
2. Click on Analyse ->Data Reduction ->Factor as shown in FA Snapshot 1.
3. Following window will be opened by SPSS.
FA Snapshot 8
Select variables Q1,
Q2, Q3, .. through
Q12
4. Click on descriptives , coefficients and click on initial solution, click on KMO and Bartletts test of
sphericity, and also Select Anti-Image as shown in FA Snapshot 3.
It may be noted that we did not select Anti-Image in PCA, but we are required to select it here.
5. Click on Extraction, following window will be opened by SPSS
FA Snapshot 9
3. select
Unrotated Factor
solution
2. select
correlation matrix
5. click on
continue
114
6. SPSS will take back to the window shown in FA Snapshot 8 at this stage click on Rotation, the
window SPSS will open is shown in FA Snapshot
7. Select Varimax rotation, Select Display rotated solution and click continue, as shown in FA
Snapshot 6
8. It may be noted that in PCA of FA Snapshot 7 we selected to store some variables which is not
require here.
Following output will be generated by SPSS
Factor Analysis
Descriptive Statistics
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Mean
4.80
3.20
2.83
3.89
3.09
3.49
3.23
3.86
3.46
3.74
3.17
3.60
Std. Deviation
1.568
1.410
1.671
1.605
1.245
1.772
1.734
1.611
1.633
1.615
1.505
1.866
Analysis N
35
35
35
35
35
35
35
35
35
35
35
35
This is descriptive statistics given by the SPSS. This gives general understanding of the variables.
Correlation Matrix
Correlatio
n
Q1
Q1
Q2
Q3
Q4
-.294
.984
Q5
Q6
Q7
Q8
Q9
-.548
-.017
-.188
-.093
.129
Q10
Q11
Q12
-.242
-.085
-.259
1.000
-.128
Q2
-.128
1.000
.302
-.068
.543
.231
.257
.440
.355
.359
.344
.378
Q3
.302
-.068
1.000
-.282
-.282
.558
.148
.258
.056
.148
.898
.164
.930
Q4
-.294
.984
1.000
-.510
-.052
-.223
-.063
.099
-.227
-.113
-.251
Q5
-.548
.543
.558
-.510
1.000
.101
.195
.387
.067
.538
.149
.559
Q6
-.017
-.052
.101
.159
.937
.230
.906
.096
.258
-.223
.195
1.000
.901
.901
-.188
.231
.257
.148
Q7
1.000
.202
.866
.379
.943
.211
Q8
-.093
.440
.056
-.063
.387
.159
.202
1.000
.204
.042
.192
.156
Q9
.129
-.242
-.085
.355
.359
.344
.148
.898
.164
.930
.099
-.227
-.113
.067
.538
.149
.937
.230
.906
.866
.379
.943
.204
.042
.192
1.000
.258
.889
.258
1.000
.309
.889
.309
1.000
.091
.853
.119
-.259
.378
-.251
.559
.096
.211
.156
.091
.853
.119
1.00
0
Q10
Q11
Q12
This is the correlation matrix. The Common Factor Analysis can be carried out if the correlation
matrix for the variables contains at least two correlations of 0.30 or greater. It may be noted that some
of the correlations >0.3 are marked in circle.
KMO and Bartlett's Test
Kaiser-Meyer-Olkin Measure of Sampling
Adequacy.
Bartlett's Test of
Sphericity
Approx. Chi-Square
df
Sig.
.658
497.605
66
.000
KMO measure of sampling adequacy is an index used to test appropriateness of the factor
analysis. The minimum required KMO is 0.5. The above table shows that the index for this data is
0.658 and the chi-square statistics is significant (.000<0.05). This means the principle component
analysis is appropriate for this data.
115
Communalities
Initial
.980
.730
.926
.978
.684
.942
.942
.396
.949
.872
.934
.916
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Extraction
.977
.607
.975
.996
.753
.917
.941
.379
.942
.873
.924
.882
Initial communalities are the proportion of variance accounted for in each variable by the rest of the
variables. Small communalities for a variable indicate that the proportion of variance that this
variable shares with other variables is too small. Thus, this variable does not fit the factor solution. In
the above table, most of the initial communalities are very high indicating that all the variables share
a good amount of variance with each other, an ideal situation for factor analysis.
Extraction communalities are estimates of the variance in each variable accounted for by the factors
in the factor solution. The communalities in this table are all high. It indicates that the extracted
factors represent the variables well.
Total Variance Explained
Factor
1
2
3
4
5
6
7
8
9
10
11
12
Initial Eigenvalues
% of Variance Cumulative %
39.908
39.908
25.288
65.196
13.960
79.156
11.006
90.163
4.382
94.545
2.291
96.836
1.307
98.143
.774
98.917
.421
99.338
.353
99.691
.227
99.918
.082
100.000
Total
4.789
3.035
1.675
1.321
.526
.275
.157
.093
.050
.042
.027
.010
This output gives the variance explained by the initial solution. This table gives the total variance
contributed by each component. We may note that the percentage of total variance contributed by first
component is 39.908, by second component is 25.288 and by third component is19.960. It may be
noted that the percentage of total variances is highest for first factor and it decreases thereafter. It is
also clear from this table that there are total three distinct factors for the given set of variables.
Scree Plot
Eigenvalue
0
1
Factor Number
10
11
12
116
The scree plot gives the number of factors against the eigenvalues, it and helps to determine the
optimal number of factors. The factors having steep slope indicate that larger percentage of total
variance is explained by that factor. The shallow slope indicates that the contribution to total variance
is less. In the above plot, the first four factors have steep slope; and later on the slope is shallow. It
may be noted from the above plot that the number of factors for Eigen value greater than one are four.
Hence the ideal number of factors is four
Factor Matrixa
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
1
-.410
.522
.687
-.413
.601
.705
.808
.293
.690
.727
.757
.644
Factor
2
.564
-.065
-.515
.528
-.500
.630
.486
.005
.682
-.372
.575
-.523
3
.698
.139
.423
.725
-.067
-.111
-.176
-.031
.039
.401
-.137
.432
4
.066
.557
-.245
.141
.371
-.102
-.144
.540
.019
-.213
-.040
-.090
This table gives each variables factor loadings but it is the next table, which is easy to interpret.
Rotated Factor Matrixa
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
1
.001
.195
.084
-.042
.006
.953
.939
.117
.936
.210
.943
.022
Factor
2
-.147
.253
.970
-.137
.456
.053
.161
-.008
.068
.899
.078
.909
3
.972
-.002
-.146
.987
-.430
-.004
-.162
-.040
.164
-.101
-.059
-.110
4
-.109
.711
.074
-.035
.600
.078
.086
.603
.186
.103
.158
.206
This table is the most important table for interpretation. The maximum in each row (ignoring sign)
indicates that the respective variable belongs to the respective factor. For example, in the first row
the maximum is 0.972 which is for factor 3; this indicates that the Q1 contributes to third factor. In
the second row maximum is 0.711; for factor 4, indicating that Q2 contributes to factor 4, and so on.
The variables Q6, Q7, Q9, and Q11 are highly correlated and contribute to a single factor
which can be named as Factor 1 or Economy.
The variables Q3, Q10, and Q12 are highly correlated and contribute to a single factor which can
be named as Factor 2 or Services beyond Calling.
The variables Q1 and Q4 are highly correlated and contribute to a single factor which can be
named as Factor 3 or Customer Care.
The variables Q2, Q5, and Q8 are highly correlated and contribute to a single factor which can
be named as Factor 4 or Anytime Anywhere Service.
We may summarise the above analysis in the following Table.
Factors
Questions
Factor 1
Q.6. Call rates and Tariff plans
117
Q.7. Additional features like unlimited SMS, lifetime prepaid, 2 phones free
calling, etc.
Q.9. SMS and Value Added Services charges.
Q.11. Roaming charges
Factor 2
Q.3. Internet connection, with good speed.
Q.10. Value Added Services like MMS, caller tunes, etc.
Services
beyond Calling Q.12. Conferencing
Factor 3
Q.1. Availability of services (like drop boxes and different payment
Customer Care options, in case of post paid, and recharge coupons in case of pre paid)
Q.4. Quick and appropriate response at customer care centre.
Factor 4
Q.2. Good network connectivity all through the city.
Q.5. Connectivity while roaming (out of the state or out of country)
Anytime
Q.8. Quality of network service like Minimum call drops, Minimum down
Anywhere
time, voice quality, etc.
Service
Economy
It implies that the telecomm service provider should consider these four factors which customers feel
are important, while selecting / switching a service provider.
Y+Y2+Yp
=
X1 + X2+..Xp
(metric)
(metric)
Some Indicative Applications:
A medical researcher could be interested in determining if individuals lifestyle and personal
habits have an impact on their health as measured by a number of health-related variables
such as hypertension, weight, blood sugar, etc.
The marketing manager of a consumer goods firm could be interested in determining if there
is a relationship between types of products purchased and consumers income and
profession.
The practical significance of a canonical correlation is that it indicates as to how much variance in one set of
variables is accounted for by another set of variables.
Squared canonical correlations are referred to as canonical roots or Eigen values.
If X1, X2, X3, , Xp & Y1, Y2, Y3, .,Yq are the observable variables then canonical variables will be
118
U1= a1X1+ a2X2+ ..+apXp
V1=b1Y1+b2Y2+ .+bqYq
U2=c1X1+c2X2+.+cpXp
V2=d1Y1+d2Y2+..+dqYq
& so on
Then U's & V's are called canonical variables & coefficients are called canonical coefficients.
The first pair of sample canonical variables is obtained in such a way that,
Var (U1) = Var (V1)=1
Canonical correlation measures the association between the discriminant scores and the groups. For
two groups, it is the usual person correlation between the scores and the groups coded as 0 and 1.
SSB
CR2
SSB/SSW
=
SST
Eigen value
=
SST/SSW
1/
=
Eigen value
Wilks Lambda ( ) is the proportion of the total variance in the discriminatn scores not explained by
differences among the groups. It is used to test H0 that the means of the variables of groups are equal.
is approximated by 2p,G-1
=
{n-1 (p + G/2)} log
p
:
no. of variables
G
:
no. of groups
0
1. If is small H0 is reject, is it is high H0 is accepted
MDA & PCA : Similarities and Differences
In both cases, a new axis is identified and a new variable is formed as a linear combination of
the original variables. The new variable is given by the projection of the points onto this new axis.
The difference is with respect to the criterion used to identify the new axis.
In PCA, a new axis is formed such that the projection of the points onto this new axis
account for maximum variance in the data. This is equivalent to maximizing SST, because there is no
criterion variable for dividing the sample into groups.
In MDA, The objective is not to account for maximum variance in the data ( i.e. maximum
SST) , but to maximize the between-group to within-group sum of squares ratio( i.e. SSB/ SSW ) that
results in the best discrimination between the groups.The new axis or the new linear combination that
is identified is called Linear Discriminant Function. The projection of an observed point onto this
discriminant function ( i.e. the value of the new variable) is called the discriminant score.
Application: Asset Liability Mismatch
A Study on Structural changes and Asset Liability Mismatch in Scheduled Commercial Banks in
India was carried out in Reserve Bank of India. It was conducted as an empirical exercise to identify
and explore the relationships and structural changes, including hedging behaviour, between asset
and liability of cross-section of scheduled commercial banks at two different time points
representing pre- and post-liberalisation periods. As there were two sets of dependent variables,
119
instead of regression, the study used the canonical correlation technique to investigate the assetliability relationship of the banks at the two time points.
7.1 Canonical Correlation Using SPSS
We will use the case data on Commodity market perceptions given in Section 8.5.
Open the file Commodity.sav
Select from the menu Analyze General Linear Model Multivariate as shown below
CANONICAL Snapshot 1
1.Select Dependent
variables as invest_SM,
invest_CM and
Invest_MF
3.Click OK
It may be noted that the above example is discussed in Section 5.1. The difference between
MANOCOVA and canonical correlation is that MANOCOVA can have both factors and metric
independent variables, Canonical correlation can have only metric independent variables, factors
(categorical independent variables) are not possible in canonical correlation.
We are assuming in the above example that the dependent variables are the investments in
commodity market and in share market. The metric independent variables are age, respondents
rating for commodity market ,share market and Mutual funds and respondents perception of risk for
commodity market, share market and mutual funds. Here we assume that their investments depend on
their ratings, age and their risk perceptions for mutual funds, commodity markets and share markets.
120
The following output will be displayed
General Linear Model
Multivariate Testsb
Effect
Intercept
Rate_CM
Rate_SM
risky_CM
risky_SM
Age
Rate_FD
Rate_MF
risky_FD
risky_MF
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Pillai's Trace
Wilks' Lambda
Hotelling's Trace
Roy's Largest Root
Value
.022
.978
.023
.023
.024
.976
.025
.025
.026
.974
.026
.026
.123
.877
.140
.140
.044
.956
.046
.046
.152
.848
.179
.179
.031
.969
.032
.032
.092
.908
.101
.101
.145
.855
.170
.170
.001
.999
.001
.001
F
.241a
.241a
.241a
.241a
.267a
.267a
.267a
.267a
.280a
.280a
.280a
.280a
1.497a
1.497a
1.497a
1.497a
.490a
.490a
.490a
.490a
1.914a
1.914a
1.914a
1.914a
.338a
.338a
.338a
.338a
1.075a
1.075a
1.075a
1.075a
1.814a
1.814a
1.814a
1.814a
.012a
.012a
.012a
.012a
Hypothesis df
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
3.000
Error df
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
32.000
Sig.
.867
.867
.867
.867
.848
.848
.848
.848
.839
.839
.839
.839
.234
.234
.234
.234
.692
.692
.692
.692
.147
.147
.147
.147
.798
.798
.798
.798
.373
.373
.373
.373
.164
.164
.164
.164
.998
.998
.998
.998
a. Exact statistic
b. Design: Intercept+Rate_CM+Rate_SM+risky_CM+risky_SM+Age+Rate_FD+Rate_MF+risky_
FD+risky_MF
This table indicates that the hypothesis about age, ratings of CM, SM and MF and Risky CM, SM and
MF are not rejected ( as p-value is greater than 0.05) this means there is no significant difference in
the investments for these variables.
121
Tests of Betw een-Subjects Effects
Source
Corrected Model
Intercept
Rate_CM
Rate_SM
risky_CM
risky_SM
Age
Rate_FD
Rate_MF
risky_FD
risky_MF
Error
Total
Corrected Total
Dependent Variable
Invest_CM
Invest_SM
Invest_MF
Invest_CM
Invest_SM
Invest_MF
Invest_CM
Invest_SM
Invest_MF
Invest_CM
Invest_SM
Invest_MF
Invest_CM
Invest_SM
Invest_MF
Invest_CM
Invest_SM
Invest_MF
Invest_CM
Invest_SM
Invest_MF
Invest_CM
Invest_SM
Invest_MF
Invest_CM
Invest_SM
Invest_MF
Invest_CM
Invest_SM
Invest_MF
Invest_CM
Invest_SM
Invest_MF
Invest_CM
Invest_SM
Invest_MF
Invest_CM
Invest_SM
Invest_MF
Invest_CM
Invest_SM
Invest_MF
df
9
9
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
34
34
34
44
44
44
43
43
43
Mean Square
728267187.3
4250227452
3842114308
41236149.02
1820226683
2063858495
3146747.812
701640356.6
91570171.09
5827313.435
305120.840
289045935.2
1383505869
3599660366
4765133773
5068043.704
3129346168
3115327283
1759288000
4483914274
5852257582
314194023.6
1038652882
1418050354
709118825.7
4469496231
3743054315
925863327.7
7133105111
4644806048
112028.404
46590657.39
14383175.06
308143679.0
2466958509
3027453031
F
2.363
1.723
1.269
.134
.738
.682
.010
.284
.030
.019
.000
.095
4.490
1.459
1.574
.016
1.269
1.029
5.709
1.818
1.933
1.020
.421
.468
2.301
1.812
1.236
3.005
2.891
1.534
.000
.019
.005
Sig.
.034
.122
.289
.717
.396
.415
.920
.597
.863
.891
.991
.759
.041
.235
.218
.899
.268
.318
.023
.187
.173
.320
.521
.498
.139
.187
.274
.092
.098
.224
.985
.892
.945
The above table gives three different models namely a, b, and c. Model a is for the first dependent
variable, invest CM, model b is for dependent variable invest SM and model c is for dependent
variable invest MF
The table also indicates the individual relationship between each dependent independent variable
pair. It is indicated above that only two pairs namely, Risky CM and Invest CM, and Age and Invest
CM are significant (p value less than 0.05 indicated by circles). This indicates that the independent
variable, perception of risk of commodity markets by consumers (variable name Risky CM)
significantly affects the dependent variable, i.e. their investment in commodity markets. Indicating
that the riskiness perceived by consumers affects their investments in the market. Similarly variable
AGE also impacts their investments in commodity markets. All other combinations are not
significant.
8 Cluster Analysis
This type of analysis is used to divide a given number of entities or objects into groups called
clusters. The objective is to classify a sample of entities into a small number of mutually exclusive
clusters based on the premise that they are similar within the clusters but dissimilar among the
clusters. The criterion for similarity is defined with reference to some characteristics of the entity. For
example, for companies, it could be Sales, Paid up Capital, etc.
The basic methodological questions that are to be answered in the cluster analysis are:
What are the relevant variables and descriptive measures of an entity?
How do we measure the similarity between entities?
Given that we have a measure of similarity between entities, how do we form
clusters?
122
How do we decide on how many clusters are to be formed?
For measuring similarity, let us consider the following data. The data has been collected on each of
k characteristics for all the n entities under consideration of being divided into clusters. Let the k
characteristics be measured by k variables as x1, x2, x3 , xk, and the data presented in the
following matrix form;
Entity 1
Entity 2
..
Entity n
x1
x11
x21
xn1
Variables
x2 ..
x12
.
x22 ..
xn2
xk
x1k
x2p
xnp
The question is how to determine how similar or dissimilar each row of data is from the others?
This task of measuring similarity between entities is complicated by the fact that, in most cases, the
data in its original form are measured in different units or/and scales. This problem is solved by
standardizing each variable by subtracting its mean from its value and then dividing by its standard
deviation. This converts each variable to a pure number.
The measure to define similarity between two entities, i and j, is computed as
Dij = ( xi1 xj1 )2 + ( xi2 xj2 )2 + .+ ( xik xjk )2
Smaller the value of Dij , more similar are the two entities.
The basic method of clustering is illustrated through a simple example given below.
Let there be four branches of a commercial bank each described by two variables viz. deposits and
loans / credits. The following chart gives an idea of their deposits and loans / credits.
x(1) x(2)
Loans
/Credit
x(3) x(4)
Deposits
From the above chart, it is obvious that if we want two clusters, we should group the branches 1&2
(High Deposit, High Credit) into one cluster, and 3&4(Low Deposit Low Credit) into another, since
such grouping produces the clusters for which the entities (branches) within each other are most
similar. However, this graphical approach is not convenient for more than two variables.
In order to develop a mathematical procedure for forming the clusters, we need a criterion upon
which to judge alternative clustering patterns. This criterion defines the optimal number of entities
within each cluster.
Now we shall illustrate the methodology of using distances among the entities from clusters. We
assume the following distance similarity matrix among three entities.
123
Distance or Similarity Matrix
1
2
3
------------------------------------1
0
5
10
2
5
0
8
3
10
8
0
The possible clusters and their respective distances are :
Total
Distance
Cluster Cluster
Distance Between
Distance
Within Two
I
II
Two Clusters
Among These
Clusters
Entities
1
2&3
8
15 ( = 5+10)
23
2
1&3
10
13 ( = 5+8)
23
3
1&2
5
18 (= 8 + 10)
23
Thus the best clustering would be to cluster entities 1 and 2 together. This would yield minimum
distance within clusters (= 5 ) , and simultaneously the maximum distance between clusters ( =18).
Obviously, if the number of entities is large, it is a prohibitive task to construct every possible
cluster pattern, compute each within cluster distance and select the pattern which yields the
minimum. If the number of variables and dimensions are large, computers are needed.
The criterion of minimizing the within cluster distances to form the best possible grouping to form
k clusters assumes that k clusters are to be formed. If the number of clusters to be formed is not
fixed a priori, the criterion will not specify the optimal number of clusters. However, if the
objective is, to minimize the sum of the within cluster distances and the number of clusters is free to
vary, then all that is required is to make each entity its own cluster, and the sum of the within
cluster distances will be zero. Obviously, more the number of clusters lesser will be the sum of
within cluster distances. Thus, making each entity its own cluster is of no value. This issue is,
therefore, resolved intuitively.
Discriminant and Cluster Analysis
It may be noted that though both discriminant analysis and cluster analysis are classification
techniques. However, there is a basic difference between the two techniques. In DA, the data is
classified in given set of categories using some prior information about the data. The entire rules of
classification are based on the categorical dependent variable and the tolerance of the model. But the
Cluster analysis does not assume any dependent variable. It uses different methods of classification to
classify the data into some groups without any prior information. The cases with similar data would
be in the same group, and the cases with distinct data would be classified in different groups.
Most computer oriented programs find the optimum number of clusters through their own
algorithm. We have described the methods of forming clusters in Section 8.3, and the use of SPSS
package in Section 8.5
8.1 Some Applications of Cluster Analysis
Following are two applications of cluster analysis in the banking system:
(i) Commercial Bank
In one of the banks in India, all its branches were formed into clusters by taking 15 variables
representing various aspects of the functioning of the branches. The variables are ; four types of
deposits, four types of advances, miscellaneous business such as drafts issued, receipts on behalf
of Government, foreign exchange business, etc. The bank uses this grouping of branches into
clusters for collecting information or conducting quick surveys to study any aspect, planning,
analysing and monitoring. If any sample survey is to be conducted, it is ensured that samples are
taken from branches in all the clusters so as to get a true representative of the entire bank. Since, the
124
branches in the same cluster are more or less similar to each other, only few branches are selected
from each cluster.
(ii) Agricultural Clusters
A study was conducted by one of the officers of the Reserve Bank of India, to form clusters of
geographic regions of the country based on agricultural parameters like cropping pattern, rainfall,
land holdings, productivity, fertility, use of fertilisers, irrigation facilities, etc. The whole country
was divided into 9 clusters. Thus, all the 67 regions of the country were allocated to these clusters.
Such classification is useful for making policies at the national level as also at regional/cluster
levels.
8.2 Key Terms in Cluster Analysis
Agglomeration Schedule
Cluster Centroid
Cluster Centers
Cluster Membership
Dendrogram
Icicle Diagram
Similarity/ Distance
Coefficient Matrix
125
furthest neighborhood and Average linkage average distance between all pair of objects.
This is explained in the Diagram provided below.
Centroid Methods this method considers distance between the two centroids. Centroid is the
means for all the variables
Variance Methods this is commonly termed as Wards method it uses the squared distance
from the means.
Diagram
a) Single Linkage (Minimum Distance / Nearest Neighborhood)
Cluster 1
Cluster 2
b) Complete Linkage ((Maximum Distance / Furthest Neighborhood)
Cluster 1
c) Average Linkage (Average Distance)
Cluster 2
Cluster 1
d) Centroid Method
Cluster 2
Cluster 1
Cluster 2
126
e) Wards Method (Variance Method)
Cluster 1
Cluster 2
127
above mentioned options. However, the limitation of the study is that it considers investors from
Mumbai only, and hence, might not be representative of the entire country.
An investment is a commitment of funds made in the expectation of some positive rate of return. If
properly undertaken, the return will be commensurate with the risk the investor assumes.
An analysis of the backgrounds and perceptions of the investors was undertaken in the report. The
data used in the analysis was collected by e-mailing and distributing the questionnaire among friends,
relatives and colleagues. 45 people were surveyed, and were asked various questions relating to their
backgrounds and knowledge about the investment markets and options. The raw data contains a wide
range of information, but only the data which is relevant to objective of the study was considered.
The questionnaire used for the study is as follows
QUESTIONNAIRE
Age: _________
Occupation:
o SELF EMPLOYED
o GOVERNMENT
o STUDENT
o HOUSEWIFE
o DOCTOR
o ENGINEER
o CORPORATE PROFESSIONAL
o OTHERS (PLEASE SPECIFY) : ________________________
o
Gender:
o MALE
o FEMALE
1. RATE FOLLOWING ON THE PREFERENCE OF INVESTMENT { 1- Least Preferred 5Most Preferred ( Appropriate Number)}
Question
1
2 3 4
5
COMMODITY MARKET
STOCK MARKET
FIXED DEPOSITS
Mutual Funds
2. HOW MUCH YOU ARE READY TO INVEST In COMMODITY MARKET?
_______________
3. HOW MUCH YOU ARE READY TO INVEST In STOCK MARKET ? ______________
4. HOW MUCH YOU ARE READY TO INVEST In FIXED DEPOSITS? ______________
5. HOW MUCH YOU ARE READY TO INVEST In MUTUAL FUNDS? ______________
6. FOR HOW MUCH TIME WOULD YOU BLOCK YOUR MONEY WITH INVESTMENTS?
o UNDER 5 MONTHS
o 6-12 MONTHS
o 1-3 YEARS
o MORE THAN 3 YEARS
7. ON A SCALE OF 1 10( 1- LEAST RISKY & 10 - MOST RISKY) ,HOW RISKY DO YOU
THINK IS THE COMMODITY MARKET? (Circle the
appropriate number)
128
SAFE
RISKY
8. ON A SCALE OF 1-10, HOW RISKY DO YOU THINK IS THE STOCK MARKET?
SAFE
RISKY
SAFE
RISKY
10. ON A SCALE OF 1-10 HOW RISKY DO YOU THINK ARE MUTUAL FUNDS?
SAFE
RISKY
Age
Occu
Sex
Rate
CM
Rate
SM
Rate
FD
Rate
MF
23
6000
8000
3000
5000
18
4000
5000
8000
Inv CM
Inv SM
Inv FD
Inv MF
Block
Money
Risky
CM
Risky
SM
Risky
FD
Risky
MF
129
1. Select
variables Age,
Rate_CM,
through
Risky_MF
2. Cases
Variable can be selected if one
wants to perform CA for
variables than cases. (like factor
analysis) default is cases.
3. Click on Plots
The following window will be opened
CA Snapshot 3
4. Click on Continue
1. Select Dendrogram
130
Next step is to select the clustering measure. Most common measure is the squared Euclidian
distance.
CA Snapshot 5
3. Next Click on
Continue
1. Select Squared Euclidian
Distance from the list of
measures
2. Select Standardize Z Scores,
since our data has scales
difference among variables
SPSS will be back to the window as shown in CA Snapshot 2. At this stage click on Save, following
window will be displayed.
CA Snapshot 6
SPSS gives option to save the cluster
membership. This is useful when it is to decide
how many clusters are formed for the given data.
Or to understand and analyse the formed clusters
by performing ANOVA on the cluster. For
ANOVA, saving of cluster membership is
required. One may save single solution(if sure of
number of clusters) or a ranges of solutions.
The default is none and at this stage we will keep
the default.
Click on continue, and SPSS will be back as Shown in CA Snapshot 2. At this stage click OK.
Following output will be displayed.
We shall discuss this output in detail.
Proximities
Case Processing Summarya
Valid
N
Percent
44
97.8%
Cases
Missing
N
Percent
1
2.2%
Total
N
45
Percent
100.0%
This table gives the case processing summary and its percentages. The above table indicates there are
44 out of 45 valid cases. Since one case have some missing values it is ignored from the analysis
131
Cluster
Single Linkage
This is the method we selected for cluster analysis
Agglomeration Schedule
Stage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Cluster Combined
Cluster 1
Cluster 2
6
42
7
43
2
10
40
41
8
44
38
40
30
32
1
14
26
29
1
16
34
35
26
34
1
18
19
30
26
38
1
2
1
20
1
9
24
26
7
8
1
5
1
15
11
19
1
21
4
12
1
13
1
27
1
4
1
22
1
33
1
11
1
24
1
17
1
28
1
31
1
37
7
36
1
7
1
25
1
23
1
39
1
3
1
6
Coefficients
.000
.829
2.190
2.361
2.636
3.002
3.749
3.808
3.891
4.145
4.326
4.587
4.698
5.105
5.751
5.921
6.052
6.236
6.389
6.791
7.298
7.481
7.631
7.711
7.735
8.289
8.511
8.656
8.807
8.994
9.066
9.071
9.245
9.451
9.483
9.946
9.953
10.561
10.705
11.289
12.785
12.900
15.449
Next Stage
43
20
16
6
20
15
14
10
12
13
12
15
16
23
19
17
18
21
32
37
22
24
31
26
28
27
28
29
30
31
32
33
34
35
36
38
38
39
40
41
42
43
0
This table gives the agglomeration schedule or the details of the clusters formed in each stage. This
table indicates that the cases 6 and 42 were combined at first stage. Cases 7 and 43 were combined at
2nd stage, 2 and 10 were combined at third stage, and so on. The last stage ( stage 43) indicates two
cluster solution. One above last stage( stage 42) indicates three cluster solution, and so on. The
column Coefficients indicates the distance coefficient. Sudden increase in the coefficient indicates
that the combining at that stage is more appropriate. This is one of the indicator for deciding the
number of clusters.
Agglomeration
Schedule
Cluster
Combined
Cluster 1
Stage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Coefficients
6
7
2
40
8
38
30
1
26
1
34
26
1
19
26
1
1
Cluster 2
42
43
10
41
44
40
32
14
29
16
35
34
18
30
38
2
20
0
0.828734
2.189939
2.360897
2.636238
3.002467
3.749321
3.808047
3.89128
4.1447
4.325831
4.587371
4.697703
5.105442
5.750915
5.921352
6.052442
Stage Cluster
First Appears
Cluster 1
Next
Stage
Difference in
Coefficients
Cluster 2
0
0
0
0
0
0
0
0
0
8
0
9
10
0
12
13
16
0
0
0
0
0
4
0
0
0
0
0
11
0
7
6
3
0
43
20
16
6
20
15
14
10
12
13
12
15
16
23
19
17
18
0.828734
1.361204
0.170958
0.275341
0.366229
0.746854
0.058726
0.083232
0.253421
0.181131
0.26154
0.110332
0.407739
0.645473
0.170437
0.13109
132
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
1
24
7
1
1
11
1
4
1
1
1
1
1
1
1
1
1
1
1
7
1
1
1
1
1
1
9
26
8
5
15
19
21
12
13
27
4
22
33
11
24
17
28
31
37
36
7
25
23
39
3
6
6.236206
6.389243
6.790893
7.297921
7.480892
7.631185
7.710566
7.735374
8.288569
8.510957
8.656467
8.807405
8.994409
9.066141
9.070588
9.244565
9.450741
9.483015
9.946286
9.953277
10.56085
10.70496
11.28888
12.78464
12.90033
15.44882
17
0
2
18
21
0
22
0
24
26
27
28
29
30
31
32
33
34
35
20
36
38
39
40
41
42
0
15
5
0
0
14
0
0
0
0
25
0
0
23
19
0
0
0
0
0
37
0
0
0
0
1
21
32
37
22
24
31
26
28
27
28
29
30
31
32
33
34
35
36
38
38
39
40
41
42
43
0
0.183764
0.153037
0.40165
0.507028
0.182971
0.150293
0.079381
0.024808
0.553195
0.222388
0.14551
0.150937
0.187004
0.071733
0.004447
0.173977
0.206176
0.032274
0.463272
0.00699
0.607572
0.144109
0.583924
1.495759
0.115693
2.548489
We have replicated the table with one more column added called Difference in the coefficients
this is the difference in the coefficient between the current solution and the previous solution. The
highest difference gives the most likely clusters. In the above table, the highest difference is 2.548
which is for two cluster solution. The next highest difference 1.4956 and is for 3 cluster solution.
This indicates that there could be 3 clusters for the data.
The icicle table also gives the summery of the cluster formation. It is read from bottom to top.
topmost is the single cluster solution and bottommost is all cases separate. The cases in the table are
in the columns. The first column indicates the number of clusters for that stage. Each case is
separated by an empty column. A cross in the empty column means the two cases are combined. A
gap means the two cases are in separate clusters.
If the number of cases is huge this table becomes difficult to interpret.
The diagram given below is called the dendrogram. A dendrogram is the most used tool to understand
the number of clusters and cluster memberships. The cases are in the first column and they are
connected by lines for each stages of clustering. This graph is from left to right leftmost is all cluster
solution and rightmost is the one cluster solution.
This graph also has the distance line from 0 to 25. More is the width of the horizontal line for the
cluster more appropriate is the cluster.
The graph shows that 2 cluster solution is a better solution indicated by the thick dotted line.
133
Dendrogram
* * * * * * H I E R A R C H I C A L
Dendrogram using Single Linkage
C L U S T E R
A N A L Y S I S * * * * * *
134
The above solution is not decisive as the differences are very close. Hence we shall try a different
method, i.e. furthest neighborhood.
The entire process is repeated and this time the method ( as shown in CA Snapshot 4 ) selected is
furthest neighborhood.
The output is as follows
Proximities
Case Processing Summarya
Valid
N
Percent
44
97.8%
Cases
Missing
N
Percent
1
2.2%
Total
N
45
Percent
100.0%
Cluster
Complete Linkage
Agglomeration Schedule
Stage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
Cluster Combined
Cluster 1
Cluster 2
6
42
7
43
2
10
40
41
8
44
30
32
1
14
26
29
34
35
38
40
1
16
5
18
20
21
4
12
11
30
7
8
9
15
5
9
26
34
23
28
13
31
1
22
5
33
24
26
2
13
17
20
36
37
11
19
1
27
2
39
7
38
24
25
4
5
2
11
6
36
17
24
1
23
3
7
1
2
4
17
3
6
1
4
1
3
Coefficients
.000
.829
2.190
2.361
2.636
3.749
3.808
3.891
4.326
4.386
5.796
7.298
7.711
7.735
7.770
7.784
7.968
8.697
8.830
11.289
11.457
11.552
12.218
13.013
13.898
14.285
14.858
14.963
16.980
18.176
18.897
19.342
21.241
25.851
26.366
26.691
29.225
30.498
36.323
37.523
55.294
63.846
87.611
Next Stage
35
16
25
10
16
15
11
19
19
31
22
18
26
33
28
31
18
23
24
37
25
29
33
32
30
36
35
34
37
34
38
36
40
39
41
40
39
41
42
42
43
43
0
Dendrogram
* * * * * * H I E R A R C H I C A L
* *
Dendrogram using Complete Linkage
Rescaled Distance Cluster Combine
C L U S T E R
A N A L Y S I S * * *
135
The above dendrogram clearly shows that the longest horizontal lines for clusters are for 4 cluster
solution, shown by thick dotted line (the dotted line intersects four horizontal lines) it indicates that
the cluster containing cases 7, 43, 37 and 38, named as cluster 4 the cluster containing 41, 42, 39, 8,
44, 9, 45 and 3, named as cluster 2 and so on.
136
We shall run the cluster analysis again with same method and this time we shall save the cluster
membership for single solution = 4 clusters as indicated in CA Snapshot 6.
The output will be same as discussed except a new variable is added in the SPSS file with name
CLU4_1. This variable takes value between 1 to 4 each value indicates the cluster membership.
We shall conduct ANOVA test on the data where the dependent variables are taken as all the
variables that were included while performing cluster analysis, and the factor is the cluster
membership indicated by variable CLU4_1. This ANOVA will indicate if the clusters really
distinguish on the basis of the list of variables, which variables significantly distinguish the clusters
and which do not distinguish.
The ANOVA procedure is as follows
Select Analyze Compare Means One way ANOVA from the menu as shown below.
CA Snapshot 7
137
This gives list of Post Hoc tests for ANOVA. Most common are LSD and HSD (discussed in Chapter
12 ) we shall select LSD and click on continue.
SPSS will take back to CA Snapshot 8. Click on Options and following window will be opened.
CA Snapshot 10
2.Click on Continue
SPSS will take back to window as shown in CA Snapshot 8, at this stage click on OK
Following Output will be displayed:
Oneway
De s cr ipt ive s
N
A ge
Rate_CM
Rate_SM
Rate_FD
Rate_MF
In v es t_ CM
In v es t_ SM
In v es t_ FD
In v es t_ MF
ho w _muc h_time_
bloc k _y ou r _mone y
r is ky _CM
r is ky _SM
r is ky _FD
r is ky _MF
1
2
3
4
To ta l
1
2
3
4
To ta l
1
2
3
4
To ta l
1
2
3
4
To ta l
1
2
3
4
To ta l
1
2
3
4
To ta l
1
2
3
4
To ta l
1
2
3
4
To ta l
1
2
3
4
To ta l
1
2
3
4
To ta l
1
2
3
4
To ta l
1
2
3
4
To ta l
1
2
3
4
To ta l
1
2
3
4
To ta l
16
8
16
4
44
16
8
16
4
44
16
8
16
4
44
16
8
16
4
44
16
8
16
4
44
16
8
16
4
44
16
8
16
4
44
16
8
16
4
44
16
8
16
4
44
16
8
16
4
44
16
8
16
4
44
16
8
16
4
44
16
8
16
4
44
16
8
16
4
44
Mean
20 .5 6
46 .1 3
33 .5 6
26 .0 0
30 .4 3
3.69
1.50
1.88
2.50
2.52
3.94
2.13
2.31
3.50
2.98
2.81
4.13
4.44
3.00
3.66
4.00
2.75
2.94
4.50
3.43
59 37.50
36 250 .0 0
45 93.75
57 500 .0 0
15 647 .7 3
20 156 .2 5
11 187 5.00
18 656 .2 5
11 500 0.00
44 909 .0 9
67 18.75
95 625 .0 0
20 500 .0 0
63 750 .0 0
33 079 .5 5
18 593 .7 5
11 750 0.00
17 781 .2 5
12 425 0.00
45 886 .3 6
2.19
6.13
4.31
4.25
3.86
5.31
6.63
6.00
3.00
5.59
5.38
7.25
7.19
4.75
6.32
1.94
1.13
1.50
1.00
1.55
5.13
6.50
6.56
4.75
5.86
Std . De v ia tion
2.50 2
9.25 0
9.06 3
4.76 1
11 .5 71
.873
.535
.719
.577
1.17 1
.680
.354
.602
.577
1.00 0
.911
.354
.727
1.41 4
1.09 8
1.15 5
.463
.998
1.00 0
1.14 9
96 81.38 2
87 62.74 6
72 89.76 2
11 902 .3 81
19 901 .6 71
18 716 .9 43
60 999 .8 54
24 993 .1 45
41 231 .0 56
53 293 .5 35
70 61.56 0
43 951 .0 69
18 071 .1 56
75 00.00 0
39 780 .0 56
21 455 .5 50
54 837 .4 22
15 921 .6 51
72 126 .6 25
56 550 .5 40
1.04 7
1.55 3
1.88 7
.500
2.03 0
1.88 7
.916
1.96 6
.816
1.92 1
2.52 7
.886
1.55 9
2.06 2
2.12 2
.772
.354
.516
.000
.663
2.18 7
1.19 5
2.15 9
2.36 3
2.12 0
Std . Er r or
.626
3.27 0
2.26 6
2.38 0
1.74 4
.218
.189
.180
.289
.177
.170
.125
.151
.289
.151
.228
.125
.182
.707
.166
.289
.164
.249
.500
.173
24 20.34 6
30 98.09 8
18 22.44 0
59 51.19 0
30 00.29 0
46 79.23 6
21 566 .7 05
62 48.28 6
20 615 .5 28
80 34.30 3
17 65.39 0
15 539 .0 49
45 17.78 9
37 50.00 0
59 97.06 9
53 63.88 7
19 387 .9 56
39 80.41 3
36 063 .3 12
85 25.31 5
.262
.549
.472
.250
.306
.472
.324
.492
.408
.290
.632
.313
.390
1.03 1
.320
.193
.125
.129
.000
.100
.547
.423
.540
1.18 1
.320
Lo w er
Min imum
18
32
21
23
18
2
1
1
2
1
3
2
1
3
1
2
4
3
2
2
2
2
1
3
1
0
25 000
0
50 000
0
30 00
50 000
15 00
70 000
15 00
0
70 000
25 00
55 000
0
25 00
50 000
0
22 000
0
1
4
1
4
1
2
6
3
2
2
1
6
3
3
1
1
1
1
1
1
2
4
4
3
2
Max imum
25
55
50
33
55
5
2
3
3
5
5
3
3
4
5
5
5
5
5
5
5
3
5
5
5
30 000
50 000
23 000
75 000
75 000
60 000
18 500 0
10 000 0
15 000 0
18 500 0
25 000
20 000 0
60 000
70 000
20 000 0
75 000
18 000 0
50 000
17 500 0
18 000 0
4
8
7
5
8
9
8
9
4
9
9
8
9
7
9
3
2
2
1
3
10
8
10
8
10
The above table gives descriptive statistics for the dependent variables for each cluster the short
summary of above table is displayed below:
Descriptives
N
Mean
138
1 16
2 8
3 16
4 4
Total 44
Age
Rate CM Rate SM Rate FD
Rate MF
20.5625
3.6875
3.9375
2.8125
4
46.125
1.5
2.125
4.125
2.75
33.5625
1.875
2.3125
4.4375
2.9375
26
2.5
3.5
3
4.5
30.43182 2.522727 2.977273 3.659091 3.431818
1
2
3
4
Total
16
8
16
4
44
Spend
CM
5937.5
36250
4593.75
57500
15647.73
Spend
Spend
SM
FD
20156.25 6718.75
111875
95625
18656.25
20500
115000
63750
44909.09 33079.55
Spend
Bolck
MF
Time
Risky CM Risky SM Risky FD Risky MF
18593.75
2.1875
5.3125
5.375
1.9375
5.125
117500
6.125
6.625
7.25
1.125
6.5
17781.25
4.3125
6
7.1875
1.5
6.5625
124250
4.25
3
4.75
1
4.75
45886.36 3.863636 5.590909 6.318182 1.545455 5.863636
It may be noted that these four clusters have Average age as 20.56, 46.13, 33.56 and 26. This clearly
forms four different age groups. The other descriptive is summarised as follows.
Cluster 1
Sr No : 31, 33, 12, 20, 2, 11,
14, 32, 40, 24, 29, 1, 15, 17,
23, 28
139
Test of Homogeneity of Variances
Age
Rate_CM
Rate_SM
Rate_FD
Rate_MF
Invest_CM
Invest_SM
Invest_FD
Invest_MF
how_much_time_
block_your_money
risky_CM
risky_SM
risky_FD
risky_MF
Levene
Statistic
7.243
.943
1.335
3.186
1.136
.369
17.591
4.630
15.069
df1
3
3
3
3
3
3
3
3
3
df2
40
40
40
40
40
40
40
40
40
Sig.
.001
.429
.277
.034
.346
.775
.000
.007
.000
3.390
40
.027
1.995
4.282
5.294
2.118
3
3
3
3
40
40
40
40
.130
.010
.004
.113
This table gives Levens Homogeneity test which is a must for ANOVA as ANOVA assumes that the
different groups have equal variance. If the significance is less than 5%( LOS) the null hypothesis
which states that the variances are equal is rejected. i.e. the assumption is not followed. In such case,
ANOVA cannot be used. The above table rejection of the assumption is indicated by circles. Which
means ANOVA could be invalid for those variables.
It may be noted that when ANOVA is invalid the test that can be performed is Non parametric test,
Kruskal Wallis test discussed in Chapter 13.
ANOVA
Age
Rate_CM
Rate_SM
Rate_FD
Rate_MF
Invest_CM
Invest_SM
Invest_FD
Invest_MF
how_much_time_
block_your_money
risky_CM
risky_SM
risky_FD
risky_MF
Between Groups
Within Groups
Total
Between Groups
Within Groups
Total
Between Groups
Within Groups
Total
Between Groups
Within Groups
Total
Between Groups
Within Groups
Total
Between Groups
Within Groups
Total
Between Groups
Within Groups
Total
Between Groups
Within Groups
Total
Between Groups
Within Groups
Total
Between Groups
Within Groups
Total
Between Groups
Within Groups
Total
Between Groups
Within Groups
Total
Between Groups
Within Groups
Total
Between Groups
Within Groups
Total
Sum of
Squares
3764.045
1992.750
5756.795
36.790
22.188
58.977
28.727
14.250
42.977
24.636
27.250
51.886
17.358
39.438
56.795
1E+010
3E+009
2E+010
8E+010
5E+010
1E+011
5E+010
2E+010
7E+010
9E+010
5E+010
1E+011
89.682
87.500
177.182
39.324
119.313
158.636
43.108
150.438
193.545
5.097
13.813
18.909
24.744
168.438
193.182
df
3
40
43
3
40
43
3
40
43
3
40
43
3
40
43
3
40
43
3
40
43
3
40
43
3
40
43
3
40
43
3
40
43
3
40
43
3
40
43
3
40
43
Mean Square
1254.682
49.819
F
25.185
Sig.
.000
12.263
.555
22.108
.000
9.576
.356
26.879
.000
8.212
.681
12.054
.000
5.786
.986
5.869
.002
4621914299
79138671.88
58.403
.000
2.545E+010
1144289844
22.243
.000
1.624E+010
483427734.4
33.585
.000
3.005E+010
1184108594
25.377
.000
29.894
2.188
13.666
.000
13.108
2.983
4.394
.009
14.369
3.761
3.821
.017
1.699
.345
4.920
.005
8.248
4.211
1.959
.136
ANOVA
not
rejected
as >0.05
The above ANOVA table tests the difference between means for the different clusters. The null
hypothesis states that there is no difference between the clusters for given variable. If significance is
less than 5% (p value less than 0.05) the null hypothesis is rejected.
140
It may be noted that for above table, null hypothesis that the variable is equal for all clusters is
rejected for all variables except for Risky_MF. This means all other variables significantly vary for
different clusters. It also indicates that the four cluster solution is a good solution.
K Means Cluster
This method is used when one knows in advance, how many clusters to be formed. The procedure for
k-means cluster is as follows.
CA Snapshot 11
3.Click on Save
Following window will appear.
CA Snapshot 13
2.Click on Continue
1.Select Cluster
Membership
SPSS will take back to the window as shown in CA Snapshot 12. Click on options and following
window will appear
CA Snapshot 14
141
2. Click on
continue
55
1
2
4
2
45000
60000
75000
75000
22
4
4
4
3
5000
3000
1000
2500
54
1
2
4
3
50000
60000
200000
50000
45
1
2
4
3
25000
185000
100000
155000
8
8
1
7
3
6
2
3
8
8
1
4
6
7
2
7
This table gives initial cluster centers. The initial cluster centers are the variable values of the k wellspaced observations.
Iteration Historya
Iteration
1
2
3
1
31980.518
7589.872
.000
The iteration history shows the progress of the clustering process at each step.
This table has only three steps as the process has stopped due to no change in cluster centers.
Final Cluster Centers
Cluster
Age
Rate_CM
Rate_SM
Rate_FD
Rate_MF
Invest_CM
Invest_SM
Invest_FD
Invest_MF
how_much_time_
block_your_money
risky_CM
risky_SM
risky_FD
risky_MF
33
2
3
4
4
31000
58182
41636
60636
27
3
3
4
3
2981
11769
11827
10846
3
54
1
2
4
3
50000
60000
200000
50000
4
34
2
3
4
4
36667
161667
81667
170000
5
7
1
6
6
6
2
6
8
8
1
4
5
5
1
5
142
ANOVA
Age
Rate_CM
Rate_SM
Rate_FD
Rate_MF
Invest_CM
Invest_SM
Invest_FD
Invest_MF
how_much_time_
block_your_money
risky_CM
risky_SM
risky_FD
risky_MF
Cluster
Mean Square
318.596
2.252
.599
.377
.722
3531738685
3.750E+010
1.819E+010
4.225E+010
df
3
3
3
3
3
3
3
3
3
Error
Mean Square
120.025
1.306
1.029
1.269
1.366
160901842.9
240364627.0
336746248.5
268848251.7
df
40
40
40
40
40
40
40
40
40
F
2.654
1.725
.582
.297
.528
21.950
156.032
54.022
157.162
Sig.
.062
.177
.630
.827
.665
.000
.000
.000
.000
10.876
3.614
40
3.010
.041
3.654
3.261
.603
1.949
3
3
3
3
3.692
4.594
.427
4.683
40
40
40
40
.990
.710
1.411
.416
.407
.552
.254
.742
The F tests should be used only for descriptive purposes because the clusters have been chosen to
maximize the differences among cases in different clusters. The observed significance levels are not
corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are
equal.
The ANOVA indicates that the clusters are different only for different investment options like invest
in CM, invest in SM, invest in FD and invest in MF as also block money, as the significance is less
than 0.05 only for these variables.
Number of Cases in each Cluster
Cluster
Valid
Missing
1
2
3
4
11.000
26.000
1.000
6.000
44.000
1.000
The above Table gives the number of cases for each cluster.
It may be noted that this solution is different than hierarchal solution and hierarchal cluster is more
valid for this data as it considers standardized scores and this method does not consider
standardization.
9 Conjoint Analysis
The name "Conjoint Analysis" implies the study of the joint effects. In marketing applications, it
helps in the study of joint effects of multiple product attributes on product choice. Conjoint analysis
involves the measurement of psychological judgements
such as consumer preferences, or perceived similarities or differences between choice alternatives.
In fact, conjoint analysis is a versatile marketing research technique which provides valuable
information for new product development, assessment of demand, evolving market
segmentation strategies, and pricing decisions. This technique is used to assess a wide number
of issues including:
The profitability and/or market share for proposed new product concepts given the existing
competition.
The impact of new competitors products on profits or market share of a company if status
quo is maintained with respect to its products and services
Customers switch rates either from a companys existing products to the companys new
products or from competitors products to the companys new products.
Competitive reaction to the companys strategies of introducing a new product
The differential response to alternative advertising strategies and/or advertising themes
The customer response to alternative pricing strategies, specific price levels, and proposed
price changes
143
Conjoint analysis examines the trade-offs that consumers make in purchasing a product. In
evaluating products, consumers make trade-offs. A TV viewer may like to enjoy the programs
on a LCD TV but might not go for it because of the high cost. In this case, cost is said to have
a high utility value. Utility can be defined as a number which represents the value that
consumers place on specific attributes. A low utility indicates less value; a high utility
indicates more value. In other words, it represents the relative worth of the attribute. This
helps in designing products/services that are most appealing to a specific market. In addition,
because conjoint analysis identifies important attributes, it can be used to create advertising
messages that are most appealing.
The process of data collection involves showing respondents a series of cards that contain a written
description of the product or service. If a consumer product is being tested, then a picture of the
product can be included along with a written description. Several cards are prepared describing the
combination of various alternative set of features of a product or service. A consumers response is
collected by his/ her selection of number between 1 and 10. While 1 indicates strongest dislike, 10
indicates strongest like for the combination of features on the card. Such data becomes the input for
final analysis which is carried out through computer software.
The concepts and methodology are elaborated in the case study given below.
9.1 Conjoint Analysis Using SPSS
Case 3
Credit Cards
The new head of the credit card division in a bank wanted to revamp the credit card business of the
bank and convert it from loss making business to profit making business. He was given freedom to
experiment with various options that he considered as relevant. Accordingly, he organized a focus
group discussion for assessing the preference of the customers for various parameters associated with
the credit card business. Thereafter he selected the following parameters for study.
1) Transaction Time- this is the time taken for credit card transaction
2) Fees - the annual fees charged by the credit card company
3) Interest rate the interest rate charged by the credit card company for the customers who revolve
the credits.(customers who do not pay full bill amount but use partial payment option and pay at their
convenience)
The levels of the above mentioned attributes were as follows:
Transaction Time- 1minute, 1.5minutes, 2 minutes
Fees 0, 1000 Rs, 2000 Rs
Interest rate- 1.5%, 2%,2.5% (per month)
This led to a total of 333=27 combinations. Twenty seven cards were prepared representing each
combination and the customers were asked to arrange these cards in order of their preference.
The following table shows all the possible combinations and the order given by the customer.
Input Data for Credit Card
SR.No
1
2
3
4
5
6
Transaction Time
(min)
1,1.5,2
1
1.5
1
1.5
2
2
Card Fees
(Rs)
0,1000,2000
0
0
1000
1000
0
1000
Interest Rate
1.5%,2.0%,2.5%
Rating *
27 to 1
1.5
1.5
1.5
1.5
1.5
1.5
27
26
25
24
23
22
144
7
1
0
2
21
8
1.5
0
2
20
9
1
2000
1.5
19
10
1.5
2000
1.5
18
11
1
1000
2
17
12
1.5
1000
2
16
13
1
2000
2
15
14
2
2000
1.5
14
15
1.5
2000
2
13
16
2
0
2
12
17
2
1000
2
11
18
1
0
2
10
19
1.5
0
2.5
9
20
1
1000
2.5
8
21
2
1000
2.5
7
22
2
2000
2
6
23
2
0
2.5
5
24
2
1000
2.5
4
25
1
2000
2.5
3
26
1.5
2000
2.5
2
27
2
2000
2.5
1
* rating 27 indicates most preferred and rating 1 indicates lest preferred option by customer.
Conduct appropriate analysis to find the utility for these three factors.
The data is available in credit card.sav file, given in the CD.
Running Conjoint as a Regression Model: Introduction of Dummy Variables
Representing dummy variables:
X1, X2 = transaction time
X3, X4 = Annual Fees
X5, X6 = Interest Rates
The 3 levels of life are coded as follows:
Transaction Time
X1
X2
1
1
0
1.5
0
1
2
-1
-1
The 3 levels of price are coded as follows:
Fees
X3
X4
0
1
0
1000
0
1
2000
-1
-1
The 3 levels of Colour are coded as follows:
Interest rates
X5
X6
1.5
1
0
2
0
1
1.5
-1
-1
Thus, 6 variables, ie. X1 to X6 are used to represent the 3 levels of life of the transaction time
(1,1.5,2), 3 levels of fees (0,1000,2000) and 3 levels of interest rates ( 1.5, 2,2.5).All the Six
Variables are independent variables in the regression run. Another variable Y which is the rating of
each combination given by the respondent forms the dependent variable of the regression curve.
Thus we generate the regression equation as: Y= a +b1X1+ b2X2+ b3X3+ b4X4+ b5X5+ b6X6
145
Input data for the regression model:
Sr
Transaction
Fees
Interest Y
X1
X2 X3 X4
No
time
Rate
1
1
0
1.5
27
1
0
1
0
2
1.5
0
1.5
26
0
1
1
0
3
1
1000
1.5
25
1
0
0
1
4
1.5
1000
1.5
24
0
1
0
1
5
2
0
1.5
23
-1
-1
1
0
6
2
1000
1.5
22
-1
-1
0
1
7
1
0
2
21
1
0
1
0
8
1.5
0
2
20
0
1
1
0
9
1
2000
1.5
19
1
0
-1
-1
10
1.5
2000
1.5
18
0
1
-1
-1
11
1
1000
2
17
1
0
0
1
12
1.5
1000
2
16
0
1
0
1
13
1
2000
2
15
1
0
-1
-1
14
2
2000
1.5
14
-1
-1
-1
-1
15
1.5
2000
2
13
0
1
-1
-1
16
2
0
2
12
-1
-1
1
0
17
2
1000
2
11
-1
-1
0
1
18
1
0
2
10
1
0
1
0
19
1.5
0
2.5
9
0
1
1
0
20
1
1000
2.5
8
1
0
0
1
21
2
1000
2.5
7
-1
-1
0
1
22
2
2000
2
6
-1
-1
-1
-1
23
2
0
2.5
5
-1
-1
1
0
24
2
1000
2.5
4
-1
-1
0
1
25
1
2000
2.5
3
1
0
-1
-1
26
1.5
2000
2.5
2
0
1
-1
-1
27
2
2000
2.5
1
-1
-1
-1
-1
This can be processed using SPSS package as follows
Open the file credit card.sav
Select Analyse Regression Linear option from the menu as shown below
Conjoint Snapshot 1
X5
X6
1
1
1
1
1
1
0
0
1
1
0
0
0
1
0
0
0
0
-1
-1
-1
0
-1
-1
-1
-1
-1
0
0
0
0
0
0
1
1
0
0
1
1
1
0
1
1
1
1
-1
-1
-1
1
-1
-1
-1
-1
-1
146
3.Click on
OK
1.Select Rate as
dependent variable
2.Select Rate as
dependent variables x1,
x2, x3, x4, x5, x6 as
independent variables
Variables
Entered
x6, x4, x2,
a
x5, x3, x1
Variables
Removed
Method
.
Enter
R
.963a
R Square
.927
Adjusted
R Square
.905
Std. Error of
the Estimate
2.45038
This table indicates that r square for the above model is 0.963, which is close to one. This indicates
that 96.3% variation in the rate is attributed by the six independent variables. (x1 to x6)
We conclude that the regression model is fit and explains the variations in the dependent variables
quite well.
ANOVAb
Model
1
Regression
Residual
Total
Sum of
Squares
1517.912
120.088
1638.000
df
6
20
26
Mean Square
252.985
6.004
F
42.133
Sig.
.000a
147
Coefficientsa
Model
1
(Constant)
x1
x2
x3
x4
x5
x6
Unstandardized
Coefficients
B
Std. Error
13.857
.476
1.377
.673
1.326
.695
2.265
.673
1.480
.673
8.143
.670
-.121
.661
Standardized
Coefficients
Beta
t
29.104
2.044
1.908
3.364
2.198
12.152
-.183
.148
.138
.237
.155
.829
-.013
Sig.
.000
.054
.071
.003
.040
.000
.857
The coefficients are circled these indicate the utility values for each variable.
The Regression equation is as follows:
Y= 13.857 + 1.377X1 + 1.326X2 + 2.265X3 + 1.48X4 + 8.143X5 - 0.121X6
Output and Interpretation
Utility (Uij) The utility or the part worth contribution associated with the jth level (j, j=1,2,3) of the
ith attribute (i, i=1,2,3) for ex- U21 in our example means utility associated with zero fees.
Importance of an attribute (Ii)- is defined as the range of the part worth Uij across the levels of that
attribute. Ii={ Max (Uij)- Min (Uij)} for each attribute (i)
Normalisation: The attributes importance is normalized to desire its relative importance among all
attributes.
W i=
so that
=1
The output provides the part utility of each level of attribute which is shown below:
X1 = 1.377 ( partial utility for 1 min transaction)
X2 = 0.11 (partial utility for 1.5 min transaction)
For 2 min transaction partial utility = -2.703 ( as all the utilities for a given attribute should sum to 0
hence -1.377-1.326 =-2.703 )
X3 = 2.265 ( partial utility for 0 fees)
X4 = 1.48 (partial utility for 1000 fees)
For 2 min transaction partial utility = -3.745 ( as all the utilities for a given attribute should sum to 0
hence -2.265-1.48 =-3.745)
X5 = 8.143 ( partial utility for 1.5% interest)
X6 = -0.121 (partial utility for 2% interest)
For 2% interest transaction partial utility = -8.022( as all the utilities for a given attribute should sum
to 0 hence -8.143+0.121 =-8.022)
Utilities Table for Conjoint Analysis
Attributes
Levels
Part Utility
Transaction Time
1 Min
1.377
1.5 min
1.326
2. min
-2.703
Range of Utility
(Max Min)
= 1.377-(-2.703)
= 4.08
Percentage
Utility
15.54%
148
Annual Fees in
Rupees
Interest Rate
2.265
1000
1.480
2000
-3.745
1.5%
8.143
2.0%
-0.121
2.5%
-8.022
= 2.265- (-3.745)
= 6.01
= 8.143- (-8.022)
= 16.165
22.89%
61.57%
2.685
To know the BEST COMBINATION, it is advisable to pick the highest utilities from each attribute
and then add them.
The best combination here is :
INDIVIDUAL ATTRIBUTES
The difference in utility with the change of one level in one attribute can also be checked.
1. Transaction Time
For the time 1 min to 1.5 min There in decrease in utility value of 0.051 units.
But the next level, that is , 1.5min to 2 min has an decrease in utility of 4.029 units.
2. Annual Fees
Increase fees from 0 to Rs.1000 induces a utility drop of 0.785
Whereas, Rs.1000 to Rs.2000, there is an decrease in utility of 5.225
3. Interest Rates
149
Interest rate increase from 1.5% to 2.0% induces 8.264 units drop in utility.
Interest rate increase from 2.0% to 2.5% induces 7.901units drop in utility.
10 Multidimensional Scaling
Multidimensional Scaling transforms consumer judgments / perceptions of similarity or preferences
in a multidimensional space( usually 2 or 3 dimensions). It is useful for designing products and
services. In fact, MDS is a set of procedures for drawing pictures of data so that the researcher can
Visualise relationships described by the data more clearly
Offer clearer explanations of those relationships
Thus MDS reveals relationships that appear to be obscure when one examines only the numbers resulting from
a study.
It attempts to find the structure in a set of distance measures between objects. This is done by
assigning observations to specific locations in a conceptual space( 2 to 3 dimensions) such that the
distances between points in the space match the given dissimilarities as closely as possible.
If objects A and B are judged by the respondents as being most similar compared to all other possible
pairs of objects, multidimensional technique positions these objects in the space in such a manner that
the distance between them is smaller than that between any other two objects.
Suppose, data is collected for perceiving the differences or distances among three objects say A B and
C, and the following distance matrix emerges.
A
0
4
6
A
B
C
B
4
0
3
C
6
3
0
4
A
However, if the data comprises of only ordinal or rank data, then the same distance matrix could be
written as:
A
B
C
and can be depicted as :
A
0
1
3
B
2
0
1
C
3
3
0
150
C
1
3
2
A
If the actual magnitudes of the original similarities (distances are used to obtain a geometric representation, the
process is called Metric Multidimensional Scaling.
When only this ordinal information in terms of ranks is used to obtain a geometric representation, the process
is called Non-metric Multidimensional Scaling .
10.1 Uses of MDS
( i ) Illustrating market segments based on preference and judgments.
( i i ) Determining which products are more competitive with each other.
( i i i ) Deriving the criteria used by people while judging objects (products,
brands, advertisements, etc.).
Illustration 4
An all-India organization had six zonal offices, each headed by a zonal manager. The top management of the
organization wanted to have a detailed assessment of all the zonal managers for selecting two of them for
higher positions in the Head Office. They approached a consultant for helping them in the selection. The
management indicated that they would like to have assessment on several parameters associated with the
functioning of a zonal manager. The management also briefed the consultant that they laid great emphasis on
the staff with a view to developing and retaining them.
The consultants collected a lot of relevant data, analyzed it and offered their recommendations. In
one of the presentations, they showed the following diagram obtained through Multi Dimensional
Scaling technique. The diagram shows the concerns of various zonal managers, indicated by letters A
to F, towards the organization and also towards the staff working under them.
Concern for Organisation
E
A
151
It is observed that two zonal managers viz. B and E exhibit high concern for both the organisation as
well as staff. If these criteria are critical to the organisation, then these two zonal managers could be
the right candidates for higher positions in the Head Office.
Illustration 5
Similar study could be conducted for a group of companies to have an assessment of the perception of
investors about the attitude of companies towards interest of their shareholders and vis--vis interest
of their staff.
For example, from the following MDS graph, it is observed that company A is perceived to be taking
more interest in the welfare of the staff than company B.
Interest of
Shareholde
rs
A
Interest
of Staff
152
C&RT, a recursive partitioning method, builds classification and regression trees for predicting
continuous dependent variables (regression) and categorical predictor variables (classification). The
classic C&RT algorithm was popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone,
1984; see also Ripley, 1996). A general introduction to tree-classifiers, specifically to
the QUEST (Quick, Unbiased, Efficient Statistical Trees) algorithm, is also presented in the context
of the Classification Trees Analysis facilities, and much of the following discussion presents the same
information, in only a slightly different context. Another, similar type of tree building algorithm is
CHAID (Chi-square Automatic Interaction Detector; see Kass, 1980).
CLASSIFICATION AND REGRESSION PROBLEMS
There are numerous algorithms for predicting continuous variables or categorical variables from a set
of continuous predictors and/or categorical factor effects. For example, in GLM (General Linear
Models) and GRM (General Regression Models), we can specify a linear combination (design) of
continuous predictors and categorical factor effects (e.g., with two-way and three-way interaction
effects) to predict a continuous dependent variable. In GDA (General Discriminant Function
Analysis), we can specify such designs for predicting categorical variables, i.e., to solve classification
problems.
Regression-type problems. Regression-type problems are generally those where we attempt to
predict the values of a continuous variable from one or more continuous and/or categorical predictor
variables. For example, we may want to predict the selling prices of single family homes (a
continuous dependent variable) from various other continuouspredictors (e.g., square footage) as well
as categorical predictors (e.g., style of home, such as ranch, two-story, etc.; zip code or telephone area
code where the property is located, etc.; note that this latter variable would be categorical in nature,
even though it would contain numeric values or codes). If we used simple multiple regression, or
some general linear model (GLM) to predict the selling prices of single family homes, we would
determine a linear equation for these variables that can be used to compute predicted selling prices.
There are many different analytic procedures for fitting linear models (GLM, GRM, Regression),
various types of nonlinear models (e.g., Generalized Linear/Nonlinear Models (GLZ), Generalized
Additive Models (GAM), etc.), or completely custom-defined nonlinear models (see Nonlinear
153
Estimation), where we can type in an arbitrary equation containing parameters to be estimated.
CHAID also analyzes regression-type problems, and produces results that are similar (in nature) to
those computed byC&RT. Note that various neural network architectures are also applicable to solve
regression-type problems.
Classification-type problems. Classification-type problems are generally those where we attempt to
predict values of a categorical dependent variable (class, group membership, etc.) from one or more
continuous and/or categorical predictor variables. For example, we may be interested in predicting
who will or will not graduate from college, or who will or will not renew a subscription. These would
be examples of simple binary classification problems, where the categorical dependent variable can
only assume two distinct and mutually exclusive values. In other cases, we might be interested in
predicting which one of multiple different alternative consumer products (e.g., makes of cars) a
person decides to purchase, or which type of failure occurs with different types of engines. In those
cases there are multiple categories or classes for the categorical dependent variable. There are a
number of methods for analyzing classification-type problems and to compute predicted
classifications, either from simple continuous predictors (e.g., binomial or multinomial logit
regression in GLZ), from categorical predictors (e.g., Log-Linear analysis of multi-way frequency
tables), or both (e.g., via ANCOVA-like designs in GLZ or GDA). The CHAID also analyzes
classification-type problems, and produces results that are similar (in nature) to those computed
by C&RT. Note that various neural network architectures are also applicable to solve classificationtype problems.
CLASSIFICATION AND REGRESSION TREES (C&RT)
In most general terms, the purpose of the analyses via tree-building algorithms is to determine a set
of if-then logical (split) conditions that permit accurate prediction or classification of cases.
CLASSIFICATION TREES
For example, consider the widely referenced Iris data classification problem introduced by Fisher
[1936; see alsoDiscriminant Function Analysis and General Discriminant Analysis (GDA)]. The data
file Irisdat reports the lengths and widths of sepals and petals of three types of irises (Setosa,
Versicol, and Virginic). The purpose of the analysis is to learn how we can discriminate between the
three types of flowers, based on the four measures of width and length of petals and sepals.
Discriminant function analysis will estimate several linear combinations of predictor variables for
computing classification scores (or probabilities) that allow the user to determine the predicted
classification for each observation. A classification tree will determine a set of logical if-then
conditions (instead of linear equations) for predicting or classifying cases instead:
154
The interpretation of this tree is straightforward:
If the petal width is less than or equal to 0.8, the
respective flower would be classified as Setosa;
if the petal width is greater than 0.8 and less
than or equal to 1.75, then the respective flower
would be classified as Virginic; else, it belongs
to class Versicol.
REGRESSION TREES
The general approach to derive predictions from few simple if-then conditions can be applied to
regression problems as well. This example is based on the data file Poverty, which contains 1960 and
1970 Census figures for a random selection of 30 counties. The research question (for that example)
was to determine the correlates of poverty, that is, the variables that best predict the percent of
families below the poverty line in a county. A reanalysis of those data, using the regression tree
analysis [and v-fold cross-validation, yields the
following results:
Again, the interpretation of these results is rather
straightforward: Counties where the percent of
households with a phone is greater than 72%
have generally a lower poverty rate. The greatest
poverty rate is evident in those counties that
show less than (or equal to) 72% of households
with a phone, and where the population change
(from the 1960 census to the 170 census) is less
than -8.3 (minus 8.3). These results are
straightforward, easily presented, and intuitively
clear as well: There are some affluent counties (where most households have a telephone), and those
generally have little poverty. Then there are counties that are generally less affluent, and among those
the ones that shrunk most showed the greatest poverty rate. A quick review of the scatterplot of
observed vs. predicted values shows how the discrimination between the latter two groups is
particularly well "explained" by the tree model.
155
As mentioned earlier, there are a large number of methods that an analyst can choose from when
analyzing classification or regression problems. Tree classification techniques, when they "work" and
produce accurate predictions or predicted classifications based on few logical if-then conditions, have
a number of advantages over many of those alternative techniques.
Simplicity of results. In most cases, the interpretation of results summarized in a tree is very simple.
This simplicity is useful not only for purposes of rapid classification of new observations (it is much
easier to evaluate just one or two logical conditions, than to compute classification scores for each
possible group, or predicted values, based on all predictors and using possibly some complex
nonlinear model equations), but can also often yield a much simpler "model" for explaining why
observations are classified or predicted in a particular manner (e.g., when analyzing business
problems, it is much easier to present a few simple if-then statements to management, than some
elaborate equations).
Tree methods are nonparametric and nonlinear. The final results of using tree methods for
classification or regression can be summarized in a series of (usually few) logical if-then conditions
(tree nodes). Therefore, there is no implicit assumption that the underlying relationships between the
predictor variables and the dependent variable are linear, follow some specific non-linear link
function [e.g., see Generalized Linear/Nonlinear Models (GLZ)], or that they are even monotonic in
nature. For example, some continuous outcome variable of interest could be positively related to a
variable Income if the income is less than some certain amount, but negatively related if it is more
than that amount (i.e., the tree could reveal multiple splits based on the same variable Income,
revealing such a non-monotonic relationship between the variables). Thus, tree methods are
particularly well suited for data mining tasks, where there is often little a priori knowledge nor any
coherent set of theories or predictions regarding which variables are related and how. In those types
156
of data analyses, tree methods can often reveal simple relationships between just a few variables that
could have easily gone unnoticed using other analytic techniques.
GENERAL COMPUTATION ISSUES AND UNIQUE SOLUTIONS OF C&RT
The computational details involved in determining the best split conditions to construct a simple yet
useful and informative tree are quite complex. Refer to Breiman et al. (1984) for a discussion of their
CART algorithm to learn more about the general theory of and specific computational solutions for
constructing classification and regression trees. An excellent general discussion of tree classification
and regression methods, and comparisons with other approaches to pattern recognition and neural
networks, is provided in Ripley (1996).
AVOIDING OVER-FITTING: PRUNING, CROSSVALIDATION, AND V-FOLD CROSSVALIDATION
A major issue that arises when applying regression or classification trees to "real" data with much
random error noise concerns the decision when to stop splitting. For example, if we had a data set
with 10 cases, and performed 9 splits (determined 9 if-then conditions), we could perfectly predict
every single case. In general, if we only split a sufficient number of times, eventually we will be able
to "predict" ("reproduce" would be the more appropriate term here) our original data (from which we
determined the splits). Of course, it is far from clear whether such complex results (with many splits)
will replicate in a sample of new observations; most likely they will not.
This general issue is also discussed in the literature on tree classification and regression methods, as
well as neural networks, under the topic of "overlearning" or "overfitting." If not stopped, the tree
algorithm will ultimately "extract" all information from the data, including information that is not and
cannot be predicted in the population with the current set of predictors, i.e., random or noise
variation. The general approach to addressing this issue is first to stop generating new split nodes
when subsequent splits only result in very little overall improvement of the prediction. For example,
if we can predict 90% of all cases correctly from 10 splits, and 90.1% of all cases from 11 splits, then
it obviously makes little sense to add that 11th split to the tree. There are many such criteria for
automatically stopping the splitting (tree-building) process.
Once the tree building algorithm has stopped, it is always useful to further evaluate the quality of the
prediction of the current tree in samples of observations that did not participate in the original
computations. These methods are used to "prune back" the tree, i.e., to eventually (and ideally) select
a simpler tree than the one obtained when the tree building algorithm stopped, but one that is equally
as accurate for predicting or classifying "new" observations.
Crossvalidation. One approach is to apply the tree computed from one set of observations (learning
sample) to another completely independent set of observations (testing sample). If most or all of the
157
splits determined by the analysis of the learning sample are essentially based on "random noise," then
the prediction for the testing sample will be very poor. Hence, we can infer that the selected tree is
not very good (useful), and not of the "right size."
V-fold crossvalidation. Continuing further along this line of reasoning (described in the context of
crossvalidation above), why not repeat the analysis many times over with different randomly drawn
samples from the data, for every tree size starting at the root of the tree, and applying it to the
prediction of observations from randomly selected testing samples. Then use (interpret, or accept as
our final result) the tree that shows the best average accuracy for cross-validated predicted
classifications or predicted values. In most cases, this tree will not be the one with the most terminal
nodes, i.e., the most complex tree. This method for pruning a tree, and for selecting a smaller tree
from a sequence of trees, can be very powerful, and is particularly useful for smaller data sets. It is an
essential step for generating useful (for prediction) tree models, and because it can be
computationally difficult to do, this method is often not found in tree classification or regression
software.
REVIEWING LARGE TREES: UNIQUE ANALYSIS MANAGEMENT TOOLS
Another general issue that arises when applying tree classification or regression methods is that the
final trees can become very large. In practice, when the input data are complex and, for example,
contain many different categories for classification problems and many possible predictors for
performing the classification, then the resulting trees can become very large. This is not so much a
computational problem as it is a problem of presenting the trees in a manner that is easily accessible
to the data analyst, or for presentation to the "consumers" of the research.
ANALYZING ANCOVA-LIKE DESIGNS
The classic (Breiman et. al., 1984) classification and regression trees algorithms can accommodate
both continuous and categorical predictor. However, in practice, it is not uncommon to combine such
variables into analysis of variance/covariance (ANCOVA) like predictor designs with main effects or
interaction effects for categorical and continuous predictors. This method of analyzing coded
ANCOVA-like designs is relatively new and. However, it is easy to see how the use of coded
predictor designs expands these powerful classification and regression techniques to the analysis of
data from experimental designs (e.g., see for example the detailed discussion of experimental design
methods for quality improvement in the context of the Experimental Design module of Industrial
Statistics).
158
Computational Details
The process of computing classification and regression trees can be characterized as involving four
basic steps:
Selecting splits
159
Misclassification costs. Sometimes more accurate classification of the response is desired for some
classes than others for reasons not related to the relative class sizes. If the criterion for predictive
accuracy is Misclassification costs, then minimizing costs would amount to minimizing the
proportion of misclassified cases when priors are considered proportional to the class sizes and
misclassification costs are taken to be equal for every class.
Case weights. Case weights are treated strictly as case multipliers. For example, the misclassification
rates from an analysis of an aggregated data set using case weights will be identical to the
misclassification rates from the same analysis where the cases are replicated the specified number of
times in the data file.
However, note that the use of case weights for aggregated data sets in classification problems is
related to the issue of minimizing costs. Interestingly, as an alternative to using case weights for
aggregated data sets, we could specify appropriate priors and/or misclassification costs and produce
the same results while avoiding the additional processing required to analyze multiple cases with the
same values for all variables. Suppose that in an aggregated data set with two classes having an equal
number of cases, there are case weights of 2 for all cases in the first class, and case weights of 3 for
all cases in the second class. If we specified priors of .4 and .6, respectively, specified equal
misclassification costs, and analyzed the data without case weights, we will get the same
misclassification rates as we would get if we specified priors estimated by the class sizes, specified
equal misclassification costs, and analyzed the aggregated data set using the case weights. We would
also get the same misclassification rates if we specified priors to be equal, specified the costs of
misclassifying class 1 cases as class 2 cases to be 2/3 of the costs of misclassifying class 2 cases as
class 1 cases, and analyzed the data without case weights.
SELECTING SPLITS
The second basic step in classification and regression trees is to select the splits on the predictor
variables that are used to predict membership in classes of the categorical dependent variables, or to
predict values of the continuous dependent (response) variable. In general terms, the split at each
node will be found that will generate the greatest improvement in predictive accuracy. This is usually
measured with some type of node impurity measure, which provides an indication of the relative
homogeneity (the inverse of impurity) of cases in the terminal nodes. If all cases in each terminal
node show identical values, then node impurity is minimal, homogeneity is maximal, and prediction
is perfect (at least for the cases used in the computations; predictive validity for new cases is of
course a different matter...).
For classification problems, C&RT gives the user the choice of several impurity measures: The Gini
index, Chi-square, or G-square. The Gini index of node impurity is the measure most commonly
chosen for classification-type problems. As an impurity measure, it reaches a value of zero when only
160
one class is present at a node. With priors estimated from class sizes and equal misclassification
costs, the Gini measure is computed as the sum of products of all pairs of class proportions for classes
present at the node; it reaches its maximum value when class sizes at the node are equal; the Gini
index is equal to zero if all cases in a node belong to the same class. The Chi-square measure is
similar to the standard Chi-square value computed for the expected and observed classifications (with
priors adjusted for misclassification cost), and the G-square measure is similar to the maximumlikelihood Chi-square (as for example computed in the Log-Linear module). For regression-type
problems, a least-squares deviation criterion (similar to what is computed in least squares regression)
is automatically used. Computational Formulas provides further computational details.
DETERMINING WHEN TO STOP SPLITTING
As discussed in Basic Ideas, in principal, splitting could continue until all cases are perfectly
classified or predicted. However, this wouldn't make much sense since we would likely end up with a
tree structure that is as complex and "tedious" as the original data file (with many nodes possibly
containing single observations), and that would most likely not be very useful or accurate for
predicting new observations. What is required is some reasonable stopping rule. InC&RT, two
options are available that can be used to keep a check on the splitting process; namely Minimum n
and Fraction of objects.
Minimum n. One way to control splitting is to allow splitting to continue until all terminal nodes are
pure or contain no more than a specified minimum number of cases or objects. In C&RT this is done
by using the option Minimum n that allows us to specify the desired minimum number of cases as a
check on the splitting process. This option can be used when Prune on misclassification error, Prune
on deviance, or Prune on variance is active as the Stopping rule for the analysis.
Fraction of objects. Another way to control splitting is to allow splitting to continue until all
terminal nodes are pure or contain no more cases than a specified minimum fraction of the sizes of
one or more classes (in the case of classification problems, or all cases in regression problems). This
option can be used when FACT-style direct stopping has been selected as the Stopping rule for the
analysis. In C&RT, the desired minimum fraction can be specified as the Fraction of objects. For
classification problems, if the priors used in the analysis are equal and class sizes are equal as well,
then splitting will stop when all terminal nodes containing more than one class have no more cases
than the specified fraction of the class sizes for one or more classes. Alternatively, if the priors used
in the analysis are not equal, splitting will stop when all terminal nodes containing more than one
class have no more cases than the specified fraction for one or more classes. See Loh and
Vanichestakul, 1988 for details.
161
PRUNING AND SELECTING THE "RIGHT-SIZED" TREE
The size of a tree in the classification and regression trees analysis is an important issue, since an
unreasonably big tree can only make the interpretation of results more difficult. Some generalizations
can be offered about what constitutes the "right-sized" tree. It should be sufficiently complex to
account for the known facts, but at the same time it should be as simple as possible. It should exploit
information that increases predictive accuracy and ignore information that does not. It should, if
possible, lead to greater understanding of the phenomena it describes. The options available
in C&RT allow the use of either, or both, of two different strategies for selecting the "right-sized" tree
from among all the possible trees. One strategy is to grow the tree to just the right size, where the
right size is determined by the user, based on the knowledge from previous research, diagnostic
information from previous analyses, or even intuition. The other strategy is to use a set of welldocumented, structured procedures developed by Breiman et al. (1984) for selecting the "right-sized"
tree. These procedures are not foolproof, as Breiman et al. (1984) readily acknowledge, but at least
they take subjective judgment out of the process of selecting the "right-sized" tree.
FACT-style direct stopping. We will begin by describing the first strategy, in which the user
specifies the size to grow the tree. This strategy is followed by selecting FACT-style direct stopping
as the stopping rule for the analysis, and by specifying the Fraction of objects that allows the tree to
grow to the desired size. C&RT provides several options for obtaining diagnostic information to
determine the reasonableness of the choice of size for the tree. Specifically, three options are
available for performing cross-validation of the selected tree; namely Test sample, V-fold, and
Minimal cost-complexity.
Test sample cross-validation. The first, and most preferred type of cross-validation is the test
sample cross-validation. In this type of cross-validation, the tree is computed from the learning
sample, and its predictive accuracy is tested by applying it to predict the class membership in the test
sample. If the costs for the test sample exceed the costs for the learning sample, then this is an
indication of poor cross-validation. In that case, a different sized tree might cross-validate better. The
test and learning samples can be formed by collecting two independent data sets, or if a large learning
sample is available, by reserving a randomly selected proportion of the cases, say a third or a half, for
use as the test sample.
In the C&RT module, test sample cross-validation is performed by specifying a sample identifier
variable that contains codes for identifying the sample (learning or test) to which each case or object
belongs.
V-fold cross-validation. The second type of cross-validation available in C&RT is V-fold crossvalidation. This type of cross-validation is useful when no test sample is available and the learning
sample is too small to have the test sample taken from it. The user-specified 'v' value for v-fold crossvalidation (its default value is 3) determines the number of random subsamples, as equal in size as
162
possible, that are formed from the learning sample. A tree of the specified size is computed 'v' times,
each time leaving out one of the subsamples from the computations, and using that subsample as a
test sample for cross-validation, so that each subsample is used (v - 1) times in the learning sample
and just once as the test sample. The CV costs (cross-validation cost) computed for each of the 'v' test
samples are then averaged to give the v-fold estimate of the CV costs.
Minimal cost-complexity cross-validation pruning. In C&RT, minimal cost-complexity crossvalidation pruning is performed, if Prune on misclassification error has been selected as the Stopping
rule. On the other hand, if Prune on deviance has been selected as the Stopping rule, then minimal
deviance-complexity cross-validation pruning is performed. The only difference in the two options is
the measure of prediction error that is used. Prune on misclassification error uses the costs that
equals the misclassification rate when priors are estimated and misclassification costs are equal,
while Prune on deviance uses a measure, based on maximum-likelihood principles, called the
deviance (see Ripley, 1996). For details about the algorithms used in C&RT to implement Minimal
cost-complexity cross-validation pruning, see also the Introductory Overview and Computational
Methods sections ofClassification Trees Analysis.
The sequence of trees obtained by this algorithm have a number of interesting properties. They are
nested, because the successively pruned trees contain all the nodes of the next smaller tree in the
sequence. Initially, many nodes are often pruned going from one tree to the next smaller tree in the
sequence, but fewer nodes tend to be pruned as the root node is approached. The sequence of largest
trees is also optimally pruned, because for every size of tree in the sequence, there is no other tree of
the same size with lower costs. Proofs and/or explanations of these properties can be found in
Breiman et al. (1984).
Tree selection after pruning. The pruning, as discussed above, often results in a sequence of
optimally pruned trees. So the next task is to use an appropriate criterion to select the "right-sized"
tree from this set of optimal trees. A natural criterion would be the CV costs (cross-validation costs).
While there is nothing wrong with choosing the tree with the minimum CV costs as the "right-sized"
tree, oftentimes there will be several trees with CV costs close to the minimum. Following Breiman et
al. (1984) we could use the "automatic" tree selection procedure and choose as the "right-sized" tree
the smallest-sized (least complex) tree whose CV costs do not differ appreciably from the minimum
CV costs. In particular, they proposed a "1 SE rule" for making this selection, i.e., choose as the
"right-sized" tree the smallest-sized tree whose CV costs do not exceed the minimum CV costs plus 1
times the standard error of the CV costs for the minimum CV costs tree. In C&RT, a multiple other
than the 1 (the default) can also be specified for the SE rule. Thus, specifying a value of 0.0 would
result in the minimal CV cost tree being selected as the "right-sized" tree. Values greater than 1.0
could lead to trees much smaller than the minimal CV cost tree being selected as the "right-sized"
163
tree. One distinct advantage of the "automatic" tree selection procedure is that it helps to avoid "over
fitting" and "under fitting" of the data.
As can be been seen, minimal cost-complexity cross-validation pruning and subsequent "right-sized"
tree selection is a truly "automatic" process. The algorithms make all the decisions leading to the
selection of the "right-sized" tree, except for, perhaps, specification of a value for the SE rule. V-fold
cross-validation allows us to evaluate how well each tree "performs" when repeatedly cross-validated
in different samples randomly drawn from the data.
Computational Formulas
In Classification and Regression Trees, estimates of accuracy are computed by different formulas for
categorical and continuous dependent variables (classification and regression-type problems). For
classification-type problems (categorical dependent variable) accuracy is measured in terms of the
true classification rate of the classifier, while in the case of regression (continuous dependent
variable) accuracy is measured in terms of mean squared error of the predictor.
In addition to measuring accuracy, the following measures of node impurity are used for
classification problems: The Gini measure, generalized Chi-square measure, and generalized Gsquare measure. The Chi-square measure is similar to the standard Chi-square value computed for the
expected and observed classifications (with priors adjusted for misclassification cost), and the Gsquare measure is similar to the maximum-likelihood Chi-square (as for example computed in
the Log-Linear module). The Gini measure is the one most often used for measuring purity in the
context of classification problems, and it is described below.
For continuous dependent variables (regression-type problems), the least squared deviation (LSD)
measure of impurity is automatically applied.
ESTIMATION OF ACCURACY IN CLASSIFICATION
In classification problems (categorical dependent variable), three estimates of the accuracy are used:
resubstitution estimate, test sample estimate, and v-fold cross-validation. These estimates are defined
here.
Resubstitution estimate. Resubstitution estimate is the proportion of cases that are misclassified by the
classifier constructed from the entire sample. This estimate is computed in the following manner:
is true
X = 0, if the statement
is false
164
and d (x) is the classifier.
The resubstitution estimate is computed using the same data as used in constructing the classifier d .
Test sample estimate. The total number of cases are divided into two subsamples
estimate is the proportion of cases in the subsample
from the subsample
and
where
and
is the sub sample that is not used for constructing the classifier.
v-fold crossvalidation. The total number of cases are divided into v sub samples
, ...,
sizes. v-fold cross validation estimate is the proportion of cases in the subsample
the classifier constructed from the subsample
Let the learning sample
of almost equal
, ...,
where
and
where
and
165
v-fold cross-validation. The total number of cases are divided into v sub samples
sizes. The subsample
, ...,
of almost equal
, ...,
where
The Gini measure is the measure of impurity of a node and is commonly used when the dependent variable
is a categorical variable, defined as:
Least-squared deviation (LSD) is used as the measure of impurity of a node when the response variable is
continuous, and is computed as:
where Nw(t) is the weighted number of cases in node t, wi is the value of the weighting variable for
case i, fi is the value of the frequency variable, yi is the value of the response variable, and y(t) is the
weighted mean for node t.