0% found this document useful (0 votes)
43 views70 pages

Ilovepdf Merged

DADM

Uploaded by

Anisha Sapra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views70 pages

Ilovepdf Merged

DADM

Uploaded by

Anisha Sapra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Session 8: Dimension Reduction Method-

The Factor Analysis

Dr. Mahesh K C 1
Factor Analysis
• In real life, data tends to follow some patterns but the reasons are not apparent right
from the start of the data analysis.
• In a demographics based survey, many people will answer questions in a particular
‘way’. For example, all married men will have higher expenses than single men but lower
than married men with children.

• In this case, the driving factor which makes them answer following a pattern is the
economic status but these answers may also depend on other factors such as level of
education, salary and locality or area.
• It becomes complicated to assign answers related to multiple factors.
• One option is to map automatically all variables or answers into different new
categories with customized weight (or loadings) based on their influence on that
category.

• Factor analysis starts with the assumption of hidden latent variables which cannot be
observed directly but are reflected in the answers or variables of the data.
• It also makes the assumption that there are as many factors as there are variables.
An example
• Consider the following factor loadings from an airline customer satisfaction survey.

• The first factor may represent customer experience post on-boarding.


• The second factor reflects the airline booking experience and related perks.
• The third factor shows the flight competitive advantage of the airline compared to its
competitors.
• It is the factor loadings and their understanding which are the prime reason which makes factor
analysis of such importance followed by the ability to scale down to a few factors without
losing much information.
Factor Analysis Model (Optional)
• The set of variables X=(X1, X2, …, Xm) can be modeled as linear combinations of a
smaller set of k (<m) unobserved (“latent”) factors F1, F2,.., Fk along with an error
term e = e1, e2, .., ek.
• Factor analysis model:
X  L F ε
m1 mk k 1 m1

• X is the standardized data vector.


• L is the matrix of factor loadings, with lij representing the factor loading of the
ith variable and the jth factor.
• F represents the vector of unobservable (or latent) common factors.
• e represents the error vector such that E(e) = 0, V(e) is a diagonal matrix.

• Total Variance in X = Shared variance of factors + error variance


• A significantly high correlation among the predictors is required to conduct FA.
• Factor analysis similar to PCA, but the goal of factor analysis is to model the data.
Factor Analysis (FA) Process
• Formulate the FA problem.
• Identify the variables (measured on either interval or ratio scale) to be factor
analyzed.
• Partition the data into “training” and “test”
• Standardize the training data based on the variables selected.

• Check whether there exist significant correlation of these variables based on


the partitioned data.
• Select a method of FA (exploratory/confirmatory).
• Decide the number of factors (eigenvalue, scree plot, and communalities) to
be extracted and the method of rotation (varimax).

• Interpretation of factors.
Kaiser-Meyer-Olkin (KMO) Test and Bartlett’s Test for Sphericity
• KMO-test: A test used to examine the appropriateness of (measure of sampling
adequacy) FA.
• The test measures the sampling adequacy of each variable in the data and an overall
sampling adequacy of the data.
• KMO < 0.5, indicate FA may not be appropriate. 0≤KMO≤1
• Basically, KMO tests the level of correlation between the variables.

• Bartlett test for Sphericity: A test used to examine the hypothesis that the variables
are uncorrelated.
• Let  denote the correlation matrix of order m and I denotes the identity matrix of
order m. Then the hypothesis to be tested is

H 0 :   I against H1 :   I

• If p-value < 0.01, reject the null hypothesis that no correlation exists among the
variables.
Factor rotation: Varimax Method
• Solution to factor analysis are non-unique without further constraints. This facilitates
the need for factor rotation.

• Factor loadings represents the correlations between the factors and the variables. A
large absolute value indicates the factor and the variables are closely related.
• In un-rotated factor loadings, it is possible that factors are correlated with several
variables leading to interpretation wise difficulties.

• By rotating factors, we would like each factor to have significant loadings with some of
the variables. Similarly, we would like each variable to have significant loadings with
only a few factors.
• Rotation does not affect communalities and percentage of total variance explained.

• Varimax: An orthogonal (axes are maintained at right angle) method of factor


rotation that minimizes the number of variables with high loadings on a factor,
thereby enhancing interpretability of the factors.
References

• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl, K .C.
(2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of Data-
Driven Decision Making, 1st edition, Wiley.

Dr. Mahesh K C 8
Session 9: Simple Linear Regression
Revisited and Diagnostics

Dr. Mahesh K C 1
Introduction
• Managerial decisions are often based on the relationship between two or more variables.
• Examples:
• A company in distribution business may interested in the relationship between price of crude oil and
the company’s transportation cost.
• A marketing executive might want to know how strong the relationship is between advertising
expenditure and sales.
• An economist may be interested in the relationship between the income and expenditure.
• A commercial airliner may interested in predicting the cost of flying based on type of plane, distance,
number of passengers etc.

• Two key concepts are: Correlation analysis and regression analysis.


• Correlation: measures the degree of the relationship between the variables or how
strong/weak relationship exist between the variables.
• Regression: tries to build a mathematical model (equation) which can be used for prediction.

Dr. Mahesh K C 2
Scatter Plot: Linear and Non-Linear
• A graphical way of identifying the relationship between two variables.
• Let x = students population (in 1000s) and y: sales ($1000s). See Figure-1.
• Let x = months employed and y = items sold. See figure 2.
Figure 2 Non-linear Scatter Plot
Figure-1 Linear Scatter Plot
400
250
350
200 300

Scales Sold
250
150
Sales

200
100
150
50 100
50
0
0 5 10 15 20 25 30 0
0 20 40 60 80 100 120
Population
Months employed

Dr. Mahesh K C 3
Simple Linear Regression (SLR) Model
• Regression analysis is the process of constructing a mathematical model or
function that can be used to predict one variable by another variable or set of
variables.
• The variable being predicted is called the dependent variable (denoted by y) and
the variable being used to predict the dependent variable is called independent
variable (denoted by x).
• The equation that describes how y, the dependent variable, is related with x, the
independent variable and an error term is called the regression model. In SLR,
the model used is:
Y = β0 + β1X + ε
where Y=(y1, y2,…,yn), X =(x1, x2,…, xn), β0, β1 are referred to as the parameters of
the model and ε (epsilon), a random variable referred to as error term.
• The error term accounts for the variability in y that can not be explained by the
linear relationship between x and y.
Dr. Mahesh K C 4
Principle of Least Squares and Estimated regression line
• Since β0 and β1 are parameters in the regression model, generally unknown,
one has to use their estimated values, say, b0 and b1.
• The estimated values are obtained using the principle of least squares which
states that the sum of squares of errors should be minimum.
• Using this principle the values of b0 and b1 are as follows:

b1   xy  x y , b0  y  b1 x
 x  x 
2 2

• The estimated regression line:

ŷ  b0  b1 x

Dr. Mahesh K C 5
How well the estimated regression line fits the data?
• Least Squares Regression Method can approximate linear relationship
between any two variables.
• How useful is the estimated regression line for making predictions?
• Coefficient of Determination (The r2 statistic) measures the estimated line’s
goodness of fit to data.

• The following rough guidelines can be some time useful in deciding the
goodness of fit of the model:
If r2 ≥0.8, regression is good,
If 0.5≤ r2 <0.8, regression is moderate and
If r2 < 0.5, regression is poor.

• This rule may differ with respect to the data(social science).


Dr. Mahesh
6 KC
The Pearson’s Correlation Coefficient
• Correlation Coefficient measures strength of linear relationship between two
quantitative variables
Cov x, y 
r ; -1  r  1
SD  x  SD  y 

• Rough guidelines for interpreting correlation between two variables:


r > 0.7 highly positively correlated
0.33 < r ≤ 0.7 mildly positively correlated
-0.33 < r ≤ 0.33 not correlated
-0.7 < r ≤ -0.33 mildly negatively correlated
r ≤ -0.7 highly negatively correlated
• r conveniently expressed as
r   r2
• r is positive when b1 is positive, r is negative when b1 is negative
Dr. Mahesh
7 KC
The Regression Model : Assumptions
• In regression the model assumptions are for the Error Term (ε).
(1) Zero Mean Assumption: Error term ε random variable with mean, E(ε) = 0.
(2) Constant Variance Assumption: Variance of ε is constant say σ2, regardless of x-value.
(3) Independence Assumption: Values of ε are independent.
(4) Normality Assumption: Error term ε normally distributed random variable.
• Summary: εi (i=1, 2, 3,…, n) are independent normally distributed with mean = 0
and constant variance σ2.

• Validating regression assumptions is not important when inference or model


building is not performed. However, regression assumptions must be validated
when inference or model building is performed.

Dr. Mahesh
8 KC
Inference on Regression: The t-test and the F-test
• The inference on regression is done in two ways:
test of significance of predictor (the t-test) and
test of overall significance of the model (the F-test).

H 0:β1  0
• Irrespective of the test, the hypothesis to be tested is:
H 1:β1  0

• Rejection Rule: Reject H0: if p-value ≤ α where α is the level of significance.

• Note that with one independent variable, the F-test will provide the same
conclusion as the t-test.
• But with more than one independent variable, only the F-test can be used to
test for an overall significant relationship. In this case, the t-test will be used to
test the individual significance of the independent variable.
Dr. Mahesh K C 9
The Residual Analysis
• For model building and inference purposes, regression model assumptions
require validation.
• Estimated Regression Line: yˆ  b0  b1 x
• Residuals: (Actual value – Estimated Value) y i  yˆ i ; i  1,2,....,n

• Residuals may have different variances, so that it is preferable to use


standardized residuals (SR).
• Standardized residuals are residuals divided by their standard error so that
they are all on the same scale. SR for ith residual is given by:
y i  yˆ i
SRi  ; i  1, 2,...., n
SE y i  yˆ i 

Dr. Mahesh
10 K C
Outliers and High Leverage Values
• Observations with very large standardized residual, in absolute value are outliers.
• Generally, observations with standardized residual beyond ± 2 are flagged as
outliers.

• Extreme observations in predictor space or outliers in predictors are referred as


High Leverage Values.
• Very large values of predictor variable, without reference to the response variable.

• The upper and lower bounds of leverage:


1
 leverage  1
n
• Observation with leverage > 2(p + 1)/n are considered to have high leverage where
p is the number of predictors and n is the number of observations.

Dr. Mahesh
11 K C
Outlier Example
y = -7.3305x + 64.958 y = -7.0283x + 60.425
Figure-1 R² = 0.4968 Figure-2 R² = 0.8912
80 60
X Y SR
1 45 -1.06 70 50
1 55 -0.22 60
40
2 50 -0.02 50

3 75 2.68 40 30

3 40 -0.25 30
20
3 45 0.17 20
4 30 -0.47 10
10

4 35 -0.05 0 0
5 25 -0.28 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

6 15 -0.50

• In figure 1, the point (3,75) could be an outlier. The model has r2 = 0.49.
• After removing the outlier there is a substantial improvement in the model fit (r2
=0.89) (see Figure 2).
• The observation y=75 is an outlier as its SR=2.68.
Dr. Mahesh K C 12
Leverage Point Example
y = -0.4251x + 127.47
Figure-1 R² = 0.7989 y = -1.0909x + 138.18
X Y Leverage 140 Figure-2 R² = 0.8727
135
10 125 0.22
120
130
100
10 130 0.22
80 125
15 120 0.18
60 120
20 115 0.15 40 115
20 120 0.15 20
110
25 110 0.14 0
0 10 20 30 40 50 60 70 80 105
70 100 0.94 0 5 10 15 20 25 30

• The point (70,100) could be a high leverage point or outlier in the predictor
variable.
• Removal of the same changes the r2 value (see figure 2).
• The leverage value 0.94 > 0.57 which corresponds to X=70.

Dr. Mahesh K C 13
Influential Observation
• Influential observation significantly alters regression parameters based on
absence/presence in data set.
• Outlier and high leverage point may or may not be influential.
• Influential observations combine both the characteristics of large residual and
high leverage.

• An observation not flagged as an outlier, as a high leverage point, still be


influential.
• Cook’s Distance measures an observation’s level of influence and considers both
size of residual and leverage for an observation.
• In general, influential observations have Cook’s Distance > 1.0.

Dr. Mahesh
14 K C
Influential Observation Example
y = 0.0981x + 119.05
X Y Cook’s D Figure-1 R² = 0.0751
Figure-2
y = -1.0909x + 138.18
135 R² = 0.8727
135
10 125 0.07
130
130
10 130 0.29
125 125
15 120 0.00
120 120
20 115 0.06
20 120 115 115
0.00
25 110 0.21
110 110

70 130 35.19 105 105


0 10 20 30 40 50 60 70 80 0 5 10 15 20 25 30

• Consider the observation (70,130) in figure 1. Observe the estimated line and the
corresponding r2.
• Once we remove the above point, the value actually increase and the entire regression
line changes drastically (see the slopes in both figures).
• The Cook’s Distance D =35.19 >1 corresponds to the point (70, 130) is an influential
observation.
Dr. Mahesh K C 15
Verifying Regression Assumptions
• For inferential purpose adherence to regression assumptions is essential.
• Two graphical methods used to verify assumptions:
(1) Normal probability plot of residuals.
(2) Plot of standardized residuals (SR) against predicted values.

• Method 1: Normal Probability Plot (Q-Q plot)


• Determines whether specified distribution deviates from normality.
• When normal, bulk of points should lie on straight line.
• Otherwise, systematic deviations from linearity denote non-normality.

Dr. Mahesh
16 K C
Verifying Regression Assumptions (cont’d)
• Method 2: Plot Standardized Residuals Against Predicted (fitted) Values
• Four commonly found patterns in residual-fit plots shown.
A B

• Plot (A): “healthy” plot where no detectable patterns exist.


• Data points form overall rectangular shape.
• Here, regression assumptions remain intact.

• Plot (B): exhibits curvature, which violates independence C D

assumption.
• Plot (C): displays “funnel” pattern, which violates constant
variance assumption.

• Plot (D): shows pattern increasing from left to right, which violates zero-mean assumption.

Dr. Mahesh
17 K C
What if graphical tests indicate regression assumption(s) violated?
• For example, constant variance assumption violated.
• Transformation of response and/or /both predictor variable (s) may help.
• Frederic, Mosteller and Tukey Bulging Rule (FMTB Rule)
• Ladder of re-expressions” proposed, which are essentially power transformations.
• Compare curve in scatter plot to curves shown on right.

t 3 t 2 t 1 t 1 2 ln(t ) t t1 t2 t3

Dr. Mahesh
18 K C
Transformations to Achieve Linearity: Two variable case
• Scrabble® is game where players build crosswords by randomly selecting letter tiles.
Each tile has associated point value, where point value roughly related to letter
frequency.
• Plot indicates relationship between two variables curvilinear, rather than linear.
• Therefore, modeling linear relationship is not appropriate.

Dr. Mahesh
19 K C
FMTB Rule applied to Scrabble data
• Bulging rule says to move “x down, y down”
• Should transform x, moving down one or more positions from x’s current
position t1 on ladder. Similarly, transform y from position t1.
• Bulging rule says to apply square root or log transformation to x and y

• Square root transformation applied producing sqrt points and sqrt frequency.
However, scatter plot determines relationship remains non-linear.
• Next, x and y transformed using log transformation. Scatter plot ln points
against ln freq shows somewhat acceptable level of linearity.

• Regression of ln points on ln freq indicates r2 = 87.6%. In contrast,


untransformed regression points on freq has r2 = 45.5%

Dr. Mahesh
20 K C
Scatter plots: After square root and log transformations

Scatter plot SqPoints Vs SqFrequency


Scatter plot logPoints Vs logFrequency
3.0

2.0
2.5

1.5
Psq

LogP
2.0

1.0
1.5

0.5
0.0
1.0

1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.5 1.0 1.5 2.0 2.5

Fsq LogF

Dr. Mahesh K C 21
References

• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl,


K .C. (2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and
Predictive Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of
Data-Driven Decision Making, 1st edition, Wiley.

Dr. Mahesh K C 22
Session 10: Multiple Linear Regression-
Model Building

Dr. Mahesh K C 1
Multiple Linear Regression (MLR)
• Simple linear regression examines relationship between single predictor and
response.
• Multiple Regression models set of predictors to single continuous response.
• Provide improved precision for estimation and prediction.
• Model uses plane or hyper-plane to approximate relationship between
predictor set and single response.

• Predictors are typically continuous.


• Categorical predictors can be included, through the use of indicator (dummy)
variables.
• Here, plane/hyper-plane represents a linear surface in p dimensions.

Dr. Mahesh
2 KC
The Multiple Regression Model
• Multiple Regression Model:
y   0  1 x1   2 x2     p x p  
where, β1, β2, …, βp are model parameters whose true value remains unknown and ε
represents error term.
• Model parameters are estimated from data set using method of least squares.
• The estimated regression plane: yˆ  b0  b1 x1  b2 x2  ....  bp x p

• We interpret the coefficient bi as “estimated change in response variable, for unit increase
in variable xi, when all remaining predictors held constant.”
• The quantity (y – ŷ) measures error in prediction called residual. Residual equals
vertical distance between data point and regression plane (or hyperplane) in multiple
regression.
• Coefficient of Determination (R2): Represents proportion of variability in response
variable accounted for by linear relationship with predictor set.

Dr. Mahesh
3 KC
An Example
• Consider a MLR model to estimate the miles per gallons (mpg) based on
weight (wt) and displacement (disp).
• A 3D scatter plot for the same will be:

• The estimated regression equation:


mpg  b0  b1 wt  b2 disp

• The approximating plane will be

Dr. Mahesh K C 4
Model Assumptions

• Zero Mean Assumption: Error term ε random variable with mean,


E(ε) = 0.
• Constant Variance Assumption: Variance of ε constant, regardless
of value of x1, x2, …, xp. This assumption is also known as
homoscedasticity.
• Independence Assumption: Values of ε are independent.
• Normality Assumption: Error term ε normally distributed random
variable.

Dr. Mahesh K C 5
Coefficient of Determination (R2) and Adjusted R2
• Would we expect higher R2 value when using two predictors, rather than one?
• Yes, R2 always increases by including additional predictor. When new predictor is
useful, R2 increases substantially. Otherwise, R2 may increase small or negligible
amount.
• Largest R2 may occur for models with most predictors, rather than best
predictors.

• Adjusted R2 measure “adjusts” R2, by penalizing models that include non-useful


predictors.
• If R2adj < R2, indicates at least one predictor in the model may be extraneous, and
should be omitted from the model.
• Models should be evaluated based on R2adj rather than R2 .

Dr. Mahesh
6 KC
Inference on Regression: The t-test and F-test
• t-test is used to test the significance of individual predictors of the regression model.
• Hypothesis Test for model: H0: βi = 0 against H1: βi ≠ 0 (i = 1, 2, 3,…., p)
• Reject the null hypothesis when p-value < level of significance (α).

• F-test is used for overall significance of the regression model.


• Hypotheses for F-Test: H0 :β1 = β2 = … = βm = 0 against H1 : At least one βi ≠ 0.
• The ANOVA table
Source of Degrees of
Sum of Squares Mean Square F
Variation Freedom
SSR
Regression SSR p MSR 
p MSR
F
Error SSE MSE
SSE n  p 1 MSE 
(or Residual) n  p 1

Total SST  SSR  SSE n 1 reject H0 if p-value < α

Dr. Mahesh
7 KC
Multi-collinearity
• Multicollinearity is condition where two or more predictors are correlated.
• This leads to instability in solution space, with possibly incoherent results.
• Data set with severe multicollinearity may have significant F-test, while none
of the t-tests for the individual predictors are significant.
• Multicollinearity produce high variability in coefficient estimates (b1, b2,…).

• Highly correlated variables tend to overemphasize regression model


component.
• Multicollinearity issue can be identified by examining correlation structure
among predictors.
• One may use matrix plot to identify the correlation structure.

Dr. Mahesh
8 KC
Multicollinearity Contd.’
• Consider a MLR with two predictors:
yˆ  b0  b1 x1  b2 x2
• If predictors x1 and x2 not correlated and orthogonal. In such cases, the predictors form solid
basis upon which the response surface y rests firmly, providing stable coefficients b1 and b2
(see figure A) with small variability SE(b1) and SE(b2).
• If the predictors x1 and x2 correlated (multicollinear situation), so that as one of them
increases, so does the other. In this case , the predictors no longer form a solid basis on which
the response surface y rests firmly (unstable), providing highly variable coefficients b1 and b2
(see figure B) due to high inflated values of SE(b1) and SE(b2).

Dr. Mahesh K C 9
Does method exist to identify multicollinearity in regression model?
• Variance Inflation Factors (VIFs) measures the correlation between the ith
predictor xi and the remaining predictor variables.
1
VIFi  ; i  1, 2, 3,..., p
1  Ri2

• When xi completely uncorrelated with remaining predictors, Ri2 = 0 leads to


minimum value for VIFi = 1. Alternately, VIFi increases without bound as Ri2
approaches 1.
• Large VIFi will produce an inflated standard error of the estimates leading to
degradation in the precision of the estimates.

• In general, VIFi > 5 and VIFi > 10 indicates moderate and severe
multicollinearity, respectively.

Dr. Mahesh
10 K C
Some Guidelines for model building using multiple linear regression
Step 1: Detect (using VIF criterion) and eliminate multicollinearity (if present) by
dropping variables. Drop one variable at a time until multicollinearity is eliminated.
Step 2: Run regression and check for influential observations, outliers and high
leverage observations.
Step 3: If one or more influential observations/outliers/ high leverage observations
are present delete one of them and rerun regression and go back to Step 2.

Step 4: Keep doing this until you get no further influential observations/ outliers/
high leverage observations or 10% (or 5% case to case) of data has been removed.
Step 5: Check for regression assumptions of linearity, normality, homoscedasticity and
independence of the residuals.
Step 6: If some of the assumptions in Step 5 is violated then try using transformations.
If NO transformation can be found which can correct for the violations then STOP.

Dr. Mahesh K C 11
Step 7: When all the regression assumptions are met then look at the p-value of
the F-test. If it is not significant then STOP.
Step 8: If the p-value of the F-test is significant then look at the p-values of the
individual the coefficients. If some of the p-values are not significant then
choose one of the variables with non-significant p-value and drop it from the
model and run regression again.
Step 9: Repeat Step 8 until you get the p-values of all the coefficients significant.

Dr. Mahesh K C 12
Model Building: Health Care Revenue data
• These data were collected by Department of Health and Social Services (DHSS)
of the state of New Mexico and cover 52 of the 60 licensed facilities in New
Mexico in 1998. Specific definitions of the variables are given below. The
location of the facility is indicated whether it is the rural or non rural area.
Variable Definition
RURAL Rural home (1) and non-rural home (0)
BED Number of Beds in home
MCDAYS Annual medical in-patient days(hundreds)
TDAYS Annual Total Patient Days (Hundreds)
PCREV Annual Total Patient Care Revenue($100)
NSAL Annual nursing salaries ($100)
FEXP Annual Facilities Expenditure ($100)

• DHSS is interested to predict patient care revenue based on the other hospital
characteristics.
Dr. Mahesh K C 13
Model Building: HCR Data
• Objective: Build a model to predict patient care revenue.
• Response Variable: PCREV
• Continuous Predictors: BED, MCDAYS, TDAYS, NSAL, FEXP
• Categorical Predictor (dummy): RURAL
• Total records: 52
• Total Variables: 7 (6 predictors)
• No missing values

• Step 1: Checking Multicollinearity. Dropped variables “TDAYS” as the corresponding


VIF=8.47 >5. Now we have 5 predictors- BED, MCDAYS, NSAL, FEXP and RURAL.
Repeated check for multicollinearity resulted that model (lm2) is free of the same.
• Step 2a: Check for influential observations: Since none of residual has Cook’s distance
greater than 1, no influential observations. The largest Cook’s distance = 0.77.
Dr. Mahesh K C 14
Model Building Cont’d
• Step 2b: Check for outliers: We found that the following standardized residual values
are beyond > 2: 3.74, 3.55, 2.76 and 2.94. We have deleted these one by one and
accordingly updated the data set. Now the final updated data (HRev5) at this stage
consists of 48 records on 5 predictors and having standardized residual 1.94.
• Step 2c: Check for leverage values: The leverage value = 0.66 > 2(p+1)/n = 12/48 =
0.25. After removing the respective record, the new updated data set (HRev6 with 47
records) is again tested for leverage value and found that the problem still persist.
• Since we already deleted 5 records (10% of 52 records), we stop at this stage and
proceed to step 3. The data (HRev6) now has 47 records and 5 predictors.
• Step 3: Check for regression assumptions: The standardized residuals versus fitted
plot & the Q-Q plot roughly shows that the assumptions are more or less met. But we
proceed to step 4.
• Step 4a: We checked the significance of the variables and found that the predictor
“RURAL” is not significant at 5% level. We removed the same and updated the data
(HRev7).
• Step 4b: Again checked for significance and found that “FEXP” not significant at 5%.
We removed and updated the data (HRev8).
• Step 4c: All predictors are now significant and the overall model is also significant
Dr. Mahesh K C 15
Model Building Cont’d

Dr. Mahesh K C 16
Summary Results of Model 9

Variables Estimates Std. error Pr(>|t|)


Intercept -2056.49 833.18

BED 82.92 16.36 8.12e-06

MCDAYS 15.67 4.46 0.00106


NSAL 1.44 0.28 7.87e-06
Residual standard error: 1817 on 43 degrees of freedom. Multiple R-squared: 0.9045,
Adjusted R-squared: 0.8978, F-statistic: 135.7 on 3 and 43 DF, and p-value: < 2.2e-16

Final MLR model:


PCREV = -2056.49+82.92BED+15.67MCDAYS+1.44NSAL

Dr. Mahesh K C 17
References

• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl, K .C.
(2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of Data-
Driven Decision Making, 1st edition, Wiley.

Dr. Mahesh K C 18
Sessions 11&12: Multiple Regression
with Categorical Predictors & Variable
Selection Method-Backward Elimination

Dr. Mahesh K C 1
Regression with Categorical Predictors Using Indicator Variables
• Categorical variables can be included in model through use of indicator variables.
• Example: Consider Cars data set. We have: mpg, cylinders, cubicinches, hp, weightlbs, time.to.60
are continuous variables and brand-Categorical Variable with three levels US, Japan and Europe.
The variable “year” is not considered.
• For regression, categorical variable with k categories transformed to k – 1 indicator (dummy)
variables. Indicator variable is binary, equals 1 when observation belongs to category, otherwise
equals 0.
• Brand variable is transformed to two indicator (dummy) variables:
1 1
C1   if country is Japan C 2   if country is US
0 otherwise 0 otherwise

• Note, brand =Europe is implied when C1 = 0 and C2 = 0 known as Reference Category.

Dr. Mahesh
2 KC
Estimated Regression Equation with Categorical Predictors
• Including indicator variables into model produces estimated reg. eq:
mpg  b 0  b1 ( cylinders )  b 2 ( cubicinche s )  b 3 ( hp )  b 4 ( weightlbs )
 b 5 ( time .to . 60 )  b 6 C 1  b 7 C 2

• estimated reg. eq. when brand is Japan:


mpg  ( b 0  b 6 )  b1 ( cylinders )  b 2 ( cubicinche s )  b 3 ( hp )  b 4 ( weightlbs )  b 5 ( time .to . 60 )

• estimated reg. eq. when brand is US:


mpg  ( b 0  b 7 )  b1 ( cylinders )  b 2 ( cubicinche s )  b 3 ( hp )  b 4 ( weightlbs )  b 5 ( time .to . 60 )

• estimated reg. eq. when brand is Europe:


mpg  b 0  b1 ( cylinders )  b 2 ( cubicinche s )  b 3 ( hp )  b 4 ( weightlbs )  b 5 ( time .to . 60 )

Dr. Mahesh
3 KC
Variable Selection Methods
• Several variable selection methods available.
• Assist analyst in determining which variables to include in model.
• Algorithms help select predictors leading to optimal model.
• Four variable selection methods:
(1) Forward Selection
(2) Backwards Elimination
(3) Stepwise Selection
(4) Best Subsets

Dr. Mahesh
4 KC
The Partial F-Test (Theory optional)
• Suppose model has x1,…,xp predictors and we consider adding additional predictor x*.
• Calculate sequential sum of squares from adding x*, given existing x1,…,xp in model.
• Full sum of squares SSFull = x1,…,xp, x* in model.
• Reduced sum of squares SSReduced = x1,…,xp in model.
• Therefore, extra sum of squares SSExtra denoted by
SS Extra  SS ( x * | x1 , x2 ,..., x p )  SS Full  SS Re duced
• Null hypothesis for Partial F-Test
– Ho: No, SSExtra associated with x* does not contribute significantly to model
– Ha: Yes, SSExtra associated with x* does contribute significantly to model
• Test statistic for Partial F-Test SS Extra
F ( x * | x1 , x 2 , , x p ) 
MSE Full

follows F1, n-p-2 distribution when Ho true.


• Therefore, Ho rejected for small p-value

Dr. Mahesh
5 KC
Backwards Elimination Procedure
• Procedure begins with all variables in model.
• Step 1:
Perform regression on full model with all variables
For example, assume model has x1,…,x4
• Step 2:
For each variable in model perform partial F-test
Select variable with smallest partial F-statistic, denoted Fmin
• Step 3:
If Fmin not significant, remove associated variable from model and return to Step 2
Otherwise, if Fmin significant, stop algorithm and report current model
If first pass, then current model is full model
If not first pass, then full set of predictors reduced by one or more variables

Dr. Mahesh
6 KC
Backwards Elimination Applied to Cars Data Set
• We begin with all predictors (excluding the predictor “Year”) included in the model.

Model A : mpg  b0  b1 (cylinder )  b2 (cubicinches)  b3 (hp)  b4 ( weightlbs )


 b5 (time.to.60)  b6 (brand )

• Partial F-statistic calculated for each predictor. Smallest F-statistic Fmin (= 0.5132) associated
with cubicinches. Here, Fmin is not significant at 5%, therefore cubicinches is dropped.
• On second pass predictor cylinders is eliminated as its Fmin (= 0.4425) which is not significant
at 5%.
• On third pass predictor time.to.60 is dropped with Fmin (=1.7229) which is not significant at
5%.
• Finally, all predictors are significant at 5% level.
• Procedure terminates with model (B):
Model B : mpg  b0  b1 (hp )  b2 ( weightlbs )  b6 (brand )

Dr. Mahesh
7 KC
Backwards Elimination Applied to Cars Data Set
• Most of the time variable selection methods take care of multicollinearity. Still
one may check for the same with latest model.
• Based on Model B check influential, outliers and leverage values.
• Check assumptions on regression. If violated one may try transformation
either on response variable or predictors or both.

Dr. Mahesh K C 8
References

• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl, K .C.
(2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of Data-
Driven Decision Making, 1st edition, Wiley.

Dr. Mahesh K C 9
Session 14: Linear Discriminant Analysis
(LDA)

Dr. Mahesh K C 1
LDA: Basic Concept and Objectives
• A technique for analyzing multivariate data when the response variable is categorical
and the predictors are interval in nature.
• In most cases the dependent variable consists of two groups or classifications, like, high
versus normal blood pressure, loan defaulting versus non defaulting, use versus non use
of internet banking etc.
• The choice between three candidates, A, B or C in an election is an example where the
dependent variable consists of more than two groups.

• Objectives:
• Develop a discriminant function: A linear combination of predictors that will best
discriminate between the categories of response variable (groups).
• Examine whether there exist significance differences between the groups and the
predictors.
• Classify the cases to one of the groups based on the value of the predictors.
• Evaluate the accuracy of the classification.
Dr. Mahesh K C 2
Some examples LDA
• The technique can be used to answer the questions such as:

• Based on demographic characteristics, how do customers who exhibit


store loyalty differ from those who do not?
• Do heavy, medium and light users of soft drinks differ in terms of their
consumption of frozen foods?

• Do the various market segments differ in their media consumptions habit?


• What psychographic characteristics help in differentiating price- sensitive
and non-price-sensitive buyers?
• To study the bankruptcy problem.

Dr. Mahesh K C 3
Fisher’s Linear Discriminant Function
• Typically considered more of a statistical classification method than a data
mining method introduced by R A Fisher in 1936.
• Obtain linear combination (known as discriminant function) of independent
variables that will best discriminate the groups in the dependent variables.

• Idea is to find linear functions of the measurements that maximize the ratio of
between-class variability to within-class variability. In other words, obtain
groups that are homogeneous and differ the most from each other.

• For each record, these functions are used to compute scores that measure the
proximity of that record to each of the classes.
• A record is classified as belonging to the class for which it has the highest
classification score.
Dr. Mahesh K C 4
The LDA Model and assumptions
• Let X1, X2,…, Xk denotes the predictors and let D, the discriminant score. Then a linear
combinations of the predictors is given by:
D  b1 X 1    bk X k .
where bi stands for discriminant coefficients or weights. Note that with k groups we
need k-1 discriminators.
• Assumptions:
1) The groups must be mutually exclusive and have equal sample size.
2) Groups should have the same variance-covariance matrices on independent
variables.
3) The independent variables should be multivariate normally distributed.

• If assumption 3 is met, then LDA is more powerful tool than other classification
methods such as logistic regression (roughly 30% more efficient, see Efron 1975).
• LDA perform better as sample size increases.
Dr. Mahesh K C 5
Statistics associated with LDA
• Cannonical Correlation: measures the extent of association between the
discriminant function and the groups in the response variable.

• Confusion(Classification) matrix: A matrix representing the number of correctly


classified and misclassified cases. Correctly classified cases appear in the diagonal
and misclassified cases appear off-diagonal.

• Hit ratio (accuracy) : sum of the diagonal elements divided by the total number of
cases.
• Eigenvalue: the ratio of between-group to within-group sum of squares. Large
eigenvalues imply superior function.

Dr. Mahesh K C 6
The Iris Flower Data
• This famous (Fisher's or Anderson's) iris data set gives the measurements in
centimeters of the variables sepal length and width and petal length and width,
respectively, for 50 flowers from each of 3 species of iris. The species are Iris
setosa, Iris versicolor, and Iris virginica.
• Predictors: Sepal. Length, Sepal. Width, Petal. Length and Petal. Width
• Dependent Variable: Species with three levels Setosa, Versicolor and Virginica
• Total Observations: 150

• Iris setosa iris versicolor iris virginica


Dr. Mahesh K C 7
LDA of iris data
• Required R-packages: MASS & psych
• Scatter plot shows in most of the cases, a clear grouping of the Species.
• Partition the data into 70% training and 30% testing.
• The three groups had equal prior probability (33%).

• The group means (centroid) clearly show a separation between the groups and the
corresponding predictors.
• Since Species have three levels, we got two discriminant functions (LD1 and LD2) with
corresponding weights.
LD1  0.534Sepal.Length  2.125Sepal.Width  1.962 Petal.Length  3.561Petal.Width

LD 2  0.294 Sepal .Length  1.933Sepal .Width  1.143 Petal .Length  3.003 Petal .Width

Dr. Mahesh K C 8
Matrix plot: Iris data

Dr. Mahesh K C 9
LDA of iris data Cont’d
• The proportion of trace for LD1 = 0.993 and that of LD2 = 0.007. This implies the
% of separation achieved by the discriminant function.
• The predicted classification is: Setosa = 35, Versicolor = 36 and Virginica = 35 in
the training data.
• The eigenvalues corresponding to LD1 and LD2 are 44.23 and 3.71. Higher the
eigenvalue better the separation.
• The discriminant scores are obtained by using LD1 and LD2 for different record
values for the predictors.

Dr. Mahesh K C 10
Histogram of Discriminant Scores based on LD1 &LD2

The histogram of discriminant scores based on LD1 (left panel) shows


clear separation of Species and that of LD2 shows significant overlapping.

Dr. Mahesh K C 11
Confusion Matrix & Accuracy
• Confusion Matrix: A matrix summarizes the correct and incorrect classifications
that a classifier produced for a certain dataset (see table 1).
• Accuracy: The overall accuracy of the correct classification. For the training data,
accuracy is 100%.
Table 1: Confusion matrix for training data Table 2: Confusion matrix for test data
Predicted Actual (Training) Predicted Actual
Setosa Versicolor Virginica Setosa Versicolor Virginica
Setosa 35 0 0 Setosa 15 0 0
Versicolor 0 36 0 Versicolor 0 13 1
Virginica 0 0 35 Verginica 0 1 14

• For the validation data accuracy is 95.45% which is expected.

Dr. Mahesh K C 12
References

• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl, K .C.
(2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of Data-
Driven Decision Making, 1st edition, Wiley.

Dr. Mahesh K C 13

You might also like