0% found this document useful (0 votes)

43 views70 pages

Ilovepdf Merged

DADM

Uploaded by

Anisha Sapra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views70 pages

Ilovepdf Merged

DADM

Uploaded by

Anisha Sapra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Session 8: Dimension Reduction Method-

The Factor Analysis

Dr. Mahesh K C 1
Factor Analysis
• In real life, data tends to follow some patterns but the reasons are not apparent right
from the start of the data analysis.
• In a demographics based survey, many people will answer questions in a particular
‘way’. For example, all married men will have higher expenses than single men but lower
than married men with children.

• In this case, the driving factor which makes them answer following a pattern is the
economic status but these answers may also depend on other factors such as level of
education, salary and locality or area.
• It becomes complicated to assign answers related to multiple factors.
• One option is to map automatically all variables or answers into different new
categories with customized weight (or loadings) based on their influence on that
category.

• Factor analysis starts with the assumption of hidden latent variables which cannot be
observed directly but are reflected in the answers or variables of the data.
• It also makes the assumption that there are as many factors as there are variables.
An example
• Consider the following factor loadings from an airline customer satisfaction survey.

• The first factor may represent customer experience post on-boarding.

• The second factor reflects the airline booking experience and related perks.
• The third factor shows the flight competitive advantage of the airline compared to its
competitors.
• It is the factor loadings and their understanding which are the prime reason which makes factor
analysis of such importance followed by the ability to scale down to a few factors without
losing much information.
Factor Analysis Model (Optional)
• The set of variables X=(X1, X2, …, Xm) can be modeled as linear combinations of a
smaller set of k (<m) unobserved (“latent”) factors F1, F2,.., Fk along with an error
term e = e1, e2, .., ek.
• Factor analysis model:
X  L F ε
m1 mk k 1 m1

• X is the standardized data vector.

• L is the matrix of factor loadings, with lij representing the factor loading of the
ith variable and the jth factor.
• F represents the vector of unobservable (or latent) common factors.
• e represents the error vector such that E(e) = 0, V(e) is a diagonal matrix.

• Total Variance in X = Shared variance of factors + error variance

• A significantly high correlation among the predictors is required to conduct FA.
• Factor analysis similar to PCA, but the goal of factor analysis is to model the data.
Factor Analysis (FA) Process
• Formulate the FA problem.
• Identify the variables (measured on either interval or ratio scale) to be factor
analyzed.
• Partition the data into “training” and “test”
• Standardize the training data based on the variables selected.

• Check whether there exist significant correlation of these variables based on

the partitioned data.
• Select a method of FA (exploratory/confirmatory).
• Decide the number of factors (eigenvalue, scree plot, and communalities) to
be extracted and the method of rotation (varimax).

• Interpretation of factors.
Kaiser-Meyer-Olkin (KMO) Test and Bartlett’s Test for Sphericity
• KMO-test: A test used to examine the appropriateness of (measure of sampling
adequacy) FA.
• The test measures the sampling adequacy of each variable in the data and an overall
sampling adequacy of the data.
• KMO < 0.5, indicate FA may not be appropriate. 0≤KMO≤1
• Basically, KMO tests the level of correlation between the variables.

• Bartlett test for Sphericity: A test used to examine the hypothesis that the variables
are uncorrelated.
• Let  denote the correlation matrix of order m and I denotes the identity matrix of
order m. Then the hypothesis to be tested is

H 0 :   I against H1 :   I

• If p-value < 0.01, reject the null hypothesis that no correlation exists among the
variables.
Factor rotation: Varimax Method
• Solution to factor analysis are non-unique without further constraints. This facilitates
the need for factor rotation.

• Factor loadings represents the correlations between the factors and the variables. A
large absolute value indicates the factor and the variables are closely related.
• In un-rotated factor loadings, it is possible that factors are correlated with several
variables leading to interpretation wise difficulties.

• By rotating factors, we would like each factor to have significant loadings with some of
the variables. Similarly, we would like each variable to have significant loadings with
only a few factors.
• Rotation does not affect communalities and percentage of total variance explained.

• Varimax: An orthogonal (axes are maintained at right angle) method of factor

rotation that minimizes the number of variables with high loadings on a factor,
thereby enhancing interpretability of the factors.
References

• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl, K .C.
(2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and Predictive
Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of Data-
Driven Decision Making, 1st edition, Wiley.

Dr. Mahesh K C 8
Session 9: Simple Linear Regression
Revisited and Diagnostics

Dr. Mahesh K C 1
Introduction
• Managerial decisions are often based on the relationship between two or more variables.
• Examples:
• A company in distribution business may interested in the relationship between price of crude oil and
the company’s transportation cost.
• A marketing executive might want to know how strong the relationship is between advertising
expenditure and sales.
• An economist may be interested in the relationship between the income and expenditure.
• A commercial airliner may interested in predicting the cost of flying based on type of plane, distance,
number of passengers etc.

• Two key concepts are: Correlation analysis and regression analysis.

• Correlation: measures the degree of the relationship between the variables or how
strong/weak relationship exist between the variables.
• Regression: tries to build a mathematical model (equation) which can be used for prediction.

Dr. Mahesh K C 2
Scatter Plot: Linear and Non-Linear
• A graphical way of identifying the relationship between two variables.
• Let x = students population (in 1000s) and y: sales ($1000s). See Figure-1.
• Let x = months employed and y = items sold. See figure 2.
Figure 2 Non-linear Scatter Plot
Figure-1 Linear Scatter Plot
400
250
350
200 300

Scales Sold
250
150
Sales

200
100
150
50 100
50
0
0 5 10 15 20 25 30 0
0 20 40 60 80 100 120
Population
Months employed

Dr. Mahesh K C 3
Simple Linear Regression (SLR) Model
• Regression analysis is the process of constructing a mathematical model or
function that can be used to predict one variable by another variable or set of
variables.
• The variable being predicted is called the dependent variable (denoted by y) and
the variable being used to predict the dependent variable is called independent
variable (denoted by x).
• The equation that describes how y, the dependent variable, is related with x, the
independent variable and an error term is called the regression model. In SLR,
the model used is:
Y = β0 + β1X + ε
where Y=(y1, y2,…,yn), X =(x1, x2,…, xn), β0, β1 are referred to as the parameters of
the model and ε (epsilon), a random variable referred to as error term.
• The error term accounts for the variability in y that can not be explained by the
linear relationship between x and y.
Dr. Mahesh K C 4
Principle of Least Squares and Estimated regression line
• Since β0 and β1 are parameters in the regression model, generally unknown,
one has to use their estimated values, say, b0 and b1.
• The estimated values are obtained using the principle of least squares which
states that the sum of squares of errors should be minimum.
• Using this principle the values of b0 and b1 are as follows:

b1   xy  x y , b0  y  b1 x
 x  x 
2 2

• The estimated regression line:

ŷ  b0  b1 x

Dr. Mahesh K C 5
How well the estimated regression line fits the data?
• Least Squares Regression Method can approximate linear relationship
between any two variables.
• How useful is the estimated regression line for making predictions?
• Coefficient of Determination (The r2 statistic) measures the estimated line’s
goodness of fit to data.

• The following rough guidelines can be some time useful in deciding the
goodness of fit of the model:
If r2 ≥0.8, regression is good,
If 0.5≤ r2 <0.8, regression is moderate and
If r2 < 0.5, regression is poor.

• This rule may differ with respect to the data(social science).

Dr. Mahesh
6 KC
The Pearson’s Correlation Coefficient
• Correlation Coefficient measures strength of linear relationship between two
quantitative variables
Cov x, y 
r ; -1  r  1
SD  x  SD  y 

• Rough guidelines for interpreting correlation between two variables:

r > 0.7 highly positively correlated
0.33 < r ≤ 0.7 mildly positively correlated
-0.33 < r ≤ 0.33 not correlated
-0.7 < r ≤ -0.33 mildly negatively correlated
r ≤ -0.7 highly negatively correlated
• r conveniently expressed as
r   r2
• r is positive when b1 is positive, r is negative when b1 is negative
Dr. Mahesh
7 KC
The Regression Model : Assumptions
• In regression the model assumptions are for the Error Term (ε).
(1) Zero Mean Assumption: Error term ε random variable with mean, E(ε) = 0.
(2) Constant Variance Assumption: Variance of ε is constant say σ2, regardless of x-value.
(3) Independence Assumption: Values of ε are independent.
(4) Normality Assumption: Error term ε normally distributed random variable.
• Summary: εi (i=1, 2, 3,…, n) are independent normally distributed with mean = 0
and constant variance σ2.

• Validating regression assumptions is not important when inference or model

building is not performed. However, regression assumptions must be validated
when inference or model building is performed.

Dr. Mahesh
8 KC
Inference on Regression: The t-test and the F-test
• The inference on regression is done in two ways:
test of significance of predictor (the t-test) and
test of overall significance of the model (the F-test).

H 0:β1  0
• Irrespective of the test, the hypothesis to be tested is:
H 1:β1  0

• Rejection Rule: Reject H0: if p-value ≤ α where α is the level of significance.

• Note that with one independent variable, the F-test will provide the same
conclusion as the t-test.
• But with more than one independent variable, only the F-test can be used to
test for an overall significant relationship. In this case, the t-test will be used to
test the individual significance of the independent variable.
Dr. Mahesh K C 9
The Residual Analysis
• For model building and inference purposes, regression model assumptions
require validation.
• Estimated Regression Line: yˆ  b0  b1 x
• Residuals: (Actual value – Estimated Value) y i  yˆ i ; i  1,2,....,n

• Residuals may have different variances, so that it is preferable to use

standardized residuals (SR).
• Standardized residuals are residuals divided by their standard error so that
they are all on the same scale. SR for ith residual is given by:
y i  yˆ i
SRi  ; i  1, 2,...., n
SE y i  yˆ i 

Dr. Mahesh
10 K C
Outliers and High Leverage Values
• Observations with very large standardized residual, in absolute value are outliers.
• Generally, observations with standardized residual beyond ± 2 are flagged as
outliers.

• Extreme observations in predictor space or outliers in predictors are referred as

High Leverage Values.
• Very large values of predictor variable, without reference to the response variable.

• The upper and lower bounds of leverage:

1
 leverage  1
n
• Observation with leverage > 2(p + 1)/n are considered to have high leverage where
p is the number of predictors and n is the number of observations.

Dr. Mahesh
11 K C
Outlier Example
y = -7.3305x + 64.958 y = -7.0283x + 60.425
Figure-1 R² = 0.4968 Figure-2 R² = 0.8912
80 60
X Y SR
1 45 -1.06 70 50
1 55 -0.22 60
40
2 50 -0.02 50

3 75 2.68 40 30

3 40 -0.25 30
20
3 45 0.17 20
4 30 -0.47 10
10

4 35 -0.05 0 0
5 25 -0.28 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

6 15 -0.50

• In figure 1, the point (3,75) could be an outlier. The model has r2 = 0.49.
• After removing the outlier there is a substantial improvement in the model fit (r2
=0.89) (see Figure 2).
• The observation y=75 is an outlier as its SR=2.68.
Dr. Mahesh K C 12
Leverage Point Example
y = -0.4251x + 127.47
Figure-1 R² = 0.7989 y = -1.0909x + 138.18
X Y Leverage 140 Figure-2 R² = 0.8727
135
10 125 0.22
120
130
100
10 130 0.22
80 125
15 120 0.18
60 120
20 115 0.15 40 115
20 120 0.15 20
110
25 110 0.14 0
0 10 20 30 40 50 60 70 80 105
70 100 0.94 0 5 10 15 20 25 30

• The point (70,100) could be a high leverage point or outlier in the predictor
variable.
• Removal of the same changes the r2 value (see figure 2).
• The leverage value 0.94 > 0.57 which corresponds to X=70.

Dr. Mahesh K C 13
Influential Observation
• Influential observation significantly alters regression parameters based on
absence/presence in data set.
• Outlier and high leverage point may or may not be influential.
• Influential observations combine both the characteristics of large residual and
high leverage.

• An observation not flagged as an outlier, as a high leverage point, still be

influential.
• Cook’s Distance measures an observation’s level of influence and considers both
size of residual and leverage for an observation.
• In general, influential observations have Cook’s Distance > 1.0.

Dr. Mahesh
14 K C
Influential Observation Example
y = 0.0981x + 119.05
X Y Cook’s D Figure-1 R² = 0.0751
Figure-2
y = -1.0909x + 138.18
135 R² = 0.8727
135
10 125 0.07
130
130
10 130 0.29
125 125
15 120 0.00
120 120
20 115 0.06
20 120 115 115
0.00
25 110 0.21
110 110

70 130 35.19 105 105

0 10 20 30 40 50 60 70 80 0 5 10 15 20 25 30

• Consider the observation (70,130) in figure 1. Observe the estimated line and the
corresponding r2.
• Once we remove the above point, the value actually increase and the entire regression
line changes drastically (see the slopes in both figures).
• The Cook’s Distance D =35.19 >1 corresponds to the point (70, 130) is an influential
observation.
Dr. Mahesh K C 15
Verifying Regression Assumptions
• For inferential purpose adherence to regression assumptions is essential.
• Two graphical methods used to verify assumptions:
(1) Normal probability plot of residuals.
(2) Plot of standardized residuals (SR) against predicted values.

• Method 1: Normal Probability Plot (Q-Q plot)

• Determines whether specified distribution deviates from normality.
• When normal, bulk of points should lie on straight line.
• Otherwise, systematic deviations from linearity denote non-normality.

Dr. Mahesh
16 K C
Verifying Regression Assumptions (cont’d)
• Method 2: Plot Standardized Residuals Against Predicted (fitted) Values
• Four commonly found patterns in residual-fit plots shown.
A B

• Plot (A): “healthy” plot where no detectable patterns exist.

• Data points form overall rectangular shape.
• Here, regression assumptions remain intact.

• Plot (B): exhibits curvature, which violates independence C D

assumption.
• Plot (C): displays “funnel” pattern, which violates constant
variance assumption.

• Plot (D): shows pattern increasing from left to right, which violates zero-mean assumption.

Dr. Mahesh
17 K C
What if graphical tests indicate regression assumption(s) violated?
• For example, constant variance assumption violated.
• Transformation of response and/or /both predictor variable (s) may help.
• Frederic, Mosteller and Tukey Bulging Rule (FMTB Rule)
• Ladder of re-expressions” proposed, which are essentially power transformations.
• Compare curve in scatter plot to curves shown on right.

t 3 t 2 t 1 t 1 2 ln(t ) t t1 t2 t3

Dr. Mahesh
18 K C
Transformations to Achieve Linearity: Two variable case
• Scrabble® is game where players build crosswords by randomly selecting letter tiles.
Each tile has associated point value, where point value roughly related to letter
frequency.
• Plot indicates relationship between two variables curvilinear, rather than linear.
• Therefore, modeling linear relationship is not appropriate.

Dr. Mahesh
19 K C
FMTB Rule applied to Scrabble data
• Bulging rule says to move “x down, y down”
• Should transform x, moving down one or more positions from x’s current
position t1 on ladder. Similarly, transform y from position t1.
• Bulging rule says to apply square root or log transformation to x and y

• Square root transformation applied producing sqrt points and sqrt frequency.
However, scatter plot determines relationship remains non-linear.
• Next, x and y transformed using log transformation. Scatter plot ln points
against ln freq shows somewhat acceptable level of linearity.

• Regression of ln points on ln freq indicates r2 = 87.6%. In contrast,

untransformed regression points on freq has r2 = 45.5%

Dr. Mahesh
20 K C
Scatter plots: After square root and log transformations

Scatter plot SqPoints Vs SqFrequency

Scatter plot logPoints Vs logFrequency
3.0

2.0
2.5

1.5
Psq

LogP
2.0

1.0
1.5

0.5
0.0
1.0

1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.5 1.0 1.5 2.0 2.5

Fsq LogF

Dr. Mahesh K C 21
References

• Shmueli, G., Bruce, P .C, Yahav, I., Patel, N.R., Lichtendahl,

K .C. (2018), Data Mining for Business Analytics, Wiley.
• Larose, D.T. & Larose, C.D. (2016), Data Mining and
Predictive Analytics, 2nd edition, Wiley.
• Kumar, U.D., (2018), Business Analytics-The Science of
Data-Driven Decision Making, 1st edition, Wiley.

Dr. Mahesh K C 22
Session 10: Multiple Linear Regression-
Model Building

Dr. Mahesh K C 1
Multiple Linear Regression (MLR)
• Simple linear regression examines relationship between single predictor and
response.
• Multiple Regression models set of predictors to single continuous response.
• Provide improved precision for estimation and prediction.
• Model uses plane or hyper-plane to approximate relationship between
predictor set and single response.

• Predictors are typically continuous.

• Categorical predictors can be included, through the use of indicator (dummy)
variables.
• Here, plane/hyper-plane represents a linear surface in p dimensions.

Dr. Mahesh
2 KC
The Multiple Regression Model
• Multiple Regression Model:
y   0  1 x1   2 x2     p x p  
where, β1, β2, …, βp are model parameters whose true value remains unknown and ε
represents error term.
• Model parameters are estimated from data set using method of least squares.
• The estimated regression plane: yˆ  b0  b1 x1  b2 x2  ....  bp x p

• We interpret the coefficient bi as “estimated change in response variable, for unit increase
in variable xi, when all remaining predictors held constant.”
• The quantity (y – ŷ) measures error in prediction called residual. Residual equals
vertical distance between data point and regression plane (or hyperplane) in multiple
regression.
• Coefficient of Determination (R2): Represents proportion of variability in response
variable accounted for by linear relationship with predictor set.

Dr. Mahesh
3 KC
An Example
• Consider a MLR model to estimate the miles per gallons (mpg) based on
weight (wt) and displacement (disp).
• A 3D scatter plot for the same will be:

• The estimated regression equation:

mpg  b0  b1 wt  b2 disp

• The approximating plane will be

Dr. Mahesh K C 4
Model Assumptions

• Zero Mean Assumption: Error term ε random variable with mean,

E(ε) = 0.
• Constant Variance Assumption: Variance of ε constant, regardless
of value of x1, x2, …, xp. This assumption is also known as
homoscedasticity.
• Independence Assumption: Values of ε are independent.
• Normality Assumption: Error term ε normally distributed random
variable.

Dr. Mahesh K C 5
Coefficient of Determination (R2) and Adjusted R2
• Would we expect higher R2 value when using two predictors, rather than one?
• Yes, R2 always increases by including additional predictor. When new predictor is
useful, R2 increases substantially. Otherwise, R2 may increase small or negligible
amount.
• Largest R2 may occur for models with most predictors, rather than best
predictors.

• Adjusted R2 measure “adjusts” R2, by penalizing models that include non-useful

predictors.
• If R2adj < R2, indicates at least one predictor in the model may be extraneous, and
should be omitted from the model.
• Models should be evaluated based on R2adj rather than R2 .

Dr. Mahesh
6 KC
Inference on Regression: The t-test and F-test
• t-test is used to test the significance of individual predictors of the regression model.
• Hypothesis Test for model: H0: βi = 0 against H1: βi ≠ 0 (i = 1, 2, 3,…., p)
• Reject the null hypothesis when p-value < level of significance (α).

• F-test is used for overall significance of the regression model.

• Hypotheses for F-Test: H0 :β1 = β2 = … = βm = 0 against H1 : At least one βi ≠ 0.
• The ANOVA table
Source of Degrees of
Sum of Squares Mean Square F
Variation Freedom
SSR
Regression SSR p MSR 
p MSR
F
Error SSE MSE
SSE n  p 1 MSE 
(or Residual) n  p 1

Total SST  SSR  SSE n 1 reject H0 if p-value < α

Dr. Mahesh
7 KC
Multi-collinearity
• Multicollinearity is condition where two or more predictors are correlated.
• This leads to instability in solution space, with possibly incoherent results.
• Data set with severe multicollinearity may have significant F-test, while none
of the t-tests for the individual predictors are significant.
• Multicollinearity produce high variability in coefficient estimates (b1, b2,…).

• Highly correlated variables tend to overemphasize regression model

component.
• Multicollinearity issue can be identified by examining correlation structure
among predictors.
• One may use matrix plot to identify the correlation structure.

Dr. Mahesh
8 KC
Multicollinearity Contd.’
• Consider a MLR with two predictors:
yˆ  b0  b1 x1  b2 x2
• If predictors x1 and x2 not correlated and orthogonal. In such cases, the predictors form solid
basis upon which the response surface y rests firmly, providing stable coefficients b1 and b2
(see figure A) with small variability SE(b1) and SE(b2).
• If the predictors x1 and x2 correlated (multicollinear situation), so that as one of them
increases, so does the other. In this case , the predictors no longer form a solid basis on which
the response surface y rests firmly (unstable), providing highly variable coefficients b1 and b2
(see figure B) due to high inflated values of SE(b1) and SE(b2).

Dr. Mahesh K C 9
Does method exist to identify multicollinearity in regression model?
• Variance Inflation Factors (VIFs) measures the correlation between the ith
predictor xi and the remaining predictor variables.
1
VIFi  ; i  1, 2, 3,..., p
1  Ri2

• When xi completely uncorrelated with remaining predictors, Ri2 = 0 leads to

minimum value for VIFi = 1. Alternately, VIFi increases without bound as Ri2
approaches 1.
• Large VIFi will produce an inflated standard error of the estimates leading to
degradation in the precision of the estimates.

• In general, VIFi > 5 and VIFi > 10 indicates moderate and severe
multicollinearity, respectively.

Dr. Mahesh
10 K C
Some Guidelines for model building using multiple linear regression
Step 1: Detect (using VIF criterion) and eliminate multicollinearity (if present) by
dropping variables. Drop one variable at a time until multicollinearity is eliminated.
Step 2: Run regression and check for influential observations, outliers and high
leverage observations.
Step 3: If one or more influential observations/outliers/ high leverage observations
are present delete one of them and rerun regression and go back to Step 2.

Step 4: Keep doing this until you get no further influential observations/ outliers/
high leverage observations or 10% (or 5% case to case) of data has been removed.
Step 5: Check for regression assumptions of linearity, normality, homoscedasticity and
independence of the residuals.
Step 6: If some of the assumptions in Step 5 is violated then try using transformations.
If NO transformation can be found which can correct for the violations then STOP.

Dr. Mahesh K C 11
Step 7: When all the regression assumptions are met then look at the p-value of
the F-test. If it is not significant then STOP.
Step 8: If the p-value of the F-test is significant then look at the p-values of the
individual the coefficients. If some of the p-values are not significant then
choose one of the variables with non-significant p-value and drop it from the
model and run regression again.
Step 9: Repeat Step 8 until you get the p-values of all the coefficients significant.

Dr. Mahesh K C 12
Model Building: Health Care Revenue data
• These data were collected by Department of Health and Social Services (DHSS)
of the state of New Mexico and cover 52 of the 60 licensed facilities in New
Mexico in 1998. Specific definitions of the variables are given below. The
location of the facility is indicated whether it is the rural or non rural area.
Variable Definition
RURAL Rural home (1) and non-rural home (0)
BED Number of Beds in home
MCDAYS Annual medical in-patient days(hundreds)
TDAYS Annual Total Patient Days (Hundreds)
PCREV Annual Total Patient Care Revenue($100)
NSAL Annual nursing salaries ($100)
FEXP Annual Facilities Expenditure ($100)

• DHSS is interested to predict patient care revenue based on the other hospital
characteristics.
Dr. Mahesh K C 13
Model Building: HCR Data
• Objective: Build a model to predict patient care revenue.
• Response Variable: PCREV
• Continuous Predictors: BED, MCDAYS, TDAYS, NSAL, FEXP
• Categorical Predictor (dummy): RURAL
• Total records: 52
• Total Variables: 7 (6 predictors)
• No missing values

• Step 1: Checking Multicollinearity. Dropped variables “TDAYS” as the corresponding

VIF=8.47 >5. Now we have 5 predictors- BED, MCDAYS, NSAL, FEXP and RURAL.
Repeated check for multicollinearity resulted that model (lm2) is free of the same.
• Step 2a: Check for influential observations: Since none of residual has Cook’s distance
greater than 1, no influential observations. The largest Cook’s distance = 0.77.
Dr. Mahesh K C 14
Model Building Cont’d
• Step 2b: Check for outliers: We found that the following standardized residual values
are beyond > 2: 3.74, 3.55, 2.76 and 2.94. We have deleted these one by one and
accordingly updated the data set. Now the final updated data (HRev5) at this stage
consists of 48 records on 5 predictors and having standardized residual 1.94.
• Step 2c: Check for leverage values: The leverage value = 0.66 > 2(p+1)/n = 12/48 =
0.25. After removing the respective record, the new updated data set (HRev6 with 47
records) is again tested for leverage value and found that the problem still persist.
• Since we already deleted 5 records (10% of 52 records), we stop at this stage and
proceed to step 3. The data (HRev6) now has 47 records and 5 predictors.
• Step 3: Check for regression assumptions: The standardized residuals versus fitted
plot & the Q-Q plot roughly shows that the assumptions are more or less met. But we
proceed to step 4.
• Step 4a: We checked the significance of the variables and found that the predictor
“RURAL” is not significant at 5% level. We removed the same and updated the data
(HRev7).
• Step 4b: Again checked for significance and found that “FEXP” not significant at 5%.
We removed and updated the data (HRev8).
• Step 4c: All predictors are now significant and the overall model is also significant
Dr. Mahesh K C 15
Model Building Cont’d

Dr. Mahesh K C 16
Summary Results of Model 9

Variables Estimates Std. error Pr(>|t|)

Intercept -2056.49 833.18

BED 82.92 16.36 8.12e-06

MCDAYS 15.67 4.46 0.00106

NSAL 1.44 0.28 7.87e-06
Residual standard error: 1817 on 43 degrees of freedom. Multiple R-squared: 0.9045,
Adjusted R-squared: 0.8978, F-statistic: 135.7 on 3 and 43 DF, and p-value: < 2.2e-16

Final MLR model:

PCREV = -2056.49+82.92BED+15.67MCDAYS+1.44NSAL

Dr. Mahesh K C 17
References

Dr. Mahesh K C 18
Sessions 11&12: Multiple Regression
with Categorical Predictors & Variable
Selection Method-Backward Elimination

Dr. Mahesh K C 1
Regression with Categorical Predictors Using Indicator Variables
• Categorical variables can be included in model through use of indicator variables.
• Example: Consider Cars data set. We have: mpg, cylinders, cubicinches, hp, weightlbs, time.to.60
are continuous variables and brand-Categorical Variable with three levels US, Japan and Europe.
The variable “year” is not considered.
• For regression, categorical variable with k categories transformed to k – 1 indicator (dummy)
variables. Indicator variable is binary, equals 1 when observation belongs to category, otherwise
equals 0.
• Brand variable is transformed to two indicator (dummy) variables:
1 1
C1   if country is Japan C 2   if country is US
0 otherwise 0 otherwise

• Note, brand =Europe is implied when C1 = 0 and C2 = 0 known as Reference Category.

Dr. Mahesh
2 KC
Estimated Regression Equation with Categorical Predictors
• Including indicator variables into model produces estimated reg. eq:
mpg  b 0  b1 ( cylinders )  b 2 ( cubicinche s )  b 3 ( hp )  b 4 ( weightlbs )
 b 5 ( time .to . 60 )  b 6 C 1  b 7 C 2

• estimated reg. eq. when brand is Japan:

mpg  ( b 0  b 6 )  b1 ( cylinders )  b 2 ( cubicinche s )  b 3 ( hp )  b 4 ( weightlbs )  b 5 ( time .to . 60 )

• estimated reg. eq. when brand is US:

mpg  ( b 0  b 7 )  b1 ( cylinders )  b 2 ( cubicinche s )  b 3 ( hp )  b 4 ( weightlbs )  b 5 ( time .to . 60 )

• estimated reg. eq. when brand is Europe:

mpg  b 0  b1 ( cylinders )  b 2 ( cubicinche s )  b 3 ( hp )  b 4 ( weightlbs )  b 5 ( time .to . 60 )

Dr. Mahesh
3 KC
Variable Selection Methods
• Several variable selection methods available.
• Assist analyst in determining which variables to include in model.
• Algorithms help select predictors leading to optimal model.
• Four variable selection methods:
(1) Forward Selection
(2) Backwards Elimination
(3) Stepwise Selection
(4) Best Subsets

Dr. Mahesh
4 KC
The Partial F-Test (Theory optional)
• Suppose model has x1,…,xp predictors and we consider adding additional predictor x*.
• Calculate sequential sum of squares from adding x*, given existing x1,…,xp in model.
• Full sum of squares SSFull = x1,…,xp, x* in model.
• Reduced sum of squares SSReduced = x1,…,xp in model.
• Therefore, extra sum of squares SSExtra denoted by
SS Extra  SS ( x * | x1 , x2 ,..., x p )  SS Full  SS Re duced
• Null hypothesis for Partial F-Test
– Ho: No, SSExtra associated with x* does not contribute significantly to model
– Ha: Yes, SSExtra associated with x* does contribute significantly to model
• Test statistic for Partial F-Test SS Extra
F ( x * | x1 , x 2 , , x p ) 
MSE Full

follows F1, n-p-2 distribution when Ho true.

• Therefore, Ho rejected for small p-value

Dr. Mahesh
5 KC
Backwards Elimination Procedure
• Procedure begins with all variables in model.
• Step 1:
Perform regression on full model with all variables
For example, assume model has x1,…,x4
• Step 2:
For each variable in model perform partial F-test
Select variable with smallest partial F-statistic, denoted Fmin
• Step 3:
If Fmin not significant, remove associated variable from model and return to Step 2
Otherwise, if Fmin significant, stop algorithm and report current model
If first pass, then current model is full model
If not first pass, then full set of predictors reduced by one or more variables

Dr. Mahesh
6 KC
Backwards Elimination Applied to Cars Data Set
• We begin with all predictors (excluding the predictor “Year”) included in the model.

Model A : mpg  b0  b1 (cylinder )  b2 (cubicinches)  b3 (hp)  b4 ( weightlbs )

 b5 (time.to.60)  b6 (brand )

• Partial F-statistic calculated for each predictor. Smallest F-statistic Fmin (= 0.5132) associated
with cubicinches. Here, Fmin is not significant at 5%, therefore cubicinches is dropped.
• On second pass predictor cylinders is eliminated as its Fmin (= 0.4425) which is not significant
at 5%.
• On third pass predictor time.to.60 is dropped with Fmin (=1.7229) which is not significant at
5%.
• Finally, all predictors are significant at 5% level.
• Procedure terminates with model (B):
Model B : mpg  b0  b1 (hp )  b2 ( weightlbs )  b6 (brand )

Dr. Mahesh
7 KC
Backwards Elimination Applied to Cars Data Set
• Most of the time variable selection methods take care of multicollinearity. Still
one may check for the same with latest model.
• Based on Model B check influential, outliers and leverage values.
• Check assumptions on regression. If violated one may try transformation
either on response variable or predictors or both.

Dr. Mahesh K C 8
References

Dr. Mahesh K C 9
Session 14: Linear Discriminant Analysis
(LDA)

Dr. Mahesh K C 1
LDA: Basic Concept and Objectives
• A technique for analyzing multivariate data when the response variable is categorical
and the predictors are interval in nature.
• In most cases the dependent variable consists of two groups or classifications, like, high
versus normal blood pressure, loan defaulting versus non defaulting, use versus non use
of internet banking etc.
• The choice between three candidates, A, B or C in an election is an example where the
dependent variable consists of more than two groups.

• Objectives:
• Develop a discriminant function: A linear combination of predictors that will best
discriminate between the categories of response variable (groups).
• Examine whether there exist significance differences between the groups and the
predictors.
• Classify the cases to one of the groups based on the value of the predictors.
• Evaluate the accuracy of the classification.
Dr. Mahesh K C 2
Some examples LDA
• The technique can be used to answer the questions such as:

• Based on demographic characteristics, how do customers who exhibit

store loyalty differ from those who do not?
• Do heavy, medium and light users of soft drinks differ in terms of their
consumption of frozen foods?

• Do the various market segments differ in their media consumptions habit?

• What psychographic characteristics help in differentiating price- sensitive
and non-price-sensitive buyers?
• To study the bankruptcy problem.

Dr. Mahesh K C 3
Fisher’s Linear Discriminant Function
• Typically considered more of a statistical classification method than a data
mining method introduced by R A Fisher in 1936.
• Obtain linear combination (known as discriminant function) of independent
variables that will best discriminate the groups in the dependent variables.

• Idea is to find linear functions of the measurements that maximize the ratio of
between-class variability to within-class variability. In other words, obtain
groups that are homogeneous and differ the most from each other.

• For each record, these functions are used to compute scores that measure the
proximity of that record to each of the classes.
• A record is classified as belonging to the class for which it has the highest
classification score.
Dr. Mahesh K C 4
The LDA Model and assumptions
• Let X1, X2,…, Xk denotes the predictors and let D, the discriminant score. Then a linear
combinations of the predictors is given by:
D  b1 X 1    bk X k .
where bi stands for discriminant coefficients or weights. Note that with k groups we
need k-1 discriminators.
• Assumptions:
1) The groups must be mutually exclusive and have equal sample size.
2) Groups should have the same variance-covariance matrices on independent
variables.
3) The independent variables should be multivariate normally distributed.

• If assumption 3 is met, then LDA is more powerful tool than other classification
methods such as logistic regression (roughly 30% more efficient, see Efron 1975).
• LDA perform better as sample size increases.
Dr. Mahesh K C 5
Statistics associated with LDA
• Cannonical Correlation: measures the extent of association between the
discriminant function and the groups in the response variable.

• Confusion(Classification) matrix: A matrix representing the number of correctly

classified and misclassified cases. Correctly classified cases appear in the diagonal
and misclassified cases appear off-diagonal.

• Hit ratio (accuracy) : sum of the diagonal elements divided by the total number of
cases.
• Eigenvalue: the ratio of between-group to within-group sum of squares. Large
eigenvalues imply superior function.

Dr. Mahesh K C 6
The Iris Flower Data
• This famous (Fisher's or Anderson's) iris data set gives the measurements in
centimeters of the variables sepal length and width and petal length and width,
respectively, for 50 flowers from each of 3 species of iris. The species are Iris
setosa, Iris versicolor, and Iris virginica.
• Predictors: Sepal. Length, Sepal. Width, Petal. Length and Petal. Width
• Dependent Variable: Species with three levels Setosa, Versicolor and Virginica
• Total Observations: 150

• Iris setosa iris versicolor iris virginica

Dr. Mahesh K C 7
LDA of iris data
• Required R-packages: MASS & psych
• Scatter plot shows in most of the cases, a clear grouping of the Species.
• Partition the data into 70% training and 30% testing.
• The three groups had equal prior probability (33%).

• The group means (centroid) clearly show a separation between the groups and the
corresponding predictors.
• Since Species have three levels, we got two discriminant functions (LD1 and LD2) with
corresponding weights.
LD1  0.534Sepal.Length  2.125Sepal.Width  1.962 Petal.Length  3.561Petal.Width

LD 2  0.294 Sepal .Length  1.933Sepal .Width  1.143 Petal .Length  3.003 Petal .Width

Dr. Mahesh K C 8
Matrix plot: Iris data

Dr. Mahesh K C 9
LDA of iris data Cont’d
• The proportion of trace for LD1 = 0.993 and that of LD2 = 0.007. This implies the
% of separation achieved by the discriminant function.
• The predicted classification is: Setosa = 35, Versicolor = 36 and Virginica = 35 in
the training data.
• The eigenvalues corresponding to LD1 and LD2 are 44.23 and 3.71. Higher the
eigenvalue better the separation.
• The discriminant scores are obtained by using LD1 and LD2 for different record
values for the predictors.

Dr. Mahesh K C 10
Histogram of Discriminant Scores based on LD1 &LD2

The histogram of discriminant scores based on LD1 (left panel) shows

clear separation of Species and that of LD2 shows significant overlapping.

Dr. Mahesh K C 11
Confusion Matrix & Accuracy
• Confusion Matrix: A matrix summarizes the correct and incorrect classifications
that a classifier produced for a certain dataset (see table 1).
• Accuracy: The overall accuracy of the correct classification. For the training data,
accuracy is 100%.
Table 1: Confusion matrix for training data Table 2: Confusion matrix for test data
Predicted Actual (Training) Predicted Actual
Setosa Versicolor Virginica Setosa Versicolor Virginica
Setosa 35 0 0 Setosa 15 0 0
Versicolor 0 36 0 Versicolor 0 13 1
Virginica 0 0 35 Verginica 0 1 14

• For the validation data accuracy is 95.45% which is expected.

Dr. Mahesh K C 12
References

Dr. Mahesh K C 13

wk2_factor-analysis
No ratings yet
wk2_factor-analysis
35 pages
Slide share session 15 to 18 BRM
No ratings yet
Slide share session 15 to 18 BRM
105 pages
Practical Concepts in Stata
No ratings yet
Practical Concepts in Stata
87 pages
Data Analysis
No ratings yet
Data Analysis
263 pages
Harrell2001 Book RegressionModelingStrategies
No ratings yet
Harrell2001 Book RegressionModelingStrategies
583 pages
ML NOTES
No ratings yet
ML NOTES
12 pages
Income Tax
No ratings yet
Income Tax
9 pages
Advancedeconometricsl3!4!240128102442 58a0f1f1
No ratings yet
Advancedeconometricsl3!4!240128102442 58a0f1f1
58 pages
Quantitative Methods and Business Statistics For Decision Making (MSA606)
No ratings yet
Quantitative Methods and Business Statistics For Decision Making (MSA606)
63 pages
Regn & Marketing Research
No ratings yet
Regn & Marketing Research
23 pages
Inferential Statistics
No ratings yet
Inferential Statistics
22 pages
Regression
No ratings yet
Regression
39 pages
Subject: Quantity Techniques in Business
No ratings yet
Subject: Quantity Techniques in Business
8 pages
Chapter1 Regression Introduction
No ratings yet
Chapter1 Regression Introduction
8 pages
Harvard Lecture Series Session 4 - Factor Analysis
No ratings yet
Harvard Lecture Series Session 4 - Factor Analysis
50 pages
The Nature of Econometrics and The Modelling Process: Session 1
No ratings yet
The Nature of Econometrics and The Modelling Process: Session 1
51 pages
PREDICTIVE BUSINESS ANALYTICS sem 4
No ratings yet
PREDICTIVE BUSINESS ANALYTICS sem 4
31 pages
20230305slides
No ratings yet
20230305slides
39 pages
Statistics
No ratings yet
Statistics
64 pages
06_Banerjee and Banerjee_Business Analytics_Ch06
No ratings yet
06_Banerjee and Banerjee_Business Analytics_Ch06
21 pages
Factor Analysis - Group 3 and 8
No ratings yet
Factor Analysis - Group 3 and 8
47 pages
Sample Exam Answers CMA
No ratings yet
Sample Exam Answers CMA
9 pages
003-Forecasting Techniques Detailed
No ratings yet
003-Forecasting Techniques Detailed
20 pages
Regression Analysis
No ratings yet
Regression Analysis
6 pages
2.6 Factor analysis
No ratings yet
2.6 Factor analysis
35 pages
ECO 391 Lecture Slides - Part 2
No ratings yet
ECO 391 Lecture Slides - Part 2
26 pages
Data Analysis Notes
No ratings yet
Data Analysis Notes
9 pages
Econometrics Session
No ratings yet
Econometrics Session
43 pages
Digital Marketing - Course Outline-Merged
No ratings yet
Digital Marketing - Course Outline-Merged
45 pages
Harvard Lecture Series Session 4 - Factor Analysis
No ratings yet
Harvard Lecture Series Session 4 - Factor Analysis
50 pages
Chapter 1. Elements in Predictive Analytics
No ratings yet
Chapter 1. Elements in Predictive Analytics
66 pages
Data Analysis Chapter 7
No ratings yet
Data Analysis Chapter 7
20 pages
Basic Estimation Techniques
No ratings yet
Basic Estimation Techniques
21 pages
Chapter1 Regression Introduction PDF
No ratings yet
Chapter1 Regression Introduction PDF
8 pages
Spss
No ratings yet
Spss
42 pages
Chapter1 Regression Introduction
No ratings yet
Chapter1 Regression Introduction
8 pages
Regression Analysis: From Wikipedia, The Free Encyclopedia
No ratings yet
Regression Analysis: From Wikipedia, The Free Encyclopedia
10 pages
2023 Statistics Fin 10
No ratings yet
2023 Statistics Fin 10
14 pages
6.multiple Regressions - BDSM - 2020 - Oct
No ratings yet
6.multiple Regressions - BDSM - 2020 - Oct
45 pages
Business Research Method: Factor Analysis
100% (1)
Business Research Method: Factor Analysis
52 pages
Linear Regression Analysis: Module - I
No ratings yet
Linear Regression Analysis: Module - I
13 pages
Regrion
No ratings yet
Regrion
19 pages
Consolidated DA
No ratings yet
Consolidated DA
41 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
48 pages
Factor Analysis
No ratings yet
Factor Analysis
32 pages
Rajat Sir FDP
No ratings yet
Rajat Sir FDP
48 pages
Sessions 21-24 Factor Analysis - Ppt-Rev
No ratings yet
Sessions 21-24 Factor Analysis - Ppt-Rev
61 pages
Brake Design Report
No ratings yet
Brake Design Report
27 pages
Session 13 - Factor Analysis
No ratings yet
Session 13 - Factor Analysis
22 pages
Correlation and Regression Analyses
No ratings yet
Correlation and Regression Analyses
8 pages
Session 1.4 Factor Analysis Notes
No ratings yet
Session 1.4 Factor Analysis Notes
23 pages
Topic0 Introduction
No ratings yet
Topic0 Introduction
9 pages
Factor Analysis (FA)
No ratings yet
Factor Analysis (FA)
61 pages
Factor Analysis
No ratings yet
Factor Analysis
16 pages
Pdfjoiner
No ratings yet
Pdfjoiner
48 pages
Factor Analysis
No ratings yet
Factor Analysis
32 pages
Research+Methodology+ +Multivariate+Analysis
No ratings yet
Research+Methodology+ +Multivariate+Analysis
13 pages
Factor Analysis
No ratings yet
Factor Analysis
11 pages
Quant Notes For C-Exam
No ratings yet
Quant Notes For C-Exam
4 pages
Topic 02 International Monetary System
No ratings yet
Topic 02 International Monetary System
24 pages
Topic 03 Balance of Payments
No ratings yet
Topic 03 Balance of Payments
42 pages
Topic 04 Foreign Exchange Markets
No ratings yet
Topic 04 Foreign Exchange Markets
32 pages
International Finance Mba (FT) : Topic 02-A Exchange Rate Systems
No ratings yet
International Finance Mba (FT) : Topic 02-A Exchange Rate Systems
19 pages
Dadm s16 Cart
No ratings yet
Dadm s16 Cart
18 pages
1.) Banking Overview and Regulations
No ratings yet
1.) Banking Overview and Regulations
69 pages
Pdfjoiner
No ratings yet
Pdfjoiner
115 pages
Pdfjoiner
No ratings yet
Pdfjoiner
115 pages
Analysis of The Most Common Mistakes Companies Make With Global Marketing by Nataly Kelly - Docx-Merged
No ratings yet
Analysis of The Most Common Mistakes Companies Make With Global Marketing by Nataly Kelly - Docx-Merged
8 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
10 pages
Investment Banking Individual Asssignmеnt-I Heranba Industries Limited Offer Document (Drhp)
No ratings yet
Investment Banking Individual Asssignmеnt-I Heranba Industries Limited Offer Document (Drhp)
11 pages
DADM S4 Basic Data Visualization
No ratings yet
DADM S4 Basic Data Visualization
10 pages
DADM S15 K-NN Classification
No ratings yet
DADM S15 K-NN Classification
13 pages
DADM S14 Linear Discriminant Analysis
No ratings yet
DADM S14 Linear Discriminant Analysis
13 pages
Website Technical Review - Parents To Alumni
No ratings yet
Website Technical Review - Parents To Alumni
8 pages
DADM S3 Skewness and Transformations To Achieve Normality
No ratings yet
DADM S3 Skewness and Transformations To Achieve Normality
9 pages
Investment Banking Individual Asssignmеnt-I Heranba Industries Limited Offer Document (Drhp)
No ratings yet
Investment Banking Individual Asssignmеnt-I Heranba Industries Limited Offer Document (Drhp)
5 pages
DADM S2 Data Preprocessing-Data Cleaning and Transformation
No ratings yet
DADM S2 Data Preprocessing-Data Cleaning and Transformation
12 pages
Explorations of Mathematical Models in Biology with MATLAB
From Everand
Explorations of Mathematical Models in Biology with MATLAB
Mazen Shahin
No ratings yet
Introduction to Applied Econometrics Analysis Using Stata
From Everand
Introduction to Applied Econometrics Analysis Using Stata
Justin Doran
5/5 (3)
Econometrics: A Simple Introduction
From Everand
Econometrics: A Simple Introduction
K.H. Erickson
3.5/5 (5)