0% found this document useful (0 votes)
96 views

BA-II Notes

Uploaded by

PARTH RANA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

BA-II Notes

Uploaded by

PARTH RANA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 300

Business Analytics-II

Dr. Ramesh Kandela


Room No. F010
[email protected]
Regression

Ramesh Kandela
[email protected]
Regression
• Regression analysis is a form of predictive modelling technique which investigates the
relationship between a dependent (target) and independent variable (s) (predictor).
• The variable you want to predict is the dependent variable.
• The variable used for prediction is the independent variable.
• Regression: technique concerned with predicting some variables by knowing others. The
process of predicting variable Y using variable X.
Linear regression is used for finding linear relationship between target and one or more
predictors. There are two types of linear regression- Simple and Multiple.
Examples of Relationships In Simple Linear Regression only one
independent variable is present and the
Independent Variable(s) Dependent Variable model has to find the linear relationship
Numerical / Categorical Numerical with one dependent variable.
Advertisement Sales In Multiple Linear Regression there are
Income Expenditure more than one independent variables for
Years of experience, Years of education, and Gender Salaries the model to find the relationship with
Square footage of the house and Number of rooms House price one dependent variable.
Application of Regression Analysis in Business
Some potential uses of regression analysis in business :
• How does a company’s sales level depend on its advertising levels?
• How do wages of employees depend on years of experience, years of
education, and gender?
• How does the total cost of producing a batch of items depend on the total
quantity of items that have been produced?
• How does the selling price of a house depend on such factors as the
appraised value of the house, the square footage of the house, the number
of bedrooms in the house, and perhaps others?
Simple Linear Regression
• Simple linear regression is useful for finding relationship between two variables. One is
predictor or independent variable and other is response or dependent variable.
• The dependent variable is the one being explained, and the independent variable is the one
used to explain the variation in the dependent variable.
• The equation of a linear relationship between two variables x and y is written as

y = a + bx+e

• For example, X may represent TV advertising and Y may represent sales. Then we can regress
Sales onto TV advertising by fitting the model
• Sales ≈ a+ b × TV Advertising + e
• a and b are two unknown constants that represent the intercept and slope terms in the
linear model. Together, a and b are intercept, slope known as the model coefficients or
parameters.
Estimating the Coefficients (Estimating the parameters of Regression )
• To find an intercept a and a slope b such that the resulting line is as close as possible to the
given data points. There are a number of ways of measuring closeness. However, by far the
most common approach involves minimizing the least squares criterion.
• The least squares estimate, is given by

b= Slope of the regression line


a= Intercept or Constant
x= Values of the independent variable
y= Values of the dependent variable
x̅ = Mean of values of the independent variable
y̅ = Mean of values of the dependent variable
Estimating the Coefficients (Estimating the parameters of Regression )
Income Expenditure Income(x) Expenditure(y) X-X̅ Y-Y̅ (X-X̅)(Y-Y̅) (X-X̅)^2
55 14 55 14 -0.14286 -1.42857 0.204082 0.020408
83 24 83 24 27.85714 8.571429 238.7755 776.0204
38 13 38 13 -17.1429 -2.42857 41.63265 293.8776
61 16 61 16 5.857143 0.571429 3.346939 34.30612
33 9 33 9 -22.1429 -6.42857 142.3469 490.3061
49 15 49 15 -6.14286 -0.42857 2.632653 37.73469
67 17 67 17 11.85714 1.571429 18.63265 140.5918
Σ(X-X̅)(Y-Y̅) Σ(X-X̅)^2
ΣX=386 ΣY=108 =447.5714 =1772.857
b=Σ(X-X̅)(Y-Y̅)/Σ(X-X̅)^2
x̅ =55.14286 y̅=15.42857143
a=y̅-b*x̅

b 0.252458
a 1.507333
Interpretation of the regression coefficients
• a is the intercept term—that is, the expected value of Y when X = 0
• b= slope of regression line that represents the expected change in the value of y for unit
change in the value of x.
• b is the slope—the average increase in Y associated with a one-unit increase in X.
a) Note that when b is positive, an increase in x will lead to an increase in y,
and a decrease in x will lead to a decrease in y. In other words, when b is
positive, the movements in x and y are in the same direction. Such a
relationship between x and y is called a positive linear relationship. The
regression line in this case slopes upward from left to right.
b) On the other hand, if the value of b is negative, an increase in x will lead
to a decrease in y, and a decrease in x will cause an increase in y. The
changes in x and y in this case are in opposite directions. Such a
relationship between x and y is called a negative linear relationship. The
regression line in this case slopes downward from left to right.
Interpretation of a
• Consider a household with zero income.
• A household with no income is expected to spend $150.50 per month. Alternatively, we can
also state that the point estimate of the average monthly food expenditure for all
households with zero income is $150.50.
Interpretation of b
• The value of b in a regression model gives the change in y (dependent variable) due to a
change of one unit in x (independent variable).
• on average, a $1 increase in income of a household will increase the expenditure by $.2525.
We can also state that, on average, a $100 increase in income will result in a $25.25
increase in expenditure.
Fitted Value
After estimates the model coefficients a and b, we can predict future Expenditure (sales) based
on a particular value of Income (advertising) by computing
yො = a + bX
• where ˆy indicates a prediction of Y on the basis of X . Here we use a hat symbol, ˆ , to
denote the estimated value for an unknown parameter or coefficient, or to denote the
predicted value of the response.
• A fitted value is the predicted value of the dependent variable
Fundamental Equation for Regression
Observed Value = Fitted Value + Residual Y
ŷ =Y a= bX++bX
Interpretation of the regression coefficients a
Change
• a is the intercept term—that is, the expected value of Y when X = 0 b = Slope in Y
Change in X
• b= slope of regression line that represents the expected change in a = Y-intercept
the value of y for unit change in the value of x. X
• b is the slope—the average increase in Y associated with a one-unit
increase in X.
Find the least squares regression line for the data on incomes and expenditures on the seven
households given in Table . Use income as an independent variable and food expenditure as a
dependent variable. What is the Expenditure value when x is 36?
Income Expenditure Income(x) Expenditure(y) X-X̅ Y-Y̅ (X-X̅)(Y-Y̅) (X-X̅)^2
55 14 55 14 -0.14286 -1.42857 0.204082 0.020408
83 24 27.85714 8.571429 238.7755 776.0204
83 24 38 13 -17.1429 -2.42857 41.63265 293.8776
38 13 61 16 5.857143 0.571429 3.346939 34.30612
33 9 -22.1429 -6.42857 142.3469 490.3061
61 16 49 15 -6.14286 -0.42857 2.632653 37.73469
33 9 67 17 11.85714 1.571429 18.63265 140.5918
Σ(X-X̅)(Y-Y̅) Σ(X-X̅)^2
49 15 ΣX=386 ΣY=108 =447.5714 =1772.857
67 17
x̅ =55.14286 y̅=15.42857143

b=Σ(X-X̅)(Y-Y̅)/Σ(X-X̅)^2 b=0.252458 Regression estimated model


a=y̅-b*x̅ a=1.507333 yො = a + bX

yො = 1.5073 + 0.2525X If x is 36 y=10.59581



Predicted Values(𝐲)
Income Expenditure X-X̅
55 14
Y-Y̅ (X-X̅)(Y-Y̅) (X-X̅)^2 Predicted
-0.14286 -1.42857 0.204082 0.020408 15.39251
ŷ = a + bX
83 24 27.85714 8.571429 238.7755 776.0204 22.46132
38 13 -17.1429 -2.42857 41.63265 293.8776 11.10073 Regression estimated model
61 16 5.857143 0.571429 3.346939 34.30612 16.90725
33 9 -22.1429 -6.42857 142.3469 490.3061 9.838437 yො = 1.5073 + 0.2525X
49 15 -6.14286 -0.42857 2.632653 37.73469 13.87776
67 17 11.85714 1.571429 18.63265 140.5918
INCOME LINE FIT PLOT
18.422
386 108 Sum 447.5714 1772.857
55.1428 30
6 15.42857143 Expenditure
25
b0.252458

EXPENDITURE
a 1.507333 20
Predicted
15
Expenditure
10
Linear 5
(Predicted
Expenditure) 0
25 45INCOME 65 85 105
Residual(Error)
• The line will have a good fit if it minimizes the error between the estimated points on the line
and the actual observed points that were used to draw it.
• Using a and b, we write the estimated regression model as
ŷ = a + bX
• Where ŷ is the estimated or predicted value of y for a given value of x
• The values a and b must be chosen so that they minimize the error.
• residual, denoted by e, is the difference between the observed value y and the fitted value ŷ.
• Then ei = yi− ŷi represents the ith residual—this is the difference between the ith actual
value and the ith predicted value by linear model.
• We define the sum of squares of residuals/ errors(SSE)
Error(e=y-ොy)
Income Expenditure(y) X-X̅ Y-Y̅ ෡
(X-X̅)(Y-Y̅) (X-X̅)^2 Predicted(ොy) Error(e=y-y) (y-ොy)^2
55 14 -0.14286 -1.42857 0.204082 0.020408 15.39251 -1.39251 1.939073
83 24 27.85714 8.571429 238.7755 776.0204 22.46132 1.538678 2.367531
38 13 -17.1429 -2.42857 41.63265 293.8776 11.10073 1.899275 3.607245
61 16 5.857143 0.571429 3.346939 34.30612 16.90725 -0.90725 0.823107
33 9 -22.1429 -6.42857 142.3469 490.3061 9.838437 -0.83844 0.702976
49 15 -6.14286 -0.42857 2.632653 37.73469 13.87776 1.12224 1.259423
67 17 11.85714 1.571429 18.63265 140.5918 18.422 -1.422 2.022079
386 108 Sum 447.5714 1772.857 SSE=12.72143

55.14286 15.42857143
b0.252458
a 1.507333

ŷ = a + bX
yො = 1.5073 + 0.2525X
Standard error of the regression (S)
• The standard error of the regression (S), also known as the standard error of the estimate,
represents the average distance that the actual (observed) values fall from the regression
line.
• Conveniently, it tells you how wrong the regression model is on average using the units of
the dependent variable. Smaller values are better because it indicates that the actual values
are closer to the fitted line.

• In this formula, n-2 represents the degrees of freedom for the regression model. The reason
df = n-2 is that we lose one degree of freedom to calculate x and one for y.
• SSE=12.72143
• The degrees of freedom for a simple linear regression model is df = n – 2 (n-2=7-2=5)
• Se=1.5939
Root Mean Square Error (RMSE)
• This is the root of the mean of the squared errors.
• Most popular (has same units as y)
• Root Mean Square Error (RMSE) is a standard way to measure the error of a
model in predicting quantitative data.

Is this value of RMSE good?”


Compare your error metric to the average value of the dependent variable.
R-Square(Measure of fit) / Coefficient of Determination
• The coefficient of determination (or R-square or R2) measures the percentage (proportion) of variation
in the dependent variable Y explained by the regression model (a + bX).
• The simple linear regression model can be broken into explained variation and unexplained variation
y= a + bx + e
• To calculate R2, we use the formula
• R-Square value between 0 and 1
• R2 measures the proportion of variability in Y that can be explained using X. An R2 statistic that is close
to 1 indicates that a large proportion of the variability in the response has been explained by the
regression. A number near 0 indicates that the regression did not explain much of the variability in the
response; this might occur because the linear model is wrong, or the inherent error is high, or both.
• Mathematically, the square of correlation coefficient is equal to coefficient of determination
(i.e., r2 = R2).
R-Square
Income Expenditure X-X̅ Y-Y̅ ෡ (y-ොy)^2 (Y-Y̅)^2
(X-X̅)(Y-Y̅) (X-X̅)^2 Predicted(ොy) Error(e=y-y)
55 14 -0.14286 -1.42857 0.204082 0.020408 15.39251 -1.39251 1.939073 2.040816
83 24 27.85714 8.571429 238.7755 776.0204 22.46132 1.538678 2.367531 73.46939
38 13 -17.1429 -2.42857 41.63265 293.8776 11.10073 1.899275 3.607245 5.897959
61 16 5.857143 0.571429 3.346939 34.30612 16.90725 -0.90725 0.823107 0.326531
33 9 -22.1429 -6.42857 142.3469 490.3061 9.838437 -0.83844 0.702976 41.32653
49 15 -6.14286 -0.42857 2.632653 37.73469 13.87776 1.12224 1.259423 0.183673
67 17 11.85714 1.571429 18.63265 140.5918 18.422 -1.422 2.022079 2.469388
386 108 Sum 447.5714 1772.857 SSE SST
55.14286 15.42857143 12.72143 125.7143
b0.252458 SSE/SST=0.101193
a 1.507333 R2=0.898807

• The total sum of squares SST is a measure of the total variation in expenditures.
• The Regression sum of squares SSR is the portion of total variation explained by the regression model
(or by income), and the error sum of squares SSE is the portion of total variation not explained by the
regression model.
• Hence, we can state that 90% of the total variation in the expenditures of households occurs because
of the variation in their incomes, and the remaining 10% is due to randomness and other variables.
Hypothesis Test for Regression Coefficient (t-Test)
The null and alternative hypotheses for the Simple Linear Regression model can be stated as follows:
• H0: There is no relationship between X and Y Hypothesis testing is used to confirm if
• Ha: There is a relationship between X and Y our beta coefficients are significant in a
linear regression model. Every time we
Thus, the null and alternative hypotheses can be restated as follows: run the linear regression model, we
• H0: b = 0 test if the line is significant or not by
checking if the coefficient is significant.
• Ha: b ≠ 0
• The value of the test statistic t is calculated as
Make a Decision
• The value of the test statistic t is greater than the critical value of t and it falls in the rejection region. Hence,
we reject the null hypothesis and conclude that a linear relationship between x and y.
• The p-value is less than 0.05, we reject the null hypothesis and conclude that there is significant evidence
suggesting a linear relationship between x and y.

Interpretation of b
If b > 0, then x(predictor) and y(target) have a positive relationship. That is increase in x will increase y.
If b < 0, then x(predictor) and y(target) have a negative relationship. That is increase in x will decrease y.
Hypothesis Test for Regression Coefficient (t-Test)
Income Expenditure X-X̅ Y-Y̅ ෡ (y-ොy)^2
(X-X̅)(Y-Y̅) (X-X̅)^2 Predicted(ොy) Error(e=y-y)
55 14 -0.14286 -1.42857 0.204082 0.020408 15.39251 -1.39251 1.939073
83 24 27.85714 8.571429 238.7755 776.0204 22.46132 1.538678 2.367531
Standard Error of 38 13 -17.1429 -2.42857 41.63265 293.8776 11.10073 1.899275 3.607245
regression Coefficient 61 16 5.857143 0.571429 3.346939 34.30612 16.90725 -0.90725 0.823107
33 9 -22.1429 -6.42857 142.3469 490.3061 9.838437 -0.83844 0.702976
49 15 -6.14286 -0.42857 2.632653 37.73469 13.87776 1.12224 1.259423
67 17 11.85714 1.571429 18.63265 140.5918 18.422 -1.422 2.022079
386 108 Sum 447.5714 1772.857 SSE
X̅X-X̅
=55.142
)^2 Y̅=15.4285714 Sqrt=
86 3 42.10531 12.72143
b0.252458 Se=1.5939
SE(b)=1.5939/42.10531=0.037855 a 1.507333
t=.2525/.037855
=6.66
Df= n -2 =7 -2 =5
The value of the test statistic t 6.66 is greater than the critical value of t 2.571, and
it falls in the rejection region. Hence, we reject the null hypothesis and conclude that x (income)
determines y (food expenditure) positively. That is, food expenditure increases with an
increase in income and it decreases with a decrease in income.
Standard error of the coefficient
• The standard error of an estimator reflects how it varies under repeated sampling.

SE(b)=1.5939/42.10531=0.037855

• The standard error of the coefficient measures how precisely the model estimates the
coefficient's unknown value. The standard error of the coefficient is always positive.
• Use the standard error of the coefficient to measure the precision of the estimate of the
coefficient. The smaller the standard error, the more precise the estimate. Dividing the
coefficient by its standard error of the coefficient calculates a t-value
Confidence Intervals
• These standard errors can be used to compute confidence intervals. A 95% confidence
interval is defined as a range of values such that with 95% probability, the range will contain
the true unknown value of the parameter. It has the form b±2SE(b).
• That is, there is approximately a 95% chance that the interval [b-2SE(b), b+2SE(b)] will
contain the true value of b.
Simple Linear Regression with R
• # read in the first worksheet from the workbook Boston.xlsx
# first row contains variable names
• library(readxl)
• Boston <- read_excel("E:/Kandela/Analytics/IBS/BA-II/Data/Boston.xlsx")
• plot(rm,medv)
• lm(medv~rm)
• slr=lm(medv~rm)
• summary(slr)
• names(slr)
• coef(slr)
• confint(slr)
• abline(slr)
• medv
• fitted.values(slr)
• Actual=medv
• Predicted=fitted.values(slr)
• Residual=residuals(slr)
• Error=data.frame(Actual,Predicted,Residual)
Regression with Excel
Regression in Excel using formulas
• The Excel SLOPE(known_y's, known_x's) and INTERCEPT(known_y's, known_x’s) functions return the slope and
intercept, respectively, of the least-squares line.
• RSQ function to determine the R2 value. & STEYX(known_y's, known_x’s) returns the standard error of the
predicted y-value for each x in the regression.
SUMMARYOUTPUT Regression in Excel using Data Analysis ToolPack
Regression Statistics • Select "Data" from the toolbar. The "Data" menu displays.
Multiple R 0.948054203 • Select "Data Analysis". The Data Analysis - Analysis Tools dialog box displays. From
R Square 0.898806772 the menu, select "Regression" and click "OK".
Adjusted R Square 0.878568127 • In the Regression dialog box, click the "Input Y Range" box and select the
Standard Error 1.595082087 dependent variable data .
Observations 7 • Click the "Input X Range" box and select the independent variable data.
ANOVA • Click "OK" to run the results.
df SS MS F Significance F
Regression 1 112.9928514 112.9928514 44.41042122 0.001148513
Residual 5 12.72143433 2.544286865
Total 6 125.7142857

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 1.507332796 2.17424242 0.693268047 0.519019865 -4.081735275 7.096400867
Income 0.252457695 0.037883157 6.664114436 0.001148513 0.155075939 0.349839451
Boston dataset task
https://www.kaggle.com/datasets/puxama/bostoncsv
Below questions involves the use of scatter plot, correlation and simple linear regression on the Boston data
set.
A) Use the Excel/R to perform Scatterplot/Correlation with medv as the dependent variable and any one of the
remaining as the independent variable. Comment on the output
B) Use the Excel/R to perform a simple linear regression with medv as the dependent variable and any one of
the remaining as the independent variable. Comment on the output.
For example
I. Is there a relationship between the predictor and the response?. Is the relationship between the predictor
and the response positive or negative?
II. How strong is the relationship between the predictor and the response?
III. Comment on the Standard error of the regression(estimate).
IV. What is your decision from Hypothesis testing? and Why?. Write the Interpretation of coefficient(b).
V. What is the predicted medv associated with a independent variable of 5?
VI. What are the associated 95% confidence prediction intervals?
Happy Analyzing
Multiple Regression
Ramesh Kandela
[email protected]
Multiple Regression
Multiple regression analysis is a straightforward extension of simple regression analysis which
allows more than one independent variable.
• Multiple linear regression (MLR) is a statistical technique for finding existence of an
association relationship between a dependent variable (response variable or outcome
variable) and several independent variables (explanatory variables or predictor variable).
• Multiple Regression Equation
• Y = a + b1X1 + b2X2 + …bkXk + error term
(sales = b0 + b1 × TV + b2 × radio + b3 × newspaper + e)
• In Equation , a is called the intercept or constant term.
• bi is called the regression coefficient for the independent variable Xi.
• A positive value of the error term occurs if the actual value of the dependent variable
exceeds your predicted value (a+ b1X1 + b2X2 + …BkXk). A negative value of the error term
occurs when the actual value of the dependent variable is less than the predicted value.
Estimating the Regression Coefficients
• To estimate b1, b2,….bk as the values that minimize the sum of squared error (residuals)
• SSE=ei = (yi− ŷi)^2
• This is done using standard statistical software. The values that minimize SSE are the multiple
least squares regression coefficient estimates.
Interpretation of Regression Coefficients
• Each coefficient can be estimated and tested separately.
• Interpretations such as a unit change in Xj is associated with a bj change in Y , while all the
other variables stay constant.
• For example, b1 is the expected change in Y when X1 increases by one unit and the other Xs
in the equation, X2 through Xk, remain constant.
• The intercept a is the expected value of y when all of the Xs equal zero.
Interpretation of Regression Coefficients
• Estimate and interpret the equation for Overhead when both explanatory variables, Machine
Hours and Production Runs, are included in the regression equation.
Regression Statistics
Multiple R 0.930819542
R Square 0.866425021
Adjusted R Square 0.858329567 Predicted Overhead = 3997 + 43.54Machine Hours + 883.62Production Runs
Standard Error 4108.99309
Observations 36
ANOVA
df SS MS F Significance F
Regression 2 3614020661 1807010330 107.0261279 3.75374E-15
Residual 33 557166199.1 16883824.22
Total 35 4171186860

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 3996.678209 6603.650932 0.605222512 0.549170949 -9438.550632 17431.90705
Machine Hours 43.53639812 3.5894837 12.12887472 1.04645E-13 36.23353862 50.83925761
Production Runs 883.6179252 82.25140753 10.74289124 2.6114E-12 716.2761784 1050.959672
The interpretation of Equation is that if the number of production runs is held constant, the
overhead cost is expected to increase by $43.54 for each extra machine hour, and if the number
of machine hours is held constant, the overhead cost is expected to increase by $883.62 for
each extra production run. The intercept, $3997, as the fixed component of overhead.
Validation of Overall Regression Model: F-test
• Statistical significance of individual variables in MLR – t-test
• Analysis of Variance (ANOVA) is used to validate the overall regression model. If there are k
independent variables in the model, then The null hypothesis is that all coefficients of the
explanatory variables are zero (H0 : β1 = β2 = · · · = βp = 0). The alternative is that at least one
.
of these coefficients is not zero. This hypothesis test is performed by computing the F-
statistic
• F = MSR/MSE
ANOVA
df SS MS F Significance F
Regression 2 3614020661 1807010330 107.0261279 3.75374E-15
Residual (Error) 33 557166199.1 16883824.22
Total 35 4171186860

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 3996.678209 6603.650932 0.605222512 0.549170949 -9438.550632 17431.90705
Machine Hours 43.53639812 3.5894837 12.12887472 1.04645E-13 36.23353862 50.83925761
Production Runs 883.6179252 82.25140753 10.74289124 2.6114E-12 716.2761784 1050.959672
Standard error of estimate for multiple regression
Once we have rejected the null hypothesis in favor of the alternative hypothesis .The quality of a
linear regression fit is typically assessed using two related quantities: standard error of estimate (SE)
and the R^2 statistic.
The standard error of the estimate is a measure of the accuracy of predictions.

Regression Statistics
Multiple R 0.930819542 The percentage of error is=
SSE/y̅
R Square 0.866425021
Adjusted R =4108/99151=0.04=4%
Square 0.858329567
Standard Error 4108.99309
Observations 36
ANOVA
df SS MS F Significance F
Regression 2 3614020661 1807010330 107.0261279 3.75374E-15
Residual 33 557166199.1 16883824.22
Total 35 4171186860
Co-efficient of Multiple Determination (R-square) and Adjusted R-square
Adjusted R2 is a corrected goodness-of-fit (model accuracy) measure for linear models. It identifies the
percentage of variance in the target field that is explained by the independent variables.
• R2 value is adjusted by normalizing both SSE and SST with the corresponding degrees of freedom.
• The adjusted R-square with k predictors is given by
or
• N = the sample size
• k = the number of independent variables in the regression equation
• Every time you add a independent variable to a model, the R-square increases, even if the
independent variable is insignificant. It never declines. Whereas Adjusted R-square increases only
when independent variable is significant and affects dependent variable.
• The adjusted R-square value is always less than or equal to the R-square value.
• If no increase in adjusted R-square after adding a new predictor variable to the model may indicate
that the newly added variable may not be statistically significant, or it is not explaining the variation in
the response variables that is not explained by the variables that are already present in the model.
Find the Adjusted R-square

Regression Statistics
Multiple R 0.930819542
R Square 0.866425021
Adjusted R Square 0.858329567
Standard Error 4108.99309
Observations 36
ANOVA
df SS MS F Significance F
Regression 2 3614020661 1807010330 107.0261279 3.75374E-15
Residual 33 557166199.1 16883824.22
Total 35 4171186860

Machine Hours and Production Runs combine to explain 85.83% of the variation in Overhead
Advertising dataset
• The Advertising data displays sales (in thousands of units) for a particular product as a
function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper media.

• As a Business Analyst, to suggest based on this data, a marketing plan for next year will result
in high product sales.

Here are a few questions that we might seek to address:


TV radio newspaper sales
230.1 37.8 69.2 22.1
• Is there a relationship between advertising budget and sales? 44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5
• How strong is the relationship between the advertising budget and sales? 180.8 10.8 58.4 12.9
8.7 48.9 75 7.2
• Which media contribute to sales? 57.5 32.8
120.2 19.6
23.5 11.8
11.6 13.2
8.6 2.1 1 4.8
• How can we predict future sales? 199.8 2.6 21.2 10.6
66.1 5.8 24.2 8.6
214.7 24 4 17.4
23.8 35.1 65.9 9.2
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.947212
R Square 0.897211
Adjusted R Square 0.895637
Standard Error 1.68551
Observations 200

ANOVA
df SS MS F Significance F
Regression 3 4860.323487 1620.107829 570.2707037 1.57523E-96
Residual 196 556.8252629 2.840945219
Total 199 5417.14875

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 2.938889 0.311908236 9.42228844 1.26729E-17 2.323762279 3.55401646
TV 0.045765 0.001394897 32.80862443 1.50996E-81 0.043013712 0.048515579
radio 0.18853 0.008611234 21.89349606 1.50534E-54 0.171547447 0.205512586
newspaper -0.00104 0.00587101 -0.176714587 0.85991505 -0.012615953 0.010540967
• Is there a relationship between advertising budget and sales?
• This question can be answered by fitting a multiple regression model of sales onto TV, radio, and
newspaper and testing the hypothesis H0 : βTV = βradio = βnewspaper = 0.
• we showed that the F-statistic can be used to determine whether we should reject this null
hypothesis. In this case the p-value corresponding to the F-statistic is very low, indicating clear
evidence of a relationship between advertising and sales.
• How strong is the relationship between the advertising budget and sales?
• R^2 statistic records the percentage of variability in the response that is explained by the predictors.
The predictors explain almost 90% of the variance in sales.
• Which media contribute to sales?
• To answer this question, we can examine the p-values associated with each predictor’s t-statistic . In
the multiple linear regression model, the p-values for TV and radio are low, but the p-value for
newspaper is not. This suggests that only TV and radio are related to sales.
• How can we predict future sales?
• Sales = 2.938889+ 0.045765*TV+ 0.18853*radio-0.00104 *newspaper
Validation of Multiple Regression Model
The following measures and tests are carried out to validate a multiple linear regression
model:
• Standard Error or RMSE and Coefficient of multiple determination (R-Square) / Adjusted R-
Square, which can be used to judge the overall fitness of the model.
• t-test to check the existence of statistically significant relationship between the response
variable and individual explanatory variable at a given significance level (α) or at (1 −
α)100% confidence level.
• F-test to check the statistical significance of the overall model at a given significance level
(α) or at (1 − α)100% confidence level.

• Conduct a residual analysis to check whether the Linearity, normality, homoscedasticity


assumptions have been satisfied. Also, check for any pattern in the residual plots to check
for correct model specification.
• Check for presence of multi-collinearity (strong correlation between independent variables)
that can destabilize the regression model.
• Check for auto-correlation in case of time-series data.
Other Considerations in the Regression Model
• Qualitative Variables
• Interaction Variables
• Non-linear Relationships
Qualitative Predictors (Dummy Variable)
• A Dummy variable or Indicator Variable is an artificial variable created to represent an
attribute with two or more distinct categories/levels.
• A dummy variable is a numeric variable that represents categorical data, such as gender,
race, political affiliation, etc.
• Technically, dummy variables are dichotomous, quantitative variables. Their range of
values is small; they can take on only two quantitative values,1 or 0. Typically, 1 represents
the presence of a qualitative attribute, and 0 represents the absence.
• The number of dummy variables necessary to represent a single attribute variable is equal
to the number of levels (categories) in that variable minus one.
Objective: To use the Regression to analyze whether the bank discriminates against females in terms of salary.
• Suppose that we wish to investigate differences in salary between males and females, ignoring the other
variables for the moment. We simply create an indicator or dummy variable that takes on two possible
numerical values. For example, based on the gender variable, we can create a new variable that takes the form

xi = 1 if ith person is female Regression Statistics


0 if ith person is male
Multiple R 0.346541171
use this variable as a predictor in the
regression equation. yi = a + bxi + e R Square 0.120090783

Adjusted R
Now a can be interpreted as the average salary Square 0.115819379
among males, a+b as the average salary among
Standard Error 10584.26048
females, and b as the average difference in
salary between females and males.
Observations 208
Female Predicted Salary = 45505 − 8296(1) =
Standard Lower Upper
37209 Coefficients Error t Stat P-value 95% 95%

Male salary =45505 1283.5301 35.453349 1.2220 42974.9 48035


Intercept 45505.44118 15 04 5E-89 0165 .98
Females get paid $8296 less on average than males. - - -
1564.4933 5.3023637 2.9354 11379.9 5211.
Gender -8295.512605 18 18 5E-07 8419 04
As expected, the coefficients of the job
grade dummies are all positive, and
they increase as the job grade
increases—it pays to be in the higher
job grades. The effect of age appears to
be minimal, and there appears to be a
“bonus” of close to $5000 for having a
PC-related job. the penalty for being a
female has decreased to $2562—still
large but not as large as before.
Interaction Variables
• An interaction variable is the product of two explanatory variables. You can include such a
variable in a regression equation if you believe the effect of one explanatory variable on Y
depends on the value of another explanatory variable.
• In marketing, this is known as a synergy effect, and in statistics it is referred to as an
interaction effect.
• regression equation without an interaction: y= b0 + b1X1 + b2X2+e
• regression equation with an interaction: y = b0 + b1X1 + b2X2 + b3X1X2+e

From Advertisement data, model takes the form


Sales = b0 + b1 TV + b2 radio + b3 (radio x TV) + e
= b0 + (b1 +b 3 radio) TV + b2 radio + e
Interaction Effect

• The results in this table suggests that interactions are important. The p-value for the interaction
term TVradio is extremely low, indicating that there is strong evidence for HA : b3= 0.
• The R2 for the interaction model is 96:8%, compared to only 89:7% for the model that predicts
sales using TV and radio without an interaction term.

In this situation, given a fixed budget of $100; 000, spending half on radio and half on TV may
increase sales more than allocating the entire amount to either TV or to radio.
Interaction Effect
• Objective: To use multiple regression with an interaction variable to see whether the effect
of years of experience on salary is different across the two genders.
• Solution: The multiple regression output appears below.

The regression equations for Female and Male are shown


graphically below.
Validation of the Model Fit
• To see if the regression equation will be successful in predicting new values of the
dependent variable, split the original data into two subsets: one for
estimation(Training data) and one for validation(Testing data).
• A regression equation is estimated from the first subset.
• Then the values of the independent variables from the second subset are
substituted into the equation to obtain predicted values for the dependent
variable.
• Finally, these predicted values are compared to the known values of the
dependent variable in the second subset.
• If the model is good, there is reason to believe that the regression equation will
predict well for the new data.
• This procedure is called validating the fit.
Validation of the Fit with Overhead cost data using R
• #Import dataset
• library(readxl)
• Overhead_Costs = read_excel("Overhead Costs.xlsx")
• #View(Overhead_Costs)
• attach(Overhead_Costs)
• #MLR • # Predicting the Test set results
• y_pred = predict(MLR, newdata = test_set)
• MLRr=lm(Overhead~`Machine Hours`+`Production Runs`)
• summary(MLR) • results <- cbind(y_pred,test_set$Overhead)
• #Model Fit • colnames(results) <- c('pred','actual')
• # Splitting the dataset into the Training set and Test set • results <- as.data.frame(results)
• # install.packages('caTools') • results
• library(caTools) • sse=(results$actual-results$pred)^2
• sse
• set.seed(123)
• sum(sse)
• split = sample.split(Overhead_Costs$Overhead, SplitRatio = 0.7) • mse <- mean((results$actual-results$pred)^2)
• training_set = subset(Overhead_Costs, split == TRUE) • print(mse)
• test_set = subset(Overhead_Costs, split == FALSE) • rmse= mse^0.5
• # Fitting Multiple Linear Regression to the Training set • print(rmse)
• MLR=lm(Overhead ~ `Machine Hours` + `Production Runs`, data= • summary(test_set$Overhead)
training_set)
• summary(MLR)
Residual Analysis in Regression
• Residual analysis is used to assess the appropriateness of a linear regression model by
defining residuals and examining the residual plots.
• https://data.library.virginia.edu/diagnostic-plots/

• library(MASS)
• attach(Boston)
• LR=lm(medv~lstat)
• summary(LR)
• plot(LR)
• MLR=lm(medv~lstat+I(lstat^2))
• summary(MLR)
• plot(MLR)
• par(mfrow=c(2,2))
• #Change back to 1 x 1
• par(mfrow=c(1,1))
Assumptions of Multiple Regression Model
1.There should be a linear relationship between dependent (response) variable and independent (predictor)
variable(s). A linear relationship suggests that a change in response Y due to one unit change in X is
constant, regardless of the value of X.
2.The error terms must have constant variance. This phenomenon is known as homoscedasticity. The
presence of non-constant variance is referred to heteroscedasticity.
3. The error terms must be normally distributed.
4.The independent variables should not be correlated. Absence of this phenomenon is known as
multicollinearity.
5.There should be no correlation between the residual (error) terms. Absence of this phenomenon is known
as Autocorrelation.
Transformation of Variables
• Transformations are applied to accomplish certain objectives such as to ensure linearity, to achieve
normality, or to stabilize the variance.
• The necessity for transforming the data arises because the original variables, or the model in terms of the
original variables, violates one or more of the standard regression assumptions.
• Some common transformations are log transformation, square root transformation ,reciprocal square root
transformation and polynomial transformation.
Non-Linearity of the data
• Nonlinear transformations of variables are often used because of curvature detected in scatter plots.
The prediction accuracy of the model can be significantly reduced.
• Residual plots are a useful graphical tool for identifying non-linearity. plot the residuals versus the
predicted (fitted) values ˆyi. If there exist any pattern (may be, a parabolic shape) in this plot, consider
it as signs of non-linearity in the data.
• If the residual plot indicates that there are non-linear associations in the data, then a simple approach
is to use non-linear transformations of the dependent variable Y or any of the explanatory variables,
the Xs. Or we can do both, such as the natural logarithm, the square root, the reciprocal, and the
square. in the regression model.

• The mpg data suggest a curved relationship. A simple approach for


incorporating non-linear associations in a linear model is to include
transformed versions of the predictors in the model.
• mpg = β0 + β1 × horsepower + β2 × horsepower^2 + e
• The left panel of Figure displays a residual plot from
the linear regression of mpg onto horsepower on the
Auto data set . The red line is a smooth fit to the
residuals, which is displayed in order to make it
easier to identify any trends. The residuals exhibit a
clear U-shape, which provides a strong indication of
non-linearity in the data.
• In contrast, the right-hand panel of Figure displays
the residual plot that results from the model (mpg =
β0 + β1 × horsepower + β2 × horsepower2 + e
),which contains a quadratic term. There appears to
be little pattern in the residuals, suggesting that the
quadratic term improves the fit to the data.
SUMMARY OUTPUT SUMMARY OUTPUT
linear fit quadratic fi
Regression Statistics Regression Statistics
Multiple R 0.777683 Multiple R 0.827713
R Square 0.604791 R Square 0.685109
The R Square of the quadratic fit is 0.688,
compared to 0.606 for the linear fit..
Heteroscedasticity (Non-constant Variance of Error Terms)
• Another important assumption of the linear regression model is that the error terms have a
constant variance(homoscedasticity). The standard errors, confidence intervals, and
hypothesis tests associated with the linear model rely upon this assumption.
• We can identify non-constant variances in the errors, or heteroscedasticity, from the
presence of a funnel shape in the residual plot.
• To overcome heteroskedasticity, a possible way is to transform the response variable such as
log(Y) or √Y. Also, we can use weighted least square method to tackle heteroskedasticity.

Left: The funnel shape indicates


heteroscedasticity.
Right: The response has been log
transformed, and there is now no evidence
of heteroscedasticity.
Normality Test
• The data to be considered as a good fit for regression, the residuals must be normally
distributed. Regression model requires that the errors (residuals) between observed and
expected values should be normally distributed.
• The normal Q-Q plot used to test how the error term is distributed i.e., normally distributed or
non-normally distributed. If the residuals follow along with the diagonal line, they are normally
distributed.
• Residuals to be approximately normally distributed because of the residuals are aligned along
the diagonal line.

If the errors are not normally distributed, non – linear transformation of the variables
(response or predictors) can bring improvement in the model.
Multicollinearity Test
• Multicollinearity means the presence of an accurate linear relationship between among
independent variables. No two independent variables (regressors) must share a strong correlation.
The multicollinearity problem, which drives to a less effective estimator of coefficients (Wooldridge,
2013). correlation matrixes and variance inflation factor (VIF) were used to test the presence of
multicollinearity.

• Collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the
standard error for ˆb to grow. The t-statistic for each predictor is calculated by dividing ˆ bj by its
standard error. Consequently, collinearity results in a decline in the t-statistic. As a result, in the
presence of collinearity, we may fail to reject H0 : bj = 0.
• A better way to assess multicollinearity is to compute the variance inflation factor (VIF).
• VIF determines the strength of the correlation between the independent variables. It is predicted by
taking a variable and regressing it against every other variable.
• The VIF for each variable can be computed using the formula
• As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.
Dealing with Multi-collinearity
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -259.51752 55.88219 -4.644 4.66e-06 ***
Age -2.34575 0.66861 -3.508 0.000503 ***
Limit 0.01901 0.06296 0.302 0.762830
Rating 2.31046 0.93953 2.459 0.014352 *

The Credit data, a regression of balance on age, rating, and limit indicates that the
predictors have VIF values of 1.01, 160.67, and 160.59. As we suspected, there is
considerable collinearity in the data
• Model 1 is a regression of balance on age and limit, and Model 2 a regression of balance on
rating and limit. The standard error of ˆ βlimit increases 12-fold in the second regression,
due to collinearity.
• The first is to drop one of the problematic variables from the regression. So, dropping limit
from the set of predictors has effectively solved the collinearity problem without
compromising the fit.
• The second solution is to combine the collinear variables together into a single predictor. For
instance, we might take the average of standardized versions of limit and rating in order to
create a new variable that measures credit worthiness.
Autocorrelation
• The presence of correlation in error terms drastically reduces model’s accuracy. This usually
occurs in time series models where the next instant is dependent on previous instant. If the
error terms are correlated, the estimated standard errors tend to underestimate the true
standard error.
• If this happens, it causes confidence intervals and prediction intervals to be narrower.
Narrower confidence interval means that a 95% confidence interval would have lesser
probability than 0.95 that it would contain the actual value of coefficients. Also, lower
standard errors would cause the associated p-values to be lower than actual. This will make
us incorrectly conclude a parameter to be statistically significant
• The Durbin Watson d statistic applied to check the autocorrelation problem in the regression
model.

MLR=lm(Balance~Age+Rating)
• DW = 2 — →no autocorrelation, summary(MLR)
• 0 < DW < 2 — →positive autocorrelation library(lmtest)
• 2 < DW < 4 — →negative autocorrelation. library(zoo)
dwtest(MLR)
Assumptions with R
• library(MASS)
View(mtcars)
• attach(Boston)
LR=lm(mpg~hp)
• LR=lm(medv~lstat)
summary(LR)
• summary(LR) plot(LR)
• plot(LR) MLR=lm(mpg~hp+I(hp^2))
• MLR=lm(medv~lstat+I(lstat^2)) summary(MLR)
• summary(MLR) Credit=read_excel('Credit.xlsx')
• plot(MLR) View(Credit)
• par(mfrow=c(2,2)) attach(Credit)
• plot(MLR) MLR=lm(Balance~Age+Limit+Rating)
• #Change back to 1 x 1 summary(MLR)
• par(mfrow=c(1,1))
vif(MLR)
MLR=lm(Balance~Age+Rating)
• library(readxl)
summary(MLR)
• library(carData)
• library(car) library(lmtest)
• data("mtcars") library(zoo)
• head(mtcars) dwtest(MLR)
• attach(mtcars)
• plot(hp,mpg)
Include/Exclude Decisions
• The t-values of regression coefficients can be used to make include/exclude decisions for
explanatory variables in a regression equation.
• Always trying to get the best fit possible, but the principle of parsimony suggests using the
fewest number of variables.
• Look at a variable’s t-value and its associated p-value. If the p-value is above some accepted
significance level, such as 0.05, this variable is a candidate for exclusion.
• Check whether a variable’s t-value is less than 1 or greater than 1 in magnitude. If it is less
than 1, then it is a mathematical fact that se will decrease (and adjusted R2 will increase) if
this variable is excluded from the equation.
• Look at t-values and p-values, rather than correlations, when making include/exclude
decisions. An explanatory variable can have a fairly high correlation with the dependent
variable, but because of other variables included in the equation, it might not be needed.
Model Building
❖Model building is the process of deciding which independent variables to include in the model.
❖When building a model, it is best to start with a few IV’s and then begin adding other variables.
However, when adding a variable, check for:
• Improved prediction (increase in adjusted R2)
• Statistically significant estimated coefficients
• Do other coefficients change when adding the new one?
• Particularly look for sign changes for estimated coefficients.
There are three types of equation-building procedures:
Forward—begins with no explanatory variables in the equation and successfully adds one at a time
until no remaining variables make a significant contribution.
Backward—begins with all potential explanatory variables in the equation and deletes them one at
a time until further deletion would do more harm than good.
Stepwise—is much like a forward procedure, except that it also considers possible deletions along
the way.
All of these procedures have the same basic objective—to find an equation with a small se and a large R2
(or adjusted R2).
Stepwise Regression
• Stepwise regression is a combination of forward selection and backward elimination procedure.
• Stepwise regression is a way to build a model by adding or removing predictor variables. It involves
adding or removing potential explanatory variables in succession and testing for statistical significance
after each iteration/R^2.
Stepwise Regression in R
Build regression model from a set of candidate predictor variables by entering predictors based on p
values, in a stepwise manner until there is no variable left to enter any more. The model should include
all the candidate predictor variables.
install.packages("olsrr")
library(olsrr)
#import credit data
library(readxl)
credit=read_excel('Credit.xlsx')
attach(credit)
MLR=lm(Balance~ Age+Rating+Limit)
summary(MLR)
# Stepwise Regression in R
ols_step_both_p(MLR)
ols_step_both_p(MLR,details = TRUE)
Task with Auto data set
This question involves the use of multiple linear regression on the Auto data set.

• (a) Produce a scatterplot matrix which includes all of the variables in the data set.

• (b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the
name variable, which is qualitative.

• (c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables
except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

• i. Is there a relationship between the predictors and the response?

• ii. Which predictors appear to have a statistically significant relationship to the response?

• iii. What does the coefficient for the year variable suggest?

• (d) Use the plot() function to produce diagnostic (residual) plots of the linear regression fit. Comment on any
problems you see with the fit.

• (e) Do any interactions appear to be statistically significant?

• (f) Try a few different transformations of the variables, such as log(X), √ X, X^2. Comment on your findings
Happy Analyzing
Logistic Regression
Ramesh Kandela
[email protected]
Logistic Regression
Logistic regression is the appropriate regression analysis to conduct when the dependent
variable is Qualitative/Category (binary). This type of statistical model is often used for
classification and predictive analytics.

• Logistic regression is a classification algorithm used to assign observations to a discrete set


of classes.
• Logistic Regression (also called Logit Regression) is commonly used to estimate the
probability that an instance belongs to a particular class (e.g., what is the probability that
this email is spam?).

Application of Logistic Regression


• An organization may like to predict the customers who are likely to buy (here Y takes two
values, Y = 1 for buy and Y = 0 for do not buy).
• Fraud detection in Credit card
• Email spam or ham
• Disease prediction – Diabetes, Cancer, etc.…
Linear versus Logistic Regression
Logistic Regression is much similar to the Linear
Regression except that how the line is fit to the data.
In Logistic regression, instead of fitting a regression
line, we fit an "S" shaped logistic function, which
predicts probabilities lie between 0 and 1.
Threshold value is 0.5

▪ In logistic regression, we use the concept of the


threshold value, which defines the probability of either
0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.
Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
𝟏
• 𝐩 𝐗 =
𝟏+𝐞−𝐳
•p X = output between 0 and 1 (probability estimate)
•z = input to the function (algorithm’s prediction e.g. Z = β0 + β1*X)
•e = base of natural log

• It maps any real value into another value within a


range of 0 and 1.
• The value of the logistic regression must be between
0 and 1, which cannot go beyond this limit, so it
forms a curve like the "S" form. The S-form curve is
called the Sigmoid function or the logistic function.
The Logistic Model
Let's write p(X) = Pr(Y = 1/X) for short and consider using balance to predict default. Logistic
regression uses the form
𝟏 Multiple Logistic Regression
𝐩 𝐗 = −𝐳 or
𝟏+𝐞

After a bit of manipulation

The quantity p(X)/[1−p(X)] is called the odds ratio. The odds in favor of event occurring is defined
as the probability the event will occur divided by the probability the event will not occur. (0 to ∞)
By taking the logarithm of both sides Multiple Logistic Regression

• The logit is also called the log-odds, since it is the log of the ratio between the estimated probability
for the positive class and the estimated probability for the negative class.(-∞ to ∞)
• in a logistic regression model, increasing X by one unit changes the log odds by β1 or equivalently it
multiplies the odds by e^β1.
(Odds Ratio=e^β)
Odds ratio the value is calculated by dividing the probability of success by the probability of
failure.
Odds Ratio=odds1/odds0
Interpret the logistic regression coefficients
• The interpretation of the odds ratio depends on whether the predictor(Independent
variable) is categorical or continuous.
• Odds ratios that are greater than 1 indicate that the event is more likely to occur as the
predictor increases. Odds ratios that are less than 1 indicate that the event is less likely to
occur as the predictor increases.
• Numerical feature: If you increase the value of feature xj by one unit, the estimated odds
change by a factor of exp(βj).
• Categorical feature: Changing the feature xj from the reference category to the other
category changes the estimated odds by a factor of exp(βj).
Odds Ratio Interpretation

exp(.1229589) = 1.13
exp(.979948) = 2.66
exp(. 0.0590632) = 1.06

• This fitted model says that, holding math and reading are constant, the odds of getting into an honors class
for females (female = 1)over the odds of getting into an honors class for males (female = 0) is exp(.979948)
= 2.66. In terms of percent change, we can say that the odds for females are 166% higher than the odds
for males.
• The coefficient for math says that, holding female and reading are constant, we will see 13% increase in
the odds of getting into an honors class for a one-unit increase in math score (exp(.1229589) = 1.13. )
• The coefficient for reading says that, holding female and math are constant, we will see 6% increase in
the odds of getting into an honors class for a one-unit increase in reading score (exp(. 0.0590632) = 1.06)
Estimating the Logistic Regression Coefficients (β0 and β1)
• Logistic regression uses method of maximum likelihood to find the best fitting model.
• Maximum likelihood to estimate the parameters β0 and β1
Most statistical packages can fit linear logistic regression models by maximum likelihood.
Coefficients(b) S.E. z value P-value Exp(b)
balance 0.0055 0.0002 -29.49 <2e-16 1.006
Constant -10.6513 0.361 24.95 <2e-16 0
• If a coefficient b is positive, then if its X increases, the log odds increases, so the probability
of being in category 1 increases. The opposite is true for a negative b.
• Xs are positively correlated with being in category 1 (the positive bs) and which are
positively correlated with being in group 0 (the negative bs).
Testing for Significance
Wald’s test :Wald’s test is used for checking statistical significance of individual predictor variables (equivalent to t-
test in MLR).
Likelihood Ratio (or Deviance) Test:The test for overall significance is based upon the value of G test statistic. The
sampling distribution of G follows a chi-square distribution with degrees of freedom equal to no. of independent
variable ((equivalent to F-test in MLR).
Wald’s test
• Wald’s test is used for checking statistical significance of individual independent variables .
The null and alternative hypotheses for Wald’s test are:
• Wald test for individual regression coefficient H0 : βk =0
• Wald test for individual regression coefficient βk not equal to zero

The Wald statistic can be defined two ways


• Wald=βk2/SEβk2 Hypothesis testing procedure as for the Chi-Square test)
• Wald=βk/SEβk(Hypothesis testing procedure as for the z test)
Coefficients(b) S.E. Wald(X^2) z value P-value Exp(b)
balance 0.0055 0.0002204 622.622 24.95 <2e-16 1.006
Constant -10.6513 0.361 869.736 -29.49 <2e-16 0

• P value<.05, reject H0. In other words, we conclude that there is indeed an association
between balance and probability of default.
• To be precise, a one-unit increase in balance is associated with an increase in the log odds of
default by 0.0055 units.
Testing Global Null Hypothesis Chi-squared test for
the complete
regression model
• Null Hypothesis:
H0: β1=β2=…=βK=0β
1=β2=…=βK=0
• Alternative
Hypothesis : H1: not
all regression
coefficients are
zero.

X2 = (Null deviance – Residual deviance) / (Null df – Residual df)


The null deviance, is conceptually similar to the total variance of the dependent variable in linear regression
analysis. the residual deviance, is conceptually similar to the residual variance in linear regression analysis.
p-value of overall Chi-Square statistic:1-pchisq(2920.6-1571.5,9999-99986)=0
p value is smaller than α(0.05), therefore reject null hypothesis and conclude that the overall the logistic
regression model is significant.
Hosmer Lemeshow test for goodness of fit
• Hosmer- Lemeshow (H-L) is a chi-square goodness of fit test used for checking the goodness of logistic
regression model (Hosmer and Lemeshow, 2000).
• The Hosmer- Lemeshow test is constructed by dividing the data set into 10 groups (deciles). The Hosmer-
Lemeshow test checks whether the observed and expected frequencies in each group are equal. The null and
alternative hypotheses in Hosmer-Lemeshow test are
The H-L test statistic is given by
• H0: The logistic regression model fits the data
• H1: The logistic regression model does not fit the data
• Ok is the observed frequency in group k, Ek is the expected frequency in group k, Nk is the number of
observations in group k and pk is the group mean.
• A large value of Chi-squared (with small p-value < 0.05) indicates poor fit and small Chi-squared values (with
larger p-value closer to 1) indicate a good logistic regression model fit.
Hosmer-Lemeshow test with R
library(ResourceSelection)
HL=hoslem.test(Default$default, fitted(logreg), g = 10) Chi-square df Sig.
4.289 8 .830
HL
cbind(HL$observed,HL$expected)

The p-value in Table is 0.830, we retain the null hypothesis, that is the logistic regression model fits the data
Making Predictions
Estimated coefficients of the logistic regression model that predicts the probability of default
using balance.
• Once the coefficients have been estimated, we can compute the probability of default for
any given credit card balance.

• What is our estimated probability of default for someone with a balance of $1000?

• The predicted probability of default for an individual with a balance of $2, 000 is 0.586
Making Predictions
• For the Default data, estimated coefficients of the logistic regression model that predicts the
probability of default using balance, income, and student status. Student status is encoded
as a dummy variable student[Yes], with a value of 1 for a student and a value of 0 for a non-
student. In fitting this model, income was measured in thousands of dollars.

The negative coefficient for student in the multiple logistic regression indicates that for a
fixed value of balance and income, a student is less likely to default than a non-student.
For example, a student with a credit card balance of $1, 500 and an income of $40, 000
has an estimated probability of default of

A non-student with the same balance and income has an estimated probability of default of
Logistic Regression with R
• # import the data
• library(ISLR) prob_pred = predict(logreg, type = 'response', newdata
=data.frame(balance=c(1000,1500,2000)))
• attach(Default)
prob_pred
• summary(Default)
y_pred = ifelse(prob_pred > 0.5, 1, 0)
• logreg=glm(default~balance, family = binomial("logit"))
y_pred
• summary(logreg)
• # Predicting mlogreg=glm(default~balance+income+student, family =
• prob_pred = predict(logreg, type = 'response', newdata binomial("logit"))
=data.frame(balance=1000)) summary(mlogreg)
• prob_pred
• y_pred = ifelse(prob_pred > 0.5, 1, 0) prob_pred = predict(mlogreg, type = 'response', newdata
• y_pred =data.frame(balance=1500,income=40000,
• prob_pred = predict(logreg, type = 'response', newdata student=c("Yes","No")))
=data.frame(balance=2000)) prob_pred
• prob_pred y_pred = ifelse(prob_pred > 0.5, 1, 0)
• y_pred = ifelse(prob_pred > 0.5, 1, 0) y_pred
• y_pred
Thank You
Time Series Analysis
Ramesh Kandela

[email protected]
TIME SERIES ANALYSIS
• A time series is a set of numerical values of some variable obtained at regular period over
time. The series is usually tabulated or graphed in a manner that readily conveys the
behaviour of the variable under study.
• A statistical technique that attempts to forecast future values of the time series by examining
past observations of the data only.
• Time series are monthly, trimestral, or annual, sometimes weekly, daily, or hourly (study of
road traffic).
Examples of Forecasting Applications
• When a company plans its ordering or production schedule for a product it sells to the public,
it must forecast the customer demand for this product so that it can stock appropriate
quantities—neither too many nor too few.
• When an organization plans to invest in stocks, bonds, or other financial instruments, it
typically attempts to forecast movements in stock prices and interest rates.
• When government officials plan policy, they attempt to forecast movements in
macroeconomic variables such as inflation, interest rates, and unemployment.
Components of a Time Series
The various reasons or the forces which affect the values of an observation in a time series
are the components of a time series. The time-series data contain four components: trend,
seasonality, cyclicality and irregularity. Not all time-series have all these components.
100
• Trend: which describe the movement along the term.
80

• Seasonal variations: which represent seasonal changes. 60

• Cyclical fluctuations: which correspond to periodical but not seasonal


40

20
variations.
0

• Irregular variations: which are the unpredictable component 1 3 5 7 9 11 13 15 17 19 21 23

(a) Trend
25 35 80
20 30 70
60
25
15 50
20 40
10
15 30
5 20
10
10
0 5 0
1 3 5 7 9 11 13 15 17 19 21 23 1 3 5 7 9 11131517192123252729313335373941434547
0
1 2 3 4 5 6 7 8 9 101112131415161718192021222324

(b) Seasonality (c) Cyclicality (d) Random


Trend(T)
• Trend Sometimes a time-series displays a steady tendency of either upward or downward movement in
the average (or mean) value of the forecast variable y over time. Such a tendency is called a trend.
When observations are plotted against time, a straight line describes the increase or decrease in the
time series over a period of time.
• The population, agricultural production, items manufactured, number of births and deaths, number of
industry or any factory, number of schools or colleges are some of its example showing some kind of
tendencies of movement.
• If observations increase or decrease regularly through time, the time series has a trend.
• Linear trend—occurs if the observations increase by the same amount from period to period.
• Quadratic trend —the values of a time series tend to rise or fall at a rate that is not constant; it
changes over time.
• Exponential trend—occurs when observations increase at a tremendous rate.
Seasonal Variation(S)
• Seasonal It is a special case of a cycle component of time series in which fluctuations are
repeated usually within a year (e.g. daily, weekly, monthly, quarterly) with a high degree of
regularity. For example, average sales for a retail store may increase greatly during festival
seasons.
• These variations come into play either because of the natural forces or man-made
conventions. The various seasons or climatic conditions play an important role in seasonal
variations. Such as production of crops depends on seasons, the sale of umbrella and
raincoats in the rainy season, and the sale of electric fans and A.C. shoots up in summer
seasons.
Cycle(C)
• Cycles An upward and downward movement in the variable value about the trend time over a
time period are called cycles. A business cycle may vary in length, usually more than a year
but less than 5 to 7 years. The movement is through four phases: from peak (prosperity) to
contradiction (recession) to trough (depression) to expansion (recovery or growth).
• A time series has a cyclic component when business cycles
affect the variables in similar ways.
• The cyclic component is more difficult to predict than
the seasonal component, because seasonal variation is
much more regular.
• The length of the business cycle varies, sometimes
substantially.
• The length of a seasonal cycle is generally one year, while
the length of a business cycle is generally longer than
one year and its actual length is difficult to predict.
Random Variations(R)
• Irregular variations are rapid changes or bleeps in the data caused by short-term
unanticipated and non-recurring factors. Irregular fluctuations can happen as often as
day to day.
• Random variation (or noise) is the unpredictable component that gives most time series
graphs their irregular, zigzag appearance.
• A time series can be determined only to a certain extent by its trend, seasonal, and
cyclic components; other factors determine the rest.
• These other factors combine to create a certain amount of unpredictability in almost
all time series.
Forecasting Methods (Time Series Models)
• Naïve Forecasting
• Moving Average
• Weighted Moving Averages
• Regression Based Forecasting
• Linear Trend Model
• Nonlinear Trend(Quadratic trend & Exponential Trend)
• Exponential Smoothing
• Simple Exponential Smoothing
• Holt’s Exponential Smoothing
• Holt-winter’s Exponential Smoothing
Forecast Accuracy
• Measures of forecast accuracy are used to determine how well a particular forecasting
method is able to reproduce the time series data that are already available.
• Measures of forecast accuracy are important factors in comparing different forecasting
methods.
• By selecting the method that has the best accuracy for the data already known, we hope to
increase the likelihood that we will obtain better forecasts for future time periods.
• The key concept associated with measuring forecast accuracy is forecast error.
• Forecast Error = Actual Value - Forecast
• A positive forecast error indicates the forecasting method underestimated the actual value.
• A negative forecast error indicates the forecasting method overestimated the actual value.
Measures of Accuracy
The forecast error is the difference between the actual value and the forecast.
Mean Absolute Error (MAE)
• Mean absolute error (MAE) is the average absolute error.
• Yt is the actual value of Y at time t and Ft is the corresponding forecasted value.
Mean Absolute Percentage Error (MAPE)
• Mean absolute percentage error (MAPE) is the average of absolute percentage error.

Mean Square Error (MSE)


• Mean square error is the average of squared error
• Lower MSE implies better prediction. However, it depends on the range of the time-series data.
Root Mean Square Error (RMSE)
• Root mean square error (RMSE) is the square root of mean square error
• RMSE along with MAPE are two most popular accuracy measures of forecasting. RMSE is the standard
deviation of errors or residuals.
Measures of Accuracy
• The model selection may depend on the chosen forecasting accuracy measure.
• Some forecasting software packages choose the best model from a given class by
minimizing MAE, RMSE, or MAPE.
• However, small values of these measures guarantee only that the model tracks the
historical observations well.
• There is still no guarantee that the model will forecast future values accurately.
• Unlike residuals from the regression equation, forecast errors are not guaranteed to always
average to zero.
• If the average of the forecast errors is negative, this implies a bias, or that the forecasts
tend to be too high.
• If the average is positive, the forecasts tend to be too low.
Naïve Forecasting Week Absolute Squared Absolute
Naïve forecasting is the (t) Sales Forecast Error Error Error % of Error % Error
technique in which the 1 17
last period’s sales are
used for the next 2 21 17 4 4 16 19.05 19.05
period’s forecast without 3 19 21 -2 2 4 -10.53 10.53
predictions or adjusting 4 23 19 4 4 16 17.39 17.39
the factors.
5 18 23 -5 5 25 -27.78 27.78
6 16 18 -2 2 4 -12.50 12.50
𝟒𝟏
MAE = = 𝟑. 𝟕𝟑 7 20 16 4 4 16 20.00 20.00
𝟏𝟏
𝟏𝟕𝟗 8 18 20 -2 2 4 -11.11 11.11
MSE = = 𝟏𝟔. 𝟐𝟕
𝟏𝟏 9 22 18 4 4 16 18.18 18.18
𝐑𝐌𝐒𝐄 = 𝐌𝐒𝐄
10 20 22 -2 2 4 -10.00 10.00
RMSE=4.03
𝟐𝟏𝟐 11 15 20 -5 5 25 -33.33 33.33
MAPE = = 𝟏𝟗. 𝟐𝟒% 12 22 15 7 7 49 31.82 31.82
𝟏𝟏
41.00 179 1.19 212
Mean 3.73 16.27 0.11 19.24
RMSE 4.03
Moving Averages
• The moving averages method uses the average of the most recent k data values in the time series. As
the forecast for the next period..

σ(most recent 𝑘 data values) 𝑌𝑡 + 𝑌𝑡−1 + ⋯ + 𝑌𝑡−𝑘+1


𝐹𝑡+1 = =
𝑘 𝑘
• where: Ft+1= forecast of the time series for period t + 1
• The term moving is used because every time a new observation becomes available for the time
series, it replaces the oldest observation in the equation.

• As a result, the average will change, or move, as new observations become available.

• To use moving averages to forecast, we must first select the order k, or number of time series values,
to be included in the moving average.

• A smaller value of k will track shifts in a time series more quickly than a larger value of k.

• If more past observations are considered relevant, then a larger value of k is better.
Moving Averages Week Sales 3MA Forecast Absolute Squared Absolute %
Forecast Error Error Error Error
1 17
2 21
3 19
24
MAE = = 2.67 4 23 19 4 4 16 17.39
9 5 18
92 21 -3 3 9 16.67
MSE = = 10.22 6 16 20 -4 4 16 25.00
9
RMSE = MSE 7 20 19 1 1 1 5.00
RMSE=3.2 8 18 18 0 0 0 0.00
129.21 9 22 18 4 4 16 18.18
MAPE = = 14.36%
9 10 20 20 0 0 0 0.00
11 15 20 -5 5 25 33.33
12 22 19 3 3 9 13.64
24.00 92.00 129.21
Mean 2.67 10.22 14.36
RMSE 3.20
Weighted Moving Averages
• To use this method we must first select the number of data values to be Week Sales
1 17
included in the average. 2 21
• Next, we must choose the weight for each of the data values. 3 19
4 23
• The more recent observations are typically given more weight than 5 18
older observations. 6 16
7 20
• For convenience, the weights should sum to 1. 8 18
9 22
An example of a 3-period weighted moving average (3WMA) is: 10 20
• Using this weighted average, our forecast for week 4 is computed as follows: 11 15
• Forecast for week 4 3WMA = .2(17) + .3(21) + .5(19) = 19.2
12 22

Weights (.2, .3,and .5) sum to 19 is the most recent observation


1
Exponential Smoothing
• This method is a special case of a weighted moving averages method; we select only
the weight for the most recent observation.
• The weights for the other data values are computed automatically and become
smaller as the observations grow older.
• The exponential smoothing forecast is a weighted average of all the observations in
the time series.
• The term exponential smoothing comes from the exponential nature of the
weighting scheme for the historical values.

Exponential smoothing is a forecasting method for univariate time series data. This method
produces forecasts that are weighted averages of past observations where the weights of
older observations exponentially decrease.
Exponential Smoothing
• SES assumes a fairly steady time-series data with no significant trend, seasonal or cyclical component
• Level (or Intercept) equation 𝐹𝑡+1 = 𝛼𝑌𝑡 + (1 − 𝛼)𝐹𝑡

where:Ft+1 = forecast of the time series for period t + 1


Ft = forecast of the time series for period t
Yt = actual value of the time series in period t
𝛼 = smoothing constant (0 < 𝛼 < 1)
and let: F2 = Y1 (to initiate the computations)

• The closer  is to 1 more importance is given to recent values and the closer it is to 0 more
importance to past values.
• The computer decides the optimal value of  that minimizes the RMSE
• Since the model uses one smoothing constant, it is called single exponential smoothing.
Exponential Smoothing
α=.3
α=.2 Time
Week Time Serie Value Forecast Error Squared Error Serie Square
1 17 #N/A Week Value Forecast Error d Error
2 21 17.00 4.00 16
1 17 #N/A
3 19 17.80 1.20 1.44
2 21 17.00 4.00 16.00
4 23 18.04 4.96 24.6016
3 19 18.20 0.80 0.64
5 18 19.03 -1.03 1.065024
4 23 18.44 4.56 20.79
6 16 18.83 -2.83 7.984015
5 18 19.81 -1.81 3.27
7 20 18.26 1.74 3.02593
6 16 19.27 -3.27 10.66
8 18 18.61 -0.61 0.370131
7 20 18.29 1.71 2.94
9 22 18.49 3.51 12.34323
8 18 18.80 -0.80 0.64
10 20 19.19 0.81 0.657128
9 22 18.56 3.44 11.83
11 15 19.35 -4.35 18.93549
10 20 19.59 0.41 0.17
12 22 18.48 3.52 12.382
11 15 19.71 -4.71 22.23
Sum 10.92 98.80
12 22 18.30 3.70 13.69
MSE 8.98
Sum 8.03 102.86
RMSE 3.00
MSE 9.35
RMSE 3.06
Trend Projection
Linear Trend
• Many time series follow a long-term trend except for random variation.
• This trend can be upward or downward.
• A straightforward way to model this trend is to estimate a regression equation for Yt, using time
t(X) as the single explanatory variable.
• If a time series exhibits a linear trend, the method of least squares may be used to determine a trend
line (projection) for future forecasts.
• A linear trend means that the time series variable changes by a constant amount each time period.
• The independent variable is the time period and the dependent variable is the actual observed value
in the time series.
• Using the method of least squares, the formula for the trend projection is:
Tt = b0 + b1t
: where Tt = linear trend forecast in period t
b0 = Y-intercept of the linear trend line
b1 = slope of the linear trend line
t = the time period
For the trend projection equation Tt = b0 + b1t

σ𝑛𝑡=1(𝑡 − 𝑡)lj (𝑌𝑡 − 𝑌) 𝑏0 = 𝑌ሜ − 𝑏1 𝑡lj
𝑏1 =
σ𝑛𝑡=1(𝑡 − 𝑡)lj 2

where: Yt = actual value of the time series in period t


n = number of periods in the time series
𝑌ሜ = average values of the time series
𝑡lj = mean value of t
Y
• The intercept termt a is less important: It literally represents the expected value of the series
at time t = 0 (X=0).
• The interpretation of 𝑏1 is that it represents the expected change in the series from one
period to the next.
• If 𝑏1 is positive, the trend is upward.
• If 𝑏1 is negative, the trend is downward.
• A graph of the time series indicates whether a linear trend is likely to provide a good fit.
Below are given the figures of Sales (in thousands) of a Mobile Company:
Year 2014 2015 2016 2017 2018 2019 2020
Sales 80 90 92 83 94 99 92
(a) Fit a straight-line trend to these figures.
(b) Plot these figures on a graph and show the trend line.
(c) Estimate the Sales in 2021.
Year Time Period(x) Sales(y) X-X̅ Y-Y̅ (X-X̅)(Y-Y̅) (X-X̅)^2 Trend Values(Forecast)
2014 1 80 -3 -10 30 9 84
2015 2 90 -2 0 0 4 86
2016 3 92 -1 2 -2 1 88
2017 4 83 0 -7 0 0 90
2018 5 94 1 4 4 1 92
2019 6 99 2 9 18 4 94
2020 7 92 3 2 6 9 96
28 630 0 56 28
xbar
4 90
b= 2
a= 82
Year Sales Absolute Squared
X-X̅ Y-Y̅ (X-X̅)(Y-Y̅) (X-X̅)^2 Forecast Error Error Error
1 21.6 -4.5 -4.85 21.825 20.25 21.5 -0.1 0.1 0.01
2 22.9 -3.5 -3.55 12.425 12.25 22.6 -0.3 0.3 0.09
3 25.5 -2.5 -0.95 2.375 6.25 23.7 -1.8 1.8 3.24
4 21.9 -1.5 -4.55 6.825 2.25 24.8 2.9 2.9 8.41
5 23.9 -0.5 -2.55 1.275 0.25 25.9 2 2 4
6 27.5 0.5 1.05 0.525 0.25 27 -0.5 0.5 0.25
7 31.5 1.5 5.05 7.575 2.25 28.1 -3.4 3.4 11.56
8 29.7 2.5 3.25 8.125 6.25 29.2 -0.5 0.5 0.25
9 28.6 3.5 2.15 7.525 12.25 30.3 1.7 1.7 2.89
10 31.4 4.5 4.95 22.275 20.25 31.4 0 0 0
5.526.45 Sum 90.75 82.5 30.7
3.07
b 1.1
a 20.4
Quadratic Trend
• Quadratic trend regression Yt = b0 + b1t + + b2t2.
Year Revenue t t^2 Regression Statistics
1 23.1 1 1 Multiple R 0.990548
R Square 0.981185
2 21.3 2 4
Adjusted R Square 0.975809
3 27.4 3 9 Standard Error 3.975782
4 34.6 4 16 Observations 10
5 33.8 5 25
6 43.2 6 36 ANOVA
7 59.5 7 49 df SS MS F Significance F
Regression 2 5770.128 2885.064 182.5199 9.14E-07
8 64.4 8 64
Residual 7 110.6479 15.80685
9 74.2 9 81 Total 9 5880.776
10 99.3 10 100
Coefficients Standard Error t Stat P-value Lower 95%
Intercept 24.18167 4.676124 5.171305 0.001293 13.12439
t -2.10598 1.952947 -1.07836 0.316623 -6.72397
t^2 0.921591 0.173024 5.326385 0.001092 0.512455
Exponential Trend
• An exponential trendline is a curved line that is most useful when data values rise or fall at
increasingly higher rates.
• The appropriate regression equation contains a multiplicative error term ut:
Yt =abt
• This equation is not useful for estimation; for that, a linear equation is required.
• You can achieve linearity by taking natural logarithms of both sides of the equation, as
shown below, where a = ln(c) and et = ln(ut).

• The coefficient b (expressed as a percentage) is approximately the percentage change


per period. For example, if b = 0.05, then the series is increasing by approximately 5%
per period.
• If a time series exhibits an exponential trend, then a plot of its logarithm should be
approximately linear.
#Moving Average
Time Series Models with R
ma=ma(AirPassengers,3)
• #install.packages ma

• library(forecast) accuracy(ma)
ma2=forecast(ma,12)
• #import data
accuracy(ma2)
• AirPassengers plot(ma2)
#Simple Exponential smoothing #Accuracy of all models
• #Naïve
es=ses(AirPassengers) #Naïve
• naivem=naive(AirPassengers,h accuracy(naivem)
summary(es)
=12,data=AirPassengers )
accuracy(es) #Moving Average
• naivem plot(es) accuracy(ma2)
• forecastnavie=forecast(naivem,12) #Linear Trend Model #Simple Exponential smoothing
month=1:144
• forecastnavie accuracy(es)
month
#Linear Trend Model
• accuracy(naivem) ltm=tslm(AirPassengers~month)
accuracy(ltm)
summary(ltm)
• accuracy(forecastnavie)
accuracy(ltm)
• plot(forecastnavie) summary(AirPassengers)
Case Study: Marriott Rooms Forecasting

• What forecasting procedure would you recommend for making the


Tuesday afternoon forecast of each day’s demand for the following
Saturday through Friday?
• What is your forecast for Saturday?
• What will you do about the current request for up to 60 rooms for
Saturday?
Thank You
Factor Analysis
Ramesh Kandela
[email protected]
Factor Analysis
• Factor analysis is a known name denoting a class of procedures primarily used for data
reduction and summarization.
• Dimensionality reduction, or variable reduction techniques, refers to the process of reducing
the number of dimensions or features in a dataset.
• Factor analysis is an interdependence technique in that an entire set of interdependent
relationships is examined without making the distinction between dependent and
independent variables.
• Factors: An underlying dimension that explains the correlations among a set of variables.
• Factor analysis is used in the following circumstances:
• To identify underlying dimensions, or factors, that explain the correlations among a set of
variables.
• To identify a new, smaller, set of uncorrelated variables to replace the original set of
correlated variables in subsequent multivariate analysis (regression or discriminant
analysis).
Types of Factor Analysis
1. An Exploratory Factor Analysis explores the relationships among the variables and does not
have an a priori fixed number of factors.
▪ Principal Components Analysis
▪ Common Factor Analysis.
2. A Confirmatory Factor Analysis assumes that you enter the factor analysis with a firm idea
about the number of factors you will encounter and which variables will most likely load onto
each factor.
• Major difference is that EFA seeks to discover the number of factors and does not specify
which variables ((items) load on which factors.
Factor Analysis Assumptions
• The variables must be appropriately measured on an interval or ratio scale.
• An appropriate sample size should be used.
• Variables must be interrelated.
Steps in Exploratory Factor Analysis
Formulate the Problem

Construction of the Correlation Matrix


Bartlett's test of sphericity and
Kaiser-Meyer-Olkin (KMO)

Determine the Method of Factor Analysis

Factor Extraction

Determination of Number of Factors

Rotation of Factors

Interpretation of Factors
Formulate the Problem
• The objectives of factor analysis should be identified.
• The variables to be included in the factor analysis should be specified based on the researcher's past
research, theory, and judgment. The variables must be appropriately measured on an interval or
ratio scale.
• An appropriate sample size should be used. As a rough guideline, there should be at least four or
five times as many observations (sample size) as there are variables.
Conducting Factor Analysis with Airline Passenger Satisfaction
Download the airline passenger satisfaction data using the below link
https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction
Objective: Categorize the dimensions of passenger satisfaction into a smaller number of Factors.
Variables
1. Inflight wifi service 8. Inflight entertainment
2. Departure/Arrival time convenient 9. On-board service
3. Ease of Online booking 10. Leg room service
4. Gate location 11. Baggage handling
5. Food and drink 12. Check-in service
6. Online boarding 13. Inflight service
7. Seat comfort 14. Cleanliness
Construct the Correlation Matrix
The analytical process is based on a matrix of correlations between the variables.
Correlation matrix. A correlation matrix is a lower triangle matrix showing the simple correlations,
r, between all possible pairs of variables included in the analysis. The diagonal elements, which are
all 1, are usually omitted.
Bartlett's test of sphericity. Bartlett's test of sphericity is a test statistic used to examine the
hypothesis that the variables are uncorrelated in the population. In other words, the population
correlation matrix is an identity matrix; each variable correlates perfectly with itself (r = 1) but has
no correlation with the other variables (r = 0).
Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy. The Kaiser-Meyer-Olkin (KMO)
measure of sampling adequacy is an index used to examine the appropriateness of factor analysis.
High values (between 0.5 and 1.0) indicate factor analysis is appropriate. Values below 0.5 imply
that factor analysis may not be appropriate. KMO and Bartlett's Test
Small values of the KMO statistic indicate that Kaiser-Meyer-Olkin Measure of Sampling Adequacy. .781
Bartlett's Test of Sphericity Approx. Chi- 601676.
the correlations between pairs of variables
Square 894
cannot be explained by other variables and df 91
that factor analysis may not be appropriate. Sig. .000
Determine the Method of Factor Analysis
Factor Extraction
 The primary objective of this stage is to determine the factors.
 Estimates of initial factors are obtained using Principal components analysis and
common factor analysis.
• In principal components analysis, the total variance in the data is considered. The
diagonal of the correlation matrix consists of unities, and full variance is brought into the
factor matrix. Principal components analysis is recommended when the primary concern is
determining the minimum number of factors that will account for maximum variance in
the data for subsequent multivariate analysis. The factors are called principal components.
• In common factor analysis, the factors are estimated based only on the common variance.
Communalities are inserted in the diagonal of the correlation matrix. This method is
appropriate when the primary concern is to identify the underlying dimensions and the
common variance is of interest. This method is also known as principal axis factoring.
Factor Analysis Model
• Mathematically, each variable is expressed as a linear combination of underlying factors.
The amount of variance a variable shares with all other variables included in the analysis is
called communality.
• The covariation among the variables is described in terms of a small number of common
factors plus a unique factor for each variable. If the variables are standardized, the factor
model may be represented as:
Xi = Ai 1F1 + Ai 2F2 + Ai 3F3 + . . . + AimFm + ViUi

where
Xi = i th standardized variable
Aij = standardized multiple regression coefficient of variable i on common factor j
F = common factor
Vi = standardized regression coefficient of variable i on unique factor i
Ui = the unique factor for variable i
m = number of common factors
Factor Analysis Model
The unique factors are uncorrelated with each other and with the common factors. The
common factors themselves can be expressed as linear combinations of the observed
variables.
Fi = Wi1X1 + Wi2X2 + Wi3X3 + . . . + WikXk
Where:
Fi = estimate of i th factor
Wi = weight or factor score coefficient
k = number of variables
• It is possible to select weights or factor score coefficients so that the first factor explains the
largest portion of the total variance.
• Then a second set of weights can be selected, so that the second factor accounts for most of
the residual variance, subject to being uncorrelated with the first factor.
• This same principle could be applied to selecting additional weights for the additional factors.
Principal Components Analysis
• The principal components analysis is the most commonly used extraction method
• In principal components analysis, linear combinations of the observed variables are
formed.
 The 1st principal component is the combination that accounts for the largest
amount of variance in the sample (1st extracted factor).
 The 2nd principal component accounts for the next largest amount of variance and
is uncorrelated with the first (2nd extracted factor).
 Successive components explain progressively smaller portions of the total sample
variance, and all are uncorrelated with each other.
Component Matrix (Factor Matrix) Component Matrix using Principal Component Analysis
• Factor matrix. A factor matrix contains the factor Component (Factor)
loadings of all the variables on all the factors extracted. Variables (Items) 1 2 3 4 h^2
Inflight
• Factor loadings. Factor loadings are simple correlations entertainment 0.834 -0.279 -0.102 0.205 0.825
between the variables and the factors. Cleanliness 0.695 -0.279 -0.469 0.13 0.798
• The components can be interpreted as the correlation Seat comfort 0.673 -0.243 -0.462 0.725
of each variable(item) with the component. The square Food and drink 0.599 -0.25 -0.513 0.217 0.732
of each loading represents the proportion of variance
Checkin service 0.356 0.196 -0.294 0.257
(think of it as an R2 statistic) explained by a particular
Ease of Online
component. booking 0.321 0.824 -0.151 0.811
• Communality. Communality is the amount of
Inflight wifi service 0.462 0.689 -0.221 0.743
variance a variable shares with all the other variables
being considered. This is also the proportion of Gate location 0.131 0.66 0.493 0.7
variance explained by the common factors. Departure/Arrival
• Summing the squared component loadings across the time convenient 0.201 0.64 0.369 0.588
components (columns) gives you the communality Inflight service 0.517 0.652 0.117 0.715
estimates for each item(variable). Baggage handling 0.511 0.638 0.1 0.685
Inflight On-board service 0.541 -0.105 0.563 0.621
entertainment
Leg room service 0.433 0.441 -0.119 0.396
Loading 0.834 -0.279 -0.102 0.205 Communality
Online boarding 0.549 0.23 -0.242 -0.618 0.795
Squared Loading 0.696 0.078 0.010 0.042 0.825
Extraction Method: Principal Component Analysis
Results of Principal Components Analysis % of Cumula
• Eigenvalue. The eigenvalue represents the total variance Factor Eigenvalues Variance tive %
explained by each factor. 1 3.8 27.144 27.144
• Eigenvalues are also the sum of squared component loadings 2 2.362 16.871 44.015
across all items (rows) for each component, which represent the 3 2.166 15.471 59.486
amount of variance in each item that can be explained by the 4 1.063 7.595 67.08
principal component. 5 0.951 6.792 73.873
• Percentage of variance. The percentage of the total variance 6 0.7 5.002 78.875
attributed to each factor. 7 0.54 3.857 82.732
8 0.515 3.676 86.408
• Scree plot. A scree plot is a plot of the Eigenvalues against the 9 0.469 3.353 89.762
number of factors in order of extraction. 10 0.369 2.633 92.395
11 0.328 2.346 94.741
12 0.295 2.108 96.848
13 0.253 1.808 98.657
14 0.188 1.343 100
Eigenvalues 3.8 2.36 2.17 1.06
% of =3.8/14 =2.36/14 =2.17/1 1.06/14
Variance 27.14 =16.87 4=15.47 =7.59
Results of Principal Components Analysis After removing Checkin service and Leg room service
Commun
Variables alities Component Matrix
% of Cumulati
Inflight wifi Component Component Eigenvalues Variance ve %
service 0.77 1 2 3 4 1 3.572 29.769 29.769
Departure/Arriva Inflight .849 -.299 .128 2 2.359 19.656 49.424
3 1.987 16.561 65.985
l time convenient 0.644 entertainment 4 1.038 8.646 74.632
Ease of Online Cleanliness .741 -.311 -.364 .148 5 .589 4.907 79.539
booking 0.819 6 .521 4.342 83.881
Seat comfort .714 -.274 -.372 7 .482 4.017 87.897
Gate location 0.722 Food and drink .659 -.286 -.408 .210 8 .369 3.072 90.970
Food and drink 0.726 Ease of Online .344 .813 -.177 9 .334 2.781 93.751
Online boarding 0.797 10 .296 2.464 96.214
booking 11 .254 2.113 98.327
Seat comfort 0.724 12 .201 1.673 100.000
Inflight wifi service .483 .676 -.276
Inflight
Gate location .158 .653 .517
entertainment 0.827
Departure/Arrival .210 .641 .434
On-board service 0.658
time convenient
Baggage
handling 0.718 Inflight service .444 .740
Inflight service 0.751 Baggage handling .439 .722
Cleanliness 0.8 On-board service .476 .648
Online boarding .568 .213 -.210 -.621
Determine the Number of Factors

• A Priori Determination

• Determination Based on Eigenvalues

• Determination Based on Scree Plot

• Determination Based on Percentage of Variance

• Determination Based on Split-Half Reliability

• Determination Based on Significance Tests


Determine the Number of Factors
• A Priori Determination. Sometimes, because of prior knowledge, the researcher knows how many
factors to expect and thus can specify the number of factors to be extracted beforehand.
• Determination Based on Eigenvalues. In this approach, only factors with Eigenvalues greater than
1.0 are retained. An Eigenvalue represents the amount of variance associated with the factor.
Hence, only factors with a variance greater than 1.0 are included. Factors with variance less than 1.0
are no better than a single variable, since, due to standardization, each variable has a variance of 1.0.
• Determination Based on Percentage of Variance. In this approach the number of factors extracted is
determined so that the cumulative percentage of variance extracted by the factors reaches a
satisfactory level. It is recommended that the factors extracted should account for at least 60% of the
variance.
• Determination Based on Scree Plot. A scree plot is a plot of the
Eigenvalues against the number of factors in order of extraction.
Experimental evidence indicates that the point at which the
scree begins denotes the true number of factors. Generally, the
number of factors determined by a scree plot will be one or a
few more than that determined by the Eigenvalue criterion.
Determine the Number of Factors
• Determination Based on Split-Half Reliability. The sample is split in half and factor
analysis is performed on each half. Only factors with high correspondence of factor
loadings across the two subsamples are retained.
• Determination Based on Significance Tests. It is possible to determine the statistical
significance of the separate Eigenvalues and retain only those factors that are statistically
significant. A drawback is that with large samples (size greater than 200), many factors are
likely to be statistically significant, although from a practical viewpoint many of these
account for only a small proportion of the total variance.
Component Matrix
Number of Factors:3 Component
Variables
1 2 3
Variables Communalities Inflight entertainment 0.888 -0.214 -0.009
Inflight wifi service 0.65 Cleanliness 0.752 -0.277 -0.393
Departure/Arrival
time convenient 0.50 Seat comfort 0.704 -0.255 -0.385
Ease of Online Food and drink 0.678 -0.255 -0.444
booking 0.75 Ease of Online booking 0.25 0.814 -0.169
Gate location 0.52 Inflight wifi service 0.399 0.689 -0.132
Food and drink 0.72
Seat comfort 0.71 Gate location 0.138 0.685 -0.175
Inflight Departure/Arrival time convenient 0.186 0.678 -0.067
entertainment 0.83 Inflight service 0.495 0.042 0.708
Baggage handling 0.72
Baggage handling 0.486 0.048 0.691
Inflight service 0.75
Cleanliness 0.80 On-board service 0.512 0.014 0.627
On-board service 0.66 Eigenvalues 3.328 2.323 1.952
% of Variance 30.252 21.12 17.745
Cumulative % 30.252 51.372 69.117
Factor Rotation
• An important output from factor analysis is the factor matrix, also called the factor pattern
matrix.
• The factor matrix contains the coefficients used to express the standardised variables in
terms of the factors. These coefficients, the factor loadings, represent the correlations
between the factors and the variables. A coefficient with a large absolute value indicates that
the factor and the variable are closely related.
• Although the initial or unrotated factor matrix indicates the relationship between the factors
and individual variables, it seldom results in factors that can be interpreted, because the
factors are correlated with many variables. Therefore, through rotation the factor matrix is
transformed into a simpler one that is easier to interpret.
• In rotating the factors, we would like each factor to have nonzero, or significant, loadings or
coefficients for only some of the variables. Likewise, we would like each variable to have
nonzero or significant loadings with only a few factors, if possible with only one.
• The rotation is called orthogonal rotation if the axes are maintained at right angles. The most
commonly used method for rotation is the varimax procedure. This is an orthogonal method of
rotation that minimizes the number of variables with high loadings on a factor, thereby enhancing
the interpretability of the factors. Orthogonal rotation results in factors that are uncorrelated.

• The rotation is called oblique rotation when the axes are not maintained at right angles, and the
factors are correlated. Sometimes, allowing for correlations among factors can simplify the factor
pattern matrix. Oblique rotation should be used when factors in the population are likely to be
strongly correlated. promax
Factor Matrix Before and After Rotation
Variables Factor 1 Factor 2 Variables Factor 1 Factor 2
1 X 1 X
2 X X 2 X
3 X 3 X
4 X X 4 X
5 X X 5 X
6 X 6 X
A)High Loading Before Rotation B)High Loading After Rotation
Factor Matrix Before and After Rotation
Component Matrix Rotated Component Matrix
Component Component
1 2 3 1 2 3
Inflight entertainment .888 -.214 -.009 Cleanliness .891 .015 .048
Cleanliness .752 -.277 -.393 Food and drink .849 .027 -.034
Seat comfort .704 -.255 -.385 Seat comfort .841 .021 .030
Food and drink .678 -.255 -.444 Inflight entertainment .795 .036 .448
Ease of Online booking .250 .814 -.169 Ease of Online booking .018 .868 .010
Inflight wifi service .399 .689 -.132 Inflight wifi service .162 .782 .115
Gate location .138 .685 -.175 Gate location -.028 .717 -.057
Departure/Arrival time convenient .186 .678 -.067 Departure/Arrival time convenient -.039 .703 .059
Inflight service .495 .042 .708 Inflight service .046 .034 .863
Baggage handling .486 .048 .691 Baggage handling .045 .041 .844
On-board service .512 .014 .627 On-board service .108 .027 .802
Extraction Method: Principal Component Analysis
Rotation Method: Varimax.
Interpret Factors Rotated Component Matrix
Variables Comfort Convenience Service
Making final decisions Cleanliness 0.891 0.015 0.048
• The final decision about the number Food and drink 0.849 0.027 -0.034
of factors to choose is the number of Seat comfort 0.841 0.021 0.03
factors for the rotated solution that is Inflight
most interpretable.
entertainment 0.795 0.036 0.448
• A factor can then be interpreted in Ease of Online
terms of the variables that load high booking 0.018 0.868 0.01
on it.
Inflight wifi service 0.162 0.782 0.115
• To identify factors, group variables
Gate location -0.028 0.717 -0.057
that have large loadings for the same
factor. Departure/Arrival
time convenient -0.039 0.703 0.059
• Interpret factors according to the
meaning of the variables
Inflight service 0.046 0.034 0.863
Baggage handling 0.045 0.041 0.844
Reference: On-board service 0.108 0.027 0.802
Malhotra Naresh, K. and Dash, S. (2015) Marketing Research, An Applied Orientation. 7th Edition, Pearson, India.
Exploratory Factor Analysis (PCA) with R
Formulate the Problem
• #Import the data set Determination of Number of Factors
• library(readxl) ev <- eigen(cor(APS)) # get eigenvalues
ev$values
• APS<- read_excel("E:/Documents/APS.xlsx")
scree(APS, pc=FALSE) # Use pc=FALSE for factor analysis
• attach(APS)
Construction of the Correlation Matrix Rotation of Factors
# Varimax Rotated Principal Components
• #install.packages("psych")
# retaining 3 components
• library(psych) RPCA<- principal(APS, nfactors=3, rotate="varimax")
• # Bartletts test of spherecity RPCA
• bartlett.test(APS)
Interpretation of Factors
• # Kaiser-Meyer-Olkin measure
• KMO(APS)
Determine the Method of Factor Analysis & Factor Extraction
• #The princomp( ) function produces an unrotated principal component analysis.
• PCA <- princomp(APS, centre=True, scale=True)
• summary(PCA
• loadings(PCA)
Thank you
Cluster Analysis

Ramesh Kandela
[email protected]
Cluster Analysis
• Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a
data set. Clustering looks to find homogeneous subgroups among the observations.
• Cluster analysis is a class of techniques used to classify objects or cases into relatively
homogeneous groups called clusters. Objects in each cluster tend to be similar to each
other and dissimilar to objects in the other clusters. Cluster analysis is also called
classification analysis, or numerical taxonomy.
• in cluster analysis, there is no a priori information about the group or cluster membership
for any of the objects. Groups or clusters are suggested by the data, not defined a priori.
• cluster analysis is sometimes referred to as unsupervised classification.
• Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
Applications of Clustering
• In Search Engines: Search engines also work on the clustering technique. The search result
appears based on the closest object to the search query. It does it by grouping similar data
objects in one group that is far from the other dissimilar objects. The accurate result of a
query depends on the quality of the clustering algorithm used.
• Customer Segmentation: It is used in market research to segment the customers based on
their choice and preferences.
An Ideal Clustering Situation
Assumptions
• sample is representative of the population.

Variable 1
• It is assumed that the variables are not correlated.
• There are no significant outliers.
• Data collected assumed standardized.
Variable 2
Customer Segmentation
• Customer segmentation is the process of dividing customers into groups based on common
characteristics. The most common ways in which businesses segment their customer base are:
1.Demographic information, such as gender, age, familial and marital status, income, education, and
occupation.
2.Geographical information, Examples of segmentation by geography include country, state, city, and
town.
3.Psychographics, such as social class, lifestyle, and personality traits.
4.Behavioral data, such as spending and consumption habits, product/service usage, and desired
benefits.
Clustering with Mall Customer Segmentation Data
• The dataset can be downloaded from the Kaggle website which can be found here.
• The data includes the following features:
Customer ID Gender Age Annual Income (k$) Spending Score (1-100)
1 Male 19 15 39
2 Male 21 15 81
Business Applications
• When Procter & Gamble test markets a new cosmetic, it may want to group U.S. cities into
groups that are similar on demographic attributes such as percentage of Asians, percentage
of Blacks, percentage of Hispanics, median age, unemployment rate, and median income
level.
• A marketing analyst at Coca-Cola wants to segment the soft drink market based on
consumer preferences for price sensitivity, preference of diet versus regular soda, and
preference of Coke versus Pepsi.
• Microsoft might cluster its corporate customers based on the price a given customer is
willing to pay for a product. For example, there might be a cluster of construction
companies that are willing to pay a lot for Microsoft Project but not so much for Power
Point.
• Samsung segmentation is one of the key elements of the marketing strategy of this
electronics company. The Samsung market segmentation consists of four segmentation
types: Geographic, Demographic, Behavioral, and Psychographic segmentation. Each form
of segmentation is further divided based on certain criteria.
Steps in Conducting Cluster Analysis
Formulate the Problem

Select a Distance Measure

Select a Clustering Procedure

Decide on the Number of Clusters

Interpret and Profile Clusters

Assess the Validity of Clustering


Formulate the Problem
• Perhaps the most important part of formulating the clustering problem is selecting the
variables on which the clustering is based.
• Basically, the set of variables selected should describe the similarity between objects in
terms that are relevant to the marketing research problem.
• The variables should be selected based on past research, theory, or consideration of the
hypotheses being tested.
• Annual Income (k$) and Spending Score (1-100) variables were selected from 200
respondents.
Select a Distance or Similarity Measure
• The most common approach is to measure similarity in terms of distance between pairs of
objects.
• The most commonly used measure of similarity is the Euclidean distance or its square. The
Euclidean distance is the square root of the sum of the squared differences in values for
each variable.
• If the points (x1,y1)and (x2,y2) are in 2-dimensional space, then the Euclidean distance
between them is

Euclidean distance matrix


P1 P2 P3 P4
Point X Y
P1 0 2 P1 0
P2 2 0 P2 2.8 0
P3 3 1 P3 3.2 1.4 0
P4 5 1 P4 5.1 3.2 2 0
• The city-block or Manhattan distance between two objects is the sum of the absolute
differences in values for each variable.

• The Chebychev distance between two objects is the maximum absolute difference in values
for any variable.
• If the variables are measured in vastly different units, the clustering solution will be
influenced by the units of measurement. In these cases, before clustering respondents, we
must standardize the data by rescaling each variable to have a mean of zero and a standard
deviation of unity. It is also desirable to eliminate outliers (cases with typical values).
• Use of different distance measures may lead to different clustering results. Hence, it is
advisable to use different measures and compare the results.
A Classification of Clustering Procedures
Clustering Procedures

Hierarchical Nonhierarchical Other

Agglomerative Divisive Two-Step

Linkage Variance Centroid Sequential Parallel Optimizing


Methods Methods Methods Threshold Threshold Partitioning

Ward’s
Method

Single Complete Average


Linkage Linkage Linkage
Clustering Algorithms
• Nonhierarchical clustering: The nonhierarchical clustering methods are frequently referred
to as k-means clustering. These methods include sequential threshold, parallel threshold,
and optimizing partitioning.
• K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. This is a prototype-based, partitional clustering technique that attempts to find
a user-specified number of clusters (K), which are represented by their centroids.
• Hierarchical clustering is characterized by the development of a hierarchy or tree-like
structure. Hierarchical methods can be agglomerative or divisive.
• Agglomerative clustering starts with each object in a
separate cluster. Clusters are formed by grouping objects
into bigger and bigger clusters. This process is continued
until all objects are members of a single cluster.
Agglomerative is a bottom-up approach.
• Divisive clustering starts with all the objects grouped in a
single cluster. Clusters are divided or split until each object is
in a separate cluster. Divisive is a top-down approach.
Why hierarchical clustering?
In the K-means clustering, there are some challenges with this algorithm, which are a
predetermined number of clusters, and it always tries to create clusters of the same size. To
solve these two challenges, we can opt for the hierarchical clustering algorithm because, in
this algorithm, we don't need to know the predefined number of clusters.

Agglomerative Hierarchical Clustering


The agglomerative hierarchical clustering algorithm is a popular example of Hierarchical
Clustering. To group the datasets into clusters, it follows the bottom-up approach. It means,
this algorithm considers each data point as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are merged
into a single cluster that contains all the datasets.
Agglomerative Hierarchical clustering
• Agglomerative methods are
commonly used in
marketing research. They
consist of linkage methods,
error sums of squares or
variance methods, and
centroid methods.
▪ Single Linkage
▪ Complete Linkage
▪ Average Linkage
▪ Ward’s Method
▪ Centroid Method
Linkage Methods of Clustering
• Single Linkage: It is the Shortest Distance Single Linkage
between the closest points of the clusters. At
every stage, the distance between two clusters Minimum Distance
is the distance between their two closest points
• Complete Linkage: It is similar to a single Cluster 1 Cluster 2
linkage, except based on the maximum distance Complete Linkage
or the furthest neighbor approach. In complete
linkage, the distance between two clusters is Maximum Distance
calculated as the distance between their two
furthest points.
Cluster 1
• Average Linkage: It is the linkage method in Average Linkage Cluster 2
which the distance between each pair of
datasets is added up and then divided by the
total number of datasets to calculate the
average distance between two clusters. It is also
Average Distance
one of the most popular linkage methods. Cluster 1 Cluster 2
Linkage Methods
Euclidean distance matrix
Point X Y
P1 P2 P3 P4
P1 0 2
P2 2 0 P1 0
P3 3 1 P2 2.8 0
P4 5 1 P3 3.2 1.4 0
P4 5.1 3.2 2 0

Single Linkage Complete Linkage Average Linkage


P1 P2 & P3 P4 P1 P2 & P3 P4 P1 P2 & P3 P4
P1 0 P1 0 P1 0
P2 & P3 2.8 0 P2 & P3 3.2 0 P2 & P3 3 0
P4 5.1 2 0 P4 5.1 3.2 0 P4 5.1 2.6 0
• The variance methods attempt to generate clusters to
minimize the within-cluster variance. A commonly
used variance method is the Ward's procedure.
• Ward’s Method: For each cluster, the means for all
the variables are computed. Then, for each object,
the squared Euclidean distance to the cluster means
is calculated. These distances are summed for all the
objects. At each stage, the two clusters with the
smallest increase in the overall sum of squares within
cluster distances are combined.
Centroid Method: In the centroid methods, the
distance between two clusters is the distance
between their centroids (means for all the
variables.) Every time objects are grouped, a new
centroid is computed.

• Of the hierarchical methods, Average linkage and Ward's methods have been shown to
perform better than the other procedures.
Dendrogram in Hierarchical clustering (Single Linkage)
A Dendrogram is a tree-like diagram that records the sequences of merges or splits.

Point X Y Euclidean distance matrix


P1 0 2 Single Linkage
P1 P2 P3 P4
P2 2 0 P1 P2 & P3 P4
P1 0 P1 0
P3 3 1
P2 2.8 0 P2 & P3 2.8 0
P4 5 1
P3 3.2 1.4 0 P4 5.1 2 0
P4 5.1 3.2 2 0

Single Linkage
P1 P2 , P3 &P4
P1 0
P2 , P3 &P4 2.8 0
Dendrogram in Hierarchical clustering
• A Dendrogram is a tree-like diagram that records the sequences of merges or splits.
• The dendrogram is a tree-like structure that is mainly used to store each step as a memory
that the HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean
distances between the data points, and the x-axis shows all the data points of the given
dataset.
• The term hierarchical refers to the fact that clusters obtained by cutting the dendrogram at
a given height are necessarily nested within the clusters obtained by cutting the
dendrogram at any greater height.
• The number of clusters will be the number of vertical lines which are being intersected by
the line drawn using the threshold(Generally, we try to set the threshold in such a way that
it cuts the tallest vertical line).
• More the distance of the vertical lines in the dendrogram, more the distance between those
clusters.
How the Agglomerative Hierarchical clustering Work?
The primary objective of cluster analysis is to define the structure of the data by placing the most
similar observations in a group. To accomplish this task, we must address three basic questions.
1. How do we measure similarity?
2. How do we form clusters?
3. How many clusters do we form?
• Step-1: Create each data point as a single cluster. Let's say there are N data points, so the
number of clusters will also be N.
• Step-2: Take two closest data points or clusters and merge them to form one cluster. So,
there will now be N-1 clusters. (Using Euclidean distance and linkage methods)
• Step-3: Again, take the two closest clusters and merge them together to form one cluster.
There will be N-2 clusters.
• Step-4: Repeat Step 3 until only one cluster left.
• Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to
divide the clusters as per the problem.
Step-1: Create each data point as a single
cluster. Let's say there are N data points, so
the number of clusters will also be N.

Step-2: Take two closest data points or


clusters and merge them to form one cluster.
So, there will now be N-1 clusters.
Step-3: Again, take the two closest clusters and
merge them together to form one cluster. There
will be N-2 clusters.

Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:
Step-5: Develop the
dendrogram to divide the
clusters as per the problem.

• In the above diagram, the left part is showing how clusters are created in agglomerative
clustering, and the right part is showing the corresponding dendrogram.
• Firstly, the datapoints P2 and P3 combine together and form a cluster, correspondingly a
dendrogram is created, which connects P2 and P3 with a rectangular shape. The hight is
decided according to the Euclidean distance between the data points.
• In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is
higher than of previous, as the Euclidean distance between P5 and P6 is a little bit greater than
the P2 and P3.
• Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and
P4, P5, and P6, in another dendrogram.
• At last, the final dendrogram is created that combines all the data points together.
Decide on the Number of Clusters
• Theoretical, conceptual, or practical considerations may suggest a certain number of clusters.
• In hierarchical clustering, the distances at which clusters are combined can be used as criteria.
This information can be obtained from the agglomeration schedule or from the dendrogram.
• In nonhierarchical clustering, the ratio of total within-group variance to between-group variance
can be plotted against the number of clusters. The point at which an elbow or a sharp bend
occurs indicates an appropriate number of clusters.
• The relative sizes of the clusters should be meaningful.
• # Fitting Hierarchical Clustering to the dataset
• HC= hclust(d = dist(Mall_Customers, method = 'euclidean'),
method = 'ward.D')
• HC
• Cluster method : ward.D
• Distance : euclidean
• Number of objects: 200
• # Using the dendrogram to find the optimal number of clusters
• plot(HC)
Interpreting and Profiling the Clusters
• Interpreting and profiling
clusters involves examining the
cluster centroids. The centroids
enable us to describe each cluster 5. Target Customers
by assigning it a name or label.

• It is often helpful to profile the 1. Careless


clusters in terms of variables that
were not used for clustering. These 3. Standard
may include demographic,
psychographic, product usage,
media usage, or other variables. 4. Careful
2. Sensible
1. Careless
2. Sensible
3. Standard
4. Careful
5. Target Customers
Assess Reliability and Validity
• Perform cluster analysis on the same data using different distance measures. Compare the
results across measures to determine the stability of the solutions.
• Use different methods of clustering and compare the results.
• Split the data randomly into halves. Perform clustering separately on each half. Compare
cluster centroids across the two subsamples.
• Delete variables randomly. Perform clustering based on the reduced set of variables.
Compare the results with those obtained by clustering based on the entire set of variables.
• In nonhierarchical clustering, the solution may depend on the order of cases in the data set.
Make multiple runs using different order of cases until the solution stabilizes.

Reference:
Malhotra Naresh, K. and Dash, S. (2015) Marketing Research, An Applied Orientation. 7th Edition, Pearson, India.
Hierarchical Clustering with R # Fitting Hierarchical Clustering to the dataset Using 5 clusters
• # Importing the dataset HC=cutree(HC, 5)
HC
• library(readxl)
table(HC)
• Mall_Customers <- read_excel("Mall_Customers.xlsx") # Visualising the clusters
library(cluster)
• attach(Mall_Customers) clusplot(Mall_Customers, HC, lines = 0, shade = TRUE,color =
• View(Mall_Customers) TRUE,labels= 2, plotchar = FALSE, span = TRUE, main =
paste('Clusters'), xlab = 'Annual Income',ylab = 'Spending
• head(Mall_Customers) Score')
• library(dplyr)
• Mall_Customers=select(Mall_Customers,`Annual Income (k$)`,`Spending Score (1-100)`)
• head(Mall_Customers)
• Mall_Customers=scale(Mall_Customers, center = TRUE, scale = TRUE)
• ## Fitting Hierarchical Clustering to the dataset
• HC= hclust(d = dist(Mall_Customers, method = 'euclidean'), method = 'ward.D')
• HC
• # Using the dendrogram to find the optimal number of clusters
• plot(HC)
Thank You
Linear Programming

Ramesh Kandela
[email protected]
Optimization Models
Whenever there is a hard job to be done I assign it to a lazy man; he is sure to
find an easy way of doing it. – Walter Chrysler
• Prescriptive analytics is the highest level of analytics capability which is used for choosing
optimal actions once an organization gains insights through descriptive and predictive
analytics.
• Optimization is a precise procedure using design constraints and criteria to enable the
planner to find the optimal solution.
• An optimization model is a translation of the key characteristics of the business problem you
are trying to solve. The model consists of three elements: the objective function, decision
variables and business constraints.
▪Linear Programming
▪Transportation & Assignment Problems
▪Decision Analysis
▪Markov Models
▪Simulation
Linear Programming
Linear Programming is a mathematical technique useful for allocation of scarce or limited
resources (such as labour, material, machine, time, warehouse space, capital, energy, etc.), to
several competing activities (such as products, services, jobs, new equipment, projects, etc.)
on the basis of given criterion of optimality.

• First conceived by George B Dantzing around 1947.

• Dantzing first paper was titled “Programming In Linear Structure”

• Koopmans coined the term “Linear Programming” in 1948

• Simplex method was published in 1949 by Dantzing


Properties of Linear Programming Model
Basic Components of an LP
1.Decision Variables
describe our choices that are under our control;
2.Objective Function
describes a criterion that we wish to minimize (e.g., cost) or maximize (e.g., profit);
3.Constraints
describe the limitations that restrict our choices for decision variables.
4. Non-negativity
the variables of linear programs must always take non-negative values (i.e., they must be
greater than or equal to zero).
Assumptions of an LP Model
1.Certainty:
In LP models, it is assumed that all its parameters such as: availability of resources, profit (or cost) contribution per unit of
decision variable and consumption of resources per unit of decision variable must be known and constant.

2. Additivity:
The value of the objective function and the total amount of each resource used (or supplied), must be equal to the sum of
the respective individual contribution (profit or cost) of the decision variables. For example, the total profit earned from the
sale of two products A and B must be equal to the sum of the profits earned separately from A and B. Similarly, the amount
of a resource consumed for producing A and B must be equal to the total sum of resources used for A and B individually.

3. Linearity (or proportionality):


The amount of each resource used (or supplied) and its contribution to the profit (or cost) in objective function must be
proportional to the value of each decision variable. For example, if production of one unit of a product uses 5 hours of a
particular resource, then making 3 units of that product uses 3×5 = 15 hours of that resource.

4. Divisibility (or continuity):


The solution values of decision variables are allowed to assume continuous values. For instance, it is possible to collect 6.254
thousand liters of milk by a milk dairy and such variables are divisible. But, it is not desirable to produce 2.5 machines and
such variables are not divisible and therefore must be assigned integer values. Hence, if any of the variable can assume only
integer values or are limited to discrete number of values, LP model is no longer applicable.
General Mathematical Model of Linear Programming Problem
The general linear programming problem (or model) with n decision variables and m
constraints can be stated in the following form

• cj s are coefficients representing the per unit profit (or cost) of decision variable xj to the value of
objective function.
• The aij’s are referred as technological coefficients (or input-output coefficients).These represent the
amount of resource, say i consumed per unit of variable (activity) xj. These coefficients can be
positive, negative or zero.
• The bi represents the total availability of the ith resource.
Mathematical Formulation of LP Model
The term formulation refers to the process of converting the verbal description and
numerical data into mathematical expressions, which represents the relationship among
relevant decision variables (or factors), objective and restrictions (constraints) on the use of
scarce resources to several competing activities on the basis of a given criterion of optimality.
• Step1: Study the given situation, find the key decision to be made. Hence, identify the
decision variables of the problem.
• Step2: Formulate the objective function to be optimized.
• Step3: Formulate the constraints of the problem.
• Step4: Add non-negativity restrictions.
The objective function, the set of constraints, and, the non-negativity restrictions together
form an LP model.
Examples of LP Model Formulation
A company manufactures two types of products, A and B, sell them at a profit of
Rs. 4 on type A and Rs. 5 on type B. each product is processed on two machines
X and Y. type A requires 2 minutes of processing time on X and 3 minutes on Y.
type B requires 2 minutes of processing time on X and 2 minutes on Y. the
machine, X is available for 5 hours 30 minutes, while Y is available for 8 hours
during any working day. Formulate the problem as a LP problem.
Suppose an industry is manufacturing two types of products P1 and P2. The
profits per Kg of the two products are Rs.30 and Rs.40 respectively. These two
products require processing in three types of machines. The following table
shows the available machine hours per day and the time required on each
machine to produce one Kg of P1 and P2. Formulate the problem in the form
of linear programming model.
Machine Product Total Available Machine hours/day
P1 P2
1 3 2 600
2 3 5 800
3 5 6 1100
Solution:
The procedure for linear programming problem formulation is as follows:
• Introduce the decision variable as follows:
• Let x1 = amount of P1 and x2 = amount of P2
• In order to maximize profits, we establish the objective function as 30x1 + 40x2
• Since one Kg of P1 requires 3 hours of processing time in machine 1 while the corresponding requirement of P2
is 2 hours. So, the first constraint can be expressed as 3x1 + 2x2 ≤ 600
• Similarly, corresponding to machine 2 and 3 the constraints are
• 3x1 + 5x2 ≤ 800
• 5x1 + 6x2 ≤ 1100
• In addition to the above there is no negative production, represented algebraically as x1 ≥ 0 ; x2 ≥ 0
Thus, the product mix problem in the linear programming model is as follows:
• Maximize 30x1 + 40x2
• Subject to:
• 3x1 + 2x2 ≤ 600
• 3x1 + 5x2 ≤ 800
• 5x1 + 6x2 ≤ 1100
• x1≥ 0, x2 ≥ 0
4.Reddy Mikks produces both interior and exterior paints from two raw materials, M1and M2.
The following table provides the basic data of the problem:
Tons of raw material per ton of
Exterior paint Interior paint Maximum daily availability (tons
Raw material, M1 6 4 24
Raw material, M2 1 2 6
Profit per ton ($1000) 5 4
A market survey indicates that the daily demand for interior paint cannot exceed that for
exterior paint by more than 1 ton. Also, the maximum daily demand for interior paint is 2 tons.
Reddy Mikks wants to determine the optimum (best) product mix of interior and exterior
paints that maximizes the total daily profit. Maximize Z = 5x1 + 4x2
subject to the constraints
(i) 6x1 + 4x2≤24
(ii) x1+2x2 ≤ 6
(iii) -x1+x2 ≤ 1
(iv) x2 ≤2
and x1, x2 ≥ 0.
Product 1 Product 2 Resource Availability
Material 1 1 1 5
Material 2 3 2 12
Profit/Unit 6 5
• LP formulation
• x1 : the amount of product type 1 produced (units)
• x2 : the amound ofproduct type 2 produced (units)
Total profit
Resource A limit
Resource B limit

Non-negativity

166
Linear Programming: Solution
Two basic solution approaches of linear programming exist.
1.The graphical Method
Simple, but limited to two decision variables.
2.The simplex method
More complex, but solves multiple decision variable problems.
Graphical Method
For LP problem that have only two variables, it is possible that the entire set of feasible
solutions can be displayed graphically by plotting linear constraints on graph paper in order
to locate the best(optimal) solution. The technique used to identify the optimal solution is
called the graphical solution method for an LP problem with two variables.
Optimal solution to an LP problem is obtained by
(i) Extreme (corner) point, and
(ii) Iso-profit (cost) function line method
Extreme (corner) point method
Step1: Plot constraints on graph paper and decide the feasible region
• Replace the inequality sign in each constraint by an equality sign
• Draw the straight line on the graph paper and decide each time the area of feasible solutions
according to inequality sign of the constraint.
• ≤ then shaded area is below the line
• ≥ then shaded area is above the line
• Shade the common portion of the graph paper that satisfies all the constraints simultaneously drawn
so far. The final shaded area is called the feasible region of the given LP problem. Any point inside this
region is called the feasible solution and this provide values of x1 and x2 that satisfies the all the
constraints.
Step2: Examine extreme points of the feasible region to find an optimal solution

• Determine the coordinates of each extreme point of the feasible region.


• Compute and compare the value of the objective function at each extreme point.
• Identify the extreme point that give optimal (maximum or minimum) value of the objective function.
Graphical Solution

• Start with Non-negativity constraints!


• Assume equality of the constraint (𝑥1 = 0 or 𝑥2 = 0 )
• Then, find the region defined by the constraint

• Resource A limit constraint


• Draw 𝑥1 + 𝑥2 = 5
• 𝑖𝑓 𝑥1 = 0 then 𝑥2 = 5
• 𝑖𝑓 𝑥2 = 0 then 𝑥1 = 5
Find the region where 𝑥1 + 𝑥2 ≤ 5
• (0 ,5 ) and (5,0 )
Graphical Solution

• Resource B limit constraint


• Draw 3𝑥1 + 2𝑥2 = 12
• Find the region where
3𝑥1 + 2𝑥2 ≤ 12
Graphical Solution

• Define the feasible region


• It is the region defined by the
intersection of all of your
constraints
• It includes all of the feasible
solutions to LP
Graphical Solution

• Since objective function Z is to be


C
maximized, we conclude that maximum
value of Z = 27 is achieved at the point
extreme B (2, 3).
• Hence the optimal solution to the given LP
B
problem is: x1 = 2, x2 = 3 and Max Z = 27.

A
O
Use the graphical method to solve the following LP problem.
Maximize Z = 5x1 + 4x2
subject to the constraints
(i) 6x1 + 4x2≤24
(ii) x1+2x2 ≤ 6
(iii)-x1+x2 ≤ 1
(iv)x2 ≤2
and x1, x2 ≥ 0.

X1=3 and
X2=1.5
Z=21
Use the graphical method to solve the
following LP problem.
Maximize Z = 15x1 + 10x2
subject to the constraints
(i) 4x1 + 6x2≤360, (ii) 3x1 ≤ 180, (iii) 5x2 ≤ 200
and x1, x2 ≥ 0.
• Solution: Treat x1 as the horizontal axis and x2 as the vertical axis.
Plot each constraint on the graph by treating it as a linear equation.

• Consider the first constraint 4x1 + 6x2 ≤ 360. Treat this as the equation 4x1 + 6x2 = 360. For this find any two
points that satisfy the equation and then draw a straight line through them. The two points are generally the
points at which the line intersects the x1 and x2 axes. For example, whenx1 = 0 we get 6x2 = 360 or x2 = 60.
Similarly, when x2 = 0, 4x1 = 360, x1 = 90. These two points are then connected by a straight line as shown in
Fig.
• Similarly, the constraints 3x1 ≤ 180 and 5x2 ≤ 200 are also plotted on the graph and are indicated by the shaded
area as shown in Fig.
• Since all constraints have been graphed, the area which is bounded by all the constraints lines including all the
boundary points is called the feasible region (or solution space). The feasible region is shown in Fig. by the
shaded area OABCD.
• Since the optimal value of the objective function occurs at
one of the extreme points of the feasible region, it is
necessary to determine their coordinates. The coordinates
of extreme points of the feasible region are: O = (0, 0), A =
(60, 0), B = (60, 20), C = (30, 40), D = (0, 40).
• Evaluate objective function value at each extreme point
of the feasible region as shown in the Table

Since objective function Z is to be maximized, from Table we conclude that maximum value of Z = 1,100 is
achieved at the point extreme B (60, 20). Hence the optimal solution to the given LP problem is: x1 = 60, x2 =
20 and Max Z = 1,100.
Minimization LP Problem
• Use the graphical method to solve the following LP problem.

The minimum (optimal) value of the objective function Z = 13 occurs at the extreme point C
(1, 5). Hence, the optimal solution to the given LP problem is: x1 = 1, x2 = 5, and Min Z = 13
Slack and Surplus Variables
• Slack variable represents an unused quantity of resource; it is added to less-than or equal-
to type constraints in order to get an equality constraint.
• Surplus variable represents the amount of resource usage above the minimum required
and is subtracted to greater-than or equal-to constraints in order to get equality
constraint.
• A linear program in which all the variables are non-negative and all the constraints are
equalities is said to be in standard form.
• Slack and surplus variables represent the difference between the left and right sides of the
constraints.
• Slack and surplus variables have objective function coefficients equal to 0.
Slack Variables (for < constraints) Surplus Variables (for ≥ constraints)
Max 1 5x1 + 10x2 + 0s1 + 0s2 + 0s3
Min 3x1 + 2x2 + 0s1 + 0s2 + 0s3
• S.t
s.t. 5x1 + x2 - s1 = 10
• 4 x1 + 6 x2 +s1 = 360
x1 + x2 - s2 = 6
• 3 x1 +s2 = 180,
x1 + 4x2 - s3 = 12
• 5 x2 +s3 = 200
x1, x2, s1, s2, s3 > 0
• x1, x2 , s1 , s2 , s3 > 0
s1 , s2 , and s3 are surplus variables
• s1 , s2 , and s3 are slack variables
The optimal solution to the given LP The optimal solution to the given LP problem
problem is: x1 = 60, x2 = 20 and Max Z = is: x1 = 1, x2 = 5, and Min Z = 13
1,100
Constraint Value of slack variable Constraint Value of surplus variable
4x1 + 6 x2 =360 100 5x1 + x2 =10 0
3 x1 +s2 =180 0 x1 + x2 =6 0
5 x2 +s3 ≤ 200 0 x1 + 4x2 =12 9
Iso-profit (Cost) Function Line Method
Iso-profit (or cost)function line is a straight line that represents all nonnegative combinations
of x1 and x2 variable values for a particular profit (or cost) level
Step 1: Identify the feasible region and extreme points of the feasible region.
Step 2: Draw an iso-profit (iso-cost) line for an arbitrary but small value of the objective function
without violating any of the constraints of the given LP problem. However, it is simple to pick a value
that gives an integer value to x1 when we set x2 = 0 and vice-versa. A good choice is to use a number
that is divided by the coefficients of both variables.
Step 3: Move iso-profit (iso-cost) lines parallel in the direction of increasing (decreasing) objective
function values. The farthest iso-profit line may intersect only at one corner point of feasible region
providing a single optimal solution. Also, this line may coincide with one of the boundary lines of the
feasible area. Then at least two optimal solutions must lie on two adjoining corners and others will lie
on the boundary connecting them. However, if the iso-profit line goes on without limit from the
constraints, then an unbounded solution would exist. This usually indicates that an error has been
made in formulating the LP model.
Step 4: An extreme (corner) point touched by an iso-profit (or cost) line is considered as the optimal
solution point. The coordinates of this extreme point give the value of the objective function.
Iso-profit (Cost) Function Line Method

The coordinates x1 = 60 and x2 = 20 of corner point B satisfy the given constraints and the
total profit obtained is Z = 1,100
Graphical Method -Special Cases
• Alternative (or Multiple) Optimal Solutions
• Infeasible Solution
• Unbounded Solution

Use the graphical method to solve the following LP problems.


Maximize Z = 10x1 + 6x2
subject to the constraints
(i) 5x1 + 3x2≤30, (ii) x1 +2x2 ≤ 18 and x1, x2 ≥ 0.

Maximize Z = 6x1 - 4x2


subject to the constraints
(i) 2x1 + 4x2≤4 (ii) 4x1 +8x2 > 16 and x1, x2 ≥ 0.

Maximize = 40x1 + 60x2


subject to
(i) 2x1+x2 ≥70 (ii) x1 +x2 ≥40 (iii) x1 + 3x2 ≥ 9 and x1, x2 ≥ 0
Alternative (or Multiple) Optimal Solutions
• LP problem may have more than one solution yielding the same optimal objective function
value. Each of such optimal solutions is termed as alternative optimal solution.

• Since value (maximum) of objective function, Z = 60 at two different extreme points B and
C is same, therefore two alternative solutions: x1 = 6/7, x2 = 60/7 and x1= 6, x2 = 0 exist
Infeasible Solution
• An infeasible solution to an LP problem arises when there is no solution that satisfies all the
constraints simultaneously. This happens when there is no unique (single) feasible region.
This situation arises when a LP model that has conflicting constraints.

The constraints are plotted on graph as usual as shown in above Figure. Since there is no
unique feasible solution space, therefore a unique set of values of variables x1 and x2 that
satisfy all the constraints cannot be determined. Hence, there is no feasible solution to this LP
problem because of the conflicting constraints.
Unbounded Solution
Sometimes an LP problem may have an infinite solution. Such a solution is referred as an
unbounded solution. It happens when value of certain decision variables and the value of the
objective function (maximization case) are permitted to increase infinitely, without violating
the feasibility condition.
Maximize = 40x1 + 60x2
subject to
2x1+x2 ≥70
x1 +x2 ≥40
x1 + 3x2 ≥ 9
x1, x2 ≥ 0
• The point (x1, x2) must be somewhere in the solution space as shown in the figure by shaded portion.
• The three extreme points (corner points) in the finite plane are:
P = (90, 0); Q = (24, 22) and R = (0, 70) The values of the objective function at these extreme points are: Z(P)
= 3600, Z(Q) = 2280 and Z(R) = 4200.
• In this case, no maximum of the objective function exists because the region has no boundary for increasing
values of x1 and x2. Thus, it is not possible to maximize the objective function in this case and the solution is
unbounded.
Find the optimal solution using graphical method
Product Resource Available
Base Fuel Additive Solvent Base for Production
Material 1 (tons) 0.4 0.520 tons
Material 2 (tons) 0.25 tons
Material 3 (tons) 0.6 0.321 tons
Profit/ton 40 30
Max 40F+30S
S.t
0.4F+0.5S ≤ 20
0.2S ≤ 5
0.6F+0.3S ≤ 21
F,S>0
• The objective function 40F+30S takes on its
maximum value at the extreme point F = 25 and S =
20. Thus, F =25 and S = 20 is the optimal solution
and z=1600 is the value of the optimal solution.
LPP with Excel Solver Click Solver button on the Excel Data tab
1 B C Max 40F+30S
2 S.t .4F+.5S ≤ 20
3 .2S ≤ 5
4 .6F+.3S ≤ 21
5 F,S>0
6 Variables F S MaxZ
7 Optimal Sol =SUMPRODUCT(B8:C8,B7:C7)
8 Obj Fun Coeff 40 30
9
10 S.t LHS RHS
11 Constraint 1 0.4 0.5 =SUMPRODUCT(B11:C11,$B$7:$C$7) <= 20

12 Constraint 2 0 0.2 =SUMPRODUCT(B12:C12,$B$7:$C$7) <= 5


13 Constraint 3 0.6 0.3 =SUMPRODUCT(B13:C13,$B$7:$C$7) <= 21
Sensitivity Analysis
Introduction to Sensitivity Analysis
• Sensitivity analysis (or post-optimality analysis) is used to determine how the optimal
solution is affected by changes, within specified ranges, in:
• the objective function coefficients ( Unit profit or unit cost)
• the right-hand side (RHS) values (The Availability of resources)

Max 40F+30S Max 40F+30S+50C


S.t S.t
.4F+.5S ≤ 20 .4F+.5S+.6C ≤ 20
.2S ≤ 5 .2S +.1C≤ 5
.6F+.3S ≤ 21 .6F+.3S+.3C ≤ 21
F,S>0 F,S>0
Changes in the Objective Function Coefficients
• The range of optimality for each coefficient provides the range of values over which the
current solution will remain optimal.

• Now suppose RMC learns that a price reduction in the fuel additive has reduced its profit
contribution to $30 per ton.
Max 30F+30S
S.t
.4F+.5S ≤ 20
.2S ≤ 5
.6F+.3S ≤ 21
F,S>0
The total profit contribution decreased to 30(25)
+30(20)=1350, the decrease in the profit contribution for
the fuel additive from $40 per ton to $30 per ton does
not change the optimal solution F=25 and S =20.
Changes in the Objective Function Coefficients
• Decreasing the profit contribution for the fuel additive to
$20 per ton changes the optimal solution. The solution F
=25 tons and S =20 tons is no longer optimal.
Sensitivity Report (Range of Optimality for c1 and c2)

Simultaneous Changes
100% Rule for objective function coefficients
The 100% rule states that simultaneous changes in objective function coefficients will not
change the optimal solution as long as the sum of the percentages of the change divided by
the corresponding maximum allowable change in the range of optimality for each coefficient
does not exceed 100%.
Change in the Right-Hand Sides
• Let us consider how a change in the right-hand side for a constraint might affect the feasible
region and perhaps cause a change in the optimal solution.
• The improvement in the value of the optimal solution per unit increase in the right-hand
side is called the shadow price.
Max 40F+30S
S.t 0.4F+0.5S ≤ 20
0.2S ≤ 5
0.6F+0.3S ≤ 25.5
F,S>0
• The additional 4.5 tons of material 3 in the
revised problem provides a new optimal solution
and increases the value of the optimal solution by
$1800 $1600 $200. On a per-ton basis, the
additional 4.5 tons of material 3 increases the
value of the optimal solution at the rate of
$200/4.5 $44.44 per ton.
Shadow Price (Dual Price)
• The shadow price is the change in the optimal objective function value per unit increase in
the right-hand side of a constraint.
• Hence, the shadow price for the material 3 constraint is $44.44 per ton.
• The shadow price for a nonbinding constraint is 0.
• A negative shadow price indicates that the objective function will not improve if the RHS is
increased

Simultaneous Changes
100% Rule for right hand side
For all right hand sies that are changed, sum the percentages of the allowable increases and
the allowable decrease. If the sum of the percentages is less than or equal to 100%, the
shadow prices do not change.
Sensitivity Analysis with Excel Solver
Before you click OK, select Sensitivity from the Reports section.
Thank You
Transportation & Assignment
Problems
Ramesh Kandela
[email protected]
Network Models
• A network model is one which can be represented by a set
of nodes, a set of arcs, and functions (e.g. costs, supplies,
demands, etc.) associated with the arcs and/or nodes.
• Transportation, Assignment, transshipment, shortest-
route, maximal flow and PERT/CPM problems are all
examples of network problems.
• For each of the problems, if the right-hand side of the
linear programming formulations are all integers, the
optimal solution will be in terms of integer values for the
decision variables.
Transportation Problem
Transportation Problem

Supply Demand
Minimize
locations locations
the Cost
(Origins) (Destinations)

• The transportation problem arises frequently in planning for the distribution of goods
and services from several supply locations to several demand locations.. The usual
objective in a transportation problem is to minimize the cost of shipping goods from the
origins to the destinations.
Transportation Problem
Linear Programming Formulation
• Network Representation 𝑚 𝑛

Min ෍ ෍ 𝑐𝑖𝑗 𝑥𝑖𝑗


1 d1 𝑖=1 𝑗=1
𝑛
c11 ෍ 𝑥𝑖𝑗 ≤ 𝑠𝑖 𝑖 = 1,2, … , 𝑚 Supply
s1 1 c12 𝑗=1
𝑚
c13 ෍ 𝑥𝑖𝑗 = 𝑑𝑗 𝑗 = 1,2, … , 𝑛 Demand
2 d2
c21 𝑖=1

c22 xij > 0 for all i and j


s2 2
c23
3 d3 xij = number of units shipped from origin i to destination j
cij = cost per unit of shipping from origin i to destination j
Sources Destinations si = supply or capacity in units at origin i
dj = demand in units at destination j
Transportation Problem
A transportation problem faced by Foster Generators. This problem involves the transportation
of a product from three plants to four distribution centers. Foster Generators operates plants
in Cleveland, Ohio; Bedford, Indiana; and York, Pennsylvania. Production capacities over the
next three-month planning period for one particular type of generator are as follows:
Transportation cost per unit, Supply and Demand for the foster generators transportation problem
Destination
Origin Boston Chicago St. Louis Lexington Supply
Cleveland 3 2 7 6 5000
Bedford 7 5 2 3 6000
York 2 5 4 5 2500
Demand 6000 4000 2000 1500 13500
Develop a network representation of the distribution system (transportation problem).

Develop a linear programming model for this problem.


Network Representation
Transportation Problem: LP Formulation
• A linear programming model can be used to solve this transportation problem. We use
double-subscripted decision variables, with x11 denoting the number of units shipped from
origin 1 (Cleveland) to destination 1 (Boston), x12 denoting the number of units shipped
from origin 1 (Cleveland) to destination 2 (Chicago), and so on. In general, the decision
variables for a transportation problem having m origins and n destinations are written as
follows:
• xij number of units shipped from origin i to destination j
• where i 1, 2, . . . , m and j 1, 2, . . . , n
• Transportation costs for units shipped from Cleveland 3x11+2x12 +7x13+6x14
• Transportation costs for units shipped from Bedford 7x21+5x22+2x23+3x24
• Transportation costs for units shipped from York 2x31+5x32+4x33+5x34
Transportation Problem: LP Formulation
Combining the objective function and constraints into one model provides a 12-variable, 7-
constraint linear programming formulation of the Foster Generators transportation problem:
Formulate this transportation problem as an LP model to minimize the total transportation cost.
• company has three production facilities S1, S2 and S3 with production capacity of 7, 9 and 18 units (in 100s) per week of a
product, respectively. These units are to be shipped to four warehouses D1, D2, D3 and D4 with requirement of 5, 6, 7 and
14 units (in 100s) per week, respectively. The transportation costs (in rupees) per unit between factories to warehouses are
given in the table below:
Destination
Origin D1 D2 D3 D4 Supply
S1 19 30 50 10 7 Let xij = number of units of the product to be transported from a
production facility i (i = 1, 2, 3) to a warehouse j ( j = 1, 2, 3, 4)
S2 70 30 40 60 9
S3 40 8 70 20 18 In the LP model, there are m × n = 3 × 4 = 12 decision variables, xij
Demand 5 8 7 14 34 and m + n = 7 constraints, where m are the number of rows and n
are the number of columns in a general transportation table.
Model formulation
Minimize (total transportation cost) Z = 19x11 + 30x12 + 50x13 + 10x14 + 70x21 + 30x22 + 40x23+ 60x24 +
40x31 + 8x32 + 70x33 + 20x34
Subject to the constraints
(Demand)
(Supply)
x11 + x21 + x31 = 15
x11 + x12 + x13 + x14 ≤ 17 x12 + x22 + x32 = 18 and xij >= 0 for i = 1, 2, 3 and j = 1, 2, 3, and 4.
x21 + x22 + x23 + x24 ≤ 19 x 13 + x23 + x33 = 17
x31 + x32 + x33 + x34 ≤ 18 x14 + x24 + x34 = 14
Transportation Problem: LP Formulation Special Cases
1. Total supply not equal to total demand
• Total supply exceeds total demand:
No modification of LP formulation is necessary.
• Total demand exceeds total supply:
Add a dummy origin with supply equal to the shortage amount. Assign a zero shipping cost per unit.
The amount “shipped” from the dummy origin (in the solution) will not actually be shipped.

2. Maximization objective function


The objective is maximizing profit or revenue:
Solve as a maximization problem.

3. Route capacities or route minimums


Maximum route capacity from i to j:
xij < Lij

4. Unacceptable routes
Remove the corresponding decision variable.
Assignment Problem
Assignment Problem
• An assignment problem seeks to minimize the total cost assignment of m workers to m
jobs, given that the cost of worker i performing job j is cij.
• It assumes all workers are assigned and each job is performed.
• An assignment problem is a special case of a transportation problem in which all supplies
and all demands are equal to 1; hence assignment problems may be solved as linear
program.

• The assignment problem arises in a variety of decision-making situations; typical


assignment problems involve assigning jobs to machines, agents to tasks, sales personnel
to sales territories, contracts to bidders, and so on. A distinguishing feature of the
assignment problem is that one agent is assigned to one and only one task. Specifically,
we look for the set of assignments that will optimize a stated objective, such as minimize
cost, minimize time, or maximize profits.
Assignment Problem
• Network Representation

c11
1 1
c12
c13
Agents Tasks
c21
c22
2 2
c23
c31
c32
3 3
c33
Assignment Problem: Linear Programming Formulation

Using the notation:


xij = 1 if agent i is assigned to task j
0 otherwise
cij = cost of assigning agent i to task j

𝑚 𝑛

Min ෍ ෍ 𝑐𝑖𝑗 𝑥𝑖𝑗


𝑖=1 𝑗=1

෍ 𝑥𝑖𝑗 ≤ 1 𝑖 = 1,2, … , 𝑚 Agents


𝑗=1
𝑚

෍ 𝑥𝑖𝑗 = 1 𝑗 = 1,2, … , 𝑛 Tasks


𝑖=1

xij > 0 for all i and j


Assignment Problem
Estimated project completion times (days) for the fowle marketing research assignment problem
Network Model
Client
Project Leader 1 2 3
Terry 10 15 9
Carle 9 18 5
McClymonds 6 14 3
The decision variables for Fowle’s assignment problem as
xij = 1 if agent i is assigned to task j
0 otherwise
where i = 1, 2, 3, and j =1, 2, 3
Develop completion time expressions using this notation and the
completion time
Days required for Terry’s assignment = 10x11 +15x12 +9x13
Days required for Carle’s assignment = 9x21 +18x22 +5x23
Days required for McClymonds’s assignment = 6x31 +14x32 +3x33

The sum of the completion times for the three project leaders will provide the total days required to complete the three
assignments. Thus, the objective function is Min 10x11 +15x12 +9x13 +9x21 +18x22 +5x23 +6x31 +14x32 +3x33
The constraints for the assignment problem reflect the conditions that each project leader can be assigned to at
most one client and that each client must have one assigned project leader. These constraints are written as
follows:
x11 +x12 +x13 ≤1 Terry’s assignment
x21 +x22 +x23 ≤1 Carle’s assignment
x31 +x32 +x33 ≤1 McClymonds’s assignment
x11 +x21 +x31 =1 Client 1
x12 +x22 +x32 =1 Client 2
x13 + x23 +x33 =1 Client 3
Combining the objective function and constraints into one model provides the following nine-variable,
six-constraint linear programming model of the Fowle Marketing Research assignment problem
Min 10x11 +15x12 +9x13 +9x21 +18x22 +5x23 +6x31 +14x32 +3x33
x11 +x12 +x13 ≤1
x21 +x22 +x23 ≤1
x31 +x32 +x33 ≤1
x11 +x21 +x31 =1
x12 +x22 +x32 =1
x13 + x23 +x33 =1

xij ≥ 0 for i =1, 2, 3; j = 1, 2, 3


Assignment Problem: LP Formulation Special Cases
• Number of agents exceeds the number of tasks:
Extra agents simply remain unassigned.
• Number of tasks exceeds the number of agents:
Add enough dummy agents to equalize the number of agents and the number of tasks.
The objective function coefficients for these new variable would be zero.
• The assignment alternatives are evaluated in terms of revenue or profit:
Solve as a maximization problem.
• An assignment is unacceptable:
Remove the corresponding decision variable.
• An agent is permitted to work t tasks:
𝑛

෍ 𝑥𝑖𝑗 ≤ 𝑡 𝑖 = 1,2, … , 𝑚 Agents


𝑗=1
Assignment Problem: Example
An electrical contractor pays his subcontractors a fixed fee plus mileage for
work performed. On a given day the contractor is faced with three electrical jobs
associated with various projects. Given below are the distances between the
subcontractors and the projects.
Projects
Subcontractor A B C
Westside 50 36 16
Federated 28 30 18
Goliath 35 32 20
Universal 25 25 14

How should the contractors be assigned so that total mileage is minimized?


Assignment Problem: Example

Network Representation
50
West. A
36
16
Subcontractors Projects
28
30
Fed. B
18
35 32

Gol.
20
C
25
25

Univ. 14
Assignment Problem: Example
Linear Programming Formulation

Min 50x11+36x12+16x13+28x21+30x22+18x23
+35x31+32x32+20x33+25x41+25x42+14x43
s.t. x11+x12+x13 < 1
x21+x22+x23 < 1 The optimal assignment is:
Agents Subcontractor Project Distance
x31+x32+x33 < 1
x41+x42+x43 < 1 Westside C 16
x11+x21+x31+x41 = 1 Federated A 28
x12+x22+x32+x42 = 1 Tasks Goliath (unassigned)
x13+x23+x33+x43 = 1 Universal B 25
xij = 0 or 1 for all i and j Total Distance = 69 miles
Thank You
Decision Analysis
Ramesh Kandela
[email protected]
Decision Analysis
• Decision analysis is an analytical approach of comparing decision alternatives in terms of
expected outcomes.
• Decision analysis can be used to develop an optimal strategy when a decision maker is
faced with several decision alternatives and an uncertain or risk-filled pattern of future
events.
• Even when a careful decision analysis has been conducted, the uncertain future events
make the final consequence uncertain.
• The risk associated with any decision alternative is a direct result of the uncertainty
associated with the final consequence.
• Good decision analysis includes risk analysis that provides probability information about
the favorable as well as the unfavorable consequences that may occur.
A decision problem is characterized by
• Decision alternatives
• States of nature
• Payoffs
Problem Formulation
• A decision problem is characterized by decision alternatives, states of nature, and resulting
payoffs.
• The decision alternatives are the different possible strategies the decision maker can employ.
• The states of nature refer to future events, not under the control of the decision maker, which
may occur. States of nature should be defined so that they are mutually exclusive and
collectively exhaustive.
• Payoff It is a numerical value (outcome) obtained due to the application of each possible
combination of decision alternatives and states of nature.
• The consequence resulting from a specific combination of a decision alternative and a state of
nature is a payoff.
• A table showing payoffs for all combinations of decision alternatives and states of nature is a
payoff table.
• Payoffs can be expressed in terms of profit, cost, time, distance or any other appropriate
measure.
Example: Pittsburgh Development Corp.
• Pittsburgh Development Corporation (PDC) purchased land that will be the site of a new
luxury condominium complex. PDC commissioned preliminary architectural drawings for
three different projects: one with 30, one with 60, and one with 90 condominiums.
• The financial success of the project depends upon the size of the condominium complex and
the chance event concerning the demand for the condominiums. The statement of the PDC
decision problem is to select the size of the new complex that will lead to the largest profit
given the uncertainty concerning the demand for the condominiums.
Consider the following problem with three decision alternatives and two states of nature
with the following payoff table representing profits:
Payoff table
States of Nature
Decision Alternative Strong Demand s1 Weak Demand s2
Small complex, d1 8 7
Medium complex, d2 14 5
Large complex, d3 20 -9
Influence Diagrams
• An influence diagram is a graphical device showing the relationships among
the decisions, the chance events, and the consequences.
• Squares or rectangles depict decision nodes.
• Circles or ovals depict chance nodes.
• Lines or arcs connecting the nodes show the direction of influence.
• Diamonds depict consequence nodes.
Decision Trees
• A decision tree is a chronological
representation of the decision problem.
• Decision tree is the graphical display of the
progression of decision and random events.
• Each decision tree has two types of nodes;
round nodes correspond to the states of
nature while square nodes correspond to the
decision alternatives.
• The branches leaving each round node
represent the different states of nature while
the branches leaving each square node
represent the different decision alternatives.
• At the end of each limb of a tree are the
payoffs attained from the series of branches
making up that limb.
Types of Decision-making Environments
Decision-Making under Certainty
In this decision-making environment, decision-maker has complete knowledge (perfect
information) of outcome due to each decision-alternative (course of action). In such a case he
would select a decision alternative that yields the maximum return (payoff) under known state of
nature. For example, the decision to invest in National Saving Certificate, Indira Vikas Patra, Public
Provident Fund, etc., is where complete information about the future return due and the principal
at maturity is know.
Decision-Making under Risk
In this decision-environment, decision-maker does not have perfect knowledge about possible
outcome of every decision alternative. It may be due to more than one states of nature.
Decision-Making under Uncertainty
In this decision environment, decision-maker is unable to specify the probability for occurrence of
particular state of nature. However, this is not the case of decision-making under ignorance,
because the possible states of nature are known. Thus, decisions under uncertainty are taken even
with less information than decisions under risk. For example, the probability that Mr X will be the
prime minister of the country 15 years from now is not known.
Decision Making without Probabilities (Decision-making Under Uncertainty)
In this decision environment, decision-maker is unable to specify the probability for occurrence
of particular state of nature.

• Three commonly used criteria for decision making when probability information
regarding the likelihood of the states of nature is unavailable are:
• Optimistic approach (maximax or minimin)
• Conservative approach (maximin or minimax)
• Minimax regret approach.
Optimistic Approach
In this criterion the decision-maker ensures that he should not miss the opportunity to
achieve the largest possible profit (maximax) or the lowest possible cost (minimin). Thus, he
selects the decision alternative that represents the maximum of the maxima (or minimum of
the minima) payoffs (consequences or outcomes).
• The optimistic approach would be used by an optimistic decision maker.
• The decision with the largest possible payoff is chosen.
• If the payoff table was in terms of costs, the decision with the lowest cost would be
chosen.
Conservative Approach
• The conservative approach would be used by a conservative decision maker.
• For each decision the minimum payoff is listed and then the decision corresponding to the
maximum of these minimum payoffs is selected. (Hence, the minimum possible payoff is
maximized.)
• If the payoff was in terms of costs, the maximum costs would be determined for each
decision and then the decision corresponding to the minimum of these maximum costs is
selected. (Hence, the maximum possible cost is minimized.)

in this criterion the decision-maker is conservative about the future and always anticipate
the worst possible outcome (minimum for profit and maximum for cost or loss), it is called
pessimistic decision criterion. This criterion is also known as Wald’s criterion.
Minimax Regret Approach
• The minimax regret approach requires the construction of a regret table or an opportunity
loss table. This is done by calculating for each state of nature the difference between each
payoff and the largest payoff for that state of nature.
• Then, using this regret table, the maximum regret for each possible decision is listed.
• The decision chosen is the one corresponding to the minimum of the maximum regrets.
Example: Minimax Regret Approach

Opportunity loss, or regret, table for the PDC


condominium project ($ millions)
Which strategy should the concerned executive choose on the basis of
(a) Maximin criterion (b) Maximax criterion (c) Minimax regret criterion
Decision Making with Probabilities (Decision making under Risk)
Expected Value Approach(EMV)
• If probabilistic information regarding the states of nature is available, one may use the
expected value (EV) approach.
• Here the expected return for each decision is calculated by summing the products of the payoff
under each state of nature and the probability of the respective state of nature occurring.
• The decision yielding the best expected return is chosen.
• The expected value of a decision alternative is the sum of weighted payoffs for the decision
alternative. N

• The expected value (EV) of decision alternative di is defined as: EV( d i ) =  P( s j )Vij
j =1

where: N = the number of states of nature


P(sj ) = the probability of state of nature sj
Vij = the payoff corresponding to decision alternative di and state of nature sj
Expected Value Approach
Calculate the expected value for each decision. Here d1, d2, and d3 represent the decision
alternatives of building a small, medium, and large complex, while s1 and s2 represent
the states of nature of strong demand and weak demand. (s1) =0.8 and P(s2) =0.2
Decision Tree

Payoffs
s1 .8
$8 mil
2 s2 .2
d1 $7 mil
s1 .8
d2 $14 mil
1 3 s2 .2
d3 $5 mil
s1 .8
$20 mil
4 s2 .2
-$9 mil
Expected Value for Each Decision

EMV = .8(8 mil) + .2(7 mil) = $7.8 mil

Small d1 2

EMV = .8(14 mil) + .2(5 mil) = $12.2 mil


Medium d2
1 3

EMV = .8(20 mil) + .2(-9 mil) = $14.2 mil

Large d3 4

Choose the decision alternative with the largest EV. Build the large complex.
• Suppose that the decision maker obtained the probability assessments P(s1)
0.65, P(s2) 0.15, and P(s3) 0.20. Use the expected value approach to determine
the optimal decision
Expected Value of Perfect Information
• Frequently information is available which can improve the probability estimates for the
states of nature.
• The expected value of perfect information (EVPI) is the increase in the expected profit that
would result if one knew with certainty which state of nature would occur.
• The EVPI provides an upper bound on the expected value of any sample or survey
information.
EVPI = (Expected profit with perfect information) – (Expected profit without perfect information)
• EVPI Calculation
• Step 1:
Determine the optimal return corresponding to each state of nature.
• Step 2:
Compute the expected value of these optimal returns.
• Step 3:
Subtract the EMV of the optimal decision from the amount determined in step (2).
Expected Value of Perfect Information
PDC’s optimal decision strategy when the perfect information becomes available as follows:
If s1, select d3 and receive a payoff of $20 million.
If s2, select d1 and receive a payoff of $7 million.

• Expected Value with Perfect Information (EVwPI)


EVwPI = .8(20 mil) + .2(7 mil) = $17.4 mil

• Expected Value without Perfect Information (EVwoPI)


EVwoPI = .8(20 mil) + .2(-9 mil) = $14.2 mil

• Expected Value of Perfect Information (EVPI)


EVPI = |EVwPI – EVwoPI| = |17.4 – 14.2| = $3.2 mil
• Suppose that the decision maker obtained the probability assessments P(s1)
0.65, P(s2) 0.15, and P(s3) 0.20. Use the expected value approach to determine
the optimal decision. Also calculate the EVPI.
Decision Analysis with Sample Information
Decision Analysis with Sample Information
• Frequently, decision makers have preliminary or prior probability assessments for the states of
nature that are the best probability values available at that time.
• To make the best possible decision, the decision maker may want to seek additional information
about the states of nature.
• This new information, often obtained through sampling, can be used to revise the prior probabilities
so that the final decision is based on more accurate probabilities for the states of nature.
• These revised probabilities are called posterior probabilities.
Example: Pittsburgh Development Corp.
Let us return to the PDC problem and assume that management is considering a 6-month market
research study designed to learn more about potential market acceptance of the PDC
condominium project. Management anticipates that the market research study will provide one of
the following two results:
1. Favorable report: A significant number of the individuals contacted express interest in purchasing
a PDC condominium.
2. Unfavorable report: Very few of the individuals contacted express interest in purchasing a PDC
condominium.
Bayes’ Theorem and Posterior Probabilities
• Knowledge of sample (survey) information can be used to revise the probability estimates
for the states of nature.
• Prior to obtaining this information, the probability estimates for the states of nature are
called prior probabilities.
• With knowledge of conditional probabilities for the outcomes or indicators of the sample or
survey information, these prior probabilities can be revised by employing Bayes' Theorem.
• The outcomes of this analysis are called posterior probabilities or branch probabilities for
decision trees.
Branch (Posterior) Probabilities Calculation
• Step 1: For each state of nature, multiply the prior probability by its conditional probability
for the indicator -- this gives the joint probabilities for the states and indicator.
• Step 2: Sum these joint probabilities over all states -- this gives the marginal probability for
the indicator.
• Step 3: For each state, divide its joint probability by the marginal probability for the indicator
-- this gives the posterior probability distribution.
Bayes' Theorem Statement
• An initial probability statement to evaluate expected payoff is called a prior probability distribution,
but if the probability statement has been revised due to additional information, then such a
probability statement is called a posterior probability distribution.
• The method of computing posterior probabilities, given prior probabilities using Bayes’ theorem.
The analysis of problems using posterior probabilities with new expected payoffs and additional
information, is called prior-posterior analysis.
• Let A1, A2, . . ., An be mutually exclusive and collectively exhaustive outcomes. Their probabilities
P(A1), P(A2), . . ., P(An) are known (the prior probability or marginal probability ). There is an
experimental outcome B for which the conditional probabilities P(B | A1), P(B | A2), . . ., P(B | An)
are also known. Given the information that outcome B has occurred, the revised conditional
probabilities of outcomes Ai, i.e. P(Ai | B), i = 1, 2, . . ., n are determined by using the following
relationship:

Since each joint probability can be expressed as the product of a known marginal (prior) and conditional
probability, i.e., P( Ai ∩ B) = P( Ai ) x P( B | Ai )
Posterior Probabilities
Favorable
State of Prior Conditional Joint Posterior
Nature Probability Probability Probability Probability
sj P(sj) P(F|sj) P(F I sj) P(sj |F)
s1 0.8 0.90 0.72
s2 0.2 0.25 0.05
P(favorable) = P(F) =
Unfavorable
State of Prior Conditional Joint Posterior
Nature Probability Probability Probability Probability
sj P(sj) P(U|sj) P(U I sj) P(sj |U)
s1 0.8 0.10 0.08
s2 0.2 0.75 0.15
P(unfavorable) = P(U) =
Posterior Probabilities
Favorable
State of Prior Conditional Joint Posterior
Nature Probability Probability Probability Probability
sj P(sj) P(F|sj) P(F I sj) P(sj |F)
s1 0.8 0.90 0.72 0.94
s2 0.2 0.25 0.05 0.06
P(favorable) = P(F) = 0.77 1.00
Unfavorable
State of Prior Conditional Joint Posterior
Nature Probability Probability Probability Probability
sj P(sj) P(U|sj) P(U I sj) P(sj |U)
s1 0.8 0.10 0.08 0.35
s2 0.2 0.75 0.15 0.65
P(unfavorable) = P(U) = 0.23 1.00
Sample Information
PDC has developed the following branch probabilities.
If the market research study is undertaken:
P(Favorable report) = P(F) = .77
P(Unfavorable report) = P(U) = .23
If the market research report is favorable:
P(Strong demand | favorable report) = P(s1|F) = .94
P(Weak demand | favorable report) = P(s2|F) = .06
If the market research report is unfavorable:
P(Strong demand | unfavorable report) = P(s1|U) = .35
P(Weak demand | unfavorable report) = P(s2|U) = .65
If the market research study is not undertaken, the prior
probabilities are applicable:
P(Favorable report) = P(F) = .80
P(Unfavorable report) = P(U) = .20
Decision Tree
s1 P(s1) = .94 $ 8 mil
d1 6 s2
P(s2) = .06 $ 7 mil
d2 s1
s2 P(s1) = .94 $14 mil
F 3 7
d3 P(s2) = .06 $ 5 mil
(.77) s1
s2 P(s1) = .94 $20 mil
8
Conduct P(s2) = .06 -$ 9 mil
2 s1 P(s1) = .35
Market d1 s2 $ 8 mil
9
Research U P(s2) = .65 $ 7 mil
d2 s1
Study (.23) 10 s2 P(s1) = .35 $14 mil
4
d3 s1 P(s2) = .65 $ 5 mil
1 $20 mil
11 s2 P(s1) = .35
s1 P(s2) = .65 -$ 9 mil
d1 12 s2 P(s1) = .80 $ 8 mil
Do Not Conduct
s1 P(s2) = .20 $ 7 mil
Market Research d2
5 13 s2 P(s1) = .80 $14 mil
Study d3 s1 P(s2) = .20 $ 5 mil
14 s2 P(s1) = .80 $20 mil
P(s2) = .20 -$ 9 mil
Decision Strategy
• A decision strategy is a sequence of decisions and chance outcomes where the decisions
chosen depend on the yet-to-be-determined outcomes of chance events.
• The approach used to determine the optimal decision strategy is based on a backward
pass through the decision tree using the following steps:
• At chance nodes, compute the expected value by multiplying the payoff at the end
of each branch by the corresponding branch probabilities.
• At decision nodes, select the decision branch that leads to the best expected value.
This expected value becomes the expected value at the decision node.
Decision Tree
EV = d1 6 EV = .94(8) + .06(7) = $7.94 mil
$18.26 mil
d2
F 3 7 EV = .94(14) + .06(5) = $13.46 mil
d3
(.77)
EV = 8 EV = .94(20) + .06(-9) = $18.26 mil
$15.93 2
mil d1 9 EV = .35(8) + .65(7) = $7.35 mil
U
d2
(.23) 4 10 EV = .35(14) + .65(5) = $8.15 mil
d3
1 EV =
11 EV = .35(20) + .65(-9) = $1.15 mil
EV = $8.15 mil
$15.93 d1 12 EV = .8(8) + .2(7) = $7.80 mil
mil
d2
5 13 EV = .8(14) + .2(5) = $12.20 mil
d3
EV = $14.20 mil
14 EV = .8(20) + .2(-9) = $14.20 mil
Decision Tree
Decision Strategy
• PDC’s optimal decision strategy is:
• Conduct the market research study.
• If the market research report is favorable, construct the large condominium complex.
• If the market research report is unfavorable, construct the medium condominium
complex.
EVSI=EVwSI EVwoSI
where
EVSI = expected value of sample information
EVwSI= expected value with sample information about the
states of nature
EVwoSI = expected value without sample information about
the states of nature
EVwSI
EV(Node 2)= 0.77EV(Node 3) + 0.23EV(Node 4)
=0.77(18.26) + 0.23(8.15)
= 15.93
Expected Value of Sample Information
• The expected value of sample information (EVSI) is the additional expected profit possible
through knowledge of the sample or survey information.
• The expected value associated with the market research study is $15.93.
• The best expected value if the market research study is not undertaken is $14.20.
• We can conclude that the difference, $15.93 − $14.20 = $1.73, is the expected value of
sample information.
• Conducting the market research study adds $1.73 million to the PDC expected value.
Efficiency of Sample Information
• Efficiency of sample information is the ratio of EVSI to EVPI.
• As the EVPI provides an upper bound for the EVSI, efficiency is always a number between 0 and 1.
The efficiency of the survey:
E = (EVSI/EVPI) X 100
= [($1.73 mil)/($3.20 mil)] X 100
= 54.1%
The information from the market research study is 54.1% as efficient as perfect information.
Thank You
Markov Models
Ramesh Kandela
[email protected]
Markov Processes
• Markov chain models (also known as stochastic processes) are useful to study a system in
which the system’s current state depends on all of its previous states.
• Markov process models are useful in studying the evolution of systems over repeated trials
or sequential time periods or stages.
• the promotion of managers to various positions within an organization
• the migration of people into and out of various regions of the country
• the progression of students through the years of college, including eventually dropping
out or graduating
Markov processes have been used to describe the probability that:
• a machine that is functioning in one period will function or break down in the next period.
• a consumer purchasing brand A in one period will purchase brand B in the next period.
Examples
For example, consider the following few systems:
I. Market share of a product and its competitive brands.
II. Cash collection procedures involved in converting accounts receivable from the product’s
sales into cash.
III. Machines used to manufacture a product.
IV. Area of specialization by a management student at one time.
In all these examples, each process (or system) may be in one of several possible states. These
states describe all possible conditions of the given system. For example,
I. the brand of the product that a customer is presently using is termed as a state.
II. the accounts receivable can be in one of the two states: cash sale or credit sale.
III. the machine condition can be in one of the two possible states: working or not working
IV. the few areas in which a student can specialize at one time represent states.
Transition Probabilities
• State : This is the position at a specific time-step in the environment.
• State probability of an event is the probability of its occurrence at a point in time.
• Transition : Moving from one state to another is called Transition.
• Transition probabilities govern the manner in which the state of the system changes from
one stage to the next. These are often represented in a transition matrix.
• A system has a finite Markov chain with stationary transition probabilities if:
• there are a finite number of states,
• the transition probabilities remain constant from stage to stage, and
• the probability of the process being in a particular state at stage n+1 is completely
determined by the state of the process at stage n (and not the state at stage n-1). This
is referred to as the memory-less property.
Student Markov Process

Student
Markov
Process
Example: Market Share Analysis
• Suppose we are interested in analyzing the market share and customer loyalty for
Murphy’s Foodliner and Ashley’s Supermarket, the only two grocery stores in a small
town. We focus on the sequence of shopping trips of one customer and assume that the
customer makes one shopping trip each week to either Murphy’s Foodliner or Ashley’s
Supermarket, but not both.

• We refer to the weekly periods or shopping trips as the trials of the process. Thus, at each
trial, the customer will shop at either Murphy’s Foodliner or Ashley’s Supermarket. The
particular store selected in a given week is referred to as the state of the system in that
period. Because the customer has two shopping alternatives at each trial, we say the
system has two states.
State 1. The customer shops at Murphy’s Foodliner.
State 2. The customer shops at Ashley’s Supermarket.
Example: Market Share Analysis
• Suppose that, as part of a market research study, we collect data from 100 shoppers over a
10-week period. In reviewing the data, suppose that we find that of all customers who
shopped at Murphy’s in a given week, 90% shopped at Murphy’s the following week while
10% switched to Ashley’s.
• Suppose that similar data for the customers who shopped at Ashley’s in a given week show
that 80% shopped at Ashley’s the following week while 20% switched to Murphy’s.
Transition Probabilities

pij = probability of making a transition from state I in a given period to state j in the next period
p11 p12 0.9 0.1
P= =
p21 p22 0.2 0.8
• The terms  and  will denote the probability of the system being in state 1 or state 2
at some initial or starting period.
• Week 0 =represents the most recent period, when we are beginning the analysis of a
Markov process.
• If we set  (0) =1 and (0) = 0, we are saying that as an initial condition the customer
shopped last week at Murphy’s.
• Alternatively, if we set  (0)= 0 and  (0) =1, we would be starting the system with a
customer who shopped last week at Ashley’s.
(n) = [(n) (n)]
to denote the vector of state probabilities for the system in period n.
(next period) =  (current period)P
or
(n+1) =  (1)P
State Probabilities for Future Periods
Beginning with the system in state 1 at period 0, we have Π(0) = [1 0]. We can compute the
state probabilities for period 1 as follows:
(1) =  (0)P

The state probabilities  (1)= 0.9 and (1) =0.1 are the probabilities that a customer who
shopped at Murphy’s during week 0 will shop at Murphy’s or at Ashley’s during week 1.
We see that the probability of shopping at Murphy’s during the second week is 0.83, while the
probability of shopping at Ashley’s during the second week is 0.17.
We can compute the state probabilities for any future period; that is
Example: Market Share Analysis
State Probabilities

Murphy’s
.9 P = .9(.9) = .81
Murphy’s
Ashley’s
.9
Murphy’s .1 P = .9(.1) = .09
Murphy’s P = .1(.2) = .02
Ashley’s .2
.1
Ashley’s P = .1(.8) = .08
.8
Example: Market Share Analysis
State Probabilities for Future Periods Beginning Initially with a Murphy’s Customer

State Probabilities for Future Periods Beginning Initially with an Ashley’s Customer
In the market share analysis, suppose that we are considering the Markov process
associated with the shopping trips of one customer, but we do not know where the
customer shopped during the last week. Thus, we might assume a 0.5 probability that the
customer shopped at Murphy’s and a 0.5 probability that the customer shopped at Ashley’s
at period 0; that is, (0) =0.5 and (0) =0.5. Given these initial state probabilities, Find the
probability of each state in 3 future periods.
Steady-State (Equilibrium) Probabilities
• The state probabilities at any stage of the process can be recursively calculated by
multiplying the initial state probabilities by the state of the process at stage n.
• The probability of the system being in a particular state after a large number of stages is
called a steady-state probability.

• Steady state probabilities can be found by solving the system of equations P = 


together with the condition for probabilities that i = 1.
• Matrix P is the transition probability matrix
• Vector  is the vector of steady state probabilities.
Example: Market Share Analysis
Steady-State Probabilities
Let 1 = long run proportion of Murphy’s visits
2 = long run proportion of Ashley’s visits
Then,
.9 .1
[ ] .2 .8 = [ ]

9 +  =  (1)


 +  =  (2) Thus, if we have 1000 customers in the
 +  = 1 (3) system, the Markov process model tells us
Substitute  = 1 -  into (1) to give: that in the long run, with steady-state
1 = 9 + .2(1 - 1) = 2/3 = .667 probabilities 1 = .667 and 2 = .333, 667
Substituting back into (3) gives: customers will be Murphy’s and 333
customers will be Ashley’s.
 = 1/3 = .333.
Example: Market Share Analysis
Revised Steady-State Probabilities
5 + 0 =  (1)
5 + 0 =  (2)
 +  = 1 (3)
Substitute  = 1 -  into (1) to give:
1 = 5 + .20(1 - 1) = .57
Substituting back into (3) gives:
 = .43.
• Suppose that the total market consists of 6000 customers per week. The new promotional
strategy will increase the number of customers doing their weekly shopping at Ashley’s
from 2000 to 2580.
• If the average weekly profit per customer is $10, the proposed promotional strategy can
be expected to increase Ashley’s profits by $5800 per week. If the cost of the promotional
campaign is less than $5800 per week, Ashley should consider implementing the strategy.
Example: Market Share Analysis
Suppose Ashley’s Supermarket is contemplating an advertising campaign to attract more of
Murphy’s customers to its store. Let us suppose further that Ashley’s believes this
promotional strategy will increase the probability of a Murphy’s customer switching to
Ashley’s from 0.10 to 0.15.

Revised Transition Probabilities


Management of the New-Fangled Soft drink Company believes that the probability of a
customer purchasing Red Pop or the company’s major competition, Super Cola, is based on
the customer’s most recent purchase. Suppose that the following transition probabilities are
appropriate:

• Find the probabilities for next 3 Periods


• Show the two-period tree diagram for a customer who last purchased Red Pop. What is the
probability that this customer purchases Red Pop on the second purchase?
• What is the long-run market share for each of these two products?
• A Red Pop advertising campaign is being planned to increase the probability of attracting
Super Cola customers. Management believes that the new campaign will
• increase to 0.15 the probability of a customer switching from Super Cola to Red Pop.What
is the projected effect of the advertising campaign on the market shares?
Thank You
Introduction to Machine Learning
Ramesh Kandela
[email protected]
What Is Machine Learning?
• Machine Learning is the science (and art) of programming computers so
they can learn from data.
• Machine Learning is the field of study that gives computers the ability to
learn without being explicitly programmed. —Arthur Samuel,
• A computer program is said to learn from experience E with respect to
some task T and some performance measure P, if its performance on T, as
measured by P, improves with experience E. —Tom Mitchell
• It could be used as a design tool to help us think clearly about what data
to collect (E), what decisions the software needs to make (T) and how we
will evaluate its results (P).
Machine Learning
• Spam filter is a Machine Learning program that can learn to flag spam given examples of
spam emails (e.g., flagged by users) and examples of regular (nonspam, also called “ham”)
emails.
• Problem: Which emails are spam or not spam,Identify the Experience, Task and
Performance.
A) Classifying emails as spam or not spam
B) Watching you label emails as spam or not spam
C) The ratio of correctly classified emails
• The examples that the system uses to learn are called the training set. Each training example
is called a training instance (or sample). In this case, the task T is to flag spam for new
emails, the experience E is the training data, and the performance measure P needs to be
defined; for example, you can use the ratio of correctly classified emails. This particular
performance measure is called accuracy and it is often used in classification tasks.
Why Machine Learning
• Consider how you would write a spam filter using traditional programming techniques
• 1. First you would look at what spam typically looks like. You might notice that some words or
phrases (such as “4U,” “credit card,” “free,” and “amazing”) tend to come up a lot in the subject.
Perhaps you would also notice a few other patterns in the sender’s name, the email’s body, and
so on.
• 2. You would write a detection algorithm for each of the patterns that you noticed, and your
program would flag emails as spam if a number of these patterns are detected.
• 3. You would test your program, and repeat steps 1 and 2 until it is good enough.
• The traditional approach

• Since the problem is not trivial, your program will likely become a long list of complex rules—
pretty hard to maintain.
Why Machine Learning
• In contrast, a spam filter based on Machine Learning
techniques automatically learns which words and
phrases are good predictors of spam by detecting
unusually frequent patterns of words in the spam
examples compared to the ham examples). The
program is much shorter, easier to maintain, and most
likely more accurate.
• Moreover, if spammers notice that all their emails
containing “4U” are blocked, they might start writing
“For U” instead. A spam filter using traditional Machine Learning approach
programming techniques would need to be updated to
flag “For U” emails. If spammers keep working around
your spam filter, you will need to keep writing new rules
forever.
• In contrast, a spam filter based on Machine Learning
techniques automatically notices that “For U” has
become unusually frequent in spam flagged by users,
and it starts flagging them without your intervention
Why Machine Learning
To summarize, Machine Learning is great for:
• Problems for which existing solutions require a lot of hand-tuning or long lists of rules: one
Machine Learning algorithm can often simplify code and perform better.
• Complex problems for which there is no good solution at all using a traditional approach: the
best Machine Learning techniques can find a solution.
• Fluctuating environments: a Machine Learning system can adapt to new data.
• Getting insights about complex problems and large amounts of data.
• Although machine learning is continuously evolving with so many new technologies, it is still
used in various industries.
• Machine learning is important because it gives enterprises a view of trends in customer
behavior and operational business patterns, as well as supports the development of new
products. Many of today's leading companies, such as Amazon, Flipkart, Facebook,
Google, Netflix and Uber, make machine learning a central part of their operations. Machine
learning has become a significant competitive differentiator for many companies.
Scope of Machine Learning in the future
• The scope of Machine Learning (ML) is vast, and in the near future, it will deepen its reach
into various fields like medical, finance, social media, facial and voice recognition, online
fraud detection, and biometrics. . As the Machine Learning scope is very high, there are some
areas where researchers are working toward revolutionizing the world for the future.
• Medical
• Cybersecurity
• Digital voice assistants
• Education
• Search engines
• Automotive Industry
• Quantum Computing
Types of Machine Learning Systems
Machine Learning systems can be classified
according to the amount and type of
supervision they get during training.
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Supervised Learning
• These algorithms require the knowledge of both the outcome variable (dependent variable)
and the independent variable (input variables).
• In supervised learning, the training data you feed to the algorithm includes the desired
solutions, called label.
• Regression and Classification algorithms are Supervised Learning algorithms. Both the
algorithms are used for prediction in machine learning and work with the labeled datasets.
• The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. predict
a car price is an example of regression . to predict a target numeric value, such as the price
of a car, given a set of features (mileage, age, brand, etc.) called predictors.
• Classification algorithms are used to predict/Classify the discrete values such as Positive or
Negative, Spam or Not Spam, etc. The spam filter is a good example of this: it is trained
with many example emails along with their class (spam or ham), and it must learn how to
classify new emails.
Most important supervised learning algorithms

• Linear Regression
• Logistic Regression
• k-Nearest Neighbors
• Support Vector Machines (SVMs)
• Naïve Bayes
• Decision Trees and Random Forests
• Neural Networks
Unsupervised Learning
• Unsupervised learning: In unsupervised learning, the training data is unlabeled.
The system tries to learn without a teacher.
• These algorithms are set of algorithms which do not have the knowledge of the
outcome variable in the dataset.
• These algorithms discover hidden patterns or data groupings without the need
for human intervention. Its ability to discover similarities and differences in
information make it the ideal solution for exploratory data analysis, cross-selling
strategies, customer segmentation, and image recognition.
Most Important Unsupervised Algorithms
Here, are some of the most important unsupervised learning algorithms
Clustering
• — K-Means
• — DBSCAN
• — Hierarchical Cluster Analysis (HCA)
Dimensionality Reduction/ Feature Selection
• — Principal Component Analysis (PCA)
Association Rule Learning
• — Apriori
• — Eclat
Supervised vs. Unsupervised Learning
Supervised Learning Unsupervised Learning
Supervised learning algorithms are trained using Unsupervised learning algorithms are trained using
labeled data. unlabeled data.
Supervised learning model takes direct feedback to Unsupervised learning model does not take any
check if it is predicting the correct output or not. feedback.
Unsupervised learning model finds the hidden
Supervised learning model predicts the output. patterns in data.
In supervised learning, input data is provided to In unsupervised learning, only input data is provided
the model along with the output. to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find the
model so that it can predict the output when it is hidden patterns and useful insights from the
given new data. unknown dataset.
Supervised learning needs supervision to train the Unsupervised learning does not need any
model. supervision to train the model.
Supervised learning can be categorized Unsupervised Learning can be classified
in Classification and Regression problems. in Clustering and Associations problems.
It includes various algorithms such as Linear
Regression, Logistic Regression, Support Vector It includes various algorithms such as Clustering,
Machine, Decision tree, etc. Principal Component Analysis, and Apriori algorithm.
Applications and Technique
Task Technique
Forecasting your company’s revenue next year, based on many performance metrics Regression
Classifying emails as spam or not spam Classification
Neural
Analyzing images of products on a production line to automatically classify them Networks(CNNs)
Neural
Detecting tumours in brain scans Networks(CNNs)
Automatically classifying news articles
Automatically flagging offensive comments on discussion forums
Summarizing long documents automatically Natural Language
Creating a chatbot or a personal assistant Processing (NLP)
Detecting credit card fraud Anomaly Detection
Segmenting clients based on their purchases so that you can design a different marketing
strategy for each segment Clustering
Representing a complex, high-dimensional dataset in a clear and insightful diagram PCA
Recommending a product that a client may be interested in, based on past purchases Recommender System
Reinforcement
Building an intelligent bot for a game Learning
Machine Learning Applications
Image Recognition
• Image recognition is one of the most common applications of machine learning. It is used to identify
objects, persons, places, digital images, etc. The popular use case of image recognition and face
detection is, Automatic friend tagging suggestion:
• Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with
our Facebook friends, then we automatically get a tagging suggestion with name, and the technology
behind this is machine learning's face detection and recognition algorithm.
• It is based on the Facebook project named "Deep Face," which is responsible for face recognition and
person identification in the picture.
Speech Recognition
• While using Google, we get an option of "Search by voice," it comes under speech recognition, and
it's a popular application of machine learning.
• Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms are
widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.
Machine Learning Applications
Traffic prediction
• If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the shortest route and
predicts the traffic conditions.
• It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested with the help of two
ways:
• Real Time location of the vehicle form Google Map app and sensors. Average time has taken on past days at the same time.
Self-driving cars
• One of the most exciting applications of machine learning is self-driving cars. Machine learning plays a significant role in
self-driving cars. Tesla, the most popular car manufacturing company is working on self-driving car. It is using unsupervised
learning method to train the car models to detect people and objects while driving.
Product recommendations
• Machine learning is widely used by various e-commerce and entertainment companies such as Amazon, Netflix, etc., for
product recommendation to the user. Whenever we search for some product on Amazon, then we started getting an
advertisement for the same product while internet surfing on the same browser and this is because of machine learning.
• Google understands the user interest using various machine learning algorithms and suggests the product as per customer
interest.
• As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc., and this is also done
with the help of machine learning.
Machine Learning Applications
Virtual Personal Assistant:
• We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name suggests, they help
us in finding the information using our voice instruction. These assistants can help us in various ways just by our voice
instructions such as Play music, call someone, Open an email, Scheduling an appointment, etc.
• These virtual assistants use machine learning algorithms as an important part.
• These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML algorithms and act
accordingly.
Online Fraud Detection:
• Machine learning is making our online transaction safe and secure by detecting fraud transaction. Whenever we perform
some online transaction, there may be various ways that a fraudulent transaction can take place such as fake accounts, fake
ids, and steal money in the middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking
whether it is a genuine transaction or a fraud transaction.
• For each genuine transaction, the output is converted into some hash values, and these values become the input for the
next round. For each genuine transaction, there is a specific pattern which gets change for the fraud transaction hence, it
detects it and makes our online transactions more secure.
Automatic Language Translation:
• Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all, as for this also
machine learning helps us by converting the text into our known languages. Google's GNMT (Google Neural Machine
Translation) provide this feature, which is a Neural Machine Learning that translates the text into our familiar language, and
it called as automatic translation.
• The technology behind the automatic translation is a sequence to sequence learning algorithm, which is used with image
recognition and translates the text from one language to another language
Applications of ML
Warren McCulloch Walter Pitts
Self driving cars on the roads
Movies recommendations
• Amazon product recommendations
• Speech recognition in your smartphone
Steps of Machine Learning
Get the data

Understand the data and visualize the data to gain insights

Data Preprocessing (Prepare the data for Machine Learning algorithms)

Select a model and train it.

Make Predictions

Model Evaluation (Model Testing)

Model Deployment
Happy Analyzing

You might also like