0% found this document useful (0 votes)

91 views158 pages

CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S

This document provides an overview of regression analysis techniques for a data analytics course. It introduces simple linear regression, multiple linear regression, and the advertising dataset used as an example. Simple linear regression models a quantitative response Y based on a single predictor X, while multiple linear regression extends this to model Y based on multiple predictors X1, X2, etc. The advertising dataset contains sales and advertising expenditure data for different media and will be used to predict sales based on advertising predictors.

Uploaded by

Ujjwal Karnani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

91 views158 pages

CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S

Uploaded by

Ujjwal Karnani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 158

CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)

School of Electronics Engineering (SENSE), VIT-Chennai

Email: [email protected]
Handphone No.: +91-9944226963

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 36

Summary of Facilitator’s Profile

Education

B.E., Electronics and Communication Engineering, Anna University,

Tamil Nadu, India - April 2008.
M.Sc., Signal Processing, Nanyang Technological University
(NTU), Singapore - May 2011.
Ph.D., Signal/Image Compression, NTU, Singapore - August 2016.

Experience

Post-doctoral experience: Research Fellow, NTU, Singapore -

October 2016 to April 2018.
Teaching experience: Assistant Professor (Senior Grade), VIT,
Chennai - June 2018 onwards.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 36

Suggested Readings

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani,

“An Introduction to Statistical Learning with Applications in R”,
Springer Texts in Statistics, 2013 (Facilitator’s Recommendation).

Alpaydin Ethem, “ Introduction to Machine Learning”, 3rd Edition,

PHI Learning Private Limited, 2019.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 36

Contents

1 Module 1: Regression Analysis

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 36

Module 1: Regression Analysis

Topics to be covered in Module-1

The Advertising Dataset and Problem Statement

Simple Linear Regression
Multiple Linear Regression
Model Estimation and Evaluation
Correlation
Time Series Forecasting
Autocorrelation
ANOVA - Analysis of Variance

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 36

Module 1: Regression Analysis

The Advertising Dataset and Problem Statement

Figure 1: Sales (in thousands of units) for a particular product as a function of

advertising budgets (in thousands of dollars) for TV, radio, and newspaper media.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 36

Module 1: Regression Analysis

The Advertising Dataset and Problem Statement

The plot in Figure 1 displays sales, in thousands of units, as a

function of TV, radio, and newspaper budgets, in thousands of
dollars, for 200 different markets.
In each plot, a simple least squares fit of sales to that variable is
shown. In other words, each blue line represents a simple model that
can be used to predict sales using TV, radio, and newspaper,
respectively.
Suppose that in our role as statistical consultants we are asked to
suggest, on the basis of this data, a marketing plan for next year that
will result in high product sales.
What information would be useful in order to provide such a
recommendation?

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 36

Module 1: Regression Analysis

The Advertising Dataset and Problem Statement

Few questions that we might seek to address:

Is there a relationship between advertising budget and sales? If yes,

how strong is that relationship?
Is the relationship linear?
How accurately can we estimate the effect of each medium on sales?
How accurately can we predict future sales?
Which media contribute to sales?
Which media generate the biggest boost in sales?
How much increase in sales is associated with a given increase in TV
advertising?

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 36

Module 1: Regression Analysis

Simple Linear Regression

Simple Linear Regression is a straightforward approach for predicting

a quantitative response Y on the basis of a single predictor variable
X . Mathematically, this linear relationship can be expressed as

Y ≈ β0 + β1 X

where β0 and β1 are two unknown constants that represent the

intercept and slope terms in the linear model.
For example, X may represent TV advertising and Y may represent
sales. Then we can regress sales onto TV by fitting the model

sales ≈ β0 + β1 TV

Together, β0 and β1 known as the model coefficients or parameters.

We must use training data/samples to estimate these coefficients.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 9 / 36
Module 1: Regression Analysis
Simple Linear Regression
Once we produce the estimates β̂0 and β̂1 using the training data, we
can predict y given x:
ŷ ≈ β̂0 + β̂1 x.
Let ŷi ≈ β̂0 + β̂1 xi be the prediction for i th value of y based on the
i th value of x. Then
ei = yi − ŷi
represents the i residual. This is the difference between the i th
th
observed response value and the i th predicted response value.
The residual sum of squares (RSS) is defined as
n
X
RSS = ei2 = e12 + e22 + e32 + .... + en2
i
where n is the number of predictions or simply, the number of
samples in the training data.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 10 / 36
Module 1: Regression Analysis

Simple Linear Regression

A random pattern in the residual plot is an indication that a linear

model provides a decent fit to the data.
The least squares approach chooses β̂0 and β̂1 to minimize the RSS.
Using calculus, one can show that the minimizers are
Pn
(xi − x̄)(yi − ȳ )
β̂1 = i Pn 2
and β̂0 = ȳ − β̂1 x̄
i (xi − x̄)

where x̄ = n1 ni xi and ȳ = n1 ni yi are sample means. These β̂0 and

P P

β̂1 are the least squares coefficient estimates for simple linear
regression, and they give the best linear fit on the given training
data.
Figure 2 shows the simple linear regression fit to the Advertising data,
where β̂0 = 7.03 and β̂1 = 0.0475.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 11 / 36
Module 1: Regression Analysis

Simple Linear Regression

Figure 2: Simple linear regression fit to the Advertising data.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 36

Module 1: Regression Analysis

Question 1.1

Which of the following statements is true about linear regression regarding

outliers?

(a) Linear regression is sensitive to outliers.

(b) Linear regression is not sensitive to outliers.
(c) The impact of outliers on linear regression depends upon the data.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 13 / 36

Module 1: Regression Analysis

Question 1.2
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f (x) of the form f (x) = ax + b which is
parameterized by (a, b). Using squared error as the loss function, which of
the following parameters would you use to model this function.

(a) (4 3)
(b) (5 3)
(c) (5 1)
(d) (1 5)

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 14 / 36

Module 1: Regression Analysis

Question 1.3

For the five training examples given in Question 1.2,

(i) Find the best linear fit.

(ii) Determine the minimum RSS.
(iii) Draw the residual plot for the best linear fit and comment on the
suitability of the linear model to this training data.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 15 / 36

Module 1: Regression Analysis
Multiple Linear Regression
Although simple linear regression is a useful approach for predicting a
response on the basis of a single predictor variable, in practice more
than one predictor variable will be available, and hence simple linear
regression can be extended to multiple linear regression.
Continuing with the same sales prediction example, in the advertising
data, amount of money spent advertising on the radio and in
newspaper are available. Therefore, we can regress sales onto TV,
radio and newspaper by fitting the model
sales ≈ β0 + β1 TV + β2 radio + β3 newspaper
where β0 , β1 , β2 , and β3 are the model coefficients or parameters.
Predicting a quantitative response Y on the basis of a multiple
predictor variables X1 , X2 , ... and Xp can be expressed as
Y ≈ β0 + β1 X1 + β2 X2 + ... + βp Xp
where p is the number of distinct predictor variables.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 16 / 36
Module 1: Regression Analysis
Multiple Linear Regression
Upon estimating β0 , β1 , ... βp using training data/samples, we can
predict y as follows:
ŷ ≈ β̂0 + β̂1 x1 + β̂2 x2 + ... + β̂p xp .
The regression model can be re-stated in matrix form as follows:
X×B =Y
T
where X = [1 X1 X2 ... Xp ], and B = [β̂0 β̂1 β̂2 ... β̂p ] is the
(column) vector form of model coefficients to be estimated. Note
that Y , X1 , X2 , ... Xp are training samples of dimension n × 1.
As in the case of simple linear regression, the least squares approach
can be used to determine the coefficients. The solution is given by
B = X† Y
−1
where X † = (X T X ) X T is the pseudo-inverse of X .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 17 / 36
Module 1: Regression Analysis

Question 1.4

When you perform multiple linear regression, which among the following
are questions you will be interested in?

(a) Is at least one of the predictors useful in predicting the response?

(b) Do all the predictors help to explain Y , or is only a subset of the
predictors useful?
(c) How well does the model fit the data?
(d) Given a set of predictor values, what response value should we predict,
and how accurate is our prediction?

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 18 / 36

Module 1: Regression Analysis
Model Estimation and Evaluation
Assume that the true relationship between X and Y takes the form
Y = f (X ) + for some unknown function f (X ), where is a
mean-zero random error term. If f (X ) is to be approximated by a
simple linear function, then this linear relationship can be expressed as
Y = β0 + β1 X + .
In the case of Y being a random variable, how accurate is the sample
mean (µ̂) of Y as an estimate of its population mean (µ)? In general,
this question is answered by computing the standard error of µ̂,
expressed as SE(µ̂)
p σ
SE(µ̂) = Var(µ̂) = √
n
p
where n is the size of the training set and σ = Var() is the
standard deviation of each of the realizations yi of Y .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 19 / 36
Module 1: Regression Analysis
Model Estimation and Evaluation
Assuming the errors i for each observation are uncorrelated with
common variance σ 2 , the standard errors associated with β̂0 and β̂1
can be expressed as
s
1 x̄ 2
SE(β̂0 ) = σ + Pn 2
n i=1 (xi − x̄)
and s
1
SE(β̂1 ) = σ Pn .
i=1 (xi − x̄)2
p
In general, σ = Var() is not known, but can be estimated from the
data. This estimate is known as the residual standard error (RSE),
and is expressed as r
RSS
RSE = .
n−2
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 20 / 36
Module 1: Regression Analysis
Model Estimation and Evaluation
Standard errors can be used to compute confidence intervals.
For simple linear regression, the 95% confidence interval for β0
approximately takes the form
β̂0 ± 2 SE(β̂0 ).
That is, there is approximately a 95% probability that the interval
[β̂0 − 2 SE(β̂0 ) , β̂0 + 2 SE(β̂0 )]
will contain the true value of β0 . Similarly, there is approximately a
95% probability that the interval
[β̂1 − 2 SE(β̂1 ) , β̂1 + 2 SE(β̂1 )]
will contain the true value of β0 .
The word ‘approximately’ is included mainly because: (i) The errors
are assumed to be Gaussian; and (ii) The factor ‘2’ in front of SE(.)
terms will vary slightly depending on ‘n’ in linear regression.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 21 / 36
Module 1: Regression Analysis

Model Estimation and Evaluation

The RSE provides an absolute measure of lack of fit of the model to

the data. A small RSE indicates that the model fits the data well
whereas a large RSE indicates that the model doesn’t fit the data
well. But since it is measured in the units of Y , it is not always clear
what constitutes a good RSE.
The R 2 statistic provides an alternative measure of fit. It takes the
form of a proportion of variance, expressed as
RSS
R2 = 1 −
TSS

where TSS = ni=1 (yi − ȳ )2 is the total sum of squares. Note that
P
R 2 statistic is independent of the scale of Y , and it always takes a
value between 0 and 1.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 22 / 36
Module 1: Regression Analysis

Model Estimation and Evaluation

TSS = ni=1 (yi − ȳ )2 measures the total variance in the response
P
variable Y , and can be interpreted as the amount of variability
inherent in the response before the regression is performed.
TSS - RSS = ni=1 {(yi − ȳ )2 − (yi − ŷi )2 } measures the amount of
P
variability in the response that is removed by performing the
regression, and therefore R 2 measures the proportion of variability in
Y that can be explained using X .
An R 2 statistic that is close to 1 indicates that a large proportion of
the variability in the response has been explained by the regression. A
number close to 0 indicates that the regression did not explain much
of the variability in the response; this might occur because the linear
model is wrong, or the inherent error σ 2 is high, or both.
The R 2 statistic is also a measure of the linear relationship between
X and Y and it is closely related to correlation between X and Y .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 23 / 36
Module 1: Regression Analysis

Question 1.5
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f (x) of the form f (x) = ax + b which is
parameterized by (a, b).

(a) Find the best linear fit.

(b) Evaluate the standard errors associated with â and b̂.
(c) Determine the 95% confidence interval for a and b.
(d) Compute the R 2 statistic.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 24 / 36

Module 1: Regression Analysis

Model Estimation and Evaluation

Bias-Variance Tradeoff

Bias is the error resulting from simplifying assumptions made by the

model to make the target function easier to approximate.
Variance is the amount that the estimate of the target function will
change given different training data.
Underfitted models have high bias and low variance.
Overfitted models have low bias and high variance.
With an increase in model complexity, bias decreases and variance
increases.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 25 / 36

Module 1: Regression Analysis

Correlation
When comparing two random variables, say x1 and x2 , covariance
Cov(x1 , x2 ) is used to determine how much these two vary together,
whereas correlation Corr(x1 , x2 ) is used to determine whether a
change in one variable will result in a change in another.
For multiple data points, the covariance matrix is given by

(X − m)(X − m)T
C= .
n
where X = [x1 x2 ...] is the data matrix with n columns (each column
is one data point) and m is the mean vector of the data points.
Correlation, a normalized version of the covariance, is expressed as
Cov(x1 , x2 )
Corr(x1 , x2 ) = .
σx1 σx2

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 26 / 36

Module 1: Regression Analysis

Correlation

Both covariance and correlation measure linear relationships between

variables. Examples: relationship between height and weight of
children and relationship between speed and weight of cars, etc.
Since covariance is affected by a change in scale, it can take values
between −∞ and ∞. However, the correlation coefficient always lies
between -1 and 1, and it can be used to make statements and
compare correlations.
When the correlation coefficient is positive, an increase in one
variable results in an increase in the other. When the correlation
coefficient is negative, an increase in one variable results in a decrease
in the other (i.e. the change happens in the opposite direction). A
zero correlation coefficient indicates there is no relationship between
the two variables. Figure 3 shows these three types of relationship.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 27 / 36

Module 1: Regression Analysis
Correlation
In some scenarios, correlation measure may be misleading due to the
existence of a spurious relationship (two variables have no relationship
but wrongly inferred due to either coincidence or the presence of a
certain unseen factor known as confounding factor/lurking variable).

Figure 3: Four-quadrant scatterplots showing 3 types of relationship between 2

random variables. Source: https://acadgild.com/blog/covariance-and-correlation

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 28 / 36

Module 1: Regression Analysis

Time Series Forecasting

Time series modeling deals with the time based data. Time can be
years, days, hours, minutes, etc.
Time series forecasting involves fitting a model on time based data
and using it to predict future observations.
Time series forecasting serves two purposes: understanding the
pattern/trend in the time series data and forecasting/extrapolating
the future values of it. The forecast package in R contains functions
which serve these purposes.
In time series forecasting, the AutoRegressive Integrated Moving
Average (ARIMA) model is fitted to the time series data either to
better understand the data or to predict future points in the series.
Components of a time series are level, trend, seasonal, cyclical and
noise/irregular (random) variations.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 29 / 36

Module 1: Regression Analysis
Time Series Forecasting
Figure 4 shows the forecast of 4 future values of ’AirPassengers’ data
using ARIMA model (available in forecast package).

Figure 4: Forecast from ARIMA(3,1,3) - ’AirPassengers’ data

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 30 / 36

Module 1: Regression Analysis
Autocorrelation
As correlation measures the linear relationship between two variables,
autocorrelation measures the linear relationship between lagged values
of a time series data/variable. The term ’lag’ refers to ’time dealy’.
Figure 5 shows the autocorrelation plot of ’AirPassengers’ data
obtained using Acf() function (available in forecast package).

Figure 5: ACF plot - ’AirPassengers’ data

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 31 / 36
Module 1: Regression Analysis

ANOVA - Analysis of Variance

Analysis of Variance (ANOVA) is a statistical technique for comparing

the means of more than 2 sample groups and deciding whether they
are drawn from the same population or not.
The hypothesis is stated as follows:

H0 : µ1 = µ2 = µ3 = ...
Ha : µ1 6= µ2 6= µ3 6= ...

ANOVA also allows comparision of more than 2 population.

Assumptions made:
(i) Samples are independent and randomly drawn from respective
populations,
(ii) Populations are normally distributed, and
(iii) Variances of the population are equal.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 32 / 36
Module 1: Regression Analysis
ANOVA - Analysis of Variance
Let X denote the data matrix consisting of samples from r groups
such that each column corresponds to one group, X̄ denote the mean
of all the entries in X , x̄j denote the mean of all entries in column-j
and nj denote the number of samples in column-j.
To establish comparison between groups, three variances are
considered. They are Sum-of-Squares-Total (SST ),
Sum-of-Squares-TReatments (SSTR) and Sum-of-Squares-Error
(SSE ):
2
XX
SST = (Xi,j − X̄ )
j i
2
X
SSTR = nj (x̄j − X̄ )
j
XX
SSE = (Xi,j − x̄j )2 .
j i
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 33 / 36
Module 1: Regression Analysis
ANOVA - Analysis of Variance
SST gives the overall variance in the data, SSTR gives the part of
the variation within the data due to differences among the groups,
and SSE gives the part of the variation within the data due to error.
Note that SST = SSTR + SSE .
The ANOVA F-statistic is defined as
MSTR
F =
MSE
where MSTR = SSTR/d.o.f = SSTR/(r − 1) and P
MSE = SSE /d.o.f = SSE /(n − r ). Note that n = j nj is the total
number of samples.
If F-statistic is greater than the critical value, then the null hypothesis
is rejected. The critical value is obtained from the F-distribution table
using parameters such as significance level (α) and degrees of
freedom (d.o.f) of SSTR and SSE.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 34 / 36
Module 1: Regression Analysis

Question 1.6
Assume there are 3 canteens in a college and the sale of an item in those
canteens during first week of February-2021 is as follows:

Table 1: Data for Question 1.6

Canteen A Canteen B Canteen C

40 30 50
60 30 60
70 10 30
30 70 20
50 60 20

Is there a significant difference between the mean sales of the item, at

α = 0.05?
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 35 / 36
Module 1: Regression Analysis

Module-1 Summary

The Advertising dataset example and problem statements

Simple Linear Regression and Multiple Linear Regression
Simple Linear Regression Model - Estimation and Evaluation
Correlation: Measures linear relationship between 2 variables
Time Series Forecasting: Analysis and prediction of time-based data
Autocorrelation: Measures linear relationship between lagged values
ANOVA: Compares more than 2 population (uses F-statistic)

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 36 / 36

CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)

School of Electronics Engineering (SENSE), VIT-Chennai

Email: [email protected]
Handphone No.: +91-9944226963

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 42

Suggested Readings

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani,

“An Introduction to Statistical Learning with Applications in R”,
Springer Texts in Statistics, 2013 (Facilitator’s Recommendation).

Alpaydin Ethem, “ Introduction to Machine Learning”, 3rd Edition,

PHI Learning Private Limited, 2019.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 42

Contents

1 Module 2: Classification

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 42

Module 2: Classification

Topics to be covered in Module-2

Logistic Regression
Bayes’ Theorem for classification
Decision Trees
Bagging, Boosting and Random Forest
Hyperplane for Classification
Support Vector Machines

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 42

Module 2: Classification

Logistic Regression

Most common problems that occur when we fit a linear regression

model to a particular data set are: (i) non-linearity of the
response-predictor relationships, (ii) outliers and (iii) correlation of
error terms.
Moreover, the linear regression model assumes that the response
variable Y is quantitative (or numerical). But in many situations,Y is
instead qualitative (or categorical).
Consider predicting whether an individual will default on his or her
credit card payment, on the basis of annual income and monthly
credit card balance. Since the outcome is not quantitative, the linear
regression model is not appropriate.
In general, if the response Y falls into one of two categories (Yes or
No), logistic regression is used.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 42
Module 2: Classification

Logistic Regression

Rather than modeling Y directly, logistic regression models the

probability that Y belongs to a particular category.
For example, in the case of predicting whether an individual will
default on his or her credit card payment on the basis of monthly
credit card balance, logistic regression models the probability of
default as

Pr (default=Yes | balance) = p(balance).

The values of p(balance) ranges from 0 to 1. For any given value of

balance, a prediction can be made for default. For example, one
might predict default=Yes for any individual for whom p(balance)
exceeds a predefined threshold.
Logistic regression uses a logistic function to model this probability.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 42
Module 2: Classification

Logistic Regression

The logistic function for predicting the probability of Y on the basis

of a single predictor variable X is be expressed as

e β0 +β1 X
p(X ) =
1 + e β0 +β1 X
where β0 and β1 are the model parameters.
To fit the above model (i.e. to determine β0 and β1 ), a method called
maximum likelihood is used.
The estimates β0 and β1 are chosen to maximize the likelihood
function:
Y Y
`(β0 , β1 ) = p(xi ) × (1 − p(xi 0 )).
i:yi =1 i 0 :yi 0 =1

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 42

Module 2: Classification

Logistic Regression

The logistic function can be manipulated as follows:

p(X )
= e β0 +β1 X
1 − p(X )
p(X )
The quantity 1−p(X ) is called the odds, and can take on any value
odds between 0 and ∞. Values of the odds close to 0 and ∞ indicate
very low and very high probabilities of default, respectively.
Taking logarithm on both sides of the above equation gives log-odds
or logit:
p(X )
loge = β0 + β1 X .
1 − p(X )
The logit of a logistic regression model is linear in X . Note that
loge () is natural logarithm which is usually denoted as ln().
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 42
Module 2: Classification

Logistic Regression

Logistic regression can be extended to multiple logistic regression (i.e.

to make 2-class prediction based on p predictor variables X1 , X2 , ... ,
Xp ). The logistic function for multiple logistic regression can be
expressed as
e β0 +β1 X1 +β2 X2 +...+βp Xp
p(X ) = .
1 + e β0 +β1 X1 +β2 X2 +...+βp Xp
Model parameters can be chosen to maximize the same likelihood
function as in the case of single predictor variable.
The logit of a multiple logistic regression model will be linear in
{X1 , X2 , ..., Xp }.
Logistic regression can be extended to predict a response variable that
has more than two classes as well. However, for such tasks,
discriminant analysis is preferred.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 9 / 42
Module 2: Classification
Question 2.1
Consider the following training examples
Marks scored: X = [81 42 61 59 78 49]
Grade (Pass/Fail): Y = [Pass Fail Pass Fail Pass Fail]
Assume we want to model the probability of Y of the form
e β0 +β1 x
p(x) = 1+e β0 +β1 x which is parameterized by (β0 , β1 ).

(i) Which of the following parameters would you use to model p(x).
(a) (-119, 2)
(b) (-120, 2)
(c) (-121, 2)
(ii) With the chosen parameters, what should be the minimum mark to
ensure the student gets a ‘Pass’ grade with 95% probability?

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 10 / 42

Module 2: Classification

Bayes’ Theorem for Classification

Bayes’ theorem is used in formulating the optimal classifier. The

classification task is: Given an input x, find the class ωi it belongs to.
Assume there are K ≥ 2 classes: ω1 , ω2 , ... , ωK .
The likelihood function of class k (i.e. the probability that class k has
x in it) is represented as p(x|ωk ) for k = 1, 2, ..., K .
The probability of deciding x belonging to ωk is denoted as p(ωk |x).
This probability distribution is generally unknown and it can be
estimated using Bayes’ theorem:

p(x|ωk ) p(ωk )
p(ωk |x) =
p(x)

where p(ωk ) is the probability of occurrence of class k and p(x) is the

probability of occurance of x. Note that p(x) is independent of k.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 11 / 42

Module 2: Classification

Bayes’ Theorem for Classification

Both p(x|ωk ) and p(ωk ) are apriori probabilities and they can be
estimated using training data. Using these apriori probabilities, the
posterior probability p(ωk |x) or its equivalent can be estimated.
The decision function for Bayes’ classifier is
K
X
dj (x) = − Lkj p(x|ωk ) p(ωk )
k=1

where Lkj is the loss/penalty due to misclassification. In general, Lkj

takes a value between 0 and 1. Since p(x) is independent of k, it
becomes a common term and hence it is not included in dj (x).
The decision is stated as follows:
x → ωi if di = max{dj } for j = 1, 2, ..., K
j

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 42

Module 2: Classification

Question 2.2
Assume A and B are Boolean random variables (i.e. they take one of the
two possible values: True and False).
Given: p(A = True) = 0.3, p(A = False) = 0.7,
p(B = True|A = True) = 0.4, p(B = False|A = True) = 0.6,
p(B = True|A = False) = 0.6, p(B = False|A = False) = 0.4.

Calculate p(A = True|B = False) by appyling Bayes’ rule.

Hint: Use the relation

p(B = False) = p(B = False|A = True) × p(A = True)

+ p(B = False|A = False) × p(A = False).

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 13 / 42

Module 2: Classification
Bayes’ Theorem for Classification
Naive Bayes’ Classifier (NBC) assumes conditional independence.
Two random variables A and B are said to be conditionally
independent given another random variable C if
p(A ∩ B|C ) = p(A, B|C ) = p(A|C ) × p(B|C ).
This implies, as long as the value of C is known and fixed, A and B
are independent. Equivalently, p(A|B, C ) = p(A|C ).
NBC is termed naive because of this strong assumption which is
unrealistic (for real data), yet very effective.
The joint probability distribution of n random variables A1 , A2 , ..., An
can be expressed as a product of n localized probabilities:
n
Y
p(∩nk=1 Ak ) = p(Ak | ∩k−1
j=1 Aj ).
k=1

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 14 / 42

Module 2: Classification
Bayes’ Theorem for Classification
Consider the Bayesian network in Figure 1. It is a directed acyclic
graph in which each edge corresponds to a conditional dependency,
and each node corresponds to a unique random variable.
The network has 4 nodes: Cloudy, Sprinkler, Rain and WetGrass.
Since Cloudy has an edge going into Rain, it means that
p(Rain|Cloudy) will be a factor, whose probability values are specified
next to the Rain node in a conditional probability table.
Note that Sprinkler is conditionally independent of Rain given
Cloudy. Therefore,

p(Sprinkler|Cloudy, Rain) = p(Sprinkler|Cloudy).

Using the relationships specified by the Bayesian network, the joint

probability distribution can be obtained as a product of n factors (i.e.
n probabilities) by taking advantage of conditional independence.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 15 / 42
Module 2: Classification

Figure 1: Bayesian network - example 1

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 16 / 42

Module 2: Classification

Question 2.3

(a) Consider the Bayesian network in Figure 1. Evaluate the following

probability distribution functions:
(i) p(Cloudy = True, Sprinkler = True, Rain = False, WetGrass = True)
(ii) p(Cloudy = True, Sprinkler = False, Rain = True, WetGrass = True)

(b) Consider the Bayesian network in Figure 2. Evaluate the following

probability distribution functions:
(i) p(a = 1, b = 0, c = 1, d = 1, e = 0)
(ii) p(a = 1, b = 1, c = 2, d = 0, e = 1)
(iii) p(a = 1, b = 1, c = 2, d = 0)

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 17 / 42

Module 2: Classification

Figure 2: Bayesian network - example 2

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 18 / 42

Module 2: Classification

Decision Trees
A decision tree is a hierarchical model for supervised learning. It can
be applied to both regression and classification problems.
A decision tree consists of decision nodes (root and internal) and leaf
nodes (terminal). Figure 3 shows a data set and its classification tree
(i.e. decision tree for classification).
Given an input, at each decision node, a test function is applied and
one of the branches is taken depending on the outcome of the
function. The test function gives discrete outcomes labeling the
branches (say for example, Yes or No).
The process starts at the root node (topmost decision node) and is
repeated recursively until a leaf node is hit. Each leaf node has an
output label (say for example, Class 0 or Class 1).
During the learning process, the trees grows, branches and leaf nodes
are added depending on the data.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 19 / 42
Module 2: Classification

Figure 3: Data set (left) and the corresponding decision tree (right) - Example of
a classification tree.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 20 / 42

Module 2: Classification

Decision Trees

Decision trees do not assume any parametric form for the class
densities and the tree structure is not fixed apriori. Therefore, a
decision tree is a non-parametric model.
Different decision trees assume different models for the test function,
say f (·). In a decision tree, the assumed model for f (·) defines the
shape of the classified regions. For example, in Figure 3, the test
functions define ‘rectangular’ regions.
In a univariate decision tree, the test function in each decision node
uses only one of the input dimensions.
In a classification tree, the ‘goodness of a split’ is quantified by an
impurity measure. Popular among them are entropy and Gini index. If
the split is such that, for all branches, all the instances choosing a
branch belong to the same class, then it is pure.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 21 / 42

Module 2: Classification

Question 2.4

What is specified at any non-leaf node in a decision tree?

(a) Class of instance (Class 0 or Class 1)

(b) Data value description
(c) Test function/specification
(d) Data process description

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 22 / 42

Module 2: Classification

Advantages of Decision Trees

Fast localization of the region covering an input - due to hierarchical

placement of decisions. If the decisions are binary, it requires only
log2 (b) decisions to localize b regions (in the best case). In the case
of classification trees, there is no need to create dummy variables
while handling qualitative predictors.
Easily interpretable (in graphical form) and can be converted to easily
understandable IF-THEN rules. To some extent, decision trees mirror
human decision-making. For this reason, decision trees are sometimes
preferred over more accurate but less interpretable methods.

Disadvantages of Decision Trees

Greedy learning approach - they look for the best split at each step.
Low prediction accuracy compared to methods like regression.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 23 / 42

Module 2: Classification

Bagging, Boosting and Random Forest

Since the prediction accuracy of a decision tree is low (due to high

variance), techniques like bagging, random forests, and boosting
aggregate many decision trees to construct more powerful prediction
models.
Bagging creates multiple copies of the original training data using
the bootstrap (i.e. random sampling), fits a separate decision tree to
each copy, and then combines all of the trees in order to create a
single, powerful prediction model. Each tree is independent of the
other trees.
Boosting works in a way similar to bagging, except that the trees are
grown sequentially. Boosting does not involve random sampling;
instead each tree is grown using information from previously grown
trees (i.e. fit on a modified version of the original training data).

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 24 / 42

Module 2: Classification

Bagging, Boosting and Random Forest

As in bagging, random forests build a number of decision trees on
bootstrapped training data. While building these trees, for each split,
a random sample of m predictors is chosen as split candidates from
the full set of p predictors and one among these m is used.
Suppose that there is one very strong predictor in the data set, along
with a number of other moderately strong predictors. In this case,
bootstrap aggregation (i.e. bagging) will not lead to a substantial
reduction in variance over a single tree.
Since in random forests only m out of p predictors are considered for
each split, on average p−mp of the splits will not even consider the
strong predictor, and therefore other predictors stand a chance. This
decorrelation process reduces the variance in the average of the
resulting trees and hence improves the reliability and the prediction
√
accuracy. Typically, m = p.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 25 / 42
Module 2: Classification

Question 2.5

Using a small value of m in building a random forest will typically be

helpful when

(a) the number of correlated samples is zero

(b) the number of correlated samples is small
(c) the number of correlated samples is large
(d) all predictors in the data set are moderately strong

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 26 / 42

Module 2: Classification
Hyperplane for Classification
A hyperplane is a flat subspace of dimension p-1, in a p-dimensional
space. It is mathematically defined as
α0 + α1 X1 + α2 X2 + ... + αp Xp = 0.
The set of points X = {X1 , X2 , ...Xp } (i.e. vectors of length p)
satisfying the above equation lie on the hyperplane.
Suppose that,
α0 + α1 X1 + α2 X2 + ... + αp Xp > 0.
This shows the set of points lie on one side of the hyperplane.
On the other hand, if
α0 + α1 X1 + α2 X2 + ... + αp Xp < 0,
then the set of points lie on the other side of the hyperplane.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 27 / 42
Module 2: Classification

Hyperplane for Classification

In a 2-dimensional space (i.e. for p = 2), a hyperplane is a line

dividing the space into two halves. Figure 4 shows the hyperplane
1 + 2X1 + 3X2 = 0 dividing a 2-dimensional space into two. Similarly
for p = 3, a hyperplane is a plane dividing the 3-dimensional space
into two halves. In p > 3 dimensions, it becomes hard to visualize a
hyperplane but the notion of dividing p-dimensional space into two
halves still applies.
Consider a training data X of dimension n × p (i.e. a n × p data
matrix consisting of n training observations in p-dimensional space) in
which each of the observations fall into two classes, say Class -1 and
Class 1. Now, given a test observation x ? (i.e. a vector of p features
or variables), the concept of a separating hyperplane can be used to
develop a classifier that will correctly classify x ? .

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 28 / 42

Module 2: Classification

Figure 4: The hyperplane (i.e. line) 1 + 2X1 + 3X2 = 0 in a 2-dimensional space.

Blue region: set of points satisfying 1 + 2X1 + 3X2 > 0. Purple region: set of
points satisfying 1 + 2X1 + 3X2 < 0

.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 29 / 42
Module 2: Classification

Hyperplane for Classification

If the class labels for Class -1 and Class 1 are yi = −1 and yi = 1,
respectively, then the separating hyperplane has the property that
yi (α0 + α1 xi,1 + α2 xi,2 + ... + αp xi,p ) > 0.
for all i = 1, 2, ..., n.
If there exists a hyperplane that separates the training observations
perfectly according to their class labels, then x ? can be assigned a
class depending on which side of the hyperplane it is located.
As shown in Figure 5, a classifier based on a separating hyperplane
leads to a linear boundary, and there can be more than one separating
hyperplane. The separating hyperplane that is farthest from the
training observations is considered for classification. It is called
optimal separating hyperplane or maximal margin hyperplane.
Figure 6 shows one such hyperplane.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 30 / 42
Module 2: Classification

Figure 5: Two classes of observations (shown in purple and blue), each having
two features/variables, and three separating hyperplanes.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 31 / 42

Module 2: Classification

Figure 6: Two classes of observations (shown in purple and blue), each having
two features/variables, and the optimal separating hyperplane or the maximal
marigin hyperplane
Facilitator: Dr Sathiya Narayanan S .
VIT-Chennai - SENSE Winter Semester 2020-21 32 / 42
Module 2: Classification

Hyperplane for Classification

Let M represent the marigin of the hyperplane. The maximal marigin
hyperplane is the solution to the following optimization problem:

maximizeα0 ,α1 ,...,αp M

p
X
subject to αj2 = 1,
j=1
p
X
yi α0 + αj xij ≥ M for all i = 1, 2, ..., n.
j=1

The two constraints in the above optimization problem ensures that:

(i) each training observation is in the correct side of the hyperplane;
and (ii) each observation is located at least a distance M from the
hyperplane.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 33 / 42
Module 2: Classification

Hyperplane for Classification

As shown in Figure 7, addition of a single observation leads to a
dramatic change in the maximal marigin hyperplane. Such highly
sensitive hyperplanes are problematic in the sense that they may
overfit the training data.
Consider a hyperplane that does not perfectly separate the two
classes, in the interest of: (i) robustness to individual observations;
and (ii) better classification of most of the training observations.
Classifier based on such hyperplane is called support vector classifier
(SVC) or soft marigin classifier.
The underlying assumption is, allowing misclassification of a few
training observations will result in a better classification of the
remaining observations.
The SVC is a natural approach for two-class classification, if the
boundary between the two classes is linear.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 34 / 42
Module 2: Classification

Figure 7: Two classes of observations (shown in purple and blue), each having
two features/variables, and two separating hyperplanes.

.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 35 / 42
Module 2: Classification

Hyperplane for Classification

The hyperplane for SVC is the solution to the following optimization

problem:

maximizeα0 ,α1 ,...,αp ,1 ,2 ,...,n M

p
X
subject to αj2 = 1,
j=1
p
X
yi α0 + αj xij ≥ M(1 − i ),
j=1
n
X
i ≥ 0, i ≤ C ,
i=1

for all i = 1, 2, ..., n and C is a non-negative tuning parameter.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 36 / 42
Module 2: Classification
Support Vector Machines
In real-world data, the class boundaries are often non-linear (as shown
in Figure 8) and in such scenarios, SVC or any linear classifier will
perform poorly.
In the case of SVC, in computing its coefficients, only inner products
are required. This inner product can be generalized as K (xi , xi 0 ),
where K is some function and it will be referred as a kernel. A linear
kernel will give back the SVC.
To handle non-linear boundaries, a polynomial kernel of degree d
(where d is a positive integer) is required. Using such a kernel with
d > 1 leads to a more flexible decision boundary compared to that of
a SVC. When the SVC is combined with a non-linear kernel, the
resulting classifier is known as a support vector machine (SVM).
Therefore, SVM is an extension of the SVC that enlarges the feature
space using polynomial kernels of degree d > 1, to handle non-linear
boundaries.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 37 / 42
Module 2: Classification

Figure 8: Two classes of observations (shown in purple and blue), with a

non-linear boundary separating them.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 38 / 42

Module 2: Classification

Support Vector Machines

The hyperplane for SVM using polynomial kernel of degree d = 2 is

the solution to the following optimization problem:

maximizeα0 ,α11 ,α12 ...,αp1 ,αp2 ,...,1 ,2 ,...,n M

p X
X 2
2
subject to αjk = 1,
j=1 k=1
p
X p
X
2
yi α0 + αj1 xij + αj2 xij ≥ M(1 − i ),
j=1 j=1
n
X
i ≥ 0, i ≤ C , for all i = 1, 2, ..., n.
i=1

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 39 / 42

Module 2: Classification

Support Vector Machines

A radial kernel or radial basis function (RBF) is a popular non-linear
kernel used in SVMs. It takes the form
p
X
2
K (xi , xi 0 ) = exp − γ (xij − xi 0 j )
j=1

where γ is a positive constant. For a test observation x ? that is far

from a training observation xi , the value of K (xi , xi 0 ) will be tiny.
Therefore, the radial kernel has a local behavior, in the sense that
only nearby observations have an effect on the predicted class labels.
Figure 9 shows an example of an SVM with a radial kernel on a
non-linear data.
Usage of kernels (instead of simply expanding the feature space) in
SVM is computationally advantageous. A kernel-based approach
requires computation of K (xi , xi 0 ) for all n(n−1)
2 distinct pairs i and i 0 .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 40 / 42
Module 2: Classification

Figure 9: SVM with a radial kernel, on a non-linear data.

.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 41 / 42
Module 2: Classification

Module-2 Summary

Logistic regression: Modeling the probability that the response Y

belongs to a particular category, using a logistic function, on the basis
of single or multiple variables.
Bayes’ theorem for classification: Bayes’ classifier using conditional
independence
Decision trees and random forests: A non-parametric,
‘information-based learning’ approach which is easy to interpret.
Hyperplane for classification: maximal marigin classifier and SVC.
Support Vector Machines (SVMs): Extension of SVC to handle
‘non-linear boundaries’ between classes. Uses kernels for
computational efficiency. RBF kernel exhibits ‘local behavior’.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 42 / 42

CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)

School of Electronics Engineering (SENSE), VIT-Chennai

Email: [email protected]
Handphone No.: +91-9944226963

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 17

Suggested Readings

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani,

“An Introduction to Statistical Learning with Applications in R”,
Springer Texts in Statistics, 2013 (Facilitator’s Recommendation).

Alpaydin Ethem, “ Introduction to Machine Learning”, 3rd Edition,

PHI Learning Private Limited, 2019.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 17

Contents

1 Module 3: Clustering

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 17

Module 3: Clustering

Topics to be covered in Module-3

Introduction to Clustering
K -Means Clustering
K -Medoids Clustering
Hierarchical Clustering
Applications of Clustering

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 17

Module 3: Clustering

Introduction to Clustering

Clustering algorithms group samples/data points/features/objects

into clusters by natural association according to some similarity
measures (say Euclidean distance).
Clustering serves two purposes: (i) understanding the structure in the
data (i.e. data exploration) and (ii) finding the similarities between
instances (data points) and thus grouping them.
After grouping the data points, the groups can be named and their
attributes can be defined (using domain knowledge). This paves the
way for supervised learning. In this case, clustering becomes a part of
preprocessing stage.
In most cases, labelling the data is costly and therefore, preceding a
supervised learning (regression or classification) with unsupervised
learning (clustering) is advantageous.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 17

Module 5: Clustering

K -Means Clustering

Partitioning based clustering methods group data points based on

their similarity and characteristics.
K -means and K -medoids (partitioning around medoids) are the two
popular partitioning based methods.
K -means algorithm partitions the data points into K clusters and the
value of K should be known apriori (i.e. it needs to be specified
beforehand).
K -means algorithm is based on the minimization of the sum of
squared distances.
The K -means procedure attempts to form clusters such that the
intracluster similarity is high and intercluster similarity is low. The
similarity of a cluster is determined based on the centroid (i.e. mean
value) of the data points in it.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 17

Module 3: Clustering

K -Means Clustering - Procedure

Step 1: Initialize the iteration count: n = 1; and arbitrarily choose K
samples as initial cluster centres: z1 (n), z2 (n), ..., and zK (n).
Step 2: Distribute the pattern samples x among the K clusters according
to the following rule:

x ∈ Gi (n) if kx − zi (n)k < kx − zj (n)k for j = 1, 2, ..., K ; j 6= i.

Step 3: Compute zi (n + 1) for i = 1, 2, .., K :

1 X
zi (n + 1) = x
Ni
x∈Gi (n)

where Ni is the number of pattern samples assigned to class Gi (n).

Step 4: If ∀i zi (n + 1) = zi (n), then STOP. Otherwise go to Step 2.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 17

Module 3: Clustering

K -Means Clustering

Advantages of K -means clustering: (i) simple and efficient ; and (ii)

low computational complexity.
Drawbacks: (i) K must be known/decided; and (ii) final clusters
usually depend on the order of presentation of training samples and
the initial cluster centres.

Question 3.1
Apply K -means clustering to cluster the following samples/data points:
(0,0), (0,1), (1,0), (3,3), (5,6), (8,9), (9,8) and (9,9).
Fix K = 2 and choose (0,0) and (5,6) as the initial cluster centres.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 17

Module 3: Clustering
K -Means Clustering
The ‘elbow method’ (a heuristic approach) for determining K : Plot the
‘explained variation’ (say, the ‘distortion’) as a function of K , and pick the
‘elbow’ or ’knee’ of the curve as the value of K (as shown in Figure 1).

Figure 1: Elbow method using distortion.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 9 / 17
Module 5: Clustering

K -Medoids Clustering
In K -medoids clustering, each cluster is represented by a cluster
medoid which is one among the data points in the cluster.
The medoid of a cluster is defined as a data point in the cluster
whose average dissimilarity to all the other data points in the cluster
is minimal. As ‘medoid’ is the most centrally located point in the
cluster, the cluster representatives can be interpreted in a better way
(compared to K -means).
In K -medoids can use arbitrary dissimilarity measures, whereas
K -means generally requires Euclidean distance for better
performance. In general, K -medoids use Manhattan distance and
minimizes the sum of pairwise dissimilarities.
As in the case of K -means, the value of K needs to be specified
beforehand. An heuristic approach, the ‘silhouette method’ , can be
used for determining the optimal value of K .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 10 / 17
Module 5: Clustering

K -Medoids Clustering

As the K -medoids clustering problem is NP-hard to solve exactly,

many heuristic approaches/solutions exist. The most common
approach is the Partitioning Around Medoids (PAM) algorithm.
PAM algorithm has 2 phases: BUILD phase and SWAP phase.
The BUILD phase greedily selects K data points from the available
data points and initialize them as cluster medoids.
The SWAP phase associates each data point to the closest medoid
and SWAPS a cluster medoid with a non-medoid data point in the
cluster if the cost of the configuration decreases.
PAM is faster than exhaustive search and being a greedy search, it
may not find the optimum solution.
K -medoids clustering is more robust (i.e. less sensitive to outliers and
noise) compared to K -means.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 11 / 17
Module 3: Clustering
Hierarchical Clustering
Hierarchical clustering procedure groups similar data points into
clusters. It can be performed with either raw data (i.e. data points)
or a similarity matrix. When raw data is provided, a similarity matrix
S should be computed:
1
Si,j =
di,j + 1
where di,j is the Euclidean distance between data points i and j. In
the recent years, many other distance metrics have been developed.
An agglomerative hierarchical clustering is an iterative, 2-step
procedure that starts by considering each data point as a separate
cluster. In each iteration, it executes the following steps: (i)
identifying the two clusters that are closest together, and (ii) merging
the two most similar clusters. This iterative process continues until all
the clusters are merged together.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 17
Module 3: Clustering

Hierarchical Clustering
Click here for more details (particularly an illustration) on hierachical
clustering.
The dendrogram obtained at the end of hierarchical clustering shows
the hierarchical relationship between the clusters.
After completing the merging step, it is necessary to update the
similarity matrix. The updation can be based on (i) the two most
similar parts of a cluster (single-linkage), (ii) the two least similar bits
of a cluster (complete-linkage), or (iii) the center of the clusters
(mean or average-linkage). Refer Figure 2.
The choice of similarity or distance metric and the choice of linkage
criteria are always application-dependent.
Hierarchical clustering can also be done by initially treating all data
points as one cluster, and then successively splitting them. This
approach is called the divisive hierarchical clustering.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 13 / 17
Module 3: Clustering

Hierarchical Clustering

Figure 2: Three linkage types used in hierarchical clustering. Source:

https://www.dexlabanalytics.com/blog/hierarchical-clustering-foundational-
concepts-and-example-of-agglomerative-clustering

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 14 / 17

Module 3: Clustering

Question 3.2
Consider the similarity matrix given below.

Determine the hierarchy of clusters created by

(a) the single-linkage clustering algorithm, and
(b) the complete-linkage clustering algorithm.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 15 / 17

Module 3: Clustering

Applications of Clustering

Clustering analysis finds its application in market research, image

processing, etc.
Consider Customer Relationship Management for example.
Assume the customers of a company are defined in terms of their
demographic attributes and transactions with the company. If those
customers are grouped into K clusters, then a better understanding of
customer base is possible. Based on this understanding the company
shall adapt different strategies for different types of customers. The
company shall also identify unique customers (those who don’t fall in
any large group) and develop strategies for them. For example,
’churning’ customers who require immediate attention.
In Image Segmentation, clustering can be used to group the pixels
that ‘belong together’ (say, foreground and background pixels).

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 16 / 17

Module 3: Clustering

Module-3 Summary

Clustering: grouping data points based on similarity measures

Partitioning based methods: K -means and K -medoids
The elbow method and the silhouette method: heuristic approaches
for deciding the optimal value of K for partition based methods
Hierarchical clustering: single-linkage, complete-linkage, etc.
Dendogram: shows the hierarchical relationship between the clusters
Applications of clustering: customer relationship management, image
segmentation, etc.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 17 / 17

CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)

School of Electronics Engineering (SENSE), VIT-Chennai

Email: [email protected]
Handphone No.: +91-9944226963

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 20

Contents

1 Module 4: Optimization

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 20

Module 4: Optimization

Topics to be covered in Module-4

Introduction to Optimization
Gradient Descent
Variants of Gradient Descent
Momentum Optimizer
Nesterov Accelerated Gradient
Adagrad
Adadelta
RMSProp
Adam
AMSGrad

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 20

Module 4: Optimization

Introduction to Optimization
Optimization is the process of maximizing or minimizing a real
function by systematically choosing input values from an allowed set
of values and computing the value of the function.
It refers to usage of specific methods to determine the best solution
from all feasible solutions, say for example, finding the best functional
representation and finding the best hyperplane to classify data.
Three components of an optimization problem: objective function
(minimzation or maximization), decision variables and constraints.
Based on the type of objective function, constraints and decision
variables, several types of optimization problems exists. An
optmization can be linear or non-linear, convex or non-convex,
iterative or non-iterative, etc.
Optimization is considered as one among the three pillars of data
science. Linear algebra and statistics are the other two pillars.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 20
Module 4: Optimization

Introduction to Optimization
Consider the following optimization problem which attempts to find
the maximal marigin hyperplane with marigin M:

maximizeα0 ,α1 ,...,αp M (1)

p
X
subject to αj2 = 1, (2)
j=1
p
X
yi α0 + αj xij ≥ M for all i = 1, 2, ..., n. (3)
j=1

Equation (1) is the objective function, equations (2) and (3) are the
constraints, and α0 , α1 , ..., αp are the decision variables.
In general, an objective function is denoted as f (·) and minimizer of
f (·) is same as the maximizer of −f (·).
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 20
Module 4: Optimization

Gradient Descent
Gradient Descent is the most common optimization algorithm in
machine learning and deep learning.
It is a first-order, iterative-based optimization algorithm which only
takes into account the first derivative when performing the updates
on the parameters.
In each iteration, there are 2 steps: (i) finding the (locally) steepest
direction according to the first derivative of an objective function; and
(ii) finding the best point in the line. The parameters are updated in
the opposite direction of the gradient of the objective function.
The learning rate determines the convergence (i.e. the number of
iterations required to reach the local minimum). It should neither be
too small nor too large. Very small α leads to very slow convergance
and a very large value leads to oscillations around the minima or may
even lead to divergence.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 20
Module 4: Optimization

Gradient (Steepest) Descent

Let f (X ) denote the objective function and X0 denote the starting
point. In iteration k, the best point is given by

Xk = Xk−1 − αGk−1

where α is the learning rate (step length) and

Gk−1 = Of (X ) = f 0 (X ) is the derivative of f (X ) (search direction).
Consider for example, f (X ) = x1 + 2x12 + 2x1 x2 + 3x22 , α = 0.1 and

0.5
X0 = .
0.5

In this case,
0 1 + 4x1 + 2x2
f (X ) = .
2x1 + 6x2

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 20

Module 4: Optimization

Gradient (Steepest) Descent (contd.)

In the first iteration, the direction G0 and the best point X1 are
estimated as follows:

0 4 0.1
G0 = f (X0 ) = and X1 = X0 − αG0 = .
4 0.1

Similarly, In the next iteration,

0 1.6 −0.6
G1 = f (X1 ) = and X2 = X1 − αG1 = .
0.8 0.02

The iterations continue till convergence. The parameter α plays a

significant role in both convergence and stability. Figure 1 shows a
sample plot of sequence of estimated points.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 20

Module 4: Optimization

Figure 1: Steepest descent - convergence plot. Source: Mishra S.K., Ram B.

(2019) Steepest Descent Method. In: Introduction to Unconstrained
Optimization with R. Springer, Singapore.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 9 / 20
Module 4: Optimization

Variants of Gradient Descent

There are three variants of gradient descent based on the amount of data
(samples) considered for computing the gradient at each iteration.
1 Batch Gradient Descent: The parameter update step involves
summing up all data samples. It has straight trajectory towards the
minimum and its convergence is guaranteed.
2 Mini-Batch Gradient Descent: The parameter update involves
summing up lower number of samples based on batch size. It is faster
than batch gradient descent but convergence is not guaranteed.
3 Stochastic Gradient Descent: The parameter update is done
sample-wise. It has less generalization error compared to mini-batch
gradient descent but the run time is more.
Therefore, there exists a gradient accuracy - time complexity tradeoff
between these variants.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 10 / 20
Module 4: Optimization

Question 4.1

Apply gradient descent approach to minimize the function:

f (X ) = 4 x12 + 3 x1 x2 + 2.5 x22 − 5.5 x1 − 4 x2 .

Assume the step size is 0.135 and the starting point is

x1 (0) 2
X0 = = .
x2 (0) 2

Let the stopping criteria be the absolute difference between the function
values in successive iterations less than 0.005. Your answer should show
the search direction and the value of the function in each iteration.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 11 / 20

Module 4: Optimization

Momentum Optimizer

In gradient descent approach, the biggest challenge lies in choosing a

proper learning rate α. In addition, there are challenges such as
non-convex error functions (quite common in neural networks) getting
struck at their suboptimal local minima.
To circumvent these challenges, several optimization algorithms were
proposed and used by the deep learning community. Notable among
them are momentum, Nesterov accelerated gradient, Adagrad,
Adadelta, RMSprop, and AMSGrad.
As indicated earlier, gradient descent approach (say the stochastic
gradient descent) with improper α value might lead to oscillations
around the minima. Momentum optimizer attempts to dampen these
oscillations by accelerating the stochastic gradient descent in the
relevant direction.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 20

Module 4: Optimization

Momentum Optimizer

Momentum optimizer accomplishes the task by adding a fraction γ of

the update vector of the past iteration to the current update vector:

wk = wk−1 − [γ vk−2 + α f 0 (wk−1 )]

= wk−1 − γ vk−2 − α f 0 (wk−1 )

where the term vk−2 = wk−2 − wk−1 = γ vk−3 + α f 0 (wk−2 ) refers to

the update vector in the previous iteration.
Two forces act on the parameter to be updated in an iteration: the
gradient force (α f 0 (wk−1 )) and the momentum force (γ vk−2 ).
The momentum term γ vk−2 decreases where there is a change in
gradient direction(s) and increases when there is no change in
direction(s). Therefore, this approach dampens oscillations and leads
to faster convergence.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 13 / 20
Module 4: Optimization
Nesterov Accelerated Gradient
Nesterov Accelerated Gradient (NAG) attempts to use the
momentum more effectively compared to momemtum optimizer.
Given the fact that wk−1 − γ vk−2 gives a rough approximation of wk ,
the search direction (i.e. gradient) is computed with respect to
anticipated current update wk−1 − γ vk−2 instead of previous update
wk−1 . The current update vector is expressed as follows:
wk = wk−1 − [γ vk−2 + α f 0 (wk−1 − γ vk−2 )]
= wk−1 − γ vk−2 − α f 0 (wk−1 − γ vk−2 )
This anticipatory update in NAG improves the performance of
gradient descent further. Click here for more details.
Both momentum optimizer and NAG require two hyper-parameters (γ
and α) to be set manually. These parameters decide the learning rate.
These two optimizers use same learning rate for all dimensions which
is not proper.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 14 / 20
Module 4: Optimization
Adagrad
Adaptive Gradient (Adagrad) optimizer adaptively scales the learning
rate for different dimensions. For a parameter, the scale factor is
inversely proportional to the square root of sum of historical squared
values of gradient. The update rule is:
α
wk (i) = wk−1 (i) − p Gk−1 (i)
Rk−1 (i, i) +
where Rk−1 is a diagonal matrix with each diagonal element i, i being
the sum of squares of the gradients with respect to w (i) upto time
step k − 1, and is the smoothing term (usually 10−8 ).
The learning rate reduces faster for parameters showing large slope.
Adagrad does not require manual tuning of hyper-parameters.
It converges rapidly when applied to convex functions. In the case of
non-convex functions, the learning rate becomes too small and
therefore, at some point, the model may stop learning.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 15 / 20
Module 4: Optimization
Adadelta
Adadelta, an extension of Adagrad, attempts to resolve Adagrad’s
issue - radically diminishing learning rates. It limits the window of
accumulated gradients to some fixed size.
Instead of storing the previous squared gradients, the sum of
gradients is recursively defined as a decaying average of all past
squared gradients. The update becomes:
α
wk = wk−1 − p Gk−1 (4)
2
E [G ]k−1 +
where E [G 2 ]k−1 = βE [G 2 ]k−2 + (1 − β)Gk−1
2 .
p
The term E [G 2 ]k−1 + is the Root-Mean-Square (RMS) of the
gradient. Adadelta, further replaces α term in the numerator with the
RMS of the previous update. Therefore, there is no need to set the
value of α.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 16 / 20
Module 4: Optimization

RMSProp

Both Adadelta and RMSProp have been developed independently

around the same time.
RMSProp is same as the first update of Adadelta (given as Equation
(4) in previous slide):
α
wk = wk−1 − Gk−1 .
RMS[G ]k−1

Like Adadelta, it uses exponentially decaying average of squared

gradient and discards history from the extreme past.
It converges rapidly once it finds a locally convex bowl. It behaves
like Adagrad initialized within that convex bowl.
RMSProp is very effective for mini-batch gradient descent learning.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 17 / 20

Module 4: Optimization

Adam
Adaptive Moment estimation (Adam) combines RMSProp and
Momentum.
It incorporates the momentum term (i.e. first moment with
exponential weighting of the gradient) in RMSProp as follows:
α
wk = wk−1 − p m̂k−1
v̂k−1 +
where m̂k−1 and v̂k−1 are bias-corrected versions of mk−1 (first
moment) and vk−1 (second moment) respectively. The first and
second moments are:

mk−1 = β1 mk−2 + (1 − β1 )Gk−1

2
vk−1 = β2 vk−2 + (1 − β2 )Gk−1 .

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 18 / 20

Module 4: Optimization

AMSGrad

In situations where some mini-batches provide large and informative

gradients, Adam converges to a suboptimal solution. This is due to
the fact that the exponential averaging diminishes the influence of
such rarely occuring mini-batches, which leads to poor convergence.
AMSGrad updates the parameters by considering the maximum of
past squared gradients rather than the exponential average. The
update rule is:
α
wk = wk−1 − p mk−1 .
MAX(ṽk−2 , vk−1 ) +

Note that bias-correction is not considered.

AMSGrad results in a non-increasing step size. This resolves the
problem suffered by Adam.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 19 / 20

Module 4: Optimization
Module-4 Summary
Introduction to Optimization: three components
Gradient Descent: first-order, iterative-based optimization algorithm
Variants of Gradient Descent: batch gradient descent, mini-batch
gradient descent and stochastic gradient descent
Momentum Optimizer: accelerates the stochastic gradient descent in
the relevant direction - NAG uses the momentum term for
anticipatory update
Adagrad: adaptively scales learning rate for different dimension
Adadelta: sum of gradients recursively defined as the decaying
average of past gradients
RMSProp: same first update of Adadelta
Adam: combination of RMSProp and momentum
AMSGrad: considers the maximum of past squared gradients
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 20 / 20
CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)

School of Electronics Engineering (SENSE), VIT-Chennai

Email: [email protected]
Handphone No.: +91-9944226963

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 15

Suggested Readings

https://www.sscnasscom.com/qualification-pack/SSC/Q2101/
(For Modules 5, 6 & 7).

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 15

Contents

1 Module 5: Managing Health and Safety

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 15

Module 5: Managing Health and Safety

Topics to be covered in Module-5

Performance Criteria
Basic Workplace Safety Guidelines
Types of Accidents in Workplace
Types of Emergencies in Worksplace
Hazards

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 15

Module 5: Managing Health and Safety
Performance Criteria (PC)
PC1 Comply with your organization’s current health, safety and security
policies and procedures
PC2 Report any identified breaches in health, safety, and security policies
and procedures to the designated person
PC3 Identify and correct any hazards that you can deal with safely,
competently and within the limits of your authority
PC4 Report any hazards that you are not competent to deal with to the
relevant person and warn other people who may be affected
PC5 Follow your organization’s emergency procedures promptly, calmly,
and efficiently
PC6 Identify and recommend opportunities for improving health, safety,
and security to the designated person
PC7 Complete any health and safety records legibly and accurately
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 15
Module 5: Managing Health and Safety
Basic Workplace Safety Guidelines
Fire safety: Employees should be aware of all emergency exits
(including fire escape routes) of the office building and also the
locations of fire extinguishers and alarms.
Falls and slips: All things must be arranged properly. There should
be proper lighting in all areas (including stairways). Any spilt liquid,
food or other items must be promptly cleaned to avoid any accidents.
First Aid: First-aid kits should be kept in places that can be reached
quickly and these locations should be known to all employees. These
kits should contain all the important items for first aid, for example,
items to deal with cuts, burns, muscle cramps, etc.
Electrical Safety: Electrical engineers and staffs should carry out
routine inspections of all wiring to make sure there are no damaged or
broken wires. Employees must be provided instructions about
electrical safety such as keeping water and food items away from
electrical equipment.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 15
Module 5: Managing Health and Safety

Basic Workplace Safety Guidelines

It is the joint responsibility of both employer and employees to ensure that
the workplace is safe and secure. State whether the following are correct
or not.
At any office, the first-aid kit should always be available for use in an
emergency.
It is optional to participate in the random fire drills conducted by the
Offices from time-to-time.
There is no need to train employees on how to use the fire
extinguisher. They can operate extinguishers following the instruction
written on the extinguisher case, when needed.
Wet floor can be identified easily, without the signs. The “Wet Floor”
sign is not needed and causes problems for people.
It is okay to place heavy and light items on the same shelf.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 15

Module 5: Managing Health and Safety

Types of Accidents in Workplace

The following are some of commonly occurring accidents in organizations:

Trip and fall
Slip and fall
Injuries caused due to escalators or elevators
Accidents due to falling of goods
Accidents due to moving objects

Try to avoid accidents by finding out all potential hazards and eliminating
them. One person’s careless action can harm the safety of many others in
the organization.

Figure 1 shows the major types of safety hazards and Figure 2 shows the
major types of workplace hazards.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 15

Module 5: Managing Health and Safety

Figure 1: Major types of safety hazards. Source:

https://www.mscdirect.com/betterMRO/safety/understanding-accidents-why-
they-happen-and-what-you-can-do

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 9 / 15

Module 5: Managing Health and Safety

Figure 2: Major types of workplace hazards. Source:

https://www.totalika.org/understanding-the-six-major-types-of-workplace-
hazards/

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 10 / 15

Module 5: Managing Health and Safety

Types of Emergencies in Worksplace

Categories of emergencies include (but not limited to) the following:

Medical emergencies, such as heart attack or an expectant mother in
labor
Substance emergencies, such as fire, chemical spills, and explosions
Structural emergencies, such as loss of power or collapsing of walls
Security emergencies, such as armed robberies, intruders, and mob
attacks or civil disorder
Natural disaster emergencies, such as floods and earthquakes

Keep a list of numbers to call during emergencies. Regularly check that all
emergency handling equipments are in working condition. Ensure that
emergency exits are not obstructed.

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 11 / 15

Module 5: Managing Health and Safety
Hazards
In relation to workplace safety and health, hazard can be defined as
any source of potential harm or danger to someone or any adverse
health effect produced under certain condition.
A hazard can harm an individual or an organization. Hazard to an
organization include loss of property or equipment while hazard to an
individual involve harm to health or body.
Examples of potential hazards: (i) Materials such as knife or sharp
edged nails can cause cuts; (ii) Substances such as Benzene can cause
fume suffocation. Inflammable substances like petrol can cause fire;
(iii) Naked wires or electrodes can result in electric shocks; (iv)
Condition such as “Wet floor” can cause slippage, (v) Objects falling
on workers; and (vi) Clothes entangled into rotating objects.

Figure 3 shows some signage boards used to notify hazards and Figure 4
shows some common safety signs.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 15
Module 5: Managing Health and Safety

Figure 3: Signage boards to notify hazards

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 13 / 15

Module 5: Managing Health and Safety

Figure 4: Common safety signs

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 14 / 15

Module 5: Managing Health and Safety

Module-5 Summary

Performance criteria (7 in total)

Basic workplace safety guidelines: fire safety, first-aid kit, electrical
safety, etc.
Types of accidents: trips, slips, injuries/accidents due to
falling/moving items, etc.
Types of emergencies: medical, structural, natural disaster, etc.
Hazards: sources of potential harm (notified using signage boards)

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 15 / 15

CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)

School of Electronics Engineering (SENSE), VIT-Chennai

Email: [email protected]
Handphone No.: +91-9944226963

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 16

Suggested Readings

https://www.sscnasscom.com/qualification-pack/SSC/Q2101/
(For Modules 5, 6 & 7)
https://www.datapine.com/blog/daily-weekly-monthly-financial-
report-examples/
https://www.datapine.com/blog/daily-weekly-monthly-marketing-
report-examples/
https://www.datapine.com/blog/sales-report-kpi-examples-for-daily-
reports/