CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S
CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S
Email: [email protected]
Handphone No.: +91-9944226963
Education
Experience
Y ≈ β0 + β1 X
sales ≈ β0 + β1 TV
β̂1 are the least squares coefficient estimates for simple linear
regression, and they give the best linear fit on the given training
data.
Figure 2 shows the simple linear regression fit to the Advertising data,
where β̂0 = 7.03 and β̂1 = 0.0475.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 11 / 36
Module 1: Regression Analysis
Question 1.1
Question 1.2
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f (x) of the form f (x) = ax + b which is
parameterized by (a, b). Using squared error as the loss function, which of
the following parameters would you use to model this function.
(a) (4 3)
(b) (5 3)
(c) (5 1)
(d) (1 5)
Question 1.3
Question 1.4
When you perform multiple linear regression, which among the following
are questions you will be interested in?
where TSS = ni=1 (yi − ȳ )2 is the total sum of squares. Note that
P
R 2 statistic is independent of the scale of Y , and it always takes a
value between 0 and 1.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 22 / 36
Module 1: Regression Analysis
Question 1.5
Consider the following five training examples
X = [2 3 4 5 6]
Y = [12.8978 17.7586 23.3192 28.3129 32.1351]
We want to learn a function f (x) of the form f (x) = ax + b which is
parameterized by (a, b).
Bias-Variance Tradeoff
Correlation
When comparing two random variables, say x1 and x2 , covariance
Cov(x1 , x2 ) is used to determine how much these two vary together,
whereas correlation Corr(x1 , x2 ) is used to determine whether a
change in one variable will result in a change in another.
For multiple data points, the covariance matrix is given by
(X − m)(X − m)T
C= .
n
where X = [x1 x2 ...] is the data matrix with n columns (each column
is one data point) and m is the mean vector of the data points.
Correlation, a normalized version of the covariance, is expressed as
Cov(x1 , x2 )
Corr(x1 , x2 ) = .
σx1 σx2
Correlation
Time series modeling deals with the time based data. Time can be
years, days, hours, minutes, etc.
Time series forecasting involves fitting a model on time based data
and using it to predict future observations.
Time series forecasting serves two purposes: understanding the
pattern/trend in the time series data and forecasting/extrapolating
the future values of it. The forecast package in R contains functions
which serve these purposes.
In time series forecasting, the AutoRegressive Integrated Moving
Average (ARIMA) model is fitted to the time series data either to
better understand the data or to predict future points in the series.
Components of a time series are level, trend, seasonal, cyclical and
noise/irregular (random) variations.
H0 : µ1 = µ2 = µ3 = ...
Ha : µ1 6= µ2 6= µ3 6= ...
Question 1.6
Assume there are 3 canteens in a college and the sale of an item in those
canteens during first week of February-2021 is as follows:
Module-1 Summary
Email: [email protected]
Handphone No.: +91-9944226963
1 Module 2: Classification
Logistic Regression
Bayes’ Theorem for classification
Decision Trees
Bagging, Boosting and Random Forest
Hyperplane for Classification
Support Vector Machines
Logistic Regression
Logistic Regression
Logistic Regression
e β0 +β1 X
p(X ) =
1 + e β0 +β1 X
where β0 and β1 are the model parameters.
To fit the above model (i.e. to determine β0 and β1 ), a method called
maximum likelihood is used.
The estimates β0 and β1 are chosen to maximize the likelihood
function:
Y Y
`(β0 , β1 ) = p(xi ) × (1 − p(xi 0 )).
i:yi =1 i 0 :yi 0 =1
Logistic Regression
p(X )
= e β0 +β1 X
1 − p(X )
p(X )
The quantity 1−p(X ) is called the odds, and can take on any value
odds between 0 and ∞. Values of the odds close to 0 and ∞ indicate
very low and very high probabilities of default, respectively.
Taking logarithm on both sides of the above equation gives log-odds
or logit:
p(X )
loge = β0 + β1 X .
1 − p(X )
The logit of a logistic regression model is linear in X . Note that
loge () is natural logarithm which is usually denoted as ln().
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 42
Module 2: Classification
Logistic Regression
(i) Which of the following parameters would you use to model p(x).
(a) (-119, 2)
(b) (-120, 2)
(c) (-121, 2)
(ii) With the chosen parameters, what should be the minimum mark to
ensure the student gets a ‘Pass’ grade with 95% probability?
p(x|ωk ) p(ωk )
p(ωk |x) =
p(x)
Question 2.2
Assume A and B are Boolean random variables (i.e. they take one of the
two possible values: True and False).
Given: p(A = True) = 0.3, p(A = False) = 0.7,
p(B = True|A = True) = 0.4, p(B = False|A = True) = 0.6,
p(B = True|A = False) = 0.6, p(B = False|A = False) = 0.4.
Question 2.3
Decision Trees
A decision tree is a hierarchical model for supervised learning. It can
be applied to both regression and classification problems.
A decision tree consists of decision nodes (root and internal) and leaf
nodes (terminal). Figure 3 shows a data set and its classification tree
(i.e. decision tree for classification).
Given an input, at each decision node, a test function is applied and
one of the branches is taken depending on the outcome of the
function. The test function gives discrete outcomes labeling the
branches (say for example, Yes or No).
The process starts at the root node (topmost decision node) and is
repeated recursively until a leaf node is hit. Each leaf node has an
output label (say for example, Class 0 or Class 1).
During the learning process, the trees grows, branches and leaf nodes
are added depending on the data.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 19 / 42
Module 2: Classification
Figure 3: Data set (left) and the corresponding decision tree (right) - Example of
a classification tree.
Decision Trees
Decision trees do not assume any parametric form for the class
densities and the tree structure is not fixed apriori. Therefore, a
decision tree is a non-parametric model.
Different decision trees assume different models for the test function,
say f (·). In a decision tree, the assumed model for f (·) defines the
shape of the classified regions. For example, in Figure 3, the test
functions define ‘rectangular’ regions.
In a univariate decision tree, the test function in each decision node
uses only one of the input dimensions.
In a classification tree, the ‘goodness of a split’ is quantified by an
impurity measure. Popular among them are entropy and Gini index. If
the split is such that, for all branches, all the instances choosing a
branch belong to the same class, then it is pure.
Question 2.4
Greedy learning approach - they look for the best split at each step.
Low prediction accuracy compared to methods like regression.
Question 2.5
.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 29 / 42
Module 2: Classification
Figure 5: Two classes of observations (shown in purple and blue), each having
two features/variables, and three separating hyperplanes.
Figure 6: Two classes of observations (shown in purple and blue), each having
two features/variables, and the optimal separating hyperplane or the maximal
marigin hyperplane
Facilitator: Dr Sathiya Narayanan S .
VIT-Chennai - SENSE Winter Semester 2020-21 32 / 42
Module 2: Classification
Figure 7: Two classes of observations (shown in purple and blue), each having
two features/variables, and two separating hyperplanes.
.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 35 / 42
Module 2: Classification
.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 41 / 42
Module 2: Classification
Module-2 Summary
Email: [email protected]
Handphone No.: +91-9944226963
1 Module 3: Clustering
Introduction to Clustering
K -Means Clustering
K -Medoids Clustering
Hierarchical Clustering
Applications of Clustering
Introduction to Clustering
K -Means Clustering
1 X
zi (n + 1) = x
Ni
x∈Gi (n)
K -Means Clustering
Question 3.1
Apply K -means clustering to cluster the following samples/data points:
(0,0), (0,1), (1,0), (3,3), (5,6), (8,9), (9,8) and (9,9).
Fix K = 2 and choose (0,0) and (5,6) as the initial cluster centres.
K -Medoids Clustering
In K -medoids clustering, each cluster is represented by a cluster
medoid which is one among the data points in the cluster.
The medoid of a cluster is defined as a data point in the cluster
whose average dissimilarity to all the other data points in the cluster
is minimal. As ‘medoid’ is the most centrally located point in the
cluster, the cluster representatives can be interpreted in a better way
(compared to K -means).
In K -medoids can use arbitrary dissimilarity measures, whereas
K -means generally requires Euclidean distance for better
performance. In general, K -medoids use Manhattan distance and
minimizes the sum of pairwise dissimilarities.
As in the case of K -means, the value of K needs to be specified
beforehand. An heuristic approach, the ‘silhouette method’ , can be
used for determining the optimal value of K .
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 10 / 17
Module 5: Clustering
K -Medoids Clustering
Hierarchical Clustering
Click here for more details (particularly an illustration) on hierachical
clustering.
The dendrogram obtained at the end of hierarchical clustering shows
the hierarchical relationship between the clusters.
After completing the merging step, it is necessary to update the
similarity matrix. The updation can be based on (i) the two most
similar parts of a cluster (single-linkage), (ii) the two least similar bits
of a cluster (complete-linkage), or (iii) the center of the clusters
(mean or average-linkage). Refer Figure 2.
The choice of similarity or distance metric and the choice of linkage
criteria are always application-dependent.
Hierarchical clustering can also be done by initially treating all data
points as one cluster, and then successively splitting them. This
approach is called the divisive hierarchical clustering.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 13 / 17
Module 3: Clustering
Hierarchical Clustering
Question 3.2
Consider the similarity matrix given below.
Applications of Clustering
Module-3 Summary
Email: [email protected]
Handphone No.: +91-9944226963
1 Module 4: Optimization
Introduction to Optimization
Gradient Descent
Variants of Gradient Descent
Momentum Optimizer
Nesterov Accelerated Gradient
Adagrad
Adadelta
RMSProp
Adam
AMSGrad
Introduction to Optimization
Optimization is the process of maximizing or minimizing a real
function by systematically choosing input values from an allowed set
of values and computing the value of the function.
It refers to usage of specific methods to determine the best solution
from all feasible solutions, say for example, finding the best functional
representation and finding the best hyperplane to classify data.
Three components of an optimization problem: objective function
(minimzation or maximization), decision variables and constraints.
Based on the type of objective function, constraints and decision
variables, several types of optimization problems exists. An
optmization can be linear or non-linear, convex or non-convex,
iterative or non-iterative, etc.
Optimization is considered as one among the three pillars of data
science. Linear algebra and statistics are the other two pillars.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 20
Module 4: Optimization
Introduction to Optimization
Consider the following optimization problem which attempts to find
the maximal marigin hyperplane with marigin M:
Equation (1) is the objective function, equations (2) and (3) are the
constraints, and α0 , α1 , ..., αp are the decision variables.
In general, an objective function is denoted as f (·) and minimizer of
f (·) is same as the maximizer of −f (·).
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 20
Module 4: Optimization
Gradient Descent
Gradient Descent is the most common optimization algorithm in
machine learning and deep learning.
It is a first-order, iterative-based optimization algorithm which only
takes into account the first derivative when performing the updates
on the parameters.
In each iteration, there are 2 steps: (i) finding the (locally) steepest
direction according to the first derivative of an objective function; and
(ii) finding the best point in the line. The parameters are updated in
the opposite direction of the gradient of the objective function.
The learning rate determines the convergence (i.e. the number of
iterations required to reach the local minimum). It should neither be
too small nor too large. Very small α leads to very slow convergance
and a very large value leads to oscillations around the minima or may
even lead to divergence.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 20
Module 4: Optimization
Xk = Xk−1 − αGk−1
In this case,
0 1 + 4x1 + 2x2
f (X ) = .
2x1 + 6x2
In the first iteration, the direction G0 and the best point X1 are
estimated as follows:
0 4 0.1
G0 = f (X0 ) = and X1 = X0 − αG0 = .
4 0.1
Question 4.1
Let the stopping criteria be the absolute difference between the function
values in successive iterations less than 0.005. Your answer should show
the search direction and the value of the function in each iteration.
Momentum Optimizer
Momentum Optimizer
RMSProp
Adam
Adaptive Moment estimation (Adam) combines RMSProp and
Momentum.
It incorporates the momentum term (i.e. first moment with
exponential weighting of the gradient) in RMSProp as follows:
α
wk = wk−1 − p m̂k−1
v̂k−1 +
where m̂k−1 and v̂k−1 are bias-corrected versions of mk−1 (first
moment) and vk−1 (second moment) respectively. The first and
second moments are:
AMSGrad
Email: [email protected]
Handphone No.: +91-9944226963
https://www.sscnasscom.com/qualification-pack/SSC/Q2101/
(For Modules 5, 6 & 7).
Performance Criteria
Basic Workplace Safety Guidelines
Types of Accidents in Workplace
Types of Emergencies in Worksplace
Hazards
Try to avoid accidents by finding out all potential hazards and eliminating
them. One person’s careless action can harm the safety of many others in
the organization.
Figure 1 shows the major types of safety hazards and Figure 2 shows the
major types of workplace hazards.
Keep a list of numbers to call during emergencies. Regularly check that all
emergency handling equipments are in working condition. Ensure that
emergency exits are not obstructed.
Figure 3 shows some signage boards used to notify hazards and Figure 4
shows some common safety signs.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 12 / 15
Module 5: Managing Health and Safety
Module-5 Summary
Email: [email protected]
Handphone No.: +91-9944226963
https://www.sscnasscom.com/qualification-pack/SSC/Q2101/
(For Modules 5, 6 & 7)
https://www.datapine.com/blog/daily-weekly-monthly-financial-
report-examples/
https://www.datapine.com/blog/daily-weekly-monthly-marketing-
report-examples/
https://www.datapine.com/blog/sales-report-kpi-examples-for-daily-
reports/
Performance Criteria
Knowledge Management
Reporting Templates
Knowledge Management
Knowledge Management (KM) is the process of capturing,
developing, sharing, and effectively using organizational knowledge.
KM refers to a multi-disciplinary approach to achieving organizational
objectives by making the best use of knowledge. It captures
uniqueness of each project, makes complex work scalable, reduces
people dependencies and reduces the delivery time though faster
knowledge distribution.
KM is an evolving process and does not need to adhere to stringent
rules. However, it needs to be done within a specified framework for
an organization. Each organization will have some set standards,
methods and approaches towards KM.
In general, KM deals with certain knowledge items as depicted in
Figure 1.
Reporting Templates
Reporting templates are pre-created structures based on which reports are
to be created. Various types of templates are
Financial reporting template: describes balance sheet, income
statement, profit margin, etc.
Marketing report template: describes marketing cost (i.e. total
spend), clicks rate (in the case of web-based marketing), etc.
Sales reporting template: describes sales revenue, profit, target met
(in percentage), etc.
Research template: describes the results of a survey, interview or
any other type of qualitative/quantitative research.
Whitepaper: concisely presents a complex issue along with the
issuing body’s philosophy on the matter. It is meant to help readers
understand an issue, solve te issue, or make a decision.
Module-6 Summary
Email: [email protected]
Handphone No.: +91-9944226963
https://www.sscnasscom.com/qualification-pack/SSC/Q2101/
(For Modules 5, 6 & 7).
Performance Criteria
Common Definitions: Knowledge, Skills and Competence
Skills Needed for Job Roles in Industry
Training and Development
PC1 Obtain advice and guidance from appropriate people to develop your
knowledge, skills and competence
PC2 Identify accurately the knowledge and skills you need for your job
role
PC3 Identify accurately your current level of knowledge, skills and
competence and any learning and development needs
PC4 Agree with appropriate people a plan of learning and development
activities to address your learning needs
PC5 Undertake learning and development activities in line with your plan
PC6 Apply your new knowledge and skills in the workplace, under
supervision
Module-7 Summary