0% found this document useful (0 votes)
7 views

Linear Regression for ML ass

Uploaded by

gacil99877
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Linear Regression for ML ass

Uploaded by

gacil99877
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

22ECSC306

1
Learning! Knowledge ! Intelligence!
• Learning: is the ability to
• see pattern
• recognize pattern
• add constraints
• Knowledge:
• is the variability in constraints [patterns with variability].
• Intelligence:
• is invoke the knowledge on given test case.

10-07-2024 School of Computer Science and Engineering 2


Machines & Humans
• Von Neumann architecture bottlenecks:
• Serial
• Unambiguous
• Architecture of Brain – Simulation by ANN
• Process & Represent / Represent & Process
• Process & Represent
• Humans
• Represent & Process
• Machines

10-07-2024 School of Computer Science and Engineering 3


Learning!!

• Human Intelligence • Artificial Intelligence


• Biological Neural Network • Von Neumann architecture

• Artificial Neural Network

10-07-2024 School of Computer Science and Engineering 4


“A computer program that learns from experience E with respect to
some task T and some performance measure P, if its performance on T,
as measured by P, improves with experience E.”

-- Tom Mitchell, Carnegie Mellon University

7/10/2024 5
Machine Learning

7/10/2024 6
Machine Learning

10-07-2024 School of Computer Science and Engineering 7


Machine Learning

10-07-2024 School of Computer Science and Engineering 8


Machine Learning

10-07-2024 School of Computer Science and Engineering 9


AI, ML and DL

10-07-2024 School of Computer Science and Engineering 10


Machine Learning

10-07-2024 School of Computer Science and Engineering 11


Types of Machine Learning

10-07-2024 School of Computer Science and Engineering 12


7/10/2024 13
Classical Machine Learning techniques

10-07-2024 School of Computer Science and Engineering 14


Machine Learning

10-07-2024 School of Computer Science and Engineering 15


Knowledge Discovery (KDD) Process

– Data mining—core of Pattern Evaluation


knowledge discovery
process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
July 10, 2024 Data Mining: Concepts and Techniques 16
Machine Learning
Traditional Programming:
Data
Computer Output
Program

Machine Learning:

Data
Computer Program
Output

10-07-2024 17
School of Computer Science and Engineering 17
Machine Learning

past future

Training model/ Testing model/


Data predictor Data predictor

10-07-2024 School of Computer Science and Engineering 18


Machine Learning
Is more like gardening

• Seeds = Algorithms
• Nutrients = Data
• Gardener = You
• Plants = Programs

10-07-2024 School of Computer Science and Engineering 19


ML in a Nutshell
• Tens of thousands of machine learning algorithms
• Hundreds new every year
• Every machine learning algorithm has three components:
• Representation
• Evaluation
• Optimization

10-07-2024 School of Computer Science and Engineering 20


Machine Learning - Types
• Supervised (inductive) learning
• Training data includes desired outputs
• Unsupervised learning
• Training data does not include desired outputs
• Semi-supervised learning
• Training data includes a few desired outputs
• Reinforcement learning
• Rewards from sequence of actions
Machine Learning - Supervised
In supervised learning, the training data you feed to the algorithm includes the desired
solutions, called labels.
Machine Learning - Supervised
• Regression: to predict a target numeric value
Example: To predict the price of a car, given a set of features
(mileage, age, brand, etc.) called predictors.

Figure 2: Regression
Machine Learning - Supervised
Machine Learning - Supervised
Machine Learning - Supervised
Machine Learning – Un-Supervised

• In unsupervised learning, the training data is unlabeled(Figure 3). The


system tries to learn without a teacher.

Figure 3: An unlabeled training set


Figure 4: Clustering
for unsupervised learning
Machine Learning – Semi-Supervised
• Algorithms that work with partially labeled training data, usually a lot of
unlabeled data and a little bit of labeled data. This is called semisupervised
learning.
Machine Learning – Reinforcement

• Reinforcement Learning : The


learning system, called an agent, in
this context, can observe the
environment, select and perform
actions, and get rewards in return
• It must then learn by itself what is
the best strategy, called a policy, to
get the most reward over time. A
policy defines what action the agent
should choose when it is in a given
situation.
Machine Learning Workflow

10-07-2024 School of Computer Science and Engineering 30


Machine Learning Workflow

10-07-2024 School of Computer Science and Engineering 31


Machine Learning Workflow

10-07-2024 School of Computer Science and Engineering 32


Machine Learning Workflow

10-07-2024 School of Computer Science and Engineering 33


Machine Learning Workflow

10-07-2024 School of Computer Science and Engineering 34


Machine Learning Workflow

10-07-2024 School of Computer Science and Engineering 35


Machine Learning Workflow

10-07-2024 School of Computer Science and Engineering 36


10-07-2024 School of Computer Science and Engineering 37
10-07-2024 School of Computer Science and Engineering 38
Machine Learning

10-07-2024 School of Computer Science and Engineering 39


Supervised Learning
Linear Regression
Supervised Learning
• In regression, we seek to identify (or estimate) a continuous variable y
associated with a given input vector x.
• In classification, we seek to identify the categorical class Ck associate with
a given input vector x.
• Regression method is used for forecasting and finding out cause and effect
relationship between variables.
• Regression techniques differ based on the number of independent
variables and the type of relationship between the independent and
dependent variables.

10-07-2024 School of Computer Science and Engineering 41


Why Regression
• Regression analysis is a form of predictive modelling technique which
investigates the relationship between dependent (target)
and independent variable (s) (predictor).
• Regression analysis is an important tool for modelling and analyzing
data. Here, we fit a curve / line to the data points, in such a manner that
the differences between the distances of data points from the curve or
line is minimized.

.
Why do we use regression analysis?
• There are multiple benefits of using regression analysis. They are
as follows:
• It indicates the significant relationships between dependent
variable and independent variable.
• It indicates the strength of impact of multiple independent
variables on a dependent variable.

.
Regression Means….

.
Linear Regression
• Linear regression is usually among the first few topics which people pick while learning
predictive modeling.
• In this technique, the dependent variable is continuous, independent variable(s) can
be continuous or discrete, and nature of regression line is linear.
• The difference between simple linear regression and multiple linear regression is that,
multiple linear regression has (>1) independent variables, whereas simple linear
regression has only 1 independent variable. Now, the question is “How do we
obtain best fit line?”.

.
Linear regression: Introduction
• A scatter plot of the data on 2-Dimensional plane.
Pizza Price

Pizza Size

10-07-2024 School of Computer Science and Engineering 46


Linear regression: Introduction
• Choose a linear (straight line) model, and tweak it to match the data points by
changing its slope.

• Choose a linear (straight line) model, and tweak it to match the data points by
changing its intercept/bias.

10-07-2024 School of Computer Science and Engineering 47


Linear regression: Introduction
• Fitting linear (straight) line by optimal value for slope and bias.

10-07-2024 School of Computer Science and Engineering 48


Linear Regression
• Different applications of Linear regression:
i. Impact of product price on number of sales
ii. Impact of rainfall amount on number fruits yielded
iii. To predict if the funds that they have invested in marketing a particular
brand has given them substantial return on investment.
• In order to predict accurate value of Y for a given value of X.
1. Data (samples, combination of X and Y)
2. Model (function to represent relationship X & Y)
3. Cost function (how well our model approximates training samples)
4. Optimization (find parameters of model to minimize value of cost function)

10-07-2024 School of Computer Science and Engineering 49


Linear Regression- data
• In simple linear regression the number of independent variables is one
and there is a linear relationship between the independent(x) and
dependent(y) variable.

Population density Number of COVID-19


per sq km patients
20 2
40 6
60 8
80 12
100 14

10-07-2024 School of Computer Science and Engineering 50


Linear Regression- hypothesis
• How to represent data as model/hypothesis?
• Since data is linear, we can use linear operators
• Hypothesis hθ(x) = θ0x0+ θ1x1
• θ0 & θ1 – parameters
• x0 and x1 – input
• y – actual output
• hθ(x) – predicted output

10-07-2024 School of Computer Science and Engineering 51


Linear Regression- data

Population density Average Number of COVID-19


per sq km age patients
20 16 3
40 16 4
60 14 6
80 14 7
300 21 21

10-07-2024 School of Computer Science and Engineering 52


Linear Regression- cost function

X Y

1 1

2 2

3 3

10-07-2024 School of Computer Science and Engineering 53


Linear Regression- optimization

10-07-2024 School of Computer Science and Engineering 54


Linear Regression- optimization
Gradient descent algorithm Linear Regression Model

10-07-2024 School of Computer Science and Engineering 55


Linear Regression- optimization

update
and
simultaneously

10-07-2024 School of Computer Science and Engineering 56


Linear Regression- optimization

Correct: Simultaneous update Incorrect:

10-07-2024 School of Computer Science and Engineering 57


Linear Regression- optimization

10-07-2024 School of Computer Science and Engineering 58


Linear Regression- optimization

10-07-2024 School of Computer Science and Engineering 59


Linear Regression- optimization

If α is too small, gradient descent can be


slow.

If α is too large, gradient descent can


overshoot the minimum. It may fail to
converge, or even diverge.

10-07-2024 School of Computer Science and Engineering 60


Linear Regression- optimization
• Determine the direction with the steepest downward gradient at current
position and take a step of size X in that direction. Repeat & rinse; this is known
as training.

10-07-2024 School of Computer Science and Engineering 61


Linear Regression- optimization
• Determine the direction with the steepest downward gradient at current
position and take a step of size X in that direction. Repeat & rinse; this is known
as training.

10-07-2024 School of Computer Science and Engineering 62


Linear Regression- optimization
• Gradient descent algorithm guides to move in direction of steep descent.

10-07-2024 School of Computer Science and Engineering 63


Linear Regression- optimization
• Determine the direction with the steepest downward gradient at current
position and take a step of size X in that direction.

10-07-2024 School of Computer Science and Engineering 64


Linear Regression- optimization
• Determine the direction with the steepest downward gradient at current
position and take a step of size X in that direction.

10-07-2024 School of Computer Science and Engineering 65


Linear Regression- optimization

J()




10-07-2024 School of Computer Science and Engineering 66


Linear Regression- optimization

J()




10-07-2024 School of Computer Science and Engineering 67


Linear Regression- optimization

J()




10-07-2024 School of Computer Science and Engineering 68


Linear Regression- optimization

J()




10-07-2024 School of Computer Science and Engineering 69


Linear Regression- optimization

10-07-2024 School of Computer Science and Engineering 70


Linear Regression- Variations
• Gradient descent can vary in terms of the number of training patterns
used to calculate error; that is in turn used to update the model.
• The number of patterns used to calculate the error includes how stable
the gradient is that is used to update the model.
• There is a tension in gradient descent configurations of computational
efficiency and the fidelity of the error gradient.
• The three main flavours of gradient descent are stochastic, batch and
mini-batch.

10-07-2024 School of Computer Science and Engineering 71


Stochastic Gradient Descent
• It is a variation of the gradient descent algorithm that calculates the error
and updates the model for each example in the training dataset.
• Parameters are updated after computing the gradient of error with
respect to a single training example.
• The update of the model for each training example means that stochastic
gradient descent is often called an online machine learning algorithm.

10-07-2024 School of Computer Science and Engineering 72


Stochastic Gradient Descent-Upsides
• The frequent updates immediately give an insight into the performance of
the model and the rate of improvement.
• This variant of gradient descent may be the simplest to understand and
implement, especially for beginners.
• The increased model update frequency can result in faster learning on
some problems.
• The noisy update process can allow the model to avoid local minima (e.g.
premature convergence).

10-07-2024 School of Computer Science and Engineering 73


Stochastic Gradient Descent-Downside
• Updating the model so frequently is more computationally expensive than
other configurations of gradient descent, taking significantly longer to
train models on large datasets.
• The frequent updates can result in a noisy gradient signal, which may
cause the model parameters and in turn the model error to jump around
(have a higher variance over training epochs).
• The noisy learning process down the error gradient can also make it hard
for the algorithm to settle on an error minimum for the model.

10-07-2024 School of Computer Science and Engineering 74


Batch Gradient Descent-Upside
• Fewer updates to the model means this variant of gradient descent is
more computationally efficient than stochastic gradient descent.
• The decreased update frequency results in a more stable error gradient
and may result in a more stable convergence on some problems.
• The separation of the calculation of prediction errors and the model
update lends the algorithm to parallel processing based implementations.

10-07-2024 School of Computer Science and Engineering 75


Batch Gradient Descent-Downside
• The more stable error gradient may result in premature convergence of
the model to a less optimal set of parameters.
• The updates at the end of the training epoch require the additional
complexity of accumulating prediction errors across all training examples.
• Commonly, batch gradient descent is implemented in such a way that it
requires the entire training dataset in memory and available to the
algorithm.
• Model updates, and in turn training speed, may become very slow for
large datasets.

10-07-2024 School of Computer Science and Engineering 76


Mini-Batch Gradient Descent
• It splits the training dataset into small batches that are used to calculate
model error and update model coefficients.
• Implementations may choose to sum the gradient over the mini-batch
which further reduces the variance of the gradient.
• Mini-batch gradient descent seeks to find a balance between the
robustness of stochastic gradient descent and the efficiency of batch
gradient descent.
• It is the most common implementation of gradient descent used in the
field of deep learning.
• Parameters are updated after computing the gradient of error with
respect to a subset of the training set.
10-07-2024 School of Computer Science and Engineering 77
Mini batch Gradient Descent-Upside
• The model update frequency is higher than batch gradient descent which
allows for a more robust convergence, avoiding local minima.
• The batched updates provide a computationally more efficient process
than stochastic gradient descent.
• The batching allows both the efficiency of not having all training data in
memory and algorithm implementations.

10-07-2024 School of Computer Science and Engineering 78


Mini batch Gradient Descent-Downside
• Mini-batch requires the configuration of an additional “mini-batch size”
hyper parameter for the learning algorithm.
• Error information must be accumulated across mini-batches of training
examples like batch gradient descent.

10-07-2024 School of Computer Science and Engineering 79


Gradient Descent-summary
Stochastic Gradient Descent Batch Gradient Descent Mini-Batch Gradient Descent

Since only a single training example is Entire training data is considered Subset of training examples is
considered before taking a step in the before taking a step in the direction of considered, hence it can make quick
direction of gradient. gradient. updates in the model parameters.

we are forced to loop over the training Therefore it takes a lot of time for It can also exploit the speed
set and thus cannot exploit the speed making a single update. associated with vectorizing the code.
associated with vectorizing the code.

It makes very noisy updates in the It makes smooth updates in the model Depending upon the batch size, the
parameters. parameters. updates can be made less noisy –
greater the batch size less noisy is the
update.

10-07-2024 School of Computer Science and Engineering 80


Multivariate Linear Regression
Size (feet2) Number of Number of Age of home Price ($1000)
bedrooms floors (years)

2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …

Notation:
= number of features
= input (features) of training example.
= value of feature in training example.

10-07-2024 School of Computer Science and Engineering 81


Multivariate Linear Regression
• Univariate Linear regression:
• Multivariate Linear Regression:
Hypothesis:
Parameters:
Cost function:

Gradient descent:
Repeat

(simultaneously update for every )


10-07-2024 School of Computer Science and Engineering 82
Multivariate Linear Regression
New algorithm :
Gradient Descent
Repeat
Previously (n=1):
Repeat
(simultaneously update for
)

(simultaneously update )

10-07-2024 School of Computer Science and Engineering 83


Bias and Variance
• If we take a Quadratic function, it will give a good fit. • There are several ways to deal with over-fitting:
➢ Collecting more data
• If we take a very high order polynomial, we get a curve that ➢Select a subset of features
over fits the data. ➢Penalize the weights (parameter values)
Regularization
Problem of Overfitting & Under-fitting

• If we take a Quadratic function, it will give a good fit.


w1x + w2x2 +b
• If we take a very high order polynomial, we get a curve that over fits the data.
w1x + w2x2 + w3x3 + w4x4 + b

• There are several ways to deal with over-fitting:


➢ Collecting more data
➢Select a subset of features
➢Penalize the weights (parameter values)
Regularization
• There are several problems like Overfitting, Under-
fitting.
• And Regularization helps to solve the problem of
Overfitting.
• Its Aim is to :
Reduce Model Complexity
Reduce Cost Function
• It adds penalty to reduce Magnitude of weights.
• There are 3 such types: Ridge, Lasso & Elastic-Net
• Regularized Cost Function for Linear Regression:
Limitation of Linear Regression
• Linear regression is not robust to outliers.
• Linear regression also susceptible to over fitting.
Ridge• Regression
Ridge regression works with an enhanced cost function when
compared to the least squares cost function.
• Instead of the simple sum of squares, Ridge regression
introduces an additional ‘regularization’ parameter that
penalizes the size of the weights.
• It can be used when there are too many predictors, or
predictors have a high degree of Multicollinearity between
each other.
• Cost function and gradient descent is given by Equation 1 and
2

2
Limitation of Ridge Regression
• It is not good for feature selection.
• Ridge regression decreases the complexity of a model but does not reduce
the number of variables since it never leads to a coefficient been zero rather
only minimizes it.
Lasso Regression
• Lasso regression stands for Least Absolute Shrinkage and Selection
Operator.
• It adds penalty term to the cost function.
• This term is the absolute sum of the coefficients. As the value of
coefficients increases from 0 this term penalizes, cause model, to
decrease the value of coefficients in order to reduce loss.
• The difference between ridge and lasso regression is that it tends to
make coefficients to absolute zero as compared to Ridge which never
sets the value of coefficient to absolute zero.
Limitations of Lasso Regression
• Lasso sometimes struggles with some types of data.
• If the number of predictors (p) is greater than the number of observations (n),
Lasso will pick at most n predictors as non-zero, even if all predictors are
relevant (or may be used in the test set).
• If there are two or more highly collinear variables then LASSO regression
select one of them randomly which is not good for the interpretation of data
The Elastic Net:

• Elastic Net is proved to better it combines the regularization of both


lasso and Ridge.
• The advantage of that it does not easily eliminate the high collinearity
coefficient.
• The elastic net has two parameters, it combines
both L1 and L2 regularization. So we need a lambda1 for the L1 and a
lambda2 for the L2.
Thank you

You might also like