0% found this document useful (0 votes)
11 views

Cost-Function

The document discusses the process of predicting house prices using linear regression, focusing on the importance of cost functions and gradient descent for optimizing the model. It explains how to evaluate regression performance through metrics like R-squared and introduces various methods for data partitioning, including holdout, random subsampling, and cross-validation. Additionally, it covers multiple linear regression and how to interpret the resulting equations based on predictor variables.

Uploaded by

VANSHIKA GADA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Cost-Function

The document discusses the process of predicting house prices using linear regression, focusing on the importance of cost functions and gradient descent for optimizing the model. It explains how to evaluate regression performance through metrics like R-squared and introduces various methods for data partitioning, including holdout, random subsampling, and cross-validation. Additionally, it covers multiple linear regression and how to interpret the resulting equations based on predictor variables.

Uploaded by

VANSHIKA GADA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Gradient Descent and

Cost Function
Predicting house Now the big question: how do
we determine the best-fitting
prices line for our data?

To figure out the best line, the first thing we need to do is


mathematically represent what a bad line looks like. So let’s take
this “bad” line and according to this a 2000 feet² house should
sell for ~$140,000, whereas we know it actually sold for
$300,000:
Predicting house prices- Cost Function

To figure out the best line, the first thing we need to do is


mathematically represent what a bad line looks like. So let’s take
this “bad” line and according to this a 2000 feet² house should The purpose of any cost function is
sell for ~$140,000, whereas we know it actually sold for to minimize its value and reduce the
$300,000:
cost to the greatest extent possible.

we need to determine the optimal values


for the slope and the intercept to get our
best-fitting line for our linear regression
problem – that minimizes MSE.

Let’s assume that we somehow


magically already have the value of
the slope, 0.069.
Predicting house prices- Cost Function

To get the predicted price of any house of a


certain size, all we need to do is plug in the
values of the intercept and desired house size.
For instance, for a house of size 1000 feet²
with intercept 0…

All we need to do is find optimal value of


intercept

Brute force method- repeatedly guess the value


of the intercept, draw a LR line and calculate the
MSE.
Predicting house prices- Cost Function

Start by guessing a random value of the intercept


(let’s start with 0) and plotting the LR line:
Predicting house prices- Cost Function
We’ll test another value for
the intercept (let’s say 25),
plot the corresponding line,
and calculate the MSE.
Predicting house prices- Cost Function
We can continue this process with different values of
the intercept (= 0, 25, 50, 75, 100, 125, 150, and 175) until we
end up with a graph that looks like this:
From the points plotted on the graph, we can see that the MSE is
the lowest when the intercept is set to 100. However, it is
possible that there may be another intercept value between 75
and 100 that would result in an even lower MSE. A slow and
painful method for finding the minimal MSE is to plug and chug a
bunch more values for the intercept

Process of testing multiple intercept values manually-


tedious and inefficient. How can we be certain that
we have found the lowest possible MSE value?
Solution to find lowest MSE-
Gradient Descent
• It is a powerful optimization algorithm that aims to quickly and efficiently find the minimum point
of a curve.
• The best way to visualize this process is to imagine you are standing at the top of a hill, with a
treasure chest filled with gold waiting for you in the valley.

1. The exact location of the


valley is unknown because it’s
super dark out and we can’t
see anything
2. We want to reach valley –
before anyone else does
3. Gradient Descent helps us
navigate the terrain and
reach this optimal point
efficiently and quickly.
Evaluating Regression
9

Performance
Case 1: Model gives accurate
results
Actual yi Predicted y’i yi – y’i yi – mean
10 10 0 -10
20 20 0 0
30 30 0 10
SSE = 0 SST = 200
Evaluating Regression
10

Performance

Sum of Squares Total (SST) = Sum of Squared Errors (SSE) (of


squared differences of each prediction) =
observation from the overall sum of the squared residuals
mean
• R-squared (R²) is a statistical measure used to evaluate the
performance of a regression model in machine learning. It represents
the proportion of the variance in the dependent variable (the target
variable) that is predictable from the independent variables (the
features). In other words, it indicates how well the model's
predictions fit the actual data.
• How R-squared Works:
• R-squared Value: The value of R-squared ranges from 0 to 1.
• R² = 1: This means the model perfectly predicts the target variable, with no
errors.
• R² = 0: This means the model does not explain any of the variability in the
target variable, essentially performing no better than a model that simply
predicts the mean of the target variable.
• R² < 0: This can occur if the model is worse than predicting the mean,
indicating a poor fit.
Evaluating Regression
12

Performance
Case 2: Model always gives same
results
Actual yi Predicted y’i yi – y’i yi – mean
10 20 10 -10
20 20 0 0
30 20 -10 10
SSE = 200 SST = 200
Evaluating Regression
13

Performance
Case 3: Model gives worse results

Actual yi Predicted y’i yi – y’i yi – mean


10 30 -20 -10
20 10 10 0
30 20 10 10
SSE = 600 SST= 200
Train, Validation, Test split Data Set
Results Known
+ Model
+ Training set Builder
-
-
+
Data
Evaluate
Model Builder
Predictions
+
-
Y N +
Validation set -

+
- Final Evaluation
+
Final Test Set Final Model -
Prediction-Linear Regression
Holdout method and Random
Subsampling
• In holdout method, the given data are randomly partitioned into two independent sets, a training
set and a test set.
• Typically, two-thirds of the data are allocated to the training set, and the remaining one-third is
allocated to the test set.
• The training set is used to derive the model. The model’s accuracy is then estimated with the test
set
Holdout method and Random
Subsampling
• Random subsampling is a variation of the holdout method in which
the holdout method is repeated k times.
• The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.
Cross Validation
• In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or
“folds,” D1, D2,… , Dk, each of approximately equal size.
• Training and testing is performed k times.
• In iteration i, partition Di is reserved as the test set, and the remaining partitions are collectively used to train
the model.
• That is, in the first iteration, subsets D2, …. , Dk collectively serve as the training set to obtain a first model,
which is tested on D1
• The second iteration is trained on subsets D1, D3, … , Dk and tested on D2 and so on.
• Unlike the holdout and random subsampling methods, here each sample is used the same number of times
for training and once for testing.
• For classification, the accuracy estimate is the overall number of correct classifications from the k iterations,
divided by the total number of tuples in the initial data.
Cross Validation
Leave One Out k-fold Cross
Validation
This approach leaves 1 data point out of training data, i.e. if there are n data points in the original

sample then, n-1 samples are used to train the model and p points are used as the validation set.
This is repeated for all combinations in which the original sample can be separated this way, and
then the error is averaged for all trials, to give overall effectiveness.
Stratified k-Fold Cross Validation
The splitting of data into folds may be governed by criteria such as ensuring that each
fold has the same proportion of observations with a given categorical value, such as the
class outcome value. This is called stratified cross-validation.
MLR
• When we want to understand the relationship between a single
predictor variable and a response variable, we often
use simple linear regression.
• However, to understand the relationship
between multiple predictor variables and a response variable
then we can instead use multiple linear regression.
• If we have p predictor variables, then a multiple linear
regression model takes the form:
• Y = β 0 + β 1X 1 + β 2X 2 + … + β p X p + ε
where:
Y: The response variable
Xj: The jth predictor variable
βj: The average effect on Y of a one unit increase in Xj, holding all
other predictors fixed
ε: The error term
The values for β0, β1, B2, … , βp are chosen using the least square
method, which minimizes the sum of squared residuals (RSS/SSE):
RSS = Σ(yi – ŷi)2
Multiple linear regression is a method we can use to quantify the
relationship between two or more predictor variables and a response
variable.
• The formula to calculate b2 is: [(Σx12)(Σx2y) – (Σx1x2)(Σx1y)] / [(Σx12) (Σx22)
– (Σx1x2)2]
• Thus, b2 = [(263.875)(-953.5) – (-200.375)(1152.5)] / [(263.875)
(194.875) – (-200.375)2] = -1.656
• The formula to calculate b0 is: y – b1X1 – b2X2
• Thus, b0 = 181.5 – 3.148(69.375) – (-1.656)(18.125) = -6.867
• Step 5: Place b0, b1, and b2 in the estimated linear
regression equation.
• The estimated linear regression equation is: ŷ = b0 + b1*x1 + b2*x2
• In our example, it is ŷ = -6.867 + 3.148x1 – 1.656x2
How to Interpret a Multiple Linear
Regression Equation

• Here is how to interpret this estimated linear regression equation: ŷ = -


6.867 + 3.148x1 – 1.656x2
• b0 = -6.867. When both predictor variables are equal to zero, the
mean value for y is -6.867.
• b1 = 3.148. A one unit increase in x1 is associated with a 3.148 unit
increase in y, on average, assuming x2 is held constant.
• b2 = -1.656. A one unit increase in x2 is associated with a 1.656 unit
decrease in y, on average, assuming x1 is held constant.

You might also like