Cost-Function
Cost-Function
Cost Function
Predicting house Now the big question: how do
we determine the best-fitting
prices line for our data?
Performance
Case 1: Model gives accurate
results
Actual yi Predicted y’i yi – y’i yi – mean
10 10 0 -10
20 20 0 0
30 30 0 10
SSE = 0 SST = 200
Evaluating Regression
10
Performance
Performance
Case 2: Model always gives same
results
Actual yi Predicted y’i yi – y’i yi – mean
10 20 10 -10
20 20 0 0
30 20 -10 10
SSE = 200 SST = 200
Evaluating Regression
13
Performance
Case 3: Model gives worse results
+
- Final Evaluation
+
Final Test Set Final Model -
Prediction-Linear Regression
Holdout method and Random
Subsampling
• In holdout method, the given data are randomly partitioned into two independent sets, a training
set and a test set.
• Typically, two-thirds of the data are allocated to the training set, and the remaining one-third is
allocated to the test set.
• The training set is used to derive the model. The model’s accuracy is then estimated with the test
set
Holdout method and Random
Subsampling
• Random subsampling is a variation of the holdout method in which
the holdout method is repeated k times.
• The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.
Cross Validation
• In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or
“folds,” D1, D2,… , Dk, each of approximately equal size.
• Training and testing is performed k times.
• In iteration i, partition Di is reserved as the test set, and the remaining partitions are collectively used to train
the model.
• That is, in the first iteration, subsets D2, …. , Dk collectively serve as the training set to obtain a first model,
which is tested on D1
• The second iteration is trained on subsets D1, D3, … , Dk and tested on D2 and so on.
• Unlike the holdout and random subsampling methods, here each sample is used the same number of times
for training and once for testing.
• For classification, the accuracy estimate is the overall number of correct classifications from the k iterations,
divided by the total number of tuples in the initial data.
Cross Validation
Leave One Out k-fold Cross
Validation
This approach leaves 1 data point out of training data, i.e. if there are n data points in the original
•
sample then, n-1 samples are used to train the model and p points are used as the validation set.
This is repeated for all combinations in which the original sample can be separated this way, and
then the error is averaged for all trials, to give overall effectiveness.
Stratified k-Fold Cross Validation
The splitting of data into folds may be governed by criteria such as ensuring that each
fold has the same proportion of observations with a given categorical value, such as the
class outcome value. This is called stratified cross-validation.
MLR
• When we want to understand the relationship between a single
predictor variable and a response variable, we often
use simple linear regression.
• However, to understand the relationship
between multiple predictor variables and a response variable
then we can instead use multiple linear regression.
• If we have p predictor variables, then a multiple linear
regression model takes the form:
• Y = β 0 + β 1X 1 + β 2X 2 + … + β p X p + ε
where:
Y: The response variable
Xj: The jth predictor variable
βj: The average effect on Y of a one unit increase in Xj, holding all
other predictors fixed
ε: The error term
The values for β0, β1, B2, … , βp are chosen using the least square
method, which minimizes the sum of squared residuals (RSS/SSE):
RSS = Σ(yi – ŷi)2
Multiple linear regression is a method we can use to quantify the
relationship between two or more predictor variables and a response
variable.
• The formula to calculate b2 is: [(Σx12)(Σx2y) – (Σx1x2)(Σx1y)] / [(Σx12) (Σx22)
– (Σx1x2)2]
• Thus, b2 = [(263.875)(-953.5) – (-200.375)(1152.5)] / [(263.875)
(194.875) – (-200.375)2] = -1.656
• The formula to calculate b0 is: y – b1X1 – b2X2
• Thus, b0 = 181.5 – 3.148(69.375) – (-1.656)(18.125) = -6.867
• Step 5: Place b0, b1, and b2 in the estimated linear
regression equation.
• The estimated linear regression equation is: ŷ = b0 + b1*x1 + b2*x2
• In our example, it is ŷ = -6.867 + 3.148x1 – 1.656x2
How to Interpret a Multiple Linear
Regression Equation