0% found this document useful (0 votes)
11 views

Supervised Regression Notes

Uploaded by

neeharika.sssvv
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Supervised Regression Notes

Uploaded by

neeharika.sssvv
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

SUPERVISED - REGRESSION.

The unadjusted value from estimating a linear regression model will almost always increase if more
features are added.

The Total Sum of Squares (TSS) can never be used to select the best-fitting regression model.

The Sum of Squared Errors (SSE) can be used to select the best-fitting regression model.

Model estimation involves choosing parameters that minimize the cost function.

It is less concerning to treat a Machine Learning model as a black box for prediction purposes,
compared to interpretation purposes.

Create a new variable that flags 1 for above a certain value and 0 otherwise – Converts a regression
problem to a classification problem.

Introduction to Supervised Machine Learning

The types of supervised Machine Learning are:

 Regression, in which the target variable is continuous.


 Example: Stock price prediction, Box office revenue, location (x, y co-ordinates)
 Classification, in which the target variable is categorical.
 Example: Face recognition, customer churn, which words come next

To build a classification model you need:

 Features that can be quantified.


 A labeled target or outcome variable
 Method to measure similarity.

Linear Regression

A linear regression models the relationship between a continuous variable and one or more scaled
variables.It is usually represented as a dependent function equal to the sum of a coefficient plus
scaling factors times the independent variables.

Residuals are defined as the difference between an actual value and a predicted value.

A modeling best practice for linear regression is:

 Use cost function to fit the linear regression model.


 Develop multiple models.
 Compare the results and choose the one that fits your data and whether you are using
your model for prediction or interpretation.

Three common measures of error for linear regressions are:

 Sum of squared Error (SSE)


 Total Sum of Squares (TSS)
 Coefficient of Determination (R2)

Linear Regression Syntax

The simplest syntax to train a linear regression using scikit learn is:

from sklearn.linear_model import LinearRegression

LR = LinearRegression()

LR = LR.fit(X_train, y_train)

To score a data frame X_test you would use this syntax:

y_predict = LR.predict(X_test)

Training and Test Splits

Splitting your data into a training and a test set can help you choose a model that has better chances
at generalizing and is not overfitted.

The training data is used to fit the model, while the test data is used to measure error and
performance.

Training error tends to decrease with a more complex model.Cross validation error generally has a
u-shape. It decreases with more complex models, up to a point in which it starts to increase again.

Cross Validation

The three most common cross validation approaches are:

 k-fold cross validation – Using each k subsamples as test samples. Here one part is test
set and remaining parts are train set. Train Train Test Train Test Train

 leave one out cross validation – Using each observation as a test sample. Here k = m-1
where m = no. of rows. Leave one row for test and train on the remaining rows until all
rows are trained and tested.
 stratified cross validation - k-fold cross validation with representative samples. It means
both train and test sets are stratified (Equal division in terms of ratio. For example:
True: False ratio is 3:1 in train set, then it must be 3:1 in test set)

UNDERFITTING 

Low k-value  high bias  underfitting

Alpha in sklearn which is used in


Lasso and Ridge regressions, is used
to reduce overfitting.
OVERFITTING 
High k-value  high variance overfitting

Correct model 

 A high variance of parameter estimates across cross-validation subsamples indicates likely


overfitting.
 For a dataset with M observations and N features, Leave-one-out cross-validation is
equivalent to k-fold cross-validation with k =M-1.
 If a low-complexity model is underfitting during estimation, K-fold cross-validation will still
lead to underfitting, for any k, is most likely True.
 For a dataset with M observations and N features, Stratified cross-validation is not
equivalent to k-fold cross-validation, where k =N-1.
 A linear regression model is being tested by cross-validation. Relative to K-fold cross-
validation, stratified cross-validation (with the same k) will not be likely increase the variance
of estimated parameters.
 In K-fold cross-validation, how will increasing k affect the variance (across subsamples) of
estimated model parameters?
 Increasing k will usually increase the variance of estimated parameters.
 The main purpose of splitting your data into a training and test sets is to avoid overfitting.
 Too many features are trained  Overfit
Polynomial Regression

Polynomial terms help you capture nonlinear effects of your features.

Other algorithms that help you extend your linear models are:

 Logistic Regression
 K-Nearest Neighbors
 Decision Trees
 Support Vector Machines
 Random Forests
 Ensemble Methods
 Deep Learning Approaches
Regularization techniques:
 BOTH Ridge regression and Lasso regression add a term to the loss function proportional to
a regularization parameter.
 Compared with Lasso regression (assuming similar implementation), Ridge regression is
less likely to set feature coefficients to zero.
 Regularization decreases the likelihood of overfitting relative to training data.
 Underfitting is characterized by higher errors in both training and test samples.
 Higher model complexity leads to a higher chance of overfitting.

Bias-Variance trade-off:

High bias: Tendency of predictions to miss true values.

- Worsened by missing information, overly simplistic


assumptions.
- Miss real patterns (underfit).
- Model is not complex enough.

High variance: Tendency of predictions to fluctuate.

- Characterized by sensitivity of output to small changes


in input.
- Often due to over complexity.
- Overfitting.

Irreducible error: We cannot perfectly model real-world data. Irreducible


error is tendency to intrinsic uncertainty or randomness. This error will be
present even in the best possible model.
 The higher the degree of a polynomial regression, the more complex the model is 
May lead to overfit most times (It means low bias, high variance).
 At lower degrees, we can see visual signs of bias: predictions are too rigid to capture the
curve pattern in the data.
 At higher degrees, we can see the visual signs of variance: predictions fluctuate widely
because of model’s sensitivity.
 Bias representing a model that is rigid and unable to properly model the relationship of X and
Y, whereas variance represents models that tend to have high sensitivity given minor
changes in our input variables.

Regularization and model selection:


 Regularization (lambda) helps to reduce overfitting.

M(w) + lambda*R(w); M(w) is model error, R(w) is function of estimated parameters, lambda
is regularization strength parameter.
M(w) + lambda*R(w) is adjusted cost function.

 The regularization portion, which is just lambda time R(w), is added onto our original cost
function so that we can penalize the model extra if it is too complex.
 Essentially, this will allow us to dumb down the model. So, the stronger our weights, the
stronger our parameters, the higher this cost function is going to be. And we are trying to
ultimately minimize this, so we are not going to be able to fit it as closely to the actual training
data.
 Regularization adds an (adjustable) regularization strength parameter directly into the cost
function.
 The lambda adds a penalty proportional to the size of the estimated model parameter, or a
function of the parameter. It means larger the lambda, the more we penalize the stronger
parameter. The more we penalize our model for being stronger and having stronger
parameters, the less complex that model will be able to be as we try to minimize this
function.
Regularization and feature selection:
 Regularization performs feature selection by shrinking the contribution of features (as it adds
more weight to the penalty). For L-1 regularization this is accomplished by driving some
coefficients to zero.
o Lasso will drive some of the coefficients in our linear regression down to zero. So, if
we think about a coefficient of zero, we are essentially removing the contribution of
that feature altogether. It is going to be the same effect as manual removing some
features prior to modeling. Except that with Lasso, it will find which ones to remove
automatically according to some mathematical formula.
 Feature selection can also be performed by removing some features.
 Reducing the number of features reduce overfitting.
 For some models (which do not have inbuilt regularization term like ridge/ lasso), fewer
features can improve fitting time and/or results.
 Can identify most important features. It can improve model interpretability.

Ridge regression:

 Here beta = weights; RSS = residual sum squares


 Scaling (StandardScalar and MinMaxScalar) plays a major role (especially standard scalar is
preferred).
 The complexity penalty lambda is applied proportionally to squared coefficient values.
o The penalty term (lambda) has the effect of shrinking the coefficients to zero. Not
exactly zero, like Lasso but higher the coefficient is, more the penalty is  reducing
the size of those coefficients.
o This imposes bias but reduces variance.
o We can select best regularization strength lambda via cross-validation.

 Alpha in sklearn which is used in


Lasso and Ridge regressions, is used
to reduce overfitting.
 Lasso  L-1 regularization
 Ridge  L-2 regularization
 To penalize certain weights even
higher, ridge is more useful than
lasso.
 As lambda increases, the standardized coefficients should decrease.
Lasso Regression:

 Here, the complexity penalty lambda, is directly proportional to the absolute values of
coefficients.
 Lasso stands for least absolute shrinkage and selection operator, essentially using that
absolute value to penalize our coefficients. Like ridge this will work by giving the user a
means to reduce complexity. An increase in lambda again will raise that bias but lower our
variance or lower our complexity.
 LASSO is more likely to perform feature selection, in that for a fixed lambda, LASSO is
more likely to result in coefficients being set to zero.
 In Lasso:
o Penalty selectively shrinks some coefficients.
o Can be used for feature selection.
o Slower to converge than ridge regression.
 Difference between LASSO and ridge is that it will quickly zero out a lot of the values.
 Shrinkage and selection effect as regularization strength increases: some features drop
to zero.
Which one to select?

 Elastic net combines


penalties of both ridge and
lasso regressions.
 It requires tuning of
additional parameter that
determine the emphasis of
L-1 vs L-2 regularization
penalties.
 Recursive feature
elimination and lasso are
used for feature selection.

Recursive Feature Elimination (RFE):

 Provided by sklearn.
 RFE is an approach that combines:
o A model or estimation approach
o A desired number of features
o RFE then repeatedly applies the model, measures feature importance and
recusively removes less important features.

 The RFECV class will perform feature elimination using cross-validation.


Regularization techniques have an analytical, a geometric, and a probabilistic interpretation.

 If a coefficient of a feature is large, a small changes in the feature has a larger impact on the
prediction.

 Under the geometric formulation, the cost function minimum is found at the intersection of
penalty border and a contour of the traditional OLS (Ordinary Least Squares) cost function
surface .
If training set is small, high bias / low variance models (e.g. Naive Bayes) tend
to perform better because they are less likely to be overfit.

If training set is large, low bias / high variance models (e.g. Logistic
Regression) tend to perform better because they can reflect more complex
relationships.

You might also like