Supervised Regression Notes
Supervised Regression Notes
The unadjusted value from estimating a linear regression model will almost always increase if more
features are added.
The Total Sum of Squares (TSS) can never be used to select the best-fitting regression model.
The Sum of Squared Errors (SSE) can be used to select the best-fitting regression model.
Model estimation involves choosing parameters that minimize the cost function.
It is less concerning to treat a Machine Learning model as a black box for prediction purposes,
compared to interpretation purposes.
Create a new variable that flags 1 for above a certain value and 0 otherwise – Converts a regression
problem to a classification problem.
Linear Regression
A linear regression models the relationship between a continuous variable and one or more scaled
variables.It is usually represented as a dependent function equal to the sum of a coefficient plus
scaling factors times the independent variables.
Residuals are defined as the difference between an actual value and a predicted value.
The simplest syntax to train a linear regression using scikit learn is:
LR = LinearRegression()
LR = LR.fit(X_train, y_train)
y_predict = LR.predict(X_test)
Splitting your data into a training and a test set can help you choose a model that has better chances
at generalizing and is not overfitted.
The training data is used to fit the model, while the test data is used to measure error and
performance.
Training error tends to decrease with a more complex model.Cross validation error generally has a
u-shape. It decreases with more complex models, up to a point in which it starts to increase again.
Cross Validation
k-fold cross validation – Using each k subsamples as test samples. Here one part is test
set and remaining parts are train set. Train Train Test Train Test Train
leave one out cross validation – Using each observation as a test sample. Here k = m-1
where m = no. of rows. Leave one row for test and train on the remaining rows until all
rows are trained and tested.
stratified cross validation - k-fold cross validation with representative samples. It means
both train and test sets are stratified (Equal division in terms of ratio. For example:
True: False ratio is 3:1 in train set, then it must be 3:1 in test set)
UNDERFITTING
Correct model
Other algorithms that help you extend your linear models are:
Logistic Regression
K-Nearest Neighbors
Decision Trees
Support Vector Machines
Random Forests
Ensemble Methods
Deep Learning Approaches
Regularization techniques:
BOTH Ridge regression and Lasso regression add a term to the loss function proportional to
a regularization parameter.
Compared with Lasso regression (assuming similar implementation), Ridge regression is
less likely to set feature coefficients to zero.
Regularization decreases the likelihood of overfitting relative to training data.
Underfitting is characterized by higher errors in both training and test samples.
Higher model complexity leads to a higher chance of overfitting.
Bias-Variance trade-off:
M(w) + lambda*R(w); M(w) is model error, R(w) is function of estimated parameters, lambda
is regularization strength parameter.
M(w) + lambda*R(w) is adjusted cost function.
The regularization portion, which is just lambda time R(w), is added onto our original cost
function so that we can penalize the model extra if it is too complex.
Essentially, this will allow us to dumb down the model. So, the stronger our weights, the
stronger our parameters, the higher this cost function is going to be. And we are trying to
ultimately minimize this, so we are not going to be able to fit it as closely to the actual training
data.
Regularization adds an (adjustable) regularization strength parameter directly into the cost
function.
The lambda adds a penalty proportional to the size of the estimated model parameter, or a
function of the parameter. It means larger the lambda, the more we penalize the stronger
parameter. The more we penalize our model for being stronger and having stronger
parameters, the less complex that model will be able to be as we try to minimize this
function.
Regularization and feature selection:
Regularization performs feature selection by shrinking the contribution of features (as it adds
more weight to the penalty). For L-1 regularization this is accomplished by driving some
coefficients to zero.
o Lasso will drive some of the coefficients in our linear regression down to zero. So, if
we think about a coefficient of zero, we are essentially removing the contribution of
that feature altogether. It is going to be the same effect as manual removing some
features prior to modeling. Except that with Lasso, it will find which ones to remove
automatically according to some mathematical formula.
Feature selection can also be performed by removing some features.
Reducing the number of features reduce overfitting.
For some models (which do not have inbuilt regularization term like ridge/ lasso), fewer
features can improve fitting time and/or results.
Can identify most important features. It can improve model interpretability.
Ridge regression:
Here, the complexity penalty lambda, is directly proportional to the absolute values of
coefficients.
Lasso stands for least absolute shrinkage and selection operator, essentially using that
absolute value to penalize our coefficients. Like ridge this will work by giving the user a
means to reduce complexity. An increase in lambda again will raise that bias but lower our
variance or lower our complexity.
LASSO is more likely to perform feature selection, in that for a fixed lambda, LASSO is
more likely to result in coefficients being set to zero.
In Lasso:
o Penalty selectively shrinks some coefficients.
o Can be used for feature selection.
o Slower to converge than ridge regression.
Difference between LASSO and ridge is that it will quickly zero out a lot of the values.
Shrinkage and selection effect as regularization strength increases: some features drop
to zero.
Which one to select?
Provided by sklearn.
RFE is an approach that combines:
o A model or estimation approach
o A desired number of features
o RFE then repeatedly applies the model, measures feature importance and
recusively removes less important features.
If a coefficient of a feature is large, a small changes in the feature has a larger impact on the
prediction.
Under the geometric formulation, the cost function minimum is found at the intersection of
penalty border and a contour of the traditional OLS (Ordinary Least Squares) cost function
surface .
If training set is small, high bias / low variance models (e.g. Naive Bayes) tend
to perform better because they are less likely to be overfit.
If training set is large, low bias / high variance models (e.g. Logistic
Regression) tend to perform better because they can reflect more complex
relationships.