Describe in Brief Different Types of Regression Algorithms
Describe in Brief Different Types of Regression Algorithms
Answer:
1. Linear Regression
Regression is a technique used to model and analyze the relationships between variables and often
times how they contribute and are related to producing a particular outcome together. A linear
regression refers to a regression model that is completely made up of linear variables. Beginning
with the simple case, Single Variable Linear Regression is a technique used to model the
relationship between a single input independent variable (feature variable) and an output
dependent variable using a linear model i.e a line.
The more general case is Multi Variable Linear Regression where a model is created for the
relationship between multiple independent input variables (feature variables) and an output
dependent variable. The model remains linear in that the output is a linear combination of the
input variables. We can model a multi-variable linear regression as the following:
Where a_n are the coefficients, X_n are the variables and b is the bias. As we can see, this
function does not include any non-linearities and so is only suited for modelling linearly separable
data. It is quite easy to understand as we are simply weighting the importance of each feature
variable X_n using the coefficient weights a_n. We determine these weights a_n and the
bias b using a Stochastic Gradient Descent (SGD). Check out the illustration below for a more
visual picture!
Illustration of how Gradient Descent find the optimal parameters for a Linear Regression
Fast and easy to model and is particularly useful when the relationship to be modeled is
not extremely complex and if you don’t have a lot of data.
2. Polynomial Regression
When we want to create a model that is suitable for handling non-linearly separable data, we will
need to use a polynomial regression. In this regression technique, the best fit line is not a straight
line. It is rather a curve that fits into the data points. For a polynomial regression, the power of
some independent variables is more than 1. For example, we can have something like:
We can have some variables have exponents, others without, and also select the exact exponent
we want for each variable. However, selecting the exact exponent of each variable naturally
requires some knowledge of how the data relates to the output. See the illustration below for a
visual comparison of linear vs polynomial regression.
Linear vs Polynomial Regression with data that is non-linearly separable
A few key points about Polynomial Regression:
Able to model non-linearly separable data; linear regression can’t do this. It is much more
flexible in general and can model some fairly complex relationships.
Full control over the modelling of feature variables (which exponent to set).
Requires careful design. Need some knowledge of the data in order to select the best
exponents.
3. Ridge Regression
A standard linear or polynomial regression will fail in the case where there is high collinearity
among the feature variables. Collinearity is the existence of near-linear relationships among the
independent variables. The presence of hight collinearity can be determined in a few different
ways:
A regression coefficient is not significant even though, theoretically, that variable should
be highly correlated with Y.
When you add or delete an X feature variable, the regression coefficients change
dramatically.
Your X feature variables have high pairwise correlations (check the correlation matrix).
We can first look at the optimization function of a standard linear regression to gain some insight
as to how ridge regression can help:
min || Xw - y ||²
To alleviate this issue, Ridge Regression adds a small squared bias factor to the variables:
Such a squared bias factor pulls the feature variable coefficients away from this rigidness,
introducing a small amount of bias into the model but greatly reducing the variance.
The assumptions of this regression is same as least squared regression except normality is
not to be assumed.
It shrinks the value of coefficients but doesn’t reaches zero, which suggests no feature
selection feature
4. Lasso Regression
Lasso Regression is quite similar to Ridge Regression in that both techniques have the same
premise. We are again adding a biasing term to the regression optimization function in order to
reduce the effect of collinearity and thus the model variance. However, instead of using a squared
bias like ridge regression, lasso instead using an absolute value bias:
There are a few differences between the Ridge and Lasso regressions that essentially draw back to
the differences in properties of the L2 and L1 regularization:
Sparsity: refers to that only very few entries in a matrix (or vector) is non-zero. L1-norm
has the property of producing many coefficients with zero values or very small values with
few large coefficients. This is connected to the previous point where Lasso performs a type
of feature selection.
Computational efficiency: L1-norm does not have an analytical solution, but L2-norm
does. This allows the L2-norm solutions to be calculated computationally efficiently.
However, L1-norm solutions does have the sparsity properties which allows it to be used
along with sparse algorithms, which makes the calculation more computationally efficient.
5. ElasticNet Regression
ElasticNet is a hybrid of Lasso and Ridge Regression techniques. It is uses both the L1 and L2
regularization taking on the effects of both techniques:
A practical advantage of trading-off between Lasso and Ridge is that, it allows Elastic-Net to
inherit some of Ridge’s stability under rotation.
It encourages group effect in the case of highly correlated variables, rather than zeroing
some of them out like Lasso.
Regularization
This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates
towards zero. In other words, this technique discourages learning a more complex or flexible
model, so as to avoid the risk of overfitting.
A simple relation for linear regression looks like this. Here Y represents the learned relation and β
represents the coefficient estimates for different variables or predictors(X).
The fitting procedure involves a loss function, known as residual sum of squares or RSS. The
coefficients are chosen, such that they minimize this loss function.
Now, this will adjust the coefficients based on your training data. If there is noise in the training
data, then the estimated coefficients won’t generalize well to the future data. This is where
regularization comes in and shrinks or regularizes these learned estimates towards zero.
Ridge Regression
Above image shows ridge regression, where the RSS is modified by adding the shrinkage
quantity. Now, the coefficients are estimated by minimizing this function. Here, λ is the tuning
parameter that decides how much we want to penalize the flexibility of our model. The increase
in flexibility of a model is represented by increase in its coefficients, and if we want to minimize
the above function, then these coefficients need to be small. This is how the Ridge regression
technique prevents coefficients from rising too high. Also, notice that we shrink the estimated
association of each variable with the response, except the intercept β0, This intercept is a measure
of the mean value of the response when xi1 = xi2 = …= xip = 0.
When λ = 0, the penalty term has no effect, and the estimates produced by ridge regression will be
equal to least squares. However, as λ→∞, the impact of the shrinkage penalty grows, and the
ridge regression coefficient estimates will approach zero. As can be seen, selecting a good value
of λ is critical. Cross validation comes in handy for this purpose. The coefficient estimates
produced by this method are also known as the L2 norm.
The coefficients that are produced by the standard least squares method are scale equivariant,
i.e. if we multiply each input by c then the corresponding coefficients are scaled by a factor of 1/c.
Therefore, regardless of how the predictor is scaled, the multiplication of predictor and
coefficient(Xjβj) remains the same. However, this is not the case with ridge regression, and
therefore, we need to standardize the predictors or bring the predictors to the same scale before
performing ridge regression. The formula used to do this is given below.
Lasso
Lasso is another variation, in which the above function is minimized. Its clear that this variation
differs from ridge regression only in penalizing the high coefficients. It uses |βj|
(modulus)instead of squares of β, as its penalty. In statistics, this is known as the L1 norm.
Lets take a look at above methods with a different perspective. The ridge regression can be
thought of as solving an equation, where summation of squares of coefficients is less than or
equal to s. And the Lasso can be thought of as an equation where summation of modulus of
coefficients is less than or equal to s. Here, s is a constant that exists for each value of shrinkage
factor λ. These equations are also referred to as constraint functions.
Consider their are 2 parameters in a given problem. Then according to above formulation,
the ridge regression is expressed by β1² + β2² ≤ s. This implies that ridge regression coefficients
have the smallest RSS(loss function) for all points that lie within the circle given by β1² + β2² ≤ s.
Similarly, for lasso, the equation becomes,|β1|+|β2|≤ s. This implies that lasso coefficients have
the smallest RSS(loss function) for all points that lie within the diamond given by |β1|+|β2|≤ s.
The above image shows the constraint functions(green areas), for lasso(left) and ridge
regression(right), along with contours for RSS(red ellipse). Points on the ellipse share the value
of RSS. For a very large value of s, the green regions will contain the center of the ellipse, making
coefficient estimates of both regression techniques, equal to the least squares estimates. But, this
is not the case in the above image. In this case, the lasso and ridge regression coefficient estimates
are given by the first point at which an ellipse contacts the constraint region. Since ridge
regression has a circular constraint with no sharp points, this intersection will not generally
occur on an axis, and so the ridge regression coefficient estimates will be exclusively non-
zero. However, the lasso constraint has corners at each of the axes, and so the ellipse will often
intersect the constraint region at an axis. When this occurs, one of the coefficients will equal
zero. In higher dimensions(where parameters are much more than 2), many of the coefficient
estimates may equal zero simultaneously.
This sheds light on the obvious disadvantage of ridge regression, which is model
interpretability. It will shrink the coefficients for least important predictors, very close to zero.
But it will never make them exactly zero. In other words, the final model will include all
predictors. However, in the case of the lasso, the L1 penalty has the effect of forcing some of the
coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently
large. Therefore, the lasso method also performs variable selection and is said to yield sparse
models.
An example of a decision tree can be explained using above binary tree. Let’s say you want to
predict whether a person is fit given their information like age, eating habit, and physical
activity, etc. The decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’,
‘Does he eat a lot of pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’. In
this case this was a binary classification problem (a yes no type problem).
There are two main types of Decision Trees:
Here the decision or the outcome variable is Continuous, e.g. a number like 123.
Working
Now that we know what a Decision Tree is, we’ll see how it works internally. There are many
algorithms out there which construct Decision Trees, but one of the best is called as ID3
Algorithm. ID3 Stands for Iterative Dichotomiser 3.
Before discussing the ID3 algorithm, we’ll go through few definitions.
Entropy
Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of
the amount of uncertainty or randomness in data.
Intuitively, it tells us about the predictability of a certain event. Example, consider a coin toss
whose probability of heads is 0.5 and probability of tails is 0.5. Here the entropy is the highest
possible, since there’s no way of determining what the outcome might be. Alternatively, consider
a coin which has heads on both the sides, the entropy of such an event can be predicted perfectly
since we know beforehand that it’ll always be heads. In other words, this event has no
randomness hence it’s entropy is zero.
In particular, lower values imply less uncertainty while higher values imply high uncertainty.
Information Gain
Information gain is also called as Kullback-Leibler divergence denoted by IG(S,A) for a set S is
the effective change in entropy after deciding on a particular attribute A. It measures the relative
change in entropy with respect to the independent variables.
Alternatively,
where IG(S, A) is the information gain by applying feature A. H(S) is the Entropy of the entire
set, while the second term calculates the Entropy after applying the feature A, where P(x) is the
probability of event x.
Let’s understand this with the help of an example
Consider a piece of data collected over the course of 14 days where the features are Outlook,
Temperature, Humidity, Wind and the outcome variable is whether Golf was played on the day.
Now, our job is to build a predictive model which takes in above 4 parameters and predicts
whether Golf will be played on the day. We’ll build a decision tree to do that using ID3
algorithm.
Day Outlook Temperature Humidity Wind Play Golf
Now we’ll go ahead and grow the decision tree. The initial step is to calculate H(S), the Entropy
of the current state. In the above example, we can see in total there are 5 No’s and 9 Yes’s.
Yes No Total
9 5 14
Remember that the Entropy is 0 if all members belong to the same class, and 1 when half of them
belong to one class and other half belong to other class that is perfect randomness. Here it’s 0.94
which means the distribution is fairly random.
Now the next step is to choose the attribute that gives us highest possible Information
Gain which we’ll choose as the root node.
Let’s start with ‘Wind’
where ‘x’ are the possible values for an attribute. Here, attribute ‘Wind’ takes two possible
values in the sample data, hence x = {Weak, Strong}
We’ll have to calculate:
Amongst all the 14 examples we have 8 places where the wind is weak and 6 where the wind
is Strong.
Wind = Weak Wind = Strong Total
8 6 14
Now out of the 8 Weak examples, 6 of them were ‘Yes’ for Play Golf and 2 of them were ‘No’
for ‘Play Golf’. So, we have,
Similarly, out of 6 Strong examples, we have 3 examples where the outcome was ‘Yes’ for
Play Golf and 3 where we had ‘No’ for Play Golf.
Remember, here half items belong to one class while other half belong to other. Hence we have
perfect randomness.
Now we have all the pieces required to calculate the Information Gain,
Which tells us the Information Gain by considering ‘Wind’ as the feature and give us
information gain of 0.048. Now we must similarly calculate the Information Gain for all the
features.
We can clearly see that IG(S, Outlook) has the highest information gain of 0.246, hence we
chose Outlook attribute as the root node. At this point, the decision tree looks like.
Here we observe that whenever the outlook is Overcast, Play Golf is always ‘Yes’, it’s no
coincidence by any chance, the simple tree resulted because of the highest information gain is
given by the attribute Outlook.
Now how do we proceed from this point? We can simply apply recursion, you might want to
look at the algorithm steps described earlier.
Now that we’ve used Outlook, we’ve got three of them remaining Humidity, Temperature, and
Wind. And, we had three possible values of Outlook: Sunny, Overcast, Rain. Where the Overcast
node already ended up having leaf node ‘Yes’, so we’re left with two subtrees to compute:
Sunny and Rain.
Code:
Let’s see an example in Python
import pydotplus
from sklearn.datasets import load_iris
from sklearn import tree
from IPython.display import Image, display
__author__ = "Mayur Kulkarni <[email protected]>"
def load_data_set():
"""
Loads the iris data set
graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(data=graph.create_png()))
if __name__ == '__main__':
iris_data = load_iris()
decision_tree_classifier = train_model(iris_data)
display_image(clf=decision_tree_classifier, iris=iris_data)
Conclusion:
Below is the summary of what we’ve studied in this blog:
Machine Learning: As discussed in this article, machine learning is nothing but a field of study
which allows computers to “learn” like humans without any need of explicit programming.
What is Predictive Modeling: Predictive modeling is a probabilistic process that allows us to
forecast outcomes, on the basis of some predictors. These predictors are basically features that
come into play when deciding the final result, i.e. the outcome of the model.
Validation
This process of deciding whether the numerical results quantifying hypothesized relationships
between variables, are acceptable as descriptions of the data, is known as validation. Generally,
an error estimation for the model is made after training, better known as evaluation of residuals.
In this process, a numerical estimate of the difference in predicted and original responses is done,
also called the training error. However, this only gives us an idea about how well our model does
on data used to train it. Now its possible that the model is underfitting or overfitting the data. So,
the problem with this evaluation technique is that it does not give an indication of how well the
learner will generalize to an independent/ unseen data set. Getting this idea about our model
is known as Cross Validation.
Holdout Method
Now a basic remedy for this involves removing a part of the training data and using it to get
predictions from the model trained on rest of the data. The error estimation then tells how our
model is doing on unseen data or the validation set. This is a simple kind of cross validation
technique, also known as the holdout method. Although this method doesn’t take any overhead
to compute and is better than traditional validation, it still suffers from issues of high
variance. This is because it is not certain which data points will end up in the validation set and
the result might be entirely different for different sets.
In K Fold cross validation, the data is divided into k subsets. Now the holdout method is
repeated k times, such that each time, one of the k subsets is used as the test set/ validation set
and the other k-1 subsets are put together to form a training set. The error estimation is
averaged over all k trials to get total effectiveness of our model. As can be seen, every data point
gets to be in a validation set exactly once, and gets to be in a training set k-1times. This
significantly reduces bias as we are using most of the data for fitting, and also significantly
reduces variance as most of the data is also being used in validation set. Interchanging the
training and test sets also adds to the effectiveness of this method. As a general rule and
empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed and it can take any
value.
This method is exhaustive in the sense that it needs to train and validate the model for all possible
combinations, and for moderately large p, it can become computationally infeasible.
A particular case of this method is when p = 1. This is known as Leave one out cross
validation. This method is generally preferred over the previous one because it does not suffer
from the intensive computation, as number of possible combinations is equal to number of data
points in original sample or n.
Cross Validation is a very useful technique for assessing the effectiveness of your model,
particularly in cases where you need to mitigate overfitting. It is also of use in determining the
hyper parameters of your model, in the sense that which parameters will result in lowest test
error. This is all the basic you need to get started with cross validation. You can get started with
all kinds of validation techniques using Scikit-Learn, that gets you up and running with just a
few lines of code in python.
Ans: