Data Analytics Unit 2
Data Analytics Unit 2
The main use of regression analysis is to determine the strength of predictors, forecast an
effect, a trend, etc. For example, a gym supplement company can use regression analysis
techniques to determine how prices and advertisements can affect the sales of its
supplements.
Note: It’s very important to understand a variable before feeding it into a model. A
good set of input variables can impact the success of a business.
The benefit of regression analysis is that it allows data crunching to help
businesses make better decisions. A greater understanding of the variables can
impact the success of a business in the coming weeks, months, and years in the
future.
Regression analysis is used for one of two purposes: predicting the value
of the dependent variable when information about the independent
variables is known or predicting the effect of an independent variable on
the dependent variable.
Linear regression
The name says it all: linear regression can be used only when there is a linear
relationship among the variables. It is a statistical model used to understand the
association between independent variables (X) and dependent variables (Y).
The variables that are taken as input are called independent variables. In the
example of the gym supplement above, the prices and advertisement effect are the
independent variables, whereas the one that is being predicted is called the
dependent variable (in this case, ‘sales’).
Simple regression is a relationship where there are only two variables. The equation
for simple linear regression is as below when there is only one input variable:
If there is more than one independent variable, it is called multiple linear regression
and is expressed as follows:
where x denotes the explanatory variable. β1 β2…. Βn are the slope of the particular
regression line. β0 is the Y-intercept of the regression line.
Usually, regression lines are used in the financial sector and for business
procedures. Financial analysts use regression techniques to predict stock prices,
commodities, etc. whereas business analysts use them to forecast sales,
inventories, and so on.
The best way to fit a line is by minimizing the sum of squared errors, i.e., the
distance between the predicted value and the actual value. The least square method
is the process of fitting the best curve for a set of data points. The formula to
minimize the sum of squared errors is as below:
Since the degree is not 1, the best fit line won’t be a straight line anymore. Instead, it
will be a curve that fits into the data points.
Sometimes, this can result in overfitting or underfitting due to a higher degree of the
polynomial. Therefore, always plot the relationships to make sure the curve is just right and
not overfitted or underfitted.
Logistic regression
Logistic regression analysis is generally used to find the probability of an event. It is
used when the dependent variable is dichotomous or binary. For example, if the
output is 0 or 1, True or False, Yes or No, Cat or Dog, etc., it is said to be a binary
variable. Since it gives us the probability, the output will be in the range of 0-1.
Let’s see how logistic regression squeezes the output to 0-1. We already know that
the equation of the best fit line is:
Since logistic regression analysis gives the probability, let’s take probability (P)
instead of y. Here, the value of P will exceed the limits of 0-1. To keep the value
inside this range, we take the odds of the above equation which will become:
Another issue here is that the above equation will always give the output in the range
of (0, +∞). We don’t want a restricted range because it may decrease the correlation.
To solve this, we take log odds with a range of (-∞, +∞).
Since we want to predict the probability of P, we will simplify the above equation in
terms of P and get:
This is also called logistic function. The graph is shown below:
Ridge Regression
Ridge Regression is a technique used when the data suffers from multicollinearity
(independent variables are highly correlated). In multicollinearity, even though the least
squares estimates (OLS) are unbiased, their variances are large which deviates the observed
value far from the true value. By adding a degree of bias to the regression estimates, ridge
Above, we saw the equation for linear regression. Remember? It can be represented as:
y=a+ b*x
This equation also has an error term. The complete equation becomes:
y=a+b*x+e (error term), [error term is the value needed to correct for a prediction error
between the observed and predicted value]
=> y=a+y= a+ b1x1+ b2x2+....+e, for multiple independent variables.
In a linear equation, prediction errors can be decomposed into two sub components. First
is due to the biased and second is due to the variance. Prediction error can occur due to any
one of these two or both components. Here, we’ll discuss about the error caused due to
variance.
In this equation, we have two components. First one is least square term and other one is
lambda of the summation of β2 (beta- square) where β is the coefficient. This is added to
least square term in order to shrink the parameter to have a very low variance.
In linear regression, we minimize the cost function. Remember that the goal of a
model is to have low variance and low bias. To achieve this, we add another term in
Important Points:
The assumptions of this regression is same as least squared regression except normality is not
to be assumed
Ridge regression shrinks the value of coefficients but doesn’t reaches zero, which suggests no
Lasso Regression
Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) also
penalizes the absolute size of the regression coefficients. In addition, it is capable of reducing
the variability and improving the accuracy of linear regression models. Look at the equation
below: Lasso regression differs from ridge regression in a way that it uses absolute values in
the penalty function, instead of squares. This leads to penalizing (or equivalently constraining
the sum of the absolute values of the estimates) values which causes some of the parameter
estimates to turn out exactly zero. Larger the penalty applied, further the estimates get shrunk
towards absolute zero. This results to variable selection out of given n variables.
Lasso or least absolute shrinkage and selection operator regression is very similar to ridge
regression. It is capable of reducing the variability and improving the accuracy of linear
regression models. In addition, it helps us perform feature selection. Instead of squares, it
uses absolute values in the penalty function.
Important Points:
The assumptions of lasso regression is same as least squared regression except normality is
not to be assumed
Lasso Regression shrinks coefficients to zero (exactly zero), which certainly helps in feature
selection
If group of predictors are highly correlated, lasso picks only one of them and shrinks the
others to zero
multivariate analysis encompasses all statistical techniques that are used to analyze more
than two variables at once. The aim is to find patterns and correlations between several
variables simultaneously—allowing for a much deeper, more complex understanding of a
given scenario than you’ll get with bivariate analysis.
An example of multivariate analysis
Let’s imagine you’re interested in the relationship between a person’s social media habits and
their self-esteem. You could carry out a bivariate analysis, comparing the following two
variables:
You may or may not find a relationship between the two variables; however, you know that,
in reality, self-esteem is a complex concept. It’s likely impacted by many different factors—
not just how many hours a person spends on Instagram. You might also want to consider
factors such as age, employment status, how often a person exercises, and relationship status
(for example). In order to deduce the extent to which each of these variables correlates with
self-esteem, and with each other, you’d need to run a multivariate analysis.
So we know that multivariate analysis is used when you want to explore more than two
variables at once. Now let’s consider some of the different techniques you might use to do
this.
2. Multivariate data analysis techniques and examples
There are many different techniques for multivariate analysis, and they can be divided into
two categories:
Dependence techniques
Interdependence techniques
When we use the terms “dependence” and “interdependence,” we’re referring to different
types of relationships within the data. To give a brief explanation:
Dependence methods
Dependence methods are used when one or some of the variables are dependent on others.
Dependence looks at cause and effect; in other words, can the values of two or more
independent variables be used to explain, describe, or predict the value of another, dependent
variable? To give a simple example, the dependent variable of “weight” might be predicted
by independent variables such as “height” and “age.”
In machine learning, dependence techniques are used to build predictive models. The analyst
enters input data into the model, specifying which variables are independent and which ones
are dependent—in other words, which variables they want the model to predict, and which
variables they want the model to use to make those predictions.
Interdependence methods
Interdependence methods are used to understand the structural makeup and underlying
patterns within a dataset. In this case, no variables are dependent on others, so you’re not
looking for causal relationships. Rather, interdependence methods seek to give meaning to a
set of variables or to group them together in meaningful ways.
So: One is about the effect of certain variables on others, while the other is all about the
structure of the dataset.
With that in mind, let’s consider some useful multivariate analysis techniques. We’ll look at:
Factor analysis
Cluster analysis
Multiple linear regression
Multiple linear regression is a dependence method which looks at the relationship between
one dependent variable and two or more independent variables. A multiple regression model
will tell you the extent to which each independent variable has a linear relationship with the
dependent variable. This is useful as it helps you to understand which factors are likely to
influence a certain outcome, allowing you to estimate future outcomes.
As a data analyst, you could use multiple regression to predict crop growth. In this example,
crop growth is your dependent variable and you want to see how different factors affect it.
Your independent variables could be rainfall, temperature, amount of sunlight, and amount of
fertilizer added to the soil. A multiple regression model would show you the proportion of
variance in crop growth that each independent variable accounts for.