0% found this document useful (0 votes)
21 views

Data Analytics Unit 2

Uploaded by

sj1162003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Data Analytics Unit 2

Uploaded by

sj1162003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

REGRESSION MODELING:-

Regression analysis is a statistical technique of measuring the relationship between


variables. It provides the values of the dependent variable from the value of an independent
variable. One of the most important types of data analysis in this field is Regression
Analysis. Regression Analysis is a form of predictive modeling technique mainly used in
statistics. The term “regression” in this context, was first coined by Sir Francis Galton, a
cousin of Sir Charles Darwin. The earliest form of regression was developed by Adrien-
Marie Legendre and Carl Gauss - a method of least squares.

The main use of regression analysis is to determine the strength of predictors, forecast an
effect, a trend, etc. For example, a gym supplement company can use regression analysis
techniques to determine how prices and advertisements can affect the sales of its
supplements.

Why is Regression Analysis important?


The evaluation of relationship between two or more variables is called Regression Analysis.
It is a statistical technique. Regression Analysis helps enterprises to understand what their
data points represent,and use them wisely in coordination with different business analytical
techniques in order to make better decisions. Regression Analysis helps an individual to
understand how the typical value of the dependent variable changes when one of the
independent variables is varied, while the other independent variables remain unchanged.
Therefore, this powerful statistical tool is used by Business Analysts and other data
professionals for removing unwanted variables and choosing only the important ones.

Why are regression analysis techniques needed?


Regression analysis helps organizations to understand what their data points mean
and to use them carefully with business analysis techniques to arrive at better
decisions. It showcases how dependent variables vary when one of the independent
variables is varied and the other independent variables remain unchanged. It acts as
a tool to help business analysts and data experts pick significant variables and
delete unwanted ones.

Note: It’s very important to understand a variable before feeding it into a model. A
good set of input variables can impact the success of a business.
The benefit of regression analysis is that it allows data crunching to help
businesses make better decisions. A greater understanding of the variables can
impact the success of a business in the coming weeks, months, and years in the
future.

What is the purpose of a regression model?

Regression analysis is used for one of two purposes: predicting the value
of the dependent variable when information about the independent
variables is known or predicting the effect of an independent variable on
the dependent variable.

Types of regression techniques


There are several types of regression analysis, each with their own strengths and
weaknesses. Here are the most common.

Linear regression
The name says it all: linear regression can be used only when there is a linear
relationship among the variables. It is a statistical model used to understand the
association between independent variables (X) and dependent variables (Y).

The variables that are taken as input are called independent variables. In the
example of the gym supplement above, the prices and advertisement effect are the
independent variables, whereas the one that is being predicted is called the
dependent variable (in this case, ‘sales’).

Simple regression is a relationship where there are only two variables. The equation
for simple linear regression is as below when there is only one input variable:
If there is more than one independent variable, it is called multiple linear regression
and is expressed as follows:

where x denotes the explanatory variable. β1 β2…. Βn are the slope of the particular
regression line. β0 is the Y-intercept of the regression line.

If we take two variables, X and Y, there will be two regression lines:

 Regression line of Y on X: Gives the most probable Y values from the


given values of X.
 Regression line of X on Y: Gives the most probable X values from the
given values of Y.

Usually, regression lines are used in the financial sector and for business
procedures. Financial analysts use regression techniques to predict stock prices,
commodities, etc. whereas business analysts use them to forecast sales,
inventories, and so on.

How is the best fit line achieved?

The best way to fit a line is by minimizing the sum of squared errors, i.e., the
distance between the predicted value and the actual value. The least square method
is the process of fitting the best curve for a set of data points. The formula to
minimize the sum of squared errors is as below:

where yi is the actual value and yi_cap is the predicted value.

Assumptions of linear regression

 Independent and dependent variables should be linearly related.


 All the variables should be independent of each other, i.e., a change
in one variable should not a昀昀ect another variable.
 Outliers must be removed before 昀椀tting a regression line.
 There must be no multicollinearity.
Polynomial regression
You must have noticed in the above equations that the power of the independent
variable was one (Y = m*x+c). When the power of the independent variable is
more than one, it is referred to as polynomial regression (Y = m*x^2+c).

Since the degree is not 1, the best fit line won’t be a straight line anymore. Instead, it
will be a curve that fits into the data points.

Important points to note

 Sometimes, this can result in overfitting or underfitting due to a higher degree of the
polynomial. Therefore, always plot the relationships to make sure the curve is just right and
not overfitted or underfitted.

 Higher degree polynomials can end up producing bad results on


extrapolation so look out for the curve towards the ends.

Logistic regression
Logistic regression analysis is generally used to find the probability of an event. It is
used when the dependent variable is dichotomous or binary. For example, if the
output is 0 or 1, True or False, Yes or No, Cat or Dog, etc., it is said to be a binary
variable. Since it gives us the probability, the output will be in the range of 0-1.

Let’s see how logistic regression squeezes the output to 0-1. We already know that
the equation of the best fit line is:
Since logistic regression analysis gives the probability, let’s take probability (P)
instead of y. Here, the value of P will exceed the limits of 0-1. To keep the value
inside this range, we take the odds of the above equation which will become:

Another issue here is that the above equation will always give the output in the range
of (0, +∞). We don’t want a restricted range because it may decrease the correlation.
To solve this, we take log odds with a range of (-∞, +∞).

Since we want to predict the probability of P, we will simplify the above equation in
terms of P and get:
This is also called logistic function. The graph is shown below:

Important points to note

 Logistic regression is mostly used in classification problems.


 Unlike linear regression, it doesn’t require a linear relationship among dependent and
independent variables because it applies non-linear log transformation to the predicted
odds ratio.
 If there are various classes in the output, it is called multinomial logistic regression.
 Like linear regression, it doesn’t allow multicollinearity.

Ridge Regression

Ridge Regression is a technique used when the data suffers from multicollinearity

(independent variables are highly correlated). In multicollinearity, even though the least

squares estimates (OLS) are unbiased, their variances are large which deviates the observed

value far from the true value. By adding a degree of bias to the regression estimates, ridge

regression reduces the standard errors.

Above, we saw the equation for linear regression. Remember? It can be represented as:
y=a+ b*x

This equation also has an error term. The complete equation becomes:

y=a+b*x+e (error term), [error term is the value needed to correct for a prediction error
between the observed and predicted value]
=> y=a+y= a+ b1x1+ b2x2+....+e, for multiple independent variables.

In a linear equation, prediction errors can be decomposed into two sub components. First

is due to the biased and second is due to the variance. Prediction error can occur due to any

one of these two or both components. Here, we’ll discuss about the error caused due to

variance.

Ridge regression solves the multicollinearity problem through shrinkage parameter λ

(lambda). Look at the equation below.

In this equation, we have two components. First one is least square term and other one is

lambda of the summation of β2 (beta- square) where β is the coefficient. This is added to

least square term in order to shrink the parameter to have a very low variance.

In linear regression, we minimize the cost function. Remember that the goal of a

model is to have low variance and low bias. To achieve this, we add another term in

the cost function of linear regression: “lambda” and “slope”.

The equation of ridge regression is as follows:


If there are multiple variables, we can take the summation of all the slopes and
square it.

Important Points:

 The assumptions of this regression is same as least squared regression except normality is not

to be assumed

 Ridge regression shrinks the value of coefficients but doesn’t reaches zero, which suggests no

feature selection feature

 This is a regularization method and uses l2 regularization.

Lasso Regression

Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) also

penalizes the absolute size of the regression coefficients. In addition, it is capable of reducing

the variability and improving the accuracy of linear regression models. Look at the equation

below: Lasso regression differs from ridge regression in a way that it uses absolute values in

the penalty function, instead of squares. This leads to penalizing (or equivalently constraining

the sum of the absolute values of the estimates) values which causes some of the parameter

estimates to turn out exactly zero. Larger the penalty applied, further the estimates get shrunk

towards absolute zero. This results to variable selection out of given n variables.
Lasso or least absolute shrinkage and selection operator regression is very similar to ridge
regression. It is capable of reducing the variability and improving the accuracy of linear
regression models. In addition, it helps us perform feature selection. Instead of squares, it
uses absolute values in the penalty function.

The equation of lasso regression is:

Important Points:

 The assumptions of lasso regression is same as least squared regression except normality is

not to be assumed

 Lasso Regression shrinks coefficients to zero (exactly zero), which certainly helps in feature

selection

 Lasso is a regularization method and uses l1 regularization

 If group of predictors are highly correlated, lasso picks only one of them and shrinks the

others to zero

Multivariate analysis, which looks at more than two variables.

multivariate analysis encompasses all statistical techniques that are used to analyze more
than two variables at once. The aim is to find patterns and correlations between several
variables simultaneously—allowing for a much deeper, more complex understanding of a
given scenario than you’ll get with bivariate analysis.
An example of multivariate analysis

Let’s imagine you’re interested in the relationship between a person’s social media habits and
their self-esteem. You could carry out a bivariate analysis, comparing the following two
variables:

1. How many hours a day a person spends on Instagram

2. Their self-esteem score (measured using a self-esteem scale)

You may or may not find a relationship between the two variables; however, you know that,
in reality, self-esteem is a complex concept. It’s likely impacted by many different factors—
not just how many hours a person spends on Instagram. You might also want to consider
factors such as age, employment status, how often a person exercises, and relationship status
(for example). In order to deduce the extent to which each of these variables correlates with
self-esteem, and with each other, you’d need to run a multivariate analysis.

So we know that multivariate analysis is used when you want to explore more than two
variables at once. Now let’s consider some of the different techniques you might use to do
this.
2. Multivariate data analysis techniques and examples

There are many different techniques for multivariate analysis, and they can be divided into
two categories:

 Dependence techniques

 Interdependence techniques

So what’s the difference? Let’s take a look.

Multivariate analysis techniques: Dependence vs. interdependence

When we use the terms “dependence” and “interdependence,” we’re referring to different
types of relationships within the data. To give a brief explanation:

Dependence methods

Dependence methods are used when one or some of the variables are dependent on others.
Dependence looks at cause and effect; in other words, can the values of two or more
independent variables be used to explain, describe, or predict the value of another, dependent
variable? To give a simple example, the dependent variable of “weight” might be predicted
by independent variables such as “height” and “age.”

In machine learning, dependence techniques are used to build predictive models. The analyst
enters input data into the model, specifying which variables are independent and which ones
are dependent—in other words, which variables they want the model to predict, and which
variables they want the model to use to make those predictions.

Interdependence methods

Interdependence methods are used to understand the structural makeup and underlying
patterns within a dataset. In this case, no variables are dependent on others, so you’re not
looking for causal relationships. Rather, interdependence methods seek to give meaning to a
set of variables or to group them together in meaningful ways.
So: One is about the effect of certain variables on others, while the other is all about the
structure of the dataset.

With that in mind, let’s consider some useful multivariate analysis techniques. We’ll look at:

 Multiple linear regression

 Multiple logistic regression

 Multivariate analysis of variance (MANOVA)

 Factor analysis

 Cluster analysis
Multiple linear regression

Multiple linear regression is a dependence method which looks at the relationship between
one dependent variable and two or more independent variables. A multiple regression model
will tell you the extent to which each independent variable has a linear relationship with the
dependent variable. This is useful as it helps you to understand which factors are likely to
influence a certain outcome, allowing you to estimate future outcomes.

Example of multiple regression:

As a data analyst, you could use multiple regression to predict crop growth. In this example,
crop growth is your dependent variable and you want to see how different factors affect it.
Your independent variables could be rainfall, temperature, amount of sunlight, and amount of
fertilizer added to the soil. A multiple regression model would show you the proportion of
variance in crop growth that each independent variable accounts for.

You might also like