Regression
Regression
Regression analysis is a set of statistical methods used for the estimation of relationships between a
dependent variable and one or more independent variables. It can be utilized to assess the strength of
the relationship between variables and for modelling the future relationship between them.
The process that is adapted to perform regression analysis helps to understand which factors are
important, which factors can be ignored, and how they are influencing each other.
Regression analysis includes several variations, such as linear, multiple linear, and
nonlinear. The most common models are simple linear and multiple linear
•Independent Variable: These are factors that influence the analysis or target variable and provide us
with information regarding the relationship of the variables with the target variable.
Regression is concerned with specifying the relationship between a single numeric dependent variable
(the value to be predicted) and one or more numeric independent variables (the predictors).
Regression analysis is used for prediction and forecasting. This has a substantial overlap to the field of
machine learning. This statistical method is used across different industries such as,
•Financial Industry- Understand the trend in the stock prices, forecast the prices, evaluate risks in the
insurance domain
•Marketing- Understand the effectiveness of market campaigns, forecast pricing and sales of the
product.
•Manufacturing- Evaluate the relationship of variables that determine to define a better engine to
provide better performance
•Medicine- Forecast the different combination of medicines to prepare generic medicines for diseases.
Linear Regression, multiple regression
The simplest of all regression types is Linear Regression where it tries to establish
relationships between Independent and Dependent variables. The Dependent
variable considered here is always a continuous variable.
If the relationship with the dependent variable is in the form of single variables, then it is known as
Simple Linear Regression
Simple Linear Regression
X —–> Y
y = α + βx
The intercept, α (alpha), describes where the line crosses the y axis, while the slope, β (beta), describes
the change in y given an increase of x
Positive relationship , Negative relationship
Suppose we know that the estimated regression parameters in
the equation for the shuttle launch data are:
• a = 4.30 • b = -0.057
Hence, the full linear equation is y = 4.30 – 0.057x. Ignoring for
a moment how these numbers were obtained, we can plot the
line on the scatterplot:
Ordinary least squares estimation
In order to determine the optimal estimates of α and β, an estimation method known as ordinary least
squares (OLS) was used. In OLS regression, the slope and intercept are chosen such that they
minimize the sum of the squared errors, that is, the vertical distance between the predicted y value and
the actual y value. These errors are known as residuals.
In plain language, this equation defines e (the error) as the difference between the actual y value and the
predicted y value. The error values are squared and summed across all points in the data.
The caret character (^) above the y term is a commonly used feature of statistical notation. It indicates
that the term is an estimate for the true y value. This is referred to as the y-hat.
It can be shown using calculus that the value of b that results in the minimum squared error is:
covariance
covariance is a measure of the relationship between two random variables.
•Negative covariance: Reveals that two variables tend to move in inverse directions.
The correlation between two variables is a number that indicates how closely their relationship follows
a straight line. Without additional qualification, correlation refers to Pearson's correlation coefficient,
which was developed by the 20th century mathematician Karl Pearson. The correlation ranges between
-1 and +1. The extreme values indicate a perfectly linear relationship, while a correlation close to zero
indicates the absence of a linear relationship.
Where:
•ρ(X,Y) – the correlation between the variables X and Y
•Cov(X,Y) – the covariance between the variables X and
Y
•σX – the standard deviation of the X-variable
•σY – the standard deviation of the Y-variable
Multiple linear regression
•Multiple linear regression most of the time use regression for a numeric prediction task.
•EXAMPLE:
• Yield, rainfall, temperature
The strengths and weaknesses of multiple linear regression
y changes by the amount βi for each unit increase in xi . The intercept is then the expected value of y
when the independent variables are all zero.
Since the intercept is really no different than any other regression parameter, it can also be denoted as
β0 (pronounced beta-naught) as shown in the following equation
The dependent variable is now a vector, Y, with a row for every example. The independent variables
have been combined into a matrix, X, with a column for each feature plus an additional column of '1'
values for the intercept term. The regression coefficients β and errors ε are also now vectors
The goal now is to solve for the vector β that minimizes the sum of the squared errors between the
predicted and actual y values. Finding the optimal solution requires the use of matrix algebra;