0% found this document useful (0 votes)
5 views

ML-U2-Regression

Regression is a statistical method used to analyze the relationship between dependent and independent variables, aiming to create a predictive model. Common regression algorithms include Linear Regression, Polynomial Regression, and Decision Tree Regression, each suited for different data types and relationships. Evaluation metrics such as Mean Squared Error and R-squared are used to assess the performance of regression models.

Uploaded by

aaryankrsaini24
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

ML-U2-Regression

Regression is a statistical method used to analyze the relationship between dependent and independent variables, aiming to create a predictive model. Common regression algorithms include Linear Regression, Polynomial Regression, and Decision Tree Regression, each suited for different data types and relationships. Evaluation metrics such as Mean Squared Error and R-squared are used to assess the performance of regression models.

Uploaded by

aaryankrsaini24
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

REGRESSION

Regression is a statistical approach used to analyze the relationship


between a dependent variable (target variable) and one or more
independent variables (predictor variables). The objective is to determine
the most suitable function that characterizes the connection between
these variables.
It seeks to find the best-fitting model, which can be utilized to make
predictions or draw conclusions.

Regression in Machine Learning


It is a supervised machine learning technique, used to predict the value of
the dependent variable for new, unseen data. It models the relationship
between the input features and the target variable, allowing for the
estimation or prediction of numerical values.
Regression analysis problem works with if output variable is a real or
continuous value, such as “salary” or “weight”. Many different models can
be used, the simplest is the linear regression. It tries to fit data with the
best hyper-plane which goes through the points.

Regression Algorithms
There are many different types of regression algorithms, but some of the most
common include:
● Linear Regression
○ Linear regression is one of the simplest and most widely used statistical

models. This assumes that there is a linear relationship between the

independent and dependent variables. This means that the change in the

dependent variable is proportional to the change in the independent variables.

● Polynomial Regression

○ Polynomial regression is used to model nonlinear relationships between

the dependent variable and the independent variables. It adds polynomial

terms to the linear regression model to capture more complex relationships.

● Support Vector Regression (SVR)

○ Support vector regression (SVR) is a type of regression algorithm that is

based on the support vector machine (SVM) algorithm. SVM is a type of

algorithm that is used for classification tasks, but it can also be used for

regression tasks. SVR works by finding a hyperplane that minimizes the sum

of the squared residuals between the predicted and actual values.

● Decision Tree Regression

○ Decision tree regression is a type of regression algorithm that builds a

decision tree to predict the target value. A decision tree is a tree-like structure

that consists of nodes and branches. Each node represents a decision, and

each branch represents the outcome of that decision. The goal of decision tree

regression is to build a tree that can accurately predict the target value for

new data points.

● Random Forest Regression


○ Random forest regression is an ensemble method that combines multiple

decision trees to predict the target value. Ensemble methods are a type of

machine learning algorithm that combines multiple models to improve the

performance of the overall model. Random forest regression works by

building a large number of decision trees, each of which is trained on a

different subset of the training data. The final prediction is made by averaging

the predictions of all of the trees.

Applications of Regression

● Predicting prices: For example, a regression model could be used

to predict the price of a house based on its size, location, and

other features.

● Forecasting trends: For example, a regression model could be

used to forecast the sales of a product based on historical sales

data and economic indicators.

● Identifying risk factors: For example, a regression model could

be used to identify risk factors for heart disease based on patient

data.

● Making decisions: For example, a regression model could be

used to recommend which investment to buy based on market

data.

Machine Learning is a branch of Artificial intelligence that focuses on the


development of algorithms and statistical models that can learn from and
make predictions on data. Linear regression is also a type of machine-
learning algorithm more specifically a supervised machine-learning
algorithm that learns from the labelled datasets and maps the data points
to the most optimized linear functions, which can be used for prediction on
new datasets.
First off we should know what supervised machine learning algorithms is.
It is a type of machine learning where the algorithm learns from labelled
data. Labeled data means the dataset whose respective target value is
already known. Supervised learning has two types:
● Classification: It predicts the class of the dataset based on the

independent input variable. Class is the categorical or discrete

values. like the image of an animal is a cat or dog?

● Regression: It predicts the continuous output variables based on

the independent input variable. like the prediction of house prices

based on different parameters like house age, distance from the

main road, location, area, etc.

Types of Regression Techniques

Along with the development of the machine learning domain regression

analysis techniques have gained popularity as well as developed manifold

from just y = mx + c. There are several types of regression techniques,

each suited for different types of data and different types of relationships.

The main types of regression techniques are:

1. Linear Regression

2. Polynomial Regression

3. Stepwise Regression

4. Decision Tree Regression

5. Random Forest Regression


6. Support Vector Regression

7. Ridge Regression

8. Lasso Regression

9. ElasticNet Regression

10. Bayesian Linear Regression

Linear Regression

Linear regression is used for predictive analysis. Linear regression is a

linear approach for modeling the relationship between the criterion or the

scalar response and the multiple predictors or explanatory variables.

Linear regression focuses on the conditional probability distribution of the

response given the values of the predictors. For linear regression, there is

a danger of overfitting. The formula for linear regression is:


This is the most basic form of regression analysis and is used to model a

linear relationship between a single dependent variable and one or more

independent variables.

Here, a linear regression model is instantiated to fit a linear relationship

between input features (X) and target values (y). This code is used for

simple demonstration of the approach.

Simple Linear Regression

This is the simplest form of linear regression, and it involves only one
independent variable and one dependent variable. The equation for simple
linear regression is:

y=β0+β1X

where:
● Y is the dependent variable

● X is the independent variable

● β0 is the intercept

● β1 is the slope

Multiple Linear Regression

This involves more than one independent variable and one dependent
variable. The equation for multiple linear regression is:

y=β0+β1X1+β2X2+………βnXn

\where:

● Y is the dependent variable

● X1, X2, …, Xn are the independent variables

● β0 is the intercept

● β1, β2, …, βn are the slopes

The goal of the algorithm is to find the best Fit Line equation that can predict the

values based on the independent variables.

In regression set of records are present with X and Y values and these

values are used to learn a function so if you want to predict Y from an

unknown X this learned function can be used. In regression we have to

find the value of Y, So, a function is required that predicts continuous Y in

the case of regression given X as independent features.


Evaluation Metrics for Linear Regression

A variety of evaluation measures can be used to determine the strength of

any linear regression model. These assessment metrics often give an

indication of how well the model is producing the observed outputs.

The most common measurements are:

Mean Square Error (MSE)

Mean Squared Error (MSE) is an evaluation metric that calculates the

average of the squared differences between the actual and predicted

values for all the data points. The difference is squared to ensure that

negative and positive differences don’t cancel each other out.

MSE=1n∑i=1n(yi–yi^)2

Here,

● n is the number of data points.

● yi is the actual or observed value for the ith data point.

MSE is a way to quantify the accuracy of a model’s predictions. MSE is

sensitive to outliers as large errors contribute significantly to the overall

score.

Mean Absolute Error (MAE)


Mean Absolute Error is an evaluation metric used to calculate the accuracy

of a regression model. MAE measures the average absolute difference

between the predicted values and actual values.

Mathematically, MAE is expressed as:

MAE=1n∑i=1n∣Yi–Yi^∣

Here,

● n is the number of observations

● Yi represents the actual values.

● represents the predicted values

Lower MAE value indicates better model performance. It is not sensitive to

the outliers as we consider absolute differences.

Root Mean Squared Error (RMSE)

The square root of the residuals’ variance is the Root Mean Squared Error.

It describes how well the observed data points match the expected

values, or the model’s absolute fit to the data.

In mathematical notation, it can be expressed as:

RMSE=RSSn=∑i=2n(yiactual−yipredicted)2n

her than dividing the entire number of data points in the model by the

number of degrees of freedom, one must divide the sum of the squared
residuals to obtain an unbiased estimate. Then, this figure is referred to as

the Residual Standard Error (RSE).

In mathematical notation, it can be expressed as:

RMSE=RSSn=∑i=2n(yiactual−yipredicted)2(n−2)

RSME is not as good of a metric as R-squared. Root Mean Squared Error

can fluctuate when the units of the variables vary since its value is

dependent on the variables’ units (it is not a normalized measure).

Coefficient of Determination (R-squared)

R-Squared is a statistic that indicates how much variation the developed


model can explain or capture. It is always in the range of 0 to 1. In general,
the better the model matches the data, the greater the R-squared number.
In mathematical notation, it can be expressed as:

R2=1−(RSSTSS)

● Residual sum of Squares (RSS): The sum of squares of the


residual for each data point in the plot or data is known as the
residual sum of squares, or RSS. It is a measurement of the
difference between the output that was observed and what was
anticipated.

● RSS=∑i=2n(yi−b0−b1xi)2

Total Sum of Squares (TSS): The sum of the data points’ errors
from the answer variable’s mean is known as the total sum of
squares, or TSS.
● TSS=∑(y−yi‾)2

R squared metric is a measure of the proportion of variance in the

dependent variable that is explained the independent variables in the

model.

Adjusted R-Squared Error

Adjusted R2 measures the proportion of variance in the dependent

variable that is explained by independent variables in a regression model.

Adjusted R-square accounts the number of predictors in the model and

penalizes the model for including irrelevant predictors that don’t

contribute significantly to explain the variance in the dependent variables.

Mathematically, adjusted R2 is expressed as:

AdjustedR2=1–((1−R2).(n−1)n−k−1)

Here,

● n is the number of observations

● k is the number of predictors in the model

● R2 is coeeficient of determination

Adjusted R-square helps to prevent overfitting. It penalizes the model

with additional predictors that do not contribute significantly to explain

the variance in the dependent variable.

FINDING THE LINE:


A linear regression lets you use one variable to predict another variable’s value.

Regression line formula

The regression line formula used in statistics is the same used in algebra:

y = mx + b

Where: x = horizontal axis

y = vertical axis

m = the slope of the line (how steep it is)

b = the y-intercept (where the line crosses the Y axis)

For any data set ;

CORRELATION COEFFICIENT
Correlation is a statistical measure that describes the extent to which two

variables are related to each other. It quantifies the direction and strength
of the linear relationship between variables. Generally, a correlation

between any two variables is of three types that include:

● Positive Correlation

● Zero Correlation

● Negative Correlation

The Pearson Correlation Coefficient, denoted as r, is a statistical

measure that calculates the strength and direction of the linear

relationship between two variables on a scatterplot. The value of r

ranges between -1 and 1, where:

● 1 indicates a perfect positive linear relationship,

● -1 indicates a perfect negative linear relationship, and

● 0 indicates no linear relationship between the variables.


Pearson’s Correlation Coefficient Formula
Karl Pearson’s correlation coefficient formula is the most commonly used

and the most popular formula to get the statistical correlation coefficient.

It is denoted with the lowercase “r”. The formula for Pearson’s

correlation coefficient is shown below:

r = n(∑xy) – (∑x)(∑y) / √[n∑x²-(∑x)²][n∑y²-(∑y)²

The full name for Pearson’s correlation coefficient formula is Pearson’s

Product Moment correlation (PPMC). It helps in displaying the Linear

relationship between the two sets of the data.

Pearson’s correlation helps in measuring the correlation strength (it’s

given by coefficient r-value between -1 and +1) and the existence (given

by p-value ) of a linear correlation relationship between the two variables

and if the outcome is significant we conclude that the correlation exists.

Cohen (1988) says that an absolute value of r of 0.5 is classified as large,

an absolute value of 0.3 is classified as medium and an absolute value of

0.1 is classified as small.


The interpretation of the Pearson’s correlation coefficient is as follows:

● A correlation coefficient of 1 means there is a positive increase of

a fixed proportion of others, for every positive increase in one

variable. Like, the size of the shoe goes up in perfect correlation

with foot length.

● If the correlation coefficient is 0, it indicates that there is no

relationship between the variables.

● A correlation coefficient of -1 means there is a negative decrease

of a fixed proportion, for every positive increase in one variable.

Like, the amount of water in a tank will decrease in a perfect

correlation with the flow of a water tap.

The Pearson correlation coefficient essentially captures how closely the

data points tend to follow a straight line when plotted together. It’s

important to remember that correlation doesn’t imply causation – just

because two variables are related, it doesn’t mean one causes the

change in the other.

Pearson Correlation Coefficient Table

Pearson Correlation Type of Description of New Illustrative

Coefficient (r) Range Correlation Relationship Example


Study Time vs.

An increase in one Test Scores: More

variable associates hours spent


0<r≤ 1 Positive
with an increase in studying tends to

the other. lead to higher test

scores.

Shoe Size vs.

No discernible Reading Skill: A

relationship between person’s shoe size


r=0 None
the changes in both doesn’t predict

variables. their ability to

read.

-1 ≤ r < 0 Negative An increase in one Outdoor

variable associates Temperature vs.

with a decrease in Home Heating

the other. Cost: As the

outdoor

temperature

decreases, heating

costs in the home


increase.
Pearson Correlation Coefficient Interpretation
Interpreting the Pearson correlation coefficient (r) involves assessing the
correlation strength, direction, and correlation significance of the relationship
between two variables. Here’s a guide to interpreting r:
1. Strength of Relationship:

● Close to +1: Indicates a strong positive linear relationship. As one

variable increases, the other tends to increase proportionally.

● Close to -1: Suggests a strong negative linear relationship. As one

variable increases, the other tends to decrease proportionally.

● Close to 0: Implies a weak or no linear relationship. Changes in one

variable do not consistently predict changes in the other.

2. Direction of Relationship:

● Positive r: Both variables tend to increase or decrease together.

● Negative r: One variable tends to increase as the other decreases, and

vice versa.

3. Significance:

● Statistical significance indicates whether the observed correlation

coefficient is likely to occur due to chance.

● Significance is typically assessed using a hypothesis test, such as the

t-test for correlation coefficient, with the null hypothesis stating that the true

correlation coefficient in the population is zero.

● If the p-value is less than the chosen significance level (e.g., 0.05), the

correlation is considered statistically significant.

4. Scatterplot Examination:
● Visual inspection of a scatterplot can provide additional insights into

the relationship between variables.

● A scatterplot allows you to assess the linearity, directionality, and

presence of outliers, complementing the numerical interpretation of r.

5. Caution:

● Correlation does not imply causation. Even if a strong correlation is

observed between two variables, it does not necessarily mean that changes in

one variable cause changes in the other.

● Other factors, such as confounding variables or omitted variables,

may influence the observed correlation.

6. Sample Size:

● Larger sample sizes tend to provide more reliable estimates of

correlation coefficients, reducing the likelihood of obtaining spurious

correlations.

7. Context Dependence:
● The interpretation of r should consider the specific context and

subject matter of the study. What is considered a strong or weak correlation may

vary depending on the field of research and the variables under investigation.

Examples:Calculate the correlation coefficient for the following table with the

help of Pearson’s correlation coefficient formula:

You might also like