0% found this document useful (0 votes)
45 views

Fsgs

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It finds the best fit straight line through the data points to establish a predictive equation for the dependent variable based on the independent variables. The model summary provides coefficients, R-squared value, p-values and other metrics to evaluate the model fit and significance of predictors. The regression equation can be used to predict new values of the dependent variable given the independent variable values.

Uploaded by

Ragul S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Fsgs

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. It finds the best fit straight line through the data points to establish a predictive equation for the dependent variable based on the independent variables. The model summary provides coefficients, R-squared value, p-values and other metrics to evaluate the model fit and significance of predictors. The regression equation can be used to predict new values of the dependent variable given the independent variable values.

Uploaded by

Ragul S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Linear Regresion

Linear Regression

 Linear regression is a predictive statistical approach for


modelling relationship between a dependent
variable with a given set of independent
variables.
 It is a type of modelling technique used for predictive
analytics.
 Linear regression models use a straight line to explain
the relationship.
 The variable that is being predicted is called the
dependent variable or response variable (y).
 The variable that is used to predict the value of the
dependent variable is called the independent variable (x).
Linear Regression

 Regression is a supervised machine learning


algorithm that is used to model and analyse the
relationships between variables.
 If there is one independent variable, it is called as
simple linear regression.

 For more than one independent variable, the process


is called as multiple linear regression.
scatter plot

• A scatter plot is drawn between x and y variable


which shows the relationship between the variables.

No Relationship
Positive Relationship
Negative Relationship
When to apply Linear Regression

 When correlation
coefficient shows that data is
likely to be able to predict
future outcomes

 Scatter plot of the data appears


to form a straight line

 Linear regression can be used


to find a predictive function.
Line of Best Fit

• A straight line is fitted in the scatterplot which


estimates the y variable for a given value of x
variable.
• The line is fitted in such a way it reduces the error in
prediction.
Simple/Multiple Linear Regression

 Salary and CGPA


 Sales and discounts offered
 Impact of demonetisation on GDP
 Market share of a brand based on price, promotion
expenses, competitors etc.
Simple Linear Regression Coefficients

The simple linear regression is estimated by


the formula,
y=β0 + β1x + ∈
where y is the dependent variable
x is the independent variable
β0 is the intercept which gives the predicted
value of y when x is zero
β1 is the regression coefficient
∈ is the error of the estimate which shows the
variation in estimating the regression
coefficient. This is also called as residual.
The residual terms represent the difference
between the predicted value and the
observed value.
Validation of Simple Linear
Regression
P-value can be interpreted as,
If the p-value is less than α value (α is the probability of
making an error. α value can be either 0.01, 0.05, or 0.10)
which is commonly taken as 0.05, the test results are
significant
If the p-value is more than 0.05, the test results are not
significant. This means that the regression model is invalid.
MULTIPLE LINEAR REGRESSION

 Multiple regression has more than one predictor


variables and one response variable.
 The predictor variables can be continuous or
categorical.
 Multiple linear regression assumes that there is
linear relationship between the dependent variable
and each of the independent variables.
 Multiple linear regression also requires that the
independent variables should not be highly
correlated to each other.
Dummy variables in regression

 A dummy variable is an artificial variable which represents an attribute


with two or more categories.
 Dummy variables take discrete values such as 1 or 0 indicating the
presence or absence of a particular category.
 For example, city in which a person resides has three categories,
Mumbai, Delhi and Chennai. The variable city can be represented as,
Mumbai Delhi Chennai
1 0 0
0 1 0
0 0 1

 But the third category can be represented as 0 0 0 instead of 0 0 1.


 To represent a categorical variable with n categories/labels, n-1 dummy
variables need to be defined.
 R automatically converts them into dummy variables while building
linear regression model using lm().
Multiple Linear Regression Coefficients

Multiple linear regression can be depicted by the formula,

Y = β0 + β1x1 + β2x2 + …………… βnxn + ∈

where y is the dependent or predicted variable


β0 is the intercept which is the value of y when the x values
are zero
β1, β2, …..βn are the regression coefficients that represents the
change in y variable for each unit change in the x variables.
∈ is the residual or the error in prediction
Coefficient of Determination

 R-squared value denotes the proportion of variation in the


dependent variable explained by the regression model with the help
of independent variables.
The value of R-square varies between 0 and 1.
High values of R-square near to 1 indicate that the model fits on the
data well, whereas small values indicate that the model does not fit the
data well.

 Adjusted R-square measures the proportion of variation


explained by only those independent variables that really help in
explaining the dependent variable.
The adjusted R-square compensates for the addition of variables and
increases only if the new predictor variables enhances the model.
RMSE – Root means square error measures the effectiveness of
the regression model.
Multiple Linear Regression equation

 Mathematical equation for multiple regression is −


Y = β0 + β1X1+ β2X2 + β3X3 + …… βnXn
Multiple Linear Regression in R

 lm() function
 The model determines the value of the coefficients
using the input data.
 Next we can predict the value of the response
variable for a given set of predictor variables using
these coefficients.
 This function creates the relationship model between
the predictor and the response variable.
Syntax

 lm(y ~ x1+x2+x3...,data)
 y is the response variable.
 x1, x2, ...xn are the predictor variables.
 Data= dataset
Objective

 Dataset : mtcars
 GOAL of the model: To establish the relationship
between "mpg" as a response variable with
"disp","hp" and "wt" as predictor variables.
 We create a subset of these variables from the mtcars
data set for this purpose.
data(mtcars)
input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))
model <- lm(mpg~disp+hp+wt, data = input)
print(model)
summary(model)
1. Create Equation for Regression Model

 Based on the above intercept and coefficient values,


we create the mathematical equation.
 Y = a+Xdisp.x1+Xhp.x2+Xwt.x3 or
 Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-
3.8008)*x3
Apply Equation for predicting New Values

 We can use the regression equation created above to


predict the mileage when a new set of values for
displacement, horse power and weight is provided.
 For a car with disp = 221, hp = 102 and wt = 2.91 the
predicted mileage is −
 Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-
3.8008)*2.91 = 22.7104
Interpretation

2. R squared value , 82.68%, indicating 82.68% changes in mpg is based on disp, hp


and wt
3. p-value, for individual variables . p value needs to be less than 0.05
p-value: 8.65e-11 which is highly significant.
4.Individual coefficients: To see which predictor variables are significant, you can
examine the coefficients table, which shows the estimate of regression beta
coefficients and the associated t-statistic p-values:
5. **** highly significant
# Predicting for a Datapoint
predict(model, newdata = data.frame(disp=221, hp=102,
wt=2.91 ))

# Prediction for the Dataset


Prediction=predict(model)
View(Prediction)
Understanding by Plotting
Actual=mtcars$mpg
BackTrack=data.frame(Actual, Prediction)
BackTrack
plot(Actual, col="Red", pch="o", xlab = "Miles Per Gallon",
lty=1, ylim=c(0,70))
lines(Actual, col="Red")
points(Prediction, col="Blue",pch="*", xlab = "Miles Per
Gallon",lty=2)
lines(Prediction, col="Blue")
legend(1,70,legend=c("Actual","Prediction"),
col=c("red","blue"),pch=c("o","*"),lty=c(1,2), ncol=1)
Actual vs Prediction
Exercise

You might also like