0% found this document useful (0 votes)
2 views42 pages

BA unit3

The document discusses trendlines and regression analysis, focusing on the importance of mathematical functions in predictive analytics. It covers concepts such as simple and multiple linear regression, least squares regression, hypothesis testing for regression coefficients, and evaluation metrics like R-squared and RMSE. Additionally, it addresses regression with categorical variables and non-linear regression applications.

Uploaded by

Shaik Ashraaf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views42 pages

BA unit3

The document discusses trendlines and regression analysis, focusing on the importance of mathematical functions in predictive analytics. It covers concepts such as simple and multiple linear regression, least squares regression, hypothesis testing for regression coefficients, and evaluation metrics like R-squared and RMSE. Additionally, it addresses regression with categorical variables and non-linear regression applications.

Uploaded by

Shaik Ashraaf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Unit 3

Trendlines and Regression


BUSINESS ANALYTICS
B.Tech(CSE) IV Year - I Semester
Open Elective - III

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering, Andhra


University
Prof. S.Adinarayana, Dept of CS&SE, College
of Engineering, Andhra University
1
Unit 3
Trendlines and Regression

Prof. S.Adinarayana, Dept of CS&SE,


College of Engineering, Andhra University

2
Modeling Relationships and Trends in Data
• Mathematics and the descriptive properties of different functional
relationships are important in building predictive analytical models.
• Common types of mathematical functions used in predictive analytical
models include the following:

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 3
Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,
Andhra University 4
• R2 (R-squared) is a measure of the “fit” of the line to the data.
• The value of R2 will be between 0 and 1.
• The larger the value of R2 the better the fit.
• Trendlines can be used to model relationships between variables and
understand how the dependent variable behaves as the independent
variable changes.
• For example, the demand-prediction models would generally be
developed by analyzing data.
Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,
Andhra University 5
Simple Linear Regression

• Regression analysis is a tool for building mathematical and statistical models that
characterize relationships between a dependent variable (which must be a ratio
variable and not categorical) and one or more independent, or explanatory, variables,
all of which are numerical (but may be either ratio or categorical).
• Two broad categories of regression models are used often in business settings: (1)
regression models of cross-sectional data and (2) regression models of time-series
data, in which the independent variables are time or some function of time and the
focus is on predicting the future.
• Time-series regression is an important tool in forecasting.
Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,
Andhra University
6
simple linear regression

• A regression model that involves a single independent variable is called


simple regression.
• A regression model that involves two or more independent variables is
called multiple regression.
• Simple linear regression involves finding a linear relationship between
one independent variable, X, and one dependent variable, Y.
• The relationship between two variables can assume many forms, as
illustrated in Figure.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 7
Linear regression

• Linear Regression is to identify the linear relationship


between target variables and explanatory variables.
• Here, the variables that are going to be predicted are
considered target variables, and the variables that are going to
help in predicting the target variables are called explanatory
variables.
• With the linear relationship, we can identify the impact of a
change in explanatory variables on the target variable.

Dr.S.Adinarayana,Professor,CS&SE,Andhra University 8
Least squares regression
• The mathematical basis for the best-fitting regression line is called least-
squares regression.
• In regression analysis, we assume that the values of the dependent
variable, Y, in the sample data, are drawn from some unknown population
for each value of the independent variable, X.
• Imagine we have a list of people’s study hours and test scores. In the
scatterplot, we can see a positive relationship exists between study time
and test scores. Statistical software can display the least squares regression
line and its equation.

From the above points, we


know that this line minimizes
the squared distance
between the line and the Prof. S.Adinarayana, Dept of CS&SE, College of
Engineering, Andhra University
data points. 9
Least Squares Regression Line Formula
y = b + mx
Where:
•y is the dependent variable.
•x is the independent variable. Where N is no. of observations

•b is the y-intercept.
•m is the slope of the line.

The slope represents the mean change in the dependent variable for a
one-unit change in the independent variable.

Prof. S.Adinarayana, Dept of CS&SE, College of


Engineering, Andhra University 10
Example-Least Square Regression

Let’s take the data from the hours of studying example.


We’ll use the least squares regression line formulas to find the slope
and constant for our model.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 11
regression on analysis of variance

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering, Andhra University 12


Testing hypothesis for regression coefficients

It involves confirming if the estimated coefficients have statistical


significance. Two common approaches are used:
1.Confidence interval approach: Determines if the confidence interval for the
coefficient includes zero.
2.t-test approach: Calculates a t-statistic by dividing the estimated coefficient
by its standard error, indicating how many standard-error units the
coefficient is away from zero.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University
13
Confidence Interval
• Confidence intervals provide a systematic approach to quantifying
the uncertainty associated with sample statistics, offering a range
within which population parameters are likely to reside.
• Confidence Interval is a range where we are certain that true value
exists.
• The selection of a confidence level for an interval determines the
probability that the confidence interval will contain the true
parameter value.
• This range of values is generally used to deal with population-
based data, extracting specific, valuable information with a certain
amount of confidence, hence the term ‘Confidence Interval’.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 14
Prof. S.Adinarayana, Dept of CS&SE, College of Engineering, Andhra University
15
Types of Confidence Intervals

1. Confidence Interval for the Mean of Normally Distributed Data


A confidence interval for the mean of normally distributed data is often
calculated using the t-distribution.
2. Confidence Interval for Proportions
For proportions, a confidence interval estimates the likely range of values for the
true population proportion. Typically, the normal approximation or the binomial
distribution is used, depending on the sample size.

3. Confidence Interval for Non-Normally Distributed Data


When dealing with non-normally distributed data or unknown distributions,
bootstrap methods offer a flexible approach. Bootstrap confidence intervals
involve resampling from the dataset to create multiple samples, allowing for the
estimation of the parameter distribution.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 16
t-test approach-Testing hypothesis for regression coefficients

• T-tests are statistical hypothesis tests that you use to analyze one
or two sample means.
• Depending on the t-test that you use, you can compare a sample
mean to a hypothesized value, the means of two independent
samples, or the difference between paired samples.
• t-Tests Use t-Values and t-Distributions to Calculate Probabilities.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 17
• T-values are a type of test statistic. Hypothesis tests use the test
statistic that is calculated from your sample to compare your
sample to the null hypothesis.

A single t-test produces a single t-value. suppose we repeat our


study many times by drawing many random samples of the same
size from this population. Perform t-tests on all of the samples
and plot the distribution of the t-values.
This distribution is known as a sampling distribution, which is a
type of probability distribution(t-distribution).
Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,
Andhra University
18
Residual Analysis
• It is a statistical technique used to evaluate the performance of a linear
regression model by analyzing residuals.
• As the linear regression model is not always appropriate for the data,
you should assess the appropriateness of the model by defining
residuals and examining/analyzing residual plots.
• The difference between the observed value of the dependent variable
(y) and the predicted value (ŷ) is called the residual (e). Each data
point has one residual.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University
19
Residual Analysis

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 20
Residual Plots
• A residual plot is a graph that shows the residuals on the vertical axis
and the independent variable on the horizontal axis.
• If the points in a residual plot are randomly dispersed around the
horizontal axis, a linear regression model is appropriate for the data;
otherwise, a nonlinear model is more suitable.
• The table below shows inputs and outputs from a simple linear
regression analysis.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 21
• And the chart below displays the residual (e) and independent
variable (X) as a residual plot.

The residual plot shows a fairly random pattern - the first residual is
positive, the next two are negative, the fourth is positive, and the last
residual is negative. This random pattern indicates that a linear model
provides a decent fit to the data.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 22
Regression assumptions
• Linear regression is a useful statistical method to understand the
relationship between two variables, x and y.
• Before conducting linear regression, we must first make sure that
four assumptions are to be satisfied.
1. Linear relationship 2. Independence 3. Homoscedasticity and
4. Normality

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 23
1. Linear relationship: There exists a linear relationship between the
independent variable, x, and the dependent variable, y.
2. Independence: The residuals are independent. In particular, there is
no correlation between consecutive residuals in time series data.
3. Homoscedasticity: The residuals have constant variance at every
level of x.
4. Normality: The residuals of the model are normally distributed.
If one or more of these assumptions are violated, then the results of our
linear regression may be unreliable or even misleading.
Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,
Andhra University 24
Multiple Linear Regression
• Multiple Linear Regression is one of the important regression
algorithms that models the linear relationship between a single
dependent continuous variable Y and more than one independent
variables xi.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 25
Multiple Regression Analysis- An example

• Suppose we have the following dataset with one response


variable y and two predictor variables X1 and X2.

• steps to fit a multiple linear regression model to this dataset.


1. Calcúlate X12, X22, X1y, X2y and X1X2.
2. Calculate Regression Sums.
3. Calculate b0, b1, and b2
4. Place b0, b1, and b2 in the estimated linear regression equation.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 26
1. Calcúlate X12, X22, X1y, X2y and X1X2.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University

27
2. Calculate Regression Sums

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 28
3. Calculate b0, b1, and b2

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 29
4. Place b0, b1, and b2 in the estimated linear regression equation.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


30
Andhra University
Interpret a Multiple Linear Regression Equation

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


31
Andhra University
Evaluation Metrics for regression models
The evaluation metrics for a Linear Regression model are:
1.Coefficient of Determination or R-Squared (R2)
2.Root Mean Squared Error (RSME)
R-Squared
• R-squared describes the amount of variation that is captured by the developed model. It always
ranges between 0 and 1. The higher the value of R-squared, the better the model fits with the data.

Root Mean Squared Error(RMSE)


• RMSE measures the average magnitude of the errors or residuals between the predicted values
generated by a model and the actual observed values in a dataset.
• It always ranges between 0 and positive infinity. Lower RMSE values indicate better predictive
performance.
Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,
Andhra University 32
Regression with categorical independent variables with two or more levels

• Categorical variables with two levels may be directly entered as


predictor or predicted variables in a multiple regression model.
• Their use in multiple regression is a straightforward extension of
their use in simple linear regression.
• When entered as predictor variables, the interpretation
of regression weights depends upon how the variable is coded.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 33
Example:-Regression with Categorical Variables

• Consider the effect of (self-reported) exercise on weight in college


students.
• The students were asked the question: how often do you exercise in
a regular week?

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 34
• Let’s take a look at how many observations we have our each level of this
variable.

• boxplot of this data

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


35
Andhra University
• who did not answer this question? They will need to be removed from
consideration.: 13
• Notice that only the first three options were reported on in this data
set (nobody answered with the 4 or 5 options in the survey).
• To build our regression model we want something of the form:

• The works out daily (exercise==1) describes everyone who doesn’t work out 2-3 times or
once a week and is therefore included in the α term.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University
36
Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,
37
Andhra University
confidence interval

• This confidence interval shows us that we can’t conclude we have


any difference in the average weight of these three categories as the
confidence intervals contain both positive and negative values.
• It also gives us a confidence interval for the average weight of those
in category 1 (exercise every day), as this is the intercept.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 38
Regression with nonlinear terms

• Non-linear regression is a general description of statistical


techniques used to model the relationship between a dependent
variable and one or more independent variables.
• Unlike linear regression, which assumes a linear relationship
between the independent features and dependent labels, non-linear
regression allows for more complex relationships to be modeled.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University
39
Applications of Nonlinear Regression
• Many real-world data sets won’t follow a linear relationship, there are
many applications of nonlinear regression.
• These applications include predictive modeling, time series
forecasting, function approximation, and unraveling intricate
relationships between variables.
• Non-linear regression algorithms are machine learning techniques used to
model and predict non-linear relationships between input variables and
target variables.
• These algorithms aim to capture complex patterns and interactions that
cannot be effectively represented by a linear model.
Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,
Andhra University 40
Types of Nonlinear Regression
• There are many types of nonlinear regression. They are:
1. Simple Linear Regression: This model involves one independent variable used to predict the
dependent variable. It’s a basic yet powerful tool in understanding relationships between
variables.
2. Multiple Linear Regression: Unlike simple linear regression, this model incorporates
multiple independent variables to predict the dependent variable. It provides a full analysis
by considering various factors simultaneously.
3. Polynomial Regression: This model fits a curve to the data points. It’s useful when the
relationship between the independent and dependent variables is non-linear.
4. Logistic Regression: Primarily used for binary classification problems, logistic regression
predicts the probability of occurrence of an event by fitting data to a logistic curve.
5. Ridge Regression and Lasso Regression: These are regularization techniques used to prevent
overfitting in predictive models by adding a penalty term to the loss function.
6. Time Series Regression: This model is ideal for looking at data points collected over time to
identify trends, seasonality, and other patterns.
7. Ordinal Regression: It’s used when the dependent variable is ordinal, i.e., it has ordered
categories.
Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,
Andhra University 41
Advanced Techniques in Regression Analysis
When it comes to regression analysis, there are advanced techniques that can
take your models to the next level. They are:

1. Regularization: helps prevent overfitting by adding a penalty for complex


models.
2. Gradient Boosting: a powerful ensemble technique that builds models
sequentially to correct errors made by previous models.
3. Neural Networks: a complex modeling technique that can capture complex
patterns in data, though it requires a large amount of data.
4. Time Series Analysis: useful for modeling and forecasting time-dependent
data.
5. Support Vector Machines (SVM): effective in high-dimensional spaces and
ideal for cases where the data is not linearly separable.

Prof. S.Adinarayana, Dept of CS&SE, College of Engineering,


Andhra University 42

You might also like