QT Chapter 4
QT Chapter 4
CONCEPT OF REGRESSION - Regression is a statistical method used in machine learning and statistics to analyse
the relationship between dependent and independent variables. The goal of regression analysis is to model the
relationship between the variables, understand the strength and nature of that relationship, and make predictions based
on the learned patterns.
Here are key concepts associated with regression:
TWO REGRESSION EQUATIONS - The functional relation developed between the two correlated variables are
called regression equations.
The regression equation of x on y is: (X – X̄ ) = bXY (Y – Ȳ), where bXY -the regression coefficient of x on y.
The regression equation of y on x is: (Y – Ȳ) = bYX (X – X̄ ) where bYX -the regression coefficient of y on x.
REGRESSION COEFFICIENTS AND PROPERTIES - Regression coefficients are numerical values that represent
the strength and direction of the relationship between independent and dependent variables in a regression model.
These coefficients are estimated from the data during the model training process. Let's discuss the properties of
regression coefficients:
1. Intercept (b0):
The intercept represents the predicted value of the dependent variable when all independent variables are set
to zero. It is the point where the regression line intersects the y-axis.
It may or may not have a meaningful interpretation depending on the context.
2. Slope (b1,b2,...,bn):
The slope coefficients indicate the change in the dependent variable for a one-unit change in the
corresponding independent variable, assuming all other variables remain constant.
A positive slope suggests a positive relationship, while a negative slope indicates a negative relationship.
3. Unit Change:
For each unit increase in the independent variable, the coefficient represents the expected change in the
dependent variable.
4. Significance Level:
The statistical significance of coefficients is often assessed using p-values. A low p-value suggests that the
coefficient is statistically significant.
A high p-value indicates that the evidence is not strong enough to reject the null hypothesis that the
coefficient is equal to zero.
5. Confidence Intervals:
Confidence intervals provide a range within which we can be reasonably confident that the true coefficient
lies. Wider intervals indicate more uncertainty, while narrower intervals suggest greater precision.
6. Multicollinearity:
Multicollinearity occurs when independent variables in a multiple regression model are highly correlated. It
can affect the stability and interpretability of coefficients.
Variance Inflation Factor (VIF) is a measure used to detect multicollinearity.
7. Heteroscedasticity:
Heteroscedasticity refers to the situation where the variability of the residuals (the differences between
predicted and actual values) is not constant across all levels of the independent variables.
It can impact the efficiency and reliability of coefficient estimates.
8. Homoscedasticity:
Homoscedasticity is the opposite of heteroscedasticity. It implies that the variance of the residuals is constant
across all levels of the independent variables.
9. Coefficient of Determination (R2):
R2 is a measure of how well the regression model explains the variance in the dependent variable. It ranges
from 0 to 1, with higher values indicating a better fit.
R2 represents the proportion of the dependent variable's variability that is explained by the independent
variables.
10. Adjusted R2:
Adjusted R2 takes into account the number of independent variables in the model and provides a more
accurate measure of goodness of fit, penalizing the inclusion of irrelevant variables.
Multiple Linear Regression: Involves one dependent variable and two or more independent variables.
Polynomial Regression: Involves using polynomial equations to model the relationship between
variables.
Logistic Regression: Used when the dependent variable is binary or categorical. It models the probability
of an event occurring.
USES OF REGRESSION - Regression analysis has numerous practical applications across various fields due
to its ability to model relationships between variables and make predictions. Here are some common uses of
regression analysis:
1. Economics and Finance:
Predicting economic indicators, such as GDP growth, based on factors like government spending,
interest rates, and consumer spending.
Evaluating the relationship between stock prices and various financial ratios.
4. Social Sciences:
Analyzing the relationship between education levels and income.
Studying factors influencing voting behavior in political science.
5. Education:
Predicting student performance based on factors such as study time, attendance, and socioeconomic
status.
Evaluating the effectiveness of educational interventions.
6. Human Resources:
Predicting employee performance based on training, experience, and other factors.
7. Environmental Science:
Modeling the impact of environmental factors on wildlife populations.
Studying the relationship between pollution levels and health outcomes.
8. Operations Research:
Predicting demand for products based on historical sales data.
Optimizing production processes by identifying key factors influencing efficiency.
9. Psychology:
Analyzing the relationship between variables like stress levels and performance on cognitive tasks.
Predicting behavior based on psychological factors.
DIFFERENCE BETWEEN CORRELATION AND REGRESSION - Correlation and regression are statistical
techniques used to analyze the relationship between two or more variables. However, they serve different
purposes and provide distinct types of information.
1. Purpose:
Correlation: Correlation measures the strength and direction of a linear relationship between two
variables. It helps in determining whether and how strongly two variables are related.
Regression: Regression, on the other hand, is used to model the relationship between a dependent
variable and one or more independent variables. It not only describes the relationship but also allows
for making predictions.
2. Output:
Correlation: The result of a correlation analysis is a correlation coefficient, often denoted by "r." It
ranges from -1 to 1, where -1 indicates a perfect negative linear relationship, 1 indicates a perfect
positive linear relationship, and 0 indicates no linear relationship.
Regression: The output of a regression analysis includes coefficients, which represent the slope and
intercept of the regression line. The equation of the line can be used to predict the value of the
dependent variable based on the values of the independent variable(s).
3. Application:
Correlation: Correlation is used when you want to quantify the degree of association between two
variables without making predictions or determining cause and effect.
Regression: Regression is used when you want to predict the value of one variable based on the values
of one or more other variables. It's also used for understanding the strength and nature of the
relationship between variables.
4. Directionality:
Correlation: Correlation does not imply causation, and it does not specify which variable is the cause
and which is the effect. It only indicates the degree and direction of the relationship.
Regression: In regression, there is an assumption of a cause-and-effect relationship, where the
independent variable(s) is considered to cause changes in the dependent variable.
5. Representation:
Correlation: It is often represented by a scatter plot, and the correlation coefficient summarizes the
pattern observed in the plot.
Regression: The relationship is represented by a regression line on a scatter plot, indicating the best-
fitting line through the data points.