Da sem unit 3-1
Da sem unit 3-1
Regression – Concepts:
Introduction:
• The term regression is used to indicate the estimation or prediction of the average value
of one variable for a specified value of another variable.
• Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables.
• Regression Analysis is a statistical process for estimating the relationships between the
Dependent Variables /Criterion Variables / Response Variables&
• A regression problem is when the output variable is a real or continuous value, such as
“salary” or “weight”.
• Mathematically a linear relationship represents a straight line when plotted as a graph.
• A non-linear relationship where the exponent of any variable is not equal to 1 creates a
curve.
Linear Regression
Logistic Regression
Ridge Regression
Lasso Regression
Polynomial Regression
• Fast and easy to model and is particularly useful when the relationship to be modeled is
not extremely complex and if you don’t have a lot of data.
• Very intuitive to understand and interpret.
• Linear Regression is very sensitive to outliers.
Linear regression:
• Linear Regression is a very simple method but has proven to be very useful for a large
number of situations.
• When we have a single input attribute (x) and we want to use linear regression, this is
called simple linear regression.
• simple linear regression we want to model our data as follows: y = B0 + B1 * x
• we know and B0 and B1 are coefficients that we need to estimate that move the line
around.
2. BLUE property Assumptions
• In regression analysis, the BLUE property assumptions refer to the Best Linear Unbiased
Estimator criteria, which are derived from the Gauss-Markov theorem. These
assumptions ensure that the Ordinary Least Squares (OLS) estimator is the best
(minimum variance) linear unbiased estimator.
• For an OLS estimator to be BLUE, it must satisfy the following assumptions:
a) Linearity of the model
• The relationship between the independent variables (XXX) and the dependent variable
(YYY) must be linear.
• The model should take the form:
Y=β0+β1X1+β2X2+⋯+βnXn+ϵ
E[ϵi] = 0, ¥ i
• This ensures that the OLS estimates are unbiased, meaning that the expected value of the
estimated coefficients equals true
c) No Perfect Multicollinearity
If multicollinearity exists, the matrix XTX becomes singular making it possible to compute (XTX)-1
If variables are highly correlated but not perfectly,OLS can still be used but may be unstable.
var(ϵi) = ϭ2
Note: If heteroscedasticity(changing variance) exists, OLS is still unbiased but not efficient.
e) No Autocorrelation
Note: If autocorrelation exists, OLS is still unbiased but inefficient leading to incorrect standard
errors and hypothesis test.
ϵi ͠ N(0, ϭ2)
If the errors are not normal but the sample size is large, the central limit theorem allows
approximate inference
y=β0+β1x+ϵ
🔷 Objective:
S= ∑ y i -( β0+β1xi))
2
β1= i=1
n
( x − mean(x))
i
2
i=1
β0 = mean( y) – β *mean(x)
1
1. Slope (β₁):
2. Intercept (β₀):
🔷 Applications:
🔷 Example:
y = 2x + 3
Then:
• β1=2
• β0=3
4. Variable Rationalization:
• The data set may have a large number of attributes. But some of those attributes
can be irrelevant or redundant. The goal of Variable Rationalization is to improve the
Data Processing in an optimal way through attribute subset selection.
• This process is to find a minimum set of attributes such that dropping of those
irrelevant attributes does not much affect the utility of data and the cost of data
analysis could be reduced.
Types
All the above methods are greedy approaches for attribute subset selection.
I. Stepwise Forward Selection: This procedure starts with an empty set of attributes as
the minimal set. The most relevant attributes are chosen (having minimum p-value)
and are added to the minimal set. In each iteration, one attribute is added to a
reduced set.
II. Stepwise Backward Elimination: Here all the attributes are considered in the initial
set of attributes. In each iteration, one attribute is eliminated from the set of
attributes whose p-value is higher than significance level.
III. Combination of Forward Selection and Backward Elimination: The stepwise forward
selection and backward elimination are combined so as to select the relevant
attributes most efficiently. This is the most common technique which is generally
used for attribute selection.
IV. Decision Tree Induction: This approach uses decision tree for attribute selection. It
constructs a flow chart like structure having nodes denoting a test on an attribute.
Each branch corresponds to the outcome of test and leaf nodes is a class
prediction. The attribute that is not the part of tree is considered irrelevant and
hence discarded.
The Model Building Life Cycle is a systematic approach to solving business problems using
data analytics. It consists of six main stages:
1. Problem Definition
2. Hypothesis Generation
3. Data Collection
5. Predictive Modeling
6. Model Deployment
Key Takeaways
Let me know if you’d like a diagram or flowchart to visually summarize this life cycle.
1. Introduction:
• Logistic Regression is a Supervised Learning algorithm used for classification problems.
• It predicts a categorical dependent variable (binary or multiclass) using one or more
independent variables.
• Instead of exact 0 or 1 outputs, it produces probability values between 0 and 1.
2. Key Characteristics:
5. Assumptions:
7. Example in R (sigmoid):
y <- c(-10:10)
z <- 1 / (1 + exp(-y))
plot(y, z)
• Credit Scoring
• Customer Retention (CRM)
• Fraud Detection
• Finance & Risk Analysis
• Human Resource Analytics
• Marketing Response Modeling
Model Fit Statistics in logistic regression help evaluate how well the model explains the
relationship between the dependent and independent variables. Unlike linear regression, logistic
regression doesn't use R² directly, so we rely on several alternative measures:
1. Likelihood Function
2. Deviance
• Null Deviance: Measures the fit of a model with only the intercept (no predictors).
• Residual Deviance: Measures the fit of the model with all predictors.
• A large drop in deviance indicates a good model.
Model fit statistics in logistic regression provide essential insights into predictive performance.
They ensure the model is neither underfitted nor overfitted and help in comparing multiple
models to select the best one.
Logistic regression is a supervised learning classification algorithm used when the dependent
variable is categorical (binary or multinomial). Here's a concise explanation of logistic
regression model construction, based on your document:
🔶 3. Finance
🔶 4. Human Resources
🔶 5. Manufacturing
🔶 6. Marketing
🟩 Summary:
Logistic regression models are simple yet powerful tools in business analytics. They are best
used when:
Let me know if you want specific case studies or R code examples for any of these domains.