0% found this document useful (0 votes)
3 views13 pages

Da sem unit 3-1

The document provides an overview of regression analysis, including its definition, types, and applications in statistical modeling. It details the assumptions necessary for the Best Linear Unbiased Estimator (BLUE) property, the concept of least squares estimation, and variable rationalization techniques. Additionally, it outlines the model building life cycle in data analytics and the application of logistic regression in various business domains.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views13 pages

Da sem unit 3-1

The document provides an overview of regression analysis, including its definition, types, and applications in statistical modeling. It details the assumptions necessary for the Best Linear Unbiased Estimator (BLUE) property, the concept of least squares estimation, and variable rationalization techniques. Additionally, it outlines the model building life cycle in data analytics and the application of logistic regression in various business domains.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1.

Regression – Concepts:

Introduction:

• The term regression is used to indicate the estimation or prediction of the average value
of one variable for a specified value of another variable.
• Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables.
• Regression Analysis is a statistical process for estimating the relationships between the
Dependent Variables /Criterion Variables / Response Variables&

One or More Independent variables / Predictor variables.

• Regression describes how an independent variable is numerically related to the


dependent variable.
• Regression can be used for prediction, estimation and hypothesis testing, and modeling
causal relationships.

When regression is chosen?

• A regression problem is when the output variable is a real or continuous value, such as
“salary” or “weight”.
• Mathematically a linear relationship represents a straight line when plotted as a graph.
• A non-linear relationship where the exponent of any variable is not equal to 1 creates a
curve.

Types of Regression Analysis Techniques:

Linear Regression

Logistic Regression

Ridge Regression

Lasso Regression

Polynomial Regression

Bayesian Linear Regression

Advantages & Limitations:

• Fast and easy to model and is particularly useful when the relationship to be modeled is
not extremely complex and if you don’t have a lot of data.
• Very intuitive to understand and interpret.
• Linear Regression is very sensitive to outliers.

Linear regression:

• Linear Regression is a very simple method but has proven to be very useful for a large
number of situations.
• When we have a single input attribute (x) and we want to use linear regression, this is
called simple linear regression.
• simple linear regression we want to model our data as follows: y = B0 + B1 * x
• we know and B0 and B1 are coefficients that we need to estimate that move the line
around.
2. BLUE property Assumptions
• In regression analysis, the BLUE property assumptions refer to the Best Linear Unbiased
Estimator criteria, which are derived from the Gauss-Markov theorem. These
assumptions ensure that the Ordinary Least Squares (OLS) estimator is the best
(minimum variance) linear unbiased estimator.
• For an OLS estimator to be BLUE, it must satisfy the following assumptions:
a) Linearity of the model
• The relationship between the independent variables (XXX) and the dependent variable
(YYY) must be linear.
• The model should take the form:

Y=β0+β1X1+β2X2+⋯+βnXn+ϵ

b) Random and Zero-Mean Errors(Unbiasedness)


• The error term must have an expected value of zero:

E[ϵi] = 0, ¥ i

• This ensures that the OLS estimates are unbiased, meaning that the expected value of the
estimated coefficients equals true
c) No Perfect Multicollinearity

The independent variables must not be perfectly correlated: rank(X)=p

If multicollinearity exists, the matrix XTX becomes singular making it possible to compute (XTX)-1

If variables are highly correlated but not perfectly,OLS can still be used but may be unstable.

d) Homoscedasticity(constant variance of errors)


The variance of error term must be constant for all observations:

var(ϵi) = ϭ2

Note: If heteroscedasticity(changing variance) exists, OLS is still unbiased but not efficient.

e) No Autocorrelation

Error terms should not be correlated:


E[ϵi, ϵj]=0 for all i≠j

Note: If autocorrelation exists, OLS is still unbiased but inefficient leading to incorrect standard
errors and hypothesis test.

f) Errors are normally distributed(for inference)

While not required for BLUE, normality of errors:

ϵi ͠ N(0, ϭ2)

Ensures valis hypothesis testing and confidence intervals.

If the errors are not normal but the sample size is large, the central limit theorem allows
approximate inference

3. 🔷 Concept of Least Squares Estimation:

Given a dataset with observations:

(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)

We assume a linear relationship:

y=β0+β1x+ϵ

• : Dependent (response) variable


• : Independent (predictor) variable
• : Model parameters (intercept and slope) to be estimated
• : Error term (normally distributed with mean 0)

🔷 Objective:

Minimize the sum of squared residuals:

S= ∑ y i -( β0+β1xi))
2

This yields estimates of and .

( x − mean(x))*( y − mean( y))


n
i i

β1= i=1
n

 ( x − mean(x))
i
2

i=1
β0 = mean( y) – β *mean(x)
1

🔷 Formulas for Estimation:

1. Slope (β₁):

2. Intercept (β₀):

Where and are the means of and .

🔷 Applications:

• Linear regression (single or multiple)


• Curve fitting in physics/engineering
• Financial and economic modeling
• Time series forecasting

🔷 Example:

If the estimated equation is:

y = 2x + 3

Then:

• β1=2

• β0=3

This means for every unit increase in , increases by 2.

🔷 Error Estimation – RMSE (Root Mean Squared Error):

Lower RMSE indicates better fit.

Key Properties of LSE/OLS Estimators:


According to Gauss-Markov theorem, under the assumptions of OLS, the least squares
estimators are:

• Best (minimum variance)


• Linear
• Unbiased

Collectively known as BLUE (Best Linear Unbiased Estimator

4. Variable Rationalization:
• The data set may have a large number of attributes. But some of those attributes
can be irrelevant or redundant. The goal of Variable Rationalization is to improve the
Data Processing in an optimal way through attribute subset selection.
• This process is to find a minimum set of attributes such that dropping of those
irrelevant attributes does not much affect the utility of data and the cost of data
analysis could be reduced.

Types

I. Stepwise Forward Selection


II. Stepwise Backward Elimination
III. Combination of Forward Selection and Backward Elimination
IV. Decision Tree Induction.

All the above methods are greedy approaches for attribute subset selection.

I. Stepwise Forward Selection: This procedure starts with an empty set of attributes as
the minimal set. The most relevant attributes are chosen (having minimum p-value)
and are added to the minimal set. In each iteration, one attribute is added to a
reduced set.
II. Stepwise Backward Elimination: Here all the attributes are considered in the initial
set of attributes. In each iteration, one attribute is eliminated from the set of
attributes whose p-value is higher than significance level.
III. Combination of Forward Selection and Backward Elimination: The stepwise forward
selection and backward elimination are combined so as to select the relevant
attributes most efficiently. This is the most common technique which is generally
used for attribute selection.
IV. Decision Tree Induction: This approach uses decision tree for attribute selection. It
constructs a flow chart like structure having nodes denoting a test on an attribute.
Each branch corresponds to the outcome of test and leaf nodes is a class
prediction. The attribute that is not the part of tree is considered irrelevant and
hence discarded.

5. Model Building Life Cycle in Data Analytics – Summary

The Model Building Life Cycle is a systematic approach to solving business problems using
data analytics. It consists of six main stages:

1. Problem Definition

• Understand the business problem clearly.


• Define objectives and prediction targets.
• Identify challenges and requirements before proceeding.

2. Hypothesis Generation

• Form assumptions about factors influencing the outcome.


• Involves brainstorming potential predictors, even those not yet in the data.
• Helps guide data collection and feature selection.

3. Data Collection

• Gather relevant and reliable data from credible sources.


• The data should:
o Answer hypothesis-related questions.
o Be detailed enough to support analysis.
o Allow accurate outcome predictions.
4. Data Exploration/Transformation

• Explore and preprocess raw data to understand patterns and handle


inconsistencies.
• Key substeps include:
o Feature Identification – determine relevant variables.
o Univariate Analysis – study individual variables.
o Multivariate Analysis – explore relationships among variables.
o Handling Null Values – replace with mean/mode (for numerical) or most
frequent value (for categorical).
• Takes up 60–70% of a data scientist’s time.

5. Predictive Modeling

• Use appropriate algorithms to train models on data.


• Steps:
o Algorithm Selection – supervised (e.g., regression, classification) or
unsupervised (e.g., clustering).
o Model Training – build model using selected algorithm.
o Model Prediction – test on new data and validate with metrics like accuracy
or ROC curve.

6. Model Deployment

• Implement the model in real-time systems (e.g., websites, business platforms).


• Ensure it:
o Supports strategic decision-making.
o Can be updated with new data.
o Enhances customer satisfaction (e.g., personalized recommendations).

Key Takeaways

• Clearly define the problem and prediction goals.


• Generate hypotheses before analyzing data.
• Collect quality data for effective modeling.
• Prioritize data cleaning and exploration.
• Choose the right algorithm and validate results.
• Deploy the model to derive actionable insights and support business strategies.

Let me know if you’d like a diagram or flowchart to visually summarize this life cycle.

6. Logistic Regression – Model Theory (Summary from Document)

1. Introduction:
• Logistic Regression is a Supervised Learning algorithm used for classification problems.
• It predicts a categorical dependent variable (binary or multiclass) using one or more
independent variables.
• Instead of exact 0 or 1 outputs, it produces probability values between 0 and 1.

2. Key Characteristics:

• Predictive like regression but classifies, not estimates.


• Uses a sigmoid (logistic) function to model a curve that maps input features to a
probability between 0 and 1.
• Based on a threshold, it classifies outcomes as 0 or 1.

3. Types of Logistic Regression:

• Binomial: Two possible outcomes (e.g., Yes/No).


• Multinomial: More than two unordered outcomes (e.g., Dog/Cat/Sheep).
• Ordinal: More than two ordered outcomes (e.g., Low/Medium/High).

4. Logistic Regression Equation: Derived from linear regression:

y = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n

\sigma(y) = \frac{1}{1 + e^{-y}}

5. Assumptions:

• The dependent variable must be categorical.


• No multicollinearity among independent variables.

6. Model Fit and Evaluation Metrics:

• Confusion Matrix: Shows TP, TN, FP, FN.


• Accuracy:
• Precision:
• Recall (Sensitivity):
• F1 Score: Harmonic mean of precision and recall.
• ROC Curve: Plots TPR vs. FPR.
• AUC (Area Under ROC Curve): Measures classification performance.

7. Example in R (sigmoid):

y <- c(-10:10)

z <- 1 / (1 + exp(-y))
plot(y, z)

8. Applications in Business Domains:

• Credit Scoring
• Customer Retention (CRM)
• Fraud Detection
• Finance & Risk Analysis
• Human Resource Analytics
• Marketing Response Modeling

7. Logistic Regression – Model Fit Statistics (Summary for 10 Marks


Answer)

Model Fit Statistics in logistic regression help evaluate how well the model explains the
relationship between the dependent and independent variables. Unlike linear regression, logistic
regression doesn't use R² directly, so we rely on several alternative measures:

1. Likelihood Function

• Logistic regression uses the Maximum Likelihood Estimation (MLE) instead of


least squares.
• Log-Likelihood (LL): Measures the probability of observed results given model
parameters. Higher values indicate a better model fit.

2. Deviance

• Null Deviance: Measures the fit of a model with only the intercept (no predictors).
• Residual Deviance: Measures the fit of the model with all predictors.
• A large drop in deviance indicates a good model.

3. Akaike Information Criterion (AIC)

• AIC balances model fit and complexity (penalizes for overfitting).


• Lower AIC = Better model.
• Used for model comparison.

4. Pseudo R-squared (McFadden’s R²)

• Measures the proportion of variance explained, like R² in linear regression.


• Ranges from 0 to 1. Higher values = better fit.
• Values > 0.2 suggest a decent fit.
5. Confusion Matrix Metrics

Derived from predicted vs actual class labels:

• Accuracy: (TP + TN) / (TP + TN + FP + FN)


• Precision: TP / (TP + FP)
• Recall (Sensitivity): TP / (TP + FN)
• F1 Score: 2 * (Precision * Recall) / (Precision + Recall)

6. ROC Curve and AUC (Area Under Curve)

• ROC curve plots True Positive Rate vs False Positive Rate.


• AUC (Area Under ROC Curve): Measures discrimination ability. Closer to 1 = better
model.
Conclusion

Model fit statistics in logistic regression provide essential insights into predictive performance.
They ensure the model is neither underfitted nor overfitted and help in comparing multiple
models to select the best one.

8. Logistic Regression – Model Construction

Logistic regression is a supervised learning classification algorithm used when the dependent
variable is categorical (binary or multinomial). Here's a concise explanation of logistic
regression model construction, based on your document:

Steps in Logistic Regression Model Construction:

1. Define the Problem:


o Identify the binary outcome (e.g., success/failure, yes/no).
o Understand the business context or research question.
2. Prepare the Dataset:
o Collect relevant independent variables (predictors).
o Ensure the dependent variable is categorical.
o Handle missing data, outliers, and convert categorical variables using
encoding techniques.
3. Check Assumptions:
o No multicollinearity among independent variables.
o Dependent variable must be binary or ordinal.
o Linearity of the logit for continuous independent variables.
4. Fit the Logistic Model:
o Use the logistic (sigmoid) function to model the probability of the outcome:

P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 +


\ldots + \beta_n x_n)}}

5. Model Evaluation – Fit Statistics:


o Use metrics like:
▪ Confusion Matrix (TP, TN, FP, FN)
▪ Accuracy, Precision, Recall, F1-score
▪ AIC (Akaike Information Criterion) for model comparison
▪ ROC Curve and AUC Score for classification quality
6. Model Prediction:
o Predict probabilities

9.Logistics regression of application to various business domains


Logistic regression is widely used in business analytics due to its ability to predict categorical
outcomes (like "Yes/No", "Buy/Not Buy", "Churn/Not Churn") based on input features. Here's a
concise overview of its applications across different business domains, based on the content from
your file:

🔶 1. Credit Card Companies

• Purpose: To analyze spending habits and detect fraud.


• Application: Predict whether a transaction is fraudulent (0 = No, 1 = Yes).
• Use Case: Customer segmentation, behavior prediction, credit scoring.

🔶 2. Customer Relationship Management (CRM)

• Purpose: Enhance customer satisfaction and retention.


• Application: Predict customer churn or likelihood of repeat purchase.
• Use Case: Targeted marketing, personalized recommendations.

🔶 3. Finance

• Purpose: Financial planning and risk assessment.


• Application: Loan approval prediction, risk classification (e.g., high/low risk).
• Use Case: Budgeting, forecasting, portfolio optimization.

🔶 4. Human Resources

• Purpose: Recruitment and retention planning.


• Application: Predict employee attrition or likelihood to accept an offer.
• Use Case: Job-fit modeling, succession planning.

🔶 5. Manufacturing

• Purpose: Improve efficiency and reduce costs.


• Application: Predict equipment failure or quality defects (0 = No Defect, 1 = Defect).
• Use Case: Maintenance planning, supply chain optimization.

🔶 6. Marketing

• Purpose: Optimize campaigns and improve ROI.


• Application: Predict whether a customer will respond to a marketing campaign.
• Use Case: Lead conversion modeling, ad performance analysis, A/B testing.

🟩 Summary:

Logistic regression models are simple yet powerful tools in business analytics. They are best
used when:

• The outcome is categorical.


• You need to interpret the influence of features (coefficients).
• There's a need for probabilistic output for decision-making.

Let me know if you want specific case studies or R code examples for any of these domains.

You might also like