0% found this document useful (0 votes)

3 views13 pages

Da sem unit 3-1

The document provides an overview of regression analysis, including its definition, types, and applications in statistical modeling. It details the assumptions necessary for the Best Linear Unbiased Estimator (BLUE) property, the concept of least squares estimation, and variable rationalization techniques. Additionally, it outlines the model building life cycle in data analytics and the application of logistic regression in various business domains.

Uploaded by

raghavarao.balagani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views13 pages

Da sem unit 3-1

Uploaded by

raghavarao.balagani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

1.

Regression – Concepts:

Introduction:

• The term regression is used to indicate the estimation or prediction of the average value
of one variable for a specified value of another variable.
• Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables.
• Regression Analysis is a statistical process for estimating the relationships between the
Dependent Variables /Criterion Variables / Response Variables&

One or More Independent variables / Predictor variables.

• Regression describes how an independent variable is numerically related to the

dependent variable.
• Regression can be used for prediction, estimation and hypothesis testing, and modeling
causal relationships.

When regression is chosen?

• A regression problem is when the output variable is a real or continuous value, such as
“salary” or “weight”.
• Mathematically a linear relationship represents a straight line when plotted as a graph.
• A non-linear relationship where the exponent of any variable is not equal to 1 creates a
curve.

Types of Regression Analysis Techniques:

Linear Regression

Logistic Regression

Ridge Regression

Lasso Regression

Polynomial Regression

Bayesian Linear Regression

Advantages & Limitations:

• Fast and easy to model and is particularly useful when the relationship to be modeled is
not extremely complex and if you don’t have a lot of data.
• Very intuitive to understand and interpret.
• Linear Regression is very sensitive to outliers.

Linear regression:

• Linear Regression is a very simple method but has proven to be very useful for a large
number of situations.
• When we have a single input attribute (x) and we want to use linear regression, this is
called simple linear regression.
• simple linear regression we want to model our data as follows: y = B0 + B1 * x
• we know and B0 and B1 are coefficients that we need to estimate that move the line
around.
2. BLUE property Assumptions
• In regression analysis, the BLUE property assumptions refer to the Best Linear Unbiased
Estimator criteria, which are derived from the Gauss-Markov theorem. These
assumptions ensure that the Ordinary Least Squares (OLS) estimator is the best
(minimum variance) linear unbiased estimator.
• For an OLS estimator to be BLUE, it must satisfy the following assumptions:
a) Linearity of the model
• The relationship between the independent variables (XXX) and the dependent variable
(YYY) must be linear.
• The model should take the form:

Y=β0+β1X1+β2X2+⋯+βnXn+ϵ

b) Random and Zero-Mean Errors(Unbiasedness)

• The error term must have an expected value of zero:

E[ϵi] = 0, ¥ i

• This ensures that the OLS estimates are unbiased, meaning that the expected value of the
estimated coefficients equals true
c) No Perfect Multicollinearity

The independent variables must not be perfectly correlated: rank(X)=p

If multicollinearity exists, the matrix XTX becomes singular making it possible to compute (XTX)-1

If variables are highly correlated but not perfectly,OLS can still be used but may be unstable.

d) Homoscedasticity(constant variance of errors)

The variance of error term must be constant for all observations:

var(ϵi) = ϭ2

Note: If heteroscedasticity(changing variance) exists, OLS is still unbiased but not efficient.

e) No Autocorrelation

Error terms should not be correlated:

E[ϵi, ϵj]=0 for all i≠j

Note: If autocorrelation exists, OLS is still unbiased but inefficient leading to incorrect standard
errors and hypothesis test.

f) Errors are normally distributed(for inference)

While not required for BLUE, normality of errors:

ϵi ͠ N(0, ϭ2)

Ensures valis hypothesis testing and confidence intervals.

If the errors are not normal but the sample size is large, the central limit theorem allows
approximate inference

3. 🔷 Concept of Least Squares Estimation:

Given a dataset with observations:

(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)

We assume a linear relationship:

y=β0+β1x+ϵ

• : Dependent (response) variable

• : Independent (predictor) variable
• : Model parameters (intercept and slope) to be estimated
• : Error term (normally distributed with mean 0)

🔷 Objective:

Minimize the sum of squared residuals:

S= ∑ y i -( β0+β1xi))
2

This yields estimates of and .

( x − mean(x))*( y − mean( y))

n
i i

β1= i=1
n

 ( x − mean(x))
i
2

i=1
β0 = mean( y) – β *mean(x)
1

🔷 Formulas for Estimation:

1. Slope (β₁):

2. Intercept (β₀):

Where and are the means of and .

🔷 Applications:

• Linear regression (single or multiple)

• Curve fitting in physics/engineering
• Financial and economic modeling
• Time series forecasting

🔷 Example:

If the estimated equation is:

y = 2x + 3

Then:

• β1=2

• β0=3

This means for every unit increase in , increases by 2.

🔷 Error Estimation – RMSE (Root Mean Squared Error):

Lower RMSE indicates better fit.

Key Properties of LSE/OLS Estimators:

According to Gauss-Markov theorem, under the assumptions of OLS, the least squares
estimators are:

• Best (minimum variance)

• Linear
• Unbiased

Collectively known as BLUE (Best Linear Unbiased Estimator

4. Variable Rationalization:
• The data set may have a large number of attributes. But some of those attributes
can be irrelevant or redundant. The goal of Variable Rationalization is to improve the
Data Processing in an optimal way through attribute subset selection.
• This process is to find a minimum set of attributes such that dropping of those
irrelevant attributes does not much affect the utility of data and the cost of data
analysis could be reduced.

Types

I. Stepwise Forward Selection

II. Stepwise Backward Elimination
III. Combination of Forward Selection and Backward Elimination
IV. Decision Tree Induction.

All the above methods are greedy approaches for attribute subset selection.

I. Stepwise Forward Selection: This procedure starts with an empty set of attributes as
the minimal set. The most relevant attributes are chosen (having minimum p-value)
and are added to the minimal set. In each iteration, one attribute is added to a
reduced set.
II. Stepwise Backward Elimination: Here all the attributes are considered in the initial
set of attributes. In each iteration, one attribute is eliminated from the set of
attributes whose p-value is higher than significance level.
III. Combination of Forward Selection and Backward Elimination: The stepwise forward
selection and backward elimination are combined so as to select the relevant
attributes most efficiently. This is the most common technique which is generally
used for attribute selection.
IV. Decision Tree Induction: This approach uses decision tree for attribute selection. It
constructs a flow chart like structure having nodes denoting a test on an attribute.
Each branch corresponds to the outcome of test and leaf nodes is a class
prediction. The attribute that is not the part of tree is considered irrelevant and
hence discarded.

5. Model Building Life Cycle in Data Analytics – Summary

The Model Building Life Cycle is a systematic approach to solving business problems using
data analytics. It consists of six main stages:

1. Problem Definition

• Understand the business problem clearly.

• Define objectives and prediction targets.
• Identify challenges and requirements before proceeding.

2. Hypothesis Generation

• Form assumptions about factors influencing the outcome.

• Involves brainstorming potential predictors, even those not yet in the data.
• Helps guide data collection and feature selection.

3. Data Collection

• Gather relevant and reliable data from credible sources.

• The data should:
o Answer hypothesis-related questions.
o Be detailed enough to support analysis.
o Allow accurate outcome predictions.
4. Data Exploration/Transformation

• Explore and preprocess raw data to understand patterns and handle

inconsistencies.
• Key substeps include:
o Feature Identification – determine relevant variables.
o Univariate Analysis – study individual variables.
o Multivariate Analysis – explore relationships among variables.
o Handling Null Values – replace with mean/mode (for numerical) or most
frequent value (for categorical).
• Takes up 60–70% of a data scientist’s time.

5. Predictive Modeling

• Use appropriate algorithms to train models on data.

• Steps:
o Algorithm Selection – supervised (e.g., regression, classification) or
unsupervised (e.g., clustering).
o Model Training – build model using selected algorithm.
o Model Prediction – test on new data and validate with metrics like accuracy
or ROC curve.

6. Model Deployment

• Implement the model in real-time systems (e.g., websites, business platforms).

• Ensure it:
o Supports strategic decision-making.
o Can be updated with new data.
o Enhances customer satisfaction (e.g., personalized recommendations).

Key Takeaways

• Clearly define the problem and prediction goals.

• Generate hypotheses before analyzing data.
• Collect quality data for effective modeling.
• Prioritize data cleaning and exploration.
• Choose the right algorithm and validate results.
• Deploy the model to derive actionable insights and support business strategies.

Let me know if you’d like a diagram or flowchart to visually summarize this life cycle.

6. Logistic Regression – Model Theory (Summary from Document)

1. Introduction:
• Logistic Regression is a Supervised Learning algorithm used for classification problems.
• It predicts a categorical dependent variable (binary or multiclass) using one or more
independent variables.
• Instead of exact 0 or 1 outputs, it produces probability values between 0 and 1.

2. Key Characteristics:

• Predictive like regression but classifies, not estimates.

• Uses a sigmoid (logistic) function to model a curve that maps input features to a
probability between 0 and 1.
• Based on a threshold, it classifies outcomes as 0 or 1.

3. Types of Logistic Regression:

• Binomial: Two possible outcomes (e.g., Yes/No).

• Multinomial: More than two unordered outcomes (e.g., Dog/Cat/Sheep).
• Ordinal: More than two ordered outcomes (e.g., Low/Medium/High).

4. Logistic Regression Equation: Derived from linear regression:

y = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n

\sigma(y) = \frac{1}{1 + e^{-y}}

5. Assumptions:

• The dependent variable must be categorical.

• No multicollinearity among independent variables.

6. Model Fit and Evaluation Metrics:

• Confusion Matrix: Shows TP, TN, FP, FN.

• Accuracy:
• Precision:
• Recall (Sensitivity):
• F1 Score: Harmonic mean of precision and recall.
• ROC Curve: Plots TPR vs. FPR.
• AUC (Area Under ROC Curve): Measures classification performance.

7. Example in R (sigmoid):

y <- c(-10:10)

z <- 1 / (1 + exp(-y))
plot(y, z)

8. Applications in Business Domains:

• Credit Scoring
• Customer Retention (CRM)
• Fraud Detection
• Finance & Risk Analysis
• Human Resource Analytics
• Marketing Response Modeling

7. Logistic Regression – Model Fit Statistics (Summary for 10 Marks

Answer)

Model Fit Statistics in logistic regression help evaluate how well the model explains the
relationship between the dependent and independent variables. Unlike linear regression, logistic
regression doesn't use R² directly, so we rely on several alternative measures:

1. Likelihood Function

• Logistic regression uses the Maximum Likelihood Estimation (MLE) instead of

least squares.
• Log-Likelihood (LL): Measures the probability of observed results given model
parameters. Higher values indicate a better model fit.

2. Deviance

• Null Deviance: Measures the fit of a model with only the intercept (no predictors).
• Residual Deviance: Measures the fit of the model with all predictors.
• A large drop in deviance indicates a good model.

3. Akaike Information Criterion (AIC)

• AIC balances model fit and complexity (penalizes for overfitting).

• Lower AIC = Better model.
• Used for model comparison.

4. Pseudo R-squared (McFadden’s R²)

• Measures the proportion of variance explained, like R² in linear regression.

• Ranges from 0 to 1. Higher values = better fit.
• Values > 0.2 suggest a decent fit.
5. Confusion Matrix Metrics

Derived from predicted vs actual class labels:

• Accuracy: (TP + TN) / (TP + TN + FP + FN)

• Precision: TP / (TP + FP)
• Recall (Sensitivity): TP / (TP + FN)
• F1 Score: 2 * (Precision * Recall) / (Precision + Recall)

6. ROC Curve and AUC (Area Under Curve)

• ROC curve plots True Positive Rate vs False Positive Rate.

• AUC (Area Under ROC Curve): Measures discrimination ability. Closer to 1 = better
model.
Conclusion

Model fit statistics in logistic regression provide essential insights into predictive performance.
They ensure the model is neither underfitted nor overfitted and help in comparing multiple
models to select the best one.

8. Logistic Regression – Model Construction

Logistic regression is a supervised learning classification algorithm used when the dependent
variable is categorical (binary or multinomial). Here's a concise explanation of logistic
regression model construction, based on your document:

Steps in Logistic Regression Model Construction:

1. Define the Problem:

o Identify the binary outcome (e.g., success/failure, yes/no).
o Understand the business context or research question.
2. Prepare the Dataset:
o Collect relevant independent variables (predictors).
o Ensure the dependent variable is categorical.
o Handle missing data, outliers, and convert categorical variables using
encoding techniques.
3. Check Assumptions:
o No multicollinearity among independent variables.
o Dependent variable must be binary or ordinal.
o Linearity of the logit for continuous independent variables.
4. Fit the Logistic Model:
o Use the logistic (sigmoid) function to model the probability of the outcome:

P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 +

\ldots + \beta_n x_n)}}

5. Model Evaluation – Fit Statistics:

o Use metrics like:
▪ Confusion Matrix (TP, TN, FP, FN)
▪ Accuracy, Precision, Recall, F1-score
▪ AIC (Akaike Information Criterion) for model comparison
▪ ROC Curve and AUC Score for classification quality
6. Model Prediction:
o Predict probabilities

9.Logistics regression of application to various business domains

Logistic regression is widely used in business analytics due to its ability to predict categorical
outcomes (like "Yes/No", "Buy/Not Buy", "Churn/Not Churn") based on input features. Here's a
concise overview of its applications across different business domains, based on the content from
your file:

🔶 1. Credit Card Companies

• Purpose: To analyze spending habits and detect fraud.

• Application: Predict whether a transaction is fraudulent (0 = No, 1 = Yes).
• Use Case: Customer segmentation, behavior prediction, credit scoring.

🔶 2. Customer Relationship Management (CRM)

• Purpose: Enhance customer satisfaction and retention.

• Application: Predict customer churn or likelihood of repeat purchase.
• Use Case: Targeted marketing, personalized recommendations.

🔶 3. Finance

• Purpose: Financial planning and risk assessment.

• Application: Loan approval prediction, risk classification (e.g., high/low risk).
• Use Case: Budgeting, forecasting, portfolio optimization.

🔶 4. Human Resources

• Purpose: Recruitment and retention planning.

• Application: Predict employee attrition or likelihood to accept an offer.
• Use Case: Job-fit modeling, succession planning.

🔶 5. Manufacturing

• Purpose: Improve efficiency and reduce costs.

• Application: Predict equipment failure or quality defects (0 = No Defect, 1 = Defect).
• Use Case: Maintenance planning, supply chain optimization.

🔶 6. Marketing

• Purpose: Optimize campaigns and improve ROI.

• Application: Predict whether a customer will respond to a marketing campaign.
• Use Case: Lead conversion modeling, ad performance analysis, A/B testing.

🟩 Summary:

Logistic regression models are simple yet powerful tools in business analytics. They are best
used when:

• The outcome is categorical.

• You need to interpret the influence of features (coefficients).
• There's a need for probabilistic output for decision-making.

Let me know if you want specific case studies or R code examples for any of these domains.

Nlp Material
No ratings yet
Nlp Material
250 pages
Unit-2
No ratings yet
Unit-2
136 pages
Evaluating and Institutionalizing OD Interventions
100% (1)
Evaluating and Institutionalizing OD Interventions
36 pages
BA3-4-5modules
No ratings yet
BA3-4-5modules
258 pages
SAS 2130 Statistics 2021
No ratings yet
SAS 2130 Statistics 2021
212 pages
Unit-3 DA (3)
No ratings yet
Unit-3 DA (3)
50 pages
RDA Question Bank.docx
No ratings yet
RDA Question Bank.docx
4 pages
A Complete project report on a study on home loans
No ratings yet
A Complete project report on a study on home loans
62 pages
228371_Lecture_Notes_Week_1
No ratings yet
228371_Lecture_Notes_Week_1
70 pages
UNIT-III-DA
No ratings yet
UNIT-III-DA
46 pages
module 2 modified
No ratings yet
module 2 modified
67 pages
Regression and Analysis
No ratings yet
Regression and Analysis
132 pages
Machine learning
No ratings yet
Machine learning
62 pages
UNIT 3 DA
No ratings yet
UNIT 3 DA
20 pages
Unit-III (Data Analytics)
50% (2)
Unit-III (Data Analytics)
15 pages
Module 5.2
No ratings yet
Module 5.2
51 pages
UNIT-3-1
No ratings yet
UNIT-3-1
41 pages
Simple Regression
No ratings yet
Simple Regression
45 pages
BA501 Week5 Linear Regression
No ratings yet
BA501 Week5 Linear Regression
45 pages
Practical - Regression
No ratings yet
Practical - Regression
114 pages
nlp sem unit 2
No ratings yet
nlp sem unit 2
12 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Model Development
No ratings yet
Model Development
80 pages
Mary Muli Project September 2024
No ratings yet
Mary Muli Project September 2024
45 pages
Ecom 165 Notes
No ratings yet
Ecom 165 Notes
98 pages
Unit - II_DA
No ratings yet
Unit - II_DA
22 pages
Econometrics Session
No ratings yet
Econometrics Session
43 pages
nlp sem unit 5
No ratings yet
nlp sem unit 5
9 pages
Econometrics Theory Note
No ratings yet
Econometrics Theory Note
13 pages
MODULE-3
No ratings yet
MODULE-3
34 pages
Chapter5
No ratings yet
Chapter5
14 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Lec 1: DNSC 6314
No ratings yet
Lec 1: DNSC 6314
47 pages
nlp sem unit 1
No ratings yet
nlp sem unit 1
8 pages
DAUNIT-3
No ratings yet
DAUNIT-3
32 pages
Chapter 06-Regression Analysis
No ratings yet
Chapter 06-Regression Analysis
41 pages
Accuracy Assessment and Confusion Matrix
No ratings yet
Accuracy Assessment and Confusion Matrix
23 pages
DA Unit-3
No ratings yet
DA Unit-3
14 pages
Maths in Focus Ext1 Yr 12 CH 15
No ratings yet
Maths in Focus Ext1 Yr 12 CH 15
84 pages
DSR Notes 3 To 5
No ratings yet
DSR Notes 3 To 5
70 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Regression Logistic Unit3 Notes
No ratings yet
Regression Logistic Unit3 Notes
6 pages
ML DL NLP Definitions
No ratings yet
ML DL NLP Definitions
22 pages
Stat 378
No ratings yet
Stat 378
73 pages
Unit III
No ratings yet
Unit III
18 pages
Data Analytics Unit 3 Notes
100% (3)
Data Analytics Unit 3 Notes
28 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
20 pages
linear regression (1)
No ratings yet
linear regression (1)
8 pages
(Ebook) Handbook of item response theory, volume two: statistical tools by Wim J. van der Linden (editor) ISBN 9781466514324, 9781466514430, 1466514329, 1466514434 download pdf
No ratings yet
(Ebook) Handbook of item response theory, volume two: statistical tools by Wim J. van der Linden (editor) ISBN 9781466514324, 9781466514430, 1466514329, 1466514434 download pdf
67 pages
Group_1_Practical
No ratings yet
Group_1_Practical
16 pages
Ejedu 426 (Galley)
No ratings yet
Ejedu 426 (Galley)
6 pages
Modern Pridictive Modelling(Regression)
No ratings yet
Modern Pridictive Modelling(Regression)
12 pages
Machine learning notes
No ratings yet
Machine learning notes
12 pages
Teit ML2
No ratings yet
Teit ML2
11 pages
ML PR-2
No ratings yet
ML PR-2
11 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
Data Science Module 5 q & A
No ratings yet
Data Science Module 5 q & A
8 pages
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model for Medical Data
No ratings yet
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model for Medical Data
7 pages
Unit-_3_PDA
No ratings yet
Unit-_3_PDA
20 pages
2023 Statistics Fin 10
No ratings yet
2023 Statistics Fin 10
14 pages
4 Random Sampling Methods
No ratings yet
4 Random Sampling Methods
19 pages
UNIT - III
No ratings yet
UNIT - III
9 pages
Logistic Regression Exercises
No ratings yet
Logistic Regression Exercises
3 pages
Etc 2410 Notes
50% (2)
Etc 2410 Notes
133 pages
Attitudes Towards Business Ethics Questionnaire
80% (5)
Attitudes Towards Business Ethics Questionnaire
14 pages
Ethnographic Atlas - George P. Murdock
No ratings yet
Ethnographic Atlas - George P. Murdock
144 pages
BPCC-104 June 2022
No ratings yet
BPCC-104 June 2022
4 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
Impact of Recruitment and Selection Strategy On Employees' Performance A Study of Three Selected Manufacturing Companies in Nigeria
No ratings yet
Impact of Recruitment and Selection Strategy On Employees' Performance A Study of Three Selected Manufacturing Companies in Nigeria
11 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Master of Physical Education PDF
No ratings yet
Master of Physical Education PDF
34 pages
Propagation of Statistical Errors
No ratings yet
Propagation of Statistical Errors
6 pages
Confidence Interval Curve
100% (1)
Confidence Interval Curve
4 pages
ETF2100 5910 Tutorial Week 1 SOLUTION
No ratings yet
ETF2100 5910 Tutorial Week 1 SOLUTION
7 pages
1 Forecasting-Questions
No ratings yet
1 Forecasting-Questions
4 pages
HM - Coconut Tart
No ratings yet
HM - Coconut Tart
23 pages
Ms 236 N 0
No ratings yet
Ms 236 N 0
63 pages
Multinomial Logistic Regression - Spss Data Analysis Examples
No ratings yet
Multinomial Logistic Regression - Spss Data Analysis Examples
1 page
21MAB204T PQT
No ratings yet
21MAB204T PQT
1 page
Chaltu Kedir Final Thesis
No ratings yet
Chaltu Kedir Final Thesis
65 pages
The Oedipus Complex
100% (1)
The Oedipus Complex
6 pages
Work Sampling
No ratings yet
Work Sampling
36 pages
Answers Econometrics
No ratings yet
Answers Econometrics
3 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
AI Unit 4 QA
No ratings yet
AI Unit 4 QA
22 pages
3 Unit - Dspu
No ratings yet
3 Unit - Dspu
23 pages
hw1 2017 (2140)
No ratings yet
hw1 2017 (2140)
1 page
Chapter 2
No ratings yet
Chapter 2
22 pages
Basic Regression Analysis
No ratings yet
Basic Regression Analysis
5 pages
Econ 3049: Econometrics: Department of Economics The University of The West Indies, Mona
No ratings yet
Econ 3049: Econometrics: Department of Economics The University of The West Indies, Mona
16 pages
Applied Linear Algebra: Core Principles
From Everand
Applied Linear Algebra: Core Principles
Kartikeya Dutta
No ratings yet
Regression Analysis: A Journey from Simple to Complex
From Everand
Regression Analysis: A Journey from Simple to Complex
Pasquale De Marco
No ratings yet
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)