Lecture 5. Part 1 - Regression Analysis
Lecture 5. Part 1 - Regression Analysis
Lecture 5. STT153A
Overview
In this lecture, we are going to discuss the following:
• Intrinsically Linear Models
• Logistic Regression
• Kaplan-Meier Survival Analysis
Intrinsically Linear Models
Reference: Mislick, G. K., & Nussbaum, D. A. (2015). Cost estimation: Methods and
tools. John Wiley & Sons.
Intrinsically Linear Models
What if the relationship between our variables are NOT linear?
• using calculator
(ln value)
• Excel formula
function
(= ln value)
ln (Hours) ln( Unit #)
4.09434 1.60944
3.80666 2.48491
3.46574 3.55535
3.21888 4.31749
3.04452 4.82831
Since there is only one independent variable use simple linear
regression for the transformed data
Unit
Hours Number ln (Hours) ln( Unit #) Enter the data in Statistica
32 35 3.46574 3.55535
25 75 3.21888 4.31749
Log Likelihood
Encode the data in Statistica
Example 1. Statistics-> Advance Liner/Nonlinear Models ->Stepwise Model Builder->
Logistic Regression
Buys the
Select Variables -> Dependent, Continuous, Categorial -> Bad code (Yes),
Product Salary Age
Good code (No). This is because the model will predict the probability of the
Yes 1500 33 “bad code”, in this situation our interest is if the customer will the product.
No 1200 33
No 2200 34
No 2100 42
Yes 1500 29
Yes 1700 19
No 3000 50
No 3000 55
Yes 2800 31
Yes 2900 46
No 2750 36
No 2550 48
Yes 1200 24
-> Full Sample-> Add variables (select all variables from Marginal Results
Example 1. Table) -> Summary
Example 1
1
𝑃 𝑦 = 1 ∣ 𝑥1 , … , 𝑥𝑛 =
1 + 𝑒− 𝑏1 ⋅𝑥1 +⋯+𝑏𝑘 ⋅𝑥𝑘 +𝑎
1
𝑃 Buys the product = Yes ∣ Salary, 𝐴𝑔𝑒 =
1+𝑒 − 0.000777⋅𝑥1 −0.212893𝑥2
Example 1
1
𝑃 Buys the product = Yes ∣ Salary, 𝐴𝑔𝑒 =
1 + 𝑒− 0.000777⋅𝑥1 −0.212893𝑥2
Interpretation
Logistic regression analysis was performed to examine the influence of Age, and Salary on variable Buys the
Product to predict the value “Yes".
The coefficient of the variable Salary is 𝑏1 = 0.000777, which is positive. This means that an increase in Salary
is associated with an increase in the probability that the dependent variable is “Yes”. However, the p-value of
0.5837 indicates that this influence is not statistically significant. The odds Ratio of exp(0.000777 ) =
1.000777302 indicates that one unit increase of the variable Salary will increase the odds that the dependent
variable is “Yes" by 1.000777302 times.
The coefficient of the variable Age is 𝑏2 = −0.212893, which is negative. This means that an increase in Age is
associated with a decrease in the probability that the dependent variable is “Yes”. However, the p-value of
0.094680 indicates that this influence is not statistically significant. The odds Ratio of exp(0.094680)
=1.099307021 indicates that one unit increase of the variable Age will decrease the odds that the dependent
variable is “Yes" by 1.000777302 times.
Example 2.
Open Data CHD Logistic
Reference: https://www.kaggle.com/datasets/dileep070/heart-disease-
prediction-using-logistic-regression
Example 2.
Open Data CHD Logistic
Reference: https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression
Variables/ Demographic:
• Sex: male or female(Nominal)
• Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of
age is continuous)
Behavioral
• Current Smoker: whether or not the patient is a current smoker (Nominal)
• Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as
one can have any number of cigarettes, even half a cigarette.)
Medical( history)
• BP Meds: whether or not the patient was on blood pressure medication (Nominal)
• Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)
• Prevalent Hyp: whether or not the patient was hypertensive (Nominal)
• Diabetes: whether or not the patient had diabetes (Nominal)
Medical(current)
• Tot Chol: total cholesterol level (Continuous)
• Sys BP: systolic blood pressure (Continuous)
• Dia BP: diastolic blood pressure (Continuous)
• BMI: Body Mass Index (Continuous)
• Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are
considered continuous because of large number of possible values.)
• Glucose: glucose level (Continuous)
Note that this interpretation is based on the post in Kaggle. The data is not
clean. There were 648 cases with NA response which is not practical for
analysis. After cleaning the data the result for men and women is different. s