0% found this document useful (0 votes)
38 views

Lecture 5. Part 1 - Regression Analysis

- Intrinsically linear models use data transformations to linearize relationships between variables that are not truly linear. An example showed transforming hours and unit number data using natural logs to create linear relationships. - Logistic regression predicts probabilities of binary/dichotomous outcomes like disease presence/absence using continuous predictor variables. It expresses the log-odds as a linear combination of the predictors. - Kaplan-Meier survival analysis estimates the survival function from lifetime data and handles censored observations where the event was not observed for some subjects.

Uploaded by

Richelle Pausang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Lecture 5. Part 1 - Regression Analysis

- Intrinsically linear models use data transformations to linearize relationships between variables that are not truly linear. An example showed transforming hours and unit number data using natural logs to create linear relationships. - Logistic regression predicts probabilities of binary/dichotomous outcomes like disease presence/absence using continuous predictor variables. It expresses the log-odds as a linear combination of the predictors. - Kaplan-Meier survival analysis estimates the survival function from lifetime data and handles censored observations where the event was not observed for some subjects.

Uploaded by

Richelle Pausang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Special Regression Topics

Lecture 5. STT153A
Overview
In this lecture, we are going to discuss the following:
• Intrinsically Linear Models
• Logistic Regression
• Kaplan-Meier Survival Analysis
Intrinsically Linear Models
Reference: Mislick, G. K., & Nussbaum, D. A. (2015). Cost estimation: Methods and
tools. John Wiley & Sons.
Intrinsically Linear Models
What if the relationship between our variables are NOT linear?

• We will be using data transformations to “linearize” the model


Example
Transform the data

• using calculator
(ln value)
• Excel formula
function
(= ln value)
ln (Hours) ln( Unit #)

4.09434 1.60944

3.80666 2.48491

3.46574 3.55535

3.21888 4.31749

3.04452 4.82831
Since there is only one independent variable use simple linear
regression for the transformed data

Unit
Hours Number ln (Hours) ln( Unit #) Enter the data in Statistica

60 5 4.09434 1.60944 Statistics -> Multiple Regression -> Dependent ln(Hours),


Independent ln(unit #) -> Ok> Summary Regression
45 12 3.80666 2.48491 Results

32 35 3.46574 3.55535

25 75 3.21888 4.31749

21 125 3.04452 4.82831


Get the regression model of the transformed data
Logistic Regression
Logistic regression
The basic difference between linear regression and logistic regression is
the dependent variable.
Linear regression dependent variable is continuous (ex. Ice cream sales,
GPA, Household Income)
Logistic regression dependent variable is dichotomous (two possible
outcome (ex. Have a disease=1 Not have a disease=0). This variable is
the probability of occurrence either 100% or 0%. Logistic regression
expresses values between 0 and 1.
This video is a good introduction to comparing linear regression and
logistic regression.
https://www.youtube.com/watch?v=C5268D9t9Ak (1 to 5 minutes)
or it can be expressed as

Log Likelihood
Encode the data in Statistica
Example 1. Statistics-> Advance Liner/Nonlinear Models ->Stepwise Model Builder->
Logistic Regression
Buys the
Select Variables -> Dependent, Continuous, Categorial -> Bad code (Yes),
Product Salary Age
Good code (No). This is because the model will predict the probability of the
Yes 1500 33 “bad code”, in this situation our interest is if the customer will the product.
No 1200 33
No 2200 34
No 2100 42
Yes 1500 29
Yes 1700 19
No 3000 50
No 3000 55
Yes 2800 31
Yes 2900 46
No 2750 36
No 2550 48
Yes 1200 24
-> Full Sample-> Add variables (select all variables from Marginal Results
Example 1. Table) -> Summary
Example 1

1
𝑃 𝑦 = 1 ∣ 𝑥1 , … , 𝑥𝑛 =
1 + 𝑒− 𝑏1 ⋅𝑥1 +⋯+𝑏𝑘 ⋅𝑥𝑘 +𝑎

1
𝑃 Buys the product = Yes ∣ Salary, 𝐴𝑔𝑒 =
1+𝑒 − 0.000777⋅𝑥1 −0.212893𝑥2
Example 1
1
𝑃 Buys the product = Yes ∣ Salary, 𝐴𝑔𝑒 =
1 + 𝑒− 0.000777⋅𝑥1 −0.212893𝑥2

Interpretation
Logistic regression analysis was performed to examine the influence of Age, and Salary on variable Buys the
Product to predict the value “Yes".
The coefficient of the variable Salary is 𝑏1 = 0.000777, which is positive. This means that an increase in Salary
is associated with an increase in the probability that the dependent variable is “Yes”. However, the p-value of
0.5837 indicates that this influence is not statistically significant. The odds Ratio of exp(0.000777 ) =
1.000777302 indicates that one unit increase of the variable Salary will increase the odds that the dependent
variable is “Yes" by 1.000777302 times.

The coefficient of the variable Age is 𝑏2 = −0.212893, which is negative. This means that an increase in Age is
associated with a decrease in the probability that the dependent variable is “Yes”. However, the p-value of
0.094680 indicates that this influence is not statistically significant. The odds Ratio of exp(0.094680)
=1.099307021 indicates that one unit increase of the variable Age will decrease the odds that the dependent
variable is “Yes" by 1.000777302 times.
Example 2.
Open Data CHD Logistic
Reference: https://www.kaggle.com/datasets/dileep070/heart-disease-
prediction-using-logistic-regression
Example 2.
Open Data CHD Logistic
Reference: https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression
Variables/ Demographic:
• Sex: male or female(Nominal)
• Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of
age is continuous)
Behavioral
• Current Smoker: whether or not the patient is a current smoker (Nominal)
• Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as
one can have any number of cigarettes, even half a cigarette.)
Medical( history)
• BP Meds: whether or not the patient was on blood pressure medication (Nominal)
• Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)
• Prevalent Hyp: whether or not the patient was hypertensive (Nominal)
• Diabetes: whether or not the patient had diabetes (Nominal)
Medical(current)
• Tot Chol: total cholesterol level (Continuous)
• Sys BP: systolic blood pressure (Continuous)
• Dia BP: diastolic blood pressure (Continuous)
• BMI: Body Mass Index (Continuous)
• Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are
considered continuous because of large number of possible values.)
• Glucose: glucose level (Continuous)

Predict variable (desired target)


• 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)
Example 2. Variables
Observe the continuous and categorical. Exclude education.
Bad Code = 1 ( Yes) 10 year risk of coronary heart disease CHD
Example 2.
Coefficient Odds Ratio
Variables Estimate (exp(coefficient))
Intercept -8.84912 0.000143508
age 0.06590 1.068116101
cigsPerDay 0.01923 1.01941152
totChol 0.00227 1.002274682
sysBP 0.01753 1.017688966
glucose 0.00728 1.00730684
male -0.28072 0.755237497
Scale 1.00000 2.718281828
Example 2.
Interpreting the results: Odds Ratio
• This fitted model shows that, holding all other features constant,
the odds of getting diagnosed with heart disease for males Coefficient Odds Ratio
Variables Estimate (exp(coefficient))
(sex_male = 1)over that of females (sex_male = 0) is exp(-0.28072) =
Intercept -8.84912 0.000143508
0.755237497. In terms of percent change, we can say that the odds
age 0.06590 1.068116101
for males are 100%-76%= 24% lower than the odds for females. cigsPerDay 0.01923 1.01941152
• The coefficient for age says that, holding all others constant, we totChol 0.00227 1.002274682
will see 7% increase in the odds of getting diagnosed with CDH for a sysBP 0.01753 1.017688966
one year increase in age since exp(0.06590) = 1.068116. glucose 0.00728 1.00730684
• Similarly , with every extra cigarette one smokes there is a 2% male -0.28072 0.755237497
increase in the odds of CDH. Scale 1.00000 2.718281828
• For Total cholesterol level and glucose level the change is too
small.
• There is a 1.7% increase in odds for every unit increase in systolic
Blood Pressure.
Conclusions
• Women seem to be more susceptible to heart disease than men. Increase
in age, number of cigarettes smoked per day, and systolic Blood Pressure also
show increasing odds of having heart disease.
• Total cholesterol shows no significant change in the odds of CHD. This could
be due to the presence of 'good cholesterol(HDL) in the total cholesterol
reading. Glucose too causes a very negligible change in odds (0.2%)

Note that this interpretation is based on the post in Kaggle. The data is not
clean. There were 648 cases with NA response which is not practical for
analysis. After cleaning the data the result for men and women is different. s

You might also like