0% found this document useful (0 votes)
3 views

6 Continuous Data Analysis

The document provides an overview of continuous data analysis in biostatistics, focusing on methods such as t-tests, ANOVA, correlation, and linear regression. It explains the significance of these methods in comparing means, interpreting relationships between variables, and assessing model fit through various statistical measures. The document also outlines assumptions for linear regression and techniques for variable selection in multiple linear regression models.

Uploaded by

Abas Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

6 Continuous Data Analysis

The document provides an overview of continuous data analysis in biostatistics, focusing on methods such as t-tests, ANOVA, correlation, and linear regression. It explains the significance of these methods in comparing means, interpreting relationships between variables, and assessing model fit through various statistical measures. The document also outlines assumptions for linear regression and techniques for variable selection in multiple linear regression models.

Uploaded by

Abas Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Basic Biostatistics

Haramaya University

Collage of Health and Medicine Sciences

School of Public Health

Continuous Data Analysis

By Adisu Birhanu (Assistant prof. of Biostatistics)

Feb 2025
Session Objectives

Describe Continuous variable and method of analyses

Describe relationship between continuous variables

Interpret the outputs from linear regression models


Analysis of Continuous Data
A continuous variable is one which can take on infinity many,
uncountable possible values in the range of real numbers.

Data analysis methods such as scatter plot, line graphs and


histogram are applicable for describing numerical data.

More advanced methods for inferential analysis of continuous data


include correlation, t-test, ANOVA and linear regression.
Comparison of the means
t-test is appropriate to compare two means from two
population

There are three different t-tests


One sample t-test

Two independent sample t-test

Paired sample t-test

ANOVA is used for IV with more than two groups

BY ADISU B.
One sample t-test

 It is used to compare the estimate of a sample with


a hypothesized population mean to see if the
sample mean is significantly different.

 there is one group being compared against a standard value

BY ADISU B.
Independent two sample t-test

Used to compare mean of two unrelated or independent


groups

the groups come from two different populations (e.g., two


different people from two separate cities).

Hypothesis: Ho: Mean of group 1 = Mean of group 2

HA: Mean of group 1 ≠ Mean of group 2 ,

BY ADISU B.
Example
Research question: To test if there is significant difference in
birth weight of male and female infant→ Independent t-test is
appropriate

BY ADISU B.
Interpretation

The 95% confidence interval for the difference of means


does not contain 0.

The p-value is less than 0.05

Hence, we conclude that there is significant difference in


birth weight among the male and female infants.

BY ADISU B.
Paired t- test
 Compare means if each observation in one sample has
one and only one pair in the other sample dependent
to each other.

 In this case the groups come from a single population (e.g.,


measuring before and after an experimental treatment), perform
a paired t test.

 Hypothesis: Ho: Mean difference = 0 Vs HA: Mean


difference ≠ 0

BY ADISU B.
One way ANOVA (Analysis of Variance)
For two normal distributions the two sample means are
compared by t-test.

The means of more than two distributions need to be


compared.

BY ADISU B.
One way ANOVA…
The t-test methodology generalizes to the one-way analysis
of variance (ANOVA) for categorical variables with more
than two categories.

ANOVA do not tell you which group is different, but only


whether a difference exists.

To know which group is different, we used post hoc tests


(bonferroni, Tuckey, scheffe).

BY ADISU B.
One way ANOVA…

For K means (K> 3).

Ho : µ1 = µ2 = : : : =µ k ,

HA : at least one of the means is different.

There is one factor of grouping (one way ANOVA)

BY ADISU B.
One way ANOVA…

Consider infant data: Outcome variable: birth


weight

Factor variable: residence (urban= 1, semi-urban=


2, rural=3)

Objective: compare weight among the three place


categories

BY ADISU B.
STATA CODE: oneway weight place

BY ADISU B.
One way ANOVA…
We reject the null hypothesis (p value < 0.05) and

We can conclude that at least one of the groups' means differ


on body weight.

Now the question is: which groups are different?

Answering this question requires multiple comparisons (post


hoc tests).

Bonferroni, Tukey and scheffe are commonly used methods.

Bonferroni method corrects probability of Type I error for the


BY ADISU B.
Interpretation;
All pairs of comparison are statistically significant at 0.05
level:
urban versus semi-urban, urban versus rural, semi-urban
versus rural.
STATA CODE: oneway weight place, bonferroni

BY ADISU B.
Correlation

Correlation is used to quantify the degree to which two


continuous random variables are related,
Common correlation measure
Pearson Correlation Coefficient: for linear relationship
between two variables
Scatterplot
Helpful tool in exploring relationship between two variables
 If No relationship between proposed explanatory and dependent
variables
 Then fitting a linear regression model to data probably will
not provide a useful model
 Before attempting to fit a linear model to observed data, a
modeler should first determine whether or not there is a
relationship between the variables of interest
 This does not necessarily imply that one variable causes the
other, but that there is some significant association between the
two variables
Scatter plot and correlation of two data
Scatter plot for age and CD4 count of
patients
The scatter plot of CD4 count versus age of patient
Correlation coefficient
A valuable numerical measure of relationship between
two variables
A value between -1 and 1 indicating the strength of the
linear relationship for two variables
 Population correlation coefficient ρ (rho) measures the
strength of linear relationship between two variables
 Sample correlation coefficient, r, is an estimate of ρ and is used
to measure the strength of the linear relationship in the
sample observations.
Correlation coefficient
Basic features of sample and population correlation
are:
 It is unit free, It range between -1 and 1

 The closer to -1, the stronger the negative linear relationship

 The closer to 1, the stronger the positive linear relationship

 The closer to 0, the weaker the linear relationship


Coefficient of determination/R
squared
Coefficient of determination is the measure of strength of the
model
Variation in dependent variable is split into two parts as
Variation in y = SSE + SSR
Sum of Squares Error (SSE):
 Measures amount of variation in y that remains unexplained
(i.e. due to error)
Sum of Squares Regression (SSR) :
 Measures amount of variation in y explained by variation in
the independent variable x
Coefficient of determination…
 Coefficient of determination does not have a critical value that enables
us to draw conclusions
 Higher the value of R squared, the better the model fits the data
 If R2= 1, it implies Perfect match between the line and the data points
 If R2=0 then it implies there are no linear relationship between x and y
 Quantitative measure of how well the independent variables account for
the outcome
 When R2 is multiplied by 100 it can be thought of as the percentage of the
variance in the dependent variable explained by the independent
variables
Linear Regression
We frequently measure two or more variables on the same individual
to explore the nature of the relationship among these variables.
Regression analysis is a form of predictive modelling technique which
investigates the relationship between a dependent and independent
variable.
Questions to be answered
What is the relationship between Y and X?
How can changes in Y be explained by changes in X?
Linear regression (#2)
Linear regression attempts to model the relationship
between two variables by fitting a linear equation to
observed data
 Explanatory variable (X): can be any types of variables

 Dependent variable: Y

 Dependent variable for linear regression should be


numeric (continuous)
Linear regression (#3)
Goal of linear regression is to find the line that best
predicts dependent variable from independent variables

 Linear regression does this by finding the line that


minimizes the sum of the squares of the vertical distances
of the points from the line
How linear regression works?
Least-squares methods (OLS)
 Calculates the best-fitting line for the observed data by
minimizing the sum of the squares of the vertical deviations from
each data point to the line
 If a point lies on the fitted line exactly, then its vertical deviation is 0

 Goal of regression is to minimize the sum of the squares of the


vertical distances of the points from the line
Linear Regression Model
To understand linear regression, therefore, you must
understand the model
Y = intercept + slope *X =a + β *X+ ε
When X equals 0 the equation calculates that Y equals a
The slope, β, is the change in Y for every unit change in X
Epsilon (ε) represents random variability
The simplest way to express the dependence of the expected
response Yi on the predictor xi is to assume that it is a linear function, say

Constant or intercept:
 Parameter represents the expected response when xi =0

 Slope
 Parameter represents the expected increment in the response per
unit change in xi
 Note: Both α and β are population parameters which are usually
unknown and hence estimated from the data by a and b
Assumptions of linear regression
Linearity :- Relationship between independent and dependent variable is
linear
 To check this assumptions we draw a scatter plot of residuals and y
values
 If the scatter plot follows a linear pattern (i.e. not a curvilinear pattern)
that shows that linearity assumption is met
Linear Regression Assumptions
Normality (Normally Distributed Error Terms): - Error terms follow
the normal distribution. We can use `qnorm' and `pnorm' to check
the normality of the residuals.

Shapirowilk test can also be used


Homoscedasticity of Residuals
Homoscedasticity: - Variance of the error terms is constant.

Is about homogeneity of variance of the residuals.

If the model is well-fitted, there should be no pattern to the


residuals plotted against the fitted values.

If the variance of the residuals is non-constant. it is heteroscedastic.


Homoscedasticity …
The Breusch-Pagan test is used.
p-value < 0.05, reject the hypothesis that states that
variance is homogenous.
Multicollinearity
When there is a perfect linear relationship among the
predictors, the estimates cannot be uniquely computed.
The term collinearity implies that two variables are near perfect
linear combinations of one another.
The regression model estimates of the coefficients become
unstable.
The standard errors for the coefficients can get wildly inflated.
We can use the vif or tolerance to check for multicollinearity.
Multicollinearity…
As a rule of thumb, a variable whose VIF are greater than 5
may need further investigation.

Tolerance, defined as 1/VIF, is used by many researchers to


check on the degree of collinearity.
Multiple Linear Regression

Simple linear regression can be extended to multiple linear


regression models
Two or more independent variables which could be categorical
or continuous
 Response variable to be a function of k explanatory
variables x1; x2; : : : ; xk
Its purposes are mainly:
 Prediction, explanation
 Adjusting effects of confounders
Multiple Linear Regression

Best fitting model

 Minimizes sum of squared residual


 Residuals are deviations between observed response variables
and values predicted by fitted model
 Smaller residuals, closer the fitted line

 Note that residuals i are given by:


Coefficient in multiple linear regressions

beta coefficient measures amount of increase or decrease in


dependent variable for a one-unit difference in continuous
independent variable
If an independent variable has a nominal scale with more
than two categories
 Dummy variables are needed
 Each dummy should be considered as an independent
variable
Assumptions: Specification of model (model building)

Strategies to identify a subset of variables:

Option 1: Variable selection based on significance in


univariable models (simple linear regression):
 All variables that show a significant effect in uni-variable
models are included
 Variable with a p-value of less than 0.25 is taken to MLR
model
Option 2: Variable selection based on significance in
multivariable model:
 Backward

 stepwise

 forward selection
Backward/stepwise/forward selection
Backward selection:
 All variables will be entered in the model
 Then remove step by step until significantly contributing
variables are left in model
 Least contributing variable will be removed first
 Then second least contributor will be removed and so on
Forward selection:
 Model starts with empty (null model)
 Then most significantly contributing variable will enter first
 This continuous step by step until only significantly
contributing variables enter in the model
Stepwise selection
 Same as forward selection

 Even if a variable is included in the model its contribution


will be tested after inclusion of other variable/s
 Variables are added but can subsequently be removed if
they no longer contribute to the prediction
Option 3: Variable selection based on subject matter
knowledge:
 Best way to select variables, as it is not data-driven and it is
therefore considered as yielding unbiased results
Practical session for Multiple linear
regression using STATA
Thank you!!

You might also like