Medical Biostatistics 2
Medical Biostatistics 2
Medical Biostatistics 2
Georg Heinze
e-mail: [email protected]
Version 2009-12
1 INTRODUCTION TO STATISTICAL MODELING ............................................. 5
Statistical tests and statistical models ........................................................................................................ 5
What is a statistical test? ............................................................................................................................ 5
What is a statistical model? ....................................................................................................................... 5
Response or outcome variable ................................................................................................................... 6
Independent variable.................................................................................................................................. 7
Representing a statistical test by a statistical model .................................................................................. 7
Uncertainty of a model .............................................................................................................................. 8
Types of responses – types of models........................................................................................................ 9
Univariate and multivariable models ......................................................................................................... 9
Multivariate models ................................................................................................................................. 11
Purposes of multivariable models ............................................................................................................ 12
Confounding ............................................................................................................................................ 16
Effect modification .................................................................................................................................. 16
Assumptions of various models ............................................................................................................... 18
References ................................................................................................................................................... 39
References ................................................................................................................................................... 70
References ..................................................................................................................................................108
In its simplest setting, a statistical test compares the values of a variable between two
groups. Often we want to infer whether two groups of patients actually belong to the
same population. We specify a null hypothesis and reject it if the observed data does not
give evidence that the hypothesis holds. For simplification we restrict the hypothesis to
the comparison of means, as the mean is the most important and most obvious feature of
any distribution. If our patient groups belong to the same population, they should exhibit
the same mean. Thus, our null hypothesis states “the means in the two groups are equal”.
To perform the statistical test, we need two pieces of information for each patient: his/her
group membership, and his/her value of the variable to be compared. (And so far, it is of
no importance whether the variable we want to compare is a scale or a nominal variable.)
As an example, consider the rat diet example of the basic lecture. We tested the equality
of weight gains between the groups of high protein diet and low protein diet.
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 153.115 8.745 17.509 .000
Body-mass-index 1.179 .326 .283 3.620 .001
Age .756 .091 .648 8.293 .000
a. Dependent Variable: Cholesterol level
Comparing two patients of the same age which differ in their BMI by 1 kg/m2, the
heavier person’s cholesterol level is on average 1.179 units higher than that of the
slimmer person.
and
Comparing two patients with the same BMI which differ in their age by one year, the
older person will on average have a cholesterol level 0.756 units higher than the younger
person.
The column labeled “Sig.” informs us whether these coefficients can be assumed to be 0,
the p-values in that column refer to testing that the corresponding regression coefficients
are zero. If they were actually zero, then these variables had no effect on cholesterol, as
can be demonstrated easily:
In the above equation, the cholesterol level is completely independent from BMI and age.
No matter which values we insert for BMI or Age, the cholesterol level will not change
from 180.
Summarizing, we can get out more of a statistical model than we can get out of a
statistical test: not only do we test the hypothesis of ‘no relationship’, we also obtain an
estimate of the magnitude of the relationship, and even a prediction rule for cholesterol.
Statistical models, in their simplest form, and statistical tests are related to each other. We
can express any statistical test as a statistical model, in which the P-value obtained by
statistical testing is delivered as a ‘by-product’.
In the rat diet example, the response variable is the weight gain.
Independent variable
The statistical model provides an equation to estimate values of the response variable by
one or several independent variables. The denotation ‘independent’ points at their role in
the model: their part is an active one; namely to explain differences in response and not to
be explained themselves. In our example, these independent variables were BMI and age.
In the rat diet example, we consider the diet group (high or low protein) as independent
variable.
Recall the rat diet example. We can represent the t-test which was applied to the data as
linear regression of weight gain on diet group:
where D=1 for the high protein group, and D=0 for the low protein group.
b0 is the mean weight gain in the low protein group (because for D=0, we have Weight
gain = b0 + b1*0).
b1 is the excess average weight gain in the high protein group, compared to the low
protein group, or, put another way, the difference in mean weight gain between the two
groups.
Clearly, if b1 is significantly different from zero, then the type of diet influences weight
gain. Let’s proof by applying linear regression to the rat diet data:
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 139,000 14,575 9,537 ,000
Dietary group -19,000 10,045 -,417 -1,891 ,076
a. Dependent Variable: Weight gain (day 28 to 84)
For interpreting the coefficient corresponding to ‘Dietary group’, we must know how this
variable was coded. Actually, 1 was the code for the high protein group, and 2 for the low
protein group. Inserting the codes into the regression model we obtain
Weight gain = 139 – 19 = 120 for the high protein group and
Weight gain = 139 – 19*2 = 101 for the low protein group,
which exactly reproduces the means of weight gain in the two groups. The p-value
associated with Dietary group exactly resembles that of a two-sample t-test.
Other relationships exist for other statistical tests, e. g., the chi-square test has its
analogue in logistic regression, or the log-rank test for comparing survival data can be
expressed as a simple Cox regression model. Both will be demonstrated in later sessions.
Uncertainty of a model
Since a model is estimated from a sample of limited size, we cannot be sure that the
estimated values resemble exactly those of the underlying population. Therefore, it is
important that when reporting results we also state how precise our estimates are. This is
usually done by supplying confidence intervals in addition to point estimates.
Summarizing, there are two sources of uncertainty related to statistical models: one
source is due to limited sample sizes, and the other source due to limited ability of a
model’s structure to predict the outcome.
The type of response defines the type of model to use. For scale variables as responses,
we will most often use the linear regression model. For binary (nominal) outcomes, the
logistic regression model is the model of choice. (There are other models for binary data,
but with less appealing interpretability of results.) For survival outcomes (time to event
data), the Cox regression model is useful. For repeated measurements on scale outcomes,
the analysis of variance for repeated measurements can be applied.
A univariate model is the translation of a simple statistical test into a statistical model:
there is one independent variable and one response variable. The independent variable
may be nominal, ordinal or scale.
A multivariable model uses more than one independent variable to explain the outcome
variable. Multivariable models can be used for various purpose, some of them are listed
in the next subsection but one.
Often, univariate (crude) and multivariable (adjusted) models are contrasted in one table,
as the following example (from a Cox regression analysis) shows [1]:
In the above table, wee see substantial differences in the estimated effects for KLF5
expression, nodal status and tumor size, but not for differentiation grade. It was shown
that KLF5 expression is correlated with nodal status and tumor size, but not with
differentiation grade. Therefore, the univariate effect of differentiation grade does not
change at all by including KLF5 expression into the model. On the other hand, the effect
of KLF5 is reduced by about 40%, caused by the simultaneous consideration of nodal
status and tumor size.
In other examples, the reverse may occur; an effect may be insignificant in a univariate
model and only be confirmable statistically if another effect is considered
simultaneously:
As outlined earlier, the ‘effect’ of sex (2=female, 1=male) on cholesterol level could also
be demonstrated by applying a univariate linear regression model:
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 209,698 5,525 37,952 ,000
Sex 2,802 3,457 ,090 ,811 ,420
a. Dependent Variable: Cholesterol level
Both analyses (comparison of means and linear regression) yield the same result: mean
cholesterol level in females is about 2.8 units higher than mean cholesterol level in males.
The difference is not significant, as revealed by a t-test (or a univariate regression model)
with a p-value of 0.42.
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 175,729 12,749 13,784 ,000
Weight (kg) ,378 ,129 ,339 2,928 ,004
Sex 7,132 3,622 ,228 1,969 ,052
a. Dependent Variable: Cholesterol level
Now, the effect of sex on cholesterol is much more pronounced (comparing equal-
weighted males and females, the difference is 7.132) and marginally significant
(P=0.052).
Multivariate models
A multivariate model is a model with several outcome variables explained by the same
set of independent variables. As an example, consider a study in which two different
statin products are compared in their ability to decrease cholesterol levels. Patient’s
cholesterol levels are repeatedly assessed, beginning with a baseline examination before
start of treatment, and examinations after three and six months of statin therapy. A
simultaneous evaluation of all these cholesterol measurements makes sense because the
repeated cholesterol levels will be correlated within a patient, and this correlation should
be taken into account.
Cholesterol at month 6
•
•
Defining a prediction rule of the outcome
Adjusting effects for confounders
The typical situation for the first purpose is a set of candidate variables, from which some
will enter the final (best explaining) model. There are several strategies to identify such a
subset of variables:
• Option 4: variable selection based on substance matter knowledge: this is the best
way to select variables, as it is not data-driven and it is therefore considered as
yielding unbiased results.
o Pros: no bias
o Cons: not automated, needs some thinking, hard to justify that selection
was really made without looking at the data
The optimal choice of variable selection method has always been a matter of debate. The
first option should be avoided if possible. The second option should only be used in
conjunction with careful validation using resampling techniques. Among all ‘automatic’
selection procedures, the third one is currently state-of-the-art and should be applied. It
however needs specialized software (there is one implementation in SAS but not in
SPSS). The fourth option is generally preferred by statisticians (passing the buck to their
clinical partners).
A worked example
Consider cholesterol as outcome variable. The candidate predictors are: sex, age, BMI,
WHR (waist-hip-ratio), and sports (although this variable is ordinal, we treat it as a scale
variable here for simplicity).
While model 2 can be easily calculated by SPSS, model 1 needs hand selection after all
univariate models have been estimated and model 3 needs many side calculations.
Model 3 selected Sex, Age, BMI and WHR as predictors of cholesterol. Age and BMI
were selected based on their significance (P<0.1) in the multivariable model. On the other
hand, Sex was selected because dropping it from the model would cause the B of WHR
to change by -63%. Similarly, dropping WHR from the model would imply a change in B
of Sex by -44%. Therefore, both variables were left in the model. Dropping sports from
the model including all 5 variables will cause a change in B of BMI of +17%, and has
less impact on the other variables. Since sports was not significant (P=0.54) and the
maximum change in B was 17% (less than the pre-specified 20%), it was eliminated.
There are some typical situations (among others) in which multivariable modeling is used
to adjust an effect for confounders:
•
known predictors (e. g., tumor stage, nodal status etc.)
if in an observational study one wants to separate the effects of two variables
•
which are correlated (e. g., type of medication and comorbidities)
to asses the treatment effect in a randomized trial
How many independent variables can be included in a multivariable model? There are
some guidelines addressing this issue. First of all it should be discussed why it is
important to restrict the number of candidate variables. In the extreme case, the number
of variables equals the number of subjects. In this situation the results cannot be
generalized, as they only reflect the sample at hand.
C C C
C C
C
C
240,00
C C
C CCC C C
C C
C C C
C C C
C CC C C C
C
C CC CCC CC
C C C
C C C
C C C C
C C C C C C
CC
C C C
C C C CC C
C C C C
200,00 C C C C C
C C C
C
C C
C
C
160,00
20,00 25,00 30,00 35,00 40,00 20,00 25,00 30,00 35,00 40,00
Body-mass-index Body-mass-index
The red line is a linear regression line based on data from the first two patients only.
Although the fit for these two patients is perfect, as confirmed by an R-Square of 1
(=100%), it is not transferable to the other patients. A regression line computed from
patients 3-83 yields substantially different results, with an R-Square of only 9%.
Typically, the results based on a small sample show a more extreme relationship than
would be obtained in a larger sample. Such results are termed ‘overfit’.
In general, using too many variables with too few independent subjects tends to over-
estimate relationships (as shown in the example above), the results are unstable (i. e., they
change greatly by leaving out one subject or one variable from the model). As a rule of
thumb, there should be at least 10 subjects for each variable in the model (or for each
candidate variable when automated variable selection is applied). In logistic regression
models, this rule is further tightened: if there are n events and m non-events, then the
number of subjects should exceed 10min(n, m). In Cox regression models for survival
data, the 10-subjects-rule applies to the number of deaths.
Univariate models describe the crude relationship between a variable (let’s call it the
exposure for the time being; it could also be the treatment in a randomized trial) and an
outcome. Often the crude relationship may not only reflect the effect of the exposure, but
may also reflect the effect of an extraneous factor, a confounder, which is associated with
the exposure. A confounder is an extraneous factor that is:
•
•
associated with the exposure in the source population
•
a determinant of the outcome, independent of the exposure, and
not part of the causal pathway from the exposure to the outcome
This implies, that the crude measure of effect reflects a mixture of the effect of the
exposure and the effect of confounding factors. When confounding exists, analytical
methods must be used to separate the effect of the exposure from the effects of the
confounding factor(s). Multivariable modeling is one way to control confounding
(another way would be stratification, which is not considered here).
Effect modification
Effect modification means that the size of the effect of a variable depends on the level of
another variable. Presence of effect modification can be assessed by adding interaction
terms to a model:
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 218.519 27.718 7.884 .000
Age -.805 .636 -.691 -1.266 .209
Body-mass-index -1.712 1.208 -.411 -1.417 .160
bmiage .069 .028 1.538 2.478 .015
a. Dependent Variable: Cholesterol level
Here, Bmiage = Age*BMI. The effect of body mass index on cholesterol is modified by
age; we have different effects of BMI on cholesterol at different ages.
Significant and relevant effect modification indicates the use of subgroup analyses
(separate models for patients divided into groups defined by the effect modifier). In our
example, we would divide the patients into young, middle-aged and old subjects and
present separate (univariate) regression models explaining cholesterol by BMI.
C
250,00
C C
C
C C
C CC C C
C
C C
C C
C C C
225,00 C C
C C CC C
C C C CC C C C
C C
C C
C C C C
C C C C
C C
C C C
C C CC C C
C C C C
200,00 C C C C C
C C C
C
C C
C
175,00 C
agegroup = 60-80
C C
250,00
C
225,00
CC
C C
C
200,00
175,00
Body-mass-index
Various assumptions underlie statistical models. Some of them are common to all
models, some are specific to linear or Cox regression.
• Linearity:
Consider the regression equation Cholesterol=b0+b1*age+b2*BMI. Both
independent variables age and BMI have by default a linear effect on cholesterol:
comparing two patients of age 30 and 31 leads to the same difference in
cholesterol as a comparison of two patients aged 60 and 61. The linearity
assumption can be relaxed by including quadratic and cubic terms for scale
variables, as was demonstrated in the basic course.
•
should not show any increase or decrease of the spread of the residuals.
Residuals are uncorrelated to each other
This assumption could be violated if subjects were not sampled independently,
but were recruited in clusters. If the assumption of independence is violated, we
must account for the clustering by including so-called random effects into the
model. A random effect (as the opposite of a fixed effect) is not of interest per se,
•
it rather serves to adjust for the dependency of observations within a cluster.
Residuals are uncorrelated to independent variables
If a scatter plot of residuals versus an independent variable shows some
systematic dependency, it could be a consequence of a violation of the linearity
assumption, or it might also indicate a misspecification, e. g., the constant has
been omitted.
As the validity of the models results is crucially depending on the validity of model
assumptions, estimation of statistical models should always be followed by a careful
investigation of the model assumptions.
Pneumoconiosis
Present Absent Total
FEV1<80% 22 6 28
FEV1>80% 5 7 12
Total 27 13 40
Disease
Test result Present Absent Total
Assume that FEV-1 should be used to assess pneumoconiosis status. In order to quantify
its ability to detect pneumoconiosis, the following diagnostic measures are useful:
• Sensitivity (Se): the probability of a positive test given the disease is present.
•
Se=TP/(TP+FN)
Specificity (Sp): the probability of a negative test given the diseases is absent.
•
Sp=TN/(FP+TN)
Accuracy (Ac): the probability of a correct test result.
Ac=(TP+TN)/( TP+FP+FN+TN)
Sensitivity: 81.5%
Specificity: 53.8%
From our sample of mine workers, we estimate a pretest probability of the disease as
27/40=67.5%. Now assume that a mine worker’s FEV-1 is measured, and it falls below
80% of the reference value. How does this test result affect our pretest probability? We
can quantify the posttest probability (positive predictive value) as 78.6%. Generally, it is
defined as
The ability of a positive test result to change our prior (pretest) assessment is quantified
by the positive likelihood ratio (PLR). It is defined as the ratio of posttest odds and
pretest odds. Odds are another way to express probabilities. Generally, the odds of an
event are given by
• PLR = Se / (1 – Sp)
In our example, the positive likelihood ratio is thus 0.815/(1 – 0.538) = 1.764. This
means that a positive test result increases the odds for presence of disease by the 1.764-
fold.
Assume, we investigate FEV-1 in workers of a different mine, and obtain the following
sample:
Pneumoconiosis
Present Absent Total
FEV1<80% 22 60 82
FEV1>80% 5 70 75
Total 27 130 157
The key measures characterizing the performance of the diagnostic test calculate as
follows:
We see that the positive likelihood ratio is unchanged. It is independent of the pretest
probability (or prevalence). In other words, a positive test result still increases the
probability of presence of disease by the 1.764-fold. However, since we start at a pretest
probability of 17.2%, this increase results in a lower value for the posttest probability
than before.
Similarly, we have
•
NPV=TN/(TN+FN)
Negative likelihood ratio: Sp / (1 – Se)
expressing the increase of the probability of absence of disease caused by a negative test
result.
In the example given above, we chose a cut-off value of 80% of reference as defining a
positive or negative test result. Selecting different cut-off values would change sensitivity
and specificity of the diagnostic test. Sensitivity and specificity resulting from various
cut-off values can be plotted in a so-called receiver operating characteristic (ROC) curve.
ROC Curve
1,0
0,8
Sensitivity
0,6
0,4
0,2
0,0
0,0 0,2 0,4 0,6 0,8 1,0
1 - Specificity
Note that on the x-axis, by convention 1-Specificity is plotted. A global criterion for a
test is the area under the ROC curve, often denoted as the c-index. Generally, this value
falls into the range 0 to 1. It can be interpreted as the probability that a randomly chosen
diseased worker has a lower FEV-1 value than a randomly chosen healthy worker.
Clearly, if the c-index is 0.5, it means that the healthy or the diseased worker may have a
higher FEV-1 value, or, put another way, that the test is meaningless. This is expressed
by the diagonal line in the ROC curve: the area under this line is exactly 0.5, and if the
ROC curve of a test more or less follows the diagonal, such a test would be meaningless
in detecting the disease. A common threshold value for the c-index to denote a test as
“useful” is 0.8. Our FEV-1 test has a c-index of 0.789, which is marginally below the
threshold value of 0.8. Because it is based on a very small sample, it is useful to state a
95% confidence interval for the index, which is given by [0.647, 0.931]. Since 0.5 is
outside this interval, we can prove some correlation of the test with presence of disease.
However, our data is compatible with c-indices ranging from 0.65 on, meaning that we
cannot really prove the usefulness of the test to detect pneumoconiosis in mine workers.
D = sqrt((1-Se)2 + (1-Sp)2)
A graph plotting D against various cut-off values can be used to validate the identified
cut off level:
1,2
0,8
0,6 D
0,4
0,2
0
0 20 40 60 80 100 120 140
Here we see that the “best” cut-off value is indeed 80. The inverse peak at a cut-off value
of 80 underlines the uniqueness of that value.
Both approaches outlined above put the same weight on a high sensitivity and a high
specificity. However, sometimes it is more useful to attain a certain minimum level of
sensitivity, because it may be more harmful or costly to overlook presence of disease than
to falsely diagnose the disease in a healthy person. In such cases, one would consider
only such values as cut-points where the sensitivity is at least 95% (or 99%), and select
that value that maximizes the specificity.
ROC curves can also be used to compare diagnostic markers. A test A is preferable over
a test B, it the ROC curve of A is always above the ROC curve of B.
The following example [18] is a prospective study, which compares the incidences of
dyskinesia after ropinirole (ROP) or levodopa (LD) in patients with early Parkinson’s
disease. The results show that 17 of 179 patients who took ropinirole and 23 of 89
who took levodopa developed dyskinesia. The data are summarized in the following
table:
Presence of dyskinesia
Yes No Total
Group
Levodopa 23 66 89
Ropinirole 17 162 179
The risk of having dyskinesia among patients who took LD is 23/89 = 0.258, whereas the
risk of developing dyskinesia among patients who took ROP is 17/179 = 0.095.
Therefore, the absolute risk reduction is ARR=0.258-0.095=0.163. Since ARR is a point
estimate, it is desirable to have an interval estimate as well which reflects the uncertainty
in the point estimate due to limited sample size. A 95% confidence interval can be
obtained by a simple normal approximation by first computing the variance of ARR. The
standard error of ARR is then simply the square root of the variance. Adding +/-1.96
times the standard error to the ARR point estimate yields a 95% confidence interval. To
compute the variance of the ARR, let’s first consider variances for the risk estimates in
both groups. These calculate as risk(1-risk)/N.
Summarizing, we have
The NNT is interpreted as the number of patients who must be treated in order to expect
one healed patient. The larger the NNT, the more useless is the treatment.
A 95% confidence interval for NNT can be obtained by taking the reciprocal of the
confidence interval of ARR. In our example, we have
Note: if ARR is close to 0, the confidence interval for NNT such obtained may not include
the point estimate. This is due to the singularity of NNT in case of ARR=0: in this
situation NNT is actually infinite. For illustration, consider an example where ARR (95%
C.I.) is 0.1 (-0.05, 0.25). The NNT (95% C.I.) would be calculated as 10 (-20, 4). The
confidence interval does not contain the point estimate. However, this confidence interval
is not correctly calculated. In case that the confidence interval of ARR covers the value 0,
the confidence interval of NNT must be redefined as (-20 to -∞, 4 to ∞). Thus it contains
all values between -20 and -∞, and at the same time all values between 4 and infinity.
This can be proven empirically by computing the NNT for some ARR values inside the
confidence interval, say for -0.03, -0.01, +0.05 and +0.15; we would end up in NNT
values of -33, -10, +20 and +6.7, which are all inside the redefined interval but not in the
original interval.
ARR is an absolute measure to compare the risk between two groups. Thus it reflects the
underlying risk without treatment (or with standard treatment) and has a clear
interpretation for the practitioner.
The next two popular measures are the relative risk (RR) and the relative risk reduction
(RRR). The relative risk is the ratio of risks of the treated group and the control group,
and also called the risk ratio. The relative risk reduction is derived from the relative risk
by subtracting it from one, which is the same as the ratio between the ARR and the risk in
the control group. A 95% confidence intervals for RR can be obtained by first calculating
the standard error of the log of RR, then computing a confidence interval for log(RR),
and then taking the antilog to obtain a confidence interval of RR. In our example, the RR
and the RRR calculate as follows:
These numbers are interpreted as follows: the risk of developing dyskinesia after
treatment by ROP is only 0.368 times the risk of developing dyskinesia after treatment by
LD. This means, the risk of developing dyskinesia is reduced by 63.2% if treatment ROP
is applied.
One disadvantage of RR is that its value can be the same for very different clinical
situations. For example, a RR of 0.167 would be the outcome for both of the following
clinical situations: 1) when the risks for the treated and control groups are 0.3 and
0.05, respectively; and for 2) a risk of 0.84 for the treated group and of 0.14 for the
control group. RR is clear on a proportional scale, but has no real meaning on an absolute
scale. Therefore, it is generally more meaningful to use relative effect measures
for summarizing the evidence and absolute measures for application to a concrete clinical
or public health situation [2].
The odds ratio (OR) is a commonly used measure of the size of an effect and may be
reported in case control studies, cohort studies, or clinical trials. It can also be used in
retrospective studies and cross-sectional studies, where the goal is to look at associations
rather than differences.
The odds can be interpreted as the number of events relative to the number of nonevents.
The odds ratio is the ratio between the odds of the treated group and the odds of the
control group.
Both odds and odds ratios are dimensionless. An odds ratio less than 1 means that the
odds have decreased, and similarly, an OR greater than 1 means that the odds have
increased.
It should be noted that ORs are hard to comprehend [3] and are frequently interpreted as
an approximate relative risk. Although the odds ratio is close to the relative risk when the
outcome is relatively uncommon [2] as assumed in case-control studies, there is a
recognized problem that odds ratios do not give a good approximation of the relative risk
when the control group risk is “high”. Furthermore, an odds ratio will always exaggerate
the size of the effect compared to a relative risk. When the OR is less than 1, it is smaller
than the RR, and when it is greater than 1, the OR exceeds the RR. However, the
interpretation will not, generally, be influenced by this discrepancy, because the
discrepancy is large only for large positive or negative effect size, in which case the
qualitative conclusion will remain unchanged. The odds ratio is the only valid measure of
association regardless of whether the study design is follow-up, case-control, or cross
sectional. Risks or relative risks can be estimated only in follow-up designs.
The great advantage of odds ratios is that they are the result of logistic regression, which
allows adjusting effects for imbalances in important covariates. As an example, assume
Consider the general case where we have a table of the following structure:
Disease
Present absent Total
Group
Control A B A+B
Treated C D C+D
Totals A+C B+D N=A+B+C+D
The following describes the calculation of the measures and the associated 95%
confidence intervals:
Estimation of all the risk measures presented in this section and computation of 95%
confidence intervals is facilitated by the Excel application “RiskEstimates.xls” which is
available at the author’s homepage
http://www.meduniwien.ac.at/user/georg.heinze/RiskEstimates.xls
where X and Y denote the independent and binary dependent variables, respectively. This
equation describes the association of X and the probability that Y assumes the value 1.
Log(Pr(Y=1)/Pr(Y=0)) = b0 + b1X
which is linear on the right hand side, and has the so-called logistic function on the left
hand side (hence the name “logistic regression”).
The expression
Pr(Y=1)/Pr(Y=0)
is equal to the odds of Y=1, such that we are actually modeling the log odds by a linear
model. Thus, the regression coefficient b1 has the meaning:
Since a change in odds is called an odds ratio, we can directly compute odds ratios from
the regression coefficients which are given in the output of any statistical software
package for logistic regression. These odds ratio refer to a comparison of two subjects
differing in X by one unit.
For b0=0 and b1=1 (dashed line) or b1=2 (solid line), the logistic equation yields:
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
-8 -6 -4 -2 0 2 4 6 8
The higher the value of b1, the steeper is the slope of the curve. In the extreme case of
b1=0, the curve will be a flat line. Values of b0 different from 0 will shift the curve to the
left (for positive b0) or to the right (for negative b0). Negative values of b1 will mirror
the curve, it will fall from the upper left corner to the lower right corner of the panel.
By estimating the curve parameters b0 and b1, we can quantify the association of a an
independent variable X with a binary outcome variable Y. The regression parameter b1
has a very intuitive meaning: it is simple the log of the odds ratio associated with a one-
unit increase of X. Put another way, exp(b1) is the factor by which the odds for an event
(Y=1) change if X is increased by 1.
Now assume that X is not a scale variable, but a dichotomous factor itself. It could be an
indicator of treatment, for instance X=1 defines the new treatment, and X=0 the standard
therapy. Of course, the curve will now reduce to two points, i. e. the probability of an
event in group X=1 and the probability of an event in group X=0. Estimating these two
probabilities by means of logistic regression will exactly yield the relative frequencies of
events in these two points. So, logistic regression can be used for analysis of a two-by-
two table, yielding relative frequencies and an odds ratio, but it can also be extended to
scale variables, and one can even mix both in one model.
The following two examples are based on the same study, where the aim was to identify
risk factors for low birth weight (lower than 2500 grams) [5]. 189 newborn babies were
included in the study, 69 of them had low birth weight.
Let’s first consider age of the mother as independent variable, a scale variable. For
convenience, the age of the mother is expressed as decade, such that odds ratio estimates
refer to a 10-year change in age instead of a 1-year change.
The results of logistic regression analysis using SPSS is given by the following table:
There are two ‘variables’ in the model: age_decade and Constant. The column labeled B
contains the regression coefficient estimates. Thus, the regression equation reads as
We cannot learn much from this equation unless we take a look at the column Exp(B),
which contains the odds ratio estimate for Age_Decade: 0.6, with a 95% confidence
interval of [0.32, 1.11]. This means that the risk of low birth weight decreases to the
0.6fold with every decade of mother’s age. Put another way, each decade reduces the risk
for low birth weight by 40% (1-0.6, corresponding to the formula for relative risk
reduction).
However, we see that the confidence interval contains the value 1 which would mean that
mother’s age has absolutely no influence on low birth weight. With our data, we cannot
rule out that situation. A 95% confidence interval containing the null hypothesis value is
always accompanied by an insignificant p-value; here it is 0.105, which is clearly above
the commonly accepted significance level of 0.05.
Despite the non-significant result, let’s have a look at the estimated regression curve:
0,60000
] ]
]]
]]
0,40000
]]
]]
] ]
] ] ]
]]
] ]
]]
0,20000 ]]
0,00000
20 30 40
Smoking Status During Pregnancy (1/0) * Low Birth Weight (<2500g) Crosstabulation
We see that half of the mothers of low weight babies were smoking, but only one third of
the mothers of normal weight babies. Analysis by logistic regression yields:
The odds ratio corresponding to smoking (95% confidence interval) is 2 (1.1, 3.8). Thus,
smoking during pregnancy is a risk factor for low birth weight. (The very same result is
obtained if the data of the contingency table given above was entered into
RiskEstimates.xls)
Using multiple logistic regression, it is now possible to obtain not only crude effects of
variables, but also adjusted effects. The following covariables are available:
Let’s fit a multivariable logistic regression model. The analysis is done in four steps:
Ad 1: We have 59 events (cases of low birth weight), and 130 nonevents (cases of normal
birth weight). The number of covariates is 5. Since 5<59/10, we are allowed to fit the
model.
Column Contents
label
B Estimated regression coefficients
S.E. Their standard errors
Wald Wald Chi-Squared, computed as (B/SE)2
Df Degrees of freedom
Sig. Two-sided P-value for testing the hypothesis B=0
Exp(B) Estimated odds ratio referring to a unit increase in the variable, computed
as Exp(B)
Lower Lower 95% confidence limit for odds ratio, computed as Exp(B-1.96S.E.)
Upper Upper 95% confidence limit for odds ratio, computed as Exp(B+1.96S.E.)
Exercise: Try to figure out of that table which variables affect the outcome (low birth
weight), and in which way they do!
The last line contains the estimate for the constant, which was denoted as b0 in the
outline of simple logistic regression. The most important columns are the odds ratio
estimates, the confidence limits and the P-value. We learn that last weight, history of
premature labor and hypertension are independent risk factors for low birth weight.
SPSS outputs some other tables which are useful to interpret results:
Chi-square df Sig.
Step 1 Step 23,344 5 ,000
Block 23,344 5 ,000
Model 23,344 5 ,000
This table contains a test for the hypothesis that all regression coefficients related to
covariates are zero (equivalent to: all odds ratios are one). SPSS performs the estimation
in Steps and Blocks, which are only of relevance, if automated variable selection is
applied (which is not the case here). The result of the test is “P<0.001” which means that
the null hypothesis of no effect at all is implausible.
The model summary provides two Pseudo-R-Square measures, which yield quite the
same result: about 11-16% of the variation in birth weight (depending on the way of
calculation) can be explained by our five predictors.
First, let’s look at the regression equation, which can be extracted from the regression
coefficients of the first output table:
Log odds(low birth weight) = 1.67 - 0.46 AGE - 0.015 LWT + 0.559 SMOKE + 0.69
PTL + 1.771 HT
Pr(low birth weight) = 1 / (1 + exp(-1.67 + 0.46 AGE + 0.015 LWT - 0.559 SMOKE -
0.69 PTL - 1.771 HT))
Thus, we can predict the probability of low birth weight for each individual in the
sample. These predictions can be used to assess the model fit, which is done by the
Hosmer and Lemeshow Test:
This test mainly tests the hypothesis that important predictors are still missing from the
regression equation. In our case it is not significant, indicating adequacy of the model.
How’s it done? The subjects of the sample are categorized in deciles corresponding to
their predicted probabilities, and then the number of observed events (cases of low birth
weight) in each decile is compared to the number expected from the predicted
probabilities (i. e., the sum of predicted probabilities).
If the expected and observed numbers differ by more than what can be expected from
random variation, ‘lack of fit’ is still present, meaning that important predictors are
missing from the model. Such effects could be
•
•
Other variables explaining the outcome
•
Nonlinear effects of continuous variables (e. g., of AGE or LWT)
Interactions of variables (e. g., smoking in combination with hypertension could
be worse than just the sum of the main effects of smoking and hypertension)
Predicted
Low Birth Weight
(<2500g) Percentage
Observed 0 1 Correct
Step 1 Low Birth Weight 0 121 9 93,1
(<2500g) 1 45 14 23,7
Overall Percentage 71,4
a. The cut value is ,500
Here, subjects are classified according to their predicted probability for low birth weight,
with predicted probabilities above 0.5 defining the ‘high risk group’, for which we would
predict low birth weight. We see that overall 71.4% can be classified correctly.
Another way to assess model fit is to use not only one cut value, but all possible values,
constructing a ROC curve (with the predicted probabilities as ‘test’ variable, and the
outcome as ‘state’ variable):
Comparing two randomly chosen subjects with different outcome, then our model assigns
a higher risk score (predicted probability) to the subject with unfavorable outcome with
72.3% probability.
Clearly, if the c-index is 0.5, the model cannot predict anything. By contrast, a c-index
close to 1.0 indicates a perfect model fit.
In case-control studies, the variation of the outcome is set by the design of the study; it is
simply the proportion of cases among all subjects.
In cohort studies, the variation of the outcome reflects its prevalence in the study
population.
References
[1] Tong, D et al. Expression of KLF5 is a prognostic factor for disease-free survival and
overall survival in patients with breast cancer. Clin Cancer Res. 2006; 12(8):2442-8.
[2] Egger M. Meta-analysis. principles and procedures. BMJ 1997;315:1533–7.
[3] Davies HTO, Crombie IK, Tavakoli M. When can odds ratios mislead? BMJ 1998;
316:989–91.
[4] Rascol O, Brooks D, Korczyn AD, et al. A fiveyear study of the incidence of
dyskinesia in patients with early Parkinson’s disease who were treated with ropinirole or
levodopa. N Engl J Med 2000; 342:1484–91.
[5] Hosmer, D, Lemeshow, St. Applied logistic regression. New York: Wiley, 2000.
[6] Campbell, M, Machin, D. Medical statistics. New York: Wiley, 1995.
[7] Reisinger, J et al. Prospective Comparison of Flecainide Versus Sotalol for Immediate
Cardioversion of Atrial Fibrillation. American Journal of Cardiology 1998; 81:1450-
1454.
•
•
Survival after an operation (e. g., resection of a tumor)
•
Functional graft survival (e. g., after kidney transplant)
•
Time to rejection (e. g., after transplant)
•
Functional survival of a bypass
•
Time to recurrence of disease (e. g., tumor)
Time to remission
All these examples have in common, that we are interested in the time that elapses from a
well-defined starting point (e. g., operation, onset of therapy, diagnosis), until occurrence
of a particular event (e. g, death, time of progression, rejection, etc.). This time is
generally called survival time.
However, it is very unlikely, that at time of analysis the event of interest has occurred in
all patients of a study.
Example: dogs. Consider the following experiment: 11 dogs are treated with an
experimental drug, and for each dog we take records of its survival time. The data looks
as follows:
Name Entry Last date min max
01JAN2004 01NOV2005
*-----------------------------------------------------*
Matt 01JAN2004 01DEC2004 |E-------------------------d |
Andy 01FEB2004 01APR2005 | E---------------------------------d |
Jack 01MAR2004 01JAN2005 | E-----------------------d |
Sim 01APR2004 01JUN2005 | E---------------------------------d |
Jimmy 01MAY2004 01JUN2005 | E------------------------------d |
Phil 01JUN2004 01NOV2005 | E---------------------------------------d|
Bart 01JUL2004 01MAY2005 | E-----------------------a |
Tommy 01JUL2004 01OCT2005 | E------------------------------------a |
Teddy 01SEP2004 01OCT2005 | E-------------------------------d |
Jody 01OCT2004 01NOV2005 | E-----------------------------a|
Dolly 01OCT2004 01NOV2005 | E-----------------------------d|
*-----------------------------------------------------*
The letters ‘d’ and ‘a’ indicate whether a dog was dead or alive at the ‘last contact’ date.
Summary:
Note that we are not able to measure the complete survival time for some dogs in our
experiment. Dogs with status ‘alive’ have incomplete or censored information on survival
times. We know the exact survival time for Matt, Andy, Jack, Sim, Jimmy, Phil, Teddy
and Dolly. Such observations are called complete. For Jody all we know is that he
survived more than 13 months, we don’t know if his survival time is 14 months or 26
months.
Compare Jody to Tommy (who ran away) and Bart (who was eaten by Dolly).
•
•
Does running away indicate he was seeking a place to die? (increased risk)
Or was he seeking a postman to harass? (vitality associated with decreased risk)
Could this be related to treatment (a toxic reaction that weakened or killed him)?
In a clinical trial of a new therapy for diabetes, would you consider an observation to be
censored if the patient:
•
•
Committed suicide?
Was killed in a car accident?
Competing risks:
Consider a study of an intervention to prevent death from stroke. The population is at risk
of death from many other causes. Should patients who die from other causes (e. g.,
cancer, myocardial infarction, etc.) be censored?
•
death are taken into account.
Cons: it is often difficult to identify the precise cause of death with any certainty.
To be on the safe side, it might be wiser to consider ‘death from all causes’ as the event.
Henceforth, we will (have to) assume that Bart actually left town and refused further
cooperation with the investigators.
Potentially censored survival data can be properly dealt with using the methods presented
in the following.
The Kaplan-Meier method [1] can use the information that lies in censored observations
efficiently. The result of a Kaplan-Meier analysis is a survival curve like the following:
I. e., the survival time is computed by subtracting the entry date from the last observation
date.
Next, at each time point at which an event occurred, an estimate of the conditional
survival probability is obtained:
Survival Table
Cumulative Proportion
Surviving at the Time
N of N of
Cumulative Remaining
Time Status Estimate Std. Error Events Cases
1 10,000 dead ,909 ,087 1 10
2 10,000 alive . . 1 9
3 11,000 dead ,808 ,122 2 8
4 13,000 dead . . 3 7
5 13,000 dead . . 4 6
6 13,000 dead ,505 ,158 5 5
7 13,000 alive . . 5 4
8 14,000 dead . . 6 3
9 14,000 dead ,253 ,149 7 2
10 15,000 alive . . 7 1
11 17,000 dead ,000 ,000 8 0
The shortest survival time was 10 months. Eleven dogs lived up to 10 months. One dog
died at 10 months. Ten dogs survived at least a little bit longer than 10 months (we
assume that censoring at 10 months means: surviving a little bit longer than 10 months).
Thus, the survival probability at time 10 is 10/11=90.9%.
Immediately after the 10th month, one dog was lost due to censoring. Now, only 9 are
still under observation.
After 11 months, another dog dies. 9 dogs were under observation up to 11 months, 8
dogs survived that time point. The survival probability at 11 months is thus 8/9=88.9%,
but this probability is conditional that a dog lived up 11 months. To obtain a cumulative
The computation is carried on until the last dog has died or disappeared.
Please note that the Kaplan-Meier curve is a step function: the curve stays at a certain
level until the next event occurs. At each event time, the curve drops.
The step heights provide an estimate of the risk of death at certain time points. Relating
the step heights of the Kaplan-Meier curve to the total height of the curve at the times the
steps occur, and cumulating them generates a so-called ‘cumulative hazard plot’. We see
that the relative step heights of the Kaplan-Meier curve and so the risk to die increase by
time.
Kaplan-Meier curves can also be used to compare groups, as will be demonstrated below.
Basically there are two nonparametric tests available to compare survival curves between
groups: the log rank test and the generalized Wilcoxon test. A third test (the Tarone-Ware
test) is a kind of mixture of these two.
The log rank test is obtained by constructing a set of 2 by 2 tables, one at each distinct
event time. In each table, the death rates are compared between the two groups,
conditional on the number of subjects at risk in the groups. Observed death rates are
compared to those expected under the null hypothesis. The information from each table is
then combined into a single test statistic.
In the example given above, we observe two deaths at time t(j) in group 1, and 8 deaths in
group 2. In group 1, we would expect 40% of all deaths that occurred at this time
(because the proportion of subjects in group 1 is 40 out of 100 at this time). In total, 10
deaths occurred, thus we would expect 4 in group 1. However, there were only 2 deaths
in group 1. The contribution to the log rank statistic of this table is therefore 2 – 4 = -2.
The contributions are summed up over all tables and related to their variance. The
resulting test statistic follows a chi-squared distribution with one degree of freedom. This
means, we can look up a p-value by using a table of quantiles of a chi-squared
distribution with one degree of freedom.
The generalized Wilcoxon test constructs its test statistic in the very same way, but it
weights the contributions of each table by the number of subjects that are at risk just
before t(j). Thus, early time points obtain more weight compared to late time points. This
is often desirable if the effect of an experimental condition vanishes with ongoing time.
By contrast, the log rank test has optimal power to detect differences in event rates if
those differences are manifest during all the follow-up period.
Percentiles
Chi-Square df Sig.
Log Rank (Mantel-Cox) 11,229 1 ,001
Breslow (Generalized
12,389 1 ,000
Wilcoxon)
Test of equality of survival distributions for the different levels of
stage.
We see that survival in the high risk group (stage 2) is much worse than in the low risk
group (stage 1). The median survival time is 22.3 months in stage 2, compared to 71.5 in
stage 1. Both tests confirm the importance of stage for survival after diagnosis.
The following plots show Kaplan-Meier curves for groups defined by the expression of
particular genes. The groups are defined by expression below or above the median
expression of that gene in the sample.
Chi-Square df Sig.
Log Rank (Mantel-Cox) 6,889 1 ,009
Breslow (Generalized
7,628 1 ,006
Wilcoxon)
Test of equality of survival distributions for the different levels of
Gene_6799.
For this gene, both tests are significant. We see a constant gap between the curves, i. e.
the patients with gene expression below the median (group 0) have constantly higher risk
to die than patients with high gene expression.
Overall Comparisons
Chi-Square df Sig.
Log Rank (Mantel-Cox) 7,323 1 ,007
Breslow (Generalized
3,684 1 ,055
Wilcoxon)
Test of equality of survival distributions for the different levels of
Gene_5193.
Overall Comparisons
Chi-Square df Sig.
Log Rank (Mantel-Cox) 3,241 1 ,072
Breslow (Generalized
6,699 1 ,010
Wilcoxon)
Test of equality of survival distributions for the different levels of
Gene_10575.
For this gene, we see marked differences during the first 18 months, then the curves run
in parallel until surviving patients of the low expression group make up the differences
from 60 months on. The Wilcoxon test, which puts more weight on early times, yields a
significant p-value, while the log rank test does not.
Since discrepancies or agreement of the test results may provide further insight in
survival mechanisms, both tests should be carried out and reported. As a matter of course,
Basics
So far, we have tested differences in survival curves by using log rank or generalized
Wilcoxon tests. These tests (in particular, the log rank test) can also be extended by using
a regression model instead, the so-called Cox proportional hazards regression model [3].
The Cox regression model is a semi-parametric model, i. e., it imposes no assumptions
about the distribution of survival times (e. g., exponential or normal distribution). On the
other hand, it cannot be used to predict survival times for individuals.
Consider the contingency table at t(j) we have already been working with:
Cox regression extends the group comparison which is performed by the log rank test to
estimating a group difference. The group difference is quantified by an estimate of
relative risk.
Considering the table shown above, the relative risk of group 2 vs group1 is 8/60 / 2/40 =
2.67.
Cox regression combines the information of all tables that correspond to all distinct event
times. The final relative risk estimate is that one that best explains the observed group
differences in the data (maximum likelihood principle).
Technically, the following equation is used to model the logarithm of a relative risk
(please note that big X denotes a variable, and little x a particular value of that variable):
Log(R(X=x) / R(X=0)) = B * x
If we compare two subjects which differ in X (say, systolic blood pressure) by 5 units, we
have
Log(RR) = b * 5
or
RR = exp(b * 5)
If we compare two subjects coming from two different groups (coded as 1 and 0), we
have
Log(RR) = b * 1
or
RR = exp(b)
As an example, consider the variable stage of the lung cancer data set. Requesting a Cox
regression analysis, we obtain the following table:
The program automatically supplies the value for Exp(B), which corresponds to the
relative risk referring to a comparison of high risk to low risk patients. We learn that the
relative risk for high stage patients is 2.3 compared to low stage patients. The program
also supplies a 95% confidence interval for this estimate.
The relative risk estimates may assume only positive values. A relative risk estimate of 1
means that the groups to be compared don’t differ in terms of survival. A relative risk
estimate <1 means that a subject with a higher value for X has lower risk than a subject
with a lower value for X. The reverse applies if the relative risk estimate is >1.
While the relative risk estimate is the most intuitive result of a Cox regression analysis,
the regression coefficient B, which equals the logarithm of the relative risk, is used for
testing and for confidence interval estimation, because its distribution is symmetric and
approximately normal (it may assume values between minus and plus infinity).
Wald = (B/SE)2
Previous models that have been dealt with (linear regression, logistic regression) can be
used to predict the outcome of future patients.
From the results of Cox regression, it is not straightforward to predict the death times of
patients. Cox regression places no assumptions about the distribution of death time. In
linear regression, we assume that at least the residuals should be normally distributed.
Since we are not predicting death times here, we also do not compute residuals (at least
not in the usual sense as observed minus expected outcome).
Assumptions
The most important assumption of the Cox model is the so-called proportional hazards
assumption. Hazard means the instantaneous risk to die, i.e. the probability to die within
the next minute. (Therefore, the relative risk estimated by Cox’s model is often denoted
by hazard ratio, although both terms mean the same.) ‘Proportional hazards’ means, that
the ratio between the hazards of two patient groups remains constant over the complete
follow-up period. Consider the following example, depicting follow-up time on the X-
axis and hazard on the Y-axis:
Another example:
In this example, the hazard for females remains constant, but the hazard for males
increases with time. Thus, the proportional hazards assumption is not justified in this
example.
Since in Cox regression we use the information from all tables evaluated at all distinct
death times, and compute one relative risk (from now on, let’s use the more correct
denomination ‘hazard ratio’) estimate which should apply to all death times, we must
make sure that this simplification is really justified. It is of course only justified, if the
group difference remains constant over the whole range of follow-up time we are dealing
with.
Proportional hazards hold if we observe a picture similarly to the one above: The
cumulative hazard increases in both groups (it actually cannot decrease). The rate of
increase is sometimes higher, sometimes lower. However, comparing the rates between
the groups, we see that they are more or less proportional.
Computing one relative risk estimate for all the range would over-simplify the situation:
Although in total, we see a positive effect of high gene expression, from the cumulative
hazards plot we must assume that the benefit of the high expression group only last until
about 24 months. We must assume a much larger effect during the first 24 months, and
almost an inversion of effect after that time point.
Comparing these model-based survival curves to Kaplan-Meier estimates, we see that the
model-based curves assume perfect proportional hazards. The constant risk ratio between
the groups leads to constantly increasing differences between the survival curves, which
cumulate the death hazard over time.
By contrast, the Kaplan-Meier method estimates survival curves separately for each
group, without imposing any assumption on proportionality of hazards. Thus, the distance
between the curves may decrease and extend.
The Cox regression model is closely related to the log rank test comparing two groups.
The p-value obtained by the log rank test is approximately equal to the p-value
corresponding to a binary independent variable in Cox regression, as the following
comparison shows:
Overall Comparisons
Chi-Square df Sig.
Log Rank (Mantel-Cox) 11,229 1 ,001
Breslow (Generalized
Wilcoxon) 12,389 1 ,000
This close relationship is the reason why the log-rank test is also labeled “Mantel-Cox”
test.
The numbers shown in column ‘Exp(B)’ are now adjusted relative risk (hazard ratio)
estimates; each effect is adjusted for any other effect in the model.
The inclusion of nikotin, gender and age doesn’t change our conclusions about the
importance of the factor stage.
The same model building strategies that have already been discussed also apply to
multivariable Cox regression. As a rule of thumb, we must keep in mind that the number
of candidate variables for model building must not exceed one tenth of the number of
events (deaths).
Suppose a candidate marker ‘Gene 5193’ should be evaluated in its ability to predict
survival of lung cancer patients. For simplicity, the expression values of this marker
again have been categorized (1: above median, 0: below median). The marker proves
satisfactory in univariable analyses:
Chi-Square df Sig.
Log Rank (Mantel-Cox) 7,323 1 ,007
Breslow (Generalized
Wilcoxon) 3,684 1 ,055
These results suggest that Gene 5193 is a risk factor: patients with higher gene expression
have a 1.9fold higher risk than patients with lower gene expression.
Adjusting for stage, we learn that the hazard ratio drops from 1.9 to 1.6, and the p-value
now exceeds the significance level of 5%. Therefore, we cannot prove that Gene 5193 is
an independent predictor of patient survival. The drop in hazard ratio comparing
univariable to multivariable analyses is a result of the correlation between stage and Gene
5193:
Gene_5193
0 1 Total
stage 1 Count 94 58 152
% within stage 61,8% 38,2% 100,0%
2 Count 26 48 74
% within stage 35,1% 64,9% 100,0%
Total Count 120 106 226
% within stage 53,1% 46,9% 100,0%
While only 38.2% of the stage 1 patients have gene expression above the median, the
corresponding number for stage 2 patients is 64.9%. The chi-square test yields a p-value
< 0.001.
With scale independent variables, we must verify the linearity assumption by considering
nonlinear effects. The additivity assumption can be checked by screening interactions.
These assessments of model assumptions can be carried out in the same way as has been
demonstrated for linear or logistic regression, and is therefore not considered again in this
text.
Cox regression provides an alternative way to adjust for confounding variables, namely
stratification. Stratification does not mean that one reports results from subgroups.
•
used as stratification variables.
Only nominal or ordinal variables can be used as stratification variables, as
stratification by a scale variable would assign each individual to a separate
subgroup, making any estimation impossible. Scale variables can only be used for
stratification if they are grouped into few categories prior to estimation of a Cox
model.
There will always be differences between the results obtained by stratification and those
obtained by multivariable estimation. These differences result from the relaxed
assumption of proportional hazards for variable stage in analysis by stratification. The
differences will be negligible if the proportional hazards assumption holds well for the
stratification variable, and will be larger if hazards are non-proportional between the
different levels of the stratification variable Therefore, stratification can be used if we
want to adjust for a variable that
•
•
exhibits non-proportional hazards
and is not of interest by itself
We have already defined ‘proportional hazards’ as the situation where the ratio between
the hazards of two patient groups remains constant over the complete follow-up period.
Since Cox hazard ratio estimates crucially depend on the validity of the proportional
hazards assumption, it is necessary that this assumption is verified in a Cox model. As
with assumption checking in other regression models, there are two options:
•
•
Graphical checks
Statistical testing of violations of the proportional hazards assumption
Comparing the cumulative hazard between two groups has already been outlined in Lec
6. These plots show the increasing cumulative risk as function of follow-up time.
Therefore, assuming proportional hazards, we should observe two lines which diverge
with increasing follow-up time. Proportional hazards approximately hold for variable
stage (as shown in the left panel). If the proportional hazards assumption does not hold,
there is no constant divergence; periods of divergence may be followed by periods of
convergence, and the cumulative hazards may even cross. Such a situation is shown in
the right panel of the graph below.
While we can use these plots to judge upon the proportional hazards assumption in
univariable models, it is not straightforward to apply these plots to multivariable models,
as the cumulative hazard may be confounded by other variables.
In brief, one such residual is computed at each death time, and it is basically defined as
the difference between the observed covariate value of the subject that failed
(experienced the event) at that time, and the covariate value of the average individual that
would be expected to fail from the estimated Cox model. If the partial residuals are on
average positive at early times and negative at later times, we can conclude that the
hazard ratio decreases with time, meaning that hazards are non-proportional. Sometimes,
it is useful to insert a line of moving averages (a so-called smoother; in the plots below
we used a cubic regression line with 95% confidence intervals as offered by SPSS) into
the partial residuals plots, which allows an easier interpretation. The two plots below
show the partial residuals corresponding to the cumulative hazards plots shown above:
While we see an almost constant smoother line for variable stage, there is a sharp
increase during the first 24 months in the residuals for Gene_1791, followed by a
constant period. (Occasional fluctuations that result from the small number of patients at
risk after 60 months, say, should not be over-interpreted.)
The most important advantage of partial residuals is their use in multivariable modeling.
The non-proportionality of hazards observed in Gene_1791 could disappear if we adjust
for other variables. In other situations, non-proportional hazards appear only in
multivariable modeling, but are not present in a univariate model. If a Cox model is fit
including the variables stage and Gene_1791, we obtain the following table of estimated
and hazard ratios:
Now we observe, as before, a slight decrease in the residuals of stage, and the same, even
more unique, picture for Gene_1791.
There are two ways to formally test the proportional hazards assumption, leading to
approximately the same results:
•
•
Testing the slope of partial residuals
Testing an interaction of covariate with time
If a linear regression model is computed, using the partial residuals as dependent variable
and time (or log of time) as independent variable, then we are able to test whether the
partial residuals change (linearly) with time. This test can be used to assess the
proportional hazards assumption. The test is significant, if a horizontal line at 0 is
completely covered by the 95% confidence interval for the linear fit:
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) ,089 ,100 ,886 ,379
Survival -,003 ,003 -,144 -1,121 ,267
a. Dependent Variable: Partial residual for stage
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) -,205 ,094 -2,178 ,033
Survival ,008 ,003 ,338 2,754 ,008
a. Dependent Variable: Partial residual for Gene_1791
In the table shown above, we assume now the following relationship between the effect
of stage and time: B = 1.207 – 0.015*time. This means, that there is a small and not
significant decline in the effect of stage. At beginning of follow-up, the effect of stage is
1.207 (corresponding to a hazard ratio of 3.345). With every month, it declines by 0.015
(or in terms of hazard ratios, the hazard ratio reduces by 1.4%).
If we assume a linear relationship of the effect of Gene 1791 and time, we obtain the
following Cox model:
At beginning, high expression of Gene 1791 has a protective effect. The B estimate is -
1.258, corresponding to a hazard ratio of 0.28. With every month, the effect decreases by
0.032 (in terms of hazard ratios: the initial hazard ratio 0.28 multiplies by 1.033 every
month). After roughly 40 months, we already see a hazard ratio of 1. Afterwards, high
expression has a negative effect on the death hazard.
In summary, both tests reveal a significant time-dependency of the hazard ratio of Gene
1791 (p=0.008 and 0.029 for the slope-of-partial-residuals test and the interaction-with-
time test). On the other hand, based on these tests we would not conclude that the
proportional hazards assumption is violated for stage (p=0.267 and p=0.294,
respectively).
There are some options how to proceed if the proportional hazards assumption is violated
in a model.
Another option is to divide the time axis into two parts, and estimate separate hazard
ratios estimates for each part. Suppose we want to divide the time axis at 40 months. For
a proper subgroup analysis, we must follow the principle:
• All subjects must enter the first subgroup. The censoring indicator has to be set to
‘censored’ for all subjects which lived longer than 40 months, and the survival
•
time is set to 40 months for these subjects.
Only the subjects who lived longer than 40 months enter the second subgroup.
We see that the effect of stage is higher in the first subgroup than in the second one (2.57
vs. 1.19), and that the hazard ratio of Gene 1791 is 0.4 in the first subgroup, but 1.8 in the
second one. Because of the small number of events in the second subgroup, we cannot
Influential observations
Plotting the DfBeta values for stage against patient ID, we learn that patient 84 has a
noticeably high value:
C
84
0,10000
C Patient No. 84
C C
Dfbeta for stage
C
0,05000 C C
C C C
C C C
CC CC C CC C
CC C CC
C C C
C CC C CC
C CC C C
C C C C
C
C
CCC C
C
0,00000 C
C
C CC C
C
CC C CCCC CCC CC C C C C
C C CC C C
CC C CC C C C C CC C
C CC C C C C
C C
CC C C C CCC C
C C
C
-0,05000
0 40 80 120
patid
Interestingly, the same patient is identified as influential point when plotting the DfBeta
of Gene 1791:
C C
C C C C
C C C CC C
C C C C
C C CC C C C C
CC C CCCC C C C CC C C CC C C C
CC
Dfbeta for Gene_ 1791
C C
C C C C C C
C C C C
C C C C
C C
C C
0,00000 C
C
C CC C C C C C
CC C C CC
C C C
C C C CC C CC C
C CC
C C C
C C
C
C CC C CC
C
Patient No. 84
C
-0,05000
C
84
0 40 80 120
patid
References
[1] Kaplan EL, Meier P. (1958) Nonparametric estimation from incomplete observations.
Journal of the American Statistical Association 53: 457-481
[2] Bhattacharjee, A., et al (2001). Classification of human lung carcinomas by mRNA
expression profiling reveals distinct adenocarcinoma subclasses. PNAS 98, 13790-13795.
[3] Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical
Society B 34: 187-220.
[4] Schoenfeld, D. (1982). Partial residuals for the Cox proportional hazards model.
Biometrika 69, 239-241.
A box plot illustrates the distribution of pre- and post-treatment scores, stratified by
treatment:
We see that the post-treatment functional scores in the acupuncture group are clearly
higher than those in the placebo group, but there is already an imbalance between the
groups at baseline, with better scores in the acupuncture group.
Although we notice an improvement in both groups, we cannot see the (variation of)
individual improvements. The plot may even obscure occasional decreases in functional
scores in some patients.
•
•
Adjust for the baseline imbalance
take into account individual changes in functional scores
Change scores
Let’s first ignore the baseline imbalance and compare the post-treatment scores using a t-
test:
Group Statistics
Std. Error
Group N Mean Std. Deviation Mean
post-treatment score Acupuncture 25 80,3200 16,83182 3,36636
Placebo 23 62,8261 18,70744 3,90077
The easiest way to accomplish both requirements is to use change scores. These are
simply defined as the difference between post-treatment and pre-treatment score,
computed for each patient separately. Change scores can be computed as raw change
scores or as percent change scores:
We will use raw change scores, if a change of 5 units has the same interpretation at all
levels of pre-treatment scores. Sometimes, this may not be useful, and then percent
change scores are more appropriate. Using percent change scores, we assign the same
importance to a change from, e. g., 20 to 15 (-25%) as to a change from 40 to 30 (-25%).
The boxplot shows approximate normal distributions of the raw change sores, such that
use of the t-test is indicated:
Group Statistics
Std. Error
Group N Mean Std. Deviation Mean
Raw change score Acupuncture 25 20,8000 15,82982 3,16596
Placebo 23 11,1304 15,61847 3,25668
Because of the outlier observed in the acupuncture group, we might prefer the Mann-
Whitney-U test for comparison:
Percent
change score
Mann-Whitney U 220,000
Wilcoxon W 496,000
Z -1,393
Asymp. Sig. (2-tailed) ,164
a Grouping Variable: Group
The difference in relative (percent) changes is not so strong as in raw changes, such that
significance is not reached.
Comparing change scores we have to take into account the possibility of the so-called
regression to the mean effect. This effect is responsible for the general observation that
patients with low functional scores before start of treatment tend to have greater change
scores than patients with better functional scores prior to treatment. If baseline functional
scores are very heterogeneous, this may affect the change scores leading to a biased
analysis. Using alternative options of analysis we can circumvent the regression to the
mean effect and obtain an unbiased group comparison. One such option is to use analysis
of covariance, and will be presented later.
Second assessment
If a patient is – by chance – in a good constitution at the day at which the baseline
measurement is taken, it is very likely that he/she will be in a worse constitution at the
next assessment, leading to a decrease in functional score.
Second assessment
Baseline assessment
Thus, low baseline functional scores are – in our hypothetical example of no treatment
effect – associated with positive change scores, and high baseline functional scores
correlate with negative change scores.
If a treatment effect is present, we must still assume that the change scores will tend to be
negatively correlated with the baseline scores.
Assuming that there is no effect of treatment, we would assume the mean of the two
assessments as the best guess of a patient’s long-time functional score.
The regression to the mean effect becomes most evident, if we correlate change scores
with baseline values:
C
Group
C Acupuncture
D C D Placebo
CD
C
40,00
C
C
C C
D
Linear Regressi on wi th
Raw change score
D
C D C
95,00% Mean Prediction Interval
C
D C
C C
C
C D
20,00
D D
D
Raw change score = 44,00 + -0,39 * pre
C
D C R-Square = 0,09
D
D
C
D D C
C DC C D C
Raw change score = 22,77 + -0,23 * pre
D D D
D
0,00 R-Square = 0,04
C
D
D
-20,00
30,00 40,00 50,00 60,00 70,00 80,00
We see that in both groups, the change scores are negatively correlated with pre-
treatment scores. This obvious negative correlation is a logical result from comparing the
expressions Y - X and X. Therefore, for determining whether the magnitude of change
scores correlate with baseline values, we have to correct for this automatic correlation by
replacing X by (X+Y)/2 (the mean) at the x-axis. The mean again serves as the best
estimate of the long-time average of a patient’s functional score.
D
C D C 95,00% Mean Predicti on Interval
C
C
D
C C
C
C D
20,00
Raw change score = -9,95 + 0,44 * meanv al
D D
D R-Square = 0,12
D C C
D D
C C
D D
D C DC C D C
D D
D
0,00 Raw change score = -10,25 + 0,37 * meanv al
R-Square = 0,12
C
D
D
-20,00
40,00 50,00 60,00 70,00 80,00
(Pre + Post)/2
Analysis of covariance
In order to correct for the regression-to-the-mean effect when comparing the treatment
effect between acupuncture and placebo groups, the simple comparison of change scores
should be replaced by an analysis of covariance (ANCOVA).
ANCOVA is the same as ANOVA (or, in a two-group situation, a t-test), but it allows for
taking into account covariates which can be used for assessing adjusted treatment effects.
Technically, we fit a linear regression model with two variables – one defining the
treatment groups and one representing the baseline assessment. The change score or the
post-treatment scores may be used as outcome, leading to the same conclusion.
From ANCOVA we obtain an estimate of the treatment effect (the difference in change
score), which is adjusted for baseline imbalance using multivariable estimation. In other
words, we obtain the average difference in change score (or post-treatment score)
between a patient of the acupuncture group and a patient of the placebo group, assuming
both had the same baseline values.
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
We see that after optimal adjustment for baseline imbalance, we end up with a difference
in post-treatment scores of -12, referring to the difference Placebo (code 2) –
Acupuncture (code 1). Thus, patients treated by acupuncture have post-treatment scores
which are on average 12 units higher than those of patients in the placebo group with
equal pre-treatment scores. This effect is significant (p=0.013).
•
•
To describe the course of a parameter after an intervention
•
To identify typical shapes of courses
To compare the course of a parameter between groups of patients defined by an
experimental condition
The following table (AZT.sav) shows the serum zidovudine levels (AZT) of AIDS
patients at various time points after administration of AZT. Some patients are known to
possess abnormal fat absorption (malapsorption):
The first step in the analysis of repeated measurements should always consist of a
graphical display of individual curves (separately for each patient). With a small number
of patients, one may draw all curves into one diagram. In our data set, there are nine
patients with malapsorption and five patients with normal fat absorption. Therefore, we
can draw two diagrams, one for each group:
• First, explorative data analysis provides insight into the distribution of variables,
and some types of distributions demand special statistical methods (e. g., for
skewed distributions we should not use methods based on the assumption of
normally distributed data). For this purpose, all individual curves should be
equally scaled, i. e., the diagrams should all have the same y-axis and x-axis
•
definitions.
•
Second, we may identify outliers or data errors by inspecting individual curves.
Third, different shapes of time courses may be identified by inspecting individual
curves. In the example shown above, two types are identified: we see curves with
a marked peak at the beginning (e.g., patients 1, 3, 4, 5, 8, 9, 13), and some with a
slower ascent and a smoother peak occurring later (6, 14).
Grouped curves
Only if the shapes of the curves are similar, grouping the curves by showing the course of
means or medians over time is reaonable. In any case, also the variation of the values
should be properly shown, either by displaying standard deviation as error bars or by
adding curves referring to 25th and 75th percentiles (or minimum and maximum).
For the example shown above, a plot of the mean course with standard deviation, grouped
by type of fat absorption, yields the following diagram:
15,000
c
C
10,000
c
C
AZT
5, 000 c
C
c
C
c
C c
C
c
C
c
C c
C
c
C c
C
c
C c
C
C
c C
c
c
C C
c
c
C c
C C
c C
c
C
c c
C c
C
0, 000
-5,000
15,000
C C
C
10,000
C
AZT
C C
C C
C C
C C
C
C C
C
C C C
5, 000 C C C C
C
C C
C C
C C C
C C C C
C
C
C
C C C
C C C C C
C
C C
C C C C C
C C C C
C C C C C C
C
C C C C C
C C C C C
C C
C C
C C C
C C C
C C
C C C C C C
C
C C
C C C C C C
C
C C C C C C
C C C C
C
C C C C C C C
0, 000
In the above plot, the lines represent the medians at each time point. Clearly, the variation
of data points above the median is greater than that of data points below the median.
Therefore, we should not use parametric descriptive statistics, but rather the median and,
because of the small number of subjects, minimum and maximum (instead of 25th and
75th percentiles):
Drop-outs
Special attention should be paid to patients for whom measurements after a particular
time point are unavailable. Reasons for such drop-outs are:
•
•
Patients who die
•
Patients who are lost-to-follow-up
•
Patients unwilling to further participate in the study
•
Patients for whom measurements cannot be taken
…
The quick and simple solution would involve computing a correlation coefficient
between A and B. The problem with that approach is that we have repeated
measurements for those parameters for each patient. We cannot assume that values taken
from the same patient are independent, which, however, is a crucial assumption
underlying the computation of correlation coefficients. Therefore, this violated
assumption has to be accounted for. One easy way to account for it is to compute partial
correlation coefficients. Partial correlation coefficients adjust the correlation between A
and B for the different average levels of the patients.
Consider the following (hypothetical) example. Let there be two parameters A and B,
each measured in 8 repeated assessments of 5 patients. Ignoring the ‘patient effect’, we
obtain the following scatter plot and correlation coefficient:
C C
50,00 C C
C
C
CC
C
C
C
40,00 C C C C
C
Param eter A
C
C
CC
C
30,00 C
C
C
C
C CC
C
20,00
C
C
C C
10,00 CCC C
C
C C
Parame te r B
Correlations
A proper statistical analysis has to adjust for the ‘patient effect’. If we mark the data
points in the scatter plot grouping the measurements on each different patient, we clearly
see that the parameters are only correlated because if in a patient parameter A is high,
then also B is high, but if in the same patient the parameter A changes, then parameter B
may increase or decrease. There is absolutely no systematic correlation in the intra-
individual changes of parameters A and B:
G G
Patie nt ID
C
G G
D
1
G
50,00
G GG
2
F
F
E
F
3
F
F F F F
4
G
40,00
F 5
Param eter A
E
EE E
E
30,00 E
E
E
D
D DD
20,00 D
D
D
C D
10,00 CCC C
C
C C
Parame te r B
Correlations
•
•
Define a reference category.
•
With N levels, create N – 1 dummy variables referring to all other categories.
•
The dummy variables are all 0 for the reference category.
For any other category, the corresponding dummy variable is 1.
In our example, the variable ‘PatID’ has 5 levels: 1, 2, 3, 4, 5. Let’s define 5 as the
reference level (without loss of generality). The dummy variables are therefore defined as
follows:
Now the set of dummy variables can be used to adjust for the patient effect in a partial
correlation analysis (labeled by ‘control variables’ in the SPSS output).
Another example: correlation of changes in serum pH and serum PACO2 [4]. In eight
patients, repeated measurements on serum pH and serum PACO2 have been taken, with a
In a marginal scatter plot (ignoring the patient factor), we see no correlation between pH
and PaCO2 (a correlation coefficient is computed as -0.065, but please note that this
value is wrong because of the ignored dependence of values that are measured on the
same patient)
C C C
C CC
C C CC C C
C CC C
C C C CC C C C C
C C C
C
CC
7,25
C C
C
C
7,00
C
C
C
pH
C
C
C
6,75
C
C
6,50
C
PaCO2
A panel scatterplot shows the data cloud for each of the eight patients:
6, 75
C
C
6, 50
C
4 5 6
C C
C C C C CC
C C
C C C C C C C
C
7, 25
7, 00
pH
6, 75
6, 50
7 8
C
C
CC
C C
7, 25
7, 00
CC
C
pH
C
C
6, 75
6, 50
4, 00 5, 00 6, 00 4, 00 5, 00 6, 00
PaCO2 PaCO2
Correlations
In summary, correlation of original data is wrong here, and should be replaced by partial
correlation.
• growth curves have their maximum at the end or at the beginning of the time
•
course, and
peaked curves, which have a peak somewhere in the middle of the time course.
•
•
Last value
Difference between last and first value, either computed from raw values or from
•
a regression analysis
Slope of the growth curve
Sometimes measurements are taken repeatedly to describe the course of the concentration
of a drug in the body. To describe such peaked curves the following summary measures
are useful:
•
•
Maximum concentration (Cmax)
•
Time to maximum concentration (Tmax)
•
Minimum concentration (Cmin) between two subsequent applications
•
Time above a threshold value
•
Area under the curve, which is used as a measure of drug absorption
Half-time: time to half of maximum concentration (T1/2)
•
•
The slope of the regression lines can be used to describe the kidney function
The values predicted from the regression line at particular days (say, day90 after
transplantation) may serve as a daily-fluctuation corrected value to be used in
further analysis, e. g., for predicting the long-term outcome of the kidney
function.
8 14 19
0, 800
CC
Reciprocal creatinine
C C C C
C C
C C C
C C C
C
C
0, 600
C
CC
CCCCC CC
C C
C
C C
0, 400
C C C C C
C CC C C C
CCC C C C
C C
C C C CC
C C CC C
C CC
C
C C C CC CC
C
C CC
0, 200 C C C C C
24 28 30
C CC
C C
0, 800
C CCC
C
C C C
Recipro cal creatinine
C C
0, 600
C C C
C
C C C
0, 400
C
0, 200
C
CC
CC C C C C
41 44 45
0, 800
C
Recipro cal creatinine
C C
C CC C C C C C
CC C C
CCC CC C
CC
0, 600
C C
C C
C
CC C
C
C C C C C C
C
0, 400
C
0, 200
0 25 50 75 0 25 50 75 0 25 50 75
Consider the AZT data of Lec 10. The area under the curve can be computed using the
trapezoidal rule. The area under the curve between two subsequent time points T1 and T2
is based on an assumed linear course between these two time points, and thus estimated
by:
Where X1 and X2 denote the concentrations at times T1 and T2, respectively. The area
under the curve (AUC) is obtained by summing up all partial areas between subsequent
time points.
The diagram below shows the way the area is computed using the trapezoidal rule on the
first three measurements on patient 1. By replacing the trapezoidal sections (under the
black line) by rectangulars (under the red line) of average height, the area can easily be
computed using the formula length*height.
The area under the curve can be computed in SPSS using the instructions given in the
SPSS lab section.
From the individual AZT curves, we can distill the maximum concentration Cmax and
the time to maximum concentration Tmax in order to find associations between the rise
time and the maximum absorption of the drug.
12,000
D
Cmax
C
C
D
8,000
D C
C
C
4,000
C C
C
Tmax
Another example, reproduced from [5], deals with aspirin absorption in healthy and ill
persons. The basic research question was: Do ill persons have reduced aspirin
absorption?
The next figure shows the Cmax values plotted against the Tmax values. This plot clearly
shows that the maximum concentrations tend to be lower and occur later in the ill
patients. Additionally, the Cmax appears to be negatively correlated with Tmax, i. e.,
higher peaks occur earlier than lower peaks.
ANOVA compares the means of a scale variable between levels of a factor. In its
simplest setting, there are two levels, which means that ANOVA is actually a two-group
comparison and could be replaced by a t-test.
A basic assumption of ANOVA is that all observations are mutually independent. This
assumption is violated, if several measurements are taken from each subject. We cannot
treat these repeated measurements as being independent. ANOVA for repeated
measurements provides a toolbox to account for the dependence of the observations taken
on the same individuals.
So far, we have only considered effects which relate to experimental conditions (e. g., a
comparison of treatment A with treatment B) that were varied between individuals. Some
individuals were randomized to receive A, others to receive B. However, there was no
subject receiving both. Thus, we call ‘treatment’ in this simple example a ‘between-
subject effect’.
Example: consider an ophthalmologic trial, where two types of lenses are implanted in
the same subjects, and the outcome is a visus assessment one week after the operation.
Since each subject receives both lenses, the factor ‘lens’ is a within-subject effect. A
proper statistical analysis must account for the within-subject nature of this effect, if no
covariates are present, a paired t-test could be done, or alternatively, type of lens could be
specified as within-subject effect in an ANOVA.
If we take several serial measurements on the same individual, then these measurements
differ by the time at which they were taken. Therefore, the factor ‘time’ constitutes a
within-subject effect: it is varied within subjects.
Consider a trial where serial measurements of pain (on a visual analogue scale, VAS) are
taken on the same patients after start of a pain treatment: patients are randomized either
to electro-stimulated acupuncture or to standard acupuncture. In this example, we have
two types of effects: time is a within-subject effect, type of acupuncture a between-
subject effect.
Unstructured covariance
The rows and columns in the matrix shown above correspond to the time points at which
measurements are taken. Between each pair of time points there exists a correlation (or
covariance), and using an unstructured covariance matrix, we estimate each covariance
separately. With N time points, we end up with N(N+1)/2 covariance parameters that
have to be estimated. Of course, this structure offers highest flexibility. Variances
(standard deviations) are allowed to differ from time point to time point. However, since
many parameters have to estimated, it can only be applied if the number of measurements
is small and the number of subjects comparably high, otherwise the ‘1:10’ rule of thumb
would be violated and estimates highly unstable.
Since these conditions do not always hold in real life, several covariance structures have
been proposed that simplify the unstructured covariance, but at the cost of lower
flexibility.
•
•
It assumes the same variance for each time point,
•
it assumes that the correlations between subsequent measurements are the same,
it assumes that the correlations between measurements separated by a third are the
same, etc.
Toeplitz-Heterogenous
To relax the assumption of equal variances at each time point, one may choose Toeplitz-
heterogenous structure:
This covariance structure needs 2N-1 covariance parameters, thus it is about twice as
‘costly’ as the standard Toeplitz structure, yet it provides a much more flexible way to
define the covariance.
Compound symmetry means that we are assuming the same variance (standard deviation)
of measurements at each time point, and that we are assuming the same correlation
between measurements taken at different times. This covariance structure may be most
adequate if we have a stable-over-time process, e. g. with blood pressure measurements,
or if there is a relatively long time elapsing between two subsequent measurements such
that we can assume that a precedent measurement does no longer take influence on a
subsequent one. Only two parameters have to estimated: the variance of the
measurements and the within-subject correlation.
We see that since rho is a number between -1 and 1, the correlation drops off with
increasing time between two measurements. However, the magnitude of this drop-off
may not be adequately modeled with this structure. On the other hand, similarly to the
compound symmetry structure, only two parameters are estimated.
The specification of the covariance structure is the most crucial part of an RM-ANOVA
analysis. All other specifications are very similar to simple ANOVA or regression
analysis.
Model formulation
The most simplest setting assumes a between-subject factor like treatment arm, and time
as within-subject facture (time). Additionally, we may consider covariates such as
baseline measurements or demographic variables (age, sex) that the treatment effect
should be adjusted for.
Time Within-subject
Treatment Between-subject
Now the question arises which covariance structure is the most adequate. While the
unstructured covariance matrix provides the most flexible fit to the observed data, it
requires estimation of many additional parameters (the elements of the covariance
matrix), which we are not really interested in. Such parameters are called nuisance
parameters. Generally, estimation of many parameters with few independent subjects
should be avoided, because results can be unstable. By contrast, the Toeplitz or the AR(1)
structures are less flexible, but more parsimonious, as they require less parameters to be
estimated. The so-called ‘Akaike information criterion’ (AIC) can be used to decide
whether to use a more flexible or a more restrictive covariance matrix. It penalizes the
likelihood of the model (i. e., the probability of the observed data given the estimated
model parameters) by the number of estimated parameters. AIC is supplied in the
program output and by running the analysis with different covariance structures, one may
compare AIC und choose that covariance structure that yields the smallest AIC (by
convention, AIC is scaled such that small numbers indicate better fit).
With the AR(1) covariance we obtain the following table of global tests:
For the interaction test, the procedure calculates a p-value of 0.866, which means that the
interaction term should be dropped from the model:
Type III Tests of Fixed Effects(a)
We see that the effect of the baseline measurements is not significant, nevertheless this
effect is retained in the analysis (it is not of importance whether this effect is significant
or not, we only want to adjust for it).
Although both treatment effect and time effect (row labeled ‘Week’) are not significant at
the usual 5% level, we take a look at the parameter estimates:
The VAS measurements in the PLACEBO group are about 1.6 cm higher than in the
VERUM group. Since there is no interaction with time, we can assume that the effect is
constant over time (as can also be seen from the error bar plot above). Regarding the time
effect, the program automatically assumes the last week as the reference category, and all
other time points are compared to week 6. In week 1, the measurements are on average
0.56 cm higher, and decline thereafter. After adjusting for multiple testing (multiplying
the p-values by 5), the only significant difference is between week 1 and week 6. Thus,
we conclude the final effect of treatment is not yet attained after the first week. However,
significances of comparisons of time points should not be overinterpreted. With a larger
sample size, we may have obtained a significance in comparing weeks 3 and 6 or weeks 4
and 6. Emphasis should be put on the magnitude of the effect, which clearly shows that
VAS measurements decline with ongoing acupuncture treatment.
With some data sets, the unstructured covariance specification may lead to a convergence
failure, i. e., the program does not supply reliable results. Such a failure is often
indicated by a warning like:
Warnings
Iteration was terminated but convergence has not been achieved. The MIXED
procedure continues despite this warning. Subsequent results produced are based on
the last iteration. Validity of the model fit is uncertain.
More restrictive covariance specifications may remove this convergence failure, but at
the cost of less flexibility.
References
[2] M. Bland. An Introduction to Medical Statistics, 3rd ed. Oxford University Press,
Oxford, 1995.
[3] X. Guo and B. P. Carlin. Separate and joint modeling of longitudinal and event time
data using standard computer packages. The American Statistician 58: 16-24, 2004.
Modeling of different types of variables will be exemplified on the data set PROS [1]. In
this study on 380 prostate cancer patients, the binary outcome variable is ‘tumor
penetration of prostatic capsule’ (1=yes: N=153; 0=no: N=227). Several potential
independent variables are considered, including PSA level (scale), Gleason score
(ordinal), Age (scale), Race (nominal), DPROS (nominal), tumor volume by ultrasound
(scale).
Nominal variables
The simplest type of nominal variables is a binary variable, which may assume two
values only. If this variable is coded as 1 and 0, then the regression coefficient of a
statistical model gives us a ‘difference’ between the levels associated with the higher and
the lower codes.
Consider the binary explanatory variable ‘Race’ (1=white, 2=black), here cross-tabulated
against the outcome variable:
Penetration of prostatic
capsule
No Yes Total
race white Count 204 137 341
% within race 59,8% 40,2% 100,0%
black Count 22 14 36
% within race 61,1% 38,9% 100,0%
Total Count 226 151 377
% within race 59,9% 40,1% 100,0%
Considering also the value estimated for the constant, one may now compute predicted
probabilities of penetration of the prostatic capsule based on race. These probabilities are
computed by inserting the values into the logistic regression formula
If the coding of race was changed to 0 (white) and 1 (black), the logistic regression
results change only with respect to the constant, because the difference in the codes
remains the same:
Similarly, the sign of the B coefficient for race reverts if the coding is 1 for white and 0
for black:
For variables with more than two levels, several coefficients are needed to enable the
estimation of separate predicted probabilities for each group. Consider the variable
DPROS which describes the result of the digital rectal exam: No nodule, unilobar nodule
left, unilobar nodule right, bilobar nodule:
Digital rectal exam * Penetration of prostatic capsule Crosstabulation
Penetration of prostatic
capsule
No Yes Total
Digital No Nodule Count 80 19 99
rectal % within Digital
exam 80,8% 19,2% 100,0%
rectal exam
Unilobar Nodule (left) Count 84 48 132
% within Digital
63,6% 36,4% 100,0%
rectal exam
Unilobar Nodule (right) Count 45 51 96
% within Digital
46,9% 53,1% 100,0%
rectal exam
Bilobar Nodule Count 18 35 53
% within Digital
34,0% 66,0% 100,0%
rectal exam
Total Count 227 153 380
% within Digital
59,7% 40,3% 100,0%
rectal exam
In the logistic regression menu, SPSS offers the possibility to define a nominal variable
as ‘categorical’. Since B coefficients and odds ratios (Exp(B)) always refer to a pair of
levels, a reference category must be specified; either the first (lowest code) or last
(highest code). Here, we consider specifying DPROS=1 (No nodule) as reference
category.
dpros(1) now refers to the odds rato between left and no nodule
dpros(2) refers to the odds ratio between right and no nodule
dpros(3) refers to the odds ratio between biolobar and no nodule
Ordinal variables
Penetration of prostatic
capsule
No Yes Total
gleason 5 Count 64 6 70
% within gleason 91,4% 8,6% 100,0%
6 Count 101 38 139
% within gleason 72,7% 27,3% 100,0%
7 Count 55 73 128
% within gleason 43,0% 57,0% 100,0%
8 Count 6 24 30
% within gleason 20,0% 80,0% 100,0%
9 Count 1 12 13
% within gleason 7,7% 92,3% 100,0%
Total Count 227 153 380
% within gleason 59,7% 40,3% 100,0%
This analysis supplies separate odds ratio estimates between Gleason 6 and 5, Gleason 7
and 5 etc. (5 is the reference category). To compute an OR for Gleason 7 vs. 6, we have
to divide the ORs:
The odds ratios and predicted probabilities can again be depicted in a bar chart:
The odds ratio of 3.481 now refers to the comparison of Gleason 4 vs. 3, Gleason 5 vs. 4,
Gleason 6 vs. 5 etc. Hence, we assume the same step between subsequent levels of
Gleason score. The p-value associated with Gleason score entering the analysis as scale
variable tests the ‘linear trend’ hypothesis, and it applies to all comparisons of subsequent
levels:
Use ‘as nominal’ representation if the effect is non-linear (revealed by inspecting the
probabilities of a positive outcome in a bar chart), and model ordinal variables ‘as scale’
if the linearity assumption can be justified, or if the low number of subjects in a data set
does not warrant the estimation of the number of parameters needed in an ‘as nominal’
analysis.
Scale variables
So far, scale variables were always considered as exhibiting a linear effect on the
outcome. This means, we assume the same effect to be present between any two values of
the scale explanatory variable which differ by 1.
As an example, consider the variable PSA in our PROS data set. The histograms of this
variable, separate for those with and without penetration of the prostatic capsule, give an
impression about the distribution of PSA:
30%
Percent
20%
10%
We see that psa has a significant effect on penetration; the odds ratio estimate is 1.051
per unit increase in psa.
Since psa was modeled linearly, we assume that the same odds ratio rules between psa
levels of 2 and 3, 10 and 11, 50 and 51, etc. Is this assumption meaningful? The odds
ratio could be less comparing levels of 50 and 51, than if comparing levels of 10 and 11.
How can we check this linearity assumption?
Second, fit a logistic regression model using penetration of the prostatic capsule as
dependent, and the categorized psa level as nominal (factor) independent variable:
Plot the coefficients associated with the 1st, 2nd, 3rd and 4th quartiles against the midpoints
of the intervals. Use a coefficient of 0 (reference) for the first quartile:
Action: use psa as is. Use psa, psa2 as Use psa, psa2, psa3.
independent variables.
Binary shapes:
In our case, we may assume a cubic (S-shaped) relationship, thus we try a model using
psa, psa2, psa3. The variables psa2 and psa3 must be computed using Transform-
Compute. The resulting type of modeling is called ‘polynomial’.
The model equation is: log odds = -1.794 +0.169 PSA -0,00389 PSA2 + 0.0000307 PSA3
Linear: Polynomial:
Likewise, we can plot predicted probabilities (which are only transforms of the log odds)
against psa levels:
Linear: Polynomial:
We see that the higher log odds estimated by the polynomial modeling do not reflect in
higher probabilities. The model show differences mainly in psa levels of 25 to 75, where
the polynomial model estimates lower risk than the linear model.
The models yield identical discrimination of true status by psa; the more complex model
using the polynomial does not pay off in terms of discrimination of penetrated and non-
penetrated prostatic capsules.
In this example the nonlinear treatment of PSA does not add significant information to
the model.
Going back to the histograms, we notice that the distribution of psa level is very skewed.
We could try a logarithmic transformation and repeat the steps of the analysis. By using
the log to base 2 (computed by log2psa=ln(psa)/ln(2)), we obtain useful interpretation of
results: a unit-increase in log2-psa corresponds to a doubling of the original psa value:
PSA log2-PSA
2 1
4 2
8 3
16 4
32 5
etc.
The histograms of log2psa now show approximate normal distributions (left: absence of
penetration, right: penetration present):
12%
Percent
8%
4%
0%
0 2 4 6 0 2 4 6
log2PSA log2PSA
This transformation is also useful to get rid of outliers, which could have disproportional
impact on results. We can repeat the quartile method on the log-transformed variable:
Log-Linear:
Variables in the Equation
We can also compare the predicted probabilities obtained by these two models:
Log-linear: Log-polynomial:
Since there is not much difference in the predicted probabilities, one may prefer the log-
linear model here. We will now discuss a formal way to test the non-linearity of an effect.
• If significant, proceed.
1) First, perform a simultaneous test of log2PSA, log2PSA2 and log2PSA3.
2) Second, perform a simultaneous test of log2PSA2 and log2PSA3; i. e., test the
For the first test to be performed, all three parameters are entered within the same block.
The Omnibus test table looks as follows:
Chi-square df Sig.
Step 1 Step 58,109 3 ,000
Block 58,109 3 ,000
Model 58,109 3 ,000
Since there is only one block, the three p-values for Step, Block and the whole Model are
the same. Since the method of variable selection is specified as ‘Enter’, all variables of
one block enter the analysis. (There is no variable selection performed based on
significance.) The block adds significant (p<0.001) information to the null model
(assuming the same probability of capsule penetration for all patients, irrespective from
their psa levels). Therefore we proceed to the test of non-linearity:
Chi-square df Sig.
Step 1 Step 2,606 2 ,272
Block 2,606 2 ,272
Model 58,109 3 ,000
Now the variables have been entered such that the first block consists of log2PSA only,
and the second block of log2PSA2 and log2PSA3. The second block adds no significant
information, as revealed by a non-significant p-value of 0.272. Therefore, we conclude
that the log-linear effect of PSA (using only log2PSA) is the most adequate one.
The odds ratio estimate associated with each doubling of psa is 1.826, or, put another
way, the odds for capsule penetration increase by 1.8 fold if psa is doubled.
Should we categorize a scale variable prior to regression analysis? In the vast majority of
cases, the answer is no for the following reasons:
• Categorization most often results in a loss of power (because different values are
•
collapsed into one category)
•
The search for a meaningful cut-off value is a ‚multiple testing fiesta‘
Cut-off values are difficult to defend statistically (only by independent validation
•
or well-conducted internal cross-validation)
If confidence intervals for cut-off values are computed in a statistically correct
way, then they are often very wide (reflecting the low power of cutpoint searches
and the resulting uncertainty)
Assume we want to find a meaningful cut-off value for PSA. The following plot shows
the resulting odds ratios, if various candidate values of PSA are used as cut-off values for
dichotomization:
PSA level
Although at a psa level of 3 the highest odds ratio (6.2) would be attained, this plot does
not support any particular cut-off value at which categorization would be meaningful.
Also by selecting the cutpoint on grounds of significance (at 14.5, the lower confidence
limit is maximized), there is no clear answer.
From each resampled data set, a cut-off value is determined such that it yields the highest
odds ratio between those patients with a psa value higher than the cut-off and those with
lower psa values. The frequency of selection of particular cut-off values in the resamples
can be depicted graphically:
20%
10%
0%
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
PSA-level
In only about 32% of cases we would select a very low cut-off point (2 or 3), while in the
other 68% we would select a cut-off value between 12 and 30. A 95% confidence interval
can be achieved by computing the 2.5th and 97.5th percentile of the cut-off distribution. In
this example, the 95% confidence interval ranges from 2 to 30, thus reflecting the
complete range of cut-off values used as candidate values. This supports our initial
impression that no unique cut-off value can be determined, and that categorization is not
meaningful here.
Zero-inflated variables
100
Count
50
0
0 25 50 75
Tumor volume (Ultrasound)
•
•
The first one distincts zero from non-zero values (a binary indicator)
The second one measures the impact of the scale part of the distribution. This
second part is like a normal scale variable; it could also be modeled in a non-
linear way.
The results from a logistic regression analysis, treating the scale part as linear, are as
follows:
The variable vol_not0 supplies an estimate of the odds ratio for nonzero vs. zero values:
this estimate is 0.761.
Variable Vol supplies the odds ratio estimate for a unit increase in tumor volume, given
that volume is not zero: 0.992. This second part has been estimated assuming a linear
relationship. This assumption can be verified again using the quartile method. In the
following plot, quartile 0 refers to the zero spike of the distribution:
Formal testing for the effect of this variable yields the following two p-values:
Chi-square df Sig.
Test of non-linearity of effect: since only one variable is used to model an additional non-
linear effect, we can use the p-value of that variable (vol2): it is 0.064. Therefore, we
conclude that there is no non-linear effect (or that the observed apparent non-linear effect
is also plausible under validity of the linearity assumption). We refit the model using only
vol_not0 and vol and obtain:
Variable selection means to apply some rule in selecting variables for a multivariable
model. The type of this rule depends on the purpose of the multivariable model.
Consider a study which should evaluate a new biomarker for prognostic relevance. In this
case one will have to include all variables in the model that must be considered as
confounders. Recall the definition of confounders: all variables which are correlated with
the biomarker and with the outcome, and which are not intermediate on the causal path
from the biomarker to the outcome (confounders can not be ‘mediators’). It is not
necessary to include variables which are correlated with the biomarker but not with the
outcome.
As an example, consider again the lung cancer data set of Section 3. Assume we want to
evaluate expression of gene 7933 in its prognostic value to predict survival after
diagnosis of lung cancer. 71 deaths (events) have been observed. We consider a potential
confounders stage, gender, smoking and age. Running a Cox regression analysis, we
obtain:
• Backward elimination: start with ‘full’ model, eliminate step-by-step all non-
•
significant variables
Forward selection: start with model containing only the most significant (in
•
univariable analyses) variable, add variables on-by-one
Select all variables with univariable significant effects (p<0.15)
These variable selection methods have been discussed already in Section 1. We will now
discuss the purposeful selection algorithm in detail.
The decision whether a variable should be considered a confounder or not can be based
The purposeful selection algorithm selects all significant variables and additionally all
variables that confound the effect of others. The results are generally more stable than
those of backward/forward selection. This means, one will obtain similar results if
analyzing subgroups of the data set. By contrast, backward/forward selection must be
considered to yield biased, i.e., over-optimistic results.
Suppose we want to establish a model for predicting lung cancer from the following
variables: gene 7933, stage, age, smoking, gender
All significant (p<0.10) variables are kept in the model. (Variables not significant in the
model would have to be evaluated if they are confounders of other variables.)
Now, all variables not selected in the first step are included one-by-one to evaluate
whether they are confounders of other variables. We enter these variables in addition to
stage and Gene 7933:
Age is not significant (p=0.331). Hence, it can not be included because of significance.
Now we have to evaluate whether age should enter the model because of confounding of
other variables. This is done by comparing the B coefficients of stage and Gene_7933
from the model including age with those from the model excluding age. The B coefficient
of variable stage would change from 0.760 to 0.767 (+0.9%) if age was excluded. This
change (+0.9%) is less than the pre-specified 15% needed for the definition of a
confounder. Similarly, the B coefficient of Gene_7933 would change from -0.587 to -
0.614 (-4.6%), which is also less than 15%. Thus, age is not considered as confounder
and will not be included in the model.
Nikotin is not significant (p=0.452). Comparing the B coefficients in- and excluding
nikotin from the model, we notice a change from 0.813 to 0.767 for stage (-6%, no
In this example we stay with the first model, including only stage and gene 7933. In other
examples, confounders may be identified in the second step, and a new ‘working model’
would have to include them. In such cases, the second step is then repeated, comparing
the ‘working model’ to a model including one-by-one the non-selected variables. The
algorithm stops, if the model does not change anymore.
Starting from a multivariable model, potential effect modification must be assessed. This
can be done by including, one-by-one, interaction (product) terms of the variables in the
model. We will show assessment of interactions by means of the multivariable model
from the purposeful selection algorithm. There is only one product term to assess:
stage*gene7933:
The product term is significant at p=0.006. Once the product term was included, the
associated effects of the variables constituting this product term should not be
interpreted! The interaction should be further evaluated, starting with graphical method.
Here we compute Kaplan-Meier curves, separately for stage=1 and stage=2:
The effect of Gene 7933 for stage 1 patients can be computed as:
Thus, the effect of gene_7933 on survival is modified by stage; it is HR=0.27 for stage 1
patients and HR=1.22 for stage 2 patients. A convenient way to depict this difference is
by means of forest plot:
Group HR
Total 0.54
Stage 1 0.27
Stage 2 1.22
Hazard ratio
If more then two variables are included in a multivariable model, then all pairwise
interactions must be evaluated (one-by-one). To avoid spurious results caused by multiple
testing error, the significance level for these tests for interaction should be adjusted (by
dividing by number of interactions).
The best way to validate a model is to apply the obtained risk score (result from inserting
covariate values into the estimated model equation) in an independent sample and
compare the computed risk scores with the true outcome in those patients. This
evaluation should then show a clear distinction, e.g., between short-term and long-term
survivors.
• 10fold cross-validation: Split the data set randomly into ten partitions of equal
size. Use nine of these partitions as training set, and the tenth as the test set.
Develop the model on the training set, and compute risk scores for the training
set. Repeat denoting a different partition as the test set, until all partitions have
been used as test sets. Repeat the whole process several times (e. g. 10 times) each
time starting at a different random split. Average risk scores over the cross-
•
validation loops to obtain cross-validated risk scores.
Bootstrap cross-validation: Draw B=1000 resamples of size N with replacement
from the original data set. In these data sets, some patients will appear multiple
times while others not at all. Develop the model on the resample (the training set)
and evaluate risk scores on those patients which have not been included in that
particular resample (the test set). After all resamples have been processed,
compute average risk scores for each patient over those resamples where the
•
patient was part of the test set.
Leave-one-out cross-validation: estimate model parameters from N-1 subjects,
compute risk score for the Nth subject. Repeat for all N subjects.
• Evaluate the association of the cross-validated risk scores with the outcome
variable (Kaplan-Meier curve, cross-validated measure of discrimination such as
the c-index, etc.).
In this overview, it was mentioned that a model is developed on the training set. This
model development could be restricted to estimate the parameters of a given model, but
could also involve variable selection, if there is uncertainty about the variables that
should be included in to the model. In this latter case one may have a different set of
explanatory variables in each of the training sets. Here, cross-validation is used to
evaluate the adequacy of this model building strategy.
Here we present some results from bootstrap cross-validation of the model including the
interaction for the lung cancer data set. Cross-validated risk scores have been computed
using the bootstrap method for each patient. Model development in each resample was
based on forward selection of main effects of stage, gender, age, nikotin, and gene_7933,
and all pairwise interactions. Only effects significant at the 0.05 level were allowed to
enter the model. Based on the original data set, the following model, including stage,
gene_7933 and their interaction would be obtained:
For computation of risk scores of the patients in the test sets (those not selected in a
resample), that model was used that was obtained in that particular resample:
25
20
15
Percent
10
0
0 0.3 0.6 0.9 1.2 1.5 1.8 2.1
cv_xbeta
In the following the risk scores were stratified into quartiles. Kaplan-Meier curves show
the association of risk group with survival:
1.00
0.75
Survival Distribution Function
0.50
0.25
0.00
0 10 20 30 40 50 60
Survival
This analysis validates the strategy of model development; in our case forward selection
including interactions.
0.6
0.4
0.2
In this ROC curve, a true positive result is the prediction of mortality before 24 months
for patients with high risk scores who have died before 24 months. A false positive result
is a wrongly predicted death before 24 months. Comparing the true status of the patients
at 24 months (dead or alive), and using various cut-off values for the risk scores to
predict either ‘dead’ (for patients with risk scores exceeding the cut-off value) or ‘alive’,
an ROC curve is obtained.
The survival ROC curves for 24 months and 60 months indicate that the ability of the
model to predict survival at these time points is rather low; this constitutes a typical
situation for survival data.
[1] Hosmer, D, Lemeshow, St. Applied logistic regression. New York: Wiley, 2000.
[2] Bursac, Z., Gauss, C. H., Williams, D. K. and Hosmer, D.: A Purposeful Selection of
Variables Macro for Logistic Regression. SAS Global Forum 2007, Paper 173-2007,
http://www2.sas.com/proceedings/forum2007/TOC.html
Concluding remarks
The data files that are referred to in these notes can be downloaded from
http://www.meduniwien.ac.at/msi/biometrie/lehre
(Click on the link Medical Biostatistics 2)
Define a cut-point
Choose fev as input variable. Define fev80 as output variable, labelled ‘FEV cat 80’ (or
something similar). Press ‘Change’ to accept the new name.
Fill in the value ‘80’ (without quotation marks) in the field ‘Range, value through
HIGHEST’, and define 1 in the field ‘New Value’. Press ‘Add’ to accept this choice. In
the field ‘Range, LOWEST through value:’, fill in ‘80’, and in the field ‘New Value’,
define 0. Again, press ‘Add’ to confirm. Press ‘Continue’. Back at the first dialogue,
press ‘OK’.
A new variable, ‘FEV80’ has been added to the data sheet. We learn that the value 80
was categorized as 1. This is controlled by the sequence we use to define recoding
instructions. In our example ‘80 thru Highest’ precedes ‘Lowest thru 80’. Thus, the
program first applies the first instruction to all subjects. As soon as a subject is
categorized, it will not be recoded again by a subsequent instruction.
We can now use SPSS to compute a cross table of the diagnostic test and the disease
status. Choose
Analyze-Descriptive Statistics-Crosstabs…
Press ‘Cells…’ and choose ‘Column’ percentages to obtain the sensitivity and specificity
of the test.
Pneumokoniosis
absent present Total
FEV cat positive test Count 6 22 28
80 % within
46,2% 81,5% 70,0%
Pneumokoniosis
negative test Count 7 5 12
% within
53,8% 18,5% 30,0%
Pneumokoniosis
Total Count 13 27 40
% within
100,0% 100,0% 100,0%
Pneumokoniosis
The sensitivity is defined as the true positive rate (among the subgroup of diseased),
which can be read as 81.5%. Similarly, the specificity is defined as true negative rate and
computes to 53.8%.
To obtain positive and negative predictive value, we repeat the analysis, requesting ‘Row
percentages’ instead of column percentages:
Pneumokoniosis
absent present Total
FEV cat positive test Count 6 22 28
80 % within FEV cat 80 21,4% 78,6% 100,0%
negative test Count 7 5 12
% within FEV cat 80 58,3% 41,7% 100,0%
Total Count 13 27 40
% within FEV cat 80 32,5% 67,5% 100,0%
Looking at the same cells as before, we read a positive predictive value of 78.6% and a
negative predictive value of 58.3%.
ROC curves
Analyze-ROC Curve…
It’s crucial to check ‘Smaller test result indicates more positive test’, as small values of
FEV-1 indicate a poor lung function. Confirm by pressing ‘Continue’ and ‘OK’.
Sometimes it’s useful to further examine the coordinates of the ROC curve (e. g., for
defining an optimal cut value). The coordinates are output if the corresponding box
‘Coordinate points of the ROC curve’ is checked:
This table can be copied into Excel, say, for further analyses.
Logistic regression
We start with the low birth weight data set (lowbwt.sav). Logistic regression is called by
choosing from the menu
Analyze-Regression-Binary logistic…
Press ‘Options…’ and check ‘Hosmer-Lemeshow goodness-of-fit’ and ‘CI for exp(B):
95%’:
The program not only outputs the results of the procedure, but also some steps in
between. Results labeled as ‘Block 0’ refer to pre-fit results, i.e. ‘what happens if we
include certain variables in the model’:
We see, at step 0, only the constant is in the model, and all other variables are not. SPSS
performs some tests evaluating whether inclusion of these variables would significantly
improve this null model. We see that with the exception of AGE, inclusion of any other
variable would improve the model significantly.
Now let’s have a look at Block 1. We requested to enter all specified covariates
simultaneously (as defined by ‘Method: Enter’ in the logistic regression dialogue). If the
‘Enter’-method was requested, SPSS performs one single step: entering all variables. We
will later choose other methods, which may produce more than one step. The results
contain several tables, which have been explained previously:
The predicted probabilities are now contained in a new column of the data editor:
It is often desirable to check for interactions (effect modifications) that may exist
between a model’s covariates. For this purpose, forward selection could be applied,
starting with a model containing all variables in question.
The forward selection of significant interactions can now be performed as follows: first,
open the logistic regression dialogue:
Here, we select all possible pairs of covariates (click on the first covariate, then hold the
Strg key and click on the second covariate), and press the button ‘>a*b>’ to include their
interaction as a candidate covariate in the model. After having defined all possible pairs,
we change the ‘Method:’ to ‘Forward: Conditional’:
The most important table is the one labeled ‘Variables in the Equation’, and from this
table only the results for ‘Step 2’ are interesting (as these are the final results):
Two interactions have been included in the model: Age by HT and LWT by SMOKE.
This means that the effect of age depends on history of hypertension (if a significance
level of 0.3 was postulated!), and that the effect of mother’s weight depends on her
smoking status.
For non-hypertensive mothers, the effect of age is -0.056, or expressed as an odds ratio,
0.945 which means that with each year, the risk of delivering a low weight baby
decreases by 5.5%.
Exercises
1. Ultrasound techniques. Open the SPSS data set ‘stenosis.sav’.
Using the ultrasound methods ‚ultrasound dilution technique’ (UDT) and
‘color Doppler ultrasonography’ (CDUS), perfusion was evaluated in 59
patients on renal replacement therapy. Presence or absence of stenosis
was evaluated with fistulography.
Which cutpoints would you choose for either method, if a test for stenosis
should at least have
All methods of survival analysis are demonstrated using the data set lungcancer.sav. In
this data set, we have the following variables:
Kaplan-Meier analysis
First we compute Kaplan-Meier curves for groups defined by stage. Call the dialogue
Analyze-Survival-Kaplan-Meier… and define:
We select ‘Log rank’ and ‘Breslow’ and confirm by clicking on ‘Continue’. At the
submenu ‘Options’, we request survival tables, quartiles and survival plots:
The subsequent table shows the median survival, and 25th and 75th percentiles in both
groups and overall:
Percentiles
Overall Comparisons
Chi-Square df Sig.
Log Rank (Mantel-Cox) 11,229 1 ,001
Breslow (Generalized
12,389 1 ,000
Wilcoxon)
Test of equality of survival distributions for the different levels of
stage.
Please note that these curves start at the shortest death time. In our analysis, the shortest
death time is close enough to 0 such that the plot seems to start at 0. In some analyses
however, it may be necessary to add, for each group, a ‘ghost observation’ with survival
time and censoring indicator both set to 0 to have the Kaplan-Meier curves starting at 0.
The status variable has to be defined the same way as before. In the Options subdialogue,
we request 95% confidence limits for Exp(B):
-2 Log Overall (score) Change From Previous Step Change From Previous Block
Likelihood Chi-square df Sig. Chi-square df Sig. Chi-square df Sig.
500,588 15,383 2 ,000 14,219 2 ,001 14,219 2 ,001
a. Beginning Block Number 0, initial Log Likelihood function: -2 Log likelihood: 514,807
b. Beginning Block Number 1. Method = Enter
This hypothesis is clearly rejected. At least one model effect is not zero in our model. The
next table is the most important one:
Variables in the Equation
This table contains regression coefficients (B), their standard errors (SE), the Wald
statistics, the p-values (Sig.), and the estimated hazard ratios (Exp(B)) and associated
95% confidence intervals.
In the Save subdialogue, we may request to save certain statistics into the data matrix.
The following could be of interest:
Partial residuals can be plotted against survival time by calling the chart builder (Graphs-
Chart Builder), dragging the symbol of Scatter/Dot into the preview, and defining the
partial residuals for Gene 1791 as vertical axis variable, and survival (the survival time)
as horizontal axis variable. Change the type of the partial residual variable (which is
erranously set as ‘nominal’) to scale:
Using linear regression, we can test whether partial residuals increase or decrease with
time. Choose Analyze-Regression-Linear… and define Partial residual for Gene 1791 as
dependent variable, and survival (the survival time) as independent variable as shown
below:
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) -,205 ,094 -2,178 ,033
Survival ,008 ,003 ,338 2,754 ,008
a. Dependent Variable: Partial residual for Gene_1791
We see that survival time has indeed as significant effect on the residuals, thus we
conclude that the effect of Gene 1791 changes with time.
Interactions of covariates with time (time-dependent effects) can be specified using the
menu Analyze-Survival-Cox w/ Time-Dep Cov…. First we have to define how the
interaction term should involve time (either linearly, or as logarithm etc.). We first
choose a linear interaction with time, by moving the system variable T_ (standing for
‘time’) into the field ‘Expression for T_COV_’. This variable denotes the survival time.
Please note that the variable ‘survival’ may not be used instead of T_ to define
interactions with time!
We see that the effect of Gene_1791 indeed depends on time. At the beginning of the
follow-up time (when time=0), we observe an adjuste hazard ratio of 0.211. This hazard
ratio increases by 3.8% (or: ‘multiplies by 1.038’) with every month.
When dividing the time axis and fitting separate models for the two subgroups, we must
follow the two principles that have already been outlined above (please note that the
cutpoint of 40 months is completely arbitrary, as alternative we may also choose that time
at which half of the total number of events have already occurred):
• All subjects must enter the first subgroup. The censoring indicator has to be set to
‘censored’ for all subjects which lived longer than 40 months, and the survival
•
time is set to 40 months for these subjects.
Only the subjects who lived longer than 40 months enter the second subgroup.
The first subgroup analysis needs a redefinition of survival time. All subjects that lived
longer than 40 months have to be censored at 40 months. This is done in several steps.
For the first subgroup analysis, we create a new variable ‘upto40’ containing the survival
times redefined by the principle given above:
Then, we create a new censoring indicator. It should contain the true dead/alive status for
subjects having been followed-up shorter than 40 months. For all other subjects, it should
indicate ‘alive’, as these subjects lived longer than 40 months.
Now we call Cox regression, using the redefined survival time and status:
The second subgroup analysis can be obtained by simply requesting a Cox regression
analysis with the original survival time and censoring indicator, but restricting the
analysis to all subjects with a survival time longer than 40 months. Choose Data-Select
Cases and request ‘If Survival > 40’:
We see that the hazard ratio estimates are quite different from the estimates we computed
for the first 40 months. (Although they are not significantly different from 0, there is, for
Gene_1791, a significant difference if we compare 1.783 to 0.413.)
Exercise
Consider the data set stud1234.sav which contains data from 1118 breast cancer
patients who were treated at the former 1st Department of Surgery, University of
Vienna. The data set contains the following variables:
id Patient identifier
birth Date of birth
op_dat Date of surgery
rez_dat Date of recurrence of cancer
last_dat Date last seen or date of death
surs Status indicator (coded as 0 – alive and 1 - dead)
1. Compute the overall survival time in months for each patient! (Hint: Use
Transform-Compute and the function CTIME.DAYS(last_dat –
op_dat)/30.4.
4. Assess the assumptions of the Cox regression model for the present
analysis!
You may assign labels to raw and percent change scores in the variable view. Now
groups can easily compared, calling Graphs-Chart Builder and the Boxplot from the
Gallery:
Remember to define the group codes. In our data set, the groups are coded as 1 and 2.
Std. Error
Group N Mean Std. Deviation Mean
Raw change score Acupuncture 25 20,8000 15,82982 3,16596
Placebo 23 11,1304 15,61847 3,25668
Coefficients(a)
Unstandardized Standardized
Coefficients Coefficients
The coefficient B for Group is -12.019. Technically, a change in Group by 1 unit means a
reduction of post-treatment score by 12.019. Practically, group 2 is placebo and group 1
is acupuncture. Thus, the placebo group has a baseline-adjusted average post-treatment
score which is by 12.019 units lower than that of the acupuncture group.
Individual curves, plotted into several panels, can be obtained by choosing Graphs-
Legacy Dialogs-Interactive-Line:
This message reminds us that the PatID variable should be of nominal type but it is
(erranously) of scale type. We can immediately change the type of PatID by selecting
‘Convert’. The same warning pops up if we move Group into the field ‘style’ (to have
different line styles for either group):
A plot of mean curves with standard deviations represented as error bars can be obtained
in a similar way, just omitting to specify PatID to define panels on the ‘Assign Variables’
view, and moving Group into the ‘Panel variables’ field:
Summary measures suitable for growth curves are exemplified on the data set pigs.sav
which contains serial weight measurements on 48 pigs. Clearly, the data is in long
format:
To compute the weight change during 9 weeks of diet, we have to extract the first and last
weight measurements and compute their difference. First, sort the data set by pig ID and
week (Data-Sort Cases):
A new variable is created containing the weight gain for each of the pigs as a summary
measure in the new data set. The distribution of these values can be depicted using a
histogram (Graphs-Interactive-Histogram).
Then click OK. The individual regression equations are not of interest per se. We are
only interested in the predicted values, which have been added to the data set in the data
editor:
8
Count
0
40,00 45,00 50,00 55,00
We start with the data set pigs.sav. Use file splitting by pig ID (Data-Split file…) as in
the preceding section. Choose Analyze-Regression-Linear and select weight and week as
dependent and independent variables, respectively. Now press ‘Save…’:
Now, there is only one row per pig, containing each individual’s regression equation. For
pig no. 1, the regression equation is weight = 19.75 + 5.78 * week. For pig no. 2, it is
17.42 + 6.78 * week, etc. Clearly, the slopes are contained in the column named ‘week’
(as these numbers are the regression coefficients corresponding to the independent
variable ‘week’). The slopes can be interpreted as the average weight gain per week and
are distributed as shown by the histogram below. Thus, the slopes are exactly
proportional to the stabilized weight gain from week 1 to 9 computed above (slope =
stabilized weight gain per 1 week).
8
Count
0
5,00 5,50 6,00 6,50 7,00
We e k
First, sort the data set by patient ID, descending AZT value and ascending time (Data-
Sort Cases):
Next, choose ‘Data-Aggregate’. Define ‘PatID’ and ‘Group’ as Break Variable(s) and
move AZT and time into the field ‘Summaries of Variable(s)’. For azt, change the
summary function to ‘Maximum’ and for time, change to ‘First value’. Select ‘Create a
new dataset containing only the aggregated variables’ and give it a name (e. g.,
CmaxTmax):
The area under the curve can be computed using the trapezoidal rule. First, revert to the
original AZT.sav data set. Next, sort by PatID and ascending time (Data-Sort Cases).
Then, call ‘Transform-Compute’. The height of the partial rectangles is called ‘height’.
The numeric expression should read (azt + lag(azt)) / 2. Remember that the lag function
moves the preceding azt value into the respective subsequent line. We do not want this
computation for the first line of each patient (since there is no rectangle to compute).
Therefore, click on ‘If…’ and request ‘time > 0’:
Open the data set cervpain-long0sav. This data set contains the data in long format, but
with an additional line (‘week’=0) for the baseline measurement:
We move patid into the field ‘Subjects’ and ‘week’ into the field ‘Repeated’. As a
covariance structure, we choose ‘Unstructured’. At the next menu, we select VAS as the
dependent variable, week and treatment as factors, and the baseline VAS measurement as
the covariate:
Finally, select vas0_mean (the baseline value) and move it as main effect or factorial into
the Model field.
After having defined these model terms, we click on ‘Continue’ and OK. For the time
being, we are only interested in the Akaike information criterion of the model, to judge
the adequacy of the assumed covariance structure:
Information Criteria(a)
-2 Restricted Log
Likelihood 167,835
Akaike's Information
Criterion (AIC) 209,835
Hurvich and Tsai's
Criterion (AICC) 221,103
Bozdogan's Criterion
(CAIC) 286,367
Schwarz's Bayesian
Criterion (BIC) 265,367
The AIC is 209.835 for the unstructured covariance. We repeat the analysis, this time
specifying the Toeplitz structure at the very beginning:
Information Criteria(a)
-2 Restricted Log
Likelihood 200,511
Akaike's Information
Criterion (AIC) 212,511
Hurvich and Tsai's
Criterion (AICC) 213,377
Bozdogan's Criterion
(CAIC) 234,377
Schwarz's Bayesian
Criterion (BIC) 228,377
Information Criteria(a)
-2 Restricted Log
Likelihood 203,466
Akaike's Information
Criterion (AIC) 207,466
Hurvich and Tsai's
Criterion (AICC) 207,585
Bozdogan's Criterion
(CAIC) 214,755
Schwarz's Bayesian
Criterion (BIC) 212,755
This structure yields the smallest AIC. Therefore, we continue with the AR(1) structure.
We learn that the interaction is not significant and it can therefore be dropped from the
model. We recall the menu ‘Analyze-Mixed models-Linear’ with the same specifications,
but remove the interaction in the ‘Fixed effects’ submenu (select the interaction and press
‘Remove’):
Now we proceed by clicking ‘Continue’. Next, we call the ‘Statistics submenu and select
the following:
We see that the treatment effect is 1.58 (p=0.064). This means that on average, the
placebo treated group has VAS scores which are about 1.59 units higher than those of the
electro-stimulated acupuncture group. Since we could not find a significant interaction of
treatment effect and time, we may assume that the treatment effect is constant over the
whole range of follow-up (6 weeks).
The baseline VAS has no effect on later VAS measurements (p=0.879). This could be a
results from the very small range of baseline VAS measurements (there was not much
difference in the baseline VAS measurements between the patients).
A longitudinal data set, i. e., a data set involving repeated measurements on the same
subjects, can be represented in two formats:
• The ‘long’ format: each row of data corresponds to one time point at which
•
measurements are taken. Each subject is represented by multiple rows.
The ‘wide’ format: each row of data corresponds to one subject. Each of several
serial measurements is represented by multiple columns.
The following screenshots show the cervical pain data set in long …
With SPSS, SAS and other statistics programs, it is possible to switch between these two
formats. We exemplify the format switching on the cervical pain data set.
We start with the data set cervpain-wide.sav as depicted above. From the menu, select
Data-Restructure:
After pressing ‘Finish’, the data set is immediately restructured into long format. You
should save the data set now using a different name.
We start with the data set in long format (cervpain-long.sav). Select Data-Restructure
from the menu, and choose ‘Restructure selected cases into variables’:
The order of the new variable groups is only relevant, if more than one variable is serially
measured. In our case, we have only the VAS scores as repeated variable. Optionally, one
may also create a column which counts the number of observations that were combined
into one row for each subject.
Exercise 1
Q1: Aggregate the data such that for each patient, we have only the mean VAS
(over weeks 1-6) as outcome variable. The data set should finally look like the
following:
Q4: Compare age and sex between Placebo and Verum patients.
Q5: Repeat the ANCOVA (Q3), but this time add age and sex as independent
variables.
Suppose the following trial on effects of diet and different types of exercises on
the pulse measured before, during and after running. You have measured the
pulse of probands at three trials, and these have been entered into an SPSS
dataset as three lines per proband. The variable Diet denotes dietary preference,
with values of 1 signifying meat eaters and 2 signifying vegetarians. Finally, the
variable Exertype is the type of exercise assigned to the subjects, with 1
signifying aerobic stairs, 2 signifying racquetball, and 3 signifying weight training.
The data are saved in data file exercise.sav.
To perform a logistic regression analysis with a nominal variable, first call Analyze-
Regression-Binary logistic and use as dependent variable ‘Penetration of prostatic
capsule [capsule]’, and as covariate ‘Digital rectal exam [dpros]’:
and move Digital rectal exam[dpros] into the field ‘Categorical Covariates’. Change the
Reference Category to ‘First’, and confirm with ‘Change’. Click on Continue, finally on
OK:
For a bar chart of the proportion of penetration of prostatic capsule, call the chart builder
and move the bar chart to the preview window. Drag ‘Digital rectal exam’ from the list of
variables to the x-axis:
Change the number format such that 1 decimal place is displayed. Click on Apply and
close the Chart Editor. The ‘Mean’ can be interpreted as proportion of patients with
penetration.
To categorize PSA level into the 4 quartiles, call the menu Transform-Rank Cases:
Drag PSA into the field Variable(s). Click on ‘Rank Types…’, deselect rank and select
‘Ntiles: 4’:
Select psa in the table, click on ‘N% Summary Statistics’, change ‘Mean’ to ‘Median’,
click on ‘Apply to Selection’:
Double-click on the ‘Variables in the Equation’ table. Select the three coefficients, copy
them and insert them into a new SPSS data table:
Call Transform-Compute. Write psa2 into the field ‘Target Variable:’ and psa ** 2 into
the Numeric Expression field. Click OK. Do the same with psa3 and psa**3. Two new
variables are created, psa2 and psa3.
Call logistic regression with squared and cubic psa; Analyze-Regression –Binary logistic:
In the first block, enter only psa:
Click on OK. In the Table labeled ‘Omnibus Tests of Model Coefficients’, the p-value
labeld ‘Model’ with 3 degrees of freedom (df) is <0.001. This indicates the psa as a
whole is relevant for penetration. The Block p-value with 2 df is 0.021, indicating
relevance of the non-linear effect of psa.
These can be used to be plotted against PSA, or for ROC curve computation (see Section
2).
The squared and cubic transformations of variable log2PSA can be obtained as outlined
above.
Consider the prostatic cancer study data set pros.sav. The goal of the analysis
should be to determine whether variables measured at a baseline exam can be
used to predict whether the tumor has penetrated the prostatic capsule. Estimate
a multivariable model with capsule penetration (CAPSULE) as the dependent
variable, and all other variables (age, race, DPROS, DCAPS, PSA, VOL,
GLEASON) as potential independent variables.
a. Perform univariable analyses, modeling each of the variables in
appropriate way.
b. Perform multivariable analysis, decide on which variables to keep in
the model.
c. Check interactions and non-linearities.
d. Do goodness-of-fit tests and compute the overall c-index. As another
index of explained variation (R-squared), compute the squared
Pearson correlation coefficient of predicted probabilities with true
status of capsule penetration.
e. Describe your results as for a medical journal. Use appropriate figures
(if needed) and tables. Describe methods and results concisely.