0% found this document useful (0 votes)
47 views14 pages

Influence Results A Case-Control Study: of Model-Building Strategies On The OF

This document summarizes the results of analyzing a case-control study on brain tumors that investigated many potential risk factors. The authors performed a sensitivity analysis using different modeling strategies to demonstrate how the results can vary based on decisions made during the analysis. They found that variable selection, handling of missing data, interactions tested, and other modeling choices impacted the final results. The authors conclude more details on the analysis plan should be prespecified and results interpreted cautiously, as the analysis was highly dependent on the data. Validation with new studies is important given the influence of modeling choices on outcomes.

Uploaded by

Guilherme Marthe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views14 pages

Influence Results A Case-Control Study: of Model-Building Strategies On The OF

This document summarizes the results of analyzing a case-control study on brain tumors that investigated many potential risk factors. The authors performed a sensitivity analysis using different modeling strategies to demonstrate how the results can vary based on decisions made during the analysis. They found that variable selection, handling of missing data, interactions tested, and other modeling choices impacted the final results. The authors conclude more details on the analysis plan should be prespecified and results interpreted cautiously, as the analysis was highly dependent on the data. Validation with new studies is important given the influence of modeling choices on outcomes.

Uploaded by

Guilherme Marthe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

STATISTICS IN MEDICINE, VOL.

12, 1325-1338 (1993)

INFLUENCE OF MODEL-BUILDING STRATEGIES ON THE


RESULTS OF A CASE-CONTROL STUDY

MARIA BLETTNER
German Cancer Research Centre, Diuision of Epidemiology, Im Neuenheimer Feld 280, W-6900 Heidelberg,
Germany

AND

WILL1 SAUERBREI
Institute of Medical Biometry and Informatics, University of Freiburg, Stefan Meier Str. 26, W-7800 Freiburg, Germany

SUMMARY
We evaluate the analysis of a case-control study in which many variables were investigated simultaneously.
The purpose of the study was to explore some rather unspecific hypotheses about potential risk factors for
adult brain tumour. Our aim is to show that in the analysis of case-control studies many decisions are
necessary which are usually not published in detail. As in most studies these decisions are made during
analysis and are data dependent. We demonstrate that the data allow sensible alternative decisions which
influence the final results. A sensitivity analysis of several aspects of the analysis such as different
measurement scales, variable selection, handling of missing values and interactions was performed, and
demonstrated variation in the results based on the strategy for analysis. We conclude that details of the final
analysis should be decided in the planning phase of a case-control study, and that more details of
model-building strategies must be published. Results from a study where the analysis is highly data
dependent must be interpreted with caution and validation of the results with new studies is essential.

1. INTRODUCTION
When conducting a case-control study, information is collected on possible risk factors of interest
and on potentially confounding variables.The aim of the analysis is to identify variables which
influence the estimate of the effect of exposure on disease (confounder) and those that may be
important risk factors. Ideally, we want to detect risk factors (or stable indicators of risk factors)
so that the results can be extrapolated and help to understand the development of the disease of
interest. The odds ratio (or approximate relative risk) is usually estimated using logistic regression
analysis, which also allows control of confounding and evaluation of combined effects and
interactions.' This approach requires a strategy for building regression models, including deci-
sions about which measurement level or categorization to use and which variables to include in
the model. Many of these problems have been discussed recently,' but to our knowledge little
systematic research has been done to investigate the effect of strategies of model building on the
results of a case-control study.
Here we use the data from a case-control study on brain tumours to illustrate methodological
aspects of the analysis. The results have been published e l ~ e w h e r e .It~ .is~ not our intention to
criticize this study but to use it for illustration. The aim of our paper is to demonstrate the

0277-67 15/93/141325-14s 12.00 Received July 1992


0 1993 by John Wiley & Sons, Ltd. Revised January 1993
1326 M. BLETTNER AND W. SAUERBREI

complexity of the modelling strategy in the analysis of a large data set from cancer epidemiology.
We also want to show that decisions made prior to the analysis usually reported may influence
the conclusions, and consequently that more careful interpretations of the results are necessary,
especially when many weak risk factors are investigated.

2. MATERIAL AND METHODS

Study design and data


A case-control study of primary brain tumour was carried out in the Rhein-Neckar-Odenwald
area of Germany. The study included 231 incident cases diagnosed between January 1987 and
December 1988. In addition 581 controls, frequency matched by age and sex, were randomly
selected from the residential registers of the study area. The response rate was high for cases (97.8
per cent) and satisfactory for controls (72 per cent). The total number of subjects was 644 (226
cases and 418 controls).
The study was one part of a large international case-control study coordinated by the
International Agency for Research on Cancer, to investigate some possible risk factors for adult
brain tumour, such as exposure to nitrosamine or to electromagnetic fields. However, the
protocol did not detail the specific hypotheses. One advantage of such international collaboration
is the possibility of building a model with one part of the data set and testing it with the other(s)
(model validation). Additional to the main research hypothesis, namely the influence of nitros-
amine or electromagnetic fields, other risk factors which had been discussed in the literature were
also included.
A questionnaire with ten sections each corresponding to a major theme such as demography,
occupational history, smoking habits or medical history was completed by each subject. We use
here all sections except those on diet, residential history, drinking habits and life-style factors. The
questions yielded numerous variables of all types, continuous ordinal and nominal, categorical
and binary. The total number of variables considered in this paper is 29; more than 50 other
variables were not considered in the modelling.

Statistical methods
Our objective is to investigate the effects of covariates on a dichotomous variable Y, assigned the
values 1 for the patients with brain tumours (cases) and 0 for the controls. We let
X = ( X I , X 1 , . . . . . X,) denote the vector of K known or suspected risk factors, possible
confounding variables and effect modification variables (interactions). The two matching vari-
ables age and sex will always be included in our regression model.
Formally, we want to estimate the odds ratio r(X) through the logistic model, in the form
K
P( Y = 1 IX)/P( Y = OlX)
r(X) =
P ( Y = llXO)/P( Y = 01x0)
= exp
i=l
1p i ( X i- XO),
where Xo is a specific baseline level for X and B is the vector of regression coefficients to be
estimated. For each variable the reference and other categories have to be determined; this
important step is often not described in publications. After defining X, several strategies can be
used for selecting variables. Both variable definition and selection should be guided by the
hypotheses under investigation, medical knowledge and prior beliefs as well as the distribution of
the available data. All variables can be included in one (full) model, though it is often preferable
MODEL-BUILDING STRATEGIES IN A CASE-CONTROL STUDY 1327

to select important variables for a more parsimonious (jinal) model. It is not unusual to
investigate each variable in a separate (uniuariate)model.
Several procedures are available for variable selection. Stepwise methods (forward selection,
stepwise selection and backward elimination) are well known and standard software is available
for unconditional logistic m~delling.~. All-subsets procedures, often used with the linear regres-
sion model, can be implemented for generalized linear models with the program ISMOD.’
Likelihood ratio statistics are calculated and the better models are determined by the smallest
values of Akaike’s information criterion (AIC).

3. MODEL BUILDING STRATEGY


We will describe several steps in the analysis including decisions prior to the variable selection
strategy, and the selection of variables for the final model. The influence of our decisions will be
assessed and the results of alternative approaches will be compared with our final model. The
decisions presented are a major part of nearly all analyses in observational studies with only
unspecific hypotheses, but they are rarely reported and few researchers evaluate their impact on
the results.

Decisions for analysing the data


Problem 1: excluding variables prior to model building
For the brain tumour study, about 200 items were coded and further variables were derived from
them. This reflects the non-specific character of the study. It is not feasible to include all variables
in a final model and there are good reasons to restrict the analysis to a subset of them. For some
variables only a descriptive analysis may be of interest to reveal questions for further research.
Some questions may have been included in the questionnaire without realizing the difficulty of
obtaining reliable answers.
Decision I The following analysis will be restricted to 29 variables. The main reasons why
variables were excluded from further analysis at this stage were:
1. too many missing values (for example water quality, which was not available from some
local water authorities);
2. substantial measurement errors’ (for example in questions about diet and drinking habits);
3. low prevalence (for example rare sporting activities such as boxing, and exposures to specific
solvents for painters).
In general, it is not sensible to use these variables in the modelling; however, descriptive results
should be presented as they could suggest more specific hypotheses.

Problem 2: variable dejinition and coding


For each variable, the measurement scale and coding have to be decided prior to analysis. The
baseline category, number of categories and scoring are often not defined in the protocol, but may
be done retrospectively and be data dependent.
Continuous variables can be coded in their original format, or categorized into two or more
groups; the latter may result in loss of efficiency.’ When possible, categorization should be based
on medical reasoning and very small categories should be avoided. For integer variables (number
of events) or ordered categorical variables (never, seldom, often) it has to be decided whether or
1328 M. BLETTNER A N D W. SAUERBREI

not to collapse categories. The decision can be based on the distribution of the variables
(combining small categories), on estimates of parameters from different approaches and on
likelihood-ratio tests.
Decision 2 We decided to divide the variables into clusters defined by the sections of the
questionnaire; Table I gives the original coding as used in the questionnaire. For most categorical
variables some categories were collapsed. We discuss variable definition, coding and selection for
the medical history cluster in the next section.

Problem 3: dealing with missing values


Even when only few values are missing for each variable, the number of subjects with at least one
missing value can be large. Analysis based on the complete cases only leads to biased estimates
when the exclusion rates are different for cases and controls. Additionally, we would lose valuable
information by restricting the analysis to a subset of the data. If the number of missing values for
each variable is small, the frequently used approach of defining an additional category is not
sensible and always leads to biased estimates. l o Other techniques include probability imputa-
tion" and replacing missing data by values generated randomly from the distribution of the
available data. The latter can be carried out for cases and controls separately and can also
incorporate other stratification factors.
Decision 3 The number of missing values for each variable is given in Table I. For about 20 per
cent of the subjects at least one variable was missing. We decided to use the replacement
technique and compared the results of our final model with those obtained from the complete
case analysis.

Problem 4: combined or separate models


Many studies have low power for separate subgroups analyses.12 Unless there is strong prior
evidence for separating the data, model building should be done using the complete data set.
Decision 4 Our study included 99 men and 127 women with brain tumour and 185 male and 233
female controls. Model building was conducted with both sexes, and interactions were investig-
ated in the final model. We used unconditional logistic regression analysis adjusted for age (a
categorical variable with four levels) and sex in each model.

Problem 5: Selection level and selection procedure


Choice of the P-value (a level) used for variable selection is important and should be guided by the
aims of analysis. The decision on whether or not to include a potential confounder is different
from that of declaring a variable of primary interest significant. Some authors prefer lower levels
of significance (larger P-values) for confounders than the popular values of 0.01 or 0.05. Thomas
et a l l 3 give the significant associations based on 0.1 and for tests of collapsibility values as high as
0.20 are sometimes favoured.' l 42v

Variable selection can be done within each cluster, or for all clusters combined. There are
automatic procedures available, such as stepwise selection, backward elimination or all-subsets
selection, and these may be modified to incorporate prior beliefs. Decisions on whether and how
to include a variable in the final model are usually guided by P-values, but may also be based on
the magnitude of the regression coefficient. Alternatively at least one variable from each cluster
might be included irrespective of the P-value.
MODEL-BUILDING STRATEGIES IN A CASE-CONTROL STUDY 1329

Table I. Definition, coding and selection of variables


~

Variable Number
no. Name Categories % missing
Cluster I: medical history
1 Infections never 51 4
seldom 38
often 6
2 Fever never 66 24
seldom 20
often 13
3 Asthma yes/no 4* 3
4 Unspecific allergies yes/no 32 3
5 Eczema yes/no 10 4
6t Any allergy yes/no 35 6
Cluster 11: job history
7 Chemical industry yes/no 7 O$
8 Metal industry yesJno 16 0
9 Office workers yeslno 21 0
10 Electricity industry yes/no 4 0
11 Sales person yesjno 15 0
12 Food processing yesJno 7 0
13 Textiles yesJno 12 0
14 Wood yesJno 6 0
1st Any industrial worker
(including 7, 8, 10, 13, 14) yes/no 31 0
Cluster I l l : X-ray exposure:
16 Dental X-ray never 15 17
age of 2 2 5 seldom 65
often 21
17 Dental X-ray never 75 83
age <25 seldom 19
often 6
18 Other X-ray of head or neck yesjno 11 26
19 Full mouth X-ray never 75 0
1 time 18
2 and more I
Cluster IV: head injuries
20 Head injuries 27 4
21t With hospitalization 10 4
22t With unconsciousness 11 4
23t With concussion 12 4
Cluster V: social class factor
24 Schooling
9 years 73 4
10 years 15
12 years (technical school) 2
13 years (Abitur) 11
25 Job class
unskilled worker 28 15
skilled worker 55
polytechnic degree 7
university degree 10
1330 M. BLETTNER AND W. SAUERBREI

Table I. (Continued)
Variable Number
no. Name Categories Yo missing

Cluster VI: smoking


26 Pack-years
average number 20 11
among smokers
median 16
(25th, 75th centile) (5,301
27 Smoking
never 48 2
ex-smokers 21
current smokers 31

Matching ,factors§
28 Age mean age 53.3 0
(range) (years) (24-75)
29 Male sex yestno 44 0
* Percentage positive for binary variables
t Not used in the full model and the simulation study
$ No values missing as exposed defined by ‘worked in this area for at least five years’. A job history was available for each
person
0 Included in all models

Decision 5 We used backward elimination with a selection level of 0.157, which corresponds to
the asymptotic significance level of the all-subsets approach based on Akaike’s AIC or Mallows’s
C, criterion in the linear m0de1.l~This significance level is appropriate in complex studies with
moderate sample sizes, as has been confirmed recently in a large simulation study.I6 Variable
selection was carried out within each cluster. We compared this approach with results using all
variables from all clusters simultaneously and selecting variables with two stepwise approaches
and all-subsets selection with the AIC criterion. With the stepwise procedures we used in addition
the usual selection level of 0.05.

4. VARIABLE DEFINITION, CODING AND SELECTION FOR THE MEDICAL


HISTORY CLUSTER
Cluster 1 consists of variables describing medical history prior to diagnosis of brain tumour.
A complete list of medical conditions (15 items) recorded is given in Schlehofer et aL3 Some
diseases possibly related to brain tumour, such as toxoplasmosis and schizophrenia, were only
mentioned by one or two people and could not be evaluated any further.
It has been postulated that people with a history of infections may have a smaller risk of
developing certain tumours.” Similar observations have been reported for people with allergies.
Consequently we examined the six items shown in Table I.
Table II(a) shows results from the univariate model for variable 1 (infections) coded by three
levels (coding A), then collapsing the first two levels (coding B), and thirdly using equidistant
coding (0, I , 2) assuming a linear trend (coding C). Coding A suggests that the risk from the
category ‘often’ is different from that associated with the categories ‘never’ and ‘seldom’. Also,
misclassification between the groups ‘never’ and ‘seldom’ seems likely, owing to recall bias. For
these reasons we collapsed the first two categories to form a binary variable. The log-likelihood
MODEL-BUILDING STRATEGIES IN A CASE-CONTROL STUDY 1331

Table 11. Examples of possible codings for two categorical variables from cluster I (sex and age included in
each model)
(a) Infections
~ ~ ~~

Coding A Coding B Coding C


(3 levels) (2 leyels) (equidistant)
Cases Controls s D a
Never 141 226 0.00 0.00 0.00
Seldom 79 163 - 024 0.00 - 0.36
Often 6 29 - 1.09 - 1.00 - 0.72
Likelihood ratio - 826.28 - 828.19 - 827.61
P-value 0.023 0.0 17 0.0 13
d.f. 2 1 1
Deviance (compared with null model) 7.7 1 5.80 6.38

(b) Univariate analysis for cluster I

Variable Casest Controlst a Wa) P-value


1* 6 29 - 1.00 (0.46) 0.028
2 69 146 - 019 (0.18) 0.290
3 6 20 - 063 (0.48) 0.189
4 64 139 - 0-22 (0.18) 0.230
5 15 47 - 057 (031) 0.060
62 68 160 - 0.36 (0.18) oQ40

* Coding B is used here


t Number of exposed cases and controls is given
1 Variable 6 is a combination of 3, 4 and 5 and is not used in the multivariate analysis

criterion did not favour one of the three models strongly. Similar reasoning yielded a binary
variable for variable 2 (fever).
Although the correlations between most of the variables in this cluster are not high, it should be
noted that the variables are not independent. For example, 23 per cent of subjects with an allergy
(variable 4) do have eczema (variable 5 ) by contrast to 3.6 per cent of the subjects without an
a 11ergy.
The results of the univariate logistic analysis for cluster I are given in Table II(b). Backward
elimination selected a model with variables 1 and 5.
As it was not clear how allergies may influence the risk of brain tumour, we created a new score
for allergy (variable 6). Since the prevalence of asthma and allergy is small compared with
unspecific allergies, this score is dominated by the latter. Univariate models show that all the
regression coefficients are negative but their absolute values are larger for variables 3 and 5 than
for 4 (Table II(b)). Of course, the standard errors for variables 3 and 5 are large owing to the small
prevalence. The regression coefficient for variable 6 is much smaller than those for variables 3 and
5; it also has the smallest standard error and lowest P-value owing to the high prevalence. Now,
backward elimination selected variables 1,4 and 6, the last with a negative and the second with
a positive coefficient.
This example demonstrates that statistical significance is not always the best guide for variable
selection. The choice between the two models is rather arbitrary. To include both variables 4 and
6 in one model is unsatisfactory since they are ‘nested’. In addition variable 5 is easier to interpret
1332 M. BLETTNER A N D W. SAUERBREI

Table 111. Regression coefficients and standard errors for the final model, sex and age included
(a) Missing data substituted
Cluster Variable B SE(B) P-value Odds ratios
Medical history 1* - 0.99 (0.46) 0.032 0.37
5 - 0.47 (0.3 1) 0.140 0.62
Job 10 0.64 (0.40) 0.1 10 1.90
X-ray 17 - 0.65 (0.44) 0140 0.52
Head injuries 20 - 0.35 (0.20) 0.082 0.70
Social class 25f - 0.44 (0.24) 007 1 0-64
Smoking 26 - 0-014 (0.006) 0.013 0.986
2 log L: - 808.1
2 log L: - 834.0 with age and sex only
XZ: 25.9 (d.f. = 7) ( p < 0-001)

(b) Complete case analysis

Cluster Variable B SE(l3 P-value Odds ratios


Medical history 1* - 1.11 (0.56) 0.05 0-33
5 - 0.20 (034) 055 0.82
Job 10 0.69 (0.45) 0.13 1.99
X-ray 17 - 0.74 (0.47) 0.12 0.48
Head injuries 20 - 0.30 (0.24) 021 074
Social class 25t - 0.52 (0.28) 0.06 0.59
Smoking 26 - 0013 (0.007) 0.056 0.987

* Coding B is used
t Two categories are used, after collapsing four categories from Table I

and its effect is stronger than those of the other two variables. We decided to include variables
1 and 5 from cluster I in the final model, a decision driven by our prior beliefs.'*

5. FINAL MODEL AND SENSITIVITY ANALYSIS

Final model
The final model was obtained by (a) using unconditional logistic regression adjusting for age and
sex, (b) performing variable selection with both sexes combined, (c) replacing missing data with
random values, (d) using all data, (e) using data-dependent definitions for the coding of variables
and (f) selecting variables within the six clusters. This yielded a final model with seven variables,
one from each cluster, except cluster I from which two variables were selected. The results
are presented in Table III(a). All variables except pack years (variable 26) are binary. The
regression coefficients for variables 5, 17, 20 and 25 indicate a small reduction in risk (OR
between 0.5 and 0.7). Exposure to electricity (variable 10) increases the risk slightly (OR = 1.9)
and only two factors (variable 1 and 26) have a more pronounced effect. The results for
dental X-rays (variable 17) and head injury (variable 20) are somewhat surprising as we would
expect both to be positively associated with brain tumour. We do not intend to interpret these
results, an extensive discussion of which may be found in the papers by Schlehofer et
MODEL-BUILDING STRATEGIES IN A CASE-CONTROL STUDY 1333

However, an estimate of a regression coefficient in the opposite direction to that expected is


suspicious and may reveal some problems.
We will demonstrate in a simulation study that some of the ‘significant’factors may be a result
of extensive data-dependent searching and indicate only random-noise variables. We carried out
a simulation study with the covariates of the brain tumour study but assuming that none of the
covariates (except the matching factors sex and age) had an influence on the outcome. With
backward elimination and a selection level of 0.157, three or more ‘Noise variables’ were obtained
in about 67 per cent of all replications. Correlated variables were more often selected in
pairs. This simulation study confirmed the impression that some of the variables selected
in a final model have to be seen as noise variables. The results have to be interpreted with
caution.
Comparison of the final model with the null model (containing sex and age only) shows a highly
significant reduction in variation (deviance = 25.9 with d.f. = 7). To assess the final model, we
calculated the predicted probabilities of being a case. Although there were differences in the
distributions of these predicted probabilities between cases and controls, there was a large
overlap, indicating that no strong risk factors were present and that our final model had only low
predictive power.”

Sensitivity analysis
For each of the five decisions discussed above we performed several alternative analyses.

Changing the coding of single variables


The early decision about the coding of a variable may have important consequences for the final
model. For infections (variable 1) we decided to combine the categories ‘never’ and ‘seldom’, but
there are alternatives. The odds ratios for infections in the final model differ for the three coding
schemes in Table II(a), leading to different interpretations of the results. For example, the
protective effect of many infections compared with few ranges from (OR) 0.37 (95 per cent CI:
0.15-0.53) for coding B (two levels), 0.43 (95 per cent CI: 0.17-1.09) for coding A (three levels) to
0.70 (95 per cent CI: 0524.93) for coding C (equidistant coding).
For other variables, changing the coding would have resulted in their elimination from the
final model and their importance as risk factors. This was so for variables 17 and 25, which
were not statistically significant for the 15.7% level when the original codings into three and four
categories were used.

Univariate models and full model


For the seven variables in the final model we estimated regression coefficients (and standard
errors) from univariate models adjusted for sex and age, and the full model based on the two
matching variables and all 22 variables in Table I. The estimates were very similar to those in the
final model, reflecting the small correlation between the variables in the final model, and the other
variables in the full model.
As expected, we obtained smaller standard errors for the regression coefficient when
moving from the full to a more parsimonious model; however, this gain was not large.
P-values for the variables not in the final model were all greater than 0.157 in the full
model.
The likelihood ratio test for the final compared with the full model yields X 2 = 5.80 with
d.f. = 15, clearly indicating that the more parsimonious model is adequate.
1334 M. BLETTNER AND W. SAUERBREI

Table IV. Selected variables from 22 using different procedures, sex and age included in each model

Variable Final Full Backward Stepwise* All-subsets approach


model model* elimination* Best 2nd 3rd 4th

1 X X X X X x x x
5 X X X X X x x x
9 x x x
10 X X X X X x x x
17 X X X X X X
20 X X X X X x x x
25 X X X X x x
26 X X X X X x x x

* Selection level = I 5 7 per cent

Selection procedures
Table IV shows the result of different selection procedures applied to the 22 variables in Table I.
Backward elimination and stepwise selection lead to the same variables as in the final model. The
four best models from all-subset selection are also presented. All selected models include 7 or
8 variables (plus age and sex). The values of the log-likelihood from the chosen models indicate
that the 'good' models are quite similar. The best model from the all-subset approach is identical
to the one chosen by backward elimination and stepwise selection. This can be explained by our
choice of the rather unusual selection level of 0.157 for the stepwise approaches (see decision 5).
There are small differences for variable 9. When using backward elimination in the job categories
cluster, this variable was selected with a slightly smaller P-value than variable 10. We preferred to
choose variable 10 to represent this cluster. Major differences will only be obtained if the full
model is used and afterwards selection is based on the WALD criterion. With this approach, only
five variables are significant at the 15.7 per cent level.
If possible, we prefer variable selection based on the clusters, since medical reasons and
prior beliefs can be incorporated more easily. In our example, results did not differ substan-
tially between the selection procedures. The low correlation between the variables is the
probable explanation. Other examples with dramatic differences between selection proced-
ures can be found in the literature.20 With high multicollinearity, stepwise selection may
choose a rather bad model and backward elimination seems to be the better of these stepwise
approaches.21 -
Changing the selection level
If the usual selection level of 5 per cent is used, backward elimination and stepwise selection yield
models with only the three variables: 1,25, and 26. In the full model only variables 1 and 26 are
significant at the 5 per cent level.

Missing values
Table III(b) shows regression coefficients for the final model based on the complete case analysis.
As expected, reducing the number of subjects results in an increase in all standard errors. Small
reductions in the absolute values of the coefficients combined with increases in the corresponding
standard errors lead to important changes in some P-values.
MODEL-BUILDING STRATEGIES IN A CASE-CONTROL STUDY 1335

Table V. Separate analysis by sex; regression coefficients and standard errors for the final model, age
included in each model

Women* Men* Interaction?


Variables B SWB) P B WB) P P

1 - 1.15 0.67 0.093 - 1.24 0.65 0.056 0.930


5 0.20 0.41 0.636 - 1.21 0.57 0.034 0.045
10 162 0.71 0.022 - 0.15 0.58 0.790 0.05 1
17 - 0.99 0.67 0.136 - 0.43 0.62 0.492 0.600
20 - 0.95 0.32 0.003 0.19 0.28 0.508 Oa08
25 - 0.96 0.42 0.022 - 0.21 0.32 0.509 0.190
26 0.002 0.010 0.81 - 0.02 0.007 0.003 0.083
~ ~

* 360 women and 284 men are included


t Interaction terms were added to the final model (from Table 111) for all seven variables

Interucfions wirh sex


Model building was carried out with both sexes combined. To evaluate this decision we analysed
the data for the two sexes separately. The regression coefficients for the variables in the final
model are given in Table V. There are important differences for all variables except the first in
terms of absolute magnitude and/or P-values. The P-values for the interaction terms for the
model with seven variables and seven interaction terms are also given, showing that the
interactions for variables 5 and 20 are significant at the 5 per cent level. Backward elimination
( a = 0.05) of the interaction terms - with all main effects forced into the model - revealed
significant interactions with variables 10 and 20.
Rather than speculate whether the differences between men and women are real or artificial, we
simply observe that Table V demonstrates the problems of investigating interactions with weak
risk factors in a study of this size.

Major differences due to alternative strategies


In Table VI we summarize the variables declared significant ( a = 0.157) under the alternative
strategies. For the combined sex analyses only variables 1 and 26 were always declared signifi-
cant. More striking are the differences for the two analyses by sex where we obtained fewer
significant factors because of the smaller sample sizes. Most of the interaction terms are
also not significant because of the low power. Experience shows that significance testing
of a strong risk factor is robust against different modelling approaches. Our example shows
that the 'importance' of weak risk factors may depend heavily on the modelling strategy.
Note also that the results presented in Table VI depend on the sequence in which decisions
are made. If, for example, the coding had been done separately for men and women, the results
would be different.

6. DISCUSSION
Recently Andersen22stated that 'too often information is missing on how the analysis was carried
out and why' and that 'the strategy of the analysis should be clearly stated'. We have described the
1336 M. BLETTNER A N D W. SAUERBREI

Table VI. Significant (a = 0.157) variables under different strategies

Cluster selection Full model Complete$ Original$ Separated by sex:$


with/without case categories men women
prior beliefs: analysis
with* without

1 X X X X X X X
5 X t X X X
9 X
10 X X X X X
17 X X X X
20 X X X X X
25 X X X X
26 X X X X X X

* Main strategy (final model)


t If the score (variable 6) is eligible, variables 4 and 6 are selected
$ Only seven variables from final model used

complexity of the analysis of a case-control study, demonstrated that many decisions are made
which are not usually mentioned in the report of such a study, and shown how those decisions
may influence the results. Data-dependent decisions are usually necessary even for the definition
of variables and their scales and categories. Frequently the data are insufficient for an optimal
choice.
We defined variables and their coding within nearly independent clusters. Within each cluster
variable selection is necessary. This selection yields P-values which are too small and should be
interpreted descriptively. An important decision is the selection level itself. Several authors’ 2*l 4
favour a high selection level to include important confounding variables. The decision may also
depend on the size of the study, because the probability of false elimination of important
confounding variables is only small in a large study, even when the usual selection level of 005 is
used. Selection procedures implemented in computer program packages only judge by the
P-values themselves, and ignore the estimates. However, concentration on the ~tandardized’~
estimates also yields problems, especially when the prevalences of important factors vary. Criteria
such as quality, simplicity and reproducibility should be used to select a ‘representative’ from
a cluster of correlated variables.
In epidemiology there is a distinction between confounding variables and risk factors, and
dealing with the former remains a major In practice this distinction is not always
clear, and statistical significance testing is not a reliable guide to identify confounding variables.
Matching factors and well-established confounders should always be taken into account.
The exclusion of important confounding variables may alter the estimates for the risk factors
even when they are not statistically significant, and lead to potential bias in the estimation of odds
ratios. If deletion of a confounder does not change the regression coefficient of a risk factor
substantially then use of a reduced model can lead to some gain in precision as well as easier
interpretation.
We have called the model based on all 22 variables in Table 1 the full model, although many
other variables have already been excluded in an early stage of analysis. In practice the full model,
including all possible variables rarely exists. Decisions have to be made on which items from the
questionnaire are to be transferred into regression variables, and how. We do not wish to obtain
the most complex model or the most parsimonious model: ‘all models are wrong, some though
MODEL-BUILDING STRATEGIES IN A CASE-CONTROL STUDY 1337

are better than others and we can search for the better ones’.28Several meaningful alternatives are
available. As Vandenbr~ucke~’ points out, ‘widely different combinations of variables may fit the
data equally as likely, so that we have to refer to prior opinions to construct the model which best
suits our tastes.’
Investigation of the stability of the selected and its sensitivity to other decisions, for
example about scaling and missing values as well as model assumptions, should lead to more
careful interpretation of results.
We argue that the strategy of data analysis should be decided as far as possible prior to the
analysis itself, in agreement with recommendations for the analysis of clinical trials.j2 However,
clinical trials are less complicated by random allocation of treatment and the objective to estimate
a well-defined treatment effect. In observational studies a clear distinction should be made
between prior beliefs and data-dependent decisions.
Our example has few strong risk factors and only vague prior knowledge about cause of the
disease. We recommend that a validation study with new data and precise hypotheses should be
carried out following data-dependent modelling. For the brain tumour study this validation can
be performed using data from the other centres.

ACKNOWLEDGEMENTS

We thank Dr. B. Schlehofer for allowing us to use the data and for important discussion on the
medical background of the study, Dr. Lawless for letting us have the program ISMOD for the
all-subsets approach, Dorothea Niehoff for technical assistance, Beate Edinger and Heike Weis
for secretarial help, and the referees for helpful comments on earlier drafts of the paper.

REFERENCES
1. Breslow, N. E. and Day, N. E. Statisticat Methods in Cancer Research 1: the Analysis ofcase-Control
Studies, International Agency for Research on Cancer, Lyon, 1980.
2. Greenland, S. ‘Modelling and variable selection in epidemiologic analysis’, American Journal of Public
Health, 79, 340-349 (1989).
3. Schlehofer, B., Kunze, S., Sachsenheimer, W., Blettner, M., Niehoff, D. and Wahrendorf, J. ‘Occupa-
tional risk factors for brain tumours: results from a population-based case-control study in Germany’,
Cancer Causes and Control, I , 209-15 (1992).
4. Schlehofer, B., Blettner, M., Becker, N., Martinsohn, C. and Wahrendorf, J. ‘Medical risk factors and the
development of brain tumours’, Cancer, 69, 2541-2547 (1 992).
5. SAS Institute Inc. SAS User’s Guide. Statistics, Version 5, Cary, NC, 1985.
6. B M D P Statistical Software Manual, University of California, Berkeley, 1988.
7. Lawless, J. F. and Singhal, K. ‘ISMOD: an all-subsets regression program for generalized linear models.
I: Statistical and computational background. 11: Program guide and examples’, Computer Methods and
programs in Biomedicine, 24, 117- 124, 125- 134 ( 1987).
8. Armstrong, B. G. ‘The effects of measurement errors on relative risk regressions’, American Journal of
Epidemiology, 132, 1176-84 (1990).
9. Lagakos, S. W. ‘Effects of mismodelling and mismeasuring explanatory variables on tests of their
association with a response variable’, Statistics in Medicine, 7, 257-274 (1988).
10. Vach, W. and Blettner, M. ‘Biased estimation of the odds ratio in case-control studies due to the use of
ad-hoc methods of correcting for missing values in confounding variables’, American Journal of
Epidemiology, 134, 895-907 (1991).
11. Schemper, M. and Smith, T. L. ‘Efficient evaluation of treatment effects in the presence of missing
covariate values’, Statistics in Medicine, 9, 777-784 (1990).
12. Greenland, S. ‘Power, sample size and smallest detectable effect determination for multivariate studies’,
Statistics in Medicine, 4, 117-127 (1985).
13. Thomas, D. C., Siemiatycki, J., Dewar, R., Robins, J., Goldberg, M. and Armstrong, B. G. ‘The problem
of multiple inference in studies designed to generate hypotheses’, American Journal ofEpiderniology, 122,
108&1095 (1985).
1338 M. BLETTNER AND W. SAUERBREI

14. Dales, L. G. and Ury, H. K. ‘An improper use of statistical significance testing in studying covariables’,
International Journal of Epidemiology, 112, 696-706 (1978).
15. Tergsvirta, T. and Mellin, 1. ‘Model selection criteria and model selection tests in regression models’,
Scandinaoian Journal sf Statistics, 13, 159- 171 (1986).
16. Sauerbrei, W. Variablenselektion in Regressionmodellen unter besonderer Berucksichtigung medizinischer
Frayestellung ( Variable selection in regression models with application in medical research), unpublished
dissertation, University of Dortmund, 1992.
17. Abel, U., Becker, N, Angerer, R., Frentzel-Beyme, R., Kaufmann, M., Schlag, P., Wysocki, S.,
Wahrendorf, J. and Schulz, G. ‘History of common infections in cancer patients and controls’, Journal of
Cancer Research Clinical Oncology, 117, 339-344 ( 1 99 1).
IS. Robins, J. M. and Greenland, S. ‘The role of model selection in causal inference from non experimental
data’, American Journal of Epidemiology, 123, 392-402 (1986).
19. Korn, E. L. and Simon, R. ‘Measures of explained variation for survival data’, Statistics in Medicine, 9,
487-503 (1990).
20. McGee, D., Reed, D. and Yano, K. ‘The results of logistic analysis when the variables are highly
correlated: an empirical example using diet and C H D incidence’, Journal of Chronic Diseases, 37,
713-719 (1984).
21. Mantel, N. ‘Why stepdown procedures in variable selection’, Technometrics, 12, 621-625 (1970).
22. Andersen, P. K. ‘Survival analysis 1982-1991: the second decade of the proportional hazards regression
model’, Statistics in Medicine, 10, 1931-1941 (1991).
23. Wickramaratne, P. J. and Holford, T. R. ‘Confounder in epidemiologic studies: the adequacy of the
control group as a measure of confounding’, Biornetrics, 43, 751-765 (1987).
24. Greenland, S. ‘Confounding in epidemiologic studies’, Biometrics, 45, 1309-13 10 (1989).
25. Holland, P. W. ‘Confounding in epidemiologic studies’, Biornetrics, 45, 1310-1316 (1989).
26. Mantel, N. ‘Confounding in epidemiologic studies’, Biornetrics, 45, 131 7-1 3 18 (1989).
27. Mantel, N. ‘Confounding variables: correcting superstitions, correspondence to Wickramaratne and
Holford (Biornetrics, 1989, 13 19-1 322)’, Biometrics, 46, 869-870 (1990).
28. McCullagh, P. and Nelder, J. Generulized Linear Models, Chapman Hall, London, 1983.
29. Vandenbroucke, J. P. ‘Should we abandon statistical modeling altogether?, American Journal of
Epidemiology, 126, 10-13 (1987).
30. Altman, D. G. and Andersen, P. K. ‘Bootstrap investigation of the stability of a Cox regression model’,
Statistics in Medicine, 8, 771-783 (1989).
31. Sauerbrei, W. and Schumacher, M. ‘A bootstrap resampling procedure for model building: application
to the Cox regression model’, Statistics in Medicine, 11, 2093-2109 (1992).
32. Canner, P. L. ‘Further aspects of data analysis’, Controlled Clinical trials, 4, 485-503 (1983).

You might also like