Influence Results A Case-Control Study: of Model-Building Strategies On The OF
Influence Results A Case-Control Study: of Model-Building Strategies On The OF
MARIA BLETTNER
German Cancer Research Centre, Diuision of Epidemiology, Im Neuenheimer Feld 280, W-6900 Heidelberg,
Germany
AND
WILL1 SAUERBREI
Institute of Medical Biometry and Informatics, University of Freiburg, Stefan Meier Str. 26, W-7800 Freiburg, Germany
SUMMARY
We evaluate the analysis of a case-control study in which many variables were investigated simultaneously.
The purpose of the study was to explore some rather unspecific hypotheses about potential risk factors for
adult brain tumour. Our aim is to show that in the analysis of case-control studies many decisions are
necessary which are usually not published in detail. As in most studies these decisions are made during
analysis and are data dependent. We demonstrate that the data allow sensible alternative decisions which
influence the final results. A sensitivity analysis of several aspects of the analysis such as different
measurement scales, variable selection, handling of missing values and interactions was performed, and
demonstrated variation in the results based on the strategy for analysis. We conclude that details of the final
analysis should be decided in the planning phase of a case-control study, and that more details of
model-building strategies must be published. Results from a study where the analysis is highly data
dependent must be interpreted with caution and validation of the results with new studies is essential.
1. INTRODUCTION
When conducting a case-control study, information is collected on possible risk factors of interest
and on potentially confounding variables.The aim of the analysis is to identify variables which
influence the estimate of the effect of exposure on disease (confounder) and those that may be
important risk factors. Ideally, we want to detect risk factors (or stable indicators of risk factors)
so that the results can be extrapolated and help to understand the development of the disease of
interest. The odds ratio (or approximate relative risk) is usually estimated using logistic regression
analysis, which also allows control of confounding and evaluation of combined effects and
interactions.' This approach requires a strategy for building regression models, including deci-
sions about which measurement level or categorization to use and which variables to include in
the model. Many of these problems have been discussed recently,' but to our knowledge little
systematic research has been done to investigate the effect of strategies of model building on the
results of a case-control study.
Here we use the data from a case-control study on brain tumours to illustrate methodological
aspects of the analysis. The results have been published e l ~ e w h e r e .It~ .is~ not our intention to
criticize this study but to use it for illustration. The aim of our paper is to demonstrate the
complexity of the modelling strategy in the analysis of a large data set from cancer epidemiology.
We also want to show that decisions made prior to the analysis usually reported may influence
the conclusions, and consequently that more careful interpretations of the results are necessary,
especially when many weak risk factors are investigated.
Statistical methods
Our objective is to investigate the effects of covariates on a dichotomous variable Y, assigned the
values 1 for the patients with brain tumours (cases) and 0 for the controls. We let
X = ( X I , X 1 , . . . . . X,) denote the vector of K known or suspected risk factors, possible
confounding variables and effect modification variables (interactions). The two matching vari-
ables age and sex will always be included in our regression model.
Formally, we want to estimate the odds ratio r(X) through the logistic model, in the form
K
P( Y = 1 IX)/P( Y = OlX)
r(X) =
P ( Y = llXO)/P( Y = 01x0)
= exp
i=l
1p i ( X i- XO),
where Xo is a specific baseline level for X and B is the vector of regression coefficients to be
estimated. For each variable the reference and other categories have to be determined; this
important step is often not described in publications. After defining X, several strategies can be
used for selecting variables. Both variable definition and selection should be guided by the
hypotheses under investigation, medical knowledge and prior beliefs as well as the distribution of
the available data. All variables can be included in one (full) model, though it is often preferable
MODEL-BUILDING STRATEGIES IN A CASE-CONTROL STUDY 1327
to select important variables for a more parsimonious (jinal) model. It is not unusual to
investigate each variable in a separate (uniuariate)model.
Several procedures are available for variable selection. Stepwise methods (forward selection,
stepwise selection and backward elimination) are well known and standard software is available
for unconditional logistic m~delling.~. All-subsets procedures, often used with the linear regres-
sion model, can be implemented for generalized linear models with the program ISMOD.’
Likelihood ratio statistics are calculated and the better models are determined by the smallest
values of Akaike’s information criterion (AIC).
not to collapse categories. The decision can be based on the distribution of the variables
(combining small categories), on estimates of parameters from different approaches and on
likelihood-ratio tests.
Decision 2 We decided to divide the variables into clusters defined by the sections of the
questionnaire; Table I gives the original coding as used in the questionnaire. For most categorical
variables some categories were collapsed. We discuss variable definition, coding and selection for
the medical history cluster in the next section.
Variable selection can be done within each cluster, or for all clusters combined. There are
automatic procedures available, such as stepwise selection, backward elimination or all-subsets
selection, and these may be modified to incorporate prior beliefs. Decisions on whether and how
to include a variable in the final model are usually guided by P-values, but may also be based on
the magnitude of the regression coefficient. Alternatively at least one variable from each cluster
might be included irrespective of the P-value.
MODEL-BUILDING STRATEGIES IN A CASE-CONTROL STUDY 1329
Variable Number
no. Name Categories % missing
Cluster I: medical history
1 Infections never 51 4
seldom 38
often 6
2 Fever never 66 24
seldom 20
often 13
3 Asthma yes/no 4* 3
4 Unspecific allergies yes/no 32 3
5 Eczema yes/no 10 4
6t Any allergy yes/no 35 6
Cluster 11: job history
7 Chemical industry yes/no 7 O$
8 Metal industry yesJno 16 0
9 Office workers yeslno 21 0
10 Electricity industry yes/no 4 0
11 Sales person yesjno 15 0
12 Food processing yesJno 7 0
13 Textiles yesJno 12 0
14 Wood yesJno 6 0
1st Any industrial worker
(including 7, 8, 10, 13, 14) yes/no 31 0
Cluster I l l : X-ray exposure:
16 Dental X-ray never 15 17
age of 2 2 5 seldom 65
often 21
17 Dental X-ray never 75 83
age <25 seldom 19
often 6
18 Other X-ray of head or neck yesjno 11 26
19 Full mouth X-ray never 75 0
1 time 18
2 and more I
Cluster IV: head injuries
20 Head injuries 27 4
21t With hospitalization 10 4
22t With unconsciousness 11 4
23t With concussion 12 4
Cluster V: social class factor
24 Schooling
9 years 73 4
10 years 15
12 years (technical school) 2
13 years (Abitur) 11
25 Job class
unskilled worker 28 15
skilled worker 55
polytechnic degree 7
university degree 10
1330 M. BLETTNER AND W. SAUERBREI
Table I. (Continued)
Variable Number
no. Name Categories Yo missing
Matching ,factors§
28 Age mean age 53.3 0
(range) (years) (24-75)
29 Male sex yestno 44 0
* Percentage positive for binary variables
t Not used in the full model and the simulation study
$ No values missing as exposed defined by ‘worked in this area for at least five years’. A job history was available for each
person
0 Included in all models
Decision 5 We used backward elimination with a selection level of 0.157, which corresponds to
the asymptotic significance level of the all-subsets approach based on Akaike’s AIC or Mallows’s
C, criterion in the linear m0de1.l~This significance level is appropriate in complex studies with
moderate sample sizes, as has been confirmed recently in a large simulation study.I6 Variable
selection was carried out within each cluster. We compared this approach with results using all
variables from all clusters simultaneously and selecting variables with two stepwise approaches
and all-subsets selection with the AIC criterion. With the stepwise procedures we used in addition
the usual selection level of 0.05.
Table 11. Examples of possible codings for two categorical variables from cluster I (sex and age included in
each model)
(a) Infections
~ ~ ~~
criterion did not favour one of the three models strongly. Similar reasoning yielded a binary
variable for variable 2 (fever).
Although the correlations between most of the variables in this cluster are not high, it should be
noted that the variables are not independent. For example, 23 per cent of subjects with an allergy
(variable 4) do have eczema (variable 5 ) by contrast to 3.6 per cent of the subjects without an
a 11ergy.
The results of the univariate logistic analysis for cluster I are given in Table II(b). Backward
elimination selected a model with variables 1 and 5.
As it was not clear how allergies may influence the risk of brain tumour, we created a new score
for allergy (variable 6). Since the prevalence of asthma and allergy is small compared with
unspecific allergies, this score is dominated by the latter. Univariate models show that all the
regression coefficients are negative but their absolute values are larger for variables 3 and 5 than
for 4 (Table II(b)). Of course, the standard errors for variables 3 and 5 are large owing to the small
prevalence. The regression coefficient for variable 6 is much smaller than those for variables 3 and
5; it also has the smallest standard error and lowest P-value owing to the high prevalence. Now,
backward elimination selected variables 1,4 and 6, the last with a negative and the second with
a positive coefficient.
This example demonstrates that statistical significance is not always the best guide for variable
selection. The choice between the two models is rather arbitrary. To include both variables 4 and
6 in one model is unsatisfactory since they are ‘nested’. In addition variable 5 is easier to interpret
1332 M. BLETTNER A N D W. SAUERBREI
Table 111. Regression coefficients and standard errors for the final model, sex and age included
(a) Missing data substituted
Cluster Variable B SE(B) P-value Odds ratios
Medical history 1* - 0.99 (0.46) 0.032 0.37
5 - 0.47 (0.3 1) 0.140 0.62
Job 10 0.64 (0.40) 0.1 10 1.90
X-ray 17 - 0.65 (0.44) 0140 0.52
Head injuries 20 - 0.35 (0.20) 0.082 0.70
Social class 25f - 0.44 (0.24) 007 1 0-64
Smoking 26 - 0-014 (0.006) 0.013 0.986
2 log L: - 808.1
2 log L: - 834.0 with age and sex only
XZ: 25.9 (d.f. = 7) ( p < 0-001)
* Coding B is used
t Two categories are used, after collapsing four categories from Table I
and its effect is stronger than those of the other two variables. We decided to include variables
1 and 5 from cluster I in the final model, a decision driven by our prior beliefs.'*
Final model
The final model was obtained by (a) using unconditional logistic regression adjusting for age and
sex, (b) performing variable selection with both sexes combined, (c) replacing missing data with
random values, (d) using all data, (e) using data-dependent definitions for the coding of variables
and (f) selecting variables within the six clusters. This yielded a final model with seven variables,
one from each cluster, except cluster I from which two variables were selected. The results
are presented in Table III(a). All variables except pack years (variable 26) are binary. The
regression coefficients for variables 5, 17, 20 and 25 indicate a small reduction in risk (OR
between 0.5 and 0.7). Exposure to electricity (variable 10) increases the risk slightly (OR = 1.9)
and only two factors (variable 1 and 26) have a more pronounced effect. The results for
dental X-rays (variable 17) and head injury (variable 20) are somewhat surprising as we would
expect both to be positively associated with brain tumour. We do not intend to interpret these
results, an extensive discussion of which may be found in the papers by Schlehofer et
MODEL-BUILDING STRATEGIES IN A CASE-CONTROL STUDY 1333
Sensitivity analysis
For each of the five decisions discussed above we performed several alternative analyses.
Table IV. Selected variables from 22 using different procedures, sex and age included in each model
1 X X X X X x x x
5 X X X X X x x x
9 x x x
10 X X X X X x x x
17 X X X X X X
20 X X X X X x x x
25 X X X X x x
26 X X X X X x x x
Selection procedures
Table IV shows the result of different selection procedures applied to the 22 variables in Table I.
Backward elimination and stepwise selection lead to the same variables as in the final model. The
four best models from all-subset selection are also presented. All selected models include 7 or
8 variables (plus age and sex). The values of the log-likelihood from the chosen models indicate
that the 'good' models are quite similar. The best model from the all-subset approach is identical
to the one chosen by backward elimination and stepwise selection. This can be explained by our
choice of the rather unusual selection level of 0.157 for the stepwise approaches (see decision 5).
There are small differences for variable 9. When using backward elimination in the job categories
cluster, this variable was selected with a slightly smaller P-value than variable 10. We preferred to
choose variable 10 to represent this cluster. Major differences will only be obtained if the full
model is used and afterwards selection is based on the WALD criterion. With this approach, only
five variables are significant at the 15.7 per cent level.
If possible, we prefer variable selection based on the clusters, since medical reasons and
prior beliefs can be incorporated more easily. In our example, results did not differ substan-
tially between the selection procedures. The low correlation between the variables is the
probable explanation. Other examples with dramatic differences between selection proced-
ures can be found in the literature.20 With high multicollinearity, stepwise selection may
choose a rather bad model and backward elimination seems to be the better of these stepwise
approaches.21 -
Changing the selection level
If the usual selection level of 5 per cent is used, backward elimination and stepwise selection yield
models with only the three variables: 1,25, and 26. In the full model only variables 1 and 26 are
significant at the 5 per cent level.
Missing values
Table III(b) shows regression coefficients for the final model based on the complete case analysis.
As expected, reducing the number of subjects results in an increase in all standard errors. Small
reductions in the absolute values of the coefficients combined with increases in the corresponding
standard errors lead to important changes in some P-values.
MODEL-BUILDING STRATEGIES IN A CASE-CONTROL STUDY 1335
Table V. Separate analysis by sex; regression coefficients and standard errors for the final model, age
included in each model
6. DISCUSSION
Recently Andersen22stated that 'too often information is missing on how the analysis was carried
out and why' and that 'the strategy of the analysis should be clearly stated'. We have described the
1336 M. BLETTNER A N D W. SAUERBREI
1 X X X X X X X
5 X t X X X
9 X
10 X X X X X
17 X X X X
20 X X X X X
25 X X X X
26 X X X X X X
complexity of the analysis of a case-control study, demonstrated that many decisions are made
which are not usually mentioned in the report of such a study, and shown how those decisions
may influence the results. Data-dependent decisions are usually necessary even for the definition
of variables and their scales and categories. Frequently the data are insufficient for an optimal
choice.
We defined variables and their coding within nearly independent clusters. Within each cluster
variable selection is necessary. This selection yields P-values which are too small and should be
interpreted descriptively. An important decision is the selection level itself. Several authors’ 2*l 4
favour a high selection level to include important confounding variables. The decision may also
depend on the size of the study, because the probability of false elimination of important
confounding variables is only small in a large study, even when the usual selection level of 005 is
used. Selection procedures implemented in computer program packages only judge by the
P-values themselves, and ignore the estimates. However, concentration on the ~tandardized’~
estimates also yields problems, especially when the prevalences of important factors vary. Criteria
such as quality, simplicity and reproducibility should be used to select a ‘representative’ from
a cluster of correlated variables.
In epidemiology there is a distinction between confounding variables and risk factors, and
dealing with the former remains a major In practice this distinction is not always
clear, and statistical significance testing is not a reliable guide to identify confounding variables.
Matching factors and well-established confounders should always be taken into account.
The exclusion of important confounding variables may alter the estimates for the risk factors
even when they are not statistically significant, and lead to potential bias in the estimation of odds
ratios. If deletion of a confounder does not change the regression coefficient of a risk factor
substantially then use of a reduced model can lead to some gain in precision as well as easier
interpretation.
We have called the model based on all 22 variables in Table 1 the full model, although many
other variables have already been excluded in an early stage of analysis. In practice the full model,
including all possible variables rarely exists. Decisions have to be made on which items from the
questionnaire are to be transferred into regression variables, and how. We do not wish to obtain
the most complex model or the most parsimonious model: ‘all models are wrong, some though
MODEL-BUILDING STRATEGIES IN A CASE-CONTROL STUDY 1337
are better than others and we can search for the better ones’.28Several meaningful alternatives are
available. As Vandenbr~ucke~’ points out, ‘widely different combinations of variables may fit the
data equally as likely, so that we have to refer to prior opinions to construct the model which best
suits our tastes.’
Investigation of the stability of the selected and its sensitivity to other decisions, for
example about scaling and missing values as well as model assumptions, should lead to more
careful interpretation of results.
We argue that the strategy of data analysis should be decided as far as possible prior to the
analysis itself, in agreement with recommendations for the analysis of clinical trials.j2 However,
clinical trials are less complicated by random allocation of treatment and the objective to estimate
a well-defined treatment effect. In observational studies a clear distinction should be made
between prior beliefs and data-dependent decisions.
Our example has few strong risk factors and only vague prior knowledge about cause of the
disease. We recommend that a validation study with new data and precise hypotheses should be
carried out following data-dependent modelling. For the brain tumour study this validation can
be performed using data from the other centres.
ACKNOWLEDGEMENTS
We thank Dr. B. Schlehofer for allowing us to use the data and for important discussion on the
medical background of the study, Dr. Lawless for letting us have the program ISMOD for the
all-subsets approach, Dorothea Niehoff for technical assistance, Beate Edinger and Heike Weis
for secretarial help, and the referees for helpful comments on earlier drafts of the paper.
REFERENCES
1. Breslow, N. E. and Day, N. E. Statisticat Methods in Cancer Research 1: the Analysis ofcase-Control
Studies, International Agency for Research on Cancer, Lyon, 1980.
2. Greenland, S. ‘Modelling and variable selection in epidemiologic analysis’, American Journal of Public
Health, 79, 340-349 (1989).
3. Schlehofer, B., Kunze, S., Sachsenheimer, W., Blettner, M., Niehoff, D. and Wahrendorf, J. ‘Occupa-
tional risk factors for brain tumours: results from a population-based case-control study in Germany’,
Cancer Causes and Control, I , 209-15 (1992).
4. Schlehofer, B., Blettner, M., Becker, N., Martinsohn, C. and Wahrendorf, J. ‘Medical risk factors and the
development of brain tumours’, Cancer, 69, 2541-2547 (1 992).
5. SAS Institute Inc. SAS User’s Guide. Statistics, Version 5, Cary, NC, 1985.
6. B M D P Statistical Software Manual, University of California, Berkeley, 1988.
7. Lawless, J. F. and Singhal, K. ‘ISMOD: an all-subsets regression program for generalized linear models.
I: Statistical and computational background. 11: Program guide and examples’, Computer Methods and
programs in Biomedicine, 24, 117- 124, 125- 134 ( 1987).
8. Armstrong, B. G. ‘The effects of measurement errors on relative risk regressions’, American Journal of
Epidemiology, 132, 1176-84 (1990).
9. Lagakos, S. W. ‘Effects of mismodelling and mismeasuring explanatory variables on tests of their
association with a response variable’, Statistics in Medicine, 7, 257-274 (1988).
10. Vach, W. and Blettner, M. ‘Biased estimation of the odds ratio in case-control studies due to the use of
ad-hoc methods of correcting for missing values in confounding variables’, American Journal of
Epidemiology, 134, 895-907 (1991).
11. Schemper, M. and Smith, T. L. ‘Efficient evaluation of treatment effects in the presence of missing
covariate values’, Statistics in Medicine, 9, 777-784 (1990).
12. Greenland, S. ‘Power, sample size and smallest detectable effect determination for multivariate studies’,
Statistics in Medicine, 4, 117-127 (1985).
13. Thomas, D. C., Siemiatycki, J., Dewar, R., Robins, J., Goldberg, M. and Armstrong, B. G. ‘The problem
of multiple inference in studies designed to generate hypotheses’, American Journal ofEpiderniology, 122,
108&1095 (1985).
1338 M. BLETTNER AND W. SAUERBREI
14. Dales, L. G. and Ury, H. K. ‘An improper use of statistical significance testing in studying covariables’,
International Journal of Epidemiology, 112, 696-706 (1978).
15. Tergsvirta, T. and Mellin, 1. ‘Model selection criteria and model selection tests in regression models’,
Scandinaoian Journal sf Statistics, 13, 159- 171 (1986).
16. Sauerbrei, W. Variablenselektion in Regressionmodellen unter besonderer Berucksichtigung medizinischer
Frayestellung ( Variable selection in regression models with application in medical research), unpublished
dissertation, University of Dortmund, 1992.
17. Abel, U., Becker, N, Angerer, R., Frentzel-Beyme, R., Kaufmann, M., Schlag, P., Wysocki, S.,
Wahrendorf, J. and Schulz, G. ‘History of common infections in cancer patients and controls’, Journal of
Cancer Research Clinical Oncology, 117, 339-344 ( 1 99 1).
IS. Robins, J. M. and Greenland, S. ‘The role of model selection in causal inference from non experimental
data’, American Journal of Epidemiology, 123, 392-402 (1986).
19. Korn, E. L. and Simon, R. ‘Measures of explained variation for survival data’, Statistics in Medicine, 9,
487-503 (1990).
20. McGee, D., Reed, D. and Yano, K. ‘The results of logistic analysis when the variables are highly
correlated: an empirical example using diet and C H D incidence’, Journal of Chronic Diseases, 37,
713-719 (1984).
21. Mantel, N. ‘Why stepdown procedures in variable selection’, Technometrics, 12, 621-625 (1970).
22. Andersen, P. K. ‘Survival analysis 1982-1991: the second decade of the proportional hazards regression
model’, Statistics in Medicine, 10, 1931-1941 (1991).
23. Wickramaratne, P. J. and Holford, T. R. ‘Confounder in epidemiologic studies: the adequacy of the
control group as a measure of confounding’, Biornetrics, 43, 751-765 (1987).
24. Greenland, S. ‘Confounding in epidemiologic studies’, Biometrics, 45, 1309-13 10 (1989).
25. Holland, P. W. ‘Confounding in epidemiologic studies’, Biornetrics, 45, 1310-1316 (1989).
26. Mantel, N. ‘Confounding in epidemiologic studies’, Biornetrics, 45, 131 7-1 3 18 (1989).
27. Mantel, N. ‘Confounding variables: correcting superstitions, correspondence to Wickramaratne and
Holford (Biornetrics, 1989, 13 19-1 322)’, Biometrics, 46, 869-870 (1990).
28. McCullagh, P. and Nelder, J. Generulized Linear Models, Chapman Hall, London, 1983.
29. Vandenbroucke, J. P. ‘Should we abandon statistical modeling altogether?, American Journal of
Epidemiology, 126, 10-13 (1987).
30. Altman, D. G. and Andersen, P. K. ‘Bootstrap investigation of the stability of a Cox regression model’,
Statistics in Medicine, 8, 771-783 (1989).
31. Sauerbrei, W. and Schumacher, M. ‘A bootstrap resampling procedure for model building: application
to the Cox regression model’, Statistics in Medicine, 11, 2093-2109 (1992).
32. Canner, P. L. ‘Further aspects of data analysis’, Controlled Clinical trials, 4, 485-503 (1983).