0% found this document useful (0 votes)
100 views31 pages

Fitting Additive Hazards Models For Case-Cohort Studies: A Multiple Imputation Approach

This document describes a study that uses multiple imputation to fit additive hazards models for case-cohort studies. It introduces case-cohort study designs, which collect expensive covariate measurements disproportionately from cases and a random sample of controls. The study uses multiple imputation to handle missing data in covariates, fitting additive hazards models within each imputed dataset and combining results. Simulation studies examine estimator properties for continuous and binary target variables under varying conditions.

Uploaded by

Jen Boyko
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views31 pages

Fitting Additive Hazards Models For Case-Cohort Studies: A Multiple Imputation Approach

This document describes a study that uses multiple imputation to fit additive hazards models for case-cohort studies. It introduces case-cohort study designs, which collect expensive covariate measurements disproportionately from cases and a random sample of controls. The study uses multiple imputation to handle missing data in covariates, fitting additive hazards models within each imputed dataset and combining results. Simulation studies examine estimator properties for continuous and binary target variables under varying conditions.

Uploaded by

Jen Boyko
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Fitting additive hazards models for case-cohort studies: A multiple imputation approach

Jinhyouk Jung
Department of Statistics University of Connecticut

Aug 1, 2012

Joint work with Ofer Harel and Sangwook Kang at UConn

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

1 / 31

Outline

Case-cohort study Multiple Imputation Additive hazards models Simulation studies Data example - Zinc concentration study

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

2 / 31

INTRODUCTION : CASE-COHORT STUDY


Epidemiological cohort
To assess a relationship between certain risk factors and disease outcomes of interest

If the disease outcome of interest is rare?


a large number of subjects a long study period to observe a sucient number of subjects

If main risk factor is expensive to measure?


cost arising from conducting such large studies could be even more expensive

Case-cohort study design


Prentice (1986) To overcome this diculty and as an alternative way to conducting full cohort studies

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

3 / 31

INTRODUCTION: CASE-COHORT STUDY

The main point of this design


To obtain the expensive covariate measurements disproportionately on cases and controls Cases : People are observed to develop the disease of interest Controls : People are not observed..

Two steps
Select subcohort randomly Add remaining cases in the cohort to the subcohort

includes all cases and a fraction of controls

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

4 / 31

CASE-COHORT DESIGN

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

5 / 31

CASE-COHORT DESIGN

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

6 / 31

CASE-COHORT DESIGN

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

7 / 31

CASE-COHORT DESIGN

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

8 / 31

CASE-COHORT DESIGN

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

9 / 31

INTRODUCTION: ASSUMPTION
Model assumption
Semiparametric additive hazards models Cox proportional hazards model? The critical proportional hazards assumption seems to be violated or the quantity of interest is risk dierences rather than relative risks

Missingness assumption
Missing at random (MAR) the selection of the subcohort could depend only on some covariates As a special case of missing covariate problem with MAR.

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

10 / 31

INTRODUCTION: Recent study

Recent study related with additive hazards model or MI


Qi et al.(2010) : Cox model, MI and fully augmented weighted estimators (FAWE) Lin (2011) : semiparametric additive hazards model with missing covariates

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

11 / 31

Additive hazards model - Notations


Z = (Z mis , Z obs ): a set of covariates T : failure time; C : potential censoring time, X = min(T , C ): observed time, = I (T C ): failure indicator, N(t) = I (T t): counting process for failure, Y (t) = I (X t): at risk indicator process, : study end time, n: cohort size, S: selection indicator variable to identify a subject with missing covariates;

The missing data mechanism is determined by the conditional distribution of S given (X , , Z obs ).

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

12 / 31

Additive hazards model - Model and Estimation


Additive hazards model (t|Z (t)) = 0 (t) + Z (t), where 0 (t) is an unspecied baseline hazard function and is a vector-valued regression parameter. Kulich M., and Lin DY. (1994) If the full cohort data were available,
n 0 n i=1 Yi (t)Zi (t)/ n i=1 Yi (t).

(1)

U() =
i=1

{Zi (t) Z (t)}{dNi (t) Yi (t) Zi (t)dt},

(2)

where Z (t) =

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

13 / 31

Additive hazards model - Model and Estimation

There exists an explicit solution to the estimating equations U() = 0 taking the following form
n i=1 n i=1 0 0 1

Yi (t) Zi (t) Z (t)

dt

Yi (t) Zi (t) Z (t) dNi (t)

(3)

where a2 = aa .

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

14 / 31

Multiple Imputation - Concept and process

Instead of lling in a single value for each missing value, two or more acceptable values representing a distribution of possibilities are used. Rubin (1987) suggested three steps for MI
1) Imputing step : consider a reasonable imputation model with an approximate true distributional relationship between the unobserved data and the available information. 2) Analysis step : complete data analysis is performed M-times using each completed data set. 3) Combining step : combine these estimates to obtain the so-called repeated-imputation inference.

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

15 / 31

Multiple Imputation - Combining rules


The combining rule for the point estimates of m , m = 1, , M 1 M = M
M

m .
m=1

(4)

The combining rule for the variance estimate


within-imputation variance WM = 1 M
M

V (m ), m = 1, , M,
m=1

between-imputation variance BM = total variance TM = W M + (1 +


Jinhyouk Jung (UConn) JSM 2012

1 M

(m M )(m M ) .
m=1

1 )BM . M
08/01/12

(5)
16 / 31

Multiple Imputation -Missing patterns

Little and Rubin (1987)


missing completely at random (MCAR) : the probability of incomplete observation is independent with observed values, missing at random (MAR) : probability depends only on observed values, missing not at random (MNAR) : probability depends only on unobserved values.

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

17 / 31

SIMULATION I: continuous target variable


To study the eect of the strength of correlations between the continuous missing and observed covariates. Z = (Z mis , Z obs ) : sampled from bivariate normal distribution with the correlation coecient varying from 0 to 0.7. As true values of corresponding Z = (Z mis , Z obs ), let mis = 0.1, obs = 0.1. failure time T log (1 U)/0 + Z , where U was extracted from Uniform (0, 1) censoring rate was xed at about 65% selection variable of S was generated by Bernoulli distribution with probability = 1 + exp(a + b + cX + dZ obs )
1

where a, b, c and d were assigned specic values to generate about 50 percent missing rate of full cohort. In our study, we used (0.68, 1, 1, 1) and sample size of our study was n = 2000.

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

18 / 31

SIMULATION I: continuous target variable


To check the performances of MI estimators
Percentage Bias (PB) : (E () )/
If PB exceeds 5 % in either direction, a bias is large.

Average Length of CI (AL)


a measure of precision of the estimates

Coverage Rate (CR)

Generate 1000 data sets Full imputation model such as Z mis 0 + 1 X + 2 + 3 Z obs MICE package in R
Bayesian linear regression imputation, (Rubin (1987)) : norm predictive mean matching, (Little (1988)) : pmm

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

19 / 31

SIMULATION I: continuous target variable - PB, AL


mis
20 20

obs

10

PB%

PB% 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

10

20

20 0.0

10

10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

correlation

correlation

1.0

1.0

Full Cohort
0.8 0.8

MI norm MI pmm

0.6

AL

0.4

AL 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.2

0.2 0.0

0.4

0.6

0.1

0.2

0.3

0.4

0.5

0.6

0.7

correlation

correlation

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

20 / 31

Averaged Estimates

Averaged Estimates Correlation Methods mis Full NORM PMM Full NORM PMM 0.0 -.101 -.098 -.097 .099 .099 .096 0.1 -.102 -.101 -.101 .099 .098 .098 0.2 -.100 -.098 -.097 .099 .098 .098 0.3 -.096 -.093 -.093 .098 .096 .096 0.4 -.098 -.091 -.090 .096 .093 .092 0.5 -.099 -.099 -.098 .102 .101 .100 0.6 -.098 -.098 -.096 .096 .095 .093 0.7 -.101 -.099 -.099 .100 .098 .097

obs

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

21 / 31

Coverage Rate(%)

Coverage Rate (%) Correlation Methods mis Full NORM PMM Full NORM PMM 0.0 93.19 95.20 97.79 94.09 95.19 95.89 0.1 95.00 94.30 95.90 95.09 95.09 94.70 0.2 95.69 94.49 95.19 96.29 96.09 96.19 0.3 93.00 94.69 94.00 94.99 94.19 95.09 0.4 95.40 94.29 94.59 95.60 95.60 95.10 0.5 95.39 96.19 95.69 95.49 95.99 95.59 0.6 94.50 94.59 93.19 95.19 94.79 94.29 0.7 96.90 95.87 94.49 96.59 94.99 94.29

obs

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

22 / 31

SIMULATION II: binary target variable


(t|Z mis , Z obs ) = 0.5 + obs Z obs + mis Z mis . Two covariates of Z obs and Z mis were sampled from Bernoulli(0.5) with Pr (Z mis = 0|Z obs = 0) = Pr (Z mis = 1|Z obs = 1) = .

where = 0.5, 0.8


As a real value = ( obs , mis ) = (0.4, 0.5). C U(0, c), c was chosen so that the censoring rate is close to 50%. the selection variable of S was generated by Bernoulli distribution with probability = 0.1(1 )(1 Z obs ) + 0.3(1 )Z obs + 0.5(1 Z obs ) + 0.7Z obs missing rate : 60% n = 500 Use logreg which imputes binary data by the Bayesian logistic regression model : Rubin(1987), Van Buuren (1999, 2000)

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

23 / 31

binary target variable

Table: Estimates for mis and obs at dierent size and under simulation scenario II.
n = 500, = 0.5 Full Cohort mis Average Est. (AL) (PB%) (CR%) 0.500 0.471 0.116 94.8 obs 0.402 0.464 0.747 96.0 MI logreg mis 0.480 0.901 -3.862 91.8 obs 0.409 0.493 2.368 96.1 n = 500, = 0.8 Full Cohort mis 0.498 0.575 -1.777 96.4 obs 0.398 0.596 -0.425 95.1 MI logreg mis 0.491 0.989 -0.299 91.8 obs 0.397 0.651 -0.671 96.0

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

24 / 31

binary target variable : case-cohort study


n=500 , P[ = 1] = 0.1 Z obs : always observed
We assume Z obs Benoulli (0.1) P[Z obs = 1] = 0.1

Z mis : partially observed


The missingness in Z mis occurs due to a case-cohort study design. The covariate measuresments for Z mis are available only for those who are sampled by case-cohort The probability of observed is depended on Z obs and

We assume p0 =

1 18

and p1 =

1 2

where p0 = P[S = 1|Z obs = 0] and p1 = P[S = 1|Z obs = 1]

the selection variable S Bernoulli()


where = p1 (1 )Z obs + Z obs + p0 (1 )(1 Z obs ) + (1 Z obs )

The corresponding probabilities of Z mis = 1 are 0.5 and 0.26 when = 0.5, 0.8
Jinhyouk Jung (UConn) JSM 2012 08/01/12 25 / 31

binary target variable : Area we want to select

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

26 / 31

binary target variable : Result

n = 500, = 0.5 Full Cohort mis Average Est. (AL ) (PB%) (CR%) 0.507 0.819 1.502 95.3 obs 0.409 1.640 2.345 94.2 MI logreg mis 0.511 1.440 2.289 96.7 obs 0.383 1.681 -4.180 94.3

n = 500, = 0.8 Full Cohort mis 0.507 1.006 1.460 94.5 obs 0.410 1.533 2.719 90.7 MI logreg mis 0.472 1.733 -5.413 92.7 obs 0.391 1.562 -2.150 90.7

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

27 / 31

Zinc concentration study


The zinc and oesophageal cancer data
Abnet et al(2005) NestedCohort package in R, Katki and Mark (2005) Menggang Yu (2011)

There are 81 cases and 350 controls among 431 subjects of total. The sample for zinc measurement consists of 56 cases and 67 controls so available data is 123 of total (Missing rate is 71%). Since it is expensive and dicult to measure the metal concentrations on precious oesophageal biopsy tissue to everyone so some subjects in the cohort are only chosen to measure for the concentrations of zinc as well as other metals. This can be treated as a special case of a missing covariate problem. ahaz (Anders, 2011) packages in R
Jinhyouk Jung (UConn) JSM 2012 08/01/12 28 / 31

Zinc concentration study


Goal : investigating the eect of concentrations of zinc, baseline histology, and family history of cancer on oesophageal cancer The missing data mechanism is related to baseline histologies of subjects by design The complete case analysis - zinc(zncent) only was signicant Zinc (MI : pmm and norm) anyhist (MI : logreg) For Imputation model,
sex, age, ever smoking for 6 months, history for consumption of alcohol within the last 12 months, family history of cancer, baseline histories carcinoma in situ(CIS), NOS, failure time and censored indicator

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

29 / 31

Zinc concentration study-result

Table: MI approach using norm for zncent and logreg for anyhist
Covariate sexMale agepill bahistE basehistMD basehistMoD basehistSeD basehistNOS basehistCIS anyhist zncent Estimate 0.120 0.008 0.306 0.631 1.006 1.762 0.747 9.536 0.170 -0.066 Std. Error 0.100 0.006 0.108 0.203 0.504 0.857 1.003 9.652 0.122 0.031 Z 1.208 1.297 2.824 3.110 2.101 2.055 0.744 0.987 1.375 -2.128 Pr(> |z|) 0.256 0.194 0.004 0.001 0.035 0.039 0.456 0.323 0.168 0.033 lower 95% -0.075 -0.004 0.093 0.233 0.071 0.081 -1.219 -9.382 -0.072 -0.127 upper 95% 0.317 0.021 0.519 1.029 2.052 3.442 2.714 20.845 0.412 -0.005

Jinhyouk Jung (UConn)

JSM 2012

08/01/12

30 / 31

Concluding Remarks
missing data problems at case cohort study. Since case cohort data follows MAR mechanism Multiple Imputation method To yield valid results when imputing missing data, imputation model include all available variables is crucial point. MI norm, MI ppm, and MI logreg as imputing methods yield reasonable results
in terms of estimates, Coverage rate (CR), Average Length of CI (AL), and Percentage Bias (PB) in simulation study.

large bias when the amount of missing data is greater than 75% but it depends on sample size and correlation among covariates related to target missing variable. However these methods might still produce useful results despite high missing rates.
Jinhyouk Jung (UConn) JSM 2012 08/01/12 31 / 31

You might also like