0% found this document useful (0 votes)
11 views

Slides 01

Uploaded by

Fabian Scheipl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Slides 01

Uploaded by

Fabian Scheipl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Regression for Correlated Data

Fabian Scheipl

Summer Term 2024

Slides and exercises based on material co-created with Anne-Laure


Boulesteix, Sonja Greven, Almond Stöcker, Jona Cederbaum, Giacomo
de Nicola & Nurzhan Sapargali.

Regression for Correlated Data, Summer 2024


Language

• Lecture videos, labs and exams (!) are in English.

• You can always ask questions in German, I’ll translate and reply in
English.

Regression for Correlated Data, Summer 2024 1


Flipped Classroom

• Lecture videos to be viewed and slides/code to be reviewed every week


before the Q&A session on Thursday. Do not come unprepared, you’re
wasting your time.

• Lab sessions alternating with lecture.

• Ex. sheets 02 and 04 are to be solved in groups of 2+ people and


handed in (Due: May 09 / May 30), I’ll provide individual feedback on
those.

• Up-to-date schedule, announcements, etc on Moodle.

Regression for Correlated Data, Summer 2024 2


Flipped Classroom

• Lecture videos on LMUcast (see Moodle for link)

• Immediately after each video, do the corresponding quiz on moodle


for self-assessment, activation and engagement.

• Moodle forums will be used to collect your questions & requests for
clarification that we’ll discuss in the Q&A sessions.

• Homework: post at least one question (or answer) per week.

Regression for Correlated Data, Summer 2024 3


Exam

• 15 minute oral exams

• Dates: July 22-23 (Retake: sometime around October 9).

• Registration for exams is mandatory.

• Lectures, code and exercise sessions will all be relevant for the exam.

• Oral exams will feature theoretical questions as well as interpretation


of R code, R outputs, data visualizations.

Regression for Correlated Data, Summer 2024 4


References
• Diggle, Heagerty, Liang, and Zeger (2002). Analysis of Longitudinal
Data. Oxford University Press.

• Fahrmeir, Kneib, Lang, and Marx (2013) Regression: Models, Methods


and Applications. Springer.

• Fitzmaurice, Laird, Ware (2004). Applied Longitudinal Analysis. Wiley.

• Molenberghs and Verbeke (2005). Models for Discrete Longitudinal


Data. Springer.

• Verbeke and Molenberghs (2000). Linear Mixed Models for


Longitudinal Data. Springer.

• Wood (2017). Generalized Additive Models - An Introduction with R.


Springer.

Additional papers and books are referenced in the slides / on Moodle.

Regression for Correlated Data, Summer 2024 5


Overview Chapter 1 - Introduction

1.1 Introduction to longitudinal data

1.2 Examples

1.3 Advantages of longitudinal data

1.4 Challenges of longitudinal data

1.5 Correlation and modeling approaches

Regression for Correlated Data, Summer 2024 6


What are longitudinal data?

For repeated measures data, the variable of interest is measured


repeatedly for the same subjects under different conditions.
Example: heart rate measurements for several subjects after different
exercises.

Longitudinal data are a special type of repeated measures data:


the variable of interest is measured for the same subjects repeatedly over
time.
Example: heart rate measurements for several subjects over 12 months.

[We will use the term “subject” for convenience, even though the unit of
observation might be an animal, a crop field, a country etc.]

Regression for Correlated Data, Summer 2024 7


Examples of longitudinal studies
• Cohort studies set up a cohort of people sharing some characteristic
(e.g. born in the same year, free of a certain disease that is
prospectively studied) and follow it over time. Often used in
medicine/epidemiology, but also in other areas.

• Panel studies are similar to cohort studies, often collecting repeated


measurements at specified time intervals, but the term is more
common in the social and economic sciences. In some uses of the
term, the panel is drawn to represent a cross-section of the population
being studied and this sometimes involves replacement of panel
members leaving the study.

• In randomized (clinical) trials, subjects are randomly assigned to


treatment groups and often followed up over time.

Regression for Correlated Data, Summer 2024 8


Notation and special cases

• Let 𝑛𝑖 be the number of observations per subject for subjects


𝑖 = 1, … , 𝑁.

• Let 𝑡𝑖1, … , 𝑡𝑖𝑛 be the time points where subject 𝑖 is measured.


𝑖

• Balanced data has the same number of observations 𝑛1 = ⋯ = 𝑛𝑁 and


the same time points 𝑡𝑖𝑗 ≡ 𝑡𝑗 , 𝑗 = 1, … , 𝑛𝑖 , for all subjects 𝑖.

• If the observation times also have the same distance 𝑑 = 𝑡𝑗+1 − 𝑡𝑗 for
all 𝑗, they are called equally spaced.

Regression for Correlated Data, Summer 2024 9


Overview Chapter 1 - Introduction

1.1 Introduction to longitudinal data

1.2 Examples

1.3 Advantages of longitudinal data

1.4 Challenges of longitudinal data

1.5 Correlation and modeling approaches

Regression for Correlated Data, Summer 2024 10


Example 1: Sleep deprivation study

• Sleep deprivation study with daily measurements from day 0 (normal


sleep) to day 8 (3 hours sleep per night on subsequent nights) for 𝑁 =
18 subjects.

• Response: average reaction time (in milliseconds, ms) on a series of


tests

• No missings, balanced and equally spaced data, time is only covariate

• First analyzed in Belenkey et al (2003), re-analyzed in Bates et al (2014)


and part of the R-package lme4.

Regression for Correlated Data, Summer 2024 11


Example 1: Sleep deprivation study
subject: 308 subject: 309 subject: 310 subject: 330 subject: 331 subject: 332

400

300

200
Average reaction time [ms]

subject: 333 subject: 334 subject: 335 subject: 337 subject: 349 subject: 350

400

300

200

subject: 351 subject: 352 subject: 369 subject: 370 subject: 371 subject: 372

400

300

200
0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5
Days of sleep deprivation

Regression for Correlated Data, Summer 2024 12


Example 2: Cranial growth in children (Orthodont data)

• Data from Potthoff and Roy (1964), re-analyzed in the book by Little
and Rubin (1987) and part of the R-package nlme.

• 𝑁 = 27, 11 girls, 16 boys

• Response: distance between two points in the face (in mm)

• 4 measurements at the ages 8, 10, 12, 14 (balanced data, equally


spaced)

• Questions of interest: Comparison of intercept and slope between


boys and girls. Heterogeneity between subjects?

Regression for Correlated Data, Summer 2024 13


Example 2: Growth in children (Orthodont data)
32

28
Distance [mm]

sex
24 Male
Female

20

16
8 10 12 14
Age [years]

Regression for Correlated Data, Summer 2024 14


Example 3: Treatment of lead-exposed children (TLC)

Background: US children can be exposed to lead-based paint in


deteriorating housing from before 1978 (when the paint was banned).
High blood levels of lead result in risk of several adverse health effects.

TLC trial (see Fitzmaurice et al, 2004): In this data set

• N=100 children 12-33 months old with high blood lead levels
• Response: Blood lead level (𝜇g/dl)
• Treatment: placebo or succimer (enhances urinary excretion of lead)
• Measurements: baseline, week 1, week 4 and week 6 (balanced data).

Question: Is the treatment effective?


Data source: http://www.hsph.harvard.edu/fitzmaur/ala/tlc.txt

Regression for Correlated Data, Summer 2024 15


Example 3: TLC trial

Treatment Placebo

60
Blood lead level [mg/cl]

40

20

0
0 2 4 6 0 2 4 6
Week

Regression for Correlated Data, Summer 2024 16


Example 4: Rats

Question: Effect of an inhibitor for testosterone production (Decapeptyl)


in rats on their craniofacial growth (see Verbeke and Molenberghs, 2000).

• 𝑁 = 50 male rats

• randomized into three groups: control, low dose, high dose

• Response: Distance between two well-defined points on X-ray pictures


of the skull, characterizing the height of the skull (in pixels), taken on
anesthesized rats.

• Same measurement times 𝑡𝑗 for all rats, but dropout due to rats not
surviving the anesthesia (unequal 𝑛𝑖 ).

Data source: https://gbiomed.kuleuven.be/english/research/50000687/50000696/geertverbeke/data/

rats.sas

Regression for Correlated Data, Summer 2024 17


Example 4: Rats
low dose high dose control

85

80
Distance [pixels]

75

70

60 80 100 60 80 100 60 80 100


Age [d]

Regression for Correlated Data, Summer 2024 18


Example 5: CD4

• Background: The human immunodeficiency virus (HIV) destroys CD4


cells, which are important in the body’s immunoresponse. The CD4
cell count decreases after seroconversion (when anti-HIV antibodies
develop, “HIV positive”) and is a good indicator for the disease
development.

• The data with 2376 observations on 369 men infected with HIV is
highly unbalanced (see Diggle et al, 2002).

• Questions of interest:
– the average time course for the CD4 cell depletion
– time courses for individuals
– heterogeneity between individuals
– factors influencing the CD4 cell count change (age, cigarette and
drug use, number of sexual partners, psychological health)

Regression for Correlated Data, Summer 2024 19


Example 5: CD4

3000
CD4 cell count [1/ml]

2000

1000

−2 0 2 4
Time since seroconversion [a]

Regression for Correlated Data, Summer 2024 20


Some observations on longitudinal data

• Covariates can be
– time-invariant and only measured at baseline, e.g. gender or
treatment for Examples 2-4.
– time-varying and measured over time, e.g. cigarette and drug use
in the CD4 data.

• Sometimes longitudinal data is measured together with explicit


survival / time-to-event data (more in Chapter 12).
Implicit: time to dropout is often related to the longitudinal outcome
(e.g. in the CD4 data) – very important to take into account.

Regression for Correlated Data, Summer 2024 21


• Longitudinal data can be measured prospectively or retrospectively
(e.g. via a survey or by searching through archives). Prospective
studies are typically more reliable (e.g. recall bias, such as when
subjects who developed a disease better remember risk factors than
healthy controls). All considered examples are prospective studies.

Regression for Correlated Data, Summer 2024 22


Typical questions with longitudinal data

• Are there changes over time (e.g. sleep study)?

• If so, which shape do they take? Linear (e.g. growth in children)? Are
there break points (e.g. TLC trial)?

• Do changes depend on covariates, e.g. on treatment group or gender


(e.g. growth in children, rat data, TLC trial)?

• Are changes associated with the baseline value at 𝑡 = 0 (e.g. CD4


data)?

• How large is the intra-individual variability compared to the inter-


individual variability (e.g. CD4 data)?

Regression for Correlated Data, Summer 2024 23


Overview Chapter 1 - Introduction

1.1 Introduction to longitudinal data

1.2 Examples

1.3 Advantages of longitudinal data

1.4 Challenges of longitudinal data

1.5 Correlation and modeling approaches

Regression for Correlated Data, Summer 2024 24


Advantages of longitudinal studies: Sources of variation

We can distinguish different sources of variation:

• differences between subjects (inter-subject variability)

• changes/variability within a subject over time (intra-subject variability).

Looking at heterogeneity between subjects is directly of interest e.g. in


the CD4 data.

Regression for Correlated Data, Summer 2024 25


Advantages of longitudinal studies: Distinguishing effects
We can distinguish changes over time within an individual (ageing effects)
from differences in baseline levels between people (cohort effects).


income


● ●

time

Regression for Correlated Data, Summer 2024 26


Advantages of longitudinal studies: Distinguishing effects

income


● ●

time

Income is increasing over time for each person (“ageing” effect).

Starting salaries seem to be decreasing over time (cohort effect).

Regression for Correlated Data, Summer 2024 27


Advantages of longitudinal studies: Distinguishing effects

Longitudinal studies can follow individual change over time and are thus
more informative than cross-sectional studies (𝑛𝑖 = 1).

We can distinguish ageing effects and cohort effects, e.g.

E[𝑌𝑖𝑗 ] = 𝛽0 + 𝛽𝐶 𝑡𝑖1 + 𝛽𝐿(𝑡𝑖𝑗 − 𝑡𝑖1),

• 𝛽𝐶 = increase in average starting salaries per year

• 𝛽𝐿 = average increase in salary per year after starting to work.

Note that E[𝑌𝑖𝑗 − 𝑌𝑖𝑘 ] = 𝛽𝐿(𝑡𝑖𝑗 − 𝑡𝑖𝑘 ), i.e. changes in 𝑌 for subject 𝑖 when 𝑡
changes contribute to the estimation of 𝛽𝐿.

Regression for Correlated Data, Summer 2024 28


In a cross-sectional study, without longitudinal information, we have to
assume 𝛽𝐶 = 𝛽𝐿. This is a strong assumption! In our example, 𝛽𝐶 and 𝛽𝐿
have different signs:

● ●

● ●

● ●
income

income
● ●

● ●

● ●
● ●

● ●

● ● ● ●

● ●

● ●

time time

Regression for Correlated Data, Summer 2024 29


Advantages of longitudinal studies: Power

Longitudinal studies are often more powerful than cross-sectional


studies to estimate 𝛽𝐿.
E.g. in the Orthodont data, 𝛽𝐿 can be more precisely estimated as there
is less variability in the slopes per subject than there is overall in the data
– by tracking the same set of subjects over time, we can “remove” the
variability in the individual level from our estimate of the average change
over time 𝛽𝐿:
32 32

28 28

Male

Male
24 24
Distance [mm]

Distance [mm]
20 20

16 16
32 32

28 28
Female

Female
24 24

20 20

16 16
8 10 12 14 8 10 12 14
Age [years] Age [years]

Regression for Correlated Data, Summer 2024 30


Advantages of longitudinal studies: Confounding

Consider a time-varying covariate 𝑥𝑖𝑗 of interest and a time-constant


confounder 𝑧𝑖 and say we have

E[𝑌𝑖𝑗 ] = 𝛽0 + 𝛽1𝑥𝑖𝑗 + 𝛽2𝑧𝑖 ⇒ E[𝑌𝑖𝑗 − 𝑌𝑖𝑘 ] = 𝛽1(𝑥𝑖𝑗 − 𝑥𝑖𝑘 ).

[Confounder: a variable that is associated with both the response and the
covariate of interest and will lead to biased effect estimates if ignored.]

Note that 𝑧𝑖 does not appear in the mean of the change in 𝑌. Longitudinal
studies thus offer better protection against confounding: For changes
in the response, each subject serves as its own control for time-constant
variables such as age, gender, socio-economic background, education,
genetics, disease history, ….

Regression for Correlated Data, Summer 2024 31


Similarly, if we include an intercept per subject in our model, this offers
protection against any time-constant confounders 𝑧𝑖 : Using

E[𝑌𝑖𝑗 ] = 𝛽𝑖 + 𝛽1𝑥𝑖𝑗 ⇒ E[𝑌𝑖𝑗 − 𝑌𝑖𝑘 ] = 𝛽1(𝑥𝑖𝑗 − 𝑥𝑖𝑘 )

𝛽𝑖 now captures the effect of 𝛽0 + 𝛽2𝑧𝑖 .

Note however, that confounding is still possible by time-varying variables.

Regression for Correlated Data, Summer 2024 32


Confounding: Example air pollution study

Consider a study comparing cities with respect to their mortality counts


and PM10 levels (particulate matter < 10𝜇𝑚) to determine whether higher
PM10 levels increase mortality.

• Cross-sectional study comparing average PM10 levels and mortality:

– Confounding by time-constant variables per city, e.g. different


industrialization, poverty levels, climates, …

• Longitudinal study comparing daily PM10 levels and mortality counts:

– No confounding by time-constant variables if each city is allowed


their own average mortality level in the model.
– Confounding by time-varying variables possible, e.g. seasonality
(both PM10 and influenza higher in the winter) and long-term trends.

Regression for Correlated Data, Summer 2024 33


Overview Chapter 1 - Introduction

1.1 Introduction to longitudinal data

1.2 Examples

1.3 Advantages of longitudinal data

1.4 Challenges of longitudinal data

1.5 Correlation and modeling approaches

Regression for Correlated Data, Summer 2024 34


What is special about longitudinal data?

• Observations on the same subject are more similar than observations


on different subjects. They are not independent, but correlated.

• Observations have an ordering in time.

• Often, observations are more similar the closer they are in time, i.e. the
correlation is decreasing with the time difference. (In contrast to
clustered data, e.g. on families.)

• Missing data are common, e.g. because of drop-out, when subjects


leave the study.

Regression for Correlated Data, Summer 2024 35


Some challenges in longitudinal data

• Appropriate modeling of correlation structure (more in Chapters 3-10).

• There has been a lot of development in recent years, but flexibility and
robustness of software can still be an issue.

• Missing values are methodologically challenging and constitute a


problem depending on the missing data mechanism and the method
used (more in Chapter 11).

Regression for Correlated Data, Summer 2024 36


Challenges in longitudinal data: Time-varying covariates

• determining an appropriate lag structure of covariate effects.


Examples:
– does air pollution increase mortality immediately? After hours?
Days? Cumulatively?
– carry-over effects in cross-over trials

• covariate endogeneity when the response predicts the covariate


values at later times (feedback mechanisms). Examples:
– the treatment is changed when the response values indicate that
the patient is not responding
– patients in a study on the effects of physical activity on blood
glucose levels increase their physical activity after high glucose
measurements.

More in Chapter 12.

Regression for Correlated Data, Summer 2024 37


Overview Chapter 1 - Introduction

1.1 Introduction to longitudinal data

1.2 Examples

1.3 Advantages of longitudinal data

1.4 Challenges of longitudinal data

1.5 Correlation and modeling approaches

Regression for Correlated Data, Summer 2024 38


Why are simple methods not adequate?

Example: Orthodont data.


Question: Difference between genders in change over time?
Possible naive approaches:

• Linear regression model with covariates gender, age and their


interaction.
Problems:

• Cross-sectional analysis, comparison of boys and girls at each age


Problems:

• Linear regression model with covariate age for each subject.


Comparison of subject-specific regression coefficients between boys
and girls.
Problems:

Regression for Correlated Data, Summer 2024 39


Different viewpoints of correlation

• Marginal models: Model marginal correlation and/or account for it


with robust standard errors (GEE).

• Mixed models: Observations are correlated, because they are from the
same subject and share the same underlying processes.

• Transition/Markov models: Observations are correlated, because the


past influences the present.
(Typical here: Past = preceding 𝑞 observations ⟹ Markov property.)

Regression for Correlated Data, Summer 2024 40


The three approaches for the linear model

Consider a simple linear regression model (e.g. for child growth,


𝑌 = height)
E[𝑌𝑖𝑗 ] = 𝛽0 + 𝛽1𝑡𝑖𝑗 . (1)

• Marginal model: In addition to (1), specify a model for variance Var(𝑌𝑖𝑗 )


and correlation Corr(𝑌𝑖𝑗 , 𝑌𝑖𝑘 ).

• Mixed model: Model curves with subject-specific means, e.g.

𝑌𝑖𝑗 = (𝛽0∗ + 𝑏𝑖0) + (𝛽1∗ + 𝑏𝑖1)𝑡𝑖𝑗 + 𝜖𝑖𝑗


𝑏𝑖0 𝑖𝑖𝑑 𝑑11 𝑑12 𝑖𝑖𝑑
( ) ∼ 𝑁 (𝟎, ( )) ind. of 𝜖𝑖𝑗 ∼ 𝑁(0, 𝜎 2)
𝑏1𝑖 𝑑12 𝑑22

Regression for Correlated Data, Summer 2024 41


• Transition model: Model the present in terms of the past, e.g. (𝑞 = 1)

𝑌𝑖𝑗 = 𝛽0∗∗ + 𝛽1∗∗𝑡𝑖𝑗 + 𝜖𝑖𝑗


𝑖𝑖𝑑
𝜖𝑖𝑗 = 𝛼𝜖𝑖𝑗−1 + 𝜉𝑖𝑗 , 𝜉𝑖𝑗 ∼ 𝑁(0, 𝜏 2), 𝜖𝑖1 ∼ 𝑁(0, 𝜎 2).

• In the linear model: Transition and linear mixed model imply marginal
models with particular correlation structures (cf. Ch. 3.5 and 6.1).
The 𝛽 parameters in all three approaches have the same marginal
interpretation. This is no longer the case in the generalized setting,
see Ch. 8 ff.

• For the linear case, we will focus on the linear mixed model (Ch. 3-7).
The generalized linear mixed model is discussed in Chapter 9.

• Marginal models are discussed for the generalized case in Chapter 10.

Regression for Correlated Data, Summer 2024 42


Longitudinal and other data

• (Balanced) longitudinal data can be viewed as a type of multivariate


data. But with a special correlation structure!

• Hierarchical / multi-level / clustered data: Similar nested structure


and approaches (random effects etc.), but without the temporal
structure.

• Spatial data: 2-D / 3-D, no inherent ordering, usually no independent


subunits. But many similar approaches to modeling correlation:
Marginal models, Gaussian random effects / fields, Markov chains /
random fields

• Longitudinal data can be viewed as realizations of stochastic processes.


Especially if measurements are frequent, they can be viewed as
functional data, i.e., units of observation are random functions
𝑌𝑖 (𝑡), 𝑖 = 1, … , 𝑛.

Regression for Correlated Data, Summer 2024 43


Analysis of longitudinal data (ALD) vs. time series
analysis

• Both model time courses and try to take into account temporal
correlation between observations.

• Time series analysis usually focuses on prediction/forecasting, ALD


usually focuses on the estimation of covariate effects.

• Longitudinal data typically span shorter time periods than time series,
but they contain independent replications in the form of subjects.
This allows us to borrow strength (can be more robust to model
assumptions).

• Many concepts from time series analysis are useful in ALD.

Regression for Correlated Data, Summer 2024 44

You might also like