BW LME Tutorial2 PDF
BW LME Tutorial2 PDF
This tutorial serves as a quick boot camp to jump-start your own analyses with
linear mixed effects models. This text is different from other introductions by
being decidedly conceptual; I will focus on why you want to use mixed models
and how you should use them. While many introductions to this topic can be very
daunting to readers who lake the appropriate statistical background, this text is
going to be a softer kind of introduction so, dont panic!
The tutorial requires R so if you havent installed it yet, go and get it! I also
recommend reading tutorial 1 in this series before you go further. You can find it
here:
http://www.bodowinter.com/tutorial/bw_LME_tutorial1.pdf
This tutorial will take you about 1 hour (possibly a bit more).
For updates and other tutorials, check my webpage www.bodowinter.com. If you have any
suggestions, please write me an email: [email protected]
Please cite as:
Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications.
arXiv:1308.5499. [http://arxiv.org/pdf/1308.5499.pdf]
400
350
300
250
200
100
150
F1
F2
F3
F4
F5
F6
F7
F8
F9
M1
M2
M3
M4
M5
M6
M7
Subjects
400
350
300
250
200
100
150
Items
The variation between items isnt as big as the variation between subjects but
there are still noticeable differences, and we better account for them in our model!
We do this by adding an additional random effect:
pitch ~ politeness + sex + (1|subject) + (1|item) +
So, on top of different intercepts for different subjects, we now also have different
intercepts for different items. We now resolved those non-independencies (our
model knows that there are multiple responses per subject and per item), and we
accounted for by-subject and by-item variation in overall pitch levels.
Note the efficiency and elegance of this model. Before, people used to do a lot of
averaging. For example, in psycholinguistics, people would average over items
for a subjects-analysis (each data point comes from one subject, assuring
independence), and then they would also average over subjects for an itemsanalysis (each data point comes from one item). Theres a whole literature on the
advantages and disadvantages of this approach (Clark, 1973; Forster & Dickinson,
1976; Wike & Church, 1976; Raaijmakers, Schrijnemakers, & Gremmen, 1999;
Raaijmakers, 2003; Locker, Hoffman, & Bovaird, 2007; Baayen, Davidson, &
Bates, 2008; Barr, Levy, Scheepers, & Tilly, 2013).
The upshot is: while traditional analyses that do averaging are in principle legit,
mixed models give you much more flexibility and they take the full data into
account. If you do a subjects-analysis (averaging over items), youre essentially
disregarding by-item variation. Conversely, in the items-analysis, youre
Mixed models in R
For a start, we need to install the R package lme4 (Bates, Maechler & Bolker,
2012). While being connected to the internet, open R and type in:
install.packages(lme4)
Select a server close to you. After installation, load the lme4 package into R with
the following command:
library(lme4)
Now, you have the function lmer() available to you, which is the mixed model
equivalent of the function lm() in tutorial 1. This function is going to construct
mixed models for us.
But first, we need some data! I put a shortened version of the dataset that we used
for Winter and Grawunder (2012) onto my server. You can load it into R the
following way:
politeness=
read.csv("http://www.bodowinter.com/tutorial/politeness_data.csv")
100
150
200
250
300
boxplot(frequency ~ attitude*gender,
col=c("white","lightgray"),politeness)
inf.F
pol.F
inf.M
pol.M
What do we see? In both cases, the median line (the thick line in the middle of the
boxplot) is lower for the polite than for the informal condition. However, there
may be a bit more overlap between the two politeness categories for males than
for females.
Lets start with constructing our model!
Type in the command below
lmer(frequency ~ attitude, data=politeness)
and you will retrieve an error that should look like this:
Error in mkReTrms(findbars(RHSForm(formula)), fr) :
No random effects terms specified in formula
This is because the model needs a random effect (after all, mixing fixed and
random effects is the point of mixed models). We just specified a single fixed
effect, attitude, and that was not enough.
So, lets add random intercepts for subjects and items (remember that items are
called scenarios here):
politeness.model = lmer(frequency ~ attitude +
(1|subject) + (1|scenario), data=politeness)
The last command created a model that used the fixed effect attitude (polite vs.
informal) to predict voice pitch, controlling for by-subject and by-item variability.
We saved this model in the object politeness.model. Use summary() to
display the full result:
summary(politeness.model)
Again, lets work through this: First, the output reminds you of the model that you
fit. Then, theres some general summary statistics such as Akaikes Information
Criterion, the log-Likelihood etc. We wont go into the meaning of these different
values in this tutorial because these are conceptually a little bit more involved.
Lets focus on the output for the random effects first:
Have a look at the column standard deviation. This is a measure of how much
variability in the dependent measure there is due to scenarios and subjects (our
two random effects). You can see that scenario (item) has much less variability
than subject. Based on our boxplots from above, where we saw more idiosyncratic
differences between subjects than between items, this is to be expected. Then, you
see Residual which stands for the variability thats not due to either scenario or
subject. This is our again, the random deviations from the predicted values
that are not due to subjects and items. Here, this reflects the fact that each and
every utterance has some factors that affect pitch that are outside of the purview
of our experiment.
The fixed effects output mirrors the coefficient table that we considered in tutorial
1 when we talked about the results of our linear model analysis.
The coefficient attitudepol is the slope for the categorical effect of politeness.
Minus 19.695 means that to go from informal to polite, you have to go down
-19.695 Hz. In other words: pitch is lower in polite speech than in informal speech,
by about 20 Hz. Then, theres a standard error associated with this slope, and a tvalue, which is simply the estimate (20 Hz) divided by the standard error (check
this by performing the calculation by hand).
Note that the lmer() function (just like the lm() function in tutorial 1) took
whatever comes first in the alphabet to be the reference level. inf comes before
pol, so the slope represents the change from inf to pol. If the reference
category would be pol rather than inf, the only thing that would change would
be that the sign of the coefficient 19.695 would be positive. Standard errors,
significance etc. would remain the same.
Now, lets consider the intercept. In tutorial 1, we already talked about the fact
that oftentimes, model intercepts are not particularly meaningful. But this
intercept is especially weird. Its 202.588 Hz where does that value come
from?
If you look back at the boxplot that we constructed earlier, you can see that the
value 202.588 Hz seems to fall halfway between males and females (in the
informal condition) and this is indeed what this intercept represents. Its the
average of our data for the informal condition.
As we didnt inform our model that theres two sexes in our dataset, the intercept
is particularly off, in between the voice pitch of males and females. This is just
like the classic example of a farm with a dozen hens and a dozen cows where
the mean legs of all farm animals considered together is three, not a particularly
informative representation of whats going on at the farm.
Lets add gender as an additional fixed effect:
politeness.model = lmer(frequency ~ attitude +
gender + (1|subject) +
(1|scenario), data=politeness)
We overwrote our original model object politeness.model with this new
model. Note that we added gender as a fixed effect because the relationship
between sex and pitch is systematic and predictable (i.e., we expect females to
10
have higher pitch). This is different from the random effects subject and item,
where the relationship between these and pitch is much more unpredictable and
random. Well talk more about the distinction between fixed and random effects
later.
Lets print the model output again:
summary(politeness.model)
Lets have a look at the residuals first:
Note that compared to our earlier model without the fixed effect gender, the
variation thats associated with the random effect subject dropped considerably.
This is because the variation thats due to gender was confounded with the
variation thats due to subject. The model didnt know about males and females,
and so its predictions were relatively more off, creating relatively larger residuals.
Now that we have added the effect of gender, we have shifted a considerable
amount of the variance that was previously in the random effects component
(differences between male and female individuals) to the fixed effects component.
Lets look at the coefficient table now:
We see that males and females differ by about 109 Hz. And the intercept is now
much higher (256.846 Hz), as it now represents the female category (for the
informal condition). The coefficient for the effect of attitude didnt change much.
Statistical significance
So far, we havent talked about significance yet. But, if you want to publish this,
youll most likely need to report some kind of p-value. Unfortunately, p-values
for mixed models arent as straightforward as they are for the linear model. There
are multiple approaches, and theres a discussion surrounding these, with
11
sometimes wildly differing opinions about which approach is the best. Here, I
focus on the Likelihood Ratio Test as a means to attain p-values.
Likelihood is the probability of seeing the data you collected given your model.
The logic of the likelihood ratio test is to compare the likelihood of two models
with each other. First, the model without the factor that youre interested in (the
null model), then the model with the factor that youre interested in. Maybe an
analogy helps you to wrap your head around this: Say, youre a hiker, and you
carry a bunch of different things with you (e.g., a gallon of water, a flashlight). To
know whether each item affects your hiking speed, you need to get rid of it. So,
you get rid of the flashlight and run without it. Your hiking speed is not affected
much. Then, you get rid of the gallon of water, and you realize that your hiking
speed is affected a lot. You would conclude that carrying a gallon of water with
you significantly affects your hiking speed whereas carrying a flashlight does not.
Expressed in formula, you would want to compare the following two models
(think hikes) with each other:
mdl1 = hiking speed ~ gallon of water + flashlight
mdl2 = hiking speed ~ flashlight
If there is a significant difference between mdl2 and mdl1, then you know
that the gallon of water matters. To assess the effect of the flashlight, you would
have to do a similar comparison:
mdl1 = hiking speed ~ gallon of water + flashlight
mdl2 = hiking speed ~ gallon of water
In both cases, we compared a full model (with the fixed effects in question)
against a reduced model without the effects in question. In each case, we conclude
that a fixed effect is significant if the difference between the likelihood of these
two models is significant.
Heres how you would do this in R. First, you need to construct the null model:
politeness.null = lmer(frequency ~ gender +
(1|subject) + (1|scenario), data=politeness,
REML=FALSE)
Note one additional technical detail. I just added the argument REML=FALSE.
Dont worry about it too much but in case youre interested, this changes some
internal stuff (in particular, the likelihood estimator), and it is necessary to do this
when you compare models using the likelihood ratio test (Pinheiro & Bates, 2000;
Bolker et al., 2009).
12
Then, we re-do the full model above, this time also with REML=FALSE:
politeness.model = lmer(frequency
gender
+
(1|subject)
+
data=politeness, REML=FALSE)
~ attitude +
(1|scenario),
Now you have two models to compare with each other one with the effect in
question, one without the effect in question. We perform the likelihood ratio test
using the anova() function:
anova(politeness.null,politeness.model)
This is the resulting output:
Youre being reminded of the formula of the two models that youre comparing.
Then, you find a Chi-Square value, the associated degrees of freedom and the pvalue2.
You would report this result the following way:
politeness affected pitch (2(1)=11.62, p=0.00065), lowering it by
about 19.7 Hz 5.6 (standard errors)
If youre used to t-tests, ANOVAs and linear model stuff, then this likelihoodbased approach might seem weird to you. Rather than getting a p-value
straightforwardly from your model, you get a p-value from a comparison of two
models. To help you get used to the logic, remember the hiker and the analogy of
putting one piece of luggage away to estimate that pieces effect on hiking speed.
2
You might wonder why were doing a Chi-Square test here. Theres a lot of technical detail here,
but the main thing is that theres a theorem, called Wilks Theorem, which states that negative two
times the log likelihood ratio of two models approaches a Chi-Square distribution with degrees of
freedom of the number of parameters that differ between the models (in this case, only attitude).
So, somebody has done a proof of this and youre good to go!
Do note, also, that some people dont like straight-jacketing likelihood into the classical nullhypothesis significance testing framework that were following here, and so they would disagree
with the interpretation of likelihood the way we used it in the likelihood ratio test.
13
Note that we kept the predictor gender in the model. The only change between
the full model and the null model that we compared in the likelihood ratio test
was the factor of interest, politeness. In this particular test, you can think of
gender as a control variable and of attitude as your test variable.
We could have also compared the following two models:
full model:
reduced model:
mdl.null in this case is an intercept only model, where you just estimate the
mean of the data. You could compare this to mdl.full, which has two more
effects, attitude and gender. If this difference became significant, you would
know that mdl.full and mdl.null are significantly different from each other
but you would not know whether this difference is due to attitude or due to
gender. Coming back to the hiker analogy, it is as if you dropped both the
gallon of water and the flashlight and then you realized that your hiking speed
changed, but you wouldnt be able to determine conclusively which one of the
two pieces of luggage was the crucial one.
A final thing regarding likelihood ratio tests: What happens if you have an
interaction? We didnt talk much about interactions yet, but say, you predicted
attitude to have an effect on pitch that is somehow modulated through gender.
For example, it could be that speaking politely versus informally has the opposite
effect for men and women. Or it could be that women show a difference and men
dont (or vice versa). If you have such an inter-dependence between two factors
(called an interaction), you can test it the following way:
full model:
reduced model:
frequency ~ attitude*gender
frequency ~ attitude + gender
14
You see that each scenario and each subject is assigned a different intercept.
Thats what we would expect, given that weve told the model with (1|subject)
and (1|scenario) to take by-subject and by-item variability into account.
But not also that the fixed effects (attitude and gender) are all the same for all
subjects and items. Our model is what is called a random intercept model. In this
model, we account for baseline-differences in pitch, but we assume that whatever
the effect of politeness is, its going to be the same for all subjects and items.
But is that a valid assumption? In fact, often times its not it is quite expected
that some items would elicit more or less politeness. That is, the effect of
politeness might be different for different items. Likewise, the effect of politeness
might be different for different subjects. For example, it might be expected that
15
some people are more polite, others less. So, what we need is a random slope
model, where subjects and items are not only allowed to have differing intercepts,
but where they are also allowed to have different slopes for the effect of
politeness. This is how we would do this in R:
politeness.model = lmer(frequency ~ attitude +
gender + (1+attitude|subject) +
(1+attitude|scenario),
data=politeness,
REML=FALSE)
Note that the only thing that we changed is the random effects, which now look a
little more complicated. The notation (1+attitude|subject) means that you tell
the model to expect differing baseline-levels of frequency (the intercept,
represented by 1) as well as differing responses to the main factor in question,
which is attitude in this case. You then do the same for items.
Have a look at the coefficients of this updated model by typing in the following:
coef(politeness.model)
Heres a reprint of the output that I got:
16
Now, the column with the by-subject and by-item coefficients for the effect of
politeness (attitudepol) is different for each subject and item. Note, however,
that its always negative and that many of the values are quite similar to each
other. This means that despite individual variation, there is also consistency in
how politeness affects the voice: for all of our speakers, the voice tends to go
down when speaking politely, but for some people it goes down slightly more so
than for others.
Have a look at the column for gender. Here, the coefficients do not change. That
is because we didnt specify random slopes for the by-subject or by-item effect of
gender.
O.k., lets try to obtain a p-value. We keep our model from above
(politeness.model) and compare it to a new null model in a likelihood ratio
test. Lets construct the null model first:
politeness.null = lmer(frequency ~ gender +
(1+attitude|subject) + (1+attitude|scenario),
data=politeness, REML=FALSE)
Note that the null model needs to have the same random effects structure. So, if
your full model is a random slope model, your null model also needs to be a
random slope model.
Lets now do the likelihood ratio test:
anova(politeness.null,politeness.model)
This is, again, significant.
There are a few important things to say here: You might ask yourself Which
random slopes should I specify? or even Are random slopes necessary at
all?
A lot of people construct random intercept-only models but conceptually, it makes
hella sense to include random slopes most of the time. After all, you can almost
always expect that people differ with how they react to an experimental
manipulation! And likewise, you can almost always expect that the effect of an
experimental manipulation is not going to be the same for all items.
Moreover, researchers in ecology (Schielzeth & Forstmeier, 2009),
psycholinguistics (Barr, Levy, Scheepers, & Tilly, 2013) and other fields have
shown via simulations that mixed models without random slopes are anticonservative or, in other words, they have a relatively high Type I error rate (they
tend to find a lot of significant results which are actually due to chance).
17
Barr et al. (2013) recommend that you should keep it maximal with respect to
your random effects structure, at least for controlled experiments. This means that
you include all random slopes that are justified by your experimental design
and you do this for all fixed effects that are important for the overall interpretation
of your study.
In the model above, our whole study crucially rested on stating something about
politeness. We were not interested in gender differences, but they are well worth
controlling for. This is why we had random slopes for the effect of attitude (by
subjects and item) but not gender. In other words, we only modeled by-subject
and by-item variability in how politeness affects pitch.
Assumptions
In tutorial 1, we talked a lot about the many different assumptions of the linear
model. The good news is: Everything that we discussed in the context of the
linear model applies straightforwardly to mixed models. So, you also have to
worry about collinearity and influential data points. And you have to worry about
homoscedasticity (and potentially about lack of normality). But you dont have to
learn much new stuff. The way you check these assumptions in R is exactly the
same as in the case of the linear model, say, by creating a residual plot, a
histogram of the residuals or a Q-Q plot.
Independence, being the most important assumption, requires a special word: One
of the main reasons we moved to mixed models rather than just working with
linear models was to resolve non-independencies in our data. However, mixed
models can still violate independence if youre missing important fixed or
random effects. So, for example, if we analyzed our data with a model that didnt
include the random effect subject, then our model would not know that there
are multiple responses per subject. This amounts to a violation of the
independence assumption. So choose your fixed effects and random effects
carefully, and always try to resolve non-independencies.
Then, a word on influential data points. You will find that the function
dfbeta() that we used in the context of linear models doesnt work for mixed
models. If you worry about influential points, you can check out the package
influence.ME (Nieuwenhuis, te Grotenhuis, & Pelzer, 2012), or you can program
a for loop that does the leave-one-out diagnostics by hand. The following code
gives you an outline of the general structure of how you might want to do this
(you can check my doodling tutorials on loops and programming structures in R
to get a better grasp of this):
all.res=numeric(nrow(mydataframe))
18
for(i in 1:nrow(mydataframe)){
myfullmodel=lmer(response~predictor+
(1+predictor|randomeffect),POP[-i,])
all.res[i]=fixef(myfullmodel)[some number]
}3
Go ahead and play with checking the assumptions. You can go back to tutorial 1
and apply the code in there to the new mixed model objects in this tutorial.
The basic idea of this code snippet is this: Pre-define a vector that has as many elements as you
have rows in your dataset. Then, cycle through each row. For each iteration, make a new mixed
model without that row (this is achieved by POP[-i,]). Then, the function fixef() extracts
whatever coefficient interests you.
You will need to adapt this code to your analysis. Besides the names of your data frame and your
variables, you need to run fixef() on your model once so you know which position the relevant
coefficient is. In our case, I would put a 2 in there because the effect of attitudepol appears
second in the list of coefficients. 1 would give me the intercept, always the first coefficient
mentioned in the coefficient table.
19
In contrast, random effects generally sample from the population of interest. That
means that they are far away from exhausting the population because theres
usually many many more subjects or items that you could have tested. The levels
of the factor in your experiment is a tiny subset of the levels out there in the
world.
The write-up
A lot of tutorials dont cover how to write up your results. And thats a pity,
because this is a crucial part of your study!!!
The most important thing: You need to describe the model to such an extent that
people can reproduce the analysis. So, a useful heuristic for writing up your
results is to ask yourself the question Would I be able to re-create the analysis
given the information that I provided? If the answer is yes your write-up is
good.
In particular, this means that you specify all fixed effects and all random effects,
and you should also mention whether you have random intercepts or random
slopes.
For reporting individual results, you can stick to my example with the likelihood
ratio test above. Remember that its always important to report the actual
coefficients/estimates and not just whether an effect is significant. You should
also mention standard errors.
Another important thing is to give enough credit to the people who put so much of
their free time into making lme4 and R work so efficiently. So lets cite them! Its
also a good idea to cite exactly the version that you used for your analysis. You
can find out your version and who to cite by typing in
citation()
for your R-version and
citation("lme4")
for the lme4 package.
Finally, its important that you mention that you checked assumptions, and that
the assumptions are satisfied.
So heres what I would have written for the analysis that we performed in this
tutorial:
20
We used R (R Core Team, 2012) and lme4 (Bates, Maechler & Bolker,
2012) to perform a linear mixed effects analysis of the relationship
between pitch and politeness. As fixed effects, we entered politeness and
gender (without interaction term) into the model. As random effects, we
had intercepts for subjects and items, as well as by-subject and by-item
random slopes for the effect of politeness. Visual inspection of residual
plots did not reveal any obvious deviations from homoscedasticity or
normality. P-values were obtained by likelihood ratio tests of the full
model with the effect in question against the model without the effect in
question.
Yay, were done!! I hope this tutorial was of help to you. I want to thank the
many people who taught me stats (in particular Roger Mundry), and the many
readers of this tutorial who gave feedback to earlier versions. Finally, I want to
thank you for joining me on this quick tour through mixed models.
References
Bates, D.M., Maechler, M., & Bolker, B. (2012). lme4: Linear mixed-effects models
using S4 classes. R package version 0.999999-0.
Baayen, R.H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics
Using R. Cambridge: Cambridge University Press.
Baayen, R.H., Davidson, D.J., Bates, D.M. (2008). Mixed-effects modeling with crossed
random effects for subjects and items. Journal of Memory and Language, 59,
390-412.
Barr, D.J., Levy, R., Scheepers, C., & Tilly, H. J. (2013). Random effects structure for
confirmatory hypothesis testing: Keep it maximal. Journal of Memory and
Language, 68, 255278.
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H.
H., & White, J. S. S. (2009). Generalized linear mixed models: a practical guide
for ecology and evolution. Trends in Ecology & Evolution, 24(3), 127-135.
Clark, H. H. (1973). The language-as-fixed-effect fallacy: A critique of language
statistics in psychological research. Journal of Verbal Learning and Verbal
Behavior, 12, 335359.
Forster, K.I., & Dickinson, R.G. (1976). More on the language-as-fixed-effect fallacy:
Monte Carlo estimates of error rates for F1, F2, F and minF. Journal of Verbal
Learning & Verbal Behavior, 15, 135-142.
21
Locker, L., Hoffman, L., & Bovaird, J.A. (2007). On the use of multilevel modeling as an
alternative to items analysis in psycholinguistic research. Behavior Research
Methods, 39, 723-730.
Nieuwenhuis, R., te Grotenhuis, M., & Pelzer, B. (2012). Influence.ME: Tools for
Detecting Influential Data in Mixed Effects Models. R Journal, 4(2): pp. 38-47.
Pinheiro, J.C., & Bates, D.M. (2000). Mixed-Effects Models in S and SPLUS. New York:
Springer.
Raaijmakers, J.G. (2003). A further look at the language-as-fixed-effect fallacy.
Canadian Journal of Experimental Psychology, 57, 141-151.
Raaijmakers, J.G., Schrijnemakers, J.M.C., & Gremmen, F. (1999). How to Deal with
"The Language-as-Fixed-Effect Fallacy": Common Misconceptions and
Alternative Solutions. Journal of Memory and Language, 41, 416-426.
R Core Team (2012). R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria.
Schielzeth, H., & Forstmeier, W. (2009). Conclusions beyond support: overconfident
estimates in mixed models. Behavioral Ecology, 20, 416420.
Wike, E.L., & Church, J.D. (1976). Comments on Clarks The language-as-fixed-effect
fallacy. Journal of Verbal Learning & Verbal Behavior, 15, 249-255.
Winter, B., & Grawunder, S. (2012). The Phonetic Profile of Korean Formality. Journal
of Phonetics, 40, 808-815.
22