0% found this document useful (0 votes)
1 views

Babies Learning Language_ Methods[03-04]

The document discusses various methods in psychometrics, particularly the use of Likert scales versus dichotomous scales, and introduces 'nosub', a command line tool for conducting web experiments on Amazon Mechanical Turk. It also highlights the advantages of Bayesian mixed effects models for statistical inference in psychology, emphasizing their growing accessibility and the challenges associated with model specification and convergence. The author advocates for a Bayesian approach to data analysis while acknowledging the complexities involved in teaching these methods.

Uploaded by

Luis Luengo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Babies Learning Language_ Methods[03-04]

The document discusses various methods in psychometrics, particularly the use of Likert scales versus dichotomous scales, and introduces 'nosub', a command line tool for conducting web experiments on Amazon Mechanical Turk. It also highlights the advantages of Bayesian mixed effects models for statistical inference in psychology, emphasizing their growing accessibility and the challenges associated with model specification and convergence. The author advocates for a Bayesian approach to data analysis while acknowledging the complexities involved in teaching these methods.

Uploaded by

Luis Luengo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Read more »

Posted by Michael Frank at 10:38 PM No comments:

Labels: Methods, Reproducibility


F r i d a y, S e p t e m b e r 7 , 2 0 1 8

Scale construction, continued


For psychometrics fans: I helped out with a post by Brent Roberts, "Yes or No 2.0: Are Likert
scales always preferable to dichotomous rating scales?" This post is a continuation of our
earlier conversation on scale construction and continues to examine the question of if – and
if so, when – it's appropriate to use a Likert scale vs. a dichotomous scale. Spoiler: in some
circumstances it's totally safe, while in others it is a disaster!

Posted by Michael Frank at 2:12 PM No comments:

Labels: Methods

S a t u r d a y, M a y 5 , 2 0 1 8

nosub: a command line tool for pushing web


experiments to Amazon Mechanical Turk
(This post is co-written with Long Ouyang, a former graduate student in our department,
who is the developer of nosub, and Manuel Bohn, a postdoc in my lab who has created a
minimal working example).

Although my lab focuses primarily on child development, our typical workflow is to refine
experimental paradigms via working with adults. Because we treat adults as a convenience
population, Amazon Mechanical Turk (AMT) is a critical part of this workflow. AMT allows us
to pay an hourly wage to participants all over the US who complete short experimental
tasks. (Some background from an old post).

Our typical workflow for AMT tasks is to create custom websites that guide participants
through a series of linguistic stimuli of one sort or another. For simple questionnaires we
often use Qualtrics, a commercial survey product, but most tasks that require more
customization are easy to set up as free-standing javascript/HTML sites. These sites then
need to be pushed to AMT as "external HITs" (Human Intelligence Tasks) so that workers can
find them, participate, and be compensated.

nosub is a simple tool for accomplishing this process, building on earlier tools used by my
lab.* The idea is simple: you customize your HIT settings in a configuration file and type

nosub upload

to upload your experiment to AMT. Then you can type

nosub download

to fetch results. Two nice features of nosub from a psychologist's perspective are: 1. worker
IDs are anonymized by default so you don't need to worry about privacy issues (but they are
deterministically hashed so you can still flag repeat workers), and 2. nosub can post HITs
in batches so that you don't get charged Amazon's surcharge for tasks with more than 9
hits.

All you need to get started is to install Node.js; installation instructions for nosub are
available in the project repository.

Once you've run nosub, you can download your data in JSON format, which can easily be
parsed into R. We've put together a minimal working example of an experiment that can be
run using nosub and a data analysis script in R that reads in the data.

---
* psiTurk is another framework that provides a way of serving and tracking HITs. psiTurk is
great and we have used it for heavier-weight applications where we need to track
participants, but can be tricky to debug and is not always compatible with some of our
light-weight web experiments.

Posted by Michael Frank at 3:52 PM No comments:

Labels: Methods

M o n d a y, F e b r u a r y 2 6 , 2 0 1 8

Mixed effects models: Is it time to go Bayesian by


default?
(tl;dr: Bayesian mixed effects modeling using brms is really nifty.)

Introduction: Teaching Statistical Inference?

How do you reason about the relationship between your data and your hypotheses?
Bayesian inference provides a way to make normative inferences under uncertainty. As
scientists – or even as rational agents more generally – we are interested in knowing the
probability of some hypothesis given the data we observe. As a cognitive scientist I've long
been interested in using Bayesian models to describe cognition, and that's what I did much
of my graduate training in. These are custom models, sometimes fairly difficult to write
down, and they are an area of active research. That's not what I'm talking about in this
blogpost. Instead, I want to write about the basic practice of statistics in experimental
data analysis.
Mostly when psychologists do and teach "stats," they're talking about frequentist statistical
tests. Frequentist statistics are the standard kind people in psych have been using for the
last 50+ years: t-tests, ANOVAs, regression models, etc. Anything that produces a p-value.
P-values represent the probability of the data (or any more extreme) under the null
hypothesis (typically "no difference between groups" or something like that). The problem
is that this is not what we really want to know as scientists. We want the opposite: the
probability of the hypothesis given the data, which is what Bayesian statistics allow you to
compute. You can also compute the relative evidence for one hypothesis over another (the
Bayes Factor).

Now, the best way to set psychology twitter on fire is to start a holy war about who's
actually right about statistical practice, Bayesians or frequentists. There are lots of
arguments here, and I see some merit on both sides. That said, there is lots of evidence
that much of our implicit statistical reasoning is Bayesian. So I tend towards the Bayesian
side on the balance <ducks head>. But despite this bias, I've avoided teaching Bayesian
stats in my classes. I've felt like, even with their philosophical attractiveness, actually
computing Bayesian stats had too many very severe challenges for students. For example,
in previous years you might run into major difficulties inferring the parameters of a model
that would be trivial under a frequentist approach. I just couldn't bring myself to teach a
student a philosophical perspective that – while coherent – wouldn't provide them with an
easy toolkit to make sense of their data.

The situation has changed in recent years, however. In particular, the BayesFactor R
package by Morey and colleagues makes it extremely simple to do basic inferential tasks
using Bayesian statistics. This is a huge contribution! Together with JASP, these tools make
the Bayes Factor approach to hypothesis testing much more widely accessible. I'm really
impressed by how well these tools work.

All that said, my general approach to statistical inference tends to rely less on inference
about a particular hypothesis and more on parameter estimation – following the spirit of
folks like Gelman & Hill (2007) and Cumming (2014). The basic idea is to fit a model whose
parameters describe substantive hypotheses about the generating sources of the dataset,
and then to interpret these parameters based on their magnitude and the precision of the
estimate. (If this sounds vague, don't worry – the last section of the post is an example).
The key tool for this kind of estimation is not tests like the t-test or the chi-squared.
Instead, it's typically some variant of regression, usually mixed effects models.

Mixed-Effects Models

Especially in psycholinguistics where our experiments typically show many people many
different stimuli, mixed effects models have rapidly become the de facto standard for data
analysis. These models (also known as hierarchical linear models) let you estimate sources
of random variation ("random effects") in the data across various grouping factors. For
example, in a reaction time experiment some participants will be faster or slower (and so
all data from those particular individuals will tend to be faster or slower in a correlated
way). Similarly, some stimulus items will be faster or slower and so all the data from these
groupings will vary. The lme4 package in R was a game-changer for using these models (in
a frequentist paradigm) in that it allowed researchers to estimate such models for a full
dataset with just a single command. For the past 8-10 years, nearly every paper I've
published has had a linear or generalized linear mixed effects model in it.

Despite their simplicity, the biggest problem with mixed effects models (from an
educational point of view, especially) has been figuring out how to write consistent model
specifications for random effects. Often there are many factors that vary randomly
(subjects, items, etc.) and many other factors that are nested within those (e.g., each
subject might respond differently to each condition). Thus, it is not trivial to figure out
what model to fit, even if fitting the model is just a matter of writing a command. Even in
a reaction-time experiment with just items and subjects as random variables, and one
condition manipulation, you can write

(1) rt ~ condition + (1 | subject) + (1 | item)

for just random intercepts by subject and by item, or you can nest condition (fitting a
random slope) for one or both:

(2) rt ~ condition + (condition | subject) + (condition | item)

and you can additionally fiddle with covariance between random effects for even more
degrees of freedom!

Luckily, a number of years ago, a powerful and clear simulation paper by Barr et al. (2013)
came out. They argued that there was a simple solution to the specification issue: use the
"maximal" random effects structure supported by the design of the experiment. This meant
adding any random slopes that were actually supported by your design (e.g., if condition
was a within-subject variable, you could fit condition by subject slopes). While this
suggestion was quite controversial,* Barr et al.'s simulations were persuasive evidence that
this suggestion led to conservative inferences. In addition, having a simple guideline to
follow eliminated a lot of the worry about analytic flexibility in random effects structure.
If you were "keeping it maximal" that meant that you weren't intentionally – or even
inadvertently – messing with your model specification to get a particular result.

Unfortunately, a new problem reared its head in lme4: convergence. With very high
frequency, when you specify the maximal model, the approximate inference algorithms
that search for the maximum likelihood solution for the model will simply not find a
satisfactory solution. This outcome can happen even in cases where you have quite a lot of
data – in part because the number of parameters being fit is extremely high. In the case
above, not counting covariance parameters, we are fitting a slope and an intercept across
participants, plus a slope and intercept for every participant and for every item.

To deal with this, people have developed various strategies. The first is to do some black
magic to try and change the optimization parameters (e.g., following these helpful tips).
Then you start to prune random effects away until your model is "less maximal" and you get
convergence. But these practices mean you're back in flexible-model-adjustment land, and

You might also like