Gries Stefan Thomas (2013) - Statistics For Linguistics With R - 2
Gries Stefan Thomas (2013) - Statistics For Linguistics With R - 2
As you can see, there are now only three variables left because POS now
functions as row names. Note that this is only possible when the column
with the row names contains no element twice.
A second way of creating data frames that is much less flexible, but ex-
tremely important for Chapter 5 involves the function expand.grid. In its
simplest use, the function takes several vectors or factors as arguments and
returns a data frame the rows of which contain all possible combinations of
vector elements and factor levels. Sounds complicated but is very easy to
understand from this example and we will use this many times:
All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.
While you can generate data frames as shown above, this is certainly not
the usual way in which data frames are entered into R. Typically, you will
read in files that were created with a spreadsheet software. If you create a
table in, say LibreOffice Calc and want to work on it within R, then you
should first save it as a comma-separated text file. There are two ways to
do this. Either you copy the whole file into the clipboard, paste it into a text
editor (e.g., geany or Notepad++), and then save it as a tab-delimited text
file, or you save it directly out of the spreadsheet software as a CSV file (as
mentioned above with File: Save As … and Save as type: Text CSV (.csv);
then you choose tabs as field delimiter and no text delimiter, and don’t
forget to provide the file extension. To load this file into R, you use the
function read.table and some of its arguments:
− file="…": the path to the text file with the table (on Windows PCs you
can use choose.files() here, too; if the file is still in the clipboard,
you can also write file="clipboard";
Copyright 2013. De Gruyter Mouton.
EBSCO Publishing : eBook Academic Collection (EBSCOhost) - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA
AN: 604318 ; Gries, Stefan Thomas.; Statistics for Linguistics with R : A Practical Introduction
Account: s8387532.main.ehost
Data frames 87
Thus, if you want to read in the above table from the file
<_inputfiles/02-5-2_dataframe1.csv> – once without row names and once
with row names – then this is what you could type:
or
By entering a1¶ or str(a1)¶ (same with a2), you can check whether
the data frames have been loaded correctly.
While the above is the most explicit and most general way to load all
sorts of different data frames, when you have set up your data as recom-
mended above, you can often use a shorter version with read.delim:,
which has header=TRUE and sep="\t" as defaults and should, therefore,
work most of the time:
> a3<-read.delim(file.choose())¶
If you want to save a data frame from R, then you can use
write.table. Its most important arguments are:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
88 Fundamentals of R
Given these default settings and under the assumption that your operat-
ing system uses an English locale, you would save data frames as follows:
In this section, we will discuss how you can access parts of data frames and
then how you can edit and change data frames.
Further below, we will discuss many examples in which you have to ac-
cess individual columns or variables of data frames. You can do this in
several ways. The first of these you may have already guessed from look-
ing at how a data frame is shown in R. If you load a data frame with col-
umn names and use str to look at the structure of the data frame, then you
see that the column names are preceded by a “$”. You can use this syntax
to access columns of data frames, as in this example using the file
<_inputfiles/02-5-3_dataframe.csv>.
> rm(list=ls(all=TRUE))¶
> a<-read.delim(file.choose())¶
> a¶
POS TOKENFREQ TYPEFREQ CLASS
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Data frames 89
You can now use these just like any other vector or factor. For example,
the following line computes token/type ratios of the parts of speech:
You can also use indices in square brackets for subsetting. Vectors and
factors as discussed above are one-dimensional structures, but R allows you
to specify arbitrarily complex data structures. With two-dimensional data
structures, you can also use square brackets, but now you must of course
provide values for both dimensions to identify one or several data points –
just like in a two-dimensional coordinate system. This is very simple and
the only thing you need to memorize is the order of the values – rows, then
columns – and that the two values are separated by a comma. Here are
some examples:
> a[2,3]¶
[1] 103
> a[2,]¶
POS TOKENFREQ TYPEFREQ CLASS
2 adv 337 103 open
> a[,3]¶
[1] 271 103 735 18 37
> a[2:3,4]¶
[1] open open
Levels: closed open
> a[2:3,3:4]¶
TYPEFREQ CLASS
2 103 open
3 735 open
Note that row and columns names are not counted. Also note that all
functions applied to vectors above can be used with what you extract out of
a column of a data frame:
> which(a[,2]>450)¶
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
90 Fundamentals of R
[1] 3 4 5
> a[,3][which(a[,3]>100)]¶
[1] 271 103 735
> a[,3][ a[,3]>100]¶
[1] 271 103 735
> attach(a)¶
> Class¶
[1] open open open closed closed
Levels: closed open
Note two things. First, if you attach a data frame that has one or more
names that have already been defined as data structures or as columns of
previously attached data frames, you will receive a warning; in such cases,
make sure you are really dealing with the data structures or columns you
want and consider using detach to un-attach the earlier data frame. Second,
when you use attach you are strictly speaking using ‘copies’ of these vari-
ables. You can change those, but these changes do not affect the data frame
they come from.
> CLASS[4]<-"closed"¶
If you want to change the data frame a, then you must make your
changes in a directly, e.g. with a$CLASS[4]<-NA¶ or a$TOKENFREQ[2]<-
338¶. Given what you have seen in Section 2.4.3, however, this is only
easy with vector or with factors where you do not add a new level – if you
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Data frames 91
want to add a new factor level, you must define that level first.
Sometimes you will need to investigate only a part of a data frame –
maybe a set of rows, or a set of columns, or a matrix within a data frame.
Also, a data frame may be so huge that you only want to keep one part of it
in memory. As usual, there are several ways to achieve that. One uses indi-
ces in square brackets with logical conditions or which. Either you have
already used attach and can use the column names directly or not:
> b<-a[CLASS=="open",]; b¶
POS TOKENFREQ TYPEFREQ CLASS
1 adj 421 271 open
2 adv 337 103 open
3 n 1411 735 open
> b<-a[a[,4]=="open",]; b¶
POS TOKENFREQ TYPEFREQ CLASS
1 adj 421 271 open
2 adv 337 103 open
3 n 1411 735 open
(Of course you can also write b<-a[a$Class=="open",]¶.) That is, you
determine all elements of the column called CLASS / the fourth column that
are open, and then you use that information to access the desired rows and
all columns (hence the comma before the closing square bracket). There is
a more elegant way to do this, though, the function subset. This function
takes two arguments: the data structure of which you want a subset and the
logical condition(s) describing which subset you want. Thus, the following
line creates the same structure b as above:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
92 Fundamentals of R
rows as you need, in a text editor. For the sake of completeness, let me
mention that R of course also allows you to edit data frames in a spread-
sheet-like format. The function fix takes as argument a data frame and
opens a spreadsheet editor in which you can edit the data frame; you can
even introduce new factor levels without having to define them first. When
you close the editor, R will do that for you.
Finally, let us look at ways in which you can sort data frames. Recall
that the function order creates a vector of positions and that vectors can be
used for sorting. Imagine you wanted to search the data frame a according
to the column CLASS (in alphabetically ascending order), and within Class
according to TOKENFREQ (in descending order). How can you do that?
THINK
BREAK
After that, you can use the vector order.index to sort the data frame:
> a[order.index,]¶
POS TOKENFREQ TYPEFREQ CLASS
4 conj 458 18 closed
5 prep 455 37 closed
3 n 1411 735 open
1 adj 421 271 open
2 adv 337 103 open
You can now also use the function sample to sort the rows of a data
frame randomly (for example, to randomize tables with experimental items;
12. Note that R is superior to many other programs here because the number of sorting
parameters is in principle unlimited.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Data frames 93
cf. above). You first determine the number of rows to be randomized (e.g.,
with nrow or dim) and then combine sample with order. Your data frame
will probably be different because we used a random sampling.
> no.rows<-nrow(a)¶
> order.index<-sample(no.rows); order.index¶
[1] 3 4 1 2 5
> a[order.index,]¶
POS TOKENFREQ TYPEFREQ CLASS
3 n 1411 735 open
4 conj 458 18 closed
1 adj 421 271 open
2 adv 337 103 open
5 prep 455 37 closed
> a[sample(nrow(a)),] # in just one line¶
But what do you do when you need to sort a data frame according to
several factors – some in ascending and some in descending order? You
can of course not use negative values of factor levels – what would -open
be? Thus, you first use the function rank, which rank-orders factor levels,
and then you can use negative values of these ranks:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
94 Fundamentals of R
So far, we have focused on simple and existing functions but we have done
little to explore the programming-language character of R. This section will
introduce a few very powerful notions that allow you to make R decide
which of two or more user-specified things to do and/or do something over
and over again. In Section 2.6.1, we will explore the former, Section 2.6.2
then discusses the latter, but the treatment here can only be very brief and I
advise you to explore some of the reading suggestions for more details.
Later, you will often face situations where you want to pursue one of sev-
eral possible options in a statistical analysis. In a plot, for example, the data
points for male subjects should be plotted in blue and the data points for
female subjects should be plotted in pink. Or, you actually only want R to
generate a plot when the result is significant but not, when it is not. In gen-
eral, you can of course always do these things stepwise yourself: you could
decide for each analysis yourself whether it is significant and then generate
a plot when it is. However, a more elegant way is to write R code that
makes decisions for you, that you can apply to any data set, and that, there-
fore, allows you to recycle code from one analysis to the next. Conditional
expressions are one way – others are available and sometimes more elegant
– to make R decide things. This is what the syntax can look like in a nota-
tion often referred to as pseudo code (so, no need to enter this into R!):
That’s it, and the part after the first } is even optional. Here’s an exam-
ple with real code (recall, "\n" means ‘a new line’):
> pvalue<-0.06¶
> if (pvalue>=0.05) {¶
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Conditional expressions and loops 95
The first line defines a p-value, which you will later get from a statisti-
cal test. The next line tests whether that p-value is greater than or equal to
0.05. It is, which is why the code after the first opening { is executed and
why R then never gets to see the part after else.
If you now set pvalue to 0.04 and run the if expression again, then this
happens: Line 2 from above tests whether 0.04 is greater than or equal to
0.05. It is not, which is why the block of code between { and } before else
is skipped and why the second block of code is executed. Try it.
A short version of this can be extremely useful when you have many
tests to make but only one instruction for both when a test returns TRUE or
FALSE. It uses the function ifelse, here represented schematically again:
As you can see, ifelse tested all four values of pvalues against the
threshold value of 0.05, and put the correspondingly required values into
the new vector decisions. We will use this a lot to customize graphs.
6.2. Loops
Loops are useful to have R execute one or (many) more functions multiple
times. Like many other programming languages, R has different types of
loops, but I will only discuss for-loops here. This is the general syntax in
pseudo code:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
96 Fundamentals of R
Let’s go over this step by step. The data structure some.name stands for
any name you might wish to assign to a data structure that is processed in
the loop, and a.sequence stands for anything that can be interpreted as a
sequence of values, most typically a vector of length 1 or more. This
sounds more cryptic than it actually is, here’s a very easy example:
When R enters the for-loop, it assigns to counter the first value of the
sequence 1:3, i.e. 1. Then, in the only line in the loop, R prints some sen-
tence and ends it with the current value of counter, 1, and a line break.
Then R reaches the } and, because counter has not yet iterated over all
values of a.sequence, re-iterates, which means it goes back to the begin-
ning of the loop, this time assigning to counter the next value of
a.sequence, i.e., 2, and so on. Once R has printed the third line, it exits the
loop because counter has now iterated over all elements of a.sequence.
Here is a more advanced example, but one that is typical of what we’re
going to use loops for later. Can you see what it does just from the code?
> some.numbers<-1:100¶
> collector<-vector(length=10)¶
> for (i in 1:10) {¶
+ collector[i]<-mean(sample(some.numbers, 50))¶
+ }¶
> collector¶
[1] 50.78 51.14 45.04 48.04 55.30 45.90 53.02 48.40 50.38
49.88
THINK
BREAK
The first line generates a vector some.numbers with the values from 1
to 100. The second line generates a vector called collector which has 10
elements and which will be used to collect results from the looping. Line 3
begins a loop of 10 iterations, using a vector called i as the counter. Line 4
is the crucial one now: In it, R samples 50 numbers randomly without re-
placement from the vector some.numbers, computes the mean of these 50
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Conditional expressions and loops 97
numbers, and then stores that mean in the i-th slot of collector. On the first
iteration, i is of course 1 so the first mean is stored in the first slot of col-
lector. Then R iterates, i becomes 2, R generates a second random sam-
ple, computes its mean, and stores it in the – now – 2nd slot of collector,
and so on, until R has done the sampling, averaging, and storing process 10
times and exits the loop. Then, the vector collector is printed on the
screen.
In Chapter 4, we will use an approach like this to help us explore data
that violate some of the assumptions of common statistical tests. However,
it is already worth mentioning that loops are often not the best way to do
things like the above in R: in contrast to some other programming lan-
guages, R is designed such that it is often much faster and more memory-
efficient to do things not with loops but with members of the apply family
of functions, which you will get to know a bit later. Still, being able to
quickly write a loop and test something is often a very useful skill.
The fact that R is not just a statistics software but a full-fledged program-
ming language is something that can hardly be overstated enough. It means
that nearly anything is possible: the limit of what you can do with R is not
defined by what the designers of some other software thought you may
want to do – the limit is set pretty much only by your skills and maybe your
RAM/processor (which is one reason why I recommend using R for cor-
pus-linguistic analyses, see Gries 2009a). One aspect making this particu-
larly obvious is how you can very easily write your own functions to facili-
tate and/or automate tedious and/or frequent tasks. In this section, I will
give a few very small examples of the logic of how to write your own func-
tions, mainly because we haven’t dealt with any statistical functions yet.
Don’t despair if you don’t understand these programming issues immedi-
ately – for most of this book, you will not need them, but these capabilities
can come in very handy when you begin to tackle more complex data. Al-
so, in Chapter 3 and 4 I will return to this topic so that you get more prac-
tice in this and end up with a list of useful functions for your own work.
The first example I want to use involves looking at a part of a data
structure. For example, let’s assume you loaded a really long vector (let’s
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
98 Fundamentals of R
say, 10,000 elements long) and want to check whether you imported it into
R properly. Just printing that onto the screen is somewhat tedious since you
can’t possibly read all 10,000 items (let alone at the speed with which they
are displayed), nor do you usually need all 10,000 items – the first n are
usually enough to see whether your data import was successful. The same
holds for long data frames: you don’t need to see all 1600 rows to check
whether loading it was successful, maybe the first 5 or 6 are sufficient.
Let’s write a function peek that by default shows you the first 6 elements of
each of the data structures you know about: one-dimensional vectors or
factors and two-dimensional data frames.
One good way to approach the writing of functions is to first consider
how you would solve that problem just for a particular data structure, i.e.
outside of the function-writing context, and then make whatever code you
wrote general enough to cover not just the one data structure you just ad-
dressed, but many more. To that end, let’s first load a data frame for this
little example (from <_inputfiles/02-7_dataframe1.csv>):
> into.causatives<-read.delim(file.choose())¶
> str(into.causatives)¶
'data.frame': 1600 obs. of 5 variables:
$ BNC : Factor w/ 929 levels "A06","A08","A0C",..:
1 2 3 4 ...
$ TAG_ING : Factor w/ 10 levels "AJ0-NN1","AJ0-VVG",..:
10 7 10 ...
$ ING : Factor w/ 422 levels "abandon-
ing","abdicating",..: 354 49 382 ...
$ VERB_LEMMA: Factor w/ 208 levels "activate","aggravate",..:
76 126 186 ...
$ ING_LEMMA : Factor w/ 417 levels "abandon","abdicate",..:
349 41 377 ...
vector.or.factor[1:6]
data.frame[1:6,]
So, essentially you need to decide what the data structure is of which R
is supposed to display the first n elements (by default 6) and then you sub-
set with either [1:6] or [1:6,]. Since, ultimately, the idea is to have R –
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Your own functions 99
not you – decide on the right way of subsetting (depending on the data
structure), you use a conditional expression:
> if (is.data.frame(into.causatives)) {¶
> into.causatives[1:6,]¶
> } else {¶
> into.causatives[1:6]¶
> }¶
BNC TAG_ING ING VERB_LEMMA ING_LEMMA
1 A06 VVG speaking force speak
2 A08 VBG being nudge be
3 A0C VVG taking talk tak
4 A0F VVG taking bully take
5 A0H VVG trying influence try
6 A0H VVG thinking delude think
To turn this into a function, you wrap a function definition (naming the
function peek) around this piece of code. However, if you use the above
code as is, then this function will use the name into.causatives in the
function definition, which is not exactly very general. As you have seen,
many R functions use x for the main obligatory variable. Following this
tradition, you could write this:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
100 Fundamentals of R
a data frame?
− what if you want to be able to see not 6 but n elements?
− what if the data structure you use peek with has fewer than n elements
or rows?
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Your own functions 101
exactly what peek should return and output to the user when it’s done. Try
the following lines (output not shown here and see the comments in the
code file) to see that it works:
> peek(into.causatives)¶
> peek(into.causatives, 3)¶
> peek(into.causatives, 9)¶
> peek(21:50, 10)¶
> peek(into.causatives$BNC, 12)¶
> peek(as.matrix(into.causatives))¶
While all this may not seem easy and worth the effort, we will later see
that being able to write your own functions will facilitate quite a few statis-
tical analyses below. Let me also note that this was a tongue-in-cheek ex-
ample: there is actually already a function in R that does what peek does
(and more, because it can handle more data structures) – look up head and
also tail ;-).
Now you should do the exercise(s) for Chapter 2 …
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Chapter 3
Descriptive statistics
Any 21st century linguist will be required to read about and understand
mathematical models as well as understand statistical methods of analysis.
Whether you are interested in Shakespearean meter, the sociolinguistic
perception of identity, Hindi verb agreement violations, or the perception
of vowel duration, the use of math as a tool of analysis is already here and
its prevalence will only grow over the next few decades. If you're not pre-
pared to read articles involving the term Bayesian, or (p<.01), k-means
clustering, confidence interval, latent semantic analysis, bimodal and uni-
modal distributions, N-grams, etc, then you will be
but a shy guest at the feast of linguistics.
(<http://thelousylinguist.blogspot.com/2010/01/
why-linguists-should-study-math.html>)
In this chapter, I will explain how you obtain descriptive results. In section
3.1, I will discuss univariate statistics, i.e. statistics that summarize the
distribution of one variable, of one vector, of one factor. Section 3.2 then is
concerned with bivariate statistics, statistics that characterize the relation of
two variables, two vectors, two factors to each other. Both sections also
introduce ways of representing the data graphically; many additional
graphs will be illustrated in Chapters 4 and 5.
1. Univariate statistics
The probably simplest way to describe the distribution of data points are
frequency tables, i.e. lists that state how often each individual outcome was
observed. In R, generating a frequency table is extremely easy. Let us look
at a psycholinguistic example. Imagine you extracted all occurrences of the
disfluencies uh, uhm, and ‘silence’ and noted for each disfluency whether it
was produced by a male or a female speaker, whether it was produced in a
monolog or in a dialog, and how long in milliseconds the disfluency lasted.
First, we load these data from the file <_inputfiles/03-1_uh(m).csv>.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 103
> UHM<-read.delim(file.choose())¶
> str(UHM)¶
'data.frame': 1000 obs. of 5 variables:
$ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
$ SEX : Factor w/ 2 levels "female","male": 2 1 1 1 2 ...
$ FILLER: Factor w/ 3 levels "silence","uh",..: 3 1 1 3 ...
$ GENRE : Factor w/ 2 levels "dialog","monolog": 2 2 1 1 ...
$ LENGTH: int 1014 1188 889 265 465 1278 671 1079 643 ...
> attach(UHM)¶
To see which disfluency or filler occurs how often, you use the function
table, which creates a frequency list of the elements of a vector or factor:
> table(FILLER)¶
FILLER
silence uh uhm
332 394 274
If you also want to know the percentages of each disfluency, then you
can either do this rather manually or you use the function prop.table,
whose argument is a table generated with table and which returns the per-
centages of the frequencies in that table (cf. also below).
> table(FILLER)/length(FILLER)¶
FILLER
silence uh uhm
0.332 0.394 0.274
> prop.table(table(FILLER))¶
FILLER
silence uh uhm
0.332 0.394 0.274
> 1:5¶
[1] 1 2 3 4 5
> cumsum(1:5)¶
[1] 1 3 6 10 15
> cumsum(table(FILLER))¶
silence uh uhm
332 726 1000
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
104 Descriptive statistics
> cumsum(prop.table(table(FILLER)))¶
silence uh uhm
0.332 0.726 1.000
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 105
But if you give two vectors as arguments, then the values of the first and
the second are interpreted as coordinates on the x-axis and the y-axis re-
spectively (and the names of the vectors will be used as axis labels):
With the argument type=…, you can specify the kind of graph you want.
The default, which was used because you did not specify anything else, is
type="p" (for points). If you use type="b" (for both), you get points and
lines connecting the points; if you use type="l" (for lines), you get a line
plot; cf. Figure 16. (With type="n", nothing gets plotted into the main
plotting area, but the coordinate system is set up.)
Other simple but useful ways to tweak graphs involve defining labels
for the axes (xlab="…" and ylab="…"), a bold heading for the whole graph
(main="…"), the ranges of values of the axes (xlim=… and ylim=…), and the
addition of a grid (grid()¶). With col="…", you can also set the color of
the plotted element, as you will see more often below.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
106 Descriptive statistics
An important rule of thumb is that the ranges of the axes must be chosen
such that the distribution of the data is represented most meaningfully. It is
often useful to include the point (0, 0) within the ranges of the axes and to
make sure that graphs to be compared have the same and sufficient axis
ranges. For example, if you want to compare the ranges of values of two
vectors x and y in two graphs, then you usually may not want to let R de-
cide on the ranges of axes. Consider the upper panel of Figure 18.
The clouds of points look very similar and you only notice the distribu-
tional difference between x and y when you specifically look at the range
of values on the y-axis. The values in the upper left panel range from 0 to 2
but those in the upper right panel range from 0 to 6. This difference be-
tween the two vectors is immediately obvious, however, when you use
ylim=… to manually set the ranges of the y-axes to the same range of val-
ues, as I did for the lower panel of Figure 18.
Note: whenever you use plot, by default a new graph is created and the
old graph is lost (In RStudio, you can go back to previous plots, however,
with the arrow button or the menu Plots: …) If you want to plot two lines
into a graph, you first generate the first with plot (and type="l" or
type="b") and then add the second one with points (or lines; sometimes
you can also use the argument add=TRUE). That also means that you must
define the ranges of the axes in the first plot in such a way that the values
of the second graph can also be plotted into it.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 107
An example will clarify that point. If you want to plot the points of the
vectors m and n, and then want to add into the same plot the points of the
vectors x and y, then this does not work, as you can see in the left panel of
Figure 19.
The left panel of Figure 19 shows the points defined by m and n, but not
those of x and y because the ranges of the axes that R used to plot m and n
are too small for x and y, which is why you must define those manually
while creating the first coordinate system. One way to do this is to use the
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
108 Descriptive statistics
function max, which returns the maximum value of a vector (and min re-
turns the minimum). The right panel of Figure 19 shows that this does the
trick. (In this line, the minimum is set to 0 manually – of course, you could
also use min(m, x) and min(n, y) for that, but I wanted to include (0, 0)
in the graph.)
The function to generate a pie chart is pie. Its most important argument is a
table generated with table. You can either just leave it at that or, for ex-
ample, change category names with labels=… or use different colors with
col=… etc.:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 109
One thing that’s a bit annoying about this is that, to use different colors
with col=… as above, you have to know how many colors there are and
assign names to them, which becomes cumbersome with many different
colors and/or graphs. For situations like these, the function rainbow can be
very useful. In its simplest use, it requires only one argument, namely the
number of different colors you want. Thus, how would you re-write the
above line for the pie chart in such a way that you let R find out how many
colors are needed rather than saying col=rainbow(3)?
THINK
BREAK
Let R use as many colors as the table you are plotting has elements:
Note that pie charts are usually not a good way to summarize data be-
cause humans are not very good at inferring quantities from angles. Thus,
pie is not a function you should use too often – the function rainbow, on
the other hand, is one you should definitely bear in mind.
To create a bar plot, you can use the function barplot. Again, its most
important argument is a table generated with table and again you can cre-
ate either a standard version or more customized ones. If you want to de-
fine your own category names, you unfortunately must use names.arg=…,
not labels=… (cf. Figure 21 below).
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
110 Descriptive statistics
The second way to create a similar graph – cf. the right panel of Figure
22 – involves some useful changes:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 111
The first line now does not just plot the barplot, it also assigns what R
returns to a data structure called mids, which contains the x-coordinates of
the middles of the bars, which we can then use for texting. (Look at mids.)
Second, the second line now uses mids for the x-coordinates of the text to
be printed and it uses pos=1 to make R print the text a bit below the speci-
fied coordinates; pos=2, pos=3, and pos=4 would print the text a bit to the
left, above, and to the right of the specified coordinates respectively.
The functions plot and text allow for another powerful graph: first,
you generate a plot that contains nothing but the axes and their labels (with
type="n", cf. above), and then with text you plot words or numbers. Try
this for an illustration of a kind of plot you will more often see below:
> tab<-table(FILLER)¶
> plot(tab, type="n", xlab="Disfluencies", ylab="Observed
frequencies", xlim=c(0, 4), ylim=c(0, 500)); grid()¶
> text(seq(tab), tab, labels=tab)¶
1.1.4. Pareto-charts
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
112 Descriptive statistics
sented as in a bar plot, but they are first sorted in descending order of fre-
quency and then overlaid by a line plot of cumulative percentages that indi-
cates what percent of all data one category and all other categories to the
left of that category account for. The function pareto.chart comes with
the library qcc that you must (install and/or) load first; cf. Figure 23.
> library(qcc)¶
> pareto.chart(table(FILLER), main=””)¶
Pareto chart analysis for table(FILLER)
Frequency Cum.Freq. Percentage Cum.Percent.
uh 394.0 394.0 39.4 39.4
silence 332.0 726.0 33.2 72.6
uhm 274.0 1000.0 27.4 100.0
1.1.5. Histograms
While bar plots are probably the most frequent forms of representing the
frequencies of nominal/categorical variables, histograms are most wide-
spread for the frequencies of interval/ratio variables. In R, you can use
hist, which just requires the relevant vector as its argument.
> hist(LENGTH)¶
For some ways to make the graph nicer, cf. Figure 24, whose left panel
contains a histogram of the variable LENGTH with axis labels and grey bars.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 113
With the argument breaks=… to hist, you can instruct R to try to use a
particular number of bins (or bars). You either provide one integer – then R
tries to create a histogram with as many bins – or you provide a vector with
the boundaries of the bins. The latter raises the question of how many bins
should or may be chosen? In general, you should not have more than 20
bins, and as one rule of thumb for the number of bins to choose you can use
the formula in (14) (cf. Keen 2010:143–160 for discussion). The most im-
portant aspect is that the bins you choose do not misrepresent the data.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
114 Descriptive statistics
This plot is very useful because it does not lose information by binning
data points: every data point is represented in the plot, which is why ecdf
plots can be very revealing even for data that most other graphs cannot
illustrate well. Let’s see whether you’ve understood this plot: what do ecdf
plots of normally-distributed and uniformly-distributed data look like?
THINK
BREAK
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 115
You will find the answer in the code file (with graphs); make sure you
understand why so you can use this very useful type of graph.
Measures of central tendency are probably the most frequently used statis-
tics. They provide a value that attempts to summarize the behavior of a
variable. Put differently, they answer the question, if I wanted to summa-
rize this variable and were allowed to use only one number to do that,
which number would that be? Crucially, the choice of a particular measure
of central tendency depends on the variable’s level of measurement. For
nominal/categorical variables, you should use the mode (if you do not
simply list frequencies of all values/bins anyway, which is often better), for
ordinal variables you should use the median, for interval/ratio variables you
can often use the arithmetic mean.
The mode of a variable or distribution is the value that is most often ob-
served. As far as I know, there is no function for the mode in R, but you
can find it very easily. For example, the mode of FILLER is uh:
> which.max(table(FILLER))¶
uh
2
> max(table(FILLER))¶
[1] 394
Careful when there is more than one level that exhibits the maximum
number of observations – tabulating is usually safer.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
116 Descriptive statistics
The measure of central tendency for ordinal data is the median, the value
you obtain when you sort all values of a distribution according to their size
and then pick the middle one (e.g., the median of the numbers from 1 to 5
is 3). If you have an even number of values, the median is the average of
the two middle values.
> median(LENGTH)¶
[1] 897
> sum(LENGTH)/length(LENGTH)¶
[1] 915.043
> mean(LENGTH)¶
[1] 915.043
> a<-1:10; a¶
[1] 1 2 3 4 5 6 7 8 9 10
> b<-c(1:9, 1000); b¶
[1] 1 2 3 4 5 6 7 8 9 1000
> mean(a)¶
[1] 5.5
> mean(b)¶
[1] 104.5
Although the vectors a and b differ with regard to only a single value,
the mean of b is much larger than that of a because of that one outlier, in
fact so much larger that b’s mean of 104.5 neither summarizes the values
from 1 to 9 nor the value 1000 very well. There are two ways of handling
such problems. First, you can add the argument trim=…, the percentage of
elements from the top and the bottom of the distribution that are discarded
before the mean is computed. The following lines compute the means of a
and b after the highest and the lowest value have been discarded:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 117
Second, you can just use the median, which is also a good idea if the da-
ta whose central tendency you want to report are not normally distributed.
Warning/advice
Just because R or your spreadsheet software can return many decimals does
not mean you have to report them all. Use a number of decimals that makes
sense given the statistic that you report.
You now want to know the average rate at which the lexicon increased.
First, you compute the successive increases:
That is, by age 2;2, the child produced 19.697% more types than by age
2;1, by age 2;3, the child produced 6.962% more types than by age 2;2, etc.
Now, you must not think that the average rate of increase of the lexicon is
the arithmetic mean of these increases:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
118 Descriptive statistics
You can easily test that this is not the correct result. If this number was
the true average rate of increase, then the product of 132 (the first lexicon
size) and this rate of 1.128104 to the power of 5 (the number of times the
supposed ‘average rate’ applies) should be the final value of 240. This is
not the case:
> 132*mean(increases)^5¶
[1] 241.1681
Instead, you must compute the geometric mean. The geometric mean of
a vector x with n elements is computed according to formula (15), and if
you use this as the average rate of increase, you get the right result:
1
n
(15) meangeom = (x1·x2·…·xn-1·xn)
> rate.increase<-prod(increases)^(1/length(increases));
rate.increase¶
[1] 1.127009
> 132*rate.increase^5¶
[1] 240
True, the difference between 240 – the correct value – and 241.1681 –
the incorrect value – may seem negligible, but 241.1681 is still wrong and
the difference is not always that small, as an example from Wikipedia (s.v.
geometric mean) illustrates: If you do an experiment and get an increase
rate of 10.000 and then you do a second experiment and get an increase rate
of 0.0001 (i.e., a decrease), then the average rate of increase is not approx-
imately 5.000 – the arithmetic mean of the two rates – but 1 – their geomet-
ric mean.13
Finally, let me again point out how useful it can be to plot words or
numbers instead of points, triangles, … Try to generate Figure 26, in which
the position of each word on the y-axis corresponds to the average length of
the disfluency (e.g., 928.4 for women, 901.6 for men, etc.). (The horizontal
line is the overall average length – you may not know yet how to plot that
one.) Many tendencies are immediately obvious: men are below the aver-
age, women are above, silent disfluencies are of about average length, etc.
13. Alternatively, you can compute the geometric mean of increases as follows:
exp(mean(log(increases)))¶.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 119
Most people know what measures of central tendencies are. What many
people do not know is that they should never – NEVER! – report a measure
of central tendency without some corresponding measure of dispersion.
The reason for this rule is that without such a measure of dispersion you
never know how good the measure of central tendency actually is at sum-
marizing the data. Let us look at a non-linguistic example, the monthly
temperatures of two towns and their averages:
> town1<-c(-5, -12, 5, 12, 15, 18, 22, 23, 20, 16, 8, 1)¶
> town2<-c(6, 7, 8, 9, 10, 12, 16, 15, 11, 9, 8, 7)¶
> mean(town1); mean(town2)¶
[1] 10.25
[1] 9.833333
On the basis of the means alone, the towns seem to have a very similar
climate, but even a quick glance at Figure 27 shows that that is not true – in
spite of the similar means, I know where I would want to be in February.
Obviously, the mean of Town 2 summarizes the central tendency of Town
2 much better than the mean of Town 1 does for Town 1: the values of
Town 1 vary much more widely around their mean. Thus, always provide a
measure of dispersion for your measure of central tendency: relative entro-
py for the mode, the interquartile range or quantiles for the median and
interval/ratio-scaled data that are non-normal or exhibit outliers, and the
standard deviation or the variance for normal interval/ratio-scaled data.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
120 Descriptive statistics
∑ ( p ⋅ ln p )
i i
(16) Hrel = − i =1
ln n
Thus, if you count the articles of 300 noun phrases and find 164 cases
with no determiner, 33 indefinite articles, and 103 definite articles, this is
how you compute Hrel:
It is worth pointing out that the above formula does not produce the de-
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 121
> logw0<-function(x) {¶
+ ifelse (x==0, 0, log(x))¶
+ }¶
> hrel<--sum(perc*logw0(perc))/logw0(length(perc)); hrel¶
[1] 0
The simplest measure of dispersion for interval/ratio data is the range, the
difference of the largest and the smallest value. You can either just use the
function range, which requires the vector in question as its only argument,
and then compute the difference from the two values with diff, or you just
compute the range from the minimum and maximum yourself:
> range(LENGTH)¶
[1] 251 1600
> diff(range(LENGTH))¶
[1] 1349
> max(LENGTH)-min(LENGTH)¶
[1] 1349
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
122 Descriptive statistics
Another simple but useful and flexible measure of dispersion involves the
quantiles of a distribution. We have met quantiles before in the context of
probability distributions in Section 1.3.4. Theoretically, you compute quan-
tiles by sorting the values in ascending order and then counting which val-
ues delimit the lowest x%, y%, etc. of the data; when these percentages are
25%, 50%, and 75%, then they are called quartiles. In R you can use the
function quantile, (see below on type=1):
> a<-1:100¶
> quantile(a, type=1)¶
0% 25% 50% 75% 100%
1 25 50 75 100
If you write the integers from 1 to 100 next to each other, then 25 is the
value that cuts off the lower 25%, etc. The value for 50% corresponds to
the median, and the values for 0% and 100% are the minimum and the
maximum. Let me briefly mention two arguments of this function. First,
the argument probs allows you to specify other percentages. Second, the
argument type=… allows you to choose other ways in which quantiles are
computed. For discrete distributions, type=1 is probably best, for continu-
ous variables the default setting type=7 is best.
> quantile(town1)¶
0% 25% 50% 75% 100%
-12.0 4.0 13.5 18.5 23.0
> IQR(town1)¶
[1] 14.5
> quantile(town2)¶
0% 25% 50% 75% 100%
6.00 7.75 9.00 11.25 16.00
> IQR(town2)¶
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 123
[1] 3.5
You can now apply this function to the lengths of the disfluencies:
That is, the central 20% of all the lengths of disfluencies are greater than
788 and range up to 1039 (as you can verify with sort(LENGTH)
[401:600]¶), 20% of the lengths are smaller than or equal to 519, 20% of
the values are 1307 or larger, etc.
An interesting application of quantile is to use it to split vectors of
continuous variables up into groups. For example, if you wanted to split the
vector LENGTH into five groups of nearly equal ranges of values, you can
use the function cut from Section 2.4.1 again, which splits up vectors into
groups, and the function quantile, which tells cut what the groups should
look like. That is, there are 200 values of LENGTH between and including
251 and 521 etc.
> town1¶
[1] -5 -12 5 12 15 18 22 23 20 16 8 1
> town1-mean(town1)¶
[1] -15.25 -22.25 -5.25 1.75 4.75 7.75 11.75
12.75 9.75 5.75 -2.25 -9.25
> abs(town1-mean(town1))¶
[1] 15.25 22.25 5.25 1.75 4.75 7.75 11.75 12.75
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
124 Descriptive statistics
> mean(abs(LENGTH-mean(LENGTH)))¶
[1] 329.2946
> town1¶
[1] -5 -12 5 12 15 18 22 23 20 16 8 1
> town1-mean(town1)¶
[1] -15.25 -22.25 -5.25 1.75 4.75 7.75 11.75
12.75 9.75 5.75 -2.25 -9.25
> (town1-mean(town1))^2¶
[1] 232.5625 495.0625 27.5625 3.0625 22.5625 60.0625
138.0625 162.5625 95.0625 33.0625 5.0625 85.5625
> sum((town1-mean(town1))^2)¶
[1] 1360.25
> sum((town1-mean(town1))^2)/(length(town1)-1)¶
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 125
[1] 123.6591
> sqrt(sum((town1-mean(town1))^2)/(length(town1)-1))¶
[1] 11.12021
Even though the standard deviation is probably the most widespread meas-
ure of dispersion, it has a potential weakness: its size is dependent on the
mean of the distribution, as you can see in the following example:
> sd(town1)¶
[1] 11.12021
> sd(town1*10)¶
[1] 111.2021
When the values, and hence the mean, is increased by one order of
magnitude, then so is the standard deviation. You can therefore not com-
pare standard deviations from distributions with different means if you do
not first normalize them. If you divide the standard deviation of a distribu-
tion by its mean, you get the variation coefficient. You see that the varia-
tion coefficient is not affected by the multiplication with 10, and Town 1
still has a larger degree of dispersion.
> sd(town1)/mean(town1)¶
[1] 1.084899
> sd(town1*10)/mean(town1*10)¶
[1] 1.084899
> sd(town2)/mean(town2)¶
[1] 0.3210999
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
126 Descriptive statistics
If you want to obtain several summarizing statistics for a vector (or a fac-
tor), you can use summary, whose output is self-explanatory.
> summary(town1)¶
Min. 1st Qu. Median Mean 3rd Qu. Max.
-12.00 4.00 13.50 10.25 18.50 23.00
− the bold-typed horizontal lines represent the medians of the two vectors;
− the regular horizontal lines that make up the upper and lower boundary
of the boxes represent the hinges (approximately the 75%- and the 25%
quartiles);
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 127
− the whiskers – the dashed vertical lines extending from the box until the
upper and lower limit – represent the largest and smallest values that are
not more than 1.5 interquartile ranges away from the box;
− each data point that would be outside of the range of the whiskers would
be represented as an outlier with an individual small circle;
− the notches on the left and right sides of the boxes extend across the
range ±1.58*IQR/sqrt(n): if the notches of two boxplots do not over-
lap, then their medians will most likely be significantly different.
Figure 28 shows that the average temperatures of the two towns are very
similar and probably not significantly different from each other. Also, the
dispersion of Town 1 is much larger than that of Town 2. Sometimes, a
good boxplot nearly obviates the need for further analysis; boxplots are
extremely useful and will often be used in the chapters to follow. However,
there are situations where the ecdf plot introduced above is better and the
following example is modeled after what happened in a real dataset of a
student I supervised. Run the code in the code file and consider Figure 29.
As you could see in the code file, I created a vector x1 that actually con-
tains data from two very different distributions whereas the vector x2 con-
tains data from only one but wider distribution.
Figure 29. Boxplots (left panel) and ecdf plots (right panel) of two vectors
Crucially, the boxplots do not reveal that at all. Yes, the second darker
boxplot is wider and has some outliers but the fact that the first lighter box-
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
128 Descriptive statistics
> means<-vector(length=10000)¶
> for (i in 1:10000) {¶
+ means[i]<-mean(sample(LENGTH, size=1000, replace=TRUE))¶
+ }¶
> sd(means)¶
[1] 12.10577
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 129
var sd
(18) semean = =
n n
Thus, the standard error of the mean length of disfluencies here is this,
which is very close to our resampled result from above.
> mean(LENGTH)¶
[1] 915.043
> sqrt(var(LENGTH)/length(LENGTH))¶
[1] 12.08127
You can also compute standard errors for statistics other than arithmetic
means but the only other example we look at here is the standard error of a
relative frequency p, which is computed according to the formula in (19):
p⋅(1− p )
(19) sepercentage =
n
Thus, the standard error of the percentage of all silent disfluencies out
of all disfluencies (33.2% of 1000 disfluencies) is:
> prop.table(table(FILLER))¶
FILLER
silence uh uhm
0.332 0.394 0.274
> sqrt(0.332*(1-0.332)/1000)¶
[1] 0.01489215
2 2
(20) sedifference between means = SE mean _ group1 + SEmean _ group 2
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
130 Descriptive statistics
Warning/advice
Standard errors are only really useful if the data to which they are applied
are distributed pretty normally or when the sample size n ≥ 30.
One way to normalize the grades is called centering and simply in-
volves subtracting from each individual value within one course the aver-
age of that course.
> a<-1:5¶
> centered.scores<-a-mean(a); centered.scores¶
[1] -2 -1 0 1 2
You can see how these scores relate to the original values in a: since the
mean of a is obviously 3, the first two centered scores are negative (i.e.,
smaller than a’s mean), the third is 0 (it does not deviate from a’s mean),
and the last two centered scores are positive (i.e., larger than a’s mean).
Another more sophisticated way involves standardizing, i.e. trans-
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 131
The relationship between the z-scores and a’s original values is very
similar to that between the centered scores and a’s values: since the mean
of a is obviously 3, the first two z-scores are negative (i.e., smaller than a’s
mean), the third z-score is 0 (it does not deviate from a’s mean), and the
last two z-scores are positive (i.e., larger than a’s mean). Note that such z-
scores have a mean of 0 and a standard deviation of 1:
> mean(z.scores)¶
[1] 0
> sd(z.scores)¶
[1] 1
> scale(a)¶
[,1]
[1,] -1.2649111
[2,] -0.6324555
[3,] 0.0000000
[4,] 0.6324555
[5,] 1.2649111
attr(,"scaled:center")
[1] 3
attr(,"scaled:scale")
[1] 1.581139
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
132 Descriptive statistics
[2,] -1
[3,] 0
[4,] 1
[5,] 2
attr(,"scaled:center")
[1] 3
If we apply both versions to our example with the two courses, then you
see that the 80% scored by student X is only 0.436 standard deviations (and
13.33 percent points) better than the mean of his course whereas the 60%
scored by student Y is actually 0.873 standard deviations (and 26.67 per-
cent points) above the mean of his course. Thus, X’s score is higher than
Y’s, but if we take the overall results in the two courses into consideration,
then Y’s performance is better; standardizing data is often useful.
In most cases, you are not able to investigate the whole population you are
actually interested in because that population is not accessible and/or too
large so investigating it is impossible, too time-consuming, or too expen-
sive. However, even though you know that different samples will yield
different statistics, you of course hope that your sample would yield a reli-
able estimate that tells you much about the population you are interested in:
− if you find in your sample of 1000 disfluencies that their average length
is approximately 915 ms, then you hope that you can generalize from
that to the population and future investigations;
− if you find in your sample of 1000 disfluencies that 33.2% of these are
silences, then you hope that you can generalize from that to the popula-
tion and future investigations.
So far, we have only discussed how you can compute percentages and
means for samples – the question of how valid these are for populations is
the topic of this section. In Section 3.1.5.1, I explain how you can compute
confidence intervals for arithmetic means, and Section 3.1.5.2 explains how
to compute confidence intervals for percentages. The relevance of such
confidence intervals must not be underestimated: without a confidence
interval it is unclear how well you can generalize from a sample to a popu-
lation; apart from the statistics we discuss here, one can also compute con-
fidence intervals for many others.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 133
If you compute a mean on the basis of a sample, you of course hope that it
represents that of the population well. As you know, the average length of
disfluencies in our example data is 915.043 ms (standard deviation:
382.04). But as we said above, other samples’ means will be different so
you would ideally want to quantify your confidence in this estimate. The
so-called confidence interval, which is useful to provide with your mean, is
the interval of values around the sample mean around which we will as-
sume there is no significant difference with the sample mean. From the
expression “significant difference”, it follows that a confidence interval is
typically defined as 1-significance level, i.e., typically as 1-0.05 = 0.95.
In a first step, you again compute the standard error of the arithmetic
mean according to the formula in (18).
(21) CI = x ±t·SE
To do this more simply, you can use the function t.test with the rele-
vant vector and use conf.level=… to define the relevant percentage. R then
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
134 Descriptive statistics
computes a significance test the details of which are not relevant yet, which
is why we only look at the confidence interval (with $conf.int):
Note that when you compare means of two roughly equally large sam-
ples and their 95%-confidence intervals do not overlap, then you know the
sample means are significantly different and, therefore, you would assume
that there is a real difference between the population means, too. However,
if these intervals do overlap, this does not show that the means are not sig-
nificantly different from each other (cf. Crawley 2005: 169f.).
The above logic with regard to means also applies to percentages. Given a
particular percentage from a sample, you want to know what the corre-
sponding percentage in the population is. As you already know, the per-
centage of silent disfluencies in our sample is 33.2%. Again, you would
like to quantify your confidence in that sample percentage. As above, you
compute the standard error for percentages according to the formula in
(19), and then this standard error is inserted into the formula in (22).
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 135
(22) CI = a±z·SE
For a 95% confidence interval for the percentage of silences, you enter:
The simpler way requires the function prop.test, which tests whether
a percentage obtained in a sample is significantly different from an ex-
pected percentage. Again, the functionality of that significance test is not
relevant yet, but this function also returns the confidence interval for the
observed percentage. R needs the observed frequency (332), the sample
size (1000), and the probability for the confidence interval. R uses a formu-
la different from ours but returns nearly the same result.
Warning/advice
Since confidence intervals are based on standard errors, the warning from
above applies here, too: if data are not normally distributed or the samples
too small, then you should probably use other methods to estimate confi-
dence intervals (e.g., bootstrapping).
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
136 Descriptive statistics
2. Bivariate statistics
We have so far dealt with statistics and graphs that describe one variable or
vector/factor. In this section, we now turn to methods to characterize two
variables and their relation. We will again begin with frequencies, then we
will discuss means, and finally talk about correlations. You will see that we
can use many functions from the previous sections.
> UHM<-read.delim(file.choose())¶
> attach(UHM)¶
Let’s assume you wanted to see whether men and women differ with re-
gard to the kind of disfluencies they produce. First two questions: are there
dependent and independent variables in this design and, if so, which?
THINK
BREAK
In this case, SEX is the independent variable and FILLER is the depend-
ent variable. Computing the frequencies of variable level combinations in R
is easy because you can use the same function that you use to compute
frequencies of an individual variable’s levels: table. You just give table a
second vector or factor as an argument and R lists the levels of the first
vector in the rows and the levels of the second in the columns:
In fact you can provide even more vectors to table, just try it out, and
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 137
we will return to this below. Again, you can create tables of percentages
with prop.table, but with two-dimensional tables there are different ways
to compute percentages and you can specify one with margin=…. The de-
fault is margin=NULL, which computes the percentages on the basis of all
elements in the table. In other words, all percentages in the table add up to
1. Another possibility is to compute row percentages: set margin=1 and
you get percentages that add up to 1 in every row. Finally, you can choose
column percentages by setting margin=2: the percentages in each column
add up to 1. This is probably the best way here since then the percentages
adding up to 1 are those of the dependent variable.
You can immediately see that men appear to prefer uh and disprefer
uhm while women appear to have no real preference for any disfluency.
However, we of course do not know yet whether this is a significant result.
The function addmargins outputs row and column totals (or other user-
defined margins, such as means):
Of course you can also represent such tables graphically. The simplest way
involves providing a formula as the main argument to plot. Such formulae
consist of a dependent variable (here: FILLER: FILLER), a tilde (“~” mean-
ing ‘as a function of’), and an independent variable (here: GENRE: GENRE).
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
138 Descriptive statistics
> plot(FILLER~GENRE)¶
The widths and heights of rows, columns, and the six boxes represent
the observed frequencies. For example, the column for dialogs is a little
wider than that for monologs because there are more dialogs in the data; the
row for uh is widest because uh is the most frequent disfluency, etc.
Other similar graphs can be generated with the following lines:
These graphs are called stacked bar plots or mosaic plots and are – to-
gether with association plots to be introduced below – often effective ways
of representing crosstabulated data. In the code file for this chapter you will
find R code for another kind of useful graph.
2.1.2. Spineplots
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 139
tory purposes.) You can use the function spineplot with a formula:
> spineplot(FILLER~LENGTH)¶
The y-axis represents the dependent variable and its three levels. The x-
axis represents the independent ratio-scaled variable, which is split up into
the value ranges that would also result from hist (which also means you
can change the ranges with breaks=…; cf. Section 3.1.1.5 above).
Apart from these plots, you can also generate line plots that summarize
frequencies. If you generate a table of relative frequencies, then you can
create a primitive line plot by entering the code shown below.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
140 Descriptive statistics
Warning/advice
Sometimes, it is recommended to not represent such frequency data with a
line plot like this because the lines ‘suggest’ that there are frequency values
between the levels of the categorical variable, which is of course not the
case. Again, you should definitely explore the function dotchart for this.
Figure 32. Line plot with the percentages of the interaction of SEX and FILLER
2.2. Means
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 141
> mean(LENGTH[SEX=="female"])¶
[1] 928.3984
> mean(LENGTH[SEX=="male"])¶
[1] 901.5803
− you must define the values of LENGTH that you want to include manual-
ly, which requires a lot of typing (especially when the independent vari-
able has more than two levels or, even worse, when you have more than
one independent variable);
− you must know all relevant levels of the independent variables – other-
wise you couldn’t use them for subsetting in the first place;
− you only get the means of the variable levels you have explicitly asked
for. However, if, for example, you made a coding mistake in one row –
such as entering “malle” instead of “male” – this approach will not
show you that.
Of course the result is the same as above, but you obtained it in a better
way. You can of course use functions other than mean: median, IQR, sd,
var, …, even functions you wrote yourself. For example, what do you get
when you use length? The numbers of lengths observed for each sex.
2.2.1. Boxplots
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
142 Descriptive statistics
(If you only want to plot a boxplot and not provide any further argu-
ments, it is actually enough to just enter plot(LENGTH~GENRE)¶: R ‘infers’
you want a boxplot because LENGTH is a numerical vector and GENRE is a
factor.) Again, you can infer a lot from that plot: both medians are close to
900 ms and do most likely not differ significantly from each other (since
the notches overlap). Both genres appear to have about the same amount of
dispersion since the notches, the boxes, and the whiskers are nearly equally
large, and both genres have no outliers.
THINK
BREAK
It adds plusses into the boxplot representing the means of LENGTH for
each GENRE: seq(levels(GENRE)) returns 1:2, which is used as the x-
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 143
coordinates; the tapply code returns the means of LENGTH for each GENRE,
and the "+" is what is plotted.
Such results are best shown in tabular form such that you don’t just pro-
vide the above means of the interactions as they were represented in Figure
32 above, but also the means of the individual variables. Consider Table 17
and the formula in its caption exemplifying the relevant R syntax.
A plus sign between variables refers to just adding main effects of vari-
ables (i.e., effects of variables in isolation, e.g. when you only inspect the
two means for SEX in the bottom row of totals or the three means for
FILLER in the rightmost column of totals). A colon between variables refers
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
144 Descriptive statistics
That means, you can choose one of two formats, depending on which
independent variable is shown on the x-axis and which is shown with dif-
ferent lines. While the represented means will of course be identical, I ad-
vise you to always generate and inspect both graphs anyway because one of
the two graphs is usually easier to interpret. In Figure 34, you find both
graphs for the above values and I prefer the lower panel.
THINK
BREAK
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 145
First, you should not just report the means like this because I told you to
never ever report means without a measure of dispersion. Thus, when you
want to provide the means, you must also add, say, standard deviations,
standard errors, confidence intervals:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
146 Descriptive statistics
How do you get the standard errors and the confidence intervals?
THINK
BREAK
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 147
The last section in this chapter is devoted to cases where both the depend-
ent and the independent variable are ratio-scaled. For this scenario we turn
to a new data set. First, we clear our memory of all data structures we have
used so far:
> rm(list=ls(all=TRUE))¶
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
148 Descriptive statistics
Let us load and plot the data, using by now familiar lines of code:
> ReactTime<-read.delim(file.choose())¶
> str(ReactTime); attach(ReactTime)¶
'data.frame': 20 obs. of 3 variables:
$ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
$ LENGTH : int 14 12 11 12 5 9 8 11 9 11 ...
$ MS_LEARNER: int 233 213 221 206 123 176 195 207 172 ...
> plot(MS_LEARNER~LENGTH, xlim=c(0, 15), ylim=c(0, 300),
xlab="Word length in letters", ylab="Reaction time of
learners in ms"); grid()¶
THINK
BREAK
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 149
∑ (x )( )
n
i − x ⋅ yi − y
i =1
(23) Covariancex, y =
n −1
> covariance<-sum((LENGTH-mean(LENGTH))*(MS_LEARNER-
mean(MS_LEARNER)))/(length(MS_LEARNER)-1)¶
> covariance<-cov(LENGTH, MS_LEARNER); covariance¶
[1] 79.28947
The sign of the covariance already indicates whether two variables are
positively or negatively correlated; here it is positive. However, we cannot
use the covariance to quantify the correlation between two vectors because
its size depends on the scale of the two vectors: if you multiply both vec-
tors with 10, the covariance becomes 100 times as large as before although
the correlation as such has of course not changed:
> covariance/(sd(LENGTH)*sd(MS_LEARNER))¶
[1] 0.9337171
> cor(MS_LEARNER, LENGTH, method="pearson")¶
[1] 0.9337171
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
150 Descriptive statistics
> 93.61+10.3*16¶
[1] 258.41
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 151
The use of expand.grid is overkill here for a data frame with a single
length but I am using it here because it anticipates our uses of predict and
expand.grid below where we can actually get predictions for a large num-
ber of values in one go (as in the following; the output is not shown here):
If you only use the model as an argument to predict, you get the values
the model predicts for every observed word length in your data in the order
of the data points (same with fitted).
The first value of LENGTH is 14, so the first of the above values is the
reaction time we expect for a word with 14 letters, etc. Since you now have
the needed parameters, you can also draw the regression line. You do this
with the function abline, which either takes a linear model object as an
argument or the intercept and the slope; cf. Figure 36:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
152 Descriptive statistics
You can easily test manually that these are in fact the residuals:
Note two important points though: First, regression equations and lines
are most useful for the range of values covered by the observed values.
Here, the regression equation was computed on the basis of lengths be-
tween 5 and 15 letters, which means that it will probably be much less reli-
able for lengths of 50+ letters. Second, in this case the regression equation
also makes some rather non-sensical predictions because theoretically/
mathematically it predicts reactions times of around 0 ms for word lengths
of -9. Such considerations will become important later on.
The correlation coefficient r also allows you to specify how much of the
variance of one variable can be accounted for by the other variable. What
does that mean? In our example, the values of both variables –
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 153
MS_LEARNER and LENGTH – are not all identical: they vary around their
means and this variation was called dispersion and quantified with the
standard deviation or the variance. If you square r and multiply the result
by 100, then you obtain the amount of variance of one variable that the
other variable accounts for. In our example, r = 0.933, which means that
87.18% of the variance of the reaction times can be accounted for – in a
statistical sense, not necessarily a cause-effect sense – on the basis of the
word lengths. This value, r2, is referred to as coefficient of determination.
Incidentally, I sometimes heard students or colleages compare two r-
values such that they say something like, “Oh, here r = 0.6, nice, that’s
twice as much as in this other data set, where r = 0.3.” Even numerically
speaking, this is at least misleading, if nothing worse. Yes, 0.6 is twice as
high as 0.3, but one should not compare r-values directly like this – one has
to apply the so-called Fisher’s Z-transformation first, which is exemplified
in the following two lines:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
154 Descriptive statistics
include the outlier, then Pearson’s r suddenly becomes 0.75 (and the re-
gression line’s slope is changed markedly) while Kendall’s τ remains ap-
propriately small: 0.14.
The previous explanations were all based on the assumption that there is
in fact a linear correlation between the two variables or one that is best
characterized with a straight line. This need not be the case, though, and a
third scenario in which neither r nor τ are particularly useful involves cases
where these assumptions do not hold. Often, this can be seen by just look-
ing at the data. Figure 38 represents a well-known example from
Anscombe (1973) (from <_inputfiles/03-2-3_anscombe.csv>), which has
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 155
are all identical although the distributions are obviously very different.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
156 Descriptive statistics
In the top left of Figure 38, there is a case where r and τ are unproblem-
atic. In the top right we have a situation where x and y are related in a cur-
vilinear fashion – using a linear correlation here does not make much
sense.16 In the two lower panels, you see distributions in which individual
outliers have a huge influence on r and the regression line. Since all the
summary statistics are identical, this example illustrates most beautifully
how important, in fact indispensable, a visual inspection of your data is,
which is why in the following chapters visual exploration nearly always
precedes statistical computation.
Now you should do the exercise(s) for Chapter 3 …
Warning/advice
Do not let the multitude of graphical functions and settings of R and/or
your spreadsheet software tempt you to produce visual overkill. Just be-
cause you can use 6 different fonts, 10 colors, and cute little smiley sym-
bols does not mean you should: Visualization should help you and/or the
reader understand something otherwise difficult to grasp, which also means
you should make sure your graphs are fairly self-sufficient, i.e. contain all
the information required to understand them (e.g., meaningful graph and
axis labels, legends, etc.) – a graph may need an explanation, but if the
explanation is three quarters of a page, chances are your graph is not help-
ful (cf. Keen 2010: Chapter 1).
16. I do not discuss nonlinear regressions; cf. Crawley (2007: Ch. 18, 20) for overviews.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Chapter 4
Analytical statistics
Typically, there are only two possible answers to that question: “hy-
pothesis-generating” and “hypothesis-testing.” The former means that you
are approaching a (typically large) data set with the intentions of detecting
structure(s) and developing hypotheses for future studies; your approach to
the data is therefore data-driven, or bottom-up; an example for this will be
discussed in Section 5.6. The latter is what most of the examples in this
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
158 Analytical statistics
book are about and means your approach to the data involves specific hy-
potheses you want to test and requires the types of tests in this chapter and
most of the following one.
(25) What kinds of variables are involved in your hypotheses, and how
many?
There are essentially two types of answers. One pertains to the infor-
mation value of the variables and we have discussed this in detail in Sec-
tion 1.3.2.2 above. The other allows for four different possible answers.
First, you may only have one dependent variable, in which case, you nor-
mally want to compute a so-called goodness-of-fit test to test whether the
results from your data correspond to other results (from a previous study)
or correspond to a known distribution (such as a normal distribution). Ex-
amples include
Second, you may have one dependent and one independent variable or
you may just have two sets of measurements (i.e. two dependent variables).
In both cases you typically want to compute a monofactorial test for inde-
pendence to determine whether the values of one/the independent variable
are correlated with those of the other/dependent variable. For example,
− does the animacy of the referent of the direct object (a categorical inde-
pendent variable) correlate with the choice of one of two postverbal
constituent orders (a categorical dependent variable)?
− does the average acceptability judgment (a mean of a ratio/interval de-
pendent variable) vary as a function of whether the subjects doing the
rating are native speakers or not (a categorical independent variable)?
Third, you may have one dependent and two or more independent vari-
ables, in which case you want to compute a multifactorial analysis (such as
a multiple regression) to determine whether the individual independent
variables and their interactions correlate with, or predict, the dependent
variable. For example,
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Analytical statistics 159
Fourth, you have two or more dependent variables, in which case you
may want to perform a multivariate analysis, which can be exploratory
(such as hierarchical cluster analysis, principal components analysis, factor
analysis, multi-dimensional scaling, etc.) or hypothesis-testing in nature
(MANOVA). For example, if you retrieved from corpus data ten words and
the frequencies of all content words occurring close to them, you can per-
form a cluster analysis to see which of the words behave more (or less)
similarly to each other, which often is correlated with semantic similarity.
(26) Are data points in your data related such that you can associate
them to each other meaningfully and in a principled way?
This question is concerned with whether you have what are called inde-
pendent or dependent samples (and brings us back to the notion of inde-
pendence discussed in Section 1.3.4.1). For example, your two samples –
e.g., the numbers of mistakes made by ten male and ten female non-native
speakers in a grammar test – are independent of each other if you cannot
connect each male subject’s value to that of one female subject on a mean-
ingful and principled basis. You would not be able to do so if you randomly
sampled ten men and ten women and let them take the same test.
There are two ways in which samples can be dependent. One is if you
test subjects more than once, e.g., before and after a treatment. In that case,
you could meaningfully connect each value in the before-treatment sample
to a value in the after-treatment sample, namely connect each subject’s two
values. The samples are dependent because, for instance, if subject #1 is
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
160 Analytical statistics
very intelligent and good at the language tested, then these characteristics
will make his results better than average in both tests, esp. compared to a
subject who is less intelligent and proficient in the language and who will
perform worse in both tests. Recognizing that the samples are dependent
this way will make the test of before-vs.-after treatments more precise.
The second way in which samples may be dependent can be explained
using the above example of ten men and ten women. If the ten men were
the husbands of the ten women, then one would want to consider the sam-
ples dependent. Why? Because spouses are on average more similar to each
other than randomly chosen people: they often have similar IQs, similar
professions, they spend more time with each other than with randomly-
selected people, etc. Thus, one should associate each husband with his
wife, making this two dependent samples.
Independence of data points is often a very important criterion: many
tests assume that data points are independent, and for many tests you must
choose your test depending on what kind of samples you have.
(27) What is the statistic of the dependent variable in the statistical hy-
potheses?
There are essentially five different answers to this question, which were
already mentioned in Section 1.3.2.3 above, too. Your dependent variable
may involve frequencies/counts, central tendencies, dispersions, correla-
tions, or distributions.
(28) What does the distribution of the data or your test statistic look
like? Normal, some other way that can ultimately be described by a
probability function (or a way that can be transformed to look like
a probability function), or some other way?
(29) How big are the samples you collected? n < 30 or n ≥ 30?
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Analytical statistics 161
cess, which you should have downloaded as part of all the files from the
companion website. Let’s exemplify the use of this graph using the above
example scenario: you hypothesize that the average acceptability judgment
(a mean of an ordinal dependent variable) varies as a function of whether
the subjects providing the ratings are native or non-native speakers (a bina-
ry/categorical independent variable).
You start at the rounded red box with approach in it. Then, the above
scenario is a hypothesis-testing scenario so you go down to statistic. Then,
the above scenario involves averages so you go down to the rounded blue
box with mean in it. Then, the hypothesis involves both a dependent and an
independent variable so you go down to the right, via 1 DV 1 IV to the
transparent box with (tests for) independence/difference in it. You got to
that box via the blue box with mean so you continue to the next blue box
containing information value. Now you make two decisions: first, the de-
pendent variable is ordinal in nature. Second, the samples are independent.
Thus, you take the arrow down to the bottom left, which leads to a blue box
with U-test in it. Thus, the typical test for the above question would be the
U-test (to be discussed below), and the R function for that test is already
provided there, too: wilcox.test.
Now, what does the dashed arrow mean that leads towards that box? It
means that you would also do a U-test if your dependent variable was in-
terval/ratio-scaled but violated other assumptions of the t-test. That is,
dashed arrows provide alternative tests for the first-choice test from which
they originate.
Obviously, this graph is a simplification and does not contain every-
thing one would want to know, but I think it can help beginners to make
first choices for tests so I recommend that, as you continue with the book,
you always determine for each section which test to use and how to identify
this on the basis of the graph.
Before we get started, let me remind you once again that in your own
data your nominal/categorical variables should ideally always be coded
with meaningful character strings so that R recognizes them as factors
when reading in the data from a file. Also, I will assume that you have
downloaded the data files from the companion website.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
162 Analytical statistics
In this section, I will illustrate how to test whether distributions and fre-
quencies from one sample differ significantly from a known distribution
(cf. Section 4.1.1) or from another sample (cf. Section 4.1.2). In both sec-
tions, we begin with variables from the interval/ratio level of measurement
and then proceed to lower levels of measurement.
In this section, I will discuss how you compare whether the distribution of
one dependent interval-/ratio-scaled variable is significantly different from
a known distribution. I will restrict my attention to one of the most frequent
cases, the situation where you test whether a variable is normally distribut-
ed (because as mentioned above in Section 1.3.4, many statistical tech-
niques require a normal distribution so you must some know test like this).
We will deal with an example from the first language acquisition of
tense and aspect in Russian. Simplifying a bit here, one can often observe a
relatively robust correlation between past tense and perfective aspect as
well as non-past tenses and imperfective aspect. Such a correlation can be
quantified with Cramer’s V values (cf. Stoll and Gries, 2009, and Section
4.2.1 below). Let us assume you studied how this association – the
Cramer’s V values – changes for one child over time. Let us further assume
you had 117 recordings for this child, computed a Cramer’s V value for
each one, and now you want to see whether these are normally distributed.
This scenario involves
You can test for normality in several ways. The test we will use is the
Shapiro-Wilk test (remember: check <sflwr_navigator.png> to see how we
get to this test!), which does not really have any assumptions other than
ratio-scaled data and involves the following procedure:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 163
Procedure
− Formulating the hypotheses
− Visualizing the data
− Computing the test statistic W and p
> RussianTensAsp<-read.delim(file.choose())¶
> attach(RussianTensAsp)¶
> hist(TENSE_ASPECT, xlim=c(0, 1), main=””, xlab="Tense-Apect
correlation", ylab="Frequency") # left panel¶
Figure 39. Histogram of the Cramer’s V values reflecting the strengths of the
tense-aspect correlations
At first glance, this looks very much like a normal distribution, but of
course you must do a real test. The Shapiro-Wilk test is rather cumbersome
to compute semi-manually, which is why its manual computation will not
be discussed here (unlike nearly all other monofactorial tests). In R, how-
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
164 Analytical statistics
ever, the computation could not be easier. The relevant function is called
shapiro.test and it only requires one argument, the vector to be tested:
> shapiro.test(TENSE_ASPECT)¶
Shapiro-Wilk normality test
data: TENSE_ASPECT
W = 0.9942, p-value = 0.9132
What does this mean? This simple output teaches an important lesson:
Usually, you want to obtain a significant result, i.e., a p-value that is small-
er than 0.05 because this allows you to accept H1. Here, however, you may
actually welcome an insignificant result because normally-distributed vari-
ables are often easier to handle. The reason for this is again the logic under-
lying the falsification paradigm. When p < 0.05, you reject H0 and accept
H1. But here you ‘want’ H0 to be true because H0 states that the data are
normally distributed. You obtained a p-value of 0.9132, which means you
cannot reject H0 and, thus, consider the data to be normally distributed.
You would therefore summarize this result in the results section of your
paper as follows: “According to a Shapiro-Wilk test, the distribution of this
child’s Cramer’s V values measuring the tense-aspect correlation does not
deviate significantly from normality: W = 0.9942; p = 0.9132.” (In paren-
theses or after a colon you usually mention all statistics that helped you
decide whether or not to accept H1.)
As an alternative to the Shapiro-Wilk test, you can also use a Kolmogo-
rov-Smirnov test for goodness of fit. This test requires the function
ks.test and is more flexible than the Shapiro-Wilk-Test, since it can test
for more than just normality and can also be applied to vectors with more
than 5000 data points. To test the Cramer’s V value for normality, you pro-
vide them as the first argument, then you name the distribution you want to
test against (for normality, "pnorm"), and then, to define the parameters of
the normal distribution, you provide the mean and the standard deviation of
the Cramer’s V values:
The result is the same as above: the data do not differ significantly from
normality. You also get a warning because ks.test assumes that no two
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 165
values in the input are the same, but here some values (e.g., 0.27, 0.41, and
others) are attested more than once; below you will see a quick and dirty
fix for this problem.
In this section, we are going to return to an example from Section 1.3, the
constructional alternation of particle placement in English, which is again
represented in (30).
As you already know, often both constructions are acceptable and native
speakers can often not explain their preference for one of the two. One may
therefore expect that both constructions are equally frequent, and this is
what you are going to test. This scenario involves
Such questions are generally investigated with tests from the family of
chi-squared tests, which is one of the most important and widespread tests.
Since there is no independent variable, you test the degree of fit between
your observed and an expected distribution, which should remind you of
Section 3.1.5.2. This test is referred to as the chi-squared goodness-of-fit
test and involves the following steps:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
166 Analytical statistics
Procedure
− Formulating the hypotheses
− Computing descriptive statistics and visualizing the data
− Computing the frequencies you would expect given H0
− Testing the assumption(s) of the test:
− all observations are independent of each other
− 80% of the expected frequencies are ≥ 517
− all expected frequencies are > 1
− Computing the contributions to chi-squared for all observed frequencies
− Computing the test statistic χ2, df, and p
The first step is very easy here. As you know, H0 typically postulates
that the data are distributed randomly/evenly, and that means that both
constructions occur equally often, i.e., 50% of the time (just as tossing a
fair coin many times will result in a largely equal distribution). Thus:
17. This threshold value of 5 is the one most commonly mentioned. There are a few studies
that show that the chi-squared test is fairly robust even if this assumption is violated –
especially when, as is here the case, H0 postulates that the expected frequencies are
equally high (cf. Zar 1999: 470). However, to keep things simple, I stick to the most
common conservative threshold value of 5 and refer you to the literature quoted in Zar.
If your data violate this assumption, then you must compute a binomial test (if, as here,
you have two groups) or a multinomial test (for three or more groups); cf. the recom-
mendations for further study.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 167
THINK
BREAK
It’s wrong because Table 19 does not show you the raw data – what it
shows you is already a numerical summary. You don’t have interval/ratio
data – you have an interval/ratio summary of categorical data, because the
numbers 247 and 150 summarize the frequencies of the two levels of the
categorical variable CONSTRUCTION (which you probably obtained from
applying table to a vector/factor). One strategy to not mix this up is to
always conceptually envisage what the raw data table would look like in
the case-by-variable format discussed in Section 1.3.3. In this case, it
would look like this:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
168 Analytical statistics
You must now check whether you can actually do a chi-squared test
here, but the observed frequencies are obviously larger than 5 and we as-
sume that Peters’s data points are in fact independent (because we will
assume that each construction has been provided by a different speaker).
We can therefore proceed with the chi-squared test, the computation of
which is fairly straightforward and summarized in (31).
n
(observed − expected ) 2
(31) Pearson chi-squared = χ2 = ∑
i =1 expected
That is to say, for every value of your frequency table you compute a
so-called contribution to chi-squared by (i) computing the difference be-
tween the observed and the expected frequency, (ii) squaring this differ-
ence, and (iii) dividing that by the expected frequency again. The sum of
these contributions to chi-squared is the test statistic chi-squared. Here, it is
approximately 23.7.
> sum(((VPCs-VPCs.exp)^2)/VPCs.exp)¶
[1] 23.70025
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 169
H0: χ2 = 0.
H1: χ2 > 0.
But the chi-squared value alone does not show you whether the differ-
ences are large enough to be statistically significant. So, what do you do
with this value? Before computers became more widespread, a chi-squared
value was used to look up whether the result is significant or not in a chi-
squared table. Such tables typically have the three standard significance
levels in the columns and different numbers of degrees of freedom (df) in
the rows. Df here is the number of categories minus 1, i.e., df = 2-1 = 1,
because when we have two categories, then one category frequency can
vary freely but the other is fixed (so that we can get the observed number of
elements, here 397). Table 21 is one such chi-squared table for the three
significance levels and df = 1 to 3.
Table 21. Critical χ2-values for ptwo-tailed = 0.05, 0.01, and 0.001 for 1 ≤ df ≤ 3
p = 0.05 p = 0.01 p = 0.001
df = 1 3.841 6.635 10.828
df = 2 5.991 9.21 13.816
df = 3 7.815 11.345 16.266
You can actually generate those values yourself with the function
qchisq. That function requires three arguments:
− p: the p-value(s) for which you need the critical chi-squared values (for
some df);
− df: the df-value(s) for the p-value for which you need the critical chi-
squared value;
− lower.tail=FALSE: the argument to instruct R to only use the area
under the chi-squared distribution curve that is to the right of / larger
than the observed chi-squared value.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
170 Analytical statistics
More advanced users find code to generate all of Table 21 in the code
file. Once you have such a table, you can test your observed chi-squared
value for significance by determining whether it is larger than the chi-
squared value(s) tabulated at the observed number of degrees of freedom.
You begin with the smallest tabulated chi-squared value and compare your
observed chi-squared value with it and continue to do so as long as your
observed value is larger than the tabulated ones. Here, you first check
whether the observed chi-squared is significant at the level of 5%, which is
obviously the case: 23.7 > 3.841. Thus, you can check whether it is also
significant at the level of 1%, which again is the case: 23.7 > 6.635. Thus,
you can finally even check if the observed chi-squared value is maybe even
highly significant, and again this is so: 23.7 > 10.827. You can therefore
reject H0 and the usual way this is reported in your results section is this:
“According to a chi-squared goodness-of-fit test, the frequency distribution
of the two verb-particle constructions deviates highly significantly from the
expected one (χ2 = 23.7; df = 1; ptwo-tailed < 0.001): the construction where
the particle follows the verb directly was observed 247 times although it
was only expected 199 times, and the construction where the particle fol-
lows the direct objet was observed only 150 times although it was expected
199 times.”
With larger and more complex amounts of data, this semi-manual way
of computation becomes more cumbersome (and error-prone), which is
why we will simplify all this a bit. First, you can of course compute the p-
value directly from the chi-squared value using the mirror function of
qchisq, viz. pchisq, which requires the above three arguments:
As you can see, the level of significance we obtained from our stepwise
comparison using Table 21 is confirmed: p is indeed much smaller than
0.001, namely 0.00000125825. However, there is another even easier way:
why not just do the whole test with one function? The function is called
chisq.test, and in the present case it requires maximally three arguments:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 171
In this case, this is easy: you already have a vector with the observed
frequencies, the sample size n is much larger than 60, and the expected
probabilities result from H0. Since H0 says the constructions are equally
frequent and since there are just two constructions, the vector of the ex-
pected probabilities contains two times 1/2 = 0.5. Thus:
You get the same result as from the manual computation but this time
you immediately also get a p-value. What you do not also get are the ex-
pected frequencies, but these can be obtained very easily, too. The function
chisq.test computes more than it returns. It returns a data structure (a so-
called list) so you can assign a name to this list and then inspect it for its
contents (output not shown):
Thus, if you require the expected frequencies, you just retrieve them
with a $ and the name of the list component you want, and of course you
get the result you already know.
> test$expected¶
[1] 198.5 198.5
Let me finally mention that the above method computes a p-value for a
two-tailed test. There are many tests in R where you can define whether
you want a one-tailed or a two-tailed test. However, this does not work
with the chi-squared test. If you require the critical chi-squared value for
pone-tailed = 0.05 for df = 1, then you must compute the critical chi-squared
value for ptwo-tailed = 0.1 for df = 1 (with qchisq(0.1, 1, lower.tail=
FALSE)¶), since your prior knowledge is rewarded such that a less extreme
result in the predicted direction will be sufficient (cf. Section 1.3.4). Also,
this means that when you need the pone-tailed-value for a chi-square value,
just take half of the ptwo-tailed-value of the same chi-square value. In this
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
172 Analytical statistics
case, if your H1 had been directional, this would have been your p-value.
But again: this works only with df = 1.
Warning/advice
Above I warned you to never change your hypotheses after you have ob-
tained your results and then sell your study as successful support of the
‘new’ H1. The same logic does not allow you to change your hypothesis
from a two-tailed one to a one-tailed one because your ptwo-tailed = 0.08 (i.e.,
non-significant) so that the corresponding pone-tailed = 0.04 (i.e., significant).
Your choice of a one-tailed hypothesis must be motivated conceptually.
Another hugely important warning: never ever compute a chi-square
test like the above on percentages – always on ‘real’ observed frequencies!
Let us now look at an example in which two independent samples are com-
pared with regard to their overall distributions. You will test whether men
and women differ with regard to the frequencies of hedges they use in dis-
course (i.e., expressions such as kind of or sort of). Again, note that we are
here only concerned with the overall distributions – not just means or just
variances. We could of course do that, too, but it is of course possible that
the means are very similar while the variances are not and a test for differ-
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 173
The question of whether the two sexes differ in terms of the distribu-
tions of hedge frequencies is investigated with the two-sample Kolmogo-
rov-Smirnov test (again, check <sflwr_navigator.png>):
Procedure
− Formulating the hypotheses
− Computing descriptive statistics and visualizing the data
− Testing the assumption(s) of the test: the data are continuous
− Computing the cumulative frequency distributions for both samples, the
maximal absolute difference D of both distributions, and p
First the hypotheses: the text form is straightforward and the statistical
version is based on a test statistic called D to be explained below
H0: The distribution of the dependent variable HEDGES does not differ
depending on the levels of the independent variable SEX; D = 0.
H1: The distribution of the dependent variable HEDGES differs depend-
ing on the levels of the independent variable SEX; D > 0.
Before we do the actual test, let us again inspect the data graphically.
You first load the data from <_inputfiles/04-1-2-1_hedges.csv>, check the
data structure (I will usually not show that output here in the book), and
make the variable names available.
> Hedges<-read.delim(file.choose())¶
> str(Hedges)¶
> attach(Hedges)¶
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
174 Analytical statistics
You are interested in the general distribution, so one plot you can create
is a stripchart. In this kind of plot, the frequencies of hedges are plotted
separately for each sex, but to avoid that identical frequencies are plotted
directly onto each other (and can therefore not be distinguished anymore),
you also use the argument method="jitter" to add a tiny value to each
data point, which decreases the chance of overplotted data points (also try
method="stack"). Then, you include the meaningful point of x = 0 on the
x-axis. Finally, with the function rug you add little bars to the x-axis
(side=1) which also get jittered. The result is shown in Figure 40.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 175
> SEX<-SEX[order(HEDGES)]¶
> HEDGES<-HEDGES[order(HEDGES)]¶
The next step is a little more complex. You must now compute the max-
imum of all differences of the two cumulative distributions of the hedges.
You can do this in three steps: First, you generate a frequency table with
the numbers of hedges in the rows and the sexes in the columns. This table
in turn serves as input to prop.table, which generates a table of column
percentages (hence margin=2; cf. Section 3.2.1, output not shown):
This table shows that, say, 10% of all numbers of hedges of men are 4,
but these are of course not cumulative percentages yet. The second step is
therefore to convert these percentages into cumulative percentages. You
can use cumsum to generate the cumulative percentages for both columns
and can even compute the differences in the same line:
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
176 Analytical statistics
> differences<-cumsum(dists[,1])-cumsum(dists[,2])¶
That is, you subtract from every cumulative percentage of the first col-
umn (the values of the women) the corresponding value of the second col-
umn (the values of the men). The third and final step is then to determine
the maximal absolute difference, which is the test statistic D:
> max(abs(differences))¶
[1] 0.4666667
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 177
TRUE, col="darkgrey")¶
> abline(v=9, lty=2)¶
For example, the fact that the values of the women are higher and more
homogeneous is indicated especially in the left part of the graph where the
low hedge frequencies are located and where the values of the men already
rise but those of the women do not. More than 40% of the values of the
men are located in a range where no hedge frequencies for women were
obtained at all. As a result, the largest difference at position x = 9 arises
where the curve for the men has already risen considerably while the curve
for the women has only just begun to take off. This graph also explains
why H0 postulates D = 0. If the curves are completely identical, there is no
difference between them and D becomes 0.
The above explanation simplified things a bit. First, you do not always
have two-tailed tests and identical sample sizes. Second, identical values –
so-called ties – can complicate the computation of this test (and others).
Fortunately, you do not really have to worry about any of this because the
R function ks.test does everything for you in just one line. You just need
the following arguments:19
19. Unfortunately, the function ks.test does not take a formula as input.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
178 Analytical statistics
When you test a two-tailed H1 as we do here, then the line to enter into
R reduces to the following, and you get the same D-value and the p-value.
(I omitted the warning about ties here but, again, you can use jitter to get
rid of it; cf. the code file.)
In Section 4.1.1.2 above, we discussed how you test whether the distribu-
tion of a dependent nominal/categorical variable is significantly different
from another known distribution. A probably more frequent situation is that
you test whether the distribution of one nominal/categorical variable is
dependent on another nominal/categorical variable.
Above, we looked at the frequencies of the two verb-particle construc-
tions. We found that their distribution was not compatible with H0. Howev-
er, we also saw earlier that there are many variables that are correlated with
the constructional choice. One of these is whether the referent of the direct
object is given information, i.e., known from the previous discourse, or not.
Specifically, previous studies found that objects referring to given referents
prefer the position before the particle whereas objects referring to new ref-
erents prefer the position after the particle. We will look at this hypothesis
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 179
Procedure
− Formulating the hypotheses
− Computing descriptive statistics and visualizing the data
− Computing the frequencies you would expect given H0
− Testing the assumption(s) of the test:
− all observations are independent of each other
− 80% of the expected frequencies are ≥ 5 (cf. n. 17)
− all expected frequencies are > 1
− Computing the contributions to chi-squared for all observed frequencies
− Computing the test statistic χ2, df, and p
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
180 Analytical statistics
First, we explore the data graphically. You load the data from
<_inputfiles/04-1-2-2_vpcs.csv>, create a table of the two factors, and get a
first visual impression of the distribution of the data (cf. Figure 43).
> VPCs<-read.delim(file.choose())¶
> str(VPCs); attach(VPCs)¶
> Peters.2001<-table(CONSTRUCTION, GIVENNESS)¶
> plot(CONSTRUCTION~GIVENNESS)¶
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 181
H0: nV DO Part & Ref DO = given = nV DO Part & Ref DO ≠ given = nV Part DO & Ref DO = given
= nV Part DO & Ref DO ≠ given
H1: as H0, but there is at least one “≠” instead of an “=“.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
182 Analytical statistics
However, life is usually not that simple, for example when (a) as in Pe-
ters (2001) not all subjects answer all questions or (b) naturally-observed
data are counted that are not as nicely balanced. Thus, in Peters’s real data,
it does not make sense to simply assume equal frequencies. Put differently,
H0 cannot look like Table 24 because the row totals of Table 23 show that
the different levels of GIVENNESS are not equally frequent. If GIVENNESS
had no influence on CONSTRUCTION, you would expect that the frequencies
of the two constructions for each level of GIVENNESS would exactly reflect
the frequencies of the two constructions in the whole sample. That means
(i) all marginal totals (row/column totals) must remain constant (as they
reflect the numbers of the investigated elements), and (ii) the proportions of
the marginal totals determine the cell frequencies in each row and column.
From this, a rather complex set of hypotheses follows:
In other words, you cannot simply say, “there are 2·2 = 4 cells and I as-
sume each expected frequency is 397 divided by 4, i.e., approximately
100.” If you did that, the upper row total would amount to nearly 200 – but
that can’t be right since there are only 150 cases of CONSTRUCTION: VERB-
OBJECT-PARTICLE. Thus, you must include this information, that there are
only 150 cases of CONSTRUCTION: VERB-OBJECT-PARTICLE, into the com-
putation of the expected frequencies. The easiest way to do this is using
percentages: there are 150/397 cases of CONSTRUCTION: VERB-OBJECT-
PARTICLE (i.e. 0.3778 = 37.78%). Then, there are 185/397 cases of
GIVENNESS: GIVEN (i.e., 0.466 = 46.6%). If the two variables are independ-
ent of each other, then the probability of their joint occurrence is
0.3778·0.466 = 0.1761. Since there are altogether 397 cases to which this
probability applies, the expected frequency for this combination of variable
levels is 397·0.1761 = 69.91. This logic can be reduced to (33).
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 183
If you apply this logic to every cell, you get Table 26.
You can immediately see that this table corresponds to the above H0: the
ratios of the values in each row and column are exactly those of the row
totals and column totals respectively. For example, the ratio of 69.9 to 80.1
to 150 is the same as that of 115.1 to 131.9 to 247 and as that of 185 to 212
to 397, and the same is true in the other dimension. Thus, H0 is not “all cell
frequencies are identical” – it is “the ratios of the cell frequencies are equal
(to each other and the respective marginal totals).”
This method to compute expected frequencies can be extended to arbi-
trarily complex frequency tables (see Gries 2009b: Section 5.1). But how
do we test whether these deviate strongly enough from the observed fre-
quencies? Thankfully, we do not need such complicated hypotheses but can
use the simpler versions of χ2 = 0 and χ2 > 0 used above, and the chi-
squared test for independence is identical to the chi-squared goodness-of-fit
test you already know: for each cell, you compute a contribution to chi-
squared and sum those up to get the chi-squared test statistic.
As before, the chi-squared test can only be used when its assumptions
are met. The expected frequencies are large enough and for simplicity’s
sake we assume here that every subject only gave just one sentence so that
the observations are independent of each other: for example, the fact that
some subject produced a particular sentence on one occasion does then not
affect any other subject’s formulation. We can therefore proceed as above
and compute (the sum of) the contributions to chi-squared on the basis of
the same formula, here repeated as (34):
n
(observed − expected ) 2
(34) Pearson χ2 = ∑i =1 expected
The results are shown in Table 27 and the sum of all contributions to
chi-squared, chi-squared itself, is 9.82. However, we again need the num-
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
184 Analytical statistics
ber of degrees of freedom. For two-dimensional tables and when the ex-
pected frequencies are computed on the basis of the observed frequencies
as here, the number of degrees of freedom is computed as shown in (35).20
With both the chi-squared and the df-value, you can look up the result in
a chi-squared table (e.g., Table 28 below, which is the same as Table 21).
As above, if the observed chi-squared value is larger than the one tabulated
for p = 0.05 at the required df-value, then you can reject H0. Here, chi-
squared is not only larger than the critical value for p = 0.05 and df = 1, but
also larger than the critical value for p = 0.01 and df = 1. But, since the chi-
squared value is not also larger than 10.827, the actual p-value is some-
where between 0.01 and 0.001: the result is very, but not highly significant.
Table 28. Critical χ2-values for ptwo-tailed = 0.05, 0.01, and 0.001 for 1 ≤ df ≤ 3
p = 0.05 p = 0.01 p = 0.001
df = 1 3.841 6.635 10.828
df = 2 5.991 9.21 13.816
df = 3 7.815 11.345 16.266
Fortunately, all this is much easier when you use R’s built-in function.
Either you compute just the p-value as before,
20. In our example, the expected frequencies were computed from the observed frequencies
in the marginal totals. If you compute the expected frequencies not from your observed
data but from some other distribution, the computation of df changes to: df = (number of
rows ⋅ number of columns)-1.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 185
or you use the function chisq.test and do everything in a single step. The
most important arguments for our purposes are:
> test.Peters$expected¶
GIVENNESS
CONSTRUCTION given new
V_DO_Part 69.89924 80.10076
V_Part_DO 115.10076 131.89924
> test.Peters$statistic¶
X-squared
9.819132
For effect sizes, this is of course a disadvantage since just because the
sample size is larger, this does not mean that the relation of the values to
each other has changed, too. You can easily verify this by noticing that the
ratios of percentages, for example, have stayed the same. For that reason,
the effect size is often quantified with a coefficient of correlation (called φ
in the case of k×2/m×2 tables or Cramer’s V for k×m tables with k or m >
21. For further options, cf. again ?chisq.test¶. Note also what happens when you enter
summary(Peters.2001)¶.
EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use