0% found this document useful (0 votes)
16 views100 pages

Gries Stefan Thomas (2013) - Statistics For Linguistics With R - 2

The document discusses creating and manipulating data frames in R, including using the expand.grid function to generate combinations of data. It explains how to load and save data frames from files, detailing the use of read.table and write.table functions with various arguments. Additionally, it covers accessing and editing data frame elements, including subsetting and using the attach function for easier column access.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views100 pages

Gries Stefan Thomas (2013) - Statistics For Linguistics With R - 2

The document discusses creating and manipulating data frames in R, including using the expand.grid function to generate combinations of data. It explains how to load and save data frames from files, detailing the use of read.table and write.table functions with various arguments. Additionally, it covers accessing and editing data frame elements, including subsetting and using the attach function for easier column access.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

86 Fundamentals of R

$ TYPEFREQ : num 271 103 735 18 37


$ CLASS : Factor w/ 2 levels "closed","open": 2 2 2 1 1

As you can see, there are now only three variables left because POS now
functions as row names. Note that this is only possible when the column
with the row names contains no element twice.
A second way of creating data frames that is much less flexible, but ex-
tremely important for Chapter 5 involves the function expand.grid. In its
simplest use, the function takes several vectors or factors as arguments and
returns a data frame the rows of which contain all possible combinations of
vector elements and factor levels. Sounds complicated but is very easy to
understand from this example and we will use this many times:
All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

> expand.grid(COLUMN1=c("a", "b"), COLUMN2=1:3)¶


COLUMN1 COLUMN2
1 a 1
2 b 1
3 a 2
4 b 2
5 a 3
6 b 3

5.2. Loading and saving data frames

While you can generate data frames as shown above, this is certainly not
the usual way in which data frames are entered into R. Typically, you will
read in files that were created with a spreadsheet software. If you create a
table in, say LibreOffice Calc and want to work on it within R, then you
should first save it as a comma-separated text file. There are two ways to
do this. Either you copy the whole file into the clipboard, paste it into a text
editor (e.g., geany or Notepad++), and then save it as a tab-delimited text
file, or you save it directly out of the spreadsheet software as a CSV file (as
mentioned above with File: Save As … and Save as type: Text CSV (.csv);
then you choose tabs as field delimiter and no text delimiter, and don’t
forget to provide the file extension. To load this file into R, you use the
function read.table and some of its arguments:

− file="…": the path to the text file with the table (on Windows PCs you
can use choose.files() here, too; if the file is still in the clipboard,
you can also write file="clipboard";
Copyright 2013. De Gruyter Mouton.

− header=TRUE: an indicator of whether the first row of the file contains

EBSCO Publishing : eBook Academic Collection (EBSCOhost) - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA
AN: 604318 ; Gries, Stefan Thomas.; Statistics for Linguistics with R : A Practical Introduction
Account: s8387532.main.ehost
Data frames 87

column headers (which it should always have) or header=FALSE (the


default);
− sep="": between the double quotes you put the single character that
delimits columns; the default sep="" means space or tab, but usually
you should set sep="\t" so that you can use spaces in cells of the table;
− dec="." or dec=",": the decimal separator;
− row.names=…, where … is the number of the column containing the row
names;
− quote=…: the default is that quotes are marked with single or double
quotes, but you should nearly always set quote="";
− comment.char=…: the default is that comments are separated by “#”, but
we will always set comment.char="".

Thus, if you want to read in the above table from the file
<_inputfiles/02-5-2_dataframe1.csv> – once without row names and once
with row names – then this is what you could type:

> a1<-read.table(file.choose(), header=TRUE, sep="\t",


quote="", comment.char="") # R numbers rows¶

or

> a2<-read.table(file.choose(), header=TRUE, sep="\t",


quote="", comment.char="", row.names=1) # row names¶

By entering a1¶ or str(a1)¶ (same with a2), you can check whether
the data frames have been loaded correctly.
While the above is the most explicit and most general way to load all
sorts of different data frames, when you have set up your data as recom-
mended above, you can often use a shorter version with read.delim:,
which has header=TRUE and sep="\t" as defaults and should, therefore,
work most of the time:

> a3<-read.delim(file.choose())¶

If you want to save a data frame from R, then you can use
write.table. Its most important arguments are:

− x: the data frame you want to save;


− file: the path to the file into which you wish to save the data frame;

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
88 Fundamentals of R

typically, using file.choose() is easiest;


− append=FALSE (the default) or append=TRUE: the former generates or
overwrites the defined file, the latter appends the data frame to that file;
− quote=TRUE (the default) or quote=FALSE: the former prints factor lev-
els with double quotes; the latter prints them without quotes;
− sep="": between the double quotes you put the single character that
delimits columns; the default " " means a space, what you should use is
"\t", i.e. tabs;
− eol="\n": between the double quotes you put the single character that
separates lines from each other (eol for end of line); the default "\n"
means newline;
− dec="." (the default): the decimal separator;
− row.names=TRUE (the default) or row.names=FALSE: whether you want
row names or not;
− col.names=TRUE (the default) or col.names=FALSE: whether you want
column names or not.

Given these default settings and under the assumption that your operat-
ing system uses an English locale, you would save data frames as follows:

> write.table(a1, file.choose(), quote=FALSE, sep="\t",


col.names=NA)¶

5.3. Editing data frames

In this section, we will discuss how you can access parts of data frames and
then how you can edit and change data frames.
Further below, we will discuss many examples in which you have to ac-
cess individual columns or variables of data frames. You can do this in
several ways. The first of these you may have already guessed from look-
ing at how a data frame is shown in R. If you load a data frame with col-
umn names and use str to look at the structure of the data frame, then you
see that the column names are preceded by a “$”. You can use this syntax
to access columns of data frames, as in this example using the file
<_inputfiles/02-5-3_dataframe.csv>.

> rm(list=ls(all=TRUE))¶
> a<-read.delim(file.choose())¶
> a¶
POS TOKENFREQ TYPEFREQ CLASS

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Data frames 89

1 adj 421 271 open


2 adv 337 103 open
3 n 1411 735 open
4 conj 458 18 closed
5 prep 455 37 closed
> a$TOKENFREQ¶
[1] 421 337 1411 458 455
> a$CLASS¶
[1] open open open closed closed
Levels: closed open

You can now use these just like any other vector or factor. For example,
the following line computes token/type ratios of the parts of speech:

> ratio<-a$TOKENFREQ/a$TYPEFREQ; ratio¶


[1] 1.553506 3.271845 1.919728 25.444444 12.297297

You can also use indices in square brackets for subsetting. Vectors and
factors as discussed above are one-dimensional structures, but R allows you
to specify arbitrarily complex data structures. With two-dimensional data
structures, you can also use square brackets, but now you must of course
provide values for both dimensions to identify one or several data points –
just like in a two-dimensional coordinate system. This is very simple and
the only thing you need to memorize is the order of the values – rows, then
columns – and that the two values are separated by a comma. Here are
some examples:

> a[2,3]¶
[1] 103
> a[2,]¶
POS TOKENFREQ TYPEFREQ CLASS
2 adv 337 103 open
> a[,3]¶
[1] 271 103 735 18 37
> a[2:3,4]¶
[1] open open
Levels: closed open
> a[2:3,3:4]¶
TYPEFREQ CLASS
2 103 open
3 735 open

Note that row and columns names are not counted. Also note that all
functions applied to vectors above can be used with what you extract out of
a column of a data frame:

> which(a[,2]>450)¶

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
90 Fundamentals of R

[1] 3 4 5
> a[,3][which(a[,3]>100)]¶
[1] 271 103 735
> a[,3][ a[,3]>100]¶
[1] 271 103 735

The most practical way to access individual columns, however, involves


the function attach (and gets undone with detach). I will not get into the
ideological debate about whether one should use attach or rather with,
etc. – if you are interested in that, go to the R-Help list or read ?with…
You get no output, but you can now access any column with its name:

> attach(a)¶
> Class¶
[1] open open open closed closed
Levels: closed open

Note two things. First, if you attach a data frame that has one or more
names that have already been defined as data structures or as columns of
previously attached data frames, you will receive a warning; in such cases,
make sure you are really dealing with the data structures or columns you
want and consider using detach to un-attach the earlier data frame. Second,
when you use attach you are strictly speaking using ‘copies’ of these vari-
ables. You can change those, but these changes do not affect the data frame
they come from.

> CLASS[4]<-NA; CLASS¶


[1] open open open <NA> closed
Levels: closed open
> a¶
POS TOKENFREQ TYPEFREQ CLASS
1 adj 421 271 open
2 adv 337 103 open
3 n 1411 735 open
4 conj 458 18 closed
5 prep 455 37 closed

Let’s change CLASS back to its original state:

> CLASS[4]<-"closed"¶

If you want to change the data frame a, then you must make your
changes in a directly, e.g. with a$CLASS[4]<-NA¶ or a$TOKENFREQ[2]<-
338¶. Given what you have seen in Section 2.4.3, however, this is only
easy with vector or with factors where you do not add a new level – if you

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Data frames 91

want to add a new factor level, you must define that level first.
Sometimes you will need to investigate only a part of a data frame –
maybe a set of rows, or a set of columns, or a matrix within a data frame.
Also, a data frame may be so huge that you only want to keep one part of it
in memory. As usual, there are several ways to achieve that. One uses indi-
ces in square brackets with logical conditions or which. Either you have
already used attach and can use the column names directly or not:

> b<-a[CLASS=="open",]; b¶
POS TOKENFREQ TYPEFREQ CLASS
1 adj 421 271 open
2 adv 337 103 open
3 n 1411 735 open

> b<-a[a[,4]=="open",]; b¶
POS TOKENFREQ TYPEFREQ CLASS
1 adj 421 271 open
2 adv 337 103 open
3 n 1411 735 open

(Of course you can also write b<-a[a$Class=="open",]¶.) That is, you
determine all elements of the column called CLASS / the fourth column that
are open, and then you use that information to access the desired rows and
all columns (hence the comma before the closing square bracket). There is
a more elegant way to do this, though, the function subset. This function
takes two arguments: the data structure of which you want a subset and the
logical condition(s) describing which subset you want. Thus, the following
line creates the same structure b as above:

> b<-subset(a, CLASS=="open")¶

The formulation “condition(s)” already indicates that you can of course


use several conditions at the same time.

> b<-subset(a, CLASS=="open" & TOKENFREQ<1000); b¶


POS TOKENFREQ TYPEFREQ CLASS
1 adj 421 271 open
2 adv 337 103 open
> b<-subset(a, POS %in% c("adj", "adv")); b¶
POS TOKENFREQ TYPEFREQ CLASS
1 adj 421 271 open
2 adv 337 103 open

As I mentioned above, you will usually edit data frames in a spreadsheet


software or, because the spreadsheet software does not allow for as many

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
92 Fundamentals of R

rows as you need, in a text editor. For the sake of completeness, let me
mention that R of course also allows you to edit data frames in a spread-
sheet-like format. The function fix takes as argument a data frame and
opens a spreadsheet editor in which you can edit the data frame; you can
even introduce new factor levels without having to define them first. When
you close the editor, R will do that for you.
Finally, let us look at ways in which you can sort data frames. Recall
that the function order creates a vector of positions and that vectors can be
used for sorting. Imagine you wanted to search the data frame a according
to the column CLASS (in alphabetically ascending order), and within Class
according to TOKENFREQ (in descending order). How can you do that?

THINK
BREAK

The problem is both sorting styles are different: one is decreasing=


FALSE, the other is decreasing=TRUE. What you can do is apply order not
to TOKENFREQ, but to the negative values of TOKENFREQ.

> order.index<-order(CLASS, -TOKENFREQ); order.index¶


[1] 4 5 3 1 2

After that, you can use the vector order.index to sort the data frame:

> a[order.index,]¶
POS TOKENFREQ TYPEFREQ CLASS
4 conj 458 18 closed
5 prep 455 37 closed
3 n 1411 735 open
1 adj 421 271 open
2 adv 337 103 open

Of course you can do that in just one line:12

> a[order(CLASS, -TOKENFREQ),]¶

You can now also use the function sample to sort the rows of a data
frame randomly (for example, to randomize tables with experimental items;

12. Note that R is superior to many other programs here because the number of sorting
parameters is in principle unlimited.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Data frames 93

cf. above). You first determine the number of rows to be randomized (e.g.,
with nrow or dim) and then combine sample with order. Your data frame
will probably be different because we used a random sampling.

> no.rows<-nrow(a)¶
> order.index<-sample(no.rows); order.index¶
[1] 3 4 1 2 5
> a[order.index,]¶
POS TOKENFREQ TYPEFREQ CLASS
3 n 1411 735 open
4 conj 458 18 closed
1 adj 421 271 open
2 adv 337 103 open
5 prep 455 37 closed
> a[sample(nrow(a)),] # in just one line¶

But what do you do when you need to sort a data frame according to
several factors – some in ascending and some in descending order? You
can of course not use negative values of factor levels – what would -open
be? Thus, you first use the function rank, which rank-orders factor levels,
and then you can use negative values of these ranks:

> order.index<-order(-rank(CLASS), -rank(POS))¶


> a[order.index,]¶
POS TOKENFREQ TYPEFREQ CLASS
3 n 1411 735 open
2 adv 337 103 open
1 adj 421 271 open
5 prep 455 37 closed
4 conj 458 18 closed

Recommendation(s) for further study


− the function is.data.frame to test if a data structure is a data frame
− the function dim for the number of rows and columns of a data frame
− the functions read.csv and read.csv2 to read in tab-delimited files
− the function save to save data structures in a compressed binary format
− the function with to access columns of a data frame without attach
− the functions cbind and rbind to combine vectors and factors in a
columnwise or rowwise way
− the function merge to combine different data frames
− the function complete.cases to test which rows of a data frame contain
missing data / NA

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
94 Fundamentals of R

6. Some programming: conditionals and loops

So far, we have focused on simple and existing functions but we have done
little to explore the programming-language character of R. This section will
introduce a few very powerful notions that allow you to make R decide
which of two or more user-specified things to do and/or do something over
and over again. In Section 2.6.1, we will explore the former, Section 2.6.2
then discusses the latter, but the treatment here can only be very brief and I
advise you to explore some of the reading suggestions for more details.

6.1. Conditional expressions

Later, you will often face situations where you want to pursue one of sev-
eral possible options in a statistical analysis. In a plot, for example, the data
points for male subjects should be plotted in blue and the data points for
female subjects should be plotted in pink. Or, you actually only want R to
generate a plot when the result is significant but not, when it is not. In gen-
eral, you can of course always do these things stepwise yourself: you could
decide for each analysis yourself whether it is significant and then generate
a plot when it is. However, a more elegant way is to write R code that
makes decisions for you, that you can apply to any data set, and that, there-
fore, allows you to recycle code from one analysis to the next. Conditional
expressions are one way – others are available and sometimes more elegant
– to make R decide things. This is what the syntax can look like in a nota-
tion often referred to as pseudo code (so, no need to enter this into R!):

if (some logical expression testing a condition) {


what to do if this logical expression evaluates to TRUE
(this can be more than one line)
} else if (some other logical expression) {
what to do if this logical expression evaluates to FALSE
(this can be more than one line)
} else {
what to do if all logical expressions above evaluate to
FALSE
}

That’s it, and the part after the first } is even optional. Here’s an exam-
ple with real code (recall, "\n" means ‘a new line’):

> pvalue<-0.06¶
> if (pvalue>=0.05) {¶

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Conditional expressions and loops 95

+ cat("Not significant, p =", pvalue, "\n")¶


+ } else {¶
+ cat("Significant, p =", pvalue, "\n")¶
+ }¶
Not significant, p = 0.06

The first line defines a p-value, which you will later get from a statisti-
cal test. The next line tests whether that p-value is greater than or equal to
0.05. It is, which is why the code after the first opening { is executed and
why R then never gets to see the part after else.
If you now set pvalue to 0.04 and run the if expression again, then this
happens: Line 2 from above tests whether 0.04 is greater than or equal to
0.05. It is not, which is why the block of code between { and } before else
is skipped and why the second block of code is executed. Try it.
A short version of this can be extremely useful when you have many
tests to make but only one instruction for both when a test returns TRUE or
FALSE. It uses the function ifelse, here represented schematically again:

ifelse(logical expression, what when TRUE, what when FALSE)

And here’s an application:

> pvalues<-c(0.02, 0.00096, 0.092, 0.4)¶


> decisions<-ifelse (pvalues<0.05, "*", "ns")¶
> decisions¶
[1] "*" "*" "ns" "ns"

As you can see, ifelse tested all four values of pvalues against the
threshold value of 0.05, and put the correspondingly required values into
the new vector decisions. We will use this a lot to customize graphs.

6.2. Loops

Loops are useful to have R execute one or (many) more functions multiple
times. Like many other programming languages, R has different types of
loops, but I will only discuss for-loops here. This is the general syntax in
pseudo code:

for (some.name in a.sequence) {


what to do as often often as a.sequence has elements
(this can be more than one line)
}

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
96 Fundamentals of R

Let’s go over this step by step. The data structure some.name stands for
any name you might wish to assign to a data structure that is processed in
the loop, and a.sequence stands for anything that can be interpreted as a
sequence of values, most typically a vector of length 1 or more. This
sounds more cryptic than it actually is, here’s a very easy example:

> for (counter in 1:3) {¶


+ cat("This is iteration number", counter, "\n")¶
+ }¶
This is iteration number 1
This is iteration number 2
This is iteration number 3

When R enters the for-loop, it assigns to counter the first value of the
sequence 1:3, i.e. 1. Then, in the only line in the loop, R prints some sen-
tence and ends it with the current value of counter, 1, and a line break.
Then R reaches the } and, because counter has not yet iterated over all
values of a.sequence, re-iterates, which means it goes back to the begin-
ning of the loop, this time assigning to counter the next value of
a.sequence, i.e., 2, and so on. Once R has printed the third line, it exits the
loop because counter has now iterated over all elements of a.sequence.
Here is a more advanced example, but one that is typical of what we’re
going to use loops for later. Can you see what it does just from the code?

> some.numbers<-1:100¶
> collector<-vector(length=10)¶
> for (i in 1:10) {¶
+ collector[i]<-mean(sample(some.numbers, 50))¶
+ }¶
> collector¶
[1] 50.78 51.14 45.04 48.04 55.30 45.90 53.02 48.40 50.38
49.88

THINK
BREAK

The first line generates a vector some.numbers with the values from 1
to 100. The second line generates a vector called collector which has 10
elements and which will be used to collect results from the looping. Line 3
begins a loop of 10 iterations, using a vector called i as the counter. Line 4
is the crucial one now: In it, R samples 50 numbers randomly without re-
placement from the vector some.numbers, computes the mean of these 50

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Conditional expressions and loops 97

numbers, and then stores that mean in the i-th slot of collector. On the first
iteration, i is of course 1 so the first mean is stored in the first slot of col-
lector. Then R iterates, i becomes 2, R generates a second random sam-
ple, computes its mean, and stores it in the – now – 2nd slot of collector,
and so on, until R has done the sampling, averaging, and storing process 10
times and exits the loop. Then, the vector collector is printed on the
screen.
In Chapter 4, we will use an approach like this to help us explore data
that violate some of the assumptions of common statistical tests. However,
it is already worth mentioning that loops are often not the best way to do
things like the above in R: in contrast to some other programming lan-
guages, R is designed such that it is often much faster and more memory-
efficient to do things not with loops but with members of the apply family
of functions, which you will get to know a bit later. Still, being able to
quickly write a loop and test something is often a very useful skill.

Recommendation(s) for further study


− the functions next and break to control behavior of/in loops

7. Writing your own little functions

The fact that R is not just a statistics software but a full-fledged program-
ming language is something that can hardly be overstated enough. It means
that nearly anything is possible: the limit of what you can do with R is not
defined by what the designers of some other software thought you may
want to do – the limit is set pretty much only by your skills and maybe your
RAM/processor (which is one reason why I recommend using R for cor-
pus-linguistic analyses, see Gries 2009a). One aspect making this particu-
larly obvious is how you can very easily write your own functions to facili-
tate and/or automate tedious and/or frequent tasks. In this section, I will
give a few very small examples of the logic of how to write your own func-
tions, mainly because we haven’t dealt with any statistical functions yet.
Don’t despair if you don’t understand these programming issues immedi-
ately – for most of this book, you will not need them, but these capabilities
can come in very handy when you begin to tackle more complex data. Al-
so, in Chapter 3 and 4 I will return to this topic so that you get more prac-
tice in this and end up with a list of useful functions for your own work.
The first example I want to use involves looking at a part of a data
structure. For example, let’s assume you loaded a really long vector (let’s

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
98 Fundamentals of R

say, 10,000 elements long) and want to check whether you imported it into
R properly. Just printing that onto the screen is somewhat tedious since you
can’t possibly read all 10,000 items (let alone at the speed with which they
are displayed), nor do you usually need all 10,000 items – the first n are
usually enough to see whether your data import was successful. The same
holds for long data frames: you don’t need to see all 1600 rows to check
whether loading it was successful, maybe the first 5 or 6 are sufficient.
Let’s write a function peek that by default shows you the first 6 elements of
each of the data structures you know about: one-dimensional vectors or
factors and two-dimensional data frames.
One good way to approach the writing of functions is to first consider
how you would solve that problem just for a particular data structure, i.e.
outside of the function-writing context, and then make whatever code you
wrote general enough to cover not just the one data structure you just ad-
dressed, but many more. To that end, let’s first load a data frame for this
little example (from <_inputfiles/02-7_dataframe1.csv>):

> into.causatives<-read.delim(file.choose())¶
> str(into.causatives)¶
'data.frame': 1600 obs. of 5 variables:
$ BNC : Factor w/ 929 levels "A06","A08","A0C",..:
1 2 3 4 ...
$ TAG_ING : Factor w/ 10 levels "AJ0-NN1","AJ0-VVG",..:
10 7 10 ...
$ ING : Factor w/ 422 levels "abandon-
ing","abdicating",..: 354 49 382 ...
$ VERB_LEMMA: Factor w/ 208 levels "activate","aggravate",..:
76 126 186 ...
$ ING_LEMMA : Factor w/ 417 levels "abandon","abdicate",..:
349 41 377 ...

Now, you want to work with one-dimensional and two-dimensional vec-


tors, factors, and data frames. How would you get the first six elements of
each of these? That you already know. For vectors or factors you’d write:

vector.or.factor[1:6]

and for data frames you’d write:

data.frame[1:6,]

So, essentially you need to decide what the data structure is of which R
is supposed to display the first n elements (by default 6) and then you sub-
set with either [1:6] or [1:6,]. Since, ultimately, the idea is to have R –

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Your own functions 99

not you – decide on the right way of subsetting (depending on the data
structure), you use a conditional expression:

> if (is.data.frame(into.causatives)) {¶
> into.causatives[1:6,]¶
> } else {¶
> into.causatives[1:6]¶
> }¶
BNC TAG_ING ING VERB_LEMMA ING_LEMMA
1 A06 VVG speaking force speak
2 A08 VBG being nudge be
3 A0C VVG taking talk tak
4 A0F VVG taking bully take
5 A0H VVG trying influence try
6 A0H VVG thinking delude think

To turn this into a function, you wrap a function definition (naming the
function peek) around this piece of code. However, if you use the above
code as is, then this function will use the name into.causatives in the
function definition, which is not exactly very general. As you have seen,
many R functions use x for the main obligatory variable. Following this
tradition, you could write this:

> peek<-function (x) {¶


> if (is.data.frame(x)) {¶
> x[1:6,]¶
> } else {¶
> x[1:6]¶
> }¶
> }¶
> peek(into.causatives)¶

This means, R defines a function called peek that requires an argument,


and that argument is function-internally called x. When you call peek with
some argument – e.g., into.causatives – then R will take the content of
that data structure and, for the duration of the function execution, assign it
to x. Then, within the function R will carry out all of peek with x and re-
turn/output the result, which is the first 6 rows of into.causatives.
It seems like we’re done. However, some things are missing. When you
write a function, it is crucial you make sure it covers all sorts of possibili-
ties or data you may throw at it. After all, you’re writing a function to make
your life easier, to allow you not to have to worry about stuff anymore after
you have thought about it once, namely when you wrote the function.
There are three ways in which the above code should be improved:
− what if the data structure you use peek with is not a vector or a factor or

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
100 Fundamentals of R

a data frame?
− what if you want to be able to see not 6 but n elements?
− what if the data structure you use peek with has fewer than n elements
or rows?

To address the first possibility, we just add another conditional expres-


sion. So far we only test whether whatever we use peek with is a data
frame – now we also need to check whether, if it is not a data frame,
whether it then is a vector or a factor, and ideally we return some warning
if the data structure is none of the three.
To address the second possibility, we need to be able to tell the function
flexibly how many parts of x we want to see, and the way we tell this to a
function is of course by its arguments. Thus, we add an argument, let’s call
it n, that says how much we want to see of x, but we make 6 the default.
To address the final possibility, we have to make sure that R realizes
how many elements x has: if it has more than n, R should show n, but if it
has fewer than n, R should show as many as it can, i.e., all of them.
This version of peek addresses all of these issues:

> peek< function (x, n=6) {¶


> if (is.data.frame(x)) {¶
> return(x[1:min(nrow(x), n),])¶
> } else if (is.vector(x) | is.factor(x)) {¶
> return(x[1:min(length(x), n)])¶
> } else {¶
> cat("Not defined for other data structures ...\n")¶
> }¶
> }¶

Issue number one is addressed by adding a second conditional with the


else if test – recall the use of | to mean ‘or’ – and outputting a message if
x is neither a vector, factor, or a data frame.
Issue number two is addressed by adding the argument n to the function
definition and using n in the body of the function. The argument n is set to
6 by default, so if the user does not specify n, 6 is used, but the user can
also override this with another number.
The final issue is addressed by tweaking the subsetting: instead of using
just n, we use 1: the minimum of n or the number of elements x has. Thus,
if x has more than n elements, then n will be the minimum and we get to
see n elements, and if x has less than n elements, then that number of ele-
ments will be the minimum and we get to see them all.
Finally, also note that I am now using the function return to specify

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Your own functions 101

exactly what peek should return and output to the user when it’s done. Try
the following lines (output not shown here and see the comments in the
code file) to see that it works:

> peek(into.causatives)¶
> peek(into.causatives, 3)¶
> peek(into.causatives, 9)¶
> peek(21:50, 10)¶
> peek(into.causatives$BNC, 12)¶
> peek(as.matrix(into.causatives))¶

While all this may not seem easy and worth the effort, we will later see
that being able to write your own functions will facilitate quite a few statis-
tical analyses below. Let me also note that this was a tongue-in-cheek ex-
ample: there is actually already a function in R that does what peek does
(and more, because it can handle more data structures) – look up head and
also tail ;-).
Now you should do the exercise(s) for Chapter 2 …

Recommendation(s) for further study


− the functions NA, is.na, NaN, is.nan, na.action, na.omit, and
na.fail on how to handle missing data
− Ligges (2005), Crawley (2007), Braun and Murdoch (2008), Spector
(2008), Gentleman (2009), and Gries (2009a) for more information on
R: Ligges (2005), Braun and Murdoch (2008), and Gentleman (2009) on
R as a (statistical) programming language, Crawley as a very compre-
hensive overview, Spector (2008) on data manipulation in R, and Gries
(2009a) on corpus-linguistic methods with R

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Chapter 3
Descriptive statistics

Any 21st century linguist will be required to read about and understand
mathematical models as well as understand statistical methods of analysis.
Whether you are interested in Shakespearean meter, the sociolinguistic
perception of identity, Hindi verb agreement violations, or the perception
of vowel duration, the use of math as a tool of analysis is already here and
its prevalence will only grow over the next few decades. If you're not pre-
pared to read articles involving the term Bayesian, or (p<.01), k-means
clustering, confidence interval, latent semantic analysis, bimodal and uni-
modal distributions, N-grams, etc, then you will be
but a shy guest at the feast of linguistics.
(<http://thelousylinguist.blogspot.com/2010/01/
why-linguists-should-study-math.html>)

In this chapter, I will explain how you obtain descriptive results. In section
3.1, I will discuss univariate statistics, i.e. statistics that summarize the
distribution of one variable, of one vector, of one factor. Section 3.2 then is
concerned with bivariate statistics, statistics that characterize the relation of
two variables, two vectors, two factors to each other. Both sections also
introduce ways of representing the data graphically; many additional
graphs will be illustrated in Chapters 4 and 5.

1. Univariate statistics

1.1. Frequency data

The probably simplest way to describe the distribution of data points are
frequency tables, i.e. lists that state how often each individual outcome was
observed. In R, generating a frequency table is extremely easy. Let us look
at a psycholinguistic example. Imagine you extracted all occurrences of the
disfluencies uh, uhm, and ‘silence’ and noted for each disfluency whether it
was produced by a male or a female speaker, whether it was produced in a
monolog or in a dialog, and how long in milliseconds the disfluency lasted.
First, we load these data from the file <_inputfiles/03-1_uh(m).csv>.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 103

> UHM<-read.delim(file.choose())¶
> str(UHM)¶
'data.frame': 1000 obs. of 5 variables:
$ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
$ SEX : Factor w/ 2 levels "female","male": 2 1 1 1 2 ...
$ FILLER: Factor w/ 3 levels "silence","uh",..: 3 1 1 3 ...
$ GENRE : Factor w/ 2 levels "dialog","monolog": 2 2 1 1 ...
$ LENGTH: int 1014 1188 889 265 465 1278 671 1079 643 ...
> attach(UHM)¶

To see which disfluency or filler occurs how often, you use the function
table, which creates a frequency list of the elements of a vector or factor:

> table(FILLER)¶
FILLER
silence uh uhm
332 394 274

If you also want to know the percentages of each disfluency, then you
can either do this rather manually or you use the function prop.table,
whose argument is a table generated with table and which returns the per-
centages of the frequencies in that table (cf. also below).

> table(FILLER)/length(FILLER)¶
FILLER
silence uh uhm
0.332 0.394 0.274
> prop.table(table(FILLER))¶
FILLER
silence uh uhm
0.332 0.394 0.274

Often, it is also useful to generate a cumulative frequency table of the


observed values or of the percentages. R has a function cumsum, which
successively adds the values of a vector and returns all sums, which is ex-
emplified in the following two lines:

> 1:5¶
[1] 1 2 3 4 5
> cumsum(1:5)¶
[1] 1 3 6 10 15

And of course you can apply cumsum to our tables:

> cumsum(table(FILLER))¶
silence uh uhm
332 726 1000

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
104 Descriptive statistics

> cumsum(prop.table(table(FILLER)))¶
silence uh uhm
0.332 0.726 1.000

Usually, it is instructive to represent the observed distribution graphical-


ly and the sections below introduce a few graphical formats. For reasons of
space, I only discuss some ways to tweak graphs, but you can turn to the
help pages of these functions (using ?…) and Murrell (2011) for more info.

1.1.1. Scatterplots and line plots

Before we begin to summarize vectors and factors graphically in groups of


elements, we discuss how the data points of a vector are plotted individual-
ly. The simplest approach just requires the function plot. This is a very
versatile function, which, depending on the arguments you use with it, cre-
ates many different graphs. (This may be a little confusing at first, but al-
lows for an economical style of working, as you will see later.) If you pro-
vide just one numerical vector as an argument, then R plots a scatterplot,
i.e., a two-dimensional coordinate system in which the values of the vector
are interpreted as coordinates of the y-axis, and the order in which they
appear in the vector are the coordinates of the x-axis. Here’s an example:

> a<-c(1, 3, 5, 2, 4); b<-1:5¶


> plot(a) # left panel of Figure 15¶

Figure 15. Simple scatterplots

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 105

But if you give two vectors as arguments, then the values of the first and
the second are interpreted as coordinates on the x-axis and the y-axis re-
spectively (and the names of the vectors will be used as axis labels):

> plot(a, b) # right panel of Figure 15¶

With the argument type=…, you can specify the kind of graph you want.
The default, which was used because you did not specify anything else, is
type="p" (for points). If you use type="b" (for both), you get points and
lines connecting the points; if you use type="l" (for lines), you get a line
plot; cf. Figure 16. (With type="n", nothing gets plotted into the main
plotting area, but the coordinate system is set up.)

> plot(b, a, type="b") # left panel of Figure 16¶


> plot(b, a, type="l") # right panel of Figure 16¶

Figure 16. Simple line plots

Other simple but useful ways to tweak graphs involve defining labels
for the axes (xlab="…" and ylab="…"), a bold heading for the whole graph
(main="…"), the ranges of values of the axes (xlim=… and ylim=…), and the
addition of a grid (grid()¶). With col="…", you can also set the color of
the plotted element, as you will see more often below.

> plot(b, a, xlab="A vector b", ylab="A vector a", xlim=c(0,


8), ylim=c(0, 8), type="b"); grid() # Figure 17¶

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
106 Descriptive statistics

Figure 17. A scatterplot exemplifying a few simple plot settings

An important rule of thumb is that the ranges of the axes must be chosen
such that the distribution of the data is represented most meaningfully. It is
often useful to include the point (0, 0) within the ranges of the axes and to
make sure that graphs to be compared have the same and sufficient axis
ranges. For example, if you want to compare the ranges of values of two
vectors x and y in two graphs, then you usually may not want to let R de-
cide on the ranges of axes. Consider the upper panel of Figure 18.
The clouds of points look very similar and you only notice the distribu-
tional difference between x and y when you specifically look at the range
of values on the y-axis. The values in the upper left panel range from 0 to 2
but those in the upper right panel range from 0 to 6. This difference be-
tween the two vectors is immediately obvious, however, when you use
ylim=… to manually set the ranges of the y-axes to the same range of val-
ues, as I did for the lower panel of Figure 18.
Note: whenever you use plot, by default a new graph is created and the
old graph is lost (In RStudio, you can go back to previous plots, however,
with the arrow button or the menu Plots: …) If you want to plot two lines
into a graph, you first generate the first with plot (and type="l" or
type="b") and then add the second one with points (or lines; sometimes
you can also use the argument add=TRUE). That also means that you must
define the ranges of the axes in the first plot in such a way that the values
of the second graph can also be plotted into it.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 107

Figure 18. Scatterplots and the importance of properly-defined ranges of axes

An example will clarify that point. If you want to plot the points of the
vectors m and n, and then want to add into the same plot the points of the
vectors x and y, then this does not work, as you can see in the left panel of
Figure 19.

> m<-1:5; n<-5:1¶


> x<-6:10; y<-6:10¶
> plot(m, n, type="b"); points(x, y, type="b"); grid()¶

The left panel of Figure 19 shows the points defined by m and n, but not
those of x and y because the ranges of the axes that R used to plot m and n
are too small for x and y, which is why you must define those manually
while creating the first coordinate system. One way to do this is to use the

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
108 Descriptive statistics

function max, which returns the maximum value of a vector (and min re-
turns the minimum). The right panel of Figure 19 shows that this does the
trick. (In this line, the minimum is set to 0 manually – of course, you could
also use min(m, x) and min(n, y) for that, but I wanted to include (0, 0)
in the graph.)

Figure 19. Scatterplots and the importance of properly-defined ranges of axes

> plot(m, n, type="b", xlim=c(0, max(m, x)), ylim=


c(0, max(n, y)), xlab="Vectors m and x",
ylab="Vectors n and y"); grid()¶
> points(x, y, type="b")¶

Recommendation(s) for further study


the functions pmin and pmax to determine the minima and maxima at each
position of different vectors (try pmin(c(1, 5, 3), c(2, 4, 6))¶)

1.1.2. Pie charts

The function to generate a pie chart is pie. Its most important argument is a
table generated with table. You can either just leave it at that or, for ex-
ample, change category names with labels=… or use different colors with
col=… etc.:

> pie(table(FILLER), col=c("grey20", "grey50", "grey80"))¶

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 109

Figure 20. A pie chart with the frequencies of disfluencies

One thing that’s a bit annoying about this is that, to use different colors
with col=… as above, you have to know how many colors there are and
assign names to them, which becomes cumbersome with many different
colors and/or graphs. For situations like these, the function rainbow can be
very useful. In its simplest use, it requires only one argument, namely the
number of different colors you want. Thus, how would you re-write the
above line for the pie chart in such a way that you let R find out how many
colors are needed rather than saying col=rainbow(3)?

THINK
BREAK

Let R use as many colors as the table you are plotting has elements:

> pie(table(FILLER), col=rainbow(length(table(FILLER))))¶

Note that pie charts are usually not a good way to summarize data be-
cause humans are not very good at inferring quantities from angles. Thus,
pie is not a function you should use too often – the function rainbow, on
the other hand, is one you should definitely bear in mind.

1.1.3. Bar plots

To create a bar plot, you can use the function barplot. Again, its most
important argument is a table generated with table and again you can cre-
ate either a standard version or more customized ones. If you want to de-
fine your own category names, you unfortunately must use names.arg=…,
not labels=… (cf. Figure 21 below).

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
110 Descriptive statistics

> barplot(table(FILLER)) # left panel of Figure 21¶


> barplot(table(FILLER), col=c("grey20", "grey40",
"grey60")) # right panel of Figure 21¶

Figure 21. Bar plots with the frequencies of disfluencies

An interesting way to configure bar plots is to use space=0 to have the


bars be immediately next to each other. That is of course not exactly mind-
blowing in itself, but it is one of two ways to make it easier to add further
data/annotation to the plot. For example, you can then easily plot the ob-
served frequencies into the middle of each bar using the function text. The
first argument of text is a vector with the x-axis coordinates of the text to
be printed (with space=0, 0.5 for the middle of the first bar, 1.5 for the
middle of the second bar, and 2.5 for the middle of the third bar), the sec-
ond argument is a vector with the y-axis coordinates of that text (half of
each observed frequency so that the text ends up in the middle of the bars),
and labels=… provides the text to be printed; cf. the left panel of Figure 22.

> barplot(table(FILLER), col=c("grey40", "grey60", "grey80"),


names.arg=c("Silence", "Uh", "Uhm"), space=0)¶
> text(c(0.5, 1.5, 2.5), table(FILLER)/2, labels=
table(FILLER))¶

The second way to create a similar graph – cf. the right panel of Figure
22 – involves some useful changes:

> mids<-barplot(table(FILLER), col=c("grey40", "grey60",


"grey80"))¶
> text(mids, table(FILLER), labels=table(FILLER), pos=1)¶

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 111

Figure 22. Bar plots with the frequencies of disfluencies

The first line now does not just plot the barplot, it also assigns what R
returns to a data structure called mids, which contains the x-coordinates of
the middles of the bars, which we can then use for texting. (Look at mids.)
Second, the second line now uses mids for the x-coordinates of the text to
be printed and it uses pos=1 to make R print the text a bit below the speci-
fied coordinates; pos=2, pos=3, and pos=4 would print the text a bit to the
left, above, and to the right of the specified coordinates respectively.
The functions plot and text allow for another powerful graph: first,
you generate a plot that contains nothing but the axes and their labels (with
type="n", cf. above), and then with text you plot words or numbers. Try
this for an illustration of a kind of plot you will more often see below:

> tab<-table(FILLER)¶
> plot(tab, type="n", xlab="Disfluencies", ylab="Observed
frequencies", xlim=c(0, 4), ylim=c(0, 500)); grid()¶
> text(seq(tab), tab, labels=tab)¶

Recommendation(s) for further study


the function dotchart for dot plot and the parameter settings cex, srt,
col, pch, and font to tweak plots: ?par¶.

1.1.4. Pareto-charts

A related way to represent the frequencies of the disfluencies is a pareto-


chart. In pareto-charts, the frequencies of the observed categories are repre-

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
112 Descriptive statistics

sented as in a bar plot, but they are first sorted in descending order of fre-
quency and then overlaid by a line plot of cumulative percentages that indi-
cates what percent of all data one category and all other categories to the
left of that category account for. The function pareto.chart comes with
the library qcc that you must (install and/or) load first; cf. Figure 23.

> library(qcc)¶
> pareto.chart(table(FILLER), main=””)¶
Pareto chart analysis for table(FILLER)
Frequency Cum.Freq. Percentage Cum.Percent.
uh 394.0 394.0 39.4 39.4
silence 332.0 726.0 33.2 72.6
uhm 274.0 1000.0 27.4 100.0

Figure 23. Pareto-chart with the frequencies of disfluencies

1.1.5. Histograms

While bar plots are probably the most frequent forms of representing the
frequencies of nominal/categorical variables, histograms are most wide-
spread for the frequencies of interval/ratio variables. In R, you can use
hist, which just requires the relevant vector as its argument.

> hist(LENGTH)¶

For some ways to make the graph nicer, cf. Figure 24, whose left panel
contains a histogram of the variable LENGTH with axis labels and grey bars.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 113

> hist(LENGTH, main="", xlab="Length in ms", ylab=


"Frequency", xlim=c(0, 2000), ylim=c(0, 100),
col="grey80")¶

The right panel of Figure 24 contains a histogram of the probability


densities (generated by freq=FALSE) with a curve (generated by lines).

> hist(LENGTH, main="", xlab="Length in ms", ylab="Density",


freq=FALSE, xlim=c(0, 2000), col="grey50")¶
> lines(density(LENGTH))¶

Figure 24. Histograms for the frequencies of lengths of disfluencies

With the argument breaks=… to hist, you can instruct R to try to use a
particular number of bins (or bars). You either provide one integer – then R
tries to create a histogram with as many bins – or you provide a vector with
the boundaries of the bins. The latter raises the question of how many bins
should or may be chosen? In general, you should not have more than 20
bins, and as one rule of thumb for the number of bins to choose you can use
the formula in (14) (cf. Keen 2010:143–160 for discussion). The most im-
portant aspect is that the bins you choose do not misrepresent the data.

(14) Number of bins for a histogram of n data points = 1+3.32·log10 n

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
114 Descriptive statistics

1.1.6. Empirical cumulatuive distributions

A very useful visualization of numerical data is the empirical cumulative


distribution (function, abbreviated ecdf) plot, an example of which you
have already seen as part of the pareto chart in Section 3.1.1.4. On the x-
axis of an ecdf plot, you find the range of the variable that is visualized, on
the y-axis you find a percentage scale from 0 to 1 (=100%), and the points
in the coordinate system show how much in percent of all data one variable
value and all other smaller values to the left of that value account for. Fig-
ure 25 shows such a plot for LENGTH and you can see that approximately
18% of all lengths are smaller than 500 ms.

Figure 25. Ecdf plot of lengths of disfluencies

This plot is very useful because it does not lose information by binning
data points: every data point is represented in the plot, which is why ecdf
plots can be very revealing even for data that most other graphs cannot
illustrate well. Let’s see whether you’ve understood this plot: what do ecdf
plots of normally-distributed and uniformly-distributed data look like?

THINK
BREAK

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 115

You will find the answer in the code file (with graphs); make sure you
understand why so you can use this very useful type of graph.

Recommendation(s) for further study


− the functions dotchart and stripchart (with method="jitter") to
represent the distribution of individual data points in very efficient ways
− the function scatterplot (from the library car) for more sophisticated
scatterplots
− the functions plot3d and scatterplot3d (from the library rgl and the
library scatterplot3d) for different three-dimensional scatterplots

1.2. Measures of central tendency

Measures of central tendency are probably the most frequently used statis-
tics. They provide a value that attempts to summarize the behavior of a
variable. Put differently, they answer the question, if I wanted to summa-
rize this variable and were allowed to use only one number to do that,
which number would that be? Crucially, the choice of a particular measure
of central tendency depends on the variable’s level of measurement. For
nominal/categorical variables, you should use the mode (if you do not
simply list frequencies of all values/bins anyway, which is often better), for
ordinal variables you should use the median, for interval/ratio variables you
can often use the arithmetic mean.

1.2.1. The mode

The mode of a variable or distribution is the value that is most often ob-
served. As far as I know, there is no function for the mode in R, but you
can find it very easily. For example, the mode of FILLER is uh:

> which.max(table(FILLER))¶
uh
2
> max(table(FILLER))¶
[1] 394

Careful when there is more than one level that exhibits the maximum
number of observations – tabulating is usually safer.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
116 Descriptive statistics

1.2.2. The median

The measure of central tendency for ordinal data is the median, the value
you obtain when you sort all values of a distribution according to their size
and then pick the middle one (e.g., the median of the numbers from 1 to 5
is 3). If you have an even number of values, the median is the average of
the two middle values.

> median(LENGTH)¶
[1] 897

1.2.3. The arithmetic mean

The best-known measure of central tendency is the arithmetic mean for


interval/ratio variables. You compute it by adding up all values of a distri-
bution or a vector and dividing that sum by the number of values, but of
course there is also a function for this:

> sum(LENGTH)/length(LENGTH)¶
[1] 915.043
> mean(LENGTH)¶
[1] 915.043

One weakness of the arithmetic mean is its sensitivity to outliers:

> a<-1:10; a¶
[1] 1 2 3 4 5 6 7 8 9 10
> b<-c(1:9, 1000); b¶
[1] 1 2 3 4 5 6 7 8 9 1000
> mean(a)¶
[1] 5.5
> mean(b)¶
[1] 104.5

Although the vectors a and b differ with regard to only a single value,
the mean of b is much larger than that of a because of that one outlier, in
fact so much larger that b’s mean of 104.5 neither summarizes the values
from 1 to 9 nor the value 1000 very well. There are two ways of handling
such problems. First, you can add the argument trim=…, the percentage of
elements from the top and the bottom of the distribution that are discarded
before the mean is computed. The following lines compute the means of a
and b after the highest and the lowest value have been discarded:

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 117

> mean(a, trim=0.1)¶


[1] 5.5
> mean(b, trim=0.1)¶
[1] 5.5

Second, you can just use the median, which is also a good idea if the da-
ta whose central tendency you want to report are not normally distributed.

> median(a); median(b)¶


[1] 5.5
[1] 5.5

Warning/advice
Just because R or your spreadsheet software can return many decimals does
not mean you have to report them all. Use a number of decimals that makes
sense given the statistic that you report.

1.2.4. The geometric mean

The geometric mean is used to compute averages of factors or ratios


(whereas the arithmetic mean is computed to get the average of sums).
Let’s assume you have six recordings of a child at the ages 2;1 (two years
and one month), 2;2, 2;3, 2;4, 2;5, and 2;6. Let us also assume you had a
vector lexicon that contains the cumulative numbers of different words
(types!) that the child produced at each age:

> lexicon<-c(132, 158, 169, 188, 221, 240)¶


> names(lexicon)<-c("2;1", "2;2", "2;3", "2;4", "2;5",
"2;6")¶

You now want to know the average rate at which the lexicon increased.
First, you compute the successive increases:

> increases<-lexicon[2:6]/lexicon[1:5]; increases¶


2;2 2;3 2;4 2;5 2;6
1.196970 1.069620 1.112426 1.175532 1.085973

That is, by age 2;2, the child produced 19.697% more types than by age
2;1, by age 2;3, the child produced 6.962% more types than by age 2;2, etc.
Now, you must not think that the average rate of increase of the lexicon is
the arithmetic mean of these increases:

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
118 Descriptive statistics

> mean(increases) # wrong!¶


[1] 1.128104

You can easily test that this is not the correct result. If this number was
the true average rate of increase, then the product of 132 (the first lexicon
size) and this rate of 1.128104 to the power of 5 (the number of times the
supposed ‘average rate’ applies) should be the final value of 240. This is
not the case:

> 132*mean(increases)^5¶
[1] 241.1681

Instead, you must compute the geometric mean. The geometric mean of
a vector x with n elements is computed according to formula (15), and if
you use this as the average rate of increase, you get the right result:

1
n
(15) meangeom = (x1·x2·…·xn-1·xn)

> rate.increase<-prod(increases)^(1/length(increases));
rate.increase¶
[1] 1.127009
> 132*rate.increase^5¶
[1] 240

True, the difference between 240 – the correct value – and 241.1681 –
the incorrect value – may seem negligible, but 241.1681 is still wrong and
the difference is not always that small, as an example from Wikipedia (s.v.
geometric mean) illustrates: If you do an experiment and get an increase
rate of 10.000 and then you do a second experiment and get an increase rate
of 0.0001 (i.e., a decrease), then the average rate of increase is not approx-
imately 5.000 – the arithmetic mean of the two rates – but 1 – their geomet-
ric mean.13
Finally, let me again point out how useful it can be to plot words or
numbers instead of points, triangles, … Try to generate Figure 26, in which
the position of each word on the y-axis corresponds to the average length of
the disfluency (e.g., 928.4 for women, 901.6 for men, etc.). (The horizontal
line is the overall average length – you may not know yet how to plot that
one.) Many tendencies are immediately obvious: men are below the aver-
age, women are above, silent disfluencies are of about average length, etc.

13. Alternatively, you can compute the geometric mean of increases as follows:
exp(mean(log(increases)))¶.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 119

Figure 26. Mean lengths of disfluencies

1.3. Measures of dispersion

Most people know what measures of central tendencies are. What many
people do not know is that they should never – NEVER! – report a measure
of central tendency without some corresponding measure of dispersion.
The reason for this rule is that without such a measure of dispersion you
never know how good the measure of central tendency actually is at sum-
marizing the data. Let us look at a non-linguistic example, the monthly
temperatures of two towns and their averages:

> town1<-c(-5, -12, 5, 12, 15, 18, 22, 23, 20, 16, 8, 1)¶
> town2<-c(6, 7, 8, 9, 10, 12, 16, 15, 11, 9, 8, 7)¶
> mean(town1); mean(town2)¶
[1] 10.25
[1] 9.833333

On the basis of the means alone, the towns seem to have a very similar
climate, but even a quick glance at Figure 27 shows that that is not true – in
spite of the similar means, I know where I would want to be in February.
Obviously, the mean of Town 2 summarizes the central tendency of Town
2 much better than the mean of Town 1 does for Town 1: the values of
Town 1 vary much more widely around their mean. Thus, always provide a
measure of dispersion for your measure of central tendency: relative entro-
py for the mode, the interquartile range or quantiles for the median and
interval/ratio-scaled data that are non-normal or exhibit outliers, and the
standard deviation or the variance for normal interval/ratio-scaled data.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
120 Descriptive statistics

Figure 27. Temperature curves of two towns

1.3.1. Relative entropy

A simple dispersion measure for categorical data is relative entropy Hrel.


Hrel is 1 when the levels of the relevant categorical variable are all equally
frequent, and it is 0 when all data points have only one and the same varia-
ble level. For categorical variables with n levels, Hrel is computed as shown
in formula (16), in which pi corresponds to the frequency in percent of the
i-th level of the variable:

∑ ( p ⋅ ln p )
i i
(16) Hrel = − i =1
ln n

Thus, if you count the articles of 300 noun phrases and find 164 cases
with no determiner, 33 indefinite articles, and 103 definite articles, this is
how you compute Hrel:

> article<-c(164, 33, 103)¶


> perc<-article/sum(article)¶
> hrel<--sum(perc*log(perc))/log(length(perc)); hrel¶
[1] 0.8556091

It is worth pointing out that the above formula does not produce the de-

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 121

sired result of 0 when only no-determiner cases are observed because


log(0) is not defined:

> article<-c(300, 0, 0)¶


> perc<-article/sum(article)¶
> hrel<--sum(perc*log(perc))/log(length(perc)); hrel¶
[1] NaN

Usually, this is taken care of by simply setting the result of log(0) to


zero (or sometimes also by incrementing all values by 1 before logging).
This is a case where writing a function to compute logarithms that can
handle 0s can be useful. For example, this is how you could define your
own logarithm function logw0 and then use that function instead of log to
get the desired result:

> logw0<-function(x) {¶
+ ifelse (x==0, 0, log(x))¶
+ }¶
> hrel<--sum(perc*logw0(perc))/logw0(length(perc)); hrel¶
[1] 0

Distributions of categorical variables will be dealt with in much more


detail below in Section 4.1.1.2.

1.3.2. The range

The simplest measure of dispersion for interval/ratio data is the range, the
difference of the largest and the smallest value. You can either just use the
function range, which requires the vector in question as its only argument,
and then compute the difference from the two values with diff, or you just
compute the range from the minimum and maximum yourself:

> range(LENGTH)¶
[1] 251 1600
> diff(range(LENGTH))¶
[1] 1349
> max(LENGTH)-min(LENGTH)¶
[1] 1349

This measure is extremely simple to compute but obviously also very


sensitive: one outlier is enough to yield results that are not particularly
meaningful anymore. For this reason, the range is not used very often.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
122 Descriptive statistics

1.3.3. Quantiles and quartiles

Another simple but useful and flexible measure of dispersion involves the
quantiles of a distribution. We have met quantiles before in the context of
probability distributions in Section 1.3.4. Theoretically, you compute quan-
tiles by sorting the values in ascending order and then counting which val-
ues delimit the lowest x%, y%, etc. of the data; when these percentages are
25%, 50%, and 75%, then they are called quartiles. In R you can use the
function quantile, (see below on type=1):

> a<-1:100¶
> quantile(a, type=1)¶
0% 25% 50% 75% 100%
1 25 50 75 100

If you write the integers from 1 to 100 next to each other, then 25 is the
value that cuts off the lower 25%, etc. The value for 50% corresponds to
the median, and the values for 0% and 100% are the minimum and the
maximum. Let me briefly mention two arguments of this function. First,
the argument probs allows you to specify other percentages. Second, the
argument type=… allows you to choose other ways in which quantiles are
computed. For discrete distributions, type=1 is probably best, for continu-
ous variables the default setting type=7 is best.

> quantile(a, probs=c(0.05, 0.1, 0.5, 0.9, 0.95), type=1)¶


5% 10% 50% 90% 95%
5 10 50 90 95

The bottom line of using quantiles as a measure of dispersion of course


is that the more the 25% quartile and the 75% quartile differ from each
other, the more heterogeneous the data are, which is confirmed by looking
at the data for the two towns: the so-called interquartile range – the differ-
ence between the 75% quartile and the 25% quartile – is much larger for
Town 1 than for Town 2.

> quantile(town1)¶
0% 25% 50% 75% 100%
-12.0 4.0 13.5 18.5 23.0
> IQR(town1)¶
[1] 14.5
> quantile(town2)¶
0% 25% 50% 75% 100%
6.00 7.75 9.00 11.25 16.00
> IQR(town2)¶

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 123

[1] 3.5

You can now apply this function to the lengths of the disfluencies:

> quantile(LENGTH, probs=c(0.2, 0.4, 0.5, 0.6, 0.8, 1),


type=1)¶
20% 40% 50% 60% 80% 100%
519 788 897 1039 1307 1600

That is, the central 20% of all the lengths of disfluencies are greater than
788 and range up to 1039 (as you can verify with sort(LENGTH)
[401:600]¶), 20% of the lengths are smaller than or equal to 519, 20% of
the values are 1307 or larger, etc.
An interesting application of quantile is to use it to split vectors of
continuous variables up into groups. For example, if you wanted to split the
vector LENGTH into five groups of nearly equal ranges of values, you can
use the function cut from Section 2.4.1 again, which splits up vectors into
groups, and the function quantile, which tells cut what the groups should
look like. That is, there are 200 values of LENGTH between and including
251 and 521 etc.

> LENGTH.GRP<-cut(LENGTH, breaks=quantile(LENGTH, probs=


c(0, 0.2, 0.4, 0.6, 0.8, 1)), include.lowest=TRUE)¶
> table(LENGTH.GRP)¶
LENGTH.GRP
[251,521] (521,789] (789,1.04e+03]
200 200 200
(1.04e+03,1.31e+03] (1.31e+03,1.6e+03]
203 197

1.3.4. The average deviation

Another way to characterize the dispersion of a distribution is the average


deviation. You compute the absolute difference of every data point from
the mean of the distribution (cf. abs), and then you compute the mean of
these absolute differences. For Town 1, the average deviation is 9.04:

> town1¶
[1] -5 -12 5 12 15 18 22 23 20 16 8 1
> town1-mean(town1)¶
[1] -15.25 -22.25 -5.25 1.75 4.75 7.75 11.75
12.75 9.75 5.75 -2.25 -9.25
> abs(town1-mean(town1))¶
[1] 15.25 22.25 5.25 1.75 4.75 7.75 11.75 12.75

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
124 Descriptive statistics

9.75 5.75 2.25 9.25


> mean(abs(town1-mean(town1)))¶
[1] 9.041667
> mean(abs(town2-mean(town2)))¶
[1] 2.472222

For the lengths of the disfluencies, we obtain:

> mean(abs(LENGTH-mean(LENGTH)))¶
[1] 329.2946

Although this is a quite intuitive measure, it is unfortunately hardly used


anymore. For better or for worse (cf. Gorard 2004), you will more often
find the dispersion measure discussed next, the standard deviation.

1.3.5. The standard deviation/variance

The standard deviation sd of a distribution x with n elements is defined in


(17). This may look difficult at first, but the standard deviation is con-
ceptually similar to the average deviation. For the average deviation, you
compute the difference of each data point to the mean and take its absolute
value – for the standard deviation you compute the difference of each data
point to the mean, square these differences, sum them up, and after dividing
the sum by n-1, you take the square root (to ‘undo’ the previous squaring).
1
 n
 ∑ xi − x ( )2 2

(17) sd =  i =1 
 n −1 
 
 

Once we ‘translate’ this into R, it probably becomes clearer:

> town1¶
[1] -5 -12 5 12 15 18 22 23 20 16 8 1
> town1-mean(town1)¶
[1] -15.25 -22.25 -5.25 1.75 4.75 7.75 11.75
12.75 9.75 5.75 -2.25 -9.25
> (town1-mean(town1))^2¶
[1] 232.5625 495.0625 27.5625 3.0625 22.5625 60.0625
138.0625 162.5625 95.0625 33.0625 5.0625 85.5625
> sum((town1-mean(town1))^2)¶
[1] 1360.25
> sum((town1-mean(town1))^2)/(length(town1)-1)¶

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 125

[1] 123.6591
> sqrt(sum((town1-mean(town1))^2)/(length(town1)-1))¶
[1] 11.12021

There is of course an easier way …

> sd(town1); sd(town2)¶


[1] 11.12021
[1] 3.157483

Note in passing: the standard deviation is the square root of another


measure, the variance, which you can also compute with the function var.

Recommendation(s) for further study


the function mad to compute another very robust measure of dispersion, the
median absolute deviation

1.3.6. The variation coefficient

Even though the standard deviation is probably the most widespread meas-
ure of dispersion, it has a potential weakness: its size is dependent on the
mean of the distribution, as you can see in the following example:

> sd(town1)¶
[1] 11.12021
> sd(town1*10)¶
[1] 111.2021

When the values, and hence the mean, is increased by one order of
magnitude, then so is the standard deviation. You can therefore not com-
pare standard deviations from distributions with different means if you do
not first normalize them. If you divide the standard deviation of a distribu-
tion by its mean, you get the variation coefficient. You see that the varia-
tion coefficient is not affected by the multiplication with 10, and Town 1
still has a larger degree of dispersion.

> sd(town1)/mean(town1)¶
[1] 1.084899
> sd(town1*10)/mean(town1*10)¶
[1] 1.084899
> sd(town2)/mean(town2)¶
[1] 0.3210999

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
126 Descriptive statistics

1.3.7. Summary functions

If you want to obtain several summarizing statistics for a vector (or a fac-
tor), you can use summary, whose output is self-explanatory.

> summary(town1)¶
Min. 1st Qu. Median Mean 3rd Qu. Max.
-12.00 4.00 13.50 10.25 18.50 23.00

An immensely useful graph is the so-called boxplot. In its simplest


form, the function boxplot just requires one vector as an argument, but we
also add notch=TRUE, which I will explain shortly, as well as a line that
adds little plus signs for the arithmetic means. Note that I am assigning the
output of boxplot to a data structure called boxsum for later inspection.

> boxsum<-boxplot(town1, town2, notch=TRUE,


names=c("Town 1", "Town 2"))¶
> text(1:2, c(mean(town1), mean(town2)), c("+", "+"))¶

This plot, see Figure 28, contains a lot of valuable information:

− the bold-typed horizontal lines represent the medians of the two vectors;
− the regular horizontal lines that make up the upper and lower boundary
of the boxes represent the hinges (approximately the 75%- and the 25%
quartiles);

Figure 28. Boxplot of the temperatures of the two towns

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 127

− the whiskers – the dashed vertical lines extending from the box until the
upper and lower limit – represent the largest and smallest values that are
not more than 1.5 interquartile ranges away from the box;
− each data point that would be outside of the range of the whiskers would
be represented as an outlier with an individual small circle;
− the notches on the left and right sides of the boxes extend across the
range ±1.58*IQR/sqrt(n): if the notches of two boxplots do not over-
lap, then their medians will most likely be significantly different.

Figure 28 shows that the average temperatures of the two towns are very
similar and probably not significantly different from each other. Also, the
dispersion of Town 1 is much larger than that of Town 2. Sometimes, a
good boxplot nearly obviates the need for further analysis; boxplots are
extremely useful and will often be used in the chapters to follow. However,
there are situations where the ecdf plot introduced above is better and the
following example is modeled after what happened in a real dataset of a
student I supervised. Run the code in the code file and consider Figure 29.
As you could see in the code file, I created a vector x1 that actually con-
tains data from two very different distributions whereas the vector x2 con-
tains data from only one but wider distribution.

Figure 29. Boxplots (left panel) and ecdf plots (right panel) of two vectors

Crucially, the boxplots do not reveal that at all. Yes, the second darker
boxplot is wider and has some outliers but the fact that the first lighter box-

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
128 Descriptive statistics

plot represents a vector containing data from two different distributions is


completely absent from the graph. The ecdf plots in the right panel show
that very clearly, however: the darker line for the second vector increases
steadily in a way that suggests one normal distribution whereas the lighter
line for the first vector shows that it contains two normal distributions,
given the two s-shaped curve segments. Thus, while the ecdf plot is not as
intuitively understandable as a boxplot, it can be much more informative.

Recommendation(s) for further study


the functions hdr.boxplot (from the library hdrcde), vioplot (from the
library vioplot), and bpplot (from the library Hmisc) for interesting alter-
natives to, or extensions of, boxplots

1.3.8. The standard error

The standard error of an arithmetic mean is defined as the standard devia-


tion of the means of equally large samples drawn randomly from a popula-
tion with replacement. Imagine you took a sample from a population and
computed the arithmetic mean of some variable. Unless your sample is
perfectly representative of the population, this mean will not correspond
exactly to the arithmetic mean of that variable in the population, and it will
also not correspond exactly to the arithmetic mean you would get from
another equally large sample from the same population. If you take many
(e.g., 10,000) random and equally large samples from the population with
replacement and computed the arithmetic mean of each of them, then the
standard deviation of all these means is the standard error.

> means<-vector(length=10000)¶
> for (i in 1:10000) {¶
+ means[i]<-mean(sample(LENGTH, size=1000, replace=TRUE))¶
+ }¶
> sd(means)¶
[1] 12.10577

The standard error of an arithmetic mean is computed according to the


formula in (18), and from (18) you can already see that the larger the stand-
ard error of a mean, the smaller the likelihood that that mean is a good es-
timate of the population mean, and that the larger sample size n, the smaller
the standard error becomes:

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 129

var sd
(18) semean = =
n n

Thus, the standard error of the mean length of disfluencies here is this,
which is very close to our resampled result from above.

> mean(LENGTH)¶
[1] 915.043
> sqrt(var(LENGTH)/length(LENGTH))¶
[1] 12.08127

You can also compute standard errors for statistics other than arithmetic
means but the only other example we look at here is the standard error of a
relative frequency p, which is computed according to the formula in (19):

p⋅(1− p )
(19) sepercentage =
n

Thus, the standard error of the percentage of all silent disfluencies out
of all disfluencies (33.2% of 1000 disfluencies) is:

> prop.table(table(FILLER))¶
FILLER
silence uh uhm
0.332 0.394 0.274
> sqrt(0.332*(1-0.332)/1000)¶
[1] 0.01489215

Standard errors will be much more important in Section 3.1.5 because


they are used to compute so-called confidence intervals. Note that when
you compare means of two roughly equally large samples and their inter-
vals means±standard errors overlap, then you know the sample means are
not significantly different. However, if these intervals do not overlap, this
does not show that the means are significantly different (cf. Crawley 2005:
169f.). In Chapter 5, you will also get to see standard errors of differences
of means, which are computed according to the formula in (20).

2 2
(20) sedifference between means = SE mean _ group1 + SEmean _ group 2

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
130 Descriptive statistics

Warning/advice
Standard errors are only really useful if the data to which they are applied
are distributed pretty normally or when the sample size n ≥ 30.

1.4. Centering and standardization (z-scores)

Very often it is useful or even necessary to compare values coming from


different scales. An example (from Bortz 2005): if a student X scored 80%
in a course and a student Y scored 60% in another course, can you then say
that student X was better than student Y? On the one hand, sure you can:
80% is better than 60%. On the other hand, the test in which student Y
participated could have been much more difficult than the one in which
student X participated. It can therefore be useful to relativize/normalize the
individual grades of the two students on the basis of the overall perfor-
mance of students in their courses. (You encountered a similar situation
above in Section 3.1.3.6 when you learned that it is not always appropriate
to compare different standard deviations directly.) Let us assume the grades
obtained in the two courses look as follows:

> grades.course.X<-rep((seq(0, 100, 20)), 1:6);


grades.course.X¶
[1] 0 20 20 40 40 40 60 60 60 60 80 80 80 80
80 100 100 100 100 100 100
> grades.course.Y<-rep((seq(0, 100, 20)), 6:1);
grades.course.Y¶
[1] 0 0 0 0 0 0 20 20 20 20 20 40 40 40
40 60 60 60 80 80 100

One way to normalize the grades is called centering and simply in-
volves subtracting from each individual value within one course the aver-
age of that course.

> a<-1:5¶
> centered.scores<-a-mean(a); centered.scores¶
[1] -2 -1 0 1 2

You can see how these scores relate to the original values in a: since the
mean of a is obviously 3, the first two centered scores are negative (i.e.,
smaller than a’s mean), the third is 0 (it does not deviate from a’s mean),
and the last two centered scores are positive (i.e., larger than a’s mean).
Another more sophisticated way involves standardizing, i.e. trans-

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 131

forming the values to be compared into so-called z-scores, which indicate


how many standard deviations each value of the vector deviates from the
mean of the vector. The z-score of a value from a vector is the difference of
that value from the mean of the vector, divided by the vector’s standard
deviation. You can compute that manually as in this simple example:

> z.scores<-(a-mean(a))/sd(a); z.scores¶


[1] -1.2649111 -0.6324555 0.0000000 0.6324555 1.2649111

The relationship between the z-scores and a’s original values is very
similar to that between the centered scores and a’s values: since the mean
of a is obviously 3, the first two z-scores are negative (i.e., smaller than a’s
mean), the third z-score is 0 (it does not deviate from a’s mean), and the
last two z-scores are positive (i.e., larger than a’s mean). Note that such z-
scores have a mean of 0 and a standard deviation of 1:

> mean(z.scores)¶
[1] 0
> sd(z.scores)¶
[1] 1

Both normalizations can be performed with the function scale, which


takes three arguments: the vector to be normalized, center=… (the default
is TRUE) and scale=… (the default is TRUE). If you do not provide any ar-
guments other than the vector to be standardized, then scale’s default set-
ting returns a matrix that contains the z-scores and whose attributes corre-
spond to the mean and the standard deviation of the vector:

> scale(a)¶
[,1]
[1,] -1.2649111
[2,] -0.6324555
[3,] 0.0000000
[4,] 0.6324555
[5,] 1.2649111
attr(,"scaled:center")
[1] 3
attr(,"scaled:scale")
[1] 1.581139

If you set scale to FALSE, then you get centered scores:

> scale(a, scale=FALSE)¶


[,1]
[1,] -2

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
132 Descriptive statistics

[2,] -1
[3,] 0
[4,] 1
[5,] 2
attr(,"scaled:center")
[1] 3

If we apply both versions to our example with the two courses, then you
see that the 80% scored by student X is only 0.436 standard deviations (and
13.33 percent points) better than the mean of his course whereas the 60%
scored by student Y is actually 0.873 standard deviations (and 26.67 per-
cent points) above the mean of his course. Thus, X’s score is higher than
Y’s, but if we take the overall results in the two courses into consideration,
then Y’s performance is better; standardizing data is often useful.

1.5. Confidence intervals

In most cases, you are not able to investigate the whole population you are
actually interested in because that population is not accessible and/or too
large so investigating it is impossible, too time-consuming, or too expen-
sive. However, even though you know that different samples will yield
different statistics, you of course hope that your sample would yield a reli-
able estimate that tells you much about the population you are interested in:

− if you find in your sample of 1000 disfluencies that their average length
is approximately 915 ms, then you hope that you can generalize from
that to the population and future investigations;
− if you find in your sample of 1000 disfluencies that 33.2% of these are
silences, then you hope that you can generalize from that to the popula-
tion and future investigations.

So far, we have only discussed how you can compute percentages and
means for samples – the question of how valid these are for populations is
the topic of this section. In Section 3.1.5.1, I explain how you can compute
confidence intervals for arithmetic means, and Section 3.1.5.2 explains how
to compute confidence intervals for percentages. The relevance of such
confidence intervals must not be underestimated: without a confidence
interval it is unclear how well you can generalize from a sample to a popu-
lation; apart from the statistics we discuss here, one can also compute con-
fidence intervals for many others.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 133

1.5.1. Confidence intervals of arithmetic means

If you compute a mean on the basis of a sample, you of course hope that it
represents that of the population well. As you know, the average length of
disfluencies in our example data is 915.043 ms (standard deviation:
382.04). But as we said above, other samples’ means will be different so
you would ideally want to quantify your confidence in this estimate. The
so-called confidence interval, which is useful to provide with your mean, is
the interval of values around the sample mean around which we will as-
sume there is no significant difference with the sample mean. From the
expression “significant difference”, it follows that a confidence interval is
typically defined as 1-significance level, i.e., typically as 1-0.05 = 0.95.
In a first step, you again compute the standard error of the arithmetic
mean according to the formula in (18).

> se<-sqrt(var(LENGTH)/length(LENGTH)); se¶


[1] 12.08127

This standard error is used in (21) to compute the confidence interval.


The parameter t in formula (21) refers to the distribution mentioned in Sec-
tion 1.3.4.3, and its computation requires the number of degrees of free-
dom. In this case, the number of degrees of freedom df is the length of the
vector-1, i.e. 999. Since you want to compute a t-value on the basis of a p-
value, you need the function qt, and since you want a two-tailed interval –
95% of the values around the observed mean, i.e. values larger and smaller
than the mean – you must compute the t-value for 2.5% (because 2.5% on
both sides result in the desired 5%):

(21) CI = x ±t·SE

> t.value<-qt(0.025, df=999, lower.tail=FALSE); t.value¶


[1] 1.962341

Now you can compute the confidence interval:

> mean(LENGTH)-(se*t.value); mean(LENGTH)+(se*t.value)¶


[1] 891.3354
[1] 938.7506

To do this more simply, you can use the function t.test with the rele-
vant vector and use conf.level=… to define the relevant percentage. R then

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
134 Descriptive statistics

computes a significance test the details of which are not relevant yet, which
is why we only look at the confidence interval (with $conf.int):

> t.test(LENGTH, conf.level=0.95)$conf.int¶


[1] 891.3354 938.7506
attr(,"conf.level")
[1] 0.95

This confidence interval

identifies a range of values a researcher can be 95% confi-


dent contains the true value of a population parameter (e.g.,
a population mean). Stated in probabilistic terms, the re-
searcher can state there is a probability/likelihood of .95
that the confidence interval contains the true value of the
population parameter. (Sheskin 2011:75; see also Field,
Miles, and Field 2012:45)14

Note that when you compare means of two roughly equally large sam-
ples and their 95%-confidence intervals do not overlap, then you know the
sample means are significantly different and, therefore, you would assume
that there is a real difference between the population means, too. However,
if these intervals do overlap, this does not show that the means are not sig-
nificantly different from each other (cf. Crawley 2005: 169f.).

1.5.2. Confidence intervals of percentages

The above logic with regard to means also applies to percentages. Given a
particular percentage from a sample, you want to know what the corre-
sponding percentage in the population is. As you already know, the per-
centage of silent disfluencies in our sample is 33.2%. Again, you would
like to quantify your confidence in that sample percentage. As above, you
compute the standard error for percentages according to the formula in
(19), and then this standard error is inserted into the formula in (22).

14 A different way of explaining confidence intervals is this: “A common error is to misin-


terpret the confidence interval as a statement about the unknown parameter [here, the
percentage in the population, STG]. It is not true that the probability that a parameter is
included in a 95% confidence interval is 95%. What is true is that if we derive a large
number of 95% confidence intervals, we can expect the true value of the parameter to be
included in the computed intervals 95% of the time” (Good and Hardin 2012:156)

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Univariate statistics 135

> se<-sqrt(0.332*(1-0.332)/1000); se¶


[1] 0.01489215

(22) CI = a±z·SE

The parameter z in (22) corresponds to the z-score mentioned above in


Section 1.3.4.3, which defines 5% of the area under a standard normal dis-
tribution – 2.5% from the upper part and 2.5% from the lower part:

> z.score<-qnorm(0.025, lower.tail=FALSE); z.score¶


[1] 1.959964

For a 95% confidence interval for the percentage of silences, you enter:

> z.score<-qnorm(0.025, lower.tail=FALSE)¶


> 0.332-z.score*se; 0.332+z.score*se¶
[1] 0.3028119
[1] 0.3611881

The simpler way requires the function prop.test, which tests whether
a percentage obtained in a sample is significantly different from an ex-
pected percentage. Again, the functionality of that significance test is not
relevant yet, but this function also returns the confidence interval for the
observed percentage. R needs the observed frequency (332), the sample
size (1000), and the probability for the confidence interval. R uses a formu-
la different from ours but returns nearly the same result.

> prop.test(332, 1000, conf.level=0.95)$conf.int¶


[1] 0.3030166 0.3622912
attr(,"conf.level")
[1] 0.95

Recommendation(s) for further study


Dalgaard (2002: Ch. 7.1 and 4.1), Crawley (2005: 167ff.)

Warning/advice
Since confidence intervals are based on standard errors, the warning from
above applies here, too: if data are not normally distributed or the samples
too small, then you should probably use other methods to estimate confi-
dence intervals (e.g., bootstrapping).

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
136 Descriptive statistics

2. Bivariate statistics

We have so far dealt with statistics and graphs that describe one variable or
vector/factor. In this section, we now turn to methods to characterize two
variables and their relation. We will again begin with frequencies, then we
will discuss means, and finally talk about correlations. You will see that we
can use many functions from the previous sections.

2.1. Frequencies and crosstabulation

We begin with the case of two nominal/categorical variables. Usually, one


wants to know which combinations of variable levels occur how often. The
simplest way to do this is cross-tabulation. Let’s return to the disfluencies:

> UHM<-read.delim(file.choose())¶
> attach(UHM)¶

Let’s assume you wanted to see whether men and women differ with re-
gard to the kind of disfluencies they produce. First two questions: are there
dependent and independent variables in this design and, if so, which?

THINK
BREAK

In this case, SEX is the independent variable and FILLER is the depend-
ent variable. Computing the frequencies of variable level combinations in R
is easy because you can use the same function that you use to compute
frequencies of an individual variable’s levels: table. You just give table a
second vector or factor as an argument and R lists the levels of the first
vector in the rows and the levels of the second in the columns:

> freqs<-table(FILLER, SEX); freqs¶


SEX
FILLER female male
silence 171 161
uh 161 233
uhm 170 104

In fact you can provide even more vectors to table, just try it out, and

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 137

we will return to this below. Again, you can create tables of percentages
with prop.table, but with two-dimensional tables there are different ways
to compute percentages and you can specify one with margin=…. The de-
fault is margin=NULL, which computes the percentages on the basis of all
elements in the table. In other words, all percentages in the table add up to
1. Another possibility is to compute row percentages: set margin=1 and
you get percentages that add up to 1 in every row. Finally, you can choose
column percentages by setting margin=2: the percentages in each column
add up to 1. This is probably the best way here since then the percentages
adding up to 1 are those of the dependent variable.

> percents<-prop.table(table(FILLER, SEX), margin=2)¶


> percents¶
SEX
FILLER female male
silence 0.3406375 0.3232932
uh 0.3207171 0.4678715
uhm 0.3386454 0.2088353

You can immediately see that men appear to prefer uh and disprefer
uhm while women appear to have no real preference for any disfluency.
However, we of course do not know yet whether this is a significant result.
The function addmargins outputs row and column totals (or other user-
defined margins, such as means):

> addmargins(freqs) # cf. also colSums and rowSums¶


SEX
FILLER female male Sum
silence 171 161 332
uh 161 233 394
uhm 170 104 274
Sum 502 498 1000

Recommendation(s) for further study


the functions xtabs and especially ftable to generate more complex tables

2.1.1. Bar plots and mosaic plots

Of course you can also represent such tables graphically. The simplest way
involves providing a formula as the main argument to plot. Such formulae
consist of a dependent variable (here: FILLER: FILLER), a tilde (“~” mean-
ing ‘as a function of’), and an independent variable (here: GENRE: GENRE).

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
138 Descriptive statistics

> plot(FILLER~GENRE)¶

The widths and heights of rows, columns, and the six boxes represent
the observed frequencies. For example, the column for dialogs is a little
wider than that for monologs because there are more dialogs in the data; the
row for uh is widest because uh is the most frequent disfluency, etc.
Other similar graphs can be generated with the following lines:

> plot(GENRE, FILLER)¶


> plot(table(GENRE, FILLER))¶
> mosaicplot(table(GENRE, FILLER))¶

These graphs are called stacked bar plots or mosaic plots and are – to-
gether with association plots to be introduced below – often effective ways
of representing crosstabulated data. In the code file for this chapter you will
find R code for another kind of useful graph.

Figure 30. Stacked bar plot / mosaic plot for FILLER~GENRE

2.1.2. Spineplots

Sometimes, the dependent variable is nominal/categorical and the inde-


pendent variable is interval/ratio-scaled. Let us assume that FILLER is the
dependent variable, which is influenced by the independent variable
LENGTH. (This does not make much sense here, we just do this for exposi-

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 139

tory purposes.) You can use the function spineplot with a formula:

> spineplot(FILLER~LENGTH)¶

The y-axis represents the dependent variable and its three levels. The x-
axis represents the independent ratio-scaled variable, which is split up into
the value ranges that would also result from hist (which also means you
can change the ranges with breaks=…; cf. Section 3.1.1.5 above).

2.1.3. Line plots

Apart from these plots, you can also generate line plots that summarize
frequencies. If you generate a table of relative frequencies, then you can
create a primitive line plot by entering the code shown below.

Figure 31. Spineplot for FILLER~LENGTH

> fill.table<-prop.table(table(FILLER, SEX), 2); fill.table¶


SEX
FILLER female male
silence 0.3406375 0.3232932
uh 0.3207171 0.4678715
uhm 0.3386454 0.2088353
> plot(fil.table[,1], ylim=c(0, 0.5), xlab="Disfluency",
ylab="Relative frequency", type="b")¶
> points(fil.table[,2], type="b")¶

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
140 Descriptive statistics

However, somewhat more advanced code in the companion file shows


you how you can generate the graph in Figure 32. (Again, you may not
understand the code immediately, but it will not take you long.)

Warning/advice
Sometimes, it is recommended to not represent such frequency data with a
line plot like this because the lines ‘suggest’ that there are frequency values
between the levels of the categorical variable, which is of course not the
case. Again, you should definitely explore the function dotchart for this.

Figure 32. Line plot with the percentages of the interaction of SEX and FILLER

Recommendation(s) for further study


the function plotmeans (from the library gplots) to plot line plots with
means and confidence intervals

2.2. Means

If the dependent variable is interval/ratio-scaled or ordinal and the inde-


pendent variable is nominal/categorical, then one is often not interested in
the frequencies of particular values of the dependent variable, but its cen-
tral tendencies at each level of the independent variable. For example, you
might want to determine whether men and women differ with regard to the
average disfluency lengths. One way to get these means is the following:

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 141

> mean(LENGTH[SEX=="female"])¶
[1] 928.3984
> mean(LENGTH[SEX=="male"])¶
[1] 901.5803

This approach is too primitive for three reasons:

− you must define the values of LENGTH that you want to include manual-
ly, which requires a lot of typing (especially when the independent vari-
able has more than two levels or, even worse, when you have more than
one independent variable);
− you must know all relevant levels of the independent variables – other-
wise you couldn’t use them for subsetting in the first place;
− you only get the means of the variable levels you have explicitly asked
for. However, if, for example, you made a coding mistake in one row –
such as entering “malle” instead of “male” – this approach will not
show you that.

Thus, we use an extremely useful function called tapply, which mostly


takes three arguments. The first is a vector or factor to which you want to
apply a function – here, this is LENGTH, to which we want to apply mean.
The second argument is a vector or factor that has as many elements as the
first one and that specifies the groups of values from the first vector/factor
to which the function is to be applied. The last argument is the relevant
function, here mean. We get:

> tapply(LENGTH, SEX, mean)¶


female male
928.3984 901.5803

Of course the result is the same as above, but you obtained it in a better
way. You can of course use functions other than mean: median, IQR, sd,
var, …, even functions you wrote yourself. For example, what do you get
when you use length? The numbers of lengths observed for each sex.

2.2.1. Boxplots

In Section 3.1.3.7 above, we looked at boxplots, but restricted our attention


to cases where we have one or more dependent variables (such as town1
and town2). However, you can also use boxplots for cases where you have

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
142 Descriptive statistics

one or more independent variables and a dependent variable. Again, the


easiest way is to use a formula with the tilde meaning ‘as a function of’:

> boxplot(LENGTH~GENRE, notch=TRUE, ylim=c(0, 1600))¶

(If you only want to plot a boxplot and not provide any further argu-
ments, it is actually enough to just enter plot(LENGTH~GENRE)¶: R ‘infers’
you want a boxplot because LENGTH is a numerical vector and GENRE is a
factor.) Again, you can infer a lot from that plot: both medians are close to
900 ms and do most likely not differ significantly from each other (since
the notches overlap). Both genres appear to have about the same amount of
dispersion since the notches, the boxes, and the whiskers are nearly equally
large, and both genres have no outliers.

Figure 33. Boxplot for LENGTH~GENRE

Quick question: can you infer what this line does?

> text(seq(levels(GENRE)), tapply(LENGTH, GENRE, mean), "+")¶

THINK
BREAK

It adds plusses into the boxplot representing the means of LENGTH for
each GENRE: seq(levels(GENRE)) returns 1:2, which is used as the x-

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 143

coordinates; the tapply code returns the means of LENGTH for each GENRE,
and the "+" is what is plotted.

2.2.2. Interaction plots

So far we have looked at graphs representing one variable or one variable


depending on another variable. However, there are also cases where you
want to characterize the distribution of one interval/ratio-scaled variable
depending on two, say, nominal/categorical variables. You can again obtain
the means of the variable level combinations of the independent variables
with tapply. You must specify the two independent variables in the form
of a list, and the following two examples show you how you get the same
means in two different ways (so that you see which variable goes into the
rows and which into the columns):

> tapply(LENGTH, list(SEX, FILLER), mean)¶


silence uh uhm
female 942.3333 940.5652 902.8588
male 891.6894 904.9785 909.2788
> tapply(LENGTH, list(FILLER, SEX), mean)¶
female male
silence 942.3333 891.6894
uh 940.5652 904.9785
uhm 902.8588 909.2788

Such results are best shown in tabular form such that you don’t just pro-
vide the above means of the interactions as they were represented in Figure
32 above, but also the means of the individual variables. Consider Table 17
and the formula in its caption exemplifying the relevant R syntax.

Table 17. Means for LENGTH ~ FILLER * SEX


SEX: FEMALE SEX: MALE Total
FILLER: SILENCE 942.33 891.69 917.77
FILLER: UH 940.57 904.98 919.52
FILLER: UHM 902.86 909.28 905.3
TOTAL 928.4 901.58 915.04

A plus sign between variables refers to just adding main effects of vari-
ables (i.e., effects of variables in isolation, e.g. when you only inspect the
two means for SEX in the bottom row of totals or the three means for
FILLER in the rightmost column of totals). A colon between variables refers

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
144 Descriptive statistics

to only the interaction of the variables (i.e., effects of combinations of vari-


ables as when you inspect the six means in the main body of the table
where SEX and FILLER are combined). Finally, an asterisk between varia-
bles denotes both the main effects and the interaction (here, all 12 means).
With two variables A and B, A*B is the same as A + B + A:B.
Now to the results. These are often easier to understand when they are
represented graphically. You can create and configure an interaction plot
manually, but for a quick and dirty glance at the data, you can also use the
function interaction.plot. As you might expect, this function takes at
least three arguments:

− x.factor: a vector/factor whose values/levels are represented on the x-


axis;
− trace.factor: the second argument is a vector/factor whose val-
ues/levels are represented with different lines;
− response: the third argument is a vector whose means for all variable
level combinations will be represented on the y-axis by the lines.

That means, you can choose one of two formats, depending on which
independent variable is shown on the x-axis and which is shown with dif-
ferent lines. While the represented means will of course be identical, I ad-
vise you to always generate and inspect both graphs anyway because one of
the two graphs is usually easier to interpret. In Figure 34, you find both
graphs for the above values and I prefer the lower panel.

> interaction.plot(FILLER, SEX, LENGTH); grid()¶


> interaction.plot(SEX, FILLER, LENGTH); grid()¶

Obviously, uhm behaves differently from uh and silences: the average


lengths of women’s uh and silence are larger than those of men, but the
average length of women’s uhm is smaller than that of men. But now an
important question: why should you now not just report the means you
computed with tapply and the graphs in Figure 34 in your study?

THINK
BREAK

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 145

Figure 34. Interaction plot for LENGTH ~ FILLER : SEX

First, you should not just report the means like this because I told you to
never ever report means without a measure of dispersion. Thus, when you
want to provide the means, you must also add, say, standard deviations,
standard errors, confidence intervals:

> tapply(LENGTH, list(SEX, FILLER), sd)¶


silence uh uhm
female 361.9081 397.4948 378.8790
male 370.6995 397.1380 382.3137

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
146 Descriptive statistics

How do you get the standard errors and the confidence intervals?

THINK
BREAK

> se<-tapply(LENGTH, list(SEX, FILLER), sd)/


sqrt(tapply(LENGTH, list(SEX, FILLER), length)); se¶
silence uh uhm
female 27.67581 31.32698 29.05869
male 29.21522 26.01738 37.48895
> t.value<-qt(0.025, df=999, lower.tail=FALSE); t.value¶
[1] 1.962341
> tapply(LENGTH, list(SEX, FILLER), mean)-(t.value*se)¶
silence uh uhm
female 888.0240 879.0910 845.8357
male 834.3592 853.9236 835.7127
> tapply(LENGTH, list(SEX, FILLER), mean)+(t.value*se)¶
silence uh uhm
female 996.6427 1002.0394 959.882
male 949.0197 956.0335 982.845

And this output immediately shows again why measures of dispersion


are important: the standard deviations are large and the means plus/minus
one standard error overlap (as do the confidence intervals), which shows
that the differences are not significant. You can see this with boxplot,
which allows formulae with more than one independent variable (boxplot(
LENGTH~SEX*FILLER, notch=TRUE)¶, with an asterisk for the interaction).
Second, the graphs should not be used as they are (at least not uncriti-
cally) because R has chosen the range of the y-axis such that it is as small
as possible but still covers all necessary data points. However, this small
range on the y-axis has visually inflated the differences in Figure 34 – a
more realistic representation would have either included the value y = 0 (as
in the first pair of the following four lines) or chosen the range of the y-axis
such that the complete range of LENGTH is included (as in the second pair
of the following four lines):

> interaction.plot(SEX, FILLER, LENGTH, ylim=c(0, 1000))¶


> interaction.plot(FILLER, SEX, LENGTH, ylim=c(0, 1000))¶
> interaction.plot(SEX, FILLER, LENGTH, ylim=range(LENGTH))¶
> interaction.plot(FILLER, SEX, LENGTH, ylim=range(LENGTH))¶

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 147

2.3. Coefficients of correlation and linear regression

The last section in this chapter is devoted to cases where both the depend-
ent and the independent variable are ratio-scaled. For this scenario we turn
to a new data set. First, we clear our memory of all data structures we have
used so far:

> rm(list=ls(all=TRUE))¶

We look at data to determine whether there is a correlation between the


reaction times in ms of second language learners in a lexical decision task
and the length of the stimulus words. We have

− a dependent ratio-scaled variable: the reaction time in ms


MS_LEARNER, whose correlation with the following independent varia-
ble we are interested in;
− an independent ratio-scaled variable: the length of the stimulus words
LENGTH (in letters).

Such correlations are typically quantified using a so-called coefficient


of correlation r. This coefficient, and many others, are defined to fall in the
range between -1 and +1. Table 18 explains what the values mean: the sign
of a correlation coefficient reflects the direction of the correlation, and the
absolute size reflects the strength of the correlation. When the correlation
coefficient is 0, then there is no correlation between the two variables in
question, which is why H0 says r = 0 – the two-tailed H1 says r ≠ 0.

Table 18. Correlation coefficients and their interpretation


Correlation Labeling the Kind of correlation
coefficient correlation
0.7 < r ≤ 1 very high positive correlation:
0.5 < r ≤ 0.7 high the more/higher …, the more/higher …
0.2 < r ≤ 0.5 intermediate the less/lower …, the less/lower …
0 < r ≤ 0.2 low
r≈0 no statistical correlation (H0)
0 > r ≥ -0.2 low negative correlation:
-0.2 > r ≥ -0.5 intermediate the more/higher …, the less/lower …
-0.5 > r ≥-0.7 high the less/lower …, the more/higher …
-0.7 > r ≥ -1 very high

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
148 Descriptive statistics

Let us load and plot the data, using by now familiar lines of code:

> ReactTime<-read.delim(file.choose())¶
> str(ReactTime); attach(ReactTime)¶
'data.frame': 20 obs. of 3 variables:
$ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
$ LENGTH : int 14 12 11 12 5 9 8 11 9 11 ...
$ MS_LEARNER: int 233 213 221 206 123 176 195 207 172 ...
> plot(MS_LEARNER~LENGTH, xlim=c(0, 15), ylim=c(0, 300),
xlab="Word length in letters", ylab="Reaction time of
learners in ms"); grid()¶

Figure 35. Scatterplot15 for MS_LEARNER~LENGTH

What kind of correlation is that, a positive or a negative one?

THINK
BREAK

This is a positive correlation, because we can describe it with a “the


more …, the more …” statement: the longer the word, the longer the reac-
tion time: when you move from the left (short words) to the right (long
words), the reaction times get higher. But we also want to quantify the
correlation and compute the Pearson product-moment correlation r.

15 Check the code file for how to handle overlapping points.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 149

First, we do this manually: We begin by computing the covariance of


the two variables according to the formula in (23).

∑ (x )( )
n

i − x ⋅ yi − y
i =1
(23) Covariancex, y =
n −1

As you can see, the covariance involves computing the differences of


each variable’s value from the variable’s mean. For example, when the i-th
value of both the vector x and the vector y are above the averages of x and
y, then this pair of i-th values will contribute a positive value to the covari-
ance. In R, we can compute the covariance manually or with the function
cov, which requires the two relevant vectors:

> covariance<-sum((LENGTH-mean(LENGTH))*(MS_LEARNER-
mean(MS_LEARNER)))/(length(MS_LEARNER)-1)¶
> covariance<-cov(LENGTH, MS_LEARNER); covariance¶
[1] 79.28947

The sign of the covariance already indicates whether two variables are
positively or negatively correlated; here it is positive. However, we cannot
use the covariance to quantify the correlation between two vectors because
its size depends on the scale of the two vectors: if you multiply both vec-
tors with 10, the covariance becomes 100 times as large as before although
the correlation as such has of course not changed:

> cov(MS_LEARNER*10, LENGTH*10)¶


[1] 7928.947

Therefore, we divide the covariance by the product of the standard devi-


ations of the two vectors and obtain r. This is a very high positive correla-
tion, r is close to the theoretical maximum of 1. In R, we can do all this
more efficiently with the function cor. Its first two arguments are the two
vectors in question, and the third specifies the desired kind of correlation:

> covariance/(sd(LENGTH)*sd(MS_LEARNER))¶
[1] 0.9337171
> cor(MS_LEARNER, LENGTH, method="pearson")¶
[1] 0.9337171

The correlation can be investigated more closely, though. We can try to

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
150 Descriptive statistics

predict values of the dependent variable on the basis of the independent


one. This method is called linear regression. In its simplest form, it in-
volves trying to draw a straight line in such a way that it represents the
scattercloud best. Here, best is defined as ‘minimizing the sums of the
squared vertical distances of the observed y-values (here: reaction times)
and the predicted y-values reflected by the regression line.’ That is, the
regression line is drawn fairly directly through the scattercloud because
then these deviations are smallest. It is defined by a regression equation
with two parameters, an intercept a and a slope b. Without discussing the
relevant formulae here, I immediately explain how to get these values with
R. Using the formula notation you already know, you define and inspect a
so-called linear model using the function lm:

> model<-lm(MS_LEARNER~LENGTH); model¶


Call:
lm(formula = MS_LEARNER ~ LENGTH)
Coefficients:
(Intercept) LENGTH
93.61 10.30

That is, the intercept – the y-value of the regression line at x = 0 – is


93.61, and the slope of the regression line is 10.3, which means that for
every letter of a word the estimated reaction time increases by 10.3 ms. For
example, our data do not contain a word with 16 letters, but since the corre-
lation between the variables is so strong, we can come up with a good pre-
diction for the reaction time such words might result in:

predicted reaction time = intercept + b · LENGTH


258.41 ≈ 93.61 + 10.3 · 16

> 93.61+10.3*16¶
[1] 258.41

(This prediction of the reaction time is of course overly simplistic as it


neglects the large number of other factors that influence reaction times but
within the current linear model this is how it would be computed.) Alterna-
tively, you can use the function predict, whose first argument is the (line-
ar) model and whose second argument can be a data frame called newdata
that contains a column with values for each independent variable for which
you want to make a prediction. With the exception of differences resulting
from me only using two decimals, you get the same result:

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 151

> predict(model, newdata=expand.grid(LENGTH=16))¶


[1] 258.4850

The use of expand.grid is overkill here for a data frame with a single
length but I am using it here because it anticipates our uses of predict and
expand.grid below where we can actually get predictions for a large num-
ber of values in one go (as in the following; the output is not shown here):

> predict(model, newdata=expand.grid(LENGTH=1:16))¶

If you only use the model as an argument to predict, you get the values
the model predicts for every observed word length in your data in the order
of the data points (same with fitted).

> round(predict(model), 2)¶


1 2 3 4 5 6 7 8
237.88 217.27 206.96 217.27 145.14 186.35 176.05 206.96
9 10 11 12 13 14 15 16
186.35 206.96 196.66 165.75 248.18 227.57 248.18 186.35
17 18 19 20
196.66 155.44 176.05 206.96

The first value of LENGTH is 14, so the first of the above values is the
reaction time we expect for a word with 14 letters, etc. Since you now have
the needed parameters, you can also draw the regression line. You do this
with the function abline, which either takes a linear model object as an
argument or the intercept and the slope; cf. Figure 36:

> plot(MS_LEARNER~LENGTH, xlim=c(0, 15), ylim=c(0, 300),


xlab="Word length in letters", ylab="Reaction time of
learners in ms"); grid()¶
> abline(model) # abline(93.61, 10.3)¶

It is obvious why the correlation coefficient is so high: the regression


line is an excellent summary of the data points since all points are fairly
close to it. (Below, we will see two ways of making this graph more in-
formative.) We can even easily check how far away every predicted value
is from its observed value.
This difference – the vertical distance between an observed y-value / re-
action time and the y-value on the regression line for the corresponding x-
value – is called a residual, and the function residuals requires just the
linear model object as its argument.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
152 Descriptive statistics

Figure 36. Scatterplot with regressions line for MS_LEARNER~LENGTH

> round(residuals(model), 2)¶


1 2 3 4 5 6 7 8
-4.88 -4.27 14.04 -11.27 -22.14 -10.35 18.95 0.04
9 10 11 12 13 14 15 16
-14.35 -6.96 8.34 11.25 7.82 -14.57 7.82 1.65
17 18 19 20
-1.66 10.56 6.95 3.04

You can easily test manually that these are in fact the residuals:

> round(MS_LEARNER-(predict(model)+residuals(model)), 2)¶


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Note two important points though: First, regression equations and lines
are most useful for the range of values covered by the observed values.
Here, the regression equation was computed on the basis of lengths be-
tween 5 and 15 letters, which means that it will probably be much less reli-
able for lengths of 50+ letters. Second, in this case the regression equation
also makes some rather non-sensical predictions because theoretically/
mathematically it predicts reactions times of around 0 ms for word lengths
of -9. Such considerations will become important later on.
The correlation coefficient r also allows you to specify how much of the
variance of one variable can be accounted for by the other variable. What
does that mean? In our example, the values of both variables –

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 153

MS_LEARNER and LENGTH – are not all identical: they vary around their
means and this variation was called dispersion and quantified with the
standard deviation or the variance. If you square r and multiply the result
by 100, then you obtain the amount of variance of one variable that the
other variable accounts for. In our example, r = 0.933, which means that
87.18% of the variance of the reaction times can be accounted for – in a
statistical sense, not necessarily a cause-effect sense – on the basis of the
word lengths. This value, r2, is referred to as coefficient of determination.
Incidentally, I sometimes heard students or colleages compare two r-
values such that they say something like, “Oh, here r = 0.6, nice, that’s
twice as much as in this other data set, where r = 0.3.” Even numerically
speaking, this is at least misleading, if nothing worse. Yes, 0.6 is twice as
high as 0.3, but one should not compare r-values directly like this – one has
to apply the so-called Fisher’s Z-transformation first, which is exemplified
in the following two lines:

> r<-0.3; 0.5*log((1+r)/(1-r))¶


[1] 0.3095196
> r<-0.6; 0.5*log((1+r)/(1-r))¶
[1] 0.6931472
> 0.6931472/0.3095196
[1] 2.239429

Thus, an r-value of 0.6 is twice as high as one of 0.3, but it reflects a


correlation that is in fact nearly 21/4 times as strong. How about writing a
function fisher.z that would compute Z from r for you …
The product-moment correlation r is probably the most frequently used
correlation. However, there are a few occasions on which it should not be
used. First, when the relevant variables are not interval/ratio-scaled but
ordinal or when they are not both normally distributed (cf. below Section
4.4), then it is better to use another correlation coefficient, for example
Kendall’s tau τ. This correlation coefficient is based only on the ranks of
the variable values and thus more suited for ordinal data. Second, when
there are marked outliers in the variables, then you should also use Ken-
dall’s τ, because as a measure that is based on ordinal information only it is,
just like the median, less sensitive to outliers. Cf. Figure 37, which shows a
scatterplot with one noteworthy outlier in the top right corner. If you cannot
justify excluding this data point, then it can influence r very strongly, but
not τ. Pearson’s r and Kendall’s τ for all data points but the outlier are 0.11
and 0.1 respectively, and the regression line with the small slope shows that
there is clearly no correlation between the two variables. However, if we

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
154 Descriptive statistics

include the outlier, then Pearson’s r suddenly becomes 0.75 (and the re-
gression line’s slope is changed markedly) while Kendall’s τ remains ap-
propriately small: 0.14.

Figure 37. The effect of outliers on r

But how do you compute Kendall’s τ? The computation of Kendall’s τ


is rather complex (especially with larger samples and ties), which is why I
only explain how to compute it with R. The function is actually the same as
for Pearson’s r – cor – but the argument method=… is changed. For our
experimental data we again get a high correlation, which turns out to be a
little bit smaller than r. (Note that correlations are bidirectional – the order
of the vectors does not matter – but linear regressions are not because you
have a dependent and an independent variable and it matters what goes
before the tilde – that which is predicted – and what goes after it.)

> cor(LENGTH, MS_LEARNER, method="kendall")¶


[1] 0.8189904

The previous explanations were all based on the assumption that there is
in fact a linear correlation between the two variables or one that is best
characterized with a straight line. This need not be the case, though, and a
third scenario in which neither r nor τ are particularly useful involves cases
where these assumptions do not hold. Often, this can be seen by just look-
ing at the data. Figure 38 represents a well-known example from
Anscombe (1973) (from <_inputfiles/03-2-3_anscombe.csv>), which has

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Bivariate statistics 155

the intriguing characteristics that

− the means and variances of the x-variable;


− the means and variances of the y-variable;
− the correlations and the linear regression lines of x and y;

are all identical although the distributions are obviously very different.

Figure 38. The sensitivity of linear correlations: the Anscombe data

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
156 Descriptive statistics

In the top left of Figure 38, there is a case where r and τ are unproblem-
atic. In the top right we have a situation where x and y are related in a cur-
vilinear fashion – using a linear correlation here does not make much
sense.16 In the two lower panels, you see distributions in which individual
outliers have a huge influence on r and the regression line. Since all the
summary statistics are identical, this example illustrates most beautifully
how important, in fact indispensable, a visual inspection of your data is,
which is why in the following chapters visual exploration nearly always
precedes statistical computation.
Now you should do the exercise(s) for Chapter 3 …

Warning/advice
Do not let the multitude of graphical functions and settings of R and/or
your spreadsheet software tempt you to produce visual overkill. Just be-
cause you can use 6 different fonts, 10 colors, and cute little smiley sym-
bols does not mean you should: Visualization should help you and/or the
reader understand something otherwise difficult to grasp, which also means
you should make sure your graphs are fairly self-sufficient, i.e. contain all
the information required to understand them (e.g., meaningful graph and
axis labels, legends, etc.) – a graph may need an explanation, but if the
explanation is three quarters of a page, chances are your graph is not help-
ful (cf. Keen 2010: Chapter 1).

Recommendation(s) for further study


− the function s.hist (from the library ade4) and scatterplot (from the
library car) to produce more refined scatterplots with histograms or
boxplots
− Good and Hardin (2012: Ch. 8), Crawley (2007: Ch. 5, 27), Braun and
Murdoch (2008: Section 3.2), and Keen (2010) for much advice to cre-
ate good graphs; cf. also <http://cran.r-project.org/src/contrib/
Views/Graphics.html>

16. I do not discuss nonlinear regressions; cf. Crawley (2007: Ch. 18, 20) for overviews.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Chapter 4
Analytical statistics

The most important questions of life are,


for the most part, really only questions of probability.
Pierre-Simon Laplace
(from <http://www-rohan.sdsu.edu/%7Emalouf/>)

In my description of the phases of an empirical study in Chapter 1, I


skipped over one essential step: how to decide which significance test to
use (Section 1.3.4). In this chapter, I will now discuss this step in some
detail as well as then discuss how to conduct a variety of significance tests
you may want to perform on your data. More specifically, in this chapter I
will explain how descriptive statistics from Chapter 3 are used in the do-
main of hypothesis-testing. For example, in Section 3.1 I explained how
you compute a measure of central tendency (such as a mean) or a measure
of dispersion (such as a standard deviation) for a particular sample. In this
chapter, you will see how you test whether such a mean or such a standard
deviation differs significantly from a known mean or standard deviation or
the mean or standard deviation of a second sample.
However, before we begin with actual tests: how do you decide which
of the many tests out there is required for your hypotheses and data? One
way to try to narrow down the truly bewildering array of tests is to ask
yourself the six questions I will list in (24) to (29) and discuss presently,
and the answers to these questions usually point you to only one or two
tests that you can apply to your data. (A bit later, I will also provide a visu-
al aid for this process.).
Ok, here goes. The first question is shown in (24).

(24) What kind of study are you conducting?

Typically, there are only two possible answers to that question: “hy-
pothesis-generating” and “hypothesis-testing.” The former means that you
are approaching a (typically large) data set with the intentions of detecting
structure(s) and developing hypotheses for future studies; your approach to
the data is therefore data-driven, or bottom-up; an example for this will be
discussed in Section 5.6. The latter is what most of the examples in this

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
158 Analytical statistics

book are about and means your approach to the data involves specific hy-
potheses you want to test and requires the types of tests in this chapter and
most of the following one.

(25) What kinds of variables are involved in your hypotheses, and how
many?

There are essentially two types of answers. One pertains to the infor-
mation value of the variables and we have discussed this in detail in Sec-
tion 1.3.2.2 above. The other allows for four different possible answers.
First, you may only have one dependent variable, in which case, you nor-
mally want to compute a so-called goodness-of-fit test to test whether the
results from your data correspond to other results (from a previous study)
or correspond to a known distribution (such as a normal distribution). Ex-
amples include

− is the ratio of no-negations (e.g., He is no stranger) and not-negations


(e.g., He is not a stranger) in your data 1 (i.e., the two negation types
are equally likely)?
− does the average acceptability judgment you receive for a sentence cor-
respond to that of a previous study?

Second, you may have one dependent and one independent variable or
you may just have two sets of measurements (i.e. two dependent variables).
In both cases you typically want to compute a monofactorial test for inde-
pendence to determine whether the values of one/the independent variable
are correlated with those of the other/dependent variable. For example,

− does the animacy of the referent of the direct object (a categorical inde-
pendent variable) correlate with the choice of one of two postverbal
constituent orders (a categorical dependent variable)?
− does the average acceptability judgment (a mean of a ratio/interval de-
pendent variable) vary as a function of whether the subjects doing the
rating are native speakers or not (a categorical independent variable)?

Third, you may have one dependent and two or more independent vari-
ables, in which case you want to compute a multifactorial analysis (such as
a multiple regression) to determine whether the individual independent
variables and their interactions correlate with, or predict, the dependent
variable. For example,

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Analytical statistics 159

− does the frequency of a negation type (a categorical dependent variable


with the levels NO vs. NOT; cf. above) depend on the mode of communi-
cation (a binary independent variable with the levels SPOKEN vs.
WRITTEN), the type of verb that is negated (a categorical independent
variable with the levels COPULA, HAVE, or LEXICAL), and/or the interac-
tion of these independent variables?
− does the reaction time to a word w in a lexical decision task (a ratio-
scaled dependent variable) depend on the word class of w (a categorical
independent variable), the frequency of w in a reference corpus (a ra-
tio/interval independent variable), whether the subject has seen a word
semantically related to w on the previous trial or not (a binary independ-
ent variable), whether the subject has seen a word phonologically simi-
lar to w on the previous trial or not (a binary independent variable),
and/or the interactions of these independent variables?

Fourth, you have two or more dependent variables, in which case you
may want to perform a multivariate analysis, which can be exploratory
(such as hierarchical cluster analysis, principal components analysis, factor
analysis, multi-dimensional scaling, etc.) or hypothesis-testing in nature
(MANOVA). For example, if you retrieved from corpus data ten words and
the frequencies of all content words occurring close to them, you can per-
form a cluster analysis to see which of the words behave more (or less)
similarly to each other, which often is correlated with semantic similarity.

(26) Are data points in your data related such that you can associate
them to each other meaningfully and in a principled way?

This question is concerned with whether you have what are called inde-
pendent or dependent samples (and brings us back to the notion of inde-
pendence discussed in Section 1.3.4.1). For example, your two samples –
e.g., the numbers of mistakes made by ten male and ten female non-native
speakers in a grammar test – are independent of each other if you cannot
connect each male subject’s value to that of one female subject on a mean-
ingful and principled basis. You would not be able to do so if you randomly
sampled ten men and ten women and let them take the same test.
There are two ways in which samples can be dependent. One is if you
test subjects more than once, e.g., before and after a treatment. In that case,
you could meaningfully connect each value in the before-treatment sample
to a value in the after-treatment sample, namely connect each subject’s two
values. The samples are dependent because, for instance, if subject #1 is

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
160 Analytical statistics

very intelligent and good at the language tested, then these characteristics
will make his results better than average in both tests, esp. compared to a
subject who is less intelligent and proficient in the language and who will
perform worse in both tests. Recognizing that the samples are dependent
this way will make the test of before-vs.-after treatments more precise.
The second way in which samples may be dependent can be explained
using the above example of ten men and ten women. If the ten men were
the husbands of the ten women, then one would want to consider the sam-
ples dependent. Why? Because spouses are on average more similar to each
other than randomly chosen people: they often have similar IQs, similar
professions, they spend more time with each other than with randomly-
selected people, etc. Thus, one should associate each husband with his
wife, making this two dependent samples.
Independence of data points is often a very important criterion: many
tests assume that data points are independent, and for many tests you must
choose your test depending on what kind of samples you have.

(27) What is the statistic of the dependent variable in the statistical hy-
potheses?

There are essentially five different answers to this question, which were
already mentioned in Section 1.3.2.3 above, too. Your dependent variable
may involve frequencies/counts, central tendencies, dispersions, correla-
tions, or distributions.

(28) What does the distribution of the data or your test statistic look
like? Normal, some other way that can ultimately be described by a
probability function (or a way that can be transformed to look like
a probability function), or some other way?
(29) How big are the samples you collected? n < 30 or n ≥ 30?

These questions relate back to Section 1.3.4, where I explained two


things: First, if your data / test statistics follow a particular probability dis-
tribution, you can often use a computationally simpler parametric test, and
if your data / test statistics don’t, you must often use a non-parametric test.
Second, given sufficient sample sizes, even data from a decidedly non-
normal distribution can begin to look normal and, thus, allow you to apply
parametric tests. It is safer, however, to be very careful and, maybe be con-
servative and run both types of tests.
Let us now use a graph (<sflwr_navigator.png>) that visualizes this pro-

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Analytical statistics 161

cess, which you should have downloaded as part of all the files from the
companion website. Let’s exemplify the use of this graph using the above
example scenario: you hypothesize that the average acceptability judgment
(a mean of an ordinal dependent variable) varies as a function of whether
the subjects providing the ratings are native or non-native speakers (a bina-
ry/categorical independent variable).
You start at the rounded red box with approach in it. Then, the above
scenario is a hypothesis-testing scenario so you go down to statistic. Then,
the above scenario involves averages so you go down to the rounded blue
box with mean in it. Then, the hypothesis involves both a dependent and an
independent variable so you go down to the right, via 1 DV 1 IV to the
transparent box with (tests for) independence/difference in it. You got to
that box via the blue box with mean so you continue to the next blue box
containing information value. Now you make two decisions: first, the de-
pendent variable is ordinal in nature. Second, the samples are independent.
Thus, you take the arrow down to the bottom left, which leads to a blue box
with U-test in it. Thus, the typical test for the above question would be the
U-test (to be discussed below), and the R function for that test is already
provided there, too: wilcox.test.
Now, what does the dashed arrow mean that leads towards that box? It
means that you would also do a U-test if your dependent variable was in-
terval/ratio-scaled but violated other assumptions of the t-test. That is,
dashed arrows provide alternative tests for the first-choice test from which
they originate.
Obviously, this graph is a simplification and does not contain every-
thing one would want to know, but I think it can help beginners to make
first choices for tests so I recommend that, as you continue with the book,
you always determine for each section which test to use and how to identify
this on the basis of the graph.
Before we get started, let me remind you once again that in your own
data your nominal/categorical variables should ideally always be coded
with meaningful character strings so that R recognizes them as factors
when reading in the data from a file. Also, I will assume that you have
downloaded the data files from the companion website.

Recommendation(s) for further study


Good and Hardin (2012: Ch. 6) on choosing test statistics

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
162 Analytical statistics

1. Distributions and frequencies

In this section, I will illustrate how to test whether distributions and fre-
quencies from one sample differ significantly from a known distribution
(cf. Section 4.1.1) or from another sample (cf. Section 4.1.2). In both sec-
tions, we begin with variables from the interval/ratio level of measurement
and then proceed to lower levels of measurement.

1.1. Distribution fitting

1.1.1. One dep. variable (ratio-scaled)

In this section, I will discuss how you compare whether the distribution of
one dependent interval-/ratio-scaled variable is significantly different from
a known distribution. I will restrict my attention to one of the most frequent
cases, the situation where you test whether a variable is normally distribut-
ed (because as mentioned above in Section 1.3.4, many statistical tech-
niques require a normal distribution so you must some know test like this).
We will deal with an example from the first language acquisition of
tense and aspect in Russian. Simplifying a bit here, one can often observe a
relatively robust correlation between past tense and perfective aspect as
well as non-past tenses and imperfective aspect. Such a correlation can be
quantified with Cramer’s V values (cf. Stoll and Gries, 2009, and Section
4.2.1 below). Let us assume you studied how this association – the
Cramer’s V values – changes for one child over time. Let us further assume
you had 117 recordings for this child, computed a Cramer’s V value for
each one, and now you want to see whether these are normally distributed.
This scenario involves

− a dependent interval/ratio-scaled variable called TENSEASPECT, consist-


ing of the Cramer’s V values;
− no independent variable because you are not testing whether the distri-
bution of the variable TENSEASPECT is influenced by, or correlated
with, something else.

You can test for normality in several ways. The test we will use is the
Shapiro-Wilk test (remember: check <sflwr_navigator.png> to see how we
get to this test!), which does not really have any assumptions other than
ratio-scaled data and involves the following procedure:

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 163

Procedure
− Formulating the hypotheses
− Visualizing the data
− Computing the test statistic W and p

As always, we begin with the hypotheses:

H0: The data points do not differ from a normal distribution; W = 1.


H1: The data points differ from a normal distribution; W ≠ 1.

First, you load the data from <_inputfiles/04-1-1-1_tense-aspect.csv>


and create a graph; the code for the left panel is shown below but you can
also generate the right panel using the code from the code file.

> RussianTensAsp<-read.delim(file.choose())¶
> attach(RussianTensAsp)¶
> hist(TENSE_ASPECT, xlim=c(0, 1), main=””, xlab="Tense-Apect
correlation", ylab="Frequency") # left panel¶

Figure 39. Histogram of the Cramer’s V values reflecting the strengths of the
tense-aspect correlations

At first glance, this looks very much like a normal distribution, but of
course you must do a real test. The Shapiro-Wilk test is rather cumbersome
to compute semi-manually, which is why its manual computation will not
be discussed here (unlike nearly all other monofactorial tests). In R, how-

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
164 Analytical statistics

ever, the computation could not be easier. The relevant function is called
shapiro.test and it only requires one argument, the vector to be tested:

> shapiro.test(TENSE_ASPECT)¶
Shapiro-Wilk normality test
data: TENSE_ASPECT
W = 0.9942, p-value = 0.9132

What does this mean? This simple output teaches an important lesson:
Usually, you want to obtain a significant result, i.e., a p-value that is small-
er than 0.05 because this allows you to accept H1. Here, however, you may
actually welcome an insignificant result because normally-distributed vari-
ables are often easier to handle. The reason for this is again the logic under-
lying the falsification paradigm. When p < 0.05, you reject H0 and accept
H1. But here you ‘want’ H0 to be true because H0 states that the data are
normally distributed. You obtained a p-value of 0.9132, which means you
cannot reject H0 and, thus, consider the data to be normally distributed.
You would therefore summarize this result in the results section of your
paper as follows: “According to a Shapiro-Wilk test, the distribution of this
child’s Cramer’s V values measuring the tense-aspect correlation does not
deviate significantly from normality: W = 0.9942; p = 0.9132.” (In paren-
theses or after a colon you usually mention all statistics that helped you
decide whether or not to accept H1.)
As an alternative to the Shapiro-Wilk test, you can also use a Kolmogo-
rov-Smirnov test for goodness of fit. This test requires the function
ks.test and is more flexible than the Shapiro-Wilk-Test, since it can test
for more than just normality and can also be applied to vectors with more
than 5000 data points. To test the Cramer’s V value for normality, you pro-
vide them as the first argument, then you name the distribution you want to
test against (for normality, "pnorm"), and then, to define the parameters of
the normal distribution, you provide the mean and the standard deviation of
the Cramer’s V values:

> ks.test(TENSE_ASPECT, "pnorm", mean=mean(TENSE_ASPECT),


sd=sd(TENSE_ASPECT))¶
One-sample Kolmogorov-Smirnov test
data: TENSE_ASPECT
D = 0.078, p-value = 0.4752
alternative hypothesis: two-sided

The result is the same as above: the data do not differ significantly from
normality. You also get a warning because ks.test assumes that no two

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 165

values in the input are the same, but here some values (e.g., 0.27, 0.41, and
others) are attested more than once; below you will see a quick and dirty
fix for this problem.

Recommendation(s) for further study


− as alternatives to the above functions, the functions jarqueberaTest
and dagoTest (both from the library fBasics)
− the function mshapiro.test (from the library mvnormtest) to test for
multivariate normality
− the function qqnorm and its documentation (for quantile-quantile plots)
− Crawley (2005: 100f.), Crawley (2007: 316f.), Sheskin (2011: Test 7)

1.1.2. One dep. variable (nominal/categorical)

In this section, we are going to return to an example from Section 1.3, the
constructional alternation of particle placement in English, which is again
represented in (30).

(30) a. He picked up the book. (verb - particle - direct object)


b. He picked the book up. (verb - direct object - particle)

As you already know, often both constructions are acceptable and native
speakers can often not explain their preference for one of the two. One may
therefore expect that both constructions are equally frequent, and this is
what you are going to test. This scenario involves

− a dependent nominal/categorical variable CONSTRUCTION: VERB-


PARTICLE-OBJECT vs. CONSTRUCTION: VERB-OBJECT-PARTICLE;
− no independent variable, because you do not investigate whether the
distribution of CONSTRUCTION is dependent on anything else.

Such questions are generally investigated with tests from the family of
chi-squared tests, which is one of the most important and widespread tests.
Since there is no independent variable, you test the degree of fit between
your observed and an expected distribution, which should remind you of
Section 3.1.5.2. This test is referred to as the chi-squared goodness-of-fit
test and involves the following steps:

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
166 Analytical statistics

Procedure
− Formulating the hypotheses
− Computing descriptive statistics and visualizing the data
− Computing the frequencies you would expect given H0
− Testing the assumption(s) of the test:
− all observations are independent of each other
− 80% of the expected frequencies are ≥ 517
− all expected frequencies are > 1
− Computing the contributions to chi-squared for all observed frequencies
− Computing the test statistic χ2, df, and p

The first step is very easy here. As you know, H0 typically postulates
that the data are distributed randomly/evenly, and that means that both
constructions occur equally often, i.e., 50% of the time (just as tossing a
fair coin many times will result in a largely equal distribution). Thus:

H0: The frequencies of the two variable levels of CONSTRUCTION are


identical – if you find a difference in your sample, this difference is
just random variation; nV Part DO = nV DO Part.
H1: The frequencies of the two variable levels of CONSTRUCTION are
not identical; nV Part DO ≠ nV DO Part.

Note that this is a two-tailed H1; no direction of the difference is provid-


ed. Next, you would collect some data and count the occurrences of both
constructions, but we will abbreviate this step and use frequencies reported
in Peters (2001). She conducted an experiment in which subjects described
pictures and obtained the construction frequencies represented in Table 19.

Table 19. Observed construction frequencies of Peters (2001)


Verb - Particle - Direct Object Verb - Direct Object - Particle
247 150

17. This threshold value of 5 is the one most commonly mentioned. There are a few studies
that show that the chi-squared test is fairly robust even if this assumption is violated –
especially when, as is here the case, H0 postulates that the expected frequencies are
equally high (cf. Zar 1999: 470). However, to keep things simple, I stick to the most
common conservative threshold value of 5 and refer you to the literature quoted in Zar.
If your data violate this assumption, then you must compute a binomial test (if, as here,
you have two groups) or a multinomial test (for three or more groups); cf. the recom-
mendations for further study.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 167

Obviously, there is a strong preference for the construction in which the


particle follows the verb directly. At first glance, it seems very unlikely that
H0 could be correct, given these data.
One very important side remark here: beginners often look at something
like Table 19 and say, oh, ok, we have interval/ratio data: 247 and 150.
Why is this wrong?

THINK
BREAK

It’s wrong because Table 19 does not show you the raw data – what it
shows you is already a numerical summary. You don’t have interval/ratio
data – you have an interval/ratio summary of categorical data, because the
numbers 247 and 150 summarize the frequencies of the two levels of the
categorical variable CONSTRUCTION (which you probably obtained from
applying table to a vector/factor). One strategy to not mix this up is to
always conceptually envisage what the raw data table would look like in
the case-by-variable format discussed in Section 1.3.3. In this case, it
would look like this:

Table 20. The case-by-variable version of the data in Table 19


CASE CONSTRUCTION
1 vpo
2 vpo
247 vpo
248 vop
vop
397 vop

From this format, it is quite obvious that the variable CONSTRUCTION is


categorical. So, don’t mix up interval/ratio summaries of categorical data
with interval/ratio data.
As the first step of our evaluation, you should now have a look at a
graphical representation of the data. A first possibility would be to gener-
ate, say, a dot chart. Thus, you first enter the two frequencies – first the
frequency data, then the names of the frequency data (for the plotting) –
and then you create a dot chart or a bar plot as follows:

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
168 Analytical statistics

> VPCs<-c(247, 150) # VPCs="verb-particle constructions"¶


> names(VPCs)<-c("V-Part-DO", "V-DO-Part")¶
> dotchart(VPCs, xlim=c(0, 250))¶
> barplot(VPCs)¶

The question now of course is whether this preference is statistically


significant or whether it could just as well have arisen by chance. Accord-
ing to the above procedure, you must now compute the frequencies that
follow from H0. In this case, this is easy: since there are altogether 247+150
= 397 constructions, which should be made up of two equally large groups,
you divide 397 by 2:

> VPCs.exp<-rep(sum(VPCs)/length(VPCs), length(VPCs))¶


> VPCs.exp¶
[1] 198.5 198.5

You must now check whether you can actually do a chi-squared test
here, but the observed frequencies are obviously larger than 5 and we as-
sume that Peters’s data points are in fact independent (because we will
assume that each construction has been provided by a different speaker).
We can therefore proceed with the chi-squared test, the computation of
which is fairly straightforward and summarized in (31).

n
(observed − expected ) 2
(31) Pearson chi-squared = χ2 = ∑
i =1 expected

That is to say, for every value of your frequency table you compute a
so-called contribution to chi-squared by (i) computing the difference be-
tween the observed and the expected frequency, (ii) squaring this differ-
ence, and (iii) dividing that by the expected frequency again. The sum of
these contributions to chi-squared is the test statistic chi-squared. Here, it is
approximately 23.7.

(32) Pearson χ = 2 (247 − 198.5)2 + (150 − 198.5)2 ≈ 23.7


198.5 198.5

> sum(((VPCs-VPCs.exp)^2)/VPCs.exp)¶
[1] 23.70025

Obviously, this value increases as the differences between observed and

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 169

expected frequencies increase (because then the numerators become larg-


er). That also means that chi-squared becomes 0 when all observed fre-
quencies correspond to all expected frequencies: then the numerators be-
come 0. Thus, we can simplify our statistical hypotheses to the following:

H0: χ2 = 0.
H1: χ2 > 0.

But the chi-squared value alone does not show you whether the differ-
ences are large enough to be statistically significant. So, what do you do
with this value? Before computers became more widespread, a chi-squared
value was used to look up whether the result is significant or not in a chi-
squared table. Such tables typically have the three standard significance
levels in the columns and different numbers of degrees of freedom (df) in
the rows. Df here is the number of categories minus 1, i.e., df = 2-1 = 1,
because when we have two categories, then one category frequency can
vary freely but the other is fixed (so that we can get the observed number of
elements, here 397). Table 21 is one such chi-squared table for the three
significance levels and df = 1 to 3.

Table 21. Critical χ2-values for ptwo-tailed = 0.05, 0.01, and 0.001 for 1 ≤ df ≤ 3
p = 0.05 p = 0.01 p = 0.001
df = 1 3.841 6.635 10.828
df = 2 5.991 9.21 13.816
df = 3 7.815 11.345 16.266

You can actually generate those values yourself with the function
qchisq. That function requires three arguments:

− p: the p-value(s) for which you need the critical chi-squared values (for
some df);
− df: the df-value(s) for the p-value for which you need the critical chi-
squared value;
− lower.tail=FALSE: the argument to instruct R to only use the area
under the chi-squared distribution curve that is to the right of / larger
than the observed chi-squared value.

> qchisq(c(0.05, 0.01, 0.001), 1, lower.tail=FALSE)¶


[1] 3.841459 6.634897 10.827566

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
170 Analytical statistics

More advanced users find code to generate all of Table 21 in the code
file. Once you have such a table, you can test your observed chi-squared
value for significance by determining whether it is larger than the chi-
squared value(s) tabulated at the observed number of degrees of freedom.
You begin with the smallest tabulated chi-squared value and compare your
observed chi-squared value with it and continue to do so as long as your
observed value is larger than the tabulated ones. Here, you first check
whether the observed chi-squared is significant at the level of 5%, which is
obviously the case: 23.7 > 3.841. Thus, you can check whether it is also
significant at the level of 1%, which again is the case: 23.7 > 6.635. Thus,
you can finally even check if the observed chi-squared value is maybe even
highly significant, and again this is so: 23.7 > 10.827. You can therefore
reject H0 and the usual way this is reported in your results section is this:
“According to a chi-squared goodness-of-fit test, the frequency distribution
of the two verb-particle constructions deviates highly significantly from the
expected one (χ2 = 23.7; df = 1; ptwo-tailed < 0.001): the construction where
the particle follows the verb directly was observed 247 times although it
was only expected 199 times, and the construction where the particle fol-
lows the direct objet was observed only 150 times although it was expected
199 times.”
With larger and more complex amounts of data, this semi-manual way
of computation becomes more cumbersome (and error-prone), which is
why we will simplify all this a bit. First, you can of course compute the p-
value directly from the chi-squared value using the mirror function of
qchisq, viz. pchisq, which requires the above three arguments:

> pchisq(23.7, 1, lower.tail=FALSE)¶


[1] 1.125825e-06

As you can see, the level of significance we obtained from our stepwise
comparison using Table 21 is confirmed: p is indeed much smaller than
0.001, namely 0.00000125825. However, there is another even easier way:
why not just do the whole test with one function? The function is called
chisq.test, and in the present case it requires maximally three arguments:

− x: a vector with the observed frequencies;


− p: a vector with the expected percentages (not the frequencies!);
− correct=TRUE or correct=FALSE: when the sample size n is small (15
≤ n ≤ 60), it is sometimes recommended to apply a so-called continuity

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 171

correction (after Yates); correct=TRUE is the default setting.18

In this case, this is easy: you already have a vector with the observed
frequencies, the sample size n is much larger than 60, and the expected
probabilities result from H0. Since H0 says the constructions are equally
frequent and since there are just two constructions, the vector of the ex-
pected probabilities contains two times 1/2 = 0.5. Thus:

> chisq.test(VPCs, p=c(0.5, 0.5))¶


Chi-squared test for given probabilities
data: VPCs
X-squared = 23.7003, df = 1, p-value = 1.126e-06

You get the same result as from the manual computation but this time
you immediately also get a p-value. What you do not also get are the ex-
pected frequencies, but these can be obtained very easily, too. The function
chisq.test computes more than it returns. It returns a data structure (a so-
called list) so you can assign a name to this list and then inspect it for its
contents (output not shown):

> test<-chisq.test(VPCs, p=c(0.5, 0.5))¶


> str(test)¶

Thus, if you require the expected frequencies, you just retrieve them
with a $ and the name of the list component you want, and of course you
get the result you already know.

> test$expected¶
[1] 198.5 198.5

Let me finally mention that the above method computes a p-value for a
two-tailed test. There are many tests in R where you can define whether
you want a one-tailed or a two-tailed test. However, this does not work
with the chi-squared test. If you require the critical chi-squared value for
pone-tailed = 0.05 for df = 1, then you must compute the critical chi-squared
value for ptwo-tailed = 0.1 for df = 1 (with qchisq(0.1, 1, lower.tail=
FALSE)¶), since your prior knowledge is rewarded such that a less extreme
result in the predicted direction will be sufficient (cf. Section 1.3.4). Also,
this means that when you need the pone-tailed-value for a chi-square value,
just take half of the ptwo-tailed-value of the same chi-square value. In this

18. For further options, cf. ?chisq.test¶, formals(chisq.test)¶ or args(chisq.test)¶.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
172 Analytical statistics

case, if your H1 had been directional, this would have been your p-value.
But again: this works only with df = 1.

> pchisq(23.7, 1, lower.tail=FALSE)/2¶

Warning/advice
Above I warned you to never change your hypotheses after you have ob-
tained your results and then sell your study as successful support of the
‘new’ H1. The same logic does not allow you to change your hypothesis
from a two-tailed one to a one-tailed one because your ptwo-tailed = 0.08 (i.e.,
non-significant) so that the corresponding pone-tailed = 0.04 (i.e., significant).
Your choice of a one-tailed hypothesis must be motivated conceptually.
Another hugely important warning: never ever compute a chi-square
test like the above on percentages – always on ‘real’ observed frequencies!

Recommendation(s) for further study


− the functions binom.test or dbinom to compute binomial tests
− the function prop.test (cf. Section 3.1.5.2) to test relative frequencies /
percentages for deviations from expected frequencies / percentages
− the function dmultinom to help compute multinomial tests
− Baayen (2008: Section 4.1.1), Sheskin (2011: Test 8, 9)

1.2. Tests for differences/independence

In Section 4.1.1, we looked at goodness-of-fit tests for distributions and


frequencies – now we turn to tests for differences/independence.

1.2.1. One dep. variable (ordinal/interval/ratio scaled) and one indep.


variable (nominal) (indep. samples)

Let us now look at an example in which two independent samples are com-
pared with regard to their overall distributions. You will test whether men
and women differ with regard to the frequencies of hedges they use in dis-
course (i.e., expressions such as kind of or sort of). Again, note that we are
here only concerned with the overall distributions – not just means or just
variances. We could of course do that, too, but it is of course possible that
the means are very similar while the variances are not and a test for differ-

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 173

ent means might not uncover the overall distributional difference.


Let us assume you have recorded 60 two-minute conversations between
a confederate of an experimenter, each with one of 30 men and 30 women,
and then counted the numbers of hedges that the male and female subjects
produced. You now want to test whether the distributions of hedge fre-
quencies differs between men and women. This question involves

− an independent nominal/categorical variable, SEX: MALE and SEX:


FEMALE;
− a dependent interval/ratio-scaled: the number of hedges produced:
HEDGES.

The question of whether the two sexes differ in terms of the distribu-
tions of hedge frequencies is investigated with the two-sample Kolmogo-
rov-Smirnov test (again, check <sflwr_navigator.png>):

Procedure
− Formulating the hypotheses
− Computing descriptive statistics and visualizing the data
− Testing the assumption(s) of the test: the data are continuous
− Computing the cumulative frequency distributions for both samples, the
maximal absolute difference D of both distributions, and p

First the hypotheses: the text form is straightforward and the statistical
version is based on a test statistic called D to be explained below

H0: The distribution of the dependent variable HEDGES does not differ
depending on the levels of the independent variable SEX; D = 0.
H1: The distribution of the dependent variable HEDGES differs depend-
ing on the levels of the independent variable SEX; D > 0.

Before we do the actual test, let us again inspect the data graphically.
You first load the data from <_inputfiles/04-1-2-1_hedges.csv>, check the
data structure (I will usually not show that output here in the book), and
make the variable names available.

> Hedges<-read.delim(file.choose())¶
> str(Hedges)¶
> attach(Hedges)¶

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
174 Analytical statistics

You are interested in the general distribution, so one plot you can create
is a stripchart. In this kind of plot, the frequencies of hedges are plotted
separately for each sex, but to avoid that identical frequencies are plotted
directly onto each other (and can therefore not be distinguished anymore),
you also use the argument method="jitter" to add a tiny value to each
data point, which decreases the chance of overplotted data points (also try
method="stack"). Then, you include the meaningful point of x = 0 on the
x-axis. Finally, with the function rug you add little bars to the x-axis
(side=1) which also get jittered. The result is shown in Figure 40.

> stripchart(HEDGES~SEX, method="jitter", xlim=c(0, 25),


xlab="Number of hedges", ylab="Sex")¶
> rug(jitter(HEDGES), side=1)¶

Figure 40. Stripchart for HEDGES~SEX

It is immediately obvious that the data are distributed quite differently:


the values for women appear to be a little higher on average and more ho-
mogeneous than those of the men. The data for the men also appear to fall
into two groups, a suspicion that also receives some prima facie support
from the following two histograms in Figure 41. (Note that all axis limits
are again defined identically to make the graphs easier to compare.)

> par(mfrow=c(1, 2))¶


> hist(HEDGES[SEX=="M"], xlim=c(0, 25), ylim=c(0, 10), ylab=
"Frequency", main="")¶
> hist(HEDGES[SEX=="F"], xlim=c(0, 25), ylim=c(0, 10), ylab=
"Frequency", main="")¶
> par(mfrow=c(1, 1))¶

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 175

Figure 41. Histograms of the number of hedges by men and women

The assumption of continuous data points is not exactly met because


frequencies are discrete – there are no frequencies 3.3, 3.4, etc. – but
HEDGES spans quite a range of values and we could in fact jitter the values
to avoid ties. To test these distributional differences with the Kolmogorov-
Smirnov test, which involves the empirical cumulative distribution of the
data, you first rank-order the data: You sort the values of SEX in the order
in which you need to sort HEDGES, and then do the same to HEDGES itself:

> SEX<-SEX[order(HEDGES)]¶
> HEDGES<-HEDGES[order(HEDGES)]¶

The next step is a little more complex. You must now compute the max-
imum of all differences of the two cumulative distributions of the hedges.
You can do this in three steps: First, you generate a frequency table with
the numbers of hedges in the rows and the sexes in the columns. This table
in turn serves as input to prop.table, which generates a table of column
percentages (hence margin=2; cf. Section 3.2.1, output not shown):

> dists<-prop.table(table(HEDGES, SEX), margin=2); dists¶

This table shows that, say, 10% of all numbers of hedges of men are 4,
but these are of course not cumulative percentages yet. The second step is
therefore to convert these percentages into cumulative percentages. You
can use cumsum to generate the cumulative percentages for both columns
and can even compute the differences in the same line:

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
176 Analytical statistics

> differences<-cumsum(dists[,1])-cumsum(dists[,2])¶

That is, you subtract from every cumulative percentage of the first col-
umn (the values of the women) the corresponding value of the second col-
umn (the values of the men). The third and final step is then to determine
the maximal absolute difference, which is the test statistic D:

> max(abs(differences))¶
[1] 0.4666667

You can then look up this value in a table for Kolmogorov-Smirnov


tests; for a significant result, the computed value must be larger than the
tabulated one. For cases in which both samples are equally large, Table 22
shows the critical D-values for two-tailed Kolmogorov-Smirnov tests
(computed from Sheskin 2011: Table A23).

Table 22. Critical D-values for two-sample Kolmogorov-Smirnov tests


p = 0.05 p = 0.01
n1 = n2 = 29 0.3571535 0.428059
n1 = n2 = 30 0.3511505 0.4208642
n1 = n2 = 31 0.3454403 0.4140204

Our value of D = 0.4667 is not only significant (D > 0.3511505), but


even very significant (D > 0.4208642). You can therefore reject H0 and
summarize the results: “According to a two-sample Kolmogorov-Smirnov
test, there is a significant difference between the distributions of hedge
frequencies of men and women: women seem to use more hedges and be-
have more homogeneously than the men, who use fewer hedges and whose
data appear to fall into two groups (D = 0.4667, ptwo-tailed < 0.01).”
The logic of this test is not always immediately clear but worth explor-
ing. To that end, we look at a graphical representation. The following lines
plot the two empirical cumulative distribution functions (ecdf) of men (in
black) and women (in grey) as well as a vertical line at position x = 9,
where the largest difference (D = 0.4667) was found. This graph in Figure
42 below shows what the Kolmogorov-Smirnov test reacts to: different
empirical cumulative distributions.

> plot(ecdf(HEDGES[SEX=="M"]), do.points=TRUE, verticals=


TRUE, main="Hedges: men (black) vs. women (grey)",
xlab="Numbers of hedges")¶
> lines(ecdf(HEDGES[SEX=="F"]), do.points=TRUE, verticals=

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 177

TRUE, col="darkgrey")¶
> abline(v=9, lty=2)¶

Figure 42. Empirical cumulative distribution functions of the numbers of hedges


of men (black) and women (grey)

For example, the fact that the values of the women are higher and more
homogeneous is indicated especially in the left part of the graph where the
low hedge frequencies are located and where the values of the men already
rise but those of the women do not. More than 40% of the values of the
men are located in a range where no hedge frequencies for women were
obtained at all. As a result, the largest difference at position x = 9 arises
where the curve for the men has already risen considerably while the curve
for the women has only just begun to take off. This graph also explains
why H0 postulates D = 0. If the curves are completely identical, there is no
difference between them and D becomes 0.
The above explanation simplified things a bit. First, you do not always
have two-tailed tests and identical sample sizes. Second, identical values –
so-called ties – can complicate the computation of this test (and others).
Fortunately, you do not really have to worry about any of this because the
R function ks.test does everything for you in just one line. You just need
the following arguments:19

− x and y: the two vectors whose distributions you want to compare;

19. Unfortunately, the function ks.test does not take a formula as input.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
178 Analytical statistics

− alternative="two-sided" for two-tailed tests (the default) or alter-


native="greater" or alternative="less" for one-sided tests de-
pending on which H1 you want to test: the argument alternative="…"
refers to the first-named vector so that alternative="greater" means
that the cumulative distribution function of the first vector is above that
of the second.

When you test a two-tailed H1 as we do here, then the line to enter into
R reduces to the following, and you get the same D-value and the p-value.
(I omitted the warning about ties here but, again, you can use jitter to get
rid of it; cf. the code file.)

> ks.test(HEDGES[SEX=="M"], HEDGES[SEX=="F"])¶


Two-sample Kolmogorov-Smirnov test
data: HEDGES[SEX == "M"] and HEDGES[SEX == "F"]
D = 0.4667, p-value = 0.002908
alternative hypothesis: two-sided

Recommendation(s) for further study


− apart from the function mentioned in the text (plot(ecdf(…)), you can
create such graphs also with plot.stepfun
− Crawley (2005: 100f.), Crawley (2007: 316f.), Baayen (2008: Section
4.2.1), Sheskin (2011: Test 13)

1.2.2. One dep. variable (nominal/categorical) and one indep. variable


(nominal/categorical) (indep. samples)

In Section 4.1.1.2 above, we discussed how you test whether the distribu-
tion of a dependent nominal/categorical variable is significantly different
from another known distribution. A probably more frequent situation is that
you test whether the distribution of one nominal/categorical variable is
dependent on another nominal/categorical variable.
Above, we looked at the frequencies of the two verb-particle construc-
tions. We found that their distribution was not compatible with H0. Howev-
er, we also saw earlier that there are many variables that are correlated with
the constructional choice. One of these is whether the referent of the direct
object is given information, i.e., known from the previous discourse, or not.
Specifically, previous studies found that objects referring to given referents
prefer the position before the particle whereas objects referring to new ref-
erents prefer the position after the particle. We will look at this hypothesis

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 179

(for the sake of simplicity as a two-tailed hypothesis). It involves

− a dependent nominal/categorical variable, namely CONSTRUCTION:


VERB-PARTICLE-OBJECT vs. CONSTRUCTION: VERB-OBJECT-PARTICLE;
− an independent variable nominal/categorical variable, namely the
givenness of the referent of the direct object: GIVENNESS: GIVEN vs.
GIVENNESS: NEW;
− independent samples because we will assume that, in the data below, the
fact any particular constructional choice is unrelated to any other one
(this is often far from obvious, but too complex to be discussed here in
more detail).

As before, such questions are investigated with chi-squared tests: you


test whether the levels of the independent variable result in different fre-
quencies of the levels of the dependent variable. The overall procedure for
a chi-squared test for independence is very similar to that of a chi-squared
test for goodness of fit, but you will see below that the computation of the
expected frequencies is (only superficially) a bit different from above.

Procedure
− Formulating the hypotheses
− Computing descriptive statistics and visualizing the data
− Computing the frequencies you would expect given H0
− Testing the assumption(s) of the test:
− all observations are independent of each other
− 80% of the expected frequencies are ≥ 5 (cf. n. 17)
− all expected frequencies are > 1
− Computing the contributions to chi-squared for all observed frequencies
− Computing the test statistic χ2, df, and p

The hypotheses are simple, especially since we apply what we learned


from the chi-squared test for goodness of fit from above:

H0: The frequencies of the levels of the dependent variable


CONSTRUCTION do not vary as a function of the levels of the inde-
pendent variable GIVENNESS; χ2 = 0.
H1: The frequencies of the levels of the dependent variable
CONSTRUCTION vary as a function of the levels of the independent
variable GIVENNESS; χ2 > 0.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
180 Analytical statistics

In order to discuss this version of the chi-squared test, we return to the


data from Peters (2001). As a matter of fact, the above discussion did not
utilize all of Peters’s data because I omitted an independent variable, name-
ly GIVENNESS. Peters (2001) did not just study the frequency of the two
constructions – she studied what we are going to look at here, namely
whether GIVENNESS is correlated with CONSTRUCTION. In the picture-
description experiment described above, she manipulated the variable
GIVENNESS and obtained the already familiar 397 verb-particle construc-
tions, which patterned as represented in Table 23. (By the way, the cells of
such 2-by-2 tables are often referred to with the letters a to d, a being the
top left cell (85), b being the top right cell (65), etc.)

Table 23. Observed construction frequencies of Peters (2001)


GIVENNESS: GIVEN GIVENNESS: NEW Row totals
CONSTRUCTION:
85 65 150
V DO PART
CONSTRUCTION:
100 147 247
V PART DO
Column totals 185 212 397

First, we explore the data graphically. You load the data from
<_inputfiles/04-1-2-2_vpcs.csv>, create a table of the two factors, and get a
first visual impression of the distribution of the data (cf. Figure 43).

> VPCs<-read.delim(file.choose())¶
> str(VPCs); attach(VPCs)¶
> Peters.2001<-table(CONSTRUCTION, GIVENNESS)¶
> plot(CONSTRUCTION~GIVENNESS)¶

Obviously, the differently-colored areas are differently big between


rows/columns. To test these differences for significance, we need the fre-
quencies expected from H0. But how do we compute the frequencies pre-
dicted by H0? Since this is a central question, we will discuss this in detail.
Let us assume Peters had obtained the totals in Table 24. What would
the distribution following from H0 look like? Above in Section 4.1.1.2, we
said that H0 typically postulates equal frequencies. Thus, you might assume
– correctly – that the expected frequencies are those represented in Table
24. All marginal totals are 100 and every variable has two equally frequent
levels so we have 50 in each cell.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 181

Figure 43. Mosaic plot for CONSTRUCTION~GIVENNESS

Table 24. Fictitious observed construction frequencies of Peters (2001)


GIVENNESS: GIVEN GIVENNESS: NEW Row totals
CONSTRUCTION:
100
V DO PART
CONSTRUCTION:
100
V PART DO
Column totals 100 100 200

Table 25. Fictitious expected construction frequencies of Peters (2001)


GIVENNESS: GIVEN GIVENNESS: NEW Row totals
CONSTRUCTION:
50 50 100
V DO PART
CONSTRUCTION:
50 50 100
V PART DO
Column totals 100 100 200

The statistical hypotheses that go beyond just stating whether or not χ2 =


0 would then be:

H0: nV DO Part & Ref DO = given = nV DO Part & Ref DO ≠ given = nV Part DO & Ref DO = given
= nV Part DO & Ref DO ≠ given
H1: as H0, but there is at least one “≠” instead of an “=“.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
182 Analytical statistics

However, life is usually not that simple, for example when (a) as in Pe-
ters (2001) not all subjects answer all questions or (b) naturally-observed
data are counted that are not as nicely balanced. Thus, in Peters’s real data,
it does not make sense to simply assume equal frequencies. Put differently,
H0 cannot look like Table 24 because the row totals of Table 23 show that
the different levels of GIVENNESS are not equally frequent. If GIVENNESS
had no influence on CONSTRUCTION, you would expect that the frequencies
of the two constructions for each level of GIVENNESS would exactly reflect
the frequencies of the two constructions in the whole sample. That means
(i) all marginal totals (row/column totals) must remain constant (as they
reflect the numbers of the investigated elements), and (ii) the proportions of
the marginal totals determine the cell frequencies in each row and column.
From this, a rather complex set of hypotheses follows:

H0: nV DO Part & Ref DO = given : nV DO Part & Ref DO ≠ given ∝


nV Part DO & Ref DO = given : nV Part DO & Ref DO ≠ given ∝
nRef DO = given : nRef DO ≠ given and
nV DO Part & Ref DO = given : nV Part DO & Ref DO = given ∝
nV DO Part & Ref DO ≠ given : nV Part DO & Ref DO ≠ given ∝
n V DO Part : n V Part DO
H1: as H0, but there is at least one “≠” instead of an “=“.

In other words, you cannot simply say, “there are 2·2 = 4 cells and I as-
sume each expected frequency is 397 divided by 4, i.e., approximately
100.” If you did that, the upper row total would amount to nearly 200 – but
that can’t be right since there are only 150 cases of CONSTRUCTION: VERB-
OBJECT-PARTICLE. Thus, you must include this information, that there are
only 150 cases of CONSTRUCTION: VERB-OBJECT-PARTICLE, into the com-
putation of the expected frequencies. The easiest way to do this is using
percentages: there are 150/397 cases of CONSTRUCTION: VERB-OBJECT-
PARTICLE (i.e. 0.3778 = 37.78%). Then, there are 185/397 cases of
GIVENNESS: GIVEN (i.e., 0.466 = 46.6%). If the two variables are independ-
ent of each other, then the probability of their joint occurrence is
0.3778·0.466 = 0.1761. Since there are altogether 397 cases to which this
probability applies, the expected frequency for this combination of variable
levels is 397·0.1761 = 69.91. This logic can be reduced to (33).

row sum ⋅ column sum


(33) nexpected cell frequency =
n

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 183

If you apply this logic to every cell, you get Table 26.

Table 26. Expected construction frequencies of Peters (2001)


GIVENNESS: GIVEN GIVENNESS: NEW Row totals
CONSTRUCTION:
69.9 80.1 150
V DO PART
CONSTRUCTION:
115.1 131.9 247
V PART DO
Column totals 185 212 397

You can immediately see that this table corresponds to the above H0: the
ratios of the values in each row and column are exactly those of the row
totals and column totals respectively. For example, the ratio of 69.9 to 80.1
to 150 is the same as that of 115.1 to 131.9 to 247 and as that of 185 to 212
to 397, and the same is true in the other dimension. Thus, H0 is not “all cell
frequencies are identical” – it is “the ratios of the cell frequencies are equal
(to each other and the respective marginal totals).”
This method to compute expected frequencies can be extended to arbi-
trarily complex frequency tables (see Gries 2009b: Section 5.1). But how
do we test whether these deviate strongly enough from the observed fre-
quencies? Thankfully, we do not need such complicated hypotheses but can
use the simpler versions of χ2 = 0 and χ2 > 0 used above, and the chi-
squared test for independence is identical to the chi-squared goodness-of-fit
test you already know: for each cell, you compute a contribution to chi-
squared and sum those up to get the chi-squared test statistic.
As before, the chi-squared test can only be used when its assumptions
are met. The expected frequencies are large enough and for simplicity’s
sake we assume here that every subject only gave just one sentence so that
the observations are independent of each other: for example, the fact that
some subject produced a particular sentence on one occasion does then not
affect any other subject’s formulation. We can therefore proceed as above
and compute (the sum of) the contributions to chi-squared on the basis of
the same formula, here repeated as (34):

n
(observed − expected ) 2
(34) Pearson χ2 = ∑i =1 expected

The results are shown in Table 27 and the sum of all contributions to
chi-squared, chi-squared itself, is 9.82. However, we again need the num-

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
184 Analytical statistics

ber of degrees of freedom. For two-dimensional tables and when the ex-
pected frequencies are computed on the basis of the observed frequencies
as here, the number of degrees of freedom is computed as shown in (35).20

Table 27. Contributions to chi-squared for the data of Peters (2001)


GIVENNESS: GIVEN GIVENNESS: NEW Row totals
CONSTRUCTION:
3.26 2.85
V DO PART
CONSTRUCTION:
1.98 1.73
V PART DO
Column totals 9.82

(35) df = (no. of rows-1) ⋅ (no. of columns-1) = (2-1)⋅(2-1) = 1

With both the chi-squared and the df-value, you can look up the result in
a chi-squared table (e.g., Table 28 below, which is the same as Table 21).
As above, if the observed chi-squared value is larger than the one tabulated
for p = 0.05 at the required df-value, then you can reject H0. Here, chi-
squared is not only larger than the critical value for p = 0.05 and df = 1, but
also larger than the critical value for p = 0.01 and df = 1. But, since the chi-
squared value is not also larger than 10.827, the actual p-value is some-
where between 0.01 and 0.001: the result is very, but not highly significant.

Table 28. Critical χ2-values for ptwo-tailed = 0.05, 0.01, and 0.001 for 1 ≤ df ≤ 3
p = 0.05 p = 0.01 p = 0.001
df = 1 3.841 6.635 10.828
df = 2 5.991 9.21 13.816
df = 3 7.815 11.345 16.266

Fortunately, all this is much easier when you use R’s built-in function.
Either you compute just the p-value as before,

> pchisq(9.82, 1, lower.tail=FALSE)¶


[1] 0.001726243

20. In our example, the expected frequencies were computed from the observed frequencies
in the marginal totals. If you compute the expected frequencies not from your observed
data but from some other distribution, the computation of df changes to: df = (number of
rows ⋅ number of columns)-1.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use
Distributions and frequencies 185

or you use the function chisq.test and do everything in a single step. The
most important arguments for our purposes are:

− x: the two-dimensional table for which you do a chi-squared test;


− correct=TRUE or correct=FALSE; cf. above for the correction.21

> test.Peters<-chisq.test(Peters.2001, correct=FALSE)¶


> test.Peters¶
Pearson's Chi-squared test
data: Peters.2001
X-squared = 9.8191, df = 1, p-value = 0.001727

This is how you obtain expected frequencies or the chi-squared value:

> test.Peters$expected¶
GIVENNESS
CONSTRUCTION given new
V_DO_Part 69.89924 80.10076
V_Part_DO 115.10076 131.89924
> test.Peters$statistic¶
X-squared
9.819132

You now know that GIVENNESS is correlated with CONSTRUCTION, but


you neither know yet how strong that effect is nor which variable level
combinations are responsible for this result. As for the effect size, even
though you might be tempted to use the size of the chi-squared value or the
p-value to quantify the effect, you must not do that. This is because the chi-
squared value is dependent on the sample size, as we can easily see:

> chisq.test(Peters.2001*10, correct=FALSE)¶


Pearson's Chi-squared test
data: Peters.2001 * 10
X-squared = 98.1913, df = 1, p-value < 2.2e-16

For effect sizes, this is of course a disadvantage since just because the
sample size is larger, this does not mean that the relation of the values to
each other has changed, too. You can easily verify this by noticing that the
ratios of percentages, for example, have stayed the same. For that reason,
the effect size is often quantified with a coefficient of correlation (called φ
in the case of k×2/m×2 tables or Cramer’s V for k×m tables with k or m >

21. For further options, cf. again ?chisq.test¶. Note also what happens when you enter
summary(Peters.2001)¶.

EBSCOhost - printed on 2/25/2020 5:24 AM via UNIVERSIDAD DE MURCIA. All use subject to https://www.ebsco.com/terms-of-use

You might also like