Workflow of Statistical Data Analysis
Workflow of Statistical Data Analysis
Oliver Kirchkamp
c Oliver Kirchkamp
2 Workflow of statistical data analysis — Contents
Workflow of empirical work may seem obvious. It is not. Small initial mistakes can
lead to a lot of hard work afterwards. In this course we discuss some techniques that
hopefully facilitate the organisation of your empirical work.
This handout provides a summary of the slides from the lecture. It is not supposed
to replace a book.
Many examples in the text are based on the statistical software R. I urge you to try
these examples on your own computer.
As an attachment of this PDF you find a file wf.zip with some raw data. You
also find a file wf.Rdata with some R functions and some data already in R’s internal
format.
The drawing on the previous page is Albercht Dürer’s “Der Hafen von Antwerpen”
— an example for workflow in a medieval city.
Contents
1 Introduction 4
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Structure of a paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Aims of statistical data analysis . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Creativity and chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Making the analysis reproducible . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Preserve raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Interaction with coauthors . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Digression: R 11
2.1 Installation of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Types and assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Random numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Example Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Basic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7.1 Plotting functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.2 Empty plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7.3 Line type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7.4 Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.5 Legends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.6 Auxiliary lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7.7 Axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Fancy math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.8.1 Several diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.9 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
c Oliver Kirchkamp
[29 July 2016 14:28:20] —3
2.10 Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Organising work 28
3.1 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.1 Robust scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Robustness towards different computers . . . . . . . . . . . . . 30
3.1.3 Robustness towards changes in context . . . . . . . . . . . . . . 31
3.1.4 Functions increase robustness . . . . . . . . . . . . . . . . . . . . 31
3.2 Calculations that take a lot of time . . . . . . . . . . . . . . . . . . . . . 33
3.3 Nested functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Reproducible randomness . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Recap — writing scripts and using functions . . . . . . . . . . . . . . . 34
3.6 Human readable scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Data manipulation 53
5.1 Subsetting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Merging data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Reshaping data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6 Preparing Data 58
6.1 Reading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.1 Reading z-Tree Output . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.2 Reading and writing R-Files . . . . . . . . . . . . . . . . . . . . . 60
6.1.3 Reading Stata Files . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.1.4 Reading CSV Files . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.5 Filesize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 Checking Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.1 Range of values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.2 (Joint) distribution of values . . . . . . . . . . . . . . . . . . . . . 71
6.2.3 (Joint) distribution of missings . . . . . . . . . . . . . . . . . . . 74
6.2.4 Checking signatures . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3 Naming variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4 Labeling (describing) variables . . . . . . . . . . . . . . . . . . . . . . . 75
6.5 Labeling values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.6 Recoding data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.6.1 Replacing values by missings . . . . . . . . . . . . . . . . . . . . 78
6.6.2 Replacing values by other values . . . . . . . . . . . . . . . . . . 79
c Oliver Kirchkamp
4 Workflow of statistical data analysis — 1 INTRODUCTION
6.7 Creating new variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.8 Select subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8 Version control 92
8.1 Problem I – concurrent edits . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.2 A “simple” solution: locking . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.3 Problem II – nonlinear work . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.4 Version control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.5 Solution to problem II: nonlinear work . . . . . . . . . . . . . . . . . . . 94
8.6 Solution to problem I: concurrent edits . . . . . . . . . . . . . . . . . . . 98
8.7 Edits without conflicts: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.8 Going back in time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.9 git and subversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.10 Steps to set up a subversion repository at the URZ at the FSU Jena . . . 100
8.11 Setting up a subversion repository on your own computer . . . . . . . 101
8.12 Usual workflow with git . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.13 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9 Exercises 102
1 Introduction
1.1 Motivation
Literature: Surprisingly, there is not much literature about workflow of statistical
data analysis:
General Literature
• J. Scott Long; The Workflow of Data Analysis Using Stata, Stata Press, 2009.
c Oliver Kirchkamp
[29 July 2016 14:28:20] —5
Literate Programming
• Friedrich Leisch; Sweave User Manual.
• Nicola Sartori; Sweave = R · LATEX2
• Max Kuhn; CRAN Task View: Reproducible Research.
Version control
• Scott Chacon, Ben Straub; Pro Git.
• Ben Collins-Sussman, Brian W. Fitzpatrick, C. Michael Pilato; Version Con-
trol with Subversion.
What is workflow:
• A sequence of operations.
statisticalmethods
statistical methods
statistical methods
paper
paper
raw data paper
workflow
workflow
workflow
• We do not tell students how to apply these methods (how to integrate methods
into a “workflow”)
• Why?
• Describe the research question
Which economic model do we use to structure this question?
Which statistical model do we use for inference? (Estimation, hypothesis testing,
classification. . . )
• Replicability
– for us, to understand our data and our methods after we get back to work
after a short break
– for our friends (coauthors), so that they can understand what we are doing
– for our enemies — we should always (even years after) be able to prove our
results exactly
• Assume we have another look at our paper (and our analysis) after a break of 6
month:
– What does it mean if sex==1 ?
c Oliver Kirchkamp
[29 July 2016 14:28:20] —7
– For the variable meanContribution: was the mean taken with respect to
all players and the same period, or with respect to the same player and all
periods, or . . .
– What is the difference between payoff and payoff2. . .
– Do the tables and figures in version 27 of the paper . . .
∗ . . . refer to all periods of the experiment or only to the last 6 periods?
∗ . . . do they include data from the two pilot experiments we ran?
∗ . . . do they refer to the “cleaned” dataset, or to the “cleaned dataset in
long form” (where we eliminated a few outliers)
∗ Do all tables and figures and p-values and t-tests. . . actually refer to
the same data? (or do some include outliers, some not,. . . )
Assume we take only 10 not completely obvious decisions between two alterna-
tives during our analysis (which perhaps took us 1 week),. . .
. . . → we will have to explore 210 = 1024 variants of our analysis (= 1024 weeks) to
recover what we actually did.
Often we take more than 10 not completely obvious decisions.
→ we should follow a workflow that facilitates replicability.
This is not obvious, since workflow is (unfortunately) not linear:
descriptive analysis
get results
write paper
During this process we create a lot of intermediate results. How can we organise
these results?
c Oliver Kirchkamp
8 Workflow of statistical data analysis — 1 INTRODUCTION
• Store everything — not feasible
• permanent (documented)
Let our computer(s) reflect these two lives:
.../projectXYZ/
/permanent/
/rawData
/cleanData
/R
/Paper
/Slides
/creative/
/cleanData
/R
/Paper
/Slides
Rules
1. Anything that we give to other people (collaborators, journals,. . . ) must come
entirely from permanent
4. We must be able trace back everything in permanent clearly to our raw data.
Since we give things to other people more than once (first draft, second draft,. . . ,
first revision, . . . , second revision,. . . ), we must be able to replicate each of these in-
stances.
c Oliver Kirchkamp
[29 July 2016 14:28:20] —9
Consequences — permanent data has versions (Below we will discuss the advan-
tages of a version control system (git, svn). Let us assume for a moment that we have
to do everything manually.)
• We will accumulate versions in our permanent life (do not delete them, do not
change them)
cleaned_data_150721.Rdata
cleaned_data_150722.Rdata
cleaned_data_150722b.Rdata
.
.
.
preparingData_150721.R
preparingData_150722.R
descriptives_150722.R
econometrics_150723.R
.
.
.
paper_150724.Rnw
paper_150725.Rnw
paper_150727.Rnw
.
.
.
What it the optimal workflow? The optimal workflow is different for each of us
Aims
• Exactness (allow clear replication)
• Efficiency
• Both problems are solved independently:
– Probability of (undiscovered) mistake A: 0.1 · 0.2
– Probability of (undiscovered) mistake B: 0.1 · 0.2
– Probability of some undiscovered mistake: 1 − .982 ≈ 0.04
• Both problems are solved with the same routine (one function in your code):
– Probability of some undiscovered mistake: 0.1 · 0.22 = 0.004
Producing your results with the help of identical (and computerised) routines makes
it much easier to discover mistakes.
• We keep a log where we document the above steps for a given project on a daily
basis (research log) (nobody wants to keep logs, so this must be easy)
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 11
• If our raw data comes from z-Tree experiments: We better keep all programs (the
current version can always be found as @1.ztt,. . . in the working directory).
2 Digression: R
For the purpose of the course we take R as an example for one statistical language.
Even if you use other languages for your work, you will find that the concepts are
similar.
2.1 Installation of R
On the Homepage of the R Projekt you find in the menu on the left a link Download /
CRAN. This link leads to a choice of “mirrors”. If you are in Jena, the GWDG Mirror
in Göttingen might be fast. There you also find instructions how to install R on your
OS.
c Oliver Kirchkamp
12 Workflow of statistical data analysis — 2 DIGRESSION: R
Installation of Libraries If the command library complains about not being able to
find the required library, then the library is most likely not installed. The command
install.packages("Ecdat")
installs the library Ecdat. Some installations have a menu “Packages” that allows
you to install missing libraries. Users of operating systems of Microsoft find support
at the FAQ for Packages.
[1] "double"
[1] 8
sqrt(x)
[1] 2
Often our calculations will not only involve a single number (a scalar) but several
which are connected as a vector. Several numbers are connected with c
x <- c(21,22,23,24,25,16,17,18,19,20)
x
[1] 21 22 23 24 25 16 17 18 19 20
[1] 21 22 23 24 25 26 27 28 29 30
y <- 21:30
x[1]
[1] 21
When we want to access several elements at the same time, we simply use several
indices (which are connected with c). We can use this to change the sequence of
values (e.g. to sort).
x[c(3,2,1)]
[1] 23 22 21
x[3:1]
[1] 23 22 21
[1] 21 22 23 24 25 16 17 18 19 20
order(x)
[1] 6 7 8 9 10 1 2 3 4 5
x[order(x)]
[1] 16 17 18 19 20 21 22 23 24 25
(order determines an “ordering”, i.e. a sequence in which the elements of the dataset
should be ordered. We use x[...] to see the ordered result.)
Negative indices drop elements:
x[-1:-3]
[1] 24 25 16 17 18 19 20
Logicals Logicals can be either TRUE or FALSE. When we compare a vector with a
number, then all the elements will be compared (this follows from the recycling rule,
see below):
x < 20
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
x [ x < 20 ]
[1] 16 17 18 19
Characters Not only numbers, also character strings can be assigned to a variable:
x <- "Mary"
[1] "Mary"
x[3]<-"Lucy"
x
Factors Often it is clumsy to store a string of characters again and again if this string
appears in the dataset several times. We might, e.g., want to store whether an obser-
vation belongs to a man or a woman. This can be done in an efficient way by storing
2 for "male", and 1 for "female".
x <- as.factor(c("male","female","female","male"))
levels(x)
x[2]
[1] female
Levels: female male
as.numeric(x)
[1] 2 1 1 2
Usually the first level in a factor is the level that comes first on the alphabet. If we
do not want this, we can relevel a factor:
x<-relevel(x,"male")
x
as.numeric(x)
[1] 1 2 2 1
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 15
Sometimes, when we have more than only two levels, we want to order levels of a
factor along a third variable. This is done by reorder.
y <- c(12,7,8,11)
reorder(x,y)
2.3 Functions
R knows many built-in functions:
mean(x)
median(x)
max(x)
min(x)
length(x)
unique(c(1,2,3,4,1,1,1))
The last expression in a function (here x*x) is the return value. Now we can use the
function.
square(7)
[1] 49
[1] 1 4 9 16 25 36 49 64 81 100
sapply(range,function(x) x*x)
[1] 1 4 9 16 25 36 49 64 81 100
c Oliver Kirchkamp
16 Workflow of statistical data analysis — 2 DIGRESSION: R
Random numbers can be generated for rather different distributions. R calculates
pseudo-random numbers, i.e. R picks numbers from a very long list that appears
random. Where we start in this long list is determined by set.seed:
set.seed(123)
rnorm(10)
We get the same list when we initialise the list with the same starting value:
set.seed(123)
rnorm(10)
This is very useful, when we want to replicate the same “random” results.
10 uniformly distributed random numbers from the interval [100, 200] can be ob-
tained with
runif(10,min=100,max=200)
replicate(10,mean(rnorm(100)))
takes 10 times the mean of each 100 pseudo-normally distributed random num-
bers.
and time R does not load all libraries initially. The command library allows us to
Usually we do not want to see many numbers. Instead we want to derive (in a
structured way) a few numbers (parameters, confidence intervals, p-values,. . . )
The command help aids us in finding out the meaning of the numbers of the difer-
ent columns of a dataset.
help(BudgetFood)
How can we access specific columns from our dataset? Since R may have several
datasets at the same time in its memory, there are several possibilities. One possibility
is to append the name of the dataset BudgetFood with a $ and then the name of the
column.
BudgetFood$age
[1] 43 40 28 60 37 35 40 68 43 51 43 48 51 58 61 53 58 64 50 50 47 76 49 44 49
[26] 51 56 63 30 70 29 60 50 56 36 46 43 32 45 34
[ reached getOption("max.print") -- omitted 23932 entries ]
This is helpful when we work with several different datasets at the same time.
The example also shows that R does not flood our screen with long lists of num-
bers. Instead we only see the first few numbers, and then the text “omitted ...
entries”.
When we want to use only one dataset, then the command attach is helpful.
attach(BudgetFood)
age
[1] 43 40 28 60 37 35 40 68 43 51 43 48 51 58 61 53 58 64 50 50 47 76 49 44 49
[26] 51 56 63 30 70 29 60 50 56 36 46 43 32 45 34
[ reached getOption("max.print") -- omitted 23932 entries ]
c Oliver Kirchkamp
18 Workflow of statistical data analysis — 2 DIGRESSION: R
From now on, all variables will first be searched in the dataset BudgetFood. When
we no longer want this, then we say
detach(BudgetFood)
with(BudgetFood,age)
[1] 43 40 28 60 37 35 40 68 43 51 43 48 51 58 61 53 58 64 50 50 47 76 49 44 49
[26] 51 56 63 30 70 29 60 50 56 36 46 43 32 45 34
[ reached getOption("max.print") -- omitted 23932 entries ]
We often use with when we use a function and want to refer to a specific dataset in
this function. E.g. hist shows a histogram:
with(BudgetFood,hist(age))
Histogram of age
2500
Frequency
1500
500
0
20 40 60 80 100
age
Most commands have several options which allow you to fine-tune the result. Have
a look at the help-page for hist (you can do this with help(hist)). Perhaps you
prefer the following graph:
with(BudgetFood,hist(age,breaks=40,xlab="Age [years]",col=gray(.7),main="Spain"))
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 19
Spain
1000
Frequency
600
200
0
20 40 60 80 100
Age [years]
2.6 Graphs
There is more than one way to represent numbers as graphs.
with(BudgetFood, {
hist(age)
plot(density(age))
boxplot(age ~ sex,main="Boxplot")
})
100
2500
0.020
80
0.015
Frequency
Density
1500
60
0.010
40
0.005
500
20
0.000
0
x <- sample(BudgetFood$age,
100) qqnorm(x)
plot(ecdf(x),main="ecdf") qqline(x)
100
Sample Quantiles
0.8
80
Fn(x)
60
0.4
40
20
0.0
20 40 60 80 100 -2 -1 0 1 2
x Theoretical Quantiles
• Sometimes it is obvious how to prepare our data for these functions. Sometimes
it is more complicated. Then other commands help and calculate an object that
can be plotted (with plot)
curve(dchisq(x,3),from=0,to=10)
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 21
0.25
0.20
dchisq(x, 3)
0.15
0.10
0.05
0.00
0 2 4 6 8 10
plot(NULL,xlim=c(0,10),ylim=c(-3,6),xlab="x",ylab="y",main="empty plot")
empty plot
6
4
2
y
0
-2
0 2 4 6 8 10
plot(NULL,ylim=c(1,6),xlim=c(0,1),xaxt="n",ylab="lty",las=1)
sapply(1:6,function(lty) abline(h=lty,lty=lty,lwd=5))
lty
3
2.7.4 Points
range=1:20
plot(range,range/range,pch=range,frame=FALSE)
text(range,range/range+.2,range)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2.7.5 Legends
When we use more than one line or more than one symbol in our plot we have to
explain their meaning. This is done in a legend.
Usually legend gets as an option a vector of linetypes lty and symbols pch. They
will be used to construct example lines and symbols next to the actual text of the
legend. If the lty or pch is NA, then no line or point is drawn.
plot(NULL,xlim=c(0,10),ylim=c(-3,6),xlab="x",ylab="y",main="empty plot")
legend("topleft",c("Text 1","more Text","even more"),lty=1:3,pch=1:3)
legend("bottomright",c("no line no symbol","line only","line and symbol","symbol only"),lty=c(N
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 23
empty plot
6
Text 1
more Text
4
even more
2
y
no line no symbol
0
line only
line and symbol
-2
symbol only
0 2 4 6 8 10
plot(NULL,xlim=c(0,10),ylim=c(-3,6),xlab="x",ylab="y",main="main title")
abline(h=2:6,lty="dotted")
abline(v=5,lty="dashed")
abline(a=-1,b=1,lwd=5,col=grey(.7))
legend("bottomright",c("h","v","a/b"),lty=c("dotted","dashed","solid"),col=c("black","black",grey(.7)),lw
main title
6
4
2
y
h
v
-2
a/b
0 2 4 6 8 10
Note, that these arguments can be vectors if we want to draw several lines at the
same time.
2.7.7 Axes
The options log=’x’, log=’y’ or log=’xy’ determine whether which axis is shown
in a logarithmic style.
data(PE,package="Ecdat")
xx<-as.data.frame(PE)
attach(xx)
20.0
earnings
40
earnings
earnings
20
2.0
20
0
0.2
0
0 400 1000
5 20 200 5 20 200
price
price price
To gain more flexibility axis can draw a wide range of axes. Before using axis
the previous axes can be removed entirely (axes=FALSE) or suppressed selectively
(xaxt="n" or yaxt="n").
plot(price, earnings, plot(price, earnings, plot(price, earnings,
log="xy",axes=FALSE) log="xy",xaxt="n") log="xy",xaxt="n")
axis(1,at=c(5,10,20,40,
80,160,320,640,1280))
20.0
20.0
earnings
earnings
earnings
2.0
2.0
0.2
0.2
5 20 160
If we specify a lot of axes labels, as in the example above, R does not print them all
if they overlap.
R can also render more than only textual labels with plotmath:
plot(price, earnings,xlab=’$\\pi_1$’,ylab=’$\\hat{\\gamma}_0$’,
main="the $\\int_\\theta^{\\bar{\\theta}} \\sqrt{\\xi} d\\phi$")
abline(lm(earnings~price))
legend("bottomright",c("legend","$\\xi^2$","line $\\phi$"),pch=c(NA,1,NA),lty=c(NA,NA,1))
Rθ̄ √
the θ ξdφ
50
30
^0
γ
legend
ξ2
10
line φ
0
π1
Diagrams side by side To put several diagrams on one plot side by side we can call
par(mfrow=c(...)) or layout or split.screen.
par(mfrow=c(1,2))
with(BudgetFood, {
hist(age)
plot(density(age))
})
c Oliver Kirchkamp
26 Workflow of statistical data analysis — 2 DIGRESSION: R
3000
0.020
2000
Frequency
Density
0.010
1000
0.000
0
20 40 60 80 100 20 40 60 80 100
Superimposed graphs
• Anything that can create lines or points (like density or ecdf) can immedi-
ately be added to an existing plot.
• Plot-objects that would otherwise create a new figure (like plot, hist, or curve)
can be added to an existing plot with the optional parameter add=TRUE.
with(BudgetFood, {
plot(density(age),lwd=2)
lines(density(age[sex=="man"],na.rm=TRUE),lty=3,lwd=2)
hist(age,freq=FALSE,add=TRUE)
curve(dnorm(x,mean(age),sd(age)), add = TRUE,lty=2)
})
density.default(x = age)
0.020
Density
0.010
0.000
20 40 60 80 100
2.9 Tables
age
sex 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
man 3 6 21 21 36 37 87 100 132 201 210 248 254 329 367 363
woman 0 2 7 9 12 21 19 21 22 26 18 28 10 25 28 12
Other statistics The command aggregate groups our data by levels of one or several
factors and applies a function to each group. In the following example the factor is
sex, the function is the mean which is applied to the variable age.
with(BudgetFood,aggregate(age ~ sex,FUN=mean))
sex age
1 man 49.08985
2 woman 59.47445
2.10 Regressions
Simple regressions can be estimated with lm. The operator ~ allows us to describe
the regression equation. The dependent variable is written on the left side of ~, the
indenpendent variables are written on the right side of ~.
lm (wfood ~ totexp,data=BudgetFood)
Call:
lm(formula = wfood ~ totexp, data = BudgetFood)
Coefficients:
(Intercept) totexp
0.4950397225 -0.0000001348
The result is a bit terse. More details are shown with the command summary.
Call:
lm(formula = wfood ~ totexp, data = BudgetFood)
Residuals:
Min 1Q Median 3Q Max
-0.49307 -0.09374 -0.01002 0.08617 1.06182
c Oliver Kirchkamp
28 Workflow of statistical data analysis — 3 ORGANISING WORK
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.495039722500 0.001561819134 316.96 <2e-16 ***
totexp -0.000000134849 0.000000001459 -92.41 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
options(browser = "/usr/bin/iceweasel")
in .Rprofile makes sure that the help system of R always uses iceweasel.
Also when we quit R with the command q(), the application tries to make our life
easier.
q()
R first asks us
Here we have the possibility to save all the data that we currently use (and that are in
our workspace) in a file .Rdata in the current working directory. When we start R for
the next time (from this directory) R automatically reads this file and we can continue
our work.
3 Organising work
3.1 Scripts
Most of the practical work in data analysis and statistics can be see as a sequence of
commands to a statistical software.
How can we run these commands?
• Execute a command in the command window (or with mouse and dialog boxes)
– clumsy
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 29
– hard to replicate what we did and why we did it (logs don’t really help).
– hard to find mistakes (structure of the mistake is easy to overlook).
• Write a file (.R or .do) and execute single lines (or small regions) from the file
while editing the file.
– great way to creatively develop code line by line. Not reproducible since
the file changes permanently.
– one window with the file another window with mainly the R output
• Write a source file (.R or .do), open it in an editor and then always execute the
entire file (while editing the file).
– great way to creatively develop larger parts of code
• Source “public” files (.R or .do) from a “master file”
source("read_data_160715.R")
source("clean_data_160715.R")
source("create_figures_160715.R")
This is the first step to reproducible research. When our script seems to do what
it is supposed to do, we make it “public”, give it a unique name, and never
change it again.
• From a master file, first source a file which defines functions. Then call these
functions.
source("functions_XYZ_160715.R")
read_data()
clean_data()
create_figures()
How can we make our scripts “robust”? Remember:
• The structure of the data may change over time.
– New variables might come with new treatments of our experiment.
– New treatments might require that we code variables differently.
next to it we have
/home/oliver/projectXYX/data/munich/1998/test.Rdata
load(file="/home/oliver/projectXYX/data/munich/1998/test.Rdata")
or (relative path)
load(file="../data/munich/1998/test.Rdata")
setwd("../data/munich/1998/")
...
load(file="test.Rdata")
(and remember to make the setwd relative, i.e. avoid the following:
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 31
setwd("/home/oliver/projectXYZ/data/munich/1998/")
...
).
# script1.R
load("someData.Rdata")
# now two variables, x and y are defined
source("script2.R")
# script2.R
est <- lm ( y ~ x)
In this example script2.R assumes that variables y and x are defined. As long as
script2.R is called in this context, everything is fine.
Changing script1.R might have unexpected side effects since we transport vari-
ables from one script to the other. The call
source("script2.R")
# script1.R
source("script2.R")
load("someData.Rdata")
myFunction(y,x)
# script2.R
# defines myFunction
myFunction <- function(y,x) {
est <<- lm ( y ~ x)
}
Now script2.R only defines a function. The function has arguments, hence, when
we use it in script1.R we realise which variable goes where.
Note that the function takes arguments. This is more elegant (and less risky) than to
write functions like this one:
c Oliver Kirchkamp
32 Workflow of statistical data analysis — 3 ORGANISING WORK
# script2.R
# defines myFunction
myFunction <- function() {
est <<- lm ( y ~ x)
}
# script1.R
source("script2.R")
load("someData.Rdata")
x <- ...
y <- ...
myFunction()
It will still work, but later it will be less clear to us that the assignments before the
function call are essential for the function.
This function has a side effect. It changes a variable est outside the function. Often
it is less confusing to define functions with return values and no side effects.
Recap
• Functions which only use arguments and return values: often better
If a sequence of functions takes a lot of time to run, let it generate intermediate data.
Our master-R-file could look like this:
set.seed(123)
...
source("projectXYZ_init_160715.R")
getAndCleanData() # takes a lot of time
save(cleanedData,file="cleanedData160715.Rdata")
load("cleanedData160715.Rdata")
doBootstrap() # takes a lot of time
save(bsData,file="bsData160715.Rdata")
load("cleanedData160715.Rdata")
load("bsData160715.Rdata")
doFigures()
...
Actually, if we need some functions only within a specific other function then we
can define them within this function:
...
doAnalysis <- function () {
firstStep <- function() {
...
}
secondStep <- function() {
...
c Oliver Kirchkamp
34 Workflow of statistical data analysis — 3 ORGANISING WORK
}
firstStep()
secondStep()
thirdStep()
...
}
• Advantage: These function are only visible from within doAnalysis and can do
no harm elsewhere (where we, perhaps, defined functions with the same name
that do different things).
N <- 100
profit88 <- rnorm(N)
profit89 <- rnorm(N)
profit98 <- rnorm(N)
myData <- as.data.frame(cbind(profit88,profit89,profit98))
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 35
Compare
t.test(profit88,data=myData)$p.value
t.test(profit89,data=myData)$p.value
t.test(profit98,data=myData)$p.value
with
sapply(grep("profit",colnames(myData),value=TRUE),function(x)
t.test(myData[,x])$p.value)
#
# to detect outliers we use lrt-method.
# We have tried depth.trim and depth.pond
# but they produce implausible results...
outl <- foutliers(data,method="lrt")
• Formatting
Compare
with
with
• Variables names
short but not too short
• Abbreviations in scripts
R (and other languages too) allows you to refer to parameters in functions with
names:
qnorm(p=.01,lower.tail=FALSE)
[1] 2.326348
qnorm(p=.01,low=FALSE)
[1] 2.326348
library(Ecdat)
data(Kakadu)
head(Kakadu)
lower upper answer recparks jobs lowrisk wildlife future aboriginal finben
1 0 2 nn 3 1 5 5 1 1 1
mineparks moreparks gov envcon vparks tvenv conservation sex age schooling
1 4 5 1 yes yes 1 no male 27 3
income major
1 25 no
[ reached getOption("max.print") -- omitted 5 rows ]
[1] 2361.471
(xx <- sample(Kakadu$lower,3))
[1] 100 0 20
sqMean(xx)
[1] 1600
Assume that we still do not trust the function. debug allows us to debug a function.
ls allows us to list the variables in the current environment.
debug(sqMean)
sqMean(xx)
undebug(sqMean)
options(error=recover)
In the following function we refer to the variable xxx which is not defined. The
function will, hence, fail. With options(error=recover) we can inspect the function
at the time of the failure.
1: sqMean(xx)
2: #2: mean(xxx)
Selection: 1
Called from: top level
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 39
Browse[1]> xxx
Error during wrapup: object ’xxx’ not found
Browse[1]> x
[1] 20 0 0 250 100 50 20 50 50 100
Browse[1]> Q
Whenever things repeat, we define them in variables at the top of the paper:
a b c
(Intercept) 2.122∗∗∗ 2.765∗∗∗ 2.648∗∗∗
(0.035) (0.065) (0.076)
income 0.003∗ 0.003∗ 0.002
(0.001) (0.001) (0.001)
age −0.013∗∗∗ −0.012∗∗∗
(0.001) (0.001)
sex: male/female −0.196∗∗∗ −0.190∗∗∗
(0.043) (0.043)
conservation: yes/no 0.215∗∗
(0.083)
vparks: yes/no 0.120∗
(0.047)
R-squared 0.0 0.1 0.1
adj. R-squared 0.0 0.1 0.1
sigma 0.9 0.9 0.9
F 4.7 47.6 31.7
p 0.0 0.0 0.0
Log-likelihood −2402.8 −2336.2 −2328.8
Deviance 1484.5 1380.1 1369.1
AIC 4811.5 4682.3 4671.7
BIC 4828.0 4709.9 4710.3
N 1827 1827 1827
Now we use the same explanatory variables to explain a different dependent vari-
able:
a b c
List of 12
$ coefficients : Named num [1:2] 2.12202 0.00278
..- attr(*, "names")= chr [1:2] "(Intercept)" "income"
c Oliver Kirchkamp
42
Workflow of statistical data analysis — 4 SOME PROGRAMMING TECHNIQUES
$ residuals : Named num [1:1827] -1.19 -1.15 -1.19 -1.19 -1.22 ...
..- attr(*, "names")= chr [1:1827] "1" "2" "3" "4" ...
$ effects : Named num [1:1827] -93.28 1.95 -1.17 -1.17 -1.21 ...
..- attr(*, "names")= chr [1:1827] "(Intercept)" "income" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:1827] 2.19 2.15 2.19 2.19 2.22 ...
..- attr(*, "names")= chr [1:1827] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:1827, 1:2] -42.7434 0.0234 0.0234 0.0234 0.0234 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:1827] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "income"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.02 1.02
..$ pivot: int [1:2] 1 2
..$ tol : num 0.0000001
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 1825
$ xlevels : Named list()
$ call : language lm(formula = paste("as.integer(answer) ~ ", m), data = Kakadu)
$ terms :Classes ’terms’, ’formula’ language as.integer(answer) ~ income
.. ..- attr(*, "variables")= language list(as.integer(answer), income)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "as.integer(answer)" "income"
.. .. .. ..$ : chr "income"
.. ..- attr(*, "term.labels")= chr "income"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: 0x6c0fc80>
.. ..- attr(*, "predvars")= language list(as.integer(answer), income)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "as.integer(answer)" "income"
$ model :’data.frame’: 1827 obs. of 2 variables:
..$ as.integer(answer): int [1:1827] 1 1 1 1 1 1 1 1 1 1 ...
..$ income : num [1:1827] 25 9 25 25 35 27 25 25 35 25 ...
..- attr(*, "terms")=Classes ’terms’, ’formula’ language as.integer(answer) ~ income
.. .. ..- attr(*, "variables")= language list(as.integer(answer), income)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "as.integer(answer)" "income"
.. .. .. .. ..$ : chr "income"
.. .. ..- attr(*, "term.labels")= chr "income"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: 0x6c0fc80>
.. .. ..- attr(*, "predvars")= language list(as.integer(answer), income)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 43
There are at least two ways to extract data from these objects:
• Extractor functions
coef(lm1)
(Intercept) income
2.122018102 0.002781938
vcov(lm1)
(Intercept) income
(Intercept) 0.00121806075 -0.000035685812
income -0.00003568581 0.000001647787
hccm(lm1)
(Intercept) income
(Intercept) 0.00123366056 -0.000036812592
income -0.00003681259 0.000001719666
logLik(lm1)
effects(lm1)
fitted.values(lm1)
residuals(lm1)
lm1$coefficients
lm1$residuals
lm1$fitted.values
lm1$residuals
Note: Some interesting values are not provided by the lm-object itself. These can
often be accessed as part of the summary-object.
Looping The simplest way to repeat a command is a loop:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
for (i in 1:10) {
x <- runif(i)
print(mean(x))
}
[1] 0.3565607
[1] 0.9663778
[1] 0.5063639
[1] 0.4378409
[1] 0.487012
[1] 0.5853594
[1] 0.3502112
[1] 0.499148
[1] 0.5078825
[1] 0.4557163
sapply(1:10,print)
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 1 2 3 4 5 6 7 8 9 10
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 45
sapply(1:10,function(i) {
x <- runif(i)
mean(x)
})
Note that sapply already returns a vector which is in many cases what we want
anyway.
In the above examples we applied a function to a vector. Sometimes we want to
apply functions to a matrix.
apply(Kakadu,2,function(x) mean(as.integer(x)))
cbind(apply(Kakadu,2,function(x) mean(as.integer(x))))
[,1]
lower 48.594964
upper 536.714286
answer NA
recparks 3.688560
jobs 2.592228
lowrisk 2.790367
wildlife 4.739464
future 4.466886
aboriginal 3.569787
finben 2.915709
mineparks 3.643678
moreparks 3.864806
gov 1.083196
envcon NA
vparks NA
tvenv 1.785441
conservation NA
sex NA
c Oliver Kirchkamp
46
Workflow of statistical data analysis — 4 SOME PROGRAMMING TECHNIQUES
age 42.968254
schooling 3.683634
income 21.656814
major NA
wide long
hor vert x
a A 1
a b c b A 2
A 1 2 3 c A 3
B 4 5 6 a B 4
b B 5
c B 6
Ragged array:
wide long
hor vert x
a b c b A 2
A 2 3 c A 3
B 4 5 a B 4
b B 5
data(Fatality)
head(Fatality)
state year mrall beertax mlda jaild comserd vmiles unrate perinc
1 1 1982 2.12836 1.539379 19.00 no no 7.233887 14.4 10544.15
2 1 1983 2.34848 1.788991 19.00 no no 7.836348 13.7 10732.80
3 1 1984 2.33643 1.714286 19.00 no no 8.262990 11.1 11108.79
4 1 1985 2.19348 1.652542 19.67 no no 8.726917 8.9 11332.63
[ reached getOption("max.print") -- omitted 2 rows ]
by(Fatality,list(Fatality$year),function(x) mean(x$mrall))
: 1982
[1] 2.089106
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 47
------------------------------------------------------------
: 1983
[1] 2.007846
------------------------------------------------------------
: 1984
[1] 2.017122
------------------------------------------------------------
: 1985
[1] 1.973671
------------------------------------------------------------
: 1986
[1] 2.065071
------------------------------------------------------------
: 1987
[1] 2.060696
------------------------------------------------------------
: 1988
[1] 2.069594
by(Fatality,list(Fatality$state),function(x) mean(x$mrall))
: 1
[1] 2.412627
------------------------------------------------------------
: 4
[1] 2.7059
------------------------------------------------------------
: 5
[1] 2.435336
------------------------------------------------------------
: 6
[1] 1.904977
------------------------------------------------------------
: 8
[1] 1.866981
------------------------------------------------------------
: 9
[1] 1.463509
------------------------------------------------------------
: 10
[1] 2.068231
------------------------------------------------------------
: 12
[1] 2.477799
------------------------------------------------------------
: 13
[1] 2.401569
------------------------------------------------------------
: 16
[1] 2.571667
------------------------------------------------------------
c Oliver Kirchkamp
48
Workflow of statistical data analysis — 4 SOME PROGRAMMING TECHNIQUES
: 17
[1] 1.405084
------------------------------------------------------------
: 18
[1] 1.834221
------------------------------------------------------------
: 19
[1] 1.679544
------------------------------------------------------------
: 20
[1] 1.969664
------------------------------------------------------------
: 21
[1] 2.133043
------------------------------------------------------------
: 22
[1] 2.120829
------------------------------------------------------------
: 23
[1] 1.87013
------------------------------------------------------------
: 24
[1] 1.629377
------------------------------------------------------------
: 25
[1] 1.199393
------------------------------------------------------------
: 26
[1] 1.672087
------------------------------------------------------------
: 27
[1] 1.370441
------------------------------------------------------------
: 28
[1] 2.761846
------------------------------------------------------------
: 29
[1] 1.977451
------------------------------------------------------------
: 30
[1] 2.903021
------------------------------------------------------------
: 31
[1] 1.685413
------------------------------------------------------------
: 32
[1] 2.74526
------------------------------------------------------------
: 33
[1] 1.798824
------------------------------------------------------------
: 34
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 49
[1] 1.319227
------------------------------------------------------------
: 35
[1] 3.653197
------------------------------------------------------------
: 36
[1] 1.207581
------------------------------------------------------------
: 37
[1] 2.34471
------------------------------------------------------------
: 38
[1] 1.601454
------------------------------------------------------------
: 39
[1] 1.550474
------------------------------------------------------------
: 40
[1] 2.33993
------------------------------------------------------------
: 41
[1] 2.177147
------------------------------------------------------------
: 42
[1] 1.541673
------------------------------------------------------------
: 44
[1] 1.110077
------------------------------------------------------------
: 45
[1] 2.821669
------------------------------------------------------------
: 46
[1] 2.04929
------------------------------------------------------------
: 47
[1] 2.403066
------------------------------------------------------------
: 48
[1] 2.27587
------------------------------------------------------------
: 49
[1] 1.835836
------------------------------------------------------------
: 50
[1] 2.092991
------------------------------------------------------------
: 51
[1] 1.740946
------------------------------------------------------------
: 53
[1] 1.677211
c Oliver Kirchkamp
50
Workflow of statistical data analysis — 4 SOME PROGRAMMING TECHNIQUES
------------------------------------------------------------
: 54
[1] 2.300624
------------------------------------------------------------
: 55
[1] 1.616567
------------------------------------------------------------
: 56
[1] 3.217534
by does not return a vector but an object of class by. If we actually need a vector we
have to use c and sapply.
In the following example we let by actually return two values.
byObj <- by(Fatality,list(Fatality$year),
function(x) c(year=median(x$year),
fatality=mean(x$mrall),
meanbeertax=mean(x$beertax)))
sapply(byObj,c)
xx<-data.frame(t(sapply(byObj,c)))
xyplot(fatality ~ meanbeertax,type="l",data=xx)+
layer(with(xx,panel.text(label=year,y=fatality,x=meanbeertax,adj=c(1,1))))
1982
2.08
1988
2.06 1987 1986
fatality
2.04
2.02
1984
2.00
1983
1.98
1985
0.48 0.49 0.50 0.51 0.52 0.53
meanbeertax
a regression. To get only the coefficients from the regression (and not fitted values,
xx<-data.frame(t(sapply(byObj,coef)))
xyplot(beertax ~ jaildyes,type="l",data=xx)+
layer(with(xx,panel.text(label=rownames(xx),y=beertax,x=jaildyes,adj=c(1,1))))
0.5
19861987
1988
0.4 1985 1984
beertax
0.3
1983
0.2
1982
0.34 0.36 0.38 0.40 0.42 0.44
jaildyes
by is very complex. It offers the entire subset of the dataframe, as defined by the
index variable, to the function.
Sometimes we want simply to apply a function of only a vector along a ragged
array.
with(Fatality,aggregate(mrall~year,FUN=mean))
year mrall
1 1982 2.089106
2 1983 2.007846
3 1984 2.017122
4 1985 1.973671
5 1986 2.065071
c Oliver Kirchkamp
52 Workflow of statistical data analysis — 5 DATA MANIPULATION
6 1987 2.060696
7 1988 2.069594
Again, the function (which was mean in the previous example) can be defined by
us:
with(Fatality,aggregate(mrall~year,FUN=function(x) sd(x)/mean(x)))
year mrall
1 1982 0.3196449
2 1983 0.3017002
3 1984 0.2721300
4 1985 0.2726437
5 1986 0.2709500
6 1987 0.2738153
7 1988 0.2518286
5 Data manipulation
5.1 Subsetting data
There are several ways to access only a part of a dataset:
Call:
lm(formula = mrall ~ beertax + jaild, data = Fatality, subset = year ==
1982)
Coefficients:
(Intercept) beertax jaildyes
1.9080 0.1824 0.4501
state year mrall beertax mlda jaild comserd vmiles unrate perinc
1 1 1982 2.12836 1.53937948 19.0 no no 7.233887 14.4 10544.152
8 4 1982 2.49914 0.21479714 19.0 yes yes 6.810157 9.9 12309.069
15 5 1982 2.38405 0.65035802 21.0 no no 7.208500 9.8 10267.303
22 6 1982 1.86194 0.10739857 21.0 no no 6.858677 9.9 15797.136
[ reached getOption("max.print") -- omitted 44 rows ]
Call:
lm(formula = mrall ~ beertax + jaild)
Coefficients:
(Intercept) beertax jaildyes
1.9080 0.1824 0.4501
Fatality[ Fatality$year==1982 , ]
state year mrall beertax mlda jaild comserd vmiles unrate perinc
1 1 1982 2.12836 1.53937948 19.0 no no 7.233887 14.4 10544.152
8 4 1982 2.49914 0.21479714 19.0 yes yes 6.810157 9.9 12309.069
15 5 1982 2.38405 0.65035802 21.0 no no 7.208500 9.8 10267.303
22 6 1982 1.86194 0.10739857 21.0 no no 6.858677 9.9 15797.136
[ reached getOption("max.print") -- omitted 44 rows ]
Call:
lm(formula = mrall ~ beertax + jaild)
Coefficients:
(Intercept) beertax jaildyes
1.9080 0.1824 0.4501
library(plyr)
rbind.fill(x,y)
merge(x,y)
merge(x,y,all.x=TRUE)
Dataset A Dataset B
Name Grade Name eMail
Eva 2.0 Eva eva@. . .
Mary 1.0 Eva eva2@. . .
Mike 3.0 Susan susan@. . .
Mike mike@. . .
Appending In the following example we first split the data from an experiment into
two parts. Merge helps us to append them to each other.
load("data/160716_060x.Rdata")
experiment1 <- subset(trustGS$subjects,Date=="160716_0601")
experiment2 <- subset(trustGS$subjects,Date=="160716_0602")
dim(experiment1)
[1] 108 14
dim(experiment2)
[1] 108 14
library(plyr)
dim(rbind.fill(experiment1,experiment2))
[1] 216 14
Joining A frequent application for a join are tables in z-Tree that have something to
do with each other. E.g. the globals and the subjects tables both provide informa-
tion about each period. Common variables in these tables are Date, Treatment, and
Period.
By merging globals with subjects, merge looks up for each record in the subjects
table the matching record in the globals table and adds the variables which are not
already present in subjects.
head(trustGS$global)
3 160716_0601 1 3 6 0
4 160716_0601 1 4 6 0
5 160716_0601 1 5 6 0
6 160716_0601 1 6 6 0
head(trustGS$subject)
In the following example we simply get two more variables in the dataset (NumPeriods
and RepeatTreatment). With more variables in globals we would, of course, also get
more variables in the merged dataset.
dim(trustGS$global)
[1] 24 5
dim(trustGS$subject)
[1] 432 14
dim(merge(trustGS$global,trustGS$subject))
[1] 432 16
head(Fatality)
state year mrall beertax mlda jaild comserd vmiles unrate perinc
1 1 1982 2.12836 1.539379 19.00 no no 7.233887 14.4 10544.15
2 1 1983 2.34848 1.788991 19.00 no no 7.836348 13.7 10732.80
3 1 1984 2.33643 1.714286 19.00 no no 8.262990 11.1 11108.79
4 1 1985 2.19348 1.652542 19.67 no no 8.726917 8.9 11332.63
[ reached getOption("max.print") -- omitted 2 rows ]
Aggregate(c(avgMrall=mean(mrall)) ~ year,data=Fatality)
year avgMrall
1 1982 2.089106
c Oliver Kirchkamp
56 Workflow of statistical data analysis — 5 DATA MANIPULATION
2 1983 2.007846
3 1984 2.017122
4 1985 1.973671
5 1986 2.065071
6 1987 2.060696
7 1988 2.069594
merge(Fatality,Aggregate(c(avgMrall=mean(mrall)) ~ year,data=Fatality))
merge has joined the two datasets, the large Fatality one, and the small aggregated
one, on the variable year.
reshape(trustWide,direction="long")[1:4,]
↑ Reshaping back returns more or less the orignal data. The ordering has changed
and rows have got names now.
library(reshape2)
recast(trustLong, Date + Period + Group ~ Pos, measure.var=c("Offer"))
6 Preparing Data
• read data
• check values
– eliminate outliers
– reshape data
setwd("rawdata/Trust/")
files <- list.files(pattern = "[0-9]{6}_[0-9]{4}.xls$",recursive=TRUE)
Doing: globals
Doing: subjects
*** rawdata/PublicGood/090622_0602.xls is file 6 / 16 ***
reading rawdata/PublicGood/090622_0602.xls ...
Skipping:
Doing: globals
Doing: subjects
*** rawdata/PublicGood/090622_0603.xls is file 7 / 16 ***
reading rawdata/PublicGood/090622_0603.xls ...
Skipping:
Doing: globals
Doing: subjects
*** rawdata/PublicGood/090622_0604.xls is file 8 / 16 ***
reading rawdata/PublicGood/090622_0604.xls ...
Skipping:
Doing: globals
Doing: subjects
*** rawdata/PublicGood/130616_0601.xls is file 9 / 16 ***
reading rawdata/PublicGood/130616_0601.xls ...
Skipping:
Doing: globals
Doing: subjects
*** rawdata/PublicGood/130616_0602.xls is file 10 / 16 ***
reading rawdata/PublicGood/130616_0602.xls ...
Skipping:
Doing: globals
Doing: subjects
*** rawdata/PublicGood/130616_0603.xls is file 11 / 16 ***
reading rawdata/PublicGood/130616_0603.xls ...
Skipping:
Doing: globals
Doing: subjects
*** rawdata/PublicGood/130616_0604.xls is file 12 / 16 ***
reading rawdata/PublicGood/130616_0604.xls ...
Skipping:
Doing: globals
Doing: subjects
*** rawdata/Trust/130716_0601.xls is file 13 / 16 ***
reading rawdata/Trust/130716_0601.xls ...
Skipping:
Doing: globals
Doing: subjects
*** rawdata/Trust/130716_0602.xls is file 14 / 16 ***
reading rawdata/Trust/130716_0602.xls ...
Skipping:
Doing: globals
Doing: subjects
*** rawdata/Trust/130716_0603.xls is file 15 / 16 ***
reading rawdata/Trust/130716_0603.xls ...
Skipping:
Doing: globals
Doing: subjects
c Oliver Kirchkamp
60 Workflow of statistical data analysis — 6 PREPARING DATA
*** rawdata/Trust/130716_0604.xls is file 16 / 16 ***
reading rawdata/Trust/130716_0604.xls ...
Skipping:
Doing: globals
Doing: subjects
save in R-format:
save(trustGS,zTreeTables,file="160716_060x.Rdata")
save in Stata-format:
xx<-with(trustGS,merge(globals,subjects))
write.dta(xx,file="160716_060x.dta")
save.dta13(xx,file="160716_060x.dta13")
save as csv:
write.csv(xx,file="160716_060x.csv")
fn<-list.files(pattern="160716_060x\\.[^.]*")
xtable(cbind(name=fn,size=file.size(fn)))
name size
1 160716_060x.csv 301048
2 160716_060x.dta 613274
3 160716_060x.dta13 618378
4 160716_060x.Rdata 24508
As long as we need only a single table, we can access, e.g. the subjects table with
$subjects .
If we need, e.g. the globals table together with the subjects table, we can merge:
with(trustGS,merge(globals,subjects))
save(trustGS,zTreeTables,file="data/160716_060x.Rdata")
load("data/160716_060x.Rdata")
Advantages:
library(foreign)
sta <- read.dta("data/160716_060x.dta")
foreign (read.dta stores internal Stata information as attributes of the data frame.
memisc (Stata.file) stores internal Stata information as attributes of the variable.
str(sta)
$ Contrib3 : num 0.678 0.586 0.259 0.896 0.378 0.818 1 0.381 0.81 0.618 ...
- attr(*, "datalabel")= chr "Written by R. "
- attr(*, "time.stamp")= chr ""
- attr(*, "formats")= chr "%11s" "%9.0g" "%9.0g" "%9.0g" ...
- attr(*, "types")= int 138 100 100 100 100 100 100 100 100 100 ...
- attr(*, "val.labels")= chr "" "" "" "" ...
- attr(*, "var.labels")= chr "Date" "Treatment" "Period" "NumPeriods" ...
- attr(*, "version")= int 7
str(sta2)
attributes(sta)
$datalabel
[1] "Written by R. "
$time.stamp
[1] ""
$names
[1] "Date" "Treatment" "Period" "NumPeriods"
[5] "RepeatTreatment" "Subject" "Pos" "Group"
[9] "Offer" "Receive" "Return" "GetBack"
[13] "country" "siblings" "sex" "age"
[17] "Contrib1" "Contrib2" "Contrib3"
$formats
[1] "%11s" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g"
[10] "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g" "%9.0g"
[19] "%9.0g"
$types
[1] 138 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
$val.labels
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
$var.labels
[1] "Date" "Treatment" "Period" "NumPeriods"
[5] "RepeatTreatment" "Subject" "Pos" "Group"
[9] "Offer" "Receive" "Return" "GetBack"
[13] "country" "siblings" "sex" "age"
[17] "Contrib1" "Contrib2" "Contrib3"
$row.names
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40"
[ reached getOption("max.print") -- omitted 3896 entries ]
$version
[1] 7
$class
[1] "data.frame"
$ptr
<pointer: 0x9c1e090>
attr(,"file.name")
c Oliver Kirchkamp
64 Workflow of statistical data analysis — 6 PREPARING DATA
[1] "data/160716_060x.dta"
$document
character(0)
$names
[1] "Date" "Treatment" "Period" "NumPeriods"
[5] "RepeatTreatment" "Subject" "Pos" "Group"
[9] "Offer" "Receive" "Return" "GetBack"
[13] "country" "siblings" "sex" "age"
[17] "Contrib1" "Contrib2" "Contrib3"
$data.spec
$data.spec$names
[1] "Date" "Treatment" "Period" "NumPeriods"
[5] "RepeatTreatment" "Subject" "Pos" "Group"
[9] "Offer" "Receive" "Return" "GetBack"
[13] "country" "siblings" "sex" "age"
[17] "Contrib1" "Contrib2" "Contrib3"
$data.spec$types
[1] 0b ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
$data.spec$nobs
[1] 3936
$data.spec$nvar
[1] 19
$data.spec$varlabs
Date Treatment Period NumPeriods
"Date" "Treatment" "Period" "NumPeriods"
RepeatTreatment Subject Pos Group
"RepeatTreatment" "Subject" "Pos" "Group"
Offer Receive Return GetBack
"Offer" "Receive" "Return" "GetBack"
country siblings sex age
"country" "siblings" "sex" "age"
Contrib1 Contrib2 Contrib3
"Contrib1" "Contrib2" "Contrib3"
$data.spec$value.labels
named character(0)
$data.spec$missing.values
NULL
$data.spec$version.string
[1] "Stata 7"
$class
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 65
[1] "Stata.importer"
attr(,"package")
[1] "memisc"
Within the memisc world you can obtain more information with codebook.
codebook(sta2)
================================================================================
Date ’Date’
--------------------------------------------------------------------------------
Min: 090622_0601
Max: 160716_0604
================================================================================
Treatment ’Treatment’
--------------------------------------------------------------------------------
Mean: 1.000
Variance: 0.000
Skewness: NaN
Kurtosis: NaN
Min: 1.000
Max: 1.000
================================================================================
Period ’Period’
--------------------------------------------------------------------------------
Mean: 5.841
Variance: 11.483
Skewness: 0.290
Kurtosis: -1.082
Min: 1.000
Max: 12.000
c Oliver Kirchkamp
66 Workflow of statistical data analysis — 6 PREPARING DATA
================================================================================
NumPeriods ’NumPeriods’
--------------------------------------------------------------------------------
Mean: 10.683
Variance: 6.168
Skewness: -1.355
Kurtosis: -0.163
Min: 6.000
Max: 12.000
================================================================================
RepeatTreatment ’RepeatTreatment’
--------------------------------------------------------------------------------
Mean: 0.000
Variance: 0.000
Skewness: NaN
Kurtosis: NaN
Min: 0.000
Max: 0.000
================================================================================
Subject ’Subject’
--------------------------------------------------------------------------------
Mean: 8.720
Variance: 22.665
Skewness: 0.028
Kurtosis: -1.165
Min: 1.000
Max: 18.000
================================================================================
Pos ’Pos’
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 67
--------------------------------------------------------------------------------
Mean: 2.280
Variance: 1.202
Skewness: 0.317
Kurtosis: -1.214
Min: 1.000
Max: 4.000
================================================================================
Group ’Group’
--------------------------------------------------------------------------------
Mean: 3.049
Variance: 3.510
Skewness: 1.287
Kurtosis: 1.685
Min: 1.000
Max: 9.000
================================================================================
Offer ’Offer’
--------------------------------------------------------------------------------
Mean: 0.327
Variance: 0.137
Skewness: 0.481
Kurtosis: -1.424
Min: 0.000
Max: 1.000
Miss.: 3072.000
NAs: 3072.000
================================================================================
Receive ’Receive’
--------------------------------------------------------------------------------
c Oliver Kirchkamp
68 Workflow of statistical data analysis — 6 PREPARING DATA
Storage mode: double
Measurement: interval
Mean: 0.981
Variance: 1.230
Skewness: 0.481
Kurtosis: -1.424
Min: 0.000
Max: 3.000
Miss.: 3072.000
NAs: 3072.000
================================================================================
Return ’Return’
--------------------------------------------------------------------------------
Mean: 0.409
Variance: 0.381
Skewness: 1.502
Kurtosis: 1.437
Min: 0.000
Max: 2.763
Miss.: 3072.000
NAs: 3072.000
================================================================================
GetBack ’GetBack’
--------------------------------------------------------------------------------
Mean: 0.409
Variance: 0.381
Skewness: 1.502
Kurtosis: 1.437
Min: 0.000
Max: 2.763
Miss.: 3072.000
NAs: 3072.000
================================================================================
country ’country’
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 69
--------------------------------------------------------------------------------
Mean: 18.069
Variance: 721.620
Skewness: 2.555
Kurtosis: 4.875
Min: 1.000
Max: 99.000
Miss.: 3072.000
NAs: 3072.000
================================================================================
siblings ’siblings’
--------------------------------------------------------------------------------
Mean: 2.903
Variance: 131.421
Skewness: 8.176
Kurtosis: 65.579
Min: 0.000
Max: 99.000
Miss.: 3072.000
NAs: 3072.000
================================================================================
sex ’sex’
--------------------------------------------------------------------------------
Mean: 8.912
Variance: 663.355
Skewness: 3.192
Kurtosis: 8.196
Min: 1.000
Max: 99.000
================================================================================
age ’age’
c Oliver Kirchkamp
70 Workflow of statistical data analysis — 6 PREPARING DATA
--------------------------------------------------------------------------------
Mean: 31.421
Variance: 511.610
Skewness: 2.478
Kurtosis: 4.619
Min: 16.000
Max: 99.000
================================================================================
Contrib1 ’Contrib1’
--------------------------------------------------------------------------------
Mean: 0.504
Variance: 0.047
Skewness: -0.047
Kurtosis: -0.269
Min: 0.000
Max: 1.000
Miss.: 864.000
NAs: 864.000
================================================================================
Contrib2 ’Contrib2’
--------------------------------------------------------------------------------
Mean: 0.507
Variance: 0.048
Skewness: 0.006
Kurtosis: -0.429
Min: 0.000
Max: 1.000
Miss.: 864.000
NAs: 864.000
================================================================================
Contrib3 ’Contrib3’
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 71
--------------------------------------------------------------------------------
Mean: 0.501
Variance: 0.049
Skewness: -0.003
Kurtosis: -0.361
Min: 0.000
Max: 1.000
Miss.: 864.000
NAs: 864.000
The memisc approach preserves more information. Often this is more intuitive.
Some packages are, however, confused by these attributes.
Stata 13 Every now and then stata changes their file format:
library(readstata13)
sta13<-read.dta13("data/160716_060x.dta13")
• Separators: , ; TAB
The advantage of CSV as a medium to exchange data is: CSV can be read by any
software.
The disadvantage is: No extra information (variable labels, levels of factors, . . . )
can be stored.
c Oliver Kirchkamp
72 Workflow of statistical data analysis — 6 PREPARING DATA
6.1.5 Filesize
For our example we obtain the following sizes:
codebook(data.set(trustC))
..
.
================================================================================
--------------------------------------------------------------------------------
Min: 0.000
Max: 1.000
Mean: 0.654
Std.Dev.: 0.244
Skewness: -0.684
Kurtosis: 0.034
Miss.: 216.000
NAs: 216.000
================================================================================
--------------------------------------------------------------------------------
Basic plots
with(trustC,hist(GetBack/Offer))
boxplot(GetBack/Offer ~ sub("_","",Date),data=trustC,main="Boxplot")
with(trustC,plot(ecdf(GetBack/Offer)))
abline(v=1)
1.0
2.5
40
0.8
2.0
30
0.6
Frequency
Fn(x)
1.5
20
0.4
1.0
10
0.2
0.5
0.0
0.0
0
0.0 1.0 2.0 3.0 1607160601 1607160604 0.0 1.0 2.0 3.0
GetBack/Offer x
c Oliver Kirchkamp
74 Workflow of statistical data analysis — 6 PREPARING DATA
plot(GetBack ~ Offer ,data=trustC)
abline(a=0,b=3)
2.0
GetBack
1.0
0.0
Offer
If something is suspicious (which does not seem to be the case here) plot the data for
subgroups:
Given : Period
2.5
Given : Date
0.0
2.5
GetBack
0.0
2.5
0.0
2.5
0.0
Offer
data(Kakadu)
nrow(Kakadu)
[1] 1827
When our data falls into a small number of categories a simple scatterplot is not too
informative. The right graph shows a scatterplot with some jitter added.
plot(lower ~ upper,data=Kakadu)
abline(a=0,b=1)
plot(jitter(lower,factor=50) ~ jitter(upper,factor=50),cex=.1,
data=Kakadu)
250
250
150
150
lower
50
50
0
0
With such a large number of observations, and so few categories, a table might be
more informative
with(Kakadu,table(lower,upper))
upper
2 5 20 50 100 250 999
0 129 147 156 176 0 0 0
2 0 9 0 0 0 0 0
5 0 0 63 0 0 0 0
lower
20 0 0 0 69 0 0 321
50 0 0 0 0 76 0 281
100 0 0 0 0 0 61 187
250 0 0 0 0 0 0 152
In our dataset we do not have the age of the oldest sibling, but let us just pretend:
with(trustGS$subjects,table(siblings,age,useNA=’always’))
age
siblings 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 98 99 <NA>
0 12 24 0 0 12 24 36 0 0 12 24 0 0 12 12 0 12 12 24 12 0
[ reached getOption("max.print") -- omitted 5 rows ]
with(trustGS$subjects,table(siblings,is.na(age)))
siblings FALSE
0 228
1 180
2 192
3 252
99 12
The discussion of value labels in section 6.5 contains more details on missings.
library(tools)
md5sum("data/160716_060x.Rdata")
data/160716_060x.Rdata
"9551b43d01a79e6659c15d86b2cae879"
It might be worthwile to include in the draft version of your paper the checksum
of your datasets.
trustC[,sort(names(trustC))]
Either. . .
• use a small number of source files, and keep the information somewhere in the
file
. . . or. . .
• use many source files and few data files, and keep the information with the data.
load("data/160716_060x.Rdata")
trust <- within(with(trustGS,merge(globals,subjects)), {
description(Pos)<- "(1=trustor, 2=trustee)"
description(Offer)<- "trustor’s offer"
description(Receive)<- "amount received by trustee"
description(Return)<- "amount trustee sends back to
trustor"
description(GetBack)<- "amount trustor receives back
from trustee"
description(country)<- "country of origin"
description(sex)<- "participant’s sex (1=male, 2=female)"
description(siblings)<- "number of siblings"
description(age)<- "true age"
})
codebook(data.set(trust))
attr(trust,"annotation")<-"Note: 160716_0601 was a pilot,..."
annotation(trust)["note"]="Note: This is not a real dataset..."
c Oliver Kirchkamp
78 Workflow of statistical data analysis — 6 PREPARING DATA
• labels can be long, but they should be meaningful even if they are truncated.
The following is not a label but a wording:
Better:
General attributes
description() short description of the variable always
wording() wording of a question if necessary
annotation()["..."] e.g. specific property of dataset if necessary
how a variable was created if necessary
• numbers: 1, 2, 3
• characters: “male”, “female”, . . .
• factors: “male”=1, “female”=2,. . .
– factors are integers + levels, often treated as characters.
– factors have only one type of missing (this is not a restriction, since the type
of missingness could be stored in another variable)
codebook(trustC$sex)
================================================================================
--------------------------------------------------------------------------------
table(as.factor(trustC$sex),useNA="always")
table(as.numeric(trustC$sex),useNA="always")
1 2 <NA>
174 216 42
table(as.character(trustC$sex),useNA="always")
trust <- within(trust,{
labels(sex)<-c("male"=1,"female"=2,"refused"=98,"missing"=99)
labels(siblings)<-c("refused"=98,"missing"=99)
labels(age)<-c("refused"=98,"missing"=99)
labels(country)<-c("a"=1, "b"=2, "c"=3, "d"=4, "e"=5, "f"=6, "g"=7, "h"=8, "i"=9, "j"=10, "k"
missing.values(sex)<-c(98,99)
missing.values(siblings)<-c(98,99)
missing.values(age)<-c(98,99)
missing.values(country)<-c(98,99)
})
save(trustC,file="data/160716_060x_C.Rdata")
Introducing missings makes a difference. The left graph shows the plot where miss-
ings were coded (wrongly) as zeroes, the right graph shows the plot with missings
coded as missings.
c(ecdfplot(~Offer,data=trust),ecdfplot(~Offer,data=trustC))
1.0
Empirical CDF
0.8
0.6
0.4
0.2
0.0
Offer
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 81
mean(trust$Offer)
[1] NA
mean(trustC$Offer)
[1] NA
mean(trustC$Offer,na.rm=TRUE)
[1] 0.6536776
(Note that the equivalent in Stata, . == . and 7 < ., do not fail but returns TRUE.
)
The following works:
x<-NA
if(is.na(x)) print("x is na")
• keep the old variables
• generate indicator variables for records you will use in a specific context
trust<-within(trust,youngSingle <- age<25 & siblings==0)
with(subset(trust,youngSingle),...)
CTANGLE foo.c
foo.w
CWEAVE foo.tex
• supposed to be read
• reduces the amount of text one must read to understand the code
tangle foo.R
foo.Rnw weave
knit foo.tex foo.pdf
Nonliterate:
statisticalmethods
statistical methods
statistical methods
paper
paper
raw data paper
workflow
workflow
workflow
Remember: it is easy to confuse the different version of the analysis and their
relation to the versions of the paper.
Literate:
statisticalmethods
statistical methods
statistical methods
workflow
workflow
workflow
raw data paper
paper
paper
• Connection of methods to paper (no more: ‘which version of the methods were
used for which figure, which table’)
Don’t write:
We ran 12 sessions with 120 participants.
instead:
numSession <- length(unique(sessionID))
numPart <- length(unique(partID))
..
.
We ran \Sexpr{numSession} sessions with \Sexpr{numPart} participants.
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 85
7.3 An example
\documentclass{article}
\begin{document}
text that explains what you are doing and why it is
interesting ...
<<someCalculations,results=’asis’,echo=FALSE>>=
library(Ecdat)
library(xtable)
library(lattice)
data(Caschool)
attach(Caschool)
est <- lm(testscr ~ avginc)
xtable(anova(est))
@
<<aFigure,echo=FALSE,fig.width=4,fig.height=3>>=
plot(xyplot(testscr ~ avginc,xlab="average income",ylab="testscore",
type=c("p","r","smooth")))
@
library(knitr)
knit("<filename.Rnw>")
system("pdflatex <filename.tex>")
text that explains what you are doing and why it is interesting ...
700
680
testscore
660
640
620
10 20 30 40 50
average income
<<>>=
lm(testscr ~ avginc)
@
<<fig.height=2.5>>=
plot(est,which=1)
@
more generally
<<...parameters...>>=
...R-commands...
@
• <<anyName,...>>=
not necessary, but identifies the chunk. Also helps recycling chunks, e.g. a figure.
• <<...,eval=FALSE,...>>=
• <<...,echo=FALSE,...>>=
• <<...,fig.width=3,fig.height=3,...>>=
All figures produced in this chunk will have this width and height.
• <<...,results=’asis’,...>>=
Elements of a knitr-document
\documentclass{article}
\begin{document}
<<>>=
opts_chunk[["set"]](dev=’tikz’, external=FALSE, fig.width=4.5,
fig.height=3, echo=TRUE, warning=TRUE,
error=TRUE, message=TRUE,
cache=TRUE, autodep=TRUE,
size="footnotesize")
@
\usepackage{tikz}
• echo=TRUE, warning=TRUE, error=TRUE, message=TRUE what kind of output
is shown
• cache=TRUE, autodep=TRUE do calculate chunks only when they have changed
• size="footnotesize" size of the output
All these values can be overridden for specific knitr chunks.
This document has been generated on July 29, 2016, with R version 3.3.1 (2016-06-
21), on x86_64-pc-linux-gnu.
7.5 Advantages
• Accuracy (no more mistakes from copying and pasting)
• Reproducability (even years later, it is always clear how results were generated)
• Dynamic document (changes are immediately reflected everywhere, this speeds
up the writing process)
<<eval=!FAST>>=
read.csv(’rawData.csv’)
expensiveData<-thisTakesALongTime()
save(expensiveData,file=’expensive.Rdata’)
@
<<>>=
load(’expensive.Rdata’)
...
@
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 89
Alternatively: caching intermediate results knitr can also cache intermediate results:
<<expensiveStep,cache=TRUE>>=
intermediateResults <- ....
@
The above chunk is executed only once (unless it changes), results are stored on
disk and can be used lateron.
You can save a lot of work if you harness R to create and format your tables for you.
A versatile function is xtable:
(x <- rbind(c(1,2,3),c(4,5,6)))
<<results=’asis’>>
library(xtable)
xtable(x)
1 2 3
1 1.00 2.00 3.00
2 4.00 5.00 6.00
library(Ecdat)
data(Caschool)
est1 <- lm(testscr ~ str,data=Caschool)
est2 <- lm(testscr ~ str + elpct,data=Caschool)
est3 <- lm(testscr ~ str + elpct +avginc,data=Caschool)
toLatex(mtable(est1,est2,est3))
c Oliver Kirchkamp
90 Workflow of statistical data analysis — 7 WEAVING AND TANGLING
(Intercept) 698.933∗∗∗ 686.032∗∗∗ 640.315∗∗∗
(9.467) (7.411) (5.775)
str −2.280∗∗∗ −1.101∗∗ −0.069
(0.480) (0.380) (0.277)
elpct −0.650∗∗∗ −0.488∗∗∗
(0.039) (0.029)
avginc 1.495∗∗∗
(0.075)
R-squared 0.1 0.4 0.7
adj. R-squared 0.0 0.4 0.7
sigma 18.6 14.5 10.3
F 22.6 155.0 334.9
p 0.0 0.0 0.0
Log-likelihood −1822.2 −1716.6 −1575.4
Deviance 144315.5 87245.3 44540.7
AIC 3650.5 3441.1 3160.7
BIC 3662.6 3457.3 3180.9
N 420 420 420
toLatex(relabel(mtable("simple"=est1,"intermediate"=est2,
"final"=est3),c(str="student/teacher",
elpct="English learners",avginc="average income",
"(Intercept)"="Constant")))
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 91
Requirements The default of toLatex assumes the dcolumn package, i.e. in the pream-
ble of the document we have to say something like:
\usepackage{dcolumn}
\let\toprule\hline
\let\midrule\hline
\let\bottomrule\hline
AIC <- smer@AICtab$AIC
BIC <- smer@AICtab$BIC
logLik <- smer@AICtab$logLik
deviance <- smer@AICtab$deviance
REMSdev <- smer@AICtab$REMSdev
N <- length(mer@resid)
# below we assume two random effects: one for the independent
# observations and one for the participants
# this is frequently the case for experiments but need not
# always be the case for other mer-s
ngrps<-min(smer@ngrps)
mgrps<-max(smer@ngrps)
sumstat <- c(deviance=deviance,AIC=AIC,BIC=BIC,
logLik=logLik,N=N,ngrps=ngrps,mgrps=mgrps)
list(coef=coef,sumstat=sumstat,call = mer@call)
}
setSummaryTemplate(mer=c("Log-likelihood" = "($logLik:f#)",
Deviance = "($deviance:f#)",
AIC = "($AIC:f#)",
BIC = "($BIC:f#)",
N = "($N:d)",
"indep.obs."="($ngrps:d)",
"participants"="($mgrps:d)"))
setCoefTemplate(pci=c(est="($est:#)($p:*)",
ci="[($lwr:#);($upr:#)]"))
We should note that our definition of indep.obs. and participants as the smallest and
largest number of groups, respectively, is often reasonable if we have, indeed, two
random effects, one for independent observations, the other for participants. This is
frequently the case for experiments but need not always be the case for other mixed
effects models.
We should also note that there are several ways to bootstrap p-values. In the ex-
ample we use mcmcsamp and we assume that the distribution of coefficients follows a
normal distribution.
PROJECT = myProject_160601
pdf: $(PROJECT).pdf
%.pdf: %.tex
pdflatex $<
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 93
%.tex: %.Rnw
PROJECT = myProject_160601
Here we define a variable. This is useful, since this most of the time the only line
of the Makefile I ever have to change (instead of changing every occurence of the
filename)
pdf: $(PROJECT).pdf
The part pdf before the colon is a target. Since it is the first target in the file it is also
the default target. I.e. make will try to make it whenever I just say
make
make pdf
The part after the colon tells make on which file(s) the target actually depends (the
prerequisites). Here it is only one but there could be several. If all prerequisites exists,
and if they are up-to-date (newer than all files they depends on), make will apply the
rule. Otherwise, make will try to create the prerequisites (the pdf file in this case, with
the help of other rules) and then apply this rule.
%.tex: %.Rnw
echo "library(knitr);knit(\"$<\");" | R --vanilla
This is a rule that make can use to create tex files. So above we requested the pdf file
myProject_160601.pdf, and now make knows that we require a file myProject_160601.tex.
If this already exists and is up-to-date (i.e. newer than all files it depends on), make
will apply this rule. Otherwise, make will first try to create the prerequisite (the sin-
gle tex file in this case would be created with the help of other rules) and then apply
this rule.
To create our pdf it is now sufficient to say (from the command line, not from R)
make
A Makefile for a larger project When I wrote this handout I split it into several Rnw
files. This saves time. When I make changes to one part, only this part has to be
compiled again. The files were all in the same directory. The directory also contained
a “master”-tex file that would assemble the tex-files for each Rnw-file.
The following example shows how we assemble the output of several files to make
one document:
PROJECT = myProject_160601
RPARTS = $(wildcard $(PROJECT)_[1-9].Rnw)
TEXPARTS = $(RPARTS:.Rnw=.tex)
pdf: $(PROJECT).pdf
8 Version control
8.1 Problem I – concurrent edits
What happens if two authors, Anna and Bob, simultaneously want to work on the
same file. Chances are that one is deleting the changes of the other. (This problem is
similar to one author working on two different machines)
Server VBA
Anna Bob
VVA VVB
Anna Bob
VA VV
A+B
A
• Bob can only work with Anna’s permission — very inefficient (50% of the time
Anna and Bob are forced to wait)
A B C D E
F G
• Editions of a book
• Revisions of a specification
.
• ..
Software:
• Subversion (SVN)
• Git
c Oliver Kirchkamp
96 Workflow of statistical data analysis — 8 VERSION CONTROL
• Mercurial
• Bazaar
.
• ..
• Free
• Distributed repository
git status
# On branch master
#
# Initial commit
#
# Changes to be committed:
# (use "git rm --cached <file>..." to unstage)
#
# new file: test.Rnw
Note that git denotes versions with identifiers like “3ea6194” (and not A, B, C).
After some changes to test.Rnw. . .
git status
# On branch master
# Changes not staged for commit:
# (use "git add <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# modified: test.Rnw
#
no changes added to commit (use "git add" and/or "git commit -a")
git log –oneline
f965066 added funny model (does not fully work yet)
9100277 improved regression results (do not fully work)
1d05e8f draft conclusion
74fd521 introduction and first results
3ea6194 first version of test.Rnw
HEAD
master
Assume we want to go back to 1d05e8f but not forget what we did between 1d05e8f
and f965066.
Remember current state:
Now that we have given the current branch a name we can revert to the old state:
git status
# On branch master
nothing to commit, working directory clean
HEAD funny
master
do more work. . .
HEAD
master
beca79e 9682285
funny
or that. . .
Auto-merging test.Rnw
CONFLICT (content): Merge conflict in test.Rnw
Automatic merge failed; fix conflicts and then commit the result
Merging:
test.Rnw
HEAD
master
funny
c Oliver Kirchkamp
100 Workflow of statistical data analysis — 8 VERSION CONTROL
Version control allows all authors to work on the file(s) simultaneously.
In this example we start with an empty repository. In a first step both Anna and
Bob “checkout” the repository, i.e. they create a local copy of the repository on their
computer.
Anna creates a file, adds it to version control and commits it to the repository. Bob
then updates his copy and, thus, obtains Anna’s changes.
• This repository can now be accessed from “clients”, either on the same ma-
chine. . .
git clone /path/to/repository/
changes. Since they both edit different parts of the file, the version control system can
1d05e8f draft conclusion
74fd521 introduction and first results
3ea6194 first version of test.Rnw
git blame allows you to inspect modifications in specific files. If we want to find
out who introduced or removed “something specific” (and when), we would say. . .
git blame -L ’/something specific/’ test.Rnw
19eb9bac (w6kiol2 2016-06-17 ...) therefore important to study something specific which
dd0647f7 (w6kiol2 2016-06-21 ...) switched our focus to something else and continue with
There is a range of GUIs that allow you to browse the commit tree.
Try, e.g., gitk
8.10 Steps to set up a subversion repository at the URZ at the FSU Jena
If you need to set up a subversion repository here at the FSU, tell me about it and
tell me the hurz-loginis of the people who plan to use it. Technically, setting up a new
repository means the following:
• ssh to subversion.rz.uni-jena.de
• then, at the local machine in a directory that actually contains only the files you
want to add: svn –username hurz-logini import . https://subversion.rz.uni-jena.de/svn/
"Initial import"
(this “imports” data into the repository)
• then, at all client machines,
svn –username hurz-logini checkout https://subversion.rz.uni-jena.de/svn/ewf/hreposit
• editing
– git add add a file to version control
– git mv move a file under version control
– git rm delete a file under version control
8.13 Exercise
Create (in hpathi) four directories A, B, C.
From A create a repository: svnadmin create ../R
A=. . .
B=. . .
In A create a file test.txt with some text:
Initial import. In A say:
svn import . file://hpathi/R -m "My first initial import"
in B: in C:
svn checkout file://hpathi/R svn checkout file://hpathi/R
in B/R: in C/R:
Simultaneous changes to test.txt
A=1 A=. . .
B=. . . B=2
Commit changes
svn commit svn commit
Update
svn update svn update
9 Exercises
Exercise 1
Have a look at the dataset Workinghours from the library Ecdat. Compare the distri-
bution of “other household income” for whites and non-whites. Do the same for the
different types of occupation of the husband.
Exercise 2
Read the data from a hypothetical experiment from rawdata/Coordination. Does the
Effort change over time?
Exercise 3-a
Read the data from a hypothetical z-Tree experiment from rawdata/Trust. Do you
find any relation between the number of siblings and trust?
Exercise 3-b
For the same dataset: Attach a label (description) to siblings. Attach value labels to
this variable.
Exercise 3-c
Make the above a function.
c Oliver Kirchkamp
[29 July 2016 14:28:20] — 105
Also write a function that compares the offers of all participants with n siblings
with the other offers. This function should (at least) return a p-value of a two-sample
Wilcoxon test (wilcox.test). The number n should be a parameter of the function.
Exercise 4
Read the data from a hypothetical z-Tree experiment from rawdata/PublicGood. The
three variables Contrib1, Contrib2, and Contrib3 are contributions of the partici-
pants to the other three players in their group (in gruops of four).
1. Check that, indeed, in each period, players are equally distributed into four
groups.
2. Produce for each period a boxplot with the contribution (i.e. 16 boxplots in one
graph).
4. Produce for each contribution partner a boxplot with the contribution (i.e. 3
boxplots in one graph).
5. Produce an Sweave file that generates the two graphs. In this file also write when
you estimate the average contribution reaches zero.