R_intro2021
R_intro2021
Mahmood Arai
Department of Economics, Stockholm University
First Version: 2002-11-05, This Version: 2021-03-26
1 Introduction
1.1 About R
R is published under the GPL (GNU Public License) and exists for all major
platforms. R is described on the R Homepage as follows:
[1] "example"
R-codes including comments of codes that are not executed are indented as
follows:
1
The characters within < > refer to verbatim names of files, functions etc. when it
is necessary for clarity. The names <mysomething> such as <mydata>,<myobject>
are used to refer to a general dataframe, object etc.
2 First things
2.1 Installation
It is a good idea to create a directory for a project and start R from there. This
makes it easy to save your work and find it in later sessions.
If you want R to start in a certain directory in MS-Windows, you have to
specify the <start in directory> to be your working directory. This is done
by changing the <properties> by clicking on the right button of the mouse
while pointing at your R-icon, and then going to <properties>.
Displaying the working directory within R:
getwd()
setwd("/home/ma/project1")
2.3 Naming in R
Do not name an object as <my_object> or <my-object> use instead <my.object> or
<myObject>. Notice that in R <my.object> and <My.object> are two di↵erent names.
Names starting with a digit (<1a>) is not accepted. You can instead use <a1>)
You should not use names of variables in a data-frame as names of objects. If you do
so, the object will shadow the variable with the same name in another object. The
2
problem is then that when you call this variable you will get the object – the object
shadows the variable / the variable will be masked by the object with the same name.
To avoid this problem:
1- Do not give a name to an object that is identical to the name of a variable in your
data frames.
2- If you are not able to follow this rule, refer to variables by referring to the variable
and the dataset that includes the variable. For example the variable <wage> in the
data frame <df1> is called by:
df1$wage.
The problem of ”shadowing” concerns R functions as well. Do not use object names
that are the same as R functions. <conflicts(detail=TRUE)> checks whether an
object you have created conflicts with another object in the R packages and lists them.
You should only care about those that are listed under <.GlobalEnv> – objects in your
workspace. All objects listed under <.GlobalEnv> shadows objects in R packages and
should be removed in order to be able to use the objects in the R packages.
The following example creates <T> that should be avoided (since <T> stands for ¡TRUE¿),
checks conflicts and resolves the conflict by removing <T>.
T <- "time"
conflicts(detail=TRUE)
rm(T)
conflicts(detail=TRUE)
You should avoid using the following one-letter words <c,C,D,F,I,q,t,T> as names.
They have special meanings in R.
Extensions for files
It is a good practice to use the extension <R> for your files including R-codes. A file
<lab1.R> is then a text-file including R-codes.
The extension <rda> is appropriate for work images (i.e files created by <save()>).
The file <lab1.rda> is then a file including R-objects.
The default name for the saved work image is <.RData>. Be careful not to name a
file as <.RData> when you use <RData> as extension, since you will then overwrite the
<.Rdata> file.
install.packages("Ecdat")
data("Wages1, package="Ecdat")
3
The following command saves the object <Wages1> in a file <mydata.rda>.
save(Wages1, file="mydata.rda")
To save an image of the your workspace that will be automatically loaded when you
next time start R in the same directory.
save.image()
You can also save your working image by answering <yes> when you quit and are
asked
<Save workspace image? [y/n/c]:>.
In this way the image of your workspace is saved in the hidden file <.RData>.
You can save an image of the current workspace and give it a name <myimage.rda>.
save.image("myimage.rda")
> options(width=60)
3 Elementary commands
ls() # Lists all objects.
ls.str() # Lists details of all objects
str(myobject) # Lists details of <myobject>.
list.files() # Lists all files in the current directory.
dir() # Lists all files in the current directory.
myobject # Prints simply the object.
4
rm(myobject) # removes the object <myobject>.
rm(list=ls()) # removes all the objects in the working space.
save(myobject, file="myobject.rda")
# saves the object <myobject> in a file <myobject.rda>.
q() # Quits R.
The output of a command can be directed in an object by using < <- > , an object is
then assigned a value. The first line in the following code chunk creates vector named
<VV> with a values 1,2 and 3. The second line creates an object named <VV> and prints
the contents of the object <VV>.
[1] 1 2
4 Data management
4.1 Reading data in plain text format:
Data tables
Start by writing you data table in a file. The file will contain the variable names in
the first line (separated with a space) and the values of these variables (separated with
a space) in the following lines.
The argument <header = TRUE> indicates that the first line includes the names of the
variables. The object <dat> is a data-frame as it is called in R.
If the columns of the data in the file <tmp.txt> were separated by <,>, the syntax
would be:
Note that if your decimal character is not <.> you should specify it. If the decimal
character is <,>, you can use <read.csv> and specify the following argument in the
function <dec=",">.
5
4.2 Non-available and delimiters in tabular data
Let us create a small datset where the first observation on the second column (variable)
is a missing value coded as <.> and save it to a file. The data are:
1 . 9
6 3 2
When reading theis data, to tell R that <.> is a missing value, you use the argument:
<na.strings=".">
V1 V2 V3
1 1 NA 9
2 6 3 2
Sometimes columns are separated by other separators than spaces. The separator
might for example be <,> in which case we have to use the argument <sep=",">.
Be aware that if the columns are separated by <,> and there are spaces in some
columns like the case below the <na.strings="."> does not work. The NA is
actually coded as two spaces, a point and two spaces, and should be indicated as:
<na.strings=" . ">.
1, . ,9
6, 3 ,2
1 9
6 3 2
Notice that there are two spaces between 1 and 9 in the first line implying that the
value in the second column is blank. This is a missing value. Here it is important to
specify <sep=" "> along with <na.strings=""> .
library(foreign)
# Try to write the 'Wages1' data frame in Stata format.
write.dta(Wages1, file = "wage.dta")
# reads the data <wage.dta> and put it in the object <Wages1>
Wages1 <- read.dta(file="wage.dta")
<read.ssd()> , <read.spss()> etc. are other commands in the foreign package for
reading data in SAS and SPSS format.
6
4.4 Examining the contents of a data-frame object
Attaching the <Wages1> data by <attach(Wages1)> allows you to access the contents
of the dataset <Wages1> by referring to the variable names in the <Wages1>. If you
have not attached the <Wages1> you can use <Wages1$sex> to refer to the variable
<sex> in the data frame <Wages1>. When you do not need to have the data attached
anymore, you can undo the <attach()> by <detach()>
A description of the contents of the data frame Wages1.
Notice that you do not need to create variables that are simple transformations of the
original variables. You can do the transformation directly in your computations and
estimations.
7
4.6 Choosing a subset of variables in a data frame
# Read a <subset> of variables (wage,sex) in Wages1.
Wages1.female <- subset(Wages1, select=c(wage,sex))
# The following keeps all variables from sex to wage as listed above
Wages1xx <- subset(Wages1, select=sex:wage)
# Keeping observations for females and those with more than 12 years of Schooling only.
fem.collage.data <- subset(female, sex=="female" & school > 12 )
8
The variable <Wages1$university> is now a dummy for university education. Re-
member to re-attach the data set after recoding.
university
FALSE TRUE
2541 753
university
FALSE TRUE
2541 753
However, we usually do not need to create dummies. We can compute on <school > 12>
directly,
FALSE TRUE
2541 753
9
4.10 Factors
Sometimes our variable has to be redefined to be used as a category variable with ap-
propriate levels that corresponds to various intervals. We might wish to have schooling
categories that corresponds to schooling up to 9 years, 10 to 12 years and above 12
years. This could be coded by using <cut()>. To include the lowest category we use
the argument <include.lowest=TRUE>.
SchoolLevel
(1,9] (9,12] (12,16]
293 2248 753
Labels can be set for each level. Consider the university variable created in the previous
section.
SchoolLevel
basic highschool university
293 2248 753
The factor defined as above can for example be used in a regression model. The refer-
ence category is the level with the lowest value. The lowest value is 1 that corresponds
to verb+¡Basic¿+ and the column for ¡Basic¿ is not included in the contrast matrix.
Changing the base category will remove another column instead of this column. This
is demonstrated in the following example:
> contrasts(SchoolLevel)
highschool university
basic 0 0
highschool 1 0
university 0 1
basic highschool
basic 1 0
highschool 0 1
university 0 0
10
4.11 Aggregating data by groups of obervation
Let us create a simple dataset consisting of 3 variables V1, V2 and V3. V1 is the
group identity and V2 and V3 are two numeric variables.
V1 V2 V3
1 1 1 11
2 2 2 12
3 3 3 13
4 1 4 14
5 2 5 15
6 3 6 16
7 1 7 17
8 2 8 18
9 3 9 19
Group.1 V2 V3
1 1 12 42
2 2 15 45
3 3 18 48
Group.1 V2 V3
1 1 4 14
2 2 5 15
3 3 6 16
id Time x y
1 11 1 2 5
2 11 2 3 6
3 12 1 2 5
4 12 2 3 6
This computes group means for all variables in the data frame and drops the variable
<Time> and the automatically created group-indicator variable <Group.1>.
11
> (Bdat <- subset(aggregate(dat,list(dat$id),FUN=mean),select=-c(Time,Group.1)))
id x y
1 11 2.5 5.5
2 12 2.5 5.5
Merge <Bdat> and <dat$id> to create a data set with repeated group averages for
each observation on <id> and of the length as <id>.
x y
1 2.5 5.5
2 2.5 5.5
3 2.5 5.5
4 2.5 5.5
Now you can create a data set including the <id> and <Time> indicators and the
deviation from mean values of all the other variables.
id Time x y
1 11 1 -0.5 -0.5
2 11 2 0.5 0.5
3 12 1 -0.5 -0.5
4 12 2 0.5 0.5
We can use variables from both datasets without merging the datasets. Let us regress
<data1$wage> on <data1$female> and <data2$experience>.
We can put together variables from di↵erent data frames into a data frame and do
our analysis on these data. This required that the variables have the same number of
observations (vectors with the same length).
12
data1.wage data1.female data2.experience
1 81 1 17
2 77 1 10
3 63 1 18
4 84 1 16
5 110 0 13
6 151 0 15
7 59 1 19
8 109 0 20
9 159 1 21
10 71 0 20
We can merge the datasets. If we have one common variable in both data sets, the
data is merged according to that variable.
Notice that unlike some other softwares, we do not need the observations to appear in
the same order as defined by the <id>.
If we need to match two data sets using a common variable (column) and the common
variable have di↵erent names in the datasets, we either can change the names to the
same name or use the data as they are and specify the variables that are to be used
for matching in the data sets. If the matching variable in <data2> and <data1> are
called <id2> and <id> you can use the following syntax:
13
2 77 0 10
3 63 1 18
If you want to add a number of observations at the end of a data set, you use <rbind>.
The following example splits the clumns 2,3 and 4 in <data4> in two parts and then
puts themtogether by <rbind>.
5 Basic statistics
Summary statistics for all variables in a data frame:
summary(mydata)
mean (myvariable)
median (myvariable)
sd (myvariable)
max (myvariable)
min (myvariable)
# compute 10, 20, ..., 90 percentiles
quantile(myvariable, 1:9/10)
When R computes <sum> , <mean> etc on an object containing <NA>, it returns <NA>.
To be able to apply these functions on observations where data exists, you should
add the argument <na.rm=TRUE>. Another alternative is to remove all lines of data
containing <NA> by <na.omit>.
[1] NA
> sum(a,na.rm=TRUE)
14
[1] 8
a
1 3 4 <NA>
1 1 1 1
You can also use <sum(na.omit(a))> that removes the NA and computes the sum or
<sum(a[!is.na(a)])> that sums the elements that are not NA (!is.na) in <a>.
5.1 Tabulation
Load the ’DataWageMacro.rda’ first.
Cross Tabulation
> attach(Wages1,warn.conflicts=FALSE)
> table(sex, school > 12) # yields frequencies
Creating various statistics by category. The following yields average wage for males
and females.
female male
5.146924 6.313021
Using <length>, <min>, <max>, etc yields number of observations, minimum, maxi-
mum etc for males and females.
female male
1569 1725
The following example yields average wage for males and females with schooling more
than 12 years and those who have 12 years or less of schooling.
FALSE TRUE
female 4.676244 6.486286
male 5.838858 8.209675
The following computes the average by group creating a vector of the same length.
Same length implies that for the group statistics is retained for all members of each
group. Average wage for males and females:
15
The function <mean> can be substituted with <min>, <max>, <length> etc. yielding
group-wise minimum, maximum, number of observations, etc.
6 Matrixes
In R we define a matrix as follows (see ?matrix in R):
A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns.
> matrix(1:12,3,4)
A matrix with 3 rows and 4 columns with elements 1,2,3, ..., 12 filled by rows:
[1] 3 4
[1] 3
[1] 4
6.1 Indexation
The elements of a matrix can be extracted by using brackets after the matrix name
and referring to rows and columns separated by a comma. You can use the indexation
in a similar way to extract elements of other types of objects.
16
> A>3 # Elements greater than 3
[1] 9 10 7 11 8 12
> diag(2,3,3)
> diag(diag(2,3,3))
[1] 2 2 2
> t(matrix(1:6,2,3)) #
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
17
Try matrix(1:6,2,3) and matrix(1:6,3,2, byrow=T).
Addition and subtraction
Addition and subtraction can be applied on matrixes of the same dimensions or a
scalar and a matrix.
# Try this
A <- matrix(1:12,3,4)
B <- matrix(-1:-12,3,4)
C1 <- A+B
D1 <- A-B
Scalar multiplication
# Try this
A <- matrix(1:12,3,4); TwoTimesA = 2*A
c(2,2,2)*A
c(1,2,3)*A
c(1,10)*A
Matrix multiplication
For multiplying matrixes R uses ¡% ⇤ %¿ and this works only when the matrixes are
conform.
E <- matrix(1:9,3,3)
crossproduct.of.E <- t(E)%*%E
# Or another and more efficient way of obtaining crossproducts is:
crossproduct.of.E <- crossprod(E)
Matrix inversion
The inverse of a square matrix A denoted as A 1 is defined as a matrix that when
multiplied with A results in an Identity matrix (1’s in the diagonal and 0’s in all
o↵-diagonal elements.)
1 1
AA =A A=I
FF <- matrix((1:9),3,3)
detFF<- det(FF) # we check the determinant
18
are <school> and <sex>. An intercept is included by default. Notice that we do not
have to specify the data since the data frame <Wages1> containing these variables is
attached. The result of the regression is assigned to the object named <reg.model>.
This object includes a number of interesting regression results that can be extracted
as illustrated further below after some examples for using <lm>.
Call:
lm(formula = log(wage) ~ school + sex, data = Wages1)
Residuals:
Min 1Q Median 3Q Max
-3.9486 -0.2844 0.0589 0.3713 2.0451
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.119153 0.074659 1.596 0.111
school 0.114517 0.006183 18.522 <2e-16 ***
sexmale 0.260111 0.020517 12.678 <2e-16 ***
---
Signif. codes:
0
Sometimes we need to use transformed values of the variables in the model. The
transformation should be given as the in the function ¡I()¿. I() means Identity function.
<expr^2> is <expr> squared.
Same as:
19
7.1 Extracting the model formula and results
The model formula
> coef(summary(reg.model))[,2]
> coef(summary(reg.model))[,3]
20
7.2 White’s heteroskedasticity corrected standard errors
The package <car> and <sandwich> have predefined functions for computing robust
standard errors. There are di↵erent weighting options.
The White’s correction
> library(car)
> f1 <- formula(log(wage) ~ sex +school)
> sqrt(diag(hccm(lm(f1),type="hc1")))
> library(sandwich)
> library(lmtest)
> coeftest(lm(f1), vcov=(vcovHC(lm(f1), "HC1")))
t test of coefficients:
<hc0> in library ¡car¿ and <HC0> in library sandwich use the original White formula.
The <hc1> <HC1> multiply the variances with NN k .
7.3 F-test
Estimate the restricted (restricting some (or all) of slope coefficients to be zero) and
the unrestricted model (allowing non-zero as well as zero coefficients). You can then
use anova() to test the joint hypotheses defined as in the restricted model.
Model 1: log(wage) ~ 1
Model 2: log(wage) ~ sex + school
Res.Df RSS Df Sum of Sq F Pr(>F)
1 3293 1277.0
2 3291 1122.1 2 154.9 227.15 < 2.2e-16 ***
---
Signif. codes:
0
21
Under non-constant error variance, we use the White variance-Covariance matrix
and the model F-value is as follows. The <-1> in the codes below remove related
row/column for the intercept
> library(car)
> COV <- hccm(mod.unrestricted, "hc1")[-1,-1]
> beta <- matrix(coef(mod.unrestricted, ,1))[-1,]
> t(beta)%*%solve(COV)%*%beta/(lm(f1)$rank -1)
[,1]
[1,] 204.0736
8 Writing functions
The syntax is: myfunction <- function(x, a, ...) \{...\} The arguments for
a function are the variables used in the operations as specified in the body of the
function i.e. the codes within { }. Once you have written a function and saved it,
you can use this function to perform the operation as specified in { ...} by referring to
your function and using the arguments relevant for the actual computation.
The following function computes the squared of mean of a variable. By defining the
function <ms> we can write <ms(x)> instead of <(mean(x))^2)> every time we want
to compute the square of mean for a variable ¡x¿.
[1] 2550.25
[1] "Welcome"
This function takes an argument x. The arguments of the function must be supplied.
If a default value is specified, the default value is assumed when no arguments are
supplied.
[1] "I use R and sometimes something else for statistical computation."
22
8.1 A function for computing Clustered Standard Errors
Here follows a function for computing clustered-Standard Errors. (See also the function
robcov in the library Design discussed above.) The arguments are a data frame <dat>,
a model formula<f1>, and the cluster variable <cluster>.
9 Acknowledgements
I am grateful to Michael Lundholm , Lena Nekby and Achim Zeileis for helpful com-
ments.
23