Lecture 2
Lecture 2
Outline
Why R, and R Paradigm
References, Tutorials and links
R Overview
R Interface
R Workspace
Help
R Packages
Input/Output
Reusing Results
You can safely remove the object mean with the function
rm() without risking deletion of the mean function.
65
Importing Data
Importing data into R is fairly simple.
For Stata and Systat, use the foreign package.
For SPSS and SAS I would recommend the Hmisc
package for ease and functionality.
See the Quick-R section on packages, for information
on obtaining and installing the these packages.
Example of importing data are provided below.
67
From Excel
The best way to read an Excel file is to export it to a comma delimited file and
import it using the method above.
On windows systems you can use the RODBC package to access Excel files.
The first row should contain variable/column names.
# first row contains variable names
# we will read in workSheet mysheet
library(RODBC)
channel <- odbcConnectExcel("c:/myexel.xls")
mydata <- sqlFetch(channel, "mysheet")
odbcClose(channel)
%y 2-digit year 07
%Y 4-digit year 2007
# rename programmatically
library(reshape)
mydata <- rename(mydata, c(oldname="newname"))
grep(pattern, x , Search for pattern in x. If fixed =FALSE then pattern is a regular expression. If
ignore.case=FALSE, fixed=FALSE) fixed=TRUE then pattern is a text string. Returns matching indices.
grep("A", c("b","A","c"), fixed=TRUE) returns 2
sub(pattern, replacement, x, Find pattern in x and replace with replacement text. If fixed=FALSE then pattern is
ignore.case =FALSE, fixed=FALSE) a regular expression.If fixed = T then pattern is a text string.
sub("\\s",".","Hello There") returns "Hello.There"
toupper(x) Uppercase
Applied Statistical Computing and
tolower(x) Lowercase
Graphics 101
Stat/Prob Functions
The following table describes functions related
to probaility distributions. For random number
generators below, you can use set.seed(1234)
or some other integer to create reproducible
pseudo-random numbers.
sd(x) standard deviation of object(x). also look at var(x) for variance and mad(x) for median absolute
deviation.
median(x) median
quantile(x, probs) quantiles where x is the numeric vector whose quantiles are desired and probs is a numeric vector with
probabilities in [0,1].
# 30th and 84th percentiles of x
y <- quantile(x, c(.3,.84))
range(x) range
sum(x) sum
diff(x, lag=1) lagged differences, with lag indicating which lag to use
min(x) minimum
max(x) maximum
resamplings.
# y is numeric, A is a grouping factor, and B is a
# blocking factor.
library(coin)
friedman_test(y~A|B, data=mydata,
distribution=approximate(B=9999))
Dr. Fox's car package provides advanced utilities for regression modeling.
# Assume that we are fitting a multiple linear regression
# on the MTCARS data
library(car)
fit <- lm(mpg~disp+hp+wt+drat, data=mtcars)
This example is for exposition only. We will ignore the fact that this may not
be a great way of modeling the this particular set of data!
Nonlinear Regression
The nls package provides functions for nonlinear regression. See John Fox's Nonlinear Regression and Nonlinear Least Squares for an overview. Huet and colleagues' Statistical Tools for Nonlinear Regression: A Practical Guide with S-PLUS and R Examples is a valuable reference book.
Robust Regression
140
The robust package provides a comprehensive library of robust methods, including regression. The robustbase package also provides basic robust statistics including model selection methods. And David Olive has provided an detailed online review of Applied Robust Statistics with
sample R code.
Graphics
ANOVA
If you have been analyzing
ANOVA designs in traditional
statistical packages, you are likely
to find R's approach less coherent
and user-friendly. A good online
presentation on ANOVA in R is
available from Katholieke
Universiteit Leuven.
Cohen suggests that d values of 0.2, 0.5, and 0.8 represent small,
medium, and large effect sizes respectively.
You can specify alternative="two.sided", "less", or "greater" to indicate
a two-tailed, or one-tailed test. A two tailed test is the default.
Applied Statistical Computing and
Graphics 150
Power Analysis - ANOVA
For a one-way analysis of variance use
pwr.anova.test(k = , n = , f = , sig.level = , power = )
where k is the number of groups and n is the common sample
size in each group.
For a one-way ANOVA effect size is measured by f where
The first formula is appropriate when we are evaluating the impact of a set of
predictors on an outcome. The second formula is appropriate when we are
evaluating the impact of one set of predictors above and beyond a second set
of predictors (or covariates). Cohen suggests f2 values of 0.02, 0.15, and 0.35
represent small, medium, and large effect sizes.
Applied Statistical Computing and
Graphics 153
Power Analysis
Tests of Proportions
When comparing two proportions use
pwr.2p.test(h = , n = , sig.level =, power = )
where h is the effect size and n is the common sample size in each
group.
Cohen suggests that h values of 0.2, 0.5, and 0.8 represent small,
medium, and large effect sizes respectively.
For unequal n's use
pwr.2p2n.test(h = , n1 = , n2 = , sig.level = , power = )
To test a single proportion use
pwr.p.test(h = , n = , sig.level = power = )
For both two sample and one sample proportion tests, you can specify
alternative="two.sided", "less", or "greater" to indicate a two-tailed,
or one-tailed test. A two tailed
Applied test Computing
Statistical is the default.
and
Graphics 154
Power Analysis
Chi-square Tests
For chi-square tests use
pwr.chisq.test(w =, N = , df = , sig.level =, power = )
where w is the effect size, N is the total sample size, and df is
the degrees of freedom. The effect size w is defined as
Emax D
g ( D) E Y D E0
ED50 D