StatisticUsing R PDF
StatisticUsing R PDF
1 INTRODUCTION TO STATISTICS
Statistics is the science of learning from data. It is the fields that deal with all aspect of data
including the planning of data collection, data preparation, analysis and interpretation to
make credible understanding or meaning from data in making more effective decisions. In
general, there are two types of statistics:
Descriptive statistics: statistics that summarize, organize and make sense of
observations. In other word, descriptive statistics is concerned to describe certain
features of a variable.
Inferential statistics: statistics use data gathered from a sample to make inferences about
the larger population from which the sample was drawn.
1.2 Variable
In statistics, each variable is required to be defined and categorized with the scales of
measurement. Each scale of measurement has certain properties which in turn determine the
appropriateness for use of certain statistical analyses. The four scales of measurement are
nominal, ordinal, interval, and ratio.
Nominal ( Qualitative)
Categorical data and numbers that are simply used as identifiers or names represent a nominal
scale of measurement. Each observation belongs to some distinctively different categories. It is
cannot be quantified or ranked. (i.e. 1=Married, 2=Single ; M=Married, S=Single)
Ordinal: (Quantitative)
Interval: ( Quantitative)
A scale that represents quantity and has equal units but for which zero represents simply an
additional point of measurement is an interval scale. The Fahrenheit scale is a clear example of
the interval scale of measurement. Thus, 60 degree Fahrenheit or -10 degrees Fahrenheit
represent interval data. Measurement of Sea Level is another example of an interval scale. With
each of these scales there are direct, measurable quantities with equality of units. In addition,
zero does not represent the absolute lowest value. Rather, it is point on the scale with numbers
both above and below it (for example, -10degrees Fahrenheit).
Ratio: ( Quantitative)
2
3 Introduction to Statistics
The ratio scale of measurement is similar to the interval scale in that it also represents quantity
and has equality of units. However, this scale also has an absolute zero (no numbers exist below
zero). Very often, physical measures will represent ratio data (for example, height and weight). If
one is measuring the length of a piece of wood in centimeters, there is quantity, equal units, and
that measure cannot go below zero centimeters. A negative length is not possible.
The table can be used to conduct descriptive statistics between the four scales of measurement
Qualitative Quantitative
Unordered
Definition Ordered Categories Numeric Values
Categories
Histogram (with
Graph Pie / Bar Pie / Bar
normal curve)
3
4 Introduction to R
2 INTRODUCTION TO R
R is widely known as a free software programming language and software environment
for statistical computing and graphics. It provides a wide variety of statistical analysis such as
linear and nonlinear modelling, time-series analysis, classical statistical tests, and graphical
techniques which is highly extensible. Likewise, R is also an integrated suite of software which
facilities data manipulation, calculation and graphical display. The functionalities of R includes
an effective data handling and storage facility, a suite of operators for calculations, integrated
collection of intermediate tools for data analysis, graphical facilities for data analysis as well as a
well-developed, simple yet effective programming language which includes conditionals, loops,
user-defined recursive functions and input and output facilities.
There are around eight packages supplied with the R. Those packages are available through the
Comprehensive R Archives Network (CRAN) sites, http://cran.stat.nus.edu.sg. The
functionalities are quite similar as Minitab, SPSS and other statistical software that we can find
in the market. However, this R is almost similar like a Matlab which it allows us to do
programming in order to do computation and generate graphical view (graph).
4
5 Introduction to R
R is an open-source package for statistical computing. Open source has a number of different
meanings; here the important one is that R is freely available, and its users are free to see how it
is written, and to improve it. R is based on the computer language S, developed by John
Chambers and others at Bell Laboratories in 1976. In 1993 Robert Gentleman and Ross Ihaka at
the University of Auckland wanted to do experiment with the language, so they developed an
implementation, and named it R. They made it open source in 1995, and hundreds of people
around the world have contributed to its development until today.
S-PLUS is a commercial implementation of the S language. Because both R and S-PLUS are
based on the S language, much of what is described in what follows will apply without change to
S-PLUS. However, other statistical packages such as S-PLUS, SAS, SPSS are not free. These are
the licensed software and we have to buy in the market.
2.3 Installation of R
Browse http://www.r-project.org/
Click download R
We can read the available information and most frequent asked questions in the website.
5
6 Introduction to R
Click Download R-3.1.0 for Windows and save into our hard disk (will take some time as
the file is big, 54MB for 32/64 bit machines).
2.4.1 Syntax of R
In Microsoft Windows, the R installer will have created a start menu item and an icon for R on
our desktop. Double clicking on the R icon to start the program.
First thing that will happen is that R will open the console, into which the user can type any
commands. The greater-than sign (>) is the prompt symbol. When this appears, you can begin
typing commands.
Upon pressing the Enter key, the result 54 appears, prefixed by the number 1 in square brackets:
> 5 + 49
[1] 54
The [1] indicates that this is the first (and in
this case only) result from the command
The sequence of integers from 1 to 100 may be displayed as follows:
> 1:100
Comment
6
7
Sign "#" is used for any remark and comment. Everything following a "#" sign is assumed to be a ignored
by R.
> # type any comment and the comment is ignored by R
More calculation
>3-8
[1] -5
> 12 / 4
[1] 3
> 13 / 5
[1] 2.6 Note what happens if you omit the parentheses () when attempting
to quit:
>q
2.4.3 Quit From R function (save = "default", status = 0, runLast = TRUE)
To quit your R session, type .Internal(quit(save, status, runLast))
> q() <environment: namespace:base>
3 INTRODUCTION TO RSTUDIO
The RStudio IDE is a user interface for R. Its free and open source, and works on Windows,
Mac, and Linux. RStudio is not a replacement for R. R must be installed in addition to RStudio.
However, when both are installed on your system, only RStudio needs to be run.
R STUDIO INSTALLATION .
7
8 Introduction to RStudio
The RStudio interface consists of several main windows sitting below a top-level toolbar and
menu bar. Although this placement can be adjusted, the default layout utilizes four main panels
or panes in the following positions:
8
9 Introduction to RStudio
If you have different projects you can change the working directory for that session, see above.
Or you can type:
getwd()
# Changes the wd
setwd("C:/myfolder/data")
9
10 Introduction to RStudio
10
11 Introduction to RStudio
11
12 Introduction to RStudio
12
13 Introduction to RStudio
The easiest way to enter data in R is to work with a text file, in which the columns are
separated by tabs. There are many ways in which you can get such files. Here we simply
explain how to make such a file with Excel. Open Excel and enter the data.
After enter the data in worksheet in Excel, give the file a name (e.g. student) and select the
option Text (Tab delimited). Now save the file.
To enter the data in RStudio you can use the command Import Dataset from Text File:
13
14 Introduction to RStudio
Find the student.txt file and open it. Import the data as shown in the following screenshot:
Now you have defined a dataset in R, called student, containing the variable midMarks as shown
in source window (upper left panel).
14
15 Introduction to RStudio
To start analyzing, create the workspace with a new Rscript file (click on the option File or
create icon in the menu bar)
To make sure you have entered your data correctly, type the command list (student),
activate it (by going over it with the mouse, while you keep the left button pressed and click on
Run Line(s).
15
16 Introduction to RStudio
This should show you the data of the midmarks in the lower left panel.
Two of the most common statistics are the mean and standard deviation:
Standard Deviation :
16
17 Introduction to RStudio
Histogram
> hist(mystud)
17
18 Descriptive Statistics
Boxplot
4 DESCRIPTIVE STATISTICS
4.1 Frequency
#simple frequency
x=table(mystud)
x
barplot (x)
18
19 Descriptive Statistics
group=seq(0,100, by=20)
group.data=cut(student$midMarks,group,right=FALSE)
group.data
group.feq=table(group.data)
list(group.feq)
cbind(group.feq)
barplot(group.feq)
barplot(group.feq, col=c("lightblue", "mistyrose",
"lightcyan","lavender", "cornsilk"), )
barplot(group.feq, col=c("lightblue", "mistyrose",
"lightcyan","lavender", "cornsilk"), main= list("Carry mark
BACS1111", font=2))
Making a frequency distribution table at first sight looks more complicated in R than in other
programs. The reason for this, however, is that you use the same commands for the simplest
table as for the more complicated. So, the commands are slightly harder to start with, but
remain the same to make all types of tables, included grouped frequency distribution tables.
First, you have to define the spans you are interested in. Suppose we are interested in all spans
from 0 to 100. This is defined by the command:
group=seq(0,100, by=10)
19
20 Descriptive Statistics
This commands defines the lower values of the intervals as the sequence of values from 0 to
100 with a step function of 10 (i.e., it will go from 0 to 10, to 20, etc.). If we define by=20,
then the lower values of the spans will go from 0 to 20, to 40, and so on, which means that we
will have a grouped frequency distribution table.
Next we segment the continuum of values in midMarks according to the spans we defined
above. We do this with the following command:
group.data=cut(student$midMarks,group,right=FALSE)
This commands cuts the values of the variable midMarks data from the dataset student into
parts starting with the lower value defined in span, and ending with the upper value
immediately below the lower value of the next span, and puts it into a new variable group,data.
The following parts are important:
group.data
This is the variable group.data from the dataset student. Variables from a dataset are defined as
dataset$variable (i.e., with a $ sign between the name of the dataset and the name of the
variable).
right=FALSE makes that the lower value of the next interval will be excluded from the present
interval. If we had not done this, the interval 20 40 would exclude the number 20 and include
the number 40. Now the interval goes from 20.0 to 39.9999.
To calculate the frequencies of the intervals (Midterm group) we use the command:
group.feq=table(group.data)
20
21 Descriptive Statistics
21
22 Descriptive Statistics
Write command:
list(student)
22
23 Descriptive Statistics
gen.feq=table(student$gender)
gen.feq
cbind(gen.feq)
barplot (gen.feq, col=c("red", "blue"))
23
24 Descriptive Statistics
4.4.1 Entering
24
25 Descriptive Statistics
25
26 Descriptive Statistics
List(student)
AssignDiff =c(student$diffAssign)
AssignDiff
#
AssignDiff.feq=table (student$diffAssign)
AssignDiff.feq
cbind(AssignDiff.feq)
pie(AssignDiff.feq)
What will you get? Error: could not find function "pie3D"
26
27 Descriptive Statistics
will appear in console window that indicate function Plotrix in your RStudio.
27
28 Descriptive Statistics
pie3D(AssignDiff.feq)
lbls=paste(names (AssignDiff.feq))
pie3D(AssignDiff.feq)
pie3D(AssignDiff.feq, labels=lbls, explode=0.1, main= "Difficulty
level for Assignment")
28
29 Inferential Analysis
5 INFERENTIAL ANALYSIS
5.1 Introduction
In terms of inferential statistics R has many varieties of functions for doing statistical analyses
including the parametric and non-parametric t-test, analysis of variance, t-test, correlation and
linear regression.
Some of them are self-understanding, and really easy to use. This is the case for example to
perform a t-test to compare one average value with zero or two average values. The function
t.test() do this. With one argument, the function tests if the average value of the variable
differs from zero. With two arguments it simply compares the two average values, e.g.
t.test(midMarks,finalMarks).
The function wilcox.test()performs wilcoxon rank test, which assumes that the means of
two unnormally distributed datasets are equal.
The function binom.test() performs an exact test of a simple null hypothesis about the
probability of success in a percentage. For example, among 125 insects counted, 45 are females,
and we want to see whether this differs from 50% females. For this, binom.test(45,125,.5)
give the corresponding results (the answer is yes, with p= 0.002226).
The function kruskal.test()performs a Kruskal-Wallis rank sum test of the null that the
location parameters of the distribution of x are the same in each group (sample). The alternative
is that they differ in at least one. It can take a single argument that is either a table with several
29
30 Inferential Analysis
## Formula interface.
require(graphics)
boxplot(Ozone ~ Month, data = airquality)
kruskal.test(Ozone ~ Month, data = airquality)
v. Correlation
The function cor.test() tests the correlation or relationship between two vectors, e.g.,
cor.test(midMarks,finalMarks).
Overall, there are several other functions in R, covering (almost) all standard statistical methods.
See http://www.statmethods.net/stats/ttest.html
30
31 Inferential Analysis
5.3.1 T-test
list(student)
midterm.data<-c(student$midMarks)
midterm.data
final.data<-c(student$finalMarks)
final.data
t.test(midterm.data,final.data) # independent t-test
t.test(midterm.data,final.data, paired=TRUE) #paired sample t-test
t.test(final.data, mu=75) #Ho:mu=75 #One sample t-test.
31
32 Inferential Analysis
If we are assuming the data to have not normal distribution or small sample size. In our case
n<20, so we can proceed with the non-parametric t-test or Wilcoxon rank test .
wilcox.test(midterm.data,final.data, paired=TRUE)
5.4 Exercises
Exercise 1: A company has proposed a new packaging training and wants to test the
effectiveness of the training by comparing the average times of 10 workers in the packaging
process. The table below shows the time in minute before and after training for each worker.
Before new 12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3
training
After new 12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1
taraining
Answer:
In this case we have two sets of paired samples, since the measurements were made on the same
worker before and after the training. To see if there was an improvement, deterioration, or if the
means of times have remained substantially the same (hypothesis H0), we need to make a
Student's t-test for paired samples, proceeding in this way:
a = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3)
b = c(12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1)
t.test(a,b, paired=TRUE)
Paired t-test
32
33 Inferential Analysis
data: a and b
t = -0.2133, df = 9, p-value = 0.8358
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.5802549 0.4802549
sample estimates:
mean of the differences
-0.05
The p-value is greater than 0.05, then we can accept the hypothesis H0 of equality of the
averages. In conclusion, the new training has not made any significant improvement (or
deterioration) to the team of workers.
Exercise 2:
An investigator believes that caffeine facilitates performance on a simple spelling test. Two
groups of subjects are given either 200 mg of caffeine or a placebo. The data are:
Placebo Drug
24 24
25 29
27 26
26 23
26 25
22 28
21 27
22 24
23 27
25 28
25 27
25 26
Answer:
To describe the differences between these two groups, we can use basic descriptive statistics
(means and standard deviations), and graph the results. To see how likely a difference of this
33
34 References:
magnitude would happen by chance if, in fact, the two groups were sampled from the same
population, we can do a t-test. Please try
Other exercises:
6 REFERENCES:
For more information on RStudio, try the following resources:
Other websites:
http://www.statmethods.net/graphs/bar.html
http://www.r-bloggers.com/
https://sites.google.com/site/r4statistics/popularity
http://en.freestatistics.info/
http://lib.stat.cmu.edu/
http://www.comfsm.fm/~dleeling/statistics/notes000.html
34
35 Appendix
7 APPENDIX
In RStudio
In SPSS
35