R For Data Exploration
R For Data Exploration
Topics
Data exploration
Graphics in R
Exploration – first step of
analysis
Usually the first step of a data analysis is graphical
data exploration
The most important aim is to get an overview of the
dataset
• Where is data centered?
• How is the data spread (symmetric, skewed…)?
• Any outliers?
• Are the variables normally distributed?
• How are the relationships between variables:
• Between dependent and independents
• Between independents
Categorical
(factors in R)
• Make of a car
• Gender
• Color of the eyes
Exploration – methods I/II
Single continuous variable
• Plots: boxplot, histogram (density plot, stem-and-leaf), normal
probability plot, strip chart
• Descriptive: mean, median, standard deviation, fivenum
summary
Q1 Median
Q3
Fivenum summary:
• Minimum (1), 1st Quartile (3), Medium (5), 3rd Quartile (7),
maximum (9)
What if distribution is skewed or there
are outliers/deviant observation?
• Coding errors
•Male = 0, Female=1
Extreme observations
• Measurements that are somehow largely different from others, but can’t be
treated as outliers
• If the observation is not definitely an outlier, better treat it as an extreme
observation, and keep it in the data
Outliers
What are those with gender coded as 2?
gender Probably a typing error
• What if they are missing values (gender is unknown)?
0 1 2
If a typing error, should be checked from
11 8 1
the original data
If a missing value, should be coded as
missing value
• We will come to this shortly
Extreme observations
Missing values
Missing values are observation that really
are missing a value
• Some samples were not measured during the experiment
• Some students did not answer to certain questions on the
feedback from
Missing values in R I/II
In R missing values are coded with NA
• NA = not available
• Many graphical, descriptive, and testing procedure fail, if there are
missing values in the data
An example
• x<-c(NA, rnorm(10))
• mean(x)
• [1] NA
To compute without NA
• mean(x, na.rm=T)
• [1] 0.176616
Missing values in R II/II
The most simple way to treat missing values is to
delete all cases (rows) that contain at least one
missing value.
For vector this means just removing the missing
values:
• x2<-na.omit(x)
• mean(x2)
• [1] 0.176616
There are other ways to treat missing values,
such as imputation, where the missing values are
recoded with, e.g., the mean of the continuous
variable, or with the most common observation, if
the variable is categorical.
• x2[is.na(x2)]<-mean(na.omit(x))
Graphical methods
Continuous variables
Boxplot
Link between quartiles and
boxplot
Histogram I/II
density.default(x = rnorm(10000))
Histogram of rnorm(10000)
-4 -2 0 2 4 -4 -2 0 2 4
x x
Link between histogram and
boxplot
Scatterplot
QQ-plot
QQ-plot is a plot that can be used for
graphically testing whether a variable is
normally distributed.
• Normal distribution is an assumption made by many statistical
procedures.
Pairwise scatterplot
Categorical variables
Barchart
Contingency table
Friday January February March April
4 5 3 4
Monday 4 4 4 5
Thursday 4 4 4 4
Tuesday 5 4 4 4
Wednesday 5 4 4 4
Graphics in R
Basic idea
All graphs in R are displayed
on a graphical device.
If no device is open when the
plotting command is called, a
new one is opened, and the
image is displayed in it.
Graphics device is simply a
new window that displays the
graphic.
Graphic device can also be a
file where the plot is written.
• Open it
• Make the plot
• Close it
Traditional graphics
commands is R
High level graphical commands create the plot
• plot( ) # Scatter plot, and general plotting
• hist( ) # Histogram
• stem( ) # Stem-and-leaf
• boxplot( ) # Boxplot
• qqnorm( ) # Normal probability plot
• mosaicplot( ) # Mosaic plot
Low level graphical commands add to the plot
• points( ) # Add points
• lines( ) # Add lines
• text( ) # Add text
• abline( ) # Add lines
• legend( ) # Add legend
Most command accept also additional graphical
parameters
• par( ) # Set parameters for plotting
Graphical parameters in R
par( )
• cex # font size
• col # color of plotting symbols
• lty # line type
• lwd # line width
• mar # inner margins
• mfrow # splits plotting area (mult. figs. per page)
• oma # outer margins
• pch # plotting symbol
• xlim # min and max of X axis range
• ylim # min and max of Y axis range
A few worked examples
Drawing a scatterplot in R I/V
Let’s generate some
data
• x<-rnorm(100)
• y<-rpois(100, 10)
• g<-c(rep(”horse”, 50), rep(”hound”,50))
x=matrix(rnorm(400),nrow=100,
ncol=4)
# 2*2 figures on the same page
# Setting graphical parameters
par(mfrow=c(2,2), xlim=c(-3,3))
# plotting
# Every box plot has a title
boxplot(x[,1], main="1.
column")
boxplot(x[,2], main="2.
column")
boxplot(x[,3], main="3.
column")
Scatter Plot
x=rnorm(100)
y=5*x+rnorm(100)
plot(x,y,col="blue")
Pairwise Scatterplots
y=matrix(rnorm(400),nrow=100,ncol=4)
pairs(y)
Bar and Pie Charts
• x=c(rep("Fair",20),rep("Good",50),re
p("Excellent",30))
• y=table(x)
• barplot(y)
• pie(y)