0% found this document useful (0 votes)
99 views

R For Data Exploration

This document provides an overview of data exploration and graphics in R. It discusses exploring both continuous and categorical variables through graphical and descriptive methods. These include histograms, boxplots, scatterplots, and contingency tables. It also covers important concepts like outliers, missing values, and the normal distribution. Graphical parameters in R like color, shape, and size can be adjusted to customize plots. Working examples demonstrate how to generate basic graphs like scatterplots in R.

Uploaded by

Jad Abou Assaly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

R For Data Exploration

This document provides an overview of data exploration and graphics in R. It discusses exploring both continuous and categorical variables through graphical and descriptive methods. These include histograms, boxplots, scatterplots, and contingency tables. It also covers important concepts like outliers, missing values, and the normal distribution. Graphical parameters in R like color, shape, and size can be adjusted to customize plots. Working examples demonstrate how to generate basic graphs like scatterplots in R.

Uploaded by

Jad Abou Assaly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 52

Data Exploration and Graphics in

Topics
 Data exploration
 Graphics in R
Exploration – first step of
analysis
 Usually the first step of a data analysis is graphical
data exploration
 The most important aim is to get an overview of the
dataset
• Where is data centered?
• How is the data spread (symmetric, skewed…)?
• Any outliers?
• Are the variables normally distributed?
• How are the relationships between variables:
• Between dependent and independents

• Between independents

 Graphical exploration complements descriptive


statistics
Variable types
 Continuous
(vectors in R)
• Height
• Age
• Degrees in centigrade

 Categorical
(factors in R)
• Make of a car
• Gender
• Color of the eyes
Exploration – methods I/II
 Single continuous variable
• Plots: boxplot, histogram (density plot, stem-and-leaf), normal
probability plot, strip chart
• Descriptive: mean, median, standard deviation, fivenum
summary

 Single categorical variable


• Plots: contingency table, strip chart, barplot
• Descriptive: mode, contingency table

 Two continuous variables


• Plots: scatterplot
• Descriptive: individually, same as for a single variable

 Two categorical variables


• Plots: contingency table, mosaic plot
• Descriptive: individually, same as for a single variable
Exploration – methods II/II
 One continous, one
categorical variable
• Plot: boxplot, histogram, but for each category separately
• Descriptives: mean, median, sd…, for each category
separately

 Several continous and / or


categorical variables
• Plots: pairwise scatterplot, mosaic plot
• Descriptives: as for continuous or categorical variables
Descriptive statistics
Mean
 Mean = sum of all values / the
number of values
Standard deviation and variance
 SD = each observation’s squared difference
from the mean divided by the number of
observation minus one.

• Has the same unit as the original variable

 Var = SD*SD = SD^2


Normal distribution I/III Some measurements

are normally distributed
in the real- world
• Height
• Weight
 Means of observations
taken from otherwise
distributed data are also
normally distributed
 Many descriptive, and
statistical tests rely on
the assumption of
normality
Normal distribution II/III
 Normal distribution is described
by two parameters:
• Mean
• Standard deviation

 These two are enough to tell:


• Where is the peak (center) of the distribution located
• How the data are spread around this peak
Normal distribution III/III
Quartiles
 1st quartile(25%), Median (50%), and 3rd
quartile (75%)
1 2 3 4 5 6 7 8 9

Q1 Median
Q3

Interquartile range (IQR)

Fivenum summary:
• Minimum (1), 1st Quartile (3), Medium (5), 3rd Quartile (7),
maximum (9)
What if distribution is skewed or there
are outliers/deviant observation?

 Use nonparametric alternative measures


• Median instead of mean
• Inter-quartile range instead of standard deviation
Summary of a continuous variable
I/II
 summary( )
• x<-rnorm(100)
•summary(x)
Min. 1st Qu. Mean 3rd Qu. Max.
Median
0.005561 0.079430 0.202900 0.310300 0.401000 1.677000
median(x)
mean(x)
min(x)
max(x)
quantile(x, c(0.25, 0.75))
• 1st and 3rd quartiles
Summary of a continuous
variable II/II
 IQR(x) # inter-quartile range
 sd(x) # standard deviation
 var(x) # variance

 table( ) # Makes a table of frequencies


(categ. var.)
Outliers and missing values
What are these outliers then?
 Outliers
• Technical errors

• Coding errors
•Male = 0, Female=1

•Data has some values coded with 2

 Extreme observations
• Measurements that are somehow largely different from others, but can’t be
treated as outliers
• If the observation is not definitely an outlier, better treat it as an extreme
observation, and keep it in the data
Outliers
 What are those with gender coded as 2?
gender  Probably a typing error
• What if they are missing values (gender is unknown)?
0 1 2
 If a typing error, should be checked from
11 8 1
the original data
 If a missing value, should be coded as
missing value
• We will come to this shortly
Extreme observations
Missing values
 Missing values are observation that really
are missing a value
• Some samples were not measured during the experiment
• Some students did not answer to certain questions on the
feedback from
Missing values in R I/II
 In R missing values are coded with NA
• NA = not available
• Many graphical, descriptive, and testing procedure fail, if there are
missing values in the data

 An example
• x<-c(NA, rnorm(10))
• mean(x)
• [1] NA

 To compute without NA
• mean(x, na.rm=T)
• [1] 0.176616
Missing values in R II/II
 The most simple way to treat missing values is to
delete all cases (rows) that contain at least one
missing value.
 For vector this means just removing the missing
values:
• x2<-na.omit(x)
• mean(x2)
• [1] 0.176616
 There are other ways to treat missing values,
such as imputation, where the missing values are
recoded with, e.g., the mean of the continuous
variable, or with the most common observation, if
the variable is categorical.
• x2[is.na(x2)]<-mean(na.omit(x))
Graphical methods
Continuous variables
Boxplot
Link between quartiles and
boxplot
Histogram I/II
density.default(x = rnorm(10000))

Histogram of rnorm(10000)

-4 -2 0 2 4 -4 -2 0 2 4

rnorm(10000) N = 10000 Bandwidth = 0.1432


Histogram II/II
Histogram of x
Histogram of x

9956 9958 9960 9962 9964 9956 9958 9960 9962

x x
Link between histogram and
boxplot
Scatterplot
QQ-plot
 QQ-plot is a plot that can be used for
graphically testing whether a variable is
normally distributed.
• Normal distribution is an assumption made by many statistical
procedures.
Pairwise scatterplot
Categorical variables
Barchart
Contingency table
Friday January February March April
4 5 3 4
Monday 4 4 4 5
Thursday 4 4 4 4
Tuesday 5 4 4 4
Wednesday 5 4 4 4
Graphics in R
Basic idea
 All graphs in R are displayed
on a graphical device.
 If no device is open when the
plotting command is called, a
new one is opened, and the
image is displayed in it.
 Graphics device is simply a
new window that displays the
graphic.
 Graphic device can also be a
file where the plot is written.
• Open it
• Make the plot
• Close it
Traditional graphics
commands is R
 High level graphical commands create the plot
• plot( ) # Scatter plot, and general plotting
• hist( ) # Histogram
• stem( ) # Stem-and-leaf
• boxplot( ) # Boxplot
• qqnorm( ) # Normal probability plot
• mosaicplot( ) # Mosaic plot
 Low level graphical commands add to the plot
• points( ) # Add points
• lines( ) # Add lines
• text( ) # Add text
• abline( ) # Add lines
• legend( ) # Add legend
 Most command accept also additional graphical
parameters
• par( ) # Set parameters for plotting
Graphical parameters in R
 par( )
• cex # font size
• col # color of plotting symbols
• lty # line type
• lwd # line width
• mar # inner margins
• mfrow # splits plotting area (mult. figs. per page)
• oma # outer margins
• pch # plotting symbol
• xlim # min and max of X axis range
• ylim # min and max of Y axis range
A few worked examples
Drawing a scatterplot in R I/V
 Let’s generate some
data
• x<-rnorm(100)
• y<-rpois(100, 10)
• g<-c(rep(”horse”, 50), rep(”hound”,50))

 Simple scatter plot


• plot(x, y)
Adding a title and axis labels
II/V
 plot(x, y, main=”Horses and hounds”, xlab=”Performance”, ylab=”Races”)
Drawing a scatterplot in R III/V
 Coloring spots according to the group (horse or hound) they
belong to
• cols<-ifelse(g==”horse”, ”Black”, ”Red”)
• plot(x, y, main=”Horses and hounds”, xlab=”Performance”, ylab=”Races”,
col=cols)
Drawing a scatterplot in R IV/V
 Changing the plotting symbol
• plot(x, y, main=”Horses and hounds”, xlab=”Performance”,
ylab=”Races”, col=cols, pch=20)
• plot(x, y, main=”Horses and hounds”, xlab=”Performance”,
ylab=”Races”, col=cols, pch=”+”)
Drawing a scatterplot in R V/V
 Saving the image
• Menu: File -> Save As -> JPEG / BMP / PDF / postscript

 Directing the plotting to a file


• pdf(”hnh.pdf”)
• plot(x, y, main=”Horses and hounds”, xlab=”Performance”,
ylab=”Races”, col=cols, pch=20)
• dev.off()

 Setting the size of the image


in inches
• pdf(”hnh.pdf”, width=7, height=7)
• plot(x, y, main=”Horses and hounds”, xlab=”Performance”,
ylab=”Races”, col=cols, pch=20)
• dev.off()
Drawing a box plot I/II
 x<-rnorm(100) # x is a vector
 boxplot(x) # makes a boxplot
Drawing Side-By-Side
Boxplots II/II
 x=rnorm(100,mean=80,sd=5)
 f=c(rep("Section 1",40),rep("Section 2",35),rep("Section 3",25))
 boxplot(x~f)
Putting several graphs on the
same page I/II

 x=matrix(rnorm(400),nrow=100,
ncol=4)
 # 2*2 figures on the same page
 # Setting graphical parameters
 par(mfrow=c(2,2), xlim=c(-3,3))
 # plotting
 # Every box plot has a title
 boxplot(x[,1], main="1.
column")
 boxplot(x[,2], main="2.
column")
 boxplot(x[,3], main="3.
column")
Scatter Plot
 x=rnorm(100)
 y=5*x+rnorm(100)
 plot(x,y,col="blue")
Pairwise Scatterplots
 y=matrix(rnorm(400),nrow=100,ncol=4)
 pairs(y)
Bar and Pie Charts
• x=c(rep("Fair",20),rep("Good",50),re
p("Excellent",30))
• y=table(x)
• barplot(y)
• pie(y)

You might also like