Statistics - A.Y. 2018-2019: BIEF - Class 22
Statistics - A.Y. 2018-2019: BIEF - Class 22
2018-2019
BIEF – Class 22
Lecture 3
R manual
Previously on “Statistics 30001”
• Identify types of data and levels of
measurement
• Appropriately create and interpret tables and
graphs to describe categorical variables:
• frequency distribution,
• bar chart,
• pie chart
Basics and approach
• We must type commands
> …[ENTER]
• Objects
> X <- 3 [ENTER]
> X [ENTER]
[1] 3
Basics and approach
• Dataset format: *.rda or *.rds or *.rdata
• To open a dataset in R:
> load(“path-and-file.rda”)
– In Rstudio Menu File – Command Open File … or Import Dataset3
• To save a dataset in R:
> save(Dataframe, file = “path-and-
file.rda”)
– Even in RStudio
Basics and approach
• Dataframe vs Datafile
> D<-Dataset_Movie_v3
> str(D)
'data.frame': 2868 obs. of 46 variables:
$ id_imdb : chr "tt0120338" "tt0499549" "tt2488496" "tt0107290" ...
$ movie_title : chr "Titanic" "Avatar" "Star Wars: The Force Awakens" "Jurassic Park"
...
$ year : int 1997 2009 2015 1993 2015 2012 2015 2003 1999 2011 ...
$ year_bins : Factor w/ 3 levels "1980-1999","2000-2009",..: 1 2 3 1 3 3 3 2 1 3 ...
$ studio_distrib : Factor w/ 11 levels "20th Century Fox",..: 7 1 10 9 9 10 9 11 1 11 ...
. . . .
> focus<-data.frame(Dataset_Movie_v3$movie_title,
Dataset_Movie_v3$movie_genre)
> str(focus)
'data.frame': 2868 obs. of 2 variables:
$ Dataset_Movie_v3.movie_title: Factor w/ 2854 levels "10 Cloverfield Lane",..: 2627 214 1947 1139
1141 2066 802 2351 1943 916 ...
$ Dataset_Movie_v3.main_genre : Factor w/ 8 levels "Action","Adventure",..: 7 1 1 2 1 1 1 2 1 2
...
Basics and approach
• Object vector
> classes <- c(1, 4, 254)
> str(classes)
num [1:3] 1 4 254
> classes
[1] 1 4 254
> classes[2]
[1] 4
> classes[c(1, 3)]
[1] 1 254
Basics and approach
• Matrices
> A <- matrix(data = c(4, 2, 0, 1, -3, 0.9), nrow = 3,
ncol = 2, byrow = TRUE)
> A
[,1] [,2]
[1,] 4 2.0
[2,] 0 1.0
[3,] -3 0.9
Basics and approach
• Matrices
> B <- matrix(data = c(4, 2, 0, 1, -3, 0.9), nrow = 3,
ncol = 2, byrow = FALSE)
> B
[,1] [,2]
[1,] 4 1.0
[2,] 2 -3.0
[3,] 0 0.9
Basics and approach
• Matrices
> A <- matrix(data = c(4, 2, 0, 1, -3, 0.9), nrow = 3,
ncol = 2, byrow = TRUE)
> A
[,1] [,2]
[1,] 4 2.0
[2,] 0 1.0
[3,] -3 0.9
> A[2, 1] # element in row 2 and column 1
[1] 0
> A[c(1, 3), 2] # elements in row 1 or 3 and column 2
[1] 2.0 0.9
> A[, 1] # all elements in column 1
[1] 4 0 -3
Basics and approach
• Object factor
a categorical variable, even if
categories are coded as numbers
> gender <- factor(c("f", "f", "f", "m", "m", "f", "m", "f",
"f", "m"))
> gender
[1] f f f m m f m f f m
> str(Dataset_Movie_v3$major_distrib)
int [1:2868] 1 1 1 1 1 1 1 1 1 1 ...
> major_recoded <- factor(Dataset_Movie_v3$major_distrib)
> str(major_recoded)
Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
Get help
> help(command)
Basics and approach
> table(dataframe$variable)
Example:
In the Case of the Motion Picture Industry (dataset
«Dataset_Movie_v3.rda»), which is the distribution
of the movies by genre (variable «main_genre»)?
Frequency tables
Example:
In the Case of the Motion Picture Industry (dataset «Dataset_Movie_v3.rda»),
which is the distribution of the movies by genre (variable «main_genre»)?
> pie(table(dataframe$variable))
Example:
In the case of the Motion Picture Industry (dataset
«Dataset_Movie_v3.rda»), which is the distribution
of the movies by genre (variable «main_genre»)?
Pie chart
Example:
In the Case of Motion Picture Industry (dataset «Dataset_Movie_v3.rda»), which is
the distribution of the movies by genre (variable «main_genre»)?
1. Basic version:
>_pie(table(Dataset_Movie_v3$main_ge
nre))
2. Improvement:
>_pie(table(Dataset_Movie_v3$main_ge
nre), main = “Pie chart for the Main
Genre of the Movies”)
Pie chart
Example:
In the Case of the Motion Picture Industry (dataset «Dataset_Movie_v3.rda»),
which is the distribution of the movies by genre (variable «main_genre»)?
Why R and not Excel?
• It depends on your purposes.
• Excel is better for data management, for
presenting and reporting results to the
mainstream.
• R is better for exploring data, analyzing
dataset, finding evidence: more powerful,
faster, more statistical tools.
• They can be combined!
Recap
Upcoming