MIS2502:
Data Analytics
Introduction to R and RStudio
Acknowledgement: David Schuff Aaron Zhi Cheng
http://community.mis.temple.edu/zcheng/
[email protected]Now we will start using R and RStudio heavily
in class activities and assignments
• R/RStudio has become one of the dominant software
environments for data analysis
• And has a large user community that contribute
functionality
Make sure you download both on your
computer!
• Software development • Integrated
platform and Development
programming language Environment for R
• Open source, free • Nicer interface that
• Many, many, many makes R easier to use
statistical add-on • Requires R to run
“packages” that perform
data analysis
(The base/engine) (The pretty face)
• If you have both installed, you only need to interact with Rstudio
• Mostly, you do not need to touch R directly
RStudio Interface
Environment: info of your data
History: Previous commands
Console
(just like a command line)
• Files
• Plots
• Packages
• Help
• Viewer
It may have an additional window
for R script(s) and data view if you
have any of them open
The Basics: Calculations
• R will do math
for you:
Type commands into the console and
it will give you an answer
The Basics: Functions
sqrt(), log(), abs(), and exp() are
functions.
Functions accept parameters
(in parentheses) and return a
value
The Basics: Variables
• Variables are named <- and = do the
containers for data same thing
• The assignment operator in
R is:
<- or =
rm() removes
• Variable names can start the variable
with a letter or digits. from memory
• Just not a number by itself.
• Examples: result, x1, 2b
(not 2)
x, y, and z are variables that
• R is case sensitive (i.e. can be manipulated
Result is a different
variable than result)
Basic Data Types
Type Range Assign a Value
x<-1
Numeric Numbers
y<--2.5
name<-"Mark"
Character Text strings
color<-"red"
Logical (Boolean) TRUE or FALSE female<-TRUE
Advanced Data Types
• Vectors
• Lists
• Data frames
Vectors of Values
• A vector is a sequence of data elements of the same
basic type.
> scores<-c(65,75,80,88,82,99,100,100,50)
> scores
[1] 65 75 80 88 82 99 100 100 50
> studentnum<-1:9
> studentnum
[1] 1 2 3 4 5 6 7 8 9 c() and rep() are functions
> ones<-rep(1,4)
> ones
[1] 1 1 1 1
> names<-c("Nikita","Dexter","Sherlock")
> names
[1] "Nikita" "Dexter" "Sherlock"
Indexing Vectors
• We use brackets [ ] to pick specific elements in the
vector.
• In R, the index of the first element is 1
> scores
[1] 65 75 80 88 82 99 100 100 50
> scores[1]
[1] 65
> scores[2:3]
[1] 75 80
> scores[c(1,4)]
[1] 65 88
Simple statistics with R
• You can get descriptive statistics from a vector
> scores
[1] 65 75 80 88 82 99 100 100 50
> length(scores)
[1] 9
> min(scores)
[1] 50
> max(scores)
[1] 100 Again, length(), min(), max(), mean(),
> mean(scores) median(), sd(), var() and summary()
[1] 82.11111 are all functions.
> median(scores)
[1] 82
> sd(scores) These functions accept vectors as
[1] 17.09857
parameter.
> var(scores)
[1] 292.3611
> summary(scores)
Min. 1st Qu. Median Mean 3rd Qu. Max.
50.00 75.00 82.00 82.11 99.00 100.00
Lists
• A list can contain many different types of elements
inside it
> studentnum<-1:9
> names<-c("Nikita","Dexter","Sherlock")
> list1<-list(c(2,5,3), 21.3, "hello", studentnum, names)
> list1
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
[1] "hello"
[[4]]
[1] 1 2 3 4 5 6 7 8 9
[[5]]
[1] "Nikita" "Dexter" "Sherlock"
Indexing Lists
• We use double brackets [[ ]] to pick specific
elements in the list.
> list1[[1]]
[1] 2 5 3
> list1[[5]]
[1] "Nikita" "Dexter" "Sherlock“
> list1[[5]][1]
[1] "Nikita"
Data Frames
• A data frame is a type of variable used for storing data tables
– is a special type of list where every element of the list has
same length (i.e. data frame is a “rectangular” list)
> BMI<-data.frame(
+ gender = c("Male","Male","Female"),
+ height = c(152,171.5,165),
+ weight = c(81,93,78),
+ Age = c(42,38,26)
+ )
> BMI
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
> nrow(BMI)
[1] 3
> ncol(BMI)
[1] 4
Identify elements of a data frame
• To retrieving cell values
> BMI[1,3]
[1] 81
> BMI[1,]
gender height weight Age
1 Male 152 81 42
> BMI[,3]
[1] 81 93 78
• More ways to retrieve columns as vectors
> BMI[[2]]
[1] 152.0 171.5 165.0
> BMI$height
[1] 152.0 171.5 165.0
> BMI[["height“]]
[1] 152.0 171.5 165.0
Packages
• Packages (add-ons) are collections of R functions and code
in a well-defined format.
• To install a package: Each package only needs to
install.packages("pysch") be installed once
• For every new R session (i.e., every time you re-open
Rstudio), you must load the package before it can be used
library(psych)
or
require(psych)
Creating and opening a .R file
• The R script is where you keep a record of your work in
R/RStudio.
• To create a .R file
– Click “File|New File|R Script” in the menu
• To save the .R file
– click “File|Save”
• To open an existing .R file
– click “File|Open File” to browse for the .R file
Working directory
• The working directory is where Rstudio will
look first for scripts and files
• Keeping everything in a self contained
directory helps organize code and analyses
• Check you current working directory with
getwd()
To change working directory
Use the Session | Set Working Directory Menu
– If you already have an .R file open, you can select
“Set Working Directory>To Source File Location”.
Reading data from a file
• Usually you won’t type in data
manually, you’ll get it from a file
• Example: 2009 Baseball Statistics
(http://www2.stetson.edu/~jrasp/data.htm)
reads data from a CSV
file and creates a data
frame called
teamData that store
the data table.
reference the HomeRuns column in the
data frame using TeamData$HomeRuns
Looking for differences across groups:
The setup
• We want to know if National League (NL) teams scored more runs
than American League (AL) Teams
– And if that difference is statistically significant
• To do this, we need a package that will do this analysis
– In this case, it’s the “psych” package
Downloads and
installs the package
(once per R
installation)
Looking for differences across groups: The
analysis (t-test)
Descriptive statistics,
broken up by group
(League)
Results of t-test
for differences in
Runs by League)
Histogram
hist(teamData$BattingAvg,
xlab="Batting Average",
main="Histogram: Batting Average")
hist()
first parameter – data values
xlab parameter – label for x axis
main parameter - sets title for chart
Plotting data
plot(teamData$BattingAvg,teamData$WinningPct,
xlab="Batting Average",
ylab="Winning Percentage",
main="Do Teams With Better Batting Averages Win More?")
plot()
first parameter – x data values
second parameter – y data values
xlab parameter – label for x axis
ylab parameter – label for y axis
main parameter - sets title for chart
Running this analysis as a script
Use the Code | Run Region | Run All Menu
Commands can be entered one at
a time, but usually they are all put
into a single file that can be saved
and run over and over again.
Getting help
help.start() general help
help(mean) help about function mean()
?mean same. Help about function mean()
example(mean) show an example of function mean()
help.search("regression") get help on a specific topic such as
regression.
Online Tutorials
• If you’d like to know more about R, check these out:
• Quick-R (http://www.statmethods.net/index.html)
• R Tutorial (http://www.r-tutor.com/r-introduction)
• Learn R Programing (
https://www.tutorialspoint.com/r/index.htm)
• Programming with R (
https://swcarpentry.github.io/r-novice-inflammation/
• There is also an interactive tutorial to learn R basics,
highly recommended! (http://tryr.codeschool.com/)