Introduction To R
Introduction To R
R is a language based on the S language from Bell lab 1976 as a statistical language
Named R since based on S and the names of Ross Ihaka and Robert Gentleman
2008 Acquired by TIBCO for $25 million
version 3.0.2 is released on September 25, 2013
Active development withmany other packages
Interactive language
Free!
Integration withSpotfire, Tableau, Qlikview, MicroStrategy, Alteryx, Oracle, and more!
Resources
The R Homepage: http://cran.r-project.org/
Stack Overflow: http://stackoverflow.com/questions/tagged/r
Cross Validated: http://stats.stackexchange.com/questions/tagged/r
Books: http://www.r-project.org/doc/bib/R-books.html
Springer books
R blog aggregation: http://www.r-bloggers.com/
EI team members!
Help( ) will bring up the command overview, options and examples
Working with R
Editors:
Rstudio Tinn-R
Point and Click: Rattle
Lets discuses Pros and Cons
R and Client Work
Todays Expectations: Rs ability to do the following.
Load data, Inspect ,transform and summarize data
Create analytics and visualization
Create predictive and descriptive Models
Datasets overview
REMOVED
For the remainder of the class we will be breaking down some of the key
points from each of these sets and inspecting the way R processes data.
This presentation is a guide. We will not cover each line of code!
Working with R
Using the GUI
GUI preferences
Expression are evaluated and the results are returned. For final expression the output will be
printed. For variables the value is stored in memory under the variable name.
If a value is currently stored in memory, it can be printed by typing the variable name
Assignment and Basic Arithmetic
Assignment is done using the<- or = commands**, best practice uses<- for clarity.
x1<- 2
X2<-3
x3 <- 1:10
Arithmetic is performed using standard +-*/^(%% and %/% for modulo and
integer division)
X2+x3
** note that ==is used for cratrira
Getting started
General Code:
The #character indicates a comment. The code will skip those lines
The c() function is used to create vectors of objects.
id <- c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2) # Vector of 1s and 2s
id2 <- seq(from = 1, to = 4, by = 1)
id*2
R objects:
Numeric Vectors as above
List of vectors that can contain elements of all type.
Data <- list("EI", Id = 2, 5 )
Factors : used to represent categorical data
x <- factor(c("yes", "yes", "no", "yes", "no"))
Matrix - vectors with a dimension attribute
m <- as.matrix(id, id2)
Data.frame : used to store tabular data. can store all type similar to list (Matrix cant)
x <- data.frame(test = 1:10, test2 = c(T, T, T, T,F,F,F,F,F,F))
Interacting with objects and missing values
Function calls require parentheses while Square brackets [ ] are for subscript and matrix
operation
A<-matrix (c(1,4,1,7,1,8,1,1,33,6,2,12,2,9,2),nrow=3, ncol=5,byrow =
TRUE)
A[3,2]
A[ ,2]
B<-t(A)
A %*% B
Data sets will frequently have missing values for example:
x <- c(NA, 2, 4, 5)
y<- is.na(x)
y
any(x)
x[!is.na(x)]
## note that we will cover functions in the next few slides
Any summary with NA will return false unless the option for na.rm=T is selected:
id <- c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, NA)
Sum(id) #This code will give NA
Sum(id,na.rm=T) #will work
Libraries
Core to R is the extendibility offered through R libraries (or packages), the primary source for
these is CRAN (http://cran.r-project.org/)
To install a library use the following command:
install.packages(library_name)
To use a library you must load it into memory:
library(library_name)
To access information about the library use the following command to see functions contained in
the library:
lsf.str("package:reshape2")
library(help=library_name)
To find info on a specific function use either of the following
?function
help(function)
Install packages MASS, lattice, plyr, reshape2, ggplot2
Beginning data analysis
Loading external data
Your working directory is the default location that R will try to read and write files to. To set your
working directory use:
setwd("C:\\Users\\lavre\\Documents\\R_Training\\Jan_Class")
R provide the ability to view the content of the folder
ls()
Import data:
many ways to read data into R including other stat packages and server data
Once loaded any data can be referred to as part of the loaded data set
The user can attach the file to working memory by using attach and drop
Files can be loaded from the web with no extra packages (there is a twitter package!!)
Data exploration
AML data: These two methods of loading data are functionally the same.
Opt ion 1: ARS_Data<- read.table("rawdata.csv", sep=",", header=T)
Opt ion 2: ARS_Data2<- read.csv("rawdata.csv", header = T)
Opt ion 3* : ARS_Data3 <- read.xlsx("rawdata.xlsx", sheetIndex=1
,header=T)
*Note that Option 3 need the XLSX library to be installed.
To verify this claim we will use several functions that are built in to inspect our data
names(ARS_Data)
names(ARS_Data2)
head (ARS_Data)
ARS_Data [2,]
Data exploration
To get a better idea of the structure of the ARS_DataObject we can use the following functions:
str(ARS_Data) # Overview of the data and structure
ARS_Data[1:2,]
dim(ARS_Data)
summary (ARS_Data) #summary can be used on a full data frame
summary (ARS_Data$CURR_DR_TRANS_AMT) #or on an individual
column
Some statistics: Mean of Current Debit Transaction Amount
mean(ARS_Data$CURR_DR_TRANS_AMT)
Why didnt that work? missing Values!!
Fix:
mean(ARS_Data$CURR_DR_TRANS_AMT, na.rm=T)
Cross tables and histograms
R has build in loop functions applying the same function over a range
Weve obtained mean for one variable. For exampleSapply: would run the same function on
each element (or variable) selected
sapply(ARS_Data[,4:6], FUN = mean, na.rm=T)
Cross tables: Provide a frequency table
table(ARS_Data$ACCT_GEO_RISK_NB)
table(ARS_Data$PRCSNG_DT)
table(ARS_Data$ACCT_GEO_RISK_NB, ARS_Data$PRCSNG_DT)
Cross table apply condition
table((ifelse(ARS_Data$LOG_OUTGOING_CASH>0,"Good","Not Good")))
Explore Incoming Cash variable by creating some basic graphs
hist(ARS_Data$INCOMING_CASH)
dens<-density(ARS_Data$INCOMING_CASH, na.rm=T)
plot(dens, col=2,lwd=3 , main="Density INCOMING_CASH")
(try plot(dens, col=2,lwd=30, main="Density INCOMING_CASH"))
Functions
We have a neat new column, but we still havent dealt with all of these missing values in a proper
way. Lets examine the scope of the problem by creating a function.
Function - action based on set input.
1- Any variable created in the function is local. If there is a reference to variable outside the
function not in the reference input R will use the last known variable with that name
2 The function takes only parameter which fits the types defined in the functions
NAcount<-function(X) {
N <- is.na(X)
colSums(N)
}
Out put :
NAcount(ARS_Data)
Imputation
Now that we can see the scope of our problem, lets fix the issue by imputing nulls to 0.
ARS_Data$LOG_INCOMING_CASH[is.na(ARS_Data$LOG_INCOMING_CASH)] <-
0
NAcount(ARS_Data)
** Note that we selected the NA value in the variable itself
How did this effect our graphs?
par(mfrow=c(2,1) ,mar=c(3,3,5,1),oma=c(1.5,2,2,5))
plot(dens, col=2,lwd=3 , main="Density INCOMING_CASH")
plot(dens2, col=2,lwd=3 , main="log Density INCOMING_CASH")
Exporting data
The transformation and imputation are complete. It is best practice to export the data we are
using for record retention.
The following will export the data in two formats: (Same as read just in reverse)
write.table(ARS_Data, "ARS_Data.txt", sep="\t")
write.csv(ARS_Data, "ARS_Data.csv")
Data analysis 2
Exporting data 2 - Data manipulation
Size assessment: Aggregations by date:
aggdata <-aggregate(ARS_Data[ ,5:9], by=list(ARS_Data$PRCSNG_DT),
FUN=mean, na.rm=TRUE)
aggdata
Aggregations by date and Days open:
aggdata2 <-aggregate(ARS_Data[ ,5:9], by=list(ARS_Data$PRCSNG_DT,
ARS_Data$CUST_DAYS_OPEN_LE_100), FUN=mean, na.rm=TRUE)
aggdata2
Create a sunset - data when CUST_DAYS_OPEN_LE_100 is true
sub_ARS_Data <-
subset(ARS_Data,ARS_Data$CUST_DAYS_OPEN_LE_100==1)
dim(sub_ARS_Data )
dim(ARS_Data )
Function and frequencies
Now well review frequencies by creating a function to perform a count for us
frq_over<-function(Colm){
X<-as.data.frame(table(Colm))
X2<-as.data.frame(subset(X, Freq>6) )
X2
}
Run:
apply(ARS_Data[ , 4:8], 2,frq_over)
Based on this output some follow up needs to take place!
Credit Transactions look interesting, lets explore further.
dens3<- density(log(ARS_Data$CR_TRANS_AMT_INCR), na.rm=T)
plot(dens3, col=2,lwd=3 , main="Density INCOMING_CASH")
Removing observations and plotting
(again)
Clearly there is something wrong with our data!!
Lets compare before log scaled observations without removals, to the log scaled observations
with removals, to the non-scaled with removals.
par(mfrow=c(3,1) ,mar=c(3,3,5,1),oma=c(1.5,2,2,5))
plot(dens3, col=2,lwd=3 , main="Density INCOMING_CASH - ALL")
Removed
*This is for one variable. A true analysis will run through all variables
Visualizations
Graphics
This class isnt about Spotfireand Excel, its about R!
One of R best attributes is representation
Can represent a lot of data with few lines of code
Many options and we will cover the most basic functions in the base package. Note that the
Lattice package includes further options as well as other packages for 3D
The most popular graphical package in R is ggplot2. It has its own unique language structure,
and is related to R in a similar way as D3 is related to JavaScript
(i.e. Based on the language, but is used in a significantly different manner than standard)
For general overview for graphical capabilities see the Appendix for a reference using the Bank
data set provided.
Graphics Risk Loss
Our ORI dataset overview Business overview of data
Subset The approach
Device for output png, pdf..
Printing Select PDF!
Print Area the par options
par(mfrow=c(2,1) ,mar=c(3,3,5,1),oma=c(1.5,2,2,5))
mfrow- c(nrows, ncols)
Mar - number of lines of margin to be specified on the four sides of the plot.
(bottom,left, top, right)
Oma The size of the outer margins in lines of text. (bottom,left, top, right)
Lets go over the code In your R editor
Capabilities: Plot, Lines, legend, fit, abline
Options: col, lwd, pch
Graphics Operational Risk
REMOVED
Basic predictive modeling
Making things easier (by making them look harder)
Using our knowledge of different data types in R we can create control files for various
parameters rather than hard code into R. This will give us the ability to updated data without
updates to the code.
Import our cleaned up data, as well as a control file called varlist.csv
datraw <- read.csv(paste(getwd(),"Input/rawdata.csv", sep = "/"), sep
=",", header = TRUE)
varlist <- read.csv(paste(getwd(),"Input/varlist.csv", sep = "/"), sep = ",",
header = TRUE)
Variable Types to consider:
Categorical variable
Continuous variable
target variable
Non-predictor variables
Making things easier
(by making them look harder)
Now, pull apart those different aspects of the control file Prep work:
1. Pull Cat egorical variable list from varlist .csv
catPredictors <- varlist$Variable[varlist$ModelType == 'Pred' &
varlist$VarType == 'Cat']
2. Pull Cont inuous variable list f rom varlist .csv
contPredictors <- varlist$Variable[varlist$ModelType == 'Pred' &
varlist$VarType == 'Cont']
3. Pull t arget variable from varlist .csv
target <- REMOVED
4. Pull addit ional non-predict or variables list f rom varlist .csv
nonPredictors <- REMOVED
Correlation chart
Create and view Correlation chart of the predictors vsthe target variable.
5. Pull only t arget and predict or variables for correlat ion chart
cont <- datraw[names(datraw) %in% target | names(datraw) %in%
contPredictors]
6. Convert t arget t o numeric vector
REMOVED
7. Creat e and writ e correlat ion chart t o CSV in out put folder
write.csv(cor(cont, cont$IS_SAR), file = paste(getwd(),"Output/corr.csv",
sep = "/"))
Model
Logistic Model Classification model based on binary dependent variable
8. Remove any columns not on our control file.
REMOVED
9. Format categorical variables as factors (The COR matrix process above proved the
appropriate formatting of the continuous variables)
REMOVED
10. Data transformation as appropriate (based on our data exploration work)
In our dataset - log transform all variables that were marked Log.
transVars <- REMOVED
Model
Best practices of modeling require the partition of the data into training, testing,
validation. We are going to ignore validation for now
11. Define seed
This will allow the random selection to start from a set point for everyone
12. Define function
REMOVED
13. Generate data sets
spl <- splitdf(dat,seed=12345)
Outcome: spl$trainset and spl$testset are our train and test sets
Model
14. Create a basic logistic regression with stepwise selection
model <- REMOVED
15. Export the coefficients and p-values
cf <- REMOVED
Model Scoring
Select a scoring formula we would like to use: REMOVED
Finol Scorc =XXXX
16. Predict our train set and apply score formula:
LogisticScore <- Removed
17. Create variable for scoring formula Min and Max
asmin <- quantile(log10(LogisticScore),0.2,na.rm="TRUE")
asmax <- quantile(log10(LogisticScore),0.95,na.rm="TRUE")
18. Score based on our model
REMOVED
(This is what you have been waiting for all day!!!)
Appendix - Graphics
Appendix - Graphics
Plot Simple but powerful
A: plot (BankDat a$Prof it _09, BankDat a$Age_09 )
B: plot (x= BankDat a$Prof it _09,
y= BankDat a$Age_09, xlab=" Prof it " ,
ylab="Age" , Main=" Prof it 09" ,
pch= 8, col=3)
Appendix - Graphics
Bar Plot:
BankDat a3 <-aggregat e(subset (BankDat a,select = c(2,8) ),
by=list (Age_09),FUN=sum, na.rm=TRUE)
barplot ( t (BankDat a3),beside =T, ylab="Age at 99" ,xlab = " Sum Of Prof it " , main =
" Prof it Per Year" )
legend(" t opleft " , legend = " Red 09 Blue 00" )
Appendix - Graphics
Histogram
hist (BankDat a$Prof it _09)
Or for many:
par(mf row=c(1,2))
hist (BankDat a$Prof it _09, main=" Prof it 09" , col=1)
hist (BankDat a$Prof it _00, main=" Prof it 00" , col=2)
Appendix - Graphics
Pie
pie(BankDat a3$Prof it _09, col=rainbow(6), main=" pie chart " )
Someone wrote this package:
library(plot rix)
pie3D(BankDat a3$Prof it _09, labels=(BankDat a3$Group.1), main=" Prof it By Age" ,
explode = 0.01, radius=1.1, t het a=.45, labelcex=1.5,shade=0.5)