Advanced R Data Analysis Training PDF
Advanced R Data Analysis Training PDF
Training
https://github.com/rkrtiwari/rAdvanc
ed
Module 1
Getting Started
Data Analysis Steps
• Data Collection
• Data Processing
• Data Cleaning
• Data Visualization
• Data Product
R Data Analysis Packages
Data Manipulation
dplyr: Data manipulation
tasks
tidyr: Reshape data
mice: Missing data
Imputation
Data Analysis
Data Explorer: Visualize variables
R Data Analysis Packages
Data Visualization
ggplot2: Powerful visualization
shiny: Interactive data
visualization
VIM: Missing data
visualization
Install Packages
install.packages(“tidyverse”)
install.packages(“DataExplorer”)
install.packages(“data.table”)
install.packages("mice")
install.packages("ggplot2")
Module 2
Obtaining Data
Read Data from CSV File
data1 <- read.csv("data.csv", header =
TRUE)
Read Data from json
data <- fromJSON(“data.json”)
Read Data from Web
url<-
"http://archive.ics.uci.edu/ml/machi
ne-learning-
databases/wine/wine.data"
read.csv(url, nrows=5, header =
FALSE)
Read Data from XML
library(XML)
data <- xmlTreeParse(data.xml)
Challenge
Read the housing data from the
following webpage
“https://archive.ics.uci.edu/ml/machi
ne-learning-
databases/housing/housing.data”
and store it in a dataframe named
house
Time: 5 min
Module 3
Data Exploration
and Cleaning
Exploring our data
# load our library
library(DataExplorer)
library(data.table)
group_category(heartDT, "chest_pain", 0,
"chol")
# correlation plot
plot_correlation(heartDT)
Plotting
# density plot
plot_density(heartDT)
# only for numerical columns
# histogram
plot_histogram(heartDT)
# only for numeric columns
# scatterplot
plot_scatterplot(heartDT,"age")
# using age as y axis
Splitting data
output=split_columns(heartDT)
output$discrete
output$continous
Imputing data
library(mice)
library(VIM)
# Mean Substitution
mean_sub <- miss_mtcars
mean_sub$qsec[is.na(mean_sub$qsec)] <-
mean(mean_sub$qsec, na.rm = TRUE)
Dealing with Outliers
# ESD method
t=2
m=mean(x)
s=sd(x)
b1=m - s*t
b2=m + s*t
table(y)
Dealing with Outliers
# boxplot method
boxplot(x)
boxplot.stats(x)
# outliers package
library(outliers)
dixon.test(x)
Challenge (10 mins)
Using the airquality dataset in R
glimpse(x)
lst(x)
tbl_sum(x)
Selecting columns
x2=select(x,col1,col2,col3,col4)
# selecting only 4 columns
trestbps_class=trestbps/5)
# this will give two new columns
Creating calculated columns
x2=mutate(x, cholLevel=
if_else(chol>250,"highrisk","normal"),
chol_class=chol/20)
Counting and arranging
count(x, chest_pain, sort = TRUE)
x2=arrange(x, desc(age))
# descending order
x2=top_n(x,20)
#top 20 rows
Chaining
# the “%>%” is used in chain operations
# link one process to another
right_join(A,B, by="col1")
# join matching rows from B to A
inner_join(A,B, by="col1")
# join data, retain only rows in both sets)
full_join(A,B, by="col1")
# join data, retain all values, all rows)
Group by
groupDisease=group_by(x, disease)
# disease is the variable which we want to
create groups ["positive", "negative"]
summarize(heart,
count=n(),
avgAge=mean(
age, na.rm=TRUE),
sdAge=sd(age, na.rm=TRUE),
medAge=median(age,
na.rm=TRUE),
Q3rdAge=quantile(age, .75)
)
Challenge (10 mins)
Use the mtcars dataset
Arguments
#first: dataset name,
#second: column Name,
#third: new col names to split column into
(names)
#fourth: the seperator (what split the
columns by)
Unite
#opposite of separate, combining columns
Arguments
#first: dataset name,
#second: column Name to unite columns
into,
#third: column names to combine
#fourth: the seperator in the new columns
homeruns2=gather(homeruns, year,
home_runs, YR2015:YR2013)
Spread
#opposite of gather, spreading out the
columns
model_heart=n_heart %>%
mutate(model=map(data, mod_fun))
# use "data" to symbolize the data
Logical testing
pluck(heart,"age") # get values in
"age"
old=function(x){x>50}
some(heart$age, old)
# do some elements pass a test
detect(heart$age, old)
# find first element that pass a test
detect_index(heart$age, old)
pmap
# pmap takes a list of arguments as
input