Practical Preprocessing and Data Cleaning
Practical Preprocessing and Data Cleaning
Importation of files
CSV file
See correct and Errors
data2 <- read.csv("input.csv", sep = "", header = TRUE)
From online source
df <- read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt",
header = FALSE)
//////////////////////////////////////////////////////////////////////////////////////////////////////
install.packages("tidyverse")
# Alternatively, install just dplyr:
install.packages("tidyr")
Data sets
Lemon2016 or starwars from the above package
1. Import the file.
2. Identify missing values, character values and Identifying NA and NAN values.
3. Count Missing values are na, NA, space, in each column for missing values.
4. Replacing values with required Numerics.
> data2 <- read.csv("lemonade2016.csv", header = TRUE)
> data2
/////////////////////////////////////////////////////////////////////////////////////////////////////////////
Identify the missing values only NA is identify, the – and na is not recognized.
Count the total number of NA missing values
> sum(is.na(data2))
[1] 11
If the number of missing values is small that it will not affect the overall analysis, you may drop
it. Drop the missing value in the orange and Location.
Remove all row that has Na by this code
data2_new <- data2[, colSums(is.na(data2)) < nrow(data2)]
Using dplyr package (Grammar for data manipulation)
dplyr Verbs
select() (Selecting columns)
mutate() (Add or change columns).
filter() (Selecting rows)
summarise() (Summary of group of rows)
arrange() (Ordering of the rows).
starwars %>%
filter(eye_color !="black")
Using select() (Selecting columns)
Helpers Description
starts_with() Starts with a prefix
ends_with() Ends with a prefix
contains() Contains a literal string
matches() Matches a regular expression
num_range() Numerical range like x01, x02, x03.
one_of() Variables in character vector.
everything() All variables.
starwars %>%
select(name,height, mass,bmi)
Using factors on mutate
other examples
Using arrange() (Ordering of the rows).
By default, is by ascending order
starwars %>%
arrange(height)
starwars %>%
arrange(desc(height))
Using Summarise() with mtcars datasets
Group_by
Tidyr practical
Install and load the tidyr package
Pivot_longer
Pivot_wider
Funtions of stringr
/////////////////////////////////////////////////////////////////////////////////////////////////////////
str_length("abc")
#> [1] 3
////////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////
x <- c("abcdef", "ghifjk")
str_sub(x, 3, 3)
#> [1] "c" "i"
str_sub(x, 2, -2)
//////////////////////////////////////////////////////////////////////////////////////////////////////////
Whitespace
Three functions add, remove, or modify whitespace use
billboard %>%
mutate(week = substr(week,3,4),
week= as.integer(week))
Detecting String using str_detect()
///////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////
years<-top_movies %>%
html_nodes("tbody tr td.titleColumn ") %>%
html_text() %>%
str_trim() %>%
str_split("\n") %>%
lapply(function(movie){
movie[3]#extract the 3 element which is year
}) %>%
unlist() %>%
str_trim() %>%
str_replace("\\(", "") %>% #replace the first parentesis by open space
str_replace("\\)", "") %>% #replace the first parentesis by open space
as.integer()
ratings<-top_movies %>%
html_nodes(".imdbRating strong") %>%
html_text() %>%
as.numeric()
////////////////////////////////Ranks//////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////////////////////////////////////////////////
top_movies_tables<- tibble(
Rank = ranks,
Title = titles,
Year = years,
Rating = ratings
)
/////////////////////////////////////////////////////////////////////////////////////////////////////////////
Cleaning Data
Package and dataset to install and load
library(dplyr)
library(tidyr)
library(skimr)
Starwars dataset
library(skimr) is use to show the data skim below
Keep the the data_test to be used for validation and clean the data_train
skim(data_train)
Or any.na(data_train)
any(is.na(data_train))
colSums(is.na(data_train))
Dropping the missing values that are very few
Dropping height and gender missing values
data_tr_imputed<-data_train %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi))
data_tr_imputed
gender is a categorical variable and must be encoded
data_tr_imputed_encoded<-data_tr_imputed %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender)
data_tr_imputed_encoded
Feature Scaling
normalize<- function(feature){
(feature = mean(feature))/sd(feature)
}
Complete processes Pipeline
Putting the whole processes of data cleaning into one
Steps
I. Feature Engineering.
II. Missing values.
III. Encoding categorical variables.
IV. Feature Scaling.
data_train %>%
mutate(bmi = mass/(height*height)) %>%
drop_na(height,gender) %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi)) %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender) %>%
mutate_all(normalize)
[Recipes packages provides functions for doing all the coding above]
data_train %>%
recipe() %>%
step_mutate(BMI=mass/(height*height)) %>%
step_naomit(height,gender) %>%
step_meanimpute(mass,BMI) %>%
step_dummy(gender) %>%
step_normalize(everything()) %>%
prep()