0% found this document useful (0 votes)
16 views

Practical Preprocessing and Data Cleaning

Uploaded by

hoangha43kd
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Practical Preprocessing and Data Cleaning

Uploaded by

hoangha43kd
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Practical Preprocessing

Data Manipulation and Cleaning


Data cleaning technique depends on the types of error that your data contain. But some cleaning
activities on data are very common that almost half of data preprocessing may involve those
activities.
Before we begin recall that Data manipulation which are Slicing and Drilling and importation of
different data file are out of the scope of this module (meaning you should learn it privately).

Setting the work folder/Directory


# check working directory
> getwd()

#set working directory


> setwd("C:/Users/ebenu/Downloads/COMP1810Web AnalyticsLectures")

Importation of files
CSV file
See correct and Errors
data2 <- read.csv("input.csv", sep = "", header = TRUE)
From online source
df <- read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt",
header = FALSE)

df1 <- read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.csv",


header = FALSE,
sep = ",")

df2 <- read.csv("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.csv",


header = FALSE)

//////////////////////////////////////////////////////////////////////////////////////////////////////

read.delim() for Delimited Files

If separator character is a comma or a semicolon, use the read.delim() and read.delim2()


functions. These behave like read.table() function, just like the read.csv() function. See the data
below
df <- read.delim("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test_delim.txt", sep="$")

df <- read.delim2("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test_delim.txt", sep="$")


////////////////////////////////////////////////////////////////////////////////////////////////////////////

Identifying and Imputation of missing values.


Run = ctrl+Alt_enter
Pipeline = ctrl+shift+m
Install these packages used for data cleaning and manipulation
 (tidyr ) Tidy Messy Data • tidyr https://tidyr.tidyverse.org
# The easiest way to get tidyr is to install the whole tidyverse:
# You may install waldo package to use a function called compare

install.packages("tidyverse")
# Alternatively, install just dplyr:
install.packages("tidyr")
Data sets
Lemon2016 or starwars from the above package
1. Import the file.
2. Identify missing values, character values and Identifying NA and NAN values.
3. Count Missing values are na, NA, space, in each column for missing values.
4. Replacing values with required Numerics.
> data2 <- read.csv("lemonade2016.csv", header = TRUE)
> data2
/////////////////////////////////////////////////////////////////////////////////////////////////////////////

Identify the missing values only NA is identify, the – and na is not recognized.
Count the total number of NA missing values

> sum(is.na(data2))
[1] 11

Missing values for each column

Input NA using the mean of a column with the

data2$Lemon[is.na(data2$Lemon)] <- round(mean(data2$Lemon, na.rm = TRUE))


Recheck the number of missing values again

If the number of missing values is small that it will not affect the overall analysis, you may drop
it. Drop the missing value in the orange and Location.
Remove all row that has Na by this code
data2_new <- data2[, colSums(is.na(data2)) < nrow(data2)]
Using dplyr package (Grammar for data manipulation)
dplyr Verbs
 select() (Selecting columns)
 mutate() (Add or change columns).
 filter() (Selecting rows)
 summarise() (Summary of group of rows)
 arrange() (Ordering of the rows).

Using starwars dataset that comes with dplyr

Install package called magrittr to be able to used pipline %>%


Using filter() (Selecting rows) to select where eye color and not is black
starwars %>%
filter(eye_color =="black")

starwars %>%

filter(eye_color !="black")
Using select() (Selecting columns)

Selecting column using index

Selecting column using name of variable or columns


Selecting column using number of index

Selecting column using different construct.


Select helper functions of dplyr
These functions are used with select, hence is called helper functions.

Helpers Description
starts_with() Starts with a prefix
ends_with() Ends with a prefix
contains() Contains a literal string
matches() Matches a regular expression
num_range() Numerical range like x01, x02, x03.
one_of() Variables in character vector.
everything() All variables.

Re-arranging the column by using column headings and everything


How to use the Helper start_with(), end_with(), contain()
Using mutate() (Add or change columns).
Using the startwars create a BMI column

BMI = mass/((height/100) )^2

Rounding up the BMI column

starwars %>%

mutate(bmi = mass/ (height/100)^2) %>%

mutate(bmi= round(bmi,2)) %>%

select(name,height, mass,bmi)
Using factors on mutate
other examples
Using arrange() (Ordering of the rows).
By default, is by ascending order

starwars %>%

arrange(height)

starwars %>%

arrange(desc(height))
Using Summarise() with mtcars datasets

Group_by
Tidyr practical
Install and load the tidyr package

Pivot_longer

Pivot_wider

Making a tidy dataset from wide to long see below.


Load tidyr to Import relig_income dataset

pivot_longer(relig_income, -religion, names_to ="income" , values_to ="count" )


Using billboard datasets in tidyr
write.csv(billboard, "billboard.csv",row.names = FALSE)
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to ="week",
values_to ="rank"
)
String Manipulation
install.packages (“stringr”) # install the package

library (stringr) #load the package

All string function starts with str

Funtions of stringr

Getting and setting individual characters

/////////////////////////////////////////////////////////////////////////////////////////////////////////
str_length("abc")

#> [1] 3

////////////////////////////////////////////////////////////////////////////////////////////////////////////

///////////////////////////////////////////////////////////////////////////////////////////////////////////
x <- c("abcdef", "ghifjk")

# The 3rd letter

str_sub(x, 3, 3)
#> [1] "c" "i"

# The 2nd to 2nd-to-last character

str_sub(x, 2, -2)

#> [1] "bcde" "hifj"

//////////////////////////////////////////////////////////////////////////////////////////////////////////

Whitespace
Three functions add, remove, or modify whitespace use

Add space: str_pad()

> x <- c("abc", "defghi")

> str_pad(x, 10) # default pads on left

combine str_pad() and str_trunc():


In this table we will remove wk and make all the number integers

billboard %>%
mutate(week = substr(week,3,4),
week= as.integer(week))
Detecting String using str_detect()

Regular Expression and str_detect()


To detect any letter upper or lower case
str_replace: Replace matched patterns in a string.
Count number of matched

str_locate(): Locate the position of patterns in a string


Web Scrapping
install.package (“rvest”)

We will be scrpping imdb IMDb Top 250 - IMDb


Using html_node() to extract
This will allow to extract any specific tags from html
html_nodes(“div”)#extract all div tags
If a class = “hello”
html_nodes(“.hello”) #extract the hello class
If a id=”hi”
html_nodes(“#hi”) #extract the hi id
//////////////////////////////////////////////////////////////////////////////////////////////////////////

///////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////

To Extract the text using html_text()


Assigned it to title to clean it up Cleaning the
//////////////////////////////////////////////Extract title//////////////////////////////////////////

titles<- top_movies %>%


html_nodes("tbody tr td.titleColumn ") %>%
html_text() %>%
str_trim() %>%
str_split("\n") %>%
lapply(function(movie){
movie[2]
}) %>%
unlist() %>%
str_trim()
////////////////////////////////////////////Extracting year////////////////////////////////////////////////

years<-top_movies %>%
html_nodes("tbody tr td.titleColumn ") %>%
html_text() %>%
str_trim() %>%
str_split("\n") %>%
lapply(function(movie){
movie[3]#extract the 3 element which is year
}) %>%
unlist() %>%
str_trim() %>%
str_replace("\\(", "") %>% #replace the first parentesis by open space
str_replace("\\)", "") %>% #replace the first parentesis by open space
as.integer()

/////////////////////////////////////Extracting the Ratings//////////////////////////////////////

ratings<-top_movies %>%
html_nodes(".imdbRating strong") %>%
html_text() %>%
as.numeric()
////////////////////////////////Ranks//////////////////////////////////////////////////////////////////

This is simple is just


ranks<- 1:250

//////////////////////////////////////////////////////////////////////////////////////////////////////////////

Putting all together by Creating a table >>>>tibble

top_movies_tables<- tibble(
Rank = ranks,
Title = titles,
Year = years,
Rating = ratings
)
/////////////////////////////////////////////////////////////////////////////////////////////////////////////
Cleaning Data
Package and dataset to install and load

library(dplyr)
library(tidyr)
library(skimr)
Starwars dataset
library(skimr) is use to show the data skim below

Let's extract height, mass and gender from the dataset

data <- starwars %>%


select(height, mass, gender)
data
Splitting the data by installing and loading
library(rsample)
///////////////////////////////////////////////////////////////////////////////////////////////////////////
data_split <- initial_split(data)
data_train <- training(data_split)
data_test <-testing(data_split)

Checking the number of split data

Keep the the data_test to be used for validation and clean the data_train

Creating a new feature bmi

data_train<- data_train %>%


mutate(bmi = mass/(height*height))
data_train
To check for missing values

Use skim() to check for missing values.

skim(data_train)
Or any.na(data_train)
any(is.na(data_train))

colSums(is.na(data_train))
Dropping the missing values that are very few
Dropping height and gender missing values

Imputation of missing values for mass and bmi


ifelse(condition,true ,false)

data_tr_imputed<-data_train %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi))
data_tr_imputed
gender is a categorical variable and must be encoded
data_tr_imputed_encoded<-data_tr_imputed %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender)
data_tr_imputed_encoded
Feature Scaling

Creating a function for normalisation

normalize<- function(feature){
(feature = mean(feature))/sd(feature)
}
Complete processes Pipeline
Putting the whole processes of data cleaning into one

Steps

I. Feature Engineering.
II. Missing values.
III. Encoding categorical variables.
IV. Feature Scaling.

data_train %>%
mutate(bmi = mass/(height*height)) %>%
drop_na(height,gender) %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi)) %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender) %>%
mutate_all(normalize)

Using Recipes for Data cleaning pipeline


install.packages("recipes")

[Recipes packages provides functions for doing all the coding above]
data_train %>%
recipe() %>%
step_mutate(BMI=mass/(height*height)) %>%
step_naomit(height,gender) %>%
step_meanimpute(mass,BMI) %>%
step_dummy(gender) %>%
step_normalize(everything()) %>%
prep()

/////////////////////////ENCODING CATEGORICAL DATASET Using Iris////////////////////////////////////


//////////////////////////////////////////////////////////////////////////////////////////
iris %>%
mutate(Species_versicolor = ifelse(Species =="versicolor",1,0),
Species_virginica = ifelse(Species =="virginica",1,0)) %>%#remove the Species
select(-Species)

You might also like