0% found this document useful (0 votes)

16 views

Practical Preprocessing and Data Cleaning

Uploaded by

hoangha43kd

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Practical Preprocessing and Data Cleaning

Uploaded by

hoangha43kd

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 51

Practical Preprocessing

Data Manipulation and Cleaning

Data cleaning technique depends on the types of error that your data contain. But some cleaning
activities on data are very common that almost half of data preprocessing may involve those
activities.
Before we begin recall that Data manipulation which are Slicing and Drilling and importation of
different data file are out of the scope of this module (meaning you should learn it privately).

Setting the work folder/Directory

# check working directory
> getwd()

#set working directory

> setwd("C:/Users/ebenu/Downloads/COMP1810Web AnalyticsLectures")

Importation of files
CSV file
See correct and Errors
data2 <- read.csv("input.csv", sep = "", header = TRUE)
From online source
df <- read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt",
header = FALSE)

df1 <- read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.csv",

header = FALSE,
sep = ",")

df2 <- read.csv("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.csv",

header = FALSE)

//////////////////////////////////////////////////////////////////////////////////////////////////////

read.delim() for Delimited Files

If separator character is a comma or a semicolon, use the read.delim() and read.delim2()

functions. These behave like read.table() function, just like the read.csv() function. See the data
below
df <- read.delim("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test_delim.txt", sep="$")

df <- read.delim2("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test_delim.txt", sep="$")

////////////////////////////////////////////////////////////////////////////////////////////////////////////

Identifying and Imputation of missing values.

Run = ctrl+Alt_enter
Pipeline = ctrl+shift+m
Install these packages used for data cleaning and manipulation
 (tidyr ) Tidy Messy Data • tidyr https://tidyr.tidyverse.org
# The easiest way to get tidyr is to install the whole tidyverse:
# You may install waldo package to use a function called compare

install.packages("tidyverse")
# Alternatively, install just dplyr:
install.packages("tidyr")
Data sets
Lemon2016 or starwars from the above package
1. Import the file.
2. Identify missing values, character values and Identifying NA and NAN values.
3. Count Missing values are na, NA, space, in each column for missing values.
4. Replacing values with required Numerics.
> data2 <- read.csv("lemonade2016.csv", header = TRUE)
> data2
/////////////////////////////////////////////////////////////////////////////////////////////////////////////

Identify the missing values only NA is identify, the – and na is not recognized.
Count the total number of NA missing values

> sum(is.na(data2))
[1] 11

Missing values for each column

Input NA using the mean of a column with the

data2$Lemon[is.na(data2$Lemon)] <- round(mean(data2$Lemon, na.rm = TRUE))

Recheck the number of missing values again

If the number of missing values is small that it will not affect the overall analysis, you may drop
it. Drop the missing value in the orange and Location.
Remove all row that has Na by this code
data2_new <- data2[, colSums(is.na(data2)) < nrow(data2)]
Using dplyr package (Grammar for data manipulation)
dplyr Verbs
 select() (Selecting columns)
 mutate() (Add or change columns).
 filter() (Selecting rows)
 summarise() (Summary of group of rows)
 arrange() (Ordering of the rows).

Using starwars dataset that comes with dplyr

Install package called magrittr to be able to used pipline %>%

Using filter() (Selecting rows) to select where eye color and not is black
starwars %>%
filter(eye_color =="black")

starwars %>%

filter(eye_color !="black")
Using select() (Selecting columns)

Selecting column using index

Selecting column using name of variable or columns

Selecting column using number of index

Selecting column using different construct.

Select helper functions of dplyr
These functions are used with select, hence is called helper functions.

Helpers Description
starts_with() Starts with a prefix
ends_with() Ends with a prefix
contains() Contains a literal string
matches() Matches a regular expression
num_range() Numerical range like x01, x02, x03.
one_of() Variables in character vector.
everything() All variables.

Re-arranging the column by using column headings and everything

How to use the Helper start_with(), end_with(), contain()
Using mutate() (Add or change columns).
Using the startwars create a BMI column

BMI = mass/((height/100) )^2

Rounding up the BMI column

starwars %>%

mutate(bmi = mass/ (height/100)^2) %>%

mutate(bmi= round(bmi,2)) %>%

select(name,height, mass,bmi)
Using factors on mutate
other examples
Using arrange() (Ordering of the rows).
By default, is by ascending order

starwars %>%

arrange(height)

starwars %>%

arrange(desc(height))
Using Summarise() with mtcars datasets

Group_by
Tidyr practical
Install and load the tidyr package

Pivot_longer

Pivot_wider

Making a tidy dataset from wide to long see below.

Load tidyr to Import relig_income dataset

pivot_longer(relig_income, -religion, names_to ="income" , values_to ="count" )

Using billboard datasets in tidyr
write.csv(billboard, "billboard.csv",row.names = FALSE)
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to ="week",
values_to ="rank"
)
String Manipulation
install.packages (“stringr”) # install the package

library (stringr) #load the package

All string function starts with str

Funtions of stringr

Getting and setting individual characters

/////////////////////////////////////////////////////////////////////////////////////////////////////////
str_length("abc")

#> [1] 3

////////////////////////////////////////////////////////////////////////////////////////////////////////////

///////////////////////////////////////////////////////////////////////////////////////////////////////////
x <- c("abcdef", "ghifjk")

# The 3rd letter

str_sub(x, 3, 3)
#> [1] "c" "i"

# The 2nd to 2nd-to-last character

str_sub(x, 2, -2)

#> [1] "bcde" "hifj"

//////////////////////////////////////////////////////////////////////////////////////////////////////////

Whitespace
Three functions add, remove, or modify whitespace use

Add space: str_pad()

> x <- c("abc", "defghi")

> str_pad(x, 10) # default pads on left

combine str_pad() and str_trunc():

In this table we will remove wk and make all the number integers

billboard %>%
mutate(week = substr(week,3,4),
week= as.integer(week))
Detecting String using str_detect()

Regular Expression and str_detect()

To detect any letter upper or lower case
str_replace: Replace matched patterns in a string.
Count number of matched

str_locate(): Locate the position of patterns in a string

Web Scrapping
install.package (“rvest”)

We will be scrpping imdb IMDb Top 250 - IMDb

Using html_node() to extract
This will allow to extract any specific tags from html
html_nodes(“div”)#extract all div tags
If a class = “hello”
html_nodes(“.hello”) #extract the hello class
If a id=”hi”
html_nodes(“#hi”) #extract the hi id
//////////////////////////////////////////////////////////////////////////////////////////////////////////

///////////////////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////////////////

To Extract the text using html_text()

Assigned it to title to clean it up Cleaning the
//////////////////////////////////////////////Extract title//////////////////////////////////////////

titles<- top_movies %>%

html_nodes("tbody tr td.titleColumn ") %>%
html_text() %>%
str_trim() %>%
str_split("\n") %>%
lapply(function(movie){
movie[2]
}) %>%
unlist() %>%
str_trim()
////////////////////////////////////////////Extracting year////////////////////////////////////////////////

years<-top_movies %>%
html_nodes("tbody tr td.titleColumn ") %>%
html_text() %>%
str_trim() %>%
str_split("\n") %>%
lapply(function(movie){
movie[3]#extract the 3 element which is year
}) %>%
unlist() %>%
str_trim() %>%
str_replace("\$", "") %>% #replace the first parentesis by open space
str_replace("\$", "") %>% #replace the first parentesis by open space
as.integer()

/////////////////////////////////////Extracting the Ratings//////////////////////////////////////

ratings<-top_movies %>%
html_nodes(".imdbRating strong") %>%
html_text() %>%
as.numeric()
////////////////////////////////Ranks//////////////////////////////////////////////////////////////////

This is simple is just

ranks<- 1:250

//////////////////////////////////////////////////////////////////////////////////////////////////////////////

Putting all together by Creating a table >>>>tibble

top_movies_tables<- tibble(
Rank = ranks,
Title = titles,
Year = years,
Rating = ratings
)
/////////////////////////////////////////////////////////////////////////////////////////////////////////////
Cleaning Data
Package and dataset to install and load

library(dplyr)
library(tidyr)
library(skimr)
Starwars dataset
library(skimr) is use to show the data skim below

Let's extract height, mass and gender from the dataset

data <- starwars %>%

select(height, mass, gender)
data
Splitting the data by installing and loading
library(rsample)
///////////////////////////////////////////////////////////////////////////////////////////////////////////
data_split <- initial_split(data)
data_train <- training(data_split)
data_test <-testing(data_split)

Checking the number of split data

Keep the the data_test to be used for validation and clean the data_train

Creating a new feature bmi

data_train<- data_train %>%

mutate(bmi = mass/(height*height))
data_train
To check for missing values

Use skim() to check for missing values.

skim(data_train)
Or any.na(data_train)
any(is.na(data_train))

colSums(is.na(data_train))
Dropping the missing values that are very few
Dropping height and gender missing values

Imputation of missing values for mass and bmi

ifelse(condition,true ,false)

data_tr_imputed<-data_train %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi))
data_tr_imputed
gender is a categorical variable and must be encoded
data_tr_imputed_encoded<-data_tr_imputed %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender)
data_tr_imputed_encoded
Feature Scaling

Creating a function for normalisation

normalize<- function(feature){
(feature = mean(feature))/sd(feature)
}
Complete processes Pipeline
Putting the whole processes of data cleaning into one

Steps

I. Feature Engineering.
II. Missing values.
III. Encoding categorical variables.
IV. Feature Scaling.

data_train %>%
mutate(bmi = mass/(height*height)) %>%
drop_na(height,gender) %>%
mutate(mass =ifelse(is.na(mass), mean(mass,na.rm = TRUE),mass),
bmi =ifelse(is.na(bmi), mean(bmi,na.rm = TRUE),bmi)) %>%
mutate(gender_masculine = ifelse(gender =="masculine",1,0)) %>%
select(-gender) %>%
mutate_all(normalize)

Using Recipes for Data cleaning pipeline

install.packages("recipes")

[Recipes packages provides functions for doing all the coding above]
data_train %>%
recipe() %>%
step_mutate(BMI=mass/(height*height)) %>%
step_naomit(height,gender) %>%
step_meanimpute(mass,BMI) %>%
step_dummy(gender) %>%
step_normalize(everything()) %>%
prep()

/////////////////////////ENCODING CATEGORICAL DATASET Using Iris////////////////////////////////////

//////////////////////////////////////////////////////////////////////////////////////////
iris %>%
mutate(Species_versicolor = ifelse(Species =="versicolor",1,0),
Species_virginica = ifelse(Species =="virginica",1,0)) %>%#remove the Species
select(-Species)

R Cheat Sheet PDF
100% (1)
R Cheat Sheet PDF
38 pages
Case Study - 8085 Microprocessor
No ratings yet
Case Study - 8085 Microprocessor
14 pages
Data Minig and Techniquezz
No ratings yet
Data Minig and Techniquezz
48 pages
Data Preparation: Treatment of Missing Values
No ratings yet
Data Preparation: Treatment of Missing Values
26 pages
DataCamp Week 5
No ratings yet
DataCamp Week 5
7 pages
Tidyverse - Tidyr and Dplyr
No ratings yet
Tidyverse - Tidyr and Dplyr
33 pages
Big Data - Lab 3
No ratings yet
Big Data - Lab 3
25 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
MBA Sem 1 Unit 3 Fundamentals of R (1)
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R (1)
41 pages
lect01-2
No ratings yet
lect01-2
19 pages
R Master Sheet - All codes, inbuilt functions and packages needed for the course
No ratings yet
R Master Sheet - All codes, inbuilt functions and packages needed for the course
2 pages
Manipulating Data in R
No ratings yet
Manipulating Data in R
32 pages
Data Manipulation Workshop Handout
No ratings yet
Data Manipulation Workshop Handout
46 pages
Data Preparation: Handling Missing Values and Outliers
No ratings yet
Data Preparation: Handling Missing Values and Outliers
28 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Section 03
No ratings yet
Section 03
20 pages
R Module 8 - Data Cleaning
No ratings yet
R Module 8 - Data Cleaning
48 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
Module 7_(Data Analysis with R Programming)
No ratings yet
Module 7_(Data Analysis with R Programming)
18 pages
Week2 DataWrangling DelimitedText PDF
No ratings yet
Week2 DataWrangling DelimitedText PDF
5 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
BS730 Class 12
No ratings yet
BS730 Class 12
36 pages
Unit2
No ratings yet
Unit2
76 pages
Data Handling and Manipulation
No ratings yet
Data Handling and Manipulation
18 pages
Lab4_Instructions
No ratings yet
Lab4_Instructions
52 pages
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
No ratings yet
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
16 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
26 pages
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
No ratings yet
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
17 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
CleaningData Chapter 3
No ratings yet
CleaningData Chapter 3
29 pages
R Packages Dplyr Sem-III 2021
No ratings yet
R Packages Dplyr Sem-III 2021
13 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
No ratings yet
MTH 4407 - Group 2 (Dr. Farid Zamani) - Lecture 6
22 pages
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
No ratings yet
Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16
31 pages
mda_practical2_eda
No ratings yet
mda_practical2_eda
50 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
mod3 tables EPP
No ratings yet
mod3 tables EPP
9 pages
R BasicCommands
No ratings yet
R BasicCommands
5 pages
R Tutorial #1: Applied Econometrics (Econ3005)
No ratings yet
R Tutorial #1: Applied Econometrics (Econ3005)
21 pages
Important R Codes and Notes
No ratings yet
Important R Codes and Notes
13 pages
BIO259 Note
No ratings yet
BIO259 Note
55 pages
Data Tidying With Tidyr::: Cheat Sheet
No ratings yet
Data Tidying With Tidyr::: Cheat Sheet
2 pages
58.tidy Data in R For Linguists
No ratings yet
58.tidy Data in R For Linguists
14 pages
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
No ratings yet
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
26 pages
R Language PDF
100% (1)
R Language PDF
619 pages
Getting and Cleaning Data Course Notes: Xing Su
No ratings yet
Getting and Cleaning Data Course Notes: Xing Su
27 pages
Study Guide Data Manipulation With R
No ratings yet
Study Guide Data Manipulation With R
4 pages
Tidyverse Pres
No ratings yet
Tidyverse Pres
20 pages
Base-R
No ratings yet
Base-R
9 pages
Coursera Notes
No ratings yet
Coursera Notes
4 pages
How To Do Reliability Analysis and Basic Factor Analysis in R
No ratings yet
How To Do Reliability Analysis and Basic Factor Analysis in R
4 pages
Unit 2 Reading and Writing Files
No ratings yet
Unit 2 Reading and Writing Files
33 pages
I R A E D: Mport EAD ND Xport ATA
No ratings yet
I R A E D: Mport EAD ND Xport ATA
28 pages
MIT 302 - Statistical Computing II - Tutorial 02
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 02
5 pages
CH 3
No ratings yet
CH 3
33 pages
Lab 1 (with Answers)
No ratings yet
Lab 1 (with Answers)
44 pages
Learn R_ Learn R_ Data Cleaning Cheatsheet _ Codecademy
No ratings yet
Learn R_ Learn R_ Data Cleaning Cheatsheet _ Codecademy
4 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Lecturer-Correlation Analysis
No ratings yet
Lecturer-Correlation Analysis
10 pages
Lecture1 Clickstream Data Collection Techniques
No ratings yet
Lecture1 Clickstream Data Collection Techniques
12 pages
Lecturer3-Descriptive Analysis
No ratings yet
Lecturer3-Descriptive Analysis
24 pages
Lecture2-Web Analytic Metrics and KPIs
No ratings yet
Lecture2-Web Analytic Metrics and KPIs
14 pages
Practicaal Session Lecture3-Set Up For R Programming Language For Data Analytics
No ratings yet
Practicaal Session Lecture3-Set Up For R Programming Language For Data Analytics
11 pages
Comp 1807 Week6
No ratings yet
Comp 1807 Week6
35 pages
(@NEETpassionate) Aakash CST - 19A
No ratings yet
(@NEETpassionate) Aakash CST - 19A
18 pages
Chapter 2 - Open Channel
No ratings yet
Chapter 2 - Open Channel
51 pages
CFA With Multiple Regression Pp. 15 E-JSBRB 2 Dogbe Zakari Pesse-Kumar 101 2019
No ratings yet
CFA With Multiple Regression Pp. 15 E-JSBRB 2 Dogbe Zakari Pesse-Kumar 101 2019
15 pages
An Alternative Approach To AIC and Mallow's CP Statistics Based Relative Influence Measure (RIMs) in Regression Variable Selection
No ratings yet
An Alternative Approach To AIC and Mallow's CP Statistics Based Relative Influence Measure (RIMs) in Regression Variable Selection
6 pages
Cilindro Neumatico 2019 Tolva
No ratings yet
Cilindro Neumatico 2019 Tolva
3 pages
E700 Catalog
No ratings yet
E700 Catalog
77 pages
Com - Maths Set-E Answer Key and Marking Scheme
No ratings yet
Com - Maths Set-E Answer Key and Marking Scheme
3 pages
Change of State: Gaining Energy
No ratings yet
Change of State: Gaining Energy
5 pages
Unit-5 Z Test and T Test
No ratings yet
Unit-5 Z Test and T Test
24 pages
08 - Velan ABV - Company Profile
No ratings yet
08 - Velan ABV - Company Profile
16 pages
EEE3090F Self-Assessment 2 - Questions
No ratings yet
EEE3090F Self-Assessment 2 - Questions
8 pages
KPI Formula
No ratings yet
KPI Formula
30 pages
BP 2059
No ratings yet
BP 2059
8 pages
Paper Failure Weld Joint
No ratings yet
Paper Failure Weld Joint
6 pages
Statistics at Virginia Tech
No ratings yet
Statistics at Virginia Tech
2 pages
Transaction & Concurrency Control
100% (1)
Transaction & Concurrency Control
55 pages
Problems Sheet (2) : Damped Oscillations: X X B X
No ratings yet
Problems Sheet (2) : Damped Oscillations: X X B X
1 page
Procedure Megger Test.
No ratings yet
Procedure Megger Test.
3 pages
Home Work - 3: EEE 207 / ECE 207: Electronic Devices and Circuits II
No ratings yet
Home Work - 3: EEE 207 / ECE 207: Electronic Devices and Circuits II
3 pages
Financial Modelling Corporate PDF
100% (3)
Financial Modelling Corporate PDF
64 pages
Group Theory Selected Problems - B Sury
100% (2)
Group Theory Selected Problems - B Sury
185 pages
Out of Focus
43% (7)
Out of Focus
30 pages
Slac-153 Uc-34 Misc Concepts of Radiation Dosimetry Kenneth R. Kase and ... PDF
No ratings yet
Slac-153 Uc-34 Misc Concepts of Radiation Dosimetry Kenneth R. Kase and ... PDF
220 pages
ED Course Handout
No ratings yet
ED Course Handout
5 pages
IGCSE Physics CIE: 1.9 Pressure
No ratings yet
IGCSE Physics CIE: 1.9 Pressure
9 pages
General Words For Groups of People
No ratings yet
General Words For Groups of People
5 pages
Windows Command & Shortcuts
No ratings yet
Windows Command & Shortcuts
11 pages
Layout and Construction: Grouted Riprap
No ratings yet
Layout and Construction: Grouted Riprap
4 pages
Multicollinearity: What Happens If Explanatory Variables Are Correlated.
No ratings yet
Multicollinearity: What Happens If Explanatory Variables Are Correlated.
20 pages