Data Clean R
Data Clean R
Data Cleansing
Data cleansing, Data cleaning and Data scrubbing is the process of detecting and
correcting corrupt or inaccurate records from a data set
This involves exploring raw data, tidying messy data and preparing data for
analysis
In data preprocessing phase, often cleaning data takes 50-80% of time before
actually mining them for insights
Data Quality
Data Quality
Business decisions often revolve around
identifying prospects
understanding customers to stay connected
knowing about competitors and partners
being current and relevant with marketing campaigns
Data quality is an important factor that impacts the outcomes of the data analysis
to arrive at accurate decision making. Qualitative predictions cannot be made with
data having nil to low quality.
But Dirty Data is inevitable in the system due to various reasons. Hence it is
essential to clean your data at all times. This is an ongoing exercise that the
organizations have to follow.
Dirty data
Dirty data
Dirty data refers to data with erroneous information. Following are considered as
dirty data.
Misleading data
Duplicate data
Inaccurate data
Non-integrated data
Data that violates business rules
Data without a generalized formatting
Incorrectly punctuated or spelled data
*source - Techopedia
Missing values
Inaccurate values
Duplicates values
Outliers like typographic / measurement errors
Noisy values
Data timeliness (age of data)
How to manage Missing Values
How to manage Missing Values
Ignoring or removing missing values, is not a right approach as they may be too
important to ignore. Similarly, filling the missing value manually may be tedious
and not feasible.
5 of 11
Binning Method – First sort the data and partition them into equi-depth bins. Then,
smooth the data by bin means, bin median, bin boundaries etc.
Clustering – Group the data into clusters, then identify and remove outliers
Regression – Using regression functions to smooth the data
Dirty data is everywhere. In fact, most real-world datasets start off dirty in one
way or another, but need to be cleaned and prepared for analysis.
In this video we will learn about the typical steps involved like exploring raw
data, tidying data, and preparing data for analysis.
Hi, I'm Nick. I'm a data scientist at DataCamp and I'll be your instructor for this
course on Cleaning Data in R. Let's kick things off by looking at an example of
dirty data.
You're looking at the top and bottom, or head and tail, of a dataset containing
various weather metrics recorded in the city of Boston over a 12 month period of
time. At first glance these data may not appear very dirty. The information is
already organized into rows and columns, which is not always the case. The rows are
numbered and the columns have names. In other words, it's already in table format,
similar to what you might find in a spreadsheet document. We wouldn't be this lucky
if, for example, we were scraping a webpage, but we have to start somewhere.
Despite the dataset's deceivingly neat appearance, a closer look reveals many
issues that should be dealt with prior to, say, attempting to build a statistical
model to predict weather patterns in the future. For starters, the first column X
(all the way on the left) appears be meaningless; it's not clear what the columns
X1, X2, and so forth represent (and if they represent days of the month, then we
have time represented in both rows and columns); the different types of
measurements contained in the measure column should probably each have their own
column; there are a bunch of NAs at the bottom of the data; and the list goes on.
Don't worry if these things are not immediately obvious to you -- they will be by
the end of the course. In fact, in the last chapter of this course, you will clean
this exact same dataset from start to finish using all of the amazing new things
you've learned.
Dirty data are everywhere. In fact, most real-world datasets start off dirty in one
way or another, but by the time they make their way into textbooks and courses,
most have already been cleaned and prepared for analysis. This is convenient when
all you want to talk about is how to analyze or model the data, but it can leave
you at a loss when you're faced with cleaning your own data.
With the rise of so-called "big data", data cleaning is more important than ever
before. Every industry - finance, health care, retail, hospitality, and even
education - is now doggy-paddling in a large sea of data. And as the data get
bigger, the number of things that can go wrong do too. Each imperfection becomes
harder to find when you can't simply look at the entire dataset in a spreadsheet on
your computer.
In fact, data cleaning is an essential part of the data science process. In simple
terms, you might break this process down into four steps: collecting or acquiring
your data, cleaning your data, analyzing or modeling your data, and reporting your
results to the appropriate audience. If you try to skip the second step, you'll
often run into problems getting the raw data to work with traditional tools for
analysis in, say, R or Python. This could be true for a variety of reasons. For
example, many common algorithms require variables to be arranged into columns and
for missing values to be either removed or replaced with non-missing values,
neither of which was the case with the weather data you just saw.
Not only is data cleaning an essential part of the data science process - it's also
often the most time-consuming part. As the New York Times reported in a 2014
article called "For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights",
"Data scientists ... spend from 50 percent to 80 percent of their time mired in
this more mundane labor of collecting and preparing unruly digital data, before it
can be explored for useful nuggets." Unfortunately, data cleaning is not as sexy as
training a neural network to identify images of cats on the internet, so it's
generally not talked about in the media nor is it taught in most intro data science
and statistics courses. No worries, we're here to help.
In this course, we'll break data cleaning down into a three step process: exploring
your raw data, tidying your data, and preparing your data for analysis. Each of the
first three chapters of this course will cover one of these steps in depth, then
the fourth chapter will require you to use everything you've learned to take the
weather data from raw to ready for analysis.
The first step in the data cleaning process is exploring your raw data. We can
think of data exploration itself as a three step process consisting of
understanding the structure of your data, looking at your data, and visualizing
your data.
To understand the structure of your data, you have several tools at your disposal
in R. Here, we read in a simple dataset called lunch, which contains information on
the number of free, reduced price, and full price school lunches served in the US
from 1969 through 2014. First, we check the class of the lunch object to verify
that it's a data frame, or a two-dimensional table consisting of rows and columns,
of which each column is a single data type such as numeric, character, etc.
We then view the dimensions of the dataset with the dim() function. This particular
dataset has 46 rows and 7 columns. dim() always displays the number of rows first,
followed by the number of columns.
Next, we take a look at the column names of lunch with the names() function. Each
of the 7 columns has a name: year, avg_free, avg_reduced, and so on.
Okay, so we're starting to get a feel for things, but let's dig deeper. The str()
(for "structure") function is one of the most versatile and useful functions in the
R language because it can be called on any object and will normally provide a
useful and compact summary of its internal structure. When passed a data frame, as
in this case, str() tells us how many rows and columns we have. Actually, the
function refers to rows as observations and columns as variables, which, strictly
speaking, is true in a tidy dataset, but not always the case as you'll see in the
next chapter. In addition, you see the name of each column, followed by its data
type and a preview of the data contained in it. The lunch dataset happens to be
entirely integers and numerics. We'll have a closer look at these datatypes in
chapter 3.
The dplyr package offers a slightly different flavor of str() called glimpse(),
which offers the same information, but attempts to preview as much of each column
as will fit neatly on your screen. So here, we first load dplyr with the library()
command, then call glimpse() with a single argument, lunch.
To review, you've seen how we can use the class() function to see the class of a
dataset, the dim() function to view its dimensions, names() to see the column
names, str() to view its structure, glimpse() to do the same in a slightly enhanced
format, and summary() to see a helpful summary of each column.
Okay, so we've seen some useful summaries of our data, but there's no substitute
for just looking at it. The head() function shows us the first 6 rows by default.
If you add one additional argument, n, you can control how many rows to display.
For example, head(lunch, n = 15) will display the first 15 rows of the data.
We can also view the bottom of lunch with the tail() function, which displays the
last 6 rows by default, but that behavior can be altered in the same way with the n
argument.
Viewing the top and bottom of your data only gets you so far. Sometimes the easiest
way to identify issues with the data are to plot them. Here, we use hist() to plot
a histogram of the percent free and reduced lunch column, which quickly gives us a
sense of the distribution of this variable. It looks like the value of this
variable falls between 50 and 60 for 20 out of the 46 years contained in the lunch
dataset.
Finally, we can produce a scatter plot with the plot() function to look at the
relationship between two variables. In this case, we clearly see that the percent
of lunches that are either free or reduced price has been steadily rising over the
years, going from roughly 15 to 70 percent between 1969 and 2014.
To review, head() and tail() can be used to view the top and bottom of your data,
respectively. Of course, you can also just print() your data to the console, which
may be okay when working with small datasets like lunch, but is definitely not
recommended when working with larger datasets.
Lastly, hist() will show you a histogram of a single variable and plot() can be
used to produce a scatter plot showing the relationship between two variables.
#
# Complete the 'mtcars_data' function below.
#
# The function is expected to return an INTEGER.
#
mtcars_data()
R follows a set of conventions that makes one layout of tabular data much easier to
work with than others. Any dataset that follows following three rules is said to be
Tidy data.
separate() and unite() help you split and combine cells to place a single, complete
value in each cell.
Let us create a new dataframe called "paverage.df" that stores the average scores
by player across 3 different years. We will use this dataframe to examine gather()
and spread() behaviour.
player <- c("Sachin Tendulkar", "Sourav Ganguly", "VVS Laxman", "Rahul Dravid")
Y2010 <- c(48.8, 40.22, 51.02, 53.34)
Y2011 <- c(53.7, 41.9, 50.8, 59.44)
Y2012 <- c(60.0, 52.39, 61.2, 61.44)
paverage.df <- data.frame(player,Y2010,Y2011,Y2012)
print(paverage.df)
When column headers are values and not variables, use gather() function to tidy the
data.
Now try to recreate this dataset again after tidying the data.
For datasets that are messy due to single column holding multiple variables,
separate() function can be used.
When the variables are both at rows and columns, use gather() function followed by
spread() function.
library(tidyr)
tidyr_operations <- function(){
##print(paverage.df)
print(pavg_gather)
paverage1.df <- spread(pavg_gather,key = "year",value = "pavg")
print(paverage1.df)
print(first.df)
print(separate(first.df,col = "DoB",into = c('date','month','year'),sep = '-'))
print(unite(first.df,col = "Name",c('fname','lname'),sep = ' '))
print(mydf2.df)
print(spread(gather(mydf2.df,key = "month",value = "temp", c(4,5,6)),key =
"month",value = "temp"))
tidyr_operations()
library(dplyr)
dplyr_operations <- function(){
print(arrange(mtcars1,cyl,-mpg))
print(mt_select)
print(mt_newcols)
print(mean(mtcars$mpg))
print(max(mtcars$mpg))
print(quantile(mtcars$mpg, probs = 0.25))
}
dplyr_operations()
library(stringr)
stringr_operations <- function(){
x = "R"
print(str_c(x,"Tutorial",sep=" "))
print(str_detect(Y,'z'))
print(str_extract(Z,"NA"))
print(str_extract_all(Z,"NA"))
print(str_length(Z))
print(str_to_lower(Z))
print(str_to_upper(Z))
y <- c("alpha","gama","duo","uno","beta")
print(y[str_order(y)])
print(str_trim(z))
}
stringr_operations()
}
[1] "1965-05-01"
date_operations()
Outlierset <- c(19, 13, 29, 17, 5, 16, 18, 20, 55, 22,33,14,25, 10,29, 56)
Copy Outlierset to a new dataset Outlierset1.
Replace the outliers with 36 which is 3rd quartile + minimum.
Compare boxplot on both Outlierset and Outlierset1. You should see no outliers in
the new dataset.
Obvious errors
We have so far seen how to handle missing values, special values & outliers.
Sometimes, we might come across some obvious errors which cannot be caught by
previously learnt technical techniques.
Errors such as age field having a negative value, or height field being, say, 0 or
a smaller number. Such erroneous data would still need manual checks and
corrections.
correct_data()