0% found this document useful (0 votes)
39 views

Working With Data

This document discusses working with data from files in R. It covers loading a CSV file from a URL into an R data frame, examining the data using functions like class(), help(), and summary(). It also discusses exploring the car data loaded and working with other data formats like Excel, JSON, XML, and SQL. The document then covers transforming data in R, including setting column names. Finally, it discusses exploring the data using summary statistics, cleaning data such as treating missing values, and sampling data for modeling.

Uploaded by

Deva Hema D
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Working With Data

This document discusses working with data from files in R. It covers loading a CSV file from a URL into an R data frame, examining the data using functions like class(), help(), and summary(). It also discusses exploring the car data loaded and working with other data formats like Excel, JSON, XML, and SQL. The document then covers transforming data in R, including setting column names. Finally, it discusses exploring the data using summary statistics, cleaning data such as treating missing values, and sampling data for modeling.

Uploaded by

Deva Hema D
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

WORKING WITH DATA FROM

FILES

D.Deva
Hema(AP/CSE)
WORKING WITH DATA FROM FILES

 The most common ready-to-go data format is a


family of tabular formats called structured values.
Working with well-structured data from files
Loading file:
uciCar <- read.table(
'http://www.win-vector.com/dfiles/car.data.csv',
sep=',‘ ,header=T)

 This loads the data and stores it in a new R data


frame object called uciCar.
WORKING WITH DATA FROM FILES

EXAMINING OUR DATA


 class()— tells us the object uciCar is of class

data.frame.
 help()—Gives you the documentation for a class.

 summary()—Gives you a summary of almost any R

object. summary(uciCar) shows us a lot about the


distribution of the UCI car data.
WORKING WITH DATA FROM FILES

Exploring the car data


 summary(uciCar)

buying maint
high :432 high :432
low :432 low :432
med :432 med :432
vhigh:432 vhigh:432
 The summary() command shows us the distribution

of each variable in the dataset.


WORKING WITH OTHER DATA FORMATS

XLS/XLSX—http://cran.r-project.org/doc/manuals/
R-data.html#Reading-Excel-spreadsheets

JSON—http://cran.r-project.org/web/packages/rjs
on/index.html

XML—http://cran.r-project.org/web/packages/XM
L/index.html
 MongoDB—http://cran.r-project.org/web/
packages/rmongodb/index.html

SQL—http://cran.r-project.org/web/packages/DBI/
index.html
TRANSFORMING DATA IN R

Loading the credit dataset


d <-
read.table('http://archive.ics.uci.edu/ml/',
'machine-learning
databases/statlog/german/german.data',se
p=“,”,header=F)
print(d[1:3,])
TRANSFORMING DATA IN R
Setting column names
 colnames(d) <- c('Status.of.existing.checking.account',

'Duration.in.month', 'Credit.history', 'Purpose',


'Credit.amount', 'Savings account/bonds',
'Present.employment.since',
'Installment.rate.in.percentage.of.disposable.income',
'Personal.status.and.sex', 'Other.debtors/guarantors',
'Present.residence.since', 'Property', 'Age.in.years',
'Other.installment.plans', 'Housing',
'Number.of.existing.credits.at.this.bank', 'Job',

'Number.of.people.being.liable.to.provide.maintenance.for',
'Telephone', 'foreign.worker', 'Good.Loan')
TRANSFORMING DATA IN R

Building a map to interpret loan use codes


mapping <- list(
'A40'='car (new)',
'A41'='car (used)',
'A42'='furniture/equipment',
'A43'='radio/television',
'A44'='domestic appliances', ...)
TRANSFORMING DATA IN R
EXAMINING OUR NEW DATA

> table(d$Purpose,d$Good.Loan)
BadLoan GoodLoan
business 34 63
car (new) 89 145
car (used) 17 86
domestic appliances 4 8
education 22 28
furniture/equipment 58 123
others 5 7
radio/television 62 218
Repairs 8 14
WORKING WITH RELATIONAL
DATABASES

RMySQL Package:
 R has a built-in package named "RMySQL" which
provides native connectivity between with MySql
database. You can install this package in the R
environment using the following command.

install.packages("RMySQL")
WORKING WITH RELATIONAL DATABASES-
STAGING THE DATA

 A staging area, is an intermediate storage area


used for data processing during the extract,
transform and load (ETL)process.

 The data staging area sits between the data


source(s) and the data target(s), which are
often data warehouses, data marts, or other data
repositories
WORKING WITH RELATIONAL DATABASES-
CURATING THE DATA

 Data curation is the organization and integration


of data collected from various sources.
 It involves annotation, publication and
presentation of the data such that the value of the
data is maintained over time, and the data remains
available for reuse and preservation.
WORKING WITH RELATIONAL DATABASES

Connecting R to MySql
 create a connection object in R to connect to the database. It
takes the username, password, database name and host
name as input.
 # Create a connection Object to MySQL database.
 # We will connect to the sampel database named "sakila" that
comes with MySql installation.

mysqlconnection = dbConnect(MySQL(), user = 'root',


password = '', dbname = 'sakila', host = 'localhost')

# List the tables available in this database.


dbListTables(mysqlconnection)
WORKING WITH RELATIONAL DATABASES
 We can query the database tables in MySql using the function dbSendQuery().
 The query gets executed in MySql and the result set is returned using the

R fetch() function.
 It is stored as a data frame in R.

#Create table

actor<-"CREATE TABLE actor(actor_id INT, first_name TEXT, last_name TEXT,


last update TEXT)

# Insert the data into table

dbSendQuery(mysqlconnection, "insert into


actor(actor_id,first_name,last_name,last_update)
values(188,’Jeba’ ,’raj’,13/08/2019))

# Query the "actor" tables to get all the rows.


result = dbSendQuery(mysqlconnection, "select * from actor")
WORKING WITH RELATIONAL
DATABASES
# Store the result in a R data frame object. n = 5 is used to fetch first 5 rows.
data.frame = fetch(result, n = 5)
print(data.fame)

Query with Filter Clause:


result = dbSendQuery(mysqlconnection, "select * from actor where last_name =
'TORN'")

# Fetch all the records(with n = -1) and store it as a data frame.


data.frame = fetch(result, n = -1)
print(data)
Output:
actor_id first_name last_name last_update
1 18 DAN TORN 2006-02-15 04:34:33
2 94 KENNETH TORN 2006-02-15 04:34:33 3
102 WALTER TORN 2006-02-15 04:34:33
WORKING WITH RELATIONAL
DATABASES
Updating Rows in the Tables

dbSendQuery(mysqlconnection, "update mtcars set


disp = 168.5 where hp = 110")

Dropping Tables in MySql

dbSendQuery(mysqlconnection, 'drop table if


exists student ')
EXPLORING THE DATA
 Using summary statistics to spot problems
 Missing Valuees
 Invalid Values And Outliers
 Data Range
 Units
EXPLORING THE DATA

Using summary statistics to spot problems


 > summary(custdata)

custid sex
Min. : 2068 F:440
1st Qu.: 345667 M:560
Median : 693403
Mean : 698500
3rd Qu.:1044606
Max. :1414286
EXPLORING THE DATA

Typical problems revealed by data summaries


 Missing Values
 Invalid Values And Outliers
 Data Range
 Units
EXPLORING THE DATA
 Missing Values
is.employed
Mode :logical
FALSE:73
TRUE :599
NA‘s :328 (Missing Values)
housing. type
Homeowner free and clear :157
Homeowner with mortgage/loan:412
Occupied with no rent : 11
Rented :364
NA's : 56 (Missing Values)
EXPLORING THE DATA

Invalid values and Outliers

 Examples of invalid values include negative values


in what should be a non-negative numeric data
field (like age or income), or text where you expect
numbers.
 Outliers are data points that fall well out of the
range of where you expect the data to be.
EXPLORING THE DATA
Invalid values and Outliers
> summary(custdata$income)

Min. 1st Qu. Median Mean 3rd Qu.


-8700 14600 35000 53500 67000
Max.61500
(Negative values of income represent bad data)
> summary(custdata$age)
Min. 1st Qu. Median Mean 3rd Qu.
0.0 38.0 50.0 51.7 64.0
Max.146.7
(Customers of age zero, or customers of an age greater
than about 110 are outliers)
EXPLORING THE DATA
DATA RANGE
> summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu.
-8700 14600 35000 53500 67000
Max.
615000
UNITS
 Does the income data represent hourly wages, or yearly

wages in units of $1000?


 > summary(Income)

Min. 1st Qu. Median Mean 3rd Qu. Max.


-8.7 14.6 35.0 53.5 67.0 615.0
MANAGING DATA- CLEANING DATA

 Treating missing values


◦ To drop or not to drop
◦ Missing data in categorical variables
◦ Missing Values In Numeric Data
summary(custdata$Income)
Min. 1st Qu. Median Mean 3rd Qu.
Max. NA's
0 25000 45000 66200 82000
615000 328
MANAGING DATA- CLEANING DATA
 TO DROP OR NOT TO DROP?

 summary(custdata[is.na(custdata$housing.type),
c("recent.move","num.vehicles")])

 The c function in R is used to create a vector with


values you provide explicitly. 
MANAGING DATA- CLEANING DATA

Missing data in categorical variables


custdata$is.employed.fix <-
ifelse(is.na(custdata$is.employed),
"missing",
ifelse(custdata$is.employed==T,
"employed",
"not employed"))
 fix invokes edit on x and then assigns the

new (edited) version of x in the user's


workspace.
MANAGING DATA- CLEANING DATA
 summary(as.factor(custdata$is.employed.fix))

employed missing not employed


 599 328 73
  factors are variables in R which take on a

limited number of different values; such


variables are often referred to as categorical
variables. ...
Sampling for modeling and validation
 Sampling is the process of selecting a subset of a
population to represent the whole , during analysis and
modeling.
 It’s important that the dataset that you do use is an
accurate representation of your population as a whole.
For example, your customers might come from all over
the United States.
 When you collect your custdata dataset, it might be
tempting to use all the customers from one state, to
train the model. But if you plan to use the model to
make predictions about customers all over the country,
it’s a good idea to pick customers randomly from all the
states.
Test and training splits
 The training set is the data that you feed to the
model-building algorithm—regression, decision
trees, and so on.

 The test set is the data that you feed into the
resulting model, to verify that the model’s
predictions are accurate.
Creating a sample group column
 convenient way to manage random sampling
is to add a sample group column to the data
frame.
 The sample group column contains a number

generated uniformly from zero to one, using


the runif function.
Creating a sample group column
> custdata$gp <- runif(dim(custdata)[1])
> testSet <- subset(custdata, custdata$gp <=
0.1)
> trainingSet <- subset(custdata, custdata$gp
> 0.1)
> dim(testSet)[1]
[1] 93
> dim(trainingSet)[1]
[1] 907
Record grouping
 hh <- unique(hhdata$household_id)
 households <- data.frame(household_id =

hh, gp = runif(length(hh)))
 hhdata <- merge(hhdata, households,

by="household_id")
Data provenance
 You’ll also want to add a column (or columns)
to record data provenance: when your dataset
was collected, perhaps what version of your
data cleaning procedure was used on the data
before modeling, and so on.
DATA STRUCTURES
 Structured
 Semi- Structured
 Quasi-Structured
 Unstructured
DATA STRUCTURES
 Big data can come in multiple forms, including
structured and non-structured data such as
financial data, text files, multimedia files, and
genetic mappings.

 Most of the Big Data is unstructured or semi-


structured in nature, which requires different
techniques and tools to process and analyze.
DATA STRUCTURES
 Structured data: Data containing a defined data
type, format, and structure (that is, transaction
data,online analytical processing [OLAP] data
cubes, traditional RDBMS, CSV files, and even
simple spreadsheets).
 Semi-structured data: Textual data files with a
discernible pattern that enables parsing (such as
Extensible Markup Language [XML] data files that
are self-describing and defined by an XML schema).

36
DATA STRUCTURES

 Quasi-structured data: Textual data with erratic


data formats that can be formatted with effort,
tools, and time

 Unstructured data: Data that has no inherent


structure, which may include text documents,
PDFs,images, and video.
DRIVERS OF BIG DATA
 Medical information, such as genomic sequencing and
diagnostic imaging
 Photos and video footage uploaded to the World Wide Web
 Video surveillance, such as the thousands of video cameras
spread across a city
 Mobile devices, which provide geospatial location data of the
users, as well as metadata about text messages, phone calls,
and application usage on smart phones
 Smart devices, which provide sensor-based collection of
information from smart electric grids, smart buildings, and
many other public and industry infrastructures
 Nontraditional IT devices, including the use of radio-frequency
identification (RFID) readers, GPS navigation systems, and
seismic processing

You might also like