Working With Data
Working With Data
FILES
D.Deva
Hema(AP/CSE)
WORKING WITH DATA FROM FILES
data.frame.
help()—Gives you the documentation for a class.
buying maint
high :432 high :432
low :432 low :432
med :432 med :432
vhigh:432 vhigh:432
The summary() command shows us the distribution
XLS/XLSX—http://cran.r-project.org/doc/manuals/
R-data.html#Reading-Excel-spreadsheets
JSON—http://cran.r-project.org/web/packages/rjs
on/index.html
XML—http://cran.r-project.org/web/packages/XM
L/index.html
MongoDB—http://cran.r-project.org/web/
packages/rmongodb/index.html
SQL—http://cran.r-project.org/web/packages/DBI/
index.html
TRANSFORMING DATA IN R
'Number.of.people.being.liable.to.provide.maintenance.for',
'Telephone', 'foreign.worker', 'Good.Loan')
TRANSFORMING DATA IN R
> table(d$Purpose,d$Good.Loan)
BadLoan GoodLoan
business 34 63
car (new) 89 145
car (used) 17 86
domestic appliances 4 8
education 22 28
furniture/equipment 58 123
others 5 7
radio/television 62 218
Repairs 8 14
WORKING WITH RELATIONAL
DATABASES
RMySQL Package:
R has a built-in package named "RMySQL" which
provides native connectivity between with MySql
database. You can install this package in the R
environment using the following command.
install.packages("RMySQL")
WORKING WITH RELATIONAL DATABASES-
STAGING THE DATA
Connecting R to MySql
create a connection object in R to connect to the database. It
takes the username, password, database name and host
name as input.
# Create a connection Object to MySQL database.
# We will connect to the sampel database named "sakila" that
comes with MySql installation.
R fetch() function.
It is stored as a data frame in R.
#Create table
custid sex
Min. : 2068 F:440
1st Qu.: 345667 M:560
Median : 693403
Mean : 698500
3rd Qu.:1044606
Max. :1414286
EXPLORING THE DATA
summary(custdata[is.na(custdata$housing.type),
c("recent.move","num.vehicles")])
The test set is the data that you feed into the
resulting model, to verify that the model’s
predictions are accurate.
Creating a sample group column
convenient way to manage random sampling
is to add a sample group column to the data
frame.
The sample group column contains a number
hh, gp = runif(length(hh)))
hhdata <- merge(hhdata, households,
by="household_id")
Data provenance
You’ll also want to add a column (or columns)
to record data provenance: when your dataset
was collected, perhaps what version of your
data cleaning procedure was used on the data
before modeling, and so on.
DATA STRUCTURES
Structured
Semi- Structured
Quasi-Structured
Unstructured
DATA STRUCTURES
Big data can come in multiple forms, including
structured and non-structured data such as
financial data, text files, multimedia files, and
genetic mappings.
36
DATA STRUCTURES