0% found this document useful (0 votes)
59 views

Unit 1 DataScience

Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Unit 1 DataScience

Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 105

UNIT-1

Introduction to
DataScience
Topics to be covered
• Introduction: What is Data Science? –
• Big Data and Data Science hype – and getting past the hype
Datafication –
• Statistical Inference - Populations and samples - Statistical modelling,
probability distributions
• fitting a model – Over fitting
• Basics of R:Introduction, R-Environment Setup, Programming with R,
Basic Data Types
What is Data Science?
• Data science is the study of data. It involves developing methods of
recording, storing, and analyzing data to effectively extract useful
information.
• The goal of data science is to gain insights and knowledge from any
type of data — both structured and unstructured
• Data science is more closely related to the mathematics field of
Statistics, which includes the collection, organization, analysis, and
presentation of data.
Cont..
• Because of the large amounts of data modern companies and
organizations maintain, data science has become an integral part
of IT. For example, a company that has petabytes of user data may
use data science to develop effective ways to store, manage, and
analyze the data. The company may use the scientific method to run
tests and extract results that can provide meaningful insights about
their users.
Data Science vs Data Mining

• Data science is often confused with data mining.


• Data mining is a subset of data science. It involves analyzing large
amounts of data (such as big data) in order to discover patterns and
other useful information.
• Data science covers the entire scope of data collection and
processing.
Big Data and Data Science
• The difference between Data Science and Big Data is immediate. Big
Data is the discipline of processing and exploiting a large amount of
data, while in Data Science there is no constraint on the amount of
data. It therefore happens that we can use Big Data techniques in
Data Science when the quantity of our data to be processed becomes
very important.
Datafication
• Datafication is defined as a process of “taking all aspects of life and
• turning them into data.” As examples, “Google’s augmented-reality
glasses datafy the gaze”. Twitter ” datafies stray thoughts”. “LinkedIn
datafies professional networks.”
• Datafication is an interesting concept and led us to consider its
importance with respect to people’s intentions about sharing their
own data.
• We are being datafied, or rather our actions are, and when we “like”
someone or something online, we are intending to be datafied,
Cont..

• when we browse the Web, we are unintentionally, or passively,


being datafied through cookies that we might or might not be aware
of. And when we walk around in a store, or even on the street, we are
being datafied in a completely unintentional way, via sensors,
cameras, or Google glasses.
• Once we datafy things, we can transform their purpose and turn the
information into new forms of value.
• For modelers and entrepreneurs the value refers to making money
from getting people to buy stuff, and the “value” translates into
something like increased efficiency through automation.
The Data Science Landscape

• Data science is part of the computer sciences [1]. It comprises the


disciplines of i) analytics, ii) Statistic and iii) machine learning.
Cont..
• Data science, is a blend of Red-Bull-fueled hacking and espresso-
inspired statistics.
• But data science is not merely hacking or data science is not merely
statistics, Data science is the civil engineering of data. Its acolytes
possess a practical knowledge of tools and materials, coupled with a
theoretical understanding of what’s possible.
• Driscoll refers to Drew Conway’s Venn diagram of data science
• from 2010, shown in Figure 1-1.

Figure 1-1. Drew Conway’s Venn diagram of data science


• The skills of Data for the “Rise of the Data Scientist”
• include:

• Statistics (traditional analysis you’re used to thinking about)


• Data munging (parsing, scraping, and formatting data)
• Visualization (graphs, tools, etc.)
Statistical Inference Definition

• Statistical inference is the process of analysing the result and making


conclusions from data subject to random variation. It is also called
inferential statistics.
• Hypothesis testing and confidence intervals are the applications of
the statistical inference.
• Statistical inference is a method of making decisions about the
parameters of a population, based on random sampling. It helps to
assess the relationship between the dependent and independent
variables.
Cont..
• The purpose of statistical inference to estimate the uncertainty or
sample to sample variation. It allows us to provide a probable range
of values for the true values of something in the population.
• The components used for making statistical inference are:

1. Sample Size
2. Variability in the sample
3. Size of the observed differences
Types of Statistical Inference

• There are different types of statistical inferences that are extensively


used for making conclusions. They are:
• One sample hypothesis testing
• Confidence Interval
• Pearson Correlation
• Bi-variate regression
• Multi-variate regression
• Chi-square statistics and contingency table
• ANOVA or T-test
Importance of Statistical
Inference
• Inferential Statistics is important to examine the data properly.
• To make an accurate conclusion, proper data analysis is important to
interpret the research results.
• It is majorly used in the future prediction for various observations in
different fields.
• It helps us to make inference about the data.
Cont..
• The statistical inference has a wide range of application in different
fields, such as:
• Business Analysis
• Artificial Intelligence
• Financial Analysis
• Fraud Detection
• Machine Learning
• Share Market
• Pharmaceutical Sector
Statistical Inference Examples

• An example of statistical inference is given below.


Question: From the shuffled pack of cards, a card is drawn. This trial is repeated for 400 times, and the suits are given below:
While a card is tried at random, then what is the probability of getting a
•Diamond cards
•Black cards
•Except for spade

Suit Spade Clubs Hearts Diamonds

No.of times drawn 90 100 120 90


• Solution:
• By statistical inference solution,
• Total number of events = 400
• i.e.,90+100+120+90=400
• (1) The probability of getting diamond cards:
• Number of trials in which diamond card is drawn = 90
• Therefore, P(diamond card) = 90/400 = 0.225
(2) The probability of getting black cards:
•Number of trials in which black card showed up = 90+100 =190
•Therefore, P(black card) = 190/400 = 0.475
(3) Except for spade
•Number of trials other than spade showed up = 90+100+120 =310
Model fitting
• Model fitting is the measure of how well a machine learning model
generalizes data similar to that with which it was trained. A good
model fit refers to a model that accurately approximates the output
when it is provided with unseen inputs.
• Fitting refers to adjusting the parameters in the model to improve
accuracy. The process involves running an algorithm on data for
which the target variable (“labeled” data) is known to produce a
machine learning model.
• Then, the model’s outcomes are compared to the real, observed
values of the target variable to determine the accuracy.
• A model that is well-fitted produces more accurate outcomes.
• A model that is overfitted matches the data too closely.
• A model that is under fitted doesn’t match closely enough.
• Each machine learning algorithm has a basic set of parameters that
can be changed to improve its accuracy. During the fitting process,
you run an algorithm on data for which you know the target variable,
known as “labeled” data, and produce a machine learning model.
• Then, you compare the outcomes to real, observed values of the
target variable to determine their accuracy.
• Next, you use that information to adjust the algorithm’s standard
parameters to reduce the level of error, making it more accurate in
uncovering patterns and relationships between the rest of its features
and the target.
• You repeat this process until the algorithm finds the optimal
parameters that produce valid, practical, applicable insights for your
practical business problem.
Why is Model Fitting Important?

• Model fitting is the essence of machine learning.


• If your model doesn’t fit your data correctly, the outcomes it
produces will not be accurate enough to be useful for practical
decision-making.
• A properly fitted model has hyperparameters that capture the
complex relationships between known variables and the target
variable, allowing it to find relevant insights or make accurate
predictions.
TYPES OF MODEL FITTING:

• Based upon the characteristics of different kinds of data there exists


two types of model fittings which are as follows:
• Overfitting data model.
• Underfitting data model.
• Overfitting negatively impacts the performance of the model on new
data. It occurs when a model learns the details and noise in the
training data too efficiently.
• When random fluctuations or the noise in the training data are picked
up and learned as concepts by the model, the model “overfits”.
• It will perform well on the training set, but very poorly on the test set.
This negatively impacts the model’s ability to generalize and make
accurate predictions for new data.
• Underfitting happens when the machine learning model cannot
sufficiently model the training data nor generalize new data.
• An underfit machine learning model is not a suitable model; this will
be obvious as it will have a poor performance on the training data.
DIFFERENCES:
The characteristics of the data
models can be summarized into
following table.
• Example for Overfitting and under fitting:
• If the model is performing well for Training data and performing well
for test data then its perfect model.
• A student preparing Course hand book for GRE or AMCAT exam
• Consider Training data as Exam syllabus and Test data as student marks
and student is a model.
• If students preparation confidence for exam syllabus is low and does
not perform well in exam its condition of Under fitting model.
• If students preparation confidence for exam syllabus is High and does
not perform well in exam its condition of Over fitting model.
• Data lekage means paper out even though the student is confident on
Train Data Set Test Data Set
Under Fitting Accuracy or Performance Accuracy or
Performance

Over Fitting Accuracy or Performance Accuracy or


Performance
• How to detect overfit models:
• To understand the accuracy of machine learning models, it’s
important to test for model fitness. K-fold cross-validation is one of
the most popular techniques to assess accuracy of the model.
• In k-folds cross-validation, data is split into k equally sized subsets,
which are also called “folds.” One of the k-folds will act as the test
set, also known as the holdout set or validation set, and the
remaining folds will train the model. This process repeats until each
of the fold has acted as a holdout fold. After each evaluation, a score
is retained and when all iterations have completed, the scores are
averaged to assess the performance of the overall model.
For example, let’s say we divided the dataset into five sub-groups. The process can be visualized, like this:
Reasons for Overfitting are as
follows:
• 1.High variance and low bias
• 2.The model is too complex
• 3.The size of the training data
Techniques to reduce overfitting:

• 1.Increase training data.


• 2.Reduce model complexity.
• 3.Early stopping during the training phase (have an eye over the loss
over the training period as soon as loss begins to increase stop
training).
• 4.Ridge Regularization and Lasso Regularization
• 5.Use dropout for neural networks to tackle overfitting.
Introduction to R
WHY R?
It's free!
It runs on a variety of platforms including Windows, Unix and
MacOS.
It provides an unparalleled platform for programming new
statistical methods in an easy and straightforward manner.
It contains advanced statistical routines not yet available in
other packages.
It has state-of-the-art graphics capabilities.

36
How to download?
• Google it using R or CRAN
(Comprehensive R Archive Network)
• http://www.r-project.org

37
R Overview
R is a comprehensive statistical and graphical
programming language and is a dialect of the S
language:
1988 - S2: RA Becker, JM Chambers, A Wilks
1992 - S3: JM Chambers, TJ Hastie
1998 - S4: JM Chambers
R: initially written by Ross Ihaka and Robert
Gentleman at Dep. of Statistics of U of Auckland,
New Zealand during 1990s.
Since 1997: international “R-core” team of 15
people with access to common CVS archive.

38
R Overview
You can enter commands one at a time at the command prompt
(>) or run a set of commands from a source file.
There is a wide variety of data types, including vectors
(numerical, character, logical), matrices, data frames, and
lists.
To quit R, use
>q()

39
R Overview
Most functionality is provided through built-in and user-created
functions and all data objects are kept in memory during an
interactive session.
Basic functions are available by default. Other functions are
contained in packages that can be attached to a current
session as needed

40
R Overview
A key skill to using R effectively is learning how to use the built-
in help system. Other sections describe the working
environment, inputting programs and outputting results,
installing new functionality through packages and etc.

A fundamental design feature of R is that the output from most


functions can be used as input to other functions. This is
described in reusing results.

41
R Interface
Start the R system, the main window (RGui) with a
sub window (R Console) will appear
In the `Console' window the cursor is waiting for you
to type in some R commands.

42
Your First R Session

43
R Introduction
• Results of calculations can be stored in objects using the
assignment operators:
• An arrow (<-) formed by a smaller than character and a hyphen
without a space!
• The equal character (=).

44
R Introduction
• These objects can then be used in other calculations. To
print the object just enter the name of the object. There
are some restrictions when giving an object a name:
• Object names cannot contain `strange' symbols like !, +, -, #.
• A dot (.) and an underscore ( ) are allowed, also a name starting
with a dot.
• Object names can contain a number but cannot start with a
number.
• R is case sensitive, X and x are two different objects, as well as
temp and temP.

45
An example
> # An example
> x <- c(1:10)
> x[(x>8) | (x<5)]
> # yields 1 2 3 4 9 10
> # How it works
> x <- c(1:10)
>x
>1 2 3 4 5 6 7 8 9 10
>x>8
>FFFFFFFFTT
>x<5
>TTTTFFFFFF
>x>8|x<5
>TTTTFFFFTT
> x[c(T,T,T,T,F,F,F,F,T,T)]
> 1 2 3 4 9 10

46
R Introduction
• To list the objects that you have in your current R session use
the function ls or the function objects.
> ls()
[1] "x" "y"
• So to run the function ls we need to enter the name followed
by an opening ( and a closing ). Entering only ls will just print
the object, you will see the underlying R code of the the
function ls. Most functions in R accept certain arguments. For
example, one of the arguments of the function ls is pattern.
To list all objects starting with the letter x:
> x2 = 9
> y2 = 10
> ls(pattern="x")
[1] "x" "x2"

47
R Introduction
• If you assign a value to an object that already exists then the
contents of the object will be overwritten with the new value
(without a warning!). Use the function rm to remove one or
more objects from your session.
> rm(x, x2)

• Lets create two small vectors with data and a scatterplot.


z2 <- c(1,2,3,4,5,6)
z3 <- c(6,8,3,5,7,1)
plot(z2,z3)
title("My first scatterplot")

48
R Warning !
R is a case sensitive language.
FOO, Foo, and foo are three different objects

49
R Introduction
> x = sin(9)/75
> y = log(x) + x^2
>x
[1] 0.005494913
>y
[1] -5.203902
> m <- matrix(c(1,2,4,1), ncol=2)
>m
> [,1] [,2]
[1,] 1 4
[2,] 2 1
> solve(m)
[,1] [,2]
[1,] -0.1428571 0.5714286
[2,] 0.2857143 -0.1428571

50
R Workspace
Objects that you create during an R session
are hold in memory, the collection of
objects that you currently have is called
the workspace. This workspace is not
saved on disk unless you tell R to do so.
This means that your objects are lost
when you close R and not save the
objects, or worse when R or your system
crashes on you during a session.

51
R Workspace
When you close the RGui or the R console
window, the system will ask if you want
to save the workspace image. If you
select to save the workspace image then
all the objects in your current R session
are saved in a file .RData. This is a binary
file located in the working directory of R,
which is by default the installation
directory of R.

52
R Workspace
• During your R session you can also explicitly save the
workspace image. Go to the `File‘ menu and then
select `Save Workspace...', or use the save.image
function.
## save to the current working directory
save.image()
## just checking what the current working directory is
getwd()
## save to a specific file and location
save.image("C:\\Program Files\\R\\R-2.5.0\\bin\\.RData")

53
R Workspace
If you have saved a workspace image and you start R the
next time, it will restore the workspace. So all your
previously saved objects are available again. You can
also explicitly load a saved workspace, that could be
the workspace image of someone else. Go the `File'
menu and select `Load workspace...'.

54
R Workspace
Commands are entered interactively at the R user
prompt. Up and down arrow keys scroll through your
command history.
You will probably want to keep different projects in
different physical directories.

55
R Workspace
R gets confused if you use a path in your code like
c:\mydocuments\myfile.txt
This is because R sees "\" as an escape character.
Instead, use
c:\\my documents\\myfile.txt
or
c:/mydocuments/myfile.txt

56
R Workspace
getwd() # print the current working directory

ls() # list the objects in the current workspace


setwd(mydirectory) # change to mydirectory
setwd("c:/docs/mydir")

57
R Workspace
#view and set options for the session
help(options) # learn about available options
options() # view current option settings
options(digits=3) # number of digits to print on output
# work with your previous commands
history() # display last 25 commands
history(max.show=Inf) # display all previous commands

58
R Workspace
# save your command history
savehistory(file="myfile") # default is ".Rhistory"
# recall your command history
loadhistory(file="myfile") # default is ".Rhistory“

59
R Help
Once R is installed, there is a comprehensive built-in help
system. At the program's command prompt you can
use any of the following:
help.start() # general help
help(foo) # help about function foo
?foo # same thing
apropos("foo") # list all function containing string foo
example(foo) # show an example of function foo

60
R Help
# search for foo in help manuals and archived mailing lists
RSiteSearch("foo")
# get vignettes on using installed packages
vignette() # show available vingettes
vignette("foo") # show specific vignette

61
R Datasets
R comes with a number of sample datasets that you can
experiment with. Type
> data( )
to see the available datasets. The results will depend on
which packages you have loaded. Type
help(datasetname)
for details on a sample dataset.

62
R Packages
• One of the strengths of R is that the system can easily
be extended. The system allows you to write new
functions and package those functions in a so called `R
package' (or `R library'). The R package may also
contain other R objects, for example data sets or
documentation. There is a lively R user community
and many R packages have been written and made
available on CRAN for other users. Just a few
examples, there are packages for portfolio
optimization, drawing maps, exporting objects to
html, time series analysis, spatial statistics and the list
goes on and on.

63
R Packages
• When you download R, already a number (around 30)
of packages are downloaded as well. To use a function
in an R package, that package has to be attached to
the system. When you start R not all of the
downloaded packages are attached, only seven
packages are attached to the system by default. You
can use the function search to see a list of packages
that are currently attached to the system, this list is
also called the search path.
> search()
[1] ".GlobalEnv" "package:stats" "package:graphics"
[4] "package:grDevices" "package:datasets" "package:utils"
[7] "package:methods" "Autoloads" "package:base"

64
R Packages
To attach another package to the system you can use the menu or
the library function. Via the menu:

Select the `Packages' menu and select `Load package...', a list of


available packages on your system will be displayed. Select one and
click `OK', the package is now attached to your current R session. Via
the library function:
> library(MASS)
> shoes
$A
[1] 13.2 8.2 10.9 14.3 10.7 6.6 9.5 10.8 8.8 13.3
$B
[1] 14.0 8.8 11.2 14.2 11.8 6.4 9.8 11.3 9.3 13.6

65
R Packages
• The function library can also be used to list all the available
libraries on your system with a short description. Run the
function without any arguments
> library()
Packages in library 'C:/PROGRA~1/R/R-25~1.0/library':
base The R Base Package
Boot Bootstrap R (S-Plus) Functions (Canty)
class Functions for Classification
cluster Cluster Analysis Extended Rousseeuw et al.
codetools Code Analysis Tools for R
datasets The R Datasets Package
DBI R Database Interface
foreign Read Data Stored by Minitab, S, SAS,
SPSS, Stata, Systat, dBase, ...
graphics The R Graphics Package

66
R Packages
install = function() {
install.packages(c("moments","graphics","Rcmdr","hexb
in"),
repos="http://lib.stat.cmu.edu/R/CRAN")
}
install()

67
R Conflicting objects
• It is not recommended to do, but R allows the user to give an
object a name that already exists. If you are not sure if a name
already exists, just enter the name in the R console and see if R
can find it. R will look for the object in all the libraries (packages)
that are currently attached to the R system. R will not warn you
when you use an existing name.
> mean = 10
> mean
[1] 10
• The object mean already exists in the base package, but is now
masked by your object mean. To get a list of all masked objects
use the function conflicts.
>
[1] "body<-" "mean"

68
R Conflicting objects
The object mean already exists in the base package, but is now
masked by your object mean. To get a list of all masked objects
use the function conflicts.
> conflicts()
[1] "body<-" "mean“

You can safely remove the object mean with the function
rm() without risking deletion of the mean function.

Calling rm() removes only objects in your working environment by


default.

69
Source Codes
you can have input come from a script file (a file containing R
commands) and direct output to a variety of destinations.
Input
The source( ) function runs a script in the current session. If the
filename does not include a path, the file is taken from the
current working directory.
# input a script
source("myfile")

70
Output
Output
The sink( ) function defines the direction of the output.
# direct output to a file
sink("myfile", append=FALSE, split=FALSE)

# return output to the terminal


sink()

71
Output
The append option controls whether output overwrites or adds to a
file.
The split option determines if output is also sent to the screen as
well as the output file.
Here are some examples of the sink() function.
# output directed to output.txt in c:\projects directory.
# output overwrites existing file. no output to terminal.
sink("myfile.txt", append=TRUE, split=TRUE)

72
Data Types
R has a wide variety of data types including
scalars, vectors (numerical, character,
logical), matrices, dataframes, and lists.

73
Vectors
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE)
#logical vector
Refer to elements of a vector using subscripts.
a[c(2,4)] # 2nd and 4th elements of vector

74
Matrices
All columns in a matrix must have the same mode(numeric, character, etc.)
and the same length.
The general format is
mymatrix <- matrix(vector, nrow=r, ncol=c,
byrow=FALSE,dimnames=list(char_vector_rownames,
char_vector_colnames))
byrow=TRUE indicates that the matrix should be filled by rows.
byrow=FALSE indicates that the matrix should be filled by columns (the
default). dimnames provides optional labels for the columns and rows.

75
Matrices
# generates 5 x 4 numeric matrix
y<-matrix(1:20, nrow=5,ncol=4)
# another example
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames,
cnames))
#Identify rows, columns or elements using subscripts.
x[,4] # 4th column of matrix
x[3,] # 3rd row of matrix
x[2:4,1:3] # rows 2,3,4 of columns 1,2,3

76
Arrays
Arrays are similar to matrices but can have more than two
dimensions. See help(array) for details.

77
Data frames
A data frame is more general than a matrix, in that different
columns can have different modes (numeric, character, factor,
etc.).
d <- c(1,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data.frame(d,e,f)
names(mydata) <- c("ID","Color","Passed") #variable names

78
Data frames
There are a variety of ways to identify the elements of a dataframe .
myframe[3:5] # columns 3,4,5 of dataframe
myframe[c("ID","Age")] # columns ID and Age from dataframe
myframe$X1 # variable x1 in the dataframe

79
Lists
An ordered collection of objects (components). A list allows you to
gather a variety of (possibly unrelated) objects under one name.
# example of a list with 4 components -
# a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)

# example of a list containing two lists


v <- c(list1,list2)

80
Lists
Identify elements of a list using the [[]] convention.
mylist[[2]] # 2nd component of the list

81
Factors
Tell R that a variable is nominal by making it a factor. The factor stores the nominal
values as a vector of integers in the range [ 1... k ] (where k is the number of
unique values in the nominal variable), and an internal vector of character
strings (the original values) mapped to these integers.
# variable gender with 20 "male" entries and
# 30 "female" entries
gender <- c(rep("male",20), rep("female", 30))
gender <- factor(gender)
# stores gender as 20 1s and 30 2s and associates
# 1=female, 2=male internally (alphabetically)
# R now treats gender as a nominal variable
summary(gender)

82
Useful Functions
length(object) # number of elements or components
str(object) # structure of an object
class(object) # class or type of an object
names(object) # names
c(object,object,...) # combine objects into a vector
cbind(object, object, ...) # combine objects as columns
rbind(object, object, ...) # combine objects as rows
ls() # list current objects
rm(object) # delete an object
newobject <- edit(object) # edit copy and save a newobject
fix(object) # edit in place

83
Importing Data
Importing data into R is fairly simple.
For Stata and Systat, use the foreign package.
For SPSS and SAS I would recommend the Hmisc package for ease and
functionality.
See the Quick-R section on packages, for information on obtaining and
installing the these packages.
Example of importing data are provided below.

84
From A Comma
Delimited Text File
# first row contains variable names, comma is separator
# assign the variable id to row names
# note the / instead of \ on mswindows systems

mydata <- read.table("c:/mydata.csv", header=TRUE, sep=",",


row.names="id")

85
From Excel
The best way to read an Excel file is to export it to a comma delimited file and import it using the method above.
On windows systems you can use the RODBC package to access Excel files. The first row should contain variable/column names.
# first row contains variable names
# we will read in workSheet mysheet
library(RODBC)
channel <- odbcConnectExcel("c:/myexel.xls")
mydata <- sqlFetch(channel, "mysheet")
odbcClose(channel)

86
From SAS
• # save SAS dataset in trasport format
libname out xport 'c:/mydata.xpt';
data out.mydata;
set sasuser.mydata;
run;
• library(foreign)
#bsl=read.xport(“mydata.xpt")

87
Keyboard Input
Usually you will obtain a dataframe by importing it from SAS, SPSS,
Excel, Stata, a database, or an ASCII file. To create it interactively,
you can do something like the following.
# create a dataframe from scratch
age <- c(25, 30, 56)
gender <- c("male", "female", "male")
weight <- c(160, 110, 220)
mydata <- data.frame(age,gender,weight)

88
Keyboard Input
You can also use R's built in spreadsheet to enter the data
interactively, as in the following example.
# enter data using editor
mydata <- data.frame(age=numeric(0), gender=character(0),
weight=numeric(0))
mydata <- edit(mydata)
# note that without the assignment in the line above,
# the edits are not saved!

89
Exporting Data
There are numerous methods for exporting R objects into other
formats . For SPSS, SAS and Stata. you will need to load the foreign
packages. For Excel, you will need the xlsReadWrite package.

90
Exporting Data
To A Tab Delimited Text File
write.table(mydata, "c:/mydata.txt", sep="\t")
To an Excel Spreadsheet
library(xlsReadWrite)
write.xls(mydata, "c:/mydata.xls")
To SAS
library(foreign)
write.foreign(mydata, "c:/mydata.txt", "c:/mydata.sas",
package="SAS")

91
Viewing Data
There are a number of functions for listing the contents of an object or dataset.
# list objects in the working environment
ls()
# list the variables in mydata
names(mydata)
# list the structure of mydata
str(mydata)
# list levels of factor v1 in mydata
levels(mydata$v1)
# dimensions of an object
dim(object)

92
Viewing Data
There are a number of functions for listing the contents of an object
or dataset.
# class of an object (numeric, matrix, dataframe, etc)
class(object)
# print mydata
mydata
# print first 10 rows of mydata
head(mydata, n=10)
# print last 5 rows of mydata
tail(mydata, n=5)

93
Variable Labels
R's ability to handle variable labels is somewhat unsatisfying.
If you use the Hmisc package, you can take advantage of some labeling
features.
library(Hmisc)
label(mydata$myvar) <- "Variable label for variable myvar"
describe(mydata)

94
Variable Labels
Unfortunately the label is only in effect for functions provided by the
Hmisc package, such as describe(). Your other option is to use the
variable label as the variable name and then refer to the variable by
position index.
names(mydata)[3] <- "This is the label for variable 3"
mydata[3] # list the variable

95
Value Labels
To understand value labels in R, you need to understand the data structure
factor.
You can use the factor function to create your own value lables.
# variable v1 is coded 1, 2 or 3
# we want to attach value labels 1=red, 2=blue,3=green
mydata$v1 <- factor(mydata$v1,
levels = c(1,2,3),
labels = c("red", "blue", "green"))
# variable y is coded 1, 3 or 5
# we want to attach value labels 1=Low, 3=Medium, 5=High

96
Value Labels
mydata$v1 <- ordered(mydata$y,
levels = c(1,3, 5),
labels = c("Low", "Medium", "High"))
Use the factor() function for nominal data and the ordered() function
for ordinal data. R statistical and graphic functions will then treat
the data appropriately.
Note: factor and ordered are used the same way, with the same
arguments. The former creates factors and the later creates
ordered factors.

97
Missing Data
In R, missing values are represented by the symbol NA (not available) .
Impossible values (e.g., dividing by zero) are represented by the
symbol NaN (not a number). Unlike SAS, R uses the same symbol
for character and numeric data.
Testing for Missing Values
is.na(x) # returns TRUE of x is missing
y <- c(1,2,3,NA)
is.na(y) # returns a vector (F F F T)

98
Missing Data
Recoding Values to Missing
# recode 99 to missing for variable v1
# select rows where v1 is 99 and recode column v1
mydata[mydata$v1==99,"v1"] <- NA
Excluding Missing Values from Analyses
Arithmetic functions on missing values yield missing values.
x <- c(1,2,NA,3)
mean(x) # returns NA
mean(x, na.rm=TRUE) # returns 2

99
Missing Data
The function complete.cases() returns a logical vector indicating which
cases are complete.
# list rows of data that have missing values
mydata[!complete.cases(mydata),]
The function na.omit() returns the object with listwise deletion of
missing values.
# create new dataset without missing data
newdata <- na.omit(mydata)

100
Missing Data
Advanced Handling of Missing Data
Most modeling functions in R offer options for dealing with missing
values. You can go beyond pairwise of listwise deletion of missing
values through methods such as multiple imputation. Good
implementations that can be accessed through R include Amelia
II, Mice, and mitools.

101
Date Values
Dates are represented as the number of days since 1970-01-01, with
negative values for earlier dates.
# use as.Date( ) to convert strings to dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
# number of days between 6/22/07 and 2/13/04
days <- mydates[1] - mydates[2]
Sys.Date( ) returns today's date.
Date() returns the current date and time.

102
Date Values
The following symbols can be used with the format( ) function to
print dates.

Symbol Meaning Example

%d day as a number (0-31) 01-31

%a abbreviated weekday Mon


%A unabbreviated weekday Monday

%m month (00-12) 00-12

%b abbreviated month Jan


%B unabbreviated month January

%y 2-digit year 07
%Y 4-digit year 2007

103
Date Values
# print today's date
today <- Sys.Date()
format(today, format="%B %d %Y")
"June 20 2007"

104
THANK YOU

You might also like