SEU - DS510 - Module 4 Input-Output and Data Structure
SEU - DS510 - Module 4 Input-Output and Data Structure
Module 4
DS510
Statistics for Data Science
Instructor Name
Module 4# Learning Outcomes
1. Manage input for implementing
statistical projects.
2. Produce formatted output for reporting
project numbers.
3. Examine data frames like matrices, lists,
factors, data frames, and tibbles.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Input-Output and Data Structure
R Cookbook: Proven Recipes for Data
Analysis, Statistics, and Graphics
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Input and Output
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Redirecting Output to a File
► You can redirect the output of the cat function by using
its file argument:
cat("The answer is", answer, "\n", file = "filename.txt")
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
“Cannot Open File”
► The backslashes in the filepath are causing trouble. You can solve this
problem in one of two ways:
• Change the backslashes to forward slashes: "C:/data/sample.txt".
• Double the backslashes: "C:\\data\\sample.txt".
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Fixed-Width Records
► Use the read_fwf from the readr package (which is part of the tidyverse). The
main arguments are the filename and the description of the fields:
library(tidyverse)
records <- read_fwf("myfile.txt",
fwf_cols(col1 = 10,
col2 = 7))
records
► This form uses the fwf_cols parameter to pass column names and widths to the
function. You can also pass column parameters in other ways as discussed
next.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Tabular Data Files
► Use the read_table2 function from the readr package, which returns a
tibble:
library(tidyverse)
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Tabular Data Files
► Con’t
tab1 #>
# A tibble: 5 x 4
#> last first birth death
#> <chr> <chr> <dbl> <dbl>
#> 1 Fisher R.A. 1890 1962
#> 2 Pearson Karl 1857 1936
#> 3 Cox Gertrude 1900 1978
#> 4 Yates Frank 1902 1994
#> 5 Smith Kirstine 1878 1939
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading from CSV Files
► The read_csv function from the readr package is a fast (and, according to the
documentation, fun) way to read CSV files. If your CSV file has a header line,
use this:
library(tidyverse)
► If your CSV file does not contain a header line, set the col_names option
to FALSE:
tbl <- read_csv("datafile.csv", col_names = FALSE)
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Writing to CSV Files
► The write_csv function from the tidyverse readr package can write a
CSV file:
library(tidyverse)
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Tabular or CSV Data from the Web
► Use the read_csv or read_table2 functions from the readr package, using a
URL instead of a filename. The functions will read directly from the
remote server:
library(tidyverse)
berkley <- read_csv('http://bit.ly/barkley18', comment = '#')
#> Parsed with column specification:
#> cols(
#> Name = col_character(),
#> Location = col_character(),
#> Time = col_time(format = "")
#> )
► You can also open a connection using the URL and then read from the
connection, which may be preferable for complicated files.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Data from Excel
► The openxlsx package makes reading Excel files easy.
library(openxlsx)
df1 <- read.xlsx(xlsxFile = "file.xlsx",
sheet = 'sheet_name')
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Writing a Data Frame to Excel
► The openxlsx package makes writing to Excel files relatively easy.
While there are lots of options in openxlsx, a typical pattern is to
specify an Excel filename and a sheet name:
library(openxlsx)
write.xlsx(df,
sheetName = "some_sheet",
file = "out_file.xlsx")
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Data from a SAS file
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Data from HTML Tables
► Use the read_html and html_table functions in the rvest package. To read
all tables on the page, do the following:
library(rvest)
library(tidyverse)
all_tables <-
read_html("URL") %>%
html_table(fill = TRUE, header = TRUE)
► Note that rvest is installed when you run install.packages('tidyverse'),
although it is not a core tidyverse package. So you must explicitly load
the package.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Files with a Complex Structure
• Use the readLines function to read individual lines; then
process them as strings to extract data items.
• Alternatively, use the scan function to read individual
tokens and use the argument what to describe the stream of
tokens in your file. The function can convert tokens into
data and then assemble the data into records.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading from MySQL Databases
1. Install the RMySQL package on your computer and add a
user and password.
2. Open a database connection using
the DBI::dbConnect function.
3. Use dbGetQuery to initiate a SELECT and return the result
sets.
4. Use dbDisconnect to terminate the database connection
when you are done.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Accessing a Database with dbplyr
► In addition to being a grammar of data manipulation, the
tidyverse package dplyr can, in connection with
the dbplyr package, turn dplyr commands into SQL for you.
► Let’s set up an example database using RSQLite. Then we’ll
connect to it and use dplyr and the dbplyr backend to extract
data.
► Set up the example table by loading the msleep example data into
an in-memory SQLite database:
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
sleep_db <- copy_to(con, msleep, "sleep")
Now that we have a table in our database, we can create a
reference to it from R:
sleep_table <- tbl(con, "sleep")
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Accessing a Database with dbplyr
► The sleep_table object is a type of pointer or alias to the table
on the database. However, dplyr will treat it like a regular
tidyverse tibble or data frame, so you can operate on it
using dplyr and other R commands. Let’s select all animals
from the data who sleep less than three hours.
little_sleep <- sleep_table %>%
select(name, genus, order, sleep_total) %>%
filter(sleep_total < 3)
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Accessing a Database with dbplyr
► The dbplyr backend does not go fetch the data when we do the
preceding commands. But it does build the query and get ready. To
see the query built by dplyr, you can use show_query:
show_query(little_sleep)
#> <SQL> #> SELECT *
#> FROM (SELECT `name`, `genus`, `order`, `sleep_total`
#> FROM `sleep`)
#> WHERE (`sleep_total` < 3.0)
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Accessing a Database with dbplyr
► To bring the data back to your local machine, use collect:
local_little_sleep <- collect(little_sleep)
local_little_sleep
#> # A tibble: 3 x 4 #> name genus order sleep_total
#> <chr> <chr> <chr> <dbl>
#> 1 Horse Equus Perissodactyla 2.9
#> 2 Giraffe Giraffa Artiodactyla 1.9
#> 3 Pilot whale Globicephalus Cetacea 2.7
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Saving and Transporting Objects
► Write the objects to a file using the save function:
save(tbl, t, file = "myData.RData")
► Read them back using the load function, either on
your computer or on any platform that supports R:
load("myData.RData")
► The save function writes binary data. To save in an
ASCII format, use dput or dump instead:
dput(tbl, file = "myData.txt")
dump("tbl", file = "myData.txt") # Note
quotes around variable name
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Input-Output and Data Structure
R Cookbook: Proven Recipes for Data
Analysis, Statistics, and Graphics
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Vectors
► Vectors are homogeneousAll elements of a vector must have the same type or, in R
terminology, the same mode.
► Vectors can be indexed by positionSo v[2] refers to the second element of v.
v[["Larry"]]
#>for
DS510 Statistics [1]Data
20 Science Long, JD., & Teetor P. (2019
Lists
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Lists
► List elements can have namesBoth lst[["Moe"]] and lst$Moe refer to the element
named “Moe.”
► Since lists are heterogeneous and since their elements can be retrieved by name, a
list is like a dictionary or hash or lookup table in other programming languages.
► What’s surprising (and cool) is that in R, unlike most of those other programming
languages, lists can also be indexed by position.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Mode: Physical Type
► In R, every object has a mode, which indicates how it is stored in memory: as a number,
as a character string, as a list of pointers to other objects, as a function, and so forth.
► The mode function gives us this information:
mode(3.1415) # Mode of a number
#> [1] "numeric"
mode(c(2.7182, 3.1415))
# Mode of a vector of numbers #> [1] "numeric"
mode("Moe")
# Mode of a character string
#> [1] "character"
mode(list("Moe", "Larry", "Curly"))
# Mode of a list #> [1] "list"
► A critical difference between a vector and a list can be summed up this way:
• In a vector, all elements must have the same mode.
► In a list, the elements can have different modes.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Class: Abstract Type
► In R, every object also has a class, which defines its abstract type. The
terminology is borrowed from object-oriented programming. A single number
could represent many different things: a distance, a point in time, or a weight, for
example. All those objects have a mode of “numeric” because they are stored as a
number, but they could have different classes to indicate their interpretation.
► For example, a Date object consists of a single number:
d <- as.Date("2010-03-15")
mode(d)
#> [1] "numeric"
length(d)
#> [1] 1
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Class: Abstract Type
► But it has a class of Date, telling us how to interpret that number—namely, as the
number of days since January 1, 1970:
class(d)
#> [1] "Date"
► R uses an object’s class to decide how to process the object. For example, the
generic function print has specialized versions (called methods) for printing
objects according to their class: data.frame, Date, lm, and so forth. When you
print an object, R calls the appropriate print function according to the object’s
class.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Scalars
► The quirky thing about scalars is their relationship to vectors. In some software,
scalars and vectors are two different things. In R, they are the same thing: a scalar is
simply a vector that contains exactly one element. In this book we often use the
term scalar, but that’s just shorthand for “vector with one element.”
► Consider the built-in constant pi. It is a scalar:
pi
#> [1] 3.14
► Since a scalar is a one-element vector, you can use vector functions on pi:
length(pi)
#> [1] 1
► You can index it. The first (and only) element is ππ of course:
pi[1]
#> [1] 3.14
► If you ask for the second element, there is none:
pi[2
#> [1] NA
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Matrices
► In R, a matrix is just a vector that has dimensions. It may seem strange at first,
but you can transform a vector into a matrix simply by giving it dimensions.
► A vector has an attribute called dim, which is initially NULL, as shown here:
A <- 1:6
dim(A)
#> NULL
print(A)
#> [1] 1 2 3 4 5 6
► We give dimensions to the vector when we set its dim attribute. Watch what
happens when we set our vector dimensions to 2 × 3 and print it:
dim(A) <- c(2, 3)
print(A)
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Matrices
► Voilà! The vector was reshaped into a 2 × 3 matrix.
► A matrix can be created from a list, too. Like a vector, a list has a dim attribute,
which is initially NULL:
B <- list(1, 2, 3, 4, 5, 6)
dim(B)
#> NULL
► If we set the dim attribute, it gives the list a shape:
dim(B) <- c(2, 3)
print(B)
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
► Voilà! We have turned this list into a 2 × 3 matrix.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Arrays
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Arrays
► Note that R prints one “slice” of the structure at a time, since it’s not possible to
print a three-dimensional structure on a two-dimensional medium.
► It strikes us as very odd that we can turn a list into a matrix just by giving the list
a dim attribute. But wait: it gets stranger.
► Recall that a list can be heterogeneous (mixed modes). We can start with a
heterogeneous list, give it dimensions, and thus create a heterogeneous matrix. This
code snippet creates a matrix that is a mix of numeric and character data:
C <- list(1, 2, 3, "X", "Y", "Z")
dim(C) <- c(2, 3)
print(C)
#> [,1] [,2] [,3]
#> [1,] 1 3 "Y"
#> [2,] 2 "X" "Z"
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Arrays
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Factors
► A factor looks like a character vector, but it has special properties. R keeps track of
the unique values in a vector, and each unique value is called a level of the
associated factor. R uses a compact representation for factors, which makes them
efficient for storage in data frames. In other programming languages, a factor
would be represented by a vector of enumerated values.
► There are two key uses for factors:
► Categorical variablesA factor can represent a categorical variable. Categorical
variables are used in contingency tables, linear regression, analysis of variance
(ANOVA), logistic regression, and many other areas.
► GroupingThis is a technique for labeling or tagging your data items according to
their group
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Data Frames
► A data frame is a powerful and flexible structure. Most serious R applications
involve data frames. A data frame is intended to mimic a dataset, such as one
you might encounter in SAS or SPSS, or a table in an SQL database.
► A data frame is a tabular (rectangular) data structure, which means that it has
rows and columns. It is not implemented by a matrix, however. Rather, a data
frame is a list with the following characteristics:
• The elements of the list are vectors and/or factors. 1
• Those vectors and factors are the columns of the data frame.
• The vectors and factors must all have the same length; in other words, all
columns must have the same height.
• The equal-height columns give a rectangular shape to the data frame.
• The columns must have names.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Data Frames
► Because a data frame is both a list and a rectangular structure, R provides two
different paradigms for accessing its contents:
• You can use list operators to extract columns from a data frame, such
as df[i], df[[i]], or df$name.
• You can use matrix-like notation, such as df[i,j], df[i,], or df[,j].
► A data frame is a rectangular data structure. The columns are typed, and each
column must be numeric values, character strings, or a factor. Columns must
have labels; rows may have labels. The table can be indexed by position, column
name, and/or row name. It can also be accessed by list operators, in which case
R treats the data frame as a list whose elements are the columns of the data
frame.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Tibbles
► A tibble is a modern reimagining of the data frame, introduced by Hadley
Wickham in the tibble package, which is a core package in the tidyverse. Most
of the common functions you would use with data frames also work with tibbles.
However, tibbles typically do less than data frames and complain more. This
idea of complaining and doing less may remind you of your least favorite
coworker; however, we think tibbles will be one of your favorite data structures.
Doing less and complaining more can be a feature, not a bug.
► Unlike data frames, tibbles:
• Do not give you row numbers by default.
• Do not give you strange, unexpected column names.
• Don’t coerce your data into factors (unless you explicitly ask for that).
• Recycle vectors of length 1 but not other lengths.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Tibbles
► In addition to basic data frame functionality, tibbles:
► Print only the top four rows and a bit of metadata by default.
► Always return a tibble when subsetting.
► Never do partial matching: if you want a column from a tibble, you have to
ask for it using its full name.
► Complain more by giving you more warnings and chatty messages to make
sure you understand what the software is doing.
► All these extras are designed to give you fewer surprises and help you make
fewer mistakes.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Appending Data to a Vector
► Use the vector constructor (c) to construct a vector with the additional data
items:
v <- c(1, 2, 3)
newItems <- c(6, 7, 8)
c(v, newItems)
#> [1] 1 2 3 6 7 8
► For a single item, you can also assign the new item to the next vector element. R
will automatically extend the vector:
v <- c(1, 2, 3)
v[length(v) + 1] <- 42
v
#> [1] 1 2 3 42
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Inserting Data into a Vector
► Despite its name, the append function inserts data into a vector by using
the after parameter, which gives the insertion point for the new item or items:
append(vec,newvalues, after =n)
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Categorical Variable
► The factor function encodes your vector of discrete values into a factor:
f <- factor(v) # v can be a vector of strings or integers
► If your vector contains only a subset of possible values and not the entire universe,
then include a second argument that gives the possible levels of the factor:
f <- factor(v, levels)
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Combining Multiple Vectors into One Vector and a Factor
► Create a list that contains the vectors. Use the stack function to combine the list
into a two-column data frame:
comb <- stack(list(v1 = v1, v2 = v2, v3 = v3)) # Combine 3 vectors
► The data frame’s columns are called values and ind. The first column contains the
data, and the second column contains the parallel factor.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Creating a List
► To create a list from individual data items, use the list function:
lst <- list(x, y, z)
► Selecting List Elements by Position
► Use one of these ways. Here, lst is a list variable:
lst[[n]]
Select the nth element from the list.
lst[c(_n_~1~, _n_~2~, ..., _n_~k~)]
Returns a list of elements, selected by their positions.
► Note that the first form returns a single element and the second form returns a list.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Selecting List Elements by Name
► Use one of these forms. Here, lst is a list variable:
lst[["*name*"]]
Selects the element called name. Returns NULL if no element has that name.
lst$*name*
Same as previous, just different syntax.
lst[c(*name*~1~, *name*~2~, ..., *name*~k~)]
Returns a list built from the indicated elements of lst.
► Note that the first two forms return an element, whereas the third form returns a
list.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Building a Name/Value Association List
► The list function lets you give names to elements, creating an association between
each name and its value:
lst <- list(mid = 0.5, right = 0.841, far.right = 0.977)
► If you have parallel vectors of names and values, you can create an empty list and
then populate the list by using a vectorized assignment statement:
values <- c(1, 2, 3)
names <- c("a", "b", "c")
lst <- list()
lst[names] <- values
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
List
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Removing List Elements Using a Condition
► Start with a function that returns TRUE when your criteria is met
and FALSE otherwise. Then use the discard function from purrr to remove values
that match your criteria. This code snippet, for example, uses the is.na function to
remove NA values from lst:
lst <- list(NA, 0, NA, 1, 2)
lst %>%
discard(is.na)
#> [[1]]
#> [1] 0
#>
#> [[2]]
#> [1] 1
#>
#> [[3]]
#> [1] 2for Data Science
DS510 Statistics Long, JD., & Teetor P. (2019
Initializing a Matrix
► Capture the data in a vector or list, and then use the matrix function to shape the
data into a matrix. This example shapes a vector into a 2 × 3 matrix (i.e., two rows
and three columns):
vec <- 1:6
matrix(vec, 2, 3)
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Performing Matrix Operations
t(A)
Matrix transposition of A
solve(A)
Matrix inverse of A
A %*% B
Matrix multiplication of A and B
diag(*n*)
An n-by-n diagonal (identity) matrix
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Descriptive Names to the Rows and Columns of a Matrix
► Every matrix has a rownames attribute and a colnames attribute. Assign a vector
of character strings to the appropriate attribute:
rownames(mat) <- c("rowname1", "rowname2", ..., "rownameN")
colnames(mat) <- c("colname1", "colname2", ..., "colnameN")
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Selecting One Row or Column from a Matrix
► The solution depends on what you want. If you want the result to be a
simple vector, just use normal indexing:
mat[1, ] # First row
mat[, 3] # Third column
► If you want the result to be a one-row matrix or a one-column matrix,
then include the drop=FALSE argument:
mat[1, , drop=FALSE] # First row, one-row matrix
mat[, 3, drop=FALSE] # Third column, one-column matrix
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Initializing a Data Frame from Column Data
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Initializing a Data Frame from Row Data
► Store each row in a one-row data frame. Use rbind to bind the rows into one large
data frame:
rbind(row1, row2, . . . , rowN)
► Appending Rows to a Data Frame
► Create a second, temporary data frame containing the new rows. Then use
the rbind function to append the temporary data frame to the original data
frame.
► Selecting Data Frame Columns by Position
► Use the select function:
df %>% select(*n*~1~, *n*~2~, ..., *n*~k~)
where df is a data frame and *n*~1~, *n*~2~, …, *n~k~* are integers
with values between 1 and the number of columns.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Selecting Data Frame Columns by Name
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Removing NAs from a Data Frame
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Merging Data Frames by Common Column
► We can use the join functions from the dplyr package to join our data frames
together on a common column.
► If you want only rows that appear only in both data frames, use inner_join.
inner_join(df1, df2, by = "col")
where "col" is the column that appears in both data frames.
► If you want all rows that appear in either data frame, use full_join instead.
full_join(df1, df2, by = "col")
► If you want all rows from df1 and only those from df2 that match, use left_join:
left_join(df1, df2, by = "col")
► Or to get all records from df2 and only the matching ones from df1,
use right_join:
right_join(df1, df2, by = "col")
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Converting One Atomic Value into Another
► For each atomic data type, there is a function for converting values to that
type. The conversion functions for atomic types include:
• as.character(x)
• as.complex(x)
• as.numeric(x) or as.double(x)
• as.integer(x)
• as.logical(x)
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Converting One Structured Data Type into Another
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Converting One Structured Data Type into Another
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Module 4 Assignment Requirements
❑ Discussion Board
oTopic
oKey concepts
❑ Assignment
o Topic
o Key concepts
❑ Quiz
o Key concepts
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
This concludes our live session.
Thank you for your attendance!
Questions
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Next Live Session
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019