0% found this document useful (0 votes)
18 views

SEU - DS510 - Module 4 Input-Output and Data Structure

Uploaded by

g230001495
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

SEU - DS510 - Module 4 Input-Output and Data Structure

Uploaded by

g230001495
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Live Session

Module 4

DS510
Statistics for Data Science

Instructor Name
Module 4# Learning Outcomes
1. Manage input for implementing
statistical projects.
2. Produce formatted output for reporting
project numbers.
3. Examine data frames like matrices, lists,
factors, data frames, and tibbles.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Input-Output and Data Structure
R Cookbook: Proven Recipes for Data
Analysis, Statistics, and Graphics

Chapter 4 Input and Output

Chapter 5 Data Structures

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Input and Output

► All statistical work begins with data, and


most data is stuck inside files and databases.
Dealing with input is probably the first step
of implementing any significant statistical
project.
► All statistical work ends with reporting
numbers back to a client, even if you are the
client. Formatting and producing output is
probably the climax of your project.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Entering Data from the Keyboard
► For very small datasets, enter the data as literals using
the c constructor for vectors:
scores <- c(61, 66, 90, 88, 100)

Printing Fewer Digits (or More Digits)


► For print, the digits parameter can control the number of printed digits.
► For cat, use the format function (which also has a digits parameter) to
alter the formatting of numbers.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Redirecting Output to a File
► You can redirect the output of the cat function by using
its file argument:
cat("The answer is", answer, "\n", file = "filename.txt")

► Use the sink function to redirect all output from


both print and cat. Call sink with a *filename* argument to begin
redirecting console output to that file. When you are done,
use sink with no argument to close the file and resume output to
the console:
sink("filename") # Begin writing output to file
# ... other session work ...
sink() # Resume writing output to console
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Listing Files
► The list.files function shows the contents of your working directory:
list.files()
#> [1] "_book" "_bookdown.yml"
#> [3] "_common.R" "_main.rds"
#> [5] "_output.yml" "01_GettingStarted.md"
#> [7] "01_GettingStarted.Rmd" "01_GettingStarted.utf8.md"
#> [9] "02_SomeBasics_files" "02_SomeBasics.md" etc ...

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
“Cannot Open File”
► The backslashes in the filepath are causing trouble. You can solve this
problem in one of two ways:
• Change the backslashes to forward slashes: "C:/data/sample.txt".
• Double the backslashes: "C:\\data\\sample.txt".

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Fixed-Width Records
► Use the read_fwf from the readr package (which is part of the tidyverse). The
main arguments are the filename and the description of the fields:
library(tidyverse)
records <- read_fwf("myfile.txt",
fwf_cols(col1 = 10,
col2 = 7))
records
► This form uses the fwf_cols parameter to pass column names and widths to the
function. You can also pass column parameters in other ways as discussed
next.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Tabular Data Files
► Use the read_table2 function from the readr package, which returns a
tibble:
library(tidyverse)

tab1 <- read_table2("./data/datafile.tsv")


#> Parsed with column specification:
#> cols(
#> last = col_character(),
#> first = col_character(), #> birth = col_double(),
#> death = col_double()
#> )

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Tabular Data Files
► Con’t
tab1 #>
# A tibble: 5 x 4
#> last first birth death
#> <chr> <chr> <dbl> <dbl>
#> 1 Fisher R.A. 1890 1962
#> 2 Pearson Karl 1857 1936
#> 3 Cox Gertrude 1900 1978
#> 4 Yates Frank 1902 1994
#> 5 Smith Kirstine 1878 1939

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading from CSV Files
► The read_csv function from the readr package is a fast (and, according to the
documentation, fun) way to read CSV files. If your CSV file has a header line,
use this:
library(tidyverse)

tbl <- read_csv("datafile.csv")

► If your CSV file does not contain a header line, set the col_names option
to FALSE:
tbl <- read_csv("datafile.csv", col_names = FALSE)

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Writing to CSV Files
► The write_csv function from the tidyverse readr package can write a
CSV file:

library(tidyverse)

write_csv(df, path = "outfile.csv")

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Tabular or CSV Data from the Web
► Use the read_csv or read_table2 functions from the readr package, using a
URL instead of a filename. The functions will read directly from the
remote server:
library(tidyverse)
berkley <- read_csv('http://bit.ly/barkley18', comment = '#')
#> Parsed with column specification:
#> cols(
#> Name = col_character(),
#> Location = col_character(),
#> Time = col_time(format = "")
#> )
► You can also open a connection using the URL and then read from the
connection, which may be preferable for complicated files.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Data from Excel
► The openxlsx package makes reading Excel files easy.

library(openxlsx)
df1 <- read.xlsx(xlsxFile = "file.xlsx",
sheet = 'sheet_name')

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Writing a Data Frame to Excel
► The openxlsx package makes writing to Excel files relatively easy.
While there are lots of options in openxlsx, a typical pattern is to
specify an Excel filename and a sheet name:

library(openxlsx)
write.xlsx(df,
sheetName = "some_sheet",
file = "out_file.xlsx")

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Data from a SAS file

► The sas7bdat package supports reading SAS sas7bdat files into R.


library(haven)
sas_movie_data <- read_sas("data/movies.sas7bdat")

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Data from HTML Tables
► Use the read_html and html_table functions in the rvest package. To read
all tables on the page, do the following:
library(rvest)
library(tidyverse)

all_tables <-
read_html("URL") %>%
html_table(fill = TRUE, header = TRUE)
► Note that rvest is installed when you run install.packages('tidyverse'),
although it is not a core tidyverse package. So you must explicitly load
the package.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading Files with a Complex Structure
• Use the readLines function to read individual lines; then
process them as strings to extract data items.
• Alternatively, use the scan function to read individual
tokens and use the argument what to describe the stream of
tokens in your file. The function can convert tokens into
data and then assemble the data into records.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Reading from MySQL Databases
1. Install the RMySQL package on your computer and add a
user and password.
2. Open a database connection using
the DBI::dbConnect function.
3. Use dbGetQuery to initiate a SELECT and return the result
sets.
4. Use dbDisconnect to terminate the database connection
when you are done.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Accessing a Database with dbplyr
► In addition to being a grammar of data manipulation, the
tidyverse package dplyr can, in connection with
the dbplyr package, turn dplyr commands into SQL for you.
► Let’s set up an example database using RSQLite. Then we’ll
connect to it and use dplyr and the dbplyr backend to extract
data.
► Set up the example table by loading the msleep example data into
an in-memory SQLite database:
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
sleep_db <- copy_to(con, msleep, "sleep")
Now that we have a table in our database, we can create a
reference to it from R:
sleep_table <- tbl(con, "sleep")

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Accessing a Database with dbplyr
► The sleep_table object is a type of pointer or alias to the table
on the database. However, dplyr will treat it like a regular
tidyverse tibble or data frame, so you can operate on it
using dplyr and other R commands. Let’s select all animals
from the data who sleep less than three hours.
little_sleep <- sleep_table %>%
select(name, genus, order, sleep_total) %>%
filter(sleep_total < 3)

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Accessing a Database with dbplyr
► The dbplyr backend does not go fetch the data when we do the
preceding commands. But it does build the query and get ready. To
see the query built by dplyr, you can use show_query:
show_query(little_sleep)
#> <SQL> #> SELECT *
#> FROM (SELECT `name`, `genus`, `order`, `sleep_total`
#> FROM `sleep`)
#> WHERE (`sleep_total` < 3.0)

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Accessing a Database with dbplyr
► To bring the data back to your local machine, use collect:
local_little_sleep <- collect(little_sleep)
local_little_sleep
#> # A tibble: 3 x 4 #> name genus order sleep_total
#> <chr> <chr> <chr> <dbl>
#> 1 Horse Equus Perissodactyla 2.9
#> 2 Giraffe Giraffa Artiodactyla 1.9
#> 3 Pilot whale Globicephalus Cetacea 2.7

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Saving and Transporting Objects
► Write the objects to a file using the save function:
save(tbl, t, file = "myData.RData")
► Read them back using the load function, either on
your computer or on any platform that supports R:
load("myData.RData")
► The save function writes binary data. To save in an
ASCII format, use dput or dump instead:
dput(tbl, file = "myData.txt")
dump("tbl", file = "myData.txt") # Note
quotes around variable name
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Input-Output and Data Structure
R Cookbook: Proven Recipes for Data
Analysis, Statistics, and Graphics

Chapter 5 Data Structures

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Vectors

► Here are some key properties of vectors:

► Vectors are homogeneousAll elements of a vector must have the same type or, in R
terminology, the same mode.
► Vectors can be indexed by positionSo v[2] refers to the second element of v.

► Vectors can be indexed by multiple positions, returning a subvectorSo v[c(2,3)] is a subvector


of v that consists of the second and third elements.
► Vector elements can have namesVectors have a names property, the same length as the vector
itself, that gives names to the elements:
v <- c(10, 20, 30)
names(v) <- c("Moe", "Larry", "Curly")
print(v)
#> Moe Larry Curly
#> 10 20 30
► If vector elements have names, then you can select them by name

► Continuing the previous example:

v[["Larry"]]
#>for
DS510 Statistics [1]Data
20 Science Long, JD., & Teetor P. (2019
Lists

► Lists are heterogeneousLists can contain elements of different types; in R


terminology, list elements may have different modes. Lists can even contain other
structured objects, such as lists and data frames; this allows you to create
recursive data structures.
► Lists can be indexed by positionSo lst[[2]] refers to the second element of lst.
Note the double square brackets. Double brackets means that R will return the
element as whatever type of element it is.
► Lists let you extract sublistsSo lst[c(2,3)] is a sublist of lst that consists of the
second and third elements. Note the single square brackets. Single brackets
means that R will return the items in a list. If you pull a single element with single
brackets, like lst[2], R will return a list of length 1 with the first item being the
desired item.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Lists

► List elements can have namesBoth lst[["Moe"]] and lst$Moe refer to the element
named “Moe.”
► Since lists are heterogeneous and since their elements can be retrieved by name, a
list is like a dictionary or hash or lookup table in other programming languages.
► What’s surprising (and cool) is that in R, unlike most of those other programming
languages, lists can also be indexed by position.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Mode: Physical Type
► In R, every object has a mode, which indicates how it is stored in memory: as a number,
as a character string, as a list of pointers to other objects, as a function, and so forth.
► The mode function gives us this information:
mode(3.1415) # Mode of a number
#> [1] "numeric"
mode(c(2.7182, 3.1415))
# Mode of a vector of numbers #> [1] "numeric"
mode("Moe")
# Mode of a character string
#> [1] "character"
mode(list("Moe", "Larry", "Curly"))
# Mode of a list #> [1] "list"
► A critical difference between a vector and a list can be summed up this way:
• In a vector, all elements must have the same mode.
► In a list, the elements can have different modes.
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Class: Abstract Type
► In R, every object also has a class, which defines its abstract type. The
terminology is borrowed from object-oriented programming. A single number
could represent many different things: a distance, a point in time, or a weight, for
example. All those objects have a mode of “numeric” because they are stored as a
number, but they could have different classes to indicate their interpretation.
► For example, a Date object consists of a single number:
d <- as.Date("2010-03-15")
mode(d)
#> [1] "numeric"
length(d)
#> [1] 1

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Class: Abstract Type
► But it has a class of Date, telling us how to interpret that number—namely, as the
number of days since January 1, 1970:
class(d)
#> [1] "Date"
► R uses an object’s class to decide how to process the object. For example, the
generic function print has specialized versions (called methods) for printing
objects according to their class: data.frame, Date, lm, and so forth. When you
print an object, R calls the appropriate print function according to the object’s
class.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Scalars
► The quirky thing about scalars is their relationship to vectors. In some software,
scalars and vectors are two different things. In R, they are the same thing: a scalar is
simply a vector that contains exactly one element. In this book we often use the
term scalar, but that’s just shorthand for “vector with one element.”
► Consider the built-in constant pi. It is a scalar:

pi
#> [1] 3.14
► Since a scalar is a one-element vector, you can use vector functions on pi:

length(pi)
#> [1] 1
► You can index it. The first (and only) element is ππ of course:

pi[1]
#> [1] 3.14
► If you ask for the second element, there is none:

pi[2
#> [1] NA
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Matrices
► In R, a matrix is just a vector that has dimensions. It may seem strange at first,
but you can transform a vector into a matrix simply by giving it dimensions.
► A vector has an attribute called dim, which is initially NULL, as shown here:

A <- 1:6
dim(A)
#> NULL
print(A)
#> [1] 1 2 3 4 5 6
► We give dimensions to the vector when we set its dim attribute. Watch what
happens when we set our vector dimensions to 2 × 3 and print it:
dim(A) <- c(2, 3)
print(A)
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Matrices
► Voilà! The vector was reshaped into a 2 × 3 matrix.
► A matrix can be created from a list, too. Like a vector, a list has a dim attribute,
which is initially NULL:
B <- list(1, 2, 3, 4, 5, 6)
dim(B)
#> NULL
► If we set the dim attribute, it gives the list a shape:
dim(B) <- c(2, 3)
print(B)
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
► Voilà! We have turned this list into a 2 × 3 matrix.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Arrays

► The discussion of matrices can be generalized to three-dimensional or even n-dimensional


structures: just assign more dimensions to the underlying vector (or list). The following
example creates a three-dimensional array with dimensions 2 × 3 × 2:
D <- 1:12
dim(D) <- c(2, 3, 2)
print(D)
#> , , 1 #>
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6
#>
#> , , 2
#>
#> [,1] [,2] [,3]
#> [1,] 7 9 11
#> [2,] 8 10 12

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Arrays

► Note that R prints one “slice” of the structure at a time, since it’s not possible to
print a three-dimensional structure on a two-dimensional medium.
► It strikes us as very odd that we can turn a list into a matrix just by giving the list
a dim attribute. But wait: it gets stranger.
► Recall that a list can be heterogeneous (mixed modes). We can start with a
heterogeneous list, give it dimensions, and thus create a heterogeneous matrix. This
code snippet creates a matrix that is a mix of numeric and character data:
C <- list(1, 2, 3, "X", "Y", "Z")
dim(C) <- c(2, 3)
print(C)
#> [,1] [,2] [,3]
#> [1,] 1 3 "Y"
#> [2,] 2 "X" "Z"

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Arrays

► To us, this is strange because we ordinarily assume a matrix is purely numeric,


not mixed. R is not that restrictive.
► The possibility of a heterogeneous matrix may seem powerful and strangely
fascinating. However, it creates problems when you are doing normal, day-to-
day stuff with matrices. For example, what happens when the matrix C (from
the previous example) is used in matrix multiplication? What happens if it is
converted to a data frame? The answer is that odd things happen.
► In this book, we generally ignore the pathological case of a heterogeneous
matrix. We assume you’ve got simple, vanilla matrices. Some recipes involving
matrices may work oddly (or not at all) if your matrix contains mixed data.
Converting such a matrix to a vector or data frame, for instance, can be
problematic.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Factors

► A factor looks like a character vector, but it has special properties. R keeps track of
the unique values in a vector, and each unique value is called a level of the
associated factor. R uses a compact representation for factors, which makes them
efficient for storage in data frames. In other programming languages, a factor
would be represented by a vector of enumerated values.
► There are two key uses for factors:
► Categorical variablesA factor can represent a categorical variable. Categorical
variables are used in contingency tables, linear regression, analysis of variance
(ANOVA), logistic regression, and many other areas.
► GroupingThis is a technique for labeling or tagging your data items according to
their group

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Data Frames
► A data frame is a powerful and flexible structure. Most serious R applications
involve data frames. A data frame is intended to mimic a dataset, such as one
you might encounter in SAS or SPSS, or a table in an SQL database.
► A data frame is a tabular (rectangular) data structure, which means that it has
rows and columns. It is not implemented by a matrix, however. Rather, a data
frame is a list with the following characteristics:
• The elements of the list are vectors and/or factors. 1
• Those vectors and factors are the columns of the data frame.
• The vectors and factors must all have the same length; in other words, all
columns must have the same height.
• The equal-height columns give a rectangular shape to the data frame.
• The columns must have names.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Data Frames
► Because a data frame is both a list and a rectangular structure, R provides two
different paradigms for accessing its contents:
• You can use list operators to extract columns from a data frame, such
as df[i], df[[i]], or df$name.
• You can use matrix-like notation, such as df[i,j], df[i,], or df[,j].

► A data frame is a rectangular data structure. The columns are typed, and each
column must be numeric values, character strings, or a factor. Columns must
have labels; rows may have labels. The table can be indexed by position, column
name, and/or row name. It can also be accessed by list operators, in which case
R treats the data frame as a list whose elements are the columns of the data
frame.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Tibbles
► A tibble is a modern reimagining of the data frame, introduced by Hadley
Wickham in the tibble package, which is a core package in the tidyverse. Most
of the common functions you would use with data frames also work with tibbles.
However, tibbles typically do less than data frames and complain more. This
idea of complaining and doing less may remind you of your least favorite
coworker; however, we think tibbles will be one of your favorite data structures.
Doing less and complaining more can be a feature, not a bug.
► Unlike data frames, tibbles:
• Do not give you row numbers by default.
• Do not give you strange, unexpected column names.
• Don’t coerce your data into factors (unless you explicitly ask for that).
• Recycle vectors of length 1 but not other lengths.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Tibbles
► In addition to basic data frame functionality, tibbles:
► Print only the top four rows and a bit of metadata by default.
► Always return a tibble when subsetting.
► Never do partial matching: if you want a column from a tibble, you have to
ask for it using its full name.
► Complain more by giving you more warnings and chatty messages to make
sure you understand what the software is doing.
► All these extras are designed to give you fewer surprises and help you make
fewer mistakes.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Appending Data to a Vector

► Use the vector constructor (c) to construct a vector with the additional data
items:
v <- c(1, 2, 3)
newItems <- c(6, 7, 8)
c(v, newItems)
#> [1] 1 2 3 6 7 8
► For a single item, you can also assign the new item to the next vector element. R
will automatically extend the vector:
v <- c(1, 2, 3)
v[length(v) + 1] <- 42
v
#> [1] 1 2 3 42

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Inserting Data into a Vector
► Despite its name, the append function inserts data into a vector by using
the after parameter, which gives the insertion point for the new item or items:
append(vec,newvalues, after =n)

Understanding the Recycling Rule


► When you do vector arithmetic, R performs element-by-element operations. That
works well when both vectors have the same length: R pairs the elements of the
vectors and applies the operation to those pairs.
► But what happens when the vectors have unequal lengths?
► In that case, R invokes the Recycling Rule. It processes the vector element in pairs,
starting at the first elements of both vectors. At a certain point, the shorter vector
is exhausted while the longer vector still has unprocessed elements. R returns to
the beginning of the shorter vector, “recycling” its elements; continues taking
elements from the longer vector; and completes the operation. It will recycle the
shorter-vector elements as often as necessary until the operation is complete.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Categorical Variable

► The factor function encodes your vector of discrete values into a factor:
f <- factor(v) # v can be a vector of strings or integers
► If your vector contains only a subset of possible values and not the entire universe,
then include a second argument that gives the possible levels of the factor:
f <- factor(v, levels)

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Combining Multiple Vectors into One Vector and a Factor

► Create a list that contains the vectors. Use the stack function to combine the list
into a two-column data frame:
comb <- stack(list(v1 = v1, v2 = v2, v3 = v3)) # Combine 3 vectors
► The data frame’s columns are called values and ind. The first column contains the
data, and the second column contains the parallel factor.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Creating a List

► To create a list from individual data items, use the list function:
lst <- list(x, y, z)
► Selecting List Elements by Position
► Use one of these ways. Here, lst is a list variable:
lst[[n]]
Select the nth element from the list.
lst[c(_n_~1~, _n_~2~, ..., _n_~k~)]
Returns a list of elements, selected by their positions.
► Note that the first form returns a single element and the second form returns a list.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Selecting List Elements by Name
► Use one of these forms. Here, lst is a list variable:
lst[["*name*"]]
Selects the element called name. Returns NULL if no element has that name.
lst$*name*
Same as previous, just different syntax.
lst[c(*name*~1~, *name*~2~, ..., *name*~k~)]
Returns a list built from the indicated elements of lst.
► Note that the first two forms return an element, whereas the third form returns a
list.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Building a Name/Value Association List
► The list function lets you give names to elements, creating an association between
each name and its value:
lst <- list(mid = 0.5, right = 0.841, far.right = 0.977)
► If you have parallel vectors of names and values, you can create an empty list and
then populate the list by using a vectorized assignment statement:
values <- c(1, 2, 3)
names <- c("a", "b", "c")
lst <- list()
lst[names] <- values

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
List

► Removing an Element from a List


► Assign NULL to the element. R will remove it from the list.
► Flatten a List into a Vector
► You want to flatten all the elements of a list into a vector.
► Removing NULL Elements from a List
► The compact from the purrr package will remove
the NULL elements.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Removing List Elements Using a Condition

► Start with a function that returns TRUE when your criteria is met
and FALSE otherwise. Then use the discard function from purrr to remove values
that match your criteria. This code snippet, for example, uses the is.na function to
remove NA values from lst:
lst <- list(NA, 0, NA, 1, 2)

lst %>%
discard(is.na)
#> [[1]]
#> [1] 0
#>
#> [[2]]
#> [1] 1
#>
#> [[3]]
#> [1] 2for Data Science
DS510 Statistics Long, JD., & Teetor P. (2019
Initializing a Matrix

► Capture the data in a vector or list, and then use the matrix function to shape the
data into a matrix. This example shapes a vector into a 2 × 3 matrix (i.e., two rows
and three columns):
vec <- 1:6
matrix(vec, 2, 3)
#> [,1] [,2] [,3]
#> [1,] 1 3 5
#> [2,] 2 4 6

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Performing Matrix Operations
t(A)
Matrix transposition of A
solve(A)
Matrix inverse of A
A %*% B
Matrix multiplication of A and B
diag(*n*)
An n-by-n diagonal (identity) matrix

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Descriptive Names to the Rows and Columns of a Matrix

► Every matrix has a rownames attribute and a colnames attribute. Assign a vector
of character strings to the appropriate attribute:
rownames(mat) <- c("rowname1", "rowname2", ..., "rownameN")
colnames(mat) <- c("colname1", "colname2", ..., "colnameN")

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Selecting One Row or Column from a Matrix

► The solution depends on what you want. If you want the result to be a
simple vector, just use normal indexing:
mat[1, ] # First row
mat[, 3] # Third column
► If you want the result to be a one-row matrix or a one-column matrix,
then include the drop=FALSE argument:
mat[1, , drop=FALSE] # First row, one-row matrix
mat[, 3, drop=FALSE] # Third column, one-column matrix

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Initializing a Data Frame from Column Data

► If your data is captured in several vectors and/or factors, use


the data.frame function to assemble them into a data frame:
df <- data.frame(v1, v2, v3, f1)
► If your data is captured in a list that contains vectors and/or factors,
use as.data.frame instead:
df <- as.data.frame(list.of.vectors)

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Initializing a Data Frame from Row Data
► Store each row in a one-row data frame. Use rbind to bind the rows into one large
data frame:
rbind(row1, row2, . . . , rowN)
► Appending Rows to a Data Frame
► Create a second, temporary data frame containing the new rows. Then use
the rbind function to append the temporary data frame to the original data
frame.
► Selecting Data Frame Columns by Position
► Use the select function:
df %>% select(*n*~1~, *n*~2~, ..., *n*~k~)
where df is a data frame and *n*~1~, *n*~2~, …, *n~k~* are integers
with values between 1 and the number of columns.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Selecting Data Frame Columns by Name

► Use select and give it the column names.


df %>% select(*name*~1~, *name*~2~, ..., *name*~k~)
► Changing the Names of Data Frame Columns
The rename function from the dplyr package makes renaming pretty easy:
df %>% rename(*newname*~1~ = *oldname*~1~, . . . , *newname*~n~
= *oldname*~n~)
where df is a data frame, *oldname*~i~ are names of columns in df,
and *newname*~i~ are the desired new names.
► Note that the argument order is *newname* = *oldname*.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Removing NAs from a Data Frame

► Use na.omit to remove rows that contain any NA values.


clean_dfrm <- na.omit(dfrm)
► Excluding Columns by Name
Use the select function from the dplyr package with a dash (minus sign) in front of
the name of the column to exclude:
select(df, -bad) # Select all columns from df except bad
► Combining Two Data Frames
To combine the columns of two data frames side by side, use cbind (column bind):
all.cols <- cbind(df1, df2)
To “stack” the rows of two data frames, use rbind (row bind):
all.rows <- rbind(df1, df2)

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Merging Data Frames by Common Column
► We can use the join functions from the dplyr package to join our data frames
together on a common column.
► If you want only rows that appear only in both data frames, use inner_join.
inner_join(df1, df2, by = "col")
where "col" is the column that appears in both data frames.
► If you want all rows that appear in either data frame, use full_join instead.
full_join(df1, df2, by = "col")
► If you want all rows from df1 and only those from df2 that match, use left_join:
left_join(df1, df2, by = "col")
► Or to get all records from df2 and only the matching ones from df1,
use right_join:
right_join(df1, df2, by = "col")

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Converting One Atomic Value into Another
► For each atomic data type, there is a function for converting values to that
type. The conversion functions for atomic types include:
• as.character(x)
• as.complex(x)
• as.numeric(x) or as.double(x)
• as.integer(x)
• as.logical(x)

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Converting One Structured Data Type into Another

► These functions convert their argument into the corresponding structured


data type:
• as.data.frame(x)
• as.list(x)
• as.matrix(x)
• as.vector(x)
► Some of these conversions may surprise you, however. We suggest you
review

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Converting One Structured Data Type into Another

► These functions convert their argument into the corresponding structured


data type:
• as.data.frame(x)
• as.list(x)
• as.matrix(x)
• as.vector(x)
► Some of these conversions may surprise you, however. We suggest you
review

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Module 4 Assignment Requirements

❑ Discussion Board
oTopic
oKey concepts
❑ Assignment
o Topic
o Key concepts
❑ Quiz
o Key concepts

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
This concludes our live session.
Thank you for your attendance!
Questions

Take advantage of this opportunity


to seek further clarification.

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019
Next Live Session

• <Insert date for next Live Session.>


• <Insert time for next Live Session.>

DS510 Statistics for Data Science Long, JD., & Teetor P. (2019

You might also like