Introduction to Data Science in Biostatistics_ Using R, the Tidyverse Ecosystem, and APIs
Introduction to Data Science in Biostatistics_ Using R, the Tidyverse Ecosystem, and APIs
MacFarland
Introduction
to Data
Science in
Biostatistics
Using R, the Tidyverse Ecosystem,
and APIs
Introduction to Data Science in Biostatistics
Thomas W. MacFarland
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
vii
viii Foreword
audience of this text, will help those who are still new at biostatistics and in turn
improve retention of students and career advancement of future biostatisticians.
Data scientists provide value beyond the immediate. Following along with this
concept, value is added to this text in that most lessons are enhanced by greatly
detailed addenda, often multiple addenda in each lesson. New ideas, exposure to
new tidyverse ecosystem packages and functions, and needed skills are gradually
addressed with each advancing lesson. The addenda often introduce, reinforce, and
expand on specialized tidyverse ecosystem packages and functions that go beyond
what was previously presented, address parametric versus nonparametric approaches
toward data, and often include practice data sets that support incremental engage-
ment in pursuit of advanced skills.
Of equal importance to those with interest beyond any cursory introduction to
biostatistics, many challenge activities are included throughout each lesson and the
addenda. The challenges at first are simple and should be successfully completed by
all. Later, as the text continues, the challenges are more detailed, calling for creative
attempts to achieve success, with some challenge activities lacking complete guid-
ance, purposely. The later challenges are often worded so that there is no one and
only one correct approach to resolution but instead the challenges allow for multiple
approaches at resolution. A few of the last challenges also call for individual inquiry
into more advanced topics and resources in the use of R with biostatistics than what
is presented in the text. Not to be redundant, but these later challenges will indeed
be challenging, but of course data scientists face challenges daily.
Additional value is also added in that each external dataset mentioned and used
has been placed at the publisher’s Web site for this text. These datasets are easily
available for download, and their inclusion makes it possible to follow along with
the syntax presented in the text. Ideally, use the syntax and provided datasets to
reproduce the outcomes shown in this text. Then, go beyond the original syntax and
try different approaches to data organization, experiment with other data analysis
approaches, and consider additional functions and function arguments to produce
even more enhanced figures and maps, etc. Take on the role of a data scientist and
add value beyond base requirements.
The use of R and specifically the use of APIs and R’s evolving tidyverse ecosystem
for engagement in biostatistics is the focus of this text. By following along with a
gradual exposure to R, APIs, and the tidyverse ecosystem, this text should help
beginning students and early stages researchers gradually increase their skills with
the use of R syntax for inquiries into biostatistics.
The first lesson of this text is somewhat unique in that it looks closely at the way
data science is viewed as an emerged (not emerging) discipline in higher education.
The United States Department of Education has a hierarchical coding system for the
way academic majors are organized, and from this system, a large collection of
majors that call for some degree of expertise in data science is identified. These
majors are then put into context by the hierarchical coding system used by the
United States Department of Labor and the eventual transition from academic prep-
aration to careers. Although higher education has experienced a noticeable decline
in enrollment over the last few years, that is not the case for data science. There is
clearly an increase in interest in data science, not surprisingly due to the growth of
data science as a career opportunity. Employment in data science is projected to
grow and salaries are projected to increase. A few basic summaries on the use of R
and data types are also stressed in the first lesson, as either a recapitulation for those
with prior exposure to R or as an introduction for those who are not as well versed
in the use of R and how data are viewed.
The next two lessons look closely at data. A summary of possible data sources
related to biostatistics is the focus of the second lesson. Although it may seem intui-
tive to those with experience, beginning students and early stages researchers need
to know that there are many resources that either provide data that may totally meet
needs as inquiries are attempted, or the data may serve as a useful proxy. Government
data sources are especially valued and are stressed in this lesson. Knowing possible
data resources, the third lesson stresses a curated ten-point process at statistical
analyses, with an emphasis in the lesson on how these processes are used with an
all-inclusive demonstration of statistical tests.
The process stressed in the third lesson leads to a more detailed introduction to
the tidyverse ecosystem in the fourth lesson. Emphasis is placed on how the
ix
x Preface
I cannot begin to adequately thank the many individuals who contribute to the open-
source paradigm and the countless number of hours given freely to software devel-
opment and management, often for little if any financial renumeration and only rare
acknowledgment by deans and supervisors as a metric for career advancement.
These individuals put the profession and the advancement of science first, often at
the cost of time away from personal pursuits.
I also want to recognize all at Springer who assisted with this text, editors Laura
Aileen Briskman and Faith Su and the many staff members, domestic and foreign,
who make final production of disparate files into a cohesive text. To all – thank you
for your many ideas, feedback, help, and supporting efforts.
xi
Contents
xiii
xiv Contents
Identify and Organize the Data and All Relevant Variables���������������� 149
Outline Potential Approach(s) for Analyses and Consider
Alternate Approaches�������������������������������������������������������������������������� 150
Put Plans into Action, with Frequent Checks for Quality
Assurance�������������������������������������������������������������������������������������������� 150
Individual Review of All Outcomes���������������������������������������������������� 150
External Review of Outcomes Whenever Possible������������������������������ 151
Report at an Appropriate Level for the Intended Audience ���������������� 151
Debrief to Establish Processes for Future Improvements�������������������� 151
General Approach When Using R for Statistical Analysis���������������������� 152
Exploratory Graphics �������������������������������������������������������������������������� 152
Exploratory Descriptive Statistics and Measures of Central
Tendency���������������������������������������������������������������������������������������������� 152
Exploratory Analyses �������������������������������������������������������������������������� 153
Addendum: Use Inferential Statistics and R Syntax to Address
Differences in Percentage Deaths from COVID-19 by the Urban v
Rural Continuum�������������������������������������������������������������������������������������� 153
External Data and/or Data Resources Used in This Lesson�������������������� 173
4
Data Science and R, Base R, and the tidyverse Ecosystem���������������� 175
Workflow for Reproducible, Efficient, and Accurate Analyses
and Presentations ������������������������������������������������������������������������������������ 175
Base R������������������������������������������������������������������������������������������������������ 179
The tidyverse Ecosystem ������������������������������������������������������������������������ 181
The tidyverse Ecosystem as an Idea and the Need
for Tidy Data���������������������������������������������������������������������������������������� 182
The Core tidyverse Ecosystem as a Set of Tools in R Packages
for Data Science���������������������������������������������������������������������������������� 184
Auxiliary Packages Outside of the Core tidyverse Ecosystem������������ 185
Addendum 1: Complex Data Set on Birth Rates Easily
Accommodated by Using the tidyverse Ecosystem�������������������������������� 187
Addendum 2: Complex Data Set on Gross Domestic Product
(GDP) and Comparison to Birth Rates by Using the tidyverse
Ecosystem������������������������������������������������������������������������������������������������ 206
Addendum 3: Individual Initiative of Planned Workflow, Analyses,
and Graphical Presentations�������������������������������������������������������������������� 213
Addendum 4: Essential tidyverse Ecosystem Functions That Every
Data Scientists Should Master ���������������������������������������������������������������� 217
External Data and/or Data Resources Used in This Lesson�������������������� 219
5 Statistical Analyses and Graphical Presentations in Biostatics
Using Base R and the tidyverse Ecosystem������������������������������������������ 221
Overview of Using R for Statistical Analysis������������������������������������������ 221
Background������������������������������������������������������������������������������������������ 221
Import Data������������������������������������������������������������������������������������������ 222
Code Book and Data Organization������������������������������������������������������ 223
xvi Contents
Index������������������������������������������������������������������������������������������������������������������ 525
Chapter 1
Emergence of Data Science as a Critical
Discipline in Biostatistics
This text is about data science. This text is specifically about the way data science is
deployed by those who work in the biological sciences, using R and R’s associated
tidyverse ecosystem. This text also provides multiple examples on the use of APIs
(Application Programming Interface), where selected API functions (e.g., data
retrieval clients) from R packages provide an efficient and reproducible way to
obtain extant data from external resources:
• This text is not only about algorithms, computing, computer science, computing
hardware, computing software, computing infrastructure, and programming,
although algorithms, computing, computer science, computing hardware, com-
puting software, computing infrastructure, and programming all have a major
role in data science.
• This text is not only about data, although data clearly have a major role in data
science.1
• This text is not only about mathematics, although mathematics has a major role
in data science.
1
In this text, the term data is nearly always used in a plural sense of the term (e.g., the data have
been recorded.) and the term datum is nearly always used in a singular sense of the term (e.g., the
datum has been recorded.). This approach follows along with the Latin origin of the term(s) datum
and data. The term datapoints is occasionally used, but the term datums is avoided. It is recognized,
however, that the term data is regularly seen in the literature for both singular and plural use. Going
beyond its use in statistics and data science, the term data, in the plural sense of the term, is often
viewed as a mass noun, indicating either a substance or quality that cannot be counted.
• This text is not only about statistics, although statistics has a major role in data
science.
• This text is not only about analytics, although analytics has a major role in data
science.
Instead, this text is about data science for those who work in the biological sci-
ences and choose to use the R language and its associated tidyverse ecosystem and
associated APIs. With this degree of context as to the purpose of this text:
• Data science is viewed as a multidisciplinary means of using data and associated
support structures and processes in creative ways that look for otherwise undis-
covered patterns and from that discovery not only solve current problems but
also provide insight into possible next steps regarding future desired outcomes.
• Within the paradigm of data science, it is argued that mathematics, statistics, and
analytics, although critically important, are backward looking.
• In contrast, data science is forward looking, by using processes, outcomes, and
associated data from the past. As always, remember the well-known expression
past behavior is the best predictor of future behavior.
Ultimately, it is often said that data science gives value to data, with value seen
in the biological sciences in terms of new methodologies and processes that ulti-
mately improve the human condition and mitigate threats of various types, result in
new products such as medicines and therapies, and contribute to improved efficien-
cies in agriculture, biology, environmental studies, medicine, public health, etc.
With adequate skills in data science, it is possible to focus on insight that can be
gained from limited as well as massive amounts of data (e.g., Big Data).
Regarding more well-established sciences, data science is a relatively new field
of study:
• It is often suggested that the impetus for the emergence of data science grew out
of Tukey’s early 1960s publication The Future of Data Analysis.
• By the early 1970s, the term data science was used by Naur (Concise Survey of
Computer Methods) and soon after by others.
• By the late 1970s and into the 1990s, the term data science was found in publica-
tions, conference papers, and other literature. Related professional associations
that focused on data science also emerged during the late 1970s and into
the 1990s.
• As computing machinery, software, and systems improved, there were growing
trends such that by the late 1990s and into the early 2000s more than a mere few
professionals used the term data scientist instead of computer scientist or statis-
tician as an official job title. Concurrently, journals devoted to data science also
emerged at this time.
• Finally, by 2020, the United States Department of Education recognized data sci-
ence as an emerged (not emerging) field of study and assigned a Classification of
Instructional Programs (CIP) code and definition for data science: CIP Code
30.7001; Data Science, General; a program that focuses on the analysis of large-
scale data sources from the interdisciplinary perspectives of applied statistics, com-
puter science, data storage, data representation, data modeling, mathematics, and
Definition and History of Data Science 3
The State of Data Science and the Need for Data Scientists
Definition of Data
The etymology of the term data (plural) is derived from the Latin term datum (sin-
gular), meaning given. The classical use of the term(s) datum and data was given
fact. By the late 1940s, the terms datum and data were used in a computational
fashion, eventually evolving into corollary terms such as data processing, database,
and data entry.
For this text, the terms datum (singular) and data (plural) are defined as informa-
tion that describes an identified attribute. It is recognized that this definition of the
terms datum and data is quite broad, but data scientists work with many types of
data, necessitating recognition of the broad nature of data. Data can take many
forms, and in R, data scientists often work with various data structures (covered in
more detail later in this lesson) and data types. Base R and the tidyverse ecosystem
have useful packages and functions for working with the many types of data typi-
cally encountered.
2
Like uniform coding practices by nearly all other federal agencies, the United States Bureau of
Labor Statistics Standard Occupational Classification (SOC) system is a uniform coding system
that consists of nearly 900 codes, organized in a hierarchy, used to classify worker occupations.
4 1 Emergence of Data Science as a Critical Discipline in Biostatistics
CharacterTypeVector
base::attributes(CharacterTypeVector) # Metadata
NULL
base::class(CharacterTypeVector) # Object type
[1] "character"
base::length(CharacterTypeVector) # Number of datapoints
[1] 3
utils::str(CharacterTypeVector) # Internal structure
chr [1:3] "Jane" "Mary" "Esther"
base::typeof(CharacterTypeVector) # Object type
[1] "character"
IntegerTypeVector
[1] 1 3 123
base::attributes(IntegerTypeVector) # Metadata
base::class(IntegerTypeVector) # Object type
base::length(IntegerTypeVector) # Number of datapoints
utils::str(IntegerTypeVector) # Internal structure
base::typeof(IntegerTypeVector) # Object type
3
Throughout this text, the color green indicates R-based input and the color red indicates
R-based output.
4
In an effort to save space, look at the expression
Selected sections of output were deleted to save space. Even so,
copy and paste the R-based syntax to replicate outcomes.
Definition and History of Data Science 5
NumericTypeVector
base::attributes(NumericTypeVector) # Metadata
base::class(NumericTypeVector) # Object type
base::length(NumericTypeVector) # Number of datapoints
utils::str(NumericTypeVector) # Internal structure
base::typeof(NumericTypeVector) # Object type
LogicalTypeVector
base::attributes(LogicalTypeVector) # Metadata
base::class(LogicalTypeVector) # Object type
base::length(LogicalTypeVector) # Number of datapoints
utils::str(LogicalTypeVector) # Internal structure
base::typeof(LogicalTypeVector) # Object type
Practice with other, self-created, actions used to create data. When completed, key
the syntax q(), which is an alias for the quit() function), at the R prompt to quit the
session. Then respond to the Save workspace image?, prompt, Yes – No – Cancel.
Some prior experience with any programming language, and ideally R, will help
for those who use this text as an aid for first inquiries into data science. If needed,
those with limited experience with R should consider the many resources available
for users who are new to this language.
6 1 Emergence of Data Science as a Critical Discipline in Biostatistics
install.packages("swirl", dependencies=TRUE)
# While connected to the Internet, start the download
# process. As a hint, a prompt will show asking for the
# selection of a CRAN mirror site, the place(s) where R
# packages are available. Most users select the most
# local site.
library("swirl")
# Once a package is downloaded it is still necessary to
# put the package into use. With R, this process is put
# into place by using the library() function.
#
# Be patient. It may take a few minutes to download and
# install the swirl package.
swirl()
# Follow along with the text displayed on the screen.
Data are generated in ways that many do not even realize, and increasingly these
data have become a valued commodity:
• Think of the actions needed to purchase food items at a grocery store. Early on,
the individual owner of a local grocery store had a sense of inventory control by
using paper-based hand tallies of available stock and from this knowledge orders
were placed using a telephone to connect with a food distributor, calling out
requirements item by item. As grocery stores became larger, to gain efficiencies
5
Similar to other software associated with R, the swirl package is free and open source software
(FOSS). The package is free to download and install. The syntax associated with the package can
also be viewed, to better understand the package and possibly improve upon it.
Definition and History of Data Science 7
in food distribution, dedicated managers at the local level were needed to keep up
with inventory control. However, in the mid-1970s, grocery supermarkets gained
such size that newly developed barcodes were introduced and placed on food
packaging. It was no longer necessary to place adhesive labels on each item,
identifying the current cost. Equally, cashiers no longer needed to hand enter the
price of each item at checkout but instead merely scanned each package, placing
the barcode over the scanner to add each item to the bill. Most importantly, the
data generated from each scan at the checkout line was sent to a central location
serving all stores in the grocery chain, facilitating efficiencies in inventory con-
trol, automated ordering, loss prevention assessment, etc. Now, radio frequency
identification (RFID) technologies are being explored to add additional efficien-
cies to the business operations of food distribution, from farm to table and all
points along the food distribution network – a data focused logistics web of
increasing complexity that is challenged with freshness and spoilage issues that
are unique to food distribution.
• Consider contemporary smart phones, the use of health-related apps made avail-
able at time of purchase of these phones, certain third-party apps purposely
downloaded from some type of online app provider, and how the default settings
of these apps are often set to maintain an automated tally of daily steps while
either walking or jogging. By itself, these data may be interesting to the indi-
vidual user and could be helpful as part of an exercise regime. However, if it were
possible for an exercise- or health-related company to obtain these data, then the
data could be monetized through the distribution of unsolicited email or other
means to those who have a daily count of 3000 or more steps per day, advertising
running shoes, exercise apparel, organic high energy food bars, etc. How does
this happen? Depending on the app, locale of the user, and associated local and
national governance over the involved process (or lack of governance), the data
are quite possibly legally obtained by automated transfer to a commercial entity,
organized by a data broker, sold to a commercial business, used by a data science
team, and then sold to interested health product and health service businesses.
The result is that private individuals may receive unsolicited messages for prod-
ucts and services that may (or may not) be desired, resulting in direct marketing
in a cost effective manner, often with very satisfactory results for some individu-
als, but an obtrusive and unwanted disturbance to others.6 Unwanted messages
6
Some apps, of all types and not only those related to health, have automated background sharing
of generated data (e.g., daily number of steps, as in the current example) as a default setting. Yes,
there is often a way of disabling automated data sharing, but many users either do not know about
this option or find it difficult to disable automated data sharing. There is a growing movement by
many national governments to respond to this unwanted data harvesting process – the right to be
left alone. A common remedy is that app developers must make default automated data sharing
prominently known at the time of app download or first use, and it must be as easy to disable
default automated data sharing as it is to enable this process. There are a few national governments
that have applied large fines against companies associated with the digital advertisement industry
that, unknowing to the user, obtain app-generated data without express permission, but the univer-
sal application of these protective practices is still uncertain. Regardless of the appropriate use of
large-scale background data harvesting, it is only possible due to the emergence of data science
applications.
8 1 Emergence of Data Science as a Critical Discipline in Biostatistics
Data science bridges the output of statistical analyses and other forms of analyt-
ics to gain insight and in turn support future problem solving. Again, consider the
expression Past behavior is the best predictor of future behavior. Data science uses
extant data, often exceptionally large amounts of complex data in various formats
and from various sources, to make discoveries that help justify decisions about
future actions.
Biostatistics is one of the many support mechanisms (some may say the most impor-
tant, but of course, that statement is subject to opinion and discussion) for the
broader discipline of data science in the biological sciences. There are conceivably
as many definitions of biostatistics as there are those who would attempt to make a
definition of this discipline. For the purposes of this text, biostatistics is defined as
10 1 Emergence of Data Science as a Critical Discipline in Biostatistics
a process (e.g., a distinct set of activities) where measures of many different types
are gained from, about, or in association with biological organisms, and these mea-
sures are then subjected to critical examination using various analyses, analyses that
usually involve some type of mathematical focus. Enabling actions associated with
biostatistics include, in part:
• Broad processes where biologically oriented problems are considered, through
direct inquiry, collaboration with colleagues, literature review, etc.
• Actionable methods are considered, developed, refined, and later implemented to
obtain reliable and valid data related to the identified problem(s).
• Specific actions are used to put the obtained data into usable formats.
• Software and computing machinery are used against the data to perform appro-
priate analyses, such as descriptive statistics (e.g., calculation of mean, standard
deviation, median, mode, range, and frequency distribution), inferential statistics
(e.g., Chi-square, Student’s t-test for either matched pairs or independent sam-
ples, analysis of variance), and measures of association (correlation and regres-
sion, etc.).
• Software and computing machinery are used against the data to generate appro-
priate graphics, to visualize outcomes that may not be readily evident when
viewing numeric analyses only.
• Consideration of outcomes is used to provide some degree of analysis, interpre-
tation, and insight into outcomes, both those outcomes that are obtained by
numeric analyses and those outcomes that are visualized using graphical presen-
tation (e.g., figures and maps).
• Using the five-chapter model (e.g., introduction, literature review, methods,
results, and interpretation and conclusions) and its many derivations, collective
efforts are shared with others (e.g., internal distribution of a summary memo,
symposia poster session, conference presentation, journal article, and book pub-
lication) as part of a communication and quality assurance process that invites
discussion and feedback for the purpose of continuous inquiry and improvement.
Although biostatistics is perhaps the most common term used today to describe
these constructs, an earlier term for what is now considered biostatistics was use of
the term biometry, which may show in older literature. Investigations into life
expectancy, going back to the 1800s, were first associated with the term biometry.
Those with a special interest in biostatistics and how it emerged as an area of scien-
tific inquiry should look into the life stories of Florence Nightingale, John Snow,
William Farr, Francis Galton, Karl Pearson, Thomas Junius Calloway, Ronald
Aylmer Fisher, Claude Shannon, and others – individuals who either developed
mathematical processes for problem identification, data collection, and statistically
oriented analysis of biological phenomena or developed processes and models for
the visual presentation of outcomes and potential impact(s) to fellow scientists, leg-
islators and government officials, the press, and eventually the public.
Although by no means an exhaustive list, biostatistics and increasingly the link-
age between biostatistics and data science includes inquiries into the following
disciplines:
Academic Growth of Data Science Programs of Study in the Biological Sciences, Based… 11
Agriculture
Animals (including veterinary science), plants, soils, storage and distribution, etc.
Biology
Epidemiology
Health Science
The United States Department of Education (DOE) provides many resources that
offer a sense of the growing level of interest in data science and specifically those
data science programs of study that are dependent on the use of biostatistics.7 The
two primary DOE resources are the Classification of Instructional Programs (CIP)
and the Integrated Postsecondary Education Data System (IPEDS):
• The first implementation of Classification of Instructional Programs (CIP, https://
nces.ed.gov/ipeds/cipcode/) coding started in 1980, with updates in 1985, 1990,
2000, 2010, and 2020. CIP codes are based on a two-digit, four-digit, and six-
digit coding system of increasing granularity and were developed to communi-
cate the nature of programs of study in postsecondary education, from a broad
7
The United States Department of Labor (Labor), which is detailed later in this lesson, is also an
excellent resource for information on career opportunities and required skills and education for
those who wish to work in data science. As an advance organizer to information presented later in
this lesson, review Occupational Employment and Wages - 15-2051 Data Scientists (https://www.
bls.gov/oes/current/oes152051.htm) and notice the salaries for data scientists (national and
selected regions), job distribution throughout the nation, etc. Again, far more detail is provided
later in this lesson.
12 1 Emergence of Data Science as a Critical Discipline in Biostatistics
8
Review the detailed description for Medical Informatics (https://nces.ed.gov/ipeds/cipcode/cip-
detail.aspx?y=55&cipid=87670) and follow the progression of detail: CIP Code 51, Health
Professions and Related Clinical Sciences; CIP Code 51.27, Medical Illustration and Informatics;
and CIP Code 51.2706, Medical Informatics.
9
Although the Classification of Instructional Programs was developed specifically for educational
purposes, the use of CIP codes is found throughout the many agencies, institutions, businesses, and
other entities that have some degree of interest in throughput of students in postsecondary educa-
tion, including: Department of Commerce, Bureau of the Census; Department of Education, Office
of Career, Technical, and Adult Education; Department of Education, Office for Civil Rights;
Department of Education, Office of Federal Student Aid; Department of Education, Office of
Special Education; Department of Homeland Security; Department of Labor, Bureau of Labor
Statistics; National Academy of Sciences; National Institutes of Health; National Occupational
Information Coordinating Committee; National Science Foundation; etc. CIP codes are also cre-
atively used by state departments of education and other state agencies, national organizations and
professional groups, higher education institutions, technical institutions, and the many businesses
that provide employment services.
10
The normal lag from data submission by individual postsecondary institutions to public access of
the data at the IPEDS Peer Analysis System has been noticeably extended due to the prior and
continuing challenges brought about by COVID-19 and the still frequent practice of work from
home (WFH) by many postsecondary education information workers and government counter-
parts, including external contracted workers. As an example, deadlines for IPEDS data submission
Academic Growth of Data Science Programs of Study in the Biological Sciences, Based… 13
topics as diverse as tuition and other student charges, fall term enrollment, institu-
tional control (e.g., public, private not for profit, private for profit), completions,
financial aid, faculty and staff headcounts, finance (e.g., revenue and expenses),
library resources, etc. – topics that those in data science and other programs of study
should consider IPEDS data are highly transparent, and the data are available to the
public worldwide and require no notification of access, no permissions for access,
and no key or other user-identified authentication for access.
IPEDS data are not yet available through use of a function-type (e.g., client)
R-based Application Programming Interface (API). Instead, data are gained by
interaction with a Graphical User Interface (GUI) menu-type selection process.11
When all selections are completed it is possible to download an IPEDS dataset that
consists of data for hundreds of variables against thousands of postsecondary insti-
tutions. An individual dataset gained from the IPEDS Peer Analysis System is
always initially downloaded as a comma separated values (.csv) file, with the data
in a rectangular row by column format. With a .csv file format as the default, it is
then an easy task to put IPEDS data into other file formats, such as .xlsx file format,
if desired.12
Using available six-digit Classification of Instructional Programs (CIP) informa-
tion about programs of study (e.g., academic programs, academic majors, curricula,
and disciplines) and CIP-related data gained from the IPEDS Peer Analysis System,
the following programs of study were selected for emphasis in this lesson as a sam-
ple, with each identified as being among those that require some degree of knowl-
edge about biostatistics in relation to the overarching use of tools associated with
data science. These programs of study represent many possible levels of instruction,
ranging from short-course certificate programs to multi-year terminal degree gradu-
ate and professional programs, but all have some degree of focus on the efficient
management and use of data (e.g., data science) and the general discipline of
biostatistics:13
were extended, and it is expected that these extensions will ultimately impact the timeliness of data
availability. These issues related to data availability are of course not restricted to the United States
Department of Education but are endemic to many data resources, like the way COVID-19 has
become increasingly endemic.
11
The IPEDS Peer Analysis System offers a Save session option for replication of consistently
structured queries, but by no means does this option begin to equal the convenience and quality
assurance of an R-based API data query process based on reproducible syntax, as will be seen in
later lessons with other data resources.
12
These datasets are in wide format, but tidyverse ecosystem tools can be used to put the data into
long format, a tidy approach to dataset structure. This practice is demonstrated multiple times in
this text, in later lessons.
13
From among many possible resources, refer to the United States Department of Homeland
Security STEM Designated Degree Program list (https://www.ice.gov/sites/default/files/docu-
ments/stem-list.pdf) for an audit of the more than 500 CIPs associated with STEM (science, tech-
nology, engineering, and mathematics) that are recognized by this federal executive cabinet-level
department. Many CIPs on this extensive list require exposure to data science and acquaintance
with biostatistics.
14 1 Emergence of Data Science as a Critical Discipline in Biostatistics
14
This listing and many others in this text that are similar should not be viewed as a word-processed
table. Instead, text of this type represents output from some type of computing activity, with some
degree of modification for presentation purposes. Accordingly, the text shows inred, following
along with the identification of input and output used throughout this text.
Academic Growth of Data Science Programs of Study in the Biological Sciences, Based… 15
15
At the time these figures were prepared, Academic Year 2018–2019 was the ending point for the
availability of CIP-specific IPEDS data on completers. Even if the data were updated by the time
this text becomes available, it is best to question if completions data for Academic Year 2019–2020
and onward for the next few years are typical and appropriate for year-by-year comparisons. From
among the many social outcomes of reactions to COVID-19, higher education enrollment patterns
were greatly stressed as communities went into lockdown, students went home, and many students
suspended their studies, moved on, and may not return to higher education. It will likely be a few
years before postsecondary education enrollment patterns, throughput, and completions return to
expected patterns.
16 1 Emergence of Data Science as a Critical Discipline in Biostatistics
Challenge: Syntax used to generate figures associated with these data is pre-
sented in Addendum 1. Use the syntax to replicate figures for all 30 CIP six-digit
academic majors. More detail on syntax, such as what is shown in Addendum 1, will
be provided in later lessons. In these early stages of acquaintance with data science
and R, focus on process and come back to the syntax later if needed.
Within the United States Department of Labor, the Employment and Training
Administration has responsibility for Occupational Information Network (O*NET),
a hierarchical coding system that describes occupations. Not surprisingly, the
Occupational Information Network process parallels the way the previously
described National Center for Education Statistics, within the United States
Department of Education, also maintains a hierarchical coding system that describes
programs of study, the Classification of Instructional Programs (CIP).16,17
It is beyond the purpose of this lesson to provide detailed instruction for use of
the Occupational Information Network (O*NET), but for now consider how this
process deconstructs jobs into occupational characteristics and worker requirements
that address three key components for each job:
• The term knowledge represents the cognitive capabilities needed to succeed at an
expected level of performance.
• The term skills is related to the psychomotor behaviors needed to succeed at an
expected level of performance.
• The term abilities refers to the affective dispositions needed to succeed at an
expected level of performance.
16
Similar hierarchical coding systems are used throughout the many departments, bureaus, agen-
cies, offices, etc. associated with the United States federal government. The data maintained by
these entities should always be considered for use, either as a primary source or as a proxy that at
least provides guidance, if possible. Data scientists quickly learn about the use of data resources
that are convenient and freely available if the available data meet needs.
17
External resources related to biostatistics that allow access to data that are highly vetted, reliable,
and valid are identified in a later lesson.
18
Observe how Occupational Information Network (O*NET) codes use a different numbering
sequence than the Classification of Instructional Programs (CIP) codes. Even so, there is structure
(e.g., hierarchy) to O*NET codes, and with experience it is possible to learn more about the
requirements for each specific job, regardless of how the job is coded by a local employer. Take the
Jobs and Job Requirements for a Data Scientist 17
OCC_CODE OCC_TITLE
=============================================================
15-2041 Statisticians
17-2031 Bioengineers and Biomedical Engineers
19-1011 Animal Scientists
19-1012 Food Scientists and Technologists
19-1013 Soil and Plant Scientists
19-1020 Biological Scientists
19-1021 Biochemists and Biophysicists
19-1022 Microbiologists
19-1023 Zoologists and Wildlife Biologists
19-1029 Biological Scientists, All Other
19-1032 Foresters
19-1040 Medical Scientists
19-1041 Epidemiologists
19-4010 Agricultural and Food Science Technicians
19-4021 Biological Technicians
19-4040 Environmental Science and Geoscience Technicians
19-4092 Forensic Science Technicians
25-1040 Life Sciences Teachers, Postsecondary
25-1041 Agricultural Sciences Teachers, Postsecondary
25-1042 Biological Science Teachers, Postsecondary
25-1070 Health Teachers, Postsecondary
25-1072 Nursing Instructors and Teachers, Postsecondary
29-1021 Dentists, General
29-1041 Optometrists
29-1051 Pharmacists
29-1131 Veterinarians
29-1141 Registered Nurses
29-1151 Nurse Anesthetists
29-1211 Anesthesiologists
29-1215 Family Medicine Physicians
29-1216 General Internal Medicine Physicians
29-1218 Obstetricians and Gynecologists
29-1221 Pediatricians, General
29-1228 Physicians, All Other; and Ophthalmologists,
Except Pediatric
29-1248 Surgeons, Except Ophthalmologists
-------------------------------------------------------------
Use the online resources associated with the Occupational Information Network
(https://www.onetonline.org/) to conduct job-by-job research. Regarding jobs in
data science and biostatistics, the following behaviors and dispositions seem to
time to review, for at least a few selected jobs, highly detailed information on: (1) employment
estimate and mean wage, (2) percentile wage estimates, (3) industries with the highest concentra-
tion of employment, (4) top paying industries, (5) states with the highest employment level, (6)
states with the highest concentration of jobs and location quotients, (7) top paying states, (8) met-
ropolitan areas with the highest employment level, (9) metropolitan areas with the highest concen-
tration of jobs and location quotients, (10) top paying metropolitan areas, (11) nonmetropolitan
areas with the highest concentration of jobs and location quotients, and (12) top paying nonmetro-
politan areas.
18 1 Emergence of Data Science as a Critical Discipline in Biostatistics
19
These behaviors and dispositions are organized in alphabetical order, to avoid any attempt to
suggest that there is a hierarchy of importance or sequence of these behaviors and dispositions.
Again, these behaviors and dispositions go across the many jobs associated with data science and
biostatistics. Use the Occupational Information Network for specific job-by-job details, details and
work activities.
Jobs and Job Requirements for a Data Scientist 19
20
Give special attention to the following six-digit CIP codes and how these programs of study
crosswalk to O*NET-identified jobs: CIP 30.7001 Data Science, General; CIP 51.1201 Medicine;
and CIP 51.2706 Medical Informatics.
20 1 Emergence of Data Science as a Critical Discipline in Biostatistics
OCC_CODE OCC_TITLE
=============================================================
11-9121.00 Natural Sciences Managers
11-9121.01 Clinical Research Coordinators
11-9121.02 Water Resource Specialists
15-2041.00 Statisticians
15-2041.01 Biostatisticians
19-1029.00 Biological Scientists, All Other
19-1029.01 Bioinformatics Scientists
19-1029.02 Molecular and Cellular Biologists
19-1029.03 Geneticists
19-1029.04 Biologists
19-1042.00 Medical Scientists, Except Epidemiologists
19-4021.00 Biological Technicians
25-1022.00 Mathematical Science Teachers, Postsecondary
25-1071.00 Health Specialties Teachers, Postsecondary
-------------------------------------------------------------
The depth of job-related statistics associated with data science and biostatistics
that are available from the Bureau of Labor Statistics should be explored in detail
when considering possible career path decisions. Again, using May 2020 statistics,
keeping in mind the national and worldwide impact of COVID-19 on economic
outcomes, including employment numbers and salaries, look at the following table
of total employment (e.g., TOT_EMP) and annual median salary (e.g., A_MEDIAN)
for selected occupations (e.g., OCC_CODE and OCC_TITLE) related to data sci-
ence and biostatistics, using BLS Occupation Codes (https://www.bls.gov/oes/cur-
rent/oes_stru.htm):21
Using federal resources, look at the following output that highlights total employ-
ment and median salary for jobs in data science. Again, consider the prior comments
on displacement in the workplace due to the COVID-19 pandemic and why the data
were published in May 2020, since it is assumed that this point in time provides a
stable summary of employment.
21
See the prior comment about the use of data from either before or at the earliest stages of the
COVID-19 pandemic and the impact of mitigation practices such as lockdowns on data representa-
tion, thus the use of data from 2019 and early 2020, but not beyond.
Job Opportunities and Salaries in Data Science 21
00
,0
50
$1
00
,0
00
$1
0
00
0,
$5
$0
15-2041
17-2031
19-1011
19-1012
19-1013
19-1020
19-1021
19-1022
19-1023
19-1029
19-1032
19-1040
19-1041
19-4010
19-4021
19-4040
19-4092
25-1040
25-1041
25-1042
25-1070
25-1072
29-1021
29-1041
29-1051
29-1131
29-1141
29-1151
29-1211
29-1215
29-1216
29-1218
29-1221
29-1228
29-1248
Job Code, https://www.bls.gov/oes/current/oes_stru.htm
Fig. 1.1
OCCJobsNationalSalary.fig
# The syntax for this figure is found in Addendum 2.
# C01Fig01OCCJobsNationalSalary.png
Note: By looking at the files used to generate the data in this part of the les-
son (detailed explicitly in Addendum 2), both offline after downloading the file
and then again after importing the data into R, the characters * and # show in
many columns, columns that should be numeric but are not due to the presence
of these special characters. Using the notes found in the data dictionary (e.g.,
Sheet 2 of the original downloaded file national_M2020_dl.xlsx), it is identified
how the * character is used to show that a wage estimate is not available. It is
also stated that the # character is used to indicate that the wage is equal to or
greater than $100.00 per hour or $208,000 per year. The masking of the exact
wage is a purposeful decision by the Bureau of Labor Statistics and the data at
this upper limit are not readily available to the public at this resource. Although
data imputation is supported in R, it was judged that it would be inappropriate
Job Opportunities and Salaries in Data Science 23
to input $100.00 per hour or $208,000 per year since there is no way of knowing
the exact hourly or yearly wage statistics. Accordingly, these data exist, but are
unavailable.22,23
National total employment numbers and median salaries for specific jobs pro-
vide a good start on using Bureau of Labor Statistics for guidance on career choices
and the many decisions that should be considered when developing a personalized
educational program of study, in preparation for an eventual individualized career
path. Yet, recall that employment is localized and statistics that apply at the national
level may not be representative of state-wide and even more localized levels of
comparison.
The national statistics on employment numbers and salaries provide some degree of
guidance for each selected occupation, but do not forget that salaries are very regionally
specific. As an example, compare the May 2000 mean salary for the job 19-1022
(Microbiologists) for all states. When viewing these statistics, be sure to consider salaries
and localized cost of living, with impact by local rents, local taxes, local transportation
costs, etc. Additionally, compare more regional salaries using All Data (XLS). Details on
how these comparisons are made, using R, are provided in Addendum 2, with the data
gained from https://www.bls.gov/oes/current/oes_stru.htm, the BLS Occupation Codes
resource.
22
R supports processes to work with data imputation through the use of functions from a few differ-
ent packages, such as: Amelia, brms, mi, mice, VIM, mitml, etc.
23
When using R, these special characters should be removed and replaced by NA, the symbolic
representation of missing data. A simple and effective transformation process is demonstrated in
Addendum 2.
24 1 Emergence of Data Science as a Critical Discipline in Biostatistics
Metropolitan Area
=================
19-2041 Miami-Fort Lauderdale-West Palm Beach, FL 61020
19-2041 Orlando-Kissimmee-Sanford, FL 42160
19-2041 Tallahassee, FL 44910
Nonmetropolitan Area
====================
19-2041 South Florida nonmetropolitan area 47580
19-2041 North Florida nonmetropolitan area 43910
-------------------------------------------------------------
• From among the many possibilities, look at the way geographic and sector break-
outs are available and compare STEM (science, technology, engineering, and
mathematics) and non-STEM salaries.
• Then, review the salary-type datasets associated with entry-level educational
requirements (e.g., (1) no formal educational credential, (2) high school diploma
or equivalent, (3) some college, no degree, (4) postsecondary nondegree award,
(5) associate’s degree, (6) bachelor’s degree, (7) master’s degree, and (8) doc-
toral or professional degree), showing that additional education nearly always
results in increased salary.
Pre-ENIAC (1946)
Challenge: Search on the names Jean Jennings Bartik, Frances Bilas, Ruth Licherman, Kathleen
24
McNulty, Betty Snyder, and Marlyn Wescoff. What was their role regarding ENIAC?
Computing and Data Science 27
• Early-form punch cards were developed to tabulate 1880 United States Census
Bureau data, and results were so satisfactory (e.g., accuracy and speed) that the
1890 Census was tabulated using this process, continuing until the 1950s, when
more modern computers took over this massive task.
• The Atanasoff Berry Computer (ABC) was developed at Iowa State College
(Ames), later Iowa State University, in the late 1930s and very early 1940s.
• Colossus was a set of computing devices developed during World War II and was
used by British codebreakers in support of the war effort. It is considered, by
some, as an electronic digital computer that could be programmed.
As important as ENIAC may have been, it had practical limitations that challenged
uses beyond its original purpose. However, within a few years after the development
of ENIAC, by the early to mid-1950s, what we now call mainframe computers were
being used by businesses, not only for mathematical calculations but for back-office
activities as mundane, but important, as billing, fiscal accounting, payroll, etc. Look
into development of the following for a more complete picture of these early days
in computing:
• UNIVC (Universal Automatic Computer) is considered by many as the first com-
mercial computer, with sales to large companies beginning in 1951.
• Going forward throughout the 1960s and 1970s, the commercialization of main-
frame computers continued. The early mainframe machines became smaller,
faster, easier to maintain, easier to program, and had greater and greater memory
and data capacity.25 They were still large and complex by today’s standards, but
with each iteration improvements made them available to mid-size and increas-
ingly smaller companies.
If there is disagreement on just what was the first large-scale computer, there are
equally different views on the development of small, personal computers:
• By the late 1960s and going on to the 1970s, the first scientific pocket calculators
became available to the public. Dependence on slide rules declined as hand-held
computing devices became part of the tool kit of many engaged in mathematics
and the sciences.
25
Search on the term Moore’s law, and see if it applies, then and now.
28 1 Emergence of Data Science as a Critical Discipline in Biostatistics
• Evolving away from calculators, by the 1970s, there was intense competition
between different companies in the development of what would be considered
early prototype personal computers, like competition in the automobile industry
during the early 1900s. A few companies prospered while other companies failed.
• Eventually, many manufacturers of personal computers started to stabilize, and
by the mid to late 1970s and into the early 1980s, small (e.g., desktop size) per-
sonal computers became increasingly available to the public. Software was also
developed such that by the mid-1980s, these personal computers became afford-
able and supported complex mathematical and statistical calculations, word pro-
cessing, database management, graphics, gaming, and access to distributed
computing systems by use of modems, etc. – all for use by an individual. January
22, 1984, is considered a bellwether date for personal computing when the first
national television advertisement for a personal computer was aired during Super
Bowl XVIII, with an estimated audience of nearly 80 million viewers.
Increasingly, computers had arrived, and people often wondered how they every
lived without them, like years earlier when disruptive technology such as univer-
sal postal service, railroad, telegraph, telephone, automobile, airplane, radio, and
television first gained public acceptance.
public. By using the World Wide Web and a GUI Web browser, actions could now
be implemented by using a mouse and a simple click against a visual hyperlink.
Graphics were soon deployed in full color, video (at first limited to a few minutes
but soon in full length) could be played, and real-time multi-user communication
with others was now possible. In turn, commercial services were developed that
enhanced the user experience and were gladly adopted by the public.
In the early 1990s and onward, because of the Internet and the World Wide Web,
it was now possible for those with a user account, a personal computer (later, a
smart phone), a modem (soon to be replaced by faster capabilities), and network
access to communicate with others worldwide. At the earliest beginning of the
Internet, most communication was serious and focused on research and academic
activities. It did not take long, however, for topical bulletin boards, listservs, and
other utilities to gain acceptance by members of the public, for personal use and
enjoyment.26 Soon after, intense commercialization was deployed, where user expe-
riences were tracked, data related to these experiences were harvested, algorithms
were applied against the harvested data, and commercialization became the norm,
initially with little to no oversight by any authoritative agency. Even with what
many view as obtrusive commercialization, use of the Internet and the World Wide
Web is now embedded into daily actions (e.g., banking, education, health mainte-
nance, real-time and delayed time communication, shopping, and transportation and
logistics), and there are many who would now find it difficult to successfully use
something as simple as a tri-fold paper map instead of using one of the many map-
type apps for personal transportation on unknown streets and highways.
As mainframe computers gained popularity in the late 1950s and well into the
1970s, it was common to have all mainframe computing facilities on campus at one
central building and users in other buildings had access to the mainframe from
campus-based distributed, but restricted, terminals – hardware that was little more
than a black and white television-like monitor and keyboard. All data and software
were at the centrally located mainframe, and connection was restricted to the use of
physical cables that were often limited to only a minimal distance. As the Internet
and World Wide Web grew in use, there were those who remembered this paradigm,
but now saw how the Internet and the World Wide Web could be used to connect to
vastly superior computing facilities hundreds, if not thousands, of miles
26
For unexplained reasons, online videos of dancing cats were among the first entertainment expe-
riences for many early Internet users and these funny videos helped entice the public to try out both
entertainment and serious video resources that were now available to the public.
30 1 Emergence of Data Science as a Critical Discipline in Biostatistics
away – evolving into what is now called cloud computing, with possibly thousands
of devices placed on racks at what are referred to as server farms.27
The term cloud computing was possibly used for the first time in the late 1990s,
but it was not until a few years later that the term cloud computing was used by large
companies that offered this type of distributed service. Typical to any type of jargon,
there were so many views of cloud computing, the cloud, etc. such that by 2011, the
federal government’s National Institute of Standards and Technology, a unit within
the United States Department of Commerce, provided clarity by offering a defini-
tive definition of the term, available at https://csrc.nist.gov/publications/detail/
sp/800-145/final.
A key concept associated with cloud computing is the term service and the will-
ingness of users to often use services on a pay as you go basis. Cloud computing
service is often viewed as the following:
• Software as a Service (SaaS), where an individual can use cloud-provided
software
• Platform as a Service (PaaS), where an individual can use cloud-provided utili-
ties and tools
• Infrastructure as a Service (IaaS), where an individual can use cloud-provided
resources for storage and processing
It is not suggested that cloud computing is merely a modern update to the use of
restricted terminals and their connection to a campus-based mainframe computer.
Even so, there are advantages to distributed computing, whether in the 1960s or the
2020s. Of course, when there are disruptions, which occasionally happen when con-
struction equipment cuts dedicated communication lines, such as fiber optic cable,
the disadvantages of distributed computing also become evident.
Data science is focused, not surprisingly, on data – data in its many forms. It is com-
mon to think of data as being either character (e.g., A, cat, Dog) or numeric (e.g., 1,
5, 1.5). R (as well as many other statistically oriented languages) can certainly
accommodate both character data and numeric data, but R can also accommodate
many other types of data. The following presentation is a mere introduction to the
many different data types supported by R, with more detail provided throughout the
addenda for this lesson as well as other lessons in this text.
27
Server farms, with thousands of computing devices in operation, consume vast amounts of elec-
tricity and generate extreme amounts of heat, requiring additional electricity for cooling. There is
a growing movement in cloud computing to locate server farms at locations that are naturally cool
throughout the year, if not cold, with electricity often generated by either geothermal or hydro
technology. The excess heat generated by the many servers is often captured and ported to other
buildings for heating purposes, adding to efficiencies of use.
Data Types Supported by R 31
R has been developed so that actions are often performed against object vari-
ables. Saying that, it is immediately necessary to demonstrate a few different ways
object variables are assigned values in R. Following this brief demonstration of
assignment there will be a discussion on why the <- assignment operator is used
throughout this text.
Variable123
[1] 123
# Assignment using =
Variable456 = 456
# In this example, create a numeric object variable called
# Variable456 using the = assignment operator. Be sure to
# notice where there is one and only one = character.
Variable456
[1] 456
base::assign("Variable789", 789)
# In this example, create a numeric object variable called
# Variable789 using the base::assign() function. Be sure
# to notice that there are double quote marks around the
# named object variable.
Variable789
[1] 789
there are those who might go as far as to say is temporarily assigned the value
instead of is assigned the value. The point here is to consider the term variable.
Variables, as the term suggests, may vary in value. Accordingly, the term assign-
ment is the better term to use when speaking of variable values.
It also needs to be mentioned that throughout this text, of the three assignment
operators previously demonstrated:
• The <- assignment operator is preferred, and other methods will be rarely, if ever,
used, at least in this text.
• The = character, as used for assignment, is too easily confused with the use of ==
(e.g., two equal signs with no space between each), which is used to determine
equation.
• The base::assign() function is too cumbersome.
Look into the life of George Boole (1815–1864) and his 1854 text An Investigation
of the Laws of Thought On Which are Founded the Mathematical Theories of Logic
and Probabilities, and think about all of the many times some type of decision was
made that reduced to a simple FALSE or TRUE outcome. This outcome, FALSE or
TRUE, is the basis for Boolean logic and comparisons.
The use of Boolean logic is then enhanced by a formal order of operations, where
there is structure to the evaluation of comparisons. A few simple examples will
serve as an introduction to this critically important aspect of data science.28
28
Challenge: To save space, the outcomes of the R-based syntax presented in this section are not
always presented. However, either duplicate or copy and paste the syntax into an active R session
and replicate findings. This self-directed activity is among the many paths used in this text for
learning R and later the tidyverse ecosystem.
Data Types Supported by R 33
[1] 12
[1] 15
[1] 18
X > Y
# The value of X is greater than the value of Y:
# FALSE or TRUE.
[1] FALSE
Y < Z
# The value of Y is less than the value of Z:
# FALSE or TRUE.
[1] TRUE
(X + Y) > Z
# The value of the summation of X plus Y is greater
# than the value of Z: FALSE or TRUE. Give special
# attention to the use of parentheses and how the use
# of encapsulating parentheses remove any possibility
# of incorrect grouping, or the summation of X plus Y
# in this simple example.
[1] TRUE
X != Y
# R supports many comparisons, including not equals
# (e.g., !) in this example.
[1] TRUE
34 1 Emergence of Data Science as a Critical Discipline in Biostatistics
DogID <- c(
"D01", "D02", "D03", "D04", "D05", "D06",
"D07", "D08", "D09", "D10", "D11", "D12",
"D13", "D14", "D15", "D16", "D17", "D18")
# Create an object variable consisting of IDs.
DogBreed <- c(
"Beagle", "Beagle", "Beagle", "Beagle", "Beagle", "Beagle",
"Lab", "Lab", "Lab", "Lab", "Lab", "Lab",
"Terrier", "Terrier", "Terrier", "Terrier", "Terrier",
"Terrier")
# Create an object variable consisting of breeds.
DogGender <- c(
"Female", "Female", "Female", "Male", "Male", "Male",
"Female", "Female", "Female", "Male", "Male", "Male",
"Female", "Female", "Female", "Male", "Male", "Male")
# Create an object variable consisting of genders.
DogWeightLb <- c(
18, 20, 21, 22, 26, 23,
44, 42, 46, 44, 50, 58,
11, 13, 12, 14, 15, 13)
# Create an object variable consisting of weights (Lbs.).
base::attach(DogIDBreedGenderWeightLb.df)
utils::str(DogIDBreedGenderWeightLb.df)
DogIDBreedGenderWeightLb.df$DogWeightLb <-
as.numeric(DogIDBreedGenderWeightLb.df$DogWeightLb)
# DogWeightLb needs to be put into numeric format.
base::attach(DogIDBreedGenderWeightLb.df)
utils::str(DogIDBreedGenderWeightLb.df)
36 1 Emergence of Data Science as a Critical Discipline in Biostatistics
DogIDBreedGenderWeightLb.df
base::attach(DogIDAllBreedsFemaleWeightLb.df)
DogIDAllBreedsFemaleWeightLb.df
base::attach(DogIDBeagleMaleWeightLb.df)
DogIDBeagleMaleWeightLb.df
38 1 Emergence of Data Science as a Critical Discipline in Biostatistics
base::attach(DogIDLabORTerrierFemaleGTE13Lb.df)
DogIDLabORTerrierFemaleGTE13Lb.df
This brief discussion on the use of Boolean logic and associated tools that sup-
port selection should be enhanced by further study. It is always desirable to check
outcomes against a known quantity, just to be sure that the initial logic and all asso-
ciated syntax are correct. Although the term Pencil Tracing is now rarely used, it is
still a desired practice. Look at the syntax for selection and then look at the dataset.
For small sections of syntax, use a pencil and trace each action against the dataset
and see if the syntax and data would yield expected outcomes. Quality assurance
can take many forms and this degree of review is well worth the effort.
Numeric Data
R supports many different data types, including different numeric data types. The
two most frequently encountered numeric data types are: (1) decimal numbers or
real numbers, and (2) integer numbers (e.g., whole numbers). It should be men-
tioned that integers are often used as codes for otherwise factor-type data (e.g.,
Gender: use 1 as code for Female and use 2 as code for Male). A brief explanation
of the two numeric data types follows.
Data Types Supported by R 39
Decimal or real numeric data are often expressed in some type of decimal format
(e.g., 1.23, 45.67, and 890.12). Consider the weight of lab rats (Rattus norvegicus
domestica) as a typical example of how decimal notation is often expressed in R:
RatWeightKg
utils::str(RatWeightKg)
# Show the internal structure of the named object.
base::typeof(RatWeightKg)
# Determine the internal storage mode of the named object.
[1] "double"
base::mode(RatWeightKg)
# Determine and/or set the internal storage mode of the
# named object.
[1] "numeric"
The output gained from all three functions is helpful to know the true nature of
the object variable that is in decimal notation.30
29
Preemptively, it must be mentioned that an oddity of R is that the base::mode() function is not
used as a measure a central tendency, such as the base::mean() function or the stats::median() func-
tion. As a measure of central tendency, the mode (e.g., the most frequently occurring value in a set
of values) is accommodated by using functions from external R packages, such as use of the
DescTools::Mode() function among many other complementary functions that serve the same
purpose.
30
Any further discussion would go beyond the purpose of this lesson, but recall that decimal nota-
tion, as used in this example, is not universal and an experienced data scientist may encounter the
40 1 Emergence of Data Science as a Critical Discipline in Biostatistics
There are many times when numbers do not have a decimal format and cannot have
a decimal format – whole numbers, otherwise known as integers. Perhaps the most
common way to express this notion is to say that there are five lab rats represented
in the object variable RatWeightKg. In this context, consider the integer data as
whole numbers – numbers that cannot have fractional representation, such as five
lab rats in this example and not 5.0 lab rats. Along with their use as IDs or for head-
counts (e.g., five lab rats), it is also quite common to see integers used as a
numeric code.
Consider a situation where the numbers 1 and 2 are used to designate gender:
• 1 ..... Female
• 2 ..... Male
The two numbers, 1 and 2, by no means represent anything other than a useful
code and are often used to reduce typing for when data are hand-entered into a
spreadsheet. R can easily accommodate these numbers, in the context as factor-type
integer codes.31
use of commas instead of decimal points to express what is otherwise called decimal notation. R
can accommodate this practice (e.g., the use of commas instead of decimal points), if needed.
31
As a good programming practice (gpp), for when there is no compelling reason to do otherwise,
it may be best to have codes organized so that the terms they represent show in alphabetical order,
thus 1 for Female and 2 for Male, since this naming scheme follows alphabetical ordering. Of
course, there are times when an ordinal approach may be best for the assignment of integer factor-
type codes, such as: 1 (Small), 2 (Medium), and 3 (Large), which is decidedly not an ascending
alphabetical ordering but is instead an ordering by size. These issues are best communicated in a
code book.
Data Types Supported by R 41
RatGender
[1] 1 1 1 2 2
Levels: 1 2
utils::str(RatGender)
base::length(RatGender)
# Confirm the headcount (e.g., number of)
# rats assigned a gender.
[1] 5
Follow along with this simple coding scheme, where integers are used as an iden-
tification code for each individual lab rat:
Now that the three object variables have been separately created, join them
together into one common dataframe, by wrapping the as.data.frame() function
around the cbind() function. Then, preemptively put each object variable into desired
data type:
42 1 Emergence of Data Science as a Critical Discipline in Biostatistics
Rat.df$RatGender.Recode <-
factor(Rat.df$RatGender,
labels=c("Female", "Male"))
levels(Rat.df$RatGender.Recode)
# NOTE: factor(...) and NOT as.factor (...)
# Create an enumerated object variable and put it
# into desired data type.
# Note the alphabetical ordering of the two
# factor-type breakouts: Female and then Male.
base::attach(Rat.df)
utils::str(Rat.df)
'data.frame': 5 obs. of 4 variables:
$ RatID : Factor w/ 5 levels "1","2","3","4",..: 1 2
$ RatGender : Factor w/ 2 levels "1","2": 1 1 1 2 2
$ RatWeightKg : num 0.65 0.72 0.45 1.05 0.91
$ RatGender.Recode: Factor w/ 2 levels "Female","Male": 1 1 1
Rat.df
By following along with these actions, the dataset Rat.df has been put into proper
format, where: (1) the integer values for RatID and RatGender serve as codes for
named factor-type breakouts associated with ID and Gender, (2) the numeric values
for RatWeightKg are expressed correctly, following decimal notation, and (3) the
RatGender codes were used to create an enumerated object variable (RatGender.
Recode) that designates text expressions of Female and Male, improving readability
of any future use of this dataset.
Data Types Supported by R 43
As a value-added activity for this simple dataset, use functions associated with
Base R to make a simple barchart of RatWeightKg means broken out by the two
genders. Later, the tidyverse ecosystem will be demonstrated throughout this text
for similar activities, but first, this gentle introduction to Base R is helpful as a guide
to how problem-solving is approached (Fig. 1.2).
Mean.RatGender
Female Male
[1,] 0.6066667 0.98
par(ask=TRUE)
graphics::barplot(Mean.RatGender, # Plot the enumerated object
main="Mean Weight (Kg) of Lab Rats by Gender",
col=c("red","darkblue"), # Colors for each bar
beside=TRUE, # Place bars side by side
xlab="Gender", # X axis label
ylab="Mean Weight (Kg)", # Y axis label
ylim=c(0.0, 1.1), # Y axis scale
font.axis=2, # Bold
font.lab=2, # Bold
cex.axis=1.1, # Increase font size
cex.lab=1.2) # Increase font size
text(x=1.5, y=0.675, labels="Mean Weight = 0.607 Kg", font=2)
text(x=3.5, y=1.050, labels="Mean Weight = 0.980 Kg", font=2)
# There is no easy way to know exactly where to place the
# text, x and y – experiment. Otherwise, look at where the
# text for Mean Weight has been placed for each of the two
# genders in the enumerated object Mean.RatGender and also
# observe how the text font has been put into bold format.
# Fig. 1.2
Fig. 1.2
StringExampleDoubleQuotes
utils::str(StringExampleDoubleQuotes)
# Show the internal structure of the named object.
base::typeof(StringExampleDoubleQuotes)
# Determine the internal storage mode of the named object.
base::mode(StringExampleDoubleQuotes)
# Determine and/or set the internal storage mode of the
# named object.
StringExampleSingleQuotes
StringOne
StringTwo
StringTen
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
Although Base R has many functions that are used to accommodate strings, the
tidyverse ecosystem is far better for this task. The stringr package, which is included
among the packages associated with the core tidyverse, is perhaps among the best
packages for string manipulation. Examples from the stringr package are found in
later lessons in this text.
As an ending comment on the seemingly complex nature of using strings, look at
the following examples, and from this, always remember to check everything in the
pursuit of continuous quality assurance. In this example, consider the complexity of
accommodating a company payroll information system for an employee with the
following name:
Payroll
Consider some of the common mistakes made when humans work with string
data, with this example based on names in a payroll information system.
Payroll$Name_First == "Alfred"
# Name_First is Alfred and this expression should
# return TRUE.
[1] TRUE
Payroll$Name_First == "Al"
# But, it is not unexpected that someone using the payroll
# system might enter Al instead of Alfred since the worker
# uses Al as an informal first name and most users do not
# even know his true first name, Alfred.
[1] FALSE
Then, consider the complexity of Mack as a middle name and how many users
would immediately type Mac, instead:
Payroll$Name_Middle == "Mac"
[1] FALSE
However, the last name is perhaps the string most likely to cause confusion and
errors. Look at the many FALSE statements that are generated by seemingly simple
confusion of correct spelling:
Social Security Numbers are now rarely used as a requested number for identification purposes,
32
due to public concern about this practice and justified personal identity and security concerns.
Data Types Supported by R 47
even begin to consider the issue of information security. A unique numeric code is
simply much easier to manage than a character-based set of names.
Many beginning data science students and even entry-level data scientists find time
and dates a frustrating challenge, regardless of the language, including R. The lub-
ridate package, which is associated with the tidyverse ecosystem, is often used to
work with time and dates. For now, this brief introduction on how R accommodates
time and dates will depend on tools associated with Base R, to focus on the begin-
ning heuristics of this topic once again. The use of specialized packages, especially
lubridate, will come with more experience with data science and the tidyverse
ecosystem.
When using R, it is essential to know that January 1, 1970, is the origin (e.g.,
base) date for counting days.33,34 Negative numbers express the number of days prior
to the origin (January 1, 1970), and conversely, positive numbers express the num-
ber of days after the origin (January 1, 1970):
as.numeric(as.Date('1969-12-31'))
# 1969 December 31
# Wrap the as.numeric() function around the
# as.Date() function, to determine the number
# of days from the R origin date, January 1,
# 1970.
[1] -1
as.numeric(as.Date('1970-01-01'))
# 1970 January 01
[1] 0
as.numeric(as.Date('1970-01-02'))
# 1970 January 02
[1] 1
33
The origin date for R (January 1, 1970) borrows from what is commonly referred to as the arbi-
trary UNIX epoch time and date of 00:00:00 UTC (Universal Time Coordinated) on Thursday,
January 01, 1970).
34
For calculations going back over extreme lengths of time, it may be helpful to know that the
Gregorian calendar is used for UNIX epoch time, as opposed to use of the Julian calendar.
48 1 Emergence of Data Science as a Critical Discipline in Biostatistics
With this brief introduction to the way dates in R are viewed as the number of
days from the origin, in either direction, practice with a few dates to see the cumu-
lative number of dates over time, subtracting the numeric value of a beginning date
from the numeric value of an ending date:
IndependenceDayUSA1776
[1] "1776-07-04"
as.numeric(IndependenceDayUSA1776)
# Determine the number of days prior to the
# January 1, 1970, origin date, with negative
# values showing direction -- the number of days
# prior to the origin date.
[1] -70672
IndependenceDayUSA2026
[1] "2026-07-04"
as.numeric(IndependenceDayUSA2026)
# Determine the number of days since the January
# 1, 1970, origin date, with positive values
# showing direction – the number of days since
# the origin date.
[1] 20638
As an interesting quality assurance check, subtract the two dates and see if the
output is approximately 250 years, the sestercentennial (e.g.,semiquincentennial) to
use the Latin term(s) for a 250-year anniversary date.
Data Types Supported by R 49
IndependenceDayUSA2026 - IndependenceDayUSA1776
91310/365
# Number of days from Jul-04-1776 to Jul-04-2026
# divided by 365 days per year, which is not
# quite right considering the impact of leap
# year and how leap year does not occur every
# four years.
#
# Review the rules regarding leap years to see
# why 1800 and 1900 were not leap years and how
# this issue clouds the precise use of 365 days
# per year in calculation of the number of days
# over long periods of time, such as the dates in
# this example. Even so, this calculation offers
# a good estimate of the number of years from one
# date to another.
[1] 250.1644
base::attach(DaysOfTreatment.df)
utils::str(DaysOfTreatment.df)
Data Types Supported by R 51
DaysOfTreatment.df
A graphic of mean days of treatment, from Begin to End, will help emphasize
outcomes from this simple example on the use of dates as the basis for a datum (e.g.,
Delta). The syntax previously used will help guide this graphic, an efficient (e.g.,
tidy) approach to the reuse of R syntax (Fig. 1.3).
Fig. 1.3
52 1 Emergence of Data Science as a Critical Discipline in Biostatistics
Mean.DaysOfTreatment <-
base::t(base::tapply(DaysOfTreatment.df$Delta,
base::list(DaysOfTreatment.df$Treatment), base::mean))
# Create an object named Mean.DaysOfTreatment that holds
# the mean Delta value for each treatment breakout,
# Treatment 1, Treatment 2, and Treatment 3.
Mean.DaysOfTreatment
par(ask=TRUE)
graphics::barplot(Mean.DaysOfTreatment, # Plot the enumerated
main="Mean Days of Treatment", # object
col=c("red","darkblue", "black"), # Colors for each bar
beside=TRUE, # Side by side bars
xlab="Treatment", # X axis label
ylab="Mean Days", # Y axis label
ylim=c(0.0, 50.0), # Y axis scale
font.axis=2, # Bold
font.lab=2, # Bold
cex.axis=1.1, # Increase font size
cex.lab=1.2) # Increase font size
text(x=1.5, y=12.00, labels="Mean Days = 06.33", font=2)
text(x=3.5, y=20.00, labels="Mean Days = 14.67", font=2)
text(x=5.5, y=46.00, labels="Mean Days = 40.33", font=2)
# Experiment with the best place to place text. Additionally,
# bold fonts and large print make it easier to view and
# understand outcomes presented in graphical format.
# Fig. 1.3
Up to this point, only default date formats have been presented. There are many
ways that dates can be presented in R, and there are many resources on this topic.
As a simple example, look at the way IndependenceDayUSA1776 can be presented
with slight adjustment in format:
IndependenceDayUSA1776
# Default presentation of a date
[1] "1776-07-04"
Going beyond the number of days from an origin date and how this datum is used
for different calculations, consider how R accommodates minutes and hours:35
January012023OneMinutePastMidnight <-
as.POSIXct("2023-01-01 00:01:00")
January012023OneMinutePastMidnight
December312023OneMinuteBeforeMidnight <-
as.POSIXct("2023-12-31 23:59:00")
December312023OneMinuteBeforeMidnight
December312023OneMinuteBeforeMidnight -
January012023OneMinutePastMidnight
# Subtract the dates.
Although far more could be presented about time and dates when using R, it
should be obvious that this topic is possibly one of the most challenging tasks in
data science given the complexity of how time and dates are seen throughout
the world:
• Does the notation 07/10/23 mean July 10, 2023, or does it instead mean October
07, 2023? Both interpretations could be correct, based on geographic location
and local views on how dates are expressed, and some type of context is needed
to be sure of the correct meaning.
• There are 365 days in a year, right? Perhaps not – How does this simple constant
of 365 account for leap years?
• There is a leap year every four years, right? Perhaps not – How does this simple
constant account for the century years 1700, 1800, and 1900, which when using
the leap year formula cannot be divided evenly by 400?
35
As an independent activity, at the R prompt key help(as.POSIXct) to learn more about this oth-
erwise somewhat complex calculation that with careful thought does not need to be as complex as
originally thought.
54 1 Emergence of Data Science as a Critical Discipline in Biostatistics
• There are 24 hours each day, right? Perhaps not - How does this simple constant
account for the days when twice yearly there is a change from Daylight Savings
Time to Standard Time and Standard Time to Daylight Savings Time?
• There are 60 seconds each minute, right? Perhaps not – How does this simple
constant account for the rare, but still existent, leap second when this extreme
change is occasionally made to account for the earth’s rotation?
• There are 24 time zones, right? Perhaps not – How does this simple constant
account for time zone lines that clearly do not follow exact longitudes, local
governments that do not account for changing the clock twice yearly, and local
time zones that instead follow 30 minute and 45 minute offsets?
In summary for this section, context is the key to working correctly with dates
and time, for all languages and not only R:
• Know the origin date (January 1, 1970, for R).
• Confirm that leap years are accounted for correctly, which should not be auto-
matically assumed for all languages and software applications.
• Learn the specific notation for how dates are presented, locally and for the wider
international R community.
With practice, R can provide excellent results when time and dates are used, but
careful attention and practice are especially critical for this area.
Missing Data
The reality of data science is that it is only the rare dataset that is complete, with no
missing data. Data can be missing for many reasons:
• A subject was not available for measurement at the appointed time (e.g., an unex-
pected snowstorm kept some subjects from showing up at the clinic where mea-
surements are taken).
• A subject was available for measurement at the appointed time but could not or
would not be measured (e.g., a frightened large animal with sharp hooves and
horns could not be weighed due to continuous and possibly dangerous movement).
• A subject was available for measurement at the appointed time, the measurement
was recorded, but for unknown reasons, the recorded datum did not appear in the
final dataset (e.g., an unsecured folder of time-specific paper-based field notes
that blow away across an open field, never to be retrieved, when a car door is
opened).
• A subject was available for measurement at the appointed time, the measurement
was recorded, the datum appeared in the final dataset, but the recorded datum
was either illogical or so totally out of range that it was not an outlier but was
instead viewed as either a misreading or a data entry error (e.g., systolic blood
pressure measurements were inadvertently recorded as weight in pounds for
some subjects, but not all, by a poorly supervised trainee).
Data Types Supported by R 55
There is seemingly no end to the reason why data can be missing. R has struc-
tures that can accommodate missing data and a key part of data science is working
within the unfortunate, but not at all unexpected, event of missing data.
Consider a simple example of how missing data are treated when using R, know-
ing that the nomenclature is clearly different than what is used in other languages:
XNoMissingData
[1] 10 20 30 40 50
XMissingData
[1] 10 NA 30 40 50
36
Using R, the symbol NA equates to the term Not Available. The symbol NaN equates to the term
Not a Number.
37
For those who wish to explore this topic in more detail, use available resources such as keying
help(NameOfFunction) at the R prompt to learn more about the following terms, as used in R:
NULL, NA, NaN, and finite (Inf and -Inf, both positive and negative infinity).
56 1 Emergence of Data Science as a Critical Discipline in Biostatistics
base::mean(XNoMissingData)
# Calculate the mean of a dataset where
# there are no missing data.
[1] 30
base::mean(XMissingData)
# Calculate the mean of a dataset where
# there is one missing datum, but where
# there is no accommodation for missing
# data.
[1] NA
base::mean(XMissingData, na.rm=TRUE)
# Calculate the mean of a dataset where
# there is at least one missing datum,
# using the na.rm=TRUE argument.
[1] 32.5
base::mean(XNoMissingData)
# Calculate the mean of a dataset where
# there are no missing data, without
# use of the na.rm=TRUE argument.
[1] 30
base::mean(XNoMissingData, na.rm=TRUE)
# Calculate the mean of a dataset where
# there are no missing data, but with
# use of the na.rm=TRUE argument.
[1] 30
Data Types Supported by R 57
base::summary(XMissingData)
The base::is.na() function is also used, at least for smaller datasets, to gain a
sense of exactly which datapoints have a missing value:
base::is.na(XMissingData)
Create a more embellished dataset with missing values to see how missing data
are accommodated in multiple columns of a dataset:
58 1 Emergence of Data Science as a Critical Discipline in Biostatistics
XDatasetWithMissingValues.df$Subject <-
as.factor(XDatasetWithMissingValues.df$Subject)
XDatasetWithMissingValues.df$WeightLb <-
as.numeric(XDatasetWithMissingValues.df$WeightLb)
XDatasetWithMissingValues.df$Gender <-
as.factor(XDatasetWithMissingValues.df$Gender)
base::attach(XDatasetWithMissingValues.df)
utils::str(XDatasetWithMissingValues.df)
XDatasetWithMissingValues.df
# As an interesting observation, notice how NA shows
# for missing numeric values whereas <NA> shows for
# missing factor values.
Data Types Supported by R 59
base::mean(XDatasetWithMissingValues.df$WeightLb, na.rm=TRUE)
# Use the na.rm=TRUE argument to accommodate missing data.
[1] 163
base::table(XDatasetWithMissingValues.df$Gender)
# Look how NA does not show at all when using
# the base::table() function against a factor-
# type object variable.
Female Male
4 4
base::summary(XDatasetWithMissingValues.df)
# Look how summary provides a fairly complete
# view of the data, numeric and factor,
# including identification of missing values.
If there were a desire to remove all rows that have an NA value, consider use of
the stats::na.omit() function. In advance and only with caution, be sure to determine
that this is the desired action.38
38
Think of the expression measure twice and cut once before deploying any action that eliminates
data from a dataset.
60 1 Emergence of Data Science as a Critical Discipline in Biostatistics
XDatasetWithMissingValuesUsingna.omit <-
stats::na.omit(XDatasetWithMissingValues.df)
# Omit all rows that have a NA value.
XDatasetWithMissingValuesUsingna.omit
There are many functions in the tidyverse ecosystem that go far beyond these
Base R functions when working with (and around) missing data. These simple
examples provide a good start on recognizing that data will be missing and there
must be ways to accommodate this real-world issue.
R, Base R and the tidyverse ecosystem, can accommodate many different types of
data structures. A few are discussed, but each discussion could be greatly expanded,
but that might not be appropriate for an introductory text.
Although dataframes have been shown previously in this lesson, it needs repeating
that a dataframe is a rectangular collection of object variables, where each object
variable (e.g., column) consists of the same number of subjects (e.g., rows). Consider
a simple four by three dataframe (e.g., row by column - or 4 rows by 3 columns)
consisting of data on blood pressure from four subjects gained from three object
variables, each of a different data type:
Data Types Supported by R 61
SBPAlert01.df$SubjectID <-
as.factor(SBPAlert01.df$SubjectID)
SBPAlert01.df$SystolicBP <-
as.numeric(SBPAlert01.df$SystolicBP)
SBPAlert01.df$BPRisk <-
as.logical(SBPAlert01.df$BPRisk)
base::attach(SBPAlert01.df)
utils::str(SBPAlert01.df)
The dataframe SBPAlert01.df has been put into desired format with the numbers
associated with SubjectID viewed as factors, the numbers associated with SystolicBP
viewed as real numbers, and the text associated with BPRisk viewed as logical
FALSE or TRUE values.
The tidyverse ecosystem has not yet been brought into this R session and as such
the tibble::as_tibble() function and the tibble::tibble() function cannot yet be dem-
onstrated at this point. However, tibbles are used in the addenda, in the back part of
this lesson and throughout this text. A tibble is a special type of dataframe that is
used extensively with the tidyverse ecosystem. It has unique features, and although
the tidyverse ecosystem can often be used with dataframes, it is common to use
tibbles when using functions associated with the tidyverse.
Factors
Factors need special mention in that they are a specialized data type, seen previ-
ously in this lesson (and will be seen throughout future lessons), but a more formal
introduction is provided here. Imagine three adult males from the general popula-
tion with the following heights:
Robert ... 164 centimeters
Juan .... 178 centimeters
Declan .. 192 centimeters
Assume that the measuring tape was used correctly and that these measures are
reliable and valid. However, it is not always easy to obtain precise measures and
sometimes broad measures are the best that can be obtained. Or it needs to be
62 1 Emergence of Data Science as a Critical Discipline in Biostatistics
considered that all recipients of the information on height may not have a good
grasp of the measuring scale. Thus, consider how these height measurements could
instead be expressed initially as:
Robert ... 1
Juan .... 2
Declan .. 3
Treat these 1-2-3 measures as ordered factor-type data, where listing in a code
book indicates that 1 = Short, 2 = Medium, and 3 = Tall. There is an order, and ide-
ally there needs to be a set of cut points for correct placement in each classification.
Of course, context is always important. It is unlikely that an adult male professional
basketball player with a measurement of 192 centimeters would ever be classified
as 3, or Tall – at least among this specialized group of adult males. A separate Code
Book and cut points would be needed for an ordered factor listing of heights for
these professional athletes, where a height of 212 centimeters would not be an
extreme upper-range value for those who are considered tall basketball players.
Not all factor-type data are ordered, and many data of factor-type are coded list-
ings. The most common factor-type datum that is not ordered is perhaps gender,
with Female = 1 and Male = 2. The assignment of codes is merely a shortcut that
avoids excess keying, where it is far less work to key 1 instead of Female and to key
2 instead of Male. And note how the two codes in this example follow an alphabeti-
cal ordering (e.g., Female – Male), which is used to indicate that there is no ordering
based on the nature of the variable.
List
When using R, a list is a special type of vector containing other objects. Go back to
the demonstration for dataframes and note how a dataframe is actually a special
type of list, with the dataframe SBPAlert01.df consisting of three object variables,
each of a different type, but of the same length: (1) the numbers in SubjectID are
factors, (2) SystolicBP is composed of real numbers, and (3) the FALSE and TRUE
entries in BPRisk are logical values:
utils::str(SBPAlert01.df)
Although lists can take many forms, for this lesson and later lessons, it will be
common to encounter rectangular lists, in the form of dataframes or tibbles. More
complex lists go beyond the introductory nature of this text.
Data Types Supported by R 63
Matrix
A dataframe is a rectangular object where the different variables can represent dif-
ferent data types, as seen in SBPAlert01.df. A matrix, however, must have all vari-
ables of the same data type.
Apply the base::as.matrix() function against the dataframe SBPAlert01.df, and
see the outcome of this action:
SBPAlert01.matrix
str(SBPAlert01.matrix)
chr [1:4, 1:3] "1035" "1067" "2053" "1716" "122" "186" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "SubjectID" "SystolicBP" "BPRisk"
Matrices are used with some specialized functions pertaining to math, but it is
perhaps best to work with either dataframes or tibbles (when using the tidyverse
ecosystem) instead of a matrix. Perhaps more importantly for this lesson, the
ggplot2::ggplot() function, which is a key part of the tidyverse ecosystem for the
creation of Beautiful Graphics, either will not work or will work only with difficulty
against matrix data. Accordingly, a matrix-based object may need to be put into
either dataframe format or tibble format if there were a desire to plot the data using
functions from the ggplot2 package.
Vector
The simplest data structure in R is likely the vector. A vector is a collection of: real
numbers (e.g., numbers with a decimal value, even if the number does not express
the decimal value): integers (e.g., whole numbers without decimal value); charac-
ters; or logical notation. Consider two vectors, one vector consisting of only one
numeric datapoint and the other vector consisting of five character datapoints:
64 1 Emergence of Data Science as a Critical Discipline in Biostatistics
VectorOneDatapoint
[1] 101
utils::str(VectorOneDatapoint)
base::length(VectorOneDatapoint)
base::is.vector(VectorOneDatapoint)
VectorFiveDatapoints
utils::str(VectorFiveDatapoints)
base::length(VectorFiveDatapoints)
base::is.vector(VectorFiveDatapoints)
Although others may have a different view on how to explain the ubiquity of
vectors in data science, it is argued in this text that vectors should be viewed as the
basic building blocks of nearly all complex R-based data structures. Certainly, vec-
tors are used to build the most common data structures in R and the associated
tidyverse ecosystem, namely dataframes and tibbles.39 Even so, data scientists
should have acquaintance with many different data structures.
Throughout this introductory lesson, R has been introduced in a gentle manner,
in keeping with the notion that: Petit à petit l’oiseau fait son nid. (French); Little by
little, the bird builds its nest. (English); and Doni doni kononi b’a nyaga da.
(Bambaran). The addenda that follow continue along with a gentle introduction to
R and more specifically the tidyverse ecosystem and the use of APIs. Increasing
complexity is seen as the addenda in this lesson progress and future lessons in this
text progress.
39
When using R, an array is an object that can store data in more than two dimensions. As seen
later, a matrix is a two-dimensional array.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 65
Guidance was provided earlier in this lesson that introduced discussion on the num-
ber of completers (e.g., graduates), from Academic Year 2009–2010 to Academic
Year 2018–2019, of selected academic programs of study associated with data sci-
ence and biostatistics, using data provided by the United States Department of
Education.40 All data were gained from the Integrated Postsecondary Education
Data System (IPEDS) Peer Analysis System (https://nces.ed.gov/ipeds/use-the-
data), a United States Department of Education resource freely available to the pub-
lic, worldwide. The purpose of this addendum is to demonstrate how R and the
tidyverse ecosystem are used to generate figures that graphically communicate out-
comes relating to completions. As an advance organizer to R and the tidyverse eco-
system, give focus to the syntax, with more complete detail on the syntax provided
in later lessons.
The IPEDS Peer Analysis System (https://nces.ed.gov/ipeds/use-the-data) was
used to obtain and download the data associated with this part of the lesson. The
IPEDS Peer Analysis System interface does not currently provide data by use of an
R-based Application Programming Interface (API) function (e.g., client) and to
offer detailed instructions on navigation through the IPEDS Peer Analysis System
interface is far beyond the purpose of this lesson. However, all files downloaded
after using the IPEDS Peer Analysis System are available at the publisher’s Web
page associated with this text. For all selected programs of study, it may be helpful
to know that the data were originally downloaded in .csv format but were then also
saved in .xlsx format. Saving the original .csv files in .xlsx format allows convenient
use of the readxl package, a package that is associated with the tidyverse ecosystem,
but is not among packages in the core tidyverse ecosystem – but more detail on core
tidyverse ecosystem packages and associated tidyverse ecosystem packages is pro-
vided in later lessons.
Although an active R session was used in the front matter of this lesson, it is now
assumed that the R session for the addenda starts with the Housekeeping syntax,
below. Follow along with the provided syntax to first understand and then reproduce
40
As a brief explanation of the value of using Integrated Postsecondary Education Data System
(IPEDS) data on completers, it should be mentioned that six-digit CIP data (e.g., the highly granu-
lar data that are specific to individual programs of study) are available for nearly all completers, but
that is not the case for the availability of six-digit CIP (Classification of Instructional Programs)
fall term enrollment data from the IPEDS Peer Analysis System. The rationale for that decision is
that postsecondary students frequently change their academic major program of study before com-
pletion, such that six-digit CIP enrollment data would be an inconsistent and misleading false
friend. The decision to exclude fall term enrollment data also considers that many students transfer
from one postsecondary institution to another, again confounding the efficacy of using enrollment
as an appropriate metric. In contrast, completion of a program of study from a specific institution
and the awarding of either a certificate or degree by that institution is a recorded final event that
results in a fixed datum that will not change and therefore serves as a valid measure of interest for
individual programs of study.
66 1 Emergence of Data Science as a Critical Discipline in Biostatistics
the syntax used to produce the figures.41 The R-based syntax used to generate the
figure for CIP 01.0000 (Agriculture, General), CIP 26.1102 (Biostatistics), and CIP
51.1201 (Medicine) soon follows. When viewing the syntax used to prepare these
figures, note how the dplyr package and the ggplot2 package, key packages in the
core tidyverse ecosystem, are the dominant packages for organization of the data
(dplyr) and subsequently preparation of these figures (ggplot2).
There are a few different packages and related functions associated with the tidy-
verse ecosystem that are used to import data into an active R session:
• The readr package is part of core tidyverse and it supports many different func-
tions that are used to import delimited files during an active R session. Among
the many different file import functions supported by the readr package, perhaps
the most common is the readr::read_csv() function. Due to nearly universal use
of comma separated files, the readr::read_csv() function is often used to import
rectangular data (e.g., data organized in row by column format) that are in comma
separated values (.csv) format – an extremely common file format that is easily
shared with others.
• The readxl package is associated with the tidyverse ecosystem and is commonly
used to import spreadsheets that are in either .xls format (using the readxl::read_
xls() function) or .xlsx format (using the readxl::read_xlsx() function). There are
many arguments associated with these functions and these arguments make the
readxl package quite useful, so that when data are imported, they are organized
as expected with the data in declared format.
A few reminders may be helpful for those who want to learn more about file
formats:
• Delimited files are commonly found in data science, where the file is structured
so that data are separated by some type of character. Commas are perhaps the
most common characters used to separate data in a row, data from one column to
data in another column.
• The .csv file format is structured so that commas are used to separate data fields.
This format has been used in data science since the early 1970s, with first use
often attributed to use of the Fortran programming language. Although .csv files
do not have the many robust features of other data-oriented file formats, simplic-
ity is the comparative advantage of the .csv file format. Simple .csv files can be
opened by nearly all text editors and spreadsheets, allowing nearly universal use
in data science. The United States Library of Congress (https://www.loc.gov/
preservation/digital/formats/fdd/fdd000323.shtml) provides a rich history of the
.csv format. The Library of Congress identifies the .csv format as a preferred
dataset format in their Recommended Formats Statement (RFS) for datasets.
Search for and read about standard RFC 4180 to learn more about the history and
41
The syntax in this addendum provides an advance organizer to the use of data science and R. For
now, give attention to process. More specific exposure to the use of R syntax in support of data
science is provided in later lessons.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 67
use of files in .csv format. There are regions throughout the international data
science community where a comma character is used with numbers in place of a
decimal point (e.g., The mean weight of subjects was 12,34 Kg. instead of The
mean weight of subjects was 12.34 Kg.). In that case, it is possible to prepare a
.csv-equivalent delimited file with a semicolon character serving as a separator
between data fields, instead of a comma. Tabs are also frequently used to sepa-
rate data within a row. It is now uncommon, but white space (e.g., use of the
space bar) was once a common format used to separate row-based data into fixed
columns.
• The readxl::read_excel() function can be used to import both .xls format files and
.xlsx format files, and it is often used when the exact spreadsheet file extension
is unknown.
• Data are automatically put into format as a tibble when using functions associ-
ated with the readxl package. A tibble is a specialized type of dataframe that has
many advantages, and it is the norm dataframe when working with the tidyverse
ecosystem. More discussion about the tibble dataframe (e.g., dataset) format is
provided in later lessons.
Data for this lesson were gained by using the IPEDS Peer Analysis System, orig-
inally downloaded as .csv files, put into .xlsx format, and placed on an external F:\
drive. With the first row of each spreadsheet serving as column name identifiers, the
data are rectangular and are consistently organized in more than 6000 rows and 22
columns. Each row represents a unique postsecondary institution, identified by a
unique UnitID, and each column represents a unique variable. In original format,
from when the data were first obtained, names in the column headers are quite long
and complex, but they are descriptive and fully illustrate the nature of the data.42 Be
sure to look at use of the many arguments associated with the readxl::read_xlsx()
function so that data, when imported, are in good form, especially the long col-
umn names.43
42
The original format column header names are certainly not tidy, but look at the syntax on how
the base::colnames() function is used to put final form column names into a more tidy format and
are equally easy to read and understand.
43
The utils::read.table() function, associated with Base R, the set of packages and functions avail-
able when R is first downloaded, is a common tool for importing .csv files into an active R session.
One frequently used feature associated with the utils::read.table() function is use of the stringsAs-
Factors argument. By using this argument, it is possible to set character data (e.g., Female or Male,
Fail or Pass, etc.) as factors during the data import process. The tidyverse ecosystem and specifi-
cally the readxl package takes a totally different approach to this task. When the readxl package is
used to import data, using either the readxl::read_excel(), readxl::read_xls(), or readxl::read_xlsx()
functions, character data are retained as characters after they are imported and they are not put into
factor format during the data import process. If there were a desire to organize character data in
factor format, which is common, that forced action must come later, after the data are imported and
put into a tibble. The advantage of this approach is that when using tools in the tidyverse ecosys-
tem, data must be organized early-on and the syntax for this set of actions will be very visible as
syntax of its own, minimizing a possible misuse of the data because of a simple action that was
somehow earlier forgotten.
68 1 Emergence of Data Science as a Critical Discipline in Biostatistics
###############################################################
# Housekeeping Use for All Analyses #
###############################################################
.libPaths(new = "F:/R_Packages")
# As a preference, all installed packages
# will now go to the external F:\ drive.
date() # Current system time and date.
Sys.time() # Current system time and date (redundant).
R.version.string # R version and version release date.
options(digits=6) # Confirm default digits.
options(scipen=999)# Suppress scientific notation.
options(width=60) # Confirm output width.
ls() # List all objects in the working
# directory.
rm(list = ls()) # CAUTION: Remove all files in the working
# directory. If this action is not desired,
# use the rm() function one by one to remove
# the objects that are not needed.
ls.str() # List all objects, with finite detail.
getwd() # Identify the current working directory.
setwd("F:/R_Ceres")
# Set to a new working directory.
# Note the single forward slash and double
# quotes.
# This new directory should be the directory
# where the data file is located, otherwise
# the data file will not be found.
getwd() # Confirm the working directory.
list.files() # List files at the PC directory.
.libPaths() # Library pathname.
.Library # Library pathname.
sessionInfo() # R version, locale, and packages.
search() # Attached packages and objects.
searchpaths() # Attached packages and objects.
###############################################################
44
As a desire, in the near future the active R session will be based on use of the cloud, but for now
a portable drive, a physical portable F drive in this case, meets current needs.
45
After this syntax was prepared, the core tidyverse was updated to tidyverse 2.0.0. The lubridate
package was added as part of the new core tidyverse package of packages. More detail is provided
in a later lesson.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 69
install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)
# All eight packages in the core tidyverse ecosystem
# will now be available and ready for use:
# dplyr
# forcats
# ggplot2
# purrr
# readr
# stringr
# tibble
# tidyr
# As a good programming practice (gpp), the following
# convention will be used when functions in the core
# tidyverse are used: PackageName::FunctionName(). By
# using this practice there should be no confusion as to
# which function is associated with which tidyverse (or
# other) package. Some may see this convention as an
# unnecessary redundant activity, but it is argued that
# this practice is part of the overall quality assurance
# process, especially for those who are new to data
# science, R, and the tidyverse ecosystem.
#
# Other packages will be downloaded as needed, but again,
# all eight packages associated with core tidyverse are
# now available and ready for use.
install.packages("readxl", dependencies=TRUE)
library(readxl)
# The readxl package is NOT part of the core tidyverse.
# It needs to be individually installed and loaded.
# The readxl package was selected for data import, but of
# course other packages could have also been used.
install.packages("magrittr", dependencies=TRUE)
library(magrittr)
# The magrittr package is NOT part of the core tidyverse.
# It needs to be individually installed and loaded.
# The pipe operator (e.g., %>%), from the magrittr
# package, is used to move an object forward. The use of
# pipes, often multiple pipes in a chain, is an essential
# part of how the tidyverse ecosystem is used to create
# syntax that is both functional and easily understood by
# others.
The ggplot2 package is part of the core tidyverse and it is used to create Beautiful
Graphics. However, there are many ancillary packages that support the production
of graphical presentations, with features that go far beyond what can be prepared
using the ggplot2 package by itself. A few of these graphically oriented ancillary
packages include:
70 1 Emergence of Data Science as a Critical Discipline in Biostatistics
install.packages("ggmosaic", dependencies=TRUE)
library(ggmosaic)
install.packages("ggpubr", dependencies=TRUE)
library(ggpubr)
install.packages("ggtext", dependencies=TRUE)
library(ggtext)
install.packages("ggthemes", dependencies=TRUE)
library(ggthemes)
install.packages("gridExtra", dependencies=TRUE)
library(gridExtra)
install.packages("scales", dependencies=TRUE)
library(scales)
Challenge: As mentioned earlier, the R-based syntax used to generate the figures
for CIP 01.0000 (Agriculture, General), CIP 26.1102 (Biostatistics), and CIP
51.1201 (Medicine) soon follows. Go to the publisher’s Web site associated with
this text to obtain the full set of IPEDS-originated .xls files, one file for each of the
programs of study in the following output:
Challenge: Again, after reviewing the syntax and obtaining the data, reproduce
the figures for each of these academic programs of study, where each has some
degree of focus on biostatistics. The readxl::read_xlsx() function will be used to
start this task. Follow along with the arguments, such as sheet, col_names, col_
types, etc., to see the quality assurance measures used to be sure that data are
imported correctly and in desired format. Then use the syntax exactly as presented,
perhaps only changing the disk drive to personal preferences, to generate all figures.
In its original form, as the data are obtained from IPEDS, the column names are
extremely long, complex, have multiple spaces and special characters, etc. In short,
the column names are unmanageable from a tidy perspective. R supports many
functions that could be used to rename column names, including functions in either
the core tidyverse or those functions from packages that are ancillary to the tidy-
verse ecosystem as well as functions from Base R. The functions dplyr::rename()
and dplyr::rename_with() are commonly used, but other similar functions have
merit too, including gdata::rename.vars() and data.table::setnames().
From a much larger set of possible selections, the IPEDS Peer Analysis System
was used to construct 30 datasets relating to certificate and degree completion for
identified CIPs, programs of study that require some degree of association with data
science and biostatistics. Queries to the IPEDS Peer Analysis System followed a
consistent structure, allowing some degree of automation for data retrieval (given
that an R-based API function is not supported for these data) and eventual use of the
data to construct the figures:
• There are 22 columns.
• Column placement is consistent in terms of selected variables.
• The only difference from one download to another is that the data are CIP spe-
cific, but the number of rows (e.g., postsecondary institutions) and the structure
for columns (e.g., variables) are consistent.46
• Because of this consistency in the way columns are named and organized, it was
judged best to use the base::colnames() function to rename the columns into
manageable names. The tidyverse ecosystem is a great contribution to the R
language, but functions from Base R should not be overlooked when their use is
appropriate and in many cases the simplest approach to achieving aims.47
46
Ideally the IPEDS Peer Analysis System would support function specific R-based API
(Application Programming Interface) data retrieval. R syntax would be created in such a way that
a function would be invoked from an active R session and by using this function desired data would
be returned, eliminating actual interaction with an interface at the originating data source.
Unfortunately, the IPEDS data resource does not yet support this type of API data retrieval. Those
who work in data science must be able to react to multiple data acquisition processes, not only
those that are ideal.
47
Search on the early 1300s writings of William of Ockham, who is generally attributed as formu-
lating Occam’s razor – an approach that advocates simple solutions at problem-solving, whenever
possible.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 73
base::colnames(WCIP010000.tbl) <- c(
"UnitID", # Column 01 UnitID
"Institution", # Column 02 Institution
"Fall2019", # Column 03 Fall 2019 Enrollment
"Institution", # Column 04 Institution (redundant)
"City", # Column 05 City
"State", # Column 06 State
"FIPS", # Column 07 FIPS County
"Longitude", # Column 08 Longitude
"Latitude", # Column 09 Latitude
"Control", # Column 10 Institutional Control
"Highest", # Column 11 Highest Level Offered
"Carnegie", # Column 12 Carnegie Classification
"AY201819", # Column 13 AY 2018-19 Degrees/Certificates
"AY201718", # Column 14 AY 2017-18 Degrees/Certificates
"AY201617", # Column 15 AY 2016-17 Degrees/Certificates
"AY201516", # Column 16 AY 2015-16 Degrees/Certificates
"AY201415", # Column 17 AY 2014-15 Degrees/Certificates
"AY201314", # Column 18 AY 2013-14 Degrees/Certificates
"AY201213", # Column 19 AY 2012-13 Degrees/Certificates
"AY201112", # Column 20 AY 2011-12 Degrees/Certificates
"AY201011", # Column 21 AY 2010-11 Degrees/Certificates
"AY200910") # Column 22 AY 2009-10 Degrees/Certificates
There is now a high degree of assurance that the dataset has been imported cor-
rectly and that the data are in good order. It should be noticed that there are many
NAs showing in the dataset, but that is expected. There are more than 6000 postsec-
ondary institutions represented in this dataset, gained by queries to the IPEDS Peer
Analysis System, and compared to a population of more than 6000, there are only
but a few postsecondary institutions that have a program of study coded as CIP
01.0000 Agriculture, General.
74 1 Emergence of Data Science as a Critical Discipline in Biostatistics
base::length(WCIP010000.tbl$AY201819)
# Number of lines (e.g., records, rows, etc.)
[1] 6179
base::table(is.na(WCIP010000.tbl$AY201819))
# For the identified object, generate a table of rows with
# missing data (e.g., is.na outcome is TRUE) and rows where
# data are not missing data (e.g., is.na outcome is FALSE).
# The data of interest to the upcoming figure are from the
# rows marked FALSE, where is.na yields a FALSE outcome.
FALSE TRUE
186 5993
Code Book: When navigating the IPEDS Peer Analysis System interface, a pop-
u p menu is available to describe the meaning of numeric codes. These codes are
then downloaded when the main dataset is also downloaded. There are variables in
the IPEDS spreadsheets that are not currently needed to generate the desired figures,
such as Longitude and Latitude or Carnegie Classification, but they were selected
for possible use in the future.
Selected sections of the Code Book were deleted to save space, but all details are
provided online, at the IPEDS Peer Analysis System resource.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 75
# UnitID
# Each postsecondary institution is assigned a unique numeric
# code. Some institutions with multiple campuses have multiple
# UnitIDs, and some do not.
#
# Institution
# The actual name of an institution is provided, which may be
# different than what is commonly used in less formal usage.
#
# Fall 2019 Enrollment
# Unduplicated fall term headcount enrollment, as measured on a
# set census date near end of term, is provided. This metric
# is different from FTE (Full Time Equivalent) enrollment and
# duplicated fall term headcount enrollment. Recall that due to
# the COVID-19 pandemic and its impact on massive reduction in
# student engagement beginning in 2020, 2019 was the last year
# that was perhaps reflective of norm behavior and enrollment
# patterns.
#
# Institution (redundant)
# The name of the institution is provided again.
#
# City
# The city location is provided for the UnitID campus.
#
# State
# The state for the identified UnitID campus is provided.
#
# FIPS County
# Refer to https://www.census.gov/geographies/reference-files/
# 2020/demo/popest/2020-fips.html for a listing of the thousands
# of numeric state and county Federal Information Processing
# Standards (FIPS) codes. Numeric FIPS codes are a far more
# efficient way of identifying counties than the use of actual
# names since some county names show in multiple states ( e.g.,
# Washington is used as a county or parish name in 31 states,
# Jefferson in 26, Franklin in 25, Jackson in 24 and Lincoln in
# 24, etc.).
#
# Longitude
# The longitude of the UnitID campus is used to create maps.
#
# Latitude
# The latitude of the UnitID campus is used to create maps.
#
# Institutional Control
# 1 Public
# 2 Private not-for-profit
# 3 Private for-profit
#
76 1 Emergence of Data Science as a Critical Discipline in Biostatistics
48
The tidyr::gather() function is deprecated. It is still included among the many functions available
in the tidyr package, but there are no current efforts to improve upon its functionality. Most impor-
tantly, existing syntax from prior projects that use the tidyr::gather() function still work, but new
projects tend to use the tidyr::pivot_longer() function, given its support and improved
functionality.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 77
LCIP010000.tbl <-
tidyr::pivot_longer(WCIP010000Adjusted.tbl,
-c(UnitID),
names_to = "AY", values_to = "Completers")
# Put the data into long format, using the
# tidyr::pivot_longer() function.
#
# The expression -c(UnitID) means that the
# tidyr::pivot_longer() function should
# pivot everything except UnitID. In this
# syntax, the minus sign means except.
LCIP010000.tbl
Before any graphics are produced, it is important to know that the ggplot2::ggplot()
function supports many different themes. A ggplot2 theme is created by using syn-
tax to produce a figure with a desired appearance. To make the figures bold and
vibrant, but also to reduce redundant keying, look at theme_Mac(), a self-created
theme that will be used in concert with the ggplot2::ggplot() function.
Themes reduce redundant keying while adding value to a project. The use of
additional themes, other than the standard themes available to all, is used to enhance
axis and tick mark presentation, bold labels and titles, centering, font size and color,
etc. However, the use of these additional ad hoc changes to standard themes requires
many lines of syntax. By placing syntax that is keyed one time and then saved with
a unique name, it is possible to easily deploy a user-created theme such as theme_
Mac() multiple times and in multiple projects. This approach of multiple use and
reuse of existing syntax promotes an efficient and tidy way of meeting project
requirements.
base::class(theme_Mac)
# Confirm that the user-created object
# theme_Mac() is a function.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 79
Create an object that summarizes the number of completers for each academic
year, Academic Year 2009–2010 to Academic Year 2018–2019. The output will
include labels placed over the top of each column when a column chart (e.g., bar-
chart) of completers by academic year is prepared (Fig. 1.4).
CIP010000SumByAY
# Use this summary to help determine the range for the Y
# axis, labels placed over each academic year column, and
# also to confirm the correct ordering of academic years.
# A tibble: 10 x 3
AY sum n
<chr> <dbl> <int>
1 AY200910 2258 6179
2 AY201011 2508 6179
3 AY201112 2587 6179
4 AY201213 2691 6179
5 AY201314 2873 6179
6 AY201415 2932 6179
7 AY201516 3233 6179
8 AY201617 3302 6179
9 AY201718 3376 6179
10 AY201819 3464 6179
CIP010000AgricultureGeneral.fig <-
ggplot2::ggplot(data=LCIP010000.tbl, aes(x=AY, y=Completers)) +
geom_col(fill="red") +
geom_richtext(data=CIP010000SumByAY,
aes(x=AY, label=scales::comma(round(sum), accuracy=1),
y=sum), hjust=0.50, vjust = -0.75, size=5,
label.color="black", fontface="bold") +
# Notice the use of geom_richtext, not geom_text. A few
# desired embellishments are possible with geom_richtext,
# such as the large print label above each column and the
# way this label is highlighted in an attractive offset
# textbox with rounded corners. Give special attention to
# the way label=scales::comma(round(), accuracy=1) was used
# to use comma notation as a thousands separator and to be
# sure that whole numbers were generated since the labels
# represent counts and it would be inappropriate to use
# decimal notation.
labs(
title="CIP 01.0000 Agriculture, General",
subtitle="Completions (All Degrees and Certifications) by
Academic Year",
x = "\nAcademic Year",
y = "Completions - All Degrees and\nCertifications\n") +
annotate("text", x=5.5, y=-100.0, fontface="bold", size=03,
label="Academic Year: July 01 to June 30") +
# Notice how annotate() has been placed in a centered
# position (x=5.5 for this figure), below the columns
80 1 Emergence of Data Science as a Critical Discipline in Biostatistics
Fig. 1.4
par(ask=TRUE)
CIP010000AgricultureGeneral.fig
# Fig. 1.4
The figure offers a very useful perspective of throughput (e.g., completions, all
degrees, and certifications) for the selected CIP-based program of study. For those
with special interest, the IPEDS Peer Analysis System can be used to provide not
only completer totals but also breakouts by different degree levels. However, that
degree of granularity is beyond the purpose of this summary presentation.
Review syntax for the two other CIP-specific programs of study associated with
data science and biostatistics selected for presentation in this text, from among the
many possible selections: CIP 26.1102 (Biostatistics) and CIP 51.1201 (Medicine).
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 81
Once again, respond to the challenge to use the syntax and model in this addendum
to create all 30 figures that reflect completions over time among programs of study
that have some degree of association with biostatistics, from CIP 01.0000
Agriculture, General to CIP 51.3818 Nursing Practice, following along with the
prior list.49 To save space, nearly all comments have been removed from remaining
syntax in this addendum since the comments are repetitive to what was seen earlier
(Figs. 1.5 and 1.6).
Fig. 1.5
Fig. 1.6
49
The best way to learn R syntax is to read R syntax prepared by others, write R syntax, make cor-
rections, read documentation and then experiment with multiple packages and functions, etc.
Practice – practice – practice!
82 1 Emergence of Data Science as a Critical Discipline in Biostatistics
base::colnames(WCIP261102.tbl) <- c(
"UnitID", # Column 01 UnitID
"Institution", # Column 02 Institution
"Fall2019", # Column 03 Fall 2019 Enrollment
"Institution", # Column 04 Institution (redundant)
"City", # Column 05 City
"State", # Column 06 State
"FIPS", # Column 07 FIPS County
"Longitude", # Column 08 Longitude
"Latitude", # Column 09 Latitude
"Control", # Column 10 Institutional Control
"Highest", # Column 11 Highest Level Offered
"Carnegie", # Column 12 Carnegie Classification
"AY201819", # Column 13 AY 2018-19 Degrees/Certificates
"AY201718", # Column 14 AY 2017-18 Degrees/Certificates
"AY201617", # Column 15 AY 2016-17 Degrees/Certificates
"AY201516", # Column 16 AY 2015-16 Degrees/Certificates
"AY201415", # Column 17 AY 2014-15 Degrees/Certificates
"AY201314", # Column 18 AY 2013-14 Degrees/Certificates
"AY201213", # Column 19 AY 2012-13 Degrees/Certificates
"AY201112", # Column 20 AY 2011-12 Degrees/Certificates
"AY201011", # Column 21 AY 2010-11 Degrees/Certificates
"AY200910") # Column 22 AY 2009-10 Degrees/Certificates
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 83
base::length(WCIP261102.tbl$AY201819)
base::table(is.na(WCIP261102.tbl$AY201819))
LCIP261102.tbl <-
tidyr::pivot_longer(WCIP261102Adjusted.tbl,
-c(UnitID),
names_to = "AY", values_to = "Completers")
LCIP261102.tbl
base::class(theme_Mac)
CIP261102SumByAY
84 1 Emergence of Data Science as a Critical Discipline in Biostatistics
CIP261102Biostatistics.fig <-
ggplot2::ggplot(data=LCIP261102.tbl, aes(x=AY, y=Completers)) +
geom_col(fill="red") +
geom_richtext(data=CIP261102SumByAY,
aes(x=AY, label=scales::comma(round(sum), accuracy=1),
y=sum), hjust=0.50, vjust = -0.75, size=5,
label.color="black", fontface="bold") +
labs(
title="CIP 26.1102 Biostatistics",
subtitle="Completions (All Degrees and Certifications) by
Academic Year",
x = "\nAcademic Year",
y = "Completions - All Degrees and\nCertifications\n") +
annotate("text", x=5.5, y=-40.0, fontface="bold", size=03,
label="Academic Year: July 01 to June 30") +
scale_x_discrete(labels = c(
"AY2009-10", # By using enumerated labels, the natural
"AY2010-11", # ordering of label placement was changed.
"AY2011-12", # As such, notice the reverse order of the
"AY2012-13", # enumerated labels.
"AY2013-14", #
"AY2014-15", # Quality Assurance Check: Compare the
"AY2015-16", # number in each textbox label to each value
"AY2016-17", # associated with CIP261102SumByAY.
"AY2017-18",
"AY2018-19")) +
scale_y_continuous(labels=scales::comma, limits=c(-40,1200),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
par(ask=TRUE)
CIP261102Biostatistics.fig
# Fig. 1.5
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 85
base::colnames(WCIP511201.tbl) <- c(
"UnitID", # Column 01 UnitID
"Institution", # Column 02 Institution
"Fall2019", # Column 03 Fall 2019 Enrollment
"Institution", # Column 04 Institution (redundant)
"City", # Column 05 City
"State", # Column 06 State
"FIPS", # Column 07 FIPS County
"Longitude", # Column 08 Longitude
"Latitude", # Column 09 Latitude
"Control", # Column 10 Institutional Control
"Highest", # Column 11 Highest Level Offered
"Carnegie", # Column 12 Carnegie Classification
"AY201819", # Column 13 AY 2018-19 Degrees/Certificates
"AY201718", # Column 14 AY 2017-18
86 1 Emergence of Data Science as a Critical Discipline in Biostatistics
Degrees/Certificates
"AY201617", # Column 15 AY 2016-17 Degrees/Certificates
"AY201516", # Column 16 AY 2015-16 Degrees/Certificates
"AY201415", # Column 17 AY 2014-15 Degrees/Certificates
"AY201314", # Column 18 AY 2013-14 Degrees/Certificates
"AY201213", # Column 19 AY 2012-13 Degrees/Certificates
"AY201112", # Column 20 AY 2011-12 Degrees/Certificates
"AY201011", # Column 21 AY 2010-11 Degrees/Certificates
"AY200910") # Column 22 AY 2009-10 Degrees/Certificates
base::length(WCIP511201.tbl$AY201819)
base::table(is.na(WCIP511201.tbl$AY201819))
WCIP511201Adjusted.tbl <- WCIP511201.tbl %>%
dplyr::select(c(
UnitID,
AY201819,
AY201718,
AY201617,
AY201516,
AY201415,
AY201314,
AY201213,
AY201112,
AY201011,
AY200910))
LCIP511201.tbl <-
tidyr::pivot_longer(WCIP511201Adjusted.tbl,
-c(UnitID),
names_to = "AY", values_to = "Completers")
LCIP511201.tbl
base::class(theme_Mac)
CIP511201SumByAY
CIP511201Medicine.fig <-
ggplot2::ggplot(data=LCIP511201.tbl, aes(x=AY, y=Completers)) +
geom_col(fill="red") +
geom_richtext(data=CIP511201SumByAY,
aes(x=AY, label=scales::comma(round(sum), accuracy=1),
y=sum), hjust=0.50, vjust = -0.75, size=5,
label.color="black", fontface="bold") +
labs(
title="CIP 51.1201 Medicine",
subtitle="Completions (All Degrees and Certifications) by
Academic Year",
x = "\nAcademic Year",
y = "Completions - All Degrees and\nCertifications\n") +
annotate("text", x=5.5, y=-300.0, fontface="bold", size=03,
label="Academic Year: July 01 to June 30") +
scale_x_discrete(labels = c(
"AY2009-10", # By using enumerated labels, the natural
"AY2010-11", # ordering of label placement was changed.
"AY2011-12", # As such, notice the reverse order of the
"AY2012-13", # enumerated labels.
"AY2013-14", #
"AY2014-15", # Quality Assurance Check: Compare the
"AY2015-16", # number in each textbox label to each value
"AY2016-17", # associated with CIP511201SumByAY.
"AY2017-18",
"AY2018-19")) +
scale_y_continuous(labels=scales::comma, limits=c(-300,23000),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
par(ask=TRUE)
CIP511201Medicine.fig
# Fig. 1.6
The purpose of the analyses in this addendum is to offer a sample of jobs that
require some degree of acquaintance with data science and biostatistics. The United
States Bureau of Labor Statistics is the sole source of data for these job-related data.
88 1 Emergence of Data Science as a Critical Discipline in Biostatistics
As time and interest permit, review the many resources made available by the
Bureau of Labor Statistics, which go far beyond what is provided in this addendum.50
Immediately, it is reinforced that the United States Bureau of Labor Statistics
was the source for the data in this addendum, national_M2020_dl.xlsx and state_
M2020_dl.xlsx. The data were gained by selecting the XLS option and then down-
loading the OEWS data posted at https://www.bls.gov/oes/tables.htm. Both datasets
were downloaded to the F drive and have also been posted at the publisher’s Web
site associated with this text.
Like IPEDS data in the prior addendum, the data from the Bureau of Labor
Statistics are gained from a typical Web-based interface, and they are currently
unavailable by invoking an R-based API function, serving as a client. Although the
use of an API data acquisition process would be ideal, as stated earlier, data scien-
tists do not always find data in desired format but instead need to adjust.
It was stated earlier that the characters * and # show in many columns, columns
that should be numeric but are not due to the presence of these two characters. Using
the notes found in the data dictionary (e.g., Sheet 2 of the downloaded file national_
M2020_dl.xlsx), it is identified how the * character is used to show that a wage
estimate is not available. It is also stated that the # character is used to indicate that
the wage is equal to or greater than $100.00 per hour or $208,000 per year. The
masking of the exact wage is a purposeful decision by the Bureau of Labor Statistics
and the data are not readily available at the original data source.
The task now is to import the data contained in the file national_M2020_dl.xlsx,
to immediately view it, and to then adjust the file based on the object variables of
interest in this addendum. For the previously identified jobs that require some
degree of acquaintance with data science and biostatistics, the focus for this adden-
dum is the presentation of statistics related to job code, job title, number of employ-
ees at the national level by job code, and annual median salary.51
NationalOCCJobsMay2020.tbl <-
readxl::read_excel("national_M2020_dl.xlsx", 1)
# Use the readxl::read_excel() function to import the .xlsx
# spreadsheet national_M2020_dl.xlsx into the current R
# session and place the contents into the object
# NationalOCCJobsMay2020.tbl.
#
# The number 1 that shows after the .xlsx filename is used to
# declare that only the 1st sheet in the spreadsheet should
# be read into NationalOCCJobsMay2020.tbl, the intended
# object in this example. However, prior to importing Sheet
# 1, review Sheet 2 which serves as a code book for the file.
base::getwd() # Working directory
dplyr::glimpse(NationalOCCJobsMay2020.tbl) # File structure
50
OCC-coded jobs refer to primary occupation. OCC codes are used by federal agencies and pro-
vide some degree of command and control on employment trends.
51
Although it is common to see mean as a measure of central tendency when identifying average
salary information, it is argued that median may be a more appropriate statistic. Yet, both measures
of central tendency (e.g., mean, and median) are commonly presented and used.
Addendum 2: National and State Data for OCC-Identified Jobs Associated with Data… 89
Based on the purpose of this addendum, it is only necessary to work with data
from four of the 31 columns that currently show in NationalOCCJobsMay2020.tbl:
NationalOCCJobsMay2020Adjusted1.tbl <-
NationalOCCJobsMay2020.tbl %>%
dplyr::select(c(
OCC_CODE, # Job code
OCC_TITLE, # Job title
TOT_EMP, # Total employee
A_MEDIAN)) # Annual median salary
# Use the dplyr::select() function to select only those
# columns that are required for the task at hand.
dplyr::glimpse(NationalOCCJobsMay2020Adjusted1.tbl)
52
This lesson is introductory. Give attention to the syntax, but more detail on its selection and use
is presented in later lessons.
90 1 Emergence of Data Science as a Critical Discipline in Biostatistics
NationalOCCJobsMay2020Adjusted2.tbl <-
NationalOCCJobsMay2020Adjusted1.tbl %>%
dplyr::filter(OCC_CODE %in% c(
"15-2041", # Statisticians
"17-2031", # Bioengineers and Biomedical Engineers
"19-1011", # Animal Scientists
"19-1012", # Food Scientists and Technologists
"19-1013", # Soil and Plant Scientists
"19-1020", # Biological Scientists
"19-1021", # Biochemists and Biophysicists
"19-1022", # Microbiologists
"19-1023", # Zoologists and Wildlife Biologists
"19-1029", # Biological Scientists, All Other
"19-1032", # Foresters
"19-1040", # Medical Scientists
"19-1041", # Epidemiologists
"19-4010", # Agricultural and Food Science Technicians
"19-4021", # Biological Technicians
"19-4040", # Environmental Science and Geoscience Techni ...
"19-4092", # Forensic Science Technicians
"25-1040", # Life Sciences Teachers, Postsecondary
"25-1041", # Agricultural Sciences Teachers, Postsecondary
"25-1042", # Biological Science Teachers, Postsecondary
"25-1070", # Health Teachers, Postsecondary
"25-1072", # Nursing Instructors and Teachers, Postsecon ...
"29-1021", # Dentists, General
"29-1041", # Optometrists
"29-1051", # Pharmacists
"29-1131", # Veterinarians
"29-1141", # Registered Nurses
"29-1151", # Nurse Anesthetists
"29-1211", # Anesthesiologists
"29-1215", # Family Medicine Physicians
"29-1216", # General Internal Medicine Physicians
"29-1218", # Obstetricians and Gynecologists
"29-1221", # Pediatricians, General
"29-1228", # Physicians, All Other; and Ophthalmologists ...
"29-1248")) # Surgeons, Except Ophthalmologists
# Note the numbering sequence for the dataframe title,
# Adjusted2 and not. Adjusted1.
These actions should now result in a tibble-based dataframe that meets immedi-
ate requirements, the production of a printout of selected jobs, job titles, number of
employees at the national level, and the median annual salary.
base::getwd()
base::ls()
base::attach(NationalOCCJobsMay2020Adjusted2.tbl)
utils::str(NationalOCCJobsMay2020Adjusted2.tbl)
dplyr::glimpse(NationalOCCJobsMay2020Adjusted2.tbl)
utils::head(NationalOCCJobsMay2020Adjusted2.tbl)
base::summary(NationalOCCJobsMay2020Adjusted2.tbl)
Addendum 2: National and State Data for OCC-Identified Jobs Associated with Data… 91
The step-by-step process of deconstructing the original imported file into the
desired tibble has resulted in a dataset that meets all requirements, the production of
a printout that lists a self-selected sample of jobs that require some degree of
acquaintance with data science and biostatistics, along with other useful informa-
tion. Use the base::print() function to present an attractive printout of the final result,
taking into account the required number of rows to allow for the printout (gained
from the dplyr::glimpse() function):
base::print(NationalOCCJobsMay2020Adjusted2.tbl, n=36,
width=64)
# A tibble: 36 x 4
OCC_CODE OCC_TITLE TOT_EMP A_MED~1
<chr> <chr> <chr> <chr>
1 15-2041 Statisticians 38860 92270
2 17-2031 Bioengineers and Biomedical Enginee~ 18660 92620
25 29-1041 Optometrists 36690 118050
26 29-1051 Pharmacists 315470 128710
27 29-1131 Veterinarians 73710 99250
28 29-1141 Registered Nurses 2986500 75330
29 29-1151 Nurse Anesthetists 41960 183580
Some of the job titles are quite long. If there were a desire to print a wider format
output, then merely adjust the width. Observe the difference, with width set to 110
(the width needed to have all job titles show in full) instead of 64.
base::print(NationalOCCJobsMay2020Adjusted2.tbl, n=36,
width=110)
R can easily accommodate this concern over missing values by using a fairly
simple process associated with Base R. Of course, there are tidyverse ecosystem
tools for this task, but going back to a prior comment on the value of simplicity. The
syntax shown below is easy to implement and works well, given the introductory
nature of this lesson.
NationalOCCJobsMay2020Adjusted2.tbl[
NationalOCCJobsMay2020Adjusted2.tbl == "#"] <- NA
# Replace all # characters with NA, the missing
# value indicator.
base::print(NationalOCCJobsMay2020Adjusted2.tbl, n=36,
width=64)
# Confirm that each # character has been replaced with NA.
The NA (e.g., missing value) character now shows, correctly, for the A_MEDIAN
object. There is one remaining task that is evident when looking at the first few lines
of the print output – TOT_EMP and A_MEDIAN both show as character-based
objects. There are many ways to accommodate this concern, but perhaps the easiest
way is to use a simple transformation process:53
53
The figure is presented in the front matter of this lesson.
Addendum 2: National and State Data for OCC-Identified Jobs Associated with Data… 93
NationalOCCJobsMay2020Adjusted2.tbl$TOT_EMP <-
as.numeric(NationalOCCJobsMay2020Adjusted2.tbl$TOT_EMP)
NationalOCCJobsMay2020Adjusted2.tbl$A_MEDIAN <-
as.numeric(NationalOCCJobsMay2020Adjusted2.tbl$A_MEDIAN)
base::getwd()
base::ls()
base::attach(NationalOCCJobsMay2020Adjusted2.tbl)
utils::str(NationalOCCJobsMay2020Adjusted2.tbl)
dplyr::glimpse(NationalOCCJobsMay2020Adjusted2.tbl)
utils::head(NationalOCCJobsMay2020Adjusted2.tbl)
base::summary(NationalOCCJobsMay2020Adjusted2.tbl)
NationalSalarySelectedOCCJobs.fig <-
ggplot2::ggplot(data=NationalOCCJobsMay2020Adjusted2.tbl,
aes(x=OCC_CODE, y=A_MEDIAN)) +
geom_col(fill="red") +
labs(
title=
"Selected Jobs Associated with Data Science and Biostatistics:
Annual Median Salary - May 2020",
subtitle=
"All Data are from the Bureau of Labor Statistics",
x = "\nJob Code,
https://www.bls.gov/oes/current/oes_stru.htm",
y = "Annual Median Salary\n") +
annotate("text", x=2.5, y=225000, fontface="bold", size=03,
hjust=0, label=
"Data are excluded in the original dataset for median") +
annotate("text", x=2.5, y=210000, fontface="bold", size=03,
hjust=0, label=
"salaries >= $208,000 per year.") +
scale_y_continuous(labels=scales::dollar, limits=c(0,250000),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac() +
theme(axis.text.x=element_text(face="bold", size=08,
hjust=0.5, vjust=1, angle=90)) +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X axis text and
# Y axis text need to be placed after the enumerated theme
# theme_Mac().
par(ask=TRUE)
NationalSalarySelectedOCCJobs.fig
# The syntax for this figure shows here, but the figure
# itself shows in the front matter of this lesson.
# Fig. 1.1
Much more could be done to examine readily available federal data regarding
career choices in biostatistics: academic programs of study and change in comple-
tions over time, job titles, job requirements, national survey of salaries by job, etc.
94 1 Emergence of Data Science as a Critical Discipline in Biostatistics
StateOEWSJobsMay2020.tbl <-
readxl::read_excel("state_M2020_dl.xlsx", 1)
# Use the readxl::read_excel() function to import the .xlsx
# spreadsheet state_M2020_dl.xlsx into the current R session
# and place the contents into the object
# StateOEWSJobsMay2020.tbl
#
# The number 1 that shows after the .xlsx filename is used to
# declare that only the 1st sheet in the spreadsheet should
# be read into StateOEWSJobsMay2020.tbl, the intended object
# in this example. However, prior to importing Sheet 1,
# review Sheet 2, which serves as a code book for the file.
With this federal dataset now available, use the prior organizational approach
and R syntax (Base R and tools from the tidyverse ecosystem) to focus on a few
selected jobs, such as:
As a test of current skills with R, reproduce the figure for OCC_CODE 29-1141
Registered Nurses and see if it meets expectations, based on the generated figure
(Fig. 1.7):
Addendum 2: National and State Data for OCC-Identified Jobs Associated with Data… 95
Fig. 1.7
# Registered Nurses
# Job291141byStateMay2020.tbl
Job291141byStateMay2020.fig <-
ggplot2::ggplot(data=Job291141byStateMay2020.tbl,
aes(x=reorder(PRIM_STATE, -A_MEDIAN), y=A_MEDIAN)) +
geom_col(fill="red") +
labs(
title=
"OEWS Job 29-1141, Registered Nurses: Annual Median Salary
by State (Descending Order) - May 2020",
subtitle=
"All Data are from the Bureau of Labor Statistics",
x = "\nState, https://www.bls.gov/oes/#data",
y = "Annual Median Salary\n") +
annotate("text", x=35.0, y=120000, fontface="bold", size=03,
hjust=0, label=
"Data are excluded in the original dataset for median") +
annotate("text", x=35.0, y=114000, fontface="bold", size=03,
hjust=0, label=
"salaries >= \$208,000 per year.") +
annotate("text", x=35.0, y=105000, fontface="bold", size=03,
hjust=0, label=
"Data for all jobs are not provided by state for all states.")+
scale_y_continuous(labels=scales::dollar, limits=c(0,125000),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac() +
theme(axis.text.x=element_text(face="bold", size=10,
hjust=0.5, vjust=1, angle=45)) +
theme(axis.text.y=element_text(face="bold", size=12,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X axis text and
# Y axis text need to be placed after the enumerated theme
# theme_Mac().
par(ask=TRUE)
Job291141byStateMay2020.fig
# Fig. 1.7
96 1 Emergence of Data Science as a Critical Discipline in Biostatistics
Challenge: Use multiple federal resources to compare state-wide salary data with
state-wide cost of living data. Although the prior back matter for this lesson pro-
vided useful by state information about salaries for a small, self-selected sample of
jobs that require some degree of acquaintance with data science and biostatistics, is
this information sufficient to make informed lifelong career decisions, place of resi-
dence decisions, etc.? As evidenced by the tables and figures, salaries for the same
job vary widely by state. However, there is also variance in cost of living by differ-
ent regions within a state, especially urban v rural regions.
The purpose of this late part of Addendum 2 is to first offer a brief view of how
extant data provided by the federal government, coupled with knowledge of the
tidyverse ecosystem and data science skills, can be used to offer insight and improve
decision making, in this case looking at rent as a proxy for cost of living. As a value-
added activity at this early part of the text, an R-based Application Programming
Interface (API) function (e.g., client) will be demonstrated in this example to auto-
mate the acquisition of data from the United States Census Bureau. Note how the
API process is tidier than point, and click menu selection processes at a Web page.54
54
An entire lesson in this text is provided on APIs. Look at the API process, here, but use the later
lesson for explicit detail on how APIs are used in data science.
Addendum 2: National and State Data for OCC-Identified Jobs Associated with Data… 97
# Use the tidycensus package and/or the acs package and the
# US Census Bureau key to obtain state and/or county specific
# data from selected American Community Survey (ACS) and/or
# Decennial Census tables.
#
# Use the following URL to access the form needed to obtain an
# API key from the US Census Bureau:
# https://api.census.gov/data/key_signup.html
#
# Complete details on the API process with the US Census Bureau
# are available at https://www.census.gov/content/dam/Census/
# library/publications/2020/acs/acs_api_handbook_2020_ch02.pdf.
install.packages("tidycensus", dependencies=TRUE)
library(tidycensus)
# CAUTION: The tidycensus package may take longer to
# download than other packages. Be patient.
tidycensus::census_api_key(
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
#"Yourx40xdigitxkeyxgoesxherexxxxxxxxxxxxx",
overwrite=FALSE)
# CAUTION: The code for this key is reserved
# for [email protected] only. Use your own key!
Far more detail on APIs is provided in a later lesson. For now, accept that an
appropriate Census API key has been obtained (but not shown in this lesson since it
is a private key – obtain and use your own key), the variable code
(variables="B25031_004", Median gross rent, 2 bedrooms) for this session is
known (by examination of the file ACS2019.csv), and instead, simply focus on the
simplicity of data acquisition through use of an R-based API function,
tidycensus::get_acs() in this example.
98 1 Emergence of Data Science as a Critical Discipline in Biostatistics
All50StatesMedianRent <-
tidycensus::get_acs(
geography="state", # Breakouts by state
variables="B25031_004", # Median gross rent, 2 bedrooms
year=2019, # Year
survey="acs5", # ACS Survey
cache_table="TRUE", # Cache the table
output="tidy", # Tidy output
show_call=TRUE) # Confirm output at Census URL
# Median Gross Rent 2019 2 Bedrooms
print(All50StatesMedianRent, n=67)
# A tibble: 52 x 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama B25031_004 759 6
2 02 Alaska B25031_004 1244 20
3 04 Arizona B25031_004 1007 7
4 05 Arkansas B25031_004 720 5
5 06 California B25031_004 1536 5
Challenge: Compare the 2019 median cost of rent by state to the previous sum-
mary of 2020 median salary by state for selected jobs. As an example, the 2019
median cost of rent in California was $1536 per month whereas the 2020 median
annual salary for a registered nurse in California was $118,410. What is the ratio of
rent to salary? Compare the ratio to other states. For this one ratio-type metric,
alone, which state is the most favorable in terms of rent v salary? Compare this
metric for multiple jobs to see if outcomes are consistent. Then, prepare ratios for
multiple jobs to multiple states. This challenge, when completed, should give a
good understanding of how these data fit into career decisions for those with an
interest in data science and biostatistics.
Career paths and life choices take many unexpected twists and turns, based on
personal desires, job openings, relationships, family, etc. If relocation is an option,
it may at first seem best to move to an area with the highest salaries for a chosen
profession. However, some analysis of local cost of living, such as using rent as a
recognized proxy for consumer costs, may help contribute to a more informed deci-
sion. A career that includes use of biostatistics, in whole or part, must consider
many factors, such as salaries and residence. Like data science in general, available
data should be considered when developing life and career plans.
External Data and/or Data Resources Used in This Lesson 99
The publisher’s Web site associated with this text includes the following files, pre-
sented in .csv, .txt, and .xlsx file formats.
CIP010000CompletionsAllCertDeg2010to2019AgrGeneral.xlsx
CIP010301CompletionsAllCertDeg2010to2019AgrProdOperGen.xlsx
CIP010401CompletionsAllCertiDeg2010to2019AgrFoodProdProc.xlsx
CIP010901CompletionsAllCertDeg2010to2019AnimalSciGen.xlsx}
CIP011001CompletionsAllCertDeg2010to2019FoodScience.xlsx
CIP011101CompletionsAllCertDeg2010to2019PlantSciGen.xlsx
CIP011201CompletionsAllCertDeg2010to2019SoilSciAgmyGen.xlsx
CIP260502CompletionsAllCertDeg2010to2019MicrobioGen.xlsx
CIP261101CompletionsAllCertDeg2010to2019BioBiometrics.xlsx
CIP261102CompletionsAllCertDeg2010to2019Biostatistics.xlsx
CIP261103CompletionsAllCertDeg2010to2019Bioinformatics.xlsx
CIP261104CompletionsAllCertDeg2010to2019CompBiology.xlsx
CIP261199CompletionsAllCertDeg2010to2019BioinfCompBio.xlsx
CIP261306CompletionsAllCertDeg2010to2019PopBiology.xlsx
CIP261309CompletionsAllCertDeg2010to2019Epidemiology.xlsx
CIP270501CompletionsAllCertDeg2010to2019StatisticsGen.xlsx
CIP303001CompletionsAllCertDeg2010to2019ComputSci.xlsx
CIP440503CompletionsAllCertDeg2010to2019HlthPolAnaly.xlsx
CIP510401CompletionsAllCertDeg2010to2019Dentistry.xlsx
CIP511201CompletionsAllCertDeg2010to2019Medicine.xlsx
CIP511401CompletionsAllCertDeg2010to2019MedScientist.xlsx
CIP511901CompletionsAllCertDeg2010to2019OsteoMedOpath.xlsx
CIP512010CompletionsAllCertDeg2010to2019PharmSciences.xlsx
CIP512201CompletionsAllCertDeg2010to2019PubHealthGen.xlsx
CIP512202CompletionsAllCertDeg2010to2019EnvHealth.xlsx
CIP512401CompletionsAllCertDeg2010to2019VetMed.xlsx
CIP512706CompletionsAllCertDeg2010to2019MedInform.xlsx
CIP513801CompletionsAllCertDeg2010to2019NursingRegNur.xlsx
CIP513808CompletionsAllCertDeg2010to2019NursingSci.xlsx
CIP513818CompletionsAllCertDeg2010to2019NursPractice.xlsx
national_M2020_dl.xlsx
state_M2020_dl.xlsx
Challenge: Use these files to practice and replicate the outcomes used in this les-
son. Be creative and go beyond what was previously presented.
All other data, if any, were enumerated directly in the R session, using functions
such as round(rnorm()) and read.table(textConnection()).
Chapter 2
Data Sources in Biostatistics
Data scientists often work with data from external resources, data over which they
often have little to no control regarding its creation and original format. However,
data scientists also often work with their own data – data that are the result of their
actions and data that they create, either individually or by delegation (with supervi-
sion) to subordinates.
Consider HeadcountsbyCIPbyCode01-07andbyCodeA-CNotTidy.txt, a tab-
separated file serving as a teaching dataset. Immediately, it should be stated that the
Code Book is not provided in terms of what the codes represent, Code01-07 and
CodeA-C. There is also no description as to why individual CIP codes show on
multiple rows. That information is not relevant here, for this lesson. Instead, look at
the way the data were saved by an inexperienced, and obviously unsupervised,
assistant. The data are in tab-separated format; there is one row that includes a
descriptive title, and there are two separate header rows. It would be wrong to say
that the file cannot be used. Of course, the file can be used, but it would take some
work to put it into a readily usable format. Planning and close supervision of assis-
tants would have helped. Still, look at the file using a text editor to see the type of
structure for this dataset as an example of the challenges data scientists often face,
even with data under their control.1
Challenge: HeadcountsbyCIPbyCode01-07andbyCodeA-CNotTidy.txt is avail-
able at the publisher’s Web site associated with this text. The data are clearly not in
1
It is often viewed that time on task in data science follows an 80-20 rule, where 80% of time on
task is given to organizing data into proper format and 20% of time on task is given to the main job
at hand, using the data for discovery purposes that places value on the entire data science process.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 101
T. W. MacFarland, Introduction to Data Science in Biostatistics,
https://doi.org/10.1007/978-3-031-46383-9_2
102 2 Data Sources in Biostatistics
a tidy format. Using an editor, delete the title row and then organize the two header
rows so that there is only one descriptive header row, ideally so that for each column
the code provides a sense of Code01-07 and CodeA-C. Then prepare summations
for each column. These actions would be a good start as the data are made at least
somewhat tidy, but for this introductory activity, the Code Book is needed to offer
greater value to the data. After this experience, it should be obvious that hand edit-
ing of the file is no easy task. Later, with more experience, the tidyverse ecosystem
will be used to organize the data and provide value to its use. Those who are expe-
rienced with the tidyverse ecosystem should be able to achieve these aims using R
syntax, and these tools are introduced in a gradual manner throughout this text.
Now consider LMilkManagementPoundsTidy.txt, a different personal dataset
that is tidy, with white space used to separate columns (e.g., variables). The full
dataset was originally quite comprehensive and addressed multiple variables associ-
ated with commercial milk production, such as: breed, operator, management prac-
tice, pounds of milk per lactation, percent fat, and percent protein. Data acquisition
and organization were closely supervised, the data were originally entered into a
common spreadsheet, the data were entered in wide format, and eventually R was
used to transform the desired data into long format. Further R-based actions were
used to sequester the data so that for each planned analysis there was one breakout
dataset, such as the dataset involving milk production, Management and Pounds. If
possible, plan for future work before data are ever obtained and put into some type
of organizational structure, ideally, so that more time is given to examination and
productive use of the data and less time is given to the mundane, but still critically
important, task of data organization.
Challenge: LMilkManagementPoundsTidy.txt is available at the publisher’s Web
site associated with this text. The data are in tidy format. Use the data to determine
the mean pounds of milk per lactation by management practice, Conventional and
Organic.
Local communities such as cities, counties, townships, tribal lands, and boroughs
are often charged with sunshine laws, where commissioners, freeholders, supervi-
sors, directors, and all other public officers must have formal meetings in an open
setting, where all relevant data under their charge must also be made conveniently
available to the public. Internet posting of data, with unfettered access, is now a
common process in which data are made available to the public.
Consider two typical Web-based data sources made available by Palm Beach
County, Florida. The first resource is clearly associated with biostatistics in that it
relates to environmental issues. The other resource may not at first seem relevant to
biostatistics, but quite the opposite, it is an excellent proxy for gauging public health
and the way public health is impacted by the overall economy.
Local Data Sources 103
Palm Beach County, Florida, Natural Areas Trails: Many communities provide
an inventory of public nature trails since this resource is highly valued. Property
developers who wish to construct housing developments near protected natural
areas, to attract potential home buyers, find information about natural areas and
their location useful. Of course, those who focus on the conservation of natural
resources will also find this information useful as they try to restrict high-density
housing developments that encroach on nearby sensitive natural areas. Both parties
have access to these public data, in their attempt to meet goals.
Challenge: Review the URL https://opendata2-pbcgov.opendata.arcgis.com/
datasets/PBCGOV::palm-beach-county-natural-areas-trails/explore?
location=26.587800%2C-80.496600%2C9.32&showTable=true and notice that
along with the table first seen at this URL, it is also possible to download a shape-
file (.shp file format), allowing creative opportunities for mapping for those with
this skill – with mapping a critical domain in data science. The file Palm_Beach_
County_Natural_Areas_Trails.csv represents the data at this resource and it is
available at the publisher’s Web site associated with this text. Prepare a summary
of the many different types of trails, including in part: boardwalk, equestrian, hik-
ing, nature, multi-use, etc. How many trails are there of each type, and what is the
mean length of each?
Palm Beach County, Florida, Bed Tax Collections: Many United States munici-
palities collect a daily short-term rental bed tax as one of many tools used to gener-
ate revenue from hotel guests, with the tax bundled into the total charge for lodging.2
At first, it may seem that this information is interesting, but not for those focused on
biostatistics. On the contrary, bed tax receipts are often viewed as a critical proxy
indicator of local economic vitality, a downstream indicator of public health due to
the linkage between public health and public finances. Given the pervasive impact
of COVID-19 on the economy, hotel-based bed taxes are an especially useful proxy
measure of how the pandemic greatly impacted the travel and tourist industry, com-
paring bed tax collections prior to the pandemic in 2019, during the worse times of
the pandemic in 2020 and 2021, and the emergence of pandemic endemicity in late
2022 and onward. Never discount the clever usefulness of proxy measures.
Challenge: Review the URL https://discover.pbcgov.org/touristdevelopment/
pages/bed-tax-collections.aspx and observe how data can be downloaded in .pdf
format, with the resource Tourist_Development_Tax2021-2022.pdf also avail-
able at the publisher’s Web site associated with this text. Look at Gross Collections
and Net Collections and compare month-by-month change, beginning March
2019 to December 2021. Use the ggplot2::ggplot() function to prepare a line
chart of each (X axis Month and Year and Y axis Bed Tax Collection), Gross
Collections and Net Collections. Compare these line charts to any readily avail-
able resource that plots COVID-19 at the county level, such as what the New York
Times makes available at the URL https://www.google.com/search?client=firefox-
2
The Palm Beach County bed tax, conveniently called the Tourist Development Tax, is currently
6% of total charges and it is separate from any charges for state and county sales tax obligations.
104 2 Data Sources in Biostatistics
In the same way that local governments provide a wide variety of data resources for
public access, states also work in the sunshine and provide relevant data for public
acquisition. Among the nearly countless number of datasets and related topics, look
at the sample datasets shown below. Of course, as time permits, look at datasets
from other states.
National School Lunch Programs (NSLP) Locations: The United States
Department of Agriculture, the United States Department of Education, states, local
school boards, and other educational agencies partner to coordinate school lunch
programs for those children who qualify. There are many data resources for the
National School Lunch Program (NSLP, https://www.fns.usda.gov/nslp), a long-
term program that was initiated at the federal level in 1946.
Challenge: Review the URL https://geodata.fdacs.gov/datasets/FDACS::nslp-
sites/explore?location=27.764172%2C-83.759111%2C6.54&showTable=true, an
NSLP resource provided by the Florida Department of Agriculture and Consumer
Services and specific to Florida educational institutions. The data at this resource
were easily downloaded as nslp_sites.csv and this dataset is also available at the
publisher’s Web site associated with this text. This dataset is somewhat unique in
that it provides geospatial information (e.g., latitude and longitude coordinates) for
thousands of schools and other service providers, allowing mapping opportunities.
Once again, mapping is a rapidly emerging domain in data science. For those with
skills beyond the introductory level, use this .csv file to map selections locations of
participating NSLP schools throughout Florida, possibly by using functions from
either the choroplethr package or the tmap package. Maps, as useful figures, are
demonstrated later in this text.
Florida Department of Health Tracking: Although the use of an API would be
desirable, by going to Florida Tracking (https://www.floridatracking.com/health-
tracking/, a resource with a degree of affiliation with the Centers for Disease Control
and Prevention), and completing a simple set of checks at a Graphical User Interface,
it is possible to construct a by-county dataset for all 67 Florida counties that focuses
on data associated with: (1) Life Expectancy at Birth, (2) Air Quality, and (3) Cancer
(various types and different measures; e.g., age-adjusted incidence rate and number
National Data Sources 105
Following along with the paradigm of government in the sunshine and requirements
that different government agencies make data readily available to the public, the
various federal cabinet departments and their respective agencies make available a
wealth of information, and there has been great improvement in the ready availabil-
ity of the data, along with its quality. It would take volumes to give any meaningful
detail on the many Web-based datasets offered by the federal government, so only a
few are listed below, but first visit the URL DATA.GOV (e.g., https://data.gov/) to
gain a first sense of possibilities.
3
Nearly all states provide similar information.
106 2 Data Sources in Biostatistics
An excellent starting point to obtain data from the decennial census is Quick Facts,
https://www.census.gov/quickfacts/fact/table/US/PST045221. Along with selec-
tions made from a Graphical User Interface, it cannot be ignored that the Census
Bureau is a leader in facilitating cooperation on the use of R-based API functions as
a means of making data available, including data from the Decennial Census,
American Community Survey, Public Use Microdata Sample, and many other data
resources (See Census Datasets, https://www.census.gov/data/datasets.html.).
Challenge: Use the Quick Facts interface to create and download a dataset
detailing Cape May County, New Jersey, saved as QuickFactsApr072022
CapeMayCountyNJvUnitedStates.csv and made available at the publisher’s Web
site associated with this text. Carefully examine the file for format and consistent
presentation of information that describes: Population, Age and Sex, Race and
Hispanic Origin, Population Characteristics, Housing, Families and Living
Arrangements, Computer and Internet Use, Education, Health, Economy,
Transportation, Income and Poverty, Businesses, and Geography. Use Quick Facts
to search other communities. Anyone who engages in biostatistics relating to
humans and communities should visit the United States Census Bureau to learn as
much as possible about the areas under investigation.4
At first, it may seem that the data possibilities from the Centers for Disease Control
and Prevention (CDC) are endless for those data scientists who work in biostatistics.
Look at the information related to Diagnosed Diabetes (https://gis.cdc.gov/grasp/
diabetes/DiabetesAtlas.html) as a first review of what could be extremely useful for
those who study this common disease.
Challenge: Use the dataset DiabetesAtlasData.csv (available at the publisher’s
Web site associated with this text) and use the ggplot2::ggplot() function to construct
a line chart of X axis (Year, 2000 to 2019) by Y axis (Total Percentage Diagnosed
Adults), to offer a view of the growing incidence of this disease over the last few
years. But the curious data scientist would look not only for data related to diabetes
but also known comorbidities, to see if there are associations. Saying that, a search
tool was used to examine any possible relationship between diabetes and obesity
(https://www.cdc.gov/nchs/hus/contents2019.htm?search=Obesity/overweight), a
4
With more experience in R, review the following R packages and how they can be used to obtain
data from the Census Bureau, whether the Decennial Census, the American Community Survey
(ACS), or other census-related resources: acs, censable, censusapi, cpsR, easycensus, idbr, ppmf,
tidycensus, tidyqwi, totalcensus. All packages have merit, but by choice the tidycensus package is
emphasized in this text.
National Data Sources 107
known comorbidity to diabetes. This expanded query against CDC data yielded the
dataset SelectedHealthConditionsandRiskFactorsbyAge.xlsx (available at the pub-
lisher’s Web site associated with this text), addressing multiple risk factors and not
only diabetes and obesity.5 It would take some work to put the data into tidy format,
but clearly, it was not necessary to conduct what would possibly be redundant sur-
veys, clinical trials, or other time-intensive and expensive actions needed to obtain
the data.6 The CDC should always be among the first choices when data related to
public health issues are considered.
The United States Department of Agriculture recognizes the need for quality data in
support of food production and related concerns, as evidenced by their Data Strategy
Statement (https://www.usda.gov/sites/default/files/documents/usda-data-strategy.
pdf). For those who use R, the rnassqs package provides an interface to allow access
to data from the United States Department of Agriculture (USDA) National
Agricultural Statistics Service (NASS). The syntax and use of the rnassqs::nassqs()
function (sample syntax, showing how this function is used, follows), once estab-
lished and tested for validity, can be easily replicated with minor changes, allowing
repeated use across multiple regions, crops, years, etc.7
Challenge: Part of the syntax needed to obtain Kentucky corn yields over time is
provided below, eventually used to generate the file KentuckyCornYield1900Onward.
xlsx (this file is also available at the publisher’s Web site associated with this text),
but explicit detail and all syntax on use of the rnassqs::nassqs() function is provided
in the lesson on APIs. In response to this challenge, select known high-yielding
counties in Kentucky for corn production, such as Daviess, Fulton, Logan, Shelby,
and Warren, and for each county prepare a line chart of X axis Year by Y axis Corn
Yield (Bushels per Acre) to see the dramatic increase in productivity of corn pro-
duction, allowing for a few years when climate and disease challenged norm yields.
5
Although the two files mentioned in this section are not at all large and could easily be edited by
using standard spreadsheet actions, as a challenge, use tidyverse ecosystem tools to put the data in
good form, suitable for R. This may not be possible yet, but by the end of this text it should be
possible to come back to these files to achieve that task. As an advance organizer, look at use of the
janitor::row_to_names() function and the dplyr::slice() function, but again more detail is offered
later in this text.
6
Data are rarely easy to obtain. Data are expensive. If a proxy dataset can be legally obtained and
if that dataset meets needs, then consider its use – at least as a first indicator of direction.
7
As a quality assurance check on validity of the data, review corn (Zea mays) yields for 2012, a
year of extreme Spring and Summer drought in many regions. Then check corn yields for 1970, a
year when Southern Corn Leaf Blight (SCLB), a disease caused by the fungus Helminthosporium
maydis, specifically Race T, caused extreme distress in many fields, reducing yields.
108 2 Data Sources in Biostatistics
install.packages("rnassqs", dependencies=TRUE)
library(rnassqs)
# The Housekeeping section and related syntax
# has not yet been put into place in this
# lesson but is instead found at the start of
# Addendum 1.
nassqs_auth(key="UseTheKeyProvidedAtSign-Upxxxxxxxxxx")
# The USDA NASS key is free, but it must first be
# obtained at https://quickstats.nass.usda.gov/api.
#
# CAUTION: The code for this key is reserved
# for [email protected] only. Use your own key!
utils::str(KentuckyCornYield1900Onward.tbl, width=64,
strict.width="cut")
the possibilities of how this dataset could be used to communicate outcomes to the
public during National Farm-City week, usually in late November.
As stressed in this text, poverty has an extreme impact on public health. This text
purposely provides pointers to data resources that serve as national, state, and local
proxy indicators of the economic challenges that many families face, knowing the
impact of poverty on health, even when just one family member is out of work.
Poverty data should be considered biostatistics data.
As briefly mentioned earlier, the National School Lunch Program (NSLP), now
in place for more than 75 years, was established in a multi-agency attempt to pro-
vide wholesome and nutritious meals (often, breakfast, lunch, and a snack before
afternoon dismissal) during the school day to youth who are in need.8 Using
Graphical User Interface (GUI) menu-type selections at Number and percentage of
public school students eligible for free or reduced-price lunch, by state: Selected
years (https://nces.ed.gov/programs/digest/d18/tables/dt18_204.10.asp, select Click
here for the latest version of this table), and look at the wealth of state-wide data on
participation in the NSLP, currently from the 2000–2001 school year onward.
Challenge: Data in NationalSchoolLunchProgram2000-01to2018-19.xlx, avail-
able at the publisher’s Web site associated with this text, are clearly not in tidy for-
mat. Either using tidyverse ecosystem tools or if needed, direct editing, put the data
into tidy format (By continuing with future lessons, direct editing should be
unnecessary.).9 As a value-added activity, obtain state-wide Census-based data on
poverty. Merge these state-wide data on poverty with the state-wide dataset on
NSLP participation. Then, determine if there is any degree of association between
Census-obtained poverty statistics and NSLP-obtained percent eligibility rates in
free and reduced-price lunch programs. Data scientists often obtain, organize, scrub,
manipulate, etc. multiple datasets into one final dataset before any attempt is made
to use the data for value-added purposes. Come back to this challenge after skills
expand since it is a typical action in data science.
8
Many school-based principals find creative ways to offer NSLP services to all students, regardless
of formal eligibility requirements, to avoid the stigma of participation for individual students
in need.
9
Read the footnotes at the bottom of the spreadsheet to see how the data were obtained and to also
gain a sense of expansion of eligibility requirements over time, thus expansion of participation
over time.
110 2 Data Sources in Biostatistics
Following along with the theme that poverty impacts public health, data available
through the United States Department of Labor are without parallel when attempt-
ing to gauge the economic viability of a region, whether county, state, or national.
Using data for September 01, 2020, during a period when avoidance of public gath-
erings due to the COVID-19 pandemic was at its most extreme, consider the data
presented in the file FloridaUnemploymentByCountySep-01-20.csv (this dataset is
also available at the publisher’s Web site associated with this text) and the accompa-
nying figure (Fig. 2.1), presenting a choropleth (a color-coded thematic map) of the
data for all 67 Florida counties and a brief summary of the most extreme values.10
Challenge: Give particular attention to the high unemployment rates in Central
Florida (e.g., especially Osceola County and Orange County) during this time,
where this region is the home of many internationally known theme parks, which of
course are a draw for nearby restaurants, hotels, airports, rental car agencies, and
similar tourist industry accommodations. The economic impact and later public
health impact of these exceptionally high unemployment rates are without current
parallel, and these impacts include cancelled and deferred medical appointments,
stress and the subsequent increased use of alcohol and tobacco, unhealthy family
dynamics due to missed pay checks and late housing payments, lost educational
opportunities for pK-12 students who did not have sufficient computing resources
at home to actively participate in emergency remote learning opportunities, etc.
Fig. 2.1
10
API-type functions from the blscrapeR package were used to obtain the by Florida county unem-
ployment data. It is important to note that the blscrapeR package is currently not available from
CRAN, but the latest development version is available at GitHub.
National Data Sources 111
The Environmental Protection Agency (EPA) was established in the early 1970s and
since that time this agency has made tremendous strides to bring environmental
issues to public attention, often by using reliable and valid data to communicate
issues of importance.11 The EPA has a wide variety of data available to the public,
as part of an open government approach, and the URL https://www.epa.gov/data
should be reviewed to learn more about data availability.
A typical example of data provided by the EPA would be the data associated with
Air Quality Index (AQI), a metric that provides an overall summary of air quality
that should be easily understood by all and not only those scientists who work in this
specialized domain. As an example of the data, go to https://aqs.epa.gov/aqsweb/
airdata/download_files.html#Annual, unzip the file annual_aqi_by_county_2021.
zip, and then review the file annual_aqi_by_county_2021.csv, Annual Summary
Data AQI by County, which is also made available at the publisher’s Web site asso-
ciated with this text.12
Challenge: The dataset annual_aqi_by_county_2021.csv, addressing many coun-
ties in the United States, was selected in that the information could be easily under-
stood by the public, using column headers such as: Days with AQI, Good Days,
Moderate Days, Unhealthy for Sensitive Groups Days, Unhealthy Days, Very
Unhealthy Days, Hazardous Days, and Median AQI. For those with special interest
in how the COVID-19 pandemic and mitigating actions such as lockdowns impacted
not only public health but also environmental health, review data for 2019 (pre-
pandemic) and then data for 2020 and 2021. A ggplot2::ggplot() function line chart
of data from the column Median AQI for each of the 3 years, 2019, 2020, and 2021,
would be especially useful. Did the many public pandemic mitigating actions, such
as lockdowns, reduced economic activity, and limited automobile traffic, improve
air quality? Respond to this challenge, once skills with the tidyverse ecosystem are
sufficient. It serves as an excellent example of how the ggplot2::ggplot() function
can be used to communicate outcomes in a clear manner, ideally on a subject of
public interest.
11
The early 1970s was a time when it was finally recognized that a healthy environment could not
be assumed, but instead attention to pollution and other environmental issues was needed. The first
Earth Day was celebrated on April 22, 1970. Concurrently, a Public Service Announcement (PSA)
television commercial (now retired) featuring an actor professionally known as Iron Eyes Cody
had tremendous impact on how it was in the national interest to give attention to environmental
concerns and the impact of the same on public health.
12
Review Air Quality Index (AQI) Basics, https://www.airnow.gov/aqi/aqi-basics/, to learn more
about the AQI scale.
112 2 Data Sources in Biostatistics
For more than 70 years, the National Science Foundation (NSF) has projected a
leading role in the way science is used to improve the human condition. On the topic
of leadership in the sciences, consider the annual NSF surveys used to monitor the
critical mass of doctoral recipients from various academic areas, those individuals
who are most likely to have later leadership roles in various fields of study in the
sciences, including the biological sciences.
Challenge: As an example, consider the NSF dataset Doctorate recipients, by
fine field of study: 2010–20, available at https://ncses.nsf.gov/pubs/nsf22300/data-
tables, and download the file nsf22300-tab013.xlsx, which is also available at the
publisher’s Web site associated with this text. For those who are concerned about
the production of sufficient numbers of individuals who can contribute to the sci-
ences, these data should offer a view of future leadership, considering how many of
these individuals may have a future 30–40-year career in their selected discipline.
The dataset is presented as an easy-to-read table and should take only a small
amount of effort to make the data suitable for use with R. For those with a special
interest, look at the data under the broad field of Agricultural sciences and natural
resources, from Agricultural economics to Natural resources and conservation. Use
the ggplot2::ggplot() function to prepare a line chart of general trends for selected
areas of study. Do the outcomes sync with national needs? These data may not be of
direct interest to the public, immediately, but it is critical information for appropri-
ate policy makers as efforts are made to guide the nation’s workforce in the sci-
ences, a workforce that needs skills with biostatistics and data science, well into
the future.
European Union and European Economic Area (EU/EEA) data on the daily number
of new reported COVID-19 cases and deaths are obtained at Data on the daily num-
ber of new reported COVID-19 cases and deaths by EU/EEA country (https://www.
International Data Sources 113
ecdc.europa.eu/en/publications-data/data-daily-new-cases-covid-19-eueea-coun-
try) and have been downloaded as COVID19CasesDeathsEUEEACountryApr-16-
2022.csv, with data updated after this point in time.13
Challenge: The data are from multiple cabinet-level ministries as well as other
resources, the data are made available on different dates, and from this disparity,
great efforts are made to put the data into a unified dataset. Review All Topics, A to
Z} (https://www.ecdc.europa.eu/en/all-topics) to gain a sense of the comprehensive
nature of this invaluable data resource. Select a few countries of special interest and
prepare a ggplot2-based line chart of date (X-axis) by total COVID-19 deaths
(Y-axis), to see the wave-like pattern of the pandemic, as the SARS-CoV-2 virus
mutated into variants, each with unique characteristics in terms of severity and
transmissibility.
This resource is very R friendly. Look at the directions at Script for downloading the CSV file into
13
R software.} The dataset is also available at the publisher’s Web site associated with this text.
114 2 Data Sources in Biostatistics
For those who work with international agricultural data, the data possibilities avail-
able at the United Nations Food and Agriculture Organization should be reviewed.
To do that, as a good starting point, look at the data related to Crops and Livestock
Products at FAOSTAT, https://www.fao.org/faostat/en/#data/QCL. From this start-
ing point, select France, Germany, and the United Kingdom of Great Britain and
Northern Ireland to gain a sense of the change in animal production over the last 60
or so years, from 1961 to 2020.
Challenge: After interacting with the interface, data were downloaded as
FranceGermanyUKLivestoack1961to2020.csv and this dataset is also available at
the publisher’s Web site associated with this text.14 For each selected country, com-
pare production for a selected variable (possibly Meat, turkey) over time, typically
by preparing a line chart based on use of the ggplot2::ggplot() function.
14
Similar data made available at FAOSTST can be obtained using functions associate with the R
package FAOSTAT, but for now use the Graphical User Interface (GUI) to explore the many types
of data and how the data can be filtered to meet specific needs.
International Data Sources 115
World Bank
From among many possible resources at the World Bank, review the birth rate data-
set, birth rate, crude (per 1000 people), made available at https://data.worldbank.
org/indicator/SP.DYN.CBRT.IN?end=2020&start=2000&view=chart, where birth
rate data are provided by country and a few selected regions or entities. Then review
the World Bank dataset relating to gross domestic product (GDP), GDP (current
US$), available at https://data.worldbank.org/indicator/NY.GDP.MKTP.CD.
Challenge: Clean (e.g., organize, scrub, etc.) each dataset (the datasets
BirthRatePer1000People.csv and GDPCurrentUSDollar.csv were both downloaded,
and each dataset is available at the publisher’s Web site associated with this text) as
needed, and eventually merge the two, ending with a new dataset that includes
Country Name, Country Code, 2019 (pre-pandemic) Gross Domestic Product, and
2019 (pre-pandemic) Birth Rate per 1000 People.15,16 As much as possible, antici-
pating that there may be some missing datapoints, construct a breakout data set that
includes the many countries associated with the African Union (from Algeria to
Zimbabwe) and the many countries associated with the European Union (from
Austria to Sweden) and explore the data, looking to see if there are any associations
between GDP and birth rate (consider use of the ggplot2::ggplot() function): (1)
overall, for all data, (2) for each of two constructed breakout groups, African Union
and European Union, and (3) for selected countries of individual interest. Come
back to this challenge later, if skills are not yet sufficient.
The World Health Organization (WHO) has been prominently mentioned over the
last few years, given its role in responding to the COVID-19 pandemic. As much
attention has been given to COVID-19, it is important to recall that there have been
many deaths during the pandemic that are the result of other causes, especially
behavioral and lifestyle choices, all for a variety of individual reasons. A leading
cause of death relates to the expression Deaths of Despair, with alcohol highly
associated with this expression. Saying that, go to Alcohol-attributable fractions,
all-cause deaths (%) at https://www.who.int/data/gho/data/indicators/indicator-
details/GHO/alcohol-attributable-fractions-all-cause-deaths-(-) and export the data
15
The base::merge() function should not be ignored. The dplyr::bind_rows() and dplyr::bind_cols()
functions, associated with the tidyverse ecosystem, are also frequently used, along with other func-
tions from the dplyr package. Individual choice usually determines which function(s) are used to
achieve aims.
16
Are the codes in Country Code consistent for each of the two datasets? If so, this consistency
should help facilitate any merging actions of the two datasets.
116 2 Data Sources in Biostatistics
There are a few commercial organizations that make public interest data available,
often with no authentication needed for access. However, in many cases, it may be
easier to obtain the data (or reasonable proxy data) at public resources, if possible.
Yet, it would be negligent to avoid identification of perhaps the two most widely
known commercial entities providing data related to COVID-19, repeating the cau-
tion that the data are available at multiple resources, public and private.
Proprietary and Other Resources 117
The New York Times makes available data on COVID-19, providing extensive
resources. Look at the many selections, with most data at https://github.com/
nytimes/covid-19-data. The Cumulative Cases and Deaths datasets, at various
breakout levels (available as .csv files) and ending dates, are quite interesting and
useful for inquiries on this topic.17
17
Data may be out of date as COVID-19 is now endemic.
118 2 Data Sources in Biostatistics
###############################################################
# Housekeeping Use for All Analyses #
###############################################################
.libPaths(new = "F:/R_Packages")
# As a preference, all installed packages
# will now go to the external F:\ drive.
date() # Current system time and date.
Sys.time() # Current system time and date (redundant).
R.version.string # R version and version release date.
options(digits=6) # Confirm default digits.
options(scipen=999)# Suppress scientific notation.
options(width=60) # Confirm output width.
ls() # List all objects in the working
# directory.
rm(list = ls()) # CAUTION: Remove all files in the working
# directory. If this action is not desired,
# use the rm() function one by one to remove
# the objects that are not needed.
ls.str() # List all objects, with finite detail.
getwd() # Identify the current working directory.
setwd("F:/R_Ceres")
# Set to a new working directory.
# Note the single forward slash and double
# quotes.
# This new directory should be the directory
# where the data file is located, otherwise
# the data file will not be found.
getwd() # Confirm the working directory.
list.files() # List files at the PC directory.
.libPaths() # Library pathname.
.Library # Library pathname.
sessionInfo() # R version, locale, and packages.
search() # Attached packages and objects.
searchpaths() # Attached packages and objects.
###############################################################
As used throughout this text, the Housekeeping section represents personal desires
in terms of how R is used, how settings are organized, where packages are kept,
default and other location(s) where files are maintained, etc. As always, use the
Housekeeping syntax as a guide, but of course, make changes as skills and prefer-
ences allow.
In keeping with what is seen in the Housekeeping section, use the many pack-
ages listed below as a starting point for what is often used in an R session that
focuses on the tidyverse ecosystem and its use in data science. Other packages will
be deployed later in this lesson, as needed.
In advance, notice a major difference in this Housekeeping section compared to
what was presented in the prior lesson. A # comment character is placed in front of
the install.packages() function for those packages that were previously downloaded.
There is confidence that the package is up to date and there is no need to download
Addendum 1: Our World in Data 119
the package again, at least not until there is an update.18,19 Of course, the library()
function is still used, to put the package into use.20
# install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)
# install.packages("readxl", dependencies=TRUE)
library(readxl)
# install.packages("magrittr", dependencies=TRUE)
library(magrittr)
The ggplot2 package is part of the core tidyverse, and it is used to create beauti-
ful graphics.21 However, there are many ancillary packages that support the produc-
tion of graphical presentations, with features that go far beyond what can be prepared
using the ggplot2 package by itself. A few of these graphically oriented ancillary
packages include:
# install.packages("ggmosaic", dependencies=TRUE)
library(ggmosaic)
# install.packages("ggpubr", dependencies=TRUE)
library(ggpubr)
# install.packages("ggtext", dependencies=TRUE)
library(ggtext)
# install.packages("ggthemes", dependencies=TRUE)
library(ggthemes)
# install.packages("gridExtra", dependencies=TRUE)
library(gridExtra)
# install.packages("scales", dependencies=TRUE)
library(scales)
With all upfront work completed, it is now time to address the data associated
with the addenda in this lesson. Most approaches for use of the data will be based
on functions associated with the tidyverse ecosystem, but functions from Base R
18
Look into use of the old.packages() function and the update.packages() function, as needed.
19
It is common to comment-out syntax using the # comment character when there is a desire to
retain syntax that is important, but not currently needed.
20
It is a best practice to use the base::library() function instead of the base::require() function,
given a few known differences between the two functions.
21
The syntax needed to produce figures, typically using the ggplot2::ggplot() function, is purposely
presented throughout this text. Following this syntax, many, but not all, figures are also presented
in this text. Although this action saves space, the syntax should be used against the data and the
figures that are excluded from this text should still be produced, to gain a complete understanding
of the topics stressed in the syntax.
120 2 Data Sources in Biostatistics
will be used when they represent the most appropriate approach toward problem-
solving, especially at an introductory level.
Addendum 1 is centered on the identification and later retrieval of data from Our
World in Data (https://ourworldindata.org/). Download the R-specific owidR pack-
age and then use a few simple functions to search for data that may help offer a
sense of life expectancy and wealth, as measured by Gross Domestic Product
(GDP). There may be many possible selections, so ideally the dataset names are
verbose and descriptive.
install.packages("owidR", dependencies=TRUE)
library(owidR)
owidR::owid_search("life expectancy")
# The owidR::owid_search() function returns: (1) a list of
# titles related to the declared search term, or "life
# expectancy" in this example, and (2) a list of the actual
# names (e.g., chart_id) of the identified files. Use the
# chart_id to retrieve the data, using R-based API-type
# syntax.
titles
[5,] "Life expectancy vs. GDP per capita"
[12,] "Life expectancy vs. healthcare expenditure"
[18,] "Life expectancy"
[44,] "Deaths from smallpox per 1,000 population vs. Life exp
[45,] "Life expectancy and smallpox deaths per 10,000 people
chart_id
[5,] "life-expectancy-vs-gdp-per-capita"
[12,] "life-expectancy-vs-healthcare-expenditure"
[18,] "life-expectancy"
[44,] "deaths-from-smallpox-per-1000-population-vs-life-expec
[45,] "sweden-life-expectancy-smallpox-deaths"
Addendum 1: Our World in Data 121
LifeExpectancyGDP.tbl <-
owidR::owid("life-expectancy-vs-gdp-per-capita")
# Use the owidR::owid() function to retrieve the desired Our
# World in Data dataset, life-expectancy-vs-gdp-per-capita in
# this example.
As a good programming practice (gpp) and to follow along with the reality of
future possible data unavailability when acquiring a dataset from an external Internet
host, it is best to immediately download the data, or LifeExpectancyGDP.tbl for this
example, which was originally gained by invoking the function owidR::owid().
There are a few different functions that could be used to download a dataset cur-
rently in an R session, for safe keeping and assurance that the data will be avail-
able later:
• utils::write.csv()
• xlsx::write.xlsx()
• writexl::write_xlsx()
From among these possible options, the writexl::write_xlsx() function will be
used in this lesson:
install.packages("writexl", dependencies=TRUE)
library("writexl")
writexl::write_xlsx(LifeExpectancyGDP.tbl,
path = "F:\\R_Ceres\\LifeExpectancyGDP.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.
base::file.exists("LifeExpectancyGDP.xlsx")
base::file.info("LifeExpectancyGDP.xlsx")
base::list.files(pattern =".xlsx")
The file has been successfully retrieved from Our World in Data, but some col-
umn names are verbose. A few simple actions, using the base::colnames() function
should make column names more manageable.
base::colnames(LifeExpectancyGDP.tbl) <- c(
"entity", # Column 01 Entity
"code", # Column 02 Code for entity
"year", # Column 03 Year
"life_expectancy", # Column 04 Life Expectancy
"gdp_capita", # Column 05 GDP per Capita
"population", # Column 06 Population
"continent") # Column 07 Continent
A few key points associated with this file on world-wide trends for population,
life expectancy, and gross domestic product per capita need to be highlighted here,
to avoid later confusion:
• The term entity is used instead of terms such as country or nation. The data are
quite inclusive and include not only data for many countries but also selected
territories and other geographical locations that are not universally recognized
sovereign states. Entity, in this context, is a more correct term.
• Population and life expectancy are always difficult to estimate, but refer to the
original resources gained by deploying the owidR::owid_source() function to
gain a sense of the methods used in support of the dataset.
• As with many other Our World in Data datasets, it is assumed that Gross Domestic
Product (GDP) is adjusted for inflation to 2011 United States dollars (USD),
which of course can be a problematic application of the algorithms used for this
purpose, at least for some entities. Even so, it is assumed that GDP per capita is
useful as a comparative measure, allowing examination of change over time and
differences between and among selected geographic entities.
owidR::owid_source(LifeExpectancyGDP.tbl)
Although it is common to go straight to the data and begin planned and ad hoc
activities, data science calls for careful review of the data, to look for general trends
and possible discovery of the unexpected. Data science, as opposed to only data
analysis, also adds future value to the data, going beyond the basics. Look at a few
formative figures displaying trends over recent memory before more value-added
analyses and displays are attempted.
Addendum 1: Our World in Data 123
To achieve this aim, use the dplyr::filter() function to create a new dataset, where
the entire LifeExpectancyGDP.tbl dataset is examined, but only for data from 1950
onward. No attempt will be made to embellish the figures since they are only pre-
pared for diagnostic review (Fig. 2.2).
LifeExpectancyGDPWorld1950onward.tbl <-
LifeExpectancyGDP.tbl %>%
dplyr::filter(year >= 1950)
# Create the 1950 onward ad hoc dataset,
# using the dplyr::filter() function.
Population1950Onward.fig <-
ggplot2::ggplot(
data=LifeExpectancyGDPWorld1950onward.tbl,
aes(x=year, y=population)) +
geom_col(fill="red") +
labs(title="World Population:
1950 Onward") +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
par(ask=TRUE); Population1950Onward.fig
LifeExpectancy1950Onward.fig <-
ggplot2::ggplot(
data=LifeExpectancyGDPWorld1950onward.tbl,
aes(x=year, y=life_expectancy)) +
geom_col(fill="red") +
labs(title="World Life Expectancy:
1950 Onward") +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
Fig. 2.2
124 2 Data Sources in Biostatistics
par(ask=TRUE); LifeExpectancy1950Onward.fig
GDP1950Onward.fig <-
ggplot2::ggplot(
data=LifeExpectancyGDPWorld1950onward.tbl,
aes(x=year, y=gdp_capita)) +
geom_col(fill="red") +
labs(title="World GDP:
1950 Onward") +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
par(ask=TRUE); GDP1950Onward.fig
par(ask=TRUE)
gridExtra::grid.arrange(
Population1950Onward.fig,
LifeExpectancy1950Onward.fig,
GDP1950Onward.fig, ncol=3)
# The Y axis scale was made blank to avoid any
# possible confusion with side by side comparisons,
# given different scales for each metric.
#
# C02Fig09World-WideOutcomesSideBySide.png
These figures, placed into one convenient side by side comparative figure, pro-
vide evidence that at least the most current data, overall and from 1950 onward, are
within expectations of a general upward trend, but of course with occasional
decrease.22
The data were retrieved using the owidR::owid() function, supporting an API
process, and the data have been subjected to an initial review for integrity and
expectations. The data will be used for a more finite examination of a subset of the
data, to look for interesting outcomes of approximate neighboring geographical
entities and in turn add value to this data science experience.
22
The file has not yet been fully organized, scrubbed, cleaned, etc. Even so, these early figures
provide a sense of general trends.
Addendum 1: Our World in Data 125
LifeExpectancyGDPSample1900onward.tbl <-
LifeExpectancyGDP.tbl %>%
dplyr::filter(year >= 1900) %>%
dplyr::filter(entity %in% c(
"South Korea",
"North Korea",
"Taiwan",
"China",
"Honduras",
"United States"))
# The dplyr::filter() function is used twice in this syntax:
# (1) once to select only the data from year 1900 onward and
# to then (2) selected data only for the six entities listed
# above: South Korea, North Korea, Taiwan, China, Honduras,
# and United States.
base::getwd()
base::ls()
base::attach(LifeExpectancyGDPSample1900onward.tbl)
utils::str(LifeExpectancyGDPSample1900onward.tbl)
dplyr::glimpse(LifeExpectancyGDPSample1900onward.tbl)
utils::head(LifeExpectancyGDPSample1900onward.tbl)
base::summary(LifeExpectancyGDPSample1900onward.tbl)
Before any detailed graphics are produced, it is important to know that the
ggplot2::ggplot() function supports many different themes. A ggplot2 theme is cre-
ated by using syntax to produce a figure with a desired appearance. In an attempt to
make the figures bold and vibrant, but also in an attempt to reduce redundant key-
ing, look at theme_Mac(), a self-created theme that will be used in concert with the
ggplot2::ggplot() function.
126 2 Data Sources in Biostatistics
base::class(theme_Mac)
# Confirm that the user-created object
# theme_Mac() is a function.
With this preparation completed, try to produce one summative figure of popula-
tion to show the difficulty of producing an attractive figure straight off (Fig. 2.3).
Fig. 2.3
Addendum 1: Our World in Data 127
install.packages("directlabels", dependencies=TRUE)
library(directlabels)
ggplot2::ggplot(
data=LifeExpectancyGDPSample1900onward.tbl,
aes(x=year, y=population, group=code, color=code)) +
geom_line(size=1.25) +
geom_dl(aes(label=code), method=list(cex=0.75, rot = 25,
hjust=-.5, dl.combine("last.points"))) +
# Use geom_dl, from the directlabels package, to place
# labels at the end of each line, to better identify the
# geographic entity, avoiding reliance only on a color-
# coded legend.
labs(
title=
"Population of Selected Entities Over Time: 1900 to 2020",
subtitle=
"Data Were Obtained from Our World in Data",
x = "\nYear", y = "Population\n")
# Fig. 2.3
ggplot2::ggplot(
data=LifeExpectancyGDPSample1900onward.tbl,
aes(x=year, y=life_expectancy, group=code, color=code)) +
geom_line(size=1.25) +
geom_dl(aes(label=code), method=list(cex=0.75, rot = 25,
hjust=-.5, dl.combine("last.points"))) +
# Use geom_dl, from the directlabels package, to place
# labels at the end of each line, to better identify the
# geographic entity, avoiding reliance only on a color-
# coded legend.
labs(
title=
128 2 Data Sources in Biostatistics
The life expectancy scale is generally adequate such that this line chart is quite
sufficient to see trends over time for each of the six selected entities. The ggplot
facet_grid() option will add more value and support a better understanding of
outcomes.
Addendum 1: Our World in Data 129
ggplot2::ggplot(
data=LifeExpectancyGDPSample1900onward.tbl,
aes(x=year, y=life_expectancy, group=code, color=code)) +
geom_line(size=1.75) +
facet_grid(cols = vars(entity)) +
# Note how entity was used and not code, to offer
# another view on detail.
labs(
title="Life Expectancy Over Time: 1900 Onward",
subtitle=
"Data Were Obtained from Our World in Data", x="\nYear",
y="Population\n") +
scale_y_continuous(labels=scales::comma, limits=c(20,
90), breaks=scales::pretty_breaks(n=10)) +
theme_Mac() +
theme(legend.position="none") +
theme(axis.text.x=element_text(face="bold", size=12,
hjust=0.5, vjust=1, angle=90)) +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X axis text and
# Y axis text need to be placed after the enumerated theme
# theme_Mac().
Along with population change and life expectancy, GDP per capita is a leading
indicator of funds that can be devoted to health care, public sanitation and clean
water, workplace protections, and other factors that all contribute to general health
and wellness. Ostensibly, entities with a high GDP per capita have the potential to
promote better health care than entities with a lower GDP per capita. These metrics
do not dictate the quality of healthcare availability, but they do suggest potential
expenditures (Fig. 2.4).
Fig. 2.4
130 2 Data Sources in Biostatistics
ggplot2::ggplot(
data=LifeExpectancyGDPSample1900onward.tbl,
aes(x=year, y=gdp_capita, group=code, color=code)) +
geom_line(size=0.75) +
# A thin line was declared in geom_line to accommodate to
# some degree overlap in the early years, when GDP per
# capita was quite low for multiple entities. It gave a
# slight improvement in output, but there is still a fair
# degree of overlap in the early years of this figure.
geom_dl(aes(label=code), method=list(cex=0.75, rot = 25,
hjust=-.5, dl.combine("last.points"))) +
# Use geom_dl, from the directlabels package, to place
# labels at the end of each line, to better identify the
# geographic entity, avoiding reliance only on a color-
# coded legend.
labs(
title=
"Gross Domestic Product (GDP) per Capita of Selected Entities
Over Time: 1900 Onward",
subtitle=
"Data Were Obtained from Our World in Data", x="\nYear",
y="Gross Domestic Product (GDP)\nper Capita\n") +
annotate("text", x=1900, y=25000, fontface="bold", size=03,
hjust=0, label=
"References: UN Population Division (2019) and Others")+
annotate("text", x=1900, y=60000, fontface="bold", size=03,
hjust=0, label="CHN China") +
annotate("text", x=1900, y=55000, fontface="bold", size=03,
hjust=0, label="HND Honduras") +
annotate("text", x=1900, y=50000, fontface="bold", size=03,
hjust=0, label="KOR South Korea") +
annotate("text", x=1900, y=45000, fontface="bold", size=03,
hjust=0, label="PRK North Korea") +
annotate("text", x=1900, y=40000, fontface="bold", size=03,
hjust=0, label="TWN Taiwan") +
annotate("text", x=1900, y=35000, fontface="bold", size=03,
hjust=0, label="USA United States") +
scale_y_continuous(labels=scales::dollar, limits=c(-1000,
60000), breaks=scales::pretty_breaks(n=10)) +
# By using labels=scales::dollar, the USD dollar sign will
# show on the Y axis.
theme_Mac() +
theme(legend.title=element_blank()) + # No legend title
theme(axis.text.x=element_text(face="bold", size=12,
hjust=0.5, vjust=1)) +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X axis text and
# Y axis text need to be placed after the enumerated theme
# theme_Mac().
# Fig. 2.4
There are more than a few years where some entities are missing GDP per capita
data, all for different reasons. Then, consider the GDP per capita overlap in the early
Addendum 1: Our World in Data 131
years, as referenced earlier. Given these concerns, once again ggplot’s facet_grid()
option will allow a more nuanced comparison of change in GDP per capita over
time and in turn add more value and support a better understanding of outcomes
(Fig. 2.5).
ggplot2::ggplot(
data=LifeExpectancyGDPSample1900onward.tbl,
aes(x=year, y=gdp_capita, group=code, color=code)) +
geom_line(size=1.75) +
facet_grid(cols = vars(entity)) +
# Note how entity was used and not code, to offer
# another view on detail.
labs(
title="Gross Domestic Product (GDP) per Capita Over Time:
1900 Onward",
subtitle=
"Data Were Obtained from Our World in Data", x="\nYear",
y="Gross Domestic Product (GDP)\nper Capita\n") +
scale_y_continuous(labels=scales::dollar, limits=c(-1000,
60000), breaks=scales::pretty_breaks(n=10)) +
# By using labels=scales::dollar, the USD dollar sign will
# show on the Y axis.
theme_Mac() +
theme(legend.position="none") +
theme(axis.text.x=element_text(face="bold", size=10,
hjust=0.5, vjust=1, angle=90)) +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X axis text and
# Y axis text need to be placed after the enumerated theme
# theme_Mac().
# Fig. 2.5
Challenge: Much more can be done with the data, in the original format. As an
example, review the object variable LifeExpectancyGDPWorld1950onward.
Fig. 2.5
132 2 Data Sources in Biostatistics
tbl$entity and review how some entities consist of regions, not just individual
entities:
• High-income countries
• Upper-middle-income countries
• Lower-middle-income countries
• Low-income countries
• More developed regions
• Less developed regions
Following along with the approach used in this addendum, look into life expec-
tancy and gdp by region, as listed above: (1) income, high to low (four breakouts);
and (2) development, more and less (two breakouts). Do the figures provide some
degree of evidence that gdp impacts life expectancy? Later, with more experience,
it will be possible to use various inferential analyses to examine this relationship
more closely, but for now, graphical displays are sufficient.
23
The Bureau of Labor Statistics (BLS) provides a free key for use of API access to data resources.
Periodic renewal of the key is required, with the BLS sending out an e-mail notice of directions for
renewal.
Addendum 2: United States Department of Labor, Bureau of Labor Statistics 133
install.packages("blsR", dependencies=TRUE)
library(blsR)
NatPctUnemp2017to2021ByGender.tbl <-
blsR::get_n_series_table(
base::getwd()
base::ls()
base::attach(NatPctUnemp2017to2021ByGender.tbl)
utils::str(NatPctUnemp2017to2021ByGender.tbl)
dplyr::glimpse(NatPctUnemp2017to2021ByGender.tbl)
utils::head(NatPctUnemp2017to2021ByGender.tbl)
base::summary(NatPctUnemp2017to2021ByGender.tbl)
writexl::write_xlsx(NatPctUnemp2017to2021ByGender.tbl,
path = "F:\\R_Ceres\\NatPctUnemp2017to2021ByGender.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.
Observe how year and month are integers, but for now, it would be better to view
them as factors.24 The variables uer.men and uer.women are real numbers, in deci-
mal format. A few actions are needed to put the dataset into desired format.
base::colnames(NatPctUnemp2017to2021ByGender.tbl) <- c(
"Year", # Column 01 Year.
"Month", # Column 02 Month
"Uer.Men", # Column 03 Unemployment rate, men
"Uer.Women") # Column 04 Unemployment rate, women
# Uer - Unemployment Rate, as a Percentage
base::getwd()
base::ls()
base::attach(NatPctUnemp2017to2021ByGender.tbl)
utils::str(NatPctUnemp2017to2021ByGender.tbl)
dplyr::glimpse(NatPctUnemp2017to2021ByGender.tbl)
utils::head(NatPctUnemp2017to2021ByGender.tbl)
base::summary(NatPctUnemp2017to2021ByGender.tbl)
As a simple Quality Assurance check of the data, use the ggplot2:qplot() func-
tion with only a few embellishments, to look for trends in unemployment over time
and by gender. With assurance that the data are useful, future actions would likely
call for use of the ggplot2::ggplot() function, but that action is deferred in this
addendum to instead show how the ggplot2::qplot() function has value and should
be considered for initial graphics (Fig. 2.6).25
24
Although it is not needed for this addendum, many who are experienced with the tidyverse eco-
system might use the dplyr::mutate() function and either the lubridate::make_date() function or the
lubridate::make_datetime() function to accommodate how date(s) are considered. Review Schedule
of Releases for the Employment Situation (https://www.bls.gov/schedule/news_release/empsit.
htm) for precise information if greater granularity for dates is needed.
25
It is recognized that the ggplot2::qplot() function is deprecated, but it is still available, it still
works quite nicely, and it is purposely demonstrated in this addendum.
Addendum 2: United States Department of Labor, Bureau of Labor Statistics 135
Fig. 2.6
USAUnempPct2017to2021Men.fig <-
qplot(data = NatPctUnemp2017to2021ByGender.tbl,
Year, Uer.Men, ylim=c(0,16),
main="USA Percentage Unemployment of Men
by Year: 2017 to 2021",
xlab="\nYear", ylab="Percentage Unemployment\nMen\n") +
annotate("text", x=0.5, y=15.0, fontface="bold", size=03,
hjust=0, label=
"Each dot is a datapoint for a specific month.") +
annotate("text", x=0.5, y=14.0, fontface="bold", size=03,
hjust=0, label=
"Some months had values that were similar") +
annotate("text", x=0.5, y=13.2, fontface="bold", size=03,
hjust=0, label=
"to other months.") +
theme_Mac()
USAUnempPct2017to2021Women.fig <-
qplot(data = NatPctUnemp2017to2021ByGender.tbl,
Year, Uer.Women, ylim=c(0,16),
main="USA Percentage Unemployment of Women
by Year: 2017 to 2021",
xlab="\nYear", ylab="Percentage Unemployment\nWomen\n") +
annotate("text", x=0.5, y=15.0, fontface="bold", size=03,
hjust=0, label=
"Each dot is a datapoint for a specific month.") +
annotate("text", x=0.5, y=14.0, fontface="bold", size=03,
hjust=0, label=
136 2 Data Sources in Biostatistics
par(ask=TRUE)
gridExtra::grid.arrange(
USAUnempPct2017to2021Men.fig,
USAUnempPct2017to2021Women.fig, ncol=2)
# Fig. 2.6
NatPctUnemp2017to2021ByYearByGender.tbl <-
dplyr::select(NatPctUnemp2017to2021ByGender.tbl,
-Month)
# Use the dplyr::select() function to remove a column
# by name, or Month in this set of syntax, since the
# immediate focus does not include by Month analyses.
base::getwd()
base::ls()
base::attach(NatPctUnemp2017to2021ByYearByGender.tbl)
utils::str(NatPctUnemp2017to2021ByYearByGender.tbl)
dplyr::glimpse(NatPctUnemp2017to2021ByYearByGender.tbl)
utils::head(NatPctUnemp2017to2021ByYearByGender.tbl)
base::summary(NatPctUnemp2017to2021ByYearByGender.tbl)
LNatPctUnemp2017to2021ByYearByGender.tbl <-
tidyr::pivot_longer(NatPctUnemp2017to2021ByYearByGender.tbl,
-c(Year),
names_to = "Gender", values_to = "Pct_Unemployment")
# Put the data into long format, using the
# tidyr::pivot_longer() function.
#
# As is often shown in this text, the leading L
# in the dataset (e.g., tibble) name means that
# the data are in long format.
#
# The expression -c(Year) means that the
# tidyr::pivot_longer() function should
# pivot everything except Year. In this
# syntax, the minus sign means except.
base::getwd()
base::ls()
base::attach(LNatPctUnemp2017to2021ByYearByGender.tbl)
utils::str(LNatPctUnemp2017to2021ByYearByGender.tbl)
dplyr::glimpse(LNatPctUnemp2017to2021ByYearByGender.tbl)
utils::head(LNatPctUnemp2017to2021ByYearByGender.tbl)
base::summary(LNatPctUnemp2017to2021ByYearByGender.tbl)
install.packages("onewaytests", dependencies=TRUE)
library(onewaytests)
onewaytests::describe(Pct_Unemployment ~ Year,
data=LNatPctUnemp2017to2021ByYearByGender.tbl)
onewaytests::describe(Pct_Unemployment ~ Gender,
data=LNatPctUnemp2017to2021ByYearByGender.tbl)
Challenge: With a focus on inferential statistics, use a few different packages for
Analysis of Variance (ANOVA) testing that should provide more definitive statistics
related to statistically significant difference (p ≤ 0.05), following along with the
desire to know26:
• Is there a statistically significant difference (p ≤ 0.05) in percentage unemploy-
ment by year?
• Is there a statistically significant difference (p ≤ 0.05) in percentage unemploy-
ment by gender?
• Is there a statistically significant interaction (p ≤ 0.05) between year and gender
regarding percentage unemployment?
install.packages("agricolae", dependencies=TRUE)
library(agricolae)
agricolae::HSD.test(
aov(Pct_Unemployment ~ Year,
data=LNatPctUnemp2017to2021ByYearByGender.tbl), # Model
trt="Year", # Treatment
group=TRUE, console=TRUE, alpha=0.05, # Arguments
main="Percentage Unemployment by Year")
# Wrap the agricolae::HSD.test() function around the
# Oneway ANOVA model obtained by using the aov()
# function. Select desired arguments, such as group,
# console, and alpha (e.g., p-value).
Far more detail is provided in a later lesson regarding inferential tests, and of course, there are
26
Year, means
Pct_Unemployment groups
2020 8.10417 a
2021 5.35000 b
2017 4.35417 bc
2018 3.89167 c
2019 3.66667 c
install.packages("car")
library(car)
Response: Pct_Unemployment
Sum Sq Df F value Pr(>F)
Year 315.7 4 27.203 0.00000000000000106 ***
Gender 0.0 1 0.001 0.974
Year:Gender 2.6 4 0.228 0.922
Residuals 319.2 110
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
140 2 Data Sources in Biostatistics
The Twoway ANOVA output for percentage unemployment from 2017 to 2021
confirms that:
• There is a statistically significant difference (p ≤ 0.05) in percentage unemploy-
ment by year, with explicit details found in the Oneway ANOVA output.
• There is no statistically significant difference (p ≤ 0.05) in percentage unem-
ployment by gender.
• There is no statistically significant interaction (p ≤ 0.05) between year and gen-
der regarding percentage unemployment.
As useful as these results may be, data scientists need to be aware of conditions
that may impact the data and most importantly, the concern that data are reliable,
valid, and representative. Look at the years (and subsequent data) selected for this
addendum on United States percentage unemployment. The United States economy
was robust in 2017, 2018, and 2019, and unemployment was consistently at or near
record lows. Then COVID-19 impacted everything, with concerns growing in
January and February 2020 – with devastating impacts on the economy by mid-
March 2020. In a mere matter of days, millions of workers were either laid off (e.g.,
made redundant) or voluntarily left their jobs in response to factory and other physi-
cal plant shutdowns, lockdowns, distancing from others, fear of infection, and other
mitigation responses to COVID-19.
Then, whether true or not, there was a continuing storyline in the press that
women were impacted by lost employment at a rate far greater than men once the
COVID-19 layoffs began.27 Look at the following syntax and outcomes to see if the
data support this issue, recalling that collectively, from 2017 to 2021, there was no
difference in percentage unemployment between the two genders (p ≤ 0.05).
LNatPctUnemp2020to2021ByYearByGender.tbl <-
LNatPctUnemp2017to2021ByYearByGender.tbl %>%
dplyr::filter(Year == 2020 | Year == 2021)
# Use the dplyr::filter() function to have data for
# 2020 or 2021, only. Be sure to note the condition
# (Year == 2020 | Year == 2021) and NOT the frequent
# mistake (Year == 2020 | 2021).
base::getwd()
base::ls()
base::attach(LNatPctUnemp2020to2021ByYearByGender.tbl)
utils::str(LNatPctUnemp2020to2021ByYearByGender.tbl)
dplyr::glimpse(LNatPctUnemp2020to2021ByYearByGender.tbl)
utils::head(LNatPctUnemp2020to2021ByYearByGender.tbl)
base::summary(LNatPctUnemp2020to2021ByYearByGender.tbl)
27
The expression follow the science, wherever it leads was used extensively in the press throughout
the pandemic. It might have been more appropriate to use the expression follow the data, wherever
they lead. Reliable and valid data are essential to furthering discovery of true outcomes.
Addendum 2: United States Department of Labor, Bureau of Labor Statistics 141
Now that there is a dataset with data for 2020 and 2021 only, apply the Twoway
ANOVA syntax again to see if percentage unemployment outcomes for these 2 years
are consistent with prior findings, from 2017 to 2021.
agricolae::HSD.test(
aov(Pct_Unemployment ~ Year,
data=LNatPctUnemp2020to2021ByYearByGender.tbl), # Model
trt="Year", # Treatment
group=TRUE, console=TRUE, alpha=0.05, # Arguments
main="Percentage Unemployment by Year: 2020 and 2021")
# Wrap the agricolae::HSD.test() function around the
# Oneway ANOVA model obtained by using the aov()
# function. Select desired arguments, such as group,
# console, and alpha (e.g., p-value).
Year, means
Pct_Unemployment groups
2020 8.10417 a
2021 5.35000 b
As evidenced by the use of this Oneway ANOVA, there was a statistically signifi-
cant difference (p ≤ 0.05) in mean percentage unemployment by year, with 2020
mean Pct_Unemployment = 8.10417 and 2021 mean Pct_Unemployment = 5.3500.
The Twoway ANOVA approach to the data should provide more information, now
introducing Gender for these 2 years.
142 2 Data Sources in Biostatistics
Response: Pct_Unemployment
Sum Sq Df F value Pr(>F)
Year 91.0 1 12.612 0.000927 ***
Gender 0.4 1 0.049 0.826670
Year:Gender 2.1 1 0.294 0.590110
Residuals 317.6 44
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
However, even for 2020 and 2021, alone, it is now confirmed that there is no
statistically significant difference (p ≤ 0.05) in mean percentage unemployment by
gender. A few statistics, using federal data as a resource, may help clarify actual
outcomes vs what was often reported in the press (incorrectly) concerning differ-
ences in unemployment by gender.
LNatPctUnemp2020ByYearByGender.tbl <-
LNatPctUnemp2017to2021ByYearByGender.tbl %>%
dplyr::filter(Year == 2020)
# Use the dplyr::filter() function to have data for
# 2020, only.
base::getwd()
base::ls()
base::attach(LNatPctUnemp2020ByYearByGender.tbl)
utils::str(LNatPctUnemp2020ByYearByGender.tbl)
dplyr::glimpse(LNatPctUnemp2020ByYearByGender.tbl)
utils::head(LNatPctUnemp2020ByYearByGender.tbl)
base::summary(LNatPctUnemp2020ByYearByGender.tbl)
With the data restricted to 2020 only, a time of exceptionally high rates of unem-
ployment, see if there are differences between women and men regarding percent-
age unemployment.
onewaytests::describe(Pct_Unemployment ~ Gender,
data=LNatPctUnemp2020ByYearByGender.tbl)
The 2020 mean percentage unemployment for Uer.Men = 7.80833 and the 2020
mean percentage unemployment for Uer.Women = 8.40000. However, is this differ-
ence statistically significant (p ≤ 0.05)?
onewaytests::st.test(Pct_Unemployment ~ Gender,
data=LNatPctUnemp2020ByYearByGender.tbl,
alpha=0.05, na.rm=TRUE, verbose=TRUE)
# Perform a Student?s t-Test for two samples.
statistic : -0.391328
parameter : 22
p.value : 0.699319
Fig. 2.7
144 2 Data Sources in Biostatistics
USAUnempPct2020BothGenders.fig <-
qplot(data = LNatPctUnemp2020ByYearByGender.tbl,
Gender, Pct_Unemployment, ylim=c(0,16),
main="USA Percentage Unemployment of Both Genders:
2020",
xlab="\nGender", ylab="Percentage Unemployment\n") +
scale_x_discrete(labels=c("Uer.Men" = "Men",
"Uer.Women" = "Women")) +
annotate("text", x=1.25, y=15.0, fontface="bold", size=03,
hjust=0, label=
"Each dot is a datapoint for a specific month.") +
annotate("text", x=1.25, y=14.0, fontface="bold", size=03,
hjust=0, label=
"Some months had values that were similar") +
annotate("text", x=1.25, y=13.2, fontface="bold", size=03,
hjust=0, label=
"to other months.") +
theme_Mac()
# Fig. 2.7
par(ask=TRUE); USAUnempPct2020BothGenders.fig
These outcomes are a demonstration of the type of value that data scientists pro-
vide to clients. Far more is possible, but again these additional possibilities are
addressed in other lessons.
The publisher’s Web site associated with this text includes the following files, pre-
sented in .csv, .txt, and .xlsx file formats.
AlcoholAttributableFractionsAllCauseDeaths.csv
annual_aqi_by_county_2021.csv
BirthRatePer1000People.csv
COVID19CasesDeathsEUEEACountryApr-16-2022.csv
DiabetesAtlasData.csv
FDOH_HealthTrackingData.csv
FloridaUnemploymentByCountySep-01-20.csv
FranceGermanyUKLivestoack1961to2020.csv
GDPCurrentUSDollar.csv
HeadcountsbyCIPbyCode01-07andbyCodeA-CNotTidy.txt
KentuckyCornYield1900Onward.xlsx
LifeExpectancyGDP.xlsx
LMilkManagementPoundsTidy.txt
NationalSchoolLunchProgram2000-01to2018-19.xlx
NatPctUnemp2017to2021ByGender.xlsx
nsf22300-tab013.xlsx
External Data and/or Data Resources Used in This Lesson 145
nslp_sites.csv
Palm_Beach_County_Natural_Areas_Trails.csv
QuickFactsApr-07-2022CapeMayCountyNJvUnitedStates.csv
SelectedHealthConditionsandRiskFactorsbyAge.xlsx
SHA_05032023201717071.csv
Tourist_Development_Tax2021-2022.pdf
Challenge: Use these files to practice and replicate the outcomes used in this les-
son. Be creative and go beyond what was previously presented.
All other data, if any, were enumerated directly in the R session, using functions
such as round(rnorm()), read.table(textConnection()), etc.
Chapter 3
Role of Statistics for Decision-Making
in Biostatistics
There are many ways to approach the decision-making process when data scientists
use statistical analyses to address problems associated with the biological sciences.
The ten-point process discussed in this lesson provides a general framework for the
overall workflow, from initial problem identification to communication of current
outcomes and plans for future improvements. Other frameworks are often encoun-
tered, but the general process is often like what is presented in this lesson. Although
the text in this lesson may be brief, it is suggested that the best way to learn about
the topics in this lesson is to read the biostatistics literature, literature on many bio-
logical topics and literature from many biologically oriented resources, print and
online. Look for similarities in the way workflow is addressed, as a planned statisti-
cal approach is used to achieve results.
A relevant problem associated with COVID-19 is used for demonstration pur-
poses. Reviewing as many resources in the literature as time allows, see how other
problems in biostatistics are addressed, as data scientists try to investigate relevant
problems and communicate findings. See the similarities (and differences) in the
literature to the general structure for the ten points identified in this lesson.
Was the percentage of COVID-19 deaths in the many counties of the United States
impacted by the degree of urbanization, otherwise known as the urban-rural con-
tinuum? Heavily urban communities have high population densities, whereas more
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 147
T. W. MacFarland, Introduction to Data Science in Biostatistics,
https://doi.org/10.1007/978-3-031-46383-9_3
148 3 Role of Statistics for Decision-Making in Biostatistics
As seen later in this lesson, the sample problem is focused on the use of the tidy-
verse ecosystem to adjust an extant federal dataset and to then examine if the urban-
rural continuum has any impact on the overall percentage of COVID-19 deaths. The
Centers for Disease Control and Prevention is the source for data in this example,
https://data.cdc.gov/NCHS/Provisional-COVID-19-Deaths-by-County-and-Race-
Ten-Point Process When Using R for Statistical Analysis 149
and/k8wy-p9cg. As a proxy for a provided Code Book, examine the federal publica-
tion 2013 NCHS Urban–Rural Classification Scheme for Counties.1
This resource, like its earlier 2006 counterpart, is a county-level population den-
sity scheme with six levels to the National Center for Health Statistics (NCHS)
urban–rural classification scheme: four metropolitan (large central metro, large
fringe metro, medium metro, and small metro) breakouts and two nonmetropolitan
(micropolitan and noncore) breakouts. Review the resource and see in Table 1 (page
8) how there is far greater detail on these breakouts, which should be examined to
fully understand the complexity of the county-based urban-rural continuum.
When reviewing the background information, note how county data are available
for counties with more than 100 COVID-19 deaths. Deaths are cumulative from the
week ending January 4, 2020, onward to the most recent reporting week (May 7,
2022, in this example) identified in the original dataset for when it was obtained,
and based on county of occurrence since it would be impossible to know with any
assurance the county of infection. Review the original resource to see the conditions
as to why data may not be reported, such as the threshold that a county must have at
least 100 COVID-19 deaths to be included in that reporting column as well as other
conditions that protect privacy. The exclusion of some data, although needed for
confidentiality purposes, is a continual concern to data scientists and is addressed to
some degree in this lesson.
Review the ending materials in this lesson for details on how the tidyverse ecosys-
tem was used to obtain the data, a federal dataset freely available to anyone with
interest. The readr::read_csv() function was used to import the downloaded federal
data into the R session, but a great deal of effort was needed to identify the appro-
priateness of this dataset and to then adjust it for immediate use.2 Far more support-
ing detail is provided in the ending materials.
Review the ending materials in this lesson for details on how the tidyverse ecosys-
tem was used to organize the data, using R syntax exclusively. Give special attention
to the many R packages and functions used against the dataset, especially the use of
1
Review https://www.cdc.gov/nchs/data/series/sr_02/sr02_166.pdf, Pages 2 and 8, Series 2, No.
166 for more detail.
2
Those with a special interest may want to explore functions from the rio package to see other
strategies for how data can be imported into an active R session.
150 3 Role of Statistics for Decision-Making in Biostatistics
functions from the dplyr package and movement from wide to long data format by
using the tidyr::pivot_longer() function.
Carefully examine the ending materials in this lesson to see the many approaches
used for inquiry into the problem associated with this lesson, the issue of possible
differences in percentage deaths from COVID-19 by the degree of urbanization of
local communities. As the syntax is examined, give special attention to:
Data of interest: Not all data in the original dataset are needed for immediate use
and can be sequestered from the final working dataset.
R packages and functions: Many different packages and associated functions,
including packages from Base R and the tidyverse ecosystem, have a role. Select
the R packages and functions that meet needs.
Expected workflow and timelines: Notice how the workflow is facilitated by follow-
ing a structured approach to problem-solving and that this process supports effi-
cient use of time and resources.
Quality assurance checks: Quality assurance is a continuous process, with periodic
checks even if at first these checks seem redundant. A small problem early on
that is undetected can easily result in unfounded errors by end of process.
Develop templates for all major actions: Give attention on how syntax is used (with
modifications, as needed) in multiple ways and in multiple places. These tem-
plates promote efficient use of syntax that have a known history of acceptable
use and reuse.
Notice how the syntax in the ending materials of this lesson may at times seem
redundant, but the many actions provide a continuous set of activities that promote
correct analyses and the production of relevant beautiful graphics.
As the data for this lesson are organized (especially, from wide to long) and then
analyzed, give special attention to how the data are subjected to both nonparametric
and parametric inferential analyses. This type of individual review is far too often
Ten-Point Process When Using R for Statistical Analysis 151
overlooked. It may be that in a final summary of outcomes, perhaps only the calcu-
lated p-value will be reported and then of course the concluding statement on sig-
nificance compared to a priori hypotheses. But multiple reviews of all outcomes are
needed to have full assurance of final conclusions.
When outcomes are prepared for publication in the literature, whether a journal
article or part of a published text, it can be assumed that the final manuscript will be
reviewed by a reputable editor and anonymous peer reviewers, often three or more.
Ideally, the peer reviewers will be selected by the editor for their expertise with the
subject matter and nonacquaintance with the author(s) of the draft publication. This
process provides additional quality assurance that the final publication is acceptable
for readership by the intended community of readers.
Even for publications that are not intended for external publication in a journal
or text but are instead internal to a select group of readers in a company, department,
research laboratory, etc., external review by disinterested third parties is still an
essential part of the process for inquiry and communication of outcomes. Free
inquiry and equally free criticism make for improvement in the sciences.
Did the inquiry answer all possible issues associated with the identified problem?
Given that this is rarely the case, what else could have (or should have) been done
to address the problem if there were only more available staff members, more time
152 3 Role of Statistics for Decision-Making in Biostatistics
allowed for investigation before moving on to other priority projects, more funds
available for the purchase of additional resources, etc.? Data scientists, in personal
notes, keep a list of what else may have been done for future improvements. These
ideal processes may not be currently achievable, but they provide a framework for
future actions – actions that can be defended based on the desire for continuous
improvement.
Give considerable attention to the ending materials in this lesson to examine the R
syntax, selected packages and functions, selected approaches, etc. used to address
the problem addressed in this lesson, the issue of deaths from COVID-19 and the
subsequent degree of urbanization between different communities. It is not enough
to merely obtain the data, whether self-obtained or obtained from an external
resource. An experienced data scientist will know that full comprehension by exter-
nal readers and viewers will come from multiple approaches to communication.
Exploratory Graphics
Exploratory Analyses
Data scientists, when they finally have a dataset in desired form, address analyses
from multiple perspectives. There may be a planned set of analyses, as is often the
case, but it is also advisable to allow for serendipity by possibly looking for
unplanned patterns. It is often the case that the planned set of analyses are the analy-
ses that eventually provide the best understanding of outcomes, but it is still advis-
able to allow for exploration of the data by engaging in:
Inferential Analyses that Address Differences Between and Among
Groups: Consider the use of the following:
• Nonparametric tests such as: Sign Test, Chi-Square, Mann-Whitney U Test,
Wilcoxon Matched-Pairs Signed-Ranks Test, Kruskal-Wallis Oneway Analysis
of Variance, Friedman Twoway Analysis of Variance
• Parametric tests such as: Student’s t-Test for Independent Samples, Student’s
t-Test for Matched Pairs, Oneway Analysis of Variance, and Twoway Analysis of
Variance
The Centers for Disease Control and Prevention (CDC) provides a regularly updated
dataset, Provisional COVID-19 Deaths by County, and Race and Hispanic Origin,
that addresses deaths from COVID-19, for counties with more than 100 COVID-19
deaths. Deaths are cumulative from the week ending January 4, 2020, and are ongo-
ing. The dataset Provisional_COVID- 19_Deaths_by_County__and_Race_and_
Hispanic_Origin.csv is available at https://data.cdc.gov/NCHS/Provisional-
154 3 Role of Statistics for Decision-Making in Biostatistics
# UrbanRuralCode
# 1 Large central metro
# 2 Large fringe metro
# 3 Medium metro
# 4 Small metro
# 5 Micropolitan
# 6 Noncore
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 155
###############################################################
# Housekeeping Use for All Analyses #
###############################################################
.libPaths(new = "F:/R_Packages")
# As a preference, all installed packages
# will now go to the external F:\ drive.
date() # Current system time and date.
Sys.time() # Current system time and date (redundant).
R.version.string # R version and version release date.
options(digits=6) # Confirm default digits.
options(scipen=999)# Suppress scientific notation.
options(width=60) # Confirm output width.
ls() # List all objects in the working
# directory.
rm(list = ls()) # CAUTION: Remove all files in the working
# directory. If this action is not desired,
# use the rm() function one by one to remove
# the objects that are not needed.
ls.str() # List all objects, with finite detail.
getwd() # Identify the current working directory.
setwd("F:/R_Ceres")
# Set to a new working directory.
# Note the single forward slash and double
# quotes.
# This new directory should be the directory
# where the data file is located, otherwise
# the data file will not be found.
getwd() # Confirm the working directory.
list.files() # List files at the PC directory.
.libPaths() # Library pathname.
.Library # Library pathname.
sessionInfo() # R version, locale, and packages.
search() # Attached packages and objects.
searchpaths() # Attached packages and objects.
###############################################################
In keeping with what is seen in the Housekeeping section, use the many pack-
ages listed below as a starting point for what is often used in an R session that
focuses on the tidyverse ecosystem and its use in data science.3
3
See the comment in an earlier lesson as to why a # comment character has been placed in front of
most packages that were previously downloaded.
156 3 Role of Statistics for Decision-Making in Biostatistics
#install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)
#install.packages("readxl", dependencies=TRUE)
library(readxl)
#install.packages("magrittr", dependencies=TRUE)
library(magrittr)
#install.packages("janitor", dependencies=TRUE)
library(janitor)
#install.packages("ggmosaic", dependencies=TRUE)
library(ggmosaic)
#install.packages("ggpubr", dependencies=TRUE)
library(ggpubr)
#install.packages("ggtext", dependencies=TRUE)
library(ggtext)
#install.packages("ggthemes", dependencies=TRUE)
library(ggthemes)
#install.packages("gridExtra", dependencies=TRUE)
library(gridExtra)
#install.packages("scales", dependencies=TRUE)
library(scales)
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 157
###############################################################
base::class(theme_Mac)
# Confirm that the user-created object
# theme_Mac() is a function.
###############################################################
Address the data associated with the focus for this example, the percentage of
deaths from COVID-19 by degree of urbanization (see the publication with the
Code Book) 2013 NCHS Urban–Rural Classification Scheme for Counties, page 9,
for more details associated with the Code Book:
# Most approaches for the use of the data will be based on functions associated
with the tidyverse ecosystem, but functions from Base R will be used when they
represent the most appropriate approach toward problem-solving, especially for this
introductory text.
base::getwd()
base::ls()
base::attach(WCOVID19Deaths.tbl)
utils::str(WCOVID19Deaths.tbl)
dplyr::glimpse(WCOVID19Deaths.tbl)
base::summary(WCOVID19Deaths.tbl)
Before the dataset is adjusted, it is best to learn more about the main object vari-
able of interest, urban_rural_description, since the output of the base::summary()
function for this object variable is not overly helpful.
Summaryurban_rural_description
# A tibble: 6 x 3
urban_rural_description Count Percentage
<chr> <int> <dbl>
1 Micropolitan 990 29.4
2 Large fringe metro 687 20.4
3 Medium metro 660 19.6
4 Small metro 642 19.1
5 Large central metro 204 6.07
6 Noncore 180 5.35
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 159
Remember that the terms Count and Percentage refer to occurrence in the data-
set, not population. Most counties in the United States are rural, but most people
reside in more densly populated urban counties.
A decision needs to be made now on how to adjust the dataset. Should any col-
umns be deleted? What is the best course of action to address the many occurrences
of missing data? If the data are to be put into long format, what is the best way to
arrange the data?
Use the utils::str() function to once again review the structure of the dataset and
make decisions on the final form, deciding on the columns that are needed for
planned analyses on COVID-19 deaths by degree of urbanization, temporarily
deleting all columns that do not play a role in this planned analysis. Although the
adjusted datasets will eventually exclude selected object variables, it is reminded
that the original data are retained, if it were ever necessary to go back to the data for
other analyses.
utils::str(WCOVID19Deaths.tbl)
base::getwd()
base::ls()
base::attach(WCOVID19DeathsAdjusted1.tbl)
utils::str(WCOVID19DeathsAdjusted1.tbl)
dplyr::glimpse(WCOVID19DeathsAdjusted1.tbl)
base::summary(WCOVID19DeathsAdjusted1.tbl)
It is now best to adjust the wide dataset one more time, applying the dplyr::filter()
function against the object variable indicator so that the eventual dataset contains
data only for percentage of COVID-19 deaths, removing the many rows for the two
other indicator object variable breakouts, distribution of all-cause deaths (%) and
distribution of population (%).
160 3 Role of Statistics for Decision-Making in Biostatistics
WCOVID19DeathsAdjusted2.tbl <-
WCOVID19DeathsAdjusted1.tbl %>%
dplyr::filter(indicator %in% c(
"Distribution of COVID-19 deaths (%)")) # COVID19 deaths, only
# Use the seemingly ubiquitous dplyr::filter() function
# to remove rows, which in this example retains only the rows
# with the term Distribution of COVID-19 deaths (%) in the
# indicator column. The output of this filtering syntax
# results in a new dataset, WCOVID19DeathsAdjusted2.tbl. This
# adjusted dataset will be used to reconfigure the data into
# Long format.
base::getwd()
base::ls()
base::attach(WCOVID19DeathsAdjusted2.tbl)
utils::str(WCOVID19DeathsAdjusted2.tbl)
dplyr::glimpse(WCOVID19DeathsAdjusted2.tbl)
base::summary(WCOVID19DeathsAdjusted2.tbl)
base::getwd()
base::ls()
base::attach(WCOVID19DeathsAdjusted3.tbl)
utils::str(WCOVID19DeathsAdjusted3.tbl)
dplyr::glimpse(WCOVID19DeathsAdjusted3.tbl)
base::summary(WCOVID19DeathsAdjusted3.tbl)
The wide dataset should now be in good form and ready for restructuring, from
wide to long. The tidyr::pivot_longer() function is likely the best choice for this
task. As a summary of intentions, the long dataset will consist of percentages of
death from COVID-19 in one column and urban_rural_description breakouts in
another column. With the data in long format, it will be possible to use tidyverse
ecosystem functions to the best effect.
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 161
LCOVID19Deaths.tbl <-
tidyr::pivot_longer(
WCOVID19DeathsAdjusted3.tbl,
col=c(
non_hispanic_white,
non_hispanic_black,
non_hispanic_american_indian_or_alaska_native,
non_hispanic_asian,
non_hispanic_native_hawaiian_or_other_pacific_islander,
hispanic,
other),
names_to = "RaceEthnic",
values_to = "Percentage")
# The wide dataset is now put in long format, with the
# race-ethnicity percentages for distribution of COVID-19
# deaths all set to one column named Percentage. There
# are two complementary columns, urban_rural_description
# and RaceEthnic. This dataset is in long format and has
# all of the information needed to determine if there are
# statistically significant differences (p <= 0.05) in
# percentages COVID-19 deaths by urban-rural gradients.
# The measured datum percentage deaths should be viewed
# as a real number, but urban-rural gradients are clearly
# factor object variables that show no precise order. As
# such, a nonparametric approach will be used along with
# a parametric understanding of the data.
#
# Comment: This example is focused on the percentage of
# COVID-19 deaths by degree of urbanization. The long
# format also contains breakouts of RaceEthnic, but this
# object variable is not used for analyses in this
# lesson, but is retained in the long format dataset for
# possible further inquiry.
base::getwd()
base::ls()
base::attach(LCOVID19Deaths.tbl)
utils::str(LCOVID19Deaths.tbl)
dplyr::glimpse(LCOVID19Deaths.tbl)
base::summary(LCOVID19Deaths.tbl)
federal resource serving as a proxy Code Book as to why confidentiality and the
Rule of 100 called for this outcome, the occurrence of so many NAs.
For this lesson, a decision has been made that analyses will be based only on a
complete dataset. Given this decision, all rows with a missing value will be deleted
from the dataset and the dataset used in this demonstration will have no missing
values.4
base::getwd()
base::ls()
base::attach(LCOVID19DeathsNOMissingData.tbl)
utils::str(LCOVID19DeathsNOMissingData.tbl)
dplyr::glimpse(LCOVID19DeathsNOMissingData.tbl)
base::summary(LCOVID19DeathsNOMissingData.tbl)
Exploratory data analysis (EDA): Descriptive statistics of the relevant dataset are
needed to obtain a general feel for the data, with attention not only to trends but with
attention also given to any extreme values. From among the many functions that
could possibly be used to calculate these descriptive statistics, consider the
dplyr::summarize() function. Recall that there are no missing data in the adjusted
dataset LCOVID19DeathsNOMissingData.tbl. If there were missing data, it would
be necessary to use the argument is.na in concert with the dplyr::summarize()
function.
4
The literature will be useful for those who want to explore the efficacy of this decision, the dele-
tion of all rows with missing data. This action, the deletion of all rows with missing data, must be
identified in the methods section of any eventual summary and it may be necessary to defend
this action.
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 163
LCOVID19DeathsNOMissingData.tbl %>%
dplyr::group_by(urban_rural_description) %>%
dplyr::summarize(
N = base::length(Percentage),
Minimum = base::min(Percentage),
Median = stats::median(Percentage),
Mean = base::mean(Percentage),
SD = stats::sd(Percentage),
Maximum = base::max(Percentage))
# Descriptive statistics are generated by first using the
# dplyr::group_by() function against the object variable
# urban_rural_description. The dplyr::summarize() function
# is then used against a set of selected functions to make
# a neatly presented summary of descriptive statistics.
# A tibble: 6 x 7
urban_rural_description N Minimum Median Mean Maximum
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Large central metro 360 0.001 0.0905 0.188 0.866
2 Large fringe metro 706 0.002 0.132 0.318 0.995
3 Medium metro 725 0.003 0.101 0.299 0.991
4 Micropolitan 565 0.014 0.648 0.566 1
5 Noncore 87 0.056 0.758 0.672 1
6 Small metro 549 0.01 0.185 0.381 0.992
# Median - Nonparametric
par(ask=TRUE)
ggplot(LCOVID19DeathsNOMissingData.tbl,
aes(x=stats::reorder(urban_rural_description, Percentage),
y = Percentage)) +
geom_bar(stat = "summary", fun = "median")
# This rough draft figure is prepared only but to show a
# sense of direction for outcomes. Far more will be done
# when the data are used to prepare Beautiful Graphics.
# To achieve that aim, tidyverse ecosystem arguments will
# be used to best effect.
#
# Give attention to the way the stats::reorder() function
# was wrapped around urban_rural_description, reordering
# output on the X axis by Percentage values. This syntax
# may not be needed, but it clearly makes it easier to
# see the progression in percentage of COVID-19 deaths by
# degree of urbanization, the percentage mostly ranging
# from the smallest percentage (Large central metro) to
# the largest percentage (Noncore). Is it a coincidence
# that the progression of percentage deaths from COVID-19
# generally parallels the continuum of urbanization?
The least to the greatest ordering of percentage deaths from COVID-19 from a
nonparametric perspective, using the median, ranged from (1) Large central metro,
to (2) Medium metro, to (3) Large fringe metro, to (4) Small metro, to (5)
Micropolitan, to (6) Noncore. This ordering mostly follows along the urban-rural
continuum, where the percentage deaths from COVID-19 increased as population
density changes from a high degree of urbanization (low percentage death rate) to a
low degree of urbanization (high percentage death rate).
Why was the percentage deaths from COVID-19 the least in urban areas, consid-
ering the urban-rural continuum? The data provided by the Centers for Disease
Control and Prevention provided sufficient evidence as to what happened, but the
data cannot begin to answer unequivocally why urban areas had lower percentages
of COVID-19 deaths than the percentage deaths from COVID-19 in more rural
areas. A data scientist should know when to make definitive statements on out-
comes, but a data scientist should also know when to avoid conjecture that is not
supported by the evidence, or the issue of why in this example.
An inferential analysis of the data is needed, however, to learn more about per-
centage deaths from COVID-19 by degree of urbanization. The Kruskal-Wallis test
by ranks, a Oneway Analysis of Variance (ANOVA) test most appropriately used
from a nonparametric view toward data, will be used to examine any commonalities
and differences between and among the urban-rural breakout groups in view of
percentage death from COVID-19.
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 165
# install.packages("agricolae", dependencies=TRUE)
library(agricolae)
# Median - Nonparametric
agricolae::kruskal(LCOVID19DeathsNOMissingData.tbl$Percentage,
LCOVID19DeathsNOMissingData.tbl$urban_rural_description,
alpha=0.05, group=FALSE, p.adj="holm",
main="COVID-19 Percentage Deaths by Urban-Rural Continuum
Using Nonparametric Kruskal-Wallis Oneway ANOVA",
console=TRUE)
# Use holm for pairwise comparisons. Another choice could
# have been to use bonferroni for pairwise comparisons.
LCOVID19DeathsNOMissingData.tbl$urban_rural_description, means
of the ranks
LCOVID19DeathsNOMissingData.tbl.Percentage r
Large central metro 1042.37 360
Large fringe metro 1395.70 706
Medium metro 1288.42 725
Micropolitan 2018.79 565
Noncore 2270.72 87
Small metro 1538.50 549
# Mean - Parametric
par(ask=TRUE)
ggplot(LCOVID19DeathsNOMissingData.tbl,
aes(x=stats::reorder(urban_rural_description, Percentage),
y = Percentage)) +
geom_bar(stat = "summary", fun = "mean")
The least to the greatest ordering of percentage deaths from COVID-19 from a
parametric perspective, using the mean, ranged from (1) Large central metro, to (2)
Medium metro, to (3) Large fringe metro, to (4) Small metro, to (5) Micropolitan,
to (6) Noncore. This ordering mostly follows along the urban-rural continuum,
where the percentage deaths from COVID-19 increased as an area changes from a
high degree of urbanization (low percentage death rate) to a low degree of urbaniza-
tion (high percentage death rate).
What is also interesting about this figure is that the ordering of percentage deaths
from COVID-19 by urban-rural continuum is the same, regardless of whether a
nonparametric (median) or parametric (mean) approach served as the basis for prep-
aration of the figure. The percentages are different, comparing median to mean, as
evidenced in the descriptive statistics – but the practical outcomes in terms of order-
ing are equivalent.
agricolae::HSD.test(
aov(Percentage ~ urban_rural_description,
data=LCOVID19DeathsNOMissingData.tbl), # Model
trt="urban_rural_description", # Treatment
group=TRUE, console=TRUE, alpha=0.05, # Arguments
main="COVID-19 Percentage Deaths by Urban-Rural Continuum
Using Tukey's HSD (Honestly Significant Difference)
Parametric Oneway ANOVA")
# Wrap the agricolae::HSD.test() function around the
# Oneway ANOVA model obtained by using the aov()
# function. Select desired arguments, such as group,
# console, and alpha (e.g., p-value).
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 167
urban_rural_description, means
Percentage groups
Noncore 0.671862 a
Micropolitan 0.566340 a
Small metro 0.381002 b
Large fringe metro 0.318435 c
Medium metro 0.298619 c
Large central metro 0.188347 d
Fig. 3.1
COVID19MedianPercentageDeathByUrbanRural <-
LCOVID19DeathsNOMissingData.tbl %>%
dplyr::group_by(urban_rural_description) %>%
dplyr::summarize(
Median = stats::median(Percentage))
COVID19MedianPercentageDeathByUrbanRural
# A tibble: 6 x 2
urban_rural_description Median
<chr> <dbl>
1 Large central metro 0.0905
2 Large fringe metro 0.132
3 Medium metro 0.101
4 Micropolitan 0.648
5 Noncore 0.758
6 Small metro 0.185
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 169
par(ask=TRUE)
ggplot2::ggplot(LCOVID19DeathsNOMissingData.tbl,
aes(
x=stats::reorder(urban_rural_description, Percentage),
y=Percentage)) +
geom_bar(stat = "summary", fun = "median", fill="red",
color="black") +
labs(
title="Median Percentage COVID-19 Deaths by Urban-Rural
Continuum as of May 13, 2022",
subtitle="Data: Centers for Disease Control and Prevention",
x = "\nUrban - Rural Breakout Groups",
y = "Median Percentage COVID-19\nDeaths\n") +
annotate("text", x=0.75, y=1.00, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="Percentage Density Urbanization Level") +
annotate("text", x=0.75, y=0.95, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="----------------------------------------") +
annotate("text", x=0.75, y=0.90, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.091 2,037 Large central metro") +
annotate("text", x=0.75, y=0.85, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.101 148 Medium metro") +
annotate("text", x=0.75, y=0.80, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.132 233 Large fringe metro") +
annotate("text", x=0.75, y=0.75, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.185 93 Small metro") +
annotate("text", x=0.75, y=0.70, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.648 55 Micropolitan") +
annotate("text", x=0.75, y=0.65, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.758 18 Noncore") +
annotate("text", x=0.75, y=0.60, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="----------------------------------------", ) +
annotate("text", x=0.75, y=0.50, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="Density: Median Persons per Square Mile") +
scale_y_continuous(breaks=scales::pretty_breaks(n=5),
limits=c(0, 1), sec.axis = dup_axis()) +
# Note how the Y axis has been duplicated, where the left Y
# Axis (default) is shown again to the right of the figure,
# making it easier to read the values for each bar in the
# bar chart.
theme_Mac() +
theme(axis.text.y=element_text(face="bold", size=12, hjust=0.5))
# Use a smaller-sized font to accommodate the large number of
# completers, six-digits in this case.
# Notice how the placement of this one off theme comes after
# theme_Mac().
# Fig. 3.1
170 3 Role of Statistics for Decision-Making in Biostatistics
# Mean - Parametric
COVID19MeanPercentageDeathByUrbanRural <-
LCOVID19DeathsNOMissingData.tbl %>%
dplyr::group_by(urban_rural_description) %>%
dplyr::summarize(
Mean = base::mean(Percentage))
COVID19MeanPercentageDeathByUrbanRural
# A tibble: 6 x 2
urban_rural_description Mean
<chr> <dbl>
1 Large central metro 0.188
2 Large fringe metro 0.318
3 Medium metro 0.299
4 Micropolitan 0.566
5 Noncore 0.672
6 Small metro 0.381
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 171
par(ask=TRUE)
ggplot2::ggplot(LCOVID19DeathsNOMissingData.tbl,
aes(
x=stats::reorder(urban_rural_description, Percentage),
y = Percentage)) +
geom_bar(stat = "summary", fun = "mean", fill="red",
color="black") +
labs(
title="Mean Percentage COVID-19 Deaths by Urban-Rural
Continuum as of May 13, 2022",
subtitle="Data: Centers for Disease Control and Prevention",
x = "\nUrban - Rural Breakout Groups",
y = "Mean Percentage COVID-19\nDeaths\n") +
annotate("text", x=0.75, y=1.00, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="Percentage Density Urbanization Level") +
annotate("text", x=0.75, y=0.95, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="----------------------------------------") +
annotate("text", x=0.75, y=0.90, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.188 2,037 Large central metro") +
annotate("text", x=0.75, y=0.85, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.299 148 Medium metro") +
annotate("text", x=0.75, y=0.80, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.318 233 Large fringe metro") +
annotate("text", x=0.75, y=0.75, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.381 93 Small metro") +
annotate("text", x=0.75, y=0.70, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.566 55 Micropolitan") +
annotate("text", x=0.75, y=0.65, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.672 18 Noncore") +
annotate("text", x=0.75, y=0.60, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="----------------------------------------") +
annotate("text", x=0.75, y=0.50, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="Density: Median Persons per Square Mile") +
scale_y_continuous(breaks=scales::pretty_breaks(n=5),
limits=c(0, 1), sec.axis = dup_axis()) +
theme_Mac() +
theme(axis.text.y=element_text(face="bold", size=12, hjust=0.5))
# Use a smaller-sized font to accommodate the large number of
# completers, six-digits in this case.
# Notice how the placement of this one off theme comes after
# theme_Mac().
# Fig. 3.2
Challenge: Much more can be done in terms of analyses and graphical presenta-
tions, not only with the long format adjusted table that has no missing data but also
with the original dataset, WCOVID19Deaths.tbl (Fig. 3.2). Use WCOVID19Deaths.
172 3 Role of Statistics for Decision-Making in Biostatistics
Fig. 3.2
tbl and look at the data again, from many perspectives (analyses by and not only for
those analyses and figures demonstrated in this lesson).
Comment: If race-ethnicity were examined in any meaningful detail, consider
the issue of race-ethnicity health disparities across the population, and equally con-
sider disparities in percentage representation of residence by the many race-ethnicity
groups across the urban-rural continuum. It is far beyond the purpose of this lesson
to address why there may be disparities in outcomes (e.g., percentage COVID-19
deaths by different breakout groups, now possibly including race-ethnicity), but as
previously addressed the curious data scientist will at least examine the data, look
for trends and associations, and make these outcomes known to others.
Comment: For those who wish to go further with this line of inquiry, use data
from the Centers for Disease Control and Prevention and the National Center for
Health Statistics to examine Excess Deaths Associated with COVID-19, available
at https://www.cdc.gov/nchs/nvss/vsrr/covid19/excess_deaths.htm and the many
.csv files that show near the end of this Web page. Some comparisons can be made
along the Urban-Rural Continuum, comparing states that are known to be heavily
urban v states that are known to be quite rural, possibly New Jersey v Alaska,
Rhode Island v Wyoming, Massachusetts v Montana, etc. These comparisons
should provide another (e.g., proxy) view of the severity of COVID-19 along the
Urban-Rural Continuum. Federal data on this issue can also be obtained at
Provisional COVID-19 Deaths: Distribution of Deaths by Race and Hispanic
Origin (https://data.cdc.gov/NCHS/Provisional-COVID-19-Deaths-Distribution-
of-Deaths/pj7m-y5uh), using the file Provisional_COVID- 19_Deaths__
Distribution_of_Deaths_by_Race_and_Hispanic_Origin.csv and made available
at the Publisher’s Web site associated with this text renamed as
ProvisionalCOVID19DeathsDistributionofDeathsbyRaceandHispanicOrigin.csv.
Much needs to be done to put the data into tidy format, but of course, by the end
of engagement with this text, this should be an achievable skill.
External Data and/or Data Resources Used in This Lesson 173
The publisher’s Web site associated with this text includes the following files, pre-
sented in .csv, .txt, and .xlsx file formats.
ProvisionalCOVID19DeathsbyCountyRaceHispanicOrigin.csv.
ProvisionalCOVID19DeathsDistributDeathsRaceHispanicOrigin
Challenge: Use these files to practice and replicate the outcomes used in this les-
son. Be creative and go beyond what was previously presented.
Note: There is only one addendum in this lesson.
All other data, if any, were enumerated directly in the R session, using functions
such as round(rnorm()), read.table(textConnection()), etc.
Chapter 4
Data Science and R, Base R,
and the tidyverse Ecosystem
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 175
T. W. MacFarland, Introduction to Data Science in Biostatistics,
https://doi.org/10.1007/978-3-031-46383-9_4
176 4 Data Science and R, Base R, and the tidyverse Ecosystem
seem tedious, verbose, only marginally necessary, etc. – until there is an attempt to
use the data or the syntax later, individually or by others, when attempting replica-
tion and reuse. What will be remembered about a project six days after completion,
six weeks after completion, six months after completion, etc.? How will others,
either co-workers, staff members from other departments, or future hires, use the
data and documentation if there were any attempts to use the original workflow for
future purposes, which is so common that this possibility is an assumed standard
practice. Few projects are static, or to use current jargon, few projects are one
and done.
When considering how R and the tidyverse ecosystem fits into data science and
how data science is used to address problems in the biological sciences, review the
following sequence of terms and workflow:
• Import the Data – The data need to be brought into an R session, but how this is
accomplished depends on the nature of the data, how the data have been saved,
where the data are currently housed prior to import, etc. The tidyverse ecosystem
supports many tools used to import data and it would be hard to imagine a dataset
that could not eventually be imported into R, but of course some datasets may be
more challenging than others.
• Tidy the Data – It would only be the rare exception for a dataset to be imported
into R that requires no further effort to put it into good form, with desired format
depending on many conditions. As the name suggests, the tidyverse ecosystem
has many excellent tools used to put data into good form – a tidy format.
• Transform the Data into Desired Format – Data are often originally organized in
ways that defy understanding, as those with minimal background in data science
prepare, use, and, in many cases, abuse spreadsheets. A data entry decision that
may seem intuitive and easy to read may not at all be acceptable for computer-
based analyses. The most common data transformation is likely moving data
from wide format to long format. This transformation, from wide to long, is
demonstrated multiple times in this text.
• Program or Prepare tidy Syntax – When using the tidyverse ecosystem, it is best
to review how others prepare syntax and there are multiple sources where sam-
ples are provided. Give attention to the packages and functions used to achieve
aims, note how tibbles are named, give attention to consistency in capitalization
and the use of underscores in object variable names, follow along with norms on
spacing, etc.
• Visualize Outcomes Frequently – The tidyverse ecosystem supports the notion of
Beautiful Graphics. Even if the tidyverse ecosystem were not used to prepare
final form figures suitable for professional publication, which it certainly sup-
ports, rough draft figures should nearly always be prepared for quality assurance
purposes, to be sure that outcomes are in range, or highlight outcomes if out-
comes are not in expected range.
• Model the Process to Support Consistency in Future Replication – Using a com-
bination of different internal documentation activities (e.g., good programming
practices such as meaningful variable names and not cryptic abbreviations and
Workflow for Reproducible, Efficient, and Accurate Analyses and Presentations 177
detail needed, especially if that somewhat long character string were used as a
label in a barchart or some other figure? There are tidyverse ecosystem tools (and
Base R tools, too) that can remove the characters Florida if it were decided that
this action is needed, leaving Broward County only – a label that will more easily
fit on either the X axis or Y axis of a figure.
• Imagine, however, that the dataset included the character string Washington
County, Florida, as a specific datapoint. Tools from either Base R or the tidy-
verse ecosystem could be used to change the datapoint to Washington County
only, but is this a wise choice? There is only one county in the United States
named Broward County, but the name Washington County is found in 30 or so
states in the United States, in honor of General George Washington, a founding
father and the first president of the United States. The tidyverse ecosystem can
accommodate the change, if desired, but the appropriateness of any change is a
human decision.
• Imagine another dataset, in original format as a spreadsheet, where comments
and titles show in the first few rows, prior to the inclusion of data. These com-
ments and titles are quite appropriate for when the spreadsheet is shared with
others, but these extraneous rows are not acceptable for inclusion in any rectan-
gular dataset. There are tidyverse ecosystem tools that can remove the unneeded
rows, and with appropriate knowledge of the tidyverse ecosystem, it is not neces-
sary to manually edit it outside of R.
The terms dataset, data frame, and dataframe are regularly used in data science,
often interchangeably. Regardless of how these terms are expressed, it is important
to become acquainted with the term tibble and its special use in the tidyverse eco-
system. A tibble is a rectangular dataframe, but with somewhat different character-
istics than what is the norm when using Base R and the traditional concept of a
dataframe. More about tibbles will be evident by reviewing the back matter in the
many lessons in this text, but for now, it is important to know that a dataframe can
be converted into a tibble by using the tibble::as_tibble() function. An especially
useful feature of the tibble::as_tibble() function is the name_repair argument, which
can be especially convenient in the early stages of data organization, prior to use of
the data.
It is also noteworthy to consider how the tidyverse ecosystem is used to put wide
data into long format and conversely the restructuring of long data into wide format.
These actions (most often, wide to long) are demonstrated throughout this text and
many examples can be found by searching the Internet, so only a brief recap is
needed here, but the convenient restructuring of data is central to why many data
scientists first embraced the tidyverse ecosystem.1 A simple example of how wide
data may appear after it is put into long format follows:
1
Look at the history of the reshape package (and later the reshape2 package and even later the tidyr
package) and note time of first availability for each package. The reshape package and the ggplot2
package were among the earliest packages associated with the tidyverse ecosystem and their use
Base R 179
# WIDE Data
# LONG Data
Base R
S was developed in the mid-1970s and R grew out of S. It might help to back up
briefly to describe how statistical analyses were deployed in the early days of com-
puting, prior to the use of more contemporary statistical analysis software, such as
S and later R.
was quickly accepted by data scientists within the R community, leading to the development of
other tidyverse packages and associated tools.
180 4 Data Science and R, Base R, and the tidyverse Ecosystem
2
Review the term ASCII Art to see how simple graphics were attempted in these early days, often
with questionable results (including the subject matter) for selected early days figures.
The tidyverse Ecosystem 181
There are nearly countless resources for that aim, and they should be reviewed if
needed. Instead, it is assumed that those who review this introductory lesson on data
science for biostatistics have some degree of acquaintance with R and the many
tools available from when R is first downloaded, prior to the use of any external
packages, packages from among the 20,000 or so packages that go beyond what is
possible when R is first downloaded – especially tools associated with the tidyverse
ecosystem. Step back and become fairly well experienced with Base R if needed.
S was quite popular with a dedicated user base. In the mid-1990s when R, as an
outgrowth of S, was made available to the public, many former S users quickly
downloaded early versions of R and began testing its features and then used R for
production.3 As an open-source software, it was possible to customize selected fea-
tures and develop packages and functions, as needed. For many data scientists, R
became a first-choice software selection for data management and organization,
statistical analysis, and the preparation of quality graphics.
Recognizing that software is never static, by the mid-2000s, there were major
improvements to R when the first iterations of packages and functions associated
with what would later be called the tidyverse ecosystem were released. The reshape
3
As early as January 6, 2009, a prominent national newspaper in the United States featured R and
how it was emerging as a first choice for many engaged in data management and statistical
analyses.
182 4 Data Science and R, Base R, and the tidyverse Ecosystem
The tidyverse Ecosystem as an Idea and the Need for Tidy Data
With such wide acceptance of the tidyverse ecosystem, it should not be surprising
that there are nearly countless resources, including journal articles, blog postings,
tutorials, short courses, and videos on its many features and uses. There is no need
to repeat what is so readily available and can be gained by searching the Internet.
Yet, it would be remiss if the driving features of the tidyverse ecosystem, tidy data,
were not detailed. To avoid any future frustration by those who are just beginning to
use R and the tidyverse ecosystem, it must be mentioned that it is the rare dataset
that needs no accommodation prior to first use. Messy data are abundant, especially
when datasets are prepared by those who are not data scientists and those who do
not use the tidyverse ecosystem. Data scientists need to learn how to modify messy
data so that the three key features of a tidy dataset are observed. There are countless
ways messy data are put into datasets, often in seemingly unmanageable configura-
tions. In contrast, tidy data are housed as datapoints in rectangular datasets, where
rectangular datasets are composed of rows and columns, with data found at the
intersection of rows and columns:
• Rows represent observations (e.g., subjects) and each row represents a singular
observation (e.g., singular subject).
• Columns represent variables and each column represents a singular variable.
• Cells represent values and each cell represents a singular value.
A brief example or two of a messy dataset and how the data could appear in tidy
format follows. Hand-editing with software, either an editor or a spreadsheet, can be
used for a small dataset, but imagine the challenges if there were hundreds or thou-
sands of rows for the messy dataset presented below. R and more specifically R’s
tidyverse ecosystem has tools that could be used to accommodate the desire to put
The tidyverse Ecosystem 183
the data into tidy format, a challenge perhaps but certainly requiring less time on
task than hand-editing.4
# Messy Dataset
#
# First and Last Gender/
# Name Sex Weight Height Body Mass Index
# ==========================================================
# John Smith Male 165 Lb 5 foot 6 in 26.6
# Sally Rojas Female 126 Lbs 5'2" 23.0
# William Danick M 183 lb 6 f 24.8
# Walter Maurer m 199 lbs 6 feet 4" 24.2
# Juanita Adams f 143 5-4 24.5
# ----------------------------------------------------------
# Tidy Dataset
#
id nameLast nameFirst gender weightLbs heightInches bmi
ID001 Smith John m 165 66 26.6
ID002 Rojas Sally f 126 62 23.0
ID003 Danick William m 183 72 24.8
ID004 Maurer Walter m 199 76 24.2
ID005 Adams Juanita f 143 64 24.5
4
This is not the only place where it could be mentioned, but it should be remembered that names
are messy, first names and last names. This issue is especially evident when datasets are merged.
Imagine a subject with Sean as a first name. It is not at all inconceivable, right or wrong, that this
name may entered in other datasets as Jean, Jehan, John, Shane, Shaun, Shawn, Shayne, or Shon.
Which spelling is correct, which is incorrect, and is it possible to accommodate differences? If
there are language accent marks over some characters, how will the software address these special
characters, if at all? Then consider the clever idea of using Social Security Numbers, National
Identity Numbers, or some other de facto recognized means of identifying individuals by using
unique government-issued codes. At first this idea may sound grand, but is it legal? If legal, is it
prudent? These national identification numbers may allow for consistency in identifying individu-
als and they are especially convenient when datasets are merged, but their use also puts individuals
at risk for identify fraud, such as unauthorized credit card purchases, title theft, and dodgy mort-
gage applications. Always deploy best practices against the possibility of a data breach and use
caution when deploying identification codes.
184 4 Data Science and R, Base R, and the tidyverse Ecosystem
Great writers read the writings of other writers. Great musicians listen to the
music of other musicians. Great athletes observe the on-field efforts of other ath-
letes. It should not be surprising, then, that data scientists who want to improve their
mastery of R and the tidyverse ecosystem should:
• Study the datasets used by other data scientists, to see how the data were
organized.
• Study the syntax used by other data scientists, to see how the syntax was prepared.
• Study the statistical output generated by other data scientists, to see how the
statistics were generated.
• Study the graphical output generated by other data scientists, to see how the
graphics were generated.
• Study the conclusions presented by other data scientists, to see how the conclu-
sions were presented.
R takes time to learn and the tidyverse ecosystem adds an even greater demand
on what is needed to master the tools inherent to data science in biostatistics. Yet,
the rewards are many (e.g., employment, salary, professional development, peer
acceptance, recognition, etc.) for those data scientists who master R and the tidy-
verse ecosystem.
It may be a bit confusing at first to understand exactly which R packages are included
among the many packages associated with the tidyverse ecosystem. As an example,
the following packages are typically viewed as those packages associated with the
core tidyverse, listed in alphabetical order:5
• dplyr – Use the dplyr package to manipulate data so that a final dataset is in
desired format in terms of row and column placement.
• forcats – Use the forcats package to accommodate factor-type variables, typi-
cally variables representing different categories.
• ggplot2 – Use the ggplot2 package, which is based on the Grammar of Graphics,
to create both draft figures and publishable Beautiful Graphics.
• lubridate – Use the lubridate package to facilitate the complexities of working
with dates, times, etc.
• purrr – Use the purrr package to deploy functional programming practices, typi-
cally avoiding cumbersome syntax for loops.
5
With the March 2023 release of tidyverse 2.0.0, it can now be stated that the lubridate package has
been added as part of the package of packages associated with core tidyverse. Older references to
core tidyverse may not refer to the lubridate package.
The tidyverse Ecosystem 185
• readr – Use the readr package to import offline rectangular datasets into an active
R session, such as data in the following formats: comma separated values (.csv),
tab separated values (.tsv), and the now less common but once ubiquitous fixed
width format (.fwf).
• stringr – Use the stringr package to ease the complexities of manipulating
character-type variables (e.g., strings).
• tibble – Use the tibble package to rethink what a data frame represents, now
organizing data in a tibble which improves quality by forcing resolution of data
issues early in the workflow.
• tidyr – Use the tidyr package to have data in a tidy format, typically placing data
in wide format into long format where observations represent singular rows,
variables represent singular columns, and datapoints are represented in singular
cells – the intersection of a row and a column. Of course, the tidyr package can
also be used to place data in long format into wide format, but that practice is
perhaps less common.
Although reference to the core tidyverse is found throughout this text, it is worth
repeating that all packages in the core tidyverse can be downloaded and put into use
in one simple operation:
install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)
There are many R packages that are directly associated with the tidyverse, going
beyond packages in core tidyverse. Use the tidyverse::tidyverse_packages() func-
tion to list these packages and be sure to account for packages in core tidyverse by
using the include_self = TRUE argument.
tidyverse::tidyverse_packages(include_self = TRUE)
# List all packages in the tidyverse
186 4 Data Science and R, Base R, and the tidyverse Ecosystem
Some of these auxiliary tidyverse packages have regular use in data science and
efforts should be made to learn about their potential. Other auxiliary tidyverse pack-
ages are used less frequently in that they have very specialized applications that may
not be the norm for most users. Consider a few examples from among the R pack-
ages considered auxiliary packages outside of the core tidyverse ecosystem:6
• broom – Use the broom package to provide a tidy summary of information
gained from models, supporting methods such as anova, glm, lm, etc.
• cli – Use the cli package to build user-defined command line interfaces (e.g., CLIs).
• readxl – Use the readxl package to bring rectangular (e.g., tabular) data housed
in a spreadsheet into an active R session, including the complexity of differenti-
ating between different sheets within a spreadsheet.
• rvest – Use the rvest package to scrape (e.g., harvest, obtain) data from Web
pages, an increasingly important data acquisition process among data scientists.7
The tidyverse ecosystem is an expanding array of packages that work and play
well with other packages in the tidyverse, the heuristics of the tidyverse, and the
actual deployment of the tidyverse. As an example, consider the R packages ggmo-
saic, ggthemes, and scales. These packages are used throughout the many lessons in
this text, given their association with the ggplot2 package. Many would say that
these associated packages should be viewed as part of the tidyverse ecosystem, but
others may hold back on this broad statement.
Consider the many possibilities addressed in the remaining parts of this lesson:
Complex Data Set on Birth Rates Easily Accommodated by Using the tidyverse
Ecosystem, Addendum 1; Complex Data Set on Gross Domestic Product (GDP) and
Comparison to Birth Rates by Using the tidyverse Ecosystem}, Addendum 2; and
Individual Initiative of Planned Workflow, Analyses, and Graphical Presentations,
Addendum 3. A wide variety of functions associated with the tidyverse, core
6
Go to the URL https://cran.r-project.org/web/packages/available_packages_by_name.
html#available-packages-A to see the list of Available CRAN Packages By Name. At the time this
chapter was prepared, there were nearly 20,000 packages included in this listing. Of those pack-
ages, more than 70 packages had the text string tidy somewhere in the package name (not the
package description). This listing does not include the many tidyverse ecosystem packages where
the text string tidy is not included in the package name, such as broom and pillar. The tidyverse
ecosystem is ubiquitous in R.
7
Data scraping is an increasingly important activity, but far beyond appropriate demonstration in
an introductory text.
Addendum 1: Complex Data Set on Birth Rates Easily Accommodated by Using… 187
tidyverse ecosystem, and auxiliary packages outside of the core tidyverse ecosys-
tem, along with functions from other packages, are used to make sense of data from
well-respected external sources.
A self-curated (and therefore admittedly biased) listing of Essential tidyverse
Ecosystem Functions That Every Data Scientists Should Master is the subject of
Addendum 4. Regard the parenthetical reminder that this listing is biased. Others
will have different views on which tidyverse ecosystem functions are essential,
including some on the list in Addendum 4, but equally offering other ideas on which
tidyverse ecosystem functions are essential.
As used throughout these analyses, the Housekeeping section represents personal desires
in terms of how R is used, how settings are organized, where packages are kept, default
and other location(s) where files are maintained, etc. As always, use the Housekeeping
syntax as a guide, but of course make changes as skills and preferences allow.
###############################################################
# Housekeeping Use for All Analyses #
###############################################################
.libPaths(new = "F:/R_Packages")
# As a preference, all installed packages
# will now go to the external F:\ drive.
date() # Current system time and date.
Sys.time() # Current system time and date (redundant).
R.version.string # R version and version release date.
options(digits=6) # Confirm default digits.
options(scipen=999)# Suppress scientific notation.
options(width=60) # Confirm output width.
ls() # List all objects in the working
# directory.
rm(list = ls()) # CAUTION: Remove all files in the working
# directory. If this action is not desired,
# use the rm() function one by one to remove
# the objects that are not needed.
ls.str() # List all objects, with finite detail.
getwd() # Identify the current working directory.
setwd("F:/R_Ceres")
# Set to a new working directory.
# Note the single forward slash and double
# quotes.
# This new directory should be the directory
# where the data file is located, otherwise
# the data file will not be found.
getwd() # Confirm the working directory.
list.files() # List files at the PC directory.
.libPaths() # Library pathname.
.Library # Library pathname.
sessionInfo() # R version, locale, and packages.
search() # Attached packages and objects.
searchpaths() # Attached packages and objects.
###############################################################
188 4 Data Science and R, Base R, and the tidyverse Ecosystem
#install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)
#install.packages("readxl", dependencies=TRUE)
library(readxl)
#install.packages("magrittr", dependencies=TRUE)
library(magrittr)
#install.packages("janitor", dependencies=TRUE)
library(janitor)
#install.packages("ggmosaic", dependencies=TRUE)
library(ggmosaic)
#install.packages("ggpubr", dependencies=TRUE)
library(ggpubr)
#install.packages("ggtext", dependencies=TRUE)
library(ggtext)
#install.packages("ggthemes", dependencies=TRUE)
library(ggthemes)
#install.packages("scales", dependencies=TRUE)
library(scales)
#install.packages("gridExtra", dependencies=TRUE)
library(gridExtra)
#install.packages("cowplot", dependencies=TRUE)
library(cowplot)
Addendum 1: Complex Data Set on Birth Rates Easily Accommodated by Using… 189
#install.packages("writexl", dependencies=TRUE)
library(writexl)
###############################################################
With all of the Housekeeping work completed and expected packages put into
use, it is now time to address the data associated with the addenda in this lesson.
Most approaches to use of the data will be based on functions associated with the
tidyverse ecosystem, but functions from Base R will be used when they represent
the most appropriate approach toward problem-solving.
As an advance organizer to the addenda in this lesson, consider the structure of
Addendum 1 and how it is centered on the use of birth rate data made available by
the World Bank (https://www.worldbank.org/en/home) and previously referenced
in the second lesson in this text. What is special about this dataset and why it was
selected for this chapter on how the tidyverse ecosystem is such a valuable tool for
data scientists include the following challenges:
• The first row of BirthRatePer1000People.csv is not a header row, in its current
form. If the dataset were uploaded without accommodation, it would be quite
difficult to make any meaningful use of the data.
• Instead, the first four rows of BirthRatePer1000People.csv are not even part of
the dataset and are either descriptive information about the dataset (e.g., row 1
and row 3) or blank rows used only for spacing (e.g., row 2 and row 4).
• The header row of BirthRatePer1000People.csv is found on row 5, but there are
known problems with the column names in terms of later use:
190 4 Data Science and R, Base R, and the tidyverse Ecosystem
–– Some column names consist of two words with a blank space between the
two. This type of column naming scheme may cause problems if not
accommodated.
–– Some column names consist of numbers, only. Again, this type of column
naming scheme may cause problems if not accommodated.
A similar dataset, also from the World Bank but now focused on Gross Domestic
Product (GDP), is used in Addendum 2. This addendum is prepared so that the syn-
tax used in Addendum 1 should also be used in Addendum 2, such that a tidy
approach that works in one application should work in another application, often
with only minimal change (mostly object variable names and selected scales, given
disparate data).8
An additional dataset from the World Health Organization, somewhat like the
datasets gained from the World Bank, is used for Addendum 3. A tidy approach is
suggested for use of this dataset. However, Addendum 3 is not prescriptive and only
a minimum of syntax is provided. The dataset is identified, a different function
(associated with the tidyverse ecosystem) is demonstrated to import the data and
concurrently adjust the dataset, and a few ideas are offered on direction, but only a
few ideas relating to sampling, workflow, analyses, and graphics are provided.9
Whether for Addendum 1, Addendum 2, or Addendum 3, the tidyverse ecosys-
tem has functions that can easily accommodate these and many other challenges.
The tidyverse ecosystem provides an excellent platform to efficiently (1) import
data into an active R session, (2) adjust object variable names, (3) restructure, select,
and filter data based on specific criteria, (4) support statistical analyses and graphi-
cal presentations, (5) expand scope and increase value by merging external data into
a currently active dataset.
When considering the idea of Must Know tidyverse ecosystem functions
(Addendum 4), once again recall that any such list is self-curated. Add to this list as
acquaintance with more experience on use of the tidyverse ecosystem.
Given this background, upload the data for Addendum 1, knowing in advance
after prior offline review that the file will not be in final form after upload:
8
To promote initiative, a series of challenges are offered on how the data associated with Addendum
2 can be used to address interesting issues. Yet, unlike what is seen in Addendum 1, in Addendum
2 some syntax and output are purposely excluded from this chapter, all part of the desire to gradu-
ally encourage skill development. Use the syntax and process first seen in Addendum 1 as a guide
for completion of the challenges identified in Addendum 2. This should be an achievable outcome,
even at the introductory level of this text.
9
A series of suggestions are offered on how the data associated with Addendum 3 can be used to
address interesting issues. Yet, unlike what is seen in Addendum 1 and less so in Addendum 2, in
Addendum 3, most syntax and output are purposely excluded. Use the syntax and process first seen
in Addendum 1 and partially repeated in Addendum 2 as a guide for completion of Addendum 3,
again as a purposeful effort to encourage skill development.
Addendum 1: Complex Data Set on Birth Rates Easily Accommodated by Using… 191
utils::head(WBirthRateByEntity1960Onward.tbl)
# Confirm the nature of the uploaded dataset,
# prior to any accommodations.
install.packages("janitor", dependencies=TRUE)
library(janitor)
WBirthRateByEntity1960Onward.tbl <-
janitor::row_to_names(WBirthRateByEntity1960Onward.tbl,
row_number=5, # Make this row the header row.
remove_rows_above=TRUE) # Remove rows above the header row.
# The janitor::row_to_names() function is used to remove the
# rows above row 5 (e.g., rows 1 to 4), AND row 5 is now set
# as the header row, the row that holds the dataset column
# names.
#
# The task of removing a set number of rows from a dataset
# and declaring a later row in the dataset as the header row
# could have been accomplished using Base R. Yet, success
# with this task is far easier to obtain and the process is
# more succinctly communicated to others by deploying a few
# lines of syntax associated with the tidyverse ecosystem,
# using functions from the janitor package in this example.
utils::head(WBirthRateByEntity1960Onward.tbl)
# Confirm the nature of the adjusted dataset,
# after accommodations.
WBirthRateByEntity1960Onward.tbl <-
janitor::clean_names(WBirthRateByEntity1960Onward.tbl)
# Use the janitor::clean_names() function to put all
# object variable names into a tidy format.
utils::str(WBirthRateByEntity1960Onward.tbl)
# Confirm the nature of the adjusted dataset, after
# accommodations. By doing this, note the way all
# variable names are in lower case and an underscore
# is used as a replacement for spaces that may show
# in variable names.
WBirthRateByEntity1960OnwardNAMES
# Print names after using the
# janitor::clean_names() function.
Although the janitor::clean_names() function is used occasionally in this text, there are many
10
data scientists who use this function for nearly all datasets in use, as a standard practice for naming
object variables.
Addendum 1: Complex Data Set on Birth Rates Easily Accommodated by Using… 193
The three things that are most obvious after using the janitor::clean_names()
function are that:
• Object variable names that started with a character have all been put into lower
case, including the leading letter of each if applicable.
• Any spaces between multiple words in an object variable name have been
replaced with an underscore character.
• Object variable names that started with a number have all been renamed so that
each begins with a lowercase x character, considering the issue of object variable
naming requirements using R.
Now that the dataset WBirthRateByEntity1960Onward.tbl seems to be in good
form, use standard R functions to confirm one last time that all is correct and ready
for later use.
base::getwd()
base::ls()
base::attach(WBirthRateByEntity1960Onward.tbl)
utils::str(WBirthRateByEntity1960Onward.tbl)
dplyr::glimpse(WBirthRateByEntity1960Onward.tbl)
base::summary(WBirthRateByEntity1960Onward.tbl)
With the changes made to the original dataset, which was gained from an exter-
nal Web-based resource, it should now be easier to work with the data, where the
goal now is to make sense of birth rate percentages over time (e.g., 1960 onward)
for six purposely selected entities:
• ARG, Argentina
• BGD, Bangladesh
• BOL, Bolivia
• CHN, China
• MEX, Mexico
• USA, United States
Because of this consistency in the way columns are named and organized, it is
judged best to use the base::colnames() function to rename the columns into more
presentable names. Functions from Base R should not be overlooked when their use
is appropriate and in many cases the simplest and easiest approach to achiev-
ing aims.
194 4 Data Science and R, Base R, and the tidyverse Ecosystem
base::colnames(WBirthRateByEntity1960Onward.tbl) <- c(
"country_name", # Column 01 Geographic Entity Name
"country_code", # Column 02 Geographic Entity Code
"indicator_name", # Column 03 Indicator Name
"indicator_code", # Column 04 Indicator Code
"1960", # Column 05 1960
"1961", # Column 06 1961
"1962", # Column 07 1962
"1963", # Column 08 1963
"1964", # Column 09 1964
"1965", # Column 10 1965
"1966", # Column 11 1966
"1967", # Column 12 1967
"1968", # Column 13 1968
"1969", # Column 14 1969
"1970", # Column 15 1970
"1971", # Column 16 1971
"1972", # Column 17 1972
"1973", # Column 18 1973
"1974", # Column 19 1974
"1975", # Column 20 1975
"1976", # Column 21 1976
"1977", # Column 22 1977
"1978", # Column 23 1978
"1979", # Column 24 1979
"1980", # Column 25 1980
"1981", # Column 26 1982
"1982", # Column 27 1983
"1983", # Column 28 1983
"1984", # Column 29 1984
"1985", # Column 30 1985
"1986", # Column 31 1986
"1987", # Column 32 1987
"1988", # Column 33 1988
"1989", # Column 34 1989
"1990", # Column 35 1990
"1991", # Column 36 1991
"1992", # Column 37 1992
"1993", # Column 38 1993
"1994", # Column 39 1994
"1995", # Column 40 1995
"1996", # Column 41 1996
"1997", # Column 42 1997
"1998", # Column 43 1998
"1999", # Column 44 1999
"2000", # Column 45 2000
"2001", # Column 46 2001
"2002", # Column 47 2002
"2003", # Column 48 2003
"2004", # Column 49 2004
"2005", # Column 50 2005
"2006", # Column 51 2006
Addendum 1: Complex Data Set on Birth Rates Easily Accommodated by Using… 195
base::getwd()
base::ls()
base::attach(WBirthRateByEntity1960Onward.tbl)
utils::str(WBirthRateByEntity1960Onward.tbl)
dplyr::glimpse(WBirthRateByEntity1960Onward.tbl)
base::summary(WBirthRateByEntity1960Onward.tbl)
utils::str(WBirthRateByEntity1960Onward.tbl)
# As a prudent QA check, confirm that the data
# are organized and named, as desired, for this
# point in the workflow.
With the data in acceptable format, it is now best to make a few additional adjust-
ments to the dataset, which in this case entails using the dplyr::select() function to
remove three columns (e.g., country_name, indicator_name, and indicator_code)
since these columns are not needed for later analyses and graphical presentations.11
This action may not be needed and the columns could be retained, but their removal
will help in that the dataset will be more manageable in the evolving desire to use a
tidy approach to use of the data. The original dataset is retained if there were ever a
desire to go back and use the data in the columns currently planned for removal.
11
Throughout this text, the convention PackageName::FunctionName() has been used regularly to
identify functions by their full name, taking into account function namespace, a critical issue if a
selected package and associated functions are to work and play well with other packages and other
functions. The select() function is an excellent example of why this approach,
PackageName::FunctionName(), is used. For this lesson, the select() function is associated with
the dplyr package. However, install and load the MASS package and type help(select) at the R
prompt to see that the select() function is also associated with the MASS package. If the dplyr
package and the MASS package were both loaded in the same R session, there may be some confu-
sion as to the use of the select() function, alone. However, using a PackageName::FunctionName()
approach to naming functions takes care of any confusion. If there were ever any question as to
which R packages are available in the current session, merely type sessionInfo() and look at the
output in the other attached packages: section.
196 4 Data Science and R, Base R, and the tidyverse Ecosystem
WBirthRateByEntity1960OnwardAdjusted.tbl <-
dplyr::select(WBirthRateByEntity1960Onward.tbl,
-c(1, 3, 4)) # Remove columns by index, not name.
# Use the seemingly ubiquitous dplyr::select() function
# to remove columns by index: Column 1, Column 3, and
# Column 4. Column 2 and all other columns are retained.
# Entity Column 01
# Entity Code Column 02
# Indicator Name Column 03
# Indicator Code Column 04
# Entity codes (e.g., Column 02) will be used, purposely,
# due to their consistent use of a three column alpha
# coding scheme.
#
# Be sure to notice that in this example, the minus sign
# indicates that the listed columns should be removed.
base::getwd()
base::ls()
base::attach(WBirthRateByEntity1960OnwardAdjusted.tbl)
utils::str(WBirthRateByEntity1960OnwardAdjusted.tbl)
dplyr::glimpse(WBirthRateByEntity1960OnwardAdjusted.tbl)
base::summary(WBirthRateByEntity1960OnwardAdjusted.tbl)
LBirthRateByEntity1960OnwardAdjusted.tbl <-
tidyr::pivot_longer(
WBirthRateByEntity1960OnwardAdjusted.tbl,
-c(country_code),
names_to = "year", values_to = "birth_rate") %>%
dplyr::filter(country_code %in% c(
"ARG", # Argentina
"BGD", # Bangladesh
"BOL", # Bolivia
"CHN", # China
"MEX", # Mexico
"USA")) # United States
# There are two parts to this syntax, with %>% used to: (1)
# send the product of the tidyr::pivot_longer() function for
# (2) use by the dplyr::filter() function. That to say:
# The tidyr::pivot_longer() function was used to put the
# data in WBirthRateByEntity1960OnwardAdjusted.tbl into
# long format, and LBirthRateByEntity1960OnwardAdjusted.tbl
# is the result of that action.
# The dplyr::filter() function was used to filter out all
# long format data other than data for the six selected
# country_code entities.
# Because of these actions, the working dataset is in long
# format, a tidy approach to data science, and the dataset is
# restricted to data of interest, excluding all other data --
# another tidy approach to data science.
#
# When using tidyr::pivot_longer(), -c(country_code) means
# that the tidyr::pivot_longer() function should pivot
# everything except country_code. In this context, the minus
# sign means except.
base::getwd()
base::ls()
base::attach(LBirthRateByEntity1960OnwardAdjusted.tbl)
utils::str(LBirthRateByEntity1960OnwardAdjusted.tbl)
dplyr::glimpse(LBirthRateByEntity1960OnwardAdjusted.tbl)
base::summary(LBirthRateByEntity1960OnwardAdjusted.tbl)
# Even if seemingly redundant, it is still a prudent QA
# check to confirm that the data are organized and
# named, as desired, for this point in the workflow now
# that the data of interest are in long format.
birth_rate
Min. :10.5
1st Qu.:17.8
Median :22.9
Mean :26.0
3rd Qu.:34.2
Max. :48.4
Give attention to the minimum value (10.5) and maximum value (48.4) for the
object variable LBirthRateByEntity1960OnwardAdjusted.tbl$birth_rate, which
198 4 Data Science and R, Base R, and the tidyverse Ecosystem
will guide the Y axis scale, set from 0 (less than the minimum) to 50 (greater than
the maximum):
Use the writexl::write_xlsx() function to immediately download the data, for
safekeeping in case the data were ever to become unavailable in the future.
writexl::write_xlsx(
LBirthRateByEntity1960OnwardAdjusted.tbl,
path =
"F:\\R_Ceres\\LBirthRateByEntity1960OnwardAdjusted.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.
base::file.exists("LBirthRateByEntity1960OnwardAdjusted.xlsx")
base::file.info("LBirthRateByEntity1960OnwardAdjusted.xlsx")
base::list.files(pattern =".xlsx")
Before any graphics are produced, it is important to know that the ggplot2::ggplot()
function supports many different themes. A ggplot2 theme is created by using syn-
tax to produce a figure with a desired appearance. In an attempt to make the figures
bold and vibrant, but also in an attempt to reduce redundant keying, look at theme_
Mac(), a self-created theme that will be used multiple times in concert with the
ggplot2::ggplot() function.
When used multiple themes, as a preferred format, themes reduce redundant key-
ing while adding value to a project. The use of additional themes, other than the
standard themes available to all, is valuable to enhance axis and tick mark presenta-
tion, bold labels and titles, centering, font size and color, etc. However, the use of
these additional ad hoc changes to standard themes often requires many lines of
syntax. By preparing syntax that is keyed one time and then saved with a unique
name, it is possible to easily deploy a user-created theme such as theme_Mac()
multiple times and in multiple projects. This approach of multiple use and reuse of
existing syntax promotes an efficient and tidy way of meeting project requirements.
# With all actions in place, it is now possible to produce individual figures that
plot changes in birth rate by year (1960–2020) for the six distinct selected geo-
graphic entities that are either adjacent to each other at least in general proximity:
• Asia: Bangladesh and China
• North America: Mexico and United States
• South America: Bolivia and Argentina
Later, a grid approach will be used to present geographically nearby comparisons
for the selected entities from Asia, North America, and South America. These com-
parisons are among the many ways that a data scientist adds value to a project:
Addendum 1: Complex Data Set on Birth Rates Easily Accommodated by Using… 199
• The organization of relevant, current, reliable, and valid data into a table is a
good start for consumption by designated readers of a project, but it is not
enough. Most people, other than those who work nearly day to day with data,
simply do not have the background and patience to go over rows and rows of data
and make sense of the same. There are only but a few professionals who can truly
make sense of thousands of datapoints, and now with expansion of data science
in education, government, industry, transportation and logistics, etc., there is a
need for accommodation of millions of datapoints in the pursuit of true compre-
hension of outcomes, those outcomes that were purposely investigated as well as
outcomes that are the result of unplanned discovery (e.g., serendipity).
• Accordingly, data are not only put into a readable rectangular dataset of some
sort, but the data then need to be organized in a way that supports either statisti-
cal analysis, collapsed tables, and figures, with the example in this addendum
focused on use of the ggplot2::ggplot() function to create Beautiful Graphics.
With this desire in mind, it may be best to briefly recap the data workflow in this
addendum and to summarize what has happened and what final actions are still
needed to generate the desired figures:
• The data were originally found at a Web page readily available to the public, and
by interacting with a Graphical User Interface (GUI) process, the data were
downloaded to a personal computer.12
• The data were then imported into an active R session.
• Once in the active R session, a set of R functions was used to put the data into an
acceptable wide format, with the dataset adjusted so that the eventual first row,
serving as the header row, had acceptable column names.
• Actions were then taken to remove columns that were not needed for cur-
rent plans.
• The data were put into long format by using tidyverse ecosystem functions.
• With the data in long format, tidyverse ecosystem tools will be used to select data
based on selected codes and the output of that action will then be used to produce
descriptive figures that display outcomes.
• To add value that increases understanding of outcomes, side by side figures will
be prepared, comparing birth rates over time for geographic entities that either
share a common border or are at least near each other.
utils::str(LBirthRateByEntity1960OnwardAdjusted.tbl)
# As a last reminder prior to generating the desired
# figures, use the utils::str() function to confirm the
# nature of each object variable found in the dataset
# LBirthRateByEntity1960OnwardAdjusted.tbl.
Note: Prior to preparation of these figures, it would be best to look at the opera-
tional definition of Birth rate, crude (per 1000 people), at the World Bank Metadata
Glossary, https://databank.worldbank.org/metadataglossary/gender-statistics/
12
An Application Programming Interface (API) paradigm for data retrieval is often desired, but not
always feasible. Data scientists need to interact with multiple processes to obtain desired data.
200 4 Data Science and R, Base R, and the tidyverse Ecosystem
Fig. 1
Fig. 2
Fig. 3
Addendum 1: Complex Data Set on Birth Rates Easily Accommodated by Using… 201
par(ask=TRUE); BGDBirthggplot.fig
###########################################################
###########################################################
202 4 Data Science and R, Base R, and the tidyverse Ecosystem
par(ask=TRUE); CHNGDPggplot.fig
###########################################################
par(ask=TRUE)
gridExtra::grid.arrange(
BGDBirthggplot.fig,
CHNBirthggplot.fig, ncol=2)
# C04Fig17BGDandCHNBirthRate.png
Addendum 1: Complex Data Set on Birth Rates Easily Accommodated by Using… 203
par(ask=TRUE); MEXBirthggplot.fig
###########################################################
###########################################################
204 4 Data Science and R, Base R, and the tidyverse Ecosystem
par(ask=TRUE); USABirthggplot.fig
###########################################################
par(ask=TRUE)
gridExtra::grid.arrange(
MEXBirthggplot.fig,
USABirthggplot.fig, ncol=2)
# C04Fig18MEXandUSABirthRate.png
Addendum 1: Complex Data Set on Birth Rates Easily Accommodated by Using… 205
par(ask=TRUE); ARGBirthggplot.fig
###########################################################
###########################################################
206 4 Data Science and R, Base R, and the tidyverse Ecosystem
par(ask=TRUE); BOLBirthggplot.fig
###########################################################
par(ask=TRUE)
gridExtra::grid.arrange(
ARGBirthggplot.fig,
BOLBirthggplot.fig, ncol=2)
# C04Fig19ARGandBOLBirthRate.png
Go to the World Bank Web page(s) relating to Data and view the page https://data.
worldbank.org/indicator/NY.GDP.MKTP.CD to obtain a dataset that follows the
same layout and structure as the Birth Rate dataset used in Addendum 1, but now
Addendum 2: Complex Data Set on Gross Domestic Product (GDP) and Comparison… 207
providing data related to Gross Domestic Product (GDP).13 Download the .csv ver-
sion of the file and use a descriptive name, such as GDPCurrentUSDollar.csv.
Bring the GDP-focused dataset into the same R session started in Addendum 1,
ideally using the readr::read_csv() function and process previously demonstrated.
As a suggested activity, but with far less guidance in this addendum than what
was provided in Addendum 1, follow along with the same process used in Addendum
1 and eventually generate the file LGDPByEntity1960OnwardAdjusted.tbl.14 Use
this file to generate six additional figures, which if the same ordering and naming
scheme were used as was previously demonstrated in Addendum 1, the figures
should be presented as: BGDGDPggplot.fig, CHNGDPggplot.fig, MEXGDPggplot.
fig, USAGDPggplot.fig, ARGGDPggplot.fig, and BOLGDPggplot.fig. These fig-
ures, now in Addendum 2, on GDP change over time and are likely quite interesting,
like the way Birth Rate changes over time (Addendum 1) was also interesting.15
Experienced data scientists know that context is always needed if data are to have
value. Use the figures prepared in this R session (Addendum 1, where syntax for the
figures was provided and Addendum 2, where syntax for the figures is not provided)
to offer that context, where in the same general figure, there is a comparison of Birth
Rate over time to Gross Domestic Product (GDP) over time, by placing two separate
figures in the same summative figure, using the gridExtra::grid.arrange() function:
13
There is a known association between wealth and health. Wealth does not guarantee health, but
wealth eases access to medicines, availability of services, etc. It is entirely appropriate to consider
proxies for wealth, such as Gross Domestic Product (GDP), when examining health-related issues
in biostatistics.
14
Addendum 2 purposely does not include the full set of syntax needed for suggested actions, such
as syntax used to generate the dataset LGDPByEntity1960OnwardAdjusted.tbl and many later
actions based on this dataset. Use the syntax in Addendum 1 as a guide for actions in Addendum
2, but of course make improvements with an expanding skill set and interest.
15
If a simple copy and paste were used to change the syntax in Addendum 1 for use in Addendum
2, do not forget that the main object variable of interest in Addendum 1 was listed as birth_rate,
whereas the main object variable of interest in Addendum 2 should be listed as either gdp or
GDP. Equally, the scale needs to be adjusted for the Y axis in Addendum 2, with a maximum value
for GDP at 21400000000000, whereas the maximum value for the birth_rate Y axis value in
Addendum 1 was 48.4.
208 4 Data Science and R, Base R, and the tidyverse Ecosystem
par(ask=TRUE)
gridExtra::grid.arrange(BGDBirthggplot.fig, BGDGDPggplot.fig,
ncol=2)
# Prepare BGDGDPggplot.fig using the same process as was used
# to prepare BGDBirthggplot.fig and follow along for the five
# other GDP figures. Recall that there will be differences
# in object names and scales, but most syntax and certainly
# the process can be used for both figures.
par(ask=TRUE)
gridExtra::grid.arrange(CHNBirthggplot.fig, CHNGDPggplot.fig,
ncol=2)
par(ask=TRUE)
gridExtra::grid.arrange(MEXBirthggplot.fig, MEXGDPggplot.fig,
ncol=2)
par(ask=TRUE)
gridExtra::grid.arrange(USABirthggplot.fig, USAGDPggplot.fig,
ncol=2)
par(ask=TRUE)
gridExtra::grid.arrange(ARGBirthggplot.fig, ARGGDPggplot.fig,
ncol=2)
par(ask=TRUE)
gridExtra::grid.arrange(BOLBirthggplot.fig, BOLGDPggplot.fig,
ncol=2)
Add even more value to this inquiry by preparing a scatterplot of X axis GDP
for each year (1960 onward) to Y axis Birth Rate for each year (1960 onward).
To achieve this aim, it will be necessary to merge the two files currently available
in this R session: LBirthRateByEntity1960OnwardAdjusted.tbl and
LGDPByEntity1960OnwardAdjusted.tbl.
There are many creative ways that R can be used to merge (e.g., join, bind) two
files, with the selection based in part on the structure of the two original files and the
desired structure of the final file – the eventual merged file.
• Do not overlook the use of Base R and how the base::merge() function has value
and is often an expedient choice when merging two files.
• However, the tidyverse ecosystem and specifically functions available through
use of the dplyr package should also be considered, again depending on the
structure of the two original files and the desired structure of the final file.
Common functions available with the dplyr package for joining files include:
–– dplyr::left_join()
–– dplyr::right_join()
–– dplyr::inner_join()
–– dplyr::full_join()
–– dplyr::anti_join()
–– dplyr::semi_join()
–– dplyr::nest_join()
Addendum 2: Complex Data Set on Gross Domestic Product (GDP) and Comparison… 209
–– dplyr::bind_rows()
–– dplyr::bind_cols()
Although this text is focused on the tidyverse ecosystem, this lesson also includes
reference to Base R functions. Consider the base::merge() function. If the naming
process used in Addendum 1 were followed in Addendum 2, then the two files
(focus on birth_rate in Addendum 1 and focus on gdp in Addendum 2) could easily
be merged by using the following syntax:
LBirthRateGDP1960Onward.df <-
base::merge(
LBirthRateByEntity1960OnwardAdjusted.tbl,
LGDPByEntity1960OnwardAdjusted.tbl,
by = c("country_code", "year"))
# Merge the two files, by country_code and
# by year.
base::getwd()
base::ls()
base::attach(LBirthRateGDP1960Onward.df)
utils::str(LBirthRateGDP1960Onward.df)
dplyr::glimpse(LBirthRateGDP1960Onward.df)
base::summary(LBirthRateGDP1960Onward.df)
LBirthRateGDP1960Onward.tbl <-
tibble::as_tibble(LBirthRateGDP1960Onward.df)
# Put the dataframe into tibble format, to be
# consistent with use of tidyverse ecosystem
# tools.
base::getwd()
base::ls()
base::attach(LBirthRateGDP1960Onward.tbl)
utils::str(LBirthRateGDP1960Onward.tbl)
dplyr::glimpse(LBirthRateGDP1960Onward.tbl)
base::summary(LBirthRateGDP1960Onward.tbl)
There are two cells where NA shows for Gross Domestic Product. Use the
dplyr::na.omit() function to remove those two rows, since they may cause possible
problems in future analyses – especially correlation analyses:16
16
It may be beyond the purpose of this text, but it would still be useful to review the literature to
see different views on when missing data should be removed from a dataset and equally, when
missing data should be retained but accommodated.
210 4 Data Science and R, Base R, and the tidyverse Ecosystem
LBirthRateGDP1960OnwardNoNAs.tbl <-
stats::na.omit(LBirthRateGDP1960Onward.tbl)
# The stats package is included among the
# many Base R packages, available when R is
# first downloaded. Other functions, those
# in the tidyverse ecosystem as well as
# those in Base R could have been used, but
# the stats::na.omit() function was perhaps
# the simplist selection for this example.
base::getwd()
base::ls()
base::attach(LBirthRateGDP1960OnwardNoNAs.tbl)
utils::str(LBirthRateGDP1960OnwardNoNAs.tbl)
dplyr::glimpse(LBirthRateGDP1960OnwardNoNAs.tbl)
base::summary(LBirthRateGDP1960OnwardNoNAs.tbl)
writexl::write_xlsx(
LBirthRateGDP1960OnwardNoNAs.tbl,
path = "F:\\R_Ceres\\LBirthRateGDP1960OnwardNoNAs.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.
base::file.exists("LBirthRateGDP1960OnwardNoNAs.xlsx")
base::file.info("LBirthRateGDP1960OnwardNoNAs.xlsx")
base::list.files(pattern =".xlsx")
LBirthRateGDP1960OnwardNoNAs.tbl %>%
dplyr::group_by(country_code) %>%
dplyr::summarise(
NCountry = base::length(country_code),
MeanGDP = base::mean(GDP_rate),
MeanBirthRate = base::mean(birth_rate),
Pearson_r = stats::cor(x=GDP_rate, y=birth_rate,
method = "pearson")
)
# Assume that GDP and Birth Rate are interval and in turn
# use the parametric Pearson's test to estimate the
# association between the two variables, GDP_rate and
# birth_rate.
# A tibble: 6 x 5
country_code NCountry MeanGDP MeanBirthRate Pearson_r
<chr> <int> <dbl> <dbl> <dbl>
1 ARG 59 2.07e11 21.2 -0.866
2 BGD 61 6.15e10 34.5 -0.803
3 BOL 61 1.00e10 34.1 -0.878
4 CHN 61 2.58e12 20.1 -0.558
5 MEX 61 4.73e11 30.3 -0.914
6 USA 61 7.68e12 15.5 -0.775
17
The expression 7.68e12, with experience, is far more convenient than writing the exceptionally
long number 7680000000000.
212 4 Data Science and R, Base R, and the tidyverse Ecosystem
Fig. 4
par(ask=TRUE)
GDP_rateBYbirth_rateBYEntity.fig <-
ggplot2::ggplot(data=LBirthRateGDP1960OnwardNoNAs.tbl,
aes(x=base::log10(GDP_rate), y=base::log10(birth_rate))) +
geom_point(color="red", size=2) +
theme_Mac() +
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
theme(
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
# Use the base::log10() function to accommodate and provide
# clarity to such widely disparate scales between the values
# in MeanGDP (maximum = USD 7,680,000,000,000) and the values
# in MeanBirthRate (maximum = 34.5). Recall that the
# purpose of this set of figures is to provide a visual cue
# for general trends. A full set of arguments and options
# supported with use of the ggplot2::ggplot() function is not
# used at this time. If desired, go beyond this initial
# graphic to offer even more detail.
#
# This figure confirms the general outcome gained from the
# use of Pearson's r, in that Birth Rate (Y axis) decreases
# as GDP (X) axis increases.
# C04Fig20GDPRateBirthRate.png
par(ask=TRUE); GDP_rateBYbirth_rateBYEntity.fig
With this overall trend established, that Birth Rate (Y axis) decreases as GDP (X)
axis increases, it is now useful to use a faceting technique to generate an association-
type figure for each geographic entity (e.g., breakout). This aim of faceting is easily
supported in R, often by many different approaches. From among the many options,
observe how functions found in the ggpubr package easily accommodate this task
(Fig. 5).
Addendum 3: Individual Initiative of Planned Workflow, Analyses, and Graphical… 213
Fig. 5
GDP_rateBYbirth_rateBYEntity_Facet.fig <-
ggpubr::facet(GDP_rateBYbirth_rateBYEntity.fig,
facet.by = "country_code")
# C04Fig21GDPRateBirthRateFacet.png
par(ask=TRUE); GDP_rateBYbirth_rateBYEntity_Facet.fig
Challenge: Much more could be (and should be) done with the two datasets that
were eventually merged into one dataset. The output in Addendum 2 provides a few
ideas on how R, both Base R and the tidyverse ecosystem, supports a thorough
understanding of the main variables of interest in this lesson, specifically the notion
that over time, parents have fewer children as their income gradually increases.
Reminder: The full set of syntax associated with Addendum 2 has been excluded
from this text. Use the syntax in Addendum 1 as a guide but make improvements in
presentation and content. Go beyond what was presented in Addendum 1 and begin
to model the behaviors of an engaged data scientist. By this point in the text, these
should all be acceptable goals.
It is known that alcohol consumption has an impact on a host of social and health
conditions, resulting in debt, debilitation, destruction, despair, and death.18 It is far
beyond the focus of this text to offer any meaningful discussion on alcohol
18
The term Deaths of Despair is now part of the lexicon found in the popular press as well as the
professional literature and is, collectively, a major contributor to awareness of the growing phe-
214 4 Data Science and R, Base R, and the tidyverse Ecosystem
consumption other than to say that the individual decision to consume alcohol is
influenced by many factors that either promote or inhibit drink. The dataset for this
addendum is specific to the percentage of alcohol-attributable deaths, whether due
to disease or injury.
The World Health Organization (WHO) is the resource for the dataset used in
this addendum. Look at the many resources available at The Global Health
Observatory – Explore a World of Health Data (https://www.who.int/data/gho/data
and use this starting point to find the many datasets associated with alcohol con-
sumption and breakout datasets on liver cirrhosis, road traffic deaths and injuries,
cancer, etc.
The dataset for this addendum, Alcohol-attributable fractions, all-cause
deaths (%), is found at https://www.who.int/data/gho/data/indicators/indicator-
details/GHO/alcohol-attributable-fractions-all-cause-deaths-(-) and has since
been downloaded to a local drive on the personal computer used for this lesson.
An API suitable for R specific to this resource was not found. As such, follow the
directions associated with EXPORT DATA in CSV format, Right click here &
Save link to download the dataset, saved externally as
AlcoholAttributableFractionsAllCauseDeaths.csv.
In an effort to demonstrate a variety of tidyverse ecosystem resources, the
vroom::vroom() function will be used to import the data placed on the F:\ drive into
the active R session and to concurrently adjust the dataset during this process, where
desired columns are retained and those columns that are not needed are excluded.
To make the decision on what columns (e.g., object variables) to include and what
columns to exclude:
• Examine the Metadata descriptions, a standard practice that should always be
followed whenever metadata are provided.
• Then, offline, examine the downloaded .csv file for what seems to be intuitive
descriptors used for column names. Make decisions on which variables are of
interest and which variables are not needed for the current planned workflow,
analyses, and graphics. Once a decision has been made on desired contents and
workflow, use the vroom::vroom() function to import the data, using the col_
select() attribute to retain desired columns, avoiding an import of those columns
excluded during the import process.
nomenon of excess deaths, observed well before the COVID-19 pandemic, but now accelerated far
beyond expectations.
Addendum 3: Individual Initiative of Planned Workflow, Analyses, and Graphical… 215
install.packages("vroom", dependencies=TRUE)
library(vroom)
base::getwd()
base::ls()
base::attach(WAlcoholDeaths.tbl)
utils::str(WAlcoholDeaths.tbl)
dplyr::glimpse(WAlcoholDeaths.tbl)
base::summary(WAlcoholDeaths.tbl)
19
Recall that use of the leading W is a self-guided practice, indicating that the data are in wide
format, not long format.
216 4 Data Science and R, Base R, and the tidyverse Ecosystem
20
The inclusion of geographic entities in this listing is aspirational. The dataset provided by the
World Health Organization may not include all entities. It is also possible that slightly different
wording is used for identification of these entities. Use personal judgment on which entities to
include.
21
Whatever the result (likely a Pearson’s r estimate), never forget that association (e.g., correlation)
does not suggest causation.
Addendum 4: Essential tidyverse Ecosystem Functions That Every Data Scientists… 217
Review this brief self-curated list of Must Know tidyverse ecosystem functions. Are
some of these functions useful, but perhaps they fail to rise to the level of Must
Know? Are there other Must Know functions that should have been included in this
brief listing? Feel free to contact the author of this text, [email protected], for sug-
gestions on additions and deletions to this admittedly self-curated list of Must Know
tidyverse packages, functions, and arguments, as appropriate for an introductory text.
dplyr::across()
dplyr::add_count()
dplyr::anti_join()
dplyr::arrange()
dplyr::case_when()
dplyr::count()
dplyr::filter()
dplyr::group_by() %>% dplyr::summarize()
dplyr::join()
dplyr::mutate()
dplyr::nest_by()
dplyr::pull()
dplyr::relocate()
dplyr::rename()
dplyr::select()
dplyr::semi_join()
dplyr::slice()
dplyr::transmute()
forcats::factor()
forcats::fct_collapse()
forcats::fct_count()
forcats::fct_drop()
forcats::fct_expand()
forcats::fct_explicit_na()
forcats::fct_lump()
forcats::fct_lump_lowfreq()
forcats::fct_lump_min()
forcats::fct_lump_n()
forcats::fct_lump_prop()
forcats::fct_recode()
forcats::fct_reorder()
forcats::levels()
ggplot2::ggplot()
aes()
geom_density()
geom_dotplot()
geom_histogram()
coord_flip()
scale_x_continuous()
scale_x_log10()
218 4 Data Science and R, Base R, and the tidyverse Ecosystem
lubridate::dmy()
lubridate::make_date()
lubridate::mdy()
lubridate::ymd()
purrr::discard()
purrr::keep()
purrr::map()
purrr::map_dfr()
purrr::modify()
purrr::when()
readr::read_csv()
readr::read_csv2()
readr::read_delim()
readr::read_tsv()
stringr::str_count()
stringr::str_extract()
stringr::str_pad()
stringr::str_replace()
stringr::str_replace_all
stringr::str_split()
stringr::str_starts()
stringr::str_subset()
stringr::str_to_lower()
stringr::str_to_sentence()
stringr::str_to_title()
stringr::str_to_upper()
stringr::str_which()
tibble::add_column()
tibble::add_row()
tibble::as_tibble()
tibble::column_to_rownames()
tibble::has_rownames()
tibble::remove_rownames()
tibble::rownames_to_column()
tibble::tibble()
tidyr::crossing()
tidyr::drop_na()
tidyr::expand()
tidyr::extract()
tidyr::nest()
tidyr::pivot_longer()
External Data and/or Data Resources Used in This Lesson 219
tidyr::pivot_wider()
tidyr::replace_na()
tidyr::separate()
tidyr::separate()
tidyr::unite()
janitor::clean_names()
janitor::find_header()
janitor::remove_empty()
janitor::row_to_names()
janitor::tabyl()
janitor::top_levels()
magrittr::extract()
magrittr %>% pipe operator
tidylog::tally()
tidyselect::contains()
tidyselect::ends_with()
tidyselect::matches()
tidyselect::num_range()
tidyselect::starts_with()
vroom::vroom()
The publisher’s Web site associated with this text includes the following files, pre-
sented in .csv, .txt, and .xlsx file formats.
AlcoholAttributableFractionsAllCauseDeaths.csv
BirthRatePer1000People.csv
GDPCurrentUSDollar.csv
LBirthRateByEntity1960OnwardAdjusted.xlsx
LBirthRateGDP1960OnwardNoNAs.xlsx
Challenge: Use these files to practice and replicate the outcomes used in this les-
son. Be creative and go beyond what was previously presented.
All other data, if any, were enumerated directly in the R session, using functions
such as round(rnorm()), read.table(textConnection()), etc.
Chapter 5
Statistical Analyses and Graphical
Presentations in Biostatics Using Base R
and the tidyverse Ecosystem
Background
There are many ways by which data scientists identify problems requiring statistical
analysis. Some data scientists are company or university employees and receive
work assignments from deans, department chairs, managers, supervisors, etc. Some
data scientists are consultants and discuss project possibilities with private individu-
als who wish to obtain their services. Regardless of the method of employment or
workplace, data science calls for a formal process on how problem-solving is
approached, ideally a collaborative and transparent approach from beginning to end.
To address this aim, early-on efforts include:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 221
T. W. MacFarland, Introduction to Data Science in Biostatistics,
https://doi.org/10.1007/978-3-031-46383-9_5
222 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
• Description of the Data: However the problem identification and later statistical
analysis process begins, the key issue here is that a reasonable attempt must be
made to learn as much as possible about the data: possible data resources, threats
to data availability, processes needed for data acquisition, expectations of what
will be needed to put data into good form, expectations for outcomes derived
from the data, etc. Once these issues are addressed there will be an initial idea of
resources needed for engagement, eventual completion of the analyses, and for-
mat for summary presentation of outcomes. Early on, before decisions are made
that would be difficult and costly to change, the expected workflow should be
drafted, and data should be carefully examined to avoid unexpected later prob-
lems. Initial examination of the data and communication with all parties involved
in the process can avoid untold problems later, problems that may make correc-
tions difficult, time-consuming, and expensive.
• Null Hypothesis: In statistics, it is still common to structure lines of inquiry in
the form of a Null Hypothesis, such as the simple statement: There is no statisti-
cally significant difference (p <= 0.05) between members of Group X and mem-
bers of Group Y in terms of characteristic Z. Deconstruct this statement and see
how it provides a great deal of context about the eventual line of inquiry.1
Import Data
Data can take many forms: (1) numeric (e.g., whole numbers and decimals such as
3 or 3.0), (2) alpha (e.g., strings such as R or S), (3) alpha-numeric (e.g., 123 Main
Street, Apartment 456, Cliff Island, Maine, 04019), (4) logical (e.g., 0 or 1, False or
True, Go or Stop, etc.), and (5) dates (e.g., year, month, day, hour, second). Data can
be housed at many locations: (1) offline on some type of storage device, portable
drive or resident PC, (2) Web page or cloud location, public or private. Data can be
organized in many ways: (1) rectangular dataset in a row-by-column format, (2)
disorganized spreadsheet with multiple sheets, where data are available, but not in
any type of tidy row-by-column format, (3) text embedded in a Web page or some
other type of document. Base R has many functions that facilitate the importation of
data into an active session and functions associated with the tidyverse ecosystem are
especially useful for this task, especially when data are not immediately found in a
rectangular (e.g., tidy) row by column format. Many different functions for data
import are demonstrated throughout this text.
1
It cannot be ignored that there are those who question the efficacy of structuring analyses in the
form of a formal acceptance or rejection of a Null Hypothesis, suggesting that reporting p-values
(only) is a better way of making judgment on outcomes. It is far beyond the purpose of this text to
offer comments on this issue, pro or con. It would be remiss to avoid attention to the use of the Null
Hypothesis but give attention too to those who purport the publication of p-values only.
Overview of Using R for Statistical Analysis 223
Exploratory Graphics
Perhaps one of the most beneficial trends in statistical analysis software develop-
ment over the last few iterations is the way by which high quality graphics can be
generated, often in an easy and intuitive manner. Previously, graphics (e.g., figures)
were often crude, unattractive, difficult to interpret, and of marginal value. Now,
graphics can take many forms, from simple figures that help guide the exploratory
data analysis process to finalized figures that rival any professional artistic
composition:
• Graphics Using Base R: Base R can be used for graphical output. Functions used
to produce bar charts and histograms, density plots, scatterplots, etc. are all part
of the large collection of graphical possibilities supported by Base R. With some
degree of expertise, figures generated by use of Base R can be quite detailed and
attractive.
• Graphics Using the tidyverse Ecosystem: The tidyverse ecosystem and specifi-
cally the ggplot2 package is now, for many data scientists who use R, the default
tool to produce attractive graphics, initial graphics for guidance and final graph-
ics for presentation. As used throughout this text, the ggplot2::ggplot() function
has been used along the Grammar of Graphics paradigm and the desire for
Beautiful Graphics, where figures are not only accurate and easy to read but the
figures are also exceptionally attractive. Depending on the source and period, the
224 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Once the data have been imported and subjected to initial graphics, to gain a sense
of general trends between and among the data, more specificity is obtained when the
data are subjected to functions used to numerically describe the data and offer a
sense of central tendency. Factor-type data, data that deal with categorical data (e.g.,
HeightGeneral: short, medium, or tall), are often initially examined using frequency
distributions. Numeric-type data, either integer or real (e.g., HeightCentimeters:
174.05, 182.88, 185.76), are often examined by average (e.g., mean, median, mode),
dispersion (e.g., variance and standard deviation), and range (e.g., minimum and
maximum). There are many functions, using Base R and the tidyverse ecosystem,
that put descriptive statistics and measures of central tendency into an attractive and
well-organized table-type output, and these functions are demonstrated throughout
this text.
Exploratory Analyses
The expression follow the science, wherever it leads became a regular phrase
because of COVID-19 press releases in the mass media and from government
offices. This expression follows the notion of exploratory analyses. Analyses that
address the original Null Hypothesis are always essential, but other, possibly more
interesting, and useful analyses may come from a curious broad investigation of the
data, analyses that follow the data but then go beyond initial plans. On this subject,
it is also useful to explore the data from multiple perspectives, using parametric
analyses and nonparametric analyses. As shown in Addendum 1 and then Addendum
2 of this lesson, data are not always as pristine as purported, and analyses from
parametric and nonparametric viewpoints that yield consistent outcomes provide
additional quality assurance that derived outcomes have value.
2
Dowload and activate the cranlogs package. Then, deploy the R syntax: cranlogs::cran_
downloads(packages=”ggplot2”, from=”2018-01-01”, to=”2022-12-31”) or other selected dates to
see a running summary of the number of times the ggplot2 package has been downloaded from
CRAN. The syntax cranlogs::cran_top_downloads(when=”last-month”, count=100) provides
comparative data on R package downloads and it is the rare month, if at all, that the ggplot2 pack-
age is not among the top 5 or 10 downloaded packages.
Examples of Leading Statistical Tests, Including All Syntax and Presentation of Screen… 225
Presentation of Outcomes
Once data have been obtained, subjected to graphical and statistical processes, and
all results are finalized, fully engaged data scientists need to consider their audience
when preparing the presentation of outcomes. Who are members of the audience:
(1) peer knowledge experts and others with technical expertise who will easily
understand processes, selected tests, jargon, etc. (2) deans, managers, supervisors,
colleagues, and others who have general knowledge of the scientific process and
statistics, but are by no means subject matter experts or (3) members of the public
who need to understand the general implications of outcomes and subsequent action
plans, but who will disdain recommendations derived from otherwise solid out-
comes if the presentation is not geared for a general, but respectful, level of
understanding?
This text is focused on R, the tidyverse ecosystem, and the use of APIs to obtain
data. Throughout this text, data are obtained by various processes. The data are then
often subjected to statistical analyses, typically using inferential tests. Even so, this
text is not solely another statistics text, using R as the platform for statistical
analyses.
Given the purpose and scope of this text, the front matter in this lesson provides
a brief listing of leading statistical tests. This listing of leading statistical tests is
very important, but the focus of this lesson is found in the four addenda and by
extension throughout the entire text. Give careful attention to the addenda and how
the tidyverse ecosystem and APIs are used in support of statistical analyses.
Nonparametric Tests
Data that are viewed as nonparametric are far too often underappreciated when
inferential analyses are planned. Quite the opposite, nonparametric inferential anal-
yses are an essential part of data science and subsequently statistical analysis,
whether R or some other platform is used. Nonparametric inferential tests are often
called ranking tests and distribution-free tests, given that these tests are frequently
based on the use of ordinal data and similar data that are ordered but may not meet
the desired assumption of normal distribution (e.g., bell-shaped curve). As a brief
summary, nonparametric data: (1) do not have the precision of continuous data that
fall along an interval scale, (2) often have extreme deviation from normal
226 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
distribution, and (3) may have noticeable differences in the number of subjects in
breakout groups.
R supports many functions from packages (Base R packages and external pack-
ages) that support nonparametric inferential statistics, including tests such as3:
• The Chi-square Test is typically used to test for differences in proportions
between two or more groups.
• The Mann-Whitney U Test is used to determine if two independent groups are
from the same population.
• The Wilcoxon Matched-Pairs Signed Ranks Test is used to examine differences
between paired subjects; however, the term pair is constructed.
• The Kruskal-Wallis H-Test Oneway ANOVA is used to examine if there are sta-
tistically significant differences when comparing multiple groups (often three or
more), with different factors for each group.
• The Friedman Twoway ANOVA uses repeated measures across three or more
matched groups to determine if similarities or differences exist between groups.
• Spearman’s rho is used to estimate the association between two separate mea-
sures.4 Yet, always consider the expression that association does not suggest
causation.
Parametric Tests
Unlike nonparametric data which are often based on ranks, parametric data are, ide-
ally, continuous and have exact parameters. Parametric data may be expressed as
whole numbers (e.g., integers) or as real numbers (e.g., decimals) and include tests
such as5:
• The Student’s t-Test for Independent Samples is used to determine if there are
statistically significant differences in measurement gained from two sepa-
rate groups.
• The Student’s t-Test for Matched Pairs is used to determine if there are statisti-
cally significant differences in measurement gained from matched subjects, sep-
arated into two separate groups; however, the term pair is constructed.
• The Oneway ANOVA Test is used to determine if there are statistically signifi-
cant differences in measurement gained from three separate groups.
3
Use p <= 0.05 or some other selected p-value for both nonparametric analyses and parametric
analyses.
4
When considering the use of a test of association, such as Spearman’s rho, Pearson’s r, or
Kendall’s tau, it is common to use the term estimate when referencing the derived statistic.
5
It cannot be ignored that there are many times when data are purported to be parametric, but the
assumption of a continuous nature to parametric data is violated and the data should instead be
viewed as being nonparametric. The literature should be reviewed to examine this issue in
more detail.
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 227
• The Twoway ANOVA Test is used to determine if there are statistically signifi-
cant differences (and possible interactions) between groups when variables have
two or more categories.
• Pearson’s r is used to estimate the association between two separate measures.
Yet, always consider the expression association does not suggest causation.
• Linear Regression is used to provide a predicted estimate of a variable of interest
when there is one or more predictor variables.6
As a reminder, the use of functions associated with many of these inferential tests
is found in the addenda associated with this lesson and throughout this text. But of
course, there are many resources beyond this text on the use of R functions for these
inferential tests, with the functions found in both Base R and other packages housed
at CRAN.
6
When using a test of association, correlation, or regression, recall that the overarching assumption
is based on the expression past behavior is the best predictor of future behavior.
228 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
###############################################################
# Housekeeping Use for All Analyses #
###############################################################
.libPaths(new = "F:/R_Packages")
# As a preference, all installed packages
# will now go to the external F:\ drive.
date() # Current system time and date.
Sys.time() # Current system time and date (redundant).
R.version.string # R version and version release date.
options(digits=6) # Confirm default digits.
options(scipen=999)# Suppress scientific notation.
options(width=60) # Confirm output width.
ls() # List all objects in the working
# directory.
rm(list = ls()) # CAUTION: Remove all files in the working
# directory. If this action is not desired,
# use the rm() function one by one to remove
# the objects that are not needed.
ls.str() # List all objects, with finite detail.
getwd() # Identify the current working directory.
setwd("F:/R_Ceres")
# Set to a new working directory.
# Note the single forward slash and double
# quotes.
# This new directory should be the directory
# where the data file is located, otherwise
# the data file will not be found.
getwd() # Confirm the working directory.
list.files() # List files at the PC directory.
.libPaths() # Library pathname.
.Library # Library pathname.
sessionInfo() # R version, locale, and packages.
search() # Attached packages and objects.
searchpaths() # Attached packages and objects.
###############################################################
When looking at the syntax, below, notice how the # character was not placed in
front of each install.packages() function, which was the case in a few prior lessons.
An updated version of R was downloaded and put into use for this lesson.
Accordingly, it was decided as a good programming practice (gpp) to preemptively
download all required packages, too, to have the most up-to-date versions of R’s
many package-based functions. It is always a good idea, unless there are unusual
reasons otherwise, to use the most current version and supporting packages.
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 229
install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)
install.packages("readxl", dependencies=TRUE)
library(readxl)
install.packages("magrittr", dependencies=TRUE)
library(magrittr)
install.packages("janitor", dependencies=TRUE)
library(janitor)
install.packages("rlang", dependencies=TRUE)
library(rlang)
install.packages("htmltools", dependencies=TRUE)
library(htmltools)
install.packages("ggmosaic", dependencies=TRUE)
library(ggmosaic)
install.packages("ggpubr", dependencies=TRUE)
library(ggpubr)
install.packages("ggtext", dependencies=TRUE)
library(ggtext)
install.packages("ggthemes", dependencies=TRUE)
library(ggthemes)
install.packages("scales", dependencies=TRUE)
library(scales)
install.packages("gridExtra", dependencies=TRUE)
library(gridExtra)
install.packages("cowplot", dependencies=TRUE)
library(cowplot)
install.packages("writexl", dependencies=TRUE)
library(writexl)
###############################################################
230 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
# Use the tidycensus package and/or the acs package and the
# U.S. Census Bureau key to obtain state and/or county specific
# data from selected American Community Survey (ACS) and/or
# Decennial Census tables.
#
# Use the following URL to access the form needed to obtain an
# API key from the U.S. Census Bureau:
# https://api.census.gov/data/key_signup.html
#
# Complete details on the API process with U.S. Census Bureau
# are available at https://www.census.gov/content/dam/Census/
# library/publications/2020/acs/acs_api_handbook_2020_ch02.pdf.
install.packages("tidycensus", dependencies=TRUE)
library(tidycensus)
# CAUTION: The tidycensus package may take longer to
# download than other packages. Be patient.
tidycensus::census_api_key(
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
#"Yourx40xdigitxkeyxgoesxherexxxxxxxxxxxxx",
overwrite=FALSE)
# CAUTION: The code for this key is reserved
# for [email protected] only. Use your own key!
install.packages("acs", dependencies=TRUE)
library(acs)
###############################################################
# Mapping #
###############################################################
install.packages("maptools", dependencies=TRUE)
library(maptools)
install.packages("rcpp", dependencies=TRUE)
library(rcpp)
install.packages("rgdal", dependencies=TRUE)
library(rgdal)
install.packages("rgeos", dependencies=TRUE)
library(rgeos)
install.packages("sf", dependencies=TRUE)
library(sf)
install.packages("stars", dependencies=TRUE)
library(stars)
install.packages("terra", dependencies=TRUE)
library(terra)
install.packages("xfun", dependencies=TRUE)
library(xfun)
install.packages("choroplethr", dependencies=TRUE)
library(choroplethr)
install.packages("choroplethrMaps", dependencies=TRUE)
library(choroplethrMaps)
install.packages("choroplethrAdmin1", dependencies=TRUE)
library(choroplethrAdmin1)
232 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
###############################################################
base::class(theme_Mac)
# Confirm that the user-created object
# theme_Mac() is a function.
###############################################################
Challenge: Review Addendum 1 and Addendum 2. Use the syntax in these two
addenda to replicate all analyses and figures. Then, using the front matter and what
is presented in these addenda, prepare a technical memorandum on outcomes asso-
ciated with the data, following along with the following process for presentation:
• Background
–– Description of the Data
–– Null Hypothesis
• Import Data
• Code Book and Data Organization
• Exploratory Graphics
–– Graphics Using Base R
–– Graphics Using the tidyverse Ecosystem
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 233
Background
Births and deaths are part of the many base measures used to estimate excess deaths,
where as a metric, excess deaths were frequently touted in the press and by many
government officials regarding impact of the SARS-CoV-2 virus and subsequently
COVID-19 on communities. Census Bureau resources are used in Addendum 1 and
Addendum 2 to obtain data on the rates per 1000 persons of births (RBIRTH) and
rates per 1000 persons of deaths (RDEATH) by county.7,8
County-based Census Bureau data are collectively collapsed by state, and even-
tually, the states are collapsed into four recognized geographical regions: Midwest,
Northeast, South, and West (https://www2.census.gov/geo/pdfs/maps-data/maps/
reference/us_regdiv.pdf). The data for this lesson will be gained from 2015 to 2019,
to establish a base for trends prior to the impact of COVID-19, which commenced
in very late 2019 and was endemic by 2020.
Challenge: As data become available, replicate the syntax and process in
Addendum 1 and Addendum 2, but now for 2000 onward. Do the 2015 to 2019
trends hold, compared to 2000 onward?
7
The set of analyses in this addendum uses county-wide data, where county is declared as those
primary administrative entities with a recognized FIPS (Federal Information Processing Standard)
five-digit county code. Most states use the term county for these geographical areas, but the use of
this term is not consistent. There are more than 3100 counties in the United States, and these enti-
ties also include terms such as: parish (Louisiana), borough (Alaska), census area (Alaska), and
city (Maryland, Missouri, Nevada, Virginia). To compound this classification scheme, Washington
DC, the District of Columbia, is classified by the Census Bureau as both a state and a county, yet
it is neither. Fortunately, FIPS codes dismiss these concerns for those who use Census Bureau
county-specific data: Washington County, Florida, FIPS code 12133; Washington Parish, Louisiana,
FIPS code 22117; Washington, District of Columbia, FIPS code 11001, etc.
8
Refer to States, Counties, and Statistically Equivalent Entities (https://www2.census.gov/geo/
pdfs/reference/GARM/Ch4GARM.pdf) for more specific information on how counties are viewed
when using Census Bureau data. Give special attention to how codes are used to identify island
areas such as Puerto Rico, perhaps the best-known island area under United States jurisdiction.
234 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
It is reminded that the data for this set of analyses is based on rates per 1000
persons of births (RBIRTH) and rates per 1000 persons of deaths (RDEATH). In the
original format, the dataset had data for other variables, but the analyses in
Addendum 1 and Addendum 2 are restricted to these two variables: RBIRTH and
RDEATH. Be sure to notice how those variables, those that are not needed for this
lesson, are accommodated.
Null Hypothesis
The Census Bureau is a rich data source for the many vital statistics that have some
degree of impact on human health and wellness, thus the interest and use of these
data in biostatistics. From among the many possible analyses that could be achieved
from the planned dataset, Addendum 1 (a parametric approach to the data) and
Addendum 2 (a nonparametric approach to the data) will focus on the following null
hypotheses:
• Using county-wide data gained from the Census Bureau, there is no statistically
significant difference (p <= 0.05) at the national level in the rate per 1000 persons
of births (RBIRTH) by Year (e.g., 2015, 2016, 2017, 2018, 2019).
• Using county-wide data gained from the Census Bureau, there is no statistically
significant difference (p <= 0.05) at the national level in the rate per 1000 persons
of births (RBIRTH) by Region (e.g., Midwest, Northeast, South, West).
• Using county-wide data gained from the Census Bureau, there is no statistically
significant difference (p <= 0.05) at the national level in the rate per 1000 persons
of deaths (RDEATH) by Year (e.g., 2015, 2016, 2017, 2018, 2019).
• Using county-wide data gained from the Census Bureau, there is no statistically
significant difference (p <= 0.05) at the national level in the rate per 1000 persons
of deaths (RDEATH) by Region (e.g., Midwest, Northeast, South, West).
• Using county-wide data gained from the Census Bureau, there is no statistically
significant difference (p <= 0.05) at the national level in the rate per 1000 persons
of births (RBIRTH) by Year (e.g., 2015, 2016, 2017, 2018, 2019) and by Region
(e.g., Midwest, Northeast, South, West).
• Using county-wide data gained from the Census Bureau, there is no statistically
significant difference (p <= 0.05) at the national level in the rate per 1000 persons
of deaths (RDEATH) by Year (e.g., 2015, 2016, 2017, 2018, 2019) and by Region
(e.g., Midwest, Northeast, South, West).
• Using county-wide data gained from the Census Bureau, there is no statistically
significant association (p <= 0.05) at the national level in the rate per 1000 per-
sons of births (RBIRTH) and the rate per 1000 persons of deaths (RDEATH).
The data will encompass multiple years (e.g., 2015, 2016, 2017, 2018, 2019) for
all four national regions (e.g., Midwest, Northeast, South, West).
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 235
Import Data
The tidycensus::get_estimates() function is used to obtain the data for all analyses
attempted in Addendum 1 and Addendum 2. The data are from Census Bureau
Population Estimates. The data acquisition process was organized by using an
Application Programming Interface (API) and facilitated by use of the purrr::map_
dfr() function, which allowed data retrieval for multiple years, 2015 to 2019, all in
one convenient process. Note how the purr package is recognized as part of the
tidyverse ecosystem and the tidycensus package works and plays well with the tidy-
verse ecosystem.
After the desired Census Bureau Population Estimates dataset(s) are gained, data
are then sequestered into desired format. Specifically, data are organized by each of
the four Census Bureau regions by using tidyverse ecosystem functions. Later,
functions from the tidyverse ecosystem are used to put the four regional datasets
into one unified national dataset.
It is more than recognized that many other approaches could have been used to
put data into desired format(s), the four multi-year breakout regional datasets and
the final multi-year unified national dataset. The methods used in this addendum
were purposely selected for teaching purposes, to demonstrate functions from the
tidyverse ecosystem multiple times and to also reinforce the heuristics of the data
and the many possibilities of what can be done once an inclusive dataset is obtained.9
9
With more experience, perhaps, think of and if possible, implement, other ways the tidyverse
ecosystem can be used to have the equivalent data in format(s) suitable for analyses that support
the many previously stated Null Hypotheses. Give special attention to how the tidycensus::get_
estimates() function and the purrr::map_dfr() function are used to obtain the data instead of con-
structing some type of forced loop Then, review how other tidyverse ecosystem functions are used
to organize, sequester, and join data. There is no one and only one way to implement these analyses.
236 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
#############################################################
#
# Code Book - Census Bureau Population Estimates After Using
# the tidycensus::get_estimates() Function and the product =
# "components" Argument Across Multiple Years
#
# Data when first obtained from the Census Bureau
#
# year, chr ................ 2015, 2016, 2017, 2018, and 2019
# NAME, chr ................. Names of counties, states, etc.
# GEOID, chr ........................... Five-digit FIPS code
# variable, chr ................ BIRTHS, DEATHS, DOMESTICMIG,
# INTERNATIONALMIG, NATURALINC, NETMIG, RBIRTH,
# RDEATH, RDOMESTICMIG, RINTERNATIONALMIG,
# RNATURALINC, RNETMIG
# value, num ...Rate per 1,000 Persons of Births (RBIRTH) and
# Rate per 1,000 Persons of Deaths (RDEATH)
#############################################################
#
# Region, chr ... An enumerated variable that is added to the
# dataset after initial data acquisition:
# Midwest, Northeast, South, West
#############################################################
A variety of mostly tidyverse ecosystem functions will be used against the dataset,
in original format, so that it meets declared requirements. Major changes against the
original dataset are:
• Delete data for 10 of the 12 breakout variables, so that data are retained only for
the breakouts RBIRTH and RDEATH.
• Prepare by Region breakout tibbles, and from this action prepare draft figures
and descriptive statistics.
• Add Region (e.g., Midwest, Northeast, South, West) as an enumerated column
(e.g., object variable), to aid later analyses.
• Temporarily transform the data from long to wide to give another view on how
correlation analyses are attempted.
• As needed, especially because of function requirements associated with mapping
activities, put selected character object variables into numeric format.
• Put the four by Region breakout tibbles (e.g., datasets) back into one unified
tibble, representing one common national dataset. This final dataset is used to
prepare a national map of rates for RBIRTH and RDEATH, with county outlines
and state outlines showing on the map. The national dataset will also be used for
statistical purposes.
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 237
Exploratory Graphics
A variety of actions are used to prepare the four by Region (Midwest, Northeast,
South, West) breakout tibbles over the five years (2015, 2016, 2017, 2018, and
2019). As a reminder as to why these variables were selected, the data on RBIRTH
and RDEATH provide a base for trends prior to the COVID-19 epidemic, which had
major impact starting in early 2020. Ultimately, these data serve as base measures,
along with a host of other variables, to later address excess deaths, a key metric
associated with understanding the pandemic’s many impacts on public health and
wellness, educational institutions, workforce availability, supply chains, factory
output, economic growth, migrations and population redistribution, etc.
First, prepare a list of the 5 years (2015, 2016, 2017, 2018, and 2019) for which
county-wide data on RBIRTH and RDEATH are available from the Census Bureau
Population Estimates:
base::getwd()
base::ls()
base::attach(years2015to2019.lst)
base::class(years2015to2019.lst)
[1] "list"
10
Preemptively, it should be mentioned that there are many ways to obtain the data, organize the
data, structure and restructure the data, modify the class of selected variables, etc. The steps taken
in this part of the lesson may be somewhat repetitive and verbose, but they were purposely selected
to demonstrate the complexity of this type of endeavor. Recall that this text is designed for those
who are being introduced to the tidyverse ecosystem and the use of APIs for biostatistics. Those
with more advanced skills might select other actions.
238 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
base::getwd()
base::ls()
base::attach(RBIRTHandRDEATHMidwest2015Onward.tbl)
utils::str(RBIRTHandRDEATHMidwest2015Onward.tbl)
dplyr::glimpse(RBIRTHandRDEATHMidwest2015Onward.tbl)
Rows: 10,550
Columns: 6
$ year <chr> "2015", "2015", "2015", "2015", "2015",
$ NAME <chr> "Adams County, Illinois, East North Cent
$ GEOID <chr> "17001", "17003", "17005", "17007", "170
$ variable <chr> "RBIRTH", "RBIRTH", "RBIRTH", "RBIRTH",
$ value <dbl> 12.48965, 14.69952, 9.50872, 10.69120, 8
$ Region <chr> "Midwest", "Midwest", "Midwest", "Midwes
base::summary(RBIRTHandRDEATHMidwest2015Onward.tbl)
base::print(RBIRTHandRDEATHMidwest2015Onward.tbl)
# As one last quality assurance action, used to verify that filtering was done cor-
rectly, look again at the column named variable:
240 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
base::unique(RBIRTHandRDEATHMidwest2015Onward.tbl$variable)
# With assurance that all is correct and that the enumerated tibble is ready for
use, the writexl::write_xlsx() function should be used to immediately download the
data, for safekeeping in case the data were ever to become unavailable in the future.
writexl::write_xlsx(
RBIRTHandRDEATHMidwest2015Onward.tbl,
path = "F:\\R_Ceres\\RBIRTHandRDEATHMidwest2015Onward.xlsx",
col_names=TRUE)
# Give special attention to how the path is identified,
# especially the use of double back slashes.
base::file.exists("RBIRTHandRDEATHMidwest2015Onward.xlsx")
base::file.info("RBIRTHandRDEATHMidwest2015Onward.xlsx")
base::list.files(pattern =".xlsx")
base::getwd()
base::ls()
base::attach(RBIRTHandRDEATHNortheast2015Onward.tbl)
utils::str(RBIRTHandRDEATHNortheast2015Onward.tbl)
dplyr::glimpse(RBIRTHandRDEATHNortheast2015Onward.tbl)
base::summary(RBIRTHandRDEATHNortheast2015Onward.tbl)
base::print(RBIRTHandRDEATHNortheast2015Onward.tbl)
As one last quality assurance action, used to verify that filtering was done cor-
rectly, look again at the column named variable:
base::unique(RBIRTHandRDEATHNortheast2015Onward.tbl$variable)
With assurance that all is correct and that the enumerated tibble is ready for use,
the writexl::write_xlsx() function should be used to immediately download the data,
for safekeeping in case the data were ever to become unavailable in the future.
242 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
writexl::write_xlsx(
RBIRTHandRDEATHNortheast2015Onward.tbl,
path = "F:\\R_Ceres\\RBIRTHandRDEATHNortheast2015Onward.xlsx",
col_names=TRUE)
# Give special attention to how the path is identified,
# especially the use of double back slashes.
base::file.exists("RBIRTHandRDEATHNortheast2015Onward.xlsx")
base::file.info("RBIRTHandRDEATHNortheast2015Onward.xlsx")
base::list.files(pattern =".xlsx")
#
# base::unique(RBIRTHandRDEATHSouth2015Onward.tbl$variable)
#
# View(RBIRTHandRDEATHSouth2015Onward.tbl)
#
tibble::add_column(Region = "South") %>%
# Continue with %>%
#
dplyr::filter(variable %in% c(
"RBIRTH",
"RDEATH"))
# Retain the RBIRTH and RDEATH
# rows and delete all others.
#
# The tibble, after modification by using the tidyverse
# ecosystem functions, now consists of 14,220 rows and 6
# columns.
base::getwd()
base::ls()
base::attach(RBIRTHandRDEATHSouth2015Onward.tbl)
utils::str(RBIRTHandRDEATHSouth2015Onward.tbl)
dplyr::glimpse(RBIRTHandRDEATHSouth2015Onward.tbl)
base::summary(RBIRTHandRDEATHSouth2015Onward.tbl)
base::print(RBIRTHandRDEATHSouth2015Onward.tbl)
As one last quality assurance action, used to verify that filtering was done cor-
rectly, look again at the column named variable:
base::unique(RBIRTHandRDEATHSouth2015Onward.tbl$variable)
With assurance that all is correct and that the enumerated tibble is ready for use,
the writexl::write_xlsx() function should be used to immediately download the data,
for safekeeping in case the data were ever to become unavailable in the future.
244 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
writexl::write_xlsx(
RBIRTHandRDEATHSouth2015Onward.tbl,
path = "F:\\R_Ceres\\RBIRTHandRDEATHSouth2015Onward.xlsx",
col_names=TRUE)
# Give special attention to how the path is identified,
# especially the use of double back slashes.
base::file.exists("RBIRTHandRDEATHSouth2015Onward.xlsx")
base::file.info("RBIRTHandRDEATHSouth2015Onward.xlsx")
base::list.files(pattern =".xlsx")
# View(RBIRTHandRDEATHWest2015Onward.tbl)
#
tibble::add_column(Region = "West") %>%
# Continue with %>%
#
dplyr::filter(variable %in% c(
"RBIRTH",
"RDEATH"))
# Retain the RBIRTH and RDEATH
# rows and delete all others.
#
# The tibble, after modification by using the tidyverse
# ecosystem functions, now consists of 4,480 rows and 6
# columns.
base::getwd()
base::ls()
base::attach(RBIRTHandRDEATHWest2015Onward.tbl)
utils::str(RBIRTHandRDEATHWest2015Onward.tbl)
dplyr::glimpse(RBIRTHandRDEATHWest2015Onward.tbl)
base::summary(RBIRTHandRDEATHWest2015Onward.tbl)
base::print(RBIRTHandRDEATHWest2015Onward.tbl)
As one last quality assurance action, used to verify that filtering was done cor-
rectly, look again at the column named variable:
base::unique(RBIRTHandRDEATHWest2015Onward.tbl$variable)
With assurance that all is correct and that the enumerated tibble is ready for use,
the writexl::write_xlsx() function should be used to immediately download the data,
for safekeeping in case the data were ever to become unavailable in the future.
writexl::write_xlsx(
RBIRTHandRDEATHWest2015Onward.tbl,
path = "F:\\R_Ceres\\RBIRTHandRDEATHWest2015Onward.xlsx",
col_names=TRUE)
# Give special attention to how the path is identified,
# especially the use of double back slashes.
base::file.exists("RBIRTHandRDEATHWest2015Onward.xlsx")
base::file.info("RBIRTHandRDEATHWest2015Onward.xlsx")
base::list.files(pattern =".xlsx")
Before graphics are attempted, often for draft purposes that provide initial guid-
ance on general trends, it is best to prepare a series of breakout tibbles, where data
for each of the four regions (e.g., Midwest, Northeast, South, West) are sequestered
into two new tibbles, RBIRTH and RDEATH tibbles. Each of these eight new
246 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
tibbles will be used to prepare a relevant boxplot, a boxplot that focuses on one and
only one breakout (e.g., RBIRTH or RDEATH) for one and only on region (e.g.,
Midwest, Northeast, South, or West).11
Efforts will then be used to consolidate the resulting breakout boxplots into a few
meaningful side-by-side boxplots. Later, the ggplot::ggplot() function and its facet
capabilities will be used to demonstrate why use of the tidyverse ecosystem may be
an easier way to approach these aims, but Base R graphics still have use in data
science.
RBIRTH2015OnwardMidwest.tbl <-
RBIRTHandRDEATHMidwest2015Onward.tbl %>%
dplyr::filter(variable %in% c("RBIRTH"))
base::attach(RBIRTH2015OnwardMidwest.tbl)
RDEATH2015OnwardMidwest.tbl <-
RBIRTHandRDEATHMidwest2015Onward.tbl %>%
dplyr::filter(variable %in% c("RDEATH"))
base::attach(RDEATH2015OnwardMidwest.tbl)
RBIRTH2015OnwardNortheast.tbl <-
RBIRTHandRDEATHNortheast2015Onward.tbl %>%
dplyr::filter(variable %in% c("RBIRTH"))
base::attach(RBIRTH2015OnwardNortheast.tbl)
RDEATH2015OnwardNortheast.tbl <-
RBIRTHandRDEATHNortheast2015Onward.tbl %>%
dplyr::filter(variable %in% c("RDEATH"))
base::attach(RDEATH2015OnwardNortheast.tbl)
RBIRTH2015OnwardSouth.tbl <-
RBIRTHandRDEATHSouth2015Onward.tbl %>%
dplyr::filter(variable %in% c("RBIRTH"))
base::attach(RBIRTH2015OnwardSouth.tbl)
RDEATH2015OnwardSouth.tbl <-
RBIRTHandRDEATHSouth2015Onward.tbl %>%
dplyr::filter(variable %in% c("RDEATH"))
base::attach(RDEATH2015OnwardSouth.tbl)
11
Explore use of the graphics::boxplot() function and its subset argument as a potential tool.
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 247
base::attach(RBIRTH2015OnwardWest.tbl)
RDEATH2015OnwardWest.tbl <-
RBIRTHandRDEATHWest2015Onward.tbl %>%
dplyr::filter(variable %in% c("RDEATH"))
base::attach(RDEATH2015OnwardWest.tbl)
Now that the eight breakout tibbles are created, prepare eight corresponding box-
plots, looking at value by variable for each. Add a few embellishments to improve
presentation and understanding.12
Not only to save space, but perhaps more importantly, observe how the functions
par(mfrow=c()) is one tool (of many) for placing multiple figures into one compos-
ite figure. This type of presentation makes side-by-side comparisons easy to under-
stand, but consistent scales are needed to fully benefit from side-by-side comparisons
(Figs. 5.1 and 5.2).
Fig. 5.1
12
Give attention to the outliers showing in the boxplots. Outliers have a potential impact on later
calculations of normality. Outliers can also introduce bias in model-building and the development
of prediction equations. It is possible to trim (e.g., remove) outliers from the data, with the
outliers::rm.outlier() function serving this purpose, along with other possible choices for this task.
Of course, any alteration of data away from original form should only be done after careful thought
and consideration of all possible unintended downstream impacts on later use of the data.
248 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Fig. 5.2
par(ask=TRUE)
par(mfrow=c(2,2)) # 4 figures into a 2 row by 2 column grid
graphics::boxplot(data=RBIRTH2015OnwardMidwest.tbl,
value ~ variable, main="Midwest", xlab="RBIRTH",
ylab="Rate per 1,000 Persons", ylim=c(0, 35),
col="cornsilk")
#
graphics::boxplot(data=RBIRTH2015OnwardNortheast.tbl,
value ~ variable, main="Northeast", xlab="RBIRTH",
ylab="Rate per 1,000 Persons", ylim=c(0, 35),
col="cornsilk")
#
graphics::boxplot(data=RBIRTH2015OnwardSouth.tbl,
value ~ variable, main="South", xlab="RBIRTH",
ylab="Rate per 1,000 Persons", ylim=c(0, 35),
col="cornsilk")
#
graphics::boxplot(data=RBIRTH2015OnwardWest.tbl,
value ~ variable, main="West", xlab="RBIRTH",
ylab="Rate per 1,000 Persons", ylim=c(0, 35),
col="cornsilk")
# Fig. 5.1
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 249
par(ask=TRUE)
par(mfrow=c(2,2)) # 4 figures into a 2 row by 2 column grid
graphics::boxplot(data=RDEATH2015OnwardMidwest.tbl,
value ~ variable, main="Midwest", xlab="RDEATH",
ylab="Rate per 1,000 Persons", ylim=c(0, 30),
col="cornsilk")
#
graphics::boxplot(data=RDEATH2015OnwardNortheast.tbl,
value ~ variable, main="Northeast", xlab="RDEATH",
ylab="Rate per 1,000 Persons", ylim=c(0, 30),
col="cornsilk")
#
graphics::boxplot(data=RDEATH2015OnwardSouth.tbl,
value ~ variable, main="South", xlab="RDEATH",
ylab="Rate per 1,000 Persons", ylim=c(0, 30),
col="cornsilk")
#
graphics::boxplot(data=RDEATH2015OnwardWest.tbl,
value ~ variable, main="West", xlab="RDEATH",
ylab="Rate per 1,000 Persons", ylim=c(0, 30),
col="cornsilk")
# Fig. 5.2
Challenge: Much more could have been prepared using Base R functions, but for
now these many boxplots should give a first glance of outcomes. As a challenge,
replicate these breakout tables and breakout boxplots by year instead of region. Do
the trends demonstrated in the figures give a hint of future outcomes?
13
A geom, when using the ggplot2::ggplot() function, should be viewed as a geometric object.
250 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
# When viewing these figures, the graphical images of data distribution help
provide a future sense of descriptive statistics, visually. It was decided to use
geom_boxplot() for the figures, but other geoms could have just as easily been used
(Figs. 5.3, 5.4, 5.5, and 5.6).
par(ask=TRUE)
ggplot2::ggplot(data=RBIRTHandRDEATHMidwest2015Onward.tbl,
aes(x=year, y=value)) +
geom_boxplot() +
facet_wrap(~variable) +
ggtitle("RBIRTH and RDEATH: Midwest") +
labs(x="\nYear", y="Rate per 1,000 People\n") +
scale_y_continuous(labels=scales::comma, limits=c(0, 35),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
# Fig. 5.3
par(ask=TRUE)
ggplot2::ggplot(data=RBIRTHandRDEATHNortheast2015Onward.tbl,
aes(x=year, y=value)) +
geom_boxplot() +
facet_wrap(~variable) +
ggtitle("RBIRTH and RDEATH: Northeast") +
labs(x="\nYear", y="Rate per 1,000 People\n") +
scale_y_continuous(labels=scales::comma, limits=c(0, 35),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
# Fig. 5.4
Fig. 5.3
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 251
Fig. 5.4
par(ask=TRUE)
ggplot2::ggplot(data=RBIRTHandRDEATHSouth2015Onward.tbl,
aes(x=year, y=value)) +
geom_boxplot() +
facet_wrap(~variable) +
ggtitle("RBIRTH and RDEATH: South") +
labs(x="\nYear", y="Rate per 1,000 People\n") +
scale_y_continuous(labels=scales::comma, limits=c(0, 35),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
# Fig. 5.5
Fig. 5.5
252 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Fig. 5.6
par(ask=TRUE)
ggplot2::ggplot(data=RBIRTHandRDEATHWest2015Onward.tbl,
aes(x=year, y=value)) +
geom_boxplot() +
facet_wrap(~variable) +
ggtitle("RBIRTH and RDEATH: West") +
labs(x="\nYear", y="Rate per 1,000 People\n") +
scale_y_continuous(labels=scales::comma, limits=c(0, 35),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
# Fig. 5.6
Challenge: The same challenge that was offered when using the
graphics::boxplot() function is repeated, but now by use of the ggplot2::ggplot()
function. Replicate these breakout boxplots by year instead of Region. Do the
trends demonstrated in the figures give a hint of future outcomes?
The ggplot2::ggplot() function, which is quite rich for many data scientists (if
not most data scientists), is often the first choice for the production of graphics when
using R. This is especially an issue in that many R packages and their functions have
been developed so that they work and play well with the ggplot2::ggplot() function.
Increasingly, for those who wish to produce Beautiful Graphics, it is the exceptional
figure where the ggplot2::ggplot() function was not among the first considerations.
It is far too common for those with limited experience in data science to perhaps
generate a few draft figures and to then immediately use inferential analyses to
address Null Hypotheses and provide value by interpreting outcomes. The timing of
this approach is undesirable in that it avoids examination of exploratory descriptive
statistics and measures of central tendency. It may be somewhat dull to examine
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 253
distribution patterns of data (e.g., data in the large and data by breakouts). It may be
somewhat dull to prepare statistics such as mean, standard deviation, median, mini-
mum, maximum, etc. Even so, these are the very statistics that provide guidance on
inferential test selection, discovery of trends between and among variables, and
other potential outcomes that may otherwise go unnoticed. These front-end tasks
simply cannot be ignored.
With the importance of exploratory descriptive statistics and measures of central
tendency established as a vital concern, the first task is to examine the distribution
pattern(s) of all measured variables of interest.14 For this addendum, consider the
object variable named value, for each of the eight breakout tibbles, looking at value
datapoints for RBIRTH by all four regions and value datapoints for RDEATH by all
four regions.15 There are many ways to initiate these exploratory analyses, but for
this lesson review the following practices:
• Density plots will be prepared to graphically examine distribution pattern(s) for
value by established breakouts.
• QQ (e.g., Quantile-Quantile) plots will be prepared to graphically examine dis-
tribution pattern(s) for value by established breakouts.
The Anderson-Darling test will also be used to empirically examine distribution
patterns, specifically normal distribution, for value by established breakouts
(Figs. 5.7, 5.8, 5.9, and 5.10).
Fig. 5.7
14
The correct use of many inferential tests, especially those tests that use parametric data, assumes
that data follow a normal distribution pattern (even though this assumption is frequently violated).
If data do not follow a normal distribution pattern, it may be best to use an inferential test that is
more appropriately associated with nonparametric data. There are many online and print resources
on this topic and greater discussion in this introductory text is not necessary.
15
Challenge: Once again, respond to the challenge that this approach should be replicated by year
and not only by region.
254 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Fig. 5.8
par(ask=TRUE)
ggplot2::ggplot(data=RBIRTHandRDEATHMidwest2015Onward.tbl,
aes(x=value)) +
geom_density(lwd=2, color="red") +
facet_wrap(~variable) +
ggtitle("RBIRTH and RDEATH: Midwest") +
labs(x="\nYear", y="Density of Rate per 1,000 People\n") +
scale_y_continuous(labels=scales::comma, limits=c(0, 0.28),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
# Fig. 5.7
Fig. 5.9
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 255
Fig. 5.10
par(ask=TRUE)
ggplot2::ggplot(data=RBIRTHandRDEATHNortheast2015Onward.tbl,
aes(x=value)) +
geom_density(lwd=2, color="red") +
facet_wrap(~variable) +
ggtitle("RBIRTH and RDEATH: Northeast") +
labs(x="\nYear", y="Density of Rate per 1,000 People\n") +
scale_y_continuous(labels=scales::comma, limits=c(0, 0.28),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
# Fig. 5.8
par(ask=TRUE)
ggplot2::ggplot(data=RBIRTHandRDEATHSouth2015Onward.tbl,
aes(x=value)) +
geom_density(lwd=2, color="red") +
facet_wrap(~variable) +
ggtitle("RBIRTH and RDEATH: South") +
labs(x="\nYear", y="Density of Rate per 1,000 People\n") +
scale_y_continuous(labels=scales::comma, limits=c(0, 0.28),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
# Fig. 5.9
256 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
par(ask=TRUE)
ggplot2::ggplot(data=RBIRTHandRDEATHWest2015Onward.tbl,
aes(x=value)) +
geom_density(lwd=2, color="red") +
facet_wrap(~variable) +
ggtitle("RBIRTH and RDEATH: West") +
labs(x="\nYear", y="Density of Rate per 1,000 People\n") +
scale_y_continuous(labels=scales::comma, limits=c(0, 0.28),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
# Fig. 5.10
Using the geom_density() geom, it seems that there might be an overall trend for
normal distribution. However, it is cautioned that the term might be lacks precision.
The visuals gained by using geom_density() are a good start, but the Anderson-
Darling test (or some other test for normal distribution, such as the Shapiro-Wilk
test) is needed for the final determination of normal distribution and from that find-
ing, whether a parametric or nonparametric approach (or both) should be used for
selecting the appropriate inferential test(s).16
par(ask=TRUE)
ggplot2::ggplot(data=RBIRTHandRDEATHMidwest2015Onward.tbl,
aes(sample=value)) +
geom_qq(lwd=2, color="red") +
geom_qq_line(lwd=1.25, color="blue") +
facet_wrap(~variable) +
ggtitle("RBIRTH and RDEATH: Midwest") +
labs(x="\nYear", y="QQ of Rate per 1,000 People\n") +
scale_y_continuous(labels=scales::comma, limits=c(0, 35),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
# Somewhat different than might be expected, note how the
# aesthetic is expressed as aes(sample=value) and not
# aes(x=value).
Challenge: Both to save space and to encourage active participation, only one
Quantile-Quantile Plots of RBIRTH and RDEATH by Region is shown. As a chal-
lenge, generate all four QQ plot figures for this section. Be sure to go beyond mere
visualization by using an Anderson-Darling test too, to continue with inquiries into
normal distribution (Fig. 5.11).
16
Far too many times, a graphical figure appeared to show what might be normal distribution, but
that finding was not confirmed by applying an empirical normality test, whether the Anderson-
Darling test, the Shapiro-Wilk, or some other test. Then a decision must be made on which para-
digm, parametric or nonparametric, to use for inferential testing.
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 257
Fig. 5.11
par(ask=TRUE)
ggplot2::ggplot(data=RBIRTHandRDEATHNortheast2015Onward.tbl,
aes(sample=value)) +
geom_qq(lwd=2, color="red") +
geom_qq_line(lwd=1.25, color="blue") +
facet_wrap(~variable) +
ggtitle("RBIRTH and RDEATH: Northeast") +
labs(x="\nYear", y="QQ of Rate per 1,000 People\n") +
scale_y_continuous(labels=scales::comma, limits=c(0, 35),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
par(ask=TRUE)
ggplot2::ggplot(data=RBIRTHandRDEATHSouth2015Onward.tbl,
aes(sample=value)) +
geom_qq(lwd=2, color="red") +
geom_qq_line(lwd=1.25, color="blue") +
facet_wrap(~variable) +
ggtitle("RBIRTH and RDEATH: South") +
labs(x="\nYear", y="QQ of Rate per 1,000 People\n") +
scale_y_continuous(labels=scales::comma, limits=c(0, 35),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
# Fig. 5.11
258 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
par(ask=TRUE)
ggplot2::ggplot(data=RBIRTHandRDEATHWest2015Onward.tbl,
aes(sample=value)) +
geom_qq(lwd=2, color="red") +
geom_qq_line(lwd=1.25, color="blue") +
facet_wrap(~variable) +
ggtitle("RBIRTH and RDEATH: West") +
labs(x="\nYear", y="QQ of Rate per 1,000 People\n") +
scale_y_continuous(labels=scales::comma, limits=c(0, 35),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()
install.packages("nortest", dependencies=TRUE)
library(nortest)
nortest::ad.test(RBIRTH2015OnwardMidwest.tbl$value)
# Calculated p-value <= 0.0000000000000002
nortest::ad.test(RDEATH2015OnwardMidwest.tbl$value)
# Calculated p-value <= 0.0000000000000002
nortest::ad.test(RBIRTH2015OnwardNortheast.tbl$value)
# Calculated p-value <= 0.0000000000000002
nortest::ad.test(RDEATH2015OnwardNortheast.tbl$value)
# Calculated p-value <= 0.00075
nortest::ad.test(RBIRTH2015OnwardSouth.tbl$value)
# Calculated p-value <= 0.0000000000000002
nortest::ad.test(RDEATH2015OnwardSouth.tbl$value)
# Calculated p-value <= 0.0000000000000002
nortest::ad.test(RBIRTH2015OnwardWest.tbl$value)
# Calculated p-value <= 0.0000000000000002
nortest::ad.test(RDEATH2015OnwardWest.tbl$value)
# Calculated p-value <= 0.000000000236
interesting to see if there are any meaningful differences in outcomes between the
two approaches, parametric v nonparametric.
Now that normal distribution patterns have been addressed, it is necessary to
provide the standard measures associated with descriptive statistics and measures of
central tendency:
• N
• Minimum
• Median (e.g., 50th percentile)
• Mean (e.g., arithmetic average)
• SD (e.g., standard deviation)
• Maximum
• Missing (e.g., expressed in R as NA)
There are many R packages and associated functions that can be used for this
task, but it was decided to keep with the tidyverse ecosystem as these metrics are
calculated for each of the four RBIRTH and RDEATH breakout tibbles currently in
use. As a brief reminder, there are no missing data for the object variable value. If
there had been missing data special accommodations such as the use of na.rm=TRUE
would have been used.
RBIRTHandRDEATHMidwest2015Onward.tbl %>%
dplyr::group_by(variable, year) %>%
dplyr::summarize(
N = base::length(value),
Minimum = base::min(value),
Median = stats::median(value),
Mean = base::mean(value),
SD = stats::sd(value),
Maximum = base::max(value),
Missing = base::sum(is.na(value))
)
# Descriptive statistics are generated by first using the
# dplyr::group_by() function against the object variables
# called variable and year, with the dplyr::summarize()
# function then used against a set of selected functions,
# all in an effort to make a neatly presented summary of
# descriptive statistics for each region, by variable
# breakout (e.g., RBIRTH and RDEATH) and by year (e.g.,
# 2015, 2016, 2017, 2018, and 2019). There are many ways
# the variables could have been grouped and there are many
# possible descriptive statistics that could have been
# presented. Additionally, there are no missing data for
# value so it was not necessary to use the na.rm=TRUE
# argument.
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 261
# A tibble: 10 x 9
# Groups: variable [2]
variable year N Minimum Median Mean SD Maximum
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 RBIRTH 2015 1055 4.05 11.5 11.7 2.54 27.4
2 RBIRTH 2016 1055 4.53 11.4 11.7 2.49 29.4
3 RBIRTH 2017 1055 5.25 11.4 11.6 2.44 29.5
4 RBIRTH 2018 1055 4.58 11.6 11.9 2.55 29.8
5 RBIRTH 2019 1055 4.76 11.0 11.3 2.29 26.4
6 RDEATH 2015 1055 0 10.0 9.95 2.45 18.1
7 RDEATH 2016 1055 2.19 10.4 10.4 2.52 20.8
8 RDEATH 2017 1055 2.12 10.3 10.2 2.43 19.1
9 RDEATH 2018 1055 1.07 11.1 11.2 2.79 23.7
10 RDEATH 2019 1055 0 10.4 10.4 2.45 18.6
RBIRTHandRDEATHNortheast2015Onward.tbl %>%
dplyr::group_by(variable, year) %>%
dplyr::summarize(
N = base::length(value),
Minimum = base::min(value),
Median = stats::median(value),
Mean = base::mean(value),
SD = stats::sd(value),
Maximum = base::max(value),
Missing = base::sum(is.na(value))
)
262 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
# A tibble: 10 x 9
# Groups: variable [2]
variable year N Minimum Median Mean SD Maximum
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 RBIRTH 2015 217 4.96 9.98 10.1 1.72 18.6
2 RBIRTH 2016 217 3.53 9.90 10.0 1.76 17.3
3 RBIRTH 2017 217 4.38 9.78 9.87 1.73 16.4
4 RBIRTH 2018 217 5.80 10.1 10.2 1.80 18.3
5 RBIRTH 2019 217 2.75 9.42 9.62 1.75 17.3
6 RDEATH 2015 217 4.78 9.40 9.60 1.80 16.1
7 RDEATH 2016 217 4.85 9.77 10.0 1.86 16.4
8 RDEATH 2017 217 4.84 9.70 9.91 1.85 16.3
9 RDEATH 2018 217 5.58 10.1 10.1 1.95 15.1
10 RDEATH 2019 217 5.04 10.0 10.2 1.83 16.0
RBIRTHandRDEATHSouth2015Onward.tbl %>%
dplyr::group_by(variable, year) %>%
dplyr::summarize(
N = base::length(value),
Minimum = base::min(value),
Median = stats::median(value),
Mean = base::mean(value),
SD = stats::sd(value),
Maximum = base::max(value),
Missing = base::sum(is.na(value))
)
# A tibble: 10 x 9
# Groups: variable [2]
variable year N Minimum Median Mean SD Maximum
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 RBIRTH 2015 1422 0 11.7 11.8 2.60 26.4
2 RBIRTH 2016 1422 0 11.5 11.6 2.58 29.2
3 RBIRTH 2017 1422 0 11.5 11.5 2.46 27.5
4 RBIRTH 2018 1422 3.23 11.7 11.8 2.48 24.2
5 RBIRTH 2019 1422 0.796 11.0 11.0 2.23 28.2
6 RDEATH 2015 1422 0 10.4 10.2 2.57 18.1
7 RDEATH 2016 1422 0 10.9 10.7 2.66 19.8
8 RDEATH 2017 1422 0 10.8 10.6 2.63 18.9
9 RDEATH 2018 1422 1.13 11.3 11.2 2.79 25.4
10 RDEATH 2019 1422 0 11.2 11.0 2.65 19.3
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 263
RBIRTHandRDEATHWest2015Onward.tbl %>%
dplyr::group_by(variable, year) %>%
dplyr::summarize(
N = base::length(value),
Minimum = base::min(value),
Median = stats::median(value),
Mean = base::mean(value),
SD = stats::sd(value),
Maximum = base::max(value),
Missing = base::sum(is.na(value))
)
# A tibble: 10 x 9
# Groups: variable [2]
variable year N Minimum Median Mean SD Maximum
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 RBIRTH 2015 448 0 11.7 12.0 3.63 29.6
2 RBIRTH 2016 448 0 11.6 11.8 3.55 28.7
3 RBIRTH 2017 448 0 11.6 11.8 3.39 28.5
4 RBIRTH 2018 448 0 11.7 12.0 3.61 30.4
5 RBIRTH 2019 448 0 10.7 10.9 3.20 28.0
6 RDEATH 2015 448 0 8.05 8.16 2.85 22.5
7 RDEATH 2016 448 0 8.50 8.48 2.96 22.7
8 RDEATH 2017 448 0 8.33 8.42 2.92 22.6
9 RDEATH 2018 448 2.07 8.89 9.06 3.10 22.9
10 RDEATH 2019 448 0 8.62 8.71 2.92 22.1
Exploratory Analyses
The many graphics and descriptive statistics (including measures of central ten-
dency) up to this point provide some understanding of general trends between and
among the data. It is suspected that there may be statistically significant differences
(p <= 0.05) in RBIRTH and RDEATH by year and by region – but the key term here
is that outcome is only suspected. It is only too common for early observations to
fail to meet statistically significant differences at the declared p-value once the data
are analyzed properly, using an appropriate inferential test. Data scientists do not
put their reputation and career at risk merely by glancing at either descriptive statis-
tics and measures of central tendency or graphics, absent repeated inferential testing.
To address the Null Hypotheses, it is necessary to engage in the use of different
inferential tests. For this lesson a parametric approach toward the data will be used,
dependent on Oneway ANOVA (Analysis of Variance), Twoway ANOVA (Analysis
of Variance), and Pearson’s r coefficient of correlation. Other analyses could be
prepared, but they will be deferred until later lessons. As a reminder, review the Null
Hypotheses one more time:
264 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Null Hypotheses
• Using county-wide data gained from the Census Bureau, there is no statistically
significant difference (p <= 0.05) at the national level in the rate per 1000 persons
of births (RBIRTH) by Year (e.g., 2015, 2016, 2017, 2018, 2019).
• Using county-wide data gained from the Census Bureau, there is no statistically
significant difference (p <= 0.05) at the national level in the rate per 1000 persons
of births (RBIRTH) by Region (e.g., Midwest, Northeast, South, West).
• Using county-wide data gained from the Census Bureau, there is no statistically
significant difference (p <= 0.05) at the national level in the rate per 1000 persons
of deaths (RDEATH) by Year (e.g., 2015, 2016, 2017, 2018, 2019).
• Using county-wide data gained from the Census Bureau, there is no statistically
significant difference (p <= 0.05) at the national level in the rate per 1000 persons
of deaths (RDEATH) by Region (e.g., Midwest, Northeast, South, West).
• Using county-wide data gained from the Census Bureau, there is no statistically
significant difference (p <= 0.05) at the national level in the rate per 1000 persons
of births (RBIRTH) by Year (e.g., 2015, 2016, 2017, 2018, 2019) and by Region
(e.g., Midwest, Northeast, South, West).
• Using county-wide data gained from the Census Bureau, there is no statistically
significant difference (p <= 0.05) at the national level in the rate per 1000 persons
of deaths (RDEATH) by Year (e.g., 2015, 2016, 2017, 2018, 2019) and by Region
(e.g., Midwest, Northeast, South, West).
• Using county-wide data gained from the Census Bureau, there is no statistically
significant association (p <= 0.05) at the national level in the rate per 1000 per-
sons of births (RBIRTH) and the rate per 1000 persons of deaths (RDEATH).
The data will encompass multiple years (e.g., 2015, 2016, 2017, 2018, 2019) for
all four national regions (e.g., Midwest, Northeast, South, West).
This lesson was structured so that there were four regional datasets, each gaining
data from 2015 to 2019. The task now is to join the four regional datasets into one
unified national dataset. There are many ways to complete this action, but because
of the common format for all four regional datasets, the dplyr::bind_rows() function
will be used in this lesson.
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 265
RBIRTHandRDEATHNational2015Onward.tbl
- <
dplyr::bind_rows(
RBIRTHandRDEATHMidwest2015Onward.tbl,
RBIRTHandRDEATHNortheast2015Onward.tbl,
RBIRTHandRDEATHSouth2015Onward.tbl,
RBIRTHandRDEATHWest2015Onward.tbl)
# Merge, join, blend, bind, etc. rows of
# the four breakout regional datasets
# into one common national dataset.
base::getwd()
base::ls()
base::attach(RBIRTHandRDEATHNational2015Onward.tbl)
utils::str(RBIRTHandRDEATHNational2015Onward.tbl)
dplyr::glim
pse(RBIRTHandRDEATHNational2015Onward.tbl)
base::summary(RBIRTHandRDEATHNational2015Onward.tbl)
base::print(RBIRTHandRDEATHNational2015Onward.tbl)
17
When observing how RBIRTHandRDEATHNational2015Onward.tbl consists of 31,420 rows, it
is helpful to remember that the dataset covers data from each county in the United States over
5 years (e.g., 2015, 2016, 2017, 2018, 2019) and for two breakouts of variable (e.g., RBIRTH and
RDEATH), or: 31,420 rows of county data divided by 5 years equals 6284, which is then divided
by two variable breakouts, resulting in a sum of 3142. Then go back to the earliest part of this les-
son where it was stated that “There are more than 3100 counties in the United States [],” or 3142
to be more accurate, excluding county-equivalents at territories under United States jurisdiction. It
is best to occasionally check, as a quality assurance measure, these mundane, but important, issues
to be sure that data are correct.
266 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
RBIRTHandRDEATHNational2015Onward.tbl %>%
dplyr::group_by(year, variable, Region) %>%
# Th
ere are now three groups addressed by
# the dplyr::group_by() function. Be sure
# to compare the ordering within use of
# the dplyr::group_by() function and the
# way descriptive statistics are ordered
# in the tibble-based output.
dplyr::summarize(
N = base::length(value),
Minimum= base::min(value),
Median= stats::
median
(value),
Mean = base::
mean(value),
SD = stats::sd(value),
Maximum = base::max(value),
Missing = base::sum(is.na(value))
) %>%
print(n=40)
# The resulting tibbleshould be 40 rows:
# 5 breakouts for year * 2 breakouts for
# variable * 4 breakouts for Region = 40.
# Atibble
: 40 x 10
# Groups: year, variable [10]
year variable Region N Minimum Median Mean SD
<chr> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 2015 RBIRTH Midwest 1055 4.05 11.5 11.7 2.54
2 2015 RBIRTH Northeast 217 4.96 9.98 10.1 1.72
3 2015 RBIRTH South 1422 0 11.7 11.8 2.60
4 2015 RBIRTH West 448 0 11.7 12.0 3.63
5 2015 RDEATH Midwest 1055 0 10.0 9.95 2.45
6 2015 RDEATH Northeast 217 4.78 9.40 9.60 1.80
7 2015 RDEATH South 1422 0 10.4 10.2 2.57
8 2015 RDEATH West 448 0 8.05 8.16 2.85
RBIRTHNational.tbl <-
RBIRTHandRDEATHNational2015Onward.tbl %>%
dplyr::filter(variable %in% c("RBIRTH"))
# Retain the RBIRTH rows.
# Although it is perhaps not totally necessary,
# it is often convenient to sequester datainto
# separate and individualized dataset, which in
# this example is a national dataset consisting
# of datafor RBIRTH rows.
RBIRTHNational.tbl$year
- <
as.factor(RBIRTHNational.tbl$year)
RBIRTHNational.tbl$GEOID<-
as.numeric(RBIRTHNational.tbl$GEOID)
RBIRTHNational.tbl$Region
- <
as.factor(RBIRTHNational.tbl$Region)
RBIRTHNational.tbl$value
- <
as.numeric(RBIRTHNational.tbl$value)
# Put the year, GEOID, Region, and value datainto desired
# format,
nticipating
a future needs and function requirements.
base::attach(RBIRTHNational.tbl)
utils::str(RBIRTHNational.tbl)
base::unique(RBIRTHNational.tbl$variable)
# Confirm that only RBIRTH shows in the
# enumerated dataset.
RBIRTHNational.tbl %>%
268 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
dplyr::group_by(year) %>%
dplyr::summarize(
N = base::length(value),
Minimum= base::min(value),
Median= stats::median(value),
Mean= base::mean(value),
SD = stats::sd(value),
Maximum = base::max(value),
Missing = base::sum(is.na(value))
) %>% # Continue by using %>%
# Up to this point of the chained syntax, using the %>% pipe
# operator, descriptive statistics are generated by first
# using the dplyr::group_by() function against the object
# variable called variable (e.g., it was decided to retain
# this name, in its original format), using the
# dplyr::summarize() function against a set of selected
# statistically oriented functions, all in an effort to make
# a neatly presented summary of descriptive statistics.
# There are no missing datafor the object variable called
# value so it was not necessary to use the argument
# na.rm=TRUE. There is a potential problem, however, in that
# the resulting tibble, by default, prints numbers with 3
# significant digits. (Look at the tibblepackage and give
# attention to the pillar.sigfig option for more detail on
# this subject). Do not be confused by what shows on screen
# v actual values. From among many ways to see more detail
# on the actual count of numbers to the right of the decimal
# point, look below how the tibblewas put into format as a
# dataframeand then how the tidyverse ecosystem was used to
# offer greater precision in the final printout, showing a
# greater number of significant digits.
as.
data
.frame(.) %>% # Continue by using %>%
dplyr::mutate_if(is.numeric, round, 5)
Although there are many ways to address Analysis of Variance (both Oneway
ANOVA and Twoway ANOVA) using Base R, functions from the agricolae package
are demonstrated. The output is not only accurate, but it is also quite easy to inter-
pret. It is not always easy to interpret ANOVA output when using Base R functions,
but the problem of interpretation is largely mitigated by using functions from the
agricolae package.
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 269
install.packages("agricolae", dependencies=TRUE)
library(agricolae)
agricolae::HSD.test(
stats::aov(value ~ year,
data=RBIRTHNational.tbl), # Model
trt="year", # Treatment
group=TRUE, console=TRUE, alpha=0.05, # Arguments
main="Birth Rate per 1,000 Persons (RBIRTH) by Year
at the National Level Using Tukey's HSD (Honestly
Significant Difference) ParametricOneway ANOVA:
2015 to 2019 ")
# Wrap the agricolae::HSD.test() function around the Oneway
# ANOVA model, obtained by using the stats::aov() function.
# Select desired arguments, such as group, console, and
# alpha (e.g., p-value).
year, means
value groups
2018 11.7427 a
2015 11.6684 ab
2016 11.5388 bc
2017 11.4697 c
2019 10.9884 d
Fig. 5.12
function, where rates per 1000 persons (value) for RBIRTH were examined by year,
the group consolidations are:
• Group a: Rates per 1000 persons (value) for RBIRTH in 2018 (Mean = 11.7427)
share commonality with rates per 1000 persons (value) for RBIRTH in 2015
(Mean = 11.6684).}
• Group ab: Rates per 1000 persons (value) for RBIRTH in 2015 (Mean = 11.6684)
share commonality with rates per 1000 persons (value) for RBIRTH in 2016
(Mean = 11.5388) and the prior finding for 2018.}
• Group bc: Rates per 1000 persons (value) for RBIRTH in 2016 (Mean = 11.5388)
share commonality with rates per 1000 persons (value) for RBIRTH in 2017
(Mean = 11.4697) and the prior finding for 2015.}
• Group c: Rates per 1000 persons (value) for RBIRTH in 2017 (Mean = 11.4697)
share commonality with rates per 1000 persons (value) for RBIRTH in 2016
(Mean = 11.5388).
• Group d: Rates per 1000 persons (value) for RBIRTH in 2019 (Mean = 10.9884)
are totally unique and share no commonality with rates per 1000 persons (value)
for RBIRTH in other years.}
Use of the agricolae::HSD.test() function is especially helpful in that the output
provides clearly organized and easy-to-understand summaries of descriptive statis-
tics and group membership(s). Using a parametric perspective, notice how it is pos-
sible to see the overlap in mean(s) for different years and the unique mean(s) for
individual years.
A summary graphic is also needed to reinforce RBIRTH differences by year,
with the tidyverse ecosystem used once more. In this example, the
onewaytests::gplot() function will be used to generate the figure, but give attention
to the way this function has been developed in such a way that it works and plays
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 271
install.packages("onewaytests", dependencies=TRUE)
library(onewaytests)
RBIRTHNationalYEAR.fig- <
onewaytests::gplot(value ~ year, data= RBIRTHNational.tbl,
type = "errorbar", option = "se") +
geom_point(color="red", size=3) +
labs(
title =
"Birth Rate per 1,000 Persons (RBIRTH) by Year at the
National Level: 2015",to 2019
subtitleData
= :" Census Bureau Population Estimates",
x = \nYear",
"
y = "Birth Rate per 1,000
\n") Persons
+
scale_y_continuous(labels=scales::comma, limits=c(10.80,
11.80), breaks=scales::pretty_breaks(n=7)) +
annotate("text", x=1.10, y=11.40, label="Year MeanSD ",
fontface="bold", size=03, color="black", hjust=0,
family="mono")+
annotate("text", x=1.10,y=11.35, label="======================",
fontface="bold", size=03, color="black", hjust=0, family="mono")+
annotate("text", x=1.10, y=11.30, label="2015 11.6684 2.73544",
fontface="bold", size=03, color="black", hjust=0, family="mono")+
annotate("text", x=1.10, y=11.25, label="2016 11.5388 2.69872",
fontface="bold", size=03, color="black", hjust=0, family="mono")+
annotate("text", x=1.10, y=11.20, label="2017 11.4697 2.60429",
fontface="bold", size=03, color="black", hjust=0, family="mono")+
annotate("text", x=1.10, y=11.15, label="2018 11.7427 2.68837",
fontface="bold", size=03, color="black", hjust=0, family="mono")+
annotate("text", x=1.10, y=11.10, label="2019 10.9884 2.41780",
fontface="bold", size=03, color="black", hjust=0, family="mono")+
theme_Mac() +
theme(axis.text.y=element_text(face="bold", size=12,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X-axis text and/
# or Y-axis text need to be placed after the enumerated theme
# theme_Mac().
# Fig. 5.12
par(ask=TRUE); RBIRTHNationalYEAR.fig
18
The onewaytests package has functions that can be used for Oneway ANOVA. However, Oneway
ANOVA output using the onewaytests package is overly verbose, and it is not as easy to interpret
as Oneway ANOVA output from the agricolae package, thus preference of the agricolae package
for statistical output.
272 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Fig. 5.13
As a final quality assurance check, compare the means (the large red dots) in the
figure to what was seen from Oneway ANOVA output gained by using agricolae
function(s) and argument(s). Do the means in the figure generated by using the
onewaytests::gplot() function correspond to the prior Oneway ANOVA output gen-
erated by using the agricolae::HSD.test() function?
For this analysis and the remaining Oneway ANOAVA analyses, the comments
will be kept to a minimum. Follow along with the prior presentation of how func-
tions from the agricolae package and the onewaytests package are used.
Oneway ANOVA of RBIRTH by Region at the National Level (Fig. 5.13)
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 273
RBIRTHNational.tbl %>%
dplyr::group_by(Region) %>%
dplyr::summarize(
N = base::length(value),
Minimum= base::min(value),
Median= stats::
median
(value),
Mean = base::
mean(value),
SD = stats::sd(value),
Maximum = base::max(value),
Missing = base::sum(is.na(value))
) %>% # Continue by using %>%
as.
data
.frame(.) %>% # Continue >%
by using %
dplyr::mutate_if(is.numeric, round, 5)
agricolae::HSD.test(
stats::aov(value ~ Region,
data=RBIRTHNational.tbl), # Model
trt="Region", # Treatment
group=TRUE, console=TRUE, alpha=0.05, # Arguments
main="Birth Rate per 1,000 Persons (RBIRTH) by Region
at the National Level Using Tukey's HSD (Honestly
Significant Difference) ParametricOneway ANOVA:
2015 to 2019 ")
# Wrap the agricolae::HSD.test() function around the Oneway
# ANOVAmodel, obtained by using the stats::aov() function.
# Select desired arguments, such as group, console, and
# alpha (e.g., p-value).
Region,
means
# One
way ANOVA of RDEATH by Year at the National Level
RDEATHNational.tbl
- <
RBIRTHandRDEATHNational2015Onward.tbl %>%
dplyr::filter(variable %in% c("RDEATH"))
# Retain the RDEATH rows.
RDEATHNational.tbl$year <-as.factor(RDEATHNational.tbl$year)
RDEATHNational.tbl$GEOID<-as.numeric(RDEATHNational.tbl$GEOID)
RDEATHNational.tbl$Region <-as.factor(RDEATHNational.tbl$Region)
RDEATHNational.tbl$value <-as.numeric(RDEATHNational.tbl$value)
# Put the year, GEOID, Region, and value datainto desired
# format, anticipating future needs and function requirements.
base::attach(RDEATHNational.tbl)
utils::str(RDEATHNational.tbl)
base::unique(RDEATHNational.tbl$variable)
# Confirm that only RDEATH shows in the
# enumerated dataset.
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 275
RDEATHNational.tbl %>%
dplyr::group_by(year) %>%
dplyr::summarize(
N = base::length(value),
Minimum = base::min(value),
Median= stats::median(value),
Mean = base::
mean(value),
SD = stats::sd(value),
Maximum = base::max(value),
Missing = base::sum(is.na(value))
) %>% # Continue by using %>%
as.
data
.frame(.) %>% # Continue by using %>%
dplyr::mutate_if(is.numeric, round, 5)
e::HSD.test(
agricola
stats::aov(value ~ year,
data=RDEATHNational.tbl), # Model
trt="year", # Treatment
group=TRUE, console=TRUE, alpha=0.05, # Arguments
main="Death Rate per 1,000 Persons (RDEATH) by Year
at the National Level Using Tukey's HSD (Honestly
Significant Difference) ParametricOneway ANOVA:
2015 to 2019 ")
# Wrap the agricolae::HSD.test() function around the Oneway
# ANOVA model, obtained by using the stats::aov() function.
# Select desired arguments, such as group, console, and
# alpha (e.g., p-value).
276 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
year, means
value std r Min Max
2015 9.80019 2.62162 3142 0.00000 22.4844
2016 10.22399 2.71406 3142 0.00000 22.6941
2017 10.13915 2.66652 3142 0.00000 22.5912
2018 10.80759 2.88947 3142 1.07124 25.3978
2019 10.40725 2.68092 3142 0.00000 22.1019
value groups
2018 10.80759 a
2019 10.40725 b
2016 10.22399 bc
2017 10.13915 c
2015 9.80019 d
RDEATHNationalYear.fig
- <
onewaytests::gplot(value ~ year, data= RDEATHNational.tbl,
type = "errorbar", option = "se") +
geom_point(color="red", size=3) +
labs(
title =
"Death Rate per 1,000 Persons (RDEATH) by Year at the
National Level: 2015 to 2019",
subtitle = "Data: Census Bureau Population Estimates",
x = "\nYear",
y = "Death Rate per 1,000 Persons\n") +
scale_y_continuous(labels=scales::comma, limits=c(9.5,
11.00), breaks=scales::pretty_breaks(n=7)) +
# As a challenge, add an annotated table using the
# prior example.
theme_Mac() +
theme(axis.text.y=element_text(face="bold", size=12,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X-axis text and/
# or Y-axis text need to be placed after the enumerated theme
# theme_Mac().
# Fig. 5.13
par(ask=TRUE); RDEATHNationalYear.fig
RDEATHNational.tbl %>%
dplyr::group_by(Region) %>%
dplyr::summarize(
N = base::length(value),
Minimum= base::min(value),
Median= stats::median(value),
Mean= base::mean(value),
SD = stats::sd(value),
Maximum = base::max(value),
Missing = base::sum(is.na(value))
) %>% # Continue by using %>%
as.
data
.frame(.) %>% # Continue by using %>%
dplyr::mutate_if(is.numeric, round, 5)
agricolae::HSD.test(
stats::aov(value ~ Region,
data=RDEATHNational.tbl), # Model
trt="Region", # Treatment
group=TRUE, console=TRUE, alpha=0.05, # Arguments
main="Death Rate per 1,000 Persons (RDEATH) by Year
at the National Level Using Tukey's HSD (Honestly
Significant Difference) ParametricOneway ANOVA:
2015 to 2019 ")
# Wrap the agricolae::HSD.test() function around the Oneway
# ANOVA model, obtained by using the stats::aov() function.
# Select desired arguments, such as group, console, and
# alpha (e.g., p-value).
278 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Region, means
Give special attention to how the Oneway ANOVA output for RDEATH by
Region is quite unique and not seen all that often. The mean for each regional break-
out (e.g., Midwest, Northeast, South, and West) is unique and there is no commonal-
ity (e.g., overlap for groups a, b, c, and d) in terms of group membership for any of
the four groups (Fig. 5.14).
Fig. 5.14
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 279
RDEATHNationalRegion.fig
- <
onewaytests::gplot(value ~ Region, data= RDEATHNational.tbl,
type = "errorbar", option = "se") +
geom_point(color="red", size=3) +
labs(
title =
"Death Rate per 1,000 Persons (RDEATH) by Region at the
National Level: 2015 to 2019",
subtitle = "Data: Census Bureau Population Estimates",
x = "\nRegion",
y = "Death Rate per 1,000 Persons\n") +
scale_y_continuous(labels=scales::comma, limits=c(08.50,
11.00), breaks=scales::pretty_breaks(n=7)) +
# As a challenge, add an annotated table using the
# prior example.
theme_Mac() +
theme(axis.text.y=element_text(face="bold", size=12,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X-axis text and/
# or Y-axis text need to be placed after the enumerated theme
# theme_Mac().
# Fig. 5.14
par(ask=TRUE); RDEATHNationalRegion.fig
RBIRTH measured value datapoints by year (e.g., 2015, 2016, 2017, 2018, 2019)
and by Region (Midwest, Northeast, South, West):
RBIRTHTwowayYR.aov -<
stats::aov(value ~ year * Region,
data=RBIRTHNational.tbl)
# Twoway ANOVAfor Y (year), and
# R (Region) --TwowayYR
base::attach(RBIRTHTwowayYR.aov)
base::class(RBIRTHTwowayYR.aov)
base::print(RBIRTHTwowayYR.aov)
Call:
stats::aov(formula = value ~ year * Region, data=
RBIRTHNational.tbl)
Terms:
year Region year:Region Residuals
Sum of Squares 1098.8 2755.5 96.0 105893.5
Deg. of Freedom 4 3 12 15690
base::summary(RBIRTHTwowayYR.aov)
# Wrap the base::summary around
# TwowayYR.aov, the enumerated
# object.
The overall outcome for Twoway ANOVA Using Base R at the top level of
understanding is easy to interpret:
• It is confirmed that there are statistically significant (p <= 0.05) differences in
RBIRTH by year. The calculated (p <= 0.0000000000000002) is certainly less
than (p <= 0.05), confirming significant difference by year. Also note the use of
three asterisks, a common notation used to indicate statistically significant differ-
ence. Go back to the Oneway ANOVA findings to see exactly which years are in
common and which years are unique.
• It is confirmed that there are statistically significant (p <= 0.05) differences in
RBIRTH by Region. The calculated (p <= 0.0000000000000002) is certainly
less than (p <= 0.05), confirming significant difference by Region. Also note the
use of three asterisks, a common notation indicating statistically significant
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 281
Twoway ANOVA Using the Rfit Package: RBIRTH by year and by Region
It is always prudent to confirm findings, to look for consistency. There will be
times when individual statistics may be slightly different when different functions
are used due to selected algorithms, methods, rounding, etc. When there are differ-
ences between comparative analyses, they should be minimal and certainly summa-
tive outcomes should be consistent.
Use the Rfit package to confirm what was gained using
base::summary(RBIRTHTwowayYR.aov). Expect slight differences in calculated
statistics, but the outcomes should clearly be in parity.
install.packages("Rfit", dependencies=TRUE)
library(Rfit)
Robust ANOVATable
DF RD MeanRD F p-value
year 4 149.5876 37.39689 34.17400 0.00000
Region 3 635.9539 211.98465 193.71565 0.00000
year:Region 12 19.7175 1.64313 1.50152 0.11528
Going back to the comment that multiple approaches help with confirmation,
notice how use of the Rfit::raov() function confirmed Twoway ANOVA results
gained from using Base R, but with slightly different output. The calculated p-values
confirm that there are statistically significant (p <= 0.05) differences for year and
Region. Equally, the calculated p-value for interaction by year and by Region (e.g.,
year:Region) was in parity with what was seen using Base R, but output for the
calculated p-value was not exactly the same for both functions (e.g.,
base::summary(RBIRTHTwowayYR.aov) v Rfit::raov(value ~ year + Region,
282 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
install.packages("WRS2", dependencies=TRUE)
library(WRS2)
Call:
WRS2::t2way(formula = value ~ year * Region, data =
RBIRTHNational.tbl,
tr = 0.1)
value p.value
year 148.7459 0.001
Region 1038.5548 0.001
year:Region 15.7937 0.205
Once again, but now using the WRS2::t2way() function, the outcomes are con-
sistent even if the calculated statistics are slightly different. Regarding birth rates
per 1000 persons (RBIRTH, only):
• The p-value for year provides evidence that there are statistically significant
(p <= 0.05) differences by year.
• The p-value for Region provides evidence that there are statistically significant
(p <= 0.05) differences by Region.
• The p-value for year:Region provides evidence that there is no statistically sig-
nificant (p <= 0.05) interaction between year and Region.
Finally, an interaction plot is the best way to end this part of the lesson, where
focus will be on RBIRTH by year and by Region. Give special attention to any
overlaps that may occur but equally those cases where there is simply no overlap.
From a few different choices, the CGPfunctions::Plot2WayANOVA() function will
be used to generate the interaction plot. This function was purposely selected not
only because the figure is informative, but there is accompanying output about the
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 283
Twoway ANOVA that can be used to serve as a quality assurance check once again
for consistency and depth of understanding mean comparisons of RBIRTH between
and among the relevant object variables.
Be sure to note that the CGPfunctions::Plot2WayANOVA() function is compat-
ible with the ggplot2::ggplot() function. After a few trial and error attempts, it was
decided to revise the theme_Mac() function to accommodate how legend titles are
suppressed when using the CGPfunctions::Plot2WayANOVA() function, which is
compatible with the ggplot2::ggplot() function, but in a slightly different way -- thus
the need to alter theme_Mac() (Fig. 5.15).
Fig. 5.15
284 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
###############################################################
theme_MacNoLegendTitle <-function(base_size=12,
base_family ="sans"){
theme_stata() +
theme( # Embellishments to theme_stata()
plot.title=element_text(face="bold", size=14, hjust=0.5),
plot.subtitle=element_text(face="bold", size=12,
hjust=0.5),
plot.caption=element_text(face="bold", size=10, hjust=0.5),
axis.title.x=element_text(face="bold", size=14, hjust=0.5),
axis.text.x=element_text(face="bold", size=12, hjust=0.5),
axis.title.y=element_text(face="bold", size=14, hjust=0.5,
vjust=1, angle=90),
axis.text.y=element_text(face="bold", size=12, hjust=0.5),
legend.title=element_blank(),
# Supress the legend title.
# legend.title=element_text(face="bold", size=12),
legend.text=element_text(face="bold", size=12),
axis.ticks.x=element_line(size=1.2),
axis.ticks.y=element_line(size=1.2),
axis.ticks.length=unit(0.25,"cm"),
panel.background=element_rect(fill="whitesmoke")
)
}
# hjust -horizonal justification; 0 = left edge to 1 = right
# edge, with 0.5 the default
# vjust
-vertical justification; 0 = bottom edge to 1 = top
# edge, with 0.5 the default
# angle
-rotation; generally 1 to 90 degrees, with 0 the
# default
base::class(theme_MacNoLegendTitle)
# Confirm that the user-created object
# theme_MacNoLegendTitle() is a function.
###############################################################
install.packages("CGPfunctions", dependencies=TRUE)
library(CGPfunctions)
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 285
RDEATHTwowayYR.aov <
-
stats::aov(value ~ year * Region,
data=RDEATHNational.tbl)
# Twoway ANOVAfor Y (year), and
# R (Region) --TwowayYR
base::attach(RDEATHTwowayYR.aov)
base::class(RDEATHTwowayYR.aov)
base::print(RDEATHTwowayYR.aov)
Call:
stats::aov(formula = value ~ year * Region, data =
RDEATHNational.tbl)
Terms:
year Region year:Region Residuals
Sum of Squares 1720.7 8335.2 192.7 107330.2
Deg. of Freedom 4 3 12 15690
base::summary(RDEATHTwowayYR.aov)
# Wrap the base::summary around
# TwowayYR.aov, the enumerated
# object.
The overall outcome for Twoway ANOVA Using Base R at the top level of
understanding is easy to interpret:
• It is confirmed that there are statistically significant (p <= 0.05) differences in
RDEATH by year. The calculated (p <= 0.0000000000000002) is certainly less
than (p <= 0.05), confirming significant difference by year. Also note the use of
three asterisks, a common notation indicating statistically significant difference.
Go back to the Oneway ANOVA findings to see exactly which years are in com-
mon and which years are unique.
• It is confirmed that there are statistically significant (p <= 0.05) differences in
RDEATH by Region. The calculated (p <= 0.0000000000000002) is certainly
less than (p <= 0.05), confirming significant difference by Region. Also note the
use of three asterisks, a common notation indicating statistically significant dif-
ference. Go back to the Oneway ANOVA findings to see exactly which Regions
are in common and which years are unique.
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 287
Twoway ANOVA Using the Rfit Package: RDEATH by year and by Region
It is always prudent to confirm findings, to look for consistency. There will be
times when individual statistics may be slightly different when different functions
are used due to selected algorithms, methods, rounding, etc. When there are differ-
ences, they should be minimal and, certainly, outcomes should be consistent.
Use the Rfit package to confirm what was gained using
base::summary(RDEATHTwowayYR.aov). Expect slight differences in calculated
statistics, but the outcomes should clearly be in parity.
Robust ANOVATable
DF RD Mean RD p-value
year 4 149.9147 37.47868 29.34608 0.00000
Region 3 1705.3449 568.44832 445.09915 0.00000
year:Region 12 32.4758 2.70632 2.11907 0.01297
Going back to the comment that multiple approaches help with confirmation,
notice how use of the Rfit::raov() function confirmed Twoway ANOVA results
gained from using Base R, but with slightly different output. The calculated p-values
confirm that there are statistically significant (p <= 0.05) differences for year and
Region. Equally, the calculated p-value for interaction by year and by Region (e.g.,
year:Region) was in parity with what was seen using Base R, but output for the
calculated p-value was not exactly the same for both functions (e.g.,
base::summary(RDEATHTwowayYR.aov) v Rfit::raov(value ~ year + Region,
data=RDEATHNational.tbl)). See the prior comment about the way differences in
288 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Call:
WRS2::t2way(formula = value ~ year * Region, data =
RDEATHNational.tbl,
tr = 0.1)
value p.value
year 125.4927 0.001
Region 1220.4110 0.001
year:Region 25.7543 0.013
Once again, now using the WRS2::t2way() function, the outcomes are consistent
even if the calculated statistics are slightly different. Regarding death rates per 1000
persons (RDEATH, only):
• The p-value for year provides evidence that there are statistically significant
(p <= 0.05) differences by year.
• The p-value for Region provides evidence that there are statistically significant
(p <= 0.05) differences by Region.
• The p-value for year:Region provides evidence that there is a statistically signifi-
cant (p <= 0.05) interaction between year and Region.
Finally, an interaction plot is the best way to end this part of the lesson, where
focus will be on RDEATH by year and by Region. Give special attention to any
overlaps that may occur but equally those cases where there is simply no overlap.
Again, the CGPfunctions::Plot2WayANOVA() function will be used to generate the
interaction plot (Fig. 5.16).20
19
The inquiry into possible interaction for year:Region is especially interesting in this example, for
RDEATH datapoints of the object value. The interaction is statistically significant at (p <= 0.05),
but can the same be said for (p <= 0.01)? Why are there only two asterisks for this p-value when
statistical significance was symbolized previously with three asterisks? It is unacceptable to say
something such as: There is a statistically significant difference between X and Y in terms of Z. It
is instead necessary to say something such as: There is a statistically significant difference (p <=
0.05) between X and Y in terms of Z. The inclusion of (p <= 0.05, or any other declared p-value) in
this statement is not merely a fine point, but it is, instead, essential.
20
This text is focused on the use of R in data science at an introductory level. It would be far
beyond the purpose of this text to discuss in great detail the complexity of interactions between and
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 289
Fig. 5.16
among variables, how to use R to go beyond top-level statistical analysis of interactions, and inter-
pretations of the same. There are many R-based resources for those who desire more finite detail
on this topic, but as a starting point, consider a review of the many vignettes associated with the
interactions package, especially discussion of the Johnson-Neyman (J-N) interval and plots of this
statistic.
290 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
it would be interesting to see how the RBIRTH rates per 1000 persons and RDEATH
rates per 1000 persons compare to each other. Can the tidyverse ecosystem be used
to gain estimates of correlation between the two?
WRBIRTHandRDEATHNational2015Onward.tbl
- <
RBIRTHandRDEATHNational2015Onward.tbl %>%
tidyr::pivot_wider(names_from = variable,
values_from = value)
# Put the datainto WIDE format, to allow for
# simple comparisons between RBIRTH and RDEATH.
#
# Although the tidyverse ecosystem is often
# used to put wide datainto long format, there
# are times when the opposite is an appropriate
# approach to dataorganization, the need to
# put long datainto wide format.
base::getwd()
base::ls()
base::attach(WRBIRTHandRDEATHNational2015Onward.tbl)
utils::str(WRBIRTHandRDEATHNational2015Onward.tbl)
dplyr::glimpse(WRBIRTHandRDEATHNational2015Onward.tbl)
base::summary(WRBIRTHandRDEATHNational2015Onward.tbl)
base::print(WRBIRTHandRDEATHNational2015Onward.tbl)
With the wide data in final form, it is now a simple task to estimate the associa-
tion between RBIRTH and RDEATH (Fig. 5.17)21:
Fig. 5.17
21
The term often used with correlation and association is estimate. Pearson’s r correlation coeffi-
cient (parametric) is considered an estimate of association and the same concept applies to
Spearman’s rho correlation coefficient (nonparametric).
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 291
stats::cor.test(
WRBIRTHandRDEATHNational2015Onward.tbl$RBIRTH,
WRBIRTHandRDEATHNational2015Onward.
tbl$RDEATH,
method = c("pearson"))
# It is viewed that the RBIRTH and RDEATH data
# are parametricand it is appropriate to use
# Pearson's r to estimate the association.
Pearson's product-moment correlation
data
: WRBIRTHandRDEATHNational2015Onward.tbl$RBIRTH and
WRBIRTHandRDEATHNational2015Onward.tbl$RDEATH
t = -24.65, df = 15708, p-value<0.0000000000000002
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.207988 -0.177878
sample estimates:
cor
-0.192978
PearsonRBIRTHandRDEATHNational2015Onward.fig
- <
ggplot2::ggplot(WRBIRTHandRDEATHNational2015Onward.tbl,
aes(x=RBIRTH, y=RDEATH)) +
geom_point() +
geom_smooth(method=lm, se=FALSE, lwd=3, color="red") +
labs(
title =
"Pearson's r Estimate of AssociationBetween RBIRTH
and RDEATH, from 2015 to 2019: National",
subtitle = "Data: Census Bureau Population Estimates",
x = "\nBirth Rate per 1,000 Persons (RBIRTH)\n",
y = "\nDeath Rate per 1,000 Persons (RDEATH)\n") +
annotate("text", x=20, y=25, fontface="bold", size=05,
color="black", hjust=0, family="mono",
label="Pearson's r = -0.192978") +
scale_x_continuous(labels=scales::comma, limits=c(0, 32)) +
scale_y_continuous(labels=scales::comma, limits=c(0, 27)) +
theme_Mac()
# Fig. 5.17
par(ask=TRUE); PearsonRBIRTHandRDEATHNational2015Onward.fig
Fig. 5.18
Fig. 5.19
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 293
RBIRTHNat2019.map- <
RBIRTHNational.tbl %>%
dplyr::filter(year %in% c("2019")) %>%
dplyr::select(-c('NAME', 'variable', 'Region'))
# The map will be restricted to RBIRTH datapoints
# for value in 2019.
base::attach(RBIRTHNat2019.map)
utils::str(RBIRTHNat2019.map)
base::names(RBIRTHNat2019.map)[names(RBIRTHNat2019.map) ==
'GEOID '] -'region'
<
# Use the base::names() function to rename the object
# variable GEOIDto region. Give attention to the use
# of lowercase in this instance, region and NOT Region,
# to use variable name requirements when
- using the map
# based choro
plethr package.
base::attach(RBIRTHNat2019.map)
utils::str(RBIRTHNat2019.map)
tibble[3,142 × 3] (S3: tbl_df/tbl/data.frame)
$ year : Factor w/ 5 levels "2015","2016",..: 5 5 5 5 5 5
$ region: num [1:3142] 17001 17009 17003 17005 17007 ...
$ value : num [1:3142] 11.84 8.52 9.15 8.83 10.54 ...
choroplethr::county_choropleth(RBIRTHNat2019.map,
title = "Birth Rate per 1,000 Persons in 2019:
California",
legend = "Birth Rate",
num_colors = 7,
state_zoom = c("california")) +
theme(plot.title = element_text(hjust = 0.5))
# DRAFT map, with 7 colors and no embellishments.
choroplethr::county_choropleth(RBIRTHNat2019.map,
title = "Birth Rate per 1,000 Persons (RBIRTH) by County
in 2019: California",
legend = "Birth Rate",
num_colors = 9,
state_zoom = c("california")) +
theme(plot.title=element_text(hjust = 0.5)) +
theme(legend.position="left")
# Remember that the choroplethr package is associated
# with the tidyverse ecosystem and that ggplot2::ggplot()
# function arguments and options are generally supported.
# As such, the legend title is easily changed to a far
# more descriptive label, the legend is left justified,
# the title is centered, etc. Other embellishments could
# be provided, but it is judged that they are not needed.
#
# Map with lors
9 co and embellishments.
# Fig. 5.18
294 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
RDEATHNat2019.map -<
RDEATHNational.tbl %>%
dplyr::filter(year %in% c("2019")) %>%
dplyr::select(-c('NAME', 'variable', 'Region'))
# The map will be restricted to RDEATH datapoints
# for value in 2019.
base::attach(RDEATHNat2019.map)
utils::str(RDEATHNat2019.map)
base::names(RDEATHNat2019.map)[names(RDEATHNat2019.map) ==
'GEOID '] -'region'
<
# Use the base::names() function to rename the object
# variable GEOIDto region. Give attention to the use
# of lowercase in this instance, region and NOT Region,
# to use variable name requirements when
- using the map
# based choroplethr package.
base::attach(RDEATHNat2019.map)
utils::str(RDEATHNat2019.map)
choroplethr::county_choropleth(RDEATHNat2019.map,
title = "Death Rate per 1,000 Persons by County in 2019:
California",
legend = "Death Rate",
num_colors = 7,
state_zoom = c("california")) +
theme(plot.title = element_text(hjust = 0.5))
# DRAFT map, with 7 colors and no embellishments.
choroplethr::county_choropleth(RDEATHNat2019.map,
title = "Death Rate per 1,000 Persons (RDEATH) by County
in 2019: California",
legend = "Death Rate",
num_colors = 9,
state_zoom = c("california")) +
theme(plot.title=element_text(hjust = 0.5)) +
theme(legend.position="left")
# Remember that the choroplethr package is associated
# with the tidyverse ecosystem and that ggplot2::ggplot()
# function arguments and options are generally supported.
# As such, the legend title is easily changed to a far
# more descriptive label, the legend is left justified,
# the title is centered, etc. Other embellishments could
# be provided, but it is judged that they are not needed.
#
# Map with 9 colors and embellishments.
# Fig. 5.19
Addendum 1: A Parametric Approach to Statistical Analyses and Graphical… 295
Presentation of Outcomes
The analysis of RBIRTH and RDEATH for multiple years (e.g., 2015, 2016, 2017,
2018, 2019) and by multiple regions (e.g., Midwest, Northeast, South, and East) has
been the focus of this addendum, looking at outcomes from multiple viewpoints.
Going back to the original set of Null Hypothesis statements, the following out-
comes can now be stated:
• It is confirmed that there are statistically significant (p <= 0.05) differences in
RBIRTH by year. The calculated (p <= 0.0000000000000002) is certainly less
than (p <= 0.05), confirming significant difference by year.
• It is confirmed that there are statistically significant (p <= 0.05) differences in
RBIRTH by Region. The calculated (p <= 0.0000000000000002) is certainly
less than (p <= 0.05), confirming significant difference by Region.
• It is confirmed that there is no statistically significant (p <= 0.05) interaction for
RBIRTH by year and by Region. The calculated (p <= 0.29) is certainly greater
than (p <= 0.05), confirming that there is no significant interaction for RBIRTH
by year and by Region.
• It is confirmed that there are statistically significant (p <= 0.05) differences in
RDEATH by year. The calculated (p <= 0.0000000000000002) is certainly less
than (p <= 0.05), confirming significant difference by year.
• It is confirmed that there are statistically significant (p <= 0.05) differences in
RDEATH by Region. The calculated (p <= 0.0000000000000002) is certainly
less than (p <= 0.05), confirming significant difference by Region.
• It is confirmed that there is a statistically significant (p <= 0.05) interaction for
RDEATH by year and by Region. The calculated (p <= 0.0053) is certainly less
than (p <= 0.05), confirming that there is a significant interaction for RDEATH
by year and by Region.
• It is confirmed that Pearson’s r for RBIRTH v RDEATH was -0.192978, with
p-value <= 0.0000000000000002, bringing into question any meaningful (e.g.,
practical) interpretation of association.
As time permits, go back to the prior analyses in this addendum and the accom-
panying figures to review exact statistics gained from the many analyses, knowing
that screen output has been edited in some cases to save space since many analyses
produce long sections of output. It is especially valuable to see printouts that clearly
identify commonality in group means and differences in group means. For those
with a special interest, investigate the many health-related, economic, and other
social contributors to the outcomes, those comparisons where there are no mean
differences between comparative groups and those comparative groups where there
are mean differences.
296 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Addendum 1 addressed federal data on the two breakouts RBIRTH and RDEATH. It
would only be redundant to recapitulate discussion about the data. As a reminder, all
analyses demonstrated in Addendum 1 assumed that the data were such that it was
appropriate to base analyses from a parametric perspective – to assume that there
were acceptable levels of normal distribution, to assume that the data were inter-
val, etc.
The focus of Addendum 2 is to use the original dataset and breakout datasets
developed in Addendum 1 and look at the data again, but now from a nonparametric
perspective. There is only a needed degree of discussion about the data since much
was stated in Addendum 1.
Nonparametric – Oneway ANOVA RBIRTH
install.packages("rstatix", dependencies=TRUE)
library(rstatix)
data
: value by year
Kruskal-Wallis chi-squared = 193.6, df = 4, p-value
<0.0000000000000002
RBIRTHNational.tbl %>%
rstatix::dunn_test(value ~ year)
# A tibble: 10 x 9 p.adj.
.y. group1 group2 n1 n2 p p.adj signif
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <chr>
1 value 2015 2016 3142 3142 4.75e- 2 1.42e- 1 ns
2 value 2015 2017 3142 3142 4.03e- 3 1.61e- 2 *
3 value 2015 2018 3142 3142 1.63e- 1 3.26e- 1 ns
4 value 2015 2019 3142 3142 3.52e-29 3.16e-28 ****
5 value 2016 2017 3142 3142 3.71e- 1 3.71e- 1 ns
6 value 2016 2018 3142 3142 7.35e- 4 3.67e- 3 **
7 value 2016 2019 3142 3142 2.68e-20 2.14e-19 ****
8 value 2017 2018 3142 3142 1.95e- 5 1.17e- 4 ***
9 value 2017 2019 3142 3142 7.61e-17 5.33e-16 ****
0 value 2018 2019 3142 3142 1.92e-36 1.92e-35 ****
RBIRTHNational.tbl %>%
rstatix::dunn_test(value ~ Region)
# A tibble: 6 x 9 p.adj.
.y. group1 group2 n1 n2 p p.adj signif
* <chr> <chr> <chr> <int><int> <dbl> <dbl> <chr>
1 value Midwest Northeast 5275 1085 5.93e-117 2.96e-116 ****
2 value Midwest South 5275 7110 9.66e- 1 1 e+ 0 ns
3 value Midwest West 5275 2240 4.32e- 1 1 e+ 0 ns
4 value Northeast South 1085 7110 5.23e-122 3.14e-121 ****
5 value Northeast West 1085 2240 1.37e- 90 5.46e- 90 ****
6 value South West 7110 2240 4.32e- 1 1 e+ 0 ns
RDEATHNational.tbl %>%
rstatix::dunn_test(value ~ year)
# A tibble: 10 x 9 p.adj.
.y. group1 group2 n1 n2 p p.adj signif
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <chr>
1 value 2015 2016 3142 3142 2.89e-10 1.73e- 9 ****
2 value 2015 2017 3142 3142 3.77e- 7 1.88e- 6 ****
3 value 2015 2018 3142 3142 2.34e-43 2.34e-42 ****
4 value 2015 2019 3142 3142 2.48e-19 2.24e-18 ****
5 value 2016 2017 3142 3142 2.21e- 1 2.21e- 1 ns
6 value 2016 2018 3142 3142 6.32e-14 4.43e-13 ****
7 value 2016 2019 3142 3142 7.26e- 3 1.45e- 2 *
8 value 2017 2018 3142 3142 2.65e-18 2.12e-17 ****
9 value 2017 2019 3142 3142 9.26e- 5 2.78e- 4 ***
10 value 2018 2019 3142 3142 1.46e- 6 5.84e- 6 ****
RDEATHNational.tbl %>%
rstatix::dunn_test(value ~ Region)
# A tibble: 6 x 9 p.adj.
.y. group1 group2 n1 n2 p p.adj signif
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <chr>
1 value Midwest Northeast 5275 1085 2.72e- 10 2.72e- 10 ****
2 value Midwest South 5275 7110 7.20e- 15 1.44e- 14 ****
3 value Midwest West 5275 2240 3.10e-152 1.55e-151 ****
4 value Northeast South 1085 7110 3.59e- 27 1.08e- 26 ****
5 value Northeast West 1085 2240 2.18e- 34 8.71e- 34 ****
6 value South West 7110 2240 1.47e-241 8.81e-241 ****
install.packages("rcompanion", dependencies=TRUE)
library(rcompanion)
DV: value
Observations: 15710
D: 1
MS total: 20568318
Df Sum Sq H p.value
year 4 3981854707 193.6 0.0000
Region 3 12107930757 588.7 0.0000
year:Region 12 211173885 10.3 0.5926
Residuals 15690 306806739699
DV: value
Observations: 15710
D: 1
MS total: 20568318
Df Sum Sq H p.value
year 4 4249998071 206.6 0.00000
Region 3 23496839829 1142.4 0.00000
year:Region 12 481830699 23.4 0.02432
Residuals 15690 294879029665
WRBIRTHandRDEATHMidwest2015Onward.tbl <-
RBIRTHandRDEATHMidwest2015Onward.tbl %>%
tidyr::pivot_wider(names_from = variable,
values_from = value)
# Put the data into WIDE format, to allow for
# simple comparisons between RBIRTH and RDEATH.
base::getwd()
base::ls()
base::attach(WRBIRTHandRDEATHMidwest2015Onward.tbl)
utils::str(WRBIRTHandRDEATHMidwest2015Onward.tbl)
dplyr::glimpse(WRBIRTHandRDEATHMidwest2015Onward.tbl)
base::summary(WRBIRTHandRDEATHMidwest2015Onward.tbl)
base::print(WRBIRTHandRDEATHMidwest2015Onward.tbl)
stats::cor.test(
WRBIRTHandRDEATHMidwest2015Onward.tbl$RBIRTH,
WRBIRTHandRDEATHMidwest2015Onward.tbl$RDEATH,
method = c("spearman"))
# Spearman's rho - nonparametric
stats::cor.test(
WRBIRTHandRDEATHMidwest2015Onward.tbl$RBIRTH,
WRBIRTHandRDEATHMidwest2015Onward.tbl$RDEATH,
method = c("pearson"))
# Pearson's r - parametric
Historically, Kansas has been one of the leading wheat-growing states, and it
remains its high ranking among all states in terms of wheat acreage, total production
of wheat, and wheat yields. There are many wheat types, but most wheat grown in
Kansas is classified as Hard Red Winter (HRW) wheat, which is known for having
high protein and strong gluten.22
The dataset for Addendum 3 was gained from the United States Department of
Agriculture (USDA) National Agricultural Statistics Service (NASS) Access Quick
Stats graphical user interface (GUI), https://quickstats.nass.usda.gov/. This GUI-
based resource was purposely selected, as opposed to the use of an API for data
retrieval. APIs have been demonstrated throughout this text and are the focus of a
later lesson.
Challenge: Similar to what has been presented throughout this lesson, follow
along with the outline that guides statistical analyses, but in a tidy and well-
organized manner:
• Background
–– Description of the Data
–– Null Hypothesis
• Import Data
• Code Book and Data Organization
• Exploratory Graphics
–– Graphics Using Base R
–– Graphics Using the tidyverse Ecosystem
• Exploratory Descriptive Statistics and Measures of Central Tendency
• Exploratory Analyses
• Presentation of Outcomes
Prepare a technical memorandum on the subject matter, data, process, and out-
comes for what is presented in this addendum. Even for those who are inexperi-
enced in wheat production, there are many resources that, for those with an interest,
can produce a cogent narrative relative to the data and how the data are obtained and
used. For many, wheat is part of the daily diet, thus the selection of a dataset on this
22
For those with special interest, review available resources to learn more about the many wheat
types commonly grown in the United States, including Hard Red Spring (HRS), Hard Red Winter
(HRW), Soft Red Winter (SRW), Hard White (HW), Soft White (SW), and Durum. Review the
environmental conditions, soil types, farming practices, etc. that are best for each wheat type.
Consider how this information fits into the background of the proposed memorandum associated
with this addendum.
302 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
critical crop and from this base information the many biostatistics implications for
its production.
When preparing the technical memorandum, consider a reader who knows little
about Kansas and less about wheat farming. Use available resources to provide a
meaningful discussion on the history of wheat production in Kansas, current trends,
and reasons why there are such marked differences in yield by county. Keep the
discussion at a level that does not require expertise in agronomy.
As a reminder, most parts of this addendum will be completed by those who take
the challenge of starting out with a dataset that is somewhat complex. A few begin-
ning actions and ending actions will be presented, but most parts of this addendum
are self-completed, especially for those who have followed along with all lessons up
to this point and now have a skill set with R and the tidyverse that is greatly expand-
ing. Be sure to notice the many data wrangling actions needed to put the dataset into
good form, to support descriptive statistics, formative and final form graphics, infer-
ential analyses, mapping, and ultimately a concise conclusion. Occasionally, a few
hints will be offered, but there are so many possibilities on how the data can be used
that prescriptive mandates are generally avoided.
Background
The syntax used to import the data is demonstrated in this lesson, to provide a start-
ing point. Also note how data wrangling is used to put the data in good form. Give
special attention to a few challenges, such as efforts used to accommodate FIPS
(Federal Information Processing Standard) county codes, object variable name
requirements when using the choroplethr package, etc.
Addendum 3: Data Wrangling, and Then Statistical Analyses and Mapping 303
base::getwd()
base::ls()
base::attach(KSWheat.tbl)
utils::str(KSWheat.tbl)
dplyr::glimpse(KSWheat.tbl)
base::summary(KSWheat.tbl)
Give attention to the messy (e.g., not tidy) column names for many object vari-
ables, especially the way spaces show in compound words (e.g., Geo Level, Ag
District, Ag District Code, etc.). Use the janitor::clean_names() function to accom-
modate this concern.
KSWheatYield.tbl <-
KSWheat.tbl %>%
janitor::clean_names()
# There are a few objects that have spaces in the object
# name. Use the janitor::clean_names() function to put
# object variable names into a tidy format.
base::getwd()
base::ls()
base::attach(KSWheatYield.tbl)
utils::str(KSWheatYield.tbl)
dplyr::glimpse(KSWheatYield.tbl)
base::summary(KSWheatYield.tbl)
KSWheatYield1934Onward.tbl <-
KSWheatYield.tbl %>%
# The tibble KSWheatYield.tbl currently consists
# of data in 7,760 rows and 21 columns. Use the
# dplyr::select() function, accompanied by the -
# character to delete previously identified
# unnecessary object variables (e.g., columns).
dplyr::select(-c('week_ending',
'zip_code',
'region',
'watershed_code',
'watershed',
'domain_category',
'cv_percent'))
# The tibble KSWheatYield.tbl now consists
# of data in 7,760 rows and 14 columns,
# having removed the 7 columns that were
# not needed.
KSWheatYield1934Onward.tbl$year <-
forcats::as_factor(KSWheatYield1934Onward.tbl$year)
# Specifically, to point out that year needs to be
# viewed as a factor and not as a number, look at
# the way the forcats::as_factor() function was
# used to put year into factor format, from the
# original numeric format.
base::getwd()
base::ls()
base::attach(KSWheatYield1934Onward.tbl)
utils::str(KSWheatYield1934Onward.tbl)
dplyr::glimpse(KSWheatYield1934Onward.tbl)
base::summary(KSWheatYield1934Onward.tbl)
At first, it seems that the data are in good form and that the tibble is ready for use,
but there are a few problems involving FIPS code(s) and how they are presented in
the current dataset:
• The state FIPS code (e.g., state_ansi; Kansas FIPS = 20) code is separate from
the county FIPS codes.
• The county FIPS codes (there are 105 counties in Kansas) is inconsistent in the
number of digits. The three-digit FIPS code for Barton County, Kansas, should
read as 009. However, in the dataset, Barton County, Kansas, reads as 9. Compare
this outcome to the FIPS code for Lincoln County, Kansas, which is presented in
the tibble as 105, the correct three-digit FIPS code.
• Due to these problems, separate FIPS codes (state and county) and, more impor-
tantly, inconsistent padding for three-digit county FIPS codes, there needs to be
a way to properly identify counties by their five-digit FIPS codes. Fortunately,
look at the way each county is identified in upper case (e.g., CAPS) within the
object variable called county. To meet the challenge of having the correct five-
digit FIPS code for each county:
306 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
base::attach(KSCountyFIPSCodes.tbl)
utils::str(KSCountyFIPSCodes.tbl)
The tibble KSCountyFIPSCodes.tbl currently has 3232 rows and 3 columns. Use
the tidyverse ecosystem to make the data specific to Kansas, only. Then: (1) follow
along with the need to put county names in upper case, to match how county names
are listed in the wheat dataset, (2) keep the rows with Kansas FIPS codes and delete
all other rows, and (3) rename the Name column to county, to have a common name
for what will be the join element when the two datasets are merged into one new
dataset.
23
Experienced data scientists collect, organize, and curate private datasets, which may need adjust-
ment to some degree but otherwise serve anticipated needs such as the needs for this FIPS-oriented
dataset. A personal collection of these files is often needed.
Addendum 3: Data Wrangling, and Then Statistical Analyses and Mapping 307
KSCountyFIPSCodesUPPER.tbl <-
KSCountyFIPSCodes.tbl %>%
dplyr::mutate(Name = stringr::str_to_upper(Name)) %>%
# Put county names into all UPPER CASE, to match
# the way they show in the wheat dataset, to
# facilitate the join.
dplyr::filter(State %in% c("KS")) %>%
# Retain the 105 KS (e.g., Kansas) county
# rows and delete all others.
dplyr::rename(county=Name)
# When using the dplyr::rename() function to rename a column,
# the format is New_Name and then Old_Name, which for some
# may not be intuitive. Notice also how a single = character
# is used in this instance.
# The name county was introduced to be consistent with how
# county name is listed in the wheat dataset, which is needed
# to join the two files.
#
# Merely as a recap: (1) the all inclusive dataset of ZIP
# codes was altered by putting county names in all UPPER
# CASE, (2) rows with data were retained for KS and all other
# rows were deleted, and (3) the object Name was renamed as
# county.
base::getwd()
base::ls()
base::attach(KSCountyFIPSCodesUPPER.tbl)
utils::str(KSCountyFIPSCodesUPPER.tbl)
dplyr::glimpse(KSCountyFIPSCodesUPPER.tbl)
base::summary(KSCountyFIPSCodesUPPER.tbl)
# As expected, there are 105 rows in the dataset
# KSCountyFIPSCodesUPPER.tbl, one row representing
# one five-digit FIPS code for each Kansas county.
The data wrangling process continues, where it is now fairly easy to use the
dplyr::left_join() function to join KSWheatYield1934Onward.tbl (the adjusted
Kansas wheat yield dataset) and KSCountyFIPSCodesUPPER.tbl (the adjusted
Kansas county FIPS dataset) into one common dataset, to be called
KSWheatYieldByCounty1934to2007.tbl. The common object for each of the two
datasets is the object (e.g., column) named county, with county names in all
upper case.
Once the join process is completed and the two datasets are merged into one new
dataset, by using the %>% operator note how the FIPS object is renamed to region,
an object variable name that is needed for use with the choroplethr package. The
other term required for the choroplethr package is value, which is currently the
name for wheat yields (bushels per acre), so no action is needed for this requirement.
308 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
KSWheatYieldByCounty1934to2007.tbl <-
dplyr::left_join(KSWheatYield1934Onward.tbl,
KSCountyFIPSCodesUPPER.tbl, by = "county") %>%
# Apply the dplyr::left_join() function, with
# county (in UPPER CASE) the common variable in
# both datasets.
dplyr::rename(region=FIPS)
# The term region, representing FIPS codes in
# numeric format, is needed for use of the
# choroplethr package.
base::attach(KSWheatYieldByCounty1934to2007.tbl)
utils::str(KSWheatYieldByCounty1934to2007.tbl)
It seems that the only remaining task is the need to put the object variable region
into numeric format, changing it from the current character format. This task is
needed for successful use of the choroplethr package.
base::getwd()
base::ls()
base::attach(KSWheatYieldByCounty1934to2007.tbl)
utils::str(KSWheatYieldByCounty1934to2007.tbl)
dplyr::glimpse(KSWheatYieldByCounty1934to2007.tbl)
base::summary(KSWheatYieldByCounty1934to2007.tbl)
The data transformations (e.g., wrangling) of the original Kansas wheat yield
dataset, gained from the USDA NASS required careful planning so that the final
dataset meets needs for descriptive statistics, graphics, inferential statistics, and as a
value-added activity – mapping.
Challenge: Prepare a code book and explain the many ways wrangling was used to
organize the data.
Addendum 3: Data Wrangling, and Then Statistical Analyses and Mapping 309
Exploratory Graphics
KSWheatYieldByCounty1934to2007.fig <-
ggplot2::ggplot(data = KSWheatYieldByCounty1934to2007.tbl,
aes(x=year, y=value)) +
geom_point() +
stat_summary(
geom = "point",
fun = "mean",
col = "black",
size = 3,
shape = 21, # Circle
fill = "red") +
# An advantage of using geom_point() and stat_summary(),
# as presented in this figure, is that the figure is very
# comprehensive. Using geom_point() there is a sense of
# variance in wheat yield for each year whereas by using
# stat_summary() the mean is prominently displayed. Both
# add value to the figure.
labs(
title = "Overall and Mean Kansas Wheat Yield
(Bushels per Acre) Over Time",
subtitle = "Data: USDA NASS",
x = "\nYear",
y = "Yield (Bushels per Acre)\n") +
scale_x_discrete(breaks=scales::pretty_breaks(n=10)) +
scale_y_continuous(labels=scales::comma, limits=c(0, 85),
breaks=scales::pretty_breaks(n=10)) +
# As a challenge, add an annotated comment or two to help the
# reader. Perhaps it would be best to describe how the red dot
# indicates mean.
theme_Mac()
# Fig. 5.20
par(ask=TRUE); KSWheatYieldByCounty1934to2007.fig
Use this figure as a basis for discussion in the Presentation of Outcomes. Give
attention to the overall upward trajectory of Kansas wheat yield over time. Then
give special attention to the noticeable increases beginning in the mid to late 1950s.
There were clearly years from the mid to late 1950s with year-to-year declines, but
the overall trend has been increased yield over time. Why? What changed in terms
of farming practices that yields would increase so consistently beginning in the mid
to late 1950s? Data scientists add value, and this discussion would be a helpful addi-
tion to the Presentation of Outcomes.
310 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Fig. 5.20
Exploratory Analyses
Are the prior Oneway ANOVA outcomes confirmed, now in the Twoway ANOVA. Is
there interaction between year and ag_district? If so, can it be explained?
Presentation of Outcomes
KSWheatYieldByCounty2003.map <-
KSWheatYieldByCounty1934to2007.tbl %>%
dplyr::filter(year %in% c("2003"))
# Retain the 2003 rows.
base::getwd()
base::ls()
base::attach(KSWheatYieldByCounty2003.map)
utils::str(KSWheatYieldByCounty2003.map)
dplyr::glimpse(KSWheatYieldByCounty2003.map)
base::summary(KSWheatYieldByCounty2003.map)
KSWheatYieldByCounty2003.fig <-
choroplethr::county_choropleth(KSWheatYieldByCounty2003.map,
state_zoom = c("kansas"),
title = "Kansas Wheat Yield (Bushels per Acre) by County:
2003",
legend = "Wheat Yield (Bu/A)",
num_colors = 9) +
theme(plot.title=element_text(hjust = 0.5)) +
theme(legend.position="left")
# Remember that the choroplethr package is associated
# with the tidyverse ecosystem and that ggplot2::ggplot()
# function arguments and options are generally supported.
# As such, the legend title is easily changed to a far
# more descriptive label, the legend is left justified,
# the title is centered, etc. Other embellishments could
# be provided, but it is judged that they are not needed.
# Fig. 5.21
par(ask=TRUE); KSWheatYieldByCounty2003.fig
Fig. 5.21
snow as well as rain and search on how snow is calculated as part of soil moisture
in terms of annual precipitation. Incorporate this information into a comprehensive
new dataset and determine if there is any association between annual precipitation
and wheat yield, year over year. Going beyond rainfall and snowfall, it would be
grand to consider data associated with irrigated acreage, possibly from water
pumped out of the Ogallala Aquifer or other sources.
Addendum 4: Prediction
Background
The data associated with this addendum are from a private dataset, adjusted for teach-
ing purposes. The data represent metrics from human subjects who participated in a
voluntary treatment (e.g., physical therapy) program. A set of measures, demographic
and assessment, were taken prior to participation, but these measures were not used
to limit enrollment. Enrollment in the treatment program was open to all willing
adults who provided written consent and subjects could leave at any time. Treatment
was ongoing, starting soon after the beginning of each year, from 2009 onward. The
dataset was obtained in late 2019 and reflects all subjects from 2009 to that end data.
After an unidentified period during which subjects engaged in treatment activi-
ties, a final performance assessment (POSTMeasure1) was required. Regardless of
the numeric score on the final performance metric, completion qualified subjects to
participate, by choice, in a terminal Fail/Pass (e.g., 0/1) final assessment. Again, by
choice, some subjects completed all treatment activities but selected to decline par-
ticipation in the terminal Fail/Pass (e.g., 0/1) final assessment. Those subjects who
did not receive a Pass on the terminal Fail/Pass assessment were allowed to attempt
Addendum 4: Prediction 313
the assessment at later dates. Continuance for these subjects was allowed until either
a Pass was obtained, or, by self-choice, a decision was made to finally decline fur-
ther participation. All subjects who participated in the terminal Fail/Pass assessment
received a certificate of completion, with a mark of Pass inscribed for those subjects
who passed the Fail/Pass terminal assessment.
This addendum is focused on prediction and how data gained from prior subjects
and their past activities can be used to predict future group outcomes. Consider the
common expression past behavior is the best predictor of future behavior. This
concept applies to biostatistics as well as the social and behavioral sciences. The
important thing to remember here is that prediction, even when valid, applies to
group outcomes and not the performance of any specific individuals.
Challenge: As an interesting activity, conduct a set of Pearson’s r correlations
and corresponding Null Hypotheses, to see if there is any meaningful correlation
between subject age, the three pre-assessment metrics, and the final post-performance
metrics. It is assumed that these measures represent interval data, but density plots
will be used to investigate this assumption. Then go beyond simple correlation
activities and move to regression, both linear and binomial, and finally prediction.
Before these later activities are attempted, it is suggested that those who have lim-
ited experience in statistics read about linear regression and binomial logistic
regression. It would be especially helpful to review the concept of odds ratio, as
opposed to the often-inappropriate use of the term(s) probability, chance, and odds.
Challenge: Use the syntax provided in this addendum to confirm outcomes, but
as skill level and interest allow go beyond the provided syntax to examine the data
in even greater detail. To save space, most figures are excluded from this addendum,
but the figures should be prepared and, as desired, improved upon.
Code Book
Rows: 849
Columns: 14
$ ID <dbl> 1 to end
$ YearStart <dbl> 2009 to 2019
$ AgeStart <dbl> 18 to 66
$ Gender <chr> A and B
$ RaceEthnicity <chr> A and B
$ PREMeasure1 <chr> 0 to 250
$ PREMeasure2 <dbl> 0 to 200
$ PREMeasure3 <chr> 0.00 to 5.00
$ POSTMeasure1 <dbl> 0.00 to 5.00
$ INTreatment <dbl> 0 and 1
$ COMPLETEDTreatment <dbl> 0 and 1
$ ATTEMPTEDFinalAssessment <dbl> 0 and 1
$ PASSEDFinalAssessment <dbl> 0 and 1
$ ATTEMPTSFinalAssessment <dbl> 1 to 10
314 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Functions from the tidyverse ecosystem will be used to transform and reorganize
data into desired format(s).
As is often the case (but check to be sure), when binomial data are coded as 0 and
1, by convention the code 0 represents the negative (e.g., Die, Fail, No, Stop) and
the code 1 represents the positive (e.g., Live, Pass, Yes, Go).
base::getwd()
base::ls()
base::attach(NorY.tbl)
utils::str(NorY.tbl)
dplyr::glimpse(NorY.tbl)
base::summary(NorY.tbl)
Even if redundant, force each object variable into desired data type, particularly (but
not only) those object variables coded as 0 and 1.
Addendum 4: Prediction 315
NorY.tbl$ID <-
forcats::as_factor(NorY.tbl$ID)
NorY.tbl$YearStart <-
forcats::as_factor(NorY.tbl$YearStart)
NorY.tbl$AgeStart <-
base::as.numeric(NorY.tbl$AgeStart)
NorY.tbl$Gender <-
forcats::as_factor(NorY.tbl$Gender)
NorY.tbl$RaceEthnicity <-
forcats::as_factor(NorY.tbl$RaceEthnicity)
NorY.tbl$PREMeasure1 <-
base::as.numeric(NorY.tbl$PREMeasure1)
NorY.tbl$PREMeasure2 <-
base::as.numeric(NorY.tbl$PREMeasure2)
NorY.tbl$PREMeasure3 <-
base::as.numeric(NorY.tbl$PREMeasure3)
NorY.tbl$POSTMeasure1 <-
base::as.numeric(NorY.tbl$POSTMeasure1)
NorY.tbl$INTreatment <-
forcats::as_factor(NorY.tbl$INTreatment)
NorY.tbl$COMPLETEDTreatment <-
forcats::as_factor(NorY.tbl$COMPLETEDTreatment)
NorY.tbl$ATTEMPTEDFinalAssessment <-
forcats::as_factor(NorY.tbl$ATTEMPTEDFinalAssessment)
NorY.tbl$PASSEDFinalAssessment <-
forcats::as_factor(NorY.tbl$PASSEDFinalAssessment)
NorY.tbl$ATTEMPTSFinalAssessment <-
forcats::as_factor(NorY.tbl$ATTEMPTSFinalAssessment)
base::getwd()
base::ls()
base::attach(NorY.tbl)
utils::str(NorY.tbl)
dplyr::glimpse(NorY.tbl)
Rows: 849
Columns: 14
$ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9,
$ YearStart <fct> 2009, 2009, 2009, 2009, 2009
$ AgeStart <dbl> 25, 23, 23, 22, 22, 23, 25,
$ Gender <fct> B, B, B, B, A, B, A, B, B, A
$ RaceEthnicity <fct> B, A, B, A, B, B, B, B, B, B
$ PREMeasure1 <dbl> 176, 175, 173, 187, 178, 183
$ PREMeasure2 <dbl> 150, 145, 145, 153, 146, 149
$ PREMeasure3 <dbl> 2.57, 3.00, 2.77, 3.36, 3.20
$ POSTMeasure1 <dbl> 2.4332, NA, 2.5562, NA, 2.69
$ INTreatment <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
$ COMPLETEDTreatment <fct> 1, 0, 1, 0, 1, 0, 0, 1, 0, 0
$ ATTEMPTEDFinalAssessment <fct> 1, NA, 1, NA, 1, NA, NA, 1,
$ PASSEDFinalAssessment <fct> 0, NA, 1, NA, 1, NA, NA, 0,
$ ATTEMPTSFinalAssessment <fct> 3, NA, 1, NA, 2, NA, NA, 2,
base::summary(NorY.tbl)
Due to its role in binary logistic regression, give special attention that
PASSEDFinalAssessment is coded correctly:
316 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
base::levels(NorY.tbl$PASSEDFinalAssessment)
base::table(NorY.tbl$PASSEDFinalAssessment)
0 1
104 364
With expected progression in the use of R as a data science tool, syntax for the fig-
ures of immediate importance is shown, but the figures are absent. Generate the
figures and from this display gain an initial sense of the data.
par(ask=TRUE); ggplot2::ggplot(data=NorY.tbl,
aes(x=AgeStart)) +
geom_density(lwd=2, color="red") +
ggtitle("Age at Start of Treatment") +
labs(x="\nAge at Start", y="Density\n")
par(ask=TRUE); ggplot2::ggplot(data=NorY.tbl,
aes(x=PREMeasure1)) +
geom_density(lwd=2, color="red") +
ggtitle("PREMeasure1") +
labs(x="\nPREMeasure1", y="Density\n")
par(ask=TRUE); ggplot2::ggplot(data=NorY.tbl,
aes(x=PREMeasure2)) +
geom_density(lwd=2, color="red") +
ggtitle("PREMeasure2") +
labs(x="\nPREMeasure2", y="Density\n")
par(ask=TRUE); ggplot2::ggplot(data=NorY.tbl,
aes(x=PREMeasure3)) +
geom_density(lwd=2, color="red") +
ggtitle("PREMeasure3") +
labs(x="\nPREMeasure3", y="Density\n")
par(ask=TRUE); ggplot2::ggplot(data=NorY.tbl,
aes(x=POSTMeasure1)) +
geom_density(lwd=2, color="red") +
ggtitle("POSTMeasure1") +
labs(x="\nPOSTMeasure1", y="Density\n")
# Give special attention to the distribution
# pattern for POSTMeasure1 since this metric
# serves as the predictor variable in the
# later binomial logistic regression inquiry.
Addendum 4: Prediction 317
par(ask=TRUE); ggplot2::ggplot(data=NorY.tbl,
aes(x=YearStart)) +
geom_bar(fill="red", color="black") +
ggtitle("Year Treatment Started") +
labs(x="\nYear", y="Count\n")
par(ask=TRUE); ggplot2::ggplot(data=NorY.tbl,
aes(x=Gender)) +
geom_bar(fill="red", color="black") +
ggtitle("Gender") +
labs(x="\nGender", y="Count\n")
# Specific gender breakouts are purposely not
# identified by name, thus the cryptic codes A and
# B.
par(ask=TRUE); ggplot2::ggplot(data=NorY.tbl,
aes(x=RaceEthnicity)) +
geom_bar(fill="red", color="black") +
ggtitle("Race-Ethnicity") +
labs(x="\nRace-Ethnicity", y="Count\n")
# Specific race-ethnicity breakouts (collapsed for
# this dataset) are purposely not identified by
# name, thus the cryptic codes A and B.
par(ask=TRUE); ggplot2::ggplot(data=NorY.tbl,
aes(x=INTreatment)) +
geom_bar(fill="red", color="black") +
ggtitle("Currently in Treatment") +
labs(x="In Treatment\n", y="Count\n")
par(ask=TRUE); ggplot2::ggplot(data=NorY.tbl,
aes(x=COMPLETEDTreatment)) +
geom_bar(fill="red", color="black") +
ggtitle("Completed Treatment") +
labs(x="Completed Treatment\n", y="Count\n")
par(ask=TRUE); ggplot2::ggplot(data=NorY.tbl,
aes(x=ATTEMPTEDFinalAssessment)) +
geom_bar(fill="red", color="black") +
ggtitle("Attempted Final Assessment") +
labs(x="Attempted Final Assessment\n", y="Count\n")
par(ask=TRUE); ggplot2::ggplot(data=NorY.tbl,
aes(x=PASSEDFinalAssessment)) +
geom_bar(fill="red", color="black") +
ggtitle("Passed Final Assessment") +
labs(x="Passed Final Assessment\n", y="Count\n")
par(ask=TRUE); ggplot2::ggplot(data=NorY.tbl,
aes(x=ATTEMPTSFinalAssessment)) +
geom_bar(fill="red", color="black") +
ggtitle("Number of Attempts at Final Assessment") +
labs(x="Number of Attempts\n", y="Count\n")
318 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
$ YearStart <fct>
$ Gender <fct>
$ RaceEthnicity <fct>
$ INTreatment <fct>
$ COMPLETEDTreatment <fct>
$ ATTEMPTEDFinalAssessment <fct>
$ PASSEDFinalAssessment <fct>
$ ATTEMPTSFinalAssessment <fct>
Among many uses in the way the tidyverse ecosystem supports data science, look at
the clever way the janitor::tabyl() function supports a thorough understanding of the
data, specifically frequency distributions of selected factor-type object variables.24
24
Later, experiment by using janitor::adorn_totals(c(“col”, “row”)) and how this addition
impacts output.
Addendum 4: Prediction 319
YearStart n percent
2009 128 15.08%
2010 123 14.49%
2011 144 16.96%
2012 73 8.60%
2013 59 6.95%
2014 50 5.89%
2015 32 3.77%
2016 34 4.00%
2017 83 9.78%
2018 61 7.18%
2019 62 7.30%
show_missing_levels=TRUE) %>%
janitor::adorn_pct_formatting(digits=2) %>%
base::print(ATTEMPTSFinalAssessmentNorY.df)
Note how there is a question on data entry for what shows as the 8th attempt.
Should this datapoint be marked as 6? Did the subject make a 6th, 7th, and then an
8th attempt? Perhaps the datum 8 is correct, but quality assurance issues of this type
are often noticed only because frequency distribution printouts are produced.
Addendum 4: Prediction 321
$ AgeStart <dbl>
$ PREMeasure1 <dbl>
$ PREMeasure2 <dbl>
$ PREMeasure3 <dbl>
$ POSTMeasure1 <dbl>
NorY.tbl %>%
# dplyr::group_by(PASSEDFinalAssessment) %>%
# Remove the # comment character if there were a desire
# to have the printout show breakouts by the object
# variable PASSEDFinalAssessment, 0 and 1.
dplyr::summarize(
N = base::length(AgeStart),
Minimum = base::min(AgeStart, na.rm=TRUE),
Median = stats::median(AgeStart, na.rm=TRUE),
Mean = base::mean(AgeStart, na.rm=TRUE),
SD = stats::sd(AgeStart, na.rm=TRUE),
Maximum = base::max(AgeStart, na.rm=TRUE),
Missing = base::sum(is.na(AgeStart))
)
# A tibble: 1 x 7
N Minimum Median Mean SD Maximum Missing
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 849 18 24 26.7 6.70 66 0
NorY.tbl %>%
dplyr::summarize(
N = base::length(PREMeasure1),
Minimum = base::min(PREMeasure1, na.rm=TRUE),
Median = stats::median(PREMeasure1, na.rm=TRUE),
Mean = base::mean(PREMeasure1, na.rm=TRUE),
SD = stats::sd(PREMeasure1, na.rm=TRUE),
Maximum = base::max(PREMeasure1, na.rm=TRUE),
Missing = base::sum(is.na(PREMeasure1))
)
NorY.tbl %>%
dplyr::summarize(
N = base::length(PREMeasure2),
Minimum = base::min(PREMeasure2, na.rm=TRUE),
Median = stats::median(PREMeasure2, na.rm=TRUE),
Mean = base::mean(PREMeasure2, na.rm=TRUE),
SD = stats::sd(PREMeasure2, na.rm=TRUE),
Maximum = base::max(PREMeasure2, na.rm=TRUE),
Missing = base::sum(is.na(PREMeasure2))
)
322 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
NorY.tbl %>%
dplyr::summarize(
N = base::length(PREMeasure3),
Minimum = base::min(PREMeasure3, na.rm=TRUE),
Median = stats::median(PREMeasure3, na.rm=TRUE),
Mean = base::mean(PREMeasure3, na.rm=TRUE),
SD = stats::sd(PREMeasure3, na.rm=TRUE),
Maximum = base::max(PREMeasure3, na.rm=TRUE),
Missing = base::sum(is.na(PREMeasure3))
)
NorY.tbl %>%
dplyr::summarize(
N = base::length(POSTMeasure1),
Minimum = base::min(POSTMeasure1, na.rm=TRUE),
Median = stats::median(POSTMeasure1, na.rm=TRUE),
Mean = base::mean(POSTMeasure1, na.rm=TRUE),
SD = stats::sd(POSTMeasure1, na.rm=TRUE),
Maximum = base::max(POSTMeasure1, na.rm=TRUE),
Missing = base::sum(is.na(POSTMeasure1))
)
# A tibble: 1 x 7
N Minimum Median Mean SD Maximum Missing
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 849 2.08 2.99 3.00 0.355 3.96 375
Exploratory Analyses
Prepare estimates of correlation (e.g., association) between and among the numeric
object variables using Pearson’s r coefficient of correlation. Use the tidyverse eco-
system to prepare a correlation matrix of all numeric variables, by various degrees
of breakout complexity.
install.packages("corrr", dependencies=TRUE)
library(corrr)
NorY.tbl %>%
dplyr::select(YearStart, AgeStart, PREMeasure1, PREMeasure2,
PREMeasure3, POSTMeasure1) %>%
dplyr::group_by(YearStart) %>%
dplyr::group_modify(~ corrr::correlate(.x)) %>%
print(n=55)
# As a challenge, use this syntax and be sure to see
# the diagonal pattern of a correlation matrix.
To offer more clarity, look at the correlation matrix without AgeStart, but still
grouping by YearStart.
Addendum 4: Prediction 323
NorY.tbl %>%
dplyr::select(YearStart, PREMeasure1, PREMeasure2,
PREMeasure3, POSTMeasure1) %>%
dplyr::group_by(YearStart) %>%
dplyr::group_modify(~ corrr::correlate(.x)) %>%
print(n=44)
# As a challenge, use this syntax and be sure to see
# the diagonal pattern of a correlation matrix.
NorY.tbl %>%
dplyr::select(PREMeasure1, PREMeasure2, PREMeasure3,
POSTMeasure1) %>%
# dplyr::group_by(YearStart) %>%
# COMMENT OUT dplyr::group_by(YearStart)
# Use the # comment character.
dplyr::group_modify(~ corrr::correlate(.x)) %>%
print()
25
As a very broad measure of the weak, to moderate, to strong continuum of Pearson’s r and esti-
mates of correlation, there are those who see: Strong Relationship, r = 0.50 and onward (+ or −);
Moderate Relationship, r = 0.30 to 0.49 (+ or −); Weak Relationship, r = 0.00 to 0.29 (+ or −).
Review multiple resources and talk to practitioners for other views.
26
When reading about estimates of correlation, give special attention to judgment on the concept
of positive degrees of association and negative degrees of association. Avoid the temptation to
think, incorrectly, that positive is good and negative is bad, which is far too often viewed by those
who are new to this type of inquiry. Positive and negative degrees of association only signify direc-
tion and with clever rewording each could be reversed.
27
Give attention to the diagonal nature of a correlation matrix.
324 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
base::summary(PredictionModel)
Call:
stats::lm(formula = PREMeasure1 ~ PREMeasure2, data = NorY.tbl)
Residuals:
Min 1Q Median 3Q Max
-10.903 -3.894 0.095 3.106 11.064
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.5001 4.7488 6.21 0.00000000082 ***
PREMeasure2 1.0028 0.0324 30.98 < 0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
treatment. Sequester the dataset to include only those subjects with the following
conditions:
• Subjects had a mark of COMPLETEDTreatment = 1 and
• Subjects had a mark of ATTEMPTEDFinalAssessment = 1 and
• Subjects had a mark of PASSEDFinalAssessment = 1 and
• Subjects had a mark of ATTEMPTSFinalAssessment = 1.
base::length(NorY.tbl$ATTEMPTSFinalAssessment)
# Confirm number of subjects prior to adjusting
# the dataset.
[1] 849
NorYMostSuccessful.tbl <-
NorY.tbl %>%
dplyr::filter(COMPLETEDTreatment %in% c("1")) %>%
dplyr::filter(ATTEMPTEDFinalAssessment %in% c("1")) %>%
base::length(NorYMostSuccessful.tbl$ATTEMPTSFinalAssessment)
# Confirm number of subjects after adjusting
# the dataset.
[1] 235
base::getwd()
base::ls()
base::attach(NorYMostSuccessful.tbl)
utils::str(NorYMostSuccessful.tbl)
dplyr::glimpse(NorYMostSuccessful.tbl)
Rows: 235
Columns: 14
$ ID <fct> 3, 11, 12, 18, 21, 22, 23, 24,
$ YearStart <fct> 2009, 2009, 2009, 2009, 2009,
$ AgeStart <dbl> 23, 22, 36, 22, 41, 24, 32,
$ Gender <fct> B, B, B, B, B, B, B, B, B, B,
$ RaceEthnicity <fct> B, B, B, B, B, B, B, B, B, B,
$ PREMeasure1 <dbl> 173, 183, 186, 183, 179, 184,
$ PREMeasure2 <dbl> 145, 146, 161, 150, 148, 149,
$ PREMeasure3 <dbl> 2.77, 3.73, 2.48, 3.28, 3.12,
$ POSTMeasure1 <dbl> 2.5562, 3.1773, 3.3769, 2.8607
$ INTreatment <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
$ COMPLETEDTreatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
$ ATTEMPTEDFinalAssessment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
$ PASSEDFinalAssessment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
$ ATTEMPTSFinalAssessment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
base::summary(NorYMostSuccessful.tbl)
326 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Now that the dataset has been adjusted to include only those subjects who were
the most successful – those who met all threshold activities, were qualified to
attempt the final assessment, and successfully passed the Fail/Pass final assessment
on the first attempt, see if the correlations are in any meaningful way different than
what was previously seen when the entire dataset was examined.28
NorY.tbl %>%
dplyr::select(PREMeasure1, PREMeasure2, PREMeasure3,
POSTMeasure1) %>%
# dplyr::group_by(YearStart) %>%
# COMMENT OUT dplyr::group_by(YearStart)
# Use the # comment character.
dplyr::group_modify(~ corrr::correlate(.x)) %>%
print()
# ORIGINAL dataset
# A tibble: 4 x 5
term PREMeasure1 PREMeasure2 PREMeasure3 POSTMeasure1
<chr> <dbl> <dbl> <dbl> <dbl>
1 PREMeasure1 NA 0.730 0.673 0.276
2 PREMeasure2 0.730 NA -0.00198 0.179
3 PREMeasure3 0.673 -0.00198 NA 0.222
4 POSTMeasure1 0.276 0.179 0.222 NA
NorYMostSuccessful.tbl %>%
dplyr::select(PREMeasure1, PREMeasure2, PREMeasure3,
POSTMeasure1) %>%
# dplyr::group_by(YearStart) %>%
# COMMENT OUT dplyr::group_by(YearStart)
dplyr::group_modify(~ corrr::correlate(.x)) %>%
print()
# ADJUSTED dataset
28
Even if results are not published and shared with others, it is common to examine data from many
perspectives. Data scientists should be curious.
Addendum 4: Prediction 327
Review the correlation matrix from both perspectives, the original dataset (NorY.
tbl) and the adjusted dataset (NorYMostSuccessful.tbl). Are there any meaningful
(e.g., actionable) differences in interpretation of estimates of correlation?
Then even if redundant, see if there are any major changes in the linear regres-
sion prediction equation for PREMeasure1 and PREMeasure2 from among these
most successful subjects.
base::summary(PredictionModelMostSuccessful)
Call:
stats::lm(formula = PREMeasure1 ~ PREMeasure2,
data = NorYMostSuccessful.tbl)
Residuals:
Min 1Q Median 3Q Max
-11.02 -3.01 0.99 3.99 8.99
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.87 8.93 3.46 0.00065 ***
PREMeasure2 1.00 0.06 16.69 < 0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
For all subjects (e.g., PredictionModel), the Estimate for PREMeasure2 was
1.0028, whereas for those subjects who were the most successful
(PredictionModelMostSuccessful) the Estimate for PREMeasure2 was 1.00. Are
there any practical differences in these two estimates?
Challenge: Read not only about correlation, association, and linear regression, but
give attention to the extremely useful but far too often neglected topic of binary
logistic regression. Many life events, certainly an issue in biostatistics, eventually
equate to a binary decision, such as:
• Was the treatment effective or was the treatment ineffective?
• Should the program continue, or should the program be terminated?
• Did the subject live or did the subject die?
Before a binary logistic regression is attempted, it is necessary to consider miss-
ing data and the impact of missing data on outcomes. For this example, review the
background and the original dataset and give attention to how certain thresholds of
328 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
completion must be met before subjects are qualified to attempt the Fail/Pass final
assessment. Subjects who have not met the required thresholds can (and should) be
eliminated from consideration. Once these subjects are removed, such as those sub-
jects from the later years (and could not have been expected to complete all activi-
ties) in the dataset, notice that missing data is not a major concern. With that issue
possibly resolved, consider the following questions:
• Is there some type of pattern to missing data, or are those datapoints that are
missing distributed randomly?
• Is it desirable to impute data and if so, should this be done, how will it be
done, etc?
• If missing data are removed, will the eventual dataset have a sufficient N to jus-
tify binary logistic regression?
For this example, it has been decided (and can be justified, knowing that some may
have other views) that the final dataset should be complete and that there should be no
missing data. Notice how easy it is to use the tidyverse ecosystem to achieve that aim.
base::length(NorY.tbl$ID)
# Original dataset, including subjects
# who are not yet qualified to sit for
# the Fail/Pass final assessment.
[1] 849
NorYNONAs.tbl <-
NorY.tbl %>%
tidyr::drop_na()
# Remove all rows containing missing values.
base::length(NorYNONAs.tbl$ID)
[1] 464
base::summary(NorYNONAs.tbl)
# There are now no NAs in the dataset NorYNONAs.tbl
# and this tibble will be used for the binary
# logistic regression.
base::summary(NorYNONAs.tbl$PASSEDFinalAssessment)
0 1
102 362
base::getwd()
base::ls()
base::attach(NorYNONAs.tbl)
utils::str(NorYNONAs.tbl)
dplyr::glimpse(NorYNONAs.tbl)
base::summary(NorYNONAs.tbl)
Addendum 4: Prediction 329
The adjusted dataset is now in good form and ready for inquiry from a binary
logistic regression perspective, where it will be possible to see which identified
object variables contribute to outcomes for the Fail/Pass final assessment predictor
variable (e.g., PASSEDFinalAssessment) in a meaningful way29:
Variable of interest: PASSEDFinalAssessment
Predictor Variables
YearStart ........ Factor
AgeStart ......... Numeric
Gender ........... Factor
RaceEthnicity .... Factor
PREMeasure1 ...... Numeric
PREMeasure2 ...... Numeric
PREMeasure3 ...... Numeric
POSTMeasure1 ..... Numeric
With this preparation, note how the ultimate goal of a binary logistic regression
is the desire to view odds ratios for the many predictor variables and how they con-
tribute to the binary outcome, or Fail/Pass in this example:
29
Do not ignore the data type for the many variables when engaged in binary logistic regression.
The binary variable of interest is viewed as a factor-type variable, with two levels: 0/1, Fail/Pass,
etc.). However, note how a combination of factor-type object variables and numeric-type variables
have a role in the prediction algorithm. A major value of binary logistic regression is that this
approach toward analyses considers factor-type variables as well as numeric-type variables.
330 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
logitPassedFinalAssessment <-
stats::glm(PASSEDFinalAssessment ~
YearStart +
AgeStart +
Gender +
RaceEthnicity +
PREMeasure1 +
PREMeasure2 +
PREMeasure3 +
POSTMeasure1,
data=NorYNONAs.tbl, family="binomial")
# Build the model.
base::summary(logitPassedFinalAssessment)
Call:
stats::glm(formula = PASSEDFinalAssessment ~ YearStart +
AgeStart + Gender + RaceEthnicity + PREMeasure1 +
PREMeasure2 + PREMeasure3 + POSTMeasure1, family =
"binomial", data = NorYNONAs.tbl)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.0943 0.0666 0.2635 0.5533 1.8273
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.97596 5.15699 -1.74 0.0818 .
YearStart2010 0.11104 0.57921 0.19 0.8480
YearStart2011 -2.38911 0.52588 -4.54 0.000005544278345 ***
YearStart2012 -1.97203 0.60053 -3.28 0.0010 **
YearStart2013 -2.93007 0.59109 -4.96 0.000000715509315 ***
YearStart2014 -2.57440 0.64635 -3.98 0.000068057146887 ***
YearStart2015 -3.33261 0.70688 -4.71 0.000002422691728 ***
YearStart2016 -4.39314 0.76705 -5.73 0.000000010202655 ***
YearStart2017 -1.40819 1.19847 -1.17 0.2400
AgeStart -0.06399 0.02349 -2.72 0.0064 **
GenderA -0.56140 0.29335 -1.91 0.0556 .
RaceEthnicityA -0.77954 0.38717 -2.01 0.0441 *
PREMeasure1 0.00821 0.16402 0.05 0.9601
PREMeasure2 0.00636 0.16417 0.04 0.9691
PREMeasure3 -0.70767 1.64886 -0.43 0.6678
POSTMeasure1 4.80679 0.63862 7.53 0.000000000000052 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This output is interesting and statistics in the two columns z value and Pr(>|z|)
give a hint of eventual outcomes, but for now, continue and look at confidence inter-
vals for object variables in the fitted model.
stats::confint(logitPassedFinalAssessment)
# Determine confidence intervals for
# object variables in a fitted model.
This work with the fitted model should be given attention. However, the odds
ratios are the main concern for this binary logistic regression inquiry. As mentioned
earlier, study on the precise meaning of odds ratio v odds v chance v probability v
prediction, etc. These terms are not synonyms.
base::exp(stats::coef(logitPassedFinalAssessment))
# Determine odds ratios for object variables in
# a fitted model.
#
# Review the literature to be sure what odds
# ratio means, as opposed to the incorrect use
# of other terms that are not the same.
(Intercept) 0.000126412
YearStart2010 1.117438197
YearStart2011 0.091711721
YearStart2012 0.139173601
YearStart2013 0.053393163
YearStart2014 0.076199435
YearStart2015 0.035699636
YearStart2016 0.012361855
YearStart2017 0.244585459
AgeStart 0.938019029
GenderA 0.570408488
RaceEthnicityA 0.458616410
PREMeasure1 1.008243025
PREMeasure2 1.006377812
PREMeasure3 0.492789926
POSTMeasure1 122.337700994
332 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
30
Give attention to the base::cut() function and how it has many uses in data science.
Addendum 4: Prediction 333
NorYNONAs.tbl$AddCutScore <-
base::cut(NorYNONAs.tbl$POSTMeasure1,
breaks=c(-Inf, 3.00, 3.50, Inf),
labels=c("Low","Middle","High"))
base::attach(NorYNONAs.tbl)
base::table(NorYNONAs.tbl$AddCutScore,
NorYNONAs.tbl$PASSEDFinalAssessment)
# Rows ..... AddCutScore
# Columns .. PASSEDFinalAssessment
#
# HINT: When selecting which object variable
# should be rows and which object variable
# should be columns, make the rows object
# variable the one with the greatest number of
# breakouts. This output is a 3 (rows) by 2
# columns presentation.
0 1
Low 76 163
Middle 26 157
High 0 42
base::class(Performance)
Performance
0 1
Low 76 163
Middle 26 157
High 0 42
data: Performance
X-squared = 31.71, df = 2, p-value = 0.00000013
PerformanceSQ$observed %>%
print()
0 1
Low 76 163
Middle 26 157
High 0 42
PerformanceSQ$expected %>%
print()
0 1
Low 52.53879 186.4612
Middle 40.22845 142.7716
High 9.23276 32.7672
There are clear differences in the counts in each cell, observed v expected.
Further, with a calculated p-value of 0.00000013, there is a statistically significant
(p <= 0.05) difference in performance on the Fail/Pass final assessment by break-
outs of AddCutScore (e.g., Low, Middle, and High), which is derived from
POSTMeasure1.
The Chi-square analyses provide an additional degree of confirmation of how
POSTMeasure1 has such a large influence on PASSEDFinalAssessment outcomes.
Then, going back to what was found earlier in the binary logistic regression, observe
how the many other object variables in the binary logistic equation are all interest-
ing, but only a brief discussion is warranted given the exceptional predictive utility
for POSTMeasure1:
• Subjects enter the treatment program by choice and there is no control over the
starting year and willingness to persist. There was a slight predictive odds ratio
of potential use for those who started in 2010 but notice how predictive utility for
Year dropped off soon after.
• It should also be mentioned that there is no control over age, other than subjects
must be adults (e.g., 18 years or older) and capable of giving consent for treat-
ment. Also, consider outcomes for gender and race-ethnicity. There is little pre-
dictive utility for these three object variables: age, gender, and race-ethnicity.
• For entry into treatment, it is expected that subjects will provide scores for three
pretreatment measures. Like what was seen in the correlation matrix, these three
pretreatment measures have little value in terms of their impact on the final Fail/
Pass assessment.
• The POSTMeasure1 object variable is the most useful measure in terms of its
impact on success with the Fail/Pass final assessment. Even a small increase in
Addendum 4: Prediction 335
SuccessOnTheFinalAssessment.fig <-
ggplot2::ggplot(data=NorYNONAs.tbl,
aes(x=POSTMeasure1)) +
geom_histogram(color="black", fill="dodgerblue") +
facet_wrap(~PASSEDFinalAssessment) +
ggtitle("Fail (0) and Pass (1) on the Fail/Pass Final
Assessment by POSTMeasure1 Scores (Range = 0 to 5)") +
labs(x="\nPOSTMeasure1 Score", y="Count\n") +
scale_x_continuous(labels=scales::comma, limits=c(1.8, 5.0),
breaks=scales::pretty_breaks(n=10)) +
scale_y_continuous(labels=scales::comma, limits=c(0, 50),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac() +
theme(strip.text.x=element_text(face="bold", size=14)) +
theme(axis.text.x=element_text(face="bold", size=12,
hjust=0.5, vjust=1, angle=45))
theme(axis.text.y=element_text(face="bold", size=12,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X axis text and/
# or Y axis text need to be placed after the enumerated theme
# theme_Mac().
# Fig. 5.22
par(ask=TRUE); SuccessOnTheFinalAssessment.fig
More could be (and will be) done with this dataset, but the binary logistic regres-
sion provides ample evidence that POSTMeasure1 is the most useful object variable
in terms of Fail (0) or Pass (1) performance on the Fail/Pass final assessment. All
other object variables in the binary logistic regression algorithm are either limited
or have no practical use given the nature of how, why, and when subjects entered
treatment. Of course, that may not be the case with treatment programs for non-
human subjects, but that was not an issue in this addendum.
It is not a totally sidebar issue that the pretreatment metrics have no practical
impact regarding success on the final assessment. It is questioned if the time and
cost of these pretreatment assessments should be continued. Instead, it may be of
more value to concentrate on efforts needed to increase the posttreatment metric
(e.g., POSTMeasure1), prior to attempting the Fail/Pass final assessment, perhaps
by applying a practice test to prepare subjects for POSTMeasure1. Assume that the
data scientist charged with this dataset and analyses has limited knowledge of the
structure for the treatment program, thus it is difficult to offer more discussion on
this topic.
336 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Data scientists add value when attempting the many activities associated with their
discipline. The prior odds ratio values are critically important and give important
meaning to this inquiry, but many members of the public do not understand the true
meaning of odds ratio and its use. The last section of this addendum will accord-
ingly focus on probability and prediction – concepts that ostensibly the public
should understand with a greater degree of comprehension.31
The task now is to prepare a figure that represents the probability of receiving a
pass on the Fail/Pass final assessment (e.g., PASSEDFinalAssessment = 1). The
previously adjusted dataset with no missing values will be used. As a brief reminder,
the logit model was prepared as:
logitPassedFinalAssessment <-
stats::glm(PASSEDFinalAssessment ~
YearStart +
AgeStart +
Gender +
RaceEthnicity +
PREMeasure1 +
PREMeasure2 +
PREMeasure3 +
POSTMeasure1,
data=NorYNONAs.tbl, family="binomial")
# Redundant: Build the model.
base::attach(logitPassedFinalAssessment)
utils::str(logitPassedFinalAssessment)
dplyr::glimpse(logitPassedFinalAssessment)
# Take time to study the extreme detail gained from using
# either the utils::str() function or the dplyr::glimpse()
# function, when applied against the model as compared to
# when these functions are applied against a dataset.
With the logit model in good form, create a new object variable, to be added to
the NorYNONAs.tbl dataset. This new object variable will consist of
logistic_prediction values and it will be based on application of the stats::predict()
function.
31
The term ostensibly was used purposely. If presented with a valid deck of cards, many members
of the public understand that the probability of selecting the Ace of Spades is one out of 52.
However, when state-run lotteries have mega jackpots and there is heightened interest, it is not at
all uncommon for many members of the public to buy two tickets instead of just one, in a self-
declared effort to double the chance of winning. Of course, that statement is incorrect – terribly
incorrect.
Addendum 4: Prediction 337
NorYNONAs.tbl$logistic_predictions <-
stats::predict(logitPassedFinalAssessment, type = "response")
# Prepare a new object variable that focuses on
# probability predictions, subject by subject --
# but with no missing data to confound the use of
# R and outcomes.
base::getwd()
base::ls()
base::attach(NorYNONAs.tbl)
utils::str(NorYNONAs.tbl)
dplyr::glimpse(NorYNONAs.tbl)
base::summary(NorYNONAs.tbl)
base::print(NorYNONAs.tbl)
base::summary(NorYNONAs.tbl$logistic_predictions)
Prepare an estimate of correlation between two of the most pertinent object vari-
ables associated with this inquiry:
stats::cor(
NorYNONAs.tbl$POSTMeasure1,
NorYNONAs.tbl$logistic_predictions,
method="pearson")
[1] 0.539462
Fig. 5.22
338 5 Statistical Analyses and Graphical Presentations in Biostatics Using Base R…
Using Base R, prepare a simple draft plot, in rough form, of the same: X axis
POSTMeasure1 v Y axis logistic_predictions:
graphics::plot(
NorYNONAs.tbl$POSTMeasure1, # X axis
NorYNONAs.tbl$logistic_predictions) # Y axis
The output of this simple plot, using the graphics::plot() function should give a
general sense of how success on the Fail/Pass final assessment increases as
POSTMeasure1 increases. Use the ggplot2::ggplot() function to improve presenta-
tion of this finding (Fig. 5.23).
ProbPASSEDFinalAssessmentByPOSTMeasure1.fig <-
NorYNONAs.tbl %>%
ggplot2::ggplot(
aes(x=POSTMeasure1,
y=logistic_predictions)) +
geom_point(color="red") +
geom_smooth(method="lm", se=TRUE) +
labs(
x="\nPOSTMeasure1",
y="PASSEDFinalAssessment Probability:
Logistic Predictions\n",
title=
"Probability of Success at PASSEDFinalAssessment
by POSTMeasure1") +
annotate("text", x=3.25, y=0.50, fontface="bold",
size=05, color="black", hjust=0, family="mono",
label="Pearson's r = 0.539462") +
scale_x_continuous(labels=scales::comma, limits=c(2.0,
4.05), breaks=scales::pretty_breaks(n=5)) +
# Make the scale go slightly beyond the maximum value
# of the X axis, to allow room for the points so that
# they do not get lost in the figure.
scale_y_continuous(labels=scales::comma, limits=c(0.0,
1.05), breaks=scales::pretty_breaks(n=5)) +
# Make the scale go slightly above the upper limit of
# the Y axis, to allow room for the points so that
# they do not get lost in the figure.
theme_Mac()
# Fig. 5.23
par(ask=TRUE); ProbPASSEDFinalAssessmentByPOSTMeasure1.fig
Fig. 5.23
The publisher’s Web site associated with this text includes the following files, pre-
sented in .csv, .txt, and .xlsx file formats.
FIPSCountyStateCodesUSDA.csv
KSWheatYieldBushelPerAcre1934to2007.csv
NorYMultipleVariablesPrediction.csv
RBIRTHandRDEATHBeforeAdjustment2015Onward.xlsx
RBIRTHandRDEATHMidwest2015Onward.xlsx
RBIRTHandRDEATHNortheast2015Onward.xlsx
RBIRTHandRDEATHSouth2015Onward.xlsx
RBIRTHandRDEATHWest2015Onward.xlsx
Challenge: Use these files to practice and replicate the outcomes used in this les-
son. Be creative and go beyond what was previously presented.
All other data, if any, were enumerated directly in the R session, using functions
such as round(rnorm()), read.table(textConnection()), etc.
Chapter 6
Use of R-Based APIs (Application
Programming Interface) to Obtain Data
Data and money are similar in that they both have potential value, if used correctly.
Money kept in a locked box has limited value, but money gains value when it is
judiciously spread throughout a community by spending and respending, spawning
what is called the multiplier effect and ultimately increased economic impact.
Following along with this paradigm, data that are locked away and difficult to obtain
also have limited value due to disuse, but data gain value when the data spread
throughout a community, allowing use, reuse, and impact, often in clever ways that
the data originator never imagined. Use inquiries into the SARS-CoV-2 virus and
how freely available data relating to this virus have been used to develop helpful
efforts to combat COVID-19 (and possibly future infectious diseases), with an
example being the now known differing impacts of the virus by age groups and how
mitigation efforts for different age profiles were justified by health officials. This
mitigation strategy was only justified by the free exchange of massive amounts of
reliable and valid data, from multiple resources, and examined by multiple
researchers.
The free exchange of data, often massive amounts of data, is not new. Early
efforts to share data between and among data scientists were developed concurrent
to the development and continuing enhancement of what is now known as the
Internet:
• First released in the early 1970s and later improved upon over many years of
effort, ftp (File Transfer Protocol) was one of the earliest tools used to move data
from one location to another, where data were exchanged between a local host
and a remote host. The use of ftp required advance planning, permissions, and
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 341
T. W. MacFarland, Introduction to Data Science in Biostatistics,
https://doi.org/10.1007/978-3-031-46383-9_6
342 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
the deployment of necessary software and system configurations, but ftp was a
best practice at one time.1
• Developed in the early 1990s, the gopher protocol was a communication tool that
used a menu-type listing of resources that allowed easy distribution of data.
There were limitations on the applicability of gopher, and by the late 1990s, it
fell out of favor as an information-sharing tool.
• The World Wide Web (WWW) gained widespread use after its early 1990s
release and has since become the dominant tool for access to distributed informa-
tion in its many forms. Ease of use and widespread applicability across multiple
platforms, operating systems, browsers, etc. have contributed to its nearly uni-
versal use.
• Given this very brief overview of early attempts at data transfer, this introductory
text is focused on API (Application Programming Interface) clients and how
associated R-based functions are used as an increasingly popular tool for data
acquisition among data scientists. Without going into detail that would be well
beyond the scope of this text, the most dominant standard for API development
and use is Representational State Transfer (REST). The REST paradigm is struc-
tured along the concept of data request and response. Using this approach, a
client-based request for data, using HTTP and associated URLs, is made to a
server and if executed correctly the data are typically sent back in JSON format.2
The first three addenda for this lesson demonstrate how R-based API client func-
tions (e.g., functions from specific R packages) are used to obtain data. However,
knowing that some explicit detail on the heuristics of APIs is needed, review the
fourth addendum for this lesson to see the more detailed structure for the API pro-
cess and how JSON data are obtained in raw format and then put into a more man-
ageable and tidier format. Work by many individuals and their contributed R-based
API client functions now have a prime role in the fairly simple way by which R is
used to obtain freely available data.
Give special attention to the R-based APIs demonstrated in the first three addenda
for this lesson. Before any attempt is made to reproduce the efforts in these addenda,
look at the documentation for each R function that serves as an API client. As an
1
Search on the terms chmod 700 filename and chmod 777 filename and consider differences in
data protection early on with the use of distributed systems and current practices.
2
Although it is an oversimplification, think of REST as a scenario where a researcher at Computer
A sends a message to Computer B, asking for data. Computer B has been set so that the message
is then reviewed to see if the request is structured correctly, if the authentication process (if any)
confirms that the researcher is qualified to receive the data, if the data are available, etc. If all
requirements are met, then Computer B sends the requested data back to Computer A, for the
requesting researcher to use the data as desired.
APIs and the Need for a Key 343
Many data resources, especially those associated with the federal government, use
a key to provide some degree of authentication and later command and control over
data access. Regarding the use of a key, authentication of course refers to user iden-
tity. Command and control usually refer not only to the specific data made available,
overall but also the number of data requests and number of variables within each
data request, but of course, this approach is unique to each agency. There are also
controls built into the process to provide assurance that a human is responsible for
the data request and that the request is not from an automated process that could
possibly overwhelm the server housing the data.3
A wide variety of approaches are used regarding the ease of use and authentica-
tion for data access from many different API-based resources:
• Review Addendum 3 in this lesson for an example where no authentication is
needed to obtain data from the United States Environmental Protection Agency
(EPA). In this example, it is only necessary to know the URLs associated with
the data, data in this example specific to air pollution from electricity power plants.
• Review the front matter in the first lesson to see how data from the United States
Department of Education’s Integrated Postsecondary Education Data System
(IPEDS) are made available, but only by interacting with a Web-based Graphical
3
A key should be treated as if it were a password. A key provides a unique identification of the
approved user. Do not share a key with anyone else, in the same way that passwords should always
be kept private.
344 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
User Interface (GUI) resource where multiple selections and a click here and
click there approach is used for eventual data retrieval. Anyone with Internet
access can obtain the data. Required authentication is minimal, calling for the
numeric code of a selected postsecondary institution from among the thousands
of postsecondary institutions in the United States. Yet, there is no attempt at veri-
fication of user identify, association with the selected institution, or intended use
of the data.
• An R-based API client (e.g., the function owidR::owid()) is used in the second
lesson (Addendum 1) to obtain data from Our World in Data. Observe how the
data requests associated with this data retrieval process require no key, no authen-
tication, etc. Anyone with interest and sufficient skills can obtain the data, data
that are freely available.
• However, a key is needed for data acquisition from many federal agencies, espe-
cially for large data requests. Although the details are unique to each agency,
look at instructions for key access and use for a few of the many resources that
use keys, knowing that there are many other federal agencies that also facilitate
data access by use of a readily available key:
–– Census Bureau: Complete the form at https://api.census.gov/data/key_signup.
html to obtain a key.
–– Bureau of Labor Statistics: Complete the form at https://data.bls.gov/registra-
tionEngine/ to obtain a key.
–– Department of Agriculture National Agricultural Statistics Service (USDA
NASS): Complete the form at https://quickstats.nass.usda.gov/api/ to
obtain a key.
There is a continuing conversation among those who use API keys on whether it
is best to either declare the key in syntax each time it is used or if it is best to store
the key one-time in the R environment, possibly in the .Rprofile file or the .REnviron
file. No judgment is offered here since this is a personal preference. Each approach
has advantages:
• The use of an API key is highly visible when it is embedded into syntax, but of
course, the contents of the API key can be easily seen by others when syntax is
made visible, such as when script files are shared with others, such as the
Housekeeping section in this text.
• The one-time storage of an API key makes it unnecessary to declare the key each
time it is used, and API key contents are less visible to others, allowing greater
privacy and protection from misuse.
This is all a personal choice on how API keys are deployed. Just remember to
treat an API key as if it were a password, and do not share it with others.
Structure of an API to Automate Data Retrieval 345
It has been mentioned more than a few times in this text that family income and
other indicators related to wealth and its opposite, poverty, are statistics that are
critical to those who work in public health, an area that regularly uses biostatistics.
The devastating impact of poverty on health, current and future, simply cannot be
overstated. For researchers in the United States, the Census Bureau is likely the best
data source for reliable, valid, and current poverty and wealth statistics.
Before reviewing this section, review Census Bureau guidelines on APIs, such as
Working with the Census Data API, https://www.census.gov/content/dam/Census/
library/publications/2020/acs/acs_api_handbook_2020_ch02.pdf. Then review
documentation for the R-based tidycensus package at https://cran.r-project.org/
web/packages/tidycensus/index.html and give attention to the tidycensus::get_acs()
function, which serves as an R-based API client, greatly facilitating the ease of use
as Census Bureau American Community Survey (ACS) data are obtained. Be sure
to ask for, obtain, use, and keep private a unique Census Bureau key.
Review and deploy the syntax presented in this section, simply to see how easy
it is to use the tidycensus::get_acs() function. Then as an interesting alternative, try
to obtain the same data using Census Bureau graphical menus, if indeed this attempt
yields comparable results without extensive efforts.4 It is worth the time for a data
scientist to use APIs, as opposed to more cumbersome graphical menus that leave
behind no record of actions, limiting future replication.
4
To continue with an evaluation of available R-based API client functions, compare ease of use and
outcomes from use of the tidycensus::get_acs() function and the acs::acs.fetch() function. Try both
API clients over multiple queries to see if there is a reason to prefer one API client over the other.
It has been decided to use the tidycensus::get_acs() function in this text, but the acs::acs.fetch()
function also has great value.
346 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
###############################################################
# Housekeeping Use for All Analyses #
###############################################################
.libPaths(new = "D:/R_Packages")
# As a preference, all installed packages
# will now go to the external D:\ drive.
date() # Current system time and date.
Sys.time() # Current system time and date (redundant).
R.version.string # R version and version release date.
options(digits=6) # Confirm default digits.
options(scipen=999)# Suppress scientific notation.
options(width=60) # Confirm output width.
ls() # List all objects in the working
# directory.
rm(list = ls()) # CAUTION: Remove all files in the working
# directory. If this action is not desired,
# use the rm() function one by one to remove
# the objects that are not needed.
ls.str() # List all objects, with finite detail.
getwd() # Identify the current working directory.
setwd("D:/R_Ceres")
# Set to a new working directory.
# Note the single forward slash and double
# quotes.
# This new directory should be the directory
# where the data file is located, otherwise
# the data file will not be found.
getwd() # Confirm the working directory.
list.files() # List files at the PC directory.
.libPaths() # Library pathname.
.Library # Library pathname.
sessionInfo() # R version, locale, and packages.
search() # Attached packages and objects.
searchpaths() # Attached packages and objects.
###############################################################
When looking at the syntax in the Housekeeping section and in the section where
packages are declared:
• A new external storage device was used, and it is named the D: drive, not the
F: drive.
• A new version of R was used and as such the # comment character was not used
in front of the install.packages() function, so that all required packages go to the
new drive (e.g., .libPaths(new = “D:/R_Packages”)).
• The update.packages() function was not used, but it is an option. If used, allow
time since it can be quite a long process.
Structure of an API to Automate Data Retrieval 347
install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)
install.packages("readxl", dependencies=TRUE)
library(readxl)
install.packages("magrittr", dependencies=TRUE)
library(magrittr)
install.packages("janitor", dependencies=TRUE)
library(janitor)
install.packages("rlang", dependencies=TRUE)
library(rlang)
install.packages("htmltools", dependencies=TRUE)
library(htmltools)
install.packages("httr", dependencies=TRUE)
library(httr)
install.packages("jsonlite", dependencies=TRUE)
library(jsonlite)
install.packages("ggmosaic", dependencies=TRUE)
library(ggmosaic)
install.packages("ggpubr", dependencies=TRUE)
library(ggpubr)
install.packages("ggtext", dependencies=TRUE)
library(ggtext)
install.packages("ggthemes", dependencies=TRUE)
library(ggthemes)
install.packages("scales", dependencies=TRUE)
library(scales)
install.packages("gridExtra", dependencies=TRUE)
library(gridExtra)
install.packages("cowplot", dependencies=TRUE)
library(cowplot)
install.packages("writexl", dependencies=TRUE)
library(writexl)
348 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
###############################################################
install.packages("tidycensus", dependencies=TRUE)
library(tidycensus)
# CAUTION: The tidycensus package may take longer to
# download than other packages. Be patient.
tidycensus::census_api_key(
"Yourx40xdigitxkeyxgoesxherexxxxxxxxxxxxx",
overwrite=FALSE)
Structure of an API to Automate Data Retrieval 349
install.packages("acs", dependencies=TRUE)
library(acs)
acs::api.key.install(
"Yourx40xdigitxkeyxgoesxherexxxxxxxxxxxxx",
overwrite=FALSE)
acs.tables.install()
###############################################################
# Mapping #
###############################################################
install.packages("htmlTable", dependencies=TRUE)
library(htmlTable)
install.packages("maptools", dependencies=TRUE)
library(maptools)
install.packages("Rcpp", dependencies=TRUE)
library(Rcpp)
install.packages("rgdal", dependencies=TRUE)
library(rgdal)
install.packages("rgeos", dependencies=TRUE)
library(rgeos)
350 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
install.packages("sf", dependencies=TRUE)
library(sf)
install.packages("sp", dependencies=TRUE)
library(sp)
install.packages("stars", dependencies=TRUE)
library(stars)
install.packages("terra", dependencies=TRUE)
library(terra)
install.packages("xfun", dependencies=TRUE)
library(xfun)
install.packages("choroplethr", dependencies=TRUE)
library(choroplethr)
install.packages("choroplethrMaps", dependencies=TRUE)
library(choroplethrMaps)
install.packages("choroplethrAdmin1", dependencies=TRUE)
library(choroplethrAdmin1)
###############################################################
axis.ticks.length=unit(0.25,"cm"),
panel.background=element_rect(fill="whitesmoke")
)
}
# hjust - horizonal justification; 0 = left edge to 1 = right
# edge, with 0.5 the default
# vjust - vertical justification; 0 = bottom edge to 1 = top
# edge, with 0.5 the default
# angle - rotation; generally 1 to 90 degrees, with 0 the
# default
base::class(theme_Mac)
# Confirm that the user-created object
# theme_Mac() is a function.
###############################################################
Go back to the Housekeeping section, and observe how the object variable
ACS2019 was created by using the tidycensus::load_variables() function. Open the
file saved as ACS2019.csv, a file of more than 25,000 lines, and spend a few hours
(yes, hours) reviewing the many topics addressed in the American Community
Survey (ACS) and corresponding breakouts. The column marked label provides a
narrative description of the topic for which data are available, and the column
marked name provides the table number and associated breakout code(s). By search-
ing from among the many variables involving money, it was decided that Table
B19013 would be the most appropriate table for this inquiry, where an estimate is
provided for breakouts of Median household income in the past 12 months (in 2021
inflation-adjusted dollars), with attention to breakouts by householder race-ethnicity
as well, detailed in the column marked concept.
With Housekeeping, environment setup, and table selection out of the way, use
the tidycensus::get_acs() function, serving as an R-based API client, to obtain the
Census Bureau American Community Survey data on Tennessee median household
income (e.g., code B19013_001) by county for 2019, initially using the ACS1 data-
set but later using the ACS5 dataset.
352 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
TNMedHouseIncomeB19013_001acs1_2019.tbl <-
tidycensus::get_acs(
# Use the tidycensus::get_acs() function as an R-based API
# client, to obtain American Community Survey (ACS) data from
# the Census Bureau.
geography="county",# Data for all counties, if available
variables = c("B19013_001"), # All Race-Ethnicity Groups
# Breakout Codes Race-Ethnicity Breakout Groups
# "B19013A_001" WHITE
# "B19013B_001" BLACK OR AFRICAN AMERICAN ALONE
# "B19013C_001" AMERICAN INDIAN AND ALASKA NATIVE ALONE
# "B19013D_001" ASIAN ALONE
# "B19013E_001" NATIVE HAWAIIAN OTHER PACIFIC ISLANDER
# "B19013F_001" SOME OTHER RACE ALONE
# "B19013G_001" TWO OR MORE RACES
# "B19013H_001" WHITE ALONE, NOT HISPANIC OR LATINO
# "B19013I_001" HISPANIC OR LATINO
year = 2019, # Year
output="tidy", # Output, tidy or wide
state = "TN", # State, Tennessee
geometry=FALSE, # Return sf geometry TRUE or FALSE
moe_level=95, # Margin of error
survey="acs1", # acs1, more restrictive than acs5
show_call=TRUE) # Display call made to Census API
# Note about survey= "acs1" or "acs5":
# acs1 - Data for areas with populations of 65,000+
# acs5 - Data for all areas
# Within what may seem to be but a fraction of a second,
# the following text appears on the screen:
# The 1-year ACS provides data for geographies with
# populations of 65,000 and greater.
# Getting data from the 2021 1-year ACS
# Using FIPS code '47' for state 'TN'
#
# An URL is also provided, preceded by the expression
# Census API call:
Challenge: As an interesting activity, put the URL for this Census API call into a
browser and see what is returned – a JSON-based data set. Review the data, espe-
cially before viewing Addendum 4 in the back matter of this lesson.
Structure of an API to Automate Data Retrieval 353
base::attach(TNMedHouseIncomeB19013_001acs1_2019.tbl)
TNMedHouseIncomeB19013_001acs1_2019.tbl
# A tibble: 20 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 47001 Anderson County, Tennessee B19013_001 49044 6566.
2 47009 Blount County, Tennessee B19013_001 59276 5393.
3 47011 Bradley County, Tennessee B19013_001 52468 6135.
4 47037 Davidson County, Tennessee B19013_001 63938 2906.
5 47059 Greene County, Tennessee B19013_001 44722 6521.
6 47065 Hamilton County, Tennessee B19013_001 57502 3373.
7 47093 Knox County, Tennessee B19013_001 60283 2892.
8 47113 Madison County, Tennessee B19013_001 49820 4730.
9 47119 Maury County, Tennessee B19013_001 66781 3080
10 47125 Montgomery County, Tennessee B19013_001 56948 5282.
11 47141 Putnam County, Tennessee B19013_001 49282 5667.
12 47147 Robertson County, Tennessee B19013_001 66939 8000.
13 47149 Rutherford County, Tennessee B19013_001 69397 3951.
14 47155 Sevier County, Tennessee B19013_001 57741 6989.
15 47157 Shelby County, Tennessee B19013_001 52614 2951.
16 47163 Sullivan County, Tennessee B19013_001 51860 5197.
17 47165 Sumner County, Tennessee B19013_001 68743 4634.
18 47179 Washington County, Tennessee B19013_001 51428 6239.
19 47187 Williamson County, Tennessee B19013_001 115507 9084.
20 47189 Wilson County, Tennessee B19013_001 80071 7785.
writexl::write_xlsx(
TNMedHouseIncomeB19013_001acs1_2019.tbl,
path =
"D:\\R_Ceres\\TNMedHouseIncomeB19013_001acs1_2019.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.
base::file.exists("TNMedHouseIncomeB19013_001acs1_2019.xlsx")
base::file.info("TNMedHouseIncomeB19013_001acs1_2019.xlsx")
base::list.files(pattern =".xlsx")
The output displayed ACS1 2019 B19013_001 data for 20 Tennessee counties.
However, Tennessee has 95 counties. Perhaps the ACS5 survey will be more inclu-
sive. Note how the only difference in the syntax is the one line of code, declaring
acs5 instead of acs1.
354 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
TNMedHouseIncomeB19013_001acs5_2019.tbl <-
tidycensus::get_acs(
# Use the tidycensus::get_acs() function as an R-based API
# client, to obtain American Community Survey (ACS) data from
# the Census Bureau.
geography="county",# Data for all counties, if available
variables = c("B19013_001"), # All Race-Ethnicity Groups
# Breakout Codes Race-Ethnicity Breakout Groups
# "B19013A_001" WHITE
# "B19013B_001" BLACK OR AFRICAN AMERICAN ALONE
# "B19013C_001" AMERICAN INDIAN AND ALASKA NATIVE ALONE
# "B19013D_001" ASIAN ALONE
# "B19013E_001" NATIVE HAWAIIAN OTHER PACIFIC ISLANDER
# "B19013F_001" SOME OTHER RACE ALONE
# "B19013G_001" TWO OR MORE RACES
# "B19013H_001" WHITE ALONE, NOT HISPANIC OR LATINO
# "B19013I_001" HISPANIC OR LATINO
year = 2019, # Year
output="tidy", # Output, tidy or wide
state = "TN", # State, Tennessee
geometry=FALSE, # Return sf geometry TRUE or FALSE
moe_level=95, # Margin of error
survey="acs5", # acs5, less restrictive than acs1
show_call=TRUE) # Display call made to Census API
# Note about survey= "acs1" or "acs5":
# acs1 - Data for areas with populations of 65,000+
# acs5 - Data for all areas
base::attach(TNMedHouseIncomeB19013_001acs5_2019.tbl)
TNMedHouseIncomeB19013_001acs5_2019.tbl
# A tibble: 95 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 47001 Anderson County, Tennessee B19013_001 50392 2465.
2 47003 Bedford County, Tennessee B19013_001 50415 2663.
3 47005 Benton County, Tennessee B19013_001 37512 3367.
4 47007 Bledsoe County, Tennessee B19013_001 44122 4107.
5 47009 Blount County, Tennessee B19013_001 56667 2147.
6 47011 Bradley County, Tennessee B19013_001 51331 2347.
7 47013 Campbell County, Tennessee B19013_001 39803 2526.
8 47015 Cannon County, Tennessee B19013_001 55330 4619.
9 47017 Carroll County, Tennessee B19013_001 42637 2525.
10 47019 Carter County, Tennessee B19013_001 38092 2313.
# 85 more rows
# Use `print(n = ...)` to see more rows
Structure of Data Returned by an API 355
writexl::write_xlsx(
TNMedHouseIncomeB19013_001acs5_2019.tbl,
path =
"D:\\R_Ceres\\TNMedHouseIncomeB19013_001acs5_2019.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.
base::file.exists("TNMedHouseIncomeB19013_001acs5_2019.xlsx")
base::file.info("TNMedHouseIncomeB19013_001acs5_2019.xlsx")
base::list.files(pattern =".xlsx")
As expected, the ACS5 survey is more inclusive and by using data over multiple
years the ACS5 survey provided a display of Tennessee Median Household Income
statistics for all 95 counties.
This brief display of American Community Survey data was purposely used to
provide a useful introduction to the complexity of Application Programming
Interface (API) construction. Know the nature of selected APIs, what they can pos-
sibly provide, but equally limitations and awareness of what they do not provide.
Many R functions that serve as API clients have been purposely prepared to follow
along with a tidyverse approach to data science. It is not a mistake that the tidycen-
sus package includes the term tidy in its name – there is no mistaking how data,
when returned, are easily put into tidy format merely by using the output=“tidy”
argument. Other packages may have functions that serve as API clients, but are the
data brought back in tidy format? This is clearly a desired output.
Examine the two datasets gained from initial use of the tidycensus::get_acs()
function:
• TNMedHouseIncomeB19013_001acs1_2019.tbl
• TNMedHouseIncomeB19013_001acs5_2019.tbl
Both datasets provide useful information on household income statistics
• GEOID, State and county FIPS (Federal Information Processing System) code
356 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
A few actions could be used to manipulate the returned dataset and put it into desired
format, but it was returned in tidy format so it may not be necessary to go to great
lengths to organize the data. In the current format, it would be possible to prepare
descriptive statistics of median family income by county. For those who know the
5
Use Census Bureau resources at https://www.census.gov/programs-surveys/geography/guidance/
geo-areas/urban-rural.html and give attention to the file A state-sorted list of all 2020 Census
Urban Areas for the US, Puerto Rico, and Island Areas first sorted by state FIPS code, then sorted
by Urban Area Census (UACE) code.
Structure of Data Returned by an API 357
many counties in Tennessee, purposeful selection and deletions of both rows and
columns could be used to facilitate desired inferential analyses. With greater effort,
it is possible to merge the dataset with other data to support further inquiries, such
as comparisons of median family income by race-ethnicity breakouts, by level of
education breakouts, or by county population density.
Use the choroplethr package to create a choropleth-type map using data for all 95
Tennessee counties. It is important to know, however, that the choroplethr package
is organized so that the map is based on two required names for the object variables
in question: region (e.g., GEOID) and value (e.g., estimate, when using ACS sur-
veys, either acs1 or acs5). Make those accommodations before the choroplethr-
based maps are created.
TNMedHouseIncomeB19013_001acs5_2019.map <-
TNMedHouseIncomeB19013_001acs5_2019.tbl %>%
dplyr::rename(region=GEOID, value=estimate)
# When using the dplyr::rename() function to rename a column,
# the format is New_Name and then Old_Name, which for some
# may not be intuitive. Notice also how a single = character
# is used in this instance.
TNMedHouseIncomeB19013_001acs5_2019.map$region <-
as.numeric(TNMedHouseIncomeB19013_001acs5_2019.map$region)
# Put region (e.g., FIPS codes) into numeric format.
TNMedHouseIncomeB19013_001acs5_2019.map$value <-
as.numeric(TNMedHouseIncomeB19013_001acs5_2019.map$value)
# Put value (e.g., measured data) into numeric format.
base::attach(TNMedHouseIncomeB19013_001acs5_2019.map)
TNMedHouseIncomeB19013_001acs5_2019.map
DraftMapTNMedHouseIncomeB19013_001acs5_2019.fig <-
choroplethr::county_choropleth(
TNMedHouseIncomeB19013_001acs5_2019.map,
title=
"Tennessee 2019 Median Household Income by County",
legend = "Median Household Income",
num_colors = 9,
state_zoom = c("tennessee")) +
theme(plot.title = element_text(hjust = 0.5))
par(ask=TRUE); DraftMapTNMedHouseIncomeB19013_001acs5_2019.fig
TNMedHouseIncomeB19013_001acs5_2019.fig <-
choroplethr::county_choropleth(
TNMedHouseIncomeB19013_001acs5_2019.map,
title=
"Tennessee 2019 Median Household Income by County",
legend = "Median Household Income",
state_zoom=c("tennessee")) +
scale_fill_brewer(palette="Oranges") +
guides(fill=guide_legend(title=
"Median Household Income")) +
theme(plot.title=element_text(hjust = 0.5)) +
theme(legend.position="bottom")
# Remember that the choroplethr package is associated
# with the tidyverse ecosystem and that ggplot2::ggplot()
# function arguments and options are generally supported.
# As such, the legend title is easily changed to a far
# more descriptive label, the legend is left justified,
# the title is centered, etc. Other embellishments could
# be provided, but it is judged that they are not needed.
# Fig. 6.1
par(ask=TRUE); TNMedHouseIncomeB19013_001acs5_2019.fig
Fig. 6.1
Structure of Data Returned by an API 359
options(tigris_use_cache = TRUE)
TNDavidsonMedHouseIncomeB19013_001acs5_2019SF.tbl <-
tidycensus::get_acs(
# Use the tidycensus::get_acs() function as an R-based API
# client, to obtain American Community Survey (ACS) data from
# the Census Bureau.
#
# The data will be at a more granular level and are
variables = "B19013_001",
state = "TN",
county = "Davidson",
geography = "tract",
geometry = TRUE,
year = 2019, # Year
output="tidy", # Output, tidy or wide
moe_level=95, # Margin of error
survey="acs5", # acs5, less restrictive than acs1
show_call=TRUE) # Display call made to Census API
# Note about survey= "acs1" or "acs5":
# acs1 - Data for areas with populations of 65,000+
# acs5 - Data for all areas
base::attach(TNDavidsonMedHouseIncomeB19013_001acs5_2019SF.tbl)
TNDavidsonMedHouseIncomeB19013_001acs5_2019SF.tbl
# A partial display of output, when geometry = TRUE is
# displayed, but parts of the output have been altered
# in appearance to allow for space.
360 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
GEOID geometry
1 47037018409 MULTIPOLYGON (((-86.98225 3...
2 47037016000 MULTIPOLYGON (((-86.77265 3...
3 47037011800 MULTIPOLYGON (((-86.77245 3...
4 47037017901 MULTIPOLYGON (((-86.84431 3...
5 47037018101 MULTIPOLYGON (((-86.885 36....
6 47037014200 MULTIPOLYGON (((-86.81129 3...
7 47037015804 MULTIPOLYGON (((-86.71027 3...
8 47037017402 MULTIPOLYGON (((-86.73944 3...
9 47037010702 MULTIPOLYGON (((-86.7287 36...
10 47037015618 MULTIPOLYGON (((-86.63462 3...
Structure of Data Returned by an API 361
writexl::write_xlsx(
TNDavidsonMedHouseIncomeB19013_001acs5_2019SF.tbl,
path =
"D:\\R_Ceres\\TNDavidsonMedHouseIncomeB19013_001acs5_2019SF.xlsx
",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.
base::file.exists("TNDavidsonMedHouseIncomeB19013_001acs5_2019SF
.xlsx")
base::file.info("TNDavidsonMedHouseIncomeB19013_001acs5_2019SF.x
lsx")
base::list.files(pattern =".xlsx")
TNDavidsonMedHouseIncomeB19013_001acs5_2019SF.tbl %>%
ggplot(aes(fill = estimate)) +
geom_sf(color = "orange") +
scale_fill_viridis_c(labels = scales::dollar) +
ggtitle(
"Davidson County, Tennessee (e.g., Nashville), Median Family
Income by Census Tract: 2019") +
scale_fill_viridis_c(labels = scales::dollar) +
theme(plot.title = element_text(hjust = 0.5))
# Much more, using the full set of available ggplot tools,
# could be done to improve the presentation of this figure.
# Fig. 6.2
The purpose of this syntax is to show how easy it is to obtain data using API
functions associated with the tidyverse ecosystem (e.g., (tidycensus::get_acs()) and
to then use other functions associated with the tidyverse ecosystem to add value to
the data by preparing figures (e.g., choroplethr::county_choropleth(),
ggplot2::ggplot()) that communicate with the public the meaning of the data.
362 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
Fig. 6.2
Challenge: Use both the narrative and syntax in this addendum and other addenda
in this lesson to practice with APIs and the use of R. Then, in anticipation of later
needs and career advancement long after this text has been completed, put the nar-
rative and outcomes from these addenda into technical memorandum format. For
format, use the standard process touted throughout this text and equally seen among
other professionals in data science:
• Background
–– Description of the Data
–– Null Hypothesis
• Import Data
• Code Book and Data Organization
• Exploratory Graphics
–– Graphics Using Base R
–– Graphics Using the tidyverse Ecosystem
• Exploratory Descriptive Statistics and Measures of Central Tendency
364 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
• Exploratory Analyses
• Presentation of Outcomes
Let the technical memorandum sit for a few days, and then go back and review it
carefully, correcting any errors and problems with the presentation that may have
escaped attention at the time of preparation. As an additional quality assurance prac-
tice, ask a few trusted colleagues to review it for content, comprehension, flow, etc.
Avoid asking friends who feel obliged to offer congratulatory comments. Instead,
ask those who have a keen sense of professionalism to review it. It may be helpful
to ask the reviewer(s) to consider the following rubric, used for each memorandum:
• Background: What is the general theme for the summary memorandum? Why
were the data obtained, what do the data represent, and what are the characteris-
tics of the data? Does the Null Hypothesis, if included, address the identified
problem such that meaningful outcomes will likely come from analyses?
• Literature or Expertise: Although unlikely at these early stages, was any profes-
sional literature cited in the memorandum? Is the literature from a highly
regarded professional publication, a peer-reviewed publication accepted by the
scientific community?
• Methods: Would the identified methods for analysis allow replication? How were
the data obtained and how were the data organized? Are there any references to
how the original data could be obtained for those who wish to replicate and
expand upon cited analyses? Are the data in the public domain, allowing access
to all? Or are the data private (e.g., proprietary), requiring permission(s) for use?
• Descriptive Statistics: What are the measures of central tendency for numerically
oriented data (e.g., weight and systolic blood pressure) and what are the fre-
quency distributions for the factor-type data (e.g., race-ethnicity, gender) and
logical data (e.g., 0 and 1, Fail and Pass, Die and Survive)? Are the cited descrip-
tive statistics broad or are they specific for multiple breakouts (e.g., weight over-
all and then weight by gender, systolic blood pressure overall and then systolic
blood pressure by race-ethnicity)?
• Tables and Graphics: Were either tables or graphics used to describe the data and
the relationship(s) between and among the data? Describe the table(s), if used.
Describe the graphic(s), if used? Do both the tables and graphics support reader
comprehension?
• Inferential Analyses and Estimates of Association: Was there any attempt to
identify specific statistical tests in support of decision-making, to determine dif-
ferences between and among selected variables or groupings of variables, or to
determine if there is any meaningful relationship between and among variables
or groupings of variables? If not, was there at least a broad reference to statistical
analysis other than descriptive statistics (e.g., mean, standard deviation, median,
mode), percentages, and frequency distributions? If a statistically significant dif-
ference between groups is identified, is the term statistically significant used, and
if so, is a specific criterion p-value cited (e.g., 0.05, 0.01, 0.001)?
• Summary: How would the reader (as a consumer, informed citizen, taxpayer,
voter, professional worker, dean, supervisor, etc.) benefit from reading this
Addendum 1: Use of the tidyUSDA::getQuickstat() API 365
install.packages("tidyUSDA", dependencies=TRUE)
library(tidyUSDA)
IACountyCornYieldBuAc2021.tbl <-
tidyUSDA::getQuickstat(
key = 'UseTheKeyProvidedAtSign-Upxxxxxxxxxx',
# Based on a process similar to tidyCensus,
# use the USDA NASS API key provided at
# https://quickstats.nass.usda.gov/api.
program = 'SURVEY',
year = '2021',
# The data will be specific to 2021, only.
# period = 'YEAR',
# There are many arguments associated with
# the tidyUSDA::getQuickstat() function. All
# are not needed for every attempt at data
# retrieval. A few are included in this
# syntax, to demonstrate their inclusion, but
# made inactive by using the # comment
# character.
geographic_level = 'COUNTY',
# county = c('COUNTY_1', 'COUNTY_2', 'etc.'),
6
If the NASS GUI were used, think of all the times it may be necessary to click (or not click) on
either Year (from the 1920s into the early 2020s) or County (Iowa has 99 counties). Reproducible
syntax seems like a good idea when confronted with this user experience.
366 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
Data scientists should assume little and verify much. With this caution as a gen-
eral approach to the workflow, it is a best practice to verify how the data are struc-
tured, IACountyCornYieldBuAc2021.tbl. The name for tidyUSDA::getQuickstat()
function might suggest that the data are returned in tidy format and correspondingly
the data are organized as a tibble, but is this assumption true?7 Use the
tibble::is_tibble() function to verify the data format.
tibble::is_tibble(IACountyCornYieldBuAc2021.tbl)
[1] FALSE
7
The tidyUSDA::getQuickstat() function by default returns a dataframe, not a tibble.
Addendum 1: Use of the tidyUSDA::getQuickstat() API 367
base::is.data.frame(IACountyCornYieldBuAc2021.tbl)
[1] TRUE
Use the tibble::as_tibble() function to address this concern, perhaps a minor con-
cern but still a concern that should be addressed early on to remain consistent with
the tidyverse ecosystem.
tibble::as_tibble(IACountyCornYieldBuAc2021.tbl)
# Put the dataframe into tibble format, to keep
# within the tidyverse paradigm.
# A tibble: 85 × 59
GEOID year ALAND unit_desc short_desc Value
<chr> <int> <dbl> <chr> <chr> <dbl>
1 19021 2021 1488981867 BU / ACRE CORN, GRAIN - Y… 204.
2 19041 2021 1469139465 BU / ACRE CORN, GRAIN - Y… 203.
3 19059 2021 985574547 BU / ACRE CORN, GRAIN - Y… 178.
4 19063 2021 1025349325 BU / ACRE CORN, GRAIN - Y… 182.
5 19119 2021 1521901261 BU / ACRE CORN, GRAIN - Y… 217.
6 19141 2021 1484153957 BU / ACRE CORN, GRAIN - Y… 214.
7 19143 2021 1032593803 BU / ACRE CORN, GRAIN - Y… 208
8 19147 2021 1460400553 BU / ACRE CORN, GRAIN - Y… 190.
9 19149 2021 2234715257 BU / ACRE CORN, GRAIN - Y… 191.
10 19151 2021 1495047211 BU / ACRE CORN, GRAIN - Y… 214.
# 75 more rows
# 46 more variables: domain_desc <chr>, agg_level_desc <chr>,
# Use `print(n = ...)` to see more rows
With all upfront work accomplished, it is now possible to continue with standard
operations.
base::getwd()
base::ls()
base::attach(IACountyCornYieldBuAc2021.tbl)
utils::str(IACountyCornYieldBuAc2021.tbl)
dplyr::glimpse(IACountyCornYieldBuAc2021.tbl)
base::summary(IACountyCornYieldBuAc2021.tbl)
Look at descriptive statistics for 2021 Iowa corn yields (e.g., Value is listed in
this tibble) expressed as bushels per acre by agricultural district (e.g., asd_desc):
368 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
IACountyCornYieldBuAc2021.tbl %>%
dplyr::group_by(asd_desc) %>%
dplyr::filter(asd_desc %in% c(
"CENTRAL",
"EAST CENTRAL",
"NORTH CENTRAL",
"NORTHEAST",
"NORTHWEST",
"SOUTH CENTRAL",
"SOUTHEAST",
"SOUTHWEST",
"WEST CENTRAL")) %>%
# As a data trap quality assurance measure, retain the
# rows for agricultural districts (e.g., asd_desc) that
# are properly named and delete rows for all others,
# most importantly those rows with no value listed for
# asd_desc.
dplyr::summarize(
N = base::length(Value),
Minimum = base::min(Value, na.rm=TRUE),
Median = stats::median(Value, na.rm=TRUE),
Mean = base::mean(Value, na.rm=TRUE),
SD = stats::sd(Value, na.rm=TRUE),
Maximum = base::max(Value, na.rm=TRUE),
Missing = base::sum(is.na(Value)) %>%
base::print(15)
)
# Descriptive statistics are generated by first using the
# dplyr::group_by() function against the object variable(s)
# asd_desc. Then, in an attempt to accommodate either poor
# data entry for asd_desc, or no data entry for asd_desc
# (think of the row associated with OTHER COUNTIES, where
# there is no assigned datum for agricultural districts
# (e.g., asd_desc), see how the dplyr::filter() function
# was used. Finally, the dplyr::summarize() function is
# used against a set of selected functions, all in an
# effort to make a neatly presented summary for all named
# agricultural districts.
# A tibble: 9 x 9
asd_desc N Minimum Median Mean SD Maximum Missing
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 CENTRAL 11 194. 211. 210. 8.47 223. 0
2 EAST CEN~ 6 192. 207. 205. 7.02 212. 0
3 NORTH CE~ 11 193. 204. 202. 5.95 211. 0
4 NORTHEAST 11 194. 206. 206. 8.19 219 0
5 NORTHWEST 11 178. 205. 203. 14.6 222. 0
6 SOUTH CE~ 8 166. 177. 180. 12.8 200 0
7 SOUTHEAST 9 153. 168. 173. 15.5 201. 0
8 SOUTHWEST 7 193. 208. 208. 7.76 217. 0
9 WEST CEN~ 10 200. 220. 219. 9.26 232. 0
# ... with 1 more variable: geometry <GEOMETRY >
Addendum 1: Use of the tidyUSDA::getQuickstat() API 369
This is all very interesting, but one year (2021) does not make a trend. Rainfall
(either a drought, too much rain overall, too much rain at a specific time in the grow-
ing cycle, etc.), either late or early frost, high winds that cause stalk lodging, etc. can
be severe in one part of a state and of minimal concern over other areas, especially
in a large state such as Iowa. A longer time frame would be needed to make any
reasonable inquiry into within-state regional differences.
To obtain data over multiple years and to use R in an effort to automate (e.g.,
iterate) this action, the purrr package and specifically the map() function and its
many variations are often used, as was seen in prior lessons. However, all data
resources are not the same nor are all R-based APIs the same. For data retrieval from
the USDA NASS resource with use of the tidyUSDA::getQuickstat() function, note
how data over multiple years (actually, all years with available data) can be obtained
without use of the purrr package and its map() function(s).
IACountyCornYieldBuAcStarttoEnd.tbl <-
tidyUSDA::getQuickstat(
key = 'UseTheKeyProvidedAtSign-Upxxxxxxxxxx',
program = 'SURVEY',
# period = 'YEAR',
# As the tidyUSDA::getQuickstat() function
# has been organized, if a specific year is
# not declared (e.g., year = '2021') and
# period = 'YEAR' is also not used, the
# function fetches data for all available
# years. In other packages, such as those in
# the tidycensus package, the purrr::map()
# function is used to iterate over multiple
# years.
state = 'IOWA',
commodity = 'CORN',
data_item = 'CORN, GRAIN - YIELD, MEASURED IN BU / ACRE',
domain = 'TOTAL'
)
tibble::is_tibble(IACountyCornYieldBuAcStarttoEnd.tbl)
[1] FALSE
tibble::as_tibble(IACountyCornYieldBuAcStarttoEnd.tbl)
# Put the dataframe into tibble format, to keep
# within the tidyverse paradigm.
base::getwd()
base::ls()
base::attach(IACountyCornYieldBuAcStarttoEnd.tbl)
utils::str(IACountyCornYieldBuAcStarttoEnd.tbl)
dplyr::glimpse(IACountyCornYieldBuAcStarttoEnd.tbl)
base::summary(IACountyCornYieldBuAcStarttoEnd.tbl)
writexl::write_xlsx(
IACountyCornYieldBuAcStarttoEnd.tbl,
path = "D:\\R_Ceres\\IACountyCornYieldBuAcStarttoEnd.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.
base::file.exists("IACountyCornYieldBuAcStarttoEnd.xlsx")
base::file.info("IACountyCornYieldBuAcStarttoEnd.xlsx")
base::list.files(pattern =".xlsx")
Following along with prior actions, look at descriptive statistics for Iowa corn
yields (bushels per acre) since the beginning of data collection to the most current
entries in the online dataset. Give special attention to the output for the variable
year, ranging from 1866 to 2022 – quite a range of years for data on corn yields.
IACountyCornYieldBuAcStarttoEnd.tbl %>%
dplyr::group_by(asd_desc) %>%
dplyr::filter(asd_desc %in% c(
"CENTRAL",
"EAST CENTRAL",
"NORTH CENTRAL",
"NORTHEAST",
"NORTHWEST",
"SOUTH CENTRAL",
"SOUTHEAST",
"SOUTHWEST",
"WEST CENTRAL")) %>%
dplyr::summarize(
N = base::length(Value),
Minimum = base::min(Value, na.rm=TRUE),
Median = stats::median(Value, na.rm=TRUE),
Mean = base::mean(Value, na.rm=TRUE),
SD = stats::sd(Value, na.rm=TRUE),
Maximum = base::max(Value, na.rm=TRUE),
Missing = base::sum(is.na(Value))
)
# A tibble: 9 x 8
asd_desc N Minimum Median Mean SD Maximum Missing
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 CENTRAL 1243 19.8 100. 106. 51.8 226 0
2 EAST CENTRAL 1050 26 96.1 105. 49.9 224. 0
3 NORTH CENTRAL 1150 15.8 100. 104. 53.1 213. 0
4 NORTHEAST 1150 17.1 95.3 103. 53.3 225. 0
5 NORTHWEST 1243 4.9 90.5 102. 54.2 222. 0
6 SOUTH CENTRAL 1137 4.1 75 83.2 45.8 200 0
7 SOUTHEAST 1147 7.7 89.6 94.9 48.8 220. 0
8 SOUTHWEST 953 5.9 83.3 92.6 49.8 217. 0
9 WEST CENTRAL 1243 3 87.9 99.1 52.9 235. 0
Addendum 1: Use of the tidyUSDA::getQuickstat() API 371
par(ask=TRUE)
ggplot2::ggplot(data=IACountyCornYieldBuAcStarttoEnd.tbl,
aes(x=year, y=Value)) +
geom_point(size=0.75) +
ggtitle(
"Iowa Corn Yields (Bushel per /Acre):
Mid-1860s to Early-2020s") +
labs(x="\nYear", y="Bushels per Acre\n") +
scale_x_continuous(limits=c(1865, 2022),
breaks=scales::pretty_breaks(n=20)) +
theme_Mac()
# Fig. 6.3
As evidenced in this figure, the upward trajectory of corn yields (bushels per
acre) over time is amazing. The lowest yields in recent years are still orders of mag-
nitude greater than the highest yields from earlier years. It is beyond the scope of
this addendum to offer specific reasons as to why yields have increased, but those
with a direct interest will certainly have ideas that include management practices
such as the use of inorganic fertilizers, herbicides, fungicides, insecticides, improved
plant breeding and seed selection, irrigation, etc.8
8
For those with special interest, review the biography of Henry A. Wallace, a native Iowan, who
was appointed Secretary of Agriculture in 1933 and was elected as Vice President of the United
States in 1940. His leadership, individually and in league with others, was instrumental in the
development and use of hybrid corn beginning in the mid-1930s. Look at the figure on corn yields
Fig. 6.3
372 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
Given the challenge of looking at trends over more recent times, consider using
tidyverse ecosystem tools to filter the comprehensive dataset to a manageable num-
ber of years, such as 2012 to 2021, a summary of corn yields over 10 growing
seasons:
IACountyCornYieldBuAc2012to2021.tbl <-
IACountyCornYieldBuAcStarttoEnd.tbl %>%
dplyr::group_by(year, asd_desc) %>%
# Now, group by year and asd_desc. Recall
# that the dataset is specific to 2012 to
# 2021, only.
dplyr::filter(asd_desc %in% c(
"CENTRAL",
"EAST CENTRAL",
"NORTH CENTRAL",
"NORTHEAST",
"NORTHWEST",
"SOUTH CENTRAL",
"SOUTHEAST",
"SOUTHWEST",
"WEST CENTRAL")) %>%
dplyr::filter(year %in% c(
"2012", "2013", "2014", "2015", "2016",
"2017", "2018", "2019", "2020", "2021"))
# Continue filtering, now selecting only
# those rows with data showing 2012 to 2021
# for the object variable year.
base::getwd()
base::ls()
base::attach(IACountyCornYieldBuAc2012to2021.tbl)
utils::str(IACountyCornYieldBuAc2012to2021.tbl)
dplyr::glimpse(IACountyCornYieldBuAc2012to2021.tbl)
base::summary(IACountyCornYieldBuAc2012to2021.tbl)
over time, and note how a few years after, by the late 1930s to early 1940s, corn yields per acre
began their rapid ascent.
Addendum 1: Use of the tidyUSDA::getQuickstat() API 373
IACountyCornYieldBuAc2012to2021.tbl %>%
dplyr::summarize(
N = base::length(Value),
Minimum = base::min(Value, na.rm=TRUE),
Median = stats::median(Value, na.rm=TRUE),
Mean = base::mean(Value, na.rm=TRUE),
SD = stats::sd(Value, na.rm=TRUE),
Maximum = base::max(Value, na.rm=TRUE),
Missing = base::sum(is.na(Value))) %>%
base::print(n=90)
Selected sections of output were deleted to save space.
# A tibble: 90 x 9
# Groups: year [10]
year asd_desc N Minimum Median Mean SD Maximum
<int> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2012 CENTRAL 13 129 151. 151. 9.64 164.
2 2012 EAST CENTRAL 11 115. 132. 135. 12.4 159.
3 2012 NORTH CENTRAL 12 122 144. 142. 14.3 165.
4 2012 NORTHEAST 12 126. 140 140. 10.0 154.
5 2012 NORTHWEST 13 110. 161. 157. 16.3 171.
6 2012 SOUTH CENTRAL 12 44.5 80.8 79.5 20.9 115.
7 2012 SOUTHEAST 12 52.4 119. 118. 27.9 163.
8 2012 SOUTHWEST 10 91.7 121. 119. 13.5 132
9 2012 WEST CENTRAL 13 105. 131. 127. 12.4 151.
10 2013 CENTRAL 13 135 154. 152. 12.7 181.
And as was seen before, a simple figure will help communicate findings on corn
yields, but now only for a 10-year period, 2012 to 2012 (Figs. 6.4 and 6.5).
Fig. 6.4
374 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
Fig. 6.5
par(ask=TRUE)
ggplot2::ggplot(data=IACountyCornYieldBuAc2012to2021.tbl,
aes(x=asd_desc, y=Value)) +
geom_point(size=1.05) +
coord_flip() + # Use coord_flip to reverse X and Y axis
ggtitle(
"Iowa Corn Yields (Bushel per Acre) by Agricultural District:
2012 to 2021") +
labs(x="Agricultural District\n", y="Bushels per Acre\n") +
theme_Mac() +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=0.5, angle=00))
# The special theme-based accommodations for X axis text and/
# or Y axis text need to be placed after the enumerated theme
# theme_Mac().
#
# For this figure, observe how vjust=0.5 was used, to make
# the Agricultural District text align with the tick marks.
# Fig. 6.4
Addendum 1: Use of the tidyUSDA::getQuickstat() API 375
par(ask=TRUE)
ggplot2::ggplot(data=IACountyCornYieldBuAc2012to2021.tbl,
aes(as.factor(x=year), y=Value)) +
# The object variable year is temporarily put into factor
# format so that all years show on the X axis. Without
# this practice, the scale on the X axis would have been
# for every other year. A simple fix such as this attempt
# can be useful and does not require excessive efforts.
geom_point(size=1.05) +
ggtitle(
"Iowa Corn Yields (Bushel per Acre) by Year: 2012 to 2021") +
labs(x="\nYear", y="Bushels per Acre\n") +
theme_Mac() +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=0.5, angle=00))
# The special theme-based accommodations for X axis text and/
# or Y axis text need to be placed after the enumerated theme
# theme_Mac().
#
# For this figure, observe how vjust=0.5 was used, to make
# the Agricultural District text align with the tick marks.
# Fig. 6.5
Data scientists provide value, not only printouts of tables and figures, as valuable
as those tables and figures may be. Use the agricola package for its use with Oneway
ANOVA, to see if there are statistically significant (p <= 0.05) differences in corn
yields by agricultural district over the ten-year period, 2012 to 2021.
install.packages("agricolae", dependencies=TRUE)
library(agricolae)
agricolae::HSD.test(
stats::aov(Value ~ asd_desc,
# Recall that the object variable Value represents corn
# yield, measured in bushels per acre.
data=IACountyCornYieldBuAc2012to2021.tbl), # Model
trt="asd_desc", # Treatment
group=TRUE, console=TRUE, alpha=0.05, # Arguments
main=
"Iowa Corn Yields (Bushels per Acre) by Agricultural
District Using Tukey's HSD (Honestly Significant
Difference) Parametric Oneway ANOVA: 2012 to 2021")
# Wrap the agricolae::HSD.test() function around the Oneway
# ANOVA model, obtained by using the stats::aov() function.
# Select desired arguments, such as group, console, and
# alpha (e.g., p-value).
376 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
asd_desc, means
Value groups
NORTHEAST 189.321 a
NORTHWEST 188.462 a
EAST CENTRAL 187.410 a
CENTRAL 186.893 a
NORTH CENTRAL 186.218 ab
WEST CENTRAL 185.862 ab
SOUTHWEST 175.298 bc
SOUTHEAST 172.621 c
SOUTH CENTRAL 153.777 d
Using Oneway ANOVA, there are statistically significant (p <= 0.05) differences
in corn yields (bushels per acre) from 2012 to 2021 by the agricultural district.
Yields ranged from a high of approximately 186 to 189 mean bushels per acre
(Group a) to a low of about 154 mean bushels per acre (Group d). The output pro-
vides exact details on group membership, individual and overlaps (e.g., group ab
and group bc).
However, environmental conditions are not the same each year and of course
wind, rainfall, temperatures, hours of sunlight, etc. could impact yields. How does
year fit into this study of corn yields.9
9
Without going beyond the scope of this addendum, environmental conditions such as temperature
are not easily measured as one or two datapoints in a spreadsheet, such as mean monthly high
temperature and mean monthly low temperature. Consider the impact of extreme temperatures on
Addendum 1: Use of the tidyUSDA::getQuickstat() API 377
IACountyCornYieldBuAc2012to2021.tbl$year <-
factor(IACountyCornYieldBuAc2012to2021.tbl$year)
base::attach(IACountyCornYieldBuAc2012to2021.tbl)
base::class(IACountyCornYieldBuAc2012to2021.tbl$year)
TwowayAgDistrictYear <-
stats::aov(Value ~ asd_desc * year,
data=IACountyCornYieldBuAc2012to2021.tbl)
# Twoway ANOVA for Value (corn yield) by
# Agricultural District and Year
base::summary(TwowayAgDistrictYear)
# Generate a Twoway ANOVA table.
The prior Oneway ANOVA confirmed that there are statistically significant (p <=
0.05) differences in corn yield by agricultural district. This simple Twoway ANOVA
confirms that finding. It is also evident that there are statistically significant (p <=
0.05) differences in corn yield by year. Note that there is also a statistically signifi-
cant (p <= 0.05) interaction between agricultural district and year. The
agricolae::HSD.test() function, in concert with this Twoway ANOVA approach, can
offer more understanding to outcomes.
corn, both high temperatures and low temperatures, especially during tasseling and silking, when
pollination occurs. Extreme temperatures during these critical stages of development may have an
adverse impact on kernel formation or grain fill. Large stalks, where the corn is as high as an ele-
phant’s eye, may look good when driving past a field, but most corn is grown, sold, and used for
grain, not fodder.
378 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
agricolae::HSD.test(TwowayAgDistrictYear, "asd_desc",
group=TRUE, console=TRUE)
# Use Tukey's HSD (Honestly Significant Difference) Mean
# Comparison Test to determine which breakouts are in common
# groups, and which breakout groups are not.
asd_desc, means
Value groups
NORTHEAST 189.2314 a
NORTHWEST 188.3848 a
EAST CENTRAL 187.3894 a
CENTRAL 186.8008 a
NORTH CENTRAL 186.1136 a
WEST CENTRAL 185.7400 a
SOUTHWEST 175.2172 b
SOUTHEAST 172.5696 b
SOUTH CENTRAL 153.7190 c
The same approach is needed for year, which is now a factor-type object variable
and easily used in this approach:
Addendum 1: Use of the tidyUSDA::getQuickstat() API 379
agricolae::HSD.test(TwowayAgDistrictYear, "year",
group=TRUE, console=TRUE)
# Use Tukey's HSD (Honestly Significant Difference) Mean
# Comparison Test to determine which breakouts are in common
# groups, and which breakout groups are not.
year, means
Value groups
2021 200.3429 a
2016 200.1991 a
2017 197.8962 ab
2019 193.8970 bc
2018 192.6743 cd
2015 187.8038 d
2014 178.1537 e
2020 175.0211 e
2013 159.9509 f
2012 130.3963 g
Fig. 6.6
par(ask=TRUE)
savecex.axis <- par(cex.axis=0.65) # Adjust label
base::plot(agricolae::HSD.test(TwowayAgDistrictYear,
"asd_desc", alpha=0.05, group=TRUE),
main="Common Subgoups of Corn Yield by Agricultral District")
par(savecex.axis) # Return original setting.
# Fig. 6.6
par(ask=TRUE)
savecex.axis <- par(cex.axis=0.65) # Adjust label
base::plot(agricolae::HSD.test(TwowayAgDistrictYear,
"year", alpha=0.05, group=TRUE),
main="Common Subgoups of Corn Yield by Year")
par(savecex.axis) # Return original setting.
# Fig. 6.7
Value Added Challenge 1 – State Map of Iowa Corn Yield (Bushels per Acre)
in 2021.
It is stated throughout this text that data scientists add value. Look at the way a
mapping package cognate to the tidyverse ecosystem, choroplethr, is used to pro-
vide a state map of county-by-county corn yields in 2021, with county borders in
Iowa clearly showing on the map.10 Outcomes (e.g., gradients in yield, bushels per
acre) show on a color-based choropleth. With a choropleth, it is important to see
how dark-colored areas in the map indicate high levels of the variable in question
10
Going back to the late 1700s and passage of the Northwest Ordinance, note how many counties
in what later became the state of Iowa (and other midwestern states) are generally square. There
are many resources that explain how land was surveyed into one square mile sections (640 acres),
with 36 adjoining sections organized into a survey township, all on a rectangular grid. Townships
were then collectively organized into what are commonly called box-shaped counties. There are
exceptions, of course, considering natural borders such as rivers, but the organizational structure of
borders has historically impacted land ownership, farming practices, etc.
Addendum 1: Use of the tidyUSDA::getQuickstat() API 381
Fig. 6.7
Fig. 6.8
(corn yield in this addendum), whereas lighter areas indicate low levels of the vari-
able in question, corn yields.
The choroplethr package only needs two object variables to make a state-based
county-wide map, the five-digit FIPS code for each county (e.g., region) and the
measured data (e.g., value). Although tidyverse functions could be used, such as the
dplyr::rename() function, Base R functions will be used to put the data into required
format for production of the choropleth, using the choroplethr::county_choropleth()
function (Fig. 6.8).
382 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
base::names(IACountyCornYieldBuAc2021.tbl)[
names(IACountyCornYieldBuAc2021.tbl) ==
'GEOID'] <- 'region'
# Use the base::names() function to rename the object
# variable GEOID to region. Give attention to the use
# of lowercase in this instance.
IACountyCornYieldBuAc2021.tbl$region <-
as.numeric(IACountyCornYieldBuAc2021.tbl$region)
# The object variable region must be numeric to
# use the choroplethr package.
base::names(IACountyCornYieldBuAc2021.tbl)[
names(IACountyCornYieldBuAc2021.tbl) ==
'Value'] <- 'value'
# Use the base::names() function to rename the object
# variable Value to value. Give attention to the use
# of lowercase in this instance. R is case sensitive.
IACountyCornYieldBuAc2021.tbl$value <-
as.numeric(IACountyCornYieldBuAc2021.tbl$value)
# The object variable value must be numeric to
# use the choroplethr package. This action may
# be redundant, but it is done as a reminder of
# package requirements.
base::getwd()
base::ls()
base::attach(IACountyCornYieldBuAc2021.tbl)
utils::str(IACountyCornYieldBuAc2021.tbl)
dplyr::glimpse(IACountyCornYieldBuAc2021.tbl)
base::summary(IACountyCornYieldBuAc2021.tbl)
IACornYieldByCounty2021.map <-
choroplethr::county_choropleth(
IACountyCornYieldBuAc2021.tbl,
title = "Iowa Corn Yields (Bushels per Acre)
by County: 2021",
num_colors = 7,
# It is possible to have up to 9 breakouts for color
# gradients, but 7 was purposely selected for this
# state-wide map. Whether 9 or 7 were used, it is
# fairly easy to see the major areas where Iowa corn
# yields were reported.
state_zoom = c("iowa")) +
scale_fill_brewer(name="Bushels per Acre", palette=2) +
theme(plot.title=element_text(hjust = 0.5)) +
theme(legend.position="left")
# Remember that the choroplethr package is associated
# with the tidyverse ecosystem and that ggplot2::ggplot()
# function arguments and options are generally supported.
# As such, the legend title is easily changed to a far
# more descriptive label, the legend is left justified,
# the title is centered, etc. Other embellishments could
# be provided, but it is judged that they are not needed.
# Fig. 6.8
par(ask=TRUE); IACornYieldByCounty2021.map
Addendum 1: Use of the tidyUSDA::getQuickstat() API 383
An accompanying table of corn yields (bushels per acre) would make a good
addition to the final summary on this inquiry.
IACornYieldBuAcCountyByCounty2021.tbl <-
IACountyCornYieldBuAc2021.tbl %>%
dplyr::group_by(county_name) %>%
dplyr::filter(asd_desc %in% c(
"CENTRAL",
"EAST CENTRAL",
"NORTH CENTRAL",
"NORTHEAST",
"NORTHWEST",
"SOUTH CENTRAL",
"SOUTHEAST",
"SOUTHWEST",
"WEST CENTRAL")) %>%
# As a data trap quality assurance measure, retain the
# rows for agricultural districts (e.g., asd_desc) that
# are properly named and delete rows for all others,
# most importantly those rows with no value listed for
# asd_desc since the county will also be in question.
dplyr::summarize(
N = base::length(value),
Maximum = base::max(value, na.rm=TRUE),
Missing = base::sum(is.na(Value))
# Remember that the object variable Value was renamed
# to value, to accommodate choroplethr requirements.
#
# There is only one datum (corn yield value) for each
# county since this analysis is specific to 2021, so
# a full set of descriptive statistics is not needed.
base::print(IACornYieldBuAcCountyByCounty2021.tbl, n=99)
# Allow for all 99 counties, knowing that data are
# missing for some.
#
# At first it may seem unnecessary, but in the original
# printout of this inquiry, the geometry data follow
# along with each county, county by county. Those
# coordinates were trimmed from the printout, but some
# may find it helpful if mapping is attempted,
# especially if mapping is done that requires latitude
# and longitude coordinates.
384 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
80 WINNEBAGO 1 199. 0
81 WINNESHIEK 1 200. 0
82 WOODBURY 1 208. 0
83 WORTH 1 208. 0
84 WRIGHT 1 193. 0
Challenge: Sort the values and list the three counties that had the highest corn
yields in 2021 and the three counties that had the lowest corn yields in 2021. Then
look at historical data, which was gained earlier, and go back about 100 years to the
1920s. What were corn yields at that time, the highest three counties and the lowest
three counties? For those with special interest, look at farming practices then and
farming practices now. Go beyond corn yields, only, and find data on national popu-
lation, world population, and the degree to which Iowa corn is exported to other
nations, 100 or so years ago and now. Consider how the calories generated by Iowa
corn farmers feed, in part, a hungry world far beyond Iowa.
Challenge: Review online resources for annual precipitation for a few represen-
tative cities in Iowa, 2012 to 2021. As an example, selecting Sac City, Sac County,
Iowa, it is evident that 2012 and 2013 were dry years. Is it surprising that corn yields
were low in those years, compared to other years? There is far more to corn yields
than annual precipitation, but rain and snow certainly impact available soil mois-
ture. Look at resources made available by the National Weather Service and use a
correlation technique, possibly Pearson’s r, to compare annual rainfall by year to
corn yields by year. What is the correlation estimate? If possible, be sure to exclude
data from irrigated acreage. Are the federal data available and organized to allow
that exclusion?
Value Added Challenge 2 – Other R-Based APIs to Gain USDA NASS Data.
Merely to show that there are multiple R-based APIs associated with the USDA
NASS resource, look at the R package rnassqs. It has many useful features and
should be considered, at least for some queries to NASS.
Addendum 1: Use of the tidyUSDA::getQuickstat() API 385
install.packages("rnassqs", dependencies=TRUE)
library(rnassqs)
Cranberry2010to2021.list <-
list(commodity_name="CRANBERRIES",
year=c("2010", "2011", "2012", "2013", "2014", "2015",
"2016", "2017", "2018", "2019", "2020", "2021"),
agg_level_desc="STATE",
state_alpha=c("MA", "NJ", "OR", "WA", "WI"),
commodity_desc="CRANBERRIES",
statisticcat_desc="YIELD")
# Prepare a list of parameters for use with the NASS
# data resource. Specifically, this list will query
# NASS for cranberry production for five states, from
# 2010 to 2021. This is one more example of how R-
# based APIs can be used to get data over multiple
# years and in this example data from multiple states.
base::class(Cranberry2010to2021.list)
nassqs_auth(key="UseTheKeyProvidedAtSign-Upxxxxxxxxxx")
# The USDA NASS key is free, but it must first be
# obtained at https://quickstats.nass.usda.gov/api.
Cranberry2010to2021.df <-
rnassqs::nassqs(Cranberry2010to2021.list)
# The rnassqs::nassqs() function activates a HTTP GET
# data request to the USDA NASS Quick Stats API, with
# data returned as a data.frame.
base::getwd()
base::ls()
base::attach(Cranberry2010to2021.df)
utils::str(Cranberry2010to2021.df)
dplyr::glimpse(Cranberry2010to2021.df)
base::summary(Cranberry2010to2021.df)
writexl::write_xlsx(
Cranberry2010to2021.df,
path = "D:\\R_Ceres\\Cranberry2010to2021.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.
Much can be done with this dataset. Although it may not be an absolute require-
ment, put the dataframe into tibble format, to facilitate use of tidyverse ecosystem
functions. Then take time to study the key object variables. Give special attention to
the somewhat unique (e.g., historical) way that cranberry yield is measured,
BARRELS / ACRE. Search the Internet to learn more about this measure, where
there is a uniform weight set by the federal government even though it has been
years since cranberries were packed, distributed, and sold in barrels (e.g., like the
way corn is no longer packed in bushel baskets as part of the distribution process).
From among many possible ways this dataset could be used, a few hints for com-
munication to a general audience would include:
• Start out with the use of Base R, and make a simple plot of year (X-axis) by
Value (Y-axis). The syntax and accompanying figure are shown below.
• Prepare a line graph of cranberry yields over time by state. This type of figure
would be extremely useful for those who are not fully acquainted with this crop
and where it is grown. Since there are only five states (e.g., breakouts) in this
example, it is likely that one unified figure, generated by using the ggplot::ggplot()
function, would be sufficient. The syntax and accompanying figure are
shown below.11
• It might also be beneficial, for those who require more precise detail, to prepare
a table of yields by state and by year, and then again by year and by state. The
dplyr::group_by() function and the dplyr::summarize() function will serve as at
least one means of obtaining these statistics in a presentable format (Figs. 6.9
and 6.10).
11
Use of facet_wrap() may not be needed, but it should still be attempted just to see if it is a reason-
able way of communicating outcomes.
Addendum 1: Use of the tidyUSDA::getQuickstat() API 387
base::plot(data=Cranberry2010to2021.df,
as.factor(year), # X Axis
as.numeric(Value), # Y Axis
main="Base R - Cranberry Yield (BARRELS / ACRE)
by Year: 2010 to 2021")
# Fig. 6.9
ggplot2::ggplot(data=Cranberry2010to2021.df,
aes(x=as.factor(year), y=as.numeric(Value), group=state_name,
color=state_name, shape=state_name)) +
geom_point(size=1.5) +
geom_line(size=1.0) +
labs(title =
"ggplot2 - Cranberry Yield (BARRELS / ACRE)
by Year: 2010 to 2021")
# Much more can and should be done to this figure to
# add value to the consumer:
# Add descriptive labels to the X Axis and Y Axis.
# Adjust the Y Axis scale to show 0 to maximum, plus
# a little more to allow full display of yields.
# Adjust the number of breaks along the Y Axis.
# Use a theme that is more vibrant and easier to see
# at a distance.
# Change the legend title to State, from state_name.
# In a selected open space area, add an annotation
# to explain the meaning of BARRELS / ACRE, a term
# that few would understand.
# Fig. 6.10
Fig. 6.9
388 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
Fig. 6.10
The challenge for Addendum 2 is to first obtain data on corn prices per bushel,
looking at the earliest historical records maintained by the United States Department
of Agriculture. Then obtain data on acreage planted in corn, using the earliest his-
torical records maintained by the United States Department of Agriculture. Once
the data are obtained, depending on structure, put the data into one unified dataset
so that it is possible to look at the association between prices and acreage. Is there a
statistically significant (p < = 0.05) correlation (e.g., association) between the two,
corn prices and corn acreage? Use the tidyverse ecosystem to plot the two variables,
corn prices (X-Axis) and corn acreage (Y-Axis).12
Farming, whether by a small family farm operator or a large-scale agribusiness,
must show a profit if it is to continue for any length of time. We all have bills to pay.
How does profitability, farm management and conservation, corn prices, and corn
acreage fit into biostatistics?
Consider farming systems and management practices and their impact on the
environment in toto. Corn is a critical crop and vast acreage is devoted to corn pro-
duction, not only for domestic use but also as an exported commodity, where corn
is used for: human consumption; feed, forage, and bedding for livestock; fuel (e.g.,
ethanol), and many industrial uses (e.g., adhesives, enzymes, textile dyes), etc. Corn
usage is far greater than many could ever imagine.
However, if farm operators cannot make a reasonable year after year profit,
they may select to raise other row crops, or perhaps take the land out of row crop
production. Even with the use of no-till farming practices, corn production often
results in more erosion of valuable topsoil than pasture and hay crops. How do
farm operators balance environmental protection v production of vital crops v
reasonable profits, etc. Although the term sustainable farming is increasingly seen
in the popular press and university curricula, farm operators must be able to make
a profit if they are to continue feeding the world. Consider these and related issues
as the comparison of corn acreage planted v commodity prices is determined.
These two concepts (acreage in production and commodity prices) are indeed
related to biostatistics.
When determining a correlation coefficient between two variables, whether using Pearson’s r,
12
Spearman’s rho, or Kendall’s tau, always keep in mind two related common expressions: (1) Past
behavior is the best predictor of future behavior; and (2) correlation does not suggest causation.
390 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
IACornPricePerBushel1867Onward.tbl <-
tidyUSDA::getQuickstat(
key='UseTheKeyProvidedAtSign-Upxxxxxxxxxx',
program='SURVEY',
# period='YEAR',
# As the tidyUSDA::getQuickstat() function
# has been organized, if a specific year is
# not declared (e.g., year = '2021') and
# period = 'YEAR' is also not used, the
# function fetches data for all available
# years. In other packages, such as those in
# the tidycensus package, the purrr::map()
# functions are used to iterate over multiple
# years.
state='IOWA',
commodity='CORN',
data_item='CORN, GRAIN - PRICE RECEIVED, MEASURED IN $ / BU',
domain='TOTAL'
)
# Comment: Computing systems are neither as perfect nor as
# robust as desired. Syntax for the above NASS-specific API
# was attempted in the late AM, a few minutes before noon
# Eastern Time, a time of day when systems often see heavy
# use. In return, the following error message was generated
# after the data call timed out:
#
# Error: API did not return results. First verify that your
# input parameters work on the NASS website:
Addendum 2: Use an API to Obtain Multiple Files, Wrangle the Data, Merge Files… 391
tibble::is_tibble(IACornPricePerBushel1867Onward.tbl)
[1] FALSE
tibble::as_tibble(IACornPricePerBushel1867Onward.tbl)
# Put the dataframe into tibble format, to keep
# within the tidyverse paradigm.
# A tibble: 1,529 x 39
base::getwd()
base::ls()
base::attach(IACornPricePerBushel1867Onward.tbl)
utils::str(IACornPricePerBushel1867Onward.tbl)
dplyr::glimpse(IACornPricePerBushel1867Onward.tbl)
base::summary(IACornPricePerBushel1867Onward.tbl)
writexl::write_xlsx(
IACornPricePerBushel1867Onward.tbl,
path = "D:\\R_Ceres\\IACornPricePerBushel1867Onward.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.
base::file.exists("IACornPricePerBushel1867Onward.xlsx")
base::file.info("IACornPricePerBushel1867Onward.xlsx")
base::list.files(pattern =".xlsx")
392 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
The file may at first seem to be a bit longer (e.g., has more rows) than might at
first be expected. Look closely at values in the freq_desc column, and observe how
data are provided, both MONTHLY and ANNUAL. The data are valuable and can
serve many potential uses, but the data are not tidy, with MONTHLY and ANNUAL
data all in the same column, freq_desc. This is not a tidy approach to data
organization.
Use tidyverse ecosystem tools to filter out the MONTHLY rows, resulting in a
dataset that is focused only on ANNUAL corn prices per bushel. Ideally, there will
be one row of data and only one row of data for each year in the dataset if ANNUAL,
for freq_desc, is filtered correctly, resulting in 155 rows, the number of years associ-
ated with this dataset.
IACornPricePerBushel1867OnwardADJUSTED.tbl <-
IACornPricePerBushel1867Onward.tbl %>%
dplyr::filter(freq_desc %in% c(
"ANNUAL"))
# Retain the rows that have ANNUAL
# (only!) in the freq_desc column and
# delete all others rows.
# After this filtering action, focusing on
# freq_desc, the resulting dataset consists
# of 155 rows -- data from 1867 to 2022.
base::getwd()
base::ls()
base::attach(IACornPricePerBushel1867OnwardADJUSTED.tbl)
utils::str(IACornPricePerBushel1867OnwardADJUSTED.tbl)
dplyr::glimpse(IACornPricePerBushel1867OnwardADJUSTED.tbl)
base::summary(IACornPricePerBushel1867OnwardADJUSTED.tbl)
There is now a somewhat manageable dataset of Iowa corn prices per bushel over
a multiple-year period, beginning in 1867. Move on to similar actions, but to obtain
acreage devoted to Iowa corn production by year. See how many prior actions can
be copied and pasted, with just clever changing of dataset names and object vari-
able names.
Immediately, observe how data on Iowa corn acreage starts in 1926 whereas data on
Iowa corn prices went back to 1867. The USDA NASS resource is a great resource,
but the unavailability of data from the past is a constant issue for those who work in
data science. Even with this limitation, approximately 100 years of data are avail-
able on Iowa corn acreage.
Addendum 2: Use an API to Obtain Multiple Files, Wrangle the Data, Merge Files… 393
IACornAcresPlanted1926Onward.tbl <-
tidyUSDA::getQuickstat(
key='UseTheKeyProvidedAtSign-Upxxxxxxxxxx',
program='SURVEY',
# period='YEAR',
# As the tidyUSDA::getQuickstat() function
# has been organized, if a specific year is
# not declared (e.g., year = '2021') and
# period = 'YEAR' is also not used, the
# function fetches data for all available
# years. In other packages, such as those in
# the tidycensus package, the purrr::map()
# functions are used to iterate over multiple
# years.
state='IOWA',
commodity='CORN',
data_item='CORN - ACRES PLANTED',
domain='TOTAL'
)
tibble::is_tibble(IACornAcresPlanted1926Onward.tbl)
[1] FALSE
tibble::as_tibble(IACornAcresPlanted1926Onward.tbl)
# Put the dataframe into tibble format, to keep
# within the tidyverse paradigm.
# A tibble: 10,538 × 39
base::getwd()
base::ls()
base::attach(IACornAcresPlanted1926Onward.tbl)
utils::str(IACornAcresPlanted1926Onward.tbl)
dplyr::glimpse(IACornAcresPlanted1926Onward.tbl)
base::summary(IACornAcresPlanted1926Onward.tbl)
writexl::write_xlsx(
IACornAcresPlanted1926Onward.tbl,
path = "D:\\R_Ceres\\IACornAcresPlanted1926Onward.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.
base::file.exists("IACornAcresPlanted1926Onward.xlsx")
base::file.info("IACornAcresPlanted1926Onward.xlsx")
base::list.files(pattern =".xlsx")
This file is quite large in that it has information on acreage planted for all IOWA
agricultural district breakouts as well as IOWA overall, shown in the location_desc
column. And to add more complexity, some rows that read IOWA, only, in the loca-
tion_desc column, show multiple rows by year due to multiple entries in the refer-
ence_period_desc column. As an example, for year 2022, there are three entries for
the reference_period_desc column: YEAR, YEAR – JUN ACREAGE, YEAR –
MAR ACREAGE. Efforts are needed to end up with a dataset that has one and only
one composite entry of Value (e.g., CORN – ACRES PLANTED) and year
(e.g., year).13
There are many ways to approach this data-wrangling challenge, and from
among the many options, here is an overall suggestion:
• For the object variable (e.g., column) location_desc, use the dplyr::filter() func-
tion to filter out all rows other than those rows that have IOWA, only, in the loca-
tion_desc column. This should result in 119 rows.
• Then for the object variable (e.g., column) reference_period_desc, use the
dplyr::filter() function to filter out all rows other than those rows that have YEAR,
only, in the reference_period_desc column. This should result in a final set of
98 rows.
13
Do not obtain data and then immediately rush in and start analyses. Take time to study the data,
visually and by using software. There is a reason for the 80-20 rule.
Addendum 2: Use an API to Obtain Multiple Files, Wrangle the Data, Merge Files… 395
IACornAcresPlanted1926OnwardADJUSTED.tbl <-
IACornAcresPlanted1926Onward.tbl %>%
dplyr::filter(location_desc %in% c(
"IOWA")) %>%
# Retain the rows that have IOWA (only!)
# in the location_desc column and delete
# all others.
dplyr::filter(reference_period_desc %in% c(
"YEAR"))
# Retain the rows that have YEAR (only!)
# in the reference_period_desc column and
# delete all others.
# After the two filtering actions, both
# location_desc and reference_period_desc,
# the resulting dataset consists of 98 rows.
base::getwd()
base::ls()
base::attach(IACornAcresPlanted1926OnwardADJUSTED.tbl)
utils::str(IACornAcresPlanted1926OnwardADJUSTED.tbl)
dplyr::glimpse(IACornAcresPlanted1926OnwardADJUSTED.tbl)
base::summary(IACornAcresPlanted1926OnwardADJUSTED.tbl)
There are two datasets of interest, an adjusted dataset of Iowa corn acreage by year
and another adjusted dataset on Iowa corn prices by year. Both datasets have many
variables that may be interesting but have little to do with the inquiry at hand – an
examination of the association between corn production (acreage) and corn prices.
Does acreage increase as prices increase? Does acreage increase as prices decrease?
Consider use of the dplyr::select() function to put the two datasets into more
manageable order:
396 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
IACornPricePerBushel1867OnwardTrimmed.tbl <-
IACornPricePerBushel1867OnwardADJUSTED.tbl %>%
dplyr::select('statisticcat_desc', 'Value',
'freq_desc', 'state_alpha', 'year')
# The trimmed dataset for Iowa corn prices will be
# restricted to these five object variables.
base::attach(IACornPricePerBushel1867OnwardTrimmed.tbl)
utils::head(IACornPricePerBushel1867OnwardTrimmed.tbl)
The dataset is now tidy and manageable. The dataset consists of 155 rows and
5 variables. A simple figure should give a good idea of year-by-year change in
prices, and those with special interest may want to match these changes with
world events, since corn is an exported farm commodity that is sensitive to global
events.14 Going back to the run-up of corn prices during World War I, what was
the impact of demand for Iowa corn during World War II? For those with special
interest, look at the 1942 legislation on price controls and conflicts between Leon
Henderson (Office of Price Administration) and Claude Wickard (Secretary of
Agriculture), a conflict that resulted in policy decisions with profound implica-
tions on then and later corn production, as well as other agricultural products
(Fig. 6.11).15
14
Look at the price of Iowa corn in 1915 ($0.63 per bushel) and the rapid increases up to 1919
($1.34 per bushel), a more than doubling of price in only a few years. Was World War I and the
worldwide demand for food responsible for this increase? Give special attention to the $1.34 per
bushel price of Iowa corn in 1919 and how the price of Iowa corn crashed throughout the 1920s
and 1930s, with a low for Iowa corn at $0.32 per bushel in 1932. Concurrent with these low prices,
review available resources on the Great Depression and the Dust Bowl. It is dangerous to say that
X caused Y, but is there an association between the low prices for farm commodities in the 1920s
and 1930s, the Great Depression of the 1930s, and the prior use of compromised farming practices
that may have contributed to the 1930s Dust Bowl? This is of course an extremely complicated
issue, but those who work in agriculture and biostatistics should at least be aware of these issues.
15
Other global events impacting the rapid increase and soon after decline in Iowa corn prices
would be the 1972 decision to export grain to the then Union of Soviet Socialist Republics (USSR)
and later, the 1980 decision to embargo grain sales to the same entity. The point here is that farm
commodity prices are impacted by far more than weather and similar environmental factors.
Addendum 2: Use an API to Obtain Multiple Files, Wrangle the Data, Merge Files… 397
Fig. 6.11
ggplot2::ggplot(data=IACornPricePerBushel1867OnwardTrimmed.tbl,
aes(x=year, y=Value)) +
geom_point() +
ggtitle("Iowa Corn Production from 1860s to 2020s:
Price per Bushel") +
labs(x="\nYear", y="Price per Bushel\n") +
scale_x_continuous(limits=c(1865,2025),
breaks=scales::pretty_breaks(n=7)) +
scale_y_continuous(labels=scales::dollar, limits=c(0,7),
breaks=scales::pretty_breaks(n=7)) +
theme_Mac()
# Fig. 6.11
IACornAcresPlanted1926OnwardTrimmed.tbl <-
IACornAcresPlanted1926OnwardADJUSTED.tbl %>%
dplyr::select('location_desc', 'Value',
'freq_desc', 'statisticcat_desc', 'year')
# The trimmed dataset for Iowa corn acreage will be
# restricted to these five object variables.
base::attach(IACornAcresPlanted1926OnwardTrimmed.tbl)
utils::head(IACornAcresPlanted1926OnwardTrimmed.tbl)
Like the prior figure on Iowa corn prices over time, prepare a figure that exam-
ines Iowa acreage for corn, using these USDA NASS resources (Fig. 6.12).
398 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
Fig. 6.12
ggplot2::ggplot(data=IACornAcresPlanted1926OnwardTrimmed.tbl,
aes(x=year, y=Value)) +
geom_point() +
ggtitle("Iowa Corn Production from 1920s to 2020s:
Acreage") +
labs(x="\nYear", y="Acres Planted\n") +
scale_x_continuous(limits=c(1925,2025),
breaks=scales::pretty_breaks(n=7)) +
scale_y_continuous(labels=scales::comma, limits=c(0,
15000000), breaks=scales::pretty_breaks(n=7)) +
theme_Mac() +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=0.5, angle=15))
# The special theme-based accommodations for X axis text and/
# or Y axis text need to be placed after the enumerated theme
# theme_Mac().
#
# For this figure, observe how vjust=0.5 was used, to make
# the Agriculural District text align with the tick marks.
# Fig. 6.12
Although the resulting figure of Iowa corn acreage over time may seem serpen-
tine, compare the general trends, the ups and downs, to the prior figure on Iowa corn
prices. Is there an association? What is needed to determine the association, if
indeed there is one?
Ultimately, it will be necessary to join the two datasets into one unified dataset,
so that Iowa corn acreage per year and Iowa corn price per bushel per year are in the
same dataset, allowing analyses of the two object variables. The object variable year
is common to both trimmed datasets and that variable will be selected for common-
ality. However, a unique problem for the two datasets is that the object variable
Value is also in common, given how this is the default name for the major variable
of interest in USDA NASS datasets. A few more wrangling activities are needed to
take care of this issue, renaming.
Addendum 2: Use an API to Obtain Multiple Files, Wrangle the Data, Merge Files… 399
# Rename IACornPricePerBushel1867OnwardTrimmed.tbl$Value
base::names(IACornPricePerBushel1867OnwardTrimmed.tbl)[
names(IACornPricePerBushel1867OnwardTrimmed.tbl) ==
'Value'] <- 'price'
# Use the base::names() function to rename the object
# variable Value to price.
base::attach(IACornPricePerBushel1867OnwardTrimmed.tbl)
utils::str(IACornPricePerBushel1867OnwardTrimmed.tbl)
utils::head(IACornPricePerBushel1867OnwardTrimmed.tbl)
# Rename IACornAcresPlanted1926OnwardTrimmed.tbl$Value
base::names(IACornAcresPlanted1926OnwardTrimmed.tbl)[
names(IACornAcresPlanted1926OnwardTrimmed.tbl) ==
'Value'] <- 'acres'
# Use the base::names() function to rename the object
# variable Value to price.
base::attach(IACornAcresPlanted1926OnwardTrimmed.tbl)
utils::str(IACornAcresPlanted1926OnwardTrimmed.tbl)
utils::head(IACornAcresPlanted1926OnwardTrimmed.tbl)
Merge the price-focused dataset and the acres-focused dataset into one unified
dataset, now that both trimmed datasets are in good form. The object variable year
is in common to both trimmed datasets and that will be used to facilitate the merger.
The dplyr package supports more than a few join-type (e.g., merge) functions:
inner_join(), left_join(), right_join, and full_join(). Review easily available docu-
mentation on the unique outcomes from each join-type function. For this example,
the full_join function will be used, even though it is known that there are an unequal
number of rows in each trimmed dataset.
400 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
IACornPriceByAcresByYearJOINED.tbl <-
base::merge(
IACornPricePerBushel1867OnwardTrimmed.tbl,
IACornAcresPlanted1926OnwardTrimmed.tbl, by = "year")
tibble::is_tibble(IACornPriceByAcresByYearJOINED.tbl)
[1] FALSE
tibble::as_tibble(IACornPriceByAcresByYearJOINED.tbl)
# Put the dataframe into tibble format, to keep
# within the tidyverse paradigm.
Selected sections of output were deleted to save space.
# A tibble: 97 × 9
The data of interest (e.g., Iowa corn prices per bushel by year and Iowa acreage
in corn per year) are now in the same dataset and ready for use. Consider the follow-
ing Null Hypothesis:
Null Hypothesis: There is no statistically significant (p <= 0.05) association
between Iowa corn prices per bushel by year and Iowa acreage in corn per year.
Addendum 2: Use an API to Obtain Multiple Files, Wrangle the Data, Merge Files… 401
stats::cor.test(
IACornPriceByAcresByYearJOINED.tbl$price,
IACornPriceByAcresByYearJOINED.tbl$acres,
method="pearson",
use="pairwise.complete.obs")
Among the statistics gained from this correlation, observe how Pearson’s
r = 0.657049, confirming the strong association between Iowa corn prices per bushel
by year and Iowa acreage in corn per year. As always, a well-developed figure will
help reinforce an understanding of this outcome (Fig. 6.13):
Fig. 6.13
402 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
par(ask=TRUE)
ggplot2::ggplot(data=IACornPriceByAcresByYearJOINED.tbl,
aes(x=price, y=acres)) +
geom_point(size=1.25) +
ggtitle("Iowa Corn Production from 1926 Onward:
Price per Bushel v Acreage Planted")+
geom_smooth(method=lm, se=TRUE) +
labs(x="\nPrice per Bushel", y="Acreage Planted\n") +
scale_x_continuous(labels=scales::dollar, limits=c(0,7),
breaks=scales::pretty_breaks(n=7)) +
scale_y_continuous(labels=scales::comma, limits=c(0,
16000000), breaks=scales::pretty_breaks(n=7)) +
#############################################################
annotate("text", x=0.5, y=7500000, fontface="bold", size=04,
hjust=0, label=
"Null Hypothesis: There is no statistically significant ") +
annotate("text", x=0.5, y=6700000, fontface="bold", size=04,
hjust=0, label=
"(p <= 0.05) association between Iowa corn prices per ") +
annotate("text", x=0.5, y=5900000, fontface="bold", size=04,
hjust=0, label=
"bushel by year and Iowa acreage in corn by year.") +
#############################################################
annotate("text", x=4.0, y=7500000, fontface="bold", size=04,
hjust=0, label=
"Calculate p-value = 0.00000000000027") +
annotate("text", x=4.0, y=5900000, fontface="bold", size=04,
hjust=0, color="red", label=
"Pearson's r = 0.657049.") +
#############################################################
annotate("text", x=0.5, y=4500000, fontface="bold", size=04,
hjust=0, label=
"The Null Hypothesis is not accepted. Or, to phrase the ") +
annotate("text", x=0.5, y=3800000, fontface="bold", size=04,
hjust=0, label=
"outcome in a different manner, the Null Hypothesis is ") +
annotate("text", x=0.5, y=3100000, fontface="bold", size=04,
hjust=0, label=
"rejected. There is an association between X (price) ") +
annotate("text", x=0.5, y=2400000, fontface="bold", size=04,
hjust=0, label=
"and Y (acreage).") +
theme_Mac() +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=0.5, angle=15))
# The special theme-based accommodations for X axis text and/
# or Y axis text need to be placed after the enumerated theme
# theme_Mac().
#
# Fig. 6.13
Following along with the statistical outcomes, the figure provides visual evi-
dence that there is an association between X (price) and Y (acres). Recall, however,
that there is no suggestion of causation. It cannot be determined that X causes Y nor
can it be determined that Y causes X. It is only possible to say that there is an
Addendum 3: Use of Known URLs as a Proxy API (Application Programming Interface) 403
association between X (price) and Y (acres) and then, depending on wording, that
the association is positive or negative.
Although there is complete confidence with the statistics and figures associated
with the association between X (price) and Y (acres), an experienced data scientist
would know that there will be some individuals, especially among the public, who
do not fully benefit from the presentation up to this point, including the well-
prepared scatter plot. Many members of the public could see the upward trend over
time between X (price) and Y (acres), allowing for selected periods of marked fluc-
tuation, often caused by either extreme weather or global events, but it cannot be
ignored that many members of the public have only a vague (if any) idea of the area
for an acre (e.g., 43,560 square feet) or the use of bushel as a measurement (e.g.,
1.24 cubic feet, or as a standard for corn 56 pounds per bushel with corn at 15.5%
moisture content). To confront this problem, it may be necessary to have a few pic-
tographs that introduce those terms to the public, such as:
• An acre is approximately the size of a football field (e.g., American football),
without the two end zones.
• It takes approximately 110 to 120 ears of corn to make a bushel.
As always, remember the background of the intended audience for the final sum-
mary. Testing the final document(s) with a broad and representative readership can
help with the final presentation.
There are many API (Application Programming Interface) Web sites that have valu-
able data for those who work in biostatistics. The United States Environmental
Protection Agency (EPA) is included among the various offices associated with the
President’s cabinet and this office provides extremely important information about
toxins that impact the environment, where ultimately these toxins are linked to pub-
lic health and by extension, the impact of public health on the workforce – certainly
topics associated with biostatistics.
The EPA provides the data through many interfaces and in many formats. Some
EPA interfaces are quite easy to navigate, consisting of simple GUI (Graphical User
Interface) drop-down menus. Other EPA interfaces require some degree of expertise
with computing. There are also a few product-specific R-based APIs that interact
with and obtain data from the EPA, such as prior R-based APIs demonstrated earlier.
For this addendum, look at the way known URLs (Uniform Resource Locator)
are used to obtain data associated with fuel types and the impact of these many fuels
when used for the generation of electricity. From the larger collection of object
404 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
variables included in the obtained datasets, the focus for this addendum will be on
the following object variables16:
reporting_year
n2o_emissions_co2e
total_annual_heat_input
fuel_type
ch4_emissions_co2e
In advance, based on prior knowledge from working with the EPA’s interface and
processes for these data, it is known that the dataset in question consists of about
55,000 rows of data. However, to best manage resources, the EPA’s system housing
these data has been designed to limit the number of rows returned at one time, gen-
erating this error message if there were an attempt to obtain more than 10,000 rows
of data:
# message:
#
# error "Limit must be less than 10000"
#
It may be an aggravation, but the problem is easily solved by using a divide and
conquer approach, breaking the data acquisition process into six multiple data calls,
with each returned file meeting the size limit of fewer than 10,000 rows. After the
separate .csv files are brought into the R session, the many files will be merged into
one common file. It would be best to perform one simple data call to the EPA’s data
resource, but that aspiration is not possible, this suggested process works, and the
data are quite revealing and worth the effort.
Identify the EPA URLs containing the row-based data, knowing that it is best to
obtain no more than 9999 rows of data at one time.
16
Note: At one time, the original object variables names showed in UPPERCASE instead of low-
ercase. Always check datasets for current naming schemes, structure, data availability, etc. Data at
online resources that are controlled by others are always subject to change: updates, deletions,
modifications.
Addendum 3: Use of Known URLs as a Proxy API (Application Programming Interface) 405
urlEPA01 <-
"https://data.epa.gov/efservice/D_FUEL_LEVEL_INFORMATION/ROWS/00
000:09999/CSV"
urlEPA02 <-
"https://data.epa.gov/efservice/D_FUEL_LEVEL_INFORMATION/ROWS/10
000:19999/CSV"
urlEPA03 <-
"https://data.epa.gov/efservice/D_FUEL_LEVEL_INFORMATION/ROWS/20
000:29999/CSV"
urlEPA04 <-
"https://data.epa.gov/efservice/D_FUEL_LEVEL_INFORMATION/ROWS/30
000:39999/CSV"
urlEPA05 <-
"https://data.epa.gov/efservice/D_FUEL_LEVEL_INFORMATION/ROWS/40
000:49999/CSV"
urlEPA06 <-
"https://data.epa.gov/efservice/D_FUEL_LEVEL_INFORMATION/ROWS/50
000:59999/CSV"
base::ls()
# Confirm that the objects urlEPA01 to urlEPA06 were
# created, with each consisting of the name for an URL.
Caution: The URLs in this section show across two or more lines, due to word
wrap and the limits of preparing text. However, when executing these actions (e.g.,
importing the data), adjust the URLs so that each URL, individually, shows on one
line, likely going well to the right of the screen. Each URL consists of 78 characters,
from the beginning double quote to the ending double quote.
base::ls()
utils::head(datEPA01); utils::head(datEPA02);
utils::head(datEPA03); utils::head(datEPA04);
utils::head(datEPA05); utils::head(datEPA06);
# Confirm that the objects datEPA01 to datEPA06 were
# created, with each consisting of data gained from the
# previously named URL.
406 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
The six separate datasets (e.g., datEPA01 to datEPA06) are structured so that the
file layout is the same for each dataset. The task now is to join the six datasets into
one unified dataset. There are many ways to achieve this aim, but because of the
common format for all six EPA datasets, the dplyr::bind_rows() function will
be used.
EPAFuelToxins.tbl <-
dplyr::bind_rows(
datEPA01,
datEPA02,
datEPA03,
datEPA04,
datEPA05,
datEPA06)
# Merge, join, blend, bind, etc. rows of
# the six row-based breakout EPA datasets
# into one common, complete, dataset.
tibble::is_tibble(EPAFuelToxins.tbl)
[1] TRUE
The dataset, obtained from the EPA interface and referencing different fuel types
and related information on toxins associated with the generation of electricity, is
now almost in final form. There are a few object variables that are not of direct inter-
est for this set of analyses, so a few more efforts are needed to put the dataset into
what is judged final form.
EPAFuelToxinsTrimmed.tbl <-
EPAFuelToxins.tbl %>%
dplyr::select('reporting_year',
'n2o_emissions_co2e',
'total_annual_heat_input',
'fuel_type',
'ch4_emissions_co2e')
base::getwd()
base::ls()
base::attach(EPAFuelToxinsTrimmed.tbl)
utils::str(EPAFuelToxinsTrimmed.tbl)
dplyr::glimpse(EPAFuelToxinsTrimmed.tbl)
base::summary(EPAFuelToxinsTrimmed.tbl)
base::summary(EPAFuelToxinsTrimmed.tbl$reporting_year)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2010 2012 2015 2015 2018 2021
FrequencyYear <-
EPAFuelToxinsTrimmed.tbl %>%
janitor::tabyl(reporting_year,
show_na=TRUE,
show_missing_levels=TRUE) %>%
janitor::adorn_totals("row") %>%
janitor::adorn_pct_formatting(digits=2) %>%
base::print(FrequencyYear, n=99)
# There are NOT 99 lines to the printout, but
# 99 was chosen as an exceptionally high, and
# and noticeable number, to allow printout of
# the full output.
reporting_year n percent
2010 5399 9.06%
2011 5499 9.23%
2012 5568 9.34%
2013 5336 8.96%
2014 5153 8.65%
2015 4955 8.32%
2016 4764 8.00%
2017 4650 7.80%
2018 4667 7.83%
2019 4582 7.69%
2020 4516 7.58%
2021 4497 7.55%
Total 59586 100.00%
dplyr::arrange(FrequencyYear, desc(n))
# Arrange output in descending order
# against the object n
reporting_year n percent
Total 59586 100.00%
2012 5568 9.34%
2011 5499 9.23%
2010 5399 9.06%
2013 5336 8.96%
2014 5153 8.65%
2015 4955 8.32%
2016 4764 8.00%
2018 4667 7.83%
2017 4650 7.80%
2019 4582 7.69%
2020 4516 7.58%
2021 4497 7.55%
A follow-up figure will help provide a graphical summary of the number of cases
(e.g., rows) associated with data for each year (Fig. 6.14).
Addendum 3: Use of Known URLs as a Proxy API (Application Programming Interface) 409
Fig. 6.14
par(ask=TRUE)
ggplot2::ggplot(data=EPAFuelToxinsTrimmed.tbl,
aes(x=as.factor(reporting_year))) +
geom_bar(fill="red", color="black") +
coord_flip() +
ggtitle("Years Included in the Study of Fuel Types Used
for Electricity Production") +
labs(x="Year\n", y="\nN") +
# The X and Y axis are flipped, so adjust the labels.
theme_Mac() +
theme(axis.text.y=element_text(face="bold", size=12,
hjust=1, vjust=0.5, angle=0))
# The special theme-based accommodations for X axis text and/
# or Y axis text need to be placed after the enumerated theme
# theme_Mac().
#
# The bars for reporting_year were purposely not put into any
# type of order, increasing or decreasing. This decision was
# made to show the sequential progression of data collection.
# Fig. 6.14
FrequencyFuel <-
EPAFuelToxinsTrimmed.tbl %>%
janitor::tabyl(fuel_type,
show_na=TRUE,
show_missing_levels=TRUE) %>%
janitor::adorn_totals("row") %>%
janitor::adorn_pct_formatting(digits=2) %>%
base::print(FrequencyYear, n=99)
The printout of fuel types is quite long but attend to the screen printout to see the
many fuels used to generate electricity, even if a few fuel types are used sparingly.
410 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
Notice how the leading fuel is Natural Gas (Weighted U.S. Average), representing
57.41 percent of all valid cases in the dataset.17
dplyr::arrange(FrequencyFuel, desc(n))
# Arrange output in descending order
# against the object n
There are far too many fuel_type breakouts to easily make sense of the output.
Wrap the dplyr::mutate() function around the forcats::fct_lump() function to col-
lapse (e.g., lump) the number of breakouts into a manageable number, seven fuel
types plus other as the syntax is presented here.
17
Challenge: Use tools from the tidyverse ecosystem to observe change over time, if any, in per-
centage use of Natural Gas (Weighted US Average) in 2010 compared to 2021.
Addendum 3: Use of Known URLs as a Proxy API (Application Programming Interface) 411
CollapseFuel.tbl <-
EPAFuelToxinsTrimmed.tbl %>%
dplyr::filter(!is.na(fuel_type)) %>%
# Remove the few NA values.
dplyr::mutate(fuel_type =
forcats::fct_lump(fuel_type, n = 7)) %>%
# Collapse the dataset into the 7 leaning
# breakouts, with all remaining breakouts
# lumped into the ubiquitous breakout
# called other, all facilitated by use of
# the forcats::fct_lump() function.
dplyr::count(fuel_type)
base::print(CollapseFuel.tbl, n=99)
# A tibble: 8 x 2
fuel_type n
<fct> <int>
1 Bituminous 4239
2 Distillate Fuel Oil No. 2 12752
3 Kerosene 1111
4 Mixed (Electric Power sector) 591
5 Natural Gas (Weighted U.S. Average) 34142
6 Residual Fuel Oil No. 6 988
7 Subbituminous 3459
8 Other 2186
More testing of the dataset would be useful, but at this point in the initial analy-
ses, it seems that data are both reliable and valid. Use the writexl::write_xlsx() func-
tion to immediately download the data, for safekeeping in case the data were ever to
become unavailable in the future.
writexl::write_xlsx(
EPAFuelToxinsTrimmed.tbl,
path = "D:\\R_Ceres\\EPAFuelToxinsTrimmed.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.
base::file.exists("EPAFuelToxinsTrimmed.xlsx")
base::file.info("EPAFuelToxinsTrimmed.xlsx")
base::list.files(pattern =".xlsx")
Fig. 6.15
CollapseFuel.tbl %>%
dplyr::mutate(fuel_type = forcats::fct_reorder(fuel_type,
n)) %>%
ggplot2::ggplot(aes(x=fuel_type, y=n)) +
geom_col(fill="red", color="black") +
coord_flip() +
ggtitle("Fuel Types Used for Electricity Production:
2010 to 2021") +
labs(x="\nFuel Type", y="N\n") +
theme_Mac() +
theme(axis.text.y=element_text(face="bold", size=12,
hjust=1, vjust=0.5, angle=0))
# The special theme-based accommodations for X axis text and/
# or Y axis text need to be placed after the enumerated theme
# theme_Mac().
# Fig. 6.15
18
Challenge: look at mean values for n2o_emissions_co2e from 2010 to 2019, in 2020, and then
again in 2021. Consider possible reasons for the gradual decline, even if somewhat uneven from
2010 to 2019, the large drop off in 2020, and the uptick in 2021. As data become available, what is
the trend from 2022 onward?
Addendum 3: Use of Known URLs as a Proxy API (Application Programming Interface) 413
install.packages("RcmdrMisc", dependencies=TRUE)
library(RcmdrMisc)
RcmdrMisc::numSummary(
EPAFuelToxinsTrimmed.tbl[,c("n2o_emissions_co2e")],
statistics=c("mean", "sd", "quantiles"),
quantiles=c(0.5), # Median
groups=reporting_year) # Breakouts by Year
Look at the extreme difference between mean (arithmetic average) and the
median (e.g., 50th percentile, 50%). Look also at the high values for the sd (e.g.,
standard deviation). There is clearly wide variation in values for
n2o_emissions_co2e.
Challenge: Prepare a density plot of n2o_emissions_co2e to better understand
the data distribution pattern for this object variable. The odd shape of the density
plot provides further confirmation that there is wide variation in values for
n2o_emissions_co2e.
414 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
RcmdrMisc::numSummary(
EPAFuelToxinsTrimmed.tbl[,c("total_annual_heat_input")],
statistics=c("mean", "sd", "quantiles"),
quantiles=c(0.5), # Median
groups=reporting_year) # Breakouts by Year
RcmdrMisc::numSummary(
EPAFuelToxinsTrimmed.tbl[,c("ch4_emissions_co2e")],
statistics=c("mean", "sd", "quantiles"),
quantiles=c(0.5), # Median
groups=reporting_year) # Breakouts by Year
At the risk of being repetitive, it cannot be ignored that the data for the three
measured object variables show wide variance in data distribution. This observation
will later impact the selected inferential test(s) when attempts are made to make
Addendum 3: Use of Known URLs as a Proxy API (Application Programming Interface) 415
more sense of how the data can be used for statistical testing: nonparametric v
parametric.
There are three remaining issues that should be addressed, to make later analyses
against the full dataset, which will be called EPAFuelToxinsNoNAs.tbl, as easy as
possible:
• Remove any remaining NA values, so that the dataset has no missing data.
• Put reporting_year into factor format.
• Put fuel_type into factor format.
EPAFuelToxinsNoNAs.tbl <-
EPAFuelToxinsTrimmed.tbl %>%
tidyr::drop_na()
# Remove any remaining NAs.
base::summary(EPAFuelToxinsNoNAs.tbl)
EPAFuelToxinsNoNAs.tbl$reporting_year <-
base::as.factor(
EPAFuelToxinsNoNAs.tbl$reporting_year)
# Put reporting_year into factor format.
base::summary(EPAFuelToxinsNoNAs.tbl)
EPAFuelToxinsNoNAs.tbl$fuel_type <-
base::as.factor(
EPAFuelToxinsNoNAs.tbl$fuel_type)
# Put fuel_type into factor format.
base::summary(EPAFuelToxinsNoNAs.tbl)
base::getwd()
base::ls()
base::attach(EPAFuelToxinsNoNAs.tbl)
utils::str(EPAFuelToxinsNoNAs.tbl)
dplyr::glimpse(EPAFuelToxinsNoNAs.tbl)
base::summary(EPAFuelToxinsNoNAs.tbl)
Following along with inquiries into the emissions included in this dataset and the
generation of electricity by various fuel types, there are many ways to structure
meaningful inferential analyses. As a start example, consider a Mann-Whitney U
Test for n2o_emissions_co2e by reporting_year:
agricolae::kruskal(
EPAFuelToxinsNoNAs.tbl$n2o_emissions_co2e, # Measured
EPAFuelToxinsNoNAs.tbl$reporting_year, # Grouping
alpha=0.05, group=FALSE, p.adj="holm",
main="n2o_emissions_co2e by reporting_year",
console=TRUE)
# Use holm for pairwise comparisons. Another choice could
# have been to use bonferroni for pairwise comparisons.
Selected sections of output were deleted to save space.
• Continue with the previously demonstrated Kruskal-Wallis ANOVA, but now for
total_annual_heat_input by RESPONDING_YEAR and ch4_emissions_co2e by
RESPONDING_YEAR.
• After collapsing fuel_type into a reasonable number of breakouts, investigate
Kruskal-Wallis ANOVA for n2o_emissions_co2e by fuel_type, total_annual_
heat_input by fuel_type, and ch4_emissions_co2e by fuel_type.
• For those who especially take up this challenge, structure the analyses (collaps-
ing fuel_type into a reasonable number of breakouts) from the perspective of a
nonparametric Friedman Twoway Analysis of Variance by Ranks.
In summary, there are many possible ways to examine the data, all to add value
to the data science process. Take up these challenges.
Value Added Challenge: For anyone who follows the news, it is well known that
there is an effort to take electricity-generating power plants that burn coal and oil
offline and to instead replace the use of these fuels with natural gas.19 The premise
is that natural gas generates less pollution than coal and oil, per unit of heat output
used to ultimately generate electricity. The tidyverse ecosystem can be used to criti-
cally examine the efficacy of this transition (Figs. 6.16, 6.17, and 6.18).
Fig. 6.16
19
The change from coal and oil to natural gas for generation of electricity has provided the oppor-
tunity for many impressive videos, showing dramatic images of the implosion of old infrastructure:
boilers and cooling stacks. Merely as one of many possible selections, search for videos of the June
19, 2011, implosion of electricity-generating infrastructure at Riviera Beach, Florida. In mere
seconds, two boilers and two 300-foot stacks came tumbling down, to make way for construction
of a new natural gas-powered system. A serendipitous outcome was construction of a dedicated
lagoon, where warm water from the new cooling towers is discharged into an area where manatees
(a protected species) can gather during the winter and thrive when cool water temperatures may
otherwise put them at stress. Look at the Manatee Cam (https://www.visitmanateelagoon.com/) in
the winter, when air temperatures are about 60F or 15C and look at what may seem to be 100 or
more manatees enjoying the benefit of warm water discharge from the power plant.
418 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
Fig. 6.17
Fig. 6.18
Addendum 3: Use of Known URLs as a Proxy API (Application Programming Interface) 419
EPAn2o_emissions_co2eTrimDesStatNatGasBitCoal.tbl <-
# Create a new object that will hold descriptive
# statistics for Natural Gas and Bituminous coal.
EPAFuelToxinsTrimmed.tbl %>%
dplyr::filter(fuel_type %in% c(
"Natural Gas (Weighted U.S. Average)",
"Bituminous")) %>%
# Use the dplyr::filter() function to focus only
# on Natural Gas (Weighted U.S. Average) and
# Bituminous.
dplyr::group_by(fuel_type, reporting_year) %>%
# Group output by fuel_type and reporting_year,
# the focus for intended figure that will show
# change in Sum n2o_emissions_co2e over time,
# 2010 to 2021.
dplyr::summarize(
N = base::length(n2o_emissions_co2e),
Minimum = base::min(n2o_emissions_co2e),
Median = stats::median(n2o_emissions_co2e),
Mean = base::mean(n2o_emissions_co2e),
SD = stats::sd(n2o_emissions_co2e),
Maximum = base::max(n2o_emissions_co2e),
Sum = base::sum(n2o_emissions_co2e),
Missing = base::sum(is.na(n2o_emissions_co2e))
# Generate a complete set of useful descriptive
# descriptive statistics, but the focus for the
# intended figure will be Sum since ultimately
# this is the most useful measure of pollution.
)
base::print(
EPAn2o_emissions_co2eTrimDesStatNatGasBitCoal.tbl, n=24)
# Observe the 2010 to 2021 N for Bituminous and Natural
# Gas (Weighted U.S. Average). Then, observe the 2010 to
# 2021 Sum for the same, both in the enumerated table of
# descriptive statistics but especially in the figure.
#
420 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
# A tibble: 24 × 10
# Groups: fuel_type [2]
fuel_type reporting_year N Median Mean SD Maximum
<chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 Bituminous 2010 502 4427. 8535. 9228. 45670.
2 Bituminous 2011 483 3997. 7725. 8613. 45736.
3 Bituminous 2012 483 2989. 6553. 8095. 44264.
4 Bituminous 2013 453 3947. 7252. 8402. 42769
5 Bituminous 2014 397 5643. 8719. 8800. 44086.
6 Bituminous 2015 363 5613. 8122. 8357. 42725.
7 Bituminous 2016 313 7166. 8666. 8367. 38331.
8 Bituminous 2017 296 6231. 8248. 8487. 40904.
9 Bituminous 2018 284 6088. 8123. 8396. 40419.
10 Bituminous 2019 242 5430. 7645. 7929. 39239.
11 Bituminous 2020 223 4991. 6761. 7190. 37404.
12 Bituminous 2021 200 7180. 8381. 7667. 37538.
13 Natural Gas 2010 2806 12.4 81.2 132. 1966.
14 Natural Gas 2011 2857 11.4 80.4 128. 993.
15 Natural Gas 2012 2907 15.5 95.8 143. 840.
16 Natural Gas 2013 2927 10.7 84.0 130. 837.
17 Natural Gas 2014 2898 11.2 84.3 129. 656.
18 Natural Gas 2015 2843 17 101. 146. 854.
19 Natural Gas 2016 2823 22.2 105. 146. 793.
20 Natural Gas 2017 2791 18.1 102. 172. 3177.
21 Natural Gas 2018 2841 25.8 112. 149. 881.
22 Natural Gas 2019 2825 26.6 122. 167. 1774.
23 Natural Gas 2020 2802 28.4 124. 166. 888.
24 Natural Gas 2021 2822 28.3 120. 160. 789.
par(ask=TRUE)
ggplot2::ggplot(data=
EPAn2o_emissions_co2eTrimDesStatNatGasBitCoal.tbl,
aes(x=reporting_year, y=Sum)) +
geom_point(size=4.00, color="red") +
geom_line(size=1.00, color="black") +
facet_wrap(~fuel_type) +
ggtitle(
"Sum n2o_emissions_co2e in the United States as a
Result of Electricity Generation: 2010 to 2021") +
labs(x="\nYear", y="Sum n2o_emissions_co2e\n") +
scale_x_continuous(limits=c(2010,2021),
breaks=scales::pretty_breaks(n=7)) +
scale_y_continuous(labels=scales::comma, limits=c(0,
5000000), breaks=scales::pretty_breaks(n=3)) +
theme_Mac() +
theme(strip.text.x = element_text(face="bold",
size = 12, color = "black"))
# Put the faceted headers in bold.
# Fig. 6.16
Addendum 3: Use of Known URLs as a Proxy API (Application Programming Interface) 421
EPAch4_emissions_co2eTrimDesStatNatGasBitCoal.tbl <-
# Create a new object that will hold descriptive
# statistics for Natural Gas and Bituminous coal.
EPAFuelToxinsTrimmed.tbl %>%
dplyr::filter(fuel_type %in% c(
"Natural Gas (Weighted U.S. Average)",
"Bituminous")) %>%
# Use the dplyr::filter() function to focus only
# on Natural Gas (Weighted U.S. Average) and
# Bituminous.
dplyr::group_by(fuel_type, reporting_year) %>%
# Group output by fuel_type and reporting_year,
# the focus for intended figure that will show
# change in Sum n2o_emissions_co2e over time,
# 2010 to 2021.
dplyr::summarize(
N = base::length(ch4_emissions_co2e),
Minimum = base::min(ch4_emissions_co2e),
Median = stats::median(ch4_emissions_co2e),
Mean = base::mean(ch4_emissions_co2e),
SD = stats::sd(ch4_emissions_co2e),
Maximum = base::max(ch4_emissions_co2e),
Sum = base::sum(ch4_emissions_co2e),
Missing = base::sum(is.na(ch4_emissions_co2e))
# Generate a complete set of useful descriptive
# descriptive statistics, but the focus for the
# intended figure will be Sum since ultimately
# this is the most useful measure of pollution.
)
base::print(
EPAch4_emissions_co2eTrimDesStatNatGasBitCoal.tbl, n=24)
# Observe the 2010 to 2021 N for Bituminous and Natural
# Gas (Weighted U.S. Average). Then, observe the 2010 to
# 2021 Sum for the same, both in the enumerated table of
# descriptive statistics but especially in the figure.
par(ask=TRUE)
ggplot2::ggplot(data=
EPAch4_emissions_co2eTrimDesStatNatGasBitCoal.tbl,
aes(x=reporting_year, y=Sum)) +
geom_point(size=4.00, color="red") +
geom_line(size=1.00, color="black") +
facet_wrap(~fuel_type) +
ggtitle(
"Sum ch4_emissions_co2e in the United States as a
Result of Electricity Generation: 2010 to 2021") +
labs(x="\nYear", y="Sum ch4_emissions_co2e\n") +
scale_x_continuous(limits=c(2010,2021),
breaks=scales::pretty_breaks(n=7)) +
scale_y_continuous(labels=scales::comma, limits=c(0,
2000000), breaks=scales::pretty_breaks(n=3)) +
theme_Mac() +
theme(strip.text.x = element_text(face="bold",
size = 12, color = "black"))
# Put the faceted headers in bold.
# Fig. 6.17
422 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
EPAtotal_annual_heat_inputTrimDesStatNatGasBitCoal.tbl <-
# Create a new object that will hold descriptive
# statistics for Natural Gas and Bituminous coal.
EPAFuelToxinsTrimmed.tbl %>%
dplyr::filter(fuel_type %in% c(
"Natural Gas (Weighted U.S. Average)",
"Bituminous")) %>%
# Use the dplyr::filter() function to focus only
# on Natural Gas (Weighted U.S. Average) and
# Bituminous.
dplyr::group_by(fuel_type, reporting_year) %>%
# Group output by fuel_type and reporting_year,
# the focus for intended figure that will show
# change in Sum n2o_emissions_co2e over time,
# 2010 to 2021.
dplyr::summarize(
N = base::length(total_annual_heat_input),
Minimum = base::min(total_annual_heat_input),
Median = stats::median(total_annual_heat_input),
Mean = base::mean(total_annual_heat_input),
SD = stats::sd(total_annual_heat_input),
Maximum = base::max(total_annual_heat_input),
Sum = base::sum(total_annual_heat_input),
Missing = base::sum(is.na(total_annual_heat_input))
# Generate a complete set of useful descriptive
# descriptive statistics, but the focus for the
# intended figure will be Sum since ultimately
# this is the most useful measure of pollution.
)
base::print(
EPAtotal_annual_heat_inputTrimDesStatNatGasBitCoal.tbl, n=24)
# Observe the 2010 to 2021 N for Bituminous and Natural
# Gas (Weighted U.S. Average). Then, observe the 2010 to
# 2021 Sum for the same, both in the enumerated table of
# descriptive statistics but especially in the figure.
par(ask=TRUE)
ggplot2::ggplot(data=
EPAtotal_annual_heat_inputTrimDesStatNatGasBitCoal.tbl,
aes(x=reporting_year, y=Sum)) +
geom_point(size=4.00, color="red") +
geom_line(size=1.00, color="black") +
facet_wrap(~fuel_type) +
ggtitle(
"Sum total_annual_heat_input in the United States as a
Result of Electricity Generation: 2012 to 2021") +
labs(x="\nYear", y="Sum total_annual_heat_input\n",
caption="Data are unavailable for 2010 and 2011") +
scale_x_continuous(limits=c(2010,2021),
breaks=scales::pretty_breaks(n=7)) +
scale_y_continuous(labels=scales::comma, limits=c(0,
12000000000), breaks=scales::pretty_breaks(n=3)) +
theme_Mac() +
theme(strip.text.x = element_text(face="bold",
size = 12, color = "black"))
# Put the faceted headers in bold.
# Fig. 6.18
Addendum 3: Use of Known URLs as a Proxy API (Application Programming Interface) 423
20
Be sure to notice that data for total_annual_heat_input were unavailable for two years, 2010
and 2011.
21
Consumers expect continuous and uninterrupted electricity for their factories, gas pumps at ser-
vice stations, refrigeration at grocery stores, homes, hospitals, schools, shops, offices, traffic lights,
water purification plants, etc. Even a few minutes (seconds, actually) of interruption to the electric
power grid creates havoc. Downtime in the availability of electricity, even when power plants and
communities are faced with weather-related force majeure events are quickly deemed unaccept-
able by the public – the lights need to be on 24 hours a day, each day, every day, all year, with no
exceptions. Think of the February 2021 power outages in Texas. The disruptions, during an excep-
tionally active polar vortex that reached far into the South, with freezing weather and bitter storm
conditions, caused millions of consumers and businesses the hardship of intolerable living condi-
tions, diminished economic impact from lost productivity, flooded houses once frozen pipes even-
tually warmed up and discharged untold gallons of water into residences, and worse of all, the
many deaths that could be directly attributable to the disruption of electric service as power-
generating plants were offline and power lines went down. Many consumers would have gladly
accepted a temporary increase in emissions, n2o_emissions_co2e and ch4_emissions_co2e, but of
course, power plants cannot be so easily transformed from one fuel type to another, and this can
certainly not be done quickly, without adequate (and costly) advance planning, if at all.
424 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
Java Script Object Notation (either JSON or json) is a text-based file format that is
used to store and transfer data, typically using the Web to move data across multiple
platforms. Not surprisingly, json data are often associated with Web applications.
A comparative advantage of json data is that it is text-based, only. Being text-
based, json data have the potential for wide use across various computing systems
and multiple users since json data are not dependent on unique hardware or soft-
ware. The simple text-format makes json data potentially available to a wide
community.22
The line-by-line appearance of json data may look a bit strange, at first, even
though json data are prepared in text format. There is no doubt that the structure for
json data is not immediately easy to read for those who are new to this file format,
but fortunately, many computing languages, including R, can work with json data
quite easily.
This text is focused on the use of R and R-based API clients (e.g., functions).
Accordingly, this addendum on json-based data is brief, but json data are so ubiqui-
tous as the Web has grown in importance that this introductory data science text
would be deficient without some treatment of json data.
Given the continuing attention given to COVID-19 and the desire to make data-
based decisions when developing policies and protocols, it only seemed appropriate
to look at json data related to this disease, caused by the SARS-CoV-2 virus. The
data, in json format, were obtained from a United States Centers for Disease Control
and Prevention (CDC) Web-based resource: https://data.cdc.gov/Public-Health-
Surveillance/Rates-of-COVID-19-Cases-or-Deaths-by-Age-Group-and/3rge-nu2a.
Rates of COVID-19 Cases or Deaths by Age Group and Vaccination Status
Data for CDC’s COVID Data Tracker site on Rates of COVID-19 Cases and Deaths
by Vaccination Status
Updated February 22, 2023
Data Provided by CDC COVID-19 Response, Epidemiology Task Force
Take time to explore the entire Web page associated with this resource. Notice
how the CDC makes the data available in many file formats, in their effort to make
resources available to as many users as possible. Give special attention to the Code
Book, where the data are described in full immediately after the header Columns in
this Dataset.
Using a known URL that has COVID-19 data provided by the Centers for Disease
Control and Prevention, the purpose of this addendum is to demonstrate how json
data are obtained and then put into a format suitable for R-based uses. The
22
In a similar manner, comma separated values (.csv) files are also text based, allowing wide use
across multiple platforms, software and hardware, and users.
Addendum 4: API-Based Data in JavaScript Object Notation (JSON) Format 425
following syntax is merely a very brief demonstration of the way json data can be
obtained and used:
CovidCDC.json <-
httr::GET("https://data.cdc.gov/resource/3rge-nu2a.json")
# Use the httr::GET() function to obtain the desired data.
CovidCDC.json
Response [https://data.cdc.gov/resource/3rge-nu2a.json]
Date: 2023-05-17 16:08
Status: 200
Content-Type: application/json;charset=utf-8
Size: 412 kB
[{"outcome":"case","month":"APR 2021","mmwr_week":"202114","age_
,{"outcome":"case","month":"APR 2021","mmwr_week":"202114","age_
,{"outcome":"case","month":"APR 2021","mmwr_week":"202114","age_
,{"outcome":"case","month":"APR 2021","mmwr_week":"202114","age_
,{"outcome":"case","month":"APR 2021","mmwr_week":"202114","age_
,{"outcome":"case","month":"APR 2021","mmwr_week":"202114","age_
,{"outcome":"case","month":"APR 2021","mmwr_week":"202114","age_
,{"outcome":"case","month":"APR 2021","mmwr_week":"202114","age_
,{"outcome":"case","month":"APR 2021","mmwr_week":"202114","age_
,{"outcome":"case","month":"APR 2021","mmwr_week":"202114","age_
httr::http_type(CovidCDC.json)
base::class(CovidCDC.json)
utils::str(CovidCDC.json)
Convert the json object to a dataframe and then to a tibble, suitable for use with
the tidyverse ecosystem.
426 6 Use of R-Based APIs (Application Programming Interface) to Obtain Data
base::class(CovidCDC.df)
utils::str(CovidCDC.df)
base::getwd()
base::ls()
base::attach(CovidCDC.tbl)
utils::str(CovidCDC.tbl)
dplyr::glimpse(CovidCDC.tbl)
Rows: 1,000
Columns: 16
$ outcome <chr> "case", "case", "case", "ca
$ month <chr> "APR 2021", "APR 2021", "AP
$ mmwr_week <chr> "202114", "202114", "202114
$ age_group <chr> "12-17", "18-29", "30-49",
$ vaccine_product <chr> "all_types", "all_types", "
$ vaccinated_with_outcome <chr> "8", "674", "1847", "1558",
$ fully_vaccinated_population <chr> "36887", "2543093", "742840
$ unvaccinated_with_outcome <chr> "30785", "76736", "98436",
$ unvaccinated_population <chr> "17556462", "31091322", "41
$ crude_vax_ir <chr> "21.687857511", "26.5031597
$ crude_unvax_ir <chr> "175.348541181", "246.80841
$ crude_irr <chr> "8.085102048", "9.312414844
$ continuity_correction <chr> "0", "0", "0", "0", "0", "0
$ age_adj_vax_ir <chr> NA, NA, NA, NA, NA, NA, "22
$ age_adj_unvax_ir <chr> NA, NA, NA, NA, NA, NA, "22
$ age_adj_irr <chr> NA, NA, NA, NA, NA, NA, "9.
base::summary(CovidCDC.tbl)
The data in this example were originally obtained in json format. A few simple
actions were used to convert the data into a tibble, named CovidCDC.tbl in this
example.
Challenge: This addendum has been focused on how data obtained in json format
are put into eventual format as a tibble. Much more needs to be done to make use of
the COVID-19 data obtained from the CDC. A few actions are demonstrated below
but take up the challenge and use the dataset to make sense of the data, related to
COVID-19 vaccination metrics.
Immediately, notice how all data are in character format. Yet, the Code Book
describes the desired format for each object variable. Change data formats, at least
for a few object variables.
Addendum 4: API-Based Data in JavaScript Object Notation (JSON) Format 427
CovidCDC.tbl$outcome <-
as.factor(CovidCDC.tbl$outcome)
CovidCDC.tbl$month <-
as.factor(CovidCDC.tbl$month)
CovidCDC.tbl$mmwr_week <-
as.integer(CovidCDC.tbl$mmwr_week)
CovidCDC.tbl$age_group <-
as.factor(CovidCDC.tbl$age_group)
CovidCDC.tbl$vaccine_product <-
as.factor(CovidCDC.tbl$vaccine_product)
CovidCDC.tbl$vaccinated_with_outcome <-
as.numeric(CovidCDC.tbl$vaccinated_with_outcome)
CovidCDC.tbl$fully_vaccinated_population <-
as.numeric(CovidCDC.tbl$fully_vaccinated_population)
CovidCDC.tbl$unvaccinated_with_outcome <-
as.numeric(CovidCDC.tbl$unvaccinated_with_outcome)
CovidCDC.tbl$unvaccinated_population <-
as.numeric(CovidCDC.tbl$unvaccinated_population)
base::getwd()
base::ls()
base::attach(CovidCDC.tbl)
utils::str(CovidCDC.tbl)
dplyr::glimpse(CovidCDC.tbl)
writexl::write_xlsx(
CovidCDC.tbl,
path = "D:\\R_Ceres\\CovidCDC.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.
# Confirm that the downloaded file is in good form
base::file.exists("CovidCDC.xlsx")
base::file.info("CovidCDC.xlsx")
base::list.files(pattern =".xlsx")
More recoding activities may be necessary, but for now, prepare a simple bar
chart that represents month and year on the X-axis and fully_vaccinated_population
on the Y axis. Notice the sequential order of month and year, representing a timeline
in the correct order.
The data will be specific to all_types within the vaccine_product column (e.g.,
object variable). There is no desire to double-count cases.
base::summary(CovidCDC.tbl$outcome)
case death
812 188
base::summary(CovidCDC.tbl$age_group)
base::summary(CovidCDC.tbl$vaccine_product)
With the original dataset (CovidCDC.tbl) in good form, use the dplyr::filter()
function to adjust the dataset, generating a new dataset
(CovidCDCCaseAllAgesAllTypes.tbl) that retains only those rows that meet all
requirements for a meaningful bar plot of the number of individuals in the United
States who are fully vaccinated, with any of the different available vaccines. Notice
how the number of rows declines from 1000 rows, to 812 rows, to 308 rows, to
finally 77 rows (Fig. 6.19).
Addendum 4: API-Based Data in JavaScript Object Notation (JSON) Format 429
Fig. 6.19
CovidCDCCaseAllAgesAllTypes.tbl <-
CovidCDC.tbl %>%
dplyr::filter(outcome %in% c("case")) %>%
# Retain the rows that have case (only!)
# in the outcome column and delete all
# other rows.
# 812 rows - current number of rows
dplyr::filter(age_group %in% c("all_ages_adj")) %>%
# Retain the rows that have all_ages_adj (only!)
# in the age_group column and delete all
# other rows.
# 308 rows - current number of rows
dplyr::filter(vaccine_product %in% c("all_types"))
# Retain the rows that have all_types
# (only!) in the vaccine_product column
# and delete all others rows.
# 77 rows - current number of rows
base::attach(CovidCDCCaseAllAgesAllTypes.tbl)
str(CovidCDCCaseAllAgesAllTypes.tbl)
ggplot2::ggplot(data=CovidCDCCaseAllAgesAllTypes.tbl,
aes(x=stats::reorder(month, mmwr_week),
# Give attention to this clever ordering
# process, how mmwr_week was used to put
# month in sequential order.
y=fully_vaccinated_population)) +
geom_bar(stat="summary", fill="red") +
labs(
title="Population (All Ages) Fully Vaccinated (All Types) for
COVID-19
by Month and Year: September 2022 End Date",
subtitle="Data are from the CDC.",
x="\nMonth and Year", y="Fully Vaccinated Population\n") +
scale_y_continuous(labels=scales::comma, limits=c(0,
160000000), breaks=scales::pretty_breaks(n=7)) +
theme_Mac() +
theme(axis.text.x=element_text(face="bold", size=10,
hjust=1.0, vjust=1.0, angle=45)) +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=0.5, angle=00))
# The special theme-based accommodations for X axis text and/
# or Y axis text need to be placed after the enumerated theme
# theme_Mac().
#
# For this figure, observe the values used on both the X axis
# and the Y axis for hjust, vjust, and angle.
# Fig. 6.19
Challenge: Look closely at the data and consider how they can be used for other
presentations and analyses. Much more can and should be done with these and other
COVID-19 data from the CDC, to learn more about this disease and from that
knowledge, to prepare for future diseases.
The important takeaway from this addendum is that there is no mystery to json
data – json data represent merely another data format. Data in json format are text
based, and by using R (as well as other languages), it is possible to put the special-
ized json structure into a format suitable for R, as either a dataframe or a tibble.
Data scientists need to know how to work with data in many formats, including
text-based json data. However, this text is geared for R and R-based APIs, thus the
minimal discussion of json data. It is suggested that it is far easier to acquire and
work with data brought into an active R session by using a tidy-focused API, but of
course, an experienced data scientist must have the skills to use json data if that is
the format at hand.
The publisher’s Web site associated with this text includes the following files, pre-
sented in .csv, .txt, and .xlsx file formats.
CovidCDC.xlsx
External Data and/or Data Resources Used in this Lesson 431
Cranberry2010to2021.xlsx
EPAFuelToxinsTrimmed.xlsx
IACornAcresPlanted1926Onward.xlsx
IACornPricePerBushel1867Onward.xlsx
IACountyCornYieldBuAcStarttoEnd.xlsx
TNDavidsonMedHouseIncomeB19013_001acs5_2019SF.xlsx
TNMedHouseIncomeB19013_001acs1_2019.xlsx
TNMedHouseIncomeB19013_001acs5_2019.xlsx
Challenge: Use these files to practice and replicate the outcomes used in this les-
son. Be creative and go beyond what was previously presented.
All other data, if any, were enumerated directly in the R session, using functions
such as round(rnorm()), read.table(textConnection()), etc.
Chapter 7
Putting It All Together – R, the tidyverse
Ecosystem, and APIs
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 433
T. W. MacFarland, Introduction to Data Science in Biostatistics,
https://doi.org/10.1007/978-3-031-46383-9_7
434 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
###############################################################
# Housekeeping Use for All Analyses #
###############################################################
.libPaths(new = "D:/R_Packages")
# As a preference, all installed packages
# will now go to the external D:\ drive.
date() # Current system time and date.
Sys.time() # Current system time and date (redundant).
R.version.string # R version and version release date.
options(digits=6) # Confirm default digits.
options(scipen=999)# Suppress scientific notation.
options(width=60) # Confirm output width.
ls() # List all objects in the working
# directory.
rm(list = ls()) # CAUTION: Remove all files in the working
# directory. If this action is not desired,
# use the rm() function one by one to remove
# the objects that are not needed.
ls.str() # List all objects, with finite detail.
getwd() # Identify the current working directory.
setwd("D:/R_Ceres")
# Set to a new working directory.
# Note the single forward slash and double
# quotes.
# This new directory should be the directory
# where the data file is located, otherwise
# the data file will not be found.
getwd() # Confirm the working directory.
list.files() # List files at the PC directory.
.libPaths() # Library pathname.
.Library # Library pathname.
sessionInfo() # R version, locale, and packages.
search() # Attached packages and objects.
searchpaths() # Attached packages and objects.
###############################################################
The packages marked with a preceding R # comment character have been previ-
ously downloaded from CRAN, are housed in the correct directory (e.g., folder),
and represent the most currently available version. Accordingly, it is not necessary
to download them again, and application of the library() function is sufficient to put
the functions in these packages into use. To save space, it is not uncommon to see
syntax such as library(Package1); library(Package2); library(Package3), etc. on the
same line, but that practice was not used in this lesson.
Obtain Data from an API 435
# install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)
# install.packages("readxl", dependencies=TRUE)
library(readxl)
# install.packages("magrittr", dependencies=TRUE)
library(magrittr)
# install.packages("janitor", dependencies=TRUE)
library(janitor)
# install.packages("rlang", dependencies=TRUE)
library(rlang)
# install.packages("htmltools", dependencies=TRUE)
library(htmltools)
# install.packages("httr", dependencies=TRUE)
library(httr)
# install.packages("jsonlite", dependencies=TRUE)
library(jsonlite)
# install.packages("ggmosaic", dependencies=TRUE)
library(ggmosaic)
# install.packages("ggpubr", dependencies=TRUE)
library(ggpubr)
# install.packages("ggtext", dependencies=TRUE)
library(ggtext)
# install.packages("ggthemes", dependencies=TRUE)
library(ggthemes)
# install.packages("scales", dependencies=TRUE)
library(scales)
# install.packages("gridExtra", dependencies=TRUE)
library(gridExtra)
# install.packages("cowplot", dependencies=TRUE)
library(cowplot)
# install.packages("writexl", dependencies=TRUE)
library(writexl)
436 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
###############################################################
# Use the tidycensus package and/or the acs package and the
# U.S. Census Bureau key to obtain state and/or county specific
# data from selected American Community Survey (ACS) and/or
# Decennial Census tables.
#
# Use the following URL to access the form needed to obtain an
# API key from the U.S. Census Bureau :
# https://api.census.gov/data/key_signup.html
#
# Complete details on the API process with U.S. Census Bureau
# are available at https://www.census.gov/content/dam/Census/
# library/publications/2020/acs/acs_api_handbook_2020_ch02.pdf.
# install.packages("tidycensus", dependencies=TRUE)
library(tidycensus)
# CAUTION: The tidycensus package may take longer to
# download than other packages. Be patient.
tidycensus::census_api_key(
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
#"Yourx40xdigitxkeyxgoesxherexxxxxxxxxxxxx",
overwrite=FALSE)
# CAUTION: The code for this key is reserved
# for [email protected] only. Use your own key!
# #
# Note: Many 2020 Decennial Census data products are coming #
# online, but all data are not yet available. Use the 2010 #
# Decennial Census as a base for table identification and #
# nature of the data. #
#
SF12010 <-
tidycensus::load_variables(2010, "sf1", cache = TRUE) #
View(SF12010) #
str(SF12010) #
writexl::write_xlsx(SF12010, #
path = "D:\\R_Ceres\\SF12010.xlsx", col_names=TRUE) #
# #
SF22010 <-
tidycensus::load_variables(2010, "sf2", cache = TRUE) #
View(SF22010) #
str(SF22010) #
writexl::write_xlsx(SF22010, #
path = "D:\\R_Ceres\\SF22010.xlsx", col_names=TRUE) #
# #
###############################################################
# install.packages("acs", dependencies=TRUE)
library(acs)
# acs.tables.install()
# Be patient. Some packages take more than a few minutes to
# install and a few also take an unexpected amount of time to
# complete their library() function completion.
###############################################################
# Mapping #
###############################################################
# install.packages("htmlTable", dependencies=TRUE)
library(htmlTable)
# install.packages("ggmap", dependencies=TRUE)
library(ggmap)
# install.packages("maps", dependencies=TRUE)
library(maps)
# install.packages("maptools", dependencies=TRUE)
library(maptools)
# install.packages("Rcpp", dependencies=TRUE)
library(Rcpp)
438 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
# install.packages("rgdal", dependencies=TRUE)
library(rgdal)
# install.packages("rgeos", dependencies=TRUE)
library(rgeos)
# install.packages("sf", dependencies=TRUE)
library(sf)
# install.packages("sp", dependencies=TRUE)
library(sp)
# install.packages("stars", dependencies=TRUE)
library(stars)
# install.packages("terra", dependencies=TRUE)
library(terra)
# install.packages("usmap", dependencies=TRUE)
library(usmap)
# install.packages("xfun", dependencies=TRUE)
library(xfun)
# install.packages("choroplethr", dependencies=TRUE)
library(choroplethr)
# install.packages("choroplethrAdmin1", dependencies=TRUE)
library(choroplethrAdmin1)
# install.packages("choroplethrMaps", dependencies=TRUE)
library(choroplethrMaps)
# install.packages("zctaCrosswalk", dependencies=TRUE)
library(zctaCrosswalk)
# The zctaCrosswalk package may be a good substitute
# to the choroplethrZip package. The choroplethrZip
# package is not found at CRAN but is instead housed
# at github.
###############################################################
###############################################################
# needed. However, its bold and large fonts are the main
# reason for its use -- it makes figures easy to read.
base::class(theme_Mac)
# Confirm that the user-created object
# theme_Mac() is a function.
###############################################################
At this point in the text, the ending lesson, but before the tidyverse ecosystem is
put into use one last time, it seems best to also identify a few functions that should
be deployed occasionally, to remain current with R, the tidyverse ecosystem, and
the many periodic updates of each. This brief section should also identify possible
conflicts and any other issues when using the tidyverse and its many packages and
functions:
440 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
utils::news(package="R")
utils::news(package="tidyverse")
# Remain current with the tidyverse.
tidyverse::tidyverse_sitrep()
# Receive a situation report on the
# tidyverse, core packages and non-core
# packages.
tidyverse::tidyverse_conflicts()
# Identify tidyverse functions that may
# be in conflict with other functions if
# a PackageName::FunctionName() naming
# convention were not used.
As a reminder, as of early 2023 and the release of tidyverse 2.0.0, the lubridate
package is now among the core tidyverse packages. Use the following functions to
see highly detailed information about which package version is in use:
Now that the upfront work of preparing for this lesson is completed, prepare a list
of 5 years (2015, 2016, 2017, 2018, and 2019) for which county-wide data are avail-
able and requested from the Census Bureau, specifically the American Community
Survey (ACS):
base::getwd()
base::ls()
base::attach(years2015to2019.lst)
base::class(years2015to2019.lst)
This list, used along with the purrr::map_dfr() function, will facilitate the acqui-
sition of ACS data over multiple years, 2015 to 2019. Although this is not exactly
syntax for a loop, think of this as being like a looping process, where one set of
syntax is used to iterate the data acquisition process.
Use the tidycensus::get_estimates() function and the product="components"
argument to obtain data for 12 unique variables of interest to those who work in
public health. Download the file so that the data are available for future use.
Obtain Data from an API 441
base::attach(USCounties2015to2019Components.tbl)
base::unique(USCounties2015to2019Components.tbl$variable)
base::print(USCounties2015to2019Components.tbl, n=24)
Note: The syntax associated with this data request to the Census Bureau results
in a visible dataset that is somewhat large, generating a tibble of 188,520 rows by 6
columns. What is not seen, however, is the complexity of all that must occur to
organize the data in the requested manner: multiple years, multiple states, multiple
counties and GEOIDs within states, multiple MULTIPOLYGONs, etc. Be patient
when executing the syntax. If possible, processing speed seems fastest during early
morning nonbusiness hours for the Eastern Time Zone, or four or more hours behind
Zulu Time Zone (e.g., Greenwich Mean Time (GMT)). A complex data request to
the Census Bureau that required 26 min. to process at Noon on a regular Monday to
Friday workday required less than 1 min to process in the earliest hours of the morn-
ing, before sunrise.
Note: For those with confidence in their Internet connection, it may not be neces-
sary to download this file and similar files gained by use of the tidycensus::get_esti-
mates() function. It would be a rare event if Census Bureau ACS data were
unavailable. Further, the Census Bureau key is available to all, merely by complet-
ing a simple online form. These datasets can be easily recreated, eliminating nearly
all concerns about availability. But of course, the data could be saved if needed.
Challenge: Given the volumes of data, data relating to 12 unique variables over
multiple years, data for multiple states, and data for multiple counties, focus on the
variable NATURALINC (review Census Bureau resources to find a precise defini-
tion of natural increase) as it relates to change in populations. Review the syntax
found immediately below and use tidyverse ecosystem tools to embellish the figure:
Adjust the Y-axis scale, add a theme that increases readability, and add an annota-
tion that describes the meaning of natural increase for those who may not know the
correct application of this term. Finally, explain why public health personnel need
to have detailed information about natural increase (as well as other indicators
related to population dynamics) at multiple levels of detail: by state, by county, by
tract, etc. Finally, be sure to study this syntax and notice how many different activi-
ties (e.g., filtering, computation, graphics) were all chained together into efficient
and highly transparent syntax that resulted in a useful draft figure (Fig. 7.1).
442 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Fig. 7.1
USCounties2015to2019NATURALINC.fig <-
USCounties2015to2019Components.tbl %>%
# Currently, 188,520 rows by 6 columns
dplyr::filter(variable %in% c(
"NATURALINC")) %>%
# Now, 15,710 rows by 6 columns
# Retain rows where the string value shows
# in the object called variable.
dplyr::group_by(year) %>% # Prepare statistics for
dplyr::summarize( # each year, 2015-2019.
# N = base::length(value), # Descriptive statistics
# Minimum = base::min(value), # that are excluded by
# Median = stats::median(value), # using the R comment
# Mean = base::mean(value), # character.
# SD = stats::sd(value), # Prepare Sum NATURALINC
# Maximum = base::max(value), # by year statistics.
Sum = base::sum(value)) %>%
ggplot2::ggplot(aes(x=year, y=Sum)) +
geom_col(fill="red", color="black") +
labs(title =
"Sum United States Natural Increase (NATURALINC) by Year:
2015 to 2019")
# Fig. 7.1
par(ask=TRUE); USCounties2015to2019NATURALINC.fig
USCounties2019Housing.tbl <-
tidycensus::get_estimates(
geography="county", # Data - all US counties
year="2019", # Data - 2019 only
product="housing", # Data - 01 unique variable
# HUEST
output="tidy", # Data - tidy format
geometry="TRUE") # Fetch geometry data.
base::attach(USCounties2019Housing.tbl)
base::unique(USCounties2019Housing.tbl$variable)
base::print(USCounties2019Housing.tbl, n=24)
USCounties2019HousingMultipleStates.tbl <-
USCounties2019Housing.tbl %>%
dplyr::filter(
grepl('Kentucky|Maryland|Virginia', NAME)) %>%
dplyr::group_by(GEOID, NAME) %>%
dplyr::summarize(
N = base::length(value),
# Minimum = base::min(value),
# Median = stats::median(value),
# Mean = base::mean(value),
# SD = stats::sd(value),
# Maximum = base::max(value),
Sum = base::sum(value)) %>%
dplyr::arrange(GEOID)
# Arrange the output in GEOID
# ascending order.
base::print(USCounties2019HousingMultipleStates.tbl, n=332)
Question: Replicate the above syntax, but instead use the following line for
selection:
Would the dataset include data for any states other than Kentucky, Maryland, or
Ohio. If so, which state(s) and why would this occur?
Use the tidycensus::get_estimates() function and the product="population" argu-
ment to obtain population data. Along with overall population headcounts, the pop-
ulation density data are also quite useful. Consider the relevance of the population
density data for defined areas and how public health services are provided, urban v
rural, and the too common concern (whether founded or not) that rural residents
may not receive the same degree of public health service as urban residents.
base::attach(USCounties2015to2019Population.tbl)
base::unique(USCounties2015to2019Population.tbl$variable)
base::print(USCounties2015to2019Population.tbl, n=24)
Obtain Data from an API 445
USCounties2019PopulationDensityPA.map <-
USCounties2015to2019Population.tbl %>%
dplyr::filter(variable %in% c(
"DENSITY")) %>%
# Retain rows where the string DENSITY shows in
# the object called variable.
dplyr::filter(year %in% c(
"2019")) %>%
# Retain rows where the string 2019 shows in
# the object called year.
dplyr::filter(grepl('Pennsylvania', NAME)) %>%
# Select Pennsylvania only.
dplyr::rename(region=GEOID)
# Rename GEOID to region, to satisfy naming
# requirements for the choroplethr package.
base::attach(USCounties2019PopulationDensityPA.map)
base::unique(USCounties2019PopulationDensityPA.map)
base::length(USCounties2019PopulationDensityPA.map)
base::print(USCounties2019PopulationDensityPA.map)
Fig. 7.2
446 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
USCounties2019PopulationDensityPA.map$region <-
base::as.numeric(
USCounties2019PopulationDensityPA.map$region)
base::attach(USCounties2019PopulationDensityPA.map)
utils::str(USCounties2019PopulationDensityPA.map)
par(ask=TRUE)
choroplethr::county_choropleth(
USCounties2019PopulationDensityPA.map,
USCounties2019AgeGroupCA.tbl <-
tidycensus::get_estimates(
geography="county", # Data - all US counties
state="California", # Data - CA only
year="2019", # Data - 2019 only
product="characteristics", # Data - 32 unique variables
breakdown="AGEGROUP", # Data - select AgeGroup only
# All ages, to
# Multiple age breakouts, to
# Median age
breakdown_labels="TRUE",
output="tidy")
base::attach(USCounties2019AgeGroupCA.tbl)
base::unique(USCounties2019AgeGroupCA.tbl$AGEGROUP)
base::print(USCounties2019AgeGroupCA.tbl, n=32)
Challenge: Prepare output that displays median age (2019) for each California
county and then prepare a choropleth to visually reinforce outcomes.
USCounties2019HispanicTX.tbl <-
tidycensus::get_estimates(
geography="county", # Data - all US counties
state="Texas", # Data - TX only
year="2019", # Data - 2019 only
product="characteristics", # Data - 03 unique variables
breakdown="HISP", # Data - select Hispanic only
# Both Hispanic Origins
# Non-Hispanic
# Hispanic
breakdown_labels="TRUE",
output="tidy")
base::attach(USCounties2019HispanicTX.tbl)
base::unique(USCounties2019HispanicTX.tbl$HISP)
base::print(USCounties2019HispanicTX.tbl, n=06)
USCounties2019RaceVT.tbl <-
tidycensus::get_estimates(
geography="county", # Data - all US counties
state="Vermont", # Data - VT only
year="2019", # Data - 2019 only
product="characteristics", # Data - 12 unique variables
breakdown="RACE", # Data - select Race only
# All races
# White alone
# Black alone
# Aerican Indian and Alaska Native alone
# Asian alone
# Native Hawaiian and Other Pacific Islander alone
# Two or more races
# White alone or in combination
# Black alone or in combination
# American Indian and Alaska Native alone or in
# combination
# Asian alone or in combination
# Native Hawaiian and Other Pacific Islander alone or in
# combination
breakdown_labels="TRUE",
output="tidy")
base::attach(USCounties2019RaceVT.tbl)
base::unique(USCounties2019RaceVT.tbl$RACE)
base::print(USCounties2019RaceVT.tbl, n=99)
USCounties2019SEXWY.tbl <-
tidycensus::get_estimates(
geography="county", # Data - all US counties
state="Wyoming", # Data - WY only
year="2019", # Data - 2019 only
product="characteristics", # Data - 03 unique variables
breakdown="SEX", # Data - select SEX only
# Both sexes
# Male
# Female
breakdown_labels="TRUE",
output="tidy")
base::attach(USCounties2019SEXWY.tbl)
base::unique(USCounties2019SEXWY.tbl$SEX)
base::print(USCounties2019SEXWY.tbl, n=06)
Challenge: Use the syntax below to then prepare two choropleth maps of
Wyoming counties: (1) percentage (2019) Males, and (2) percentage (2019)
Females. For those with more advanced skills, use a matched pairs (possibly a
Make the Data Tidy 449
Student’s t-Test for Matched Pairs) approach to the data and determine if there is a
by-county statistically significant (p ≤ 0.05) difference in age between males and
females.
Challenge: Use tools from the tidyverse ecosystem to determine the percentage
of males and females for each Wyoming county (2019). An approach is displayed
immediately below, but there are many ways to address this task so think of other
ways that the two percentages for each county can be easily calculated.
USCounties2019SEXWYPctFemaleMale.tbl <-
USCounties2019SEXWY.tbl %>%
group_by(GEOID) %>%
mutate(percentage = prop.table(value) * 100 * 2) %>%
arrange(SEX, GEOID)
base::attach(USCounties2019SEXWYPctFemaleMale.tbl)
base::print(USCounties2019SEXWYPctFemaleMale.tbl, n=69)
USCounties2019SEXWYPctFemaleMale.tbl <-
USCounties2019SEXWY.tbl %>%
group_by(NAME) %>%
mutate(percentage = prop.table(value) * 100 * 2) %>%
arrange(SEX, GEOID)
base::attach(USCounties2019SEXWYPctFemaleMale.tbl)
base::print(USCounties2019SEXWYPctFemaleMale.tbl, n=69)
The data previously used in this lesson were in long format, due to use of the
tidycensus::get_estimates() function and the output="tidy" argument. This feature
is very useful, placing fetched data in long (e.g., tidy) format immediately when
data are retrieved. Yet, data scientists often encounter wide format data and actions
need to be used to restructure the data into long format.
Fortunately, the tidyverse ecosystem can accommodate data restructuring, from
wide to long and from long to wide, as needed:
• The tidyr::pivot_longer() function is used to put wide format data into long for-
mat. The use of this function is quite common, as wide data that have good eye-
appeal as a spreadsheet-type table are restructured into long format to better
accommodate many features in the tidyverse ecosystem.
• The tidyr::pivot_wider() function is perhaps less frequently used, as long format
data are restructured into wide format.
Following with use of the tidyverse, it is not at all uncommon to see either the
tidyr::pivot_longer() function or the tidyr::pivot_wider() used in association with
the %>% (e.g., pipe) operator as multiple actions are chained into one unified set of
450 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
WMilkLbFatProtein.tbl <-
readxl::read_xlsx("WideMilkLbsPctFatPctProtein.xlsx",
sheet = 1, # Read in the 1st sheet.
col_names = TRUE, # The 1st row represents column names.
col_types = c(
"numeric", # Column A AyrshireLb
"numeric", # Column B GuernseyLb
"numeric", # Column C HolsteinLb
"numeric", # Column D JerseyLb
"numeric", # Column E AyrshirePctFat
"numeric", # Column F GuernseyPctFat
"numeric", # Column G HolsteinPctFat
"numeric", # Column H JerseyPctFat
"numeric", # Column I AyrshirePctProtein
"numeric", # Column J GuernseyPctProtein
"numeric", # Column K HolsteinPctProtein
"numeric", # Column L JerseyPctProtein
"text", # Column M Management
"text"), # Column N Farm
trim_ws = FALSE, # Retain leading/trailing whitespace.
n_max = 21)
# When viewing the object name WMilkLbFatProtein.tbl, for
# this lesson the convention W is used to relay how the data
# are in WIDE format. Later, the convention L is used to
# identify how data are in LONG format. The use of this
# naming schema is certainly not required, but it is a good
# programming practice (gpp) and improves communication and
# comprehension of data layout.
The object variables Management and Farm show as character-type objects, and
indeed they are strings of character values. Yet, they should be viewed as factors.
From among the many ways available in R, put these two object variables into factor
format and declare a specific order for the factors, with the ordering based on alpha-
betical order.
Make the Data Tidy 451
WMilkLbFatProtein.tbl$Management <-
factor(WMilkLbFatProtein.tbl$Management,
levels=ManagementLevels)
# Put the object variable Management into
# factor format, now as an ordered factor
# with specific levels.
WMilkLbFatProtein.tbl$Farm <-
factor(WMilkLbFatProtein.tbl$Farm,
levels=FarmLevels)
# Put the object variable Farm into factor
# format, now as an ordered factor with
# specific levels.
From among the many ways this task could be addressed, a reasonable approach
is to structure the data into long format using the tidyr::pivot_longer() function.
Note how this function is in contemporary use, whereas the tidyr::gather() function
is considered outdated and is no longer suggested for use, although it still works and
there are no current stated plans to remove it from availability.
First, create a wide format dataset that is restricted to the following data:
AyrshireLb, GuernseyLb, HolsteinLb, JerseyLb, Management, and Farm. This new
dataset excludes all data related to percent fat and percent protein.
WMilkLbBreedMgtFarm.tbl
LMilkLbBreedMgtFarm.tbl <-
tidyr::pivot_longer(WMilkLbBreedMgtFarm.tbl,
-c(Management, Farm),
names_to = "Breed", values_to = "Pounds")
# Put the data into long format, using the
# tidyr::pivot_longer() function.
#
# The expression -c(Management, Farm) means
# that the tidyr::pivot_longer() function
# should pivot everything except Management
# and Farm. In this syntax, the minus sign
# means except.
LMilkLbBreedMgtFarm.tbl
base::getwd() # Working directory
base::ls() # List current files
base::attach(LMilkLbBreedMgtFarm.tbl) # Attach the file
utils::str(LMilkLbBreedMgtFarm.tbl) # File structure
dplyr::glimpse(LMilkLbBreedMgtFarm.tbl) # File structure
utils::head(LMilkLbBreedMgtFarm.tbl) # Print first few rows
summary(LMilkLbBreedMgtFarm.tbl) # Data summary
LMilkLbBreedMgtFarm.tbl
It seems that the wide format dataset WMilkLbBreedMgtFarm.tbl has been suc-
cessfully restructured into the long format dataset LMilkLbBreedMgtFarm.tbl. Yet,
as a quality assurance measure, it is best to confirm that the data in long format are
equivalent to the data in wide format and that the only difference is in the way the
data are organized. A data scientist assumes little and confirms much.
• First, prepare an initial figure that will be used to provide a sense of data distribu-
tion for all pounds of milk per lactation datapoints, knowing that the graphic will
be for Pounds in collapsed format – data from all 80 dairy cows.
Make the Data Tidy 455
• After this initial figure of collapsed data is prepared, prepare a set of figures that
addresses pounds of milk per lactation by Breed, by Management, and by Farm.
A violin plot with a superimposed boxplot will be used to prepare an initial figure
of pounds of milk per lactation, knowing that the figure represents data for all four
breeds (20 cows per breed * 4 breeds = 80 datapoints). By using a violin plot and a
superimposed boxplot, the resulting figure should give a good visual advance orga-
nizer of the data and where all data show compared to measures of central tendency,
such as the mean and the median. Be sure to consider how two graphical tools are
incorporated into one common figure, a useful feature supported by the
ggplot2::ggplot() function – a key part of the tidyverse ecosystem (Fig. 7.3).
par(ask=TRUE)
# As found throughout the many examples in this lesson and
# text, par(ask=TRUE) is used to freeze the screen and
# provide some degree of control of sequence and flow.
ggplot2::ggplot(data=LMilkLbBreedMgtFarm.tbl,
aes_string(x=1, y=Pounds))+
# The violin plot and superimposed boxplot in this figure
# are prepared only for one variable, Pounds. Notice the
# use of aes_string() and how x is accommodated, using the
# number 1 instead of a named variable.
geom_violin(width=4.5, fill="lightcyan") +
geom_boxplot(width = 0.5, fill="honeydew2", fatten=4,
outlier.shape=NA) +
# The fatten parameter is used to adjust the thickness of
# the line representing the median. The default is 2, but
# in this figure notice how the line has been adjusted to
# 4, making the line thick and by design very noticeable.
# The boxplot will purposely hide outliers, by using
# outlier.shape=NA, so that the outliers do not clash with
# the purposeful presentation of all datapoints from the 80
# dairy cows, which show as red circles. Be sure to
# observe placement of the 80 datapoints in relation to the
# mean (blue circle in the box) and median (black
# horizontal line in the box).
stat_summary(fun="mean",geom="point", size=6, pch=21,
color="blue", fill="blue")+
# Place a large blue circle inside the box, representing
# the mean. It is standard with a box plot that the
# median is represented by a horizontal bar inside the
# box.
geom_jitter(color= "red") + # Datapoints
annotate("text", x=1.0, y=16000, fontface=2, label=
"Median - the black horizontal line in the boxplot") +
annotate("text", x=1.0, y=22000, fontface=2, label=
456 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Fig. 7.3
By showing both a violin plot and a boxplot in the same figure it was possible to
prepare an extremely interesting figure, a figure that would be less useful if only the
violin plot or the boxplot had been presented. Give special attention to both the
datapoints and the shape of the violin plot. There seem to be two unique data distri-
bution patterns for overall pounds of milk per lactation, but of course, recall that
there are multiple breeds represented in the dataset and resulting figure. As interest-
ing as this initial figure and later descriptive statistics may be, confirming inferential
analyses are needed to say more about this observation regarding data distribution.
Initial figures and descriptive statistics guide inquiries, but confirmation comes
from the application of appropriate inferential tests.
As displayed in the violin plot and superimposed boxplot, it is evident that data
(e.g., Pounds, pounds of milk per lactation) do not follow any normal distribution
pattern. A set of figures may help offer a better sense of the data and the figures may
also offer a few ideas on how to approach later statistical analyses, descriptive sta-
tistics, and inferential statistics (Figs. 7.4, 7.5, and 7.6).
Make the Data Tidy 457
Fig. 7.4
Fig. 7.5
Fig. 7.6
458 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
MilkLbBreed.fig <-
ggplot2::ggplot(LMilkLbBreedMgtFarm.tbl,
aes(x=Breed, y=Pounds)) +
geom_boxplot((aes(fill=Breed))) +
labs(title="Pounds of Milk per Lactation:
Breed",
x="\nBreed", y="Pounds of Milk\n") +
scale_y_continuous(labels=scales::comma,
limits=c(17000, 28000),
breaks=scales::pretty_breaks(n=6)) +
theme_Mac() +
theme(legend.title= element_blank()) +
theme(axis.text.y=element_text(face="bold",
size=12, hjust=0.5, vjust=1, angle=45))
# Fig. 7.4
par(ask=TRUE); MilkLbBreed.fig
MilkLbManagementBreed.fig <-
ggplot2::ggplot(LMilkLbBreedMgtFarm.tbl,
aes(x=Breed, y=Pounds)) +
geom_boxplot((aes(fill=Management))) +
labs(title="Pounds of Milk per Lactation:
Management by Breed",
x="\nBreed", y="Pounds of Milk\n") +
scale_y_continuous(labels=scales::comma,
limits=c(17000, 28000),
breaks=scales::pretty_breaks(n=6)) +
theme_Mac() +
theme(legend.title= element_blank()) +
theme(axis.text.y=element_text(face="bold",
size=12, hjust=0.5, vjust=1, angle=45))
# Fig. 7.5
par(ask=TRUE); MilkLbManagementBreed.fig
MilkLbFarmBreed.fig <-
ggplot2::ggplot(LMilkLbBreedMgtFarm.tbl,
aes(x=Breed, y=Pounds)) +
geom_boxplot((aes(fill=Farm))) +
labs(title="Pounds of Milk per Lactation:
Farm by Breed",
x="\nBreed", y="Pounds of Milk\n") +
scale_y_continuous(labels=scales::comma,
limits=c(17000, 28000),
breaks=scales::pretty_breaks(n=6)) +
theme_Mac() +
theme(legend.title= element_blank()) +
theme(axis.text.y=element_text(face="bold",
size=12, hjust=0.5, vjust=1, angle=45))
# Fig. 7.6
par(ask=TRUE); MilkLbFarmBreed.fig
Statistical Tests – Base R and tidyverse Ecosystem Functions 459
Much more could and should be done with the data and the generation of figures
that offer a glimpse of data patterns and from that, potential analyses that offer more
definitive answers regarding statistically significant differences and associations
(p < = 0.05). What should be recalled from this ending lesson is that figures, regard-
less of how data may appear in graphical format, do not provide sufficient evidence
to make any statements on significance. Formal statistical testing is needed for judg-
ments on significance.
# Base R
base::mean(LMilkLbBreedMgtFarm.tbl$Pounds)
[1] 20226.8
stats::sd(LMilkLbBreedMgtFarm.tbl$Pounds)
[1] 3299.03
base::length(LMilkLbBreedMgtFarm.tbl$Pounds)
[1] 80
stats::quantile(LMilkLbBreedMgtFarm.tbl$Pounds)
# tidyverse Ecosystem
LMilkLbOverallDescriptives
# A tibble: 1 × 9
Mean SD Ptile25 Median Ptile75 Minimum Maximum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 20227. 3299. 18192. 18700 20981 17106 26637
Missing N
<int> <int>
0 80
Statistical Tests – Base R and tidyverse Ecosystem Functions 461
Not surprisingly, other than rounding, the values for mean, sd, length (e.g., N),
etc. are equivalent whether using standard functions that come with Base R or using
the tidyverse ecosystem and the dplyr::summarise() function. Now add another
layer of detail to use of the dplyr::summarise() function by also deploying the
dplyr::group_by() function.
LMilkLbBreedDescriptives
# A tibble: 4 × 10
Breed Mean SD Ptile25 Median Ptile75
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AyrshireLb 18650. 471. 18210 18478. 19186
2 GuernseyLb 17763. 589. 17241 17783 18219.
3 HolsteinLb 25811. 346. 25605. 25663 25808.
4 JerseyLb 18683 478. 18297. 18569 18959.
1
For those who are not well-acquainted with dairy cattle and characteristics of the leading breeds,
review State and national standardized lactation averages by breed for cows calving in 2007
(https://queries.uscdcb.com/publish/dhi/dhi09/laall.shtml) for generalized by breed statistics not
only on milk production (lb), but statistics that also address fat and protein production (% and lb).
There are many dairy herdsmen who place a high value on the production of fat (% and lb.) and
protein (% and lb.) and are willing to accept less milk production in terms of measured weight (lb.).
Statistical Tests – Base R and tidyverse Ecosystem Functions 463
2
Whether data are nonparametric or parametric is often a matter of personal judgment or group
consensus. Ideally, the final decision is based not only on observation of the data but is also a result
of applied tests such as the Anderson-Darling test or the Shapiro test, but more discussion on this
issue would go beyond the scope for this specific lesson.
3
For those with special interest in the issue of normal distribution and the selection of a nonpara-
metric or parametric approach to inferential test selection, look at use of the dlookr::normality()
function, such as dlookr::normality(Pounds) alone or chained to testing of normal distribution by
groups by using the dplyr::group_by() function.
464 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
# A tibble: 4 × 4
Breed Mean.Lb SD.Lb N.Lb
<fct> <dbl> <dbl> <int>
1 AyrshireLb 18650. 471. 20
2 GuernseyLb 17763. 589. 20
3 HolsteinLb 25811. 346. 20
4 JerseyLb 18683 478. 20
These descriptive statistics provide useful information, but it is not yet possible
to determine if there are statistically significant differences (p < = 0.05) in mean
pounds of milk production per lactation between and among the four dairy breeds.
There are of course more than a few R-based solutions to resolve this inquiry, where
a Oneway Analysis of Variance (Oneway ANOVA) is needed. To achieve this aim
and generate a summary of Oneway ANOVA outcomes, first use the tidyverse
broom package and a few supporting functions in the broom package.
Data are often in tidy format due to specific actions used to achieve that aim.
Following along with a tidy approach to data science, the broom package was devel-
oped to support the production of tidy output, especially the output of calculations
associated with inferential tests that examine differences between and among
groups. Consider the following attempt at a Oneway ANOVA and the desire to know:
• Are there statistically significant differences (p ≤ 0.05) in pounds of milk pro-
duction per lactation between and among the four breeds in this example?
• If so, which breeds are in common (e.g., there is no statistically significant dif-
ference (p ≤ 0.05) in pounds of milk production per lactation between identified
groups.) and which breeds show difference (there is a statistically significant
difference (p ≤ 0.05) in pounds of milk production per lactation between identi-
fied groups.).
As background, the broom package is not part of the core tidyverse ecosystem
but it is certainly included among the many packages associated with the tidyverse
ecosystem. As such, it must be downloaded and put into use separately.
Statistical Tests – Base R and tidyverse Ecosystem Functions 465
install.packages("broom", dependencies=TRUE)
library(broom)
PoundsMilkByBreedUsinglm <-
stats::lm(Pounds ~ Breed, data = LMilkLbBreedMgtFarm.tbl)
# lm - linear model, using base R
broom::tidy(stats::anova(PoundsMilkByBreedUsinglm))
# Generate a tibble of overall Oneway ANOVA
# outcomes.
# A tibble: 2 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Breed 3 842380181. 280793394. 1225. 3.15e-64
2 Residuals 76 17424104. 229265. NA NA
With a calculated p-value of 3.15e-64, which is certainly far less than p ≤ 0.05,
it can be stated that there is a statistically significant difference (p ≤ 0.05) in pounds
of milk production per lactation between the four dairy breeds. But this output does
not provide any sense of which breeds are in common and which breeds show
difference.
To achieve a more finite sense of commonality and difference between and
among groups (e.g., breeds), it is necessary to use a mean comparison test. Tukey’s
HSD (Honestly Significant Difference) test is among the most frequently used post
hoc tests for this purpose, making it possible to have some sense of multiple group
comparisons of commonality and differences for all potential pairs (e.g., group-by-
group comparisons).
PoundsMilkByBreedUsingTukeyHSD <-
aov(Pounds ~ Breed, data = LMilkLbBreedMgtFarm.tbl)
# aov - anova model, using base R
broom::tidy(TukeyHSD(PoundsMilkByBreedUsingTukeyHSD))
# Generate a tibble of groupwise comparative
# Oneway ANOVA outcomes.
466 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
# A tibble: 6 × 7
term contrast null.value estimate conf.low
<chr> <chr> <dbl> <dbl> <dbl>
1 Breed GuernseyLb-AyrshireLb 0 -887. -1285.
2 Breed HolsteinLb-AyrshireLb 0 7160. 6763.
3 Breed JerseyLb-AyrshireLb 0 32.6 -365.
4 Breed HolsteinLb-GuernseyLb 0 8048. 7650.
5 Breed JerseyLb-GuernseyLb 0 920. 522.
6 Breed JerseyLb-HolsteinLb 0 -7128. -7525.
It may take a few minutes to see which group comparisons are in common (e.g.,
those groups where the p-value is greater than 0.05). Equally, it requires a fair
amount of time to discover group comparisons where there are differences (e.g.,
those groups where the p-value is less than or equal to 0.05).
Because it can be challenging to go over these comparisons, especially when
there are many, it is best to use R packages and functions that provide an easier-to-
read output of group comparisons. The agricolae::HSD.test() function is quite use-
ful for this task, and the groupwise comparison table is very easy to read, increasing
value to the analyses.
install.packages("agricolae", dependencies=TRUE)
library(agricolae)
agricolae::HSD.test(
aov(Pounds ~ Breed, data=LMilkLbBreedMgtFarm.tbl),# Model
trt="Breed", # Treatment
group=TRUE, console=TRUE, alpha=0.05, # Arguments
main="Milk Production During Lactation by Dairy Breed:
Pounds")
# Wrap the agricolae::HSD.test() function around the
# Oneway ANOVA model obtained by using the aov()
# function. Select desired arguments, such as group,
# console, and alpha (e.g., p-value).
Breed, means
Pounds groups
HolsteinLb 25810.7 a
JerseyLb 18683.0 b
AyrshireLb 18650.3 b
GuernseyLb 17763.0 c
The mean for pounds of milk production per lactation for each of the four dairy
breeds is found in the numeric column, with production ranging from a mean of
25,810.7 pounds (Holstein) to a mean of 17,763.0 pounds (Guernsey). What is per-
haps especially helpful is the information in the groups column. Notice the use of
lower-case letters to distinguish between group membership:
• Holstein dairy cows (mean = 25,810.7 pounds) are marked as group a, and in this
example, there are no other breeds marked as group a – showing that Holstein
dairy cows were a group of their own in terms of pounds of milk production per
lactation. There is a statistically significant difference (p ≤ 0.05) in pounds of
milk production per lactation between Holstein dairy cows and all other breeds.
• Jersey dairy cows (mean = 18,683.0 pounds) and Ayrshire (mean = 18,650.3
pounds) dairy cows are both marked as group b. This designation indicates that
they were in common regarding pounds of milk production per lactation and in
more formal language it is appropriate to say that there was no statistically sig-
nificant difference (p < = 0.05) in pounds of milk production per lactation
between Jersey dairy cows and Ayrshire dairy cows.
• Guernsey dairy cows (mean = 17,763.0 pounds) are marked as group c, and in
this example, there are no other breeds marked as group c – showing that
Guernsey dairy cows were a group of their own in terms of pounds of milk pro-
duction per lactation. There is a statistically significant difference (p ≤ 0.05) in
pounds of milk production per lactation between Guernsey dairy cows and all
other breeds.
In terms of statistically significant differences (p ≤ 0.05), based on the data in
this example (which are from an N = 20 per breed teaching dataset and are therefore
not at all passed along as being representative of the industry), it can be said that: (a)
Holstein dairy cows produced more pounds of milk per lactation than (b) Jersey and
468 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Ayrshire dairy cows, who produced more pounds of milk per lactation than (c)
Guernsey dairy cows, who produced the fewest pounds of milk per lactation.
It is cautioned, however, to make no broad statements from the figure on pounds
of milk production per lactation and the table on pounds of milk production per
lactation other than what has been briefly stated immediately above. As a product
offered for sale, pounds of milk production per lactation is important, but it is also
important to consider many other contributions to potential profitability, such as
production costs by breed (e.g., larger cows have a higher housing cost than smaller
cows, larger cows will likely consume more feed than smaller cows, etc.) and milk
quality (e.g., percentage fat by breed and percent protein by breed). Biostatistics,
and especially the commercial applications of biostatistics, call for many complex
and often difficult-to-obtain sources of information.
Challenge: Going beyond analyses of pounds of milk production per lactation
overall and by Breed, organize data and syntax to accommodate analyses of pounds
of milk production per lactation by Management.
Challenge: Going beyond analyses of pounds of milk production per lactation
overall, by Breed, and by Management, organize data and syntax to accommodate
analyses of pounds of milk production per lactation by Farm.
Challenge: Going beyond the prior analyses of pounds of milk production per
lactation, organize data and syntax to examine possible differences and interactions
of all relevant variables in the dataset associated with milk production. Consider:
• Are there statistically significant differences (p ≤ 0.05) in Percent Fat by Breed?
• Are there statistically significant differences (p ≤ 0.05) in Percent Protein
by Breed?
• Are there statistically significant differences (p ≤ 0.05) in Percent Fat by
Management?
• Are there statistically significant differences (p ≤ 0.05) in Percent Protein by
Management?
• Are there statistically significant differences (p ≤ 0.05) in Percent Fat by Farm?
• Are there statistically significant differences (p ≤ 0.05) in Percent Protein
by Farm?
• What is the association (e.g., correlation) between Pounds of milk per lactation
and Percent Fat, by Percent Protein, etc.?
Challenge: For those with even greater interest in the use of statistics in data sci-
ence, structure the data and syntax to address these inquiries from a Twoway
ANOVA perspective. Address not only Pounds but also Percent Fat and Percent
Protein to determine outcomes by Breed, by Management, and by Farm, including
interactions of these measured variables between and among the grouping variables.
Much can be done with the original dataset, even though it is obviously a limited set
of data (N = 80 rows) designed for teaching purposes. A few hints on how to start
these analyses are displayed but be creative and go beyond this initial syntax
(Fig. 7.7).
Statistical Tests – Base R and tidyverse Ecosystem Functions 469
TwowayPoundBMF <-
aov(Pounds ~ Management * Farm * Breed,
data=LMilkLbBreedMgtFarm.tbl)
# Twoway ANOVA for B(reed), M(anagement),
# F(arm) -- TwowayPoundBMF
Df F value Pr(>F)
Management 1 105.05 0.000000000000004 ***
Farm 2 1.39 0.25651
Breed 3 3174.23 < 0.0000000000000002 ***
Management:Breed 3 7.84 0.00016 ***
Farm:Breed 6 0.27 0.94795
Residuals 64
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
agricolae::HSD.test(TwowayPoundBMF,"Breed", group=TRUE,
console=TRUE)
# Use Tukey’s HSD (Honestly Significant Difference) Mean
# Comparison Test to determine which breakouts are in
# common groups, and which breakout groups are not.
Breed, means
Pounds groups
HolsteinLb 25810.7 a
JerseyLb 18683.0 b
AyrshireLb 18650.3 b
GuernseyLb 17763.0 c
agricolae::HSD.test(TwowayPoundBMF,"Management",
group=TRUE, console=TRUE)
# Use Tukey’s HSD (Honestly Significant Difference)
# Mean Comparison Test.
Management, means
Pounds groups
Conventional 20567.6 a
Organic 19885.9 b
agricolae::HSD.test(TwowayPoundBMF,"Farm", group=TRUE,
console=TRUE)
# Use Tukey’s HSD (Honestly Significant Difference) Mean
# Comparison Test to determine which breakouts are in common
# groups, and which breakout groups are not.
Farm, means
Pounds groups
Lawson 20627.7 a
Skaggs 20507.5 a
Stanley 19936.2 b
Monroe 19835.6 b
Beautiful Graphics
Graphics (e.g., figures) are often prepared from three perspectives: graphics of
grouped data (e.g., a bar plot of median weight by identified race-ethnic breakout
groups), graphics of interval and real numbers (e.g., a density plot of weight for
each identified race-ethnic breakout group), and maps (e.g., a county-wide chorop-
leth map of each identified race-ethnic group by all identified Census Bureau
Tracts). Many figures have been prepared and displayed in this text. Review these
prior figures and the many figures listed and shown below, knowing that R supports
the production not only of these graphics (e.g., figures and maps) but the production
of many other types, too.
Grouped Data
• Bar Plot
• Mosaic Plot
• Waffle Plot
• Beanplot
• Beeswarm Plot
• Boxplot
• Density Plot
• Dotplot
• Histogram
• Line Chart
• Pirate Plot
• Quantile-Quantile (QQ) Plot
• Scatter Plot
• Scatter Plot Matrix
• Violin Plot
Beautiful Graphics 473
Beautiful Maps
• International
• National
• State
• County
• Sub-County
To support presentation of these graphics, create an abbreviated dataset for dem-
onstration purposes, only. From among the many ways a dataset can be created
when using R, for this demonstration use the read.table(textConnection() function.
The data are part of a larger teaching dataset and are inspired by actual data, but the
data do not reflect Systolic Blood Pressure measurements for actual patients.
#########################
# Abbreviated Code Book #
#########################
# L(ong)
# S(ystolic) B(lood) P(ressure), SBP
# G(ender)
# D(rug)
# R(aceEthnic)
In this dataframe, text instead of numeric codes was used for data relating to
Gender (Female and Male) and Race (Black, Hispanic, Other, and White). As a
purposeful contrast for this teaching dataset, numeric codes were used for data relat-
ing to Drug (1, 2, 3, 4, 5). Whole numbers were used to enter data for SBP, including
odd and even SBP measurements given the capabilities of contemporary digital
sphygmomanometers.
base::colnames(LSBPGDR.df) <-
c("Patient", "Gender", "Drug", "Race", "SBP")
# Although somewhat redundant, clearly identify the name
# for each column. Note how PatientID was changed to
# Patient and how RaceEthnic was changed to Race, largely
# to reinforce how column names can be changed easily, if
# desired for later presentation purposes.
Look again at the output of the utils::str() function and see why the following
data for Drug and SBP need to be put into another format and then observe how this
is accomplished, using functions such as factor() and as.numeric(). Further, for the
object variable SBPGDR.df$Drug, apply a label so that numeric codes such as 1, 2,
3, 4, and 5 are instead presented as Drug A, Drug B, Drug C, Drug D, and Placebo:
• Transform the object variable Gender from character to factor.
• Transform the object variable Drug from integer to factor.
• Transform the object variable Race from character to factor.
• Transform the object variable SBP from integer to numeric.
LSBPGDR.df$Gender <-
factor(LSBPGDR.df$Gender,
labels=c("Female", "Male"))
levels(LSBPGDR.df$Gender)
summary(LSBPGDR.df$Gender)
# Transform the object variable Gender from a
# character-type object variable to a factor-type
# object variable and provide meaningful text to
# describe the nature of each code.
LSBPGDR.df$Drug <-
factor(LSBPGDR.df$Drug,
labels=c("Drug A", "Drug B", "Drug C", "Drug D",
"Placebo"))
levels(LSBPGDR.df$Drug)
summary(LSBPGDR.df$Drug)
# Transform the object variable Drug from an
# integer-type object variable to a factor-type
# object variable and provide meaningful text to
# describe the nature of each numeric code.
Fig. 7.7
Beautiful Graphics 477
LSBPGDR.df$Race <-
factor(LSBPGDR.df$Race,
labels=c("Black", "Hispanic", "Other", "White"))
levels(LSBPGDR.df$Race)
summary(LSBPGDR.df$Race)
# Transform the object variable Race from a
# character-type object variable to a factor-type
# object variable and provide meaningful text to
# describe the nature of each code.
LSBPGDR.df$SBP <-
as.numeric(LSBPGDR.df$SBP)
summary(LSBPGDR.df$SBP)
# Transform the object variable SBP from an
# integer-type object variable to a numeric-type
# object variable.
Race SBP
Black :22 Min. : 99
Hispanic:19 1st Qu.:120
Other :20 Median :137
White :23 Mean :132
3rd Qu.:139
Max. :158
Give special attention to the base::summary() function output for Drug and how
it is now quite different since it was transformed from a set of numeric codes to
factor-type breakouts, using meaningful text to identify the different drugs used in
this study.
478 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Bar Plot
Use the tidyverse ecosystem to prepare a bar plot of mean Systolic Blood Pressure
(SBP) by Race:
• Use the dplyr package to gain a sense of descriptive statistics.
• Use the forcats package to address ordering of the selected variable.
• Use the ggplot2 package to generate the figure, where the values for mean SBP
by Race are in descending order.
Add additional value to the figure by flipping (e.g., transposing) the X-axis and
the Y-axis, making it much easier to see the transition in mean SBP by Race
(Fig. 7.8).
RaceLSBPGDRDescriptives <-
LSBPGDR.df %>%
dplyr::group_by(Race) %>%
dplyr::summarize(
N = base::length(SBP),
Minimum = base::min(SBP, na.rm=TRUE),
Median = stats::median(SBP, na.rm=TRUE),
Mean = base::mean(SBP, na.rm=TRUE),
SD = stats::sd(SBP, na.rm=TRUE),
Maximum = base::max(SBP, na.rm=TRUE),
Missing = base::sum(is.na(SBP))
)
RaceLSBPGDRDescriptives
Fig. 7.8
Beautiful Graphics 479
# A tibble: 4 × 8
Race N Minimum Median Mean SD Maximum Missing
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Other 20 99 116 118. 12.5 139 0
2 White 23 108 135 129. 10.9 139 0
3 Hispanic 19 128 138 137. 3.83 145 0
4 Black 22 128 144 144. 8.48 158 0
LSBPGDR.df$Race <- forcats::fct_reorder(LSBPGDR.df$Race,
LSBPGDR.df$SBP, mean, na.rm=TRUE)
# Use the forcats::fct_reorder() function.
BarPlot.fig <-
ggplot2::ggplot(data=LSBPGDR.df,
aes(x=Race, y=SBP)) +
stat_summary(fun.data=mean_sdl, geom="bar") +
coord_flip() +
labs(
title=
"Bar Plot: Mean Systolic Blood Pressure (SBP) and Race",
subtitle=
"Order: Descending Mean SBP by Race ",
x="Race\n", y="\nSystolic Blood Pressure (SBP)") +
# Remember – output has been flipped.
scale_y_continuous(limits=c(0, 160),
breaks=scales::pretty_breaks(n=10)) +
theme_Mac()
# Fig. 7.8
par(ask=TRUE); BarPlot.fig
MosaicPlot.fig <-
ggplot2::ggplot(data=LSBPGDR.df) +
geom_mosaic(aes(x = product(Drug), fill = Race)) +
labs(title="Mosaic Plot: Drug and Race",
subtitle="Representation: Headcount by Groups",
x="\nDrug", y="Race\n") +
theme_Mac() +
theme(legend.title=element_blank())
# Fig. 7.9
par(ask=TRUE); MosaicPlot.fig
install.packages("waffle", dependencies=TRUE)
library(waffle)
480 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Fig. 7.9
Using the janitor package, determine the frequency distribution of the identified
factor-type variables, or LSBPGDR.df$Drug in this example. This breakout infor-
mation is the basis for a Waffle Plot.
FrequencyDrug <-
LSBPGDR.df %>%
janitor::tabyl(Drug,
show_na=TRUE,
show_missing_levels=TRUE) %>%
janitor::adorn_pct_formatting(digits=2) %>%
base::print(FrequencyDrug, n=99)
Drug n percent
Drug A 16 19.05%
Drug B 16 19.05%
Drug C 16 19.05%
Drug D 16 19.05%
Placebo 20 23.81%
Prepare an object that includes name(s) of breakouts and the frequency of each
(Fig. 7.10).
Beautiful Graphics 481
Fig. 7.10
DrugBreakouts <- c(
'Drug A' = 16,
'Drug B' = 16,
'Drug C' = 16,
'Drug D' = 16,
'Placebo' = 20)
DrugBreakouts.fig <-
waffle::waffle(DrugBreakouts, rows=8,
title=
"Waffle Plot (e.g., Square Pie Chart): Drug Breakouts") +
theme_Mac() +
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank()
)
# Much more can be done with this Waffle Plot, such
# as the addition of labels. Experiment with these
# and other embellishments, for this figure and all
# other figures in this section.
# C07Fig73WafflePlotDrug.png
par(ask=TRUE); DrugBreakouts.fig
482 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Beanplot
When viewing the beanplot, know that there are many who see the beanplot and the
violin plot as being the same general figure. Of course, others have a different opin-
ion, thus the reason why both figures are shown in this section (Fig. 7.11).
install.packages("beanplot", dependencies=TRUE)
library(beanplot)
BeanplotRaceSBP.fig <-
beanplot::beanplot(SBP ~ Race,
data=LSBPGDR.df,
main=
"Beanplot: Race and Systolic Blood Pressure (SBP)",
show.names=TRUE)
# Much more can be done to expand on the beanplot as a
# useful figure. Read the online documentation to see
# arguments supported by the beanplot::beanplot()
# function.
# Fig. 7.11
par(ask=TRUE); BeanplotRaceSBP.fig
Along with the beanplot as a figure, notice the statistics printed on the screen.
These statistics can be very useful. Take time to review their meaning and applica-
tion to fully understand outcomes between and among breakout groups.
Fig. 7.11
Beautiful Graphics 483
$bw
[1] 3.09916
$wd
[1] 8.14491
$names
[1] "Other" "White" "Hispanic" "Black"
$stats
[1] 117.550 128.739 136.895 144.091
$overall
[1] 131.94
$log
[1] ""
$ylim
[1] 89.7025 167.2975
$xlim
[1] 0.5 4.5
install.packages("ggbeeswarm", dependencies=TRUE)
library(ggbeeswarm)
BeeswarmRaceSBP.fig <-
ggplot2::ggplot(data=LSBPGDR.df,
aes(Race, SBP, col=Race)) +
geom_beeswarm(size=2.5) +
labs(title=
"Beeswarm Plot: Race and Systolic Blood Pressure (SBP)",
x="\nRace", y="Systolic Blood Pressure\n(SBP)\n") +
scale_y_continuous(labels=scales::comma,
limits=c(0, 280),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac() +
theme(legend.title= element_blank()) +
theme(axis.text.y=element_text(face="bold",
size=12, hjust=0.5, vjust=0.5, angle=00))
# Fig. 7.12
par(ask=TRUE); BeeswarmRaceSBP.fig
484 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Fig. 7.12
BoxplotRaceSBP.fig <-
ggplot2::ggplot(data=LSBPGDR.df,
aes(Race, SBP, col=Race)) +
geom_boxplot(outlier.color="red",
outlier.shape=2, lwd=0.75) +
stat_boxplot(geom='errorbar', lwd=0.75) +
labs(title=
"Boxplot: Race and Systolic Blood Pressure (SBP)",
x="\nRace", y="Systolic Blood Pressure\n(SBP)\n") +
scale_y_continuous(labels=scales::comma,
limits=c(075, 175),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac() +
theme(legend.title= element_blank()) +
theme(axis.text.y=element_text(face="bold",
size=12, hjust=0.5, vjust=0.5, angle=00))
# Fig. 7.13
par(ask=TRUE); BoxplotRaceSBP.fig
Beautiful Graphics 485
Fig. 7.13
DensityRaceSBP.fig <-
ggplot2::ggplot(LSBPGDR.df, aes(x=SBP)) +
geom_density(color="blue", lwd=1.25) +
geom_vline(aes(xintercept=mean(SBP)), color="red",
linetype="dotted", size=0.75) +
facet_grid(cols=vars(Race)) +
labs(title=
"Density Plot: Race and Systolic Blood Pressure (SBP)",
subtitle="Dotted Line Represents Mean",
x="\nRace", y="Systolic Blood Pressure\n(SBP)\n") +
theme_Mac() +
theme(legend.title=element_blank()) +
theme(strip.text=element_text(face="bold")) +
theme(axis.text.x=element_text(face="bold",
size=11, hjust=0.5, vjust=0.5, angle=45)) +
theme(axis.text.y=element_text(face="bold",
size=12, hjust=0.5, vjust=0.5, angle=00))
# Fig. 7.14
par(ask=TRUE); DensityRaceSBP.fig
486 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Fig. 7.14
Fig. 7.15
DotplotRaceSBP.fig <-
ggplot2::ggplot(LSBPGDR.df, aes(x=Race, y=SBP,
fill=Race)) +
geom_dotplot(binaxis='y', stackdir='center') +
labs(title=
"Dot Plot: Race and Systolic Blood Pressure (SBP)",
x="\nRace", y="Systolic Blood Pressure\n(SBP)\n") +
scale_y_continuous(labels=scales::comma,
limits=c(075, 175),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac() +
theme(legend.title= element_blank())
# Fig. 7.15
par(ask=TRUE); DotplotRaceSBP.fig
Beautiful Graphics 487
HistogramRaceSBP.fig <-
ggplot2::ggplot(LSBPGDR.df, aes(x=SBP, fill=Race)) +
geom_histogram(color="black", lwd=0.75) +
geom_vline(aes(xintercept=mean(SBP)), color="red",
linetype="dotted", size=0.75) +
geom_vline(aes(xintercept=median(SBP)),
color="darkgreen", linetype="dashed", size=1.25) +
facet_grid(cols=vars(Race)) +
labs(title=
"Histogram: Race and Systolic Blood Pressure (SBP)",
subtitle=
"Mean - Red Dotted Line and Median - Green Dashed Line",
x="\nRace", y="Systolic Blood Pressure\n(SBP)\n") +
scale_x_continuous(labels=scales::comma,
limits=c(100, 175),
breaks=scales::pretty_breaks(n=5)) +
scale_y_continuous(labels=scales::comma,
limits=c(0, 12),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac() +
theme(legend.title=element_blank()) +
theme(strip.text=element_text(face="bold")) +
theme(axis.text.x=element_text(face="bold",
size=11, hjust=0.5, vjust=0.5, angle=45)) +
theme(axis.text.y=element_text(face="bold",
size=12, hjust=0.5, vjust=0.5, angle=00))
# Fig. 7.16
par(ask=TRUE); HistogramRaceSBP.fig
Fig. 7.16
488 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
LineChartStaticRaceSBP.fig <-
ggplot2::ggplot(data=LSBPGDR.df,
aes(x=Race, y=SBP)) +
geom_line() +
geom_point(size=2.5) +
labs(
title=
"Line Chart - Static: Race and Systolic Blood Pressure (SBP)",
subtitle="Use of theme_tufte()",
x="\nRace", y="Systolic Blood Pressure\n(SBP)\n") +
theme_tufte()
# As opposed to use of theme_Mac(), theme_tufte() was
# instead used for this figure. There are those who
# argue that the simplicity of presentations gained
# by the use of theme_tufte() is an advantage,
# compelling readers to give careful attention to
# outcomes -- based on the idea that such simplicity
# is more demanding than the use of large and bold
# fonts, vibrant colors, etc. As interested, look
# into the many publications by Edward Tufte to learn
# more about this approach to preparation of slides
# and other materials.
# Fig. 7.17
par(ask=TRUE); LineChartStaticRaceSBP.fig
Create a dataframe that addresses the issue of change over time (e.g., Year, 2019 to
2023). Use the base::data.frame() function to create the dataset, simply to show
another way of creating demonstration-type ad hoc datasets when using R in an
interactive fashion.
Fig. 7.17
Beautiful Graphics 489
utils::str(RegionYearYield.df)
RegionYearYield.df$Region <-
as.factor(RegionYearYield.df$Region)
RegionYearYield.df$Year <-
as.factor(RegionYearYield.df$Year)
RegionYearYield.df$Yield <-
as.numeric(RegionYearYield.df$Yield) # Redundant
utils::str(RegionYearYield.df)
base::attach(RegionYearYield.df)
RegionYearYield.df %>% {
base::rbind(utils::head(., 3),
utils::tail(., 3) )} %>%
print()
# Although not an enumerated function, use the
# base::rbind() function to print both the head
# and tail of the enumerated dataset.
Fig. 7.18
LineChartMultipleRegionYearYield.fig <-
ggplot2::ggplot(data=RegionYearYield.df,
aes(x=Year, y=Yield, group=Region, color=Region)) +
geom_line(lwd=2) +
geom_point(size = 5, shape = 21, col="Black",
aes(fill=Region)) +
labs(
title= "Line Chart - Multiple: Yield by Region
and by Year: 2019 to 2023",
subtitle="Use of modified theme_wsj()",
x="\nYear", y="Yield\n") +
theme_wsj() +
theme(
plot.title=element_text(hjust=0.5),
plot.subtitle=element_text(hjust=0.5, face="bold")) +
# Center the title, subtitle and put the subtitle
# in bold.
theme(legend.title=element_blank()) +
# Do not print a legend title.
theme(legend.position="bottom") +
# Put the legend at the bottom.
theme(axis.title=element_text(size=12, face="bold"))
# Make labels bold for the X axis the Y axis.
# R supports many themes and if desired, modification
# of themes. Compare the vast differences between
# theme_tufte() and theme_wsj(), with the above changes
# to theme_wsj() and without changes.
# C07Fig81LineChartMultipleRegionYearYield.png
par(ask=TRUE); LineChartMultipleRegionYearYield.fig
Beautiful Graphics 491
Pirate Plot
The Pirate Plot (found in the yarrr (e.g., Talk like a pirate.) package), which works
and plays well with the ggplot2 package, is not used regularly, but it may be of value
to those who take time to explore its many possibilities. To learn more, key the syn-
tax vignette("pirateplot", package="yarrr") at the R prompt. Notice how this func-
tion can also produce descriptive statistics (Fig. 7.19).
install.packages("yarrr", dependencies=TRUE)
library(yarrr)
PiratePlotRaceSBP.fig <-
yarrr::pirateplot(formula=SBP ~ Race, data=LSBPGDR.df,
main=
"Pirate Plot: Race and Systolic Blood Pressure (SBP)",
theme=2,
# The argument theme is specific to this function.
plot=TRUE)
# Fig. 7.19
par(ask=TRUE); PiratePlotRaceSBP.fig
Fig. 7.19
492 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
PiratePlotRaceSBP.descriptives <-
yarrr::pirateplot(formula=SBP ~ Race, data=LSBPGDR.df,
plot=FALSE)
PiratePlotRaceSBP.descriptives
$summary
Race bean.num n avg inf.lb inf.ub
1 Other 1 20 117.550 111.564 123.164
2 White 2 23 128.739 124.216 133.435
3 Hispanic 3 19 136.895 134.963 138.613
4 Black 4 22 144.091 140.570 148.151
$avg.line.fun
[1] "mean"
$inf.method
[1] "hdi"
$inf.p
[1] 0.95
QQPlotRaceSBP.fig <-
ggplot2::ggplot(data=LSBPGDR.df, aes(sample=SBP)) +
stat_qq() +
stat_qq_line() +
facet_grid(cols=vars(Race)) +
labs(title=
"Quantile-Quantile (QQ) Plot: Race and Systolic Blood
Pressure (SBP)",
x="\nRace", y="Systolic Blood Pressure\n(SBP)\n") +
theme_Mac() +
theme(legend.title=element_blank()) +
theme(strip.text=element_text(face="bold")) +
theme(axis.text.x=element_text(face="bold",
size=11, hjust=0.5, vjust=0.5, angle=00)) +
theme(axis.text.y=element_text(face="bold",
size=12, hjust=0.5, vjust=0.5, angle=00))
# QQ plots are often prepared for private use and are
# often excluded from publications or presentations,
# since they are coupled with statistical tests for
# normality, such as the Anderson-Darling Test or the
# Shapiro Test. Thus, there are few embellishments
# to this figure.
# Fig. 7.20
par(ask=TRUE); QQPlotRaceSBP.fig
Beautiful Graphics 493
Scatter Plot
Make a teaching dataframe with three numeric variables. These variables relate to
different body measurements (e.g., WeightLb, HeightIn, and WaistIn) and will be
used to visually display the concept of association (e.g., correlation), the singular
correlation(s) of X v Y (e.g., Weight v Height, Weight v Waist, and Height v Waist)
and then use of a correlation matrix to address X v Y v Z and its many permutations
(e.g., Weight v Height v Waist) (Figs. 7.21, 7.22, and 7.23).
Fig. 7.20
Fig. 7.21
494 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Fig. 7.22
Fig. 7.23
Beautiful Graphics 495
utils::str(WeightHeightWaist.df)
WeightHeightWaist.df$WeightLb <-
as.numeric(WeightHeightWaist.df$WeightLb)
WeightHeightWaist.df$HeightIn <-
as.numeric(WeightHeightWaist.df$HeightIn)
WeightHeightWaist.df$WaistIn <-
as.numeric(WeightHeightWaist.df$WaistIn)
# Currently in integer format, strive to put all
# three object variables into numeric format.
utils::str(WeightHeightWaist.df)
496 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
base::attach(WeightHeightWaist.df)
stats::cor.test(HeightIn, WeightLb,
data=WeightHeightWaist.df, method="pearson")
ScatterPlotHeightWeight.fig <-
ggplot2::ggplot(data=WeightHeightWaist.df) +
geom_point(aes(x=HeightIn, y=WeightLb), size=3,
color="red", fill="black", shape=18) +
geom_smooth(aes(x=HeightIn, y=WeightLb), method=loess,
linetype="dashed", color="black", fill="blue") +
labs(title=
"Scatter Plot (Points and Smooth): Height v Weight",
x="\nHeight (Inches)", y="Weight (Pounds)\n") +
scale_y_continuous(labels=scales::comma, limits=c(0,
300), breaks=scales::pretty_breaks(n = 5)) +
annotate("text", x=60, y=250, fontface="bold", size=06,
color="black", hjust=0, family="mono",
label="Pearson's r = 0.800667") +
theme_Mac()
# Fig. 7.21
par(ask=TRUE); ScatterPlotHeightWeight.fig
Beautiful Graphics 497
stats::cor.test(HeightIn, WaistIn,
data=WeightHeightWaist.df, method="pearson")
ScatterPlotHeightWaist.fig <-
ggplot2::ggplot(data=WeightHeightWaist.df) +
geom_point(aes(x=HeightIn, y=WaistIn), size=3,
color="red", fill="black", shape=18) +
geom_smooth(aes(x=HeightIn, y=WaistIn), method=loess,
linetype="dashed", color="black", fill="blue") +
labs(title=
"Scatter Plot (Points and Smooth): Height v Waist",
x="\nHeight (Inches)", y="Waist (Inches)\n") +
scale_y_continuous(labels=scales::comma, limits=c(0,
60), breaks=scales::pretty_breaks(n = 5)) +
annotate("text", x=60, y=250, fontface="bold", size=06,
par(ask=TRUE); ScatterPlotHeightWaist.fig
498 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
stats::cor.test(WeightLb, WaistIn,
data=WeightHeightWaist.df, method="pearson")
ScatterPlotWeightWaist.fig <-
ggplot2::ggplot(data=WeightHeightWaist.df) +
geom_point(aes(x=WeightLb, y=WaistIn), size=3,
color="red", fill="black", shape=18) +
par(ask=TRUE); ScatterPlotWeightWaist.fig
From among many creative ways that could be used, create a scatterplot matrix
using the GGally::ggpairs() function (Fig. 7.24).
install.packages("GGally", dependencies=TRUE)
library(GGally)
ScatterPlotMatrixWeightHeightWaist.fig <-
GGally::ggpairs(WeightHeightWaist.df) +
labs(title=
Beautiful Graphics 499
Fig. 7.24
par(ask=TRUE); ScatterPlotMatrixWeightHeightWaist.fig
Violin Plot
Review the prior comments on how a beanplot is like a violin plot (Fig. 7.25).
500 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Fig. 7.25
ViolinPlotRaceSBP.fig <-
ggplot2::ggplot(data=LSBPGDR.df,
aes(Race, SBP, col=Race, fill=Race)) +
geom_violin(size=2.5) +
labs(title=
"Violin Plot: Race and Systolic Blood Pressure (SBP)",
x="\nRace", y="Systolic Blood Pressure\n(SBP)\n") +
scale_y_continuous(labels=scales::comma,
limits=c(90, 160),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac() +
theme(legend.title=element_blank()) +
theme(axis.text.y=element_text(face="bold",
size=12, hjust=0.5, vjust=0.5, angle=00))
# Fig. 7.25
par(ask=TRUE); ViolinPlotRaceSBP.fig
4
Consider a map of the United Kingdom of Great Britain and Northern Ireland (UK). Should this
map include only England, Northern Ireland, Scotland, and Wales? How would the Republic of
Ireland show on this map, given how it is part of the same land mass as the land mass for Northern
Ireland? Then, add to this complexity the Channel Islands such as the Bailiwick of Guernsey and
the Bailiwick of Jersey. Should these two entities be included in a map of the UK? Should the Isle
of Man also show on the map? Should the British Virgin Islands, the Falkland Islands, Gibraltar,
and other British Overseas Territories show on the map? What about the Chagos Archipelago?
Beautiful Graphics 501
International
From among many possible resources, use the maps::map() function to create a
world map. As created, the map will be of basic geo-political boundaries, but with
limited detail and the resolution for how small geo-political entities are viewed is
also limited.5 Use this map as a starting point to mapping (Fig. 7.26).
WorldMap <-
maps::map("world", fill=TRUE, col="blue")
title("World Map")
# Fig. 7.26
par(ask=TRUE: WorldMap
Challenge: The current world map shows geographic boundaries, but otherwise
there are no data associated with the map. Use the created world map (or perhaps a
world map from another source) and the tidyverse ecosystem to produce a geo-
graphic entity by entity map-based choropleth of COVID-19 total cases per million,
as of the creation date for the dataset:
• Obtain a list of ISO (International Organization for Standardization) codes for
each entity assigned an ISO code. The ISO online browsing platform (https://
www.iso.org/obp/) is one possible resource, where codes are provided as: English
short name, French short name, Alpha-2 code, Alpha-3 code, and Numeric.
Fig. 7.26
Should Rockall be included? The complexity of maps goes far beyond the use of R or any other
software for their creation.
5
Is it possible to distinguish the borders for Luxembourg in this map?
502 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
However, there are many resources that provide these codes for corresponding
geographic entities.
• Obtain longitude and latitude data for each geographic entity. There are many
resources for this task, but those who use Kaggle may find it convenient to use
the data obtained from https://www.kaggle.com/datasets/bohnacker/country-
longitude-latitude, at least as a starting point for this challenge.
• Finally, obtain the most current COVID-19 data, as provided by Our World in
Data (https://github.com/owid/covid-19-data/blob/master/public/data/latest/
owid-covid-latest.csv) and use the variable total_cases_per_million, in an effort
to have data that allows meaningful comparisons between and among identified
geographic entities. The object variable marked iso_code is likely the resource
that will be used to join files as the final dataset is prepared, possibly after some
degree of data wrangling.
It will be a task to use the tidyverse ecosystem and create one unified final data-
set. There are many steps to obtaining the data, reviewing the data, cleaning the
data, possibly joining the data to other datasets, etc. After these tasks are success-
fully completed it should then be simple to produce the world map choropleth.6 By
this point, nearing the end of this text, those with interest should be able to rise to
this challenge. There have been adequate pointers and examples for this challenge.
Reactions to these challenges are part of the many day-to-day inquiries data scien-
tists face.
National
There are also many resources that can be used for national maps. Detailed informa-
tion on mapping states, counties, and divisions within counties is gained by care-
fully reviewing the many materials related to Federal Information Processing
System (FIPS) codes, as provided by the Census Bureau and other government
offices.7,8
Once again, use the maps::map() function to create a map, but now of the United
States, outlining all 3,000-plus counties. After the map is created and as skills and
interest allow, respond to the Challenge outlined below. Review materials posted in
the earlier parts of this text to see the many complexities of FIPS codes: Why do
state-level FIPS codes range from 01 to 56 since there are 50 states, not 56? How do
6
Notice how data may not be available for all geographic entities, or there may be concerns about
the efficacy of some data.
7
Review Federal Information Processing System (FIPS) Codes for States and Counties, https://
transition.fcc.gov/oet/info/maps/census/fips/fips.txt, for state by state and county by county
FIPS codes.
8
Review materials such as ZIP Code Tabulation Areas (ZCTAs) (https://www.census.gov/
programs-surveys/geography/guidance/geo-areas/zctas.html) to learn about the way United States
Postal Service ZIP Codes are accommodated when working with output gained from the Census
Bureau. Census Bureau ZCTAs seem to be similar to Postal Service ZIP Codes, but not quite.
Beautiful Graphics 503
Fig. 7.27
county-level FIPS codes accommodate the multiple use of common names for
counties named Washington, Franklin, Jefferson, etc. (Fig. 7.27)?
USNationalMap <-
maps::map("county", fill=TRUE, col="cyan")
title("United States National Map of Counties")
# Fig. 7.27
par(ask=TRUE); USNationalMap
Challenge: The current United States county map shows geographic boundaries,
but otherwise there are no data associated with the map. Use the created county map
and the tidyverse ecosystem to produce a county-by-county choropleth of COVID-19
cases per 100,000, for all cases marked as date_updated February 24, 2022 (N = 3220
rows of data):
• Obtain the data file United_States_COVID-19_Community_Levels_by_County.
csv, associated with United States COVID-19 Community Levels by County
(https://data.cdc.gov/Public-Health-Surveillance/United-States-COVID-19-
Community-Levels-by-County/3nnm-4jni), provided by the Centers for Disease
Control and Prevention.
• Notice how FIPS codes are provided for each county along with other relevant
county-wide information, running over multiple dates (2/24/2022 to 5/11/2023,
as of the time this lesson was prepared). Accordingly, if these data are accept-
able, it is not necessary to obtain any other files other than what are already
available.
504 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
State
Following along with the use of federal data as a prime data resource, review the
many materials associated with Surveillance Data from the Lyme Disease Data
Dashboard, as provided by the Centers for Disease Control and Prevention, https://
www.cdc.gov/lyme/datasurveillance/surveillance-d ata.html?CDC_AA_
refVal=https%3A%2F%2Fround-lake.dustinice.workers.dev%3A443%2Fhttps%2Fwww.cdc.gov%2Flyme%2Fdatasurveillance%2Frec
ent-surveillance-data.html. Read the Limitations section and give special attention
to the statement, “Under-reporting and misclassification are features common to all
surveillance systems.”
After reviewing this resource, as much for process and presentation as well as the
provided materials, go to https://www.cdc.gov/lyme/datasurveillance/surveillance-
data.html and download the file at Lyme Disease Incidence Rates by State or
Locality [XLS – 4 KB], Lyme_Disease_Incidence_Rates_by_State_or_Locality.
csv, which is also available at the publisher’s Web site associated with this text.
Challenge: Use the tidyverse ecosystem and associated tools to adjust the data-
set. The many actions needed to adjust the dataset are not shown in this text. By this
part of the text, those with an interest should be able to put the data into usable
format, but the main tasks should produce an adjusted dataset that models what is
seen immediately below, where the emphasis is on entities with a State FIPS code
(remove the summary row marked US Incidence) and 2019 incidence data:
Beautiful Graphics 505
base::attach(LymeDisease2019.df)
utils::str(LymeDisease2019.df)
LymeDisease2019.df %>% {
base::rbind(utils::head(., 3),
utils::tail(., 3) )} %>%
print()
# Although not an enumerated function, use the
# base::rbind() function to print both the head
# and tail of the enumerated dataset.
With all of these actions completed, use the usmap::plot_usmap() and the syntax
shown below, at least as a model, to produce a choropleth of all states and the inci-
dence of Lyme disease in 2019. Note how the syntax works and plays well with the
ggplot2 package and by extension the tidyverse ecosystem. Does the map commu-
nicate to public health personnel and the public that Lyme disease is, at least for
2019, a concern for the Northeast and Upper Midwest, but of minimal concern in
other regions (Fig. 7.28)?
506 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Fig. 7.28
LymeDisease2019Map <-
usmap::plot_usmap(data = LymeDisease2019.df,
values="incidence", color="red") +
labs(
title=
"Incidence of Lyme Disease by State: 2019",
subtitle=
"Review the Interplay of Borrelia burgdorferi, Borrelia
mayonii, Ticks, Wild Mammals, and Humans
for Lyme Disease") +
scale_fill_continuous(low="white", high="blue",
name="Incidence", label=scales::comma) +
theme_map() +
theme(
plot.title=element_text(hjust=0.5),
plot.subtitle=element_text(hjust=0.5, face="bold")) +
# Center the title, subtitle and put the subtitle
# in bold.
theme(legend.position="right")
# Put the legend at the bottom.
# Fig. 7.28
par(ask=TRUE); LymeDisease2019Map
Beautiful Graphics 507
County
Using data available from the Centers for Disease Control and Prevention, look
once again at Surveillance Data, but focus now on United States COVID-19
Community Levels by County, https://data.cdc.gov/Public-Health-Surveillance/
United-States-COVID-19-Community-Levels-by-County/3nnm-4jni. Use the
Export button to download the associated dataset, marked as United_States_
COVID-19_Community_Levels_by_County.csv and made available at the publish-
er’s Web site associated with this text.
Challenge: Use the tidyverse ecosystem and associated tools to adjust the data-
set. The many actions needed to adjust the dataset are not shown in this text. By this
part of the text, those with an interest should be able to put the data into usable
format, but the main tasks should produce an adjusted dataset that models what is
seen immediately below, where the selections, deletions, adjustments, etc. result in
the following9 (Fig. 7.29):
• State New Jersey
• Counties All 21 Counties
• Date Updates 4/7/2022 (April 07, 2022)
• Variable covid_cases_per_100k
Fig. 7.29
9
Look at the accommodation that was needed for the county named Cape May. What is the issue?
From many possible ways to approach this issue, what tidyverse tool is best for this
accommodation?
508 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
base::attach(NJCountiesCovid19Feb242022.df)
utils::str(NJCountiesCovid19Feb242022.df)
NJCountiesCovid19Feb242022.df %>% {
base::rbind(utils::head(., 5),
utils::tail(., 5) )} %>%
print()
# Although not an enumerated function, use the
# base::rbind() function to print both the head
# and tail of the enumerated dataset.
NJCountiesCovid19Feb242022Map <-
usmap::plot_usmap(data = NJCountiesCovid19Feb242022.df,
regions="state",
include="NJ",
values="covid_cases_per_100k",
color="red") +
labs(
title=
"COVID-19 Cases per 100,000 Population by New Jersey County",
subtitle=
"Data Updated April 07, 2022") +
scale_fill_continuous(low="white", high="blue",
name="Cases per 100,000", label=scales::comma) +
theme_map() +
theme(
plot.title=element_text(hjust=0.5),
plot.subtitle=element_text(hjust=0.5, face="bold")) +
# Center the title, subtitle and put the subtitle
# in bold.
theme(legend.position="right")
# Put the legend at the right.
# Fig. 7.29
par(ask=TRUE); NJCountiesCovid19Feb242022Map
Beautiful Graphics 509
Bonus Challenge: Look not only at the New Jersey by county choropleth map of
COVID-19 cases per 100,000 but also go back to the adjusted dataset, focusing on
the relationship (if any) between county population and COVID-19 cases per
100,000. Comparing these two object variables, what parts of the state seem to have
the greatest incidence of COVID-19 (e.g., cases per 100,000) and what parts of the
state seem to have the least?
Question: Using the dataset named NJCountiesCovid19Feb242022.df, is there
any association between COVID-19 cases per 100,000 and county population
(Fig. 7.30)?
stats::cor.test(
as.numeric(
NJCountiesCovid19Feb242022.df$county_population),
as.numeric(
NJCountiesCovid19Feb242022.df$covid_cases_per_100k),
method="pearson")
p-value = 0.000249
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.414486 0.877537
sample estimates:
cor
0.717748
Fig. 7.30
510 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
stats::cor.test(
as.numeric(
NJCountiesCovid19Feb242022.df$county_population),
as.numeric(
NJCountiesCovid19Feb242022.df$covid_cases_per_100k),
method="spearman")
p-value = 0.000231
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.733766
NJCountiesCovid19Feb242022Fig <-
ggplot2::ggplot(data=NJCountiesCovid19Feb242022.df,
aes(x=county_population,
y=covid_cases_per_100k)) +
geom_point(size=2.5, color="red") +
labs(
title=
"COVID-19 Cases per 100,000 by New Jersey County Population",
subtitle="Data Updated April 07, 2022",
x="\nNJ County Population",
y="COVID-19 Cases per 100,000\n") +
scale_x_continuous(labels=scales::comma, limits=c(0,
1000000), breaks=scales::pretty_breaks(n = 5)) +
scale_y_continuous(labels=scales::comma, limits=c(0, 200),
breaks=scales::pretty_breaks(n = 5)) +
annotate("text", x=0, y=175, fontface="bold", size=06,
color="black", hjust=0, family="mono",
label="Pearson's r ..... 0.717748") +
annotate("text", x=0, y=150, fontface="bold", size=06,
color="black", hjust=0, family="mono",
label="Spearman's rho .. 0.733766") +
theme_Mac()
# Fig. 7.30
par(ask=TRUE); NJCountiesCovid19Feb242022Fig
• Observation: Using the map, selected inferential testing for association (e.g.,
Pearson’s r and Spearman’s rho), and observations of a relatively small dataset
(N = 21 counties), the association between county_population and covid_cases_
per_100k is clearly established, using results from the enumerated
NJCountiesCovid19Feb242022.df dataset. As county population increases, the
relative incidence of COVID-19 also increased. Once again, it is emphasized that
the nature of the object variable covid_cases_per_100k is not a measure of N
cases but is instead a proportional measure that is adjusted for overall population.
Beautiful Graphics 511
• Question(s): It will take some effort to go back into prior news stories in the press
and archived government documents, but it is still worthy to ask about national
and state mandates regarding COVID-19 mitigation efforts. Were there uniform
COVID-19 mitigation mandates for all 21 New Jersey counties in early April
2022, seeing the association between degree of urban concentration and inci-
dence of COVID-19 cases? As county population increases, so did the degree of
COVID-19 infection among the population. In contrast, as county population
decreases, so did the degree of COVID-19 infection among the population. Is it
the responsibility of a data scientist to bring these types of observations (esti-
mates of association exceeding 0.70) to those who formulate and eventually
mandate policies – often policies with punitive outcomes for those who do not
follow mandates?
Sub-county
The Census Bureau provides detailed information by Tract and by Block. There are
many resources on these breakout areas, but Census Tracts (https://www2.census.
gov/geo/pdfs/education/CensusTracts.pdf) is a useful first resource. Notice how
Tracts range from a population of 1200 (minimum) to 8000 (maximum), depending
on the composition of the overall county. Most Tracts have a population of approxi-
mately 4000.
The acquisition of health insurance is a gateway to better health. Yes, there are
those who have health insurance and may not take advantage of its many benefits,
but medical bills for most individuals are simply far too expensive without health
insurance. The figure presented below provides a choropleth-type map at the level
of Census Tracts for those individuals (age 35 to 64 years) in South Florida (e.g.,
Broward County, Miami-Dade County, and Palm Beach County) who do not have
health insurance, identified as American Community Survey (ACS) variable
B27010_050. This location-specific type of information could greatly help public
health personnel narrowcast outreach campaigns on the need for health insurance,
spending scarce resources wisely while gaining a positive return on investment for
media campaign costs (Fig. 7.31).
512 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Fig. 7.31
SouthFloridaNoHealthInsuranceTract.df <-
tidycensus::get_acs(
geography="tract",
variables="B27010_050",
# 35 to 64 years, No health insurance coverage
state="FL",
county=c("Broward", "Miami-Dade", "Palm Beach"),
# South Florida counties
geometry=TRUE,
year=2021,
cache_table=TRUE,
output="tidy",
survey="acs5",
show_call=TRUE)
base::attach(SouthFloridaNoHealthInsuranceTract.df)
SouthFloridaNoHealthInsuranceTract.df %>%
ggplot(aes(fill = estimate)) +
Beautiful Graphics 513
geom_sf(color = "dodgerblue") +
scale_fill_viridis_c(labels = scales::comma) +
labs(fill="Census Tract Population\nNo Health Insurance") +
ggtitle(
"South Florida (Broward County, Miami-Dade County, and
Palm Beach County) Population with No Health
Insurance (35 to 64 years) by Census Tract:
2021") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
# Much more, using the full set of available ggplot tools,
# could be done to improve the presentation of this figure.
# Fig. 7.31
Regarding Census Bureau maps at the level of either tracts or blocks, the three
counties in South Florida (e.g., Broward County, Miami-Dade County, and Palm
Beach County) were purposely selected to show what may at first seem to be unusual
output, especially for a tri-county area with a population of more than 6 million resi-
dents – a tri-county population that exceeds the population of many states. To offer
context, consider how the minimum population for a Census Tract is 1200. Then for
those who are not well-acquainted with these three counties, consider how the pop-
ulation is minimal, if not totally absent, in large areas such as Lake Okeechobee, the
Agricultural Reserve and other western sections devoted to agricultural use, many
portions of the Everglades and Everglades National Park, Biscayne National Park,
selected wetland areas adjoining the Atlantic Ocean and Florida Bay, etc. The map
is correct, but context always helps for best use and context.
Look at the three South Florida counties again, but now focusing on the total
population by the four main racial groups. Given that 2020 Decennial Census data
are starting to come online, focus on all residents in these counties by the four most
representative races (Fig. 7.32):
514 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Fig. 7.32
Beautiful Graphics 515
SoFlaRaceVariables <- c(
Hispanic = "P2_002N",
White = "P2_005N",
Black = "P2_006N",
Asian = "P2_008N")
SoFlaRaceVariables
SoFlaRace.df <-
tidycensus::get_decennial(
geography = "tract",
variables = SoFlaRaceVariables,
state = "FL",
county = c("Broward", "Miami-Dade", "Palm Beach"),
geometry = TRUE,
year = 2020,
show_call = TRUE)
516 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
SoFlaRace.df
base::attach(SoFlaRace.df)
utils::str(SoFlaRace.df)
SoFlaRaceDots <-
tidycensus::as_dot_density(
SoFlaRace.df,
value = "value",
values_per_dot = 100,
group = "variable")
# Allow time for this syntax to process. It is a
# complicated set of actions that takes time.
SoFlaRaceDots
base::attach(SoFlaRaceDots)
utils::str(SoFlaRaceDots)
SoFlaRaceBase <-
SoFlaRace.df[SoFlaRace.df$variable == "Asian", ]
SoFlaRaceBase
base::attach(SoFlaRaceBase)
utils::str(SoFlaRaceBase)
ggplot2::ggplot() +
geom_sf(data = SoFlaRaceBase,
fill = "white", color = "lightgray") +
Beautiful Graphics 517
geom_sf(data = SoFlaRaceDots,
aes(color = variable), size = 2.5) +
labs(
title=
"Population Distribution Throughout South Florida Counties
by Race (e.g., Asian, Black, Hispanic, and White) Using
2020 Decennial Census Data",
subtitle=
"Each dot represents 100 residents.") +
theme_void() +
theme(plot.title=element_text(face="bold", hjust=0.5)) +
theme(plot.subtitle=element_text(face="bold", hjust=0.5)) +
theme(legend.title=element_blank()) +
theme(axis.text.x=element_text(face="bold", size=10,
hjust=1.0, vjust=1.0, angle=45)) +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=1.0, vjust=1.0, angle=00))
# Fig. 7.32
Another view that may help with a good understanding of population distribution
in South Florida by race is to produce four breakout panels, using the facet_grid()
function. Here is one way to address this issue, but there are many possible ways
that a grid of individual breakout maps, by race, could be produced (Fig. 7.33).
ggplot2::ggplot() +
geom_sf(data = SoFlaRaceBase,
fill = "white", color = "lightgray") +
geom_sf(data = SoFlaRaceDots,
aes(color = variable), size = 0.01) +
facet_grid(cols = vars(variable)) +
labs(
title=
"Population Distribution Throughout South Florida Counties
by Race (e.g., Asian, Black, Hispanic, and White) Using
2020 Decennial Census Data",
subtitle=
"Each dot represents 100 residents.\n") +
theme_void() +
theme(plot.title=element_text(face="bold", hjust=0.5)) +
theme(strip.text=element_text(face="bold")) +
theme(plot.subtitle=element_text(face="bold", hjust=0.5)) +
theme(legend.title=element_blank()) +
theme(legend.position = "none") +
theme(axis.text.x=element_text(face="bold", size=10,
hjust=1.0, vjust=1.0, angle=45)) +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=1.0, vjust=1.0, angle=00))
# Fig. 7.33
There are many ways to use R to produce useful maps. For this current example
on distribution of the four main races in South Florida, some may think that a fac-
eted map is best, but others may think that a single blended distribution-type map is
best. Fortunately, R is flexible and allows for many approaches, all based on needs
and agreement on presentation.
518 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Fig. 7.33
Bonus Challenge: Use the data to recreate the map, but now using the tmap pack-
age. This is an advanced challenge that may go beyond introductory instruction, but
it is worth the effort to attempt this challenge. Look at the syntax, presented below,
for a totally different mapping task and dataset, to get a partial glimpse for how the
tmap package is used to provide annotations and even a compass:
tmap::tm_credits(
"\n\nWestern parts of Broward County (e.g., GEOID
12011980000) are\nmostly uninhabited, reserved for agriculture
and the Everglades.",
fontface = "bold",
size = 0.50,
position = c("left", "top")) +
tmap::tm_credits(
" Resource: Census Bureau - ACS5, 2021\n\n", # ID credit(s)
fontface = "bold", # Text in bold
size = 0.50, # Adjust size
position = c("left", "bottom")) + # Position
tmap::tm_compass(
north = 0, # Orientation - north
type = "8star", # Compass type
position = c("center", "bottom"),# Compass position
text.size = 0.55, # Compass text - size
size = 1.05, # Compass size
show.labels = 2) # Cardinal directions
R Markdown and LaTeX Demonstrations of a Summary Memorandum of Findings 519
Data scientists seemingly engage in a host of activities, all to use data in its many
forms to better understand trends and from this understanding, help guide
decision-making:
• Consult with deans, supervisors, clients, and others on possible data-centric proj-
ects requiring focused inquiry.
• Determine scope, requirements (e.g., human, computing, software, etc.), time-
lines, budgets, etc. of proposed inquiries.
• Discuss, negotiate, compromise, and finally agree on differences between first
ideas and eventual acceptance of project requirements.
• Hire and train staff, obtain supporting space, hardware, software, etc. Put planned
actions into place, obtain data, scrub and organize data, prepare formative
analyses of the data, consult with others on first findings, use feedback loops to
continue with next-step inquiries, etc.
• Complete planned inquiries and prepare initial formal documents that serve as
summaries of outcomes. Then, use feedback to revise formal documents, until
the documents are eventually in final accepted form. As appropriate, share out-
comes with a broad audience (internal and external) to further new learning
opportunities.
This section will briefly highlight a topic that has not yet been addressed in any
specific detail in this text – how data scientists prepare documents and the type of
documents prepared by data scientists. Of course, there is no one and only one
answer to either, but there are a few commonalities that should be considered:
• As projects are in process, it is common for data scientists to use informal com-
munication channels to communicate planned methods, data resources, eventual
methods, formative outcomes, etc. with staff, deans, supervisors, clients, and
interested peers. Along with informal face-to-face conversations and formal
meetings with those who are nearby, e-mail is also a common means of commu-
nication, but consider the value of focused listservs and private blogs.
• As projects continue and finality is approaching, it is then common to use forms
of communication that include memos, public blogs, poster presentations, and
large group slide presentations at professional conferences.
• As projects are completed, it is then the norm to put closure to inquiries and to
prepare formal summary documents. If the project were proprietary, it may be
necessary to prepare a formal report that is submitted to only a small group of
authorized readers. For projects that can be shared with peers and ostensibly the
public, it is common to share findings by preparing a paper (e.g., an article) and
to then submit it to a relevant journal, for peer review and eventual publication.
Following along with this theme, it is also common to present findings at a pro-
fessional conference, either as an invited speaker or by selection of a competitive
review committee.
520 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
The question remains, however, on how does a data scientist prepare documents
that summarize findings from formal inquiries? From among the many ways sum-
maries are prepared, go back to the earliest days of computing and data science as a
guide to better understand current practices, and how document preparation is still
put into use, especially by those who use R -- typesetting.1011
Typesetting has been used by those early electrical engineers and nascent com-
puter scientists since the beginning days of computing. When typesetting, it is com-
mon to use a simple ASCII text editor to markup (e.g., as an example of markups,
the string ce, often with a preceding period, was an early markup for centering text.)
selected text and with accompanying software the text can be put into a very attrac-
tive format. Some of the earliest approaches at typesetting, in use at or even before
Neil Armstrong’s first steps on the moon (July 20, 1969), were based on UNIX-
based roff (Run Off a Document) typesetting markup software and included even-
tual variants such as nroff (Newer roff), troff (Typesetter roff), and groff (GNU roff).
The paradigm for typesetting is to use a text editor to prepare and markup text
and to then use a preprocessor and a postprocessor for eventual final document
preparation. With correct preparation, it was a seamless and somewhat simple oper-
ation to typeset documents during the earliest years of computing. However, soft-
ware either evolves or dies and the roff-type typesetting programs now see less use
and have been replaced to a large degree by R Markdown and LaTeX, both of which
are demonstrated in this lesson, with accompanying files at the publisher’s Web site
associated with this text.
Static reports are the most common way outcomes from data science inquiries
are put into final form. These static reports can range from a one-page memorandum
to a 1000 or more page report. The key here is that a static report is in fixed format,
both electronic and paper, when viewed by the reader, as opposed to interactive
reports (e.g., dashboards are increasingly popular, but are often lacking the finite
detail of a professional static report) that allow some degree of user engagement.
R Markdown
10
Many documents are also prepared by use of word processing software, but it is not necessary to
comment too much of its use other than to mention that some of the most popular word processing
software packages are proprietary and it cannot be assumed that interested peers and students have
access to the same. In contrast, the typesetting approach demonstrated in this section (both R
Markdown and LaTeX) is based on markup software that is legally and freely obtained.
11
Although there is no desire to make negative comments on the use of word processing software,
investigate distinction between the expression WYSIWYG (what you see is what you get) v
WYSIAYG (what you see is all you get) when deciding to prepare a document with word processing
software v the decision to prepare a document using a markup language and accompanying software.
Decide if the need for inclusion of comments, syntax, and other text directly in the manuscript, but
text that is not visible in the final report, has value when selecting document preparation software.
Concluding Comments and Next Steps 521
rmarkdown::render("RMarkdownHelloWorld.Rmd",
"html_document")
Note: It may be necessary to use the install.packages() function and the library()
function against the packages Rcmdr and rmarkdown, based on prior use of these
packages.
LaTeX
For those data scientists who need extensive access to typesetting capabilities,
including the ability to typeset mathematical notation(s), LaTeX may be a valued
choice when selecting typesetting software. A sample LaTeX file
(LaTeXMemorandum.tex) is provided at the publisher’s Web site associated with
this text. Open the file with a text editor to see its structure and the precise syntax.
Once the file is prepared, typically by using freely available LaTeX software, use
the editor’s features to prepare a pdf file (LaTeXMemorandum.pdf), also provided
at the publisher’s Web site associated with this text. There are many choices on
LaTeX editors, and it is beyond the purpose of this text to recommend one over
another other than to say LaTeX may not be the easiest markup software to learn,
but it can be used to create documents that are beyond compare.
Go back to the first chapter in this text to review how data science, as a specific
discipline and not as an offshoot of either statistics or computer science, has evolved
into a recognized discipline as taught at the university level:
Detail for CIP Code 30.7001
Title: Data Science, General.
Definition: A program that focuses on the analysis of large-scale data sources
from the interdisciplinary perspectives of applied statistics, computer science, data
storage, data representation, data modeling, mathematics, and statistics. Includes
instruction in computer algorithms, computer programming, data management, data
mining, information policy, information retrieval, mathematical modeling, quantita-
tive analysis, statistics, trend spotting, and visual analytics.
Action: New
CIP: The Classification of Instructional Programs;
https://nces.ed.gov/ipeds/cipcode/cipdetail.aspx?y=56&cipid=92953
522 7 Putting It All Together – R, the tidyverse Ecosystem, and APIs
Much has been covered in this text that provides a glimpse of the day-to-day techni-
cal skills required of a data scientist. It is only necessary to provide the reminder
that expertise is needed in grasping what data are and from that, using various
resources to either create or obtain data, to organize data into an acceptable form, to
subject the data to analyses of various types (often of a statistical nature but for
descriptive purposes too), and to create visualizations of the relationships between
and among the data. Anything more on these topics would be redundant to what has
already been presented, often multiple times.
Data scientists are highly valued professionals and like all other professionals, soft
skills cannot be ignored. Data scientists are leaders and as such, they need to know
how to interact with others (e.g., subordinates and supervisors, external individuals
of authority, peers, and the public) in a professional manner that advances organiza-
tional goals and profitability.
It is assumed that a data scientist, especially an experienced data scientist with
supervisory responsibilities, must have excellent skills in analytics, computing, sta-
tistics, etc. It is also expected that a data scientist who strives for career advance-
ment must also have a set of soft skills, including traits (in alpha order) such as:
• A desire to innovate and a willingness to try new approaches to traditional
practices
• Awareness of self and empathy for others
• Curiosity and the desire to investigate outcomes that may not have been expected
or are counter to norms
• Excellent communication skills, oral and written, including diction and the use
of correct grammar in all media
• Honesty, without exception
• Knowledge of how a company operates and why profitability is essential
• Problem-solving skills and a willingness to bring others into the problem-
solving process
• Professional presentation and an adherence to expected clothing (e.g., dress
code), grooming, manners, etc.
• Respect and courtesy for all contacts, subordinates, peers, and superordinates
• Spatial skills and the desire to use visuals to tell a story that may be lost on others
if only a numeric approach were used for summary presentations
External Data and/or Data Resources Used in This Lesson 523
The Bureau of Labor Statistics (BLS) is possibly the best resource for learning
about employment opportunities in data science. Explicit detail is obtained using
the Web-based Occupational Outlook Handbook. By review of the term Data
Scientists (https://www.bls.gov/ooh/math/data-scientists.htm) it is seen that salaries
are often greater than $USD 100,000 per year. Equally important, the BLS projects
positive employment growth for data scientists, with employment growing at 36
percent over the next decade.
The publisher’s Web site associated with this text includes the following files, pre-
sented in .csv, .txt, and .xlsx file formats.
Lyme_Disease_Incidence_Rates_by_State_or_Locality.csv
United_States_COVID-19_Community_Levels_by_County.csv
WideMilkLbsPctFatPctProtein.xlsx
owid-covid-latest.csv
A few additional files are also available at the publisher’s Web site associated
with this text, relating to typesetting and documents used for preparation of sum-
mary reports:
LaTeXMemorandum.pdf
LaTeXMemorandum.tex
RMarkdownHelloWorld.HTML
RMarkdownHelloWorld.Rmd
Challenge: Use these files to practice and replicate the outcomes used in this les-
son. Be creative and go beyond what was previously presented.
All other data, if any, were enumerated directly in the R session, using functions
such as round(rnorm()), read.table(textConnection()), etc.
Index
A Boolean selection, 34
ACS1, 351, 353, 356 Boxplot, 246, 247, 249, 250, 252, 259, 455,
ACS5, 351, 353, 355, 356 456, 472, 484–485
American community survey (ACS), 106, 345,
351, 355, 357, 359, 440–442, 446, 511
American Standard Code for Information C
Interchange (ASCII), 28, 520 Cell, 55, 91, 177, 179, 182, 185, 209, 334
Analysis of variance ((ANOVA), 10, 138, 153, Centiles, 459
263, 268 Chance, 313, 331
Anderson-Darling test for normality, 256, 258 Chi-square, 10, 153, 334, 463
API key, 97, 344 Classification of instructional programs
Application programming interface (API), 1, (CIP), 2, 9, 11–16, 19, 65–87,
13, 65, 72, 88, 96, 97, 104, 106, 108, 101, 521
114, 124, 132, 154, 199, 214, 235, 301, Client, 1, 13, 65, 88, 96, 144, 342–345, 351,
341–431, 433–523 355, 362, 424, 519
Assignment operator, 31, 32 Cloud computing, 29–30
Association, 2, 10, 72, 81, 106, 109, 115, 153, Code Book, 62, 74, 101, 102, 149, 154, 157,
172, 186, 210, 211, 216, 226, 227, 234, 162, 177, 223, 232, 236, 301, 308,
264, 289–291, 295, 299, 312, 322–324, 313–314, 363, 424, 426
327, 344, 364, 389, 395, 398, 400–403, Column, 13, 22, 57, 60, 66, 67, 72, 79, 88,
416, 449, 452, 459, 462, 468, 471, 472, 89, 102, 105, 111, 121, 122, 149, 159,
493, 509–511 160, 179, 180, 182, 184, 185, 189,
190, 193, 195, 199, 214, 215, 222,
223, 236, 239, 241, 243, 245, 265,
B 303, 306, 307, 331, 351, 357, 392,
Bar plot, 428, 472, 478–479 394, 424, 428, 441, 467
Base R, 3, 34, 43, 45, 47, 60, 72, 92, 94, 119, Comprehensive R Archive Network (CRAN),
150, 158, 175–219, 221–339, 363, 377, 181, 227, 362, 434
381, 386 Core tidyverse, 45, 65, 66, 69, 72, 119,
Beanplot, 472, 482–483, 499 184–187, 271, 440, 464
Beautiful graphics, 63, 69, 76, 119, 150, 152, Correlation, 10, 209–211, 216, 227, 236, 263,
176, 184, 199, 223, 252, 472–519 290, 299, 313, 322, 323, 326, 327, 337,
Beeswarm plot, 472, 483–484 384, 389, 401, 463, 468, 493
Binary logistic regression, 315, 327–335 Correlation matrix, 322, 327, 334, 493
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 525
Springer Nature Switzerland AG 2024
T. W. MacFarland, Introduction to Data Science in Biostatistics,
https://doi.org/10.1007/978-3-031-46383-9
526 Index
COVID-19, 3, 20, 103, 104, 110–113, File Transfer Protocol (ftp), 341, 342
115–117, 140, 143, 147–150, 152–172, Friedman Twoway Analysis of Variance
216, 224, 233, 237, 341, 356, 424, 426, (ANOVA) by Ranks, 417
430, 501–503, 507, 509–511
Cut score, 332
G
GEOID, 355, 357, 441, 443
D Geom, 249, 256
Data, 1–10, 12, 13, 16, 18–20, 22, 23, 25, 27, GitHub, 110, 117
29–67, 71–73, 76, 87–99, 101–145, GNU public license, 180
148–150, 152–154, 157–160, Good programming practice (gpp), 45, 121, 228
162–164, 171–173, 175–216, 219, Gopher, 28, 342
222–316, 318, 320, 324, 327, 328, Grammar of graphics, 182, 184, 223
339, 341–431 Graphical user interface (GUI), 13, 28, 29,
Dataframe, 41, 60–64, 67, 89, 90, 178, 367, 104, 106, 108, 109, 114, 154, 199, 301,
386, 425, 430, 451, 475, 488, 493 343–344, 365, 389, 403
Data harvesting, 8 Greenbar paper, 180
Data science, 1–30, 32, 47, 53–55, 64–67, 72, Gross domestic product (GDP), 115, 120, 122,
76, 80, 87–98, 103–105, 109, 112, 118, 129–131, 186, 190, 206–213, 216
122, 124, 136, 175–219, 221, 223, 225,
227, 246, 252, 271, 316, 318, 343, 355,
362, 363, 388, 392, 417, 424, 451, 464, H
468, 500, 520, 521, 523 Histogram, 116, 223, 472, 487–488
Data scientist, 2, 3, 8, 9, 16–20, 26, 47, 64, 88,
101, 105, 106, 109, 116, 140, 144, 147,
149, 152, 153, 164, 172, 177, 178, 181, I
182, 184, 186, 187, 189, 196, 198, 207, Infrastructure as a Service (IaaS), 30
213, 216–219, 221, 223, 225, 252, 263, Integrated postsecondary education data
291, 309, 311, 335, 336, 341, 342, 345, system (IPEDS), 11–13, 15, 65, 67,
362, 366, 375, 380, 388, 403, 423, 430, 72–74, 80, 88, 343
449, 454, 463, 500, 502, 504,
511, 519–523
Datum, 443 J
Deaths of despair, 115 JavaScript Object Notation (JSON), 154, 342,
Decennial census, 106, 513 362, 424–431
Delimited file, 66, 67
Density plot, 223, 253, 259, 313, 413, 414,
472, 485–486 K
Deprecated, 76, 134 Kruskal-Wallis H-Test for Oneway Analysis of
Dotplot, 472, 486–487 Variance (ANOVA) by Ranks, 164
Kurtosis, 459
E
80-20 rule, 175, 177, 223, 388, 394 L
Electronic numerical integrator and computer LaTeX, 519–521
(ENIAC), 26, 27 Linear regression, 227, 313, 324, 327
Environmental Protection Agency (EPA), 111, Line chart, 103, 106, 107, 111–114, 128,
343, 403, 404, 406, 423 472, 488
Long data, 150, 178
F
Facet, 128, 131, 246, 249 M
Federal information processing system (FIPS), Mann-Whitney U test, 153, 226, 416, 463
302, 305–307, 355, 357, 381, 443, 502 Markdown, 519–521
Index 527
S
O S, 179–181, 222
Object, 31, 113, 158, 176, 236, 351, 450 Scatter plot, 403, 493–499
Occupational Employment and Wage Statistics Scatter Plot Matrix, 472, 498–499
(OEWS), 16, 22, 25, 88, 94 Scientific notation, 211
Occupational Information Network (O*NET), Sign test, 153, 463
16, 17, 19 Skewness, 459
Odds, 313, 329, 331, 334 Soft skills, 522
Odds ratios, 313, 329, 331, 334, 336 Software as a service (SaaS), 30
Oneway Analysis of Variance (ANOVA), 138, Spearman rank correlation coefficient,
140, 141, 153, 164, 226, 227, 263, 153, 463
266, 268, 272, 276, 278–283, Standard deviation (SD), 10, 152, 224, 253,
285–288, 296, 310, 311, 375–377, 260, 364, 413, 415, 459
417, 463, 464, 466 Standard Error, 459
Our world in data, 114, 118–132, 216, 344, Standard occupational classification
363, 502 (SOC), 3, 9
528 Index
T V
Tibble, 60–64, 67, 91, 176, 178, 185, 236, Variable, 13, 102, 149, 176, 223, 343, 440
237, 240, 241, 243, 245–247, 249, 253, Variance, 25, 94, 96, 163, 224, 414, 459
259, 260, 265, 266, 305, 306, 366, 367, Violin plot, 455, 456, 462, 472, 482, 499–500
386, 425, 426, 430, 441, 451
Tidyverse ecosystem, ix, x, 1–3, 13, 32, 34,
36, 43, 45, 47, 60, 61, 63–67, 72, 92, W
94, 96, 102, 107–109, 111, 113, 115, Waffle Plot, 472, 479–482
118, 119, 134, 148–150, 152, 155, 158, Wide data, 178, 290, 449
160, 175–219, 221–339, 361, 363, 365, Wilcoxon Matched-Pairs Signed Ranks Test,
367, 372, 380, 388, 389, 392, 410, 417, 153, 226, 463
425, 433–523 Workflow, 147, 150, 153, 175–179, 185, 186,
Transfer Control Protocol/Internetwork 190, 199, 213–216, 222, 366
Protocol (TCP/IP), 28 World Health Organization (WHO), 115–116,
Twoway Analysis of Variance (ANOVA), 190, 214, 216
153, 417 World Wide Web (WWW), 28–29, 342
Typesetting, 520, 521, 523 Wrapper, 362