0% found this document useful (0 votes)

122 views536 pages

Introduction to Data Science in Biostatistics_ Using R, the Tidyverse Ecosystem, and APIs

Thomas W. MacFarland - Introduction to Data Science in Biostatistics_ Using R, the Tidyverse Ecosystem, and APIs (2024, Springer)

Uploaded by

lichaocnu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views536 pages

Introduction to Data Science in Biostatistics_ Using R, the Tidyverse Ecosystem, and APIs

Thomas W. MacFarland - Introduction to Data Science in Biostatistics_ Using R, the Tidyverse Ecosystem, and APIs (2024, Springer)

Uploaded by

lichaocnu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 536

Thomas W.

MacFarland

Introduction
to Data
Science in
Biostatistics
Using R, the Tidyverse Ecosystem,
and APIs
Introduction to Data Science in Biostatistics
Thomas W. MacFarland

Introduction to Data Science

in Biostatistics
Using R, the Tidyverse Ecosystem, and APIs
Thomas W. MacFarland
Office of Institutional Effectiveness and College
of Computing and Engineering
Nova Southeastern University
Fort Lauderdale, FL, USA

ISBN 978-3-031-46382-2 ISBN 978-3-031-46383-9 (eBook)

https://doi.org/10.1007/978-3-031-46383-9

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Paper in this product is recyclable.

Airson mo theaghlach agus am foighidinn
gun chrìoch.
Foreword

Beginning with the mid-1970s development of S and its reimagination into R,

approximately 20 years later, R remains a leading language in biostatistics. By the
mid-2000s, ease of use and functionality with the R language expanded greatly
when the tidyverse ecosystem saw its first implementation.
Throughout its evolution to today, R has remained open-source software that is
freely available to all. From among its many uses, R supports data acquisition from
distant hosts using Application Programming Interface (API) clients, data manage-
ment and data organization using tidyverse ecosystem tools such as the dplyr pack-
age and the tidyr package, and superior production of graphics and maps using the
ubiquitous tidyverse ggplot2 package and complementary packages that are ggplot2
compliant. There is also a host of other R-based tools for statistical analyses that
work and play well with APIs and the pervasive tidyverse ecosystem. It is argued in
this text that R should always be among the first selections in any list of software
that supports biostatistics.
This text was developed to assist beginning students and early stages researchers
in their attempt to make sense of how software can be used in biostatistics, viewing
an all-pervasive concept of biostatistics in the large and the many disciplines associ-
ated with biostatistics. To meet this challenge, R was selected as the most appropri-
ate programming language, calling on Base R (e.g., the many functions made
available when R is first downloaded) and supporting packages (e.g., the thousands
of auxiliary R software collections that provide functionality far beyond what is
available in Base R, especially packages associated with the tidyverse ecosystem).
The desire to prepare an introductory text that was based on the needs of begin-
ning students and early stages researchers grew out of observations from prior
teaching experiences in biostatistics, individually and of colleagues. It could not be
ignored that frustrations of those who were new to biostatistics were so great that
many who might have later taken on careers and leadership roles in biostatistics
moved on to other fields of study, greatly impacting the immediate need for favor-
able retention rates and future concern about lost intellectual potential and leader-
ship for the profession. It is hoped that the many details and examples shown in this
text, admittedly verbose for those with experience but needed for the target

vii
viii Foreword

audience of this text, will help those who are still new at biostatistics and in turn
improve retention of students and career advancement of future biostatisticians.
Data scientists provide value beyond the immediate. Following along with this
concept, value is added to this text in that most lessons are enhanced by greatly
detailed addenda, often multiple addenda in each lesson. New ideas, exposure to
new tidyverse ecosystem packages and functions, and needed skills are gradually
addressed with each advancing lesson. The addenda often introduce, reinforce, and
expand on specialized tidyverse ecosystem packages and functions that go beyond
what was previously presented, address parametric versus nonparametric approaches
toward data, and often include practice data sets that support incremental engage-
ment in pursuit of advanced skills.
Of equal importance to those with interest beyond any cursory introduction to
biostatistics, many challenge activities are included throughout each lesson and the
addenda. The challenges at first are simple and should be successfully completed by
all. Later, as the text continues, the challenges are more detailed, calling for creative
attempts to achieve success, with some challenge activities lacking complete guid-
ance, purposely. The later challenges are often worded so that there is no one and
only one correct approach to resolution but instead the challenges allow for multiple
approaches at resolution. A few of the last challenges also call for individual inquiry
into more advanced topics and resources in the use of R with biostatistics than what
is presented in the text. Not to be redundant, but these later challenges will indeed
be challenging, but of course data scientists face challenges daily.
Additional value is also added in that each external dataset mentioned and used
has been placed at the publisher’s Web site for this text. These datasets are easily
available for download, and their inclusion makes it possible to follow along with
the syntax presented in the text. Ideally, use the syntax and provided datasets to
reproduce the outcomes shown in this text. Then, go beyond the original syntax and
try different approaches to data organization, experiment with other data analysis
approaches, and consider additional functions and function arguments to produce
even more enhanced figures and maps, etc. Take on the role of a data scientist and
add value beyond base requirements.

Dr. Thomas W. MacFarland

Senior Research Associate
Office of Institutional Effectiveness; and Associate Professor
College of Computing and Engineering
Nova Southeastern University
3301 College Avenue
Fort Lauderdale
Florida 33314
Preface

The use of R and specifically the use of APIs and R’s evolving tidyverse ecosystem
for engagement in biostatistics is the focus of this text. By following along with a
gradual exposure to R, APIs, and the tidyverse ecosystem, this text should help
beginning students and early stages researchers gradually increase their skills with
the use of R syntax for inquiries into biostatistics.
The first lesson of this text is somewhat unique in that it looks closely at the way
data science is viewed as an emerged (not emerging) discipline in higher education.
The United States Department of Education has a hierarchical coding system for the
way academic majors are organized, and from this system, a large collection of
majors that call for some degree of expertise in data science is identified. These
majors are then put into context by the hierarchical coding system used by the
United States Department of Labor and the eventual transition from academic prep-
aration to careers. Although higher education has experienced a noticeable decline
in enrollment over the last few years, that is not the case for data science. There is
clearly an increase in interest in data science, not surprisingly due to the growth of
data science as a career opportunity. Employment in data science is projected to
grow and salaries are projected to increase. A few basic summaries on the use of R
and data types are also stressed in the first lesson, as either a recapitulation for those
with prior exposure to R or as an introduction for those who are not as well versed
in the use of R and how data are viewed.
The next two lessons look closely at data. A summary of possible data sources
related to biostatistics is the focus of the second lesson. Although it may seem intui-
tive to those with experience, beginning students and early stages researchers need
to know that there are many resources that either provide data that may totally meet
needs as inquiries are attempted, or the data may serve as a useful proxy. Government
data sources are especially valued and are stressed in this lesson. Knowing possible
data resources, the third lesson stresses a curated ten-point process at statistical
analyses, with an emphasis in the lesson on how these processes are used with an
all-inclusive demonstration of statistical tests.
The process stressed in the third lesson leads to a more detailed introduction to
the tidyverse ecosystem in the fourth lesson. Emphasis is placed on how the

ix
x Preface

tidyverse ecosystem is used to organize workflow, as inquiries into biostatistics are

attempted. The fourth lesson goes into detail on core tidyverse ecosystem packages
and auxiliary packages that complement a tidy workflow. These R packages and
their many associated functions and arguments are then detailed throughout the
remaining parts of this text.
The fifth lesson is focused on statistical analyses. Specific tests are demonstrated
and there is also considerable discussion on the issue of parametric versus nonpara-
metric approaches to statistical testing.
In contrast to the use of a GUI (Graphical User Interface) and click-type selec-
tions to build and download a dataset, the sixth lesson emphasizes an API
(Application Programming Interface) approach to data acquisition. An API consists
of syntax within an R work session and the use of syntax is a far more efficient and
reproducible means of obtaining data than undocumented GUI selections. Different
resources that support R-based APIs are identified in the sixth lesson.
The concluding seventh lesson provides a detailed summary of what was covered
throughout the text, including: how data are obtained using an API; how data are put
into tidy format; how data are subjected to statistical tests; and how data are used to
create a wide variety of figures, including maps. Along with use of the data, a few
ideas on how data scientists prepare summary reports are also demonstrated. The
ending lesson is worded to look forward to the world of data science and how R is
used to support inquiries, with ending comments on the favorable future of data sci-
ence, along with general ideas about professionalism and soft skills in the data sci-
ence workplace as well as data science in the large. Finally, the text ends with
information needed to contact the author and a reminder on how to obtain all datas-
ets referenced in this text.

Fort Lauderdale, FL, USA Thomas W. MacFarland

Fall 2023
Acknowledgments

I cannot begin to adequately thank the many individuals who contribute to the open-
source paradigm and the countless number of hours given freely to software devel-
opment and management, often for little if any financial renumeration and only rare
acknowledgment by deans and supervisors as a metric for career advancement.
These individuals put the profession and the advancement of science first, often at
the cost of time away from personal pursuits.
I also want to recognize all at Springer who assisted with this text, editors Laura
Aileen Briskman and Faith Su and the many staff members, domestic and foreign,
who make final production of disparate files into a cohesive text. To all – thank you
for your many ideas, feedback, help, and supporting efforts.

xi
Contents

1 Emergence of Data Science as a Critical Discipline

in Biostatistics�� 1
Definition and History of Data Science �� 1
The State of Data Science and the Need for Data Scientists �� 3
Definition of Data�� 3
Emergence of Data as a Valued Problem-Solving Input�� 6
Emergence of Data Science as a Highly Valued Occupation
and Career Paths�� 9
Biostatistics: Definition and Applications Allowing for Frequent
Overlap�� 9
Academic Growth of Data Science Programs of Study
in the Biological Sciences, Based on Classification of Instructional
Programs (CIP) Codes Related to Biostatistics �� 11
CIP Series 01: Agricultural, Animal, Plant, Veterinary Science
and Related Fields�� 14
CIP Series 26: Biological and Biomedical Sciences�� 14
CIP Series 27: Mathematics and Statistics�� 14
CIP Series 30: Multi-Interdisciplinary Studies�� 15
CIP Series 44: Public Administration and Social Service �� 15
CIP Series: 51: Health Professions and Related Programs�� 15
Jobs and Job Requirements for a Data Scientist�� 16
Job Opportunities and Salaries in Data Science�� 16
Job Opportunities and Salaries in Data Science�� 20
Computing and Data Science�� 26
Pre-ENIAC (1946) �� 26
Mainframe Computing (1950s Onward)�� 27
Personal Computing (1980s Onward)�� 27
Widespread Acceptance of the Internet (1970s Onward)
and the World Wide Web (1989 Onward)�� 28
Movement to Cloud Computing (2006 Onward)�� 29
Data Types Supported by R �� 30

xiii
xiv Contents

Boolean (e.g., Logical) Data Expressing Comparisons

and Order of Operations�� 32
Numeric Data �� 38
String or Character Data�� 43
Time and Dates�� 47
Missing Data�� 54
Data Structures Used in R�� 60
Addendum 1: Syntax Used to Generate Six-Digit Classification
of Instructional Programs (CIP) Completions �� 65
Addendum 2: National and State Data for OCC-Identified Jobs
Associated with Data Science and Biostatistics�� 87
External Data and/or Data Resources Used in This Lesson�� 99
2
Data Sources in Biostatistics �� 101
Personal Data Sources�� 101
Local Data Sources�� 102
State Data Sources �� 104
National Data Sources�� 105
United States Census Bureau �� 106
United States Centers for Disease Control and Prevention�� 106
United States Department of Agriculture�� 107
United States Department of Education�� 109
United States Department of Labor �� 110
United States Environmental Protection Agency�� 111
United States National Science Foundation�� 112
International Data Sources�� 112
European Centre for Disease Prevention and Control �� 112
The Organization for Economic Co-operation
and Development �� 113
Our World in Data�� 114
United Nations Food and Agriculture Organization�� 114
World Bank�� 115
World Health Organization�� 115
Proprietary and Other Resources �� 116
Google Cloud Platform Datasets for COVID-19 Research �� 117
New York Times COVID-19 Data at github�� 117
Addendum 1: Our World in Data �� 118
Addendum 2: United States Department of Labor, Bureau
of Labor Statistics�� 132
External Data and/or Data Resources Used in This Lesson�� 144
3
Role of Statistics for Decision-Making in Biostatistics �� 147
Ten-Point Process When Using R for Statistical Analysis�� 147
Identify Problems That Benefit from Statistical Analysis�� 147
Identify Potential Data Resources�� 148
Obtain the Data�� 149
Contents xv

Identify and Organize the Data and All Relevant Variables�� 149
Outline Potential Approach(s) for Analyses and Consider
Alternate Approaches�� 150
Put Plans into Action, with Frequent Checks for Quality
Assurance�� 150
Individual Review of All Outcomes�� 150
External Review of Outcomes Whenever Possible�� 151
Report at an Appropriate Level for the Intended Audience �� 151
Debrief to Establish Processes for Future Improvements�� 151
General Approach When Using R for Statistical Analysis�� 152
Exploratory Graphics �� 152
Exploratory Descriptive Statistics and Measures of Central
Tendency�� 152
Exploratory Analyses �� 153
Addendum: Use Inferential Statistics and R Syntax to Address
Differences in Percentage Deaths from COVID-19 by the Urban v
Rural Continuum�� 153
External Data and/or Data Resources Used in This Lesson�� 173
4
Data Science and R, Base R, and the tidyverse Ecosystem�� 175
Workflow for Reproducible, Efficient, and Accurate Analyses
and Presentations �� 175
Base R�� 179
The tidyverse Ecosystem �� 181
The tidyverse Ecosystem as an Idea and the Need
for Tidy Data�� 182
The Core tidyverse Ecosystem as a Set of Tools in R Packages
for Data Science�� 184
Auxiliary Packages Outside of the Core tidyverse Ecosystem�� 185
Addendum 1: Complex Data Set on Birth Rates Easily
Accommodated by Using the tidyverse Ecosystem�� 187
Addendum 2: Complex Data Set on Gross Domestic Product
(GDP) and Comparison to Birth Rates by Using the tidyverse
Ecosystem�� 206
Addendum 3: Individual Initiative of Planned Workflow, Analyses,
and Graphical Presentations�� 213
Addendum 4: Essential tidyverse Ecosystem Functions That Every
Data Scientists Should Master �� 217
External Data and/or Data Resources Used in This Lesson�� 219
5 Statistical Analyses and Graphical Presentations in Biostatics
Using Base R and the tidyverse Ecosystem�� 221
Overview of Using R for Statistical Analysis�� 221
Background�� 221
Import Data�� 222
Code Book and Data Organization�� 223
xvi Contents

Exploratory Graphics �� 223

Exploratory Descriptive Statistics and Measures of Central
Tendency�� 224
Exploratory Analyses �� 224
Presentation of Outcomes�� 225
Examples of Leading Statistical Tests, Including All Syntax
and Presentation of Screen Outcomes and Graphics �� 225
Nonparametric Tests�� 225
Parametric Tests �� 226
Addendum 1: A Parametric Approach to Statistical Analyses
and Graphical Presentations for Data on Rates of Births and Rates
of Deaths�� 227
Background�� 233
Import Data�� 235
Code Book and Data Organization�� 236
Exploratory Graphics �� 237
Exploratory Analyses �� 263
Presentation of Outcomes�� 295
Addendum 2: A Nonparametric Approach to Statistical Analyses
and Graphical Presentations for Data on Rates of Births and Rates
of Deaths�� 296
Addendum 3: Data Wrangling, and Then Statistical Analyses
and Mapping�� 301
Background�� 302
Import the Data�� 302
Code Book and Data Organization�� 308
Exploratory Graphics �� 309
Exploratory Descriptive Statistics and Measures of Central
Tendency�� 310
Exploratory Analyses �� 310
Presentation of Outcomes�� 311
Addendum 4: Prediction�� 312
Background�� 312
Code Book �� 313
Import the Data�� 314
Graphics (e.g., Figures and/or Maps)�� 316
Exploratory Descriptive Statistics and Measures of Central
Tendency�� 318
Exploratory Analyses �� 322
External Data and/or Data Resources Used in This Lesson�� 339
6 Use of R-Based APIs (Application Programming Interface)
to Obtain Data�� 341
Emergence of APIs as a Data Resource �� 341
APIs and Reproducible Syntax�� 342
Contents xvii

APIs and the Need for a Key �� 343

Structure of an API to Automate Data Retrieval�� 345
Structure of Data Returned by an API �� 355
Data in Returned Format�� 355
Data After Organization and Manipulation with Tidyverse
Tools�� 356
Common API Resources in Biostatistics, Government
and Proprietary�� 362
Addendum 1: Use of the tidyUSDA::getQuickstat() API�� 363
Addendum 2: Use an API to Obtain Multiple Files, Wrangle
the Data, Merge Files, Review Absolute and Percentage Change
Over Time�� 388
Obtain Data on Iowa Corn Prices, 1867 Onward�� 389
Obtain Data on Iowa Corn Acreage, 1926 Onward �� 392
Wrangle the Data into a Singular Dataset�� 395
Addendum 3: Use of Known URLs as a Proxy API (Application
Programming Interface) �� 403
Addendum 4: API-Based Data in JavaScript Object Notation
(JSON) Format�� 424
External Data and/or Data Resources Used in this Lesson�� 430
7
Putting It All Together – R, the tidyverse Ecosystem, and APIs�� 433
Obtain Data from an API �� 433
Make the Data Tidy�� 449
Statistical Tests – Base R and tidyverse Ecosystem Functions�� 459
Beautiful Graphics �� 472
Grouped Data �� 472
Interval and Real Numeric Data�� 472
Beautiful Maps�� 473
R Markdown and LaTeX Demonstrations of a Summary
Memorandum of Findings�� 519
R Markdown�� 520
LaTeX�� 521
Concluding Comments and Next Steps �� 521
Technical Skills of a Data Scientist �� 522
Soft Skills of a Data Scientist�� 522
Future Employment Opportunities�� 523
Contact the Author �� 523
External Data and/or Data Resources Used in This Lesson�� 523

Index�� 525
Chapter 1
Emergence of Data Science as a Critical
Discipline in Biostatistics

Definition and History of Data Science

This text is about data science. This text is specifically about the way data science is
deployed by those who work in the biological sciences, using R and R’s associated
tidyverse ecosystem. This text also provides multiple examples on the use of APIs
(Application Programming Interface), where selected API functions (e.g., data
retrieval clients) from R packages provide an efficient and reproducible way to
obtain extant data from external resources:
• This text is not only about algorithms, computing, computer science, computing
hardware, computing software, computing infrastructure, and programming,
although algorithms, computing, computer science, computing hardware, com-
puting software, computing infrastructure, and programming all have a major
role in data science.
• This text is not only about data, although data clearly have a major role in data
science.1
• This text is not only about mathematics, although mathematics has a major role
in data science.

1
In this text, the term data is nearly always used in a plural sense of the term (e.g., the data have
been recorded.) and the term datum is nearly always used in a singular sense of the term (e.g., the
datum has been recorded.). This approach follows along with the Latin origin of the term(s) datum
and data. The term datapoints is occasionally used, but the term datums is avoided. It is recognized,
however, that the term data is regularly seen in the literature for both singular and plural use. Going
beyond its use in statistics and data science, the term data, in the plural sense of the term, is often
viewed as a mass noun, indicating either a substance or quality that cannot be counted.

Supplementary Information The online version contains supplementary material available at

https://doi.org/10.1007/978-3-031-46383-9_1.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 1

T. W. MacFarland, Introduction to Data Science in Biostatistics,
https://doi.org/10.1007/978-3-031-46383-9_1
2 1 Emergence of Data Science as a Critical Discipline in Biostatistics

• This text is not only about statistics, although statistics has a major role in data
science.
• This text is not only about analytics, although analytics has a major role in data
science.
Instead, this text is about data science for those who work in the biological sci-
ences and choose to use the R language and its associated tidyverse ecosystem and
associated APIs. With this degree of context as to the purpose of this text:
• Data science is viewed as a multidisciplinary means of using data and associated
support structures and processes in creative ways that look for otherwise undis-
covered patterns and from that discovery not only solve current problems but
also provide insight into possible next steps regarding future desired outcomes.
• Within the paradigm of data science, it is argued that mathematics, statistics, and
analytics, although critically important, are backward looking.
• In contrast, data science is forward looking, by using processes, outcomes, and
associated data from the past. As always, remember the well-known expression
past behavior is the best predictor of future behavior.
Ultimately, it is often said that data science gives value to data, with value seen
in the biological sciences in terms of new methodologies and processes that ulti-
mately improve the human condition and mitigate threats of various types, result in
new products such as medicines and therapies, and contribute to improved efficien-
cies in agriculture, biology, environmental studies, medicine, public health, etc.
With adequate skills in data science, it is possible to focus on insight that can be
gained from limited as well as massive amounts of data (e.g., Big Data).
Regarding more well-established sciences, data science is a relatively new field
of study:
• It is often suggested that the impetus for the emergence of data science grew out
of Tukey’s early 1960s publication The Future of Data Analysis.
• By the early 1970s, the term data science was used by Naur (Concise Survey of
Computer Methods) and soon after by others.
• By the late 1970s and into the 1990s, the term data science was found in publica-
tions, conference papers, and other literature. Related professional associations
that focused on data science also emerged during the late 1970s and into
the 1990s.
• As computing machinery, software, and systems improved, there were growing
trends such that by the late 1990s and into the early 2000s more than a mere few
professionals used the term data scientist instead of computer scientist or statis-
tician as an official job title. Concurrently, journals devoted to data science also
emerged at this time.
• Finally, by 2020, the United States Department of Education recognized data sci-
ence as an emerged (not emerging) field of study and assigned a Classification of
Instructional Programs (CIP) code and definition for data science: CIP Code
30.7001; Data Science, General; a program that focuses on the analysis of large-
scale data sources from the interdisciplinary perspectives of applied statistics, com-
puter science, data storage, data representation, data modeling, mathematics, and
Definition and History of Data Science 3

statistics. This includes instruction in computer algorithms, computer p rogramming,

data management, data mining, information policy, information retrieval, mathe-
matical modeling, quantitative analysis, statistics, trend spotting, and visual
analytics.

The State of Data Science and the Need for Data Scientists

With worldwide attention given to the COVID-19 pandemic, graduating to endemic

status, and media headlines that use phrases such as follow the science, follow the
data, data-driven government policies, data-justified mandates, big data used to
stop a little virus, etc., the public is keenly aware of how data science impacts the
biological sciences and from this, everyday lives. There is clearly a need for data
scientists, with promising employment opportunities for those who work in this
field, from entry-level positions to positions of senior leadership.
The United States Bureau of Labor Statistics estimates that there are currently
nearly 60,000 individuals who have an official job classification of SOC Code 15-2098
(data scientists and mathematical science occupations, all others), which does not
count the many individuals who engage in data science activities but have other offi-
cial job classifications.2 The median wage for SOC Code 15-2098 (data scientists and
mathematical science occupations, all others) workers is approximately 100,000 USD
and of the many workers, almost 10,000 are employed in either management, scien-
tific, and technical consulting services or scientific research and development services.

Definition of Data

The etymology of the term data (plural) is derived from the Latin term datum (sin-
gular), meaning given. The classical use of the term(s) datum and data was given
fact. By the late 1940s, the terms datum and data were used in a computational
fashion, eventually evolving into corollary terms such as data processing, database,
and data entry.
For this text, the terms datum (singular) and data (plural) are defined as informa-
tion that describes an identified attribute. It is recognized that this definition of the
terms datum and data is quite broad, but data scientists work with many types of
data, necessitating recognition of the broad nature of data. Data can take many
forms, and in R, data scientists often work with various data structures (covered in
more detail later in this lesson) and data types. Base R and the tidyverse ecosystem
have useful packages and functions for working with the many types of data typi-
cally encountered.

2
Like uniform coding practices by nearly all other federal agencies, the United States Bureau of
Labor Statistics Standard Occupational Classification (SOC) system is a uniform coding system
that consists of nearly 900 codes, organized in a hierarchy, used to classify worker occupations.
4 1 Emergence of Data Science as a Critical Discipline in Biostatistics

Download the Uniform Resource Locator (URL) https://cran.r-project.org/, and

install the most current version of R, but of course give attention to the operating
system options when selecting. Once R is installed, at the prompt follow along with
the following syntax to see how R is used to create and use data of different types:3,4

# Create a character-type vector

CharacterTypeVector <- c("Jane", "Mary", "Esther")

# Input: Create a character-type vector.

CharacterTypeVector

[1] "Jane" "Mary" "Esther"

base::attributes(CharacterTypeVector) # Metadata
NULL
base::class(CharacterTypeVector) # Object type
[1] "character"
base::length(CharacterTypeVector) # Number of datapoints
[1] 3
utils::str(CharacterTypeVector) # Internal structure
chr [1:3] "Jane" "Mary" "Esther"
base::typeof(CharacterTypeVector) # Object type
[1] "character"

# Create an integer-type vector

IntegerTypeVector <- c(1L, 3L, 123L)

# Create an integer-type (e.g., whole number) vector,
# where the L forces storage of each datapoint as an
# integer.

IntegerTypeVector

[1] 1 3 123

base::attributes(IntegerTypeVector) # Metadata
base::class(IntegerTypeVector) # Object type
base::length(IntegerTypeVector) # Number of datapoints
utils::str(IntegerTypeVector) # Internal structure
base::typeof(IntegerTypeVector) # Object type

Selected sections of output were deleted to save space.

3
Throughout this text, the color green indicates R-based input and the color red indicates
R-based output.
4
In an effort to save space, look at the expression
Selected sections of output were deleted to save space. Even so,
copy and paste the R-based syntax to replicate outcomes.
Definition and History of Data Science 5

# Create a numeric-type vector

NumericTypeVector <- c(1.234, 5, 6.78)

# Create a numeric-type (e.g., real number) vector,
# with and without decimal expression.

NumericTypeVector

Selected sections of output were deleted to save space.

base::attributes(NumericTypeVector) # Metadata
base::class(NumericTypeVector) # Object type
base::length(NumericTypeVector) # Number of datapoints
utils::str(NumericTypeVector) # Internal structure
base::typeof(NumericTypeVector) # Object type

Selected sections of output were deleted to save space.

# Create a logical-type vector

LogicalTypeVector <- c(FALSE, TRUE, FALSE)

# Create a logical-type (e.g., FALSE/TRUE, 0/1,
# Die or Live, etc.) vector. Observe how quote
# marks are not needed in this example.

LogicalTypeVector

Selected sections of output were deleted to save space.

base::attributes(LogicalTypeVector) # Metadata
base::class(LogicalTypeVector) # Object type
base::length(LogicalTypeVector) # Number of datapoints
utils::str(LogicalTypeVector) # Internal structure
base::typeof(LogicalTypeVector) # Object type

Selected sections of output were deleted to save space.

Practice with other, self-created, actions used to create data. When completed, key
the syntax q(), which is an alias for the quit() function), at the R prompt to quit the
session. Then respond to the Save workspace image?, prompt, Yes – No – Cancel.
Some prior experience with any programming language, and ideally R, will help
for those who use this text as an aid for first inquiries into data science. If needed,
those with limited experience with R should consider the many resources available
for users who are new to this language.
6 1 Emergence of Data Science as a Critical Discipline in Biostatistics

It may be especially helpful to use R to either learn or revisit R by using the R

swirl package as a self-guided learning experience.5 Start an active R session, and at
the prompt, key the following, observing how the # character indicates that any text
that follows is a comment and not syntax that results in some type of action:

# Go to https://cran.r-project.org/Download and then

# download and install R, selecting the appropriate
# operating system.
#
# Once the opening R screen shows, key the following syntax
# at the R prompt, the greater than sign (e.g., >).

install.packages("swirl", dependencies=TRUE)
# While connected to the Internet, start the download
# process. As a hint, a prompt will show asking for the
# selection of a CRAN mirror site, the place(s) where R
# packages are available. Most users select the most
# local site.

library("swirl")
# Once a package is downloaded it is still necessary to
# put the package into use. With R, this process is put
# into place by using the library() function.
#
# Be patient. It may take a few minutes to download and
# install the swirl package.

swirl()
# Follow along with the text displayed on the screen.

q() # When completed, quit the active R session.

Emergence of Data as a Valued Problem-Solving Input

Data are generated in ways that many do not even realize, and increasingly these
data have become a valued commodity:
• Think of the actions needed to purchase food items at a grocery store. Early on,
the individual owner of a local grocery store had a sense of inventory control by
using paper-based hand tallies of available stock and from this knowledge orders
were placed using a telephone to connect with a food distributor, calling out
requirements item by item. As grocery stores became larger, to gain efficiencies

5
Similar to other software associated with R, the swirl package is free and open source software
(FOSS). The package is free to download and install. The syntax associated with the package can
also be viewed, to better understand the package and possibly improve upon it.
Definition and History of Data Science 7

in food distribution, dedicated managers at the local level were needed to keep up
with inventory control. However, in the mid-1970s, grocery supermarkets gained
such size that newly developed barcodes were introduced and placed on food
packaging. It was no longer necessary to place adhesive labels on each item,
identifying the current cost. Equally, cashiers no longer needed to hand enter the
price of each item at checkout but instead merely scanned each package, placing
the barcode over the scanner to add each item to the bill. Most importantly, the
data generated from each scan at the checkout line was sent to a central location
serving all stores in the grocery chain, facilitating efficiencies in inventory con-
trol, automated ordering, loss prevention assessment, etc. Now, radio frequency
identification (RFID) technologies are being explored to add additional efficien-
cies to the business operations of food distribution, from farm to table and all
points along the food distribution network – a data focused logistics web of
increasing complexity that is challenged with freshness and spoilage issues that
are unique to food distribution.
• Consider contemporary smart phones, the use of health-related apps made avail-
able at time of purchase of these phones, certain third-party apps purposely
downloaded from some type of online app provider, and how the default settings
of these apps are often set to maintain an automated tally of daily steps while
either walking or jogging. By itself, these data may be interesting to the indi-
vidual user and could be helpful as part of an exercise regime. However, if it were
possible for an exercise- or health-related company to obtain these data, then the
data could be monetized through the distribution of unsolicited email or other
means to those who have a daily count of 3000 or more steps per day, advertising
running shoes, exercise apparel, organic high energy food bars, etc. How does
this happen? Depending on the app, locale of the user, and associated local and
national governance over the involved process (or lack of governance), the data
are quite possibly legally obtained by automated transfer to a commercial entity,
organized by a data broker, sold to a commercial business, used by a data science
team, and then sold to interested health product and health service businesses.
The result is that private individuals may receive unsolicited messages for prod-
ucts and services that may (or may not) be desired, resulting in direct marketing
in a cost effective manner, often with very satisfactory results for some individu-
als, but an obtrusive and unwanted disturbance to others.6 Unwanted messages

6
Some apps, of all types and not only those related to health, have automated background sharing
of generated data (e.g., daily number of steps, as in the current example) as a default setting. Yes,
there is often a way of disabling automated data sharing, but many users either do not know about
this option or find it difficult to disable automated data sharing. There is a growing movement by
many national governments to respond to this unwanted data harvesting process – the right to be
left alone. A common remedy is that app developers must make default automated data sharing
prominently known at the time of app download or first use, and it must be as easy to disable
default automated data sharing as it is to enable this process. There are a few national governments
that have applied large fines against companies associated with the digital advertisement industry
that, unknowing to the user, obtain app-generated data without express permission, but the univer-
sal application of these protective practices is still uncertain. Regardless of the appropriate use of
large-scale background data harvesting, it is only possible due to the emergence of data science
applications.
8 1 Emergence of Data Science as a Critical Discipline in Biostatistics

advertising products and services made possible by data harvesting, regardless of

the media, may be a distraction during the workday, but some would say that it is
only an annoyance and that the messages are easily deleted. However, consider a
far less benign outcome of health-related data harvesting. Could a life insurance
underwriter use data related to the number of steps walked daily by an individual
user as part of the many factors included in the process deployed when an attempt
is made to underwrite the price of either a whole life or term life insurance pol-
icy? There is certainly a need to have clarity on the correct use of harvested data
and how it is applied in a data science paradigm, but the easy transfer of data
across international borders and the many nations possibly involved make it dif-
ficult to develop uniform policy and procedures. This issue is still far from being
resolved, but any more discussion of this issue is also far from the purpose of this
introductory text.
The key to this scenario is that the number of daily steps is a valued datum and
the linkage of this datum to an individual account, with the data now often unknow-
ingly recorded daily, can be legally obtained in many locales, and then used by
commercial entities. Although many other examples could be given, the data sci-
ence part of this scenario is that for large groups of individuals, the data (daily steps
by individual users, often for thousands and thousands of individuals as part of a
data harvesting sweep) are highly valued not only by those who want to sell prod-
ucts and services but also by some individuals who gladly buy these products and
services once their availability is known. But how does data gain value? What does
the data science team do to give value to these data? To address these questions,
consider how:
• Data science deals with both structured data and unstructured data. Data scien-
tists may work with nicely formatted rectangular datasets that ideally need mini-
mal adjustments (e.g., cleaning or scrubbing, reformatting, and processes to
account for missing data). However, it is just as likely (if not more likely) that
data scientists also need to address unstructured data. The vision needed to put
unthinkable amounts of data into a usable format, to see patterns that others may
not envision, and to then gain insight from the data to improve outcomes, sepa-
rates a data scientist from those who are more focused on statistics and analytics.
Eventually, as datasets become increasingly larger and more complex, machine
learning and artificial intelligence begin to take on what would have previously
been seen as Herculean tasks – tasks that are far beyond what any one individual
or even a team of individuals could accomplish.
• When applied in a data science paradigm, data are valued for their future use.
How can a medical therapy be improved, how can a medicine be altered to gain
better efficacy, how can an agricultural management process be altered to gain
higher crop yields at lower production costs while also reducing chemical (e.g.,
fertilizer, herbicide, and pesticide) runoff into the local aquifer?
• Although some may have an alternate view, it is suggested that statistics and
analytics use data in a more reactive manner, trying to make sense of what hap-
pened. In contrast, data science is more future oriented and is usually focused on
giving some type of value to the data, whatever the local view of value may be.
Definition and History of Data Science 9

Data science bridges the output of statistical analyses and other forms of analyt-
ics to gain insight and in turn support future problem solving. Again, consider the
expression Past behavior is the best predictor of future behavior. Data science uses
extant data, often exceptionally large amounts of complex data in various formats
and from various sources, to make discoveries that help justify decisions about
future actions.

mergence of Data Science as a Highly Valued Occupation

E
and Career Paths

Data science is no longer an emerging discipline in higher education – it has instead

emerged as a distinct program of study. A core course in research and statistics was
once a mandatory course in many biologically focused degree programs in higher
education. Increasingly, acquaintance with data science is showing the same degree
of mandatory status, either as an embedded activity in selected courses or instead as
a separate standalone mandatory core course. Go back to an earlier part of this les-
son to see how data science, in the 2020 Classification of Instructional Programs,
has now received its own code and title (CIP Code 30.7001; Data Science, General)
by the United States Department of Education.
Consider how the United States Department of Labor has created a unique code
and job title for those who work in data science (SOC Code 15-2098: Data Scientists
and Mathematical Science Occupations, All Others), another indicator of the
emerged career path for those who work as data scientists, as opposed to how those
individuals may have been previously classified as statisticians, data analysists, etc.
Look at later parts of this lesson for detailed information on the data science career
opportunities for a host of separate jobs in the biological sciences, with a common
finding that acquaintance with data science, to some degree, is a requirement for
each job. The days are quickly in decline where simple use of a spreadsheet for
either data organization, simple calculations, limited graphics, and basic statistical
analyses are sufficient for those who work at jobs with any degree of responsibility
in the biological sciences. Data science, and particularly the ability to discover pre-
viously unknown patterns and gain insight from data, is increasingly becoming a
job expectation in the biological sciences.

iostatistics: Definition and Applications Allowing

B
for Frequent Overlap

Biostatistics is one of the many support mechanisms (some may say the most impor-
tant, but of course, that statement is subject to opinion and discussion) for the
broader discipline of data science in the biological sciences. There are conceivably
as many definitions of biostatistics as there are those who would attempt to make a
definition of this discipline. For the purposes of this text, biostatistics is defined as
10 1 Emergence of Data Science as a Critical Discipline in Biostatistics

a process (e.g., a distinct set of activities) where measures of many different types
are gained from, about, or in association with biological organisms, and these mea-
sures are then subjected to critical examination using various analyses, analyses that
usually involve some type of mathematical focus. Enabling actions associated with
biostatistics include, in part:
• Broad processes where biologically oriented problems are considered, through
direct inquiry, collaboration with colleagues, literature review, etc.
• Actionable methods are considered, developed, refined, and later implemented to
obtain reliable and valid data related to the identified problem(s).
• Specific actions are used to put the obtained data into usable formats.
• Software and computing machinery are used against the data to perform appro-
priate analyses, such as descriptive statistics (e.g., calculation of mean, standard
deviation, median, mode, range, and frequency distribution), inferential statistics
(e.g., Chi-square, Student’s t-test for either matched pairs or independent sam-
ples, analysis of variance), and measures of association (correlation and regres-
sion, etc.).
• Software and computing machinery are used against the data to generate appro-
priate graphics, to visualize outcomes that may not be readily evident when
viewing numeric analyses only.
• Consideration of outcomes is used to provide some degree of analysis, interpre-
tation, and insight into outcomes, both those outcomes that are obtained by
numeric analyses and those outcomes that are visualized using graphical presen-
tation (e.g., figures and maps).
• Using the five-chapter model (e.g., introduction, literature review, methods,
results, and interpretation and conclusions) and its many derivations, collective
efforts are shared with others (e.g., internal distribution of a summary memo,
symposia poster session, conference presentation, journal article, and book pub-
lication) as part of a communication and quality assurance process that invites
discussion and feedback for the purpose of continuous inquiry and improvement.
Although biostatistics is perhaps the most common term used today to describe
these constructs, an earlier term for what is now considered biostatistics was use of
the term biometry, which may show in older literature. Investigations into life
expectancy, going back to the 1800s, were first associated with the term biometry.
Those with a special interest in biostatistics and how it emerged as an area of scien-
tific inquiry should look into the life stories of Florence Nightingale, John Snow,
William Farr, Francis Galton, Karl Pearson, Thomas Junius Calloway, Ronald
Aylmer Fisher, Claude Shannon, and others – individuals who either developed
mathematical processes for problem identification, data collection, and statistically
oriented analysis of biological phenomena or developed processes and models for
the visual presentation of outcomes and potential impact(s) to fellow scientists, leg-
islators and government officials, the press, and eventually the public.
Although by no means an exhaustive list, biostatistics and increasingly the link-
age between biostatistics and data science includes inquiries into the following
disciplines:
Academic Growth of Data Science Programs of Study in the Biological Sciences, Based… 11

Agriculture

Animals (including veterinary science), plants, soils, storage and distribution, etc.

Biology

Demography of human and other populations, environmental science and ecosys-

tems, genetics, microbiology, physiology, etc.

Epidemiology

Disease, health services and management, immunology, pathology, etc.

Health Science

Dentistry, health-related regulations and policy analysis, medicine (allopathic,

osteopathic, and naturopathic), nursing, nutrition, pharmacy, public health, etc.

cademic Growth of Data Science Programs of Study

A
in the Biological Sciences, Based on Classification
of Instructional Programs (CIP) Codes Related to Biostatistics

The United States Department of Education (DOE) provides many resources that
offer a sense of the growing level of interest in data science and specifically those
data science programs of study that are dependent on the use of biostatistics.7 The
two primary DOE resources are the Classification of Instructional Programs (CIP)
and the Integrated Postsecondary Education Data System (IPEDS):
• The first implementation of Classification of Instructional Programs (CIP, https://
nces.ed.gov/ipeds/cipcode/) coding started in 1980, with updates in 1985, 1990,
2000, 2010, and 2020. CIP codes are based on a two-digit, four-digit, and six-
digit coding system of increasing granularity and were developed to communi-
cate the nature of programs of study in postsecondary education, from a broad

7
The United States Department of Labor (Labor), which is detailed later in this lesson, is also an
excellent resource for information on career opportunities and required skills and education for
those who wish to work in data science. As an advance organizer to information presented later in
this lesson, review Occupational Employment and Wages - 15-2051 Data Scientists (https://www.
bls.gov/oes/current/oes152051.htm) and notice the salaries for data scientists (national and
selected regions), job distribution throughout the nation, etc. Again, far more detail is provided
later in this lesson.
12 1 Emergence of Data Science as a Critical Discipline in Biostatistics

level of description (two-digit CIP codes), to an intermediate level of description

(four-digit level of description), to an extremely detailed level of description
(six-digit level of description).8 By using CIP codes, it is possible to provide
broad, intermediate, and finite detail on the curricula for programs of study in
postsecondary education.9
• The Integrated Postsecondary Education Data System (IPEDS, https://nces.
ed.gov/ipeds/) is both a postsecondary education reporting process and a post-
secondary education data repository. The IPEDS reporting process and the
availability of IPEDS data started in 1985, replacing prior data collection pro-
cesses and data collections (e.g., Higher Education General Information Survey,
HEGIS) that were simply neither as robust nor as unified as IPEDS data.
Completion of the IPEDS reporting process was made mandatory for all postsec-
ondary institutions that have a Program Participation Agreement (PPA) with the
Office of Postsecondary Education (OPE), United States Department of
Education. To achieve high rates (nearly 100 percent) of completion by the many
thousands of postsecondary institutions in the United States, engagement in the
IPEDS process is required of all postsecondary institutions that participate in or
are eligible to participate in any federal student financial assistance program,
authorized by Title IV of the Higher Education Act of 1965, as amended. Access
to federal student financial assistance is a strong motivator for IPEDS participa-
tion by qualified postsecondary institutions, thus a completion rate of nearly 100
percent.
Allowing for a 1–3-year lag from annual IPEDS report completion to eventual
final form IPEDS data availability, the IPEDS Peer Analysis System (https://nces.
ed.gov/ipeds/use-the-data) is an exceptionally rich resource for data related to post-
secondary education.10 Data gained from the IPEDS Peer Analysis System cover

8
Review the detailed description for Medical Informatics (https://nces.ed.gov/ipeds/cipcode/cip-
detail.aspx?y=55&cipid=87670) and follow the progression of detail: CIP Code 51, Health
Professions and Related Clinical Sciences; CIP Code 51.27, Medical Illustration and Informatics;
and CIP Code 51.2706, Medical Informatics.
9
Although the Classification of Instructional Programs was developed specifically for educational
purposes, the use of CIP codes is found throughout the many agencies, institutions, businesses, and
other entities that have some degree of interest in throughput of students in postsecondary educa-
tion, including: Department of Commerce, Bureau of the Census; Department of Education, Office
of Career, Technical, and Adult Education; Department of Education, Office for Civil Rights;
Department of Education, Office of Federal Student Aid; Department of Education, Office of
Special Education; Department of Homeland Security; Department of Labor, Bureau of Labor
Statistics; National Academy of Sciences; National Institutes of Health; National Occupational
Information Coordinating Committee; National Science Foundation; etc. CIP codes are also cre-
atively used by state departments of education and other state agencies, national organizations and
professional groups, higher education institutions, technical institutions, and the many businesses
that provide employment services.
10
The normal lag from data submission by individual postsecondary institutions to public access of
the data at the IPEDS Peer Analysis System has been noticeably extended due to the prior and
continuing challenges brought about by COVID-19 and the still frequent practice of work from
home (WFH) by many postsecondary education information workers and government counter-
parts, including external contracted workers. As an example, deadlines for IPEDS data submission
Academic Growth of Data Science Programs of Study in the Biological Sciences, Based… 13

topics as diverse as tuition and other student charges, fall term enrollment, institu-
tional control (e.g., public, private not for profit, private for profit), completions,
financial aid, faculty and staff headcounts, finance (e.g., revenue and expenses),
library resources, etc. – topics that those in data science and other programs of study
should consider IPEDS data are highly transparent, and the data are available to the
public worldwide and require no notification of access, no permissions for access,
and no key or other user-identified authentication for access.
IPEDS data are not yet available through use of a function-type (e.g., client)
R-based Application Programming Interface (API). Instead, data are gained by
interaction with a Graphical User Interface (GUI) menu-type selection process.11
When all selections are completed it is possible to download an IPEDS dataset that
consists of data for hundreds of variables against thousands of postsecondary insti-
tutions. An individual dataset gained from the IPEDS Peer Analysis System is
always initially downloaded as a comma separated values (.csv) file, with the data
in a rectangular row by column format. With a .csv file format as the default, it is
then an easy task to put IPEDS data into other file formats, such as .xlsx file format,
if desired.12
Using available six-digit Classification of Instructional Programs (CIP) informa-
tion about programs of study (e.g., academic programs, academic majors, curricula,
and disciplines) and CIP-related data gained from the IPEDS Peer Analysis System,
the following programs of study were selected for emphasis in this lesson as a sam-
ple, with each identified as being among those that require some degree of knowl-
edge about biostatistics in relation to the overarching use of tools associated with
data science. These programs of study represent many possible levels of instruction,
ranging from short-course certificate programs to multi-year terminal degree gradu-
ate and professional programs, but all have some degree of focus on the efficient
management and use of data (e.g., data science) and the general discipline of
biostatistics:13

were extended, and it is expected that these extensions will ultimately impact the timeliness of data
availability. These issues related to data availability are of course not restricted to the United States
Department of Education but are endemic to many data resources, like the way COVID-19 has
become increasingly endemic.
11
The IPEDS Peer Analysis System offers a Save session option for replication of consistently
structured queries, but by no means does this option begin to equal the convenience and quality
assurance of an R-based API data query process based on reproducible syntax, as will be seen in
later lessons with other data resources.
12
These datasets are in wide format, but tidyverse ecosystem tools can be used to put the data into
long format, a tidy approach to dataset structure. This practice is demonstrated multiple times in
this text, in later lessons.
13
From among many possible resources, refer to the United States Department of Homeland
Security STEM Designated Degree Program list (https://www.ice.gov/sites/default/files/docu-
ments/stem-list.pdf) for an audit of the more than 500 CIPs associated with STEM (science, tech-
nology, engineering, and mathematics) that are recognized by this federal executive cabinet-level
department. Many CIPs on this extensive list require exposure to data science and acquaintance
with biostatistics.
14 1 Emergence of Data Science as a Critical Discipline in Biostatistics

IP Series 01: Agricultural, Animal, Plant, Veterinary Science

C
and Related Fields14

CIP (Six-Digit) Academic Program

CIP Series 26: Biological and Biomedical Sciences

CIP (Six-Digit) Academic Program

=============================================================
26.0502 Microbiology, General
26.1101 Biometry/Biometrics
26.1102 Biostatistics
26.1103 Bioinformatics
26.1104 Computational Biology
26.1199 Biomathematics, Bioinformatics, and
Computational Biology, Other
26.1306 Population Biology
26.1309 Epidemiology
-------------------------------------------------------------

CIP Series 27: Mathematics and Statistics

CIP (Six-Digit) Academic Program

=============================================================
27.0501 Statistics, General
-------------------------------------------------------------

14
This listing and many others in this text that are similar should not be viewed as a word-processed
table. Instead, text of this type represents output from some type of computing activity, with some
degree of modification for presentation purposes. Accordingly, the text shows inred, following
along with the identification of input and output used throughout this text.
Academic Growth of Data Science Programs of Study in the Biological Sciences, Based… 15

CIP Series 30: Multi-Interdisciplinary Studies

CIP (Six-Digit) Academic Program

=============================================================
30.3001 Computational Science
---------------------------------------------------------------

CIP Series 44: Public Administration and Social Service

CIP (Six-Digit) Academic Program

=============================================================
44.0503 Health Policy Analysis
-------------------------------------------------------------

CIP Series: 51: Health Professions and Related Programs

CIP (Six-Digit) Academic Program

=============================================================
51.0401 Dentistry
51.1201 Medicine
51.1401 Medical Scientist
51.1901 Osteopathic Medicine/Osteopathy
51.2010 Pharmaceutical Sciences
51.2201 Public Health, General
51.2202 Environmental Health
51.2401 Veterinary Medicine
51.2706 Medical Informatics
51.3801 Nursing/Registered Nurse
51.3808 Nursing Science
51.3818 Nursing Practice
-------------------------------------------------------------

Accompanying figures are presented in Addendum 1 for a sample of these six-digit

CIP programs of study, showing change over time in the number of completers (e.g.,
graduates) from Academic Year 2009–2010 to Academic Year 2018–2019, the last
academic year currently available at the IPEDS Peer Analysis System.15

15
At the time these figures were prepared, Academic Year 2018–2019 was the ending point for the
availability of CIP-specific IPEDS data on completers. Even if the data were updated by the time
this text becomes available, it is best to question if completions data for Academic Year 2019–2020
and onward for the next few years are typical and appropriate for year-by-year comparisons. From
among the many social outcomes of reactions to COVID-19, higher education enrollment patterns
were greatly stressed as communities went into lockdown, students went home, and many students
suspended their studies, moved on, and may not return to higher education. It will likely be a few
years before postsecondary education enrollment patterns, throughput, and completions return to
expected patterns.
16 1 Emergence of Data Science as a Critical Discipline in Biostatistics

Challenge: Syntax used to generate figures associated with these data is pre-
sented in Addendum 1. Use the syntax to replicate figures for all 30 CIP six-digit
academic majors. More detail on syntax, such as what is shown in Addendum 1, will
be provided in later lessons. In these early stages of acquaintance with data science
and R, focus on process and come back to the syntax later if needed.

Jobs and Job Requirements for a Data Scientist

Within the United States Department of Labor, the Employment and Training
Administration has responsibility for Occupational Information Network (O*NET),
a hierarchical coding system that describes occupations. Not surprisingly, the
Occupational Information Network process parallels the way the previously
described National Center for Education Statistics, within the United States
Department of Education, also maintains a hierarchical coding system that describes
programs of study, the Classification of Instructional Programs (CIP).16,17
It is beyond the purpose of this lesson to provide detailed instruction for use of
the Occupational Information Network (O*NET), but for now consider how this
process deconstructs jobs into occupational characteristics and worker requirements
that address three key components for each job:
• The term knowledge represents the cognitive capabilities needed to succeed at an
expected level of performance.
• The term skills is related to the psychomotor behaviors needed to succeed at an
expected level of performance.
• The term abilities refers to the affective dispositions needed to succeed at an
expected level of performance.

Job Opportunities and Salaries in Data Science

The Occupational Employment and Wage Statistics Occupation Profiles (https://

www.bls.gov/oes/current/oes_stru.htm) provides a beginning sense of the many
jobs related to data science and biostatistics, listed in order by Occupational Code:18

16
Similar hierarchical coding systems are used throughout the many departments, bureaus, agen-
cies, offices, etc. associated with the United States federal government. The data maintained by
these entities should always be considered for use, either as a primary source or as a proxy that at
least provides guidance, if possible. Data scientists quickly learn about the use of data resources
that are convenient and freely available if the available data meet needs.
17
External resources related to biostatistics that allow access to data that are highly vetted, reliable,
and valid are identified in a later lesson.
18
Observe how Occupational Information Network (O*NET) codes use a different numbering
sequence than the Classification of Instructional Programs (CIP) codes. Even so, there is structure
(e.g., hierarchy) to O*NET codes, and with experience it is possible to learn more about the
requirements for each specific job, regardless of how the job is coded by a local employer. Take the
Jobs and Job Requirements for a Data Scientist 17

OCC_CODE OCC_TITLE
=============================================================
15-2041 Statisticians
17-2031 Bioengineers and Biomedical Engineers
19-1011 Animal Scientists
19-1012 Food Scientists and Technologists
19-1013 Soil and Plant Scientists
19-1020 Biological Scientists
19-1021 Biochemists and Biophysicists
19-1022 Microbiologists
19-1023 Zoologists and Wildlife Biologists
19-1029 Biological Scientists, All Other
19-1032 Foresters
19-1040 Medical Scientists
19-1041 Epidemiologists
19-4010 Agricultural and Food Science Technicians
19-4021 Biological Technicians
19-4040 Environmental Science and Geoscience Technicians
19-4092 Forensic Science Technicians
25-1040 Life Sciences Teachers, Postsecondary
25-1041 Agricultural Sciences Teachers, Postsecondary
25-1042 Biological Science Teachers, Postsecondary
25-1070 Health Teachers, Postsecondary
25-1072 Nursing Instructors and Teachers, Postsecondary
29-1021 Dentists, General
29-1041 Optometrists
29-1051 Pharmacists
29-1131 Veterinarians
29-1141 Registered Nurses
29-1151 Nurse Anesthetists
29-1211 Anesthesiologists
29-1215 Family Medicine Physicians
29-1216 General Internal Medicine Physicians
29-1218 Obstetricians and Gynecologists
29-1221 Pediatricians, General
29-1228 Physicians, All Other; and Ophthalmologists,
Except Pediatric
29-1248 Surgeons, Except Ophthalmologists
-------------------------------------------------------------

Use the online resources associated with the Occupational Information Network
(https://www.onetonline.org/) to conduct job-by-job research. Regarding jobs in
data science and biostatistics, the following behaviors and dispositions seem to

time to review, for at least a few selected jobs, highly detailed information on: (1) employment
estimate and mean wage, (2) percentile wage estimates, (3) industries with the highest concentra-
tion of employment, (4) top paying industries, (5) states with the highest employment level, (6)
states with the highest concentration of jobs and location quotients, (7) top paying states, (8) met-
ropolitan areas with the highest employment level, (9) metropolitan areas with the highest concen-
tration of jobs and location quotients, (10) top paying metropolitan areas, (11) nonmetropolitan
areas with the highest concentration of jobs and location quotients, and (12) top paying nonmetro-
politan areas.
18 1 Emergence of Data Science as a Critical Discipline in Biostatistics

occur across the spectrum, increasing in responsibility with educational attainment

and career advancement:19
• Adhere to all institutional review board (IRB) guidelines associated with the
protection of human and animal subjects
• Analyze data gained from clinical trials, surveys, etc.
• Analyze extant (e.g., archival) data in the public domain, such as birth and
death records
• Analyze samples
• Attend, actively participate in, and help organize professional conferences
• Clean (e.g., scrub) and organize data in original (e.g., raw) format
• Collaborate with other life science professionals and engage with public officials
and members of the press and public
• Collect data
• Consider issues impacting the life science research process, including budgets,
staffing, marketing, etc.
• Consult with supervisors and others in authority, internal and external
• Define populations and from definition of this large group, develop protocols for
sampling, addressing issues such as sample representation and sample size
• Design research studies, individually and in collaboration with others
• Design surveys and other means of collecting essential data from human subjects
• Design, monitor, and maintain databases, including safeguards for allowed
access to the data
• Develop a personal collaborative network with fellow scientists: agronomists,
animal scientists, biologists, biochemists and biophysicists, engineers, horticul-
turists, life scientists, medical scientists, and other medical personnel, etc.
• Develop computer-based statistical simulation models related to the biological
sciences, to explore possible techniques in a cost and time effective manner
• Develop exploratory pilot projects that are used, later, to guide large-scale imple-
mentation of research and statistical analysis projects
• Develop protocols to help transition insight gained from statistical analysis find-
ings to commercial applications
• Engage in project management, prepare project estimates of timelines, costs,
personnel, etc.
• Estimate costs for all laboratory and other hands-on activities.
• Identify public health issues (e.g., diseases caused by parasites and pathogens),
chemical, environmental, and other hazards and develop investigative processes
of the same in view of risk factors, development, life cycle, mitigation, etc.
• Maintain a professional development action plan
• Maintain currency with bioregulatory mandates involving compliance, laws, and
licensing

19
These behaviors and dispositions are organized in alphabetical order, to avoid any attempt to
suggest that there is a hierarchy of importance or sequence of these behaviors and dispositions.
Again, these behaviors and dispositions go across the many jobs associated with data science and
biostatistics. Use the Occupational Information Network for specific job-by-job details, details and
work activities.
Jobs and Job Requirements for a Data Scientist 19

• Maintain monitoring and safety action plans (individually and by delegation to

subordinates) for all laboratory equipment, drugs and other chemicals, sup-
plies, etc.
• Make conclusions from data analysis outcomes and share outcomes with peers to
allow critique and continuous quality improvement
• Make quality assurance practices pervasive throughout the research, statistical
analysis, and data science process
• Obtain data gained from an existing archive: small and large datasets, private and
public access datasets, etc.
• Prepare, test, and refine algorithms for data analysis
• Prepare publications, including text, charts, graphs, and other visualizations
• Prepare research proposals and grant applications
• Present research findings at symposia, conferences, regulatory oversight meet-
ings, public meetings, press conferences, etc.
• Question and investigate relationships between and among variables
• Review professional literature
• Supervise assistants, subordinates, and students
• Teach in-service training activities, short courses, seminars, programs of
study, etc.
• Test, validate, and reconsider existing processes used in statistical analyses for
the biological sciences and when appropriate consider and use new research sta-
tistical analysis processes
• Train colleagues and subordinates in the use of computer based statistical analy-
ses as well as interaction with extant and new interfaces
• Use existing and develop new statistical analysis syntax, functions, and software
• Use existing data to model future outcomes and later test the efficacy of the
model for predictive purposes
• Use proactive intervention techniques in the research analysis process
• Use scientific techniques to investigate and eventually mitigate biological risks
to humans, plants, animals, and the environment
Use https://www.onetonline.org/crosswalk/CIP/, a crosswalk at the Occupational
Information Network, to conduct a highly detailed comparison of IPEDS-based CIP
codes and how a specific CIP code relates to O*NET coded jobs. From among the
many possible CIP codes related to data science and biostatistics, consider the fol-
lowing CIPs and the type of jobs associated with these programs of study, whether
R or any other language is used for statistical analyses.
A crosswalk search of CIP 26.1102 Biostatistics yielded possible employment in
the following O*NET-identified jobs, but use this interface to search on OCC_CODEs
and OCC_TITLEs for many other six-digit CIP codes, to gain a good understanding
of the world of work opportunities for those with academic preparation in biostatistics:20

20
Give special attention to the following six-digit CIP codes and how these programs of study
crosswalk to O*NET-identified jobs: CIP 30.7001 Data Science, General; CIP 51.1201 Medicine;
and CIP 51.2706 Medical Informatics.
20 1 Emergence of Data Science as a Critical Discipline in Biostatistics

OCC_CODE OCC_TITLE
=============================================================
11-9121.00 Natural Sciences Managers
11-9121.01 Clinical Research Coordinators
11-9121.02 Water Resource Specialists
15-2041.00 Statisticians
15-2041.01 Biostatisticians
19-1029.00 Biological Scientists, All Other
19-1029.01 Bioinformatics Scientists
19-1029.02 Molecular and Cellular Biologists
19-1029.03 Geneticists
19-1029.04 Biologists
19-1042.00 Medical Scientists, Except Epidemiologists
19-4021.00 Biological Technicians
25-1022.00 Mathematical Science Teachers, Postsecondary
25-1071.00 Health Specialties Teachers, Postsecondary
-------------------------------------------------------------

The depth of job-related statistics associated with data science and biostatistics
that are available from the Bureau of Labor Statistics should be explored in detail
when considering possible career path decisions. Again, using May 2020 statistics,
keeping in mind the national and worldwide impact of COVID-19 on economic
outcomes, including employment numbers and salaries, look at the following table
of total employment (e.g., TOT_EMP) and annual median salary (e.g., A_MEDIAN)
for selected occupations (e.g., OCC_CODE and OCC_TITLE) related to data sci-
ence and biostatistics, using BLS Occupation Codes (https://www.bls.gov/oes/cur-
rent/oes_stru.htm):21

Job Opportunities and Salaries in Data Science

Using federal resources, look at the following output that highlights total employ-
ment and median salary for jobs in data science. Again, consider the prior comments
on displacement in the workplace due to the COVID-19 pandemic and why the data
were published in May 2020, since it is assumed that this point in time provides a
stable summary of employment.

21
See the prior comment about the use of data from either before or at the earliest stages of the
COVID-19 pandemic and the impact of mitigation practices such as lockdowns on data representa-
tion, thus the use of data from 2019 and early 2020, but not beyond.
Job Opportunities and Salaries in Data Science 21

A Selected Sample of United States Bureau of Labor Statistics

Jobs that Require Some Degree of Postsecondary Instruction in
Data Science and Biostatistics: National May 2020 Jobs
=============================================================
OCC_CODE OCC_TITLE TOT_EMP A_MEDIAN
=============================================================
15-2041 Statisticians 38860 92270
17-2031 Bioengineers and Biomedical Engine~ 18660 92620
19-1011 Animal Scientists 2680 63490
19-1012 Food Scientists and Technologists 13080 73450
19-1013 Soil and Plant Scientists 13950 66120
19-1020 Biological Scientists 110600 83220
19-1021 Biochemists and Biophysicists 32010 94270
19-1022 Microbiologists 19710 84400
19-1023 Zoologists and Wildlife Biologists 17200 66350
19-1029 Biological Scientists, All Other 41680 85290
19-1032 Foresters 9360 63980
19-1040 Medical Scientists 133620 90010
19-1041 Epidemiologists 7500 74560
19-4010 Agricultural and Food Science Tech~ 21940 41970
19-4010 Agricultural and Food Science Tech~ 21940 41970
19-4021 Biological Technicians 80640 46340
19-4040 Environmental Science and Geoscien~ 47440 47740
19-4092 Forensic Science Technicians 16640 60590
25-1040 Life Sciences Teachers, Postsecond~ 61480 86390
25-1041 Agricultural Sciences Teachers, Po~ 8520 90340
25-1042 Biological Science Teachers, Posts~ 51500 85600
25-1070 Health Teachers, Postsecondary 261130 90890
25-1072 Nursing Instructors and Teachers, ~ 61100 75470
29-1021 Dentists, General 95920 158940
29-1041 Optometrists 36690 118050
29-1051 Pharmacists 315470 128710
29-1131 Veterinarians 73710 99250
29-1141 Registered Nurses 2986500 75330
29-1151 Nurse Anesthetists 41960 183580
29-1211 Anesthesiologists 28590 #
29-1215 Family Medicine Physicians 98590 207380
29-1216 General Internal Medicine Physicia~ 50600 #
29-1218 Obstetricians and Gynecologists 18900 #
29-1221 Pediatricians, General 27550 177130
29-1228 Physicians, All Other; and Ophthal~ 375390 #
29-1248 Surgeons, Except Ophthalmologists 37900 #
-------------------------------------------------------------
22 1 Emergence of Data Science as a Critical Discipline in Biostatistics

Selected Jobs Associated with Data Science and Biostatistics:

Annual Median Salary - May 2020
00 All Data are from the Bureau of Labor Statistics
,0
50
Annual Median Salary
$2

Data are excluded in the original dataset for median

salaries>=$208,000 per year.
00
,0
00
$2

00
,0
50
$1

00
,0
00
$1

0
00
0,
$5
$0

15-2041
17-2031
19-1011
19-1012
19-1013
19-1020
19-1021
19-1022
19-1023
19-1029
19-1032
19-1040
19-1041
19-4010
19-4021
19-4040
19-4092
25-1040
25-1041
25-1042
25-1070
25-1072
29-1021
29-1041
29-1051
29-1131
29-1141
29-1151
29-1211
29-1215
29-1216
29-1218
29-1221
29-1228
29-1248
Job Code, https://www.bls.gov/oes/current/oes_stru.htm

Fig. 1.1

Data were gained from Occupational Employment and Wage

Statistics, OEWS Data (https://www.bls.gov/oes/tables.htm) (Fig. 1.1).

OCCJobsNationalSalary.fig
# The syntax for this figure is found in Addendum 2.
# C01Fig01OCCJobsNationalSalary.png

Note: By looking at the files used to generate the data in this part of the les-
son (detailed explicitly in Addendum 2), both offline after downloading the file
and then again after importing the data into R, the characters * and # show in
many columns, columns that should be numeric but are not due to the presence
of these special characters. Using the notes found in the data dictionary (e.g.,
Sheet 2 of the original downloaded file national_M2020_dl.xlsx), it is identified
how the * character is used to show that a wage estimate is not available. It is
also stated that the # character is used to indicate that the wage is equal to or
greater than $100.00 per hour or $208,000 per year. The masking of the exact
wage is a purposeful decision by the Bureau of Labor Statistics and the data at
this upper limit are not readily available to the public at this resource. Although
data imputation is supported in R, it was judged that it would be inappropriate
Job Opportunities and Salaries in Data Science 23

to input $100.00 per hour or $208,000 per year since there is no way of knowing
the exact hourly or yearly wage statistics. Accordingly, these data exist, but are
unavailable.22,23
National total employment numbers and median salaries for specific jobs pro-
vide a good start on using Bureau of Labor Statistics for guidance on career choices
and the many decisions that should be considered when developing a personalized
educational program of study, in preparation for an eventual individualized career
path. Yet, recall that employment is localized and statistics that apply at the national
level may not be representative of state-wide and even more localized levels of
comparison.
The national statistics on employment numbers and salaries provide some degree of
guidance for each selected occupation, but do not forget that salaries are very regionally
specific. As an example, compare the May 2000 mean salary for the job 19-1022
(Microbiologists) for all states. When viewing these statistics, be sure to consider salaries
and localized cost of living, with impact by local rents, local taxes, local transportation
costs, etc. Additionally, compare more regional salaries using All Data (XLS). Details on
how these comparisons are made, using R, are provided in Addendum 2, with the data
gained from https://www.bls.gov/oes/current/oes_stru.htm, the BLS Occupation Codes
resource.

22
R supports processes to work with data imputation through the use of functions from a few differ-
ent packages, such as: Amelia, brms, mi, mice, VIM, mitml, etc.
23
When using R, these special characters should be removed and replaced by NA, the symbolic
representation of missing data. A simple and effective transformation process is demonstrated in
Addendum 2.
24 1 Emergence of Data Science as a Critical Discipline in Biostatistics

United States Bureau of Labor Statistics Mean Salaries for

Microbiologists (OCC_CODE 19-1022) by State: May 2020
=============================================================
AREA OCC_CODE OCC_TITLE A_MEAN
=============================================================
Alabama 19-1022 Microbiologists 66930
Arizona 19-1022 Microbiologists *
Arkansas 19-1022 Microbiologists 68190
California 19-1022 Microbiologists 116630
Colorado 19-1022 Microbiologists 75280
Connecticut 19-1022 Microbiologists 80800
Delaware 19-1022 Microbiologists 67450
District of Columbia 19-1022 Microbiologists 107010
Florida 19-1022 Microbiologists 62810
Georgia 19-1022 Microbiologists 100400
Hawaii 19-1022 Microbiologists 73480
Idaho 19-1022 Microbiologists 67290
Illinois 19-1022 Microbiologists 79360
Indiana 19-1022 Microbiologists 64130
Iowa 19-1022 Microbiologists 81450
Kansas 19-1022 Microbiologists 58630
Kentucky 19-1022 Microbiologists 51540
Maine 19-1022 Microbiologists 74000
Maryland 19-1022 Microbiologists 105980
Massachusetts 19-1022 Microbiologists 105170
Michigan 19-1022 Microbiologists 67280
Minnesota 19-1022 Microbiologists 76660
Mississippi 19-1022 Microbiologists 81230
Missouri 19-1022 Microbiologists 77180
Montana 19-1022 Microbiologists 99410
Nebraska 19-1022 Microbiologists 84250
Nevada 19-1022 Microbiologists 73820
New Hampshire 19-1022 Microbiologists 72340
New Jersey 19-1022 Microbiologists 105280
New Mexico 19-1022 Microbiologists 53720
New York 19-1022 Microbiologists 83160
North Carolina 19-1022 Microbiologists 72020
North Dakota 19-1022 Microbiologists 57770
Ohio 19-1022 Microbiologists 69130
Oklahoma 19-1022 Microbiologists 49150
Oregon 19-1022 Microbiologists 66760
Pennsylvania 19-1022 Microbiologists 80890
Rhode Island 19-1022 Microbiologists 77730
South Carolina 19-1022 Microbiologists 73610
South Dakota 19-1022 Microbiologists 57440
Tennessee 19-1022 Microbiologists 78670
Texas 19-1022 Microbiologists 59790
Utah 19-1022 Microbiologists 64400
Vermont 19-1022 Microbiologists 68860
Virginia 19-1022 Microbiologists 99550
Washington 19-1022 Microbiologists 76290
West Virginia 19-1022 Microbiologists *
Wisconsin 19-1022 Microbiologists 69680
Puerto Rico 19-1022 Microbiologists 53890
-------------------------------------------------------------
Job Opportunities and Salaries in Data Science 25

Data were gained from Occupational Employment and Wage

Statistics, OEWS Data (https://www.bls.gov/oes/tables.htm).
It would be exhaustive to go into extreme detail but spend some time exploring
salaries for selected jobs at an even greater level of granularity, exploring salaries
for selected metropolitan areas and nonmetropolitan areas, using the files available
at the selection Metropolitan and nonmetropolitan area (HTML) (XLS), at https://
www.bls.gov/oes/tables.htm. As an example, look at the wide variance in median
salary for Environmental Scientists and Specialists, Including Health (OCC_CODE
19-2041) for three different metropolitan areas in Florida and nonmetropolitan
areas differentiated by region, South Florida and North Florida.

Median Salary (A_MEDIAN) Throughout Florida by Metropolitan

Area Breakouts for OCC_TITLE Environmental Scientists and
Specialists, Including Health (OCC_CODE 19-2041): May 2020
=============================================================
OCC_CODE AREA_TITLE A_MEDIAN
=============================================================

Metropolitan Area
=================
19-2041 Miami-Fort Lauderdale-West Palm Beach, FL 61020
19-2041 Orlando-Kissimmee-Sanford, FL 42160
19-2041 Tallahassee, FL 44910

Nonmetropolitan Area
====================
19-2041 South Florida nonmetropolitan area 47580
19-2041 North Florida nonmetropolitan area 43910
-------------------------------------------------------------

Data were gained from Occupational Employment and Wage

Statistics, OEWS Data (https://www.bls.gov/oes/tables.htm).
As evidenced by the salary statistics for Job 19-2041 (Environmental Scientists
and Specialists, Including Health), there is wide variance in salary for the same
coded job by selected areas throughout the same state. Admittedly, according to the
Census Bureau, Florida is an exceptionally large state by area (more than 65,000
square miles) and by population (more than 21 million residents), yet such wide
variance in median salary by region in the same state for the same coded job is not
uncommon. Again, use available resources to gain a sense not only of expected sala-
ries but also local cost of living expenses, such as local rents, local taxes, and local
transportation costs. A few hours of exploration of freely available resources are
essential when investigating what may be a 40-year or more career path.
Finally, along with the summative data available from the Bureau of Labor
Statistics, spend time at Additional OEWS Data Sets (https://www.bls.gov/oes/
additional.htm):
26 1 Emergence of Data Science as a Critical Discipline in Biostatistics

• From among the many possibilities, look at the way geographic and sector break-
outs are available and compare STEM (science, technology, engineering, and
mathematics) and non-STEM salaries.
• Then, review the salary-type datasets associated with entry-level educational
requirements (e.g., (1) no formal educational credential, (2) high school diploma
or equivalent, (3) some college, no degree, (4) postsecondary nondegree award,
(5) associate’s degree, (6) bachelor’s degree, (7) master’s degree, and (8) doc-
toral or professional degree), showing that additional education nearly always
results in increased salary.

Computing and Data Science

Data science is dependent on a pervasive contemporary computing ecosystem. It is

hard to image that today’s many data science activities could ever be attempted
without the support gained from the advanced computing hardware, software, and
associated infrastructure currently available. But when did computing begin and
how did it evolve into what is now available to data scientists? There are many
views on that question and there will never be universal agreement.
This brief section is only a glimpse of the history of computing. There are many
resources that provide extensive discussion and documentation for this type of
timeline.

Pre-ENIAC (1946)

Completed in February 1946, ENIAC (Electronic Numerical Integrator and

Computer) is often credited as being the first electronic digital computer.24 ENIAC
was built at the University of Pennsylvania (Philadelphia) during World War II, as
part of an army contract that focused on ballistic testing. There were earlier attempts
at electronic-mechanical computing, however, including:
• During the mid-1800s in Victorian London, Charles Babbage developed proto-
types of what are referred to as a Difference Engine and an Analytical Engine,
developed ostensibly for the purpose of computing mathematical tables. Babbage
subsequently met Ada Lovelace, who saw great potential in these prototype
machines, and she eventually developed what many consider the first computer
program, to be deployed using Babbage’s prototype (but unfinished)
Analytical Engine.

Challenge: Search on the names Jean Jennings Bartik, Frances Bilas, Ruth Licherman, Kathleen
24

McNulty, Betty Snyder, and Marlyn Wescoff. What was their role regarding ENIAC?
Computing and Data Science 27

• Early-form punch cards were developed to tabulate 1880 United States Census
Bureau data, and results were so satisfactory (e.g., accuracy and speed) that the
1890 Census was tabulated using this process, continuing until the 1950s, when
more modern computers took over this massive task.
• The Atanasoff Berry Computer (ABC) was developed at Iowa State College
(Ames), later Iowa State University, in the late 1930s and very early 1940s.
• Colossus was a set of computing devices developed during World War II and was
used by British codebreakers in support of the war effort. It is considered, by
some, as an electronic digital computer that could be programmed.

Mainframe Computing (1950s Onward)

As important as ENIAC may have been, it had practical limitations that challenged
uses beyond its original purpose. However, within a few years after the development
of ENIAC, by the early to mid-1950s, what we now call mainframe computers were
being used by businesses, not only for mathematical calculations but for back-office
activities as mundane, but important, as billing, fiscal accounting, payroll, etc. Look
into development of the following for a more complete picture of these early days
in computing:
• UNIVC (Universal Automatic Computer) is considered by many as the first com-
mercial computer, with sales to large companies beginning in 1951.
• Going forward throughout the 1960s and 1970s, the commercialization of main-
frame computers continued. The early mainframe machines became smaller,
faster, easier to maintain, easier to program, and had greater and greater memory
and data capacity.25 They were still large and complex by today’s standards, but
with each iteration improvements made them available to mid-size and increas-
ingly smaller companies.

Personal Computing (1980s Onward)

If there is disagreement on just what was the first large-scale computer, there are
equally different views on the development of small, personal computers:
• By the late 1960s and going on to the 1970s, the first scientific pocket calculators
became available to the public. Dependence on slide rules declined as hand-held
computing devices became part of the tool kit of many engaged in mathematics
and the sciences.

25
Search on the term Moore’s law, and see if it applies, then and now.
28 1 Emergence of Data Science as a Critical Discipline in Biostatistics

• Evolving away from calculators, by the 1970s, there was intense competition
between different companies in the development of what would be considered
early prototype personal computers, like competition in the automobile industry
during the early 1900s. A few companies prospered while other companies failed.
• Eventually, many manufacturers of personal computers started to stabilize, and
by the mid to late 1970s and into the early 1980s, small (e.g., desktop size) per-
sonal computers became increasingly available to the public. Software was also
developed such that by the mid-1980s, these personal computers became afford-
able and supported complex mathematical and statistical calculations, word pro-
cessing, database management, graphics, gaming, and access to distributed
computing systems by use of modems, etc. – all for use by an individual. January
22, 1984, is considered a bellwether date for personal computing when the first
national television advertisement for a personal computer was aired during Super
Bowl XVIII, with an estimated audience of nearly 80 million viewers.
Increasingly, computers had arrived, and people often wondered how they every
lived without them, like years earlier when disruptive technology such as univer-
sal postal service, railroad, telegraph, telephone, automobile, airplane, radio, and
television first gained public acceptance.

idespread Acceptance of the Internet (1970s Onward)

W
and the World Wide Web (1989 Onward)

The launching of Sputnik in 1957 by the Union of Soviet Socialist Republics

(USSR) was the impetus for funding of many Cold War initiatives by the United
States federal government. From among these many initiatives, the Advanced
Research Projects Agency Network (ARPANET) was developed from the early to
mid-1960s to maintain some degree of communication for command and control
between and among selected computing facilities, all devoted to defense: govern-
ment, military, research organizations and laboratories, and universities.
Building on this foundation, other networks came into existence, but different
protocols for communication often limited broad scale use beyond individual net-
works. However, by 1983, a common protocol (TCP/IP, Transfer Control Protocol/
Internetwork Protocol) was adopted, and in short order, the term Internet was used
to describe this common network of networks, with nearly ubiquitous communica-
tion capabilities.
The Internet was widely used after initial acceptance by those engaged in
research, but its use was somewhat difficult for the public in that input actions mod-
eled use of a typewriter using keyboard text-based entry, the only graphics were
simple ASCII art, and there was limited (if any) sound capabilities. The menu-based
Gopher protocol had some degree of use, but large-scale public use of the Internet
grew tremendously after introduction of the World Wide Web in 1989. Soon after,
Graphical User Interface (GUI) Internet browsers were easily adopted for use by the
Computing and Data Science 29

public. By using the World Wide Web and a GUI Web browser, actions could now
be implemented by using a mouse and a simple click against a visual hyperlink.
Graphics were soon deployed in full color, video (at first limited to a few minutes
but soon in full length) could be played, and real-time multi-user communication
with others was now possible. In turn, commercial services were developed that
enhanced the user experience and were gladly adopted by the public.
In the early 1990s and onward, because of the Internet and the World Wide Web,
it was now possible for those with a user account, a personal computer (later, a
smart phone), a modem (soon to be replaced by faster capabilities), and network
access to communicate with others worldwide. At the earliest beginning of the
Internet, most communication was serious and focused on research and academic
activities. It did not take long, however, for topical bulletin boards, listservs, and
other utilities to gain acceptance by members of the public, for personal use and
enjoyment.26 Soon after, intense commercialization was deployed, where user expe-
riences were tracked, data related to these experiences were harvested, algorithms
were applied against the harvested data, and commercialization became the norm,
initially with little to no oversight by any authoritative agency. Even with what
many view as obtrusive commercialization, use of the Internet and the World Wide
Web is now embedded into daily actions (e.g., banking, education, health mainte-
nance, real-time and delayed time communication, shopping, and transportation and
logistics), and there are many who would now find it difficult to successfully use
something as simple as a tri-fold paper map instead of using one of the many map-
type apps for personal transportation on unknown streets and highways.

Movement to Cloud Computing (2006 Onward)

As mainframe computers gained popularity in the late 1950s and well into the
1970s, it was common to have all mainframe computing facilities on campus at one
central building and users in other buildings had access to the mainframe from
campus-based distributed, but restricted, terminals – hardware that was little more
than a black and white television-like monitor and keyboard. All data and software
were at the centrally located mainframe, and connection was restricted to the use of
physical cables that were often limited to only a minimal distance. As the Internet
and World Wide Web grew in use, there were those who remembered this paradigm,
but now saw how the Internet and the World Wide Web could be used to connect to
vastly superior computing facilities hundreds, if not thousands, of miles

26
For unexplained reasons, online videos of dancing cats were among the first entertainment expe-
riences for many early Internet users and these funny videos helped entice the public to try out both
entertainment and serious video resources that were now available to the public.
30 1 Emergence of Data Science as a Critical Discipline in Biostatistics

away – evolving into what is now called cloud computing, with possibly thousands
of devices placed on racks at what are referred to as server farms.27
The term cloud computing was possibly used for the first time in the late 1990s,
but it was not until a few years later that the term cloud computing was used by large
companies that offered this type of distributed service. Typical to any type of jargon,
there were so many views of cloud computing, the cloud, etc. such that by 2011, the
federal government’s National Institute of Standards and Technology, a unit within
the United States Department of Commerce, provided clarity by offering a defini-
tive definition of the term, available at https://csrc.nist.gov/publications/detail/
sp/800-145/final.
A key concept associated with cloud computing is the term service and the will-
ingness of users to often use services on a pay as you go basis. Cloud computing
service is often viewed as the following:
• Software as a Service (SaaS), where an individual can use cloud-provided
software
• Platform as a Service (PaaS), where an individual can use cloud-provided utili-
ties and tools
• Infrastructure as a Service (IaaS), where an individual can use cloud-provided
resources for storage and processing
It is not suggested that cloud computing is merely a modern update to the use of
restricted terminals and their connection to a campus-based mainframe computer.
Even so, there are advantages to distributed computing, whether in the 1960s or the
2020s. Of course, when there are disruptions, which occasionally happen when con-
struction equipment cuts dedicated communication lines, such as fiber optic cable,
the disadvantages of distributed computing also become evident.

Data Types Supported by R

Data science is focused, not surprisingly, on data – data in its many forms. It is com-
mon to think of data as being either character (e.g., A, cat, Dog) or numeric (e.g., 1,
5, 1.5). R (as well as many other statistically oriented languages) can certainly
accommodate both character data and numeric data, but R can also accommodate
many other types of data. The following presentation is a mere introduction to the
many different data types supported by R, with more detail provided throughout the
addenda for this lesson as well as other lessons in this text.

27
Server farms, with thousands of computing devices in operation, consume vast amounts of elec-
tricity and generate extreme amounts of heat, requiring additional electricity for cooling. There is
a growing movement in cloud computing to locate server farms at locations that are naturally cool
throughout the year, if not cold, with electricity often generated by either geothermal or hydro
technology. The excess heat generated by the many servers is often captured and ported to other
buildings for heating purposes, adding to efficiencies of use.
Data Types Supported by R 31

R has been developed so that actions are often performed against object vari-
ables. Saying that, it is immediately necessary to demonstrate a few different ways
object variables are assigned values in R. Following this brief demonstration of
assignment there will be a discussion on why the <- assignment operator is used
throughout this text.

# Assignment using <-

Variable123 <- 123

# In this example, create a numeric object variable called
# Variable123 using the <- assignment operator. Be sure to
# notice where there is no space between the less than sign
# and the hyphen.

Variable123

[1] 123

# Assignment using =

Variable456 = 456
# In this example, create a numeric object variable called
# Variable456 using the = assignment operator. Be sure to
# notice where there is one and only one = character.

Variable456

[1] 456

# Assignment using the base::assign() function

base::assign("Variable789", 789)
# In this example, create a numeric object variable called
# Variable789 using the base::assign() function. Be sure
# to notice that there are double quote marks around the
# named object variable.

Variable789

[1] 789

Immediately, avoid the temptation of saying: Variable123 equals 123, Variable456

equals 456, and Variable789 equals 789. Use of the term equals, in this context, is
emphatically wrong but commonly encountered.
Instead, the proper way to express the outcome of assignment, although some-
what formal, is best expressed as: Variable123 is assigned the value 123, Variable456
is assigned the value 456, and Variable789 is assigned the value 789. To be exact,
32 1 Emergence of Data Science as a Critical Discipline in Biostatistics

there are those who might go as far as to say is temporarily assigned the value
instead of is assigned the value. The point here is to consider the term variable.
Variables, as the term suggests, may vary in value. Accordingly, the term assign-
ment is the better term to use when speaking of variable values.
It also needs to be mentioned that throughout this text, of the three assignment
operators previously demonstrated:
• The <- assignment operator is preferred, and other methods will be rarely, if ever,
used, at least in this text.
• The = character, as used for assignment, is too easily confused with the use of ==
(e.g., two equal signs with no space between each), which is used to determine
equation.
• The base::assign() function is too cumbersome.

oolean (e.g., Logical) Data Expressing Comparisons

B
and Order of Operations

Look into the life of George Boole (1815–1864) and his 1854 text An Investigation
of the Laws of Thought On Which are Founded the Mathematical Theories of Logic
and Probabilities, and think about all of the many times some type of decision was
made that reduced to a simple FALSE or TRUE outcome. This outcome, FALSE or
TRUE, is the basis for Boolean logic and comparisons.
The use of Boolean logic is then enhanced by a formal order of operations, where
there is structure to the evaluation of comparisons. A few simple examples will
serve as an introduction to this critically important aspect of data science.28

28
Challenge: To save space, the outcomes of the R-based syntax presented in this section are not
always presented. However, either duplicate or copy and paste the syntax into an active R session
and replicate findings. This self-directed activity is among the many paths used in this text for
learning R and later the tidyverse ecosystem.
Data Types Supported by R 33

X <- 12 # Create three object variables: X, Y, and Z. Each of

Y <- 15 # the three object variables consists of one and only
Z <- 18 # one datapoint.

X; Y; Z # Print the value(s) of object variables X, Y, and Z.

[1] 12
[1] 15
[1] 18

X > Y
# The value of X is greater than the value of Y:
# FALSE or TRUE.

[1] FALSE

Y < Z
# The value of Y is less than the value of Z:
# FALSE or TRUE.

[1] TRUE

(X + Y) > Z
# The value of the summation of X plus Y is greater
# than the value of Z: FALSE or TRUE. Give special
# attention to the use of parentheses and how the use
# of encapsulating parentheses remove any possibility
# of incorrect grouping, or the summation of X plus Y
# in this simple example.

[1] TRUE

X != Y
# R supports many comparisons, including not equals
# (e.g., !) in this example.

[1] TRUE
34 1 Emergence of Data Science as a Critical Discipline in Biostatistics

As a summary of the Boolean comparison operators supported by R, consider the

prior examples, using the values of X (12) and Y (15) in the following list:

Comparison Expression Example Outcome

===============================================================
> Greater Than X > Y 12 > 15 FALSE
< Less Than X < Y 12 < 15 TRUE
>= Greater Than or Equal To X >= Y 12 >= 15 FALSE
<= Less Than or Equal To X <= Y 12 <= 15 TRUE
== Equal To X == Y 12 == 15 FALSE
!= Not Equal To X != Y 12 != 15 TRUE
-------------------------------------
!X Not X
X | Y X OR Y
X & Y X AND Y
---------------------------------------------------------------

A simple example, using an enumerated dataset based on characteristics of dif-

ferent breeds of dogs will help clarify how Boolean logic is used with R. Immediately,
it should be mentioned that the selection processes in this sample use Base R, the set
of R functions made available when R is first downloaded. This approach was used
to emphasize the heuristics of Boolean selection and the use of parentheses for order
of operations when using R. Later in this lesson and throughout this text a far more
efficient approach is used, using the dplyr::select() function and other tools associ-
ated with the tidyverse ecosystem.
Data Types Supported by R 35

DogID <- c(
"D01", "D02", "D03", "D04", "D05", "D06",
"D07", "D08", "D09", "D10", "D11", "D12",
"D13", "D14", "D15", "D16", "D17", "D18")
# Create an object variable consisting of IDs.
DogBreed <- c(
"Beagle", "Beagle", "Beagle", "Beagle", "Beagle", "Beagle",
"Lab", "Lab", "Lab", "Lab", "Lab", "Lab",
"Terrier", "Terrier", "Terrier", "Terrier", "Terrier",
"Terrier")
# Create an object variable consisting of breeds.

DogGender <- c(
"Female", "Female", "Female", "Male", "Male", "Male",
"Female", "Female", "Female", "Male", "Male", "Male",
"Female", "Female", "Female", "Male", "Male", "Male")
# Create an object variable consisting of genders.

DogWeightLb <- c(
18, 20, 21, 22, 26, 23,
44, 42, 46, 44, 50, 58,
11, 13, 12, 14, 15, 13)
# Create an object variable consisting of weights (Lbs.).

DogIDBreedGenderWeightLb.df <- as.data.frame(

cbind(DogID, DogBreed, DogGender, DogWeightLb))
# Combine the different object variables into one dataframe.

base::attach(DogIDBreedGenderWeightLb.df)
utils::str(DogIDBreedGenderWeightLb.df)

'data.frame': 18 obs. of 4 variables:

$ DogID : chr "D01" "D02" "D03" "D04" ...
$ DogBreed : chr "Beagle" "Beagle" "Beagle" "Beagle" ...
$ DogGender : chr "Female" "Female" "Female" "Male" ...
$ DogWeightLb: chr "18" "20" "21" "22" ...

DogIDBreedGenderWeightLb.df$DogWeightLb <-
as.numeric(DogIDBreedGenderWeightLb.df$DogWeightLb)
# DogWeightLb needs to be put into numeric format.

base::attach(DogIDBreedGenderWeightLb.df)
utils::str(DogIDBreedGenderWeightLb.df)
36 1 Emergence of Data Science as a Critical Discipline in Biostatistics

'data.frame': 18 obs. of 4 variables:

$ DogID : chr "D01" "D02" "D03" "D04" ...
$ DogBreed : chr "Beagle" "Beagle" "Beagle" "Beagle" ...
$ DogGender : chr "Female" "Female" "Female" "Male" ...
$ DogWeightLb: num 18 20 21 22 26 23 44 42 46 44 ...

DogIDBreedGenderWeightLb.df

DogID DogBreed DogGender DogWeightLb

1 D01 Beagle Female 18
2 D02 Beagle Female 20
3 D03 Beagle Female 21
4 D04 Beagle Male 22
5 D05 Beagle Male 26
6 D06 Beagle Male 23
7 D07 Lab Female 44
8 D08 Lab Female 42
9 D09 Lab Female 46
10 D10 Lab Male 44
11 D11 Lab Male 50
12 D12 Lab Male 58
13 D13 Terrier Female 11
14 D14 Terrier Female 13
15 D15 Terrier Female 12
16 D16 Terrier Male 14
17 D17 Terrier Male 15
18 D18 Terrier Male 13

Now that the sample dataset DogIDBreedGenderWeightLb.df has been created,

use a few different Boolean-type tools to create new datasets, simply for demonstra-
tion purposes. Again, there are far more efficient ways to create these new datasets,
mostly by using the dplyr::select() function and other functions in the tidyverse
ecosystem, but the syntax in these examples should emphatically demonstrate the
heuristics of how Boolean tools are used in R.
Data Types Supported by R 37

DogIDAllBreedsFemaleWeightLb.df <- as.data.frame(

DogIDBreedGenderWeightLb.df[(DogGender=="Female"),])
# Create a dataset restricted to female dogs, only.

base::attach(DogIDAllBreedsFemaleWeightLb.df)

DogIDAllBreedsFemaleWeightLb.df

DogID DogBreed DogGender DogWeightLb

1 D01 Beagle Female 18
2 D02 Beagle Female 20
3 D03 Beagle Female 21
7 D07 Lab Female 44
8 D08 Lab Female 42
9 D09 Lab Female 46
13 D13 Terrier Female 11
14 D14 Terrier Female 13
15 D15 Terrier Female 12

DogIDBeagleMaleWeightLb.df <- as.data.frame(

DogIDBreedGenderWeightLb.df[(DogGender=="Male") &
DogBreed=="Beagle",])
# Create a dataset restricted to Beagles that are
# male, only.
#
# Be sure to observe how the ampersand character
# (e.g., &) was introduced, serving as a proxy for
# the term AND, which allowed the selection of dogs
# that are Gender equals Male and Breed equals Beagle.

base::attach(DogIDBeagleMaleWeightLb.df)

DogIDBeagleMaleWeightLb.df
38 1 Emergence of Data Science as a Critical Discipline in Biostatistics

Selected sections of output were deleted to save space.

DogIDLabORTerrierFemaleGTE13Lb.df <- as.data.frame(

DogIDBreedGenderWeightLb.df[
(DogBreed=="Terrier" | DogBreed=="Lab") &
(DogGender=="Female") &
(DogWeightLb >= 13),])
# Create a dataset restricted to Terrier or Lab dogs that
# are female and who weigh 13 pounds or more, only.
#
# Be sure to observe how the pipe character (e.g., |) was
# introduced, serving as a proxy for the term OR, which
# allowed the selection of Terrier dogs or Lab dogs.
#
# Notice also the liberal use of parentheses, which force
# the desired restrictions and by turn serve as a Quality
# Assurance measure.

base::attach(DogIDLabORTerrierFemaleGTE13Lb.df)

DogIDLabORTerrierFemaleGTE13Lb.df

Selected sections of output were deleted to save space.

This brief discussion on the use of Boolean logic and associated tools that sup-
port selection should be enhanced by further study. It is always desirable to check
outcomes against a known quantity, just to be sure that the initial logic and all asso-
ciated syntax are correct. Although the term Pencil Tracing is now rarely used, it is
still a desired practice. Look at the syntax for selection and then look at the dataset.
For small sections of syntax, use a pencil and trace each action against the dataset
and see if the syntax and data would yield expected outcomes. Quality assurance
can take many forms and this degree of review is well worth the effort.

Numeric Data

R supports many different data types, including different numeric data types. The
two most frequently encountered numeric data types are: (1) decimal numbers or
real numbers, and (2) integer numbers (e.g., whole numbers). It should be men-
tioned that integers are often used as codes for otherwise factor-type data (e.g.,
Gender: use 1 as code for Female and use 2 as code for Male). A brief explanation
of the two numeric data types follows.
Data Types Supported by R 39

Decimal or Real Numeric Data

Decimal or real numeric data are often expressed in some type of decimal format
(e.g., 1.23, 45.67, and 890.12). Consider the weight of lab rats (Rattus norvegicus
domestica) as a typical example of how decimal notation is often expressed in R:

RatWeightKg <- c(0.65, 0.72, 0.45, 1.05, 0.91)

RatWeightKg

[1] 0.65 0.72 0.45 1.05 0.91

Use the utils::str(), base::typeof(), and base::mode() functions to confirm the

nature of the ostensibly numeric object variable RatWeightKg:29

utils::str(RatWeightKg)
# Show the internal structure of the named object.

num [1:5] 0.65 0.72 0.45 1.05 0.91

base::typeof(RatWeightKg)
# Determine the internal storage mode of the named object.

[1] "double"

base::mode(RatWeightKg)
# Determine and/or set the internal storage mode of the
# named object.

[1] "numeric"

The output gained from all three functions is helpful to know the true nature of
the object variable that is in decimal notation.30

29
Preemptively, it must be mentioned that an oddity of R is that the base::mode() function is not
used as a measure a central tendency, such as the base::mean() function or the stats::median() func-
tion. As a measure of central tendency, the mode (e.g., the most frequently occurring value in a set
of values) is accommodated by using functions from external R packages, such as use of the
DescTools::Mode() function among many other complementary functions that serve the same
purpose.
30
Any further discussion would go beyond the purpose of this lesson, but recall that decimal nota-
tion, as used in this example, is not universal and an experienced data scientist may encounter the
40 1 Emergence of Data Science as a Critical Discipline in Biostatistics

Integer Numeric Data

There are many times when numbers do not have a decimal format and cannot have
a decimal format – whole numbers, otherwise known as integers. Perhaps the most
common way to express this notion is to say that there are five lab rats represented
in the object variable RatWeightKg. In this context, consider the integer data as
whole numbers – numbers that cannot have fractional representation, such as five
lab rats in this example and not 5.0 lab rats. Along with their use as IDs or for head-
counts (e.g., five lab rats), it is also quite common to see integers used as a
numeric code.
Consider a situation where the numbers 1 and 2 are used to designate gender:
• 1 ..... Female
• 2 ..... Male
The two numbers, 1 and 2, by no means represent anything other than a useful
code and are often used to reduce typing for when data are hand-entered into a
spreadsheet. R can easily accommodate these numbers, in the context as factor-type
integer codes.31

use of commas instead of decimal points to express what is otherwise called decimal notation. R
can accommodate this practice (e.g., the use of commas instead of decimal points), if needed.
31
As a good programming practice (gpp), for when there is no compelling reason to do otherwise,
it may be best to have codes organized so that the terms they represent show in alphabetical order,
thus 1 for Female and 2 for Male, since this naming scheme follows alphabetical ordering. Of
course, there are times when an ordinal approach may be best for the assignment of integer factor-
type codes, such as: 1 (Small), 2 (Medium), and 3 (Large), which is decidedly not an ascending
alphabetical ordering but is instead an ordering by size. These issues are best communicated in a
code book.
Data Types Supported by R 41

RatGender <- as.factor(c(1, 1, 1, 2, 2))

# The as.factor() function is wrapped around
# the assignment process.

RatGender

[1] 1 1 1 2 2
Levels: 1 2

utils::str(RatGender)

Factor w/ 2 levels "1","2": 1 1 1 2 2

base::length(RatGender)
# Confirm the headcount (e.g., number of)
# rats assigned a gender.

[1] 5

Follow along with this simple coding scheme, where integers are used as an iden-
tification code for each individual lab rat:

RatID <- as.factor(c(1, 2, 3, 4, 5))

Now that the three object variables have been separately created, join them
together into one common dataframe, by wrapping the as.data.frame() function
around the cbind() function. Then, preemptively put each object variable into desired
data type:
42 1 Emergence of Data Science as a Critical Discipline in Biostatistics

Rat.df <- as.data.frame(cbind(RatID, RatGender, RatWeightKg))

# Wrap the as.data.frame() function around the cbind()
# function to create a dataframe out of three separate
# object variables.

Rat.df$RatID <- as.factor(Rat.df$RatID)

# Preemptively, coerce the named object variable
# into factor format.

Rat.df$RatGender <- as.factor(Rat.df$RatGender)

# Preemptively, coerce the named object variable
# into factor format.

Rat.df$RatGender.Recode <-
factor(Rat.df$RatGender,
labels=c("Female", "Male"))
levels(Rat.df$RatGender.Recode)
# NOTE: factor(...) and NOT as.factor (...)
# Create an enumerated object variable and put it
# into desired data type.
# Note the alphabetical ordering of the two
# factor-type breakouts: Female and then Male.

Rat.df$RatWeightKg <- as.numeric(RatWeightKg)

# Preemptively, coerce the named object variable
# into numeric format.

base::attach(Rat.df)
utils::str(Rat.df)
'data.frame': 5 obs. of 4 variables:
$ RatID : Factor w/ 5 levels "1","2","3","4",..: 1 2
$ RatGender : Factor w/ 2 levels "1","2": 1 1 1 2 2
$ RatWeightKg : num 0.65 0.72 0.45 1.05 0.91
$ RatGender.Recode: Factor w/ 2 levels "Female","Male": 1 1 1

Rat.df

RatID RatGender RatWeightKg RatGender.Recode

1 1 1 0.65 Female
2 2 1 0.72 Female
3 3 1 0.45 Female
4 4 2 1.05 Male
5 5 2 0.91 Male

By following along with these actions, the dataset Rat.df has been put into proper
format, where: (1) the integer values for RatID and RatGender serve as codes for
named factor-type breakouts associated with ID and Gender, (2) the numeric values
for RatWeightKg are expressed correctly, following decimal notation, and (3) the
RatGender codes were used to create an enumerated object variable (RatGender.
Recode) that designates text expressions of Female and Male, improving readability
of any future use of this dataset.
Data Types Supported by R 43

As a value-added activity for this simple dataset, use functions associated with
Base R to make a simple barchart of RatWeightKg means broken out by the two
genders. Later, the tidyverse ecosystem will be demonstrated throughout this text
for similar activities, but first, this gentle introduction to Base R is helpful as a guide
to how problem-solving is approached (Fig. 1.2).

Mean.RatGender <- base::t(base::tapply(Rat.df$RatWeightKg,

base::list(Rat.df$RatGender.Recode), base::mean))
# First, create an object named Mean.RatGender that holds
# the RatWeightKg mean value for each gender.

Mean.RatGender

Female Male
[1,] 0.6066667 0.98

par(ask=TRUE)
graphics::barplot(Mean.RatGender, # Plot the enumerated object
main="Mean Weight (Kg) of Lab Rats by Gender",
col=c("red","darkblue"), # Colors for each bar
beside=TRUE, # Place bars side by side
xlab="Gender", # X axis label
ylab="Mean Weight (Kg)", # Y axis label
ylim=c(0.0, 1.1), # Y axis scale
font.axis=2, # Bold
font.lab=2, # Bold
cex.axis=1.1, # Increase font size
cex.lab=1.2) # Increase font size
text(x=1.5, y=0.675, labels="Mean Weight = 0.607 Kg", font=2)
text(x=3.5, y=1.050, labels="Mean Weight = 0.980 Kg", font=2)
# There is no easy way to know exactly where to place the
# text, x and y – experiment. Otherwise, look at where the
# text for Mean Weight has been placed for each of the two
# genders in the enumerated object Mean.RatGender and also
# observe how the text font has been put into bold format.
# Fig. 1.2

String or Character Data

Alphabetical characters (and other characters) are commonly referred to as strings,

and R, like nearly all other languages, can accommodate this data type. However,
unlike many other languages, string-type objects can be created using either a dou-
ble quote mark or a single quote mark:
44 1 Emergence of Data Science as a Critical Discipline in Biostatistics

Fig. 1.2

StringExampleDoubleQuotes <- "Hello World!"

# For this context, the exclamation mark is part of
# the string even though it is not a letter.

StringExampleDoubleQuotes

[1] "Hello World!"

utils::str(StringExampleDoubleQuotes)
# Show the internal structure of the named object.
base::typeof(StringExampleDoubleQuotes)
# Determine the internal storage mode of the named object.
base::mode(StringExampleDoubleQuotes)
# Determine and/or set the internal storage mode of the
# named object.

Selected sections of output were deleted to save space.

StringExampleSingleQuotes <- 'Hello World!'

# For this context, the exclamation mark is part of
# the string even though it is not a letter.

StringExampleSingleQuotes

Selected sections of output were deleted to save space.

Data Types Supported by R 45

As a good programming practice (gpp), it is common to use double quotes.

However, if a string has multiple double quotes within the string, then it may be best
to use single quotes to create the string. R is flexible in this regard.
Now, consider an enumerated object that is composed of two or more strings and
consider how the base::length() function provides some degree of context about
the object:

StringOne <- ("A")

StringTwo <- c("A", "B")
# Think of c as representing the term combine.
# In formal nomenclature the term should be
# expressed as the function base::c().

StringTen <- c("A", "B", "C", "D", "E",

"F", "G", "H", "I", "J")

StringOne
StringTwo
StringTen

Selected sections of output were deleted to save space.

[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

Although Base R has many functions that are used to accommodate strings, the
tidyverse ecosystem is far better for this task. The stringr package, which is included
among the packages associated with the core tidyverse, is perhaps among the best
packages for string manipulation. Examples from the stringr package are found in
later lessons in this text.
As an ending comment on the seemingly complex nature of using strings, look at
the following examples, and from this, always remember to check everything in the
pursuit of continuous quality assurance. In this example, consider the complexity of
accommodating a company payroll information system for an employee with the
following name:

Name_First <- c("Alfred")

Name_Middle <- c("Mack")
Name_Last <- c("Van Brocklin")

Payroll <- as.data.frame(cbind(Name_First, Name_Middle,

Name_Last))

Payroll

Name_First Name_Middle Name_Last

1 Alfred Mack Van Brocklin
46 1 Emergence of Data Science as a Critical Discipline in Biostatistics

Consider some of the common mistakes made when humans work with string
data, with this example based on names in a payroll information system.

Payroll$Name_First == "Alfred"
# Name_First is Alfred and this expression should
# return TRUE.

[1] TRUE

Payroll$Name_First == "Al"
# But, it is not unexpected that someone using the payroll
# system might enter Al instead of Alfred since the worker
# uses Al as an informal first name and most users do not
# even know his true first name, Alfred.

[1] FALSE

Then, consider the complexity of Mack as a middle name and how many users
would immediately type Mac, instead:

Payroll$Name_Middle == "Mac"

[1] FALSE

However, the last name is perhaps the string most likely to cause confusion and
errors. Look at the many FALSE statements that are generated by seemingly simple
confusion of correct spelling:

Payroll$Name_Last == "Van Brocklin" # Correct spelling

Payroll$Name_Last == "van Brocklin" # Incorrect case for Van
Payroll$Name_Last == "VanBrocklin" # Incorrect spacing

Selected sections of output were deleted to save space.

It is little wonder that most human-focused information systems depend on some

type of identification number (e.g., Social Security Number, phone number, National
Identity Number, and company issued ID) instead of names, to go along with the
complexity of strings and other types of characters.32 And this complexity does not

Social Security Numbers are now rarely used as a requested number for identification purposes,
32

due to public concern about this practice and justified personal identity and security concerns.
Data Types Supported by R 47

even begin to consider the issue of information security. A unique numeric code is
simply much easier to manage than a character-based set of names.

Time and Dates

Many beginning data science students and even entry-level data scientists find time
and dates a frustrating challenge, regardless of the language, including R. The lub-
ridate package, which is associated with the tidyverse ecosystem, is often used to
work with time and dates. For now, this brief introduction on how R accommodates
time and dates will depend on tools associated with Base R, to focus on the begin-
ning heuristics of this topic once again. The use of specialized packages, especially
lubridate, will come with more experience with data science and the tidyverse
ecosystem.
When using R, it is essential to know that January 1, 1970, is the origin (e.g.,
base) date for counting days.33,34 Negative numbers express the number of days prior
to the origin (January 1, 1970), and conversely, positive numbers express the num-
ber of days after the origin (January 1, 1970):

as.numeric(as.Date('1969-12-31'))
# 1969 December 31
# Wrap the as.numeric() function around the
# as.Date() function, to determine the number
# of days from the R origin date, January 1,
# 1970.

[1] -1

as.numeric(as.Date('1970-01-01'))
# 1970 January 01

[1] 0

as.numeric(as.Date('1970-01-02'))
# 1970 January 02

[1] 1

33
The origin date for R (January 1, 1970) borrows from what is commonly referred to as the arbi-
trary UNIX epoch time and date of 00:00:00 UTC (Universal Time Coordinated) on Thursday,
January 01, 1970).
34
For calculations going back over extreme lengths of time, it may be helpful to know that the
Gregorian calendar is used for UNIX epoch time, as opposed to use of the Julian calendar.
48 1 Emergence of Data Science as a Critical Discipline in Biostatistics

With this brief introduction to the way dates in R are viewed as the number of
days from the origin, in either direction, practice with a few dates to see the cumu-
lative number of dates over time, subtracting the numeric value of a beginning date
from the numeric value of an ending date:

IndependenceDayUSA1776 <- as.Date('1776-7-4')

IndependenceDayUSA1776

[1] "1776-07-04"

as.numeric(IndependenceDayUSA1776)
# Determine the number of days prior to the
# January 1, 1970, origin date, with negative
# values showing direction -- the number of days
# prior to the origin date.

[1] -70672

IndependenceDayUSA2026 <- as.Date('2026-07-04')

# Leading zeros were purposely demonstrated
# in this example of constructing dates.

IndependenceDayUSA2026

[1] "2026-07-04"

as.numeric(IndependenceDayUSA2026)
# Determine the number of days since the January
# 1, 1970, origin date, with positive values
# showing direction – the number of days since
# the origin date.

[1] 20638

As an interesting quality assurance check, subtract the two dates and see if the
output is approximately 250 years, the sestercentennial (e.g.,semiquincentennial) to
use the Latin term(s) for a 250-year anniversary date.
Data Types Supported by R 49

IndependenceDayUSA2026 - IndependenceDayUSA1776

Time difference of 91310 days

91310/365
# Number of days from Jul-04-1776 to Jul-04-2026
# divided by 365 days per year, which is not
# quite right considering the impact of leap
# year and how leap year does not occur every
# four years.
#
# Review the rules regarding leap years to see
# why 1800 and 1900 were not leap years and how
# this issue clouds the precise use of 365 days
# per year in calculation of the number of days
# over long periods of time, such as the dates in
# this example. Even so, this calculation offers
# a good estimate of the number of years from one
# date to another.

[1] 250.1644

The result of 250.1644 years, from Jul-04-1776 (-70672) to Jul-04-2026 (20638),

is clearly in range to expectations.
Another example, over a much shorter period, may offer a more practical view
on how dates are used with R. In this example, a very small sample (N = 9) of sub-
jects was offered one of three treatments, all beginning on the same date. An other-
wise unidentified event caused the treatment process to end. Look at the way dates
are subtracted from each other to determine the number of days associated with the
experimental activity, whether end of treatment indicated success, failure, survival,
death, etc.
50 1 Emergence of Data Science as a Critical Discipline in Biostatistics

DaysOfTreatment.df <- read.table(textConnection("

Subject Treatment Begin End
S01 T01 2023-07-15 2023-07-22
S02 T02 2023-07-15 2023-07-28
S03 T03 2023-07-15 2023-08-26
S04 T01 2023-07-15 2023-07-23
S05 T02 2023-07-15 2023-07-31
S06 T03 2023-07-15 2023-08-18
S07 T01 2023-07-15 2023-07-19
S08 T02 2023-07-15 2023-07-30
S09 T03 2023-07-15 2023-08-29"), header=TRUE)

DaysOfTreatment.df$Subject <- as.factor(

DaysOfTreatment.df$Subject)

DaysOfTreatment.df$Treatment <- factor(

DaysOfTreatment.df$Treatment,
labels=c("Treatment 1", "Treatment 2", "Treatment 3"))
levels(DaysOfTreatment.df$Treatment)

DaysOfTreatment.df$Begin <- as.Date(

DaysOfTreatment.df$Begin)

DaysOfTreatment.df$End <- as.Date(

DaysOfTreatment.df$End)

DaysOfTreatment.df$Delta <- as.numeric(

DaysOfTreatment.df$End - DaysOfTreatment.df$Begin)
# Be sure to observe the minus sign, indicating
# subtraction of the numeric value of one date
# from the numeric value of another date.
#
# The term Delta is used to indicate the change
# from X to Y, or from Begin to End in this
# example.

base::attach(DaysOfTreatment.df)
utils::str(DaysOfTreatment.df)
Data Types Supported by R 51

'data.frame': 9 obs. of 5 variables:

$ Subject : Factor w/ 9 levels "S01","S02","S03",..:
$ Treatment: Factor w/ 3 levels "Treatment 1",..: 1 2
$ Begin : Date, format: "2023-07-15" "2023-07-15"
$ End : Date, format: "2023-07-22" "2023-07-28"
$ Delta : num 7 13 42 8 16 34 4 15 45

DaysOfTreatment.df

Subject Treatment Begin End Delta

1 S01 Treatment 1 2023-07-15 2023-07-22 7
2 S02 Treatment 2 2023-07-15 2023-07-28 13
3 S03 Treatment 3 2023-07-15 2023-08-26 42
4 S04 Treatment 1 2023-07-15 2023-07-23 8
5 S05 Treatment 2 2023-07-15 2023-07-31 16
6 S06 Treatment 3 2023-07-15 2023-08-18 34
7 S07 Treatment 1 2023-07-15 2023-07-19 4
8 S08 Treatment 2 2023-07-15 2023-07-30 15
9 S09 Treatment 3 2023-07-15 2023-08-29 45

A graphic of mean days of treatment, from Begin to End, will help emphasize
outcomes from this simple example on the use of dates as the basis for a datum (e.g.,
Delta). The syntax previously used will help guide this graphic, an efficient (e.g.,
tidy) approach to the reuse of R syntax (Fig. 1.3).

Fig. 1.3
52 1 Emergence of Data Science as a Critical Discipline in Biostatistics

Mean.DaysOfTreatment <-
base::t(base::tapply(DaysOfTreatment.df$Delta,
base::list(DaysOfTreatment.df$Treatment), base::mean))
# Create an object named Mean.DaysOfTreatment that holds
# the mean Delta value for each treatment breakout,
# Treatment 1, Treatment 2, and Treatment 3.

Mean.DaysOfTreatment

Treatment 1 Treatment 2 Treatment 3

[1,] 6.333333 14.66667 40.33333

par(ask=TRUE)
graphics::barplot(Mean.DaysOfTreatment, # Plot the enumerated
main="Mean Days of Treatment", # object
col=c("red","darkblue", "black"), # Colors for each bar
beside=TRUE, # Side by side bars
xlab="Treatment", # X axis label
ylab="Mean Days", # Y axis label
ylim=c(0.0, 50.0), # Y axis scale
font.axis=2, # Bold
font.lab=2, # Bold
cex.axis=1.1, # Increase font size
cex.lab=1.2) # Increase font size
text(x=1.5, y=12.00, labels="Mean Days = 06.33", font=2)
text(x=3.5, y=20.00, labels="Mean Days = 14.67", font=2)
text(x=5.5, y=46.00, labels="Mean Days = 40.33", font=2)
# Experiment with the best place to place text. Additionally,
# bold fonts and large print make it easier to view and
# understand outcomes presented in graphical format.
# Fig. 1.3

Up to this point, only default date formats have been presented. There are many
ways that dates can be presented in R, and there are many resources on this topic.
As a simple example, look at the way IndependenceDayUSA1776 can be presented
with slight adjustment in format:

IndependenceDayUSA1776
# Default presentation of a date

[1] "1776-07-04"

base::format(IndependenceDayUSA1776, "%A %B %d %Y")

# %A Weekday, full spelling
# %B Month, full spelling
# %d Day of the month, with leading zero
# %Y Year, including century

[1] "Thursday July 04 1776"

Data Types Supported by R 53

Going beyond the number of days from an origin date and how this datum is used
for different calculations, consider how R accommodates minutes and hours:35

January012023OneMinutePastMidnight <-
as.POSIXct("2023-01-01 00:01:00")

January012023OneMinutePastMidnight

[1] "2023-01-01 00:01:00 EST"

December312023OneMinuteBeforeMidnight <-
as.POSIXct("2023-12-31 23:59:00")

December312023OneMinuteBeforeMidnight

[1] "2023-12-31 23:59:00 EST"

December312023OneMinuteBeforeMidnight -
January012023OneMinutePastMidnight
# Subtract the dates.

Time difference of 364.9986 days

Although far more could be presented about time and dates when using R, it
should be obvious that this topic is possibly one of the most challenging tasks in
data science given the complexity of how time and dates are seen throughout
the world:
• Does the notation 07/10/23 mean July 10, 2023, or does it instead mean October
07, 2023? Both interpretations could be correct, based on geographic location
and local views on how dates are expressed, and some type of context is needed
to be sure of the correct meaning.
• There are 365 days in a year, right? Perhaps not – How does this simple constant
of 365 account for leap years?
• There is a leap year every four years, right? Perhaps not – How does this simple
constant account for the century years 1700, 1800, and 1900, which when using
the leap year formula cannot be divided evenly by 400?

35
As an independent activity, at the R prompt key help(as.POSIXct) to learn more about this oth-
erwise somewhat complex calculation that with careful thought does not need to be as complex as
originally thought.
54 1 Emergence of Data Science as a Critical Discipline in Biostatistics

• There are 24 hours each day, right? Perhaps not - How does this simple constant
account for the days when twice yearly there is a change from Daylight Savings
Time to Standard Time and Standard Time to Daylight Savings Time?
• There are 60 seconds each minute, right? Perhaps not – How does this simple
constant account for the rare, but still existent, leap second when this extreme
change is occasionally made to account for the earth’s rotation?
• There are 24 time zones, right? Perhaps not – How does this simple constant
account for time zone lines that clearly do not follow exact longitudes, local
governments that do not account for changing the clock twice yearly, and local
time zones that instead follow 30 minute and 45 minute offsets?
In summary for this section, context is the key to working correctly with dates
and time, for all languages and not only R:
• Know the origin date (January 1, 1970, for R).
• Confirm that leap years are accounted for correctly, which should not be auto-
matically assumed for all languages and software applications.
• Learn the specific notation for how dates are presented, locally and for the wider
international R community.
With practice, R can provide excellent results when time and dates are used, but
careful attention and practice are especially critical for this area.

Missing Data

The reality of data science is that it is only the rare dataset that is complete, with no
missing data. Data can be missing for many reasons:
• A subject was not available for measurement at the appointed time (e.g., an unex-
pected snowstorm kept some subjects from showing up at the clinic where mea-
surements are taken).
• A subject was available for measurement at the appointed time but could not or
would not be measured (e.g., a frightened large animal with sharp hooves and
horns could not be weighed due to continuous and possibly dangerous movement).
• A subject was available for measurement at the appointed time, the measurement
was recorded, but for unknown reasons, the recorded datum did not appear in the
final dataset (e.g., an unsecured folder of time-specific paper-based field notes
that blow away across an open field, never to be retrieved, when a car door is
opened).
• A subject was available for measurement at the appointed time, the measurement
was recorded, the datum appeared in the final dataset, but the recorded datum
was either illogical or so totally out of range that it was not an outlier but was
instead viewed as either a misreading or a data entry error (e.g., systolic blood
pressure measurements were inadvertently recorded as weight in pounds for
some subjects, but not all, by a poorly supervised trainee).
Data Types Supported by R 55

There is seemingly no end to the reason why data can be missing. R has struc-
tures that can accommodate missing data and a key part of data science is working
within the unfortunate, but not at all unexpected, event of missing data.
Consider a simple example of how missing data are treated when using R, know-
ing that the nomenclature is clearly different than what is used in other languages:

XNoMissingData <- c(10, 20, 30, 40, 50)

XNoMissingData

[1] 10 20 30 40 50

XMissingData <- c(10, , 30, 40, 50)

# WRONG: Look at the way there is nothing
# listed for the 2nd number.

Error in c(10, , 30, 40, 50) : argument 2 is empty

XMissingData <- c(10, NA, 30, 40, 50)

# CORRECT: Now, look at the way NA has been
# entered as a substitute for the 2nd number.

XMissingData

[1] 10 NA 30 40 50

As a sidebar comment, consider a typical spreadsheet that has missing data.

There are cells in the spreadsheet that are totally blank. Fortunately, when a spread-
sheet with blank cells is imported into R, the NA convention is automatically
inserted during the import process.36,37 It would otherwise be anywhere from annoy-
ing to literally impossible to insert NA into each cell for a dataset of any size.
Now that these two simple datasets have been created, use R to determine the
mean of each:

36
Using R, the symbol NA equates to the term Not Available. The symbol NaN equates to the term
Not a Number.
37
For those who wish to explore this topic in more detail, use available resources such as keying
help(NameOfFunction) at the R prompt to learn more about the following terms, as used in R:
NULL, NA, NaN, and finite (Inf and -Inf, both positive and negative infinity).
56 1 Emergence of Data Science as a Critical Discipline in Biostatistics

base::mean(XNoMissingData)
# Calculate the mean of a dataset where
# there are no missing data.

[1] 30

base::mean(XMissingData)
# Calculate the mean of a dataset where
# there is one missing datum, but where
# there is no accommodation for missing
# data.

[1] NA

Obviously, there needs to be some accommodation for working with missing

data and fortunately, when using the base::mean() function it is the simple argument
na.rm=TRUE. See how the use of this argument takes care of the problem.

base::mean(XMissingData, na.rm=TRUE)
# Calculate the mean of a dataset where
# there is at least one missing datum,
# using the na.rm=TRUE argument.

[1] 32.5

Knowing that the na.rm=TRUE argument is valuable, it is not uncommon to use

it even when a dataset is complete, simply as a standard process:

base::mean(XNoMissingData)
# Calculate the mean of a dataset where
# there are no missing data, without
# use of the na.rm=TRUE argument.

[1] 30

base::mean(XNoMissingData, na.rm=TRUE)
# Calculate the mean of a dataset where
# there are no missing data, but with
# use of the na.rm=TRUE argument.

[1] 30
Data Types Supported by R 57

As a diagnostic tool, it is common to use the base::summary() function to get a

sense of missing data, prior to any meaningful use of the data:

base::summary(XMissingData)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

10.0 25.0 35.0 32.5 42.5 50.0 1

The base::is.na() function is also used, at least for smaller datasets, to gain a
sense of exactly which datapoints have a missing value:

base::is.na(XMissingData)

[1] FALSE TRUE FALSE FALSE FALSE

Create a more embellished dataset with missing values to see how missing data
are accommodated in multiple columns of a dataset:
58 1 Emergence of Data Science as a Critical Discipline in Biostatistics

XDatasetWithMissingValues.df <- read.table(textConnection("

Subject WeightLb Gender
A01 146 Female
A02 NA Male
A03 201 NA
A04 210 Male
A05 153 Female
A06 NA Female
A07 187 Male
A08 101 Female
A09 111 NA
A10 195 Male"), header=TRUE)

XDatasetWithMissingValues.df$Subject <-
as.factor(XDatasetWithMissingValues.df$Subject)

XDatasetWithMissingValues.df$WeightLb <-
as.numeric(XDatasetWithMissingValues.df$WeightLb)

XDatasetWithMissingValues.df$Gender <-
as.factor(XDatasetWithMissingValues.df$Gender)

base::attach(XDatasetWithMissingValues.df)
utils::str(XDatasetWithMissingValues.df)

'data.frame': 10 obs. of 3 variables:

$ Subject : Factor w/ 10 levels "A01","A02","A03",..: 1 2 3
$ WeightLb: num 146 NA 201 210 153 NA 187 101 111 195
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 NA 2 1

XDatasetWithMissingValues.df
# As an interesting observation, notice how NA shows
# for missing numeric values whereas <NA> shows for
# missing factor values.
Data Types Supported by R 59

Subject WeightLb Gender

1 A01 146 Female
2 A02 NA Male
3 A03 201 <NA>
4 A04 210 Male
5 A05 153 Female
6 A06 NA Female
7 A07 187 Male
8 A08 101 Female
9 A09 111 <NA>
10 A10 195 Male

base::mean(XDatasetWithMissingValues.df$WeightLb, na.rm=TRUE)
# Use the na.rm=TRUE argument to accommodate missing data.

[1] 163

base::table(XDatasetWithMissingValues.df$Gender)
# Look how NA does not show at all when using
# the base::table() function against a factor-
# type object variable.

Female Male
4 4

base::summary(XDatasetWithMissingValues.df)
# Look how summary provides a fairly complete
# view of the data, numeric and factor,
# including identification of missing values.

Subject WeightLb Gender

A01 :1 Min. :101.0 Female:4
A02 :1 1st Qu.:137.2 Male :4
A03 :1 Median :170.0 NA's :2
A04 :1 Mean :163.0
A05 :1 3rd Qu.:196.5
A06 :1 Max. :210.0
(Other):4 NA's :2

If there were a desire to remove all rows that have an NA value, consider use of
the stats::na.omit() function. In advance and only with caution, be sure to determine
that this is the desired action.38

38
Think of the expression measure twice and cut once before deploying any action that eliminates
data from a dataset.
60 1 Emergence of Data Science as a Critical Discipline in Biostatistics

XDatasetWithMissingValuesUsingna.omit <-
stats::na.omit(XDatasetWithMissingValues.df)
# Omit all rows that have a NA value.

XDatasetWithMissingValuesUsingna.omit

Subject WeightLb Gender

1 A01 146 Female
4 A04 210 Male
5 A05 153 Female
7 A07 187 Male
8 A08 101 Female
10 A10 195 Male

There are many functions in the tidyverse ecosystem that go far beyond these
Base R functions when working with (and around) missing data. These simple
examples provide a good start on recognizing that data will be missing and there
must be ways to accommodate this real-world issue.

Data Structures Used in R

R, Base R and the tidyverse ecosystem, can accommodate many different types of
data structures. A few are discussed, but each discussion could be greatly expanded,
but that might not be appropriate for an introductory text.

Dataframe (and tibble as a Special Type of Dataframe)

Although dataframes have been shown previously in this lesson, it needs repeating
that a dataframe is a rectangular collection of object variables, where each object
variable (e.g., column) consists of the same number of subjects (e.g., rows). Consider
a simple four by three dataframe (e.g., row by column - or 4 rows by 3 columns)
consisting of data on blood pressure from four subjects gained from three object
variables, each of a different data type:
Data Types Supported by R 61

SBPAlert01.df <- read.table(textConnection("

SubjectID SystolicBP BPRisk
1035 122 FALSE
1067 186 TRUE
2053 192 TRUE
1716 116 FALSE"), header=TRUE)

SBPAlert01.df$SubjectID <-
as.factor(SBPAlert01.df$SubjectID)

SBPAlert01.df$SystolicBP <-
as.numeric(SBPAlert01.df$SystolicBP)

SBPAlert01.df$BPRisk <-
as.logical(SBPAlert01.df$BPRisk)

base::attach(SBPAlert01.df)
utils::str(SBPAlert01.df)

Selected sections of output were deleted to save space.

The dataframe SBPAlert01.df has been put into desired format with the numbers
associated with SubjectID viewed as factors, the numbers associated with SystolicBP
viewed as real numbers, and the text associated with BPRisk viewed as logical
FALSE or TRUE values.
The tidyverse ecosystem has not yet been brought into this R session and as such
the tibble::as_tibble() function and the tibble::tibble() function cannot yet be dem-
onstrated at this point. However, tibbles are used in the addenda, in the back part of
this lesson and throughout this text. A tibble is a special type of dataframe that is
used extensively with the tidyverse ecosystem. It has unique features, and although
the tidyverse ecosystem can often be used with dataframes, it is common to use
tibbles when using functions associated with the tidyverse.

Factors

Factors need special mention in that they are a specialized data type, seen previ-
ously in this lesson (and will be seen throughout future lessons), but a more formal
introduction is provided here. Imagine three adult males from the general popula-
tion with the following heights:
Robert ... 164 centimeters
Juan .... 178 centimeters
Declan .. 192 centimeters
Assume that the measuring tape was used correctly and that these measures are
reliable and valid. However, it is not always easy to obtain precise measures and
sometimes broad measures are the best that can be obtained. Or it needs to be
62 1 Emergence of Data Science as a Critical Discipline in Biostatistics

considered that all recipients of the information on height may not have a good
grasp of the measuring scale. Thus, consider how these height measurements could
instead be expressed initially as:
Robert ... 1
Juan .... 2
Declan .. 3
Treat these 1-2-3 measures as ordered factor-type data, where listing in a code
book indicates that 1 = Short, 2 = Medium, and 3 = Tall. There is an order, and ide-
ally there needs to be a set of cut points for correct placement in each classification.
Of course, context is always important. It is unlikely that an adult male professional
basketball player with a measurement of 192 centimeters would ever be classified
as 3, or Tall – at least among this specialized group of adult males. A separate Code
Book and cut points would be needed for an ordered factor listing of heights for
these professional athletes, where a height of 212 centimeters would not be an
extreme upper-range value for those who are considered tall basketball players.
Not all factor-type data are ordered, and many data of factor-type are coded list-
ings. The most common factor-type datum that is not ordered is perhaps gender,
with Female = 1 and Male = 2. The assignment of codes is merely a shortcut that
avoids excess keying, where it is far less work to key 1 instead of Female and to key
2 instead of Male. And note how the two codes in this example follow an alphabeti-
cal ordering (e.g., Female – Male), which is used to indicate that there is no ordering
based on the nature of the variable.

List

When using R, a list is a special type of vector containing other objects. Go back to
the demonstration for dataframes and note how a dataframe is actually a special
type of list, with the dataframe SBPAlert01.df consisting of three object variables,
each of a different type, but of the same length: (1) the numbers in SubjectID are
factors, (2) SystolicBP is composed of real numbers, and (3) the FALSE and TRUE
entries in BPRisk are logical values:

utils::str(SBPAlert01.df)

'data.frame': 4 obs. of 3 variables:

$ SubjectID : Factor w/ 4 levels "1035","1067",..: 1 2 4 3
$ SystolicBP: num 122 186 192 116
$ BPRisk : logi FALSE TRUE TRUE FALSE

Although lists can take many forms, for this lesson and later lessons, it will be
common to encounter rectangular lists, in the form of dataframes or tibbles. More
complex lists go beyond the introductory nature of this text.
Data Types Supported by R 63

Matrix

A dataframe is a rectangular object where the different variables can represent dif-
ferent data types, as seen in SBPAlert01.df. A matrix, however, must have all vari-
ables of the same data type.
Apply the base::as.matrix() function against the dataframe SBPAlert01.df, and
see the outcome of this action:

SBPAlert01.matrix <- base::as.matrix(SBPAlert01.df)

# The base::as.matrix() function, when applied against
# a dataframe, generates a matrix of character data type.

SBPAlert01.matrix

SubjectID SystolicBP BPRisk

[1,] "1035" "122" "FALSE"
[2,] "1067" "186" "TRUE"
[3,] "2053" "192" "TRUE"
[4,] "1716" "116" "FALSE"

str(SBPAlert01.matrix)

chr [1:4, 1:3] "1035" "1067" "2053" "1716" "122" "186" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "SubjectID" "SystolicBP" "BPRisk"

Matrices are used with some specialized functions pertaining to math, but it is
perhaps best to work with either dataframes or tibbles (when using the tidyverse
ecosystem) instead of a matrix. Perhaps more importantly for this lesson, the
ggplot2::ggplot() function, which is a key part of the tidyverse ecosystem for the
creation of Beautiful Graphics, either will not work or will work only with difficulty
against matrix data. Accordingly, a matrix-based object may need to be put into
either dataframe format or tibble format if there were a desire to plot the data using
functions from the ggplot2 package.

Vector

The simplest data structure in R is likely the vector. A vector is a collection of: real
numbers (e.g., numbers with a decimal value, even if the number does not express
the decimal value): integers (e.g., whole numbers without decimal value); charac-
ters; or logical notation. Consider two vectors, one vector consisting of only one
numeric datapoint and the other vector consisting of five character datapoints:
64 1 Emergence of Data Science as a Critical Discipline in Biostatistics

VectorOneDatapoint <- 101

VectorOneDatapoint

[1] 101

utils::str(VectorOneDatapoint)
base::length(VectorOneDatapoint)
base::is.vector(VectorOneDatapoint)

Selected sections of output were deleted to save space.

VectorFiveDatapoints <- c("A", "BB", "CCC", "DDDD", "EEEEE")

VectorFiveDatapoints

[1] "A" "BB" "CCC" "DDDD" "EEEEE"

utils::str(VectorFiveDatapoints)
base::length(VectorFiveDatapoints)
base::is.vector(VectorFiveDatapoints)

Selected sections of output were deleted to save space.

Although others may have a different view on how to explain the ubiquity of
vectors in data science, it is argued in this text that vectors should be viewed as the
basic building blocks of nearly all complex R-based data structures. Certainly, vec-
tors are used to build the most common data structures in R and the associated
tidyverse ecosystem, namely dataframes and tibbles.39 Even so, data scientists
should have acquaintance with many different data structures.
Throughout this introductory lesson, R has been introduced in a gentle manner,
in keeping with the notion that: Petit à petit l’oiseau fait son nid. (French); Little by
little, the bird builds its nest. (English); and Doni doni kononi b’a nyaga da.
(Bambaran). The addenda that follow continue along with a gentle introduction to
R and more specifically the tidyverse ecosystem and the use of APIs. Increasing
complexity is seen as the addenda in this lesson progress and future lessons in this
text progress.

39
When using R, an array is an object that can store data in more than two dimensions. As seen
later, a matrix is a two-dimensional array.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 65

ddendum 1: Syntax Used to Generate Six-Digit Classification

A
of Instructional Programs (CIP) Completions

Guidance was provided earlier in this lesson that introduced discussion on the num-
ber of completers (e.g., graduates), from Academic Year 2009–2010 to Academic
Year 2018–2019, of selected academic programs of study associated with data sci-
ence and biostatistics, using data provided by the United States Department of
Education.40 All data were gained from the Integrated Postsecondary Education
Data System (IPEDS) Peer Analysis System (https://nces.ed.gov/ipeds/use-the-
data), a United States Department of Education resource freely available to the pub-
lic, worldwide. The purpose of this addendum is to demonstrate how R and the
tidyverse ecosystem are used to generate figures that graphically communicate out-
comes relating to completions. As an advance organizer to R and the tidyverse eco-
system, give focus to the syntax, with more complete detail on the syntax provided
in later lessons.
The IPEDS Peer Analysis System (https://nces.ed.gov/ipeds/use-the-data) was
used to obtain and download the data associated with this part of the lesson. The
IPEDS Peer Analysis System interface does not currently provide data by use of an
R-based Application Programming Interface (API) function (e.g., client) and to
offer detailed instructions on navigation through the IPEDS Peer Analysis System
interface is far beyond the purpose of this lesson. However, all files downloaded
after using the IPEDS Peer Analysis System are available at the publisher’s Web
page associated with this text. For all selected programs of study, it may be helpful
to know that the data were originally downloaded in .csv format but were then also
saved in .xlsx format. Saving the original .csv files in .xlsx format allows convenient
use of the readxl package, a package that is associated with the tidyverse ecosystem,
but is not among packages in the core tidyverse ecosystem – but more detail on core
tidyverse ecosystem packages and associated tidyverse ecosystem packages is pro-
vided in later lessons.
Although an active R session was used in the front matter of this lesson, it is now
assumed that the R session for the addenda starts with the Housekeeping syntax,
below. Follow along with the provided syntax to first understand and then reproduce

40
As a brief explanation of the value of using Integrated Postsecondary Education Data System
(IPEDS) data on completers, it should be mentioned that six-digit CIP data (e.g., the highly granu-
lar data that are specific to individual programs of study) are available for nearly all completers, but
that is not the case for the availability of six-digit CIP (Classification of Instructional Programs)
fall term enrollment data from the IPEDS Peer Analysis System. The rationale for that decision is
that postsecondary students frequently change their academic major program of study before com-
pletion, such that six-digit CIP enrollment data would be an inconsistent and misleading false
friend. The decision to exclude fall term enrollment data also considers that many students transfer
from one postsecondary institution to another, again confounding the efficacy of using enrollment
as an appropriate metric. In contrast, completion of a program of study from a specific institution
and the awarding of either a certificate or degree by that institution is a recorded final event that
results in a fixed datum that will not change and therefore serves as a valid measure of interest for
individual programs of study.
66 1 Emergence of Data Science as a Critical Discipline in Biostatistics

the syntax used to produce the figures.41 The R-based syntax used to generate the
figure for CIP 01.0000 (Agriculture, General), CIP 26.1102 (Biostatistics), and CIP
51.1201 (Medicine) soon follows. When viewing the syntax used to prepare these
figures, note how the dplyr package and the ggplot2 package, key packages in the
core tidyverse ecosystem, are the dominant packages for organization of the data
(dplyr) and subsequently preparation of these figures (ggplot2).
There are a few different packages and related functions associated with the tidy-
verse ecosystem that are used to import data into an active R session:
• The readr package is part of core tidyverse and it supports many different func-
tions that are used to import delimited files during an active R session. Among
the many different file import functions supported by the readr package, perhaps
the most common is the readr::read_csv() function. Due to nearly universal use
of comma separated files, the readr::read_csv() function is often used to import
rectangular data (e.g., data organized in row by column format) that are in comma
separated values (.csv) format – an extremely common file format that is easily
shared with others.
• The readxl package is associated with the tidyverse ecosystem and is commonly
used to import spreadsheets that are in either .xls format (using the readxl::read_
xls() function) or .xlsx format (using the readxl::read_xlsx() function). There are
many arguments associated with these functions and these arguments make the
readxl package quite useful, so that when data are imported, they are organized
as expected with the data in declared format.
A few reminders may be helpful for those who want to learn more about file
formats:
• Delimited files are commonly found in data science, where the file is structured
so that data are separated by some type of character. Commas are perhaps the
most common characters used to separate data in a row, data from one column to
data in another column.
• The .csv file format is structured so that commas are used to separate data fields.
This format has been used in data science since the early 1970s, with first use
often attributed to use of the Fortran programming language. Although .csv files
do not have the many robust features of other data-oriented file formats, simplic-
ity is the comparative advantage of the .csv file format. Simple .csv files can be
opened by nearly all text editors and spreadsheets, allowing nearly universal use
in data science. The United States Library of Congress (https://www.loc.gov/
preservation/digital/formats/fdd/fdd000323.shtml) provides a rich history of the
.csv format. The Library of Congress identifies the .csv format as a preferred
dataset format in their Recommended Formats Statement (RFS) for datasets.
Search for and read about standard RFC 4180 to learn more about the history and

41
The syntax in this addendum provides an advance organizer to the use of data science and R. For
now, give attention to process. More specific exposure to the use of R syntax in support of data
science is provided in later lessons.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 67

use of files in .csv format. There are regions throughout the international data
science community where a comma character is used with numbers in place of a
decimal point (e.g., The mean weight of subjects was 12,34 Kg. instead of The
mean weight of subjects was 12.34 Kg.). In that case, it is possible to prepare a
.csv-equivalent delimited file with a semicolon character serving as a separator
between data fields, instead of a comma. Tabs are also frequently used to sepa-
rate data within a row. It is now uncommon, but white space (e.g., use of the
space bar) was once a common format used to separate row-based data into fixed
columns.
• The readxl::read_excel() function can be used to import both .xls format files and
.xlsx format files, and it is often used when the exact spreadsheet file extension
is unknown.
• Data are automatically put into format as a tibble when using functions associ-
ated with the readxl package. A tibble is a specialized type of dataframe that has
many advantages, and it is the norm dataframe when working with the tidyverse
ecosystem. More discussion about the tibble dataframe (e.g., dataset) format is
provided in later lessons.
Data for this lesson were gained by using the IPEDS Peer Analysis System, orig-
inally downloaded as .csv files, put into .xlsx format, and placed on an external F:\
drive. With the first row of each spreadsheet serving as column name identifiers, the
data are rectangular and are consistently organized in more than 6000 rows and 22
columns. Each row represents a unique postsecondary institution, identified by a
unique UnitID, and each column represents a unique variable. In original format,
from when the data were first obtained, names in the column headers are quite long
and complex, but they are descriptive and fully illustrate the nature of the data.42 Be
sure to look at use of the many arguments associated with the readxl::read_xlsx()
function so that data, when imported, are in good form, especially the long col-
umn names.43

42
The original format column header names are certainly not tidy, but look at the syntax on how
the base::colnames() function is used to put final form column names into a more tidy format and
are equally easy to read and understand.
43
The utils::read.table() function, associated with Base R, the set of packages and functions avail-
able when R is first downloaded, is a common tool for importing .csv files into an active R session.
One frequently used feature associated with the utils::read.table() function is use of the stringsAs-
Factors argument. By using this argument, it is possible to set character data (e.g., Female or Male,
Fail or Pass, etc.) as factors during the data import process. The tidyverse ecosystem and specifi-
cally the readxl package takes a totally different approach to this task. When the readxl package is
used to import data, using either the readxl::read_excel(), readxl::read_xls(), or readxl::read_xlsx()
functions, character data are retained as characters after they are imported and they are not put into
factor format during the data import process. If there were a desire to organize character data in
factor format, which is common, that forced action must come later, after the data are imported and
put into a tibble. The advantage of this approach is that when using tools in the tidyverse ecosys-
tem, data must be organized early-on and the syntax for this set of actions will be very visible as
syntax of its own, minimizing a possible misuse of the data because of a simple action that was
somehow earlier forgotten.
68 1 Emergence of Data Science as a Critical Discipline in Biostatistics

As presented throughout this text, give attention to syntax in the Housekeeping

section. The Housekeeping syntax represents individual preferences on how the R
environment is adjusted for desired use and the Housekeeping syntax reflects a
detailed set of actions that seem best for use in this text. With time and experience
develop individual Housekeeping syntax, declaring pre-session functions and direc-
tory locations that meet individual needs and preferences.
The one feature that may be a bit different in the Housekeeping section is the
decision to send all packages to an external drive, set as the F drive in this example.
That action is a purposeful decision that supports portability and the ability to use R
easily and quickly on multiple computers, if desired. Use of the external F drive to
hold packages is not a requirement but is instead merely a personal choice.44,45

###############################################################
# Housekeeping Use for All Analyses #
###############################################################
.libPaths(new = "F:/R_Packages")
# As a preference, all installed packages
# will now go to the external F:\ drive.
date() # Current system time and date.
Sys.time() # Current system time and date (redundant).
R.version.string # R version and version release date.
options(digits=6) # Confirm default digits.
options(scipen=999)# Suppress scientific notation.
options(width=60) # Confirm output width.
ls() # List all objects in the working
# directory.
rm(list = ls()) # CAUTION: Remove all files in the working
# directory. If this action is not desired,
# use the rm() function one by one to remove
# the objects that are not needed.
ls.str() # List all objects, with finite detail.
getwd() # Identify the current working directory.
setwd("F:/R_Ceres")
# Set to a new working directory.
# Note the single forward slash and double
# quotes.
# This new directory should be the directory
# where the data file is located, otherwise
# the data file will not be found.
getwd() # Confirm the working directory.
list.files() # List files at the PC directory.
.libPaths() # Library pathname.
.Library # Library pathname.
sessionInfo() # R version, locale, and packages.
search() # Attached packages and objects.
searchpaths() # Attached packages and objects.
###############################################################

44
As a desire, in the near future the active R session will be based on use of the cloud, but for now
a portable drive, a physical portable F drive in this case, meets current needs.
45
After this syntax was prepared, the core tidyverse was updated to tidyverse 2.0.0. The lubridate
package was added as part of the new core tidyverse package of packages. More detail is provided
in a later lesson.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 69

install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)
# All eight packages in the core tidyverse ecosystem
# will now be available and ready for use:
# dplyr
# forcats
# ggplot2
# purrr
# readr
# stringr
# tibble
# tidyr
# As a good programming practice (gpp), the following
# convention will be used when functions in the core
# tidyverse are used: PackageName::FunctionName(). By
# using this practice there should be no confusion as to
# which function is associated with which tidyverse (or
# other) package. Some may see this convention as an
# unnecessary redundant activity, but it is argued that
# this practice is part of the overall quality assurance
# process, especially for those who are new to data
# science, R, and the tidyverse ecosystem.
#
# Other packages will be downloaded as needed, but again,
# all eight packages associated with core tidyverse are
# now available and ready for use.

install.packages("readxl", dependencies=TRUE)
library(readxl)
# The readxl package is NOT part of the core tidyverse.
# It needs to be individually installed and loaded.
# The readxl package was selected for data import, but of
# course other packages could have also been used.

install.packages("magrittr", dependencies=TRUE)
library(magrittr)
# The magrittr package is NOT part of the core tidyverse.
# It needs to be individually installed and loaded.
# The pipe operator (e.g., %>%), from the magrittr
# package, is used to move an object forward. The use of
# pipes, often multiple pipes in a chain, is an essential
# part of how the tidyverse ecosystem is used to create
# syntax that is both functional and easily understood by
# others.

The ggplot2 package is part of the core tidyverse and it is used to create Beautiful
Graphics. However, there are many ancillary packages that support the production
of graphical presentations, with features that go far beyond what can be prepared
using the ggplot2 package by itself. A few of these graphically oriented ancillary
packages include:
70 1 Emergence of Data Science as a Critical Discipline in Biostatistics

install.packages("ggmosaic", dependencies=TRUE)
library(ggmosaic)

install.packages("ggpubr", dependencies=TRUE)
library(ggpubr)

install.packages("ggtext", dependencies=TRUE)
library(ggtext)

install.packages("ggthemes", dependencies=TRUE)
library(ggthemes)

install.packages("gridExtra", dependencies=TRUE)
library(gridExtra)

install.packages("scales", dependencies=TRUE)
library(scales)

Challenge: As mentioned earlier, the R-based syntax used to generate the figures
for CIP 01.0000 (Agriculture, General), CIP 26.1102 (Biostatistics), and CIP
51.1201 (Medicine) soon follows. Go to the publisher’s Web site associated with
this text to obtain the full set of IPEDS-originated .xls files, one file for each of the
programs of study in the following output:

CIP (Six-Digit) Academic Program

=============================================================
01.0000 Agriculture, General
01.0301 Agricultural Production Operations, General
01.0401 Agricultural and Food Products Processing
01.0901 Animal Sciences, General
01.1001 Food Science
01.1101 Plant Sciences, General
01.1201 Soil Science and Agronomy, General
26.0502 Microbiology, General
26.1101 Biometry/Biometrics
26.1102 Biostatistics
26.1103 Bioinformatics
26.1104 Computational Biology
26.1199 Biomathematics, Bioinformatics, and
Computational Biology, Other
26.1306 Population Biology
26.1309 Epidemiology
27.0501 Statistics, General
30.3001 Computational Science
44.0503 Health Policy Analysis
51.0401 Dentistry
51.1201 Medicine
51.1401 Medical Scientist
51.1901 Osteopathic Medicine/Osteopathy
51.2010 Pharmaceutical Sciences
51.2201 Public Health, General
51.2202 Environmental Health
51.2401 Veterinary Medicine
51.2706 Medical Informatics
51.3801 Nursing/Registered Nurse
51.3808 Nursing Science
51.3818 Nursing Practice
-------------------------------------------------------------
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 71

Challenge: Again, after reviewing the syntax and obtaining the data, reproduce
the figures for each of these academic programs of study, where each has some
degree of focus on biostatistics. The readxl::read_xlsx() function will be used to
start this task. Follow along with the arguments, such as sheet, col_names, col_
types, etc., to see the quality assurance measures used to be sure that data are
imported correctly and in desired format. Then use the syntax exactly as presented,
perhaps only changing the disk drive to personal preferences, to generate all figures.

# CIP 01.0000 Agriculture, General

WCIP010000.tbl <- readxl::read_xlsx(

"CIP010000CompletionsAllCertDeg2010to2019AgrGeneral.xlsx",
sheet=1, # Read in the 1st sheet.
col_names=TRUE, # The 1st row represents column names.
col_types=c(
"text", # Column 01 UnitID
"text", # Column 02 Institution
"numeric", # Column 03 Fall 2019 Enrollment
"text", # Column 04 Institution (redundant)
"text", # Column 05 City
"text", # Column 06 State
"numeric", # Column 07 FIPS County
"numeric", # Column 08 Longitude
"numeric", # Column 09 Latitude
"numeric", # Column 10 Institutional Control
"numeric", # Column 11 Highest Level Offered
"numeric", # Column 12 Carnegie Classification
"numeric", # Column 13 AY 2018-19 Degrees/Certificates
"numeric", # Column 14 AY 2017-18 Degrees/Certificates
"numeric", # Column 15 AY 2016-17 Degrees/Certificates
"numeric", # Column 16 AY 2015-16 Degrees/Certificates
"numeric", # Column 17 AY 2014-15 Degrees/Certificates
"numeric", # Column 18 AY 2013-14 Degrees/Certificates
"numeric", # Column 19 AY 2012-13 Degrees/Certificates
"numeric", # Column 20 AY 2011-12 Degrees/Certificates
"numeric", # Column 21 AY 2010-11 Degrees/Certificates
"numeric"), # Column 22 AY 2009-10 Degrees/Certificates
trim_ws=FALSE, # Retain leading/trailing whitespace.
n_max=7500)
# When viewing the object name WCIP010000.tbl, the convention
# W is used to relay how the data are in WIDE format. Later,
# the convention L is used to identify how data are in LONG
# format. The use of this naming schema is certainly not
# required, but it is a good programming practice (gpp) and
# improves communication and comprehension of data layout.
# Established coding practices improve quality assurance in
# data science.
#
# NOTE: The file is not overly large, but it is not small
# either, with a structure of more than 6,000 rows and 22
# columns. If the upload seems to take too long and hang,
# press the enter key to see if the import process is
# completed.
72 1 Emergence of Data Science as a Critical Discipline in Biostatistics

In its original form, as the data are obtained from IPEDS, the column names are
extremely long, complex, have multiple spaces and special characters, etc. In short,
the column names are unmanageable from a tidy perspective. R supports many
functions that could be used to rename column names, including functions in either
the core tidyverse or those functions from packages that are ancillary to the tidy-
verse ecosystem as well as functions from Base R. The functions dplyr::rename()
and dplyr::rename_with() are commonly used, but other similar functions have
merit too, including gdata::rename.vars() and data.table::setnames().
From a much larger set of possible selections, the IPEDS Peer Analysis System
was used to construct 30 datasets relating to certificate and degree completion for
identified CIPs, programs of study that require some degree of association with data
science and biostatistics. Queries to the IPEDS Peer Analysis System followed a
consistent structure, allowing some degree of automation for data retrieval (given
that an R-based API function is not supported for these data) and eventual use of the
data to construct the figures:
• There are 22 columns.
• Column placement is consistent in terms of selected variables.
• The only difference from one download to another is that the data are CIP spe-
cific, but the number of rows (e.g., postsecondary institutions) and the structure
for columns (e.g., variables) are consistent.46
• Because of this consistency in the way columns are named and organized, it was
judged best to use the base::colnames() function to rename the columns into
manageable names. The tidyverse ecosystem is a great contribution to the R
language, but functions from Base R should not be overlooked when their use is
appropriate and in many cases the simplest approach to achieving aims.47

46
Ideally the IPEDS Peer Analysis System would support function specific R-based API
(Application Programming Interface) data retrieval. R syntax would be created in such a way that
a function would be invoked from an active R session and by using this function desired data would
be returned, eliminating actual interaction with an interface at the originating data source.
Unfortunately, the IPEDS data resource does not yet support this type of API data retrieval. Those
who work in data science must be able to react to multiple data acquisition processes, not only
those that are ideal.
47
Search on the early 1300s writings of William of Ockham, who is generally attributed as formu-
lating Occam’s razor – an approach that advocates simple solutions at problem-solving, whenever
possible.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 73

base::colnames(WCIP010000.tbl) <- c(
"UnitID", # Column 01 UnitID
"Institution", # Column 02 Institution
"Fall2019", # Column 03 Fall 2019 Enrollment
"Institution", # Column 04 Institution (redundant)
"City", # Column 05 City
"State", # Column 06 State
"FIPS", # Column 07 FIPS County
"Longitude", # Column 08 Longitude
"Latitude", # Column 09 Latitude
"Control", # Column 10 Institutional Control
"Highest", # Column 11 Highest Level Offered
"Carnegie", # Column 12 Carnegie Classification
"AY201819", # Column 13 AY 2018-19 Degrees/Certificates
"AY201718", # Column 14 AY 2017-18 Degrees/Certificates
"AY201617", # Column 15 AY 2016-17 Degrees/Certificates
"AY201516", # Column 16 AY 2015-16 Degrees/Certificates
"AY201415", # Column 17 AY 2014-15 Degrees/Certificates
"AY201314", # Column 18 AY 2013-14 Degrees/Certificates
"AY201213", # Column 19 AY 2012-13 Degrees/Certificates
"AY201112", # Column 20 AY 2011-12 Degrees/Certificates
"AY201011", # Column 21 AY 2010-11 Degrees/Certificates
"AY200910") # Column 22 AY 2009-10 Degrees/Certificates

base::getwd() # Working directory

base::ls() # List current files
base::attach(WCIP010000.tbl) # Attach the file
utils::str(WCIP010000.tbl) # File structure
dplyr::glimpse(WCIP010000.tbl) # File structure
utils::head(WCIP010000.tbl) # Print first few rows
summary(WCIP010000.tbl) # Data summary

Selected sections of output were deleted to save space. xxx

There is now a high degree of assurance that the dataset has been imported cor-
rectly and that the data are in good order. It should be noticed that there are many
NAs showing in the dataset, but that is expected. There are more than 6000 postsec-
ondary institutions represented in this dataset, gained by queries to the IPEDS Peer
Analysis System, and compared to a population of more than 6000, there are only
but a few postsecondary institutions that have a program of study coded as CIP
01.0000 Agriculture, General.
74 1 Emergence of Data Science as a Critical Discipline in Biostatistics

base::length(WCIP010000.tbl$AY201819)
# Number of lines (e.g., records, rows, etc.)

[1] 6179

base::table(is.na(WCIP010000.tbl$AY201819))
# For the identified object, generate a table of rows with
# missing data (e.g., is.na outcome is TRUE) and rows where
# data are not missing data (e.g., is.na outcome is FALSE).
# The data of interest to the upcoming figure are from the
# rows marked FALSE, where is.na yields a FALSE outcome.

FALSE TRUE
186 5993

Code Book: When navigating the IPEDS Peer Analysis System interface, a pop-
u p menu is available to describe the meaning of numeric codes. These codes are
then downloaded when the main dataset is also downloaded. There are variables in
the IPEDS spreadsheets that are not currently needed to generate the desired figures,
such as Longitude and Latitude or Carnegie Classification, but they were selected
for possible use in the future.
Selected sections of the Code Book were deleted to save space, but all details are
provided online, at the IPEDS Peer Analysis System resource.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 75

# UnitID
# Each postsecondary institution is assigned a unique numeric
# code. Some institutions with multiple campuses have multiple
# UnitIDs, and some do not.
#
# Institution
# The actual name of an institution is provided, which may be
# different than what is commonly used in less formal usage.
#
# Fall 2019 Enrollment
# Unduplicated fall term headcount enrollment, as measured on a
# set census date near end of term, is provided. This metric
# is different from FTE (Full Time Equivalent) enrollment and
# duplicated fall term headcount enrollment. Recall that due to
# the COVID-19 pandemic and its impact on massive reduction in
# student engagement beginning in 2020, 2019 was the last year
# that was perhaps reflective of norm behavior and enrollment
# patterns.
#
# Institution (redundant)
# The name of the institution is provided again.
#
# City
# The city location is provided for the UnitID campus.
#
# State
# The state for the identified UnitID campus is provided.
#
# FIPS County
# Refer to https://www.census.gov/geographies/reference-files/
# 2020/demo/popest/2020-fips.html for a listing of the thousands
# of numeric state and county Federal Information Processing
# Standards (FIPS) codes. Numeric FIPS codes are a far more
# efficient way of identifying counties than the use of actual
# names since some county names show in multiple states ( e.g.,
# Washington is used as a county or parish name in 31 states,
# Jefferson in 26, Franklin in 25, Jackson in 24 and Lincoln in
# 24, etc.).
#
# Longitude
# The longitude of the UnitID campus is used to create maps.
#
# Latitude
# The latitude of the UnitID campus is used to create maps.
#
# Institutional Control
# 1 Public
# 2 Private not-for-profit
# 3 Private for-profit
#
76 1 Emergence of Data Science as a Critical Discipline in Biostatistics

# Highest Level Offered

# 1 Award of less than one academic year
# 2 At least 1, but less than 2 academic yrs
# 3 Associate's degree
# . . .
# 9 Doctor's degree
#
# Carnegie Classification
# 1 Associate's Colleges: High Transfer-High Traditional
# 2 Associate's Colleges: High Transfer-Mixed Traditional/
# Nontraditional
# . . .
# 31 Special Focus Four-Year: Law Schools
# 32 Special Focus Four-Year: Other Special Focus Institutions
# 33 Tribal Colleges
# -2 Not applicable, not in Carnegie universe (not accredited
# or nondegree-granting)
#
# AY 2018-19 Degrees/Certificates
# AY 2017-18 Degrees/Certificates
# . . .
# AY 2009-10 Degrees/Certificates
# Use the sum() function to determine all degrees and
# certificates for the CIP-identified program of study
# during the identified academic year.

Create Beautiful Graphics of Completers from CIP-Specific Programs of Study

Associated with Data Science and Biostatistics: The ggplot2::ggplot() function will
be used to prepare an attractive figure that provides a sense of the number of com-
pleters (e.g., those who earned either a postsecondary degree or a certificate) by
academic year, from Academic Year 2009–2010 to Academic Year 2018–2019. To
achieve this aim, it is best to structure the data in long format using the tidyr::pivot_
longer() function. The tidyr::pivot_longer() function is in contemporary use,
whereas the tidyr::gather() function is considered outdated and is no longer sug-
gested for use, although it still works and there are currently no stated plans to
remove it from availability.48
With assurance that WCIP010000.tbl is in good form, focus on what is needed to
produce the desired figure. There are many interesting object variables in
WCIP010000.tbl and they may be used for later tasks, but for the present task, they
are not needed to produce the desired figure. Retain the object variable UnitID, for
tracking purposes if needed, and of course retain the academic year variables (e.g.,
AY200910 to AY201819). Delete all other variables.

48
The tidyr::gather() function is deprecated. It is still included among the many functions available
in the tidyr package, but there are no current efforts to improve upon its functionality. Most impor-
tantly, existing syntax from prior projects that use the tidyr::gather() function still work, but new
projects tend to use the tidyr::pivot_longer() function, given its support and improved
functionality.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 77

WCIP010000Adjusted.tbl <- WCIP010000.tbl %>%

dplyr::select(c(
UnitID,
AY201819,
AY201718,
AY201617,
AY201516,
AY201415,
AY201314,
AY201213,
AY201112,
AY201011,
AY200910))
# Use the dplyr::select() function to select only those
# columns that are required for the task at hand – the
# production of an attractive figure that provides a sense of
# change over time in the number of completers for the
# selected CIP-designated program of study.

base::getwd() # Working directory

base::ls() # List current files
base::attach(WCIP010000Adjusted.tbl) # Attach the file
utils::str(WCIP010000Adjusted.tbl) # File structure
dplyr::glimpse(WCIP010000Adjusted.tbl) # File structure
utils::head(WCIP010000Adjusted.tbl) # Print first few rows
base::summary(WCIP010000Adjusted.tbl) # Data summary

The dataset WCIP010000Adjusted.tbl is in wide format, and it contains only a

tracking variable (e.g., UnitID) and those Academic Year 2009–2010 to Academic
Year 2018–2019 variables needed for production of the desired figure. With all of
this accomplished, the next task is to restructure this dataset into long format by
using the tidyr::pivot_longer() function.

LCIP010000.tbl <-
tidyr::pivot_longer(WCIP010000Adjusted.tbl,
-c(UnitID),
names_to = "AY", values_to = "Completers")
# Put the data into long format, using the
# tidyr::pivot_longer() function.
#
# The expression -c(UnitID) means that the
# tidyr::pivot_longer() function should
# pivot everything except UnitID. In this
# syntax, the minus sign means except.

LCIP010000.tbl

base::getwd() # Working directory

base::ls() # List current files
base::attach(LCIP010000.tbl) # Attach the file
utils::str(LCIP010000.tbl) # File structure
dplyr::glimpse(LCIP010000.tbl) # File structure
utils::head(LCIP010000.tbl) # Print first few rows
base::summary(LCIP010000.tbl) # Data summary
78 1 Emergence of Data Science as a Critical Discipline in Biostatistics

theme_Mac <- function(base_size=12, base_family="sans"){

theme_stata() +
theme( # Embellishments to theme_stata()
plot.title=element_text(face="bold", size=14, hjust=0.5),
plot.subtitle=element_text(face="bold", size=12,
hjust=0.5),
axis.title.x=element_text(face="bold", size=14, hjust=0.5),
axis.text.x=element_text(face="bold", size=12, hjust=0.5),
axis.title.y=element_text(face="bold", size=14, hjust=0.5,
vjust=1, angle=90),
axis.text.y=element_text(face="bold", size=12, hjust=0.5),
legend.title=element_text(face="bold", size=12),
legend.text=element_text(face="bold", size=12),
axis.ticks.x=element_line(size=1.2),
axis.ticks.y=element_line(size=1.2),
axis.ticks.length=unit(0.25,"cm"),
panel.background=element_rect(fill="whitesmoke")
)
}
# hjust - horizonal justification; 0 = left edge to 1 = right
# edge, with 0.5 the default
# vjust - vertical justification; 0 = bottom edge to 1 = top
# edge, with 0.5 the default
# angle - rotation; generally 1 to 90 degrees, with 0 the
# default

base::class(theme_Mac)
# Confirm that the user-created object
# theme_Mac() is a function.
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 79

Create an object that summarizes the number of completers for each academic
year, Academic Year 2009–2010 to Academic Year 2018–2019. The output will
include labels placed over the top of each column when a column chart (e.g., bar-
chart) of completers by academic year is prepared (Fig. 1.4).

CIP010000SumByAY <- LCIP010000.tbl %>%

dplyr::group_by(AY) %>%
dplyr::summarise(sum=sum(Completers, na.rm=TRUE), n = n())
# Prepare an academic year by academic year summary of
# completers, sum(Completers). Accommodate missing data.

CIP010000SumByAY
# Use this summary to help determine the range for the Y
# axis, labels placed over each academic year column, and
# also to confirm the correct ordering of academic years.

# A tibble: 10 x 3
AY sum n
<chr> <dbl> <int>
1 AY200910 2258 6179
2 AY201011 2508 6179
3 AY201112 2587 6179
4 AY201213 2691 6179
5 AY201314 2873 6179
6 AY201415 2932 6179
7 AY201516 3233 6179
8 AY201617 3302 6179
9 AY201718 3376 6179
10 AY201819 3464 6179
CIP010000AgricultureGeneral.fig <-
ggplot2::ggplot(data=LCIP010000.tbl, aes(x=AY, y=Completers)) +
geom_col(fill="red") +
geom_richtext(data=CIP010000SumByAY,
aes(x=AY, label=scales::comma(round(sum), accuracy=1),
y=sum), hjust=0.50, vjust = -0.75, size=5,
label.color="black", fontface="bold") +
# Notice the use of geom_richtext, not geom_text. A few
# desired embellishments are possible with geom_richtext,
# such as the large print label above each column and the
# way this label is highlighted in an attractive offset
# textbox with rounded corners. Give special attention to
# the way label=scales::comma(round(), accuracy=1) was used
# to use comma notation as a thousands separator and to be
# sure that whole numbers were generated since the labels
# represent counts and it would be inappropriate to use
# decimal notation.
labs(
title="CIP 01.0000 Agriculture, General",
subtitle="Completions (All Degrees and Certifications) by
Academic Year",
x = "\nAcademic Year",
y = "Completions - All Degrees and\nCertifications\n") +
annotate("text", x=5.5, y=-100.0, fontface="bold", size=03,
label="Academic Year: July 01 to June 30") +
# Notice how annotate() has been placed in a centered
# position (x=5.5 for this figure), below the columns
80 1 Emergence of Data Science as a Critical Discipline in Biostatistics

Fig. 1.4

# y=-100.0). Then, notice how the Y axis scale was set to

# accommodate the -100.0 annotation declaration.
scale_x_discrete(labels = c(
"AY2009-10", # By using enumerated labels, the natural
"AY2010-11", # ordering of label placement was changed.
"AY2011-12", # As such, notice the reverse order of the
"AY2012-13", # enumerated labels.
"AY2013-14", #
"AY2014-15", # Quality Assurance Check: Compare the
"AY2015-16", # number in each textbox label to each value
"AY2016-17", # associated with CIP010000SumByAY.
"AY2017-18",
"AY2018-19")) +
scale_y_continuous(labels=scales::comma, limits=c(-100,4000),
breaks=scales::pretty_breaks(n=5)) +
# Review output from either the summary() function or
# CIP010000SumByAY to gain a sense of the Y axis range and
# then be prepared to experiment with different upper
# limits to see the most appealing presentation.
theme_Mac()

par(ask=TRUE)
CIP010000AgricultureGeneral.fig
# Fig. 1.4

The figure offers a very useful perspective of throughput (e.g., completions, all
degrees, and certifications) for the selected CIP-based program of study. For those
with special interest, the IPEDS Peer Analysis System can be used to provide not
only completer totals but also breakouts by different degree levels. However, that
degree of granularity is beyond the purpose of this summary presentation.
Review syntax for the two other CIP-specific programs of study associated with
data science and biostatistics selected for presentation in this text, from among the
many possible selections: CIP 26.1102 (Biostatistics) and CIP 51.1201 (Medicine).
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 81

Once again, respond to the challenge to use the syntax and model in this addendum
to create all 30 figures that reflect completions over time among programs of study
that have some degree of association with biostatistics, from CIP 01.0000
Agriculture, General to CIP 51.3818 Nursing Practice, following along with the
prior list.49 To save space, nearly all comments have been removed from remaining
syntax in this addendum since the comments are repetitive to what was seen earlier
(Figs. 1.5 and 1.6).

Fig. 1.5

Fig. 1.6

49
The best way to learn R syntax is to read R syntax prepared by others, write R syntax, make cor-
rections, read documentation and then experiment with multiple packages and functions, etc.
Practice – practice – practice!
82 1 Emergence of Data Science as a Critical Discipline in Biostatistics

# CIP 26.1102 Biostatistics

WCIP261102.tbl <- readxl::read_xlsx(

"CIP261102CompletionsAllCertDeg2010to2019Biostatistics.xlsx",
sheet=1, # Read in the 1st sheet.
col_names=TRUE, # The 1st row represents column names.
col_types=c(
"text", # Column 01 UnitID
"text", # Column 02 Institution
"numeric", # Column 03 Fall 2019 Enrollment
"text", # Column 04 Institution (redundant)
"text", # Column 05 City
"text", # Column 06 State
"numeric", # Column 07 FIPS County
"numeric", # Column 08 Longitude
"numeric", # Column 09 Latitude
"numeric", # Column 10 Institutional Control
"numeric", # Column 11 Highest Level Offered
"numeric", # Column 12 Carnegie Classification
"numeric", # Column 13 AY 2018-19 Degrees/Certificates
"numeric", # Column 14 AY 2017-18 Degrees/Certificates
"numeric", # Column 15 AY 2016-17 Degrees/Certificates
"numeric", # Column 16 AY 2015-16 Degrees/Certificates
"numeric", # Column 17 AY 2014-15 Degrees/Certificates
"numeric", # Column 18 AY 2013-14 Degrees/Certificates
"numeric", # Column 19 AY 2012-13 Degrees/Certificates
"numeric", # Column 20 AY 2011-12 Degrees/Certificates
"numeric", # Column 21 AY 2010-11 Degrees/Certificates
"numeric"), # Column 22 AY 2009-10 Degrees/Certificates
trim_ws=FALSE, # Retain leading/trailing whitespace.
n_max=7500)

base::colnames(WCIP261102.tbl) <- c(
"UnitID", # Column 01 UnitID
"Institution", # Column 02 Institution
"Fall2019", # Column 03 Fall 2019 Enrollment
"Institution", # Column 04 Institution (redundant)
"City", # Column 05 City
"State", # Column 06 State
"FIPS", # Column 07 FIPS County
"Longitude", # Column 08 Longitude
"Latitude", # Column 09 Latitude
"Control", # Column 10 Institutional Control
"Highest", # Column 11 Highest Level Offered
"Carnegie", # Column 12 Carnegie Classification
"AY201819", # Column 13 AY 2018-19 Degrees/Certificates
"AY201718", # Column 14 AY 2017-18 Degrees/Certificates
"AY201617", # Column 15 AY 2016-17 Degrees/Certificates
"AY201516", # Column 16 AY 2015-16 Degrees/Certificates
"AY201415", # Column 17 AY 2014-15 Degrees/Certificates
"AY201314", # Column 18 AY 2013-14 Degrees/Certificates
"AY201213", # Column 19 AY 2012-13 Degrees/Certificates
"AY201112", # Column 20 AY 2011-12 Degrees/Certificates
"AY201011", # Column 21 AY 2010-11 Degrees/Certificates
"AY200910") # Column 22 AY 2009-10 Degrees/Certificates
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 83

base::getwd() # Working directory

base::ls() # List current files
base::attach(WCIP261102.tbl) # Attach the file
utils::str(WCIP261102.tbl) # File structure
dplyr::glimpse(WCIP261102.tbl) # File structure
utils::head(WCIP261102.tbl) # Print first few rows
base::summary(WCIP261102.tbl) # Data summary

base::length(WCIP261102.tbl$AY201819)

base::table(is.na(WCIP261102.tbl$AY201819))

WCIP261102Adjusted.tbl <- WCIP261102.tbl %>%

dplyr::select(c(
UnitID,
AY201819,
AY201718,
AY201617,
AY201516,
AY201415,
AY201314,
AY201213,
AY201112,
AY201011,
AY200910))

base::getwd() # Working directory

base::ls() # List current files
base::attach(WCIP261102Adjusted.tbl) # Attach the file
utils::str(WCIP261102Adjusted.tbl) # File structure
dplyr::glimpse(WCIP261102Adjusted.tbl) # File structure
utils::head(WCIP261102Adjusted.tbl) # Print first few rows
base::summary(WCIP261102Adjusted.tbl) # Data summary

LCIP261102.tbl <-
tidyr::pivot_longer(WCIP261102Adjusted.tbl,
-c(UnitID),
names_to = "AY", values_to = "Completers")

LCIP261102.tbl

base::getwd() # Working directory

base::ls() # List current files
base::attach(LCIP261102.tbl) # Attach the file
utils::str(LCIP261102.tbl) # File structure
dplyr::glimpse(LCIP261102.tbl) # File structure
utils::head(LCIP261102.tbl) # Print first few rows
base::summary(LCIP261102.tbl) # Data summary

base::class(theme_Mac)

CIP261102SumByAY <- LCIP261102.tbl %>%

dplyr::group_by(AY) %>%
dplyr::summarise(sum=sum(Completers, na.rm=TRUE), n = n())

CIP261102SumByAY
84 1 Emergence of Data Science as a Critical Discipline in Biostatistics

CIP261102Biostatistics.fig <-
ggplot2::ggplot(data=LCIP261102.tbl, aes(x=AY, y=Completers)) +
geom_col(fill="red") +
geom_richtext(data=CIP261102SumByAY,
aes(x=AY, label=scales::comma(round(sum), accuracy=1),
y=sum), hjust=0.50, vjust = -0.75, size=5,
label.color="black", fontface="bold") +
labs(
title="CIP 26.1102 Biostatistics",
subtitle="Completions (All Degrees and Certifications) by
Academic Year",
x = "\nAcademic Year",
y = "Completions - All Degrees and\nCertifications\n") +
annotate("text", x=5.5, y=-40.0, fontface="bold", size=03,
label="Academic Year: July 01 to June 30") +
scale_x_discrete(labels = c(
"AY2009-10", # By using enumerated labels, the natural
"AY2010-11", # ordering of label placement was changed.
"AY2011-12", # As such, notice the reverse order of the
"AY2012-13", # enumerated labels.
"AY2013-14", #
"AY2014-15", # Quality Assurance Check: Compare the
"AY2015-16", # number in each textbox label to each value
"AY2016-17", # associated with CIP261102SumByAY.
"AY2017-18",
"AY2018-19")) +
scale_y_continuous(labels=scales::comma, limits=c(-40,1200),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()

par(ask=TRUE)
CIP261102Biostatistics.fig
# Fig. 1.5
Addendum 1: Syntax Used to Generate Six-Digit Classification of Instructional… 85

# CIP 51.1201 Medicine

WCIP511201.tbl <- readxl::read_xlsx(

"CIP511201CompletionsAllCertDeg2010to2019Medicine.xlsx",
sheet=1, # Read in the 1st sheet.
col_names=TRUE, # The 1st row represents column names.
col_types=c(
"text", # Column 01 UnitID
"text", # Column 02 Institution
"numeric", # Column 03 Fall 2019 Enrollment
"text", # Column 04 Institution (redundant)
"text", # Column 05 City
"text", # Column 06 State
"numeric", # Column 07 FIPS County
"numeric", # Column 08 Longitude
"numeric", # Column 09 Latitude
"numeric", # Column 10 Institutional Control
"numeric", # Column 11 Highest Level Offered
"numeric", # Column 12 Carnegie Classification
"numeric", # Column 13 AY 2018-19 Degrees/Certificates
"numeric", # Column 14 AY 2017-18 Degrees/Certificates
"numeric", # Column 15 AY 2016-17 Degrees/Certificates
"numeric", # Column 16 AY 2015-16 Degrees/Certificates
"numeric", # Column 17 AY 2014-15 Degrees/Certificates
"numeric", # Column 18 AY 2013-14 Degrees/Certificates
"numeric", # Column 19 AY 2012-13 Degrees/Certificates
"numeric", # Column 20 AY 2011-12 Degrees/Certificates
"numeric", # Column 21 AY 2010-11 Degrees/Certificates
"numeric"), # Column 22 AY 2009-10 Degrees/Certificates
trim_ws=FALSE, # Retain leading/trailing whitespace.
n_max=7500)

base::colnames(WCIP511201.tbl) <- c(
"UnitID", # Column 01 UnitID
"Institution", # Column 02 Institution
"Fall2019", # Column 03 Fall 2019 Enrollment
"Institution", # Column 04 Institution (redundant)
"City", # Column 05 City
"State", # Column 06 State
"FIPS", # Column 07 FIPS County
"Longitude", # Column 08 Longitude
"Latitude", # Column 09 Latitude
"Control", # Column 10 Institutional Control
"Highest", # Column 11 Highest Level Offered
"Carnegie", # Column 12 Carnegie Classification
"AY201819", # Column 13 AY 2018-19 Degrees/Certificates
"AY201718", # Column 14 AY 2017-18
86 1 Emergence of Data Science as a Critical Discipline in Biostatistics

Degrees/Certificates
"AY201617", # Column 15 AY 2016-17 Degrees/Certificates
"AY201516", # Column 16 AY 2015-16 Degrees/Certificates
"AY201415", # Column 17 AY 2014-15 Degrees/Certificates
"AY201314", # Column 18 AY 2013-14 Degrees/Certificates
"AY201213", # Column 19 AY 2012-13 Degrees/Certificates
"AY201112", # Column 20 AY 2011-12 Degrees/Certificates
"AY201011", # Column 21 AY 2010-11 Degrees/Certificates
"AY200910") # Column 22 AY 2009-10 Degrees/Certificates

base::getwd() # Working directory

base::ls() # List current files
base::attach(WCIP511201.tbl) # Attach the file
utils::str(WCIP511201.tbl) # File structure
dplyr::glimpse(WCIP511201.tbl) # File structure
utils::head(WCIP511201.tbl) # Print first few rows
base::summary(WCIP511201.tbl) # Data summary

base::length(WCIP511201.tbl$AY201819)

base::table(is.na(WCIP511201.tbl$AY201819))
WCIP511201Adjusted.tbl <- WCIP511201.tbl %>%
dplyr::select(c(
UnitID,
AY201819,
AY201718,
AY201617,
AY201516,
AY201415,
AY201314,
AY201213,
AY201112,
AY201011,
AY200910))

base::getwd() # Working directory

base::ls() # List current files
base::attach(WCIP511201Adjusted.tbl) # Attach the file
utils::str(WCIP511201Adjusted.tbl) # File structure
dplyr::glimpse(WCIP511201Adjusted.tbl) # File structure
utils::head(WCIP511201Adjusted.tbl) # Print first few rows
base::summary(WCIP511201Adjusted.tbl) # Data summary

LCIP511201.tbl <-
tidyr::pivot_longer(WCIP511201Adjusted.tbl,
-c(UnitID),
names_to = "AY", values_to = "Completers")

LCIP511201.tbl

base::getwd() # Working directory

base::ls() # List current files
base::attach(LCIP511201.tbl) # Attach the file
utils::str(LCIP511201.tbl) # File structure
dplyr::glimpse(LCIP511201.tbl) # File structure
utils::head(LCIP511201.tbl) # Print first few rows
base::summary(LCIP511201.tbl) # Data summary
Addendum 2: National and State Data for OCC-Identified Jobs Associated with Data… 87

base::class(theme_Mac)

CIP511201SumByAY <- LCIP511201.tbl %>%

dplyr::group_by(AY) %>%
dplyr::summarise(sum=sum(Completers, na.rm=TRUE), n = n())

CIP511201SumByAY
CIP511201Medicine.fig <-
ggplot2::ggplot(data=LCIP511201.tbl, aes(x=AY, y=Completers)) +
geom_col(fill="red") +
geom_richtext(data=CIP511201SumByAY,
aes(x=AY, label=scales::comma(round(sum), accuracy=1),
y=sum), hjust=0.50, vjust = -0.75, size=5,
label.color="black", fontface="bold") +
labs(
title="CIP 51.1201 Medicine",
subtitle="Completions (All Degrees and Certifications) by
Academic Year",
x = "\nAcademic Year",
y = "Completions - All Degrees and\nCertifications\n") +
annotate("text", x=5.5, y=-300.0, fontface="bold", size=03,
label="Academic Year: July 01 to June 30") +
scale_x_discrete(labels = c(
"AY2009-10", # By using enumerated labels, the natural
"AY2010-11", # ordering of label placement was changed.
"AY2011-12", # As such, notice the reverse order of the
"AY2012-13", # enumerated labels.
"AY2013-14", #
"AY2014-15", # Quality Assurance Check: Compare the
"AY2015-16", # number in each textbox label to each value
"AY2016-17", # associated with CIP511201SumByAY.
"AY2017-18",
"AY2018-19")) +
scale_y_continuous(labels=scales::comma, limits=c(-300,23000),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac()

par(ask=TRUE)
CIP511201Medicine.fig
# Fig. 1.6

ddendum 2: National and State Data for OCC-Identified

A
Jobs Associated with Data Science and Biostatistics

The purpose of the analyses in this addendum is to offer a sample of jobs that
require some degree of acquaintance with data science and biostatistics. The United
States Bureau of Labor Statistics is the sole source of data for these job-related data.
88 1 Emergence of Data Science as a Critical Discipline in Biostatistics

As time and interest permit, review the many resources made available by the
Bureau of Labor Statistics, which go far beyond what is provided in this addendum.50
Immediately, it is reinforced that the United States Bureau of Labor Statistics
was the source for the data in this addendum, national_M2020_dl.xlsx and state_
M2020_dl.xlsx. The data were gained by selecting the XLS option and then down-
loading the OEWS data posted at https://www.bls.gov/oes/tables.htm. Both datasets
were downloaded to the F drive and have also been posted at the publisher’s Web
site associated with this text.
Like IPEDS data in the prior addendum, the data from the Bureau of Labor
Statistics are gained from a typical Web-based interface, and they are currently
unavailable by invoking an R-based API function, serving as a client. Although the
use of an API data acquisition process would be ideal, as stated earlier, data scien-
tists do not always find data in desired format but instead need to adjust.
It was stated earlier that the characters * and # show in many columns, columns
that should be numeric but are not due to the presence of these two characters. Using
the notes found in the data dictionary (e.g., Sheet 2 of the downloaded file national_
M2020_dl.xlsx), it is identified how the * character is used to show that a wage
estimate is not available. It is also stated that the # character is used to indicate that
the wage is equal to or greater than $100.00 per hour or $208,000 per year. The
masking of the exact wage is a purposeful decision by the Bureau of Labor Statistics
and the data are not readily available at the original data source.
The task now is to import the data contained in the file national_M2020_dl.xlsx,
to immediately view it, and to then adjust the file based on the object variables of
interest in this addendum. For the previously identified jobs that require some
degree of acquaintance with data science and biostatistics, the focus for this adden-
dum is the presentation of statistics related to job code, job title, number of employ-
ees at the national level by job code, and annual median salary.51

NationalOCCJobsMay2020.tbl <-
readxl::read_excel("national_M2020_dl.xlsx", 1)
# Use the readxl::read_excel() function to import the .xlsx
# spreadsheet national_M2020_dl.xlsx into the current R
# session and place the contents into the object
# NationalOCCJobsMay2020.tbl.
#
# The number 1 that shows after the .xlsx filename is used to
# declare that only the 1st sheet in the spreadsheet should
# be read into NationalOCCJobsMay2020.tbl, the intended
# object in this example. However, prior to importing Sheet
# 1, review Sheet 2 which serves as a code book for the file.
base::getwd() # Working directory
dplyr::glimpse(NationalOCCJobsMay2020.tbl) # File structure

50
OCC-coded jobs refer to primary occupation. OCC codes are used by federal agencies and pro-
vide some degree of command and control on employment trends.
51
Although it is common to see mean as a measure of central tendency when identifying average
salary information, it is argued that median may be a more appropriate statistic. Yet, both measures
of central tendency (e.g., mean, and median) are commonly presented and used.
Addendum 2: National and State Data for OCC-Identified Jobs Associated with Data… 89

Based on the purpose of this addendum, it is only necessary to work with data
from four of the 31 columns that currently show in NationalOCCJobsMay2020.tbl:

OCC_CODE Job code

OCC_TITLE Job title
TOT_EMP Total employees
A_MEDIAN Annual median salary

NationalOCCJobsMay2020Adjusted1.tbl <-
NationalOCCJobsMay2020.tbl %>%
dplyr::select(c(
OCC_CODE, # Job code
OCC_TITLE, # Job title
TOT_EMP, # Total employee
A_MEDIAN)) # Annual median salary
# Use the dplyr::select() function to select only those
# columns that are required for the task at hand.

dplyr::glimpse(NationalOCCJobsMay2020Adjusted1.tbl)

Selected sections of output were deleted to save space.

The dataframe NationalOCCJobsMay2020Adjusted1.tbl is correct in terms of

the four columns of interest (e.g., OCC_CODE, OCC_TITLE, TOT_EMP, and A_
MEDIAN), but the object also consists of 1329 rows, listing data for all jobs in the
original dataset. In contrast, this addendum is focused only on data in the previous
list of OCC-identified job codes shown in the front matter of this lesson, ranging
from OCC_CODE 15-2041 (Statisticians) to OCC_CODE 29-1248 (Surgeons,
except Ophthalmologists). Use the dplyr::filter() function to create a new dataset
that has data only for the desired jobs, based on filtering by unique OCC_CODE:52

52
This lesson is introductory. Give attention to the syntax, but more detail on its selection and use
is presented in later lessons.
90 1 Emergence of Data Science as a Critical Discipline in Biostatistics

NationalOCCJobsMay2020Adjusted2.tbl <-
NationalOCCJobsMay2020Adjusted1.tbl %>%
dplyr::filter(OCC_CODE %in% c(
"15-2041", # Statisticians
"17-2031", # Bioengineers and Biomedical Engineers
"19-1011", # Animal Scientists
"19-1012", # Food Scientists and Technologists
"19-1013", # Soil and Plant Scientists
"19-1020", # Biological Scientists
"19-1021", # Biochemists and Biophysicists
"19-1022", # Microbiologists
"19-1023", # Zoologists and Wildlife Biologists
"19-1029", # Biological Scientists, All Other
"19-1032", # Foresters
"19-1040", # Medical Scientists
"19-1041", # Epidemiologists
"19-4010", # Agricultural and Food Science Technicians
"19-4021", # Biological Technicians
"19-4040", # Environmental Science and Geoscience Techni ...
"19-4092", # Forensic Science Technicians
"25-1040", # Life Sciences Teachers, Postsecondary
"25-1041", # Agricultural Sciences Teachers, Postsecondary
"25-1042", # Biological Science Teachers, Postsecondary
"25-1070", # Health Teachers, Postsecondary
"25-1072", # Nursing Instructors and Teachers, Postsecon ...
"29-1021", # Dentists, General
"29-1041", # Optometrists
"29-1051", # Pharmacists
"29-1131", # Veterinarians
"29-1141", # Registered Nurses
"29-1151", # Nurse Anesthetists
"29-1211", # Anesthesiologists
"29-1215", # Family Medicine Physicians
"29-1216", # General Internal Medicine Physicians
"29-1218", # Obstetricians and Gynecologists
"29-1221", # Pediatricians, General
"29-1228", # Physicians, All Other; and Ophthalmologists ...
"29-1248")) # Surgeons, Except Ophthalmologists
# Note the numbering sequence for the dataframe title,
# Adjusted2 and not. Adjusted1.

These actions should now result in a tibble-based dataframe that meets immedi-
ate requirements, the production of a printout of selected jobs, job titles, number of
employees at the national level, and the median annual salary.

base::getwd()
base::ls()
base::attach(NationalOCCJobsMay2020Adjusted2.tbl)
utils::str(NationalOCCJobsMay2020Adjusted2.tbl)
dplyr::glimpse(NationalOCCJobsMay2020Adjusted2.tbl)
utils::head(NationalOCCJobsMay2020Adjusted2.tbl)
base::summary(NationalOCCJobsMay2020Adjusted2.tbl)
Addendum 2: National and State Data for OCC-Identified Jobs Associated with Data… 91

The step-by-step process of deconstructing the original imported file into the
desired tibble has resulted in a dataset that meets all requirements, the production of
a printout that lists a self-selected sample of jobs that require some degree of
acquaintance with data science and biostatistics, along with other useful informa-
tion. Use the base::print() function to present an attractive printout of the final result,
taking into account the required number of rows to allow for the printout (gained
from the dplyr::glimpse() function):

base::print(NationalOCCJobsMay2020Adjusted2.tbl, n=36,
width=64)

Selected sections of output were deleted to save space.

# A tibble: 36 x 4
OCC_CODE OCC_TITLE TOT_EMP A_MED~1
<chr> <chr> <chr> <chr>
1 15-2041 Statisticians 38860 92270
2 17-2031 Bioengineers and Biomedical Enginee~ 18660 92620
25 29-1041 Optometrists 36690 118050
26 29-1051 Pharmacists 315470 128710
27 29-1131 Veterinarians 73710 99250
28 29-1141 Registered Nurses 2986500 75330
29 29-1151 Nurse Anesthetists 41960 183580

34 29-1221 Pediatricians, General 27550 177130

35 29-1228 Physicians, All Other; and Ophthalm~ 375390 #

Some of the job titles are quite long. If there were a desire to print a wider format
output, then merely adjust the width. Observe the difference, with width set to 110
(the width needed to have all job titles show in full) instead of 64.

base::print(NationalOCCJobsMay2020Adjusted2.tbl, n=36,
width=110)

Selected sections of output were deleted to save space.

The base::print() function produced an accurate listing of the required statistics

for all jobs in the eventual dataset NationalOCCJobsMay2020Adjusted2.tbl. Yet,
there is still an immediate concern:
• See the prior discussion as to why the original dataset included * and # characters
in some cells. Full details on the meaning of these symbols are shown in the
original spreadsheet, national_M2020_dl.xlsx Sheet 2.
• For NationalOCCJobsMay2020Adjusted2.tbl, the # character shows in five sep-
arate A_MEDIAN job listings.
• Each # character in A_MEDIAN needs to be reconstructed as NA.
92 1 Emergence of Data Science as a Critical Discipline in Biostatistics

R can easily accommodate this concern over missing values by using a fairly
simple process associated with Base R. Of course, there are tidyverse ecosystem
tools for this task, but going back to a prior comment on the value of simplicity. The
syntax shown below is easy to implement and works well, given the introductory
nature of this lesson.

NationalOCCJobsMay2020Adjusted2.tbl[
NationalOCCJobsMay2020Adjusted2.tbl == "#"] <- NA
# Replace all # characters with NA, the missing
# value indicator.

base::print(NationalOCCJobsMay2020Adjusted2.tbl, n=36,
width=64)
# Confirm that each # character has been replaced with NA.

Selected sections of output were deleted to save space.

The NA (e.g., missing value) character now shows, correctly, for the A_MEDIAN
object. There is one remaining task that is evident when looking at the first few lines
of the print output – TOT_EMP and A_MEDIAN both show as character-based
objects. There are many ways to accommodate this concern, but perhaps the easiest
way is to use a simple transformation process:53

53
The figure is presented in the front matter of this lesson.
Addendum 2: National and State Data for OCC-Identified Jobs Associated with Data… 93

NationalOCCJobsMay2020Adjusted2.tbl$TOT_EMP <-
as.numeric(NationalOCCJobsMay2020Adjusted2.tbl$TOT_EMP)

NationalOCCJobsMay2020Adjusted2.tbl$A_MEDIAN <-
as.numeric(NationalOCCJobsMay2020Adjusted2.tbl$A_MEDIAN)

NationalSalarySelectedOCCJobs.fig <-
ggplot2::ggplot(data=NationalOCCJobsMay2020Adjusted2.tbl,
aes(x=OCC_CODE, y=A_MEDIAN)) +
geom_col(fill="red") +
labs(
title=
"Selected Jobs Associated with Data Science and Biostatistics:
Annual Median Salary - May 2020",
subtitle=
"All Data are from the Bureau of Labor Statistics",
x = "\nJob Code,
https://www.bls.gov/oes/current/oes_stru.htm",
y = "Annual Median Salary\n") +
annotate("text", x=2.5, y=225000, fontface="bold", size=03,
hjust=0, label=
"Data are excluded in the original dataset for median") +
annotate("text", x=2.5, y=210000, fontface="bold", size=03,
hjust=0, label=
"salaries >= $208,000 per year.") +
scale_y_continuous(labels=scales::dollar, limits=c(0,250000),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac() +
theme(axis.text.x=element_text(face="bold", size=08,
hjust=0.5, vjust=1, angle=90)) +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X axis text and
# Y axis text need to be placed after the enumerated theme
# theme_Mac().

par(ask=TRUE)
NationalSalarySelectedOCCJobs.fig
# The syntax for this figure shows here, but the figure
# itself shows in the front matter of this lesson.
# Fig. 1.1

Much more could be done to examine readily available federal data regarding
career choices in biostatistics: academic programs of study and change in comple-
tions over time, job titles, job requirements, national survey of salaries by job, etc.
94 1 Emergence of Data Science as a Critical Discipline in Biostatistics

This information should be examined early on when considering a career path, to

match interest, academic preparation, abilities, dispositions, and salary requirements.
Challenge: As much as federal data at the national level have value, the career
decision-making process demands greater granularity by examining data at the state
level given the wide variance in state economies. Obtain Occupational Employment
and Wage Statistics (OEWS) data (https://www.bls.gov/oes/tables.htm) but now
select State data and by preference data in XLS format.
Using that federal resource, state_M2020_dl.xlsx was downloaded to the F drive
and later made available at the publisher’s Web site for this text. Import the dataset
into an active R session, to examine state-wide data for OCC-identified jobs associ-
ated with data science and biostatistics:

StateOEWSJobsMay2020.tbl <-
readxl::read_excel("state_M2020_dl.xlsx", 1)
# Use the readxl::read_excel() function to import the .xlsx
# spreadsheet state_M2020_dl.xlsx into the current R session
# and place the contents into the object
# StateOEWSJobsMay2020.tbl
#
# The number 1 that shows after the .xlsx filename is used to
# declare that only the 1st sheet in the spreadsheet should
# be read into StateOEWSJobsMay2020.tbl, the intended object
# in this example. However, prior to importing Sheet 1,
# review Sheet 2, which serves as a code book for the file.

With this federal dataset now available, use the prior organizational approach
and R syntax (Base R and tools from the tidyverse ecosystem) to focus on a few
selected jobs, such as:

OCC_CODE 19-1011 Animal Scientists

OCC_CODE 19-1022 Microbiologists
OCC_CODE 19-1041 Epidemiologists
OCC_CODE 29-1051 Pharmacists
OCC_CODE 29-1141 Registered Nurses

As a test of current skills with R, reproduce the figure for OCC_CODE 29-1141
Registered Nurses and see if it meets expectations, based on the generated figure
(Fig. 1.7):
Addendum 2: National and State Data for OCC-Identified Jobs Associated with Data… 95

Fig. 1.7

# Registered Nurses
# Job291141byStateMay2020.tbl

Job291141byStateMay2020.fig <-
ggplot2::ggplot(data=Job291141byStateMay2020.tbl,
aes(x=reorder(PRIM_STATE, -A_MEDIAN), y=A_MEDIAN)) +
geom_col(fill="red") +
labs(
title=
"OEWS Job 29-1141, Registered Nurses: Annual Median Salary
by State (Descending Order) - May 2020",
subtitle=
"All Data are from the Bureau of Labor Statistics",
x = "\nState, https://www.bls.gov/oes/#data",
y = "Annual Median Salary\n") +
annotate("text", x=35.0, y=120000, fontface="bold", size=03,
hjust=0, label=
"Data are excluded in the original dataset for median") +
annotate("text", x=35.0, y=114000, fontface="bold", size=03,
hjust=0, label=
"salaries >= \$208,000 per year.") +
annotate("text", x=35.0, y=105000, fontface="bold", size=03,
hjust=0, label=
"Data for all jobs are not provided by state for all states.")+
scale_y_continuous(labels=scales::dollar, limits=c(0,125000),
breaks=scales::pretty_breaks(n=5)) +
theme_Mac() +
theme(axis.text.x=element_text(face="bold", size=10,
hjust=0.5, vjust=1, angle=45)) +
theme(axis.text.y=element_text(face="bold", size=12,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X axis text and
# Y axis text need to be placed after the enumerated theme
# theme_Mac().

par(ask=TRUE)
Job291141byStateMay2020.fig
# Fig. 1.7
96 1 Emergence of Data Science as a Critical Discipline in Biostatistics

Challenge: Use multiple federal resources to compare state-wide salary data with
state-wide cost of living data. Although the prior back matter for this lesson pro-
vided useful by state information about salaries for a small, self-selected sample of
jobs that require some degree of acquaintance with data science and biostatistics, is
this information sufficient to make informed lifelong career decisions, place of resi-
dence decisions, etc.? As evidenced by the tables and figures, salaries for the same
job vary widely by state. However, there is also variance in cost of living by differ-
ent regions within a state, especially urban v rural regions.
The purpose of this late part of Addendum 2 is to first offer a brief view of how
extant data provided by the federal government, coupled with knowledge of the
tidyverse ecosystem and data science skills, can be used to offer insight and improve
decision making, in this case looking at rent as a proxy for cost of living. As a value-
added activity at this early part of the text, an R-based Application Programming
Interface (API) function (e.g., client) will be demonstrated in this example to auto-
mate the acquisition of data from the United States Census Bureau. Note how the
API process is tidier than point, and click menu selection processes at a Web page.54

54
An entire lesson in this text is provided on APIs. Look at the API process, here, but use the later
lesson for explicit detail on how APIs are used in data science.
Addendum 2: National and State Data for OCC-Identified Jobs Associated with Data… 97

# Use the tidycensus package and/or the acs package and the
# US Census Bureau key to obtain state and/or county specific
# data from selected American Community Survey (ACS) and/or
# Decennial Census tables.
#
# Use the following URL to access the form needed to obtain an
# API key from the US Census Bureau:
# https://api.census.gov/data/key_signup.html
#
# Complete details on the API process with the US Census Bureau
# are available at https://www.census.gov/content/dam/Census/
# library/publications/2020/acs/acs_api_handbook_2020_ch02.pdf.

install.packages("tidycensus", dependencies=TRUE)
library(tidycensus)
# CAUTION: The tidycensus package may take longer to
# download than other packages. Be patient.

tidycensus::census_api_key(
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
#"Yourx40xdigitxkeyxgoesxherexxxxxxxxxxxxx",
overwrite=FALSE)
# CAUTION: The code for this key is reserved
# for [email protected] only. Use your own key!

######=> Important When Using the tidycensus Package <=########

# #
# Use this syntax to focus on code(s) for specific tables. #
# #
ACS2019 <- tidycensus::load_variables(2019, "acs5", #
cache = TRUE) #
View(ACS2019) #
str(ACS2019) #
utils::write.csv(ACS2019, file="ACS2019.csv") #
# Update for 2019 #
# #
# The saved files should go to the working directory that was #
# previously declared using the setwd() function, in the #
# Housekeeping section of this file. #
###############################################################

Far more detail on APIs is provided in a later lesson. For now, accept that an
appropriate Census API key has been obtained (but not shown in this lesson since it
is a private key – obtain and use your own key), the variable code
(variables="B25031_004", Median gross rent, 2 bedrooms) for this session is
known (by examination of the file ACS2019.csv), and instead, simply focus on the
simplicity of data acquisition through use of an R-based API function,
tidycensus::get_acs() in this example.
98 1 Emergence of Data Science as a Critical Discipline in Biostatistics

# Tenure (the household owns or rents their private dwelling)

# Total

All50StatesMedianRent <-
tidycensus::get_acs(
geography="state", # Breakouts by state
variables="B25031_004", # Median gross rent, 2 bedrooms
year=2019, # Year
survey="acs5", # ACS Survey
cache_table="TRUE", # Cache the table
output="tidy", # Tidy output
show_call=TRUE) # Confirm output at Census URL
# Median Gross Rent 2019 2 Bedrooms

print(All50StatesMedianRent, n=67)

Selected sections of output were deleted to save space.

# A tibble: 52 x 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama B25031_004 759 6
2 02 Alaska B25031_004 1244 20
3 04 Arizona B25031_004 1007 7
4 05 Arkansas B25031_004 720 5
5 06 California B25031_004 1536 5

48 53 Washington B25031_004 1212 7

49 54 West Virginia B25031_004 720 7
50 55 Wisconsin B25031_004 869 3
51 56 Wyoming B25031_004 801 14
52 72 Puerto Rico B25031_004 421 6

Challenge: Compare the 2019 median cost of rent by state to the previous sum-
mary of 2020 median salary by state for selected jobs. As an example, the 2019
median cost of rent in California was $1536 per month whereas the 2020 median
annual salary for a registered nurse in California was $118,410. What is the ratio of
rent to salary? Compare the ratio to other states. For this one ratio-type metric,
alone, which state is the most favorable in terms of rent v salary? Compare this
metric for multiple jobs to see if outcomes are consistent. Then, prepare ratios for
multiple jobs to multiple states. This challenge, when completed, should give a
good understanding of how these data fit into career decisions for those with an
interest in data science and biostatistics.
Career paths and life choices take many unexpected twists and turns, based on
personal desires, job openings, relationships, family, etc. If relocation is an option,
it may at first seem best to move to an area with the highest salaries for a chosen
profession. However, some analysis of local cost of living, such as using rent as a
recognized proxy for consumer costs, may help contribute to a more informed deci-
sion. A career that includes use of biostatistics, in whole or part, must consider
many factors, such as salaries and residence. Like data science in general, available
data should be considered when developing life and career plans.
External Data and/or Data Resources Used in This Lesson 99

External Data and/or Data Resources Used in This Lesson

The publisher’s Web site associated with this text includes the following files, pre-
sented in .csv, .txt, and .xlsx file formats.
CIP010000CompletionsAllCertDeg2010to2019AgrGeneral.xlsx
CIP010301CompletionsAllCertDeg2010to2019AgrProdOperGen.xlsx
CIP010401CompletionsAllCertiDeg2010to2019AgrFoodProdProc.xlsx
CIP010901CompletionsAllCertDeg2010to2019AnimalSciGen.xlsx}
CIP011001CompletionsAllCertDeg2010to2019FoodScience.xlsx
CIP011101CompletionsAllCertDeg2010to2019PlantSciGen.xlsx
CIP011201CompletionsAllCertDeg2010to2019SoilSciAgmyGen.xlsx
CIP260502CompletionsAllCertDeg2010to2019MicrobioGen.xlsx
CIP261101CompletionsAllCertDeg2010to2019BioBiometrics.xlsx
CIP261102CompletionsAllCertDeg2010to2019Biostatistics.xlsx
CIP261103CompletionsAllCertDeg2010to2019Bioinformatics.xlsx
CIP261104CompletionsAllCertDeg2010to2019CompBiology.xlsx
CIP261199CompletionsAllCertDeg2010to2019BioinfCompBio.xlsx
CIP261306CompletionsAllCertDeg2010to2019PopBiology.xlsx
CIP261309CompletionsAllCertDeg2010to2019Epidemiology.xlsx
CIP270501CompletionsAllCertDeg2010to2019StatisticsGen.xlsx
CIP303001CompletionsAllCertDeg2010to2019ComputSci.xlsx
CIP440503CompletionsAllCertDeg2010to2019HlthPolAnaly.xlsx
CIP510401CompletionsAllCertDeg2010to2019Dentistry.xlsx
CIP511201CompletionsAllCertDeg2010to2019Medicine.xlsx
CIP511401CompletionsAllCertDeg2010to2019MedScientist.xlsx
CIP511901CompletionsAllCertDeg2010to2019OsteoMedOpath.xlsx
CIP512010CompletionsAllCertDeg2010to2019PharmSciences.xlsx
CIP512201CompletionsAllCertDeg2010to2019PubHealthGen.xlsx
CIP512202CompletionsAllCertDeg2010to2019EnvHealth.xlsx
CIP512401CompletionsAllCertDeg2010to2019VetMed.xlsx
CIP512706CompletionsAllCertDeg2010to2019MedInform.xlsx
CIP513801CompletionsAllCertDeg2010to2019NursingRegNur.xlsx
CIP513808CompletionsAllCertDeg2010to2019NursingSci.xlsx
CIP513818CompletionsAllCertDeg2010to2019NursPractice.xlsx
national_M2020_dl.xlsx
state_M2020_dl.xlsx
Challenge: Use these files to practice and replicate the outcomes used in this les-
son. Be creative and go beyond what was previously presented.
All other data, if any, were enumerated directly in the R session, using functions
such as round(rnorm()) and read.table(textConnection()).
Chapter 2
Data Sources in Biostatistics

Personal Data Sources

Data scientists often work with data from external resources, data over which they
often have little to no control regarding its creation and original format. However,
data scientists also often work with their own data – data that are the result of their
actions and data that they create, either individually or by delegation (with supervi-
sion) to subordinates.
Consider HeadcountsbyCIPbyCode01-07andbyCodeA-CNotTidy.txt, a tab-
separated file serving as a teaching dataset. Immediately, it should be stated that the
Code Book is not provided in terms of what the codes represent, Code01-07 and
CodeA-C. There is also no description as to why individual CIP codes show on
multiple rows. That information is not relevant here, for this lesson. Instead, look at
the way the data were saved by an inexperienced, and obviously unsupervised,
assistant. The data are in tab-separated format; there is one row that includes a
descriptive title, and there are two separate header rows. It would be wrong to say
that the file cannot be used. Of course, the file can be used, but it would take some
work to put it into a readily usable format. Planning and close supervision of assis-
tants would have helped. Still, look at the file using a text editor to see the type of
structure for this dataset as an example of the challenges data scientists often face,
even with data under their control.1
Challenge: HeadcountsbyCIPbyCode01-07andbyCodeA-CNotTidy.txt is avail-
able at the publisher’s Web site associated with this text. The data are clearly not in

1
It is often viewed that time on task in data science follows an 80-20 rule, where 80% of time on
task is given to organizing data into proper format and 20% of time on task is given to the main job
at hand, using the data for discovery purposes that places value on the entire data science process.

Supplementary Information The online version contains supplementary material available at

https://doi.org/10.1007/978-3-031-46383-9_2.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 101
T. W. MacFarland, Introduction to Data Science in Biostatistics,
https://doi.org/10.1007/978-3-031-46383-9_2
102 2 Data Sources in Biostatistics

a tidy format. Using an editor, delete the title row and then organize the two header
rows so that there is only one descriptive header row, ideally so that for each column
the code provides a sense of Code01-07 and CodeA-C. Then prepare summations
for each column. These actions would be a good start as the data are made at least
somewhat tidy, but for this introductory activity, the Code Book is needed to offer
greater value to the data. After this experience, it should be obvious that hand edit-
ing of the file is no easy task. Later, with more experience, the tidyverse ecosystem
will be used to organize the data and provide value to its use. Those who are expe-
rienced with the tidyverse ecosystem should be able to achieve these aims using R
syntax, and these tools are introduced in a gradual manner throughout this text.
Now consider LMilkManagementPoundsTidy.txt, a different personal dataset
that is tidy, with white space used to separate columns (e.g., variables). The full
dataset was originally quite comprehensive and addressed multiple variables associ-
ated with commercial milk production, such as: breed, operator, management prac-
tice, pounds of milk per lactation, percent fat, and percent protein. Data acquisition
and organization were closely supervised, the data were originally entered into a
common spreadsheet, the data were entered in wide format, and eventually R was
used to transform the desired data into long format. Further R-based actions were
used to sequester the data so that for each planned analysis there was one breakout
dataset, such as the dataset involving milk production, Management and Pounds. If
possible, plan for future work before data are ever obtained and put into some type
of organizational structure, ideally, so that more time is given to examination and
productive use of the data and less time is given to the mundane, but still critically
important, task of data organization.
Challenge: LMilkManagementPoundsTidy.txt is available at the publisher’s Web
site associated with this text. The data are in tidy format. Use the data to determine
the mean pounds of milk per lactation by management practice, Conventional and
Organic.

Local Data Sources

Local communities such as cities, counties, townships, tribal lands, and boroughs
are often charged with sunshine laws, where commissioners, freeholders, supervi-
sors, directors, and all other public officers must have formal meetings in an open
setting, where all relevant data under their charge must also be made conveniently
available to the public. Internet posting of data, with unfettered access, is now a
common process in which data are made available to the public.
Consider two typical Web-based data sources made available by Palm Beach
County, Florida. The first resource is clearly associated with biostatistics in that it
relates to environmental issues. The other resource may not at first seem relevant to
biostatistics, but quite the opposite, it is an excellent proxy for gauging public health
and the way public health is impacted by the overall economy.
Local Data Sources 103

Palm Beach County, Florida, Natural Areas Trails: Many communities provide
an inventory of public nature trails since this resource is highly valued. Property
developers who wish to construct housing developments near protected natural
areas, to attract potential home buyers, find information about natural areas and
their location useful. Of course, those who focus on the conservation of natural
resources will also find this information useful as they try to restrict high-density
housing developments that encroach on nearby sensitive natural areas. Both parties
have access to these public data, in their attempt to meet goals.
Challenge: Review the URL https://opendata2-pbcgov.opendata.arcgis.com/
datasets/PBCGOV::palm-beach-county-natural-areas-trails/explore?
location=26.587800%2C-80.496600%2C9.32&showTable=true and notice that
along with the table first seen at this URL, it is also possible to download a shape-
file (.shp file format), allowing creative opportunities for mapping for those with
this skill – with mapping a critical domain in data science. The file Palm_Beach_
County_Natural_Areas_Trails.csv represents the data at this resource and it is
available at the publisher’s Web site associated with this text. Prepare a summary
of the many different types of trails, including in part: boardwalk, equestrian, hik-
ing, nature, multi-use, etc. How many trails are there of each type, and what is the
mean length of each?
Palm Beach County, Florida, Bed Tax Collections: Many United States munici-
palities collect a daily short-term rental bed tax as one of many tools used to gener-
ate revenue from hotel guests, with the tax bundled into the total charge for lodging.2
At first, it may seem that this information is interesting, but not for those focused on
biostatistics. On the contrary, bed tax receipts are often viewed as a critical proxy
indicator of local economic vitality, a downstream indicator of public health due to
the linkage between public health and public finances. Given the pervasive impact
of COVID-19 on the economy, hotel-based bed taxes are an especially useful proxy
measure of how the pandemic greatly impacted the travel and tourist industry, com-
paring bed tax collections prior to the pandemic in 2019, during the worse times of
the pandemic in 2020 and 2021, and the emergence of pandemic endemicity in late
2022 and onward. Never discount the clever usefulness of proxy measures.
Challenge: Review the URL https://discover.pbcgov.org/touristdevelopment/
pages/bed-tax-collections.aspx and observe how data can be downloaded in .pdf
format, with the resource Tourist_Development_Tax2021-2022.pdf also avail-
able at the publisher’s Web site associated with this text. Look at Gross Collections
and Net Collections and compare month-by-month change, beginning March
2019 to December 2021. Use the ggplot2::ggplot() function to prepare a line
chart of each (X axis Month and Year and Y axis Bed Tax Collection), Gross
Collections and Net Collections. Compare these line charts to any readily avail-
able resource that plots COVID-19 at the county level, such as what the New York
Times makes available at the URL https://www.google.com/search?client=firefox-

2
The Palm Beach County bed tax, conveniently called the Tourist Development Tax, is currently
6% of total charges and it is separate from any charges for state and county sales tax obligations.
104 2 Data Sources in Biostatistics

b-1-d&q=palm+beach+county+covid+data. Again, the economic impact of

COVID-19 on a county heavily dependent on tourism clearly shows the parallel
in bed tax collections and the incidence of COVID-19, with declining bed tax
collections a proxy indicator of an extreme public health concern – unemploy-
ment among the many who work in the tourist industry and the health-related
impact of lost wages.

State Data Sources

In the same way that local governments provide a wide variety of data resources for
public access, states also work in the sunshine and provide relevant data for public
acquisition. Among the nearly countless number of datasets and related topics, look
at the sample datasets shown below. Of course, as time permits, look at datasets
from other states.
National School Lunch Programs (NSLP) Locations: The United States
Department of Agriculture, the United States Department of Education, states, local
school boards, and other educational agencies partner to coordinate school lunch
programs for those children who qualify. There are many data resources for the
National School Lunch Program (NSLP, https://www.fns.usda.gov/nslp), a long-
term program that was initiated at the federal level in 1946.
Challenge: Review the URL https://geodata.fdacs.gov/datasets/FDACS::nslp-
sites/explore?location=27.764172%2C-83.759111%2C6.54&showTable=true, an
NSLP resource provided by the Florida Department of Agriculture and Consumer
Services and specific to Florida educational institutions. The data at this resource
were easily downloaded as nslp_sites.csv and this dataset is also available at the
publisher’s Web site associated with this text. This dataset is somewhat unique in
that it provides geospatial information (e.g., latitude and longitude coordinates) for
thousands of schools and other service providers, allowing mapping opportunities.
Once again, mapping is a rapidly emerging domain in data science. For those with
skills beyond the introductory level, use this .csv file to map selections locations of
participating NSLP schools throughout Florida, possibly by using functions from
either the choroplethr package or the tmap package. Maps, as useful figures, are
demonstrated later in this text.
Florida Department of Health Tracking: Although the use of an API would be
desirable, by going to Florida Tracking (https://www.floridatracking.com/health-
tracking/, a resource with a degree of affiliation with the Centers for Disease Control
and Prevention), and completing a simple set of checks at a Graphical User Interface,
it is possible to construct a by-county dataset for all 67 Florida counties that focuses
on data associated with: (1) Life Expectancy at Birth, (2) Air Quality, and (3) Cancer
(various types and different measures; e.g., age-adjusted incidence rate and number
National Data Sources 105

of cases).3 The dataset was saved as FDOH_HealthTrackingData.csv, and this

resource is also available at the publisher’s Web site associated with this text.
Challenge: Review the URL https://www.floridatracking.com/healthtracking/
mapview.htm?i=9358&g=3&t=2013&ta=4&it=3 and make selections that model
the dataset FDOH_HealthTrackingData.csv, a dataset of 69 rows (data for 67 coun-
ties and two header rows, one header row for topic and one header row for years)
and ranged from Column A to Column BH. The data for the three criteria (Life
Expectancy at Birth, Air Quality, and Cancer) were widely diverse and addressed
many topics and measures, including as a sample: Life Expectancy at Birth,
2009–2013; Average ambient concentrations of particulate matter (PM2.5 per ug/
m3, Monitor + Modeled); Age-adjusted incidence rate of bladder cancer per
100,000; and Number of thyroid cancer cases. For those with more advanced skills
in data science, the serious, curious, and skilled data scientist will then merge (there
are more than a few ways data are merged when using R, as demonstrated later in
this text) these by county data with United States Census Bureau data, data that are
also readily available, to gain additional insight into differences between and among
Florida’s 67 counties regarding public health issues, possibly examining how public
health indicators (such as air quality and cancer) correlate to income, educational
attainment, and other important issues. Come back to this challenge, later, if skills
are not yet sufficient for these possibilities. The successful tidy organization of a
useful dataset, often using original format messy data from multiple sources, is
expected of a highly skilled data scientist, as opposed to the skill set of a data
analyst.

National Data Sources

Following along with the paradigm of government in the sunshine and requirements
that different government agencies make data readily available to the public, the
various federal cabinet departments and their respective agencies make available a
wealth of information, and there has been great improvement in the ready availabil-
ity of the data, along with its quality. It would take volumes to give any meaningful
detail on the many Web-based datasets offered by the federal government, so only a
few are listed below, but first visit the URL DATA.GOV (e.g., https://data.gov/) to
gain a first sense of possibilities.

3
Nearly all states provide similar information.
106 2 Data Sources in Biostatistics

United States Census Bureau

An excellent starting point to obtain data from the decennial census is Quick Facts,
https://www.census.gov/quickfacts/fact/table/US/PST045221. Along with selec-
tions made from a Graphical User Interface, it cannot be ignored that the Census
Bureau is a leader in facilitating cooperation on the use of R-based API functions as
a means of making data available, including data from the Decennial Census,
American Community Survey, Public Use Microdata Sample, and many other data
resources (See Census Datasets, https://www.census.gov/data/datasets.html.).
Challenge: Use the Quick Facts interface to create and download a dataset
detailing Cape May County, New Jersey, saved as QuickFactsApr072022
CapeMayCountyNJvUnitedStates.csv and made available at the publisher’s Web
site associated with this text. Carefully examine the file for format and consistent
presentation of information that describes: Population, Age and Sex, Race and
Hispanic Origin, Population Characteristics, Housing, Families and Living
Arrangements, Computer and Internet Use, Education, Health, Economy,
Transportation, Income and Poverty, Businesses, and Geography. Use Quick Facts
to search other communities. Anyone who engages in biostatistics relating to
humans and communities should visit the United States Census Bureau to learn as
much as possible about the areas under investigation.4

United States Centers for Disease Control and Prevention

At first, it may seem that the data possibilities from the Centers for Disease Control
and Prevention (CDC) are endless for those data scientists who work in biostatistics.
Look at the information related to Diagnosed Diabetes (https://gis.cdc.gov/grasp/
diabetes/DiabetesAtlas.html) as a first review of what could be extremely useful for
those who study this common disease.
Challenge: Use the dataset DiabetesAtlasData.csv (available at the publisher’s
Web site associated with this text) and use the ggplot2::ggplot() function to construct
a line chart of X axis (Year, 2000 to 2019) by Y axis (Total Percentage Diagnosed
Adults), to offer a view of the growing incidence of this disease over the last few
years. But the curious data scientist would look not only for data related to diabetes
but also known comorbidities, to see if there are associations. Saying that, a search
tool was used to examine any possible relationship between diabetes and obesity
(https://www.cdc.gov/nchs/hus/contents2019.htm?search=Obesity/overweight), a

4
With more experience in R, review the following R packages and how they can be used to obtain
data from the Census Bureau, whether the Decennial Census, the American Community Survey
(ACS), or other census-related resources: acs, censable, censusapi, cpsR, easycensus, idbr, ppmf,
tidycensus, tidyqwi, totalcensus. All packages have merit, but by choice the tidycensus package is
emphasized in this text.
National Data Sources 107

known comorbidity to diabetes. This expanded query against CDC data yielded the
dataset SelectedHealthConditionsandRiskFactorsbyAge.xlsx (available at the pub-
lisher’s Web site associated with this text), addressing multiple risk factors and not
only diabetes and obesity.5 It would take some work to put the data into tidy format,
but clearly, it was not necessary to conduct what would possibly be redundant sur-
veys, clinical trials, or other time-intensive and expensive actions needed to obtain
the data.6 The CDC should always be among the first choices when data related to
public health issues are considered.

United States Department of Agriculture

The United States Department of Agriculture recognizes the need for quality data in
support of food production and related concerns, as evidenced by their Data Strategy
Statement (https://www.usda.gov/sites/default/files/documents/usda-data-strategy.
pdf). For those who use R, the rnassqs package provides an interface to allow access
to data from the United States Department of Agriculture (USDA) National
Agricultural Statistics Service (NASS). The syntax and use of the rnassqs::nassqs()
function (sample syntax, showing how this function is used, follows), once estab-
lished and tested for validity, can be easily replicated with minor changes, allowing
repeated use across multiple regions, crops, years, etc.7
Challenge: Part of the syntax needed to obtain Kentucky corn yields over time is
provided below, eventually used to generate the file KentuckyCornYield1900Onward.
xlsx (this file is also available at the publisher’s Web site associated with this text),
but explicit detail and all syntax on use of the rnassqs::nassqs() function is provided
in the lesson on APIs. In response to this challenge, select known high-yielding
counties in Kentucky for corn production, such as Daviess, Fulton, Logan, Shelby,
and Warren, and for each county prepare a line chart of X axis Year by Y axis Corn
Yield (Bushels per Acre) to see the dramatic increase in productivity of corn pro-
duction, allowing for a few years when climate and disease challenged norm yields.

5
Although the two files mentioned in this section are not at all large and could easily be edited by
using standard spreadsheet actions, as a challenge, use tidyverse ecosystem tools to put the data in
good form, suitable for R. This may not be possible yet, but by the end of this text it should be
possible to come back to these files to achieve that task. As an advance organizer, look at use of the
janitor::row_to_names() function and the dplyr::slice() function, but again more detail is offered
later in this text.
6
Data are rarely easy to obtain. Data are expensive. If a proxy dataset can be legally obtained and
if that dataset meets needs, then consider its use – at least as a first indicator of direction.
7
As a quality assurance check on validity of the data, review corn (Zea mays) yields for 2012, a
year of extreme Spring and Summer drought in many regions. Then check corn yields for 1970, a
year when Southern Corn Leaf Blight (SCLB), a disease caused by the fungus Helminthosporium
maydis, specifically Race T, caused extreme distress in many fields, reducing yields.
108 2 Data Sources in Biostatistics

# Download and activate the rnassqs package. Then, replicate

# and invoke the syntax in an R session.

install.packages("rnassqs", dependencies=TRUE)
library(rnassqs)
# The Housekeeping section and related syntax
# has not yet been put into place in this
# lesson but is instead found at the start of
# Addendum 1.

nassqs_auth(key="UseTheKeyProvidedAtSign-Upxxxxxxxxxx")
# The USDA NASS key is free, but it must first be
# obtained at https://quickstats.nass.usda.gov/api.
#
# CAUTION: The code for this key is reserved
# for [email protected] only. Use your own key!

KentuckyCornYield1900Onward.tbl <- rnassqs::nassqs(

year__GE = 1900,
# 1900 onward, as data are available from the USDA
# Notice two underscore characters
agg_level_desc = "COUNTY",
statisticcat_desc = "YIELD",
group_desc = "FIELD CROPS",
unit_desc = "BU / ACRE",
domain_desc = "TOTAL",
sector_desc = "CROPS",
commodity_desc = "CORN",
reference_period_desc = "YEAR",
util_practice_desc = "GRAIN",
short_desc = "CORN, GRAIN - YIELD, MEASURED IN BU / ACRE",
freq_desc = "ANNUAL",
state_name = "KENTUCKY")

utils::str(KentuckyCornYield1900Onward.tbl, width=64,
strict.width="cut")

Selected sections of output were deleted to save space.

'data.frame': 10138 obs. of 39 variables:

$ source_desc : chr "SURVEY" "SURVEY" "SURVEY" "SU"
$ sector_desc : chr "CROPS" "CROPS" "CROPS" "CROPS"

$ commodity_desc : chr "CORN" "CORN" "CORN" "CORN" ...

$ statisticcat_desc : chr "YIELD" "YIELD" "YIELD" "YIELD"
$ unit_desc : chr "BU / ACRE" "BU / ACRE" "BU / "
$ year : chr "2019" "2018" "2017" "2016" ...
$ Value : num 169 169 143 136 166 ...

There is more discussion on the use of an R-based Application Programming

Interface (API) function, such as rnassqs::nassqs() as an efficient tool associated
with the tidyverse ecosystem in later parts of this text. For now, look at the resulting
file (KentuckyCornYield1900Onward.tbl) and imagine the actions that would be
needed to obtain the data if using a Graphical User Interface (GUI). Then, imagine
National Data Sources 109

the possibilities of how this dataset could be used to communicate outcomes to the
public during National Farm-City week, usually in late November.

United States Department of Education

As stressed in this text, poverty has an extreme impact on public health. This text
purposely provides pointers to data resources that serve as national, state, and local
proxy indicators of the economic challenges that many families face, knowing the
impact of poverty on health, even when just one family member is out of work.
Poverty data should be considered biostatistics data.
As briefly mentioned earlier, the National School Lunch Program (NSLP), now
in place for more than 75 years, was established in a multi-agency attempt to pro-
vide wholesome and nutritious meals (often, breakfast, lunch, and a snack before
afternoon dismissal) during the school day to youth who are in need.8 Using
Graphical User Interface (GUI) menu-type selections at Number and percentage of
public school students eligible for free or reduced-price lunch, by state: Selected
years (https://nces.ed.gov/programs/digest/d18/tables/dt18_204.10.asp, select Click
here for the latest version of this table), and look at the wealth of state-wide data on
participation in the NSLP, currently from the 2000–2001 school year onward.
Challenge: Data in NationalSchoolLunchProgram2000-01to2018-19.xlx, avail-
able at the publisher’s Web site associated with this text, are clearly not in tidy for-
mat. Either using tidyverse ecosystem tools or if needed, direct editing, put the data
into tidy format (By continuing with future lessons, direct editing should be
unnecessary.).9 As a value-added activity, obtain state-wide Census-based data on
poverty. Merge these state-wide data on poverty with the state-wide dataset on
NSLP participation. Then, determine if there is any degree of association between
Census-obtained poverty statistics and NSLP-obtained percent eligibility rates in
free and reduced-price lunch programs. Data scientists often obtain, organize, scrub,
manipulate, etc. multiple datasets into one final dataset before any attempt is made
to use the data for value-added purposes. Come back to this challenge after skills
expand since it is a typical action in data science.

8
Many school-based principals find creative ways to offer NSLP services to all students, regardless
of formal eligibility requirements, to avoid the stigma of participation for individual students
in need.
9
Read the footnotes at the bottom of the spreadsheet to see how the data were obtained and to also
gain a sense of expansion of eligibility requirements over time, thus expansion of participation
over time.
110 2 Data Sources in Biostatistics

United States Department of Labor

Following along with the theme that poverty impacts public health, data available
through the United States Department of Labor are without parallel when attempt-
ing to gauge the economic viability of a region, whether county, state, or national.
Using data for September 01, 2020, during a period when avoidance of public gath-
erings due to the COVID-19 pandemic was at its most extreme, consider the data
presented in the file FloridaUnemploymentByCountySep-01-20.csv (this dataset is
also available at the publisher’s Web site associated with this text) and the accompa-
nying figure (Fig. 2.1), presenting a choropleth (a color-coded thematic map) of the
data for all 67 Florida counties and a brief summary of the most extreme values.10
Challenge: Give particular attention to the high unemployment rates in Central
Florida (e.g., especially Osceola County and Orange County) during this time,
where this region is the home of many internationally known theme parks, which of
course are a draw for nearby restaurants, hotels, airports, rental car agencies, and
similar tourist industry accommodations. The economic impact and later public
health impact of these exceptionally high unemployment rates are without current
parallel, and these impacts include cancelled and deferred medical appointments,
stress and the subsequent increased use of alcohol and tobacco, unhealthy family
dynamics due to missed pay checks and late housing payments, lost educational
opportunities for pK-12 students who did not have sufficient computing resources
at home to actively participate in emergency remote learning opportunities, etc.

Fig. 2.1

10
API-type functions from the blscrapeR package were used to obtain the by Florida county unem-
ployment data. It is important to note that the blscrapeR package is currently not available from
CRAN, but the latest development version is available at GitHub.
National Data Sources 111

United States Environmental Protection Agency

The Environmental Protection Agency (EPA) was established in the early 1970s and
since that time this agency has made tremendous strides to bring environmental
issues to public attention, often by using reliable and valid data to communicate
issues of importance.11 The EPA has a wide variety of data available to the public,
as part of an open government approach, and the URL https://www.epa.gov/data
should be reviewed to learn more about data availability.
A typical example of data provided by the EPA would be the data associated with
Air Quality Index (AQI), a metric that provides an overall summary of air quality
that should be easily understood by all and not only those scientists who work in this
specialized domain. As an example of the data, go to https://aqs.epa.gov/aqsweb/
airdata/download_files.html#Annual, unzip the file annual_aqi_by_county_2021.
zip, and then review the file annual_aqi_by_county_2021.csv, Annual Summary
Data AQI by County, which is also made available at the publisher’s Web site asso-
ciated with this text.12
Challenge: The dataset annual_aqi_by_county_2021.csv, addressing many coun-
ties in the United States, was selected in that the information could be easily under-
stood by the public, using column headers such as: Days with AQI, Good Days,
Moderate Days, Unhealthy for Sensitive Groups Days, Unhealthy Days, Very
Unhealthy Days, Hazardous Days, and Median AQI. For those with special interest
in how the COVID-19 pandemic and mitigating actions such as lockdowns impacted
not only public health but also environmental health, review data for 2019 (pre-
pandemic) and then data for 2020 and 2021. A ggplot2::ggplot() function line chart
of data from the column Median AQI for each of the 3 years, 2019, 2020, and 2021,
would be especially useful. Did the many public pandemic mitigating actions, such
as lockdowns, reduced economic activity, and limited automobile traffic, improve
air quality? Respond to this challenge, once skills with the tidyverse ecosystem are
sufficient. It serves as an excellent example of how the ggplot2::ggplot() function
can be used to communicate outcomes in a clear manner, ideally on a subject of
public interest.

11
The early 1970s was a time when it was finally recognized that a healthy environment could not
be assumed, but instead attention to pollution and other environmental issues was needed. The first
Earth Day was celebrated on April 22, 1970. Concurrently, a Public Service Announcement (PSA)
television commercial (now retired) featuring an actor professionally known as Iron Eyes Cody
had tremendous impact on how it was in the national interest to give attention to environmental
concerns and the impact of the same on public health.
12
Review Air Quality Index (AQI) Basics, https://www.airnow.gov/aqi/aqi-basics/, to learn more
about the AQI scale.
112 2 Data Sources in Biostatistics

United States National Science Foundation

For more than 70 years, the National Science Foundation (NSF) has projected a
leading role in the way science is used to improve the human condition. On the topic
of leadership in the sciences, consider the annual NSF surveys used to monitor the
critical mass of doctoral recipients from various academic areas, those individuals
who are most likely to have later leadership roles in various fields of study in the
sciences, including the biological sciences.
Challenge: As an example, consider the NSF dataset Doctorate recipients, by
fine field of study: 2010–20, available at https://ncses.nsf.gov/pubs/nsf22300/data-
tables, and download the file nsf22300-tab013.xlsx, which is also available at the
publisher’s Web site associated with this text. For those who are concerned about
the production of sufficient numbers of individuals who can contribute to the sci-
ences, these data should offer a view of future leadership, considering how many of
these individuals may have a future 30–40-year career in their selected discipline.
The dataset is presented as an easy-to-read table and should take only a small
amount of effort to make the data suitable for use with R. For those with a special
interest, look at the data under the broad field of Agricultural sciences and natural
resources, from Agricultural economics to Natural resources and conservation. Use
the ggplot2::ggplot() function to prepare a line chart of general trends for selected
areas of study. Do the outcomes sync with national needs? These data may not be of
direct interest to the public, immediately, but it is critical information for appropri-
ate policy makers as efforts are made to guide the nation’s workforce in the sci-
ences, a workforce that needs skills with biostatistics and data science, well into
the future.

International Data Sources

Data are used to support world-wide efforts to make improvements in agriculture,

climate and environmental science, medicine, microbiology, and all other domains
associated with the biological sciences and subsequently the use of biostatistics and
data science. A few of the most widely known international data resources follow,
but data resources that are maintained by national governments that submit data to
these international resources should also be considered.

European Centre for Disease Prevention and Control

European Union and European Economic Area (EU/EEA) data on the daily number
of new reported COVID-19 cases and deaths are obtained at Data on the daily num-
ber of new reported COVID-19 cases and deaths by EU/EEA country (https://www.
International Data Sources 113

ecdc.europa.eu/en/publications-data/data-daily-new-cases-covid-19-eueea-coun-
try) and have been downloaded as COVID19CasesDeathsEUEEACountryApr-16-
2022.csv, with data updated after this point in time.13
Challenge: The data are from multiple cabinet-level ministries as well as other
resources, the data are made available on different dates, and from this disparity,
great efforts are made to put the data into a unified dataset. Review All Topics, A to
Z} (https://www.ecdc.europa.eu/en/all-topics) to gain a sense of the comprehensive
nature of this invaluable data resource. Select a few countries of special interest and
prepare a ggplot2-based line chart of date (X-axis) by total COVID-19 deaths
(Y-axis), to see the wave-like pattern of the pandemic, as the SARS-CoV-2 virus
mutated into variants, each with unique characteristics in terms of severity and
transmissibility.

The Organization for Economic Co-operation and Development

As a cooperative international organization, The Organization for Economic

Co-operation and Development (OECD) provides data with wide applicability to
public health and other areas related to the biological sciences. A full list of topics
addressed by the OECD is found at https://www.oecd.org/about/, selecting Topics
(ranging from Agriculture and fisheries to Trade).
Challenge: To learn more about the data posted by the OECD, review OECD.Stat
Web Browser User Guide, found at https://stats.oecd.org/Content/themes/OECD/
static/help/WBOS%20User%20Guide%20(EN).PDF. With this background, look
at Health expenditure and financing and download https://stats.oecd.org/index.
aspx?DataSetCode=SHA. Use the export option to obtain the file
SHA_05032023201717071.csv, a file of more than 200,000 lines of data. It may be
necessary to wait until skills with the tidyverse ecosystem are more advanced, but
as soon as possible: (1) examine this file; (2) use dplyr tools to filter and select rows
related to object variable Provider – Hospitals and object variable Measure – Per
capita, current prices and from this adjusted dataset; (3) determine how data in
object variable Value changed over time (2010–2021), overall and for a few selected
countries. As an additional challenge, look specifically at data for 2019 (pre-
pandemic) and 2021. Is there a statistically significant difference (p ≤ 0.05) in per
capita hospital expenditures between those two years?

This resource is very R friendly. Look at the directions at Script for downloading the CSV file into
13

R software.} The dataset is also available at the publisher’s Web site associated with this text.
114 2 Data Sources in Biostatistics

Our World in Data

The many resources at Our World in Data (https://ourworldindata.org/) are provided

with the ambitious goal to change the world. Many contribute to this resource, and
it cannot be overstated that the data at Our World in Data are a pearl of great price:
• The data are readily available in many formats.
• The data are freely available.
• The data are thoroughly vetted, and the original source is clearly identified.
An advantage to the use of Our World in Data for R users is that the owidR pack-
age is dedicated to API-based access to most datasets at Our World in Data. Even if
other resources are eventually selected for data use, the data at Our World in Data
should be reviewed for possible use. Again, the data at Our World in Data are of
great value.
Challenge: Given these many advantages, access to and the use of data at Our
World in Data is detailed in this lesson (Addendum 1). Give special attention to how
the owidR::owid_search() function is used to search for data, and once identified the
owidR::owid() function, as an R-based Application Programming Interface (API),
is used to retrieve the desired Our World in Data dataset.

United Nations Food and Agriculture Organization

For those who work with international agricultural data, the data possibilities avail-
able at the United Nations Food and Agriculture Organization should be reviewed.
To do that, as a good starting point, look at the data related to Crops and Livestock
Products at FAOSTAT, https://www.fao.org/faostat/en/#data/QCL. From this start-
ing point, select France, Germany, and the United Kingdom of Great Britain and
Northern Ireland to gain a sense of the change in animal production over the last 60
or so years, from 1961 to 2020.
Challenge: After interacting with the interface, data were downloaded as
FranceGermanyUKLivestoack1961to2020.csv and this dataset is also available at
the publisher’s Web site associated with this text.14 For each selected country, com-
pare production for a selected variable (possibly Meat, turkey) over time, typically
by preparing a line chart based on use of the ggplot2::ggplot() function.

14
Similar data made available at FAOSTST can be obtained using functions associate with the R
package FAOSTAT, but for now use the Graphical User Interface (GUI) to explore the many types
of data and how the data can be filtered to meet specific needs.
International Data Sources 115

World Bank

From among many possible resources at the World Bank, review the birth rate data-
set, birth rate, crude (per 1000 people), made available at https://data.worldbank.
org/indicator/SP.DYN.CBRT.IN?end=2020&start=2000&view=chart, where birth
rate data are provided by country and a few selected regions or entities. Then review
the World Bank dataset relating to gross domestic product (GDP), GDP (current
US$), available at https://data.worldbank.org/indicator/NY.GDP.MKTP.CD.
Challenge: Clean (e.g., organize, scrub, etc.) each dataset (the datasets
BirthRatePer1000People.csv and GDPCurrentUSDollar.csv were both downloaded,
and each dataset is available at the publisher’s Web site associated with this text) as
needed, and eventually merge the two, ending with a new dataset that includes
Country Name, Country Code, 2019 (pre-pandemic) Gross Domestic Product, and
2019 (pre-pandemic) Birth Rate per 1000 People.15,16 As much as possible, antici-
pating that there may be some missing datapoints, construct a breakout data set that
includes the many countries associated with the African Union (from Algeria to
Zimbabwe) and the many countries associated with the European Union (from
Austria to Sweden) and explore the data, looking to see if there are any associations
between GDP and birth rate (consider use of the ggplot2::ggplot() function): (1)
overall, for all data, (2) for each of two constructed breakout groups, African Union
and European Union, and (3) for selected countries of individual interest. Come
back to this challenge later, if skills are not yet sufficient.

World Health Organization

The World Health Organization (WHO) has been prominently mentioned over the
last few years, given its role in responding to the COVID-19 pandemic. As much
attention has been given to COVID-19, it is important to recall that there have been
many deaths during the pandemic that are the result of other causes, especially
behavioral and lifestyle choices, all for a variety of individual reasons. A leading
cause of death relates to the expression Deaths of Despair, with alcohol highly
associated with this expression. Saying that, go to Alcohol-attributable fractions,
all-cause deaths (%) at https://www.who.int/data/gho/data/indicators/indicator-
details/GHO/alcohol-attributable-fractions-all-cause-deaths-(-) and export the data

15
The base::merge() function should not be ignored. The dplyr::bind_rows() and dplyr::bind_cols()
functions, associated with the tidyverse ecosystem, are also frequently used, along with other func-
tions from the dplyr package. Individual choice usually determines which function(s) are used to
achieve aims.
16
Are the codes in Country Code consistent for each of the two datasets? If so, this consistency
should help facilitate any merging actions of the two datasets.
116 2 Data Sources in Biostatistics

in .csv format, AlcoholAttributableFractionsAllCauseDeaths.csv, which is also

available at the publisher’s Web site associated with this text.
Challenge: It will take some work to clean the dataset to put it into serviceable
format, ideally in tidy format, but it is worth the effort. Explore the data and look for
trends, overall, by sex, by country, and by region:
• Using an appropriate inferential test, is there a statistically significant difference
(p ≤ 0.05) in Alcohol-attributable fractions, all-cause deaths (%) between the
two sexes, Female and Male?
• To complement the inferential examination of sex-specific alcohol-related
deaths, prepare three separate histograms of alcohol-attributable fractions, all-
cause deaths (%)} for: (1) females, (2) males, and (3) both sexes. Then ask if
these deaths needed to occur, since alcohol is a purposely consumed substance
that has known harmful impacts, short term and long term.
If organizing the dataset identified in this section and the other challenges are
beyond current skills, be sure to mark this section and come back to it at the end of
this text. These are the types of issues that typically challenge data scientists, and
these challenges would make excellent practice for those who must work with real-
world data, data that in original form are often not quite as tidy as may be desired.
As time permits, look at the World Health Organization for its many data
resources. Selecting Data at the main interface, observe how dashboards are avail-
able for use by the public. However, data scientists usually find full datasets more
useful. Take time to explore the vast collection of datasets made available by the
World Health Organization.

Proprietary and Other Resources

There are a few commercial organizations that make public interest data available,
often with no authentication needed for access. However, in many cases, it may be
easier to obtain the data (or reasonable proxy data) at public resources, if possible.
Yet, it would be negligent to avoid identification of perhaps the two most widely
known commercial entities providing data related to COVID-19, repeating the cau-
tion that the data are available at multiple resources, public and private.
Proprietary and Other Resources 117

Google Cloud Platform Datasets for COVID-19 Research

Review About COVID-19 Public Datasets, available at https://console.cloud.

google.com/marketplace/product/bigquery-public-datasets/covid19-public-data-
program?project=serene-lotus-296916, to gain an initial sense of data possibilities
using the Google Cloud Platform. It is a valuable resource (authentication is
required) that needs to be mentioned, but its use may not be at the introductory level,
the focus for this text.

New York Times COVID-19 Data at github

The New York Times makes available data on COVID-19, providing extensive
resources. Look at the many selections, with most data at https://github.com/
nytimes/covid-19-data. The Cumulative Cases and Deaths datasets, at various
breakout levels (available as .csv files) and ending dates, are quite interesting and
useful for inquiries on this topic.17

17
Data may be out of date as COVID-19 is now endemic.
118 2 Data Sources in Biostatistics

Addendum 1: Our World in Data

As used throughout this text, the Housekeeping section represents personal desires
in terms of how R is used, how settings are organized, where packages are kept,
default and other location(s) where files are maintained, etc. As always, use the
Housekeeping syntax as a guide, but of course, make changes as skills and prefer-
ences allow.
In keeping with what is seen in the Housekeeping section, use the many pack-
ages listed below as a starting point for what is often used in an R session that
focuses on the tidyverse ecosystem and its use in data science. Other packages will
be deployed later in this lesson, as needed.
In advance, notice a major difference in this Housekeeping section compared to
what was presented in the prior lesson. A # comment character is placed in front of
the install.packages() function for those packages that were previously downloaded.
There is confidence that the package is up to date and there is no need to download
Addendum 1: Our World in Data 119

the package again, at least not until there is an update.18,19 Of course, the library()
function is still used, to put the package into use.20

# install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)

# install.packages("readxl", dependencies=TRUE)
library(readxl)

# install.packages("magrittr", dependencies=TRUE)
library(magrittr)

The ggplot2 package is part of the core tidyverse, and it is used to create beauti-
ful graphics.21 However, there are many ancillary packages that support the produc-
tion of graphical presentations, with features that go far beyond what can be prepared
using the ggplot2 package by itself. A few of these graphically oriented ancillary
packages include:

# install.packages("ggmosaic", dependencies=TRUE)
library(ggmosaic)

# install.packages("ggpubr", dependencies=TRUE)
library(ggpubr)

# install.packages("ggtext", dependencies=TRUE)
library(ggtext)

# install.packages("ggthemes", dependencies=TRUE)
library(ggthemes)

# install.packages("gridExtra", dependencies=TRUE)
library(gridExtra)

# install.packages("scales", dependencies=TRUE)
library(scales)

With all upfront work completed, it is now time to address the data associated
with the addenda in this lesson. Most approaches for use of the data will be based
on functions associated with the tidyverse ecosystem, but functions from Base R

18
Look into use of the old.packages() function and the update.packages() function, as needed.
19
It is common to comment-out syntax using the # comment character when there is a desire to
retain syntax that is important, but not currently needed.
20
It is a best practice to use the base::library() function instead of the base::require() function,
given a few known differences between the two functions.
21
The syntax needed to produce figures, typically using the ggplot2::ggplot() function, is purposely
presented throughout this text. Following this syntax, many, but not all, figures are also presented
in this text. Although this action saves space, the syntax should be used against the data and the
figures that are excluded from this text should still be produced, to gain a complete understanding
of the topics stressed in the syntax.
120 2 Data Sources in Biostatistics

will be used when they represent the most appropriate approach toward problem-
solving, especially at an introductory level.
Addendum 1 is centered on the identification and later retrieval of data from Our
World in Data (https://ourworldindata.org/). Download the R-specific owidR pack-
age and then use a few simple functions to search for data that may help offer a
sense of life expectancy and wealth, as measured by Gross Domestic Product
(GDP). There may be many possible selections, so ideally the dataset names are
verbose and descriptive.

install.packages("owidR", dependencies=TRUE)
library(owidR)

# The owidR package is used to easily obtain data from

# the Our World in Data (https://ourworldindata.org/)
# Website.

owidR::owid_search("life expectancy")
# The owidR::owid_search() function returns: (1) a list of
# titles related to the declared search term, or "life
# expectancy" in this example, and (2) a list of the actual
# names (e.g., chart_id) of the identified files. Use the
# chart_id to retrieve the data, using R-based API-type
# syntax.

Selected sections of output were deleted to save space.

titles
[5,] "Life expectancy vs. GDP per capita"
[12,] "Life expectancy vs. healthcare expenditure"
[18,] "Life expectancy"
[44,] "Deaths from smallpox per 1,000 population vs. Life exp
[45,] "Life expectancy and smallpox deaths per 10,000 people
chart_id
[5,] "life-expectancy-vs-gdp-per-capita"
[12,] "life-expectancy-vs-healthcare-expenditure"
[18,] "life-expectancy"
[44,] "deaths-from-smallpox-per-1000-population-vs-life-expec
[45,] "sweden-life-expectancy-smallpox-deaths"
Addendum 1: Our World in Data 121

LifeExpectancyGDP.tbl <-
owidR::owid("life-expectancy-vs-gdp-per-capita")
# Use the owidR::owid() function to retrieve the desired Our
# World in Data dataset, life-expectancy-vs-gdp-per-capita in
# this example.

base::getwd() # Working directory

base::ls() # List current files
base::attach(LifeExpectancyGDP.tbl) # Attach the file
utils::str(LifeExpectancyGDP.tbl) # File structure
dplyr::glimpse(LifeExpectancyGDP.tbl) # File structure
utils::head(LifeExpectancyGDP.tbl) # Print first few rows
base::summary(LifeExpectancyGDP.tbl) # Data summary

As a good programming practice (gpp) and to follow along with the reality of
future possible data unavailability when acquiring a dataset from an external Internet
host, it is best to immediately download the data, or LifeExpectancyGDP.tbl for this
example, which was originally gained by invoking the function owidR::owid().
There are a few different functions that could be used to download a dataset cur-
rently in an R session, for safe keeping and assurance that the data will be avail-
able later:
• utils::write.csv()
• xlsx::write.xlsx()
• writexl::write_xlsx()
From among these possible options, the writexl::write_xlsx() function will be
used in this lesson:

install.packages("writexl", dependencies=TRUE)
library("writexl")

# Download the file

writexl::write_xlsx(LifeExpectancyGDP.tbl,
path = "F:\\R_Ceres\\LifeExpectancyGDP.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.

# Confirm that the downloaded file is in good form

base::file.exists("LifeExpectancyGDP.xlsx")
base::file.info("LifeExpectancyGDP.xlsx")
base::list.files(pattern =".xlsx")

All 60,000-plus rows of the R-based object LifeExpectancyGDP.tbl have been

downloaded and placed in the F:\R_Ceres directory. As a helpful feature, the col-
umn headers are automatically placed in bold and are also centered. The original
dataset is now available for future use, even if there were a future problem gaining
access from the original source.
122 2 Data Sources in Biostatistics

The file has been successfully retrieved from Our World in Data, but some col-
umn names are verbose. A few simple actions, using the base::colnames() function
should make column names more manageable.

base::colnames(LifeExpectancyGDP.tbl) <- c(
"entity", # Column 01 Entity
"code", # Column 02 Code for entity
"year", # Column 03 Year
"life_expectancy", # Column 04 Life Expectancy
"gdp_capita", # Column 05 GDP per Capita
"population", # Column 06 Population
"continent") # Column 07 Continent

A few key points associated with this file on world-wide trends for population,
life expectancy, and gross domestic product per capita need to be highlighted here,
to avoid later confusion:
• The term entity is used instead of terms such as country or nation. The data are
quite inclusive and include not only data for many countries but also selected
territories and other geographical locations that are not universally recognized
sovereign states. Entity, in this context, is a more correct term.
• Population and life expectancy are always difficult to estimate, but refer to the
original resources gained by deploying the owidR::owid_source() function to
gain a sense of the methods used in support of the dataset.
• As with many other Our World in Data datasets, it is assumed that Gross Domestic
Product (GDP) is adjusted for inflation to 2011 United States dollars (USD),
which of course can be a problematic application of the algorithms used for this
purpose, at least for some entities. Even so, it is assumed that GDP per capita is
useful as a comparative measure, allowing examination of change over time and
differences between and among selected geographic entities.

owidR::owid_source(LifeExpectancyGDP.tbl)

base::getwd() # Working directory

Although it is common to go straight to the data and begin planned and ad hoc
activities, data science calls for careful review of the data, to look for general trends
and possible discovery of the unexpected. Data science, as opposed to only data
analysis, also adds future value to the data, going beyond the basics. Look at a few
formative figures displaying trends over recent memory before more value-added
analyses and displays are attempted.
Addendum 1: Our World in Data 123

To achieve this aim, use the dplyr::filter() function to create a new dataset, where
the entire LifeExpectancyGDP.tbl dataset is examined, but only for data from 1950
onward. No attempt will be made to embellish the figures since they are only pre-
pared for diagnostic review (Fig. 2.2).

LifeExpectancyGDPWorld1950onward.tbl <-
LifeExpectancyGDP.tbl %>%
dplyr::filter(year >= 1950)
# Create the 1950 onward ad hoc dataset,
# using the dplyr::filter() function.

Population1950Onward.fig <-
ggplot2::ggplot(
data=LifeExpectancyGDPWorld1950onward.tbl,
aes(x=year, y=population)) +
geom_col(fill="red") +
labs(title="World Population:
1950 Onward") +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
par(ask=TRUE); Population1950Onward.fig

LifeExpectancy1950Onward.fig <-
ggplot2::ggplot(
data=LifeExpectancyGDPWorld1950onward.tbl,
aes(x=year, y=life_expectancy)) +
geom_col(fill="red") +
labs(title="World Life Expectancy:
1950 Onward") +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())

Fig. 2.2
124 2 Data Sources in Biostatistics

par(ask=TRUE); LifeExpectancy1950Onward.fig

GDP1950Onward.fig <-
ggplot2::ggplot(
data=LifeExpectancyGDPWorld1950onward.tbl,
aes(x=year, y=gdp_capita)) +
geom_col(fill="red") +
labs(title="World GDP:
1950 Onward") +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
par(ask=TRUE); GDP1950Onward.fig

par(ask=TRUE)
gridExtra::grid.arrange(
Population1950Onward.fig,
LifeExpectancy1950Onward.fig,
GDP1950Onward.fig, ncol=3)
# The Y axis scale was made blank to avoid any
# possible confusion with side by side comparisons,
# given different scales for each metric.
#
# C02Fig09World-WideOutcomesSideBySide.png

These figures, placed into one convenient side by side comparative figure, pro-
vide evidence that at least the most current data, overall and from 1950 onward, are
within expectations of a general upward trend, but of course with occasional
decrease.22
The data were retrieved using the owidR::owid() function, supporting an API
process, and the data have been subjected to an initial review for integrity and
expectations. The data will be used for a more finite examination of a subset of the
data, to look for interesting outcomes of approximate neighboring geographical
entities and in turn add value to this data science experience.

22
The file has not yet been fully organized, scrubbed, cleaned, etc. Even so, these early figures
provide a sense of general trends.
Addendum 1: Our World in Data 125

LifeExpectancyGDPSample1900onward.tbl <-
LifeExpectancyGDP.tbl %>%
dplyr::filter(year >= 1900) %>%
dplyr::filter(entity %in% c(
"South Korea",
"North Korea",
"Taiwan",
"China",
"Honduras",
"United States"))
# The dplyr::filter() function is used twice in this syntax:
# (1) once to select only the data from year 1900 onward and
# to then (2) selected data only for the six entities listed
# above: South Korea, North Korea, Taiwan, China, Honduras,
# and United States.

base::getwd()
base::ls()
base::attach(LifeExpectancyGDPSample1900onward.tbl)
utils::str(LifeExpectancyGDPSample1900onward.tbl)
dplyr::glimpse(LifeExpectancyGDPSample1900onward.tbl)
utils::head(LifeExpectancyGDPSample1900onward.tbl)
base::summary(LifeExpectancyGDPSample1900onward.tbl)

Before any detailed graphics are produced, it is important to know that the
ggplot2::ggplot() function supports many different themes. A ggplot2 theme is cre-
ated by using syntax to produce a figure with a desired appearance. In an attempt to
make the figures bold and vibrant, but also in an attempt to reduce redundant key-
ing, look at theme_Mac(), a self-created theme that will be used in concert with the
ggplot2::ggplot() function.
126 2 Data Sources in Biostatistics

theme_Mac <- function(base_size=12, base_family="sans"){

base::class(theme_Mac)
# Confirm that the user-created object
# theme_Mac() is a function.

With this preparation completed, try to produce one summative figure of popula-
tion to show the difficulty of producing an attractive figure straight off (Fig. 2.3).

Fig. 2.3
Addendum 1: Our World in Data 127

install.packages("directlabels", dependencies=TRUE)
library(directlabels)

ggplot2::ggplot(
data=LifeExpectancyGDPSample1900onward.tbl,
aes(x=year, y=population, group=code, color=code)) +

geom_line(size=1.25) +
geom_dl(aes(label=code), method=list(cex=0.75, rot = 25,
hjust=-.5, dl.combine("last.points"))) +
# Use geom_dl, from the directlabels package, to place
# labels at the end of each line, to better identify the
# geographic entity, avoiding reliance only on a color-
# coded legend.
labs(
title=
"Population of Selected Entities Over Time: 1900 to 2020",
subtitle=
"Data Were Obtained from Our World in Data",
x = "\nYear", y = "Population\n")
# Fig. 2.3

ggplot2::ggplot(
data=LifeExpectancyGDPSample1900onward.tbl,
aes(x=year, y=life_expectancy, group=code, color=code)) +
geom_line(size=1.25) +
geom_dl(aes(label=code), method=list(cex=0.75, rot = 25,
hjust=-.5, dl.combine("last.points"))) +
# Use geom_dl, from the directlabels package, to place
# labels at the end of each line, to better identify the
# geographic entity, avoiding reliance only on a color-
# coded legend.
labs(
title=
128 2 Data Sources in Biostatistics

"Life Expectancy of Selected Entities Over Time:

1900 Onward",
subtitle=
"Data Were Obtained from Our World in Data",
x = "\nYear", y = "Life Expectancy (Years)\n") +
annotate("text", x=1900, y=35, fontface="bold", size=03,
hjust=0, label=
"References: UN Population Division (2019) and Others")+
annotate("text", x=1900, y=80, fontface="bold", size=03,
hjust=0, label="CHN China") +
annotate("text", x=1900, y=76, fontface="bold", size=03,
hjust=0, label="HND Honduras") +
annotate("text", x=1900, y=72, fontface="bold", size=03,
hjust=0, label="KOR South Korea") +
annotate("text", x=1900, y=68, fontface="bold", size=03,
hjust=0, label="PRK North Korea") +
annotate("text", x=1900, y=64, fontface="bold", size=03,
hjust=0, label="TWN Taiwan") +
annotate("text", x=1900, y=60, fontface="bold", size=03,
hjust=0, label="USA United States") +
scale_y_continuous(limits=c(30, 90),
breaks=scales::pretty_breaks(n=10)) +
# By avoiding the use of labels=scales::comma, only
# whole numbers show on the Y axis.
theme_Mac() +
theme(legend.title=element_blank()) + # No legend title
theme(axis.text.x=element_text(face="bold", size=12,
hjust=0.5, vjust=1)) +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X axis text and
# Y axis text need to be placed after the enumerated theme
# theme_Mac().

The life expectancy scale is generally adequate such that this line chart is quite
sufficient to see trends over time for each of the six selected entities. The ggplot
facet_grid() option will add more value and support a better understanding of
outcomes.
Addendum 1: Our World in Data 129

ggplot2::ggplot(
data=LifeExpectancyGDPSample1900onward.tbl,
aes(x=year, y=life_expectancy, group=code, color=code)) +
geom_line(size=1.75) +
facet_grid(cols = vars(entity)) +
# Note how entity was used and not code, to offer
# another view on detail.
labs(
title="Life Expectancy Over Time: 1900 Onward",
subtitle=
"Data Were Obtained from Our World in Data", x="\nYear",
y="Population\n") +
scale_y_continuous(labels=scales::comma, limits=c(20,
90), breaks=scales::pretty_breaks(n=10)) +
theme_Mac() +
theme(legend.position="none") +
theme(axis.text.x=element_text(face="bold", size=12,
hjust=0.5, vjust=1, angle=90)) +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X axis text and
# Y axis text need to be placed after the enumerated theme
# theme_Mac().

Along with population change and life expectancy, GDP per capita is a leading
indicator of funds that can be devoted to health care, public sanitation and clean
water, workplace protections, and other factors that all contribute to general health
and wellness. Ostensibly, entities with a high GDP per capita have the potential to
promote better health care than entities with a lower GDP per capita. These metrics
do not dictate the quality of healthcare availability, but they do suggest potential
expenditures (Fig. 2.4).

Fig. 2.4
130 2 Data Sources in Biostatistics

ggplot2::ggplot(
data=LifeExpectancyGDPSample1900onward.tbl,
aes(x=year, y=gdp_capita, group=code, color=code)) +
geom_line(size=0.75) +
# A thin line was declared in geom_line to accommodate to
# some degree overlap in the early years, when GDP per
# capita was quite low for multiple entities. It gave a
# slight improvement in output, but there is still a fair
# degree of overlap in the early years of this figure.
geom_dl(aes(label=code), method=list(cex=0.75, rot = 25,
hjust=-.5, dl.combine("last.points"))) +
# Use geom_dl, from the directlabels package, to place
# labels at the end of each line, to better identify the
# geographic entity, avoiding reliance only on a color-
# coded legend.
labs(

title=
"Gross Domestic Product (GDP) per Capita of Selected Entities
Over Time: 1900 Onward",
subtitle=
"Data Were Obtained from Our World in Data", x="\nYear",
y="Gross Domestic Product (GDP)\nper Capita\n") +
annotate("text", x=1900, y=25000, fontface="bold", size=03,
hjust=0, label=
"References: UN Population Division (2019) and Others")+
annotate("text", x=1900, y=60000, fontface="bold", size=03,
hjust=0, label="CHN China") +
annotate("text", x=1900, y=55000, fontface="bold", size=03,
hjust=0, label="HND Honduras") +
annotate("text", x=1900, y=50000, fontface="bold", size=03,
hjust=0, label="KOR South Korea") +
annotate("text", x=1900, y=45000, fontface="bold", size=03,
hjust=0, label="PRK North Korea") +
annotate("text", x=1900, y=40000, fontface="bold", size=03,
hjust=0, label="TWN Taiwan") +
annotate("text", x=1900, y=35000, fontface="bold", size=03,
hjust=0, label="USA United States") +
scale_y_continuous(labels=scales::dollar, limits=c(-1000,
60000), breaks=scales::pretty_breaks(n=10)) +
# By using labels=scales::dollar, the USD dollar sign will
# show on the Y axis.
theme_Mac() +
theme(legend.title=element_blank()) + # No legend title
theme(axis.text.x=element_text(face="bold", size=12,
hjust=0.5, vjust=1)) +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X axis text and
# Y axis text need to be placed after the enumerated theme
# theme_Mac().
# Fig. 2.4

There are more than a few years where some entities are missing GDP per capita
data, all for different reasons. Then, consider the GDP per capita overlap in the early
Addendum 1: Our World in Data 131

years, as referenced earlier. Given these concerns, once again ggplot’s facet_grid()
option will allow a more nuanced comparison of change in GDP per capita over
time and in turn add more value and support a better understanding of outcomes
(Fig. 2.5).

ggplot2::ggplot(
data=LifeExpectancyGDPSample1900onward.tbl,
aes(x=year, y=gdp_capita, group=code, color=code)) +
geom_line(size=1.75) +
facet_grid(cols = vars(entity)) +
# Note how entity was used and not code, to offer
# another view on detail.
labs(
title="Gross Domestic Product (GDP) per Capita Over Time:
1900 Onward",
subtitle=
"Data Were Obtained from Our World in Data", x="\nYear",
y="Gross Domestic Product (GDP)\nper Capita\n") +
scale_y_continuous(labels=scales::dollar, limits=c(-1000,
60000), breaks=scales::pretty_breaks(n=10)) +
# By using labels=scales::dollar, the USD dollar sign will
# show on the Y axis.
theme_Mac() +
theme(legend.position="none") +
theme(axis.text.x=element_text(face="bold", size=10,
hjust=0.5, vjust=1, angle=90)) +
theme(axis.text.y=element_text(face="bold", size=10,
hjust=0.5, vjust=1, angle=45))
# The special theme-based accommodations for X axis text and
# Y axis text need to be placed after the enumerated theme
# theme_Mac().
# Fig. 2.5

Challenge: Much more can be done with the data, in the original format. As an
example, review the object variable LifeExpectancyGDPWorld1950onward.

Fig. 2.5
132 2 Data Sources in Biostatistics

tbl$entity and review how some entities consist of regions, not just individual
entities:
• High-income countries
• Upper-middle-income countries
• Lower-middle-income countries
• Low-income countries
• More developed regions
• Less developed regions
Following along with the approach used in this addendum, look into life expec-
tancy and gdp by region, as listed above: (1) income, high to low (four breakouts);
and (2) development, more and less (two breakouts). Do the figures provide some
degree of evidence that gdp impacts life expectancy? Later, with more experience,
it will be possible to use various inferential analyses to examine this relationship
more closely, but for now, graphical displays are sufficient.

ddendum 2: United States Department of Labor, Bureau

A
of Labor Statistics

Unemployment is known to have an impact on public health. Individuals who want

to work but are unemployed must make hard spending choices that are likely to have
negative outcomes regarding purchases for food, housing, and other daily living
expenses. Then consider how for many, employment is a gateway to company-
subsidized individual and family health insurance, which is usually lost during
unemployment, usually only available as a personal (e.g., out-of-pocket) expense.
Accordingly, it is again stated that unemployment has an impact on public health
and high rates of unemployment are an early indicator of later changes to overall
public health, usually undesirable changes.
The Bureau of Labor Statistics, a highly visible agency within the United States
Department of Labor, is the leading source of information for statistics relating to
employment and correspondingly unemployment. Like many other agencies of the
United States government, many statistics associated with the Bureau of Labor
Statistics are available through use of an R-based Application Programming
Interface (API) function.23
For this addendum, the R-based blsR package will be used to obtain national
unemployment data. For those who work in public health, it will be a good
investment in time to learn more about the Bureau of Labor Statistics coding system
and how these data complement later public health trends.

23
The Bureau of Labor Statistics (BLS) provides a free key for use of API access to data resources.
Periodic renewal of the key is required, with the BLS sending out an e-mail notice of directions for
renewal.
Addendum 2: United States Department of Labor, Bureau of Labor Statistics 133

install.packages("blsR", dependencies=TRUE)
library(blsR)

NatPctUnemp2017to2021ByGender.tbl <-
blsR::get_n_series_table(

list(uer.men ='LNS14000001', uer.women = 'LNS14000002'),

# 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', # API Key
# 'Yourx33xdigitxkeyxgoesxherexxxxx', # API Key
start_year = 2017, end_year=2021, tidy=TRUE,
overwrite=FALSE)
# CAUTION: The code for this key is reserved for
# [email protected] only. Use your own key!
#
# Request a United States Bureau of Labor Statistics
# key at https://data.bls.gov/registrationEngine/.
#
# Visit https://www.bls.gov/help/hlpforma.htm#LA for
# initial guidance on the coding hierarchy used by
# BLS for employment, unemployment, labor force, etc.
#
# The focus for this dataset is national percentage
# unemployment for both genders, 2017 to 2021. The
# data should give a sense of how COVID-19 impacted
# unemployment, reviewing data from a few years prior
# to the pandemic and then, data from some of the
# worse months of the pandemic.

base::getwd()
base::ls()
base::attach(NatPctUnemp2017to2021ByGender.tbl)
utils::str(NatPctUnemp2017to2021ByGender.tbl)
dplyr::glimpse(NatPctUnemp2017to2021ByGender.tbl)
utils::head(NatPctUnemp2017to2021ByGender.tbl)
base::summary(NatPctUnemp2017to2021ByGender.tbl)

Use the writexl::write_xlsx() function to immediately download the data, for

safekeeping in case the data were ever to become unavailable in the future.
# Download the file

writexl::write_xlsx(NatPctUnemp2017to2021ByGender.tbl,
path = "F:\\R_Ceres\\NatPctUnemp2017to2021ByGender.xlsx",
col_names=TRUE)
# Give special attention to how the path is
# identified, especially the use of double
# back slashes.

# Confirm that the downloaded file is in good form

base::file.exists("NatPctUnemp2017to2021ByGender.xlsx")
base::file.info("NatPctUnemp2017to2021ByGender.xlsx")
base::list.files(pattern =".xlsx")
134 2 Data Sources in Biostatistics

Observe how year and month are integers, but for now, it would be better to view
them as factors.24 The variables uer.men and uer.women are real numbers, in deci-
mal format. A few actions are needed to put the dataset into desired format.

base::colnames(NatPctUnemp2017to2021ByGender.tbl) <- c(
"Year", # Column 01 Year.
"Month", # Column 02 Month
"Uer.Men", # Column 03 Unemployment rate, men
"Uer.Women") # Column 04 Unemployment rate, women
# Uer - Unemployment Rate, as a Percentage

NatPctUnemp2017to2021ByGender.tbl$Year <- as.factor(

NatPctUnemp2017to2021ByGender.tbl$Year)

NatPctUnemp2017to2021ByGender.tbl$Month <- as.factor(

NatPctUnemp2017to2021ByGender.tbl$Month)

NatPctUnemp2017to2021ByGender.tbl$Uer.Men <- as.numeric(

NatPctUnemp2017to2021ByGender.tbl$Uer.Men)
# Redundant

NatPctUnemp2017to2021ByGender.tbl$Uer.Women <- as.numeric(

NatPctUnemp2017to2021ByGender.tbl$Uer.Women)
# Redundant

As a simple Quality Assurance check of the data, use the ggplot2:qplot() func-
tion with only a few embellishments, to look for trends in unemployment over time
and by gender. With assurance that the data are useful, future actions would likely
call for use of the ggplot2::ggplot() function, but that action is deferred in this
addendum to instead show how the ggplot2::qplot() function has value and should
be considered for initial graphics (Fig. 2.6).25

24
Although it is not needed for this addendum, many who are experienced with the tidyverse eco-
system might use the dplyr::mutate() function and either the lubridate::make_date() function or the
lubridate::make_datetime() function to accommodate how date(s) are considered. Review Schedule
of Releases for the Employment Situation (https://www.bls.gov/schedule/news_release/empsit.
htm) for precise information if greater granularity for dates is needed.
25
It is recognized that the ggplot2::qplot() function is deprecated, but it is still available, it still
works quite nicely, and it is purposely demonstrated in this addendum.
Addendum 2: United States Department of Labor, Bureau of Labor Statistics 135

Fig. 2.6

USAUnempPct2017to2021Men.fig <-
qplot(data = NatPctUnemp2017to2021ByGender.tbl,
Year, Uer.Men, ylim=c(0,16),
main="USA Percentage Unemployment of Men
by Year: 2017 to 2021",
xlab="\nYear", ylab="Percentage Unemployment\nMen\n") +
annotate("text", x=0.5, y=15.0, fontface="bold", size=03,
hjust=0, label=
"Each dot is a datapoint for a specific month.") +
annotate("text", x=0.5, y=14.0, fontface="bold", size=03,
hjust=0, label=
"Some months had values that were similar") +
annotate("text", x=0.5, y=13.2, fontface="bold", size=03,
hjust=0, label=
"to other months.") +
theme_Mac()

USAUnempPct2017to2021Women.fig <-
qplot(data = NatPctUnemp2017to2021ByGender.tbl,
Year, Uer.Women, ylim=c(0,16),
main="USA Percentage Unemployment of Women
by Year: 2017 to 2021",
xlab="\nYear", ylab="Percentage Unemployment\nWomen\n") +
annotate("text", x=0.5, y=15.0, fontface="bold", size=03,
hjust=0, label=
"Each dot is a datapoint for a specific month.") +
annotate("text", x=0.5, y=14.0, fontface="bold", size=03,
hjust=0, label=
136 2 Data Sources in Biostatistics

"Some months had values that were similar") +

annotate("text", x=0.5, y=13.2, fontface="bold", size=03,
hjust=0, label=
"to other months.") +
theme_Mac()

par(ask=TRUE)
gridExtra::grid.arrange(
USAUnempPct2017to2021Men.fig,
USAUnempPct2017to2021Women.fig, ncol=2)
# Fig. 2.6

These simple side-by-side figures displaying monthly change in percentage

unemployment by year, for men and women, are quite revealing and there is much
more that can be done with the data to provide value to end users. However, an
immediate concern is the need to investigate any statistically significant difference
(p ≤ 0.05) in percentage unemployment by gender and by year. A few simple activi-
ties will be demonstrated in this addendum but even more detail on the use of statis-
tical analyses in data science is detailed in later lessons.

NatPctUnemp2017to2021ByYearByGender.tbl <-
dplyr::select(NatPctUnemp2017to2021ByGender.tbl,
-Month)
# Use the dplyr::select() function to remove a column
# by name, or Month in this set of syntax, since the
# immediate focus does not include by Month analyses.

base::getwd()
base::ls()
base::attach(NatPctUnemp2017to2021ByYearByGender.tbl)
utils::str(NatPctUnemp2017to2021ByYearByGender.tbl)
dplyr::glimpse(NatPctUnemp2017to2021ByYearByGender.tbl)
utils::head(NatPctUnemp2017to2021ByYearByGender.tbl)
base::summary(NatPctUnemp2017to2021ByYearByGender.tbl)

With NatPctUnemp2017to2021ByYearByGender.tbl currently in desired for-

mat, the only remaining action before statistical analyses are attempted is to put the
data into a tidier format, a long format.
Addendum 2: United States Department of Labor, Bureau of Labor Statistics 137

LNatPctUnemp2017to2021ByYearByGender.tbl <-
tidyr::pivot_longer(NatPctUnemp2017to2021ByYearByGender.tbl,
-c(Year),
names_to = "Gender", values_to = "Pct_Unemployment")
# Put the data into long format, using the
# tidyr::pivot_longer() function.
#
# As is often shown in this text, the leading L
# in the dataset (e.g., tibble) name means that
# the data are in long format.
#
# The expression -c(Year) means that the
# tidyr::pivot_longer() function should
# pivot everything except Year. In this
# syntax, the minus sign means except.

base::getwd()
base::ls()
base::attach(LNatPctUnemp2017to2021ByYearByGender.tbl)
utils::str(LNatPctUnemp2017to2021ByYearByGender.tbl)
dplyr::glimpse(LNatPctUnemp2017to2021ByYearByGender.tbl)
utils::head(LNatPctUnemp2017to2021ByYearByGender.tbl)
base::summary(LNatPctUnemp2017to2021ByYearByGender.tbl)

Challenge: With a focus on descriptive statistics, examine the syntax needed to

look at measures of central tendency for the object variable Pct_Unemployment, by
year and by gender. For this demonstration, the onewaytests package will be used.

install.packages("onewaytests", dependencies=TRUE)
library(onewaytests)

onewaytests::describe(Pct_Unemployment ~ Year,
data=LNatPctUnemp2017to2021ByYearByGender.tbl)

[Selected sections were deleted to save space.]

n Mean Std.Dev Median Min Max 25th 75th NA

2017 24 4.35417 0.184106 4.30 4.0 4.7 4.275 4.500 0
2018 24 3.89167 0.150121 3.90 3.6 4.1 3.800 4.000 0
2019 24 3.67083 0.142887 3.65 3.4 4.0 3.600 3.725 0
2020 24 8.11250 3.622192 7.35 3.4 16.1 5.900 10.550 0
2021 24 5.35000 0.828828 5.55 3.9 6.4 4.575 6.025 0

onewaytests::describe(Pct_Unemployment ~ Gender,
data=LNatPctUnemp2017to2021ByYearByGender.tbl)

[Selected sections were deleted to save space.]

n Mean Std.Dev Median Min Max 25th 75th NA

Uer.Men 60 5.07333 2.09081 4.30 3.5 13.5 3.875 5.700 0
Uer.Women 60 5.07833 2.53137 4.15 3.4 16.1 3.800 5.275 0
138 2 Data Sources in Biostatistics

Challenge: With a focus on inferential statistics, use a few different packages for
Analysis of Variance (ANOVA) testing that should provide more definitive statistics
related to statistically significant difference (p ≤ 0.05), following along with the
desire to know26:
• Is there a statistically significant difference (p ≤ 0.05) in percentage unemploy-
ment by year?
• Is there a statistically significant difference (p ≤ 0.05) in percentage unemploy-
ment by gender?
• Is there a statistically significant interaction (p ≤ 0.05) between year and gender
regarding percentage unemployment?

install.packages("agricolae", dependencies=TRUE)
library(agricolae)

# Oneway ANOVA, Percentage Unemployment by Year

agricolae::HSD.test(
aov(Pct_Unemployment ~ Year,
data=LNatPctUnemp2017to2021ByYearByGender.tbl), # Model
trt="Year", # Treatment
group=TRUE, console=TRUE, alpha=0.05, # Arguments
main="Percentage Unemployment by Year")
# Wrap the agricolae::HSD.test() function around the
# Oneway ANOVA model obtained by using the aov()
# function. Select desired arguments, such as group,
# console, and alpha (e.g., p-value).

Far more detail is provided in a later lesson regarding inferential tests, and of course, there are
26

many other resources on the use of R for inferential tests.

Addendum 2: United States Department of Labor, Bureau of Labor Statistics 139

Study: Percentage Unemployment by Year

HSD Test for Pct_Unemployment

Mean Square Error: 2.79827

Year, means

Pct_Unemployment std r Min Max

2017 4.35417 0.184106 24 4.0 4.7
2018 3.89167 0.152990 24 3.6 4.1
2019 3.66667 0.140393 24 3.4 4.0
2020 8.10417 3.634673 24 3.4 16.2
2021 5.35000 0.838736 24 3.9 6.4

Alpha: 0.05 ; DF Error: 115

Critical Value of Studentized Range: 3.91954

Minimun Significant Difference: 1.33836

Treatments with the same letter are not significantly different.

Pct_Unemployment groups
2020 8.10417 a
2021 5.35000 b
2017 4.35417 bc
2018 3.89167 c
2019 3.66667 c

# Twoway ANOVA, Percentage Unemployment by Year, Percentage

# Unemployment by Gender, and Interaction of Year and Gender
# Regarding Percentage Unemployment

install.packages("car")
library(car)

car::Anova(aov(Pct_Unemployment ~ Year * Gender,

data=LNatPctUnemp2017to2021ByYearByGender.tbl))
# Twoway ANOVA using the car::Anova(aov())
# functions.

Anova Table (Type II tests)

Response: Pct_Unemployment
Sum Sq Df F value Pr(>F)
Year 315.7 4 27.203 0.00000000000000106 ***
Gender 0.0 1 0.001 0.974
Year:Gender 2.6 4 0.228 0.922
Residuals 319.2 110
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
140 2 Data Sources in Biostatistics

The Twoway ANOVA output for percentage unemployment from 2017 to 2021
confirms that:
• There is a statistically significant difference (p ≤ 0.05) in percentage unemploy-
ment by year, with explicit details found in the Oneway ANOVA output.
• There is no statistically significant difference (p ≤ 0.05) in percentage unem-
ployment by gender.
• There is no statistically significant interaction (p ≤ 0.05) between year and gen-
der regarding percentage unemployment.
As useful as these results may be, data scientists need to be aware of conditions
that may impact the data and most importantly, the concern that data are reliable,
valid, and representative. Look at the years (and subsequent data) selected for this
addendum on United States percentage unemployment. The United States economy
was robust in 2017, 2018, and 2019, and unemployment was consistently at or near
record lows. Then COVID-19 impacted everything, with concerns growing in
January and February 2020 – with devastating impacts on the economy by mid-
March 2020. In a mere matter of days, millions of workers were either laid off (e.g.,
made redundant) or voluntarily left their jobs in response to factory and other physi-
cal plant shutdowns, lockdowns, distancing from others, fear of infection, and other
mitigation responses to COVID-19.
Then, whether true or not, there was a continuing storyline in the press that
women were impacted by lost employment at a rate far greater than men once the
COVID-19 layoffs began.27 Look at the following syntax and outcomes to see if the
data support this issue, recalling that collectively, from 2017 to 2021, there was no
difference in percentage unemployment between the two genders (p ≤ 0.05).

LNatPctUnemp2020to2021ByYearByGender.tbl <-
LNatPctUnemp2017to2021ByYearByGender.tbl %>%
dplyr::filter(Year == 2020 | Year == 2021)
# Use the dplyr::filter() function to have data for
# 2020 or 2021, only. Be sure to note the condition
# (Year == 2020 | Year == 2021) and NOT the frequent
# mistake (Year == 2020 | 2021).

base::getwd()
base::ls()
base::attach(LNatPctUnemp2020to2021ByYearByGender.tbl)
utils::str(LNatPctUnemp2020to2021ByYearByGender.tbl)
dplyr::glimpse(LNatPctUnemp2020to2021ByYearByGender.tbl)
utils::head(LNatPctUnemp2020to2021ByYearByGender.tbl)
base::summary(LNatPctUnemp2020to2021ByYearByGender.tbl)

27
The expression follow the science, wherever it leads was used extensively in the press throughout
the pandemic. It might have been more appropriate to use the expression follow the data, wherever
they lead. Reliable and valid data are essential to furthering discovery of true outcomes.
Addendum 2: United States Department of Labor, Bureau of Labor Statistics 141

Now that there is a dataset with data for 2020 and 2021 only, apply the Twoway
ANOVA syntax again to see if percentage unemployment outcomes for these 2 years
are consistent with prior findings, from 2017 to 2021.

agricolae::HSD.test(
aov(Pct_Unemployment ~ Year,
data=LNatPctUnemp2020to2021ByYearByGender.tbl), # Model
trt="Year", # Treatment
group=TRUE, console=TRUE, alpha=0.05, # Arguments
main="Percentage Unemployment by Year: 2020 and 2021")
# Wrap the agricolae::HSD.test() function around the
# Oneway ANOVA model obtained by using the aov()
# function. Select desired arguments, such as group,
# console, and alpha (e.g., p-value).

Study: Percentage Unemployment by Year: 2020 and 2021

HSD Test for Pct_Unemployment

Mean Square Error: 6.95716

Year, means

Pct_Unemployment std r Min Max

2020 8.10417 3.634673 24 3.4 16.2
2021 5.35000 0.838736 24 3.9 6.4

Alpha: 0.05 ; DF Error: 46

Critical Value of Studentized Range: 2.84666

Minimun Significant Difference: 1.53266

Treatments with the same letter are not significantly different.

Pct_Unemployment groups
2020 8.10417 a
2021 5.35000 b

As evidenced by the use of this Oneway ANOVA, there was a statistically signifi-
cant difference (p ≤ 0.05) in mean percentage unemployment by year, with 2020
mean Pct_Unemployment = 8.10417 and 2021 mean Pct_Unemployment = 5.3500.
The Twoway ANOVA approach to the data should provide more information, now
introducing Gender for these 2 years.
142 2 Data Sources in Biostatistics

car::Anova(aov(Pct_Unemployment ~ Year * Gender,

data=LNatPctUnemp2020to2021ByYearByGender.tbl))
# Twoway ANOVA using the car::Anova(aov())
# functions, for 2020 and 2021 data.

Anova Table (Type II tests)

Response: Pct_Unemployment
Sum Sq Df F value Pr(>F)
Year 91.0 1 12.612 0.000927 ***
Gender 0.4 1 0.049 0.826670
Year:Gender 2.1 1 0.294 0.590110
Residuals 317.6 44
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

However, even for 2020 and 2021, alone, it is now confirmed that there is no
statistically significant difference (p ≤ 0.05) in mean percentage unemployment by
gender. A few statistics, using federal data as a resource, may help clarify actual
outcomes vs what was often reported in the press (incorrectly) concerning differ-
ences in unemployment by gender.

LNatPctUnemp2020ByYearByGender.tbl <-
LNatPctUnemp2017to2021ByYearByGender.tbl %>%
dplyr::filter(Year == 2020)
# Use the dplyr::filter() function to have data for
# 2020, only.

base::getwd()
base::ls()
base::attach(LNatPctUnemp2020ByYearByGender.tbl)
utils::str(LNatPctUnemp2020ByYearByGender.tbl)
dplyr::glimpse(LNatPctUnemp2020ByYearByGender.tbl)
utils::head(LNatPctUnemp2020ByYearByGender.tbl)
base::summary(LNatPctUnemp2020ByYearByGender.tbl)

With the data restricted to 2020 only, a time of exceptionally high rates of unem-
ployment, see if there are differences between women and men regarding percent-
age unemployment.

onewaytests::describe(Pct_Unemployment ~ Gender,
data=LNatPctUnemp2020ByYearByGender.tbl)

[Selected sections were deleted to save space.]

n Mean Std.Dev Median Min Max 25th 75th

Uer.Men 12 7.80833 3.20864 7.35 3.5 13.5 6.125 9.900
Uer.Women 12 8.40000 4.13961 7.40 3.4 16.2 5.900 10.925
Addendum 2: United States Department of Labor, Bureau of Labor Statistics 143

The 2020 mean percentage unemployment for Uer.Men = 7.80833 and the 2020
mean percentage unemployment for Uer.Women = 8.40000. However, is this differ-
ence statistically significant (p ≤ 0.05)?

onewaytests::st.test(Pct_Unemployment ~ Gender,
data=LNatPctUnemp2020ByYearByGender.tbl,
alpha=0.05, na.rm=TRUE, verbose=TRUE)
# Perform a Student?s t-Test for two samples.

Student's t-Test (alpha = 0.05)

-------------------------------------------------------------
Groups : Uer.Men vs. Uer.Women

statistic : -0.391328
parameter : 22
p.value : 0.699319

Result : Difference is not statistically significant.

-------------------------------------------------------------

Yes, women had a greater mean percentage unemployment (Mean = 8.40000)

than the mean percentage unemployment for men (Mean = 7.80833), but the differ-
ence did not rise to the level of statistical significance (p ≤ 0.05). The calculated p-
value was 0.699319, which certainly exceeds the criterion p-value of 0.05.
Of course, these outcomes are based on national data. The result may be far dif-
ferent for selected states, counties, cities, townships, tribal lands, boroughs, etc.
Even so, at the national level, it can be said that COVID-19 resulted in higher rates
of unemployment during 2020 compared to earlier years, but the impact of
COVID-19 on percentage unemployment was not statistically different (p ≤ 0.05)
between the two genders (Fig. 2.7).

Fig. 2.7
144 2 Data Sources in Biostatistics

USAUnempPct2020BothGenders.fig <-
qplot(data = LNatPctUnemp2020ByYearByGender.tbl,
Gender, Pct_Unemployment, ylim=c(0,16),
main="USA Percentage Unemployment of Both Genders:
2020",
xlab="\nGender", ylab="Percentage Unemployment\n") +
scale_x_discrete(labels=c("Uer.Men" = "Men",
"Uer.Women" = "Women")) +
annotate("text", x=1.25, y=15.0, fontface="bold", size=03,
hjust=0, label=
"Each dot is a datapoint for a specific month.") +
annotate("text", x=1.25, y=14.0, fontface="bold", size=03,
hjust=0, label=
"Some months had values that were similar") +
annotate("text", x=1.25, y=13.2, fontface="bold", size=03,
hjust=0, label=
"to other months.") +
theme_Mac()
# Fig. 2.7

par(ask=TRUE); USAUnempPct2020BothGenders.fig

These outcomes are a demonstration of the type of value that data scientists pro-
vide to clients. Far more is possible, but again these additional possibilities are
addressed in other lessons.

External Data and/or Data Resources Used in This Lesson

The publisher’s Web site associated with this text includes the following files, pre-
sented in .csv, .txt, and .xlsx file formats.
AlcoholAttributableFractionsAllCauseDeaths.csv
annual_aqi_by_county_2021.csv
BirthRatePer1000People.csv
COVID19CasesDeathsEUEEACountryApr-16-2022.csv
DiabetesAtlasData.csv
FDOH_HealthTrackingData.csv
FloridaUnemploymentByCountySep-01-20.csv
FranceGermanyUKLivestoack1961to2020.csv
GDPCurrentUSDollar.csv
HeadcountsbyCIPbyCode01-07andbyCodeA-CNotTidy.txt
KentuckyCornYield1900Onward.xlsx
LifeExpectancyGDP.xlsx
LMilkManagementPoundsTidy.txt
NationalSchoolLunchProgram2000-01to2018-19.xlx
NatPctUnemp2017to2021ByGender.xlsx
nsf22300-tab013.xlsx
External Data and/or Data Resources Used in This Lesson 145

nslp_sites.csv
Palm_Beach_County_Natural_Areas_Trails.csv
QuickFactsApr-07-2022CapeMayCountyNJvUnitedStates.csv
SelectedHealthConditionsandRiskFactorsbyAge.xlsx
SHA_05032023201717071.csv
Tourist_Development_Tax2021-2022.pdf
Challenge: Use these files to practice and replicate the outcomes used in this les-
son. Be creative and go beyond what was previously presented.
All other data, if any, were enumerated directly in the R session, using functions
such as round(rnorm()), read.table(textConnection()), etc.
Chapter 3
Role of Statistics for Decision-Making
in Biostatistics

Ten-Point Process When Using R for Statistical Analysis

There are many ways to approach the decision-making process when data scientists
use statistical analyses to address problems associated with the biological sciences.
The ten-point process discussed in this lesson provides a general framework for the
overall workflow, from initial problem identification to communication of current
outcomes and plans for future improvements. Other frameworks are often encoun-
tered, but the general process is often like what is presented in this lesson. Although
the text in this lesson may be brief, it is suggested that the best way to learn about
the topics in this lesson is to read the biostatistics literature, literature on many bio-
logical topics and literature from many biologically oriented resources, print and
online. Look for similarities in the way workflow is addressed, as a planned statisti-
cal approach is used to achieve results.
A relevant problem associated with COVID-19 is used for demonstration pur-
poses. Reviewing as many resources in the literature as time allows, see how other
problems in biostatistics are addressed, as data scientists try to investigate relevant
problems and communicate findings. See the similarities (and differences) in the
literature to the general structure for the ten points identified in this lesson.

Identify Problems That Benefit from Statistical Analysis

Was the percentage of COVID-19 deaths in the many counties of the United States
impacted by the degree of urbanization, otherwise known as the urban-rural con-
tinuum? Heavily urban communities have high population densities, whereas more

Supplementary Information The online version contains supplementary material available at

https://doi.org/10.1007/978-3-031-46383-9_3.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 147
T. W. MacFarland, Introduction to Data Science in Biostatistics,
https://doi.org/10.1007/978-3-031-46383-9_3
148 3 Role of Statistics for Decision-Making in Biostatistics

rural communities have low population densities. This issue is examined in

this lesson.
It is known that COVID-19 is associated with exposure to aerosolized respiratory
fluids that carry the disease-causing virus SARS-CoV-2, thus the recommended
(and mandated, in many communities) use of face coverings and social distancing.
Yet, due to the realities of how people interact with others, those who work and
reside in an urban setting may have few opportunities to maintain a safe distance
from others due to high population densities. In contrast, those who work and reside
in a rural setting may have more opportunities to maintain a safe distance from oth-
ers due to low population densities. Do these differences in population density have
any impact on percentage deaths from COVID-19, urban v rural? Are other factors
possibly at play, urban v rural:
• Due to the realities of health care systems in the United States, potential access
to quality health care may be greater for those who work and reside in an urban
setting as compared to what is available in more rural settings. Do these differ-
ences in potential access to quality health care have any impact on percentage
deaths from COVID-19, urban v rural?
• Due to the realities of economic opportunity in the United States, it should not be
assumed that the demographic characteristics of the two populations, urban v
rural, are in parity. As the youth leave rural areas for economic opportunities in
cities (search on the term Rise of the Cities), it is likely rural communities will
have a disproportionate representation of older residents compared to urban
communities. Do these differences in proportional representation by age groups
have any impact on percentage deaths from COVID-19, urban v rural?
Conjecture on these and other social and infrastructure differences between
urban residents and rural residents may be grand and makes for an interesting con-
versation, but it is far better to take an empirical approach to these inquiries.
However, using the phrase follow the science, wherever it leads, which it is argued
should more appropriately be stated as follow the data, wherever they lead, this les-
son will instead address the problem of possible differences in percentage deaths
from COVID-19 by degree of urbanization.

Identify Potential Data Resources

As seen later in this lesson, the sample problem is focused on the use of the tidy-
verse ecosystem to adjust an extant federal dataset and to then examine if the urban-
rural continuum has any impact on the overall percentage of COVID-19 deaths. The
Centers for Disease Control and Prevention is the source for data in this example,
https://data.cdc.gov/NCHS/Provisional-COVID-19-Deaths-by-County-and-Race-
Ten-Point Process When Using R for Statistical Analysis 149

and/k8wy-p9cg. As a proxy for a provided Code Book, examine the federal publica-
tion 2013 NCHS Urban–Rural Classification Scheme for Counties.1
This resource, like its earlier 2006 counterpart, is a county-level population den-
sity scheme with six levels to the National Center for Health Statistics (NCHS)
urban–rural classification scheme: four metropolitan (large central metro, large
fringe metro, medium metro, and small metro) breakouts and two nonmetropolitan
(micropolitan and noncore) breakouts. Review the resource and see in Table 1 (page
8) how there is far greater detail on these breakouts, which should be examined to
fully understand the complexity of the county-based urban-rural continuum.
When reviewing the background information, note how county data are available
for counties with more than 100 COVID-19 deaths. Deaths are cumulative from the
week ending January 4, 2020, onward to the most recent reporting week (May 7,
2022, in this example) identified in the original dataset for when it was obtained,
and based on county of occurrence since it would be impossible to know with any
assurance the county of infection. Review the original resource to see the conditions
as to why data may not be reported, such as the threshold that a county must have at
least 100 COVID-19 deaths to be included in that reporting column as well as other
conditions that protect privacy. The exclusion of some data, although needed for
confidentiality purposes, is a continual concern to data scientists and is addressed to
some degree in this lesson.

Obtain the Data

Review the ending materials in this lesson for details on how the tidyverse ecosys-
tem was used to obtain the data, a federal dataset freely available to anyone with
interest. The readr::read_csv() function was used to import the downloaded federal
data into the R session, but a great deal of effort was needed to identify the appro-
priateness of this dataset and to then adjust it for immediate use.2 Far more support-
ing detail is provided in the ending materials.

Identify and Organize the Data and All Relevant Variables

Review the ending materials in this lesson for details on how the tidyverse ecosys-
tem was used to organize the data, using R syntax exclusively. Give special attention
to the many R packages and functions used against the dataset, especially the use of

1
Review https://www.cdc.gov/nchs/data/series/sr_02/sr02_166.pdf, Pages 2 and 8, Series 2, No.
166 for more detail.
2
Those with a special interest may want to explore functions from the rio package to see other
strategies for how data can be imported into an active R session.
150 3 Role of Statistics for Decision-Making in Biostatistics

functions from the dplyr package and movement from wide to long data format by
using the tidyr::pivot_longer() function.

utline Potential Approach(s) for Analyses and Consider

O
Alternate Approaches

Carefully examine the ending materials in this lesson to see the many approaches
used for inquiry into the problem associated with this lesson, the issue of possible
differences in percentage deaths from COVID-19 by the degree of urbanization of
local communities. As the syntax is examined, give special attention to:
Data of interest: Not all data in the original dataset are needed for immediate use
and can be sequestered from the final working dataset.
R packages and functions: Many different packages and associated functions,
including packages from Base R and the tidyverse ecosystem, have a role. Select
the R packages and functions that meet needs.
Expected workflow and timelines: Notice how the workflow is facilitated by follow-
ing a structured approach to problem-solving and that this process supports effi-
cient use of time and resources.
Quality assurance checks: Quality assurance is a continuous process, with periodic
checks even if at first these checks seem redundant. A small problem early on
that is undetected can easily result in unfounded errors by end of process.
Develop templates for all major actions: Give attention on how syntax is used (with
modifications, as needed) in multiple ways and in multiple places. These tem-
plates promote efficient use of syntax that have a known history of acceptable
use and reuse.

ut Plans into Action, with Frequent Checks

P
for Quality Assurance

Notice how the syntax in the ending materials of this lesson may at times seem
redundant, but the many actions provide a continuous set of activities that promote
correct analyses and the production of relevant beautiful graphics.

Individual Review of All Outcomes

As the data for this lesson are organized (especially, from wide to long) and then
analyzed, give special attention to how the data are subjected to both nonparametric
and parametric inferential analyses. This type of individual review is far too often
Ten-Point Process When Using R for Statistical Analysis 151

overlooked. It may be that in a final summary of outcomes, perhaps only the calcu-
lated p-value will be reported and then of course the concluding statement on sig-
nificance compared to a priori hypotheses. But multiple reviews of all outcomes are
needed to have full assurance of final conclusions.

External Review of Outcomes Whenever Possible

When outcomes are prepared for publication in the literature, whether a journal
article or part of a published text, it can be assumed that the final manuscript will be
reviewed by a reputable editor and anonymous peer reviewers, often three or more.
Ideally, the peer reviewers will be selected by the editor for their expertise with the
subject matter and nonacquaintance with the author(s) of the draft publication. This
process provides additional quality assurance that the final publication is acceptable
for readership by the intended community of readers.
Even for publications that are not intended for external publication in a journal
or text but are instead internal to a select group of readers in a company, department,
research laboratory, etc., external review by disinterested third parties is still an
essential part of the process for inquiry and communication of outcomes. Free
inquiry and equally free criticism make for improvement in the sciences.

Report at an Appropriate Level for the Intended Audience

As final preparation of a summary of outcomes (e.g., manuscript) is prepared, it is

important to consider the intended audience. The outcomes should be prepared in a
way that the readership or viewership can reasonably understand the entire presen-
tation. Scientific jargon, details on selected p-values, and identification of selected
inferential tests may be necessary for a talk at a professional conference. However,
if the same topic were presented to the public, then the presentation would need to
consider that the intended audience may need assistance with understanding some
of the jargon and that their interest is focused more on actionable behaviors. Notice
in the ending materials how there is focus on descriptive statistics and not only cal-
culated p-values from the inferential tests, considering the degree of possible com-
prehension for multiple audiences.

Debrief to Establish Processes for Future Improvements

Did the inquiry answer all possible issues associated with the identified problem?
Given that this is rarely the case, what else could have (or should have) been done
to address the problem if there were only more available staff members, more time
152 3 Role of Statistics for Decision-Making in Biostatistics

allowed for investigation before moving on to other priority projects, more funds
available for the purchase of additional resources, etc.? Data scientists, in personal
notes, keep a list of what else may have been done for future improvements. These
ideal processes may not be currently achievable, but they provide a framework for
future actions – actions that can be defended based on the desire for continuous
improvement.

General Approach When Using R for Statistical Analysis

Give considerable attention to the ending materials in this lesson to examine the R
syntax, selected packages and functions, selected approaches, etc. used to address
the problem addressed in this lesson, the issue of deaths from COVID-19 and the
subsequent degree of urbanization between different communities. It is not enough
to merely obtain the data, whether self-obtained or obtained from an external
resource. An experienced data scientist will know that full comprehension by exter-
nal readers and viewers will come from multiple approaches to communication.

Exploratory Graphics

R and especially the tidyverse ecosystem supports the production of Beautiful

Graphics. The ggplot2::ggplot() function is the foundation for these graphics, aug-
mented by a host of associated arguments and ancillary packages. A main reason for
reading the literature is that this experience provides exposure to how scientific
outcomes are graphically communicated to others, from simple figures to highly
detailed and complex infographics. Think of potential graphics by viewing what
others produce.

xploratory Descriptive Statistics and Measures

E
of Central Tendency

As much as the many inferential tests supported by R have value, exploratory

descriptive statistics and measures of central tendency also have value. Much can be
explained merely by detailing N, mean and standard deviation, median, mode, and
range (minimum and maximum). Prepare these descriptive statistics, and then con-
sider how they can and should be presented to others. Complex calculations, detailed
statistics, and calculated p-values are all important, but do not overlook the utility of
descriptive statistics that can be readily understood by most.
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 153

Exploratory Analyses

Data scientists, when they finally have a dataset in desired form, address analyses
from multiple perspectives. There may be a planned set of analyses, as is often the
case, but it is also advisable to allow for serendipity by possibly looking for
unplanned patterns. It is often the case that the planned set of analyses are the analy-
ses that eventually provide the best understanding of outcomes, but it is still advis-
able to allow for exploration of the data by engaging in:
Inferential Analyses that Address Differences Between and Among
Groups: Consider the use of the following:
• Nonparametric tests such as: Sign Test, Chi-Square, Mann-Whitney U Test,
Wilcoxon Matched-Pairs Signed-Ranks Test, Kruskal-Wallis Oneway Analysis
of Variance, Friedman Twoway Analysis of Variance
• Parametric tests such as: Student’s t-Test for Independent Samples, Student’s
t-Test for Matched Pairs, Oneway Analysis of Variance, and Twoway Analysis of
Variance

Correlational Analyses that Address Association Between and Among

Groups: Consider the use of the following:
• Nonparametric Spearman Rank Correlation Coefficient
• Parametric Pearson Product-Moment Correlation Coefficient
Predictive Modeling: Go beyond correlational analyses and include degrees of
increasingly complex types of regression.
Replicate the analyses in the addendum for this lesson, but also go back to the many
points listed above. Before the first attempt is made to address the identified prob-
lem, begin to think through the workflow, leading eventually to final presentations.

ddendum: Use Inferential Statistics and R Syntax to Address

A
Differences in Percentage Deaths from COVID-19 by
the Urban v Rural Continuum

The Centers for Disease Control and Prevention (CDC) provides a regularly updated
dataset, Provisional COVID-19 Deaths by County, and Race and Hispanic Origin,
that addresses deaths from COVID-19, for counties with more than 100 COVID-19
deaths. Deaths are cumulative from the week ending January 4, 2020, and are ongo-
ing. The dataset Provisional_COVID- 19_Deaths_by_County__and_Race_and_
Hispanic_Origin.csv is available at https://data.cdc.gov/NCHS/Provisional-
154 3 Role of Statistics for Decision-Making in Biostatistics

COVID-19-Deaths-by-County-and-Race-and/k8wy-p9cg by clicking on the Export

button, and it should be downloaded for use in this addendum, named
ProvisionalCOVID19DeathsbyCountyRaceHispanicOrigin.csv for better file man-
agement. See the many details on the origin of the data by reviewing the About this
Dataset section of the Web page.
The file Provisional_COVID- 19_Deaths_by_County__and_Race_and_
Hispanic_Origin.csv, as used in this addendum, was obtained on May 13, 2022. An
R-based Application Programming Interface (API) function specific to this resource
was not found and it should be mentioned that the API reference at the CDC
Graphical User Interface (GUI) is focused on the use of the JavaScript Object
Notation (JSON or json) text-based file format, which is a more advanced API than
what is covered in this lesson, but is addressed in a later lesson.
Materials from the CDC were reviewed, and the following serves as a Code
Book for this dataset:

# Code Book for Prcentage Deaths from COVID-19 by the Urban v

# Rural Continuum

# End Date EndDate

# State State
# Urban Rural Code UrbanRuralCode
# FIPS State FIPSState
# FIPS County FIPSCounty
# Indicator Indicator
# Total deaths DeathsTotal
# COVID-19 Deaths DeathsCOVID19
# White PctWhite
# Black PctBlack
# American Indian or Alaska Native PctAmIndAKNative
# Asian PctAsian
# Native Hawaiian or Other Pacific Islander PctNativeHIPacific
# Hispanic PctHispanic
# Other PctOther
# Urban Rural Description UrbanRuralDesc

# UrbanRuralCode
# 1 Large central metro
# 2 Large fringe metro
# 3 Medium metro
# 4 Small metro
# 5 Micropolitan
# 6 Noncore
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 155

In keeping with what is seen in the Housekeeping section, use the many pack-
ages listed below as a starting point for what is often used in an R session that
focuses on the tidyverse ecosystem and its use in data science.3

3
See the comment in an earlier lesson as to why a # comment character has been placed in front of
most packages that were previously downloaded.
156 3 Role of Statistics for Decision-Making in Biostatistics

#install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)

#install.packages("readxl", dependencies=TRUE)
library(readxl)

#install.packages("magrittr", dependencies=TRUE)
library(magrittr)

#install.packages("janitor", dependencies=TRUE)
library(janitor)

#install.packages("ggmosaic", dependencies=TRUE)
library(ggmosaic)

#install.packages("ggpubr", dependencies=TRUE)
library(ggpubr)

#install.packages("ggtext", dependencies=TRUE)
library(ggtext)

#install.packages("ggthemes", dependencies=TRUE)
library(ggthemes)

#install.packages("gridExtra", dependencies=TRUE)
library(gridExtra)

#install.packages("scales", dependencies=TRUE)
library(scales)
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 157

###############################################################

# Use theme_Mac() as the desired theme for all figures. It

# will be avoided if needed and it will also be adjusted as
# needed. However, its bold and large fonts are the main
# reason for its use -- it makes figures easy to read.

theme_Mac <- function(base_size=12, base_family="sans"){

theme_stata() +
theme( # Embellishments to theme_stata()
plot.title=element_text(face="bold", size=14, hjust=0.5),
plot.subtitle=element_text(face="bold", size=12,
hjust=0.5),
plot.caption=element_text(face="bold", size=10, hjust=0.5),
axis.title.x=element_text(face="bold", size=14, hjust=0.5),
axis.text.x=element_text(face="bold", size=12, hjust=0.5),
axis.title.y=element_text(face="bold", size=14, hjust=0.5,
vjust=1, angle=90),
axis.text.y=element_text(face="bold", size=12, hjust=0.5),
legend.title=element_text(face="bold", size=12),
legend.text=element_text(face="bold", size=12),
axis.ticks.x=element_line(size=1.2),
axis.ticks.y=element_line(size=1.2),
axis.ticks.length=unit(0.25,"cm"),
panel.background=element_rect(fill="whitesmoke")
)
}
# hjust - horizonal justification; 0 = left edge to 1 = right
# edge, with 0.5 the default
# vjust - vertical justification; 0 = bottom edge to 1 = top
# edge, with 0.5 the default
# angle - rotation; generally 1 to 90 degrees, with 0 the
# default

base::class(theme_Mac)
# Confirm that the user-created object
# theme_Mac() is a function.

###############################################################

Address the data associated with the focus for this example, the percentage of
deaths from COVID-19 by degree of urbanization (see the publication with the
Code Book) 2013 NCHS Urban–Rural Classification Scheme for Counties, page 9,
for more details associated with the Code Book:

Median Persons per

Urbanization Level Square Mile
-----------------------------------------
Large central metro 2,037
Large fringe metro 233
Medium metro 148
Small metro 93
Micropolitan 55
Noncore 18
158 3 Role of Statistics for Decision-Making in Biostatistics

# Most approaches for the use of the data will be based on functions associated
with the tidyverse ecosystem, but functions from Base R will be used when they
represent the most appropriate approach toward problem-solving, especially for this
introductory text.

WCOVID19Deaths.tbl <- readr::read_csv(

"ProvisionalCOVID19DeathsbyCountyRaceHispanicOrigin.csv",
col_names=TRUE) %>%
janitor::clean_names()
# Pipe the janitor::clean_names() function so that
# the dataset, immediately after it is imported into
# the R session, has column names in good format.
#
# The leading W in WCOVID19Deaths.tbl is used to
# identify that the dataset is in Wide format. The
# use of a leading L is used to identify that the
# dataset is in Long format.

base::getwd()
base::ls()
base::attach(WCOVID19Deaths.tbl)
utils::str(WCOVID19Deaths.tbl)
dplyr::glimpse(WCOVID19Deaths.tbl)
base::summary(WCOVID19Deaths.tbl)

Before the dataset is adjusted, it is best to learn more about the main object vari-
able of interest, urban_rural_description, since the output of the base::summary()
function for this object variable is not overly helpful.

Summaryurban_rural_description <- WCOVID19Deaths.tbl %>%

dplyr::group_by(urban_rural_description) %>%
dplyr::summarise(Count = n()) %>%
dplyr::mutate(Percentage =
base::round((Count/base::sum(Count)*100), digits=3)) %>%
dplyr::arrange(dplyr::desc(Percentage))
# Using the dataset in original format: (1) group by the
# desired object variable, (2) produce a Count (e.g., N) of
# the breakouts for the grouped variable, (3) calculate the
# percentage, and (4) present the output in descending order.

Summaryurban_rural_description

# A tibble: 6 x 3
urban_rural_description Count Percentage
<chr> <int> <dbl>
1 Micropolitan 990 29.4
2 Large fringe metro 687 20.4
3 Medium metro 660 19.6
4 Small metro 642 19.1
5 Large central metro 204 6.07
6 Noncore 180 5.35
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 159

Remember that the terms Count and Percentage refer to occurrence in the data-
set, not population. Most counties in the United States are rural, but most people
reside in more densly populated urban counties.
A decision needs to be made now on how to adjust the dataset. Should any col-
umns be deleted? What is the best course of action to address the many occurrences
of missing data? If the data are to be put into long format, what is the best way to
arrange the data?
Use the utils::str() function to once again review the structure of the dataset and
make decisions on the final form, deciding on the columns that are needed for
planned analyses on COVID-19 deaths by degree of urbanization, temporarily
deleting all columns that do not play a role in this planned analysis. Although the
adjusted datasets will eventually exclude selected object variables, it is reminded
that the original data are retained, if it were ever necessary to go back to the data for
other analyses.

utils::str(WCOVID19Deaths.tbl)

WCOVID19DeathsAdjusted1.tbl <- WCOVID19Deaths.tbl %>%

dplyr::select(-c(
data_as_of,
start_date,
end_date,
state,
county_name,
urban_rural_code,
fips_state,
fips_county,
fips_code,
total_deaths,
covid_19_deaths,
footnote))
# Use the seemingly ubiquitous dplyr::select() function
# to remove columns, by column name in this example.
# Ideally, the eventual Long format dataset will consist
# of only urban_rural_description and the race-ethnicity
# percentage deaths variables, with later filtering of
# the indicator object variable.

base::getwd()
base::ls()
base::attach(WCOVID19DeathsAdjusted1.tbl)
utils::str(WCOVID19DeathsAdjusted1.tbl)
dplyr::glimpse(WCOVID19DeathsAdjusted1.tbl)
base::summary(WCOVID19DeathsAdjusted1.tbl)

It is now best to adjust the wide dataset one more time, applying the dplyr::filter()
function against the object variable indicator so that the eventual dataset contains
data only for percentage of COVID-19 deaths, removing the many rows for the two
other indicator object variable breakouts, distribution of all-cause deaths (%) and
distribution of population (%).
160 3 Role of Statistics for Decision-Making in Biostatistics

WCOVID19DeathsAdjusted2.tbl <-
WCOVID19DeathsAdjusted1.tbl %>%
dplyr::filter(indicator %in% c(
"Distribution of COVID-19 deaths (%)")) # COVID19 deaths, only
# Use the seemingly ubiquitous dplyr::filter() function
# to remove rows, which in this example retains only the rows
# with the term Distribution of COVID-19 deaths (%) in the
# indicator column. The output of this filtering syntax
# results in a new dataset, WCOVID19DeathsAdjusted2.tbl. This
# adjusted dataset will be used to reconfigure the data into
# Long format.

base::getwd()
base::ls()
base::attach(WCOVID19DeathsAdjusted2.tbl)
utils::str(WCOVID19DeathsAdjusted2.tbl)
dplyr::glimpse(WCOVID19DeathsAdjusted2.tbl)
base::summary(WCOVID19DeathsAdjusted2.tbl)

The dataset WCOVID19DeathsAdjusted2.tbl has been adjusted so that the indi-

cator object variable consists of Distribution of COVID-19 deaths (%) entries, only.
Make the dataset tidy by removing this now unnecessary column.

WCOVID19DeathsAdjusted3.tbl <- WCOVID19DeathsAdjusted2.tbl %>%

dplyr::select(-c(indicator))
# Remove the indicator object variable since it is no
# longer needed for analyses. By prior design, all
# data are for the selected indicator breakout,
# Distribution of COVID-19 deaths (%). Recall, again
# that the minus sign in dplyr::select(-c(indicator))
# indicates that the object variable is deleted.

base::getwd()
base::ls()
base::attach(WCOVID19DeathsAdjusted3.tbl)
utils::str(WCOVID19DeathsAdjusted3.tbl)
dplyr::glimpse(WCOVID19DeathsAdjusted3.tbl)
base::summary(WCOVID19DeathsAdjusted3.tbl)

The wide dataset should now be in good form and ready for restructuring, from
wide to long. The tidyr::pivot_longer() function is likely the best choice for this
task. As a summary of intentions, the long dataset will consist of percentages of
death from COVID-19 in one column and urban_rural_description breakouts in
another column. With the data in long format, it will be possible to use tidyverse
ecosystem functions to the best effect.
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 161

LCOVID19Deaths.tbl <-
tidyr::pivot_longer(
WCOVID19DeathsAdjusted3.tbl,
col=c(
non_hispanic_white,
non_hispanic_black,
non_hispanic_american_indian_or_alaska_native,
non_hispanic_asian,
non_hispanic_native_hawaiian_or_other_pacific_islander,
hispanic,
other),
names_to = "RaceEthnic",
values_to = "Percentage")
# The wide dataset is now put in long format, with the
# race-ethnicity percentages for distribution of COVID-19
# deaths all set to one column named Percentage. There
# are two complementary columns, urban_rural_description
# and RaceEthnic. This dataset is in long format and has
# all of the information needed to determine if there are
# statistically significant differences (p <= 0.05) in
# percentages COVID-19 deaths by urban-rural gradients.
# The measured datum percentage deaths should be viewed
# as a real number, but urban-rural gradients are clearly
# factor object variables that show no precise order. As
# such, a nonparametric approach will be used along with
# a parametric understanding of the data.
#
# Comment: This example is focused on the percentage of
# COVID-19 deaths by degree of urbanization. The long
# format also contains breakouts of RaceEthnic, but this
# object variable is not used for analyses in this
# lesson, but is retained in the long format dataset for
# possible further inquiry.

base::getwd()
base::ls()
base::attach(LCOVID19Deaths.tbl)
utils::str(LCOVID19Deaths.tbl)
dplyr::glimpse(LCOVID19Deaths.tbl)
base::summary(LCOVID19Deaths.tbl)

urban_rural_description RaceEthnic Percentage

Length:7847 Length:7847 Min. :0.00
Class :character Class :character 1st Qu.:0.04
Mode :character Mode :character Median :0.19
Mean :0.37
3rd Qu.:0.74
Max. :1.00
NA's :4855

Give careful attention to output of the base::summary() function, specifically for

the number of missing values (e.g., NA’s). There are 7847 rows in the dataset
LCOVID19Deaths.tbl, but there are 4855 missing values for Percentage. See the
162 3 Role of Statistics for Decision-Making in Biostatistics

federal resource serving as a proxy Code Book as to why confidentiality and the
Rule of 100 called for this outcome, the occurrence of so many NAs.
For this lesson, a decision has been made that analyses will be based only on a
complete dataset. Given this decision, all rows with a missing value will be deleted
from the dataset and the dataset used in this demonstration will have no missing
values.4

LCOVID19DeathsNOMissingData.tbl <- LCOVID19Deaths.tbl %>%

drop_na()
# Observe how there are now 2,992 rows, with a complete
# set of data in each row.

base::getwd()
base::ls()
base::attach(LCOVID19DeathsNOMissingData.tbl)
utils::str(LCOVID19DeathsNOMissingData.tbl)
dplyr::glimpse(LCOVID19DeathsNOMissingData.tbl)
base::summary(LCOVID19DeathsNOMissingData.tbl)

urban_rural_description RaceEthnic Percentage

Length:2992 Length:2992 Min. :0.001
Class :character Class :character 1st Qu.:0.044
Mode :character Mode :character Median :0.194
Mean :0.367
3rd Qu.:0.743
Max. :1.000

Exploratory data analysis (EDA): Descriptive statistics of the relevant dataset are
needed to obtain a general feel for the data, with attention not only to trends but with
attention also given to any extreme values. From among the many functions that
could possibly be used to calculate these descriptive statistics, consider the
dplyr::summarize() function. Recall that there are no missing data in the adjusted
dataset LCOVID19DeathsNOMissingData.tbl. If there were missing data, it would
be necessary to use the argument is.na in concert with the dplyr::summarize()
function.

4
The literature will be useful for those who want to explore the efficacy of this decision, the dele-
tion of all rows with missing data. This action, the deletion of all rows with missing data, must be
identified in the methods section of any eventual summary and it may be necessary to defend
this action.
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 163

LCOVID19DeathsNOMissingData.tbl %>%
dplyr::group_by(urban_rural_description) %>%
dplyr::summarize(
N = base::length(Percentage),
Minimum = base::min(Percentage),
Median = stats::median(Percentage),
Mean = base::mean(Percentage),
SD = stats::sd(Percentage),
Maximum = base::max(Percentage))
# Descriptive statistics are generated by first using the
# dplyr::group_by() function against the object variable
# urban_rural_description. The dplyr::summarize() function
# is then used against a set of selected functions to make
# a neatly presented summary of descriptive statistics.

Selected sections of output were deleted to save space.

# A tibble: 6 x 7
urban_rural_description N Minimum Median Mean Maximum
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Large central metro 360 0.001 0.0905 0.188 0.866
2 Large fringe metro 706 0.002 0.132 0.318 0.995
3 Medium metro 725 0.003 0.101 0.299 0.991
4 Micropolitan 565 0.014 0.648 0.566 1
5 Noncore 87 0.056 0.758 0.672 1
6 Small metro 549 0.01 0.185 0.381 0.992

The descriptive statistics provide a glimpse of potential outcomes, an under-

standing of percentage deaths from COVID-19 for each of the urban-rural break-
outs – the degree of urbanization. However, a simple graphic will provide an even
better sense of outcomes. This graphic is simple, but it will be improved upon as the
eventual summary of outcomes is put into final form.
As the graphic is prepared, a careful review of the median and mean for percent-
age by urban_rural_description provides evidence that there is a degree of variance
for a few selected urban and rural breakouts. This suggests why it may be best to
approach the production of graphics and inferential tests using both approaches to
data distribution patterns, a nonparametric approach based on the use of the median
and a parametric approach based on the use of the mean.
164 3 Role of Statistics for Decision-Making in Biostatistics

# Median - Nonparametric

par(ask=TRUE)
ggplot(LCOVID19DeathsNOMissingData.tbl,
aes(x=stats::reorder(urban_rural_description, Percentage),
y = Percentage)) +
geom_bar(stat = "summary", fun = "median")
# This rough draft figure is prepared only but to show a
# sense of direction for outcomes. Far more will be done
# when the data are used to prepare Beautiful Graphics.
# To achieve that aim, tidyverse ecosystem arguments will
# be used to best effect.
#
# Give attention to the way the stats::reorder() function
# was wrapped around urban_rural_description, reordering
# output on the X axis by Percentage values. This syntax
# may not be needed, but it clearly makes it easier to
# see the progression in percentage of COVID-19 deaths by
# degree of urbanization, the percentage mostly ranging
# from the smallest percentage (Large central metro) to
# the largest percentage (Noncore). Is it a coincidence
# that the progression of percentage deaths from COVID-19
# generally parallels the continuum of urbanization?

The least to the greatest ordering of percentage deaths from COVID-19 from a
nonparametric perspective, using the median, ranged from (1) Large central metro,
to (2) Medium metro, to (3) Large fringe metro, to (4) Small metro, to (5)
Micropolitan, to (6) Noncore. This ordering mostly follows along the urban-rural
continuum, where the percentage deaths from COVID-19 increased as population
density changes from a high degree of urbanization (low percentage death rate) to a
low degree of urbanization (high percentage death rate).
Why was the percentage deaths from COVID-19 the least in urban areas, consid-
ering the urban-rural continuum? The data provided by the Centers for Disease
Control and Prevention provided sufficient evidence as to what happened, but the
data cannot begin to answer unequivocally why urban areas had lower percentages
of COVID-19 deaths than the percentage deaths from COVID-19 in more rural
areas. A data scientist should know when to make definitive statements on out-
comes, but a data scientist should also know when to avoid conjecture that is not
supported by the evidence, or the issue of why in this example.
An inferential analysis of the data is needed, however, to learn more about per-
centage deaths from COVID-19 by degree of urbanization. The Kruskal-Wallis test
by ranks, a Oneway Analysis of Variance (ANOVA) test most appropriately used
from a nonparametric view toward data, will be used to examine any commonalities
and differences between and among the urban-rural breakout groups in view of
percentage death from COVID-19.
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 165

# install.packages("agricolae", dependencies=TRUE)
library(agricolae)

# Median - Nonparametric

agricolae::kruskal(LCOVID19DeathsNOMissingData.tbl$Percentage,
LCOVID19DeathsNOMissingData.tbl$urban_rural_description,
alpha=0.05, group=FALSE, p.adj="holm",
main="COVID-19 Percentage Deaths by Urban-Rural Continuum
Using Nonparametric Kruskal-Wallis Oneway ANOVA",
console=TRUE)
# Use holm for pairwise comparisons. Another choice could
# have been to use bonferroni for pairwise comparisons.

Study: COVID-19 Percentage Deaths by Urban-Rural Continuum

Using Nonparametric Kruskal-Wallis Oneway ANOVA
Kruskal-Wallis test's
Ties or no Ties

Critical Value: 428.887

Degrees of freedom: 5
Pvalue Chisq : 0

LCOVID19DeathsNOMissingData.tbl$urban_rural_description, means
of the ranks

LCOVID19DeathsNOMissingData.tbl.Percentage r
Large central metro 1042.37 360
Large fringe metro 1395.70 706
Medium metro 1288.42 725
Micropolitan 2018.79 565
Noncore 2270.72 87
Small metro 1538.50 549

Post Hoc Analysis

P value adjustment method: holm

Comparison between treatments mean of the ranks.

Difference pvalue Signif.

Large central metro - Large fringe metro -353.332 0.0000 ***
Large central metro - Medium metro -246.052 0.0000 ***
Large central metro - Micropolitan -976.426 0.0000 ***
Large central metro - Noncore -1228.357 0.0000 ***
Large central metro - Small metro -496.137 0.0000 ***
Large fringe metro - Medium metro 107.280 0.0126 *
Large fringe metro - Micropolitan -623.095 0.0000 ***
Large fringe metro - Noncore -875.026 0.0000 ***
Large fringe metro - Small metro -142.805 0.0052 **
Medium metro - Micropolitan -730.374 0.0000 ***
Medium metro - Noncore -982.306 0.0000 ***
Medium metro - Small metro -250.085 0.0000 ***
Micropolitan - Noncore -251.931 0.0126 *
Micropolitan - Small metro 480.289 0.0000 ***
Noncore - Small metro 732.220 0.0000 ***
166 3 Role of Statistics for Decision-Making in Biostatistics

As evidenced from this printout of groupwise comparisons, there are statistically

significant differences (p <= 0.05) for each breakout group by breakout group com-
parison. The groupwise comparisons are many, but in each comparison, there are
statistically significant differences (p <= 0.05), using a nonparametric approach to
statistical analysis.

# Mean - Parametric

par(ask=TRUE)
ggplot(LCOVID19DeathsNOMissingData.tbl,
aes(x=stats::reorder(urban_rural_description, Percentage),
y = Percentage)) +
geom_bar(stat = "summary", fun = "mean")

The least to the greatest ordering of percentage deaths from COVID-19 from a
parametric perspective, using the mean, ranged from (1) Large central metro, to (2)
Medium metro, to (3) Large fringe metro, to (4) Small metro, to (5) Micropolitan,
to (6) Noncore. This ordering mostly follows along the urban-rural continuum,
where the percentage deaths from COVID-19 increased as an area changes from a
high degree of urbanization (low percentage death rate) to a low degree of urbaniza-
tion (high percentage death rate).
What is also interesting about this figure is that the ordering of percentage deaths
from COVID-19 by urban-rural continuum is the same, regardless of whether a
nonparametric (median) or parametric (mean) approach served as the basis for prep-
aration of the figure. The percentages are different, comparing median to mean, as
evidenced in the descriptive statistics – but the practical outcomes in terms of order-
ing are equivalent.

agricolae::HSD.test(
aov(Percentage ~ urban_rural_description,
data=LCOVID19DeathsNOMissingData.tbl), # Model
trt="urban_rural_description", # Treatment
group=TRUE, console=TRUE, alpha=0.05, # Arguments
main="COVID-19 Percentage Deaths by Urban-Rural Continuum
Using Tukey's HSD (Honestly Significant Difference)
Parametric Oneway ANOVA")
# Wrap the agricolae::HSD.test() function around the
# Oneway ANOVA model obtained by using the aov()
# function. Select desired arguments, such as group,
# console, and alpha (e.g., p-value).
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 167

Study: COVID-19 Percentage Deaths by Urban-Rural Continuum

Using Tukey's HSD (Honestly Significant Difference)
Parametric Oneway ANOVA

HSD Test for Percentage

Mean Square Error: 0.113256

urban_rural_description, means

Percentage std r Min Max

Large central metro 0.188347 0.222819 360 0.001 0.866
Large fringe metro 0.318435 0.338507 706 0.002 0.995
Medium metro 0.298619 0.339558 725 0.003 0.991
Micropolitan 0.566340 0.362957 565 0.014 1.000
Noncore 0.671862 0.313149 87 0.056 1.000
Small metro 0.381002 0.365894 549 0.010 0.992

Alpha: 0.05 ; DF Error: 2986

Critical Value of Studentized Range: 4.03272

Groups according to probability of means differences and alpha

level( 0.05 )

Treatments with the same letter are not significantly different.

Percentage groups
Noncore 0.671862 a
Micropolitan 0.566340 a
Small metro 0.381002 b
Large fringe metro 0.318435 c
Medium metro 0.298619 c
Large central metro 0.188347 d

Use of the agricolae::HSD.test() function is especially helpful in that the output

provides clearly organized and easy-to-understand summaries of descriptive statis-
tics and group membership. Using a parametric perspective, notice how: (a) Noncore
and Micropolitan are in common and had the highest percentage of deaths from
COVID-19, (b) Small metro followed with a lower percentage of deaths from
COVID-19 and was its own group, (c) Large fringe metro and Medium metro fol-
lowed with an even lower percentage of deaths from COVID-19 and were their own
group, and, finally, (d) Large central metro had the lowest percentage of deaths from
COVID-19 and was also its own group (Fig. 3.1).
168 3 Role of Statistics for Decision-Making in Biostatistics

Fig. 3.1

COVID19MedianPercentageDeathByUrbanRural <-
LCOVID19DeathsNOMissingData.tbl %>%
dplyr::group_by(urban_rural_description) %>%
dplyr::summarize(
Median = stats::median(Percentage))

COVID19MedianPercentageDeathByUrbanRural

# A tibble: 6 x 2
urban_rural_description Median
<chr> <dbl>
1 Large central metro 0.0905
2 Large fringe metro 0.132
3 Medium metro 0.101
4 Micropolitan 0.648
5 Noncore 0.758
6 Small metro 0.185
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 169

par(ask=TRUE)
ggplot2::ggplot(LCOVID19DeathsNOMissingData.tbl,
aes(
x=stats::reorder(urban_rural_description, Percentage),
y=Percentage)) +
geom_bar(stat = "summary", fun = "median", fill="red",
color="black") +
labs(
title="Median Percentage COVID-19 Deaths by Urban-Rural
Continuum as of May 13, 2022",
subtitle="Data: Centers for Disease Control and Prevention",
x = "\nUrban - Rural Breakout Groups",
y = "Median Percentage COVID-19\nDeaths\n") +
annotate("text", x=0.75, y=1.00, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="Percentage Density Urbanization Level") +
annotate("text", x=0.75, y=0.95, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="----------------------------------------") +
annotate("text", x=0.75, y=0.90, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.091 2,037 Large central metro") +
annotate("text", x=0.75, y=0.85, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.101 148 Medium metro") +
annotate("text", x=0.75, y=0.80, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.132 233 Large fringe metro") +
annotate("text", x=0.75, y=0.75, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.185 93 Small metro") +
annotate("text", x=0.75, y=0.70, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.648 55 Micropolitan") +
annotate("text", x=0.75, y=0.65, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.758 18 Noncore") +
annotate("text", x=0.75, y=0.60, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="----------------------------------------", ) +
annotate("text", x=0.75, y=0.50, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="Density: Median Persons per Square Mile") +
scale_y_continuous(breaks=scales::pretty_breaks(n=5),
limits=c(0, 1), sec.axis = dup_axis()) +
# Note how the Y axis has been duplicated, where the left Y
# Axis (default) is shown again to the right of the figure,
# making it easier to read the values for each bar in the
# bar chart.
theme_Mac() +
theme(axis.text.y=element_text(face="bold", size=12, hjust=0.5))
# Use a smaller-sized font to accommodate the large number of
# completers, six-digits in this case.
# Notice how the placement of this one off theme comes after
# theme_Mac().
# Fig. 3.1
170 3 Role of Statistics for Decision-Making in Biostatistics

# Mean - Parametric

COVID19MeanPercentageDeathByUrbanRural <-
LCOVID19DeathsNOMissingData.tbl %>%
dplyr::group_by(urban_rural_description) %>%
dplyr::summarize(
Mean = base::mean(Percentage))

COVID19MeanPercentageDeathByUrbanRural

# A tibble: 6 x 2
urban_rural_description Mean
<chr> <dbl>
1 Large central metro 0.188
2 Large fringe metro 0.318
3 Medium metro 0.299
4 Micropolitan 0.566
5 Noncore 0.672
6 Small metro 0.381
Addendum: Use Inferential Statistics and R Syntax to Address Differences… 171

par(ask=TRUE)
ggplot2::ggplot(LCOVID19DeathsNOMissingData.tbl,
aes(
x=stats::reorder(urban_rural_description, Percentage),
y = Percentage)) +
geom_bar(stat = "summary", fun = "mean", fill="red",
color="black") +
labs(
title="Mean Percentage COVID-19 Deaths by Urban-Rural
Continuum as of May 13, 2022",
subtitle="Data: Centers for Disease Control and Prevention",
x = "\nUrban - Rural Breakout Groups",
y = "Mean Percentage COVID-19\nDeaths\n") +
annotate("text", x=0.75, y=1.00, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="Percentage Density Urbanization Level") +
annotate("text", x=0.75, y=0.95, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="----------------------------------------") +
annotate("text", x=0.75, y=0.90, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.188 2,037 Large central metro") +
annotate("text", x=0.75, y=0.85, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.299 148 Medium metro") +
annotate("text", x=0.75, y=0.80, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.318 233 Large fringe metro") +
annotate("text", x=0.75, y=0.75, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.381 93 Small metro") +
annotate("text", x=0.75, y=0.70, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.566 55 Micropolitan") +
annotate("text", x=0.75, y=0.65, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="0.672 18 Noncore") +
annotate("text", x=0.75, y=0.60, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="----------------------------------------") +
annotate("text", x=0.75, y=0.50, fontface="bold", size=03,
color="darkblue", hjust=0, family="mono",
label="Density: Median Persons per Square Mile") +
scale_y_continuous(breaks=scales::pretty_breaks(n=5),
limits=c(0, 1), sec.axis = dup_axis()) +
theme_Mac() +
theme(axis.text.y=element_text(face="bold", size=12, hjust=0.5))
# Use a smaller-sized font to accommodate the large number of
# completers, six-digits in this case.
# Notice how the placement of this one off theme comes after
# theme_Mac().
# Fig. 3.2

Challenge: Much more can be done in terms of analyses and graphical presenta-
tions, not only with the long format adjusted table that has no missing data but also
with the original dataset, WCOVID19Deaths.tbl (Fig. 3.2). Use WCOVID19Deaths.
172 3 Role of Statistics for Decision-Making in Biostatistics

Fig. 3.2

tbl and look at the data again, from many perspectives (analyses by and not only for
those analyses and figures demonstrated in this lesson).
Comment: If race-ethnicity were examined in any meaningful detail, consider
the issue of race-ethnicity health disparities across the population, and equally con-
sider disparities in percentage representation of residence by the many race-ethnicity
groups across the urban-rural continuum. It is far beyond the purpose of this lesson
to address why there may be disparities in outcomes (e.g., percentage COVID-19
deaths by different breakout groups, now possibly including race-ethnicity), but as
previously addressed the curious data scientist will at least examine the data, look
for trends and associations, and make these outcomes known to others.
Comment: For those who wish to go further with this line of inquiry, use data
from the Centers for Disease Control and Prevention and the National Center for
Health Statistics to examine Excess Deaths Associated with COVID-19, available
at https://www.cdc.gov/nchs/nvss/vsrr/covid19/excess_deaths.htm and the many
.csv files that show near the end of this Web page. Some comparisons can be made
along the Urban-Rural Continuum, comparing states that are known to be heavily
urban v states that are known to be quite rural, possibly New Jersey v Alaska,
Rhode Island v Wyoming, Massachusetts v Montana, etc. These comparisons
should provide another (e.g., proxy) view of the severity of COVID-19 along the
Urban-Rural Continuum. Federal data on this issue can also be obtained at
Provisional COVID-19 Deaths: Distribution of Deaths by Race and Hispanic
Origin (https://data.cdc.gov/NCHS/Provisional-COVID-19-Deaths-Distribution-
of-Deaths/pj7m-y5uh), using the file Provisional_COVID- 19_Deaths__
Distribution_of_Deaths_by_Race_and_Hispanic_Origin.csv and made available
at the Publisher’s Web site associated with this text renamed as
ProvisionalCOVID19DeathsDistributionofDeathsbyRaceandHispanicOrigin.csv.
Much needs to be done to put the data into tidy format, but of course, by the end
of engagement with this text, this should be an achievable skill.
External Data and/or Data Resources Used in This Lesson 173

External Data and/or Data Resources Used in This Lesson

The publisher’s Web site associated with this text includes the following files, pre-
sented in .csv, .txt, and .xlsx file formats.
ProvisionalCOVID19DeathsbyCountyRaceHispanicOrigin.csv.
ProvisionalCOVID19DeathsDistributDeathsRaceHispanicOrigin
Challenge: Use these files to practice and replicate the outcomes used in this les-
son. Be creative and go beyond what was previously presented.
Note: There is only one addendum in this lesson.
All other data, if any, were enumerated directly in the R session, using functions
such as round(rnorm()), read.table(textConnection()), etc.
Chapter 4
Data Science and R, Base R,
and the tidyverse Ecosystem

orkflow for Reproducible, Efficient, and Accurate Analyses

W
and Presentations

Statistical analyses call for planning, organization, precision, order, continuous

review, and a strict belief in quality assurance as a pervasive process. R supports
these aims and the tidyverse ecosystem demands adherence to these constructs. In
R, this approach is often called the workflow.
Consider a workflow, especially a workflow where tools from the tidyverse eco-
system (explicit details on the tidyverse ecosystem are provided in this lesson and
in later sections) are used. It is often stated as a Rule of Thumb that time on task for
a well-designed statistical analysis project follows the 80-20 rule. It is often seen
that 80% of the time on task for project completion is given to the many tasks
needed to identify data sources, to obtain the data, to organize the data so that it is
in good form, etc., up to the point where actions are finally needed to later obtain
ending outcomes. Attention to quality assurance activities should be continuous
throughout this all-consuming set of actions. Only then does development of the
syntax take place, often requiring 20% of time on task, with 80% of the time given
to these other activities. Quality problems and challenges to acceptance of outcomes
are common for the inexperienced researcher who somehow obtains data, gives lit-
tle to no attention to the heuristics or subtleties of the data, neglects context and
organization, etc. Quickly prepared analyses and graphics have no value if out-
comes are incorrect.
Throughout the chapters in this text, especially in the back matter to each chap-
ter, give attention to the workflow, how so much discussion is given to the data, its
meaning, its source, etc. Then give attention not only to how the data are organized
but just as important consider documentation of the same. Documentation may

Supplementary Information The online version contains supplementary material available at

https://doi.org/10.1007/978-3-031-46383-9_4.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 175
T. W. MacFarland, Introduction to Data Science in Biostatistics,
https://doi.org/10.1007/978-3-031-46383-9_4
176 4 Data Science and R, Base R, and the tidyverse Ecosystem

seem tedious, verbose, only marginally necessary, etc. – until there is an attempt to
use the data or the syntax later, individually or by others, when attempting replica-
tion and reuse. What will be remembered about a project six days after completion,
six weeks after completion, six months after completion, etc.? How will others,
either co-workers, staff members from other departments, or future hires, use the
data and documentation if there were any attempts to use the original workflow for
future purposes, which is so common that this possibility is an assumed standard
practice. Few projects are static, or to use current jargon, few projects are one
and done.
When considering how R and the tidyverse ecosystem fits into data science and
how data science is used to address problems in the biological sciences, review the
following sequence of terms and workflow:
• Import the Data – The data need to be brought into an R session, but how this is
accomplished depends on the nature of the data, how the data have been saved,
where the data are currently housed prior to import, etc. The tidyverse ecosystem
supports many tools used to import data and it would be hard to imagine a dataset
that could not eventually be imported into R, but of course some datasets may be
more challenging than others.
• Tidy the Data – It would only be the rare exception for a dataset to be imported
into R that requires no further effort to put it into good form, with desired format
depending on many conditions. As the name suggests, the tidyverse ecosystem
has many excellent tools used to put data into good form – a tidy format.
• Transform the Data into Desired Format – Data are often originally organized in
ways that defy understanding, as those with minimal background in data science
prepare, use, and, in many cases, abuse spreadsheets. A data entry decision that
may seem intuitive and easy to read may not at all be acceptable for computer-
based analyses. The most common data transformation is likely moving data
from wide format to long format. This transformation, from wide to long, is
demonstrated multiple times in this text.
• Program or Prepare tidy Syntax – When using the tidyverse ecosystem, it is best
to review how others prepare syntax and there are multiple sources where sam-
ples are provided. Give attention to the packages and functions used to achieve
aims, note how tibbles are named, give attention to consistency in capitalization
and the use of underscores in object variable names, follow along with norms on
spacing, etc.
• Visualize Outcomes Frequently – The tidyverse ecosystem supports the notion of
Beautiful Graphics. Even if the tidyverse ecosystem were not used to prepare
final form figures suitable for professional publication, which it certainly sup-
ports, rough draft figures should nearly always be prepared for quality assurance
purposes, to be sure that outcomes are in range, or highlight outcomes if out-
comes are not in expected range.
• Model the Process to Support Consistency in Future Replication – Using a com-
bination of different internal documentation activities (e.g., good programming
practices such as meaningful variable names and not cryptic abbreviations and
Workflow for Reproducible, Efficient, and Accurate Analyses and Presentations 177

verbose narrative text for documentation) and external documentation activities

such as a detailed Code Book, once workflow models are prepared and there is
assurance they are correct, support future replication through the development
and documentation of modeling processes. A common modeling activity that
pays dividends in the future is something as simple as having a common look and
feel for naming functions and arguments, perhaps something as simple as using
a PackageName::FunctionName approach to the use of functions. Given the
nature of how R packages are developed by different individuals without the
benefit of a Senior Software Engineer overseeing the process, inconsistencies in
naming and other practices are occasionally seen between and among R pack-
ages. Modeling promotes eventual replication efforts.
• Understand Outcomes – It is not enough to put data into perfect format, to have
documentation that others can follow, to prepare publishable figures, and to have
statistics that meet all requirements. Is it possible to understand the outcomes,
not only the main outcomes but also nuances in the outcomes? Imagine a large
dataset, involving data from hundreds of different farms over multiple years,
where it was determined that there is no statistically significant difference
(p <= 0.01) between no-till farming methods and conventional farming methods
regarding corn (e.g., Zea mays) yields (Bushels per Acre). Does this outcome
hold throughout? Did this outcome hold in years of low rainfall? Did this out-
come hold for corn grown on high clay soils, sandy loam soils, etc.? Even if the
outcomes were consistent for all conditions, were inputs examined? What was
the cost of production for each farming method?
• Communicate Outcomes for the Intended Audience – For whom was the project
prepared? What are the characteristics of the intended audience? Will the audi-
ence consist only of experienced data scientists? Will the audience consist of
experienced data scientists and beginning researchers? Will the audience consist
of members of the public? How will audience composition impact the level of
detail in the way outcomes are communicated? Will outcomes be presented in
the form of a technical memo, for internal review only? Will outcomes be pre-
sented in the form of a group presentation, projecting prepared slides onto a large
screen in a dimly lit room? Will outcomes be presented as an article in a profes-
sional journal? How will the communication medium impact the level of detail
in the way outcomes are communicated?
Give special attention to the back matter for each lesson to see more than just the
syntax used for analyses. Consider the many actions that are required before the first
analysis is ever attempted. Finally, consider the 80-20 rule when applied to the
many actions needed in support of statistical analyses, often associated with the
80-20 Pareto Principle, named for Vilfredo Federico Damaso Pareto, an Italian
polymath who wrote extensively in the late 1800s and early 1900s.
A few special notes are needed before the conclusion of this section, dealing with
the concept of a tidy approach toward data preparation:
• Imagine a dataset where the character string Broward County, Florida, shows in
a cell as a specific datapoint. This term is clearly descriptive, but is that level of
178 4 Data Science and R, Base R, and the tidyverse Ecosystem

detail needed, especially if that somewhat long character string were used as a
label in a barchart or some other figure? There are tidyverse ecosystem tools (and
Base R tools, too) that can remove the characters Florida if it were decided that
this action is needed, leaving Broward County only – a label that will more easily
fit on either the X axis or Y axis of a figure.
• Imagine, however, that the dataset included the character string Washington
County, Florida, as a specific datapoint. Tools from either Base R or the tidy-
verse ecosystem could be used to change the datapoint to Washington County
only, but is this a wise choice? There is only one county in the United States
named Broward County, but the name Washington County is found in 30 or so
states in the United States, in honor of General George Washington, a founding
father and the first president of the United States. The tidyverse ecosystem can
accommodate the change, if desired, but the appropriateness of any change is a
human decision.
• Imagine another dataset, in original format as a spreadsheet, where comments
and titles show in the first few rows, prior to the inclusion of data. These com-
ments and titles are quite appropriate for when the spreadsheet is shared with
others, but these extraneous rows are not acceptable for inclusion in any rectan-
gular dataset. There are tidyverse ecosystem tools that can remove the unneeded
rows, and with appropriate knowledge of the tidyverse ecosystem, it is not neces-
sary to manually edit it outside of R.
The terms dataset, data frame, and dataframe are regularly used in data science,
often interchangeably. Regardless of how these terms are expressed, it is important
to become acquainted with the term tibble and its special use in the tidyverse eco-
system. A tibble is a rectangular dataframe, but with somewhat different character-
istics than what is the norm when using Base R and the traditional concept of a
dataframe. More about tibbles will be evident by reviewing the back matter in the
many lessons in this text, but for now, it is important to know that a dataframe can
be converted into a tibble by using the tibble::as_tibble() function. An especially
useful feature of the tibble::as_tibble() function is the name_repair argument, which
can be especially convenient in the early stages of data organization, prior to use of
the data.
It is also noteworthy to consider how the tidyverse ecosystem is used to put wide
data into long format and conversely the restructuring of long data into wide format.
These actions (most often, wide to long) are demonstrated throughout this text and
many examples can be found by searching the Internet, so only a brief recap is
needed here, but the convenient restructuring of data is central to why many data
scientists first embraced the tidyverse ecosystem.1 A simple example of how wide
data may appear after it is put into long format follows:

1
Look at the history of the reshape package (and later the reshape2 package and even later the tidyr
package) and note time of first availability for each package. The reshape package and the ggplot2
package were among the earliest packages associated with the tidyverse ecosystem and their use
Base R 179

# WIDE Data

Beagle WeightLb Collie WeightLb

Female 22 Female 42
Female 23 Female 49
Male 26 Male 50
Male 30 Male 58

# LONG Data

ID Breed Gender WeightLb

BF01 Beagle Female 22
BF02 Beagle Female 23
BM01 Beagle Male 26
BM02 Beagle Male 30
CF01 Collie Female 42
CF02 Collie Female 49
CM01 Collie Male 50
CM02 Collie Male 58

The wide format file, presented perhaps as a table in a presentation, if prepared

correctly should be easy for the public to read and understand. However, for use
with software (not only R, but for many other programs too) and computing machin-
ery, long format data is often the best choice on how data should be structured. With
long format data, recall how:
• Each row represents data from one subject (e.g., subjects BF01 to CM02) and
only one subject.
• Each column represents one variable (e.g., variables ID to WeightLb) and only
one variable.
• Each cell, the intersection of a row and column, represents a datum for one data-
point and only one datapoint.
There are many other examples of how data are put into tidy format, as found
through this text. There are also countless examples that can be found easily by
searching the Internet. Strive for tidy data, and to achieve that aim, review how oth-
ers structure data in tidy format.

Base R

S was developed in the mid-1970s and R grew out of S. It might help to back up
briefly to describe how statistical analyses were deployed in the early days of com-
puting, prior to the use of more contemporary statistical analysis software, such as
S and later R.

was quickly accepted by data scientists within the R community, leading to the development of
other tidyverse packages and associated tools.
180 4 Data Science and R, Base R, and the tidyverse Ecosystem

Up to the mid-1970s, any computer-based statistical analyses that occurred were

likely the result of (1) hand drawing a flowchart to help think through the proposed
order of operations, (2) pseudocode written on graph paper with one character in
each block, (3) pencil tracing the hand-drawn pseudocode for first pass quality
assurance of the proposed syntax, (4) tedious and often cryptic Fortran program-
ming syntax, (5) the use of physical punch cards that had to be in precise order and
were often held together with a rubber band (e.g., elastic band), with hand-delivery
of the punch cards to personnel at an often distant distributed computing center, etc.
It was not at all uncommon for days to pass before the punch cards were processed,
with results sent back on perforated continuous feed 14-7/8 inch wide greenbar
paper, with that width allowing for perforated tractor feed holes on both edges.
Results were typically displayed across 132 columns in a fixed font, with 10 char-
acters per inch. If there were any problems with the punch cards, either the syntax,
the misordering of the cards, or hanging chads, the entire process needed to start
again. It could possibly take a week or more before even the simplest statistical
analyses were completed. Computer graphics were only a dream using this early
process and were instead physically prepared with ink and paper, if attempted at all.2
There had to be a better way to conduct statistical analyses (and as an aspiration,
prepare quality graphics) and by the mid-1970s both computing machinery and
operating system software developed to the point that a more friendly, user-
controlled, exploratory process for statistical analysis was possible. Taking advan-
tage of software and computing machinery improvements and with extremely clever
thinking by a small team of visionaries at a leading telecommunications company
based in the United States, a first iteration of S was developed. Improvements to S
continued throughout the late-1970s onward such that by the late 1980s and into the
early 1990s, S was quite sophisticated, supported ASCII-based graphics, and was
used at many research universities and companies, especially when it was bundled
with the UNIX operating system and made available at no extra charge to the end
user. However, business models changed and eventually selected versions of S were
commoditized.
Reacting to the many changes and means of availability of S, by the mid-1990s,
the heuristics of S were reimagined and eventually R grew out of these efforts, with
R offered as freely available open-source software under the GNU Public License.
In many ways, R modeled S, and it is still found that many old S routines will pro-
vide satisfactory output when used with R, often with little to no change to the syntax.
Software is never static, and this statement is especially true for open-source
software, where the user community can develop new tools freely and without con-
straint. To achieve that aim, those who use R are able to develop packages that
include data and functions, and these packages are often shared with the larger R
community by uploading the package to a Web site indexed by the Comprehensive

2
Review the term ASCII Art to see how simple graphics were attempted in these early days, often
with questionable results (including the subject matter) for selected early days figures.
The tidyverse Ecosystem 181

R Archive Network (CRAN, https://cran.r-project.org/web/packages/available_

packages_by_name.html#available-packages-Z). At the time this chapter was pre-
pared, there were nearly 20,000 packages posted to CRAN, with the first packages
made available to the public in the mid-to-late 2000s. There are an untold number
of packages also available at other, more private Web sites, but it is difficult to have
unified knowledge of these ancillary sites. Spend time looking at the many resources
at the main CRAN Web page https://cran.r-project.org, including a review of the
most current version of R for each of the leading operating systems, a journal of
user-contributed submissions, information about conferences, etc. Time at this
resource is well worth the effort.
This chapter provides only minimal first steps instructions on the use of R,
such as:

WeightKg <- c(123.4, 234.5, 345.6, 456.7, 567.8, 678.9)

base::mean(WeightKg)

graphics::plot(WeightKg, col="red", pch=19, cex=4,

ylim=c(-100,800), xaxt="n", type="b")

There are nearly countless resources for that aim, and they should be reviewed if
needed. Instead, it is assumed that those who review this introductory lesson on data
science for biostatistics have some degree of acquaintance with R and the many
tools available from when R is first downloaded, prior to the use of any external
packages, packages from among the 20,000 or so packages that go beyond what is
possible when R is first downloaded – especially tools associated with the tidyverse
ecosystem. Step back and become fairly well experienced with Base R if needed.

The tidyverse Ecosystem

S was quite popular with a dedicated user base. In the mid-1990s when R, as an
outgrowth of S, was made available to the public, many former S users quickly
downloaded early versions of R and began testing its features and then used R for
production.3 As an open-source software, it was possible to customize selected fea-
tures and develop packages and functions, as needed. For many data scientists, R
became a first-choice software selection for data management and organization,
statistical analysis, and the preparation of quality graphics.
Recognizing that software is never static, by the mid-2000s, there were major
improvements to R when the first iterations of packages and functions associated
with what would later be called the tidyverse ecosystem were released. The reshape

3
As early as January 6, 2009, a prominent national newspaper in the United States featured R and
how it was emerging as a first choice for many engaged in data management and statistical
analyses.
182 4 Data Science and R, Base R, and the tidyverse Ecosystem

package (data reconfiguration) and the ggplot2 package (graphics, specifically

adherence to the Grammar of Graphics) were early tidyverse packages. Quality
software is quickly embraced, and in short order, a host of packages associated with
what is now considered the tidyverse ecosystem were developed, made available,
improved upon, and inspired the development of other packages that offered new
features and possibilities.
Base R still has many useful and appropriate applications for those data scientists
who use R. It would be difficult to totally eschew the use of Base R. Base R is dem-
onstrated throughout this text. Yet, the tidyverse ecosystem has gained such wide
acceptance that knowledge and mastery of the many tools associated with the tidy-
verse ecosystem is essential for contemporary data scientists. This text introduces
the tidyverse ecosystem that with study should help with this aim.

The tidyverse Ecosystem as an Idea and the Need for Tidy Data

With such wide acceptance of the tidyverse ecosystem, it should not be surprising
that there are nearly countless resources, including journal articles, blog postings,
tutorials, short courses, and videos on its many features and uses. There is no need
to repeat what is so readily available and can be gained by searching the Internet.
Yet, it would be remiss if the driving features of the tidyverse ecosystem, tidy data,
were not detailed. To avoid any future frustration by those who are just beginning to
use R and the tidyverse ecosystem, it must be mentioned that it is the rare dataset
that needs no accommodation prior to first use. Messy data are abundant, especially
when datasets are prepared by those who are not data scientists and those who do
not use the tidyverse ecosystem. Data scientists need to learn how to modify messy
data so that the three key features of a tidy dataset are observed. There are countless
ways messy data are put into datasets, often in seemingly unmanageable configura-
tions. In contrast, tidy data are housed as datapoints in rectangular datasets, where
rectangular datasets are composed of rows and columns, with data found at the
intersection of rows and columns:
• Rows represent observations (e.g., subjects) and each row represents a singular
observation (e.g., singular subject).
• Columns represent variables and each column represents a singular variable.
• Cells represent values and each cell represents a singular value.
A brief example or two of a messy dataset and how the data could appear in tidy
format follows. Hand-editing with software, either an editor or a spreadsheet, can be
used for a small dataset, but imagine the challenges if there were hundreds or thou-
sands of rows for the messy dataset presented below. R and more specifically R’s
tidyverse ecosystem has tools that could be used to accommodate the desire to put
The tidyverse Ecosystem 183

the data into tidy format, a challenge perhaps but certainly requiring less time on
task than hand-editing.4

# Messy Dataset
#
# First and Last Gender/
# Name Sex Weight Height Body Mass Index
# ==========================================================
# John Smith Male 165 Lb 5 foot 6 in 26.6
# Sally Rojas Female 126 Lbs 5'2" 23.0
# William Danick M 183 lb 6 f 24.8
# Walter Maurer m 199 lbs 6 feet 4" 24.2
# Juanita Adams f 143 5-4 24.5
# ----------------------------------------------------------

# Tidy Dataset
#
id nameLast nameFirst gender weightLbs heightInches bmi
ID001 Smith John m 165 66 26.6
ID002 Rojas Sally f 126 62 23.0
ID003 Danick William m 183 72 24.8
ID004 Maurer Walter m 199 76 24.2
ID005 Adams Juanita f 143 64 24.5

There is seemingly no shortage of messy datasets, especially when original data-

sets are unavailable and summary tables in published reports are used in lieu of the
original data. As an example, visit the URL https://downloads.usda.library.cornell.
edu/usda-esmis/files/xg94hp55p/5138jj274/gm80hz96r/cotc0496.txt and go to the
table labeled Grapefruit: Acreage, Production, Price, and Value, California and
United States, 1994-95. Imagine what would be needed to hand-edit this small data-
set into a tidy dataset. It could be done given the size, but it would be a task. Now,
imagine if comparable data for multiple years and not just data for 1994 to 1995
were brought into one unified dataset. How could it ever be put into tidy format with
hand-editing? It would clearly be a task, but with sufficient skills tools from the
tidyverse ecosystem (and knowledge of how those tools can be used) the data could
be put into proper format so that hand-editing would be unnecessary.

4
This is not the only place where it could be mentioned, but it should be remembered that names
are messy, first names and last names. This issue is especially evident when datasets are merged.
Imagine a subject with Sean as a first name. It is not at all inconceivable, right or wrong, that this
name may entered in other datasets as Jean, Jehan, John, Shane, Shaun, Shawn, Shayne, or Shon.
Which spelling is correct, which is incorrect, and is it possible to accommodate differences? If
there are language accent marks over some characters, how will the software address these special
characters, if at all? Then consider the clever idea of using Social Security Numbers, National
Identity Numbers, or some other de facto recognized means of identifying individuals by using
unique government-issued codes. At first this idea may sound grand, but is it legal? If legal, is it
prudent? These national identification numbers may allow for consistency in identifying individu-
als and they are especially convenient when datasets are merged, but their use also puts individuals
at risk for identify fraud, such as unauthorized credit card purchases, title theft, and dodgy mort-
gage applications. Always deploy best practices against the possibility of a data breach and use
caution when deploying identification codes.
184 4 Data Science and R, Base R, and the tidyverse Ecosystem

Great writers read the writings of other writers. Great musicians listen to the
music of other musicians. Great athletes observe the on-field efforts of other ath-
letes. It should not be surprising, then, that data scientists who want to improve their
mastery of R and the tidyverse ecosystem should:
• Study the datasets used by other data scientists, to see how the data were
organized.
• Study the syntax used by other data scientists, to see how the syntax was prepared.
• Study the statistical output generated by other data scientists, to see how the
statistics were generated.
• Study the graphical output generated by other data scientists, to see how the
graphics were generated.
• Study the conclusions presented by other data scientists, to see how the conclu-
sions were presented.
R takes time to learn and the tidyverse ecosystem adds an even greater demand
on what is needed to master the tools inherent to data science in biostatistics. Yet,
the rewards are many (e.g., employment, salary, professional development, peer
acceptance, recognition, etc.) for those data scientists who master R and the tidy-
verse ecosystem.

he Core tidyverse Ecosystem as a Set of Tools in R Packages

T
for Data Science

It may be a bit confusing at first to understand exactly which R packages are included
among the many packages associated with the tidyverse ecosystem. As an example,
the following packages are typically viewed as those packages associated with the
core tidyverse, listed in alphabetical order:5
• dplyr – Use the dplyr package to manipulate data so that a final dataset is in
desired format in terms of row and column placement.
• forcats – Use the forcats package to accommodate factor-type variables, typi-
cally variables representing different categories.
• ggplot2 – Use the ggplot2 package, which is based on the Grammar of Graphics,
to create both draft figures and publishable Beautiful Graphics.
• lubridate – Use the lubridate package to facilitate the complexities of working
with dates, times, etc.
• purrr – Use the purrr package to deploy functional programming practices, typi-
cally avoiding cumbersome syntax for loops.

5
With the March 2023 release of tidyverse 2.0.0, it can now be stated that the lubridate package has
been added as part of the package of packages associated with core tidyverse. Older references to
core tidyverse may not refer to the lubridate package.
The tidyverse Ecosystem 185

• readr – Use the readr package to import offline rectangular datasets into an active
R session, such as data in the following formats: comma separated values (.csv),
tab separated values (.tsv), and the now less common but once ubiquitous fixed
width format (.fwf).
• stringr – Use the stringr package to ease the complexities of manipulating
character-type variables (e.g., strings).
• tibble – Use the tibble package to rethink what a data frame represents, now
organizing data in a tibble which improves quality by forcing resolution of data
issues early in the workflow.
• tidyr – Use the tidyr package to have data in a tidy format, typically placing data
in wide format into long format where observations represent singular rows,
variables represent singular columns, and datapoints are represented in singular
cells – the intersection of a row and a column. Of course, the tidyr package can
also be used to place data in long format into wide format, but that practice is
perhaps less common.
Although reference to the core tidyverse is found throughout this text, it is worth
repeating that all packages in the core tidyverse can be downloaded and put into use
in one simple operation:

install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)

Individually, each of the core tidyverse packages could be downloaded one by

one, using the install.packages() function multiple times, and put into use by deploy-
ing the library() function one by one, but it is common to obtain all packages associ-
ated with the core tidyverse at one time. It is likely that other tidyverse packages,
auxiliary packages outside of the core tidyverse ecosystem but are still associated
with the tidyverse ecosystem, will also need to be obtained and put into use
individually.

Auxiliary Packages Outside of the Core tidyverse Ecosystem

There are many R packages that are directly associated with the tidyverse, going
beyond packages in core tidyverse. Use the tidyverse::tidyverse_packages() func-
tion to list these packages and be sure to account for packages in core tidyverse by
using the include_self = TRUE argument.

tidyverse::tidyverse_packages(include_self = TRUE)
# List all packages in the tidyverse
186 4 Data Science and R, Base R, and the tidyverse Ecosystem

[1] "broom" "cli" "crayon"

[4] "dbplyr" "dplyr" "dtplyr"
[7] "forcats" "googledrive" "googlesheets4"
[10] "ggplot2" "haven" "hms"
[13] "httr" "jsonlite" "lubridate"
[16] "magrittr" "modelr" "pillar"
[19] "purrr" "readr" "readxl"
[22] "reprex" "rlang" "rstudioapi"
[25] "rvest" "stringr" "tibble"
[28] "tidyr" "xml2" "tidyverse"

Some of these auxiliary tidyverse packages have regular use in data science and
efforts should be made to learn about their potential. Other auxiliary tidyverse pack-
ages are used less frequently in that they have very specialized applications that may
not be the norm for most users. Consider a few examples from among the R pack-
ages considered auxiliary packages outside of the core tidyverse ecosystem:6
• broom – Use the broom package to provide a tidy summary of information
gained from models, supporting methods such as anova, glm, lm, etc.
• cli – Use the cli package to build user-defined command line interfaces (e.g., CLIs).
• readxl – Use the readxl package to bring rectangular (e.g., tabular) data housed
in a spreadsheet into an active R session, including the complexity of differenti-
ating between different sheets within a spreadsheet.
• rvest – Use the rvest package to scrape (e.g., harvest, obtain) data from Web
pages, an increasingly important data acquisition process among data scientists.7
The tidyverse ecosystem is an expanding array of packages that work and play
well with other packages in the tidyverse, the heuristics of the tidyverse, and the
actual deployment of the tidyverse. As an example, consider the R packages ggmo-
saic, ggthemes, and scales. These packages are used throughout the many lessons in
this text, given their association with the ggplot2 package. Many would say that
these associated packages should be viewed as part of the tidyverse ecosystem, but
others may hold back on this broad statement.
Consider the many possibilities addressed in the remaining parts of this lesson:
Complex Data Set on Birth Rates Easily Accommodated by Using the tidyverse
Ecosystem, Addendum 1; Complex Data Set on Gross Domestic Product (GDP) and
Comparison to Birth Rates by Using the tidyverse Ecosystem}, Addendum 2; and
Individual Initiative of Planned Workflow, Analyses, and Graphical Presentations,
Addendum 3. A wide variety of functions associated with the tidyverse, core

6
Go to the URL https://cran.r-project.org/web/packages/available_packages_by_name.
html#available-packages-A to see the list of Available CRAN Packages By Name. At the time this
chapter was prepared, there were nearly 20,000 packages included in this listing. Of those pack-
ages, more than 70 packages had the text string tidy somewhere in the package name (not the
package description). This listing does not include the many tidyverse ecosystem packages where
the text string tidy is not included in the package name, such as broom and pillar. The tidyverse
ecosystem is ubiquitous in R.
7
Data scraping is an increasingly important activity, but far beyond appropriate demonstration in
an introductory text.
Addendum 1: Complex Data Set on Birth Rates Easily Accommodated by Using… 187

tidyverse ecosystem, and auxiliary packages outside of the core tidyverse ecosys-
tem, along with functions from other packages, are used to make sense of data from
well-respected external sources.
A self-curated (and therefore admittedly biased) listing of Essential tidyverse
Ecosystem Functions That Every Data Scientists Should Master is the subject of
Addendum 4. Regard the parenthetical reminder that this listing is biased. Others
will have different views on which tidyverse ecosystem functions are essential,
including some on the list in Addendum 4, but equally offering other ideas on which
tidyverse ecosystem functions are essential.

ddendum 1: Complex Data Set on Birth Rates Easily

A
Accommodated by Using the tidyverse Ecosystem

As used throughout these analyses, the Housekeeping section represents personal desires
in terms of how R is used, how settings are organized, where packages are kept, default
and other location(s) where files are maintained, etc. As always, use the Housekeeping
syntax as a guide, but of course make changes as skills and preferences allow.