Ebook Handbook of Regression Modeling in People Analytics 1St Edition Keith Mcnulty Online PDF All Chapter
Ebook Handbook of Regression Modeling in People Analytics 1St Edition Keith Mcnulty Online PDF All Chapter
https://ebookmeta.com/product/handbook-of-graphs-and-networks-in-
people-analytics-1st-edition-keith-mcnulty/
https://ebookmeta.com/product/handbook-of-regression-analysis-
with-applications-in-r-second-edition-samprit-chatterjee/
https://ebookmeta.com/product/applied-regression-and-modeling-a-
computer-integrated-approach-amar-sahay/
https://ebookmeta.com/product/handbook-of-minority-aging-1st-
edition-keith-whitfield-tamara-baker/
People Analytics For Dummies Mike West
https://ebookmeta.com/product/people-analytics-for-dummies-mike-
west/
https://ebookmeta.com/product/reliability-engineering-data-
analytics-modeling-risk-prediction-1st-edition-bracke/
https://ebookmeta.com/product/petri-nets-for-modeling-of-large-
discrete-systems-asset-analytics-davidrajuh-reggie/
https://ebookmeta.com/product/introduction-to-people-
analytics-2nd-edition-nadeem-khan/
https://ebookmeta.com/product/business-statistics-in-practice-
using-data-modeling-and-analytics-9th-edition-bruce-l-bowerman/
Handbook of
Regression Modeling
in People Analytics
Handbook of
Regression Modeling
in People Analytics
With Examples in R and Python
Keith McNulty
First edition published 2021
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
The right of Keith McNulty to be identified as author of this work has been asserted by him in accor-
dance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988.
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
[email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
Introduction xv
DOI: 10.1201/9781003194156-0 v
vi Contents
3 Statistics Foundations 39
3.1 Elementary descriptive statistics of populations and samples 40
3.1.1 Mean, variance and standard deviation . . . . . . . . . 40
3.1.2 Covariance and correlation . . . . . . . . . . . . . . . 43
3.2 Distribution of random variables . . . . . . . . . . . . . . . . 46
3.2.1 Sampling of random variables . . . . . . . . . . . . . . 46
3.2.2 Standard errors, the 𝑡-distribution and confidence inter-
vals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.1 Testing for a difference in means (Welch’s 𝑡-test) . . . 51
3.3.2 Testing for a non-zero correlation between two variables
(𝑡-test for correlation) . . . . . . . . . . . . . . . . . . 54
3.3.3 Testing for a difference in frequency distribution be-
tween different categories in a data set (Chi-square test) 56
3.4 Foundational statistics in Python . . . . . . . . . . . . . . . 58
3.5 Learning exercises . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5.1 Discussion questions . . . . . . . . . . . . . . . . . . . 62
3.5.2 Data exercises . . . . . . . . . . . . . . . . . . . . . . 63
Contents vii
References 247
Glossary 249
Index 253
Notes on data used in this book
For R and Python users, each of the data sets used in this book can be
downloaded individually by following the code in each chapter. Alternatively
for R users who intend to work through all of the chapters, all data sets
can be loaded into an R session in advance by installing and loading the
peopleanalyticsdata R package.
Over the past decade or so, increases in compute power, emergence of friendly
analytic tools and an explosion of data have created a wonderful opportu-
nity to bring more analytical rigor to nearly every imaginable question. Not
coincidentally, organizations are increasingly looking to apply all that data
and capability to what is typically their greatest area of expense and their
greatest strategic differentiator—their people. For too long, many of the most
critical decisions in an organization—people decisions—had been guided by
gut instinct or borrowed ‘best practices’ and the democratization of people an-
alytics opened up enticing pathways to fix that. Suddenly, analysts who were
originally interested in data problems began to be interested in people prob-
lems, and HR professionals who had dedicated their careers to solving people
problems needed more sophisticated analysis and data storytelling to make
their cases and to refine their approaches for greater efficiency, effectiveness
and impact.
Doing data work with people in organizations has complexities that some
other types of data work doesn’t. Often, the employee populations are rel-
atively smaller than data sets used in other areas, sometimes limiting the
methods that can be used. Various regulatory requirements may dictate what
data can be gathered and used, and what types of evidence might be required
for various programs or people strategies. Human behavior and organizations
are sufficiently complex that typically, multiple factors work together in influ-
encing an outcome. Effects can be subtle or meaningful only in combination,
or difficult to tease apart. While in many disciplines, prediction is the most
important aim, for most people analytics projects and practitioners, under-
standing why something is happening is critical.
While the universe of analytical approaches is wonderful and vast, the best
‘Swiss army knife’ we have in people analytics is regression. This volume is
an accessible, targeted work aimed directly at supporting professionals doing
people analytics work. I’ve had the privilege of knowing and respecting Keith
McNulty for many years – he is the rare and marvelous individual who is deeply
expert in the mechanics of data and analytics, curious about and steeped in
the opportunities to improve the effectiveness and well-being of people at work,
and a gifted teacher and storyteller. He is among the most prolific standard-
bearers for people analytics. This new open-source volume is in keeping with
many years of contributions to the practice of understanding people at work.
After nearly 30 years of doing people analytics work and the privilege of
leading people analytics teams at several leading global organizations, I am
still excited by the problems we get to solve, the insights we get to spawn,
and the tremendous impact we can have on organizations and the people that
comprise them. This work is human and technical and important and exciting
and deeply gratifying. I hope that you will find this Handbook of Regression
Modeling in People Analytics helps you uncover new truths and create positive
impacts in your own work.
Alexis A. Fink
December 2020
Alexis A. Fink, PhD is a leading figure in people analytics and has led
major people analytics teams at Microsoft and Intel before her current role as
Vice President of People Analytics and Workforce Strategy at Facebook. She is
a Fellow of the Society for Industrial and Organizational Psychology and is a
frequent author, journal editor and research leader in her field.
Introduction
DOI: 10.1201/9781003194156-0 xv
xvi Introduction
many students and practitioners make the mistake of trying to run multivari-
ate models without even a basic understanding of the underlying mathematics
of those models, and I find it very difficult to see how they can be credible in
responding to a wide range of questions or critique about their work without
such an understanding. That said, it is also not necessary for students and
practitioners to understand the deepest levels of theory in order to be fluent
in running and interpreting multivariate models. In this book I have tried to
limit the mathematical exposition to a level that allows confident and fluent
execution and interpretation.
I subscribe strongly to the principles of open source sharing of knowledge. If
you want to reference the material in this book or use the exercises or data
sets in trainings or classes, you are free to do so and you do not need to request
my permission. I only ask that you make reference to this book as the source.
I expect this book to improve over time. If you found this book or any part of
it helpful to solving a problem, I’d love to hear about it. If you have comments
to improve or question any aspect of the contents of this book I encourage
you to leave an issue1 on its Github repository. This is the most reliable way
for me to see your comment. I promise to consider all comments and input,
but I do have to make a personal judgment about whether they are helpful to
the aims and purpose of this book. If I do make changes or additions based
on your input I will make a point to acknowledge your contribution in future
editions.
I would like to thank the following individuals who have reviewed or con-
tributed to this book at some point during its development: Liz Romero, Alex
LoPilato, Kevin Jaggs, Seth Saavedra. My sincere thanks to Alexis Fink for
drawing on her years of people analytics experience to set the context for this
book in her foreword. My thanks to the people analytics community for their
constant encouragement and support in sharing theory, content and method,
and to the R community for all the work they do in giving us amazing and
constantly improving statistical tools to work with. Finally, I would like to
thank my family for their patience and understanding on the evenings and
weekends I dedicated to the writing of this book, and for tolerating far too
much dinner conversation on the topic of statistics.
Keith McNulty
December 2020
1 https://github.com/keithmcnulty/peopleanalytics-regression-book/issues
1
The Importance of Regression in People
Analytics
In the 19th century, when Francis Galton first used the term ‘regression’ to
describe a statistical phenomenon (see Chapter 4), little did he know how
important that term would be today. Many of the most powerful tools of
statistical inference that we now have at our disposal can be traced back to
the types of early analysis that Galton and his contemporaries were engaged in.
The sheer number of different regression-related methodologies and variants
that are available to researchers and practitioners today is mind-boggling, and
there are still rich veins of ongoing research that are focused on defining and
refining new forms of regression to tackle new problems.
Neither could Galton have imagined the advent of the age of data we now live
in. Those of us (like me) who entered the world of work even as recently as 20
years ago remember a time when most problems could not be expected to be
solved using a data-driven approach, because there simply was no data. Things
are very different now, with data being collected and processed all around us
and available to use as direct or indirect measures of the phenomena we are
interested in.
Along with the growth in data that we have seen in recent years, we have also
seen a rapid growth in the availability of statistical tools—open source and free
to use—that fundamentally change how we go about analytics. Gone are the
clunky, complex, repeated steps on calculators or spreadsheets. In their place
are lean statistical programming languages that can implement a regression
analysis in milliseconds with a single line of code, allowing us to easily run
and reproduce multivariate analysis at scale.
So given that we have access to well-developed methodology, rich sources of
data and readily accessible tools, it is somewhat surprising that many ana-
lytics practitioners have a limited knowledge and understanding of regression
and its applications. The aim of this book is to encourage inexperienced ana-
lytics practitioners to ‘dip their toes’ further into the wide and varied world
of regression in order to deliver more targeted and precise insights to their
organizations and stakeholders on the problems they are most interested in.
While the primary subject matter focus of this book is the analysis of people-
related phenomena, the material is easily and naturally transferable to other
DOI: 10.1201/9781003194156-1 1
2 1 The Importance of Regression in People Analytics
First, data sets in people analytics are rarely large enough to facilitate sat-
isfactory prediction accuracy, and so attention is usually shifted to inference
for this reason alone. Second, in the field of people analytics, decisions often
have a real impact on individuals. Therefore, even in the rare situations where
accurate predictive modeling is attainable, stakeholders are unlikely to trust
the output and bear the consequences of predictive models without some sort
of elementary understanding of how the predictions are generated. This re-
quires the analyst to consider inference power as well as predictive accuracy
in selecting their modeling approach. Again, many regression models come
to the fore because they are commonly able to provide both inferential and
predictive value.
Finally, the growing importance of evidence-based practice in many clinical
and professional fields has generated a need for more advanced modeling skills
to satisfy rising demand for quantitative evidence from decision makers. In
people-related fields such as human resources, many varieties of specialized
regression-based models such as survival models or latent variable models
have crossed from academic and clinical settings into business settings in recent
years, and there is an increasing need for qualified individuals who understand
and can implement and interpret these models in practice.
We will start with a theoretical description and then provide a real example
from a later chapter to illustrate.
Imagine we have a population 𝒫 for which we believe there may be a non-
random relationship between a certain construct or set of constructs 𝒞 and a
certain measurable outcome 𝒪. Imagine that for a certain sample 𝑆 of obser-
vations from 𝒫, we have a collection of data which we believe measure 𝒞 to
some acceptable level of accuracy, and for which we also have a measure of
the outcome 𝒪.
4 1 The Importance of Regression in People Analytics
𝑦 = 𝑓(𝑋) + 𝜖
where 𝑓 is some transformation or function of the data in 𝑋 and 𝜖 is a random,
uncontrollable error.
𝑓 can take the form of a predetermined function with a formula defined on
𝑋, like a linear function for example. In this case we can call our model a
parametric model. In a parametric model, the modeled value of 𝑦 is known
as soon as we know the values of 𝑋 by simply applying the formula. In a
non-parametric model, there is no predetermined formula that defines the
modeled value of 𝑦 purely in terms of 𝑋. Non-parametric models need further
information in addition to 𝑋 in order to determine the modeled value of 𝑦—for
example the value of 𝑦 in other observations with similar 𝑋 values.
Regression models are designed to derive 𝑓 using estimation based on statis-
tical likelihood and expectation, founded on the theory of the distribution
of random variables. Regression models can be both parametric and non-
parametric, but by far the most commonly used methods (and the majority
of those featured in this book) are parametric. Because of their foundation in
statistical likelihood and expectation, they are particularly suited to helping
1.2 What do we mean by ‘modeling’ ? 5
This book is primarily focused on steps 7–10 of this process2 . That is not to
say that steps 1–6 are not important. Indeed these steps are critical and often
loaded with analytic traps. Defining the problem, collecting reliable measures
and cleaning and organizing data are still the source of much pain and angst
for analysts, but these topics are for another day.
regression methods to a variety of people analytics data sets and problems. All
in all, sixteen different data sets are used as walkthrough or exercise examples,
and all of these data sets are fictitious constructions unless otherwise indicated.
Despite the fiction, they are deliberately designed to present the reader with
something resembling how the data might look in practice, albeit cleaner and
more organized.
The chapters of this book are arranged as follows:
• Chapter 2 covers the basics of the R programming language for those who
want to attempt to jump straight in to the work in subsequent chapters
but have very little R experience. Experienced R programmers can skip this
chapter.
• Chapter 3 covers the essential statistical concepts needed to understand
multivariate regression models. It also serves as a tutorial in univariate and
bivariate statistics illustrated with real data. If you need help developing
a decent understanding of descriptive statistics, random distribution and
hypothesis testing, this is an important chapter to study.
• Chapter 4 covers linear regression and in the course of that introduces many
other foundational concepts. The walkthrough example involves modeling
academic results from prior results. The exercises involve modeling income
levels based on various work and demographic factors.
• Chapter 5 covers binomial logistic regression. The walkthrough example in-
volves modeling promotion likelihood based on performance metrics. The
exercises involve modeling charitable donation likelihood based on prior do-
nation behavior and demographics.
• Chapter 6 covers multinomial regression. The walkthrough example and
exercise involves modeling the choice of three health insurance products by
company employees based on demographic and position data.
• Chapter 7 covers ordinal regression. The walkthrough example involves mod-
eling in-game disciplinary action against soccer players based on prior disci-
pline and other factors. The exercises involve modeling manager performance
based on varied data.
• Chapter 8 covers modeling options for data with explicit or latent hierarchy.
The first part covers mixed modeling and uses a model of speed dating
decisions as a walkthrough and example. The second part covers structural
equation modeling and uses a survey for a political party as a walkthrough
example. The exercises involve modeling latent variables in an employee
engagement survey.
• Chapter 9 covers survival analysis, Cox proportional hazard regression and
frailty models. The chapter uses employee attrition as a walkthrough exam-
ple and exercise.
• Chapter 10 outlines alternative technical approaches to regression modeling
in both R and Python. Models from previous chapters are used to illustrate
these alternative approaches.
• Chapter 11 covers power analysis, focusing in particular on estimating the
8 1 The Importance of Regression in People Analytics
DOI: 10.1201/9781003194156-2 9
10 2 The Basics of the R Programming Language
2.1 What is R?
R is a programming language that was originally developed by and for statis-
ticians, but in recent years its capabilities and the environments in which
it is used have expanded greatly, with extensive use nowadays in academia
and the public and private sectors. There are many advantages to using a
programming language like R. Here are some:
There is often heated debate about which tools are better for doing non-
trivial statistical analysis. I personally find that R provides the widest array
of resources for those interested in inferential modeling, while Python has
a more well-developed toolkit for predictive modeling and machine learning.
Since the primary focus of this book is inferential modeling, the in-depth
walkthroughs are coded in R.
The initial stages of using R can be challenging, mostly due to the need to
become familiar with how R understands, stores and processes data. Extensive
trial and error is a learning necessity. Perseverance is important in these early
stages, as well as an openness to seek help from others either in person or via
online forums.
2.3 Data in R
As you start to do tasks involving data in R, you will generally want to store
the things you create so that you can refer to them later. Simply calculating
something does not store it in R. For example, a simple calculation like this
can be performed easily:
12 2 The Basics of the R Programming Language
3 + 3
## [1] 6
## [1] 9
You will see above that you can comment your code by simply adding a # to
the start of a line to ensure that the line is ignored by the interpreter.
Note that assignment to an object does not result in the value being displayed.
To display the value, the name of the object must be typed, the print()
command used or the command should be wrapped in parentheses.
## [1] 6
## [1] 9
2.3 Data in R 13
All data in R has an associated type, to reflect the wide range of data that R
is able to work with. The typeof() function can be used to see the type of a
single scalar value. Let’s look at the most common scalar data types.
Numeric data can be in integer form or double (decimal) form.
typeof(my_integer)
## [1] "integer"
typeof(my_double)
## [1] "double"
## [1] "character"
## [1] "logical"
14 2 The Basics of the R Programming Language
Vectors are one-dimensional structures containing data of the same type and
are notated by using c(). The type of the vector can also be viewed using
the typeof() function, but the str() function can be used to display both the
contents of the vector and its type.
str(categories)
# character vector
ranking <- c("Medium", "High", "Low")
str(ranking)
str(ranking_factors)
2.3 Data in R 15
The number of elements in a vector can be seen using the length() function.
length(categories)
## [1] 5
## [1] 1 2 3 4 5 6 7 8 9 10
If you try to mix data types inside a vector, it will usually result in type
coercion, where one or more of the types are forced into a different type to
ensure homogeneity. Often this means the vector will become a character
vector.
## int [1:5] 1 2 3 4 5
## num [1:2] 1 6
## num [1:6] 1 2 3 1 3 1
Matrices are two-dimensional data structures of the same type and are built
from a vector by defining the number of rows and columns. Data is read into
the matrix down the columns, starting left and moving right. Matrices are
rarely used for non-numeric data types.
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Arrays are n-dimensional data structures with the same data type and are
not used extensively by most R users.
Lists are one-dimensional data structures that can take data of any type.
2.3 Data in R 17
## List of 3
## $ : num 6
## $ : logi TRUE
## $ : chr "hello"
List elements can be any data type and any dimension. Each element can be
given a name.
str(new_list)
## List of 3
## $ scalar: num 6
## $ vector: chr [1:2] "Hello" "Goodbye"
## $ matrix: int [1:2, 1:2] 1 2 3 4
new_list$matrix
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Dataframes are the most used data structure in R; they are effectively a
named list of vectors of the same length, with each vector as a column. As
such, a dataframe is very similar in nature to a typical database table or
spreadsheet.
18 2 The Basics of the R Programming Language
# create a dataframe
(df <- data.frame(names, ages))
## names ages
## 1 John 31
## 2 Ayesha 24
# get dimensions of df
dim(df)
## [1] 2 2
To work with data in R, you usually need to pull it in from an outside source
into a dataframe2 . R facilitates numerous ways of importing data from simple
2R also has some built-in data sets for testing and playing with. For example, check out
mtcars by typing it into the terminal, or type data() to see a full list of built-in data sets.
2.4 Working with dataframes 19
.csv files, from Excel files, from online sources or from databases. Let’s load a
data set that we will use later—the salespeople data set, which contains some
information on the sales, average customer ratings and performance ratings
of salespeople. The read.csv() function can accept a URL address of the file
if it is online.
We might not want to display this entire data set before knowing how big it
is. We can view the dimensions, and if it is too big to display, we can use the
head() function to display just the first few rows.
dim(salespeople)
## [1] 351 4
We can view a specific column by using $, and we can use square brackets to
view a specific entry. For example if we wanted to see the 6th entry of the
sales column:
salespeople$sales[6]
## [1] 918
20 2 The Basics of the R Programming Language
Alternatively, we can use a [row, column] index to get a specific entry in the
dataframe.
salespeople[34, 4]
## [1] 3
str(salespeople)
We can also see a statistical summary of each column using summary(), which
tells us various statistics depending on the type of the column.
summary(salespeople)
Note that there is missing data in this dataframe, indicated by NAs in the
summary. Missing data is identified by a special NA value in R. This should
not be confused with "NA", which is simply a character string. The function
is.na() will look at all values in a vector or dataframe and return TRUE or
FALSE based on whether they are NA or not. By adding these up using the
sum() function, it will take TRUE as 1 and FALSE as 0, which effectively provides
a count of missing data.
2.4 Working with dataframes 21
sum(is.na(salespeople))
## [1] 3
This is a small number of NAs given the dimensions of our data set and we
might want to remove the rows of data that contain NAs. The easiest way
is to use the complete.cases() function, which identifies the rows that have
no NAs, and then we can select those rows from the dataframe based on that
condition. Note that you can overwrite objects with the same name in R.
# confirm no NAs
sum(is.na(salespeople))
## [1] 0
We can see the unique values of a vector or column using the unique() function.
unique(salespeople$performance)
## [1] 2 3 4 1
Dataframes can be subsetted to contain only rows that satisfy specific condi-
tions.
Note the use of ==, which is used in many programming languages, to test for
precise equality. Similarly we can select columns based on inequalities (> for
‘greater than’, < for ‘less than’, >= for ‘greater than or equal to’, <= for ‘less
than or equal to’, or != for ‘not equal to’). For example:
## sales performance
## 1 594 2
## 2 446 3
## 3 674 4
## 4 525 2
## 5 657 3
## 6 918 2
2.4 Working with dataframes 23
Two dataframes with the same column names can be combined by their rows.
Functions usually take one or more arguments. Often there are a large number
of arguments that a function can take, but many are optional and not required
to be specified by the user. For example, the function head(), which displays
the first rows of a dataframe3 , has only one required argument x: the name
of the dataframe. A second argument is optional, n: the number of rows to
display. If n is not entered, it is assumed to have the default value n = 6.
When running a function, you can either specify the arguments by name or
you can enter them in order without their names. If you enter arguments
without naming them, R expects the arguments to be entered in exactly the
right order.
# see fewer rows - arguments need to be in the right order if not named
head(salespeople, 3)
3 It actually has a broader definition but is mostly used for showing the first rows of a
dataframe.
2.5 Functions, packages and libraries 25
FIGURE 2.2: Results of a search for the head() function in the RStudio
Help browser
26 2 The Basics of the R Programming Language
Functions are not limited to those that come packaged in R. Users can write
their own functions to perform tasks that are helpful to their objectives. Ex-
perienced programmers in most languages subscribe to a principle called DRY
(Don’t Repeat Yourself). Whenever a task needs to be done repeatedly, it is
poor practice to write the same code numerous times. It makes more sense to
write a function to do the task.
In this example, a simple function is written which generates a report on a
dataframe:
We can test our function by using the built-in mtcars data set in R.
df_report(mtcars)
## [1] "This dataframe contains 32 rows and 11 columns. There are 0 NA entries."
All the common functions that we have used so far exist in the base R instal-
lation. However, the beauty of open source languages like R is that users can
write their own functions or resources and release them to others via packages.
2.5 Functions, packages and libraries 27
Once you have installed a package, you can see what functions are available
by calling for help on it, for example using help(package = MASS). One pack-
age you may wish to install now is the peopleanalyticsdata package, which
contains all the data sets used in this book. By installing and loading this
package, all the data sets used in this book will be loaded into your R ses-
sion and ready to work with. If you do this, you can ignore the read.csv()
commands later in the book, which download the data from the internet.
Once you have installed a package into your package library, to use it in your
R session you need to load it using the library() function. For example, to
load MASS after installing it, use library(MASS). Often nothing will happen
when you use this command, but rest assured the package has been loaded
and you can start to use the functions inside it. Sometimes when you load
the package a series of messages will display, usually to make you aware of
certain things that you need to keep in mind when using the package. Note
that whenever you see the library() command in this book, it is assumed
that you have already installed the package in that command. If you have not,
the library() command will fail.
Once a package is loaded from your library, you can use any of the functions
inside it. For example, the stepAIC() function is not available before you
4 Venables and Ripley (2002)
28 2 The Basics of the R Programming Language
load the MASS package but becomes available after it is loaded. In this sense,
functions ‘belong’ to packages.
Problems can occur when you load packages that contain functions with the
same name as functions that already exist in your R session. Often the mes-
sages you see when loading a package will alert you to this. When R is faced
with a situation where a function exists in multiple packages you have loaded,
R always defaults to the function in the most recently loaded package. This
may not always be what you intended.
One way to completely avoid this issue is to get in the habit of namespacing
your functions. To namespace, you simply use package::function(), so to
safely call stepAIC() from MASS, you use MASS::stepAIC(). Most of the time
in this book when a function is being called from a package outside base R, I
use namespacing to call that function. This should help avoid confusion about
which packages are being used for which functions.
Even in the most elementary briefing about R, it is very difficult to ignore the
pipe operator. The pipe operator makes code more natural to read and write
and reduces the typical computing problem of many nested operations inside
parentheses. The pipe operator comes inside many R packages, particularly
magrittr and dplyr.
## [1] 388.6684
This is nested and needs to be read from the inside out in order to align
with the instructions. The pipe operator %>% takes the command that comes
before it and places it inside the function that follows it (by default as the
first argument). This reduces complexity and allows you to follow the logic
more clearly.
Another random document with
no related content on Scribd:
for if it had jammed, the line would surely have snapped and the
whale been lost.
“The winch was then started and the whale drawn slowly toward
the ship.”
The burst of speed was soon ended and the whale sounded for ten
minutes, giving us all a chance to breathe and wonder what had
happened. When the animal came up again, far ahead, the spout was
high and full, with no trace of blood, so we knew that he would need
a second harpoon to finish him. I was delighted, for I had long
wished for a chance to get a roll of motion-picture film showing the
killing of a whale, and now the conditions were ideal—good light,
little wind, and no sea.
I ran below to get the cinematograph and tripod and set it on the
bridge while the gun was being loaded. The winch was then started
and the whale drawn slowly toward the ship. He persisted in keeping
in the sunlight, which drew a path of glittering, dancing points of
light, beautiful to see but fatal to pictures. I shouted to Captain
Andersen, asking him to wait a bit and let the whale go down, hoping
it would rise in the other direction. He did so and the animal swung
around, coming up just as I wished, so that the sun was almost
behind us. It was now near enough to begin work and I kept the
crank of the machine steadily revolving whenever it rose to spout.
The whale was drawn in close under the bow and for several minutes
lay straining and heaving, trying to free himself from the biting iron.
“Stand by! I’m going to shoot now,” sang out the Gunner, and in a
moment he was hidden from sight in a thick black cloud.
The beautiful gray body was lying quietly at the surface when the
smoke drifted away, but in a few seconds the whale righted himself
with a convulsive heave. The poor animal was not yet dead, though
the harpoon had gone entirely through him. Captain Andersen called
for one of the long slender lances which were triced up to the ship’s
rigging, and after a few more turns of the winch had brought the
whale right under the bows, he began jabbing the steel into its side,
throwing his whole weight on the lance. The whale was pretty “sick”
and did not last long, and before the roll of motion-picture film had
been exhausted it sank straight down, the last feeble blow leaving a
train of round white bubbles on the surface.
A sei whale at Aikawa, Japan. This species is about forty-eight feet
long and is allied to the finback and blue whales.
Andersen and I went below for breakfast and by the time we were
on deck again the whale had been inflated and was floating easily
beside the ship. When we had reached the bridge the Gunner said:
“I don’t want to go in yet with this one; we’ll cruise about until
twelve o’clock and see if we can’t find another. I am going up in the
top and then we’ll be sure not to miss any.”
I stretched out upon a seat on the port side of the bridge and lazily
watched the water boil and foam ver the dead whale as we steamed
along at full speed. Captain Andersen was singing softly to himself,
apparently perfectly happy in his lofty seat. So we went about for two
hours and I was almost asleep when Andersen called down:
“There’s a whale dead ahead. He spouted six times.”
“‘There’s a whale dead ahead. He spouted six times.’”
“The click of the camera and the crash of the gun sounding at
almost the same instant.” The harpoon, rope, wads, smoke, sparks
and the back of the whale are shown in the photograph.
I was wide awake at that and had the camera open and ready for
pictures by the time we were near enough to see the animal—a sei
whale—blow. He was spouting constantly and this argued well, for
we were sure to get a shot if he continued to stay at the surface. The
Bo’s’n made a flag ready so that the carcass alongside could be let go
and marked. Apparently this was not going to be necessary, for there
was plenty of food and the whale was lazily wallowing about, rolling
first on one side and then on the other, sometimes throwing his fin in
the air and playfully slapping the water, sending it upward in geyser-
like jets.
“Half speed!” shouted the Gunner; then, “Slow!” and “Dead slow!”
The little vessel slipped silently along, the propellers hardly
moving and the nerves of every man on board as tense as the strings
of a violin. In four seconds the whale was up, not ten fathoms away
on the port bow, the click of the camera and the crash of the gun
sounding at almost the same instant. The harpoon struck the animal
in the side, just back of the fin, and he went down without a struggle,
for the bursting bomb had torn its way into the great heart.
By eleven o’clock it was alongside and slowly filling with air while
the ship was churning her way toward the station. Andersen went
below for a couple of hours’ sleep in the afternoon, and I dozed on
the bridge in the sunshine. We were just off Kinka-San at half-past
six, and by seven were blowing the whistle at the entrance to the bay.
Three other ships, the San Hogei, Ne Taihei, and Akebono, were
already inside but had no whales. Later Captain Olsen, of the
Rekkusu Maru, brought in a sei whale, but this was the only other
ship that had killed during the day. About eleven o’clock, just as I
came from the station house after developing the plates, and started
to go out to the ship, the Fukushima and Airondo Maru stole quietly
into the bay and dropped anchor. They, too, had been unsuccessful,
and, we learned later, had not even seen a whale.
Before we turned in for the night Captain Andersen said to me:
“We were just off Kinka-san at half-past six, and by seven were
blowing the whistle at the entrance to the bay.”
“We hunted them for two hours, trying first one and then the
other—they had separated—without once getting near enough
even for pictures.”
The ship got under way at two o’clock the next morning, and within
half an hour was pitching badly in a heavy sea. At five Andersen and
I turned out and climbed to the bridge, both wearing oilskins and
sou’westers to protect ourselves from the driving spray. The sun was
up in a clear sky, but the wind was awful. The man in the top shouted
down that he had seen no whales, but that many birds were about,
showing that food must be plentiful and near the surface. Captain
Andersen turned to me with a smile:
“Don’t you worry! We’ll see one before long. I’m always lucky
before breakfast.”
Almost while he was speaking the man aloft sang out, “Kujira!”
The kujira proved to be two sei whales a long way off. When we were
close enough to see, it became evident that it would only be a chance
if we got a shot. They were not spouting well and remained below a
long time.
“He was running fast but seldom stayed down long, his high
sickle-shaped dorsal fin cutting the surface first in one direction,
then in another.”
We hunted them for two hours, trying first one and then the other
—they had separated—without once getting near enough even for
pictures. It was aggravating work, and I was glad to hear Andersen
say:
“We’ll leave them and see if we can find some others. They are
impossible.”
When we came up from breakfast six other ships were visible,
some of them not far away and others marked only by long trails on
the horizon. We passed the San Hogei near enough to hear Captain
Hansen shout that he had seen no whales, and then plowed along
due south directly away from the other ships. In a short time, one by
one, they had dropped away from sight and even the smoke paths
were lost where sky and sea met.
“Then turning about with his entire head projecting from the
water like the bow of a submarine, he swam parallel with the
ship.”
“Two boat hooks were jabbed into the shark’s gills and it was
hauled along the ship’s side until it could be pulled on deck.”
One big shark, the most persistent of the school, had sunk his teeth
in the whale’s side and, although half out of water, was tearing away
at the blubber and paying not the slightest attention to the pieces of
old iron which the sailors were showering upon him. When the
harpoon was rigged and the line made fast, Andersen climbed out
upon the rope-pan in front of the gun and jammed the iron into the
shark’s back. Even then the brute waited to snatch one more
mouthful before it slid off the carcass into the water. It struggled but
little and seemed more interested in returning to its meal than in
freeing itself from the harpoon, but two boat hooks were jabbed into
its gills and it was hauled along the ship’s side until it could be pulled
on deck. This was no easy task, for it must have weighed at least two
hundred pounds and began a tremendous lashing with its tail when
the crew hauled away. “Ya-ra-cu-ra-sa,” sang the sailors, each time
giving a heave as the word “sa” was uttered, and the shark was soon
flapping and pounding about on deck. The seamen prodded it with
boat hooks and belaying pins and I must confess that I had little
sympathy for the brute when the blood poured out of its mouth and
gills, turning the snow-white breast to crimson. I paced its length as
it lay on the deck, taking good care to miss the thrashing tail and the
vicious snaps of its crescent-shaped jaws. It measured just twelve
feet and, although a big one, was by no means the largest of the
school.
A sei whale swimming directly away from the ship. The nostrils or
blowholes are widely expanded and greatly protruded.
The air pumps were still at work inflating the carcass alongside,
and the gun had not yet been loaded. Captain Andersen ran forward
with the powder charge sewed up in its neat little sack of cheesecloth;
and after the Bo’s’n had rammed it home, wadded the gun, and
inserted the harpoon, we were ready for work. The vessel had been
taking a long circle about the whale, which was blowing every few
seconds, and now we headed straight for it.
Like the last one, this animal was pursuing a school of sardines
and proved easy to approach. Andersen fired at about fifteen
fathoms, getting fast but not killing at once, and a second harpoon
was sent crashing into the beautiful gray body which before many
hours would fill several hundred cans and be sold in the markets at
Osaka. The sharks again gathered about the ship when the whale was
raised to the surface, but this time none was harpooned as we were
anxious to start for the station.
It was nearly three o’clock when the ship was on her course and
fully six before we caught a glimpse of the summit of Kinka-San, still
twenty miles away. A light fog had begun to gather, and in the west
filmy clouds draped themselves in a mantle of red and gold about the
sun. Ere the first stars appeared, the wind freshened again and the
clouds had gathered into puffy balls edged with black, which scudded
across the sky and settled into a leaden mass on the horizon. It was
evident that the good weather had ended and that we were going to
run inside just in time to escape a storm.
CHAPTER IX
HABITS OF THE SEI WHALE
“For many years the sei whale was supposed to be the young of
either the blue or the finback whale, and it was not until 1828 that
it was recognized by science as being a distinct species.”
For many years the sei whale was supposed to be the young of either
the blue or the finback whale, and it was not until 1828 that it was
recognized by science as being a distinct species. The Norwegians
gave the animal its name because it arrives upon the coast of
Finmark with the “seje,” or black codfish (Polachius virens), but in
Japan it is called iwashi kujira (sardine whale).