100% found this document useful (13 votes)
82 views69 pages

Ebook Handbook of Regression Modeling in People Analytics 1St Edition Keith Mcnulty Online PDF All Chapter

ebook

Uploaded by

jason.komp763
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (13 votes)
82 views69 pages

Ebook Handbook of Regression Modeling in People Analytics 1St Edition Keith Mcnulty Online PDF All Chapter

ebook

Uploaded by

jason.komp763
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Handbook of Regression Modeling in

People Analytics 1st Edition Keith


Mcnulty
Visit to download the full and correct content document:
https://ebookmeta.com/product/handbook-of-regression-modeling-in-people-analytics-
1st-edition-keith-mcnulty/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Handbook of Graphs and Networks in People Analytics 1st


Edition Keith Mcnulty

https://ebookmeta.com/product/handbook-of-graphs-and-networks-in-
people-analytics-1st-edition-keith-mcnulty/

Handbook of Regression Analysis With Applications in R,


Second Edition Samprit Chatterjee

https://ebookmeta.com/product/handbook-of-regression-analysis-
with-applications-in-r-second-edition-samprit-chatterjee/

Applied Regression and Modeling A Computer Integrated


Approach Amar Sahay

https://ebookmeta.com/product/applied-regression-and-modeling-a-
computer-integrated-approach-amar-sahay/

Handbook of Minority Aging 1st Edition Keith Whitfield


Tamara Baker

https://ebookmeta.com/product/handbook-of-minority-aging-1st-
edition-keith-whitfield-tamara-baker/
People Analytics For Dummies Mike West

https://ebookmeta.com/product/people-analytics-for-dummies-mike-
west/

Reliability Engineering: Data analytics, modeling, risk


prediction 1st Edition Bracke

https://ebookmeta.com/product/reliability-engineering-data-
analytics-modeling-risk-prediction-1st-edition-bracke/

Petri Nets for Modeling of Large Discrete Systems Asset


Analytics Davidrajuh Reggie

https://ebookmeta.com/product/petri-nets-for-modeling-of-large-
discrete-systems-asset-analytics-davidrajuh-reggie/

Introduction to People Analytics 2nd Edition Nadeem


Khan

https://ebookmeta.com/product/introduction-to-people-
analytics-2nd-edition-nadeem-khan/

Business Statistics in Practice: Using Data, Modeling,


and Analytics, 9th Edition Bruce L. Bowerman

https://ebookmeta.com/product/business-statistics-in-practice-
using-data-modeling-and-analytics-9th-edition-bruce-l-bowerman/
Handbook of
Regression Modeling
in People Analytics
Handbook of
Regression Modeling
in People Analytics
With Examples in R and Python

Keith McNulty
First edition published 2021

by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742

and by CRC Press


2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

© 2021 Keith McNulty

CRC Press is an imprint of Taylor & Francis Group, LLC

The right of Keith McNulty to be identified as author of this work has been asserted by him in accor-
dance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988.

Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
[email protected]

Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.

ISBN: 9781032041742 (hbk)


ISBN: 9781032046631 (pbk)
ISBN: 9781003194156 (ebk)
DOI: 10.1201/9781003194156

Typeset in Latin Modern font


by KnowledgeWorks Global Ltd.
Contents

Foreword by Alexis Fink xiii

Introduction xv

1 The Importance of Regression in People Analytics 1


1.1 Why is regression modeling so important in people analytics? 2
1.2 What do we mean by ‘modeling’ ? . . . . . . . . . . . . . . . 3
1.2.1 The theory of inferential modeling . . . . . . . . . . . 3
1.2.2 The process of inferential modeling . . . . . . . . . . . 5
1.3 The structure, system and organization of this book . . . . . 6

2 The Basics of the R Programming Language 9


2.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 How to start using R . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Data in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Homogeneous data structures . . . . . . . . . . . . . . 14
2.3.3 Heterogeneous data structures . . . . . . . . . . . . . 16
2.4 Working with dataframes . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Loading and tidying data in dataframes . . . . . . . . 18
2.4.2 Manipulating dataframes . . . . . . . . . . . . . . . . 22
2.5 Functions, packages and libraries . . . . . . . . . . . . . . . . 24
2.5.1 Using functions . . . . . . . . . . . . . . . . . . . . . . 24
2.5.2 Help with functions . . . . . . . . . . . . . . . . . . . 25
2.5.3 Writing your own functions . . . . . . . . . . . . . . . 26

DOI: 10.1201/9781003194156-0 v
vi Contents

2.5.4 Installing packages . . . . . . . . . . . . . . . . . . . . 26


2.5.5 Using packages . . . . . . . . . . . . . . . . . . . . . . 27
2.5.6 The pipe operator . . . . . . . . . . . . . . . . . . . . 28
2.6 Errors, warnings and messages . . . . . . . . . . . . . . . . . 29
2.7 Plotting and graphing . . . . . . . . . . . . . . . . . . . . . . 31
2.7.1 Plotting in base R . . . . . . . . . . . . . . . . . . . . 31
2.7.2 Specialist plotting and graphing packages . . . . . . . 33
2.8 Documenting your work using R Markdown . . . . . . . . . 34
2.9 Learning exercises . . . . . . . . . . . . . . . . . . . . . . . . 37
2.9.1 Discussion questions . . . . . . . . . . . . . . . . . . . 37
2.9.2 Data exercises . . . . . . . . . . . . . . . . . . . . . . 38

3 Statistics Foundations 39
3.1 Elementary descriptive statistics of populations and samples 40
3.1.1 Mean, variance and standard deviation . . . . . . . . . 40
3.1.2 Covariance and correlation . . . . . . . . . . . . . . . 43
3.2 Distribution of random variables . . . . . . . . . . . . . . . . 46
3.2.1 Sampling of random variables . . . . . . . . . . . . . . 46
3.2.2 Standard errors, the 𝑡-distribution and confidence inter-
vals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.1 Testing for a difference in means (Welch’s 𝑡-test) . . . 51
3.3.2 Testing for a non-zero correlation between two variables
(𝑡-test for correlation) . . . . . . . . . . . . . . . . . . 54
3.3.3 Testing for a difference in frequency distribution be-
tween different categories in a data set (Chi-square test) 56
3.4 Foundational statistics in Python . . . . . . . . . . . . . . . 58
3.5 Learning exercises . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5.1 Discussion questions . . . . . . . . . . . . . . . . . . . 62
3.5.2 Data exercises . . . . . . . . . . . . . . . . . . . . . . 63
Contents vii

4 Linear Regression for Continuous Outcomes 65


4.1 When to use it . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.1 Origins and intuition of linear regression . . . . . . . . 65
4.1.2 Use cases for linear regression . . . . . . . . . . . . . . 66
4.1.3 Walkthrough example . . . . . . . . . . . . . . . . . . 67
4.2 Simple linear regression . . . . . . . . . . . . . . . . . . . . . 69
4.2.1 Linear relationship between a single input and an out-
come . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.2 Minimising the error . . . . . . . . . . . . . . . . . . . 70
4.2.3 Determining the best fit . . . . . . . . . . . . . . . . . 73
4.2.4 Measuring the fit of the model . . . . . . . . . . . . . 74
4.3 Multiple linear regression . . . . . . . . . . . . . . . . . . . . 76
4.3.1 Running a multiple linear regression model and inter-
preting its coefficients . . . . . . . . . . . . . . . . . . 76
4.3.2 Coefficient confidence . . . . . . . . . . . . . . . . . . 77
4.3.3 Model ‘goodness-of-fit’ . . . . . . . . . . . . . . . . . . 78
4.3.4 Making predictions from your model . . . . . . . . . . 81
4.4 Managing inputs in linear regression . . . . . . . . . . . . . . 82
4.4.1 Relevance of input variables . . . . . . . . . . . . . . . 83
4.4.2 Sparseness (‘missingness’) of data . . . . . . . . . . . . 83
4.4.3 Transforming categorical inputs to dummy variables . 84
4.5 Testing your model assumptions . . . . . . . . . . . . . . . . 86
4.5.1 Assumption of linearity and additivity . . . . . . . . . 86
4.5.2 Assumption of constant error variance . . . . . . . . . 88
4.5.3 Assumption of normally distributed errors . . . . . . . 89
4.5.4 Avoiding high collinearity and multicollinearity between
input variables . . . . . . . . . . . . . . . . . . . . . . 90
4.6 Extending multiple linear regression . . . . . . . . . . . . . . 93
4.6.1 Interactions between input variables . . . . . . . . . . 93
4.6.2 Quadratic and higher-order polynomial terms . . . . . 96
4.7 Learning exercises . . . . . . . . . . . . . . . . . . . . . . . . 97
4.7.1 Discussion questions . . . . . . . . . . . . . . . . . . . 97
viii Contents

4.7.2 Data exercises . . . . . . . . . . . . . . . . . . . . . . 97

5 Binomial Logistic Regression for Binary Outcomes 101


5.1 When to use it . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.1.1 Origins and intuition of binomial logistic regression . . 102
5.1.2 Use cases for binomial logistic regression . . . . . . . . 103
5.1.3 Walkthrough example . . . . . . . . . . . . . . . . . . 104
5.2 Modeling probabilistic outcomes using a logistic function . . 106
5.2.1 Deriving the concept of log odds . . . . . . . . . . . . 107
5.2.2 Modeling the log odds and interpreting the coefficients 109
5.2.3 Odds versus probability . . . . . . . . . . . . . . . . . 110
5.3 Running a multivariate binomial logistic regression model . . 112
5.3.1 Running and interpreting a multivariate binomial logis-
tic regression model . . . . . . . . . . . . . . . . . . . 113
5.3.2 Understanding the fit and goodness-of-fit of a binomial
logistic regression model . . . . . . . . . . . . . . . . . 116
5.3.3 Model parsimony . . . . . . . . . . . . . . . . . . . . . 120
5.4 Other considerations in binomial logistic regression . . . . . 122
5.5 Learning exercises . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5.1 Discussion questions . . . . . . . . . . . . . . . . . . . 124
5.5.2 Data exercises . . . . . . . . . . . . . . . . . . . . . . 124

6 Multinomial Logistic Regression for Nominal Category Out-


comes 127
6.1 When to use it . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.1.1 Intuition for multinomial logistic regression . . . . . . 127
6.1.2 Use cases for multinomial logistic regression . . . . . . 128
6.1.3 Walkthrough example . . . . . . . . . . . . . . . . . . 128
6.2 Running stratified binomial models . . . . . . . . . . . . . . 131
6.2.1 Modeling the choice of Product A versus other products 131
6.2.2 Modeling other choices . . . . . . . . . . . . . . . . . . 133
6.3 Running a multinomial regression model . . . . . . . . . . . 133
6.3.1 Defining a reference level and running the model . . . 134
Contents ix

6.3.2 Interpreting the model . . . . . . . . . . . . . . . . . . 136


6.3.3 Changing the reference . . . . . . . . . . . . . . . . . . 137
6.4 Model simplification, fit and goodness-of-fit for multinomial lo-
gistic regression models . . . . . . . . . . . . . . . . . . . . . 138
6.4.1 Gradual safe elimination of variables . . . . . . . . . . 138
6.4.2 Model fit and goodness-of-fit . . . . . . . . . . . . . . 139
6.5 Learning exercises . . . . . . . . . . . . . . . . . . . . . . . . 140
6.5.1 Discussion questions . . . . . . . . . . . . . . . . . . . 140
6.5.2 Data exercises . . . . . . . . . . . . . . . . . . . . . . 141

7 Proportional Odds Logistic Regression for Ordered Category


Outcomes 143
7.1 When to use it . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.1 Intuition for proportional odds logistic regression . . . 143
7.1.2 Use cases for proportional odds logistic regression . . . 145
7.1.3 Walkthrough example . . . . . . . . . . . . . . . . . . 145
7.2 Modeling ordinal outcomes under the assumption of propor-
tional odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.2.1 Using a latent continuous outcome variable to derive a
proportional odds model . . . . . . . . . . . . . . . . . 148
7.2.2 Running a proportional odds logistic regression model 150
7.2.3 Calculating the likelihood of an observation being in a
specific ordinal category . . . . . . . . . . . . . . . . . 153
7.2.4 Model diagnostics . . . . . . . . . . . . . . . . . . . . 154
7.3 Testing the proportional odds assumption . . . . . . . . . . . 155
7.3.1 Sighting the coefficients of stratified binomial models . 156
7.3.2 The Brant-Wald test . . . . . . . . . . . . . . . . . . . 157
7.3.3 Alternatives to proportional odds models . . . . . . . 158
7.4 Learning exercises . . . . . . . . . . . . . . . . . . . . . . . . 159
7.4.1 Discussion questions . . . . . . . . . . . . . . . . . . . 159
7.4.2 Data exercises . . . . . . . . . . . . . . . . . . . . . . 160

8 Modeling Explicit and Latent Hierarchy in Data 163


8.1 Mixed models for explicit hierarchy in data . . . . . . . . . . 164
x Contents

8.1.1 Fixed and random effects . . . . . . . . . . . . . . . . 164


8.1.2 Running a mixed model . . . . . . . . . . . . . . . . . 165
8.2 Structural equation models for latent hierarchy in data . . . 170
8.2.1 Running and assessing the measurement model . . . . 173
8.2.2 Running and interpreting the structural model . . . . 180
8.3 Learning exercises . . . . . . . . . . . . . . . . . . . . . . . . 185
8.3.1 Discussion questions . . . . . . . . . . . . . . . . . . . 185
8.3.2 Data exercises . . . . . . . . . . . . . . . . . . . . . . 185

9 Survival Analysis for Modeling Singular Events Over Time 187


9.1 Tracking and illustrating survival rates over the study period 189
9.2 Cox proportional hazard regression models . . . . . . . . . . 193
9.2.1 Running a Cox proportional hazard regression model . 194
9.2.2 Checking the proportional hazard assumption . . . . . 196
9.3 Frailty models . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.4 Learning exercises . . . . . . . . . . . . . . . . . . . . . . . . 200
9.4.1 Discussion questions . . . . . . . . . . . . . . . . . . . 200
9.4.2 Data exercises . . . . . . . . . . . . . . . . . . . . . . 201

10 Alternative Technical Approaches in R and Python 203


10.1 ‘Tidier’ modeling approaches in R . . . . . . . . . . . . . . . 204
10.1.1 The broom package . . . . . . . . . . . . . . . . . . . . 204
10.1.2 The parsnip package . . . . . . . . . . . . . . . . . . . 208
10.2 Inferential statistical modeling in Python . . . . . . . . . . . 209
10.2.1 Ordinary Least Squares (OLS) linear regression . . . . 209
10.2.2 Binomial logistic regression . . . . . . . . . . . . . . . 211
10.2.3 Multinomial logistic regression . . . . . . . . . . . . . 212
10.2.4 Structural equation models . . . . . . . . . . . . . . . 213
10.2.5 Survival analysis . . . . . . . . . . . . . . . . . . . . . 215
10.2.6 Other model variants . . . . . . . . . . . . . . . . . . . 218

11 Power Analysis to Estimate Required Sample Sizes for Mod-


eling 221
11.1 Errors, effect sizes and statistical power . . . . . . . . . . . . 222
Contents xi

11.2 Power analysis for simple hypothesis tests . . . . . . . . . . . 224


11.3 Power analysis for linear regression models . . . . . . . . . . 228
11.4 Power analysis for log-likelihood regression models . . . . . . 229
11.5 Power analysis for hierarchical regression models . . . . . . . 231
11.6 Power analysis using Python . . . . . . . . . . . . . . . . . . 232

12 Further Exercises for Practice 235


12.1 Analyzing graduate salaries . . . . . . . . . . . . . . . . . . . 235
12.1.1 The graduates data set . . . . . . . . . . . . . . . . . 236
12.1.2 Discussion questions . . . . . . . . . . . . . . . . . . . 236
12.1.3 Data exercises . . . . . . . . . . . . . . . . . . . . . . 236
12.2 Analyzing a recruiting process . . . . . . . . . . . . . . . . . 237
12.2.1 The recruiting data set . . . . . . . . . . . . . . . . . 238
12.2.2 Discussion questions . . . . . . . . . . . . . . . . . . . 238
12.2.3 Data exercises . . . . . . . . . . . . . . . . . . . . . . 239
12.3 Analyzing the drivers of performance ratings . . . . . . . . . 239
12.3.1 The employee_performance data set . . . . . . . . . . . 240
12.3.2 Discussion questions . . . . . . . . . . . . . . . . . . . 240
12.3.3 Data exercises . . . . . . . . . . . . . . . . . . . . . . 241
12.4 Analyzing promotion differences between groups . . . . . . . 241
12.4.1 The promotion data set . . . . . . . . . . . . . . . . . 242
12.4.2 Discussion questions . . . . . . . . . . . . . . . . . . . 242
12.4.3 Data exercises . . . . . . . . . . . . . . . . . . . . . . 242
12.5 Analyzing feedback on learning programs . . . . . . . . . . . 243
12.5.1 The learning data set . . . . . . . . . . . . . . . . . . 243
12.5.2 Discussion questions . . . . . . . . . . . . . . . . . . . 244
12.5.3 Data exercises . . . . . . . . . . . . . . . . . . . . . . 244

References 247

Glossary 249

Index 253
Notes on data used in this book

For R and Python users, each of the data sets used in this book can be
downloaded individually by following the code in each chapter. Alternatively
for R users who intend to work through all of the chapters, all data sets
can be loaded into an R session in advance by installing and loading the
peopleanalyticsdata R package.

# install peopleanalyticsdata package


install.packages("peopleanalyticsdata")
library(peopleanalyticsdata)

# see a list of data sets


data(package = "peopleanalyticsdata")

# find out more about a specific data set ('managers' example)


help(managers)
Foreword by Alexis Fink

Over the past decade or so, increases in compute power, emergence of friendly
analytic tools and an explosion of data have created a wonderful opportu-
nity to bring more analytical rigor to nearly every imaginable question. Not
coincidentally, organizations are increasingly looking to apply all that data
and capability to what is typically their greatest area of expense and their
greatest strategic differentiator—their people. For too long, many of the most
critical decisions in an organization—people decisions—had been guided by
gut instinct or borrowed ‘best practices’ and the democratization of people an-
alytics opened up enticing pathways to fix that. Suddenly, analysts who were
originally interested in data problems began to be interested in people prob-
lems, and HR professionals who had dedicated their careers to solving people
problems needed more sophisticated analysis and data storytelling to make
their cases and to refine their approaches for greater efficiency, effectiveness
and impact.
Doing data work with people in organizations has complexities that some
other types of data work doesn’t. Often, the employee populations are rel-
atively smaller than data sets used in other areas, sometimes limiting the
methods that can be used. Various regulatory requirements may dictate what
data can be gathered and used, and what types of evidence might be required
for various programs or people strategies. Human behavior and organizations
are sufficiently complex that typically, multiple factors work together in influ-
encing an outcome. Effects can be subtle or meaningful only in combination,
or difficult to tease apart. While in many disciplines, prediction is the most
important aim, for most people analytics projects and practitioners, under-
standing why something is happening is critical.
While the universe of analytical approaches is wonderful and vast, the best
‘Swiss army knife’ we have in people analytics is regression. This volume is
an accessible, targeted work aimed directly at supporting professionals doing
people analytics work. I’ve had the privilege of knowing and respecting Keith
McNulty for many years – he is the rare and marvelous individual who is deeply
expert in the mechanics of data and analytics, curious about and steeped in
the opportunities to improve the effectiveness and well-being of people at work,
and a gifted teacher and storyteller. He is among the most prolific standard-
bearers for people analytics. This new open-source volume is in keeping with
many years of contributions to the practice of understanding people at work.

DOI: 10.1201/9781003194156-0 xiii


xiv Foreword by Alexis Fink

After nearly 30 years of doing people analytics work and the privilege of
leading people analytics teams at several leading global organizations, I am
still excited by the problems we get to solve, the insights we get to spawn,
and the tremendous impact we can have on organizations and the people that
comprise them. This work is human and technical and important and exciting
and deeply gratifying. I hope that you will find this Handbook of Regression
Modeling in People Analytics helps you uncover new truths and create positive
impacts in your own work.
Alexis A. Fink
December 2020
Alexis A. Fink, PhD is a leading figure in people analytics and has led
major people analytics teams at Microsoft and Intel before her current role as
Vice President of People Analytics and Workforce Strategy at Facebook. She is
a Fellow of the Society for Industrial and Organizational Psychology and is a
frequent author, journal editor and research leader in her field.
Introduction

As a fresh-faced undergraduate in mathematics in the 1990s, I took an intro-


ductory course in statistics in my first term. I would never take another. I
struggled with the subject, scored my lowest grade in it and swore I would
never go anywhere near it again.
How wrong I was. Today I live and breathe statistics. How did that happen?
Firstly, statistics is about solving real-world problems, and amazingly there
was not a single mention of a relatable problem from real life in that course I
took all those years ago, just abstract mathematics. Nowadays, I know from
my work and my personal learning activities that the mathematics has no
meaning without a motivating problem to apply it to, and you’ll see example
problems all through this book.
Secondly, statistics is all about data, and working with real data has encour-
aged me to reengage with statistics and come at it from a different angle—
bottom-up you could say. Suddenly all those concepts that were put up on
whiteboards using abstract formulas now had real meaning and consequence
to the data I was working with. For me, real data helps statistical theory
come to life, and this book is supported by numerous data sets designed for
the reader to engage with.
But one more step solidified my newfound love of statistics, and that was when
I put regression modeling into practice. Faced with data sets that I initially
believed were just far too messy and random to be able to produce genuine
insights, I progressively became more and more fascinated by how regression
can cut through the messiness, compartmentalize the randomness and lead
you straight to inferences that are often surprising both in their clarity and
in their conclusions.
Hence my motivation for writing this book, which is to give others—whether
working in people analytics or otherwise—a starting point for a practical learn-
ing of regression methods, with the hope that they will see immediate appli-
cations to their work and take advantage of a much-underused toolkit that
provides strong support for evidence-based practice.
I am a mathematician who is now a practitioner of analytics. For this rea-
son you should see that this book is neither afraid of nor obsessed with the
mathematics of the methodologies covered. It is my general observation that

DOI: 10.1201/9781003194156-0 xv
xvi Introduction

many students and practitioners make the mistake of trying to run multivari-
ate models without even a basic understanding of the underlying mathematics
of those models, and I find it very difficult to see how they can be credible in
responding to a wide range of questions or critique about their work without
such an understanding. That said, it is also not necessary for students and
practitioners to understand the deepest levels of theory in order to be fluent
in running and interpreting multivariate models. In this book I have tried to
limit the mathematical exposition to a level that allows confident and fluent
execution and interpretation.
I subscribe strongly to the principles of open source sharing of knowledge. If
you want to reference the material in this book or use the exercises or data
sets in trainings or classes, you are free to do so and you do not need to request
my permission. I only ask that you make reference to this book as the source.
I expect this book to improve over time. If you found this book or any part of
it helpful to solving a problem, I’d love to hear about it. If you have comments
to improve or question any aspect of the contents of this book I encourage
you to leave an issue1 on its Github repository. This is the most reliable way
for me to see your comment. I promise to consider all comments and input,
but I do have to make a personal judgment about whether they are helpful to
the aims and purpose of this book. If I do make changes or additions based
on your input I will make a point to acknowledge your contribution in future
editions.
I would like to thank the following individuals who have reviewed or con-
tributed to this book at some point during its development: Liz Romero, Alex
LoPilato, Kevin Jaggs, Seth Saavedra. My sincere thanks to Alexis Fink for
drawing on her years of people analytics experience to set the context for this
book in her foreword. My thanks to the people analytics community for their
constant encouragement and support in sharing theory, content and method,
and to the R community for all the work they do in giving us amazing and
constantly improving statistical tools to work with. Finally, I would like to
thank my family for their patience and understanding on the evenings and
weekends I dedicated to the writing of this book, and for tolerating far too
much dinner conversation on the topic of statistics.
Keith McNulty
December 2020

1 https://github.com/keithmcnulty/peopleanalytics-regression-book/issues
1
The Importance of Regression in People
Analytics

In the 19th century, when Francis Galton first used the term ‘regression’ to
describe a statistical phenomenon (see Chapter 4), little did he know how
important that term would be today. Many of the most powerful tools of
statistical inference that we now have at our disposal can be traced back to
the types of early analysis that Galton and his contemporaries were engaged in.
The sheer number of different regression-related methodologies and variants
that are available to researchers and practitioners today is mind-boggling, and
there are still rich veins of ongoing research that are focused on defining and
refining new forms of regression to tackle new problems.
Neither could Galton have imagined the advent of the age of data we now live
in. Those of us (like me) who entered the world of work even as recently as 20
years ago remember a time when most problems could not be expected to be
solved using a data-driven approach, because there simply was no data. Things
are very different now, with data being collected and processed all around us
and available to use as direct or indirect measures of the phenomena we are
interested in.
Along with the growth in data that we have seen in recent years, we have also
seen a rapid growth in the availability of statistical tools—open source and free
to use—that fundamentally change how we go about analytics. Gone are the
clunky, complex, repeated steps on calculators or spreadsheets. In their place
are lean statistical programming languages that can implement a regression
analysis in milliseconds with a single line of code, allowing us to easily run
and reproduce multivariate analysis at scale.
So given that we have access to well-developed methodology, rich sources of
data and readily accessible tools, it is somewhat surprising that many ana-
lytics practitioners have a limited knowledge and understanding of regression
and its applications. The aim of this book is to encourage inexperienced ana-
lytics practitioners to ‘dip their toes’ further into the wide and varied world
of regression in order to deliver more targeted and precise insights to their
organizations and stakeholders on the problems they are most interested in.
While the primary subject matter focus of this book is the analysis of people-
related phenomena, the material is easily and naturally transferable to other

DOI: 10.1201/9781003194156-1 1
2 1 The Importance of Regression in People Analytics

disciplines. Therefore this book can be regarded as a practical introduction to


a wide range of regression methods for any analytics student or practitioner.
It is my firm belief that all people analytics professionals should have a strong
understanding of regression models and how to implement and interpret them
in practice, and my aim with this book is to provide those who need it with
help in getting there. In this chapter we will set the scene for the technical
learning in the remainder of the book by outlining the relevance of regression
models in people analytics practice. We also touch on some general inferential
modeling theory to set a context for later chapters, and we provide a preview
of the contents, structure and learning objectives of this book.

1.1 Why is regression modeling so important in people


analytics?
People analytics involves the study of the behaviors and characteristics of
people or groups in relation to important business, organizational or institu-
tional outcomes. This can involve both qualitative methods and quantitative
methods, but if data is available related to a particular topic of interest, then
quantitative methods are almost always considered important. With such a
specific focus on outcomes, any analyst working in people analytics will fre-
quently need to model these outcomes both to understand what influences
them and to potentially predict them in the future.
Modeling an outcome with the primary goal of understanding what influences
it can be quite a different matter to modeling an outcome with the primary
goal of predicting if it will happen in the future. If we need to understand what
influences an outcome, we need to get inside a model and construct a formula
or structure to infer how each variable acts on that outcome, we need to get
a sense of which variables are meaningful or not, and we need to quantify the
‘explainability’ of the outcome based on our variables. If our primary aim is
to predict the outcome, getting inside the model is less important because
we don’t have to explain the outcome, we just need to be confident that it
predicts accurately.
A model constructed to understand an outcome is often called an inferential
model. Regression models are the most well-known and well-used inferential
models available, providing a wide range of measures and insights that help
us explain the relationship between our input variables and our outcome of
interest, as we shall see in later chapters of this book.
The current reality in the field of people analytics is that inferential mod-
els are more required than predictive models. There are two reasons for this.
1.2 What do we mean by ‘modeling’ ? 3

First, data sets in people analytics are rarely large enough to facilitate sat-
isfactory prediction accuracy, and so attention is usually shifted to inference
for this reason alone. Second, in the field of people analytics, decisions often
have a real impact on individuals. Therefore, even in the rare situations where
accurate predictive modeling is attainable, stakeholders are unlikely to trust
the output and bear the consequences of predictive models without some sort
of elementary understanding of how the predictions are generated. This re-
quires the analyst to consider inference power as well as predictive accuracy
in selecting their modeling approach. Again, many regression models come
to the fore because they are commonly able to provide both inferential and
predictive value.
Finally, the growing importance of evidence-based practice in many clinical
and professional fields has generated a need for more advanced modeling skills
to satisfy rising demand for quantitative evidence from decision makers. In
people-related fields such as human resources, many varieties of specialized
regression-based models such as survival models or latent variable models
have crossed from academic and clinical settings into business settings in recent
years, and there is an increasing need for qualified individuals who understand
and can implement and interpret these models in practice.

1.2 What do we mean by ‘modeling’ ?


The term ‘modeling’ has a very wide range of meaning in everyday life and
work. In this book we are focused on inferential modeling, and we define that as
a specific form of statistical learning, which tries to discover and understand a
mathematical relationship between a set of measurements of certain constructs
and a measurement of an outcome of interest, based on a sample of data on
each. Modeling is both a concept and a process.

1.2.1 The theory of inferential modeling

We will start with a theoretical description and then provide a real example
from a later chapter to illustrate.
Imagine we have a population 𝒫 for which we believe there may be a non-
random relationship between a certain construct or set of constructs 𝒞 and a
certain measurable outcome 𝒪. Imagine that for a certain sample 𝑆 of obser-
vations from 𝒫, we have a collection of data which we believe measure 𝒞 to
some acceptable level of accuracy, and for which we also have a measure of
the outcome 𝒪.
4 1 The Importance of Regression in People Analytics

By convention, we denote the set of data that measure 𝒞 on our sample 𝑆 as


𝑋 = 𝑥1 , 𝑥2 , … , 𝑥𝑝 , where each 𝑥𝑖 is a vector (or column) of data measuring
at least one of the constructs in 𝒞. We denote the set of data that measure 𝒪
on our sample set 𝑆 as 𝑦. An upper-case 𝑋 is used because the expectation
is that there will be several columns of data measuring our constructs, and a
lower-case 𝑦 is used because the expectation is that the outcome is a single
column.
Inferential modeling is the process of learning about a relationship (or lack of
relationship) between the data in 𝑋 and 𝑦 and using that to describe a rela-
tionship (or lack of relationship) between our constructs 𝒞 and our outcome
𝒪 that is valid to a high degree of statistical certainty on the population 𝒫.
This process may include:
• Testing a proposed mathematical relationship in the form of a function,
structure or iterative method
• Comparing that relationship against other proposed relationships
• Describing the relationship statistically
• Determining whether the relationship (or certain elements of it) can be
generalized from the sample set 𝑆 to the population 𝒫
When we test a relationship between 𝑋 and 𝑦, we acknowledge that data
and measurements are imperfect and so each observation in our sample 𝑆
may contain random error that we cannot control. Therefore we define our
relationship as:

𝑦 = 𝑓(𝑋) + 𝜖
where 𝑓 is some transformation or function of the data in 𝑋 and 𝜖 is a random,
uncontrollable error.
𝑓 can take the form of a predetermined function with a formula defined on
𝑋, like a linear function for example. In this case we can call our model a
parametric model. In a parametric model, the modeled value of 𝑦 is known
as soon as we know the values of 𝑋 by simply applying the formula. In a
non-parametric model, there is no predetermined formula that defines the
modeled value of 𝑦 purely in terms of 𝑋. Non-parametric models need further
information in addition to 𝑋 in order to determine the modeled value of 𝑦—for
example the value of 𝑦 in other observations with similar 𝑋 values.
Regression models are designed to derive 𝑓 using estimation based on statis-
tical likelihood and expectation, founded on the theory of the distribution
of random variables. Regression models can be both parametric and non-
parametric, but by far the most commonly used methods (and the majority
of those featured in this book) are parametric. Because of their foundation in
statistical likelihood and expectation, they are particularly suited to helping
1.2 What do we mean by ‘modeling’ ? 5

answer questions of generalizability—that is, to what extent can the relation-


ship being observed in the sample 𝑆 be inferred for the population 𝒫, which
is usually the driving force in any form of inferential modeling.
Note that there is a difference between establishing a statistical relationship
between 𝒞 and 𝒪 and establishing a causal relationship between the two. This
can be a common trap that inexperienced statistical analysts fall into when
communicating the conclusions of their modeling. Establishing that a relation-
ship exists between a construct and an outcome is a far cry from being able
to say that one causes the other. This is the common truism that ‘correlation
does not equal causation’.
To bring our theory to life, consider the walkthrough example in Chapter 4 of
this book. In this example, we discuss how to establish a relationship between
the academic results of students in the first three years of their education
program and their results in the fourth year. In this case, our population 𝒫 is
all past, present and future students who take similar examinations, and our
sample 𝑆 is the students who completed their studies in the past three years.
𝑋 = 𝑥1 , 𝑥2 , 𝑥3 are each of the three scores from the first three years, and 𝑦
is the score in the fourth year. We test 𝑓 to be a linear relationship, and we
establish that such a relationship can be generalized to the entire population
𝒫 with a substantial level of statistical confidence1 .
Almost all our work in this book will refer to the variables 𝑋 as input variables
and the variable 𝑦 as the outcome variable. There are many other common
terms for these which you may find in other sources—for example 𝑋 are often
known as independent variables or covariates while 𝑦 is often known as a
dependent or response variable.

1.2.2 The process of inferential modeling

Inferential modeling—regression or otherwise—is a process of numerous steps.


Typically the main steps are:

1. Defining the outcome of interest 𝒪 and the input constructs 𝒞 based


on a broader evidence-based objective
2. Confirming that 𝒪 has reliable measurement data
3. Determining which data can be used to measure 𝒞
4. Determining a sample 𝑆 and collecting, refining and cleaning data.
5. Performing exploratory data analysis (EDA) and proposing a set of
models to test for 𝑓
6. Putting the data in an appropriate format for each model
1 We also determine that 𝑥1 (the first-year examination score) plays no significant role in
𝑓 and that introducing some non-linearity into 𝑓 further improves the statistical accuracy
of the inferred relationship.
6 1 The Importance of Regression in People Analytics

7. Running the models


8. Interpreting the outputs and performing model diagnostics
9. Selecting an optimal model or models
10. Articulating the inferences that can be generalized to apply to 𝒫

This book is primarily focused on steps 7–10 of this process2 . That is not to
say that steps 1–6 are not important. Indeed these steps are critical and often
loaded with analytic traps. Defining the problem, collecting reliable measures
and cleaning and organizing data are still the source of much pain and angst
for analysts, but these topics are for another day.

1.3 The structure, system and organization of this book

The purpose of this book is to put inexperienced practitioners firmly on a


path to the confident and appropriate use of regression techniques in their
day-to-day work. This requires enough of an understanding of the underlying
theory so that judgments can be made about results, but also a practical set
of steps to help practitioners apply the most common regression methods to
a variety of typical modeling scenarios in a reliable and reproducible way.
In most chapters, time is spent on the underlying mathematics. Not to the
degree of an academic theorist, but enough to ensure that the reader can
associate some mathematical meaning to the outputs of models. While it may
be tempting to skip the math, I strongly recommend against it if you intend
to be a high performer in your field. The best analysts are those who can
genuinely understand what the numbers are telling them.
The statistical programming language R is used for most of the practical
demonstration in each chapter. Because R is open source and particularly
well geared to inferential statistics, it is an excellent choice for those whose
work involves a lot of inferential analysis. In later chapters, we show imple-
mentations of all of the available methodologies in Python, which is also a
powerful open source tool for this sort of work.
Each chapter involves a walkthrough example to illustrate the specific method
and to allow the reader to replicate the analysis for themselves. The exercises
at the end of each chapter are designed so that the reader can try the same
method on a different data set, or a different problem on the same data set,
to test their learning and understanding. In the final chapter, a series of data
sets and exercises are provided with limited instruction in order to give the
reader an opportunity to test their overall knowledge in selecting and applying
2 The book also addresses Steps 5 and 6 in some chapters.
1.3 The structure, system and organization of this book 7

regression methods to a variety of people analytics data sets and problems. All
in all, sixteen different data sets are used as walkthrough or exercise examples,
and all of these data sets are fictitious constructions unless otherwise indicated.
Despite the fiction, they are deliberately designed to present the reader with
something resembling how the data might look in practice, albeit cleaner and
more organized.
The chapters of this book are arranged as follows:
• Chapter 2 covers the basics of the R programming language for those who
want to attempt to jump straight in to the work in subsequent chapters
but have very little R experience. Experienced R programmers can skip this
chapter.
• Chapter 3 covers the essential statistical concepts needed to understand
multivariate regression models. It also serves as a tutorial in univariate and
bivariate statistics illustrated with real data. If you need help developing
a decent understanding of descriptive statistics, random distribution and
hypothesis testing, this is an important chapter to study.
• Chapter 4 covers linear regression and in the course of that introduces many
other foundational concepts. The walkthrough example involves modeling
academic results from prior results. The exercises involve modeling income
levels based on various work and demographic factors.
• Chapter 5 covers binomial logistic regression. The walkthrough example in-
volves modeling promotion likelihood based on performance metrics. The
exercises involve modeling charitable donation likelihood based on prior do-
nation behavior and demographics.
• Chapter 6 covers multinomial regression. The walkthrough example and
exercise involves modeling the choice of three health insurance products by
company employees based on demographic and position data.
• Chapter 7 covers ordinal regression. The walkthrough example involves mod-
eling in-game disciplinary action against soccer players based on prior disci-
pline and other factors. The exercises involve modeling manager performance
based on varied data.
• Chapter 8 covers modeling options for data with explicit or latent hierarchy.
The first part covers mixed modeling and uses a model of speed dating
decisions as a walkthrough and example. The second part covers structural
equation modeling and uses a survey for a political party as a walkthrough
example. The exercises involve modeling latent variables in an employee
engagement survey.
• Chapter 9 covers survival analysis, Cox proportional hazard regression and
frailty models. The chapter uses employee attrition as a walkthrough exam-
ple and exercise.
• Chapter 10 outlines alternative technical approaches to regression modeling
in both R and Python. Models from previous chapters are used to illustrate
these alternative approaches.
• Chapter 11 covers power analysis, focusing in particular on estimating the
8 1 The Importance of Regression in People Analytics

required minimum sample sizes in establishing meaningful inferences for


both simple statistical tests and multivariate models. Examples related to
experimental studies are used to illustrate, such as concurrent validity stud-
ies of selection instruments. Example implementations in R and Python are
outlined.
• Chapter 12 is a set of problems and data sets which will allow the reader to
practice the skills they have learned in this book and apply them to a vari-
ety of people analytics domains such as recruiting, performance, promotion,
compensation and learning. Sets of discussion questions and data exercises
will guide the reader through each problem, but these are designed in a way
that encourages the independent selection and application of the methods
covered in this book. These data sets, problems and exercises would suit as
homework material for classes in statistical modeling or people analytics.
2
The Basics of the R Programming Language

Most of the work in this book is implemented in the R statistical program-


ming language which, along with Python, is one of the two languages that I
use in my day-to-day statistical analysis. Sample implementations in Python
are also provided at various points in the book. I have made efforts to keep
the code as simple as possible, and I have tried to avoid the use of too many
external packages. For the most part, readers should see (especially in the ear-
lier chapters) that code blocks are short and simple, relying wherever possible
on base R functionality. No doubt there are neater and more effective ways to
code some of the material in this book using a wide array of R packages—and
some of these are illustrated in Chapter 10—but my priority has been to keep
the code simple, consistent and easily reproducible.
For those who wish to follow the method and theory without the implemen-
tations in this book, there is no need to read this chapter. However, the style
of this book is to use implementation to illustrate theory and practice, and so
tolerance of many code blocks will be necessary as you read onward.
For those who wish to simply replicate the models as quickly as possible, full
code is provided throughout this book by means of interspersed code blocks.
Assuming all the required external packages have been installed, these code
blocks should all be transportable and immediately usable. For those who
are extra-inquisitive and want to explore how I constructed graphics used for
illustration (for which code is usually not displayed), the best place to go is
the Github repository1 for this book.
This chapter is for those who wish to learn the methods in this book but
do not know how to use R. However, it is not intended to be a full tutorial
on R. There are many more qualified individuals and existing resources that
would better serve that purpose—in particular I recommend Wickham and
Grolemund (2016). It is recommended that you consult these resources and
become comfortable with the basics of R before proceeding into the later
chapters of this book. However, acknowledging that many will want to dive
in sooner rather than later, this chapter covers the absolute basics of R that
will allow the uninitiated reader to proceed with at least some orientation.
1 https://github.com/keithmcnulty/peopleanalytics-regression-book

DOI: 10.1201/9781003194156-2 9
10 2 The Basics of the R Programming Language

2.1 What is R?
R is a programming language that was originally developed by and for statis-
ticians, but in recent years its capabilities and the environments in which
it is used have expanded greatly, with extensive use nowadays in academia
and the public and private sectors. There are many advantages to using a
programming language like R. Here are some:

1. It is completely free and open source.


2. It is faster and more efficient with memory than popular graphical
user interface analytics tools.
3. It facilitates easier replication of analysis from person to person
compared with many alternatives.
4. It has a large and growing global community of active users.
5. It has a large and rapidly growing universe of packages, which are
all free and which provide the ability to do an extremely wide range
of general and highly specialized tasks, statistical and otherwise.

There is often heated debate about which tools are better for doing non-
trivial statistical analysis. I personally find that R provides the widest array
of resources for those interested in inferential modeling, while Python has
a more well-developed toolkit for predictive modeling and machine learning.
Since the primary focus of this book is inferential modeling, the in-depth
walkthroughs are coded in R.

2.2 How to start using R

Just like most programming languages, R itself is an interpreter which receives


input and returns output. It is not very easy to use without an IDE. An IDE is
an Integrated Development Environment, which is a convenient user interface
allowing an R programmer to do all their main tasks including writing and
running R code, saving files, viewing data and plots, integrating code into
documents and many other things. By far the most popular IDE for R is
RStudio. An example of what the RStudio IDE looks like can be seen in
Figure 2.1.
To start using R, follow these steps:

1. Download and install the latest version of R from https://www.r-


project.org/. Ensure that the version suits your operating system.
2.3 Data in R 11

FIGURE 2.1: The RStudio IDE

2. Download the latest version of the RStudio IDE from


https://rstudio.com/products/rstudio/ and view the video on
that page to familiarize yourself with its features.

3. Open RStudio and play around.

The initial stages of using R can be challenging, mostly due to the need to
become familiar with how R understands, stores and processes data. Extensive
trial and error is a learning necessity. Perseverance is important in these early
stages, as well as an openness to seek help from others either in person or via
online forums.

2.3 Data in R

As you start to do tasks involving data in R, you will generally want to store
the things you create so that you can refer to them later. Simply calculating
something does not store it in R. For example, a simple calculation like this
can be performed easily:
12 2 The Basics of the R Programming Language

3 + 3

## [1] 6

However, as soon as the calculation is complete, it is forgotten by R because


the result hasn’t been assigned anywhere. To store something in your R session,
you will assign it a name using the <- operator. So I can assign my previous
calculation to an object called my_sum, and this allows me to access the value
at any time.

# store the result


my_sum <- 3 + 3

# now I can work with it


my_sum + 3

## [1] 9

You will see above that you can comment your code by simply adding a # to
the start of a line to ensure that the line is ignored by the interpreter.
Note that assignment to an object does not result in the value being displayed.
To display the value, the name of the object must be typed, the print()
command used or the command should be wrapped in parentheses.

# show me the value of my_sum


my_sum

## [1] 6

# assign my_sum + 3 to new_sum and show its value


(new_sum <- my_sum + 3)

## [1] 9
2.3 Data in R 13

2.3.1 Data types

All data in R has an associated type, to reflect the wide range of data that R
is able to work with. The typeof() function can be used to see the type of a
single scalar value. Let’s look at the most common scalar data types.
Numeric data can be in integer form or double (decimal) form.

# integers can be signified by adding an 'L' to the end


my_integer <- 1L
my_double <- 6.38

typeof(my_integer)

## [1] "integer"

typeof(my_double)

## [1] "double"

Character data is text data surrounded by single or double quotes.

my_character <- "THIS IS TEXT"


typeof(my_character)

## [1] "character"

Logical data takes the form TRUE or FALSE.

my_logical <- TRUE


typeof(my_logical)

## [1] "logical"
14 2 The Basics of the R Programming Language

2.3.2 Homogeneous data structures

Vectors are one-dimensional structures containing data of the same type and
are notated by using c(). The type of the vector can also be viewed using
the typeof() function, but the str() function can be used to display both the
contents of the vector and its type.

my_double_vector <- c(2.3, 6.8, 4.5, 65, 6)


str(my_double_vector)

## num [1:5] 2.3 6.8 4.5 65 6

Categorical data—which takes only a finite number of possible values—can


be stored as a factor vector to make it easier to perform grouping and manip-
ulation.

categories <- factor(


c("A", "B", "C", "A", "C")
)

str(categories)

## Factor w/ 3 levels "A","B","C": 1 2 3 1 3

If needed, the factors can be given order.

# character vector
ranking <- c("Medium", "High", "Low")
str(ranking)

## chr [1:3] "Medium" "High" "Low"

# turn it into an ordered factor


ranking_factors <- ordered(
ranking, levels = c("Low", "Medium", "High")
)

str(ranking_factors)
2.3 Data in R 15

## Ord.factor w/ 3 levels "Low"<"Medium"<..: 2 3 1

The number of elements in a vector can be seen using the length() function.

length(categories)

## [1] 5

Simple numeric sequence vectors can be created using shorthand notation.

(my_sequence <- 1:10)

## [1] 1 2 3 4 5 6 7 8 9 10

If you try to mix data types inside a vector, it will usually result in type
coercion, where one or more of the types are forced into a different type to
ensure homogeneity. Often this means the vector will become a character
vector.

# numeric sequence vector


vec <- 1:5
str(vec)

## int [1:5] 1 2 3 4 5

# create a new vector containing vec and the character "hello"


new_vec <- c(vec, "hello")

# numeric values have been coerced into their character equivalents


str(new_vec)

## chr [1:6] "1" "2" "3" "4" "5" "hello"

But sometimes logical or factor types will be coerced to numeric.


16 2 The Basics of the R Programming Language

# attempt a mixed logical and numeric


mix <- c(TRUE, 6)

# logical has been converted to binary numeric (TRUE = 1)


str(mix)

## num [1:2] 1 6

# try to add a numeric to our previous categories factor vector


new_categories <- c(categories, 1)

# categories have been coerced to background integer representations


str(new_categories)

## num [1:6] 1 2 3 1 3 1

Matrices are two-dimensional data structures of the same type and are built
from a vector by defining the number of rows and columns. Data is read into
the matrix down the columns, starting left and moving right. Matrices are
rarely used for non-numeric data types.

# create a 2x2 matrix with the first four integers


(m <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2))

## [,1] [,2]
## [1,] 1 3
## [2,] 2 4

Arrays are n-dimensional data structures with the same data type and are
not used extensively by most R users.

2.3.3 Heterogeneous data structures

Lists are one-dimensional data structures that can take data of any type.
2.3 Data in R 17

my_list <- list(6, TRUE, "hello")


str(my_list)

## List of 3
## $ : num 6
## $ : logi TRUE
## $ : chr "hello"

List elements can be any data type and any dimension. Each element can be
given a name.

new_list <- list(


scalar = 6,
vector = c("Hello", "Goodbye"),
matrix = matrix(1:4, nrow = 2, ncol = 2)
)

str(new_list)

## List of 3
## $ scalar: num 6
## $ vector: chr [1:2] "Hello" "Goodbye"
## $ matrix: int [1:2, 1:2] 1 2 3 4

Named list elements can be accessed by using $.

new_list$matrix

## [,1] [,2]
## [1,] 1 3
## [2,] 2 4

Dataframes are the most used data structure in R; they are effectively a
named list of vectors of the same length, with each vector as a column. As
such, a dataframe is very similar in nature to a typical database table or
spreadsheet.
18 2 The Basics of the R Programming Language

# two vectors of different types but same length


names <- c("John", "Ayesha")
ages <- c(31, 24)

# create a dataframe
(df <- data.frame(names, ages))

## names ages
## 1 John 31
## 2 Ayesha 24

# get types of columns


str(df)

## 'data.frame': 2 obs. of 2 variables:


## $ names: chr "John" "Ayesha"
## $ ages : num 31 24

# get dimensions of df
dim(df)

## [1] 2 2

2.4 Working with dataframes


The dataframe is the most common data structure used by analysts in R, due
to its similarity to data tables found in databases and spreadsheets. We will
work almost entirely with dataframes in this book, so let’s get to know them.

2.4.1 Loading and tidying data in dataframes

To work with data in R, you usually need to pull it in from an outside source
into a dataframe2 . R facilitates numerous ways of importing data from simple
2R also has some built-in data sets for testing and playing with. For example, check out
mtcars by typing it into the terminal, or type data() to see a full list of built-in data sets.
2.4 Working with dataframes 19

.csv files, from Excel files, from online sources or from databases. Let’s load a
data set that we will use later—the salespeople data set, which contains some
information on the sales, average customer ratings and performance ratings
of salespeople. The read.csv() function can accept a URL address of the file
if it is online.

# url of data set


url <- "http://peopleanalytics-regression-book.org/data/salespeople.csv"

# load the data set and store it as a dataframe called salespeople


salespeople <- read.csv(url)

We might not want to display this entire data set before knowing how big it
is. We can view the dimensions, and if it is too big to display, we can use the
head() function to display just the first few rows.

dim(salespeople)

## [1] 351 4

# hundreds of rows, so view first few


head(salespeople)

## promoted sales customer_rate performance


## 1 0 594 3.94 2
## 2 0 446 4.06 3
## 3 1 674 3.83 4
## 4 0 525 3.62 2
## 5 1 657 4.40 3
## 6 1 918 4.54 2

We can view a specific column by using $, and we can use square brackets to
view a specific entry. For example if we wanted to see the 6th entry of the
sales column:

salespeople$sales[6]

## [1] 918
20 2 The Basics of the R Programming Language

Alternatively, we can use a [row, column] index to get a specific entry in the
dataframe.

salespeople[34, 4]

## [1] 3

We can take a look at the data types using str().

str(salespeople)

## 'data.frame': 351 obs. of 4 variables:


## $ promoted : int 0 0 1 0 1 1 0 0 0 0 ...
## $ sales : int 594 446 674 525 657 918 318 364 342 387 ...
## $ customer_rate: num 3.94 4.06 3.83 3.62 4.4 4.54 3.09 4.89 3.74 3 ...
## $ performance : int 2 3 4 2 3 2 3 1 3 3 ...

We can also see a statistical summary of each column using summary(), which
tells us various statistics depending on the type of the column.

summary(salespeople)

## promoted sales customer_rate performance


## Min. :0.0000 Min. :151.0 Min. :1.000 Min. :1.0
## 1st Qu.:0.0000 1st Qu.:389.2 1st Qu.:3.000 1st Qu.:2.0
## Median :0.0000 Median :475.0 Median :3.620 Median :3.0
## Mean :0.3219 Mean :527.0 Mean :3.608 Mean :2.5
## 3rd Qu.:1.0000 3rd Qu.:667.2 3rd Qu.:4.290 3rd Qu.:3.0
## Max. :1.0000 Max. :945.0 Max. :5.000 Max. :4.0
## NA's :1 NA's :1 NA's :1

Note that there is missing data in this dataframe, indicated by NAs in the
summary. Missing data is identified by a special NA value in R. This should
not be confused with "NA", which is simply a character string. The function
is.na() will look at all values in a vector or dataframe and return TRUE or
FALSE based on whether they are NA or not. By adding these up using the
sum() function, it will take TRUE as 1 and FALSE as 0, which effectively provides
a count of missing data.
2.4 Working with dataframes 21

sum(is.na(salespeople))

## [1] 3

This is a small number of NAs given the dimensions of our data set and we
might want to remove the rows of data that contain NAs. The easiest way
is to use the complete.cases() function, which identifies the rows that have
no NAs, and then we can select those rows from the dataframe based on that
condition. Note that you can overwrite objects with the same name in R.

salespeople <- salespeople[complete.cases(salespeople), ]

# confirm no NAs
sum(is.na(salespeople))

## [1] 0

We can see the unique values of a vector or column using the unique() function.

unique(salespeople$performance)

## [1] 2 3 4 1

If we need to change the type of a column in a dataframe, we can use the


as.numeric(), as.character(), as.logical() or as.factor() functions. For
example, given that there are only four unique values for the performance
column, we may want to convert it to a factor.

salespeople$performance <- as.factor(salespeople$performance)


str(salespeople)

## 'data.frame': 350 obs. of 4 variables:


## $ promoted : int 0 0 1 0 1 1 0 0 0 0 ...
## $ sales : int 594 446 674 525 657 918 318 364 342 387 ...
## $ customer_rate: num 3.94 4.06 3.83 3.62 4.4 4.54 3.09 4.89 3.74 3 ...
## $ performance : Factor w/ 4 levels "1","2","3","4": 2 3 4 2 3 2 3 1 3 3 ...
22 2 The Basics of the R Programming Language

2.4.2 Manipulating dataframes

Dataframes can be subsetted to contain only rows that satisfy specific condi-
tions.

(sales_720 <- subset(salespeople, subset = sales == 720))

## promoted sales customer_rate performance


## 290 1 720 3.76 3

Note the use of ==, which is used in many programming languages, to test for
precise equality. Similarly we can select columns based on inequalities (> for
‘greater than’, < for ‘less than’, >= for ‘greater than or equal to’, <= for ‘less
than or equal to’, or != for ‘not equal to’). For example:

high_sales <- subset(salespeople, subset = sales >= 700)


head(high_sales)

## promoted sales customer_rate performance


## 6 1 918 4.54 2
## 12 1 716 3.16 3
## 20 1 937 5.00 2
## 21 1 702 3.53 4
## 25 1 819 4.45 2
## 26 1 736 3.94 4

To select specific columns use the select argument.

salespeople_sales_perf <- subset(salespeople,


select = c("sales", "performance"))
head(salespeople_sales_perf)

## sales performance
## 1 594 2
## 2 446 3
## 3 674 4
## 4 525 2
## 5 657 3
## 6 918 2
2.4 Working with dataframes 23

Two dataframes with the same column names can be combined by their rows.

low_sales <- subset(salespeople, subset = sales < 400)

# bind the rows of low_sales and high_sales together


low_and_high_sales = rbind(low_sales, high_sales)
head(low_and_high_sales)

## promoted sales customer_rate performance


## 7 0 318 3.09 3
## 8 0 364 4.89 1
## 9 0 342 3.74 3
## 10 0 387 3.00 3
## 15 0 344 3.02 2
## 16 0 372 3.87 3

Two dataframes with different column names can be combined by their


columns.

# two dataframes with two columns each


sales_perf <- subset(salespeople,
select = c("sales", "performance"))
prom_custrate <- subset(salespeople,
select = c("promoted", "customer_rate"))

# bind the columns to create a dataframe with four columns


full_df <- cbind(sales_perf, prom_custrate)
head(full_df)

## sales performance promoted customer_rate


## 1 594 2 0 3.94
## 2 446 3 0 4.06
## 3 674 4 1 3.83
## 4 525 2 0 3.62
## 5 657 3 1 4.40
## 6 918 2 1 4.54
24 2 The Basics of the R Programming Language

2.5 Functions, packages and libraries


In the code so far we have used a variety of functions. For example head(),
subset(), rbind(). Functions are operations that take certain defined inputs
and return an output. Functions exist to perform common useful operations.

2.5.1 Using functions

Functions usually take one or more arguments. Often there are a large number
of arguments that a function can take, but many are optional and not required
to be specified by the user. For example, the function head(), which displays
the first rows of a dataframe3 , has only one required argument x: the name
of the dataframe. A second argument is optional, n: the number of rows to
display. If n is not entered, it is assumed to have the default value n = 6.
When running a function, you can either specify the arguments by name or
you can enter them in order without their names. If you enter arguments
without naming them, R expects the arguments to be entered in exactly the
right order.

# see the head of salespeople, with the default of six rows


head(salespeople)

## promoted sales customer_rate performance


## 1 0 594 3.94 2
## 2 0 446 4.06 3
## 3 1 674 3.83 4
## 4 0 525 3.62 2
## 5 1 657 4.40 3
## 6 1 918 4.54 2

# see fewer rows - arguments need to be in the right order if not named
head(salespeople, 3)

3 It actually has a broader definition but is mostly used for showing the first rows of a

dataframe.
2.5 Functions, packages and libraries 25

## promoted sales customer_rate performance


## 1 0 594 3.94 2
## 2 0 446 4.06 3
## 3 1 674 3.83 4

# or if you don't know the right order,


# name your arguments and you can put them in any order
head(n = 3, x = salespeople)

## promoted sales customer_rate performance


## 1 0 594 3.94 2
## 2 0 446 4.06 3
## 3 1 674 3.83 4

2.5.2 Help with functions

Most functions in R have excellent help documentation. To get help on the


head() function, type help(head) or ?head. This will display the results in
the Help browser window in RStudio. Alternatively you can open the Help
browser window directly in RStudio and do a search there. An example of the
browser results for head() is in Figure 2.2.

FIGURE 2.2: Results of a search for the head() function in the RStudio
Help browser
26 2 The Basics of the R Programming Language

The help page normally shows the following:


• Description of the purpose of the function
• Usage examples, so you can quickly see how it is used
• Arguments list so you can see the names and order of arguments
• Details or notes on further considerations on use
• Expected value of the output (for example head() is expected to return a
similar object to its first input x)
• Examples to help orient you further (sometimes examples can be very ab-
stract in nature and not so helpful to users)

2.5.3 Writing your own functions

Functions are not limited to those that come packaged in R. Users can write
their own functions to perform tasks that are helpful to their objectives. Ex-
perienced programmers in most languages subscribe to a principle called DRY
(Don’t Repeat Yourself). Whenever a task needs to be done repeatedly, it is
poor practice to write the same code numerous times. It makes more sense to
write a function to do the task.
In this example, a simple function is written which generates a report on a
dataframe:

# create df_report function


df_report <- function(df) {
paste("This dataframe contains", nrow(df), "rows and",
ncol(df), "columns. There are", sum(is.na(df)), "NA entries.")
}

We can test our function by using the built-in mtcars data set in R.

df_report(mtcars)

## [1] "This dataframe contains 32 rows and 11 columns. There are 0 NA entries."

2.5.4 Installing packages

All the common functions that we have used so far exist in the base R instal-
lation. However, the beauty of open source languages like R is that users can
write their own functions or resources and release them to others via packages.
2.5 Functions, packages and libraries 27

A package is an additional module that can be installed easily; it makes re-


sources available which are not in the base R installation. In this book we will
be using functions from both base R and from popular and useful packages.
As an example, a popular package used for statistical modeling is the MASS
package, which is based on methods in a popular applied statistics book4 .
Before an external package can be used, it must be installed into your
package library using install.packages(). So to install MASS, type in-
stall.packages("MASS") into the console. This will send R to the main internet
repository for R packages (known as CRAN). It will find the right version of
MASS for your operating system and download and install it into your package
library. If MASS needs other packages in order to work, it will also install these
packages.
If you want to install more than one package, put the names of the packages
inside a character vector—for example:

my_packages <- c("MASS", "DescTools", "dplyr")


install.packages(my_packages)

Once you have installed a package, you can see what functions are available
by calling for help on it, for example using help(package = MASS). One pack-
age you may wish to install now is the peopleanalyticsdata package, which
contains all the data sets used in this book. By installing and loading this
package, all the data sets used in this book will be loaded into your R ses-
sion and ready to work with. If you do this, you can ignore the read.csv()
commands later in the book, which download the data from the internet.

2.5.5 Using packages

Once you have installed a package into your package library, to use it in your
R session you need to load it using the library() function. For example, to
load MASS after installing it, use library(MASS). Often nothing will happen
when you use this command, but rest assured the package has been loaded
and you can start to use the functions inside it. Sometimes when you load
the package a series of messages will display, usually to make you aware of
certain things that you need to keep in mind when using the package. Note
that whenever you see the library() command in this book, it is assumed
that you have already installed the package in that command. If you have not,
the library() command will fail.
Once a package is loaded from your library, you can use any of the functions
inside it. For example, the stepAIC() function is not available before you
4 Venables and Ripley (2002)
28 2 The Basics of the R Programming Language

load the MASS package but becomes available after it is loaded. In this sense,
functions ‘belong’ to packages.
Problems can occur when you load packages that contain functions with the
same name as functions that already exist in your R session. Often the mes-
sages you see when loading a package will alert you to this. When R is faced
with a situation where a function exists in multiple packages you have loaded,
R always defaults to the function in the most recently loaded package. This
may not always be what you intended.
One way to completely avoid this issue is to get in the habit of namespacing
your functions. To namespace, you simply use package::function(), so to
safely call stepAIC() from MASS, you use MASS::stepAIC(). Most of the time
in this book when a function is being called from a package outside base R, I
use namespacing to call that function. This should help avoid confusion about
which packages are being used for which functions.

2.5.6 The pipe operator

Even in the most elementary briefing about R, it is very difficult to ignore the
pipe operator. The pipe operator makes code more natural to read and write
and reduces the typical computing problem of many nested operations inside
parentheses. The pipe operator comes inside many R packages, particularly
magrittr and dplyr.

As an example, imagine we wanted to do the following two operations in one


command:

1. Subset salespeople to only the sales values of those with sales


less than 500
2. Take the mean of those values

In base R, one way to do this is:

mean(subset(salespeople$sales, subset = salespeople$sales < 500))

## [1] 388.6684

This is nested and needs to be read from the inside out in order to align
with the instructions. The pipe operator %>% takes the command that comes
before it and places it inside the function that follows it (by default as the
first argument). This reduces complexity and allows you to follow the logic
more clearly.
Another random document with
no related content on Scribd:
for if it had jammed, the line would surely have snapped and the
whale been lost.

“The winch was then started and the whale drawn slowly toward
the ship.”

The burst of speed was soon ended and the whale sounded for ten
minutes, giving us all a chance to breathe and wonder what had
happened. When the animal came up again, far ahead, the spout was
high and full, with no trace of blood, so we knew that he would need
a second harpoon to finish him. I was delighted, for I had long
wished for a chance to get a roll of motion-picture film showing the
killing of a whale, and now the conditions were ideal—good light,
little wind, and no sea.
I ran below to get the cinematograph and tripod and set it on the
bridge while the gun was being loaded. The winch was then started
and the whale drawn slowly toward the ship. He persisted in keeping
in the sunlight, which drew a path of glittering, dancing points of
light, beautiful to see but fatal to pictures. I shouted to Captain
Andersen, asking him to wait a bit and let the whale go down, hoping
it would rise in the other direction. He did so and the animal swung
around, coming up just as I wished, so that the sun was almost
behind us. It was now near enough to begin work and I kept the
crank of the machine steadily revolving whenever it rose to spout.
The whale was drawn in close under the bow and for several minutes
lay straining and heaving, trying to free himself from the biting iron.
“Stand by! I’m going to shoot now,” sang out the Gunner, and in a
moment he was hidden from sight in a thick black cloud.
The beautiful gray body was lying quietly at the surface when the
smoke drifted away, but in a few seconds the whale righted himself
with a convulsive heave. The poor animal was not yet dead, though
the harpoon had gone entirely through him. Captain Andersen called
for one of the long slender lances which were triced up to the ship’s
rigging, and after a few more turns of the winch had brought the
whale right under the bows, he began jabbing the steel into its side,
throwing his whole weight on the lance. The whale was pretty “sick”
and did not last long, and before the roll of motion-picture film had
been exhausted it sank straight down, the last feeble blow leaving a
train of round white bubbles on the surface.
A sei whale at Aikawa, Japan. This species is about forty-eight feet
long and is allied to the finback and blue whales.

Andersen and I went below for breakfast and by the time we were
on deck again the whale had been inflated and was floating easily
beside the ship. When we had reached the bridge the Gunner said:
“I don’t want to go in yet with this one; we’ll cruise about until
twelve o’clock and see if we can’t find another. I am going up in the
top and then we’ll be sure not to miss any.”
I stretched out upon a seat on the port side of the bridge and lazily
watched the water boil and foam ver the dead whale as we steamed
along at full speed. Captain Andersen was singing softly to himself,
apparently perfectly happy in his lofty seat. So we went about for two
hours and I was almost asleep when Andersen called down:
“There’s a whale dead ahead. He spouted six times.”
“‘There’s a whale dead ahead. He spouted six times.’”

“The click of the camera and the crash of the gun sounding at
almost the same instant.” The harpoon, rope, wads, smoke, sparks
and the back of the whale are shown in the photograph.
I was wide awake at that and had the camera open and ready for
pictures by the time we were near enough to see the animal—a sei
whale—blow. He was spouting constantly and this argued well, for
we were sure to get a shot if he continued to stay at the surface. The
Bo’s’n made a flag ready so that the carcass alongside could be let go
and marked. Apparently this was not going to be necessary, for there
was plenty of food and the whale was lazily wallowing about, rolling
first on one side and then on the other, sometimes throwing his fin in
the air and playfully slapping the water, sending it upward in geyser-
like jets.
“Half speed!” shouted the Gunner; then, “Slow!” and “Dead slow!”
The little vessel slipped silently along, the propellers hardly
moving and the nerves of every man on board as tense as the strings
of a violin. In four seconds the whale was up, not ten fathoms away
on the port bow, the click of the camera and the crash of the gun
sounding at almost the same instant. The harpoon struck the animal
in the side, just back of the fin, and he went down without a struggle,
for the bursting bomb had torn its way into the great heart.
By eleven o’clock it was alongside and slowly filling with air while
the ship was churning her way toward the station. Andersen went
below for a couple of hours’ sleep in the afternoon, and I dozed on
the bridge in the sunshine. We were just off Kinka-San at half-past
six, and by seven were blowing the whistle at the entrance to the bay.
Three other ships, the San Hogei, Ne Taihei, and Akebono, were
already inside but had no whales. Later Captain Olsen, of the
Rekkusu Maru, brought in a sei whale, but this was the only other
ship that had killed during the day. About eleven o’clock, just as I
came from the station house after developing the plates, and started
to go out to the ship, the Fukushima and Airondo Maru stole quietly
into the bay and dropped anchor. They, too, had been unsuccessful,
and, we learned later, had not even seen a whale.
Before we turned in for the night Captain Andersen said to me:
“We were just off Kinka-san at half-past six, and by seven were
blowing the whistle at the entrance to the bay.”

“We’ll go sou’-sou’ west tomorrow; that’s a whale cruise. But I’m


afraid there is going to be a big sea on, for the wind has shifted and
we always get heavy weather when it’s blowing offshore.”
The news was not very encouraging, for although I have spent
many days on whaling ships I have never learned to appreciate
perfectly the charm of the deep when the little cork-like vessels are
tossing and throwing themselves about as though possessed of an
evil spirit. Each time, I make a solemn vow that if ever I am fortunate
enough once more to get on solid ground my days of whaling will be
ended.
CHAPTER VIII
CHARGED BY A WILD SEI WHALE

“We hunted them for two hours, trying first one and then the
other—they had separated—without once getting near enough
even for pictures.”

The ship got under way at two o’clock the next morning, and within
half an hour was pitching badly in a heavy sea. At five Andersen and
I turned out and climbed to the bridge, both wearing oilskins and
sou’westers to protect ourselves from the driving spray. The sun was
up in a clear sky, but the wind was awful. The man in the top shouted
down that he had seen no whales, but that many birds were about,
showing that food must be plentiful and near the surface. Captain
Andersen turned to me with a smile:
“Don’t you worry! We’ll see one before long. I’m always lucky
before breakfast.”
Almost while he was speaking the man aloft sang out, “Kujira!”
The kujira proved to be two sei whales a long way off. When we were
close enough to see, it became evident that it would only be a chance
if we got a shot. They were not spouting well and remained below a
long time.

“He was running fast but seldom stayed down long, his high
sickle-shaped dorsal fin cutting the surface first in one direction,
then in another.”

We hunted them for two hours, trying first one and then the other
—they had separated—without once getting near enough even for
pictures. It was aggravating work, and I was glad to hear Andersen
say:
“We’ll leave them and see if we can find some others. They are
impossible.”
When we came up from breakfast six other ships were visible,
some of them not far away and others marked only by long trails on
the horizon. We passed the San Hogei near enough to hear Captain
Hansen shout that he had seen no whales, and then plowed along
due south directly away from the other ships. In a short time, one by
one, they had dropped away from sight and even the smoke paths
were lost where sky and sea met.

“Always the center of a screaming flock of birds which sometimes


swept downward in a cloud, dipping into the waves and rising
again, the water flashing in myriads of crystal drops from their
brown wings.”

It was eleven o’clock before we raised another spout, but this


animal was blowing frequently and the great cloud of birds hovering
about showed that he was “on feed.” He was running fast but seldom
stayed down long, his high, sickle-shaped dorsal fin cutting the
surface first in one direction, then in another, but always the center
of a screaming flock of birds which sometimes swept downward in a
cloud, dipping into the waves and rising again, the water flashing in
myriads of crystal drops from their brown wings.
As we came close we saw that the whale was in a school of
sardines, the fish frantically dashing here and there, often jumping
clear out of the water and causing their huge pursuer a deal of
trouble to follow their quick turnings. But he managed his lithe body
with wonderful rapidity, and ever before the fish left him many yards
behind was plowing after them, his great tail sending the water in
swirling green patches astern.
We were going at full speed and came down to half when a
hundred fathoms away, but we could not take it slow, for the whale
was running directly from us. I got two pictures of the birds and from
where I was standing beside the gun could plainly follow the animal
in his course. As he rose about sixty fathoms ahead and turned to go
down, his back came into view and just behind the fin a large white
mark was visible.
“That’s a harpoon scar,” said Andersen. “It is a bad sign. He may
give us a run for it, after all.”
The engines were at dead slow now, for the whale had surely seen
us and might double under water, coming up astern. Andersen was
ready at the gun, swinging the huge weapon slightly to and fro, his
feet braced, every few seconds calling out to the Bo’s’n aloft, “Miye
masu ka?” (Do you see him?)
We had been waiting two minutes (it seemed hours) when the
Bo’s’n shouted:
“He’s coming. He’s coming. On the port bow.”
In a second the water began to swirl and boil and we could see the
shadowy form rise almost to the surface, check its upward rush, and
dash along parallel with the ship.
A sei whale showing a portion of the soft fatty tongue.

“Dame (no good), dame, he won’t come up!” exclaimed Andersen.


“Mo sukoshi (a little more) speed, mo sukoshi speed! Dame, dame,
he’s leaving us. Half speed, half speed!”
“In the mirror of my camera I could see the enormous gray head
burst from the water, the blowholes open and send forth a cloud of
vapor, and the slim back draw itself upward, the water streaming
from the high fin as it cut the surface. Andersen’s last words were
drowned in the crashing roar of the gun.”

Never shall I forget the intense excitement of those few minutes!


The huge, ghost-like figure was swimming along just under the
surface, not five feet down, aggravatingly close but as well protected
by the shallow water-armor as though it had been of steel. Andersen
was shouting beside me:
“He won’t come, dame, dame. Yes, now, now! Look out! I shoot, I
shoot!”
In the mirror of my camera I could see the enormous gray head
burst from the water, the blowholes open and send forth a cloud of
vapor, and the slim back draw itself upward, the water streaming
from the high fin as it cut the surface. Andersen’s last words were
drowned in the crashing roar of the gun. Before we could see through
the veil of smoke we heard the sailors shout, “Shinda!” (dead), and
the next instant the black cloud drifted away showing the whale lying
on its side motionless. I tried to change the plate in my camera, but
before the slide could be drawn and the shutter reset, the animal had
sunk. Apparently it had been killed almost instantly, for the rope was
taut and hung straight down.
In a few minutes Andersen gave the word to haul away, and the
Engineer started the winch. No sooner had the rattling wheels
ground in a few fathoms than we saw the line slack and then slowly
rise. Faster and faster it came, the water dripping in little streams
from its vibrating surface.
In a few seconds the whale rose about ninety fathoms ahead and
blew, the blood welling in great red clots from his spout holes. He lay
motionless for a moment and then swung about and swam directly
toward the vessel. At first he came slowly, but his speed was
increasing every moment. When almost opposite us, about thirty
fathoms away, suddenly, with a terrific slash of his tail, he half
turned on his side and dashed directly at the ship.
“Full speed astern!” yelled the gunner, dancing about like a
madman. “He’ll sink us; he’ll sink us!”
The whale was coming at tremendous speed, half buried in white
foam, lashing right and left with his enormous flukes. In an instant
he hit us. We had half swung about and he struck a glancing blow
directly amidships, keeling the little vessel far over and making her
tremble as though she had gone on the rocks; then bumped along the
side, running his nose squarely into the propeller. The whirling
blades tore great strips of blubber from his snout and jaws and he
backed off astern.
Then turning about with his entire head projecting from the water
like the bow of a submarine, he swam parallel with the ship. As he
rushed along I caught a glimpse of the dark head in the mirror of my
camera and pressed the button. An instant later the great animal
rolled on his side, thrust his fin straight upward, and sank. It had
been his death struggle and this time he was down for good. As the
water closed over the dead whale I leaned against the rail trembling
with excitement, the perspiration streaming from my face and body.
Andersen was shouting orders in English, Norwegian, and Japanese,
and cursing in all three languages at once.
I think none of us realized until then just what a narrow escape we
had had. If the whale had struck squarely he would have torn such a
hole in the steamer’s side that her sinking would have been a matter
of seconds. The only thing that saved her was the quickness of the
man at the wheel, who had thrown the vessel’s nose about, thus
letting the blow glance from her side. It was a miracle that the
propeller blades had not been broken or bent so badly as to disable
us; why they were not even injured no one can tell—it was simply the
luck that has always followed this vessel since Captain Andersen
came aboard.

“Then turning about with his entire head projecting from the
water like the bow of a submarine, he swam parallel with the
ship.”

It should not be inferred that the whale deliberately attacked the


ship with the intention of disabling her. There is little doubt in my
mind but that the animal was blindly rushing forward in his death
flurry, and the fact that he struck the vessel was pure accident.
Nevertheless, the results would have been none the less serious if he
had hit her squarely.
“I was ... gazing down into the blue water and waiting to catch a
glimpse of the body as it rose, when suddenly a dark shape glided
swiftly under the ship’s bow.”

After a hasty examination showed that the propeller was


uninjured, the whale was hauled to the surface. I was standing on the
gun platform gazing down into the blue water and waiting to catch a
glimpse of the body as it rose, when suddenly a dark shape glided
swiftly under the ship’s bow. At first I thought it was only
imagination, an aftereffect of the excitement, but another followed,
then another, and soon from every side specter-like forms were
darting swiftly and silently here and there, sometimes showing a
flash of white as one turned on its side.
They were giant sharks drawn by the floating carcass as steel is
drawn by a magnet. Like the vultures which wheel and circle in the
western sky far beyond the reach of human sight, watching for the
death of some poor, thirst-smitten, desert brute, so these vultures of
the sea quickly gathered about the dead whale. I watched them
silently fasten to the animal’s side, tearing away great cup-shaped
chunks of blubber, and shivered as I thought of what would happen
to a man if he fell overboard among these horrible, white-eyed sea-
ghosts.
Within three minutes of the time when the whale had been drawn
to the surface over twenty sharks, each one accompanied by its little
striped pilot fish swimming just behind its fins, were biting at the
carcass.
“Dame, dame, they’ll eat my whale up,” shouted Andersen in
Japanese. “Bo’s’n, bring the small harpoon.”

“Two boat hooks were jabbed into the shark’s gills and it was
hauled along the ship’s side until it could be pulled on deck.”

One big shark, the most persistent of the school, had sunk his teeth
in the whale’s side and, although half out of water, was tearing away
at the blubber and paying not the slightest attention to the pieces of
old iron which the sailors were showering upon him. When the
harpoon was rigged and the line made fast, Andersen climbed out
upon the rope-pan in front of the gun and jammed the iron into the
shark’s back. Even then the brute waited to snatch one more
mouthful before it slid off the carcass into the water. It struggled but
little and seemed more interested in returning to its meal than in
freeing itself from the harpoon, but two boat hooks were jabbed into
its gills and it was hauled along the ship’s side until it could be pulled
on deck. This was no easy task, for it must have weighed at least two
hundred pounds and began a tremendous lashing with its tail when
the crew hauled away. “Ya-ra-cu-ra-sa,” sang the sailors, each time
giving a heave as the word “sa” was uttered, and the shark was soon
flapping and pounding about on deck. The seamen prodded it with
boat hooks and belaying pins and I must confess that I had little
sympathy for the brute when the blood poured out of its mouth and
gills, turning the snow-white breast to crimson. I paced its length as
it lay on the deck, taking good care to miss the thrashing tail and the
vicious snaps of its crescent-shaped jaws. It measured just twelve
feet and, although a big one, was by no means the largest of the
school.

Making the sei whale fast to the bow of the ship.


When the whale had been finally made fast and the ship started,
the shark, now half dead, was pushed over the side. It had not gone
ten feet astern before the others of the pack were tearing away at
their unfortunate brother with as great good will as they had attacked
our whale.
Andersen and I went below to an excellent tiffin, for which I had a
better appetite than at breakfast, as the sea had subsided. The course
was set for the station to get coal and water for the next day’s run,
but we could not be in before seven or eight o’clock. The gunner lay
down in the cabin for a short nap, and after lighting my pipe I went
“top sides” to the bridge. I had been there not more than ten
minutes, when “puf-f-f” went a sei whale about two hundred fathoms
away on the starboard beam.

A sei whale swimming directly away from the ship. The nostrils or
blowholes are widely expanded and greatly protruded.

The air pumps were still at work inflating the carcass alongside,
and the gun had not yet been loaded. Captain Andersen ran forward
with the powder charge sewed up in its neat little sack of cheesecloth;
and after the Bo’s’n had rammed it home, wadded the gun, and
inserted the harpoon, we were ready for work. The vessel had been
taking a long circle about the whale, which was blowing every few
seconds, and now we headed straight for it.
Like the last one, this animal was pursuing a school of sardines
and proved easy to approach. Andersen fired at about fifteen
fathoms, getting fast but not killing at once, and a second harpoon
was sent crashing into the beautiful gray body which before many
hours would fill several hundred cans and be sold in the markets at
Osaka. The sharks again gathered about the ship when the whale was
raised to the surface, but this time none was harpooned as we were
anxious to start for the station.
It was nearly three o’clock when the ship was on her course and
fully six before we caught a glimpse of the summit of Kinka-San, still
twenty miles away. A light fog had begun to gather, and in the west
filmy clouds draped themselves in a mantle of red and gold about the
sun. Ere the first stars appeared, the wind freshened again and the
clouds had gathered into puffy balls edged with black, which scudded
across the sky and settled into a leaden mass on the horizon. It was
evident that the good weather had ended and that we were going to
run inside just in time to escape a storm.
CHAPTER IX
HABITS OF THE SEI WHALE

“For many years the sei whale was supposed to be the young of
either the blue or the finback whale, and it was not until 1828 that
it was recognized by science as being a distinct species.”

For many years the sei whale was supposed to be the young of either
the blue or the finback whale, and it was not until 1828 that it was
recognized by science as being a distinct species. The Norwegians
gave the animal its name because it arrives upon the coast of
Finmark with the “seje,” or black codfish (Polachius virens), but in
Japan it is called iwashi kujira (sardine whale).

You might also like