100% found this document useful (1 vote)
7 views

Statistical Learning In Genetics An Introduction Using R Daniel Sorensen pdf download

The document is an introduction to the book 'Statistical Learning in Genetics: An Introduction Using R' by Daniel Sorensen, aimed at life-science PhD students and post-docs seeking to develop analytic skills for genomic research. It covers statistical techniques for analyzing genetic data, emphasizing both frequentist and Bayesian methods, and includes practical programming exercises using R. The book is structured into three parts, focusing on likelihood and Bayesian inference, prediction, and exercises with solutions.

Uploaded by

kobletbounce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
7 views

Statistical Learning In Genetics An Introduction Using R Daniel Sorensen pdf download

The document is an introduction to the book 'Statistical Learning in Genetics: An Introduction Using R' by Daniel Sorensen, aimed at life-science PhD students and post-docs seeking to develop analytic skills for genomic research. It covers statistical techniques for analyzing genetic data, emphasizing both frequentist and Bayesian methods, and includes practical programming exercises using R. The book is structured into three parts, focusing on likelihood and Bayesian inference, prediction, and exercises with solutions.

Uploaded by

kobletbounce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Statistical Learning In Genetics An Introduction

Using R Daniel Sorensen download

https://ebookbell.com/product/statistical-learning-in-genetics-
an-introduction-using-r-daniel-sorensen-52407580

Explore and download more ebooks at ebookbell.com


Here are some recommended products that we believe you will be
interested in. You can click the link to download.

Treebased Methods For Statistical Learning In R A Practical


Introduction With Applications In R Brandon M Greenwell

https://ebookbell.com/product/treebased-methods-for-statistical-
learning-in-r-a-practical-introduction-with-applications-in-r-brandon-
m-greenwell-46454042

Statistical Learning And Modeling In Data Analysis Methods And


Applications 1st Edition Simona Balzano

https://ebookbell.com/product/statistical-learning-and-modeling-in-
data-analysis-methods-and-applications-1st-edition-simona-
balzano-33469896

A First Course In Statistical Learning With Data Examples And Python


Code Statistics And Computing 1st Ed 2023 Johannes Lederer

https://ebookbell.com/product/a-first-course-in-statistical-learning-
with-data-examples-and-python-code-statistics-and-computing-1st-
ed-2023-johannes-lederer-232169448

An Introduction To Statistical Learning With Applications In Python


1st Gareth James

https://ebookbell.com/product/an-introduction-to-statistical-learning-
with-applications-in-python-1st-gareth-james-50703060
An Introduction To Statistical Learning With Applications In R 2nd
Edition Gareth James Daniela Witten Trevor Hastie Robert Tibshirani

https://ebookbell.com/product/an-introduction-to-statistical-learning-
with-applications-in-r-2nd-edition-gareth-james-daniela-witten-trevor-
hastie-robert-tibshirani-51334824

An Introduction To Statistical Learning With Applications In R Second


Edition 2nd Edition Gareth James

https://ebookbell.com/product/an-introduction-to-statistical-learning-
with-applications-in-r-second-edition-2nd-edition-gareth-
james-33793008

An Introduction To Statistical Learning With Applications In R 1st


Edition Gareth James

https://ebookbell.com/product/an-introduction-to-statistical-learning-
with-applications-in-r-1st-edition-gareth-james-38329264

An Introduction To Statistical Learning With Applications In R 1st


Edition Gareth James

https://ebookbell.com/product/an-introduction-to-statistical-learning-
with-applications-in-r-1st-edition-gareth-james-43260912

Statistical Learning For Big Dependent Data Wiley Series In


Probability And Statistics 1st Edition Daniel Pea

https://ebookbell.com/product/statistical-learning-for-big-dependent-
data-wiley-series-in-probability-and-statistics-1st-edition-daniel-
pea-51992574
Statistics for Biology and Health

Daniel Sorensen

Statistical
Learning in
Genetics
An Introduction Using R
Statistics for Biology and Health

Series Editors
Mitchell Gail, Division of Cancer Epidemiology and Genetics, National Cancer
Institute, Rockville, MD, USA
Jonathan M. Samet, Department of Environmental & Occupational Health,
University of Colorado Denver - Anschutz Medical Campus, Aurora, CO, USA
Statistics for Biology and Health (SBH) includes monographs and advanced
textbooks on statistical topics relating to biostatistics, epidemiology, biology, and
ecology.
Daniel Sorensen

Statistical Learning
in Genetics
An Introduction Using R
Daniel Sorensen
Aarhus University
Aarhus, Denmark

ISSN 1431-8776 ISSN 2197-5671 (electronic)


Statistics for Biology and Health
ISBN 978-3-031-35850-0 ISBN 978-3-031-35851-7 (eBook)
https://doi.org/10.1007/978-3-031-35851-7

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Paper in this product is recyclable.


Til min elskede Pia
Preface

This book evolved from a set of notes written for a graduate course on Likelihood
and Bayesian Computations held at Aarhus University in 2016 and 2018. The
audience was life-science PhD students and post-docs with a background in either
biology, agriculture, medicine or epidemiology, who wished to develop analytic
skills to perform genomic research. This book is addressed to this audience of
numerate biologists, who, despite an interest in quantitative methods, lack the
formal mathematical background of the professional statistician. For this reason,
I offer considerably more detail in explanations and derivations than may be needed
for a more mathematically oriented audience. Nevertheless, some mathematical
and statistical prerequisites are needed in order to extract maximum benefit from
the book. These include introductory courses on calculus, linear algebra and
mathematical statistics, as well as a grounding in linear and nonlinear regression and
mixed models. Applied statistics and biostatistics students may also find the book
useful, but may wish to browse hastily through the introductory chapters describing
likelihood and Bayesian methods.
I have endeavoured to write in a style that appeals to the quantitative biologist,
while remaining concise and using examples profusely. The intention is to cover
ground at a good pace, facilitating learning by interconnecting theory with examples
and providing exercises with their solutions. Many exercises involve programming
with the open-source package R, a statistical software that can be downloaded and
used with the free graphical user interface RStudio. Most of today’s students
are competent in R and there are many tutorials online for the uninitiated. The
R-code needed to solve the exercises is provided in all cases and is written,
with few exceptions, with the objective of being transparent rather than efficient.
The reader has the opportunity to run the codes and to modify input parameters
in an experimental fashion. This hands-on computing contributes to a better
understanding of the underlying theory.
The first objective of this introduction is to provide readers with an understanding
of the techniques used for analysis of data, with emphasis on genetic data. The
second objective is to teach them to implement these techniques. Meeting these
objectives is an initial step towards acquiring the skills needed to perform data-

vii
viii Preface

driven genetics/genomics research. Despite the focus on genetic applications, the


mathematics of the statistical models and their implementation are relevant for
many other branches of quantitative methods. An appendix in the opening chapter
provides an overview of basic quantitative genomic concepts, making the book more
accessible to an audience of "non-geneticists".
I have attempted to give a balanced account of frequentist/likelihood and
Bayesian methods. Both approaches are used in classical quantitative genetic and
modern genomic analyses and constitute essential ingredients in the toolkit of the
well-trained quantitative biologist.
The book is organised in three parts. Part I (Chaps. 2–5) presents an overview
of likelihood and Bayesian inference. Chapter 2 introduces the basic elements
of the likelihood paradigm, including the likelihood function, the score and the
maximum likelihood estimator. Properties of the maximum likelihood estimator are
summarised and several examples illustrate the construction of simple likelihood
models, the derivation of the maximum likelihood estimators and their properties.
Chapter 3 provides a review of three computational methods for fitting likelihood
models: Newton-Raphson, the EM (expectation-maximisation) algorithm and gra-
dient descent. After a brief description of the methods and the essentials of their
derivation, several examples (13 in all) are developed to illustrate their imple-
mentation. Chapter 4 covers the basics of the Bayesian approach, mostly through
examples. The first set of examples illustrate the type of inferences that are possible
(joint, conditional and marginal inferences), when the posterior distributions have
known closed forms. In this case, inferences can be exact using analytical methods,
or can be approximated using Monte Carlo draws from the posterior distribution. A
number of options are available when the posterior distribution is only known up
to proportionality. After a very brief account of Bayesian asymptotics, the chapter
focuses on Markov chain Monte Carlo (McMC) methods. These are recipes for
generating approximate draws from posterior distributions. Using these draws, one
can obtain Monte Carlo estimates of the complete posterior distribution, or Monte
Carlo estimates of summaries such as the mean, variance and posterior intervals. The
chapter provides a description of the Gibbs sampling algorithm and of the joint and
single-site updating of parameters based on the Metropolis-Hastings algorithm. An
overview of the tools needed for analysis of the McMC output concludes the chapter.
An appendix provides the mathematical details underlying the magic of McMC
within the constraints imposed by the author’s limited mathematics. Chapter 5
illustrates applications of McMC. Several of the examples discussed in connection
with Newton-Raphson and the EM algorithm are revisited and implemented from a
Bayesian McMC perspective.
Part II of the book has the heading Prediction. The boundaries between Parts I
and II should not be construed as rigid. However, the heading emphasises the main
thread of Chaps. 6–11, with an important detour in Chap. 8 that discusses mul-
tiple testing. Chapter 6 introduces many important ingredients of prediction: best
predictor, best linear predictor, overfitting, bias-variance trade-off, cross-validation.
Among the topics discussed is the accuracy with which future observations can be
predicted, how is this accuracy measured, the factors affecting it and importantly,
Preface ix

how a measure of uncertainty can be attached to accuracy. The body of the chapter
deals with prediction from a classical/frequentist perspective. Bayesian prediction
is illustrated in several examples throughout the book and particularly in Chap. 10.
In Chap. 6, many important ideas related to prediction are illustrated using a simple
least-squares setting, where the number of records n is larger than the number of
parameters p of the model; this is the .n > p setup. However, in many modern
genetic problems, the number of parameters greatly exceeds the number of records;
the .p  n setup. This calls for some form of regularisation, a topic introduced
in Chap. 7 under the heading Shrinkage Methods. After an introduction to ridge
regression, the chapter provides a description of the lasso (least absolute shrinkage
and selection operator) and of a Bayesian spike and slab model. The spike and
slab model can be used for both prediction and for discovery of relevant covariates
that have an effect on the records. In a genetic context, these covariates could be
observed genetic markers and the challenge is how to find as many promising mark-
ers among the hundreds of thousands available, while incurring a low proportion
of false positives. This leads to the topic reviewed in Chap. 8: False Discovery
Rate. The subject is first presented from a frequentist perspective as introduced
by Benjamini and Hochberg in their highly acclaimed work, and is also discussed
using empirical Bayesian and fully Bayesian approaches. The latter is implemented
within an McMC environment using the spike and slab model as driving engine.
The complete marginal posterior distribution of the false discovery rate can be
obtained as a by-product of the McMC algorithm. Chapter 9 describes some of
the technical details associated with prediction for binary data. The topics discussed
include logistic regression for the analysis of case-control studies, where the data are
collected in a non-random fashion, penalised logistic regression, lasso and spike and
slab models implemented for the analysis of binary records, area under the curve
(AUC) and prediction of a genetic disease of an individual, given information on
the disease status of its parents. The chapter concludes with an appendix providing
technical details for an approximate analysis of binary traits. The approximation
can be useful as a first step, before launching the full McMC machinery of a more
formal approach. Chapter 10 deals with Bayesian prediction, where many of the
ideas scattered in various parts of the book are brought into focus. The chapter
discusses the sources of uncertainty of predictors from a Bayesian and frequentist
perspective and how they affect accuracy of prediction as measured by the Bayesian
and frequentist expectations of the sample mean squared error of prediction. The
final part of the chapter introduces, via an example, how specific aspects of a
Bayesian model can be tested using posterior predictive simulations, a topic that
combines frequentist and Bayesian ideas. Chapter 11 completes Part II and provides
an overview of selected nonparametric methods. After an introduction of traditional
nonparametric models, such as the binned estimator and kernel smoothing methods,
the chapter concentrates on four more recent approaches: kernel methods using basis
expansions, neural networks, classification and regression trees, and bagging and
random forests.
Part III of the book consists of exercises and their solutions. The exercises
(Chap. 12) are designed to provide the reader with deeper insight of the subject
x Preface

discussed in the body of the book. A complete set of solutions, many involving
programming, is available in Chap. 13.
The majority of the datasets used in the book are simulated and intend to illustrate
important features of real-life data. The size of the simulated data is kept within the
limits necessary to obtain solutions in reasonable CPU time, using straightforward
R-code, although the reader may modify size by changing input parameters.
Advanced computational techniques required for the analysis of very large datasets
are not addressed. This subject requires a specialised treatment beyond the scope of
this book.
The book has not had the benefit of having been used as material in repeated
courses by a critical mass of students, who invariably stimulate new ideas, help with
a deeper understanding of old ones and, not least, spot errors in the manuscript and
in the problem sections. Despite these shortcomings, the book is completed and out
of my hands. I hope the critical reader will make me aware of the errors. These
will be corrected and listed on the web at https://github.com/SorensenD/SLGDS.
The GitHub site also contains most of the R-codes used in the book, which can be
downloaded, as well as notes that include comments, clarifications or additions of
themes discussed in the book.

Aarhus, Denmark Daniel Sorensen


May 2023
Acknowledgements

Many friends and colleagues have assisted in a variety of ways. Bernt Guldbrandtsen
(University of Copenhagen) has been a stable helping hand and helping mind.
Bernt has generously shared his deep biological and statistical knowledge with
me on many, many occasions, and provided also endless advice with LaTeX and
MarkDown issues, with programming details, always with good spirits and patience.
I owe much to him. Ole Fredslund Christensen (Aarhus University) read several
chapters and wrote a meticulous list of corrections and suggestions. I am very
grateful to him for this effort. Gustavo de los Campos (Michigan State University)
has shared software codes and tricks and contributed with insight in many parts
of the book, particularly in Prediction and Kernel Methods. I have learned much
during the years of our collaboration. Parts of the book were read by Andres Legarra
(INRA), Miguel Pérez Enciso (University of Barcelona), Bruce Walsh (University
of Arizona), Rasmus Waagepetersen (Aalborg University), Peter Sørensen (Aarhus
University), Kenneth Enevoldsen (Aarhus University), Agustín Blasco (Universidad
Politécnica de Valencia), Jens Ledet Jensen (Aarhus University), Fabio Morgante
(Clemson University), Doug Speed (Aarhus University), Bruce Weir (University
of Washington), Rohan Fernando (retired from Iowa State University) and Daniel
Gianola (retired from the University of Wisconsin-Madison). I received many
helpful comments, suggestions and corrections from them. However, I am the only
responsible for the errors that escaped scrutiny. I would be thankful if I could be
made aware of these errors.
I acknowledge Eva Hiripi, Senior Editor, Statistics Books, Springer, for consis-
tent support during this project.
I am the grateful recipient of many gifts from my wife Pia. One has been essential
for concentrating on my task: happiness.

xi
Contents

1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Sampling Distribution of a Random Variable . . . . . . . . . . . . . . . . . . 3
1.3 The Likelihood and the Maximum Likelihood Estimator . . . . . . . . . . 5
1.4 Incorporating Prior Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Frequentist or Bayesian? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7 Appendix: A Short Overview of Quantitative Genomics. . . . . . . . . . . 32

Part I Fitting Likelihood and Bayesian Models


2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.1 A Little Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2 Summary of Likelihood Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3 Example: The Likelihood Function of Transformed Data. . . . . . . . . . 59
2.4 Example: Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.5 Example: Bivariate Normal Model with Missing Records . . . . . . . . . 63
2.6 Example: Likelihood Inferences Using Selected Records. . . . . . . . . . 66
2.7 Example: The Likelihood Function with Truncated Data . . . . . . . . . . 71
2.8 Example: The Likelihood Function of a Genomic Model . . . . . . . . . . 72
3 Computing the Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1 Newton-Raphson and the Method of Scoring . . . . . . . . . . . . . . . . . . . . . . . 77
3.2 Gradient Descent and Stochastic Gradient Descent . . . . . . . . . . . . . . . . 98
3.3 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.1 Example: Estimating the Mean and Variance of a Normal
Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.2 Posterior Predictive Distribution for a New Observation . . . . . . . . . . . 151
4.3 Example: Monte Carlo Inferences of the Joint Posterior
Distribution of Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

xiii
xiv Contents

4.4 Approximating a Marginal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155


4.5 Example: The Normal Linear Mixed Model . . . . . . . . . . . . . . . . . . . . . . . . 156
4.6 Example: Inferring a Variance Component from a
Marginal Posterior Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.7 Example: Bayesian Learning—Inheritance of Haemophilia . . . . . . . 162
4.8 Example: Bayesian Learning—Updating Additive
Genetic Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.9 A Brief Account of Bayesian Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.10 An Overview of Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . 171
4.11 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.12 The Gibbs Sampling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
4.13 Output Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.14 Appendix: A Closer Look at the McMC Machinery . . . . . . . . . . . . . . . 194
5 McMC in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.1 Example: Estimation of Gene Frequencies from ABO
Blood Group Phenotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.2 Example: A Regression Model for Binary Data . . . . . . . . . . . . . . . . . . . . 213
5.3 Example: A Regression Model for Correlated Binary Data . . . . . . . . 220
5.4 Example: A Genomic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.5 Example: A Mixture Model of Two Gaussian Components . . . . . . . 234
5.6 Example: An Application of the EM Algorithm
in a Bayesian Context—Estimation of SNP Effects . . . . . . . . . . . . . . . . 239
5.7 Example: Bayesian Analysis of the Truncated Normal Model. . . . . 244
5.8 A Digression on Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

Part II Prediction
6 Fundamentals of Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.1 Best Predictor and Best Linear Predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.2 Estimating the Regression Function in Practice: Least Squares . . . 263
6.3 Overview of Things to Come . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
6.4 The Bias-Variance Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
6.5 Estimation of Validation MSE of Prediction in Practice . . . . . . . . . . . 280
6.6 On Average Training MSE Underestimates Validation MSE . . . . . . 284
6.7 Least Squares Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7 Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
7.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
7.2 The Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
7.3 An Extension of the Lasso: The Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . 319
7.4 Example: Prediction Using Ridge Regression and Lasso . . . . . . . . . . 319
7.5 A Bayesian Spike and Slab Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
8 Digression on Multiple Testing: False Discovery Rates. . . . . . . . . . . . . . . . . 333
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
8.2 Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Contents xv

8.3 The Benjamini-Hochberg False Discovery Rate . . . . . . . . . . . . . . . . . . . . 338


8.4 A Bayesian Approach for a Simple Two-Group Mixture
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
8.5 Empirical Bayes Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
8.6 Local False Discovery Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
8.7 Storey’s q-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
8.8 Fully Bayesian McMC False Discovery Rate . . . . . . . . . . . . . . . . . . . . . . . 350
8.9 Example: A Two-Component Gaussian Mixture . . . . . . . . . . . . . . . . . . . 352
8.10 Example: The Spike and Slab Model with Genetic Markers . . . . . . . 361
9 Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
9.1 Prediction for Binary Observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
9.2 Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
9.3 Logistic Regression with Non-random Sampling. . . . . . . . . . . . . . . . . . . 375
9.4 Penalised Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
9.5 The Lasso with Binary Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
9.6 A Bayesian Spike and Slab Model for Binary Records . . . . . . . . . . . . 380
9.7 Area Under the Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
9.8 Prediction of Disease Status of Individual Given Disease
Status of relatives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
9.9 Appendix: Approximate Analysis of Binary Traits . . . . . . . . . . . . . . . . . 411
10 Bayesian Prediction and Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
10.1 Levels of Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
10.2 Prior and Posterior Predictive Distributions. . . . . . . . . . . . . . . . . . . . . . . . . 419
10.3 Bayesian Expectations of MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
10.4 Example: Bayesian and Frequentist Measures of Uncertainty . . . . . 430
10.5 Model Checking Using Posterior Predictive Distributions . . . . . . . . . 435
11 Nonparametric Methods: A Selected Overview . . . . . . . . . . . . . . . . . . . . . . . . . 445
11.1 Local Kernel Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
11.2 Kernel Methods Using Basis Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
11.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
11.4 Classification and Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
11.5 Bagging and Random Forests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
11.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533

Part III Exercises and Solutions


12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
12.1 Likelihood Exercises I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
12.2 Likelihood Exercises II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
12.3 Bayes Exercises I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
12.4 Bayes Exercises II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
12.5 Prediction Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
xvi Contents

13 Solution to Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575


13.1 Likelihood Exercises I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
13.2 Likelihood Exercises II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
13.3 Bayes Exercises I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
13.4 Bayes Exercises II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
13.5 Prediction Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
Author Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
Chapter 1
Overview

1.1 Introduction

Suppose there is a set of data consisting of observations in humans on forced


expiratory volume (FEV, a measure of lung function; lung function is a predictor
of health and a low lung function is a risk factor for mortality), or on the presence or
absence of heart disease and that there are questions that could be answered using
these data. For example, a statistical geneticist may wish to know:
1. Is there a genetic component contributing to the total variance of these traits?
A positive answer suggests that genetic factors are at play. The next step would
be to investigate the following:
2. Is the genetic component of the traits driven by a few genes located on
a particular chromosome, or are there many genes scattered across many
chromosomes? How many genes are involved and is this a scientifically sensible
question?
3. Are the genes detected protein-coding genes, or are there also noncoding genes
involved in gene regulation?
4. How is the strength of the signals captured in a statistical analysis related to the
two types of genes? What fraction of the total genetic variation is allocated to
both types of genes?
5. What are the frequencies of the genes in the sample? Are the frequencies
associated with the magnitude of their effects on the traits?
6. What is the mode of action of the genes?
7. What proportion of the genetic variance estimated in 1 can be explained by the
discovered genes?
8. Given the information on the set of genes carried by an individual, will a
genetic score constructed before observing the trait help with early diagnosis
and prevention?
9. How should the predictive ability of the score be measured?

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1


D. Sorensen, Statistical Learning in Genetics, Statistics for Biology and Health,
https://doi.org/10.1007/978-3-031-35851-7_1
2 1 Overview

10. Are there other non-genetic factors that affect the traits, such as smoking
behaviour, alcohol consumption, blood pressure measurements, body mass
index and level of physical exercise?
11. Could the predictive ability of the genetic score be improved by incorporation
of these non-genetic sources of information, either additively or considering
interactions? What is the relative contribution from the different sources of
information?
The first question has been the focus of quantitative genetics during many
years long before the so-called genomic revolution, that is, before breakthroughs
in molecular biology made technically and economically possible the sequencing of
whole genomes, resulting in hundreds of thousands or millions of genetic markers
(single nucleotide polymorphisms (SNPs)) for each individual in the data set. Until
the end of the twentieth century before dense genetic marker data were available,
genetic variation of a given trait was inferred using resemblance between relatives.
This requires equating the expected proportion of genotypes shared identical
by descent, given a pedigree, with the observed phenotypic correlation between
relatives. The fitted models also retrieve “estimates of random effects”, the predicted
genetic values that act as genetic scores and are used in selection programs of farm
animals and plants.
Answers to questions .2 − 7 would provide insight into genetic architecture and
thereby, into the roots of many complex traits and diseases. This has important
practical implications for drug therapies targeted to particular metabolic pathways,
for personalised medicine and for improved prediction. These questions could not
be sensibly addressed before dense marker data became available (perhaps with
the exception provided by complex segregation analysis that allowed searching for
single genes).
Shortly after a timid start where use of low-density genetic marker information
made its appearance, the first decade of the twenty-first century saw the construction
of large biomedical databases that could be accessed for research purposes where
health information was collected. One such database was the British .1958−cohort
study including medical records from approximately 3000 individuals genotyped
for one million SNPs. These data provided for the first time the opportunity to begin
addressing questions .2 − 7. However, a problem had to be faced: how to fit and
validate a model with one million unknowns to a few thousand records and how to
find a few promising genetic markers from the million available avoiding a large
proportion of false positives? This resulted in a burst of activity in the fields of
computer science and statistics, leading to development of a methodology designed
to meet the challenges posed by Big Data.
In recent years, the amount of information in modern data sets has
grown and become formidable and the challenges have not diminished. One
example is the UK Biobank that provides a wealth of health information
from half a million UK participants. The database is regularly updated and
a team of scientists recently reported that the complete exome sequence was
completed (about .2% of the genome involved in coding for proteins and
1.2 The Sampling Distribution of a Random Variable 3

considered to be important for identifying disease-causing or rare genetic


variants). The study involved more than .150,000 individuals genotyped
for more than 500 million SNPs (Halldorsson et al 2022). These data are
paired with detailed medical information and constitute an unparalleled
resource for linking human genetic variation to human biology and dis-
ease.
An important task for the statistical geneticist is to adapt, develop and implement
models that can extract information from these large-scale data and to contribute to
finding answers to the 11 questions posed above. This is an exercise on inference
(such as estimation of genetic variation), on gene detection (among the millions
of genetic markers that may be included in a probability model, how to screen
the “relevant” ones for further study?), on prediction (how does the quality of
prediction of future records, for example, outcome of a disease, improve with this
new knowledge about the trait?) and on how to fit the probability models. There are
several areas of expertise that must be developed in order to fulfil this data-driven
research task. An initial step is to understand the methodology that underlies the
probability models and to learn the modern computer-intensive methods required
for fitting these models. The objective of this book is to guide the reader to take this
first step.
This opening chapter gives an overview of the book’s content, omitting many
technicalities that are revealed in later chapters, and is intended to give a flavour
of the way ahead. The first part is about methodology and introduces, by means of
an example, the concepts of probability distribution, likelihood and the maximum
likelihood estimator. This is followed by a brief description of Bayesian methods
indicating how prior knowledge can be incorporated in a probability model and
how it can affect inferences. The second part of the chapter presents models
for prediction and for detection of genes using parametric and nonparametric
approaches. There is an appendix that offers a brief tour of the quantitative
genetic/genomic model. The goal is to introduce the jargon and the basic quanti-
tative genetic/genomic concepts used in the book.

1.2 The Sampling Distribution of a Random Variable

A useful starting point is to establish the distinction between a probability distribu-


tion and a likelihood function. For example, assume a random variable X that has a
Bernoulli probability distribution. This random variable can take 1 or 0 as possible
values (more generally, it can have two modalities) with probabilities .θ and .1 − θ ,
respectively. The mean of the distribution is

E(X|θ ) = 0 × Pr(X = 0|θ ) + 1 × Pr(X = 1|θ ) = θ


.
4 1 Overview

and the variance is

. Var(X|θ ) = E(X2 |θ ) − [E(X|θ )]2


= θ − θ 2 = θ (1 − θ ) .

A binomial distribution arises from the sum of n mutually independent Bernoulli


random variables all having the same probability .θ . Therefore, the expected value
and the variance of a binomially distributed random variable are .nθ and .nθ (1 − θ ),
respectively.
With this background, imagine that a sample of size n of unrelated haploid
individuals is obtained from some population with the objective of estimating allele
frequency at a biallelic locus. The sample contains x copies of allele A and .n − x
copies of allele a. The n data points are draws assumed to be identically and
independently distributed, and in each draw, the probability of observing an A allele
is .θ . Since the random variable can take two modalities (A or a), the number of
copies drawn, X, is binomially distributed with parameters n and .θ and probability
mass function equal to
 
n
. Pr (X = x|n, θ ) = θ x (1 − θ )n−x , x = 0, 1, . . . , n, 0 < θ < 1.
x
(1.1)

For fixed values of n and .θ , one can plot (1.1) as a function of .x = 0, 1, . . . , n,


and this defines the probability distribution of X. Figure 1.1 shows two different
binomial distributions. Importantly in (1.1), the parameters n and .θ are fixed and
the random variable is X, the number of A alleles drawn (here I distinguish between
0.00 0.02 0.04 0.06 0.08 0.10 0.12
0.25
0.20
Probability

Probability
0.15
0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9 0 2 4 6 8 10 12 14 16 18 20 22 24
X X

Fig. 1.1 Left: binomial probability distribution with parameters .n = 20, θ = 0.1; right: binomial
probability distribution with parameters .n = 100, θ = 0.1
1.3 The Likelihood and the Maximum Likelihood Estimator 5

the random variable X and its realised value, x. This distinction is not necessarily
followed throughout the book).
The probability distribution in the right panel of Fig. 1.1 is more symmetrical
than in the left panel. This is due to the different sample sizes n. As sample size
increases further, X will approach its limiting distribution which is the normal
distribution by virtue of the central limit theorem.

1.3 The Likelihood and the Maximum Likelihood Estimator

Consider now viewing (1.1) in a different manner, whereby x and n are fixed and
θ varies. To be specific, assume that the sample size is .n = 27 and the number
.

of copies of allele A in the sample is .x = 11. One can plot the probability of
obtaining .x = 11 copies of A in a sample of size .n = 27, for all permissible values
of .θ as in Fig. 1.2. For example, for .θ = 0.1, .Pr (X = 11|n = 27, θ = 0.1) =
0.242 × 10−4 and for .θ = 0.6, .Pr (X = 11|n = 27, θ = 0.6) = 0.203 × 10−1 . This
plot is the likelihood function for .θ , .L (θ |x, n), and the value of .θ that maximises
this function is known as the maximum likelihood estimate of .θ (I will use MLE
short for maximum likelihood estimator or maximum likelihood estimate and ML
for maximum likelihood).

Fig. 1.2 Binomial model:


likelihood function for .θ,
given data .n = 27, x = 11
Likelihood

0.0 0.2 0.4 0.6 0.8 1.0


T
6 1 Overview

One way of finding the maximum likelihood estimate of .θ is to differentiate


(1.1) and find the maximiser. It is equivalent—but often easier—to maximise the
logarithm of the likelihood function, the loglikelihood, denoted as . (θ |x, n):
   
∂ (θ |x, n) ∂ n
. = log + x log (θ ) + (n − x) log (1 − θ ) .
∂θ ∂θ x

Carrying out the differentiation and setting the result equal to zero shows that the
MLE of .θ must satisfy

x n−x
. − = 0.
θ 1−θ

Solving for .θ yields the MLE


x
θ̂ =
. , (1.2)
n

that in the case of the example, with .x = 11 and .n = 27, gives .θ̂ = 0.41.

The Sampling Variance of the Maximum Likelihood Estimator

Usually, one needs to quantify the degree of uncertainty associated with an estimate.
In classical likelihood, the uncertainty is described by the sampling distribution of
the MLE. In the case of the example, the sampling distribution of .θ̂ is the probability
distribution of this estimator obtained by drawing repeated binomial samples of
fixed size n, with the probability parameter fixed at its MLE, .θ = θ̂ . The MLE is
computed in each sample and the sampling distribution of the MLE is characterised
by these estimates.
In this binomial example, the sampling distribution of .θ̂ is known exactly; it is
proportional to a binomial distribution (since X is binomial and n is fixed). The
small sample variance of the maximum likelihood estimator is
   
X θ (1 − θ )
. Var θ̂ = Var = . (1.3)
n n

The parameter .θ is typically not known and is replaced by the MLE .θ̂ . Then
 
  θ̂ 1 − θ̂
 θ̂ =
.Var .
n
In many cases, the MLE does not have a closed form and the small sample
variance is not known. One can then appeal to large sample properties of MLE; one
1.3 The Likelihood and the Maximum Likelihood Estimator 7

8
5

6
4
Density

Density
3

4
2

2
1
0

0
0.0 0.2 0.4 0.6 0.8 0.2 0.3 0.4 0.5 0.6
MLE of T MLE of T

Fig. 1.3 Left: histogram of the Monte Carlo distribution of the MLE for the binomial model,
with .n = 27, θ = 0.41. Right: histogram of the Monte Carlo distribution of the MLE for the
binomial model, with .n = 100, θ = 0.41. The overlaid normal curves represent the asymptotic
approximation of the distribution of the MLE

of these is that, asymptotically, the MLE is normally distributed, with mean equal
to the parameter and variance given by minus the inverse of the second derivative
of the loglikelihood evaluated at .θ = θ̂ . The second derivative of the loglikelihood
is
   
∂ 2  (θ |x, n) ∂2 n
. = log + x log (θ ) + (n − x) log (1 − θ )
(∂θ )2 (∂θ )2 x
x n−x
=− − .
θ 2
(1 − θ )2

In this expression, substituting .θ with the MLE .θ̂ and taking a reciprocal
yields
 
 −1   θ̂ 1 − θ̂
∂ 2  (θ |x, n)  θ̂ =
. − = Var ≈ 0.009. (1.4)
(∂θ )2 θ=θ̂
n

In this simple example, the asymptotic variance agrees with the small sample
variance. An approximate .95% confidence interval for .θ based on asymptotic
theory is

0.41 ± 1.96 × 0.095 = (0.22, 0.60) .


. (1.5)

This means that there is a .95% probability that this interval contains the true
parameter .θ . The “probability” is interpreted with respect to a set of hypothetical
repetitions of the entire data collection and analysis procedure. These repetitions
8 1 Overview

consist of many random samples of data drawn under the same conditions and
where a confidence interval is computed for each sample. The random variable
is the confidence interval that is computed for each sample; in 95 intervals out
of 100 (in a .95% confidence interval), the interval will contain the unobserved
.θ .

Figure 1.3 (left) shows the result of simulating .100,000 times from a binomial
distribution with .n = 27 and .θ = 0.41, computing the MLE in each replicate and
plotting the distribution as a histogram. This represents the (small sample) Monte
Carlo sampling distribution of the MLE. Overlayed is the asymptotic distribution of
the MLE that is normal with mean .0.41 and variance given by (1.4) equal to .0.009.
The right panel of Fig. 1.3 displays the result of a similar exercise with .n = 100 and
.θ = 0.41. The fit of the asymptotic approximation is better with the larger sample

size.
A glance at a standard calculus book reveals that the curvature of a function f at
a point .θ is given by

f  (θ )
.c (θ ) = 3/2
.
1 + f  (θ )2

In the present case, the function f is the loglikelihood . whose first derivative
evaluated at .θ = θ̂ is equal to zero. The curvature of the loglikelihood at .θ = θ̂
is
   
.c θ̂ =  θ̂ .
   2 
(I use the standard notation . θ̂ for . ∂ (θ|x,n)
2 ).
(∂θ) θ=θ̂
Note As the loglikelihood increases or decreases, so does the likelihood; therefore,
the value of the parameter that maximises one also maximises the other. Working
with the loglikelihood is to be preferred to working with the likelihood function
because it is easier to differentiate a sum than a product. The curvature of the
loglikelihood at the MLE is related to the sample variance of the MLE. This last
point is illustrated in Fig. 1.4. As n increases from 27 to 100, the likelihood function
becomes sharper and more concentrated about the MLE.

1.4 Incorporating Prior Information

Imagine that there is prior information about the frequency .θ of allele A from
comparable populations. Bayesian methods provide a natural way of incorporating
such prior information into the model. This requires eliciting a prior distribution
for .θ that captures what is known about .θ before obtaining the data sample.
This prior distribution is combined with the likelihood (which, given the model,
1.4 Incorporating Prior Information 9

Fig. 1.4 Circled line:


likelihood function for .θ,
given .n = 27, x = 11. Full
line: likelihood function for
.θ , given .n = 100, x = 41

Likelihood

0.0 0.2 0.4 0.6 0.8 1.0


T

contains all the information arising from the data) to form the posterior dis-
tribution that is the basis for the Bayesian inference. Specifically using Bayes
theorem:

Posterior ∝ Prior × Likelihood.


. (1.6)

If the prior density of .θ is labelled .g (θ), (1.6) becomes

p (θ |x, n) ∝ g (θ) L (θ |x, n) ,


. (1.7)

indicating that the posterior density is proportional to the prior density times the
likelihood. Probability statements about .θ require scaling (1.7). This involves
dividing the right-hand side of (1.7) by

. g (θi ) L (θi |x, n) , (1.8)


i

if .θ is discrete (in which case g is a probability mass function), or by

. g (θ) L (θ |x, n) dθ, (1.9)

if it is continuous (in which case g is a probability density function).


10 1 Overview

Using a Discrete Prior

This example is adapted from Albert (2009). Continuing with the binomial model,
a simple approach to incorporate prior information on .θ is to write down possible
values and to assign weights to these values. A list of possible values of .θ could be

0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95.
. (1.10)

Based on previous knowledge, one is prepared to assign the weights

1.0, 5.2, 8.0, 7.2, 4.6, 2.1, 0.7, 0.1, 0.0, 0.0
.

that are converted to probabilities dividing each weight by the sum; this gives the
prior probability distribution of .θ

0.04, 0.18, 0.28, 0.25, 0.16, 0.07, 0.02, 0.00, 0.00, 0.00.
. (1.11)

The likelihood for .θ is proportional to (1.1). The combinatorial term does not
contain information about .θ , so one can write

. L (θ |x = 11, n = 27) ∝ θ 11 (1 − θ )27−11 .

Evaluating this expression for all the possible values of .θ in (1.10) yields a list of
ten numbers (too small to be written down here). Label these ten numbers:

L (0.05) , L (0.15) , . . . , L (0.95) .


. (1.12)

To obtain the posterior (1.7), the terms in (1.11) are multiplied by the corresponding
term in (1.12). For example,

p (θ = 0.05|x = 11, n = 27) ∝ 0.04 × L (0.05) ,


.

p (θ = 0.15|x = 11, n = 27) ∝ 0.18 × L (0.15) ,

and so on with the remaining eight terms. After scaling with the sum

0.04 × L (0.05) + 0.18 × L (0.15) + · · · + 0.00 × L (0.95) ,


.

posterior probabilities can be assigned to the ten possible values of .θ . These


posterior probabilities are (rounded to two decimal places)

0.00, 0.00, 0.13, 0.48, 0.33, 0.06, 0.00, 0.00, 0.00, 0.00.
.

Based on these posterior probabilities, the posterior mean of .θ is .0.38, and the
probability that .θ falls in the set .{0.25, 0.35, 0.45} is .0.94.
1.4 Incorporating Prior Information 11

Using a Beta Prior: The Beta-Binomial Model

Another possible prior is to assign a beta distribution to .θ with the appropriate


parameters to reflect prior information. This is a continuous distribution with
support between 0 and 1 and has two parameters denoted as a and b that determine
the shape. When .a = b, the distribution is symmetric.
One way of using a beta distribution that matches the prior probabilities (1.11) is
as follows. Notice that the sum of the first three probabilities in (1.11) represents the
probability that .θ is smaller than or equal to .0.25. This probability is equal to .0.50.
Similarly, the sum of the first five probabilities is the probability that .θ is smaller
than or equal to .0.45. This probability is equal to .0.91. The values of .θ equal to
.0.25 and .0.45 are two quantiles. Let .F (0.25; a, b) and .F (0.45; a, b) represent the

cumulative distribution functions (cdf) of the beta distribution for .θ = 0.25 and
for .θ = 0.45, respectively (the cdf is .F (x; a, b) = Pr (X ≤ x; a, b)). Then the
parameters a and b of the beta distribution that match the prior probabilities (1.11)
can be found by minimising the function

. (F (0.25; a, b) − 0.5)2 + (F (0.45; a, b) − 0.91)2

with respect to a and b. This can be achieved using the function OPTIM in R as
indicated in the following code:
mod <- function(par){
a <- par[1]
b <- par[2]
fct <- (pbeta(0.25,a,b)-0.5)^2 + (pbeta(0.45,a,b)-0.91)^2
return(fct)
}
res <- optim(par=c(3,3),mod)
res$par

## [1] 2.89705 8.04717

The function returns .a = 2.90, .b = 8.05. As a check, one can compute the
cumulative distribution functions:

pbeta(0.45,2.9,8.05)

## [1] 0.9099298

pbeta(0.25,2.9,8.05)

## [1] 0.4995856

Figure 1.5 displays the discrete prior defined by (1.10) and (1.11) and the prior
based on .Be (2.90, 8.05).
12 1 Overview

Prior Probability

Prior Density
0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0
T T

Fig. 1.5 Left: discrete prior distribution defined by (1.10) and (1.11). Right: beta prior
.Be(2.90, 8.05)

As mentioned above, the likelihood is proportional to (1.1); that is,

L (θ |x, n) ∝ θ x (1 − θ )n−x .
. (1.13)

Seen as a function of .θ , this is the kernel of a beta distribution with .a = x + 1


and .b = n − x + 1. The pdf (probability density function, sometimes referred to as
density function) of the beta distribution is

 (a + b) a−1
p (θ ) =
. θ (1 − θ )b−1 , θ ∈ [0, 1] , a, b > 0,
 (a)  (b)

where . is the gamma function. The posterior distribution is obtained by combining


this likelihood with the prior .Be (2.90, 8.05). This results in a posterior distribution
that has the beta density with parameters .x + 1 + 2.90 − 1 and .n − x + 1 + 8.05 − 1.
The posterior distribution has the form

.p (θ |x = 11, n = 27) = Be (13.90, 24.05) . (1.14)

Figure 1.6 displays plots of the prior, the likelihood and the posterior for the
example. The posterior distribution is sharper than the prior distribution, and its
probability mass is concentrated between that of the prior distribution and the
likelihood. In Bayesian inference, the posterior distribution provides the necessary
information for drawing conclusions about .θ . For example, the mean, mode and
median of (1.14) are .0.37, .0.36, .0.36. The .95% posterior interval for the posterior
mean is .(0.24, 0.50). The inference using the Bayesian approach is sharper than
that based on the likelihood (1.5). Further, the frequentist confidence interval and
the Bayesian posterior intervals have different interpretations. In the latter, the
1.4 Incorporating Prior Information 13

6
5
4
Density

Prior
Likelihood
3

Posterior
2
1
0

0.0 0.2 0.4 0.6 0.8 1.0


T

Fig. 1.6 The prior density .Be(2.90, 8.05), the likelihood .Be(12, 17) and the posterior density
.Be(13.90, 24.05)of the probability .θ

confidence interval is fixed and the associated probability is the probability that
the true parameter falls in the interval.
Note If a prior for .θ is .Be (a, b), then .p (θ ) ∝ θ a−1 (1 − θ )b−1 . The
likelihood of the binomial model is proportional to .θ x (1 − θ )n−x which is
the kernel of .Be (x + 1, n − x + 1). The posterior is then proportional to

a+x−1 (1 − θ )b+n−x−1 that is the kernel of .Be (a + x, b + n − x).

Prior Influence on Inferences

A Bayesian analysis is seldom complete without investigating how prior information


affects the conclusions. In this example, one can compare the inference about .θ
using either the discrete or the beta prior. With the former, the mean is .0.38 and the
probability that .θ falls in the set .{0.25, 0.35, 0.45} is .0.94. The numerical values
arrived at with the beta prior are quite similar.
In a situation where no prior information is available about .θ , one could resort
either to maximum likelihood or to a Bayesian approach with a non-informative
prior. The development of non-informative (or reference) priors can become quite
technical, especially in complex model scenarios; a pragmatic approach is often
chosen along the following lines. In the absence of information about .θ , the
investigator may consider three possible beta distributions as displayed in Fig. 1.7
(taken from Carlin and Louis 1996). The so-called Jeffrey’s prior is transformation
14 1 Overview

Beta(0.5,0.5) (Jefrreys prior)

5
Beta(1,1) (Uniform prior)
Beta(2,2) (Week prior)

4
3
Density
2
1
0

0.0 0.2 0.4 0.6 0.8 1.0


T

Fig. 1.7 The densities .Be(0.5, 0.5), .Be(1, 1) and .Be(2, 2) to model prior information about .θ

invariant; the .Be(1, 1) is a special case that retrieves a uniform distribution


(assigning equal probabilities to all values of .θ ); the .Be(2, 2) is mildly informative
assigning larger probabilities to intermediate values of .θ . The combination of these
priors with the likelihood .θ 11 (1 − θ )16 gives rise to the three posterior distributions
shown in Fig. 1.8.
In this particular example, three very different prior distributions give rise to
very similar posterior distributions. This is often the case when the likelihood is
very informative relative to the prior distribution. Using the uniform prior .Be(1, 1),
the posterior is .Be(12, 17) with mean value (the mode and median are almost the
same) and .95% posterior interval equal to .0.41 and .(0.24, 0.59), respectively. The
posterior interval is a little wider than that based on the sharper prior .Be (2.90, 8.05)
and almost identical (numerically) to the one based on the normal approximation to
the maximum likelihood estimator. The posterior probability that .θ is less than or
equal to .0.2 is

θ=1

. I (θ ≤ 0.2) p (θ |y, n) dθ = 0.00496. (1.15)


θ=0
1.4 Incorporating Prior Information 15

Fig. 1.8 The three posterior

6
Posterior (Jefrreys prior)
distributions corresponding to Posterior (Uniform prior)
the three priors of Fig. 1.7 Posterior (Week prior)
when .n = 27, x = 11

5
4
Density
3
2
1
0

0.0 0.2 0.4 0.6 0.8 1.0


T

Simulating from the Posterior Distribution

Inferences drawn in the examples above were exact. This is possible when the
analytical form of the posterior distribution is known and features from it (such as
probability intervals and moments) can be calculated. In principle, features from any
posterior distribution can also be obtained using samples drawn from it, making use
of standard theorems from the time series literature. For example, any function of
the random variable X, .h (X) with finite expectation .E(h (X)) can be approximated
by
N
i=1 h (xi )
. E (h (X)) ≈ , (1.16)
N

where N is the sample size. Using .R, a sample .(xi )100,000


i=1 of size .100,000 from the
posterior .Be(12, 17) results in a sample mean:

100,000
xi
. E (X) ≈ = 0.41.
100,000
i=1

A .95% posterior interval for .θ can be estimated as the .2.5th to .97.5th percentiles of
the empirical distribution of the draws .xi using the R function quantile. If the
original dataset is denoted dat, the R code is:
16 1 Overview

set.seed(7117)
dat <- rbeta(100000,12,17)
quantile(dat,c(0.025,0.975))

## 2.5% 97.5%
## 0.2454898 0.5942647

Finally, using the simulated values, a Monte Carlo estimate of the posterior
probability that .θ is less than or equal to .0.2 is obtained using

100,000
 (θ ≤ 0.2) = 1
Pr
. I (xi ≤ 0.2) = 0.00499,
100,000
i=1

which is a Monte Carlo estimator of (1.15). These figures are in good agreement
with the exact results.
In this example, it was straightforward to sample directly from the posterior
distribution because the normalising constant (1.9) is known, and therefore the form
of the posterior is fully specified. Often the normalising constant cannot be obtained
in closed form, particularly when .θ contains many elements. Chapter 4 discusses
how Monte Carlo draws from the approximate posterior distribution can still be
obtained using Markov chain Monte Carlo (McMC) methods.
An important issue with inferences based on Monte Carlo samples from posterior
distributions is the accuracy of posterior summaries. The latter are subject to
sampling uncertainty that depends on the size of the Monte Carlo sample and on
the degree of autocorrelation of the samples. Methods to quantify this uncertainty
are reviewed in Chap. 4.

Estimating Moments Using (Correlated) Samples from Posterior


Distributions

Result (1.16) is extremely useful and is applied routinely in an McMC environment


to estimate features from posterior distributions. Typically, the elements that
constitute the sample are correlated. However, despite this correlation structure,
consistent estimators of features of posterior distributions can be obtained. For
example, the sample mean

N
1

μ=
. xi , (1.17)
N
i=1
1.5 Frequentist or Bayesian? 17

the lag-k sample autocovariance

N
1
γ (k) =

. (xi − 
μ) (xi+k − 
μ) (1.18)
N
i=1

and the lag-k sample autocorrelation

 (k)
γ
(k) =
ρ
. (1.19)

γ (0)

are consistent estimators of the respective population parameters. In (1.19), .


γ (0) is
the sample variance.
As an example, consider generating draws from the lag-1 autoregressive model
 
xt = ρxt−1 + et ,
. |ρ| < 1, et ∼ N 0, σ 2 = 1 , t = 1, . . . , N, (1.20)

where the .e s are iid (independently and identically distributed). Using R, I
simulated .N = 1000, .10,000 and .100,000 observations from (1.20) using .ρ = 0.8
and .σ 2 = 1, with the initial condition .x1 ∼ N (0, 1). This generates a strongly
autocorrelated structure among the draws. The marginal mean and variance of this
process are

. E (xt ) = 0,
σ2
Var (xt ) = .
1 − ρ2

The estimates of the mean (.0.0), variance (.2.778) and correlation (.0.8) with samples
of size .N = 1000, .N = 10,000 and .N = 100,000 are .(−0.060, 2.731, 0.793),
.(−0.046, 2.749, 0.797) and .(0.018, 2.774, 0.800), respectively. Despite the rather

strong degree of autocorrelation, the estimates are quite acceptable and get closer to
the true values as sample size increases.

1.5 Frequentist or Bayesian?

There has been much heated debate between frequentists and Bayesians about
the advantages and shortcomings of both methods of inference. One can easily
construct examples where one of the methods gives silly answers and the other
performs fairly well. One such example is the following. Imagine that in the
situation discussed above, rather than obtaining .x = 11 copies of allele A in a
sample of size .n = 27, one obtained .x = 0. This outcome is not unlikely when
.θ is small; for example, the probability of obtaining .x = 0 when .n = 27 and
18 1 Overview

12
Posterior (Jefrreys prior)
Posterior (Uniform prior)
Posterior (Week prior)

10
8
Density
6
4
2
0

0.0 0.2 0.4 0.6 0.8 1.0


T

Fig. 1.9 The three posterior distributions corresponding to the three priors of Fig. 1.7 when .n =
27, x = 0

θ = 0.04 is .Pr (x = 0|n = 27, θ = 0.04) = 0.33. In this situation, the maximum
.

likelihood (ML) estimate (1.2) is 0 and its variance is also 0, which is clearly a silly
result (classical maximum likelihood is problematic when the estimate lies on the
boundary of the parameter space).
How does the Bayesian approach behave in such a situation? The three posterior
distributions corresponding to the three priors of Fig. 1.7 are shown in Fig. 1.9.
Jeffrey’s prior and mathematical form as the likelihood, proportional to (.(1 − θ )27 ),
lead to posterior distributions .Be (0.5, 27.5) and .Be (1, 28), respectively; these have
modal values of 0. The weakly informative prior .Be (2, 2) yields a posterior of the
form .Be (2, 29), which has a mode at .θ ≈ 0.034. The posterior means of these three
distributions are .0.018, .0.034 and .0.065, respectively. The .95% posterior intervals
are
 
. 0.18 × 10−4 , 0.88 × 10−1 ,
 
0.90 × 10−3 , 0.12 ,
 
0.81 × 10−2 , 0.17 ,

respectively. The posterior probabilities that .θ is less than or equal to .0.05 for
Be (0.5, 27.5), .Be (1, 28) and .Be (2, 29) are .0.91, .0.76 and .0.45, respectively.
.

In this extreme situation, prior information plays an important role (certainly,


compared to the case .n = 27, .x = 11, displayed in Fig. 1.8).
1.5 Frequentist or Bayesian? 19

20
15
Density

Beta(2,52) (Posterior Distribution)


10

Beta(2,25) (Prior Distribution)


5
0

0.0 0.2 0.4 0.6 0.8 1.0


T

Fig. 1.10 A .Be(2, 25) prior distribution and a .Be(2, 52) posterior distribution when .n = 27, x =
0

One may consider other priors from the beta family that put strong probability
mass in the neighbourhood of zero. One possibility is to use a .Be (2, 25) that has a
mode at .θ = 0.04. With .n = 27 and .x = 0, this results in a posterior .Be (2, 52).
The prior and posterior distributions are plotted in Fig. 1.10. The mean and the mode
of the posterior distribution are .0.037 and .0.019, respectively. The .95% posterior
interval is now .(0.005, 0.101) and .Pr (θ ≤ 0.05|n = 27, x = 0) = 0.75.
The beta prior does not assign probability mass to .θ = 0, and this rules out
the possibility that .θ = 0 in its posterior distribution. One way of including 0 as a
possible value for .θ is to use a two-component mixture prior, where one component
is a point mass at zero and the other is a beta prior. Mixture distributions are
discussed in Chaps. 3, 7 and 9.
There is flexibility associated with the Bayesian approach and a carefully chosen
prior distribution will lead to stable inferences about .θ . The cost is a result which
is partly affected by prior input. With small amount of data and when parameters
lie in the border of the parameter space, there is little else to choose from. In
such a situation, the most important role of prior distributions may well be to
obtain inferences about .θ that are stable and that provide a fair picture of posterior
uncertainty, conditional on the model.
Frequentist and Bayesian
There are situations where instead of choosing between frequentist or Bayesian,
one could use frequentist and Bayesian tools in a meaningful way. This is the case
with model checking where one is interested in studying the ability of a model to
account for particular features of the data or to give reasonable predictions of future
20 1 Overview

observations. Key literature is Rubin (1984), Gelman et al (1995) and Gelman et al


(1996). Suppose that data vector .y (length n) are a realisation from the sampling
model .p (y|θM , M), where .θM is a vector of parameters and M represents the
assumed model. If this assumption is adequate, then one would expect that a new
realisation from .p (·|θM , M) should result in a vector .yrep say, that resembles y.
Instead of working in n dimensions, one can construct a scalar function T of the
data and .θM , .T (y, θM ), designed to study a particular feature of the data that is
scientifically relevant. One can then compare the observed value of .T (y, θM ) with
its sampling distribution under .p (·|θM , M). An observed value that falls in the
extreme tails of the sampling distribution indicates a potential failing of the model
to account for T . Equivalently, one can study whether
 zero
 is an atypical value in
the distribution of the difference .T (y, θM ) − T yrep , θM .
Parameter .θM is typically unknown. The frequentist proposition is to replace .θM
by some point estimator .θ̂M and then proceed as above treating .θM as known and
equal to .θ̂M .  
A Bayesian rather than generating data .yrep from .p yrep |θ̂M , M does so from
 
.p yrep |y, M , the density of the posterior predictive distribution, given by

   
p yrep |y, M =
. p yrep |θM , y, M p (θM |y, M) dθM (1.21)

   
where often, .p yrep |θM , y, M = p yrep |θM , M . One then observes whether  zero

is an extreme value in the posterior predictive distribution .T (y, θM )−T yrep , θM ,
where .θM is generated from the posterior .[θM |y, M] and given .θM , .yrep is generated
from .[yrep |θM , M]. This “Bayesian frequentist”, like the frequentist,
 accounts

for the uncertainty in .yrep due to the sampling process from .p yrep |θM , y, M .
However, unlike the frequentist, account is also taken of the uncertainty about
.θM described by its posterior distribution .[θM |y, M]. Model checking applied in

this manner, although embedded in the Bayesian paradigm, is frequentist in spirit


because it decides whether the observed data look reasonable under the posterior
predictive distribution based on repetitions of data that could have been generated
by the model. All this is typically carried out using McMC methods that provide
great flexibility to question models.
Model checking using posterior predictive distributions is discussed in Chap. 10.

1.6 Prediction

The second part of the book provides an introductory overview of prediction.


A stylised setup is as follows. Data matrix z has the structure .zi = (xi , yi ),
.i = 1, 2, . . . , n, where .xi is a p-dimensional vector of covariates (or predictor

variables); .yi , a scalar, is a response variable; and the n vectors are independent
and identically distributed realisations from some distribution. Examples of both
1.6 Prediction 21

parametric (frequentist and Bayesian) and nonparametric models are given here. In
the case of parametric models where the response y is quantitative, a general form
for the association between x and y is

.yi = f (xi ) + ei , (1.22)

where f , the conditional mean, is a fixed unknown function of .xi and of parameters
and .ei is a random error term, assumed independent of .xi , with mean zero. Using
the data z, an estimate of f labelled .f is obtained by some method that leads to
.Ê(y0 |x0 ) = ŷ0 , a point prediction of the average value of .y0 , evaluated at a new

value of the covariate .x = x0 (in a frequentist setting, conditional on estimates of


parameters that index f ):

y0 = f(z, x0 ) .

. (1.23)

The notation emphasises that the estimation procedure inputs data z and yields a
prediction .ŷ0 for .x = x0 . For example, in standard least squares regression .f (xi ) =
 −1 
E(yi |xi ) = xi b, .f(z, x0 ) = x0 b, where .
b = xx x y and .xi is the ith row of
matrix x.
With binary responses, one may fit a logistic regression. Here, the modelling is at
the level of the probability. Specifically, letting .Pr (yi = 1|xi ) = π (xi ), the logistic
model can be written as
 
π (xi )
. ln = xi b, i = 1, 2, . . . , n.
1 − π (xi )

A maximum likelihood estimate . b together with a new input .x0 results in a


predicted probability .
π (x0 ) that can be transformed into a predicted value according
to the rule:

π (x0 ) ≥ 0.5
1 if 
.
y0 = (1.24)
0 if 
π (x0 ) < 0.5.
Measuring Prediction Performance
The performance of the predictions can be evaluated measuring how well they match
observed data. One measure of predictive performance is the sample mean squared
error (.MSE) :
nt
1
. MSEt = (yi − 
yi )2 (1.25)
nt
i=1

where .nt is the number of records and .yi = f(z, xi ). When .MSEt (1.25) is
computed using the data that was used to fit the model, the training data, it is
known as the sample training mean squared error. If the objective is to study how
well the model predicts a yet-to-be-observed record, .MSEt can be misleading as it
22 1 Overview

Table 1.1 Training and validating mean squared errors for the prostate cancer data, as the
number of covariates included in the linear predictor increases from 5 to 30. A standard logistic
model is implemented, and the mean squared errors represent the proportion of misclassifications
in the training and validating data
No. of covariates 5 10 15 20 25 30
.MSEt 0.29 0.31 0.16 0.12 0.06 0.00
.MSEv 0.27 0.37 0.25 0.33 0.35 0.39

overestimates predictive performance. In fact, it can be made arbitrarily small by


including a large number of covariates.
A more reliable measure of the prediction ability of the model is to test how well
predictions match observations from a new sample of data .z0 (or hold-out data), the
validating data, drawn from the same distribution as the training data. The validating
data is .z0i = (y0i , x0i ), .i = 1, 2, . . . , nv and the sample validating mean squared
error is
nv
1
. MSEv = (y0i − 
y0i )2 . (1.26)
nv
i=1

In (1.26) .
y0i is the ith prediction computed using the training data z evaluated
at the value of the ith covariate .x0i . That is, .y0i = f(z, x0i ). With binary
observations, (1.25) and (1.26) represent the proportion of cases where .y0i = y0i ,
or misclassification error .
As an illustration of some of these concepts, I use data from a microarray study
of prostate cancer from Singh et al (2002). The study includes 52 men with tumour,
50 healthy men and a total of .n = 102 men. The genetic expression of a panel
of .p = 6033 genes was measured for each man. The level of gene expression is
associated with the level of activity of the gene. A larger number implies a more
active gene. The .n × p matrix of covariates is then .x = {xij }, i = 1, 2, . . . , n =
102; j = 1, 2, . . . , p = 6033, with .p n.
Fitting Traditional Logistic Regression
To illuminate some of the consequences of overfitting, a first analysis is undertaken
with traditional logistic regression models involving .p < n. Data are divided
into training and validating sets with equal numbers in each. The logistic models
are fitted to the training data using maximum likelihood, and the estimates of the
parameters are used to predict the outcome (healthy/not healthy) in the validating
data. The models differed in the number of covariates included. The change in
.MSEt and .MSEv as the number of covariates (columns in x) increases in the

different models from .p = 5 to .p = 30 in steps of 5 is shown in Table 1.1 for


one training/validating split. The covariates were arbitrarily chosen as the first 5
columns of x, the first 10 and so on. The figures in the table show an increase in
.MSEv as the number of covariates increases beyond 15 and a parallel increase of
1.6 Prediction 23

the overstatement of the model’s predictive ability, as judged by the steady fall in
MSEt when the number of covariates is larger than 10.
.

The code below reads the data (singh2002), splits it into a training and a
validating/testing set (y.test and y.train) and in the bottom part,
• fits a logistic regression to the training data using the R-function GLM
• using the ML estimates, computes the predicted liabilities in the training and
validating data
• based on these liabilities, computes .Pr(Y = 1| b), where .
b is the ML estimate
• transforms the probabilities into the .0/1 scale
• computes .MSE (misclassification error) in the training and validating data
The figures in Table 1.1 were generated using this code. The code illustrates the
case with 15 covariates (first 15 columns of matrix x). The output agrees with the
figures in the third column of the table:

# CODE0101
# READING SINGH ET AL 2002 DATA
rm(list=ls()) # CLEAR WORKSPACE
# Lasso solutions using package glmnet
#install.packages("glmnet", .libPaths()[1])
#install.packages("sda")
library("sda")

library(glmnet)

data(singh2002)
X<-singh2002$x
y<-ifelse(singh2002$y=="cancer",1,0)
n<-nrow(X)
Xlasso<-X
set.seed(3037)
train=sample(1:nrow(X),nrow(X)/2)
test=(-train)
y.test=y[test]
y.train<-y[train]
#
# RESAMPLES TRAIN/TEST DATA AND COMPUTES MSE
# FOR EACH RESAMPLE/REPLICATE
t1 <- seq(1,15,1) # CHOOSE THE FIRST 15 COLUMNS OF x
X1 <-X[,t1]
n <- length(t1)
datarf <- data.frame(cbind(y,X))
nc <- 1 # EXAMPLE WITH 1 REPLICATE ONLY
res <- matrix(data=NA, nrow=nc,ncol=3)
for(i in 1:nc){
if(i > 1){train <- sample(1:nrow(datarf),nrow(datarf)/2)}
glm.fit <- glm(y[train] ~ X1[train,] ,
family=binomial(link="logit"))
24 1 Overview

# CALCULATE PREDICTED LIABILITY FOR THE TRAINING (liabT) AND THE


# VALIDATING DATA (liabV)
liabV <- X1[-train,1:n]%*%glm.fit$coefficients[2:(n+1)]+
glm.fit$coefficients[1]
liabT <- X1[train,1:n]%*%glm.fit$coefficients[2:(n+1)]+
glm.fit$coefficients[1]
# COMPUTE Pr(Y=1) BASED ON THESE LIABILITIES
probT <- exp(liabT)/(1+exp(liabT))
probV <- exp(liabV)/(1+exp(liabV))
# COMPUTE PREDICTED VALUES IN TRAINING AND VALIDATING DATA
# ON THE 0/1 SCALE
predT <- ifelse(probT > 0.5, "1", "0")
predV <- ifelse(probV > 0.5, "1", "0")
# COMPUTE MISCLASSIFICATION ERROR
predclassT <- mean((as.numeric(predT) - y.train)^2)
predclassV <- mean((as.numeric(predV) - y.test)^2)

# IF CURIOUS COMPUTE LOG-LIKELIHOOD, DEVIANCE,


# AIC USING TRAINING DATA
# ll <- sum(y.train*liabT) - sum(log(1+exp(liabT)))
# dev <- -2*ll
# AIC <- dev + 2*(n+1)
# ***********************************

res[i,] <- c(n,predclassT,predclassV)


}
res

## [,1] [,2] [,3]


## [1,] 15 0.1568627 0.254902
Selection of Covariates for Prediction
An objective of the experiment above is to find genes that may have an effect
on prostate cancer. The data are typical of a genomic setup where the number of
variables p (e.g. genetic markers) measured for each individual is considerably
larger than the number of individuals n; the classical .p n scenario. In the context
of prediction, these variables enter as covariates in linear regression or logistic
regression models, but only a subset are likely to contribute meaningfully. Inclusion
of redundant variables may improve model fit (reflected in small values of training
mean squared error, .MSEt ), but will definitely result in poor predictions. From the
point of view of model implementation, when .p n, some form of regularisation
or shrinkage is needed. This constitutes an important topic of the book. Three
examples are provided in this overview. The first is a parametric model; the other
two are nonparametric approaches specifically developed, but not restricted to, to
deal with the .p n situation. Like many of the tools used for the analysis of the
type of data sets commonly found in genomic studies, these models can be useful
as guidance in the choice of predictors for further study.
Fitting the Lasso
The parametric example is based on the lasso (Tibshirani 1996 , “least absolute
shrinkage and selection operator”) that is a regularisation method with a tuning
parameter governing the amount of shrinkage of the regression parameters towards
1.6 Prediction 25

zero. Since the lasso solutions typically include many coefficients equal to zero
when the tuning parameter is sufficiently large, it does model selection and
shrinkage simultaneously.
The lasso logistic regression model is fitted using the public package glmnet
(Friedman et al 2009) implemented in R. Documentation about glmnet can be found
in Hastie and Qian (2016) and in Friedman et al (2010).
To obtain predictions, the code below executes first the function cv.glmnet on
the training data in order to find the value of the tuning parameter (.λ) that optimises
prediction ability measured by .MSE. In a second step, glmnet is executed again
on the training data using this best .λ to obtain the final values of the regression
parameters. The code then constructs the predictions from the output of this second
run. The model is finally tested on the training and on the validating data.
A more direct implementation of glmnet without the need to generate estimates
of regression parameters is indicated at the bottom of the code.
The lasso logistic regression was run on the prostate data including all 6033
covariates representing the gene expression profiles. Lasso chooses 36 covariates
and sets the remaining equal to zero. The model with these 36 covariates was used
to classify the observations in the validating data and resulted in a .MSEv equal to
.0.25. In other words, .51×0.25 ≈ 13 out of the 51 observations in the validating data

are incorrectly classified. At face value, the result for .MSEv matches that obtained
in Table 1.1 when the first 15 columns of x were included in the linear predictor.
The latter can be interpreted as a logistic regression model where 15 out of 6033
covariates are randomly chosen. The result based on the lasso is not encouraging:
# CODE0102
# READING SINGH ET AL 2002 DATA
rm(list=ls()) # CLEAR WORKSPACE
# Lasso solutions using package glmnet
#install.packages("glmnet", .libPaths()[1])
#install.packages("sda")
library("sda")
library(glmnet)
data(singh2002)
X<-singh2002$x
y<-ifelse(singh2002$y=="cancer",1,0)
n<-nrow(X)
Xlasso<-X
set.seed(3037)
train=sample(1:nrow(X),nrow(X)/2)
test=(-train)
y.test=y[test]
y.train<-y[train]
#
# ********** FOR PREDICTION USING LASSO *****************
repl <- 1 # NUMBER OF REPLICATES
# (RESAMPLES TRAINING / VALIDATING)
result <- matrix(data=NA, nrow=repl,ncol=4)
set.seed(3037)
for (i in 1:repl){
if(i > 1){train <- sample(1:nrow(Xlasso),nrow(Xlasso)/2)}
y.train <- y[train]
26 1 Overview

y.test <- y[-train]


# STEP 1: cross-validation; find best value of lambda
# alpha=1: LASSO; alpha=0: RIDGE REGRESSION
cv.out=cv.glmnet(Xlasso[train,],y[train],alpha=1,
family="binomial",type = "class")
#plot(cv.out)
bestlam=cv.out$lambda.min
#bestlam

# Using best lambda, fit model on training data


# to obtain final parameter estimates

# STEP 2
fm=glmnet(y=y[train],x=Xlasso[train,],alpha=1,lambda=bestlam,
family="binomial",type.measure= "class")
nzcf<-coef(fm)
cf<-which(fm$beta[,1]!=0)
if (length(cf) == 0){
out <-c(i,length(cf))
print(out)
break
}
#length(cf) # NO. REGRESSION PARAMETERS IN FINAL MODEL
# CONSTRUCT PREDICTIONS FROM OUTPUT OF fm
# 1. VALIDATING DATA
predglmnet<-fm$a0+Xlasso[-train,cf]%*%fm$beta[cf]
probs <- exp(predglmnet)/(1+exp(predglmnet))
predclass_test <- as.numeric(ifelse(probs > 0.5, "1", "0"))
# 2. TRAINING DATA
predglmnet<-fm$a0+Xlasso[train,cf]%*%fm$beta[cf]
probs <- exp(predglmnet)/(1+exp(predglmnet))
predclass_train <- as.numeric(ifelse(probs > 0.5, "1", "0"))
result[i,] <- c(mean((predclass_train-y.train)^2),
mean((predclass_test-y.test)^2),bestlam,length(cf))
}
result

## [,1] [,2] [,3] [,4]


## [1,] 0 0.254902 0.01948886 36
#NOTE: for prediction, GLMNET can be implemented more directly,
# using in STEP2:
###############################################################
# fm.predclass=predict(cv.out,s=bestlam,newx=Xlasso[test,],
# family="binomial",type="class")
# mean((as.numeric(fm.predclass)-y.test)^2) # VALIDATION ERROR
# RATE (BASED ON CLASS LABELS)

###############################################################

The somewhat disappointing performance of the lasso was investigated further


by random splitting the training/testing data set 50 times. This provides a picture
of the sampling variation of the .MSE over the joint distribution of training/testing
data. The mean validating mean squared error, over the 50 replicates, was .0.30,
with a minimum of .0.16 and a maximum of .0.53. The number of covariates not set
equal to zero ranged from 2 to 41 with a median of 24. There is not a model that
1.6 Prediction 27

is consistently singled out as a good predictor over replications. This is a reflection


of the ubiquitous multicollinearity in a multidimensional setting where covariates
become highly correlated (a covariate can be expressed as a linear combination of
others). Therefore, a different set of covariates is chosen in each replication.
Fitting a Classification Tree
The first of the two nonparametric models that are fitted to the data is a
classification tree (Breiman et al 1984), that is described briefly via
the example generated by the R-code below. The code executes the R function
tree; this requires installation of the package tree. The data set singh2002
includes 6033 covariates and the response variable is y: a binary classification
variable with modalities “healthy” and “cancer”:
# CODE0103
rm(list=ls()) # CLEAR WORKSPACE
set.seed(30331)
#install.packages("tree")
library(sda)
library(tree)
# library(glmnet)
data(singh2002)
d <- data.frame(singh2002$x)
d$y <- singh2002$y
nrep <- 1 # NUMBER OF REPLICATES
res <- matrix(data=NA,nrow=nrep,ncol=3)
ptm<-proc.time()
for ( i in 1:nrep ) {
cat(i,"\n",sep="")
train <- c(sample(1:50,25),sample(51:102,26))
# FIT THE TREE TO THE TRAINING DATA
trees <- tree(y ~ . , data=d[train,])
# FIT FUNCTION PREDICT TO THE TRAINING AND VALIDATING DATA
predtreev <- predict(trees,d[-train,],type="class")
predtreet <- predict(trees,d[train,],type="class")
# CLASSIFICATION ERROR IN TRAINIMG AND VALIDATING DATA
predv <- sum(predtreev==d$y[-train])/length(d$y[-train])
predt <- sum(predtreet==d$y[train])/length(d$y[train])
# RECORD TRAINING / VALIDATING CLASSIFICATION ERROR AND
# NUMBER OF COVARIATES IN TREE
res[i,]<-c((1-predt),(1-predv),length(summary(trees)$used))
}

## 1

proc.time()-ptm

## user system elapsed


## 2.95 0.17 3.12
res

## [,1] [,2] [,3]


## [1,] 0 0.1764706 2
28 1 Overview

tab <- table(predtreev,d$y[-train])


tab

##
## predtreev cancer healthy
## cancer 17 0
## healthy 9 25

# CHECK CLASSIFICATION ERROR


(tab[1,2]+tab[2,1])/(length(d$y[-train]))

## [1] 0.1764706

#summary(res)
#plot(trees)
#text(trees,pretty=0)

Figure 1.11 indicates that in the particular replicate, the algorithm isolated 2
of the 6033 covariates, .X77 and .X237 . Starting at the top of the tree, the 51 cases
in the training data have been split into two groups: one, to the left, that shows
expression profile for .X77 less than a threshold .t1 = −0.777 and those to the right
with .t1 > −0.777. The group on the left is not split further and constitutes a terminal
node. On the right side, a second split based on the profile of .X237 and a threshold
.t2 = −0.855 gives rise to two terminal nodes. The result can be interpreted as an

interaction between the two markers.

Fig. 1.11 Output from a X77 < −0.777342


classification tree fitted to |
data singh2002 from
Singh et al (2002) using the R
function tree (R-code
CODE0103). Results from
one replicate

X237 < −0.85516


cancer

cancer healthy
1.6 Prediction 29

A predicted value is attached to each terminal node. For a new individual, a


predicted value is obtained by starting at the top of the tree and following the splits
downwards until the terminal node with its predicted value is reached. The new
individual is assigned the prediction given by that terminal node. In the example,
if the new individual were to show a value for .X77 = −0.3 and for .X237 = −0.1,
following the tree from top to bottom would lead to a prediction taking the modality
“healthy”.
Typing the tree object (in this case, trees) gives further details associated with the
figure. Terminal nodes are indicated with asterisks. R prints output from each branch
of the tree in the form of the node, split criterion t, the number of observations in the
branch n, the deviance, the classification for the branch (“cancer”/“healthy” in the
present case) and the proportion of observations in the branch that take the values
“cancer”/“healthy”:

library(tree)
trees

## node), split, n, deviance, yval, (yprob)


## * denotes terminal node
##
## 1) root 51 70.68 cancer ( 0.5098 0.4902 )
## 2) X77 < -0.777342 20 0.00 cancer ( 1.0000 0.0000 ) *
## 3) X77 > -0.777342 31 30.46 healthy ( 0.1935 0.8065 )
## 6) X237 < -0.85516 6 0.00 cancer ( 1.0000 0.0000 ) *
## 7) X237 > -0.85516 25 0.00 healthy ( 0.0000 1.0000 ) *

The output above indicates that at the top of the tree at .X77 (which is the
root since the tree is upside down), there are 51 records (the training data), and
the proportion of “cancer” is .0.5098. After the first split, to the left, the split is
.t1 < −0.777, and 20 observations are classified as “cancer” and 0 as “healthy”

leading to proportions of (.1.00, 0.00). To the right, the split is .t1 > −0.777 that
gives rise to 31 records, with a proportion of “healthy” equal to .0.8065 (25 out of
the 31 records are "healthy, those whose .t2 > −0.855, associated with covariate
.X237 ).

Various algorithms are available to decide which variable to split and the
splitting value t to use for the construction of the tree. Some of these topics
are deferred to the chapter on nonparametric methods. Here, I concentrate on the
predictive ability of the method. For the particular replicate, the classification error
in the training and validating data is 0 and .0.18, respectively. Replicating the
experiment 50 times gives a mean classification error in training and validating
data equal to .0.016 and .0.197, respectively, with (minimum, maximum) values of
(.0.000, 0.078) and (.0.098, 0.333), respectively. With the parameters for the utility
function tree.control set at the default values, the number of covariates over
the 50 replicates included in each tree fluctuates between 2 and 3, and these
covariates vary over replicates. For these data, the classification tree performs
considerably better than the lasso.
30 1 Overview

Interestingly, in all the cases, the classification trees capture what can be
interpreted as an interaction involving two or three covariates. These are not the
same covariates for the 50 trees. Perhaps, this must not come as a surprise: 6033
covariates give rise to more than 18 million different two-way interactions. There-
fore, predictors based on interacting covariates are prone to be highly correlated.
More generally and as noted with the lasso, in the high-dimensional setting, the
multicollinearity of the covariate matrix is often extreme, and any of the p covariates
in the .n × p matrix can be written as a linear combination of the others. This means
that there are likely many sets of pairs of covariates (other than .X77 and .X237 ) that
could predict just as well. It does not follow that the model cannot be trusted as a
prediction tool, but rather that one must not overstate the importance of .X77 and
.X237 as the only genes associated with the response variable. As with the lasso, the

analysis with the classification tree provides inconclusive evidence of specific genes
affecting prostate cancer.
Fitting a Random Forest
A problem often mentioned with trees is that they exhibit high variability. Small
changes in the data can result in the construction of very different trees and their
predictions can be impaired. However, they are an integral part of another method
known as random forest (Breiman 2001) whose prediction performance benefits by
the process of averaging. The random forest consists of many classification trees
and each is created as follows:
• Create a sample of size .nv by drawing with replacement from the .nv data in the
training data. Repeat this B times to generate B samples. (With random forests,
there is an alternative way of estimating validating mean squared error using the
entire data, without cross-validation. Details are discussed in Chap. 11).
• For each sample, generate a classification tree. Each time a split in the tree is
considered, a random sample of m unique predictors is chosen as split candidates

from the p predictors. For classification, it is customary to use .m ≈ p. This
step has the effect of decorrelating the ensemble of B trees (in classification
trees, the construction of a split involves all the predictors).
For classification, once the trees are available, the final prediction is obtained
by a majority vote. Thus, for .B = 10, say, if for a particular observation in the
validating data six or more trees classify it as ."1", the predicted value for this
observation is ."1". The prediction obtained in this manner usually outperforms the
prediction of classification trees. This improvement in performance arises from the
fact that a prediction based on B predictors with very low correlation has smaller
variance than a single prediction. The low correlation is ensured in the second
step. The first step involving the bootstrapping of the training data is known as
bagging, short for bootstrap aggregating, whereby the results of several bootstrap
samples are averaged. (The mean squared error is a measure of the performance of
a predictor, whose expectation includes the variance of the predictor, a squared bias
term and a pure noise term associated with the variance of the predictand. Therefore,
the performance of a predictor improves as its variance is reduced. This reduction
1.6 Prediction 31

1.00
0.95
1−Validating MSE
0.90
0.85

20 40 60 80 100 120
Number of Predictors/Split

Fig. 1.12 Average proportion of correct classifications in the validating data (in red) of a random
forest over 200 replicates against the number of covariates included in the ensemble of trees.
Maximum and minimum over replicates in blue

is achieved by constructing a prediction averaged over several bootstrap samples


whose variance is smaller than the variance of an estimate based on a single sample).
To study prediction ability, the random forest was implemented on the singh2002
data set using the R function RandomForest. I executed 200 replicates (200 splits
of training/testing data), and in each replicate, the number of covariates included in
the split of a particular tree ranged from 5 to 120 as indicated in the code below in the
variable mtry <- c(5,20,50,80,120). The average proportion of correct
classifications in the validating data (“cancer”, “healthy”), as well as the minimum
and maximum over the 200 replicates as a function of the number of covariates
(mtry in the .x− axis), is shown in Fig. 1.12.
The results are quite impressive. The average proportion of correct classifications
in the validating data is of the order of .97% with a minimum .−maximum in the
range .84 % − 100 %.
The code used to implement the random forest is shown below:
# CODE0104
#install.packages("randomForest")
rm(list=ls()) # CLEAR WORKSPACE
library(sda)
library(randomForest)
data(singh2002)
d <- data.frame(X=singh2002$x)
d$y <- singh2002$y
n0 <- sum(d$y=="healthy")
n1 <- sum(d$y=="cancer")
32 1 Overview

set.seed(3037)
p <- .5
nrep <- 1
mtry <- c(5,20,50,80,120)
sumd <- data.frame()
res <- rep(NA,nrep)
ptm<-proc.time()
for ( m in mtry) {
cat("mtry ",m,"\n",sep="")
for ( rep in 1:nrep ) {
cat("Replicate ",rep,"\n",sep="")
train <- c(sample( 1:n0,floor(p*n0) ),
sample( (n0+1):(n0+n1),floor(p*n1) ))
rf.singh =randomForest(y ~.,
data=d,
subset =train,
mtry=m,
importance =TRUE)
predict <- predict(rf.singh,d[-train,])
observed <- d$y[-train]
t <- table(observed,predict)
print(t)
res[rep] <- (t[1,1]+t[2,2])/sum(t)
}
sumd <- rbind(sumd,c(m,min(res),mean(res),median(res),
max(res),var(res)))
}
proc.time()-ptm
names(sumd) <- c("mtry","min","mean","median","max","var")

with(sumd,plot(mtry,mean,type="l",col="red",ylim=c(min(min),1),
ylab="1 - Mean Squared Error",
xlab="Number of Predictors Considered at each Split"))
with(sumd,lines(mtry,min,lty=2,col="blue"))
with(sumd,lines(mtry,max,lty=2,col="blue"))

While in this particular set of data the random forest was the clear winner among
the prediction machines tested, it is important to mention that there is no uniformly
best prediction machine. A different set of data may produce different results. Very
marked differences among prediction methods ought to raise suspicion and warrant
careful investigation of the data (Efron 2020). This is particularly important in this
era of increasingly larger data sets where the consequence of bias due to non-random
sampling is magnified. The point is elaborated in Meng (2018). Spurious results may
be obtained by complex interactions between a prediction method and a particular
structure in the training data at hand that may not be reproduced when the model is
deployed using validating data.

1.7 Appendix: A Short Overview of Quantitative Genomics

I provide a brief and compact description of the quantitative genetics/genomics


model and introduce terms used repeatedly in the book, such as allele, locus,
1.7 Appendix: A Short Overview of Quantitative Genomics 33

diploid, haploid, genotype, Hardy-Weinberg law, single nucleotide polymorphisms


(SNPs), genomewide association study (GWAS), allele content, quantitative trait
loci (QTL), linkage, linkage disequilibrium, phenotype, genotype, genetic value,
genetic variance, additive genetic value (breeding value), additive genetic effect
(additive effect of a gene substitution), additive genetic variance, heritability,
expected additive genetic relationship, additive genetic relationship matrix, genomic
relationship matrix, genomic model, genomic value and genomic variance.

The Classical Quantitative Genetics Model

The starting point of the mathematical genetics model is the metaphor that describes
chromosomes as strings of beads, with each bead representing a gene. Genes are the
unit of inheritance. In mammals and many other groups, each cell carries two copies
of each chromosome; they are said to be diploid. Most fungi, algae and human
gametes have only one chromosome set and are haploid.
The complete set of chromosomes of an organism includes sex chromosomes
and autosomes. For example, in humans, there are 23 pairs of chromosomes and
of these, 22 pairs are non-sex chromosomes known as autosomes. The majority of
genes are located on the autosomes and in this book I consider autosomal loci only.
In diploid organisms, at a specific location on the chromosome called the locus,
each of the two copies of the chromosome carries a gene. The pair of genes
constitute the genotype at the particular locus. Genes exist in different forms known
as alleles. Here, I consider biallelic loci, so for a given locus in diploid individuals,
if the two alleles are A and a, the three genotypes could be denoted, say AA, Aa
and aa (no distinction is made between Aa and aA). For example, an individual
with genotype Aa received one allele (say A) from the mother and the other from
the father.
The standard quantitative genetic model assumes that the expression of a trait
value y (the phenotype, here centred with zero mean) in diploid individuals is
determined by the additive contributions of a genetic value G and an environmental
value e,

y = G + e,
.

where .e ∼ (0, σ 2 ) is often assumed independent of G. The genetic value is defined


as the conditional mean of the phenotype, given genotype, .E (y|G) and is the result
of the joint action of a typically unknown number q of quantitative trait loci (QTL).

The Single Locus Model

Consider first a trait affected by a single biallelic locus with the three genotypes
labelled AA, Aa and aa. Let p denote the frequency of allele A in the population
34 1 Overview

(assumed to be the same in both sexes). In a large population, assuming random


mating among parents and in the absence of random genetic drift, selection and
mutation, in the offspring generation, the frequency of genotype AA is .p2 , of
genotype Aa is .2p(1 − p) and of genotype aa is .(1 − p)2 . In this overview, gene
frequencies p are treated as known constants that remain unchanged over repeated
cycles of random mating.
Define the random variable .z∗ as

⎨ 2, with probability p2 ,

.z = 1, with probability 2p (1 − p) ,

0, with probability (1 − p)2 ,

The random variable .z∗ is known as the allele content (here, allele A is arbitrarily
taken as reference and .z∗ = 2 if the genotype has two copies of allele A).
For thislocus,.E (z∗ ) = 2p, .Var (z∗ ) = 2p (1 − p) and for individuals k and
j , .Cov zj∗ , zk∗ = aj k 2p (1 − p) where .aj k is the expected additive genetic
relationship (given a pedigree) between k and j (e.g. .aj k = 0.5 if j and k are
non-inbred full sibs or parent and an offspring), also interpreted as the expected
proportion of alleles shared identical by descent between j and k (genes that are
identical by descent (IBD) are copies of a specific gene carried by some ancestral
individual). The note note0101.pdf at https://github.com/SorensenD/SLGDS has a
derivation of these results.
In a large (idealised) random mating population, in the absence of selection,
mutation or migration, the relationship between gene frequency p and genotype
frequency .p2 remains constant from generation to generation. The property is
derived from a theorem known as the Hardy-Weinberg law that provides one expla-
nation for the maintenance of genetic variation in such idealised random mating
population. In a population in Hardy-Weinberg equilibrium, genotype frequencies
at a particular locus in the offspring generation depend on gene frequencies in the
parent generation.
From now on, the codes for the three genotypes are centred as .z = z∗ − E (z∗ )
so that .E (z) = 0, .Var (z) = 2p (1 − p) and phenotypic values are also centred, so
that .E(y) = 0.
The genetic value .G (z) at the locus can take three modalities corresponding to
the three genotypes at the biallelic locus and can be decomposed into two terms:

G (z) = αz + δ,
. (1.27)

where .αz is the best linear predictor of genetic value. The best linear predictor
αz is also known as the additive genetic value or breeding value: the best linear
.

approximation describing the relationship between genetic value and allele content
z (best linear prediction is discussed on page 259; see also the example on page 261
for more details on the concepts of additive genetic values and effects, where it is
shown that .α, the additive genetic effect of the locus or average substitution effect
1.7 Appendix: A Short Overview of Quantitative Genomics 35

at the locus is also the regression of y on z). The residual term .δ is orthogonal to z
and includes deviations between .G (z) and .αz.
The genetic variance contributed by the locus in the population (based on the law
of total variance)

. Var (G (z)) = Varz (E [G (z) |z]) + Ez (Var [G (z) |z]) (1.28)

is orthogonally decomposed into an additive genetic component of variance .σa2 , the


first term in the right-hand side and a residual or dominant component of genetic
variance , .Var (δ), the second term. The additive genetic variance (variance of
the additive genetic values) in this single locus model, assuming Hardy-Weinberg
equilibrium, is

σa2 = Varz (E [G (z) |z]) = Var (αz|α) = 2α 2 p (1 − p) .


.

If the linear fit is perfect, the genetic variance is equal to the additive genetic
variance. Importantly, additive genetic variation at the locus arises due to variation
in allele content z among individuals at the locus. The substitution effect .α is treated
as a fixed albeit unknown parameter (this is stressed by conditioning on .α).
The (narrow sense) heritability of the trait is definedas the ratio of the additive
genetic variance to the phenotypic variance : .h2 = σa2 σy2 , where .σy2 = Var (y),
the marginal variance of the phenotype.

Models with Many Loci


 
The extension to q biallelic loci involves a random vector .z = z1 , . . . , zq of
allele contents of the q genotypes. Under random mating, .Var (zk ) = 2pk (1 − pk ),
.k = 1, . . . , q and .Cov (zk , zl ) = 2Dkl , where the linkage disequilibrium (LD)

parameter .Dkl between loci k and l is defined as follows. Choose the paternal (or
maternal) gamete and let the random variable U take the value 1 if allele .Ak is
present in the paternal gamete at locus k and zero otherwise; let the random variable
V take the value 1 if allele .Al is present in the paternal gamete at locus l and zero
otherwise. Then .Dkl is defined as the covariance between U and V :

Dkl = Cov (U, V ) ,


.

and .Cov (zk , zl ) = 2Dkl since in the diploid model, the genotype results from
the random union of two gametes. Covariances involving alleles of different loci
between gametes are zero. Linkage disequilibrium is created by evolutionary forces
such as selection, mutation and drift and is broken down by random mating, as a
function of time (measured in generations) and of the distance that separates the
intervening loci. Generally, loci that are physically close together show stronger
LD.
Random documents with unrelated
content Scribd suggests to you:
Dr. Gray’s friend of many years, George Engelmann, M. D., died in
February, 1884. He was a student at Heidelberg with Schimper and
Alexander Braun in 1827, and again in Paris, in 1832, with Agassiz and
Braun. He came to America in 1834, made some journeys on horseback in
the West, and settled as a physician in St. Louis, then a frontier trading-post,
in 1835. He lived to see it become a metropolis of over four hundred
thousand inhabitants. Dr. Gray says in his memoirs of him, “In the
consideration of Dr. Engelmann’s botanical work it should be remembered
that his life was that of an eminent and trusted physician; ... that he devoted
only the residual hours, which most men use for rest or recreation, to
scientific pursuits.... Nothing escaped his attention; he drew with facility;
and he methodically secured his observations by notes and sketches. The
lasting impression which he has made upon North American botany is due
to his habit of studying his subjects in their systematic relations, and
devoting himself to a particular genus of plants until he had elucidated it as
completely as lay within his power. In this way all his work was made to
tell effectively.... It shows how much may be done for science in a busy
physician’s horæ subsecivæ, and in his occasional vacations. Personally he
was one of the most affable and kindly of men, and was as much beloved as
respected by those who knew him.”

TO SIR EDWARD FRY.

October 10, 1884.


It is quite time that I responded to your kind and welcome letters. First,
let me congratulate myself upon having you as a colleague in the Royal
Society, in which I think you need not owe your fellowship to official
dignity. I believe you took honors in science at the university, along with
our friend Professor Flower.
You mentioned your approaching visit, with Lady Fry, to Lord
Coleridge.... Lord C., referring to your visit, sent us very cordial messages
in a letter to my colleague Professor Thayer. He will know that his host in
Boston, General Butler, is one of the candidates for the Presidency.
I am, as you may suppose, a bolter from the Republican presidential
nomination. We even hope to give the electoral vote of Massachusetts,
stanchest of Republican States, to the Democratic candidate. But I need not
bore you with American politics.
Let me say how sorry we were not to see Miss Fox at our home. It might
have been, except for a little journey we made from Philadelphia, of which I
must tell you more.
I had a mere glimpse of Miss Fox at Montreal, and a little more of her
cousin. She came late to Philadelphia, where Mrs. Gray (who was not at
Montreal) and I had a most pleasant chat with her at a garden reception. The
next day I went out to the suburban place where she was visiting, and came
near to winning her for our expedition, at least as far as to Luray cave, and
the Natural Bridge in the Valley of Virginia. But the engagements she had
made could not be reconciled. Her hostess was to take her to this
neighborhood, but too early for us to receive her here. All good people in
this country think so much of Caroline Fox that they wished to know her
sister.
I have not seen the book by Mr. Arthur on difference between physical
and moral laws, and am not sure that I ever heard of it or of the author. Who
is the publisher? I might find it at the university library. No, I never had the
fortune to see, much less to know Maurice. Of course I have always known
a good deal about him and of the remarkable influence he exerted, both in
person and in his writings, “in which were some things hard to understand,”
such as his liking for the Athanasian creed, but nothing that was not most
excellent in spirit.
Of course I well remember Miss Wedgewood; and we had occasional
correspondence up to the time when Darwin died, and she, on the part of
the family, announced it to me. I am glad to know that she wrote the sketch
of Maurice in the “British Quarterly.”
And now about ourselves. I got the Compositæ off my mind late in the
spring, but not off my hands until sometime in August. At the end of August
and of the pleasant part of the summer here (for it was delightful in
Cambridge, so cool and quiet, and Mrs. Gray away only for three weeks
with her friends on the coast) I went to the meeting of the British
Association at Montreal; enjoyed it much; read a paper,[128] a sort of
address, to the botanists coming over to North America, which the Section
seemed to like and voted to print in extenso. (I will first print it here, and
send you a copy. Not that there is much novelty in it, but it may be
readable.) I had to leave the meeting after three or four days, and return
here; sorry to leave our friend Mr. Walter Browne ill at the hospital with
typhoid fever. He and his poor wife received every kind attention, but he
died in a few days.
It is agreed that the British meeting was a distinguished success. It
brought over a throng of English people, and the American savants (I
cannot abide the word “scientists”) were in good force. We were repaid by
the large attendance of British Association members at Philadelphia, where
they contributed to make our meeting large and notable.
Up to this time the weather was all that could be wished, cooler, I
suppose, than in England at the time. But that week at Philadelphia was
raging. Mrs. Gray and I were there for the whole week, domiciled with
friends in the heart of the city,—a city which never cools at night, as it does
hereabouts. I bore the heat well, as my manner is; Mrs. Gray, fairly, by
keeping quiet through the mornings and giving herself rather to the evening
receptions, which were fine and most admirably managed. It grew cooler
the moment the week was over and the session ended. Besides, we moved
at once into a cooler region. It was arranged that I should lead any British
botanists that cared to go on an excursion into the mountains of Virginia
and Carolina. But they were otherways bound, so that I could take only my
friend Mr. John Ball of London, your fellow F. R. S., taking also another
American botanist, with whom we had visited these regions more than once
before, and, to make it pleasanter, we added three ladies, wives and
daughters of botanists, Mrs. Gray being one.
Our first day’s journey was to Luray, in the Valley of Virginia, between
the Blue Ridge and the proper Alleghanies. The next day we visited the
Cavern, which I think is the finest in the world, not forgetting that of
Adelsberg in Styria. It is newly discovered, with wonderful wealth and
beauty of stalactical formations, and is lighted up for visitors with electrical
lights in all the larger chambers. That day we went on to the Natural Bridge,
which we had not seen for many years. It was grander than I had
remembered; indeed, it and the scenery around is worth a voyage and a
journey to see. Then we went on to our favorite Roan Mountain, on the
borders of North Carolina and Tennessee, one of the highest in the Atlantic
United States, and the finest; the base and sides richly wooded with large
deciduous forest trees in unusual variety even for this country, the ample
grassy top (of several square miles) fringed with dark firs and spruces, and
the open part adorned with thousands of clumps of Rhododendron
Catawbiense, which when there last before, late in June, we saw all loaded
with blossoms, while the sides were glorious with three species of Azalea,
not to speak of many other botanical treasures. There, at top and at base, we
passed four busy days. A narrow-gauge railway recently built, and new to
us, reaches to the base of the mountain, up the Doe River, through most
picturesque scenery, showing to most advantage in the descent. On our way
back we diverged to visit some striking rock scenery on the upper Kanawha
River, and thence to a mountain-top lower than Roan, but with the
advantage of a charming little lake, with banks all fringed with
Rhododendron maximum and Kalmia, hanging over the water for a rod or
two, except on the side where the little hotel stands. Well, I have written a
deal here, little as I have managed to tell you. I think you and Lady Fry
should come over and see for yourselves, just a pleasant summer vacation,
if you can leave Failand for so long.

TO J. D. HOOKER.

September 26, 1884.


So dear Bentham has gone,—not quite filled out his eighty-fourth year.
Well, we could have wished this year of infirmity and suffering had been
avoided. One would like to say good-evening promptly at the close of the
working-day. But this we cannot order, so we must accept what comes. We
shall miss him greatly. We have nobody left to look up to. He seems to have
made a wise and good disposition of his effects.
Your two letters reached us at Philadelphia, on our return from North
Carolina and Virginia....
Yesterday we had Sir William and Lady Thomson.[129] To-day Traill and
wife (young and bright) of Aberdeen looked in and lunched.
I come home to a heap of letters and parcels and affairs, to keep me busy
awhile....
Well, the meeting at Montreal was a success, and made a pleasant
occasion. The influx of visitors from British Association to Philadelphia
made that meeting very good too. George Darwin I just saw for a moment
at Montreal, and Mrs. Darwin also at Philadelphia, one evening,—
handsome and winning.
I hope you have got the copy of “Synoptical Flora II.” for your own
shelf, through Wesley. Slips and omissions are already revealing, especially
in the index.
I am wonderfully strong and well. Mrs. G. well up to average, both much
set up by holiday, of which mine has now lasted a month....
What a deal you have fished out of Bentham’s earlier life! I thought you
meant Toulouse, not Tours. Bentham used to speak of Toulouse and that
part of France....
Among the inventive feats of his father was one I have somewhere heard
or read of, that he made a fleet of articulated transport boats for descending
the crooked channels of the Russian rivers.
I think you might have specified De Morgan’s discovery of Bentham’s
contribution to logic, and his able defense of the reclamation, to which
Herbert Spencer’s “Verdict” in 1873 was not particularly needed for the
establishment of the fact. De Morgan was not a man to leave his work half
done, especially as against Hamilton.
I only regret that the length to which these most interesting matters
extended stood in the way of your giving a more detailed account of
Bentham’s botanical work, on which another article would be timely.
I must now, before long, attempt something of this, for the American
Academy’s éloge. And I pray you, if you are not doing it yourself, to send
me hints and suggestions. Sheet full, and I will not begin another to-day, but
add only my wife’s love to you and Lady Hooker.
January 9, 1885.
The souvenir of dear Bentham has come to hand, is in its place on my
table, and the first use I make of it, now in position, is to write to you this
letter of thanks,—to you for awarding it to me, and to dear Lady Hooker for
so promptly forwarding it. The stand is a beautiful piece of marble, bearing
its two inkstands.[130] Was there ever anything to occupy the sunken area
between them?...
Of myself I have not much to write. The prospect of getting off for the
latter part of winter has just prevented my settling down to the “Flora,” and
I have found plenty else to keep me actively employed, mainly with a
revision of some boraginaceous genera, now in printer’s hands, which I
hope, while it unsettles old work, will settle it better and permanently, as far
as anything we do can be said to be lasting.
I am well,—can hardly be said to need the holiday we have determined
on.... We shall benefit much, I think probable, by getting off to meet the
spring, avoiding February-April here, which are the only drawbacks to a
climate of the best: for you know I do not at all dislike summer heat.
We have not troubled ourselves much as to where we would go. But now
it does seem that we will go to the southern part of California, if possible by
the southern Arizona route, which is near the Mexican boundary, and must
be best for winter, and to return by the route through the northern part of
Arizona, which should be pleasant in the latter part of April. Oh, that you
and Lady Hooker could be with us.... And we shall be lonely without you
on our travels, and feel that “that great principle of the survival of the
fittest” has been woefully violated....
City of Mexico, Sunday, February 22, 1885.
Your letter of January 20, forwarded from Cambridge, overtook us at
San Antonio, Bexar. We left home February 3, in bitter cold, for St. Louis,
where I had an interview with old Shaw, and heard him read his rearranged
will, which is satisfactory, as it will allow his trustees, and the corporation
of Washington University there, to turn his bequests to good account for
botany; will be an endowment quite large enough for the purpose.
Thence, rail—two nights and a day—to Mobile, where it was warm and
springlike, but no flowers out, barring an early violet. Thence to New
Orleans, which has a great exposition and a crowd, and where, in a sudden
change to cold, I caught a dreadful cold. It began with such a hoarseness
that, going, Mrs. G. and I, to dine with Dr. Richardson (son-in-law of
Short), where we met your and Dyer’s friends, Mr. and Mrs. Morris[131] of
Jamaica, I was taken speechless. I was only for a few hours at the
Exposition (I hate such), but Mrs. Gray went a second time to see Mexican
things. Dr. Farlow, joining us at New Orleans, brought, to our surprise,
passes for us to go by the Mexican Central Road to the city of Mexico and
back to El Paso (the junction with the road to California), and we decided to
undertake it. One day and a night took us to San Antonio, Texas, where we
stayed Saturday, Sunday, and Monday, till evening, trying to recover from
our colds, driving over the country through chaparral of mesquite bushes
(Prosopis) and opuntias. When we awoke next morning we were coursing
along the rocky banks of the Rio Grande del Norte, mounting into a high
region more arid still, if possible, the only flowers out a Vesicaria; and
descending into a great cattle ranch region we reached El Paso at 3.30 A.M.;
got to bed again; had the day there and on the other side of the river, at El
Paso del Norte, in the Mexican State of Chihuahua, whence at evening we
took our Pullman for three nights and two days’ journey to this place,
through Chihuahua, Zacatecas, Aguas-Caliente, Leon, etc., reaching here
yesterday morning at 8.30. We are comfortably placed in the Hotel Iturbide.
Farlow and I have looked about somewhat, though I am still suffering from
catarrh and cough; Mrs. Gray laid up with hers. This afternoon a Mexican
gentleman to whom we took letters called and drove Farlow and me out to
Chapultepec, whence a most magnificent view of the whole Valley of
Mexico and the surrounding mountains, including Popocatapetl and its
more broadly snowy companion,—with its more difficult name, meaning
White Lady,—at this season always with cloudless tops. The cypresses of
Chapultepec are glorious trees, plenty of them, full of character, and of a
port which should help to distinguish the Mexican species from the North
American. I wish you could see them. And such old trees of Schinus molle,
the handsomest of trees either old or young, the old trunks wonderfully
bossed. Is it a native of Mexico? I thought only of Chili. But it is well at
home here.
Such yucca trees as we have seen on the way here, with trunks at base
two or three feet in diameter, weirdly branched, looking like doum palms.
Opuntias of two or three arborescent species, some huge, and other cacti not
a few.
I have still to compare Arizona with the plateau of northern Mexico. But
I see they are all pretty much one thing....
Orizaba, February 27, 1885.
Since my former sheet, Farlow and I have been mousing about the city
of Mexico, I coughing most of the time, in a clear, dry air and nearly
cloudless sky, weather which should be most delightful, but somehow it is
bad for the throat (for the natives as well as for us), and the rarefied air puts
one out of breath at a little exertion; mornings and evenings cool and fresh,
the midday warm, in the sun trying.... Called in a physician, a sort of
medical man to American embassy, who came here with Maximilian, and
stayed. Very intelligent. Ordered us to come here as soon as Mrs. Gray
could travel. Here only 4,028 feet and a warmer damp air. Well, we tried it
yesterday; had to leave city of Mexico at 6.15 A.M., our hotel at 5.30 cold,
no breakfast; had to travel till ten or nearly before we could get even a
decent cup of coffee, at junction of road to Vera Cruz and Puebla, and after
rising to 8,333 feet in getting out of the Valley of Mexico; but at 1 P.M., at
Esperanza, in the Tierra Frias, had a capital dinner, and met train from Vera
Cruz. Here pine-trees on the hills all round us, two species. Soon begins the
descent and a complete change of air, the other side all dry and horrid dust,
making our catarrh worse than ever; now the moisture from the Gulf of
Mexico makes all green; the road by skillful engineering pitches down
4,000 feet to this, the greater part of the descent all in eight or nine miles of
straight line as the bird flies. In all the Valley of Mexico and to the north of
it really nothing in blossom yet, all so dry, except Senecio salignus, if I
rightly remember the name, a shrub of 1-4 feet, just becoming golden with
blossoms. But the moment we began the descent all was flowery, two
species of Baccharis, Eupatoria, Erigeron mucronatum (so much cultivated
under the false name of Vittadenia triloba), Lœseliæ species, Arbutus,
(Xalapensis) in bud, and many things of which we shall know more when
we return over the route.... Very comfortable hotel here. Botteri[132] left an
élêve here who knows something of botany, but lives out of reach on a
hacienda. We found a garden combined with a small coffee plantation. The
proprietor thereof, speaking a little French, has filled his ground with a lot
of things that will stand here. It is just in medias res, two hours below Tierra
Frias, two above (or at Cordoba, only seventeen miles, but 2,000 feet lower)
true tropical. Papaya fruits here, also Persea gratissima, etc. And the
oranges are delicious. I have passed the whole morning with the garden
man, while Farlow went up a small steep mountain, and brought back
various things. We shall drive this afternoon to the Cascade of Rincon
Grande (cascades are most rare in Mexico).
The air here suits us; shall try to leave our coughs here and at Cordoba
below.
On the way here had views of Popocatapetl and the more beautiful and
diversified Iztaccihuatl from the sides, and wound round the base of Mt.
Orizaba. A true Mexican town this. Mrs. Gray enjoying sights from the
window; will be able to drive out this afternoon, though the clouds are
sinking too much and mist gathering, a great contrast to the city of Mexico.
P. M.—We went, but saw the falls (very picturesque) in a wet mist, and
for botany got a lot of subtropical Mexican plants, the like of which I never
saw growing before: among Compositæ, Lagascea (large heads), Tree
Vernonias of the Scorpioides set, Calea, Andromachia, etc., etc.
Cordoba, March 2, 1885.
... To continue. On Saturday, a fine and sunny morning, Farlow and I
drove off for the Cascade of Barrio Nuevo, almost as beautiful as the other,
and had a long morning in clambering and collecting. In the grounds on the
way are planted trees of a Bombacea, in flower before the leaf, probably
Pachira. The peak of Orizaba shows as a narrow streak of white over a near
mountain, from the windows of our room; but by going half a mile east the
whole comes out splendidly.
Sunday morning we were comparatively quiet, but at 3.50 P.M. we were
off for Cordoba, less than an hour distant by rail, and 2,000 feet lower. A
queer little town, with only a poor, truly Mexican inn, a set of rooms in the
single story, all round a patio, into which the country diligence drives, and
on rear side the stables back against the rooms, as Farlow found to his
discomfort, only a thin wall between his room and the horse’s mangers. Tile
floors, cot-beds, but clean, and the food certainly better than was to be
expected.
Fine view of Orizaba. An American, Dr. Russell, here, whom I looked
up. And he took us to an American German, Mr. Fink, who collects
Orchids, etc., commercially. He took us to a garden, and we were going to
the river bank and ravine, but, though out of season, rain set in, and we
came home rather wet.
I fear our afternoon excursion may be lost, but it now looks like clearing.
The way from Orizaba here is magnificent, for mountains, railroad-
engineering, and culture vegetation. I hope we can get into some wild
tropical vegetation, but uncertain; can stay here only to-morrow at most. We
are cut off from news of all the world; little could we get in Mexico city;
less since....
You would be amused, as I have known you to be in Italy, at my knack
of explaining myself by gesture, and so getting on....
Lathrop, California, May 1, 1885.
We have only this morning left Rancho Chico and have set our faces
eastward. Waiting for our train I improve the rare bit of leisure to write a
line.
First of all, we are both well. No cough, however obstinate, could abide
this charming climate. And having no excuse for further stay we enter upon
the “beginning of the end” of a holiday which now only lacks ten days of
three months. What a pity to turn our backs on all the fruits we see growing
around us, having enjoyed only the cherries, which are just coming in. Well,
we have a basket of them, as big as plums, and so good! to solace the first
days of the desert part of our journey. We shall have desert enough on the
way home, as we cross Arizona and New Mexico by the Atlantic and
Pacific railway, through the northern part of those Territories (having come
out by the southern), a country quite new to us. How often we have wished
for you and Lady Hooker!
When and whence did I write you last? I think from Los Angeles and
before our trip to San Diego.
Instead of a short journey by sea (which my wife detests) we made a
long circumbendibus by rail to the southernmost town in California;
declined an invitation to go over the border into Mexican California; was, in
fact, too unwell to do anything in the field, and so, finding the coast too
cool and damp, returned, stopping two nights with Parish and wife, at their
little ranch at San Bernardino, in a dry and warm region, a charming valley
girt with high mountains, on the eastern side still snow-topped,—indeed
they are so most of the summer. Back thence to Los Angeles we soon went,
down to the port San Pedro, and took steamer for Santa Barbara, the very
paradise of California in the eyes of its inhabitants, and indeed of most
others. Our cruise of only eight hours on the Pacific was pleasant, and most
of it in daylight.
Arriving after dark, we found, to our surprise, the mayor of the little
town on the wharf with a carriage for our party (wife, Farlow, and self),
who drove us to the fine watering-place kind of hotel, and on being shown
at once to our rooms we found them all alight and embowered in roses, in
variety and superbness such as you never saw the beat of, not to speak of
Bougainvilleas, Tacsonias, and passion-flowers, Cape-bulbs in variety, etc.,
etc., and a full assortment of the wild flowers of the season. Mrs. Gray was
fairly taken off her feet. During the ten or eleven days we stayed, there were
few in which we were not taken on drives, the most pleasant and various.
The views, even from our windows, of sea and mountain and green hills
(for California is now verdant, except where Eschscholtzia and Bahias and
Layia, etc., and Lupines turn it golden or blue) were just enchanting; and on
leaving we were by good management allowed to pay our hotel bill.... Had
you been of the party I believe the good people would have come out with
oxen and garlands, and would hardly have been restrained.
Here we were driven out fifteen miles to one of the great ranches,—a
visit of two nights and a day,—that of Mr. Cooper, a very refined family;
the whole ranch flanked on the windward sides by eucalyptus groves,
apricots, almond, peach-trees, etc., by the dozens of acres; but the produce
on which the enthusiastic owner has set his heart is that of the olive, and he
makes the best of olive oil, and in a large way. Hollister’s ranch is still
larger, miles long every way; both reach from mountain-top to sea, and
have fine drives up cañons, in these fine oaks and plane-trees, occasionally
an Acer macrophyllum and an Alder. Avoiding the sea, which gives a short
route, we reached San Francisco by a lovely drive, in a hired wagon, over a
pass in the Santa Inez Mountains to the coast (south) at Ventura, and so up
the broad and long Santa Clara Valley to Newhall, on the Southern Pacific
railway, not very far above Los Angeles (two days’ drive, most pleasant),
then by rail overnight and to this place to breakfast, and on to San
Francisco.
We stopped this time at the Lick House, where we had, European-wise, a
room, not quite so good as we had at the Palace Hotel eight years ago, and
fed at the restaurant, very nice and reasonable, when we were not visiting or
invited out, which was most of the time. So it was not expensive, our room
(parlor, bedroom shutting off, and a bathroom) costing only about 12
shillings for us both. Harkness looks the same, but older; is absorbed in
fungology. Here again we were made much of for twelve days, most busy
ones. General McDowell, who you remember dined us at the “Palace,” is
ill; we saw him twice, and he has since so failed that we daily expect to hear
of the end.
May 4. In Farlie’s Chalet hotel in the Grand Cañon of the Colorado.
Dr. Brigham, you remember, who took us to the Chinese theatre, is now
married, and has three children by a bright wife, with a rich father, and a
handsome house, above Presidio,—a fine site, and filled with fine things
from all countries, and such a rose-garden; gave us a handsome dinner.
Alvord and wife (president now of Bank of California), noble people, did
wonders for us, and a dinner and drives. A lunch over at the university; and
another by General (commanding the Western Department in place of
McDowell, and in the choice house the latter built) and Mrs. Pope (she an
old acquaintance); then we went over to San Rafael, a night with the
Barbers, and next day a drive up behind Mount Tamalpais to the cañon
reservoir of water-works, and saw, at length (having failed on all former
visits), that huge Madroña (Arbutus Menziesii), like one of those great and
wide-spreading oaks you used to admire. Next day to Monterey, which we
saw nothing of on that hurried visit eight years ago, when our single day
was sacrificed to Hayden’s insane desire to see a coal mine on a bare hill!
Now there are eighteen miles of good drive around all Point Pinos and
through it, and Cupressus macrocarpa on the seaside verge, noble and
picturesque old trees, and no lack of young ones, a little back, and grand sea
and shore views.
On the other side of the town, in a grove of great live oaks and Pinus
insignis mixed, made into a beautiful park and park gardens, with a separate
railway station in the grounds, is the crack hotel of the Western coast, the
work of the Pacific Railway Company, which has also bought and appended
the whole of the pine grove, five or six miles long and two or three wide,
thus preserving Pinus insignis and the cypress, the latter much needing it.
Mr. and Mrs. Alvord, knowing our visit was to be, had telegraphed for
best rooms, and joined us unexpectedly; took us on the long drive the next
day, with four fine horses.... They showed us no end of kind attention.
At length we got off for a visit to Chico (leaving Farlow to apologize at
Santa Cruz, etc.), a quicker way than before, a steam-ferry across Suisin
Bay helping. And there we had a nice time indeed, from Saturday evening
to Friday morning, every day, drives and picnics, and botanizing, and
feeding on (besides strawberries) such cherries, just coming in in acres of
cherry-orchards, the only fruits yet in season. That big fig-tree, in the
branches of which I used to hide and feast, or rather cram, is bigger than
ever, but the figs green, to my sorrow. And we cannot wait for them.
General Bidwell[133] and wife have aged little in the eight years, are as
good as ever, full of all noble and good works, as well as of generous
hospitality; have taken wonderfully to botany; remember you most
affectionately and long for a real visit. His great ambition is to make drives,
good roads, through the ranch, for pleasure as well as use; he has now over
a hundred miles of them. That big oak[134] is finer than ever; not a dead
branch.
Well, off at length; at Lathrop joined our eastward train at evening; up
the San Joaquin valley all night, and had early morning for the wonderful
Tahachapi Pass. Breakfast at Mohave. (I must send you a railroad map.)
There took the Atlantic & Pacific Railroad, over the sandy desert to the
Great Colorado at supper, to Peach Spring station at two A.M., and next
morning in an easy “buckboard wagon” twenty-two miles and 4,000 feet
descent into this wonderful cañon, a piece of it, which its explorer, Major
Powell, has made famous.
This afternoon and evening we are to get up and back, and on in the
night and morning to Flagstaff, and the ancient cliff dwellings.
In the Cars, Kansas City, May 8, 1885.
Let me finish up these mems. We have now only a run of eleven hours to
St. Louis, where we stay three or four days with Dr. Engelmann (Jr.), and
then home.
The cañon trip well repaid the journey and its rough accessories. Some
of the views are of those depicted by Powell. We find that Tylor and
Moseley were here last year. As the man whom we had introductions to at
Flagstaff was absent for a day or two, though we found he had left
substitutes, and as we wanted to get home as soon as we could, we gave up
the visit to the cave and cliff dwellings. I dare say the models in clay, made
at Washington, are as good as the originals. So we came on, one and a half
nights and two days, and to-night we shall sleep in beds at St. Louis. We
bear this sort of travel quite well. From Mohave to the Colorado is very
sandy and complete desert, descending eastward many hundred feet. Near
Mohave lots of tree yuccas, looking very like those in northern part of
Mexico. From the Colorado to Peach Spring we passed in the dark, but had
risen to about 6,000 feet, and we kept on an elevation of 4,000 to nearly
8,000 feet all across the rest of Arizona and New Mexico, the higher parts
wooded with conifers, that is, Pinus ponderosa of the Rocky Mountains
form and Juniperus. At Las Vegas, New Mexico, we laid over one train, to
rest and visit the Hot Springs; no great to see, except a spick and span new
hotel, too fine for the place, and some very hot water.
Well, this trip, which will nearly round out to three and a half months,
has been long and enjoyable indeed.
At St. Louis will be letters, perhaps one from you.
Ever yours,
A. Gray.
Part of yesterday and last night was down along the Arkansas, the
reverse of our journey eight years ago. Country much settled up.
Cambridge, August 26, 1885.
... Charles Wright is dead, at seventy-three and a half; had been suffering
of heart-disease, went out to his barn, was missed as the evening drew on,
was found dead. So they go, one by one....
The summer is almost gone,—one hardly knows how,—but, then, we
have a longer and finer autumn than you have in England.
The five hundred copies which I printed in 1878 are gone. And, as I have
to print new copies, I take the opportunity to correct on the stereotype plates
when I can,—a great lot of wrong references to volume, page, plates,—that
is, such as we have found out. What a bother they are, and how impossible
to make correct in the first place, and to keep so through the printer’s
hands! Then there are lots of important corrections to make, and new
species and genera galore.
So,—in an evil moment, you will say—I set about a supplement to this
new issue,—also of the other part. For, as I have now brought out in the two
parts all the Gamopetalæ, and as I begin to doubt if I shall hold out to
accomplish much more, I thought it best to leave behind at least these in
good state. But it is no small job. And this, with the great amount of
herbarium work that goes along with it, or beside it, just uses up the
summer; for I dare guess it will keep me occupied all September....
The last news of you is a letter from your dear wife to mine,—giving
such a pleasant picture of the two boys, and of your enjoyment of them. You
say you are quite well, and Lady Hooker much the same,—which is
comforting. But you are naturally growing older, like myself. I tire sooner
than I used to do, and have not so sure a touch nor so good a memory. The
daily grind we both find more wearing....
We should like to come over to you once more,—but it seems less and
less practicable; unless I become actually unfit for work, and then I shall
not be worth seeing....
Your affectionate old friend,
A. Gray.
Old, indeed; the president of the Naturæ Curiosorum wrote me on
August 3 that I have been one of the curious for fifty years.

Dr. Gray wrote a notice of Charles Wright for the “American Journal of
Science,” in which he says that “Charles Wright was born at Wethersfield,
Connecticut; graduated at Yale in 1835. Had an early love for botany, which
may have taken him to the South as a teacher in Mississippi, whence he
went to Texas, joining the early immigration, and occupied himself
botanizing and surveying, and then again in teaching. He accompanied
various expeditions, and no name is more largely commemorated in the
botany of Texas, New Mexico, and Arizona than Charles Wright. It is an
acanthaceous genus of this district, of his own discovery, that bears the
name of Carlowrightia. Surely no botanist ever better earned such scientific
remembrance by entire devotion, acute observation, severe exertion, and
perseverance under hardship and privation.” He was engaged later for
several years “in his prolific exploration of Cuba.”
“Mr. Wright was a person of low stature and well-knit frame, hardy
rather than strong, scrupulously temperate, a man of simple ways, always
modest and unpretending, but direct and downright in expression, most
amiable, trusty, and religious. He accomplished a great amount of useful
and excellent work for botany in the pure and simple love of it; and his
memory is held in honorable and grateful remembrance by his surviving
associates.”[135]

TO JOHN H. REDFIELD.

Cambridge, November 3, 1885.


My dear Redfield,—I was interested in your Corema Con.
I have a remark to make on the last sentence of it; I would ask, How
could the plant have an introduction following the glacial period? And
where could it have come from?
Of course my idea is that it existed at the higher north before the glacial
period—that is my fad.
But one sees that this is one of a few plants that may be appealed to in
behalf of an Atlantis theory,—as coming across the Atlantic, making this
Corema a derivation from C. alba, of Portugal, or of its ancestor. But the
Atlantic is thought to be too deep for an Atlantis; and we do not need it
much.
What induces me to refer to your paragraph is to ask whether your
“following the glacial period,” that is, recent introduction, means in your
thought that our species is a direct descendant of Corema alba, which by
some chance got wafted across the Atlantic.
That is the most probable notion, next to my theory.
For consider, we know the genus only on these two opposite shores.
Perhaps—so far as I know, there is no more C. alba in the Old World
than C. Conradii in the New. And if it were in New England that the former
occurs, we could say that the Old World received the genus from the New—
via the Gulf Stream.
November 6.
... I start farther back than the retreat of the glaciers. I suppose that the
common ancestor of both Coremas was in the high north before the glacial
period, and that the two, in their limited but dissociated habitats, are what is
left after such vicissitudes!
In that view it does not matter how long New England coast was under
water. Our plant and its companions were then further south or west.
Yours ever,
A. Gray.
On the approach of Dr. Gray’s seventy-fifth birthday it was suggested
among the younger botanists that some tribute of love and respect should be
presented to him. Accordingly a letter was sent to all botanists whose
addresses could be obtained within the very limited time. A silver vase was
decided upon, and designs furnished, which were most happily and
beautifully carried out. The description, copied from the “Botanical
Gazette,” gives its size and decorations.
“It is about eleven inches high exclusive of the ebony pedestal, which is
surrounded by a hoop of hammered silver, bearing the inscription ‘1810,
November eighteenth, 1885—Asa Gray—in token of the universal esteem
of American Botanists.’
“The decoration of one side is Graya polygaloides, surrounded by
Aquilegia Canadensis, Centaurea Americana, Jeffersonia diphylla,
Rudbeckia speciosa, and Mitchella repens. On the other Shortia galacifolia,
Lilium Grayi, Aster Bigelovii, Solidago serotina, and Epigæa repens. The
lower part of the handles runs into a cluster of Dionæa leaves, which clasps
the body of the vase, and their upper parts are covered with Notholæna
Grayi. Adlumia cirrhosa trails over the whole background. The entire
surface is oxidized, which gives greater relief to the decorations.”
Greetings in the form of cards and letters, sent by those who gave the
vase, were placed on a silver salver accompanying the gift, with the
inscription, “Bearing the greetings of one hundred and eighty botanists of
North America to Asa Gray on his seventy-fifth birthday, November 18th,
1885.”
Dr. Gray was exceedingly touched and delighted, as well as
overwhelmed with surprise. And the day, with pleasant calls and
congratulations from friends and neighbors, gifts of flowers with warm and
kindly notes, was made a memorable one indeed.
His response to the senders of the vase was printed and sent to all who
could be reached.
Herbarium of Harvard University,
Cambridge, Mass., November 19, 1885.
To J. C. Arthur, C. R. Barnes, J. M. Coulter, Committee, and to the
numerous Botanical Brotherhood represented by them:
As I am quite unable to convey to you in words any adequate idea of the
gratification I received on the morning of the 18th inst., from the wealth of
congratulations and expressions of esteem and affection, which welcomed
my seventy-fifth birthday, I can do no more than render to each and all my
heartiest thanks. Among fellow-botanists, more pleasantly connected than
in any other pursuit by mutual giving and receiving, some recognition of a
rather uncommon anniversary might naturally be expected. But this full
flow of benediction, from the whole length and breadth of the land whose
flora is a common study and a common delight, was as unexpected as it is
touching and memorable. Equally so is the exquisite vase which
accompanied the messages of congratulation and is to commemorate them,
and upon which not a few of the flowers associated with may name or with
my special studies are so deftly wrought by art, that of them one may
almost say, “The art itself is nature.”
The gift is gratefully received, and it will preserve the memory to those
who come after us of a day made by you, dear brethren and sisters, a very
happy one to
Yours affectionately,
Asa Gray.

TO S. M. J.

November 19, 1885.


We meant our day to have been most quiet, and I completely and J.
largely were taken by surprise. So we had to send for two or three
neighbors, especially to see the vase.
J. will bring it in to you, no doubt, for she is very proud of it. The lines I
have already written have taken all the strength out of my right arm, but not
all the love out of my heart, of which a good share is yours.

TO W. M. CANBY.

Cambridge,
November 19, 1885.
My dear Canby,—Many thanks for your felicitations. There is much I
want to write, and to say what a surprise we had, and how perfect the vase
is. But my arm is worn out with note-writing.
Yours affectionately,
Asa Gray.
Two poems and a poetical epigram came among the rest!

TO SIR EDWARD FRY.

Cambridge,
January 31, 1886.
My dear Friend,—I am a laggard correspondent, I fear. Here are your two
most friendly and interesting letters, as far back as November, one of which
crossed, and one which announced, the reception of my long letter which
gave a sketch of our journeyings which began almost a year ago. For we are
now already in the middle of another winter. I doubt if we shall flee from
this one, although it has shown some severity. In the first place, we may
thankfully say that neither Mrs. Gray nor I can say that we require it; and I
cannot bear to lose the time: I seem to need the more of this as the stock
diminishes; for, somehow, I cannot get as much done in a day as I used to
do. Moreover, it is no good running away from winter unless you can go
far. For our southern borders have been unusually wintry, and they want our
guards and preparations against cold.... We were glad enough to get back to
our well and equably warmed house, where, indeed, we are most
comfortable.
You called my attention, I believe, to Professor Allen’s book on the
“Development of Christian Doctrine.” I take shame to myself that I did not
procure and read it. But I know its lines, and read some part of it before it
was in the book, and, of course, I like it much.
I am going, in a few days, to send you a little book, with similar
bearings, which I read in the articles of which it is made up. I think you will
find much of it interesting.
Bishop Temple’s “Bampton Lectures” seemed to me very good as far as
it went, but hardly came up to expectation.
I saw something of Canon Farrar when here. He pleased well, and I think
was well pleased; and personally he was very pleasing and lovable.
I wish more of the English Churchmen would visit us, and give more
time especially to the study of their own branch of the church in the United
States,—a very thriving one. I think they might learn much that would be
helpful and hopeful,—difficult as it may be to apply the experience and the
ways of one country to another.
I have seen, but not read, Mr. Forbes’s “Travels in Eastern Archipelago.”
Those who have read it here say it is very interesting. We have a great lot of
his dried plants from Sumatra and Java, unnamed, which at odd hours I am
arranging for the herbarium. I hope that in his new journey he will manage
to make better specimens. But, as he is primarily an entomologist, this can
hardly be expected. But, if I rightly understand, he goes out now with a
good backing and probably better conveniences for collecting than he could
have had before.
We have been, and still are, much interested in English politics and
election excitements. You are having very anxious times, indeed. What a
pity that some one party, that is, one of the two great parties, is not strong
enough, and homogeneous enough, to command the situation for the time
being, and to deal independently of Parnell, or, indeed, of Chamberlain....
We Americans are wonderfully peaceful—our only real questions now
pending are financial, and those not yet treated as they ought to be, on party
lines. We have an awful silver craze; but we hope to arrest it before it comes
to the worst, though sense and argument are at present ineffectual.
We have a comfortable trust in the principle that “Providence specially
protects from harm the drunken, the crazy, and the United States of
America.”
I see our friend Professor Thayer now and then. He is well and
flourishing. Mrs. Gray and I are very well indeed, and we send our most
cordial good wishes to you all.
Very sincerely yours,
Asa Gray.

TO J. D. HOOKER.

Cambridge, March 9, 1886.


When I read A. de Candolle’s notice of Boissier, I thought it was
“charming.” Anyhow, it brought back to me the charming memory of a very
lovable man. I dare say neither De Candolle nor I has done justice to
Boissier’s work. I could only touch and go,—make a picture that would just
sketch the kind of man he was.
... Yes, I have got on Ranunculaceæ, and have done up to and through
Ranunculus, minus the Batrachium set, of which happily we have few in
North America, that we know of. But having done some while ago the
Gamopetalæ of Pringle’s interesting North Mexican collection, I am now
switched off to the same in a hurried collection made by Dr. Palmer, in an
unvisited part of Chihuahua, in which very much is new. One after another
those Mocino[136] and Sessé plants turn up. Also those of Wislizenus,
whom the Mexicans for a time interned on the flanks of the Sierra Madre.
We are bound to know the botany of the parts of Mexico on our frontier,
and so must even do the work. Pringle goes back there directly, with
increased facilities, and will give special attention to the points of territory
which I regard as most hopeful.
Trelease,[137] our most hopeful young botanist,—established at St.
Louis,—is here for a part of the winter, to edit a collection of the scattered
botanical publications of Engelmann which Shaw pays for—or at least pays
for to a large extent. He would have the plates and figures, and that will
double the cost and the sum Shaw offered to provide. We may have to sell
some of the edition in order to recoup the charges....
Yes, you hit a blot. I can see to all my own books, such as the
“Synoptical Flora.” But, somehow, I cannot restrain the publishers from
altering the date of their title-pages when they print off a new issue from the
stereotype plates....
What do I call an alpine plant? Why, one that has its habitat above the
limit of trees—mainly—though it may run down lower along streams. But
in a dry region, where forest has no fair chance, we might need to mend the
definition.
Upon your paper, I got a few notes—offhand, by references.
I premise that in New England we have two places where several alpine
plants are stranded at lower levels than they ought, peculiar conditions of
configuration and shelter having preserved them, while the exposed higher
grounds have lost them. They are Willoughby Mountain and the Notch of
Mt. Mansfield, Vermont.
As to your III. Of the whole list of alpine plants of Oregon and
northward and not of California, I can put my hand upon only two that are
yet known in California, viz., Armania verna and Vaccinium cæspitosum,
which comes in its var. arbuscula only.
There is a great lack of alpine arctic plants in California. First, because
there is not much place for them now; secondly, because there have been
such terrible and vast volcanic deposits—lava and ashes—that they must
have been all killed out.
But for all these matters we shall one of these days have fuller and surer
data—after my day. Well, I must stop....

TO A. DE CANDOLLE.

Cambridge,June 29, 1886.


My dear De Candolle,—Your letter and inclosure of the 15th inst. gave
me much pleasure. Not only had I a natural curiosity to know more of
Coulter,[138] but also I find it important to know his routes in Mexico and
California.
At Los Angeles, last year, I fell in with one of the “old settlers” who
knew him, and who accompanied him on that expedition into the Arizona
desert on the lower Colorado. Mr. Ball will ascertain and let me know other
particulars of the man, and the date of his death, which probably occurred
not long after that last letter to you, from Paris.
In various ways I am convinced that I am on the verge of
superannuation. Still I work on; and now, dividing the orders with Mr.
Watson (who, though not young, is eight or ten years my junior), we are
working away at the Polypetalæ of the “Synoptical Flora of North
America,” with considerable heat and hope. But it is slow work!
Tuckerman, our lichenologist, has gone before us! I shall in a few days
send you a copy of the memorial of him which I contributed to the Council
report of the American Academy of Sciences and am having reprinted in the
“American Journal of Science” for July.
My wife is fairly well.... She is always busy; and we both enjoy life with
a zest, being in all respects very happily situated, particularly in having
plenty to do.
Let us hope that you may still be able to give us better accounts of
Madame de Candolle and of yourself; and believe me to be always,
Yours affectionately,
Asa Gray.

TO J. D. DANA.

Botanic Garden, Cambridge, Mass., September 20, 1886.


My dear Dana,—Well! “the books” have just come.
I suppose you are in no hurry for notices of them, and would prefer short
ones....
I rather like to do such things incog., as in the “Nation,” in which I
sometimes take a shot at this or that.
I and wife are well,—very.
Had a week in Old Oneida, which still looks natural. I am grinding away
at “Flora,” and probably shall be found so doing when I am called for.
Very well! I have a most comfortable and happy old age. Wishing you
the same,
Yours ever,
A. Gray.

TO J. D. HOOKER.

Cambridge, September 15, 1886.


... Has Ball returned to England? If so, please tell him that he promised
to look up in Dublin, and give from his own knowledge, some details of
Coulter’s life. Alphonse de Candolle has sent me copies of what letters he
has, and they enable me to trace Coulter’s movements and whereabouts,
which is helpful.
Old Goldie,[139] your father’s correspondent lang syne, died only this
summer, very old.
My last bit of work was upon our Portulacaceæ for my “Flora.” The
genera are thin. It is as much as one can do to keep up Montia (though if
that fails Claytonia should go to it rather than the contrary, by right,—but
convenience would call for the contrary), also Spraguea.
I have been having a holiday. A fortnight ago my wife and I set out;
made a visit to my natal soil, in the centre of the State of New York, in
Oneida County; had a gathering of the surviving members—most of them
—of the family, of which I am the senior,—two widowed sisters (one a
sister-in-law), there resident, and an older one who came with her husband
from Michigan; my oldest brother and family, who have the paternal
homestead; the unmarried sister, who passes all her winters with us;
children and some grandchildren. One brother, a lawyer in New York, and
residing near by in New Jersey, with wife and two boys, did not come.
Another absent nephew is in California, well settled there.
It is a pretty country, the upper valley of the Mohawk and of tributary
streams from the south, which interlock with tributaries of the
Susquehanna, at a height of 1,000 to 1,500 feet above tide-water, beautiful
rolling hills and valleys, fertile and well cultivated, more like much of rural
England than anything else you saw over here. We wished you and Lady
Hooker could have been with us in our drives. The summer air is just
delightful, soft and fresh.
On our return we struck off and visited my brother Joe and family, in the
environs of New York, and so came home much refreshed—though, indeed,
I hardly felt the need of a holiday.
Sargent has just started for a trip to the southern part of the mountains of
North Carolina,—a region we are fond of and long to show you.
Now I am going to pitch into Malvaceæ. I am quite alone. Goodale took
off Sereno Watson with him, on a slow steamer to Amsterdam; will run for
a fortnight or so over nearer parts of the Continent, and Watson will look in
at Kew. He was much worn down, and the rest and change will be good for
him. I have filled my sheet with this gossip.

It was during this visit that Dr. Gray, when the family gathered one
morning for breakfast, had disappeared. He came in smiling when the meal
was half over, and in answer to the anxious question where he had been,
said, “Oh, I have been to say to Mrs. Rogers that I forgave her for getting
above me in the spelling-class.”
Cambridge, October 31, 1886.
Dear Hooker,—Thanks for a nice long letter from Bournemouth,
September 27. Thanks, too, for the hope—though rather dim—that you and
wife may come over to us in the spring. Before winter is over we must
arrange some programme; for we four must meet again somehow and
somewhere, while in the land of the living. But how is a problem.
... I see how difficult it must be for you to get away as far as to us. Our
obstacle to any amount of strolling away is mainly the fear that if I interrupt
my steady work on the “Flora of North America,” I may not get back to it
again, or have the present zeal and ability for prosecuting it.
On the other hand, if I and my wife do not get some playdays now, while
we can enjoy them, the time will soon come when we shall have to say that
we have no pleasure in them. Therefore we are in sore straits.... If really you
cannot come, then we will brave out the winter here, as we did last winter
and are none the worse; then we will seriously consider whether Mahomet
shall go to the mountain, which will not come to Mahomet.
I grind away at “Flora,” but, like the mills of the gods, I grind slowly, as
becomes my age,—moreover, to continue the likeness, I grind too
“exceedingly fine,” being too finical for speed, pottering over so many
things that need looking into, and which I have not the discretion to let
alone. Consequently the grist of each day’s work is pitiably small in
proportion to the labor expended on it. I am now at Malvaceæ, which I once
enjoyed setting to rights, and of which the North American species have got
badly muddled since I had to do with them.
If Sereno Watson—who should be back again in twenty days—will only
go on with the Cruciferæ, which he has meddled with a deal, and then do
the Caryophyllaceæ, which are in like case, we may by March 1st have all
done up to the Leguminosæ.
We learn to-day, through a pamphlet sent by Miss Horner, that Bunbury
is dead—in June last....
Your “Primer”—new edition—has not come yet. Do not forget it. And
then, as my manner is, I will see if I can find fault with it. Same with
Bentham’s “Hand-book,” new edition....
I do not wonder that you are happy and contented. We should so like to
see father, mother, and children in their encampment at Sunningdale. May
plenty of sunshine be theirs!
Ball has sent me early sheets of his book. I must find time to go through
its pages.
The L.’s abroad, except the two girls (who are to winter at San Remo)
are now en voyage homeward. William, their father, has been painted by
Holl. He is a good subject. Saw your sister B. (and kind Lombe); she writes
a charming letter to my wife; seems to hold her own wonderfully.
Cambridge, November 22, 1886.
Well, I have got safely through my seventy-sixth birthday, which gives a
sort of assurance. I have always observed that if I live to November 18, I
live the year round!
You are working at Euphorbs, etc.; I at Malvaceæ, in which I find a good
deal to do for the species, and something for the betterment of genera....

TO SIR EDWARD FRY.

Cambridge, November 13, 1886.


My good Friend,—Let me turn for a moment to our quarter-millennial
celebration of the foundation of our university, though you in Europe may
count our antiquity as very modern. It was an affair of three days,
culminating on Monday last, and was altogether very pleasant. You will like
to know that among the honorary degrees given, was one to Professor
Allen, of the Episcopal Theological School here, in recognition of the
merits of his “Continuity of Religious Thought,” which work, I am glad to
remember, you much liked. The Mother Cambridge sent to us the master of
St. John’s, Dr. Taylor, and Professor Creighton, of Immanuel College, to
which the founders and first professors of Harvard belonged. Mrs.
Creighton came with him, and we found them pleasant people. I suppose
Lowell’s oration, Holmes’s poem, and the doings in general will be in print
before very long, and I shall not forget to send you a copy.
We have been away from Cambridge very little this last summer and
autumn, only on very short visits, or one rather longer one to my birthplace
in the central portion of New York, where we had a family gathering.
There is a lull just now in your political situation. I certainly at your last
election should have gone against Gladstone! How so many of my
countrymen—I mean thoughtful people—approve of homerule, i.e., of
semi-secession, I hardly understand. But local government as to local affairs
is our strength, and is what we are brought up to. Also, our safety is in that
the land—the agricultural land—is so largely owned by the tiller....
We should like to see old friends in England once more in the flesh, and
the feeling grows so that I may feign a scientific necessity, and we may, if
we live and thrive, cross over to you next summer. At least we dream of it,
though it may never come to pass.

TO J. D. HOOKER.

Cambridge,January 18, 1887.


My dear Hooker,—Glad to see the “Botanical Magazine” figure of
Nymphæa flava †.6917.
There is something not quite right in the history as you give it. Leitner
was the botanist who showed the plant to Audubon, and gave it the name
which Audubon cites, and he died—was killed by the Florida Indians
—“half a century ago.” He was the “a naturalist” you refer to.
The whole history and the mode of growth, stolons, etc., has been
repeatedly published here in the journals, etc. See Watson’s “Index”
Supplement, etc. Not that this is any matter, even about poor Leitner.
Cambridge, January 25, 1887.
... Yes, it has seemed to me clear that you could not cross the Atlantic at
present. And so it logically follows that we must.
I had been coming to this conclusion, and only the day before your letter
arrived my good wife and I had put our heads together and concluded that,
if nothing occurred meanwhile to prevent, we would cross over, say in
April. It is time we set about it, if we are ever to do it; and several things
seem to indicate that this is a more favorable time than we can expect later.
As this will be “positively Dr. Gray’s last appearance on your shores,”
we must make the most of it. Shall we have a Continental jaunt together, or
shall you be too much tied to home?
Meanwhile I must work hard and steadily....
As you “weed out” surplus of herbarium Kew, keep them for me. When I
come I will take care of them. It is (as usual) good of you to think of us.
You have done so for so long a time that it is only “second nature”—very
good nature too.
Williamson, plant-fossil, long ago begged us to come to British
Association at Manchester, and be his guests. If I do, what think you of my
preparing a paper for Botanical Section; and will you join me in it? two
venerables—anglice old fogies—on Nomenclature and Citations.
There are some points I should like to argue out and explain; to put on
record, though it may be of no use. Not that one wants to get up a
discussion in such a body—that would never do....
Cambridge, February 22.
Thank you for sending me your edition of Bentham’s “Handbook,”
which looks well in its more condensed shape, and in which I dare say you
have put a good deal of conscientious work. But it seems to me that Reeve
& Company give it poor type and paper.
I am putting through a rehash of my “Lessons in Botany,”[140] more
condensed, yet fuller, and with a new name. This, with the companion book,
which I must live to do over, Deo favente, is the principal thing for bread,
and I need it for an endowment to keep up the herbarium here, after my
time.
Well,—don’t speak of it aloud,—we have secured our passages for April
7, and if I can get present work off my hands in time, we may be on your
soil soon after Easter.
You may imagine me very busy, indeed.
Yours affectionately,
A. Gray.
Dr. Gray, with Mrs. Gray, landed in England, April 18, and went from
Liverpool to stay at Sunningdale with Sir Joseph and Lady Hooker, where a
quiet, restful week was most pleasantly passed. He went to London the first
of May for a few days, meeting again old friends, dining with them, and
dropping in for calls, “to report himself,” as he said. He did a little work at
Kew, going back and forth; then crossed to Paris, finding at the Jardin des
Plantes what he had especially wanted to see, Lamarck’s herbarium, which
had been acquired since he was last there. It completed satisfactorily his
studies in Asters, as he had now seen everything of the genus to be found in
herbaria of importance.
A journey in Normandy with Sir J. D. Hooker had been planned for May,
but Sir Joseph was unable to leave England, so Dr. Gray arranged to go to
Vienna. He greatly enjoyed the railroad journey from Bâle, in May, the
fruit-trees white with blossoms about Lake Zurich, then the wilder
mountain scenery, and Salzburg, all bringing back the memories of his first
European journey forty-eight years before.

TO A. DE CANDOLLE.

Herbarium, Kew, April


23, 1887.
My dear De Candolle,—You will be a little surprised at the sudden
transfer of Mrs. Gray and myself to England; but I wanted a vacation and
one more bit of pleasant travel with Mrs. Gray while we are both alive and
capable of enjoying it. Whether I shall look in upon you at Geneva is
doubtful, but it may be, even for a moment. We never expect to have
repeated the pleasant week at Geneva of the spring of 1881.
We expect to go to Paris early in May, but subsequent movements are
uncertain.
Always, dear De Candolle, affectionately yours,
Asa Gray.

TO ——.
May 15, 1887.
I think the journey from Bâle, in Switzerland, to Salzburg was
wonderfully fine and a great success, and that May is a good time to do it,
while there is plenty of snow in the mountains. Lake Wallenstadt showed to
great advantage. And I had no idea that the pass of the Arlberg, from
Feldkirk to Innspruck, was so high or so very fine. I believe it is the highest
railway pass across the Alps. I was quite unprepared (which was all the
better) for the exquisite and wild, and in parts grand, scenery of the next
day’s journey through the heart of Lower Tyrol and the Salzburg
Salzkammergut, by a slower train, a roundabout road making more than
twice the direct distance from Innspruck to Salzburg, through the Zillerthal
and over a fairly high pass on to the upper part of the Salzach, and down it
through some wild cañons into the plain, from nine A.M. till five, of choicest
scenery. The great castle, so picturesquely placed in the Lichtenstein
(plain), is Schloss-Werden. Rainy day at Salzburg, or should have had noble
views. If the weather had been good, I think we would have driven from
Salzburg to Ischl, and then come by the Traunsee to Linz. But after all, from
my remembrance, it would hardly have come up to what we had already
seen. And though it was a rainy day for the Danube, we did see everything
pretty well, and most comfortably, in the ladies’ cabin of the steamer, with
windows all round the three sides, and most of the time the whole to
ourselves, or with only one quiet lady, who evidently cared nothing for the
views. J. says I was bobbing all the time from one side to the other. I was
looking out for the views which I had when going up the Danube forty-
eight years ago. J. thinks it not equal to the Rhine, but there is rather more
of it, or scattered over more space.

TO SIR J. D. HOOKER.

Hôtel Beau Rivage, Geneva,


May 24, 1887.
I do believe we shall have to return to America to thaw out. Here we
arrive in Geneva this morning, full of memories of delightful summer, ten
days earlier than this in 1881, to find snow down even to foothills of the
Jura and on Mont Salève; it came two days ago, and the air, though clear, is
very chilly, which is not to my liking.
Vienna was much better, excepting our last day, which had a cold and
high wind, and our night journey to Munich was cold and comfortless, in
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge


connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and


personal growth every day!

ebookbell.com

You might also like