Statistical Learning In Genetics An Introduction Using R Daniel Sorensen pdf download
Statistical Learning In Genetics An Introduction Using R Daniel Sorensen pdf download
https://ebookbell.com/product/statistical-learning-in-genetics-
an-introduction-using-r-daniel-sorensen-52407580
https://ebookbell.com/product/treebased-methods-for-statistical-
learning-in-r-a-practical-introduction-with-applications-in-r-brandon-
m-greenwell-46454042
https://ebookbell.com/product/statistical-learning-and-modeling-in-
data-analysis-methods-and-applications-1st-edition-simona-
balzano-33469896
https://ebookbell.com/product/a-first-course-in-statistical-learning-
with-data-examples-and-python-code-statistics-and-computing-1st-
ed-2023-johannes-lederer-232169448
https://ebookbell.com/product/an-introduction-to-statistical-learning-
with-applications-in-python-1st-gareth-james-50703060
An Introduction To Statistical Learning With Applications In R 2nd
Edition Gareth James Daniela Witten Trevor Hastie Robert Tibshirani
https://ebookbell.com/product/an-introduction-to-statistical-learning-
with-applications-in-r-2nd-edition-gareth-james-daniela-witten-trevor-
hastie-robert-tibshirani-51334824
https://ebookbell.com/product/an-introduction-to-statistical-learning-
with-applications-in-r-second-edition-2nd-edition-gareth-
james-33793008
https://ebookbell.com/product/an-introduction-to-statistical-learning-
with-applications-in-r-1st-edition-gareth-james-38329264
https://ebookbell.com/product/an-introduction-to-statistical-learning-
with-applications-in-r-1st-edition-gareth-james-43260912
https://ebookbell.com/product/statistical-learning-for-big-dependent-
data-wiley-series-in-probability-and-statistics-1st-edition-daniel-
pea-51992574
Statistics for Biology and Health
Daniel Sorensen
Statistical
Learning in
Genetics
An Introduction Using R
Statistics for Biology and Health
Series Editors
Mitchell Gail, Division of Cancer Epidemiology and Genetics, National Cancer
Institute, Rockville, MD, USA
Jonathan M. Samet, Department of Environmental & Occupational Health,
University of Colorado Denver - Anschutz Medical Campus, Aurora, CO, USA
Statistics for Biology and Health (SBH) includes monographs and advanced
textbooks on statistical topics relating to biostatistics, epidemiology, biology, and
ecology.
Daniel Sorensen
Statistical Learning
in Genetics
An Introduction Using R
Daniel Sorensen
Aarhus University
Aarhus, Denmark
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
This book evolved from a set of notes written for a graduate course on Likelihood
and Bayesian Computations held at Aarhus University in 2016 and 2018. The
audience was life-science PhD students and post-docs with a background in either
biology, agriculture, medicine or epidemiology, who wished to develop analytic
skills to perform genomic research. This book is addressed to this audience of
numerate biologists, who, despite an interest in quantitative methods, lack the
formal mathematical background of the professional statistician. For this reason,
I offer considerably more detail in explanations and derivations than may be needed
for a more mathematically oriented audience. Nevertheless, some mathematical
and statistical prerequisites are needed in order to extract maximum benefit from
the book. These include introductory courses on calculus, linear algebra and
mathematical statistics, as well as a grounding in linear and nonlinear regression and
mixed models. Applied statistics and biostatistics students may also find the book
useful, but may wish to browse hastily through the introductory chapters describing
likelihood and Bayesian methods.
I have endeavoured to write in a style that appeals to the quantitative biologist,
while remaining concise and using examples profusely. The intention is to cover
ground at a good pace, facilitating learning by interconnecting theory with examples
and providing exercises with their solutions. Many exercises involve programming
with the open-source package R, a statistical software that can be downloaded and
used with the free graphical user interface RStudio. Most of today’s students
are competent in R and there are many tutorials online for the uninitiated. The
R-code needed to solve the exercises is provided in all cases and is written,
with few exceptions, with the objective of being transparent rather than efficient.
The reader has the opportunity to run the codes and to modify input parameters
in an experimental fashion. This hands-on computing contributes to a better
understanding of the underlying theory.
The first objective of this introduction is to provide readers with an understanding
of the techniques used for analysis of data, with emphasis on genetic data. The
second objective is to teach them to implement these techniques. Meeting these
objectives is an initial step towards acquiring the skills needed to perform data-
vii
viii Preface
how a measure of uncertainty can be attached to accuracy. The body of the chapter
deals with prediction from a classical/frequentist perspective. Bayesian prediction
is illustrated in several examples throughout the book and particularly in Chap. 10.
In Chap. 6, many important ideas related to prediction are illustrated using a simple
least-squares setting, where the number of records n is larger than the number of
parameters p of the model; this is the .n > p setup. However, in many modern
genetic problems, the number of parameters greatly exceeds the number of records;
the .p n setup. This calls for some form of regularisation, a topic introduced
in Chap. 7 under the heading Shrinkage Methods. After an introduction to ridge
regression, the chapter provides a description of the lasso (least absolute shrinkage
and selection operator) and of a Bayesian spike and slab model. The spike and
slab model can be used for both prediction and for discovery of relevant covariates
that have an effect on the records. In a genetic context, these covariates could be
observed genetic markers and the challenge is how to find as many promising mark-
ers among the hundreds of thousands available, while incurring a low proportion
of false positives. This leads to the topic reviewed in Chap. 8: False Discovery
Rate. The subject is first presented from a frequentist perspective as introduced
by Benjamini and Hochberg in their highly acclaimed work, and is also discussed
using empirical Bayesian and fully Bayesian approaches. The latter is implemented
within an McMC environment using the spike and slab model as driving engine.
The complete marginal posterior distribution of the false discovery rate can be
obtained as a by-product of the McMC algorithm. Chapter 9 describes some of
the technical details associated with prediction for binary data. The topics discussed
include logistic regression for the analysis of case-control studies, where the data are
collected in a non-random fashion, penalised logistic regression, lasso and spike and
slab models implemented for the analysis of binary records, area under the curve
(AUC) and prediction of a genetic disease of an individual, given information on
the disease status of its parents. The chapter concludes with an appendix providing
technical details for an approximate analysis of binary traits. The approximation
can be useful as a first step, before launching the full McMC machinery of a more
formal approach. Chapter 10 deals with Bayesian prediction, where many of the
ideas scattered in various parts of the book are brought into focus. The chapter
discusses the sources of uncertainty of predictors from a Bayesian and frequentist
perspective and how they affect accuracy of prediction as measured by the Bayesian
and frequentist expectations of the sample mean squared error of prediction. The
final part of the chapter introduces, via an example, how specific aspects of a
Bayesian model can be tested using posterior predictive simulations, a topic that
combines frequentist and Bayesian ideas. Chapter 11 completes Part II and provides
an overview of selected nonparametric methods. After an introduction of traditional
nonparametric models, such as the binned estimator and kernel smoothing methods,
the chapter concentrates on four more recent approaches: kernel methods using basis
expansions, neural networks, classification and regression trees, and bagging and
random forests.
Part III of the book consists of exercises and their solutions. The exercises
(Chap. 12) are designed to provide the reader with deeper insight of the subject
x Preface
discussed in the body of the book. A complete set of solutions, many involving
programming, is available in Chap. 13.
The majority of the datasets used in the book are simulated and intend to illustrate
important features of real-life data. The size of the simulated data is kept within the
limits necessary to obtain solutions in reasonable CPU time, using straightforward
R-code, although the reader may modify size by changing input parameters.
Advanced computational techniques required for the analysis of very large datasets
are not addressed. This subject requires a specialised treatment beyond the scope of
this book.
The book has not had the benefit of having been used as material in repeated
courses by a critical mass of students, who invariably stimulate new ideas, help with
a deeper understanding of old ones and, not least, spot errors in the manuscript and
in the problem sections. Despite these shortcomings, the book is completed and out
of my hands. I hope the critical reader will make me aware of the errors. These
will be corrected and listed on the web at https://github.com/SorensenD/SLGDS.
The GitHub site also contains most of the R-codes used in the book, which can be
downloaded, as well as notes that include comments, clarifications or additions of
themes discussed in the book.
Many friends and colleagues have assisted in a variety of ways. Bernt Guldbrandtsen
(University of Copenhagen) has been a stable helping hand and helping mind.
Bernt has generously shared his deep biological and statistical knowledge with
me on many, many occasions, and provided also endless advice with LaTeX and
MarkDown issues, with programming details, always with good spirits and patience.
I owe much to him. Ole Fredslund Christensen (Aarhus University) read several
chapters and wrote a meticulous list of corrections and suggestions. I am very
grateful to him for this effort. Gustavo de los Campos (Michigan State University)
has shared software codes and tricks and contributed with insight in many parts
of the book, particularly in Prediction and Kernel Methods. I have learned much
during the years of our collaboration. Parts of the book were read by Andres Legarra
(INRA), Miguel Pérez Enciso (University of Barcelona), Bruce Walsh (University
of Arizona), Rasmus Waagepetersen (Aalborg University), Peter Sørensen (Aarhus
University), Kenneth Enevoldsen (Aarhus University), Agustín Blasco (Universidad
Politécnica de Valencia), Jens Ledet Jensen (Aarhus University), Fabio Morgante
(Clemson University), Doug Speed (Aarhus University), Bruce Weir (University
of Washington), Rohan Fernando (retired from Iowa State University) and Daniel
Gianola (retired from the University of Wisconsin-Madison). I received many
helpful comments, suggestions and corrections from them. However, I am the only
responsible for the errors that escaped scrutiny. I would be thankful if I could be
made aware of these errors.
I acknowledge Eva Hiripi, Senior Editor, Statistics Books, Springer, for consis-
tent support during this project.
I am the grateful recipient of many gifts from my wife Pia. One has been essential
for concentrating on my task: happiness.
xi
Contents
1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Sampling Distribution of a Random Variable . . . . . . . . . . . . . . . . . . 3
1.3 The Likelihood and the Maximum Likelihood Estimator . . . . . . . . . . 5
1.4 Incorporating Prior Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Frequentist or Bayesian? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7 Appendix: A Short Overview of Quantitative Genomics. . . . . . . . . . . 32
xiii
xiv Contents
Part II Prediction
6 Fundamentals of Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.1 Best Predictor and Best Linear Predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.2 Estimating the Regression Function in Practice: Least Squares . . . 263
6.3 Overview of Things to Come . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
6.4 The Bias-Variance Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
6.5 Estimation of Validation MSE of Prediction in Practice . . . . . . . . . . . 280
6.6 On Average Training MSE Underestimates Validation MSE . . . . . . 284
6.7 Least Squares Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7 Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
7.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
7.2 The Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
7.3 An Extension of the Lasso: The Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . 319
7.4 Example: Prediction Using Ridge Regression and Lasso . . . . . . . . . . 319
7.5 A Bayesian Spike and Slab Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
8 Digression on Multiple Testing: False Discovery Rates. . . . . . . . . . . . . . . . . 333
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
8.2 Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Contents xv
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
Author Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
Chapter 1
Overview
1.1 Introduction
10. Are there other non-genetic factors that affect the traits, such as smoking
behaviour, alcohol consumption, blood pressure measurements, body mass
index and level of physical exercise?
11. Could the predictive ability of the genetic score be improved by incorporation
of these non-genetic sources of information, either additively or considering
interactions? What is the relative contribution from the different sources of
information?
The first question has been the focus of quantitative genetics during many
years long before the so-called genomic revolution, that is, before breakthroughs
in molecular biology made technically and economically possible the sequencing of
whole genomes, resulting in hundreds of thousands or millions of genetic markers
(single nucleotide polymorphisms (SNPs)) for each individual in the data set. Until
the end of the twentieth century before dense genetic marker data were available,
genetic variation of a given trait was inferred using resemblance between relatives.
This requires equating the expected proportion of genotypes shared identical
by descent, given a pedigree, with the observed phenotypic correlation between
relatives. The fitted models also retrieve “estimates of random effects”, the predicted
genetic values that act as genetic scores and are used in selection programs of farm
animals and plants.
Answers to questions .2 − 7 would provide insight into genetic architecture and
thereby, into the roots of many complex traits and diseases. This has important
practical implications for drug therapies targeted to particular metabolic pathways,
for personalised medicine and for improved prediction. These questions could not
be sensibly addressed before dense marker data became available (perhaps with
the exception provided by complex segregation analysis that allowed searching for
single genes).
Shortly after a timid start where use of low-density genetic marker information
made its appearance, the first decade of the twenty-first century saw the construction
of large biomedical databases that could be accessed for research purposes where
health information was collected. One such database was the British .1958−cohort
study including medical records from approximately 3000 individuals genotyped
for one million SNPs. These data provided for the first time the opportunity to begin
addressing questions .2 − 7. However, a problem had to be faced: how to fit and
validate a model with one million unknowns to a few thousand records and how to
find a few promising genetic markers from the million available avoiding a large
proportion of false positives? This resulted in a burst of activity in the fields of
computer science and statistics, leading to development of a methodology designed
to meet the challenges posed by Big Data.
In recent years, the amount of information in modern data sets has
grown and become formidable and the challenges have not diminished. One
example is the UK Biobank that provides a wealth of health information
from half a million UK participants. The database is regularly updated and
a team of scientists recently reported that the complete exome sequence was
completed (about .2% of the genome involved in coding for proteins and
1.2 The Sampling Distribution of a Random Variable 3
Probability
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 0 2 4 6 8 10 12 14 16 18 20 22 24
X X
Fig. 1.1 Left: binomial probability distribution with parameters .n = 20, θ = 0.1; right: binomial
probability distribution with parameters .n = 100, θ = 0.1
1.3 The Likelihood and the Maximum Likelihood Estimator 5
the random variable X and its realised value, x. This distinction is not necessarily
followed throughout the book).
The probability distribution in the right panel of Fig. 1.1 is more symmetrical
than in the left panel. This is due to the different sample sizes n. As sample size
increases further, X will approach its limiting distribution which is the normal
distribution by virtue of the central limit theorem.
Consider now viewing (1.1) in a different manner, whereby x and n are fixed and
θ varies. To be specific, assume that the sample size is .n = 27 and the number
.
of copies of allele A in the sample is .x = 11. One can plot the probability of
obtaining .x = 11 copies of A in a sample of size .n = 27, for all permissible values
of .θ as in Fig. 1.2. For example, for .θ = 0.1, .Pr (X = 11|n = 27, θ = 0.1) =
0.242 × 10−4 and for .θ = 0.6, .Pr (X = 11|n = 27, θ = 0.6) = 0.203 × 10−1 . This
plot is the likelihood function for .θ , .L (θ |x, n), and the value of .θ that maximises
this function is known as the maximum likelihood estimate of .θ (I will use MLE
short for maximum likelihood estimator or maximum likelihood estimate and ML
for maximum likelihood).
Carrying out the differentiation and setting the result equal to zero shows that the
MLE of .θ must satisfy
x n−x
. − = 0.
θ 1−θ
that in the case of the example, with .x = 11 and .n = 27, gives .θ̂ = 0.41.
Usually, one needs to quantify the degree of uncertainty associated with an estimate.
In classical likelihood, the uncertainty is described by the sampling distribution of
the MLE. In the case of the example, the sampling distribution of .θ̂ is the probability
distribution of this estimator obtained by drawing repeated binomial samples of
fixed size n, with the probability parameter fixed at its MLE, .θ = θ̂ . The MLE is
computed in each sample and the sampling distribution of the MLE is characterised
by these estimates.
In this binomial example, the sampling distribution of .θ̂ is known exactly; it is
proportional to a binomial distribution (since X is binomial and n is fixed). The
small sample variance of the maximum likelihood estimator is
X θ (1 − θ )
. Var θ̂ = Var = . (1.3)
n n
The parameter .θ is typically not known and is replaced by the MLE .θ̂ . Then
θ̂ 1 − θ̂
θ̂ =
.Var .
n
In many cases, the MLE does not have a closed form and the small sample
variance is not known. One can then appeal to large sample properties of MLE; one
1.3 The Likelihood and the Maximum Likelihood Estimator 7
8
5
6
4
Density
Density
3
4
2
2
1
0
0
0.0 0.2 0.4 0.6 0.8 0.2 0.3 0.4 0.5 0.6
MLE of T MLE of T
Fig. 1.3 Left: histogram of the Monte Carlo distribution of the MLE for the binomial model,
with .n = 27, θ = 0.41. Right: histogram of the Monte Carlo distribution of the MLE for the
binomial model, with .n = 100, θ = 0.41. The overlaid normal curves represent the asymptotic
approximation of the distribution of the MLE
of these is that, asymptotically, the MLE is normally distributed, with mean equal
to the parameter and variance given by minus the inverse of the second derivative
of the loglikelihood evaluated at .θ = θ̂ . The second derivative of the loglikelihood
is
∂ 2 (θ |x, n) ∂2 n
. = log + x log (θ ) + (n − x) log (1 − θ )
(∂θ )2 (∂θ )2 x
x n−x
=− − .
θ 2
(1 − θ )2
In this expression, substituting .θ with the MLE .θ̂ and taking a reciprocal
yields
−1 θ̂ 1 − θ̂
∂ 2 (θ |x, n) θ̂ =
. − = Var ≈ 0.009. (1.4)
(∂θ )2 θ=θ̂
n
In this simple example, the asymptotic variance agrees with the small sample
variance. An approximate .95% confidence interval for .θ based on asymptotic
theory is
This means that there is a .95% probability that this interval contains the true
parameter .θ . The “probability” is interpreted with respect to a set of hypothetical
repetitions of the entire data collection and analysis procedure. These repetitions
8 1 Overview
consist of many random samples of data drawn under the same conditions and
where a confidence interval is computed for each sample. The random variable
is the confidence interval that is computed for each sample; in 95 intervals out
of 100 (in a .95% confidence interval), the interval will contain the unobserved
.θ .
Figure 1.3 (left) shows the result of simulating .100,000 times from a binomial
distribution with .n = 27 and .θ = 0.41, computing the MLE in each replicate and
plotting the distribution as a histogram. This represents the (small sample) Monte
Carlo sampling distribution of the MLE. Overlayed is the asymptotic distribution of
the MLE that is normal with mean .0.41 and variance given by (1.4) equal to .0.009.
The right panel of Fig. 1.3 displays the result of a similar exercise with .n = 100 and
.θ = 0.41. The fit of the asymptotic approximation is better with the larger sample
size.
A glance at a standard calculus book reveals that the curvature of a function f at
a point .θ is given by
f (θ )
.c (θ ) = 3/2
.
1 + f (θ )2
In the present case, the function f is the loglikelihood . whose first derivative
evaluated at .θ = θ̂ is equal to zero. The curvature of the loglikelihood at .θ = θ̂
is
.c θ̂ = θ̂ .
2
(I use the standard notation . θ̂ for . ∂ (θ|x,n)
2 ).
(∂θ) θ=θ̂
Note As the loglikelihood increases or decreases, so does the likelihood; therefore,
the value of the parameter that maximises one also maximises the other. Working
with the loglikelihood is to be preferred to working with the likelihood function
because it is easier to differentiate a sum than a product. The curvature of the
loglikelihood at the MLE is related to the sample variance of the MLE. This last
point is illustrated in Fig. 1.4. As n increases from 27 to 100, the likelihood function
becomes sharper and more concentrated about the MLE.
Imagine that there is prior information about the frequency .θ of allele A from
comparable populations. Bayesian methods provide a natural way of incorporating
such prior information into the model. This requires eliciting a prior distribution
for .θ that captures what is known about .θ before obtaining the data sample.
This prior distribution is combined with the likelihood (which, given the model,
1.4 Incorporating Prior Information 9
Likelihood
contains all the information arising from the data) to form the posterior dis-
tribution that is the basis for the Bayesian inference. Specifically using Bayes
theorem:
indicating that the posterior density is proportional to the prior density times the
likelihood. Probability statements about .θ require scaling (1.7). This involves
dividing the right-hand side of (1.7) by
This example is adapted from Albert (2009). Continuing with the binomial model,
a simple approach to incorporate prior information on .θ is to write down possible
values and to assign weights to these values. A list of possible values of .θ could be
0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95.
. (1.10)
1.0, 5.2, 8.0, 7.2, 4.6, 2.1, 0.7, 0.1, 0.0, 0.0
.
that are converted to probabilities dividing each weight by the sum; this gives the
prior probability distribution of .θ
0.04, 0.18, 0.28, 0.25, 0.16, 0.07, 0.02, 0.00, 0.00, 0.00.
. (1.11)
The likelihood for .θ is proportional to (1.1). The combinatorial term does not
contain information about .θ , so one can write
Evaluating this expression for all the possible values of .θ in (1.10) yields a list of
ten numbers (too small to be written down here). Label these ten numbers:
To obtain the posterior (1.7), the terms in (1.11) are multiplied by the corresponding
term in (1.12). For example,
and so on with the remaining eight terms. After scaling with the sum
0.00, 0.00, 0.13, 0.48, 0.33, 0.06, 0.00, 0.00, 0.00, 0.00.
.
Based on these posterior probabilities, the posterior mean of .θ is .0.38, and the
probability that .θ falls in the set .{0.25, 0.35, 0.45} is .0.94.
1.4 Incorporating Prior Information 11
cumulative distribution functions (cdf) of the beta distribution for .θ = 0.25 and
for .θ = 0.45, respectively (the cdf is .F (x; a, b) = Pr (X ≤ x; a, b)). Then the
parameters a and b of the beta distribution that match the prior probabilities (1.11)
can be found by minimising the function
with respect to a and b. This can be achieved using the function OPTIM in R as
indicated in the following code:
mod <- function(par){
a <- par[1]
b <- par[2]
fct <- (pbeta(0.25,a,b)-0.5)^2 + (pbeta(0.45,a,b)-0.91)^2
return(fct)
}
res <- optim(par=c(3,3),mod)
res$par
The function returns .a = 2.90, .b = 8.05. As a check, one can compute the
cumulative distribution functions:
pbeta(0.45,2.9,8.05)
## [1] 0.9099298
pbeta(0.25,2.9,8.05)
## [1] 0.4995856
Figure 1.5 displays the discrete prior defined by (1.10) and (1.11) and the prior
based on .Be (2.90, 8.05).
12 1 Overview
Prior Probability
Prior Density
0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0
T T
Fig. 1.5 Left: discrete prior distribution defined by (1.10) and (1.11). Right: beta prior
.Be(2.90, 8.05)
L (θ |x, n) ∝ θ x (1 − θ )n−x .
. (1.13)
(a + b) a−1
p (θ ) =
. θ (1 − θ )b−1 , θ ∈ [0, 1] , a, b > 0,
(a) (b)
Figure 1.6 displays plots of the prior, the likelihood and the posterior for the
example. The posterior distribution is sharper than the prior distribution, and its
probability mass is concentrated between that of the prior distribution and the
likelihood. In Bayesian inference, the posterior distribution provides the necessary
information for drawing conclusions about .θ . For example, the mean, mode and
median of (1.14) are .0.37, .0.36, .0.36. The .95% posterior interval for the posterior
mean is .(0.24, 0.50). The inference using the Bayesian approach is sharper than
that based on the likelihood (1.5). Further, the frequentist confidence interval and
the Bayesian posterior intervals have different interpretations. In the latter, the
1.4 Incorporating Prior Information 13
6
5
4
Density
Prior
Likelihood
3
Posterior
2
1
0
Fig. 1.6 The prior density .Be(2.90, 8.05), the likelihood .Be(12, 17) and the posterior density
.Be(13.90, 24.05)of the probability .θ
confidence interval is fixed and the associated probability is the probability that
the true parameter falls in the interval.
Note If a prior for .θ is .Be (a, b), then .p (θ ) ∝ θ a−1 (1 − θ )b−1 . The
likelihood of the binomial model is proportional to .θ x (1 − θ )n−x which is
the kernel of .Be (x + 1, n − x + 1). The posterior is then proportional to
.θ
a+x−1 (1 − θ )b+n−x−1 that is the kernel of .Be (a + x, b + n − x).
5
Beta(1,1) (Uniform prior)
Beta(2,2) (Week prior)
4
3
Density
2
1
0
Fig. 1.7 The densities .Be(0.5, 0.5), .Be(1, 1) and .Be(2, 2) to model prior information about .θ
θ=1
6
Posterior (Jefrreys prior)
distributions corresponding to Posterior (Uniform prior)
the three priors of Fig. 1.7 Posterior (Week prior)
when .n = 27, x = 11
5
4
Density
3
2
1
0
Inferences drawn in the examples above were exact. This is possible when the
analytical form of the posterior distribution is known and features from it (such as
probability intervals and moments) can be calculated. In principle, features from any
posterior distribution can also be obtained using samples drawn from it, making use
of standard theorems from the time series literature. For example, any function of
the random variable X, .h (X) with finite expectation .E(h (X)) can be approximated
by
N
i=1 h (xi )
. E (h (X)) ≈ , (1.16)
N
100,000
xi
. E (X) ≈ = 0.41.
100,000
i=1
A .95% posterior interval for .θ can be estimated as the .2.5th to .97.5th percentiles of
the empirical distribution of the draws .xi using the R function quantile. If the
original dataset is denoted dat, the R code is:
16 1 Overview
set.seed(7117)
dat <- rbeta(100000,12,17)
quantile(dat,c(0.025,0.975))
## 2.5% 97.5%
## 0.2454898 0.5942647
Finally, using the simulated values, a Monte Carlo estimate of the posterior
probability that .θ is less than or equal to .0.2 is obtained using
100,000
(θ ≤ 0.2) = 1
Pr
. I (xi ≤ 0.2) = 0.00499,
100,000
i=1
which is a Monte Carlo estimator of (1.15). These figures are in good agreement
with the exact results.
In this example, it was straightforward to sample directly from the posterior
distribution because the normalising constant (1.9) is known, and therefore the form
of the posterior is fully specified. Often the normalising constant cannot be obtained
in closed form, particularly when .θ contains many elements. Chapter 4 discusses
how Monte Carlo draws from the approximate posterior distribution can still be
obtained using Markov chain Monte Carlo (McMC) methods.
An important issue with inferences based on Monte Carlo samples from posterior
distributions is the accuracy of posterior summaries. The latter are subject to
sampling uncertainty that depends on the size of the Monte Carlo sample and on
the degree of autocorrelation of the samples. Methods to quantify this uncertainty
are reviewed in Chap. 4.
N
1
μ=
. xi , (1.17)
N
i=1
1.5 Frequentist or Bayesian? 17
N
1
γ (k) =
. (xi −
μ) (xi+k −
μ) (1.18)
N
i=1
(k)
γ
(k) =
ρ
. (1.19)
γ (0)
where the .e s are iid (independently and identically distributed). Using R, I
simulated .N = 1000, .10,000 and .100,000 observations from (1.20) using .ρ = 0.8
and .σ 2 = 1, with the initial condition .x1 ∼ N (0, 1). This generates a strongly
autocorrelated structure among the draws. The marginal mean and variance of this
process are
. E (xt ) = 0,
σ2
Var (xt ) = .
1 − ρ2
The estimates of the mean (.0.0), variance (.2.778) and correlation (.0.8) with samples
of size .N = 1000, .N = 10,000 and .N = 100,000 are .(−0.060, 2.731, 0.793),
.(−0.046, 2.749, 0.797) and .(0.018, 2.774, 0.800), respectively. Despite the rather
strong degree of autocorrelation, the estimates are quite acceptable and get closer to
the true values as sample size increases.
There has been much heated debate between frequentists and Bayesians about
the advantages and shortcomings of both methods of inference. One can easily
construct examples where one of the methods gives silly answers and the other
performs fairly well. One such example is the following. Imagine that in the
situation discussed above, rather than obtaining .x = 11 copies of allele A in a
sample of size .n = 27, one obtained .x = 0. This outcome is not unlikely when
.θ is small; for example, the probability of obtaining .x = 0 when .n = 27 and
18 1 Overview
12
Posterior (Jefrreys prior)
Posterior (Uniform prior)
Posterior (Week prior)
10
8
Density
6
4
2
0
Fig. 1.9 The three posterior distributions corresponding to the three priors of Fig. 1.7 when .n =
27, x = 0
θ = 0.04 is .Pr (x = 0|n = 27, θ = 0.04) = 0.33. In this situation, the maximum
.
likelihood (ML) estimate (1.2) is 0 and its variance is also 0, which is clearly a silly
result (classical maximum likelihood is problematic when the estimate lies on the
boundary of the parameter space).
How does the Bayesian approach behave in such a situation? The three posterior
distributions corresponding to the three priors of Fig. 1.7 are shown in Fig. 1.9.
Jeffrey’s prior and mathematical form as the likelihood, proportional to (.(1 − θ )27 ),
lead to posterior distributions .Be (0.5, 27.5) and .Be (1, 28), respectively; these have
modal values of 0. The weakly informative prior .Be (2, 2) yields a posterior of the
form .Be (2, 29), which has a mode at .θ ≈ 0.034. The posterior means of these three
distributions are .0.018, .0.034 and .0.065, respectively. The .95% posterior intervals
are
. 0.18 × 10−4 , 0.88 × 10−1 ,
0.90 × 10−3 , 0.12 ,
0.81 × 10−2 , 0.17 ,
respectively. The posterior probabilities that .θ is less than or equal to .0.05 for
Be (0.5, 27.5), .Be (1, 28) and .Be (2, 29) are .0.91, .0.76 and .0.45, respectively.
.
20
15
Density
Fig. 1.10 A .Be(2, 25) prior distribution and a .Be(2, 52) posterior distribution when .n = 27, x =
0
One may consider other priors from the beta family that put strong probability
mass in the neighbourhood of zero. One possibility is to use a .Be (2, 25) that has a
mode at .θ = 0.04. With .n = 27 and .x = 0, this results in a posterior .Be (2, 52).
The prior and posterior distributions are plotted in Fig. 1.10. The mean and the mode
of the posterior distribution are .0.037 and .0.019, respectively. The .95% posterior
interval is now .(0.005, 0.101) and .Pr (θ ≤ 0.05|n = 27, x = 0) = 0.75.
The beta prior does not assign probability mass to .θ = 0, and this rules out
the possibility that .θ = 0 in its posterior distribution. One way of including 0 as a
possible value for .θ is to use a two-component mixture prior, where one component
is a point mass at zero and the other is a beta prior. Mixture distributions are
discussed in Chaps. 3, 7 and 9.
There is flexibility associated with the Bayesian approach and a carefully chosen
prior distribution will lead to stable inferences about .θ . The cost is a result which
is partly affected by prior input. With small amount of data and when parameters
lie in the border of the parameter space, there is little else to choose from. In
such a situation, the most important role of prior distributions may well be to
obtain inferences about .θ that are stable and that provide a fair picture of posterior
uncertainty, conditional on the model.
Frequentist and Bayesian
There are situations where instead of choosing between frequentist or Bayesian,
one could use frequentist and Bayesian tools in a meaningful way. This is the case
with model checking where one is interested in studying the ability of a model to
account for particular features of the data or to give reasonable predictions of future
20 1 Overview
p yrep |y, M =
. p yrep |θM , y, M p (θM |y, M) dθM (1.21)
where often, .p yrep |θM , y, M = p yrep |θM , M . One then observes whether zero
is an extreme value in the posterior predictive distribution .T (y, θM )−T yrep , θM ,
where .θM is generated from the posterior .[θM |y, M] and given .θM , .yrep is generated
from .[yrep |θM , M]. This “Bayesian frequentist”, like the frequentist,
accounts
for the uncertainty in .yrep due to the sampling process from .p yrep |θM , y, M .
However, unlike the frequentist, account is also taken of the uncertainty about
.θM described by its posterior distribution .[θM |y, M]. Model checking applied in
1.6 Prediction
variables); .yi , a scalar, is a response variable; and the n vectors are independent
and identically distributed realisations from some distribution. Examples of both
1.6 Prediction 21
parametric (frequentist and Bayesian) and nonparametric models are given here. In
the case of parametric models where the response y is quantitative, a general form
for the association between x and y is
where f , the conditional mean, is a fixed unknown function of .xi and of parameters
and .ei is a random error term, assumed independent of .xi , with mean zero. Using
the data z, an estimate of f labelled .f is obtained by some method that leads to
.Ê(y0 |x0 ) = ŷ0 , a point prediction of the average value of .y0 , evaluated at a new
y0 = f(z, x0 ) .
. (1.23)
The notation emphasises that the estimation procedure inputs data z and yields a
prediction .ŷ0 for .x = x0 . For example, in standard least squares regression .f (xi ) =
−1
E(yi |xi ) = xi b, .f(z, x0 ) = x0 b, where .
b = xx x y and .xi is the ith row of
matrix x.
With binary responses, one may fit a logistic regression. Here, the modelling is at
the level of the probability. Specifically, letting .Pr (yi = 1|xi ) = π (xi ), the logistic
model can be written as
π (xi )
. ln = xi b, i = 1, 2, . . . , n.
1 − π (xi )
where .nt is the number of records and .yi = f(z, xi ). When .MSEt (1.25) is
computed using the data that was used to fit the model, the training data, it is
known as the sample training mean squared error. If the objective is to study how
well the model predicts a yet-to-be-observed record, .MSEt can be misleading as it
22 1 Overview
Table 1.1 Training and validating mean squared errors for the prostate cancer data, as the
number of covariates included in the linear predictor increases from 5 to 30. A standard logistic
model is implemented, and the mean squared errors represent the proportion of misclassifications
in the training and validating data
No. of covariates 5 10 15 20 25 30
.MSEt 0.29 0.31 0.16 0.12 0.06 0.00
.MSEv 0.27 0.37 0.25 0.33 0.35 0.39
In (1.26) .
y0i is the ith prediction computed using the training data z evaluated
at the value of the ith covariate .x0i . That is, .y0i = f(z, x0i ). With binary
observations, (1.25) and (1.26) represent the proportion of cases where .y0i = y0i ,
or misclassification error .
As an illustration of some of these concepts, I use data from a microarray study
of prostate cancer from Singh et al (2002). The study includes 52 men with tumour,
50 healthy men and a total of .n = 102 men. The genetic expression of a panel
of .p = 6033 genes was measured for each man. The level of gene expression is
associated with the level of activity of the gene. A larger number implies a more
active gene. The .n × p matrix of covariates is then .x = {xij }, i = 1, 2, . . . , n =
102; j = 1, 2, . . . , p = 6033, with .p n.
Fitting Traditional Logistic Regression
To illuminate some of the consequences of overfitting, a first analysis is undertaken
with traditional logistic regression models involving .p < n. Data are divided
into training and validating sets with equal numbers in each. The logistic models
are fitted to the training data using maximum likelihood, and the estimates of the
parameters are used to predict the outcome (healthy/not healthy) in the validating
data. The models differed in the number of covariates included. The change in
.MSEt and .MSEv as the number of covariates (columns in x) increases in the
the overstatement of the model’s predictive ability, as judged by the steady fall in
MSEt when the number of covariates is larger than 10.
.
The code below reads the data (singh2002), splits it into a training and a
validating/testing set (y.test and y.train) and in the bottom part,
• fits a logistic regression to the training data using the R-function GLM
• using the ML estimates, computes the predicted liabilities in the training and
validating data
• based on these liabilities, computes .Pr(Y = 1| b), where .
b is the ML estimate
• transforms the probabilities into the .0/1 scale
• computes .MSE (misclassification error) in the training and validating data
The figures in Table 1.1 were generated using this code. The code illustrates the
case with 15 covariates (first 15 columns of matrix x). The output agrees with the
figures in the third column of the table:
# CODE0101
# READING SINGH ET AL 2002 DATA
rm(list=ls()) # CLEAR WORKSPACE
# Lasso solutions using package glmnet
#install.packages("glmnet", .libPaths()[1])
#install.packages("sda")
library("sda")
library(glmnet)
data(singh2002)
X<-singh2002$x
y<-ifelse(singh2002$y=="cancer",1,0)
n<-nrow(X)
Xlasso<-X
set.seed(3037)
train=sample(1:nrow(X),nrow(X)/2)
test=(-train)
y.test=y[test]
y.train<-y[train]
#
# RESAMPLES TRAIN/TEST DATA AND COMPUTES MSE
# FOR EACH RESAMPLE/REPLICATE
t1 <- seq(1,15,1) # CHOOSE THE FIRST 15 COLUMNS OF x
X1 <-X[,t1]
n <- length(t1)
datarf <- data.frame(cbind(y,X))
nc <- 1 # EXAMPLE WITH 1 REPLICATE ONLY
res <- matrix(data=NA, nrow=nc,ncol=3)
for(i in 1:nc){
if(i > 1){train <- sample(1:nrow(datarf),nrow(datarf)/2)}
glm.fit <- glm(y[train] ~ X1[train,] ,
family=binomial(link="logit"))
24 1 Overview
zero. Since the lasso solutions typically include many coefficients equal to zero
when the tuning parameter is sufficiently large, it does model selection and
shrinkage simultaneously.
The lasso logistic regression model is fitted using the public package glmnet
(Friedman et al 2009) implemented in R. Documentation about glmnet can be found
in Hastie and Qian (2016) and in Friedman et al (2010).
To obtain predictions, the code below executes first the function cv.glmnet on
the training data in order to find the value of the tuning parameter (.λ) that optimises
prediction ability measured by .MSE. In a second step, glmnet is executed again
on the training data using this best .λ to obtain the final values of the regression
parameters. The code then constructs the predictions from the output of this second
run. The model is finally tested on the training and on the validating data.
A more direct implementation of glmnet without the need to generate estimates
of regression parameters is indicated at the bottom of the code.
The lasso logistic regression was run on the prostate data including all 6033
covariates representing the gene expression profiles. Lasso chooses 36 covariates
and sets the remaining equal to zero. The model with these 36 covariates was used
to classify the observations in the validating data and resulted in a .MSEv equal to
.0.25. In other words, .51×0.25 ≈ 13 out of the 51 observations in the validating data
are incorrectly classified. At face value, the result for .MSEv matches that obtained
in Table 1.1 when the first 15 columns of x were included in the linear predictor.
The latter can be interpreted as a logistic regression model where 15 out of 6033
covariates are randomly chosen. The result based on the lasso is not encouraging:
# CODE0102
# READING SINGH ET AL 2002 DATA
rm(list=ls()) # CLEAR WORKSPACE
# Lasso solutions using package glmnet
#install.packages("glmnet", .libPaths()[1])
#install.packages("sda")
library("sda")
library(glmnet)
data(singh2002)
X<-singh2002$x
y<-ifelse(singh2002$y=="cancer",1,0)
n<-nrow(X)
Xlasso<-X
set.seed(3037)
train=sample(1:nrow(X),nrow(X)/2)
test=(-train)
y.test=y[test]
y.train<-y[train]
#
# ********** FOR PREDICTION USING LASSO *****************
repl <- 1 # NUMBER OF REPLICATES
# (RESAMPLES TRAINING / VALIDATING)
result <- matrix(data=NA, nrow=repl,ncol=4)
set.seed(3037)
for (i in 1:repl){
if(i > 1){train <- sample(1:nrow(Xlasso),nrow(Xlasso)/2)}
y.train <- y[train]
26 1 Overview
# STEP 2
fm=glmnet(y=y[train],x=Xlasso[train,],alpha=1,lambda=bestlam,
family="binomial",type.measure= "class")
nzcf<-coef(fm)
cf<-which(fm$beta[,1]!=0)
if (length(cf) == 0){
out <-c(i,length(cf))
print(out)
break
}
#length(cf) # NO. REGRESSION PARAMETERS IN FINAL MODEL
# CONSTRUCT PREDICTIONS FROM OUTPUT OF fm
# 1. VALIDATING DATA
predglmnet<-fm$a0+Xlasso[-train,cf]%*%fm$beta[cf]
probs <- exp(predglmnet)/(1+exp(predglmnet))
predclass_test <- as.numeric(ifelse(probs > 0.5, "1", "0"))
# 2. TRAINING DATA
predglmnet<-fm$a0+Xlasso[train,cf]%*%fm$beta[cf]
probs <- exp(predglmnet)/(1+exp(predglmnet))
predclass_train <- as.numeric(ifelse(probs > 0.5, "1", "0"))
result[i,] <- c(mean((predclass_train-y.train)^2),
mean((predclass_test-y.test)^2),bestlam,length(cf))
}
result
###############################################################
## 1
proc.time()-ptm
##
## predtreev cancer healthy
## cancer 17 0
## healthy 9 25
## [1] 0.1764706
#summary(res)
#plot(trees)
#text(trees,pretty=0)
Figure 1.11 indicates that in the particular replicate, the algorithm isolated 2
of the 6033 covariates, .X77 and .X237 . Starting at the top of the tree, the 51 cases
in the training data have been split into two groups: one, to the left, that shows
expression profile for .X77 less than a threshold .t1 = −0.777 and those to the right
with .t1 > −0.777. The group on the left is not split further and constitutes a terminal
node. On the right side, a second split based on the profile of .X237 and a threshold
.t2 = −0.855 gives rise to two terminal nodes. The result can be interpreted as an
cancer healthy
1.6 Prediction 29
library(tree)
trees
The output above indicates that at the top of the tree at .X77 (which is the
root since the tree is upside down), there are 51 records (the training data), and
the proportion of “cancer” is .0.5098. After the first split, to the left, the split is
.t1 < −0.777, and 20 observations are classified as “cancer” and 0 as “healthy”
leading to proportions of (.1.00, 0.00). To the right, the split is .t1 > −0.777 that
gives rise to 31 records, with a proportion of “healthy” equal to .0.8065 (25 out of
the 31 records are "healthy, those whose .t2 > −0.855, associated with covariate
.X237 ).
Various algorithms are available to decide which variable to split and the
splitting value t to use for the construction of the tree. Some of these topics
are deferred to the chapter on nonparametric methods. Here, I concentrate on the
predictive ability of the method. For the particular replicate, the classification error
in the training and validating data is 0 and .0.18, respectively. Replicating the
experiment 50 times gives a mean classification error in training and validating
data equal to .0.016 and .0.197, respectively, with (minimum, maximum) values of
(.0.000, 0.078) and (.0.098, 0.333), respectively. With the parameters for the utility
function tree.control set at the default values, the number of covariates over
the 50 replicates included in each tree fluctuates between 2 and 3, and these
covariates vary over replicates. For these data, the classification tree performs
considerably better than the lasso.
30 1 Overview
Interestingly, in all the cases, the classification trees capture what can be
interpreted as an interaction involving two or three covariates. These are not the
same covariates for the 50 trees. Perhaps, this must not come as a surprise: 6033
covariates give rise to more than 18 million different two-way interactions. There-
fore, predictors based on interacting covariates are prone to be highly correlated.
More generally and as noted with the lasso, in the high-dimensional setting, the
multicollinearity of the covariate matrix is often extreme, and any of the p covariates
in the .n × p matrix can be written as a linear combination of the others. This means
that there are likely many sets of pairs of covariates (other than .X77 and .X237 ) that
could predict just as well. It does not follow that the model cannot be trusted as a
prediction tool, but rather that one must not overstate the importance of .X77 and
.X237 as the only genes associated with the response variable. As with the lasso, the
analysis with the classification tree provides inconclusive evidence of specific genes
affecting prostate cancer.
Fitting a Random Forest
A problem often mentioned with trees is that they exhibit high variability. Small
changes in the data can result in the construction of very different trees and their
predictions can be impaired. However, they are an integral part of another method
known as random forest (Breiman 2001) whose prediction performance benefits by
the process of averaging. The random forest consists of many classification trees
and each is created as follows:
• Create a sample of size .nv by drawing with replacement from the .nv data in the
training data. Repeat this B times to generate B samples. (With random forests,
there is an alternative way of estimating validating mean squared error using the
entire data, without cross-validation. Details are discussed in Chap. 11).
• For each sample, generate a classification tree. Each time a split in the tree is
considered, a random sample of m unique predictors is chosen as split candidates
√
from the p predictors. For classification, it is customary to use .m ≈ p. This
step has the effect of decorrelating the ensemble of B trees (in classification
trees, the construction of a split involves all the predictors).
For classification, once the trees are available, the final prediction is obtained
by a majority vote. Thus, for .B = 10, say, if for a particular observation in the
validating data six or more trees classify it as ."1", the predicted value for this
observation is ."1". The prediction obtained in this manner usually outperforms the
prediction of classification trees. This improvement in performance arises from the
fact that a prediction based on B predictors with very low correlation has smaller
variance than a single prediction. The low correlation is ensured in the second
step. The first step involving the bootstrapping of the training data is known as
bagging, short for bootstrap aggregating, whereby the results of several bootstrap
samples are averaged. (The mean squared error is a measure of the performance of
a predictor, whose expectation includes the variance of the predictor, a squared bias
term and a pure noise term associated with the variance of the predictand. Therefore,
the performance of a predictor improves as its variance is reduced. This reduction
1.6 Prediction 31
1.00
0.95
1−Validating MSE
0.90
0.85
20 40 60 80 100 120
Number of Predictors/Split
Fig. 1.12 Average proportion of correct classifications in the validating data (in red) of a random
forest over 200 replicates against the number of covariates included in the ensemble of trees.
Maximum and minimum over replicates in blue
set.seed(3037)
p <- .5
nrep <- 1
mtry <- c(5,20,50,80,120)
sumd <- data.frame()
res <- rep(NA,nrep)
ptm<-proc.time()
for ( m in mtry) {
cat("mtry ",m,"\n",sep="")
for ( rep in 1:nrep ) {
cat("Replicate ",rep,"\n",sep="")
train <- c(sample( 1:n0,floor(p*n0) ),
sample( (n0+1):(n0+n1),floor(p*n1) ))
rf.singh =randomForest(y ~.,
data=d,
subset =train,
mtry=m,
importance =TRUE)
predict <- predict(rf.singh,d[-train,])
observed <- d$y[-train]
t <- table(observed,predict)
print(t)
res[rep] <- (t[1,1]+t[2,2])/sum(t)
}
sumd <- rbind(sumd,c(m,min(res),mean(res),median(res),
max(res),var(res)))
}
proc.time()-ptm
names(sumd) <- c("mtry","min","mean","median","max","var")
with(sumd,plot(mtry,mean,type="l",col="red",ylim=c(min(min),1),
ylab="1 - Mean Squared Error",
xlab="Number of Predictors Considered at each Split"))
with(sumd,lines(mtry,min,lty=2,col="blue"))
with(sumd,lines(mtry,max,lty=2,col="blue"))
While in this particular set of data the random forest was the clear winner among
the prediction machines tested, it is important to mention that there is no uniformly
best prediction machine. A different set of data may produce different results. Very
marked differences among prediction methods ought to raise suspicion and warrant
careful investigation of the data (Efron 2020). This is particularly important in this
era of increasingly larger data sets where the consequence of bias due to non-random
sampling is magnified. The point is elaborated in Meng (2018). Spurious results may
be obtained by complex interactions between a prediction method and a particular
structure in the training data at hand that may not be reproduced when the model is
deployed using validating data.
The starting point of the mathematical genetics model is the metaphor that describes
chromosomes as strings of beads, with each bead representing a gene. Genes are the
unit of inheritance. In mammals and many other groups, each cell carries two copies
of each chromosome; they are said to be diploid. Most fungi, algae and human
gametes have only one chromosome set and are haploid.
The complete set of chromosomes of an organism includes sex chromosomes
and autosomes. For example, in humans, there are 23 pairs of chromosomes and
of these, 22 pairs are non-sex chromosomes known as autosomes. The majority of
genes are located on the autosomes and in this book I consider autosomal loci only.
In diploid organisms, at a specific location on the chromosome called the locus,
each of the two copies of the chromosome carries a gene. The pair of genes
constitute the genotype at the particular locus. Genes exist in different forms known
as alleles. Here, I consider biallelic loci, so for a given locus in diploid individuals,
if the two alleles are A and a, the three genotypes could be denoted, say AA, Aa
and aa (no distinction is made between Aa and aA). For example, an individual
with genotype Aa received one allele (say A) from the mother and the other from
the father.
The standard quantitative genetic model assumes that the expression of a trait
value y (the phenotype, here centred with zero mean) in diploid individuals is
determined by the additive contributions of a genetic value G and an environmental
value e,
y = G + e,
.
Consider first a trait affected by a single biallelic locus with the three genotypes
labelled AA, Aa and aa. Let p denote the frequency of allele A in the population
34 1 Overview
The random variable .z∗ is known as the allele content (here, allele A is arbitrarily
taken as reference and .z∗ = 2 if the genotype has two copies of allele A).
For thislocus,.E (z∗ ) = 2p, .Var (z∗ ) = 2p (1 − p) and for individuals k and
j , .Cov zj∗ , zk∗ = aj k 2p (1 − p) where .aj k is the expected additive genetic
relationship (given a pedigree) between k and j (e.g. .aj k = 0.5 if j and k are
non-inbred full sibs or parent and an offspring), also interpreted as the expected
proportion of alleles shared identical by descent between j and k (genes that are
identical by descent (IBD) are copies of a specific gene carried by some ancestral
individual). The note note0101.pdf at https://github.com/SorensenD/SLGDS has a
derivation of these results.
In a large (idealised) random mating population, in the absence of selection,
mutation or migration, the relationship between gene frequency p and genotype
frequency .p2 remains constant from generation to generation. The property is
derived from a theorem known as the Hardy-Weinberg law that provides one expla-
nation for the maintenance of genetic variation in such idealised random mating
population. In a population in Hardy-Weinberg equilibrium, genotype frequencies
at a particular locus in the offspring generation depend on gene frequencies in the
parent generation.
From now on, the codes for the three genotypes are centred as .z = z∗ − E (z∗ )
so that .E (z) = 0, .Var (z) = 2p (1 − p) and phenotypic values are also centred, so
that .E(y) = 0.
The genetic value .G (z) at the locus can take three modalities corresponding to
the three genotypes at the biallelic locus and can be decomposed into two terms:
G (z) = αz + δ,
. (1.27)
where .αz is the best linear predictor of genetic value. The best linear predictor
αz is also known as the additive genetic value or breeding value: the best linear
.
approximation describing the relationship between genetic value and allele content
z (best linear prediction is discussed on page 259; see also the example on page 261
for more details on the concepts of additive genetic values and effects, where it is
shown that .α, the additive genetic effect of the locus or average substitution effect
1.7 Appendix: A Short Overview of Quantitative Genomics 35
at the locus is also the regression of y on z). The residual term .δ is orthogonal to z
and includes deviations between .G (z) and .αz.
The genetic variance contributed by the locus in the population (based on the law
of total variance)
If the linear fit is perfect, the genetic variance is equal to the additive genetic
variance. Importantly, additive genetic variation at the locus arises due to variation
in allele content z among individuals at the locus. The substitution effect .α is treated
as a fixed albeit unknown parameter (this is stressed by conditioning on .α).
The (narrow sense) heritability of the trait is definedas the ratio of the additive
genetic variance to the phenotypic variance : .h2 = σa2 σy2 , where .σy2 = Var (y),
the marginal variance of the phenotype.
parameter .Dkl between loci k and l is defined as follows. Choose the paternal (or
maternal) gamete and let the random variable U take the value 1 if allele .Ak is
present in the paternal gamete at locus k and zero otherwise; let the random variable
V take the value 1 if allele .Al is present in the paternal gamete at locus l and zero
otherwise. Then .Dkl is defined as the covariance between U and V :
and .Cov (zk , zl ) = 2Dkl since in the diploid model, the genotype results from
the random union of two gametes. Covariances involving alleles of different loci
between gametes are zero. Linkage disequilibrium is created by evolutionary forces
such as selection, mutation and drift and is broken down by random mating, as a
function of time (measured in generations) and of the distance that separates the
intervening loci. Generally, loci that are physically close together show stronger
LD.
Random documents with unrelated
content Scribd suggests to you:
Dr. Gray’s friend of many years, George Engelmann, M. D., died in
February, 1884. He was a student at Heidelberg with Schimper and
Alexander Braun in 1827, and again in Paris, in 1832, with Agassiz and
Braun. He came to America in 1834, made some journeys on horseback in
the West, and settled as a physician in St. Louis, then a frontier trading-post,
in 1835. He lived to see it become a metropolis of over four hundred
thousand inhabitants. Dr. Gray says in his memoirs of him, “In the
consideration of Dr. Engelmann’s botanical work it should be remembered
that his life was that of an eminent and trusted physician; ... that he devoted
only the residual hours, which most men use for rest or recreation, to
scientific pursuits.... Nothing escaped his attention; he drew with facility;
and he methodically secured his observations by notes and sketches. The
lasting impression which he has made upon North American botany is due
to his habit of studying his subjects in their systematic relations, and
devoting himself to a particular genus of plants until he had elucidated it as
completely as lay within his power. In this way all his work was made to
tell effectively.... It shows how much may be done for science in a busy
physician’s horæ subsecivæ, and in his occasional vacations. Personally he
was one of the most affable and kindly of men, and was as much beloved as
respected by those who knew him.”
TO J. D. HOOKER.
Dr. Gray wrote a notice of Charles Wright for the “American Journal of
Science,” in which he says that “Charles Wright was born at Wethersfield,
Connecticut; graduated at Yale in 1835. Had an early love for botany, which
may have taken him to the South as a teacher in Mississippi, whence he
went to Texas, joining the early immigration, and occupied himself
botanizing and surveying, and then again in teaching. He accompanied
various expeditions, and no name is more largely commemorated in the
botany of Texas, New Mexico, and Arizona than Charles Wright. It is an
acanthaceous genus of this district, of his own discovery, that bears the
name of Carlowrightia. Surely no botanist ever better earned such scientific
remembrance by entire devotion, acute observation, severe exertion, and
perseverance under hardship and privation.” He was engaged later for
several years “in his prolific exploration of Cuba.”
“Mr. Wright was a person of low stature and well-knit frame, hardy
rather than strong, scrupulously temperate, a man of simple ways, always
modest and unpretending, but direct and downright in expression, most
amiable, trusty, and religious. He accomplished a great amount of useful
and excellent work for botany in the pure and simple love of it; and his
memory is held in honorable and grateful remembrance by his surviving
associates.”[135]
TO JOHN H. REDFIELD.
TO S. M. J.
TO W. M. CANBY.
Cambridge,
November 19, 1885.
My dear Canby,—Many thanks for your felicitations. There is much I
want to write, and to say what a surprise we had, and how perfect the vase
is. But my arm is worn out with note-writing.
Yours affectionately,
Asa Gray.
Two poems and a poetical epigram came among the rest!
Cambridge,
January 31, 1886.
My dear Friend,—I am a laggard correspondent, I fear. Here are your two
most friendly and interesting letters, as far back as November, one of which
crossed, and one which announced, the reception of my long letter which
gave a sketch of our journeyings which began almost a year ago. For we are
now already in the middle of another winter. I doubt if we shall flee from
this one, although it has shown some severity. In the first place, we may
thankfully say that neither Mrs. Gray nor I can say that we require it; and I
cannot bear to lose the time: I seem to need the more of this as the stock
diminishes; for, somehow, I cannot get as much done in a day as I used to
do. Moreover, it is no good running away from winter unless you can go
far. For our southern borders have been unusually wintry, and they want our
guards and preparations against cold.... We were glad enough to get back to
our well and equably warmed house, where, indeed, we are most
comfortable.
You called my attention, I believe, to Professor Allen’s book on the
“Development of Christian Doctrine.” I take shame to myself that I did not
procure and read it. But I know its lines, and read some part of it before it
was in the book, and, of course, I like it much.
I am going, in a few days, to send you a little book, with similar
bearings, which I read in the articles of which it is made up. I think you will
find much of it interesting.
Bishop Temple’s “Bampton Lectures” seemed to me very good as far as
it went, but hardly came up to expectation.
I saw something of Canon Farrar when here. He pleased well, and I think
was well pleased; and personally he was very pleasing and lovable.
I wish more of the English Churchmen would visit us, and give more
time especially to the study of their own branch of the church in the United
States,—a very thriving one. I think they might learn much that would be
helpful and hopeful,—difficult as it may be to apply the experience and the
ways of one country to another.
I have seen, but not read, Mr. Forbes’s “Travels in Eastern Archipelago.”
Those who have read it here say it is very interesting. We have a great lot of
his dried plants from Sumatra and Java, unnamed, which at odd hours I am
arranging for the herbarium. I hope that in his new journey he will manage
to make better specimens. But, as he is primarily an entomologist, this can
hardly be expected. But, if I rightly understand, he goes out now with a
good backing and probably better conveniences for collecting than he could
have had before.
We have been, and still are, much interested in English politics and
election excitements. You are having very anxious times, indeed. What a
pity that some one party, that is, one of the two great parties, is not strong
enough, and homogeneous enough, to command the situation for the time
being, and to deal independently of Parnell, or, indeed, of Chamberlain....
We Americans are wonderfully peaceful—our only real questions now
pending are financial, and those not yet treated as they ought to be, on party
lines. We have an awful silver craze; but we hope to arrest it before it comes
to the worst, though sense and argument are at present ineffectual.
We have a comfortable trust in the principle that “Providence specially
protects from harm the drunken, the crazy, and the United States of
America.”
I see our friend Professor Thayer now and then. He is well and
flourishing. Mrs. Gray and I are very well indeed, and we send our most
cordial good wishes to you all.
Very sincerely yours,
Asa Gray.
TO J. D. HOOKER.
TO A. DE CANDOLLE.
TO J. D. DANA.
TO J. D. HOOKER.
It was during this visit that Dr. Gray, when the family gathered one
morning for breakfast, had disappeared. He came in smiling when the meal
was half over, and in answer to the anxious question where he had been,
said, “Oh, I have been to say to Mrs. Rogers that I forgave her for getting
above me in the spelling-class.”
Cambridge, October 31, 1886.
Dear Hooker,—Thanks for a nice long letter from Bournemouth,
September 27. Thanks, too, for the hope—though rather dim—that you and
wife may come over to us in the spring. Before winter is over we must
arrange some programme; for we four must meet again somehow and
somewhere, while in the land of the living. But how is a problem.
... I see how difficult it must be for you to get away as far as to us. Our
obstacle to any amount of strolling away is mainly the fear that if I interrupt
my steady work on the “Flora of North America,” I may not get back to it
again, or have the present zeal and ability for prosecuting it.
On the other hand, if I and my wife do not get some playdays now, while
we can enjoy them, the time will soon come when we shall have to say that
we have no pleasure in them. Therefore we are in sore straits.... If really you
cannot come, then we will brave out the winter here, as we did last winter
and are none the worse; then we will seriously consider whether Mahomet
shall go to the mountain, which will not come to Mahomet.
I grind away at “Flora,” but, like the mills of the gods, I grind slowly, as
becomes my age,—moreover, to continue the likeness, I grind too
“exceedingly fine,” being too finical for speed, pottering over so many
things that need looking into, and which I have not the discretion to let
alone. Consequently the grist of each day’s work is pitiably small in
proportion to the labor expended on it. I am now at Malvaceæ, which I once
enjoyed setting to rights, and of which the North American species have got
badly muddled since I had to do with them.
If Sereno Watson—who should be back again in twenty days—will only
go on with the Cruciferæ, which he has meddled with a deal, and then do
the Caryophyllaceæ, which are in like case, we may by March 1st have all
done up to the Leguminosæ.
We learn to-day, through a pamphlet sent by Miss Horner, that Bunbury
is dead—in June last....
Your “Primer”—new edition—has not come yet. Do not forget it. And
then, as my manner is, I will see if I can find fault with it. Same with
Bentham’s “Hand-book,” new edition....
I do not wonder that you are happy and contented. We should so like to
see father, mother, and children in their encampment at Sunningdale. May
plenty of sunshine be theirs!
Ball has sent me early sheets of his book. I must find time to go through
its pages.
The L.’s abroad, except the two girls (who are to winter at San Remo)
are now en voyage homeward. William, their father, has been painted by
Holl. He is a good subject. Saw your sister B. (and kind Lombe); she writes
a charming letter to my wife; seems to hold her own wonderfully.
Cambridge, November 22, 1886.
Well, I have got safely through my seventy-sixth birthday, which gives a
sort of assurance. I have always observed that if I live to November 18, I
live the year round!
You are working at Euphorbs, etc.; I at Malvaceæ, in which I find a good
deal to do for the species, and something for the betterment of genera....
TO J. D. HOOKER.
TO A. DE CANDOLLE.
TO ——.
May 15, 1887.
I think the journey from Bâle, in Switzerland, to Salzburg was
wonderfully fine and a great success, and that May is a good time to do it,
while there is plenty of snow in the mountains. Lake Wallenstadt showed to
great advantage. And I had no idea that the pass of the Arlberg, from
Feldkirk to Innspruck, was so high or so very fine. I believe it is the highest
railway pass across the Alps. I was quite unprepared (which was all the
better) for the exquisite and wild, and in parts grand, scenery of the next
day’s journey through the heart of Lower Tyrol and the Salzburg
Salzkammergut, by a slower train, a roundabout road making more than
twice the direct distance from Innspruck to Salzburg, through the Zillerthal
and over a fairly high pass on to the upper part of the Salzach, and down it
through some wild cañons into the plain, from nine A.M. till five, of choicest
scenery. The great castle, so picturesquely placed in the Lichtenstein
(plain), is Schloss-Werden. Rainy day at Salzburg, or should have had noble
views. If the weather had been good, I think we would have driven from
Salzburg to Ischl, and then come by the Traunsee to Linz. But after all, from
my remembrance, it would hardly have come up to what we had already
seen. And though it was a rainy day for the Danube, we did see everything
pretty well, and most comfortably, in the ladies’ cabin of the steamer, with
windows all round the three sides, and most of the time the whole to
ourselves, or with only one quiet lady, who evidently cared nothing for the
views. J. says I was bobbing all the time from one side to the other. I was
looking out for the views which I had when going up the Danube forty-
eight years ago. J. thinks it not equal to the Rhine, but there is rather more
of it, or scattered over more space.
TO SIR J. D. HOOKER.
ebookbell.com