100% found this document useful (20 votes)
473 views16 pages

The Fundamentals of Modern Statistical Genetics Instant Access

Secundus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (20 votes)
473 views16 pages

The Fundamentals of Modern Statistical Genetics Instant Access

Secundus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

The Fundamentals of Modern Statistical Genetics

Visit the link below to download the full version of this book:

https://medipdf.com/product/the-fundamentals-of-modern-statistical-genetics/

Click Download Now


Nan M. Laird · Christoph Lange

The Fundamentals of Modern


Statistical Genetics

123
Nan M. Laird Christoph Lange
Department of Biostatistics Department of Biostatistics
Harvard University Harvard University
Boston, MA 02115, USA Boston, MA 02115, USA
[email protected] [email protected]

Statistics for Biology and Health Series Editors


M. Gail A. Tsiatis
National Cancer Institute Department of Statistics
Bethesda, MD 20892, USA North Carolina State University
Raleigh, NC 27695, USA
Klaus Krickeberg
Le Châtelet W. Wong
F-63270 Manglieu, France Department of Statistics
Stanford University
Jonathan M. Samet Stanford, CA 94305-4065, USA
Department of Preventive Medicine
Keck School of Medicine
University of Southern California
1441 Eastlake Ave. Room 4436, MC 9175
Los Angles, CA 90089

ISSN 1431-8776
ISBN 978-1-4419-7337-5 e-ISBN 978-1-4419-7338-2
DOI 10.1007/978-1-4419-7338-2
Springer New York Dordrecht Heidelberg London

c Springer Science+Business Media, LLC 2011


All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York,
NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in
connection with any form of information storage and retrieval, electronic adaptation, computer
software, or by similar or dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if
they are not identified as such, is not to be taken as an expression of opinion as to whether or not
they are subject to proprietary rights.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


To our families, and all of the families whose
data we have analyzed.
Preface

Statistical genetics has played a pivotal role for more than a century in the discovery
of genes that cause disease in humans. Driven by advances in molecular genetics
and medicine and the continuing improvements in genotyping technology, statistical
models and methods have adapted over time to the challenges presented by new
study designs.
In this book we discuss the statistical models and methods that are used to
understand human genetics from an historical perspective. Starting with Mendel’s
first experiments to more recent genome-wide association studies, we describe how
genetic information can be incorporated into statistical models to discover disease
genes. While we cover most of the commonly used approaches in statistical genetics
(e.g., aggregation analysis, segregation, linkage analysis, etc.), the focus of the book
is on modern approaches to association analysis. Our treatment of earlier topics
is mainly to help the reader see the larger picture and understand the historical
development of methods. We provide numerous examples to illustrate key points
throughout the text, both of Mendelian and complex genetic disorders.
Most statisticians, biostatisticians and data analysts are aware of the key role
that their disciplines have played in finding disease genes, but have little direct
knowledge of how gene discovery via gene mapping works. This book arises from
teaching courses to graduate students, with varying levels of statistical preparation,
at the Harvard School of Public Health. Our intended audience for this book is
largely quantitatively oriented health scientists, including biostatisticians, statisti-
cians, epidemiologists, physicians and molecular geneticists, who want to learn
about statistical methods for genetic analysis, whether to better analyze genetic data,
or to pursue research in methodology. We assume familiarity with elementary prob-
ability, statistical inference and methods, specifically distributions for two or more
variables, conditional, marginal and joint distributions, Bayes rule, likelihood meth-
ods, hypothesis testing, estimation, correlation and the essential ideas of regression,
including linear, log-linear and logistic. However, the book emphasizes concepts
and examples, and the exercises include problems for students with a broad range
of skill levels. We assume no formal training in genetics, but familiarity with basic
concepts in molecular genetics is necessary and will be reviewed in the first chapter.
There are many excellent texts in statistical methods currently available to stu-
dents and we have used many of them in our teaching. This book shares much with

vii
viii Preface

the classic texts of Sham (1998) and Lange (2002), both of which were written with
a similar audience in mind. Our book is less focused on linkage and more focused on
association analysis than the text by Sham, and provides easier reading for students
with less mathematical training than the book by Lange. We also share much with
the newer texts by Thomas (2004) and Yang (2000), being less epidemiologically
oriented than Thomas, with more emphasis on human disease than Yang. The book
by Foulkes (2009) has a stronger emphasis on software implementation while our
focus is on statistical theory and methods.

Boston, Massachusetts Nan M. Laird


Bad Godesberg, Germany Christoph Lange
Acknowledgments

This book would not have been possible without our former students, the staff at
the Harvard School of Public Health, our colleagues, friends and our families whom
we would like to thank for their the support, encouragement and help during the
writing process of the book. The book is based on the statistical genetics courses
that we have been teaching over the last 10 years here at the Harvard School of
Public Health. With their feedback and during enjoyable discussions, many of our
former students helped us to improve our courses and thereby the book. Without
the active and patient support from the staff at the Department of Biostatistics at the
Harvard School of Public Health, especially Jelena Follweiler, we would not have
been able to put the book together. We also thank the reviewers, our colleagues,
friends and family members for their keen eye during the review process of the
book. We are especially thankful to Kaustubh Adhikari, Gourab De, Matt McQueen,
Jessica Lasky-Su and Tony Paredo for their help with the exercises, Ross Lazaras
for help with figures, and to Wai-Ki Yip, Deborah Blacker, Garrett Fitzmaurice,
Jonathan Haines, Alkes Price, Benjy Raby, David Alexander, Lily Altstein, Sharon
Lutz, Cory Zigler, Andreas Kraeusling, and several anonymous reviewers for their
comments on drafts of the book. We are also indebted to Tyler VanderWeele for con-
tributing Section on Compositional Epistatis and Compositional Gene-Environment
Interactions. Last, but not least, we would like to mention John Kimmel for leading
us through the book writing process.

ix
Contents

1 Introduction to Statistical Genetics and Background


in Molecular Genetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Basic Concepts in Genetic Disease . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Review of Molecular Genetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Types of Genetic Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Effects of Genetic Variants on Disease . . . . . . . . . . . . . . . . . . . . . . . 12

2 Principles of Inheritance: Mendel’s Laws and Genetic Models . . . . . . 15


2.1 Mendel’s Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 A Framework for Genetic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 The Biology Underlying Mendelian Inheritance . . . . . . . . . . . . . . . 24
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Some Basic Concepts from Population Genetics . . . . . . . . . . . . . . . . . . . 31


3.1 Estimation of Allele Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Population Substructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 Population Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Population Admixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Population Inbreeding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Hardy-Weinberg Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Testing for HWE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Some Causes of the Failure of HWE . . . . . . . . . . . . . . . . 39
3.3.3 Measuring the Departure from HWE . . . . . . . . . . . . . . . . 41
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Aggregation, Heritability and Segregation Analysis: Modeling


Genetic Inheritance Without Genetic Data . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Aggregation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Estimating Recurrence Risk Ratios . . . . . . . . . . . . . . . . . 51
4.2.2 Further Simplifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Heritability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

xi
xii Contents

4.4 Segregation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57


4.4.1 Segregation Analysis for Dominant Mendelian
Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.2 Segregation Analysis for Recessive Mendelian
Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5 The General Concepts of Gene Mapping: Linkage, Association,


Linkage Disequilibrium and Marker Maps . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Genetic Markers and Marker Maps . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Testing for Linkage or Association: Basic Concepts . . . . . . . . . . . . 75
5.4 A Formal Definition of Linkage Disequilibrium and Related
Measures Used to Describe Linkage Disequilibrium . . . . . . . . . . . 77
5.5 The Origin and Extent of LD in the Human Genome . . . . . . . . . . . 81
5.6 The Human Genome and HapMap Projects . . . . . . . . . . . . . . . . . . . 82
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6 Basic Concepts of Linkage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87


6.1 Basic Approach to Assessing Linkage Between Two Loci . . . . . . . 88
6.2 The Direct Counting Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3 The Interpretation of LOD Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7 The Basics of Genetic Association Analysis . . . . . . . . . . . . . . . . . . . . . . . 99


7.1 Testing Association with Dichotomous Disease Traits:
Codominant, Recessive and Dominant Models . . . . . . . . . . . . . . . . 101
7.2 The Additive Genetic Model: The Alleles Test
and the Trend Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.3 Small Sample and Permutation Tests . . . . . . . . . . . . . . . . . . . . . . . . 106
7.4 Which Mode of Inheritance Should We Assume for Testing? . . . . 107
7.5 Estimating Effect Sizes and Confidence Intervals . . . . . . . . . . . . . . 108
7.6 Examples of Testing Association with Diallelic Markers . . . . . . . . 109
7.7 The Regression Approach: Extensions to Covariate Adjustment
and to Other Phenotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.8 Association Analysis with Complex Traits: An Association
Between INSIG2 and BMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.9 Sample Size and Power Considerations
for Case-Control Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.10 Power and Effect Estimation: Testing a Marker in LD
with the DSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Contents xiii

8 Population Substructure in Association Studies . . . . . . . . . . . . . . . . . . . 125


8.1 The Impact of Population-Admixture and Stratification
on Genetic Association Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2 Genomic Control Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.3 Modeling the Effects of Population Admixture
and Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.4 Regression-Based and Principal Component Approaches . . . . . . . 133
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

9 Association Analysis in Family Designs . . . . . . . . . . . . . . . . . . . . . . . . . . 139


9.1 The Trio Design and the TDT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.2 Family Based Association Tests: FBAT . . . . . . . . . . . . . . . . . . . . . . 142
9.2.1 Missing Parents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9.2.2 Comparative Power for Family-Based
and Case-Control Designs . . . . . . . . . . . . . . . . . . . . . . . . 147
9.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
9.3.1 Using FBAT to Obtain the TDT . . . . . . . . . . . . . . . . . . . . 149
9.3.2 Deriving a TDT for a Recessive Mode
of Inheritance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.3.3 Informative Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.3.4 Codominant Mode of Inheritance . . . . . . . . . . . . . . . . . . . 151
9.3.5 Multiallelic Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.3.6 Using Unaffected Offspring . . . . . . . . . . . . . . . . . . . . . . . 152
9.3.7 Missing Parental Information . . . . . . . . . . . . . . . . . . . . . . 153
9.3.8 Quantitative Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

10 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161


10.1 The Multiple Testing Problem in Association Studies . . . . . . . . . . . 161
10.1.1 Methods Based on P-Value Adjustment . . . . . . . . . . . . . 161
10.1.2 Permutation and Monte Carlo Tests . . . . . . . . . . . . . . . . . 164
10.2 Other Methods for the Analysis of Multiple SNPs,
Including Haplotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
10.3 Gene–Environment/Gene–Drug Interaction . . . . . . . . . . . . . . . . . . . 170
10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

11 Genome Wide Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
11.2 Quality–Control for the Genotype Data . . . . . . . . . . . . . . . . . . . . . . 176
11.3 Multi-Stage Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
11.4 Testing Strategies for Family-Based Studies . . . . . . . . . . . . . . . . . . 185
11.5 Replication, Non-replications and Meta-analysis . . . . . . . . . . . . . . . 186
11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
xiv Contents

12 Looking Toward the Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

A Basic Concepts of Linkage Analysis (Continued from Chapter 6) . . . . 193


A.1 General Issues with Parametric Linkage Analysis . . . . . . . . . . . . . . 193
A.2 Non-parametric Linkage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
A.3 Multipoint Linkage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

B A Class of Score Tests for Family Designs . . . . . . . . . . . . . . . . . . . . . . . . 203


Properties of the Score Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Missing Parents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

C The TDT Tests for Both Linkage and Association (LD) . . . . . . . . . . . . 207

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Chapter 1
Introduction to Statistical Genetics
and Background in Molecular Genetics

An understanding of the basic ideas of inheritance has been evident throughout the
history of mankind, ever since the domestication of animals or the practice of farm-
ing began. The Babylonians and ancient Egyptians utilized cross pollination of crops
and selection of domesticated animals for breeding, but did not develop a formal
theory for the principles underlying the inheritance of traits. Later, ancient Greek
philosophers developed elementary theories to explain how inheritance worked in
humans, grappling unsuccessfully with the apparent paradox that inherited char-
acteristics can sometimes differ between offspring and parents. Some diseases in
humans, such as sickle cell anemia and hemophilia, have been recognized as inher-
ited disorders for centuries and, as the science of medicine developed, so too did the
recognition that many diseases are heritable. Yet, bipolar disorder, one of the oldest
known disorders in humans, was not widely regarded as heritable until the 1950s.
Although we can document an awareness of the basic concept of inheritance for
millennia, most of our current knowledge about inherited human diseases has been
acquired only in the last century. As the concept of inherited disease gradually devel-
oped, Genetics, the science of inherited variation and heritable biological material
in living organisms, became an integral part of the search for the origin of disease.
Today stories of gene discovery for many diseases dominate the news landscape.
Despite centuries of formal and informal observation of patterns of inheritance in
humans, the discipline of human genetics is relatively young. Humans are difficult
to study because, in contrast to plant and animal genetics, experimental crossings
are not possible, environmental factors are hard to control, and humans have small
families with many years required for a new generation to develop. Environmental
and genetic factors broadly overlap during childhood, making it difficult to sepa-
rate the relative contributions of the environment and genetics to the development
of disease. As a result, much of our understanding of basic genetic principles and
how genes affect variation in organisms comes from experimental studies in plants
(Mendel’s experiments with garden peas in the 1860s) and animals (Morgan’s exper-
iments in the 1920s with flies). Mendel’s laws were initially largely ignored, but
‘rediscovered’ by scientists in the 1900s and hotly debated by geneticists, biolo-
gists, statisticians and biometricians. Part of this debate centered on the apparent
conflict between Mendel’s work Experiments in Plant Hybridization (1865) and

N.M. Laird, C. Lange, The Fundamentals of Modern Statistical Genetics, 1


Statistics for Biology and Health, DOI 10.1007/978-1-4419-7338-2_1,

C Springer Science+Business Media, LLC 2011
2 1 Introduction

Darwin’s theories set forth in the Origin of the Species (1859), which was published
just prior to Mendel’s paper. Darwin used the notion of inherited traits as the basis
for natural selection, but he believed that traits in parents were ‘blended’ in the
offspring. Mendel’s work verified the inheritance of traits, but he deliberately used
discrete traits that were not blended in offspring. Developing models and theories for
how Mendel’s discrete inherited units could explain variation in continuous human
characteristics was a subject of much debate during these early years of statistical
genetics. In this text we use the term trait broadly to encompass both measured and
discrete characteristics, as well as disease outcomes.

1.1 Basic Concepts in Genetic Disease


Statistical Genetics is a branch of statistics that deals with the analysis of inherited
traits and genetic data. We use genetic data loosely here to refer to the biological
material that is inherited during reproduction via egg and sperm cells. In early days,
statistical genetics was largely dominated by statistics for experimental studies in
plants and animals. Galton’s statistical work in the 1880s on the inheritance of height
in humans is an important exception to this rule. Over the years, the methodolog-
ical focus of statistical genetics has changed to keep pace with the different kinds
of genetic data that technology has made available. Most recently, new technolo-
gies arising from the Human Genome Project and HapMap Project have generated
a surge of methodological development to address unsolved problems in human
genetics. The development of statistical models and methods to explain how genes
influence traits continues to be a common goal in plant, animal and human genetics.
When the discipline of statistical genetics was just beginning, we had little under-
standing of the basic biological underpinning of genetics and inheritance apart from
the fact that humans had ‘units’–later termed ‘genes’–that were inherited from their
parents and that ‘units’ could differ from person to person. Most important from the
statistical point of view, there was no standardized way to assay or characterize the
genetic information at the molecular level in an individual. The available data for
most statistical investigations consisted only of traits, also known as phenotypes.
We use the terms traits and phenotypes here to mean individual characteristics, not
observed at the molecular level, which are thought to have a heritable basis. For
example, a person’s blood type at the ABO locus is a phenotype which depends
upon their variants at the ABO gene. A person’s phenotype (here blood group) can
be obtained without knowledge of their gene variant. However, knowing a person’s
blood type will imply something about the information encoded in their ABO gene.
In these early years, statistical genetics was focused on methods for determining if
traits or diseases were inherited and measuring the degree of inheritance (studies of
familial aggregation), and to determine the underlying genetic model that explains
the relationship between the phenotype and the underlying disease (segregation
analysis). For these analyses, individuals with the disease, called probands, were
identified; information on relatives of the probands was used to form family or pedi-
gree structures. The term ascertain is used when referring to probands to indicate

You might also like