100% found this document useful (10 votes)

3K views

Introduction To Genomics Second Edition PDF

Uploaded by

Mila Anasanti

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (10 votes)

3K views

Introduction To Genomics Second Edition PDF

Uploaded by

Mila Anasanti

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 420

Introduction to Genomics

This page intentionally left blank

INTRODUCTION TO
GENOMICS
SECOND EDITION

Arthur M. Lesk
The Pennsylvania State University

Immured the whole of Life

Within a magic Prison
– Emily Dickinson

1
3
Great Clarendon Street, Oxford ox2 6dp
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide in
Oxford New York
Auckland Cape Town Dar es Salaam Hong Kong Karachi
Kuala Lumpur Madrid Melbourne Mexico City Nairobi
New Delhi Shanghai Taipei Toronto
With ofﬁces in
Argentina Austria Brazil Chile Czech Republic France Greece
Guatemala Hungary Italy Japan Poland Portugal Singapore
South Korea Switzerland Thailand Turkey Ukraine Vietnam
Oxford is a registered trade mark of Oxford University Press
in the UK and in certain other countries
Published in the United States
by Oxford University Press Inc., New York
© Arthur M Lesk 2012
The moral rights of the author have been asserted
Database right Oxford University Press (maker)
First edition 2007
All rights reserved. No part of this publication may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
without the prior permission in writing of Oxford University Press,
or as expressly permitted by law, or under terms agreed with the appropriate
reprographics rights organization. Enquiries concerning reproduction
outside the scope of the above should be sent to the Rights Department,
Oxford University Press, at the address above
You must not circulate this book in any other binding or cover
and you must impose the same condition on any acquirer
British Library Cataloguing in Publication Data
Data available
Library of Congress Cataloging in Publication Data
Data available
Typeset by Graphicraft Limited, Hong Kong
Printed in Italy on acid-free paper by L.E.G.O. S.p.A. – Lavis TN

ISBN 978–0–19–956435–4

10 9 8 7 6 5 4 3 2 1
For Victor and Valerie
This page intentionally left blank
PREFACE TO THE FIRST EDITION

Of all the claims on our curiosity, we want most sources of our differences from our closest extant
to understand ourselves. What are we? What lies in non-human relatives, the chimpanzees? What do we
our future? Many features of our lives depend on have in common and how do we diverge from other
accidents of history. The time and place of our birth species of primates? of mammals? of vertebrates? of
largely determine what language we first learn to eukaryotes? of all other living things?
speak and whether we are likely to be well-fed and The complete sequences of human and other
well-educated and receive adequate medical care. genomes give us complete information about the
Many aspects of our future depend on events outside underlying text of this story. We are beginning to
ourselves and beyond our control. understand how our lives shape themselves under the
Within ourselves, also, there are constraints on our influence of our genes plus our surroundings.
lives that brook relatively little argument. In some We are also beginning to intervene. Genetic engineer-
respects, we are at the mercy of our genomes. Under ing of microorganisms is an established technique.
normal circumstances, all of our basic anatomy and Genetically modified plants and animals exist and
physiology, and eye colour, height, intelligence and are the subjects of lively debate. To override the genes
basic personality traits, are ingrained in our DNA for hair colour is trivial. Changes in lifestyle or
sequences. This is not to say that our genomes dictate behaviour can – to some extent – avoid or postpone
our lives. Some of the constraints are tight – eye development of diseases to which we are genetically
colour, for instance – but our genetic endowment prone. Gene therapy offers the promise of rectifying
also confers on us a remarkable robustness. some inborn defects.
This robustness also is a product of evolution. In this book, we shall explore this new knowledge,
When Shakespeare wrote of ‘the thousand natural what it tells us about ourselves and how we can apply
shocks that flesh is heir to’, he coupled the challenges it. With power derived from knowledge goes com-
of life to heredity. Within the last century, lifestyles mitment to act wisely. We have responsibilities, to
have changed with a rapidity hitherto unknown ourselves, to other people, to other species and to
(except for the instants of asteroid impacts). Our tal- ecosystems ranging in size up to the entire biosphere.
ents have many opportunities to nurture themselves Ethical, legal and social issues have been a promin-
and develop in novel ways, and we can meet and sur- ent component of the human genome project. Most
vive brutal stresses. These are gifts of our genomic technical questions in genomics, as in other scientific
endowment: What genes control is the response of an subjects, have objectively correct answers. We do not
organism to its environment. know all the answers, but they are out there for us to
The human genome is only one of the many com- discover. Ethical, legal and social issues are different.
plete genome sequences known. Taken together, Many choices are possible. Their selection is not the
genome sequences from organisms distributed widely privilege of scientists in individual laboratories, but
among the branches of the tree of life give us a sense, of society as a whole. Scientists do have a responsibil-
only hinted at before, of the very great unity in detail ity to contribute to the informed public discussion
of all life on Earth. This recognition has changed our that is essential for wise decisions.
perceptions, much as the first pictures of the Earth One problem encountered in writing about gen-
from space engendered a unified view of our planet. omics is the need to pick and choose from the many
Of course, superimposed on this basic unity is riches of the subject. The list of subjects that cannot
great variety. We ask: What is special about us? What be left out is too long and threatens to reduce the
do we share with our parents and siblings and how treatment of each to superficiality. There is also a
do we differ from them? What do we share with all serious organizational challenge: many phenomena
other human beings and what makes us different must be approached from several different points of
from the other members of our species? What are the view. A reader may be relieved to conclude that a
viii Preface to the first edition

topic has been beaten thoroughly into submission at least in part, to all of them. However, the central
in one chapter, only to encounter it again, alive and point of view remains focused on the biology.
kicking, in a different context. More specifically, the focus is on human biology.
The speed at which the field is moving causes other In fact, on the biology of humans who are curious
problems. One is often pleased with a draft of a sec- about other species, albeit primarily for what the
tion only to find the carefully described conclusions other species tell us about ourselves. This choice
modified in next week’s journals. Yet, there is a great naturally reflects the potential readership of this
pleasure in seeing Nature’s secrets emerging before book. (If bacteria or fruit flies could read, genomics
one’s eyes. textbooks would look very different.)
Another casualty of rapid progress is a loss of This book assumes that the reader already has
interest in history and biography. We are fantastically some acquaintance with modern molecular biology,
interested in the development of the sea urchin and and builds on and develops this background, as a
the fruit fly, but not at all in the development of self-contained presentation. It is suitable as a text-
molecular genomics. Intellectual struggles that occu- book for undergraduates or starting postgraduate
pied entire careers leave behind only terse conclusions, students.
often without any appreciation of the experiments Exercises, problems and ‘weblems’ at ends of chap-
that established the facts, much less of the alternative ters test and consolidate understanding and provide
hypothesis tested and rejected. The force of the scien- opportunities to practise skills and explore additional
tists’ personalities, and their foibles, are forgotten. subjects. Exercises are short and straightforward
This is too bad: those who do not learn from the suc- applications of material in the text. Answers to exer-
cesses of history will find it harder to emulate them. cises appear on the web site associated with the book.
Genomics is an interdisciplinary subject. The phe- Problems, also, make use of no information not
nomena we want to explain are biological. But many contained in the text, but require lengthier answers
fields contribute to the methods and the intellectual or in some cases calculations. The third category,
approaches that we bring to bear on the data. Physi- ‘weblems’, require access to the World Wide Web.
cists, mathematicians, computer scientists, engineers, Weblems are designed to give readers practice with
chemists, clinical practitioners and researchers, have the tools required for further study and research in
all joined in the enterprise. This book will appeal, the field.
PREFACE TO THE SECOND EDITION

Fast, inexpensive sequencing has transformed genom- support of conservation efforts dedicated to preserv-
ics. The landmark goal, the $US1000 human genome, ing endangered species. Development of alternative
will likely soon be achieved. At the time of writing, energy sources is a challenge to both physics and
thousands of human individuals have had their full biology.
genomes sequenced, and many more are on the way. Underlying these applications, genomics offers us
A very large number of people have had sequences a profound understanding of fundamental principles
determined for individual genes. For example, muta- of biology. On the personal level, genome exegesis
tions in BRCA1 suggest an increased likelihood of will fundamentally alter our perception of ourselves:
developing breast or ovarian cancer. what does it mean to be human? The answer lies
Genetic testing for disease is one of many possible somewhere in the complex interplay of our genes
fields of application of genomics. Clinical medicine and life histories. For some characteristics, there is a
heads the list; no one doubts that it was the promise simple answer. Your eye colour, and whether or not
of improvements in health that motivated support for you suffer from sickle-cell anaemia, depend exclu-
the original human genome project. Understanding sively on the sequences of particular genes. For most
the relationships between genes and disease will allow of your phenotypic traits, however, the assignment
more precise diagnosis and warnings of increased risk of contributions to their origin from genome, epi-
of disease in patients and their offspring. It will allow genetics, and life history is a severe challenge.
design of treatment tailored to the biochemical char- In this book, I have tried to present a balanced
acteristics of the patient, called pharmacogenomics. view of the background of the subject, the technical
Genomes of other organisms also have implications developments that have so greatly increased the data
for human health, especially those of pathogenic flow, the current state of our knowledge and under-
organisms that have developed, or are threatening to standing of the data, and applications to medicine
develop, antibiotic resistance. Other applications of and other fields. One aspect of the first edition that
genomics include improvement of crops and domes- I liked was its concision. Unfortunately, this has had
ticated animals, enhancing food production, and to be sacrificed to the stampeding progress of the field.
PLAN OF THE SECOND EDITION

Chapter 1, Introduction to Genomics, sets the stage, mented. With complete sequences of genomes of
and introduces all of the major players: DNA and many different species, we can confront and compare
protein sequences and structures, genomes and pro- them. But, of course, different individuals of a species
teomes, databases and information retrieval, and do not necessarily have identical genomes. What is
bioinformatics and the World Wide Web. Subsequent the nature and extent of the variability? With intra-
chapters develop these topics in detail. Chapter 1 species variability as a baseline, what are the simi-
briefly provides the framework of how they fit larities and differences between genomes of different
together and sets them in their context of biomedical, species? Comparative genomics thereby allows us to
physical, and computational sciences. address a question central not only to this book but to
Chapter 2 demonstrates that Genomes are the the field as a whole: what does it mean to be human?
Hub of Biology. Whereas genome sequences are Chapter 5, Evolution and Genomic Change, relates
determined from individuals, to appreciate life as a genomics to evolution, a major unifying principle of
whole requires extending our point of view spatially, biology. (Arguably the laws of thermodynamics are
to populations and interacting populations; and another.) T. Dobzhansky famously said: ‘Nothing in
temporally, to consider life as a phenomenon with a biology makes sense except in the light of evolution.’
history. We can study the characteristics of life in Description of some of the important ideas and tools
the present, we can determine what came before, and – of taxonomy and phylogeny, on the classical species
we can – at least to some extent – extrapolate to level and on the molecular level – will be useful in
the future. The ‘central dogma’ and the genetic code organizing the material in subsequent chapters.
underly the implementation of the genome, in terms Chapter 6, Genomes of Prokaryotes, surveys the
of the synthesis of RNAs and proteins. Absent from genomes of bacteria and archaea in more detail. Tax-
Crick’s original statement of the central dogma is the onomy and phylogeny of prokaryotes present prob-
crucial role of regulation in making cells stable and lems because of extensive horizontal gene transfer.
robust, two characteristics essential for survival. This challenges the whole idea of a hierarchy of bio-
Chapter 3, Mapping, Sequencing, Annotation, and logical classification. Many bacteria have been cloned
Databases, describes how genomics has emerged and studied in isolation, especially those responsible
from classical genetics and molecular biology. The for disease. However, a new field, metagenomics,
first nucleic acid sequencing, by groups led by F. deals with the entire complement of living things in
Sanger and W. Gilbert, in the 1970s, were a break- an environmental sample, allowing us to address
through comparable to the discovery of the double questions about interspecies interaction in the ‘real
helix of DNA. The challenges of sequencing stimu- world’. Sources include ocean water, soil, and the
lated spectacular improvements in technology. The human gut.
first was the automation of the Sanger method. The Chapter 7 surveys Genomes of Eukaryotes. It
original sequences of the human genome were accom- starts with yeast, which is about as simple as a
plished by batteries of automated Sanger sequencers. eukaryote can get. Selected plant, invertebrate, and
Subsquently a series of ‘new generations’ of novel chordate genomes illuminate the many profound
approaches have brought the landmark goal, the common features of eukaryotic genomes; and the very
$US1000 human genome, within reach. Where do all great variety of structures, biochemistry, and lifestyles
the data go? Chapter 3 also introduces the databanks that are compatible with the underlying similarities.
that archive, curate, and distribute the data, and Chapter 8, Genomics and Human Biology, de-
some of the information-retrieval tools that make velops applications to the study of our own species.
them accessible to scientific enquiry. Although clinical applications are undoubtedly the
Chapter 4, Comparative Genomics, begins with a most important, genomics has important contributions
general survey of the different modes of genome to make to human palaeontology, anthropology, and
organization with which living things have experi- the law. The ability to extract DNA from extinct
Plan of the second edition xi

species, including Neanderthals, sheds light on our to genomics. The interactions and relationships between
early evolution. Events in our history, including the genome and proteome are intimate, both during
migrations and domestication of crops and animals, cellular activity and in the longer term in evolution.
have left their traces in DNA sequences. (A colleague once entitled a keynote lecture: ‘Genes
Chapter 9 deals with Transcriptomics, the measure- are from Venus, proteins are from Mars.’)
ment and application of protein expression patterns. The last chapter emphasizes attempts to integrate
These measurements have been carried out using our data and understanding, an area known as sys-
microarrays. However, as sequencing technology tems biology. Whereas classical biochemistry made
grows in power, it may replace microarrays as the great contributions to demonstrating the properties
method of choice for these measurements. Applica- of proteins in isolation, our job now is to put things
tions treated include changes in different physio- back together. Chapter 11, Systems Biology, presents
logical states, such as the diauxic shift in yeast, or a description of biological organization that is based
sleep and waking in rats; in plant and animal develop- on networks. Cells contain parallel sets of networks
ment; and in diagnosis and treatment of disease. based on physical and logical interactions among
Chapter 10, Proteomics, describes the principles molecules. Each network also has static and dynamic
of protein structure and the high-throughput data aspects. The ultimate, most profound, goal is a com-
streams that provide information about sets of pro- plete and integrated picture of all of life’s activity,
teins in cells. Proteomics is an essential complement from the molecule to the biosphere.
NEW TO THIS EDITION

The most important change since publication of the ance. For many important diseases, a patient can
first edition has been the spectacular progress in expect more precise diagnosis and prognosis, and
high-throughput sequencing. The resulting growth of more precise recommendations for treatment. Can-
the data produced – in quality, quantity, and type – cer genomics – the comparison of sequences from
have altered the entire landscape of genomics itself, normal and tumour cells from single patients – has
and its influence has invaded surrounding fields. No become a major activity.
area of biology has been left unscathed. In the new edition, extended coverage is given both
The new edition reflects this. Many more complete to the applications of genome sequences to working
genomes are available. Instead of a few isolated snap- out of evolutionary relationships in microorganisms,
shots of the evolution, we can trace its pathways plants and animals; and to clinical applications to
through the phyla. Instead of sequencing single indi- humans. Non-clinical applications to human bio-
viduals, it is possible to measure directly the varia- logy and history are sufficient to justify a chapter of
tion within populations. Palaeogenomics has opened their own. The genomics of crop domestication not
a window onto extinct species. only shed light on human – as well as plant – history,
High-throughput sequencing has created novel but emphasize the reciprocal interactions between
data streams, such as RNAseq to measure the tran- humans and the rest of the biosphere.
scriptome. The new techniques complement, and may Nevertheless, progress is happening so fast as to
for some purposes even supersede, some common make unavoidable the feeling of frustration in aiming
experimental techniques such as microarrays. at a moving target. The hope is that the second edi-
Study of human disease through sequencing con- tion has erected for the reader a sound framework,
tinues to be a major effort. Identification of genes both intellectual and factual, that will make it pos-
responsible for particular diseases permits testing, sible, when encountering subsequent developments,
genetic counselling, and risk assessment and avoid- to see where and how they fit in.
RECOMMENDED READING

Where else might the interested reader turn? This as recommended reading at the ends of the chapters.
book is designed as a companion volume to three The goal is that each reader will come to recognize
others: Introduction to Protein Architecture: the his or her own interests, and be equipped to follow
Structural Biology of Proteins; Introduction to Pro- them up.
tein Science: Architecture, Function, and Genomics; Many applications of genomics to health care are
and Introduction to Bioinformatics (all published by discussed in the book. However, nothing here should
Oxford University Press). Of course there are many be taken as offering medical advice to anyone about
ﬁne books by many authors, some of which are listed any condition.

INTRODUCTION TO GENOMICS ON THE WEB

Results and research in genomics make use of To this end, an Online Resource Centre at
the web, both for storage and distribution of data, www.oxfordtextbooks.co.uk/orc/leskgenomics2e/
and methods of analysis. Readers will need to accompanies this book. This contains material from
become familiar with web sites in genomics, and the book – ﬁgures and ‘movies’ of the pictures of
to develop skills in using them. Many useful sites structures, answers to exercises, and hints for solving
are mentioned in the book. The author’s Introduc- problems. In addition it contains a guided tour of
tion to Bioinformatics offers a pedagogical approach web sites in genomics, coordinated with the printed
to computational aspects of genomics. However, book, with additional exercises and problems (of the
clearly the place to learn about the web is on the ‘weblem’ variety). Some of these are suitable for use
web itself. as practical or laboratory assignments.

ACKNOWLEDGEMENTS

I thank S. Ades, G.F. Anderson, M.M. Babu, S.L. Nacheva, A. Nekrutenko, G. Otto, A. Pastore, D.
Baldauf, P. Berman, B. de Bono, D.A. Bryant, C. Perry, C. Praul, K. Reed, G.D. Rose, J. Rossjohn, S.
Cirelli, A. Cornish-Bowden, N.V. Fedoroff, J.G. Schuster, B. Shapiro, J. Tamames, A. Tramontano,
Ferry, R. Flegg, J.R. Fresco, D. Grove, R. Hardison, A.A. Travers, A. Valencia, G. Vriend, L. Waits, J.C.
E. Holmes, H. Klein, E. Koc, A.S. Konagurthu, T. Whisstock, A.S. Wilkins and E.B. Ziff for helpful
Kouzarides, H.A. Lawson, E.L. Lesk, M.E. Lesk, advice.
V.E. Lesk, V.I. Lesk, D.A. Lomas, B. Luisi, P. Maas, I thank the staff of Oxford University Press for
K. Makova, W.B. Miller, C. Mitchell, J. Moult, E. their skills and patience in producing this book.
This page intentionally left blank
CONTENTS

Preface to the first edition vii

Preface to the second edition ix
Plan of the second edition x
New to this edition xii
Recommended reading xiii
Introduction to genomics on the web xiii
Acknowledgements xiii
Sections marked with * are clinically related

1 Introduction to Genomics 3
The human genome 4
Phenotype = genotype + environment + life history + epigenetics 4

Contents of the human genome 6

Genes that encode the proteome 8
The leap from the one-dimensional world of sequences to the
three-dimensional world we inhabit 9

Varieties of genome organization 10

Chromosomes, organelles, and plasmids 10
Genes 13
Dynamic components of genomes 15

Genome sequencing projects 17

Genome projects and the development of our current information library 18

Variations within and between populations 20

Cancer genome sequencing* 20
Human genome sequencing 20
The human genome and medicine* 21
Prevention of disease 21
Detection and precise diagnosis 22
Discovery and implementation of effective treatment 22
Health care delivery 23

The evolution and development of databases 24

Databank evo-devo 25
Genome browsers 25

Protein evolution: divergence of sequences and structures within

and between species 30
Different globins diverged from a common ancestor 30
Ethical, legal, and social issues (*, partly) 32
Databases containing human DNA sequence information 32
Recommended Reading 35
Exercises, Problems, and Weblems 36
xvi Contents

2 Genomes are the Hub of Biology 41

Individuals, populations, the biosphere: past, present, and future 42
The central dogma, and peripheral ones 43
Expression patterns 43
Regulation of gene expression 44
Proteomics 48
Genomics and developmental biology 48
Genes and minds: neurogenomics 50
Populations 52
Single-nucleotide polymorphisms (SNPs) and haplotypes 52
A clinically important haplotype: the major histocompatibility complex* 55
Mutations and disease* 57
Genetic diseases – some examples of their causes and treatment* 59
Haemoglobinopathies – molecular diseases caused by abnormal
haemoglobins 59
Phenylketonuria 60
Alzheimer’s disease 61
SNPs and cancer 62
Species 66
The biosphere 67
Extinctions 67
Recommended Reading 70
Exercises, Problems, and Weblems 71

3 Mapping, Sequencing, Annotation, and Databases 79

Classical genetics as background 80
What is a gene? 81
Maps and tour guides 81
Genetic maps 82
Linkage 82
Linkage disequilibrium 83
Chromosome banding pattern maps 84
High-resolution maps, based directly on DNA sequences 88
Restriction maps 89
Discovery of the structure of DNA 90
DNA sequencing 93
Frederick Sanger and the development of DNA sequencing 94
The Maxam–Gilbert chemical cleavage method 97
Automated DNA sequencing 97
Organizing a large-scale sequencing project 98
Bring on the clones: hierarchical – or ‘BAC-to-BAC’ – genome sequencing 98
Whole-genome shotgun sequencing 99
High-throughput sequencing 100
Life in the fast lanes 103
Databanks in molecular biology 104
Nucleic acid sequence databases 106
Contents xvii

Protein sequence databases 106

Databases of genetic diseases – OMIM and OMIA* 106
Databases of structures 107
Classifications of protein structures 107
Specialized or ‘boutique’ databases 107
Expression and proteomics databases 107
Databases of metabolic pathways 109
Bibliographic databases 109
Surveys of molecular biology databases and servers 109

4 Comparative Genomics 115

Introduction 116
Unity and diversity of life 116
Taxonomy based on sequences 117

Sizes and organization of genomes 121

Genome sizes 121

Viral genomes 124

Recombinant viruses 124
Influenza: a past and current threat* 126

Genome organization in prokaryotes 129

Replication and transcription 129
Gene transfer 130

Genome organization in eukaryotes 132

Photosynthetic sea slugs: endosymbiosis of chloroplasts 132
How genomes differ 133
Variation at the level of individual nucleotides 133
Duplications 134
Comparisons at the chromosome level: synteny 143

What makes us human? 143

Comparative genomics 143
Combining the approaches: the FOXP2 gene 144

Genomes of chimpanzees and humans 144

Genomes of mice and rats 145
Model organisms for study of human diseases* 146
The genome of Caenorhabditis elegans 147
The genome of Drosophila melanogaster 147
Homologous genes in humans, worms, and flies 148

The ENCODE project 151

The modENCODE project 153

5 Evolution and Genomic Change 161

Evolution is exploration 162

Biological systematics 164
Biological nomenclature 164
Measurement of biological similarities and differences 165

Homologues and families 166

Pattern matching – the basic tool of bioinformatics 167
Sequence alignment 167
Defining the optimum alignment 171
Approximate methods for quick screening of databases 173
Pattern matching in three-dimensional structures 176

Evolution of protein sequences, structures, and functions 176

The effects of single-site mutations 177
Evolution of protein structure and function 178

Phylogeny 181
Phylogenetic trees 183
Clustering methods 184
Cladistic methods 185
The problem of varying rates of evolution 185
Bayesian methods 186

Short-circuiting evolution: genetic engineering 186

Recommended Reading 187
Exercises, Problems, and Weblems 188

6 Genomes of Prokaryotes 191

Evolution and phylogenetic relationships in prokaryotes 192

Major types of prokaryotes 192
Do we know the root of the tree of life? 194

Archaea 195
The genome of Methanococcus jannaschii 197
Life at extreme temperatures 197
Comparative genomics of hyperthermophilic archaea:
Thermococcus kodakarensis and Pyrococci 201

Bacteria 204
Genomes of pathogenic bacteria* 204
Genomics and the development of vaccines* 206

Metagenomics: the collection of genomes in a coherent environmental

sample 208
Marine cyanobacteria – an in-depth study 208

7 Genomes of Eukaryotes 215

The origin and evolution of eukaryotes 216
Evolution and phylogenetic relationships in eukaryotes 216
The yeast genome 216
The evolution of plants 218
The genome of the sea squirt (Ciona intestinalis) 220
The genome of the pufferfish (Tetraodon nigroviridis) 221
The chicken genome 224
The platypus genome (Ornithorhynchus anatinus) 225
The dog genome 226

Palaeosequencing – ancient DNA 229

Recovery of DNA from ancient samples 229

DNA from extinct birds 230

The moas of New Zealand 230
The dodo and the solitaire 232
High-throughput sequencing of mammoth DNA 232
The mammoth nuclear genome 233
The phylogeny of elephants 233

8 Genomics and Human Biology 237

Genomics in personal identification 238
Mitochondrial DNA 239
Gender identification 240
Physical characteristics 241

The domestication of crops 241

Maize (Zea mays) 243
Rice (Oryza sativa) 245
Chocolate (Theobroma cacao) 245
The T. cacao genome 247

Genomics in anthropology 250

The Neanderthal genome 250
Ancient populations and migrations 251

Genomics and language 258

Recommended Reading 260
Exercises, Problems, and Weblems 261

9 Microarrays and Transcriptomics 265

Introduction 266
Applications of DNA microarrays 268

Analysis of microarray data 269

xx Contents

Expression patterns in different physiological states 272

The diauxic shift in Saccharomyces cerevisiae 272
Sleep in rats and fruit flies 274

Expression pattern changes in development 276

Variation of expression patterns during the life cycle of
Drosophila melanogaster 276
Flower formation in roses 278

Expression patterns in learning and memory: long-term potentiation 282

Conserved clusters of co-expressing genes 284

Evolutionary changes in expression patterns 285

Applications of microarrays in medicine* 287
Development of antibiotic resistance in bacteria 287
Childhood leukaemias 290
Whole transcriptome shotgun sequencing: RNA-seq 292
Recommended Reading 293
Exercises, Problems, and Weblems 293

10 Proteomics 297
Introduction 298
Protein nature and types 298
Protein structure 299
The chemical structure of proteins 299
Conformation of the polypeptide chain 300
Protein folding patterns 301

Post-translational modifications 304

Why is there a common genetic code with 20 canonical amino acids? 306
Separation and analysis of proteins 307
Polyacrylamide gel electrophoresis (PAGE) 307
Two-dimensional polyacrylamide gel electrophoresis (2D-PAGE) 307
Mass spectrometry 308
Classification of protein structures 310
SCOP 312
Changes in folding patterns in protein evolution 313
Many proteins change conformation as part of the mechanism
of their function 314
Conformational change during enzymatic catalysis 314
Motor proteins 316
Allosteric regulation of protein function 318
Conformational states of serine protease inhibitors (serpins) 320
Protein structure prediction and modelling 322
Homology modelling 323
Available protocols for protein structure prediction 327
Structural genomics 328
Contents xxi

Directed evolution and protein design 328

Directed evolution of subtilisin E 329
Enzyme design 330

Protein complexes and aggregates 330

Protein aggregation diseases* 331
Properties of protein–protein complexes 332
Multisubunit proteins 335

11 Systems Biology 341

Introduction to systems biology 342

Two parallel networks: physical and logical 342
Statics and dynamics of networks 343

Pictures of networks as graphs 344

Trees 345

Sources of ideas for systems biology 346

Complexity of sequences 346
Shannon’s definition of entropy 347
Randomness of sequences 348
Static and dynamic complexity 349
Computational complexity 350

The metabolome 351

Classification and assignment of protein function 351
Metabolic networks 354
Databases of metabolic pathways 354
Methionine synthesis in Escherichia coli 355
The Kyoto Encyclopedia of Genes and Genomes (KEGG) 356
Evolution and phylogeny of metabolic pathways 357
Carbohydrate metabolism in archaea 358
Reconstruction of metabolic networks 360

Regulatory networks 361

Signal transduction and transcriptional control 362
Structures of regulatory networks 362

Dynamics, stability, and robustness 363

Robustness through redundancy 363
Dynamic modelling 365

Protein interaction networks 366

Structural biology of regulatory networks 370
Protein–DNA interactions 373
Structural themes in protein–DNA binding and sequence
recognition 373
An album of transcription regulators 374
xxii Contents

Gene regulation 376

The transcriptional regulatory network of Escherichia coli 376
Regulation of the lactose operon in E. coli 380

The genetic regulatory network of Saccharomyces cerevisiae 382

Adaptability of the yeast regulatory network 383
Recommended Reading 384
Exercises, Problems, and Weblems 385

Epilogue 389
Index 390
Collect a sample of your cells by rinsing out your mouth with dilute salt water. Add
some detergent to dissolve the cell and nuclear membranes, releasing the contents.
Precipitate out the proteins by adding alcohol; stir, and let settle.

In a few minutes, ﬁbres will rise out of the murk to the top of the vessel.

These ﬁbres are your DNA (plus that of some bacteria that were living in your
mouth). The sequence of your DNA is your lifelong endowment. It determined
that you are human, that you are male or female, that you are brown- or blue-eyed,
that you are right- or left-handed. But think of these fragile cables not as a leash that
constrains you but as the cords that raise the curtains on the drama of your life.

−The Slurry with the Fringe on the Top

This page intentionally left blank
CHAPTER 1

Introduction to Genomics

LEARNING GOALS

• Knowing the basic facts about the human genome – how many base pairs it contains, estimates
of how many genes it contains that code for proteins or RNAs.
• Recognizing the contributions to any individual’s phenotype from the genome sequence itself,
from life history, and from epigenetic signals within the fertilized egg.
• Appreciating that the human genome contains extensive repetitive regions of various kinds.
• Knowing the basic central dogma, that DNA is transcribed to RNA, which is translated to
protein. Beyond this, knowing that many protein-coding genes show variable splicing, which
adds another dimension of complexity to the scheme by which the genome specifies the
proteome.
• Understanding the importance of comparative genome sequencing projects, to reveal processes
of evolution, and to help interpret regions in the human genome.
• Appreciating the large number of genome projects treating different species, widely distributed
among life forms, and including metagenomics, which produces very large amounts of data
from environmental samples.
• Understanding how different types of human DNA sequencing projects are organized, including
those carried out by large international organizations, the collection of data by law-enforcement
organizations, those carried out for specific clinical tests, and direct-to-the-public sequencing
usually motivated by questions of genealogy.
• Distinguishing the several types of potential applications of genome sequence data to medicine,
including ways in which information about large numbers of people can support clinical research,
and ways in which information about specific individuals can help treat disease more effectively,
or – even better – prevent it.
• Understanding the importance of computer science and bioinformatics in producing the raw
sequence data, in creating databases in molecular biology, in archiving and careful curation of
the data, in distributing them via the web, and in creating information-retrieval tools to allow
effective mining of the data for research and applications.
• Appreciating the ethical, legal, and social implications of collection of DNA sequence data and
the conflicting demands of public safety and individual privacy.
4 1 Introduction to Genomics

The human genome

A human genome contains approximately 3.2 × 109 Of great importance to clinical applications of
base pairs, distributed among 22 paired chromo- genomics are traits that govern susceptibility to
somes, plus two X chromosomes in females and disease and risk factors; and those that determine
X and Y in males. The first human genomes were the effectiveness of different drugs in different indi-
determined in 2001, the culmination of 10 years of viduals. These allow for personalized prevention
pioneering work and dedication. Since then, advances and treatment of disease based on DNA sequences,
in technology have made genomic sequencing cheaper or pharmacogenomics.
and faster. Sequence data now flow copiously. This
• Your life history includes the integrated total of
creates the challenges of understanding the informa-
your experiences, and the physical and psycholo-
tion that our genomes contain, and applying the data
gical environment in which you developed. Your
and analysis to improve human welfare. Sequencing
nutritional history has influenced your physical
genomes of other species both facilitates these goals
development. A nurturing environment and educa-
and extends them, by revealing general principles of
tional opportunities have influenced your mental
biology.
development. Less obvious than most aspects of
How do the contents of our genomes determine
your life history is the growing recognition of
who we are?
the importance of your in utero environment in
determining your development curve and even
Phenotype = genotype + environment + your adult characteristics.
life history + epigenetics • At the interface between the genome and life experi-
Each reader of this book is an individual, with phys- ence are epigenetic factors. It is largely true that all
ical, biochemical, and psychological characteristics. cells of your body, except sperm or egg cells and
(Do not be surprised if these distinctions become cells of the immune system, have almost the same
more and more nebulous!) Each of you has a general DNA sequence (subject to usually modest amounts
form and metabolism that is common to all humans. of accumulated mutations). Yet your tissues are
At the molecular level, you have much in common differentiated, with different sets of genes expressed
with other species as well. But there is also substantial or silenced in liver, brain, etc.
variation within our species, to give you your indi- Now, some of these regulatory signals survive
vidual appearance and character. You are in a state cell division. (When a liver cell divides, it divides
of health somewhere within the spectrum between into two liver cells.) Your parents’ own life his-
robust good health and morbid disease. You are cur- tories might have altered the epigenetic patterns in
rently in some psychological state, and in some mood, their cells, and the fertilized egg from which you
reflecting your personality and current activities. were subsequently formed contained some of these
‘pre-differentiation’ signals. In this way inheri-
• Your genotype is your DNA sequence, both nuclear tance of acquired characteristics has re-entered
and mitochondrial. (For plants, include also the respectable mainstream biology.
sequence of the chloroplast DNA.)
• Your phenotype is the collection of your observ- The relative importance of these factors in deter-
able traits, other than your DNA sequence. mining your phenotype varies from trait to trait.
These include macroscopic properties such as Some are determined solely and irrevocably by
height, weight, eye and hair colour; and micro- your alleles for specific genes. Others depend on
scopic ones, such as possible sickle-cell anaemia, complex interactions between your genes and
glucose-6-phosphate deficiency, or the retention your life history, and epigenetic signals from your
beyond infancy of the ability to digest lactose. parents.
The human genome 5

aromatase, which converts androgens to oestrogens.

A genome is like a page of printed music. The page is a
fixed physical object, but the notes are consistent with (The corresponding enzyme in humans is the target
realizations, in time and space, in a variety of ways – a of inhibitors used in treatment of cancers that are
limited variety of ways. stimulated to grow by oestrogen, to reduce the ex-
posure of the tumour to oestrogen.)
Given the variability of the contributions of her-
Thus a genome constrains but does not dictate the edity, environment, life history, and epigenetics in
features of an organism. Varied surroundings and determining phenotype, how can we measure their
experience lead organisms to explore different states relative importance for any particular trait? The
consistent with their genomes. Even in microorgan- classic method of distinguishing effects of genetics
isms, expression of many genes is conditional. Syn- from those of surroundings and experience – ‘nature’
thesis of enzymes for lactose utilization in E. coli and ‘nurture’ – is controlled experiments with genet-
depends on the presence of substrate in the medium. ically identical organisms. For human beings, this
The shift of yeast between anaerobic and aerobic means monozygotic twins. (Monozygotic human
metabolism is a temporary and reversible change in triplets are extremely rare.) Many studies have com-
physiological state in response to changes in environ- pared the similarities of identical twins reared apart,
mental conditions. individuals who have the same genes but different
Most people would not regard simple and tran- environments. Less well-controlled are comparisons
sient alterations in appearance that have no genetic of identical with fraternal twins, or of non-twin sib-
component at all as true changes in phenotype. lings with adopted children reared in the same family.
Trimming one’s nails, getting a haircut, and wearing There have been suggestions that fraternal twins
makeup are common examples. share greater similarity than non-twin siblings as a
Conversely, many environmental effects, thought result of the shared uterine experience.
of as lifestyles, have long-term effects on develop- Human twin and sibling studies to distinguish
ment. Exercise enhances musculature. Education hereditary components of traits have a long and con-
enhances intellectual development. A phenylalanine- tentious history, as they have formed the basis for
controlled diet can prevent the harmful effects of important social decisions. Intelligence quotient, or
the metabolic disease phenylketonuria. Conversely, IQ, is the ratio of mental age to chronological age.
malnutrition, disease, injury, and cruel treatment can When properly measured, it remains constant in
be physically and mentally debilitating. early childhood. Objective and quantitative measure-
Science has provided a variety of means of inter- ment of intelligence in adults is extremely difficult.
position between genotype and phenotype, widely Attempts to measure inheritance of intelligence – or
applied and sometimes abused. Examples include even of score on intelligence quotient tests; not the
simple changing of hair colour, tattooing, cosmetic same thing if the tests are culturally biased – have
surgery and ‘cosmetic endocrinology’ (including both produced controversial and unsatisfying results. It
the treatment of children deficient in human growth has proved very difficult to devise tests free of socio-
hormone and the use of performance-enhancing economic bias.*
drugs by athletes), and even – at least arguably –
psychoanalysis.
Some environmental effects have specific windows • Phenotypic traits – both macroscopic and molecular –
of responsiveness to give results that are subsequently depend on a combination of influences from genome
irreversible. Growth hormone treatments and sur- sequences themselves, the individual’s life history, and
gery to produce the famous castrati as opera singers epigenetic signals in the fertilized egg.
are examples. In turtles and other reptiles, incubation
temperature controls whether eggs develop into
males or females. The effect depends on the tem- * See Gould, S.J. (1996). The Mismeasurement of Man.
perature dependence of the expression of the enzyme W.W. Norton, New York.
6 1 Introduction to Genomics

Contents of the human genome

A walk through the human genome is like a tour of Each triplet of nucleotides in the message corres-
a continent. One encounters centres of bustling ponds to one amino acid, according to the genetic
activity, rich in genes and their regulatory elements. code (see Box 1.1). Box 1.2 lists the 20 canonical
These are like villages and even cities. One passes amino acids.
also through large tracts of emptiness, or regions Despite their importance, protein-coding genes
with unrelieved monotony of repeated elements. occupy a small fraction of the human genome – no
An inventory of the human genome includes: more than about 2–3% of the overall sequence. They
are distributed across the different chromosomes,
• The most prominent and familiar aspects of the
but not evenly. Many protein-coding genes appear
genome, the regions that code for proteins. Protein-
in multiple copies, either identical or diverged into
coding genes are transcribed into messenger RNA
families. For instance, humans have over 900 related
(mRNA). After processing, ribosomes translate
olfactory-receptor genes, and some animals have
mature mRNA to polypeptide chains.
many more.
• Some regions of the genome encode non-protein-
coding RNA molecules (that is, RNAs exclusive
• Francis Crick encapsulated this scheme in the Central of messenger RNAs), including but not limited
Dogma of Molecular Biology: to transfer RNAs, the RNA components of ribo-
somes, and microRNAs and small interfering
DNA makes RNA makes Protein
RNAs that regulate translation (miRNAs and

BOX Protein synthesis

1.1

Transcription of a protein-coding gene into RNA is followed The standard genetic code
in eukaryotes by splicing to form a mature messenger RNA ttt Phe F tct Ser S tat Tyr Y tgt Cys C
(mRNA) molecule. The ribosome synthesizes a polypeptide ttc Phe F tcc Ser S tac Tyr Y tgc Cys C
chain according to the sequence of triplets of nucleotides, tta Leu L tca Ser S taa STOP tga STOP
or codons, in the mRNA. The protein folds spontaneously ttg Leu L tcg Ser S tag STOP tgg Trp W
to a native three-dimensional structure that accounts for its
ctt Leu L cct Pro P cat His H cgt Arg R
biological function.
ctc Leu L ccc Pro P cac His H cgc Arg R
The standard genetic code is shown here. The codons
cta Leu L cca Pro P caa Gln Q cga Arg R
are those appearing in DNA rather than RNA; that is, the
ctg Leu L ccg Pro P cag Gln Q cgg Arg R
codons contain t rather than u. Both three-letter and one-
letter abbreviations for the amino acids appear. Note that att Ile I act Thr T aat Asn N agt Ser S
the code is redundant: except for Met and Trp, multiple atc Ile I acc Thr T aac Asn N agc Ser S
codons specify the same amino acid. A mutation that ata Ile I aca Thr T aaa Lys K aga Arg R
changes a codon to another codon for the same amino acid atg Met M acg Thr T aag Lys K agg Arg R
is called a synonymous mutation. Three triplets are reserved gtt Val V gct Ala A gat Asp D ggt Gly G
as STOP signals, effecting termination of translation. gtc Val V gcc Ala A gac Asp D ggc Gly G
Variations from this standard code occur in mitochondria gta Val V gca Ala A gaa Glu E gga Gly G
and chloroplasts, and sporadically in individual species. gtg Val V gcg Ala A gag Glu E ggg Gly G
Contents of the human genome 7

BOX The twenty standard amino acids in proteins

1.2

Non-polar amino acids Amino acid names are frequently abbreviated to their first
G glycine A alanine P proline V valine three letters – for instance Gly for glycine – except for iso-
I isoleucine L leucine F phenylalanine M methionine leucine, asparagine, glutamine, and tryptophan, which are
abbreviated to Ile, Asn, Gln, and Trp, respectively. The rare
Polar amino acids amino acid selenocysteine has the three-letter abbreviation
Sec and the one-letter code U. The even rarer amino acid
S serine C cysteine T threonine N asparagine
pyrrolysine has the three-letter abbreviation pyl and the
Q glutamine Y tyrosine W tryptophan
one-letter code O.
Amino acid sequences are always stated in order from
Charged amino acids
the N-terminal to the C-terminal. This is also the order in
D aspartic acid E glutamic acid K lysine R arginine which ribosomes synthesize proteins: ribosomes add amino
H histidine acids to the free carboxy terminus of the growing chain.

siRNAs). There are about 3000 genes coding for the regulatory sites themselves, and all the proteins
RNAs, exclusive of the mRNAs translated to pro- and RNAs encoded that have regulatory functions,
teins. It is becoming clear that the RNA-ome is arguably including receptors.
much richer than had been suspected. Except for • Repetitive elements of unknown function account
RNAs involved in the machinery of protein syn- for surprisingly large fractions of our genomes.
thesis, such as transfer RNAs and the ribosome Long and Short Interspersed Elements (LINES and
itself, most non-coding RNAs are involved in con- SINES) account for 21% and 13% of the genome.
trol of gene expression. Even more-highly repeated sequences – minisatel-
The regions that encode proteins and non-protein- lites and microsatellites – may appear as tens or
coding RNAs correspond to molecules that form even hundreds of thousands of copies, in aggregate
non-transient parts of the cell’s contents. They do of amounting to 15% of the genome. (See Box 1.3).
course ‘turn over’, but at rates lower than messenger
RNAs. mRNAs must have short lifetimes in order to • We know functions of some regions of the genome.
turn off transcription, as part of the processes that That we cannot assign functions to others may merely
regulate gene expression. reflect our ignorance. Some regions appear to have a
life of their own, hitchhiking and reproducing within
• Other regions contain binding sites for ligands
genomes, and contributing to evolution. In some
responsible for regulation of transcription. In
cases they actively enhance rates of chromosomal
assessing the total amount of the genome dedi-
rearrangements.
cated to control, one would need to include both

BOX Repetitive elements in the human genome

1.3

Moderately repetitive DNA – tandem gene family arrays

• Functional – rRNA genes (250 copies)
– dispersed gene families, created by gene duplication – tRNA genes (50 sites with 10–100 copies each in
followed by divergence human)
– e.g. actin, globin – histone genes in many species
➔
8 1 Introduction to Genomics

• Without known function – 1–5 kb long

– short interspersed elements (SINEs) – many different ones
– Alu is an example – scattered throughout the genome
– 200–300 bp long • microsatellites
– 100 000’s of copies (300 000 Alu) – composed of repeats of up to 13 bp
– scattered locations (not in tandem repeats)
– ∼100s of kb long
– long interspersed elements (LINEs)
– ∼106 copies/genome
– 1–5 kb long
– most of the heterochromatin around the centromere
– 10–10 000 copies per genome
• telomeres
– pseudogenes
– 250–1000 repeats at the end of each chromosome
Highly repetitive DNA
– contain a short repeat unit (typically 6 bp: TTAGGG in
• minisatellites human genome, TTGGGG in Paramecium, TAGGG in
– composed of repeats of 14–500 bp segments trypanosomes, TTTAGGG in Arabidopsis)

Genes that encode the proteome

It is now believed that the human genome contains Often a neighbourhood of a gene contains a set of
about 23 000 protein-coding genes. Some regions of closely linked related genes. This is because a com-
the genome are relatively poor in protein-coding mon mechanism of evolution is gene duplication fol-
genes. These include the subtelomeric regions, on all lowed by divergence. It is often possible to follow
chromosomes, and chromosomes 18 and X. In con- evolution through a set of successive duplications.
trast, chromosomes 19 and 22 are relatively rich in However, in some cases a set of identical copies of a
protein-coding genes. gene can appear on different chromosomes. The gene
Most human protein-coding genes contain exons for ubiquitin is an example.
(expressed regions) interrupted by introns (regions Ideally, it would be possible, following determina-
spliced out of mRNA and not translated to protein). tion of a genome sequence, to infer the corresponding
The average exon size is about 200 bp. It is primarily proteome – that is, the amino-acid sequences of the
the variability in intron size that causes the large size proteins expressed. However, several mechanisms in-
differences among protein-coding genes: the gene troduce additional variety into the genome–proteome
for insulin is 1.7 kb long, the LDL receptor gene is relationship:
5.45 kb, and the dystrophin gene is 2400 kb.
Genes appear on both strands. In many cases, • In eukaryotes, a mechanism of generating variety
unrelated genes are fairly well separated. How- from a single gene sequence is alternative splicing.
ever, there are examples of genes that partially over- Alternative splicing involves forming a mature
lap; and cases of an entire gene appearing, on the messenger RNA from different choices of exons
complementary strand, within an intron of another from a gene, but always in the order in which they
gene. appear in the genome. It is estimated that ∼ 95%
A typical protein-coding gene locus contains of multi-exon protein-coding genes in the human
the exons and introns, with splice-signal sites at the genome produce splice variants. There are also
intron–exon junctions. Transcription of the gene may some known cases in which multiple promotors
be under the control of cis-regulatory elements near lead to transcription of parts of the same region
the gene, either upstream or down. Other regulatory into different proteins. If the reading frames of the
elements may appear elsewhere in the genome, even different transcripts are not in phase, there will be
on different chromosomes. no relationship between the protein sequences.
Genes that encode the proteome 9

• In both prokaryotes and eukaryotes, RNA editing The leap from the one-dimensional world of
can produce one or more proteins for which the sequences to the three-dimensional world
amino-acid sequence may differ from that predicted we inhabit
from the genome sequence. For instance, in the
Gene sequences, mRNA sequences and amino acid
wine grape (Vinus vinifera), the mRNAs arising from
sequences are all, from the logical point of view,
every mitochondrial protein-coding gene are sub-
one dimensional. To perform their proper catalytic,
ject to multiple C→U editing events, most of which
regulatory, or structural activities, proteins must
alter the encoded amino acid. In humans, nuclear
adopt precise three-dimensional structures. The mir-
protein-coding genes are subject to editing that
acle is that the structure is inherent in the amino acid
changes adenine to inosine (inosine has the coding
sequence. (See Figure 1.1.) For each natural amino
properties of guanine). The editing, and hence the
acid sequence, there is a unique stable native state that
final amino-acid sequence, can be tissue-specific.
under proper conditions is spontaneously taken up.
Variable splicing and RNA editing describe the The evidence is from the reversible denaturation
relationship between genome sequences and proteins of proteins: a native protein that is heated, or other-
potentially encoded in them. Of course, a large pro- wise brought to conditions far from its normal
portion of cellular activity is dedicated to the regulation physiological environment, will denature to a dis-
of gene expression – the selection of which potentially ordered, biologically inactive state. When normal
encoded proteins are expressed, and in what amounts. conditions are restored, proteins renature, readopt-
The immune system stands outside the general ing the native structure, indistinguishable in structure
assertions about the protein-coding regions of the and function from the original state. No information
human genome. The number of antibodies that we is available to the denatured protein, to direct its
produce dwarfs all the other proteins – it is estimated renaturation, other than the amino acid sequence.
that each human synthesizes 108–1010 antibodies. The (See Box 1.4.)
generation of such high diversity arises by special com- We therefore have the paradigm:
binatorial splicing at the DNA, not the RNA, level.
• DNA sequence determines protein sequence;
Some regions of the genome contain pseudogenes.
Pseudogenes are degenerate genes that have mutated • protein sequence determines protein structure;
so far from their original sequences that the poly- • protein structure determines protein function.
peptide sequence they encode will not be functional.
Because amino acid sequence determines protein
In some cases, pseudogenes have been picked up by
structure, we should be able to write computer pro-
viruses from mRNA, and reverse transcribed. This
grams to predict protein structures. This would be
is recognizable from the fact that the introns have
useful, because we know many more amino acid
been lost.
sequences than experimentally determined three-
dimensional structures of proteins. Structure predic-
• The human genome, and other eukaryotic genomes, tion methods have recently improved substantially
contains genes that code for proteins and non-coding
(see Chapter 10). Reliable predictions will allow the
RNAs (that is, other than messenger RNAs), control
creation of a library of the structures of the proteins
regions, pseudogenes (non-functional sequences derived
from genes by degeneration), and a wide variety of
encoded in any genome.
repetitive sequences. Genes that encode proteins in The principle that amino acid sequence dictates pro-
eukaryotes contain exons – regions that can be trans- tein structure has been the most fundamental principle
lated – and introns – regions that are spliced out before of structural molecular biology. Imagine, therefore,
translation. The possibility of omitting one or more exons, the shock produced by the observation of an effect of
called variable splicing, adds complexity to the relation- a synonymous mutation on a protein structure. In
ship between base sequences of genes and amino acid humans, the Multidrug Resistance 1 (MDR1) gene
sequences of proteins. In many cases, RNA editing cre- encodes a membrane pump, P-glycoprotein. In 2007,
ates additional differences between the DNA sequences
Kimchi-Sarfaty et al. observed that a synonymous
in the genome and the amino acid sequences of the
mutation in MDR1 produces a product with altered
proteins.
affinity for ligands. The hypothesis is that protein
10 1 Introduction to Genomics

A Sequence of Is translated to a Sequence of Which folds spontaneously

bases in DNA ... amino acids in a protein ... to a precise
three-dimensional structure
Three bases

Triplets of bases read from one strand

One amino acid

Genetic code
‘translation table’
Figure 1.1 A most ingenious paradox: the translation of DNA sequences to amino acid sequences is very simple to describe logically;
it is specified by the genetic code. The folding of the polypeptide chain into a precise three-dimensional structure is very difficult to
describe logically. However, translation requires the immensely complicated machinery of the ribosome, tRNAs, and associated
molecules, but protein folding occurs spontaneously.

folding and membrane insertion are occurring during

• Proteins fold to native states based on information
protein synthesis. A change in the rate of synthesis
contained in the amino acid sequence. Supporting this
on the ribosome, perhaps because of differential con-
principle are observations of reversible denaturation.
centrations of tRNAs, could produce the transient However, those experiments were carried out on
exposure of different partial sequences. This might isolated proteins in dilute solution. The situation in
bias the folding pathway. Ribosome pausing is like the cell is much more crowded, and this makes a
page turning in music: in principle it should not affect difference. For instance, proteins are in danger of
the result. aggregation, with fatal consequences for the cell. If
At this point, most people would agree that this you boil an egg and then cool it down, the proteins
observation represents an exception to the principle do not renature. What you have is an aggregate of
that amino acid sequence alone determines protein denatured proteins.
structure, rather than overturning it entirely.

Varieties of genome organization

Chromosomes, organelles, and plasmids simple cells without a nucleus, from eukaryotes, cells
with nuclei.
The biosphere as we know it includes living things From the point of view of genomics, the most
based on cells and also viruses. The most general relevant difference between prokaryotic and eukary-
classiﬁcation of cells, according to both their struc- otic cells is the form and organization of the genetic
ture and molecular biology, divides prokaryotes, material.
Varieties of genome organization 11

Table 1.1 Differences between prokaryotic and eukaryotic cells

Feature Typical feature of

Prokaryotic cell Eukaryotic cell

Size 10 mm ∼0.1 mm
Subcellular division No nucleus Nucleus
State of major component Circular loop, few proteins Complexed with histones to form
of genetic material permanently attached chromosomes
Internal differentiation No organized subcellular structure Nuclei, mitochondria, chloroplasts, cytoskeleton,
endoplasmic reticulum, Golgi apparatus
Cell division Fission Mitosis (or meiosis)

In structuring their DNA, cells have two problems replication and RNA synthesis in gene transcription.
to solve. The first is a packaging challenge. The Nuclear DNA is complexed with histones and other
DNA of E. coli is 1.6 mm long but must fit into a cell proteins to form chromosomes, large nucleoprotein
2 mm long and 0.8 mm wide (1 mm = 0.001 mm). complexes. Each chromosome contains a single linear
Eukaryotic cells have a harder version of the prob- molecule of DNA. The nuclei of different species
lem: the nucleus of a human cell has a diameter of contain different numbers of chromosomes and
6 mm, but the total length of the DNA is 1 m. How in each species the chromosomes vary in length.
can the DNA be deployed in cells in a form that is Humans contain 46 chromosomes – 22 pairs, plus
compact but accessible to proteins active in replica- two X chromosomes in females or one X and one Y
tion and transcription? The second problem is what chromosome in males. Deviations from the normal
to do during cell division, after DNA replication, to complement of chromosomes have clinical conse-
ensure that each daughter cell gets one copy of the quences; for example, the presence of three copies
DNA. Different types of cells, and viruses, solve these of chromosome 21 (trisomy 21) is associated with
problems in different ways. Down’s syndrome.
In the typical prokaryotic cell, most of the DNA The state – notably the accessibility to transcrip-
has the form of a single closed, or circular, molecule. tional machinery – of different regions of DNA in
It is complexed with proteins to form a structure eukaryotic chromosomes is modulated by the local
called a nucleoid. The DNA is attached to the structure of the chromosome, notably the interaction
inside of the plasma membrane but is accessible with histones (Figure 1.2).
to molecules in the cytoplasm. Some bacteria have Subcellular organelles, including mitochondria and
multiple circular DNA molecules; others have linear chloroplasts, are believed to have originated as intra-
DNA. In addition, prokaryotic cells can contain cellular parasites (see Box 1.4). These organelles con-
plasmids, small pieces of circular DNA, neither tain additional DNA in the form of single closed or
complexed permanently with protein nor attached circular molecules, uncomplexed with histones, like
to the membrane. The development and spread of the DNA of prokaryotes. Mitochondria and chloro-
antibiotic resistance in pathogenic bacteria, an plasts also contain their own protein-synthesizing
increasingly serious public health problem, are often machinery, using a slightly different dialect of the
associated with exchange of plasmids among strains. nearly universal genetic code.
Bacterial plasmids are also used as vectors for genetic Eukaryotic cells may also contain plasmids. Yeast
engineering. artificial chromosomes (YACs) – plasmids in yeast
In a eukaryotic cell, most of the DNA is seques- cells – are of great utility in genome sequencing
tered in the nucleus. The nucleus is the site of DNA projects.
12 1 Introduction to Genomics

naked duplex DNA

“beads-on-a-string”
created by formation
of nucleosomes

30 nm solenoid

(b)

Figure 1.2 (a) Hierarchical organization of structure of DNA in

extended form of eukaryotic chromosomes. From top: DNA double helix; if entirely
chromosome
straight, it would be much longer than dimensions of cell. DNA
forms nucleosomes, containing a protein core of histones, around
which nearly two turns of DNA (about 150 bp) are wrapped. The
nucleosomes condense to form a 30 nm helical fibre called the
solenoid. Further levels of compaction produce the bottom image,
condensed section corresponding to the typical appearance of a chromosome in a
of chromatin metaphase karyotyping spread. (Reproduced with permission of
themedicalbiochemistrypage.org) (b) The nucleosome core particle.
The structure comprises 146 bp of DNA, the strands coloured
brown and turquoise; and four pairs of histones, coloured blue,
mitotic
green, yellow, and red. (From: Luger, K., Mäder, A.W., Richmond,
chromosome R.K., Sargent, D.F. & Richmond, T.J. (1997). Crystal structure of
the nucleosome core particle at 2.8 Å resolution. Nature 389,
(a) 251–260.)

BOX Origin of intracellular organelles: the endosymbiont hypothesis

1.4

Mitochondria and chloroplasts are subcellular particles establishment of a pH gradient across the organelle mem-
involved in energy transduction (Figure 1.3). Mitochondria brane. The passage of protons through the membrane is
carry out oxidative phosphorylation, the conversion into coupled to generation first of mechanical energy and then
ATP of reducing power derived from metabolizing food. of chemical bond energy through the action of the molecu-
Chloroplasts carry out photosynthesis, the capture of light lar motor ATP synthase: chemiosmotic energy stored in pH
energy in the form of ‘reducing power’ – NADPH – and ATP. gradient → mechanical energy in ATP synthase → chemical
Both types of particle lead a quasi-independent life bond energy of ATP.
within the cell. They are surrounded by membranes, they Where might a cell look to find a small, self-enclosed, and
have their own genetic material and protein-synthesizing largely self-sufficient object to serve as an organelle? Why,
machinery, and they reproduce themselves within cells at another cell! There is a consensus that mitochondria
independently of cell division. and chloroplasts originated as prokaryotic endosymbionts
To perform their functions, it is essential for mitochon- that originally lived independently but took up residence
dria and chloroplasts to be enclosed. This is because elec- inside other cells. Evidence for this includes: (1) the state of
trochemical and photochemical energy conversion require the DNA: organelle DNA is circular and uncomplexed with
Varieties of genome organization 13

those of living species. Mitochondria are likely to have

originated as a relative of Rickettsia, parasites that cause
typhus and Rocky Mountain spotted fever. Chloroplasts
evolved from cyanobacteria.
The genomes of mitochondria and chloroplasts have
degenerated from their original forms. The human mito-
chondrial genome has 16 569 bp compared with the >1
Mb size of contemporary Rickettsia. Mitochondrial genome
sizes range from 5966 bp in Plasmodium to 366 924 bp in
Arabidopsis.
Chloroplast genomes contain 30 000–200 000 bp, com-
pared with cyanobacteria, such as Synechocystis, with a
3.5 Mbp genome.
This shrinking of the genome is not entirely the effect
of squeezing out the non-functional DNA in the interests
of slimming down the organelle. Mitochondria and chlo-
roplasts synthesize relatively few proteins, and certainly
not all that they need. There has been a lot of transfer of
Figure 1.3 Structure of a mitochondrion. The convoluted inner
membrane is easily visible. Red arrows indicate ribosomes.
genes from the mitochondria and chloroplast genomes to
the nucleus. Proteins synthesized with a specific N-terminal
(Reproduced with permission of themedicalbiochemistrypage.org)
targeting sequence are translocated into the appropriate
organelle (see Chapter 4).
histones, like that of prokaryotes; and (2) the fact that It is believed that chloroplasts of eukaryotic cells arose
organelle ribosomes resemble those of prokaryotes rather from cyanobacteria independently at least three times,
than those of eukaryotic cells. in lineages leading to (1) green algae and higher plants;
We can identify the origins of mitochondria and chloro- (2) red algae; and (3) a small group of unicellular algae
plasts from the similarity of their genome sequences to called the glaucophytes.

Genes length, that begins with an initiation codon (ATG)

and ends with a stop codon. An ORF is a potential
As they come off the sequencing machines, genomes
protein-coding region.
are long strings of As, Ts, Gs, and Cs, without cap-
tions or sign posts. Many people think that only a • Some regions are expressed as non-protein-
small fraction (possibly only ∼5%) of the human coding RNA. They will show regions of local self-
genome is functional. The challenge is to identify the complementarity corresponding to hairpin loops.
functional regions and ﬁgure out what they do. For instance, genes for transfer RNA (tRNA)
will contain the signature cloverleaf pattern (see
• Protein-coding regions. Because three consecutive Figure 1.4).
bases correspond to one amino acid, any nucleic
• Other regions are targets of regulatory interactions.
acid sequence can be translated into an amino acid
sequence in six ways. Beginning at nucleotide 1, 2, Gene identiﬁcation in prokaryotes is easier than
or 3 gives three possible phases of translation, and in eukaryotes. Prokaryotic genomes are smaller and
translating the reverse complement – the other contain fewer genes. Genes in bacteria are contiguous
strand – gives another three possible phases. A – they lack the introns characteristic of eukaryotic
protein-coding region will contain open reading genomes. The intergene spaces are small. In E. coli, for
frames (ORFs) in one of the six reading phases. An instance, approximately 90% of the sequence is coding.
ORF is a region of DNA sequence, of reasonable Features such as ribosome-binding sites are conserved.
14 1 Introduction to Genomics

✱ If identiﬁcation of genes in mammalian genomes is –

in a common simile – like hunting for needles in a
haystack, identification of genes in prokaryotes is no
worse than hunting for needles in a sewing kit.
Protein-coding genes in higher eukaryotes are
sparsely distributed and most are interrupted by
Acceptor stem introns. Identification of exons is one problem and
assembling them is another. Alternative splicing pat-
terns present additional difficulty. Simpler eukary-
otes, such as yeast, are not quite as bad: about 67%
of the yeast genome codes for protein and fewer than
4% of the genes contain introns.
TψC loop
There are two basic approaches to identifying
Dihydrouracil loop genes in genomes.
1. A priori methods, which seek to recognize
sequence patterns within expressed genes and the
regions flanking them. Protein-coding regions will
have distinctive patterns of codon statistics, of
course including – but not limited to – the absence
of stop codons.
Anticodon loop
2. ‘Been there, seen that’ methods, that recognize
Figure 1.4 The framework of the structure of transfer RNA regions corresponding to previously known genes,
(tRNA). The * marks the 3′ end, the site of attachment of the
from the similarity of their translated amino acid
amino acid. tRNA contains several regions of double-helical
sequences to known proteins in another species,
structure, indicated by coloured patches. These regions have
almost perfect complementarity of base pairs. They therefore or by matching expressed sequence tags (ESTs)
induce a pattern observable in the DNA sequence, by which it is (see Box 1.5).
possible to recognize genes for tRNAs in genomes.
A priori methods are the blunter instrument. Com-
bined approaches are also possible.
Characteristics of eukaryotic genes useful in identi-
Expressed sequence tags fying them include:
BOX
1.5
• The initial (5′) exon starts with a transcription
start point, preceded by a core promotor site, such
An expressed sequence tag (EST) is a sequence that as the TATA box, typically ∼30 bp upstream. It is
corresponds to at least a part of a transcribed gene. free of in-frame stop codons and ends immediately
In a cell, transcription of genes and – in eukaryotes – before a GT splice signal. (Occasionally a non-
maturation of mRNAs by splicing out the introns, pro- coding exon precedes the exon that contains the
duces single-stranded RNA molecules that correspond initiator codon.)
directly to coding sequences. Collecting these RNA mole-
• Internal exons, like initial exons, are free of in-
cules, converting them to complementary DNA (cDNA)
frame stop codons. They begin immediately after
using the enzyme reverse transcriptase and sequencing
the terminal fragments of the cDNA produces a set of
an AG splice signal and end immediately before a
ESTs. It is necessary to sequence only a few hundred GT splice signal.
initial bases of cDNA. This gives enough information • The final (3′) exon starts immediately after an AG
to identify a known protein coded by the mRNA. splice signal and ends with a stop codon, followed
Characterization of genes by ESTs is like indexing poems by a polyadenylation signal sequence. (Occasionally
or songs by their first lines. a non-coding exon follows the exon that contains
the stop codon.)
Varieties of genome organization 15

• All coding regions have non-random sequence Table 1.2 Transposable elements in the human genome
characteristics, based partly on codon usage pre-
Element Estimated number % of total genome
ferences. Empirically, it is found that statistics of
hexanucleotides perform best in distinguishing SINE + LINE 2.4 × 106 33.9
coding from non-coding regions. Using a set of LTR 0.3 × 106 8.3
known genes from an organism as a training set, Transposons 0.3 × 106
2.8
pattern-recognition programs can be tuned to par- Total 3.0 × 106 ∼45
ticular genomes.
Data from: Bannert, N. & Kurth, R. (2004). Retroelements and the
human genome: New perspectives on an old relation. Proc. Natl.
Acad. Sci. USA 101, 14572–14579.
• An early step in analysing a newly sequenced genome
is to find the genes that code for proteins and RNAs,
and to try to identify them.
Different types of element show alternative mech-
anisms of transposition.
Retrotransposons (class I) replicate via an RNA
Dynamic components of genomes intermediate. Many, if not all, of them are degenerate
Transposable elements are skittish segments of DNA, retroviruses.
found in all organisms, that move around the genome. Transposons (class II) produce DNA copies with-
They were discovered by B. McClintock in the 1940s out an intermediate RNA stage. They encode an
in studies of maize. Transposable elements in Indian enzyme called transposase, which recognizes sequences
maize (or corn) create a genetic mosaic, giving the within the transposon itself, cuts it out, and inserts
ears a mottled appearance (see Figure 1.5). In this it elsewhere. Often the excision is sloppy, leaving a
case, transposition is fast enough to affect an indi- mutation at the original site. Sometimes, a bit of the
vidual organism. Other transposable elements move surrounding sequence adheres to and accompanies
more slowly, on evolutionary timescales. the transposed material.
The focus of this section is transposable elements Because transposable elements can replicate, they
within nuclear genes of eukaryotes. We shall discuss are related to some of the types of repetitive sequence
the trafﬁc of genes between organelle (mitochondria found in genomes (see Table 1.2). Mammalian
and chloroplasts) and nuclear genomes in Chapter 4. genomes contain retrotransposons (RNA-mediated
replication) called long and short interspersed
elements (LINEs and SINEs). LINEs are typically
1–5 kb long, with tens to tens of thousands of copies.
The most common LINE, L1, appears ∼20 000 times
in the genome. SINEs are typically 200–300 bp long,
with hundreds of thousands of copies in the genome,
at scattered locations. The human genome contains
about 300 000 copies of the most common SINE,
the Alu element, which is 280 kb long. The total
amount of L1 + Alu is 7% of the human genome.
LINEs encode a reverse transcriptase and can repli-
cate autonomously. SINEs are too short to encode
their own reverse transcriptase. SINEs depend on
LINEs or other sources of the required activities for
Figure 1.5 The ears of Indian maize are mosaics. The dark
replication.
pigments are anthocyanins. Yellow sectors arise when a jumping
element or transposon interferes with expression or function of Transposons contain inverted repeats at their ends,
the genes for biosynthesis of anthocyanins, during development which are the targets of the excision machinery (see
of individual kernels. Figure 1.6 and Box 1.6). Replication may occur in
(Photograph courtesy of L.D. Graham, Rockingham, VA, USA) ‘cut-and-paste’ mode, moving the transposon from
16 1 Introduction to Genomics

transposase

Figure 1.6 Fragment of a chromosome containing a transposon (red). The ends of the transposon contain inverted repeat sequences,
demarcating the region for excision. Within the transposon is a gene for the transposase enzyme, giving the region the capacity for
autonomous replication.

one site to another. Alternatively, in ‘copy-and-paste’ leading to Prader–Willi and Angelman syndromes
mode, replication leaves the original copy behind, (see Chapter 3) are associated with a mutation in
while creating another. the sequence of a nearby transposable element.
If two equivalent transposons are nearby, they can • Leakage of epigenetic modification. From the
move a whole segment including all the material landlord’s point of view, transposable elements are
between them. Transfer of multiple genes to a plasmid squatters. For example, they make up 70% of the
is a common mechanism of generation of antibiotic maize genome. In one sense, it is bad enough that
resistance in bacteria. (The Tn3 transposon illus- they clutter up the DNA, but, even worse, eukary-
trated in Box 1.6 contains only one set of terminal otes must defend themselves against the expression
repeats.) of transposable elements. The tactics of defence is
Biological effects of transposable elements include: to methylate transposable elements, or to use small
• Sequence broadcasting. Multiple copies of ele- interfering RNAs (siRNAs) (see pp. 6–7). Some can-
ments of a sequence may be distributed to various cers and other diseases that lead to hypomethyla-
locations in the genome. tion of DNA can cause transcriptional reactivation
of some transposable elements. Methylation also
• Altering properties of genes. The arrival of a frag-
cuts down on the mobility of those transposable
ment of sequence within or in the vicinity of a gene
elements that require transcription for mobility.
may, if inserted into a coding region, render the
However, mechanisms of silencing transposable
gene product non-functional, creating a ‘knock-
elements can also affect neighbouring genes.
out’ effect. An inserted segment near a gene may
affect its regulation or alter its splicing pattern.
(Approximately 20% of human genes have
transposable elements in flanking non-coding BOX The 5′ and 3′ ends of transposons
sequences.) Even an insertion in an intron can 1.6 contain inverted repeats
affect rates of transcription by slowing down the
polymerase as it passes through. The beginning and end of the sequence of the Tn3
• Transposable elements as an important engine of transposon of E. coli contain a terminal repeat. This is
evolution. They provide a mechanism for gene a plasmid encoding b-lactamase, conferring ampicillin
evolution by gene fusion or exon shuffling. resistance, as well as a transposase.*
Transposable element insertion can cause species- 1 GGGGTCTGAC GCTCAGTGGA ACGAAAACTC
specific alternative splicing patterns. This can pro- ACGTTAAGCA ACGTTTTCTG CCTCTGACGC
duce new protein isoforms. It can also lead to 61 CTCTTTTAAT GGTCTCAGAT GACCTTTGGT
disease; for example, the cause of a case of orni- CACCAGTTCT GCCAGCGTGA AGGAATAATG
thine aminotransferase deficiency was a single ...
base change that activated a cryptic 5′ splice site 4861 TTTTTAATTT AAAAGGATCT AGGTGAAGAT
in an Alu element, introducing an in-frame stop CCTTTTTGAT AATCTCATGA CCAAAATCCC
codon leading to a truncated protein. 4921 TTAACGTGAG TTTTCGTTCC ACTGAGCGTA AGACCCC
• Causing chromosomal rearrangements. This can
* Heffron, F., McCarthy, B.J., Ohtsubo, H. & Ohtsubo, E. (1979).
include inversions, translocations, transpositions, DNA sequence analysis of the transposon Tn3: three genes and
and duplications, perhaps through mispairing of three sites involved in transposition of Tn3. Cell 18, 1153–1163.
chromosomes during cell division. The deletions
Genome sequencing projects 17

• Transposable elements add a dynamic component to genetic change, at a higher level than point mutations. These
changes may have significant biological effects.

Genome sequencing projects

The static contents of the human genome, and its mine their function. Regions that are not conserved
dynamic aspects, are similar in general features to can be reserved for later.
what other genomes contain. As genome sequencing
techniques become easier, the ﬁeld is progressing in the He who does not know foreign languages does not know
directions: (a) to determine more and more human anything about his own – Goethe, Kunst und Alterthum
genome sequences, especially those that may prove
useful in research into disease anticipation and pre- Sequences of genomes of other species also have
vention, and (b) many different species have now had direct application to human welfare. The genomes of
their genomes sequenced, for at least one individual. pathogens that have developed antibiotic resistance,
The National Center for Biotechnology Informa- or are threatening to, give clues that we can use to try
tion database currently reports: to keep ahead of them. Other practical applications
include improving crops and domesticated animals.
Table 1.3 Genome sequencing projects. Figures refer to numbers
of species and strains, not numbers of individualsa
Genomics can also support conservation efforts
aimed at preserving endangered species.
Organism type Number of Number of Genomics, allied with anthropology and archaeo-
genomes genomes logy, helps recount the history of the human species.
completed in progress It can reveal patterns of migration. It can trace
Viruses and viroidsb 3889 the domestication of plants and animals. These
Archaea 113 91 applications to many aspects of human biology are
Bacteria 1588 4914 the subject of Chapter 8.
Eukaryotes 36 1175
Many genome projects target individual species.
In addition, a major component of public DNA
a
Source: http://www.ncbi.nlm.nih.gov/genome sequence data repositories comes from metagenomic
b
A viroid is a small single-stranded RNA which can replicate
autonomously, but does not encode protein, nor is encapsulated data. These are sequences determined from environ-
within a coat. mental samples, without isolating individual organ-
isms. Sources include ocean water, soil samples, and
There are many reasons for sequencing non-human the human gut.
genomes. The most important ones are that they
reveal the processes of evolution, and that they help
us to understand the functions of different regions • Some readers will be surprised to learn that within the
of the human genome. volume of their bodies, there are more prokaryotic
cells than human ones.
Other genomes are essential to illuminate ours. An
important principle is that if evolution conserves
something, it is essential. If evolution does not con- Nevertheless, the focus of genome sequencing
serve something, it is not essential. In trying to under- efforts has been human subjects. Clinical applica-
stand the function of the ∼98% of the human genome tions have created a very large amount of sequence
that does not recognizably code for proteins or non- data for individual human genes. Many people have
coding RNAs, we gain important clues by comparing undergone genetic testing: for example, many pro-
the human genome with other mammalian genomes. spective parents determine whether they are carriers
Regions that are conserved must be conserved for a of cystic ﬁbrosis. Many women test for potentially
reason. We can focus on these regions to try to deter- dangerous mutations in genes that can predispose an
18 1 Introduction to Genomics

individual to cancer, such as BRCA1 and BRCA2.

Up to now, these tests have been carried out when Single-end read
a family history suggests increased risk. Reduced
sequencing costs may lead to more widespread
testing for these and other risk-alerting genes. Some Paired-end read
people make use of publicly available genotyping
services, outside medical supervision, to explore
genealogy. Law-enforcement agencies determine
Figure 1.7 Sequence data reported from fragments
DNA sequences from samples from crime scenes. using single-end sequencing and paired-end sequencing
A separate area of comparative genomics involves techniques. Data reported arise from red and green
comparing the genome sequences of normal cells regions. The lengths of the black regions are known only
and cancer cells from patients. These can differ in approximately, from the fragment length. Typically, fragments
used are about 200 bp long, varying in length by about 10%.
essential ways, which assist precise diagnosis and
guide treatment.
Does this information appear in databases? For non- Read length: the number of bases reported from a single
human sequences, the expectation is that the data will experiment on a single fragment.
be publicly available. For human sequences, the situ- Assembly: the inference of the complete sequence of a
ation is very different, as major questions of privacy region from the data on individual fragments from the
arise (see section on ethical, legal, and social issues). region, by piecing together overlaps.
Contig: a partial assembly of data from overlapping frag-
ments into a contiguous region of sequence.
Genome projects and the development of our
De novo sequencing: determination of a full-genome
current information library
sequence without using a known reference sequence from
The ﬁrst genome sequenced was that of the single- an individual of the species to avoid the assembly step.
stranded DNA virus, bacteriophage fX-174. F. Sanger Resequencing: determination of the sequence of an
and co-workers published this result, 5386 bases, in individual of a species for which a reference genome
1977. Recognition of the importance of sequencing sequence is known. The assembly process is replaced by
stimulated intensive efforts to improve and automate mapping the fragments onto the reference genome.
the techniques. A major breakthrough was the replace- Exome sequencing: targeted sequencing of regions in DNA
ment of the autoradiography of gels, each nucleotide that code for parts of expressed proteins (exons). There
occupying a separate lane, with four ﬂuorescent dyes, are approximately 180 000 exons in the human genome.
permitting a ‘one-pot’ reaction. Leroy Hood and col- Variability, Single-nucleotide polymorphisms (SNPs):
leagues automated this technique, developing a variability is the differences among genome sequences
machine that supported a generation of sequencing from different individuals of the same species. Much ob-
projects. Recently, a number of novel approaches, or served variability has the form of isolated substitutions at
‘next-generation’ sequencing techniques, have been individual positions, or single-nucleotide polymorphisms.
developed. We shall discuss these in Chapter 3. RNAseq: sequencing the contents and composition of
the RNAs in the cell, the transcriptome, by conversion of
RNA to complementary DNA and sequencing the result.
Vocabulary for high-throughput DNA sequencing
ChIP-seq: sequencing of the fragments of DNA to which
Fragment: a small piece of genomic DNA – typically particular proteins are known to bind. ChIP-seq permits
several hundred bp in length – subject to an individual the identification of the targets, in the genome, of
partial sequence determination, or read. DNA-binding proteins.
Single-end read: technique in which sequence is reported Methylation-pattern determination: comparison of
from only one end of a fragment (see Figure 1.7). sequences of native DNA, and DNA after treatment with
Paired-end read: technique in which sequence is reported bisulphate to convert unmethylated C to U. Methylation
from both ends of a fragment (with a number of undeter- of C residues in CpG dinucleotides at the 5′ position is
mined bases between the reads that is known only the most common signal in animals for repression of
approximately). transcription.
Genome sequencing projects 19

High-throughput sequencing Paired-end sequencing reports partial sequences from

The past few years have seen truly astounding pro- both ends of the same fragment. In either case, the
gress in the development of high-throughput sequenc- number of bases reported is the read length. The
ing techniques. Initial determination of a draft of the number of unknown bases between the paired ends
human genome took ten years, at an estimated cost is limited (by the estimate of the overall fragment
of $US 3 × 109. At the time of writing, instruments length) but the exact value is unknown.
exist that can produce 250 Gb per week. The largest Next it is necessary to assemble the genome, from
dedicated institution in the field, the BGI – formerly the sequences of overlapping fragments. A contig is
the Beijing Genomics Institute, but currently in Shen- a partial assembly of fragments, into a contiguous
zhen – has 128 such instruments. Each can produce stretch of sequence. As the assembly proceeds, the
25 × 109 bp per day! This corresponds to one human contigs grow in length, like the completed portion of
genome at over 8X coverage. Running at full capa- a jigsaw puzzle.
city, these resources could produce 10 000 human Assembly requires a sufficient number of fragments
genomes per year. to cover the entire genome, with enough replicates
Moreover, there is no reason to think that the to be able to detect errors. The ratio of the total
technical progress will not continue to accelerate. number of bases sequenced to the genome length is
Undoubtedly by the time this book is published, these the coverage of the data set. There is now consensus
specs will be outdated. (See Weblem 1.1.) that to achieve complete and accurate assembly of
There are two aspects of a large-scale sequencing a novel genome requires collection of data with a
project. One is the generation of the raw data. Most coverage of 30 or 50 (30X or 50X).
methods sequence long DNA molecules by frag- For prokaryotic genomes, the process outlined,
menting them, and partially sequencing the pieces. which corresponds to ‘shotgun’ sequencing of frag-
To determine the first genome from a species, these ments, allows accurate assembly. For a large eukary-
short sequences must be assembled into the whole otic genome, achieving the highest quality result
sequence, using overlaps between the individual requires a genetic map, breaking the assembly prob-
fragments. lem into smaller pieces.
The typical length of the individual short sequences Eukaryotic genome assembly is a very computer-
reported is called the read length of the method. The intensive problem. Development of algorithms and
goals of contemporary technical development are to computer programs for effective assembly of frag-
increase not only the number of bases sequenced per mentary sequence data are a thriving field of research.
unit time and per unit cost, but the read length. The resources of many genome centres are such that
Both generation of raw data, and assembly, depend assembly is the rate-limiting step in the process.
crucially on effective and efficient computer pro-
Resequencing
grams. Some contemporary genome centres have as
many computational biologists on their staffs as Once a reference genome for an individual of a spe-
‘wet-lab’ scientists. cies is available – for example, a published human
The very high throughput sequencing capacity of genome – the sequences of genomes from other
new instruments allows addressing several types of individuals of the species are considerably easier to
biological questions. determine. It is not necessary to assemble fragment
sequences de novo, but merely to map them onto
the reference genome. Except for highly repetitive
De novo sequencing
regions, this is fairly straightforward. Coverage
The most ambitious type of sequencing project is the must be adequate so that the error rate of sequence
determination of the complete sequence of the first determination is less than the frequency of natural
genome from a species. The total DNA must be variation. In the specific case of sequencing genomes
broken into fragments, typically about 200 bp long. from cancer cells, it is preferable to sequence normal
High-throughput sequencing can produce partial cells from the same patient than to try to infer from
sequence information for such fragments, either from the reference sequence the genome changes arising
one end only, or from paired ends (see Figure 1.7). from the disease.
20 1 Introduction to Genomics

Exome sequencing of activity of particular proteins. The loss of activity

frequently arises from a speciﬁc mutation in the
One goal of resequencing is to determine variation sequence coding for the protein. To identify such a
in the genome of an individual from the reference mutation, it is not necessary to sequence the entire
genome. By correlating these variations with pheno- genome but only the protein-coding regions; namely,
type – for example, the presence of an inherited the exons. There are approximately 180 000 exons in
disease – it is possible to identify the genetic origin of the human genome, amounting in total to approxi-
the lesion. Many inherited diseases result from loss mately 30 Mb, or ∼1% of the entire genome.

Variations within and between populations

Any two people – except for identical siblings – have Cancer genome sequencing
genomic sequences that differ at approximately
Healthy cells do accumulate mutations at a modest
0.1% of the positions. Measurements of multiple
rate. Cancer cells that have lost checks on accuracy
human genomes permit distinction between random
of DNA replication accumulate mutations copiously.
components of this variation, and those that system-
To distinguish variations arising from the disease
atically characterize different populations. Much but
it is preferable to compare the sequences from
not all of the variation takes the form of isolated base
tumour cells with those from normal cells from the
substitutions, or single-nucleotide polymorphisms.
same individual, rather than with a single reference
Sequence variation in humans has applications in
genome.
anthropology, to trace migration patterns, and in
personal identiﬁcation to prove paternity or for crime
investigation. Sequence variability in other species
• There are many different applications of high-
gives clues to the history of the species, including but
throughput sequencing techniques, designed to
not limited to understanding the history of domesti-
address different types of questions.
cation of animals and crop plants.

Human genome sequencing

In October 2010, Nature published the estimate Venter has also sequenced the genome of his
that over 2700 full human genomes will have been poodle, Shadow. They are therefore the ﬁrst human–
sequenced by the end of that month, and suggested pet combination with both sequences known.
that by the end of 2011 the total will be over 30 000. Sequencing is now done under a variety of
A few of the individuals are known. These include auspices:
J. Craig Venter, the subject of the original Celera
Corporation genome sequencing project, James D. • A number of international organizations with ambi-
Watson, Bishop Desmond Tutu, and Stanford Uni- tious speciﬁc targets. These include the International
versity professor Stephen Quake. (Perhaps analysis HapMap Project http://hapmap.ncbi.nlm.nih.gov/
might reveal a genetic locus associated with the that focuses on the variations in sequences in pop-
Nobel Prize.) Actress Glenn Close had her genome ulations distributed around the world. They are
sequenced, motivated by a family history that in- collecting in particular an atlas of single-nucleotide
cluded several individuals who suffered from mental polymorphisms (SNPs) which are substitutions of
illness. The academic human DNA sequencing pro- individual bases. Clusters of SNPs that appear to
ject used DNA primarily from a donor from Buffalo, be inherited in tandem are called haplotypes. (See
New York. His identity has not been revealed. Chapter 2.)
The human genome and medicine 21

• The 1000-genome project is an extension of the can derive more and more clinically useful informa-
HapMap project towards complete genome data, tion from sequence data, the conclusion seems ines-
with an emphasis of discovering the conditions capable that anyone entering a hospital will have at
required to ensure appropriate data quality in pro- least a partial DNA sequence determination along
jects of this type (http://www.1000genomes.org/). with taking his or her pulse rate, temperature, and
Specific goals include careful sequencing of family blood pressure. (See Problem 1.7.)
groups (mother + father + child), and detailed
• It has taken 30 years for DNA sequencing to make
sequencing of 1000 protein-coding regions in 1000
the transition from Nobel Prize breakthrough
individuals. Although one may wonder to what
research to secondary school science projects.
extent the stated goals will be overtaken by the
growing ‘background’ of genome sequencing, the The teacher in a secondary school in New Jersey,
importance of the commitment of the international USA, organized a project to analyse DNA samples of
organizations to data quality control, curation, and the students in her class. This was not full-genome
free distribution should not be underestimated. sequencing but produced limited, genealogy-oriented
• Several companies now offer personal genome data. The students compared the results with their
sequencing. Many provide sequencing of mito- own cultural backgrounds.
chondrial DNA or individual loci in nuclear DNA, The following example is not human DNA sequen-
for tracing of ancestry. The application of DNA to cing, but usefully hints at the potential variety of
demonstration of legal relationships – most com- applications, and the relative ease with which they
monly, paternity testing – is well established. can be carried out. In 2008, two teen-age students
checked samples of fish from New York City sushi
Up to now, the cost of a full-genome sequence has bars, using a genetic-fingerprinting technique called
been prohibitive for most people if the motivation is DNA Bar Coding (see p. 117). They discovered that,
casual curiosity. However, it is already true that the of the samples they could identify, 25% were mis-
cost of sequencing a person’s DNA is comparable to labelled, in fact originating in less-expensive species
the cost of a night’s stay in hospital in the USA. As than advertised. They identified some restaurants
costs fall, and equipment becomes smaller, and we that were free of mislabelling.

The human genome and medicine

The development and delivery of health care chal- tions – for instance, clean water. In all communities,
lenge biomedical science, engineering, industry, and preserving a mutually nurturing relationship with
government – both separately and in their coordina- nature will improve people’s mental and spiritual
tion. Not all components of these challenges are health, which – although in ways not entirely demon-
within the scope of this book. What is relevant is the strable yet in the laboratory – has important effects
increasing scientiﬁc sophistication of medical treat- on physical well-being.
ment and the shrinking of the distance and time Nevertheless, genomics and proteomics have
between laboratory discovery and clinical practice: played a central role in the recent transformation of
walls between ‘pure’ and ‘applied’ science have come medicine and surgery, making contributions to pre-
tumblin’ down. vention of disease, to detection and precise diagnosis,
Of course, immense improvements in human health and to effective treatment.
could be achieved by quite low-tech measures. Obvi-
ous examples include educating people about the
Prevention of disease
long-term dangers of obesity, smoking, and alcohol
abuse; use of seatbelts in automobiles and aeroplanes; • Vaccinations are pre-emptive strikes against infec-
and the provision, to the many people who lack tious diseases. They prime the immune system to
them, of basic medical care and hygienic living condi- recognize pathogens. Some vaccines are (almost)
22 1 Introduction to Genomics

completely effective. Vaccines have eliminated

smallpox entirely, and polio from much of the BOX Huntington’s disease
1.7
world. Even a partially effective vaccine can tip
the balance between a devastating pandemic – for
example, of inﬂuenza – and a controllable set of Huntington’s disease is an inherited neurodegenerative
localized outbreaks. disorder affecting approximately 30 000 people in the
USA. Its symptoms are quite severe, including uncon-
• Understanding individual genetic predispositions
trollable dance-like (choreatic) movements, mental
to disease can, for some conditions, help to pre- disturbance, personality changes, and intellectual impair-
serve health through suggesting changes in life- ment. Death usually follows within 10–15 years of the
style, and/or medical treatment. An example of a onset of symptoms. The gene arrived in New England,
risk factor detectable at the genetic level involves USA, during the colonial period in the 17th century. It
a1-antitrypsin, a protein that normally functions may have been responsible for some accusations of
to inhibit elastase in the alveoli of the lung. People witchcraft. The gene has not been eliminated from the
homozygous for the Z mutant of a1-antitrypsin population, because the age of onset – 30–50 years – is
(342Glu→Lys) express only a dysfunctional pro- after the typical reproductive period.
tein. They are at risk of emphysema because of Formerly, members of affected families had no alter-
damage to the lungs from endogenous elastase native but to face the uncertainty and fear, during youth
unchecked by normal inhibitory activity, and also and early adulthood, of not knowing whether they
of liver disease because of accumulation of a poly- had inherited the disease. The discovery of the gene
for Huntington’s disease in 1993 made it possible to iden-
meric form of a1-antitrypsin in hepatocytes, where
tify affected individuals. The gene contains expanded
it is synthesized. Smoking makes the development
repeats of the trinucleotide CAG, corresponding to
of emphysema all but certain. Heavy smokers
polyglutamine blocks in the corresponding protein,
homozygous for Z-antitrypsin generally die from
huntingtin. (Huntington’s disease is one of a family of
respiratory disease by the age of 50. In these cases, neurodegenerative conditions resulting from trinucleo-
the disease is brought on by a combination of tide repeats.) The larger the block of CAGs, the earlier
genetic and environmental factors. the onset and more severe the symptoms. The normal
gene contains 11–28 CAG repeats. People with 29–34
repeats are unlikely to develop the disease and those
‘Genetics loads the gun and environment pulls the
trigger’ – J. Stern with 35–41 repeats may develop only relatively mild
symptoms. However, people with >41 repeats are
almost certain to suffer full Huntington’s disease.
• In other cases, detection of genetic abnormalities
will not prevent a disease but can dispel fear of
the unknown (see Box 1.7). Genetic counselling Discovery and implementation of effective
is a potential preventative approach to avoiding treatment
abnormalities or diseases that arise from dangerous
combinations of parental genes. Advances in basic science provide the background of
understanding, the basis for applications. We study
the biology of viruses and bacteria to take advantage
Detection and precise diagnosis
of their vulnerabilities, and the biology of humans to
Early detection of many diseases permits simpler and ward off the consequences of ours.
more successful treatment. With a range of thera- A large component of the progress in effective
peutic strategies often available, precise classiﬁcation treatments for diseases involves the development of
may allow better prediction of the probable course of drugs. It is a sobering experience to ask a classroom
a disease, and dictate optimal treatment. For instance, full of students how many would be alive today
oncologists classify leukaemias into seven subtypes. without at least one course of drug therapy during
Determination of the subtype from gene expression a serious illness (ignoring diseases escaped through
patterns permits better prognosis and treatment. vaccination) or to ask the students how many of their
The human genome and medicine 23

surviving grandparents would be leading lives of apies, a procedure that is dangerous in terms of side
greatly reduced quality without regular treatment effects – sometimes even fatal – and in any case is
with drugs. The answers are eloquent. They engender wasteful and expensive. Treatment of patients for
fear of the new antibiotic-resistant strains of infec- adverse reactions to prescribed drugs consumes bil-
tious microorganisms. lions of dollars in health care costs. Conversely, being
The traditional drug development process involved able to predict individual patients’ responses can make
identifying a target – usually a protein – either from it possible to rescue drugs that are safe and effective
host or pathogen, the behaviour of which it is desired in a minority of patients, but which have been rejected
to affect. A drug is a molecule that intervenes in a before or during clinical trials because of inefficacy
living process by interacting with the target. or severe side effects in the majority of patients.
Recent scientific advances have accelerated the drug Some specific examples:
development process. Identifying metabolic features
unique to a pathogen helps to identify targets for • Acute lymphoblastic leukaemia is a childhood
antibacterial and antiviral agents. Human proteins cancer treated by thiopurines. In the patient, the
provide other drug targets to deal with molecular enzyme thiopurine methyltransferase breaks down
dysfunction or to adjust regulatory controls. Know- the drug. A genetic variant producing an inactive
ing the structure of a target permits computer-assisted enzyme threatens build-up of toxic levels of the
drug design by molecular modelling. drug in patients. Screening for the deficiency allows
monitoring to determine appropriate dosages.
• Abacavir is a drug used in treatment of AIDS.
Health care delivery
4–8% of patients show a serious, potentially fatal
Many countries in which advanced treatments are hypersensitivity reaction. This is correlated with
available face severe economic impediments to the MHC allele HLA-B*5701. Genomic screening can
equitable delivery of medical and surgical care to thereby detect potential hypersensitivity, and guide
their citizens. Research creates novel treatments, treatment.
but many of them are extremely costly. In the USA, • Cytochromes P450 are a family of enzymes in the
treatment for serious disease is already too expensive liver responsible for metabolizing a wide variety of
to be paid for out of most people’s earnings, health drugs. Sequence variations affect the activities of
insurance has not been universal and ‘safety nets’ these enzymes, to the point where lowered activity,
to protect the poor and elderly have been cut back. or loss of activity, can cause drug toxicity. Genetic
Baby-boomers, born in the late 1940s, are now tests for variations in cytochrome P450 genes warn
approaching elderly status. All of these factors will of potential overdose dangers. When J.D. Watson’s
put further pressure on the system. The Health Care genome sequence was determined, it emerged
and Education Reconciliation Act, which became US that he is homozygous for an unusual allele of
law in March 2010, should ameliorate the situation. the drug-metabolizing cytochrome gene CYP2D6.
Considerations of social and economic policy are Individuals with this genotype metabolize some
outside the scope of this book. However, science can drugs relatively slowly. Watson had been taking
make a contribution by improving the efficiency of b-blockers to reduce his blood pressure; however,
use of the resources available. the treatment made him unacceptably sleepy.
For example, many drugs vary in their effectiveness Based on information from his genome he is now
in different patients. A drug may be effective for some taking a lower dose.
patients and useless for others. Some patients may
tolerate a treatment easily; others may suffer side effects Our ability to rationalize and anticipate individual
ranging from discomfort through disability to death. differences in responses to drugs can improve the rate
Analysis of patients’ genes and proteins permits of clinical success. In improving the application of
selection of drugs and dosages optimal for individual resources by avoiding treatments that are useless or
patients, a field called pharmacogenomics. Physicians even harmful, science can have a direct effect on the
can thereby avoid experimenting with different ther- economics of health care delivery.
24 1 Introduction to Genomics

• Applications of genome sequencing to medicine already include: detection of diseases irrevocably implied by gene
sequences, such as Huntington’s disease, detection of diseases which an individual has an enhanced risk of developing
but for which the risks can be lessened by changes in lifestyle or prophylactic surgery, genetic counselling of prospective
parents, and the identification of optimal therapy depending on detailed diagnosis of the variety of the disease, and on
the expected reaction of the patient to different drugs.

The evolution and development of databases

High-throughput sequencing methods are generating and structure data publicly available. This increased
immense amounts of data. How can this information the harvest and permitted the combination of frag-
be archived and presented in useful form? This is the mentary and incoherent data into logically structured
responsibility of databases. collections. Databanks began to adopt fixed formats
Modern genomics combines biological data with and controlled vocabularies, and recognized the
computer science and statistics. From this union importance of careful curation and annotation of
have emerged both an intellectual framework and the data.
a technological toolkit. These contribute essential There followed a recognition of the power of
components to research and applications of geno- information-retrieval tools, which require imposing
mics and related fields. Access to data and software a structure on the data. These made possible the
has become as necessary a part of the infrastructure selective search and retrieval of data needed to
of research as distilled water. Computer storage and answer particular scientific questions. Other methods
software are essential for generating, collecting, permit numerical and/or textual analysis. Sequence
archiving, curating, distributing, retrieval, and ana- alignment is by far the most common example.
lysis of biological data. Many biologists specialize in For the development and provision of these and
computing; all find it an essential resource. many other tools, biology is indebted to computer
Sources of biological data include several high- science. Computer science is a young but nevertheless
throughput streams, including: mature field, moving swiftly on the back of devices
for high-capacity information storage and speed of
• Systematic genome sequencing
calculation. Its background was mathematics and
• Protein expression patterns engineering. Its goal has been the development of
• Metabolic pathways methods for making the most effective use of the
• Protein interaction patterns and regulatory networks available technology. This involves understanding
what determines the effectiveness of methods, devis-
• The scientific literature, including bibliographical
ing efficient methods for the problems we want
databases. Now that much of the scientific litera-
to solve, and production of computer programs to
ture is available online, it can itself be the subject
implement these methods.
of data mining.
The growth in the amount of information, together
Bioinformatics started with clerical projects for with the novel and unusual constellation of skills
archiving and distribution of data. Annotation and required for managing it, has led to the establishment
even curation of the data were, initially, minimal. of specialized institutions to organize the work.
Data entry structuring was rudimentary. Many data- The earliest databanks, the Protein Sequence Data-
banks were distributed as a series of flat files – that is, bank, started by Margaret O. Dayhoff at the National
plain text – to provide the lowest common denom- Biomedical Research Foundation, Georgetown Uni-
inator of intelligibility by different computer systems. versity, Washington DC, USA; and the Protein Data
Recognition of the importance of access to the Bank of macromolecular structures, started by Wal-
data led to policy decisions by journals and funding ter C. Hamilton of Brookhaven National Laboratory,
agencies that obliged scientists to make their sequence New York, USA, were outgrowths of – and originally
The evolution and development of databases 25

often found themselves uncomfortably competing for ilies, domains, and functional sites, and contains
funding with – research activities. links to others.
With increasing recognition of the importance of Alternatively, databanks are taking advantage of
databanks and the growing political attractiveness the World Wide Web to include a dense network
of their goals, several high-profile institutions have of links among different archives. Today, the quality
been established. Examples include the US National of a database depends not only on the information
Center for Biotechnology Information (NCBI) at it contains but also on the effectiveness of its links to
the US National Library of Medicine (NLM) in other sources of information. The growing import-
Bethesda, Maryland, USA, and the European Bio- ance of simultaneous access to databanks has led
informatics Institute outstation of the European to research in databank interactivity – how can data-
Molecular Biology Laboratory, in Hinxton, Cam- banks ‘talk to one another’ without sacrificing the
bridgeshire, UK. freedom of each one to structure its own data in
In addition to archiving and curating, databanks appropriate ways?
have been active in developing software for informa- Specialized user communities may extract subsets
tion retrieval and analysis. This is an integral part of of the data, or recombine data from different sources,
making the resources in their care available on-line. to provide specialized avenues of access. Such ‘bouti-
However, the advantages of vesting responsibility for que’ databases depend on the primary archives as
the archives in monolithic organizations do not pre- sources of information, but redesign the organization
clude multiple routes of access to the data. Colloqui- and presentation. Indeed, different derived databases
ally, anyone can design an individual ‘front end’. can ‘slice and dice’ the same information in different
ways. Similarly, an individual may improve a method
A databank without effective modes of access is merely a for solving an important problem – for example,
data graveyard. identification of genes within genome sequences –
and make the method available via a web site.
A reasonable extrapolation suggests the idea of
Databank evo-devo specialized ‘virtual databases’, grounded in the
Archiving projects originally tended to specialize, archives but providing individual scope and function,
matching the nature of the data with the skills of the tailored to the needs and achievements of individual
scientists curating it. Typically, the Protein Data Bank research groups or even individual scientists.
employed crystallographers, whereas genomic data-
banks attracted sequence specialists. The databanks
Genome browsers
tended to develop along different lines, dictated in
part by the nature of the data. There are many databases in the field of molecular
This independence and divergence has some draw- biology. We shall survey them in Chapter 3. How-
backs, notably the difficulty of answering the questions ever, a particular species of database that deals with
of greater subtlety that arise in studying relationships full-genome sequences and related information is a
among information contained in separate databanks. genome browser.
For example: for which proteins of known structure Genome browsers are projects designed to organize
involved in diseases of purine biosynthesis in humans and annotate genome information, and to present
are there related proteins in yeast? We are setting it via web pages together with links to related data,
conditions on known structure, specified function, such as evolutionary relationships or correlations
detection of relatedness, correlation with disease, with disease. Genome browsers are like encyclopae-
and specified species. We need links that facilitate dias. There is a commitment to linking the actual
simultaneous access to several databanks. sequence data to as many as possible of the resources
In principle, the problems would go away if all of data about an organism. Two major genome
the databanks merged into one. To some extent this browsers are Ensembl, a joint project of the Sanger
is happening. For example, the umbrella database Centre and the European Bioinformatics Institute in
UniProt integrates the contents, features, and anno- Hinxton, UK, and the University of California at
tation of several individual databases of protein fam- Santa Cruz Genome Browser in the USA.
26 1 Introduction to Genomics

In addition to presenting the data, genome browsers

Known genes Chromosome 16
provide tools for searching and analysis. You can scroll
through chromosomes, zooming in on interesting
regions. For any region, you can see its contents and
annotated properties. For genes, you can see informa- p13.3
tion about function, expression, and homologues.
Let us look brieﬂy at a speciﬁc example. The glo- p13.2

bins form a family of proteins in humans and other p13.13

species. The a-globin gene cluster provides an example p13.12
p13.11
both of the appearance of genome browser web
pages, and of the contents of an interesting region of p12.3

the human genome.

Definitely ‘worth a detour’ is the a-globin locus on p12.1
chromosome 16. Our access is through a genome
browser. Figure 1.8 shows a diagram of chromosome
p11.2
16 (right) and the assignments of genes in the chromo-
some (left). The banding pattern on the chromosome
is produced by differential uptake of Giemsa stain, a
dye that reacts with DNA. The bands reflect the local
structure of the chromosome, which depends on the q11.2
DNA sequences. It is clear from this figure that the
light bands are more gene-rich than the dark ones. q12.1
The a-globin gene cluster is near the upper tip of
chromosome 16, in the band denoted p13.3. (For the q12.2
nomenclature of chromosome bands see p. 85.) We
can focus in on this region to see the distribution
of individual genes (Figure 1.9). The a-globin locus q21
contains five genes that are expressed, z, aD, a2, a1,
and q1, and two degenerate pseudogenes, yz and ya1
q22.1
(see Figure 1.10).
These are not the only globins in the genome. q22.2
q22.3
Compare the a-globin locus with another multigene
cluster, the b-globin locus, which appears on chromo- q23.1
some 11 (see Figures 1.10 and 1.11). In addition, q23.2
there are single genes for myoglobin, cytoglobin, and q23.3
neuroglobin (see Weblem 1.2). All of these genes q24.1
arose through evolution by duplication and diver- q24.2
gence (see Figure 1.11). q24.3

The order of the genes on the chromosome has

another signiﬁcance. Their transcription, leading to Figure 1.8 Human chromosome 16, from the web site of
protein synthesis, follows a strict developmental Ensembl, based at the Sanger Centre in Hinxton, Cambridgeshire.
Right: a schematic diagram of the chromosome, showing the
pattern. A human embryo (up to six weeks after
centromere, and the banding patterns. Left: assignments of genes
conception) primarily synthesizes two haemoglobin to the DNA sequence along the chromosome. Genes of known
chains: z and e. These molecules form a z2e2 tetramer. function appear in red. Additional genes appear in white. There is
From six weeks after conception until about eight a clear correlation between the absence of grey and black bands
weeks after birth, the predominant species shifts on the chromosome and gene-rich regions.

to foetal haemoglobin, a2g2. This is succeeded by

adult haemoglobin, a2b2. As the organism develops,
The evolution and development of databases 27

Figure 1.9 Globin cluster, from the genome viewer of the US National Center for Biotechnology Information (NCBI).

Chromosome 16 α-Globin gene cluster expression passes between genes in order of their
position on the chromosome.
ζ ψζ αD ψα1 α2 α1 θ1 Focusing down still more ﬁnely within the a-gene
cluster, we can see the structure of the gene for HBA,
Chromosome 11 β-Globin gene cluster the a-subunit of adult haemoglobin (see Figure 1.12).
Like most other eukaryotic genes, it is divided into
ε Gγ Aγ ψβ δ β exons (expressed segments of the gene) and introns
Figure 1.10 Distribution of protein-coding genes and
(intervening regions) (Figure 1.13).
pseudogenes in the a-globin cluster on chromosome 16, Now we have reached the level of the sequence
and the b-globin cluster on chromosome 11. itself (see Figures 1.14 and 1.15). The protein folds
28 1 Introduction to Genomics

Gene duplication

Divergence
Figure 1.11 Haemoglobin genes
and pseudogenes are distributed α β
on their chromosomes in a way → Translocation →
that appears to reflect their Further duplications
α β
evolution via duplication and and divergence
divergence. That is, adjacent
ζ α ε γ β
genes are similar in sequence. The
evolutionary tree can be drawn
without any intersecting lines. ζ ψζ ζD ψα1 α2 α1 θ1 ε Gγ A γ ψβ δ β

Figure 1.12 The structure of the gene for the a-subunit of human haemoglobin.
The evolution and development of databases 29

Exon 1 Intron 1 Exon 2 Intron 2 Exon 3

Figure 1.13 Structure of the human a1-globin gene (HBA): 3′-untranslated region in red, exons in green, introns in black, and
5′-untranslated region in cyan. This exon/intron pattern is conserved in many expressed vertebrate globin genes, including haemoglobin
a and b chains, and myoglobin. In contrast, the genes for plant globins have an additional intron, genes for Paramecium globins one
fewer intron, and genes for insect globins contain none. The gene for human neuroglobin, a homologue expressed at low levels in the
brain, contains three introns, like plant globin genes.

LOCUS HSAGL1 1138 bp DNA linear PRI 24-APR-1993

DEFINITION Human alpha-globin germ line gene.
ACCESSION V00488
VERSION V00488.1 GI:28546
KEYWORDS alpha-globin; germ line; globin.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Catarrhini;
Hominidae; Homo.
REFERENCE 1 (bases 1 to 1138)
AUTHORS Liebhaber,S.A., Goossens,M.J. and Kan,Y.W.
TITLE Cloning and complete nucleotide sequence of human 5'-alpha-globin
gene
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 77 (12), 7054-7058 (1980)
PUBMED 6452630
COMMENT KST HSA.ALPGLOBIN.GL [1138].
FEATURES Location/Qualifiers
source 1..1138
/organism=''Homo sapiens''
/mol_type=''genomic DNA''
/db_xref=''taxon:9606''
prim_transcript 98..929
exon 98..230
/number=1
CDS join(135..230,348..551,692..820)
/codon_start=1
/product=''alpha globin''
/protein_id=''CAA23748.1''
/db_xref=''GI:28547''
/db_xref=''GOA:P01922''
/db_xref=''UniProtKB/Swiss-Prot:P01922''
/translation=''MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYF
PHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLL
SHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR''
exon 348..551
/number=2
exon 692..929
/number=3
ORIGIN
1 aggccgcgcc ccgggctccg cgccagccaa tgagcgccgc ccggccgggc gtgcccccgc
61 gccccaagca taaaccctgg cgcgctcgcg gcccggcact cttctggtcc ccacagactc
121 agagagaacc caccatggtg ctgtctcctg ccgacaagac caacgtcaag gccgcctggg
181 gtaaggtcgg cgcgcacgct ggcgagtatg gtgcggaggc cctggagagg tgaggctccc
241 tcccctgctc cgacccgggc tcctcgcccg cccggaccca caggccaccc tcaaccgtcc
301 tggccccgga cccaaacccc acccctcact ctgcttctcc ccgcaggatg ttcctgtcct
361 tccccaccac caagacctac ttcccgcact tcgacctgag ccacggctct gcccaagtta
421 agggccacgg caagaaggtg gccgacgcgc tgaccaacgc cgtggcgcac gtggacgaca
481 tgcccaacgc gctgtccgcc ctgagcgacc tgcacgcgca caagcttcgg gtggacccgg
541 tcaacttcaa ggtgagcggc gggccgggag cgatctgggt cgaggggcga gatggcgcct
601 tcctctcagg gcagaggatc acgcgggttg cgggaggtgt agcgcaggcg gcggcgcggc
661 ttgggccgca ctgaccctct tctctgcaca gctcctaagc cactgcctgc tggtgaccct
721 ggccgcccac ctccccgccg agttcacccc tgcggtgcac gcttccctgg acaagttcct
781 ggcttctgtg agcaccgtgc tgacctccaa ataccgttaa gctggagcct cggtagccgt
841 tcctcctgcc cgctgggcct cccaacgggc cctcctcccc tccttgcacc ggcccttcct
901 ggtctttgaa taaagtctga gtgggcggca gcctgtgtgt gcctgggttc tctctgtccc
961 ggaatgtgcc aacaatggag gtgtttacct gtctcagacc aaggacctct ctgcagctgc
1021 atggggctgg ggagggagaa ctgcagggag tatgggaggg gaagctgagg tgggcctgct
1081 caagagaagg tgctgaacca tcccctgtcc tgagaggtgc cagcctgcag gcagtggc Figure 1.14 EMBL Data Library entry for
// human haemoglobin, a chain.
30 1 Introduction to Genomics

(a) Nucleotide sequence of mRNA:

auggugcugucuccugccgacaagaccaacgucaaggccgccugggguaaggucggcgcgcacgcuggcg
aguauggugcggaggcccuggagaggauguuccuguccuuccccaccaccaagaccuacuucccgcacuu
Figure 1.15 (a) Nucleotide base sequence of cgaccugagccacggcucugcccagguuaagggccacggcaagaagguggccgacgcgcugaccaacgcc
mature messenger RNA produced from the gene guggcgcacguggacgacaugcccaacgcgcuguccgcccugagcgaccugcacgcgcacaagcuucggg
sequence in previous figure. a = adenine, u = uracil, uggacccggucaacuucaagcuccuaagccacugccugcuggugacccuggccgcccaccuccccgccga
g = guanine, c = cytosine. (b) Amino acid sequence guucaccccugcggugcacgccucccuggacaaguuccuggcuucugugagcaccgugcugaccuccaaa
uaccguuaa
from translation of (a). Each letter stands for one of
the twenty canonical amino acids. The convention (b) Translation to an amino acid sequence:
is that bases appear in lower case, amino acids in MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNA
upper case. For instance, a = adenine (a base) and VAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSK
A = alanine (an amino acid). YR

spontaneously to its proper three-dimensional struc-

ture. Two such haemoglobin a chains combine with
two corresponding b chains to form the tetrameric
structure (Figure 1.16).

• Database organizations archive data, curate and anno-

tate them, make the data available over the World
Wide Web, and provide information-retrieval tools
facilitating research.
• The variety of specialities required of the database Figure 1.16 The structure of adult human haemoglobin. The
staff to provide all of these has resulted in large, often protein contains four polypeptide chains – two a-subunits and
international, database institutions. A specific type of two b-subunits – each binding a haem group. Cylinders represent
database aimed at presenting genomic sequences and a-helices. Spheres represent atoms of the haem groups. (Do not
related information is called a genome browser. confuse the use of a to designate both a type of helix and a type
of subunit of haemoglobin.)

Protein evolution: divergence of sequences and structures within and

between species

Different globins diverged from a common cytoglobin – are more distant relatives. Correspond-
ancestor ing globins in related species, such as human and
horse, have also diverged. In general, the divergence
Differences in gene sequences, differences in the cor- at the molecular level parallels the divergence of the
responding amino acid sequences, and differences species according to classical taxonomic methods.
in three-dimensional structure reﬂect evolutionary But the power of comparative genomics and pro-
divergence. The globins within the a and b clusters teomics in tracing precise relationships both within
are more closely related than members of the a clus- and among species is immense.
ter are to members of the b cluster. Other globins in The basic tool for investigating sequence divergence
the human genome – myoglobin, neuroglobin, and is the multiple sequence alignment (see Figure 1.17).
Protein evolution: divergence of sequences and structures within and between species 31

(a) Mammalian Globin Sequences

10 20 30 40 50 60
| | | | | |
Human Haemoglobin a chain VLSPADKTNVKAAWGKVGA_ HAGEYGAEALERMFLSFPTTKTYFPHF_DLS_____HGS
Horse Haemoglobin a chain VLSAADKTNVKAAWSKVGG_ HAGEYGAEALERMFLGFPTTKTYFPHF_DLS_____HGS
Human Haemoglobin b chain VHLTPEEKSAVTALWGKV___ NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN
Horse Haemoglobin b chain VQLSGEEKAAVLALWDKV___ NEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGN
Sperm whale myoglobin VLSEGEWQLVLHVWAKVEA_ DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKAS
L k V a W KV e G eaL R P T F F dLs g

70 80 90 100 110 120

| | | | | |
Human Haemoglobin a chain AQVKGHGKKVADALTNAVAHV____ D_DMPNALSALSDLHAHKLRVDPVNFKLLSHCLLV
Horse Haemoglobin a chain AQVKAHGKKVGDALTLAVGHL____ _
D DLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLS
Human Haemoglobin b chain PKVKAHGKKVLGAFSDGLAHL____ D_NLKGTFATLSELHCDKLHVDPENFRLLGNVLVC
Horse Haemoglobin b chain PKVKAHGKKVLHSFGEGVHHL____ D_NLKGTFAALSELHCDKLHVDPENFRLLGNVLVV
Sperm whale myoglobin EDLKKHGVTVLTALGAILKK_____ KGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIH
vK HGkkV h d Ls lH Kh vdp nf ll l

130 140 150 160

| | | |
Human Haemoglobin a chain TLAAHLP_ A_ EFTPAVHASLDKFLASVSTVLTSKYR
Horse Haemoglobin a chain TLAVHLP_ N _ DFTPAVHASLDKFLSSVSTVLTSKYR
Human Haemoglobin b chain VLAHHFG_ K_ EFTPPVQAAYQKVVAGVANALAHKYH
Horse Haemoglobin b chain VLARHFG_ K_ DFTPELQASYQKVVAGVANALAHKYH
Sperm whale myoglobin VLHSRHP_ G_ DFGADAQGAMNKALELFRKDIAAKYKELGYQG
La h Ftp a K v l KY

(b) Eukaryote and Prokaryote Full-length Globin Sequences

10 20 30 40 50 60
| | | | | |
Human Haemoglobin a chain _ VLSPADKTNVKAAWGKVGA_ HAGEYGAEALERMFLSFPTTKTYFPHF_ DLS_____ HGS
Horse Haemoglobin a chain _ VLSAADKTNVKAAWSKVGG_ HAGEYGAEALERMFLGFPTTKTYFPHF_ DLS_____ HGS
Human Haemoglobin b chain VHLTPEEKSAVTALWGKV___ NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGN
Horse Haemoglobin b chain VQLSGEEKAAVLALWDKV___ NEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGN
Sperm whale myoglobin _ VLSEGEWQLVLHVWAKVEA_ DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKAS
Chironomus erythrocruorin __ LSADQISTVQASFDKVKG_____ DPVGILYAVFKADPSIMAKFTQFAG_ KDLESIKGT
Lupin leghaemoglobin GALTESQAALVKSSWEEFNA_ NIPKHTHRFFILVLEIAPAAKDLFS_ FLK_ GTSEVPQNN
Bacterial globin (Vitroscilla sp.) __ MLDQQTINIIKATVPVLKEHGVTITTTFYKNLFAKHPEVRPLFD_ M__________ GR
l v P F f

70 80 90 100 110 120

| | | | | |
Human Haemoglobin a chain AQVKGHGKKVADALTNAVAHV _ _ _ _ _
D DMPNALSALSDLHAHKLRVDPVNFKLLSHCLLV
Horse Haemoglobin a chain AQVKAHGKKVGDALTLAVGHL____D_DLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLS
Human Haemoglobin b chain PKVKAHGKKVLGAFSDGLAHL____D_NLKGTFATLSELHCDKLHVDPENFRLLGNVLVC
Horse Haemoglobin b chain PKVKAHGKKVLHSFGEGVHHL____D_NLKGTFAALSELHCDKLHVDPENFRLLGNVLVV
Sperm whale myoglobin EDLKKHGVTVLTALGAILKK_____KGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIH
Chironomus erythrocruorin APFETHANRIVGFFSKIIGEL___ P__NIEADVNTFVASHKPRG_ VTHDQLNNFRAGFVS
Lupin leghaemoglobin PELQAHAGKVFKLVYEAAIQLEVTGVVVSDATLKNLGSVHVSKG_ VADAHFPVVKEAILK
Bacterial globin (Vitroscilla sp.) QESLEQPKALAMTVLAAAQNI__ ENLPAILPAVKKIAVKHCQAG_ VAAAHYPIVGQELLG
h H

130 140 150 160

| | | |
Human Haemoglobin a chain _ _
TLAAHLP A EFTPAVHASLDKFLASVSTVLTSKYR
Horse Haemoglobin a chain _ _
TLAVHLP N DFTPAVHASLDKFLSSVSTVLTSKYR
Human Haemoglobin b chain V L A H H F G _K _ E F T P P V Q A A Y Q K V V A G V A N A L A H K Y H
Horse Haemoglobin b chain V L A R H F G _K _ D F T P E L Q A S Y Q K V V A G V A N A L A H K Y H
Sperm whale myoglobin V L H S R H P _G _ D F G A D A Q G A M N K A L E L F R K D I A A K Y K E L G Y Q G
Chironomus erythrocruorin YMKAHT_____DFAGAEAAWGATLDTFFGMIFSKM
Lupin leghaemoglobin T I K E V V G _A _ K W S E E L N S A W T I A Y D E L A I V I K K E M D D A A
Bacterial globin (Vitroscilla sp.) AIKEVLGDAA__TDDILDAWGKAYGVIADVFIQVEADLYAQAVE

Figure 1.17 (a) Multiple sequence alignment of five mammalian globins: sperm whale myoglobin, and the a and b chains of human
and horse haemoglobin. Each sequence contains approximately 150 residues. In the line below the tabulation, upper-case letters
indicate residues that are conserved in all five sequences, and lower-case letters indicate residues that are conserved in all but sperm
whale myoglobin. (b) Multiple sequence alignment of full-length globins from eukaryotes and prokaryotes. Many fewer positions are
conserved than in the mammals-only case. In the line below this tabulation, upper-case letters indicate residues that are conserved
in all eight sequences, and lower-case letters indicate residues that are conserved in all but the bacterial globin.
32 1 Introduction to Genomics

The basic tool for investigating structural divergence

is superposition (see Figure 1.18). Applications of
these techniques have become heavy industries in
molecular biology. We shall make abundant use of
them in the following chapters.

Figure 1.18 (right) Superposition of three closely related

mammalian globins: sperm whale myoglobin (black), human
haemoglobin, a chain (cyan) and b chain (magenta).

Ethical, legal, and social issues

Knowledge creates power. Power requires control. easier to identify victims of death on a battlefield or
Control requires decisions. after a terrorist attack. Most people recognize that
Advances in genomics have created problems that extensive data on genome sequences, in a form that
individuals and societies must face. In setting up can be correlated with clinical records, would be an
the Human Genome Project, the US Department extremely valuable source for research. Questions
of Energy and NIH recognized the importance of have been raised, however, over:
ethical, legal, and social issues by allocating 3–5%
• Privacy issues: should inclusion in a databank of
of the funding to them. We shall discuss these topics
genomic information about individuals require the
throughout the book, in context, keeping several
individuals’ consent?
general categories in mind.
DNA databases containing information about • What data should be included? Should the data
every citizen of a country are technically feasible. be limited to the minimum required for standard
Routine testing of all newborns for genetic diseases identification procedures, or be more extensive?
is common, and it would be simple to determine the (For instance, should sufficient additional data
sequences of the regions used by law-enforcement be kept to identify physical features or ethnic
agencies for identification. Fairly soon it will be characteristics?)
relatively inexpensive to determine complete genome • Access: who should have access to the information?
sequences of individuals.
Should this information be determined? If it is
determined, who should have access to it? Databases containing human DNA sequence
Provided DNA sequence information is kept as information
private as normal medical records, sequencing can There are two major national repositories of human
benefit individuals. genome information in the UK (see Box 1.8).
More controversial questions arise in allowing the
information to be collected into a generally accessible • The National DNA Databank (NDNAD) primarily
databank. Most people would have no problem supports law enforcement agencies.
accepting an effort that would assist in the capturing • The UK BioBank has the goal of improving pre-
of criminals, especially those criminals that are likely vention, diagnosis, and treatment of illness. It has
to repeat their offences. Most people would have no amassed a large collection of clinical data, and bio-
problem accepting an effort that would make it logical samples to provide sequence information.
Ethical, legal, and social issues 33

Also based in the UK are the comprehensive nucleotide ‘that for most of the time that the NDNAD has been
sequence databanks at the European Bioinformatics in existence there has been no formal ethical review
Institute and The Sanger Centre. of applications to use the database and the associated
In England and Wales, police have been allowed to samples for research purposes’. If the situation in the
take samples, from any individual arrested on suspi- UK could have been described as ‘extremely regret-
cion of all but the most minor crimes, and to retain table’, one need not be a Colonel Blimp to shudder to
the derived information and the samples even if the think what things are like elsewhere (see Box 1.8).
person arrested is never charged. Consent is not It has been suggested that the NDNAD be extended
required. Access to the database is not strictly limited to the entire population. The current contents of the
to police, but has been used to support research database show ethnic and gender biases, which a uni-
projects without consent of the individuals repre- versal database would eliminate. A higher fraction
sented in the database. of reported crimes might be solved, but there is
In the current political climate, there is intense no consensus about how significant this would be.
pressure to tip the scales towards giving governments Arguments against such an extension include what is
powers that can be used to protect their citizens. The at the moment still a very high cost. There is also
dangers are that such powers, although they appear intense debate over whether the loss of individual
to be innocuous to innocent people, may be com- privacy would justify the benefits to society.
patible with abuse if safeguards are inadequate, or An important point about privacy of computer
– in the worst case – deliberately violated. Even in the databanks is that, even if there is consensus that certain
United Kingdom, a House of Commons committee data should be kept confidential, computer security is
in 2005* described as ‘extremely regrettable’ the fact simply not up to the task of ensuring that it will be.

BOX DNA Sequence Databases, Law Enforcement, and the Courts

1.8

The UK National DNA Database (NDNAD) is one of steadily growing through incremental legislation.
the largest forensic DNA collections in the world. It stores Roughly speaking, recordable offences include all but
both a database of sequence-derived information, and the most trivial antisocial actions. For example, under
the biological samples from which the sequences were the Football (Offences) Act of 1991, ‘unlawfully going
determined. In mid-2010 it contained profiles of an estim- onto the playing area’ is a recordable offence.
ated 5.4 million individuals. Of these, 78.50% are male
and 20.82% female; a small number are unassigned. The For purposes of elimination, the NDNAD also contains
NDNAD contains personal identification ‘DNA finger- samples from persons present at a crime scene, and from
prints’, and gender. The probability that sequences from police officers.
a sample collected at a crime scene will match an entry in
(b) Samples collected at a crime scene. They may originate
the database is over 50%!
from the perpetrator of a crime, from a victim, or even
Entries in the NDNAD are of two types:
from someone who had been at the scene at some
(a) Samples from known individuals. Police can collect time other than during the commission of the crime.
samples, without consent, from anyone suspected of a In the event of a match, these become samples from
‘recordable’ offence, not only before trial and possible known individuals. (In the case of a partial match, they
conviction, but even before being charged with a may suggest that the crime scene sample came from a
crime. In the UK, the list of recordable offences is near relative of the individual matched.)
➔

* House of Commons Science and Technology Committee, Forensic Science on Trial, 29 March 2005. http://www.
publications.parliament.uk/pa/cm200405/cmselect/cmsctech/96/96i.pdf
34 1 Introduction to Genomics

In the UK, the governing legislation has been in an active corresponding UK NDNAD, but represents a smaller per-
state, particularly with respect to retention of samples; and, centage of the national population.
under that heading, particularly with respect to retention The Genetic Information Nondiscrimination Act (GINA)
of samples from people not convicted of a crime. The of 2008 aimed at protecting individual privacy, with respect
Criminal Justice Act of 2003 authorized the widespread to genomic information. It prohibits:
collection of samples for DNA profiles in England and
(a) Health insurance companies from reducing coverage
Wales, without consent. (This was extended to Northern
or increasing prices to individuals based on information
Ireland in the next year.) The act also allowed the indefinite
from genetic tests.
retention of the information, even if the suspect were
never charged with a crime. In 2008, the Counter-Terrorism (b) Employers from making hiring decisions based on DNA
Act extended the criteria that allow police to demand sequence information.
samples for DNA sequencing. (c) Companies from demanding or even requesting a
The law in Scotland, in this as in many other respects, genetic test.
varies from that of England and Wales. Scotland maintains
However, the law does not apply to people applying for life
a separate DNA database, and shares results with England.
insurance or long-term care or disability insurance.
One salient legal difference is that Scotland does not per-
The Health Care and Education Reconciliation Act
mit automatic indefinite retention of samples from people
became US Federal law in March 2010. In theory, it should
not convicted of a crime.
allay some of the fears associated with the potential con-
The European Court of Human Rights has aligned itself
sequences of improper release of genetic information.
more closely with the Scottish law.
A case against the UK government was brought to In principle, use of DNA samples in research should
the European Court by two individuals who wanted their require consent of the individuals donating the material:
identifying information expunged from the NDNAD and consent not only for its collection, but for the specific uses
their samples destroyed. One was tried for attempted to which the samples will be put. There have been cases
robbery but acquitted, the other was never charged. The of ‘research goal creep’ in which permission was granted
court held that there had been violation of Article 8 of the for analysis of restricted scope, but the samples were
European Convention for the Protection of Human Rights subsequently used for other studies.
and Fundamental Freedoms: A US case testing the propriety of use of samples col-
‘In conclusion, the Court finds that the blanket and indiscriminate
lected voluntarily from subjects was settled in April 2010.
nature of the powers of retention of the fingerprints, cellular sam- A Native American group, the Havasupai, who inhabit
ples and DNA profiles of persons suspected but not convicted of an inaccessible area of the Grand Canyon, has among the
offences, as applied in the case of the present applicants, fails to highest known incidences of Type II diabetes. In 1991,
strike a fair balance between the competing public and private 55% of Havasupai women and 38% of Havasupai men
interests and that the respondent State has overstepped any were affected. Scientists from Arizona State University col-
acceptable margin of appreciation in this regard. Accordingly, the lected samples from the group, and carried out research
retention at issue constitutes a disproportionate interference with projects, believed – by the subjects – to be focused on sus-
the applicants’ right to respect for private life and cannot be
ceptibility to diabetes. However, using the same samples,
regarded as necessary in a democratic society’.
the scientists also investigated genetic susceptibility for
In the US, in the absence of overriding Federal legisla- schizophrenia, and evidence for migration patterns. The
tion, laws of individual states govern DNA collection. The subjects objected. Researchers pointed to the wording in
variety of guidelines for collection and retention of samples the consent form; representatives of the Havasupai alleged
expressed in state laws have had a patchy career in the (among other things) that the wording was too vague to
(state) courts. Some have been declared unconstitutional. constitute truly informed consent. The ensuing lawsuit
In the US, the analogue of the UK NDNAD is the was settled, with the Arizona state agency responsible for
Combined DNA Index System and the National DNA Index the university agreeing to pay the Havasupai US$7 000 000,
System (CODIS/NDIS), maintained by the Federal Bureau and to return the samples.
of Investigation (FBI). In late 2010, NDIS contained almost The conclusion is that, in the UK and US at least, legis-
10 million DNA profiles. This is about twice the size of the lation is moving in the direction of greater protection of
Recommended reading 35

privacy of DNA sequence information. Belief in this protec- national sharing of identification information, individuals
tion, which may in some respects be illusory, may lead to need to be concerned not with the countries with the most
increased genetic testing, both in regular medical practice, secure databases, but those with the least secure ones.
and by private companies. (a) Like mailing lists, testing (c) Experience has shown that much private information
companies may have the right to sell genetic information in fact becomes disseminated, either through accident
to outside parties. (b) Given the increased degree of inter- or design.

Ethical considerations for compiling DNA databases

Genomic databases are useful in identifying individuals tain such databases, there is a tension between the desire
from samples collected at crime scenes. In drafting the laws to protect society against offenders and upholding indi-
granting law-enforcement agencies authorization to main- viduals’ rights of privacy.

● RECOMMENDED READING

• The first two books are sources for the history of the development of molecular biology after
the Second World War. Rosenfield, Ziff, & Van Loon’s is a less formal but not less serious
account of the founding people and events.
Judson, H.F. (1980). The Eighth Day of Creation: Makers of the Revolution in Biology. Jonathan
Cape, London.
de Chadarevian, S. (2002). Designs for Life/Molecular Biology after World War II. Cambridge
University Press, Cambridge.
Rosenfield, I., Ziff, E.B., & Van Loon, B. (1983). DNA for Beginners. Writers & Readers
Publishing, London.
• The next five publications deal with genomics and the Human Genome Project, placing it in
broad context. The book by Sulston & Ferry is a personal account by one of the major players.
Ridley, M. (1999). Genome: The Autobiography of a Species in 23 Chapters. HarperCollins
Publishers, New York.
Lander, E.S. & Weinberg, R.A. (2000). Genomics: journey to the center of biology. Science 287,
1777–1782.
Wolfsberg, T.G., Wetterstrand, K.A., Guyer, M.S., Collins, F.S., & Baxevanis, A.D. (2002). A user’s
guide to the human genome. Nat. Genet. 32 (Suppl.), 1–79.
Sulston, J. & Ferry, G. (2002). The Common Thread: A Story of Science, Politics, Ethics and the
Human Genome. Bantam Press, London.
Choudhuri, S. (2003). The path from nuclein to human genome: a brief history of DNA with a
note on human genome sequencing and its impact on future research in biology. Bull. Sci.
Technol. Soc. 23, 360–367.
The 18 February 2011 issue of Science magazine contains several articles recognizing the tenth
anniversary of the sequencing of the human genome.
36 1 Introduction to Genomics

• The richness of non-protein coding RNAs in genomes:

Baker, M. (2011). Long noncoding RNAs: the search for function. Nat. Methods 8, 379–383.
• Genomics and personalized medicine:
Cooper, D.N., Chen, J.M., Ball, E.V., Howells, K., Mort, M., Phillips, A.D., Chuzhanova, N.,
Krawczak, M., Kehrer-Sawatzki, H., & Stenson P.D. (2010). Genes, mutations, and human
inherited disease at the dawn of the age of personalized genomics. Hum. Mutat. 31, 631–655.
Marian, A.J. (2010). Editorial review: DNA sequence variants and the practice of medicine. Curr.
Opin. Cardiol. 25, 182–185.
Ma, Q. & Lu, A.Y.H. (2011). Pharmacogenetics, pharmacogenomics, and individualized
medicine. Pharmacol. Rev. 63, 437–459.
Brunicardi, F.C., Gibbs, R.A., Wheeler, D.A., Nemunaitis, J., Fisher, W., Goss, J., & Chen, C.
(2011). Overview of the development of personalized genomic medicine and surgery. World
J. Surg. 35, 1693–1699.
Mahungu, T.W., Johnson, M.A., Owen, A., & Back, D.J. The impact of pharmacogenetics on
HIV therapy. Int. J. STD AIDS 20, 145–151.
• Finally, a collection of papers on ethical, legal, and social issues, describing some of the
interactions, and collisions, between scientific advances and social policy; and a more recent
paper, and a government report.
Gaskell, G. & Bauer, M.W. (eds) (2006). Genomics & Society/Legal, Ethical & Social
Dimensions. Earthscan, London.
Levitt, Mairi. (2007). Forensic databases: benefits and ethical and social costs. Brit. Med,
Bull. 83, 235–248.
In March 2010, The Home Affairs Committee of the UK House of Commons published a report
on The National DNA Database: http://www.publications.parliament.uk/pa/cm200910/
cmselect/cmhaff/222/22202.htm.

● EXERCISES, PROBLEMS, AND WEBLEMS

Exercises
Exercise 1.1 Make a very rough estimate of the average density of protein-coding genes in the
human genome. (Total genome size ∼3 × 109 bp; total number of genes ∼3 × 104 genes.)
Exercise 1.2 Assume that most eukaryotes have approximately 25 000 protein-coding genes,
and that the average eukarotic protein has a length of 300 amino acids. Assume that other
functional regions, including RNA-coding genes and control regions, do not require more base
pairs that the protein-coding genes themselves (an assumption for which there exists not the
slightest justification other than ignorance). Estimate the minimum size that a eukaryotic genome
could potentially have.
Exercise 1.3 For the standard genetic code (Box 1.1), give an example of a pair of codons related
by a synonymous single-site substitution (a) at the third position and (b) at the first position.
(c) Give an example of a pair of codons related by a non-synonymous single-site substitution
at the third position. (d) Can a change in the second position of a codon ever produce a
synonymous mutation?
Exercise 1.4 (a) Is it possible to convert Phe to Tyr by a single base change? If so, what would be
possible wild-type and mutant codons? (b) Is it possible to convert Ser to Arg by a single base
Exercises, problems, and weblems 37

change? If so, what would be possible wild-type and mutant codons? (c) What is the minimum
number of base substitutions that would convert Cys to Glu? (d) In the evolution of an essential
protein encoded by a single gene, a Trp is converted to a Gln by two successive single-base
changes. What is the intermediate codon?
Exercise 1.5 RNA editing in the mitochondria of higher plants often changes cytosine to uracil at
the second position of codons. What amino acid changes could this effect? Note that higher plant
mitochondria use the standard genetic code.
Exercise 1.6 A single base-pair deletion in an exon in a protein-coding gene would be very serious
because it would throw off the reading frame. What would you expect to be the effect of a single
base-pair deletion in a gene for a structural RNA molecule? (Consider transfer RNA, for example;
see Figure 1.4.)
Exercise 1.7 Here is a fragment of length 200 from a mitochondrial genome. If this fragment were
processed by a sequencer giving read length 35, what would be (a) the result of a single-end
read?, (b) the result of a paired-end read? (See Figure 1.7.)

1 5′-gttaatgtag cttaaactaa agcaaggcac tgaaaatgcc tagatgagtc tgcctactcc

61 ataaacataa aggtttggtc ctagcctttc tattagttga cagtaaattt atacatgcaa
121 gtatctgcct cccagtgaaa tatgccctct aaatccttac cggattaaaa ggagccggta
181 tcaagctcac ctagagtagc tcatgacgcc ttgctaaacc acgcccccac gggatacagc-3′

Exercise 1.8 Which of the following bands of human chromosome 16 are gene-rich: p13.3,
q22.1, q11.2? (See Figure 1.9.)
Exercise 1.9 On a photocopy of Figure 1.9, mark the approximate position of the a-globin gene
cluster.
Exercise 1.10 From Figure 1.10, estimate the number of genes per base pair in the globin region.
Exercise 1.11 On a photocopy of Figure 1.14, mark with highlighters the regions shown in
Figure 1.13, in the same colours as in Figure 1.13.
Exercise 1.12 On a photocopy of the amino acid sequence of the human haemoglobin a chain in
Figure 1.15(b), indicate the regions arising from the three exons in the gene.
Exercise 1.13 What are the symmetries of the structure of the haemoglobin molecule? (See
Figure 1.16.) If the a and b subunits were identical, what additional symmetries would there be?
Exercise 1.14 On a photocopy of the sequence of the Tn3 transposon of E. coli (Box 1.6), indicate
the extents of the inverted terminal repeats.
Exercise 1.15 How might genome sequencing be applied to the conservation of endangered
species?

Problems
Problem 1.1 Draw a rough sketch of the outline of a small eukaryotic plant cell (representing
10 000 nm diameter) to fill almost a whole page. Within this outline, draw in, at scale, the
cellular components listed in the table. At the same scale, draw an E. coli cell (approximate
diameter 2000 nm).

Globular protein diameter 4 nm

Cell membrane thickness 10 nm
Ribosome 11 nm
Large virus 100 nm
Mitochondrion 3000 nm
Length of chloroplast 5000 nm
Cell nucleus 6000 nm
38 1 Introduction to Genomics

Problem 1.2 From the following data, compute the number of genes per Mb for mitochondria,
rickettsia, chloroplasts, and cyanobacteria. Compare the mitochondria with the rickettsia and
the chloroplasts with the cyanobacteria. Is the smaller genome size of the organelles largely the
result of eliminating non-functional DNA, or loss of genes, some of which were transferred to
the nuclear genome?

Genome Number of bp Number of genes

Human mitochondrion 16 569 37

Rickettsia prowazekii 1 111 523 834
Arabidopsis thaliana chloroplast 154 478 128
Synechocystis sp. PCC 6803 3 573 470 ∼3500
(plus plastids)

Problem 1.3 It is estimated that the human immune system synthesizes 108–1010 antibodies.
The portion of a typical antibody active in binding antigen consists of two variable domains each
containing 100 amino acids. If every variable domain were separately encoded in the genome but
could pair promiscuously, then the number of variable domains that needed to be encoded would
be of order of magnitude the square root of the total number of antibodies. How many bases
would be required to encode separately all the variable domains? Compare this with the size of
the human genome.
Problem 1.4 Human haemoglobin a-subunits have helices A, B, C, E, F′, F, G, and H. Human
haemoglobin b-subunits have helices A, B, C, D, E, F′, F, G, and H. The sites of interaction of the
histidine residues with the iron in the haem group are in the F and G helices. In Figure 1.16, in
what colours do the a-subunits appear and in what colours do the b-subunits appear?
Problem 1.5 Outline how you would search for tRNA genes in a genome sequence. Assume that
the lengths of the double-helical regions and the lengths of the single-stranded regions between
them vary within relatively tight limits and that there are only a limited number of deviations from
perfect base pairing in the helical regions. How would your method have to be modified if tRNA
genes contained introns?
Problem 1.6 In Figure 1.17(a), (a) find five positions in which the amino acid is the same in all
haemoglobin chains but different in myoglobin. (b) Find four positions at which human and horse
a chains have the same amino acid, human and horse b chains have the same amino acid, and
myoglobin has a different amino acid. (c) Find two positions at which myoglobin and human
and horse b chains have the same amino acid, but the a chains have a different one. (d) Classify
the following sequence. Is it a haemoglobin a chain, a haemoglobin b chain, or a myoglobin?

MGLSDAEWQLVLNVWGKVEADIPGHGQDVLIRLFKGHPETLEKFDRFKHLKTEDEMKASE
DLKKHGTTVLTALGGILKKKGQHEAEIQPLAQSHATKHKIPVKYLEFISEAIIQVIQSKH
SGDFGADAQGAMSKALELFRNDIAAKYKELGFQG
Problem 1.7 The annual birth rate in the UK is approximately 700 000. Two ‘back-of-the-envelope
calculations’: (a) If the goal of a $US1000 genome is attained, what would be the total cost (of)
sequencing 700 000 babies? The current annual budget of the UK National Health Service is
£120 000 000 000. What percentage of the current NHS budget would be required to sequence
the genomes of all babies born in the UK? (Whether the money would be available, and, if it
were, whether sequencing would be the best use of it, are separate questions.) (b) If the sequence
data were stored completely and independently, by what percentage would they increase the
current holdings of the Nucleotide Sequence Databanks?
Exercises, problems, and weblems 39

Problem 1.8 Suppose that you knew your personal DNA fingerprint in the format stored in
national DNA databases. Assume that the information did not reveal that you were unusually
susceptible to any known disease. What are the arguments for and against your voluntarily
depositing these data in the national DNA database of the country of your residence? What
conditions would you want to impose on the use of the data?
Problem 1.9 Outline the arguments for and against the use of animals for testing in drug
development. What restrictions might be imposed that would mitigate any of the arguments
against use of animals in testing? How and to what extent would the effectiveness of drug
development suffer if these restrictions came into force?

Weblems
Weblem 1.1 At the time of preparing the manuscript of this book, the sequencing capacity of the
largest dedicated genomics institute was estimated as 3.2 × 1012 bp per day. This corresponds to
1000 human genome equivalents. What is the corresponding current maximal sequencing
capacity?
Weblem 1.2 For each of the following traits of any human being, estimate the contributions of
(1) sequences of particular genes, (2) life history and environment, (3) epigenetic factors: (a) blood
type; (b) adult height; (c) native language; (d) whether or not a person will develop Huntington’s
disease; (e) whether or not a person will develop emphysema; (f) whether or not a person will
develop Angelman syndrome.
Weblem 1.3 Which human chromosomes contain genes for ubiquitin?
Weblem 1.4 The gene for insulin is 1.7 kb long; the LDL receptor gene is 5.45 kb; and the
dystrophin gene is 2400 kb. (a) What are the lengths of the amino acid sequences of these
proteins? (b) What are the ratios of total exon length/total gene length and total intron length/
total gene length for these three genes?
This page intentionally left blank
CHAPTER 2

Genomes are the Hub

of Biology

LEARNING GOALS

• Recognizing that genomics has transformed our approaches to all the classical topics of biology
and medicine.
• Appreciating that despite the individual variation among genomes within and among
populations of humans, other animals, and plants, the idea that species are discrete entities
is still valid. For prokaryotes, the situation is somewhat murkier.
• Adding to Crick’s central dogma – DNA makes RNA makes protein – recognition of the
necessity for regulating transcription, translation, and protein activity, for the health of cells
and organisms.
• Distinguishing the static components of genomes – the full nucleotide sequences that appear
in databases – from the dynamic aspects involved in the responsiveness of cells to internal and
external signals, and in the programmes of development seen in higher organisms.
• Knowing the different types of control mechanisms in organisms, and their points of application,
including but not limited to regulation of activities of proteins and control over gene expression.
• Realizing the importance of comparative genomics in revealing conserved elements that are
likely to have interesting functions.
• Knowing the different mechanisms by which mutations can affect human health.
• Being familiar with a number of important diseases with genetic components, and knowing
which ones can be treated, and for which ones lifestyle adjustments can reduce the danger
inherent in genetic risk factors.
• Recognizing that the roster of species has been in flux. Mass extinctions have characterized the
history of life. We are currently in another period of mass extinction.
42 2 Genomes are the Hub of Biology

Individuals, populations, the biosphere: past, present, and future

A genome sequence belongs to an individual organ- sible the development of aerobic metabolism. The
ism. But, if the goal is to ‘see life clearly and to see it Earth has been the theatre of an enormous network
whole’ it is essential to relate genomes to one another. of such interactions.
Genome comparisons are necessary across both space A feature of the history of life is the origin of new
and time. We analyse differences in genomes within species and the extinction of many others. In some
and among species that are currently extant, or that cases, extinctions arise from external events – for
are accessible from sequencing preserved specimens instance, an asteroid impact. In others, extinction is
of extinct species. Within species, we study genetic the result of deliberate activity – sailors killing the
variation within and between populations. In humans, dodos of Mauritius, or twentieth-century scientists
the intraspecific variation reveals the history of our exterminating the smallpox virus in the wild. Today,
origin and dispersal, and how we have adapted to habitat destruction threatens many other species,
different environments and lifestyles. An example is often an inadvertent result of human activity. To
the development of adult lactose tolerance, a con- some threatened extinctions we can however plead
comitant of the dietary change associated with cattle not guilty: Examples include the molicutes infection
domestication. The original trait, loss of the ability to (‘elm yellow’) which is devastating the elm trees
digest lactose after infancy – at the time of weaning of the eastern United States and southern Ontario
– persists in many populations. For these populations, (Figure 2.1), the fungal ‘white-nose disease’ threaten-
alternative sources of calcium include fermented ing bats in eastern North America, and devil facial
dairy products such as cheese or yoghurt – bacteria tumour disease, a transmissible cancer in Tasmanian
hydrolyse the lactose – or soybean. (It is said that devils (Sarcophilus harrisii), believed to have origin-
‘The soybean is the cow of Asia’.) ated as a single mutation in a single individual.
Comparing genomes between species can reveal their We can study the present as thoroughly as we wish,
relationships and how they achieve their similarities and the past as extensively as we can. What of the
and differences. We share 96% of our genomes with future? One thing in which we can have confidence is
our nearest relatives, the chimpanzees, and many of our that molecular biology will give us greater control
proteins are identical. This small amount of genomic over living things, including but not limited to our-
change must account for the differences between selves. Clinical applications, of genomics and of
humans and chimps. Indeed, genomics is essential to other fields, have the potential to enhance the health
understanding and drawing species boundaries. As of humans, and of animals and plants. It is already
useful a concept as a species may be, there is consen- possible to intercede against some genetic diseases,
sus that its definition is a very tricky problem. by replacement of dysfunctional proteins through
Genomes also contain records of their evolution-
ary history, which we can try to reconstruct. The
organisms that populate the Earth are the product
not only of the response to an imposed geological
environment, but very largely show effects of interac-
tions among individuals and species. There are many
obvious examples of competition among members of
the same species, and attack and defence between
species. Our struggle against pathogenic viruses and
bacteria are a salient example. Moreover, in the lon-
ger term the geological environment has in crucial
respects been imposed by life. The early organisms Figure 2.1 Elm tree infected with elm yellows.
that released oxygen through photosynthesis changed Pennsylvania Department of Conservation and Natural Resources – Forestry
the composition of the atmosphere. This made pos- Archive, Bugwood.org
Expression patterns 43

direct administration; for example, insulin therapy Genomes are now central to all of academic bio-
for diabetes and blood clotting factors for haemo- logy and clinical applications. In this chapter we shall
philia. At the forefront of clinical developments are emphasize the integrality of the relationships between
methods of rectiﬁcation of mutant genes, through genomes and other, formerly independent, areas of
gene delivery by viruses (X-linked adrenoleukodys- science. We shall begin with a cell, and then widen our
trophy), or by introduction of functional genes point of view successively to organisms, populations,
through stem cells (a1-antitrypsin deﬁciency). species, and the biosphere as a whole.

The central dogma, and peripheral ones

In 1958, F. Crick proposed the ‘central dogma’ of of the activities of molecules already present in the
molecular biology: cell, and control of synthesis of proteins and RNAs.
An example of a mechanism of control over the activ-
DNA makes RNA makes Protein.
ity of enzymes present in the cell would be feedback
Like many other dogmas, it provides important inhibition, in which the product of a concatenation
insights, even if it is not the whole story. of metabolic reactions inhibits an enzyme catalysing
A less concise statement of the dogma would be: a step early in the sequence. Many other mechanisms
DNA sequences govern the synthesis of nucleotide exist to regulate the concentrations of enzymes and
sequences of RNA (transcription), which in turn gov- other proteins, by control over levels of transcription
ern the synthesis of amino acid sequences of proteins of the genes that encode them.
(translation). In his formulation, Crick emphasized Thus cells contain two parallel networks: an
the one-way flow of information: that amino acid enzymatic network manipulating metabolites, and a
sequences do not dictate the synthesis of nucleotide regulatory network manipulating genes, transcripts,
sequences of RNAs – this is still true – and that RNA and proteins.
sequences are not transcribed into DNA sequences.
We now know that some viruses can ‘reverse tran-
scribe’ RNA sequences into DNA.
Also sacrificed to concision in Crick’s statement of • The central dogma – DNA makes RNA makes protein
the dogma is the question of regulation and control. – describes not only a series of molecular events but a
In a healthy cell, the traffic through the dense net- direction of information transfer. Information sensing
and response are also essential for integration and
work of metabolic pathways is coordinated, so that
coordination of the activities and expression patterns
no intermediates build up to unwanted levels, and
of many molecules. In this way, cells can achieve stabil-
adequate amounts of products are created when and ity and robustness.
where they are needed. This involves regulation both

Expression patterns

The ∼23 000 protein-coding genes in the human viruses. In prokaryotes, for instance, Escherichia coli
genome, exclusive of the immune system, can give will show robust synthesis of the genes in the lac
rise to at least as many proteins – many more, if operon only if growing on a medium containing lac-
splice variants are taken into account. However, a tose. Human cells in different organs will synthesize
healthy cell must synthesize only those proteins neces- proteins appropriate for their cell types.
sary for the requirements of its physiological state, Not only must each cell choose which proteins to
and its differentiated type. Regulation of expression synthesize; it must control the rate of production of
is common to all cellular life forms and even to some each. A variety of mechanisms achieve this control.
44 2 Genomes are the Hub of Biology

Some involve the binding of proteins to specific DNA Gene regulation in eukaryotes is more complex.
sequences, to control protein synthesis at the level of Transcription regulators bind to DNA at positions
transcription. proximal to the gene as in prokaryotes, but also at
remote sites. The control of human b-globin expres-
sion illustrates such a scheme (see Box 2.1). Regula-
• The genome sets the parameters that circumscribe a tory interactions also govern the expression of other
potential life. Some constraints are tight, others rela- transcription factors. The resulting control networks
tively loose. Conversely, life implements the genome, show far greater complexity, in both their logic and
in the form of synthesis of RNAs and proteins, and in their dynamics, than those of viruses or prokaryotes.
the regulatory mechanisms that keep cellular activities
The gift of complexity is robustness. Eukaryotic
stable and robust.
control networks show an ability to reprogram them-
selves and to respond to stimuli by changing cell
state. The source of robustness appears to be redun-
dancy. Yeast (Saccharomyces cerevisiae) has about
Regulation of gene expression
6000 genes. Under ‘normal’ – non-stress – condi-
Living things must regulate the synthesis of proteins tions, about 80% of them are being expressed. It is
encoded in their genomes. Transcription and transla- also true that yeast can survive approximately 80%
tion must be dynamic – to produce the right amount of single-gene knockouts. (It would be interesting to
of the right protein at the right time at the right place. know the overlap of these sets!) Many expressed
In this way, cells can respond to stimuli by altering genes must be redundant, and redundancy provides
their physiological state, or even their physical form. robustness. (Gene regulatory networks are the sub-
The driving force for these changes in profile of project of Chapter 11.)
teins produced may be changes in the environment, Other mechanisms of transcription regulation in
or internal signals directing different stages of the cell eukaryotes involve changes in patterns of methyla-
cycle or developmental programmes. We have men- tion of DNA associated with changes in the structure
tioned that the appearance of lactose in the medium of chromatin. Eukaryotic chromosomes contain
can trigger transcription of the lactose operon in complexes of DNA with histones (see Figure 1.2).
E. coli. Transcription of another operon, encoding Chromatin remodelling is an important mechanism of
enzymes for the biosynthesis of the amino acid tryp- transcriptional control. Reversible chemical modifica-
tophan, will be repressed if tryptophan is present tion of histones, by a mellifluous variety of reactions
in adequate concentrations. Similarly, a human cell including deacetylation, methylation, decarboxylation,
may differentiate into a neuron, sprouting dendrites phosphorylation, ubiquitinylation, and sumoylation,
and an axon, and synthesizing tissue-specific or even leads to alterations of the DNA–histone interactions
cell-specific proteins. that render transcription-initiation sites more or less
The central dogma of DNA → RNA → protein accessible.
suggests several possible leverage points for regula- In differentiation, DNA methylation is a regulatory
tion of protein expression. mechanism that survives cell division (see Box 2.2).
Methylation of cytosine in CpG islands silences the
Most control takes place at the level of adjacent genes, possibly by stimulating chromatin
transcription remodelling. When a cell divides, enzymes copy the
In prokaryotes, a specific focus of transcriptional methylation patterns, preserving the settings of the
regulation is at or near the binding site of RNA poly- regulatory switches.
merases to DNA, just upstream of (5′ to) the begin-
ning of the gene. Repressors can turn off transcription
• CpG islands are regions of high GC content, rich in the
by occluding the binding site, blocking polymerase
dinucleotide sequence GC, that appear at the 5′ ends
activity. In contrast, promoters can actively recruit
of vertebrate genes. Methylation of C residues silences
polymerases through cooperative binding, along genes.
with polymerase, to a site on the DNA.
Expression patterns 45

BOX Control of b-globin gene expression

2.1

The protein-coding genes of the globin loci interact with degree of their exposure is correlated with transcriptional
many control regions. The b-globin region includes pro- activity.
moters proximal to individual genes; a locus control region Regulation of b-globin expression involves interaction of
occupying a region between 6 and 22 kb upstream of the the locus control region with proximal promoters associ-
most 5′ gene (e); a 250 bp pyrimidine-rich region 5′ to the ated with specific genes. The interactions are mediated by
d gene (YR); and enhancer regions, which may appear on a large complex of proteins recruited to the site. The alter-
the same chromosome, in some cases near to and in other native interactions of the locus control region, with foetal-
cases distant from the gene they control, or on entirely dif- and with adult-expressed genes, suggests that expression
ferent chromosomes. is determined by a competition between these interactions
Control of globin gene expression is asserted for both (see Figure 2.2b,c).
tissue specificity and developmental progression. In A well-known enhancer of globin expression is erythro-
humans, transcription of the b-globin region switches from poietin, a glycoprotein hormone encoded on chromosome
the embryonic e-globin in the yolk sac to the foetal g- 7. Erythropoietin does not interact with sequences in the
globins in the liver (Gg and Ag) and finally to the adult vicinity of the globin locus but works indirectly by activat-
b-globin produced in cells derived from bone marrow. ing intracellular signalling pathways by binding to a recep-
An essential mechanism of control is modulation of local tor. Erythropoietin expression is sensitive to oxygen tension.
chromatin conformation. Chromosomes contain nucleopro- Hypoxia increases erythropoietin production. This occurs
tein complexes called chromatin. DNA is associated with naturally at high altitudes; as a result, people who live at
proteins called histones (Figure 1.2). Chromatin conforma- 2500 m above sea level have about 12% more haemo-
tional changes can be induced by covalent modification of globin than people who live at sea level. Athletes take
histones and by binding of chromatin-remodelling proteins. advantage of this. About 3 months of adaptation time is
Differential sensitivity of sites to DNAse I digestion mea- necessary to build up this differential.
sures differences in exposure. Some regions near actively A surprising player in the game of globin expression
transcribed genes are hypersensitive to DNAse I digestion regulation is acetylcholinesterase, most famous for its
(see Figure 2.2a). The locus control region upstream of the physiological role in neural synapses and neuromuscular
b-globin locus consists of five hypersensitive regions. The junctions. Acetylcholinesterase also regulates globin synthesis.

Locus Locus
control control
Locus control region region region
Protein-coding ε
DNAse
hypersensitive regions
segments ε Gγ Aγ δ β Gγ
Y-rich ε
15 kb control region Aγ PYR
60 kb Gγ Aγ YR δ β YR δ β
control region Protein-coding
(a) (b) Protein-coding regions (c) regions

Figure 2.2 (a) b-Globin region showing protein-coding genes and locus control region, consisting of five DNAse hypersensitive
segments. Pseudogene yb not shown. (b,c) Schematic structural model for the control of globin gene expression. Part (b) shows
the foetal structure, with interaction between the locus control region and the Gg and Ag genes, mediated by proteins (green).
Blue circles indicate chromatin-remodelling complexes. In this configuration, the Gg and Ag genes will be expressed. In part (c),
the cyan circle indicates the PYR complex of proteins, which binds to the pyrimidine-rich (YR; Y stands for pyrimidine) region
just 5′ to the d gene. This binding reconfigures the system: PYR blocks the foetal mode of interaction of the locus control region
with the g region (b). Instead, the locus control region interacts with, and promotes the expression of, the b gene.
Adapted from: Bank, A. (2005). Understanding globin regulation in b-thalassemia: it’s as simple as a, b, g, d. J. Clin. Invest. 115, 1470–1473.
46 2 Genomes are the Hub of Biology

BOX Mammalian females are X-chromosome-silenced mosaics

2.2

An example of gene silencing by DNA methylation is is not. The cell that produced Cc had an inactivated X
the formation of Barr bodies in female mammals. Cells of chromosome and this inactivation was replicated in all of
mammalian females (except for oocytes) have two X chro- Cc’s cells.
mosomes. The product of the Xist gene (an RNA molecule)
on one of the X chromosomes inactivates that entire chro-
mosome and causes it to form a compact, transcriptionally
inert object, called a Barr body. The Xist gene on the other
X chromosome is inactivated by cytosine methylation,
leaving that chromosome normal in structure and activity.
As a result, cells of both males and females have one active
X chromosome. This is the mammalian solution of the
‘dosage compensation’ problem that arises because the
genomes of males and females contain different numbers
of copies of X chromosome genes.
In human females, and females of other placental mam-
mals, each cell chooses at random which X chromosome to
inactivate. Most mammalian females are, therefore, mosaics
of cells expressing genes from alternative X chromosomes.
A visible example of this mosaicity is a calico cat, necessar- (a)
ily a female (see Figure 2.3b). A calico cat has a yellow-coat
allele on the X chromosome inherited from one parent
and a black-coat allele on the other X chromosome. (In the
white patches of the coat, neither allele is expressed.)
The size of the coloured patches on the coat reveal when
the genes were inactivated. In contrast, in female mar-
supials, all cells inactivate the paternally derived X chro-
mosome. (This is an example of the general phenomenon
of genetic imprinting, the dependence of phenotype on
the parental origin of a gene.) (b)
A difficulty in cloning of higher animals is how to restore
Figure 2.3 (a) Cc – or Copy Cat – a cloned cat. (b) Rainbow
pluripotency to the single cell from which the animal will
– a calico cat, the source organism for Cc. The varied-colour
develop. Figure 2.3 shows (a) a cloned cat, named Cc
appearance is reminiscent of Indian maize (Figure 1.5), and
(for Copycat), and (b) its source organism (very loosely, indeed there are some similarities in the mechanisms that
its mother), Rainbow, a calico cat. Although the two cats created them.
are genetically identical as far as the nucleotide sequences Photos reproduced courtesy of The College of Veterinary Medicine &
of their DNA are concerned, Rainbow is a mosaic and Cc Biomedical Sciences, Texas A&M University.

Some mechanisms of regulation act at the level mone that stimulates ripening. Genetic modiﬁcation
of translation – a controversial activity in the context of agriculture
Antisense RNA will form a double helix with mRNA, – has produced a tomato with longer shelf life. An
and block transcription. (Antisense RNA is single- artiﬁcial gene in the ‘Flavr Savr’ tomato is transcribed
stranded RNA complementary to mRNA.) to an antisense RNA that greatly reduces translation
Introduction of genes for antisense RNA can of a gene involved in ethylene synthesis, delaying
silence genes. For example, ethylene is a plant hor- ripening.
Expression patterns 47

In RNA interference, a short stretch of double- Cells regulate the activities of their proteins by
stranded RNA (∼20 bp) elicits degradation, by a mechanisms applied at the levels of transcription,
ribonucleoprotein complex, of mRNA complemen- translation, post-translational modiﬁcation of proteins,
tary to either of the strands. RNA interference may and response of proteins to ligation (see Figure 2.4).
have a natural function in defence against viruses. It Although these processes are biochemically distinct,
has been applied in the laboratory to achieve effec- cells apply them in coordinated ways. Chemical modi-
tive gene knockouts in studies aimed at deducing ﬁcations of transcription factors can quantitatively
gene functions. regulate amounts of proteins in cells, rather than
Another mechanism of translation control is the simply switching transcription on and off. Different
attachment of ligands to the Shine–Dalgarno sequence control processes are effective over different time
in mRNA, preventing the RNA from binding to the scales: cells must sometimes react quickly, to threat
ribosome. In E. coli, vitamins B1 (thiamin) and B12 or stress; at other times rhythmically, over cell cycles;
(adenosylcobalamin) bind to an mRNA containing and sometimes in programmes unfolding over years
transcripts of genes encoding proteins involved in or even decades during development of an organism.
their biosynthesis. This is a kind of feedback inhibi- It is not possible to sort the different control mech-
tion – adequate amounts of a product inhibit the anisms into fast- and slow-acting categories. There are
synthesis of more. However, unlike the more familiar layers of complexity here that we are only beginning
product inhibition of an enzyme, this control is to understand.
applied at the level of RNA, rather than protein.
Different modes of transcriptional control vary in
• There are many mechanisms, and types of targets,
their dependence on external conditions and all are
for regulation. These include control over expression
reversible – some more readily than others. The lac- patterns, and control over the activities of proteins and
tose operon control in E. coli is a ‘toggle’ switch that non-coding RNAs in the cell. For instance, allosteric
can respond to both the appearance of lactose and its changes are ligand-induced conformational changes in
subsequent exhaustion. Other control processes are proteins that modify activity, often leading to coopera-
cyclic, such as cell-cycle control and diurnal rhythms. tive binding curves, as in haemoglobin.
Cellular differentiation in higher eukaryotes is usu-
ally irreversible, except in certain forms of cancer.
Translation of eukaryotic genes may produce several DNA
different splice variants. Regulation of translation in Gene copy number
eukaryotes may affect the distribution of splice vari- Promotor activity
Transcription
ants produced. Mechanisms include (1) degradation Repression/attenuation

of speciﬁc mRNA variants by microRNAs (miRNAs) Induction

and siRNAs, which may repress translation of speciﬁc DNA methylation/

chromatin remodelling
splice variants; and (2) splicing factors, RNA-binding RNA

proteins that interact with the transcripts of speciﬁc mRNA lifetime

exons (or even introns) and interact with the splicing Codon usage, tRNA levels
Translation
machinery to direct maturation of the mRNA. At Ribosome binding
Alternative splicing
the transcriptional level, chromatin remodelling may
render certain exons inaccessible and affect the splice RNA interference

variant expressed. Expressed protein

Protein turnover
Some regulatory mechanisms affect protein activity Post-translational Chemical modification
modification (phosphorylation, etc.)
After expression of proteins, cells can modulate the
activities of proteins by post-translational modiﬁca- Inhibition

tions, some of which are reversible. Binding of ligands Allosteric change

Modified protein
can affect protein activities, through allosteric changes
for example. Figure 2.4 Steps in protein expression subject to control.
48 2 Genomes are the Hub of Biology

Proteomics in their developmental toolkits – that is, sets of genes

active in guiding development. For example, HOX
Proteins are the executive branch of the cell. Some
genes are responsible for organization of anterior–
proteins are structural, such as the keratins that
posterior (head-to-tail) patterning in the body plans
form our hair and the outer horny layer of our skin.
of flies and humans, and even C. elegans. The human
Some are catalytic: the enzymes that catalyse meta-
Paired Box gene PAX6 is required for proper eye
bolic reactions. Others are involved in regulation,
development, but if expressed in Drosophila can
including but not limited to the proteins that bind
transform embryonic wing tissue to an ectopic (=
DNA to control expression. Some are involved
out-of-place) eye. The implication is that despite the
in signal transduction, receiving signals at the cell
very great differences in gross anatomy of the eyes in
surface, and transmitting the signal to regulatory
vertebrates, insects, and octopus, the visual systems
proteins.
arose from a common ancestor. Both the molecular
structures of the initial light receptors – the rhodop-
• Think of intermediary metabolism, catalysed by
sins – and the architecture of the neural pathways
enzymes, as the ‘smokestack industries’ of the cell, and confirm this conclusion.
the regulatory systems, including signal transduction Full-genome sequences allow the tracking of simi-
and expression control, as the ‘silicon valley’. larities and differences in developmental processes
during evolution. In particular, it is possible to iden-
tify homologues of genes from the developmental
To carry out such a wide variety of functions, toolkit across different phyla. The results both make
proteins show a great diversity of three-dimensional use of, and illuminate, phylogenetic relationships.
conformations. This diversity in structure and func- This can be thought of as an extrapolation, to the
tion is nevertheless compatible with many common molecular level, of the classical relationship between
structural features. embryology and taxonomy.
All proteins are polymers of amino acids. There
is a common repetitive mainchain, with sidechains • Correct taxonomic assignment of species with
attached at regular positions. Each sidechain is one unusual features can reveal not only phylogenetic
of a canonical set of 20, specified by the genetic relationships, but developmental ones also. The
code (see Box 1.1). At least two other amino acids phylum cnidaria contains almost 10 000 species
naturally extend the genetic code in rare cases, of aquatic animals. Sea anemones and jellyfish
and many proteins are subject to post-translational are typical examples (Figure 2.5a,b). The worm
modifications whereby the sidechains are modified Buddenbrockia plumatellae (Figure 2.5c) was until
by a variety of rearrangements or substitutions. recently an enigma. Its body plan did not strik-
Phosphorylation of specific sidechains is a common ingly resemble the familiar radially symmetric
mechanism for regulating protein activity. cnidaria such as sea anemones or jellyfish; it might
Proteomics is the subject of Chapter 10. easily be thought to be related to nematodes.
However, Holland and co-workers have recently
shown, on the basis of sequence alignments of
Genomics and developmental biology 129 proteins, that B. plumatellae is a cnidarian.
As they emerge from sequencing machines and The significance of this observation for develop-
assemblers, genomes are static data sets. Their mental biology is that it extends the body plan
implementation in cells is dynamic: genomes contain observed for cnidaria, requiring investigation of
developmental programmes governing expression unsuspected aspects of the developmental path-
patterns of genes at different life stages, and reactions ways in this species.
and responses to environmental stimuli. • HOX genes are a classic example of how genomes
Just as different species show very different body illuminate developmental biology, both within a
plans despite high similarities in their genes and pro- species, and among species (‘evo-devo’). Organisms
teins, different taxa can also show high similarities with bilateral symmetry, including insects and
Expression patterns 49

vertebrates, contain HOX genes, which encode a

family of DNA-binding proteins. The expression
of these genes varies along the anterior–posterior
(head-to-tail) body axis, and controls the setting
out of the body plan. HOX genes have overlap-
ping domains of expression within the body.
Different regions of the embryo develop into ana-
tomically distinct regions of the adult, based on the
subset of HOX genes expressed.

Indeed there is a fascinating mapping between

(1) the order of the genes on the chromosome, (2) the
relative times during development of the onset of
(a)
their activity, and (3) the order of their action along
the body. (The b-globin locus shares the first two of
these but not the third (see p. 45).)
HOX genes reveal the duplications that have
occurred during vertebrate evolution. Insects and
amphioxus have a single HOX cluster. Humans
have four HOX clusters. Zebrafish have seven HOX
clusters, interpretable as a series of duplications:
1→2→4→8 followed by loss of one to reduce 8→7.
The HOX genes illustrate the conservation of the
developmental toolkit in species with very different
body plans.
(b) • Conversely, the distribution of DNA methylases
illustrates the diversification of developmental
toolkit even in organisms with similar body plans.

DNA methylation patterns are important signals

for transcription control in vertebrate development
and tissue differentiation. In contrast, many inverte-
brates show little or no DNA methylation. Examples
include C. elegans and D. melanogaster, the ﬁrst two
invertebrate genomes sequenced. These observations
would suggest that DNA methylation arose in the
vertebrate lineage. However, although C. elegans
lacks genes for DNA methylases, and D. melanogas-
ter has but an incomplete complement, the genome
(c) sequences of related nematodes and insects show that
Figure 2.5 (a) Sea anemone, Anthopleura sola, showing the the genes were present in invertebrates, and some
typical radially symmetric cnidarian body plan. (Photograph by or all of them have been lost in the speciﬁc lineages
Charles Halloran) (b) Jellyfish, photographed by David Burdick in leading to C. elegans and D. melanogaster. For
the Mariana Islands. (from NOAA Photo Library, image reef1133, instance, the honeybee contains a full, functional
National Oceanic and Atmospheric Administration/US Department
complement of DNA methylase genes, and its DNA
of Commerce). (c) Scanning electron microscopy image of a
Buddenbrockia plumatellae. From: Jiménez-Guri, E., Philippe, H.,
is methylated. However, it appears that the inverte-
Okamura, B., Holland, P.W. (2007). Buddenbrockia is a cnidarian worm. brates and vertebrates differ in the pattern and func-
Science 317, 116–118. tion of DNA methylation.
50 2 Genomes are the Hub of Biology

• The mapping from genome to proteins is a static

correspondence. But in fact, implementation of the
genome is a dynamic process, both (1) in the short
term, as the response to internal and external stimuli,
and (2) in the long term, as the unfolding of pro-
grammes of development and differentiation. Evolution
illuminates both the static and dynamic aspects of the
implementation of the genome.

Figure 2.6 Top: the nervous system of C. elegans, head at left,

labelled with green fluorescent protein. Bottom: the brain of
Genes and minds: neurogenomics C. elegans, with different groups of neurons labelled with
different-coloured fluorescent proteins.
Higher mental processes unique to humans are the Reproduced by permission of Prof. H. Hutter.
last major unexplored territory in our understanding
of life. Mental illness is a major cause of disability
in Europe and the USA, but it has been difficult to Some people espouse this approach. (‘You have to
integrate human mental achievements and diseases walk before you can run.’) Others retort that human
into mainstream biology and medicine. This is partly mental achievements and diseases go far beyond those
because the physical correlates of mental events are of C. elegans. It is undeniable that understanding the
so complex and partly because of a clinical tradition minds of worms – or even of chimpanzees – must
of treating mental disease by conversation between have limited applicability to humans. Nevertheless,
therapist and patient, and by counselling. That tradi- many surprising similarities have appeared, even in
tion, with its origin in the pervasive influence of such distantly related organisms.
Freud, was based on the assumption that the causes What analogues of human mental phenomena do
of mental disease were either emotional trauma alone model organisms show? The basic mechanisms of
or emotional consequences of physical trauma. Even sensation, such as vision and olfaction, are similar in
though we now know that the genetic component of many animals. Learning and memory are widespread
neuropsychiatric disease is much more important than throughout the animal kingdom. Flies learn; even
formerly suspected and that even environmentally worms learn. Language is one uniquely human attri-
caused neuropsychiatric diseases have biochemical bute and its genetic component has been traced
effects, the fact remains that treatments alternative through clinical disorders. One might assume that
to counselling, based on a profound biological under- ‘higher’ emotions and sophisticated talents – love,
standing, are not yet available. Drugs now play an despair, the ability to play chess, or talent for art or
important role in treating symptoms of neuropsychi- music or science – would also be absent from simple
atric disease. But Macbeth’s anguished cry, ‘Canst model organisms, but in many cases tantalizing ana-
thou not minister to a mind diseased?’, is still largely logies do exist. Fruit flies show courtship behaviour
an unmet challenge. under the control of known genes. The objection
One approach to mental processes in humans is to that fruit flies are merely displaying unsophisticated
work our way up from simpler nervous systems. This instinctive behaviour raises the question of how
was Sydney Brenner’s motive for choosing C. elegans rational and sophisticated is most human activity,
as a model organism. The adult hermaphrodite form including courtship (especially courtship).
has exactly 302 nerve cells and 7000 synapses (see Genomics provides the bridge between the minds
Figure 2.6). J. White worked out the complete ‘wir- of the different species. Two ways to use model
ing diagram’ by tracing individual cells through serial organisms are the study of homologues to illuminate
sections. We know not only the static structure and functions of human proteins and the insertion of
organization of the adult system, but the details of human genes into the model organisms to study the
how it develops. effects of their expression.
Expression patterns 51

Figure 2.7 Alternative social behaviours of C. elegans, feeding on a lawn of agar in a Petri dish. Left: solitary feeding. Right: clumping.
The agar is slightly thicker at the edges of the dish, which is why the worms congregate there.
From: de Bono, M. & Bargmann, C.I. (1998). Natural variation in a neuropeptide Y receptor homolog modifies social behavior and food response in
C. elegans. Cell 94, 679–689.

Models for neurological disease, in natural or of 65. It affects about 10% of those with Alzheimer’s
transgenic Drosophila and C. elegans, include Par- disease. The familial form is associated with muta-
kinson’s and Huntington’s disease, Friedrich’s ataxia tions in genes on chromosomes 1, 14, and 21. In con-
(a degenerative disease of the nervous system), and trast, a mutation in the gene for ApoE on chromosome
early-onset dystonia (neuromuscular dysfunction 19 causes increased risk of late-onset Alzheimer’s
producing sustained involuntary and repetitive muscle disease (see pp. 61–62).
contractions or abnormal postures). Species more
closely related to humans, including zebrafish and Genetics of behaviour
mice, provide models for most mental diseases. • Different strains of C. elegans show different social
Traditionally, there was a fairly rigid distinction behaviour in dining. C. elegans can be grown on
between neurological/physiological/biochemical dis- an agar lawn in a Petri dish. The wild type col-
eases affecting cognitive performance, and psychiat- lected from Australia will congregate in groups
ric diseases believed to have emotional causes and to feed. The wild type collected from Britain will
effects. However, we now recognize that even if an eat separately (see Figure 2.7). The difference has
illness is caused by an unhealthy emotional envir- been traced to a single amino acid change in a
onment (and leaving aside the genetic component of seven-transmembrane helix protein, NPR-1.
susceptibility to disease in response to such an envir-
onment), organic changes – down to the molecular The unity of life is a theme of this book, but perhaps it
level – underlie the psychological manifestations. would be wrong to read too much into this example.
There is now good evidence for a substantial
genetic component in numerous psychological condi- • In fruit flies, T. Tully and co-workers have identi-
tions, including dyslexia, autism, schizophrenia, fied a gene associated with memory. CREB (cyclic
attention-deficit hyperactivity disorder, and others. A AMP response element-binding protein) encodes
number of conditions appear in both familial forms, a transcription factor, part of a large family of
showing relatively simple patterns of inheritance, paralogues in mammals and distributed widely in
and sporadic forms, with more complex genetic eukaryotes and prokaryotes. Some engineered
components and greater influence of environment. changes in CREB produced flies that could learn
Familial forms often show early onset. but not store memories; other changes produced
For instance, familial Alzheimer’s disease is a rela- flies that learned substantially faster than normal.
tively early-onset form of the condition, with early Alterations in mouse CREB have produced mem-
onset defined in this case as appearing before the age ory impairment.
52 2 Genomes are the Hub of Biology

• The Lesch–Nyhan syndrome was the first correla- versely, amphetamines, which release dopamine,
tion discovered between a specific human genetic aggravate the symptoms.
defect and behavioural anomalies. It is an X- • Depression is a common response to stress.
linked deficiency of a single protein, the enzyme Although we all suffer stressful episodes in our
hypoxanthine–guanine phosphoribosyltransferase. lives, the development of debilitating mental dis-
The consequent inability to metabolize uric acid ease depends on our alleles in the promoter region
properly leads to physical symptoms including of the gene for the serotonin transporter (5-HTT).
gout and kidney stones. But patients also show Individuals with one or two copies of the shorter
poor muscle control, mental retardation, facial allele of this gene are more likely to exhibit depres-
grimacing, writhing and repetitive limb move- sion and suicidal tendencies than individuals
ments, and uncontrollable lip and finger biting to homozygous for the longer allele.
the point of severe self-mutilation.
• Maltreatment in childhood is a cause of antisocial
• Seasonal affective disorder, a mood change in behaviour. Obviously. However, the likelihood
response to prolonged darkness, is related to the of development of antisocial behaviour depends
regulation of circadian rhythms. Jet lag is a related on the allele for monoamine oxidase A (MAOA).
condition. The system of genes involved in circa- Maltreated children with a genotype producing
dian rhythms was originally worked out in fruit high expression levels of MAOA are less likely to
flies and has homologues in many animals, includ- become antisocial or violent offenders. A. Caspi
ing humans and C. elegans, and in plants. and colleagues found that, although only 12% of a
• Mutations in an X-linked human gene, DLG3, cause sample of people had the combination of low-
severe learning disability. Knockout of the mouse activity MAOA genotype and childhood maltreat-
homologue PSD95 produces learning-impaired ment, these individuals accounted for 44% of
mice. The protein encoded by PSD95 binds to the subsequent convictions for violent offences.
NMDA receptor, a protein involved in synaptic
plasticity. Overexpression of NMDA receptor
gives mice superior learning and memory abilities. • Cognitive abilities are our species’ proudest achieve-
• Several genes are implicated in schizophrenia. A ment. Neuropsychiatric illness is a correspondingly
difficult problem to understand and treat. Both have
common effect, hypersensitivity to the neurotrans-
genetic components that are beginning to be under-
mitter dopamine in the brain, plays a role in
stood, in part through study of analogues in other
schizophrenia. Some antipsychotic drugs block
species.
dopamine receptors on neuronal surfaces; con-

Populations

An interacting group of individuals of the same Single-nucleotide polymorphisms (SNPs)

species is a population. How do genomes vary in a and haplotypes
population? Study of the variation reveals the popula-
tion structure and its history. A population that has All people, except for identical siblings, have unique
passed through a bottleneck; or that has developed DNA sequences.
in isolation, from a small ‘founder’ group, will show Comparisons between unrelated individuals reveal
very narrow variation. All cheetahs, for example, are overall differences between whole-genome sequences
as closely related as human siblings. A population with of ∼0.1%. Any change in DNA sequence is a muta-
relatively high variation is likely to have a longer tion, including substitutions, insertions and deletions,
evolutionary history. It is such observations that and translocations. Many of the differences between
suggest that humans originated in Africa. individuals have the form of individual isolated base
Populations 53

BOX Guide to databases of SNPs and related databases

2.3

General SNP links http://www.snpforid.org/snpdata.html

NCBI dbSNP http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp
The SNP Consortium http://snp.cshl.org/
HapMap http://www.hapmap.org/
Applied Biosystems Assays-on-Demand http://myscience.appliedbiosystems.com/cdsEntry/
Form/assay_search_basic.jsp
Ensembl http://www.ensembl.org/Homo_sapiens/
HGVBase http://hgvbase.cgb.ki.se/
SeattleSNPs https://gvs.gs.washington.edu/GVS/
dbSNP database at NCBI http://www.ncbi.nlm.nih.gov/snp
Human Gene Mutation Database http://www.hgmd.cf.ac.uk
OMIM (Online Mendelian Inheritance in Man) http://www.ncbi.nlm.nih.gov/omim

substitutions, or single-nucleotide polymorphisms of magnitude. SNPs on opposite sides of recombina-

(SNPs). There are also many short deletions. tional ‘hot spots’ are more likely to be separated in
any generation. SNPs lying within recombination-
poor (‘cold’) regions will tend to stay together.
• Single-nucleotide polymorphisms (SNPs) are single-
In humans, many 100 kb regions tend to remain
base variations between genomes.
intact. They show the expected number of SNPs,
but relatively few of the possible combinations. An
Databases now contain over 100 000 mutations, average SNP density of 0.1%, or 1 SNP/kb, suggests
in 3700 genes (see Box 2.3). This is 6.2% of the total ∼100 SNPs per 100 kb. The genome of any individual
∼23 000 genes. Of course this number is growing may possess, or may lack, each of them, giving a
rapidly; approximately 10 000 new mutations are very large number (2100) of possible combinations.
discovered each year. However, many 100 kb regions show fewer than five
Each of us bears an accumulated collection of combinations of SNPs. These discrete combinations
SNPs, reflecting mutations that occurred in our an- of SNPs in recombination-poor regions define an
cestors. Some constellations of SNPs are co-inherited individual’s haplotype, or ‘haploid genotype’.
as blocks. Others are not. Mutations in different Haplotypes provide a very economical character-
DNA molecules of diploid chromosomes become ization of entire genomes. They simplify the search
separated within a single generation, by assortment. for genes responsible for diseases, or any other
Mutations on the same chromosome become phenotype–genotype correlations. For field biolo-
separated more slowly, by recombination. Haploid gists, including anthropologists, haplotypes permit
sequences, such as most of the human Y chromo- detection of migratory and interbreeding patterns in
some, or mitochondrial DNA, are not subject to populations.
recombination. Mutations in these sequences remain
together.
• Haplotypes are local combinations of genetic polymor-
Mutations in the same DNA molecule in diploid
phisms that tend to be co-inherited.
chromosomes will become unlinked by recombina-
tion events that occur between their loci. The greater
the separation between two sites, the greater the fre- In looking for genes responsible for diseases, or
quency of recombination. However, recombination other phenotypic traits, haplotypes provide a magni-
rates vary widely along the genome, by several orders fying glass. The goal is to correlate phenotype with
54 2 Genomes are the Hub of Biology

genome sequence. The target may be to identify one analysis of the samples to determine an additional
base out of 3 × 109. By correlating phenotype with 4.6 million SNPs from the same individuals.
haplotype, only enough sequence must be collected The work of the International HapMap Consor-
to localize the site to within the typical length of a tium, together with other studies, show that:
haplotype block, perhaps ∼100 kb, containing only a
• Most of the variations appear in all populations
few genes.
sampled. Some of the inter-population differences
reflect different relative amounts of the same SNPs.
• Think of boundaries between haplotype blocks as • A very few SNPs are unique to particular popula-
being like the grooves in a bar of chocolate that permit tions. For example, out of over 1 million SNPs,
it to be broken easily into bite-size fragments. only 11 are consistently different – in the sample
studied – between all individuals of European origin
and all individuals of Chinese or Japanese origin.
Variations in human genomes are the subject of
• The genomes of individuals from Japan and China
several large-scale projects.
are very similar, suggesting more recent common
The SNP Consortium (http://snp.cshl.org) collects
ancestry than other population pairs in the study.
human SNPs. Its database currently contains nearly
4.2 million SNPs. • The X chromosome varies more between different
The International HapMap Project collects and populations than other chromosomes. This may
curates haplotype distributions from several human arise from the fact that males contain only one X
populations. SNPs are its raw material, from which it chromosome, the genes on which are, therefore,
identifies the correlations among them. Phase I of the more subject to selective pressure. Recombinations
project, published in October 2005, had the goal of of X chromosomes can occur, but only in females.
measuring the distributions of at least one SNP every • Lengths of haplotype blocks vary among the dif-
5 kb across the whole human genome. Blood samples ferent sources of samples. They tend to be shorter
were provided by 269 individuals from four contin- among populations from Africa, consistent with
ents (see Box 2.4). Over 1 million SNPs of significant the idea of an African origin of the human species.
frequency (>5%) were documented. In addition, The idea is that the older the population – more
ten selected 500 kb regions were fully sequenced accurately, the larger the number of generations –
from 48 of the samples. Phase II will extend the the greater the chance of recombination.

BOX Origin of samples for the International HapMap Project*

2.4

Why the choice of parent–offspring combinations? A dif-

Population Location Number of Relationships
ficulty in determining haplotypes in heterozygous regions
origin individuals
of diploid chromosomes is how to determine which SNPs
Yoruba Ibadan, 90 30 parent– lie in the same DNA molecule. Comparison of parental and
Nigeria offspring trios child sequences can sort the observed SNPs into haploid
Northern Utah, USA 90 30 parent– contributions.
and western offspring trios
European * The International HapMap Consortium (2005). A haplotype map
descent of the human genome. Nature 437, 1299–1320.
Han Chinese Beijing, China 45 –
Japanese Tokyo, Japan 44 –
Populations 55

Secretary: ‘The establishment of the apes should be

Ethical issues faced in the HapMap project 24. Action should be taken to bring them up to this
number at once and maintain it thereafter’.
The International HapMap Consortium paid due atten- Where did they come from? Modolo, Salzburger,
tion to ethical, legal, and social issues. Informed consent & Martin analysed a 428 bp segment of mitochon-
of the donors preceded collection of samples. The pro- drial DNA from individuals from the Gibraltar col-
cedure for informed consent involved not only indi- ony and from seven natural populations in Algeria
vidual agreement, but also community engagement, and Morocco.* Among the individuals sampled, this
including interactive public explanation of the project. region contained 24 different haplotypes. The haplo-
Samples were labelled anonymously. In fact, more types differed by between 1 and 26 mutations, all but
samples were collected than used (similar in some ways one mutation being an SNP.
to the principle of issuing blank cartridges to a firing Figure 2.8 shows the clustering of the sequences.
squad). Nevertheless, characteristics of a population Each haplotype is represented by a circle with a radius
constitute personal information, the release of which proportional to the number of individuals bearing
may affect all individuals in the population, including the haplotype. Colour coding indicates the location
those who were never asked to contribute a sample, and of sample collection. For instance, individuals of
even those who refused. For this reason the HapMap
haplotype M16 are found mostly in Gibraltar, some-
Consortium did not collect medical information about
times in Algeria, and infrequently in Morocco.
the sample contributors, even under the protection of
The topology of the relationship has several inter-
consent and anonymity.
esting implications.
• There are three major clusters. One cluster (upper
right) contains most of the individuals from
Application of haplotypes to infer relationships Morocco. The individuals from Algeria show two
between populations: the Barbary macaques populations, from different collection areas, a
major one (upper left) and a minor one (bottom
Comparison of genomes between populations can
centre). One location, Pic des Singes, provided
reveal histories of migrations. Here we shall consider
individuals with haplotypes linked to each of the
the Barbary macaques; Chapter 8 will treat human
well-separated Algerian clusters.
history.
The macaques (Macaca sylvanus) on the island • Dating of the divergence suggests that the
of Gibraltar are the only wild primates on the con- Moroccan and Algerian populations separated
tinent of Europe – except for certain football fans. over 1.2 million years ago.
The island has been host to these animals for several • It is likely that the female line in the current
hundreds of years. Other populations of the same Barbary macaques originated from both the
species exist across the Mediterranean, in northwest Moroccan and Algerian population.
Africa.
Although there is fossil evidence for the ancient
A clinically important haplotype: the major
presence of Barbary apes in Europe, Gibraltar is not
histocompatibility complex
a refuge for survivors of that population.
Instead, almost all the population arrived in In the human genome, proteins of the major histo-
Gibraltar during the Second World War, deliberately compatibility complex (MHC; known in humans
imported to enhance morale during those dark days. as the human leucocyte antigen (HLA) system) are
Politically, Gibraltar is the last remnant of the British
Empire in continental Europe: It is said that Gibraltar
will remain British as long as the macaque popula-
* Modolo, L., Salzburger, W., & Martin, R.D. (2005).
tion survives. With just four individuals left on the Phylogeography of Barbary macaques (Macaca sylvanus)
island in 1943, Winston Churchill ordered that the and the origin of the Gibraltar colony. Proc. Natl. Acad.
population be restocked. He minuted the Colonial Sci. USA 102, 7392–7397.
56 2 Genomes are the Hub of Biology

M17

M05 M08
M06

M07
M04

M16

M02

M03 M09
M24

M23
M01

M10

M12

M11

M15 M13

M14

Spain Mediterranean Sea

Algiers Bejaia

Middle Atlas Kherrata

Rabat
Azrou
Rif Pic des Singes
co

Algeria
oc

M19
High Atlas Akfadou
or
M

Marrakech
0 200 km
M20 Gibraltar Djurdjura

M21
M18

M22

Figure 2.8 Haplotype network from 428-bp segments of mitochondrial DNA collected from free-living Barbary macaques (Macaca
sylvanus) inhabiting Gibraltar, Algeria, and Morocco. The size of each coloured circle is proportional to the number of individuals bearing
each haplotype. Each line segment represents a single mutation. Thus, the sequences represented by M18 and M19, just to the right of
the inset map, differ by three mutations. The colours of the circles in the graph indicate locations of sample collection (see inset map).
Copyright National Academy of Sciences, reprinted by permission.

encoded in a ∼4 Mb region on chromosome 6 • provide the mechanism by which the immune

(6p21.31). In vertebrate species, each individual system distinguishes ‘self’ molecules – those to
expresses a set of MHC proteins selected from a be tolerated – from ‘non-self’ molecules – those
diverse genetic repertoire of the species. The system is recognized as foreign invaders that must be
highly polymorphic, with 50–150 alleles per locus, repelled;
higher sequence variation than found in most poly-
• determine individual proﬁles of competence for
morphic proteins. The set of MHC proteins expressed
resistance to diseases;
deﬁnes a partial haplotype of an individual. Com-
pared with other haplotype blocks, the MHC region • are useful markers for determining relationships
shows unusually wide individual variation. among populations of humans and animals, and
The MHC region contains over 120 expressed for tracing large-scale migrations and population
genes, coding for proteins that: interactions.
Populations 57

MHC haplotypes control donor–recipient In addition to triggering immune responses in

compatibility in transplants mature individuals, MHC–peptide complexes are
Surgical patients, if not immunosuppressed by drugs, also involved in the removal of self-complementary T
will reject transplanted organs – unless the donor is cells in the thymus during development, at the stage
an identical sibling – because the transplant is recog- when the distinction between self and non-self is
nized as foreign. MHC proteins bind peptides and ‘learnt’.
present them on cell surfaces. The triggering event in
MHC haplotypes determine patterns of disease
alerting the immune system to the presence of a for-
resistance
eign protein is the recognition by a T-cell receptor of
a complex between an MHC protein and a peptide Different MHC molecules have different binding
derived from the foreign protein (see Figure 2.9). specificities and can present different sets of peptides.
MHC haplotype influences autoimmune diseases People whose MHC molecules do not effectively
– breakdowns in self-/non-self-distinguishability that present epitopes from a particular pathogen will be
result in a person’s immune system attacking his or more susceptible to infection. For instance, MHC
her own tissues. Examples of autoimmune diseases haplotype is a predictor of survival horizon in people
include rheumatoid arthritis, multiple sclerosis, type infected by human immunodeficiency virus.
I diabetes, and systemic lupus erythematosus.
MHC haplotypes influence mate selection
Opposites attract: one person will tend to find
another romantically attractive if they have different
MHC haplotypes. The mechanism is apparently
through linkage of MHC haplotype and body scent.
This effect will tend to produce offspring that have
Cβ Cα MHC molecules that can present a broader reper-
toire of peptides, producing broader resistance to
infection.
Vα
Vβ
• A person’s MHC haplotype is the set of alleles for
over 100 highly polymorphic sites. MHC haplotype is
an explicit signature of individuality, governing trans-
plant rejection, resistance to infection, and even mate
selection.
MHC

β2m
Mutations and disease
Many mutations, even if they are not synonymous
mutations, are consistent with a healthy life, and typ-
Figure 2.9 The immunological distinction between ‘self’ and ical life-span, provided that the individual practises a
‘non-self’ resides in the proteins of the major histocompatibility
reasonable lifestyle. Such mutations contribute to our
complex (MHC) and their interaction with T-cell receptors. This
picture shows a human T-cell receptor in complex with a class I ethnic and individual variety. Loss of some proteins
MHC protein and a viral peptide. MHC proteins have broad is surprisingly innocuous. Mice lacking myoglobin
specificity, each binding many peptides including those of self thrive, and even show athletic performance com-
and non-self origins. Cell surfaces contain large numbers of parable to normal mice.
MHC–peptide complexes, among which those binding foreign
In some cases, species-wide loss of biosynthetic
peptides are a small minority. T-cell receptors, in contrast, have
narrow specificity, and pick out the complexes containing foreign
enzymes is not generally considered a disease,
peptides, like a professional antiques dealer spotting a valuable but contributes to the list of essential nutrients.
item in a rummage sale. For instance, whereas most animals can synthesize
58 2 Genomes are the Hub of Biology

vitamin C, we must provide it in our diet. If not, the betes therapy. Mutations that destabilize proteins can
result is the disease scurvy. increase the proportion of misfolded proteins, which
Nevertheless, evolution has largely optimized pro- show a greater tendency to form aggregates.
teins for their roles in healthy organisms. Therefore, Some mutations that produce defective proteins
most mutations causing amino acid sequence changes show complex interactions with other traits, with the
are deleterious, impairing protein function and result that some deleterious mutations have not been
threatening to produce disease. eliminated from populations by selection, because they
Mutations associated with human disease are col- carry some compensating advantages. For example,
lected by the organization Online Mendelian Inheri- the genes for sickle-cell anaemia and for glucose-
tance in Man (OMIM)™ (http://www.ncbi.nlm.nih. 6-phosphate deﬁciency confer resistance to malaria
gov/omim). A corresponding site, Online Mendelian (see Box 2.5).
Inheritance in Animals (OMIA) (http://www.ncbi. Dysfunction of a regulatory protein or receptor
nlm.nih.gov/omia) collects mutations associated with can disorganize the operation of a pathway even if all
diseases in animals, other than human and mouse. components of the pathway regulated are normal.
Some abnormal regulatory proteins cannot be activated
By what mechanisms can mutations affect at all, whereas others are constitutively activated and
human health? cannot be shut off. The effects include:
Mutations causing defects in some proteins can be
Physiological defects: a number of diseases are asso-
accommodated by adjustments in lifestyle. Other
ciated with mutations in G protein-coupled recep-
mutations cause disease only in combination with
tors. Some mutations in opsins are associated with
unusual features of lifestyle or speciﬁc triggering
colour blindness. Certain mutations in the common
events. We have already mentioned the Z-mutation
G protein target of olfactory receptors lead to loss of
of a1-antitrypsin, smoking enhancing its tendency to
sense of smell.
cause emphysema.
Loss-of-function mutations are often recessive, so Developmental defects: several types are traceable
that the homozygosity for the mutant allele typically to mutations in hormone receptors. For instance,
has more severe consequences than heterozygosity. Laron syndrome, a phenotype including diminished
Every individual is heterozygous for some deleterious stature, arises from a mutation in the human growth
mutations that, if homozygous, would be lethal. hormone receptor. Administration of exogenous
Many diseases are associated with the formation growth hormone does not restore normal growth.
of insoluble aggregates, usually of misfolded proteins.
These include classical amyloidoses, Alzheimer’s and • Many diseases are caused directly by mutations; many
Huntington’s disease, aggregates of misfolded serpins, others have a genetic component and arise in the con-
and prion diseases. Polymerization of insulin creates text of an interaction between genetics and lifestyle.
problems in production, storage, and delivery in dia-

BOX Glucose-6-phosphate deficiency, food taboos, folk medicine, pharmacogenomics,

2.5 and mosquito breeding seasons

Glucose-6-phosphate dehydrogenase (G6PDH) is the most glucose-6-phosphate + NADP →

common enzyme deficiency, affecting over 400 million 6-phosphogluconate + NADPH,
people worldwide. It is a recessive X-linked genetic defect,
affecting up to 10% of populations in which mutations are the first step in the pentose phosphate shunt. This reaction
common. produces reduced glutathione needed to dispose of hydro-
Glucose-6-phosphate dehydrogenase (G6PDH) catalyses gen peroxide (H2O2). It is particularly important in red
the reaction: blood cells, which, lacking nuclei and mitochondria, are
Genetic diseases – some examples of their causes and treatment 59

metabolically impoverished, and have no alternative mech- the tailoring of drug treatments to the genotype of the
anism for detoxifying H2O2. Without active G6PDH, build- individual patient.
up of H2O2 will oxidize and denature haemoglobin, leading Why have dysfunctional G6PDH genes remained at such
to destruction of red blood cells, producing a condition a high level in the population? Why does primaquine
called haemolytic anaemia. produce haemolytic anaemia in G6PDH-deficient patients,
Eating fava beans, especially if uncooked, can induce and does this have a relationship to its antimalarial activity?
anaemic episodes in people deficient in G6PDH. The And why have fava beans continued to be grown if non-
danger of eating fava beans has been recognized since toxic alternatives are available?
antiquity, and has been associated with food taboos, and The malarial parasite invades the red blood cell of its
preparation techniques designed to reduce toxicity. host, and competes metabolically with normal activity.
Pythagoras, for example, banned eating of fava beans Primaquine and related drugs, such as chloroquine, subject
in his school. We now know that fava beans contain the the red blood cells to oxidative stress. Cells stressed by
compounds vicine and convicine, metabolized in the intes- both parasite and drug are the most vulnerable, and if they
tine to isouramil and divicine, which react with oxygen to die they take the parasite down with them. Because con-
produce hydrogen peroxide, subjecting cells to oxidative sumption of fava beans subjects cells to oxidative stress,
stress. they also provide an antimalarial effect, recognized in folk
Other chemicals, including certain drugs, present the medicine. Indeed, fava beans have some effect against
same danger to G6PDH-deficient people. During World malaria even for people with normal G6PDH activity; those
War I, some patients were observed to suffer dangerous with abnormal G6PDH have a greater advantage, until the
side effects from the antimalarial drug primaquine. Many maturing Plasmodium produces its own G6PDH.
drugs, including sulphonamides, are now contraindicated The link with malaria is the likely explanation of the per-
for G6PDH-deficient patients, as is the taking of large sistence of the gene in the population and the fava bean
doses of vitamin C. The observation of variations in effec- in agriculture. A final clue appears in the calendar: there is
tiveness and toxicity of different drugs in different people good overlap between the fava bean harvest period and
has developed into the new field of pharmacogenomics, the peak Anopheles breeding season.

Genetic diseases – some examples of their causes and treatment

Haemoglobinopathies – molecular diseases This creates a ‘sticky patch’ on the surface of the
caused by abnormal haemoglobins molecule. As a result, the mutant haemoglobin forms
polymers within the erythrocyte, in the unligated or
Sickle-cell anaemia: Pauling and co-workers showed deoxy state (in the deoxy form, typical of venous blood,
in 1949 that haemoglobin isolated from patients haemoglobin is not binding oxygen). To flow through
with sickle-cell anaemia differed in electric charge small capillaries, erythrocytes must be deformable, as
from normal haemoglobin. Patterns of inheritance their typical size, 7.8 mm (in humans), is larger than
showed that sickle-cell anaemia is a genetic disease. the diameter of small capillaries. The formation of the
Pauling’s discovery was therefore the first evidence polymers has a rigidifying effect on the erythrocyte,
that genes precisely control the structures of proteins. impeding its flow and blocking capillaries (Figure 2.10).
This preceded the first determination of the amino In the traffic jam building up behind a plugged capil-
acid sequence of a protein. lary, arriving red cells release their oxygen to sur-
Recall that haemoglobin contains four polypeptide rounding hypoxic tissues, become deoxygenated, and
chains, two a chains and two b chains (Figure 1.16). thereby aggravate the problem. This produces pain
The sickle-cell mutation changes residue 6 of the b – because of reduced oxygen supply resulting from
chain from a charged sidechain, glutamic acid, to a capillary congestion – and anaemia and jaundice, con-
nonpolar one, valine (b6Glu→Val). sequences of the rapid breakdown of red blood cells.
60 2 Genomes are the Hub of Biology

• missense mutations (amino acid substitutions): the

sickle-cell mutation is an example of a missense
mutation;
• nonsense mutations (changes from a triplet coding
for an amino acid to a stop codon) leading to pre-
mature termination and a truncated protein;
• mutations in splice sites;
• mutations in regulatory regions;
• certain deletions, including the normal termination
codon and the intergenic region between d and b
genes, creating d–b fusion proteins.

Phenylketonuria
Phenylketonuria (PKU) is a genetic disease caused
by deficiency in a metabolic enzyme, phenylalanine
hydroxylase, the enzyme that converts phenylalanine
Figure 2.10 Coloured scanning electron micrograph of normal to tyrosine (see Figure 2.11). If untreated, phenyla-
(round) and sickle-shaped erythrocytes. The shape of the cells is lanine accumulates in the blood, to toxic levels. If
caused by aggregation of mutant haemoglobin in its deoxy state.
untreated, the high levels of phenylalanine cause a
The symptoms of the disease are caused not so much by the
shape of the cells but by their inability to distort their shape in variety of developmental defects, including mental
order to pass through the narrow capillaries. retardation, microcephaly, and seizures. The disease
(Science Photo Library). cannot be cured, but symptoms can be avoided by
lifestyle control: a phenylalanine-free diet. Screening
of newborns for PKU is legally required in the United
The most common SNP associated with sickle-cell States and many other countries.
anaemia changes the codon gag to gtg. It is possible PKU is an example of how understanding the
to test specifically for this mutation. Alternatively, molecular mechanism of a disease can, for some con-
sequencing the entire region would pick up other ditions, help restore and preserve health through sug-
possible mutations, that might lead to thalassaemias. gested changes in lifestyle and/or medical treatment.
Thalassaemias are genetic diseases associated with PKU is an autosomal recessive trait, associated
defective or deleted haemoglobin genes: most Cauca- with mutations in the phenylalanine hydroxylase
sians have four genes for the a chain of normal adult gene on the long arm of chromosome 12 (12q22–
haemoglobin, two alleles of each of the two tandem 12q24.1). In the UK and USA the prevalence is
genes a1 and a2. Therefore a-thalassaemias can pres- about 1 in 10 000 individuals; 1 out of 50 are car-
ent clinically in different degrees of severity, depending riers. PKU is the subject of neonatal screening in
on how many genes encode normal a chains. Only many countries.
deletions leaving fewer than two active genes present A large number of known mutations are associated
as symptomatic under normal conditions. Observed with PKU, appearing in all 13 exons of the gene and
genetic defects include deletion of both genes (a pro- in flanking sequences (http://www.pahdb.mcgill.ca/).
cess made more likely by the tandem gene arrange- Many but not all of them are SNPs. These include
ment and sequence repetition, which make crossing many that affect catalytic activity, somewhat fewer
over more likely) and loss of chain termination that affect regulation, and even fewer that affect the
leading to transcriptional ‘read through’, creating assembly of the tetrameric enzyme.
extended polypeptide chains, which are unstable. The classical test for PKU depended on the build-
b-thalassaemias are usually point mutations. These up of phenylalanine and its degradation products
may be: such as phenylpyruvate in neonatal blood and urine.
Genetic diseases – some examples of their causes and treatment 61

OH It is particularly tricky to manage PKU women in

pregnancy. Remember that PKU is an autosomal
recessive trait. A woman with PKU must be homozy-
gous for defective phenylalanine hydroxylase (although
the two alleles need not necessarily bear the same
mutation). If such a woman becomes pregnant, it is
CH2 CH2
likely that the foetus is only a carrier (unless the
H3N+ C COO– H3N+ C COO– father is also a carrier). The problem is to control
phenylalanine levels in the mother so as to provide
H H adequate nutrition to the foetus, without subjecting
phenylalanine tyrosine the mother to toxic levels of phenylalanine.
A current topic of PKU research is enzyme replace-
ment therapy. It is not satisfactory to administer
functional phenylalanine hydroxylase itself, as is
done with insulin for diabetes. Phenylalanine hydroxy-
lase is a tetramer, requires a cofactor, and is subject to
complex regulatory controls. An alternative is to use
CH2 C H phenylalanine ammonia-lyase, an enzyme found in
plants, fungi, and bacteria, which converts phenylal-
O C COO – H C anine to trans-cinnamic acid (see Figure 2.11).
This enzyme is more stable, and does not require
COO –
cofactors. The product, trans-cinnamic acid, degrades
phenylpyruvate trans-cinnamate
to hippuric acid, excreted in the urine. The prob-
Figure 2.11 Phenylalanine (top left) and three metabolites. lem of antigenicity is addressed by attachment of
Top right: tyrosine, produced by the normal phenylalanine polyethylene glycol (PEG). As a drug, pegylated phenyl-
hydroxylase enzyme. Bottom left: phenylpyruvate, a degradation
alanine ammonia-lyase is currently in phase II clinical
product formed in large quantities by people suffering from
untreated phenylketonuria. Bottom right: trans-cinnamate,
trials.
produced from phenylalanine by the enzyme phenylalanine
ammonia-lyase. A derivative of this enzyme is now in phase II
clinical trials for treatment of phenylketonuria.
Alzheimer’s disease
The symptoms of Alzheimer’s disease are loss of
(Phenylpyruvate is a ketone, hence the name of the cognitive functions, characterized by: loss of train
disease. See Figure 2.11.) From a blood sample taken of thought, progressive memory problems, missing
from a neonate, mass spectrometry can measure important appointments, etc. The most common
abnormal concentrations of phenylalanine or tyro- form is late-onset Alzheimer’s, appearing in people
sine. It is also possible to detect mutations in the over the age 65. Approximately 50% of people over
gene by sequencing, although should a novel muta- age 85 suffer from it. Alzheimer’s disease is a very
tion appear it might not be possible to conclude with severe public-health problem, especially in view of
confidence that the mutant protein is dysfunctional. increased life-spans. Early-onset Alzheimer’s, defined
Genomic sequencing could also detect carriers, as first appearing at age <65, is rarer. Even rarer is
allowing counselling of potential parents. familial Alzheimer’s, involving <1% of cases, appear-
Management of PKU depends on enforcing a ing at age 40–60.
low-phenylalanine diet. This is not entirely satis- The risk of late-onset Alzheimer’s disease is corre-
factory: compliance is a common problem, as low- lated with SNPs in apolipoprotein E (ApoE). The
phenylalanine foods are relatively unpalatable and do basic function of this 317-residue protein is to remove
not provide complete nutrition (see Weblem 2.6). cholesterol from the blood. The gene for ApoE is
Artificial mixtures of amino acids – phenylalanine- on chromosome 19. There are four common alleles,
free formulas – replace high-protein foods. which differ by SNPs:
62 2 Genomes are the Hub of Biology

ApoE1 = rs429358(C) + rs7412(T) [minor variant] mutations that break down the controls on cell
ApoE2 = rs429358(T) + rs7412(T) growth. The source can be in three classes of genes:
ApoE3 = rs429358(T) + rs7412(C) [∼55%] genes that regulate cell proliferation, genes required
ApoE4 = rs429358(C) + rs7412(C) for repair of DNA damage, and genes that control
apoptosis.
The correlations with risk of Alzheimer’s disease are:
The ‘two-hit hypothesis’ interprets the relation
At least one E4 allele → increased risk of between sporadic and familial forms of disease to the
Alzheimer’s need to mutate both copies of such genes.
At least one E2 allele → decreased risk of Consider, as an example, retinoblastoma, a rare
Alzheimer’s childhood tumour of the eye. Approximately 30–40%
of cases are familial; the rest sporadic. The familial
form shows an autosomal dominant inheritance
SNPs and cancer pattern. Clinical characteristics of familial retino-
SNPs are relevant to cancer research and treatment in blastoma that distinguish it from the sporadic picture
several ways: are early onset, and the appearance of multiple tum-
ours, affecting both eyes.
(1) Mutations detectable in the genome indicate pro- The two-hit hypothesis offers an explanation for
pensity for development of cancers. Mutations in the differential age of onset, and severity, of familial
BRCA1 and BRCA2, as indicators for likelihood and sporadic retinoblastoma. The idea is that non-
of breast and ovarian cancer development, are familial cases require inactivation of both copies of
probably the best known. retinoblastoma gene, each of which was originally
(2) Sequence analysis can predict disease progression functional. Separate and independent mutations are
and outcome. necessary. In contrast, familial retinoblastoma affects
(3) Sequence analysis can help choose optimal a person who has inherited one defective and one
treatment. functional copy of the gene. That is, the ﬁrst hit
is inherited; all that is needed is the second hit (see
(4) Tumour progression often involves mutations
Figure 2.12).
and divergence of cell lines.
Tumour suppressor genes protect cells against
The onset of cancer is associated with loss of development of cancer. They encode proteins that
genome integrity. Cancer results from accumulated inhibit tumour formation. Their normal function can

Sporadic retinoblastoma: two hits required

first hit second hit

Rb Rb Rb Rb Rb Rb

Tumour

Familial retinoblastoma: first hit inherited, one more required

second hit

Rb Rb Rb Rb

Tumour

Figure 2.12 Explanation of the difference between sporadic and familial retinoblastoma, a rare cancer of the retina. According to the
two-hit theory, two copies of a gene must be inactivated. In sporadic retinoblastoma, both copies are originally functional, and two
separate, independent mutations are required to inactivate them. In familial retinoblastoma, one defective copy of the gene is inherited.
Only a single mutation, in the other allele, is required to produce the disease.
Genetic diseases – some examples of their causes and treatment 63

BOX The genes BRCA1 and BRCA2

2.6

Gene Chromosome band Gene length Protein length Number of exons

(amino acids)

BRCA1 17q21 >100 kb 1863 24 (22 coding, exon 11 very large)

BRCA2 13q12–q13 >200 kb 3418 27

be to inhibit cell growth; mutations ‘take the foot off Many BRCA1 mutations are known. (Many are
the cell-growth brake’. Mutants in these genes raise not SNPs.) Their prevalence varies among popula-
the risk of developing cancer. tions, showing strong founder effects (see Table 2.1).
Well-known examples of tumour suppressor genes Testing for mutations in these genes is now quite
are: BRCA1 and BRCA2. In the general population: common (see Box 2.7).
∼12% of women will develop breast cancer;
Table 2.1 Common BRCA1 and BRCA2 mutations
∼1.4% of women will develop ovarian cancer.
Population Common Common
Of women with a harmful mutation in BRCA1 or
BRCA1 BRCA2
BRCA2: mutations mutations
∼60% will develop breast cancer; Ashkenazi Jews 185delAG, 5382insC 6174delT
∼15–40% will develop ovarian cancer. Iceland 999del5
Denmark 2594delC, 5208T→C
BRCA1 and BRCA2 encode long proteins unre-
lated in sequence and structure (see Box 2.6). Both Lithuania 4153delA, 5382insC,
61G→C
proteins are required for chromosome stability, par-
China 589delCT, IVS7–27del10, 3337C→T
ticipating in mechanisms of repair of DNA double-
1081delG, 2371–2372delTG
strand breaks.

BOX Genetic testing for mutations in breast cancer genes BRCA1 and BRCA2
2.7

Breast cancer is a leading killer of women in western Europe Governments must decide how to apportion resources,
and the USA, affecting over 13% of the population (100 taking into account the cost of any procedure and the util-
times as many women as men) and causing death in about ity of the information it produces. Mutations in BRCA1 and
3% of all women. Among the known genetic factors that BRCA2 are associated with only about 5–10% of breast
raise the risk of breast cancer are mutations in the genes cancers. Conversely, not all women with a mutation in
BRCA1 and BRCA2. either of these genes develop cancer, although over 50%
Screening for mutations in these genes can provide do. The BRCA1 and BRCA2 genes are long, multiexon
a risk-alerting system for breast and ovarian cancer. sequences each >100 kb long, making total resequencing
However, planning of population-wide screening proof the genes a complicated procedure. Many mutations are
grammes must take into consideration cost/benefit analysis. known, spaced widely within the exons. Given that even
In many countries, medical care delivery policy is set by a complete resequencing of the genes would not provide a
national health service. (The USA is a notable exception.) helpful prognosis in many cases, it is not deemed useful,
➔
64 2 Genomes are the Hub of Biology

with current technology, to screen the entire population For these individuals, the likelihood of finding useful
fully for BRCA1 and BRCA2 mutations. information certainly justifies genetic testing.
The logic of the decision changes for individuals sus- In some genetically relatively homogenous populations,
pected to be at high risk. These include women who: practical advantage can be taken of the observation that
specific mutations in BRCA1 and/or BRCA2 are common in
• have a close relative known to have a BRCA1 or BRCA2
that population. For instance, Ashkenazi Jews are about 20
mutation;
times more likely to bear a mutant in one of the two genes
• have close relatives who have been diagnosed with than the general population. Three mutations – 185delAG
early-onset (age <50 years) breast or ovarian cancer; or and 5382insC in BRCA1 and 6174delT in BRCA2 – account
• have themselves been diagnosed with breast cancer for 90% of inherited breast and ovarian cancers in this
and want to know their likelihood of developing ovarian group. Testing for these three mutations or for any specific
cancer. mutation known in a relative – whether one of these three

Collect sample
Wild type Mutant
AUG CAA Stop AUG UAA Stop
AAAA Isolate mRNA AAAA
(New Stop)

(AUG) (CAA) Stop (AUG) (UAA) Stop

Reverse transcribe cDNA

PCR amplification

In vitro
transcription and translation

W1 Immunoprecipitation and M1 M2
gel electrophoresis M3
W2
W3 W1 M1 W2 M2 W3 M3

Figure 2.13 Schematic explanation of the protein truncation test. Starting from mRNA, or in some cases from genomic DNA,
a series of fragments covering the gene or exon of interest is amplified. In vitro transcription and translation produces the
corresponding polypeptide chains. A mutation that introduces an internal stop codon or a deletion leading to a frame shift
produces some shortened fragments that can be distinguished on a gel.
Straight lines indicate nucleic acids; wavy lines indicate polypeptides. W1, W2, and W3 are peptides derived from the wild-type
gene. M1, M2, and M3 are derived from the mutant gene, using the same primers. The red and blue dots in the peptides
correspond to the position of the CAA in the wild-type mRNA and the UAA in the mutant mRNA. In this example, the peptides
W1 and M1 have the same size, but M2 is shorter than W2 and M3 is shorter than W3. Note that the mobilities of the peptides
are not strictly proportional to their molecular weights.
The triplets in parentheses refer to the original mRNA sequence. Different bases appear in the molecules that are created during
the procedure. (See Exercises 2.6–2.9.)
Genetic diseases – some examples of their causes and treatment 65

or any other – is much simpler and more cost-effective • The protein truncation test detects premature stop
than a full resequencing of the genes. Resequencing all of codons by amplifying the coding region of exon 11 of
the exons of the gene is at present about ten times more BRCA1 or exons 11 and 12 of BRCA2 (see Figure 2.13).
expensive and can always be considered as a subsequent • The single-stranded conformational polymorphism test
step if none of the specific mutations appear. This is likely is applied to analysis of exons 2–10 and 12–24 of BRCA1
to change, as the cost of sequencing diminishes. and exons 2–10 and 12–27 of BRCA2 (see Figure 2.14).
What techniques are used to screen for common
• Full resequencing of the gene.
mutations?

Normal gene Mutant gene

G A

C T
PCR

G A

C T

Heteroduplexing

G G A A

C T C T

Conformation-sensitive
gel electrophoresis

Normal Normal + mixed

Figure 2.14 Conformation-sensitive gel electrophoresis, a method for detecting localized mutations. This figure shows the
procedure for analysing a sample from a heterozygote, containing normal DNA from one chromosome (red) and a mutation
on the other (blue). The region surrounding the site of suspected mutation is amplified by PCR. Melting and annealing produces
the two original double-stranded regions and also two mismatched pairs. The regions are sufficiently similar to form a hybrid pair
despite the single-base mismatch, but the mismatch causes sufficient conformational deformation to alter mobility on the gel,
especially under partially denaturing conditions. A normal sample would produce only a single band (left lane on the gel), but the
mixture of normal and mutant would produce additional bands, the mismatched pairs showing altered mobility.

• Applications of genomics in cancer research include:

(a) understanding the cellular events required to transform normal to cancer cells
(b) improving the precision of diagnosis and prediction of outcome
(c) guidance in choosing the most effective therapy
(d) surveying populations for high-risk individuals, who carry mutations in tumour-suppressor genes.
66 2 Genomes are the Hub of Biology

Species

The next step up from populations is species. organisms are discrete entities. The taxonomic hier-
Species are a fundamental unit of evolution. Spe- archy is in principle a classification of species. We
cies represent nature’s experiments in structures and now recognize that the static classification is largely
lifestyles. It is species that were the elements of Lin- congruent with the evolutionary tree of ancestor–
naeus’s taxonomy, and it was the emergence of novel descendant relationships.
species that gave Darwin his life’s work and the title
of his major book. • Why living things should be ‘quantized’ into discrete
It is extremely rare in science that a concept so species is a very subtle question.
fundamental to a field as species is in biology would
be so difficult to define. Genomics has greatly illu-
minated but not solved the problem. A fundamental paradox about the biology of
To understand what we mean by species we need higher organisms is that although species are discrete,
not only an explanation of the concept, but a crite- new species can evolve. Darwin’s finches are a classic
rion for deciding whether two organisms – or, more example of reproductive isolation of populations
appropriately, two populations – belong to the same that have diverged into separate species. Resolution
species or different ones. Ernst Mayr enunciated the of this paradox is quite a difficult challenge, but per-
classic biological approach, focusing on the idea that haps a basic component of the answer would be that
different species are reproductively isolated in nature. it is the discreteness of ecological niches – to which
Mayr’s definition applies to sexually reproducing individual species are optimized – that accounts for
organisms: individuals that can mate with each other the discreteness of species of higher organisms.
to produce viable and fertile offspring when they For prokaryotes, in contrast, the species concept
encounter each other in nature are members of the is in serious trouble. In particular the presence of
same species. When reproductive barriers arise – for large amounts of horizontal gene transfer overturns
instance, when groups of animals are trapped on the idea that there is a hierarchy of ancestor-descent
different islands by rising sea levels – the species relationships.
divides into two or more populations that, separately, Few people doubt that if the problem of under-
interbreed only within themselves and maintain two standing species can be resolved, genomics will play
gene pools. The separated populations may pursue a crucial role. Advantages of genomic approaches
different evolutionary paths and ultimately diverge over others are that they allow:
to form separate species. Of course, geographic isola- (1) more accurate assignment of relationships (as
tion is only one possible cause of speciation. we saw in the case of Buddenbrokia plumatellae
Many other definitions have been proposed, includ- (see p. 48)),
ing those based on comparisons of features of pheno-
(2) quantitative measurements of the divergences
types, and divergence of genomic sequences. For
between species, and
higher organisms at least, the different definitions are
usually consistent, but they are not entirely equivalent. (3) estimation of the time elapsed since the last
common ancestor. (Now, classical palaeontology
also allows measurements of dates of origins of
• Many modern biologists define a species as a group of species that have left fossil records. In fact from
similar organisms that interbreed naturally to produce
fossil records we can date extinctions, which is
fertile offspring. An alternative approach is to base spe-
not possible from genomics!)
cies definitions on genomic sequences. This approach
is extremely powerful even for higher organisms, and
essential for prokaryotes. • Here we have described both the importance of the
species concept and some of the problems associated
with it. Both the concept and the problems will be
Any valid definition must take into account the
themes of much of the rest of this book.
fundamental observation that species of higher
The biosphere 67

BOX A census of species?

2.8

We simply do not know the number of species that J.B.S. Haldane was asked what he had learned about God
have ever existed. Even for the species alive today, only a from his study of biology, he replied that God must have
fraction have been formally described. About 1.5 million ‘. . . an inordinate fondness for beetles.’
species are known to science (see below), but estimates Given the large numbers of even fairly closely related
suggest that at least twice as many exist, and probably species, one might be tempted to think that the species
even an order of magnitude more than that. that lack scientific description are only minor variations on
Most palaeontologists agree that living species amount familiar themes.
to less than 1% of the number that have ever lived, but This is not true.
it is impossible to infer precise estimates. The Burgess Shale, in the Canadian Rockies near Banff,
British Columbia, Canada, contains some of the world’s
Group of organisms Number of species described best-preserved fossils from the Cambrian era, ∼530 million
Insects 751 000 years ago. The locality contained a rich, predominantly
invertebrate, community that left fossils in an unusually
Other animals 281 000
high-quality state of preservation. Some of the forms are
Higher plants 248 400
similar to known animals, both extant and extinct, but
Fungi 69 000
others show profound structural differences, including
Protozoa 30 800 completely different body architectures. The loss of these
Algae 26 900 species has impoverished the natural world in general and
Prokaryotes 4 800 the science of biology in particular.
Viruses 1 000 These fossils strongly suggest that the living forms we
see are, to a large extent, the result of historical accident
Over half of the described species are insects. Almost 40% and that the course of evolution might easily have been
of the insects are beetles (order Coleoptera). When very different.

The biosphere
Palaeontologist G.E. Hutchinson wrote a book enti- tion was propounded by Cuvier in 1796. Nevertheless,
tled The Ecological Theater and the Evolutionary the idea of the immutability of species retained strong
Play. Life on Earth has been a 3.5 billion year epic of adherents, a problem Darwin faced later. When US
exploration and adventure. It has included comedy President Thomas Jefferson sent Meriwether Lewis
and tragedy. Hutchinson emphasized the interde- and William Clark to explore the North American
pendence of environment and living things, which continent, he urged them to be on the lookout for
reciprocally affect each other. (We have already men- mammoths. Anyone could have hoped that mam-
tioned the example of the photosynthetic origin of moths, extinct in Eurasia, might have survived in the
atmospheric oxygen.) New World; the idea of a fixed roster of species
We shall never know the full extent of the species shaded hope into confidence. Cuvier believed that
that Nature has generated in its explorations (see extinction of species was the result of major natural
Box 2.8). Organisms alive today represent only a catastrophes. Indeed, sometimes, natural catastro-
fraction of all that have existed. phes do cause large-scale simultaneous extinctions
(see Figure 2.15 and Box 2.9). An example is the
asteroid that landed in what is now the Yucatan
Extinctions
Peninsula of Mexico, approximately 65 million years
We now recognize that the roster of extant species is ago, at the boundary between the Cretaceous and
in flux. Although Darwin was the first to argue con- Tertiary ages. However, in other cases, species reach
vincingly that new species arise, the idea of extinc- a more peaceful end of their existence – ‘dying
68 2 Genomes are the Hub of Biology

Time Scale of Earth History

Cenozoic Mesozoic Palaeozoic Precambrian
mya mya mya mya
0 Quaternary Humans 250
Asteroid impact: Global warming Vendian
70 mass extinction mass extinction
Tertiary First ‘Snowball Earth’
Permian 750
5 Pliocene hominids 270 mass extinction
Amphibians and Proterozoic
reptiles dominate 1000
90 Cretaceous 290
10 Oxygen buildup
Miocene 1250 in atmosphere
Butterflies 310 Carboniferous
15 Modern birds Multi-celled
110 Marsupials Ferns 1500
330 Winged insects animals
Horses, Dogs
20 Amphibians 1750 Sponges
Bears, Apes
130 350
2000
25
Oligocene 370
Devonian 2250
30 Mammalian 150 Algae
radiation 390 Bony fishes
Amphibians 2500
35 Jurassic
410 2750
170
Giant dinosaurs
Silurian Archaean
40 Eocene Birds 430 3000
Insects
Vascular plants Origin of
Rodents 190 450 3250 life
45
Whales Ordovician
470 Primitive land plants 3500
50
210 Primitive fishes
3750
Triassic 490
55
Cambrian 4000 Hadean
Palaeocene 230 Dinosaurs 510
60 Large mammals Mammals Explosion of
4250 Earth’s crust
Primitive primates Flies life forms solidifies
530
65 250 4500

Figure 2.15 Geological ages (e.g. Cenozoic), epochs (e.g. Quaternary), and cataclysmic events (e.g. asteroid impact: mass extinction)
in black. First appearance of, or prevalence of, different life forms in red. mya = millions of years ago.

peacefully in their beds’ so to speak – as a result of at least partly viral in origin. Many Californian
gradual environmental change. almond growers must now import honeybees to
A major mass extinction is going on right now and ensure the pollination of their trees. (The problem
we are largely, although not entirely, to blame. Human is not restricted to almonds: about one-third of our
activity is responsible for the disappearance of spe- food depends on honeybee pollination.)
cies through excessive hunting, habitat destruction,
transport of species that replace native ones, and the Extinction of the thylacine (Thylacinus
introduction of toxic chemicals into the environment. cynocephalus)
Moreover, extinction of one species may cause the It is a sad event when the last individual of a species
loss of others. For instance, some plants are depen- dies. The last thylacine, or Tasmanian wolf, named
dent on a particular insect for pollination. Loss of the Benjamin, died in the Hobart Zoo on 7 September
insect is fatal to the plant, and the network of depen- 1936 (see Figure 2.16). (On the same day, Boulder
dence may be quite far ranging. North American Dam, now Hoover Dam, started operation. Strad-
bees are being killed in great numbers, by the Varroa dling the Colorado River, it provides water and elec-
mite, a parasite that has developed resistance to the trical power to the south-western USA, including the
pesticides that formerly controlled it, and by the rich agricultural lands of Southern California and the
mysterious colony collapse disorder, which may be city of Los Angeles. The opening of the Hoover Dam
The biosphere 69

BOX Mass extinctions

2.9

Period Approximate date (mya)* Likely cause Estimated loss of taxa

Ordovician-Silurian 439 Formation and melting 60% of marine genera

of glaciers 25% of marine families
Late Devonian 364 ? 57% of marine genera
22% of marine families
Permian–Triassic 251 Asteroid impact (?) 95% of all species!
53% of marine genera
53% of marine families
70% of land species
End-Triassic 199–214 Volcanic eruptions, 52% of marine genera
possibly leading to 22% of marine families
global warming ? vertebrate species
Cretaceous–Tertiary 65 Asteroid impact 47% of marine genera
16% of marine families
18% of land vertebrate
families (including dinosaurs)
Present 0 Human activity, in part ???

* mya, millions of years ago.

and the extinction of the thylacine bear simultaneous

witness to human control over nature.)

The thylacine appears prominently on the label of a

well-known Tasmanian beer.

Conversely, it is a happy event when a threatened

species is rescued.

Survival of Père David’s deer (Elaphurus

davidianus)
Figure 2.16 The Tasmanian wolf, or thylacine (Thylacinus In 1865, Père Armand David (1826–1900), a French
cynocephalus), was a marsupial showing striking convergent Basque priest and biologist, learned of an unusual
evolution in overall body form to the placental wolf/dog family. animal living in the Imperial Hunting Park, near Bei-
It had stripes on its back and tail. It lived on ‘mainland’ Australia
jing, China, within precincts reserved to the Emperor.
and New Guinea several thousand years ago, but its recent
distribution was confined to south-western Tasmania. The He was able to observe the animals and to take an
thylacine was challenged by the introduction of dogs, and antler and two hides back to Europe. Recognized as
hunted to extinction in the early 20th century. Despite claims a novel species, the deer were named for the mission-
of sightings since the 1930s, no convincing evidence for survival ary (see Figure 2.17). (Père David also discovered the
has emerged.
giant panda.)
The preservation of specimens in museums has prompted
suggestions that the thylacine might be cloned, but it appears
The animal’s range had extended over a large
that degradation of the DNA has been too extensive. area of western and northern China, but it became
extinct in the wild. (A picture of this deer illustrates a
70 2 Genomes are the Hub of Biology

It was well that he did. The Beijing deer staggered

under several blows. There were severe ﬂoods in
1894–1895, reducing the Chinese herd to 20–30 ani-
mals. The survivors were lost in 1900 during the
military campaign to lift the siege of European lega-
tions during the Boxer Rebellion.
The remaining Père David’s deer were collected
from European zoos and pooled in the Woburn
Abbey herd. This contained 18 individuals, of which
11 were fertile. Some – probably not all – of these
were the founder individuals of the entire current
population. The deer bred, with the size of the
Figure 2.17 Père David’s Deer. Woburn Abbey herd reaching 250 in 1945.
(Photograph by Tony Pearse, taken at Woburn Abbey in 1993.) In 1956, some deer were sent to Beijing. In 1957, a
calf was born in China for the ﬁrst time in over 50
17th century map of Xensi province.) A herd in the years. They are now maintained in dedicated nature
hunting park survived. The emperor gave about a reserves and have also been reintroduced into the
dozen deer as gifts to zoos in Europe and Japan. Her- wild. The site of the Emperor’s hunting garden is
brand Arthur Russell, 11th Duke of Bedford, installed now a park devoted to Père David’s deer. The deer
a small herd on his Woburn Abbey estate. have come home!

● RECOMMENDED READING

• A standard textbook in this field:

Strachan, T. & Read, A. (2010). Human Molecular Genetics, 4th ed. Garland Science, New York.
• Genomics and developmental biology:
Cañestro, C., Yokoi, H., & Postlethwait, J.H. (2007). Evolutionary developmental biology and
genomics. Nat. Rev. Genet. 8, 932–942.
• Genomics and neurobiology:
Boguski, M.S. & Jones, A.R. (2004). Neurogenomics: at the intersection of neurobiology and
genome sciences. Nat. Neurosci. 7, 429–433.
Berger, M.S., Couldwell, W.T., Rutka, J.T., & Selden, N.R. (2010). Introduction: neurogenomics
and neuroproteomics. 28, E1.
Sforza, D.M. & Smith, D.J. (2003). Genetic and genomic strategies in learning and memory.
Curr. Genomics 4, 475–485.
Nestler, E.J. & Hyman, S.E. (2010). Animal models of neuropsychiatric disorders. 13, 1161–1169.
• The general consequences of mutation:
Hill, W.G. & Loewe, L. (2010). The population genetics of mutations: good, bad and indifferent.
Philos. Trans. Roy. Soc. Lond. B: Biol. Sci. 365, 1153–1167.
• A description of the large number of haemoglobin mutants then known, set in a public-health
context:
Weatherall, D.J. & Clegg, J.B. (2001). Inherited haemoglobin disorders: an increasing global
health problem. Bull. World Health Organ. 79, 704–712.
Exercises, problems, and weblems 71

• Cancer genomics:
Ding, L., Wendl, M.C., Koboldt, D.C., & Mardis, E.R. (2010). Analysis of next-generation
genomic data in cancer: accomplishments and challenges. Hum. Mol. Genet. 19,
R188–R196.
Chin, L., Hahn, W.C., Getz, G., & Meyerson, M. (2011). Making sense of cancer genomic data.
Genes & Dev. 25, 534–555.
Majewski, I.J. & Bernards, R. (2011). Taming the dragon: genomic biomarkers to individualize
the treatment of cancer. Nat. Med. 17, 304–312.
Stratton, M.R. (2011). Exploring the genomes of cancer cells: progress and promise. Science
331, 1553–1558.
• A fascinating general essay about evolutionary biology:
Hutchinson, G.E. (1965). The Ecological Theater and the Evolutionary Play. Yale University
Press, New Haven, CT, USA.
• Discussion of the problem of biological species. Papers from a colloquium in honour of 100-year
old Ernst Mayr.
Hey, J., Fitch, W.M., & Ayala, F.J. (2005). Systematics and the origin of species: An introduction.
Proc. Natl. Acad. Sci. USA 102, 6515–6519.
de Queiroz, K. (2005). Ernst Mayr and the modern concept of species. Proc. Natl. Acad. Sci.
USA 102 (Suppl 1), 6600–6607.
• Current and past extinctions:
May, R.M. (2010). Ecological science and tomorrow’s world. Philos. Trans. Roy. Soc. Lond. B:
Biol. Sci. 365, 41–47.
Barnosky, A.D., Matzke, N., Tomiya, S., et al. (2011). Has the Earth’s sixth mass extinction
already arrived? Nature 471, 51–57.
• Two celebrated palaeontologists interpret the Burgess Shale fossils:
Gould, S.J. (1990). Wonderful life: the Burgess Shale and the Nature of History. W.W. Norton,
New York.
Conway Morris, S. (1998) The Crucible of Creation: The Burgess Shale and the Rise of Animals.
Oxford University Press, Oxford.
See also the exchange between Conway Morris and Gould (1998): Showdown on the Burgess
Shale. Nat. Hist. 107, 48–55.

● EXERCISES, PROBLEMS, AND WEBLEMS

Exercises
Exercise 2.1 A randomly chosen pair of humans will show an average nucleotide diversity
that lies between 1 base pair in 1000 and 1 in 1500. Any human and any chimpanzee differ in
approximately 1 base pair in 100. (Note that the chimpanzee genome is somewhat larger than
the human.) (a) Estimate the number of differences in the total sequences between two randomly
chosen humans. (b) Estimate roughly the number of differences in the total sequences between a
human and a chimpanzee.
Exercise 2.2 Why is it possible to have calico cats but not calico kangaroos (barring abnormal cell
division at an early embryonic stage or a skin disease)?
72 2 Genomes are the Hub of Biology

Exercise 2.3 From material presented in this chapter, give examples of mutations with clinical
consequences that involve: (a) dysfunction of a metabolic enzyme, (b) ‘read-through’ producing
extended proteins, (c) proteins with lower stability than the normal version, (d) increase in the
risk factor for a disease not known to be associated with the primary (?) function of a protein,
(e) enhanced probability of developing cancer. In each case state, if possible, the nature of the
mutation, and the mechanism by which it produces clinical consequences.
Exercise 2.4 Most newborn mammals can digest lactose in infancy, consistent with their
dependence on maternal milk for feeding. The enzyme lactase hydrolyses the disaccharide lactose,
the major carbohydrate component in milk, to glucose + galactose. In most mammalian species,
expression of lactase ceases at the time of weaning. Many human populations follow the
mammalian paradigm: lactase expression is permanently turned off when a child is about 4 years
old. Loss of lactase expression produces subsequent intolerance to dairy products. A single-site
mutation that causes lifetime lactase expression – or lactase persistence – was selected for in
populations that domesticated cattle and depended on dairy products in their adult diet. In Europe,
lactase persistence is more common in the north, where less exposure to sunlight makes northern
Europeans more dependent on dairy products as sources of calcium and vitamin D precursors.
Ninety-six per cent of Swedes and Finns are lactose persistent.
(a) Would you expect the mutation to be found within the coding regions of the lactase gene
itself?
(b) Heterozygotes produce approximately half the amount of lactase. (This is sufficient to avoid
the symptoms of lactose intolerance.) From this observation, what additional inference can you
make about the location of the mutation?
Exercise 2.5 Suppose that there are ten SNPs in a 10 kb region. If the region is on the Y
chromosome, how many possible haplotypes are there? If the region is on a diploid chromosome,
how many possible haplotypes are there?
Exercise 2.6 In the diagram of the protein truncation test (Figure 2.13), what actual triplets appear
in place of the triplets shown in parentheses?
Exercise 2.7 In the diagram of the protein truncation test (Figure 2.13), which pair of peptides has
the larger difference in length: W2 and M2 or W3 and M3?
Exercise 2.8 In the protein truncation test (see Figure 2.13), in cases in which the novel stop
codon arises from a single-site substitution, as shown in the figure, why would it not work to
subject the amplified cDNA fragments directly to electrophoresis, i.e. to skip the in vitro
transcription and translation step?
Exercise 2.9 What would the gel in Figure 2.13 look like if the person from whom the sample was
taken had the mutation C→G at the same site?
Exercise 2.10 In the conformation-sensitive gel electrophoresis technique for detecting mutations
(Figure 2.14), the gel is run under mildly denaturing conditions to accentuate conformational
differences, thereby increasing the differences in mobility. Why would a gel fail to give useful
information if it were run under fully denaturing conditions?
Exercise 2.11 Why might the prescription of monoamine oxidase inhibitors to a child presenting
with symptoms of anxiety, in the case of suspected maltreatment, be contraindicated?
Exercise 2.12 In Figure 2.8 showing the data on Barbary macaques, where did the animal
corresponding to the isolated orange point come from?

Problems
Problem 2.1 Survival of a SNP. Suppose a person is heterozygous for a novel, selectively neutral
mutation. Suppose the person has two children that survive to reproductive age. The probability
Exercises, problems, and weblems 73

of loss of the mutation in that one generation is 25%. If each descendant has two children that
survive to reproductive age, what is the probability of complete disappearance of the mutation in
200 years. Assume 25 years per generation.
Problem 2.2 Replication of RNA viruses is error-prone. It is estimated that the replication of
HIV-1 introduces one mistake per replication of its ∼105 bp genome. If the estimated generation
time of HIV-1 in a human body is ∼1 day and 1010 progeny viruses are produced per patient
per day, and assuming that an AIDS patient is initially infected by viruses with identical
genomes, estimate whether the patient will (a) generate a mutation at every possible site
in the genome every day; and (b) generate mutations at every possible pair of sites in the
genome every day.

Problem 2.3 Assume the model of expression switching in the human b-globin region shown in
Figure 2.2(b,c). (a) What developmental progression of globin expression would you expect if
the region containing the Gg and Ag genes were interchanged with the region containing the d
and b genes? (b) What developmental progression would you expect in patients with deletions
in the d- and b-regions, including the pyrimidine-rich control region YR?
Problem 2.4 (a) In the experiment that produced Cc (Copycat) as a clone of Rainbow (see
Figure 2.3), the X-linked inactivation of the cell selected was not reversed. If a number of other
cats were cloned by the same procedure, from other cells taken from Rainbow, what coat colours
would the cats produced show? (b) Suppose a way were discovered to reverse the X-linked
inactivation in the cell from Rainbow from which another cat were cloned. Would the cloned
cat have a coat indistinguishable in appearance from Rainbow’s? (c) Would monozygotic twin
(= ‘identical’ twins, in genome sequence at least) natural daughters of Rainbow show identical
coat colour patterns? (d) If your answer to (c) suggests that the twin daughters might not look
the same, how then could you be sure that two daughter cats are monozygotic twins, without
sequencing their entire genomes?
Problem 2.5 A transitive relationship is one such that if A is related to B, and B is related to C,
then A is related to C. (In arithmetic, equality is a transitive relation – if A = B and B = C, then
A = C.) Consider whether ‘belong to the same species’ is a transitive relationship. There are
examples of ‘ring species’, sets of geographically connected populations such that members of
neighbouring populations can interbreed, but at least two populations, not necessarily the most
geographically distant, cannot. A classic example is the Larus gulls, which inhabit regions
surrounding the North Pole (Figure 2.18).

For which definitions of species (see p. 66) is ‘belong to the same species’ a transitive relationship?
Problem 2.6 We know that viruses can deliver genes to mammals, including humans. Examples
include curing X-linked adrenoleukodystrophy in humans, and increasing the apparent intelligence
of mice. (a) What kinds of problems would you expect to arise in setting guidelines for what
genetic modifications of humans should be allowed? (b) How far could you go towards drafting
such a set of guidelines?

Problem 2.7 In a study of brown bear (Figure 2.19) populations in northwest North America,
samples of mitochondrial DNA were collected and sequenced from 317 free-ranging brown bears
(Ursus arctos) from 22 localities. Forty-six variable sites corresponded to 29 haplotypes, which
clustered into four major clades (see Table 2.2). Table 2.3 identifies the location(s) at which bears
with the corresponding haplotypes were found.

(a) On a copy of a map showing Alaska, northwestern Canada, and the lower 48 states as far
south as northern Wyoming, mark in different colours the sites of appearance of bears
with mitochondrial DNA in the different classes. Use the data in Table 2.3. Describe the
geographical distribution of the different classes. Do they overlap substantially?
74 2 Genomes are the Hub of Biology

5
4
6

7
1

Figure 2.18 North circumpolar distribution of Larus gull species.

1. Lesser Black-backed Gull (Larus fuscus)
2. Lesser Black-backed Gull (Siberian population of Larus fuscus)
3. Heuglin’s Gull (Larus fuscus heuglini or Larus heuglini)
4. Birula’s Gull (Larus argentatus birulai)
5. East Siberian Herring Gull (Larus argentatus vegae or Larus vegae)
6. American Herring Gull (Larus argentatus smithsonianus or Larus smithsonianus)
7. Herring Gull (Larus argentatus)
Species linked by arrows can interbreed. Notably, however, Lesser Black-backed Gulls (1) and Herring Gulls
(7) cannot.
By Global_European_Union.svg: S. Solberg J. this file: Own work: Frédéric Michel (Global_European_Union.svg)
[CC-BY-3.0 (www.creativecommons.org/licenses/by/3.0)], via Wikimedia Commons.

Figure 2.19 North American brown bear cubs playing.

(Photograph by S. Hillebrand, from US Fish and Wildlife digital library.)
Exercises, problems, and weblems 75

Table 2.2 Mitochondrial sequence data from brown bears in northwest North America.
These data do not represent continuous sequences, but only the variable positions.
A dot indicates that the base at this position is the same as in the reference sequence
in the first line

Sequence data Location of

collection

CCCTCCCAACGTTAACATTACGTAATCGAACAGCGGGTTAGGGAAC O
..T.T.T....................................... Q
Clade I ..T.TTT....................................... PU
....T......................................... Q
....T.T....................................... MOPRSV
....TTT....................................G.. U
....TTT....................................... T

......TGG......T.......G.........T.......AA... HN
......TGG.....GT.........................AA.G. G
......TGG......T............G..........G.AA.G. LM
Clade II ......TGG......T............G............AA.G. IJK
......TGG......TG................T.......AA.G. K
......TGG.....GT...G.................C...AA.G. H
......TGG......T.................T...C...AA.G. H
......TGG......T.................T.......AA.G. H
......TGG.....GT.......G.............C...AA.G. H

TT..T.T..T..CG...................T......A.A..T BE
TT..T.T..T..CG........C..........T........A..T ABCDE
TT..T.T.....CG........C.........ATA.......A..T B
Clade III TT..T.T.....CG........C..........T...........T BC
TT..T.T..T..CG........C..........T...........T A
TT..T.T..T.CCG........C.........ATA.......A..T B
TT..T.T..T..CG........C..........T..A.....A..T A
TT..T.T..T..CG...............G...T..A.....A..T C

.T.C......A......C..TA.GGCT...T....A..C...A..T F
Clade IV .T.C....G.A......C.GTA.GGCT...T....A..C...A..T F
.T.C......A......C.GTA.GGCT...T....A..C...AG.T F

(b) You are lost somewhere in northwest North America. You collect a tissue sample from a brown
bear in the vicinity and determine the following mitochondrial DNA haplotype (relative to the
reference sequence in Table 2.2):
.T.C......A......C.GTA. GGCT...T....A..C...A..T
Where are you?
(c) Additional sequences were collected from four polar bears (Ursus maritimus) from zoos:
Reference sequence CCCTCCCAACGTTAACATTACGTAATCGAACAGCGGGTTAGGGAAC
Polar bear 1 .T........A......CCGTA.GG.....T....A..C...A..T
Polar bear 2 .T........A......CCGTA.GG..A..T....A..C...A..T
Polar bear 3 .T........A......CCGTA.GG..A...GA..A..C...A..T
Polar bear 4 .T........A......CCGTA.GG..A....A..A......A..T
76 2 Genomes are the Hub of Biology

Table 2.3 Sample collection locations of bears with sequences in

Table 2.2. For instance, bears with the first haplotype of clade III
were found at location B (latitude 48.0°N, longitude 113.0°W)
and location E (latitude 51.0°N, longitude 118.1°W)

Location Latitude, °N Longitude, °W

A 44.4 110.3
B 48.0 113.0
C 48.5 116.5
D 51.0 114.2
E 51.0 118.1
F 57.2 134.6
G 58.7 133.5
H 60.8 139.5
I 67.8 115.3
J 69.2 124.0
K 69.4 129.0
L 69.0 138.0
M 69.2 143.8
N 60.1 142.5
O 63.0 145.5
P 60.0 154.8
Q 63.1 151.0
R 68.5 158.0
S 65.5 165.0
T 55.2 162.7
U 58.3 155.0
V 60.9 161.2

(The reference sequence, from a brown bear, is the same as the reference sequence in
Table 2.2.) To which class of brown bears are these polar bears most closely related? Where
are the brown bears most closely related to polar bears found?
(d) For the brown bear sequences, compute the average number of sequence differences between
classes. Assuming a divergence rate of ∼0.125 sequence changes per 10 000 years, estimate
the times since divergence of the classes. Note that the total length of the region sequenced
was 294 bases.
Brown bears immigrated to North America from Asia over a temporary land bridge
over the Bering Strait. The first fossil evidence for brown bears in the New World is from
50 000–70 000 years ago. Is it likely that the classes diverged in North America or that they
had already diverged in Asia?
(e) Assume that (1) the original North American brown bear population contained all of the
currently observed haplotypes; and (2) there is now a continuous brown bear habitat covering
all areas listed in Table 2.3. What then accounts for the current geographical distribution of the
different haplotype classes? There are two questions to address:
• what accounted for their initial separation?
• what is continuing to keep them separated?
Exercises, problems, and weblems 77

Consider the following possible scenarios:

• Climate changes in the past, associated with glaciation, fragmented the population and left
pockets that reflect ‘founder effects’.
• Climate changes in the past, associated with glaciation, fragmented the population, and
mutations within each group of bears accumulated to form separate haplotype groups.
• Female bears tend to be more philopatric than males, i.e. male bears have larger home
ranges and disperse greater distances than females. Daughters often set up home ranges
within their mother’s home range. (Recall that mitochondrial DNA sequences tell us only
about patterns of maternal inheritance.)
To what extent do these scenarios account for the observations? What additional experiments
would you suggest to illuminate the situation further?
(f) Ethical, legal, and social issues. The division of North American brown bears into subspecies
affects conservation issues. The brown bear populations of the lower 48 states of the USA are
endangered. If these populations represent an evolutionary significant unit (ESU), they could
be protected as a distinct population segment under the US Endangered Species Act. To
qualify as an ESU, a population must (1) be substantially reproductively isolated from other
populations of that species; and (2) contain an important component in the evolutionary
legacy of a species (69 Fed. Reg. at 31355). Outline a petition to the US Secretary of the
Interior or the US Fish and Wildlife Service arguing that the US brown bear population of the
lower 48 states should be declared an ESU and listed as a distinct population segment. What,
if any, additional data would you recommend collecting that might strengthen the case?
(In fact, such a petition has been successful, and the brown bears in the lower 48 states are
currently protected under the Endangered Species Act.)

Weblems
Weblem 2.1 Using the Online Mendelian Inheritance in Man™ site (http://www.ncbi.nlm.nih.
gov/omim), what are the clinical consequences of the mutations associated with (a) haemoglobin
Adana? (b) haemoglobin Malhacen?
Weblem 2.2 What animal models are available for the human mental diseases schizophrenia,
depression, and bipolar disorder?
Weblem 2.3 Mammalian females achieve dosage compensation by silencing one of the X
chromosomes in each cell. In birds, however, males are homogametic (ZZ) and females are
heterogametic (ZW), with the W chromosome being gene deficient like the mammalian Y
chromosome. Do male birds achieve dosage compensation by silencing one of their Z
chromosomes?
Weblem 2.4 Determine a feature of a human MHC haplotype that is associated with a relative
slow progression to AIDS after HIV infection.
Weblem 2.5 Which MHC haplotypes give indications for effectiveness of use of the following
drugs: (a) carbamazapine (epilepsy, bipolar disorder) (b) ximelabatran (an anticoagulant)?
Weblem 2.6 Suggest menus, as close as possible to normal diets, for breakfast, lunch, and dinner,
appropriate for a university student with PKU. Include recommended amounts.
Weblem 2.7 The three most frequent mutations in BRCA1 and BRCA2 genes in Ashkenazi Jews
are 185delAG and 5382insC in BRCA1 and 6174delT in BRCA2. In what exons of these genes do
these mutations appear?
Weblem 2.8 What are the most common mutations in BRCA1 among Swedish women?
This page intentionally left blank
CHAPTER 3

Mapping, Sequencing,
Annotation, and Databases

LEARNING GOALS

• To know some of the important landmarks in the historical background, including the classical
work of Darwin and Mendel, through Morgan and Sturtevant, to the more recent research
leading to the discovery of the double-helical structure of DNA and the development of the
human genome project.
• To distinguish different types of map – genetic linkage maps, chromosome banding patterns,
restriction maps, and DNA sequences – and the relationships among them.
• To understand the relationships among linkage, linkage disequilibrium, haplotypes and the
collection of single-nucleotide polymorphism (SNP) data.
• To understand how the basic principles of DNA sequencing developed from the initial
breakthroughs to the current automated high-throughput systems.
• To understand the primer extension reaction catalysed by DNA polymerase and its termination
by dideoxynucleoside triphosphates.
• To understand the basis and importance of the polymerase chain reaction (PCR) as a method for
amplification of selected DNA sequences within a mixture.
• To grasp the significance and relationships of reads, overlaps, contigs, and assemblies as parts of
an overall strategy and organization of a sequencing project.
• To be familiar with sequencing gels using autoradiography and to appreciate the advantages of
using fluorescent chain-terminating dideoxynucleoside triphosphates.
• To understand new developments in high-throughput sequencing, and the goals set for the next
few years of development.
• To be able to contrast hierarchical strategies of whole-genome sequencing based on maps and
BAC clones, with the whole-genome shotgun approach.
• To understand techniques of genetic testing based on sequence determination.
80 3 Mapping, Sequencing, Annotation, and Databases

Classical genetics as background

Similarities between parents and offspring, within monastery, instead of becoming a teacher, because
human families, and in animals and plants, have he had failed the botany exam. Twice! The price of
always been obvious. An understanding of how insisting on original ideas.
heredity works is more recent.
A lack of understanding of the mechanism of
The story begins in a 10-year period starting in
heredity was a barrier to the development of Darwin’s
the late 1850s. Threads then cast on would require
insights. Mendel’s work supplied the crucial missing
a further century to ramify and intertwine.
ideas: the discreteness and persistence of the elements
• On 1 July 1858, Charles Darwin and Alfred Russel of hereditary transmission, later called genes. Never-
Wallace presented to the Linnaean Society of theless, although copies of the Proceedings of the
London their ideas on the development of species Natural History Society of Brünn were distributed
through natural selection of inheritable traits.* around the scientific community, Mendel’s work
Darwin’s book, The Origin of Species, appeared went largely unnoticed until it was rediscovered early
on 22 November 1859. in the 20th century.
• In 1860, Louis Pasteur published the observation There are several ironic aspects to this situation.
that the mould Penicillium glaucum preferentially 1. Among the recipients in Britain of the Brünn Soci-
metabolized the l-form of tartaric acid, one of two ety Proceedings were the Royal Society of London
mirror-image molecules that have identical struc- and the Linnaean Society. Darwin was a member
tures except that one is left-handed and the other of both and had access to their libraries. Even
right-handed. This work, based on Pasteur’s earlier more, Darwin personally owned a book by W.O.
separation of racemic tartaric acid by manual Focke, Plant Hybridisation, published in 1880,
selection of crystals of different shape, brought the which included a section describing Mendel’s
idea of three-dimensional molecular structure into work and its implications. When J.G. Romanes,
biology: to understand how a biological process preparing an article on hybrids for the Encyclo-
works, we must know the detailed spatial structure paedia Britannica, appealed to Darwin for help
of the relevant molecules. Thus began a long court- in making his review complete, Darwin sent his
ship between biology and molecular structure that copy of Focke’s book. But, in one of the nearest
has flowered into an intimate and fecund marriage. near-misses in scientific history, neither Darwin
• On 8 February and 8 March 1865, the monk nor Romanes read the section on Mendel’s work.
Gregor Mendel read his paper, Experiments on (How do we know this? The relevant pages were
Plant Hybridization, to the Natural History never cut open! The book is now in the Cam-
Society of Brünn in Moravia. (US President bridge University Library, with the pages still intact.)
Abraham Lincoln was assassinated on 15 April.) 2. Darwin came close to an independent statement
His paper was published the following year in the of Mendel’s conclusion, that traits that differ
Society’s Proceedings. Mendel had entered the between parents persist in the offspring, rather
than blend. In a letter to Thomas Huxley in 1857,
* Several earlier writers had suggested the idea of nat- Darwin wrote:
ural selection, including William Charles Wells and Patrick ‘I have lately been inclined to speculate, very crudely
Mayhew. It appears, for example, in the appendix to May-
and indistinctly, that propagation by true fertilisation
hew’s 1831 book, On Naval Timber and Arboriculture,
will turn out to be a sort of mixture, and not true fusion,
explicitly alluding to the possibility of creating novel spe-
cies. (Mayhew’s interest was in optimizing the growth of of two distinct individuals, or rather of innumerable
trees for building warships for the Royal Navy.) Mayhew individuals, as each parent has its parents and ancestors.
complained when The Origin of Species first appeared, and I can understand no other view of the way in which
Darwin gave credit to Wells and Mayhew in subsequent crossed forms go back to so large an extent to ancestral
editions. forms. But all this, of course, is infinitely crude.’
Maps and tour guides 81

Indeed, Darwin himself hybridized pea plants and showing that the virulent strain, even if killed,
observed segregation of traits! In 1866 he wrote contained a substance that could transform a non-
to Wallace: virulent strain into a virulent one and that the
‘I crossed the Painted Lady and Purple sweetpeas, which induced virulence is heritable. In 1944, Oswald
are very differently coloured varieties, and got, even Avery, Colin MacLeod, and Maclyn McCarty tested
out of the same pod, both varieties perfect but none different chemical components of the cell for trans-
intermediate.’ forming activity. They identified the DNA from
In the same letter, Darwin pointed out, as an the virulent strain as the molecule that induced the
obvious example of discrete rather than blending transformation. (We now interpret bacterial trans-
inheritance, that male and female parents give rise formation as ‘horizontal gene transfer’.) As controls,
to male and female offspring. they showed that transformation was inhibited by
enzymes that destroy DNA but not by enzymes
The legacy of the 1860s – Darwin, Pasteur, Mendel that destroy proteins. However, their contemporaries
– was completed by the discovery of DNA by Fried- were not receptive to their conclusions. General
rich Miescher in 1869. The structure and function of acceptance of the idea that DNA is the hereditary
DNA were equally unknown. However, microscopic material awaited the 1952 experiments of Alfred
observations of the role of chromatin in fertilization Hershey and Martha Chase, who showed that when
led quite early on to suggestions that Miescher’s bacteriophage T2 replicates itself in Escherichia coli,
substance was ‘responsible . . . for the transmission it is the viral DNA and not the viral protein that
of hereditary characteristics’.* The cell biologists got enters the host cell and carries the inherited charac-
there first, and got it right. teristics of the virus.
The idea of DNA as the hereditary material then
vanished for many years. It encountered considerable
resistance when it was subsequently proposed again.
• DNA was discovered in 1869. Avery, McLeod, and
McCarty showed in 1944 that DNA was the active
What is a gene? substance in bacterial transformation. Hershey and
Chase showed that during bacteriophage infection,
In 1931, Frederick Griffith studied virulent and
DNA but not protein entered the host cell.
non-virulent strains of Streptococcus pneumoniae,

Maps and tour guides

Maps tell us where things are. More speciﬁcally, they 3. Restriction maps – DNA cleavage fragment patterns
tell us where things are in relation to other things. 4. DNA sequences.
In genomics, maps have been essential in revealing
the organization of the hereditary material. Genes, as discovered by Mendel, were entirely
Different types of map describe different types of abstract entities. Chromosomes are physical objects,
observation: with banding patterns as their visible landmarks.
Only with DNA sequences are we dealing directly
1. Linkage maps of genes with stored hereditary information in its physical
2. Banding patterns of chromosomes form. Restriction maps are in effect partial DNA
sequences – they give the positions of particular
* Hertwig, W.A.O. (1885). Das Problem der Befruchtung oligonucleotides within DNA molecules.
und der Isotropie des Eies, eine Theorie der Vererbung.
It was the very great achievement of the last century
Jenaische Zeitsch. f. Medizin u. Naturwiss. 18, 276–318.
For an article about Miescher’s scientiﬁc career, see Dahm, of biology to forge connections between these maps.
R. (2005). Friedrich Miescher and the discovery of DNA. A crucial idea that emerged from mapping is that
Dev. Biol. 278, 274–288. the organization of hereditary information is linear.
82 3 Mapping, Sequencing, Annotation, and Databases

The ﬁrst steps – and giant strides they were indeed –

proved that, within any chromosome, linkage maps BOX Vocabulary inherited from classical
3.1 genetics
are one-dimensional arrays. In fact, all of these types
of map are one-dimensional, and indeed they are
co-linear. Any school child now knows that genes are Here are traditional meanings of these terms. We must
strung out along chromosomes and that each gene reconsider how they should be defined in the light of
corresponds to a DNA sequence. But the proofs of recent understanding.
these statements earned a large number of Nobel Prizes. Gene A bearer of hereditary information.*
The complete sequences of genomes are the cul- Trait An observable property or feature
mination of the entire mapping enterprise. But genome of an individual organism.
sequences describe the hereditary information of Phenotype The collection of observable traits of any
organisms in only a one-dimensional and static form. individual, other than genomic sequence.
What they don’t tell us is (a) how this information is Genotype The sequence of any individual’s
implemented in space and time; (b) how gene ex- genome.
pression is choreographed by orderly developmental Allele One of the set of possible genes that
programmes; and (c) the inﬂuence of surroundings govern a particular trait.
and experience on the structure and activities of the Homozygote An individual that has two identical
organism. alleles at some locus.
Heterozygote An individual that has two different
alleles at some locus.
Genetic maps Segregation The separation of corresponding alleles
during the reproductive process.
Gene maps were classically determined from patterns
Independent The uncorrelated choices of genes for
of inheritance of phenotypic traits (see Box 3.1).
assortment different characters that each parent
Mendel discovered that elements of heredity are transmits to children.
discrete and persist through a lineage. He observed Linkage Absence or reduction of independent
recessive characteristics that disappear in one genera- assortment of parental genes, which are
tion but re-emerge in a later one. Expressed in mod- usually transmitted together because
ern terminology, he observed traits that depend on a they lie on the same chromosome.
single locus, with a dominant and a recessive allele –
* With apologies to A.S. Eddington.
denote them D and d. Mendel further observed that
the distribution of phenotypes followed simple statis-
tical rules. The offspring of Dd and Dd parents
showed the dominant phenotype three times as often terms. It is interesting to compare Mendel’s contem-
as the recessive. He inferred that the genotypes of the porary, the physicist James Clerk Maxwell. Despite
offspring, DD, Dd, dD (all three producing the dom- the success of Maxwell’s kinetic theory – based on an
inant phenotype) and dd (producing the recessive abstract model of particles in random motion – the
phenotype) occur at the frequencies expected from idea that matter is actually composed of tiny particles
segregation in the gametes, and independent trans- gained widespread acceptance only with Perrin’s
mission to the offspring, of the elements of heredity. studies of Brownian motion, over half a century later.
Studying the simultaneous inheritance of several Indeed, it is questionable whether a general belief
traits, Mendel found further statistical regularities in the physical reality of atoms came before or after
consistent with the existence of stable, independent, a general belief in the physical reality of genes!
and persistent elements of heredity that distribute
themselves randomly. The cards are shufﬂed and a
Linkage
new hand is redealt to each offspring.
Mendel had no concept of the physical nature of Mendel did not report that in some cases genes for
the elements of heredity that he was studying and different traits do not show independent assortment
was content to describe their behaviour in abstract but are linked, i.e. their alleles are co-inherited.
Maps and tour guides 83

Linked traits are governed by genes on the same he found was that genetic distance, as measured by
chromosome. However, in many cases linkage is crossing-over frequency, was additive. Consider three
incomplete. During gamete formation, alleles on dif- genes A, B, and C. Suppose that the distance from A
ferent chromosomes of a homologous pair can to B is 5 and the distance from B to C is 3. Then, if
recombine. This occurs as a result of crossing over, the distance from A to C is 8 = 3 + 5, the observations
the exchange of material between homologous chro- are consistent with a linear and additive structure
mosomes during copying in meiosis (see Figure 3.1). with gene order A–B–C. (Alternatively, the distances
Thomas Hunt Morgan, at Columbia University in would also be additive if the distance from A to C
New York City, USA, observed varying degrees of were 2, implying gene order A–C–B.) Note that addi-
linkage in different pairs of genes. He suggested that tivity of distances does not hold for points at the ver-
the extent of recombination could be a measure of tices of a triangle rather than on a line.
the distance between the genes on the chromosome. Sturtevant’s analysis made it possible to determine
Morgan’s student Alfred Sturtevant, then an under- the order of genes along each chromosome and to
graduate, made a crucial observation: the data were plot them along a line at positions consistent with
consistent with a linear distribution of genes. What the distances between them. The unit of length in a
gene map is the Morgan, deﬁned by the relation
Parental genotype (diploid) that 1 cM corresponds to a 1% recombination fre-
A B quency. We now know that 1 cM is ∼106 bp in
a b
humans, but it varies with the location in the genome,
the distance between genes, and the gender of the
Meiosis parent: for males 1 cM is ∼1.05 Mb; for females,
Gametes Gametes 1 cM is ∼0.88 Mb. Crossing over is reduced in peri-
centromeric regions. Other regions are ‘hot spots’
A B A b
+ +
for crossing over. It is estimated that ∼80% of genetic
a b a B
recombination takes place in no more than ∼25% of
No recombination Recombination
our genome.
Linkage guides the search for genes. To identify the
Figure 3.1 Consider two loci on the same chromosome. One
gene responsible for a disease, look for a marker of
locus has alleles A and a, the other has alleles B and b. One
known location that tends to be co-inherited with
individual has alleles A and B on one chromosome (pink) and
alleles a and b (blue) on the other (top). Gametes from this the disease phenotype. The target gene is then likely
individual may form without recombination to give a haploid to be on the same chromosome, at a position near to
gamete containing alleles A and B and another haploid gamete the marker.
containing alleles a and b (lower left). Each gamete contains a
chromosome identical (at least as far as these loci are concerned)
to one of the parental chromosomes. Alternatively, crossover Linkage disequilibrium
between the two loci may produce recombinant gametes: one
haploid gamete containing alleles A and b and another haploid Figure 3.1 showed the gametes arising from one par-
gamete containing alleles a and B (lower right). Neither gamete ent, heterozygous for two traits. The recombination
contains a chromosome identical to a parental one. The fraction rate depends only on the structure of the chromo-
of gametes showing recombination depends on several factors,
somes of this individual – whether this genotype is
notably the distance between the loci. (The same fractions of
rare or common in a population.
recombinants and non-recombinants would be produced even
if the parent were homozygous, although they would be likely Now suppose that we have a large interbreeding
to be indistinguishable.) population and that every individual in the population
In the absence of selection, the fraction of viable recombinant has the same parental genotype shown in Figure 3.1.
and non-recombinant gametes produced depends only on How will the genotype distribution in the population
the genotype of the individual that produced them. It has
develop? (Assume that no combination of alleles for
nothing to do with the allele distribution in the population.
If a remote descendant had the parental genotype shown here,
these two traits has any selective advantage or pro-
its gametes would show the same fractions of recombinants and duces any preferential mating pattern.) Recombina-
non-recombinants. tion at meiosis, followed by zygote formation, can in
84 3 Mapping, Sequencing, Annotation, and Databases

principle produce individuals of three genotypes: and ab are more common than expected, D < 0
AB/ab, Ab/aB = aB/Ab and ab/ab (where, for instance, implies that chromosomes with Ab and aB are
AB/ab signifies an individual with alleles AB on one more common than expected.
chromosome and ab on the other; this is the parental
Suppose there is no selection or gene import into
genotype in Figure 3.1). Starting from the original
the population, i.e. the overall frequencies of indi-
completely AB/ab population, eventually recombina-
vidual alleles in the population remains constant.
tion will randomize the allelic correlation, producing
Then linkage equilibrium will decay as a result of
a population in which the ratio of genotypes is: AB/
recombination. With each successive generation, the
ab:Ab/aB = aB/Ab : ab/ab is 1:2:1. (This assumes that
value of D will become closer to 0.
the overall gene frequency of the population is A = a
and B = b; see Problem 3.1.)
Linkage and linkage disequilibrium are closely
How fast the randomization occurs depends on the
related but distinct concepts
recombination rate, which depends on the genetic
distance between the loci. In the short term, if recom- Linkage is about the distribution of loci among
bination is infrequent, the parental genotype AB/ab chromosomes. Linkage disequilibrium is about the
will continue to predominate. The deviation of the distribution of allelic patterns in populations. Close
genotype distribution in the population from the ulti- linkage of two loci on a chromosome is a common
mate 1:2:1 ratio is called linkage disequilibrium. source of long-term persistence of linkage disequilib-
rium. Two genes at opposite ends of the same chro-
mosome, although formally linked, may not show
• Two markers are in linkage disequilibrium if the significant linkage disequilibrium, because crossing
observed distribution of different combinations of alleles over is frequent. Conversely, it is possible – although
differs from that expected on the basis of independent rare – to observe linkage disequilibrium between two
hereditary transmission of the individual alleles.
genes on different chromosomes. (This can happen in
two ways: (1) a community of immigrants imports
a particular set of single-nucleotide polymorphisms
In the absence of linkage disequilibrium, the fre-
(SNP) into a larger population and they preferentially
quencies of allelic combinations will be proportional
inter-marry for many generations, or (2) theoretically
to the products of the frequencies of the individual
by interactions between gene products that permit
alleles. Linkage disequilibrium measures the devia-
only certain combinations of alleles to be viable.)
tion from this equilibrium distribution.
Classical linkage maps typically involved markers
If the overall fractions of the alleles at two loci are
no less than 1 cM apart (∼1 Mb in humans) (this is
pA, pa = 1 − pA, pB, pb = 1 − pB then:
the situation shown in Figure 3.2). Linkage disequi-
equilibrium value of pAB = pA × pB librium is detectable between markers ∼0.01–0.02
equilibrium value of pAb = pA × pb cM apart (∼10–20 kb). Therefore, linkage disequilib-
equilibrium value of paB = pa × pB rium is a much finer tool for localizing a target gene.
equilibrium value of pab = pa × pb.
Note that: Chromosome banding pattern maps
• there is no necessary relationship between pA and pB; Banding patterns are visible features on chromosomes
• at equilibrium: (see Box 3.2). The most commonly used pattern is
G-banding, produced by Giemsa stain. The bands
pAB × pab = pAb × paB = pA × pB × pa × pb;
reflect base composition and chromosome loop struc-
• a measure of linkage disequilibrium is: ture. The darker regions tend to contain highly con-
densed heterochromatin of relatively low GC/AT
D = pAB × pab − pAb × paB
ratio and sparse in gene content.
where D = 0 implies that the system is at equilib- The karyotype of an individual comprises the struc-
rium, D > 0 implies that chromosomes with AB tures of the individual chromosomes. The karyotype
Maps and tour guides 85

0.2 cM
BOX Nomenclature of chromosome
A r s t x M y u v w B 3.2 bands
1 cM
In many organisms, chromosomes are numbered in
Figure 3.2 Suppose a mutation to a disease gene, M, occurred order of size, 1 being the largest. The two arms of
in a human population 50 generations ago. Consider a portion of
human chromosomes, separated by the centromere, are
the genome that includes the site of the disease mutation, M, plus
called the p (petite = short) arm and q (queue) arm.
genes for two known phenotypic traits, A and B, and two closely
spaced markers, x and y. The genes for traits A and B are 1 cM Regions within the chromosome are numbered p1,
away from the mutated locus. The markers x and y are 0.1 cM p2 . . . and q1, q2 . . . outward from the centromere.
from M. Additional digits indicate band subdivisions. For exam-
It is highly probable that markers x and y will be co-inherited ple, certain bands on the q arm of human chromosome
with M in any pedigrees for which records exist, as the probability 15 are labelled 15q11.1, 15q11.2, and 15q12. Originally,
of recombination between x and M or between y and M in any bands 15q11 and 15q12 were defined; subsequently,
generation is small: 0.1% = 0.001. The probability of recombination 15q11 was divided into 15q11.1 and 15q11.2.
in 50 generations is approximately 0.001 × 50 = 0.05. On the
other hand, the probability that markers A and M or B and M
have been separated by recombination is very high. The probability
13
of recombination between A and M in any generation is 1%. The
probability of recombination in 50 generations is approximately 12
0.4 (see Problem 3.2).
In the history of transmission of this disease gene over 50
p 11.2
generations, markers x and y are likely to be co-inherited with the
disease, but genes for traits A and B will not be reliably coupled 11.1
with M. Therefore, genes A and B, separated by 1 cM (∼1 Mb in 11.1
the human), will not be a reliable guide to localizing the target 11.2
gene M by looking in family pedigrees for genes co-inherited 12
with M. However, a distribution of markers such as x and y, 13.1
13.2
separated by 0.2 cM (∼200 kb) or less, is likely to provide a 13.3
reliable guide to localizing the target gene M through correlation
14
of disease occurrence and genetic markers in family pedigrees.
15.1
For humans, we do not have access to 50 generations of 15.2
records and DNA samples (which would amount to about 15.3
1000 years). However, the effects of recombination during 21.1
21.2
the 50 generations since the mutation filter out all but the
most closely linked genes from the co-inheritance pedigree. q 21.3
22.1
Later we shall see that haplotype groupings simplify the 22.2
identification of gene–marker correspondences. 22.31
22.32
22.33
23
24.1
is largely constant for all individuals within a species, 24.2
24.3
25.1
but varies between species. This is the result of
25.2
chromosome rearrangement during evolution. The 25.3
inability of cells with incongruent karyotypes to pair 26.1
properly is one barrier to fertility that contributes to 26.2
species divergence. 26.3
Although most individuals of a species have the
same karyotype, occasionally aberrant chromosomes Deletions in the region 15q11–13 are associated with
appear. Some of them are lethal and others are Prader–Willi and Angelman syndromes. These syndromes
correlated with disease. (For example, Prader–Willi have the interesting feature that alternative clinical con-
or Angelman syndromes; see Box 3.2.) Studies of sequences depend on whether the affected chromo-
chromosome banding patterns support several types some is paternal (leading to Prader–Willi syndrome) or
of investigations. ➔
86 3 Mapping, Sequencing, Annotation, and Databases

maternal (leading to Angelman syndrome). This obser-

vation of genomic imprinting shows that the genetic
information in a fertilized egg is not simply the bare
DNA sequences contributed by the parents. Chromo-
somes of paternal and maternal origin have different NORMAL
states of methylation, signals for differential expression DELETED
of their genes. The process of modifying the DNA,
which takes place during differentiation in development,
is already present in the zygote.

Correlation between genetic linkage maps and

chromosome structure Figure 3.3 Loss of part of one of a pair of homologous
chromosomes leads to a structure during chromosome replication
Chromosome aberrations include deletions, transloca-
called a ‘deletion loop’. This drawing, from original work
tions (of material from one chromosome to another), carried out shortly after the discovery in the 1930s of the
and inversions. The genetic consequences of a short large chromosomes in the salivary gland of Drosophila, shows
deletion in only one of a pair of homologous chro- two chromosomes that are paired except in a region in which
mosomes – only one allele instead of two for traits one chromosome has suffered a deletion. Genes that map to
that map to the deleted region – allowed for direct this region can, in affected individuals, show ‘pseudodominance’
– the appearance of a recessive phenotype as a result of deletion
mapping of genes to positions on chromosomes. In
of a masking dominant allele.
this way, the abstract genetic linkage maps could be
From: Painter, T.S. (1934). Salivary chromosomes and the attack on the
superimposed onto the chromosome. This was ﬁrst gene. J. Hered. XXV, 465–476.
done in the 1930s after the discovery of the very large
chromosomes in the Drosophila melanogaster sali-
vary glands. The correlation of chromosome aberra-
tions with changes in genetic linkage patterns proved resolution is ∼105 bp, but specialized new techniques
the co-linearity of the two maps. Together with the
can achieve high resolution, down to 1 kb. Simulta-
mapping of sex-linked traits to the X chromosome,
neous FISH with two probes can detect linkage and
the genetic consequences of chromosomal deletions
even estimate genetic distances. FISH can also detect
conﬁrmed what had previously been a hypothesis –
chromosomal abnormalities (see Figures 3.4 and 3.5,
that chromosomes carry hereditary information (see
and Box 3.3).
Figure 3.3).
Studies of evolutionary changes in karyotype

• Complete sequencing of yeast chromosome III in 1992 If we compare our chromosomes with those of a
gave the first opportunity for direct comparison of chimpanzee, we see that some large-scale rearrange-
a genetic linkage map and positions in the DNA ments have taken place. Human chromosome 2 is
sequence. split into two separate chromosomes in the chimpan-
zee (see Figure 3.6). However, most regions in the
corresponding chromosomes of the two species show
The modern technique for mapping genes onto conservation of banding patterns. Such regions are
chromosomes is fluorescent in situ hybridization called syntenic blocks. For human and chimpanzee,
(FISH). A probe oligonucleotide sequence labelled full genome sequences are available. They confirm
with fluorescent dye is hybridized to a chromosome. that the relationships suggested by the comparisons
The location where the probe is bound shows up of banding patterns reflect conservation at the level
directly in a photograph of the chromosome. Typical of DNA sequences.
Maps and tour guides 87

BOX Chromosome abnormalities are

3.3 frequently associated with cancer

The Philadelphia chromosome is an abnormal, short-

ened chromosome 22, arising from a translocation – an
exchange of chromosomal segments – between chromo-
22 der(9) some 22 and chromosome 9. The breakpoints are at
bands 9q34 and 22q11, and this translocation is denoted
t(9;22)(q34;q11) (see Figures 3.4 and 3.5).
Ph
The disastrous effects of the Philadelphia transloca-
9 tion arise because both breakpoint sites are within
genes. This results in genes for two chimaeric, or fusion,
proteins, combining ABL1 (Abelson murine leukaemia
BCR ABL viral oncogene homologue 1), a tyrosine kinase, from
chromosome 9, and BCR, the breakpoint cluster region
Figure 3.4 Detection by FISH probes of the Philadelphia (Ph)
gene, from chromosome 22. In normal cells, ABL1 and
chromosome, associated with chronic myeloid leukaemia. It arises BCR are separate. As a result of the translocation, chromo-
from a reciprocal exchange between chromosomes 9 and 22. some 22 encodes a BCR–ABL1 fusion and chromosome
The figure shows a representative image of a metaphase cell from 9 encodes an ABL1–BCR fusion.
a patient with chronic myeloid leukaemia analysed by FISH with The BCR–ABL1 gene on chromosome 22 encodes a
fusion (red/green or yellow) signals marking the Ph and der(9) fusion protein with tyrosine kinase activity insensitive to
chromosomes (der stands for derivative: a derivative chromosome
the normal regulation of ABL1. (The ABL1–BCR gene
combines segments from two or more normal chromosomes).
on chromosome 9 is apparently silent.) There is good
The normal 9 and 22 homologues are shown by a red and a green
signal, respectively. evidence for the association with cancer of the product
Figure courtesy of Dr E. Nacheva, Royal Free and University College
of this abnormal gene: (1) the fusion protein is expressed
Medical School, UK. in the proliferating cells; and (2) a drug – imatinib mesyl-
ate (Gleevec®) – that inhibits the kinase activity is thera-
peutically effective.
9 22 der(9) Ph Because the translocation produces unique DNA
sequences at the site of the join, it is possible to design
FISH probes that span the common breakpoints of both
the ABL1 and BCR genes. A red signal detects sequences
on the ABL1 side of the breakpoint; a green signal
detects sequences on the BCR side of the breakpoint.
Normal cells would show two green and two red signals,
on different chromosomes. Diseased cells show over-
lapping red and green signals (appearing yellow) on the
ABL1 BCR
aberrant chromosomes, and single red and green signals
for the normal homologues.

Figure 3.5 Transposition of material between chromosomes 9

and 22 forms the Philadelphia chromosome (Ph), associated with
chronic myeloid leukaemia. Left: normal chromosome 9 (green),
containing the ABL1 gene (yellow) and normal chromosome 22 Applications to diagnosis of disease
(magenta) containing the BCR gene (cyan). The arrow shows
Many diseases are caused by – or at least correlated
the positions of the breakpoints for the transposition. Right: the
der(9) and Ph chromosomes, showing exchange of material at the
with – chromosomal abnormalities, including many
bottom. Note that the site of translocation on each chromosome forms of cancer (see Figures 3.4 and 3.5). It seems
is within each gene. likely that partial or complete DNA sequencing of
88 3 Mapping, Sequencing, Annotation, and Databases

Direction of
electrophoresis

Figure 3.7 Restriction enzymes cleave DNA at the sites of

specific sequences to produce fragments separable, according
to size, by gel electrophoresis. Left: Regions of two homologous
chromosomes showing two alleles that differ in the length of a
restriction fragment. The fragment contains a marker (red or
blue rectangle), a region that has the same sequence on both
chromosomes. Vertical arrows show the cut sites for one
restriction enzyme. The common fragment is ‘labelled’ red on
one chromosome and blue on the other. After cleavage (middle),
the fragments are loaded on a gel (right) and separated according
to size. Cut sites for the restriction enzyme may appear in many
places in the genome, on different chromosomes, not just at
the positions of the allele illustrated. In order to see only those
fragments that originate at the locus of interest, a radioactive
or fluorescent probe complementary to the common marker
(the rectangle) is used via Southern blotting to make visible only
the selected fragments.

A restriction endonuclease is an enzyme that cuts

DNA at a speciﬁc sequence, typically 4, 6, or 8 bp
long (see Box 3.4). Many speciﬁcity sites for restric-
tion enzymes are palindromic, in the sense that
Figure 3.6 Left: human chromosome 2. Right: matching the sequence is equal to its reverse complement. For
chromosomes from a chimpanzee. instance:

individuals will supersede the cytogenetic detection EcoRI recognition site: GAATTC
of large-scale mutations. CTTAAG
Other useful types of marker include:
High-resolution maps, based directly on • Variable number tandem repeats (VNTRs), also
DNA sequences called minisatellites. VNTRs contain regions 10–
Formerly, we could see genomes only by the reflected 100 bp long, repeated a variable number of times
light of phenotypes. Now, markers are no longer lim- – same sequence, different number of repeats. In
ited to genes with phenotypically observable effects, any individual, VNTRs based on the same repeat
which are anyway too sparse for an adequately high- motif may appear only once in the genome or
resolution map of the human genome. Now that we several times, with different lengths on different
can interrogate DNA sequences directly, any features chromosomes. The distribution of the sizes of the
of DNA that vary among individuals can serve repeats is the marker. Inheritance of VNTRs can be
directly as markers. followed in a family and correlated with a disease
The first genetic markers based directly on DNA phenotype like any other trait. VNTRs were the
sequences, rather than on phenotypic traits, were first genetic sequence data used for personal iden-
restriction fragment length polymorphisms (RFLPs). tification – genetic fingerprints – in paternity and
The genetic marker is the size of the restriction frag- in criminal cases (see Chapter 8).
ments that contain a particular sequence within them • Short tandem repeat polymorphisms (STRPs), also
(see Figure 3.7). called microsatellites. STRPs are regions of only
Maps and tour guides 89

(a) EcoRI BamHI Both Size

BOX EcoRI illustrates the nomenclature kb
3.4 of restriction enzymes 0.5

Eco abbreviates the name of the species of origin

1.0
(Escherichia coli) as the first letter of the genus name
and the first two of the species name. The letter R spec-
ifies the strain. (Not all restriction enzyme names contain 1.5
a strain identifier.) The Roman numeral, in this case I,
distinguishes different restriction enzymes from the
2.0
same strain of the same organism. Another example is
TaqI, from Thermus aquaticus.
2.5

approximately 2–5 bp but repeated many times, 3.0

typically 10–30 consecutive copies. They have
several advantages as markers over VNTRs, one
* 3.5
of which is a more even distribution over the
human genome.
4.0
There is no reason why these markers need lie
within expressed genes and usually they do not. The
CAG repeats in the gene for Huntington’s disease and 4.5
certain other disease genes are exceptions (see p. 22).
Markers have applications in several different ﬁelds. (b)
* EcoRI
Medical applications of markers include their use
in tissue typing to identify compatible donors for Both
transplants, detection of disease susceptibility, and 0 1 2 3 4 5 6 7 8 9 kb
prediction of individual drug-response variation (phar- BamHI
macogenomics). Anthropologists use them to trace Figure 3.8 (a) Pattern of a gel showing sizes of restriction
migrations and relationships among populations. fragments produced by EcoRI (red), BamHI (blue), or both (black).
The DNA sequence itself can also be used as a (Direction of migration: up the page.) (b) Positions of the cutting
marker. Physically, it is a sequence of nucleotides in sites, deduced from the measurements in (a). Each lane of (a)
the molecule, computationally a string of characters corresponds to a strip of the same colour in (b) in the following
way: each band on the gel in (a) corresponds to the length of a
A, T, G, and C. Collection and application of
fragment between two successive cutting sites in (b). For instance,
sequence data directly often makes use of mitochon- the green* indicates a band on the gel and the fragment that gave
drial DNA or haplotypes (see below). rise to it. In contrast to Figure 3.7, in this case we do not want to
limit our observation of the fragments to those at a single locus.
In this case, it is relatively easy to determine the unique order of
Restriction maps fragments that accounts for all of the data. In more complicated
cases, it is possible to simplify the problem by placing radioactive
Splitting a long molecule of DNA – for example, the tags on either end of the segment being mapped, or by carrying
DNA in an entire chromosome – into fragments of out partial digests to produce additional fragments (see Exercise 3.6).
convenient size for cloning and sequencing requires
additional maps to report the order of the fragments, speciﬁcities, produces overlapping fragments. From
so that the entire sequence can be reconstructed from the sizes of the fragments produced by individual
the sequences of the fragments. enzymes and in combination, it is possible to con-
Cutting DNA with a restriction enzyme produces struct a restriction map, stating the order and dis-
a set of fragments (as in Figure 3.7). Cutting the same tance between the restriction enzyme cleavage sites
DNA with other restriction enzymes, with different (see Figure 3.8 and Box 3.5).
90 3 Mapping, Sequencing, Annotation, and Databases

BOX Give my regards to restriction maps

3.5

As an analogue of a restriction map: walk the entire length number of blocks between successive sites; then calculate
of Broadway in New York (see http://www.marktaw.com/ the number of blocks between every occurrence of either
local/MarksWalkingTour.html). Mark on the map the Starbucks or Citibank. A mutation in one of these ‘sites’ will
location of every Starbucks coffee shop and calculate the change the sizes of the fragments, allowing the mutation
number of blocks between successive sites; then mark to be located in the map.
the location of every CitiBank office and calculate the (Thanks to Professor B. Misra, New York University)

Restriction enzymes can produce fairly large In the past, the connections between chromosomes,
pieces of DNA. Cutting the DNA into smaller pieces, genes, and DNA sequences have been essential for
which are cloned and ordered by sequence overlaps, identifying the molecular deficits underlying inherited
produces a finer dissection of the DNA called a diseases, such as Huntington’s disease or cystic fibrosis.
contig map. Sequencing of the human genome has changed the
situation radically.

Discovery of the structure of DNA

Life involves the controlled manipulation of matter, P.A. Levene’s idea that DNA contained a regular
energy, and information. Biochemists were familiar repetition of a constant four-nucleotide unit. It was,
with the structures and mechanisms of enzymes, therefore, considered that DNA simply lacked the
living molecules catalyzing conversions of matter versatility required to convey hereditary information,
and energy. What kind of molecule could store and compared, for example, with proteins. We now
manipulate information? know that eukaryotic DNA does contain repetitive
Chemical analysis of DNA during the early part of sequences, but is not limited to them.
the 20th century characterized the constituents – the
bases and the sugars – and the nature of their linkage.
DNA is a polynucleotide chain, containing a repeti- • This attitude contributed to the general lack of accep-
tive backbone of sugar–phosphate units, with small tance of the experiments of Avery, MacLeod, and
organic bases – adenine, thymine, guanine, and cyto- McCarty as proof that DNA was the genetic material.
sine – attached to each sugar (see Figure 3.9).
Not only was the distribution of the bases along
the chain unknown, its significance was entirely An understanding of how DNA worked in biolo-
unsuspected (except for a prescient comment by gical processes required a detailed three-dimensional
physicist Erwin Schrödinger – based on ideas of Max structure. In the 1950s, the method of choice for
Delbrück – in his influential 1944 book, What is determination of molecular structure was X-ray crys-
Life?: ‘We believe a gene – or perhaps the whole tallography. The pioneer of X-ray structure deter-
chromosome fibre – to be an aperiodic solid.’ Do not mination, Sir Lawrence Bragg, Cavendish Professor
be misled by the reference to a solid. Schrödinger of Physics at Cambridge, was an enthusiastic sup-
himself included a footnote: ‘That it [the chro- porter of the efforts of Max Perutz and his group
mosome] is highly flexible is no objection, so is a to extend methods of X-ray crystallography to bio-
thin copper wire.’) Indeed, many people accepted logical molecules as large as proteins.
Discovery of the structure of DNA 91

available clues, the biochemist must imagine a model

5′ direction that is consistent with the known covalent structure
and with the X-ray diffraction pattern.
O P O
NH2 X-ray diffraction patterns of DNA fibres were
O
N N
measured at King’s College, London, by a group that
Adenine
CH2 O N included Rosalind Franklin and Maurice Wilkins. In
N
H H May 1951, Wilkins spoke at a conference in Naples.
H
In the audience was a young American postdoctoral
O H scientist named James Watson. Watson was con-
O P O verted. He took his newly kindled interest in X-ray
O
O structure determination to Cambridge, where he met
N NH
O Guanine a graduate student named Francis Crick, and together
CH2 N
N NH2 they sought the structure of DNA.
H H
H For DNA, the clues included: (a) titration curves
O H that showed the bases to be involved in internal
hydrogen-bonding interactions; and (b) E. Chargaff’s
O P O NH2
observations that, although the amounts of the four
O N Cytosine bases differ among samples of DNA from different
CH2 O O
N organisms, the amounts of adenine and thymine are
H H always equal and the amounts of guanine and cyto-
H
sine are always equal. Chargaff published his results
O H
in 1949 and described them to Watson and Crick
O P O O during a visit to Cambridge in July 1952.
O H3C NH It was recognized as a race. Lined up against Wat-
Thymine
CH2 O N O son and Crick were Franklin and Wilkins at King’s
H H College and Linus Pauling in California. Pauling was
H a frightening contender, with many major discoveries
O H already to his credit. He had recently beaten the
3′ direction Cambridge group to the structure of the a-helix in
proteins.
Watson and Crick did no experimental work them-
Figure 3.9 The chemical structure of one strand of DNA. The selves, but confined their efforts to rationalizing all
backbone consists of deoxyribose sugars linked by phosphodiester of the data and clues that they knew, including fibre
bonds. In its chemical bonding pattern, the backbone is a regular
diffraction data. Watson attended talks in which
repetitive structure, independent of the bases. Each base, one
the King’s group’s fibre diffraction photographs were
attached to each sugar, can be one of four choices. In the
sequence of bases, DNA is logically equivalent to a linear shown. Franklin continued to improve the data, pro-
message written in a four-letter alphabet. ducing in 1952 the photograph in Figure 3.10. This
photograph appeared in a Medical Research Council
report, which was shown to Watson and Crick with-
However, DNA molecules are flexible and do not out Franklin’s knowledge. The fibre diffraction data
form classic crystals. They can be drawn into fibres, implied that the chains formed helices, with a repeat
which can be thought of as crystalline in two dimen- distance of 34 Å, containing 10 residues per turn
sions, and disordered around the fibre axis. This dis- (i.e. 3.4 Å between successive bases). Although manu-
order severely impoverishes the available data. X-ray scripts left by Franklin suggest that she was close to
diffraction patterns of three-dimensional crystals solving the structure herself, Watson and Crick beat
permit an objective determination of the individual her to the goal.
atomic positions in a structure. In contrast, fibre dif- What animated Watson and Crick’s model was the
fraction data present a trickier puzzle. Guided by all idea of the complementarity of specific pairs of bases,
92 3 Mapping, Sequencing, Annotation, and Databases

N N H O CH3

N
N H N
Sugar N N
O
Adenine Sugar
Thymine

N O H N
Figure 3.10 X-ray diffraction pattern of DNA fibre, by R. Franklin.
The X-shaped pattern at the centre of the picture is diagnostic of N
a helical structure. From the distribution of intensity, it is possible N H N
to deduce the number of residues per turn and the symmetry of Sugar N N
the structure. N H O
Guanine Sugar
From: Franklin, R.E., & Gosling, R.G. (1953). The structure of sodium
H
thymonucleate fibres. II. The cylindrically symmetrical Patterson function, Cytosine
Acta Cryst. 6, 678–685.
Figure 3.11 The complementary base pairings: adenine–thymine
and guanine–cytosine.

together with the recognition that AT and GC pairs

had compatible stereochemistry. There had been
several premonitory hints at this crucial idea. A notice that the specific pairing we have postulated
Cambridge mathematician, John Griffith (nephew of immediately suggests a possible copying mechanism
the Frederick Griffith who discovered bacterial trans- for the genetic material’ distracts by its coyness from
formation), told Crick of a preferential attraction the underlying logic: it was the clear structural basis
between A and T, and between G and C, that he had for the biological activity that made the model imme-
deduced from theoretical calculations. This, together diately and utterly convincing (see Figure 3.12).
with Chargaff’s rules, immediately suggested the idea Once the base pairing was recognized, the pieces
of complementarity.* (A colleague has commented: of the puzzle quickly fell into place. On 28 February
in biology, theory suggests and experiment proves; 1953, Crick strode into the Eagle, a pub in down-
in physics experiment suggests, theory proves. A.S. town Cambridge near the laboratory, and announced
Eddington, an astrophysicist, once warned against to anyone listening that he and Watson had discov-
putting ‘too much confidence in observational results ered the secret of life. The 25 April issue of Nature
until they have been confirmed by theory’.) contained three papers on the structure of DNA, one
What in retrospect screams hydrogen-bonded base by Watson and Crick, and two from the King’s group.
pairing (see Figure 3.11) only whispered, before the A second paper by Watson and Crick, on the genetic
model was imagined. The leap from energetic and implications of the structure, appeared in the 30 May
compositional complementarity to the correct struc- issue of Nature.
tural complementarity was not (as it may perhaps
now appear) an easy step. Even the fact that the It was an eventful spring: Stalin died on 5 March; Edmund
Hillary and Tenzing Norgay reached the summit of Mount
model rationalized the hints from Griffith and Char-
Everest on 29 May; Queen Elizabeth II was crowned on
gaff and the titration curves, was more confirming 2 June.
evidence than proof. The famous sentence from
Watson and Crick’s paper, ‘It has not escaped our
Rosalind Franklin died in 1958. Only the living
can win a Nobel Prize. The 1962 Prize for Physiology
* Lagnado, J. (2005). From pabulun to prions (via DNA): and Medicine was awarded to Watson, Crick, and
tale of two Griffiths. The Biochemist 27, 33–35. Wilkins.
DNA sequencing 93

Figure 3.12 The structure of DNA. The two helical strands wind around the outside of the structure. The bases are inside, stacked like
the treads of a staircase with their planes perpendicular to the axis of the double helix. The bases are visible through the major and
minor grooves and are thereby accessible for interaction with proteins. This picture shows a stereo pair, most easily viewed with a
standard stereo viewer or lorgnette.

The story has been told on many occasions, in to the discovery of the structure of DNA were under-
print and on ﬁlm. Notable for its unusually sensa- estimated, not least by Watson, who in The Double
tional treatment of scientiﬁc history is Watson’s 1968 Helix disparages her mercilessly, both professionally
autobiography, The Double Helix. A collection of and personally. Of numerous recent attempts to
the reviews it elicited still makes interesting reading.† redress the balance, Sir Aaron Klug’s lecture is the
There is now consensus that Franklin’s contributions most authoritative.‡

DNA sequencing

In 1953, after the Hershey–Chase experiment and But if the sequence of the bases was like a text every-
the announcement of the double helix, molecular one wanted to read, not only was it a text in an
biologists knew that DNA contained the hereditary unknown language, but there were not even any
information. They saw how cellular machinery could examples of the language, because the sequences
achieve access to it, using base-pair complementarity. were unknown. The importance of the problem of

† ‡
Stent, G. (ed.) (1980). The Double Helix: Text, Comment- Klug, A. (2003). The discovery of the DNA double helix.
ary, Reviews, Original Papers. W.W. Norton, New York & In Changing Science and Society. T. Krude (ed.) Cambridge
London. University Press, Cambridge, pp. 5–43.
94 3 Mapping, Sequencing, Annotation, and Databases

determining DNA sequences was obvious, but the

difﬁculties frightened most people away from con- BOX The problem of sequence assembly
3.6 from short fragments
fronting the challenge.

To illustrate what is involved in assembling a long

Frederick Sanger and the development sequence from overlapping short fragments, consider
of DNA sequencing the first two verses of Shakespeare’s Richard III:
Cambridge biochemist Frederick Sanger devoted Now is the winter of our discontent
his career to sequence determination of biological Made glorious summer by this sun of York
macromolecules. First, he determined the amino acid Breaking this up into overlapping 10-letter fragments
sequence of the protein insulin, proving for the first (treat the line-break as a space), and presenting them in
time that proteins had definite sequences. This work random order, gives:
won him the Nobel Prize in Chemistry in 1958. He
next turned his attention to sequencing of RNA, ntent Made
and then to DNA. His methods for sequencing of this sun o
DNA won a second chemistry Nobel Prize, in 1980, f Yor
shared with Paul Berg and Walter Gilbert. Sanger’s discontent
professional legacy to the field in general, and his
ious summe
personal influence on his many students and col-
leagues, is unsurpassed. r by this

To appreciate the state of the art during the early er of our

1970s, it is important to recognize that it was very glorious
difficult even to prepare pure samples of single- summer by
stranded DNA for attempts at sequencing. Restric-
our disco
tion enzymes were a new discovery. They did not
winter of
provide the mature and versatile technology that we
take for granted today. Bacteriophage fX-174, pro- Made glor
viding 5386 bp of purifiable, single-stranded DNA, Now is the
was a natural source of material for Sanger’s early s the wint
work, yielding the first whole genome sequence.
It is easy to reassemble the pieces, and would be easy
Even fX-174 is too long to sequence ‘in one go’.
even if you didn’t know the answer. Observe that the
It depended, as has every subsequent genome
longer the fragments, the easier the challenge. (See
sequencing project, on solving the ‘Humpty Dumpty
Exercise 3.1.) Recognize also, that repetitive sequences
problem’: putting fragments together again. (Box 3.6
are harder. In Act 2 of Hamlet, Polonius says:
introduces the sequence assembly problem.) Sanger’s
achievement was a reliable and convenient method ’tis true, ’tis true ’tis pity,

for sequencing relatively long fragments, making And pity ’tis ’tis true

unambiguous assembly possible. It would be more difficult to reassemble this from frag-
Sanger’s method used DNA polymerase to synthe- ments, because the repetitions create ambiguities.
size a new strand of DNA, embodying, as part of the Indeed, the very large amounts of repetitive sequence in
synthesis, reactions that reveal the sequence. eukaryotic genomes do create problems for assembly
DNA polymerase is a replication enzyme that algorithms.
synthesizes the strand complementary to a piece of
single-stranded DNA. It requires a primer: a short
stretch of complementary strand to be extended by the growing primer strand the nucleotide comple-
successive addition of nucleotides (see Figure 3.13). mentary to the next unpaired base in the template.
The polymerase requires a supply of nucleoside tri- This reaction also forms the basis of the polymerase
phosphates. In successive steps, the enzyme adds to chain reaction (PCR) (see Figure 3.14).
DNA sequencing 95

Blue strand 3′ 5′
= primer
Original sample

5′ 3′

Separate strands,
add primers

Red strand
= template
Extend primers

Figure 3.13 Primer extension. Duplication of a piece of single-

stranded DNA, the template (red), occurs by extension of a short
primer (blue) by addition of complementary nucleotides to the
primer. The arrow shows the direction of synthesis. The basis of
the Sanger method for DNA sequencing is the termination of the
extension by poisoning the reaction with a modified base – a
dideoxynucleoside triphosphate – lacking the reactive group Figure 3.14 Schematic diagram of the polymerase chain reaction
necessary for continuing the extension. (PCR), a molecular copying technique that can selectively amplify
very small quantities of DNA in a sample. In this figure, red
strands are complementary to blue strands.
The target sequence is called the amplicon. One cycle doubles
DNA polymerase links the 5′-hydroxyl of the newly the amount. First, separate the strands of the original sample and
add primers complementary to both ends of the target region.
added nucleotide to the 3′ position of the end of
(The short red and blue arrows represent the primers.) Primer
the primer. The activating triphosphate provides the
extension by DNA polymerase reconstitutes two copies of the
energy. Sanger’s idea was to include in the mixture of original sample. Repetition of these operations doubles the
nucleoside triphosphates a fraction of molecules con- amount of amplicon at each step. Only the first and second
taining no reactive 3′ position, a dideoxynucleoside cycles are shown in this diagram. Exponential growth leads
triphosphate (see Figure 3.15). One reaction mixture quickly to the production of large quantities of the desired
sequence.
contains normal deoxyATP, deoxyGTP, deoxyTTP,
Specificity is achieved by tailoring the primers to the target
and a mixture of normal deoxyCTP plus dideoxyCTP sequence. Amplification is effective even if the target sequence
(Figure 3.15). As primer extension proceeds, when the appears in a small amount within a large background of other
next unpaired base in the template is a guanine, the DNA.
enzyme will randomly incorporate either a normal A development called real-time PCR allows the quantitative
measurement of levels of mRNA in a sample by first converting it
deoxycytidine, after which the extension will con-
to cDNA and then following the rate of the amplification reaction
tinue, or a dideoxycytidine, after which no further
by measurement of incorporation of a fluorescent label. The basic
extension will occur. (Three other reactions are run idea is that the more abundant a sequence in a mixture, the faster
in parallel containing all four normal triphosphates the build-up of the product of its PCR amplification.
plus dideoxyATP, dideoxyGTP, or dideoxyTTP.) Extremely useful in understanding PCR are animations available
This reaction produces a mixture of fragments of on the web. Searching will turn up a number of sites, for example.
http://users.ugent.be/∼avierstr/principles/pcrani.html.
different lengths, each of which ends in a (modiﬁed)
deoxycytidine:
96 3 Mapping, Sequencing, Annotation, and Databases

NH2 NH2 G A T C

N N

P-P-P-OCH2 O N O P-P-P-OCH2 O N O

H H H H
H H

OH H H H GC 4380
Deoxycytidine Dideoxycytidine GC A GT T T T 4370
G G T T AC
triphosphate triphosphate AGA C
C G AGA 4360
AA
GAG A 4350
Figure 3.15 Left: normal deoxycytidine triphosphate, containing G TTTT
G T
both a reactive 5′-hydroxyl, activated as the triphosphate AT
A
derivative permitting it to be added to the growing primer strand, CGAA A 4340
G G G A G 4330
and a 3′-hydroxyl (red), to which the next nucleotide will be C G A A
C
C G C C C 4320
attached. Right: dideoxycytidine triphosphate, in which the 3′ C A
AC
position is unreactive. Dideoxycytidine can be incorporated into AG
T A T C 4310
the growing strand, but no subsequent extension is possible. CA
AA
P-P-P is a simplified representation of the triphosphate group. AAT
T T T A 4300
A
TG C T
A A
3′-atacagagaatctagatacagagttgttcgag Template C
C C 4290
C
5′-tatgtctcttagat → Primer extension TC C
G TG
T
5′-tatgtctcttagatctatgtctcaacaagctc CA
A 4280

5′-tatgtctcttagatctatgtctcaacaagc Fragments T A
GC
5′-tatgtctcttagatctatgtctcaac produced A A T

5′-tatgtctcttagatctatgtctc A C 4270
C
G
5′-tatgtctcttagatctatgtc G
TT
A
5′-tatgtctcttagatc GA
T 4260
5′-tatgtctc
5′-tatgtc Figure 3.16 A gel from early in the history of DNA sequencing,
showing part of the genome of bacteriophage fX-174.
The crucial point is that the fragments are nested.
From: Sanger, F., Nicklen, S., & Coulson, A.R. (1977). DNA sequencing with
To determine the positions of the cytosine residues in chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 74, 5463–5467.
the extended primer, it is sufﬁcient to determine the
lengths of the fragments. Polyacrylamide gel electro- In Sanger’s original procedure, the dideoxycytidine
phoresis can separate oligonucleotides according to carried a radioactive label and the gel was developed
their lengths accurately enough for sequence deter- by autoradiography. To determine the positions of
mination. It is possible to separate polynucleotides the other three nucleotides, it was necessary to run
differing in length by a single base, up to molecules four reactions, in parallel, each containing radioac-
about 1000 bases long. tive dideoxy analogues of one of the four nucleoside
triphosphates. Four separate reactions are necessary
because each nucleotide gives the same signal – the
• Almost all DNA sequencing methods, from the earliest
darkening of the ﬁlm from the radioactive spot. Run-
to those in most common use today, depend on break-
ning the products of the four reactions in parallel
ing a long DNA molecule into fragments, sequencing
the fragments, and then assembling them, via over-
lanes on the same gel allows the sequence to be read
laps, into the complete sequence. The methods of from the autoradiograph (see Figure 3.16).
Sanger and of Maxam and Gilbert worked by creating
a set of nested fragments, each terminating in a known • Labelling the four dideoxynucleoside triphosphates with
nucleotide, that could be separated on a gel. Knowing different fluorescent dyes, with distinguishable colours,
the length of the fragment and the terminal nucleotide allows ‘one-pot-one-lane’ sequencing reactions (see
allowed the sequence to be ‘read off’ the gel. pp. 18 and 97).
DNA sequencing 97

aa t a t g t c t c t t aga t c t a t g t c t c aa c aagc t c c t ag t t gc a t t gc a t c t ga t gc aa c t gaga c

Figure 3.17 A tracing of a sequencing fluorescent chromatogram (simulated). Different-coloured peaks correspond to different bases.

The Maxam–Gilbert chemical cleavage method successive peaks correspond to successive bases in
the sequence (see Figure 3.17).
The Maxam–Gilbert method for DNA sequencing, also
Quality is important too: The phred score q meas-
developed in the mid-1970s, also worked by compar-
ures sequencing accuracy. q = 20 implies a probability
ing nested fragments. A sample of single-stranded
of ≤1% error per base (see Box 3.7). A common unit
DNA, labelled at its 5′-end, is cleaved by base-specific
of sequencing costs is US$ per 1000 bases determined
reagents. As in Sanger’s method, polyacrylamide gel
to q = 20 accuracy. A sequence of 1000 q = 20 bases
electrophoresis separates the fragments by size. The
would be expected to contain no more than ten
separate cleavage reactions produce fragments that
errors.
share the 5′-labelled end, but differ in the bases at
In 1986, L. Hood, L. Smith, and co-workers
which the specific 3′-cleavage occurs. The sequence
described an instrument, based on detection of base-
can be read from an autoradiograph.
specific fluorescent tags, that led to the automation of
The Maxam–Gilbert method had early successes,
DNA sequencing. Instruments produced by Applied
for which Gilbert shared the Nobel Prize. It was
Biosystems implemented the work of Hood, Smith,
Sanger’s method that spawned subsequent develop-
and co-workers, with improvements by J.M. Prober
ments, however. Because the Maxam–Gilbert method
and co-workers at DuPont. Capillary systems for
does not use primed DNA synthesis, its applicabil-
fragment separation to replace flat-sheet gels were
ity is inherently limited to sequences adjacent to
another essential advance. A population explosion in
restriction sites or other fixed termini. Another dis-
advantage of the Maxam–Gilbert method was reagent
toxicity, notably of hydrazine, a neurotoxin.
BOX Phred scores: a measure of quality
3.7 of sequence determination
Automated DNA sequencing
Using fluorescent dyes as reporters, rather than The phred score of a sequence determination is a mea-
radioactivity, was an important technical advance in sure of sequence quality. It specifies the probability that
sequencing. Radioisotopes present health hazards the base reported is correct.
in both use and disposal, and are expensive. (One of
If p = the probability that a base is in error, then
Sanger’s ‘postdocs’, who wore his hair long in the the corresponding phred score q = −10 log10(p).
fashion of the 1970s, was obliged to have a haircut
because of contamination from frequently pushing Here is a short table:
his hair out of his eyes.) By attaching different
fluorescent dyes to the four dideoxynucleoside tri- Quality Probability Error rate
phosphates, each fragment produced gives a different score q of error
signal, depending on which dideoxynucleotide ter- 10 0.1 1 base in 10 wrong
minated the extension. All four reactions can be 20 0.01 1 base in 100 wrong
done ‘in the same pot’, and electrophoresis separates
30 0.001 1 base in 1000 wrong
them in a single lane. A laser focused at a fixed point
40 0.0001 1 base in 10 000 wrong
identifies the fragments as they pass. The result can
be displayed as a four-colour chromatogram in which
98 3 Mapping, Sequencing, Annotation, and Databases

Cost of sequencing
1E+05

1E+03

1E+01
Cost per base (US$)

1E-01

1E-03

1E-05
?

1E-07
?
1960 1970 1980 1990 2000 2010 2020
Year

Figure 3.18 Fall in the cost of sequencing over time. Note logarithmic scale on the cost axis. The sea change in 1998 from the
introduction of next-generation sequencing platforms is striking. The US National Institutes of Health set goals for US$100 000
human genome in 2010, and a US$1000 human genome in 2016. Has the 2010 target been met? (See Weblem 3.4.)
(Data from: http://www.genome.gov/sequencingcosts/)

automatic DNA sequencing machines ﬁlled the high- Current goals are to improve the technology by
throughput installations that produced the human additional orders of magnitude. The US National
and other complete genomes. Instruments available Institute of Health (NIH) has set goals for a
in 1998 were able to produce ∼1 Mb of sequence US$100 000 human genome by 2009 and a US$1000
per day. genome by 2014 (see Figure 3.18).

Organizing a large-scale sequencing project

Two general approaches to genome sequencing pro- • First, cut the DNA into fragments of about 150 kb.
jects are: Clone them into BACs. For example, Arabidopsis
thaliana has a haploid genome size of about
1. The hierarchical method, in which the whole
108 bp. A 3948 clone BAC library for A. thaliana
genome is first fragmented and cloned into bacter-
contains ∼100 kb inserts per clone, giving approxi-
ial artificial chromosomes (BACs) (see Box 3.8),
mately fourfold coverage.
and the order of the fragments is established
before sequencing them. • Identify a series of clones in the library that con-
2. The whole-genome shotgun method, which works tains overlapping fragments. Although referred to
directly with large numbers of smaller fragments, as ‘fingerprinting’, this process depends on shared
with a concomitantly more challenging assembly (rather than unique) features of overlapping clones,
problem. including:
1. overlap of restriction fragment size patterns
Bring on the clones: hierarchical – or 2. amplification of single-copy DNA between
‘BAC-to-BAC’ – genome sequencing interspersed repeat elements and checking for
One approach to organizing the sequencing of a large similar size patterns of fragments
DNA molecule involves dividing the sample into 3. mapping sequence-tagged sites (STSs) and look-
pieces of known relative position. ing for fragments sharing STSs.
Organizing a large-scale sequencing project 99

Data for only

BOX BACs: bacterial artificial Gap one strand
3.8 chromosomes

A plasmid is a small piece of double-stranded DNA in a

bacterial cell, in addition to the main genome. A bacter-
ial artificial chromosome, or BAC, is a plasmid containing
foreign DNA – for instance, a fragment of the human
genome. A typical BAC, in an Escherichia coli cell, can
carry about 250 000 bp. Figure 3.19 Diagram of an assembly of a region of DNA from
shotgun fragments. Strands of DNA are shown in red and blue.
The idea of a BAC is to provide storage for relatively
‘Reads’ of fragments from the red strand are shown in cyan;
small DNA molecules. The storage should be stable and,
those from the blue strand are shown in magenta. In practice,
by growing up the cells, replicable. sequences are determined only for about 500 bp at both ends
The number of copies per cell varies from plasmid to of longer fragments. This assembly, based on overlaps, leaves one
plasmid. A high copy number would be advantageous gap. Although for one region all data arise from fragments of only
if we wanted to use a bacterial culture as a factory to one strand, this does not leave a gap.
produce large amounts of protein. However, multiple
copies of plasmids in cells can undergo recombination
and are therefore not suitable for stable storage of these are called reads. A computer program then
DNA fragments. The BACs in common use in genome assembled the results into a maximal set of contigu-
sequencing projects are based on choosing an E. coli ous sequence, or a contig (see Figure 3.19). The fully
plasmid that maintains a low copy number and shows assembled genome sequence, built by coalescence of
low recombination rates. contigs, is known as the ‘Golden Path’.
The coverage is the average number of times each
base appears in the fragments. If G = genome length,
N = number of reads and L = length of a read, the
• Using the overlaps, order the clones according to coverage = NL/G.
their position along the original large target DNA E. Lander and M. Waterman derived formulas for
molecule. the number of gaps expected as a function of cover-
• Subfragment each clone, sequence the fragments age and genome size. For instance, for a 1 Mbp
and assemble them. The idea is that the clones are genome, eight- to tenfold coverage should permit
small enough that the ∼1500 bp sequenced sub- assembly of the reads into five contigs. Completion
fragments can be assembled to give the complete of the process, called finishing, involves synthesis and
sequence of the ∼150 kb BAC clones. Then the sequencing of specific fragments to close the gaps.
clones can be assembled using their known order Whole-genome shotgun sequencing worked
in the original sequence. smoothly for prokaryotes, which contain relatively
less internal repetitive sequence. Repeats create prob-
lems in assembly, and this led to scepticism about the
Whole-genome shotgun sequencing
feasibility of the shotgun approach for a complex
The idea of the whole-genome shotgun approach is eukaryotic genome. In fact, the Drosophila genome
to sequence random pieces of the DNA and then put has fewer repeats than mammalian genomes and this
them together in the right order. If this can be done, contributed to its successful sequencing by shotgun
one can skip the laborious stage of creating a map as methods. In the event, the publication of the Dro-
the basis for assembling partial sequences. sophila genome in 2000 contained 120 Mb of finished
In the whole-genome shotgun sequencing of the sequence, with about 1600 gaps. In the latest release
Drosophila melanogaster genome, the DNA was of the Drosophila genome (April 2004), the number
sheared into random pieces of approximately 2, 10, of gaps has been reduced to 23.
and 150 kb. For each piece, the sequences of approxi- Highly skewed base composition, as in Plasmodium
mately 500 bp from each end were determined; falciparum – which contains ∼80 mol% AT – also
100 3 Mapping, Sequencing, Annotation, and Databases

complicates application of whole-genome shotgun

techniques. BOX Common and different steps in
3.9 ‘BAC-to-BAC’ and whole-genome
Even if the ultimate goal is a complete, ﬁnished,
genome sequence, it may be possible to identify genes shotgun methods
in a partly assembled genome with many gaps, pro-
vided that the genes are contained within contigs. ‘BAC-to-BAC’ method Whole-genome
Celera took the success of whole-genome shotgun shotgun method
sequencing of the fruit ﬂy as ‘proof of principle’, 1. Make random cuts to produce fragments of:
justifying its use in their human genome project. ∼150 kb ∼2000 kb and 10 000 kb
Despite the simultaneous announcement of academic 2. Make plasmid library in BACs.
and commercial human genomes, on 26 June 2000 3. Fingerprint, overlap, 3. Skip this step.
(see Chapter 1), the academic group made its results and order BAC clones.
publicly available (largely to preclude patenting) and 4. Partially sequence 1500 bp subfragments
it has been alleged that Celera made use of these data of individual clones.
in their assembly.* 5. Assemble overlaps by computer.

Comparison of BAC-to-BAC and whole-genome

shotgun approaches
Suppose the sample of DNA for sequencing comes
High-throughput sequencing
from a diploid organism. Fragments arising from
homologous regions of two chromosomes of a pair Several rewards await the developer of a break-
may have sequence differences. Correct assembly through improvement in DNA sequencing techno-
must place them at the same location, noting the logy. Sequencing has shown that it can enhance
discrepancies, and must not split these reads into human welfare. Some clinical applications have
different contigs because of the imperfect matches. already entered medical practice; sequencing sup-
BAC-to-BAC methods are more robust than ports research that will provide additional ones.
whole-genome shotgun methods with respect to this Other applications include more effective and safer
problem. Each BAC is a clone and, therefore, has a production of food, and biotechnological approaches
unique sequence. Successive clones may come from to generating safe and abundant consumable energy.
different chromosomes and may, therefore, show Intellectually, all of modern biology rides on the
mismatches, but this will not affect the order of back of DNA sequencing. High quantities of data
assembly of the clones. allow addressing more subtle questions. Think of
Note that an unambiguous success of whole- higher coverage, or being able to treat a larger sam-
genome shotgun methods, the Drosophila genome, ple cohort, as a magnifying glass that brings details
was based on a highly inbred laboratory strain. into sharper focus.
Sequencing the DNA of an individual from a natural, There are also significant financial rewards. The
outbred population – or, even worse, DNA pooled market for sequencing machines was over US$1 billion
from several individuals from such a population – in 2010. The X Prize Foundation, which sponsored
would present a more severe challenge. a US$10 million prize for space flight, has launched
Box 3.9 contrasts the alternative approaches. another US$10 million competition, for sequencing
100 human genomes in 10 days, for less than $10 000
per genome.
General approaches to improving the throughput/
* See Sulston, J. & Ferry, G. (2002). The Common Thread: cost ratio are: (a) Miniaturization, and (b) Parallel-
a Story of Science, Politics, Ethics and the Human Genome.
ization, or multiplexing. Common to many but not
Bantam Press, London; and Ashburner, M. in Jobling,
M.A., Hurles, M.E., & Tyler-Smith, C. (2003). Human all of the new methods are preparation steps in which
Evolutionary Genetics: Origins, Peoples and Disease, Gar- the target DNA is fragmented, common adaptors are
land Science, New York, p. 27. attached to one or both ends, and – in most methods
Organizing a large-scale sequencing project 101

Figure 3.20 Each bead, attached to a clone of amplified

fragments, occupies a separate well in the Roche 454 Life
Sciences high-throughput sequencing platform.
(454 Sequencing © Roche Diagnostics)

– the results are ampliﬁed. The results are distributed

spatially, either in an array of wells, or fixed to a
ground.
Major players in the field, and the features of their Figure 3.21 System overview of the Roche 454 Life Sciences
methods, include: high-throughput sequencing platform. The bead that is flashing
at you has just added a nucleotide, signalled by the luciferase
Roche 454 Life Sciences. The 454 system achieves detector (see Figure 3.22).
(454 Sequencing © Roche Diagnostics)
multiplexing by forming ‘polonies’ – single-molecule
replicates. Digests of DNA are ligated to a common
PCR primer, and amplified by replication in indi-
vidual wells (Figure 3.20). The samples in the wells
generate individual signals, detected in parallel.
As in the Sanger method, DNA polymerase extends BOX Pyrosequencing
3.10
the primer. The fragments are exposed, in four suc-
cessive operations, to each of the four bases. The
base added is identified ‘on the fly’ through detection Pyrosequencing detects incorporation of a specific
of the pyrophosphate released (see Box 3.10). In nucleotide into a growing strand by detecting the
this case, each addition generates the same signal. pyrophosphate (PPi) released in a step of the reaction
catalysed by polymerase:
Each nucleotide is presented separately, so we know
which base is added to each well because we know Template: 3′-ATACAGAGAATCTAGAT . . . + TTP →
which nucleotide triggered the signal. Figure 3.21 ATACAGAGAATCTAGAT + PPi
shows an integrated view of the whole system. Primer: 5′-TATGTCTCT → TATGTCTCTT

In each cycle, the reaction is separately exposed to

Helicos. The playing field for helicos sequencing is a
each of the four triphosphates.* Incorporation of the
flow-cell surface to which billions of oligo-T mole- complementary nucleotide releases pyrophosphate. ATP
cules are fixed. The subject DNA is fragmented into sulphurylase converts the pyrophosphate to ATP:
100–200 bp fragments, and an oligo-A primer is
added to the 3′-end of each fragment. These primers adenosine 5′-phosphosulphate + PPi → ATP
bind to the oligo-Ts fixed to the surface. ➔
102 3 Mapping, Sequencing, Annotation, and Databases

Applied Biosystems Solid. This is perhaps the tricki-

which is detected by a light-producing luciferase reac- est method to explain. To prepare the system, a set of
tion (see Figure 3.22). Apyrase destroys unincorporated fragments is extended by standard adaptors at both
nucleoside triphosphates and ATP produced upon suc-
ends. The fragments are amplified and attached to
cessful incorporation.
beads. The beads are fixed to a glass slide. A primer
is annealed to the adaptor sequence.
ATP sulphurylase The sequencing step is to expose the samples to
AMP-SO3H + PPi ATP + SO42–
fluorescently labelled probes eight nucleotides long.
ATP
Luciferase
ADP + light
The probes have the following features:

Luciferin Oxyluciferin (a) The first two positions include all possible
dinucleotides.
Figure 3.22 Detection of matching nucleotides by the (b) The remaining six positions are ‘wild cards’ –
luciferase reaction to signal the appearance of PPi released molecules that will pair with all four bases.
when a nucleotide is incorporated.
(c) The 5′ end of the probe bears one of four fluores-
Each cycle of successive exposure to the four nucleo- cent tags. Because there are four tags and 16
side triphosphates produces one sequenced base. dinucleotides, the system is degenerate, each tag
* As in many plays, including The Merchant of Venice,
representing four of the sixteen dinucleotides.
Turandot, etc., in which an eligible woman is offered several
suitors in turn. Figure 3.23 shows how the system works. The first
probe bound to one of the fragments starts with TC,
implying that the first two bases from the original
fragment (after the adaptor) are AG. However,
The fragments are sequenced by synthesis. Flooding because the fluorescent tag, yellow in this diagram,
the system with a polymerase and one fluorescently represents AG, CT, GA, or TC, we only know that
labelled nucleotide results in incorporation of the the first dinucleotide must be one of these four. We
nucleotide onto each fragment that has the comple- don’t yet know the identity of any one base.
mentary base adjacent to the growing primer. As in Remove the last three nucleotides, and repeat the
the Roche 454 Life Sciences system, an image of the process. After the next step, we know that the dinu-
system shows fluorescent spots at the positions at cleotide offset by three positions from the first must
which the nucleotide was incorporated. After wash- be one of four possibilities corresponding to the red
ing and removal of the fluorescent tag, the process tag. In summary, what we know about the target
repeats with the other three nucleotides, in succes- sequence at this point is that it must have the follow-
sion. The four images reveal the distribution of the ing form:
incorporated nucleotides. Then the process repeats, position 1 2 3 4 5 6 7
for the next position. A G ? ? ? A T
Note that in this system there is no amplification or C T or C G
step. or G A or G C
or T C or T A
Illumina Solexa. In the preparation step, fragments
bound to a surface are amplified in situ. They form This process continues for a total seven cycles, giving
distinct clusters on the surface. At each step, the poly- us additional partial information about 35 bases in
merase adds a base. The four bases have four differ- the sequence.
ent fluorescent tags. Therefore the distribution of How are the ambiguities resolved? By generating
colours, in an image of the field, identifies which base overlapping information. The newly synthesized strand
was added to each cluster. Remove the fluorescent is removed. A new primer is added, which binds at
tags, and repeat. The result is a kaleidoscopic movie one position offset from the first. Now the first dinu-
of shifting colours, one frame per position. cleotide probe that binds has a blue tag (Figure 3.23)
Organizing a large-scale sequencing project 103

and therefore must be AA, CC, GG, or TT. But the

CGNNNZZZ CANNNZZZ adaptor base is an A; therefore the dinucleotide must
1. Ligate
TTNNNZZZ be TT, and the target sequence must contain AA (of
5′ TCGATCGATCGTTC NNNZZ
Z which the ﬁrst A is from the adaptor).
CAGCTAGCTAGCAAGCTAGCTAGCTAGCT Lining these possibilities up with the earlier infor-
2. Image mation means that the sequence must have the form:
Z 3. Remove
ZZ label position 1 2 3 4 5 6 7
5′ TCGATCGATCGTTCNNN
CAGCTAGCTAGCAAGCTAGCTAGCTAGCT A G ? ? ? A T
Add fresh probes, repeat or C T or C G
for 7 total cycles or G A or G C
Z or T C or T A
5′TCGATCGATCGTTCNNNCGNNNZZ
CAGCTAGCTAGCAAGCTAGCTAGCTAGCT A A ? ? ? ? ?
Denature, remove
synthesized strand,
Therefore the base incorporated at the ﬁrst position
anneal n-1 primer, of the original run must be A (which we knew from
continue cycles the adaptor). Therefore the second base must be a G.
Z
5′ GTCGATCGATCGTT NNNZZ Now we know that the sequence begins AG. Know-
CAGCTAGCTAGCAAGCTAGCTAGCTAGCT
ing the base at position 2 resolves the ambiguity at
position 3, etc., and we can step along the sequence
Determine template sequence by knowing identity of
adaptor base A and the colour sequence: (see Figure 3.24).
Second base
A CG T
A CG T

Life in the fast lanes*

First base

A A G C T A G C T

The turkey genome offers an example of how high-

Template Sequence throughput technology facilitated the recent sequenc-
ing of the ﬁrst genome of a species. As is common,
Figure 3.23 DNA sequencing with Applied Biosystems’
Solid technology. Fragment (black) lengthened by adaptor
an international consortium carried out the project.
CAGCTAGCTAGCA (pink), and fixed to beads. From a mixture The lead investigators were from Viginia Tech, the
of probes containing 5′ dinucleotides plus 6 ‘wild cards’ (shown University of Minnesota, the University of Maryland,
as grey N and Z) the probe starting with TC binds, and is ligated. and the US Department of Agriculture. The 1.1 Gb
The yellow fluorescence shows that the dinucleotide must be TC,
turkey genome was completed in Autumn 2010.
GA, CT, or AG; at this point we don’t know that it is TC Seven
The success of the project depended on:
such cycles of extension provide partial information about
separated dinucleotides in the fragment. 1. Two high-throughput platforms for ‘brute force’
After removing the synthesized strand, the process is repeated
sequencing power.
with a displacement of one residue. Now the blue fluorescence
shows that the new dinucleotide sequence must be TT, GG, CC, • A Roche 454 GS-FLX Titanium platform gener-
or AA. Knowing the initial A from the adaptor shows that the ated ∼5X coverage
dinucleotide, complementary to the adaptor, must therefore
be TT, and that the adjacent dinucleotide, known from the
• An Illumina Genome Analyzer II platform gener-
previous round to be TC, GA, CT, or AG, must be TC and ated ∼25X coverage
hence that the second base in the fragment must be G. We 2. A genetic map and a BAC library, to assist in the
can walk along the sequence, resolving successive ambiguities
assembly. In order to support the assembly, Sanger
(see Figure 3.24).
sequencing of elements of the BAC library added
(From: Anderson, M.W., & Schrijver, I. (2010). Next generation DNA
sequencing and the future of genomic medicine. Genes 1, 38–69.)
about another 6X coverage. (Sanger sequencing
was used only because the BAC map construction

* Section title is a malaprop: lanes are older-generation

technology.
104
DUAL INTERROGATION OF EACH BASE 3 Mapping, Sequencing, Annotation, and Databases

Read Position 0 1 2 3 4 5 6 7 8 9 10 111213141516171819 20 2122 2324252627282930 3132333435

Universal seq primer (n)
1
3’ ❋❋ ❋❋ ❋❋ ❋ ❋ ❋❋ ❋❋ ❋❋
Universal seq primer (n-1)
Primer Round

2
3’ ❋ ❋ ❋ ❋ ❋❋ ❋❋ ❋ ❋ ❋❋ ❋❋
Universal seq primer (n-2)
3
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋ ❋ ❋❋ ❋❋
Universal seq primer (n-3)
4
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋ ❋ ❋❋
Universal seq primer (n-4)
5
3’ ❋ ❋ ❋❋ ❋ ❋ ❋❋ ❋❋ ❋❋ ❋❋

* Indicates positions of interrogation Ligation cycle 1 2 3 4 5 6 7

Figure 3.24 In the Applied Biosystems Solid high-throughput sequencing platform, five rounds of primer reset are completed for
each primer. By reprocessing the same fragment with a shifted primer, overlapping signals provide enough information to resolve the
ambiguities arising from the fact that sets of four of the sixteen possible dinucleotide probes have the same fluorescent tag. Through
the sliding-primer process, almost every base is interrogated in two independent binding and ligation reactions by two different primers.
For example, the base at read position 20 is tested by primer n-1 in its fifth cycle, and by primer n-2 in its fourth cycle. This double
testing is not only necessary to resolve ambiguities in labelling, but provides a sensitive mechanism of error detection.
(Photo courtesy of Life Technologies.)

Table 3.1 Turkey genome data produced by high-throughput sequencing platforms

Platform Type of read Number of reads Average read length

Roche 454 GS-FLX Titanium: shotgun 13 million 366 bp

3 kb fragment paired end 3 million 180 bp
20 kb fragment paired end 1 million 195 bp
Illumina Genome analyser II: shotgun 200 million 74 bp
180 bp paired end 200 million 74 bp

was carried out before the newer equipment an average of 76 clones. The average length of the
became available.) contigs was 2300 kb.
The genome was assembled using a genetic map,
The high-throughput sequencing platforms pro- maps based on BAC clones, and comparison with the
duced the amounts of raw data shown in Table 3.1 chicken genome, which was already available.
(see Exercise 3.14). Only for the Z chromosome was the chicken relied
Sequencing of BACs produced an integrated phys- on for assembly. It would be appropriate to regard
ical and genetic map of the turkey genome. 725 con- the turkey genome as an independenet, de novo,
tigs assembled from the BAC sequence data comprised assembly.

Databanks in molecular biology

After completion of sequence and annotation, a The many databanks form interlocking networks.
genome enters the databanks of molecular biology – Release of a genome into any of the major archival
‘to take its place in society’. projects is like casting a stone into a lake, sending
Databanks in molecular biology 105

ripples through the whole system. The genome itself ordinated access to them, led to the forging of links
is a nucleic acid sequence, but the protein-coding genes among the databases. The World Wide Web made
it contains will, after translation into amino acid this possible. The results were (1) systematization of
sequences, contribute to protein sequence databanks. formats and vocabularies, so that data in different
Databases in molecular biology have grown with collections became compatible; (2) pointers or links
astonishing rapidity recently. Their development fol- from each databank to others, facilitating access to
lows rules of their own. Here we can only describe related material; and (3) development of information-
general principles and guide the reader to his or her retrieval software that would streamline access to
own exploration of this world. (See the Online different databanks and smooth the passage from
Resource Centre associated with this book.) retrieving data to subjecting the results to computa-
The original databases were small, specialized, and tional analysis. Gathering all sequences in birds
– by post-web standards – isolated. It has always homologous to a given human protein is a problem
been true that different types of information need to in information retrieval; forming a multiple sequence
be curated by people with the appropriate expertise. alignment of these sequences is computational ana-
Specialists in different areas of biology organized lysis. Smooth passage means a simple pipeline from
the archiving of data related to their interests. One the sequences returned by the database search into a
problem was that even where data overlapped – for multiple sequence alignment program.
instance, amino acid sequences are common to pro- In summary, the requirements of a major database
tein sequence and structure databases – there was project include the following:
relatively little effort to use controlled vocabularies
1. Harvest the data plus annotations, curate them –
and to make storage formats compatible. The Inter-
that is, check both for accuracy and format – and
national Scientific Unions and CODATA (the Com-
distribute them.
mittee on Data of the International Council of
Scientific Unions) made important contributions, but 2. Track and back up the data so that it does not
problems remained. get lost. Should any question arise, it should be
With (1) the growing recognition of the import- possible to trace the data back to their origin and
ance of bioinformatics to research in biology, (2) the review all subsequent actions performed on them.
spectacular increases in the quantity of data, and – 3. Provide links from the data to relevant items
above all – (3) the emergence of the internet and in other databanks, including bibliographical
World Wide Web, came pressure towards growth libraries such as PubMed.
and integration of databanks. Requirements for large- 4. Provide information retrieval and analysis soft-
scale funding, and the need for combining biological ware to support a research pipeline that includes
and computational expertise, led to the creation of both recovery of selected data and calculations
national and international institutions responsible with them.
for archiving and curating the data. 5. Provide ample documentation and tutorial infor-
The term ‘databank’ suggests a metaphor that per- mation, so that users can make effective use of the
haps is outdated: a bank as a safe place for something facilities.
valuable, from which you can make a withdrawal
6. Keep up with scientific advances in both biology
and then go off shopping up and down the high
and informatics. These may suggest improve-
street. This description emphasizes the archiving and
ments in the presentation and facilities.
curation activities of the databanks. Of course, these
activities remain absolutely essential. However, the 7. Be responsive to users’ needs.
databanks also provide facilities – or at least links – Primary data collections related to biological mac-
for the computational analysis of the information romolecules include:
recovered. (More like banks within large shopping
malls.) • nucleic acid sequences, including whole-genome
The realization that different data types were not projects;
intellectual islands, but that researchers needed co- • amino acid sequences of proteins;
106 3 Mapping, Sequencing, Annotation, and Databases

• protein and nucleic acid structures; Protein sequence databases

• small-molecule crystal structures;
In 2002, three protein sequence databases – the
• protein functions; Protein Information Resource (PIR, at the National
• expression patterns of genes; Biomedical Research Foundation of the Georgetown
• metabolic pathways and networks of interaction University Medical Center in Washington DC, USA),
and control; and and SWISS-PROT and TrEMBL (from the Swiss
Institute of Bioinformatics in Geneva and the
• publications.
European Bioinformatics Institute in Hinxton, UK)
– coordinated their efforts, to form the UniProt con-
Nucleic acid sequence databases sortium. The partners in this enterprise share the
database but continue to offer separate information-
The worldwide nucleic acid sequence archive, The
retrieval tools for access.
International Nucleotide Sequence Database Collab-
The PIR grew out of the very first sequence data-
oration, is a partnership of EMBL-Bank at the Euro-
base, developed by Margaret O. Dayhoff – the pio-
pean Bioinformatics Institute (EBI), the DNA Data
neer of the field of bioinformatics. SWISS-PROT was
Bank of Japan (DDBJ) at the Center for Information
developed at the Swiss Institute of Bioinformatics.
Biology (CIB), and GenBank at the National Center
TrEMBL contains the translations of genes identified
for Biotechnology Information (NCBI).
within DNA sequences in the EMBL Data Library.
The groups exchange data daily. As a result, the
TrEMBL entries are regarded as preliminary. They
raw data are identical, although the format in which
mature – after curation and extended annotation –
they are stored and the nature of the annotation vary
into full-Fledged UniProt entries.
among them. These databases curate, archive, and
Today, almost all amino acid sequence information
distribute DNA and RNA sequences collected from
arises from translation of nucleic acid sequences.
genome projects, scientific publications, and patent
Information about ligands, disulphide bridges, sub-
applications.
unit associations, post-translational modifications,
Entries have a life cycle in the database. Because of
glycosylation, splice variants, effects of mRNA edit-
the desire on the part of the user community for rapid
ing, etc. are not available from gene sequences. For
access to data, new entries are made available before
instance, from genetic information alone, one would
annotation is complete and checks are performed.
not know that human insulin is a dimer linked by
Entries mature through the classes:
disulphide bridges. Protein sequence databanks col-
unannotated → preliminary → lect this additional information from the literature
unreviewed → standard. and provide suitable annotations.
Rarely, an entry ‘dies’ – a few have been removed
when they are determined to be erroneous. Databases of genetic diseases – OMIM
and OMIA
• In addition to the International Nucleotide Sequence
Database Collaboration, we have discussed several Online Mendelian Inheritance in Man™ (OMIM™)
other DNA sequence databanks: is a database of human genes and genetic disorders.
Genome browsers – databanks organized around one
Its original compilation, by V.A. McKusick, M. Smith,
or more genome sequences, with links to other informa- and colleagues, was published on paper. The NCBI of
tion sources about the organism. the US National Library of Medicine has developed
The International HapMap Consortium – database of
it into a database accessible from the web and intro-
single-nucleotide polymorphisms. duced links to other archives of related information,
including sequence databanks and the medical litera-
Forensic DNA databases – for which law enforcement
agencies in various countries are responsible.
ture. OMIM is now well integrated with the NCBI
information-retrieval system ENTREZ. A related
Databanks in molecular biology 107

database, the OMIM Morbid Map, deals with genetic of conformations of the component units of bio-
diseases and their chromosomal locations. OMIA logical macromolecules, and for investigations of
(Online Mendelian Inheritance in Animals) is a cor- macromolecule–ligand interactions, including but
responding database for disease and other inherited not limited to applications to drug design. The
traits in animals – excluding human and mouse. Nucleic Acid Structure Databank (NDB) at Rutgers
University, New Brunswick, New Jersey, USA, com-
plements the wwPDB.
Databases of structures
Structure databases archive, annotate, and distribute Classifications of protein structures
sets of atomic coordinates.
Approximately 80 000 protein structures are now Several web sites offer hierarchical classiﬁcations of
known. Most were determined by X-ray crystallo- the entire PDB according to the folding patterns of
graphy or nuclear magnetic resonance (NMR). The the proteins. These include:
Worldwide Protein Data Bank (wwPDB) now com-
• SCOP: structural classiﬁcation of proteins
prises four collaborating primary archival projects
• CATH: class/architecture/topology/homologous
to integrate the archiving and distribution of experi-
superfamily
mentally determined biological macromolecular
structures: • DALI: based on extraction of similar structures
from distance matrices
• The Research Collaboratory for Structural Bio-
• CE: a database of structural alignments.
informatics (RCSB), in the USA
• The Protein Data Bank in Europe (PDBe), at EBI, These sites are useful general entry points to pro-
UK tein structural data. For instance, SCOP offers facili-
ties for searching on keywords to identify structures,
• The Protein Data Bank Japan (Osaka, Japan)
navigation up and down the hierarchy, generation of
• The Biological Magnetic Resonance Data Bank
pictures, access to the annotation records in the PDB
(BMRB), in the USA.
entries, and links to related databases.
The wwPDB sites accept depositions, process new
entries, and maintain the archives.
Specialized or ‘boutique’ databases
These and many other web sites organize and pro-
vide access to these data, including but not limited Many individuals or groups select, annotate, and re-
to pictorial displays. Naturally, there is considerable combine data focused on particular topics, and include
overlap among them. Each has its own strengths, links affording streamlined access to information
based in many cases on the research interests of about subjects of interest. For instance, the Protein
the contributing scientists: the PDBe has recently Kinase Resource is a specialized compilation that
embarked on an ambitious software development includes sequences, structures, functional information,
program for structural analysis. Many sites offer laboratory procedures, lists of interested scientists,
search facilities to identify structures of interest, tools for analysis, a bulletin board, and links. It has
based on the presence of keywords (or a logical com- recently been redesigned, with a view to integrating
bination of keywords), or numerical values such as expanded information content with a workbench
the year of deposition. Different sites differ also in equipped with embedded tools for launching ana-
their ‘look and feel’, and users will discover their lyses of the data within the user interface.
own preferences.
The wwPDB overlaps in scope with several other
Expression and proteomics databases
databases. The Cambridge Crystallographic Data
Centre (CCDC) archives the structures of small mole- Recall the central dogma: DNA makes RNA makes
cules. This information is extremely useful in studies protein. Genomic databases contain DNA sequences.
108 3 Mapping, Sequencing, Annotation, and Databases

Expression databases record measurements of mRNA Table 3.2 Species with the largest numbers of entries in dbEST
levels, usually via ESTs (expressed sequence tags:
Species Number
short terminal sequences of cDNA synthesized from
of entries
mRNA) describing patterns of gene transcription.
Proteomics databases record measurements on pro- Homo sapiens (human) 8 314 483
teins, describing patterns of gene translation. Mus musculus + domesticus (mouse) 4 853 533
Comparisons of expression patterns give clues to Zea mays (maize) 2 019 105
(1) the function and mechanism of action of gene Sus scrofa (pig) 1 620 479
products; (2) how organisms coordinate their control Bos taurus (cattle) 1 559 494
Arabidopsis thaliana (thale cress) 1 529 700
over metabolic processes in different conditions – for
Danio rerio (zebra fish) 1 481 937
instance, yeast under aerobic or anaerobic condi-
Glycine max (soybean) 1 461 624
tions; (3) the variations in mobilization of genes in
Xenopus (Silurana) tropicalis 1 271 375
different tissues, or at different stages of the cell cycle,
(western clawed frog)
or of the development of an organism; (4) mech- Oryza sativa (rice) 1 251 304
anisms of antibiotic resistance in bacteria and con- Ciona intestinalis 1 205 674
sequent suggestion of targets for drug development; Rattus norvegicus + sp. (rat) 1 162 136
(5) the response to challenge by a parasite; (6) the Triticum aestivum (wheat) 1 071 367
response to medications of different types and dos- Drosophila melanogaster (fruit fly) 821 005
ages, to guide effective therapy. Xenopus laevis (African clawed frog) 677 806
There are many databases of ESTs. In most, the Oryzias latipes (Japanese medaka) 666 891
entries contain ﬁelds indicating tissue of origin and/ Brassica napus (oilseed rape) 643 874
or subcellular location, stage of development, condi- Gallus gallus (chicken) 600 423
tions of growth, and quantiﬁcation of expression Panicum virgatum (switchgrass) 546 245
level. Within GenBank, the dbEST collection cur- Hordeum vulgare + subsp. vulgare (barley) 501 620
rently contains almost 70 million entries, from 2281 Salmo salar (Atlantic salmon) 498 212
species. The species with the largest numbers of entries Caenorhabditis elegans (nematode) 393 714

in dbEST are shown in Table 3.2. Phaseolus coccineus 391 150

Porphyridium cruentum 386 903
Some EST collections are specialized to particular
Canis lupus familiaris (dog) 382 629
tissues (e.g. muscle, teeth) or to species. In many
Vitis vinifera (wine grape) 362 392
cases, there is an effort to link expression patterns
Physcomitrella patens subsp. patens 362 131
to other knowledge of the organism. For instance,
Ictalurus punctatus (channel catfish) 354 466
the Jackson Laboratory Gene Expression Informa-
Ovis aries (sheep) 338 364
tion Resource Project for Mouse Development coor- Branchiostoma floridae (Florida lancelet) 334 502
dinates data on gene expression and developmental Nicotiana tabacum (tobacco) 332 667
anatomy. Pinus taeda (loblolly pine) 328 662
Many databases provide connections between Malus domestica (apple tree) 324 512
ESTs in different species, for instance, linking human Picea glauca (white spruce) 313 110
and mouse homologues, or relationships between Bombyx mori (domestic silkworm) 309 472
human disease genes and yeast proteins. Other EST Aedes aegypti (yellow fever mosquito) 301 596
collections are specialized to a type of protein, for Solanum lycopersicum (tomato) 297 104
instance cytokines. A large effort is focused on can- Oncorhynchus mykiss (rainbow trout) 287 967
cer, integrating information on mutations, chromo- Linum usitatissimum 286 852
somal rearrangements, and changes in expression Neurospora crassa 277 147
patterns to identify genetic changes during tumour Gasterosteus aculeatus (three-spined stickleback) 276 992
formation and progression. Medicago truncatula (barrel medic) 269 238
Although of course there is a close relationship Gossypium hirsutum (upland cotton) 268 797
Pimephales promelas 258 504
between patterns of transcription and patterns of
Aplysia californica (California sea hare) 255 605
translation, direct measurements of protein contents
Databanks in molecular biology 109

of cells and tissues – proteomics – provides additional Bibliographic databases

valuable information. Because of differential rates
MEDLINE (based at the US National Library of
of translation of different mRNAs, measurements of
Medicine) integrates the medical literature, including
proteins directly give a more accurate description
very many papers dealing with subjects in molecular
of patterns of gene expression than measurements of
biology not overtly clinical in content. It is included
transcription. Post-translational modifications can be
in PubMed, a bibliographical database offering
detected only by examining the proteins.
abstracts of scientific articles, integrated with other
information retrieval tools of the NCBI within the
Databases of metabolic pathways National Library of Medicine (http://www.ncbi.nlm.
nih.gov/PubMed/).
The Kyoto Encyclopedia of Genes and Genomes One very effective feature of PubMed is the option
(KEGG) collects individual genomes, gene products, to retrieve related articles. This is a very quick way
and their functions, but its special strength lies in to ‘get into’ the literature of a topic. Here’s a tip: if
its integration of biochemical and genetic infor- you are trying to start to learn about an unfamiliar
mation. KEGG focuses on interactions: molecular subject, try adding the keyword tutorial to your
assemblies, and metabolic and regulatory networks. search in a general search engine, or the keyword
It has been developed under the direction of M. review to your search in PubMed.
Kanehisa.
KEGG organizes five data types into a comprehen-
Surveys of molecular biology databases
sive system:
and servers
1. catalogues of chemical compounds in living cells It is difficult to explore any topic in molecular bio-
2. gene catalogues logy on the web without quickly bumping into a list
3. genome maps of this nature. Lists of web resources in molecular
biology are very common. They contain, to a large
4. pathway maps extent, the same information but vary widely in their
5. orthologue tables. ‘look and feel’. The real problem is that, unless
they are curated, they tend to degenerate into lists of
The catalogues of chemical compounds and genes dead links.
contain information about particular molecules or
sequences. Genome maps integrate the genes them- Each year the January issue of the journal Nucleic Acids
selves according to their chromosomal location. In Research contains a set of articles on databases in
some cases, knowing that a gene appears in an operon molecular biology. This is an invaluable reference.
can provide clues to its function.
Pathway maps describe potential networks of This book does not contain a long annotated list of
molecular activities, both metabolic and regulatory. relevant and recommended sites, for the following
A metabolic pathway in KEGG is an idealization cor- reasons. First, you do not want a long list; you need
responding to a large number of possible metabolic a short one. Second, the web is too volatile for such a
cascades, combining reactions occurring in different list to stay useful for very long. It is much more effec-
organisms. It can generate a real metabolic pathway tive to use a general search engine to find what you
of a particular organism, by matching the proteins want at the moment you want it.
of that organism to enzymes within the reference My advice is spend some time browsing; it will not
pathways. take you long to find a site that appears reasonably
One enzyme in one organism would be referred stable and has a style compatible with your methods
to in KEGG in its orthologue tables, which link the of work. Alternatively, here is a site that is compre-
enzyme to related ones in other organisms. This per- hensive and shows signs of a commitment to keeping
mits analysis of relationships between the metabolic it up to date: http://www.expasy.org. It is a suitable
pathways of different organisms. site for starting a browsing session.
110 3 Mapping, Sequencing, Annotation, and Databases

● RECOMMENDED READING

• Discussions of haplotypes in general, and the important MHC complex in particular:

Neale, B.M. (2010). Introduction to linkage disequilibrium, the HapMap, and imputation. Cold
Spring Harb Protoc; 2010; doi:10.1101/pdb.top74.
Vandiedonck, C., & Knight, J.C. (2009). The human Major Histocompatibility Complex as a
paradigm in genomics research. Brief. Funct. Genomic Proteomic. 8, 379–394.
• The following describe the recent advances that have produced the high-throughput sequencing
platforms on which contemporary sequencing depends:
Davies, K. (2010). The $1000 Genome: The Revolution in DNA Sequencing and the New Era of
Personalized Medicine. Free Press, New York.
Mardis, E. (2008). The impact of next-generation sequencing technology on genetics. Trends
Genet. 24, 133–141.
Ng, P.C. & Kirkness, E.F. (2010). Whole genome sequencing. Methods Mol. Biol. 628, 215–226.
• A collection of papers on new techniques and their applications in medicine:
Janitz, M., ed. (2008) Next-Generation Genome Sequencing: Towards Personalized Medicine.
Wiley-VCH, Weinheim.
• The next two articles describe the development of databases of sequences and structures.
Smith, T.F. (1990). The history of the genetic sequence databases. Genomics 6, 701–707.
Bernstein, H. & Bernstein, F. (2005). Databanks of macromolecular structure. In Database
Annotation in Molecular Biology: Principles and Practice, Lesk, A.M. (ed.) J. Wiley & Sons,
Chichester, pp. 63–79.

● EXERCISES, PROBLEMS, AND WEBLEMS

Exercises
Exercise 3.1 Two loci with alternative alleles A/a and B/b, respectively, are 1 cM apart. A cross
between parents of genotype AB/AB and ab/ab produces a large number of offspring. Assuming
no selective difference between genotypes, estimate the fraction of the next generation that has
genotype Ab/aB.
Exercise 3.2 Gene A has two alleles, A1 and A2. Gene B has two alleles, B1 and B2. In a population,
the following haplotype frequencies are observed: A1B1 = 0.2, A2B2 = 0.45, A1B2 = 0.15, A2B1 = 0.2.
Calculate D, the extent of linkage disequilibrium.
Exercise 3.3 A fruitfly with a chromosome deletion shows pseudodominance for a trait. On a
photocopy of Figure 3.3, indicate with an ‘X’ where the locus for this trait might be.
Exercise 3.4 The Philadelphia translocation occurs in a bone marrow cell, resulting in the
development of chronic myeloid leukaemia. Would the patient transmit this leukaemia to his
or her offspring?
Exercise 3.5 On a photocopy of Figure 3.6, indicate two positions in the human chromosome
where one would look for genes that are linked in humans but unlinked in chimpanzees?
Exercise 3.6 (a) On a photocopy of Figure 3.8(a), indicate with an ‘A’ the band on the gel
that corresponds to the BamHI fragment in Figure 3.8(b) which begins at 1 kb and ends at 5 kb.
(b) On a photocopy of Figure 3.8(b), indicate with a ‘B’ the fragment that gives rise to the band
in the EcoRI lane of the gel which corresponds to the lowest molecular mass fragment.
Exercises, problems, and weblems 111

Exercise 3.7 The lengths of blocks that define human haplotypes vary, partly because of the
variation of recombination rates along the genome. Would you expect the haplotype blocks
to vary more if their sizes are measured in terms of number of base pairs or in terms of genetic
distance in centimorgans?
Exercise 3.8 Suppose that there are ten SNPs in a 10 kb region. If the region is on the Y
chromosome, how many possible haplotypes are there? If the region is on a diploid chromosome,
how many possible haplotypes are there?
Exercise 3.9 On a photocopy of Figure 3.9, indicate, by crossing atoms out and writing atoms in,
how the structure would have to be changed to illustrate an RNA molecule with the equivalent
base sequence.
Exercise 3.10 From a photocopy of Figure 3.11, cut out the individual bases and show that
guanine and thymine could form a non-canonical base pair containing two hydrogen bonds.
By comparing with a copy of the standard base pairs, show the extent to which a guanine–
thymine pair would not match the correct relative position and orientation of the sugars to be
stereochemically compatible with standard base pairs in a double helix of standard structure.
Uracil has the same hydrogen-bonding specificity as thymine. Guanine–uracil ‘wobble’ base
pairs are implicated in codon–anticodon interactions between tRNA and mRNA.
Exercise 3.11 The tetranucleotide illustrated in Figure 3.9 is self-complementary. (a) What does
this mean? (b) Make two photocopies of Figure 3.9. From one of them trim off the names of the
bases. From the other, cut out the individual bases and mount them adjacent to the first in position
to form Watson–Crick base pairs. Draw in the hydrogen bonds between bases. (It will not work
simply by turning one copy upside down and mounting it next to the other. In a double helix,
the two copies are symmetrically disposed in three dimensions but not in two dimensions.)
Exercise 3.12 If all of the DNA in all of the cells of your body were laid end to end, would you be
surprised if it were longer than the diameter of the solar system? Calculate the result and compare.
The semi-major axis of Pluto’s orbit is 5 906 376 272 km. The number of cells in an adult human
body has been estimated as 1013.
Exercise 3.13 In Figure 3.13, the region of the template strand not complexed with the primer is
shown as continuing the helical structure. Although justifiable pedagogically for clarity, why is this
not a structurally correct representation?
Exercise 3.14 On a photocopy of Figure 3.19, indicate the longest contig available from the data
given.
Exercise 3.15 In Figure 3.19, (a) What is the minimal coverage of any position (this is obvious)?
(b) What is the maximal coverage of any position (i.e. the largest number of fragments in which
the same position appears)? (c) Estimate the average coverage of the entire region? (Hint:
measure the total lengths of the fragments and divide by the length of the region.)
Exercise 3.16 The International Human Genome Mapping Consortium fingerprinted 300 000 BAC
clones. Assuming an average insert size of 150 kb and a 3.2 Gb genome size, what coverage
would be expected?
Exercise 3.17 One difficulty in extracting reads that correspond to mitochondrial DNA from
sequencing mixed fragments of nuclear and mitochondrial DNA is that the nuclear genome
contains segments homologous to regions of the mitochondrial genome, called numts.
Mammalian genomes contain 50–450 kb of numts. (The human genome contains 1005 such
segments, of average length 446 bp.) Estimate the fraction of reads from fragments of mammoth
DNA that are likely to be numts. The mammoth genome is 4.7 Gb long.
Exercise 3.18 Referring to Figure 3.24, by what primers, in which cycle, is the base at position
10 tested?
112 3 Mapping, Sequencing, Annotation, and Databases

Exercise 3.19 Referring to Figure 3.23, suppose that the dinucleotide that bound to positions
23456789 gave a green fluorescence. What is the base at position 3?
Exercise 3.20 How much raw sequence data was generated for the turkey genome project?
How many human genome equivalents does this amount to?

Problems
Problem 3.1 Consider two linked traits in a population in which half of the individuals are
double heterozygotes with genotype AB/ab and the other half are double homozygotes (AB/AB).
Assuming no selective advantage of any combination of alleles for these traits and no preferential
mating, after recombination brings the population to equilibrium, what will be the ratio of AB/AB,
Ab/aB = aB/Ab and ab/ab individuals?
Problem 3.2 Consider two markers 1 cM apart. (a) What is the probability that there will be
recombination between them in one generation? (b) What is the probability that there will not be
recombination between them in one generation? (c) What is the formula for the probability that
there will not be recombination between them in n generations? (d) Evaluating this formula, what
is the probability that there will not be recombination between them in n = 10, 20, 30, 40, and
50 generations?
Problem 3.3 As a simplified but illustrative example of sequence assembly, we saw the first two
verses of Richard III chopped into overlapping 10-character fragments. (a) Chop these lines into
consecutive overlapping 5-character fragments, and scramble these fragments into random order.
Is it still possible to reconstruct the lines without ambiguity? Why is it more difficult to do so than
to reconstruct the lines from 10-character fragments? (b) Try generating, and then trying to
reassemble, 10- and 5-character fragments, presented in random order, of the lines of Polonius
(ignore punctuation marks):
… ‘tis true, ‘tis true ‘tis pity,
And pity ‘tis ‘tis true

In each case, is it still possible to reconstruct the lines without ambiguity?

(c) Try the same with Richard II’s speech:
Your cares set up do not pluck my cares down.
My care is loss of care, by old care done;
Your care is gain of care, by new care won:
The cares I give I have, though given away;
They tend the crown, yet still with me they stay.

Problem 3.4 Extend Problem 3.3 to simulate the effect of paired-end reads. Take any text of
about 100 words in length (a sonnet is about the right length) and write a program to create
fragments with a distribution of lengths distributed roughly normally around 30 ± 5 characters.
Print reads of 8 characters from each end. Tabulate the data from different fragments in random
order. Try to reassemble the text. Study how the difficulty of the assembly depends on the read
length, fragment length, and coverage.
Problem 3.5 Lander & Waterman* derived formulas for the expected completeness of an
assembly as a function of coverage (G = genome length, N = number of reads, L = read length,
c = NL/G = coverage):
probability that a base is not sequenced = e−c
total expected gap length = G × e−c
total number of gaps = Ne−c

* Lander, E.S. & Waterman, M.S. (1988). Genomic mapping by ﬁngerprinting random clones:
a mathematical analysis. Genomics 2, 231–239.
Exercises, problems, and weblems 113

(a) What fraction of a genome could you expect to assemble from eightfold coverage? (b) What A T G C
total gap length would you expect in an assembly of a 2 Mb target genome size from eightfold
coverage? (c) How many gaps would you expect in an assembly of a 2 Mb target genome size
from an eightfold coverage of fragments with a read length of 500? (d) You want to sequence
a 4 Mb genome by the shotgun method, by assembling random fragments with read length 500.
What coverage would you require, to expect no more than four gaps, assuming no complications
arising from repetitive sequences or far-from-equimolar base composition?
Problem 3.6 Figure 3.25 shows a sequencing gel. (a) What is the sequence of this fragment?
(b) Can you see any self-complementary regions in this fragment that might form hairpin loops?
(c) On the basis of your answer to part (b), would you guess that this region encodes RNA
or protein?
Problem 3.7 Figure 3.26(a) shows a series of measurements from the Solid technology (see
p. 102). Figure 3.26(b) shows the colour coding of the dinucleotides. The known template
sequence implies that the first base is an A, therefore the dinucleotide at positions 0-1 is A-?
(a) What is the sequence of the fragment? (b) Suppose another fragment differs by a SNP at
position 15, and suppose that this SNP is a transition mutation (that is, a purine to the other
purine, or a pyrimidine to the other pyrimidine). How would a figure, corresponding to
Figure 3.26(a), that presents SOLID results from the mutated fragment, differ from
Figure 3.25
Figure 3.26(a)? Autoradiograph of
Problem 3.8 From Figure 3.18, (a) determine the rate of change of the cost of sequencing over a sequencing gel
the years 2005–2007. (b) determine the rate of change of the cost of sequencing over the years (simulated). The shortest
2008–2010. (c) According to these figures, what would be the cost in 2010 of determining a fragment travels the
farthest. In this diagram,
human-sized genome at 10X coverage?
the direction of travel is
Problem 3.9 Many people have asked whether the author is related to Filippo Brunelleschi, down the page.
who was born in 1377, died in 1446 and is buried beneath the cathedral in Florence (the dome
of which he famously created). Assume that you were granted permission to exhume his body
and collect a tissue sample. (a) Estimate the number of generations between Brunelleschi and the
author. (b) Assuming that the author is a direct descendant of Brunelleschi, would you expect
to be able to prove it by DNA sequencing? Explain your answer.

(a)
Read Position 0 1 2 3 4 5 6 7 8 9 10 111213141516171819 20 21222324252627282930 3132333435
Universal seq primer (n)
1
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋ ❋ ❋❋ ❋ ❋
Universal seq primer (n-1)
Primer Round

2
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋
Universal seq primer (n-2)
3
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋
Universal seq primer (n-3)
4
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋ ❋ ❋ ❋ ❋❋
Universal seq primer (n-4)
5
3’ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋ ❋❋

* Indicates positions of interrogation Ligation cycle 1 2 3 4 5 6 7

(b) Second base

A C G T
❋ ❋ ❋ ❋
T G C A
First Base

❋ ❋ ❋ ❋
❋ ❋ ❋ ❋
Figure 3.26 (a) a series of measurements from the Solid technology.
❋ ❋ ❋ ❋
(b) The colour coding of the dinucleotides.
114 3 Mapping, Sequencing, Annotation, and Databases

Problem 3.10 You are asked to devise a Master’s degree programme to train annotators for
databases in molecular biology. (a) What background would you require for entry to the
programme? (b) What courses would you require students to take during the programme?

Weblems
Weblem 3.1 Which of the following families have genes appearing in tandem arrays in the human
genome and which have genes dispersed among several chromosomes? (a) Actin, (b) tRNA,
(c) all globins, (d) HOX genes, (e) the major histocompatibility complex.
Weblem 3.2 What institution currently has the highest sequencing throughput power? How
many Gb per week can this institute produce?
Weblem 3.3 For the major companies providing sequencing equipment. What is the current
throughput rate and typical read length of their state-of-the-art instrument?
Weblem 3.4 On a photocopy of Figure 3.18, add points to bring the figure up-to-date.
CHAPTER 4

Comparative Genomics

LEARNING GOALS

• Knowing the three major divisions of living things – archaea, bacteria, and eukaryotes – based
on analysis of the sequences of 16S rRNA genes.
• Recognizing the prevalence of horizontal gene transfer, especially among prokaryotes, and to
understand that horizontal gene transfer is inconsistent with the hierarchical ‘tree of life’ picture
that the Linnaean classification scheme suggests.
• Being familiar with major events in the history of life.
• Appreciating the general distribution of genome sizes and numbers of genes.
• Distinguishing the characteristics of different types of genome organization in viruses,
prokaryotes, and eukaryotes.
• Recognizing the effects of gene duplication on genome evolution.
• Being able to distinguish the meanings of homologue, orthologue, and paralogue.
• Understanding the mechanism of genome change at the levels of individual bases, genes,
chromosome segments, and whole genomes.
• Understanding the limits of what genomes determine and what they do not determine, and the
limits of what we can currently explain on the basis of genetics and what we cannot.
• Appreciating, as far as possible, what makes us human.
• Understanding the idea of a model organism in the study of human disease.
• Appreciating the goals and plans of the Encyclopedia of DNA Elements (ENCODE) project, and
the related project, modENCODE.
116 4 Comparative Genomics

Introduction

It is likely that life originated on Earth about 3.5 billion a human-centred view, we ask how to compare
years ago. The first cellular life forms were undoubt- genomes in a way that illuminates our relationship
edly prokaryotes. Eukaryotes appeared about 2 bil- with other species. In succeeding chapters, we extend
lion years later. There are enough residual similarities this discussion to a more general view of the interac-
among living things to suggest a common ancestor of tion and evolution of genomes.
us all. The great diversity of living forms is, therefore, Other aspects of comparative genomics lie out-
the result of divergence. side the scope of this chapter: (a) Description of
Sequence analysis gives the most unambiguous evid- the variability of genomes within species. We have
ence for the relationships among species. For higher discussed the HapMap project, which is the com-
organisms, sequence analysis and the classical tools of parative genomics of humans. (b) Cancer genomics
comparative anatomy, palaeontology, and embryo- has become a major thrust of research. Goals include
logy usually give a consistent picture. Classification both the study of cancer-related gene variations
of microorganisms is more difficult, partly because it within populations that define risk factors (for
is less obvious how to select the features on which to instance, mutations in the tumour suppressor gene
classify them, and partly because a large amount of BRCA1 and BRCA2 enhance an individual’s likeli-
lateral gene transfer threatens to overturn the picture hood of developing breast and ovarian cancer), and
of the evolutionary tree entirely. the study of genomic changes within single individuals
In this chapter, we discuss general approaches as tumours develop and diverge.
to comparative genomics of different species. From

Unity and diversity of life

The diversity of life fascinates everyone. The macro- Common names can cause, or result from, con-
scopic life forms most familiar to us come in discrete fusion. Often settlers on new continents applied
types called species. Linnaeus, an 18th-century common names to animals similar in appearance to
Swedish naturalist, first organized the characteristics familiar ones, but not in fact closely related to them.
of different species into a logical framework. He For instance, early European settlers in Australia
introduced the system of nomenclature still used called a native animal the koala bear, although
today. (We take up biological systematics in more marsupial koalas are not closely related to European
detail in Chapter 5.) bears.
Linnaeus classified living things according to a
hierarchy: kingdom, phylum, class, order, family,
genus, and species. It usually suffices to specify the Table 4.1 Classifications of humans and the fruit fly
lowest two levels, as a binomial: genus and species.
Human Fruit fly
For instance, Homo sapiens = humans or Drosophila
melanogaster = fruit fly (see Table 4.1). Each bino- Kingdom Animalia Animalia
mial uniquely identifies a species that may also be Phylum Chordata Arthropoda
known by one or more common names; for instance, Class Mammalia Insecta
Bos taurus = cow. Conversely, many common names Order Primata Diptera
refer to whole groups of species. For example, there
Family Hominidae Drosophilidae
are many species of whales, not all in the same genus
Genus Homo Drosophila
or even the same family. Of course, most species do
Species sapiens melanogaster
not have common names at all.
Unity and diversity of life 117

For macroscopic organisms, the Linnaean classiﬁ-

cation is reinterpretable as a phylogenetic tree – a set BOX DNA barcoding
4.1
of ancestor–descendant relationships between species.
A thread of continuous family history unites all
organisms. We have already alluded to the dissonance Field biologists now often characterize populations by
between the concept of a continuous evolutionary DNA sequences. For higher animals, the sequence of the
pathway from a common ancestor to daughter spe- cytochrome c oxidase subunit I mitochondrial region
(COI) provides a compact index for species identifica-
cies and the idea that species are fundamentally
tion. In most groups, this region is 648 bp long. The
discrete (see pp. 65–66). Nevertheless, the concept of
sequence variation within a species is small compared
a species remains a useful one, even though it has
with differences between species.
proved very difficult to define precisely, even for
Biologists describe the identification of species by
macroscopic organisms.
the sequence of this region as ‘barcoding’. The Barcode
Genome sequences provide the most general,
of Life database (BOLD) collects the information, cur-
detailed, and consistent approach to definition of
rently covering over 100 000 species. Its query system
species. Sequences rule microbial taxonomy, but converts COI sequences to taxonomic assignments
jostle for power with traditional morphological (http://www.barcodinglife.org).
methods in the classification of plants and animals Barcoding is a focus of the debate over traditional
(see Box 4.1). morphology-based taxonomy versus reliance on
Microbiologists use Linnaean nomenclature for sequences. Of course, for classification of long-extinct
bacteria but find themselves uncomfortable doing so. organisms for which no DNA sequences are available,
Structural characteristics of bacteria lend themselves there is no choice. And, despite their utility, barcodes
less well to distinguishing species than the physical tell us very little about the organisms they identify. No
features of higher organisms. one would deny the phenotypic richness observable in
living specimens – not least in their behaviour – much of
which we cannot yet infer, even from complete genome
• Traditional Linnaean biological nomenclature assigns
sequences.
names to species, and classifies species according
to evolutionary relationships. It is often difficult It was by sequencing the Barcode regions that the
to draw boundaries between species, especially for New York secondary school students investigated the
microorganisms. species attributions in sushi restaurants (p. 21).

Traditional methods for bacterial classification Later, following seminal work by C. Woese, pro-
were based on features of morphology (cell size and karyotic species were defined in terms of variations in
shape), biochemistry (uptake of stains, carbon and 16S ribosomal RNA (rRNA) and other sequences.
nitrogen sources, fermentation products), and physio- Bacteria for which the 16S rRNA sequences are more
logy (growth temperature range and optimum, osmotic than about 2.5–3% different are considered different
tolerance). Gram-positive and Gram-negative bac- species. Typically, this corresponds to no more
teria differ in their ability to take up crystal violet than 70% similarity in overall genome sequence. If
or methylene blue stain: Gram-positive bacteria con- humans and chimpanzees were bacteria, we would
tain a thick (20–80 nm) peptidoglycan layer in their easily be considered as the same species!
cell wall that binds the stain. Immunological cross-
reactivity has also been a basis for classification,
Taxonomy based on sequences
especially among infectious species that elicit a
clinical motivation for their classification. Before Protein, RNA, and DNA sequences have illuminated
sequencing, hybridization of DNA from two different relationships between species, both for macroscopic
bacteria was a criterion for similarity. Most bacterial organisms and microbes. The sequences have clari-
DNAs will form hybrid double-helical structures fied some relationships but have exposed others as
provided that the similarity in base sequence is >80%. simplistic. Major results include the following:
118 4 Comparative Genomics

Bacteria Archaea Eukarya

Animals
Extreme
halophiles Slime
Green moulds Fungi
non-sulphur Methanobacterium Entamoebae Plants
bacteria
Algae
Gram-positive
bacteria Methanococcus Thermoplasma
Ciliates
Purple bacteria ❉
Pyrodictium Thermococcus
Cyanobacteria Flagellates

Flavobacteria ❉ Thermoproteus Trichomonads

Thermotoga ❉
Aquifex
❉ Diplomonads

Figure 4.1 Major divisions of the tree of life. Bacteria (blue) and archaea (magenta) are prokaryotes; their cells do not contain nuclei.
Bacteria include the typical microorganisms responsible for many infectious diseases and, of course, Escherichia coli, the mainstay of
molecular biology. Archaea include, but are not limited to, extreme thermophiles and halophiles, sulphate reducers, and methanogens.
We ourselves are eukarya – organisms containing cells with nuclei (green and red). Asterisks mark crucial splitting points (see Exercise 4.4).
This phylogenetic tree was derived by C. Woese from comparisons of ribosomal RNAs. These RNAs are present in all organisms, and
show the right degree of divergence. (Too much or too little divergence and relationships become invisible.) Figure 4.2 shows in more
detail the group that includes us – animals, fungi, and plants (red).

• All life on Earth has enough general similarity to Although archaea and bacteria are both unicel-
show that all life forms had a common origin. lular organisms that lack a nucleus, at the molecu-
Evidence includes the universality of the basic lar level archaea are in some ways more closely
chemical structures of DNA, RNA, and proteins, related to eukarya than to bacteria. It is also likely
the universality of their general biological roles, that the archaea are the closest living organisms
and the near-universality of the genetic code. to the root of the tree of life.
• On the basis of 16S rRNAs, C. Woese divided liv- • Dating of historical events from sequence differ-
ing things most fundamentally into three domains: ences. As species diverge, their sequences diverge.
bacteria, archaea, and eukarya. A domain occupies L. Pauling and E. Zuckerkandl suggested that if
a level in the hierarchy above kingdom. sequence divergence occurred at a constant rate, it
Figure 4.1 shows the major divisions of the tree would provide a ‘molecular clock’ that would allow
of life. At the ends of the eukaryote branch are the dating of the splits in lineage between species.
metazoa, including yeast and all multicellular organ- Although the clock is not universal, judicious
isms – fungi, plants, and animals (see Figure 4.2). calibration of rates of sequence change with palae-
We and our closest relatives are in the vertebrate ontological data permits dating of events in the
branch of the deuterostomes (see Figure 4.3). history of life (see Box 4.2 and Figure 2.15).

BOX Molecular phylogeny and chronology

4.2

Molecular approaches to phylogeny developed against a access to extinct organisms via the fossil record. They can
background of traditional taxonomy, based on a variety of date the appearance and extinction of species by geolo-
morphological characters, embryology, geographical distri- gical methods (see Figure 2.15).
bution and, for fossils, information about the geological Molecular biologists, in contrast, have very limited access
context (stratigraphy). The classical methods have some to extinct species. Some subfossil remains of species that
advantages. Traditional taxonomists have much greater became extinct as recently as within the last two centuries
Unity and diversity of life 119

have legible DNA, including specimens of the quagga the time of divergence of humans from chimpanzees
(a relative of the zebra), the thylacine (Tasmanian ‘wolf’, a at 5 million years ago, based on immunological data. At
marsupial), the mammoth from the permafrost in Russia, that time, traditional palaeontologists dated this split at
the dodo of Mauritius, the ‘elephant bird’ of Madagascar, 15 million years ago and were reluctant to accept the
and some New Zealand birds, for instance, moas. It has molecular approach. Reinterpretation of the fossil record
been possible to sequence mitochondrial DNA from led to acceptance of a more recent split and broke the
∼10 000-year-old remains of an ‘Irish elk’. DNA sequences barrier to general acceptance of molecular methods. It is
from Neanderthal man have been recovered from individu- now generally accepted that human and chimpanzee
als who died approximately 40 000 years ago. But Jurassic lineages diverged between ∼6 and 8 million years ago.
Park remains fiction!
A crucial event in the acceptance of molecular methods
occurred in 1967 when V.M. Sarich and A.C. Wilson dated

Deuterostomes

Vertebrata (human)
Cephalochordata (lancelets)
Urochordata (sea squirts)
Hemichordata (acorn worms)
Echinodermata (starfish, sea urchins)
Bryozoa
Entoprocta
Platyhelminthes (flatworms) Figure 4.2 Phylogenetic tree of metazoa (multicellular
Pogonophora (tube worms)
Lophotrochozoa

Brachiopoda animals). Bilateria include all animals that share a left–right

Phoronida symmetry of body plan. Protostomes and deuterostomes
Bilateria

Nemertea (ribbon worms) (red) are two major lineages that separated at an early
Annelida (segmented worms)
stage of evolution, estimated at 670 million years ago. They
Protostomes

Echiura
Mollusca (snails, clams, squids) show very different patterns of embryological development,
Sipunculan (peanut worms) including different early cleavage patterns, opposite
Gnathostomulida orientations of the mature gut with respect to the earliest
Rotifera
invagination of the blastula, and the origin of the skeleton
Gastrotricha
Nematoda (roundworms) from mesoderm (deuterostomes) or ectoderm (protostomes).
Ecdysozoa

Priapulida Protostomes comprise two subgroups distinguished on

Kinorhynchs the basis of the sequences of an RNA from the small
Onychophora (velvet worms)
Tardigrada (water bears) ribosomal subunit and HOX genes. (HOX genes govern the
Arthropoda (insects, crabs) development of body plans.) Morphologically, ecdysozoa
Ctenophora (comb jellies) have a moulting cuticle – a hard outer layer of organic
Cnidaria (jellyfish)
material. Lophotrochozoa have soft bodies. Figure 4.3
Poriferans (sponges)
Fungi (yeast, mushrooms) shows in more detail the group that includes us – the
Plants deuterostomes (red).

Echinoderms (starfish)

Deuterostomes Urochordates (tunicate worms)

Cephalochordates (amphioxus)

Jawless fish (lamprey, hagfish)

Cartilaginous fish (shark)

Bony fish (zebrafish)

Amphibians (frog)

Mammals (human)
Figure 4.3 Phylogenetic tree of vertebrates and our closest
Reptiles (lizard)
relatives. Chordates, including vertebrates, and echinoderms
Birds (chicken) are all deuterostomes. Examples of each are shown in blue.
120 4 Comparative Genomics

• The importance of horizontal gene transfer. This in contrast, assumes strict ancestor–descendant
is the acquisition of genetic material by one or- relationships between different organisms during
ganism from another by natural rather than evolution.
laboratory procedures through some means other
than descent from a parent during replication Horizontal gene transfer among different species
or mating (see Box 4.3). Several mechanisms of has affected most genes in prokaryotes. It requires
horizontal gene transfer are known, including a change in our thinking from ordinary ‘clonal’ or
direct uptake, as in Grifﬁth’s pneumococcal trans- parental models of heredity. Microorganisms do not
formation experiments, or via a viral carrier. easily ﬁt into the structure of the ‘tree’ of life but
Arrangements of species into phylogenetic trees, require a more complex organizational chart.

BOX Please pass the genes: horizontal gene transfer

4.3

On learning that Streptomyces griseus trypsin is more passed around between bacteria, mitochondria, and
closely related to bovine trypsin than to other microbial algal plastids, as well as undergoing gene duplication.
proteinases, Brian Hartley commented in 1970 that ‘. . . the – many phage genes appearing in the E. coli genome
bacterium must have been infected by a cow’. This was provide further examples and point to a mechanism of
a clear example of lateral or horizontal gene transfer – a transfer.
bacterium picking up a gene from the soil in which it
was growing, that an organism of another species had Nor is the phenomenon of horizontal gene transfer limited
deposited there. The classic experiments on pneumococcal to prokaryotes. Both eukaryotes and prokaryotes are
transformation by Griffiths, and those by O. Avery, C. chimaeras. Eukaryotes derive their informational genes
MacLeod, and M. McCarthy that identified DNA as the primarily from an organism related to Methanococcus, and
genetic material, are another example. their operational genes primarily from proteobacteria, with
Evidence for horizontal transfer includes (1) discrepan- some contributions from cyanobacteria and methanogens.
cies among evolutionary trees constructed from different Almost all informational genes from Methanococcus itself
genes; and (2) direct sequence comparisons between are similar to those in yeast. At least eight human genes
genes from different species. appeared in the Mycobacterium tuberculosis genome.
S. griseus trypsin is an example of eukaryote → prokaryote
• In Escherichia coli, about 25% of the genes appear to transfer.
have been acquired by transfer from other species. The observations hint at the model of a ‘global organ-
ism’, or a genomic World Wide DNA Web from which
• In microbial evolution, horizontal gene transfer is more
organisms download genes at will! How can this be recon-
prevalent among operational genes – those responsible
ciled with the fact that the discreteness of species has been
for ‘housekeeping’ activities such as biosynthesis – than
maintained? We offered the conventional explanation,
among informational genes – those responsible for organ-
that the living world contains ecological ‘niches’ to which
izational activities such as transcription and translation.
individual species are adapted: the discreteness of niches
For example:
explains the discreteness of species. But this explanation
– Bradyrhizobium japonicum, a nitrogen-fixing bacte- depends on the stability of normal heredity to maintain the
rium, symbiotic with higher plants, has two glutamine fitness of the species. Why would the global organism not
synthetase genes: one is similar to those of its bacterial break down the lines of demarcation between species, just
relatives; the other is 50% identical to those of higher as global access to pop culture threatens to break down
plants; lines of demarcation among national and ethnic cultural
– rubisco (ribulose-1,5-bisphosphate carboxylase/ heritages? Perhaps the answer is that it is the informational
oxygenase), the enzyme that first fixes carbon dioxide genes, which appear to be less subject to horizontal trans-
at entry to the Calvin cycle of photosynthesis, has been fer, that determine the identity of the species.
Sizes and organization of genomes 121

Sizes and organization of genomes

We appeal to genomes to help us to understand our-

• We appeal to genomes to help us to understand
selves as individuals, and our relationships with all of
evolutionary relationships. With a few exceptions,
the other organisms that march in the pageant of life. out access to genome sequences is limited to organ-
To make progress, we must integrate several data isms alive today, that is, to a single snapshot in time.
streams, including: However, genomes contain records of their history
which gives us a window onto the past.
• genome sequences;
• RNA and protein expression patterns;
• the spatial organization of individual macro-
Genome sizes
molecules, their complexes, organelles, entire cells,
tissues, and bodies; and One reason for resistance to Darwin’s theory of evo-
• regulatory networks, the internal structure and lution was its denial to human beings of a special
logic of adaptive control systems. status relative to animals. Genomics threatens to do
this all over again. Humans do have unique features.
Even these may not be enough. History – sometimes Many people, not excluding molecular biologists,
observed but more usually inferred – provides essen- expect these features to be reﬂected in the genome.
tial additional clues. We see, today, a snapshot of one And so they must be, although, frankly, not in any
stage in a history of life that extends back in time for obvious way.
at least 3.5 billion years. We must try to read the past
in contemporary genomes, which contain records of
• The term C-value has been used to refer to the amount
their own development.
of DNA in a haploid cell, i.e. a gamete; the letter C
refers to the constancy of the amount of DNA per cell
US Supreme Court Justice Felix Frankfurter wrote that in a species.
‘. . . the American constitution is not just a document,
it is a historical stream.’ Like a genome!
The overall size of the human genome is not spe-
cial. Different organisms have different total amounts
This programme requires development of novel of DNA per cell (see Figure 4.4 and Table 4.2).
methods. New ﬁelds of study require new approaches.
S.E. Luria once suggested that to determine common
Mammals
features of all life one should not try to survey every-
thing, but, rather, identify the organism most differ- Animals
ent from us and see what we have in common with it.
Let us combine this with a complementary idea: to Plants
take the most closely related organisms and identify
the differences. That is: Fungi

• How do the human genome and the E. coli genome Bacteria

express our common heritage?

Viruses
• How do genomes that are over 96% identical
create the differences between humans and
103 104 105 106 107 108 109 1010 1011
chimpanzees?
Figure 4.4 Distribution of genome sizes in different groups of
If we could answer these questions, we would have living things. The horizontal scale gives the number of bases or
achieved a lot. base pairs.
122 4 Comparative Genomics

Table 4.2 Genome sizes

Organism Number of Number Comment

base pairs of genes

fX-174 5 386 10 Virus infecting E. coli

Influenza A 13 590 10 Strain A/goose/guandong/1/96(H5N1)
Human mitochondrion 16 569 37 Subcellular organelle
Epstein–Barr virus (EBV) 172 282 80 Cause of mononucleosis
Nanoarchaeum equitans 490 885 552 Archaeon, smallest known genome of a cellular organism
Mycoplasma pneumoniae 816 394 680 Cause of cyclic pneumonia epidemics
Rickettsia prowazekii 1 111 523 834 Bacterium, cause of epidemic typhus
Mimivirus 1 181 404 1 262 Virus with the largest known genome
Borrelia burgdorferi 1 471 725 1 738 Bacterium, cause of Lyme disease
Aquifex aeolicus 1 551 335 1 749 Bacterium from hot spring
Thermoplasma acidophilum 1 564 905 1 509 Archaeal prokaryote, lacks cell wall
Helicobacter pylori 1 667 867 1 589 Chief cause of stomach ulcers
Methanococcus jannaschii 1 664 970 1 783 Archaeal prokaryote, thermophile
Haemophilus influenza 1 830 138 1 738 Bacterium, cause of middle-ear infections
Thermotoga maritime 1 860 725 1 879 Marine bacterium
Archaeoglobus fulgidus 2 178 400 2 437 Another archaeon
Deinococcus radiodurans 3 284 156 3 187 Radiation-resistant bacterium
Synechocystis 3 573 470 4 003 Cyanobacterium, ‘blue-green alga’
Vibrio cholera 4 033 460 3 890 Cause of cholera
Mycobacterium 4 411 532 3 959 Cause of tuberculosis
tuberculosis
Bacillus subtilis 4 214 814 4 779 Popular in molecular biology
Escherichia coli 4 639 221 4 485 Molecular biologists’ all-time favourite
Saccharomyces cerevisiae 12 495 682 5 770 Yeast, first eukaryotic genome sequenced
Caenorhabditis elegans 100 258 171 19 099 ‘The worm’
Arabidopsis thaliana 135 000 000 25 498 Flowering plant (angiosperm), ‘the weed’
Drosophila melanogaster 122 653 977 13 472 The fruit fly
Takifugu rubripes 3.65 × 10 8
23 000 Pufferfish (fugu fish)
Human 3.3 × 10 9
23 000
Wheat 16 × 109 30 000
11
Salamander 10 ?
Psilotum nudum 2.5 × 1011 ? Whisk fern, a simple plant
Amoeba dubia 6.7 × 10 11
? Protozoan

There is a general correlation between complexity Caenorhabditis elegans and the fruit ﬂy, many organ-
of organism and amount of DNA per cell. Prokary- isms have even greater amounts than we do. The
otes have less DNA per cell than eukaryotes, and genome of Amoeba dubia is 200 times larger than
yeast has less than mammals. However, although the human genome. The genome of the marbled
humans have more DNA per cell than certain other lungﬁsh (Protopterus aethiopicus), a closer relative,
organisms popular in molecular biology, including is 43 times as large as ours.
Sizes and organization of genomes 123

Why the different amounts of DNA? As far as we by almost an order of magnitude in genome size. It
know, most of the human genome does not encode was also unexpected to find that the worm C. elegans
protein or RNA. Regions of genomes without known appears to have more genes than the fruit fly.
function are often referred to as ‘junk DNA’. Of
course, the fact that we may not know the function
of much of our genome does not mean that it has • Even taking alternative splicing and RNA editing into
account, these figures give only a static idea of prot-
none. (Maybe it is junk, but it is certainly not all
eome complexity. Cells control gene expression pat-
transcriptionally inert. A series of recent discoveries
terns by complex and dynamic regulatory networks.
has revealed many new types of RNA molecules,
Conclusion: it is difficult to correlate numbers of
mostly involved in control processes. It would be expressed genes with organismal complexity if one has
naïve to doubt that many more types will come to no good way of measuring either.
light.) Moreover, the amount of space between genes
affects the rate of crossing over and recombination
and, thereby, rates of evolution. Indeed, the large The phenomena of alternative splicing and RNA
amount of repetitive sequence between our genes editing show the situation to be more complicated
enhances recombination rates by promoting homolo- than simple gene estimates make it appear. This is
gous recombination. Rate of evolutionary change is one reason why it has been difficult to get an accurate
a characteristic of a species that is certainly subject count of the number of genes in humans and other
to selective pressure. Features of the genome that higher organisms. In eukaryotes, estimates of gene
affect rate of evolution cannot be dismissed entirely number refer to maximal sets of exons in units that
as junk. are coordinately transcribed and translated. In fact,
If genome size per se does not single out humans, variation in splicing may create many proteins from
what about numbers of genes? Again there is a gen- each gene. As an extreme example, in the mamma-
eral correlation between complexity of organism and lian immune system, billions of distinct antibodies
estimated numbers of genes. Viral genomes encode arise from regions in the genome containing fewer
only a few proteins. Prokaryote genomes contain than ∼100 exons. (The immune system is special:
hundreds or thousands of genes. The simple eukary- splicing occurs at the DNA, not the RNA, level.)
ote yeast has almost 6000 genes, fewer than twice as RNA editing is the alteration of bases in mRNA,
many as E. coli. Metazoa have tens of thousands of after transcription. The changes are usually either
genes. C→U or A→I (I = inosine, has the coding properties
However, within groups of related organisms, of G). If only some mRNA from the same gene is
including vertebrates, there is no simple correlation edited, an extra degree of variability in the proteins
between apparent complexity of organism, or even arises. Investigation of RNA editing is a relatively
genome size, and numbers of genes (see Table 4.3). new field, and many more implications of the pro-
Two vertebrates, the puffer fish and humans, appear cess in health and disease remain to be revealed.
to have roughly the same number of genes but differ However, it is known that defective RNA editing

Table 4.3 Distribution of genome sizes and gene densities

Species Genome size Coding Approximate Estimated gene

(Mb) (%) number of genes density (kb/gene)

E. coli 4.64 88 4 485 1.03

Yeast 12.5 70 6 000 2.1
Puffer fish 365 15 23 000 10
A. thaliana 115 29 23 000 6
Human 3289 1.3 23 000 143
124 4 Comparative Genomics

contributes to the pathology of sporadic amyotrophic The basis of the complexity of expression patterns,
lateral sclerosis (a neurodegenerative disease, of which metabolic activity, and indeed all other phenotypic
the most famous sufferers have been Lou Gehrig and features is the organization of the genome itself.
Stephen Hawking.) Different types of organism have experimented with
The conclusion is that it is very different to estimate different solutions of the problems of packaging long,
the size – to say nothing of the complexity – of a narrow strands of DNA and of controlling access of
eukaryote’s proteome from its genome. transcriptional machinery to different regions.

Viral genomes

Viruses infect cells using specialized proteins on cell rev

surfaces that effect attachment and invasion. Viral
vpr vpu nef
nucleic acid enters the host cell. In some cases, viral
proteins required for replication also enter the host gag vif tat

cell. Once inside the host cell, the invading viral

molecules must (1) make multiple copies of the gag-pol env
viral genome; (2) synthesize viral proteins, including
enzymes active only within the host cell, and coat 0 1 2 3 4 5 6 7 8 9
kb
proteins (and others) to be assembled into the pro-
geny virions; and then (3) ‘pack up and leave’. Figure 4.5 Diagram showing the sizes of the individual gene
Viral genomes contain only relatively short transcripts of HIV-1. The introns of the rev and tat genes
stretches of nucleic acids. Some, such as the virus that are indicated by a thin line. The proteins encoded by these
genes are:
causes hepatitis C, encode a single polyprotein, cleav-
age of which produces the few proteins the virus Gene Proteins Function
needs to take over the cell. Other viruses, such as
gag p24, p6, Structural proteins of capsid and matrix
human immunodeﬁciency virus type 1 (HIV-1), con-
p7, p17
tain several genes. The HIV-1 genome is about 9.8 kb
pol Reverse Integration into host genome and
long, containing a total of nine genes (see Figure 4.5
transcriptase, cleavage of viral-encoded polyproteins
and Box 4.4). One gene encodes the Gag–Pol fusion integrase,
protein, which is cleaved to release Gag (the HIV-1 protease
protease), reverse transcriptase, and integrase. Other env Precursors of Envelope proteins, active in attachment
mRNAs expressed by HIV-1 contain introns and are gp120, gp41 and fusion to host cells
spliced to express Rev and Tat. tat Tat Facilitates transcription of viral RNA
rev Rev Enhances cytoplasmic export of
transcripts
Recombinant viruses
nef Nef Interferes with host immune function
Mixed infections of a cell by different viruses permit vif Vif Interferes with host defence
genetic recombination. It is even possible to package vpr Vpr Needed for nuclear import of viral
unaltered nucleic acid from one strain into an enve- nucleic acid
lope composed of protein from another strain. In vpu Vpu Promotes assembly and release of
that event, the absorption–penetration–surface anti- progeny virus; also stimulates
genicity characters are those of the coat proteins, but degradation of host CD4 proteins,
disabling the host immune system
the hereditary characteristics are those of the nucleic
acid. (Infection with such a virus is a kind of natural
Hershey–Chase experiment; see p. 81.) These effects
can alter host speciﬁcity.
Viral genomes 125

BOX Types of viral genome

4.4

As assembled within the virion, a viral genome may acquired immunodeficiency syndrome (AIDS) and avian
consist of: flu, are usually based on viruses with RNA genomes.)
A viral genome consisting of single-stranded RNA can be:
Nucleic acid Examples (+)sense = same sequence as protein-translatable mRNA
• Single-stranded DNA Bacteriophages fX–174 (−)sense = complementary sequence to mRNA
and M13 ambisense = mixture of both.
• Double-stranded DNA Adenoviruses, smallpox Inside the cell, (+)sense viral RNAs present themselves as
virus, Epstein–Barr virus,
messenger RNA (mRNA) and are translated. (−)Sense viral
bacteriophage l
RNAs and double-stranded viral RNAs require specialized
• Single-stranded RNA Bacteriophages MS2, Q b, polymerases for conversion to mRNA. Retroviral genomes
tobacco mosaic virus, HIV-1
contain (+)sense RNA, which is reverse transcribed into
• Double-stranded RNA Bluetongue virus host DNA. These viral polymerases and reverse transcrip-
tases are proteins that are contained in the infecting virion
Single-stranded DNA viral genomes are generally con- and enter the host cell along with the viral nucleic acid.
verted to double-stranded DNA by the host. Replication Some viral genomes are infectious on their own. For
of RNA viruses is prone to mutation because the error- some RNA viruses, a DNA reverse transcript of the viral
correction mechanisms active in host DNA replication do RNA is infectious (although at a lower rate than the natural
not apply. This helps viruses to evade host immune sys- virion). This permits preparation of large quantities of viral
tems and facilitates their jumping between host species. genomes for vaccines, avoiding the lability and high muta-
(Emerging viral diseases, including but not limited to tion rate of viral RNA replication.

leads to expression of immunogen and elicitation

• Both HIV-1 and influenza viruses have become major
of an immune response, giving the host protective
threats to human health after jumping from animal
hosts. Their high mutation rates – with severe clinical
immunity. The immunity created by such an intra-
consequences – reflect the fact that their genomes are cellular exposure to the immunogen is much more
RNA. powerful than that achievable simply by injecting
the immunogen into the bloodstream.
2. A recombinant virus carrying a normal human
In the laboratory, a virus can be constructed as a
gene can be useful for gene therapy. Can a retrovi-
vector to produce foreign proteins inside a cell. Two
ral vector reintroduce the normal variant into the
applications of this technique are as follows.
patient’s genome? A recent successful application
1. To produce a vaccine, insert a DNA sequence cod- has been the treatment of adrenoleukodystrophy
ing for the immunogen (perhaps the HIV-1 sur- (an inherited degenerative disorder leading to pro-
face glycoprotein gp120) into the vaccinia virus gressive brain damage, adrenal gland failure, and
genome. (The HIV-1 surface protein itself is of early death) by gene delivery using a virus derived
course not infectious: the Hershey–Chase experi- from HIV. Treatments for many other diseases are
ment again!)* Infection by recombinant virus at various stages of research. For instance, cystic
fibrosis arises from a mutation in the cystic fib-
* But HIV-1 does insert protein as well as RNA into the rosis transmembrane regulator (CFTR) gene. Gene
host cell. Fortunately for Hershey, Chase, and the field, therapy for cystic fibrosis is now in Phase I clinical
the virus they worked with, bacteriophage T2, does not. trials.
126 4 Comparative Genomics

Influenza: a past and current threat

Inﬂuenza is a contagious disease caused by a virus

that infects the respiratory tract. The virus is passed
around a population in droplets created when an
infected individual coughs or sneezes. Unlike HIV-1,
the virus that causes AIDS, influenza virus can survive
outside the host, greatly facilitating its transmission.
Every year, influenza seasonally affects many people
worldwide. In a typical year, 38 000 people die in the
USA from influenza or related complications, and
200 000 are hospitalized. The mortality rate is 0.8%,
with most fatalities occurring in the very young or
elderly. Worldwide, the usual annual fatality rate is
1–1.5 million.
However, in some years, influenza and associated Figure 4.6 Influenza virus.
complications attack more viciously. A famous pan- Picture courtesy Professor Y. Kawaoka, University of Wisconsin, USA,
demic occurred at the end of the First World War: and University of Tokyo, Japan.

within an 18 month period in 1918–1919, inﬂuenza

killed an estimated 50–100 million people, far more
than had died in the war. The mortality rate was 1% Neuraminidase helps progeny virions to get out.
of those affected. Such an influenza pandemic today Both haemagglutinin and neuraminidase are targets
could have an even higher mortality in regions of the of drugs.
world containing many people immunocompromised The virus can evolve by point mutations, also
by AIDS. called antigenic drift, or by genetic recombination.
In most influenza seasons, fatalities occur as a Immunologically distinct strains of viruses are called
result of bacterial infection of lungs weakened by the serotypes. Different serotypes vary both in the ease
virus, to which the elderly are more vulnerable. The of their spreading and in the mortality of infection.
1918–1919 epidemic was different, in the higher For instance, binding of virus to mucosal respiratory
percentage of fatalities among young people. Several surfaces may be affected by amino acid sequence
explanations have been offered including: (1) the polymorphisms of both viral proteins and host recep-
high density of young soldiers in military camps and tors. Contagion and mortality are also dependent
battlefields, leading to more effective transmission; on characteristics of the host population, including
(2) overcrowding and poor nutrition and health care density and general health levels. A strain that is both
among refugees; and (3) previous epidemics in the highly contagious and has a high mortality rate
1850s and 1889, leaving many elderly people with would be very dangerous indeed. As part of an effort
some immunity. to understand why the 1918–1919 strain was so
Three types of influenza virus are known, of which dangerous, scientists have recently reconstructed that
type A is the most dangerous. The virion contains virus, based on material recovered from contem-
a spherical lipoprotein coat enclosing eight nucleo- porary postmortem specimens.
proteins containing the RNA genome, encoding a Different strains of influenza virus contain 1 of
total of ten proteins. Protruding from the envelope 13 recognized types of gene for haemagglutinin (H)
are several hundred ‘spikes’ containing the proteins and one of nine recognized types of gene for neura-
haemagglutinin (80% of the spikes) and neuraminidase (N). These types identify different major
minidase (about 20%) (see Figure 4.6). These pro- strains of viruses. For instance, the strain that caused
teins are essential to the reproduction of the virus. the 1918–1919 pandemic was H1N1.
Haemagglutinin binds to host cell surface glyco- Many strains of the virus infect only a restricted
protein receptors to promote viral entry into cells. range of species and with different mortality rates.
Viral genomes 127

These properties can change as the virus evolves. A Table 4.4 Population in China (millions)
strain that infects animals can potentially become
Year Humans Pigs Poultry
infectious to humans. Species range depends on the
different forms of sialic acid presented on viral glyco- 1968 790 5.2 12.3
proteins. An important determinant is haemagglu- 2005 1300 508 13 000
tinin residue 226, which is Gln in viruses infectious
to birds and Leu in viruses infectious to humans.

Avian flu
• policies of various governments that do not ad-
In 2006, an H5N1 strain of avian flu characterized
equately reimburse farmers who must sacrifice
by very high mortality infected domestic poultry
animals, creating a disincentive to report disease;
in several countries. It is considered a particularly
the result is a delay or even default of an effective
dangerous threat to humans because it has a high
response.
mutation rate and recombines readily. One way for
the virus to jump from birds to humans is for two Increased population densities of both humans and
strains to co-infect pigs and use them as a ‘mixing animals threaten a greater rate of spread of a danger-
vessel’ for recombination. ous strain of virus, even in comparison with the
Avian flu can normally infect only birds and, in recent 1968–1969 epidemic (see Table 4.4).
some cases, pigs. Domestic poultry stocks raised Aggressive approaches to controlling avian flu
under conditions of very high population density are have involved large-scale culling of stocks. In 1997,
particularly vulnerable. Often migratory birds are the H5N1 strain infected poultry in Hong Kong and
carriers but do not get sick. (The 2006 avian flu caused six human fatalities. The entire poultry popu-
strain was spread from southeast Asia to Russia by lation of the island had to be destroyed: 1.5 mil-
migratory birds.) lion birds in three days. An H7N7 epidemic in the
The H5N1 strain prevalent in 2006 was first Netherlands in 2003 led to the killing of >30 million
identified in Hong Kong in 1997 and traced to ducks birds (approximately twice the human population of
from Guandong province. It jumped the species the country). In Asia in 2004, over 100 million birds
barrier to mammals, becoming infectious to pigs, were culled.
in April 2004. It then became supervirulent, killing
rodents, birds, and humans. This H5N1 strain is Drugs against influenza
100% fatal in domesticated chickens and in 54% of Tamiflu (oseltamivir) and Relenza (zanamivir) are
reported human cases. Human to human transmis- the two major drugs against influenza. Both are
sion is uncommon. inhibitors of the viral neuraminidase.
Compared with previous epidemics, the world Relenza (Figure 4.7) was designed at the Common-
today is particularly vulnerable because of: wealth Scientific and Industrial Research Organiza-
tion (CSIRO) laboratory in Melbourne, Australia.
• increased human population densities;
Crystal structures of influenza neuraminidase showed
• widespread long-distance travel; that conserved sequences formed a cavity, suggesting
• intensive livestock production (including antibiotic a target site for drugs. By targeting the active site
feeding, which may create drug-resistant strains of of the enzyme, it is harder for the virus to evolve
infectious bacteria); and resistance.
128 4 Comparative Genomics

O OH

NH O OH

H OH
H2N N
H
O N
H OH

(a) CH3

Arg373 Arg373

Arg115 Arg291 Arg115 Arg291

Asp148 Asp148
Glu116 Glu116

Arg153 Glu Glu274 Arg153 Glu Glu274

225
Arg Arg 225
149 149
Ile220 Ile220

(b) Trp176 Trp176

Figure 4.7 (a) The structure of the anti-influenza drug zanamivir (Relenza). (b) Zanamivir is a transition-state analogue that binds to the
active site of influenza neuraminidase. Here the atoms of the drug are shown as large spheres and the residues from the neuraminidase
are shown in ball-and-stick representation in stereo.

Ethical dilemma: publication of RNA sequence of the virulent 1918–1919 strain of influenza virus

Recently, scientists were able to recover and sequence the a benefit to the progress of science, and the dangers of
strain of influenza active in the 1918–1919 pandemic. The its misuse. A precedent occurred before the Second
journal Science published the work and, consistent with World War when physicist Leo Szilard tried, unsuccess-
editorial policy, required that the sequence be deposited in fully, to persuade colleagues not to publish results that
the nucleic acid databanks. might prove useful in the development of atomic
In The New York Times on 17 September 2005, R. weapons. He suggested that journals record dates of
Kurzweil and W. Joy wrote an article critical of the decision receipt and acceptance of manuscripts but then sequester
to make the sequence generally available in databanks on the articles for the duration. This occurred well before
the grounds that terrorists might use the information to the strict secrecy imposed after the Manhattan Project
recreate the virus and use it as weapon. was organized.
Reactions to the publication of the reconstructed pan- Science did make an exception to its mandatory-
demic viral sequence illustrate the conflict between the deposition policy in publishing the Human Genome Draft
recognition that free and open access to information is Sequence by J.C. Venter and co-workers in 2001.
Genome organization in prokaryotes 129

Genome organization in prokaryotes

A typical prokaryotic genome has the form of a 2000

single circular molecule of double-stranded DNA,
between 0.6 and 10 million bp long. For instance, a
1500
cell of E. coli strain K12 contains a single molecule

Number of genes
of double-stranded DNA 4 639 675 bp long, closed
into a circle. The DNA is supercoiled and associated 1000
with histone-like proteins into a ‘chromosome’,
appearing in a subcellular structure called the nucle-
oid. Some E. coli cells may contain plasmids: short, 500

usually circular, double-stranded DNA molecules,

ranging from 1 kb to several megabases in length.
0
Although single circular genomes containing most 0 1000 2000 3000 4000 5000
of the DNA are common in bacteria and archaea, Gene length (base pairs)
many exceptions are known. Many prokaryotic cells Figure 4.8 Distribution of gene lengths in E. coli. Two very long
contain plasmids. Some prokaryotes have linear genes for hypothetical proteins, yeeJ and ydbA, of length 7152
DNA. Borrelia burgdorferi, the organism that causes and 8619 bp, respectively, are omitted. The average gene length
Lyme disease, is an example. B. burgdorferi also con- is 960 bp. Most genes are less than 1500 bp long.
tains numerous plasmids, some of which are circular
and some linear. Other prokaryotes contain more lems common to all cellular life forms. In addition,
than one chromosome. Vibrio cholerae, the organism there are good practical motives for studying features
that causes cholera, contains two circular DNA mol- of prokaryotes. Differences between prokaryotic and
ecules of 2 961 146 and 1 072 314 bp. eukaryotic metabolism – enzymes unique to pro-
Some but not all prokaryote genomes contain karyotes – are appropriate targets for drugs against
insertion sequences, mobile genetic elements similar infection. Of great importance to clinical medicine
to eukaryotic transposons. is understanding how prokaryotes evolve to develop
The 4.6 Mb chromosome of E. coli encodes ap- pathogenicity and antibiotic resistance.
proximately 4500 genes, distributed on both strands. We visualize the contents of bacterial chromo-
The absence of introns and the shorter intergenic somes as concentric circular diagrams, looking
regions account for the higher coding densities. A vaguely like ‘tie-dyed’ patterns (see Figure 4.9).
very large fraction of the DNA, 87.8%, codes for
proteins, 0.8% codes for structural RNAs and only
Replication and transcription
0.7% has no known function (see Table 4.5 and Fig-
ure 4.8). In E. coli, replication begins at a speciﬁc site called
Many prokaryotic genomes have been sequenced. oriC and proceeds in both directions. This site is the
They illuminate, in a somewhat simpler context than calibration point from which the genome is indexed.
the human genome, how these organisms solve prob- Replication ends at the terC site, found almost, but
not exactly, half way around the circle. (In con-
Table 4.5 Coding percentage and average gene density trast, archaea often have multiple sites of origin of
replication.)
Species Coding Average gene density In prokaryotes, many mRNA transcripts contain
E. coli >90% 1 gene/kb
several tandem genes, which require separate initia-
tion of translation. (In this, they are unlike viral poly-
Pufferfish 15% 1 gene/10 kb
proteins, which are translated in one piece and then
Human 5% 1 gene/30 kb
cleaved.) In bacteria, but less frequently in archaea,
130 4 Comparative Genomics

(a)

Escherichia coli K12, complete genome

BacMap
Genome Atlas
Genes encoding proteins
Forward strand
Reverse strand

Genes encoding functional RNA

4500 kbp Forward strand
Reverse strand
500 kbp
COG functional categories
4000 kbp
Information storage and processing
Translation, ribosomal structure and biogenesis
Transcription
1000 kbp DNA replication, recombination and repair

3500 kbp Cellular processes

Cell division and chromosome partitioning
Posttranslational modification, protein turnover, chaperones
Cell envelope biogenesis, outer membrane
1500 kbp Cell motility and secretion
Inorganic ion transport and metabolism
3000 kbp
Signal transduction mechanisms
Metabolism
2000 kbp Energy production and conversion
2500 kbp
Carbohydrate transport and metabolism
Amino acid transport and metabolism
Nucleotide transport and metabolism
Coenzyme metabolism
Lipid metabolism
Secondary metabolites biosynthesis, transport and catabolism
Poorly characterized
General function prediction only
Function unknown
Accession: NC_000913 Length: 4,639,675 bp; Genes: 5,144

Expand - Expand + Full view Rotate - Rotate + ?

Click tick marks to expand the view. Displayed PNG file size: 186 kb.
Valid XHTML 1.0; Valid CSS. Centered on base 1; Zoom 1.

Figure 4.9 Map of the genome of E. coli K12. (a) Full view. Red arrows show protein-coding regions of the forward strand. Blue arrows
show protein-coding regions of the reverse strand. Pink arrows show structural RNA-encoding regions of the forward strand. Cyan
arrows show structural RNA-encoding regions of the reverse strand. Radial ticks identify individual gene products, colour coded
according to function. COG categories refer to the Clusters of Orthologous Groups database (http://www.ncbi.nlm.nih.gov/COG/).
(b) Expanded view of the region containing the his operon. The BacMap site provides access to genomes of bacteria and archaea
(http://wishart.biology.ualberta.ca/BacMap/).
Pictures reproduced, by permission, from BacMap: An Interactive Atlas for Exploring Bacterial Genomes. See: Stothard, P., Van Domselaar, G.,
Shrivastava, S, Guo, A., O’Neill, B., Cruz, J., Ellison, M., & Wishart, D.S. (2005). BacMap: an interactive picture atlas of annotated bacterial genomes.
Nucl. Acids Res. 33, D317–D320.

co-transcribed genes have related functions, forming replicated DNA in effect increases the copy number
an ‘operon’. of such genes. Conversely, the half-life of mRNA is
Timing illuminates the interrelationships among only a few minutes. Therefore, translation must over-
these processes. Under ordinary conditions, it takes lap transcription.
E. coli 40 minutes to replicate its genome. The full
generation time between cell divisions is about an
Gene transfer
hour. This explains why genes that require high rates
of expression tend to be near the origin of replica- There are three methods of transfer of DNA between
tion: the availability for transcription of partially prokaryotic cells.
Genome organization in prokaryotes 131

(b)

Escherichia coli K12, complete genome

insD1 [C]
cobT [C]
erfK [C]
nac [C]
BacMap
cobU [C]
Genome Atlas
dacD [C] cbl [C]
yeeY [C] yi213 cobS [C] Genes encoding proteins
b2015 yeeA [C] insH1 [C] Forward strand
yeeX [C]
yeeZ [C] yeeD [C] Reverse strand
yefJ [C] insC1 [C]
rfbX [C] insH1 [C] gyrl [C] b2007 2060 kbp Genes encoding functional RNA
yeeF [C]
glf [C] yefl [C] ugd [C] yeeE [C] Forward strand
rfc [C] b4529 b4528 2070 kbp Reverse strand
wzzB [C]
wcaL [C] rfbD [C] yefG [C] yefJ b2027 yefM [C] COG functional categories
wcaM [C] rfbA [C] wbbJ [C] gnd [C] 2080 kbp
galF [C] rfbC [C] Information storage and processing
rfbB [C] 2090 kbp Translation, ribosomal structure and biogenesis
wcaK [C] Transcription
2100 kbp DNA replication, recombination and repair

2110 kbp Cellular processes

Cell division and chromosome partitioning
Posttranslational modification, protein turnover, chaperones
Cell envelope biogenesis, outer membrane
Cell motility and secretion
yeeW [C] b1998 [C]
sbcB [C] Inorganic ion transport and metabolism
yeeR [C] yeeP [C]
Signal transduction mechanisms
yeeS [C] IS102
hisB [C] Metabolism
hisH [C] hisL [C] yeeT [C] IS102 Energy production and conversion
b2022
hisA [C] hisG [C] b2000 Carbohydrate transport and metabolism
yeeU [C]
hisF [C]
hisI [C] hisD [C] flu [C] Amino acid transport and metabolism
yeeV [C]
Nucleotide transport and metabolism
hisC [C]
Coenzyme metabolism
Lipid metabolism
Secondary metabolites biosynthesis, transport and catabolism
Poorly characterized
General function prediction only
Function unknown
Accession: NC_000913 Length: 4,639,675 bp; Genes: 5,144

Expand - Expand + Full view Rotate - Rotate + ?

Click tick marks to expand the view. Displayed PNG file size: 84 kb.
Valid XHTML 1.0; Valid CSS. Centered on base 2,090,000; Zoom 36.

Figure 4.9 (continued )

• Transformation. The uptake of ‘naked’ DNA, as in Bacterial conjugation has proved very useful in
the experiments of Grifﬁth and of Avery, MacLeod, genome mapping. Transfer of a complete genome
and McCarthy. takes 100 minutes. Interrupting the process at differ-
• Conjugation. Insertion of some or all of the DNA ent times (by physical agitation) results in partial
from one cell into another – the prokaryotic equi- genome transfer. Identifying which genes have
valent of ‘mating’, although there is no meiosis entered the recipient cell after different intervals
or zygote formation. Bacterial conjugation does revealed the order of the genes. Positions in the
permit formation of recombinants. The start point genetic map of E. coli, for example, were classically
for DNA transfer varies with the position in the expressed in minutes. Now, of course, they are speci-
genome of a mobile site. This is not the same as the ﬁed in terms of the DNA sequence itself.
origin of replication, oriC.
• Transduction. Transfer of DNA from one cell to
another via a bacteriophage. During replication in
• Prokaryotes have several mechanisms for sharing
one cell, a phage can pick up fragments of bacterial
genetic material : transformation by naked DNA, con-
DNA and transmit it to another cell subsequently jugation, and transfer via viruses.
infected by progeny virions.
132 4 Comparative Genomics

Genome organization in eukaryotes

Genomic information in eukaryotic cells is divided Table 4.6 Numbers of RNAs and proteins encoded in organelle
between the main nuclear genome and cytoplasmic genomes
organelles: mitochondria and chloroplasts.
Organelle RNA encoded Proteins encoded
In the nucleus, DNA is complexed with proteins
to form chromosomes. We have already noted that Animal Two ribosomal RNAs, 12 or 13
chromatin remodelling is an important component mitochondria 22 tRNAs

of regulation of gene expression. DNA in organelles Plant Three ribosomal RNAs, 30–39
mitochondria ∼22 tRNAs
also forms nucleoprotein complexes. These resemble
bacterial nucleoids, reﬂecting the endosymbiont Chloroplast Four ribosomal RNAs 50–57
(two copies), 37 tRNAs
origin of organelles. Organelle genomes are circular,
double-stranded DNA molecules. Organelles in some
species contain more than one DNA molecule. mitochondrial DNA in nuclear chromosome 2 con-
With a few exceptions, the amount of nuclear tains much of the mitochondrial genome including
DNA per cell is constant in all cells of an organism some duplicated material. (The full mitochondrial
except for gametes. Organelles vary in number of genome is only 366 924 bp long.)
copies of the DNA that they contain. Moreover,
cells of different tissues contain different numbers Photosynthetic sea slugs: endosymbiosis of
of organelles. Mitochondria are more numerous in chloroplasts
cells that consume large amounts of energy, such as
brain, heart, and eye (about 10 000 mitochondria The endosymbiotic origin of mitochondria and plant
per cell), than in skin cells (only a few hundred). chloroplasts is a well-accepted theory about events
In plants, a leaf cell may contain up to 100 chloro- that happened 1–2 billion years ago. Acquisition of
plasts, the number varying among species. Leaves may endosymbiotic chloroplasts by sea slugs is observ-
contain 106 chloroplasts per mm2 of surface area. able today (see Figure 4.10). The slugs, which are
Unsurprisingly, root cells have none. molluscs – i.e. animals – eat algae. They open the
Mitochondrial genomes vary in size among spe- algal cells and discard the contents – including the
cies. Human mitochondrial DNA is 16 569 bp long.
Yeast mitochondrial DNA is 75 kb and that of
plants is considerably larger: the DNA of muskmelon
(cantaloupe) mitochondria is 2.4 Mb! Chloroplast
genomes range from about 110 to 160 kb, larger
than animal mitochondrial DNAs. In some species,
such as the protozoan Cryptosporidium, the mito-
chondria contain no DNA at all!
Mitochondria and chloroplasts carry out their own
protein synthesis. Chloroplasts and plant mitochon-
dria translate their genes according to the standard
genetic code, but animal mitochondria use variants.
There is active trafﬁc between organelle and
nuclear genomes (see Box 4.5). Approximately 90%
of chloroplast proteins are encoded by nuclear genes
Figure 4.10 A lettuce sea slug (Elysia crispata) on a patch of the
and gene transfer is still going on. However, the dif- alga Bryopsis. The slug eats algae, extracts and endocytoses
ferences in genetic code inhibit mitochondrial → the chloroplasts, and then basks in the sun, as in the picture,
nuclear transfer in animals (see Table 4.6). In the while the chloroplasts photosynthesize organic compounds.
Arabidopsis thaliana genome, a 620 kb insertion of Photograph by William Capman, Augsburg College, Minneapolis, MN, USA.
How genomes differ 133

nucleus – except for the chloroplasts. The chloro- genome encodes only 13% of the organelle proteins.
plasts are taken up into host cells, where they carry During the active life of the chloroplast within the
out photosynthesis. The slug can live for months on animal’s cells, its proteins turn over and must be
the molecules synthesized using solar energy. synthesized. Genes from the algal chloroplast have
The mollusc does not get an entirely ‘free lunch’. entered the mollusc nuclear genome and are expressed
In algae typical of the slug’s food, the chloroplast by the host.

BOX Traffic between the mitochondrial and nuclear genomes

4.5

Rps14, a protein from the small subunit of the rice mito- result of four single-nucleotide deletions that destroy the
chondrial ribosome, is encoded by a nuclear gene, rps14, reading frame. Certain other higher plants have functional
on chromosome 8. In fact, by alternative splicing, the mitochondrial rps14 genes (broadbean, rapeseed), whereas
five exons of this region (centre strip) encode both Rps14 others resemble rice in containing non-functional mito-
(ribosomal protein 14 of the small subunit) and SdhB (the chondrial rps14 genes, but functional nuclear genes (potato,
B subunit of succinate dehydrogenase). Arabidopsis).
Genes similar to sdhB have not been observed in plant
mitochondrial genomes. It is likely, therefore, that the
rps14
move of the sdhB gene from the mitochondrial to the
nuclear genome is an old event and the move of the rsp14
740 bp 1142 bp gene is a relatively recent one.
genomic DNA
Moving a mitochondrial gene to the nucleus moves the
exon 1 2 3 4 5
site of its expression to the cytoplasm. It needs a leader
sequence containing a proper targeting signal to direct
sdhB
the protein to mitochondria. The protein encoded by the
rice nuclear gene for mitochondrial rps14 appears to have
Both Rps14 and SdhB are synthesized in the cytoplasm borrowed a mitochondrial targeting signal from sdhB,
and transported into the mitochondria. It is likely that both part of an earlier generation of immigrants, by alternative
genes were originally in the mitochondrial genome. A splicing. Compared with products of mitochondrial genes
region similar to the nuclear gene rps14 remains in the rice for Rps14, the nuclear-encoded version has an N-terminal
mitochondrial genome. This mitochondrial gene is trans- extension derived from the sdhB exons. This extension is
lated, but the product has become non-functional as a cleaved off in the mitochondria.

How genomes differ

There is a growing consensus that the dynamics of level of whole genomes that have undergone com-
expression patterns embodies the most interesting plete duplications.
features of genomes. It is nevertheless prudent to
begin less ambitiously, with static aspects – the
Variation at the level of individual nucleotides
sequences themselves. Similarities and differences
among genome sequences appear (1) at the levels Closely related genomes tend to contain regions
of individual bases; (2) at the level of genes (see encoding closely related proteins. Alignments of the
Box 2.6); (3) in larger-scale blocks; and (4) at the sequences of homologous genes reveal differences,
134 4 Comparative Genomics

10 20 30 40 50 60
| | | | | |
Human V K Q I ES K T A F Q E A L DA A G D K L V V V DF S A T W C G P C KM I K P F F H S L SE K Y S N _ _ _ V IF
Chicken V K S V GN L A D F E A E L KA A G E K L V V V DF S A T W C G P C KM I K P F F H S L CD K F G D _ _ _ V VF
Neurospora crassa MSDGV K H I NS A Q E F A N L L NT T _ _ Q Y V V A DF Y A D W C G P C KA I A P M Y A Q F AK T F S I P N F L AF
Staphylococcus aureus MA I V K VT D A D F D S K V ES G _ _ _ V Q L V DF W A T W C G P C KM I A P V L E E L AA D Y E G _ _ K A DI
v k F l v v v DF A t W C G P C Km I P l f

70 80 90 100 110 120

| | | | | |
Human L E V D V D D C QD V A S E C E V K CM P T F Q F F K K GQ K V G E F S _ _ __ _ G A N K E K L EA T I N E L V _ _ __
Chicken I E I D V D D A QD V A T H C D V K CM P T F Q F Y K N GK K V Q E F S _ _ __ _ G A N K E K L EE T I K S L V _ _ __
Neurospora crassa A K I N V D S V QQ V A Q H Y R V S AM P T F L F F K N GK Q V A V N G S V MI Q G A D V N S L RA A A E K M G R L AK
Staphylococcus aureus L K L D V D E N PS T A A K Y E V M SI P T L I V F K D GQ P V D K V V G _ __ _ F Q P K E N L AE V L D K H L _ _ __
d V D q v A V m P T f f f K G V g a k e L

Figure 4.11 Thioredoxins are proteins that catalyse disulphide-exchange reactions, contributing to the speed and accuracy of the
protein-folding process. The human thioredoxin gene extends over 13 kb and consists of five exons. This figure shows the alignment
of amino acid sequences of thioredoxins from two vertebrates (human and chicken), a fungus (Neurospora crassa), and a bacterium
(Staphylococcus aureus). Colour coding: green, amino acids with medium-sized and large hydrophobic side chains; yellow, small side
chains; magenta, polar side chains; blue, positively charged side chains; red, negatively charged side chains. Upper-case letters in black
on the line below the sequences indicate amino acids conserved in all four sequences. Lower-case letters in the line below the sequences
indicate amino acids conserved in three of the four sequences.

mostly in the form of single-site mutations or inser- an important mechanism of evolution. They are a
tions and deletions. Typically, there is reasonable proliﬁc source of variation, the raw material of both
correlation between overall species divergence and selection and genetic drift.
divergence of sequences of individual genes and the
corresponding proteins. Comparisons of amino acid
and gene sequences of thioredoxins provide a typical
example (see Figure 4.11). BOX What can happen to a gene?
4.6
Compare these protein sequences with the corres-
ponding gene sequences from human, chicken, and
Staphylococcus aureus (see Figure 4.12). Note the During evolution:
large gap in the bacterial gene corresponding to the 1. A gene may pass to descendants, accumulating
intron in the human and chicken genes. The asterisks favourable (or unfavourable) mutations or drifting
under the sequences indicate positions containing neutrally.
the same base in all three genomes. The colons indi- 2. A gene may be lost.
cate positions containing two identical bases among
3. A gene may be duplicated, followed by divergence
the three; in most cases, two common bases appear
or by loss of one of the pair.
in the human and chicken sequences, even in the
4. A gene may undergo horizontal transfer to an
non-coding regions. Note the frequent occurrence
organism of another species.
of patterns ‘**:’ and ‘**-blank’. What is the likely
reason for this? 5. A gene may undergo complex patterns of fusion,
fission, or rearrangement, perhaps involving regions
encoding individual protein domains.
• In most cases, divergence of the sequences of genes
and proteins correlates well with the divergence of the
species.

Duplications are seen in archaea, bacteria, and

eukaryotes. Estimates of amounts of duplication
Duplications
vary, but there is agreement that it is substantial.
Duplications of individual genes, of regions contain- The A. thaliana genome, for instance, contains over
ing many genes, and of complete genomes have been 60% duplications.
How genomes differ 135

Human actgcttttcaggaagccttggacgctgcaggtgataaacttgtagtagttgacttctca
Chicken gctgattttgaggcagaactgaaagctgctggtgagaagcttgtagtagttgatttctct
S.aureus gcagattttgattcaaaagtagaatctggtgtacaa---------ttagtagatttttgg
*:* **** *:: *: *: * :***: *:::* :: :::::::****:** **:*:

Human gccacgtggtgtgggccttgcaaaatgatcaagcctttctttcat---------------
Chicken gccacatggtgtggaccatgtaaaatgatcaagccatttttccatgtaagtagcctttgt
S.aureus gcaacatggtgtggtccatgtaaaatgatcgctccggtattagaa---------------
**:** ******** ** ** *********:::** :* ** :*:

Human --------------gtgagtattaaacaatgtctgctttgtaagagatttgtgttttttg
Chicken ttttcacagtaacagtaagtat-acacaaatacttctgtgcaacttgtcagtaatatg-g
S.aureus ------------------------------------------------------------
:: ::::: : :::: :: :: :: :: : :: : : :

Human agttggtggtcacagtggtaggaaagaaagacagtt----aaaggattttggtttcggtg
Chicken aggaaacctcctttgtctgtggggtggatggtatttcttgaaggagaatttgtagaagta
S.aureus ------------------------------------------------------------
:: : :: :: : : : : :: :: : :: :: ::

Human gg-----gggatttctttggctccatctttggtctaaaagtagtagtataacaaataatt
Chicken tgtgattggtaactattgaataaagtacttggatacacagcagggaacagacatgctgtt
S.aureus ------------------------------------------------------------
: :: : : :: : :::: : :: :: ::: ::

Human taggtttgatacatgtagcccattgaaa-acaaattttagaagttaattttgtcttaaat
Chicken gcattttgtctgtcctggtgctctgatgcacagtctgtaggtacctagcttccctcaaga
S.aureus ------------------------------------------------------------
:::: : : : ::: ::: : ::: : :: :: ::

Human agttctttttttccccacattgaaaca-----tgggcctta--tttgaaatcccagccta
Chicken a---ctggtaacagtggagttgaaacagtgtgtgtacactggctctgattattaaaactg
S.aureus ------------------------------------------------------------
: :: : :::::::: :: : : : ::: : ::

Human gaatttgatatgccaaactgtttt---atactaa--gaaaaatttgatttagagaaaatt
Chicken cattcagataagctgtagcatctttctgtagtgatggggaggctgtaagggaaggaaagg
S.aureus ------------------------------------------------------------
: : :::: :: : : :: :: : : : : : : :: :::

Human tatgtctcttagatctatgt-ctccaaaaga----tctaaatttttggatctttaattag
Chicken cctttcccttgtgcttaggtgcttcagcagactcttccaggggatggagctgaaaattaa
S.aureus ------------------------------------------------------------
: :: ::: :: :: :: :: ::: :: : : : :::::

Human tctctagttttattaagtttccatttaagaagcttaagcttgggtatgttgcattgccat
Chicken tttctggtca-gtaaagtagctgctttaaggtaacgagtcaag---tctcacagcaccag
S.aureus ------------------------------------------------------------
: ::: :: : :::: : :: : :: : : : :: :::

Human tacctagttctaaatctttt-------tggatttttcattttaaattttccag-------
Chicken tttctatgaatgcatcttttaaagaagtggctttcctggagcagtaactacataattttg
S.aureus ------------------------------------------------------------
: ::: : ::::::: ::: ::: : : ::

Human -------------tccctctctgaaaagtattccaac---gtgatattccttgaagtaga
Chicken tttttcattctagagtctgtgtgacaagtttggtgat---gtggtgttcattgaaattga
S.aureus -------------gaattagcagctgactatgaaggtaaagctgacattttaaaattaga
:* : :*: :*:* * : :::*:: : :*: *::** * **

Human tgt ggatgactgtcag

Chicken tgt ggatgatgcccag
S.aureus tgttgatgaaaatcca
***:*::::* **:

Figure 4.12 Alignment of partial thioredoxin gene sequences from the genomes of human, chicken, and Staphylococcus aureus.
The region shown contains exons 2 and 3 from the vertebrate genes.
136 4 Comparative Genomics

Table 4.7 Percentages of duplicated genes

BOX Homologues, orthologues, and
Species Duplicate genes (%) 4.7 paralogues
Bacteria
44
Homologues are regions of genomes, or portions of
Mycoplasma pneumoniae
proteins, that are derived from a common ancestor.
Helicobacter pylori 17
Because only in rare cases can we actually observe the
Haemophilus influenzae 17
ancestor–descendant relationship, most assignments of
Archaea homology are inferences from similarity in sequence,
Archaeoglobus fulgidus 30 structure, and/or genomic context.
Eukarya Paralogues are related genes that have diverged
Saccharomyces cerevisiae 30 to provide separate functions in the same species.
Caenorhabditis elegans 49 Orthologues, in contrast, are homologues that perform
the same function in different species. (For instance, the
Drosophila melanogaster 41
a and b chains of human haemoglobin are paralogues,
Arabidopsis thaliana 65
and human and horse myoglobin are orthologues.)
Homo sapiens 38
Other related sequences may be pseudogenes, which
From: Zhang, J. (2003). Evolution by gene duplication: an update. may have arisen by duplication or by retrotransposition
Trends Ecol. Evol. 18, 292–298. from mRNA, followed by the accumulation of mutations
to the point of loss of function or expression.

Duplication of genes
Organisms in all three domains of life show duplica-
tion of Ka and Ks involves more than simple counting
tion of individual genes.
because of the need to estimate and correct for pos-
After duplication, both copies of a gene may sur-
sible multiple changes.) The ratio of Ka/Ks distin-
vive and diverge. Alternatively, one copy may turn
guishes the role of selective pressure and drift in the
into a pseudogene or be deleted, leaving only one
divergence of genes after duplication:
functional copy.
As ﬁrst proposed by S. Ohno in 1970, duplication Ka/Ks ≈ 1 Neutral evolution: silent and
followed by divergence is an important source of substitution mutations have occurred
proteins with novel functions. It is generally easier to approximately equal extents.
to ‘recruit’ and adapt an already active molecule to Ka/Ks >> 1 Positive selection: substitution
a new function than to invent a new protein from mutations are more prevalent than
scratch. The course of evolution of proteins descended silent mutations, implying that
from a common ancestor will differ, depending on selective pressures are active and
whether they are retaining or changing their function the substitutions are advantageous.
(see Box 4.7). Ka/Ks << 1 Purifying selection: substitution
mutations are underrepresented,
‘Walls supply stones more easily than quarries, and palaces implying that the sequence is optimized
and temples will be demolished to make stables of granite, fairly rigidly, with relatively little
and cottages of porphyry.’ – Johnson, Rasselas.
tolerance for mutation.

In analysing the divergence of related genes, how

can we distinguish the effect of selection from genetic • A common mechanism of evolution is the duplication,
drift? Given two aligned gene sequences, we can cal- of a gene corresponding to a protein subunit, of an
entire gene, or even of a whole genome. Creating two
culate Ks, the number of synonymous substitutions,
copies of any of these entities means that one can con-
and Ka, the number of non-synonymous substitutions.
tinue to provide an essential function, while the other
Most but not all synonymous substitutions are
can diverge to explore other possibilities.
changes in the third position of codons. (The calcula-
How genomes differ 137

Chromosome 14 Chromosome 17 Chromosome 22 Chromosome 16 Chromosome 11

NGB CYGB MB ζ ψζ ψα α2 α1 ε γG γA ψβ δ β

200 Cytoglobin Myoglobin

α Chains β Chains
Million years ago

400

Haemoglobin
Neuroglobin
600
Cellular globin

800

1000
Ancestral
globin

Figure 4.13 Duplication and dispersal through the genome of globin genes during animal evolution.
From: Burmester, T., Ebner, B., Weich, B., & Hankeln, T. (2002). Cytoglobin: a novel globin type ubiquitously expressed in vertebrate tissues.
Mol. Biol. Evol. 19, 416–421.

The many globins in the human genome provide a Within mammals, the a- and b-globin regions are
good example of gene duplication and divergence quite variable in content and extent (see Figure 4.14).
(see Figure 1.11). Genes for several versions of the Even within the primate lineage, we can date a dupli-
haemoglobin a and b chains form clusters on chro- cation of the g-globin gene (see Figure 4.15).
mosomes 16 and 11. Other, isolated, loci contain
genes for neuroglobin, cytoglobin, and myoglobin. Duplication can affect individual exons
Closely linked genes, as in the a- and b-globin regions, Fibronectin, a large extracellular protein involved in
suggest relatively recent divergence. Yet even within cell adhesion and migration, is a modular protein
these clusters, the proteins encoded have diverged in (see Box 4.8) containing multiple tandem repeats of
function, showing small but significant variations in
oxygen affinity and responses to allosteric effectors.
Also, these proteins appear at different stages of our BOX Modular proteins
development, implying divergence in the control of 4.8
their expression.
We can date the globin duplications by looking A modular protein contains a linear string of compact
back into evolutionary history (see Figure 4.13). units, called domains. Domains appear to have inde-
Neuroglobin split off from other globins before the pendent stability and can be ‘mixed and matched’ with
last common ancestor of the vertebrates, perhaps 109 one another in different proteins. Domains sometimes,
years ago. The divergence of myoglobin and cyto- but by no means always, correspond to single exons.
globin from haemoglobin occurred before the Modular proteins are common in eukaryotes. Indi-
emergence of the jawless fishes, during the Cam- vidual domains of eukaryotic modular proteins are often
brian about 500 million years ago. The divergence of separately homologous to single-domain prokaryotic
a- and b-globins occurred early in the vertebrate proteins (and, less-commonly, vice versa).
lineage, approximately 450 million years ago.
138 4 Comparative Genomics

0 20 40 60 80 100 120 140 kb

Zebrafish

Xenopus

Chicken

Kangaroo

Rabbit

Mouse

Rat

Goat

Sheep

Cow

Galago

Human

Figure 4.14 Layout of the b-globin locus in selected vertebrates. Colour coding: Light brown, fish; green, amphibian; purple, avian;
magenta, marsupial; dark brown, e-like; dark blue, g-like; orange, d-like; cyan, b-like; white, rat-specific pseudogene; red, h-like.
The zebrafish and Xenopus regions illustrate the organization of the region prior to the separation of a- and b-globins. They alone
of the species illustrated here contain only a- and b-globin genes.
Goat and sheep are closely related species that show similar patterns. Rat and mouse are closely related species that show different
patterns.
After: Aguileta, G., Bielawski, J.P., & Yang, Z. (2004). Gene conversion and functional divergence in the β-globin gene family. J. Mol. Evol. 59, 177–189.

Other mammals
three types of domain called F1, F2, and F3. It is a
linear array of the form: (F1)6(F2)2(F1)3(F3)15(F1)3
ε γ ψ δ β
ε γ η δ β Prosimians (see Figure 4.16). In the human genome, each domain
Early placental of ﬁbronectin is encoded by either one or two tandem
mammal exons.
ε γ1 γ2 ψ δ β
New World monkeys Fibronectin domains also appear in other modular
proteins. The duplication of the exon(s) encoding a
ε γ 1 γ2 ψ δ β domain, followed by transfer to another protein, is
Old World monkeys: called ‘exon shufﬂing’.
including apes, chimps, and humans

Figure 4.15 Evolution of primate b-globin region (not drawn Family expansion: G-protein-coupled receptors
to scale). Red boxes, embryonically expressed genes; green
boxes, post-embryonically expressed genes; blue boxes, foetally
Repeated duplications can generate large numbers of
expressed genes; black boxes, pseudogenes. Mammals ancestral homologues. G-protein-coupled receptors (GPCRs)
to the groups in this chart had an embryonic e-globin and a are a large superfamily of eukaryotic cell-surface
post-embryonic b-globin gene. Marsupials retain this pair
receptors active in signal recognition and processing,
(see Figure 4.14). By the time this diagram begins, the e gene
had duplicated to form three embryonic genes, e, g, and h, and including the senses of sight, taste, and smell.
the b gene had duplicated to form d and b. The h gene fell into The human genome contains about 700 active
desuetude, mutating into a pseudogene. Subsequent duplication GPCRs. They are integral membrane proteins with a
of the g gene in anthropoids produced g1 and g2, and a change in
the control of expression to convert the g genes from embryonic common structure comprising seven transmem-
to foetal expression. brane helices (see Figure 4.17). GPCRs interact with
How genomes differ 139

in > 40 locations in mouse and >100 locations in

humans. Formation of many of the clusters appears
to antedate the divergence of humans and mice, but
individual gene duplication and divergence within
clusters has led to specialization.

• GPCRs are important in the pharmaceutical industry,

for both therapeutic effectiveness and financial reward.
About half of all prescription drugs target GPCRs.

Although mammals have of the order of 1000

odorant-receptor proteins, and each neuron in the
nasal epithelium expresses only one odorant-receptor
allele, over ten times as many scent molecules can be
distinguished. How is this achieved? Each scent mol-
ecule interacts with several different receptors. Con-
versely, each receptor protein binds several related
scent molecules. Each neuron signals the detection of
one group of odours, and the brain compares the
outputs and performs the required computation.
Identiﬁcation of a speciﬁc scent depends on its detec-
tion by a combination of receptors.

• We depend less on our sense of smell than mice and

dogs do, but we have about half the number of
expressed odorant receptor proteins. Would you have
expected a larger discrepancy?

Figure 4.16 A fragment of fibronectin, a modular protein,

showing four tandem domains. Large-scale duplications
The genomes of many species contain duplications
of multigene regions, the length varying from species
G proteins within the cell. Some GPCRs mediate to species.
responses to extracellular chemical signals. Recep- Large-scale segmental duplications are an import-
tion of the signal triggers intracellular signalling ant component of the difference between human
cascades, which may reach as far as the nucleus to and chimpanzee genomes, affecting about 2.7% of
affect gene expression. the genome. Some duplications are found in chim-
The interaction patterns of large families of func- panzees but not humans, some in humans but not
tionally related proteins can create great complexity. chimpanzees, and some in both.
Odorant receptors are GPCRs expressed on sensory Some duplications that appear in the human but
neurons in the nasal cavities of humans and animals. not in the chimpanzee genome involve segments
The human genome has about 1000 odorant-receptor associated with developmental disorders, including a
genes, of which only 40% are active. The mouse region on human chromosome 15 involved in Prader–
genome has about 1300, of which 80% are active. Willi and Angelman syndromes (see p. 85). These
These genes are distributed around mammalian syndromes arise from microdeletions. In humans,
genomes, arranged in clusters of up to 100 genes, the duplication contributes to the frequency of the
140 4 Comparative Genomics

Figure 4.17 G-protein-coupled receptors (GPCRs) are a large family of transmembrane proteins involved in signal transduction into
cells. They share a substructure containing seven transmembrane helices, arranged in a common topology. This figure shows the first
experimentally determined mammalian GPCR structure, bovine opsin [1H68]. This molecule senses light and generates a nerve impulse.
The seven-helical structure is common to the family of GPCRs. The helices traverse the membrane, with loops protruding outside and
inside the cell. This figure shows a view parallel to the membrane, with the extracellular side at the top. The transmembrane region is
generally flanked by N- and C-terminal domains. The N-terminal domain is always outside the cell and the C-terminal domain always
inside.
GPCRs constitute the largest known family of receptors. The family is as old as the eukaryotes and is large and diverse. Mammalian
genomes contain ∼1500–2000 GPCRs, accounting for about 3–5% of the genome. A similar fraction of the C. elegans genome codes
for GPCRs.
Some GPCRs are involved in sensory reception, including vision, smell, and taste. Some, like opsin and bacteriorhodopsin, bind
chromophores. (Bacteriorhodopsin is not a signalling molecule but a light-driven proton pump.) Others respond to extracellular ligands
including hormones and neurotransmitters.
As expected from the structure, in many groups of GPCRs the sequences of the helical regions diverge less than the sequences of the
loops. It is the loops that determine the specificity of the ligand, and of the G-protein partner.
The common mechanism of function of GPCRs is a conformational change, induced by receptor binding or light absorption. The
activated state of the GPCR interacts with an intracellular G protein, triggering a signal cascade. As there are substantially more GPCRs
than G proteins, many GPCRs must interact with a single G protein. For instance, all odorant receptors interact with the same G protein
a-subunit.
GPCRs are the targets for many drugs used in the treatment of high blood pressure, asthma, allergies, and other conditions. The large
number of related GPCRs is a challenge to the design of drugs that bind to a unique target. Many drugs have undesired side effects
because of imperfect specificity.

disease, by presenting sites for homologous recom- different genes diverge at different rates, and homo-
bination during meiosis, which show up in some of logues may be under different selective pressures.
the gametes as deletions. It is, therefore, likely that Clock arguments are, therefore, relatively weak.)
Prader–Willi and Angelman syndromes are less com- The yeast genome underwent a duplication about
mon in chimpanzees than in humans. 108 years ago. The effects are obscured by subsequent
chromosomal rearrangements and by massive loss
Whole-genome duplication of duplicated material. The duplication has neverthe-
Genomes can duplicate if the chromosomes replicate less left its traces in multiple homologues that retain
but do not segregate properly into separate progeny their genomic order. The yeast genome contains 55
cells upon mitosis. duplicated regions, on average 55 kb long, together
The mere appearance of two copies of many genes covering ∼50% of the genome and including 376
does not prove whole-genome duplication. One must pairs of homologous genes.
adduce (1) the genome-wide occurrence of pairs of In a seminal 1970 book, Evolution by Gene Dupli-
homologous genes appearing in the same order; or cation, S. Ohno proposed that the vertebrate genome
(2) ‘molecular clock’ evidence showing equal diver- is the product of one or more complete genome
gence times in many pairs of homologues. (However, duplications. Genome sequences conﬁrm his prescient
How genomes differ 141

insight. Nor are whole-genome duplications only contain multiple copies of genomes from the same
limited to vertebrates. They have occurred frequently parent. Allopolyploids contain multiple copies of
in plant lineages (see p. 219). genomes from different parents. Many crop species
In the lineage leading to vertebrates, the genomes are polyploids, relative to the wild species from which
of the cephalochordate Amphioxus (Branchiostoma they were domesticated, including wheat, alfalfa,
floridae), and the urochordate Ciona intestinalis, oats, coffee, potatoes, sugar cane, cotton, peanuts,
showed evidence for two rounds of whole genome and bananas. Often polyploidy increases the size of
duplication. Individual genes in these relatives corres- the fruit or grain, a useful property for agriculture
ponded to multiple genes in vertebrates, with enough (see Box 4.9).
synteny preserved to show that the process happened
in parallel on a large scale. It appears that two whole
gene duplications in the vertebrate line occurred
after the split between the primitive chordate rela- BOX Polyploidy in wheat
4.9
tives, urochordates and cephalochordates, from verte-
brates, about 400–600 million years ago. More
recently, a third duplication occurred in the lineage The wheat first used in agriculture, in the Middle East
leading to ray-finned fishes, such as zebrafish and at least 10 000–15 000 years ago, is a diploid called
medaka. (See p. 222.) einkorn (Triticum monococcum), containing 14 pairs
What happens to all those extra genes? Most meta- of chromosomes. Emmer wheat (T. dicoccum), also
zoa have roughly the same number of protein-coding cultivated since palaeolithic times, and durum wheat
genes, in the range 20 000–25 000. Whole-genome (T. turgidum), are merged hybrids of relatives of einkorn
with other wild grasses to form tetraploid species.
duplication is followed by massive gene loss. Some
Additional hybridizations, to different wild wheats, gave
genes do form paralogous groups, and can even
hexaploid forms, including spelt (T. spelta) and modern
duplicate further, individually, creating gene and pro-
common wheat (T. aestivum). Triticale, a robust crop
tein families of various sizes. The globins and GPCRs
developed in modern agriculture and currently used pri-
are examples.
marily for animal feed, is an artificial genus arising from
It is interesting to see which kinds of genes do take
crossing durum wheat (T. turgidum) and rye (Secale
advantage of the duplication. The comparison of the cereale). Most triticale varieties are hexaploids.
genome of the primitive chordate Branchiostoma
floridae with genomes of vertebrates shows that the Variety of Classification Chromosome
set of duplicates that is retained after whole-genome wheat complement
duplication is enriched in genes for signal transduc-
Einkorn Triticum monococcum AA
tion, transcriptional regulation, neuronal activity,
Emmer wheat Triticum dicoccum AABB
and development. Precisely the features with which
Durum wheat Triticum turgidum AABB
early chordates were experimenting.
Spelt Triticum spelta AABBDD
Common wheat Triticum aestivum AABBDD
• Evolution subscribes to the advice, attributed to Yogi Triticale Triticosecale AABBRR
Berra: ‘When you come to a fork in the road – take it.’
We may even add: if you don’t come to a fork in the A, genome of original diploid wheat or a relative; B, genome
of a wild grass, Aegilops speltoides or a relative; D, genome
road – take it anyway.
of another wild grass, T. tauschii or a relative; R, genome of
rye S. cereale.

Plant genomes are very susceptible to duplication. All of these species are still cultivated – some to only
The sequence of the Arabidopsis genome reveals at minor extents – and have their individual uses in cook-
least two and possibly three successive duplication ing. Spelt, or farro in Italian, is the basis of a well-known
events. soup; pasta is made from durum wheat; and bread is
Most plants are polyploids, i.e. they contain made from T. aestivum.
multiple sets of entire chromosomes. Autopolyploids
142 4 Comparative Genomics

I II III IV V

VI VII VIII IX X XI XII

XIII XIV XV XVl XVll XVlll

XIX XX XXI XXII X Y

Great Apes comparative karyotype

from left: HSA PPA GGO PPY (Sumatran)

Weinberg J. et al = heterochromatin
Chrom.Res. 2(1994):405-410 Marzella R. et al
Muller S. et al Marzella R. et al Cytog.Cell Genet Stanyon R. et al.
PNAS 97(2000):206-211 Genomics 63(2000): 77(1977):232-237 Am.J.Ph.Anthr. 88
307-310 (1992):245-250
IIp

IIq 5

ANC. ANC. IV ANC. ANC.

I III V
Richard F. et al.
Gen.Res.10(2000): Nickerson E. &
Montefalcone G. et al Archidiacono N. & Nelson D.L.
644-651 Gen.Res.9(1999): Tempesta S. Genomics 50
1184-1188 personal data (1998):368-372

ANC. ANC. ANC. ANC. ANC.

VI VII VIII IX X XI XII

Archidiacono N. Archidiacono N.
personal data personal data

ANC. ANC.
XIII XIV XV XVI XVII XVIII

XIX XX XXI XXII X y

Figure 4.18 Top: photograph of banding patterns. Bottom: ideograms. HSA, Homo sapiens; PTR, Pan troglodytes (chimpanzee); GGO,
Gorilla gorilla; PPY, Pongo pygmaeus (orang-utan).
Photographs courtesy of Prof. M. Rocchi, Università di Bari, Italy.
What makes us human? 143

Polyploidy may have other advantages. In studies the number of polyploid cells in the liver increases
of Arctic ﬂora, it is observed that the fraction of dip- with age, or in response to disease or surgery even in
loid and polyploid plant species increases towards children. This may be a defence against oxidative
higher latitudes. Many arctic plants tend to exist stress. In the bone marrow of mammals, very large
in small, separated populations and frequently go polyploid cells called megakaryocytes ‘bud off’ por-
through ‘bottlenecks’ of marginal survival, for tions of their cytoplasm to form platelets. (Platelets
instance during glaciations. After recession of the ice, are enucleate cells in the blood involved in clotting.)
deglaciated areas may be repopulated by a few or In a related condition, called polyteny, replicated
even one dispersed seed. Carrying many copies of the chromosomes remain in alignment rather than separ-
genome in the cells of each individual may help to ate as in polyploids. This is the origin of the giant
preserve genetic diversity, even in tiny populations. salivary gland chromosomes of Drosophila, which
Although polyploidization is much more common played such an important role in the history of cyto-
in plants than in animals, related species of frogs genetics (see Figure 3.3).
(genus Xenopus) are diploid, tetraploid, octaploid,
and dodecaploid. One tetraploid mammal is known,
Comparisons at the chromosome level: synteny
the rat Tympanoctomys barrerae from the Monte
Desert in west-central Argentina. These species pro- Comparison of chromosome banding patterns provides
vide a model for control of expression of duplicated snapshots of similarities and differences in large-scale
genes. For example, in ‘polyploid’ frogs, silencing organization among eukaryotic genomes. Synteny
is non-syntenic. Each copy of the genome contains literally means ‘on the same band’, that is, on the
some expressed and some silenced genes. This is a same chromosome. (The chromosome exchange that
different model from the silencing of an entire X causes chronic myeloid leukaemia (see Chapter 3) is
chromosome in cells of mammalian females (see a breaking of synteny.) Closely related species gener-
Chapter 1). ally show a correspondence between large syntenic
There are a number of examples of tissue-speciﬁc blocks. The similarity of the banding patterns reveals
polyploidization. The endosperm of maize kernels the underlying similarity of the patterns in the DNA
undergoes repeated cycles of endoreplication (replica- sequences themselves.
tion of nuclear DNA in the absence of mitosis) to Figure 4.18 shows the relationships among the
produce cells that can have as many as 96 copies of karyotypes of human, chimpanzee, gorilla, and
the haploid genome. In mammals, it is observed that orang-utan.

What makes us human?

It is too difﬁcult to look only at the human genome Understanding the effects of mutations both illu-
and try to deduce . . . ourselves. Two approaches help minates human biology and, often, has immediate
in understanding the genome. clinical applications.
• Comparative genomics. We can compare the
human and chimpanzee genomes and ask how dif- Comparative genomics
ferences between these genomes might give rise to The human and chimpanzee genomes are about 96%
differences between the species. identical. To understand what makes us human – or
• Study of human disease. Many mutations cause dis- at least what makes us not chimpanzees – we can
ease and give clues to the functions of the affected focus on 13 Mb of different sequence, rather than the
regions. These regions may encode enzymes or full 3.2 billion. There are even fewer differences in
regulatory proteins or RNAs, or they may be DNA our amino acid sequences. Humans and chimpanzees
sequences that are targets of regulatory mechanisms. express very similar sets of proteins, and most of the
144 4 Comparative Genomics

homologous proteins of chimpanzee and human are Many people suffer from diseases that interfere
identical or very similar. About 30% of homologous with production or comprehension of language, or
human and chimpanzee proteins show no differences both. Some of these are associated with trauma,
at all. On average, there are only two amino acid or with complex genetics. However, one abnormality
differences. with simple Mendelian inheritance appears in a
How, then, do humans and chimpanzees develop family in London. Members of the ‘KE’ family have
differently? The ultimate answer must lie within the a severe disorder affecting both the facial motor con-
static sequence of the genome. However, a satisfactory trol involved in producing speech and also the mental
answer will require understanding of the dynamics, processing of language.
specifically of patterns of regulation of gene expression. The mutation responsible for this condition has
There is a paradox here. On the one hand, living been identified. It is a single-nucleotide polymor-
systems are fairly robust to perturbations. Yeast, for phism (SNP) in a gene called FOXP2, which encodes
example, survives individual knockout of 80% of its a transcription factor. The major protein encoded by
genes. On the other hand, the 4% differences between FOXP2 is 715 amino acids long. It is quite a stable
chimpanzee and human genomes make profound dif- protein from the evolutionary point of view, with
ferences in phenotype. This suggests a chaotic system, only one substitution between mouse and the identical
one in which tiny perturbations can lead to large chimpanzee, rhesus macaque, and gorilla sequences.
changes in the subsequent trajectory. Superposed on However, the human protein has two mutations
the robustness are specific changes that exert immense relative to the other primate sequences.
leverage. This example illustrates the power of a combination
There are two ways to find these crucial sequences. of studies of human phenotypes and comparative
One is to look closely at the differences between sequence analysis. And yet the observation that the
human and chimpanzee genomes and try to figure phenotype shown by members of the KE family
out what the changed loci are doing. Another is to arises from two SNPs in a single gene may be
examine human mutations that affect phenotypic deceptively simple. We cannot conclude that the
properties that chimpanzees do not share with humans, expression of only one gene is involved in creating
such as language and reasoning. the phenotype, as is the case for phenylketonuria,
for example. The FOXP2 gene product is a transcrip-
tion factor. Its activity affects the expression of many
Combining the approaches: the FOXP2 gene
genes. The effectiveness of their coordinated expres-
Language is a unique feature of our species. It should sion required co-evolution, i.e. sequence changes in
show up as a genetic difference between our genomes other genes.
and those of other species, including chimpanzee.

Genomes of chimpanzees and humans

The genome sequence of Clint, a male chimpanzee • The sequences of the alignable regions differ at
from the Yerkes National Primate Research Center at 1.23% of the positions. Recognizing that there is
Emory University, in Atlanta, Georgia, USA, was intraspecies divergence among humans and among
reported in 2005. Clint represented the West African chimpanzees, it is likely that the true interspecies
subspecies Pan troglodytes verus. difference amounts to about 1% of the alignable
As expected from so closely related a species: sequence.
• There is close alignment of the genome: 96% is • The 4% non-alignable regions represent insertions
alignable with the human genome. and deletions. It is estimated that about 45 Mb of
Genomes of mice and rats 145

human sequence do not correspond to chimpanzee • Although most proteins are very similar, a few
sequence and a similar amount of chimpanzee show large Ka /Ks ratios, suggesting that they are
sequence does not correspond to human sequence. under positive selection. These include two pro-
More positions in the genomes differ as a result of teins, glycophorin C and granulysin (involved in
insertions/deletions than differ as a result of base combating infection) and other proteins involved
substitutions. in reproduction. (Selection can act most directly
• The distribution of differences is variable across on reproduction itself, producing high rates of
the genome. For all syntenic 1 Mb segments across evolutionary change.)
the genomes, the range in difference is about • Changes in gene expression patterns show that genes
0.005–0.025%. Looking at the distribution with active in the brain have changed more rapidly in
respect to the chromosomes, divergence tends to be humans.
higher near the telomeres. The divergence is lowest
for the X chromosome and highest for the Y. Sadly and unexpectedly, Clint died a few weeks before
• The proteins encoded are also very similar in the paper on his genome was submitted to Nature. He
sequence. Of 13 454 orthologous proteins, 29% was 24 years old. Even in the wild, chimpanzees can live
for 40–45 years. Cheeta, a chimp that appeared in Tarzan
have identical sequences. On average, there are
movies in the 1930s, is almost 80 years old.
one to two amino acid residue differences between
corresponding chimpanzee and human proteins.

Genomes of mice and rats

The mouse and rat are by far the most common years ago. The genomes of all three species are
mammalian laboratory animals. Knowledge of these approximately the same size. The rat genome is ∼5%
species and correlation with human biology is ency- smaller than the human genome. The mouse genome
clopaedic. Determination of mouse and rat complete is about ∼15% smaller than the human genome.
genome sequences was clearly a high-priority goal. Sequence divergence and chromosome segment rear-
The genome of the laboratory mouse (Mus musculus) rangement appear to have been faster in the rodent
appeared in December 2002. The genome of the lineage. The human genome shows more duplication
brown or Norway rat (Rattus norvegicus) appeared – one reason why it is larger.
in April 2004. Human, mouse, and rat genomes encode similar
Mice, rats, and humans are closely related mam- numbers of genes. Most proteins have homologues
mals and illuminate one another. Laboratory studies in all three species, with very similar amino acid
on rodents are useful guides to the biochemistry and sequences (see Figure 4.19). (Transgenic animals –
molecular biology of humans. Mice and rats provide substituting rodent genes with human genes – can
the ﬁrst test of tolerance to, and effectiveness of, equip model organisms with exact human sequences if
novel drugs aimed ultimately at human therapy. Out- necessary.) The genes for most rodent–human homo-
side the laboratory, however, the close relationship logues have a common exon–intron structure. Gene
has been tragic for humans: shared parasites permit duplications create protein families, which may be of
rats to transmit disease. (An epidemic of bubonic different sizes in different species. For instance, con-
plague in 1347–1352 killed a third of the population sistent with their greater dependence on a sense of
of Europe.) Shared diets make food supplies vulner- smell, rodents have more odorant receptors than we do.
able to rodent infestation. Some genomic variation is observable at the chro-
The last common ancestor of humans and rodents mosomal level. The mouse has 19 chromosomes,
lived approximately 75 million years ago. Rats and plus X/Y. The rat has 20 chromosomes, plus X/Y.
mice separated much more recently: 12–24 million Synteny between the mouse and rat genomes is high.
146 4 Comparative Genomics

Cytochrome c
10 20 30 40 50 60
| | | | | |
Human G D V E K G K K I F I M K C S Q C H T V E K G G K H K T G P N L H G L F G R K T G Q A P G Y S Y T A A N K N K G I I W G
Mouse G D V E K G K K I F V Q K C A Q C H T V E K G G K H K T G P N L H G L F G R K T G Q A A G F S Y T D A N K N K G I T W G
Rat GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAAGFSYTDANKNKGITWG
GDVEKGKKIF KC QCHTVEKGGKHKTGPNLHGLFGRKTGQA G SYT ANKNKGI WG

70 80 90 100
| | | |
Human E D T L M E Y L E N P K K Y I P G T K M I F V G I K K K E E R A D L I A Y L K K A T N E
Mouse E D T L M E Y L E N P K K Y I P G T K M I F A G I K K K G E R A D L I A Y L K K A T N E
Rat EDTLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE
EDTLMEYLENPKKYIPGTKMIF GIKKK ERADLIAYLKKATNE

Figure 4.19 Alignment of the amino acid sequences of cytochrome c from human, mouse, and rat. All have 104 residues. The mouse
and rat sequences are identical. Letters below the sequences show residues conserved in all three species. The human sequence shows
nine substitutions, most of which are conservative. For colour coding, see Figure 4.11.

Synteny between the mouse and human genomes is and among chromosomes make it impossible to
variable. Most human chromosomes contain many align the three genomes as one long linear sequence.
small blocks that correspond separately to regions However, at the nucleotide level, 40% of human
distributed among several mouse chromosomes. and mouse genomes are block-alignable, about 1 Gb
However, almost all of human chromosome 20 cor- in all.
responds almost continuously with a region of mouse The differences among the human, mouse, and rat
chromosome 2. Almost all of human chromosome genomes arise from selection or neutral drift. Regions
X appears in mouse chromosome X, differing by changing under selection rather than drift are likely
rearrangement of nine contiguous blocks, including to be functional. This is a powerful way to search
some reversals. Almost all of human chromosome 17 for regulatory regions, which are harder to identify in
appears within a region of mouse chromosome 11; genomes than protein-encoding genes.
however, the sequence is broken into 16 segments Genomics conﬁrms the utility of the rat and
that are rearranged, including some reversals. mouse for clinical research. Of a set of ∼1000 genes
The blocks are identiﬁed by matching sequences of for which known mutations are associated with
genetic markers. Even though most of the correspon- human disease, almost all have homologues in
dences are distributed among different chromosomes, rodents. In certain interesting cases, the sequence of
most of the genomes can be partitioned into syntenic the human disease-associated mutant is identical to
blocks, making up a total of 2.35 Gb (over 90% of the mouse and rat wild type. There has probably
the mouse genome). These regions include almost all been co-evolution, with a compensatory change in
known exons and regulatory regions. some other gene or genes. This can be a source
Alignment of genetic maps is less stringent than of clues to the function and interactions involved in
alignment of sequences. The rearrangements within the disease.

Model organisms for study of human diseases

Above the molecular level, the differences between The underlying common features of the structure,
humans and ﬂies and between humans and worms organization, and development of different species is
are more obvious than the similarities. At the bio- of both academic interest and practical importance, as
chemical and genomic level, the situation is reversed. ﬂies and worms provide models for human diseases.
Model organisms for study of human diseases 147

BOX Distribution of C. elegans genes

4.10

Chromosome Size (Mb) Number of Density of Number of

protein genes protein genes tRNA genes
(kb/gene)

I 7.9 2803 5.06 13

II 8.5 3259 3.65 6
III 7.6 2508 5.40 9
IV 9.2 3094 5.17 7
V 9.8 4082 4.15 5
X 10.1 2631 6.54 3

From: The C. elegans Sequencing Consortium (1998). Genome sequence of the nematode C. elegans: a platform for investigating
biology. Science 282, 2012–2018.

The genome of Caenorhabditis elegans non-compact form, containing most of the active
genes. The euchromatic portion, about 120 Mb,
The nematode worm Caenorhabditis elegans entered
was the first segment of the sequence released. The
molecular biology in 1963, at the invitation of Sydney
other one-third of the Drosophila genome appears as
Brenner. Brenner recognized its potential as a suffici-
heterochromatin, highly compact regions flanking
ently complex organism to be interesting but simple
the centromeres. Heterochromatin contains many
enough to permit complete analysis of its develop-
tandem repeats of the sequence AATAACATAG and
ment and neural circuitry, at least at the cellular level.
relatively few genes.
The C. elegans genome, completed in 1998, was the
The genome is distributed over five chromosomes:
first full DNA sequence of a multicellular organism.
three large autosomes, a tiny chromosome containing
C. elegans contains ∼97 Mb of DNA distributed on
only ∼1 Mb of euchromatin and an X/Y chromosome
six paired chromosomes (see Box 4.10). There is
pair, of which the Y chromosome is heterochromatic
an X but no Y chromosome: different genders in C.
and relatively gene poor. The fly’s ∼14 000 genes are
elegans are a self-fertilizing hermaphrodite, genotype
approximately double the number in yeast, but fewer
XX, and a male, genotype XO (i.e. a single unpaired
than in C. elegans, perhaps a surprise. The average
X chromosome).
density of genes in the euchromatin sequence is 1
The C. elegans genome is about eight times larger
gene/9 kb; about half that of C. elegans (see Box 4.11).
than that of yeast, and its 19 099 predicted genes
The genes of the metacentric chromosomes 2 and 3
are approximately three times the number in yeast.
are reported separately for the two arms, arbitrarily
Exons cover ∼27% of the genome. The genes contain
designated left (L) and right (R). The other chromo-
an average of five introns. The gene density is rela-
somes are telocentric (see Figure 4.20).
tively low, for a eukaryote, with ∼1 gene/5 kb of
Determination of the D. melanogaster genome
DNA. Approximately 25% are in clusters of related
sequence was a collaboration between industry (Celera
genes.
Genomics) and the academic Drosophila Genome
Projects based in Berkeley, California, USA, and in
The genome of Drosophila melanogaster Europe. The project was a methodological testbed.
The total chromosomal DNA of Drosophila melano- • First, it showed that a relatively large eukaryotic
gaster contains about 180 Mb. Approximately two- genome could be completed by the method of
thirds is euchromatin, a relatively uncoiled and whole-genome shotgun sequencing (see Chapter 3).
148 4 Comparative Genomics

BOX Distribution of D. melanogaster genes

4.11

Chromosome Size (Mb) Number of Density of Number of

arm protein genes protein genes tRNA genes
(kb/gene)

X 22.2 2279 9.7 25

2L 22.4 2537 8.8 40
2R 20.8 2947 7.1 100
3L 23.8 2718 8.8 49
3R 27.9 3501 8.0 80
4 1.28 83 15.4 0

Data from Release 5.22: http://www.ncbi.nlm.nih.gov/mapview/map search.cgi?taxid=7227.

2L 2R Homologous genes in humans, worms,

and flies
3L 3R
Once the genomes were sequenced and annotated,
comparisons showed that homologues of many
genes appear in all three species. Forty-four per cent
X
of protein-coding genes from D. melanogaster have
human homologues, 25% of protein-coding genes
Y
from C. elegans have human homologues, and 23%
of protein-coding genes from D. melanogaster have
4
homologues in C. elegans.
Figure 4.20 The chromosomes of Drosophila melanogaster. Some proteins with common functions, such as
Heterochromatin is shown in red. cytochrome c, are expected to be quite similar in the
different species (see Figure 4.21).
In other cases, different species have adapted
• Second, the annotation of the sequence took place homologous proteins to slightly different functions.
in a burst of activity: the intensive sessions of C. elegans and D. melanogaster are favourite sub-
an 11-day ‘jamboree’ meeting held at Celera in jects of developmental biologists. It is therefore of
November 1999. The ∼45 participants included interest that a number of transcription regulators
experts representing the inherited knowledge of a involved in developmental control are common to
century of fruit-ﬂy biology and local computer human, D. melanogaster and C. elegans. These
experts involved in the sequencing. The ﬂavour of include PAX (paired box domain) and HOX (homeo-
the meeting is well described from the personal box domain) proteins.
point of view by one of the participants, M.
Ashburner.* Models of human disease
A model organism is a species in which an interesting
feature of human biology – especially a disease – can
* Ashburner, M. (2006). Won for All: How the Drosophila be studied (Table 4.8). Ideally, a model organism is
Genome Was Sequenced. Cold Spring Harbor Laboratory small and robust, has a relatively simple genome, is
Press, Cold Spring Harbor, New York, USA. easy to maintain and manipulate (both physically and
Model organisms for study of human diseases 149

Cytochrome c 10 20 30 40 50 60
| | | | | |
Human G D V EKG K K I F I M K CS Q C H T V E K G GK H K T G P N L H GL F G R K T G Q A PG Y S Y T A A N K NK
D. melanogaster (isoform 1) GSG D A ENG K K I F V Q K CA Q C H T Y E V G GK H K V G P N L G GV V G R K C G T A AG Y K Y T D A N I KK
D. melanogaster (isoform 2) GVPAG D V EKG K K L F V Q R CA Q C H T V E A G GK H K V G P N L H GL I G R K T G Q A AG F A Y T D A N K AK
C. elegans SDIPAG D Y EKG K K V Y K Q R CL Q C H V V D S _ TA T K T G P T L H GV I G R T S G T V SG F D Y S A A N K NK
G D E G K K C Q C H K G P L G G R G G Y A N K

70 80 90 100 110
| | | | |
Human G I I W G E D T LME Y L E N P K K YI P G T K M I F V GI K K K E E R A DLI A Y L K K A T NE_ _
D. melanogaster (isoform 1) G V T W T E G N LDE Y L K D P K K YI P G T K M V F A GL K K A E E R A DLI A F L K S N K ___ _
D. melanogaster (isoform 2) G I T W N E D T LFE Y L E N P K K YI P G T K M I F A GL K K P N E R G DLI A Y L K S A T K__ _
C. elegans G V V W T K E T LFE Y L L N P K K YI P G T K M V F A GL K K A D E R A DLI K Y I E V E S AKS L
G W LEY L P K K YI P G T K M F G K K E R DLI

Figure 4.21 Alignments of the amino acid sequences of cytochrome c from human, D. melanogaster (two isoforms) and C. elegans.
For colour coding, see Figure 4.12.

genetically) in the laboratory, has a short generation diseases. Some of these homologues have different
time, is safe to humans, and comes with an extensive functions in humans and flies. Other human disease-
knowledge of its biology. As each model organism associated genes can be introduced into, and studied
has its own strengths and limitations, different organ- in, the fly. For instance, the gene for human spino-
isms are useful for different investigations. cerebellar ataxia type 3, when expressed in the
In principle, biologists will use the simplest organism fly, produces similar neuronal cell degeneration.
that illustrates the human feature of interest. It is There are now fly models for Parkinson’s disease and
easier to do experiments with yeast than with fruit malaria.
flies. However, sometimes there is no choice. The C. elegans also provides human disease models.
only animal other than humans that is susceptible Mutations in the human gene for presenilin-1 (PS1)
to leprosy is the armadillo, which satisfies few, if any, are associated with familial early-onset Alzheimer’s
of the criteria of an ideal laboratory organism. disease. Mutations in the homologous gene in C.
There are two ways in which model organisms elegans, sel-12 (Figure 4.22) do show neurological
can contribute to understanding and treatment of defects, but in only a few neurons. Mutants do show
human disease. The first is to observe homologues in more profound defects in egg laying, but this may be
the model organisms of genes implicated in human a secondary effect.
diseases. One can then study the effect on the model Although there are greater differences between
organism of mutation or knockout of the homo- the nervous systems of humans and C. elegans than
logues. The second is to introduce a human gene into between their machineries for respiratory energy
a model organism and discover its phenotypic effect. transduction, the difference between the homologues
A model animal containing an active human gene shown here is not greater than the difference between
makes it possible to screen libraries of compounds the cytochrome c proteins. The relationship between
for potential drugs. sequence and function in proteins is full of surprises.
Table 4.8 shows some of the human disease-
associated genes with homologues in D. melanogas-
ter, C. elegans, and Saccharomyces cerevisiae. The • We have discussed selected genomes – chimpanzee,
database Homophila provides links between human mouse, rat, worm, and fly – the first because it the
disease-associated genes and Drosophila homologues. closest extant relative we have, and the others because
Despite the fact that insects are not very closely of their importance as laboratory animals. Two aspects
related to mammals, fruit flies are useful in the study of comparative genomics are to study differences
between genomes, to try to account for phenotypic
of human disease. The D. melanogaster genome
divergence; and to study and apply similarities. One
contains homologues of human genes implicated
important application is to use laboratory animals as
in cancer and in cardiovascular, neurological, endo-
models for human diseases.
crinological, renal, metabolic, and haematological
150 4 Comparative Genomics

Table 4.8 Human disease-associated genes shared with worms, flies, and yeast

Affected area Disease Description Gene Similarity in

Worm Fly Yeast

Bones Multiple exostoses Ossification at tips of femur, pelvis, or ribs EXT1 *** ** –
Blood Leukaemia Chronic myelogenous leukaemia, ABL1 *** *** *
a blood cell cancer
Bruton agammaglobulinaemia Lack of mature B cells BTK *** ** *
Glucose-6-phosphate Drug- and stress-induced rupture of G6PD **** **** ****
dehydrogenase deficiency red blood cells
Brain Early-onset Alzheimer’s disease Common cause of mental retardation PS1 ** ** –
Fragile X syndrome FMR1 ** – –
Juvenile Parkinson’s disease PARK2 *** ** *
Colon Hereditary non-polyposis cancer Polyps that become malignant MSH2 *** *** ***
Adenomatous polyposis APC *** * –
Ears Hereditary deafness MYO15 *** *** ***
Eyes Retinoblastoma Cancer of the eye RB1 * * –
Heart Familial cardiac myopathy Inherited cardiac disease MYH7 *** *** ***
Long QT syndrome Sometimes fatal cardiac arrhythmias 3-SCN5A *** ** *
Kidney Polycystic kidney disease 2 PKD2 ** ** –
Liver Wilson’s disease Build-up of copper in cells, causing liver disease ATP7B *** *** ***
and other symptoms
Lung Cystic fibrosis Progressive disease of lungs and pancreas CFTR *** *** –
Lung cancer Caused by defects in p53 gene, which can also p53 * – –
cause cancer of the oesophagus, colon,
brain, lung, breast, and skin
Muscles Duchenne’s muscular dystrophy Progressive atrophy of muscles DMD *** *** –
Pancreas Pancreatic cancer MADH4 *** * –
Pancreatic cancer RAS ** ** **
Prostate Advanced cancer of the prostate Caused by mutations in the PTEN gene, PTEN ** ** *
which can also cause cancer of the brain,
endometrium, and breast
Skin Xeroderma pigmentosum D Early-onset skin cancer XPD *** ** ***
Neurofibromatosis 1 Soft tumours at many sites, plus skeletal NF1 *** * **
and neurological defects
Thyroid Cancer of the thyroid Multiple endocrine neoplasia type 2 MEN2 *** ** *

Based on data from Rubin, G.M., et al. (2000). Comparative genomics of the eukaryotes. Science 287, 2204–2215. Presentation adapted
from http://www.hhmi.org/genesweshare/e400.html.
The ENCODE project 151

10 20 30 40 50 60 70
| | | | | | |
Human MTELPAPLSYFQNAQMSEDNHLSNTVRSQNDNRERQEHNDRRSLGHPEPLSNGRPQGNSRQVVEQDEEED
C. elegans MPSTRRQQEGGGADAETHTVYGTNLITNRNS _ _ _ _ QEDENVV
R QE H N NS DE

80 90 100 110 120 130 140

| | | | | | |
Human EELTLKYGAKHVIMLFVPVTLCMVVVVATIKSVSFYTRKDG_QLIYTPFTEDTETVGQRALHSILNAAIM
C. elegans EEAELKYGASHVIHLFVPVSLCMALVVFTMNTITFYSQNNGRHLLYTPFVRETDSIVEKGLMSLGNALVM
EE LKYGA HVI LFVPV LCM VV T FY G L YTPF T L S NA M
150 160 170 180 190 200 210
| | | | | | |
Human ISVIVVMTILLVVLYKYRCYKVIHAWLIISSLLLLFFFSFIYLGEVFKTYNVAVDYITVALLIWNFGVVG
C. elegans LCVVVLMTVLLIVFYKYKFYKLIHGWLIVSSFLLLFLFTTIYVQEVLKSFDVSPSALLVLFGLGNYGVLG
V V MT LL V YKY YK IH WLI SS LLLF F IY EV K V V N GV G

220 230 240 250 260 270 280

| | | | | | |
Human MISIHWKGPLRLQQAYLIMISALMALVFIKYLPEWTAWLILAVISVYDLVAVLCPKGPLRMLVETAQERN
C. elegans MMCIHWKGPLRLQQFYLITMSALMALVFIKYLPEWTVWFVLFVISVWDLVAVLTPKGPLRYLVETAQERN
M IHWKGPLRLQQ YLI SALMALVFIKYLPEWT W L VISV DLVAVL PKGPLR LVETAQERN

290 300 310 320 330 340 350

| | | | | | |
Human ETLFPALIYSSTMVW _ _ _ LVNMAEGDPEAQRRVSKNS _ _ _ _ _ _ _ _ _ _ KYNAESTERESQDTVAENDDGGF
C. elegans EPIFPALIYSSGVIYPYVLVTAVENTTDPREPTSSDSNTSTAFPGEASCSSETPKRPKVKRIPQKVQIES
E FPALIYSS LV E S S E R

360 370 380 390 400 410 420

| | | | | | |
Human SEEWEAQRDSHLGPHRSTPESRAAVQELSSSILAGEDPEERGVKLGLGDFIFYSVLVGKASATASGDWNT
C. elegans NTTASTTQNSGVRVERELAAERPTVQDAN _ _ _ FHRHEEEERGVKLGLGDFIFYSVLLGKASS _ _ YFDWNT
S R R VQ EERGVKLGLGDFIFYSVL GKAS DWNT

430 440 450 460 470 480

| | | | | |
Human TIACFVAILIGLCLTLLLLAIFKKALPALPISITFGLVFYFATDYLVQPFMDQLAFHQFYI
C. elegans TIACYVAILIGLCFTLVLLAVFKRALPALPISIFSGLIFYFCTRWIITPFVTQVSQKCLLY
TIAC VAILIGLC TL LLA FK ALPALPISI GL FYF T PF Q

Figure 4.22 Alignment of the amino acid sequences of the human protein presenilin-1 and the C. elegans homologue SEL-12.

The ENCODE project

The ENCODE project (Encyclopedia of DNA ele- Regions corresponding to the selected human
ments) is a systematic development and application genome segments from 29 vertebrates will be
of comparative genomics. It has the ultimate goal of sequenced (see Table 4.9). These data will illuminate
developing methods for comprehensive identification each other. The ENCODE project will apply, improve,
of functional regions of the human genome, includ- and develop, as necessary, a variety of experimental
ing coding and regulatory regions. A selected portion and computational methods. Lessons learned from
of the human genome – 1%, about 30 Mb – will work with the selected subset will guide the scaling
be the initial focus. The basic approach will be com- up of successful methods to analysis of entire
parative genomics and will involve both laboratory genomes.
and computational analysis. Coordinating with ENCODE, the HapMap Pro-
High-quality sequences will be finished to state- ject (see Chapter 1) focuses on variations among
of-the-art standards, including resolving difficult humans in ten of the ENCODE regions. Sequences
regions. Medium-quality sequences will have >8-fold from 48 individuals from different geographic ori-
coverage, with manual refinement of assembly. Un- gins have yielded 30 000 SNPs.
finished sequences are whole-genome shotguns; the Analysis of function involves two steps: deciding
coverage may vary and assembly may be incomplete. whether a segment has functional significance and, if
152 4 Comparative Genomics

Table 4.9 Target species of the ENCODE project

Quality of sequencing

High Medium Unfinished

Class:
Actinopterygii Zebrafish
Amphibia Frog
Aves Chicken
Class: Mammalia
Order: Suborder:
Monotremata Platypus
Marsupialia Opossum
Proboscidia African elephant
Insectivora Tenrec
Xenarthra Armadillo
Insectivora Hedgehog
Insectivora Shrew
Chiroptera Bat
Artiodactyla Cow
Carnivora Dog
Carnivora Cat
Rodentia Mouse
Rodentia Rat
Rodentia Guinea pig
Lagomorpha Rabbit
Primates Prosimii Galago
Primates Prosimii Mouse lemur
Primates Platyrrhini Duski titi
Primates Platyrrhini Owl monkey
Primates Platyrrhini Marmoset
Primates Catarrhini Colobus
Primates Catarrhini Macaque
Primates Catarrhini Baboon
Primates Hominidae Orang-utan
Primates Hominidae Chimpanzee
Primates Hominidae Human
The ENCODE project 153

(a) Human
Chimp
Baboon
Rhesus monkey
Green monkey
Dusky titi
Colobus
Spider monkey

Sequence elements conserved in all species:

candidate functional element

Nucleotide difference with at least one species

(b)
100%

Percent
variation

50%

LXR-alpha Exon 3 0%
0 bp 600 bp 1200 bp

Figure 4.23 Patterns of variation in multiple sequence alignments can suggest regions of likely function. This diagram shows analysis
of a 1200 bp region in primate genomes containing an exon of the liver X receptor a gene and flanking regions. This gene encodes a
nuclear receptor responsive to elevated levels of intracellular cholesterol. (a) Human, reference sequence; purple, regions in which the
sequence of the indicated region differs from at least one other species. The columns with no purple are conserved in all species and
define regions likely to be functional. (b) Plot of % variation along the sequence. Regions of lowest variability correspond to the known
exon. (A similar but not identical approach was used by E.A. Kabat and T.T. Wu in their classic work identifying complementarity-
determining regions of antibodies from regions of hypervariability.)
From: Nobrega, M.A. & Pennacchio, L.A. (2004). Comparative genomic analysis as a tool for biological discovery. J. Physiol. 554, 31–39.

so, identifying what it does (see Figure 4.23). Approx- such as the a- and b-globin loci and the region con-
imately 5% of the human genome is conserved with taining CFTR, the gene for the cystic ﬁbrosis trans-
respect to mouse and rat sequences. This 5% should membrane conductance regulator, for which sequence
have interesting functions (without implying that the information from different species is known.
other 95% does not). Only about one-third of this Sequences of the ENCODE target regions can be
5% is predicted to encode protein. Analysis of func- aligned and compared (see Figure 4.23).
tion will require treatment of both protein-coding In 2007, ENCODE moved into its second phase,
and non-protein-coding regions. and a companion project, modENCODE began
Accordingly, the criteria for selection of regions for (Table 4.10).
the ENCODE project included choosing regions with
ranges of gene density and of non-exonic conserva-
The modENCODE project
tion with respect to the mouse sequence. The result is
a set of 44 discrete regions, spread around different modENCODE extends the ENCODE project to model
human chromosomes and the syntenic regions in organisms. Its initial goal is to identify functional ele-
other species. These include well-studied regions ments in the C. elegans and D. melanogaster genomes.
154 4 Comparative Genomics

Table 4.10 Approximate sizes of ENCODE regions Current projects include but are not limited to
those that focus on the following:
Chromosome Approximate sizes of ENCODE regions (Mb)
(gene of interest) the transcriptome: as complete as possible a descrip-
tion of what elements of the genome are actually
1 0.5
transcribed, with a classification, as far as possible,
2 0.5, 0.5, 0.5, 0.5
of putative function – at least whether the transcript
4 0.5
is likely to code for protein or non-protein-coding
5 0.5, 0.5, 1.0 (interleukin)
RNA.
6 0.5, 0.5, 0.5, 0.5
chromatin function and histone variants: genomic
7 0.5, 1.0, 1.1, 1.2, 1.9 (CFTR)
distribution of modifications to histones and other
8 0.5
chromosome-associated protein regulatory elements.
9 0.5
the 3′ utr-ome: untranslated regions 3′ to coding
10 0.5
sequences are important sites for post-transcriptional
11 0.5, 0.5, 0.6, 0.5 (Apo cluster), 1.0 (b-globin)
regulation of expression.
12 0.5
transcription factors: identification of the DNA-
13 0.5, 0.5
binding sites, and measurement of expression pat-
14 0.5, 0.5
terns at different life stages.
15 0.5
For D. melanogaster: a complete roster of small and
16 0.5, 0.5, 0.5 (a-globin)
microRNAs and assignment of their functions when
18 0.5, 0.5 possible.
19 1.0
Note the focus on regulatory elements. Papers ap-
20 0.5
pearing in Science in late 2010 report modENCODE
21 0.5, 1.7
results for worm and fly, respectively. These include
22 1.7
reports of many new genes that encode proteins or
X 0.5, 1.2
RNAs.

● RECOMMENDED READING

• Papers treating some of the current debate in biological taxonomy and the strengths and
weaknesses of barcoding, and a description of the database:
Moritz, C. & Cicero, C. (2004). DNA barcoding: promise and pitfalls. PLoS Biol. 2, e354.
Ratnasingham, S. & Hebert, P.D.N. (2007). BOLD: The Barcode of Life Data System. Molecular
Ecology Notes 7, 355–364.
Stoeckle, M.Y. & Hebert, P.D. (2008). Barcode of life. Sci. Am. 299(4), 82–86, 88.
• Detailed review of work on genome comparisons and what they tell us about genome contents
and evolution:
Miller, W., Makova, K.D., Nekrutenko, A., & Hardison, R.C. (2004). Comparative genomics.
Annu. Rev. Genomics Hum. Genet. 5, 15–56.
• How modern developments in biological data collection affect the use of model organisms in the
study of human biology and disease:
Barr, M.M. (2003). Super models. Physiol. Genomics 13, 15–24.
Exercises, problems, and weblems 155

• Importance of duplications:
Levasseur, A. & Pontarotti, P. (2011). The role of duplications in the evolution of genomes
highlights the need for evolutionary-based approaches in comparative genomics. Biol.
Direct. 18, 11.
• Alternative splicing:
Park, J.W. & Graveley, B.R. (2007). Complex alternative splicing. Adv. Exp. Med. Biol. 623, 50–63.
• RNA editing:
Maas, S., Kawahara, Y., Tamburro, K.M., & Nishikura, K. (2006). A-to-I RNA editing and human
disease. RNA Biol. 3, 1–9.
Mass, S. (2010). Gene regulation through RNA editing. Discov Med. 10, 379–386.
Farajollahi, S. & Maas, S. (2010). Molecular diversity through RNA editing: a balancing act.
Trends Genet. 26, 221–230.
• ENCODE and modENCODE:
The ENCODE Project Consortium (2011). A user’s guide to the Encyclopedia of DNA Elements
(ENCODE). PLoS Biol. 9, e1001046.
Elsner, M. & Mak, H.C. (2011). A modENCODE snapshot. Nat. Biotechnol. 29, 238–240.
Muers, M. (2011). Functional genomics: the modENCODE guide to the genome. Nat. Rev.
Genet. 12, 80.

● EXERCISES, PROBLEMS, AND WEBLEMS

Exercises
Exercise 4.1 On a photocopy of Figure 4.4, (a) indicate the position of the human genome;
(b) indicate the position of the pufferfish genome; (c) indicate the position of the rice (Oryza
sativa) genome; (d) indicate the position of the yeast S. cerevisiae genome; (e) indicate the
position of the E. coli genome; (f) indicate the range of sizes of mitochondrial genomes; and
(g) indicate the range of sizes of chloroplast genomes.
Exercise 4.2 In H.G. Wells’ 1898 novel The War of the Worlds, invaders from Mars are overcome
by disease. Assuming that life on Mars developed independently of life on Earth, why is it unlikely
that the Martians died of viral infections?
Exercise 4.3 Which of the following pairs are orthologues? Which are paralogues? Which are
neither? (a) Human trypsin and horse trypsin; (b) human trypsin and horse chymotrypsin;
(c) human trypsin and human elastase; (d) Bacillus subtilis subtilisin and horse chymotrypsin.
Exercise 4.4 On a photocopy of Figure 4.1, indicate estimates of the dates of the events at the
points marked by asterisks.
Exercise 4.5 What RNA molecule is most closely linked to the his operon in E. coli (see Figure 4.9)?
Exercise 4.6 Human mitochondrial DNA is 16 569 bp long. A brain cell may contain 10 000
mitochondria. What fraction of the DNA in a brain cell is mitochondrial? Assume 1 mtDNA/
mitochondrion.
Exercise 4.7 Some antibiotics, for example streptomycin, block protein synthesis in bacteria but
not cytoplasmic protein synthesis in eukaryotes. Mitochondria and chloroplasts contain their own
protein-synthesizing machinery. Would you expect streptomycin to block mitochondrial protein
synthesis? Explain your answer.
156 4 Comparative Genomics

Exercise 4.8 On a photocopy of Figure 4.1, indicate by an arrow linking them the approximate
positions of the source and destination organisms for the endosymbiotic origins of mitochondria
and chloroplasts.
Exercise 4.9 Why could a mollusc that extracts chloroplasts from algae that it eats not simply let
the chloroplasts reside in some body cavity (like the symbiotic bacteria in the guts of ruminant
animals) rather than endocytosing them?
Exercise 4.10 The main E. coli chromosome contains 4 639 221 bp. The cell is roughly a cylinder
about 0.1 mm in diameter and 0.2 mm long. If the length of an extended segment of DNA in the B
conformation is 3.4 Å per base pair (0.34 nm), what would be the diameter of the chromosome if
it were geometrically a circle?
Exercise 4.11 On a photocopy of Figure 4.11, mark the positions that (a) contain the same amino
acid in all three eukaryotes but differ in S. aureus; (b) contain the same amino acid in humans and
chickens but are not the same in Neurospora and/or S. aureus; (c) contain the same amino acid in
humans and S. aureus, but this amino acid does not appear at this position in both of the other
two species.
Exercise 4.12 On a photocopy of Figure 4.11, indicate which regions of the amino acid sequences
are encoded by which blocks of the nucleotide sequences in Figure 4.12. (This exercise is similar to
Exercise 1.12.)
Exercise 4.13 On a photocopy of Figure 4.13, draw horizontal lines at the beginning and end of
the Devonian (see Figure 2.15). What is earliest type of species to emerge after the split between
neuroglobin and cytoglobin?
Exercise 4.14 Which animals in Figure 4.14 have the largest number of functional b-globin genes?
Exercise 4.15 Which animals in Figure 4.14 show a triplication of part of the b-region? Which
genes make up the repeating unit?
Exercise 4.16 Which animals in Figure 4.14 show a pseudogene most closely related to a d-globin?
Exercise 4.17 The human and chimpanzee genomes are 96% identical. (a) How many individual
bases of the human genome differ from the corresponding positions of the chimp genome?
(b) Assume that the human genome is 3% coding and contains about 20 000 genes, and that
all of the sequence differences are independent, single-base changes distributed randomly
throughout the genome (these assumptions are definitely not true). Estimate the fraction of
genes mutated between humans and chimpanzees.
Exercise 4.18 On a photocopy of Figure 4.1, indicate where three whole-genome duplications are
believed to have occurred.

Problems
Problem 4.1 On a copy of Figure 4.7(b), indicate the following interactions: (a) the positively
charged guanidino group in zanamivir (left in Figure 4.7b) forms salt bridges with Glu116 and
Glu225 in the neuraminidase active site; (b) hydroxyl groups of the glycerol moiety of zanamivir
(at the right in Figure 4.7b) are hydrogen bonded to Glu274; (c) the carbonyl oxygen of the
N-acetyl sidechain is hydrogen bonded to Arg149; (d) the methyl group of the N-acetyl sidechain
makes hydrophobic interactions with Ile220 and Trp176.
Problem 4.2 Read the letters to The New York Times (17 September 2005) discussing the decision
to publish the genome sequence of the 1918 pandemic strain of influenza virus. Summarize the
arguments for and against publication.
Problem 4.3 From the partial sequences in Figure 4.12, (a) how many positions (necessarily in the
coding regions) contain the same base in all three genomes? What percentage of positions contain
the same base in all three genomes? (b) How many positions in the coding regions are common to
Exercises, problems, and weblems 157

human and chicken but different in S. aureus? To what percentage of coding positions does this
correspond? (c) How many positions in the non-coding regions are common to human and
chicken? To what percentage of non-coding positions does this correspond?
Problem 4.4 For the coding regions of the genes for human and chicken thioredoxins (p. 135),
calculate the Ka/Ks ratio. What can you infer about the relative importance of selection and drift in
accounting for the difference in the corresponding amino acid sequences?
Problem 4.5 To what time frame can you date the duplication of the g gene in the b-globin locus
of primates? (See Figures 2.15, 4.13, and 4.14.)
Problem 4.6 Figure 4.24 shows an evolutionary tree of hominids. With reference to Figure 4.18,
can you find similarities in the chromosome structures that confirm these relationships between
human, common chimpanzee, gorilla, and orang-utan?

Millions of years ago

25 20 15 10 5 0
Pygmy chimp

Common chimp

Human

Gorilla

Orang-utan

Gibbon

Figure 4.24 Phylogenetic tree of hominids.

Problem 4.7 Which substitutions between rodent and human cytochrome c would not be
considered conservative mutations?
Problem 4.8 For the following pairs of homologous proteins, what are the percentages of
identical amino acids in an optimal alignment and what are the percentages of identical residues
or conservative substitutions in an optimal alignment: (a) the cytochrome c proteins of humans
and C. elegans; and (b) presenilin-1 homologues of humans and C. elegans?

Weblems
Weblem 4.1 Identify a virus and a prokaryote such that the genome of the virus is larger than the
genome of the prokaryote (see Figure 4.4).
Weblem 4.2 Find examples of viruses with (a) a double-stranded circular DNA genome;
(b) a double-stranded linear DNA genome; (c) a single-stranded (+)sense RNA genome; and
(d) a single-stranded (−)sense RNA genome.
Weblem 4.3 What was the most common serotype of influenza in the USA during the 2010–2011
season (www.cdc.gov/flu/weekly)?
Weblem 4.4 Give an example of a species that represents the simplest known form of (a) metazoan;
(b) deuterostome; (c) placoderm; and (d) eutherian.
Weblem 4.5 Mitochondrial DNA is often edited before translation. When a mitochondrial gene is
transferred to the nucleus, there are two possibilities: (1) the copying is DNA→DNA. In this case,
the nuclear version will initially have the mitochondrial DNA sequence and will require mutation
158 4 Comparative Genomics

to effect the changes introduced in the mitochondrion by the editing; or (2) the nuclear gene is
reverse transcribed from the edited mitochondrial mRNA. Compare the sequences of the nuclear
gene for cytochrome oxidase II from mung bean (Vigna radiata) with mitochondrial sequences in
related legumes, before and after editing. Do the data support either hypothesis?
Weblem 4.6 Align the following sequences: (1) the nuclear-encoded rps14 gene from rice
(Oryza sativa); (2) the mitochondrial-encoded rps14 gene from broadbean (Vicia faba);
(3) the nuclear-encoded sdhB gene from rice. Identify the leader sequence in the rice rps14 gene,
targeting mitochondrial import, that appears to have been borrowed from the rice sdhB gene.
Weblem 4.7 Make two histograms of E. coli genes, similar to that of Figure 4.8, showing in one
of them the size distribution of genes appearing clockwise in the genome and in the other the
size distribution of genes appearing counter-clockwise in the genome. Describe any systematic
differences that appear.
Weblem 4.8 Many people believe that Rickettsiae are the closest extant relatives of the organism
that, after endosymbiosis, gave rise to mitochondria. Rickettsiae are obligate aerobes. (a) Can you
identify an enzyme that (1) catalyses an anaerobic function in mitochondria and (2) lacks a known
homologue in Rickettsiae. (b) If you can find one, and if the rickettsial origin of mitochondria is a
valid hypothesis, what are reasonable explanations of the presence of the anaerobic enzymatic
function in mitochondria? (c) How would you test these explanations?
Weblem 4.9 In eukaryotes, the recombination rate per kilobase – the physical distance that
corresponds to the genetic distance – varies among species by several orders of magnitude
overall. (It also varies within each genome.) It depends primarily on overall genome size. You
can find the data in computer-readable form through this book’s Online Resource Centre
(www.oxfordtextbooks.co.uk/orc/leskgenomics2e/). Draw graphs of recombination rate against
genome size, distinguishing data from different groups of organisms. (a) What relationship do
you observe, i.e. colloquially, what is the shape of the curve? (b) Can you plot the data in a way
that gives a linear relationship? (c) Do the data from the different groups of organisms follow
the same relationship? If not, how do they differ? (d) What general conclusions can you draw?
Weblem 4.10 Add to Figure 4.13 the corresponding partial gene sequence from N. crassa.
Describe its relationship to human, chicken, and S. aureus sequences. In particular, answer
questions analogous to those in Exercise 4.11 for the DNA sequences.
Weblem 4.11 What full-genome project is in progress, but not yet complete, for an organism in
each of following categories: (a) fungus; (b) amphibian; (c) land plant; (d) insect (not a species of
Drosophila); (e) primate.
Weblem 4.12 Collect and align sequences of the protein HSP70 from about six (of each)
Gram-positive bacteria, proteobacteria, other Gram-negative bacteria, and archaea. Identify
an insertion common to proteobacteria and other Gram-negative bacteria but absent from
Gram-postive bacteria and archaea. On this basis, sketch the topology of a phylogenetic tree
relating Gram-positive bacteria, proteobacteria, other Gram-negative bacteria, and archaea.
Where do these results suggest placing the root of the prokaryotic tree?
Weblem 4.13 The genetic code used in translation of genes in animal mitochondria differs from
the standard one. Was this feature simply inherited from the original symbiont that gave rise
to the organelle, or did it arise subseqently by divergence? Determine (a) what genetic code is
used by the living organism that appears to be the closest extant relative of the original symbiont,
Rickettsia prowazekii. (b) Do all animal mitochondria use the same variant of the standard genetic
code? Based on these data, suggest an answer to the question.
Weblem 4.14 What is significant about the following species that justified sequencing their entire
genome sequences? (a) Plasmodium falciparum; (b) Aspergillus fumigatus; (c) Tropheryma
whipplei; (d) Sulfolobus tokodaii; (e) sea urchin; (f) Ciona intestinalis.
Exercises, problems, and weblems 159

Weblem 4.15 In 1976, swine flu killed Private David Lewis, a soldier at Fort Dix, in central New
Jersey South of Princeton, USA. What was the response of the US Government, under President
Gerald Ford? What was the course of the potential epidemic? In retrospect, does the response
appear to have been the right one? Could you have advised President Ford to make a different
response? If so, based on what facts known at the time of Private Lewis’s illness?
Weblem 4.16 Compute Ka/Ks ratios for the genes for the two isoforms of D. melanogaster
cytochrome c. Does the divergence appear to have arisen from selection or drift?
Weblem 4.17 Compute Ka/Ks ratios for the genes for human presenilin-1 and the C. elegans
homologue sel-12. Does the divergence appear to have arisen from selection or drift?
Weblem 4.18 The classification of tarsiers among primates is ambiguous. Palaeontology places
tarsiers with lemurs, lorises, and galagos, in the suborder prosimians. Some genetic relationships
place tarsiers with monkeys, apes, and humans. It is also possible that tarsiers form a separate
prosimian infraorder. Find the structure of the b-globin region of the tarsier genome and explain
what these data suggest about where, in Figure 4.15, the tarsiers belong.
Weblem 4.19 The human homologues of C. elegans NPR-1 are the neuropeptide Y receptors
NPY1R, NYP2R, . . . What mutations are known in these human receptors and what, according
to Online Mendelian Inheritance in Man (OMIM™), are their effects?
Weblem 4.20 In humans, cholesteryl ester transfer protein is important in controlling blood
levels of high-density lipoproteins. Do homologues of this protein exist in (a) mouse; (b) rat;
and (c) hamster?
Weblem 4.21 Mouse and rat cytochrome c have identical amino acid sequences. Do they have
identical gene sequences also? If not, show the differences between the cytochrome c genes in
mouse and rat. Distinguish between exons and introns.
Weblem 4.22 How many cytochrome c pseudogenes are there in the human genome?
Weblem 4.23 Which of the following genes have the same number of exons in human and
mouse orthologues? (a) Haemoglobin A (human gene HBA); (b) cytochrome c (human gene
HCS); (c) spermidine synthase (human gene SRM).
Weblem 4.24 Are any regions containing globin genes, other than haemoglobin a and b, included
in the choices of ENCODE regions studied by the HapMap project? If so, which ones?
This page intentionally left blank
CHAPTER 5

Evolution and Genomic Change

LEARNING GOALS

• To understand the coordination of changes in genotype and phenotype during evolution.

• To know the principles of biological classification and the grammar of biological nomenclature.
• To appreciate the distinction between similarity and homology, and that homology, usually
unobservable, is an inference from similarity.
• To recognize that measurements of similarity between gene or protein sequences offer our best
insight into relationships between individuals or between species.
• To understand the general idea of pattern recognition, and be able to use tools such as the
dotplot to recognize similarities among sequences.
• To understand the basis of constructing phylogenetic trees, methods for calculating them, and
the information different types of tree contain.

In the preceding chapter, we described some of the differences observed in comparing closely
related species, and distantly related ones, at various levels from molecules to karyotypes. In this
chapter we seek to understand how these changes came about. The general answer is of course
evolution. The availability of the data from genomics and related fields – such as protein structure
determinations – challenges us to probe into the detailed mechanism by which evolutionary
changes occur. Tools for analysing similarities include sequence-alignment algorithms, and, from
the similarities, methods for generating phylogenetic trees. These tools are part of the essential
skill set for anyone working in the field of genomics.
162 5 Evolution and Genomic Change

Evolution is exploration

Evolution is exploration. Exploration leads to dis- However, some mutations send ripples through
covery. Discovery leads to change. Change can the system and have far-reaching effects. Small
appear as creativity. (Let us avoid the word progress, changes in single HOX genes have immense lever-
much-abused in this context.) But it is all based on age in creating the overall body plans of animals.
exploration. For example, a major change in body plan in
By exploration, we mean that life can probe the metazoans occurred about 400 million years ago,
vicinity of its current state, generating and testing with the emergence of insects with six legs from
variations. Evolution involves exploration and change arthropod ancestors with large numbers of legs.
at many levels. What makes biology so complex Experiments of W. McGinnis and co-workers
is that the changes at different levels are intimately showed that changes in one protein, Ubx, a HOX
linked. homologue, are sufﬁcient to achieve this large-
scale anatomical transition.
• The most fundamental level of exploration is muta-
• At the chromosomal level, evolution can explore
tion of genome sequences. It is through mutations
distributions of genes. This can involve local or
and, in sexually reproducing organisms, allelic
global gene duplication and transposition of either
reassortment, that life explores the neighbourhood
small segments of chromosomes or large-scale
of a current genotypic and phenotypic state.
blocks. Degradation of synteny can lead to infertil-
A mutation can affect a transcribed molecule,
ity. This is one of the mechanisms of speciation.
changing either the amino acid sequence of a pro-
tein, or a base in a non-protein-coding RNA. • At the cellular level, evolution has explored differ-
Alternatively, changes in splice sites or regulatory ent kinds of organization, notably the prokaryote–
sequences can change protein expression levels. A eukaryote division. Some but not all cells have cell
mutation that causes loss of an essential function walls. Some, but not all, have chloroplasts. Com-
will be lethal – a blind alley of evolutionary ex- plex organisms develop many types of specialized
ploration. Conversely, other mutations might be cell, tissue and organs.
expected to have little or no effect. Such putatively • Individuals explore different possible life histories.
‘silent mutations’ include changes among synony- The context and interactions of our lives shape our
mous triplets in protein-coding genes, or muta- development. For humans more than other species,
tions in pseudogenes, or – presumably – mutations cultural heritage and experience have a great effect
in the large portions of some genomes currently on physical as well as on mental development.
described as junk. However, even mutations pro- We as individuals also have greater control over
ducing synonymous codons in protein-coding how we explore the potential inherent in our
genes could interact with transfer RNA (tRNA) genomes. This freedom is, of course, incomplete,
levels to affect translation rates, and thereby affect for societies are ecosystems and constrain our
protein structure (see pp. 9–10). development and activities.
• Altered proteins explore possibilities of altered • Within populations of individuals of the same
structure, including altered post-translational species, evolution explores varying distributions of
modiﬁcation; and altered function, including allele frequencies. Natural populations show geno-
changes in enzymatic activity or changes in regula- typic and phenotypic variation. Even in the absence
tory signals. of mutations, populations can ‘react’ to changing
A single conservative amino acid substitution at conditions by varying gene frequencies: Industrial
a site on the surface of a protein distant from the melanism is a classic example (see Box 5.1).
active site would be expected to have only local- • At the level of body plan, a visit to a zoo or botanical
ized, self-contained effects on protein structure garden reveals life’s stunning variety. Comparative
and function. anatomy reveals the underlying similarities among
Evolution is exploration 163

BOX Industrial melanism and its reversal

5.1

One variety of the British pepper moth, Biston betularia darkened trees, the dark moths were better camou-
(variety typica), has a mottled black-and-white colouring. flaged. Within a century, the population had become 90%
The moths are nocturnal; during the day, they roost on tree variety carbonaria. Figure 5.1 shows, quite convincingly,
trunks. Before the rise of industrial pollution, light-coloured the differences in appearance of both trees and moths.
lichens encrusted the trees and the moths were protected The difference is a shift within the population of the allelic
from birds by camouflage. frequency distribution of a single gene.
Another variety, carbonaria, has a uniformly dark colour. Let it not be said that England did not take steps to curb
It was first observed, as a mutant, in the mid-19th century, air pollution. Coal fires were banned in London by Edward
near Manchester in the north of England. Historical collec- I in 1273, albeit only temporarily. Parliament followed up
tions show the rarity of the dark variety at that time. with the Clean Air Act of 1956. At present, B. betularia
The difference between the two varieties is controlled by a populations in areas recovering from soot deposits are
single gene that controls the amount of the black pigment, shifting back to higher proportions of the mottled typica
melanin, produced. variety.
As the industrial revolution advanced, soot killed the
lichens and blackened the bark of the trees. Against the

Figure 5.1 Industrial melanism. Left: light- and dark-coloured pepper moths (Biston betularia) on normal trees, with lichens
growing on the bark. Right: light- and dark-coloured pepper moths on lichen-free trees encrusted with soot. Each picture
contains two moths, one camouflaged and the other easily visible. Can you spot the camouflaged moths?
Reproduced with permission from: Kettlewell, H.B.D. (1956). Further selection experiments on industrial melanism in the Lepidoptera.
Heredity 10, 287–301.

different animals and plants, but also the different examples of co-evolution involve species in compe-
design solutions for structural, locomotory, and sen- tition or conﬂict; for example, predator–prey rela-
sory systems between vertebrates and invertebrates. tionships. These include the wars between humans
and pathogenic bacteria and viruses.
• At the level of ecosystems, different populations
explore their modes of interaction. Many pairs of The mechanism of evolutionary change is now
species co-evolve. Some examples of co-evolution understood, at least in general terms. Genetic reas-
of species are known from cooperating species; for sortment and mutation generate inheritable pheno-
instance, the correlation between the anatomy of typic variation. Phenotype-dependent differential
ﬂowers and their insect pollinators – a subject rates of reproduction – that is, natural selection –
studied by Darwin himself – or the correlation governs which alleles, at which frequencies, are
between colour changes in fruit ripening and the passed on to succeeding generations. Alternatively,
development of colour vision in animals. Other even in the absence of selection, genetic drift can lead
164 5 Evolution and Genomic Change

to alterations in genome contents and distributions in one – occur at many levels, from individual genomes
populations. to proteins to cells to ecosystems. A long-standing
The two elements of exploration – variation from challenge of biology is to understand the relation-
the current state of the system and change to a new ships among these different levels of evolution.

Biological systematics

Classically, the unit of large-scale evolution is the Descriptions of new species are more common
species. Species represent nature’s experiments in than descriptions of new genera, families, etc. During
structures and lifestyles. Both Linnaeus and Darwin the years 1970–1998, five times as many new species
recognized the importance of species, and made and subspecies were described than new genera. New
them the focus of their work. The concept of species families, orders, classes, and phyla appear much more
remains essential, despite its attendant difficulties infrequently (Table 5.1).
(see p. 66). In view of the theoretical problems, bio- Biological names usually describe features of a
logists present the analysis of known species, and species (e.g. giant kangaroo, Macropus giganteus,
higher-order taxa, in terms based on tradition and which means large-foot gigantic, not to be confused
convention. We shall explore how modern analytic with the ‘bigfoot’ primate alleged to inhabit the
methods based on genomic and structural data mesh northwest USA and adjacent regions of Canada).
with the classical approaches. Names may indicate location (e.g. Virginia opossum,
Study of the vast variety of living organisms re- Didelphis virginiana), or recognize the discoverer
quires that we organize what we observe and measure. (e.g. Darwin’s rhea, Rhea darwinii, a large bird
We have to agree on what we call things. Biological encountered by Darwin on his visit to South Amer-
taxonomy encompasses identifying new life forms ica). J. Gould named the species in Darwin’s honour
(new in the sense of new to the scientific literature), in 1837. (P.H.G. Mohring had named the genus
deciding where they fit in, and assigning them a in 1752, to reflect the large size of the birds: the
name – based on some ‘real or fancied characteristic Greek goddess Rhea, mother of Zeus, was a female
of the form described’ (A.S. Romer) and equipped titan, member of a mythological race of giants.) The
with a proper description and deposition of specimen scientific name of Père David’s deer, Elaphurus david-
material. ianus, is another example of an animal named after
its discoverer.
Biological nomenclature
Two problems in organizing biological nomenclature Table 5.1 New taxa described from 1970 to 1998
are what to name and how to assign the names.
The taxonomic hierarchy – kingdom, phylum, class, Taxa Numbers described
order, family, genus, and species – introduced by Lin-
New phyla 11
naeus is still in use, although with modifications.
New classes 44
(Another legacy is the continued use of classical lan-
New orders 100
guages.) Members of more restrictive categories, for
New families 731
instance, several species in the same genus, have more
shared features and higher degrees of similarity than New genera 8 579

the members of more inclusive categories, such as New species 38 590

phylum or class. As Nature is not as ‘neat’ as the New subspecies 3 231
18th century scientists would have liked, boundaries From: Winston, J.E. & Metzger, K. (1998). Trends in taxonomy
between taxa are fuzzy. revealed by published literature. BioScience 48, 125–128.
Biological systematics 165

Other sources of names include expedition sponsors, With access to living populations, it is possible to get
thesis supervisors, and public figures such as kings a sense of the variability among individuals and to
and queens, politicians, artists, musicians, and sport- observe a variety of features, including physiology
ing figures. A marine mollusc, Rotaovula hirohitoi, and lifestyle, in addition to ‘static’ anatomy.
was named for the former Emperor of Japan, who was In contrast, for extinct species, palaeontologists
himself a serious marine biologist. Some scientists are often limited to fragmentary samples of hard
have named creatures with loathsome features after parts – bones and teeth – sometimes from a single
rivals, as insults. Finally, J.E. Winston has written: individual. Indeed, in the classical era of natural his-
‘These days, it would be considered pretty tacky . . . tory exploration, a museum in Europe would often
to name a species after yourself’.* For unusual and, receive only a preserved body, even from species that
in some cases, amusing examples, see http://home. still thrived elsewhere in the world. (Père David’s
earthlink.net/~misaak/taxonomy.html, http://cache. deer is a typical example.) Despite these handicaps
ucr.edu/~heraty/menke.html and http://cache.ucr. in data collection, biologists in the pre-molecular era
edu/~heraty/yanega.html#ECOLOGY. built their taxonomic edifice on studies of compara-
Biological nomenclature is governed by interna- tive anatomy, embryology, and stratigraphy for dates.
tional agreements, adopted by consent by professional They developed spectacular expertise: it was said
scientists. The International Codes of Zoological and that Cuvier, the founder of vertebrate palaeontology,
Botanical Nomenclature separately offer rules for could reconstruct the entire skeleton of an animal
the naming of animals and plants. Nomenclature of from a single bone.
bacteria grew out of, and eventually split off from, Understanding biological diversity requires obser-
the botanical code. Virologists have developed their vation and measurement of similarities and differ-
own classification. Currently the International Union ences. What features should one compare? Classically,
of Biological Sciences, a member of the International the choice depended on expertise and experience.
Council of Scientific Unions, concerns itself with bio- W.E. Le Gros Clark wrote,
logical nomenclature. It is a sponsor of the Species
While it may be broadly accepted that, as a general pro-
2000 project, an effort to curate a complete and inte-
position, degrees of genetic relationship can be assessed
grated database of the world’s species, including
by noting degrees of resemblance in anatomical details,
plants, animals, fungi, and microbes. The Species
it needs to be emphasized that morphological characters
2000 project coordinates its activities with other simi- vary considerably in their significance for this assessment.
lar international efforts, including the Interagency Consequently it is of the utmost importance that particular
Taxonomic Information System (ITIS) and the Global attention should be given to those characters whose taxo-
Biodiversity Information Facility (GBIF). The related nomic relevance has been duly established by comparative
projects of the Tree of Life (http://tolweb.org/tree/) anatomical and palaeontological studies.
and ARKive (http://www.arkive.org/) include pictor- – Le Gros Clark, W.E. (1971). The Antecedents of Man,
ial databases. 3rd edn. Quadrangle Books, Chicago, pp. 11–12.
The World Conservation Union, usually known
This approach works fine in the hands of a pro-
by its former name, the International Union for the
fessional with expertise and as distinguished as Le
Conservation of Nature and Natural Resources
Gros Clark, but it has also elicited attempts to
(IUCN), maintains a ‘Red list’ of endangered species.
make classification methods more quantitative and
objective. These attempts include the development of
Measurement of biological similarities and computational methods for interpreting similarities
differences of a wide spectrum of features, some but not all based
on sequence data.
Ultimately, comparisons in biology involve observa-
tions of differences between individual organisms.
Molecular techniques
* Winston, J.E. (1999). Describing Species. Columbia Uni- Many molecular properties have been used for
versity Press, New York, p. 165. phylogenetic studies, some surprisingly long ago.
166 5 Evolution and Genomic Change

Serological cross-reactivity was applied to detect

• The Reichert and Brown work has my vote for the most
relationships from the beginning of the last century
premature scientific result ever.
until superseded by the direct use of sequences. E.T.
Reichert and A.P. Brown published, over a century
ago (in 1909), a phylogenetic analysis of fishes based
on haemoglobin crystals. Their work was based on Today, DNA sequences provide the best measures
Stenö’s law (1669), which states that although dif- of similarities among species for phylogenetic analysis.
ferent crystals of the same substance have different Many genes are available for comparison. This is for-
dimensions – some are big, some small – they have tunate, because, given a set of species to be studied, it
the same interfacial angles. We now understand that is necessary to find genes that vary at an appropriate
this law reflects the similarity in microscopic arrange- rate. Genes that remain almost constant among the
ment and packing of the atomic or molecular units species of interest provide no discrimination. Genes
within the crystals. Reichert and Brown showed that that vary too much cannot be aligned.
the interfacial angles of crystals of haemoglobins Fortunately genes vary widely in their rates of
isolated from different species showed patterns of change. The mammalian mitochondrial genome, a
similarity and divergence parallel to the species’ taxo- circular, double-stranded DNA molecule approxi-
nomic relationships. mately 16 000 bp long, provides a useful fast-changing
Reichert and Brown’s results are replete with sig- set of sequences for the study of evolution among
nificant implications, which can be appreciated only closely related species. In contrast, slowly changing
in retrospect. They demonstrate that proteins have ribosomal RNA (rRNA) sequences were used by C.
definite, fixed shapes, an idea by no means recognized Woese to identify the three major divisions of life:
at the time. They imply that, as species progressively archaea, bacteria, and eukarya (see Figure 4.1).
diverge, the structures of their haemoglobins pro-
gressively diverge also. In 1909, no one had a clue
about nucleic acid or protein sequences. In principle, • In order to develop a clear picture of the relationships
therefore, the recognition of evolution of protein between species, it is necessary to pick a molecule
that is changing at a reasonable rate. There must be
structures preceded, by half a century, the idea of
enough change such that the signal does not sink
evolution of nucleotide and amino acid sequences.
below the noise level, but not too much change as to
Reichert and Brown even saw a structural difference
obscure common features.
between oxy- and deoxyhaemoglobin.

Homologues and families

Products of evolution retain similarities. The similar- possible objects of such analysis – sequences of indi-
ities appear at many levels – related people, recently vidual genes, full-genome sequences, sequences and
diverged species, tissues within an organism contain- structures of proteins, anatomical features, patterns
ing related cell types but varying protein expression of development, and any other phenotypic character
patterns, amino acid sequences and structures of pro- one might choose. In many cases, the patterns of sim-
teins, and DNA sequences. A major theme of biology ilarity between different features of a set of species
traditionally has been to recognize and classify such give corresponding results, bolstering our conﬁdence
similarities, with a view to understanding how they in their signiﬁcance.
arose and, when appropriate, to what purpose (i.e. However, it is necessary to keep clearly in mind
with what selective advantage). that similarity, which is observable, is a surrogate for
To trace the course of evolution, we must quantita- relationship, which usually is not. Related biological
tively measure such similarities. There are many objects are homologues, or families. In many cases,
Pattern matching – the basic tool of bioinformatics 167

such as the globins, the similarities are sufficient But divergence does not stop within the scope of
to give us confidence that we are analysing a family our ability to detect homologues, and there are many
of related molecules. Ideally, we have a spectrum of cases where (a) a tantalizing tenuous degree of simi-
similarities, including close relatives and some dis- larity suggests homology, but we remain unsure whe-
tant ones, with the distant relatives linked by chains ther or not the inference of relationship is valid, or
of close ones. For the comparisons of globins from (b) there is sufficient dissimilarity between two mole-
different species, the congruence of the degrees of cules or structures that homology is unsuspected, but
similarity of the molecules with measurements of a series of missing links clearly connects the two.
similarities of other sets of molecules and with the Our most precise tools measure similarity between
classical taxonomic relationships between species is sequences or between molecular structures. These
reassuring. For globins within a single species, the tools are mature methods. They have been calibrated
common conserved features argue that we are deal- to allow us to decide, in all but the hardest cases,
ing with a single diverged family. whether or not we are dealing with homologues.

Pattern matching – the basic tool of bioinformatics

Given suitable data, computer programs can measure

• To measure similarity between two sequences, find
similarities and extract common patterns. Programs
their optimal alignment – the best matching up of
to extract patterns in sequences are powerful and the individual characters – and produce a cumulative
readily available. Indeed, sequence comparisons are a score of the similarities between the characters at each
problem common to many fields, including the text position.
editors available in all computer systems. For protein
structures also, it is possible to detect and mea-
sure similarities and common patterns. This makes it
possible to study sequence–structure relationships
quantitatively. Sequence alignment
Other types of biological information – such as Given two or more sequences, we wish to:
protein function, expression patterns, information
about characteristics that distinguish species – do • measure their similarity;
not present themselves quite so naturally in forms • understand how the residues match up;
adapted for computational analysis. They require • observe patterns of conservation and variability;
foundation work to create models of the informa- and
tion, including identification of the important categ-
• infer evolutionary relationships.
ories of data, and controlled and carefully defined
vocabularies for their description. The Gene Onto- If we can do this, we will be in a good position to go
logy Consortium’s classifications of protein functions fishing in databanks for related sequences, and mea-
allowed development of tools for quantitative measuring relative degrees of similarity among genes or
surement of similarity and divergence of function. proteins. A major application of sequence alignment
(See Chapter 11.) is to the annotation of genes, through identifica-
Such rules and regulations governing how to tion of homologues, in order to assign structure and
express data are also essential for database integra- function to as many genes as possible.
tion. They provide the basis on which independent Sequence alignment is the identification of residue–
databases in related or overlapping fields can com- residue correspondences. Any assignment of corres-
municate and cooperate with one another. They allow pondences that preserves the order of the residues
information retrieval software to handle queries within the sequences is an alignment. Alignments
requiring coordinated access to several databases. may contain gaps. For example,
168 5 Evolution and Genomic Change

Given two text strings: ﬁrst string =a b c d e D O R O T H Y C R O W F O O T H O D G K I N

second string = a c d e f D D D
O O O O O O O
a reasonable alignment would be: a b c d e - R R R
O O O O O O O
a - c d e f T T T
H H H
Some alignments are better than others. For the Y Y
H H H
sequences gctgaacg and ctataatc: O O O O O O O
D D D
An uninformative G G
alignment: - - - - - - - g c t g a a c g K K
I I
c t a t a a t c - - - - - - - N N
An alignment g c t g a a c g Figure 5.2 Dot plot showing identities between the
without gaps: c t a t a a t c short name (DOROTHYHODGKIN) and full name
An alignment g c t g a - a - - c g (DOROTHYCROWFOOTHODGKIN) of a famous protein
crystallographer.
with gaps: - - c t - a t a a t c
Letters corresponding to isolated matches are shown in
And another: g c t g - a a - c g non-bold type. The longest matching regions, shown in red,
- c t a t a a t c - are the first and last names DOROTHY and HODGKIN.
Shorter matching regions, such as the OTH of dorOTHy and
Most readers would consider the last of these align- crowfoOTHodgkin, or the RO of doROthy and cROwfoot,
ments the best of the four. To decide whether it is are noise. Note the effect of the ‘insertion’ of Crowfoot in
interrupting and displacing the matching.
the best of all possibilities, we need a way of exa-
mining all possible alignments systematically. We
need to compute a score reflecting the quality of each A B R A C A D A B R A C A D A B R A
possible alignment and to identify an alignment with A A A A A A A A A
B B B B
the optimal score. The optimal alignment may not be R R R R
unique: several different alignments may give the A A A A A A A A A
C C C
same best score. Moreover, even minor variations in A A A A A A A A A
the scoring scheme may change the ranking of align- D D D
A A A A A A A A A
ments, causing a different one to emerge as the best. B B B B
R R R R
The dot plot A A A A A A A A A
C C C
The dot plot is a simple picture that gives an over- A A A A A A A A A
D D D
view of pairwise sequence similarity. Less obvious is A A A A A A A A A
its close relationship to alignments. B B B B
R R R R
The dot plot is a table or matrix. The rows cor- A A A A A A A A A
respond to the residues of one sequence and the col-
umns to the residues of the other sequence. In its Figure 5.3 Dot plot showing identities between a repetitive
sequence (ABRACADABRACADABRA) and itself. The repeats
simplest form, the positions in the dot plot are left
appear on several subsidiary diagonals parallel to the main
blank if the residues are different and filled if they diagonal.
match. Stretches of similar residues show up as
diagonals in the upper left–lower right (northwest–
southeast) direction (see Figure 5.2). A dot plot relating real amino acid sequences –
Dot plots gives quick pictorial statements of the human and Xenopus laevis ephrin B3 – shows that
relationship between two sequences. Obvious fea- the similarity is stronger in the N-terminal part of the
tures of similarity stand out. Figure 5.3 shows a dot protein. Figure 5.5 shows the dot plot and the cor-
plot of a sequence containing internal repetitions. responding sequence alignment. It is useful to look at
Figure 5.4 shows a dot plot of a palindromic sequence these together and see how the regions of high and
(a sequence that is identical to its reversal). low similarity correspond in the two figures.
Pattern matching – the basic tool of bioinformatics 169

M A X I S T A Y A W A Y A T S I X A M zontal segment of the path corresponds to the inser-

M M M tion of Crowfoot.
A A A A A A A
X X X In this example, the optimal alignment and optimal
I I I
S S S path are obvious. In general, a computer program
T T T must examine all possibilities. How to do that effec-
A A A A A A A
Y Y Y tively is a matter of some delicacy. Without explaining
A A A A A A A the methods in detail, the trick is to decide, for each
W W
A A A A A A A partial path, what its best extension is. Algorithms
Y Y Y
A A A A A A A for relating locally optimal moves to integrated
T T T optimal pathways – that is, for constructing full
S S S
I I I alignments – depend on a mathematical technique
X X X called dynamic programming.1
A A A A A A A
M M M

Figure 5.4 Dot plot showing identities between the palindromic

• A dotplot shows perspicuously the quality and distri-
sequence MAX I STAY AWAY AT SIX AM and itself. The
bution of the pattern of similarity between two sequences.
palindrome reveals itself as a stretch of matches perpendicular
Each possible alignment of the two sequences cor-
to the main diagonal.
This is not just word play – regions in DNA recognized by responds to a path through the dotplot, from upper
transcriptional regulators or restriction enzymes have sequences left to lower right.
related to palindromes. Longer regions of DNA or RNA containing
inverted repeats of this form can form stem–loop structures.

Varieties and extensions

• Ephrins are proteins that guide axons in the developing
Global alignment assigns correspondences to all resi-
nervous system, and play a number of other important
roles in development.
dues in the sequences. If one sequence is shorter than
the other, the difference in length must be made up by
insertions/deletions.
A disadvantage of the dot plot is that its ‘reach’ Local alignment is a pattern-matching technique
into the realm of distantly related sequences is poor. for identifying a match for a short probe sequence
In analysing sequences, one should always look at a within a much longer text. Gaps outside the local
dot plot to be sure of not missing anything obvious, match are not penalized. This is a common task in
but be prepared to apply more subtle tools. text searching and editing. Finding all instances of
the word ‘dream’ in Hamlet is an example of local
Dot plots and alignments pattern matching. (Allowing a mismatch and a gap
How can we derive an optimal alignment of two would pick up the word ‘drum’ as well as ‘dream’.) If
sequences? Conceptually, any alignment – that is, any you consider the DNA sequence of an entire human
assignment of residue–residue correspondences – is chromosome as a long string of characters, searching
equivalent to a path through a dot plot. A diagonal for a particular gene sequence within the chromo-
move corresponds to an equivalence between two some is a local matching problem.
residues. Horizontal and vertical moves correspond A very important extension of pairwise sequence
to insertions and deletions. Because any allowable alignment is multiple sequence alignment, the mutual
alignment assigns residues uniquely, and in order alignment of three or more sequences. Usually we
along the sequences, the only allowable moves are can ﬁnd large families of similar sequences by
southeast (diagonal), east (horizontal), and south identifying homologues in many different species.
(vertical).
Figure 5.6 shows the optimal path through the 1
For further details, see Lesk, A.M. (2008). Introduc-
Dorothy Hodgkin dot plot. The path passes through tion to Bioinformatics, 3rd ed. Oxford University Press,
the largest number of matching residues. The hori- Oxford.
170 5 Evolution and Genomic Change

(a) Human ephrin B3 ➙

Xenopus laevis ephrin B3

➙

(b) 10 20 30 40 50 60
| | | | | |
Human LSLEPVYWNSANKRFQAEGGYVLYPQIGDRLDLLCPRARPPGPHSSPNYEFYKLYLVGGA
Xenopus SLDPIYWNSSNKRFEDTEGYVLYPQIGDRLDLLCPRSEPQGPFSSSPYEYYKLYLVGTK
SL P YWNS NKRF GYVLYPQIGDRLDLLCPR P GP SS YE YKLYLVG

70 80 90 100 110 120

| | | | | |
Human _
QGRR CEAPPAPNLLLTCDRPDLDLRFTIKFQEYSPNLWGHEFRSHHDYYIIATSDGTRE
Xenopus EEMSSCSILRTPNLLLTCDRPSQDLRFTIKFQEFSPNLWGHEFQSQRDYYIIATSDGTMD
C PNLLLTCDRP DLRFTIKFQE SPNLWGHEF S DYYIIATSDGT

130 140 150 160 170 180

| | | | | |
Human GLESLQGGVCLTRGMKVLLRVGQSPRGGAVPRKPVSEMPMERDRGAAHSLEPGKENLPGD
Xenopus GIETLQGGVCETKGMKVTLKVGQSPNGATPPRRPSSAG _ _ _ KDSGISPSVPNPDIPNVGE
G E LQGGVC T GMKV L VGQSP G PR P S D G S G

190 200 210 220 230 240

| | | | | |
Human PTSNATSRGAEGPLPPPSMPAVAGAAGGLALLLLGVAGAGGAMCWRRRRAKPSESRHPGP
Xenopus _
TSGNATKTGENGPLPISHVPLVAGAAGGAALLLL VFGVVGWVCHRRRQAKHSDTRHP P_
NAT G GPLP P VAGAAGG ALLLL V G G C RRR AK S RHP P

250 260 270 280 290 300

| | | | | |
Human GSFG _ _ _ _ _ _ RGGS _ _ _ _ _ _ _ _ _ _ _ _ _
LGLGGGGGMGPR EAEPGELG _ _ IALRGGGAADP
Xenopus LSLGSITSPKRGGNNNGHEPSDIIMPLRPSEAGAFCPHYEKVSGDYGHPVYIVQDMASQS
S G RGG L G P E G G A

Human PFCPHYE_
Xenopus PANIYYKV
P Y

Figure 5.5 Relationships between the sequences of ephrin B3 proteins from human and Xenopus laevis. (a) Dot plot. The major
signal is along the main diagonal, interrupted by occasional divergent regions, and showing the substantially weaker similarity near
the C terminus. (b) Sequence alignment. Amino acids are colour coded by physicochemical type. Letters under the sequences indicate
positions occupied by the same residue in both sequences.
Pattern matching – the basic tool of bioinformatics 171

D O R O T H Y C R O W F O O T H O D G K I N • the Levenshtein, or edit, distance between two

D D D strings of not necessarily equal length is the mini-
➙

O O O O O O O
mal number of ‘edit operations’ required to change
➙

R R R
➙

O O O O O O O
one string into the other, where an edit operation
➙

T T T
➙

H H H is a deletion, insertion, or alteration of a single

➙

Y Y
character in either sequence.
➙

H H ➙➙➙➙➙➙➙➙H

➙
O O O O O O O
For example:

➙
D D D

➙
G G
agtc Hamming distance = 2

➙
K K

➙
I I
cgta

➙
N N
ag-tcc Levenshtein distance = 3
Figure 5.6 A path through the Dorothy Hodgkin dot plot.
Diagonal arrows correspond to aligned residues, horizontal
cgctca
arrows to gap insertions. The corresponding alignment is the A given sequence of edit operations induces a
obvious one:
unique alignment, but not vice versa.
DOROTHY--------HODGKIN
For applications to molecular biology, we wish to
DOROTHYCROWFOOTHODGKIN
assign variable weights to different edit operations.
For nucleic acids, we know that transition muta-
Multiple sequence alignment reveals the underlying
tions (purine↔purine and pyrimidine↔pyrimidine;
patterns contained in a set of related sequences much
i.e. A↔G and T↔C) are more common than trans-
more clearly than pairwise sequence alignments.
versions (purine↔pyrimidine; i.e. (A or G)↔(T or
Programs for all of these different alignment prob-
C)). For proteins, amino acid substitutions tend to be
lems are available on the web:
conservative: the replacement of one amino acid by
Global alignment (pairwise and multiple) another with similar size or physicochemical pro-
perties is more likely to occur than its replacement
CLUSTAL W http://www.ebi.ac.uk/clustalw
by another amino acid with dissimilar properties.
T-Coffee http://www.igs.cnrs-mrs.fr/
Similarly, the deletion of several contiguous bases or
Tcoffee/tcoffeecgi/index.cgi
amino acids is more probable than the independent
EMBOSS http://www.ebi.ac.uk/emboss/
deletion of the same number of isolated bases.
align
A computer program can score each path through
Local alignment the dot plot by adding up the scores of the individual
steps. For each substitution, it adds the score of the
SSEARCH http://pir.georgetown.edu/
mutation, depending on the pair of residues involved.
pirwww/search/pairwise.shtml
For horizontal and vertical moves, it adds a suitable
EMBOSS http://www.ebi.ac.uk/emboss/
gap penalty.
align/
Scoring schemes
Defining the optimum alignment A scoring system must account for residue substitu-
tions and insertions or deletions. (An insertion from
To go beyond ‘alignment by eyeball’ via dot plots, we
one sequence’s point of view is a deletion as seen by
must deﬁne quantitative measures of sequence simi-
the other.) Deletions, or gaps in a sequence, will have
larity and difference.
scores that depend on their lengths.
Given two character strings, two measures of the
For nucleic acid sequences, it is common to use
distance between them are as follows:
a simple scheme for substitutions, +1 for a match,
• the Hamming distance, deﬁned between two −1 for a mismatch, or a more complicated scheme
strings of equal length, is the number of positions based on the higher frequency of transition muta-
with mismatching characters. tions than transversion mutations. One possibility is:
172 5 Evolution and Genomic Change

scheme. M.O. Dayhoff did this ﬁrst, by collecting

A G T C
statistics on substitution frequencies in the protein
A 20 10 5 5 sequences then known. Her results were used for
G 10 20 5 5 many years to score alignments. They have been
T 5 5 20 10 superseded by newer matrices (see Box 5.2) based on
C 5 5 10 20 the very much larger set of sequences that has sub-
sequently become available.

For proteins, a variety of scoring schemes have The BLOSUM matrices

been proposed. We might group the amino acids into S. Henikoff and J.G. Henikoff developed the
classes of similar physicochemical type and score +1 BLOSUM matrices for scoring substitutions in amino
for a match within a residue class and −1 for residues acid sequence comparisons. The BLOSUM matrices
in different classes. We might try to devise a more are based on the BLOCKS database of aligned pro-
precise substitution score from a combination of tein sequences; hence the name: BLOcks SUbstitution
properties of the amino acids. Alternatively, we might Matrix. From regions of closely related proteins
try to let the proteins teach us an appropriate scoring alignable without gaps, Henikoff and Henikoff

BOX The BLOSUM62 matrix used for scoring amino acid sequence similarity
5.2

Rows and columns are in alphabetical order of the three- substitution is the same as the rate of its reverse, but
letter amino acid names. Only the lower triangle of the because it is difficult to determine the differences between
matrix is shown, as the substitution probabilities are taken the two rates).
as symmetric (not because we are sure that the rate of any

Ala (A) 4
Arg (R) −1 5
Asn (N) −2 0 6
Asp (D) −2 −2 1 6
Cys (C) 0 −3 −3 −3 9
Gln (Q) −1 1 0 0 −3 5
Glu (E) −1 0 0 2 −4 2 5
Gly (G) 0 −2 0 −1 −3 −2 −2 6
His (H) −2 0 1 −1 −3 0 0 −2 8
Ile (I) −1 −3 −3 −3 −1 −3 −3 −4 −3 4
Leu (L) −1 −2 −3 −4 −1 −2 −3 −4 −3 2 4
Lys (K) −1 2 0 −1 −3 1 1 −2 −1 −3 −2 5
Met (M) −1 −1 −2 −3 −1 0 −2 −3 −2 1 2 −1 5
Phe (F) −2 −3 −3 −3 −2 −3 −3 −3 −1 0 0 −3 0 6
Pro (P) −1 −2 −2 −1 −3 −1 −1 −2 −2 −3 −3 −1 −2 −4 7
Ser (S) 1 −1 1 0 −1 0 0 0 −1 −2 −2 0 −1 −2 −1 4
Thr (T) 0 −1 0 −1 −1 −1 −1 −2 −2 −1 −1 −1 −1 −2 −1 1 5
Trp (W) −3 −3 −4 −4 −2 −2 −3 −2 −2 −3 −2 −3 −1 1 −4 −3 −2 11
Tyr (Y) −2 −2 −2 −3 −2 −1 −2 −3 2 −1 −1 −2 −1 3 −3 −2 −2 2 7
Val (V) 0 −3 −3 −3 −1 −2 −2 −3 −3 3 1 −2 1 −1 −2 −2 0 −3 −1 4
A R N D C Q E G H I L K M F P S T W Y V
Pattern matching – the basic tool of bioinformatics 173

calculated the ratio of the number of observed pairs 0 for a mismatch) and gap penalties of 10 for gap
of amino acids at any position to the number of pairs initiation and 0.1 for gap extension by one residue.
expected from the overall amino acid frequencies. For aligning protein sequences, the recommendations
In order to avoid overweighting closely related are to use the BLOSUM62 matrix for substitutions,
sequences, the Henikoffs replaced groups of proteins with gap penalties of 11 for gap initiation and 1 for
that had sequence identities higher than a threshold gap extension by one residue.
by either a single representative or a weighted aver-
age. The threshold of 62% similarity produces the
commonly used BLOSUM62 substitution matrix. • To define optimal alignment, we must assign scores for
each possible substitution and corresponding scores
This is offered by all programs as an option and is the
for gap initiation and extension.
default in most.
The BLOSUM62 matrix is shown in Box 5.2. It
expresses scores as log-odds values:
Approximate methods for quick screening
Score of observed i↔j mutation rate of databases
= log10
mutation i↔j mutation rate expected from
amino acid frequencies It is routine to screen genes from a new genome
against databases, to find similarities to other
The numbers are multiplied by 10, to avoid decimal sequences. Databases have grown so large that pro-
points. The matrix entries reflect the probabilities of grams based on exact local alignments are too slow.
mutational events. A value of +2 (e.g. leucine↔ Approximate methods can detect close relationships
isoleucine) implies that in related sequences the well and quickly but are inferior to the exact ones
mutation would be expected to occur 1.6 times more in picking up very distant relationships. In practice,
frequently than random. The calculation is as fol- they give satisfactory performance in the many cases
lows: the matrix entry 2 corresponds to the actual in which the probe sequence is fairly similar to one or
value 0.2 because of the scaling. The value 0.2 is log10 more sequences in a databank, and they are, there-
of the relative expectation value of the mutation. As fore, certainly worth trying first.
log10(1.6) = 0.2, the expectation value is 1.6.
The probability of two independent mutational The original paper on BLAST: Altschul, S.F., et al. (1990).
events is the product of their probabilities. By using Basic local alignment search tool. J. Mol. Biol. 215,
logarithms, we have scores that we can add up rather 403–410, was the field’s most highly cited paper
than multiply, a computational convenience. published in the 1990s.

Scoring insertions and deletions, or ‘gap weighting’ A typical approximation approach such as BLAST
To form a complete scoring scheme for alignments, (basic local alignment search tool) takes a small inte-
we need, in addition to the substitution matrix, a ger k and determines all instances of each ‘word’ of
way of scoring gaps. How important are insertions length k (i.e. each set of k consecutive characters,
and deletions, relative to substitutions? We need to with no gaps) of the probe sequence that occur in any
distinguish gap initiation: sequence in the database. A candidate sequence is a
sequence in the databank containing a large number
aaagaaa
of matching k-tuples, with equivalent spacing in
aaa-aaa
probe and candidate sequences. For a selected set of
from gap extension: candidate sequences, approximate optimal alignment
calculations are then carried out, with the time- and
aaaggggaaa
space-saving restriction that the paths through the
aaa----aaa
matrix considered are restricted to bands around the
For aligning DNA sequences, the popular alignment diagonals containing many matching k-tuples. It is
software package CLUSTALW recommends use of clearest to show the procedure in terms of a dot plot
the identity matrix for substitution (+1 for a match, (see Figure 5.7).
174 5 Evolution and Genomic Change

xxxxxxxxxxxxxxxxxxxx
x
x Database to be searched
x
x
x
x
(1) Empty x
Probe sequence
dot plot

xxxxxxxxxxxxxxxxxxxx
x
x x x
x x x x x
x x x x x x
x x x x x x
x x x x x
x x x
(2) Word x
x
x
x
x
lookup x
x
x x x x
x x x x
x x x x
x x x x

xxxxxxxxxxxxxxxxxxxx
x x
x x x x
x x x x x x
x x x x x x x
x x x x x x
x x x x x x x
x x x x x x
(3) Match x
x
x x
x
x
extension x x x x x
x x x x x x
x x x x x x
x x x x x x
x x x x x x
x x x x
x x x

xxxxxxxxxxxxxxxxxxxx
x x
x x x x
x x x x x x
x x x x x x x
x x x x x x
x x x x x x x
(4) Local x x x x x x
x x ★ x x
gapped x ★ ★ x
x x x x x
x x x x x x
alignment x x x x x x
x x x x x x
x x x x x x
x x x x
x x x

Figure 5.7 Schematic diagram showing the mechanism of a BLAST search. BLAST solves the problem of finding matches of a probe
sequence in a full genome or a full database that are much longer than the probe sequence.
(1) The ‘playing field’ of the algorithm is the outline of a dot plot, just as if the problem were going to be solved by application of an
exact-alignment method.
(2) BLAST first divides the probe sequence into fixed-length words of length k; here k = 4. It then identifies all exact occurrences of
these words in the full database, with no mismatches or gaps. Note that the same four-letter word may occur several times in the
probe sequence (shown here in red), and of course each four-letter word may match many times within the database. It is possible
to do this step quickly after pre-processing the database to record the sites of appearance of all four-letter words.
(3) Starting with each match, BLAST tries to extend the match in both directions, still with no mismatches or gaps allowed.
(4) Given the extended matches, BLAST tries to put them together by doing alignments allowing mismatches and gaps, but only within
limited regions containing the preliminary matches (grey areas). The result of this step is to add to the matches the positions shown
as ★. This produces longer matching regions.
It is the restriction of the more complex matching procedure to relatively small regions, rather than applying it to the entire matrix, that
gives the method its speed. The price to pay is that the method will miss a combined match lying outside the grey area. In the example
illustrated, the matching regions coloured red and green, at the right of the matrix, will not be combined but reported as separate hits.

There are several variations on this theme, includ- which residues are crucial (and therefore conserved).
ing the original BLAST program and its variants (see They also help us to identify distant homologues
Box 5.3). with greater conﬁdence than a pairwise sequence
alignment could.
Multiple sequence alignments and The patterns inherent in a multiple sequence align-
pattern detection ment are not merely inferences from the alignment
Multiple sequence alignments are rich in information table – this is leaving it too late – but can actively con-
about patterns of conservation. They helps us to tribute to creating a high-quality alignment. The idea
understand the common features of structure and is for an algorithm to learn the underlying patterns
function of a family of sequences, by showing us while it is assembling the multiple sequence alignment.
Pattern matching – the basic tool of bioinformatics 175

Input
BOX Different ‘flavours’ of BLAST sequence
5.3 search different databases
Protein BLAST Filter results
sequence search E < threshold
Program Searches for: In: databank

BLASTN Nucleotide Nucleotide sequence Iterate

sequence database A R N D C...
1 R -2 -3 -4 -5 -2 ... Position-specific Multiple sequence
BLASTX Six-frame Protein sequence 2 D -3 -3 4 -7 -2 ...
3 A -1 -4 -4 -5 -1 ... scoring matrix alignment
...
translations of a database
nucleotide sequence Figure 5.8 Schematic flowchart of a PSI-BLAST calculation to
detect protein sequences in a database that are similar to a probe
BLASTP Protein sequence Protein sequence
sequence. The user submits an input sequence and chooses a
database
protein sequence databank to probe.
TBLASTN Protein sequence Six-frame translations First, using the input sequence and a standard substitution
of a nucleotide matrix such as BLOSUM62, an ordinary BLAST calculation
sequence database identifies similar sequences in the database and assigns a statistical
measure of significance, E, to each ‘hit’. For each sequence
TBLASTX Six-frame Six-frame translations
retrieved from the database, E is the number of sequences of
translations of a of a nucleotide
equal or higher similarity to the probe sequence that would
nucleotide sequence sequence database
be expected to be found in the database, just by chance.
The program will select those sequences for which E is no
greater than a specified threshold, often chosen as 0.005, and
One very powerful program based on this approach perform a multiple sequence alignment of them.
is PSI-BLAST, an extension of BLAST for multiple By counting the relative frequencies of different amino acids in
each column of the multiple sequence alignment, the program will
sequence alignment (see Figure 5.8). PSI-BLAST con-
derive a position-specific scoring matrix. The red box at the lower
structs a profile, i.e. a conservation pattern, in an ini- left shows part of a position-specific scoring matrix. The columns
tial multiple alignment of the ‘hits’ from a preliminary are labelled by the 20 natural amino acids, shown in blue. The
BLAST search. Armed with the profile, the method rows are labelled by the sequence to be scored by the matrix,
returns to the database and does a more sensitive residue numbers in red and amino acids in green. In this case,
the N-terminal sequence of the sequence to be scored is RDA . . .
search, giving higher weight to well-conserved posi-
The entries in the column are the log-odd scores of finding any
tions; it then realigns what it finds and refines the amino acid at any position in the multiple alignment. For instance,
profile. Several such cycles of refinement of the pro- the entry under A in row 3 is −1; therefore, the probability of
file give PSI-BLAST the power both to detect distant finding an A at the third position is proportional to 10−1.
relationships and to create high-quality multiple To find the score of the sequence, add up the values in the
R column of the first row, the D column of the second row,
sequence alignments.
the A column of the third row, etc., to give: 10−3 + 10−7 + 10−1.
Perhaps the most powerful pattern analysis algori- In this example, the probabilities are expressed unscaled and as
thms are based on hidden Markov models. A hidden logarithms to the base 10. Note that the sequence being scored
Markov model is a mathematical construct that gener- may contain gaps.
ates sequences according to internal probabilistic This matrix can be used as an alternative to the input sequence
and substitution matrix in a BLAST search. Each subsequent BLAST
rules. Successive rolls of a pair of dice generate
search, based on the matrix derived in the previous step, will return
a sequence of numbers between 2 and 12, with dif- a different set of ‘hits’. With a sensible choice of input parameters,
ferent probabilities. However, there is not even a the procedure will usually converge to produce a more reliable set
probabilistic link between the successive numbers of similar sequences than would be returned by the simple BLAST
that come up. Typing with one finger at random search of the input sequence performed in the first step.

generates a sequence of characters, again with no

correlation between successive letters. However, if probabilistically generated sequence with the observed
you insist on typing ten keys per second, so that you distribution of each character dependent on what pre-
can only move your hand a limited extent between ceded it.
successive keystrokes, then Q is more likely to be To represent a set of sequences with a hidden
followed by W than by M. The result would be a Markov model, imagine a computer program that
176 5 Evolution and Genomic Change

generates sequences of nucleotides, or amino acids, structural similarity quantitatively? Can we derive a
according to rules that govern the probability dis- sequence alignment from structural comparisons?
tributions of successors to each letter. For example, In the native state of a protein, the mainchain
an adenine would be assigned a set of probabilities follows a curve in space. The general spatial layout of
for being followed by another adenine, or a thymine, this curve defines a folding pattern. The backbones
or a cytosine, or a guanine, or a gap. A different set of related proteins show recognizably similar but not
of probabilities would govern the successors of a thy- identical folding patterns. A letter of the alphabet in
mine, a guanine, a cytosine, or a gap. The enhanced different type fonts – for instance, b and b – illustrate
power of hidden Markov models over position- the topological similarities, and differences, in detail
specific scoring matrices stems from the correlation seen among proteins related by evolutionary diver-
between successive positions. gence that share a common folding pattern. A better
For example, it is observed that the dinucleotide analogy for widely divergent proteins might be the
frequency CpG is lower in higher organisms than letters B and R, which share the letter P as a common
would be expected from the overall mole fractions core substructure but in addition have either a loop
of C and G in the genome. A hidden Markov model (B) or a stroke (R) that differ. Homologous protein
would reflect this in a lowered probability of G in a structures typically contain fairly large, well-fitting
position following a C, relative to the probabilities of substructures. Figure 5.9 shows a superposition of
a G following A, T, or another G. local regions of two proteins and an overall super-
Given a set of sequences, the process of training position of two entire structures.
a hidden Markov model involves adjusting all of Extraction of the maximum common substructure
the probability distributions so that the sequences induces an alignment of the sequences. This is called
generated by the model have a high probability of a structural alignment. Because structure changes
reproducing the set of sequences analysed. more conservatively than sequence during evolution,
for distantly related proteins it may be possible to
align the sequences on the basis of the structures even
• A multiple sequence alignment is much richer in
if methods based purely on sequences cannot recog-
information than a pairwise sequence alignment. A
hidden Markov model is a method for capturing the
nize the relationship.
information.
• A structure alignment is nevertheless an alignment =
an assignment of residue–residue correspondences.
Pattern matching in three-dimensional Instead of assigning the correspondence by matching
structures the characters in two or more sequences, a structural
alignment assigns the correspondence to residues
Given two or more structures – perhaps of several
that occupy similar positions in space, relative to the
homologous proteins – we can frame questions gener-
molecular framework.
alized from sequence alignment. How can we measure

Evolution of protein sequences, structures, and functions

Extending Crick’s classic ‘central dogma’ gives us the Transcription of DNA to RNA, and translation of
basic paradigm: mRNA by ribosomes, takes us as far as the amino acid
sequence. The amino acid sequence dictates the pro-
DNA → RNA → amino acid sequence of a protein
tein structure by a spontaneous folding process (see
→ protein structure → protein function
Chapter 1). Folding produces a native state. For most
During evolution, selection acts on protein function proteins, the native state structure contains an active
to alter gene frequencies in populations, closing the site with the proper geometry, charge distribution, and
loop back to DNA. hydrogen-bonding potential to interact speciﬁcally
Evolution of protein sequences, structures, and functions 177

(a)

(b)

Figure 5.9 Local and global superpositions of protein structures.

(a) Two b-hairpins from the antigen-binding site of an antibody [1VFA, 2FBJ]. Only main-chain atoms are shown. The ‘stems’ of the
loops, parts of strands of b-sheet, superpose well (black and cyan regions, at bottom of picture). The connections have different
lengths and conformations, and do not superpose well (red and blue regions, at top of picture). This is an example of a local
well-fitting substructure. It involves only a small contiguous region of the chains.
(b) Superposition of regions with a folding pattern called the HeH (helix–extended loop–helix). Black: domain from RNA-binding
domain of transcriptional terminator protein r (from E. coli) [1A62]. Red: domain from KU heterodimer (human) [1JEQ].
This figure shows a ‘chain trace,’ a polygon in space connecting one point from each residue.
The helices at either end of the chains superpose well. The extended regions between the helices do not. The sequence alignment
induced by the structural superposition is this:
RNA-binding domain TPVSELITLGENMGLEN—LARMRKQDIIFAILKQH
KU heterodimer FTVPMLKEACRAYGL—KSG-L-KKQELLEALTKHF

with other proteins, or with small-molecule ligands. Changing one amino acid without otherwise altering
In many cases, the active site contains catalytic resi- the structure would leave most interactions intact,
dues that produce enzymatic activity. except for those involving the mutated residue
itself (and conservative mutations may preserve even
these). Nevertheless, sometimes changing a single
The effects of single-site mutations
residue is enough to blow the original structure
The native states of proteins are the cumulative effect apart. An example is the mutation A174D in human
of many inter-residue interactions. What then should aldolase (See Box 5.4). In other cases changes to
we expect to be the result of a perturbation in the residues providing speciﬁc interactions with ligands
amino acid sequence? may alter activity. Some mutations do not alter
Consider a SNP leading to a single amino acid the structure but destabilize it; frequently this is
substitution. Will the structure stay the same? enough to cause disease. Box 5.4, treating human
178 5 Evolution and Genomic Change

BOX Hereditary fructose intolerance and mutants of aldolase B

5.4

The enzyme aldolase catalyses the cleavage of two and deletions including frameshifts, and changes in splice
substrates: sites. The most common mutations are A149P (over 50%
of cases worldwide) and A174D.
fructose-1,6-bisphosphate → glyceraldehyde-3-
T. Cox and co-workers* characterized the proteins cor-
phosphate + dihydroxyacetone phosphate
responding to several known mutants. Normal aldolase B
fructose-1-phosphate → glyceraldehyde + is a tetramer of four 363-amino-acid subunits. Because
dihydroxyacetone phosphate all mutants were discovered in patients presenting with
Fructose-1,6-bisphosphate is a mainstream metabolite Hereditary Fructose Intolerance, all had reduced or absent
in glycolysis and gluconeogenesis, classic pathways of enzymatic activity. Two classes of mutants were:
glucose metabolism. Fructose-1-phosphate arises in meta- • Catalytic mutants: these can be expressed as intact
bolism of dietary fructose. Different isozymes of aldolase tetramers, retaining some activity at 37°C. These include
have different relative activities towards fructose-1,6- W147R and R303W.
bisphosphate and fructose-1-phosphate.
• Structural mutants: these are destabilized, and show
Approximately 1 in 20 000 people suffer from hereditary
catalytic activity (if at all) only after expression at 22–
fructose intolerance, a defect in the liver isozyme, aldolase
23°C. These include N334K, A149P, L256P, and A174D.
B. The gene encoding this protein maps to locus 9q22.3
in the human, giving the trait an autosomal recessive The substitutions in the catalytic mutants occur in or near
inheritance pattern. For affected individuals, ingestion of the active site. The substitutions in the structural mutants
fructose, a monosaccharide common in fruits and honey, occur either in a residue buried in the monomeric structure
leads to vomiting, discomfort, and hypoglycaemia. Problems (A174D, which does not fold at all, as a result of burying
often first appear in infancy as fructose and sucrose are a charged sidechain), or in the subunit interface, which
added to the diet upon weaning. The condition can be fatal causes the protein to dissociate into monomers (N334K,
if unrecognized and untreated; however, for most patients L256P, and A149P).
it is sufficient to adopt a diet free of fructose and sucrose.
Numerous mutations have been associated with aldolase
* Rellos, P., Sygusch, J., & Cox, T.M. (2000). Expression, purifica-
B dysfunction, including amino acid substitutions, non- tion and characterization of natural mutants of human aldolase
sense mutations producing truncated protein, insertions B.J. Biol. Chem. 275, 1145–1151.

aldolase, illustrates several effects of single amino

• A sequence that so lacked robustness that any muta-
acid substitutions.
tion would destroy it, could not exist. It could have no
Many small changes in amino acid sequence leave neighbouring precursor and processes of evolution
the basic structure intact, producing only small con- could never find its sequence.
formational changes. In this sense, protein structures
are robust to mutation – not to all mutations, but
to enough mutations to allow variability. This is
essential – and sufﬁcient – for evolution. Evolution of protein structure and function
Z. Wang and J. Moult have described some of the Sequences and structures of related proteins show
kinds of structural effects of single amino acid sub- coordinated evolutionary divergence. As sequences
stitutions related to human diseases2 (see Figures 5.10, progressively diverge, structures progressively de-
5.11, and 5.12). form. Typically, a core of the structure, including the
major elements of secondary structure and, usually,
2
Wang, Z. & Moult, J. (2001). SNPs, protein structure, the active site, retains its folding pattern. Other,
and disease. Hum. Mut. 17, 263–270. peripheral regions of the structure can refold entirely.
Evolution of protein sequences, structures, and functions 179

Figure 5.10 Factor XIIIa is the enzyme at the

final step of the blood coagulation cascade.
It cross-links fibrin molecules, stabilizing clots.
The normal protein contains an arginine sidechain
that forms a salt bridge and multiple hydrogen
bonds, to a neighbouring asparate sidechain
and to main-chain carbonyl atoms. The arginine
and aspartate sidechains are shown in green.
Mutation of the arginine to an isoleucine (not
shown) removes these interactions, destabilizing
the protein and resulting in poor clot formation
[1F13].

Figure 5.11 Aldolase A is an enzyme in the

glycolytic pathway. It is an isozyme of aldolase B,
the protein involved in Hereditary Fructose
Intolerance. Aldolase is normally a tetramer,
stabilized by hydrogen bonds involving an
aspartate residue at the inter-subunit interface.
Mutation of this aspartate to a glycine destabilizes
the tetramer. Although this has no detectable
effect on most cell types, red blood cells show
weakened cell membranes, causing a congenital
form of haemolytic anaemia [2ALD].

In this process, structure changes more conserva-

tively than sequence. In many families of proteins,
we can recognize structural similarity in relatives
so distant that there is no easily visible signal of the
similarity in the sequence.
The reason for retention of protein conformation
in general, and the structure of the active site in
particular, is selection for maintenance of function.
A need to retain function imposes constraints on pro-
tein stability and structural change during evolution.
It is easier to see the effects of these constraints
than to understand their mechanism. In many cases
Figure 5.12 Retinol-binding protein transports vitamin A around certain speciﬁc residues are directly involved in
the bloodstream, bound in a deep hydrophobic cavity within the function – for example the iron-linked histidine of
protein. This figure shows a model of mutant 75Gly→Asp of the globins – and these are immutable. In contrast,
retinol-binding protein. Note that the model was built solely by
constraints that maintain the overall folding pattern
inserting the sidechain, to observe the structural consequences.
No attempt was made to try to predict the structural deformation are dispersed around the sequence, and it is only by
produced. What the model shows is that there are steric and studying patterns of residue conservation in large-
electrostatic incompatibilities between the sidechain of 75Asp scale alignments of homologous proteins that we can
and the ligand. This explains the observation of decreased affinity begin to understand the constraints imposed by
for retinol, producing vitamin A deficiency and night blindness.
structure on sequence.
180 5 Evolution and Genomic Change

Similar Similar
sequences functions

Similar Similar
structures structures

Similar Similar
sequences sequences

Similar Similar
functions structures

Figure 5.13 Relationships among sequence, structure, and function:

• similar sequences can be relied on to produce similar protein structures, with divergence in structure increasing progressively with
the divergence in sequence;
• conversely, similar structures are often found with very different sequences: in many cases, the relationships in a family of proteins
can be detected only in the structures, the sequences having diverged beyond the point of our being able to detect the underlying
common features;
• similar sequences and structures often produce proteins with similar functions, but exceptions abound;
• conversely, similar functions are often carried out by non-homologous proteins with dissimilar structures, e.g. the different families
of proteinases, sugar kinases, and lysyl-tRNA synthetases.

When a protein evolves to change its function, by proteins – adapting to a novel function with
many of these constraints are released – or, more pre- relatively little sequence change.
cisely, replaced by alternative constraints required by Conversely, proteins with very different sequences
the new function. The relationship between sequence and structures can have the same function. For in-
and function is much more complex than the rela- stance, many families of proteinases differ in sequence
tionship between sequence and structure. Small and structure, sharing only a common general cata-
changes in sequence, during evolution, usually make lytic activity. Figure 5.13 summarizes, in a schematic
only small changes in structure. Often they make way, the landscape of protein space with respect
only small changes in function also. But changes in to the relations among sequence, structure, and
function do not necessarily require large changes in function.
sequence or structure – function can jump. All three features of proteins – sequence, structure,
Indeed, a protein can change function without any and function – are potentially useful in interpreting
sequence changes at all. For instance, in the duck, new genome sequences. We expect that many regions
an active lactate dehydrogenase and an enolase serve of the new genome encode proteins similar to rela-
as crystallins in the eye lens, although they do not tives known in other species. We can ﬁnd them by
encounter the substrates in situ. In other birds, looking for similar patterns in the sequences. We can
crystallins are closely related to enzymes, but some expect that the structures will be similar, and indeed
divergence has already occurred, with loss of cata- can calibrate the expected difference in structure from
lytic activity. (This proves that the enzymatic activity the extent of divergence in sequence. As Figure 5.13
is not necessary in the eye lens.) Many other such ex- shows, however, we cannot be as conﬁdent in assum-
amples are known of ‘recruitment’, or ‘moonlighting’, ing that function will be conserved.
Phylogeny 181

Phylogeny

Once we have measured the similarity of one or more The arrangement is derived from observed similar-
properties of a set of individuals, or species, we can ities. The basic principle is that the origin of similarity
try to arrange them according to their apparent pat- is common ancestry. Although there are many excep-
tern of divergence. If in fact the similarities do arise tions, arising from convergent evolution or hori-
during descent from a common ancestor, it should be zontal gene transfer, this basic principle is crucial both
possible to depict the relationships in a family tree for rationalizing contemporary observations and for
(for individuals), or phylogenetic tree (for species and opening a window onto the history of life.
higher taxa). The goal is to present the pattern of From phylogeny, we infer relationships – among
similarities and divergences in a consistent diagram species, populations, individuals, or genes. Relation-
such that close relationships within the diagram cor- ship is taken in the literal sense of kinship or genea-
respond to high degrees of similarity. logy, i.e. assignment of a scheme of ancestors and
The assumption is that such a diagram will have descendants (see Box 5.5).
the form of a tree. (We saw an example of a tree in
Figure 4.1.) The computations that derive the optimal
phylogenetic tree from a matrix of similarities are
• A phylogenetic tree is a diagram showing ancestor–
not trivial, and the problem has been a challenge in
descendant relationships, that captures a pattern of
research for some time.
similarities, in that individuals or species closely linked
The goal of phylogeny is a logical arrangement of
in the tree have high similarity.
a set of species, populations, individuals, and genes.

BOX Concepts related to biological classification and phylogeny

5.5

• Homology means, specifically, descent from a common one another than they are to other objects outside the
ancestor. classes. Most people would agree about degrees of simi-
• Similarity is the measurement of resemblance or differ- larity, but clustering is more subjective. When classifying
ence, independent of the source of the resemblance. objects, some people prefer larger classes, tolerating
Similarity is observable now and involves no historical wider variation; others prefer smaller, tighter classes.
hypotheses. In contrast, assertions of homology require They are called, respectively, groupers and splitters.
inferences about historical events, which are almost • Hierarchical clustering is the formation of clusters of
always unobservable. clusters of . . .
• Similarity and dissimilarity. Data suitable for phyloge- • The distinction between clustering and classification.
netic analysis may be specified equivalently in terms Clustering is the determination of the set of classes into
of similarities between objects or by dissimilarities. In which a group of samples should be divided. Classifica-
comparing two DNA sequences, we may count the per- tion is the assignment of a sample to its proper place
centage of identical residues in an optimal alignment. in a known set of classes.
This is a measure of similarity – the higher the value, • Phylogeny is the description of biological ancestor–
the more similar the sequences. Alternatively, we could descendant relationships, usually expressed as a tree.
count the number of mutations separating the sequences. A statement of phylogeny among objects assumes
This is a measure of dissimilarity. homology and depends on classification.
• Clustering is bringing together similar items, distinguish-
ing classes made up of objects that are more similar to
182 5 Evolution and Genomic Change

Relationships between species are rarely directly Common ancestor

observable, even in such an ‘obvious’ case as Darwin’s
finches. However, from genomics, species relation-
ships can usually be deduced reliably.
Evolutionary relationships give us an historical
glimpse of the development of life (see Figure 4.1).
Although molecules themselves cannot be dated,
evolutionary events observed on the molecular level
can be calibrated with the fossil record.
The results of phylogenetic analyses are usually
presented in the form of an evolutionary tree (see
Moa Cassowary Emu Kiwi Ostrich Rhea
Box 5.6).
Figure 5.14 shows the relationships among the Figure 5.14 Phylogenetic tree of ratites (large flightless birds),
ratites – large flightless birds, such as the ostrich. based on mitochondrial DNA sequences. The common ancestor
The ancestor of the ratites is believed to be a bird that is at the root of this tree, appearing at the top of the graph.
A surprising implication of these DNA sequences is that the moa
could fly, probably related to the extant tinamous.
and kiwi are not the closest relatives and, therefore, New Zealand
Such a tree, showing descendants of a single original must have been colonized twice by ratites or their ancestors. In
ancestral species, is said to be rooted. (The root terms of geography, this is less surprising if one looks at a map of
of the tree usually appears at the top or the side; the ancient continent Gondwanaland prior to its break-up, rather
botanists will have to get used to this.) than at a contemporary map on which the distance between
Africa and New Zealand is large.
Alternatively, we may be able to specify relation-
ships but not order them according to a history.

BOX Structure and contents of an evolutionary tree

5.6

In computer science, a tree is a particular kind of graph. relationships, as from any node there is a connected path
A graph is a structure containing nodes (abstract points) up through successive ancestors terminating at the root.
connected by edges (represented as lines between the Unrooted trees show the topology of relationship but not
points). A path between two nodes in a graph is a series the pattern of descent.
of consecutive edges that begins at one node and ends in It may be possible to assign numbers to the edges of
the other. In a general graph, there may be many paths a graph to indicate some kind of ‘length’ of the edges,
between any two nodes. (In Chapters 11 and 12, we dis- corresponding to a ‘distance’ between the nodes that the
cuss graphs in more detail.) edges connect. These lengths are not necessarily geometric
A tree is a special kind of graph. First of all, a tree must distances, but may be abstract values. Given edge lengths,
be connected, meaning that there is a path through the the graph may be drawn to scale, with the sizes of the
graph between any two points. Second, in a tree there is edges proportional to the assigned lengths.
only one path between every two points. We have already In phylogenetic trees, edge lengths signify either some
seen several trees; for example, Figure 4.1. measure of the dissimilarity between two taxa, or the length
A particular node may be selected as a root of a tree. of time since their separation. The assumption that differ-
However, this is not necessary – abstract trees may be ences between properties of living species reflects their
rooted (for instance, Figure 5.14) or unrooted (for instance, divergence times will be true only if the rates of divergence
Figure 5.15). In phylogenetic trees, the root is the earli- are the same in all branches of the tree. Many exceptions
est common ancestor of all of the other nodes. Rooted are known. For instance, among mammals, many proteins
phylogenetic trees explicitly show ancestor–descendant from rodents show relatively fast evolutionary rates.
Phylogeny 183

Sharp-beaked ground finch

Woodpecker finch Large tree finch
Large ground finch
Cactus finch
Medium tree finch
Large cactus finch
Small tree finch
Cocos finch

Vegetarian finch Medium ground finch

Small ground finch

Warbler finch

Figure 5.15 Unrooted tree of relationships among finches from the Galapagos and Cocos Islands. Darwin studied the Galapagos finches
in 1835, noting the differences in the shapes of their beaks and the correlation of beak shape with diet. Finches that ate fruits had beaks
like those of parrots, whereas finches that ate insects had narrow, prying beaks. These observations were seminal to the development of
Darwin’s ideas. As early as 1839 he wrote, in The Voyage of the Beagle, ‘Seeing this gradation and diversity of structure in one small,
intimately related group of birds, one might really fancy that from an original paucity of birds in this archipelago, one species had been
taken and modified for different ends’.

The relationships among the finches of the Galapagos ter similarities be relatively high and the intercluster
Islands, studied by Darwin, plus a related species similarities be relatively low. If the data are intrinsi-
from the nearby Cocos Island are shown in an cally well grouped, then the clustering is obvious.
unrooted tree (Figure 5.15). Addition of data from a If there is no clear separation, proper clustering is
species on the South American mainland ancestral to ambiguous and difficult.
the island finches would allow us to root the tree.

• The PHYLIP package (PHYLogeny Inference Package)

• The idea of phylogeny is to observe different degrees of J. Felsenstein is an integrated set of links to tools for
of similarity among species or higher taxa, assume that phylogenetics, and sources of software (http://evolution.
the species are related by descent from a common genetics.washington.edu/phylip/software.html). Some
ancestor and that higher degrees of similarity corres- multiple sequence alignment packages, such as
pond to closer relationships, and to try to capture CLUSTAL W, provide facilities to convert their align-
the relationships in a tree diagram showing ancestor– ments to phylogenetic trees.
descendant relationships such that species more closely
related according to the tree do have higher degrees
of similarity.
Phylogenetic trees
The statement of a tree of relationships may reveal Given a set of data that characterize different groups
only the connectivity or topology of the tree, in which of organisms – for example, DNA or protein sequ-
case the lengths of the branches are arbitrary. A more ences, or protein structures, or shapes of teeth from
ambitious goal is to show the distances between taxa different species of animals – how can we derive
quantitatively, for instance to label the branches with information about the relationships among the
the time since divergence from a common ancestor. organisms in which they were observed? To what
A phylogenetic tree tells us the organization of a extent does the topology of the relationships depend
set of taxa. It does not tell us how they should be on the choice of character? In particular, are there
grouped or partitioned. For example, the tree of any systematic discrepancies between the implica-
Darwin’s ﬁnches looks as if it could reasonably be tions of molecular and palaeontological analysis?
partitioned into two or three clusters. The guiding Broadly, there are two approaches to deriving phy-
principle in selecting a partition is that the intraclus- logenetic trees. One approach makes no reference to
184 5 Evolution and Genomic Change

any historical model of the relationships. Measure a

set of distances between species and generate the tree BOX Calculation of phylogenetic trees
5.7 by clustering
by a hierarchical clustering procedure. This is called
the phenetic approach. The alternative, the cladistic
approach, is to consider possible pathways of evolu- Consider four species characterized by homologous
tion, infer the features of the ancestor at each node sequences ATCC, ATGC, TTCG, and TCGG. Taking the
number of differences as the measure of dissimilarity
and choose an optimal tree according to some model
between each pair of species, we will use a simple clus-
of evolutionary change. Phenetics is based on similar-
tering procedure to derive a phylogenetic tree.
ity; cladistics is based on genealogy.
The distance matrix is:

ATCC ATGC TTCG TCGG

Clustering methods
ATCC 0 1 2 4
Phenetic, or clustering, approaches to determination ATGC 0 3 3
of phylogenetic relationships are explicitly non- TTCG 0 2
historical. Indeed, hierarchical clustering is perfectly TCGG 0
capable of producing a tree even in the absence of
evolutionary relationships. A departmental store has (As the matrix is symmetric, we need fill in only the
upper half.)
goods clustered into sections according to the type of
The smallest distance is 1 (in boldface), between
product – for instance, clothing or furniture – and
ATCC and ATGC. Therefore, our first cluster is {ATCC,
subclustered into more closely related subdepart- ATGC}. The tree will contain the fragment:
ments, such as men’s and women’s shoes. Men’s and
women’s shoes have a common ancestor, but there is
no implication that shoes and furniture do. ATCC ATGC
A simple clustering procedure works as follows: The reduced distance matrix is:
given a set of species, determine for all pairs a mea-
sure of the similarity or difference between them. {ATCC, TTCG TCGG
This could depend on a physical body trait, such as ATGC}

the difference between the average adult height of {ATCC, ATGC} 0 –12 (2 + 3) = 2.5 –12 (4 + 3) = 3.5
members of two species, or one could use the number TTCG 0 2
of different bases in alignments of mitochondrial TCGG 0
DNA.
The number 3.5 in the upper right was calculated by
To create a tree from the set of dissimilarities:
averaging the distances between ATCC and TCGG = 4
• First, choose the two most closely related species and and between ATGC and TCGG = 3.
insert a node to represent their common ancestor. The next cluster is {TTCG, TCGG}, with distance 2.
Finally, linking the clusters {ATCC, ATGC} and {TTCG,
• Then replace the two selected species by a set con-
TCGG} gives the tree:
taining both, and replace the distances from the
pair to the others by the average of the distances 1.5 1.5

of the two selected species to the others. Now we 0.5 0.5 1 1

have a set of pairwise dissimilarities, not between ATCC ATGC TTCG TCGG
individual species, but between sets of species.
Branch lengths have been assigned according to the rule:
(Regard each remaining individual species as a set
containing only one element.) Branch length of edge between nodes X and Y = 1
2
distance between X and Y
• Then repeat the process.
Whether the branch lengths are truly proportional to the
This method of tree building is called the UPGMA divergence times of the taxa represented by the nodes
method (unweighted pair group method with arith- must be determined from external evidence.
metic mean; see Box 5.7).
Phylogeny 185

Cladistic methods The maximum likelihood method assigns quanti-

tative probabilities to mutational events, rather than
Cladistic methods deal explicitly with the patterns of
merely counting them. Like maximum parsimony,
ancestry implied by the possible trees relating a set of
maximum likelihood reconstructs ancestors at all
taxa. Their aim is to select the correct tree by utiliz-
nodes of each tree considered; however, it also assigns
ing an explicit model of the evolutionary process.
branch lengths based on the probabilities of the
The most popular cladistic methods in molecular
mutational events postulated. For each possible tree
phylogeny are the maximum parsimony and maxi-
topology, the assumed substitution rates are varied
mum likelihood approaches. They are specialized
to ﬁnd the parameters that give the highest likelihood
to sequence data, starting from a multiple sequence
of producing the observed sequences. The optimal
alignment. Neither maximum parsimony nor maxi-
tree is the one with the highest likelihood of generat-
mum likelihood could be applied to anatomic char-
ing the observed data.
acters such as average adult height.
Both maximum parsimony and maximum likeli-
The maximum parsimony method of W. Fitch
hood methods are superior to clustering techniques.
deﬁnes an optimal tree as the one that postulates the
This has been demonstrated with cases where inde-
fewest mutations (see Box 5.8).
pendent evidence – for instance, from palaeontology
– provides a correct answer, and also with simulated
data – computer generation of evolving sequences.
BOX Calculation of phylogenetic trees
5.8 by maximum parsimony
The problem of varying rates of evolution
Given species characterized by homologous sequences Suppose that four species, A, B, C, and D, have the
ATCG, ATGG, TCCA, and TTCA, the tree: phylogenetic tree:

ATCA
A→G A→T

ATCG TTCA
C→G T→C

ATCG ATGG TCCA TTCA A B C D

postulates four mutations. Note that the ancestral

This tree is consistent with the dissimilarity matrix:
sequences, ATCG, TTCA, and ATCA, are not part of the
observable data.
A B C D
An alternative tree:
A 0 3 3 3
ATCG
B 0 2 2
G→A A→T
C 0 1
ATCA TTCG
D 0
A→G A→T, T→C T→A G→A
C→G
ATCG TCCA ATGG TTCA Suppose, however, that taxon D is changing very
postulates seven mutations. Note that the second tree fast, although the phylogeny is unaltered. The dis-
implies that the G → A mutation in the fourth position similarity matrix might then be observed to be:
occurred twice independently. The former tree is optimal
according to the maximum parsimony method, because A B C D
no other tree involves fewer mutations. In many cases, A 0 3 3 20
several trees may postulate the same number of mutations,
B 0 2 20
fewer than any other tree. For such cases, the maximum
C 0 20
parsimony approach does not give a unique answer.
D 0
186 5 Evolution and Genomic Change

from which we would derive the incorrect phyloge- given multiple sequence alignment, which one has
netic tree: the highest probability of generating the observed
multiple sequence alignment, under some model of
evolutionary change? The model might be speciﬁed
in terms of probabilities of mutation rates, etc., and
for the moment it would seem that a weakness of the
approach is the difﬁculty of knowing how to specify
A B C D the model explicitly and accurately.
Nevertheless, from any such model of evolutionary
All of the methods discussed here are subject to
change, we can compute the probability that any tree
errors of this kind if the rates of evolutionary change
would produce the observed multiple sequence align-
vary along different branches of the tree. To test for
ment. Suppose we begin the problem in a state of
varying rates, compare the species under considera-
complete ignorance, meaning that we consider that
tion with an outgroup – a species more distantly
initially – for all we know – all potential phylogenetic
related to all of the species in question than any pair
trees must be regarded as equally probable. Then
of them is to each other. For instance, if we are study-
Bayes’ rule states that we want to choose the tree
ing species of primates, a non-primate mammal such
with the highest probability of producing the
as the cow would be a suitable outgroup. If the rates
observed multiple sequence alignment.
of evolution among the primate species were con-
What makes this approach so powerful is that
stant, we would expect to observe approximately
we can optimize the probability of producing the
equal dissimilarity measures between all primate
observed data not only over possible trees, but over
species and the cow. If this is not observed, the sug-
different models of evolutionary change. This releases
gestion is that evolutionary rates have varied among
us from making overly constricting assumptions such
the primates, and the character being used may well
as constancy of molecular clock rates over different
not provide the correct phylogenetic tree.
branches of the tree, identical mutation probabilities
at all sites, etc. The calculations are nevertheless fea-
Bayesian methods sible. There is consensus that programs based on the
The problem we are trying to solve is: Bayesian approach are the most powerful tools for
Of all possible phylogenetic trees organizing the deriving phylogenetic trees from multiple sequence
relationships among different species, based on a alignments.

Short-circuiting evolution: genetic engineering

Evolutionary divergence arises in nature through An example with clinical application is the microbial
generation of variation by random mutation, fol- synthesis of human growth hormone. Formerly, the
lowed by either selection or genetic drift to alter allele only source of the hormone was by post-mortem
frequencies in populations or to create novel species. extraction from pituitary glands. This carried the risk
Contemporary techniques allow deliberate transfer of transmitting prion disease. Other microbiologically
of genes, to create organisms with altered characters produced human proteins with clinical applications
directly. In addition to gene therapy for disease (see include insulin, and many monoclonal antibodies.
p. 125) – and genetically modiﬁed crop plants – many Still other applications include manufacture of fuels,
applications are available or under development: or plastics, or dissolving oil spills.
In the USA the attempt to patent genetically modiﬁed
Use of microorganisms as protein factories. Micro- bacteria that could break down hydrocarbons was
organisms are routinely used in the laboratory to a landmark case, decided in favour of granting the
express proteins – human or otherwise – for research. patent by a 5–4 decision of the United States Supreme
Recommended reading 187

Court in 1980. Of course, novel varieties of flowers (a) pesticide-resistant plants, that allow treatments
produced by classical methods of breeding and selec- to kill weeds or insect pests without damaging
tion have been protectable for many years. The Inter- the plants.
national Union for the Protection of New Varieties of (b) a related approach is a plant that makes its
Plants (UPOV) is an intergovernmental organization own insecticide. Bt-corn (maize) contains a natural
established by treaty in 1961. It is now, appropri- insect-killing gene transferred from Bacillus
ately, turning its attention to biotechnology, and legal thuringiensis.
and intellectual property issues.
(c) crops with enhanced nutritional value. An
Genetically modified animals. Higher animals are also example is ‘golden rice’ enriched in vitamin A
used as protein factories, in cases where the active (see p. 245).
protein requires postranslational modifications of (d) fruit with longer shelf life, such as the ‘flavr-savr’
which microorganisms are incapable. Production of tomato.
drugs by this route is called ‘pharming’. Genetically
(e) crops that produce only sterile seeds.
engineered goats secrete an anticoagulant, human
antithrombin III, in their milk. This product has been There are a number of controversial aspects to
approved for clinical use in the United States. these activities. In the case of genetically modified
Other goals of genetically modified animals include: plants, there is concern over the spreading of genes
from the crops to undesired hosts. For instance, if
(a) enhancing the nutritional value of food; e.g.
a gene for herbicide resistance is introduced into a
pork enriched in w-3 fatty acids.
crop plant, it would make it easier selectively to kill
(b) pigs lacking the surface antigens that produce weeds without affecting the crop plant. However, it
rejection by the human immune system, as a has been observed that the gene can spread to the
source of organs for transplant. weeds. Another concern is economic. Use of sterile
(c) animals that grow faster and/or require less seeds requires farmers to purchase new seeds each
expensive feed; for example fast-growing salmon. year. It precludes traditional agricultural practice of
(d) protecting livestock against disease; e.g. cows lack- holding back a portion of a crop for replanting.
ing prion proteins and therefore immune to BSE. In addition to the specific economic implications,
(e) allergen-free pets. there is a widespread feeling that biotechnology
might alter the relationship between people and
(f) fish that glow in colours by virtue of genes for
Nature that have been a common cultural heritage
fluorescent proteins.
for thousands of years. It would be wrong to dismiss
Genetically modified plants. Many crop plants are these feelings as irrational or as characterizing only
targets for genetic modification. Goals include: a fringe.

● RECOMMENDED READING

• Two general articles on phylogeny:

Whelan, S., Liò, P. & Goldman, N. (2001). Molecular phylogenetics: state-of-the-art methods
for looking into the past. Trends Genet. 17, 262–272.
Baldauf, S.L. (2003). Phylogeny for the faint of heart: a tutorial. Trends Genet. 19, 345–351.
• Phylogenetic relationships in eukaryotes:
Baldauf, S.L. (2003). The deep roots of eukaryotes. Science 300, 1703–1706.
• Estimates of the history of diversity of living species:
Jackson, J.B.C. & Johnson, K.G. (2001). Paleoecology: measuring past biodiversity. Science 293,
2401–2404.
188 5 Evolution and Genomic Change

• Discussions of sequence analysis:

Gusfeld, D. (1997). Algorithms on Strings, Trees and Sequences Cambridge University Press,
Cambridge.
Doolittle, R.F. (1986). Of URFS and ORFS/A Primer on How to Analyze Derived Amino Acid
Sequences. University Science Books, Mill Valley, CA, USA.
Li, H. & Homer, N. (2010). A survey of sequence alignment algorithms for next-generation
sequencing. Brief. Bioinf. 11, 473–483.
• It is possible to represent phylogenetic relationships in forms more general than tree structures:
Bandelt, H.-J. & Dress, A.W.M. (1992). Split Decomposition: A new and useful approach to
phylogenetic analysis of distance data. Molec. Phy. Evol. 1, 242–252.
Huson, D.H. & Scornavacca, C. (2011). A survey of combinatorial methods for phylogenetic
networks. Genome Biol. Evol. 3, 23–35.
• Discussion of applications of genetic engineering:
Arnold, F.H. (2008). The race for new biofuels. Eng. Sci. 71, 12–19.
Brustad, E.M. & Arnold, F.H. (2011). Optimizing non-natural protein function with directed
evolution. Curr. Opin. Chem. Biol. 15, 201–210.

● EXERCISES, PROBLEMS, AND WEBLEMS

Exercises
Exercise 5.1 On two photocopies of Figure 5.15, indicate a reasonable division of the species into
(a) three clusters; (b) five clusters.
Exercise 5.2 What is the Hamming distance between the words DECLENSION and RECREATION?
Exercise 5.3 What is the Levenshtein distance between the words BIOINFORMATICS and
CONFORMATION?
Exercise 5.4 The Levenshtein distance between the strings agtcc and cgctca is 3, consistent with
the following alignment:
ag-tcc
cgctca
Provide a sequence of three edit operations that convert agtcc to cgctca.
Exercise 5.5 To what alignment does the path through the following dot plot correspond?

T H E ●
R E T O R T ●
C O U R T E O U S
T T T T T
➙

H H
➙

E E E E
➙

● ● ●
➙

R R R R
➙

E E E E
➙

P
➙

L
➙

Y ➙
➙

● ● ●
➙

C C
➙

H H
➙

U U U
➙

R R R R
➙

L ➙➙➙
➙

I
➙

S S
➙

H H
Exercises, problems, and weblems 189

Exercise 5.6 In the dot plot appearing in Figure 5.5, there is an interruption of the matching at a
height approximately at the level of the downward-pointing arrow at the left that precedes the
words Xenopus laevis. On a photocopy of Figure 5.5(b), indicate where in the sequence this
region appears.
Exercise 5.7 How would you use a dot plot to pick up palindromic DNA sequences of the type
that appear partly on each strand, as in the specificity sites of restriction endonucleases?
Exercise 5.8 According to the BLOSUM62 matrix: (a) is a histidine (H) more likely to change to
an asparagine (N) or to an aspartic acid (D)? (b) What is the ratio of the probability that a histidine
will be observed to change to an asparagine to the probability expected on the basis of the amino
acid composition of the protein that it will change to an asparagine?
Exercise 5.9 Consider the box (red outline) showing a part of a position-specific scoring matrix
in Figure 5.8. Suppose you were scoring a protein with 225 residues according to this matrix.
(a) How many columns would you expect there to be in the position-specific scoring matrix?
(b) How many rows would you expect there to be?
Exercise 5.10 On photocopies of Figure 5.13, indicate points (a) where a pair of highly diverged
homologous proteins with similar structure but without obvious sequence similarities might lie;
(b) where a pair of non-homologous proteins with similar structure might lie; (c) where a pair of
enzymes that share a function but not a structure (for instance, serine and cysteine proteinases)
might lie.

Problems
Problem 5.1 Draw a dot plot of the following sequence from the wheat dwarf virus genome:
ttttcgtgagtgcgcggaggctttt against itself. In what respects is it not a perfect palindrome?
Problem 5.2 How would you adapt the dot plot formalism to search for regions of DNA or RNA
that form local double-helical regions? Assume that the two hydrogen-bonded regions are
separated by only a short unpaired loop, as for example in tRNA (see Figure 1.4).
Problem 5.3 (a) How might the course of the BLAST calculation shown in Figure 5.7 differ if the
word length were chosen as 3 instead of 4? (b) How might the course of the BLAST calculation
shown in Figure 5.7 differ if the word length were chosen as 7 instead of 4?
Problem 5.4 The phylogenetic tree (p. 184) is derived from a complete dissimilarity matrix,
i.e. a specification of a measure of the dissimilarity between every pair of tetranucleotides. The
numbers associated with each edge reproduce the measures of dissimilarity between connected
nodes: i.e. the sum of the edges in the path between ATCC and ATGC is 0.5 + 0.5 = 1, which
is the value in the matrix corresponding to row ATCC and column ATGC. For every pair of
tetranucleotides, calculate the sum of the numbers associated with the edges in the path between
them. For which pairs do the results agree with the original dissimilarity matrix? For which pairs
do the results disagree?
Problem 5.5 Examples in the chapter derived a phylogenetic tree for the four sequences ATCC,
ATGC, TTCG, and TCGG by the UPGMA method (unweighted pair group method with arithmetic
mean) and a phylogenetic tree for the sequences ATCG, ATGG, TCCA, and TTCA by the
maximum-parsimony method. Derive phylogenetic trees for the sequences ATCC, ATGC, TTCG,
and TCGG by the maximum-parsimony method and for the sequences ATCG, ATGG, TCCA, and
TTCA by the UPGMA method. Show all intermediate steps. Compare the results with the trees
derived in the chapter.
190 5 Evolution and Genomic Change

Weblems
Weblem 5.1 Draw a picture of the human aldolase B monomer. Indicate the sites of the mutations
N334K, L256P, A149P, N334K, and A174D. Indicate the region of the active site and the regions
of intersubunit contacts. Comment on the possible severity of the effects of these mutations on
structure and function of the protein.
Weblem 5.2 Retrieve the globin sequences shown in Figure 1.17(b). Perform a multiple sequence
alignment, and draw the phylogenetic tree. Comment on ways in which the tree seems biologically
reasonable; and – if any – ways in which it does not.
CHAPTER 6

Genomes of Prokaryotes

LEARNING GOALS

• To know the features that distinguish the major divisions of life and to appreciate how
differences of lifestyle reflect differences in genomes and structures.
• To understand the molecular basis of adaptations; for example, to life at high temperatures, or
different ocean depths.
• To appreciate, at the molecular level, the genomic and phenotypic differences among selected
related species of prokaryotes.
• To face the problem of bacterial pathogenicity, and the development of antibiotic resistance.
• To recognize the vast variety of different microorganisms that inhabit, and mutually interact in,
environmental samples. These habitats include oceans and soils, and internal environments such
as the human (or animal) gut.
192 6 Genomes of Prokaryotes

Evolution and phylogenetic relationships in prokaryotes

Prokaryotes have several claims on our interest. phosphorus in prokaryotes is probably ten times that
of plants. The bodies of humans and other animals
• They cause infectious diseases. Some diseases, such
harbour many microbes, but – important as the con-
as tuberculosis, are major public health problems.
sequences for health and disease may be – as an
It is a challenge to control these diseases in the face
overall reservoir we are a minor player.
of the development of antibiotic resistance.
• Molecular biologists study prokaryotes as
examples of relatively simple cells, to understand • The oceans also contain viruses in very great abun-
fundamental principles of metabolism, genetics, dance and variety. Most of these are uncharacterized.
and development. They have been called the ‘dark matter of the bio-
sphere’. It is likely that viruses are an important media-
• Historically, prokaryotes represent the earliest tor of gene transfer between marine prokaryotes.
forms of life, from which all others are derived.
They had the biosphere to themselves for over
2 billion years. The exploration of potential habitats by prokaryotes
• Prokaryotes are important mediators of ecological approaches saturation. Prokaryote cells divide
processes and geological cycles. Indeed, geological actively. Production is estimated at 1.7 × 1030 cells per
and biological phenomena are linked in an intimate year, the open ocean being the highest contributor.
marriage, which has seen its turbulent episodes. This fecundity gives prokaryotes the opportunity to
Purely geological events such as asteroid impacts evolve quickly. The resulting variety of prokaryotes
have caused mass extinctions. Purely biological includes the colonists of inhospitable habitats such
events, such as the development of photosynthetic as hot springs and very salty lakes. It also includes
processes that released large quantities of O2 into almost continuous local variations, adaptions to
the atmosphere, and respiration that released CO2, microniches (Table 6.2).
have altered general ﬂows of matter and energy,
affecting the development of the Earth’s geochem- Major types of prokaryotes
istry and climate. Microbes respond to human-
caused environmental damage. They can aggravate C. Woese divided prokaryotes into archaea and
ecological problems, but also hold out hope of bacteria, on the basis of 16S rRNA gene sequences.
ameliorating them (Table 6.1). Figure 6.1 shows the secondary structures of a region
within the 16S rRNA that differs in bacteria, archaea,
The major habitats of prokaryotes are the open and eukaryotes. In context, Figure 6.2 shows the ter-
ocean, surface soils, and subsurface sediments beneath tiary structure of this region within the full Escherichia
both ocean and soil. The total carbon content of coli 16S rRNA structure in the ribosome.
prokaryotes is between 60 and 100% of the total Numerous other differences between archaea
carbon found in plants, but the total nitrogen and and bacteria have subsequently emerged, involving
genomic, structural, and metabolic features:
Table 6.1 Landmarks in history of life • some genes in archaea but none in bacteria contain
introns;
Formation of Earth ∼4.5 × 109 years ago
• there are systematic differences in tRNA sequences
Origin of life >3.8 × 109 years ago
between archaea and bacteria;
Cyanobacterial photosynthesis >2.7 × 109 years ago
Rise of atmospheric O2 2.3–1 × 109 years ago
• enzymes involved in DNA replication, such as DNA
polymerases and some of the tRNA synthetases
First metazoan ∼1 × 109 years ago
involved in protein synthesis, differ between
Cambrian ∼0.5 × 109 years ago
archaea and bacteria;
Evolution and phylogenetic relationships in prokaryotes 193

Table 6.2 Distribution of prokaryotic cells

Habitat ×1028)
Number of prokaryotic cells (× Total carbon in prokaryotes (×1015 g)

Ocean subsurface 355 303

Terrestrial subsurface 25–250 22–215
Soil 26 26
Oceans, lakes, and rivers 12 2.2
Within all human bodies 0.00004

From: Whitman, W.B., Coleman, D.C., & Wiebe, W.J. (1998). Prokaryotes: the unseen majority. Proc. Natl. Acad. Sci. USA 95, 6578–6583.

(a)

(c)

Figure 6.1 Secondary structure patterns of regions of 16S rRNA

that differ among (a) bacteria (Escherichia coli); (b) archaea
(Methanococcus vannielii); and (c) eukaryotes (Saccharomyces
cerevisiae). Dots represent individual residues. Lines indicate
complementary base pairing. The systematic differences in the
lengths of the helical regions and the constraints imposed by
the complementarity contributed to the patterns that Woese
detected in the alignment of the sequences and in the derived
phylogenetic trees. (rRNA = ribosomal RNA)
These diagrams provide only a two-dimensional view.
Figure 6.2 shows the actual three-dimensional structure of
(b) this region within the entire 16S RNA structure of E. coli.
194 6 Genomes of Prokaryotes

(a)

Figure 6.2 (a) Three-dimensional structure of 16S rRNA from

the Escherichia coli ribosome [2AVY], showing the region of
Figure 6.1(a) highlighted in red and blue. (b) Detailed structure
of the region shown in Figure 6.1(a) Note that the acute
angle between two helical regions drawn in the conventional
representation of the secondary structure in the preceding figure
is merely a drafting convention and does not correspond to the
(b) true three-dimensional structure.

• archaea but not bacteria contain DNA-associated – bacteria and eukaryotes link the organic side-
proteins resembling histones; chains to glycerol with an ester linkage while
• membranes of all cells contain phospholipids: com- archaea prefer an ether linkage;
pounds combining a glycerol molecule with long- • cell wall structures: bacterial but not archaeal cell
chain organic molecules (see Figure 6.3); however: walls contain peptidoglycan, a combination of
– bacteria and eukaryotes build cell membranes sugar derivatives and peptides; and
from phospholipids containing d-glycerol; • archaea and bacteria differ in their complement of
archaea use l-glycerol metabolic pathways.
– the organic chains in bacteria and eukaryotes are
fatty acids, typically 16–18 carbon atoms long,
Do we know the root of the tree of life?
while archaea instead use polyisoprenes (the
branching of the isoprene chains permits the for- There is consensus that life on Earth began over
mation of links between different phospholipids 3.5 billion years ago. The earliest remaining evidence
in archaeal membranes; this allows the mem- for cellular life has the form of microfossils, called
brane to develop a higher-order structure) stromatolites, from South Africa and Australia.
Archaea 195

Archaeal membrane phospholipid These arose from cyanobacteria related to modern

O prokaryotes. What preceded them has not left phys-
Polyisoprene unit Ether linkage H2C O P O− ical remnants and can only be inferred from what
C O C H O− traces they have left in contemporary molecular
biology. It is widely believed that forms of life based
C O CH2
L-Glycerol
on RNA – as both information archive and catalysts
– existed before proteins took over the ‘executive
branch’.
Bacterial membrane phospholipid The name archaea suggests that they represent the
Unbranched fatty acids oldest forms of life. However, there has been exten-
O Ester linkage
sive gene transfer between archaea and bacteria. It is
C O CH2
not possible to assign LUCA – the last universal
C O C H O common ancestor of all known life forms – to either
O O P O−
H 2C the archaeal or bacterial branch of the evolutionary
D-Glycerol O− tree. It is thought that the two lineages split very
soon after the origin of cellular life. A branch of the
Figure 6.3 The chemical structure of the cell membrane differs
between archaea and bacteria. A phospholipid is a combination archaea, the Korarchaeota, may be the closest extant
of glycerol (a three-carbon alcohol), a phosphate group, and long relatives of LUCA.
hydrocarbon moieties. In archaea, the hydrocarbons are terpenes,
or polyisoprenes, attached to the glycerol with an ether linkage.
(Double bonds are not shown for simplicity.) In bacteria, they are
fatty acids, esterified with the glycerol. Glycerol has two mirror- • C. Woese divided all living things into three groups:
image forms, and archaeal and bacterial membranes contain archaea, bacteria, and eukaryotes. Although we do not
different enantiomers. In the glycerol moieties in the figure have reliable knowledge of the earliest events in life
(shown in red), triangles indicate bonds to groups in front of the history, it is likely that archaea are closest to LUCA, the
central carbon, whereas broken lines indicate bonds to groups last universal common ancestor of us all.
farther away than the central carbon. These differences must have
correlates in the genomes, which contain genes that encode
alternative sets of synthetic enzymes to produce these structures.

Archaea

The first archaea discovered lived at high tempera- • Crenarchaeota. Many but not all of these are
tures near sea-floor hydrothermal vents or in lakes thermophiles. They include Sulfolobus and
containing very high concentrations of salt, such as Thermoproteus.
the Dead Sea. However, not all archaea are adapted • Euryarchaeota. These include methanogens, sul-
to extreme environments. Indeed, there is some evi- phate reducers, and many extreme halophiles, ther-
dence that mesophilic archaea came first and that mophiles, and acidophiles, including:
thermophiles were a later adaptation. Conversely,
not all thermophiles are archaea; Thermus aquaticus – Halobacter salinarum, which can grow in salt
(the source of Taq polymerase, an enzyme in com- concentrations above 4 M! Many people find its
mon use for polymerase chain reaction amplification photosynthetic abilities even more interesting:
of DNA) is a bacterium. H. salinarum contains a bacteriorhodopsin with
Archaea are an abundant component of life in the which it captures sunlight energy as ATP with-
open ocean, making up ∼20% of all marine microbes. out involving chlorophyll.
They also associate with a variety of metazoan hosts. – Picrophilus torridus, an extreme acidophile first
The major groupings of archaea (Figure 6.4) are as isolated from the sulphurous volcanic springs of
follows. northern Japan. It can grow at pH 0.7!
196 6 Genomes of Prokaryotes

Euryarchaeota
Halobacterium
Marine Crenarchaeota
Halococcus Archaeoglobus
Natronococcus Methanobacterium Crenarchaeota
Methanocaldococcus
Halophilic methanogen Sulfolobus
Marine Methano-
Euryarchaeota thermus

Methanosarcina Pyrodictium
Methanospirillum
Thermo-
proteus
Thermoplasma
Methanopyrus Desulfurococcus
Ferroplasma Picrophilus
Thermococcus
Pyrococcus
Korarchaeota

Figure 6.4 Phylogenetic tree of archaea, based on analysis of 16S rRNA sequences. Major archaeal groupings are coloured as follows:
Euryarchaeota: Crenarchaeota:
Archaeoglobali Desulfurococcales
Halobacteria
Methanobacteria Korarchaeota
Methanococci
Methanomicrobia Nanoarchaeota (not shown)
Methanopyri
Thermococci
Thermoplasmata

BOX Methanogens as sources of greenhouse gas emission: the case of New Zealand
6.1

New Zealand is home to 4 million people, 10 million cattle, tune of $NZ11 per ton of carbon emitted. (At that time $NZ1
and 45 million sheep. The sheep and cattle host methano- ≈ UK£0.47 ≈ $US0.76.) This would amount to an annual
genic archaea in their stomachs to help to digest fodder. charge of about $NZ0.09 per sheep and $NZ0.72 per cow.
In the USA and European countries, animals make only The proposal met determined resistance from the pas-
a relatively small contribution to greenhouse gas emissions. toral community, some of it couched in surprisingly ribald
In contrast, in New Zealand, ruminant-produced methane terms. The New Zealand government ultimately aban-
accounts for approximately half of the country’s total green- doned the idea. Research into the effects of different
house gas production. When New Zealand signed the Kyoto fodders on internal flora and even antibiotics specifically
protocol, the government proposed to tax farmers to the targeting archaea is now under way.

– Methanogens. These are strict anaerobes, depen- • Korarchaeota. These were discovered by environ-
dent on the reaction: mental sampling of a hot spring in Yellowstone
National Park, Wyoming, in the western USA.
CO2 + 4H2 → CH4 + 2H2O
They are perhaps closest to the root of all archaea.
Methanogenic archaea live in the guts of
• Nanoarchaeota. These have been identiﬁed as a
ruminant animals and help to digest cellulose.
single small (∼400 nm diameter) hyperthermophile
Cellulases hydrolyse plant fodder to simple sug-
from a submarine hot vent.
ars, from which CO2 and H2 are produced by
fermentation. A cow can produce hundreds of The last two phyla are minor, at least in terms of our
litres of methane per day! (See Box 6.1.) current knowledge of them.
Archaea 197

The genome of Methanococcus jannaschii Archaea

°C
The microorganism Methanococcus jannaschii was 110

collected from a hydrothermal vent 2600 m deep off Bacteria Hyperthermophiles

90
the coast of Baja California, Mexico, in 1983. It is
a thermophilic organism, surviving at temperatures 70 Eukaryotes Thermophiles
from 48 to 94°C, with an optimum at 85°C. M. jan- 50
naschii is capable of self-reproduction from inorganic
Mesophiles
components. Its overall metabolic equation is to 30

synthesize methane from H2 and CO2. It is a strict 10

anaerobe.
Psychrophiles
−10

• Hydrothermal vents are underwater volcanoes emitting −30

hot lava and gases through cracks in the ocean floor.

They create niches for living communities disconnected
from the surface; these use minerals from the vent as Figure 6.5 ‘Some like it hot, some like it cold . . .’ Distribution of
nutrients. These communities of microorganisms, and known growth temperatures of eukaryotes, archaea, and bacteria.
some animals, are the only forms of life not dependent Ranges defining hyperthermophiles, thermophiles, mesophiles,
on sunlight, directly or indirectly, for their energy. and psychrophiles are approximate.

overlapping ranges of optimal growth temperatures

The genome of M. jannaschii was sequenced in 1996 (see Figure 6.5). To survive at elevated temperatures,
by The Institute for Genomic Research (TIGR). It was thermophiles and hyperthermophiles must synthesize
the ﬁrst archaeal genome sequenced. It contains a large molecules that are stable to heat denaturation.
chromosome with a circular double-stranded DNA Accordingly, adaptations to high-temperature survival
molecule 1 664 976 bp long and two extrachromo- might be observed in DNA, in RNA, and in proteins.
somal elements of 58 407 and 16 550 bp. There are Enzymes from hyperthermophiles have applications
1743 predicted coding regions, of which 1682 are on in laboratory molecular biology and in industry.
the chromosome and 44 and 12 are on the large and
small extrachromosomal elements, respectively. Some What choices do organisms have to adjust the
RNA genes contain introns. As in other prokaryotic thermal stability of their constituents?
genomes, there is little non-coding DNA. Thermophiles can evolve proteins with enhanced
M. jannaschii would appear to satisfy Luria’s goal of stability. Moreover, any set of favourable amino acid
ﬁnding our most distant extant relative. Comparison sequence is compatible with many gene sequences.
of its genome sequence with others shows that it is In principle, organisms can take advantage of the
distantly related to other forms of life. Only 38% of redundancy of the genetic code to adjust the thermal
the open reading frames could be assigned a function stability of their nucleic acids. For double-stranded
on the basis of homology to proteins known from DNA and RNA, the thermal stability increases lin-
other organisms. However, to everyone’s great sur- early with the G+C content. (However, this is only
prise, archaea are in some ways more closely related one aspect of the application of the redundancy in
to eukaryotes than to bacteria! They are a complex the code. See Box 6.2.)
mixture. In archaea, proteins involved in transcrip- Singer & Hickey* compared genome sequences
tion, translation, and regulation are more similar to from 40 prokaryotes. The organisms included eight
those of eukaryotes. Archaeal proteins involved in
metabolism are more similar to those of bacteria. * Singer, G.A. & Hickey, D.A. (2003). Thermophilic pro-
karyotes have characteristic patterns of codon usage,
Life at extreme temperatures amino acid composition and nucleotide content. Gene 317,
39– 47; Hickey, D.A. & Singer, G.A. (2004). Genomic and
Organisms from the three major divisions of life – proteomic adaptations to growth at high temperature.
archaea, bacteria, and eukaryotes – show wide, Genome Biol. 5, 117.
198 6 Genomes of Prokaryotes

• DNA. Perhaps surprisingly, overall genomic G+C

BOX Codon usage patterns content is not correlated with growth temperature.
6.2
However, a correlation with growth temperature
is observed in the distribution of dinucleotides,
General observations about codon distributions are: with thermophiles, mesophiles, and psychrophiles
• Different genomes show different codon usage
showing different characteristic patterns.
patterns. Thermophiles and hyperthermophiles stabilize
their DNA structures, not by increasing the G+C
• The variation in codon usage pattern among different
content, but by tight binding of special ligands.
genes within the genome of one species is less than
the variation between species. • RNA. The G+C content of non-protein coding RNA
is correlated with growth temperature, especially
• Codon preference pattern tends to be preserved in
in double-stranded regions. High G+C content in
closely related species, but diverges as species diverge.
double-stranded regions enhances their thermosta-
The similarity of the pattern in archaeal and bacterial
bility. Single-stranded RNAs, including messenger
thermophiles is evidence that codon usage patterns
can be determined by selection.
RNAs (mRNAs), are relatively rich in purines,
notably adenine, for reasons that are not clear.
• Within genomes, highly expressed proteins show
Although the overall G+C content of DNA is not
stronger bias in codon usage. Genes for highly
correlated with growth temperature, codon usage
expressed proteins are enriched in sets of ‘preferred’
patterns are. Coding sequences in thermophiles
codons, and these preferred codons are often corre-
are enriched relative to mesophiles in synonymous
lated with greater tRNA abundances. Matching the
codons ending in C or G.
pattern of codon usage and tRNA abundance can
make protein synthetic throughput higher. • Proteins. Comparisons of homologues show that
proteins from thermophiles and hyperthermophiles:
– tend to be shorter than their homologues from
archaeal mesophiles and thermophiles and 32 bac- mesophiles, most of the residues lost coming
terial mesophiles and thermophiles, with optimal from surface loops;
growth temperatures ranging from 18 to 97°C. This – have more charged residues at their surfaces, both
permitted a study focusing on adaptations to high positive and negative (Asp, Lys, His, Asp, Glu);
temperature, not biased by archaeal–bacterial differ- the formation of stabilizing salt bridges is a com-
ences. Their results show that: mon feature of their structures (see Figure 6.6);

(a)

Figure 6.6 Proteins from thermophiles and hyperthermophiles are enriched in salt bridges relative to their mesophilic homologues.
Positively charged sidechains are shown in blue. Negatively charged sidechains are shown in red. (a) Subunit of glutamate
dehydrogenase from mesophilic archaeon Clostridium symbiosum [1HRD]. (b) Subunit of glutamate dehydrogenase from
hyperthermophilic archaeon Pyrococcus furiosus [1GTM]. (c) Sequence alignment of archaeal hyperthermophilic and mesophilic
glutamate dehydrogenase subunits.
Archaea 199

(b)

Glutamate dehydrogenase
10 20 30 40 50 60
| | | | | |
Clostridium symbiosum SKYVDRVIAEVEKKYADEPEFVQTVEEVLSSLGPVVDAHPEYEEVALLERMVIPERVIEF
Pyrococcus furiosus ADPYEIVIKQLERAAQYMEISEEALEFLK___________________RPQRIVEV
V K EE L L P R E

70 80 90 100 110 120

| | | | | |
Clostridium symbiosum RVPWEDDNGKVHVNTGYRVQFNGAIGPYKGGLRFAPSVNLSIMKFLGFEQAFKDSLTTLP
Pyrococcus furiosus TIPVEMDDGSVKVFTGFRVQHNWARGPTKGGIRWHPEETLSTVKALAAWMTWKTAVMDLP
P E D G V V TG RVQ N A GP KGG R P LS K L K LP

130 140 150 160 170 180

| | | | | |
Clostridium symbiosum MGGAKGGSDFDPNGKSDREVMRFCQAFMTELYRHIGPDIDVPAGDLGVGAREIGYMYGQY
Pyrococcus furiosus YGGGKGGIIVDPKKLSDREKERLARGYIRAIYDVISPYEDIPAPDVYTNPQIMAWMMDEY
GG KGG DP SDRE R Y I P D PA D M Y

190 200 210 220 230 240

| | | | | |
Clostridium symbiosum RKIVGG _ _ _
FYNGVLTGKARSFGGSLVRPEATGYGSVYYVEAVMKHEN DTLVGKTVALAG
Pyrococcus furiosus ETISRRKTPAFGIITGKPLSIGGSLGRIEATARGASYTIREAAKVLGWDTLKGKTIAIQG
I G TGK S GGSL R EAT G Y K DTL GKT A G

250 260 270 280 290 300

| | | | | |
Clostridium symbiosum _
FGNVAWGAAKKL AELGAKAVTLSGPDGYIYDPEGITTEEKINYMLEMRASGRNKVQDYA
Pyrococcus furiosus YGNAGYYLAKIMSEDFGMKVVAVSDSKGGIYNPDGLNADEVLKWKNEHGS _ _ _ _ _ VKDFP
GN AK G K V S G IY P G E E V D

310 320 330 340 350 360

| | | | | |
Clostridium symbiosum DKFGVQFFPGEKPWGQKVDIIMPCATQNDVDLEQAKKIVANNVKYYIEVANMPTTNEALR
Pyrococcus furiosus ___GATNITNEELLELEVDVLAPAAIEEVITKKNADNIKA___KIVAEVANGPVTPEADE
G E VD P A A I A K EVAN P T EA

370 380 390 400 410 420

| | | | | |
Clostridium symbiosum FLMQQPNMVVAPSKAVNAGGVLVSGFEMSQNSERLSWTAEEVDSKLHQVMTDIHDGSAAA
Pyrococcus furiosus _
ILFEKG ILQIPDFLCNAGGVTVSYFEWVQNITGYYWTIEEVRERLDKKMTKAFYDVYNI
L P NAGGV VS FE QN WT EEV L MT

430 440 450

| | |
Clostridium symbiosum AERYGLGYNLVAGANIVGFQKIADAMMAQGIAW_
Pyrococcus furiosus AKEK _ _ NIHMRDAAYVVAVQRVYQAMLDRGWVKH
A A V Q AM G
(c)

Figure 6.6 (continued)

200 6 Genomes of Prokaryotes

– contain relatively fewer uncharged polar residues Also, hyperthermophiles have special ‘chaperones’
(Ser, Thr, Gln, Asn, Cys); some of these have – proteins that assist in the protein folding. It is
thermolabile sidechains (His, Gln, Thr); and likely that this is an adaptation to the challenge of
– contain higher proportions of hydrophobic high-temperature growth.
b-branched residues (see Box 6.3 and Figure 6.7).

BOX Effect of b-branched sidechains on protein stability

6.3

Proteins from thermophiles and hyperthermophiles are • The enthalpy, H, represents the attractive inter-residue
enriched in amino acids with b-branched sidechains. For interactions. Attractive interactions lower the enthalpy
example, compare leucine and isoleucine (see Figure 6.7). and make H more negative. The enthalpy of the native
How does this help to achieve high-temperature stability? state is lower than that of the denatured state:
Folding of a protein to a unique native state is a compro-
ΔH = HNative − HDenatured < 0 favours folding.
mise. Attractive inter-residue interactions favour formation
of a compact native state. However, the greater conforma- • The entropy, S, represents the conformational freedom.
tional freedom of the polypeptide chain in the denatured In the denatured state, the protein molecules adopt
state favours unfolding. To stabilize the native state, the many different possible conformations, whereas in the
attractive interactions must ‘pay for’ the loss of conforma- native state, the conformations of many degrees of free-
tional freedom. dom are fixed; thus, the entropy of the denatured state
Thermodynamically, this is expressed by the criterion for is higher than the entropy of the native state:
stability:
ΔS = SNative − S Denatured < 0 favours unfolding.
G Native
−G Denatured
= ΔG = ΔH − TΔS < 0
• Because systems (at constant temperature and pressure)
where ΔH is the enthalpy change, ΔS the entropy change, come to an equilibrium state of minimum Gibbs free
and T the absolute temperature. energy (G), a protein will form the native state if and
only if a favourable ΔH overcomes an unfavourable ΔS:

ΔG = ΔH − TΔS < 0
+
NH3
Because the entropy term is weighted by T, it assumes
δ γ β α
Leucine CH3 CH CH2 C H
relatively higher importance at higher temperatures.

Assuming that a sidechain buried in the compact interior

CH3 COO−
of a folded protein has a unique conformation, its loss of
+ conformational freedom upon folding depends on the
NH3
freedom it has in the denatured state. The lower the free-
δ γ β α
Isoleucine CH3 CH2 CH C H dom in the denatured state, the lower the loss of entropy
upon adopting a unique conformation in the native state.
CH3 COO−
Because of the higher degree of crowding of atoms around
the Ca in b-branched sidechains, they have less conforma-
Figure 6.7 The amino acids leucine and isoleucine. The carbon tional freedom than b-unbranched sidechains, even in the
atoms in the sidechain are labelled a, b, g, and d outwards from denatured state. Therefore, b-branched sidechains con-
the carbon that will appear in the mainchain of a protein when tribute less to the unfavourable entropy change upon
the COO− and NH3+ groups of the amino acid form peptide
folding than b-unbranched sidechains. This is true at all
bonds. Isoleucine is said to be b-branched because, looking
temperatures but is more significant at the higher tempera-
outwards along the sidechain from the Cb, there are two
carbon substituents. The Ca of leucine, in contrast, has only tures at which proteins from thermophiles and hyperther-
one carbon substituent, looking outwards along the sidechain. mophiles have to form their native states, because of the
(Leucine is branched at the g-carbon.) factor T in the entropy term.
Archaea 201

Comparative genomics of hyperthermophilic homologues of transposases – proteins that catalyse

archaea: Thermococcus kodakarensis and the movement of DNA segments around a genome.
Pyrococci However, transposase activity was not observed, sug-
gesting that these proteins have lost their function.
A hyperthermophilic archaeon, Thermococcus koda-
karensis strain KOD1, was isolated from a hot Molecular physiology of T. kodakarensis
sulphur spring (102°C, pH 5.8) on the shore of
Kodakara Island, in the Ryukyu archipelago between By mapping the proteins of T. kodakarensis for which
Kagoshima in southwest Japan, and Okinawa functions can be assigned, it is possible to reconstruct
(29° 12′ N, 129° 19′ E). T. kodakarensis KOD1 is the metabolic and transport pathways of T. kodaka-
a strict anaerobe, normally growing by reducing rensis. (See Figure 6.9 for an overview.)
elemental sulphur to H2S.
Comparative genomics of T. kodakarensis
General features of the genome The genome sequences of three close relatives of
Fukui and co-workers reported the complete genome T. kodakarensis – Pyrococcus abyssi, Pyrococcus
sequence of T. kodakarensis KOD1 (see Figure 6.8). horikoshii, and Pyrococcus furiosus – allowed com-
The single, circular chromosome contains 2 088 737 parisons and descriptions of the evolutionary relation-
bp, with a G+C content of 52 mole % (see Figure ships between these archaea at the genome and
6.8). A total of 2306 coding sequences were identi- protein levels.
fied, with average length 833 bp, covering 92% of Some of the differences are characteristic of the
the genome. There are 46 genes for tRNA, two of different genera – Thermococcus and Pyrococcus.
which (for Trp and Met) contain introns. These include the G+C content and the number of
Database searching suggested specific functions coding sequences (Table 6.3).
for half of the proteins (1165 out of 2306), and The loss of synteny between Pyrococcus and
general functional classes for another 205. Of the Thermococcus genera illuminates the origin of the
proteins with known homologues, 240 are specific difference in protein content (see Figure 6.10). The
to the order Thermococcales. Of the remaining pro- rearrangement among the Pyrococcus species them-
teins, 261 appear to be unique to T. kodakarensis, selves is substantial and involves a major inversion
as no homology to any other known protein was between P. abyssi and P. horikoshii. In contrast,
detectable. between the genera, the genome has been completely
Fifteen of the proteins are inteins, which catalyse shuffled and redealt. Indeed, there is no large con-
the excision and splicing of intervening sequences tiguous region in the T. kodakarensis genome with
after translation; i.e. the protein itself contains the no correspondence in pyrococci. This shows that the
self-splicing activity. larger genome of T. kodakarensis is not the result of
The genome contains numerous mobile elements, a recent horizontal transfer of a large block from a
including four virus-related integrases and seven distant lineage.

Table 6.3 Characteristics of T. kodakarensis, Pyrococcus abyssi, Pyrococcus horikoshii, and Pyrococcus furiosus

Characteristic T. kodakarensis P. abyssi P. horikoshii P. furiosus

Genome size (bp) 2 088 737 1 765 118 1 738 505 1 908 256
G+C (mole %) 52.0 44.7 41.9 40.8
Coding sequences 2 306 1 784 2 065 2 065
1

1.8 Mb 0.2 Mb

oriC

1.6 Mb 0.4 Mb

1.4 Mb 0.6 Mb

1.2 Mb 0.8 Mb

1.0 Mb

Figure 6.8 Diagram of the genome of T. kodakarensis strain KOD1. The contents of the consecutive circles, from the outermost, are:
(1) Scale in 0.2 Mb increments, plus the predicted origin of replication (oriC).
(2) Predicted protein-coding regions in clockwise direction.
(3) Predicted protein-coding regions in anticlockwise direction.
(4) Predicted tRNA coding regions in clockwise direction (red).
(5) Predicted tRNA coding regions in anticlockwise direction (blue).
(6) Predicted mobile elements in clockwise direction (red). Lines indicate transposase genes; boxes indicate virus-related regions.
(7) Predicted mobile elements in counter-clockwise direction (blue). Lines indicate transposase genes; boxes indicate virus-related regions.
(8) G+C content (mol. %) in 10 kb window.
Circles (2) and (3) are colour coded according to function:

Functional category Colour

Translation, ribosomal structure, biogenesis Magenta

Transcription Pink
DNA replication, recombination, repair Pale pink
Cell division, chromosome partitioning Forest green
Post-translational modification, protein turnover, chaperones Yellow
Cell envelope biogenesis, outer membrane Light yellow
Cell motility, secretion Light green
Inorganic ion transport/metabolism Pale green
Signal transduction Medium turquoise
Energy production/conversion Purple
Carbohydrate transport/metabolism Light blue
Amino acid transport/metabolism Cyan
Nucleotide transport/metabolism Violet
Co-enzyme metabolism Pale turquoise
Lipid metabolism Medium purple
Secondary metabolites biosynthesis/transport/catabolism Light sky blue

From: Fukui, T., Atomi, H., Kanai, T., Matsumi, R., Fujiwara, S., & Imanaka, T. (2005). Complete genome sequence of the hyperthermophilic archaeon
Thermococcus kodakarensis KOD1 and comparison with Pyrococcus genomes. Genome Res. 15, 352–363.
TRAP-type
FeoAB FbpABC FepBCD transporter MalEFGK AppABCDF SnatA, PutP, GltT

Fe2+ Fe3+ Fe3+_ ? Maltodextrin Dipeptides Amino acids Pi PstABCS

siderophore Oligopeptides
ZnuABC Mn2+/Zn2+
Glucose (A) Peptidases SO42– CysAT
His ADP
Pyrimidines Cys? AMP
Amino acids (C)
Purines Ser
Met? Gly G-6-P Amino- 2OG
NH2
ModABC Mo2+ Riboflavin PRPP ADP
NADPH Pi PitA
Fbp transferases Glu NADP
Na+
AMP Gdh
R-5-P F-1,6-BP 2-Oxoacids
Trp CI– EriC
Co2+ Tyr Chorismate E-4-P 3-PG POR, VOR, Fd(ox)+CoASH
CbiMOQ Fd(red)
Phe G-3-P DHAP KOR, IOR
CO2 Rubisco Fd(ox)
Fd(red) Acyl-CoAs + CO2
Ribu-1,5-BP GTP GDP PEP
2+ ADP
Cys? Acyl-CoA ADP
CorA Mg Pps Pyk Met? synthetases ATP + CoASH
Asp Pck ATP
Thr OAA Pyruvate Val Leu
Asn Oad? Acids
Heavy AlaAT POR
ZupT Fd(ox)+CoASH Flagellins
metal Ile Arg NADP Glu Fd(red) (FlaB1-B5, CDFGHIJ)
Pro NADPH 2OG
Acetyl-CoA + CO2
Met? NH3
2OG Gdh
Cys? ADP ACSI, II Multidrug
NatAB Na +
H2 S ?
Glu Alanine ATP + CoASH Fd(red) NADP transporters
Gln Acetate Fd(ox) NADPH So
CoASH FNOR Cytosolic
MnhB-G + Lys PflDA Predicted
Na hydrogenase
NapA H+ Formate + Acetyl-CoA ? ABC transporters
NhaC (D) (HyhBGSL)
Acceptor(ox) ADP
ADP ATP Acceptor(red)
ACSI, II ATP + CoASH

TrkAH 3H+ FDH Predicted

K + A0A1-ATPase Acetate ?
Kch Fd(red)
(B) CO2 permeases
(AtpA-F, HI, K)
Fd(ox)

Membrane-bound
hydrogenase
H+ H+ + H2 (MbhA-N) (MbxA-N)
(E)

Figure 6.9 Reconstructed scheme of metabolism and solute transport in Thermococcus kodakarensis. Components or pathways for which no predictable enzymes could be assigned appear
in red.
Each gene product with a predicted function in ion or solute transport is illustrated on the membrane. The transporters and permeases are grouped by substrate specificity, as cations
(violet), anions (green), carbohydrates/carboxylates/amino acids (yellow), and unknown (grey).
Metabolic pathways appear in the interior of the cell: (A) glycolysis (modified Embden–Meyerhof pathway); (B) pyruvate degradation; (C) amino acid degradation; (D) sulphur reduction;
and (E) hydrogen evolution and formation of proton-motive force, coupled with ATP generation.
Abbreviations: DHAP, dihydroxyacetone phosphate; E-4-P, erythrose 4-phosphate; F-1,6-BP, fructose 1,6-bisphosphate; G-3-P, glyceraldehyde 3-phosphate; G-6-P, glucose 6-phosphate;
OAA, oxaloacetate; 2OG, 2-oxoglutarate; PEP, phosphoenolpyruvate; 3-PG, 3-phosphoglycerate; PRPP, 5-phosphoribosyl 1-pyrophosphate; R-5-P, ribose 5-phosphate; Ribu-1,5-BP,
ribulose 1,5-bisphosphate; ACS, acetyl-CoA synthetase (ADP-forming); AlaAT, alanine aminotransferase; Fbp, fructose 1,6-bisphosphatase; FDH, formate dehydrogenase; FNOR,
ferredoxin:NADP oxidoreductase; Gdh, glutamate dehydrogenase; IOR, indolepyruvate:ferredoxin oxidoreductase; KOR, 2-oxoacid: ferredoxin oxidoreductase; PflDA, pyruvate formate

Archaea
lyase and its activating enzyme; Oad, oxaloacetate decarboxylase; Pck, phosphoenolpyruvate carboxykinase; POR, pyruvate:ferredoxin oxidoreductase; Pps, phosphoenolpyruvate synthase;
Pyk, pyruvate kinase; VOR, 2-oxoisovalerate: ferredoxin oxidoreductase; AppABCDF, ABC-type dipeptide/oligopeptide transporter; CbiMOQ, ABC-type Co2+ transporter; CorA, Mg2+/Co2+
transporter; CysAT, ABC-type sulphate transporter; EriC, voltage-gated Cl− channel protein; FbpABC, ABC-type Fe3+ transporter; FeoAB, Fe2+ transporter; FepBCD, ABC-type Fe3+-
siderophore transporter; GltT, H+/glutamate symporter; Kch, Ca2+-gated K+ channel protein; MalEFGK, ABC-type maltodextrin transporter; MnhB-G, multisubunit Na+/H+ antiporter;

203
ModABC, ABC-type Mo2+ transporter; NapA and NhaC, Na+/H+ antiporter; NatAB, ABC-type Na+ efflux pump; PitA, Na+/phosphate symporter; PstABCS, ABC-type phosphate transporter;
PutP, Na+/proline symporter; SnatA, small neutral amino acid transporter; TrkAH, Trk-type K+ transporter; ZnuABC, ABC-type Mn2+/Zn2+ transporter; ZupT, heavy metal cation transporter.
From: Fukui et al. (2005) (see Figure 6.8).
204 6 Genomes of Prokaryotes

1
P. abyssi 1,765,118

Figure 6.10 Arrangement of P. horikoshii 1,738,505

homologous segments in the
genomes of Thermococcus
kodakarensis, Pyrococcus abyssi, P. furiosus 1,908,256
P. horikoshii, and P. furiosus.
From: Fukui et al. (2005) (see
Figure 6.8). T. kodakarensis 2,088,737

Bacteria

Bacteria form the other division of prokaryotes that preclude any neat solution in terms of a simple
(Figure 6.11). Bacteria have been known for much phylogenetic tree.
longer than archaea (A. van Leeuwenhoek dis- Figure 6.11 suggests one recent approach to classi-
covered bacteria in 1676). In consequence, bacterial fying the main groups of bacteria. Box 6.4 gives
taxonomy bears a considerable load of historical some examples of better-known organisms in the
baggage. It has required genomes to sort out the different groups, with some brief comments. For a
phylogenetic relationships. However, the genomes classiﬁcation of bacteria focusing on pathogens, see
also show large amounts of horizontal gene transfer http://www.microbialrosettastone.com/.

Genomes of pathogenic bacteria

Proteo β Proteo γ
A bacterial pathogen is a bacterium that can cause
Proteo α disease. It can do so by virtue of having virulence
Proteo δ
factors, which may include toxins, surface proteins
Proteo ε that mediate attachment to cells, defensive shields
Aquificales (proteins and carbohydrates), and secreted enzymes.
Gram-negative
Chlamydiae In many cases, closely related strains or species differ
Bacteroidetes in pathogenicity. This suggests that comparison of
Chlorobium their genomes could help to identify virulence fac-
Fibrobacter
Spirochaetes tors. Knowledge of virulence factors would permit:
Cyanobacteria (a) testing of foods for presence of pathogenic strains,
for instance of E. coli, (b) choosing suitable drug tar-
Chloroflexi
Deinococci gets, and (c) designing vaccines.
Thermus Thermotogae Examples include:

Gram-positive Actinobacteria Fusobacteria

Firmicutes
E. coli. Because a non-pathogenic strain (K-
12MG1655) of E. coli occupies such a central role in
molecular biology, it is easy to forget that it exists
Figure 6.11 Phylogenetic tree of some major bacterial types.
in nature, and that related strains are pathogenic.
This diagram reflects the topology of the tree but not the extent
of divergence between and within groups. Some of the groups are
We discussed the E. coli genome in Chapter 4.
phyla, others are genera. Strain E. coli 0157:H7 causes haemorrhagic coli-
Reference: Bacterial (Prokaryotic) Phylogeny Webpage (2006) tis, which can be fatal. A comparison of the genomes
(http://www.bacterialphylogeny.com/index.htm). of this strain with that of K12 strain MG1655 shows
Bacteria 205

BOX Characteristics of major groups of bacteria

6.4

Group Examples Comments

Firmicutes Bacilli, staphylococci, Listeria and staphylococci can be infectious; Clostridia can cause food
lactobacilli, Clostridia poisoning; lactobacilli are useful in yoghurt production
Actinobacteria Micrococcus, Streptomyces Decompose dead plant material; source of antibiotics
Fusobacteria Fusobacterium nucleatum Live in human gut, involved in periodontal infection
Thermotogae Thermotoga subterranea Thermophilic or hyperthermophilic; some are anaerobic
Thermus Thermus aquaticus Thermophilic; source of Taq polymerase
Deinococci Deinococcus radiodurans D. radiodurans is unusually radiation resistant
Chloroflexi Chloroflexus aurantiacus Photosynthetic, but do not produce O2; may provide clue to early
development of photosynthesis
Cyanobacteria Prochlorococcus marinus Chlorophyll-based photosynthesis; most split H2O and produce O2;
give rise to chloroplasts via symbiosis
Spirochaetes Leptospira, Borrelia Some are pathogenic (leptospirosis, syphilis, Lyme disease)
burgdorferi, Treponema
pallidum
Fibrobacters Fibrobacter intestinalis Live in gut; help cattle to digest cellulose
Chlorobium Chlorobium tepidum Green sulphur bacteria; photosynthetic: reduce sulphide to sulphur
Bacteroidetes Bacteroides fragilis Some are marine plankton; others are anaerobic, live in the gut and
can cause infection. Porphyromonas gingivalis causes gum disease
Chlamydiae Chlamydia trachomatis Grow intracellularly; major cause of blindness; also cause sexually
transmitted infections of the urogenital system
Aquificales Aquifex aeolicus Extremophiles, autotrophs

e-Proteobacteria Helicobacter pylori H. pylori lives in gut, cause of ulcers

d-Proteobacteria Desulfovibrio desulfuricans Mostly aerobic; some anaerobic examples reduce sulphur or sulphate

a-Proteobacteria Rhodospirillum rubrum, Rhizobium are symbiotic with legumes and fix nitrogen; Rickettsia
Rhizobium, Rickettsia cause typhus; give rise to mitochondria via symbiosis

b-Proteobacteria Burkholdia, Bordetella, Some live on inorganic nutrients; others are infectious, different
Thiobacillus, Neisseria species causing pertussis, gonorrhoea and meningitis

g-Proteobacteria Escherichia coli, Haemophilus Important in medicine and molecular biology: cause enteritis, typhoid,
influenzae, Pseudomonas bubonic plague, and others
aeruginosa, Yersinia pestis,
Salmonella typhimurium
206 6 Genomes of Prokaryotes

that the genome of K12MG1655, 4639221 bp, is 85% of the genome. There is a single plasmid con-
shorter than that of 0157:H7, 5528445 bp. Regions taining about 25 000 bp. Genes for enhanced anti-
amounting to 4.1 Mb are common to both strains. biotic resistance are encoded by a transposon inserted
The unshared genes tend to cluster in strain-specific into the plasmid. Comparison of the sequences – in
regions. Strain-specific regions in K12MG1655 con- particular, observation of lack of synteny – has made
tain 1.34 Mb, 1387 genes, and in 0157:H7 contain it clear that the development of methicillin resistance
0.53 Mb, 528 genes. was not a single event, producing a clone that was
It is likely that the strains diverged about 4.5 mil- subsequently selected. Instead, the resistance ele-
lion years ago. A clue to the origin of the differences ments were acquired many times by many strains, via
between the strains is the atypical base composition horizontal gene transfer.
of the strain-specific regions. This suggests that they A comparison of the sequences of many S. aureus
entered the respective genomes by horizontal gene strains, encompassing different clinical phenotypes,
transfer. showed that 78% of genes were common to all
strains, including isolates from cow and sheep. The
remaining 22%, that are at least partially strain-
• Horizontal gene transfer is a common theme in devel-
specific, tend to be localized within 18 large regions
opment of virulence and antibiotic resistance.
of difference (RDs), ranging from 3–50 kb long.
Figure 6.12 shows the presence of these regions in the
Helicobacter pylori. Half the world’s population is different strains, and the correlation of the pattern
infected with H. pylori. One out of 10 people develop with methicillin resistance.
clinical disease: gastritis, duodenal and gastric ulcers,
and some cancers. Proof that H. pylori infection is
Genomics and the development of vaccines
the cause of ulcers was obtained – over the disbelief
of the scientific and medical establishments at the Genomics and recombinant technology have made
time – by Barry Marshall, who swallowed a culture possible a new generation of approaches to vaccine
of H. pylori, and quickly developed symptoms of design.
gastritis. A vaccine against hepatitis B virus is expressed in
H. pylori strains are very diverse (they have been yeast cells. It is a surface antigen, a viral envelope
applied to tracking of patterns of human migration). protein, the gene for which was cloned into yeast. A
Three strains have been sequenced completely. Strain vaccine against Bordetella pertussis (the causative
26695 contains about 1.7 Mbp, and about 1550 agent of whooping cough) is based on the toxin, a
genes. Other sequenced strains differ by about 6%. multi-subunit protein. By genetic engineering, the
Virulence appears to be associated with a common molecule was completely detoxified by introduction
Cag 40 kb pathogenicity island containing >40 genes. of mutants, which removed the enzymatic activity
The appearance of genes within this island is corre- but left the immunological properties intact. That is,
lated with virulence. This pathogenicity island is antibodies raised against the detoxified form protect
common to many bacteria. It is likely that it has been against the native protein.
circulated by horizontal gene transfer. A more general approach to vaccine design involves
comparing the genome sequences of pathogenic and
Staphylococcus aureus. S. aureus infections are a nonpathogenic strains to identify virulence factors
growing clinical problem because of the aggressive that might serve as the basis of vaccines.
development of antibiotic resistance. (The develop- Neisseria meningitidis serogroup B is the major
ment of resistance to vancomycin – the ‘antibiotic of cause of meningitis and septicaemia in children and
last resort’ – is discussed in Chapter 9.) young adults. From the 2 272 351 bp genome sequence,
The genomics of S. aureus has been pursued vigor- computational methods predicted 2158 genes. Algo-
ously, in order to identify the mechanisms of develop- rithms predicted that 600 of them would be on the
ment and spread of resistance. The S. aureus genome cell surface or secreted. These were candidates for
is about 2.8–2.9 Mb long. Assignment of approxi- vaccines. Of these, 350 were expressed in E. coli, and
mately 2600 open reading frames accounts for almost tested in mice for an immune response that produced
Bacteria 207

MSA3410

MSA3426

MSA3400
MSA3405
MSA2120

MSA2965
MSA2348
MSA2020

MSA2389
MSA1601
MSA2099
MSA3412
MSA3407
MSA2885
MSA2335
MSA2754
MSA2345
MSA1836
MSA1827

MSA2786
MSA3095
MSA2346
MSA1205
MSA1832

MSA3418
MSA3402
MSA1695
MSA890

MSA817
MSA961
MSA820

MSA551
MSA535

MSA700

MSA537
RF122
COL

RD1
RD2
RD3
RD4
RD5
RD6
RD7
RD8
RD9
RD10
RD11
RD12
RD13
RD14
RD15
RD16
RD17
RD18

Figure 6.12 Results of comparison of sequences of 36 isolates of S. aureus. RD = regions of difference, localized segments of the
genome of high variability among strains. Filled squares indicate RD present; empty squares, RD absent. Hatched squares correspond
to methicillin-resistant strains. Red indicates isolates of electrophoretic type 234, the predominant type causing toxic shock syndrome.
From: Fitzgerald, J.R., Sturdevant, D.E., Mackie, S.M., Gill, S.R., & Musser, J.M. (2001). Evolutionary genomics of Staphylococcus aureus: Insights into
the origin of methicillin-resistant strains and the toxic shock syndrome epidemic. Proc. Nat. Acad. Sci. USA 98, 8821–8826. Reproduced by permission.

The future of antibiotic development

The development of resistant strains of pathogens presents A related prospect is to turn to biology as well as chem-
a severe challenge to medicine. There is consensus that novel istry to discover new therapeutic agents, including revival
antibiotics will be needed. Some are already in the ‘pipeline’, of an old suggestion of using bacteriophages clinically.
currently in the clinical testing phase. However, the research A number of small biotech companies have started up,
that produced today’s ‘new’ antibiotics was initiated in the funded by venture capital, to try to explore a number of
early 1990s and pharmaceutical companies are reducing non-traditional avenues. Given the decade ‘lead time’
their emphasis on antibiotic research. It is likely that fewer required for a new drug to make its way from the labora-
new antibiotics will emerge in the current industrial cli- tory to approval in clinical use, the process must be set
mate. The paradox of a growing need for new discoveries in motion immediately. A problem with this approach is
coupled with the reduction in resources aimed at generat- a line of recent US court decisions that impose stricter
ing them creates a problem that may become a crisis. criteria on the patentability of procedures that might be
The novel developments associated with genomics, considered ‘natural processes’.
the subject of this book, can in principle contribute to the The problems are multidimensional, involving science,
development of antibiotics. It is possible to identify targets economics, long-term forecasting, and regulatory and pat-
– specific proteins, essential for a pathogen, that differ ent law. Each field presents an individual set of difficulties.
from mammalian proteins sufficiently to suggest that drugs These difficulties are compounded by the necessity to solve
against these proteins would be effective against the them all simultaneously in the face of both genuine con-
pathogen but non-toxic to mammals. Unlike classical anti- flicts between different goals, and boundaries between
biotic research practice, experimental methods are now professions that impede communication and cooperation.
available that can define the mechanism of action of a drug There is, however, consensus that the problems must be
while it is still under development. solved.
208 6 Genomes of Prokaryotes

bactericidal antibodies. In order to achieve a vaccine

• Our descendants may well look back at the second half
that covered as many strains as possible, surveys
of the 20th century as a narrow window during which
of different strains for sequence variability of these
bacterial infections could be controlled, and before and
candidate vaccines revealed which ones were rela- after which they could not.
tively constant. The results of the work are promising
candidates for vaccines now under development.

Metagenomics: the collection of genomes in a coherent environmental sample

Classically, microbiologists studied prokaryotes by Sea. A group led by J.C. Venter sequenced 109 non-
growing them in culture, isolating pure strains for redundant regions. Many novel sequences were
detailed study. Powerful as the methods were, and found, although it is difficult to assemble complete
useful as they were for clinical applications and genomes and avoid chimaeras.
research, they were also blinders that prevented full
appreciation of the variety and interactions of species
Marine cyanobacteria – an in-depth study
in natural environments. DNA sequencing has made
it possible to: The basic concepts of genome evolution are secure:
organisms explore genome variations. Adaptations
• clarify evolutionary relationships;
drive some changes – via divergence, allelic redistri-
• use high-throughput sequencing methods to study bution within a population, gene loss, or gene acqui-
a cross-section of the life in a natural sample; sition by horizontal gene transfer. Neutral genetic
• study the majority of strains that are difficult to drift accounts for other changes, especially in small
grow in culture; and populations.
• appreciate the relationships and interactions What is more difficult to understand is how popul-
among different species that share an ecosystem. ations make choices and adopt strategies. If organisms
encounter environments varying in space and/or time,
into how many populations, or even new species, will
• A millilitre of ocean water may contain 100–200 spe-
they split? Which proteins will diverge – in sequence,
cies. A gram of soil may contain 4000.
in function, or in expression pattern? What novel
genes are needed and where will they come from?
From natural samples containing complex mix- In most field situations, the ‘topology’ of evolution-
tures, it is possible to amplify and determine ary space depends on a complicated interaction of
sequences directly, without culturing individual physical and ecological variables, such as geographic
strains. The molecule of choice has been 16S rRNA. barriers imposed by landscape, climate, and inter-
This is partly because of its traditional role as a mol- species cooperation or competition. Complex envir-
ecule that varies at the appropriate rate to distinguish onments give rise to complex biological communities.
ancient phylogenetic branching patterns. In addition, The distribution of cyanobacteria in the open
rRNA is not very prone to horizontal gene transfer. oceans in temperate regions offers a relatively simple
It thereby preserves the distinctions between taxa – ecological context. The distributions – both of envir-
perhaps, however, this disguises the mixing that has onmental features and of species or strains – depend
taken place with other genes. Another disadvantage on a single variable, depth. There are many correlates
of characterizing an organism by its rRNA is that of depth: light intensity and quality, temperature,
rRNA does not reveal any details of the metabolism pressure, ultraviolet light penetration, nutrient avail-
or other adaptations of the species. ability (notably, sources of nitrogen and iron), and
An example of metagenomics is the sequencing of the occurrence of predators and viruses. Yet, for all
16S rRNA genes from ocean water from the Sargasso its complexity, the system is one-dimensional.
Metagenomics: the collection of genomes in a coherent environmental sample 209

BOX Prochlorococcus and Synechococcus genomes

6.5

Feature Prochlorococcus Synechococcus

Strain MED4 Strain MIT9313 Strain WH8102

Preferred light level High Low

Length (bp) 1 657 990 2 410 873 2 434 428
G+C (mol%) 30.8 50.7 59.4
Protein coding (%) 88 82 85.6
Protein coding genes 1 716 2 273 2 526
RNA genes 40 51 44

The major populations of cyanobacteria in tem- Percentage of total sequences

perate and tropical oceans belong to two related 0 10 20
0
genera, Prochlorococcus and Synechococcus. They Synechococcus
are responsible for a signiﬁcant fraction of world- High-light-adapted
wide photosynthesis. Prochlorococcus is believed to 100 Prochlorococcus

have diverged from Synechococcus fairly recently.

Low-light-adapted
Ocean environments are stratiﬁed. Studies of the
200 Prochlorococcus
Depth (m)

distribution of Prochlorococcus ecotypes in a vertical

column in the Sargasso Sea reveal a division into
two types of strain. Closer to the surface than about 300

130 m depth, the majority of Prochlorococcus strains

are adapted to high light levels. Strains prevalent 400
below 130 m depth are adapted to low light levels
(see Figure 6.13).
500

Figure 6.13 Distribution with depth of three types of

• Prochlorococcus strains are adapted to different am-
cyanobacteria: Synechococcus, Prochlorococcus strains adapted to
bient light levels, which decrease with increasing depth
high-light-intensity habitats, and Prochlorococcus strains adapted
below the ocean surface. High light levels: ≥200 mmol to low-light-intensity habitats. The appearance of large amounts
photons m−2 s−1. Low light levels: ≤30–50 mmol pho- of low-light-adapted Prochlorococcus at 200 m, below the level
tons m−2 s−1. where it seemed to have virtually disappeared, is puzzling. The
broken line from 200–500 m depth is interpolated; measurements
have been made at 200 and 500 m but not in between.
Rocap and co-workers compared the genomes of After: DeLong, Preston & Mincer, et al. (2006). Community genomics
high-light- and low-light-adapted Prochlorococcus among stratified microbial assemblages in the ocean’s interior. Science
311, 496–503.
strains with Synechococcus* (see Box 6.5). Prochlo-
rococcus strain MIT9313 is adapted to growth in
low light intensity. Both its distribution with depth (Figure 6.13) and the general features of its genome
are similar to Synechococcus. In contrast, the high-
light-adapted strain MED4 is ‘lean and mean’: its
* Rocap et al. (2003). Genome divergence in two Prochlo- genome is unusually small and encodes fewer pro-
rococcus ecotypes reﬂects oceanic niche differentiation. teins. Prochlorococcus MED4 is the smallest known
Nature 424, 1042–1047. organism that generates oxygen.
210 6 Genomes of Prokaryotes

iron-stress-inducible (ISI) proteins (see Weblem

BOX Cyanobacterial photosynthesis 6.4). ISI proteins are expressed in cyanobacteria
6.6
under conditions of low iron concentrations. They
provide an alternative to the iron-rich protein
The photosynthetic apparatus of cyanobacteria con- ferredoxin in the electron transport chain.
tains large chromophore-containing macromolecular
• High-light-adapted and low-light-adapted strains
complexes:
differ in the relative amounts of chlorophyll a2
• Two coupled photosystems that carry out energy and b2. High-light-adapted strains have mostly
transduction – the capture of light energy. These are chlorophyll a2; low-light-adapted strains have
called PSI and PSII. more chlorophyll b2, which absorbs optimally in
• Antenna pigments that make light harvesting more the blue region of the spectrum. This is an appro-
efficient by absorbing light and transferring the priate match to the colour of the ambient light
excitation energy to the reactive chlorophylls. below the ocean surface. The ratio of concentra-
tions of chlorophylls a2 and b2 can vary with habi-
tat conditions. In the low-light-adapted strain
MIT9313, the ratio can change by at least a factor
Approximately half of the proteins of Synecho- of two, producing more chlorophyll a2 at higher
coccus are common to all three species: 1314 out of light intensities. The chlorophyll a2/b2 ratio in the
2526. Only 38 genus-speciﬁc proteins appear in both high-light-adapted strain MED4 is less sensitive to
Prochlorococcus strains but not in Synechococcus. ambient light intensity. The nature of the control
Many of these 38 proteins are involved in the synthesis mechanism and the set of proteins that change
of the light-harvesting complex of Prochlorococcus, expression patterns are not yet fully understood.
which has a structure unusual among bacteria (see Another adaptation for ‘scavenging’ light in dark
Box 6.6). environments is to increase the number of genes for
Within the Prochlorococcus genus, many genes light-harvesting proteins. Low-light-adapted strains
are strain speciﬁc: MIT9313 has 923 proteins that do have more copies of the genes that code for the
not appear in MED4 (about half of these do appear chlorophyll-binding antenna protein Pcb. In addition,
in Synechococcus). This, together with the observa- MED4 contains a phycoerythrin, an antenna protein
tion that the MED4 genome and proteome are that binds a chromophore absorbing green light.
substantially smaller, implies that many genes have
been lost in the differentiation of the strains. Protection against photochemical damage

How are different Prochlorococcus strains adapted Ultraviolet light can damage DNA. Major products
to differences in ambient light intensities and include thymine dimers, the linkage of adjacent thy-
spectral distributions? mine residues in DNA. In response to the threat of
mutation, cells contain repair enzymes, including
Effective interactions with light require both efﬁcient
photolyase, which recovers thymines from dimers.
energy transduction and protection from photochem-
Because ultraviolet light does not penetrate far into
ical damage caused by excitation energy spillover.
sea water, Prochlorococcus strain MED4, living nearer
• Antenna complexes of photosystem II (PSII) of the surface, is in greater danger from photochemical
most cyanobacteria, including Synechococcus, damage than MIT9313. Indeed, MED4, but not
contain protein complexes called phycobilisomes. MIT9313, contains a gene for photolyase. Another
Prochlorococcus is unusual among cyanobac- difference between the strains is also probably related
teria in using, as PSII antennae, proteins binding to photo-oxidative stress: MED4 contains perhaps
unusual modiﬁed (divinyl) chlorophylls, called Pcb twice as many high-light-inducible proteins as
proteins. MIT9313. From their distribution in the genome,
Where did the Pcb proteins come from? They some of these appear to have arisen by recent dupli-
appear to have been recruited from a family called cation events.
Metagenomics: the collection of genomes in a coherent environmental sample 211

has active nitrate and nitrite reductase genes and can

• ‘Mad dogs and Englishmen go out in the midday sun,’
use either ion, as well as ammonium, as a nitrogen
sang Noël Coward. Plants and ocean-surface-dwelling
source. Prochlorococcus MED4 has lost nitrate
cyanobacteria do too – they have no choice.
reductase but retains nitrite reductase; it can use
nitrite but not nitrate as a nitrogen source. Prochlo-
rococcus MIT9313 has neither reductase and must
Utilization of nitrogen sources get its nitrogen from ammonium or from other
In the ocean, the prevalent form of nitrogen near the reduced nitrogen compounds such as amino acids.
surface is ammonium, produced in part by ﬁxation of
atmospheric nitrogen. In deep waters, nitrate is more Protection against predators and viruses
common.* Nitrate is produced by degradation of Other cellular life forms graze on Prochlorococcus
dead organic matter: dead organisms sink. and Synechococcus strains. Viruses infect them.
Different organisms assimilate nitrogen from Defensive adaptations involve genes encoding pro-
molecular nitrogen, nitrate (NO3− ), nitrite (NO2− ), or teins involved in the synthesis of lipopolysaccharides
ammonium ion (NH4+ ). and polysaccharides, which form the basis of cell sur-
The two Prochlorococcus strains differ in their face recognition. Both strains of Prochlorococcus
ability to assimilate nitrogen from different sources have acquired, by horizontal gene transfer, clusters of
(see Box 6.7). There has been a successive loss in genes for surface polysaccharides not shared by the
nitrogen-assimilating ability with the divergence other strain or by Synechococcus. As evidence of for-
from Synechococcus to low-light-adapted Prochloro- eign origin, this 40.8 kb cluster in MIT9313 has a
coccus (deeper-living) to high-light-adapted Prochlo- G+C content of 42 mole %, substantially lower than
rococcus (living nearer the surface). Synechococcus that of the MIT9313 genome as a whole, 50.7 mole %.

BOX Assimilation of nitrogen

6.7

Reaction Enzyme Synechococcus Prochlorococcus

WH8102 MED4 MIT9313

N2 → NH+4 Nitrogenase Absent Absent Absent

− −
NO → NO
3 2 Nitrate reductase Present Absent Absent
NO2− → NH+4 Nitrite reductase Present Present Absent
+
NH → glutamine
3 Glutamine synthase Present Present Present

* See: http://www.es.flinders.edu.au/~mattom/IntroOc/notes/figures/fig5a5.html.
212 6 Genomes of Prokaryotes

● RECOMMENDED READING

• General discussions of prokaryotic classification:

Oren, A. & Papke, R.T. (2010). Molecular phylogeny of microorganisms. Caister Academic
Press, Norfolk, UK.
• Last universal common ancestor (LUCA) and related topics:
Mat, W.K., Xue, H., & Wong, J.T. (2008). The genomics of LUCA. Front. Biosci. 13, 5605–5613.
Puigbò, P., Wolf, Y.I., & Koonin, E.V. (2009). Search for a ‘Tree of Life’ in the thicket of the
phylogenetic forest. J. Biol. 8, 59.
• The history of the growth of oxygen in the atmosphere: an intersection between biochemistry
and geology:
Raymond, J. & Segrè, D. (2006). The effect of oxygen on biochemical networks and the
evolution of complex life. Science 311, 1764 –1767.
Holland, H.D. (2006). The oxygenation of the atmosphere and oceans. Phil. Trans. R. Soc. Lond.
B: Biol. Sci. 361, 903–915.
• Metagenomics:
DeLong, E.F. & Karl, D.M. (2005). Genomic perspectives in microbial oceanography. Nature
437, 336–342.
Scanlan, D.J. et al. (2009). Ecological genomics of marine picocyanobacteria. Microbiol. Mol.
Rev. 73, 249–299.
Wooley, J.C., Godzik, A., & Friedberg, I. (2010). A primer on metagenomics. PLoS Comput.
Biol. 6, e1000667.
Bohlin, J. (2011). Genomic signatures in microbes – properties and applications. Sci. World J. 11,
715–725.
• New approaches to vaccine design:
Rossolini, G.M. & Thaller, M.C. (2010). Coping with antibiotic resistance: contributions from
genomics. Genome Medicine 2, 15.
Scarselli, M., Giuliani, M.M., Adu-Bobie, J., & Rappuoli, R. (2005). The impact of genomics on
vaccine design. Trends Biotech. 23, 84–91.

● EXERCISES, PROBLEMS, AND WEBLEMS

Exercises
Exercise 6.1 Which of the differences between archaea and bacteria, described on pp. 192–194,
could you derive from genome sequence alone?
Exercise 6.2 Using the standard genetic code (Box 1.1), are there any amino acids for which a
hyperthermophilic organism could not preferentially choose a codon with G or C in the third
position?
Exercise 6.3 From the table in Box 6.4, what are two groups of bacteria that are (a) photosynthetic,
(b) live in the human gut, (c) pathogenic?
Exercise 6.4 On a photocopy of Figure 6.6, circle salt bridges that appear at (approximately)
common positions in both structures.
Exercises, problems, and weblems 213

Exercise 6.5 On a photocopy of Figure 6.7, identify positions at which there is a charged residue
in the P. furiosus sequence but an uncharged residue in the C. symbiosum sequence. (Charged
residues = K, R (shown in blue), D, E, H (shown in red) uncharged residues = all the others.)
Exercise 6.6 On a photocopy of Figure 6.9, (a) circle the glycolysis/gluconeogenesis pathway;
(b) circle the Calvin cycle.
Exercise 6.7 On a photocopy of Figure 6.10, circle a region in which synteny is maintained in
P. abyssi, P. horikoshii, and P. furiosis.
Exercise 6.8 Why is the recombinant vaccine against hepatitis B expressed in yeast and not
E. coli?
Exercise 6.9 If 1 ml of seawater contains 200 species, the metagenome would be how big?

Problems
Problem 6.1 Draw a Venn diagram showing the numbers of genes specific to one, common to
pairs, and common to all three of the following: Prochlorococcus strain MED4, Prochlorococcus
strain MIT9313, and Synechococcus strain WH8102.
Problem 6.2 In Figure 6.12, hatched squares indicate MRSA strains. (a) Find two RDs (regions
of diversity) that are present in all methicillin-resistant strains studied. (b) Find two RDs that are
present in some methicillin-resistant strains, but not every strain containing either of them is
methicillin resistant. (c) Is there any RD that is present in every methicillin-resistant strain, and
absent from every methicillin-sensitive strain?

Weblems
Weblem 6.1 Print out the complete secondary structures of 16S rRNAs from E. coli,
Methanococcus vannielii, and Saccharomyces cerevisiae (http://www.rna.icmb.utexas.edu/).
On each structure, indicate where the region illustrated in Figure 6.1 appears.
Weblem 6.2 Are any archaea implicated in human disease?
Weblem 6.3 (a) Identify a bacterium with a growth temperature below 10°C. (b) Identify an
archaeon with a growth temperature below 10°C. (See Figure 6.5.)
Weblem 6.4 Where did Prochlorococcus get the chlorophyll-binding proteins of its photosystem II
antenna? (a) Using either of the Prochlorococcus Pcb proteins (UniProt ID PCBA_PROMM or
PCBA_PROMP), search using PSI-BLAST for homologous cyanobacterial proteins with different
functions. You should find – among others – ISIA_SYNP6, ISIA_SYNP7, and ISIA_SYNP2. Present
the results of this search, editing the output down to the most relevant information. (b) What
is the function of the ISI family of proteins? (c) Align the full sequences of PCBA_PROMM,
PCBA_PROMP, ISIA_SYNP6, ISIA_SYNP7, and ISIA_SYNP2, using CLUSTAL W or T-Coffee.
Comment on the extent of the divergence. (d) Determine the fraction of identical residues
between every pair of sequences from the multiple sequence alignment and draw a
phylogenetic tree.
Weblem 6.5 The high-light-intensity-adapted Prochlorococcus strain MED4 contains a
phycoerythrin, but the low-light-intensity-adapted strain MIT9313 does not. Do most
Synechococcus strains contain phycoerythrins? If so, align the sequences of a prochlorococcal
and synechococcal phycoerythrin and comment on the extent of the divergence compared
with the divergence of prochlorococcal Pcb proteins from synechococcal ISI proteins (see
Weblem 6.4).
This page intentionally left blank
CHAPTER 7

Genomes of Eukaryotes

LEARNING GOALS

• Having a clear sense of how eukaryotic cells differ from prokaryotic cells.
• Understanding the relationships among the major types of eukaryotes.
• Recognizing that fungi in general and yeasts in particular are among the simplest eukaryotes,
and have served as model organisms in the molecular biology laboratory (in addition to
applications in agriculture and cooking).
• Knowing the unique features of higher plants, and the focus on Arabidopsis thaliana – ‘the fruit
fly of botany’.
• For the animals, appreciating the evolutionary path from early organisms, more distantly related
to humans, to mammals.
• Respecting the power – albeit limited – of the ability to recover and sequence DNA from extinct
organisms.

We, like all other species that exist or have existed, are the product of a long evolutionary history.
Genomes give us snapshots of landmarks along the route. They allow us to understand the
topology of the path – where and when it branched. They reveal when certain features arose. In
some cases they show us the experiments – some productive, some abortive – that preceded the
mature result subsequently adopted.
216 7 Genomes of Eukaryotes

The origin and evolution of eukaryotes

There is consensus that eukaryotes are descended Until relatively recently, our view of the history of
from prokaryotes. Evidence includes the observations life was limited to organisms that have left either
that both prokaryotes and eukaryotes use the same descendants or fossils. The Burgess Shale deposits
genetic code, and share many metabolic pathways. show us that we have missed a lot of interesting alter-
When did eukaryotes originate? There is consensus natives. It is true that it has become possible to
that prokaryotes had the world to themselves for recover and sequence ancient DNA. This has opened
many years since the origin of life (dated at no later a window on extinct species. But only in a limited
than 3.5 billion years ago). The first eukaryotic fossil way.
is approximately 2 billion years old. The discovery Our only real possibility of reconstructing our his-
in datable oil deposits of biochemicals produced tory is through the genomes of extant organisms that
only by eukaryotes suggests an earlier origin, almost have been around for a long time, and to read the
3 billion years ago. story that their sequences tell.
The first eukaryotes began as unicellular organisms.
Multicellularity began with formation of colonies, • In order to reconstruct evolutionary history from
followed by cell specialization, perhaps first by sim- genomes, we must analyse genomes from species, the
ple symbiosis. Developmental programmes allowed ancestors of which first arose in the distant past. What
specialization within clonal clusters. do they share with their close relatives, that might offer
There followed exploration of a great variety of genomic characterization of a group of species; for
body plans, based on a variety of tissue types. Major instance, what are the defining genomic features of
landmarks, that we know about, in the development vertebrates? What do ancient species share with their
precessors and successors? What innovations did they
of higher organisms, include the division between
achieve? Which of those were dead ends and which
animals and plants. Some animals adopted body
did other species descended from them adopt and
structures with bilateral symmetry. Some of these
develop?
became vertebrates. Eventually, some became human.

Evolution and phylogenetic relationships in eukaryotes

Genome sequences provide detailed information origin of some features of a body plan, or the immune
about evolutionary relationships among species. So system, or the endocrine system, or features of the
many genomes are now known that we must pick nervous system. Phrasing the questions loosely: Who
and choose only a few of the interesting examples. invented them? When? What if anything did they
The Genomes On Line Database (GOLD) lists 2007 come from? What alternatives were experimented
completed or ongoing eukaryotic genome sequencing with and how well did they work?
projects. This is an ambitious programme. It is appropriate
Figure 7.1 shows the major groups of eukaryotes. to begin with one of the simplest eukaryotes, yeast.
Note the ‘star’ topology – at this level of low resolu-
tion, eukaryotes are a bush rather than a tree. In this
The yeast genome
chapter, however, we shall follow a more directed
path, roughly in the direction towards higher com- Yeast, like E. coli, is an organism better known to
plexity. This correlates fairly well with date of origin. many of us from molecular biology labs than from
From the comparative genomics of eukaryotes, we nature. It has served as a model eukaryote, because
can ask questions about features of humans that we of its relative simplicity, ease of growth, having both
consider essential. For instance, we can look into the haploid and diploid states, and safety. Of course we
Evolution and phylogenetic relationships in eukaryotes 217

S. cerevisiae contains many genes for non-protein-

CH
Plants

RO
(landplants, many algae) Alveolates coding RNAs. A single tandem array on chromo-

M
AL
(dinoflagellates)
some XII encodes 120 copies of ribosomal RNA.

VE
O
LA
There are 40 genes for small nuclear RNAs, and 275

TE
Rhizaria

S
(foraminifera) Stramenopiles
(diatoms, brown algae) genes for transfer RNA, about one-third of which
contain introns.
Of the protein-coding genes, 4777 correspond
Amoebozoa
(amoebas) * * to molecules to which a function can be assigned.
Discicristates About 1000 more contain some similarity to known
(trypanosomes)
proteins in other species. Another ∼800 are similar to
Opisthokonts Excavates I ORFs in other genomes that correspond to unknown
(sponges, fungi, animals) (diplomonads) proteins. Many of these homologues appear in pro-
Figure 7.1 Major classes of eukaryote, with example species in karyotes. Only ∼1/3 of yeast proteins have identiﬁable
parentheses. The asterisks mark possible positions of the root of homologues in the human genome.
the tree. The classiﬁcation of yeast protein functions shown
From: Baldauf, S.L. (2003). The deep roots of eukaryotes. Science 300, in Table 7.1 is taken from the Saccharomyces genome
1703–1706, and personal communication.

Table 7.1 Assignment of Saccharomyces cerevisiae genes

products to different functional categories
are also grateful to yeast for bread, wine, and beer.
On the other hand some related fungi are infectious. Functional category Number of proteins
Yeast is one of the simplest known eukaryotic
metabolism 1514
organisms. Its cells, like our own, contain a nucleus
energy 367
and other specialized intracellular compartments.
cell cycle and DNA processing 1012
The sequencing of its genome, by an international
transcription 1077
consortium comprising ∼100 laboratories, was com-
protein synthesis 480
pleted in 1992.
protein fate (folding, modification, 1154
The genome of baker’s yeast, Saccharomyces cere-
destination)
visiae, contains 12 052 000 bp, distributed among 16
protein with binding function 1049
chromosomes. The chromosomes range in size over or cofactor requirement
an order of magnitude, from the 1352 kbp chromo- (structural or catalytic)
some IV to the 230 kbp chromosome I. regulation of metabolism and 253
The S. cerevisiae genome is 3.5 times the length of protein function
that of E. coli, and less than a tenth of the human cellular transport, transport facilities, 1038
genome. Many strains also contain one or more plas- and transport routes
mids. The genome is relatively compact, with genes cellular communication/signal 234
transduction mechanism
accounting for about 72% of the sequence. There are
cell rescue, defence, and virulence 554
fewer repeat sequences compared with genomes of
interaction with the environment 463
more complex eukarya.
transposable elements, viral 120
A duplication of the entire yeast genome appears
and plasmid proteins
to have occurred ∼150 million years ago. This was
cell fate 273
followed by translocations of pieces of the duplicated
development (systemic) 69
DNA and loss of one of the copies of most (∼92%) of
biogenesis of cellular components 862
the genes.
cell type differentiation 452
Approximately 6000 protein-coding genes are pre-
unclassified proteins 1393
dicted. Relatively few contain introns. However, the
functionally classified proteins 4777
genes of a related yeast, Schizosaccharomyces pombe,
functionally unclassified proteins 1394
are much richer in introns.
218 7 Genomes of Eukaryotes

Angiosperms
Bryophytes (flowering plants)
Quaternary Ferns Ginkos Conifers Cycads Eudicots Monocots

Cenozoic
(mosses)

Tertiary

Cretaceous
Mesozoic

Jurassic

Triassic
Permian

Carboniferous

Devonian
Palaeozoic

Silurian
Ordovician

Figure 7.2 Phylogeny of land plants. The picture is limited to groups with extant examples with which readers may be familiar.
Monocots and eudicots are named for the difference between single and double cotyledons in the embryo, but many other features
separate them.

database, http://mips.gsf.de/genre/proj/yeast/Search/ algae. Green algae have now been split into strep-
Catalogs/catalog.jsp. tophytes and chlorophytes; streptophytes are related
to higher plants.
Plants came ashore to occupy land environments
The evolution of plants
about 450 million years ago, in the mid-Ordovician.
Plants and animals parted company a long time ago. Most plants today are angiosperms, or plants with
Although all life forms share much of their molecular flowers (see Figure 7.2). Angiosperms arose about
biology, plants derive energy from sunlight via photo- 140–190 million years ago.
synthesis. The consequences for their structure, The first complete nuclear genome of a higher
lifestyle, and developmental programmes have been plant to be sequenced was that of Arabidopsis thaliana,
profound. They require many proteins dedicated to common name thale cress. It is related to turnip, cab-
their unique biophysical and metabolic activities. bage, and broccoli. Its ease of handling and rapid
Plant genome sequences illuminate the similarities to generation time have made it a favoured subject for
and differences from other eukaryotes: research in plant molecular biology. A. thaliana has
been called ‘the fruit fly of botany’.
• Plants share some functions with animals. At the
genomic and proteomic level, are they achieved in The Arabidopsis thaliana genome
similar ways?
A. thaliana has a relatively small genome – 146 Mb
• Some functions are unique to plants. At the – distributed over five chromosomes (see Box 7.1).
genomic and proteomic level, where did they come (The maize genome is almost 20 times as large.)
from? Were they invented, adapted, or borrowed?
From a common ancestor living about 800 million
years ago, metazoa split into three major groups: fungi,
• The compact genome was one reason why the research
animals, and plants. Higher plants evolved from
community adopted Arabidopsis.
single-celled organisms formerly classified as green
Evolution and phylogenetic relationships in eukaryotes 219

Table 7.2 Genes containing introns

BOX The Arabidopsis thaliana genome
7.1 Genome

Nuclear Chloroplast Mitochondrial

Genome size 146 Mb (estimated)
Sequenced nuclear DNA 115 936 794 bp Genes containing 80 18.4 12
introns (%)
Predicted protein coding genes 26 732
Pseudogenes 3818
Alternately spliced genes 2330
Transposons >10% of genome
dense, with preserved gene order. In plant mitochon-
dria, genes are more widely spaced and recombina-
Gene distribution Length (bp) Number of genes
tion is more common. Mitochondrial and chloroplast
Chromosome 1 30 432 563 6 905 genes contain fewer introns (Table 7.2).
Chromosome 2 19 705 359 4 178 The Arabidopsis proteome contains many genes
Chromosome 3 23 403 063 5 313 specific to plants, including those involved in photo-
Chromosome 4 18 585 042 4 088 synthesis and in the metabolism of components of
Chromosome 5 23 810 767 6 248 cell walls. Arabidopsis is rich in genes that encode
Full nucleus 115 936 794 26 732
water-transporting channels, peptide-hormone trans-
porters, metabolic and biosynthetic enzymes, and
Mitochondrion 366 924 135
proteins involved in defence, detoxification, and envir-
Chloroplast 154 478 122
onmental sensing.
Plants have many special metabolic pathways, for
photosynthesis and for the metabolism of cell wall
components, alkaloids, and growth regulators such
The Arabidopsis nuclear genome is relatively com- as auxins and gibberellins. Complex metabolism
pact. Protein-coding genes contain an average of 5.4 requires the genome to encode a large and varied set
exons, of average length 276 bp, separated by rela- of enzymes. In keeping with the essential role of light
tively short introns about 165 bp long. The intergenic in plant life, Arabidopsis has many light sensors that
spacing is also short, about 4.6 kb. A feature of plant regulate development and circadian responses.
genes is that the G+C content of exons (44 mole %) Plants are also threatened by pathogens and have
is higher than that of introns (32 mole %). evolved defence mechanisms dissimilar from our
The structure of the A. thaliana genome reveals immune system. One weapon that plants deploy
both local and genome-wide duplications. There were against pathogens involves the production of reactive-
probably three polyploidizations, estimates of the oxygen species. Plants synthesize other defence mole-
dates of which vary widely. The ranges 225–300 mil- cules against animals, but, also, other molecules that
lion years ago for the first, 150–170 million years attract pollinators. These attractants have provided
ago for the second, and 25– 40 million years ago for useful sources of flavours, fragrances, and drugs,
the most recent have been suggested. In addition, encompassing traditional ‘herbal medicine’ and
local duplications have affected ∼17% of genes. Close modern pharmacology.
relatives, such as cabbage and cauliflower, have Comparing the proteins encoded in the nuclear
undergone additional polyploidizations during the genome of Arabidopsis with human proteins, the
12 million years since they diverged from Arabidopsis. fraction of homologues observed varies with functional
A. thaliana has a mitochondrial and a chloroplast category. For protein synthesis, 60% of nuclear-
sequence as well. Genome analysis must address encoded Arabidopsis genes have human homologues.
questions of divisions of labour. Relative to animal For transcription regulation, the figure is only 30%.
cells, organelles in plant cells bear a greater meta- It is not that transcription is poorly represented in
bolic burden, if only because of the activities of chlo- plant genomes; it is just that plants do it differently. In
roplasts. Chloroplast genomes are relatively gene fact, plants have several times as many transcription
220 7 Genomes of Eukaryotes

factors as the fruit ﬂy. Although many components Vertebrata (human)

of the signal-transduction pathways familiar from Cephalochordata (lancelets)
Urochordata (sea squirts) Ciona intestinalis
animals are absent in plants, plants have developed Hemichordata (acorn worms)
specific transcription factor families unknown in Echinodermata (starfish, sea urchins)
animals.
Figure 7.3 The evolutionary position of the sea squirt, Ciona
Many Arabidopsis genes are homologous to human intestinalis, compared to other chordates and our nearest
genes implicated in disease. For instance, plants and non-chordate deuterostome relatives. (From Figure 4.2.)
animals have similar DNA repair systems, and Arabi-
dopsis has a homologue of BRCA2 (see Chapter 2).
For some human disease-associated genes, the plant
homologue is more similar to the human protein
than those from fruit fly or Caenorhabditis elegans.
Study of the function of the plant homologues will be
illuminating, even though it is unlikely that Arabi-
dopsis will be suitable for clinical trials of drugs
intended for human use!
Plants are of course not ancestral to humans. We
now turn to species that represent some important
branching points in our own ancestry. That is, a suc-
cession of species with which we shared more and
more recent common ancestors: urochordates, fishes,
birds, monotremes, and other mammals.

The genome of the sea squirt (Ciona

intestinalis)
Vertebrates arose within the chordate phylum,
branching off from other lines that led to sea squirts
and lancelets (see Figure 7.3). The sea squirt (C.
intestinalis) represents one of the most primitive of
our chordate relatives (Figure 7.4). Its genome pro-
vides insight into chordate and vertebrate origins.

• Chordates have a notochord, a rudimentary cartila- Figure 7.4 Adult Ciona intestinalis.
ginous skeleton running dorsally from head to tail. Cristian Cañestro, C., Bassham, S., & Postlethwait, J.H. (2003). Seeing
A nerve cord lies parallel, adjacent, and dorsal to the chordate evolution through the Ciona genome sequence. Genome Biol.
notochord. Vertebrates retain the notochord during 4, 208 (Photo by Andrew Martinez).
early embryonic development (vertebrates are also
chordates) but replace it with the spinal cord.
repertoires. The genes are tightly packed (7.5 kb/
gene, compared to 100 kb/gene for humans).
Because of the position of Ciona in the evolution-
The C. intestinalis genome has approximately
ary tree, it is of interest to analyse its genes according
160 Mbp, about 1/20 the size of the human genome.
to where homologues exist.
From the initial sequence determined, approximately
16 000 proteins were adduced. This is comparable • Almost 60% of the genes have homologues in
to invertebrates, and lower than typical vertebrate C. elegans and/or D. melanogaster. These represent
Evolution and phylogenetic relationships in eukaryotes 221

genes shared by species ancestral to both inverte- The genome of the pufferfish (Tetraodon
brates and chordates. nigroviridis)
• A few genes look more similar to genes from
Tetraodon nigroviridis is a freshwater pufferfish.
worm and/or fly than to vertebrate genes. It is
Ancestors of humans and fish parted company
likely that these are vestiges of the common
450 million years ago. Comparing the genome of
ancestor lost in the lineage leading to vertebrates.
T. nigroviridis with other vertebrate genomes should
An example is the gene for haemocyanin, the
therefore reveal defining properties of vertebrates.
oxygen carrier in many invertebrates. Ciona also
The T. nigroviridis genome is about 340 Mb in
contains genes for four globins, the vertebrate oxy-
length, an order of magnitude smaller than the
gen carrier.
human. Contributing to the compactness are a rela-
Some haemocyanins also have enzymatic activ-
tive paucity of repetitive transposable elements, and
ity, as phenoloxidases, enzymes which convert
shorter introns and intergenic regions. Forty per cent
monophenols to diphenols, and/or diphenols to
is protein coding! Approximately 28 000 protein-
o-quinones. These reactions are involved in invert-
coding genes have been identified.
ebrate immune responses: a defence reaction acti-
The T. nigroviridis genome strikingly illuminates
vates phenoloxidase activity, producing reactive
the large-scale structure of vertebrate genomes. First,
quinones, which contribute to the inactivation of
there is evidence for whole-genome duplication.
foreign organisms.
Chromosome rearrangements have complicated
Indeed, one system carefully looked for in Ciona,
what might originally have been a simple pattern.
but conspicuous by its absence, is an adaptive
There remains, however, a considerable degree of
immune system. Ciona does not appear to have
synteny between pairs of groups of paralogous genes
genes for immunoglobulins, T-cell receptors, or
on different chromosomes. These common syntenic
MHC proteins. This appears then to be a later
blocks within the T. nigroviridis genome arose by
invention, by vertebrates. The characteristic mole-
whole genome duplication followed by chromosome
cules are absent from even the primitive jawless
rearrangement.
vertebrates – lamprey and hagfish.
Second, it is possible to map syntenic groups
• A fifth of the genes have no apparent homologue between T. nigroviridis and human. Figure 7.5 shows
in vertebrates or invertebrates. It is likely that reciprocal maps. Consider Figure 7.5a. Imagine each
homologues will be discovered – perhaps when the human chromosome coloured a constant separate
protein structures are determined – or they may colour; for example, colour human chromosome 2
be very highly diverged within the urochordate pink. Then for each gene on human chromosome 2,
lineage. find homologues on T. nigroviridis chromosomes,
Some Ciona proteins carry out functions specific and colour them pink also. The large blocks of
to urochordates. The urochordate body is sur- pink in T. nigroviridis chromosome 2 indicate long
rounded by a ‘tunic’, made of fibrous cellulose-like blocks that are syntenic with human chromosome 2.
polysaccharide. (Tunicates is another name for this Of course, some of human chromosome 2 appears
group.) Ciona contains enzymes for synthesis of elsewhere in the T. nigroviridis karyotype; for
cellulose, and endogluconases, which degrade cel- instance, at the top of T. nigroviridis chromosome 3.
lulose. Of course cellulose as a structural material Figure 7.5b is the reciprocal map: colour the T. nigro-
is unusual in organisms other than plants and viridis chromosomes a solid colour, and map them
bacteria. There is evidence that the last common onto the human set.
ancestor of urochordates acquired the cellulose It cannot be seen in Figure 7.5 directly, but
synthase gene by lateral transfer from bacteria. typically one human region aligns with two regions
Cellulose degradation is more widespread. The in T. nigroviridis. The explanation is whole-genome
endogluconases of Ciona are most similar to duplication in T. nigroviridis but not in human.
homologues in animals that digest cellulose, such Figure 7.6 shows this in more detail. Here Hsa =
as termites and some cockroaches. Homo sapiens; human chromosomes are numbered
222 7 Genomes of Eukaryotes

(a) Tetraodon chromosomes Hsa1–Hsa22 plus HsaX. Tni = T. nigroviridis, and

Anc stands for ancestral vertebrate chromosomes.
This pattern is very interesting. First, as mentioned,
to most regions of the human chromosomes there
correspond two regions from T. nigroviridis.
Moreover, there are strange interleaving patterns
within the mapping. This is expanded in two exam-
ples, from human chromosomes 16 and X. Each
small box represents a gene. The expanded region of
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 human chromosome 16 is a combination of T. nigro-
Human viridis chromosomes 13 and 15. It is the pattern you
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X would expect from a pack of cards if you started with
the red cards in one hand, the black cards in the other
(b) Human chromosomes
hand, and shufﬂed the deck.
The history that explains this pattern – and for
which these observations provide compelling evi-
dence – is that there was a whole-genome duplication
in the T. nigroviridis lineage, but not in the common
ancestor, nor in the human lineage after divergence.
The chromosomes duplicated in an ancestor of T.
nigroviridis, after divergence from the lineage leading
to humans.
This produced what ultimately became T. nigro-
viridis chromosomes 5 and 13. If there were no chro-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
Tetraodon
mosomal rearrangements or gene loss, the matchings
of T. nigroviridis chromosomes 5 and 13 with human
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
chromosome 16 would be the same. In Figure 7.6, in
Figure 7.5 Mapping of syntenic blocks between human and the upper expanded box, there would be green boxes
Tetraodon nigroviridis chromosomes. The definition of synteny above the Hsa16 line corresponding to all the red
is not a very strict one. A synteny is recorded if a grouping of
boxes below it, and red boxes below the Hsa16 line
two or more genes in one species has an orthologue on the same
chromosome in the other species, independent of order and corresponding to all the green boxes above it.
orientation. But there have been rearrangements and loss of
(a) Each coloured band in each of the T. nigroviridis many of the duplicated genes. (As both human and
chromosomes corresponds to a conserved syntenic block in T. nigroviridis have, to a ﬁrst approximation, the same
the human chromosome of the same colour. For instance, number of genes, clearly approximately half the genes
T. nigroviridis chromosome 2 has many pink areas, present after the duplication must have been lost.)
indicating extensive relationship with human chromosome 2.
Which of the pair of duplicates was lost appears to
T. nigroviridis chromosome 17 has many purple areas,
indicating relationship with human chromosome 10. be random. And there were also chromosomal rear-
(b) Reciprocal map, showing mapping of T. nigroviridis blocks rangements. For instance, T. nigroviridis chromosomes
onto human chromosomes. Here the close relationship 5 and 13 also contribute to human chromosome 15.
between human chromosome 10 and T. nigroviridis The assumption of a whole-genome duplication
chromosome 17 appears in green.
can not only rationalize the pattern we see now, it
From: Jaillon, O., Aury, J.M., Brunet, F., Petit, J.L., Stange-Thomann, N.,
can be run in reverse to infer the ancestral vertebrate
Mauceli, E., et al. (2004). Genome duplication in the teleost fish
Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. karyotype. The alternating patterns seen on human
Nature 431, 946–957. chromosomes 16 and X in Figure 7.5 correspond
to gene loss but not chromosomal rearrangement
(within the expanded region). It is possible to ask for
Evolution and phylogenetic relationships in eukaryotes 223

K1 U1 I1 K3 D1 IB I2 K4 W1
Hsa1
J1 A1 J7 U2 D2 D3 V1
Hsa2
K5 Z1 K6 V2 K7 W2 I4
Hsa3
Z2 U2 A2 B2
Hsa4 Tni1
L2 A3 H1 Tni2
Hsa5 Tni3
L3 K8 J2 U3 J3 W4 AncA Tni4
Hsa6
AncB Tni5
C1 D4 L4 G1 F1 L5 C2 F2 L8
Hsa7 AncC Tni6
Tni7
Z3 A4 L6 AncD
Hsa8 Tni8
AncE
W3 A5 Tni9
Hsa9 AncF
Tni10
F3 U4 U5 D5 B4 D11 AncG
Hsa10 Tni11
E1 E2 H3 Z4 G3
AncH Tni12
Hsa11 AncI Tni13
F4 K10 F6 A7
AncJ Tni14
Hsa12
AncK Tni15
U6 G4 D6
Hsa13 Tni16
AncL
J5 Tni17
Hsa14 AncU
Tni18
J8 E3 AncV Tni19
Hsa15
AncW Tni20
C8 E4
Hsa16 AncZ Tni21
G5 C9 G6 C10 G7 C11
Hsa17
L9 A8 L7
Tni13
Hsa18 Hsa16
Tni5
I5 I8 I7 E5 U7 Z5
Hsa19
K11 U8 K13
Hsa20 Tni1
D7 G8 D8 HsaX
Hsa21 Tni7
A9 C14 F8
Hsa22
D9 H4 K12
HsaX

Figure 7.6 More-detailed mapping of synteny blocks between T. nigroviridis and human chromosomes. The two ‘blown-up’ regions
show matches between individual genes. Notice that in general two T. nigroviridis regions map to one human one, evidence for whole
genome duplication. In the detailed regions there is an alternation of matches, arising from random loss of one copy of each pair of
genes produced by the whole genome duplication. Hsa = Homo sapiens; human chromosomes are numbered Hsa1–Hsa22 plus HsaX.
Tni = T. nigroviridis. Anc = ancestral vertebrate. (Blocks AncU, AncV, AncW, and AncZ contain small amounts of sequence that could
not be assigned to the twelve ancestral chromosomes.)
From: Jaillon, O., Aury, J.M., Brunet, F., Petit, J.L., Stange-Thomann, N., Mauceli, E., et al. (2004). Genome duplication in the teleost fish Tetraodon
nigroviridis reveals the early vertebrate proto-karyotype. Nature 431, 946–957.
224 7 Genomes of Eukaryotes

the minimal set of rearrangements that accounts for Why is the chicken genome smaller than the human
the entire pattern. Reversing those rearrangements, and other mammalian genomes? The chicken genome
on paper of course, provides a sketch of the ancestral is relatively poor in interspersed repeats and pseudo-
vertebrate chromosomes (see Figure 7.6). The sug- genes. Expansion of many gene families is greater in
gestion is that the ancestral vertebrate genome was mammalian genomes.
distributed on 12 chromosomes, and had the reper- To try to extract a ‘core’ set of vertebrate proteins,
toire of ∼20 000–30 000 protein-coding genes that comparison of human, pufferfish (Takifugu rubripes),
are typical of extant vertebrates. and chicken genomes exposed a common gene set.
Approximately 7000 protein-coding genes from
chicken have orthologues in both pufferfish and
• The pufferfish is a vertebrate. At least three-quarters of human. These are likely to implement common neces-
its genes have human homologues. By comparison of sary functions, and one expects to find them in
its genome with humans, it is possible to reconstruct
most higher vertebrates. These common genes are
the ancestral vertebrate karyotype.
expressed in many different tissues. This is another
typical signature of a gene that is not rapidly evolving
for a lineage-specific function.
The chicken genome
The next step in this line of enquiry would be to
The chicken genome was the first complete sequence determine which of these genes are also expressed in
of a bird. It is a useful outgroup for study of the com- primitive vertebrates and invertebrates. This will define
parative genomics of mammals. The lineages leading a higher-vertebrate common core gene repertoire.
to birds and mammals diverged about 310 million Comparing chicken and human only, about 60%
years ago. of the ∼23 000 chicken protein-coding genes have
The chicken is also important as an animal raised unique human homologues. Whereas such pairs
for food. The consumption of chickens in the UK between human and mouse or rat show an average of
amounts to 2.5 million birds per year, and about 88% sequence conservation, between human and
11 billion eggs. Chickens were domesticated from chicken this drops to 75.3%. Proteins with different
the grey jungle fowl (G. sonneratii) in Asia, about classes of function differ in conservation with human
8000 years ago. Consequently, the genetics of chickens homologues: transport proteins are more highly con-
has been studied extensively. There are very many served than average, and proteins of the immune
different breeds; for example, some are specialized response show only 60% sequence conservation.
for egg production, others for meat. It is anticipated The chicken has some avian-specific proteins.
that the genome will have practical applications in These include one family of keratins, which in
food production. chickens form feathers; mammals have expanded a
The chicken has also been a popular laboratory different family, to form hair. Chickens have genes
animal. It has contributed to research in develop- for avidin, a protein appearing in the egg whites of
mental biology, virology, immunology, and cancer. reptiles, amphibians, and birds. The very strong
At ∼1.2 Gbp, the chicken genome is substantially binding of biotin to avidin (KD ≈ 10−15 M) has been
smaller than most mammalian genomes. However, applied in the laboratory for purification. What is
it contains approximately the same number of genes. its natural function? It is thought to protect eggs
The chicken has 38 autosomes and one pair of sex from bacteria, which require free biotin as a cofactor
chromosomes, called Z and W. Like other birds, in numerous reactions.
but different from mammals, females are hetero- Conversely, some human genes that chickens lack
gametic (ZW) and males are homogametic (ZZ). are:
About 90% of the 1.05 Gb of assembled sequence
was anchored to its proper chromosome location. • milk proteins such as casein
It is interesting that the synteny between human and • enamel proteins, associated with loss of teeth in
chicken is more conserved than between human the bird lineage subsequent to Archaeopteryx, a
and mouse. primitive bird that did have teeth
Evolution and phylogenetic relationships in eukaryotes 225

• vomeronasal receptors. The vomeronasal system that encodes a transcription factor. The platypus
is a secondary chemosensory or odour detection lacks this gene.
system, that appears in many vertebrates, includ- The closest to an extant reptile–mammal transi-
ing humans, fish, reptiles, and others. Its absence tional form that we have, the platypus has offered us
from chicken and other birds signifies a loss in the chance to see how the basic distinctive features of
the avian lineage, rather than an invention in the mammals originated.
mammalian one. The platypus’s was the first monotreme genome
sequenced. It has 2.2 billion base pairs. 18 527 protein-
coding genes were identified. Not unexpectedly, a
The platypus genome (Ornithorhynchus majority of these have orthologues in opossum (a
anatinus) marsupial), human, dog, and mouse (placentals), and
Extant mammals form a class divided into three even chicken. Of particular interest are genes not
orders: found in other mammals. Like its anatomy, the platy-
pus genome shows a mixture of mammalian and
monotremes: only the platypus and two species of non-mammalian features. These include:
echidna
Odour receptors: the platypus odorant receptor
marsupials: kangaroos, opossums, koalas, and many
genes are for the most part recognizable homologues
others, including all mammals native to Australia
of those in other mammals. The repertoire is more
and New Guinea
akin to that of other mammals than to reptiles. There
placentals: all other mammals, including humans.
are roughly half the number of odorant-receptor
The platypus (Ornithorhynchus anatinus) is as dis- genes as in other mammals, but this may possibly be
tant a relative as we have among mammals. Startling a reflection of the animal’s aquatic lifestyle.
to its discoverers, and to us now, is its mixture of Milk: although true milk is unique to mammals,
mammalian and reptilian characteristics. Like mam- non-mammal animals that incubate their eggs secrete
mals, the platypus has hair, and nurses its young (it fluids that protect eggs from desiccation and/or infec-
has mammary glands, but not teats – the milk is tion. However, unlike those primitive precursors,
released through localized pores in the skin, modified platypus milk resembles that of other mammals. It is
sweat glands). Like reptiles it lays eggs; and has a complex mixture with both nutritive and anti-
venom, delivered by males through ankle spurs. microbial functions.
Careful study of the anatomy revealed many other
Eggs: unlike the eggs of marsupials and placental
unusual characteristics. For instance, the name
mammals, which are nourished internally, the platy-
monotreme (= single aperture) refers to the single
pus lays eggs that contain yolk. Common to the yolks
orifice serving both the urogenital and digestive
of eggs of fish, amphibians, reptiles, birds, and most
systems.
invertebrates is the protein vitellogenin. Vitellogenin
An unusual sensory capacity is electroreception,
is the precursor of the lipoproteins and phosphopro-
the ability of the platypus to perceive electrical
teins that are major protein components of egg yolk.
impulses. The platypus can locate and catch prey
However, vitellogenins are not restricted to eggs. For
through use of a combination of mechano- and
instance, bees use it as food store also. Vitellogenin
electroreceptors in its bill. A platypus will attack a
genes were lost in the lineages leading to marsupials
battery immersed in water in the dark.
and placental mammals, and retained in the mono-
Study of the molecular biology of the platypus
tremes. In other mammals, the placenta became the
revealed other surprises, including 10 sex chro-
locus of embryonic development, and the mother
mosomes (males are always XYXYXYXYXY).
supplied nutrients. Monotremes have a primitive
However, the sex determination system is closer to
form of placenta, called a yolk-sac placenta.
that of birds than of most mammals. In marsupials
and placental mammals, the primary locus for sex Venom: venom is one of those ideas that has proved
determination is SRY, a gene on the Y chromosome useful to a variety of species, including monotremes
226 7 Genomes of Eukaryotes

vDLPs
Therian -defensins

vCrotasins

vCLPs

-defensin lineages
Lineage 1
Lineage 2
Lineage 3
Lineage 4
Lineage 5
Lineage 6

Figure 7.7 Evolutionary tree and points of gene duplication of defensins in birds, reptiles, platypus, and therians (= marsupial + placental
mammal). Defensins are a group of families of small proteins found in a variety of vertebrate and invertebrate species. The therian
molecules are not components of venom. They have antibacterial activity, generally functioning by forming pores within the microbial
cell membrane, allowing cell contents to leak out.
A class of venom defensin-like proteins (vDLPs) has arisen independently in several lineages, including reptiles and platypus. In
Crotalus snake venomes, the vDLPs are neurotoxins affecting voltage-gated sodium channels. The mechanism of action of the vDLP
in platypus venom is still unknown.
From: Warren, W.C., Hillier, L.W., Graves, J.A.M., Birney, E., Ponting, C.P., et al. (2008). Genome analysis of the platypus reveals unique signatures of
evolution. Nature 453, 175–183.

and reptiles. Platypus venom is a complex cocktail The dog genome

containing proteins evolved by duplications of genes
with other functions. However, although there are
some biochemical features common to platypus and
snake venoms, there is evidence that they developed
independently. The enlistment of defensin-like peptides
as components of venom is an example of conver-
gent, or at least parallel, evolution (see Figure 7.7).

• In its macroscopic phenotypic characteristics, the platy-

pus shows a combination of reptilian and mammalian
features. Analysis of its genome also reveals these
characteristics of a transitional form. However, there
Dogs and humans have lived and cared for each other
are some details that appear only from detailed
for over 10 000 years. Dogs are work, sport, and
sequence information; for instance, that the defensin
companion animals. All readers will know of
homologue in platypus venom is not retained from
reptiles, but convergently evolved.
dog–human partnerships that rival human–human
relationships in emotional intimacy.
Evolution and phylogenetic relationships in eukaryotes 227

Primates (human, chimp) • Because many drugs are tested in dogs, under-
Euarchontoglires
standing of their molecular biology is useful. Dogs
Rodents (mouse, rat)
have also been used in research on gene therapy.
Lagomorphs (rabbit) • Dogs show a vast morphological variation, not-
ably in size. Information about genetic regulation
Laurasiatheria of developmental pathways is implicit in the com-
(dog, cat, horse, cow, whale)
parative genomics of different breeds. For instance,
Xenarthra (armadillo,
anteater) a single mutation controls breadth of skull and
Afrotheria (aardvark) shortness of face. In humans, mutation in the
Marsupials (kangaroo) homologous protein is responsible for Treacher
Monotremes (platypus) Collins syndrome, a developmental disorder affect-
Figure 7.8 Phylogeny of mammals, showing monotremes and ing the skull and face.
marsupials (green) and the four major groups of eutherian Different breeds also vary generally in personal-
mammals: Euarchontoglires, Laurasiatheria, Xenarthra, and ity traits, providing an opportunity to identify
Afrotheria (blue). Human, chimpanzee, mouse, and rat are all genes for aggressiveness and passivity.
Euarchontoglires. Dogs belong to the Laurasiatheria. Complete
genome sequences are known for species shown in red.
History of the dog
The order Carnivora, to which domestic dogs and
cats belong, originated during the Palaeocene, ∼60 mil-
Were those not sufficient reasons for interest in the
lion years ago (see Figure 7.9). Dog-like carnivores
dog genome, the biology of the dog presents numer-
are known from fossils from 40 million years ago.
ous scientific challenges and opportunities.
The current closely related species – wolf, coyote,
• The dog is an outgroup of other mammals for jackal, and red fox – split off about 3–4 million years
which complete genome sequences have been ago. The wolf lineage gave rise to the domesticated
determined (see Figure 7.8). dog, Canis familiaris.
• Dogs are an ideal species in which to study domes- Domestication of dogs is recorded in archaeolo-
tication. To a far greater extent than other genera, gical artefacts 14 000–15 000 years old, but probably
dogs and their relatives offer both a variety of took place much earlier. The first colonists of North
inbred populations – the different breeds – and America, who came across the Bering Strait about
corresponding wild populations. The genomes of 20 000–15 000 years ago, brought domesticated
dogs and wolves are much closer than those of dogs with them.
humans and chimpanzees. The sequence diver- Evidence from the genome suggests that dogs
gences in chromosomal DNA between wolves and went through two population bottlenecks. The first
dogs is 0.04% in exons and 0.21% in introns. occurred ∼9000 generations ago (∼27 000 years)
Unlike humans and chimpanzees, dogs and wolves upon domestication. The second, ∼30–90 genera-
can interbreed. tions ago, signals the origin of breed divergence.
There are now about 300–1000 breeds of dogs. The
American Kennel Club recognizes 150 as genetically
Romulus and Remus, founders of Rome, were, according
to tradition, suckled by a wolf. separated populations, with closed gene pools.

• Dogs share many human genetic diseases. Many Genome variation among breeds of dogs
are speciﬁc to individual breeds, and good genea- The most complete canine genome is that of Tasha,
logical and clinical records are available. The a female boxer. Her genome was determined by the
breeds are highly inbred: many have small founder shotgun method, with 31.5 reads providing ∼7.5-fold
populations and some have gone through bottle- coverage (Table 7.3).
necks. This simpliﬁes the search for the gene or The dog genome is slightly smaller than that of
genes responsible for the disease. humans, in part because dogs have fewer repeat
228 7 Genomes of Eukaryotes

Nandinia

Felidae

Feliformia
Viverridae

Hyaenidae

Herpestidae

Malagasy carnivorans

Canidae

Ursidae

Caniformia
Phocidae

Arctoidea Pinnipedia Otariidae

Odobenidae

Ailurus

Mephitidae
Musteloidea

Procyonidae

Basal/other mustelids

Martes group

Mustelidae
Mustela

Lutrinae

Figure 7.9 Domestic dogs and cats fall into the two main suborders of the order Carnivora. The two lineages split about 48 million
years ago. Note that the pictures of the animals are not drawn to scale.
© 2005 From: Flynn, J.J., Finarelli, J.A., Zehr, S., Hsu, J., & Nedbal, M. (2005). Molecular phylogeny of the Carnivora (Mammalia): assessing the
impact of increased sampling on resolving enigmatic relationships. Syst. Biol., 54, 317–337. Reproduced by permission of Taylor & Francis Group, LLC
(http://www.taylorandfrancis.com).

sequences. The short, interspersed element (SINE) is were determined from 11 breeds of dogs. Given the
shorter in dogs than in humans (1 500 000 copies of inbred nature of the individual breeds, it is not
SINEs make up 13% of the human genome). surprising that fewer SNPs appear when compar-
In addition to determining the reference sequence, ing individuals within breeds than in comparisons
2.5 million single-nucleotide polymorphisms (SNPs) amongst different breeds. Comparing individual
Palaeosequencing – ancient DNA 229

Table 7.3 The dog genome • Longer haplotype blocks. Within breeds, haplo-
type blocks may be as long as 100 kb. Haplotype
Feature Value Comment blocks shared by different breeds are about 10 kb
Number of 39 pairs More than long. In comparison, the length of haplotypes in
chromosomes humans modern humans is about 20 kb.
Genome length 2.4 × 109 bp Slightly less than • Linkage disequilibrium within breeds extends over
humans
several megabases. Across all breeds it is greatly
Number of proteins 37 774 reduced, extending only over tens of kilobases.
identified
Comparison of dog, human, and mouse genomes
Dogs have 39 pairs of chromosomes compared with
23 pairs in human and 21 in mouse. Therefore,
boxers, there is ∼1 SNP/1600 bases. Between breeds,
the human and mouse chromosomes must have been
there is ∼1 SNP/900 bases.
reassorted to make up the dog karyotype. Neverthe-
Concomitant consequences of closer relationships
less, 94% of the dog genome appears in conserved
within breeds are:
synteny blocks with the human and mouse genomes.
• Greater interbreed than intrabreed sequence differ- Approximately 5% of the dog genome constitutes
ence. Over the entire species, dogs and humans show functional elements common to dog, human, and mouse.
similar levels of nucleotide diversity between indi- This is higher than the protein-coding fraction of the
viduals: a frequency of different bases of ∼8 × 10−4. human genome. It includes regulatory elements and
However, the genetic homogeneity is much greater non-protein-coding RNAs, and further suggests that
within breeds of dogs than within distinct human it is premature to dismiss as ‘junk’ the regions of the
populations. genome to which we cannot yet assign function.

Palaeosequencing – ancient DNA

Recovery of DNA from ancient samples exclude liquid water. This can occur if the samples
are frozen, or desiccated by heat, or if sequestered
The recovery and sequencing of DNA from extinct within compartments such as teeth, bone, or hair.
species offers us a window onto evolutionary history. Because DNA from ancient samples is usually pres-
Source material includes fossils collected from their ent in only microscopic quantities, contamination is
deposits, mummified samples, eggshells in the sub- a serious danger. Contaminants can be microbial, or
fossil state (it was even possible, by testing DNA human from the scientists handling the specimens.
from the outer surface of moa shells, to show that Even without contamination, samples suffer from
male birds were responsible for incubating the eggs), fragmentation, and from chemical change resulting
preserved seeds or other plant material, specimens in sequence changes. The most common is deamina-
from museums, and clinical collections of pathogens. tion of cytosine to uracil.
For instance, there are repositories of samples of However, advances in isolation methods can
influenza virus dating back almost a century. reduce further damage during the extraction phase,
Although often only minuscule amounts of material and careful technique can exclude contamination by
are available, PCR amplification can produce reason- scientist DNA. Paleosequencers must ‘rough it’ dur-
able quantities for sequencing. ing field work, but apply unusually painstaking care
Like other biological material, DNA degrades after in handling samples. A speciality practised by few,
death, unless it is preserved. The best protection is to but of interest to many.
230 7 Genomes of Eukaryotes

DNA from extinct birds

The moas of New Zealand

Dinornis robustus
In the absence of terrestrial mammals, New Zealand’s Dinornithidae
Dinornis
largest animals were ﬂightless birds. The moas ranged novaezealandiae
in size up to 3 m tall, 300 kg giants (Figure 7.10). Megalapteryx
didinus
They became extinct after the arrival of human Megalapteryidae
Pachyomis australis
settlers from Polynesia. Pachyomis
elephantopus
Moas are ratites, an order of birds that also
Pachyornis
includes the New Zealand kiwi, the ostriches of geranoides
Africa, the emu and cassowary of Australia and New Emeidae
Guinea, the rhea in South America, and the extinct Anomalopteryx
didiformis
elephant bird of Madagascar (Aepyornis maximus).
Emeus crassus
This monster was 3.3 m tall, and weighed 450 kg!
The taxonomy of moas has been difﬁcult to resolve North island fossils Euryapteryx curtus

by classical methods, in the absence of living examples. South island fossils

Figure 7.11 Phylogenetic tree of moa species, from mitochondrial

DNA sequences. This classification divides moas into nine
species, grouped into three families. Widths of coloured bands
corresponding to species reflect intraspecies variation. North island
specimens in red, South in blue. For A. didiformis and the two
Dinornis species – but not E. curtus – the clustering separates
specimens from the two islands.
After: Bunce. M, Worthy. T.H., Phillips. M.J., Holdaway, R.N. et al. (2009).
The evolutionary history of the extinct ratite moa and New Zealand
Neogene paleogeography. Proc. Nat. Acad. Sci. U.S.A. 106, 20646–20651.
Copyright (2009) National Academy of Sciences, U.S.A.

Mitochondrial DNA sequences from 29 specimens

produced a phylogenetic tree comprising nine spe-
cies, grouped into three families: the Dinornithidae,
Megalapterygidae, and Emeidae (Figure 7.11). The
data suggest a separation date of the Emediae of 5.27
million years, and more copious recent radiation
within the last 2 million years.
It is interesting to correlate the estimated divergence
times of the species with the geological history of
New Zealand. Today, the North and South Islands are
Figure 7.10 Photograph of an assembled skeleton of Dinornis separated by Cook Strait, 23 km across at its narrow-
novazealandiae, towering over celebrated 19th-century anatomist
est. But, approximately 30–21 million years ago, sea
Sir Richard Owen. Based in London, Owen was the recipient of
many interesting specimens discovered in the far reaches of the levels reduced New Zealand to a few scattered islands.
Empire, including the bones of moas, and a preserved specimen of The somewhat larger South Island was isolated from
the platypus. He was the driving force behind the establishment the North, until about 2–1.5 million years ago.
of the British Museum of Natural History in South Kensington. Approximately 8.5–5 million years ago the New
The first specimen from a moa to reach Owen was a 15-cm
Zealand ‘alps’ formed. This mountain chain runs
fragment of bone. Owen correctly identified it as coming from the
femur of a giant bird. In this photo Owen is holding the original
roughly parallel to the main axis of the South island.
fragment in his right hand. With his left, he is indicating its It divides the habitats into wet rainforest on the West,
position within the full skeleton, discovered and assembled later. and dry, warmer regions on the East.
DNA from extinct birds 231

Family: Dinornithidae Family: Megalapterygidae

Dinornis
Megalapteryx

Systematics: Monotypic, M. didinus,

Systematics: Two species D. robustus (South Island, (South Island).
blue) and D. novaezealandiae (North Island, red) Dimensions: 28–80 kg and 65 to 95 cm.
Dimensions: 56–249 kg and 90 to 200 cm in height – Pleistocene specimens are significantly
significant sexual dimorphism with females up to three larger than Holocene forms.
times the mass of males. Habitat: Subalpine scrub, grassland, and
Habitat: Browsing generalist – has been found in upland, high country forests (usually >900 m).
lowland, and open forest habitats. The larger forms
occupied low rainfall areas.

Family: Emeidae North

Island Euryapteryx
Anomalopteryx

Southern Cook Systematics: Monotypic, E. curtus (formally

Alps Strait E. gravis and E. geranoides).
Systematics: Monotypic, A. didiformis. Dimensions: 12–109 kg and 51 to 103 cm.
Dimensions: 26–64 kg and 50 to 90 cm. Habitat: Drier climates – typically lowland
Habitat: Non-coastal lowland forests open forest and coastal sites.
with a continuous canopy.
South
Pachyornis
Island

Emeus

Systematics: P. geranoides (North Island), P. elephantopus

Systematics: Monotypic, E. crassus (South Island). (blue), and P. australis (green) (South Island).
Dimensions: 36–79 kg and 73 to 99 cm in height. Dimensions: 17–163 kg and 54 to 121 cm in height.
Habitat: Preference for lowland forest (usually Habitat: P. australis occupied subalpine grassland, P.
<200 m) and swamps. geranoides and P. elephantopus preferred lowland forest
edges and wetland vegetation.

Figure 7.12 Reconstructions, classification, and estimated geographical distributions of species of moa, extinct flightless birds from
New Zealand.
From: Bunce, M., Worthy, T.H., Phillips, M.J., Holdaway, R.N., et al. (2009). The evolutionary history of the extinct ratite moa and New Zealand
Neogene paleogeography. Proc. Nat. Acad. Sci. USA 106, 20 646–20 651. Copyright (2009) National Academy of Sciences, USA.
232 7 Genomes of Eukaryotes

The data on divergence of sequences is consistent ficantly smaller than their mainland counterparts. Such
with the historical geology. The suggested scenario is substantial changes can obscure taxonomic positions.
that the divergence of the major groups took place on Sailors stopping at the island found the dodo easy
the South Island, after the alps formed. When the prey – it could not fly, and lacked appropriate fear
land links arose, during glaciations, birds began to of human predators. As a result, the dodo became
inhabit the North Island, taking advantage of the extinct. The last survivor was shot in 1681.
new surroundings to generate a new round of diver- The solitaire (Pezophaps solitaria) was a related
gence (Figure 7.12). bird from a neighbouring island east of Mauritius,
Rodrigues. It also became extinct, outliving the dodo
by perhaps a century.
The dodo and the solitaire
Even museum specimens of the dodo are rare. The
The dodo (Raphus cucullatus) was a large, flightless Oxford University Museum of Natural History had
bird that inhabited the island of Mauritius, in the one. It was seen by don Charles Dodgson, who used
Indian Ocean, east of Madagascar (Figure 7.13). it as a character in Alice in Wonderland. Only par-
It was a large, robust bird, about a metre in height tially saved from a fire during a tidying-up exercise,
and weighing about 20 kg (about three times the size the Oxford specimen is the only known source of soft
of a typical Thanksgiving turkey). It is common for tissues from a dodo. What remains now comprises
island species to be either significantly larger, or signi- a head, and a leg and foot, each with some skin
attached. There are many bones on the tropical
island, but preservation conditions on the tropical
island are not conducive to preservation of DNA. It
was the Oxford remnants that provided the material
for DNA sequencing.
From the Oxford sample of the dodo, and samples
from Rodrigues of the solitaire, it was possible to
amplify and sequence short overlapping fragments
between 120 and 180 bp. For comparison, corres-
ponding sequences were analysed from many extant
species of putative relatives, including various species
of pigeons and doves. From the extant species,
sequences were determined of 1.4 kb of mitochon-
drial DNA, and regions of the genes for 12S ribo-
somal RNA (360 bp) and for cytochrome b (1050 bp).
A phylogenetic tree constructed from these data
fixed the taxonomic position of the dodo. It is a
Figure 7.13 Mauritius dodo. pigeon, of the family Columbidae. The closest extant
From: ‘A German Menagerie Being a Folio Collection of 1100 Illustrations relative of the dodo and solitaire is the Nicobar pigeon
of Mammals and Birds’ by Edouard Poppig, 1841. (Caloenas nicobarica), which lives in Southeast Asia.

High-throughput sequencing of mammoth DNA

Mammoths are unusual among extinct organisms the south-eastern corner of Lake Taymyr. Extraction
in the favourable conditions of their Arctic habitats from 1 g of bone yielded ∼0.73 mg DNA.
for preservation of DNA. Even better than most Fragments of the DNA attached to small sepharose
specimens was a ∼28 000-year-old jawbone, found beads were ampliﬁed in lipid vesicles by PCR. Six
on the shore of Baikura-turku, a bay extending from runs of a Roche/454 Life Sciences Genome Sequencer
High-throughput sequencing of mammoth DNA 233

20 System produced a total of 1 943 593 reads, with The mammoth nuclear genome
average length of ∼95 bp.
The mammoth nuclear genome presented a harder
The DNA in the sample contained about 50%
problem. It is estimated to be approximately 4.17 Gbp
mammoth DNA, a mixture of nuclear and mito-
in length, longer than the human genome.
chondrial; the rest was bacterial contaminant. Using
DNA was extracted from hair samples from a
reference sequences to sort out mammoth sequences
Siberian animal, denoted M4, that died about 20 000
from bacterial sequences, and mammoth nuclear
years ago. It is interesting that because the DNA is
sequences from mitochondrial ones, the aggregate
fragmented, the relatively short read lengths of the
harvest was ∼95 Mb of mammoth sequence. This
sequencer were not a great problem. The average
included 7.3-fold coverage of the 16 770 bp mito-
read length produced was about 150 bp. This yielded
chondrial DNA. The remainder gave a partial view
3.6 Gb of sequence. Combining this with additional
of the mammoth nuclear genome (∼3%).
sequences determined from other individuals pro-
In addition to what it tells us about the mammoth,
duced a total of 4.17 Gb of sequence. Calibration
the significance of this work is its demonstration of
of error rates suggest an average of 6 errors out of
the power of (1) the latest instrumentation and (2)
10 000 bases arising from DNA damage, and
resequencing in determining the organelle genome.
8/10 000 from sequencing.
There is no need for selective amplification of the tar-
The genome of the African elephant (Loxodonta
get sequence. Instead, it is possible to assemble the
africana) provided a reference sequence. The L.
mitochondrial DNA from the very large quantity of
africana genome had been sequenced at 7× coverage,
data available, given the scaffolding available from
and assembled. Alignment of the reads from the
a related sequence, in this case that of the Indian
mammoth samples, to L. africana and to other
elephant. Note that, with over sevenfold coverage,
genomes representing potential contaminants, showed
the reference sequence is not really needed for the
that over 90% of the reads were mammoth DNA, for
assembly, but rather for identifying the reads corres-
a total of 3.3 Gb of mammoth sequence.
ponding to mitochondrial DNA.
It was possible to compare amino acid sequences
Indeed, computational experiments have shown
of mammoth proteins with the orthologues in ele-
that a reasonably good assembly is possible using as
phants and other species. The results suggest that
a reference sequence the mitochondrial DNA of the
mammoth and African elephant differ, on average, in
dugong, a distant relative. Ancestors of mammoths
one residue per protein. It is difficult to assign selec-
and dugongs diverged around 65–70 million years
tive or even functional significance to these, in general.
ago. Mammoth and dugong mitochondrial DNA
Even in the cases of residues unique to mammoth,
sequences are only 75.3% identical. In practice, a
compared to a wide spectrum of other placental
reference sequence from a closer relative than the
mammals, the sequence is only the starting point
dugong is to mammoth would in most cases be avail-
for investigation of the proteins thereby identified as
able. Therefore the dugong–mammoth assembly
interesting candidates for follow-up studies.
offers a ‘worst-case’ analysis.
For the sake of argument, suppose one’s interest
were limited to the mitochondrial DNA sequence. The phylogeny of elephants
One might choose to amplify the mitochondrial DNA
Access to DNA from extinct species has allowed
and sequence that only. Sequencing fragments from
resolution of two problems in elephant phylogeny:
all of the DNA is, comparatively, very inefficient
in its use of the data produced, as far as deter- (a) How many species of African elephants are
mining the mitochondrial sequence is concerned. there?
However, comparisons with other ancient-DNA There are two populations of elephants in
sequencing projects suggest that it is efficient in Africa, living in the savannah and in the forest.
terms of the amount of precious sample used. For Some authorities have described them as separ-
the study of extinct species, this is an overriding ate species: Loxodonta africana (savannah) and
consideration. L. cyclotis (forest). Others have considered them
234 7 Genomes of Eukaryotes

a single species, or regard L. cyclotis as a subspecies! Nevertheless, the conclusion is that mam-
species of L. africana. moths are more closely related to Asian elephants
(b) Are mammoths more closely related to African than to African elephants, and that L. africana and
elephants or Indian elephants (Elephas maximus)? L. cyclotis should be considered as separate species.

Both questions have engendered considerable

debate in the relevant specialist literature.
• Returning to the question of what defines species
In order to assess phylogenetic relationships among
boundaries, it is clear that this distinction was drawn
African and Indian elephants and mammoths, the at least primarily on the basis of similarity of DNA
extinct American mastodon provided an outgroup. sequence. What about the classical biological definition
DNA from a tooth, estimated to be between 50 000 of whether the species hybridize in nature. (In captivity,
and 130 000 years old, provided 1.76 Mb of mast- even African and Indian elephants are fertile.) L.
odon sequence. The corresponding regions from Africana and L. cyclotis do give rise to hybrids in the
elephants and mammoths were also sequenced. The Uganda–Congo border region where their ranges
data set for analysis contained, from each species, overlap. However, as implied by the divergence of the
approximately 40 000 bp of sequence from 375 loci. DNA sequences, hybrids are in fact rare, and the two
These data allowed comparison of the divergences populations do maintain separate gene pools.
between the groups.
The results showed that the variation between L.
africana and L. cyclotis is approximately the same as The splitting of a species into two has implications
between mammoths and Asian elephant (E. maximus). for conservation, because in principle splitting would
Note that mammoths and Asian elephants are require a separate decision about whether each were
not even in the same genus, but many people have endangered. In fact all world elephant species are
believed that L. africana and L. cyclotis are the same endangered.

● RECOMMENDED READING

• General discussions of topics in eukaryotic evolution, many in the form of collections of papers:
Hirt, R.P. & Horner, D.S., eds. (2004). Organelles, Genomes and Eukaryote Phylogeny: An
Evolutionary Synthesis in the Age of Genomics. CRC Press, Boca Raton, FL, USA.
• On June 29, 2006, The Royal Society held a discussion meeting, Major steps in cell evolution:
palaeontological, molecular, and cellular evidence of their timing and global effects. The meeting
was organized by T. Cavalier-Smith, M. Brasier, and T.M. Embley, and published in Philosophical
Transactions of the Royal Society B, volume 361, issue 1470.
Katz, L.A. & Bhattacharya, D. (2006). Genomics and Evolution of Microbial Eukaryotes. Oxford
University Press, Oxford.
Baldauf, S.L. (2008). An overview of the phylogeny and diversity of eukaryotes. Journal of
Systematics and Evolution 46, 263–273.
Telford, M.J. & Littlewood, D.T.J., eds. (2009). Animal Evolution / Genomes, Fossils and Trees.
Oxford University Press, Oxford.
• An atlas of life forms, showing phylogenetic relationships and dates of divergence:
Hedges, S.B. & Kumar, S. (2009). The Timetree of Life. Oxford University Press, Oxford.
Exercises, problems, and weblems 235

• Theory and applications of linkage disequilibrium:

Slatkin, M. (2008). Linkage disequilibrium – understanding the evolutionary past and mapping
the medical future. Nature Reviews Genetics 9(6), 477–85.

● EXERCISES, PROBLEMS, AND WEBLEMS

Exercises
Exercise 7.1 What fraction of the intergenic space in the nuclear genome of A. thaliana is
occupied by transposons?
Exercise 7.2 On a photocopy of Figure 7.2, mark the approximate dates of the duplications in the
A. thaliana genome on the branch leading to eudicots.
Exercise 7.3 Calculate the average gene density in the nuclear, mitochondrial, and chloroplast
genomes of A. thaliana.
Exercise 7.4 Describe how you would test the assertion that the last common ancestor of
urochordates acquired the cellulose synthase gene by lateral transfer from bacteria. What
sequence information would you gather, and how would you analyse it?
Exercise 7.5 In Figure 7.5(b), which T. nigroviridis chromosomes have substantial regions of
synteny with human chromosome 10?
Exercise 7.6 On a photocopy of Figure 7.5(a), indicate the regions in which T. nigroviridis
chromosomes 5 and 13 contribute to human chromosome 15.
Exercise 7.7 Figure 7.12 shows that specimens of Euryapteryx curtus appear on both North and
South Islands. On which island is it likely that the species arose? What reasoning leads you to this
conclusion?
Exercise 7.8 In the dog, the variation in mitochondrial DNA sequences is lower than the variation
in nuclear DNA sequences. What does this suggest about the breeding behaviour of domesticated
dogs?
Exercise 7.9 For which of the following domesticated species could a population survive if
released into the wild? Dog, cat, chicken, parakeet, maize, rice, and wheat.
Exercise 7.10 It is much easier to study mitochondrial DNA than nuclear – it is smaller, and more
abundant in cells. What is the danger of assigning phylogeny by comparing populations through
sequencing mitochondrial DNA of various individuals, in species that are matrilocal (that is, females
remain within a herd, males leave)? An example was a study of elephants, based on mitochondrial
DNA sequences. What criticism might be raised?

Problems
Problem 7.1 Figure 7.14 shows an alignment of globins from Ciona intestinalis and human
haemoglobin a and b, myoglobin, cytoglobin, and neuroglobin. Some N- and C-terminal
extensions have been trimmed. (a) Which pair of globins has the largest number of identical
residues in this alignment? (b) Are the Ciona globins more similar to one another than the human
globins are to one another? (c) Which human globin do the Ciona globins most resemble?
236 7 Genomes of Eukaryotes

Ciona intestinalis and human globins

10 20 30 40 50 60
| | | | | |
Ciona globin 1 MPFTDEELKLLRDSWDEVKKLGMKEVGLHIFTGLLNAAPSLRTLFYTIDLPDEEELTID
Ciona globin 2 MGLTTEEIGLLRSSWNEMKTIGMKELGLLIFHRLFSDVPRIRKMFYNLELPDDETLTME
Ciona globin 3 MSLTSEQVVLLRSSWQTIGKLGMSNVGLAVLHRLFNDVPETLPFFHSVLSP_TQQTEIE
Ciona globin 4 DEGLKRSDIINIQDSWNTLKGFGYETVGMLVLHRLFNDAPQTRYLFSQLSLSSNESFTLE
Human haemoglobin MVLSPADKTNVKAAWGKVGAH_AGEYGAEALERMFLSFPTTKTYFPHF__________D
Human haemoglobin MVHLTPEEKSAVTALWGKVN___VDEVGGEALGRLLVVYPWTQRFFESFGDL____STPD
Human myoglobin MGLSDGEWQLVLNVWGKVEAD_IPGHGQEVLIRLFKGHPETLEKFDKFKHLKSE____D
Human cytoglobin SEELSEAERKAVQAMWARLYAN_CEDVGVAILVRFFVNFPSAKQYFSQFKHMEDP____L
Human neuroglobin MERPEPELIRQSWRAVSRS_PLEHGTVLFARLFALEPDLLPLFQYNCRQFSS___PE
W G P F

70 80 90 100 110 120

| | | | | |
Ciona globin 1 _
VMRENKKVVAHATRIANAISKFIKFLDQPDELEKLLTSLGESHARRQ VDPESFEYVAPV
Ciona globin 2 _
AMRSNQKMSRHATRIATSISTYLKLADQPEELKTFLNGLGELHAGHN VEPEDFEYLAPV
Ciona globin 3 VLKSNAKVVRHASRVGLSIDKIINLLDNGEELVKYLLFLGQVHVKRS_IPRKYFSAMGPV
Ciona globin 4 QMRNNSRVVYHANRVARAVGRLVDLIELPTNFTDHLVWLGQRHAYHG_VAPVNFDYMGPV
Human haemoglobin LSHGSAQVKGHGKKVADALTNAVAHVDD___MPNALSALSDLHAHKLRVDPVNFKLLSHC
Human haemoglobin AVMGNPKVKAHGKKVLGAFSDGLAHLDN___LKGTFATLSELHCDKLHVDPENFRLLGNV
Human myoglobin EMKASEDLKKHGATVLTALGGILKKKGH___HEAEIKPLAQSHATKHKIPVKYLEFISEC
Human cytoglobin EMERSPQLRKHACRVMGALNTVVENLHDPDKVSSVLALVGKAHALKHKVEPVYFKILSGV
Human neuroglobin DCLSSPEFLDHIRKVMLVIDAAVTNVEDLSSLEEYLASLGRKHRAVG_ VKLSSFSTVGES
H l H f

130 140 150 160

| | | |
Ciona globin 1 ILSVIGGHLKLPSNSPTLQAWVKAYGVLRNGIVSAMEA_____
Ciona globin 2 MLAVIGGQLNLNSNSSILQAWVKAYGVLRNGIVRGMYAYQG__
Ciona globin 3 LLSVISAVLEKDLDAPVMQAWATAYGVIEQGIIDGM_______
Ciona globin 4 LLETIKVNLELPSDSPTLSAWAKAYGVIKNGIKDAIIATYAEG
Human haemoglobin LLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR______
Human haemoglobin LVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH______
Human myoglobin IIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG
Human cytoglobin ILEVVAEEFASDFPPETQRAWAKLRGLIYSHVTAAYKEVGWVQ
Human neuroglobin LLYMLEKCLGPAFTPATRAAWSQLYGAVVQAMSRGWDGE____
a

Figure 7.14 Alignment of globins from Ciona intestinalis and human haemoglobin a and b, myoglobin, cytoglobin, and neuroglobin.
Some N- and C-terminal extensions have been trimmed.

Weblems
Weblem 7.1 (a) Has at least one organism from each of the major eukaryote classes shown in
Figure 7.1 been the subject of a full-genome sequencing project? For each class, give an example
of such a species if possible. (b) For which eukaryotic phyla has at least one species been the subject
of a full-genome sequencing project? For each phylum, give an example of such a species if possible.
Weblem 7.2 Find a yeast protein in each of the first ten functional categories in Table 7.1.
Weblem 7.3 What is the latest common ancestor of the human and the aardvark? (Hint: compare
the full taxonomy listings in any entry of a human and an aardvark sequence.)
Weblem 7.4 The UniProtKB entry for chicken ovocleidin 116 is Q9PUT1_CHICK. This protein is
eggshell-specific in chickens. Does it have any mammalian homologues?
Weblem 7.5 Mammals excrete nitrogenous waste in the form of urea. Birds excrete uric acid. Urea
in mammals is formed by the urea cycle, a metabolic pathway that forms urea (H2NC(=O)NH2+)
from ammonia and aspartate. The first enzyme in this pathway is carbamoyl phosphate synthetase
1, which catalyses the reaction:
2ATP + HCO3– + NH4+ → 2ADP + H2NC(=O)OPO32− (carbamoyl phosphate) + Pi
(a) Does the chicken contain a homologue of human carbamoyl phosphate synthetase I?
(b) Is there reason to believe it is not functional? (c) In what tissues in chicken is it expressed?
(d) What might be the function of this enzyme in the chicken?
CHAPTER 8

Genomics and Human Biology

LEARNING GOALS

• To understand the science underlying the use of genomics for personal identification.
• To recognize what characteristics of an unknown individual can be inferred from a sample
of blood or saliva, and that the use of these inferences in criminal investigation remains
controversial, with different jurisdictions adopting different regulations.
• To appreciate how mitochondrial DNA sequences were used to identify the remains of the
Russian royal family.
• To see the domestication of crop plants as experiments in directed genome change, and to
appreciate that we can now analyse the genetic differences between the wild progenitor and
the varieties used in contemporary farming.
• To consider the relationship between humans and Neanderthals, based on a sequencing of the
Neanderthal mitochondrial and nuclear genomes.
• To understand past patterns of human migration, as reflected in mitochondrial DNA haplotypes.

A prominent theme in our presentation of genomics has been the potential for applications that
improve the health of humans, animals, and plants. In this short chapter we collect a few
applications of genomics to some of the other human sciences.
238 8 Genomics and Human Biology

Genomics in personal identification

Legal applications of DNA sequencing depend on child’s DNA was a combination of those of the
several scientific facts: parents (see Figure 8.1).
Why do different individuals give different pat-
1. The genomes of all individuals except identical
terns of restriction fragment sizes? One possible
siblings are unique. Like fingerprints, genomes
cause of the difference is a mutation in a restriction
provide a unique personal identification. A blood-
site, causing that site not to be cleaved. In this case,
stain at a crime scene, like a set of fingerprints,
two fragments from unmutated DNA will corres-
can be traced to a specific individual.
pond to a single longer fragment from the mutated
2. The genome of every person combines chromo- sample. (In terms of our analogy between restriction
somes from his or her parents. Unlike fingerprints, maps and distances between consecutive Starbucks
therefore, genomes can indicate familial relation- cafes on Broadway in New York City, imagine the
ships; notably, identification of paternity. effect on the pattern if one of the cafes were to close
3. Each person’s genome contains genes that influ- (see p. 90.) Alternatively, somewhere between two
ence, even if they do not inevitably determine, restriction sites there may be a short repetitive stretch
recognizable features, such as eye colour. In of DNA, the number of copies of which is unstable
principle, DNA left by an unknown individual during replication, where the polymerase ‘stutters’.
at a crime scene could be analysed to suggest a Expansion of such a repeat will lengthen the restric-
physical description of the source individual. tion fragment in which it appears. The fragment will
occupy a different position on a gel that separates
4. Thus, unlike fingerprints, genomes contain much
fragments according to size.
more information about a person than simple
Such a short repetitive segment of DNA is called a
identification. The treatment of this information
variable number tandem repeat (VNTR). VNTRs are
by governmental authorities raises ethical and
legal questions. We have already raised some of
these questions in Chapter 1. M F C

The use of molecular characteristics to identify

people and relationships is a century old. The earliest
methods applied the classical blood groups – A, B,
AB, and O. A suspect with blood type O must be
innocent of a crime committed by a person of blood
type A. A person of blood type O could not be the
parent of a child of type AB. However, many people
share the same blood type. Therefore, blood typing
can prove innocence but not guilt. In contrast, DNA
sequences can provide positive proof of guilt or
paternity.
A. Jeffreys at the University of Leicester discovered
DNA ‘ﬁngerprinting’ in 1984, when he and his col-
leagues compared the sizes of the restriction frag-
ments of DNA samples including a human family
group (father, mother, and child). Different indi- Figure 8.1 Pattern of a gel showing DNA fingerprints from
a mother (M), father (F), and child (C). Every band in lane C
viduals give different patterns – the gel provided a
matches one appearing in lane F or lane M or both. The
‘bar code’ unique to each individual. (This bar code bands on the gel correspond to restriction fragments from
is different from the one used for species identiﬁca- a complete digest by Hinfl. The gel separates fragments
tion – see p. 117.) Moreover, the pattern from the according to size.
Genomics in personal identification 239

generally flanked by recognition sites for the same Jeffreys also applied his method to criminal identi-
restriction enzyme, which will neatly excise them, fication. In the first case, DNA fingerprinting proved
producing fragments of different lengths (see Box 8.1). the innocence of a suspect who had actually confessed
It is these fragment lengths that vary between indi- to two crimes. The true criminal was discovered after
viduals, known as restriction fragment length poly- a survey of DNA samples from almost 4000 people
morphism (RFLP). The fragments can be separated living in the region. In this single case, DNA finger-
on a gel according to size and detected by Southern printing proved both the innocence of a man under
blotting. arrest, in serious danger of conviction and punish-
ment; and the guilt of the real criminal.
• VNTRs are characteristics of genome sequences; RFLPs
A substantial number of persons convicted and
are artificial mixtures of short stretches of DNA created sent to jail before Jeffreys’ discovery have subse-
in the laboratory in order to identify VNTRs. quently been proved innocent by analysis of samples
saved from the evidence presented at their trial.
Despite its successes, identification by gel separa-
The patterns were easy to determine from a sample tion of RFLPs has disadvantages in practice. It
of DNA. Jeffreys and his co-workers quickly estab- requires relatively large amounts of undegraded
lished that they were unique to individuals, providing DNA (10–50 ng of material no shorter than 20 000–
a ‘genetic fingerprint’. 25 000 bp). Since the development of PCR, DNA-
The first legal application, in 1985, was to a case based identification methods have tested for the
of disputed identification involving a family of UK presence of selected regions known to vary in the
citizens. A child in the family visited Ghana. When he population, using PCR to amplify those present. This
returned to the UK, immigration authorities suspected greatly improves sensitivity. Subnanogram amounts
him of being an impostor, not entitled to UK residency. suffice to identify 100 bp regions. It is possible to get
None of the classical blood tests, including A, B, AB, a positive identification from a single hair (of a per-
O, and other blood groups, and even MHC haplotyp- son, or in one case of the cat of a criminal’s parents),
ing – which gives much higher discrimination – pro- or from the saliva on a licked envelope.
duced definitive results. Indeed, there was a possibility The method in common use now is to PCR amplify
that the boy was related to the woman who claimed a short tandem repeat (STR) typically containing
to be his mother, but perhaps he was her nephew 2–5 bp, repeated between a few and a dozen times.
rather than her son. Quite fine distinctions were Amplification produces fragments about 200–500 bp
therefore essential. Jeffreys’ DNA fingerprints, com- long. Loci in common use show 5–20 common
paring the patterns from the child’s DNA with that alleles, and 8–15 loci are tested.
of members of the UK family, proved his identity to Jeffreys has recently introduced a newer identi-
the satisfaction of the Home Office. The family were fication method based on a single VNTR locus (see
reunited. Problem 8.1).

BOX The two restriction enzymes

8.1 in most common use in DNA • DNA fingerprinting has shown itself to be a very
profiling: HaeIII and HinfI reliable and useful method of personal identification.
We discussed ethical, legal, and social issues associated
with databanks of DNA sequences in Chapter 1.
Sequence specificity

↓ ↓
HaeIII 5′...GGCC...3′ HinfI 5′...GANTC...3′
3′...CCGC...5′ 3′...CTNAG...5′ Mitochondrial DNA
↑ ↑
Human mitochondrial DNA is 16 569 bp long.
It contains a hypervariable 100 bp region, which
240 8 Genomics and Human Biology

varies by 1–2% between unrelated individuals. The • The trial of O.J. Simpson for the murder of
mitochondrial DNA of unrelated people typically his wife; he was acquitted despite presentation
differs at eight positions. Mitochondrial DNA is of evidence by the prosecution that he was the
very abundant and survives very well. It was used source of fresh bloodstains found at the scene of
to identify the remains of the Russian royal family the crime.
(see Box 8.2). • A stain on a White House intern’s dress provided
evidence against US President William J. Clinton.
Gender identification • Comparison of DNA from descendants of early
19th-century US President Thomas Jefferson and
It is possible to decide whether a nuclear DNA sam-
Sally Hemmings, a slave on his Virginia planta-
ple came from a male or female. Obviously, detection
tion, proved Jefferson to be the father of Hemmings’
of any sequence unique to the Y chromosome will
children.
prove male origin. Another technique in common use
applies the appearance of different versions of the Less sensational, but more important in everyday law
gene for angiogenin on the X and Y chromosomes. enforcement, is the fact that DNA evidence is suffi-
The X version contains a 6 bp deletion. PCR amplifica- ciently definitive and widely accepted to avoid many
tion of this region from a female will give one band trials, by not indicting innocent people.
from the two identical X copies of the gene; DNA Applications of DNA identification techniques to
from a male will give two bands, one from the X and animals include the proof of claims that Dolly the
one from the Y. sheep was indeed a clone, testing of horses and dogs
DNA identification has provided evidence in to confirm breeders’ claims of pedigrees, testing of
several very high-profile cases that readers will be commercial whale meat to check for endangered
familiar with. species, and even a suggestion of creating a database

BOX Identification of the remains of the family of Tsar Nicholas II from analysis of
8.2 mitochondrial DNA

For most of us, all of our mitochondria are genetically iden- remains of the Tsarina were proved by matching the mito-
tical, a condition called homoplasmy. However, in some chondrial DNA sequence with that of a maternal relative,
individuals, different mitochondria contain different DNA Prince Philip, Chancellor of the University of Cambridge,
sequences; this is called heteroplasmy. Such sequence Duke of Edinburgh – and grandnephew of the Tsarina.
variation in a disease gene in the mitochondrial genome (Prince Philip’s shared maternal line with Alexandra means
can complicate the observed inheritance pattern of the that in principle his chances of suffering from haemophilia
disease. were 12.5%.)
The most famous case of heteroplasmy involved Tsar However, comparisons of mitochondrial DNA sequences
Nicholas II of Russia. After the revolution in 1917, the Tsar of the putative remains of Nicholas II with those of two
and his family were taken into exile in Yekaterinburg in maternal relatives revealed a difference at base 16 169: the
Central Russia. During the night of 16–17 July 1918, the Tsar had a C and the relatives a T. Extreme political and
Tsar, Tsarina Alexandra, at least three of their five children, even religious sensitivities mandated that no doubts were
their physician, and three servants who had accompanied tolerable. Further tests showed that the Tsar was hetero-
the family were killed and their bodies buried in a secret plasmic; T was a minor component of his mitochondrial
grave. When the remains were rediscovered, assembly of DNA at position 16 169. To confirm the identity beyond
the bones and examination of the dental work suggested any reasonable question, the body of Grand Duke Georgij,
– and sequence analysis confirmed – that the remains brother of the Tsar, was exhumed and was shown to have
included an expected family group. The identity of the the same rare heteroplasmy.
The domestication of crops 241

to identify dogs whose owners do not clean up after Y-chromosome sequences. Police could deduce, from
them in municipal parks. a sample left at a crime scene, whether the source
individual was named Sykes or not (unless, of course,
he changed his name).
Physical characteristics
Other possible analysis of a blood sample left at a
Suppose a sample containing DNA is collected at crime scene might provide an estimate of the time of
a crime scene, and there is reason to believe that a day of deposition. Certain chemicals, for instance
criminal deposited it. It is possible to use the sample melatonin, vary in concentration in blood and saliva
for identification. But suppose the source individual following regular circadian rhythms. Such tests do
is not represented in the forensic databanks, and is not involve DNA. DNA methylation correlates with
not one of the suspects – usual or unusual – rounded age. It might not be too fanciful to imagine a police
up. It is still technically feasible to make some infer- investigator asking:
ences about the person the police are looking for. Now then, Grandfather Sykes, where were you at
It is possible to predict certain physical charac- 11 pm last night?
teristics from analysis of DNA sequences. Gender,
obviously, but also colour of hair, eyes, and skin,
and ethnic background. Use of these inferences for • In addition to matching a DNA sample with an indi-
suspect profiling is controversial. vidual, it is possible to analyse crime-scene samples
to infer several characteristics, including eye and hair
In some cases, it is possible to infer the source
colour, complexion, and ethnicity. Use of these infer-
individual’s family name! Oxford don Brian Sykes
ences in criminal investigation remains controversial,
discovered that all males named Sykes in the UK
and there is substantial variation in what different
are descendants of a single founder individual, jurisdictions permit.
and all carry specific diagnostic features of their

The domestication of crops

The transition from hunting/gathering to agriculture instance, contemporary maize has a central
represents a major change in human activity, diet, stalk, with the ears growing at the tips of short
and social and economic organization. Domestica- branches. The ancestral species, teosinte, was
tions changed the biology of plants and animals, and highly branched (see Figure 8.3). This allows
even of humans – for instance, the ability to digest more plants per unit area tilled; it is analogous
lactose past infancy is associated with domestication to building skyscrapers in cities.
of cattle. – seeds do not fall off the plant (called shattering).
Many different plants were domesticated, in differ- However, to facilitate harvesting the link of the
ent regions around the world (Figure 8.2). Although seed to the plant should be relatively weak. The
these domestications were independent events, there loss of seed dispersal can render the plant no
are many common features: longer viable in the wild.
• Characteristics that improve the product: • Tillering – shoots fill empty spaces between plants.
This makes it unnecessary to plant seeds at specific
– enlargement of fruit and/or seed.
intervals.
– improved flavour and/or nutrition.
• Increased self-pollination.
• Characteristics that facilitate harvesting:
These favourable properties are the result of genetic
– synchronization of ripening time. changes during domestication. Documented types of
– larger central stalks relative to side shoots – changes include: amino acid substitutions, deletion/
technically, increased apical dominance. For truncations altering the functions of individual proteins,
242 8 Genomics and Human Biology

The location of the known independent centres of domestication

Pepo squash 5000 B.P.

Sunflower 4800 B.P. Broomcorn millet 8000 B.P.
Chenopod 4000 B.P. Foxtail millet 8000 B.P.
Marshelder 4400 B.P.

Pepo squash 10 000 B.P. African rice 2000 B.P. Rice 8000 B.P.
Maize 9000–7000 B.P. Pearl millet 3000 B.P. Foxnut 8000 B.P.
Common bean 4000 B.P. Sorghum 4000 B.P. Emmer wheat 10 000 B.P.
Einkorn wheat 10 000 B.P.
Barley 10 000 B.P.
Arrowroot 8000 B.P.
Yam (D. trifida) 6000 B.P.
Cotton 5000 B.P.
Sweet potato 4500 B.P. Peanut?
Manioc 8000 B.P. Yam (D. alata) 7000 B.P.?
Chile peppers 6000 B.P. Banana 7000 B.P.
Taro 7000 B.P.?
Potato 7000 B.P.?
Quinoa 5000 B.P.

Figure 8.2 Populations in many regions of the world have domesticated plants. Dates shown are based on archaeological evidence,
not DNA sequence analysis. (B.P. = before present.)
From: Doebley, J.F., Gaut, B.S., & Smith, B.D. (2006). Cell 127, 1309–1321.

(a) (b) (c) (d)

Figure 8.3 Comparison of modern maize with its teosinte progenitor (Z. mays subsp. parviglumis). (a) Teosinte grows many long,
tasselled branches. (b) In modern maize, many short branches bear the ears at their tips. (c) The kernels of teosinte are encapsulated
in a hard compartment. This picture shows both mature (left, dark) and immature (right) kernels. (d) Comparison of kernels of teosinte
and modern maize.
Sources: (a) From US Department of Agriculture, Natural Resources Conservation Service, Plant Materials Program, Plant Release Photo Gallery.
(b) Photo by David T. Webb, distributed by the Botanical Society of America. (c, d) Photographs by Hugh Iltis.

transposon insertion, regulatory changes, splice-site bottlenecks and/or (b) selective sweeps. Comparisons
mutation, and gene duplications. with genes that are selectively neutral, with respect to
In general, domesticated plants show lower genetic domestication phenotypes, reveal the relative import-
diversity than their wild, progenitor species. Causes ance of these two effects. Selectively neutral genes
of loss of genetic diversity include (a) population lose genetic diversity only through bottlenecks.
The domestication of crops 243

A very interesting question about the genetics of teosinte was domesticated in southern Mexico
domestication is: to what extent were the genes between 6000 and 9000 years ago. The earliest arte-
selected for in domestication present in progenitor facts showing domesticated maize are cobs found in
populations, and to what extent are they de novo a cave in Oaxaca, Mexico, dated 6250 years ago.
mutations? Examples of both types are known. The Rafael Guzmán, an undergraduate student at the
large genetic variability in the progenitor population Universidad de Guadalajara, discovered the teosinte
makes it harder to find a rare allele even if it were progenitor strain of maize during field work in
present in the original population. southwestern Mexico in late 1978. Guzmán was
stimulated by a challenge contained in a New Year’s
card from botanist Hugh Iltis. The importance of his
• Studies of the genomes of crop plants have applica-
discovery for maize science cannot be overestimated.
tions to agriculture, including the search for varieties
Access to the progenitor strain makes possible detailed
that produce yields improved in quality and quantity,
comparisons of sequences. Maize and teosinte are
require less fertilizer and pesticides, and are resistant
to disease. If genomes of wild progenitor species are
still interfertile, permitting the reintroduction of spe-
also available, it is possible to study the genomics of cific alleles. The teosinte that Guzmán discovered
domestication. contains unique virus-resistance genes that have been
bred into the maize used in agriculture.
Notable differences between teosinte and modern
maize include:
Maize (Zea mays) • Teosinte has many long branches tipped by tassels.
The cultivation of maize (Zea mays) supported the The tassels correspond to male flowers. Modern
pre-Columbian civilizations of Central and South maize has a single main stalk, with the tassel at
America. Maize was brought to Europe in the 15th the top. The many short lateral branches bear the
and 16th centuries and quickly spread around the ears at their tips. The ears, containing the seeds,
world. Maize is now the world’s third-largest crop of course develop from female flowers. (Separate
plant, after rice and wheat. The annual harvest male and female flowers are a feature of varieties
amounts to 7 × 1011 kg, raised on 3.3 × 1011 km2 of of teosintes and maize, not shared by other grasses.)
land. Maize is grown primarily as food, for humans • The teosinte ear contains 5–12 kernels. Their struc-
and animals, but some is converted into ethanol for ture adapts them for dispersal by passage through
fuel. the digestive tracts of birds and mammals. Individual
In the late 20th century, the intensive development seeds grow inside an encasing glume hardened by
of high-yielding crop varieties began. This ‘green silica and lignin (see Figure 8.3c). For human con-
revolution’, in addition to improving yields, bred sumption, the kernels would have to be ground, or
maize for a higher content of lysine. Corn was for- ‘popped’ by heating. At maturity, teosinte seeds
merly lysine-poor. As a result, people with diets based separate spontaneously, another aid to dispersal.
primarily on maize were traditionally lysine-deficient. In contrast, an ear of modern maize has several
hundred kernels, with no hard casing, and the ker-
nels remain fixed to the ear (see Box 8.3).
• In the USA and Canada, maize is called corn.
These very large phenotypic differences between
teosinte and maize appear in plants that have very
Maize is a domesticated form of a grass called similar genomes. This presents a paradox. Its resolu-
teosinte. Several varieties of wild teosinte survive in tion must emerge from study of the genomes, the
Mexico. Analyses of isozyme and microsatellite proteins, and the expression patterns. How did the
diversity – ‘paternity tests’ for species – have identi- genome change upon domestication? How many
fied the progenitor of maize as a teosinte subspecies, genes were involved? Which genes? How have the
Z. mays subsp. parviglumis. A combination of changes in the genome produced the phenotypic
archaeological and molecular evidence suggests that effects?
244 8 Genomics and Human Biology

observable in comparisons of sequence diversity

BOX Glume and doom: modern maize between maize and teosinte. This effect is ampliﬁed
8.3 could not survive in the wild by the high natural variation in the progenitor popu-
lation: in Z. mays subsp. parviglumis, the nucleotide
Loss of teosinte’s hard and adhering seed casing in diversity at silent sites is as high as 2–3%. (Of course,
modern maize is fatal for its natural method of seed dis- these numbers may not accurately reﬂect the charac-
persal. If eaten by birds or other animals, the seeds teristics of a small founder population that gave
would not pass through and be sown. They would be rise to maize.) The overall loss in diversity upon
digested, as they are when we eat maize. As a result,
domestication is estimated at ∼25–30%, attributable
modern maize could not survive in the wild. It depends
to the ‘domestication bottleneck’. The signal of selec-
on humans to plant seed each year.
tion must stand out, at a higher conservation level,
There is a symmetry, literally a symbiosis: human
above this background.
populations – including Mesoamerican civilizations and
the earliest pilgrim settlements in New England – have
been dependent on maize for food, and maize is
• What are the signatures of selection? A decrease in
dependent on humans for sowing its seed. One could
nucleotide diversity, increased linkage disequilibrium,
say – depending on one’s point of view – that maize has and altered population frequencies of polymorphic
domesticated humans. This is a characteristic of many nucleotides in a gene and linked regions.
but not all domestications.

The results of such sequence comparisons suggest

that about 3%, or about 1200 genes, were targets
• For maize we are fortunate to have both the modern
of selection during maize domestication. At least 50
varieties and the progenitor, teosinte.
contribute to agronomically significant traits. Many
of the genes involved in morphological changes are
The first clues to the extent of the genetic change clustered around loci that may correspond to Bea-
came from classical genetics. G. Beadle – better dle’s observations. Others affect biochemical charac-
known for his ‘one-gene–one-enzyme’ hypothesis – ters that Beadle did not investigate.
crossed a strain of maize with teosinte and examined Some of the genes involved have been identified as
the second generation (F2). He concluded that domes- transcription regulators.
tication involved about five major loci. The experi- • One gene that differs between teosinte and maize
ment was not intended to pinpoint the number of is tb1 (tb = teosinte branched). The maize version
altered genes – Beadle wanted primarily to distin- represses lateral shoot development and converts
guish whether a large or a small number of loci were tassels to ears (compare Figures 8.3a and 8.3b).
involved. Maize has ∼30% of the diversity in teosinte popula-
tions in the protein-coding region of this gene –
approximately the background level of diversity
• Some people believed that the phenotypic differences
changes upon domestication – but only ∼2% of
were so great as to preclude conversion of teosinte to
maize by human selection. Beadle’s point was that the
diversity of teosinte in the region 5′ to the gene.
genetic differences might be simpler than suspected. TB1 is a repressor, which binds to specific sites in
promotors of cell-cycle genes. In this case, selec-
tion has been applied not to the coding region of
The 2.3 Gb genome of maize was sequenced in the gene but to a regulatory region. Indeed, the
2009. With the availability of sequence information, transcription level of tb1 in maize is higher than
it is possible to determine which genes in maize that of teosinte. This appears to be the primary
have been subject to selection. Sites selected during mechanism of action of the genomic difference: there
domestication should show enhanced loss of genetic has been selection for expression level, rather than
diversity. There should be a signal from these sites, for a changed amino acid sequence of the protein
The domestication of crops 245

expressed. For tb1, the alleles selected for domes- Aromatic: Iran, Pakistan, India, Nepal; including
tication are present in wild teosinte. Basmati.
• Another gene that differs between teosinte and maize What was the course of rice domestication? Was
is tga1 (tga = teosinte glume architecture). This gene there a sequential process:
affects the structure of the glume, the surroundings
of the seed. The teosinte allele produces the hard O. rufiponga → O. sativa japonica →
seed case. The maize allele reduces the glume to a O. sativa indica
soft membrane underneath the kernels. For tga1, (or rufiponga → indica → japonica)?
the expression level is similar in teosinte and maize. These simple models do not explain the observations
However, there is a specific single nucleotide that japonica and india share some domestication
change in one exon, substituting a lysine in a teo- alleles, but not others. The currently accepted model
sinte protein with an asparagine in maize. The involves originally separate O. rufiponga popula-
maize allele has not been found in wild teosintes. tions, separately domesticated to O. sativa japonica
Can we know whether these and other genetic and O. sativa indica, followed by genetic exchange
differences appeared at the time of domestication? and subsequent specialization.
It has been possible to sequence DNA from cobs
found in sites of a variety of ages, the oldest dated at
• Do not confuse the wild progenitor Oryza rufipogon of
4400 years ago. The 4400-year-old cob shows modern- domesticated rice O. sativa with the vegetable known
maize alleles for three genes tested, including tb1, as ‘wild rice’, now grown primarily in the northern
showing that all three were present in the maize United States, Zizania palustris and certain other
population at least that long ago. However, teosinte- Zizania species. These are not close relatives of Oryza.
specific alleles of some genes appear in 2000-year-old
cobs from New Mexico, in the southwest USA. Selec-
tion was incomplete then, at least in regions distant
Golden rice
from the site of original cultivation.

Vitamin A deficiency is an important public health prob-

Rice (Oryza sativa) lem, prevalent in regions where rice is an important
Rice is a very important crop plant in much of the dietary component. Golden rice is characterized macro-
world. It forms the major component of agriculture scopically by yellow grains, and biochemically by high
in Asia and India. Rice accounts for about 20% of concentrations of b-carotene, precursor of vitamin A.
human calorie intake world-wide. Golden rice was created by genetic engineering of rice
Rice was domesticated independently at least by inserting two genes in the b-carotene synthesis path-
twice. O. sativa japonica originated in south China, way. These genes are under the control of an endosperm-
specific promoter. The genes introduced are: phytoene
at least 10 000 years ago. O. sativa indica originated
synthase, from maize; and carotene desaturase from the
in eastern India or Indonesia. The progenitor of both
soil bacterium Erwinia uredovora. It is estimated that an
was Oryza ruﬁpogon. Familiar from grocery shops
individual could achieve the minimum daily requirement
are many different varieties of rice, used in different
of vitamin A by eating 75 g of golden rice per day.
national cuisines. The characteristics of the major
Whether cultivation of golden rice, like other genetically
ones are:
modified plants, will be allowed, is under discussion.
Indica: long, slender grains; grown mainly in tropical
areas
Japonica: rounder, shorter grains; temperate and Chocolate (Theobroma cacao)
tropical varieties
The source of chocolate is the seed of a tree, Theo-
Minor varieties: broma cacao, that originated in the Amazon basin of
Aus: drought tolerant; Bangladesh and West Bengal western South America (Figure 8.4). It is a relatively
246 8 Genomics and Human Biology

(a) (b)

Figure 8.4 (a) Theobroma cacao tree showing fruit. (b) A ripe pod split open,
showing the beans. To create chocolate from beans, the seeds are fermented
and dried. Processing extracts cocoa butter (triacylglycerols) and cocoa powder
(proteins and polysaccharides, plus small molecules such as flavonoids,
terpenes, and theobromine).
Photo (a) by Paul Bolstad, University of Minnesota Bugwood.org, (b) by Keith Weller,
USDA Agricultural Research Service, Bugwood.org.

ﬁnicky plant, unable to tolerate low temperatures or after the Spanish conquests. Today, beyond its native
aridity. This restricts it, primarily, to within 10° latitude habitat in South and Central America, T. cacao is
of the equator. Chocolate was an important drink in important in the agriculture of Brazil and tropical
the major central American Olmec, Maya, and Aztec Africa. West Africa currently produces 70% of the
cultures (see Box 8.4). It became popular in Europe world’s chocolate, led by the Côte d’Ivoire and Ghana.

BOX Some history of chocolate

8.4

In the Maya and Aztec civilizations in Central America chocolate from a height into a receptacle, to produce
chocolate was a medicinal and ceremonial drink, and an froth. Even more analogous to wine, in the Maya culture
expensive one.1 Fermentation could produce an alcoholic it served sacramental, nutritious, and medicinal purposes.
beverage. The Aztecs prepared chocolate, not as a sweet The beans also served as currency: a turkey cost 100 cacao
as we know it, but mixed with chilli pepper to form a beans.
spicy, bitter drink. A Maya vase in the collection of the European contact with cacao beans began in 1502 when
Princeton University art museum shows a woman pouring Columbus captured the cargo of two Mayan trading
canoes on his fourth voyage. The beans made little impact.
Wider appreciation of chocolate awaited Cortez’s return
1
‘. . . chocolate drinks occupied the same niche as expensive from his conquest of Mexico. In 1528 Cortez presented to
French champagne does in our own [American] culture.’ S.D. Coe
and M.D. Coe, The History of Chocolate, 2nd. ed., Thames and Charles V a sample of cacao beans, and the tools and recipe
Hudson, London, 2007, p. 61. for preparing them. In 1544, a group of Maya nobles
The domestication of crops 247

accompanied Franciscan friars to the court of Philip II and with Mr. Creed to drink our morning draft, which he did give me in
demonstrated its preparation. chocolate to settle my stomach. (24 April 1661)
In contrast to the spicy, bitter Mayan and Aztec bever-
The odour and taste of natural chocolate derives from
ages, the Spanish added sugar. This initiated the transition
unique flavonoids. Pharmacologically active compounds in
of chocolate from a medicinal to a recreational substance.
chocolate include:
Chocolate beverages containing alcohol are now prepared
by adding spirits to chocolate. • theobromine (10% by weight of dark chocolate): heart
Chocolate was introduced to Europe before coffee or tea stimulant, cough suppressor, vasodilator, lowering blood
were. It was the first exposure of Europeans to stimulant pressure leading to feelings of relaxation, diuretic. Pet
alkaloids such as caffeine. The taste for chocolate grew and lovers be warned: theobromine is toxic to dogs and cats.
spread, first as a drink and later in the solid, familiar ‘bar’
• phenethylamine: a stimulant, similar in effect to
form. It is widely believed that the marriage of Philip II’s
amphetamines.
daughter Anne of Austria to Louis XIII of France in 1615
brought chocolate across the Pyrenees. The first chocolate The 18th-century physician and architect Sir Hans Sloane
house in London opened in 1657. The popularity of these was the inventor of milk chocolate. Sloane is better known
beverages derived originally from a belief in their medicinal as the founding donor of the original collection of the
value, as well as their flavour. Samuel Pepys’s diary men- British Museum. London’s chic Sloane Square was named
tions drinking chocolate on several occasions, alluding to after him because his heirs owned the land developed. Less
its curative properties: well known is that the classic red British phone box, now an
Waked in the morning with my head in a sad taking through the endangered species, was based on his tomb, which Sloane
last night’s drink, which I am very sorry for; so rose and went out himself designed.

T. cacao is susceptible to fungal infections, notably

• Unlike maize and rice, Theobroma cacao did not
‘witches’ broom’. This fungus has hit Brazil’s planta-
have to undergo a substantial genetic modification to
tions very hard, but has so far not affected trees domesticate it. Originally, the seeds from the native
grown in Africa. A search for resistant varieties has varieties were harvestable directly.
motivated many expeditions to the headwaters of the
Amazon. These expeditions also provided an oppor-
tunity to map out the geographic distribution of
The T. cacao genome
genetic variation in T. cacao trees. Areas of the high-
est variability are in the upper reaches of the Ama- Two T. cacao projects have sequenced the criollo and
zon, in what are now parts of Ecuador. This identifies forestero varieties.
the probable region of origin of the species. T. cacao is a diploid organism containing 10 chro-
Fourteen genetic clusters have been identified, mosomes. The assembly of the 420 Mbp criollo
extending the known varieties, which have been dif- genome at 16.7 X coverage includes 76% of the
ferentiated according to the appearance of the beans genome. Much of the remainder comprises repetitive
– as well as their flavour. regions. A linkage map allowed anchoring 67% of
Cultivars of T. cacao important in agriculture the 326 Mb sequenced within the 10 chromosomes.
include: criollo (T. cacao ssp. cacao), the best tasting, 28 798 protein-coding genes were identified, higher
and used for the finest products; forestero (T. cacao than A. thaliana. The average gene size was 3346 bp,
ssp. sphaerocarpum), inferior in taste but resistant to with a mean of 5.03 introns/gene. Figure 8.5 shows
disease (90% of the world’s cacao crop is forestero); the shared and unique genes from T. cacao.
and trinitario, a criollo–forestero hybrid, superior in Motivations for sequencing the T. cacao genome
taste to forestero and providing its hardiness. include the desire to understand:
248 8 Genomics and Human Biology

Arabidopsis
thaliana
1,047
Theobroma (3,770)
cocoa
Populus
43 trichocarpa
129
(101)
(340) 44
(170) 118
682 65 (323)
(2,053) 208 (236)
(882) 278
172 (1,523) 281
(557) (1,136)
407
(2,115) 572 960
2,254 (1,825) (2,737)
89 6,362
(390) (13,874) 483
3
123 (52,176)
((2,388)
(355)
621
228
(4,839) 368 265
(
(1,279) 304
(2,027) (1,193) (901)
185 96
(475) (464) 66
168
(263)
247 (483)
141
(744)
(358)
Vitis
vinifera 1,148
(3,603)
Glycine
max

Figure 8.5 Numbers of shared and unique gene families and, in parentheses, numbers of genes from chocolate (Theobroma cacao),
thale cress (Arabidopsis thaliana), black cottonwood (Populus trichocarpa), soybean (Glycine max), and wine grape (Vitis vinifera).
From: Argout, X., et al. (2011). The genome of Theobroma cacao. Nat. Genetics 43, 101–108.

• the determinants of resistance to witches’ broom compounds. A. thaliana has 36 genes for flavonoid
and other diseases biosynthesis pathway enzymes; T. cacao has 96.
• the sources of the flavours in the chocolate produced T. cacao criollo also has expanded families of gene
encoding enzymes involved in terpenoid synthesis.
• the place of T. cacao in the general phylogeny of
It would be tempting to hypothesize that the pau-
angiosperm plants.
city of toll interleukin receptor genes might account
Known disease-resistance genes in plants encode for the lower disease resistance of criollo relative to
nucleotide-binding site/leucine-rich repeat (NBS-LRR) forestero, and the enhancement in numbers of genes
proteins and receptor protein kinase (RPK) proteins. for flavonoid and terpenoid biosynthesis enzymes
The T. cacao criollo genome is relatively poor in one might account for the enhanced flavour of criollo
class of NBS-LRR genes, including those encoding relative to forestero. Comparison of the genomes
the toll interleukin receptor (TIR) motif. In T. cacao of the two subspecies, which is so far only in the
only 4% of genes orthologous to NBS-LRR contain preliminary stages, suggests that this may be too
the TIR motif. The corresponding number for A. simplistic an analysis.
thaliana is 65%! In contrast, the T. cacao criollo Comparing the genomes of several flowering
genome contains orthologues of all for NPR1 sub- plants allows inference of the primordial angiosperm
families known in A. thaliana. genome. The evolution of eudicot genomes is charac-
The molecules that give cocoa its distinctive flav- terized by polyploidization events. Therefore synteny
our include flavonoids, alkaloids, and terpenoids. is an important clue to the phylogeny, in addition
T. cacao criollo is rich in enzymes producing these to the divergence of individual gene sequences.
The domestication of crops 249

(a) Arabidopsis Soybean Papaya (b)

1 5 1 20 1 9 c10 c1

c9
c2

c8
Cacao 1 10

c7 c3

c6
c4
1 19 1 19 c5
Grape Poplar

n = 21
Paleohexaploid

123 MYA
Eurosid II 90 MYA Eurosid I
77 MYA
59 MYA

0R 2R 0R 1R 2R 0R
56F 30F 11F 71F 82F 4F

Papaya Arabidopsis Cacao Poplar Soybean Grape

1 9 1 5 1 10 1 19 1 20 1 19

Figure 8.6 Syntenic relationships of eudicot genomes, comparing T. cacao with thale cress (Arabidopsis thaliana), soybean (Glycine
max), black cottonwood (Populus trichocarpa), papaya (Carica papaya), and wine grape (Vitis vinifera), and inferring the chromosome
structure of the common ancestor. (a) Mapping of orthologues from T. cacao to other five genomes. For each species chromosomes
are numbered consecutively. Different colours represent seven ancestral eudicot linkage groups. (b) Syntenic relationships within the
T. cacao genome. There is evidence for ancestral triplication of the genome. For example, light green links regions of chromosomes 1,
2, and 8. (c) A model for the evolutionary history of the karyotype. The six species studies derived from an original eudicot ancestor
with 7 chromosomes. To produce the current karyotypes there were whole-genome duplications (R) and chromosome fusions (F).
From comparisons of the chromosomes of the six species, it is possible to trace the path from the current chromosome structure of
the six species (shown at the bottom), back to the common ancestor. The gene distribution among T. cacao chromosomes is the
closest, of these six species, to the ancestral form.
From: Argout, X., et al. (2011). The genome of Theobroma cacao. Nat. Genetics 43, 101–108.
250 8 Genomics and Human Biology

Figure 8.6 shows the syntenic relationships of ﬁve of these ﬁve species, T. cacao appears to be closest to
eudicot species. The results suggest that the ancestral the ancestral form.
eudicot contained seven chromosomes. Interestingly,

Genomics in anthropology

Our genomes contain the history of the origins and DNA extracted from bones from three individuals
development of our species. We have already seen from a site in Croatia produced a total of 4 Gb of
how genomics can elucidate phylogenetic relation- Neanderthal sequence.
ships, and how palaeosequencing of extinct species A three-way comparison among Neanderthal,
can widen the coverage of these studies. The palaeo- human, and chimpanzee sequences allows analysis of
sequence that has most captured popular interest human–Neanderthal divergence. There are 78 sites in
has been that of one of our closest relatives: the protein coding genes containing amino acids specific
Neanderthal genome. to humans, that is, the amino acid at that position is
the same in chimpanzee and Neanderthal, and differ-
ent in human. In contrast, one result of interest is
The Neanderthal genome
that Neanderthals have the human version of the
In 1856, about a dozen bones were found in a lime- FOXP2 gene, involved in language skills in humans.
stone quarry in the Neander Valley near Düsseldorf The data suggest that the populations ancestral
(Valley = Thal in German). Although, in retrospect, to Neanderthals and humans split about 270 000–
Neanderthal bones had appeared earlier, the 1856 400 000 years ago. Admittedly, this is a very rough
discovery was the first to be recognized as remains estimate. But it antedates the migration of humans
from a new species. from Africa. It suggests the following scenario:
Neanderthals were hominids closely related to the populations split in Africa 370 000 years ago.
modern humans. The earliest fossils in which they One population migrated to Europe to give rise
appear are about 130 000 years old. They were to the Neanderthals. The other remained in Africa,
approximately the same size as humans, but sub- and gave rise to humans. Then humans migrated to
stantially more robust. Their cranial capacity was Europe about 40 000 years ago, where they coexisted
as large as modern humans or perhaps slightly larger, with Neanderthals for about 6000 years.
although this in itself does not guarantee comparable Did humans interbreed with Neanderthals? The
cognitive abilities. Neanderthals did fashion and use question is quite controversial. It is argued that
tools, and produce at least decorative arts. It has by looking at human variation patterns from the
been suggested that they wore makeup. They engaged HapMap data, Neanderthals are more closely related
in complex funerary practices. to humans from regions other than Africa, than to
Both modern humans and Neanderthals inhabited African populations. If there were no genetic admix-
Europe and Central Asia, until the Neanderthals ture, one would expect no difference: Neanderthals
vanished, about 30 000 years ago. Articles in scien- should be equally closely related to all human popula-
tific journals inquire about their similarities to and tions. The scenario implied is that on their way out
differences from modern humans, and why they of Africa, humans interbred with Neanderthals and
became extinct. The popular press focuses on the carried the genes thereby picked up around the rest
possibility of sexual congress between Neanderthals of the world. The estimate is that 1–4% of the human
and humans. genome is Neanderthal in origin. One might reason-
The sequencing of Neanderthal mitochondrial DNA ably suppose that Neanderthals should be most
was completed in 2008. For the nuclear genome, closely related to European human populations – for
Genomics in anthropology 251

that is where they coexisted for longest – but the data DNA sequences, which reveal lines of maternal
do not support this. inheritance (see Box 8.5). Y chromosomes provide
complementary paternal information. Although
studies of Y sequences are more sparse, in many cases
• It has been possible to sequence DNA from bones
they corroborate the implications of the mitochon-
of Neanderthals, a species of hominid that has
been extinct for about 30 000 years. Humans and
drial data.
Neanderthals both inhabited Europe for about 6000 There is now a consensus that our species, Homo
years. One question addressed using the Neanderthal sapiens, arose in Africa approximately 100 000–
sequence is whether there was human–Neanderthal 150 000 years ago. Migrations beginning approxi-
interbreeding. It has been estimated that 1–4% of the mately 60 000 years ago took our ancestors around
human genome is Neanderthal derived. However, this the world, and continue to do so. Unlike modern
conclusion is controversial. population ﬂows, documented in historical records,
we depend on archaeological relics, modern genom-
ics, and linguistics to infer the timing, the routes, the
numbers of individuals, and even perhaps the moti-
vation of ancient migrations.
Ancient populations and migrations
Other crucial transitions in human social organiza-
When people move around, they take their DNA tion, such as turning from hunting to agriculture, are
with them. This makes is possible to trace patterns reﬂected to some extent in domestications of other
of migration. Many studies focus on mitochondrial species such as maize and dog.

BOX Human mitochondrial DNA haplogroups

8.5

Human mitochondrial DNA is a double-stranded, closed, 1 kb long. It shows a higher rate of substitution than the
circular molecule 16 569 bp long. It is inherited almost rest of the mitochondrial genome, by a factor of about
exclusively through maternal lines. A fertilized egg contains four.
the mother’s mitochondria. Although sperm contain mito- Different mitochondrial DNA sequences are associated
chondria – essential to provide energy for their motility – with different populations. Mutations are referred to the
the few paternal mitochondria that enter the egg are first human mitochondrial DNA sequence determined,
selectively eliminated. As a haploid entity, mitochondrial called the Cambridge Reference Sequence. Groups of
DNA is, therefore, not subject to recombination, and related sequences are called haplogroups. (The distribution
changes only by mutation. of the number of sequence differences between different
Mitochondrial DNA is estimated to adopt one mutation individuals has a peak at ∼70 for Africans and ∼30 for non-
every 25 000 years. This gives a reasonable rate of diver- Africans.) The original classification of sequence variants
gence to trace human migration patterns. (Nuclear DNA depended on changes in restriction sites (see Figure 8.7).
mutates approximately ten times more slowly than mito- This was followed by explicit sequencing of the control
chondrial DNA because (1) histones protect it; (2) active region, focusing on its two highly polymorphic segments.
repair mechanisms edit out some mutations; and (3) the For finest resolution, contemporary studies are now more
activity of mitochondria in oxidative phosphorylation frequently determining full mitochondrial DNA sequences,
exposes the DNA to mutagenic oxygen radicals.) except in cases of ancient DNA where the best recoverable
Human mitochondrial DNA contains genes for 22 tRNAs, material may be fragmentary.
two ribosomal RNAs, and 13 proteins. The major non- Several databases focus on human mitochondrial
coding region is the control region, or D-loop, involved in genomes, including MITOMAP (http://www.mitomap.
regulation and initiation of replication. This region is about org) and mtDB (http://www.genpat.uu.se/mtDB).
➔
252 8 Genomics and Human Biology

Prevalent in:
L1 Africa
L2 Africa
‘Eve’ L3 L3+ Africa
CZ Z Siberia
C NE Asia, Amerinds
D NE Asia, Amerinds

M M* India, Asia
E
G East Asia
Q Oceania
A Asia, Americas
Z Russia, Saami
W W. Urals, E. Baltic
N X Near East, Caucausus
Y Siberia
N* Asia, Oceania
B Amerinds, SE Asia
F Asia
H1 Europe
HV
H3 Europe
R HV0 V Oceania
R* East Asia
P Oceania
T Eastern Baltic Sea, Urals
J Europe
U Europe
K Europe

Figure 8.7 Phylogenetic tree of major mitochondrial haplogroups. The nomenclature began with a study of Native Americans, or
Amerinds, and the letters A, B, C, and D were assigned to them. Other letters were introduced and were subdivided as needed as
more detailed sequencing data appeared. HV0 was formerly called pre-V.

P. Forster has created a ‘movie’ showing successive haplogroup, which arose ∼84 000 years ago. Muta-
stages of human dispersal (see Figure 8.8). tions in L3 gave rise to haplogroups M and N,
The evidence for human origins in Africa is that ∼60 000 years ago (see Figure 8.8c). Haplogroup M
contemporary genetic diversity is highest there. The mitochondrial DNA appears in ancient populations
mitochondrial DNA haplogroup L1, believed to be the from the Andaman Islands, southern continental
oldest haplotype that survives, is found in the KhoiSan India, and the Malaysian Peninsula. This suggests a
of the Kalahari Desert in southern Africa and in the dispersal via what is now southern Iran along the
Biaka pygmies of the central African rainforest (see coast to India, the Malaysian Peninsula, and Austra-
Figure 8.8a). An expansion of L2 and L3 haplogroups lia. Archaeological evidence shows that humans
took place within Africa about 80 000–60 000 years reached Australia by 46 000 years ago. Dating of the
ago (see Figure 8.8b). Over two-thirds of contempor- earliest human remains in Australia is consistent with
ary Africans belong to these groups. the extinction of many large mammals and birds
The ﬁrst emigration from Africa occurred about shortly thereafter (see Figure 8.8d).
85 000–55 000 years ago. The participants in this From 60 000 to 30 000 years ago, human popula-
ﬁrst dispersal carried the L3 mitochondrial DNA tions expanded in southern Europe and Asia, accu-
Genomics in anthropology 253

mulating mutations in mitochondrial DNA to form tered and replaced the Neanderthal population (see
new haplogroups (see Figure 8.8e). Figure 8.8f). The same climate conditions permitted
Expansion into Europe was delayed and inter- a northwards expansion in Asia.
rupted. Between 20 000 and 30 000 years ago – an The closing of the Bering Strait allowed humans
interglacial era – the ﬁrst human Europeans encoun- from northwest Asia to move across to North

erthals
Neand ctu
s
o ere
m
Skhul/Qafzeh Ho
lsrael
L1 L1

Omo valley,
L0 Ethiopia

Klasies River Mouth,

South Africa

(a)

erthals s?
Neand ctu
o ere
m
Ho

L2
L3
L3
L1 L2
L3
L3
L1
L2 L3
L3
L1

(b)

Figure 8.8 ‘Movie’ of human migration patterns, based on mitochondrial DNA sequences. Letters indicate haplotypes.
(a) The beginnings, ∼150 000 years ago. (b) Expansion and divergence within Africa, 80 000–60 000 years ago. (c) Out of
Africa, 60 000–50 000 years ago. (d) Spread through the south coast of the Indian Ocean, reaching Australia, 50 000–30 000 years
ago. (e) Expansion in the eastern Mediterranean, India, southeast Asia and Australia, and central Asia, 30 000 years ago. (f) After
30 000 years ago, milder climate conditions permitted expansion northwards in Europe and Asia. (g) Crossing the Bering Strait,
first human inhabitants of American continents, 20 000–15 000 years ago. (h) Ice Age, 20 000 years ago, with humans forced
to retreat south. (i) Spread to South America, 18 000 years ago. (j) Warmer climate, with resettlement of northern latitudes,
15 000–13 000 years ago. (k) Sudden warming and subsequent stable climate, 11 400 years ago, allowing the spread of agriculture.
(l) Expansion to islands, 2000 years ago. (m) The current picture.
From: www.rootsforreal.com. See also: Forster, P. (2004). Ice ages and the mitochondrial DNA chronology of human dispersals: a review. Phil. Trans. R.
Soc. B: Biol. Sci. 359, 255–264 (Figure 5).
254 8 Genomics and Human Biology

erthals
Neand

N
L2 M
L3
L1 L2
L3
L1
L2 L3
L1

(c)

erthals
Neand N
M
N
N
N
L2 M
L3 M
L1 L2 Niah Caye,
L3
Bomeo
L1 M
N
L2 L3 Lake Mungo,
L1 Australia

(d)

Yana river,
Siberia
B F
A D
ls
rtha My
nde H
Nea U JT N
M
I N
R U
L2 M
L3 L2
L1
L3 Mx Mx Q
Mx
L1 M
N P
L2 L3
Mz Nz
L1

(e)

Figure 8.8 (continued)

Genomics in anthropology 255

Masterov Kliuch,
Gravettian, B F Siberia
Europe A D
My
H JT N
U M
I N Zhoukoudian,
R China
U
L2 M
L3
L1 L2
L3 Mx Mx Q
Mx
L1 M
N P
L2
L3 Mz Nz
L1

(f)

X X
B B C
C A
A D D
F My
H JT
U Ny
l N R
R U
L2 M
L3
L1 L2
L3 Mx Mx Q
Mx
L1
P
L2 L3
Mz Nz
L1

(g)

D A

X C
H JT A D
B C
A H V U F
D R B
l N
R U Ny
L2 My
L3 My
L1 L2
L3
L1 Q
P
L2 L3 Mz
Nz
L1

(h)

Figure 8.8 (continued)

256 8 Genomics and Human Biology

D A

X C
H JT A D
B C
A Meadowcroft, H V U F
D R B
Pennsylvania l N
R U Ny My
L2
L3 Mx
L1 L2
L3
B L1 Q
C P
A D L2 L3 Mz
Nz
L1

(i)

Mesa,
Alaska A Studence,
C Y Siberia
H V U Z
D A
A I
X C
Magdalenian, H JT A D
B C Europe Ny
A H V U My
D R
I N F
Clovis, B
R U
New Mexico L2
L3 Mx
L1 L2
L3
B L1 Q
C Monte Allegre,
P
A D Brazil L2 L3 Mz
Monte Verde, Nz
Chile L1

(j)

A
C Y
H V U Z
D A
A I
J C
X T
H J A D
B C T Ny
A H V U My
D R Mx
I N F
R R U B
L2 M1 R M1 ?
L3
L1 L2
L3 B
B
B L1 ? Q
C P
A D L2 Mz
L3
Nz
L1

(k)

Figure 8.8 (continued)

Genomics in anthropology 257

A
A A Y A
C
H V U
D A Z
A l
J
X T C
H J A D
B C T Ny My
A H V U
D R
l N F
R R U B
L2
L3
M1 R M1
B L1 L2 Mx B
L3 F
B
B L1 Q
C P
B A L2 L3
D B Mz B
Nz
L1
B B

(l)

A
A A A
C Y
H V U Z
D A
A l
J T C
X A D
H J
B C T Ny
A H V U My
D R
l N F
R R U B
L2
L3
M1 R M1
B L1 L2 Mx B
L3 F B
B L1 Q
C P
B A L2 L3
D B Mz B
Nz
L1
B B

(m)

Figure 8.8 (continued)

America and to expand southwards (see Figure 8.8g). reduced. The ending of the last ice age, quite abruptly
Human remains found in Alaska have been dated 11 400 years ago, permitted the expansion of agricul-
to 9800–9200 years ago. Current evidence sug- ture (see Figure 8.8k, and Box 8.6).
gests that humans ﬁrst arrived in America 20 000–
15 000 years ago.
• Agriculture reached Britain about 5000 years ago.
When glaciers covered northern Europe and arctic
America again, humans retreated southwards (see
Figure 8.8h and i). Only isolated pockets of the Late migrations, during the last 2000 years, popul-
original settlers remained in Europe. One such pocket ated islands such as Greenland by Inuit from Alaska,
was ancestral to the Basques, with their now-unique Madagascar by people from southeast Asia, and
H and V mitochondrial haplogroups. A subsequent Paciﬁc islands by Austronesians (see Figure 8.8l).
warm period, starting 15 000 years ago, saw resettle- The contemporary distribution of mitochondrial
ment of northern Europe (see Figure 8.8j). The haplogroups contains the records of this history (see
genetic diversity in northern Europe is accordingly Figure 8.8m).
258 8 Genomics and Human Biology

BOX The spread of agriculture and the source of European populations

8.6

We have discussed crop domestication from the plants’ Palaeolithic European inhabitants – and Middle Easterners
point of view. Let us now examine the human conse- – representing the originators of agriculture.* The study
quences. There is a consensus that agriculture began in the included markers on autosomes and the Y chromosome.
Middle East about 12 000 years ago. After the glaciers The mean admixture suggested an approximately equal con-
receded, farming spread northwards through Europe, sup- tribution from the two populations. There is a geographic
porting a population increasing in size. gradient. The Middle-Eastern contribution decreases as the
The nature of the process has been the subject of debate distance from the Middle East increases.
– a debate that has not become noticeably less heated as The study by Dupanloup and co-workers used con-
more data have been measured. Opinions span a spectrum temporary populations as representatives of ancient
between the extremes of a movement of people – farmers ones. Another study examined ancient DNA directly.†
and their descendants who moved north and east – and Mitochondrial DNA was sequenced from remains at
dissemination of culture – adoption of farming by rem- 7500-year-old Neolithic sites linked by cultural artefacts to
nants of Palaeolithic Europeans and their descendants. the initial spread of farming. These mitochondrial DNA
The spread of Indo-European languages along the same results suggest that these early farmers did not contribute
northeast vector, correlated with the movement of agricul- a large fraction to the European mitochondrial gene pool.
ture, is ambiguous – like agriculture, languages can travel Sequences from the Y chromosome confirm this.
either by migration of people or by cultural transmission. The conclusion is that early farmers did migrate to
However, surveys of genomes of contemporary and Europe from the Near East.
ancient Europeans should be able to detect the relative
contributions of Neolithic Middle Eastern farmers and * Dupanloup, I., Bertorelle, G., Chikhi, L., & Barbujani, G. (2004).
Palaeolithic Europeans. Estimating the impact of prehistoric admixture on the genome of
The contributions to contemporary European DNA gene Europeans. Mol. Biol. Evol. 21, 1361–1372.
†
Haak, W., Balanovsky, O., Sanchez, J.J., et al. (2010). Ancient
sequences have been apportioned to several popul- DNA from early European neolithic farmers reveals their near east-
ations including the Basques – representing the original ern affinities. PLoS Biol. 8, e1000536.

Genomics and language

Language, as a biological phenomenon, links gen- ken, a few major ones such as Chinese, English,
omics with several other disciplines, including neuro- Hindi, and Spanish can each claim over 400 million
biology, development, medicine, and anthropology. speakers (see Table 8.1). Others have much smaller
Language is also a pre-eminent social phenomenon. communities: Europe retains niche languages such as
It underlies interpersonal communication (both face- Basque and Breton. There were estimated to be about
to-face and worldwide), commerce, and literature. 250 languages of indigenous Australians at the time
Many important social decisions (e.g. the design of of settlement by Europeans, of which about two-
educational systems) depend on appreciating the thirds survive. There are over 800 distinct languages
biological substrates of language. These biological spoken on the island of New Guinea.
substrates are subtle and our knowledge of them is Human spoken languages have many features in
far from complete. common with biological species. Both languages and
Language is a feature of all human populations. species exhibit varying degrees of similarity and
Of the approximately 6500 languages currently spo- diversity. Languages and species can be classiﬁed into
Genomics and language 259

Table 8.1 Major languages of the world much more rapid than structural change – molecular
biologists might find a useful analogy to the rapidity
Native language Number of speakers
of sequence change relative to structural change in
Mandarin Chinese >1 000 000 000 proteins!
English 500 000 000 There are obvious correlations between people’s
Hindi 495 000 000 native language, and physical features that are genet-
Spanish 425 000 000 ically determined; for instance, blue or brown eyes
correlate well with native speakers of Swedish or
Mandarin, respectively. However, language is clearly
taxonomies that, at the highest levels at least, are transmitted culturally rather than genetically: any
hierarchical. In both languages and species, ancestor– child can become a fluent speaker of any language to
descendant relationships can be observed. Languages which he or she is exposed in the cradle. (There is,
diverge, as species do. The Romance languages, however, a finite window: it is much more difficult
divergent descendants of Latin, are the Galapagos after puberty to become fluent in a new spoken
finches of linguistics. Latin itself sits within the language; some aspect of neural plasticity is shut
larger class of Indo-European languages, of which down. As a result, language has and continues to
Sanskrit, the classical language of India, is the extant serve as a litmus test for immigrants. This is, unfor-
language putatively closest to their common ances- tunately, frequently used as a criterion for social
tor. Dialects of languages are analogous to genetic discrimination, by persons who may even believe sin-
haplotypes. cerely that there is a correlation between imperfect
Just as species have become extinct, so have many speech in a non-cradle language and inherent or
languages. Many languages are currently ‘endan- genetic inferiority.)
gered’, with only a few elderly speakers remaining. Although there is no direct genetic link with native
Fortunately, like species, languages can be rescued language, L.L. Cavalli-Sforza and co-workers found
from extinction. Navajo, in decline until the 1940s, substantial similarities between clustering of genetic
is now thriving. Hebrew, which had survived as a variation among human populations and language
written language, re-emerged 50 years ago as the groups. For instance, just as the Basque language
spoken language of the new country of Israel. In some is an isolate, unrelated to any other language spoken
cases, it is possible to reconstruct an extinct common in Europe, the Basque people are also genetically
ancestor of surviving descendant languages. This can distinct. This link between languages and genetics is
have interesting implications about the lifestyles of useful because we can estimate rates of genetic change
its speakers. For instance, the similarities in some and thereby infer when languages arose and diverged.
words in Indo-European languages, and dissimilar- One conclusion is that most of the world’s language
ities in others, suggests that the speakers of the families arose, in a burst, between 6000 and 25 000
ancestral Indo-European language had domesticated years ago. When such inferences from genetics can
dogs and horses but not cats and camels. be compared with datable archaeological evidence or
Linguists have developed quantitative methods to historical records, the results are sometimes, although
measure divergence of languages. Languages develop not always, gratifyingly consistent.
variation in pronunciation (or spelling, if the lan- Comparison of genetic and linguistic data can illu-
guage is written) and in structure. An example of minate what happens when two populations collide.
variation in pronunciation would be the difference Some discrepancies between correlations of genetics
between the English word ship and the German and languages appear when conquerors impose a
cognate Schiff. An example of a structural feature new language on an indigenous population.
of a language is word order: in sentences in English
and many other languages, the typical word order is In Ivanhoe, Sir Walter Scott contrasted the French word
subject–verb–object, as in ‘The mouse ate the cheese’. veau for the meat, with calf for the animal, noting that
In some languages, such as Turkish, typical word ‘. . . he is Saxon when he requires tendance, and takes a
Norman name when he becomes matter of enjoyment’.
order is subject–object–verb. Phonological change is
260 8 Genomics and Human Biology

• This happened in (what is now) Great Britain after • In contrast, the fall of Rome in 476 ad was not
the Anglo-Saxon invasions around the 5th century. followed by a replacement of Latin by a Germanic
The indigenous Celtic languages were pushed to tongue.
far reaches of the British Isles (and Brittany; Breton
Conversely, although native language is not deter-
is a Celtic language). Less effective, linguistically,
mined by genes, language differences can retard
was the Norman Conquest of 1066, after which
genetic mixing among populations by creating cul-
English absorbed many French words but retained
tural barriers to gene transfer.
its own vocabulary alongside them.
Linguistics will continue to be an area in which the
• Hungary was converted to speaking a Uralic lan- very subtle links between genomics and culture can
guage upon invasion by the Magyars in 896 ad. be explored.

● RECOMMENDED READING

• A historical review, and a recent article, on the use of DNA sequences in personal identification:
Jeffreys, A.J. (2003). Genetic fingerprinting. In Changing Science and Society. T. Krude (ed.)
Cambridge University Press, Cambridge, pp. 44–67.
Giardina, E., Spinella, A., & Novelli, G. (2011). Past, present and future of forensic DNA typing.
Nanomedicine 6, 257–270.
• Discussions of domestication of crop plants:
Murphy, D.J. (2007). People, Plants, and Genes. The Story of Crops and Humanity. Oxford
University Press, Oxford.
Doebley, J.F., Gaut, B.S., & Smith, B.D. (2006). The molecular genetics of crop domestication.
Cell 127, 1309–1321.
Tang, H., Sezen, U., & Paterson, A.H. (2010). Domestication and plant genomes. Curr. Opin.
Plant Biol. 13, 160–166.
Zeder, M.A. (2006). Central questions in the domestication of plants and animals. Evol.
Anthropol. 15, 105–117.
Doebley J. (2006). Plant science. Unfallen grains: how ancient farmers turned weeds into crops.
Science 312, 1318–1319.
Staller, J., Tykot, R.H., & Benz, B. (2006). Histories of Maize: Multidisciplinary Approaches to
the Prehistory, Linguistics, Biogeography, Domestication, and Evolution of Maize. Elsevier,
London.
• The contribution of genomics to understanding the history of agriculture:
Armelagos, G.J. & Harper, K.J. (2005). Genomics at the origins of agriculture. Evol. Anthropol.
14, 68–77, 109–121.
• A discussion of the ‘green revolution’ and extrapolating from contemporary agricultural statistics
to the needs of the future and how they might be met:
Davies, W.P. (2003). An historical perspective from the Green Revolution to the gene revolution.
Nutr. Rev. 61, S124–S134.
• The Neanderthal genome and relation to the human:
Noonan, J.P. (2010). Neanderthal genomics and the evolution of modern humans. Genome Res.
20, 547–553.
Exercises, problems, and weblems 261

• Genomics and human migrations:

Brumfeld, R.T., Beerli, P., Nickerson, D.A., & Edwards, S.V. (2003). The utility of single nucleotide
polymorphisms in inferences of population history. Trends. Ecol. Evol. 18, 248–256.
Robinson, R. (2010). Ancient DNA indicates farmers, not just farming, spread west. PLoS Biol. 8,
e1000535.
Brandon, M.C., Lott, M.T., Nguyen, K.C., Spolim, S., Navathe, S.B., Baldi, P., & Wallace, D.C.
(2005). Mitomap: a human mitochondrial genome database – 2004 update. Nucl. Acids Res.
33, D611–D613.
Ingman, M. & Gyllensten, U. (2006). mtDB: Human Mitochondrial Genome Database, a
resource for population genetics and medical sciences. Nucl. Acids Res. 34, D749–D751.
Lee, Y.S., Kim, W.Y., Ji, M., Kim, J.H., & Bhak, J. (2009). MitoVariome: a variome database of
human mitochondrial DNA. BMC Genomics. 3, Suppl 3: S12.
• The history of languages and the relationship to genomics:
Cavalli-Sforza, L.L. (2000). Genes, Peoples and Languages. Farrar, Straus and Giroux, New York.
Kamusella, T. (2009). The Politics of Language and Nationalism in Modern Central Europe.
Palgrave Macmillan, Basingstoke.
Evans, N. (2010). Dying Words: Endangered Languages and What They Have to Tell Us.
Blackwell’s, Chichester.
Harrison, K.D. (2010). The Last Speakers: The Quest to Save the World’s Most Endangered
Languages. National Geographic Society, Washington DC, USA.

● EXERCISES, PROBLEMS, AND WEBLEMS

Exercises
Exercise 8.1 On a photocopy of Figure 8.1, (a) sketch in and indicate by the letter X a band, the
appearance of which would prove that F (F = the person whose DNA produced lane F) was not
the father of child C, but individual M could be the mother. (b) Sketch in and indicate by the letter
Y a band, the appearance of which would prove that M was not the mother of child C, but F
could be the father. (c) Sketch in and indicate by the letter Z a single band, the appearance of
which would prove that M was not the mother of child C and F was not the father.
Exercise 8.2 On a photocopy of Figure 8.1, relabel lane M as C, lane F as M, and lane C as F.
Now M sues F for support of child C, alleging paternity. Can F prove from these data that he is
not the father of C?
Exercise 8.3 Why could DNA analysis not decide whether or not a woman in the UK was named
Sykes?
Exercise 8.4 Chocolate (Theobroma cacao) is not shown on the map in Figure 8.2. Insert T. cacao
on a photocopy of this figure.
Exercise 8.5 One of the papers on the recently sequenced Theobroma cacao genome states
that the Belizean criollo genotype ‘. . . is suitable for a high-quality genome sequence assembly
because it is highly homozygous as a result of the many generations of self-fertilization that
occurred during the domestication process’. Why do these characteristics make this variety suitable
for a high-quality genome sequence assembly?
Exercise 8.6 (a) How many genes in Theobroma cacao do not have homologues appearing in
Arabidopsis thaliana, Populus trichocarpa, Glycine max, and Vitis vinifera? (b) How many
262 8 Genomics and Human Biology

gene families are shared by all four species? (c) How many gene families appear in A. thaliana,
P. trichocarpa, Glycine max, and V. vinifera, but not in T. cacao?
Exercise 8.7 How many syntenic groups in the T. cacao genome show evidence for triplication
of the genome? List the triplets of chromosomes linked by each group.

Problems
Problem 8.1 Jeffreys has recently introduced a newer identification method based on one VNTR
locus, D1S8 (or MS32) (see Figure 8.9). This locus contains hundreds of repeats of a 19 bp
sequence. One base in the repeat is hypervariable, with the result that each repeat may contain,
or lack, a HaeIII restriction enzyme cut site. The feature identifying any individual’s DNA is a binary
code specifying the sequence of presence or absence of HaeIII restriction sites in successive repeats
within the locus. (The sites on homologous chromosomes vary independently, multiplying the
variability.)
The identification is read by designing PCR primers: one to a common sequence outside the
repeat, and two others to each of the alternative repeat sequences. Amplification using the
common primer and one of the others produces a series of fragments: each fragment spans
the sequence complementary to the common primer, outside the repeat, to each repeat
complementary to the other primer. These fragments can be separated by size on a gel and
read off as a bar code. A similar, separate amplification using the other primer can detect
fragments that contain the other possible sequence.
Sketch the appearance of a gel containing two lanes, one from each of the amplifications shown
in Figure 8.9.

Figure 8.9 Variable repeat of regions containing either of two alternative sequences, one of which
contains a cutting site for restriction enzyme HaeIII. The order of the choices between the alternative repeat
sequences provides a ‘binary code’. In this diagram, blue represents one repeat sequence and red the other.
Top: one PCR amplification is based on primers to one of the alternative sequences (blue arrow) and to a
common region flanking the repeat (magenta arrow), which produces fragments starting at each of the
occurrences of one of the alternative repeat sequences. The fragments produced are indicated below the
arrows. Bottom: a second PCR amplification based on a primer to the other repeat sequence (red arrow)
and the common flanking region (magenta arrow) produces fragments starting at each of the occurrences
of the other repeat sequence. The fragments produced are indicated below the arrows. (This is a somewhat
simplified description of the actual technique.)

Problem 8.2 George Beadle crossed a strain of maize with teosinte and examined the second
generation (F2). Approximately 1/500 plants was identical to the maize grandparent and
approximately 1/500 was idential to the teosinte grandparent. Assuming Mendelian inheritance
of n unlinked genes, such that the teosinte grandparent was homozygous for n teosinte alleles and
the maize grandparent was homozygous for n maize alleles, estimate the value of n such that the
fraction of F2 plants genetically identical to the grandparents at all n loci is approximately 1/500.
Exercises, problems, and weblems 263

Problem 8.3 Numerous genera of ants cultivate fungi for food. Suggest some similarities
and some differences between the domestication of fungi by ants and the domestication of
agricultural crops by humans.
Problem 8.4 If humans first entered Alaska from Asia 20 000 years ago and archaeological
remains are found in Monte Verde, Chile, dated 14 000 years ago, what mean annual rate of
southward migration speed is implied? What two villages or cities in the vicinity of where you
live are as far apart as this rate would imply for 10 years of migration?
Problem 8.5 What accounts for the paucity of disease-resistant genes in T. cacao? It has been
suggested that the genes encoding toll interleukin receptor motif proteins are older than the
angiosperm–gymnosperm split, and that they have been lost in lineages including genes in
T. cacao. (a) What alternative hypothesis is suggested by the observation that T. cacao is close
to the ancestral eudicot? (b) How might these hypotheses be tested?

Weblems
Weblem 8.1 What is the difference in structure between theobromine and caffeine?
Weblem 8.2 (a) Are there any other species in the genus theobroma? (b) What other genus of
plant is most closely related to theobroma?
Weblem 8.3 A region of human mitochondrial DNA sequence, positions 16 047–16 385, was
determined to have the following mutations, relative to the Cambridge Reference Sequence (CRS):

Position (nt)

Source 16 147 16 172 16 189 16 223 16 248 16 320 16 355

CRS C T T C C C C
Unknown A C C T T T T

To what mitochondrial haplogroup does the sequence belong?

This page intentionally left blank
CHAPTER 9

Microarrays and
Transcriptomics

LEARNING GOALS

• To understand the principles underlying microarray technology.

• To be aware of the types of application for which microarrays are suitable.
• To know what a gene expression table is and how it is derived from a microarray experiment.
• To be able to distinguish gene-based and sample-based analysis and how they support different
types of application.
• To understand how microarrays can be applied to the study of changing gene expression
patterns in different physiological states.
• To understand how microarrays can reveal the different time courses of expression of different
genes during the yeast diauxic shift.
• To understand the general features of the correlation of gene expression patterns with
development in Drosophila melanogaster.
• To recognize how microarrays can be used to work out the genes and proteins responsible for
specific features of a phenotype.
• To appreciate how microarrays can be used to study the molecular biology of higher mental
processes in mammals, including learning and the evolution of human language abilities.
266 9 Microarrays and Transcriptomics

Introduction

Microarrays provide the link between the static We infer protein expression patterns from measure-
genome and the dynamic proteome. We use micro- ments of the relative amounts of the corresponding
arrays: (1) to analyse the mRNAs in a cell, to reveal mRNAs. Hybridization is an accurate and sensitive
the expression patterns of proteins; and (2) to detect way to detect whether any particular nucleic acid
genomic DNA sequences, to reveal absent or mutated sequence is present. Microarrays achieve high-
genes. throughput analysis by running many hybridization
For an integrated characterization of cellular activ- experiments in parallel (see Box 9.1).
ity, we want to determine what proteins are present, Expression patterns can also help to identify genes
where and in what amounts. that underlie diseases. Some diseases, such as cystic
ﬁbrosis, arise from mutations in single genes. For
these, isolating a region by genetic mapping can help
• The transcriptome of a cell is the set of RNA molecules
it contains; the proteome is its proteins.
to pinpoint the lesion. Other diseases, such as asthma,
depend on interactions among many genes, with

BOX The basic innovation of microarrays is parallel processing

9.1

Compare the following types of measurement: oligomeric probe in each spot in the array, measurement
of the positions of the hybridized probes identifies their
• ‘One-to-one’. To detect whether one oligonucleotide sequences. This identifies the components present in the
has a particular known sequence, test whether it can sample (see Figure 9.1).
hybridize to the oligonucleotide with the complementary Such a DNA microarray is based on a small wafer of glass
sequence. or nylon, typically 2 cm2. Oligonucleotides are attached to
• ‘Many-to-one’. To detect the presence or absence of the chip in a square array, at densities between 10 000 and
a query oligonucleotide in a mixture, spread the mixture 250 000 positions per cm2. The spot size may be as small as
out and test each component of the mixture for binding ∼150 mm in diameter. The grid is typically a few centimetres
to the oligonucleotide complementary to the query. This across. A yeast chip contains over 6000 oligonucleotides,
is a northern or Southern blot. covering all known genes of Saccharomyces cerevisiae.
• ‘Many-to-many’. To detect the presence or absence of A DNA array, or DNA chip, may contain 400 000 probe
many oligonucleotides in a mixture, synthesize a set of oligomers. Note that this is larger than the total number of
oligonucleotides, one complementary to each sequence genes, even in higher organisms (excluding immunoglobu-
of the query list, and test each component of the mixture lin genes). However, the technique requires duplicates
for binding to each member of the set of complementary and controls, reducing the number of different genes that
oligonucleotides. Microarrays provide an efficient, high- can be studied simultaneously. Nevertheless, it is possible
throughput way of carrying out these tests in parallel. to buy a single chip containing all known human genes
(not all immunoglobulin genes, of course). Also available is
To achieve parallel hybridization analysis a large number a set of ‘tiling’ chips that cover the entire human genome
of DNA oligomers is affixed to known locations on a rigid sequence.
support, in a regular two-dimensional array. The mixture A mixture is analysed by exposing it to the microarray
to be analysed is prepared with fluorescent tags to permit under conditions that promote hybridization, then washing
the detection of the hybrids. The array is exposed to away any unbound oligonucleotides. To compare material
the mixture. Some components of the mixture bind to from different sources, the samples are tagged with differ-
some elements of the array. These elements now show the ently coloured fluorophores. Scanning the array collects
fluorescent tags. Because we know the sequence of the the data in computer-readable form.
Introduction 267

Sample Control

Isolate mRNA

cDNA synthesis
(control and sample
tagged with different
fluorescent dyes)

Hybridization,
washing
Figure 9.1 Schematic diagram of a microarray experiment.
A sample to be tested is compared with a control of known
properties. From each source, mRNA is isolated and converted
to cDNA, using reagents bearing a fluorescent tag, with
different colours for the control and sample. After hybridizing
to the microarrays and washing away unbound material, the
bound target oligonucleotides appear at specific positions. A
red spot indicates binding of oligonucleotides from the sample.
Measure
A green spot indicates binding of oligonucleotides from the
fluorescence
pattern control. A yellow spot indicates binding of both. Each probe,
represented here by a wavy black line affixed to the support,
really contains many copies of a single oligonucleotide. Indeed,
for accurate measurement, the concentration of the target
must greatly exceed the concentration of the probe. If both
red- and green-tagged targets are complementary to the
oligonucleotide probe at one spot, both can bind to different
probe molecules within the same spot.

environmental factors as complications. To understand • In an expression chip, the immobilized oligonucle-

the aetiology of multifactorial diseases requires the otides are cDNA samples, typically 20–80 bp long,
ability to determine and analyse expression patterns derived from mRNAs of known genes. This is by
of many genes, which may be distributed around far the most common type of microarray. The tar-
different chromosomes. get sample might contain mRNAs from normal or
diseased tissue, for comparison.
Typically, one position on the chip contains an
• The immobilized material on the chip is the probe. The
oligonucleotide with the exact sequence we want
sample tested is the target.
to test for and, as a control, another position con-
tains a corresponding mismatched oligonucleotide,
Microarrays are also used to screen for mutations differing by one base near the centre of the
and polymorphisms. Microarrays containing many sequence. These form a probe pair. To detect a
sequence variants of a single gene can detect differ- single mRNA, a chip may contain 16–20 probe
ences from a standard reference sequence. pairs, spread over the mRNA sequence.
Different types of chip support different • In genomic hybridization, one looks for gains
investigations. or losses of genes, or changes in copy number.
268 9 Microarrays and Transcriptomics

The probe sequences, fixed on the chip, are large between two samples by a factor ≥1.5–2 is generally
pieces of genomic DNA from known chromosomal considered a significant difference.
locations, typically 500–5000 bp long. The target
mixtures contain genomic DNA from normal or Applications of DNA microarrays
disease states. For instance, some types of cancer
arise from chromosome deletions, which can be • Investigating cellular states and processes. Profiles
identified by microarrays. of gene expression that change with cellular state or
growth conditions can give clues to the mechanism
• Mutation or polymorphism microarray analysis is
of sporulation, or to the change from aerobic to
the search for patterns of single-nucleotide poly-
anaerobic metabolism. In higher organisms, varia-
morphisms (SNPs). The oligonucleotides on the
tions in expression patterns among different tissues,
chip are selected from reference genomic data.
or different physiological or developmental states,
They correspond to many known variants of indi-
illuminate the underlying biological processes.
vidual genes.
• Comparison of related species. The very great simi-
• Protein microarrays are arrays of protein detectors
larity in genome sequence between humans and
– usually antibodies – that detect protein–protein
chimpanzees suggested that the profound pheno-
interactions.
typic differences must arise at the level of regula-
• Tissue microarrays collect and assemble micro- tion and patterns of protein and RNA expression,
scopic samples of tissue. They permit comparative rather than in the few differences between the
analysis of the molecular biology and immuno- amino acid sequences of the proteins themselves.
histochemistry of the samples. Microarrays are the appropriate technique for
following up this idea.
• DNA microarrays analyse the RNAs of a cell, to reveal • Diagnosis of genetic disease. Testing for the
expression patterns of proteins and of non-protein- presence of mutations can confirm the diagnosis of
coding RNAs; or genomic DNAs, to reveal absent or a suspected genetic disease. Detection of carriers
mutant genes. One of the reasons that microarrays are can help in counselling prospective parents.
such a versatile technology is that there are many dif- • Genetic warning signs. Some diseases are not deter-
ferent kinds of chips. Commercially available chips can mined entirely and irrevocably by genotype, but
target all or at least most known genes for individual
the probability of their development is correlated
species, for example, or even tile an entire genome.
with genes or their expression patterns. Microarray
profiling can warn of enhanced risk.

Microarray data are semiquantitative • Precise diagnosis of disease. Different related types
of leukaemia can be distinguished by signature
Microarrays are capable of comparing concentrations
patterns of gene expression. Knowing the exact
of target oligonucleotides. This allows investigation
type of the disease is important for prognosis and
of responses to changed conditions. Unfortunately,
for selecting optimal treatment.
the precision is low. Moreover, mRNA levels, detected
• Drug selection. Genetic factors can be detected that
by the array, do not always quantitatively reﬂect
govern responses to drugs, which in some patients
protein levels. Indeed, usually mRNAs are reverse
render treatment ineffective and in others cause
transcribed into more stable cDNAs for microarray
unusual or serious adverse reactions.
analysis; the yields in this step may also be non-
uniform. Microarray data are, therefore, semiquanti- • Determination of gene function. A gene with an
tative: although the distinction between presence and expression pattern similar to genes in a metabolic
absence is possible, determination of relative levels pathway is also likely to participate in the pathway.
of expression in a controlled experiment is more • Target selection for drug design. Proteins showing
difﬁcult, and measurement of absolute expression enhanced transcription in particular disease states
levels is beyond the capability of current microarray might be candidates for attempts at pharmacolo-
techniques. A change in expression levels of a gene gical intervention.
Analysis of microarray data 269

• Pathogen resistance. Comparisons of genotypes or • Following temporal variations in protein expres-

expression patterns between bacterial strains sus- sion. This permits timing the course of (1) responses
ceptible and resistant to an antibiotic point to the to pathogen infection, (2) responses to environ-
possible involvement of the proteins in the mech- mental change, (3) changes during the cell cycle,
anism of resistance. and (4) developmental shifts in expression patterns.

Analysis of microarray data

The raw data of a microarray experiment is an image The initial goal of data processing is a gene
in which the colour and intensity of the fluorescence expression table. This is a matrix containing relative
reflect the extent of hybridization to alternative expression levels, derived from the raw data. The
probes (see Figure 9.1). The two sets of targets are rows of the matrix correspond to different genes and
tagged with red and green fluorophores. If only one the columns to different sources of material. Of
target hybridizes, the spot appears green; if only the course, the gene expression table is not a simple
other target hybridizes, the spot appears red. If both ‘replica plate’ of the microarray itself. The micro-
hybridize, the colour of the corresponding spot array fluorescence pattern contains the raw data from
appears yellow. which the gene expression table must be extracted.
Extraction of reliable biological information from Data from many spots on the microarray will con-
a microarray experiment is not straightforward. tribute to the calculation of the relative expression
Despite extensive internal controls, there is con- level of each gene.
siderable noise in the experimental technique. In A typical experiment compares expression patterns
many cases, variability is inherent within the samples in material from two sources – perhaps a control of
themselves. Microorganisms can be cloned; animals known properties and a sample to be tested. We may
can be inbred to a comparable degree of homo- wish to compare organisms growing under different
geneity. However, experiments using RNA from experimental conditions and/or physiological states,
human sources – for example, a set of patients or DNA from different individuals or different tis-
suffering from a disease and a corresponding set of sues, or a series of developmental stages.
healthy controls – are at the mercy of the large Two general approaches to the analysis of a gene
individual variations that unrelated humans present. expression matrix involve (1) comparisons focused
Indeed, inbred animals, and even apparently iden- on the genes, i.e. comparing distributions of expres-
tical eukaryotic tissue-culture samples, show extension patterns of different genes by comparing rows in
sive variability. the expression matrix; or (2) comparisons focused on
Data reduction involves many technical details samples, i.e. comparing expression profiles of differ-
of image processing, checking of internal controls, ent samples by comparing columns of the expression
dealing with missing data, selecting reliable measure- matrix.
ments, and putting the results of different arrays on
consistent scales. There is extensive redundancy in a • Comparisons focused on genes: how do gene expres-
microarray – each sequence may be represented by sion patterns vary among the different samples?
several spots, and in addition to straight duplicates, Suppose a gene is known to be involved in a dis-
they may correspond to different regions of a gene. ease, or linked to a change in physiological state
Probe pairs – one perfectly matching oligonucleotide in response to changed conditions. Other genes
and the other containing a deliberate mismatch – co-expressed with the known gene may participate
allow data verification. Different oligonucleotides in related processes contributing to the disease or
cover different segments of each region of interest in change in state. More generally, if two rows (two
the sequence. Typically, one gene may correspond to genes) of the gene expression matrix show similar
∼30–40 spots. expression patterns across the samples, this suggests
270 9 Microarrays and Transcriptomics

a common pattern of regulation and some relation- Depending on the origin of the samples, what is
ship between their functions, possibly including already known about them, and what we want to
(but not limited to) a direct physical interaction. learn, data analysis can proceed in different directions.
• Comparisons focused on samples: how do samples
1. The simplest case is a carefully controlled study,
differ in their gene expression patterns? A consistent
using two different sets of samples off known
set of differences among the samples may distin-
characteristics. For instance, the samples might be
guish and characterize the classes from which the
taken from bacteria grown in the presence or
samples originate. If the samples are from different
absence of a drug, from juvenile or adult fruit
controlled sources (for instance, diseased and
flies, or from healthy humans and patients with
healthy animals), do samples from different groups
a disease. We can focus on the question: what
show consistently different expression patterns?
differences in gene expression pattern characterize
If so, given a novel sample, we could assign it to its
the two states? Can we design a classification rule
proper class on the basis of its observed gene
such that, given another sample, we can assign it
expression pattern.
to its proper class? This would be useful in
How can we measure the similarity of different diagnosis of disease. Subject to the availability of
rows or columns? Each row or column of the expres- adequate data, such an approach can be extended
sion matrix can be considered as a vector in a space to systems of more than two classes.
of many dimensions. In the row vectors, or gene In computer science, training such a classifica-
vectors (a row corresponds to a gene), each position tion algorithm is called ‘supervised learning’. The
refers to the same gene in different samples. The gene expression pattern of each sample is given by a
vector has as many elements as there are samples. vector corresponding to a single column of the
In the column vectors, or sample vectors (a column matrix. This corresponds to a point in a many-
corresponds to a sample), each entry refers to a dif- dimensional space – as many dimensions as there
ferent gene in a single sample. The sample vector has are genes. In favourable cases, the points may fall
as many elements as there are genes reported. It is in separated regions of space. Then a scientist,
possible to calculate the ‘angle’ between different or a computer program, will be able to draw a
gene vectors, or between different sample vectors, to boundary between them. In other cases, separa-
provide a measure of their similarities. The smaller tion of classes may be more difficult.
the angle, the more similar the pattern. 2. In a different experimental situation, we might not
The gene vectors and sample vectors correspond, be able to pre-assign different samples to different
separately, to points in spaces of many dimensions. categories. Instead, we hope to extract the classifi-
The number of dimensions is either the number cation of samples from the analysis. The goal is to
of genes or the number of samples. We may not be cluster the data to identify classes of samples and
able to visualize easily points in a space with more then to investigate the differences among the genes
than three dimensions, but all of our intuition about that characterize them.
geometry works fine. For instance, it is natural to ask
whether subsets of the points form natural clusters – An intrinsic problem – and a severe one – in
points with high mutual similarity – characterizing interpreting gene expression data is the fact that the
either sets of genes or sets of samples. number of genes is much larger than the number of
We have already encountered clustering in relation samples. We are trying to understand the relationship
to phylogenetic trees. Similarly, in analysis of gene of one space of very many variables (the genes) to
expression arrays, after finding clusters we can bring another one (the phenotype) from only a few mea-
similar genes and samples together. This amounts sured points (the samples). The sparsity of the obser-
to reordering the rows and columns of the gene vations does not give us anywhere near adequate
expression matrix. The results are often displayed as coverage. Statistical methods bear a heavy burden in
a chart, coloured according to the difference in the analysis to give us confidence in the significance
expression pattern. Figure 9.2 contains an example. of our conclusions.
Analysis of microarray data 271

5.0

4.0

3.0

2.0

OH
O
H3 C
HO OH
1.0 H3C
CH3
F
O
Dexamethasone

0.1
0.0

Vehicle Dexamethasone

Figure 9.2 Glucocorticoids are regulators of immune system physiology, widely used as anti-inflammatory drugs, for instance in asthma
and arthritis. However, cataract formation is a common side effect of prolonged use.
Glucocorticoids are well known to affect gene expression patterns. The steroid binds to a glucocorticoid receptor bound on a cell
surface. Like other steroid hormone receptors, the glucocorticoid receptor is a zinc-finger transcription factor. The ligated receptor
dimerizes and translocates to the nucleus where it binds a DNA sequence called the glucocorticoid response element, activating other
transcription factors to modulate expression of target genes.
Gupta and co-workers studied the effect of dexamethasone, a hydrocortisone analogue, on the protein expression pattern in cultured
epithelial cells of the human lens.* Three samples were treated with 1 mM dexamethasone for 4 hours. Three corresponding controls
were given only ‘vehicle’. (In pharmacology, a vehicle is the medium of delivery of the drug.) RNA was extracted, purified, and
converted to cDNA. The microarray used was an Affymetrix chip that contained 22 283 known human transcripts and expressed
sequence tags (ESTs) for about 15 000 genes.
The figure shows data for these six samples. Levels of expression enhanced or reduced by >1.5-fold were considered significant.
Data in red correspond to genes upregulated by dexamethasone; data in green correspond to downregulated genes. The scale on
the right relates colour to expression change. The results identified 93 upregulated transcripts and 43 downregulated ones.
In the figure, the data are clustered based on overall expression level (gene vector, according to rows) and also based on expression
on different chips (sample vector, according to columns). Note that the clusterings by gene and by sample are independent: it would be
possible to change the arrangement of the columns without altering the arrangement of the rows and vice versa.
The trees at the top and at the left indicate the similarities among the results, according to sample vector and gene vector,
respectively. The sample-vector tree at the top cleanly separates the control triple replicates (vehicle) and the dexamethasone-treated
triple replicates. For both control and dexamethasone-treated replicates, the observed expression changes appeared in at least two
of the three measurements. The gene-vector tree at the left has a major breakpoint between downregulated and upregulated genes
(for one gene the behaviour of different replicates is inconsistent).
To discover the implications of these data for cataract induction, the next step would be to examine the biological functions of the
glucocorticoid-sensitive genes. Modern bioinformatics software makes the transition to this computation a facile one. In this case, Gupta
et al. found that the modulated genes involved a wide range of functions. It appears that glucocorticoid elicits a network of responses
that must be pursued downstream to make direct contact with the mechanism of cataract formation.
* Gupta, V., Galante, A., Soteropoulos, P., Guo, S., & Wagner, B.J. (2005). Global gene profiling reveals novel glucocorticoid induced changes in gene
expression of human lens epithelial cells. Mol. Vis. 11, 1018–1040.
272 9 Microarrays and Transcriptomics

Many clustering algorithms have been applied to

• Processing the data from a microarray experiment
microarray data, including those that try to work out
produces a gene expression table, or matrix. The rows
simultaneously both the number of clusters and the
index the genes and the columns index the samples.
boundaries between them. All algorithms must face We can either focus on the genes, and ask: how do
the difﬁculty arising from the sparsity of sampling. patterns of expression of different genes vary among
Sometimes it is possible to simplify the problem by the different samples? Or we can focus on the samples,
identifying a small number of combinations of genes and ask: how do the samples differ in their gene
that account for a large portion of the variability. expression patterns?
This is called reduction of dimensionality.

Expression patterns in different physiological states

A fundamental question in biology is how different As an example from higher organisms, we shall
components of cells smoothly integrate their activ- look at changes in gene expression patterns in sleep
ities. Measurements of expression patterns tell us and wakefulness.
part of the story. They provide an inventory of the
components but suggest only inferentially how they The diauxic shift in Saccharomyces cerevisiae
interact.
Comparisons of alternative physiological states of Yeast is capable of adapting its metabolism to a vari-
an organism offer the possibility of extracting, from ety of environmental conditions. In the presence of
an entire genome, a subset of genes that underlie a glucose, Saccharomyces cerevisiae will – even in the
particular life process. An example of shift in physio- presence of oxygen – preferentially use the Embden–
logical state in microorganisms is diauxy. Diauxy, Meyerhof fermentative pathway, reducing glucose to
or double growth, is the switch in metabolic state of ethanol. Exhaustion of available glucose produces
a microorganism when, having exhausted a preferred the ‘diauxic shift’ to oxidative metabolism, sending
nutrient, it ‘retools’ itself for growth on an alterna- the products of fermentation through the Krebs
tive. The organism may show a biphasic growth (tricarboxylic acid) cycle and mitochondrial oxida-
curve, with a lag period while the changed com- tive phosphorylation (see Figure 9.3).
plement of proteins is synthesized. The shift is not merely a redirection of metabolic
ﬂux through alternative pathways using pre-existing
• Jacques Monod discovered diauxy approximately proteins but involves protein synthesis, with a sub-
70 years ago, during his predoctoral work in Paris. He stantial change in expression pattern. Many genes
described his observations in his 1941 thesis. Resuming are involved, not only enzymes in the awakened met-
his research career after the war, he and his colleagues abolic pathways. Another consequence of switching
at the Institut Pasteur made their fundamental dis- to respiratory metabolism is the danger of oxidative
coveries about the mechanism of gene regulation. damage, and the oxidative stress response requires
enhanced expression of many other genes.
The diauxic shift in yeast is the transition from fer-
mentative to oxidative metabolism upon exhaustion • Oxygen is essential for aerobic life, yet its reduced
of glucose as an energy source. In this chapter, we forms include some of the most toxic substances with
shall examine the effects of the reconﬁguration of the which cells must cope.
expression patterns, comparing the different comple-
ments of genes active in the two states. In Chapter 11, The cells are effectively sensing the level of glucose.
we shall return to the yeast diauxic shift to examine As glucose itself acts as a repressor of expression of
the control mechanisms that regulate the transcrip- many genes, depletion of glucose releases this repres-
tional reprogramming. sion – ‘turning on’ a variety of genes.
Expression patterns in different physiological states 273

Glucose An important component of the mechanism by

which high glucose levels lead to repression is the state
Gluconeogenesis Glycolysis of phosphorylation of a transcriptional regulatory pro-
tein called Mig1. The phosphorylated form of Mig1
Fermentation stays in the cytoplasm. Upon dephosphorylation,
Pyruvate Ethanol it enters the nucleus and binds to promotors of vari-
Oxaloacetate
ous genes, repressing their expression. The activity
Acetyl-CoA of kinase Snf1–Snf4, which phosphorylates Mig1,
depends on the AMP/ATP concentration ratio in the
cell. Growth on high levels of glucose generates high
levels of ATP via glycolysis, keeping the Snf1–Snf4
Glyoxylate Krebs kinase inactive. This leaves Mig1 in its active, dephos-
cycle cycle
phorylated form, repressing glucose-sensitive genes.
In a seminal paper in 1977, DeRisi, Iyer, and Brown
reported on the changes in gene expression in the
Succinate
diauxic shift in S. cerevisiae.* They observed that
Oxidative the patterns of gene expression were stable during
phosphorylation
exponential growth in a glucose-rich medium. This
Figure 9.3 Some metabolic pathways in yeast affected by the justifies considering the initial anaerobically growing
diauxic shift. population as being in a state that can be characterized
In the presence of ample glucose, yeast will adopt an anaerobic
by a set of expression levels of genes. (If, alternatively,
metabolic regimen, converting glucose to ethanol via glycolysis
and fermentation (Embden–Meyerhof pathway; red arrows). Upon there were fluctuations in expression levels that were
running out of glucose, it will shift to an aerobic metabolic state in large compared with the differences between anaero-
which ethanol is converted to CO2 and H2O via the Krebs cycle bic and aerobic regimes, it would not be possible to
and oxidative phosphorylation (blue arrows). For the alternative
analyse the diauxic shift in terms of expression pat-
pathways of energy release, pyruvate is the branch compound.
However, the shift to a utilization of a different energy source terns. In fact, if there are fluctuations, they average
creates two concomitant problems. out over the population.)
1. The oxidation of ethanol does not provide precursors for As glucose is depleted, the expression pattern
essential biosynthetic pathways. Oxidation of ethanol converts changes (see Figure 9.4, from work by Brauer and
all carbon to CO2. Some must be retained and converted to
co-workers). The changes affect a large fraction,
three- and four-carbon compounds, and even glucose, via
the glyoxylate cycle and gluconeogenesis. Acetyl-CoA is the almost 30%, of the genes (Table 9.1).
branch compound for this shift – it enters both the Krebs cycle The genes differentially expressed are associated
and the glyoxylate cycle. with proteins in several functional classes. Upon con-
The glyoxylate cycle shares several intermediates with
verting from anaerobic to aerobic metabolism several
the Krebs cycle but, to conserve carbon, leaves out the
decarboxylations that produce CO2. Succinate (green dot), things occur.
one of the metabolites common to the Krebs and glyoxylate
cycles, is converted to oxaloacetate in mitochondria. It then
• Metabolic pathways are rerouted. Synthesis of
feeds into numerous biosynthetic pathways (green arrow). pyruvate decarboxylase is shut down; synthesis of
2. In addition to the metabolic pathways shown in this figure,
yeast must also activate pathways of defence against oxidative
stress. The danger is from potential chemical attack by reactive Table 9.1 Genes affected when glucose is depleted
oxygen species, produced by partial reduction of oxygen: the
superoxide radical (O2−•), hydrogen peroxide (H2O2), and the Expression ratio in aerobic/ >2 < –12 >4 < –14
hydroxyl radical (OH•). In its defence, yeast increases expression anaerobic states
of genes involved in detoxification of reactive oxygen species. Number of genes 710 1030 183 203
The mechanism of induction of these genes depends on the
transcription factor Yap1p. An increase in the concentration
of reactive oxygen species leads to formation of disulphide
bridges in Yap1p. This causes a conformational change that * DeRisi, J.L., Iyer, V.R., & Brown, P.O. (1997). Exploring
masks a nuclear export signal. The result is redistribution of the metabolic and genetic control of gene expression on
the transcription factor to the nucleus, the site of its activity. a genomic scale. Science 278, 680–686.
274 9 Microarrays and Transcriptomics

tions, and oxidative phosphorylation (blue arrows

Time (hours)

D=0.05
in Figure 9.3).

7.25

7.75

8.25

8.75

9.25

9.75
10.0
7.5

8.0

8.5

9.0

9.5
• Genes related to protein synthesis show a decrease
in expression level. These include genes for ribo-
somal proteins, tRNA synthetases, and initiation
and elongation factors. An exception is that genes
encoding mitochondrial ribosomal molecules have
generally enhanced expression.
• Genes related to a number of biosynthetic path-
ways show reduced expression. These include
genes encoding enzymes of amino acid and nucleo-
tide metabolism.
• Genes involved in defence against oxidative stress
from reactive oxygen species show enhanced
expression. Proteins encoded include catalases,
peroxidases, superoxide dismutases, and glutathi-
one S-transferases.

• The diauxic shift in yeast is the transition from fermen-

tative to oxidative metabolism when the yeast runs
out of glucose in the medium. The shift is effected by a
retooling in which the expression patterns of relevant
genes are altered.

Sleep in rats and fruit flies

All humans sleep. The overt characteristics of sleep
are familiar: approximately cyclic periods of reduced
consciousness, relaxation, and quiescence; a raised
arousal threshold; and dreaming (highly correlated
to periods of rapid eye movements or REM). Neuro-
physiologists distinguish different stages of sleep, with
Figure 9.4 Expression patterns of yeast growing initially on different characteristic patterns in electroencephalo-
glucose, showing the yeast undergoing a diauxic shift. Column grams. The consequences of sleep deprivation are
headings indicate time in hours after exhaustion of glucose in the also familiar: reduced vigilance and performance and
samples undergoing diauxic shift. general stroppiness. These demonstrate that sleep has
From: Brauer, M.J., Saldanha, A.J., Dolinski, K., & Botstein, D. (2005). a necessary restorative function.
Homeostatic adjustment and metabolic remodeling in glucose-limited
yeast cultures. Mol. Biol. Cell 16, 2503–2517.
Other animals sleep. Even Caenorhabditis elegans
can enter a state of torpor, akin to sleep. Fruit flies
pyruvate carboxylase is enhanced. This switches sleep, and their sleep shares many features of ours.
the product of the reaction of pyruvate from Like us, they sleep deeper and longer after sleep
acetaldehyde to oxaloacetate. Enhanced synthesis deprivation. Caffeine keeps them awake. Their sleep
of genes for fructose-1,6-bisphosphatase and correlates with changes in brain electrical activity.
phosphoenolpyruvate carboxylase changes the We would not expect fruit flies to show all of the
direction of two steps in glycolysis. Expression is higher neurophysiological correlates of sleep, such as
enhanced for genes encoding the enzymes that dreaming, but it is not unreasonable to hope to find
carry out the new Krebs and glyoxylate cycle reac- some analogues at the biochemical level.
Expression patterns in different physiological states 275

It is always useful to study a biological phenom- The logic is that if the expression pattern of a gene
enon in the simplest organism that exhibits it, as well differs between the sleep-deprived and spontaneously
as in humans. Comparisons of flies and mammals may awake flies, both in a waking state, the controlling
reveal fundamental common features disguised within factor must be time of day.
the individual complexities of different species. The A gene was classified as sleep related if its expres-
results may also illuminate the functions associated sion was elevated by a factor >1.5 in spontaneously
with homologous genes. This is not to deny that, asleep flies relative to both spontaneously awake and
in the realm of cognitive phenomena, humans have sleep-deprived flies. A gene was classified as wakeful-
pushed things farther than other species and present ness related if its expression was elevated by a factor
unique features. >1.5 in both spontaneously awake and sleep-deprived
Sleep disorders present common and serious flies relative to spontaneously asleep flies. Some genes
medical problems. Sleep deprivation is a major cause showed the influence of both physiological state and
of accidents, leading to loss of life and damage to time of day.
health, property, and productivity. Indeed in rats and Cirelli and co-workers studied expression patterns
flies, prolonged sleep deprivation is itself fatal, after of ∼10 000 genes of the fly. Of these, 121 were
slightly more than 2 weeks. wakefulness related and 12 were sleep related. The
C. Cirelli and co-workers studied gene expression expression of a partly overlapping set of 130 genes
patterns of rats and fruit flies in three physiological was moderated by time of day: 87 were more highly
states: spontaneously awake, spontaneously asleep, expressed at 4 p.m. and 43 were more highly
and sleep deprived. Protocols were similar but not expressed at 4 a.m.
identical in the experiments on the two species.* The overlap of sleep/wakefulness-related genes
Flies were prepared by accustoming them to an with those modulated by time of day demonstrated a
alternating regimen: 12 hours with the lights on, relationship between homeostatic and circadian regu-
awake (8 a.m. to 8 p.m.) and 12 hours in the dark, lation. Two-thirds of sleep-related genes and one-fifth
asleep (8 p.m. to 8 a.m.). A group of flies was then of wakefulness-related genes were modulated by time
sleep-deprived for 8 hours after the normal end of the of day. This is consistent with the observation that
waking period, i.e. from 8 p.m. to 4 a.m. flies with mutations in circadian genes can show
A complication in studying the effect of sleep and abnormal homeostatic regulation of sleep.
wakefulness arises from the circadian rhythms that In both flies and rats, the genes preferentially
the expression patterns of many genes are known to expressed in wakefulness and sleep fell into several
obey. The use of the sleep-deprived animals allowed different functional categories.
the effects of waking and sleep states to be distin- In flies, genes preferentially expressed in waking
guished from time-of-day effects. To this end, sam- states included those encoding proteins involved in
ples of spontaneously asleep and sleep-deprived flies detoxification, including cytochrome P450s and
were collected at 4 a.m. Samples from spontaneously glutathione S-transferases; genes involved in defence
awake flies were collected at 4 p.m. (Table 9.2). against immune challenge and in lipid, carbohydrate,
and protein metabolism; a transcription factor; a
Table 9.2 Sleep and wakefulness states of flies
nuclear receptor; and the circadian gene crypto-
State Time of sample collection chrome. Genes preferentially expressed in sleep
included the glial gene anachronism, the gene encod-
4 p.m. 4 a.m.
ing the catalytic subunit of glutamate–cysteine ligase,
Awake Spontaneously awake Sleep deprived and other genes involved in lipid metabolism.
Asleep Spontaneously asleep In rat brains, a much larger number of genes than
in flies had raised expression levels during wakeful-
* Cirelli, C., LaVaute, T.M., & Tononi, G. (2005). Sleep ness. (For rats, a less strict criterion of significant
and wakefulness modulate gene expression in Drosophila.
change in expression level was applied: a ratio of
J. Neurochem. 94, 1411–1419; C. Cirelli (2005). A mole-
cular window on sleep: changes in gene expression between >1.2 compared with a ratio of >1.5 in flies.) In con-
sleep and wakefulness. Neuroscientist 11, 63–74. trast to flies, which show fewer sleep-related than
276 9 Microarrays and Transcriptomics

Table 9.3 Enhanced expression of genes in the rat in sleep and wakefulness

Wakefulness related Sleep related

Learning and memory Synaptic plasticity: acquisition, potentiation Synaptic plasticity: consolidation, depression
Transport – Membrane trafficking and maintenance
Metabolism Energy metabolism Cholesterol biosynthesis
Transcription (positive regulation) Transcription (negative regulation)
Translation (negative regulation) Translation (positive regulation)
Translation Translation
Stress response General stress response and unfolded protein response
Cell signalling Depolarization-sensitive Hyperpolarization-promoting (leakage)
Glutamatergic neurotransmission GABAergic neurotransmission

From: Cirelli, C. (2005). A molecular window on sleep: changes in gene expression between sleep and wakefulness. Neuroscientist 11, 63–74.

wakefulness-related genes, in the rat approximately MAP kinase phosphatases, and for proteins involved
the same number of genes showed enhanced expres- in cholesterol synthesis.
sion in sleep as in wakefulness, as shown in Table 9.3. Another window on to the molecular biology of
The genes with enhanced expression are associated sleep is the effect of mutation. A fruit fly mutant,
with different biological functions. Shaker, sleeps approximately one-third as long as
In the rat, wakefulness-related genes are associated wild-type flies. The gene involved encodes a potas-
with memory acquisition, energy metabolism, tran- sium channel that affects neural electrical activity.
scription activation, cellular stress, and excitatory Some humans can get by regularly on only 3–4 hours’
neurotransmission. Sleep-related genes are associated sleep per 24 hour interval, but it is not known
with a potassium channel, translation machinery, whether this trait is under the control of the homo-
long-term memory, and membrane trafficking and logous gene. However, what does suggest a link to
maintenance, including synthesis and transport of the fly mutant is a rare human disorder, Morvan’s
glia-derived cholesterol. Cholesterol is a major com- syndrome, the symptoms of which include insomnia.
ponent of myelin and other membranes. (Membrane At least one case of Morvan’s syndrome has appeared
maintenance suggests an analogue, at the molecular to be of immune origin, involving an autoantibody
level, of Shakespeare’s metaphor that sleep ‘. . . knits against a potassium channel.
up the ravelled sleeve of care . . .’.)
Similarities between rats and flies in the functional
• If the physiology of sleep is not well understood, the
categories of sleep- and wakefulness-associated genes
molecular biology of sleep is even more obscure.
are interesting. Wakefulness-associated genes in both
Measurements of changes in gene expression during
species include those for members of the Egr family
sleep and wakefulness give clues as to what distin-
of transcription factors, the mammalian NGF1-B guishes the states, at the molecular level.
nuclear receptor and the fly orthologue, homologous

Expression pattern changes in development

Variation of expression patterns during any post-embryonic development in humans and

the life cycle of Drosophila melanogaster other mammals, and even in amphibians. This allows
juvenile and adult ﬂies to occupy different ecological
During their lifetime, insects undergo macroscopic niches. The major stages of a ﬂy’s life are embryonic,
changes in body plan that are more profound than larval, pupal, and adult. Metamorphosis occurs
Expression pattern changes in development 277

during the pupal stage: ﬂies spend their ‘adolescence’

E L P A
sequestered within a pupa (to the envy of many a
parent of a human teenager).
Fly development has been intensively studied at
the molecular level. An impressive understanding has
been achieved of the mechanism of translation of
molecular signals into macroscopic anatomy. The
genesis of speciﬁc organs – eyes, legs – has been care-
fully analysed. We have already encountered HOX
genes and their relationship to body plan.
Arbeitman and co-workers examined changes in
transcription patterns in Drosophila melanogaster
during different stages of its life. When they took up
the problem, it was known from earlier work that
large-scale changes in gene expression occurred.
Microarrays made possible a more systematic and
thorough study.
cDNAs containing representatives of 4028 genes
(about one-third of the total estimated number in
D. melanogaster) revealed expression patterns for
66 selected time periods from embryo through to
adulthood (see Figure 9.5). Expression levels were
Developmental time
compared with pooled mRNA from all life stages, to
represent a (weighted) average expression level. The <0.25 0.33 0.5 1 2 3 >4
interval between measurements varied from 1 hour Expression level
(for embryos) up to several days (for adults) until
Figure 9.5 Gene expression profiles at different life stages of D.
a total age after fertilized egg of up to 40 days.
melanogaster, ordered by the time of the first rise in transcription
levels. E, embryo; L, larva; P, pupa; A, adult. The scale of
Stage Approximate interval expression level, relative to that of pooled mRNA samples from
between measurements all developmental stages, is shown by intensity of colour: black,
small change in expression level; dark blue→light blue, increasing
Embryo 1 hour downregulation relative to the control; dark yellow→light yellow,
Larva 5 hours increasing upregulation relative to the control. Because of the
variation in measurement intervals, the developmental time scale
Pupa 8 hours
governing the horizontal axis does not correspond to calendar
Adult 3–4 days
time. The embryonic stage lasts ∼1 day, the larval stage ∼4 days,
the pupal stage 5 days, and the adult stage until cessation of data
Most of the genes tested – 3483 out of 4028, or collection, 30 days.
Looking at the distribution of yellow, it can be seen that some
86% – changed expression levels signiﬁcantly at some
genes are expressed at high levels at single specific stages, but
stage(s) of life. Of these, 3219 varied by a factor of others are expressed at high levels at more than one stage.
>4 between their maximum and minimum values. A gene expressed throughout the life of the fly, at no less than
The data show that in Drosophila, as in other the level of the pooled sample, would appear here as a row
species, genes participating in a common process containing only black and yellow regions, with no blue.

often exhibit parallel expression patterns and similar From: Arbeitman, M.N., et al. (2002). Gene expression during the life
cycle of Drosophila melanogaster. Science 297, 2270–2275. Copyright
perturbations of these patterns in mutants. For AAAS. Reproduced by permission.
instance, the expression pattern of the eyes absent
mutant, which produces an eyeless or at least a reduced
eye phenotype, forms a cluster, in the analysis of
expression patterns, with 33 genes. Of these, 11 are
278 9 Microarrays and Transcriptomics

already known to function in eye differentiation or activity is minimal. In contrast, larval and adult stages
phototransduction. The other 22 are likely to as well; have a more active lifestyle but stasis of anatomical
the data provide at least hypotheses, and at best form, although larvae but not adults grow substanti-
reliable clues, to the function of these genes. ally in size. Consistent with the notion of a ‘go back
and get it right this time’ aspect to metamorphosis is
Different life stages make different demands on the occurrence of some dedifferentiation in the pupa.
different genes Ideas of this sort can be examined in light of the
Different genes exhibit different temporal patterns of nature of genes that show different lifelong expres-
expression. Most of the developmentally modulated sion patterns. For example, maximal expression levels
genes are expressed in the embryonic stage, as the of most metabolic genes occur during larval and
whole system is getting started. Genes expressed in adult stages. Another set of genes involved in larval
the early embryo include transcription factors, pro- and adult muscle development has a similar two-peak
teins involved in signalling and signal transduction, expression pattern. More precise analysis is possible
cell-adhesion molecules, channel and transport pro- and shows that steps in the regulatory hierarchy for
teins, and biosynthetic enzymes. A third of these are muscle development show peaks at different times,
maternally deposited genes; many of these fall off in with genes expressed later being downstream in the
expression level within 6–7 hours. regulatory hierarchy. Similar time-of-onset sequences
appear in both larvae and pupae. The two stages at
• Stathopoulos and Levine* comment that ‘. . . the which body plans are formed – embryo and pupa –
genesis of a complex organism from a fertilized egg is re-utilize not only the same materials but also some
the most elaborate process known in biology, and of the same mechanisms.
thereby depends on “every trick in the book”’.
• Measurement of expression patterns at different stages
The genes studied include one large stage-specific reveal which batteries of genes are active in develop-
ment. Particularly striking, in Drosophila, is the alterna-
class (36.3%) that shows a single major peak of expres-
tion of time-of-onset of the expression patterns of
sion. Some of these remain constitutively expressed
some genes: embryo and pupa show similarities, and
(at lower levels) subsequently. Others show sharp
larva and adult.
peaks in expression level.
Another group of genes (40.3%) shows two peaks
in expression. Two patterns are common: genes with Flower formation in roses
their first onset of enhanced expression early in
Roses have long been favourites of gardeners and
embryogenesis generally have their second at pupa-
lovers: for their appearance, for their scent, and, per-
tion, with elevated expression levels continuing into
haps less commonly, for their purported medicinal
the pupal stage. Genes with their first onset of
qualities – rose hips (fruits) are rich in vitamin C.
enhanced expression late in embryogenesis generally
Breeders have responded by creating tens of thou-
have their second at the late pupal stage, with elevated
sands of varieties.
expression levels continuing into the adult stage. The
remaining 23.4% of genes show multiple peaks in
expression level. • Roses have been known in Europe since antiquity and
appear frequently in literature, famously in works by
The observation of similarities between expression
Dante and Shakespeare. Roses symbolized the two
patterns in embryonic stages and pupal stages, and
armies fighting for control of England at Bosworth
between larval and adult stages, is interesting. Certain
in 1485.
analogies suggest themselves. In both embryonic and
pupal stages, body structures are forming and physical
Roses have been cultivated for 5000 years. Rose
* Stathopoulos, A. & Levine, M. (2002). Whole-genome varieties brought to Europe from China at the end of
expression profiles identify gene batteries in Drosophila. the 18th century introduced the very desirable pro-
Dev. Cell 3, 464–465. perty of recurrent blooming throughout the season,
Expression pattern changes in development 279

rather than ﬂowering only once a year. Some arrived

in 1794 with Lord Macartney on his return from
the famous embassy to the court of the Qianlong
emperor. It took only a single gene change to achieve
recurrent flowering. The Chinese strains also enlarged
the palette to include yellow and scarlet.
Flower formation involves tissue differentiation
and formation of an organ with specialized structure
and function. Flowers are the reproductive organs of Figure 9.6 Three of the six stages in the development of a rose
plants. Their beautiful appearance and scent attract flower: stages 1 (left), 4 (centre), and 6 (right). At stage 1, petals
are beginning to emerge. At stage 4, cells are actively elongated
insect pollinators. In addition to homeotic genes that
and show increasing pigment concentrations. Finally, at stage 6,
generate the structure of the flower, novel metabolic the flower is fully open. This figure illustrates the cultivar
pathways are activated to produce the small mole- Fragrant Cloud.
cules responsible for the colours and scents. Colour From: Dafny-Yelin, M., et al. (2005). Flower proteome: changes in protein
and scent are properties of the petals, which are the spectrum during the advanced stages of rose petal development. Planta
222, 37–46.
focus of this section.
Plant developmental biologists distinguish six stages
Table 9.4 Major classes of molecule in the scent of Fragrant
in the development of a rose flower (see Figure 9.6). Cloud roses
In formation of petals, an initial stage of cell division
may stop while the flower is less than half its final Compounds Amount emitted
size. Subsequent development occurs by differential mg/flower per day)
(m
cell elongation. Esters 61
Aromatic and aliphatic alcohols 37
Regulation of pathways that produce scents
Monoterpenes 18
in roses
Sesquiterpenes 10
Compounds from roses are an important source of
fragrances (see Table 9.4). Their production rises to a
peak in the mature flower, in consequence of raised called Fragrant Cloud and Golden Gate. Fragrant
expression levels of the enzymes that synthesize them. Cloud gives large, red flowers with a strong scent
(this is the cultivar illustrated in Figure 9.6). Golden
• Terpenes are compounds formed from isoprene units Gate has yellow flowers and a less distinct odour to
(2-methyl-1,3-butadiene). Sesquiterpenes are 15- humans. Two experiments might identify the pro-
carbon compounds formed from three isoprene units. teins involved in scent production: (1) a comparison
of mature flowers between Fragrant Cloud (highly
In order to identify enzymes involved in rose scent scent producing) and Golden Gate (poorly scent
production, Guterman and co-workers created an producing); and (2) a comparison of Fragrant Cloud
expressed sequence tag (EST) database and compared flowers at an early developmental stage, before scent
expression patterns in different stages of flower devel- production ramps up, with the mature Fragrant
opment.* They contrasted two tetraploid cultivars, Cloud flower, which is rich in scent.
The cDNA libraries created from both flowers in
• Why an EST library? The rose genome has not yet been stage 4 contained 2139 unique sequences. Of these,
fully sequenced. It is about 500–600 Mb long, distributed 1288 were found only in Fragrant Cloud and 746 only
on seven chromosomes. Most species are diploid or in Golden Gate. Expression patterns were studied of
tetraploid; a few are triploid, hexaploid, or octaploid. 350 Fragrant Cloud genes associated with:
• primary and secondary metabolism;
* Guterman, I. et al. (2002). Rose scent: genomics approach
• development;
to discovering novel floral fragrance-related genes. Plant
Cell 14, 2325–2328. • transcription;
280 9 Microarrays and Transcriptomics

• cell growth;
• cell biogenesis and organization;
• cell rescue;
• signal transduction; and
OPP
• unknown functions. Farnesyl Germacrene D
diphosphate
Taking, as a threshold of significance, a twofold dif-
ference in expression level, 77 genes (about one-fifth) Figure 9.8 Conversion of precursor farnesyl diphosphate to
germacrene D, a common scent molecule produced by rose petals.
had higher expression levels in Fragrant Cloud than
in Golden Gate, and only three had higher levels in
Cloning of the (+)-d-cadinene synthase gene to pro-
Golden Gate; the rest had similar expression levels in
duce recombinant enzyme permitted direct functional
both varieties. Comparing Fragrant Cloud flowers
studies. It catalyses the reaction of farnesyl diphos-
in stage 1 and stage 4, 65 genes had higher expression
phate to produce the sesquiterpene germacrene D,
levels in stage 4 and 14 had lower levels. Common to
the major sesquiterpene component of the scent of
both sets were 40 genes that were more highly ex-
Fragrant Cloud (see Figure 9.8).
pressed in both Fragrant Cloud relative to Golden Gate
and in Fragrant Cloud stage 4 relative to Fragrant Colour: the elusive blue rose
Cloud stage 1 flower development (see Figure 9.7).
A walk through many neighbourhoods in temperate
What are these 40 genes? Fifteen are involved in
climates will reveal the many colours that roses
metabolism and seven appear to have roles in sec-
can display, notably red, pink, peach, and yellow.
ondary metabolism, i.e. reactions outside of the core
Conspicuous by its absence is blue. It has not been
metabolic pathways responsible for normal growth,
possible to produce a blue rose by classical selective
development, and reproduction. Two very strongly
breeding. This is not for lack of trying: in 1840 the
upregulated genes encode proteins similar in sequence
horticultural societies of Great Britain and Belgium
to known enzymes: glutamate decarboxylase, and
offered a prize of 500 000 francs for one.
(+)-d-cadinene synthase. The enzyme (+)-d-cadinene
The major pigments of flower petals are antho-
synthase is involved in the synthesis of sesquiter-
cyanins: cyanidin, pelargonidin, and delphinidin. All
penes, one of the classes of floral scent molecule.
are synthesized from a common precursor, dihydro-
kaempferol (see Figure 9.9). It is delphinidin that is
25
Fragrant Cloud: blue, but roses lack the gene for the enzyme that
stage 4 > stage 1
produces dihydromyricetin from dihydrokaempferol
40 Common (marked by a green * in the figure).
Fragrant Cloud >
Scientists at the Australian Commonwealth Scien-
37 Golden Gate tific and Industrial Research Organisation (CSIRO),
in collaboration with a Melbourne company, Flori-
Figure 9.7 Two rose cultivars, Fragrant Cloud and Golden Gate, gene, and the Suntory Corporation of Japan, have
differ in their maximal odour production in stage 4 of flower created a blue rose by genetic engineering. There
development: Fragrant Cloud is rich in scent and Golden Gate is were two challenges: to produce the blue pigment
poor. Scent development in Fragrant Cloud is fully developed in
and to turn off synthesis of red and orange pigments.
stage 4 relative to the immature flowers in stage 1.
In comparisons of expression patterns, 25 genes were expressed
To suppress the red and orange pigments, the target
more highly in stage 4 Fragrant Cloud rose flowers than in stage 4 was the enzyme dihydroflavonol reductase (DFR).
Golden Gate rose flowers; 37 genes were expressed more highly DFR modifies precursor molecules in all three
in stage 4 Fragrant Cloud rose flowers than in stage 1 Fragrant branches of the pathway; a DFR− plant would have
Cloud rose flowers; and 40 were expressed more highly in stage 4 white flowers, providing a ‘clean slate’ for production
Fragrant Cloud flowers than in both stage 4 Golden Gate flowers
and display of blue pigments. However, DFR is also
and stage 1 Fragrant Cloud flowers.
These experiments can help to identify proteins preferentially needed for synthesis of delphinidin.
expressed in stage 4 Fragrant Cloud flowers that are involved in It was necessary to manipulate the pathways with
scent biosynthesis. precision. An interfering RNA (RNAi) was engineered
Expression pattern changes in development 281

OH OH
OH OH OH
HO O HO O HO O OH
OH OH OH
OH O OH O OH O
Dihydroquercetin Dihydrokaempferol * Dihydromyricetin

DFR DFR DFR

Cyanidin 3-glucoside Pelargonidin 3-glucoside Delphinidin 3-glucoside

OH OH
OH OH OH
HO O + HO O + HO O +
OH
OGluc OGluc OGluc
OH OH OH

Figure 9.9 Simplified scheme showing structures and metabolic relationships of anthocyanin flower pigments. The difficulty in breeding
a blue rose is that roses lack the gene that encodes the enzyme flavonoid 3′,5′-hydroxylase (F3′,5′H), which produces the precursor of
the blue pigment delphinidin. In other plants, this enzyme acts at the point marked by a green *. The steps corresponding to the vertical
arrows are all catalysed by a single enzyme, dihydroflavonol reductase (DFR).

to knock out the rose DFR. Then, introduction of

a DFR from iris, together with the gene for flavonoid
3′,5′-hydroxylase from pansy to supply the enzyme
missing in roses, produced a plant with high levels of
petal delphinidin and only small amounts of cyanidin.
The rose and iris DFR genes are quite similar but
have two important differences: (1) the natural speci-
ficity profile of the iris DFR produces delphinidin
predominantly; and (2) the RNAi can be sufficiently
specific that it selectively inactivates the rose DFR
and not the iris homologue.
The new rose is a shade of blue (see Figure 9.10).
The reason it is not a purer blue has to do with the
relatively acidic pH of rose petals compared with other
flowers. (Indeed, anthocyanins can be used as pH
indicators.) Finding ways to modify the intracellular
pH is the subject of current research.

• Blue roses do not exist in nature, and attempts to breed

them have been unsuccessful. Understanding of the
underlying metabolic pathways of the pigments allows
rational attempts at genetic engineering. It is interesting
that the pH of the petals has an effect – the pigments
are ‘indicators’. If this effect helped to stimulate Sydney
Brenner’s fascination for molecular biology,* that may
be its greatest significance for the field.
Figure 9.10 Picture of delphinidin-rich rose flower, produced by
methods similar to those described in the text.
* Brenner, S. (2001). My Life in Science, BioMed Central Photograph from Suntory Ltd.
Ltc., London, pp. 5–6.
282 9 Microarrays and Transcriptomics

Expression patterns in learning and memory: long-term potentiation

Learning and memory involve changes in the structure known to be involved in memory formation, espe-
and biochemistry of nerve cells. Nervous systems cially spatial memory (see Box 9.2). Following high-
are dynamic networks, passing signals among cells. frequency stimulation for four separated 1-second
Synapses are the sensitive points at which neurons intervals, total RNA was extracted after several time
interact. As each neuron integrates inputs from several intervals – 30, 60, 90, and 120 minutes after stimula-
others, its output depends on the ‘weighting’ of tion – and then reverse transcribed into cDNA and
its inputs – the distribution of the strengths of the hybridized onto Affymetrix GeneChip arrays. The
input synapses. Increasing or reducing the strengths chip reported 12 000 genes and ESTs. Samples of
of individual synaptic connections modulates the unstimulated tissue provided controls.
dynamics of the network. Park and colleagues found, within all time points,
Learning and memory must involve some perman- 1664 genes with statistically significant changed
ent structural change. The observation that memory expression patterns. Of these, 39% were upregulated
survives periods of coma, during which neural activity and 61% downregulated. The genes identified suggest
ceases, proves this. The change in the network cannot, that LTP produces changes in a variety of processes
therefore, be purely a change in the dynamic state. affecting cell morphology and affects interactions
Long-term potentiation (LTP) is a neural phenom- among cells and between cells and the extracellular
enon underlying learning and memory. LTP is a per- matrix.
sistent increase in strength of a synaptic connection Specific functional assignment showed several cat-
as a result of stimulation of the upstream cell. The egories of genes, shown in Table 9.5 (this table shows
original observation, first described by Bliss and a composite containing genes identified as having
Lomo in 1973, was that high-frequency stimulation changed expression at any of the time points; see also
of a synapse during a finite time interval produced Figure 9.11). Most but not all of the categories con-
a persistent subsequent enhancement of the post- tain examples of both up- and downregulated genes.
synaptic response. It is believed that transient effects,
lasting <1 hour, require modifications of pre-existing
proteins at the synapse. Longer-lasting effects involve BOX London taxi drivers: spatial
9.2 memory and the hippocampus
protein synthesis and gene transcription, and result-
ing structural remodelling of synapses.
London taxi drivers must have an exhaustive knowledge
of the metropolitan geography, of optimal routes
• Learning must be regarded as a specialized form of between points both famous and obscure, and of varia-
development. tions in traffic patterns. As portrayed in the famous
1979 film The Knowledge, drivers must pass a strict test
to earn their licence before taking control of a classic
In addition to neural plasticity, learning also
black London cab. Consistent with the involvement of
involves generation of new neurons. LTP stimulates
the hippocampus in spatial memory, brain scans show
both neurogenesis and enhanced survival of new cells.
that London taxi drivers have a larger hippocampus than
Park and co-workers studied genes that change
a control group, and that the hippocampus enlarges
their expression pattern in response to LTP.* They
with time spent behind the wheel.*
studied cells from the mouse dentate gyrus, a struc-
* Maguire, E.A., Gadian, D.G., Johnsrude, I.S., et al. (2000).
ture within the hippocampus. The hippocampus is Navigation-related structural change in the hippocampi of taxi
drivers. Proc. Natl. Acad. Sci. USA 97, 4398–4403; Maguire,
E.A., Spiers, H.J., Good, C.D., Hartley, T., Frackowiak, R.S., &
* Park, C.S., Gong, R., Stuart, J., & Tang, S.-J. (2006). Burgess, N. (2003). Navigation expertise and the human hip-
Molecular network and chromosomal clustering of genes pocampus: a structural brain imaging analysis. Hippocampus
involved in synaptic plasticity in the hippocampus. J. Biol. 13, 250–259.
Chem. 281, 30 195–30 211.
Expression patterns in learning and memory: long-term potentiation 283

120 min
Control
30 min
60 min
90 min
Gene name Gene function
Galactosylceramidase Myelin metabolism
Cerebellin 1 precursor protein Neuropeptide
42 kD cGMP-dependent protein kinase anchoring protein Synaptic plasticity
Striatin Ca2+ signalling in spines

Glutamate receptor, ionotropic, NMDAr1 Synaptic plasticity

Glial fibrillary protein Filament; astrocytes

Sodium channel, voltage-gated, type VI, α subunit Synaptic plasticity

Transforming growth factor, β3 Synapse formation

Histocompatibility 2 Synaptic plasticity
Heat shock protein, 60 kD Neuroprotection
Kallikrein 6 Synapse formation
Recoverin Neuronal Ca2+ sensor
Nucleoside diphosphate kinase Synaptic vesicle endocytosis
Purinergic receptor P2Y Neuromodulator
Prodynorphin Opioid precursor; neurotransmitter
Cyclin-dependent kinase 9 Neuronal differentiation
Megakaryocyte-associated tyrosine kinase Neurite outgrowth
Matrix metalloproteinase 15 Synapse remodelling

FK506-binding protein 12 Neurite growth

Cytochrome P450, 40 Neurosteroid synthesis

Activating transcription factor 5 Neural differentiation
Early B-cell factor 1 Axon path finding
Protocadherin α13 Synapse recognition
Cytochrome P450, 11a Neurosteroid synthesis
Tumor necrosis factor superfamily, member 8, ligand Synaptic plasticity
Cadherin 11 Synapse formation
SRY-box containing gene 18 Regulation of opioid expression
Neurexophilin 2 Ligand for neurexins
Glial cell line-derived neurotrophic factor family receptor α Axonal growth
γ-Aminobutyric acid (GABA-A) receptor Synaptic plasticity
Platelet-activating factor receptor Synaptic plasticity
Cadherin 16 Synaptogenesis
Arachidonate 15-lipoxygenase Axon guidance
A disintegrin and metalloprotease domain 5 Integrin ligand
Follicle stimulating hormone β subunit Neural plasticity
Fibroblast growth factor 7 Neurite outgrowth
Potassium channel, subfamily K, member 1 Synaptic plasticity

Flaggrin Homology to myelin basal protein (MBP)

N-acetylated α-linked acidic dipeptidase 2 Neuromodulation

Integrin α2 Synapse differentiation

SRY-box containing gene 3 Neurogenesis

Synuclein γ Neuroprotection

Lectin, galactose-binding, soluble 9 Synaptic plasticity

Synaptogyrin 2 Synaptic exocytosis
Deoxynucleotidyltransferase, terminal Synaptic plasticity

Tumour necrosis factor receptor superfamily, member 6 Neurite degeneration

Galanin Neuropeptide

Insulin II Neurite outgrowth

Early growth response 2 Myelination
Potassium voltage-gated channel, subfamily H, member 2 Neuritogenesis
Myelin protein zero Myelination

–3.0 –2.0 –1.0 0 1.0 2.0 3.0

Figure 9.11 Clustering of genes differentially expressed after induction of LTP. The scale at the bottom indicates the expression level
relative to the control: green = enhanced expression; red = reduced expression. Brackets indicate clusters of genes with known neural or
synaptic functions.
From Park, C.S., et al. (2006). Molecular network and chromosomal clustering of genes involved in synaptic plasticity in the hippocampus. J. Biol.
Chem. 281, 30 195–30 211.
284 9 Microarrays and Transcriptomics

Table 9.5 Genes that change expression pattern in response to One puzzling gene is CDC25B, an oncogene
long-term potentiation (LTP) encoding a tyrosine phosphatase, which functions
as a cell-cycle regulator. Use of a specific inhibitor
Functional category Upregulated Downregulated
of CDC25B protein product blocked LTP. CDC25B
The extracellular matrix + + must, therefore, have an essential role, but the mech-
and its regulation
anism remains obscure. This is precisely the kind of
Membrane protein/cell + + unexpected connection that high-throughput methods
surface/adhesion molecule
can turn up.
Neurosteroid hormone − + Some of the differentially expressed genes are
metabolism
coordinated into coherent pathways. These include
Cytokine/growth factor/ + +
enhanced expression of genes in the MAPK signalling
receptor
cascade (which was already known to be important
Other receptors/signalling + +
in LTP) and the Wnt signalling pathway (which had
Ion channel + +
not previously been connected with LTP).
Transcription factor/regulation + + Comparison of expression patterns at different
Translation + + times after LTP induction revealed the temporal
Neurotransmitter receptor/ + + expression patterns of different genes. An interesting
neuromodulator observation is that many genes in the same general
Regulation of cytoskeleton + + functional groups have similar temporal expression
Mitochondrial/energy + + profiles. Conversely, different time points are asso-
production ciated with a ‘schedule’ of activity of genes with
Proteases/protease inhibitors + + particular types of function.
Immunoresponsive + +
proteins/oxidative stress/ • Genes with common time profiles. Genes involved
neuroprotection/cell death in responses to external stimuli are upregulated at
Myelin-related proteins − + 30 and 60 minutes after LTP induction. These may
Chromatin structure + − be involved in interactions between pre- and post-
synaptic components. Genes involved in signal
transduction and transcription regulation provide
less-clear time profiles. It is likely that their effects
Expression patterns can identify particular genes
are indirect.
involved in neural plasticity. Many of the genes iden-
tified by changed expression patterns were already • Events happening at particular times. Many genes
known to play roles in synaptogenesis, synapse dif- active at 30 minutes are involved in cell–cell inter-
ferentiation, neurite outgrowth, and synaptic plastic- action, synapse formation and remodelling, and
ity. Others were not previously known to be involved neurite outgrowth. These represent a relatively early
in LTP, but their altered expression pattern makes component of the response. Genes related to the
them candidates to be tested for their effects on LTP. cytoskeleton are downregulated at 30 minutes, but
For example, transglutaminase is known to be upregulated at 60, 90, and 120 minutes. These may
expressed in neural tissue and appears in synapses. be involved in structural changes at the synapse.
However, it was not known to be implicated in
LTP. A connection was confirmed by showing that
Conserved clusters of co-expressing genes
cystamine – a specific antagonist of transglutaminase
– impairs LTP. (Cystamine is an inhibitor of trans- Mapping the loci of the differentially expressed genes
glutaminase and also causes disulphide exchange showed that they are concentrated in specific chro-
producing unfolding. Transglutaminase also helps to mosomal regions (tandem duplicates were removed).
produce protein aggregates in Huntington’s disease. These clusters tended preferentially to contain genes
Cystamine does ameliorate Huntington’s disease in with similar functions. Comparison of the distribu-
the mouse, but the mechanism is unclear.) tions of homologues in the genomes of rats, humans,
Evolutionary changes in expression patterns 285

Drosophila, and C. elegans suggest that the cluster- contribute to a mechanism for common regulation of
ing is conserved in evolution. The clustering may expression, reminiscent of a bacterial operon.

Evolutionary changes in expression patterns

The very high similarity in genome sequence between in transcription factors. This is in accord with King
humans and chimpanzees suggests that the evolu- and Wilson’s hypothesis.
tionary differences between such closely related • The differences in expression pattern are not
species would not lie primarily in the relatively small uniform in different tissues. As far as expression
changes in sequences of individual proteins, but in pattern is concerned, our hearts and livers have
expression patterns. Microarrays permit a test of this diverged from chimpanzees in expression pattern
idea. Nevertheless, amino acid sequence changes in more than our brains, both in terms of the
proteins may be significant, even though they may numbers of differentially expressed genes and the
be small. One example is the FOXP2 gene, discussed amount of the differences in transcription levels.
in Chapter 4. Another is a contributor to the control (Would you have guessed this?) However, looking
of overall cerebral cortical size (a crude but not at the course of evolution using the macaque, for
entirely irrelevant feature of our mental evolution), instance, as an outgroup, shows that the human
the gene ASPM (abnormal spindle-like microcephaly- brain is particularly rich in genes with increased
associated), which has undergone positive selection expression levels relative to the chimpanzee,
in the lineage leading to humans. consistent with the distinct differences in cognitive
abilities. In other tissues, there is a more even dis-
• The idea of the importance of changes in expression tribution of genes expressed more highly in humans
pattern to evolution appeared in a seminal paper by or more highly in the chimpanzee.
M.C. King and A.C. Wilson in 1975.
• Changes in expression patterns tend to be lower in
X and higher in Y chromosomes than in autosomes.
The design of the experiments presents a number For brain tissue, the average human/chimpanzee
of difficulties, however. ratio of expression level is about 1.51 for auto-
• There is a high background of variation in expres- somes, 1.43 for the X chromosome, and 2.14 for
sion pattern among different individuals of any the Y chromosome.
species and among different tissues within any • Duplicated genes tend to show a higher divergence
individual. This makes it difficult to identify changes in expression pattern than non-duplicated genes.
unambiguously attributable to species differences. One possible consequence of gene duplication is
• Use of a microarray containing oligomer sequences divergence and specialization of function. This is
derived from human genes to measure mRNA consistent with the requirement for differential
levels in chimpanzee tissue underestimates the control of gene expression. (Recalling Chapter 1,
expression levels in the chimpanzee because of vertebrate a- and b-globins are exceptions to this
less-effective hybridization resulting from sequence paradigm: it is necessary to calibrate their levels of
changes. expression. It is not yet clear what mechanism
achieves this. Synthesis of different amounts of
Nevertheless, carefully controlled experiments show a- and b-globin causes thalassaemias.)
that there are significant differences between human
• The differences in expression patterns are not
and chimpanzee expression patterns that arose dur-
uniform, even across different autosomes. During
ing evolution.
the 6–7 million year period of divergence, there
• The set of genes that show different expression has been substantial chromosome rearrangement
patterns between humans and chimpanzees is rich between humans and chimpanzees (see Figure 3.6).
286 9 Microarrays and Transcriptomics

2.2

2.0
Gene expression ratio
1.8

1.7

1.6

1.5

1.4

1.3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Chromosome

Figure 9.12 Average ratio of gene expression levels between human and chimp cortical tissue in collinear (red) and rearranged (blue)
chromosomes. X and Y chromosomes shown in green.
After: Marquès-Bonet, T., Cáceves, M., Bertranpetit, J., Preuss, T.M., Thomas, J.W., & Nawarro, A. (2006). Chromosomal rearrangements and the
genomic distribution of gene-expression divergence in humans and chimpanzees. Trends Genet. 20, 524–529.

Table 9.6 Areas of brain for which human/chimpanzee expression pattern differences were measured

Area of brain Function in human Function in chimpanzee if

known to differ from human

Dorsolateral prefrontal Important for higher brain functions: working memory,

cortex conscious control of behaviour
Anterior cingulate cortex Autonomic functions: heart rate and blood pressure,
and cognitive functions such as reward anticipation,
decision making, empathy, and emotion
Broca’s area Mainly language, plus action Gesture, especially control over
orofacial action, including
communicative acts
Central part of cerebellum Coordinating complex movements such as walking
Caudate nucleus Regulation and organization of information sent
to frontal lobes
Pre-motor cortex Sensory guidance of movement, activating proximal
and trunk muscles
Area homologous to Broca’s Not entirely clear; some involvement in communication
in right hemisphere without syntactic contribution to language use

In cortical tissue, changes in expression patterns are Differences between the species must be extracted
larger among genes in rearranged chromosomes from the variation among individuals. Within each
than syntenic ones (see Figure 9.12). species, typically a few hundred genes (out of
∼10 000 tested) vary in expression in different brain
The comparative analysis of expression patterns in regions. There is relatively low variation within the
human and chimpanzee brains has been pursued to
high resolution. Khaitovich and colleagues used an
array containing ∼10 000 human genes and arrays con- * Khaitovich, P., et al. (2004). Regional patterns of gene
taining ∼40 000 human transcripts to test several areas expression in human and chimpanzee brains. Genome Res.
of human and chimpanzee brains* (see Table 9.6). 14, 1462–1473.
Applications of microarrays in medicine 287

cortex itself, but over 1000 genes show differences showing species-specific differences in expression not
in expression pattern between the cerebellum and shared by other regions. An analysis of functional
other regions. categories of the genes showing enhanced or reduced
It is surprising to observe a similarity of expression interspecies expression differences does not reveal an
pattern in humans between Broca’s area, associated enrichment in specific families.
with speech, and the homologous right-hemisphere If our goal is to understand at the molecular level
area, which is not. This observation suggests that the the phenotypic differences between humans and
achievement of language in humans did not depend chimpanzees involving higher mental functions such
on localized changes in transcription patterns. as cognition and language, our data must be very
Functional analysis showed that genes encoding accurate and detailed, for as the traits grow more
proteins involved in signal transduction, cell–cell com- subtle, the molecular signal grows correspondingly
munication, differentiation, and development show a fainter. At some point, it will be necessary to trace the
greater-than-random tendency to vary in expression origin of the changes in expression patterns to pro-
between regions, within both species. Genes encoding tein and genomic sequences. This does not contradict
proteins involved in protein synthesis and turnover the King and Wilson hypothesis; after all, amino acid
tend to show conserved expression patterns. sequence changes modulate the functions of regula-
Approximately 10% of the genes studied differ in tory proteins. We may also be required to address
expression pattern between humans and chimpanzees. different levels of complexity: the different traits may
Most of the differences appear in two or more regions depend on very complicated patterns of interactions,
of the brain. The cerebellum contains several genes difficult to infer from properties of individual genes.

Applications of microarrays in medicine

Development of antibiotic resistance in bacteria gln–l-lys–d-ala–d-ala. Vancomycin acts by binding

to the oligopeptide, preventing the cross-linking.
The growth in bacterial resistance to antibiotics has
Without a robust cell wall, bacteria cannot stand up
created a crisis in disease control.
to their internal osmotic pressure.
One of the most powerful antibiotics available for
The development of resistant Staphylococcus aureus
use in humans is vancomycin, a 1.5 kDa glycopeptide
strains occurred gradually. Vancomycin-resistant
antibiotic isolated from a soil bacterium in Borneo,
enterococci appeared in 1977. Twenty years later,
Amycolatopsis orientalis (see Figure 9.13). Vancomy-
S. aureus developed resistance. The strains were already
cin was ﬁrst used clinically in 1958 when infectious
methicillin resistant. Vancomycin-resistant S. aureus
strains of staphylococci developed penicillin resistance
(VRSA) strains appeared in 2002. They have been
(see Box 9.3). It became the antibiotic of choice for
found in Europe and the USA (see Box 9.3).
many infections and the drug of last resort for some.
Resistance is measured by an increase in the mini-
mum inhibitory concentration (MIC), which is
Development of drug resistance by pathogenic
related to the clinically effective dose (Table 9.7).
microorganisms threatens to deprive us of the ability to
control infections disease. Widespread use of antibiotics, One contribution to the spread of vancomycin
not only in human clinical medicine, but in raising animals, resistance may be the practice in Europe of wide-
has contributed to the severity of the problem. spread feeding of avoparcin (see Figure 9.13) to ani-
mals. Homologues of the resistance genes are present
Vancomycin acts by interfering with cell wall syn- in the source organism for vancomycin, A. orientalis.
thesis. The cell wall in Gram-positive bacteria is The ﬁnding of bacterial DNA contamination in
a combination of polysaccharides and peptides. animal feed-grade avoparcin containing sequences
Linear polysaccharides, formed from alternating related to the resistance gene cluster strongly suggests
N-acetylglucosamine and N-acetylmuramic acid that the use of avoparcin has led to gene transfer to
units, are cross-linked by short peptides: l-ala–d- bacteria which could be taken up by the animals.
288 9 Microarrays and Transcriptomics

OH
OH
OH NH2
H2N OH H3C O
CH3 O
O HO CH2OH
H3C O
HO O CH2OH OH
O CH3 O
O CI O O HO CH2OH
NH2 H H O OH
O O O
H O OH
H CI
OH OH O O O
HO H H
CI O O N NHCH3
O O N
H H H H H N N H N
O H NHCH3 H H H H
N
N N
N
N HN H H O O H
H H H H H
HN O H2NC O HOOC
H3C H
R
HOOC H O
CH3 OH OH O
HO OH H3C
OH O
HO OH HO
HO
OH
Vancomycin α-Avoparcin R H
β-Avoparcin R CI

Figure 9.13 (left) Vancomycin, a glycopeptide antibiotic produced by the bacterium A. orientalis. (right) Antibiotics a- and b-avoparcin,
related to vancomycin and produced by Streptomyces candidus.
From: Lu, K., Asano, R., & Davies, J. (2004). Antimicrobial resistance gene delivery in animal feeds. Emerg. Infect. Dis. 10, 679–683.

Table 9.7 Effect of minimum inhibitory concentration (MIC) on

BOX Development of vancomycin resistance
9.3 resistance – a chronology*
Resistance of Minimum inhibitory Year first
S. aureus strain concentration (MIC) appeared
1941 First clinical use of penicillin G
Sensitive 1 mg/ml –
1942 Appearance of penicillin-resistant
Intermediate (VISA) 8–16 mg/ml 1997
Staphylococcus aureus
1950s Multidrug-resistant S. aureus widespread Resistant (VRSA) >32 mg/ml 2002
1956 Vancomycin described
1958 First clinical use of vancomycin Eventually the genes found their way to bacteria that
1960 First clinical use of methicillin infect humans.
1961 Appearance of methicillin-resistant
S. aureus
• The toxicity of vancomycin in its early days was caused
1960s Spread of methicillin-resistant S. aureus
by impurities – the brown preparations were nicknamed
1970s Methicillin-resistant S. aureus widespread
‘Mississippi mud’. Since the mid-1980s, purification
1988 Appearance of vancomycin-resistant
procedures and the safety of the preparations have
enterococci
improved.
1992 Laboratory transfer of high-level vancomycin
resistance from enterococci to methicillin-
resistant S. aureus S. aureus has adopted two basic strategies for
1997 Appearance of vancomycin-intermediate achieving vancomycin resistance. These approaches
S. aureus in clinical setting can be thought of as defence and attack. Both are
2002 Appearance of vancomycin-resistant effective, if success – for the bacterium – can be
S. aureus in clinical setting deﬁned as attaining a level of resistance that survives
doses of vancomycin that would be intolerably toxic
* Pfeltz, R.F. & Wilkinson, B.J. (2004). The escalating challenge
of vancomycin resistance in Staphylococcus aureus. Curr. Drug to the patient.
Targets Infect. Disord. 4, 273–294. Acting defensively, S. aureus achieves the inter-
mediate stage of vancomycin resistance (VISA) by a
Applications of microarrays in medicine 289

Table 9.8 Genes in vancomycin resistance cluster 35 genes consistently showed increased expression,
some as high as 30-fold, and 16 consistently showed
Gene Action of gene product
decreased expression.
VanH Reduces pyruvate to D-lactate Genes upregulated with increased vancomycin
VanA Esterifies D-ala–D-lactate resistance are associated with the following:
VanX Hydrolyses D-ala–D-ala, leaving D-ala–D-lactate
• purine biosynthesis, which is a large component of
to build the cell wall
the change in expression: 15 of the 35 upregulated
VanS A kinase that senses vancomycin and initiates
genes involved purine biosynthesis or transport,
transcription of the other genes. In the absence
of vancomycin, they are not expressed. and there was a mutation in the regulator of the
purine biosynthesis operon;

number of structural changes, including reduced • cell envelope synthesis, remodelling, and
growth rate, reduction in cell wall cross-links, degradation;
increased cell-wall thickness, and the appearance • proteins involved in transport and binding of
of d-glutamic acid instead of d-glutamine in the amino acids, peptides and amines, and nucleic acid
peptide. The genomic changes responsible for the components (including purines);
VISA phenotype can be magnified by vancomycin • synthesis of staphyloxanthin, an orange carotenoid
challenge and selection to produce VRSA strains that gives S. aureus its golden colour;
with MIC = 32 mg/ml. • folic acid synthesis; and
The alternative, for the bacterium, is a counter-
• unknown functions.
attack on vancomycin, to ‘pull its sting’. S. aureus
has achieved high vancomycin resistance by picking Genes downregulated with increased vancomycin
up a specific plasmid from a resistant Enterococcus. resistance are associated with:
The plasmid contains a cluster of genes, leading
• energy metabolism;
to changing the d-ala–d-ala at the C terminus of
the cross-linking pentapeptide to d-ala–d-lactate • cell envelope biosynthesis;
(Table 9.8). The modified peptide can enter the cell • proteins involved in transport and binding of
wall but has a lower binding affinity for vancomycin carbohydrates, organic alcohols, and acids;
by a factor of ∼1000. • salvage of nucleic acid components;
• regulatory functions; and
• Microorganisms also develop resistance by evolving • tetracycline resistance.
enzymes that destroy an antibiotic or pump it out of
cells. S. aureus followed this route to gain resistance
It is not always easy to put together the details of
to penicillin, which initially led clinicians to turn to a change in expression pattern involving many meta-
vancomycin. bolic subsystems in order to grasp the salient message
(see Figure 9.14). However, it is reasonable to think
that the goal of the changes is to defend the cell wall,
Mongodin and co-workers compared expression as that is the target of the antibiotic. As Mongodin
patterns of genes in VISA strains (MIC ∼8 mg/ml) and colleagues suggested, many of the changes in
with VRSA strains produced by selection, not con- expression levels combine to funnel metabolites to
taining the resistance plasmid.* The array contained the formation of ATP. These changes include down-
2688 oligonucleotides. The experiments were run regulation of the genes that encode proteins for con-
in parallel, starting with two different clinical VISA version of ATP to the corresponding deoxynucleoside
isolates. Upon increased vancomycin resistance, triphosphate for DNA synthesis (nrdD) and for the
degradation of AMP (deoD). Key enzymes in glycolysis
* Mongodin, E., Finan, J., Climo, M.W., Rosato, A., Gill, S.,
& Archer, G.L. (2003). Microarray transcription analysis and fermentation are downregulated, diverting
of clinical Staphylococcus aureus isolates resistant to van- glucose 6-phosphate through the pentose phosphate
comycin. J. Bacteriol. 185, 4638–4643. pathway to form the ribose component of ATP.
290 9 Microarrays and Transcriptomics

Glucose

Glutamate Pentose
Glucose-6P phosphate Ribose-5P
pathway

Glycolysis Xanthine,
Purine uracil
Lactate biosynthesis

Pyruvate DNA Guanosine-5P

(GMP)
Formate,
dATP
nitrite
Krebs Inosine-5P
cycle ATP (IMP)

Degradation AMP

Figure 9.14 With enhancement of vancomycin resistance in the laboratory by selection after vancomycin challenge, expression patterns
of genes associated with some processes are upregulated (blue arrows) and others are downregulated (red arrows). A major target of
upregulation is purine synthesis, aimed at enhanced production of ATP for energy requirements.
After: Mongodin, E., Finan, J., Climo, M.W., Rosato, A., Gill, S., & Archer, G.L. (2003). Microarray transcription analysis of clinical Staphylococcus
aureus isolates resistant to vancomycin. J. Bacteriol. 185, 4638–4643.

Synthesis of the thickened cell wall is a very energy- chemotherapy, the success rate for which has greatly
intensive process. The ratio of cell-wall volume to improved in the past quarter of a century. Neverthe-
total cell volume increased by 41% in the vancomycin- less, conventional therapy is unsuccessful in about
resistant cells. Perhaps reduced cellular growth rate is 25% of patients.
a price that must be paid if a larger fraction of the Measurements of gene expression patterns have per-
cell’s energy budget goes into cell-wall synthesis. mitted molecular classification of disease subtypes,
and correlations with response to chemotherapy and
• Drug resistance in pathogens is a crucial problem in likelihoods of rapid or delayed recurrence or long-
contemporary medicine. Learning, in detail, how it term survival.
comes about, will be essential for developing ways to
prevent it. • From the expression pattern of a group of 50 genes,
it is possible to distinguish almost perfectly between
lymphoblastic and myeloid leukaemias and, for
Childhood leukaemias lymphoblastic leukaemias, to distinguish B- and
Haemopoietic stem cells are the undifferentiated T-cell lineages. The results are calibrated against
precursors of all types of blood cell. They mature by established methods based on flow cytometry. The
differentiating along one of two pathways (see Fig- variability in expression pattern is unusually high
ure 9.15). B and T cells of the immune system have in acute lymphoblastic leukaemia relative to other
followed the lymphoid path; red blood cells have types of cancer. It is thereby feasible to create a
followed the myeloid path. An abnormal genetic molecular taxonomy of childhood leukaemias.
transformation leading to unregulated proliferation • Expression patterns can predict the likelihood of a
of any blood cell, at any stage of differentiation, gives favourable outcome. This combines questions of the
rise to leukaemia. Leukaemias can be classified success of therapy, the likelihood of spontaneous
according to the type of cell that is proliferating. relapse after remission, and the development of
Acute lymphoblastic leukaemia is the most com- secondary tumours. A complicating factor of studies
mon cancer of children, representing almost one- of this type in humans is the fact that the samples
third of childhood cancers. The main treatment is are taken from patients under a variety of treatments.
Applications of microarrays in medicine 291

Pluripotent stem cells

Myeloid stem cells Lymphoid stem cells

Granulocyte– Megakaryocytic Erythropoietic

macrophage stem cells stem cells
stem cells

Granulocytic Monocytic
stem cells stem cells

Neutrophils, Monocytes and Megakaryocytes Red blood cells B and T

eosinophils, macrophages and platelets lymphocytes
and basophils

Figure 9.15 Haematopoiesis is the formation of new blood cells. Our blood contains many types of cell:

Cell type Function

Neutrophils Respond to bacterial infection

Eosinophils Respond to allergens and to infections by parasites
Basophils Respond to allergens
Monocytes and macrophages Remove dead tissue and respond to infections by bacteria and fungi
Megakaryocytes Precursors of platelets
Platelets Involved in blood clotting
Red blood cells (erythrocytes) Contain haemoglobin and transport O2 and CO2
B lymphocytes Produce specific antibodies
T lymphocytes Destroy infected cells and regulate immune responses

These cells arise by different developmental pathways from a common stem-cell precursor, which can potentially differentiate into any
of the mature cell types, a property called totipotency. As maturation proceeds, cells first become pluripotent (able to mature into some
but not all cell types) and then finally committed to a single ultimate form.
Normally, haematopoiesis produces approximately 175 billion red cells, 70 billion granulocytes (neutrophils, eosinophils, basophils),
and 175 billion platelets every day. [One billion is 109.] When we are challenged by infection, production can be stepped up by an order
of magnitude.
Leukaemia is the uncontrolled proliferation of any of these types of blood cell, either mature forms or their precursors. In mammals,
mature erythrocytes, being enucleate, cannot themselves proliferate. However, mutations can occur in precursor cells. For example, a
mutation affecting the JAK2 signalling pathway in the stem-cell precursor can result in overproduction of erythrocytes, a disease called
primary polycythaemia vera. (Secondary polycythaemia vera, also characterized by overproduction of red cells, is a response to lack of
oxygen; possible causes include heavy smoking, emphysema, or moving without acclimation to a high altitude.)

Clinical experience shows that the time interval of ﬁrst gression. Genes involved in cell proliferation and DNA
remission is a good predictor of long-term survival. A repair were upregulated in the early-relapse group.
group of genes was found with an expression pattern Development of a second cancer, not related to
correlated with length of remission, i.e. these genes leukaemia in any obvious way, is a common and very
are differentially expressed in patients with early and serious complication of acute lymphoblastic leukaemia.
late relapse. Testing the expression levels of these genes Brain tumours are one of the most common second-
can improve the precision of prognosis. Identifying ary malignancies. Several genes have been identiﬁed,
the pathways in which the genes are involved can the expression patterns of which correlate with the
illuminate the underlying biology of the disease pro- risk of secondary brain tumours.
292 9 Microarrays and Transcriptomics

• Expression pattern can predict effectiveness of • Identification of specific genes involved in diseases
treatment and guide the choice of therapy. In a can suggest targets for drug development. For
study using 14 500 probe sets and samples from instance, one type of acute lymphoblastic leukae-
173 patients, sets of 20–40 genes were identified mia is associated with overexpression of the gene
that distinguish resistance and sensitivity to four FLT3 for a receptor tyrosine kinase. Patients with
different drugs: prednisolone, vincristine, aspara- mutations that produce constitutively active recep-
ginase, and daunorubicin. These results, even taken tors have a poor prognosis. FLT3 inhibitors are
as purely empirical correlations, have clinical util- now in clinical trials.
ity in guiding treatment. Their interpretation at the
Thus, expression profiles can permit tailoring of
genetic level reveals that the activities of the drugs
drug therapy, both to the specific disease and sub-
involve some different as well as some common
type, and to the patient.
pathways. A set of 45 genes was found to be cor-
related with resistance to all four of the drugs. The
• Expression profiling can (a) permit precise diagnosis of
majority of these genes involve transcription, DNA
the subtype of the disease, (b) predict the likely course
repair, cell-cycle maintenance, and nucleic acid
of the disease, and (c) guide choice of therapy.
metabolism.

Whole transcriptome shotgun sequencing: RNA-seq

Gene expression proﬁling has been the most common • If a population of cells includes both a host and
application of microarrays. The goal is to measure a pathogen, the transcriptomes of both are
the identities and relative amounts of different RNA simulaneously measurable. In contrast, without a
transcripts in populations of cells. Comparison of specially designed chip, a microarray would most
transcript proﬁles between healthy and disease states, likely contain probes only from the host.
or under different external conditions, or as a function
Nevertheless, despite many claims, reports of the
of time, reveal the changes in gene expression pat-
death of microarrays are greatly exaggerated. Tech-
terns. Examples of all of these appear in this chapter.
nically, although the sequencing platforms accurately
RNA-seq – or whole transcriptome shotgun
report the relative amounts of different cDNAs pre-
sequencing – is an alternative method of measuring
sented to them, there is potential bias in the yields of
RNA levels, applying high-throughput sequencing
reverse transcription from different RNA molecules.
techniques. RNA isolated from cells is fragmented,
For instance, internal secondary structure of the
reverse-transcribed to cDNA, and sequenced. Assembly
RNA may interfere with primer binding. (This may
is easiest by aligning with a reference genome. This will
be a problem with microarray experiments also.)
also automatically pick up post-transcriptional edits.
In addition, it is currently true that microarray mea-
The RNA-seq approach has some advantages over
surements are less expensive than the RNAseq
microarrays:
approach to collecting equivalent data. Of course,
• To construct a microarray, one must make a choice the cost of sequencing is changing rapidly.
of what probe sequences to include. RNAseq will
report whatever is there, with no prior commit-
ment to any set of possible sequences. • Methods for expression profiling include microarrays
and whole transcriptome shotgun sequencing, or RNA-
• In principle, sequencing methods give more precise
seq. Both methods are in widespread use at present. By
measurements of RNA concentrations, by record-
asking in detail what one wants from the results, one
ing how frequently each sequence appears in the
can make an intelligent choice between them for any
pooled results. Not only is sequencing potentially particular experiment on any particular system.
more precise, it has a higher dynamic range.
Exercises, problems, and weblems 293

● RECOMMENDED READING

• General discussions of microarrays:

Butte, A. (2002). The use and analysis of microarray data. Nat. Rev. Drug Discov. 1, 951–960.
Penkett, C.J. & Bähler, J. (2004). Getting the most from public microarray data. Eur. Pharm. Rev.
1, 8–17.
• The problem of bacterial drug resistance and the development of novel antibiotics:
Amábile-Cuevas, C.F. (2003). New antibiotics and new resistances. Am. Sci. 91, 138–149.
Projan, S.J. & Shlaes, D.M. (2004). Antibacterial drug discovery: is it all downhill from here?
Clin. Microbiol. Infect. Suppl. 4, 18–22.
Projan, S.J., Gill, D., Lu, Z., & Herrmann, S.H. (2004). Small molecules for small minds? The case
for biologic pharmaceuticals. Expert Opin. Biol. Ther. 4, 1345–1350.
Thomson, C.J., Power, E., Ruebsamen-Waigmann, H., & Labischinski, H. (2004). Antibacterial
research and development in the 21st century – an industry perspective of the challenges.
Curr. Opin. Microbiol. 7, 445–450.
Overbye, K.M. & Barrett, J.F. (2005). Antibiotics: where did we go wrong? Drug Discov. Today
10, 45–52.
Barrett, J.F. (2005). Can biotech deliver new antibiotics? Curr. Opin. Microbiol. 8, 498–503.
Talbot, G.H. Bradley, J., Edwards, J.E., et al. (2006). Bad bugs need drugs: an update on the
development pipeline from the Antimicrobial Availability Task Force of the Infectious Diseases
Society of America. Clin. Infect. Dis. 42, 657–668.
• Evolution of language:
Fisher, S.E. & Marcus, G.F. (2006). The eloquent ape: genes, brains and the evolution of
language. Nat. Rev. Genet. 7, 9–20.
• A collection of papers describing the emerging field combining genomics and neuroscience:
Jones, B.C. & Mormède, P. (eds.) (2006). Neurobehavioural Genetics/Methods and
Applications. CRC Press, Boca Raton.
• Presentation of whole transcriptome shotgun sequencing, or RNA-seq:
Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-seq: a revolutionary tool for transcriptomics.
Nat. Rev. Genet. 10, 57–63.

● EXERCISES, PROBLEMS, AND WEBLEMS

Exercises
Exercise 9.1 In the third level (under Hybridization, washing), Figure 9.1 shows, schematically, a
row of 14 probe oligomers, corresponding to the leftmost 14 elements in one of the rows of the
19 × 19 square array of fluorescent dots in the fourth level. Which row?
Exercise 9.2 A professor of molecular biology wanted to design a microarray experiment to
detect tRNA sequences. He suggested using the sequences of the anticodon stem–loop as target
oligonucleotides (see Figure 1.4). A student pointed out that this was not likely to be successful.
For what reason?
Exercise 9.3 On a photocopy of Figure 9.2, (a) indicate the position of a gene that is highly
downregulated in one of the vehicle samples but not in the other two; (b) indicate the position of
a gene that is more highly upregulated in one of the dexamethasone samples than in the other two.
294 9 Microarrays and Transcriptomics

Exercise 9.4 On a photocopy of Figure 9.4, indicate the position of a gene that is first upregulated
and subsequently downregulated during the yeast diauxic shift.
Exercise 9.5 (a) Citrate synthase and (b) pyruvate decarboxylase are two of the enzymes that
change expression level in the diauxic shift in yeast. On a photocopy of Figure 9.3, mark the
approximate positions of the reactions that they catalyse.
Exercise 9.6 Transcriptome profiling is the measurement of patterns of mRNA concentrations.
However, 90% of the RNA in a cell is ribosomal. How could you apply the fact that mRNAs carry
a 3′ poly A tail to avoid interference from the high background levels of ribosomal RNA?

Problems
Problem 9.1 The average ratio of gene expression levels between human and chimpanzee is
∼1.5 for collinear chromosomes and can be as high as ∼1.6–1.7 for rearranged chromosomes
(Figure 9.12). (a) Qualitatively describe the differences in banding pattern between human and
chimpanzee for human chromosomes 5, 6, 9, and 10 (see Figure 4.18). (b) Are your results
consistent with the data in Figure 9.12? (c) Which would you expect to have an average ratio of
expression levels closer to 1.0, genes on human chromosome 3 and their gorilla homologues, or
genes on human chromosome 8 and their gorilla homologues?
Problem 9.2 Describe the reasons for and against including antibiotics routinely in animal feed.
Decide on a conclusion to be drawn from these arguments and formulate a paragraph of
recommendations.
Problem 9.3 A recent study of 1737 patients treated with vancomycin for S. aureus infections
concluded that many patients received less than an adequate dose to maintain serum
concentrations above the MIC.* For patients infected by vancomycin-sensitive S. aureus, this
characterized 7.9% of patients given continuous infusions and 19% of patients to whom the
drug was administered by periodic intravenous injections. For patients infected by S. aureus strains
with intermediate-level resistance to vancomycin, this characterized 79.1% of patients given
continuous infusions and 87.8% of patients given periodic intravenous injections. What are the
expected effects of this situation on (a) the individual patients involved, and (b) the spread of
vancomycin-resistant S. aureus strains?
Problem 9.4 You have a sample of mRNA which you convert to cDNA. From this material, what
information can you derive from a microarray that would not be available from a high-throughput
sequencing run with a Roche 454 Life Sciences Genome Sequencer?

Weblems
Weblem 9.1 An EST for rose germacrene D synthase appears in dbEST (GenBank) as entry
BQ105086. Blast this sequence against general sequence databases. What properties of rose
germacrene synthase can you infer from the results?
Weblem 9.2 (a) What is the chromosome location of the ASPM gene? (b) What mutations are
known? (c) What is their clinical effect?
Weblem 9.3 Search for articles on discovery of novel antibiotics and study a number of them
to learn about the current situation. (Suggested search terms: future antibiotic development.
Alternatively, consult articles cited as recommended reading in this chapter.) (a) Provide facts and
figures that describe trends in funding for research by large pharmaceutical companies devoted to
novel antibiotic development. (b) Assuming that these figures confirm that large pharmaceutical

* Kitzis, M.D. & Goldstein, F.W. (2006). Monitoring of vancomycin serum levels for the treatment of staphylo-
coccal infections. Clin. Microbiol. Infect. 12, 92–95.
Exercises, problems, and weblems 295

companies are curtailing the relevant research programmes, what are the reasons for this, in the
face of high demand for novel antibiotics? (c) Can you recommend appropriate actions by
governments, industry, and the academic community? Give credible reasons that justify the
conclusion that if your actions were adopted they would accelerate the development and approval
of novel antibiotics. What do you see as the most serious obstacles to acceptance of your ideas?
How would you suggest overcoming them? (Warning: consider the possibility that the less
ambitious your suggestions, the more credibly you can justify them.)
This page intentionally left blank
CHAPTER 10

Proteomics

LEARNING GOALS

• To understand the fundamental chemical structure of proteins: the mainchain and sidechains,
types of sidechain, and common post-translational modifications.
• To understand the basic description of protein conformation.
• To be able to distinguish between primary, secondary, tertiary, and quaternary structures.
• To understand the use of polyacrylamide gel electrophoresis (PAGE) to separate proteins.
• To understand the technique and uses of mass spectrometry.
• To appreciate the principles of classifications of protein-folding patterns.
• To understand the possibilities and difficulties of protein structure prediction.
• To understand the goals of structural genomics projects.
298 10 Proteomics

Introduction

The proteome is the complete set of proteins associ- • Structural genomics. This activity applies advances
ated with a sample of living matter. Proteomics deals in X-ray crystallography and nuclear magnetic
with the proteins that form the structures of living resonance (NMR) to high-throughput delivery of
things, are active in living things, or are produced by coordinate sets of proteins.
living things. This includes their nature, distribution, • Bioinformatics. This brings together the many
activities, interactions, and evolution. Many ﬁelds data streams of genomics, expression patterns, and
contribute to proteomics. proteomics, to assemble databases and create links
• Chemistry and biochemistry. These include physical among them. This enables their coordinated applica-
methods, such as spectroscopy, kinetics, and tech- tion to problems of biology, clinical medicine,
niques of structure determination, and organic and agriculture, and technology.
biochemical methods for working out mechanisms Bioinformatics co-ordinates its efforts with structural
of enzymatic catalysis. Techniques for separation genomics to guide and supplement experimental
and analysis of proteins have their sources in structure determinations by prediction of protein
chemistry and molecular biology. structures from amino acid sequences. Prediction of
• Molecular and cellular biology. These disciplines protein structure is an important technique, given the
help to coordinate our knowledge of individual great disparity between the very large number of
proteins into an understanding of the biological experimentally determined sequences and the rela-
context, and of how protein activities are integrated. tively few structures. Methods for prediction of pro-
• Evolutionary biology. Proteins evolve. Evolution tein structure from sequence, most of which take
explores variations in amino acid sequences, pro- advantage of the known protein structures, can pro-
tein structures, interactions and functions, and vide libraries of three-dimensional models of proteins
patterns of protein expression. encoded in genomes.

Protein nature and types

Proteins are where the action is. • The amino acid sequences of proteins dictate
their three-dimensional structures and their fold-
• Proteins have a great variety of functions. There
ing pathways. Under physiological conditions of
are structural proteins (molecules of the cytoskel-
solvent and temperature, proteins fold spontane-
eton, epidermal keratin, viral coat proteins);
ously to an active native state. The amino acid
catalytic proteins (enzymes); transport and storage
sequence of a protein must not only preferentially
proteins (haemoglobin, retinol-binding protein,
stabilize the native state it must also contain a
ferritin); regulatory proteins (including hormones,
‘road map’ telling the protein how to get there,
many kinases and phosphatases, and proteins that
starting from the many diverse conformations
control gene expression); and proteins of the
that comprise the unfolded state. This is called the
immune system and the immunoglobulin super-
folding pathway.
family (including antibodies and proteins involved
in cell–cell recognition and signalling). How can • Advances in protein science have spawned the bio-
proteins accomplish so many different things? By technology industry. It is now possible to design
coming in a great variety of structures, specialized and test modiﬁcations of known proteins and to
to carry out different functions. design novel ones with desired functions.
Protein structure 299

Protein structure

The chemical structure of proteins Hydrogen bonds are about 20 times weaker than
covalent chemical bonds. In aqueous solution,
Chemically, protein molecules are long polymers typ-
the solvent water can form hydrogen bonds to
ically containing several thousand atoms, composed
polar groups in nucleic acids and proteins. There
of a uniform repetitive backbone (or mainchain) with
is a competition between intramolecular hydro-
a particular sidechain attached to each residue (see
gen bonds and solute–solvent hydrogen bonds.
Figure 10.1). The amino acid sequence of a protein
Hydrogen bonds in solution can be easily broken
speciﬁes the order of the sidechains.
and reformed.

• A protein is a message written in a twenty-letter

alphabet.
BOX The amino acids
10.1
The sidechains in proteins show a variety of phys-
icochemical features: some are charged, some are
+
uncharged but polar, and others are hydrophobic (see Glycine NH3– Ca– COO−
Box 10.1). Different types of residue make different Alanine – Ca – CH3
types of interaction. Serine – Ca– CH2–OH
Cysteine – Ca – CH2–SH
• Hydrogen bonding. The hydrogen bond is an
Threonine – Ca – CH(OH)–CH3
interaction between two polar atoms (oxygen or
Proline
nitrogen; occasionally sulphur) mediated by a
hydrogen atom. Several types of hydrogen bond
N−Cα
are extremely important to biology:
Valine – Ca – CH– (CH3)2
– Water is an extensively hydrogen-bonded liquid.
Leucine – Ca – CH2– CH–(CH3)2
This accounts for its physicochemical proper-
ties, for instance its high boiling point. The Isoleucine – Ca– CH(CH3) –CH2–CH3
structure of water determines the solubility of Methionine – Ca–CH2–CH2–S–CH3
different substances. Water is not merely the Phenylalanine
−Cα−CH2−
medium of most biochemical processes, it is an
active participant in many. Tyrosine
−Cα−CH2− −OH
– Hydrogen bonds between nucleic acid bases
mediate the complementarity between adenine
Aspartic acid – Ca– CH2– COO−
and thymine, and between guanine and cytosine.
Glutamic acid – Ca– CH2– CH2–COO−
– Hydrogen bonds between C=O and H–N groups
Histidine N
stabilize the structure of proteins. −Cα−CH2−C
NH
Residue i – 1 Residue i Residue i + 1 Asparagine – Ca– CH2– CONH2
Si−1 Si Si+1 Sidechains variable Glutamine – Ca–CH2– CH2–CONH2
... N Cα C N Cα C N Cα C ... Lysine – Ca– CH2– CH2–CH2– CH2–NH3+
Mainchain constant
O O O Arginine NH2

Figure 10.1 Proteins contain a mainchain of constant structure. −Cα−CH2−CH2−CH2−NH−C

Attached at regular intervals are sidechains of variable structure, NH+2
each chosen (with few exceptions) from the canonical set of 20 Tryptophan −Cα−CH2
amino acids. Here Si−1, Si, and Si+1 represent successive sidechains.
Different sequences of sidechains characterize different proteins. N
It is the sequence that gives each protein its individual structural H
and functional characteristics.
300 10 Proteomics

• Hydrophobic interactions. Hydrophobic residues They present general solutions – where in this con-
have sidechains that are primarily hydrocarbon text ‘general’ means ‘compatible with all (or at least
in nature. They have thermodynamically unfavour- almost all) amino acid sequences’.
able interactions with water. Salad dressing is an Helices and sheets are like Lego® pieces, standard
everyday example of the hydrophobic effect: the units of structure of which many proteins are built
thermodynamic unfavourability of dissolving oil in and which can be put together in different ways.
water causes a phase separation. It is energetically Helices are formed from a single consecutive set of
favourable to bury hydrophobic sidechains in the residues in the amino acid sequence. They are there-
interior of a protein, where they are not exposed to fore a local structure of the polypeptide chain, i.e.
the solvent. This is a general feature of the struc- they form from a set of residues consecutive in the
tures of globular proteins. sequence. The mainchain hydrogen-bonding pattern
• Disulphide bridges. In addition to the primary of an a-helix, the most common type of helix, links
chemical bonds in the individual residues and the the C=O group of residue i to the H–N group of
peptide bonds joining the residues into a polymer, residue i + 4.
cysteine residues in proteins, with sidechain Sheets form by lateral interactions of several inde-
–CH2SH, can form disulphide bonds: –CH2S– pendent sets of residues to create a hydrogen-bonded
SCH2–. Disulphide bonds contribute to the stability network that is often nearly flat, but sometimes cylin-
of native states. In order to denature proteins fully, drical (forming a barrel structure). Unlike helices,
it is necessary to break any disulphide bonds. sheets need not form from consecutive regions of the
chain but may bring together sections of the chain
Different possible conformations of the backbone separated widely in the sequence.
of a protein bring different types of residue into spa-
tial proximity and expose some but not all residues
to the solvent. Every conformation therefore has a • Helices and sheets are recurrent structures, stabilized
by mainchain hydrogen bonding, that appear in many
different associated energy that depends on the distri-
protein structures.
bution of favourable and unfavourable interactions.
The native state of a soluble globular protein is the
conformation that optimizes the set of interactions
among the residues and between the residues and the
Conformation of the polypeptide chain
solvent.
The conformation of a polypeptide chain can be
described in terms of angles of internal rotation
• Different types of residues make different types of around the bonds in the mainchain (see Figure 10.2).
interactions, including hydrogen bonds, hydrophobic The bonds between the N and Ca, and between the
interactions, and disulphide bridges. Formation of the Ca and C, are single bonds. Internal rotation around
native structure allows optimal formation of favourable
these bonds is not restricted by the electronic struc-
inter-residue and residue–solvent interactions.
ture of the bond, only by possible steric collisions in
the conformations produced (see Box 10.2).
The entire conformation of the protein can be
Helices and sheets described by these angles of internal rotation. Each
Underlying the great variety of protein folding pat- set of four successive atoms in the mainchain defines
terns are some recurrent structural themes. Helices an angle. In each residue i (except for the N and C
and sheets are two conformations of the polypeptide termini), the angle fi is the angle defined by atoms
chain that appear in many proteins. They satisfy the C(of residue i − 1)–N–Ca–C, and the angle yi is the
hydrogen bonding potential of the mainchain N–H angle defined by atoms N–Ca–C–N(of residue i + 1).
and C=O groups, while keeping the mainchain in an Then wi is the angle around the peptide bond itself,
unstrained conformation. They thereby solve certain defined by the atoms Ca–C–N(of residue i + 1)–
structural problems faced by all globular proteins. Ca(of residue i + 1).
Protein structure 301

α
C+1 The peptide bond has a partial double-bond char-
acter and adopts two possible conformations: trans
Ni+1 Hi+1 (by far the more common) and cis (rare). Angle w is
restricted to be close to 180° (trans) or 0° (cis).
Oi Ci Proline is an exception: the sidechain is linked
ωi back to the N of the mainchain to form a pyrrolidine
ψi ring. This restricts the mainchain conformation
Cαi of proline residues. It disqualiﬁes the N atom as a
Hi hydrogen-bond donor, for instance in helices or
φi sheets. Also, the energy difference between cis and
Ni Ciβ
trans conformations is less for proline residues than
Ci−1 for others. Most cis peptides in proteins appear
before prolines.
α
Ci−1 Oi−1

Figure 10.2 Conformational angles describing the folding of the Protein folding patterns
polypeptide chain.
The mainchain of each residue (except the C-terminal residue)
Focusing on the backbone, in the native state the
contains three chemical bonds: N–Ca, Ca–C, and the peptide polypeptide chain follows a curve in space. The
bond C–N linking the residue to its successor. The conformation general spatial layout of this curve defines a folding
of the mainchain is described by the angles of rotation around pattern. We now know over 70 000 protein struc-
these three bonds: tures. There is great but not infinite variety: many
proteins have similar folding patterns. The native
Rotation around: N–Ca
a bond Ca–C bond Peptide bond
(C–N)
states are selected from a large but finite repertoire.
In describing protein structures, the Danish protein
Name of angle: f y w chemist K.U. Linderstrøm-Lang described a hier-
archy of levels of protein structure. The amino acid
w is restricted to be close to 180° (trans) or, infrequently, close
sequence – roughly the set of chemical bonds – is
to 0° (cis).

BOX The Sasisekharan–Ramakrishnan–Ramachandran diagram

10.2

The mainchain conformation of each residue is determined The two major allowed conformations of the mainchain,
primarily by the two angles f and y, assuming the common aR and b, correspond to the two major types of secondary
trans conformation of the peptide bond, w = 180°. structure: a-helix and b-sheet. The a-helix is right-handed,
For some combinations of f and y, atoms would collide, like the threads of an ordinary bolt. In the b region, the
a physical impossibility. V. Sasisekharan, C. Ramakrishnan, chain is nearly fully extended.
and G.N. Ramachandran first plotted the sterically allowed A graph showing the f and y angles for the residues of
regions (see Figure 10.3). There are two main allowed a protein against the background of the allowed regions is
regions, one around f = −57°, y = −47° (denoted aR) and called a Sasisekharan–Ramakrishnan–Ramachandran plot,
the other around f = −125°, y = +125° (denoted b) with a often called a Ramachandran plot for short.
‘neck’ between them. The mirror image of the aR confor- It is no coincidence that the same conformations that
mation, denoted aL, is allowed for glycine residues only. correspond to low-energy states of individual residues
(As glycine is achiral – identical to its mirror image – a also permit the formation of structures with extensive
Ramachandran plot specialized to glycine must be right– mainchain hydrogen bonding. The two effects thereby
left symmetric. For non-glycine residues, collisions of the cooperate to lower the energy of the native state.
Cb atom forbid the aL conformation.)
➔
302 10 Proteomics

180°

82N αL
21F
G G
G
ψ 0

αR

G
G
Figure 10.3 A Sasisekharan–Ramakrishnan– G
Ramachandran plot of bovine acylphosphatase G G
G
[2ACY]. Sterically most-favourable regions are −180°
shown in green and sterically allowed regions −180° 0 180°
in yellow. Residues with f > 0, mostly glycines, φ
appear in red.

called the primary structure. The assignment of heli- successive steps in the pathway of biosynthesis of
ces and sheets – the hydrogen-bonding pattern of the aromatic amino acids correspond to five regions of
mainchain – is called the secondary structure. The a single protein in the fungus Aspergillus nidulans.
assembly and interactions of the helices and sheets is
called the tertiary structure. For proteins composed
• We describe protein folding patterns according to a
of more than one subunit, J.D. Bernal called the
hierarchy of primary, secondary, tertiary, and quater-
assembly of the monomers the quaternary structure
nary structures. See Box 10.3 and Figure 10.4.
(see Figure 10.4 and Box 10.3).
Some proteins change their quaternary structure
as part of a regulatory process. Cyclic AMP activates
protein kinase A by a mechanism involving subunit Domains
dissociation. The resting, inactive form of protein One way that proteins have evolved increasing com-
kinase A is a tetramer of two catalytic subunits and plexity is by assembling a large protein from a set of
two regulatory subunits. In this resting state, the regu- smaller quasi-independent subunits, either by forming
latory subunits inhibit the activity of the catalytic stable oligomers, as in haemoglobin (see Figure 10.4),
subunits. Binding of cyclic AMP to protein kinase A or by concatenating units within a single polypeptide
dissociates the tetramer, releasing individual catalytic chain. Domains are compact units within the folding
subunits in active form. pattern of a single chain. Justifications for regarding
In some cases, evolution can merge proteins – chang- them as quasi-independent include the observation
ing quaternary to tertiary structure. For example, five that domains can be ‘mixed and matched’ in different
separate enzymes in Escherichia coli that catalyse proteins and, in many cases, the similarities of their
Protein structure 303

Figure 10.4 Underlying the great variety of protein folding patterns are a number of common structural features. For instance, a-helices
and b-sheets are standard elements of the ‘parts list’ of many protein structures. a-helices and b-sheets were modelled by L. Pauling
before their experimental observation. Pauling recognized that helices and sheets provide convenient ways for the residues to achieve
comfortable steric relationships and satisfy the requirements for backbone hydrogen bonding in an (almost) sequence-independent
manner.
This figure shows, at the upper left, the primary structure in terms of a simple extended chain. The standard secondary structures,
the a-helix and b-sheet, are shown at the upper right, with hydrogen bonds indicated by broken lines. Tertiary structure is represented,
at the lower left, by acylphosphatase, which contains two a-helices packed against a five-stranded b-sheet. Human haemoglobin,
a tetramer containing two copies of two types of chain, illustrates quaternary structure, at the lower right. (Acylphosphatase is not
a subunit of haemoglobin.)

folding patterns to those of homologous monomeric is a linear array of the form: (F1)6(F2)2(F1)3(F3)15(F1)3.
proteins. Fibronectin domains also appear in other modular
Domains form the basis of the higher-level protein proteins. (See http://www.bork.embl-heidelberg.de/
structural organization typical of eukaryotic proteins. Modules/ for pictures and nomenclature.)
Modular proteins are multidomain proteins that often To create new proteins, inventing new domains is
contain many copies of closely related domains. For an unusual event. It is far more common to create
example, ﬁbronectin, a large extracellular protein different combinations of existing domains in in-
involved in cell adhesion and migration, contains 29 creasingly complex ways. These processes can occur
domains including multiple tandem repeats of three independently, and take different courses, in different
types of domain, F1, F2, and F3 (see Figure 4.16). It phyla.
304 10 Proteomics

BOX Protein structure – basic vocabulary

10.3

Polypeptide chain Linear polymer of amino acids.

Mainchain Atoms of the repetitive concatenation of peptide groups . . . N–Ca–(C=O)N–Ca–(C=O) . . .
Sidechains Sets of atoms attached to each Ca of the mainchain. Most sidechains in proteins are chosen
from a canonical set of 20.
Primary structure The chemical bonds linking atoms in the amino acid sequence in a protein.
Hydrogen bond A weak interaction between two neighbouring polar atoms, mediated by a hydrogen atom.
Secondary structure Substructures common to many proteins, compatible with mainchain conformations, free
of interatomic collisions, and stabilized by hydrogen bonds between mainchain atoms.
Secondary structures are compatible with all amino acids, except that a proline necessarily
disrupts the hydrogen-bonding pattern.
a-Helix Type of secondary structure in which the chain winds into a helix, with hydrogen bonds
between residues separated by four positions in the sequence.
b-Sheet Another type of secondary structure, in which sections of mainchain interact by lateral
hydrogen bonding.
Folding pattern Layout of the chain as a curve through space.
Tertiary structure The spatial assembly of the helices and sheets, and the pattern of interactions between them.
(Folding pattern and tertiary structure are nearly synonymous terms.)
Quaternary structure The assembly of multisubunit proteins from two or more monomers.
Native state The biologically active form of a protein, which is compact and low energy. Under suitable
conditions, proteins form native states spontaneously.
Denaturant A chemical that tends to disrupt the native state of a protein; for instance, urea.
Denatured state Non-compact, structurally heterogeneous state formed by proteins under conditions of
high temperature, or high concentrations of denaturant.
Post-translational Chemical change in a protein after its creation by the normal protein-synthesizing
modification machinery.
Disulphide bridge Sulphur–sulphur bond between two cysteine sidechains. A simple example of a post-
translational modification.

Post-translational modifications

How much does genomics actually tell us about the • the nature and binding sites of ligands integral to
proteome? Even if we could identify coding regions the ﬁnal structure; and
of genomes with complete accuracy, we would not • post-translational modiﬁcations, the subject of this
know about: section.

• levels of transcription – or even absence of The ribosome synthesizes proteins by using the
transcription; genetic code to direct the incorporation of a sequence
of amino acids chosen from the canonical 20. Seleno-
• formation of different splice variants (in
methionine and pyrrolysine are two natural rare
eukaryotes);
extensions of the standard genetic code.
• mRNA editing – exclusive of splicing – before However, the protein world is richer than the stand-
translation, which alters the amino acid sequence; ard genetic code suggests. Many proteins contain
Post-translational modifications 305

ligands, such as metal ions or small organic mole- carbohydrates. Disulphide bridge formation is a
cules, as intrinsic and permanent parts of the struc- related example. Some additions, such as sulpha-
tures. The nature of the binding between protein and tion, are permanent modiﬁcations; others, notably
ligand depends on the protein as well as the ligand. phosphorylations, are in many cases reversible.
For instance, the haem group is bound covalently • Conversions, for instance deamidation of aspara-
to cytochrome c but non-covalently to (almost all) gine (or glutamine) to aspartic acid (glutamic acid),
globins. (Of course, proteins bind many molecules or deimination of arginine to citrulline.
transiently. Enzyme–substrate complexes provide
• Removing peptides, either from a terminus or from
many examples.)
the middle of the chain, and in a few cases even
Post-translational modiﬁcations can take several
making cyclic permutations.
forms (see also Box 10.4).
• Addition of other peptides or proteins, not always
• Attaching various groups to sidechains, including by extension of the mainchain, through peptide
but not limited to acetate, phosphate, lipids, and linkages.

BOX Major types of post-translational modification

10.4

• Attachments of groups to termini and sidechains. pletely degraded glycoproteins or oligosaccharides.

Although acetylation of protein N termini is not un- Examples include a- and b-mannosidosis and aspartyl-
common, most modifications involve sidechains. Many glucosaminuria. Tay–Sachs disease is a related con-
possible derivatives are observed. dition, part of a larger family of diseases called the
– Reversible phosphorylation of serine, threonine, or lysosomal storage diseases. Tay–Sachs disease arises
tyrosine sidechains is a very common means of from a mutation in the a subunit of the hexosamini-
regulating protein activity. However, irreversible dase A gene. In Tay–Sachs disease, the dysfunction
phosphorylation of tau protein contributes to the of the mutant protein impedes the degradation of
development of Alzheimer’s disease. The neurotoxic- a ganglioside, rather than the degradation of a
ity of some organophosphorus compounds arises glycoprotein.
from their irreversible phosphorylation of – Addition of oligomers of the small protein ubiquitin to
acetylcholinesterase. lysine residues targets proteins for degradation by the
– Attachment of sugars or oligosaccharides to proteins proteasome. Conversely, the methylation of lysine is
to make glycoproteins. In mammals, many glycopro- believed to ‘protect’ proteins against ubiquitinylation
teins appear on cell surfaces, to mediate cell–cell and thereby against degradation.
recognition and communication and immune system • Post-translational modification by proteolytic cleavage
recognition. The difference between O, A, and B implies that the protein is synthesized with extra amino
blood groups resides in the carbohydrate attached to acids beyond those required to form the native state.
serum glycoproteins and to (non-protein) glycolipids What is the purpose of the additional residues?
on cell surfaces. Many viruses, including influenza and
– Some proteins are synthesized with N-terminal signal
HIV-1, gain entry to cells via cell-surface glycopro-
peptides that direct their transport to particular sub-
tein receptors. Lectins are carbohydrate recognition
cellular compartments or organelles, or mark them for
proteins that mediate cell–cell recognition and com-
secretion.
munication and sugar transport. In vertebrates, re-
cognition of sugars on the surfaces of bacteria is a – It would be dangerous to turn rogue proteases loose
component of the immune response to infection. on the cells in which they are synthesized. Many
Deficiencies in turnover lead to glycoprotein proteases are synthesized in inactive forms and then
storage diseases involving accumulation of incom- activated by cleavage.
➔
306 10 Proteomics

– Facilitation of folding. Insulin contains two polypep- • Most post-translational cleavage reactions are carried
tide chains, one of 21 residues and the other of 30 out by proteases. Alternatively, inteins are proteins that
residues. The precursor proinsulin is a single 81- have a ‘self-splicing’ activity. They autocatalytically
residue polypeptide chain from which excision of an excise internal peptides and join the ends. (In contrast,
internal peptide produces the mature protein. Insulin peptide excision from proinsulin leaves two chains that
contains one intrachain and two interchain disulphide are not joined by a peptide bond.)
bridges. Attempts to renature mature insulin – after • The lectin concanavalin A is synthesized in a precursor
unfolding and breaking the disulphide bridges – give form that is a cyclic permutation of the final structure.
poor yields. Many incorrectly paired disulphide bridges Thus, during maturation of the protein, there is cleavage
form. In vivo, the precursor proinsulin folds into a of an internal peptide bond and formation of a new
three-dimensional structure with the cysteines in peptide bond between the original N and C termini. For
proper relative positions to form the correct disulphide concanavalin A, the DNA sequence of the gene is not
bridges. Excision of a central region by endopeptidases co-linear with the amino acid sequence of the mature
then produces the mature dimer. Unfolded proinsulin protein.
can spontaneously refold correctly.

Why is there a common genetic code with 20 There is now consensus that prokaryotes were and
canonical amino acids? are engaged in widespread horizontal gene transfer
(see p. 120). This suggests – leaving aside the ques-
Almost all organisms synthesize proteins containing
tion of what the optimal genetic code should be –
a canonical set of 20 amino acids.
that it would be to the advantage of any participating
However, both nature and the laboratory show
species to conform to some standard, as that would
that 20 amino acids are not a fundamental limita-
give it access to all of the other genes. Analogously,
tion. Selenomethionine and pyrrolysine are natural
anyone can run any operating system on a computer
exceptions. P. Schultz and co-workers have extended
that they want, but the obvious advantages of run-
the genetic code by introducing modified tRNAs and
ning the same system as many other people exert
synthetases into E. coli, yeast, and even mammalian
pressure to conform to some standard. Perhaps that
cells in tissue culture.* Approximately 70 novel
is at least a partial explanation of why almost all spe-
amino acids are now available to be introduced into
cies have the same genetic code.
proteins at specific sites. Some of the novel amino
It is true that if different species adopted different
acids show designed steric or electronic properties;
genetic codes, this might protect them against viruses
others contain chromophores as fluorescent reporters
jumping from other species.
or are susceptible to photocross-linking; there are
But why not a code with many more than 20 amino
glycosylated amino acids; iodine derivatives to facili-
acids? Certainly, one perfectly feasible way to intro-
tate X-ray structure determination; and sidechains
duce greater versatility into the components of pro-
containing other types of reactive group.
teins is by expanding the genetic code. However,
In understanding the contents and layout of the
keeping within the general framework of a triplet
common genetic code, can we go beyond F. Crick’s
code, introducing more amino acids at the expense
comment that the code is a ‘frozen accident’?
of the redundancy of the code threatens to reduce
robustness. An alternative approach to greater versa-
tility without this cost is to effect post-translational
modifications of individual amino acids. Whether or
* See http://schultz.scripps.edu/research.html and Wang, not this reasoning is the correct explanation, post-
L. & Schultz, P.G. (2004). Expanding the genetic code. translational modification is the choice that nature
Angew. Chem. Int. Ed. Engl. 44, 34–66. seems largely to have made.
Separation and analysis of proteins 307

Separation and analysis of proteins

The complete complement of a cell’s proteins is a well as larger ones, and therefore move faster through
large and complex set of molecules. Metazoa contain the gel than larger proteins. Proteins with different
tens of thousands of protein-encoding genes. Differ- mobilities move different distances during a run,
ent splice variants multiply the number of possible spreading them out on the gel.
proteins. Vertebrate immune systems generate billions The mobility of a native protein depends on its mass
of molecules by specialized techniques of combina- and its shape. Higher mass tends to reduce mobility;
torial gene assembly. more compact shape tends to increase it. In particular
To give some idea of the ‘dynamic range’ required the mobility of denatured proteins is lower than that
of detection techniques, the protein inventory of a of the corresponding native states. To achieve a separa-
yeast cell varies from 1 copy per cell to 1 million tion that depends solely on molecular weight, dena-
copies per cell. ture the proteins. Common denaturing media include
Examples of techniques for separating mixtures of urea (which competes for hydrogen bonds), and the
proteins include gel filtration, chromatography, and reducing agent dithiothreitol to break S–S bridges
electrophoresis. All methods of separating molecules (and iodoacetamide to prevent their reformation).
require two things: Sodium dodecyl sulphate (SDS) is a negatively
charged detergent that helps to denature proteins.
1. A difference in some physical property, between
Multiple detergent molecules bind all along the poly-
the molecules to be separated; and
peptide chain. The result is a protein–detergent com-
2. a mechanism, taking advantage of that property, plex that has an extended shape, with a uniform
to set the molecules in motion; the speed differing charge density along its length.
according to the value of the property selected. This Carrying out SDS-PAGE in one dimension spreads
moves apart molecules with different properties. out a mixture of proteins or nucleic acids into bands.
In some separation methods, one component Running several samples on the same gel in parallel
can stand still and the other(s) move away from it. lanes is a familiar procedure if only from sequencing
Affinity chromatography is an example. With others, gels. The results of protein gels can be made visible
different species can all move, at different rates, and (‘developed’) by staining with Coomassie Blue, or, if
spread themselves out. the samples are radioactively labelled, by autoradio-
graphy. Often markers of known molecular weight
are run in a separate lane for calibration.
• To measure an inventory of the proteins in a sample,
the proteins must be: (1) separated, (2) identified,
(3) counted.
Two-dimensional polyacrylamide gel
electrophoresis (2D-PAGE)
One-dimensional PAGE will not adequately separate
a very complicated mixture of proteins. The bands in
Polyacrylamide gel electrophoresis (PAGE)
a lane on a gel will overlap, and contain mixtures of
In electrophoresis, an electric field exerts force on a proteins with similar sizes. To achieve better resolu-
molecule. The force is proportional to the molecule’s tion, a two-stage procedure first separates proteins
total or net charge. In a vacuum, the corresponding according to charge; then an SDS-PAGE step, run in
acceleration would be inversely proportional to the a direction 90° from the original direction, separates
mass. However, counteracting the acceleration from according to size.
the electric field are retarding forces from the medium The charge on a protein depends on the charged
through which the proteins move. Polyacrylamide residues it contains, and the pH of the medium. At
gels contain networks of tunnels, with a distribution different values of pH, ionizable groups on proteins
of sizes. Smaller proteins can enter smaller tunnels as have different charges. For instance, a free histidine
308 10 Proteomics

97.4
66.2

45.0

31.0

21.5

14.4
pl 4 5 6 7 8 4 5 6 7 8 4 5 6 7 8
Stage 1 Stage 4 Stage 6

Figure 10.5 Two-dimensional PAGE gels of rose petal proteins at developmental stages 1, 4, and 6. Each gel contains over 600 proteins,
of which 421 are common to all three stages. About 12% of the proteins are stage specific.
From: Dafny-Yelin, M., et al. (2005). Flower proteome: changes in protein spectrum during the advanced stages of rose petal development. Planta 222,
37–46.

sidechain is uncharged below pH ∼5, and positively Mass spectrometry

charged above pH ∼7. For any protein, there is a pH
Mass spectrometry is a physical technique that char-
at which it has a net charge of 0. This is called its
acterizes molecules by measurement of the masses of
isoelectric point.
their ions, or of ions formed from their fragments.
A protein at its isoelectric point will feel no force in
Applications to molecular biology include:
an electric field. It will not migrate in electrophoresis.
To separate proteins according to their isoelectric • rapid identification of the components of a com-
points, establish a pH gradient in a medium and plex mixture of proteins;
apply an electrophoretic field. The proteins will • sequencing of proteins and nucleic acids (see
migrate, changing their charge as they pass through p. 309);
regions of different pH, until they reach their isoelec-
• analysis of post-translational modifications or sub-
tric points and then they will stop. The result, called
stitutions relative to an expected sequence; and
isoelectric focusing, spreads proteins out according
to their charged sidechains. • measuring extents of hydrogen–deuterium ex-
After the proteins are spread out along a lane by change to reveal the solvent exposure of individual
isoelectric focusing, running PAGE at 90° spreads sites (providing information about static confor-
them out in two dimensions (Figure 10.5). It is mation, dynamics, and interactions).
possible to compare the resulting patterns. Spots of
Identification of components of a complex mixture
interest can be eluted and identified by mass spec-
trometry (see next section). First, the components are separated by electrophoresis,
then the isolated proteins are digested by trypsin to
produce peptide fragments with relative molecular
• Polyacrylamide gel electrophoresis (PAGE) is a com- masses of about 800–4000. Trypsin cleaves proteins
mon method for protein separation. Proteins migrate after Lys and Arg residues. Given a typical amino
through a gel with different mobilities depending on acid composition, a protein of 500 residues yields
their mass, shape, and charge. Proteins separated by about 50 tryptic fragments. The mass spectrometer
SDS-PAGE are denatured by the detergent sodium measures the masses of the fragments with very high
dodecyl sulphate, creating a protein–detergent com-
accuracy (see Figure 10.6). The list of fragment masses,
plex with a uniform layer of negative charge. Protein
called the peptide mass fingerprint, characterizes the
mobility in SDS-PAGE depends only on relative molecu-
protein (Figures 10.7 and 10.8). Searching a database
lar mass.
of fragment masses identifies the unknown sample.
Separation and analysis of proteins 309

Detector Detect Mass spectrometry is sensitive and fast. Peptide

time of flight
mass fingerprinting can identify proteins in subpico-
mole quantities. Measurement of fragment masses
to better than 0.1 mass units is quite good enough
Field-free Lighter ions
flight move faster to resolve isotopic mixtures. It is a high through-
(coasting) put method, capable of processing 100 spots/day
(although sample preparation time is longer). How-
ever, there are limitations. Only proteins of known
sequence can be identified from peptide mass finger-
Heavier ions prints, because only their predicted fragment masses
move slower
are included in the databases. (As with other finger-
printing methods, it would be possible to show that
two proteins from different samples are likely to
Accelerating be the same, even if no identification is possible.)
voltage
Post-translational modifications interfere because
Laser
γ = 337 nm they alter the masses of the fragments.
Vaporization and
ionization
Protein sequencing by mass spectrometry
Fragmentation of a peptide produces a mixture of ions.
Conditions under which cleavage occurs primarily at
Figure 10.6 Schematic diagram of a mass spectrometry
experiment.

Figure 10.7 (left) Identification of components of a mixture of

proteins by elution of individual spots, digestion, and fingerprinting
of the peptide fragments by MALDI–TOF (matrix-assisted laser
desorption ionization–time of flight) mass spectrometry, followed
by looking up the set of fragment masses in a database.
2D gel Eluted Peptide The operation of the spectrometer involves the following steps.
spot fragments
1. Production of the sample in an ionized form in the vapour
phase. Proteins being fairly delicate objects, it has been
challenging to vaporize and ionize them without damage.
Two ‘soft-ionization’ methods that solve this problem are:
• matrix-assisted laser desorption ionization (MALDI): the
Laser protein sample is mixed with a substrate or matrix that
~20kV
moderates the delivery of energy; a laser pulse absorbed
+
Acceleration initially by the matrix vaporizes and ionizes the protein;
• electrospray ionization (ESI): the sample in liquid form is
Coasting sprayed through a small capillary with an electric field at the
tip to create an aerosol of highly charged droplets, which
Detector Peptide mass
fragment upon evaporation, ultimately producing ions,
fingerprint
MALDI-TOF which may be multiply charged, devoid of solvent; these
mass spectrometer ions are transferred into the high vacuum region of the
mass spectrometer.
2. Acceleration of the ions in an electric field. Each ion emerges
with a velocity proportional to its charge/mass ratio.
3. Passage of the ions into a field-free region, where they ‘coast’.
Identification 4. Detection of the times of arrival of the ions. The time of flight
of (TOF) indicates the mass-to-charge ratio of the ions.
peptide 5. The result of the measurements is a trace showing the flux as
a function of the mass-to-charge ratio of the ions detected
Database search (see Figure 10.8).
310 10 Proteomics

1919.0556
100

1622.8511
1157.6235

%
981.5262

957.5369 1479.8687 1617.8866 1826.0116

1123.6340 1407.7719 1932.8910
1108.6494 2254.2646
1501.7567 1654.8632
1282.7338
1273.7909 1964.8928
1578.856 2163.0570 2425.44
0 M/z
1000 1200 1400 1600 1800 2000 2200 2400

Figure 10.8 Mass spectrum of a tryptic digest. Of the 21 highest peaks (shown in black), 15 match expected tryptic peptides of the
39 kDa subunit of cow mitochondrial complex I. This easily suffices for a positive identification.
Figure courtesy of Dr I.M. Fearnley, MRC Dunn Human Nutrition Unit, Cambridge, UK.

peptide bonds yield a series of ions differing by the Measuring deuterium exchange in proteins
masses of single amino acids. The y ions are a set If a protein is exposed to heavy water (D2O), mobile
of nested fragments containing the C terminus (see hydrogen atoms will exchange with deuterium at
Figure 10.9a) (b ions are nested fragments contain- rates dependent on the protein conformation. By
ing the N terminus). The difference in mass between exposing proteins to D2O for variable amounts of
successive y ions is the mass of a single residue. time, mass spectrometry can give a conformational
The amino acid sequence of the peptide is, therefore, map of the protein. Applied to native proteins, the
deducible from analysis of the mass spectrum (see results give information about the structure. Applied
Figure 10.9b). to initially denatured proteins brought to renaturing
Two ambiguities remain: Leu and Ile have the same conditions using pulses of exposure, the method can
mass and cannot be distinguished, and Lys and Gln give information about intermediates in folding.
have almost the same mass and usually cannot be dis-
tinguished. Discrepancies from the masses of standard
• Mass spectrometry is often used to characterize
amino acids signal post-translational modiﬁcations.
proteins isolated from mixtures. The peptide mass
In practice, the sequence of about 5–10 amino acids
fingerprint – the list of fragment masses – is usually
can be determined from a peptide of length <20–30
sufficient to identify a protein.
residues.

Classification of protein structures

Several web sites offer hierarchical classiﬁcations • CATH: Class/Architecture/Topology/Homologous

of the entire Protein Data Bank (wwPDB; see p. 107) superfamily
according to the folding patterns of the proteins. http://cathwww.biochem.ucl.ac.uk/latest/
These include:
• DALI: based on extraction of similar structures
• SCOP: Structural Classiﬁcation of Proteins from distance matrices
http://scop.mrc-lmb.cam.ac.uk/scop/ http://ekhidna.biocenter.helsinki.ﬁ/dali/start
Classification of protein structures 311

(a) Sample of peptide

y ions produced by collision-induced dissocation include:

M N L Q V V R COO

M N L Q V V COO

M N L Q V COO

Mass spectrometry

C-terminal sequence

(b)
R V V Q L L N M yMax
614.37
100
y5
501.29
y4
727.45
y6
246.06
b2
%

104.03 841.49
a1 175.09 y7
373.24
y1 y3
y2

0 M/z
100 200 300 400 500 600 700 800 900 1000

Figure 10.9 Peptide sequencing by mass spectrometry. Collision-induced dissociation produces a mixture of ions. (a) The mixture
contains a series of ions, differing by the masses of successive amino acids in the sequence. The ions are not produced in sequence
as suggested by this list, but the mass-spectral measurement automatically sorts them in order of their mass/charge ratio. (b) Mass
spectrum of fragments suitable for C-terminal sequence determination. The greater stability of y ions over b ions in fragments produced
from tryptic digests simplifies the interpretation of the spectrum. The mass differences between successive y-ion peaks are equal to the
individual residue masses of successive amino acids in the sequence. Because y ions contain the C terminus, the y-ion peak of smallest
mass contains the C-terminal residue etc., and therefore the sequence comes out ‘in reverse’. The two leucine residues in this sequence
could not be distinguished from isoleucine in this experiment.
From: Carroll, J., Fearnley, I.M., Shannon, R.J., Hirst, J., & Walker, J.E. (2003). Analysis of the subunit composition of complex I from bovine heart
mitochondria. Mol. Cell Proteomics 2, 117–126 (supplementary figure S138).
312 10 Proteomics

• CE: a database of structural alignments which have little secondary structure and have struc-
http://cl.sdsc.edu/ tures stabilized by disulphide bridges or ligands.
Box 10.5 shows the SCOP classiﬁcation of E. coli
These sites describe projects derived from the primary
CheY (see Figure 10.10).
archival databases of macromolecular coordinates.
They are useful general entry points to protein struc-
tural data.
• SCOP (Structural Classification of Proteins) offers facil-
ities for searching on keywords to identify structures,
• It is in the tertiary structure of domains that proteins navigation up and down the hierarchy, generation of
show their individuality and variety. Classifying proteins pictures, access to the annotation records in the PDB
according to their tertiary structure indicates evolu- entries, and links to related databases.
tionary relationships (or, at the very least, interesting
structural similarities) between proteins that might
have diverged so far that the relationship is not detect- The latest SCOP release contains 38 221 PDB
able by comparing their amino acid sequences. entries split into 110 800 domains. The distribution
of entries at different levels of the hierarchy is shown
in Table 10.1.
To locate a protein of interest in SCOP, the user
SCOP can traverse the structural hierarchy, or search via
SCOP, by A.G. Murzin, L. Lo Conte, B.G. Ailey, S.E.
Brenner, T.J.P. Hubbard, and C. Chothia, organizes
protein structures in a hierarchy according to evolu- BOX SCOP classification of CheY
10.5 protein of Escherichia coli
tionary origin and structural similarity. At the lowest
level of the hierarchy are individual domains. SCOP
groups sets of domains into families of homologues, 1. Root: SCOP.
for which the similarities in structure and sequence 2. Class: Alpha and beta proteins (a/b).
(and sometimes function) imply a common evolu- Mainly parallel b-sheets (b–a–b units).
tionary origin. Families, containing proteins of similar 3. Fold: Flavodoxin-like.
structure and function but for which the evidence Three layers, a/b/a; parallel b-sheet of five strands,
for evolutionary relationship is suggestive but not order 21345.
compelling, form superfamilies. Superfamilies that 4. Superfamily: CheY-like.
share a common folding topology, for at least a large
5. Family: CheY-related.
central portion of the structure, are grouped as folds.
Finally, each fold group falls into one of the general 6. Protein: CheY protein.
classes. The major classes in SCOP are a, b, a + b, 7. Species: Escherichia coli.
a/b, and miscellaneous ‘small proteins’, many of

Figure 10.10 E. coli CheY protein.

CheY is a bacterial signal-transduction
protein involved in regulation of flagellar
dynamics in chemotaxis. Activation of
a receptor causes phosphorylation of
CheY. Phosphorylated CheY interacts
with flagellum protein FliM to induce
tumbling [3CHY].
Classification of protein structures 313

Table 10.1 The distribution of SCOP entries at different levels of the hierarchy

Class Number of folds Number of superfamilies Number of families

All a proteins 284 507 871

All b proteins 174 354 742
Alpha and beta proteins (a/b) 147 244 803
a and b proteins (a + b) 376 552 1055
Multi-domain proteins 66 66 89
Membrane and cell surface proteins 58 110 123
Small proteins 90 129 219
Total 1195 1962 3902

keywords, such as protein name, PDB code, function Changes in folding patterns in protein
(including Enzyme Commission number), and name evolution
of fold (for instance, barrel). For each structure,
SCOP provides textual information, pictures, and Proteins identiﬁed by SCOP as related by evolution
links to other databases. show recognizably similar but not identical folding
Numerous other web sites offering classiﬁcations patterns. Figure 10.11 compares spinach plastocyanin
of protein structures are indexed at: http://www. and cucumber stellacyanin. For illustrations of the
bioscience.org/urllists/protdb.htm. degree of similarities of proteins grouped together at

(a) (b)

Figure 10.11 Two related proteins that share the same general folding pattern, but differ in detail. Circles represent copper ions.
(a) Spinach plastocyanin [1AG6], (b) cucumber stellacyanin [1JER]. Superposition showing (c) the entire structures and (d) only the
well-fitting core (plastocyanin, green; stellacyanin, magenta). The main secondary structural elements of these proteins are two b-sheets
packed face-to-face. It is seen in the superposition that several strands of b-sheet are conserved but displaced, and that the helix at the
right of the cucumber stellacyanin structure has no counterpart in the spinach plastocyanin structure. Even the (relatively) well-fitting
core shows the conservation of folding topology but nevertheless reveals considerable distortion.
314 10 Proteomics

different levels of the hierarchy, and discussion of other pictures of protein structures suitable for browsing
classiﬁcation schemes, see Chapter 4 of Introduction by any reader interested in exploring the stunning
to Protein Architecture (Lesk, A.M. Oxford Univer- variety of folding patterns seen in nature.
sity Press, Oxford), which contains a large number of

Many proteins change conformation as part of the mechanism of their function

The fundamental principle is that proteins fold to 4. The EP complex breaks down to release product
unique native structures. However, the mechanism (P) and re-form the original enzyme.
of action of many proteins requires ﬂexibility, or
conformational change, during the active cycle. Most E + S = ES = ES ‡ → EP → E + P
protein conformational changes are responses to
binding.
Many enzymes have different structures in the unlig-
• An enzyme may change structure after binding a ated state, in the Michaelis complex, in the transition
substrate and/or a cofactor. state, and/or in the enzyme–product complex.
• Conformational changes arising from interactions Many binding sites occur in clefts between pro-
with one or more other proteins, or nucleic acids, are tein domains. Binding often induces conformational
a common component of regulatory mechanisms. changes involving reorientation of domains, to close
the structure around the ligand. Some of these
• Some proteins are microscopic motors, intercon-
changes can be described as a ‘hinge motion’, in
verting chemical and mechanical energy.
which the two domains remain individually rigid but
• Some serpins (serine protease inhibitors) are syn- change their relative orientation by means of struc-
thesized in a metastable active state and convert tural changes in only a few residues in the regions
spontaneously to an inactive native state. This linking the domains. Hinge motion in myosin is
gives them a limited lifetime of activity, under responsible for the impulse in muscle contraction.
tighter control than normal turnover processes.
(Serpins also undergo an analogous structural
change when they are cleaved by proteases, as part • By their nature, transition states are reactive and
of their mechanism of inhibition.) difficult to trap long enough for structure deter-
mination. Possible solutions include enzymes binding
transition-state analogues or inhibitors, or lowering the
• Many proteins are microscopic machines, with internal temperature to slow down the reaction.
parts moving in precise ways to support their function.

Conformational change during enzymatic Arginine kinase

catalysis
Arginine kinase catalyses the reaction:
A general scheme for enzyme catalysis is:
l-arginine + ATP = N-phospho-l-arginine + ADP
1. An enzyme (E) reversibly binds a substrate (S) to
The phosphate group is added to the nitrogen
form a Michaelis complex (ES).
attached to the Ca of the arginine, not to the side-
2. The ES complex changes to a transition state chain. In invertebrates, phosphoarginine serves as an
(ES‡), the peak of the thermodynamic barrier energy store from which ATP can be regenerated.
between substrate and product. In vertebrates, phosphocreatine plays an analogous
3. The transition state converts to the enzyme– role. (Some athletes take creatine orally to increase
product complex (EP). their energy storage capacity.)
Many proteins change conformation as part of the mechanism of their function 315

which the nitrate mimics the phosphate group being

• One enzyme family – creatine kinase, arginine kinase,
transferred (see Figure 10.12).
glycocyamine kinase, taurocyamine kinase, hypotauro-
Closure of interdomain clefts also occurs in trans-
cyamine kinase, lombricine kinase, opheline kinase,
and thalassamine kinase – maintains ATP concentra- port proteins. Transport proteins act as carriers for
tions during bursts of muscular activity. their ligands, without catalysing reactions.
Ribose-binding protein is one of many periplasmic
proteins in bacteria that are involved in chemotaxis
The structure of arginine kinase from the horse- and transport (see Figure 10.13). These proteins
shoe crab (Limulus polyphemus) differs between the scavenge for nutrients in the cell’s environment by
unligated state and the state binding ADP, arginine, coupling ligation to interaction with transporters
and nitrate, a simulation of the transition state in or chemotaxis receptors in the inner membrane.

Figure 10.12 Superposition of two conformational states of horseshoe crab arginine kinase [1M15, 1M80]. The unligated state is shown
in pink and purple; the ligated state in dark green and cyan. The ligands, arginine and ADP, appear in the ligated structure only. There
are steric clashes between the ligands and the unligated structure in its position in the picture.
The nature of the conformational change has reminded many people of the Venus flytrap. Regions of the structure have come together
around the ligands. The motion of the small domain at the top of the picture is primarily a ‘hinge’ motion – the mobile domain moves
almost rigidly around an axis through the interdomain interface. The axis is approximately perpendicular to the page.
The parts of the structure at the lower right also deform. This protein is showing ‘induced fit’ – in response to ligation.

Figure 10.13 Superposition of two structures of ribose-binding protein [2DRI, 1URP]. The unligated structure is shown in pink and cyan;
the ligated structure is shown in dark green and purple. The ribose, in yellow, appears only in the ligated structure.
Compared with arginine kinase (Figure 10.12), this is a more pure ‘hinge’ motion: the individual domains remain nearly rigid.
The conformational change is achieved by rotations about bonds in only a few residues in the hinge region itself.
316 10 Proteomics

Ribose-binding protein, like many other members Ras has a GTPase activity to reset it to the inactive
of this family, undergoes conformational changes state. Mutations that abrogate the GTPase activity
upon ligation such that domains of the protein close are oncogenic. Mutants that are trapped in the active
around the ligand. The structural changes increase state continuously trigger proliferation. Mutations in
the protein–ligand interactions. They also create a Ras appear in 30% of human tumours.
new surface recognized by transport complexes.

Regulation of G protein activity • G proteins are a large family of signal transducers.

Their proper function requires that they alternate
GTP-binding proteins (or G proteins) are an import- between two states of different structure and activity.
ant class of signal transducer. One of them, p21 Ras, Mutations that leave the G protein Ras trapped in an
is a molecular switch in pathways controlling cell active state are oncogenic.
growth and differentiation. Ras has two conforma-
tional states, which differ in the structure of a local
mobile region (see Figure 10.14). The resting, inac-
Motor proteins
tive state binds GDP. Membrane-bound G-protein-
coupled receptors (see p. 362) trigger a GDP–GTP Motor proteins use chemical energy to set molecules
exchange transition, associated with a conforma- in controlled motion. (Heat is random molecular
tional change (see Figure 10.15). Activated Ras binds motion – conversion of chemical energy to heat is
Raf-1, a serine/threonine kinase. The Ras–Raf-1 easy!) There are two requirements: (1) coupling ATP
complex initiates the MAP kinase phosphorylation hydrolysis to conformational change, to generate a
cascade. Ultimately, the signal enters the nucleus, force; and (2) organizing a cycle of attachment and
where it activates transcription factors regulating detachment to a mechanical substrate, to allow the
gene expression. force to generate movement.

Figure 10.14 p21 Ras bound to GTP. Although an active GTPase, the system was stabilized for crystal-structure analysis by cooling to
100 K [1QRA].

Figure 10.15 The conformational change in p21 Ras from the inactive GDP-binding conformation to the active GTP-binding
conformation primarily involves two regions (shown here in red) that form a patch on the molecular surface [1QRA, 1Q21].
Many proteins change conformation as part of the mechanism of their function 317

Actin
Myosin Figure 10.16 Schematic diagram of a sarcomere. Thick
myosin filaments (red) overlap thin actin filaments (black).
In the main diagram, it is cursorily indicated that multiple
myosin molecules from thick filaments interact with adjacent
thin filaments. In fact, each thick filament contains several
hundred myosin molecules. The inset shows different
stages of the power stroke. From left to right: attachment,
conformational change propelling the thin filament inwards
by ∼10 nm, detachment (followed by recovery of original
conformation of the myosin head.)

Some motor proteins propel themselves – and their The sliding filament mechanism of muscle contraction
cargo – by exerting force against a stationary object, The structural and mechanical unit of vertebrate
such as a cytoskeletal filament. Others remain sta- skeletal muscle is an intracellular organelle called the
tionary and propel movable objects. sarcomere. Sarcomeres contain interdigitating fila-
• Myosins interact with actin during muscle ments of actin and myosin (Figure 10.16). The actin
contraction. filaments are fixed to structures called the Z-disks at
the ends of the sarcomere. The motor protein myosin
• Kinesins and dyneins interact with microtubules,
pulls the actin filaments inwards towards the centre
mediating organelle transport, chromosome separ-
of the sarcomere. During contraction, the actin and
ation in mitosis, and movements of cilia and
myosin filaments do not themselves shorten but slide
flagella.
past one another, shortening the sarcomere by increas-
Myosins, kinesins, and dyneins are primarily linear ing the region of overlap. Think of the shortening of
motors. In contrast ATPase is a rotary motor. a bicycle pump during its compression stroke.
A large muscle may contain ∼104–105 sarcomeres,
• ATPase rotates during its action. Oxidative phos-
laid end to end. Each sarcomere has a resting length
phorylation and photosynthesis create pH gradi-
of ∼2.5 mm and can contract by ∼0.3 mm. Therefore,
ents across the membranes of mitochondria and
the entire muscle can contract by about ∼1–2 cm.
chloroplasts, respectively. The mechanical step
Individual myosin molecules are large fibrous
of ATPase activity is part of the mechanism for
proteins of relative molecular mass ∼5 × 105. They
converting the osmotic energy of the potential
contain a fibrous section ∼1.6–1.7 mm long, and a
gradient across a membrane to the high-energy
globular head. Each thick filament contains ∼200–
phosphate bond of ATP.
300 myosin molecules. The mechanical coupling
between actin and myosin occurs through the myosin
head, as shown in Figure 10.16.
• Motor proteins are energy transducers. Some involve During the power stroke, the myosin head under-
conversion of chemical energy – via ATP hydrolysis – to
goes a cycle of attachment–detachment and confor-
mechanical energy. ATP synthase converts chemios-
mational change. From left to right in the inset
motic energy to chemical energy by ATP formation,
with a rotary motor as part of its mechanism.
in Figure 10.16, attachment of the myosin head is
followed by conformational change that propels the
318 10 Proteomics

Actin binding Actin binding

Figure 10.17 The contraction of muscle site site
is a transformation of chemical energy
to mechanical energy. It is carried out at
the molecular level by a hinge motion
in myosin, while myosin is attached to
an actin filament. The cycle of attach to
actin–change conformation–release from
actin in a large number of individual myosin Active site Active site
molecules creates a macroscopic force
within the muscle fibre. (a) The structure
of myosin subfragment 1 from chicken. The
active site binds and hydrolyses ATP. ELC
and RLC are the essential and regulatory ELC ELC
light chains [2MYS]. (b) Hinge motion in
myosin. Comparison of parts of chicken
myosin open form [2MYS] (no nucleotide
bound) and closed form binding the ATP RLC RLC
analogue ADP·AlF −4 [1BR2]. This shows the (a)
segments of the structure that surround the
hinge region. (c) Model of the swinging of
the long helical region in myosin as a result
of the hinge motion. The dashed line shows
a model of the position that the complete
long helix would occupy in the closed form (b)
[2MYS] and [1BR1]. This conformational
change is coupled to hydrolysis of ATP.
It takes place while myosin is bound to
actin, providing the power stroke for muscle
contraction. In the context of the assembly
and mechanism of function of a muscle
filament, it is arguable that one should
regard the helix as fixed and the head as
swinging. However, this would not show
the magnitude of the conformational
change as dramatically. (c)

actin towards the centre of the sarcomere. Detach- Allosteric proteins show ‘action at a distance’:
ment is followed by restoration of the initial confor- ligand binding at one site affects activity at another.
mation of the myosin. The myosin heads are like oars An impulse at the first site must transmit a confor-
that ‘row’ the actin filaments towards the centre mational change affecting the second. In contrast,
of the sarcomere. The displacement of the actin is GTP-activated p21 Ras (Figures 10.14 and 10.15)
∼10 nm per myosin molecule per cycle. Hydrolysis shows ligand-induced regulation of activity, but the
of one molecule of ATP during each cycle of each structural change is adjacent to the ligand. It is more
myosin molecule provides the energy. challenging to explain the properties of haemoglo-
Structures of fragments of myosin containing the bin, in which the shortest distance between binding
globular head have defined the mechanism of the sites is over 20 Å (2 nm).
conformational change (see Figure 10.17).

Allosteric regulation of protein function

Allostery is modulation of the activity of a protein at • An allosteric protein with multiple binding sites for the
one site by structural changes caused by binding a same ligand may show cooperative binding.
molecule at a distant site.
Many proteins change conformation as part of the mechanism of their function 319

1.0
• Allosteric proteins deviate from the Michaelis–Menten
curve in ligand binding or, in the cases of allosteric
0.8 enzymes, in reaction velocity as a function of sub-
strate concentration. The cooperativity is achieved by
Myoglobin
ligation-induced conformational change.
Fraction ligated

0.6

Haemoglobin

0.4
A mammalian foetus depends on its mother for
oxygen. It is perhaps surprising that the oxygen affin-
ity of isolated human foetal haemoglobin is lower
0.2 than that of adult haemoglobin. However, foetal hae-
moglobin has a lower affinity for the effector BPG.
This difference in the interaction with the effector
0
0 10 20 30 40 50 60 gives foetal haemoglobin a higher oxygen affinity
Partial pressure of oxygen (mmHg) than the maternal haemoglobin.
Figure 10.18 Oxygen-dissociation curves for myoglobin
The vertebrate haemoglobin molecule is a tetramer
and haemoglobin. Myoglobin shows a simple equilibrium, containing two identical a chains (a1 and a2) and
with a binding constant independent of oxygen concentration. two identical b chains (b1 and b2) (see Figure 1.16).
Haemoglobin shows positive cooperativity, the binding constant It can adopt two structures: deoxyhaemoglobin
for the first oxygen being several orders of magnitude smaller (unligated) and oxyhaemoglobin (four oxygen mole-
than the binding constant for the fourth oxygen. The units for
cules bound).
partial pressure are traditional in the literature about this topic:
760 mmHg = 1 atmosphere = 101 325 Pa.

• The difference in colour between arterial and venous

blood reveals the different state of the iron in ligated
Allosteric changes in haemoglobin and unligated haemoglobin.

To play its physiological role in oxygen distribution

effectively, haemoglobin must capture oxygen in the
lungs as efficiently as possible and release as much The oxygen affinity of the oxy form of haemoglo-
as possible to other tissues. To achieve this ‘take bin is similar in magnitude to that of isolated a and
from the rich, give to the poor’ effect, haemoglobin b subunits and to that of myoglobin, a monomeric
has a high oxygen affinity at high oxygen partial globin. The oxygen affinity of the deoxy form is
pressure (pO2) and a low affinity at low pO2. much less: the ratio of binding constants for the first
Haemoglobin shows positive cooperativity: binding and fourth oxygens is 1:150–300, depending on con-
of oxygen increases the affinity for additional oxygen ditions. Therefore, it is the deoxy form that is special,
(Figure 10.18). Some proteins show negative coop- as it has had its oxygen affinity ‘artificially’ reduced.
erativity: binding reduces the affinity for additional J. Monod, J. Wyman, and J.-P. Changeux proposed
ligand. that the reduced oxygen affinity of the subunits in the
In contrast to a ligand that induces cooperative deoxy state of haemoglobin arises from structural
binding of the same ligand, an effector alters the constraints that hold the subunits in a ‘tense’ (T),
activity of a protein towards a different ligand. internally inhibited form, whereas the oxy form is in
Bisphosphoglycerate (BPG) is an effector for haemo- a ‘relaxed’ (R) form, as free to bind oxygen as the
globin. It binds preferentially to the deoxy form, isolated monomer.
decreasing the oxygen affinity and enhancing oxygen A general model for cooperativity is that the sub-
release. People and animals living at high altitude units of a protein are in equilibrium between the T
have higher concentrations of BPG than their sea- and R forms. This model rationalizes the properties
level relatives. of haemoglobin.
320 10 Proteomics

1. At low partial pressures of oxygen, all of the sub- • The tertiary structural changes alter the shapes of
units of haemoglobin are in the T form and are the surfaces of the subunits, changing the way they
unligated (i.e. not binding oxygen). The binding fit together.
constant for oxygen is low, because binding to the
The haemoglobin tetramer can be thought of as a
T state is inhibited.
pair of dimers: a1 b1 and a2 b2. The allosteric change
2. At high partial pressures of oxygen, all of the sub- involves a rotation of 15° of the a1 b1 dimer with
units of haemoglobin are in the R form and each respect to the a2 b2 around an axis approximately per-
subunit binds an oxygen. The binding constant pendicular to their interface. (The motion is like that
for oxygen is high, because binding of oxygen to of a pair of shears with a1 and a2 as the blades and b1
the R state is unconstrained. and b2 as the handles.)
3. In the erythrocyte, haemoglobin is an equilibrium Starting from the deoxy structure, ligation of oxy-
mixture of deoxy and oxy forms; the concentra- gen creates strain at the haem group, arising from a
tion of partially ligated forms is tiny. Binding of change in the position of the iron and the histidine
between two and three oxygen molecules shifts sidechain linked to it. To relieve this strain, there are
the subunits concertedly from all being T state to shifts in the F helix and the FG corner (the region of
all being R state. the chain between the F and G helices). To accom-
4. Effector molecules such as BPG modify oxygen modate these shifts, a set of tertiary structural
affinity by shifting the T j R equilibrium, by pre- changes alter the overall shape of the a1 b1 and a2 b2
ferentially stabilizing one of the two forms. dimers, notably the shifting of the relative positions
of the FG corners. In consequence, the deoxy quater-
The interpretation of this scheme in structural nary structure is destabilized because the dimers no
terms was one of the early triumphs of protein crystal- longer fit together properly (having changed their
lography. The structures of haemoglobin in different shape). Adopting the alternative quaternary structure
states of ligation have been studied with intense requires the tertiary structural changes to take place
interest, because of their physiological and medical even in subunits not yet liganded. As a result of the
importance and because they were thought to offer quaternary structural change, these unligated sub-
a paradigm of the mechanism of allosteric change. units have been brought to a state of enhanced oxy-
The two crucial questions to ask of the haemoglobin gen affinity. It is important to emphasize that this is
structures are: a sequence of steps in a logical process and not a
1. What is the mechanism by which the oxygen affin- description of a temporal pathway of a conforma-
ity of the deoxy form is reduced? tional change.
2. How is the equilibrium between low- and high-
affinity states altered by oxygen binding and Conformational states of serine protease
release? inhibitors (serpins)
Comparison of the oxy and deoxy structures has Figure 10.20 shows the serpin antithrombin III in
defined the changes in tertiary structures of indi- two conformational states, native and latent.
vidual subunits, and in the quaternary structure. Serpins show multiple conformational states with
The allosteric change involves an interplay between different folding patterns. Under physiological condi-
changes in tertiary and quaternary structure (see tions, the native states of inhibitory serpins are meta-
Figure 10.19). stable, converting spontaneously to the latent state.
In the native state (Figure 10.20a), the main b-sheet
• The details of the quaternary structure – the rela- (green) has five strands (the rightmost much shorter
tive geometry of the subunits and the interactions than the others). The reactive-centre loop (red) is
at their interfaces – is determined by the way the exposed, not participating in any secondary struc-
subunits fit together. ture. It is available to interact with a protease. In the
• The fit of the subunits depends on the shapes of latent state (Figure 10.20b), the reactive-centre loop
their surfaces. forms a sixth strand within the main b-sheet. The two
Many proteins change conformation as part of the mechanism of their function 321

(a)

(b)

44 Oxy 44 Oxy

44 Deoxy 44 Deoxy

36 Oxy 36 Oxy

(c) 36 Deoxy 36 Deoxy

Figure 10.19 Some important structural differences between oxy- and deoxyhaemoglobin [1HHO, 2HHB]. (a) Changes at the haem group
in human haemoglobin result in a change in state of ligation. This figure shows the F helix, proximal histidine, and haem group of the
b chain in the oxy (black) and deoxy (red) forms; only the oxy haem is shown. The structures were superposed on the haem group.
(b) The a1b1 dimer in oxy (red) and deoxy (black, in blown-up regions only) forms. In the blown-up regions, only the F helix, FG corner,
and haem group are shown. The oxy and deoxy a1b1 dimers have been superposed on their interface; in this frame of reference, there is
a small shift in the haem groups and a shift and conformational change in the FG corners. (c) Alternative packing of a1 and b2 subunits
in oxyhaemoglobin (red) and deoxyhaemoglobin (black). The oxy and deoxy structures have been superposed on the F and G helices
of the a1 monomer. Although for the purposes of this illustration we have regarded the a1 subunit as fixed and the b2 subunit as mobile,
only the relative motion is significant.

structures have identical amino acid sequences and

chemical bonding patterns, but topologically differ- • For a description of the mechanism of inhibition and
its dependence on cleavage and conformational rear-
ent secondary and tertiary structures.
rangement, see Lesk, A.M. Introduction to Protein
The latent state resembles the state produced by
Science, 2nd ed., p. 219 (Oxford University Press,
cleavage of the reactive-centre loop, as part of the
Oxford).
mechanism of inhibitory action of this molecule.
322 10 Proteomics

(a)

Figure 10.20 Antithrombin III, a serine

proteinase inhibitor.
(a) Native conformation;
(b) latent conformation [1ATH]. (b)

Protein structure prediction and modelling

The observation that each protein folds spon-

taneously into a unique three-dimensional native BOX Critical Assessment of Structure
10.6 Prediction (CASP)
conformation implies that nature has an algorithm
for predicting protein structure from amino acid
sequence. Some attempts to understand this algo- Judging of techniques for predicting protein structures
rithm are based solely on general physical principles; requires blind tests. To this end, J. Moult initiated
biennial CASP programmes. Crystallographers and NMR
others are based on observations of known amino
spectroscopists in the process of determining a protein
acid sequences and protein structures. A proof of our
structure are invited to (1) publish the amino acid
understanding would be the ability to reproduce the
sequence several months before the expected date
algorithm in a computer program that could predict
of completion of their experiment; and (2) commit
protein structure from amino acid sequence. The themselves to keeping the results secret until an agreed
Critical Assessment of Structure Prediction (CASP) date. Predictors submit models, which are held until the
programmes provide ‘blind’ tests of the state of the deadline for release of the experimental structure. Then
art (see Box 10.6). the predictions and experiments are compared.
Most attempts to predict protein structure from The results of CASP evaluations record progress in the
basic physical principles alone try to reproduce the effectiveness of predictions, which has occurred partly
interatomic interactions in proteins, to deﬁne a because of the growth of the databanks but also because
numerical energy associated with any conformation. of improvements in the methods.
Computationally, the problem of protein structure
Protein structure prediction and modelling 323

prediction then becomes a task of finding the global D. Jones has likened the distinction between fold
minimum of the conformational energy function recognition and a priori modelling to the difference
over all possible backbone and sidechain conforma- between a multiple-choice question on an examina-
tions. So far this approach has not generally suc- tion and an essay question.
ceeded, partly because of the imprecision of the
energy function and partly because the minimization
Homology modelling
algorithms tend to get trapped in local minima.
The alternative to a priori methods are approaches Model building by homology is a useful technique
based on assembling clues to the structure of a target when one wants to predict the structure of a target
sequence by finding similarities to known structures. protein of known sequence, when the target protein
These empirical or ‘knowledge-based’ techniques is related to at least one other protein of known
have become very powerful and are currently the sequence and structure. If the proteins are closely
most successful methods known. related, the known protein structures – called the
parents – can serve as the basis for a model of the
• Homology modelling. Suppose a target protein of
target. It is on homology modelling that we depend
known amino acid sequence but unknown struc-
to extend the results of structural genomics to the
ture is related to one or more proteins of known
entire protein world.
structure. Then we expect that much of the struc-
The completeness and quality of the results depend
ture of the target protein will resemble that of the
crucially on how similar the sequences are. As a rule
known protein. The related protein of known
of thumb, if the sequences of two homologous
structure can therefore serve as a basis for a model
proteins have 50% or more identical residues in
of the target protein. The challenge is to predict
an optimal alignment, the structures are likely to
how the differences between the sequences are
have similar conformations over more than 90%
reflected in differences between the structures. This
of the model. This is a conservative estimate, as
can be thought of as the ‘differential’ rather than
Figure 10.21 shows.
the ‘integral’ form of the folding problem.
Although the quality of the model will depend on
• Attempts to predict secondary structure without
the degree of similarity of the sequences, it is possible
attempting to assemble these regions in three
to specify this quality before experimental testing.
dimensions. The results are lists of regions of the
Therefore, knowing how good a model is necessary
sequence predicted to form a-helices and regions
for the intended application permits intelligent pre-
predicted to form strands of b-sheet.
diction of the probable success of the exercise.
• Fold recognition. Given a library of known struc- Steps in homology modelling are as follows.
tures, determine which of them shares a folding
pattern with a query protein of known sequence 1. Align the amino acid sequences of the target and
but unknown structure. If the folding pattern of the protein or proteins of known structure. Usu-
the target protein does not occur in the library, ally, insertions and deletions will lie in the loop
such a method should recognize this. The results regions between helices and sheets.
are a nomination of a known structure that has the 2. Determine mainchain segments to represent the
same fold as the query protein, or a statement that regions containing insertions or deletions. Stitch-
no protein in the library has the same fold as the ing these regions into the mainchain of the known
query protein. protein creates a model for the complete main-
• Prediction of novel folds, either by a priori or chain of the target protein.
knowledge-based methods. The results are a com- 3. Replace the sidechains of residues that have been
plete coordinate set for at least the mainchain mutated. For residues that have not mutated,
and sometimes the sidechains also. The model is retain the sidechain conformation. Residues that
intended to have the correct folding pattern, but have mutated tend to keep the same sidechain
would not be expected to be comparable in quality conformational angles and could be modelled on
to an experimental structure. this basis. However, computational methods are
324 10 Proteomics

10 20 30 40 50 60
| | | | | |
Chicken lysozyme KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINS
Baboon α-lactalbumin KQFTKCELSQNLY _ _ _
DIDGYGRIALPELICTMFHTSGYDTQAIVEND ESTEYGLFQISN
K F CEL D Y L C S TQA N ST YG QI

70 80 90 100 110 120

| | | | | |
Chicken lysozyme _
RWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGN GMNAWVAWRNRCKGTD
Baboon α-lactalbumin ALWCKSSQSPQSRNICDITCDKFLDDDITDDIMCAKKILDI _ _ _
KGIDYWIAHKALC TEK
WC P SR N C I C L DIT CAKKI G W A C

130
|
Chicken lysozyme VQA WIRGCRL_
_
Baboon α-lactalbumin _
L EQWL _ _ CE_K
(a) W C

(b)

Figure 10.21 (a) Aligned sequences and (b) superposed structures of two related proteins, hen egg white lysozyme (black) [1AKI] and
baboon a-lactalbumin (red) [1ALC]. The sequences are related (37% identical residues in the aligned sequences) and the structures are
very similar. Each protein could serve as a good model for the other, at least as far as the course of the mainchain is concerned.

now available to search over possible combina- A single parent structure will permit reasonable
tions of sidechain conformations. modelling of the conserved portion of the target pro-
4. Examine the model – both by eye and using protein but will fail to produce a satisfactory model
grams – to detect any serious collisions between of the variable portion. A more favourable situation
atoms. Relieve these collisions, as far as possible, occurs when several related proteins of known struc-
by manual manipulations. ture can serve as parents for modelling a target pro-
tein. These reveal the regions of constant and variable
5. Refine the model by limited energy minimization.
structure in the family. The observed distribution of
The role of this step is to fix up the exact geo-
structural variability among the parents dictates an
metrical relationships at places where regions of
appropriate distribution of constraints to be applied
the mainchain have been joined together and to
to the model.
allow the sidechains to wriggle around a bit
Mature software for homology modelling is avail-
to place themselves in comfortable positions. The
able. SWISS-MODEL is a web site that will accept
effect is really only cosmetic – energy refinement
the amino acid sequence of a target protein, deter-
will not correct serious errors in such a model.
mine whether a suitable parent or parents for homo-
In most families of proteins, the structures contain logy modelling exist and, if so, deliver a set of
relatively constant regions and more variable ones. coordinates for the target. SWISS-MODEL (http://
The core of the structure of the family retains the www.expasy.org/swissmodel/SWISS-MODEL.html)
folding topology, although it may be distorted, but was developed by T. Schwede, M.C. Peitsch, and N.
the periphery can entirely refold (see Figure 10.11); Guex, now at the Geneva Biomedical Research Insti-
in contrast, hen egg white lysozyme and baboon tute. Another program in widespread use, MOD-
a-lactalbumin, shown in Figure 10.21, are closely ELLER, was developed by A. Šali. Each of these is
related and quite similar in structure. associated with a library of models corresponding to
Protein structure prediction and modelling 325

amino acid sequences in databanks. MODBASE ondary structure is to be predicted. The patterns are
(http://salilab.org/modbase) and 3DCrunch (http:// based on the distribution of residues at the positions
swissmodel.expasy.org/SM_3DCrunch.html) collect in the alignment table.
homology models of proteins of known sequence. The most powerful pattern recognition algorithms
now being applied to secondary structure prediction
include neural networks and hidden Markov models.
• Homology modelling is one of the most useful tech-
The basic idea is to develop a network of implica-
niques for protein structure prediction – when it is
applicable. tions containing a large set of adjustable parameters
governing the assignment of each residue to the three
classes: helix, sheet, or other. The systems are quite
Secondary structure prediction general and the parameters must be adjusted by
‘training’ on known sequences and structures.
The goal of secondary structure prediction is to iden-
The best current methods claim average accuracies
tify those residues within the sequence that will form
of Q3 ∼75%.
helices and strands of b-sheet in the native structure,
independent of their spatial arrangement in the ter-
tiary structure. • CASP categories change as the field progresses.
The original motivations for secondary structure Secondary structure prediction and fold recognition
prediction included: have been discontinued. Prediction of residue – residue
contacts, of disordered regions, and the ability to refine
• a belief that prediction of secondary structure models have been added.
should be substantially easier than prediction of
tertiary structure;
• a belief that prediction of secondary structure
Prediction of novel folds: ROSETTA
would be a step towards prediction of tertiary
structure; and ROSETTA is a program developed by D. Baker and
colleagues that predicts protein structure from amino
• both experimental evidence from co-polymers of
acid sequence by assimilating information from known
amino acids and statistical evidence from the
structures. At several recent CASP programmes,
observed residue compositions of helices and
ROSETTA showed the most consistent success on
b-sheets in solved protein structures implying that
targets in both the ‘novel fold’ and ‘fold recognition’
there are preferences among the residues for form-
categories.
ing (or breaking) helices.
ROSETTA predicts a protein structure by first gen-
Early work on secondary structure prediction made erating structures of fragments using known struc-
a priori predictions based on tables of residue pre- tures and then combining them. For each contiguous
ferences. Methods were tested according to a three- region of three and nine residues, instances of that
state model in which each residue in the prediction sequence and related sequences are identified in pro-
and in the experimental structure was assigned to teins of known structure. For fragments this small,
the classes ‘helix’, ‘sheet’, and ‘other’, and the per- there is no assumption of homology to the target
centage of residues assigned to the correct class protein. The distribution of conformations of the
was defined as a measure of success called Q3. Such fragments in the proteins of known structure models
methods achieved typical accuracies corresponding the distribution of possible conformations of the cor-
to Q3 ∼55%. responding fragments of the target structure.
Progress depended on the recognition that tables of ROSETTA explores the possible combinations of
many aligned sequences contained consensus infor- fragment conformations, evaluating compactness,
mation that could improve the prediction accuracy. paired b-sheets, and burial of hydrophobic residues.
The idea is to apply pattern-recognition algorithms The procedure carries out 1000 independent simula-
to a set of aligned sequences homologous to the tions, with starting structures chosen from the frag-
sequence of an unknown structure for which the sec- ment conformation distribution patterns evaluated
326 10 Proteomics

favourably. The structures that result from these Among their adaptations, membrane proteins con-
simulations are clustered and the centres of the larg- tain regions of mostly non-polar residues that interact
est clusters presented as predictions of the target with the organic layer. Many membrane proteins con-
structure. The idea is that a structure that emerges tain a set of seven consecutive a-helices that traverse
many times from independent simulations is likely to the membrane, oriented approximately perpendicular
have favourable features. to the plane of the membrane (see Figure 4.17). These
There is a general belief that the work of Baker and helices are connected by loops that protrude into
colleagues represents a major breakthrough in the the aqueous surroundings. A second class of mem-
field of protein structure prediction. brane protein structures contains a b-barrel. Trans-
Robetta (http://robetta.bakerlab.org) is a web membrane helices are typically 15–30 residues long.
server designed to integrate and implement the best Although enriched in hydrophobic residues, they
of the protein structure prediction tools. The central contain some polar sidechains, usually in interfaces
pipeline of the software involves first the parsing of a between a-helices packed together in the structure.
submitted amino acid sequence of a protein of unknown A useful clue to the orientation of the helices across
structure into putative domains. Then homology the membrane is the ‘+ inside rule’. The loops between
modelling techniques are applied to those domains helices lie either entirely inside or entirely outside the
for which suitable parents of known structure exist cell or organelle. Those inside contain a preponder-
and the de novo methods developed by Baker and co- ance of positively charged residues.
workers are applied to other domains. In addition, A simple approach to prediction of membrane pro-
the user will receive the results of other prediction teins involves looking for amino acid segments of
methods based on software developed outside the 15–30 residues in length that are rich in hydrophobic
Robetta group. These include, for example, predic- residues. However, signal peptides also contain
tions of secondary structure, coiled coils, and trans- hydrophobic helices: the signal sequence typically
membrane helices. comprises a positively charged n-region, followed by
Some methods are specialized to particular types a helical hydrophobic h-region, followed by a polar
of structure. c-region. Methods for recognizing transmembrane
helices in amino acid sequences tend to pick up the
Prediction of transmembrane proteins and h-regions of signal peptides as false positives. Methods
signal peptides for recognizing signal peptides in amino acid
Many proteins are designed to sit within membranes. sequences tend to pick up transmembrane helices as
Membrane proteins mediate the exchange of matter, false positives.
energy, and information between cell interiors and L. Käll, A. Krogh, and E.L.L. Sonnhammer trained
surroundings. Examples of membrane protein func- hidden Markov models to test simultaneously for
tions include energy transduction via the generation transmembrane helices and signal peptides. The goals
or release of concentration gradients across cell or are to find both at the same time, to discriminate
organelle membranes, and signal reception and between them in the results, and to predict not only
transmission. the positions of the transmembrane helices but also
It is estimated that, in the human genome, approxi- the locations – cytoplasmic or interior – of the loops.
mately 30% of genes encode membrane proteins. The method, called Phobius, is available at http://
Approximately 70% of known targets of drugs are phobius.cgb.ki.se/.
membrane proteins. Given that membrane proteins Phobius is the most successful algorithm currently
are so common, it is important to have reliable tools available for recognizing signal peptides and helical
for their identification. Relatively few membrane transmembrane proteins, and for predicting the ori-
protein structures have been determined experiment- entation of the transmembrane segments. Phobius
ally. This places a greater burden on computational is capable of distinguishing h-domains of signal
tools for sequence analysis, to identify and character- peptides from transmembrane helices: the number of
ize them. false classifications of signal peptides was 3.9%, and
Protein structure prediction and modelling 327

Available protocols for protein structure

prediction
Here we collect a few of the many web sites that deal
with protein structure prediction.
Many sites act as ‘dispatchers’ for other sites,
accepting a sequence and resending it to many other
web servers. These allow users to submit a sequence
once and have it processed by many other sites to
which the query is distributed. These are sometimes
called ‘metaservers’.

Homology modelling

Figure 10.22 Coiled-coil BZIP domain encoded by proto-oncogene

SWISS-MODEL: http://www.expasy.org/swissmod/
c-jun [1JNM]. SWISS-MODEL.html. Results of the application of
SWISS-MODEL to proteins of known sequence are
the number of false classiﬁcations of transmembrane available through 3DCrunch: http://swissmodel.
helices was 7.7%. These results represent a great expasy.org/SM_3DCrunch.html.
improvement over previous methods. It is interesting MODELLER (homology modelling software): http://
that addressing the two problems at once proved to salilab.org/modeller/modeller.html. Results of the
be more successful than treating them separately. application of MODELLER to proteins of known
sequence are available through MODBASE: http://
Coiled-coil regions salilab.org/modbase.
Proteins containing coiled coils are known among
Secondary structure prediction
structural proteins such as a-keratin and also occur
in a variety of globular proteins associated with a A list of secondary-structure prediction servers
number of functions, prominently including tran- can be found at: http://abs.cit.nih.gov/main/
scription regulation. Figure 10.22, showing a leucine otherservers.html. Quite a few sites are listed. Many
zipper, is a typical example. methods are available through the following site:
Such coiled-coil domains contain a signature pat- http://cubic.bioc.columbia.edu.
tern in their amino acid sequences. They show heptad
Prediction of full three-dimensional structure
repeats – seven-residue patterns – containing posi-
tions denoted a, b, c, d, e, f, and g, of which the ﬁrst Available at the Robetta web site: http://robetta.
and fourth positions – a and d – are usually hydro- bakerlab.org.
phobic. Here is the sequence of the leucine zipper
protein GCN4, with the heptads demarcated and the Prediction of antibody structure
hydrophobic positions indicated by asterisks: Available at: http://www.biocomputing.it/pigs.

abcdefg abcdefg abcdefg abcdefg Prediction of transmembrane helices and

* * * * * * * * signal peptides
R | MKQLEDK | VEELLSK | NYHLENE | VARLKKL | VG
Phobius is available at http://phobius.cgb.ki.se/.
Programs for predicting coiled coils include Coils,
by A. Lupas and J. Lupas (http://www.ch.embnet. Prediction of coiled coils
org/software/COILS_form.html), and Paircoil, by See Coils, by A. Lupas and J. Lupas, at: http://www.
B. Berger, D.B. Wilson, E. Wolf, T. Tonchev, ch.embnet.org/software/COILS_form.html and Pair-
M. Milla, and P.S. Kim (http://paircoil.lcs.mit.edu/ coil, by B. Berger and co-workers, at: http://paircoil.
webcoil.html). lcs.mit.edu/webcoil.html.
328 10 Proteomics

tural genomics projects combine results from differ-

• Because proteins fold into native structures at the
ent organisms. The human proteome is of course of
dictation of the amino acid sequence, it should be
special interest, as are proteins unique to infectious
possible to write a computer program to predict pro-
tein structure from amino acid sequence. Many people microorganisms.
have tried to predict different features of protein The goals of structural genomics have become fea-
structures – prediction of the full three-dimensional sible partly by advances in experimental techniques,
structure remains the ultimate goal. The CASP (Critical which make high-throughput structure determination
Assessment of Structure Prediction) programme sub- possible, and partly by advances in our understand-
jects the efforts to objective tests. ing of protein structures, which define reasonable
general goals for the experimental work and suggest
specific targets.
How many structures are needed? The theory and
practice of homology modelling suggests that at
Structural genomics
least 30% sequence identity between target and
In analogy with full-genome sequencing projects, some experimental structure is necessary. This means
structural genomics has the commitment to deliver that experimental structure determinations will be
the structures of the complete protein repertoire. required for an exemplar of every sequence family,
X-ray crystallographic and NMR experiments will including many that share the same basic folding pat-
solve a ‘dense set’ of proteins, such that all proteins tern. Experiments will have to deliver the structures
are close enough to one or more experimentally of something like 10 000 domains. In the year 2010,
determined structures to model them confidently. 7936 structures were deposited in the PDB, so the
More so than genomic sequencing projects, struc- throughput rate is not far from what is required.

Directed evolution and protein design

One strand of Darwin’s thinking that led to the the- One day, we shall be able to design amino acid
ory of evolution was the observation that farmers sequences a priori that will fold into proteins with
could improve the quality of livestock by selective desired functions. As this is not yet possible, scien-
breeding. He drew an analogy between this artificial tists have used directed evolution – or artificial selec-
selection and the idea of natural selection that he was tion – to generate molecules with novel properties
proposing as the mechanism of evolution. We now starting from natural proteins.
recognize that evolution by natural selection takes Evolution requires the generation of variants and
place at the molecular level. Why not artificial selec- differential propagation of those with favourable fea-
tion also? tures. Molecular biologists dealing with microbial
Natural proteins do many things, but not everything evolution have advantages over the farmers that
we would like them to. For applications in techno- Darwin observed. We can generate large numbers of
logy, it would be useful to have proteins that would: variants artificially. Screening and selection can, in
many cases, be done efficiently, by stringent growth
• have activities unknown in nature;
conditions, and there are virtually no limits on the
• show activity towards unnatural substrates, or size of the ‘flock’ or ‘litter’. Darwin might well have
altered specificity profiles; been envious. He wrote:
• be more robust than natural proteins, retaining
. . . as variations manifestly useful or pleasing to man
their activity at higher temperature or in organic
appear only occasionally, the chance of their appearance
solvents; and
will be much increased by a large number of individuals
• show different regulatory responses, enhanced being kept. Hence, number is of the highest importance for
expression, or reduced turnover. success. On this principle Marshall formerly remarked,
Directed evolution and protein design 329

with respect to the sheep of parts of Yorkshire, ‘as they this far? Are all of the changes essential for thermo-
generally belong to poor people, and are mostly in small stability, or has there been considerable neutral drift
lots, they never can be improved.’ as well?
– The Origin of Species, Chapter 1. A thermostable variant of subtilisin E, produced by
The procedure of directed evolution comprises directed evolution, differs from the wild type by only
these steps: eight residue substitutions. The variant is identical to
thermitase in its temperature of optimum activity:
1. Create variant genes by mutagenesis or genetic 76°C (17°C higher than the original molecule) and
recombination. stability at 83°C (a 200-fold increase relative to the
2. Create a library of variants by transfecting the wild type).
genes into individual bacterial cells. The procedure involved successive rounds of gen-
3. Grow colonies from the cells and screen for desir- eration of variants, and screening and selection of
able properties. those showing favourable properties. The formation
of mutations, via error-prone PCR to produce an
4. Isolate the genes from the selected colonies and
average of two to three base changes per gene, was
use them as input to step 1 of the next cycle.
alternated with in vitro recombination to ﬁnd the
Strategies for generating variants include (a) single best combinations of substitutions at individual sites
and multiple amino acid substitutions, (b) recombina- (Figure 10.23). At each step, several thousand clones
tion, and (c) formation of chimaeric molecules by were screened for activity and thermostability.
mixing and matching segments from several homolog- The optimal variant differed from the wild type
ous proteins. Each method has its advantages and at eight positions: N188S, S161C, P14L, N76D,
disadvantages. The smaller the change in sequence, G166R, N181D, S194P, and N218S. Figure 10.24
the more likely that the result will be functional. shows their distribution in the structure. Most of the
Yet, multiple substitutions or recombinations give a substitutions are far from the active site, which is
greater chance of generating novel features. The choice not surprising as the wild type and variant do not
depends in part on the nature of the goal. For in-
stance, it is easier to lose a function than to gain one.
8
(Why would you want to lose a function? Removal
of product inhibition to enhance throughput in an 7
enzymatically catalysed process is an example.) Random
6
mutagenesis
In t1/2 at 65°C

5
Directed evolution of subtilisin E Recombination
4
Subtilisins are a family of bacterial proteolytic
enzymes. Subtilisin E, from the mesophilic bacterium 3
Bacillus subtilis, is a 275-residue monomer. It 2
becomes inactive within minutes at 65°C. Directed
evolution has produced interesting variants, with 1

features including enhanced thermal stability and 0

activity in organic solvents. WT 1 2 3 4 5
Generation

Enhancement of thermal stability by directed evolution Figure 10.23 Directed evolution of a thermostable subtilisin.
Thermitase, a subtilisin homologue from Thermoac- The starting wild type (WT) was subtilisin E from the mesophilic
tinomyces vulgaris, remains stable up to 80°C. The bacterium B. subtilis. Steps of random mutagenesis were
alternated with recombination. At each step, screening for
existence of thermitase is reassuring, because it shows
improved properties and artificial selection chose candidates
that the evolution of subtilisin to a thermostable pro- for the next round. (t1/2 measured in minutes.)
tein is possible. However, subtilisin E and thermitase After: Zhao, H. & Arnold, F.H. (1999). Directed evolution converts subtilisin
differ in 157 amino acid residues. Do we have to go E into a functional equivalent of thermitase. Prot. Eng. 12, 47–53.
330 10 Proteomics

Enzyme design
S161C
S194P We know that mutations can change the structure
and function of proteins, in some cases in radical
ways. We observe this in natural protein evolution
and can achieve it by artificial selection (see preced-
G166R
ing section). We can do a reasonable job at predicting
N188S
the structural changes arising from mutations. Put-
ting these together suggests that we should be able to
N181D
18
S design proteins with altered or even novel functions
N2
in silicio.
Gramicidin S is a cyclic decapeptide antibiotic
P14L
from Bacillus brevis. It contains unusual amino acids,
N76D
including d-phenylalanine and ornithine. Synthesis of
gramicidin S is independent of the normal ribosomal
protein-synthesizing machinery. Instead, the enzyme
Figure 10.24 The sites of mutation in B. subtilis subtilisin E gramicidin synthetase isomerizes the substrate l-
that produced a thermostable variant by directed evolution. phenylalanine to d-phenylalanine, and activates it
The sidechains shown are those of the final product.
to the amino acid adenylate. Another component of
the enzyme effects the polymerization, in a sequence-
differ in function. Most of the sites of substitution specific manner.
are in loops between regions of secondary structure. C.-Y. Chen, I. Georgiev, A.C. Anderson, and
These regions are the most variable in the natural B.R. Donald computationally redesigned gramicidin
evolution of the subtilisin family. However, only two synthetase to accept other amino acids as substrates.
of the substitutions produce the amino acids that The wild-type enzyme has no activity towards Arg,
appear in those positions in thermitase. Two of the Glu, Lys, and Asp. Their computational modelling
substitutions are in a-helices, including P14L. P14L approach successfully predicted the sequences of
has a certain logic: proline tends to destabilize an modified enzymes with activity for each of these four
a-helix because it costs a hydrogen bond. unnatural substrates.

Protein complexes and aggregates

Within cells, life is organized and regulated by a • non-ﬁbrous structural aggregates such as viral
set of protein–protein and protein–nucleic acid capsids;
interactions. • large aggregates with dynamic properties such as
Interacting proteins and nucleic acids span a range F1-ATPase, pyruvate dehydrogenase, the GroEL–
of structures and activities: GroES chaperonin, and the proteasome;
• simple dimers or oligomers in which the mono- • protein–nucleic acid complexes, including ribo-
mers appear to function independently; somes, nucleosomes, transcription regulation com-
• oligomers with functional ‘cross-talk’, including plexes, splicing and repair particles, and viruses;
ligand-induced dimerization of receptors and • many proteins, whether monomeric or oligomeric,
allosteric proteins such as haemoglobin (Fig- which function by interacting with other proteins.
ure 1.16), phosphofructokinase, and asparate These include all enzymes with protein substrates
carbamoyltransferase; and many antibodies, inhibitors, and regulatory
• large ﬁbrous proteins such as actin or keratin; proteins.
Protein complexes and aggregates 331

BOX Diseases associated with protein aggregates

10.7

Disease Aggregating protein Comment

Sickle-cell anaemia Deoxyhaemoglobin–S Mutation creates hydrophobic patch on surface

Classical amyloidoses Immunoglobulin light chains, Extracellular fibrillar deposits
transthyretin, and many others
Emphysema associated Mutant a1-antitrypsin Destabilization of structure facilitates aggregation
with Z-antitrypsin
Huntington’s Altered huntingtin One of several polyglutamine-repeat diseases
Parkinson’s a-Synuclein Found in Lewy bodies
Alzheimer’s Ab, t Ab = 40–42 residue fragment
Spongiform encephalopathies Prion proteins Infectious, despite containing no nucleic acid

Protein aggregation diseases Abnormalities in tau appear in other neurodegen-

erative diseases, the tauopathies.
Protein interactions are frequently associated with
disease, caused by misfolded or mutant proteins
Prion diseases – spongiform encephalopathies
that are prone to aggregation. Amyloidoses are dis-
eases characterized by extracellular fibrillar deposits. Prion diseases are a set of neurodegenerative conditions
Alzheimer’s and Huntington’s diseases are also asso- of animals and humans, associated with deposition
ciated with protein aggregation. Many aggregates of protein aggregates in the brain, and a characteris-
contain proteins in a common crossed-b-sheet struc- tic sponge-like appearance of the brains of affected
ture, different from their native state. There are a individuals seen in postmortem investigation.
variety of causes of protein aggregation, including Prion diseases are unusual among protein deposi-
overproduction of a protein, destabilizing mutation diseases in that they are transmissible; and
tions, and inadequate clearance in renal failure (see unusual among transmissible diseases in that the
Box 10.7). infectious agent is a protein. Moreover – another
unusual feature – some prion diseases are hereditary,
Alzheimer’s disease for example, familial Creutzfeld–Jacob disease (CJD).
Alzheimer’s disease is a neurodegenerative disease (Distinguish between a disease transmitted from
common in the elderly. It is associated with two types mother to baby perinatally by passage of an infec-
of deposits: tious agent – as in many AIDS cases – with a truly
hereditary disease depending on parental genotype.)
(a) dense insoluble extracellular protein deposits, All hereditary human prion diseases involve muta-
called senile plaques. These contain the Ab frag- tions in the same gene, one that encodes the protein
ment (the N-terminal 40–43 residues) of a cell found in the aggregates deposited in the brains of
surface receptor in neurons, the b-protein precur- sufferers of both hereditary and infectious prion
sor (bPP) or amyloid precursor protein (APP). diseases.
(b) neurofibrillary tangles, twisted fibres inside neu- The prion protein can exist in two forms: the nor-
rons, containing microtubule-associated protein mal PrPC and the dangerous PrPSc. PrPSc but not PrPC
tau. There is some evidence that amyloid deposits can (a) form aggregates, (b) catalyse the conversion
promote tangle formation. of additional PrPC to PrPSc within the brain of an
332 10 Proteomics

individual person or animal, and (c) infect other The normal role of PrPC is not clear. Mice in which
individuals, by various routes including ingestion of PrPC has been knocked out develop normally for a time,
nervous tissue from an affected animal (or person, in and eventually die of apparently unrelated develop-
the case of kuru). mental defects. In fact, PrPC-knockout mice are not
susceptible to infection with PrPSc, an observation
• PrPC = prion protein-Cellular; important in proving the mechanism of the disease.
PrPSc = prion protein-Scrapie. The nature of the conformational change is still
not entirely clear. The change from a to b structure
shown by circular dichroism is one clue. In prin-
Prion disease presents widespread health problems
ciple, prion proteins show multiple structures from
for humans and animals. In 2001 a serious epidemic
one polypeptide sequence. However differences in
of bovine spongiform encephalopathy (colloquially,
glycosylation patterns between PrPC and PrPSc have
‘mad-cow disease’) devastated the United Kingdom
been reported; these may play a role in deﬁning the
countryside. There was an apparent association with
conformation.
the appearance of human cases of variant Creutzfeld–
The mechanism by which PrPSc catalyses the trans-
Jacob disease (vCJD). In the hereditary disease famil-
formation of additional PrPC to PrPSc is also not clear.
ial CJD, symptoms began to appear in people aged
Inherited prion diseases are associated with mutants,
55–75. Variant CJD affected people in their twenties.
presumably increasing the tendency for conforma-
It is hypothesized that these outbreaks were asso-
tional mobility (Table 10.2). A related question con-
ciated with transmission of prion protein infections
cerns the kinetics of the process – what governs the
across species barriers: sheep to cows for BSE, and
rate of accumulation of aggregates that causes many
cows to humans for vCJD.
prion diseases to appear only among the elderly?
Prion proteins form a family of homologous pro-
teins in many species of animals, and also in yeast, but
• Many diseases arise from formation of protein ag-
apparently not in C. elegans or Drosophila. Normal
gregates, including: sickle-cell anaemia, amyloidoses,
human prion protein is synthesized as a 253-residue Alzheimer’s disease, Huntington’s disease, familial and
polypeptide. This comprises: an N-terminal signal variant Creutzfeld–Jacob disease, prion diseases. Most
peptide, followed by a domain containing ∼5 tandem of these are genetic, some are infectious.
repeats of the octapeptide PHGGGWGQ (in mam-
mals), a conserved 140-residue domain, and a C-
terminal hydrophobic domain. The signal domain Properties of protein–protein complexes
and the C-terminal domain are cleaved off, and the
protein is anchored to the extracellular side of the Stoichiometry – what is the composition of the
cell membrane of neurons by a GPI (glycosylphos- complex?
phatidylinositol) group bound to the C-terminal resi- Protein complexes vary widely in the numbers and
due Ser231 of the mature protein. variety of molecules they contain. Some contain only

Table 10.2 Some diseases associated with prion proteins

Disease Species affected Symptoms

scrapie sheep hypersensitivity, unusual gait, tremor

bovine spongiform encephalopathy, or cow similar to scrapie
‘mad cow disease’
kuru human loss of coordination, dementia
Creutzfeld–Jacob disease (CJD) human, age 55–75 impaired vision and motor control, dementia
variant CJD human, age 20–30 psychiatric and sensory anomalies preceding dementia
Gerstmann–Straüssler–Scheinker syndrome human dementia
fatal familial insomnia human sleep disorder
Protein complexes and aggregates 333

Table 10.3 Dissociation constants of some protein–ligand

BOX Evolution of the proteasome complexes
10.8
Biological context Ligand Typical KD

The archaeal proteasome contains 14 identical a sub- Allosteric activator Monovalent ion 10−4–10−2
units and 14 identical b subunits. They are arranged in Co-enzyme binding NAD, for instance 10−7–10−4
four stacked rings, a7–b 7–b 7–a7. All b subunits have pro-
Antigen–antibody Various 10−4–10−16
tease activity. The core of eukaryotic proteasome also complexes
contain the a7–b7–b7–a7 stacked ring structure, but each
Thrombin inhibitor Hirudin 5 × 10−14
ring contains seven diverged and non-identical subunits.
Trypsin inhibitor Bovine pancreatic 10−14
Thus, the eukaryotic proteasome contains seven homo-
trypsin inhibitor
logous but non-identical a subunits and seven homo-
Streptavidin Biotin 10−15
logous but non-identical b subunits. Only three of the
eukaryotic b subunits have protease activity. The eukary-
otic proteasome also contains large regulatory subunits
in addition to the a–b rings, which select ubiquitinylated
Dissociation constants of protein–ligand com-
proteins for degradation. plexes span a wide range, as shown in Table 10.3.
Structural studies have elucidated several import-
ant features of the interactions between soluble
proteins, that contribute to affinity.
a few proteins; others are very large. For example,
pyruvate dehydrogenase contains hundreds of sub- • What holds the proteins together? Burial of
units and some viral capsids contain thousands. hydrophobic sidechains, hydrogen bonds and salt
Some prokaryotic proteins containing identical bridges, and Van der Waals forces. A typical
subunits are homologous to eukaryotic proteins con- protein–protein interface might involve 22 resi-
taining related but non-identical subunits, arising dues and 90 atoms, of which 20% would be main-
by gene duplication and divergence. The proteasome chain atoms, and an occasional water molecule
is an example (see Box 10.8). Some viruses achieve (Box 10.9). Burial of 1 Å2 contributes ∼100 J mol−1
diversity without duplication, by combining proteins to stability. There is, on average, one intramolecu-
with the same sequence but different conformations. lar hydrogen bond per 170 Å2 of interface area.
The average value of the surface area buried
Affinity – how stable is the complex? in binary protein complexes is ∼1600 Å2. The
The measure of the affinity of a complex is the dis- minimum buried surface for stability of a protein–
sociation constant, KD, the equilibrium constant for protein complex is ∼1000 Å2.
the reverse of the binding reaction: • Do proteins change conformation in complexing?
In some cases the interaction energy has to ‘pay
[P][L]
protein–ligand = protein + ligand KD = for’ the conformational change and the interface
[PL]
tends to be correspondingly larger. Complexes
where [P], [L], and [PL] denote the numerical values that involve conformational changes generally
of the concentrations of protein (P), ligand (L), and bury >2000 Å2.
protein–ligand complex (PL), respectively, expressed • What determines specificity? Complementarity
in mol l−1. The lower the KD, the tighter the binding. of the occluding surfaces, in shape, hydrogen-
KD corresponds to the concentration of free ligand at bonding potential, and charge distribution.
which half the proteins bind ligand and half are free: Prediction of protein complexes from the struc-
[P] = [PL]. tures of the partners is the docking problem.
Reliable solution of this problem, together with
progress in structural genomics, would permit
• The Michaelis constant of an enzyme is the dissociation
in silicio screening of proteomes for interacting
constant of the enzyme–substrate complex.
partners.
334 10 Proteomics

How are complexes organized in three dimensions? two proteins interact using the same surface on both,
the complex is closed. If two proteins interact through
When two proteins form a complex, each leaves a different surfaces, the complex is open. The signi-
‘footprint’ on the surface of the other, deﬁning the ﬁcance is that a closed complex does not allow
portion of the surface involved in the interaction. If additional proteins to bind with the same interaction.

BOX A protein–protein interface: phage M13 gene III protein and E. coli TolA
10.9

During infection of E. coli by phage M13, a complex forms The complex is stabilized by burial of 1765 Å2 of surface
between the N-terminal domain of the minor coat gene area, by combination of b-sheets from both proteins to
3 protein of the phage and the C-terminal domain of a form an extended b-sheet (see Figure 10.25a) and by
receptor protein in the bacterial cell membrane, TolA (see several linkages of sidechains by hydrogen bonds and salt
Figure 10.25). bridges. The area buried in the complex is divided almost
evenly between the two partners.

(a)

Figure 10.25 The interface

between phage M13 gene III protein
(N-terminal domain), orange, and E.
(b)
coli protein TolA (C-terminal domain),
blue, [1TOL]. (a, b) Folding patterns
and relative orientation of domains,
viewed approximately (a, c)
perpendicular and (b) parallel to
the interface. Note the b-sheet
formed from strands contributed by
both partners. (c) Slice through the
interface, with TolA shown in black,
gene III protein in red, and water
molecules in blue. It is possible that
another water molecule sits next (c)
to the one inside the structure.
Protein complexes and aggregates 335

An open complex, in which the surface of potential • Is the structure open or closed? In an open struc-
interaction is not occluded, can grow by accretion ture, at least one of the sites forming the binding
of additional subunits. Thus, open but not closed surface is exposed in at least one of the subunits,
complexes are compatible with the formation of so that additional subunits could be added on. In
aggregates by continued addition of monomers mak- a closed structure, all binding surfaces are in con-
ing the same interaction. tact with partners and the assembly is saturated.
Domain swapping – exchange of segments between
two interacting domains – often but not always
Multisubunit proteins
produces closed isologous dimers.
An important class of protein–protein complexes is • What is the symmetry of the structure? Symmetry
oligomeric or multisubunit proteins. We appeal to is the rule, rather than the exception, in structures
structural biology to address the following questions. of oligomeric proteins. The subunits in most
• What is the stoichiometry? How many different dimers are related by an axis of twofold symmetry.
types of subunit appear and how many of each are Yeast hexokinase is an exception. It forms an
present? Most proteins are homodimers or homo- asymmetric dimer. In the human growth hormone
tetramers. Monomers and heterooligomers are less receptor, a nearly symmetric dimer binds an asym-
common. The ribosome is an extreme example of metric ligand (see Figure 10.26).
a heterooligomer. Proteins containing odd num- • Do any of the subunits undergo conformational
bers of subunits are rarer than those containing changes on assembly? Often we don’t know.
even numbers of subunits. In cases of extensively interlocked interfaces,
• What is the relationship between the contributions such as the Trp repressor, the monomers could
of different subunits to the interface? Consider a not adopt the same structure in the absence of
dimer of two identical subunits: in isologous bind- their partners. Allosteric proteins can undergo
ing, the interface is formed from the same sets of ligand-dependent conformational changes. In ATP
residues from both monomers; in heterologous synthase, a threefold symmetric complex of ab
binding, different monomers contribute different subunits is distorted by interaction with the g
sets of residues to the binding site. A handshake is subunit.
isologous.

• An isologous open structure is not possible. Why?

Figure 10.26 Human growth hormone (blue) in complex with two molecules illustrating the dimerized exterior domain of its receptor
(green, orange) [3hhr].
336 10 Proteomics

● RECOMMENDED READING

• Two fine general references:

Branden, C.-I. & Tooze, J. (1999). Introduction to Protein Structure, 2nd edn. Garland, New York.
Liljas, A., Liljas, L., Piskur, J., Lindblom, G., Nissen, P., & Kjeldgaard, M. (2009). Textbook of
Structural Biology. World Scientific, Singapore.
• Reviews of current techniques in proteomics:
de Hoog, C.L. & Mann, M. (2004). Proteomics. Annu. Rev. Genomics Hum. Genet. 5, 267–293.
Domon, B. & Aebersold, R. (2006). Mass spectrometry and protein analysis. Science 312,
212–217.
• The current state of the art in protein structure prediction:
Moult, J. (2005). A decade of CASP: progress, bottlenecks, and prognosis in protein structure
prediction. Curr. Opin. Struct. Biol. 16, 285–289.
Janin, J. (2005). Assessing predictions of protein–protein interaction: the CAPRI experiment.
Prot. Sci. 14, 278–283.
Tramontano, A. (2006). Protein Structure Prediction: Concepts and Applications. Wiley–VCH,
Weinheim.
• Articles about protein complexes:
Russell, R.B., et al. (2004). A structural perspective on protein–protein interactions. Curr. Opin.
Struct. Biol. 14, 313–324.
Sali, A. & Chiu, W. (2005). Macromolecular assemblies highlighted. Structure 13, 339–341.
• A companion volume to this one, treating the topics of this chapter in more detail:
Lesk, A.M. (2010). Introduction to Protein Science/Architecture, Function and Genomics.
Oxford University Press, Oxford.

● EXERCISES, PROBLEMS, AND WEBLEMS

Exercises
Exercise 10.1 For each of the following amino acids, say whether they are hydrophobic, polar,
positively charged at pH 7, or negatively charged at pH 7: (a) leucine; (b) aspartic acid;
(c) glutamine; (d) phenylalanine; (e) lysine.
Exercise 10.2 (a) Identify a hydrophobic amino acid that is more bulky than alanine but less bulky
than leucine. (b) Identify two amino acids that have almost the same size and shape (differing only
in that one has a methyl group and the other has a hydroxyl group).
Exercise 10.3 On a photocopy of Figure 10.2, indicate the bond, a rotation around which would
correspond to the conformational angle yi−1.
Exercise 10.4 Would it be possible, by rotation around bonds shown in Figure 10.2, to convert
residue i from the L to the D conformation?
Exercise 10.5 Estimate the values of f and y that correspond to the aL conformation in Figure 10.3.
Exercise 10.6 Figure 10.2 shows the trans conformation of the polypeptide chain. (a) If the angle
labelled wi in that figure is changed from w = 180° trans to w = 0° cis, keeping the positions of
all atoms in residues i − 1 and i fixed, what atom would occupy the position currently occupied
Exercises, problems, and weblems 337

by Hi+1? (b) In the structure shown in Figure 10.2, fi = 180°. The only unlabelled atom is the
hydrogen connected to Cia. Assuming a rotation that keeps the positions of Ni, Hi, and the atoms
of residue i − 1 fixed, estimate the value of fi that would place this unlabelled hydrogen atom at
the position that C ib occupies in Figure 10.2.
Exercise 10.7 Describe and compare the nature of the accelerating and retarding forces on the
molecules in mass spectrometry and SDS-PAGE.
Exercise 10.8 In a typical protein–protein interface of area 1700 Å2: (a) how many intermolecular
hydrogen bonds would you expect to be formed? (b) How many fixed water molecules would you
expect to find in the interface? (c) If the entire buried area were hydrophobic, what contribution to
the free energy of stabilization would you estimate it to make?
Exercise 10.9 In the dimer between syntrophin and neuronal nitric oxide synthase (see Figure 10.27),
(a) is the dimer structure open or closed? (b) What secondary structure element is shared between
the two domains?

Figure 10.27 Interaction between PDZ domains in syntrophin (cyan) and neuronal nitric oxide synthase
(magenta) [1QAV].

Problems
Problem 10.1 P. Schultz has posed the question: would an extended genetic code – perhaps
one in which one of the rarely used stop codons coded for a novel amino acid with a somewhat
unusual size, shape, or charge distribution – be ‘better’ than the normal one? The question could
be taken to apply to either natural or artificial exensions of the code. How would you design
experiments to answer this question? What precautions would you consider necessary?
Problem 10.2 As a general rule, vertebrates use creatine as a phosphogen and invertebrates use
arginine. Figure 10.28 shows the sequence alignment of creatine kinases (CK) from rabbit and
chicken, and arginine kinases (AK) from sea cucumber, horseshoe crab, and abalone. The numbers
of identical residues in pairs of sequences in this alignment are:

Rabbit CK Chicken CK Sea cucumber Horseshoe Giant

AK crab AK abalone AK

Rabbit CK 378 252 226 147 132

Chicken CK 252 381 219 129 129
Sea cucumber AK 226 219 370 154 137
Horseshoe crab AK 147 129 154 357 191
Giant abalone AK 132 129 137 191 358
338 10 Proteomics

10 20 30 40 50 60
| | | | | |
Rabbit CK (KCRM_RABIT) GNTHNKYKLNYKSEEEYPDLSKHNNHMAKVLTPDLYKKLRDKETPSGFTLDDVIQTGVDN
Chicken CK (KCRS_CHICK) _
ATVHEKRKL FPPSADYPDLRKHNNCMAECLTPAIYAKLRDKLTPNGYSLDQCIQTGVDN
Sea cucumber AK (KARG_STIJA) MANLNQKKYPAKDDFPNFEGHKSLLSKYLTADMYAKLRDVATPSGYTLDRAIQNGVDN
Horseshoe crab AK (KARG_LIMPO) MVDQATLDKLEAGFKKLQEASDCKSLLKKHLTKDVFDSIKNKKTGMGATLLDVIQSGVEN
Abalone AK (KARG_NORMA) MLAMASVEELWA___ KLDGAADCKSLLKNNLTKERYEALKDKKTKFGGTLADCIRSGCLN
LT y l dk T G tL Iq Gv N

70 80 90 100 110 120

| | | | | |
Rabbit CK (KCRM_RABIT) PGHPFIMTVGCVAGDEESYTVFKDLFDPIIQDRHGGFKP_ TDKHKTDLNHENLKGGDDLD
Chicken CK (KCRS_CHICK) _
PGHPFIKTVGMVAGDEESYEVFAEIFDPVIKARHNGYDPRTMKHHTDLDASKITHG QFD
Sea cucumber AK (KARG_STIJA) PD____ FHLGLLAGDEETYTVFADLFDPVIEEYHNGFKK_ TDNHKTDLDASKILDD_ VLD
Horseshoe crab AK (KARG_LIMPO) LD____ SGVGIYAPDAESYRTFGPLFDPIIDDYHGGFKLTDKHPPKEWGDINTL__ VDLD
Abalone AK (KARG_NORMA) LD____ SGVGIYACDPDAYTVFADVLDAVIKEYHKVPEL__ KHPEPEMGDLDKLNFGDLD
vG A D e Y vF fDp I H g lD

130 140 150 160 170 180

| | | | | |
Rabbit CK (KCRM_RABIT) PH__ YVLSSRVRTGRSIKGYTLPPHCSRGERRAVEKLSVEALNSLTGEFKGKYYPLKSMT
Chicken CK (KCRS_CHICK) ER__ YVLSSRVRTGRSIRGLSLPPACSRAERREVENVVVTALAGLKGDLSGKYYSLTNMS
Sea cucumber AK (KARG_STIJA) PA__ YVISSRVRTGRNIRGMALSPHVCRSERRAIEKMVSEALNSLAADLKGKYYSLMKMD
Horseshoe crab AK (KARG_LIMPO) PGGQFIISTRVRCGRSLQGYPFNPCLTAEQYKEMEEKVSSTLSSMEDELKGTYYPLTGMS
Abalone AK (KARG_NORMA) PSGEYIVSTRVRVGRSHDSYGFPPVLTKQERLKMEEDTKAAFEKFSGELAGKYFPLEGMS
p y S RVR GRs g P er E al GkYy L M

190 200 210 220 230 240

| | | | | |
Rabbit CK (KCRM_RABIT) EQEQQQLIDDHFLFDKPVSPLLLASGMARDWPDARGIWHNDNKSFLVWVNEEDHLRVISM
Chicken CK (KCRS_CHICK) ERDQQQLIDDHFLFDKPVSPLLTCAGMARDWPDARGIWHNNDKTFLVWINEEDHTRVISM
Sea cucumber AK (KARG_STIJA) EKTQQQLIDDHFLFDRPVSRHFTSGGMARDFPDGRGIWHNDKKNFLVWINEEDHTRIISM
Horseshoe crab AK (KARG_LIMPO) KATQQQLIDDHFLFKE_GDRFLQTANACRYWPTGRGIFHNDAKTFLVWVNEEDHLRIISM
Abalone AK (KARG_NORMA) KEDQKQMTEDHFLFKD_DDRFLRDAGGYNDWCSGRGIFFNTAKNFLVWVNEEDHLRLISM
QqQlidDHFLF l g rdwp RGI hN K FLVW NEEDH R ISM

250 260 270 280 290 300

| | | | | |
Rabbit CK (KCRM_RABIT) EKGGNMKEVFRRFCVGLQKIEEIFK_ KAGHPFMWNEHLGYVLTCPSNLGTGLRGGVHVKL
Chicken CK (KCRS_CHICK) EKGGNMKRVFERFCRGLKEVERLIK_ ERGWEFMWNERLGYVLTCPSNLGTGLRAGVHVKL
Sea cucumber AK (KARG_STIJA) QMGGNMKEVFERFTRGLTEVEKHIKDKTGKEFMKNDHLGFVLTCPSNLGTGVRCSVHAKL
Horseshoe crab AK (KARG_LIMPO) QKGGDLKTVYKRLVTAVDN_____ IESK_ LPFSHDDRFGFLTFCPTNLGTTMRASVHIQL
Abalone AK (KARG_NORMA) QKGGDLAAVYKRLVVAINT_____ MTASGLSFAKRDGLGYLTFCPSNLGTALRASVHMKI
kGG k V R g F lG CPsNLGT R V H kl

310 320 330 340 350 360

| | | | | |
Rabbit CK (KCRM_RABIT) _
AHLSKH PKFEEILTRLRLQKRGTGGVDTAAVGSVFDISNADRLGSSEVEQVQLVVDGVK
Chicken CK (KCRS_CHICK) PRLSKD_ PRFPKILENLRLQKRGTGGVDTAAVADVYDISNLDRMGRSEVELVQIVIDGVN
Sea cucumber AK (KARG_STIJA) PHMAKD_ KRFEEICTKMRLQKRGTSGEFTESVGGVYDISNLDRLGSSEVEQVNCVIKGVK
Horseshoe crab AK (KARG_LIMPO) PKLAKDRKVLEDIASKFNLQVRGTRGEHTESEGGVYDISNKRRLGLTEYQAVREMQDGIL
Abalone AK (KARG_NORMA) PNLAAS_ PEFKSFCDNLNIQARGIHGEHTESVGGVYDLSNKRRLGLTEYQAVEEMRVGVE
l k f i lQ RGt G T vg V DiSN RlG E V Gv

370 380
| |
Rabbit CK (KCRM_RABIT) LMVEMEKKLEKGQSIDDMIPAQK____
Chicken CK (KCRS_CHICK) YLVDCEKKLEKGQDIKVPPPLPQFGRK
Sea cucumber AK (KARG_STIJA) VLIEMEKKLEKGESIDDLVPK______
Horseshoe crab AK (KARG_LIMPO) EMIKMEKAAA_________________
Abalone AK (KARG_NORMA) ACLAKEKELAAAKK_____________
EK l

Figure 10.28 Alignment of creatine kinase (CK) from rabbit and chicken, and arginine kinase (AK) from sea cucumber, horseshoe crab,
and abalone.
Exercises, problems, and weblems 339

(a) Does sea cucumber arginine kinase appear to be more related to vertebrate creatine kinases
or other invertebrate arginine kinases? (b) On a photocopy of Figure 10.28, circle (at least two)
regions, each at least four residues long, in which sea cucumber arginine kinase resembles
vertebrate creatine kinases more closely than it resembles other invertebrate arginine kinases, and
circle (at least two) regions, each at least four residues long, in which sea cucumber arginine kinase
resembles other invertebrate arginine kinases more closely than it resembles vertebrate creatine
kinases. (c) Can you identify any residues that might conceivably be responsible for the difference
in substrate specificity between arginine and creatine? (d) Outline how you could test the
hypothesis you presented as your answer to part (c), using only computational and not wet
laboratory methods. (e) Which is more likely: that arginine kinase activity evolved once and that
no protein in any ancestor of sea cucumber had creatine kinase activity, or that sea cucumber
arginine kinase evolved from a precursor it shared with present vertebrate creatine kinases?
Explain your reasoning.

Weblems
Weblem 10.1 Choose one of the following glycoprotein storage diseases: (1) aspartylglucosaminuria,
(2) a-mannosidosis, (3) b-mannosidosis, (4) Sandhoff–Jatzkewitz disease, or (5) sialidosis. (a) What
is the inheritance pattern of this disease? (b) What is the incidence of this disease in the USA; i.e.
what fraction of the population is affected? (c) What is the biochemical defect that causes this
disease?
Weblem 10.2 Some athletes take creatine in order to build up their ability to store energy as
phosphocreatine. Athletes find that creatine certainly improves performance in ‘burst’ events –
all-out effort for up to 10 seconds – but its efficacy for ‘endurance events’ is debated. Examples
of burst events include the 100 metre dash, a point in tennis, or a play in football. In all of these
cases, there is a pause between energy bursts to allow replenishment of phosphocreatine by
oxidative metabolism. This takes ∼30–60 seconds. (The International Tennis Federation rules allow
no more than 20 seconds between points.) (a) How does the US Food and Drug Administration
classify creatine? (b) Is creatine banned by the Olympics or major commercial sports enterprises?
(c) Why would it be difficult to enforce a ban on creatine supplements?
Weblem 10.3 The bacterium Pseudomonas fluorescens and the fungus Curvularia inaequalis each
possesses a chloroperoxidase, an enzyme that catalyses halogenation reactions. Do these enzymes
have the same folding pattern?
This page intentionally left blank
CHAPTER 11

Systems Biology

LEARNING GOALS

• To gain a sense of the discipline of systems biology as an integrative approach to all the ‘omics’
disciplines.
• To understand the idea of networks and their representation as graphs.
• To appreciate that many aspects of metabolic networks are shared by different organisms, and
that they evolve.
• To know the databases dealing with metabolic pathways, including EcoCyc, KEGG, WIT, and
BRENDA, and be fairly fluent at using them and the links they contain.
• To distinguish between static and dynamic aspects of biological networks.
• To understand the ideas of stability and robustness and the mechanisms by which life achieves
them.
• To appreciate how computational concepts important for systems biology, such as randomness
and complexity, have been made precise and quantitative.
• To understand the structure, dynamics, and evolution of metabolic networks.
• To know the different ways of experimentally determining protein–protein and protein–nucleic
acid interactions.
• To be familiar with some of the basic types of DNA-binding protein.
• To understand the structures, dynamics, and evolution of regulatory networks.
• To appreciate the adaptability of the yeast regulatory network.
342 11 Systems Biology

Introduction to systems biology

The goal of systems biology is the synthesis of all contacts in their assembly. A transcription regulat-
biological data into a unified picture of the structure, ory network is a network of genes, exerting logical
dynamics, logistics, and ultimately the logic of living control over expression patterns via the synthesis of
things. Systems biology focuses on the integration of specific DNA-binding proteins. A transcription factor
gene, RNA, and protein activity. that acts by binding to DNA may never interact
Molecules are social animals and life depends on physically with the proteins the expression of which
their interactions. As individual molecules have spe- it controls. Metabolic pathways have a similar duality:
cialized functions, control mechanisms are required to many but not all metabolic pathways are mediated by
organize and coordinate their activities. Failure of con- physical protein–protein interactions and regulated
trol mechanisms can lead to disease and even death. by logical ones.
Examples of purely physical interactions include
macromolecular complexes – both multiprotein com-
Two parallel networks: physical and logical
plexes and protein–nucleic acid complexes. Examples
Systems biology deals with networks. Networks con- of logical interactions not mediated entirely by direct
sist of sets of molecules and the interactions among physical interaction between proteins include feed-
them. There are networks of genes, of RNAs, of pro- back loops in which the increase in concentration of
teins, and of metabolites. The same set of molecules a product of a metabolic pathway inhibits an enzyme
may be connected by different types of interaction or catalysing one of the early steps in the pathway, or
relationship, to form different networks (see Box 11.1). the secretion of a small molecule as a signal to other
In cells, two interaction networks are in opera- cells, in ‘fire and forget’ mode (see Box 11.2). In these
tion: (a) a physical network of protein–protein and cases, the logical interaction is transmitted by diffu-
protein–nucleic acid complexes, and (b) a logical net- sion of a small molecule, rather than physical contact
work of control cascades. Interactions may be physical between source and recipient of the signal.
or logical. Often, they are both. Physical and logical The allosteric change in haemoglobin is an example
networks operate in parallel. A macromolecular of simultaneous physical and logical interaction:
complex such as the ribosome is a network of pro- the subunits of haemoglobin respond to changes in
teins and RNAs, interacting through the physical oxygen levels by a conformational change that alters
oxygen affinity. Another example is the transmission
of a signal from the surface of a cell across the mem-
BOX Some networks in systems biology
11.1 brane to the interior by dimerization of a receptor.
This can be the initial trigger of a process that ulti-
mately affects gene expression. Not all links of this
Network Element of Connection between process need involve protein–protein interactions;
network elements some may be mediated by diffusion of small mole-
Genomes Gene
cules such as cyclic AMP.
Homology or shared
expression pattern or linkage Even though particular complexes may participate
Protein Protein Homology or regulatory
in both physical and logical networks, the two net-
relationship or shared works remain distinct in terms of their organization
expression pattern or and their biological function, and it is useful to keep
physical complex formation the distinction between them in mind, especially
Metabolite Chemical Substrate and product of when they overlap.
compound an enzymatic reaction or
e.g. glucose similarity in structure or
similarity in reactivity • Cells contain both physical and logical networks. Many
interactions are common to both.
Introduction to systems biology 343

Statics and dynamics of networks

BOX Cell–cell communication in
11.2 microorganisms: quorum sensing Networks have both static and dynamic aspects. The
reaction sequences and enzyme names memorized by
Control mechanisms not involving direct protein– generations of biochemistry students, and appearing
protein interactions mediate intercellular signalling in on wall charts, mugs, etc., reflect the static structure
microorganisms. Vibrio fischeri is a marine bacterium of the metabolic pathway network. The patterns of
that can adopt alternative physiological states in which flow of metabolites through the network, and the
bioluminescence is active or inactive. (Literally a ‘light control of expression patterns of metabolic enzymes
switch’.) The organism can live free in seawater or in response to changing nutrient availability, are
colonize the light organs of certain species of fish or dynamic features.
squid. It is bioluminescent only when growing within Stability is an important goal of regulatory
the animal. dynamics. Concentrations of metabolites in cells are
What controls the switch? The bacteria respond to carefully controlled. Sometimes they remain roughly
the local density of bacterial cells, a form of communica-
constant (steady-state conditions). Sometimes they
tion called quorum sensing. In V. fischeri, quorum sensing
vary cyclically. Sources of metabolic stability
is mediated by secretion and detection of a small signal-
include:
ling molecule, N-(3-oxohexanoyl)-homoserine lactone.
Related species use other N-acyl homoserine lactones, • constant rate of input (for instance, rhythmic
abbreviated to AHL. AHL can diffuse freely out of the breathing ensures a regular supply of oxygen);
cells in which it is synthesized. Within the light organs,
• feedback inhibition of enzymes;
culture densities can reach 1010–1011 cells ml−1, and the
AHL concentration can exceed the threshold of about • allosteric control of activity of enzymes;
5–10 nM for flipping the physiological switch. • turning proteins ‘on’ and ‘off’ by phosphorylation
Bacterial genes luxI and luxR govern the regulation. and dephosphorylation; and
The product of luxI is involved in the synthesis of AHL. • control of amounts of proteins by regulation of
The luxR gene product, LuxR, contains a membrane-
expression.
bound domain, which detects the AHL signal, and a
transcriptional activator domain. LuxR activates an Robustness is another crucial feature of the dynamics
operon that includes (1) genes for synthesis of luciferase of biological networks. Biological systems need to be
(the enzyme responsible for the bioluminescence); and robust, both for survival of individuals under stress
(2) luxI, expression of which synthesizes additional AHL, and for the plasticity required for evolution.
amplifying the signal and sharpening the transition.
The host also senses the bacteria: the light organs • Under unchanging environmental conditions, an
of squid grown in sterile salt water do not develop pro- organism’s biochemical systems must be stable.
perly. This appears to be a reaction to the luminescence, • Under rapidly changing conditions, the system
rather than to the AHL. For the animal, the lumine- must accommodate both neutral and stressful per-
scence contributes to camouflage: disguise from turbations. Internally driven short-term adjustments
predators at lower depths, by blending with illumination dampen out fluctuations and choreograph pro-
from the sky. The masking of shadows is a natural form grammes such as the cell cycle. Responses to
of ‘make-up’. (The bioluminescence also regularly sur-
external stimuli adjust to changes in the com-
prises diners in seafood restaurants, who jump to the
position or levels of nutrients or oxygen.
conclusion that their glowing dinner is of extraterrestrial
origin. However, most bioluminescent bacteria are • Longer-term regulatory controls in an individual
harmless, although some strains of the related Vibrio organism include changes of physiological state,
species, V. cholerae, the causative agent of cholera, are such as sporulation in bacteria, the course of our
weakly bioluminescent. In fact, the virulence of V. chol- response to and recovery from viral infections
erae is also under the control of quorum sensing, by a such as a cold, and the unfolding of developmental
related mechanism.) stages during a lifetime. Populations respond to
long-term changes in conditions by evolving.
344 11 Systems Biology

Pictures of networks as graphs

We think of networks in terms of graphs (see

Boxes 11.3 and 11.4). Any network contains a set BOX Examples of graphs
11.4
of elements called nodes, and edges connecting the
nodes that stand for interactions. The familiar map
of the London Underground is a network, taking the • Sets of people who have met each other
stations as the nodes and the tracks connecting the • Electricity distribution systems
• Phylogenetic trees
• Metabolic pathways
• Chemical bonding patterns in molecules
BOX The idea of a graph
11.3 • Citation patterns in scientific literature
• The World Wide Web
• A graph consists of a set of vertices, V, and a set of
edges, E. The vertices correspond to the nodes of the
network.
stations as edges. Two stations interact if they are
• Each edge is specified by a pair of vertices. connected by tracks on one or more lines. Note that
• In a directed graph, the edges are ordered pairs of the modern London Underground map shows the
vertices. topology of the network but does not quantitatively
• In a labelled graph, there is a value associated with represent the geography of the city. (An early map,
each edge. (A directed graph is a special case of a from 1925, did maintain geographical accuracy. This
labelled graph: consider the arrowheads as labels.) was possible when the system was simpler than it is
now.) Some of the maps now posted in the Paris
An undirected unlabelled graph specifies the connectivity
Métro are fairly accurate geographically. Considered
of a network but not the distances between vertices (the
as graphs, a geographically accurate map and a sim-
topology but not the geometry, as in the modern London
Underground map). Optionally, labels on the edges can
pliﬁed map with the same edges, or connections,
indicate distances. For example, some phylogenetic trees
correspond to the same network.
indicate only the topology of the ancestor–descendant
relationships. Others indicate quantitatively the amount
of divergence between species. Phylogenetic trees are • Graphs are abstract representations of networks. They
often drawn with the lengths of the branches indicat- show the connectivity of the network. Labelled graphs
can show physical distances beween nodes, or other
ing the time since the last common ancestor. This is a
properties of edges such as throughput capacity.
pictorial device for labelling the edges.
Many graphs do not correspond to physical structures,
and in any event edge labels need not reflect geometry
in the usual sense. For example, the links in a network of A fundamental property of a network is its con-
metabolic pathways might be labelled to reflect flow nectivity. If VA and VZ are vertices in a graph:
patterns.
• a path from VA to VZ is a series of vertices: VA, VB,
VC . . . VZ, such that an edge in the graph connects
V5 14 V5
V5 each successive pair of vertices; the edges of a path
V4 V4 V4 y
in a directed graph must be traversed in the proper
V3 V6 V3 V6 V3 V6
x direction (in city trafﬁc, a path must obey designa-
3
V1 V2 V1 V2 V1 V2 tions of ‘one-way’ streets);
graph directed graph labelled graph • the number of edges in the chain is called the length
of the path; and
Pictures of networks as graphs 345

• a cycle is a path of length >2 for which the initial how easy is it to transfer from one to another, i.e.
and final end points are the same, but in which no what is the nature of the patterns of connectivity?
intermediate link is repeated. In case of failure of one or more links, is the network
Sequences of consecutive metabolic reactions are robust, i.e. does it remain connected?
pathways in a graph of metabolites. An irreversible
reaction corresponds to a directed edge. A concatena- Trees
tion of signal-transduction events is a pathway in a
regulatory network. A tree is a special form of graph (see p. 182). A tree
is a connected graph containing only one path
between each pair of vertices. A hierarchy is a tree:
• A vitamin is a compound that we must eat because we
examples include military chains of command and
cannot synthesize it. Therefore, there can be no path in
the Linnaean taxonomy. A tree cannot contain a
the metabolic network leading to a vitamin.
cycle: if it did, there would be two paths from the
initial point (= the final point) to each intermediate
For some networks, such as metabolic pathways point. In the undirected graph on page 344, the
or patterns of traffic in cities, the dynamics of the subgraph consisting of vertices V1, V2, V4, V5, and V6
system depend on the transmission capacities of the is a tree. Adding an edge from V1 to V5 would create
individual links. These capacities can be indicated as an alternative path from V1 to V5, and the cycle
labels of the edges of the graph. This allows model- V1→V2→V4→V5→V1; the graph would no longer
ling of patterns of flow through the network. Examples be a tree.
include route planning, in travel or deliveries. Note The density of connections is the mean number of
that the shortest path may well not give optimal edges per vertex and characterizes the structure of
throughput. In many cities, taxi drivers are exqui- a graph. A fully connected graph contains an edge
sitely sensitive – and insensitively voluble – about between every pair of nodes. A fully connected graph
currently optimal traffic paths. of N vertices has N − 1 connections per vertex.
A graph that contains a path between any two A graph with no edges has 0 connections per node.
vertices is said to be connected. Alternatively, a graph Nervous systems of higher animals achieve their
may split into several connected components. The power not only by containing large number of
graph on page 344 has two connected components, neurons but also by high degrees of connectivity.
one containing five vertices and one containing only Sometimes there are limits on numbers of connec-
one vertex. (In the extreme case, a graph could tions. For many human societies, in the graph in which
contain many vertices but no edges at all.) It is often individuals are the vertices and edges link people
useful to determine the shortest path between any married to each other, each node has connectivity
two nodes, and to characterize a network by the 0 or 1. Hydrocarbon structures can be represented
distribution of shortest path lengths. The phrase ‘six as graphs, with the hydrogen and carbon atoms as
degrees of separation’ – the title of a play by John vertices and the chemical bonds as the edges. The
Guare, made into a film – refers to the assertion rules of valence require that each node corresponding
(attributed originally to Marconi) that if the people to a carbon atom has ≤4 connections.
in the world are vertices of a graph and the graph In other networks, connectivities follow statistical
contains an edge whenever two people know each regularities. For instance, the World Wide Web can
other, then the graph is connected and there is a path be considered to be a directed graph: individual doc-
between any two vertices with length ≤6. uments are the nodes and hyperlinks are the edges.
The London Underground network is connected The distribution of incoming and outgoing links fol-
in that there is (usually) a route between any two lows power laws: P(k) = probability of k edges = k−q,
stations. Many questions familiar to commuters are where q = 2.1 for incoming links and q = 2.45 for
shared in the analysis of biological networks; for outgoing links (see Box 11.5).
example: what are the paths connecting station A and The density of connections is very important in
station B? Regarding different lines as subnetworks, defining the properties of a network. For instance,
346 11 Systems Biology

analogous to a phase change in physical chemistry,

BOX ‘Small-world’ networks from a situation in which the disease remains under
11.5
control to an epidemic spreading through an entire
population. The classic approach of ‘quarantine’ –
Many observed networks, including biological networks, isolating people for 40 days – works by cutting down
the World Wide Web, and electric power distribution the degree of connectivity of the disease-transmission
grids, have the characteristics of high clustering and short network. Note that a carrier who shows no symptoms
path lengths. They include relatively few nodes with very
– ‘Typhoid Mary’* was a classic case – serves as a
large numbers of connections, called ‘hubs’, and many
hub of the disease transmission network.
nodes with few connections. These combine to produce
Two historical epidemics associated with wars
short path lengths between all nodes. From this feature,
demonstrate the distinction between topology and
they are called ‘small-world networks’. Such networks
geometry in network connectivity.
tend to be fairly robust, staying connected after failure
In the early years of The Peloponnesian War, Athens
of random nodes. Failure of a hub would be disastrous
suffered a severe epidemic. (From Thucydides’ detailed
but is unlikely, because there are few hubs.
Many networks, notably the World Wide Web, are
description of the symptoms, the disease was pro-
continuously adding nodes. The connectivity distribution bably bubonic plague.) A factor contributing to its
tends to remain fairly constant as the network grows. transmission was the crowding of people into the city
These are called ‘scale-free’ networks. from the surrounding countryside, out of fear of
greater vulnerability to military invasion.
After World War I, an epidemic of inﬂuenza killed
∼50–100 million people, more than died in the war
the interactions that spread disease among humans itself. Long-distance travel by soldiers returning from
and/or animals form a network. Whether a disease the war helped spread the disease. Any epidemic
will cause an epidemic depends not only on the ease needs an infectious agent, and a high density of routes
of transmission in any particular interaction but also of transmission.
on the density of connections. As the density of These examples show that the controlling factor is
connections – the rate of interactions – increases, the the density of the connections and not the density of
system can exhibit a qualitative change in behaviour, the people.

Sources of ideas for systems biology

Just as molecular biology calls upon chemistry, Complexity of sequences

systems biology calls upon mathematics for help in
The simplest complex object in biology is a sequence.
making quantitative the general ideas about network
We have all heard of random sequences and probably
properties described in the preceding sections.
agree that the more random the sequence, the more
Several related ideas are important in coping with
complex it is. For example, genomic sequences con-
the static and dynamic aspects of systems biology.
tain ‘low-complexity’ regions. In the human genome,
These include complexity, entropy, randomness,
such regions include simple repeats, or microsatellites,
redundancy, robustness, predictablility, and chaos.
or regions of highly skewed nucleotide composition
We deal with these in our daily lives, but without the
such as AT-rich or GC-rich regions, or polypurine or
need to deﬁne them precisely and quantitatively.
How well do we really understand these concepts?
* Mary Mallon (1869–1938) presented the following
What are the relationships among them? And how unfortunate combination of features: (1) she was infected
can they be used to illuminate biology in general and with typhoid; (2) she did not show symptoms; and (3) she
systems biology in particular? worked for many families as a cook.
Sources of ideas for systems biology 347

polypyrimidine stretches. Are these regions more or

Why not make do with fewer amino acids? If 15 amino acids
less random than a region containing a gene that (plus a stop signal) – not unreasonable – would suffice, then
encodes a speciﬁc protein? How can such properties from the information point of view, a doublet code would
of sequences be measured? be possible. However, a two-base codon/two-base anticodon
Here is an approach to such questions. Take a interaction would probably not have adequate stability.

sequence of characters:

AGTCTCTA . . . , or AATAAAAATAAA . . . , Shannon’s definition of entropy

or ABZXUVJFLT. . . . In 1948, C.E. Shannon introduced the concept of
What determines the amount of information needed entropy into information theory, as part of his
to specify the next character in each sequence? analysis of signal transmission. Suppose a text con-
Less information is required if the set of possible tains symbols with relative probability pi. Shannon’s
characters – A, T, G, C – is very small or if the dis- measure of entropy is:
tribution is very skewed – AATAAAAATAAA – than H = − ∑ pi log2 pi
if the set is very large and the ratio of different i

characters is more even. The Shannon entropy, H, can be interpreted as the

How can we make this quantitative? Genomic minimum average number of bits per symbol required
sequences are limited to characters A, T, G, and C. to transmit the sequence.
To identify each symbol, it is enough to ask two For example, for a genomic sequence with equi-
‘yes-or-no’ questions. For instance: molar base composition (pA = pT = pG = pC = 0.25),
1. Is it a purine (or a pyrimidine)? (Purine implies it H = − ∑ pi log2 pi
is A or G.) i

= −[0.25 log20.25 + 0.25 log20.25

2. Is it 6-amino (or 6-keto)? (6-Amino implies it is
+ 0.25 log20.25 + 0.25 log20.25]
A or C.)
=2
Knowing the answers to these two questions is enough
for us to identify one of the four bases uniquely. (log20.25 = −2.)
Representing yes = 1 and no = 0, each ‘yes-or-no’ The result H = 2 for the gene sequence with
question provides 1 binary digit, or one bit of equimolar base composition recovers our informal
information. We could encode each nucleotide of result that two bits, or two ‘yes-or-no’ questions, are
a genome sequence as a two-bit binary string. required. For a sequence limited to two equiprobable
To identify a character of the ordinary alphabet characters A and T: pA = pT = 0.5, H = −[0.5 log20.5
– abcd . . . z – requires more than two yes-or-no + 0.5 log20.5] = 1. This also makes sense, because,
questions. It is therefore reasonable to think that knowing that the only choices are A and T, we can
a character string of full text is more complex than a decide which it is with one ‘yes-or-no’ question, or
genomic sequence of the same length containing only one bit.
characters A, T, G, and C. Suppose that a sequence is known to have the
Questions of how much information is needed skewed nucleotide composition pA = pT = 0.42, and
to specify an amino acid appear in the genetic code pG = pC = 0.08. Then:
itself. How many nucleotides are required to encode H = −[0.42 log20.42 + 0.42 log20.42
20 amino acids? If each position in a gene can con- + 0.08 log20.08 + 0.08 log20.08] = 1.63
tain one of four nucleotides, then there are only 16
possible dinucleotides – not enough. So, if the same What is the signiﬁcance of the fact that the value
number of nucleotides is to be required for each H = 1.63 is less than 2? The uncertainty in each
amino acid, there must be at least three nucleotides transmitted symbol is not complete – it is more likely
per codon, as observed. As there are only 20 amino to be A or T than G or C. In principle, we can use this
acids, the triplet code contains redundancy. knowledge to improve the coding efﬁciency.
348 11 Systems Biology

The Morse code for telegraphy took such advantage of

far from random, as it is the output of the very short
unequal letter distribution frequencies to encode common program:
letters with short sequences and uncommon letters
Step 1: print 0.
with longer ones. For instance E = dot (length one) and
J = dot-dash-dash-dash (length four). Step 2: go back to step 1.
Periodic sequences, such as:
Depending on the content of the message, the effici-
Monday, Tuesday, Wednesday, Thursday, Friday,
ency of transmission depends on how it is encoded.
Saturday, Sunday, Monday, . . .
A message containing repetitions – GACGACGAC-
GACGAC . . . – would not have particularly low are also of low complexity. In contrast, a truly
entropy if encoded one nucleotide at a time, but an random sequence has no description shorter than the
entropy of 0 if encoded one triplet at a time. Com- sequence itself.
pression algorithms are sensitive to such nuances and
optimize the encoding. The relationship between complexity, randomness,
Conversely, looking at distributions of oligonucle- and compressibility
otides (dinucleotides, triplets, etc.) is a useful way of One way to shorten the specification of a non-random
detecting biologically significant patterns. Codon-usage sequence is to compress it. We all use compression
patterns in protein-coding regions are examples. Some algorithms on our files to save disk space. If a
algorithms for gene identification make use of biases sequence is truly random, in the sense of Kolmogorov,
in coding regions of frequencies of hexanucleotides. it cannot be compressed. By definition, non-random
Although the actual genetic code does not achieve sequences can be compressed.
the theoretical efficiency that entropy calculations One basic principle of compression is that: if you
suggest, and indeed there does not even seem to can predict what is coming next, you can compress
be selection for reduction in the size of non-viral effectively.
genomes, it is clear that the redundancy in the genetic The reason that sequences such as 0,0,0,0, . . . and
code has biological significance. Many single-base Monday, Tuesday, Wednesday, Thursday, Friday,
mutations are silent. Conservative mutations allow Saturday, Sunday, Monday, . . . are so effectively
proteins to evolve with small non-lethal changes that, compressible is that it is simple to decide what the
cumulatively, can achieve large changes in structure successor of any element is. Even sequences for which
and function. And of course the redundancy in having it is not possible to decide unambiguously what the
two copies of the genetic information in two strands next element is can be compressed if some indications
of DNA is used to detect and correct errors in replic- are available. It is not even necessary that the rules be
ation, and to repair DNA damage. supplied ‘up front’ as they can be for sequences such
as 0,0,0,0, . . . and Monday, Tuesday, Wednesday,
Shannon entropy is linked with thermodynamic entropy Thursday, Friday, Saturday, Sunday, Monday, . . . The
through the general notion of disorder or randomness. The rules and statistics of prediction of a successor can be
relationship has been explored by physicists, including J.C. generated on the fly from the incoming data. The rule,
Maxwell and L. Szilard, in their discussions of ‘Maxwell’s
‘the weather on some day is likely to be the same as
demon’, and by E.T. Jaynes.
the weather the day before’ would – in most places
– be good enough for effective compression of a
series of weather reports.
Randomness of sequences
Putting together these considerations suggests a
The Shannon entropy of sequences is related to the general idea that the harder it is to predict the con-
idea of randomness, another concept that we know tents of a data set from a subset of the data, the more
from everyday life without worrying too much about complex the data set is.
exactly what it means. A.N. Kolmogorov defined the The relationships among complexity, predicta-
complexity of a sequence of numbers as the length of bility, and compressibility, which we have so far
the shortest computer program that can reproduce described for character strings, apply to the static
the sequence. Thus the sequence 0,0,0,0,0,0,0 . . . is structures of other types of object, including images,
Sources of ideas for systems biology 349

three-dimensional structures, and – especially – Another way to look at this is directly relevant to
networks. Indeed, most types of biological infor- systems biology: the dynamics of non-chaotic sys-
mation can be regarded as networks. For instance, tems are robust to small changes in initial conditions,
a nucleotide sequence is equivalent to a network in but the dynamics of chaotic systems are not robust to
which the individual bases are the nodes, and each small changes in initial conditions.
base is connected by a directed edge pointing to
the next base. That’s a perfectly proper graph! Chaos and predictability
Conversely, recognizing that sequences are networks The discovery of the laws of mechanics in the 17th
can usefully lead us to ask – can we deﬁne analogues century – Newton’s Principia was published in 1687
of sequence alignment for more general networks? – gave rise to the hope that the dynamics of the solar
(Yes, we can.) system in particular (and much, if not all, of the uni-
In biology, we are also interested in the complexity verse in general) was predictable. Laplace expressed
of processes. the view that:

If we can imagine a consciousness great enough to know

Static and dynamic complexity the exact locations and velocities of all the objects in the
universe at the present instant, as well as all forces, then
One dimension of complexity is time. Is it possible
there could be no secrets from this consciousness. It could
to distinguish static from dynamic complexity? If we
calculate anything about the past or future from the laws of
could define and measure the static complexity of a cause and effect.
system, this would provide an approach to dynamic
complexity: we could ask how the static complexity Leaving aside philosophical questions of the
of a system changes with time. Such changes appear implications about free will and responsibility, there
to be governed by at least some general rules. If are also issues of computability. How much informa-
you stop someone on the street, they might well say tion do we really need, and how accurately do we
that in closed systems the laws of thermodynamics need it, to predict the dynamics of the solar system?
require that complexity always increases in natural The weather? The universe? In chaotic systems,
processes. Other passers-by might say that the solar accurate prediction of the dynamic development
system is structurally complex but, ignoring tidal requires unachievably accurate knowledge of the
effects, dynamically simple. Will these statements initial conditions (to the point where Heisenberg’s
hold up to rigorous analysis? uncertainty principle killed off Laplace’s hope of
Within classical Newtonian mechanics, we could perfect determinism.)
base an analysis of dynamic complexity on the defini- It is true that in classical mechanics even chaotic
tion and description of the trajectories of a system of systems are subject to Poincaré’s recurrence principle:
particles. From the Kolmogorov point of view, the any system of particles held at fixed total energy will
initial positions and velocities of the particles, know- eventually return arbitrarily closely to any set of
ledge of the forces between them, and Newton’s initial positions and velocities. (What rescues the
laws of motion together provide a concise descrip- second law of thermodynamics is that the closer the
tion of the dynamics of such a system. reapproach demanded, the longer the time required,
However, even within the framework of classical i.e. the rarer the fluctuations that achieve the recur-
dynamics, this concise description can break down in rence.) However, knowing that the configuration will
the case of chaotic states. In chaotic states, very small recur does not simplify the calculation of the trajec-
changes in the initial conditions can lead to very tories of the particles.
large changes in the ensuing trajectories. Prediction Through unpredictability, chaotic dynamics is asso-
of the dynamics requires very precise statement of ciated with complexity. However, chaotic dynamics
the initial conditions and very precise knowledge of is not entirely incompatible with order and even the
the forces. Specification of the information required ‘spontaneous’ generation of order. In governing the
to describe the dynamics cannot in these cases be time course of evolution of a system, chaotic dynamics
concise. Chaos is an extreme form of dynamic does sometimes produce stable states or approxima-
complexity. tions to stable states – these are called attractors.
350 11 Systems Biology

Sometimes these are unique points; in other cases Computational complexity

they are periodic and/or localized states. There have
been examples of apparent generation of order in Perhaps the best-developed area of analysis of the
model systems evolving ‘at the edge of chaos’. complexity of processes comes from studies of the
There are even examples of static or structural complexities of computational problems.
order in chaotic systems. Many sequences associated An algorithm in computer science deﬁnes a process
with chaotic behaviour have a fractal structure. This for solving a computational problem. For some pro-
means that if an object is dissected into parts, the blems, the execution time required to solve it is
parts have a structure similar to that of the whole directly proportional to the size of the problem.
(as well as to one another). B. Mandelbrot has pro- These problems are said to be of order O(N) (read
duced many beautiful images. This self-similarity at ‘Oh-N’). For instance, searching for a number in an
different scales implies that if we know part of such unsorted table requires an execution time propor-
a structure we can predict a larger segment of it. This tional to the length N of the table, O(N). For some
should trigger the idea that predictability should problems, the execution time increases only as
permit compressibility and effectively reduce com- N log N. Sorting a list is an O(N log N) problem. For
plexity. Indeed, such internal structural relationships some problems, the execution time increases as N2,
have been applied to compression. Fractal image N3, . . . The alignment of two sequences by dynamic
compression is an effective tool for reducing the size programming (see Chapter 3) is an O(N2) problem.
of images to a form from which the recovered image These are said to be polynomial-time problems. Still
is not exactly the same as the starting image but per- other problems have even greater time demands.
ceptually equivalent. Enumerating all subsets of a set containing N mem-
Fractal structures in biology include branching bers is O(2N).
patterns of plants and of the circulatory systems of Computer scientists deﬁne the complexity of a
vertebrates. At the molecular level, the storage poly- problem in terms of the dependence of execution
saccharide glycogen has features of a fractal structure. time on problem size (see Box 11.6).

BOX Classes P and NP

11.6

A problem that can be solved in polynomial time is said to check that each number is greater than or equal to its
be in class P. O(N log N) algorithms are faster than O(N2) predecessor, which can be done by looking at each ele-
and are, therefore, in class P. ment of the list once. Therefore, sorting a list of numbers
Suppose, however, that the optimal algorithm to solve into order is a problem in class NP. (Sorting also happens
a problem has order worse than polynomial – for instance, to be in class P; sorting algorithms are known with order
it might have exponential order O(2N) – but that if you O(N log N).)
propose a solution, it can be checked in polynomial time.
Such a problem is said to be of class NP. (NP does not NP-complete problems. Does P = NP?

stand for non-polynomial, but for non-deterministic poly- Many NP problems have equivalent complexities, in the
nomial, referring to a different model for the computation. sense that if a polynomial algorithm were discovered
Don’t worry about this technical distinction.) for one, it could be applied to solve others. The set of
Consider the problem of sorting a list of numbers into NP-complete problems is the set of NP problems such that
order. That is, given a series of N numbers: 2,1,7,5,8,4,3, if we could solve any one of them in polynomial time, we
. . . an algorithm must produce as output the numbers would be able to solve all of them in polynomial time. In
rearranged into order: 1,2,3,4,5,7,8, . . . Whatever the other words, the discovery of a polynomial-time algorithm
order of the optimal algorithm that solves the problem, for any problem known to be NP-complete would cause
an algorithm to verify that 1,2,3,4,5,7,8, . . . is a solution the classes P and NP-complete to coalesce. But are there
(or that 1,8,7,2,4,5,3, . . . , is not a solution) can run in any NP problems that are not in class P? This is the famous
time linear in the length of the list. It is necessary only to unsolved conjecture of computer science: does P = NP?
The metabolome 351

What is the a relationship between computational – for instance, how much of each gene should be
complexity and our notions of the complexity asso- transcribed at the moment. Now, the classical theory
ciated with entropy, randomness, and predictability? of computational complexity applies to traditional
Think of an algorithm as operating on a set of input computer architectures, in which successive opera-
data. The algorithm might extract information from tions are executed one at a time. The inherent com-
the data, as in a program that solves an equation. plexity of many biological computations implies that
Or the algorithm might modify the data, as in sorting cells could not use this organization in their calcula-
a list of numbers. The successive steps of the algo- tions. This is an inherent constraint on the design
rithm leave a trace as a sequence of intermediate of living regulatory systems. In fact, many computa-
steps. We can analyse the complexity of the trace, just tions within cells operate as parallel processes. These
as we do any other sequences, including genome are not subject to the constraints derived from the
sequences. classical theory of computational complexity.
The theory of computational complexity places Computer scientists are extending the theory of
general limits on the efﬁciency of computations, complexity to alternative computer architectures.
independent of the nature of the hardware. In prin- The comparison of the constraints imposed by differ-
ciple, these limits constrain cells as much as they do ent computer architectures affords insight into how
human programmers. Cells do lots of computations cells organize different calculations.

The metabolome

Classification and assignment of protein Note that several reactions, involving different
function alcohols, would share this number (whether or not
the same enzyme catalysed them); but that the same
Proteins have a very wide variety of functions. In
dehydrogenation of one of these alcohols by an
systems biology, there are two particular classes
enzyme using the alternative cofactor NADP would
of function that form dynamic networks. (a) The
not. It would be assigned EC 1.1.1.2.
enzymes that run the biochemistry of the cell.
The first field in an EC number indicates to which
(b) Regulatory networks exercise control to provide
of the six main divisions (classes) the enzyme belongs:
stability and robustness.
Class 1. Oxidoreductases
The Enzyme Commission Class 2. Transferases
The first detailed classification of protein functions
Class 3. Hydrolases
was that of the Enzyme Commission (EC). In 1955,
Class 4. Lyases
the General Assembly of the International Union of
Biochemistry (IUB), in consultation with the Interna- Class 5. Isomerases
tional Union of Pure and Applied Chemistry (IUPAC), Class 6. Ligases
established an International Commission on Enzymes,
The significance of the second and third numbers
to systematize nomenclature. The Enzyme Commis-
depends on the class. For oxidoreductases the second
sion published its classification scheme, first on paper
number describes the substrate and the third number
and now on the web: http://www.chem.qmul.ac.uk/
the acceptor. For transferases, the second number
iubmb/enzyme/.
describes the class of item transferred, and the third
EC numbers (looking suspiciously like IP numbers)
number describes either more specifically what they
contain four numeric fields, corresponding to a four-
transfer or in some cases the acceptor. For hydrolases,
level hierarchy. For example, EC 1.1.1.1 corresponds
the second number signifies the kind of bond cleaved
to the reaction:
(e.g. an ester bond) and the third number the molecu-
an alcohol + NAD = the corresponding aldehyde lar context (e.g. a carboxylic ester or a thiolester).
or ketone + NADH2 (Proteinases, a type of hydrolase, are treated slightly
352 11 Systems Biology

differently, with the third number including the possibly in concert with other proteins or RNA
mechanism: serine proteinases, thiol proteinases, and molecules; either a general term such as signal
acid proteinases are classified separately.) For lyases transduction, or a particular one such as cyclic
the second number signifies the kind of bond formed AMP synthesis. This is function from the cell’s
(e.g. C–C or C–O), and the third number the specific point of view.
molecular context. For isomerases, the second number
Because many processes are dependent on location,
indicates the type of reaction and the third number
Gene Ontology (GO) also tracks:
the specific class of reaction. For ligases, the second
number indicates the type of bond formed and the • Cellular component: the assignment of site of
third number the type of molecule in which it appears. activity or partners; this can be a general term such
For example, EC 6.1 for C–O bonds (enzymes acyla- as nucleus or a specific one such as ribosome.
ting tRNA), EC 6.2 for C–S bonds (acyl-CoA deriva- Figure 11.1 shows an example of the GO
tives), etc. The fourth number gives the specific classification.
enzymatic activity. Neither the EC nor the GO classification is an
The Enzyme Structures Database at PDBe links assignment of function to individual proteins. The
Enzyme Commission numbers to proteins of known EC emphasized that: ‘It is perhaps worth noting, as it
structure (http://www.ebi.ac.uk/thornton-srv/ has been a matter of long-standing confusion, that
databases/enzymes/). enzyme nomenclature is primarily a matter of nam-
ing reactions catalysed, not the structures of the
The Gene Ontology™ Consortium protein
proteins that catalyse them’. (See http://www.chem.
function classification
qmul.ac.uk/iubmb/nomenclature/.)
In 1999, Michael Ashburner and many co-workers Assigning EC or GO numbers to proteins is a
faced the problem of annotating the soon-to-be- separate task. Such assignments appear in protein
completed Drosophila melanogaster genome sequence. databases such as UniProtKB.
As a classification of function, the EC classification
was unsatisfactory, if only because it was limited to Comparison of Enzyme Commission and Gene
enzymes. Ashburner organized the Gene Ontology™ Ontology classifications
Consortium to produce a standardized scheme for Enzyme Commission identifiers form a strict four-
describing function. level hierarchy, or tree. For example, isopentenyl-
diphosphate d-isomerase is assigned EC number
• An ontology is a formal set of well-defined terms with 5.3.3.2. The initial 5 specifies the most general cate-
well-defined interrelationships; that is, a dictionary and gory, 5 = isomerases, 5.3 comprises intramolecular
rules of syntax. isomerases, 5.3.3 those enzymes that transpose C=C
bonds, and the full identifier 5.3.3.2 specifies the par-
The Gene Ontology™ Consortium (http://www. ticular reaction. In the molecular function ontology,
geneontology.org) has produced a systematic classifica- GO assigns the identifier 0004452 to isopentenyl-
tion of gene function, in the form of a dictionary of diphosphate Δ-isomerase. (The numerical GO iden-
terms, and their relationships. tifiers themselves have no interpretable significance.)
Organizing concepts of the Gene Ontology project Figure 11.2 compares the EC and GO classifica-
include three categories: tions of isopentenyl-diphosphate d-isomerase. The
figure shows a path from GO:0004452 to the root
• Molecular function: a function associated with node of the molecular function graph, GO:0003674.
what an individual protein or RNA molecule does In this case there are four intervening nodes, progres-
in itself; either a general description such as enzyme, sively more general categories as we move up the
or a specific one such as alcohol dehydrogenase. figure. Note that the GO description of this enzyme
(specifying a catalytic activity, not a protein). This as an oxidoreductase is inconsistent with the EC
is function from the biochemist’s point of view. classification, in which a committed choice between
• Biological process: a component of the activities oxidoreductase and isomerase must be made at the
of a living system, mediated by a protein or RNA, highest level of the EC hierarchy.
The metabolome 353

Molecular function

Nucleic acid binding Enzyme

DNA binding Helicase Adenosine

triphosphatase

Chromatin binding DNA ATP-dependent DNA-dependent

helicase helicase adenosine
triphosphatase

Lamin/chromatin ATP-dependent
(a) binding DNA helicase

DNA metabolism

DNA degradation DNA packaging DNA replication DNA repair DNA recombination

Mitochondrial DNA-dependent
genome maintenance DNA replication

Mitochondrial Pre-replication DNA DNA DNA DNA strand DNA

DNA-dependent complex formation unwinding priming initiation elongation ligation
DNA replication and maintenance

Lagging strand Leading strand

elongation elongation
(b)

Figure 11.1 Selected portions of the three categories of Gene Ontology, showing classifications of functions of proteins that interact
with DNA.
(a) Biological process: DNA metabolism.
(b) Molecular function: including general DNA binding by proteins, and enzymatic manipulations of DNA.
(c) Cellular component: Different places within the cell.
These pictures illustrate the general structure of the Gene Ontology classification. Each term describing a function is a node in a graph.
Each node has one or more parents and may have one or more descendants: arrows indicate direct ancestor–descendant relationships.
A path in the graph is a succession of nodes, each node the parent of the next. Nodes can have ‘grandparents’, and more remote
ancestors.
Unlike the EC hierarchy, the Gene Ontology graphs are not trees in the technical sense, because there can be more than one path
from an ancestor to a descendant. For example, there are two paths in (a) from enzyme to ATP-dependent helicase. Along one path
helicase is the intermediate node. Along the other path adenosine triphosphatase is the intermediate node.
Although the nodes are shown on discrete levels to clarify the structure of the graph, all the nodes on any given level do not
necessarily have a common degree of significance; unlike family, genus, and species levels in the Linnaean taxonomic tree, or the ranks
in military, industrial, academic, etc. organizations. GO terms could not have such a common degree of significance, given that there
can be multiple paths, of different lengths, between different nodes.
354 11 Systems Biology

Cell

Cytoplasm Nucleus

Nucleolus Nucleoplasm Nuclear

membrane

Replication fork Pre-replicative

complex

α DNA polymerase: δ DNA polymerase DNA replication DNA replication Origin

primase complex factor A complex factor C complex recognition
(c) complex

Figure 11.1 (continued)

Metabolic networks molecular_function

GO:0003674
Metabolism is the flow of molecules and energy
through pathways of chemical reactions. Substrates
of metabolic reactions can be macromolecules –
catalytic activity catalytic activity
proteins and nucleic acids – as well as small com- EC: x.x.x.x GO:0003824
pounds such as amino acids and sugars.
The full panoply of metabolic reactions forms a
complex network. The structure of the network isomerases isomerase activity
corresponds to a graph in which metabolites are the EC: 5.x.x.x GO:0016853
nodes and the substrate and product of each reac-
tion define an edge in the graph. The dynamics of
the network depend on the flow capacities of all intramolecular intramolecular
isomerases oxidoreductase activity
of the individual links, analogous to traffic patterns EC: 5.3.x.x GO:0016860
on the streets of a city.
Some patterns within the metabolic network are
linear pathways. Others form closed loops, such as enzymes that, intramolecular oxidoreductase
the tricarboxylic acid (Krebs) cycle. Many pathways transpose C=C bonds activity, transposing C=C bonds
are highly branched and interlock densely. However, EC: 5.3.3.x GO:0016863

metabolic networks also contain recognizable clusters

or blocks, for instance catabolic and anabolic reac-
isopentenyl-diphosphate isopentenyl-diphosphate
tions. There is a relatively high density of internal
Δ-isomerase activity Δ-isomerase activity
connections within clusters and relatively few con- EC: 5.3.3.2 GO:0004452
nections between them.
Figure 11.2 Comparison of Enzyme Commission and Gene
Ontology classifications of isopentenyl-diphosphate Δ-isomerase.
Databases of metabolic pathways
Biochemists have learned a lot about different enzymes species. They represent ∼100 000 different enzymatic
in different species. Approximately an eighth of the activities.
sequences in the UniProt database are enzymes (over Databases organize this information, collecting it
2 million sequences in all). They come from ∼200 000 within a coherent and logical structure, with links to
The metabolome 355

Table 11.1 Databases of metabolic pathways To appreciate the logic of the system from dia-
grams such as Figure 11.3, keep in mind that both
Database Home page
the reaction sequence and the control cascades are
EcoCyc http://ecocyc.org embedded in much larger networks.
BioCyc http://www.biocyc.org • The first step, phosphorylation of l-aspartate, is
KEGG http://www.genome.jp/kegg/ common to the biosynthesis of methionine, lysine,
WIT www.mcs.anl.gov/compbio and threonine. E. coli contains three aspartate
kinases, encoded by three separate genes, each
specific for one of the end-product amino acids.
other databases that provide different data selections They catalyse the same reaction but are subject to
and different modes of organization. EcoCyc deals separate regulation.
with E. coli. It is the model for – and linked with –
• The third step, conversion of l-aspartate-
numerous parallel databases, with uniform web inter-
semialdehyde to l-homoserine, is common to the
faces, treating other organisms. BioCyc is the ‘umbrella’
methionine and threonine synthesis pathways.
collection. KEGG, the Kyoto Encyclopedia of Genes
Two homoserine dehydrogenases are separately
and Genomes, contains information from multiple
encoded. Regulation of expression of the aspartate
organisms. WIT contains metabolic reconstructions
kinases and homoserine dehydrogenases suffices to
derived from genome sequences (Table 11.1).
control all three pathways.
EcoCyc • The piece of the regulatory network is also ex-
EcoCyc is a database representing what we know tracted from a more complex tapestry. For example,
about the biology of E. coli, strain K-12 MG1655. CRP (catabolite repressor protein) regulates more
It contains: than 200 genes!
• Methionine is converted to S-adenosylmethionine,
• the genome: the complete sequence, and for each
a common participant in methyl group transfers.
gene its position and function if known;
S-Adenosylmethionine activates the Met repressor
• transcription regulation: operons, promoters, and (encoded by metJ) (see Figure 11.3). This is a more
transcription factors and their binding sites; complicated form of feedback. In classic feedback
• metabolism: the pathways, including details of the inhibition, a product interacts directly with an
enzymology of individual steps; for each enzyme enzyme that produces one of its precursors. In this
the reaction, activators, inhibitors, and subunit case, the product interacts with a repressor, which
structure are given; reduces the expression of enzymes that produce its
• membrane transporters: transport proteins and precursors. (See page 47.)
their cargo; and In the EcoCyc web page that contains the informa-
• links to other databases: protein and nucleic-acid tion corresponding to this figure, the items are active.
sequence data, literature references, and compari- Links to other internal pages expand information
sons to different E. coli strains. about metabolites, cofactors, enzymes, genes, and
regulators. It is possible to ‘zoom’ in or out by con-
trolling the level of detail. For instance, asking for
Methionine synthesis in Escherichia coli
less detail than the contents of Figure 11.3(a) would
A tiny subset of the E. coli metabolic network is the first eliminate the information about the genes and
pathway for synthesis of methionine from aspartate enzymes and then reduce the pathway to an outline
(see Figure 11.3). showing only critical intermediates:

L-aspartate homoserine L-homocysteine L-methionine

356 11 Systems Biology

(a) L-Aspartate
• Readers are urged to explore the EcoCyc web site on
metL Aspartate kinase
their own, either deliberately or serendipitously, or
guided by weblems in this chapter or at the Online
L-Aspartate 4-phosphate
Resource Centre.
Aspartate semialdehyde
asd
dehydrogenase

L-Aspartate-semialdehyde
It is also possible to explore in other dimensions.
metL Homoserine dehydrogenase II The methionine synthesis pathway is embedded in
larger networks. One of these involves synthesis of
L-Homoserine
the amino acids lysine and threonine in addition to
metA Homoserine O-succinyltransferase methionine, all starting with aspartate (see Figure 11.4).

O-Succinyl-L-homoserine

metB O-Succinylhomoserine(thiol)lyase −
The Kyoto Encyclopedia of Genes and
Genomes (KEGG)
Cystathione
The Kyoto Encyclopedia of Genes and Genomes
metC/malY Cystathione-β-lyase (KEGG) is an extremely comprensive battery of data-
bases for molecular biology and genomics. One of
L-Homocysteine
its special strengths is an integration of metabolic
metE/metH L-Homocysteine transmethylase and genomic information. KEGG contains path-
way maps, which describe potential networks of
Methionine
molecular activities, both metabolic and regulatory.
Figure 11.5 shows a pathway from KEGG, the
(b) reductive carboxylate cycle in photosynthetic bacte-
SoxP CRP MalI ria. This pathway is basically the Krebs cycle, run
backwards.

SoxS OxyP • Several databases present information about metabolic

pathways. These include EcoCyc, BioCyc, WIT, and
KEGG. By linking the steps in metabolic pathways to
Fur
individual genes and proteins, it is possible to recon-
struct pathways in particular species, and to compare
MetJ PhoP metabolic pathways in different slecies.

MetP

note that it connects the product, methionine, to an enzyme,

metB
metC metA metE metH metL malY not to a gene. (b) Expression of the genes in the L-aspartate →
L-methionine pathway (blue rectangles) is subject to regulation.

Figure 11.3 (a) The synthesis of methionine from aspartate is a Circles contain the genes for the transcription factors that control
seven-step pathway through a linear sequence of intermediates expression. Regulated genes appear in rectangles. Molecules that
(black). Different enzymes (green) catalyse different steps. They tend to enhance transcription are connected to their targets by
are encoded by the genes shown in blue. metC and malY encode green arrows. Molecules that tend to repress transcription are
alternative cystathione-b-lyases. The final step, conversion of connected to their targets by red lines ending in a ‘T’. In addition
L-homocysteine to L-methionine, is also catalysed by two different to the links shown here, most of the regulatory proteins feed
L-homocysteine transmethylases, encoded by two genes, metE back on themselves; in most cases, the self-regulatory signal is
and metH. One mechanism of control is at the protein level: there a repression.
is ‘feedback inhibition’ by the product, methionine, which inhibits Control is exerted on every protein of the pathway, from a
homoserine O-succinyltransferase. This is shown by the red line; variety of points of initiation.
The metabolome 357

L-Aspartate
• Several databases assemble biochemical reactions into
metabolic pathways. Individual steps are linked to
Enzyme Commission and Gene Ontology Consortium
classifications of function, and to individual proteins
L-Aspartate-semialdehyde that catalyse the reactions. These databases are useful
in organizing the assignment of function to proteins
identified in newly sequenced genomes.

Homoserine
Evolution and phylogeny of metabolic
pathways
Most organisms share many common metabolic
pathways. But there are many individual variations.
L-Threonine Some organisms have metabolic competence com-
pletely absent from others. Plants but not humans
L-Homocysteine
have enzymes for reactions involved in photosynthe-
sis and cell-wall formation.
L-Methionine Some organisms achieve the same overall meta-
bolic transformation but use alternative pathways;
L-Lysine
that is, different sets of intermediates. For instance,
classical glycolysis and the Entner–Doudoroff path-
Figure 11.4 The pathway of amino acid biosynthesis from
way are alternative routes from glucose to pyruvate,
aspartate branches after aspartate semialdehyde. In this figure,
the black sequence corresponds to the previous example, and
in which there is a whole succession of reactions pos-
the green pathways are the immediate context. The aspartate sible (Figure 11.6). Often, organisms will share many
→ methionine sequence is a subnetwork of the network shown steps in a metabolic transformation but some will
here. Each amino acid plays a regulatory role, exerting feedback extend or truncate the pathway. Humans have lost
inhibition over its own synthesis, without affecting the others. activity in the last enzyme in vitamin C synthesis,
It looks as if threonine and lysine both individually inhibit the first
l-gulonolactone oxidase, and we must include it in
step of the synthesis of all three products, but this step is catalysed
by three separate aspartate kinases, allowing specialized regulation. our diet. Most mammals have a working copy of the
gene for this protein, and can synthesize vitamin C.
KEGG derives its power from the very dense We have an inactive pseudogene.
network of links among these categories of informa- Another example is the pathway for nitrogen
tion, and additional links to many other databases excretion (see Figure 11.7). Organisms with more
to which the system maintains access. Two examples water available in their immediate surroundings use
of the kinds of questions that can be treated with more of the reactions.
KEGG are: We can represent the metabolic networks of differ-
ent species as graphs. The nodes are metabolites.
(a) It has been suggested that simple metabolic There are edges between pairs of metabolites if the
pathways evolve into more complex ones by gene organism has an enzyme that will convert one to
duplication and subsequent divergence. Searching the other, or if the interconversion is spontaneous.
the pathway catalogue for sets of enzymes that We can then compare the graphs to get a quantitative
share a folding pattern will reveal clusters of measure of the divergence. Intuitively, we expect that
linked paralogues. the divergence in metabolic network should corre-
(b) KEGG can take the set of known enzymes from spond to the divergence between species as measured
some organism and check whether they can be from comparing genome sequences.
integrated into established metabolic pathways. The procedure outlined here deals with a static and
A gap in a pathway suggests a missing enzyme or binary picture of the metabolic network. Either a
an unexpected alternative pathway. transformation is possible, or it is not. It is entirely
358 11 Systems Biology

CO2
Carbon fixation
Phosphoenolpyruvate

2.7.9.2
L-Alanine
Alanine and aspartate
Pyruvate 1.4.1.1
metabolism

4.1.1.31 1.2.7.1
CO2
Acetate
Sulphur
Acetyl-CoA 6.2.1.1
metabolism

2.3.3.8 Citrate

Alanine and aspartate Oxaloacetate

metabolism 4.2.1.3
1.1.1.37
cis-Aconitate
L-Malate

4.2.1.2 4.2.1.3

Fumarate
Isocitrate
1.3.99.1
1.1.1.42
Succinate CO2
CO2 2-Oxoglutarate
6.2.1.5
1.2.7.3
Succinyl-CoA
Glutamate
Reduced ferredoxin metabolism

Figure 11.5 Metabolic pathway map from The Kyoto Encyclopedia of Genes and Genomes (KEGG). This figure shows the reductive
carboxylate cycle, and its links to other metabolic processes. The numbers in square boxes are EC numbers identifying the reactions at
each step.

possible that enzymes that catalyse corresponding B. Siebers and P. Schönheit have studied the
steps in the network have very different kinetic con- metabolic pathways of carbohydrate metabolism in
stants in two species, or are subject to different kinds archaea. In the initial conversion of glucose to pyruvate,
of regulation. In this case the dynamic patterns of they observed a number of differences in the pathway,
traffic through the network might be quite different, from either the standard Embden–Meyerhof glyco-
even if the topology of the network is the same. lytic pathway, or the Entner–Doudoroff alternative.
Think of the difference in traffic flow through a city Pyrococcus furiosus, Thermococcus celer, Arch-
during rush hour, and at midnight. The roads haven’t aeoglobus fulgidus strain 7324, Desulfurococcus
changed, but the kinetics has. amylolyticus, and Pyrobaculum aerophilum use a
modified Embden–Meyerhof pathway (Figure 11.8).
Sulfolobos solfataricus and Haloarcula marismortui
Carbohydrate metabolism in archaea
use a modified Entner–Doudoroff pathway (Figure
The common pathway from glucose to pyruvate in 11.9). Thermoproteus tenax uses both.
bacteria and eukaryotes is the Embden–Meyerhof In addition to the differences in the sequence of
glycolytic route (see Figure 11.6). metabolites, the enzymes that catalyse even the same
The metabolome 359

(a) glucose (b) glucose

glucose-6-P glucose-6-P

fructose-6-P 6-P-gluconate

fructose-1,6-bisP 2-keto-3-deoxy-6-P-gluconate

glyceraldehyde dihydroxyacetone glyceraldehyde-3-P

-3-P phosphate

2 (1,3-bisphosphoglycerate) 1,3-bisphosphoglycerate

2 (3-phosphoglycerate) 3-phosphoglycerate

2 (2-phosphoglycerate) 2-phosphoglycerate

2 (phosphoenolpyruvate) phosphoenolpyruvate

2 (pyruvate) pyruvate pyruvate

Figure 11.6 (a) Embden–Meyerhof glycolytic pathway, (b) Entner–Doudoroff pathway. Note that the enzymatic conversion of
glyceraldehyde-3-phosphate to pyruvate is the same in both pathways (green branch).

Uric acid (primates, birds) reactions are almost always not homologues of bac-
urate oxidase terial or eukaryotic ones. Many of them use different
cofactors. Bacterial and eukaryotic phosphofructoki-
nases (that convert fructose-6-phosphate to fructose-
Allantoin (some mammals)
1,6,bisphosphate) use ATP as the phosphoryl donor.
alantoinase The archaeal enzymes that catalyse this reaction can
use ATP, ADP, or even inorganic pyrophosphate. In
Allantoate (bony fish) addition, some of the familiar enzymes are under
alantoicase allosteric control. The control relationships are not
retained in the corresponding archaeal enzymes.

Urea (amphibians)
• Of particular interest for comparative genomics are
urease
facilities to compare pathways among different organ-
isms. Alignment and comparison of pathways can
Ammonia (marine invertebrates) expose how pathways have diverged between species.
Even if the pathways are the same, in some cases the
Figure 11.7 Succession of reactions to produce excreted forms of
enzymes are non-homologous.
end products of nitrogen metabolism.
360 11 Systems Biology

Glucose
NAD(P)+
1
NAD(P)H
(b) Gluconate (a)

2
ADP ATP H2O
KDPG KDG
8
3 3
Pyruvate Pyruvate

GAP GA
NAD(P)+ + Pi NAD(P)+, Fdox
NAD(P)+ 9 NAD(P)H 4
NAD(P)H, Fdred
NAD(P)H 1.3 BPG
ADP Glycerate
11 10 ATP
5
3 PG 2 PG
12
6 ADP ATP
H2O

PEP
ADP
7
ATP
Pyruvate

Figure 11.8 Modifications of the Entner–Doudoroff (ED) pathway in Archaea. (a) The non-phosphorylative ED pathway in
Thermoplasma acidophilum. (b) The semi-phosphorylative ED pathway in halophilic Archaea. A branched ED (combining (a) and (b))
appears in S. solfataricus and T. tenax. Abbreviations: 1.3 BPG, 1,3-bisphosphoglycerate; Fdox and Fdred, oxidized and reduced
ferredoxin; GA, glyceraldehyde; GAP, glyceraldehyde-3-phosphate; KDG, 2-keto-3-deoxygluconate; KDPG, 2-keto-3-deoxy-6-
phosphogluconate; PEP, phosphoenolpyruvate; 2 PG, 2-phosphoglycerate; 3 PG, 3-phosphoglycerate. Enzymes are numbered as
follows: 1, glucose dehydrogenase; 2, gluconate dehydratase; 3, KD(P)G aldolase; 4, glyceraldehyde dehydrogenase (proposed for
T. acidophilum), glyceraldehyde:ferredoxin oxidoreductase (proposed for T. tenax) or glyceraldehyde oxidoreductase (proposed
for S. acidocaldarius); 5, glycerate kinase; 6, enolase; 7, pyruvate kinase; 8, KDG kinase; 9, GAPDH; 10, phosphoglycerate kinase; 11,
GAPN; 12, phosphoglycerate mutase.
From: Siebers, B. & Schönheit, P. (2005). Unusual pathways and enzymes of central carbohydrate metabolism in Archaea. Curr. Opin. Micro. 8, 695–705.

Reconstruction of metabolic networks that the organism has evolved a non-homologous

enzyme for the task. For example, the archaeon
Pathway comparison can be useful for annotation Methanococcus jannaschii has a pathway for biosyn-
of genomes. It is often possible to assign function to thesis of chorismate from 3-dehydroquinate. Enzymes
proteins on the basis of similarity to sequences of pro- for most of the steps have homologues in bacteria
teins of known function in other organisms. However, and/or eukaryotes. However, shikimate kinase was not
sometimes there are several weak similarities to other identifiable from sequence similarity. M. jannaschii
proteins and it is unclear which is the true homo- must have some protein with this function. How can
logue. Conversely, sometimes an organism has a it be found?
metabolic pathway but no annotated enzyme for an Although in bacteria, genes consecutive in pathways
essential step. Confronting the unannotated proteins are often consecutive in operons in the genome, this
with the unassigned functions can sometimes identify is not true of M. jannaschii. However, the genes for
the protein that fills the gap in the pathway. successive steps of the chorismate biosynthesis pathway
If an enzyme needed for a pathway cannot be iden- are clustered and consecutive in another archaeon,
tified, even by weak sequence similarity, it may be Aeropyrum pernix. It was possible to propose a gene
Regulatory networks 361

Pyrocococcus
Thermococcus
Archaeoglobus 7324 Desulfurococcus Pyrobaculum Thermoproteus
Glucose
ADP ATP ATP ATP
GLK
AMP ADP ADP ADP
G-6-P

PGI cPGI PGI/PMI PGI/PMI

F-6-P
ADP ATP ATP PPi
PFK
AMP ADP ADP Pi
F-1,6-BP
FBA aFBA aFBA

DHAP GAP

TIM GAPOR Fdox Fdox Fdox NAD(P)+

GAPN Fdred Fdred Fdred NAD(P)H

3-PG
PGM
2-PG
Enolase
ADP ADP ADP ADP
PEP
ATP ATP ATP ATP
PK
Pyruvate

Figure 11.9 Modifications of the Embden–Meyerhof (EM) pathway in Archaea. In this case most of the reactions are the same.
The enzymes are not homologous to those that catalyse the corresponding reactions in bacteria and eukarya. Note the differences
in cofactors. The steps and mechanisms of regulation also differ. Abbreviations: aFBA, archaeal class I FBA; cPGI, cupin PGI; DHAP,
dihydroxyacetone phosphate; FBA, fructose 1,6-bisphosphatae aldolase; F-1,6-BP, fructose 1,6-bisphosphate; Fdox and Fdred, oxidized
and reduced ferredoxin; F-6-P, fructose-6-phosphate; GAP, glyceraldehyde-3-phosphate; GAPN, non-phosphorylative glyceraldehyde
3-phosphate dehydrogenase; GAPOR, glyceraldehyde-3 phosphate-ferredoxin oxidoreductase; GLK, glucokinase (ADP- or ATP-
dependent); G-6-P, glucose-6-phosphate; PEP, phosphoenolpyruvate; PFK, 6-phosphofructokinase; 2-PG, 2-phosphoglycerate; 3-PG,
3-phosphoglycerate; PGI/PMI, bifunctional phosphoglucose/phosphomannose isomerase); PGI, phosphoglucose isomerase; PGM,
phosphoglycerate mutase; PK, pyruvate kinase; TIM, triosephosphate isomerase.
From: Siebers, B. & Schönheit, P. (2005). Unusual pathways and enzymes of central carbohydrate metabolism in Archaea. Curr. Opin. Micro. 8, 695–705.

for a shikimate kinase in A. pernix and to identify a shikimate kinase. It has no sequence similarity to
a homologue of that gene in M. jannaschii. bacterial or eukaryotic shikimate kinases. A protein
Experiments conﬁrmed the prediction that the from a different family has been recruited for the
M. jannaschii gene thus identiﬁed (MJ1440) encoded archaeal pathway.

Regulatory networks

Regulatory networks pervade living processes. Control the resting state (see Figure 11.10). Many regulatory
interactions are organized into linear signal transduc- actions are mediated by protein–protein complexes.
tion cascades and reticulated into control networks. Transient complexes are common in regulation, as
Any individual regulatory action requires (1) a dissociation provides a natural reset mechanism.
stimulus; (2) transmission of a signal to a target; Some stimuli arise from genetic programmes. Some
(3) a response; and (4) a ‘reset’ mechanism to restore regulatory events are responses to current internal
362 11 Systems Biology

receptor and the feedback mechanisms that are

Input
unequal to the task of restraining its effects. Different
impulse GPCRs have different mechanisms for restoring the
resting state. Rhodopsin, for example, is inactivated
by cleavage of the isomerized chromophore.
The activity of the heterotrimeric G proteins is
Signal turned off by the GTPase activity of Ga, converting
transmission
Reset
Ga(GTP) to Ga(GDP). Ga(GDP) does not bind to its
mechanism receptors – shutting down that pathway of signal
transmission. Ga(GDP) rebinds the GbGg subunits.
This resets the system.
Output
action

Signal transduction and transcriptional control

Figure 11.10 The elementary step in a regulatory network. An The signal transduction network exerts control
input impulse is received by a node, which transmits a signal to a ‘in the field’ by a variety of mechanisms including
downstream node, causing an output action. This is followed by inhibitors, dimerization, ligand-induced conforma-
reset of the upstream node to its inactive state. Combination of
tional changes including but not limited to allosteric
such elementary diagrams gives rise to the complex regulatory
networks in biology.
effects, GDP–GTP exchange or kinase–phosphorylase
switches, and differential turnover rates. This com-
ponent acts fast, on subsecond timescales. The
metabolite concentrations. Others originate outside transcriptional regulatory network exerts control ‘at
the cell; the signal is detected by surface receptors headquarters’ through control over gene expression.
and transmitted across the membrane to an intracel- This component is slower, acting on a timescale of
lular target. minutes.
G-protein-coupled receptors (GPCRs) illustrate General characteristics of all control pathways
the components of signal transduction. Recall that include the following:
GPCRs contain seven transmembrane helices, with a
binding site for triggering ligands on the extracellular • A single signal can trigger a single response or
side and a binding site for the downstream recipient many responses.
of the signal, a heterotrimeric G protein, on the intra- • A single response can be controlled by a single
cellular side. signal or influenced by many signals.
G proteins consist of three subunits: Ga, Gb, and • Each response may be stimulatory – increasing an
Gg. Ga and Gg are anchored to the membrane. In activity – or inhibitory – decreasing an activity.
the resting inactive state, Ga binds GDP. An activated • Transmission of signals may damp out stimuli or
GPCR binds to a specific G protein and catalyses amplify them.
GTP–GDP exchange in the Ga subunit. This destabi-
lizes the trimer, dissociating Ga: There are ample opportunities for complexity,
opportunities of which cells have taken extensive
Ga(GDP)GbGg j Ga(GTP) + GbGg
advantage.
The separated components, Ga and GbGg, activate
downstream targets, such as adenylate cyclase.
Structures of regulatory networks
A single activated GPCR can interact successively
with many G-protein molecules, amplifying the signal. Think of control, or regulatory, networks as
It is therefore essential to turn the signal off, after assemblies of activities. Although mediated in part
it has had its effect. Mutations that render a GPCR by physical assemblies of macromolecules (protein–
constitutively active cause a number of diseases, the protein and protein–nucleic acid complexes), regula-
symptoms emerging from a war between the rogue tory networks:
Dynamics, stability, and robustness 363

1. Tend to be unidirectional. A transcription activa- in pathways of control. Regulatory networks are

tor may stimulate the expression of a metabolic directed graphs (see p. 344): the influence of vertex
enzyme, but the enzyme may not be involved A on vertex B is expressed by a directed edge con-
directly in regulating the expression of the trans- necting A and B. An edge directed from vertex A
cription factor. to vertex B is called an outgoing connection from A
2. Have a logical dimension. It is not enough to and an incoming connection to B. Conventionally, an
describe the connectivity of a regulatory network. arrow indicates a stimulatory interaction, and a ‘T’
Any regulatory action may stimulate or repress symbol indicates an inhibitory interaction. An edge
the activity of its target. If two interactions com- connecting a vertex to itself indicates auto-regulation.
bine to activate a target, activation may require A double-headed arrow indicates reciprocal stimula-
both stimuli (logical ‘and’), or either stimulus may tion of two nodes; note that this is not the same as an
suffice (logical ‘or’). undirected edge.
3. Produce dynamic patterns. Signals may produce
combinations of effects with specified time courses.
Cell-cycle regulation is a classic example. A
A B A B
The structure of a regulatory network can be inhibitory auto-regulatory reciprocal
described by a graph in which edges indicate steps interaction interaction interaction

Dynamics, stability, and robustness

An unlabelled, undirected graph gives a static picture independent actions of all of the individual signals
of the topology of a network. The dynamic states are combine to achieve an overall, integrated result. It is
more complex (see Box 11.7), including: like the operation of the ‘invisible hand’ that, accord-
ing to Adam Smith, coordinates individual behaviour
• equilibrium;
into the regulation of national economies.
• steady state;
• states that vary periodically;
• Robustness is more than stability. Stability is keeping
• unfolding of developmental programmes; your composure in unchanging conditions. Robustness
• chaotic states; is keeping your composure in changing conditions.
• runaway or divergence; and
• shutdown. Robustness through redundancy
Although much is known about the mechanisms of In principle, networks can achieve robustness through
individual elements of control and signalling path- an extension of the mechanism by which redundancy
ways, understanding their integration is a subject of confers stability. The most direct approach is simple
current research. For instance, the idea that healthy substitutional redundancy: if two proteins are each
cells and organisms are in stable states is certainly no capable of doing a job, knock out one and the other
more than an approximation (in most cases, it is an takes over. In the London Underground, this would
idealization). correspond to a second line running over the same
Understanding how cells achieve even an apparent route. For instance, when the Circle Line is not run-
approximation to stability is also quite tricky. It is ning, passengers travelling between Paddington and
likely that great redundancy of control processes lies King’s Cross stations can use the Hammersmith &
at its basis. Regulation is based on the result of many City line that runs on the same tracks. In yeast, for
individual control mechanisms – here a short feed- example, single-gene knockouts of over 80% of the
back loop, there a multistep cascade. Somehow the ∼6200 open reading frames are survivable injuries.
364 11 Systems Biology

BOX Dynamic states of a network of processes

11.7

• At equilibrium, one or more forward and reverse pro- • Many equilibrium and some steady-state conditions are
cesses occur at compensating rates, to leave the amounts stable, in the sense that concentrations of most metabo-
of different substances unchanged: lites are changing slowly if at all and the system is robust
A B to small changes in external conditions. The alternative is
a chaotic state, in which small changes in conditions can
Chemical equilibria are generally self-adjusting upon cause very large responses. Weather is a chaotic system:
changes in conditions or in concentrations of reactants the meteorologist E. Lorenz asked, ‘Does the flap of a
or products. butterfly’s wings in Brazil set off a tornado in Texas?’ In
• A steady state will exist if the total rate of processes that a carefully regulated system, chaos is usually well worth
produce a substance is the same as the total rate of avoiding, and it is likely that life has evolved to damp
processes that consume it. For instance, the two-step down the responses to the kinds of fluctuation that
conversion might give rise to it. Chaotic dynamics does sometimes
A B C produce approximations to stable states – these are called
strange attractors. Understanding stability in dynamic
could maintain the amount of B constant, provided that systems subject to changing environmental stimuli is
the rate of production of B (the process A → B) is the important but is beyond the scope of this book.
same as the rate of its consumption (the process B → C).
• Unfolding of developmental programmes occurs over
The net effect would be to convert A to C.
the course of the lifetime of the cell or organism. Many
A cyclic process could maintain a steady state in all of
developmental events are relatively independent of
its components:
external conditions and are controlled primarily by regu-
B lation of gene expression patterns.
• Runaway or divergence. Absence of a predator can lead
A C
to uncontrolled multiplication of a species. An example is
A steady state in such a cyclic process with all reactions the growth of the rabbit population of Australia from an
proceeding in one direction is very different from an ‘inoculum’ of 24 animals in 1859. Breakdown in control
equilibrium state. Nevertheless, in some cases, it is still over cellular proliferation leads to unconstrained growth
true that altering external conditions produces a shift to in cancer.
another, neighbouring steady state. • Shutdown is part of the picture. Apoptosis is the pro-
• States that vary periodically appear in the regulation of grammed death of a cell, as part of normal develop-
the cell cycle, circadian rhythms, and seasonal changes mental processes or in response to damage that could
such as annual patterns of breeding in animals and flow- threaten the organism, such as DNA strand breaks.
ering in plants. Circadian and seasonal cycles have their Breakdown of mechanisms of apoptosis – for instance,
origins in the regular progressions of the day and year, mutations in the protein p53 – is an important cause of
but have evolved a certain degree of internalization. cancer.

Some duplicated genes contribute to substitutional Coordinated expression patterns, providing sub-
redundancy. For example, in studying models for dia- stitutional redundancy, are more probable among
betes it appears that mice and rats (but not humans) duplicated genes than among unrelated ones. For
have two similar but non-allelic insulin genes. Sub- example, Escherichia coli contains two fructose-1,6-
stitutional redundancy requires equivalence not only bisphosphate aldolases. One, expressed only in the
of function but of expression levels. In the mouse, presence of special nutrients, is non-essential under
knocking out either insulin gene leads to compensa- normal growth conditions. However, the other is
tory increased expression of the other, producing a essential. In this case, functional redundancy does not
normal phenotype. provide robustness. These two enzymes are probably
Dynamics, stability, and robustness 365

homologous, but they are distant relatives, not the disease its name. (Phenylalanine is not a ketone.)
the product of a recent gene duplication. One is a The Guthrie test for phenylketonuria measures the
member of a family of fructose-1,6-bisphosphate concentration of phenylpyruvic acid in the blood of
aldolases typical of bacteria and eukaryotes, whereas newborns.
the other is a member of another family that occurs A challenge greater than predicting the effect of a
in archaea. E. coli is unusual in containing both. single knockout would be to simulate the entire
An alternative mechanism of network robustness metabolic network: given an initial set of metabolite
is distributed redundancy: equivalent effects achieved concentrations, to predict the concentrations as a
through different routes. In normal E. coli, approxi- function of time. The idea would be to combine pre-
mately two-thirds of the NADPH produced in meta- dictions of the rates of individual reactions, assuming
bolism arises via the pentose phosphate shunt, which a simple model such as Michaelis–Menten kinetics,
requires the enzyme glucose-6-phosphate dehydro- or more complex models of allosteric enzymes. This
genase. Knocking out the gene for this enzyme leads requires knowing accurately the kinetic constants
to metabolic shifts, after which increased levels of of all of the enzymes, including effects of inhibitors.
NADH produced by the tricarboxylic acid cycle It requires being able to give a sensible treatment
are converted to NADPH by a transhydrogenase of the idea of ‘substrate concentration’ within a cell
reaction. The growth rate of the knockout strain is divided into compartments and to deal with ques-
comparable to that of the parent. tions of rates of diffusion in a crowded intercellular
environment. Longer-term simulation would require
Dynamic modelling knowing the kinetics of transcription regulation, for
which no simple model analogous to the Michaelis–
Diagrams such as Figure 11.3 give a static picture of
Menten equation is available. There are also serious
the structure of a metabolic pathway and its control.
computational issues involving how precisely the
Can we model the dynamics? What would it mean
kinetic parameters must be known, and the extent to
to do so?
which simplifying assumptions – for instance, the
A challenge that might – naively – appear relatively
steady-state approximation – are justified.
simple would be to predict the effect of knocking out
Accurate simulation of metabolic patterns of
an enzyme. An easy guess would be to expect a build-
entire cells is a clear target for research in the field.
up of the substrate of the missing enzyme. However,
However, the problem is a difficult one. Current
if the metabolic pathways branch in the vicinity of
approaches include:
that metabolite, the consequences of a knockout are
more complex. • Attempts at detailed numerical analysis of simple
For example, the disease phenylketonuria results networks. For instance, a simulation of the
most commonly from a specific dysfunctional (i.e. asparate → threonine pathway (see Figure 11.4) in
knocked-out) enzyme, phenylalanine hydroxylase. E. coli represented the enzymatic transformations
The normal function of phenylalanine hydroxylase is and feedback inhibition as a set of coupled equa-
to convert phenylalanine to tyrosine (see page 60). tions.* Changes in expression pattern were not
In phenylketonuria, phenylalanine does indeed build included. Steady-state solutions were compared
up. However, the excess phenylalanine is converted by with experimental measurements on cell extracts.
phenylalanine transaminase to phenylpyruvic acid: It was possible to:
– simulate the time course of threonine synthesis
COOH COOH and the effects of changes in initial metabolite
concentrations;
NH2 O
– predict the steady-state concentrations of
phenylalanine phenylpyruvic acid intermediates;

* Chassagnole, C., Raïs, B., Quentin, E., Fell, D.A., &

Both compounds accumulate. As phenylpyruvic Mazat, J.P. (2001). An integrated study of threonine-
acid is less readily absorbed by the kidneys than pathway enzyme kinetics in Escherichia coli. Biochem. J.
phenylalanine, it is excreted into the urine, giving 356, 415–423.
366 11 Systems Biology

– predict the effects of changes in concentrations – the thermodynamic properties of each reaction
of individual enzymes on overall throughput, determine whether or not the reaction is rever-
expressed as flux control coefficients; such data sible: this is a property of the substrate and
can help to guide development of microbial fac- product of the reaction, not of the enzyme; the
tories for increased yield of particular products; flux of an irreversible reaction must be ≥0.
– for different steps, distinguish whether the
substrates and products are approximately at
• It is interesting to see whether the space of possible
equilibrium. metabolic states is connected or broken up into sepa-
rated regimens.
• The flux control coefficient is the percentage change
in flux divided by the percentage change in amount
of enzyme. It is not a property of the enzyme, but In general, many possible flow patterns, or metabolic
a property of a reaction within a metabolic network. states, are consistent with the constraints. To deter-
A flux control coefficient equal to 1 would correspond mine a single metabolic state to compare with experi-
to a rate-limiting step.
ments, it is possible to select from the feasible states
the one that is optimal for ATP production or for
• Focusing not on individual enzymes but on poten- growth rate.
tial sets of flow rates. Represent the metabolic A variety of observable quantities are predictable.
network as a graph. Metabolites are the nodes.
Edges correspond to reactions: an edge connects • The effects of changes of medium or gene knock-
two compounds if there is a reaction, or possibly outs: which enzymes are essential for growth on
several reactions, that interconvert them. The goal different carbon sources?
is to predict the flow rate through each edge. • What are limiting factors in growth?
Recently the models have been generalized to • What are maximal theoretical yields of ATP, or
include regulation of expression. There are general assimilation of carbon, etc.?
constraints on the set of flow rates:
• What are the fluxes through individual pathways?
– under steady-state conditions, the fluxes through This is difficult but not impossible to measure.
each node must add up to zero; i.e. for each
• What are the flux control coefficients of different
compound, the amount that is synthesized or
enzymes?
supplied externally must equal the amount used
up or secreted; • For optimal growth, how much oxygen and carbon
source are taken up?
– the flux control coefficients of all of the reactions
contributing to a single flux must add up to 1; Such models have been constructed for several organ-
– the flux through any edge is limited by the values isms, including prokaryotes and eukaryotes. Predic-
of the Michaelis–Menten parameter Vmax for all tions have generally achieved good agreement with
enzymes contributing to the edge; and experiments.

Protein interaction networks

The units from which interaction networks are change in external conditions, or by the activity of
assembled are: another process.
• for physical networks, a protein–protein or protein– Most experiments reveal only pairwise interactions.
nucleic acid complex; The challenges are to integrate pairwise interactions
• for logical networks, a dynamic connection in into a network and then to study the structure and
which the activity of a process is affected by a dynamics of the system.
Protein interaction networks 367

Many techniques detect physical interactions

directly. These include:
• X-ray and NMR structure determinations can not
only identify the components of the complex, but
reveal how they interact, and whether conforma-
tional changes occur upon binding.
• X-ray tomography of cells produces images by
measuring differential absorption of low-energy
X-rays (see Figure 11.11). From a series of tilted
views, three-dimensional reconstruction is possible.
• Two-hybrid screening systems. Transcriptional
activators such as Gal4 contain a DNA-binding
domain and an activation domain. Suppose these (a)

two domains are separated, and one test protein is

fused to the DNA-binding domain and a second
test protein is fused to the activation domain. Then
a reporter protein will be expressed only if the
components of the activator are brought together
by formation of a complex between two test
proteins (Figure 11.12). High-throughput methods
allow parallel screening of a ‘bait’ protein for
interaction with a large number of potential ‘prey’
proteins.
• Chemical crosslinking ﬁxes complexes so that they
can be isolated. Subsequent proteolytic digestion
and mass spectrometry permits identiﬁcation of
the components. (b)

• Coimmunoprecipitation. An antibody raised to a Figure 11.11 (a) X-ray image of the microtubule network of a
‘bait’ protein binds the bait together with any other mouse epithelial cell labelled using metal-conjugated antibodies
(to form individual particles ∼50 nm in diameter). Different
‘prey’ proteins that interact with it. The interacting
regions of the cell have measurably different X-ray absorbances.
proteins can be purified and analysed, for instance Colouring the image according to X-ray absorbance brings out
by western blotting, or mass spectrometry. contrasts; of course, there is no suggestion that the colours
• Chromatin immunoprecipitation identifies DNA correspond to a realistic interaction with visible light. Here the
microtubule network appears in blue, and the nucleus and nucleoli
sequences that bind proteins (Figure 11.13).
in orange. The total width of the field is 120 mm. (b) Cryo X-ray
• Phage display. Genes for a large number of pro- tomography of a yeast cell (S. cerevisiae). This image shows a
teins are individually fused to the gene for a phage 0.5 mm section at 60 nm resolution. Lipid droplets are coloured
coat protein, to create a population of phage each white, the vacuole and nucleus are red. The arrow points to the
nucleus. Other cytoplasmic structures appear green and orange.
of which carries copies of one of the extra proteins
Cell diameter, 5 mm.
exposed on its surface. Affinity purification against
(a) From: Meyer-Ilse, W., et al. (2001). High resolution protein
an immobilized ‘bait’ protein selects phage dis- localization using soft X-ray microscopy. J. Microsc. 201, 395–403.
playing potential ‘prey’ proteins. DNA extracted (b) From: Larabell, C.A. & Le Gros, M.A. (2004). X-ray tomography
generates 3-D reconstructions of the yeast, Saccharomyces cerevisiae,
from the interacting phages reveals the amino acid
at 60-nm resolution. Mol. Biol. Cell 15, 957–962.
sequences of these proteins.
• Surface plasmon resonance analyses the reflection
of light from a gold surface to which a protein has
been attached. The signal changes if a ligand binds
368 11 Systems Biology

Proteins bind to DNA:

TA
Transcription
DNA-binding

Reporter gene

(a)
Cross-link:

B P TA

Transcription
DNA-binding

Reporter gene Fragment:

(b)

Figure 11.12 (a) A transcription activator contains two domains, a

DNA-binding domain (pink) and a transcription-activator domain
(TA; blue). Together they induce expression of a reporter gene.
Purify by immunoprecipitation:
The lacZ gene encoding b-galactosidase (see p. 380) is a common
choice of reporter gene, because chromogenic substrates make
b-galactosidase easy to detect. (b) Transcription proceeds if the
DNA-binding domain and TA domain are separated in different
proteins that can form a complex. A ‘bait’ protein B (red) is fused
to the DNA-binding domain and a ‘prey’ protein P (cyan) is fused
Cleave cross-links and
to the TA domain. Formation of a complex between bait and prey separate components
brings together the DNA-binding and TA domains, inducing
transcription.
ID Sequence DNA
This is the basis of a high-throughput approach for detecting protein or microarray
pairs of interacting proteins.
Figure 11.13 Chromatin immunoprecipitation. Treatment with
formaldehyde cross-links proteins and DNA, fixing the complexes
that exist within a cell. After isolation of chromatin, breaking the
to the immobilized protein. (The method detects DNA into small fragments allows separation of proteins by binding
localized changes in the refractive index of the to specific antibodies, carrying the DNA sequences along with
medium adjacent to the gold surface. This is related them. Reversal of the cross-link followed by sequencing of the
to the mass being immobilized.) DNA identifies the specific DNA sequence to which each protein
binds. To identify multiple sites in the genome to which the
• Fluorescence resonance energy transfer. If two protein binds, the DNA fragments can be analysed using a
proteins are tagged by different chromophores, microarray. To avoid the requirement for antibodies specific for
transfer of excitation energy can be observed over each protein to be tested, the proteins can be fused to a standard
distances up to about 60 Å. epitope, or to a sequence that can be biotinylated, taking
advantage of the very high biotin–streptavidin affinity.
• Tandem affinity purification (TAP) allows probing
cells in vivo for partners that bind to a selected
‘bait’ protein.
calmodulin-binding peptide, a TEV protease cleav-
Fusion of the bait protein to two affinity tags, age site (TEV is a cysteine protease from tobacco etch
separated by a cleavage site, permits high-efficiency virus), and protein A (which binds to high affinity
in extraction of complexes, using two successive, to an available antibody). The double tag, and the
or tandem, affinity purification steps separated by a two-step purification, gives superior performance
cleavage step to expose the second tag. The cleavage relative to a single-tag technique, in yield of low-
steps are specific and require only mild conditions, concentration complexes.
in order to leave the bait–prey complex intact In vivo, the expressed tagged bait protein binds
(Figure 11.14). to a set of prey proteins. Figure 11.14 shows the puri-
The fusion proteins contain an individual bait fication protocol to recover the complexes from cell
protein extended at its N or C terminus by a extracts.
Protein interaction networks 369

Calmodulin-binding TEV protease Protein A Intracellular

peptide cleavage site proteins

Bait +
TAP tag
First elution

IgG-coated
Bait beads

cleavage by
TEV protease Second elution

Bait Calmodulin-coated
beads

Final elution with EGTA

Bait

Figure 11.14 The construct and protocol in the tandem affinity purification (TAP) method for purification of complexes with a selected
‘bait’ protein. The fusion protein containing the bait and the two tags separated by the TEV cleavage site binds in vivo to proteins in the
cell. A first affinity purification step binds the bait protein to a column containing IgG-coated beads that bind specifically to the first tag,
protein A. After thorough washing, cleavage by TEV protease releases the bound complexes, and exposes the second tag. A second
affinity purification step binds the bait protein to a column containing calmodulin-coated beads, which bind specifically to the second
tag, the calmodulin-binding peptide. After washing, elution with the chelating agent ethylene glycol tetraacetic acid (EGTA) releases the
purified complexes.

Determination of a protein interaction network It is possible to create a network by deﬁning an

requires measurements of many bait proteins. Because interaction between two protein domains when-
all bait proteins carry the same fusion sequences, ever homologues of the two domains appear in the
the purification and cleavage steps in TAP are unique. same protein. This is evidence for some functional
There is no need to design different purification pro- link between the domains, even in species where
tocols for different bait proteins. the domains appear in separate proteins.
Other methods provide complementary information:
• Coexpression patterns. Clustering of microarray
• Domain recombination networks. Many eukaryotic data identifies proteins with common expression
proteins contain multiple domains. A feature of patterns. They may have the same tissue distribu-
eukaryotic evolution is that a domain may appear tion, or be up- or down-regulated in parallel in dif-
in different proteins with different partners. In ferent physiological states. This is also suggestive
some cases proteins in a bacterial operon catalysing evidence that they share some functional link. In the
successive steps in a metabolic pathway are fused response of M. tuberculosis to the drug isoniazid,
into a single multidomain protein in eukarya. The genes for the Fatty Acid Synthesis complex are
domains of the eukaryotic protein are individually coordinately upregulated. They are on an operon-
homologous to the separate bacterial proteins. like gene cluster, and in fact these proteins do form
(Examples of proteins fused in eukarya and sepa- a physical complex. On the other hand, alkyl
rate in prokaryotes are also known.) hydroperoxidase (AHPC) is also upregulated in
370 11 Systems Biology

response to isoniazid. AHPC acts to relieve oxida- interaction domains, which have diverged to form
tive stress. There is no evidence that it physically large families with different individual specificities.
interacts with the Fatty Acid Synthesis complex, or For instance, the human genome contains 115 SH2
that it mediates a metabolic transformation cou- domains, and 253 SH3 domains. (Src-Homology
pled to fatty acid synthesis. It is a second, indepen- domains SH2 and SH3 are named for their homolo-
dent, component of the response to isoniazid. gies to domains of the src family of cytoplasmic tyro-
• Phylogenetic distribution patterns. The phyloge- sine kinases.) Many individual interaction domains
netic profile of a protein is the set of organisms in even interact with different partners as they partici-
which it and its homologues appear. Proteins in a pate in successive steps of a control cascade. Initial
common structural complex or pathway are func- interactions may also trigger recruitment of addi-
tionally linked and expected to co-evolve. There- tional proteins to form large regulatory complexes.
fore proteins that share a phylogenetic profile are Figure 11.15 shows types of interaction domain
likely to have a functional link, or at least to have complexes with ligands, including binding of
a common subcellular origin. There need be no peptides (which may be attached to proteins), and
sequence or structural similarity between the pro- protein–protein complexes. Protein–nucleic acid
teins that share a phylogenetic distribution pat- complexes will appear next.
tern. A welcome feature of this method is that it Many interaction domains are sensitive to the state
derives information about the function of a protein of post-translational modification of their ligands, for
from its relationship to nonhomologous proteins. instance binding preferentially to states of a ligand
in which specific tyrosines, serines, or threonines are
Each of these methods provides a basis for a phosphorylated. These and other post-translational
protein interaction network. The networks formed modifications function as switches, turning on or
by combining each set of interactions are different, interrupting/resetting a signalling cascade.
although they overlap, to a greater or lesser extent. Protein–protein complex formation allows a cell
They give different views of the kinds of relationships to detect a signal molecule in the external medium
between proteins that exist in cells. It is possible to and report its arrival to the cell interior, without the
form a more comprehensive network by combining signal molecule itself ever needing to enter the cell.
different types of interactions. For instance, the DIP Many receptors use an ingenious dimerization mech-
database is a curated collection of experimentally anism. The receptor has external, transmembrane,
determined protein–protein interactions (see http:// and internal segments. An external ligand binds to
dip.doe-mbi.ucla.edu/). It contains data about two molecules of receptor (see Figure 10.26). The
71 276 interactions between 23 201 proteins from juxtaposition of the external portions also brings the
372 organisms. internal portions together, because they are tethered
A limitation that remains is the difficulty of deter- to the external regions by the transmembrane segments.
mining structures of transient complexes, or of sys- Interaction between the interior segments triggers a
tems showing substantial conformational changes conformational change that activates a process such
upon assembly. The situation is shared with much of as phosphorylation of a protein. This may initiate
current molecular biology: we are coming to grips a signal transduction cascade that can transmit and
with static structures but are awaiting the develop- amplify the original stimulus (see Box 11.8). Within
ment of methods for treating the dynamics. the cell, ligand-induced dimerization may activate
DNA-binding domains (Figure 11.16).
Large-scale protein interaction networks are built
Structural biology of regulatory networks
up from many individual interactions. Figure 11.17
Many molecules involved in regulation are multi- shows a portion of an interaction network of yeast
domain proteins. Each domain in a multidomain proteins, based on sets of proteins that have been
protein is relatively free to interact with other mole- found together in solved structures.
cules. An interaction domain is a part of a protein
that confers specificity in ligation of a partner. Regu-
L-Dopa is used to treat Parkinson’s disease.
latory proteins contain a limited number of types of
Protein interaction networks 371

BOX Regulation of tyrosine hydroxylase

11.8 illustrates several control
mechanisms

Tyrosine hydroxylase catalyses the conversion of L-

tyrosine to L-3,4-dihydroxyphenylalanine (L-dopa) in
neurons, a step in the synthesis of the neurotransmitters
dopamine and adrenaline. Tyrosine hydroxylase is the
focus of many diverse forms of regulation, including
control over transcription and RNA processing and turn-
(a) over. One regulatory pathway is triggered by the arrival
of a neurotransmitter at the external cell surface.

• External binding to a receptor activates adenylate

cyclase inside the cell.
• Binding of cyclic AMP activates protein kinase A.
Protein kinase A forms inactive tetramers containing
two catalytic and two regulatory subunits, or disso-
ciates into active monomers. Binding of cyclic AMP
to protein kinase A breaks up the tetramer, releasing
monomeric catalytic subunits in active form.
• Active protein kinase A phosphorylates tyrosine
hydroxylase at Ser40, upregulating its activity.
• The mechanism balancing this stimulation is the spe-
cific dephosphorylation of Ser40 by phosphatase 2A.

(b)

Figure 11.15 Some types of interactions involved in regulatory

signalling. (a) Binding of a peptide (magenta) by an SH3 domain
[1CKA]. SH3 domains are common constituents of regulatory
proteins. Functions of SH3 domains include signal transduction,
protein and vesicle trafficking, cytoskeletal organization, cell
polarization, and organelle biosynthesis. (b) Domain–domain
interaction. PDZ domains in syntrophin (magenta) and neuronal
nitric oxide synthase (cyan) [1QAV].

Figure 11.16 Glucocorticoid receptor binding domains with

dexamethasone, a cortisol analogue that has anti-inflammatory
and immunosupressant activities [1M2Z] (see page 271).
Ligation-induced dimerization can lead to translocation from the
cytoplasm into the nucleus, and activation of gene transcription
by DNA-binding domains (not shown).
372
11 Systems Biology
(a) (b) ATP_synt_do
COX6A

ATP_synt
COX3 Ran_BP1 UQ_con

RhoGAP
ATP_synt_do_C
COX1 ras ubiquitin
COX2_TM Peptidose_C12
F_box
Skp1
COX2 Peptidase_C48
PBD
COX5A
COX5B Ribosomd_S10
Ribosomd_S14 TIG
DUF232 ank
KE2 pkinase
RNA_pol_N
CKS

RNA_pol_A_boc UcrQ
SMC_C cyclin
SMC_N CK_II_beta

RNA_pol_L
RNA_pd_Rpb8 UCR_14kD
FAD_binding_2
RNA_pol_A
suoc_DH_flav_C
cytochrome_b_N

RNA_POl_Rpb5_C
fer2 TFIIS
Cytochrome_C1

Adap_comp_sulAdaptin_N GATase UCR_TM

charismate_bind
TFIIA
cytochrome_c T FIIA_gamma_C
Clot_adaptor_s peroxidase
pyr_redox
GrpE ETF_alpha pyr_redox_dm
G_alpha
WD40 HSP70 ETF_beta
TAF
TFIID_31

Signalling Ubiquitin proteases RNA polymerase Secretory pathway

ATP synthase Folding Cytochrome C1 Cytochrome oxidase Chromosome structure

Figure 11.17 Portion of the interaction network of yeast proteins: (a) describes the interactions of individual proteins, and (b) shows the interactions within a subnetwork based on
representations of different protein families, in different functional categories, linked in (a). This figure is based on structural data and modelling. Each relationship implies a physical
interaction between the proteins. Some of the interactions involve stable complexes (for instance, RNA polymerase II); others involve transient complexes.
From: Aloy, P. & Russell, R. (2005). Structure-based systems biology: a zoom lens for the cell. FEBS Lett. 579, 1854–1858.
Protein–DNA interactions 373

Protein–DNA interactions

DNA–protein complexes mediate several types of

process: BOX Specificities of DNA-binding
11.9 proteins
• replication, including repair and recombination;
• transcription; DNA-binding proteins show varying degrees of DNA-
• regulation of gene expression; and sequence specificity.

• DNA packaging, including nucleosomes and viral • Some DNA-binding proteins are relatively non-
capsids. specific with respect to nucleotide sequence, including
DNA replication enzymes and histones.
Different processes require different degrees of DNA-
• Some, for instance EcoRV, bind to DNA with low
sequence speciﬁcity (see Box 11.9).
specificity but cleave only at GATATC. This com-
bination permits a mechanism of finding the target
Structural themes in protein–DNA binding and sequence by initial non-specific binding followed by
sequence recognition diffusion in one dimension along the DNA.

What does a protein looking at a stretch of DNA in • Some recognize specific nucleotide sequences. For
the standard B conformation see? (See Figure 3.12.) example, the EcoR1 restriction endonuclease binds
What could it hope to grab hold of? Prominent specifically to GAATCC sequences with almost abso-
general features are the sugar–phosphate backbone, lute specificity. It is a homodimer that recognizes
palindromic sequences.
including charged phosphates suitable for salt bridges
and potential hydrogen-bond partners in the sugar • Some DNA-binding proteins recognize consensus
hydroxyl groups. Contact with the bases is accessible sequences. For example, the phage Mu transposase
through the major and minor grooves, although and repressor proteins bind 11 bp sequences of the
unless the DNA is distorted the bases are visible only form CTTT[A/T]PyNPu[A/T]A[A/T] (where [A/T] = A
‘edge on’. Hydrogen-bonding patterns between bases or T, Py = either pyrimidine (C or T), Pu = either purine
(A or G), and N = any of the four bases).
in the grooves and particular amino acids account for
some of the DNA-sequence specificity in binding. • Some recognize nucleotide sequences indirectly, via
However, many protein–DNA hydrogen bonds are modulations of local DNA structure. For example, the
mediated by intervening water molecules, an effect TATA box-binding protein takes advantage of the
that tends to reduce the specificity. greater flexibility of AT-rich sequences to form com-
The idea that an a-helix has the right size and plexes in which the DNA is very strongly bent (see
Figure 11.23). The distinction between sequence
shape to fit into the major groove of DNA was noted
specificity achieved through direct interaction with
in the 1950s. The structures of the first protein–DNA
bases and specificity through recognition of local
complexes confirmed this prediction. It became the
structure has been termed ‘digital versus analogue
paradigm for protein–DNA interactions. Indeed, when
readout’.
a student solving the structure of the Met repressor–
DNA complex told his supervisor that, in the • Some recognize general structural features of DNA,
such as mismatched bases or supercoiling.
electron-density map he was interpreting, it looked
as if a b-sheet were binding in the major groove, • Some DNA-binding proteins form an initial complex
he was advised, with patience strongly tinged with with high DNA-sequence specificity, followed by
condescension, to go back and look for the helix. recruitment of other proteins of low specificity to
We now recognize great structural variety in DNA– enhance overall binding affinity or create a functional
protein interactions. A few examples include: complex.

• Helix-turn-helix domains. These appear in pro-

karyotic proteins that regulate gene expression,
374 11 Systems Biology

eukaryotic homeodomains involved in develop-

mental control, and histones that package DNA in
chromosomes.
• Zinc fingers, including eukaryotic transcription
factors, and steroid and hormone receptors.
• Proteins with b-sheets that interact with DNA, for
instance the gene regulatory proteins, Met and Arc
repressors, and the TATA box-binding protein.
• Leucine zippers that act as eukaryotic transcrip-
tional regulators.
• The ‘high mobility group’ proteins in eukaryotes
and the prokaryotic protein HU, which bind
Figure 11.18 Bacteriophage l Cro is an example of the ‘helix-
sequences non-specifically and bend DNA. turn-helix’ structural motif. Following along the chain, the first
• Enzymes that interact with DNA and are involved secondary structure is a helix, followed by two more helices that
in replication, translation, repair, and uncoiling. frame the motif. The second of these two helices (the third helix
in the molecule) – called the recognition helix – lies in the major
Some are relatively small; others are large multi-
groove and makes extensive contacts with the DNA. A long
protein complexes. They show many different C-terminal tail wraps around the DNA, following the minor
types of folding pattern. Many distort the DNA groove [1CRO].
structure in order to get access to the bases that are
the target of their activity.
• Viral capsid proteins and histones, which package Alternatively, it can integrate its DNA into the host
DNA into compact forms. genome. The phage in such a lysogenized cell is dor-
mant and can be released by stimuli that switch it
These examples are an anecdotal list, not a
from the lysogenic to the lytic state.
classification.
l Cro binds to DNA as a symmetrical dimer (see
RNA-binding proteins have a separate variety.
Figure 11.18). Its target sequence is approximately
Some resemble DNA-binding proteins. Others bind
palindromic, in the sense that two complementary
to RNA molecules of defined structure; for instance,
strands contain approximately the same sequences in
enzymes that interact with tRNA, including but not
reverse order (see Exercise 11.5)
limited to amino acid tRNA synthetases, and the
ribosome itself. CTATCACCGCAAGGGATAA
From the point of view of systems biology, a GATAGTGGCGTTCCCTATT
very important class of DNA-binding proteins is
transcriptional regulators. These proteins and their The bases to which the protein makes contact are
complexes with DNA have been a focus of structural shown in bold-face. The protein interacts with both
biology. There is a great variety of structures and a strands.
few recurrent structural themes.
The eukaryotic homeodomain antennapedia
Homeodomains are highly conserved eukaryotic pro-
An album of transcription regulators teins, active in control of animal development. They
regulate homeotic genes, i.e. genes that specify loca-
l Cro tions of body parts. Antennapedia is a Drosophila
Bacteriophage l is a virus containing a double- protein responsible for initiating leg development
stranded DNA genome of 48 502 bp. A l phage (see Figure 11.19). The earliest mutations found in
infecting an E. coli cell chooses – depending on which antennapedia produced ectopic legs at the positions
genes are active – between lysis or lysogeny. Phage l of antennae. Loss-of-function mutations produce
can replicate and lyse the cell, releasing ∼100 progeny. antennae at the positions of legs.
Protein–DNA interactions 375

Zinc fingers
Zinc fingers are small modules found in eukaryotic
transcription regulators. Each finger recognizes a
triplet of bases in DNA. Tandem arrays of fingers
recognize an extended region (see Figure 11.21).
Understanding the relationship between the amino
acid sequences of individual zinc fingers and the
DNA sequences they bind would permit modular
design of gene-specific repressors, by assembling a
sequence of fingers.

The E. coli Met repressor

Figure 11.19 The homeodomain antennapedia–DNA complex The Met repressor negatively regulates genes in the
[9ANT]. As with many DNA-binding proteins, an a-helix binds
methionine biosynthesis pathway (see Figure 11.22).
in the major groove of DNA. The structure of the antennapedia–
DNA complex resembles, in some respects, prokaryotic helix-turn-
The TATA box-binding protein
helix proteins such as l Cro. However, the tail that wraps around
into the minor groove is N terminal to the helix-turn-helix motif in A TATA box is a sequence (consensus TATA[A/T]
antennapedia, instead of C terminal as in l Cro. A[A/T]) upstream of the transcriptional start site
of bacterial genes. Recognition of this sequence by
Leucine zippers as transcriptional regulators the TATA box-binding protein (see Figure 11.23)
Leucine zippers form another type of dimeric tran- initiates the formation of the basal transcription
scriptional regulator (see Figure 11.20). The Jun pro- complex, a large multiprotein particle. This is an
tein forms a homodimer consisting of an N-terminal
domain with many positively charged sidechains that
bind to DNA and a C-terminal leucine zipper domain
involved in dimerization. Jun can dimerize not only
with itself but with other related proteins, notably
Fos. Different dimers have different DNA-sequence
speciﬁcities and different afﬁnities, affording subtle
patterns of control.
Figure 11.21 Zif268, a tandem three-finger structure binding
the sequence GCGTGGGCG [1AAY]. Each finger interacts with
three consecutive bases. In each finger, three positions along the
a-helix, non-consecutive in the amino acid sequence, contain
primary determinants of the DNA-sequence specificity.

Figure 11.20 Jun dimer binding to DNA [1JNM]. The proteins

grip the DNA as if they were picking it up with chopsticks. The
a-helices bind in major grooves on opposite sides of the double
helix. This structure shares with l Cro, and many other DNA- Figure 11.22 Like many other DNA-binding proteins, the Met
binding proteins, the symmetry of the complex, which mimics repressor binds DNA as a symmetrical dimer [1CMA]. In the
the dyad symmetry of the DNA double helix. This requires, on the complex, each monomer contributes one strand of two-stranded
part of the protein, formation of symmetrical dimers, and on the b-sheet, which sits in the major groove, with sidechains making
part of the DNA, an approximately palindromic target sequence. hydrogen bonds to bases. The co-repressor, S-adenosyl-
For Jun dimers, the target sequence is ATGACGTCAT. methionine, is required for high-affinity binding.
376 11 Systems Biology

Figure 11.23 The TATA box-binding protein [1YTB]. The obvious feature of this complex is the very strong bending and unwinding
induced in the DNA. A long curved b-sheet sits against an unusually flat surface on the DNA, the result of prying open of the minor
groove. Phe sidechains intercalate between the bases.

Figure 11.24 The structure of the DNA-binding subunit of p53 shows a double-b-sheet fold [1TSR]. A helix sits in the major groove and
sidechains from loops connecting strands of the b-sheet insert into the minor groove.

example of initial binding of a protein to DNA importance because mutations in the gene for p53
followed by recruitment of other proteins to form are very common in tumours.
an active complex. p53 acts by surveilling genome integrity. Damage
to DNA induces enhanced expression of p53, which
p53 stalls cell-cycle progression. This gives time for DNA
p53 is a transcriptional activator and a tumour repair; if repair is unsuccessful, the ‘fail-safe’ mecha-
suppressor (see Figure 11.24). It is of great clinical nism is apoptosis.

Gene regulation

Cells regulate the expression patterns of their genes. The transcriptional regulatory network of
They sense internal cues to maintain metabolic sta- Escherichia coli
bility, and external cues to respond to changes in the
surroundings. The point of contact between genome Investigation of the mechanism of transcription regu-
and expression is the binding of RNA polymerase lation began with the work of F. Jacob and J. Monod
to promoter sequences, upstream of genes, to initiate on the lac operon in E. coli. The ﬁeld has burgeoned,
transcription. This sensitive point is a juicy target for with comprehensive studies of the coli regulatory
regulatory interactions (see Box 11.10). network, together with work on other organisms,
Gene regulation 377

network, links between different transcription factors

BOX Vocabulary of gene regulation are primarily activating, and auto-regulatory interac-
11.10
tions are often repressive. A high density of repressive
auto-regulatory interactions increases what might be
Operator A control region associated with a gene. thought of as the viscosity of the medium in which
Promoter A region upstream of a gene, the site of RNA the network is active. The ‘one-two punch’, or feed-
polymerase binding to initiate transcription. forward loop, motif, can also act to filter out random
Repressor A DNA-binding protein that blocks trans- fluctuations, preventing the propagation of noise.
cription. Combinations of the elementary motifs form mod-
Operon A set of tandem genes in bacteria, usually cata- ules within the network. These clusters of nodes are
lysing consecutive steps in a metabolic pathway, under often dedicated to control of expression of genes
coordinated transcriptional control. with related physiological functions, such as a group
of proteins responding to oxidative stress, or a group
cis-Regulatory region A segment of DNA that regulates
expression of genes on the same DNA molecule. The lac
involved in aromatic amino acid biosynthesis. Such
repressor binding site is a cis-regulator of the adjacent
sets of genes need not be linked as a single operon.
protein-coding genes lacZ, lacY, and lacA (see p. 380). Other analyses address the large-scale structure of
the network. The distribution of degrees of the nodes;
Transcription start site The position in the gene that
that is, the histogram of the number of edges meeting
corresponds to the first residue in the mRNA.
at a node, follows a power law:
Constitutive mutant A mutant defective in repression of
a gene, which in consequence is expressed continously. number of nodes with k edges ∝ 10−bk
with b ≈ 0.8. The scale-free topology means that
some nodes have many connections, and form the
notably yeast. E. coli contains genes for 4398 pro- ‘hubs’ of the network.
teins, 167 of which are recognized transcription fac- Is there substantial feedback from downstream
tors. There are 2369 regulatory interactions, among nodes? (A social analogy: is the network hierarchical
the transcription factors and the genes they control or democratic? That is, does your boss listen to you,
(Figure 11.25). or just give orders? Ring Lardner’s classic sentence,
‘Shut up, he explained’, emphasizes an absence, or
dysfunction, of receptors for feedback signals from
• There are many fewer known regulatory interactions
lower levels back to higher ones.) Although the
than genes. Many genes in E. coli are organized into
E. coli transcriptional regulatory network contains
operons, under coordinated control – one regulatory
interaction controlling many genes. E. coli is estimated many auto-regulatory interactions, it does not con-
to contain ∼2700 operons. Conclusion: more inter- tain larger cycles; that is, paths in which gene A regu-
actions remain to be discovered. lates gene B regulates gene C regulates . . . regulates
gene A. Such cycles can lead to instabilities in the
dynamics of a network.
This network has been the subject of many inves- Correlation of the topology of segments of the net-
tigations. There are questions about the static work with function suggests that short pathways,
topology. Some of these address the local structure feed-forward loops, and repressive auto-regulatory
of the network, deriving the common types of small interactions are involved in control of metabolic
subgraphs, or the motifs. The fork, scatter pattern, functions, such as a switch to alternative nutrients.
and feed-forward loop are motifs in the regulat- This type of network topology is adapted to main-
ory networks of E. coli and other organisms (see tenance of homeostasis. Long hierarchical cascades,
Box 11.11). The network contains a large number and activating auto-regulatory interactions, regulate
of auto-regulatory connections. An auto-regulatory developmental processes, such as biofilm formation,
activator amplifies responses; an auto-regulatory and flagellar development involved in mobility and
repressor damps them out. In the E. coli regulatory chemotaxis.
378 11 Systems Biology

Figure 11.25 The E. coli transcriptional regulatory network represented as a directed graph. Colour-coding of nodes: transcription
factors are shown as blue squares; regulated operons are shown as red circles. Colour-coding of links: activators, blue; repressors, green;
indeterminate, brown.
From: Dobrin, R., Beg, Q.K., Barabàsi, A.L., & Oltvai, Z.N. (2004). Aggregation of topological motifs in the Escherichia coli transcriptional regulatory
network. BMC Bioinformatics 5, 10.)

BOX Common motifs in biological control networks

11.11

Within the high complexity of typical regulatory networks,

certain common patterns appear frequently. In the archi-
tecture of networks, these form building blocks which
contribute to higher levels of organization. Shen-Orr, Milo,
Mangan, & Alon* have described examples including: the
fork, the scatter, and the ‘one-two punch’ (a phrase from
the boxing ring): fork scatter ‘one-two punch’
Gene regulation 379

The fork, also called the single-input motif, transmits a The ‘one-two punch’, also called the ‘feed-forward
single incoming signal to two outputs. Successive forks, loop’, affects the output both directly through the vertical
or forks with higher branching degrees, are an effective link; and indirectly and subsequently, through the inter-
way to activate large sets of genes from a single impulse. mediate link.
Generalizations of the binary fork include more down- This motif can show interesting temporal behaviour if
stream genes under common control (more tines to the activation of the target requires simultaneous input from
fork), and auto-regulation of the control node. Forks can both direct and indirect paths (logical ‘and’). Because
achieve general mobilization. Moreover, if the regulatory build-up of the intermediate requires time, the direct signal
genes have different thresholds for activation, the dynamics will arrive before the indirect one. Therefore a short pulsed
of building up the signal can produce a temporal pattern of input to the complex will not activate the output – by the
successive initiation of the expression of different genes. time the indirect signal builds up, the direct signal is no
The scatter configuration, also called the multiple input longer active. The system can thereby filter out transient
motif, can function as a logical ‘or’ operation: both down- stimuli in noisy inputs (Figure 11.26). Conversely, the
stream targets become active if either of the input impulses active state of the system can shut down quickly upon
is active. Generalizations of the square scatter pattern withdrawal of the external trigger.
shown may contain different numbers of nodes on both
* Shen-Orr, S.S., Milo, R., Mangan, S., & Alon, U. (2002). Network
layers. Note that scatter patterns are superpositions of motifs in the transcriptional regulation network of Escherichia coli.
forks. Nat. Genet. 31, 64–68.

(a) (b)
constant input to 1 pulse input to 1

output from 1 output from 1

1 1
output from 2 output from 2

2 2
input to 3 input to 3

3 3
when outputs from 1 and 2 outputs from 1 and 2
simultaneously exceed threshold, never simultaneously exceed threshold
node 3 fires therefore node 3 never fires

Figure 11.26 A ‘one-two punch’, or feed-forward loop, equipped with suitable AND logic at the downstream node, can filter out
transient noise. (a) Constant input; (b) pulse input. The mechanism of signal transmission is the synthesis of a stimulatory molecule
by an activated node. To avoid ‘locking the signal on’, this molecule must subsequently be removed. The effect described here
depends on the time course of build-up and decay of the signal.

The dynamic properties of the network are also of Even a constant network can produce different
interest. These include both the response of the net- outputs from different inputs. However, even within
work to changing conditions, as in the lac operon, an organism, networks can change their structure in
and the comparison of regulatory networks in related response to changes in conditions. This can affect
organisms to understand how networks evolve. even some of the hubs of the network, the points at
380 11 Systems Biology

which changes have the most far-reaching effects. CAP site Operator lacZ lacY lacA
This has been examined most closely in yeast (see
p. 383). Low level of transcription
(a) Promoter region
Similarities and differences in the regulatory inter-
CAP
actions in related organisms illuminate how their
RNA polymerase lacZ lacY lacA
networks evolve. The evolutionary retention of tran-
scription factors is smaller than that of target genes.
High level of transcription
Even transcription factors that serve as hubs are (b) Promoter region
not more highly conserved. Different organisms are
relatively free to explore different regulatory path- rep lacZ lacY lacA
ways, even to regulate orthologous genes. This may
well be the ‘other side of the coin’ of the redundancy No transcription
(c) Promoter region
in the networks that provides robustness.
Figure 11.27 States of the lactose operon. (a) The promoter
There is evidence that larger changes in regulatory
region contains regulatory sites upstream of the protein-encoding
networks reﬂect changes in lifestyle. Organisms with genes lacZ, lacY, and lacA. (b) Binding of CAP to its upstream
similar lifestyle – several species of soil bacteria, gen- site within the promoter enhances the binding affinity of RNA
uses Bacillus, Corynebacterium, and Mycobacterium polymerase, turning transcription on. (c) Binding of repressor
– conserve regulatory interactions, as do intracellular blocks binding of RNA polymerase, turning transcription off.
parasites Mycoplasma, Rickettsiae, and Chlamydiae. Two additional subsidiary repressor binding sites are not shown.

Conversely, comparing organisms that adopt differ-

ent lifestyles, individual transcription factors can be • Figure 11.27(c) is a simplification: lac repressor actually
deleted. binds to three sites. The cover of the 1 March 1996 issue
of Science magazine shows a model of the lac repressor,
binding to two sites on a ∼100 bp section of DNA, plus
• Regulatory networks are directed graphs. Some simple the CAP protein. The DNA is bent into a loop.
motifs, or common small subgraphs, form the lowest
level of network structure. Networks can reprogram
themselves, within an organism; and evolve, between The control regions of the lactose operon function
species. as a switch, turning on and off transcription of the
protein-encoding genes. Control is exerted through
binding of CAP to the CAP site and lac repressor
protein to the operator site. (The CAP–DNA complex
Regulation of the lactose operon in E. coli
is very similar to the structure shown in Figure 11.18.
The lactose operon of E. coli contains structural CAP stands for catabolite activator protein; CRP
genes and regulatory regions (see Table 11.2 and stands for cyclic-AMP receptor protein. These are
Figure 11.27a). Three regions of the operon encode the same protein. Unfortunately CRP also stands
proteins; these are co-transcribed into a single mRNA for C-reactive protein, a different protein synthesized
and translated separately. The lacI gene (encoding in the liver and a useful marker in humans of
lac repressor) lies upstream of the operator and is inﬂammation.) Cyclic AMP (cAMP), produced in
constitutively expressed. higher quantities if glucose is absent, and lactose

Table 11.2 Functions of the protein-encoding genes of the lactose operon

Gene Enzyme Function

lacZ b-Galactosidase Hydrolyses lactose → glucose + galactose;

isomerizes lactose → allolactose
lacY b-Galactoside permease Pumps lactose into the cell
lacA b-Galactoside transacetylase Unknown, possibly detoxification?
Gene regulation 381

analogues bind to CAP and lac repressor proteins, The operator site is between the CAP site and the
respectively, to control their binding affinities. origin of transcription. As a result, lactose absence
‘trumps’ glucose absence. That is, in the absence of
• Binding of lactose induces a conformational
lactose, the binding of repressor stops transcription
change in the repressor, from a tightly binding
whether or not glucose is absent.
form to a weakly binding one. Lac repressor will
The effect is to express the proteins of the lac
bind only in the absence of lactose.
operon only if the medium contains only lactose.
• The presence of glucose reduces the concentration
The bacteria are saying, in effect, ‘I prefer to grow on
of cyclic AMP, causing a conformational change in
glucose. If glucose is there, I don’t want high expres-
CAP to a weakly binding form. CAP will bind only
sion levels of the genes for lactose transport and
in the absence of glucose.
metabolism, even if lactose is present. Only if lactose
The actual molecule that binds to repressor to reduce is present and glucose is not present, express the
its affinity for its site on DNA is allolactose. Allolac- genes that transport and cleave lactose.’
tose is an isomer of lactose, produced from lactose The lactose operon switch is an example of a ‘fire-
by b-galactosidase. Alternative lactose analogues also and-forget’ mechanism. Once the mRNA is synthe-
stimulate transcription. One that is useful in the sized, what happens on the DNA does not affect it.
laboratory is isopropylthiogalactoside (IPTG), for However, the mRNA for the protein-coding genes of
two reasons: (1) IPTG enters the cell even if the lacY- the lac operon (lacZ, lacY, and lacA) has a half-life of
encoded transporter is dysfunctional or not expressed; ∼3 minutes. In the absence of continuous or repeated
and (2) IPTG is not metabolized; therefore its con- induction, synthesis of the lactose-metabolizing
centration stays constant during the course of an enzymes will cease within minutes. This resets the
experiment. switch.
The switch thereby responds to the type of sugar in
medium. Logical diagram of the lac operon
• If both glucose and lactose are present, neither Figure 11.28 represents the lac operon control logic
control protein binds (Figure 11.27a). RNA poly- as a network. This diagram is almost equivalent to
merase binds only weakly. Transcription occurs at the table showing the response to presence and
a low basal level. absence of glucose (see Exercise 11.7).
• If glucose is not present and lactose is present,
the CAP–cAMP complex binds to the promoter.
Glucose Lactose
The binding of CAP–cAMP with RNA polymerase present present
is cooperative. Interactions with CAP–cAMP
increase the affinity of RNA polymerase, thereby
stimulating transcription to approximately 40
AND
times the basal level (see Figure 11.27b).
• If lactose is not present, the repressor binds to the
operator site. This blocks RNA polymerase and
turns off transcription (see Figure 11.27c).
Transcription?
In summary (− means ‘absence of’; + means ‘pres-
ence of’):
Figure 11.28 Logical diagram of lac operon. Green arrows
Lactose show positive regulation. The red ‘T’ shows negative regulation.
− + The circle containing AND will pass through a positive signal
only if glucose is not present (red ‘T’) and lactose is present
− Repression Robust
(green arrow). For many regulatory circuits, we know many
transcription
Glucose of the inputs to a node but do not know the logic.
+ Repression Basal level of
transcription
382 11 Systems Biology

The genetic regulatory network of Saccharomyces cerevisiae

A recent study of transcription regulation in yeast Figure 11.29 is a cartoon-like sketch of a fragment of
treated a network containing 3459 genes, corre- such a network indicating, rather loosely, some of its
sponding to approximately half of the known pro- general features. Nodes are divided into transcrip-
teome of S. cerevisiae. The genes included 142 that tional regulators, shown as circles, and target genes,
encode transcription regulators and 3317 that encode shown as squares. Target genes are distinguished
target genes exclusive of transcription regulators. by having no output connections. There is extensive
There are 7074 known regulatory interactions among interregulation among the transcription factors, to
these genes, including effects of regulators on one a much higher density of interconnections than can
another and of regulators on non-regulatory targets. intelligibly be shown in this diagram. Think of a
Analysis of the overall network architecture seething broth of transcription factors, within the
revealed several features. shaded area, sending out signals to target genes. The
shaded area indicates only the logical clustering of
• The distribution of incoming connections to target the transcriptional regulators. There is no suggestion
genes has a mean value of 2.1 and is distributed about physical localization; indeed, transcriptional
exponentially. Most target genes receive direct regulators interact with DNA and almost never inter-
input from about two transcriptional regulators. act physically with the proteins whose expression
The probability that a gene is controlled by k tran- they control.
scription regulators, k = 1, 2, . . . , is proportional
to e−ak, with a = 0.8.
• The distribution of outgoing connections has a
mean value of 49.8 and obeys a power law. The Transcriptional regulators
. . . 50 outgoing connections. . .
probability that a given transcriptional regulator
... ...
controls k genes is proportional to k−b, with b = 0.6.
Power-law behaviour is common in networks and
characterizes topologies in which a few nodes – the
‘hubs’ – have many connections and many nodes
have few. In regulatory networks, hubs tend to Five intermediate nodes
be fairly far upstream, forming important foci of
regulation with far-reaching control.
• The average number of intermediate nodes in a
minimal path between a transcriptional regulator
Target genes
and a target gene is 4.7. The maximal number of
intermediate nodes in a path between two nodes Ultimate receptor of signal

is 12. Figure 11.29 Simplified sketch illustrating some features of

an ‘average’ segment of the pathways in the yeast interaction
• The clustering coefficient of a node is a measure of
network. Transcriptional regulators appear as circles. Target genes
the degree of local connectivity within a network.
appear as squares. A transcriptional regulator typically has direct
If all neighbours of a node are connected to one influence over about 50 genes, indicated by multiple connections
another, the clustering coefficient of the node = 1. from the filled black circle to the circles on the line below it.
If no pair of neighbours of a node are connected to Roughly one in ten of the neighbours of any node is connected
each other, the clustering coefficient of the node = 0. to another neighbour, indicated by the horizontal arrow on the
second row. The ultimate receptor of the signal lies at the end
The mean clustering coefficient, averaged over all
of a pathway typically containing about five intermediate nodes
nodes, is a measure of the local density of the (shown in black). This ultimate target gene receives on average
network. For the yeast transcriptional regulatory about two inputs. This diagram shows only a small fragment of
network, the mean clustering coefficient is 0.11. a network that is in fact quite dense.
The genetic regulatory network of Saccharomyces cerevisiae 383

Each transcriptional regulator directly influences activities (as we can, for the most part, with meta-
approximately 50 genes on average, although, as bolic enzymes). Instead, the activity of the network
with other ‘small-world’ networks following power- involves the coordinated activities of many indi-
law distributions of connectivities, the distribution is vidual regulatory molecules.
very skewed – some ‘hubs’ have very many output
connections, but most nodes have very few. A few of
Adaptability of the yeast regulatory network
the interregulatory connections between transcrip-
tion factors are shown in red. In about 10% of cases, The yeast regulatory network achieves versatility and
two neighbours of the same transcription factor responsiveness by reconfiguring its activities. This
interact with each other. A path from one regulator is seen by comparing the changes in the activities of
(filled black circle) to one ultimate receptor (filled networks controlling yeast gene expression patterns
black square), through five intermediate nodes, is in different physiological regimens of the organism:
shown in black. The intermediate nodes are other cell cycle, sporulation, diauxic shift (the change
transcriptional regulators, connected both within the from anaerobic fermentative metabolism to aerobic
path drawn in black and off this path. Even the tran- respiration as O2 levels increase), DNA damage,
scription factor used as the origin of the path receives and stress response. Cell cycling and sporulation
input connections. Although it is possible to identify involve the unfolding of endogenous gene expression
target genes from the absence of outgoing connec- programmes; the others are responses to environ-
tions, it is more difficult to identify ultimate initiators mental changes.
of signal cascades. Different states are characterized both by similar-
The ultimate receptor is a target gene that receives ities and differences in gene expression patterns and
regulatory input but itself has no output links. This by the components of the regulatory network that
target is expected to receive (on average) a second are active. There is considerable shift in expression of
control input. The black target node receives input target genes. About a quarter of the target genes are
via a black arrow, along the selected path, and via specialized to individual physiological states. Of the
a red arrow suggesting the second input. Of course total of about 3000 target genes, the expression
the second input may arrive via a path that shares levels of only about half do not show major changes
common nodes with the black path, including other in the different states. Of the 1906 that show altered
routes from the filled black circle. expression levels in different states, almost half (803)
The dense forest of additional pathways from are specialized to a single physiological state.
which this fragment is extracted is not shown. Some In contrast, different states show much more over-
‘back-of-the-envelope’ calculations indicate: (1) there lap in the usage of transcriptional regulators. For
are ∼3500 nodes, each receiving an average of two instance, for cell-cycle control, 280 target genes (8%)
input connections; (2) there are ∼140 transcription are differentially regulated by 70 (49%) of the tran-
factors, making an average of 50 output connec- scription regulators. Clearly, there is a much greater
tions; (3) the number of input connections must degree of specialization in the target genes. In gen-
equal the number of output connections, and indeed eral, half of the transcription factors are active in at
3500 × 2 = 140 × 50 = 7000. least three of the five physiological regimens. How-
Given the complexity, it is difficult to illustrate ever, contrasting with the high overlap of usage of the
larger segments of the network in more detail than transcriptional regulators (the nodes), the overlap of
the simplified version appearing in Figure 11.29. the activities within the network (the connections)
Analysis of the structures of regulatory networks is relatively low. Different components of the inter-
is an active current research topic. The motifs action network organize the different gene expression
described in Box 11.11 are the ‘secondary structures’ patterns in different states.
of network architectures. Whereas different physiological states are charac-
The high ratio of interactions to transcription terized by substitutions of different sets of synthe-
regulators implies that we cannot expect to associate sized proteins, the regulatory network uses much
individual regulatory molecules with single, dedicated of the same structure but reconfigures the pattern
384 11 Systems Biology

of activity. Think of the transcription factors as state, which permits finer control over the temporal
‘hardware’ and the connections as reprogrammable course of expression patterns. In cell-cycle control
‘software’. The molecules do not change but the and sporulation, there is a much denser interregula-
interactions do: in different states, many transcription among transcription factors and longer minimal
tion regulators change most, or a substantial part, of path lengths between transcriptional regulators and
their interactions. In particular, the set of transcrip- target genes.
tion regulators that forms the hubs of the network – Different physiological states also differ in their
those with many outgoing nodes that form foci of usage of the common motifs – fork, scatter, and ‘one-
control – are not a constant feature of the system. two punch’ (see Box 11.11). Forks are used more in
Some hubs are common to all states, but others step conditions of stress, diauxic shift, and DNA damage.
forward to take control in different physiological They are appropriate to the need for quick action.
regimens. The result of the reconfiguration of activity Requirements for a build-up of intermediates would
is that over half of the regulatory interactions are delay the response. Conversely, the ‘one-two punch’
unique to the different states. motif is more common in cell-cycle control. This is
The effect of the changes in the active interaction consistent with the need for a signal from one stage
patterns is to alter the topological characteristics of to be stabilized before the cell enters the next stage.
the network in different states. For instance, under Much of evolution proceeds towards greater spe-
panic conditions – DNA damage and stress – the cialization. The human eye is a classic example. It is an
average number of genes under the control of indi- intricate and fine-tuned structure, features that were
vidual transcriptional regulators increases; the aver- once adduced as evidence against Darwin’s theory.
age minimal path length between regulator and target Many evolutionary pathways show a trade-off between
decreases; and the clustering becomes less dense specialized adaptation and generalized adaptability.
(i.e. there is less interregulation among transcription Regulatory networks are an exception. Evolution
factors). This can be understood in terms of a need has produced structures that are both specialized
for fast and general mobilization – the equivalent of and versatile. The reconfigurability of regulatory net-
broadcasting ‘Go! Go! Go!’ over the radio. Normal works allows them to respond robustly to changes
circumstances – cell-cycle control, for instance – in conditions by creating many different structures
allow for a more dignified and precise regulatory specialized to the conditions that elicit them.

● RECOMMENDED READING

• Several authors offer general overviews of the systems view in biology:

Cramer, F. & Loewus, D.I. (Translator) (1993). Chaos and Order: The Complex Structure of
Living Systems. J. Wiley & Sons, New York.
Adami, C. (2002). What is complexity? BioEssays 24, 1085–1094.
Wagner, A. (2005). Robustness and Evolvability in Living Systems. Princeton University Press,
Princeton.
Alon, U. (2006). An Introduction to Systems Biology. Chapman & Hall, London.
Brenner, S, (2010). Sequences and consequences? Phil. Trans. Roy. Soc. Lond. B 365, 207–212.
• Barabási and co-workers have studied general principles of networks. Biological networks have
many properties in common with other networks:
Albert, R. & Barabási, A.-L. (2002). Statistical mechanics of complex networks. Rev. Mod. Phys.
74, 47–97.
Barabási, A.-L. (2003). Linked: How Everything Is Connected to Everything Else and What It
Means. Plume Books, New York.
Exercises, problems, and weblems 385

• The following papers describe the structure, dynamics, and evolution of cellular signalling and
regulatory networks:
Ideker, T. (2004). A systems approach to discovering signaling and regulatory pathways – or,
how to digest large interaction networks into relevant pieces. Adv. Exp. Med. Biol. 547,
21–30.

Babu, M.M., Luscombe, N.M., Aravind, L., Gerstein, M., & Teichmann, S.A. (2004). Structure
and evolution of transcriptional regulatory networks. Curr. Opin. Struct. Biol. 14, 283–291.

Luscombe, N.M., Babu, M.M., Yu, H., Snyder, M., Teichmann, S.A., & Gerstein, M.B. (2004).
Genomic analysis of regulatory network dynamics reveals large topological changes. Nature
431, 308–312.

● EXERCISES, PROBLEMS, AND WEBLEMS

Exercises
Exercise 11.1 In the undirected, unlabelled graph in Box 11.3 on page 344, (a) name two vertices
such that if you add an edge between them at least one vertex has exactly four neighbours. (Note
that two edges may cross without making a new vertex at their point of intersection.) (b) Name
two vertices such that if you add an edge between them to the original graph, the graph becomes
an (unrooted) tree. (c) Name two vertices (neither of them V1) such that if you add an edge
between them to the graph produced in (b), the resulting graph does not remain a tree. (d) Name
two vertices such that if you add an edge between them to the original graph, there is exactly one
path between V1 and V3, with no vertices repeated, and it has length 4. (e) Name two vertices
such that if you add an edge between them to the original graph, there are alternative paths, of
lengths 3 and 4, between V1 and V5, with no vertices repeated. (In determining the length of a
path, you have to count the number of edges in the path. A path of length 2 between V1 and V5
contains one intermediate vertex.)
Exercise 11.2 Of the examples of graphs in Box 11.4, (a) which are directed graphs? (b) Which
are labelled graphs? (c) In each example, what is the set of nodes? (d) In each example, what is
the set of edges?
Exercise 11.3 What information is contained in Figure 11.3(b) that could not be recovered from
the kind of data produced by the experiments shown in Figure 11.13?
Exercise 11.4 For which of the methods for determining interacting proteins (pp. 366ff) (a) must
one of the proteins be purified; (b) must both of the proteins be purified?
Exercise 11.5 The binding site for l Cro (p. 374) is an approximate palindrome, i.e. the two
strands contain approximately the same sequence in reverse order. (a) On a copy of the binding
site, indicate which six residues best fit the palindrome pattern. (b) A palindromic binding site can
interact with a dimeric protein by presenting surfaces of similar structure to both protein subunits.
How far apart are the six-residue regions identified in part (a)? How do you rationalize that
distance in terms of features of the structure of DNA?
Exercise 11.6 From Figure 11.3, (a) what would be the effect of increased expression of metJ
on the expression of metC? (b) What would be the effect of increased expression of OxyP on
the expression of metJ? (c) What would be the effect of increased expression of OxyP on the
expression of metP?
Exercise 11.7 Redraw Figure 11.28 with the top boxes containing glucose absent and lactose
absent instead of glucose present and lactose present.
386 11 Systems Biology

Exercise 11.8 Match mutant to phenotype:

Dysfunctional gene Resulting phenotypes

1. lacI (a) Operon expressed, no lactose uptake

2. Mutation in repressor binding site (b) Operon expressed, no glucose or galactose
3. Mutation in RNA polymerase binding produced
site (c) No expression
4. lacZ (d) Constitutive expression of operon because no
5. lacy repression is possible

Exercise 11.9 In the London Underground: (a) What is the shortest path between Moorgate and
Embankment stations? Note that, considered as a graph, the shortest path between two nodes
is the path with the fewest intervening nodes, not the path that would take the minimal time
or fewest interchanges. (b) What is the shortest cycle containing Kings Cross, Holborn, and
Oxford Circus stations? (c) The clustering coefficient of a node in a graph is defined as follows.
Suppose the node has k neighbours. Then the total possible connections between the neighbours
is k(k − 1)/2. The clustering coefficient is the observed number of neighbours divided by this
maximum potential number of neighbours. If the neighbours of a station are the other stations
that can be reached without passing through any intervening stations, what is the clustering
coefficient of the Oxford Circus station? (If necessary, see http://www.bbc.co.uk/london/travel/
downloads/tube_map.html)
Exercise 11.10 In the London Underground, (a) what is the maximum path length between any
two stations? (That is, for which two stations does the shortest trip between them involve the
maximum number of intervening stops?) (b) If the District Line were not active, what stations, if
any, would be inaccessible by underground? (c) If the Jubilee Line were not active, what stations,
if any, would be inaccessible by underground?
Exercise 11.11 On a photocopy of the three common network control motifs (Box 11.11, p. 378),
(a) indicate which nodes are controlled by only one upstream node; (b) indicate which node exerts
control over only one downstream node.
Exercise 11.12 On a photocopy of the simplified fragment of the yeast regulatory network
(Figure 11.29) indicate examples of the network control motifs (a) star and (b) ‘one-two punch’.
(c) Add one arrow to create a scatter motif.
Exercise 11.13 In the overall yeast transcriptional regulatory network, the number of incoming
connections to target genes follows an exponential distribution, i.e. the probability that a gene is
controlled by k transcriptional regulators is proportional to e−ak, with a = 0.8, k = 1,2, . . . . What is
the ratio of the number of target genes receiving four input connections to the number receiving
two input connections?

Problems
Problem 11.1 In one species, only enzyme A catalyses conversion S → P, the rate-limiting step
of a reaction pathway. What is the flux control coefficient of enzyme A? (b) In a related species,
distinct but similar enzymes A and B both catalyse the S → P reaction. The kinetic characteristics
of A and B are identical: S → P is still the rate-limiting step of the pathway. What is the flux control
coefficient of A?
Problem 11.2 For dissociation of a complex involving a simple equilibrium:
[A][B]
AB j A + B, the equilibrium constant, KD = , is equal to the ratio of
[AB]
forward and reverse rate constants: KD = koff/kon.
Exercises, problems, and weblems 387

For avidin–biotin, KD = 10−15. Suppose kon were as fast as the diffusion limit, ∼ 10−9 M−1 s−1.
(a) What is the value of koff? (b) What would be the half-life of the avidin–biotin complex?
(c) Suppose kon for avidin–biotin were 10−7 M−1 s−1. What would be the half-life of the complex?
Problem 11.3 Write detailed structures for all of the metabolites that appear in Figure 11.3a.
Problem 11.4 The network of metabolic pathways must obey constraints of thermodynamics
and physical-organic chemistry. Meléndez-Hevia and colleagues suggested the principle that
metabolic pathways are optimized, subject to the constraints, for the minimum number of steps.
The non-oxidative phase of the pentose phosphate pathway converts six five-carbon sugars to
five six-carbon sugars:

6 ribulose-5-phosphate → 5 glucose-6-phosphate

A simplified model of a pathway for this conversion is a series of steps, each of which is either:

• transfer of a two-carbon unit from one sugar to another (a transketolase reaction); or

• transfer of a three-carbon unit from one sugar to another (a transaldolase or aldolase reaction).

Represent each sugar only by a number of carbon atoms. Starting with five five-carbon sugars,
one possible initial step would be a transketolase step converting two five-carbon sugars to a
three-carbon sugar and a seven-carbon sugar. Assume that all intermediates must have at least
three carbon atoms.
Create a tableau with the following initial and final states (an initial transketolase (TK) step is
also shown):

Step Number of carbons in sugar molecules

0 5 5 5 5 5 5
TK
1 3 7 5 5 5 5
...
N 6 6 6 6 6 0

Copy and fill in the tableau to find the shortest route from the top (step 0, six five-carbon sugars)
to the bottom (five six-carbon sugars). Identify the intermediates created. Compare this with the
observed metabolic pathway.
Problem 11.5 Choose 15 amino acids, by crossing off the ones most similar to others. Devise a
doublet genetic code for these 15 residues, plus a stop signal, that is as close as possible to the
actual triplet code.
Problem 11.6 (a) You create a strain of E. coli in which the order of promoter and operator in
the lac operon are reversed. Will this strain express lacZ, lacY, and lacA in the absence of lactose
and glucose? (b) You create a strain of E. coli by moving the operator from its normal position
to a position between the lacZ and lacY genes. Will this strain express lacZ, lacY, and lacA in the
absence of lactose and glucose? Will this strain express lacZ, lacY, and lacA in the presence of
lactose and absence of glucose? (c) You add exogenous cyclic AMP to wild-type E. coli. Will lacZ,
lacY, and lacA be expressed in the presence of glucose and lactose? (d) You create a strain of
E. coli with a point mutation in lacZ that renders the enzyme completely dysfunctional. Will this
strain express lacY in the presence of lactose? Will this strain express lacY in the presence of
isopropylthiogalactoside (IPTG)?
Problem 11.7 Indicate how to connect a selection of the three common network control motifs
so that a single input node can influence three output nodes.
388 11 Systems Biology

Problem 11.8 What is the minimum number of ‘yes-or-no’ questions required to identify a specific
letter of the upper-case alphabet: ABC . . . Z?

Weblems
Weblem 11.1 In the methionine biosynthetic pathway (see Figure 11.3), the product, methionine,
inhibits an enzyme in the middle of the pathway, homoserine O-succinyltransferase. (a) The
accumulation of which intermediate might this inhibition be expected to cause? (b) In what
other pathways is this intermediate involved that might use it up?
Weblem 11.2 Figure 11.4 shows the amino acid biosynthesis leading from aspartate to
methionine, threonine, and lysine. On a photocopy of this figure, write in the names of the
omitted intermediates at the unlabelled positions between consecutive arrows.
Weblem 11.3 The genes encoded by metC and malY in E. coli convert cystathione to
L-homocysteine. Each has another function in addition. What are these other functions?

Weblem 11.4 Compare the pathways for biosynthesis of chorismate in E. coli, M. jannaschii, and
Aeropyrum pernix. What is the earliest common intermediate in these pathways? What are its
precursors in the three species?
Weblem 11.5 Define the following terms: (a) interactome; (b) signalome. (c) This is more difficult:
can you think of, and define, a reasonable ‘ome’ that has not yet been proposed?
Weblem 11.6 Compare the methionine biosynthesis pathway from asparate to methionine in
E. coli and yeast. (a) Are there any differences in the series of intermediates? (b) Are the enzymes
that catalyse similar transformations homologous? Show alignments of the amino acid sequences
where possible. Use EcoCyc for E. coli (http://ecocyc.org) and the Saccharomyces Genome
Database (http://www.yeastgenome.org) for yeast, searching in each case for ‘methionine
biosynthesis’.
Weblem 11.7 Draw Figure 11.27(a) to scale, using the known E. coli genome sequence as a
source of the true sizes of the regions.
Weblem 11.8 Find the page in EcoCyc corresponding to Figure 11.3. Choose different levels of
detail and list what types of information are presented at each level.
Weblem 11.9 An enzyme with a function related to peptidylglycine monooxygenase (EC
1.14.17.3), and linked to it in the ENZYME DB, is 1-aminocyclopropane-1-carboxylate oxidase.
(EC 1.14.17.4). (a) What is the lowest common ancestor of the two reactions in the EC
classification? (b) What is the lowest common ancestor of the two reactions in the Gene Ontology
molecular function classification? (c) Are these two enzymes closely related in the Gene Ontology
classification?
Weblem 11.10 What identifiers does Gene Ontology associate with E. coli asparate
aminotransferase, in the molecular function category? Arrange them in a directed acyclic graph,
indicating the parent–child relationships between these identifiers.
Weblem 11.11 According to EcoCyc, what reactions can orotidine 5′-monophosphate undergo?
What enzymes catalyse these reactions? What genes encode these enzymes?
Weblem 11.12 Figure 11.5 shows the reductive carboxylate cycle, and the EC numbers of the
enzymes that catalyse the individual steps. Find the corresponding information for the tricarboxylic
acid cycle (or Krebs cycle), and for the glyoxylate cycle. Do an alignment of the metabolites
participating in these cycles, display the EC numbers of the enzymes that correspond to different
reactions in different cycles. Report what is common to pairs or to all three of these pathways, in
terms of (a) metabolites, (b) links between metabolites, corresponding to reactions, (c) enzymes
that catalyse the reactions.
EPILOGUE

The new century has already seen major achieve- • We will achieve a more profound understanding of
ments in genomics. Sequencing of the human genome what life is and how it works.
is the jewel in the crown. And yet, the field is still
in a preparative and anticipatory stage. We can be We will also gain greater control over living systems.
confident that: Many applications will emerge, in clinical, agri-
cultural, and technological fields. Some of these
• Methods of sequence determination will increase are relatively simple extrapolations from what has
in power. There will be an explosion in the number already been achieved. Others seem more like the
of complete sequences of different human beings stuff of science fiction: a tightly coupled silicon–life
and of many other organisms. interface and the in vivo deployment of nanoparticles
• Tools for analysis will make progress. Better algo- sensing and interacting with our biochemical states.
rithms, making effective use of more data, will pro- Nevertheless, they represent natural developments of
duce more reliable inferences. Modelling of structure the current state of the art.
and process will improve, towards the target goal Understanding the genome may ultimately release
of the simulation of the complete cell in silicio. us from its constraints.
INDEX

a1-antitrypsin 22, 43, 58, 331 ancient DNA, see palaeosequencing AT-rich 346, 373
a-helix 303ff. Angelman syndrome 16, 39, 85 – 86, autism 51
b-blocker 23 139 –140 autoantibody 276
b-galactosidase 368, 380 –381 angiogenin 240 autoimmune disease 57
b-sheet 303ff angiosperm 122, 218, 248, 263 autopolyploid 141
l cro 374 annelida 119 autoradiograph 18, 79, 96 –97, 113, 307
antennapedia 374 –375 autoregulatory 363, 377, 379
anthocyanin 15, 280 –281 autosome 147, 224, 258, 285
A anthopleura 49 avoparcin 287–288
anthropoid 138
aardvark 227, 236 antibiotic 11, 16 –17, 23, 108, 127, 129,
acetaldehyde 274 155, 191–192, 196, 205 –207, 212, B
acetylation 44, 305 269, 287–289, 293 –295, 330
acetylcholinesterase 45, 305 antibody 9, 38, 123, 153, 177, 186, 206, baboon 152–153, 324
acetyl-CoA 204, 273, 352, 358 208, 268, 276, 291, 298, 327, 330, backbone (protein) 299ff.
achiral 301 333, 367–368 BacMap 130 –131
acorn worm 119, 220 anticoagulant 77, 187 bacterial artificial chromosome (BAC)
actinobacteria 204 –205 anticodon 14, 111, 293, 347 98ff.
actinopterygii 152 antigen 38, 55, 61, 124, 126, 177, 187, bacteriophage 18, 81, 94, 96, 120, 125,
acylphosphatase 302–303 206, 333 131, 207, 291, 334, 367, 373 –374
adenosylcobalamin 47 anti-inflammatory 271 display 367
adenovirus 125 anti-influenza 128 bacteriorhodopsin 140, 195
adenylate cyclase 371 antimalarial 59 Barcode of Life 117, 154
adrenoleukodystrophy 43, 73, 125 antiporter 204 barley 108, 242
Aepyornis 230 antipsychotic 52 barrel 300, 313, 326
Affymetrix 271, 282 antithrombin 187, 320, 322 base pair 92
agammaglobulinaemia 150 antiviral 23 basophils 291
aggregation 60, 331 Aplysia 108 B-cell 283
aldolase 177–179, 190, 360 –361, Apolipoprotein E, 51, 61– 62 Beadle, G. 244, 262
364 –365, 387 apoptosis 62, 364, 376 Bernal, J.D. 302
alfalfa 141 apyrase 102 bilateria 119
alga 13, 31, 47, 67– 68, 118, 120, 122, Arabidopsis thaliana 8, 13, 38, 98, 108, biodiversity 165, 187
132–133, 156, 217–218 132–133, 136, 141, 215, 218 –220, biofuels 188
alignment 167 247–249, 261 biogeography 260
approximate methods 173 archaea 17, 68, 115, 118, 122, 129 –136, bioinformatics 24ff., 167ff.
dot-plot and 169 192–204, 358 –361, 365 bioluminescence 343
multiple 174 Archaeopteryx 224 biotechnology 17, 25, 27, 106, 187, 298
structure 176 armadillo 149, 152, 227 biotin 224, 333, 368, 387
alkaloid 219, 247–248 aromatase 5 bipolar 77
allergen 187, 291 arousal 274 bisphosphoglycerate 319, 359ff.
allolactose 380 –381 arrhythmia 150 BLAST (Basic Local Alignment Search
allopolyploid 141 arthritis 57, 271 Tool) 173ff.
allosteric 47, 318ff. arthropod 116, 119, 162 BLOSUM matrices 172–173, 175
alveolates 217 artiodactyl 152 bluetongue virus 125
Alzheimer’s disease 51, 58, 61– 62, Ashburner, M 100, 148, 352 Bombyx mori 108
149 –150, 305, 331–332 Ashkenazi Jews, common BRCA gene bottleneck 52, 143, 227, 242, 244, 336
amino acid 6ff., 200, 299ff. mutants 63 – 64, 77 BRCA1, BRCA2 18, 62– 65, 77, 116,
substitution 172 asparaginase 292 220
amphioxus 49, 119, 141 aspartylglucosaminuria 305, 339 breakpoint 87, 271
ampicillin 16 Aspergillus 159, 302 Brenner, S. 50, 147, 281, 312, 384
amplicon 95 assembly (sequence) 18 –19, 94, 98ff., bryophyte 218
amyloidosis 58, 331–332 112–113, 151, 247, 292 bryozoa 119
amyotrophic lateral sclerosis (ALS) 124 asteroid impact 42, 67– 69, 192 BSE (bovine spongiform encephalopathy)
anaemia 4, 58 – 60, 179, 331–332 asthma 140, 266, 271 331–332
anaerobe 196 –197, 201 astrocyte 283 bubonic plague 145, 205, 346
anaerobic 5, 108, 158, 205, 268, 273, ataxia 51, 149 Buddenbrockia 48 – 49
383 ATPase 204, 317, 330 buffalo 20
Index 391

chip 18, 183, 201, 266 –268, 271, 282, coverage 19, 98 –100
C 292 craniata 29
chip-seq 18 C-reactive protein 380
cadherin 283 Chironomus 31 creatine 314 –315, 337–339
Caenorhabdidis elegans 108, 136, 274, chiroptera 152 C-region 326
285, 332 chloroperoxidase 339 Creutzfeldt–Jakob disease 331–332
development 48 chlorophyll 195, 205, 210, 213 Crick, F.H.C. 6, 41, 43, 91–92, 111,
genome 49, 122–123, 140, 147–149, chlorophyte 218 176, 306
151, 153, 220 chloroplast 4, 6, 11–13, 15, 37–38, criollo 247–248, 261
nervous system 50 –52, 151 132–133, 155 –156, 162, 205, 219, Critical Assessment of Structure
caffeine 247, 263, 274 235, 317 Prediction 322
calcium 42, 72 chloroquine 59 crossing-over 83
Cambrian 67– 68, 137, 192 chocolate 245 –250, 261–263 cross-linking 287, 289, 306
cancer 5, 16, 18 –20, 23, 42, 47, 62– 65, cholera 122, 129, 343 cultivar 247, 279 –280
71–72, 87, 108, 116, 149 –150, 206, cholesterol 61, 153, 276 C-value 121
224, 268, 290 –291, 364 chordate 29, 116, 119, 141, 220 –221, cyclic-AMP 380
canidae 228 235 cystathione 356, 388
cantaloupe 132 choreatic 22 cystic fibrosis 17, 90, 125, 150, 153, 266
capillaries 59 – 60, 97, 309 chorismate 204, 360, 388 cytoglobin 26, 30, 137, 156, 235 –236
capsid 124, 330, 333, 373 –374 chromatin 8, 12, 47, 81, 84, 132, 142, cytokine 108, 284
carbamazapine 77 147–148, 154, 284, 353 cytoskeleton 11, 284, 298, 317, 371
carboniferous 68, 218 immunoprecipitation 367–368
carboxykinase 204 remodelling 45
carnivora 152, 227–228 chromogenic 368 D
carotenoid 289 chromophore 140, 210, 306, 362, 368
carrier 17, 60 – 61, 120, 127, 221, 268, chromosome 26ff, 142 database 24, 30, 104ff.
315, 346 chymotrypsin 155 bibliographic 109
cartilaginous 119 cilia 23, 34, 118, 317 BLOCKS 172
CASP (see Critical Assessment of Ciona intestinalis 108, 141, 159, clusters of orthologous groups (COG)
Structure Prediction) 220 –221, 235 –236 130
cassowary 182, 230 circadian rhythm 52, 219, 241, 275, 364 DNA 106, ethical issues 32ff.,
cat (cloned) 46 cis-aconitate 358 barcoding 117
catalase 274 cis-regulator 8, 377 ESTs 108, 279
catarrhini 29, 152 citrulline 305 enzyme structures 352
catfish 108 c-jun 327 expression and proteomics 108ff.
CATH 107, 113, 310 cladistics 184 genetic diseases 106 –107
cDNA 14, 64, 72, 95, 108, 267–268, clotting, see coagulation genome browsers 25
271, 277, 279, 282, 292, 294 cloverleaf (tRNA) 13 genomes (GOLD) 216
Celera Genomics 19 –20, 23, 100, CLUSTAL-W 171, 173, 183, 213 homophila 149
147–148, 295, 307, 309, 337 cnidaria 48 – 49, 119 metabolic pathways 109, 354
cell-adhesion 278 coagulation 43, 143, 179, 291 MITOMAP 251
cell-cycle 47, 244, 284, 292, e363, 376, codon 6, 13 –16, 36 –37, 60, 136, 162, nucleic acid structures 107
383 –384 212, 337, 347–348 protein function classification 351ff.
cellulase 196 usage 197–198 protein interactions 366, 370
cellulose 196, 205, 221, 235 co-enzyme 130 –131, 202, 333 protein sequences 106
cell-wall 289 –290, 357 cofactor 61, 217, 224, 314, 351, 355, protein structures 107, structure
centimorgan 111 359 –360 classification 107, 310, prediction
centromere 8, 26, 85, 147 coiled-coil 326 –327 325
channel 219, 226 coimmunoprecipitation 367 quick screening of 173ff.
chaos 144, 349 –350, 363 –364, 384 co-infect 127 Saccharomyces genome database
chaperone 130 –131, 200, 202, 330 co-inheritance 85 217–218
Chargaff’s rules 91–92 colitis 204 SNP databases 53 –54
Chase, see Hershey–Chase collision-induced ionization 311 Species 2000 165
cheetah 52 complexity 346ff. taxonomy 165
chemiosmotic 12, 317 concanavalin A 306 daunorubicin 292
chemotaxis 312, 315, 377 conformation, protein 300ff dbEST 108, 294
chemotherapy 290 congenital 179 dbSNP 53
CheY 312 conifer 218 defensin 226
chimaera 120, 208 contig 18 –19, 90, 99 –100, 104 dehydroquinate 360
chimpanzee 42, 50, 71, 86, 88, 110, CopyCat 46, 73 Deinococcus radiodurans 122, 205
117ff., 139ff., 149, 152, 156 –157, co-repressor 375 delphinidin 280 –281
227, 250, 268, 285 –287, 294 corn, see maize dementia 332
392 Index

denaturation 9 –10, 200, 304, 307–308, electroreception 225 fern 68, 122, 218
310 electrospray ionization 309 ferredoxin 204, 210, 358, 360 –361
dendrites 44 elephant 119, 152, 230 –231, 233 –235 ferritin 298
deoxyhaemoglobin 166, 319, 321, 331 Elm yellows 42 fibre diffraction 91–92
dephosphorylation 273 EMBL-BANK 106 fibronectin 137–139, 303
depolarization-sensitive 276 emmer wheat 141, 242 fibrosis 150
detoxification 273, 380 emphysema 22, 39, 58, 291, 331 flavodoxin 312
deuterostome 118 –119, 157, 220 enamel 224 flavonoid 246 –247, 281
dexamethasone 271, 293, 371 enantiomer 195 ‘Flavr Savr’ tomato 46
diabetes 34, 43, 57, 61, 364 encephalopathy 331–332 flow cytometry 290
diauxic shift 272–274, 383 –384 ENCODE 151–154 fluorophore 266
dideoxynucleotides 79, 95 –97 modENCODE 153 fluorescence resonance energy transfer
dihydroflavonol 281 endocytosis 132, 283 (FRET) 368
dihydrokaempferol 280 –281 endogluconase 221 foetal haemoglobin 26, 45, 319
dihydromyricetin 280 –281 endonuclease 88, 189, 373 forestero 247–248
dihydroquercetin 281 endopeptidase 306 fork 141, 354, 377–379, 384
dihydroxyphenylalanine 371 endoplasmic reticulum 11 fossil 55, 66 – 67, 71, 76, 118 –119, 182,
dimorphism 231 endoreplication 143 194, 216, 227, 229, 234, 250
dinornithidae 230 –231 endosperm 143, 245 FOXP2 144, 250, 285
dinosaurs 68 – 69 endosymbiont 12, 132, 156, 158 fractal 350
dipeptide 203 –204 enhancer 45 frameshift mutation 178
diptera 116 enolase 180, 361 Franklin, R. 91–92
disease transmission 346 enteritis 205
disulphide 106, 273, 284, 300, 304 –306, enthalpy 200
312 ENTREZ 53, 106, 158 G
exchange 134 entropy 200, 346 –348, 351
dithiothreitol 307 enucleate 143, 291 gag (HIV-1 protein) 124
DNA polymerase 94 –95, 101–102, 192, Enzyme Commission (EC) 313, 351ff. galactose 72, 283, 380, 386
195, 238, 354 eosinophil 291 galactosylceramidase 283
DNA structure 91–93 ephrin 168 –170 Galapagos Islands 183, 259
docking 333 epidemic 122, 126 –127, 145, 159, 207, gamete 82– 83, 121, 132, 140
dodo 42, 119, 232 332, 346 ganglioside 305
dog genome 226ff. epidermal 298 G-banding 84
domain (protein) 137ff., 302ff. epigenetic 3 –5, 16, 39 GC-rich region 346
swapping 335 epilepsy 77 geochemistry 192
Drosophila melanogaster 51, 86, 108, epitope 57, 368 gibberellin 219
116, 143, 332 erythrocruorin 31 Gibraltar 55 –56
development 48, 275 –278, 374 erythrocyte 59 – 60, 291, 320 Giemsa stain 84
genome 99 –100, 122, 136, 147–149, erythropoietin 45 Gilbert, W. 94, 96 –97
285, 352 Escherichia coli glaciation 77, 143, 230
dot plot 168ff. genome 129 gleevec 87
Duchenne’s muscular dystrophy 150 transcriptional regulatory network 378 glia 275 –276, 283
dugong 233 euarchontoglires 29, 227 globin 7, 26 –32, 37–38, 73, 114,
dystonia 51 euchromatin 147 136 –138, 153 –154, 156 –157, 179,
dystrophin 8, 39 eudicot 218, 235, 248 –250, 263 190, 221, 235 –236, 285, 305,
dystrophy 43, 73, 150 euryarchaeota 195 –196 318 –321
euteleostomi 29 allosteric change in haemoglobin
eutheria 29, 157, 227 318ff., 342
E exocytosis 283 haemoglobin 26 –32, 37–38, 44 – 45,
exome 18, 20 47, 59 – 60, 136 –137, 166 –167, 291,
Embden–Meyerhof pathway 204, exon 8ff., 14, 20, 133 298, 302–303
272–273, 358 –359 ExPASy 109 myoglobin 26, 29 –32, 38, 57,
echidna 225 extinction 41– 42, 66 – 69, 71, 118, 192, 136 –137, 319
echinoderm 119, 220 252, 259 glucocorticoid receptor 271, 371
EcoCyc 341, 355 –356, 388 extremophile 205 glucokinase 360
EcoRI 88 – 89, 110 eyeless 277 gluconeogenesis 178, 213, 273
EcoRV 373 glume 243 –245
ectopic 48, 374 glutamatergic 276
effector 137, 319 –320 F glutathione 58, 274 –275
einkorn 141, 242 glyceraldehyde 178, 204, 359 –361
elastase 22, 155 farro 141 glycerol 156, 194 –195, 246
electrophoresis 64 – 65, 72, 88, 96 –97, fava bean 59 glycogen 350
207, 297, 307–308 feed-forward loop 377, 379 glycolipid 305
Index 393

glycolysis 178 –179, 204, 213, 273 –274, histone 7, 11–13, 44 – 45, 129, 154, 194, isologous interaction 335
289 –290, 357–359 251, 373 –374 isopropylthiogalactoside 381, 387
glycophorin 145 HIV 36, 73, 77, 124 –126, 305 isozyme 178 –179, 243
glycoprotein 9, 45, 125 –127, 305, 339 homeobox 148
glycosylation 106, 306 homeodomain 374 –375
golden rice 245 homeostasis 274 –275, 377 J
Gondwanaland 182 homeotic gene 279, 374
G protein 138, 140, 316, 362 homodimer 335 jaundice 59
gramicidin 330 homogametic 77, 224 Jeffreys, A. 238 –239
granulocyte 291 homology 136, 161, 167, 181, 197, 201,
granulysin 145 283, 342, 370
green fluorescent protein 50 modelling 323 –328 K
GroEL 330 homoplasmy 240
GTP 95, 204, 316, 318, 362 homozygote 82, 112 Kabat, E.A. 153
honeybee 49, 68 kallikrein 283
heterochromatin 8, 84, 142, 147–148 Kanehisa, M. 109
H Huntingdon’s disease 22, 89, 284 kangaroo 71, 138, 164, 225, 227
huntingtin 22, 331 keratin 48, 224, 298, 327, 330
haemagglutinin 126 –127 Huxley, H.E. 80 kinase 180, 203 –204, 248, 273, 276,
haematopoiesis 290 –291 hydrocortisone 271 283, 289, 298, 302, 315, 335,
haemocyanin 221 hydrogen bond 91–92, 176, 179, 355 –357, 362, 371, 373
haemoglobin (see globin) 299 –304, 330, 333ff., 373 arginine 314 –316, 337–339
haemoglobinopathy 59, 331 hydrophobicity 134, 156, 179, 200, database 107
haemolytic anaemia 59, 179 299 –300, 325 –327, 331–333, phosphofructokinase 330, 359
haemophilia 43, 240 336 –337 shikimate 360 –361
Haemophilus influenzae 122, 136, 205 hyperpolarization 276 tyrosine 87, 292, 370
haemorrhagic fever 204 hypersensitive 45 korarchaeota 195 –196
hagfish 119 hyperthermophile 196 –198, 200 –202, Krebs cycle 272–274, 290, 354, 356, 388
hairpin 13, 113, 177 205, 212
half-life 130, 381, 387 hypervariable region 239, 262
halophile 118, 195 –196, 361 hypoglycemia 178 L
Hamming distance 171 hypomethylation 16
haplogroup 251–253, 257, 263 hypoxanthine-guanine lactalbumin 324
haploid 53 –54, 83, 98, 121, 143, 216, phosphoribosyltransferase 52 lactamase 16
251 hypoxia 45 lactone 343, 357
haplotype 20, 52–57, 72–73, 75 –77, lactose 4 –5, 42– 44, 47, 72, 241, 283,
79, 85, 89, 110 –111, 229, 237, 380 –381, 385 –387
252–253, 259 I lagomorph 152, 227
HapMap 20 –21, 53 –55, 106, 110, 116, lamprey 119, 221
151, 159, 250 immunocompromised 126 lancelet 108, 119, 220
Havasupai 34 immunodeficiency 125 Last Universal Common Ancestor
hedgehog 152 immunoglobulin 221, 266, 298, 331 (LUCA) 195, 212
Helicobacter pylori 122, 136, 205 –206 immunohistochemistry 268 laurasiatheria 227
hemichordate 119, 220 immunology 57, 117, 119, 126, 206, 224 L-dopa 370 –371
hepatitis 124, 206, 213 immunoprecipitation 64, 367–368 lectin 283, 305 –306
hepatocyte 22 immunosuppression 57 leghaemoglobin 31
herbicide 187 indolepyruvate 204 lemur 152, 159
hermaphrodite 50, 147 influenza 122, 125 –128, 136, 157, lepidoptera 163
Hershey–Chase experiment 81, 93, 205, 305 leprosy 149
124 –125 insecta 116 leptospirosis 205
heterochromatin 8, 84, 142, 147–148 insecticide 187 Lesch–Nyhan syndrome 52, 113
heterodimer 177 insulin 8, 43, 58, 61, 94, 106, 186, 283, leucine-rich repeat 248
heteroduplex 65 306, 364 leucine zipper 327, 374 –375
heterogametic 77, 224 integrase 124, 201 leucocyte 55
heteroplasmy 240 interactome 388 leukaemia 22–23, 87, 110, 143, 150,
heterozygote 54, 58, 65, 72, 82– 83, intercalation 376 268, 290 –292
112 interleukin 154, 248, 263 Levenshtein distance 171
hexanucleotide 15, 348 intron 8ff., 14, 27ff., 124, 129 ligase 275, 351–352
hexokinase 335 iodoacetamide 307 lignin 173, 243, 292
hidden Markov model (HMM) ionization 309 Linnaeus, C. 66, 80, 115 –117, 164, 345,
175 –176, 325 –326 ionotropic 283 353
hippocampus 282–283 isoelectric focusing 308 lipid 130 –131, 194 –195, 202, 232, 275,
hirudin 333 isoform 16, 149, 159 305, 367
394 Index

lipopolysaccharide 211 microscopy 49, 367 neurotrophic 283

lipoprotein 61, 126, 159, 225 microtubule 317, 331, 367 nitrogen-assimilating 211
lipoxygenase 283 migration 253ff. nitrogen-ﬁxing 120
lizard 119 mimivirus 122 non-deterministic polynomial (NP)
LUCA (see Last Universal Common minisatellite 7– 8, 88 algorithm 350
Ancestor) miocene 68 non-synonymous mutation 36, 136
luciferase 101–102, 343 misfold 58, 331 notochord 220
Luria, S.E. 68 – 69, 121, 197, 218 mismatch 65, 100, 169, 171, 173 –174, N-phospho-L-arginine 314
lyase 61, 204, 210, 351–352, 356 267, 269, 373 nucleoid 11, 129, 132
lymphocyte 291 mitosis 11, 140, 143, 317 nucleolus 354
lysogeny 374 moa 119, 182, 229 –231, 329, 350 nucleosome 12, 330, 373
lysozyme 324 MODBASE 325, 327
modENCODE 153
module 303, 375, 377 O
M mollusc 119, 132–133, 156, 165
monoamine oxidase 52, 72 octopus 48
macaque 144, 152, 285 monoclonal 186 odorant receptor 139 –140, 145, 225,
Barbary 55 –56 mononucleosis 122 247, 279 –280
macrophage 291 monosaccharide 178 oestrogen 5
macropus 164 monoterpene 279 oligocene 68
mainchain (protein) 48, 176, 299 monotreme 152, 220, 225, 227 OMIA (Online Mendelian Inheritance in
maize 15 –16, 46, 108, 119, 143, moonlighting proteins 180 Animals) 58, 106 –107, 116, 350
187–188, 218, 220, 232, 235, Morgan, T.H. 79, 83, 111 OMIM (Online Mendelian Inheritance in
241–245, 247, 251, 260, 262, mosaic 15, 46, 125 Man) 53, 58, 77, 106 –107, 159
320 –321 mucosa 126 oncogene 87, 284, 327, 316
major histocompatibility complex 55, 57, multidrug resistance 9, 204, 288 one-gene-one-enzyme hypothesis 244
110, 114, 283 multiple sclerosis 57 operon 43, 109, 130, 285, 289, 343, 355,
MALDI-TOF 309 muscle 51–52, 108, 150, 278, 286, 314, 360, 369, 377–379
maltodextrin 203 –204 317–318 lac, in E. coli 43, 47, 376, 379ff.
mammoth 67, 111, 119, 232–234 muskmelon 132 opioid 283
Mandelbrot, B. 350 mutagen 251, 329 opossum 152, 164, 225
mannosidosis 305, 339 mycoplasma 122, 136, 380 opsin 48, 58, 140, 195, 362
marmoset 152 myelin 276, 283 –284 orangutan 143, 157
marsupial 46, 68 – 69, 116, 119, 138, myelogenous 150 ORF (open reading frame) 13, 188, 217
152, 225 –227 myeloid 87, 110, 143, 290 –291 organophosphorus insecticide 305
mass spectrometry 311ff. myoglobin (see globin) ornithine 16, 330
mastodon 234 myosin 314, 317–318 Ornithorhynchus anatinus 225
maximum-parsimony 184, 189 orthologue 109, 115, 136, 155, 159,
megakaryocyte 143, 283, 291 222, 224 –225, 233, 248 –249,
Megalapteryx 231 N 276, 356
meiosis 11, 83, 131, 140 Oryza sativa 245
melanin 163 N-acetylglucosamine 287 oseltamivir 127
melatonin 241 N-acetylmuramic acid 287 ostrich 182, 230
meningitis 205 –206 nanoarchaeota 196 O-succinyl-L-homoserine 356
mesoderm 119 nanomedicine 260 outgroup 186, 224, 227, 234, 285
Met repressor 375 nanoparticles 389 oxygenase 120, 283, 388
metabolome 351ff. Neanderthal 119, 237, 250 –251, oxygen-dissociation curve, haemoglobin
metacentric 147 253, 260 319
metagenomics 213 Neisseria meningitidis 205 –206 oxyluciferin 102
metaphase 12, 87 network, in systems biology 342ff.
metaservers 327 neuraminidase 126 –128, 156
methanogen 118, 120, 195 –196 neurexin 283 P
methicillin 206 –207, 213, 287–288 neurexophilin 283
MHC (see major histocompatibility neuritogenesis 283 p53 150, 364, 376
complex) neurodegenerative diseases 22, 124, PAGE (see polyacrylamide gel
Michaelis–Menten kinetics 314, 319, 331 electrophoresis)
333, 365 –366 neurogenomics 50 PAIRCOIL 327
microarray 265ff., 368 –369 neuroglobin 26, 29 –30, 137, 156, paired-end read 18 –19, 37, 112
microdeletion 139 235 –236 palaeosequencing 229ff., 250ff.
microfossil 194 neuromodulator 283 –284 palindrome 169, 189, 385
microRNA 6, 47, 154 neuropsychiatric disease 50 –51 panda 69
microsatellite 7– 8, 88, 243, 346 neurotoxin 97, 226, 305 paralogue 51, 115, 136, 155, 357
Index 395

Paramecium 8, 29 pluripotent 46, 291

Parkinson’s disease 51, 149 –150, pneumonia 81, 122, 136 Q
331, 370 pogonophora 119
Pasteur, L. 80 – 81, 272 pol (HIV-1 protein) 124 quagga 119
paternity test 20 –21, 88, 238, 261 poliomyelitis 22
pathogen 11, 17, 21, 23, 42, 57, 129, polyacrylamide gel electrophoresis
163, 191, 204 –207, 212, 219, 96 –97, 297, 307–308 R
229, 269, 290, 292 polyadenylation 14
Pauling, L. 59, 91, 118, 303 polycystic kidney disease 150 rabbit 138, 152, 227, 337–338, 364
PAX gene 48, 148 polycythaemia 291 racemic mixture 80
PDB (see wwPDB) polyglutamine-repeat diseases 22, 331 Ramachandran plot (see Sasisekharan-
PDBe 107, 352 polynomial-time algorithm 350 Ramakrishnan-Ramachandran plot)
peanut 119, 141, 242 polyploid 141, 143, 219, 248 ras 150, 316, 318
penicillin 80, 287–289 polyprotein 124, 129 RCSB (see Research Collaboratory for
peptide bond 299ff. polysaccharide 211, 221, 246, 287, 350 Structural Bioinformatics) 107
periplasm 315 polyteny 143 reactive-centre 320
permafrost 119 population 52 reactive oxygen species 273 –274
permease 204, 380 porifera 119 receptor 6 – 8, 45, 48, 51–52, 58, 126,
peroxidase 274, 339, 369, 373 position-specific scoring matrix 175, 138 –140, 145, 153, 225, 248, 271,
pertussis 205 –206 189 275 –276, 283 –284, 292, 305, 312,
pesticide 68, 187, 243 post-translational modification 47– 48, 314 –315, 330ff., 342, 362, 370ff.,
Père David’s deer 69 –70, 164 –165 106, 109, 131, 162, 187, 202, 377ff.
Perutz, M.F. 90 304 –306, 308 –310, 370 T-cell 57, 221
phage display 367 potato 133, 141, 242 recombination 53 –54, 83 – 85, 99, 123ff.,
pharmacogenomics 4, 23, 36, 58 –59, 89 power law 345, 377, 382 130 –131, 140, 202, 206, 219, 251,
pharming 187 precambrian 68 280, 329, 353, 369, 373
phenylalanine ammonia-lyase 61 prednisolone 292 regulation 4, 6, 24, 43, 45 – 48, 52, 58,
phenylketonuria 5, 60 – 61, 144, 365 presenilin 149, 151, 157, 159 60 – 62, 87, 109, 121, 123, 125, 143,
phenylpyruvate 60 – 61, 365 primaquine 59 146, 151, 154, 197, 217, 219, 229,
philopatric 77 primase 354 251, 271, 275 –276, 287, 290 –291,
PHOBIUS 326 –327 prion 58, 92, 186 –187, 331–332 298, 302, 305, 312, 314, 328, 330,
phocidae 228 prochlorococcus 205, 209 –211, 213 333, 342–384
phosphofructokinase 330, 359 –360 prodynorphin 283 allosteric 318ff.
phosphogen 337 proinsulin 306 of G-protein activity 316
phosphorylase 362 promotor 8, 14, 47, 244, 273 of transcription 6 –9, 16, 43 – 45, 132,
photocross-linking 306 proteasome 305, 330, 333 141, 144, 148, 162, 169, 219, 227,
photosynthesis 12, 42, 67, 120, 132–133, protein 244, 268, 270, 272–273, 276,
192, 195, 205, 209 –210, 212, aggregate 7, 10, 58, 233, 284, 279ff., 289 –291, 327, 342–384
218 –219, 317, 356 –357 330 –333, 335 relenza 127–128
photosystem 210, 213 folding 10, 134, 176, 200, 217, 298 renaturation 9 –10, 306, 310
phototransduction 278 folding pattern 107, 176 –179, 300ff. replication 15, 17, 19, 46, 81, 101, 130,
phred 97 structure 9 –10, 162, 166, 176ff., 140, 143, 271, 374
phycobilisomes 210 299ff. repressor 18, 44, 47, 244, 272–273, 335,
phycoerythrin 210, 213 proteinase 120, 124, 180, 189, 283 –284, 355 –356, 363, 373ff.
PHYLIP 183 305 –306, 314, 320, 322, 333, Lac operon, 380ff.
phylogeny 181ff. 351–352, 368 –369, 373 reptile 5, 68, 119, 224 –226
phylogenetic trees 181ff. trypsin digestion, 308 Research Collaboratory for Structural
Bayesian methods 186 Protein Data Bank Japan 107 Bioinformatics 107
cladistic methods (maximum protein-protein 342 restriction enzyme 79, 88 –90, 94,
parsimony, maximum likelihood) protein truncation test 64 – 65, 72 97–98, 169, 189, 239, 251,
185 proteome 8 –9, 48, 107, 297ff. 373
clustering methods 184 proterozoic 68 restriction fragment length polymorphism
phylogeography 55 protocadherin 283 (RFLP) 88, 238 –239
PKU (see phenylketonuria) proton gradient 12, 140, 204 restriction map, 81, 89 –90
plankton 205 protostomes 119 retinoblastoma 62, 150
Plasmodium falciparum 13, 59, 99, 159 protozoa 67, 122, 132 retinol 179, 298
plastid 38, 120 psychoanalysis 5 retroelements 15
plastocyanin 313 pufferfish 122, 129, 155, 221, 224 retrotransposition 136
platelet 143, 283, 291 pupa 276 –278 retrotransposons 15
platyhelminthes 119 pyrimidine-rich 45, 73 retrovirus 15, 125
platypus 152, 225 –227, 230 pyrosequencing 79, 101 reverse transcriptase 14 –15, 124 –125,
pliocene 68 pyrrolidine 301 292
396 Index

RFLP (see restriction fragment length sediment 192 solute 105, 200, 204, 268, 299, 373
polymorphism) segregation 81– 82, 140 solvent 298 –300, 308 –309, 328 –329
rheumatoid arthritis 57 selection (in evolution) 54, 58, 80, Sonnhammer, E. 326
rhizaria 217 83 – 84, 123, 134, 136, 140, soybean 42, 108, 248 –249
rhodopsin 48, 140, 195, 362 145 –146, 163, 166, 176, 179, species
riboflavin 204 186 –187, 198, 206, 233, 242–245, difficulty of definition 42, 66, 117,
ribonucleoprotein 47 280, 285, 289 –290, 328ff., 348 120, 164
ribosome 6 –7, 10, 13, 47, 133, 176, selenocysteine 7 endangered, extinct 17, 42, 67ff., 118,
192–194, 304, 330, 335, 342, selenomethionine 304, 306 165, 216, 229ff., 240, 250
352, 374 sepharose 232 sperm 4, 31–32, 122, 143, 159, 218,
rice 108, 133, 187, 242–243, 247 sequence-tagged site 98 245, 248, 251, 263
golden 245 sequencing 18ff. (see also spermidine 159
rickettsia 13, 38, 122, 158, 205, 380 palaeosequencing) spindle 285
RNA 6, 11, 13, 17, 46, 94, 106 –107, decreasing cost 4, 18, 98 spinocerebellar ataxia type 3, 149
118, 121, 123, 125, 126, 129 –131, exome 20 splicing 8 –9, 14, 16, 43, 47, 60, 106,
143, 154, 162, 169, 177, 195, genome sequencing projects 17 124, 162, 178, 219, 242, 304, 307
197–198, 209, 217, 229, (see also individual species) squid 119, 343
268 –269, 271, 282, 342, 352, high-throughput techniques 98ff., src-homology domain 370
371, 374 232ff. staphyloxanthin 289
messenger (mRNA) 6 –7, 9, 14 –15, history of techniques 18ff., 90ff. starfish 119, 220
30, 43 – 44, 46 – 47, 64, 95, serotonin 52 steady state 343, 364 –366
108 –109, 124, 129 –130, 136, serotype 126, 157 stellacyanin 313
176, 198, 266 –268, 277, 285, serpin 58, 314, 320 stem cell 291
377, 380 –381 seven-helical transmembrane protein streptavidin 333, 368
mRNA editing 9, 106, 123, 304 140 streptomycin 155
miRNA and siRNA 7, 16, 47, 154 shikimate dehydrogenase 360 –361 stress 44, 47, 52, 150, 272–274, 276,
ribosomal 7, 117–119, 132, 166, shotgun sequencing 19, 79, 98 –100, 104, 343, 383 –384
192–194, 196, 208, 217, 232, 113, 147, 151, 227, 292–293 oxidative 59, 143, 210, 272–274, 284,
251 sialic acid 127 370, 377
transfer (tRNA) 8, 10, 13 –14, 47, sialidosis 339 stromatolite 194
132, 147–148, 162, 192, 198, sickle-cell anaemia 4, 58 – 60, 331–332 subgraph 345, 377, 380
201–202, 217, 251, 274, 306, sidechain 48, 59, 134, 178, 194, 198, substitution 48, 60, 136, 144 –146, 162,
352, 374 200, 299ff., 314, 320, 323 –324, 171–173, 175, 177–178, 185, 241,
RNA interference 47, 280 –281 326, 330, 333 –334, 375 –376 251, 308, 328 –330, 363 –364, 383
RNA polymerase 16, 44, 376 –377, signal 4 – 6, 8, 14, 44 – 45, 87, 96 –97, (see also, single-nucleotide
380 –381 101, 102, 104, 133, 138 –141, 162, polymorphism)
RNAseq 18, 292–293 170, 202, 217, 220, 244, 273, subtilisin 155, 329 –330
ROBETTA 326 –327 276 –278, 280, 283 –284, 287, 291, sulfolobus 159, 195 –196, 358
298, 305, 312, 316, 342ff., 352, sulphurylase 101–102
356, 362ff., 370ff. supercoil 129, 373
S peptides 326 –327, 332 superfamily 107, 138, 283, 298,
transcriptional control 18, 48 – 49, 86, 310, 312
Saccharomyces cerevisiae 44, 122, 136, 361, 382ff. superoxide dismutase 274
149, 155, 193, 213, 217, 266, signalome 388 superposition (of protein structures or
272–273, 367, 382–383 signal-transduction 220, 312, 345 substructures) 32, 176 –177, 313,
salamander 122 silkworm 108 315, 321, 324
Sali, A. 324 single-end read 18, 37 suppressor 62– 63, 65, 116, 247, 376
Sanger, F. 18, 94ff. single-input motif 379 surface plasmon resonance 367
sarcomere 317–318 single-nucleotide polymorphism 18, 20, sweetpea 81
Sargasso Sea 208 –209 52–55, 60 – 63, 84, 144, 151, SWISS-MODEL 324 –325, 327
Sasisekharan–Ramakrishnan– 177–178, 228 –229, 268 SWISS-PROT 29, 106, 354
Ramachandran plot 301–302 single-stranded synaptogyrin 283
satellite 7– 8, 88, 243, 346 conformational polymorphism test 65 synechocystis 13, 38, 122
scale-free network 346, 377 DNA 94, 97, 125 synonymous mutation 6, 9, 57, 136,
schizophrenia 34, 51–52, 77 RNA 14, 17, 125, 198 162, 198
Schizosaccharomyces pombe 217 sleep 23, 272, 274 –276, 332 synteny 86, 141, 143, 145 –146, 153,
scrapie 332 smallpox 22, 42, 125 162, 201, 206, 221–224, 229,
SDS-PAGE 307–308, 337 SNP (see single-nucleotide 248 –250, 286
secondary structure 178, 192–194, 292, polymorphism) syntrophin 337, 371
301–304, 312–313, 320ff., 325 –327, Solexa 120 synuclein 283, 331
330, 374, 383 Solid (Applied Biosystems sequencer) syphilis 205
prediction 325 –327 102–104 systematics 71, 116, 164 –165, 231, 234
secretory 373 solubility 299 systems biology 341ff.
Index 397

signal 48, 130 –131, 140 –141, 202, 201–202, 206, 208, 211, 243,
T 217, 220, 278, 280, 284, 287, 312, 305 –306, 330, 333, 368, 374
316, 345, 352, 361–362, 370 –371 φX-174 18, 122
Takifugu rubripes 122, 224 transfection 329 HIV-1 57, 124 –126, 305
tandem affinity purification 368 –369 transgenic 51, 145 influenza 122, 125ff., 229, 305
tarsier 159 transhydrogenase 365 vitamin 47, 58 –59, 72, 179, 187, 245,
Tasmanian devil 42, 68 – 69, 119 transition (mutation) 171, 343 278, 345, 357
TATA box 14, 375 transition-state 128, 314 vitellogenin 225
TATA box-binding protein 373 –376 translation 6 –10, 13 –14, 29 –30, 43 – 47,
tau 21, 108, 116 –117, 141, 305, 315, 64, 105 –106, 108, 120, 123, 125,
331, 343 129 –133, 154, 162, 175 –176, 187, W
tauopathies 331 197, 201–202, 276 –277, 284, 348,
taurocyamine 315 374 (see also post-translational wakefulness 247, 272, 274 –276
taxonomy 30, 48, 65 – 66, 117–118, modification) warbler 183
164 –167, 204, 230, 232, 259, 290, translocation 13, 16, 28, 52, 86 – 87, Watson, J.D. 20, 23, 91–93, 111
345, 353 217, 271 whale 31–32, 68, 116, 227, 240
T-cell 57, 221, 290 transmembrane 51, 125, 138, 140, 153, wheat 108, 122, 141, 189, 235,
T-coffee 171, 213 326 –327, 362, 370 242–243
telomere 8, 145 transposition 15 –17, 87, 136, 162 wobble hypothesis 111
teosinte 241–245, 262 transposon 15 –17, 129, 206, 219, 242 Woburn Abbey 70
termination (chain) 8, 60, 95 –96, 177 transthyretin 331 wwPDB 107, 110, 310, 312–313, 325,
tertiary structure 192, 302–304, 312, transversion 171 328, 352
320 –321, 325 trEMBL 106, 354
tetracycline 289 triticale 141
Tetraodon nigroviridis 221–223 triticosecale 141 X
thalassaemia 45, 60, 285 Triticum 108, 141
Theobroma cacao 248ff. trypanosome 8, 217 xanthine 52, 290
theobromine 246 –247, 263 trypsin 22, 43, 58, 120, 155, 308, 331, 333 xenarthra 152, 227
thermitase 329 –330 tryptic digest 308, 310 –311 Xenopus 108, 138, 143, 168, 170, 189
Thermococcus kodakarensis 201ff. tuberculosis 120, 122, 192, 369 xeroderma pigmentosum 150
thermolabile 200 tumour 5, 20, 42, 108, 150, 283, X-linked 43, 52, 58, 73, 86
thermophile 118, 122, 195 –198, 200 290 –291, 316 X-ray tomography 367
thermostability 198, 329 –330 tumour-suppressor 62– 63, 65, 116, 376
thiamin 47 tunicate 119, 221
thioredoxin 134 –135 turkey 103 –104, 112, 232, 246 Y
thrombin 187, 320, 322, 333 two-hybrid screening 367
thylacine 68 – 69, 119 typhoid 205, 346 YAC (yeast artificial chromosome) 11,
thymus 57 typhus 13, 122, 205 96 –97, 178, 204, 297, 307–308,
tiling 266 338, 359 –360
time scale 68 y-chromosome 241
tinamou 182 U yeast 5, 11, 25, 108, 118 –120, 144,
tissue-culture 269 149 –150, 206, 216, 306 –307, 332,
tobacco 108, 125, 368 ubiquitin 8, 39, 44, 305, 333, 373 335, 363, 367, 380
tomato 46, 108, 187 ultraviolet 208, 210 gene expression control 266, 272ff.,
torpor 274 uncoil 147, 374 377, 382ff.
toxin 97, 204, 206, 226 UniProt 25, 29, 106, 213, 236, 352 genome 14, 44, 86, 122–124, 132,
transcriptase, reverse 14 –15, 124 –125, upregulation 371 140, 147, 216ff.
292 urea 34, 236, 307, 359 protein interaction network 370 –371
transcription 6 –9, 11, 14 –16, 18, 26, uric acid 52, 61, 91, 205, 236, 359 Yersinia pestis 205
43ff., 60, 64, 108 –109, 120, urochordata 119, 141, 220 –221, 235 y-ion 311
123 –124, 129 –131, 141, 144, 148, Y-rich region 45
154, 162, 169, 176 –177, 197, 202,
217, 219 –220, 225, 244, 266, 268, V
271ff., 282–285, 287, 289 –292, Z
304, 316, 327, 330, 342–343, 351, vaccine 21–22, 125, 204, 206, 208, 212–213
355 –356, 362–363, 365, 367–368, vacuole 367 zanamivir 127–128, 156
371, 373 –386 vancomycin 206, 287–290, 294 Z-antitrypsin 22, 331
transcription activator 368 varroa mite 68 Z-disks 317
transcriptome 18, 154, 266, 292–294 venom 225 –226 zebra 49, 51, 108, 119, 138, 141, 152
transduction vesicle 232, 283, 371 zebrafish 119, 138, 152
DNA transfer 131 vincristine 292 zinc finger 271, 375
energy 12, 149, 210, 317, 326 virus 9 –11, 15, 17, 22, 42– 44, 47, 67, zipper 327, 374 –375
photo 278 81, 121, 124 –125, 128, 163, 192, Zuckerkandl, E. 118